From: Kenneth Whistler [kenw@sybase.com] Sent: Thursday, August 08, 2002 2:19 PM To: keld@dkuug.dk Cc: kenw@sybase.com; tplum@plumhall.com; jb@benito.com; Winkler, Arnold F; nwallace@us.ibm.com; frank@farance.com; John.Hill@eng.sun.com; rex@RexJaeschke.com; nobuyoshi.mori@sap.com; Don.Schricker@microfocus.com; willemw@ace.nl; convener@research.att.com; asmusf@ix.netcom.com Subject: Re: Agenda for Character set ad-hoc - 26th August Keld said: > I would rather ask Unicode to give up their separate > specification, and join the WG20 work on this issue. I guess then it is time for me to speak up for the other point of view. It is quite clear that the data files provided by the Unicode Consortium are a much more comprehensive and up-to-date set of specifications for issues such as identifiers than anything that WG20 has been able to produce. In fact, as Arnold indicated, I have had to get involved in developing the detailed list of Annex A of ISO TR 10176 for the last two rounds, to ensure, by checking against the better data source, that the Annex A recommendations for identifiers didn't drift in an unprincipled way from the Unicode Consortium recommendations when additions were made to the repertoire for 10646. Essentially the Annex A list (at a given level of the repertoire) is equivalent to the Unicode recommendations minus certain classes of characters (non-spacing marks and some "specials") that some of the formal language specifications might find problematical. And the detailed difference is now documented in the table, too, for implementers' convenience. This at least makes it possible to understand the difference in identifier behavior between specifications built on TR 10176 Annex A and specifications, such as Java, ECMAScript, C# or the ICU implementation library, which are built on the full Unicode specification for identifiers. But if it were just up to me, I would can Annex A in ISO TR 10176 as a pain to maintain and a needless divergence from more widespread industry practice. > We cannot build > international standards on separate industry consortia, this is at least > the point of view seen from Norway. I understand -- and even sympathize (a little) -- with this point of view. However, I believe it ignores the reality of the impact of the Unicode Consortium in defining implementations of ISO/IEC 10646. (And if taken seriously, would also amount to a rejection of HTML, XML, and everything else standardized by W3C, as well.) > It is also unfortunate that Unicode > wants to build parallel standards in this field, when ISO already has > done the work here. In this particular case, ISO has not done the work. Annex A is playing belated catchup with the Unicode Consortium recommendations for identifiers, which are based on a consistent analysis of character properties -- extensible for new additions. Recommendations for programming languages (and markup languages, such as XML) should be based on that Unicode analysis -- and then should make principled decisions regarding whether identifier syntax should be permanently pegged at some particular release version of Unicode (e.g. Unicode 3.0), should accept new repertoire as it is added in future versions, or should simply take the position that all characters are allowed except for a deliberate exception list (also tied to a particular release version of Unicode). There are tradeoffs in identifier and maintenance stability, as well as interoperability considerations, but burying one's head in the sand about the importance of the Unicode Consortium specifications doesn't particularly help in coming to consensus agreements about identifier stability for formal language specifications. Regards, --Ken Whistler > > Kind regards > Keld > > On Thu, Aug 08, 2002 at 06:42:12AM -0400, Winkler, Arnold F wrote: > > Tom, > > > > The definition of characters for safe use in identifiers is in Annex A of > > ISO TR 10176. This table is provided by Ken Whistler, based on the relevant > > Unicode table. > > > > TR 10176 is using the repertoire of approved ISO 10646 characters and might > > thus be a bit more restrictive than the Unicode table at any time, but WG20 > > makes sure that the 10176 Annex A table is always synchronized with a major > > release of the Unicode standard. > > > > Yes, you are right, it would be much easier to just point to the Unicode > > definition. This request could be one of the results of the characterset > > ad-hoc at the SC22 plenary and a resolution from SC22 directing WG20 to > > document that pointer in its next revision of TR 10176. > > > > TR 10176, edition #4 is currently in the DTR ballot process (SC22 N3419), > > the proposed table is at > > http://wwwold.dkuug.dk/jtc1/sc22/wg20/docs/TR10176-4-table.txt > > > > Tom, I hope that helps you to assess your options and take appropriate > > action at the SC22 plenary. > > > > Arnold > > > > > -----Original Message----- > > > From: Thomas Plum [mailto:tplum@plumhall.com] > > > Sent: Wednesday, August 07, 2002 10:23 PM > > > To: John Benito; Ann Bennett; Frank Farance; John Hill; Rex Jaeschke; > > > Nobuyoshi Mori; Don Schricker; Keld Jørn Simonsen; Willem Wakker; > > > Winkler, Arnold F; Tom Plum (WG21) > > > Subject: Re: Agenda for Character set ad-hoc - 26th August > > > > > > > > > I think C++ was the first language to define extended identifiers > > > ... anyway, one of the first. We know that the table embedded > > > in the C++ standard is an old table. For my part, I was waiting > > > until WG20 and Unicode Consortium worked out some process for > > > harmonizing the process of defining extended-id characters. But > > > we've seen Java, ECMAscript, and C# pointing to Unicode specs, > > > meanwhile C and Cobol use WG20 tables, so I've kept quiet > > > and proposed no changes to the old C++ tables. > > > > > > SC22 should take some note of the definition (in XML?) of > > > characters that should _never_ be used in an identifier. One > > > fairly-reasonable approach to identifiers is to allow anything > > > that isn't on the black-list. > > > > > > The Java spec refers the language spec to the system API spec: > > > IsIdentifierStart and IsIdentifierPart (roughly ... from memory). > > > SC22 specs should encourage use of system API facilities; let > > > only one group of people have to track this evolution, not > > > every compiler group. > > > > > > If WG20 believes that Unicode got it wrong on two, or three, or > > > N, specific identifier characters, then publish an N-paragraph > > > TR that clarifies what the problems are, but otherwise point to > > > the relevant Unicode spec. Having an independently- > > > maintained table is a nuisance. > > >