Since a character is an abstract concept, the question of whether two characters are or are not "the same" is not a trivial one. If we recall the definition:
then there is no problem within a single coded character set, since any such set must clearly specify its members. But frequently, in the transmission or processing of character data, that data needs to be converted from one coded representation to another. This is a particular problem in the migration of existing applications from other character set codes to the UCS. The question then arises of identifying "the same character" in two different character sets.
One cannot just look at the characters, in a visual representation, since distinct characters may have the same glyph. The question should be whether they are both used in the same way in the organisation, control or representation of data. But the specification of a coded character set does not specify how its characters should be used; that is outside of its scope. It merely makes the characters available for use. The "sameness" of characters from different coded character sets is therefore ultimately a matter of convention or of definition.
One of the largest resources of coded character sets is the International Register of Coded Character Sets for use with Escape Sequences, maintained and published by the Registration Authority for ISO 2375 in accordance with the procedures of that standard. Those procedures specify how to compare two coded character sets, as follows. Two sets are deemed to be identical if
If we abstract from this those aspects which compare individual characters, rather than their code positions or overall aspects of the complete set, we see that two graphic characters are regarded as identical if
The first of these requirements permits the Registration Authority to change the name (for example, from that used in the source standard whose code is being registered) to bring it into a standardized form. It is the policy of SC2, the ISO/IEC JTC1 sub-committee responsible for coded character set standards, to align the names of characters in its published standards with those used in ISO/IEC 10646. When necessary, renaming will take place when standards are next revised. Such renaming will ensure that two characters are given distinct names if they have distinct glyphs or distinct combining procedures. It follows that
The naming guidelines of the UCS are given in annex K of ISO/IEC 10646. They include the following:
Not all of the eight terms in this numbered list need be present. Examples of character names, with term numbers added after each name element, are
These guidelines are sufficiently clear that there are very few cases in which it is unclear whether two characters from different coded character sets should have the same name under them. Here are some examples of naming problems.
Because of the significance of the names of characters in constructing correspondences between the UCS and other coded character sets, it has been controversial within the relevant sub-committee ISO/IEC JTC1/SC2 as to whether the names of characters may be translated when the text of ISO/IEC 10646-1 is translated into another language. It has recently been agreed that the names of characters may be translated.
One effect of this decision is that names will no longer serve
as language-independent unique identifiers of characters. They
retain their central role in determining whether characters from
different coded character sets are or are not the same, but the
comparison of names must take place in a common language.
If names of characters are to be translatable, there becomes a need for some other form of unique identifier for characters that is language independent. Since the aim of the UCS is to include all the world's characters, this enables the coding of a character in the UCS to be used as an identifier of that character in all situations, including in the specification of other coded character sets. Such a scheme would solve, for the future, the problem of comparing characters from different coded character sets. However, in order to add such identifiers to existing character sets as they are revised, it is first necessary to create a correspondence between the set concerned and the UCS by means of names as described above.
Amendment 9 to ISO/IEC 10646-1 proposes several alternative forms for unique identifiers constructed from UCS code positions. These have the following constructions, in which hhhhhhhh represents the eight hexadecimal digits that represent the code position in the UCS and kkkk represents the last four of these digits for characters of the Basic Multilingual Plane (BMP):
The significance of the optional prefixes is as follows:
If there is no prefix letter then the relevant amendment level is unspecified. The three forms (no prefix letter, T prefix, U prefix) coincide unless hhhhhhhh lies in the range 00003400 to 00004DFF inclusive. For this range, the correspondence between the T and U forms is given by the mapping table in the annex to Amd.5. As an example:
The prefix letters, and the letters A to F used as hexadecimal letters, may be written either as capital letters or as small letters.
The unique identifiers described above for characters are based on the International Standard ISO/IEC 10646. There is also an internationally agreed assignment of unique identifiers to glyphs, but this is instead based on an International Registration Authority. The registrar is the Association for Font Information Interchange and the register operates under procedures laid down in ISO/IEC 10036.
Glyphs registered under ISO/IEC 10036 are assigned an identifier by the Registration Authority that is a hexadecimal number in the range from 0 to FFFFFFFF. This is the same range of values as that used for identifiers of characters in accordance with ISO/IEC 10646. For the characters of ASCII the same value has been assigned to one possible glyph for each character as is assigned to the character in the ASCII code, and therefore as also in the UCS. For example, the character LATIN CAPITAL LETTER A has the character identifier U+0041 and is represented by the glyph "A" which has the glyph identifier 41 (hexadecimal). However, certain characters of the ASCII code have had their interpretation refined as coded character sets have developed over time. This has led to departures from a strict correspondence even for the ASCII code. In particular:
The use of code positions 27 (U+0027 is APOSTROPHE) and 60 for right and left single quotation marks was an allowed alternative in the original ASCII code. The glyph for a right single quotation mark is acceptable also for an apostrophe, but that for a left single quotation mark is not acceptable as a grave accent. These ASCII alternatives are still present in the registration entry under ISO 2375 for the ASCII code, namely ISO IR-6 in the International Register of Coded Character Sets to be used with Escape Sequences, as this entry dates from 1975. Register entries, once made, cannot be revised (other than in exceptional circumstances and if the possibility of revision was stated in the original entry). However, these alternatives are not present in the international standard equivalent to ASCII, namely the International Reference Version (IRV) of ISO/IEC 646:1991. Nevertheless, that standard states explicitly that its IRV may be identified as ISO IR-6.
For use in a wider context, ISO/IEC 9541-1 specifies a structured-name form for the identification of glyphs registered under ISO/IEC 10036. These have the form
where nnnn is a sequence of decimal digits, beginning with
a non-zero digit, which represents the hexadecimal value of the
glyph identifier assigned by the Registration Authority. The concept
of a structured-name is specified normatively in ISO/IEC 9541-2,
which gives both ASN.1 and SGML forms for such names.
Top of UCS Guide