ISO
INTERNATIONAL ORGANIZATION FOR STANDARDIZATION
ORGANISATION INTERNATIONALE DE NORMALISATION

ISO/IEC JTC 1/SC 2/WG 2

Universal Multiple-Octet Coded Character Set
(U C S)

ISO/IEC JTC1/SC2/WG2 N2536
Date: 2002-11-24

 

Title: 

Constraints on Character Names for Loose Matching

Source: 

US national body

Status: 

Submission

Action: 

Request for addition to Policies and Procedures

For property names, the Unicode Consortium recommends loose string matching: only letters and digits should be taken into account when matching. In particular, spaces and hyphens are disregarded in loose matching. The Unicode Character Property and Property Value aliases are vetted to make sure that this does not cause collisions: that the aliases will always remain distinct even if only letters and digits are considered in matching.

Such loose matching can be used in a variety of environments. They are especially useful in Regular Expressions, where sets of characters are built up using processes.

It is very useful to do loose matching for Unicode character names as well, for such environments. There are currently only three cases where loose matching fails:

  • U+0F68 TIBETAN LETTER A and
    U+0F60 TIBETAN LETTER -A
  • U+0FB8 TIBETAN SUBJOINED LETTER A and
    U+0FB0 TIBETAN SUBJOINED LETTER -A
  • U+116C HANGUL JUNGSEONG OE and
    U+1180
    HANGUL JUNGSEONG O-E

With such a limited number of exceptions, one can still match loosely, by special-casing these three exceptions. As it turns out, the match can even be slightly looser than with property aliases: one can also remove all instances of the letter sequences "LETTER", "CHARACTER", "DIGIT", and still not have collisions; those are essentially "noise" words (in terms of loose matching).

The US National Body recommends that the UTC and WG2 adopt a constraint on future character names, so that loose matching can be easily performed (with the exception of the above three characters). The Unicode Technical Committee has accepted this proposal, and also recommends the adoption by WG2 in the policies and procedures.

The specific proposal is:

Whenever a character name is assigned to a new character, that name will be distinct from all existing character names, even if  the following transformation were to be performed:

  1. Remove all characters except for letters and decimal digits
    • Letters and decimal digits are those with general-category = L or Nd in the Unicode Character Database.
  1. Remove all instances of the letter sequences "LETTER", "CHARACTER", "DIGIT"
    • This is only applicable to the English normative character names, not to translated names.
  1. Case-fold all characters.
    • This is only applicable to translated names that may contain both uppercase and lowercase characters.

Note: clause 2 does not exclude the words LETTER, CHARACTER and DIGIT from future names. Instead, it just ensures that those words they are not required in order to distinguish two character names. That is, one couldn't have both of the following, although one could have either one:

KHAROSHTI LETTER AA
KHAROSHTI CHARACTER AA