ISO
INTERNATIONAL ORGANIZATION FOR STANDARDIZATION
ORGANISATION INTERNATIONALE DE NORMALISATION

ISO/IEC JTC 1/SC 2/WG 2

Universal Multiple-Octet Coded Character Set
(U C S)

ISO/IEC JTC1/SC2/WG2 N1838
Date: 1998-09-15

Title: 

Proposal to add four binary completion letters to the BMP

Source: 

Mark Davis

Status: 

Expert Contribution

Action: 

For consideration by JTC1/SC2/WG2

This document contains the proposal summary (ISO/IEC JTC1/SC2/WG2 form N1352) and a full proposal for the encoding of two new characters in the BMP of ISO/IEC 10646.



A. Administrative

1. Title Proposal to add four binary completion letters to the BMP
2. Requester's name Mark Davis
3. Requester type Expert contribution
4. Submission date 1998-09-15
5. Requester's reference  
6a. Completion This is a complete proposal.
6b. More information to be provided? No

B. Technical -- General

1a. New script? Name? No
1b. Addition of characters to existing block? Name? Yes, to Latin. Suggested locations are U+1E9C thru U+1E9F. However, the characters could be added at any reasonable place in the BMP.
2. Number of characters 4
3. Proposed category Category A
4. Proposed level of implementation and rationale Level 1
5a. Character names included in proposal? Yes
5b. Character names in accordance with guidelines? Yes
5c. Character shapes reviewable? Yes
6a. Who will provide computerized font? Mark Davis
(if necessary--it is a trivial modification of any font containing U+01E0, U+01E1, U+1E1C, U+1E1D)
6b. Font currently available? No, but it can be generated quickly
6c. Font format? TrueType
7a. Are references (to other character sets, dictionaries, descriptive texts, etc.) provided? N/A--See below
7b. Are published examples (such as samples from newspapers, magazines, or other sources) of use of proposed characters attached? N/A--See below
8. Does the proposal address other aspects of character data processing? Yes

C. Technical -- Justification

1. Has this proposal been submitted before? No
2. Contact with the user community? N/A--See below
3. Information on the user community? N/A--See below
4a. The context of use for the proposed characters? N/A--See below
4b. Reference N/A--See below
5a. Proposed characters in current use? N/A--See below
5b. Where? N/A--See below
6a. Characters should be encoded entirely in BMP? Yes
6b. Rationale Required for efficient normalization of Unicode/10646, as described below.
7. Should characters be kept in a continuous range? It would be useful, but not absolutely necessary
8a. Can the characters be considered a presentation form of an existing character or character sequence? To the same degree as:
U+01E0 LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
8b. Where?  N/A--See below
8c. Reference  N/A--See below
9a. Can any of the characters be considered to be similar (in appearance or function) to an existing character? No
9b. Where?  
9c. Reference  
10a. Combining characters or use of composite sequences included? No
10b. List of composite sequences and their corresponding glyph images provided? No
11. Characters with any special properties such as control function, etc. included? No

D. SC2/WG2 Administrative

To be completed by SC2/WG2

1. Relevant SC 2/WG 2 document numbers:                                                                     
2. Status (list of meeting number and corresponding action or disposition)  
3. Additional contact to user communities, liaison organizations etc.  
4. Assigned category and assigned priority/time frame  
5. Other Comments  


E. Proposal

Proposal to add four binary completion letters to the BMP

The proposal is to add the following letters to the BMP:

X001 LATIN CAPITAL LETTER A WITH DOT ABOVE
X002 LATIN SMALL LETTER A WITH DOT ABOVE

X003 LATIN CAPITAL LETTER E WITH CEDILLA
X004 LATIN SMALL LETTER E WITH CEDILLA

While these characters may indeed occur in natural languages or academic use, the principal reason for this proposal has to do with the nature of normalization. There has been a great deal of interest in providing complete specifications for different normalized forms of Unicode/10646. (Cf. http://www.unicode.org/unicode/reports/techreports.html)

One of the normalization forms of particular interest is one that basically normalizes to precomposed forms--for example, that uses the single coded character U+00C0 LATIN CAPITAL LETTER A WITH GRAVE instead of the combining character sequence <U+0041 LATIN CAPITAL LETTER A, U+0300 COMBINING GRAVE>. Such a form is of particular interest for systems supporting implementation Level 1.

Implementations of such a normalization form can be particularly efficient if Unicode and 10646 are coded such that they always have binary canonical decompositions.(For more information on canonical decomposition, see The Unicode Standard, Version 2.0, Chapters 3 and 4.)

A composed character X has a binary canonical decomposition when X is canonically equivalent to composed character sequence:
      <B, C1, C2,...,Cn-1,Cn>
and there is another composed character Y which is canonically equivalent to the sequence without the final combining mark:
      <B, C1, C2,...,Cn-1>.

In such a case, Y is called a canonical binary completion character for X. If X does not have a binary completion character, X is called incomplete.

Notice that only characters with two or more combining marks need to be checked for completeness.

There are only four incomplete characters in 10646/Unicode:

U+01E0 LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
U+01E1 LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON

U+1E1C LATIN CAPITAL LETTER E WITH CEDILLA and BREVE
U+1E1D LATIN SMALL LETTER E WITH CEDILLA and BREVE

(Characters 1E1C and 1E1D can be produced by a binary decomposition, but not a canonical binary decomposition.)

The four characters proposed for addition to 10646/Unicode in this document are the canonical binary completion characters for these four incomplete characters.

The value of all composed characters is fundamentally a product of their usefulness in implementations, since they could be expressed with composed character sequences. This is a special case where the addition of these characters is of particular value to a wide variety of implementations ranging from XML parsers to program language parsers.

It is particularly important that these characters be added before Unicode 3.0 is final, since it is likely that that will be the version used in normalization forms.


Name and glyph

LATIN CAPITAL LETTER A WITH DOT ABOVE

LATIN SMALL LETTER A WITH DOT ABOVE

LATIN CAPITAL LETTER E WITH CEDILLA

LATIN SMALL LETTER E WITH CEDILLA


Unicode Character Properties

X001;LATIN CAPITAL LETTER A WITH DOT ABOVE;Lu;0;L;0041 0307;;;;N;;;;X002;
X002;LATIN SMALL LETTER A WITH DOT ABOVE;Ll;0;L;0061 0307;;;;N;;;X001;;X001

X003;LATIN CAPITAL LETTER E WITH CEDILLA;Lu;0;L;0045 0327;;;;N;;;;X004;
X004;LATIN SMALL LETTER E WITH CEDILLA;Ll;0;L;0065 0327;;;;N;;;X003;;X003