There is really only one concept of a repertoire, namely a repertoire is a specified set of characters. However, the concept is defined slightly differently in different character set standards and it is interpreted in ways that may differ from one's expectations. Two particular definitions are
For completeness, a coded character set also has a formal definition
It is instructive to see how these two standards differ in their use of the concept of repertoire. Recall that ISO/IEC 6937 is a standard that bases a variable-length encoding of characters from the Latin script on forming combinations of non-spacing diacritical marks with unaccented letters. It is based on two separate 7-bit coded character sets that are separately registered under ISO 2375. The primary set of ISO/IEC 6937 is the left-hand set, coded in an 8-bit code as 20 to 7E This is precisely the ASCII set registered as ISO-IR 6. The supplementary set of ISO/IEC 6937 is the right-hand set, coded as A0 to FF, which contains both the non-spacing diacritical marks and other (spacing) characters.
The repertoire of ISO/IEC 6937 is specified separately, as a list of characters together with their (variable length) coded representations. It consists of 333 characters, including SPACE. Its characters include the accented characters that are coded by two octets, the first representing a non-spacing diacritical mark from the supplementary set and the second representing an unaccented letter from the primary set. The repertoire of ISO/IEC 6937 does not include the non-spacing diacritical marks as characters in their own right.
This is entirely consistent with the definition of a repertoire. The repertoire of ISO/IEC 6937 is established by that standard as a specific list of characters, each of which is represented by one or more bit combinations. It is quite separate from the union of the repertoires of the primary and supplementary sets of ISO/IEC 6937, which consists of the 191 characters, including SPACE, each coded by one octet. That repertoire does include, say, NON-SPACING ACUTE ACCENT, but it does not include LATIN SMALL LETTER E WITH ACUTE, while the repertoire of ISO/IEC 6937 includes the latter character but not the former one.
The concept of repertoire as used in ISO/IEC 10646 corresponds in the context of ISO/IEC 6937 to that of the union of its primary and supplementary sets, not to that of ISO/IEC 6937 itself. The repertoire of ISO/IEC 10646 consists of the characters that are assigned to code positions within the 31-bit coding space of the UCS. It therefore includes combining characters (which are the nearest equivalent in ISO/IEC 10646 to the non-spacing diacritical marks of ISO/IEC 6937) but does not include either composite sequences or characters, such as LATIN SMALL LETTER G WITH GRAVE, which have glyphs that can be represented by composite sequences.
There is a faint indication of this difference in the definitions
given in these two standards. In ISO/IEC 6937 the definition refers
to characters represented by bit combinations; in ISO/IEC
10646-1 it refers to characters represented in the coded
character set. There is no conflict, since it is the definition
of a coded character set that is crucial. A coded character set
is first required to establish a character set, before it assigns
coding. That character set is then the repertoire of the coded
character set. A repertoire, composed of characters, is therefore
whatever the relevant standard says it is. It is, in principle,
quite distinct from the set of glyphs that may be represented
by the characters of the repertoire. For many purposes it is this
set of glyphs that is relevant, not the set of characters used
to represent them. But describing or specifying this set of glyphs
is outside of the scope of standards for coded character sets.
There are three levels of implementation specified in ISO/IEC 10646, distinguished from one another by limitations on the characters that may be encoded at the level concerned. They are as follows:
Hangul Jamo characters are used in the Hangul syllable composition method. A sequence of two or three Hangul Jamo characters has a glyph that represents a syllable. Hangul syllables also have precomposed coding in the HANGUL EXTENDED block of the I-zone of the BMP. The relationship between coding in terms of Hangul Jamo and that as a single syllabic character is similar to that between the precomposed and decomposed forms of Latin characters with diacritical marks. However, there is no distinction for the Hangul Jamo characters corresponding to that between the non-combining and combining characters of a composite sequence. No Hangul Jamo characters have a meaning in isolation within the Hangul script. For this reason it is specifically stated that the characters of the HANGUL JAMO block are not combining characters. Note that the Hangul syllabic characters of the HANGUL EXTENDED block are permitted at all levels of implementation.
The chapter on visual representation of characters gives more information about the scripts that can be represented at the different levels of implementation.
A collection of characters consists of the characters of the UCS that are allocated to code positions lying within one of the ranges specified for this purpose in annex A of ISO/IEC 10646-1. Each collection is assigned both a number and a name. There is a collection associated with, and frequently identical to, each block into which the BMP is divided. These collections, together with their names and numbers, are listed in the chapter of this guide on the Basic Multilingual Plane (BMP). It should be noted that, as a collection is defined by a range, it may include code positions which have not been assigned characters. An amendment to the standard may allocate characters to such code points. Thus the repertoire defined in a collection may change over time. This is not always desirable, so the notion of a fixed collection was introduced in Corrigendum No.2. As a consequence the definition of a fixed collection has to be much more precise in that no range can contain unassigned code points.
Two different collections of characters may overlap, but of those associated with specific blocks the only overlap is that two of the four characters comprising the collection ZERO-WIDTH BOUNDARY INDICATORS are also present in collections for a number of specific scripts. A number of other specialized collections are defined in annex A which put together selections of characters that are also present in other collections. These consist of script-specific formatting characters and alternate forms. There are also two collections related to the permitted levels of implementation. One consists of all combining characters and the other of those combining characters that are not permitted in an implementation at level 2. Finally there are five large collections (two of which are fixed collections) defined as follows:
|299||BMP FIRST EDITION|
Note: a fixed collection containing only characters contained in the first edition prior to any amendments.
|See ISO/IEC 10646-1 A.3|
Note: a fixed collection containing those characters of the first edition as amended by amendments 1 to 7.
|See ISO/IEC 10646-1 A.3|
|400||PRIVATE USE PLANES||G=00, P=0F, 10, E0-FF|
|500||PRIVATE USE GROUPS||G=60-7F|
The specifications of collections 300 and 400 were changed by Amendment 1 consequent on the introduction of the S-zone and its reservation for the use of UCS Transformation Format 16.
A subset is a more general term that refers to any identified set of characters from the entire repertoire of the UCS. Two alternative means of specifying subsets are recognized within ISO/IEC 10646-1:
A selected subset is more restricted than a limited subset in its permitted content, but it has two great advantages. It is much more concise to list collections rather than individual characters. Also, annex M of ISO/IEC 10646-1 specifies by algorithm an ASN.1 object identifier that may be used to identify a selected subset of the UCS within any context in which OSI protocols are used.
A limited subset may be assigned an ASN.1 object identifier, but only by means outside the scope of ISO/IEC 10646-1. The following European pre-standard:
contains the definition of a limited subset (the Minimum European
Subset) and assigns an ASN.1 object identifier to it. It also
describes a selected subset (the Extended European Subset) that
has an ASN.1 object identifier assigned in accordance with the
algorithm of ISO/IEC 10646-1.
Because of the size and open-ended nature of the repertoire of the UCS, conformance to ISO/IEC 10646-1 does not require the ability to handle all of the characters in the repertoire. Instead, a claim of conformance for information interchange is required to identify:
A separate definition of conformance is given for conformance of a device. For this purpose a device is a component of information processing equipment which can transmit and/or receive coded information, such as an input/output device, an application program or a gateway function. A claim of conformance for a device is required to specify the above three items and in addition
The precise meaning of conformance to ISO/IEC is specified in
ISO/IEC 10646 and will not be reproduced here. The important aspect
here is that conformance only requires support of the UCS within
the limits determined by these specified items.
The ability to conform to ISO/IEC 10646 while supporting only a subset of its characters is a great aid to migration from other coded character sets. In particular it permits support to be developed collection by collection. It is only in a few cases that there is a direct correspondence between the collections defined in ISO/IEC 10646-1 and the repertoires of other standardized coded character sets. However, expansion of support one collection at a time eases substantially the effort required, such as the development glyphs for additional characters.
The assignment by ISO/IEC 10646 of an ASN.1 object identifier
for any selected subset provides a means within OSI protocols
for an application to notify its peer, in any communication, of
the collections that it supports. The Extended European Subset
(EES) specified in ENV 1973 consists of the collections numbered
1-11, 27-28, 30-48, 63, 65 and 70. These contain 4013 code positions,
of which 3095 are currently assigned to characters. These are
all the collections that contain characters of the Latin, Greek,
Cyrillic, Armenian and Georgian scripts together with other characters
of the International Phonetic Alphabet and a wide range of symbols
used for academic, commercial and scientific purposes within Europe.
This subset is defined as guidance for product developers, but
it in no way restricts the ability of any developer to extend
support to either a smaller or a larger range of collections than
that of the EES.
Top of UCS Guide