CEN Guide to the Use of Character Sets in EuropeTC 304

UCS - Nature of character data

Characters, character names and glyphs

To understand the role of the UCS in the electronic representation of character data, we first need to consider what is meant, in this context, by a character. The instinctive view of a character, which must be our starting point, is that it is the basic element of some writing system, such as a letter of an alphabet or an ideograph of an ideographic writing system. But this view needs refinement in the context of such an ambitious project as the coding of all the languages of the world.

Characters are identified in their written form by their shape, which is an imprecise concept arising from the ability of the human brain to recognize that two distinct and non-identical objects have the same "shape". It is this ability that enables us to read handwriting, different typefaces, etc. It is a learned ability; most Western people have difficulty in telling whether two similar written Chinese ideographs are in fact "the same character". But it exists and we have to accept that there is an abstract concept of "shape" that underlies the entire nature of written language.

Subtleties enter when we realize that there is context dependence to the recognition of written characters. There are letters with the same shape in the Latin and Greek alphabets, for example, but we do not think of them as the same character. The shape for a Latin capital letter A is recognized as a Greek capital letter alpha when it appears in Greek text. A hyphen is interpreted as a minus sign when it appears in mathematical expressions. Are Greek capital letter omega and the Ohm sign (symbol for electrical unit of resistance), the same character or not? Historically the Greek letter was adopted as the Ohm sign, but it is a question of opinion as to whether it has by usage now become a symbol in its own right. The viewpoint of the UCS is that they are now distinct characters.

There are also subtleties in the opposite direction. The Greek language uses two distinct written forms for the Greek small letter sigma depending on whether it is, or is not, the final letter of a word. Printed text often makes use of ligatures (joined letters) for reasons of appearance that have no linguistic basis. For example, printed text in the Latin alphabet often combines a small letter F followed by a small letter I into an fi ligature. This creates a recognizably distinct shape but it is interpreted as two distinct letters when it is read. These are examples where the shape that represents the character or characters is affected by the context in which the character appears.

Which of these subtleties is important, for the purposes of the electronic encoding of data, depends substantially on the use to which the coding is to be put. A particular application of encoded data is normally concerned either with the visual appearance of encoded symbols, e.g. for printing applications, or with the semantics of the encoded symbols, e.g. for data processing. This has given rise to two distinct concepts arising from our first idea of a character as the basic element of some writing system. Elements of written data that are distinguished from one another by visual appearance are known as glyphs. The term character has become specialized to mean elements of written data that are distinguished from one another by semantic interpretation. The formal definitions are as follows:

A member of a set of elements used for the organisation, control, or representation of data (taken from ISO/IEC 10646-1:1993).
A recognizable abstract graphic symbol which is independent of any specific design (taken from ISO/IEC 9541-1:1991).

Characters are distinguished from one another by name, not by form or shape. ISO standards for coded character sets normally include tables that show a representative printed form for each character represented. These printed forms are purely illustrative and are not necessarily distinctive; the same shape (glyph) may be used for more than one character in a table. It is the name, such as LATIN CAPITAL LETTER A, that identifies the character being encoded in each code position. It is a convention adopted by the UCS that the names of characters are composed only from Latin capital letters A to Z, digits 0 to 9, space and hyphen. There are restrictions on the use of digits in names, in particular they may only be used in the names of ideographic characters.

With this distinction in place, we can say that the UCS is a standard that specifies an encoding of characters. The standard shows a representative printed form (glyph image) for each encoded character, but these are not all distinct from one another.

Graphic characters and control characters

The characters described in the preceding section are all graphic characters, i.e. characters that have a visual representation. Character data also includes characters present for control purposes, such as CARRIAGE RETURN or LINE FEED. These particular control characters have names that originate with the use of electromechanical teleprinters, but they are still used today for the characters used to control paragraph separation in modern text processing systems. They are just two examples of many such non-printing characters that may be required to control the systems used for the display or printing of coded character data.

When data is encoded directly as a sequence of characters, such control characters will appear interspersed in the sequence of graphic characters. They must therefore be assigned code positions along with the graphic characters of the code. Nowadays character data is often transmitted or otherwise processed by means of protocols that separate the control data from the character data. One such protocol is Abstract Syntax Notation One (ASN.1). When such protocols are used, it is not necessary to keep code positions for control characters within the code used for graphic characters as the separation is achieved by other means. However, the UCS does reserve code positions for the use of control characters, to permit use in systems where a single sequence of intermixed graphic and control characters is required.

Alphabetic, syllabic and ideographic scripts

The world's languages whose characters are encoded in the UCS differ substantially from one another in the extent to which the written forms of the languages can be broken down into constituent elements. The scripts used for written languages fall, for this purpose, into three distinct classes:

In these descriptions the meaning of "limited" is that the number will not increase as further words are added to the language, either in the future or to cover language usage in the past.

These different classes of script have very different requirements in terms of the number of code positions required to represent them in a coded character set. All the alphabetic scripts of the world, taken together, require fewer code positions than does the Chinese ideographic script on its own. The UCS sets aside somewhat over one quarter of its code space in the BMP, a total of 20992 code positions, for the East Asian ideographic scripts of the Chinese, Japanese and Korean languages taken together. A further 11172 code positions are occupied by the Korean Hangul syllabic script. This leaves somewhat over one half of the code space of the BMP for all other scripts of all the other languages of the world that are in current use. This is likely to be more than adequate. The effect of the limitation of space on the encoding of the ideographic scripts is described in the section on unified ideographs in the chapter of this guide on the Basic Multilingual Plane.

Sequence order and writing mode

The written form of a language is composed of a sequence of script elements. This is true whether the script is alphabetic, syllabic or ideographic. But languages differ from one another in the arrangement of the sequence on paper (or other writing surface). Three arrangements are in common use. The succession of script elements may be written left-to-right (e.g. Latin, Cyrillic and Greek scripts and horizontal Japanese Kanji) or right-to-left (e.g. Hebrew and Arabic scripts), with successive rows being written top-to-bottom, or the script elements may be written top-to-bottom (e.g. vertical Japanese Kanji) with successive rows being written right-to-left.

The sequence order of the characters in an encoding of any script is that of the logical succession of characters, regardless of the writing mode. If the encoding is to be used to create a written presentation of the encoded material, it is up to the application to observe the correct writing mode for the script in use. This is so even for encoded data that intermixes two or more scripts with different writing modes, e.g. text in Latin script containing Hebrew quotations. Where it is required to encode the intended writing mode along with the character data, the control functions SELECT PRESENTATION DIRECTIONS and START REVERSED STRING may be used. Their coding, which makes use of control characters, is defined in ISO/IEC 6429:1992. The first of these functions is used to set the writing mode of the main text. The second is used to reverse the direction temporarily, as in the example of Hebrew quotations within a predominantly Latin script.

Certain characters have semantics that depend on writing direction. The symbols "(" and ">" represent an opening parenthesis and a greater-than sign when they occur in a script written from left to right, but in a script written from right to left they represent a closing parenthesis and a less-than sign respectively. There are provisions within ISO/IEC 10646-1 for such characters to be presented in mirrored form, ")" and "<" in this example, when used with a script written from right to left. However, such mirroring should not be performed automatically since there are separate characters which have these glyphs as their normal form. Specific rules governing such forms of presentation that are given in annexes C and D of ISO/IEC 10646-1.

Precomposed and decomposed characters

Even within alphabetic scripts, there is ambiguity as to what are the constituent elements of the script. Many scripts use diacritical marks, such as accents and tone marks, as modifiers of basic letters. At what point does one cease the decomposition? Is ê (e circumflex) a letter in its own right or a composite of two separate elements, a letter (e) and an accent (circumflex)? If it is to be regarded as a composite on the grounds that the letter and accent are separated from one another, then what about i (small letter I)? A dotless i is a letter in its own right in the Turkish language. And what about ø (o with stroke), which is a superposition that is not visually in two distinct parts?

The UCS has adopted the view that basic letters and diacritical marks should be assigned encodings as separate graphic characters, but that the composites that are in normal use in current languages should also be encoded as graphic characters in their own right. A character such as a diacritical mark, intended only to be used in conjunction with a base letter, is said in the UCS to be a combining character. A composite formed from a base letter and one or more combining characters is called a composite sequence.

A composite sequence is not a character, as it is not a member of the set of elements that form the UCS; it is a sequence of such elements. But both graphic characters and composite sequences have visual representations as glyphs, and the same glyph may be the visual representation both of a graphic character of the UCS and of a composite sequence. The glyph é (e acute) is the visual representation both of the character LATIN SMALL LETTER E WITH ACUTE and the composite sequence LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT.

To Top of UCS Guide