CEN Guide to the Use of Character Sets in EuropeTC 304

UCS - Use of control functions with the UCS

The coding of control functions in 7-bit and 8-bit codes

The general structure of 7-bit and 8-bit codes for character sets is given in:

In particular, this standard provides the overall rules governing the use of control functions with such character sets. When control functions are used in conjunction with the UCS, their coded representation is algorithmically derived from their ISO/IEC 2022 representations. In the first edition of ISO/IEC 10646-1 this algorithm was based exclusively on the ISO/IEC 2022 representation for a 7-bit code, but Amendment 3 changed this to permit also coded representation of control functions based on their 8-bit code representation.

The following definitions are taken from ISO/IEC 2022:

control character
A control function the coded representation of which consists of a single bit combination.
control function
An action that affects the recording, processing, transmission or interpretation of data, and that has a coded representation consisting of one or more bit combinations.

In this context a bit combination is either a 7-bit or an 8-bit byte. In a 7-bit code the hexadecimal values 00-1F are reserved for control characters; this is known as the CL area of the code table. In an 8-bit code there are two such reserved areas, the CL area comprising values 00-1F and the CR area comprising 80-9F.

The coding of any control function either consists of a single control character from either the CL or CR areas, or such a character followed by one or more bit combinations. These following bit combinations are unrestricted in number and value. In particular, they may be (and usually are) values that on their own would be coded representations of graphic characters. The syntax for the use of any control character must provide a way of determining the end of the coding of each control function coded with its use.

ISO/IEC 2022 specifies the assignment of only one control character:

The semantics specified for this control character are simple - it causes the meaning of a limited number of bytes following it to be changed. A sequence consisting of the ESCAPE character and such following bytes is known as an escape sequence.

ISO/IEC 2022 requires an escape sequence to have the following structure:

This structure enables the escape sequence to be delimited without knowledge of the semantics of the control functions represented by such sequences. ISO/IEC 2022 also specifies a more detailed structure for the use of escape sequences. These uses are all for code extension purposes, such as changing the sets of control characters or graphic characters that are in use, but that standard does not specify the semantics of individual escape sequences. These are subject to registration in accordance with:

The register set up in accordance with ISO 2375 is:

C0 and C1 sets of control characters

The ISO 2375 register includes escape sequences that invoke individual control functions, but it also includes escape sequences that invoke sets of control characters. Each such set is either a C0-set or a C1-set. One set of each type may be invoked simultaneously in either a 7-bit or an 8-bit code. In both cases a C0-set is mapped into the CL area of the code table. A C1-set may be mapped into the CR area of an 8-bit code, but an alternative coding is necessary in a 7-bit code and this alternative is also available for use with 8-bit codes.

This alternative coding of a C1-set is by means of escape sequences. Control characters that would be coded as values in the CR area are represented by a two-byte escape sequence consisting of ESCAPE followed by a Final Byte in the range 40-5F. These values 40-5F correspond sequentially to the values 80-9F of the CR area. This coding is known as the ESC Fe representation of the control character. This term is used in the first edition of ISO/IEC 10646-1 in its explanation of the coding of control functions, but the permitted coding of control functions was extended by Amendment 3 and references to ESC Fe sequences have been deleted.

Control functions for purposes other than code extension are specified in

This standard provides a repertoire of control functions from which particular functions may be selected to meet particular needs. Conformance to ISO/IEC 6429 does not require any particular control functions to be supported. It does, however, require that when any of its control functions are used then they shall be encoded as specified in that standard, and that encodings so specified shall not be used for any other purpose.

A particular type of coding specified in ISO/IEC 6429 is a control sequence. Control sequences are similar in nature to the escape sequences of ISO/IEC 2022 but they are less restricted in their use. In particular they permit the coding of control functions that take one or more numeric values as parameters. Control sequences start with the C1 control character

which may be coded either as 9B in the CR area or as the ESC Fe escape sequence ESC 5B. This character occupies the same relative location in a C1 set as does the ESCAPE character (1B) in a C0 set. A control sequence has the following structure:

The Parameter Bytes used to represent a sequence of numeric parameters consist of the decimal representation of those numbers, with digits 0-9 coded by hexadecimal values 30-39, any separator symbol within a parameter (e.g. decimal point) coded as 3A and with distinct parameters separated from one another by value 3B.

The Final Byte value 6D is reserved in ISO/IEC 6429 for use by ISO/IEC 10646-1, in which it is used to encode the control function IDENTIFY UNIVERSAL CHARACTER SUBSET.

The use of control functions with the UCS

The coded representation of a control function for use with the UCS is obtained very simply from its coded representation for use with an 8-bit code. For an 8-bit code the coded representation consists of a sequence of one or more 8-bit bytes, of which the first is in one of the ranges 00-1F or 80-9F. Each of these bytes is converted to a UCS code position by taking the byte value as the C-octet value and setting the G-octet, P-octet and R-octet all to zero. Each UCS code position is then encoded according to the adopted coding method of the UCS, namely one of UCS-2, UCS-4, UTF-8 and UTF-16. Since code positions 0000-001F and 0080-009F of the BMP are reserved for the use of control characters, this gives an unambiguous coded representation of any control function without conflicting with the coding of graphic characters in the UCS.

In the first edition of ISO/IEC 10646-1 there was an additional requirement that control characters from a C1-set should be represented as ESC Fe escape sequences before this algorithm is applied. This requirement was removed by Amendment 3. As an example the control function CONTROL SEQUENCE INTRODUCER, which has coded representation 9B in the CR area of an 8-bit code, would have coded representations in a UCS-2 coding of

Identification of UCS subsets by use of control functions

One particular control function is defined within ISO/IEC 10646-1 for identifying selected subsets of the UCS. This is coded by means of an ISO/IEC 6429 control sequence, using a Final Byte value reserved in that standard for the use of ISO/IEC 10646-1. The control function is

It takes as parameters the collection numbers of the collections that comprise the selected subset. The ISO/IEC 6429 coded representation of this control function consists of :

This coded representation is then converted to the form appropriate to the UCS coding method in use, following the transformation rules explained above.

As an example,

This UCS-2 coding uses the ESC Fe coding form for CSI; following Amendment 3 to ISO/IEC 10646-1 the code values 001B 005B can be replaced by the single code value 009B.

Invocation of the UCS from an 8-bit code

The character code structure and extension techniques specified in ISO/IEC 2022 includes provision for control functions that

The coded representation of such a control function is by means of an escape sequence in which the first Intermediate Byte has value 25. If there is a second Intermediate Byte with value 2F, it signifies that the new coding system either has no means of return to ISO/IEC 2022 coding or that it provides its own means for so doing. If the second Intermediate Byte is either absent or has any other value then it signifies that the new coding system uses the escape sequence

to return to the coding system of ISO/IEC 2022 and that on such return, the state of the coding system (which sets of control and graphic characters are invoked, etc.) is restored to the state at the time that the other coding system was invoked. This is known as the standard return.

When a new coding system is designated and invoked by means of an escape sequence that signifies that the invocation is without standard return, then any return to the coding system of ISO/IEC 2022 is (by 15.4.2 of ISO/IEC 2022) a return to that coding system in an unspecified state. No announcements, designations and invocations of sets of control and graphic characters, or of other features of ISO/IEC 2022, survive from the state in which the new coding system was invoked. This is true even if the new coding system in fact specifies that return shall be by means of the escape sequence ESC 25 40; it is the escape sequence used to invoke the new coding method, not that used to return from it, that determines the state of the ISO/IEC 2022 coding system upon return. This has particular significance for the UCS, for reasons that will be seen below.

The assignment of particular escape sequences of these forms to particular coding methods is subject to registration in accordance with ISO 2375. The escape sequences so allocated are then published in the International Register of Coded Character Sets to be used with Escape Sequences. The following escape sequences have been registered to designate and invoke the coding methods of the UCS. They are given along with their registration numbers:

Note that of the coding methods defined for the UCS, only UTF-8 and (the now withdrawn) UTF-1 have the ability to make use of the standard return, since they are the only coding methods in which the escape sequence for the standard return is not transformed when used with the coding method concerned.

ISO/IEC 10646-1 does specify a means to return to the ISO/IEC 2022 coding system when it has been invoked without standard return. The specified method is by means of the escape sequence ESC 25 40 transformed to the appropriate coding method of the UCS as described above, e.g. 001B 0025 0040 designates a return from UCS-2. This is the same escape sequence as that used for a standard return, but except when used with UTF-8 its coded representation will include additional padding octets with value zero. Even when used with UTF-8, when that coding method has been invoked by means of ISO-IR 190, 191 or 192, it is interpreted for the purposes of ISO/IEC 2022 as a non-standard return, with consequent loss of the state of that coding system as described above.

To Top of UCS Guide