CEN Guide to the Use of Character Sets in EuropeTC 304

8-Bit Character Sets - Historical background


The assignment of specific bit combinations to a particular set of characters constitutes a coded character set, or more concisely, a code. The larger the set of characters, the greater the number of bits required for the coding. Any increase in the number of bits causes a corresponding increase in cost in the systems that use the resulting code. This need not be a monetary cost, it may instead be a cost in terms of resources, but it is nevertheless real. The historical development of coded character sets is a story of balancing the desire for more characters against that for the fewest possible number of bits. The decisions that were made have had consequences that have lasted long after the pressures that led to them have eased. This section of the guide gives some account of that history.

Table of Contents

The first binary codes

The legacy of Baudot

The first binary coded character set was a 5-bit code patented by Jean-Maurice-Émile Baudot (1845-1903) in 1874 in connection with his invention of a precursor of the teleprinter. Since the device was operated by electromechanical means, even one further bit would have added significantly to the complexity of the equipment. In 1932 the CCITT (Comité Consultatif International Télégraphique et Téléphonique) standardized a 5-bit code for teleprinters, based on that of Baudot, which is the code of the international telex (teleprinter) network to the present day. This is known as the International Telegraphic Alphabet No.2, also as CCITT code No.2 or simply as the Baudot code. It was last re-issued as ITU-T Recommendation S.1 (1993).

Locking shifts

A 5-bit code has room for 32 characters, which is not enough even for the 26 letters A to Z and the ten digits. To get round this, a teleprinter operated by the Baudot code has a shift lock in the manner of a typewriter. This locks 26 "keys", i.e. bit combinations, into one of two modes. In the alphabetic mode they print the letters A to Z. In the numeric mode some of these bit combinations print the ten digits 0 to 9 and various punctuation marks while the remainder operate certain functions such as line feed, carriage return and sounding a bell. The effect of the remaining 6 bit combinations is not affected by the shift lock. Two of these six are used to switch the shift lock between the two modes. In this way the 5-bit code conveys 58 different signals (26 times 2, plus 6).

National variants

Right from the beginning it was recognised that different countries had different needs. Although 58 character positions does not give much room for flexibility, the 1932 standard for the Baudot code filled only 55 of the available positions. The remaining three character positions were then available for national use.

ASCII

A 7-bit code

In the late 1950's the American Standards Association (now the American National Standards Institute) set about the development of a new code for the communications and data processing industries. By then, there was a need for further character positions to be available, both for printing and control purposes. To avoid the need for shift operations, it was agreed to develop a 7-bit code. This has 128 bit combinations available. The code was to be known as the American Standard Code for Information Interchange, or ASCII.

The legacy of paper tape

By that time it was common to store data on punched paper tape, for input to data processing systems or automated communication equipment. A "1" bit was represented by a hole and a "0" bit by the absence of a hole. Since a row of 7 zero bits would be indistinguishable from blank tape, the coding "0000000" would have to represent a NULL character (absence of any effect).

Since holes, once punched, could not be erased but an erroneous character could always be converted into "1111111", this bit pattern was adopted as a DELETE character. When received, or otherwise processed, it again was to have no effect but it could be punched on top of any other character to erase that character's effects.

94 characters

A design decision was taken to reserve the first 32 bit combinations (i.e. with the two most significant bits being 0) for control characters. This range includes the NULL character but not the DELETE character, so it leaves 95 bit combinations for printing characters. The printing characters (including SPACE) were to be arranged in an order that could be used for sorting purposes. The SPACE character is normally sorted before any other printing character and so was allocated the first position among the printing characters. There are then 94 contiguous positions for printing characters between "1100000" (SPACE) and "1111111" (DELETE).

This division of the code positions into 32 for control starting with NULL, followed by 94 printing characters lying between SPACE and DELETE, has dominated the structure of coded character sets right to the present day; see "the future is 16-bit" below.

The first ASCII standard was published in 1963, but at that time it left many bit combinations unallocated. It included capital letters but not small letters. The ASCII standard as we know it today dates from 1968.

Built-in extendability

Although the use of a 7-bit code for ASCII was designed to avoid the use of the locking shifts of the Baudot code, a pair of locking shift codes SHIFT IN (SI) and SHIFT OUT (SO) were included in the set of control characters to allow for future extension. An ESCAPE character was also included, to act as a non-locking route to extension.

International adoption

The later stages of the ASCII story were joint developments with the ISO subcommittee ISO/TC97/SC2. This led to the publication in 1967 of ISO Recommendation 646 (ISO had Recommendations rather than Standards at that time). Just as with the Baudot code, the need for national variants was recognised. With the greater freedom offered by 94 printing characters, 10 positions were reserved for national use. To ensure maximum consistency when these 10 positions were not all required, the recommendation provided a default assignment of characters to these positions. The version in which all the default assignments were used was known as the International Reference Version (IRV).

The most recent edition of this standard is the third edition, ISO/IEC 646:1991, which superseded the second edition of 1983. In these editions there are still 10 positions for national use and in addition a further two that have two alternative graphics assigned (number sign versus pound sign, dollar sign versus a generic currency sign). Both these and the 10 national use positions have specific assignments in the IRV. However, it is important to be aware that the IRV of ISO/IEC 646 was changed between the 1983 and 1991 editions. To conform with de facto usage, the 1991 edition recognised ASCII as the new IRV. The IRV of the 1983 edition specified the generic currency sign alternative in the choice between that and the dollar sign.

The world after ASCII

7-bit codes in an 8-bit world

The 7-bit codes that followed ASCII, such as for other scripts (Greek, Arabic, etc.), followed the basic structure of ASCII. They kept the 32 control characters, SPACE and DELETE and changed only the remaining 94 printing characters. When 7-bit codes were used in the 8-bit environment provided by most computers, the most significant bit was set to "0". This leaves the NULL character being "00000000" which is consistent with its original design intention but it codes the DELETE character as "01111111", so no longer having all "1" bits.

8-bit codes

This approach led to a natural method of extension to accomodate more characters: use a second 94-character 7-bit code and distinguish it by setting the most significant bit to "1". Such an extended code can be transmitted through 7-bit communication channels by use of the SI and SO locking shifts of ASCII. There is no need for second SPACE and DELETE characters, so these positions are unassigned in the second code. When 7-bit codes came to be designed specifically for use in this extended area, they could use the full 96 printing positions.

This design for 8-bit codes has some immediate consequences. Viewed in binary sequence the control characters are no longer contiguous; there are two control areas "000xxxxx" and "100xxxxx" and two separated areas for printing characters, a lower area with the most significant bit set to "0" and an upper area with it set to "1". Between them they can accommodate 190 printing characters (94 plus 96) excluding SPACE.

Locking shifts again

The 8-bit codes described above have room for, say, accented letters or Greek letters in the upper half, but not both. The need to cater for both at once gave rise to an obvious further extension: locking shifts for 8-bit codes. As with any locking shift mechanism, it is only suitable for communication rather than for data processing, but this route was followed. A second set of two 7-bit codes could then be accommodated, to be shifted as required into the lower and upper areas for printing characters. This gives a total of four 7-bit codes in use simultaneously. There is no need to restrict usage to one alternative code for the lower area and another for the upper, so mechanisms were set up to shift (invoke) any of the four 7-bit codes into the lower area and, independently, any other into the upper area.

The International Register

The limit of four 7-bit codes is in fact arbitrary; it would be possible to have even more 7-bit codes "on standby" and even more shift mechanisms for invoking them. But four is enough for most needs, and once this is exceeded there seems no particular reason to stop at any other number. The next stage in code extension was therefore to permit the choice of the four 7-bit codes to be changed by means of control functions. This is like sending a telephone message to a user of an interchangeable typehead ("golfball") typewriter to change the typehead.

This is more difficult as there has to be some method of identifying the new choice of code to the remote party to the communication. In effect there has to be a catalogue of typeheads from which one can choose, with catalogue numbers that one can communicate. This was achieved by means of an International Register of such codes, established by ISO. Each registered code is assigned a number and is referenced as ISO-IR xx. In this register, for example, the IRV of ISO 646:1983 is ISO-IR 2 and that of ISO/IEC 646:1991 (ASCII) is ISO-IR 6. This register enables new codes to be selected (designated) and subsequently shifted into use (invoked), all through the use of control functions.

The structure of character codes and their extension techniques that is described above is formalised in the international standard ISO/IEC 2022. The International Register is maintained according to procedures laid down in ISO 2375 and is published by Japanese standards institution (JISC) under the authority of ISO. The register may be accessed on-line. It currently contains over 200 registered coded character sets.

Limits on expansion

Not all implementations of communications protocols may be able to cope with all the possibilities of this complex system of code extension by designation and invocation. There needs to be a way of notifying the remote user of the intention to use only a selection of the available facilities. This was achieved by means of further control functions, known as announcers. These make it possible, for example, for a system to announce that it will only use a 7-bit code (or an 8-bit code) with no code extension facilities.

The summary of this section on the world after ASCII has skipped over a number of difficulties that arise in these code extension techniques. In particular, attention has been concentrated on the printing characters. The control characters also have their extension problems. An account with greater precision is given in the section on concepts and definitions.

The future is 16-bit

With the growing processing power of computers and the increasing bandwidth of communications channels, the pressure to squeeze an ever increasing number of characters into an 8-bit code structure has diminished. A need has arisen for a simpler structure at the expense of more bits. This need has given rise to a complete rethinking of code structure for a world of 16-bit and even 32-bit processing and communication. From it has risen a new international standard, ISO/IEC 10646, the Universal Multiple-Octet Coded Character Set.

It is interesting to note that even this "ultimate" standard retains some past legacies. Control functions are coded according to ISO/IEC 2022, although the code extension functions of that standard are forbidden. The first 32 bit combinations are therefore reserved for control purposes. The next 95 bit combinations contain the printing characters of ASCII including SPACE. This brings one to the bit combination "00...001111111" (the dots denote enough zeroes to fill either 16 or 32 bits, as the case may be). The legacy of paper tape survives. This is still reserved for the DELETE character!

It is the intention that ISO/IEC 10646 will be, in some sense, the last character set standard. It is planned as a multi-part standard, of which part 1 was published in 1993. Future parts will add to the code, and since it has the potential to fill a 32-bit code space, it has the capacity to be extended to meet all foreseeable future needs. It has the ultimate aim of including all characters that have ever been used for communication. The coding of ancient runes has already been standardized, that of Egyptian hieroglyphics is for future study.

More detailed information may be found in the part of the Guide to the Use of Character Set Standards in Europe dealing specifically with the UCS code structure.


To Top of 8-Bit Guide