8-Bit Character Sets - Historical background
The assignment of specific bit combinations to a particular set of characters
constitutes a coded character set, or more concisely, a code. The larger the
set of characters, the greater the number of bits required for the coding.
Any increase in the number of bits causes a corresponding increase in cost in
the systems that use the resulting code. This need not be a monetary cost,
it may instead be a cost in terms of resources, but it is nevertheless real.
The historical development of coded character sets is a story of balancing the
desire for more characters against that for the fewest possible number of
bits. The decisions that were made have had consequences that have lasted
long after the pressures that led to them have eased. This section of the
guide gives some account of that history.
The first binary coded character set was a 5-bit code patented by
Jean-Maurice-Émile Baudot (1845-1903) in 1874 in connection with his
invention of a precursor of the teleprinter. Since the device was operated
by electromechanical means, even one further bit would have added
significantly to the complexity of the equipment. In 1932 the CCITT
(Comité Consultatif International Télégraphique et
Téléphonique) standardized a 5-bit code for teleprinters, based
on that of Baudot, which is the code of the international telex (teleprinter)
network to the present day. This is known as the International Telegraphic
Alphabet No.2, also as CCITT code No.2 or simply as the Baudot code. It was
last re-issued as ITU-T Recommendation S.1 (1993).
A 5-bit code has room for 32 characters, which is not enough even for the 26
letters A to Z and the ten digits. To get round this, a teleprinter operated
by the Baudot code has a shift lock in the manner of a typewriter. This locks
26 "keys", i.e. bit combinations, into one of two modes. In the
alphabetic mode they print the letters A to Z. In the numeric mode some of
these bit combinations print the ten digits 0 to 9 and various punctuation
marks while the remainder operate certain functions such as line feed,
carriage return and sounding a bell. The effect of the remaining 6 bit
combinations is not affected by the shift lock. Two of these six are used to
switch the shift lock between the two modes. In this way the 5-bit code
conveys 58 different signals (26 times 2, plus 6).
Right from the beginning it was recognised that different countries had
different needs. Although 58 character positions does not give much room for
flexibility, the 1932 standard for the Baudot code filled only 55 of the
available positions. The remaining three character positions were then
available for national use.
In the late 1950's the American Standards Association (now the American
National Standards Institute) set about the development of a new code for the
communications and data processing industries. By then, there was a need for
further character positions to be available, both for printing and control
purposes. To avoid the need for shift operations, it was agreed to develop
a 7-bit code. This has 128 bit combinations available. The code was to be
known as the American Standard Code for Information Interchange, or ASCII.
By that time it was common to store data on punched paper tape, for input to
data processing systems or automated communication equipment. A "1"
bit was represented by a hole and a "0" bit by the absence of a
hole. Since a row of 7 zero bits would be indistinguishable from blank tape,
the coding "0000000" would have to represent a NULL character
(absence of any effect).
Since holes, once punched, could not be erased but an erroneous character
could always be converted into "1111111", this bit pattern was
adopted as a DELETE character. When received, or otherwise processed, it
again was to have no effect but it could be punched on top of any other
character to erase that character's effects.
A design decision was taken to reserve the first 32 bit combinations (i.e.
with the two most significant bits being 0) for control characters. This
range includes the NULL character but not the DELETE character, so it leaves
95 bit combinations for printing characters. The printing characters
(including SPACE) were to be arranged in an order that could be used for
sorting purposes. The SPACE character is normally sorted before any other
printing character and so was allocated the first position among the printing
characters. There are then 94 contiguous positions for printing characters
between "1100000" (SPACE) and "1111111" (DELETE).
This division of the code positions into 32 for control starting with NULL,
followed by 94 printing characters lying between SPACE and DELETE, has
dominated the structure of coded character sets right to the present day; see
"the future is 16-bit" below.
The first ASCII standard was published in 1963, but at that time it left many
bit combinations unallocated. It included capital letters but not small
letters. The ASCII standard as we know it today dates from 1968.
Although the use of a 7-bit code for ASCII was designed to avoid the use of
the locking shifts of the Baudot code, a pair of locking shift codes SHIFT IN
(SI) and SHIFT OUT (SO) were included in the set of control characters to
allow for future extension. An ESCAPE character was also included, to act as
a non-locking route to extension.
The later stages of the ASCII story were joint developments with the ISO
subcommittee ISO/TC97/SC2. This led to the publication in 1967 of ISO
Recommendation 646 (ISO had Recommendations rather than Standards at that
time). Just as with the Baudot code, the need for national variants was
recognised. With the greater freedom offered by 94 printing characters, 10
positions were reserved for national use. To ensure maximum consistency when
these 10 positions were not all required, the recommendation provided a
default assignment of characters to these positions. The version in which all
the default assignments were used was known as the International Reference
The most recent edition of this standard is the third edition, ISO/IEC 646:1991, which superseded the second edition
of 1983. In these editions there are still 10 positions for national use and
in addition a further two that have two alternative graphics assigned (number
sign versus pound sign, dollar sign versus a generic currency sign). Both
these and the 10 national use positions have specific assignments in the IRV.
However, it is important to be aware that the IRV of ISO/IEC 646 was changed
between the 1983 and 1991 editions. To conform with de facto usage,
the 1991 edition recognised ASCII as the new IRV. The IRV of the 1983 edition
specified the generic currency sign alternative in the choice between that and
the dollar sign.
The 7-bit codes that followed ASCII, such as for other scripts (Greek, Arabic,
etc.), followed the basic structure of ASCII. They kept the 32 control
characters, SPACE and DELETE and changed only the remaining 94 printing
characters. When 7-bit codes were used in the 8-bit environment provided by
most computers, the most significant bit was set to "0". This
leaves the NULL character being "00000000" which is consistent with
its original design intention but it codes the DELETE character as
"01111111", so no longer having all "1" bits.
This approach led to a natural method of extension to accomodate more
characters: use a second 94-character 7-bit code and distinguish it by
setting the most significant bit to "1". Such an extended code can
be transmitted through 7-bit communication channels by use of the SI and SO
locking shifts of ASCII. There is no need for second SPACE and DELETE
characters, so these positions are unassigned in the second code. When 7-bit
codes came to be designed specifically for use in this extended area, they
could use the full 96 printing positions.
This design for 8-bit codes has some immediate consequences. Viewed in binary
sequence the control characters are no longer contiguous; there are two
control areas "000xxxxx" and "100xxxxx" and two separated
areas for printing characters, a lower area with the most significant bit set
to "0" and an upper area with it set to "1". Between them
they can accommodate 190 printing characters (94 plus 96) excluding SPACE.
The 8-bit codes described above have room for, say, accented letters or Greek
letters in the upper half, but not both. The need to cater for both at once
gave rise to an obvious further extension: locking shifts for 8-bit codes.
As with any locking shift mechanism, it is only suitable for communication
rather than for data processing, but this route was followed. A second set
of two 7-bit codes could then be accommodated, to be shifted as required into
the lower and upper areas for printing characters. This gives a total of four
7-bit codes in use simultaneously. There is no need to restrict usage to one
alternative code for the lower area and another for the upper, so mechanisms
were set up to shift (invoke) any of the four 7-bit codes into the lower area
and, independently, any other into the upper area.
The limit of four 7-bit codes is in fact arbitrary; it would be possible to
have even more 7-bit codes "on standby" and even more shift
mechanisms for invoking them. But four is enough for most needs, and once
this is exceeded there seems no particular reason to stop at any other number.
The next stage in code extension was therefore to permit the choice of the
four 7-bit codes to be changed by means of control functions. This is like
sending a telephone message to a user of an interchangeable typehead
("golfball") typewriter to change the typehead.
This is more difficult as there has to be some method of identifying the new
choice of code to the remote party to the communication. In effect there has
to be a catalogue of typeheads from which one can choose, with catalogue
numbers that one can communicate. This was achieved by means of an
International Register of such codes, established by ISO. Each registered
code is assigned a number and is referenced as ISO-IR xx. In this register,
for example, the IRV of ISO 646:1983 is ISO-IR 2 and that of ISO/IEC 646:1991
(ASCII) is ISO-IR 6. This register enables new codes to be selected
(designated) and subsequently shifted into use (invoked), all through the use
of control functions.
The structure of character codes and their extension techniques that is
described above is formalised in the international standard ISO/IEC 2022. The International
Register is maintained according to procedures laid down in ISO 2375 and is published by Japanese standards institution (JISC) under the authority of ISO. The register may be accessed on-line. It currently contains over 200 registered coded character sets.
Not all implementations of communications protocols may be able to cope with
all the possibilities of this complex system of code extension by designation
and invocation. There needs to be a way of notifying the remote user of the
intention to use only a selection of the available facilities. This was
achieved by means of further control functions, known as announcers. These
make it possible, for example, for a system to announce that it will only use
a 7-bit code (or an 8-bit code) with no code extension facilities.
The summary of this section on the world after ASCII has skipped over a number
of difficulties that arise in these code extension techniques. In particular,
attention has been concentrated on the printing characters. The control
characters also have their extension problems. An account with greater
precision is given in the section on concepts and
With the growing processing power of computers and the increasing bandwidth
of communications channels, the pressure to squeeze an ever increasing number
of characters into an 8-bit code structure has diminished. A need has arisen
for a simpler structure at the expense of more bits. This need has given rise to a complete rethinking of code structure for a world of 16-bit and even 32-bit processing and communication. From it has risen a new international standard, ISO/IEC 10646, the Universal Multiple-Octet Coded Character Set.
It is interesting to note that even this "ultimate" standard retains some past legacies. Control functions are coded according to ISO/IEC 2022, although the code extension functions of that standard are forbidden. The first 32 bit combinations are therefore reserved for control purposes. The next 95 bit combinations contain the printing characters of ASCII including SPACE. This brings one to the bit combination "00...001111111" (the dots denote enough zeroes to fill either 16 or 32 bits, as the case may be). The legacy of paper tape survives. This is still reserved for the DELETE character!
It is the intention that ISO/IEC 10646 will be, in some sense, the last
character set standard. It is planned as a multi-part standard, of which part 1 was published in 1993. Future parts will add to the code, and since it has the potential to fill a 32-bit code space, it has the capacity to be extended to meet all foreseeable future needs. It has the ultimate aim of including all characters that have ever been used for communication. The coding of ancient runes has already been standardized, that of Egyptian hieroglyphics is for future study.
More detailed information may be found in the part of the Guide to the Use of Character Set Standards in Europe dealing specifically with the UCS code structure.
Top of 8-Bit Guide