Title: Response to comments on the question of encoding Egyptian hieroglyphs in the UCS (N2096)

Source: Michael Everson, Everson Gunn Teoranta (IE)
Status: Expert contribution
Date: 1999-10-04
Action: For information

Dr Schenkel of the Institut für Ägyptologie of Eberhard-Karls-Üniversität Tübingen responded to my exploratory proposal to encode Egyptian hieroglyphs (SC2/WG2 N1944 ( c2/wg2/doc/n1944.pdf)) in the UCS in SC2/WG2 N2096 ( sc2/wg2/docs/n2096.pdf).

Those of us who have taken an interest in encoding Egyptian in the UCS are well aware that it will take a long time to finalize a standard repertoire. If we erred in thinking that the CCER fonts were more useful than they are -- and there may be defensible arguments for why they could be considered even so -- we have not erred in making an analysis of both the repertoire of those fonts and of the system used by MacScribe and WinGlyph to manipulate them in processing. Egyptian will one day be encoded in the UCS, as one of the word's major scripts. It is important that Egyptological programmers learn about the coding conventions of the UCS, and that the architects of the UCS learn about Egyptian processing and presentation requirements. In the short term, even long before a formal encoding is made, prototype software which can interact with UCS-based operating systems could -- and should -- be made using the Private Use Zone in the BMP or the Private Use Plane if the former proves too small. At the WG2 meeting in Copenhagen, it was strongly urged by myself and the convener of WG2 that Egyptian not be "dropped" from our agenda until some unknown day when Egyptologists are satisfied with the repertoire, but rather that discussion should continue at the present time. In this paper I will try to address the numbered points in Dr Schenkel's contribution to explain why.

0. The proposal is not mature enough for decision.

We knew this ourselves. But architectural issues must be discussed. As stated, we know that even if it will take a long time Egyptian will one day be encoded in the UCS; we look forward to eventual maturity for encoding, whether or not we ourselves will be involved. (It might not even happen in our lifetimes!) We take it as read that Egyptian, as one of the major scripts of the ancient world, is a prime candidate for encoding.

1. All current lists are based on lead printing types and fonts.

In this regard Egyptian is not unlike the Asian ideographs. When these were first encoded a unification of existing character sets was done; many glyph variants were encoded as "compatibility characters" despite their "unifiability" (for purposes of roundtrip mapping). This only applied to existing national and industrial Chinese, Japanese, and Korean standards. It may or may not apply to Egyptian for various reasons. For instance, if mountains of data were encoded with the CCER fonts, compatibility encoding might be considered advantageous in terms of converting that data to UCS format. Or it might not, though this might necessitate complex mapping tables.

Strict adherence to the character/glyph model may not be advantageous for Egyptian. Or it may -- but we need to ask the question and consider the costs and benefits.

2. The Standard Library (Gardiner's list) contains extraneous characters, and if (initial) coding were only to focus on classical Egyptian, the set should be twice as large as the Standard Library.

This is a good point, but it can also be remembered that scripts can be encoded incrementally (cf. Basic and Extended Tibetan, Basic and Extended Ethiopic, etc.). If nothing else, the value of an eventual online version of Gardiner'sEgyptian grammar should be considered, since a great deal of introductory and non-specialist material is restricted to that set. Early encoding of the Gardiner set could have value to many users (professional and amateur). The set, whether perfect or not, is well-established and very widespread in many publications. I would recommend that the entire Gardiner set (even as a subset of the CCER Standard Library) be considered to be standard and candidate for early (but not rushed) encoding even if it has a few nonce palaeographic forms. In the English-speaking world, at least, every introductory course in Egyptian is based on it. Dr Schenkel should feel free to contradict me on this point if I am entirely misguided in my thinking.

(As an aside, I did introductory Egyptian with Nigel Strudwick at UCLA in 1986 or so. At that time I experimented in preparing some homework assignments using glyphs I prepared with bitmap fonts and MacPaint on the Macintosh. We used Gardiner.)

Again, there are millions of code positions available in principle, and characters can be coded incrementally. At minimum the very widely known set of "the Egyptian alphabet" given on p. (section)19 of Gardiner) might also be considered for coded in the very short term. There are a great many nonspecialist users of this set. Specialists must remember that the Universal Character Set is for everyone, not just for academics. But we don't have to rush forward with that either.

3. The Extended Library is an incomplete glyph registry of the Ptolemaic-Roman period which doesn't even sort the set systematically.

It is normal to anticipate an incremental encoding process for scripts like Egyptian, where new characters are always to be found.

With regard to ordering, I assume that characters are sorted by the Gardiner-based catalogue numbers -- one reason why in N1944 I proposed using these numbers as the character names. This is an issue which needs to be discussed. If ordering is based on the catalogue numbers this is not really problematic. I can't easily imagine other ordering schemes.

4. Early Egyptian characters have not been considered.

They're not in the base character sets, that's why. Marginal or not, these could be added incrementally when that particular set were mature regardless of the status of other Egyptian characters.

5. Erik Hornung is working on a new repertoire.

This sounds like exactly the sort of thing we need to further the discussion on the repertoire. We need to contact him and give him N1944, N2096, and the present document in order to further the discussion. Can DIN provide him with these documents?

6A. Whether a character is considered a character or a glyph variant is not a constant at the present state of knowledge.

With regard to the question of whether abstract characters might be encoded today (or in the year 2012) and discovered tomorrow (or in 2129) to be glyph variants, this is not, in principle, a problem. Normalization tables could handle subsequent unifications if "duplicate" characters (even Gardiner's) were encoded. Such tables may be used for operations like searching and inputting.

It is well-known that the glyph/font question is less well-defined for Egyptian than it is for many scripts (though the same can be said for many ancient scripts). Looking ahead even 300 years, it is obvious that there will always be overlap and ambiguity for much data due simply to the complexity and time-depth of the corpus. This is not disputed, even by Egyptologists.

Once again, this is not an argument for rushing through any Egyptian encoding, but it is intended to point out that there are different levels of processing which can enable appropriate handling even if a "character/glyph mistake" were ever made. One consideration may take into consideration the data currently encoded with the CCER set. No matter what, this data will one day need to be converted and normalized to whatever is eventually encoded in the UCS. Now isthe right time to think about this. Now is not necessarily the time to act and encode, but now is the time to recognize the importance of what will one day be 8-bit legacy data (or multi-octet Private Use data) that will need to be mapped to Egyptian as encoded in the UCS.

6B. A specialist font registry such as used by the CCER would be better for now.

As stated, implementations based on such fonts should be UCS-based in the Private Use areas to ease eventual formal encoding.

6C. The CCER programs are named "Glyph", which is significant.

No it isn't; this is a purely rhetorical statement. "Glyph" is not so named because its designers were working from any clear understanding of the UCS character/glyph model. The name is based on ordinary language and could just as easily have been "Hieroglyph" without any such implication. Given the fact that we know that there will be inevitable overlap of a given shape with the meaning of some base character (will all Egyptologists finally agree on the taxonomy?) it has to be taken as read that some identifications will eventually turn up as "duplicate encodings" of base characters. Knowing this aforehand helps us to deal with that eventuality.

6D. Only after the repertoires will have stabilized within Egyptology itself will further steps be sensible.

Actually, this isn't true. The repertoire is one thing; the architecture is another. We can make progress on the latter while waiting for the maturity of the former.

Even if nothing more than the Gardiner list were standardized in the short term (and I do not urge this particularly, though one can see certain advantages of it for the community of Egyptologist specialists, learners, and enthusiasts), it is important that architectural issues be discussed even now. Eventual encoding of Egyptian in the UCS will have to be conformant with UCS coding practice. Developers of 8-bit coding software like MacScribe and WinGlyph need to understand and conform (at least in terms of eventual mapping) to UCS practice. Even if we wait decades before proposing a final encoding. We need to understand how the UCS can be used for Egyptian processing, even long before a stable repertoire is settled upon.

Again, we do not want to rush Egyptian but we do want to continue dialogue and address what can be addressed. We could decide to go ahead with the Gardiner list, warts and all, if the benefits would outweigh the disadvantages of some minor ambiguities. Or we could wait for the Hornung list, or for other lists. Either way, the architectural issues which are independent of repertoire issues can be investigated and addressed now. A prime example of a technical issue independent of the repertoire is how cartouching is handled in existing implementations vs. how it should be handled in the UCS, as described in N1944. There is much in that paper which remains to be discussed.

Michael Everson,, 15 Port Chaeimhghein Íochtarach, Baile Átha Cliath, Éire, 1999-10-04