Editor for this issue: <>
In response to issues raised in Linguist List, Vol. 2, No. 0283. First of all the terminology of "open repertoire" and "closed repertoire" tends to cause endless confusion when applied to character encodings, because the proponents of different character encoding architecture often mean different things when they say "character". This leads to different senses of "closed repertoire of characters". In Unicode terminology, "character" refers to the thing which gets a 16-bit number attached to it in the encoding. In this sense, Unicode clearly has a closed repertoire of characters. There are about 27,000 of them, mostly Han characters, and each one of them is unambiguously identified in the standard. However, the classes of things which are encoded as "characters" includes both baseform letters (U+0065 LATIN SMALL LETTER E) and floating diacritics (U+0301 NON-SPACING ACUTE), as well as accented letters (U+00E9 LATIN SMALL LETTER E ACUTE). This creates the multiple spelling problem for accented letters that we all know about-- but it is also the basis for the open-ended, productive part of Unicode, since U+0301 NON-SPACING ACUTE can be used with other characters to create compositions which are NOT preencoded in the standard. (e.g. x-acute, beta-acute, Georgian-an-acute, who can guess...?) In this sense, the encoding of non-spacing characters in Unicode (of various classes--the Latin/Greek/Cyrillic floating diacritics are only one of several major classes of non-spacing marks used in various scripts) creates a vast potential universe of coded "things" resulting from non-spacing marks applied one or more at a time to baseform characters. To avoid confusing these "things" with characters, let's for now call them "charactoids". While Unicode has a well-defined closed repertoire of characters (each exactly 16-bits in size and well-defined), at the same time is has an open repertoire of charactoids. The class of charactoids is defined by a well-defined set of composition rules, rather than by enumeration. We know the numerosity of the class is huge, but no one is going to try to count it--indeed the whole point is that charactoids are freely generable by the encoding scheme, without having to go through a committee to get a single 16-bit number assigned to it. (As an analogy, think of characters as representing the morphemes of a morphologically complex language like Cree, for example. Charactoids are then analogous to the words of Cree. Who knows what they all are? And is anyone going to try to define them all ahead of time?) While the class of charactoids encodable by composition in Unicode is vast and open, the structure of the code, together with the facts about several widely used orthographies in the Latin/Greek family of scripts results in several well-defined subsets of charactoids which have the following properties: A. charactoids which are functionally equivalent to accented letters encoded as single characters in Unicode. This is the <e>+<non-spacing acute> = <e-acute> case. The list of such cases is well-defined, not too large, and will be published in the Unicode 1.0 standard as one of the auxiliary tables. The important principle is that Unicode does not specify a functional distinction between the two equivalent "spellings" in Unicode. This is important because to do otherwise would prevent Unicode applications and systems programmers from normalizing freely from one to the other depending on their internal requirements for representation. Unicode does not require normalization; nor does it prevent it. Unicode also does not prevent an application from maintaining a private distinction between, say, <e-acute> as a fundamental vowel unit in an orthography, and <e>+<non-spacing acute> as a vowel plus applied tone mark. What it does say is that such private distinctions cannot be reliably conveyed in plain Unicode text, because another Unicode text interpreter may normalize them all to <e-acute>. B. charactoids which are functionally equivalent to accented letters which are NOT encoded as single characters in Unicode, but which are used in important orthographies. The ones which cause all the controversy are Vietnamese and Polytonic Greek. Both make wide use of doubly accented letters. Both are freely encodable in Unicode by baseform plus non-spacing diacritic combinations. The set of charactoids required for Vietnamese or for Polytonic Greek is well-defined and is being published as part of the Unicode 1.0 standard. C. charactoids which are useful, but whose users LIKE to have them be an open class, not enumerated. All of IPA falls in this class, together with the productive application of vector notation diacritics to mathematical symbols, for instance. D. The remainder are the charactoids which... well, who the hell knows what they might be. The Unicode standard has no intention of prescribing them all, nor of proscribing any of them. (Unicoders simply want to build software that lets users do what they want to do.) The design goals of Unicode were to keep subset A as small as possible, because keeping track of such "required" equivalences tends to impose efficiency and resource penalties on software whose combinatorial properties grow as n-squared. Subset A cannot, however, be reduced to zero, because of other code compatibility requirements and offsetting inefficiencies which set in when having to deal with charactoids instead of characters in the software. Others believe that all of subset B should be encoded as characters (i.e., move them into subset A). Vietnamese is on the hairy edge of this argument, and a strong case can be made either way. There is no absolutely right answer as to how to encode it--just a lot of contradicting tradeoffs in a multiple sum game, only some of whose sums are purely technical. Lars Henrik Mathiesen noted that "there are technical reasons why a standard without floating diacritics is easier to implement." While that is true in a world of limited implementations of European languages on glass terminals with character ROM's, the Unicode designers are firmly of the opinion that in building an international character set for multilingual, multiscript applications, the arguments all come down on the other side. (In the following I am not abscribing to Lars a particular position in this--I think his contribution was intended primarily as an exegesis of another note written by an anti-Unicoder posted by a pro-Unicoder.) 1. Open-ended productivity of diacritic application is a fundamental principle of the Latin/Greek family of scripts. To attempt to code all "useful" combinations and proscribe all others is both obtuse and unworkable. (Any devotee of IPA who wants to be able to encode and exchange it on a computer should see this is self-evidently obvious. But the ISO 10646 approach to the IPA encoding problem was to remove the problem by removing IPA from the encoding! Now THAT's a great solution for linguists!) 2. Any putatively universal character encoding has to be able to convert and interwork with existing standards (e.g. in the bibliographic community) which ALREADY have non-spacing diacritics. So it's a done deal. They MUST be included--unless you just ignore them. And THAT's a great solution for bibliographers! 3. Finally, non-spacing diacritics aren't even very hard to implement. Compared to the problems which need to be addressed and solved to support Arabic and Indic scripts, the whole issue of non-spacing diacritics is revealed for what it really is in the larger picture: well-understood "easy stuff". Next, to address some issues raised by Mark Johnson's comments: ISO/IEC DIS 10646 did not propose a "fixed-length character encoding system." That was but one of its many drawbacks. It proposed a character encoding system whose canonical form was four "octets" (ISO standardese for "bytes") for one "character", but which also allowed for "compaction forms" which would result in characters encoded as one, two, three, or a variable number of bytes. And in any case, even when a fixed multiple number of bytes (say 2) would be used to represent "graphical" characters such as <e-acute>, any control characters would have to be interpreted one byte at a time. That aside, it is true that 10646 attempted to enumerate all the "useful" accented letter forms for Latin, Greek, and Cyrillic, and encoded them as distinct characters. Now, with respect to the "general escape method, which...would allow overstriking of arbitrary characters to build new characters," such methods already are standardized! The first-order hack implemented in PC's (and some earlier computers) was to use the BACKSPACE control code as a direct technical calque from everyman's solution for creation of composite characters on a manual typewriter. ISO, in amongst the various standards for control characters (ISO 6429 to be exact), has defined a tonier, less lowbrow control character function, the GCC (Graphic Character Composition) to serve exactly as the "escape method" for combining two characters. The problem is that such approaches are in the stone age of computer typography. Characters (and charactoids) are not the same as glyphs, and glyphs are not the same as images. The glyphs are abstractions of the TYPE of elements of textual graphic representation. (This is pretty close to what linguists mean by "grapheme", but abstracted away from issues relevant to its use as a structural unit; think of "glyph" is to "grapheme" as "phone" is to "phoneme" and you'll be close.) Images are instatiations or TOKENS of actual textual graphic representations taken from particular fonts (of defined face, style, weight, size, etc.). Modern rendering software provides layers of mapping between a) character encoding, which is designed to encode textual CONTENT (appropriate for manipulation by textual processes) and b) glyphic representation (appropriate for rendering in visible form on screen, printer, or other device). Such mappings between character and glyph can be one-to-one in the simple (ASCII) case, but typically are not simple in computer typography even of English. Several characters may map to a single ligature glyph; a sequence of baseform + non-spacing diacritic may map to a single composite glyph. The typeface designer builds the glyphs; the rendering software maps to the correct choice. Even in those cases (as for the open-ended set of charactoids encodable in Unicode which could not all be anticipated by a typeface designer) in which a baseform + non-spacing diacritic is mapped to a pair of glyphs which must be (glyphically) composed, modern font technology builds the composition rules into the fonts. Effectively the diacritics "know" where to place themselves with respect to baseforms and each other (within limits). Once again, the solutions for handling these issues for the Latin script are quite well-understood in the industry. Font technology is an entire computer sub-industry single-mindedly driven towards making computer typography even better than the "real thing". But what is very well understood for Latin (and rapidly being extended to Greek and Cyrillic) is still very skimpily implemented for Arabic, or Devanagari, or Tibetan (!) or Burmese (!), for example, where the problems are very much harder and where getting it right is going to take a lot more work yet. Hoping this helps some, --Ken Whistler Secretary, Unicode Consortium (and a practicing linguist!)Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue
Mr. Hackney: May I suggest you read the (extensive) discussions on multilingual character coding before speculating, and before asserting that ``the answer'' is a variable-length coding? -s (Stavros Macrakis) PS Your `inferred explanation' of `floating diacritics' is incorrect.Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue