Editor for this issue: <>
As one of the more active participants in the ISO10646 and Unicode ListServ discussions of "floating diacritics," I would like to comment on some of the postings on the subject in Linguist List, Vol. 2, No. 0283 (Monday, 10 June 1991, Subj: 2.0283 Diacritics). I apologize for the length of this contribution, but it's a large topic with a long history and many readers of this ListServ may not know the background. First, thousands of lines have already been written about this subject on those two ListServs (hundreds of them by me). I and a number of other linguists have been arguing from the beginning that character sets are too important to be left entirely to specialists in computer languages (who have their own priorities) and that natural language orthography is serious business. It is encouraging that someone has taken steps to draw more linguists into the discussion by some judicious crossposting to the Linguist ListServ. But I would like to suggest that interested linguists subscribe to the specialized ListServs mentioned above and that they read the archives available there. This will avoid unnecessary repetition and cross-posting, will ensure that all participants in the discussion know the background, and -- perhaps most important -- will ensure that linguists' informed opinions are shared with colleagues in other disciplines who often play decisive roles in developing international character set standards. I would also like to urge linguists to become involved in character set issues in an effective way. The ISO is composed of national representative bodies and is not required to listen to individuals. It is possible to join your country's national delegation and help formulate official positions on ISO proposals. ISO character set development is ultimately politics, not science. If you want to influence the outcome, you can't just post intelligent observations to a non-binding ListServ; you have to participate at the national delegation level. Why waste your time? Those of us who work with unusual writing systems are forced to develop our own character coding. Since we are not organized, we wind up with files that cannot be shared. Hardware and software manufacturers support recognized standards (both official ISO standards and de facto industry standards, which Unicode is likely to be); making a standard suit your needs means 1) you don't have to do your own character set development any more, and 2) you can share files with colleagues. macrakisMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueosf.org writes: >The argument is much narrower than that: should >character encodings be closed (i.e. contain a fixed repertoire of >character+diacritic combinations) or open (i.e. permit arbitrary >combinations of character and diacritic)? >From a different perspective, both repertoires are closed and open simultaneously. I see the crucial difference in their different definitions of character. First, an "open" repertoire, as defined above, is also fixed, in that there is a finite number of machine characters. In neither case can a new base, diacritic, or precomposed base+diacritic be added arbitrarily by a user as a machine character. In the sense of allowing new machine characters, all character sets are closed. (Both ISO DIS 10646 and Unicode have provisions for private use zones.) Second, no character set ever limits the arbitrary combination of alphabetic characters, and character sets always permit combinations that would not be meaningful in any writing system. If we define base and diacritic elements as our characters, allowing their arbitrary juxtaposition is no different from allowing the arbitrary juxtaposition of any two base characters. In the sense of allowing combinations of elements, all character sets are open. The "precomposed" camp essentially views characters as things that occupy linear space. The "separable" camp does not; from the latter perspective, base+diacritic is simply two characters, one of which is traditionally displayed above, rather than next to, the other. The issue isn't so much that one repertoire is closed and the other open as much as that the two repertoires have different constituencies. Creating a new base+diacritic combination in a separable diacritic system isn't creating a new machine character because the combination is no more a character than a sequence of two bases. >The difference comes with characters which are NOT widely used ... [a random >accented character] does not exist as a precomposed character in Unicode >or in 10646. But in Unicode it can be represented as a combination of three >codes, even if it's never been used before (and even if it is a typo!). >10646 could of course add it in a future revision, but this has to be done >on a case-by-case basis. There are two sets of issues in choosing between precomposed characters and separable diacritics. (Note: "diacritic" is not necessarily the best term for reasons discussed by Lloyd Anderson in the ISO10646 and Unicode ListServ archives, but I'll continue using it here.) These issues are adequacy and appropriateness. Adequacy is the easy one: the price of prohibiting separable diacritics is making room for all precomposed combinations. For some poorly-codified writing systems, it is impossible to determine which precomposed combinations actually occur. Separable diacritics ensure that unforeseen combinations can be represented. Opponents of separable diacritics, largely computer scientists with no experience in uncommon or poorly codified writing systems, do not care whether the repertoire is adequate for scholars, as long as it is adequate for businessmen using modern languages. One suggested criterion for representing characters is to represent only those characters used in newspapers, a criterion that I hope all linguists will find appalling. (Other opponents of separable diacritics are more reasonable, but, through lack of experience with poorly-codified writing systems, may not understand that even with the best of intentions it may be impossible to define a complete precomposed repertoire in advance.) Appropriateness is tougher. One set of arguments holds that character inventory should reflect grapheme inventory; diacritics would be encoded separably if they function as separable orthographic entities, while precomposed combinations would be used otherwise. (Graphemic analysis is not necessarily unique and "separable orthographic entity" is tricky to define, but constructive suggestions can be found in the archives.) Another holds that the proper criterion for appropriateness is processing efficiency; the programming languages people prefer precomposed combinations because they are better suited to certain machine operations. (In some cases, this reflects a limited understanding of the type of operations that people perform on texts, since separable diacritics may be better suited for other operations. Any operation _can_ be performed with either coding, but with differences in efficiency that programmers may consider significant). Another holds that the decision is arbitrary because the most appropriate or efficient encoding is unknowable. I will not rehash these arguments here; please consult the archives. It seems self-evident to me that the adequacy issue, which is entirely on the side of separable diacritics, must be paramount. >Proponents of closed repertoire systems argue that inventors of >NEW orthographies should limit themselves to standard characters. >Proponents of open repertoire systems argue that this is an unnatural >limitation which restricts designers of orthographies artificially. > ><<That>> is what the argument is about, <<not>> about suppressing e-acute. That may be what the argument _should_ be about. Nobody wants to suppress e-acute because it is used in French, but nobody is clamoring to make room for the early Cyrillic lower_case_neutral_jer+longa that I need. And nobody in the "precomposed" camp can tell me how they plan to provide for early Cyrillic when it is impossible to determine a precomposed inventory. My colleague Kyongsok Kim has raised exactly the same argument concerning ancient Hangul. Let me close with a telling anecdote from the ISO10646 ListServ. One of the Unicode developers had occasion to work with a Russian-language teach-yourself-Japanese book. The Japanese is transcribed phonetically in Cyrillic and uses a macron to indicate long vowels. This includes macron over e+diaeresis, which occurs in no Slavic writing of any period as far as I know. It was pointed out that a separable diacritic approach can handle this, while ISO DIS 10646, which is a precomposed approach, cannot. It was also suggested that if we petitioned the ISO to include this precomposed combination in ISO DIS 10646 because it was needed for Russian phonetic transcriptions of Japanese, we would not be warmly received (how many newspapers are published in Russian transcriptions of Japanese?). Someone responded to this: why can't Russians represent vowel length in some other way, such as using doubled vowels? Aside from the linguistic ignorance this betrays (every vowel letter in Russian is syllabic and a doubled vowel letter is two syllables), it demonstrates an attitude that making life easier for programmers is more important than the data. If I need e+diaeresis+macron, it the responsibility of character set designers to provide for it. It is not their business to tell me to bang in a screw with a hammer because their toolkit doesn't include a screwdriver and they don't think I need one. And <<that>> is what the argument is <<really>> about. Concerning Mark Johnson's summary of technical problems, he raises important and genuine issues, but confuses certain basic points. "Character" and "glyph" are technical terms in the character set business and what gets rendered is glyphs, not characters. A character set is not concerned with centering accent marks over vowels any more than it is concerned with forming Arabic ligatures; character sets encode characters and rendering software is responsible for putting out the proper glyphs. Inventories of characters and glyphs are not identical. Once again, please consult the archives of the appropriate ListServs for background on this issue. --David
The technical issues around diacritics have been extensively discussed in the ISO10646 and Unicode mailing lists. Let me try to summarize them (with a bias towards Unicode, I'm afraid): Encoding issues In a closed repertoire system, characters (with or without diacritics) can presumably have a fixed encoding. ISO10646 prescribes exactly one way of representing e-acute. Unicode allows both the precomposed e-acute and also the composed e+acute. (Although precomposed characters appear inconsistent with Unicode's approach, they were included for compatibility with existing standards (here, Latin-1).) Unicode also does not prescribe a canonical form for multiple diacritics in cases where their relationship is unambiguous (although it suggests one). However, 10646 does not in fact adhere strictly to biunique mapping of characters and codes. It includes numerous ligatures (especially for Arabic). It also includes many contextual forms (for Arabic, Mongolian, etc.). For Chinese-character languages, 10646 may represent one character by several different codes, depending on language and variant rendering. Neither 10646 nor Unicode questions the distinctness of Greek, Roman, and Cyrillic `A', even though they have a common history and shape. In general, biunique encoding seems unattainable, since there are many borderline cases in the world's writing systems. Processing issues Proponents of 10646 argue that fixed-length encodings simplify processing. Proponents of Unicode argue that this is only true for the simplest cases. For instance, in Spanish, the digraph `ch' must be treated as a single letter for alphabetic sorting, but no one proposes to encode it as a single code. It is also argued that it would be easier to process text in a canonical form--otherwise, you must be prepared to handle both e-acute and e+acute. But after all, you must already be prepared to treat as equivalent: `b' and `B'; u-umlaut, U-umlaut, u+e, and U+e; eta-subscript and Eta+iota; the single character esszed and the two characters SS (and maybe ss or sz in some cases); the single character stigma and the two characters sigma+tau (in Greek numerals). The main upshot of this discussion has been to make clear that processing multilingual text is non-trivial. The methods that work more or less well for English do not work for many other languages -- but even in English, improved internationalization will mean better handling of such things as capitalization, which are handled poorly by all too many programs. Rendering issues As Johnson says, good rendering of most composed characters requires individual graphic design. This is technically compatible with both open and closed repertoire systems. Open repertoire systems should be able to render combinations which weren't considered by the designer; but even with today's technology, you can do better than overstriking (e.g. place accents above the character rather than overlapping with it). Note that Unicode does not prohibit meaningless combinations, such as using Hebrew vowel points on Japanese kana! But you can expect that the rendering will be just as absurd as the spelling.... -s [End Linguist List, Vol. 2, No. 0296]Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue