Editor for this issue: <>
The question of whether to have a "closed repertoire" character set (like the ISO proposal, which fixes single character codes for a limited but large, set of letter+diacritic combinations) or an "open repertoire" set (like Unicode, which allows arbitrary combinations, but as multiple codes) does not depend much on modern type font technology or the current art of type design. Most major font formats in use today, including PostScript Type1, Apple/Microsoft TrueType, and Sun F3, (perhaps also Hewlett Packard / Agfa CG Intellifont, though I am not certain), actually store most letter+diacritic combinations as subroutine calls to the separate elements - letter, diacritic - rather than as a fully formed character comprising letter and diacritic. That is to say, when a program like a word-processor calls out the character code for, say, a-acute, the font looks up the a, and then looks up the acute, and then looks up some information about where to position the acute over the a, puts the pieces together, rasterizes the new composite, and hands it over for display and/or printing. This method of forming composites has two advantage: economy - it reduces the memory requirements of the font; power - it allows the potential for arbitrary production of all possible letter + accent combinations. The creation of new diacritic combinations doesn't require the skills of professional type designers. Some brave and ingenious souls may write PostScript programs to implement the desired combinations and to assign those combinations to character codes. Or, if a font already has most floating diacritics (like the Macintosh, or the Microsoft UGL character set), a "kerning" table can be devised to properly position the accents over the selected letters. This requires some planning, arithmetic, etc. Another and simpler way is to use a font editing program, such as Altsys' Fontographer, LetraSet's FontStudio, or URW's Ikarus M (available on the Macintosh; there are also related programs for the PC) to get into the font and mix n' match letters and accents for the desired effect, and assign the results to arbitrary character codes/positions. This requires some time to learn the rudiments of the editing program, but no training in type design. In most fonts, the designer has provided most of the common letters and diacritics. All the user needs is the desire to combine them. In fact, some users will actually do a better job of it than the designers, since the designers are not likely to be literate in all the languages for which they have designed accents, and simply follow some basic, simple rules, or various precedents, whereas literate users often have a better feeling for what constitutes discriminability among the graphemes of their own language. In the 1950's, French typographers persuaded the English Monotype Corporation, originators of Times Roman, to reposition several of the accented characters and to redesign various other characters, to make a Gallicized version of Times that would be acceptable to the French literate palate. It is reasonable to suppose that literates of other languages and orthographies might want something similar. Moreover, some users may want to design new forms that are not included in a standard font. Such things don't always look as sleek and polished as professional work, but they also might have merits that professional designers would have failed to include. If a new form achieves acceptance, sooner or later some designer will come along to spiff it up. So, the technology of fonts and the art of type design provide the means for either closed or open character sets. The decision of which to use is based on other factors, including politics. -- Chuck BigelowMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue
The most commonly used coding for text is the ASCII (American Standard Code for Information Interchange) character set, which does not provide for characters containing diacritical marks. As it stands only 7 bits of a possible 8 bits are used, giving 128 encodings (the reason for this is historical - the 8th bit was used for parity checking). Various extensions are in use (ISO multinational, DEC multinational et al) which use the 8th bit to provide another 128 encodings containing the commonly used European characters. However, in my experience, not all terminals, printers and personal computers support even this limited character set. There are (at least) two proposals for extending the character set into something that addresses the rich variety of symbols found in the many languages of the world. In response to John Baima, I must confess to an ignorance of what a floating diacritic is. I will therefore limit my comment to an inferred explanation: a floating diacritic is a character that can be combined with a normal character (such as ~ [tilde] and n) to provide a composite. This method is satisfactory but limited in that it enhances an existent impoverished system of coding. A more general solution is to extend the coding to cover more alphabets. I recently came across an article in New Scientist (a popular and serious scientific magazine) which described a new coding system ["Computer code speaks many tongues", New Scientist, 9 March 1991, pp.28]. Apparently a consortium of American companies called "Unicode" (inluding IBM, Apple, Sun, ...) have chosen to represent their character set using a 16 bit code, which will give a possible 65,536 characters. They suggest that 6,000 codes suffice for all the alphabets of Europe, the Middle East and the Indian subcontinent. Chinese, Japanese and Korean require about another 18,000 codes. I expect it is arguable whether these figures are really representative of the characters used and preferred by the respective nationals. However one thing is certain: it is not compatible with our present system. Furthermore you will need twice as much space to store text using the current character set, and transmission times will be doubled. I have come across another system that does not suffer from these limitations, and is in my opinion a winner by lengths. Instead of using a fixed length encoding, the answer is to arrange for the encoding to expand to two or three bytes when required [Becker, J.D. (1984?) Multilingual Word Processing, from: Language, Writing, and the Computer: Readings from Scientific American, pp 86-96 ISBN 0-7167-1772-7]. This is simply done by setting aside a few bytes as signals to the computer (or printer etc) and embedding these in the text. The principle signal is one byte that indicates that the next byte is a code representing the alphabet to be used for the subsequent text. This gives 255 different alphabets, each containing 255 codes. Compatability between the system and the current ASCII encoding is easily achieved by assuming the start of the text is in alphabet "Roman" (ie ASCII). Although most (all?) of the European languages, Cyrillic, Arabic etc can be fully represented with 255 characters, other "alphabets", such as Chinese require considerably more encodings. A simple extension to the above scheme provides an elegant solution. Two 'shift-alphabet' characters in sequence indicate that the next byte signals which 'super alphabet' is to be used. These alphabets use a two byte encoding scheme giving 65,536 possible letters (this is similar to the "Unicode" proposed system). (a 3 byte 'super-super-alphabet' would allow well over 16 million codes). I am really convinced that this system should be adopted in preference to the fixed length encoding. Unfortunately "Unicode" appear to be well established and their proposed system may well become the de facto standard (as they hope). Paul Hackney [End Linguist List, Vol. 2, No. 0295]Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue