Editor for this issue: <>
I am compiling together my comments and responses on a number of items posted on LINGUIST over the last several weeks. I'm sorry about the temporal disconnect on this--maybe some of the items are getting stale, but they do deserve some answers. Re Linguist List, Vol. 2, No. 0295. Friday, 14 June 1991 Charles Bigelow's discussion of Diacritics, type design, and font format is a wholly commendable introduction to modern font technology. It clearly distinguishes the issue of character encoding from the issues of type design and font technology. As Bigelow points out, modern font technology already handles productively applied marks (e.g. floating diacritics in the Latin script) as separable pieces. But the details of that are all hidden away in the font rendering engines and the fonts themselves. A character encoding is *mapped* to glyphs in the fonts, and there is no particular reason why that relationship has to be a simple one. The Unicode standard takes the position that non-spacing marks (including Latin floating diacritics) must be encoded as characters primarily for information sufficiency. Diacritics are productively applied to Latin letters--this goes well beyond the relatively small set of accented letters considered to be part of the standard alphabets of most European languages. In order to be able to encode the textual information of any usage of the Latin script (including various kinds of ad hoc transcriptions published over hundreds of years), we have to do one of the following: A. Research every baseform + diacritic(s) combination ever used in the history of the Latin (and other) script(s) and assign a character encoding number to each combination; B. Collect the set of productively applied diacritic(s) and encode them as non-spacing marks which can compose with any baseform a user chooses for them. I believe alternative (A) to be a hopeless approach. It is almost, though not quite, as absurd as trying to catalog all the sentences of a language. Position (B) builds a well-defined productive rule into the encoding which allows for unambiguous character encoding of any combination which has to be encoded. Also Re Linguist List, Vol. 2, No. 0295. Friday, 14 June 1991 Paul Hackney objected to the Unicode proposal for a fixed-width character encoding: >However one thing is certain: it is not compatible with >our present system. Furthermore you will need twice as much space to store >text using the current character set, and transmission times will be doubled. >I have come across another system that does not suffer from these limitations, >and is in my opinion a winner by lengths. He then goes on to summarize Joe Becker's discussion of the Xerox Coded Character Set (XCCS), cited in the Scientific American article on "Multilingual Word Processing." Hackney goes on to praise the notion of a character set which uses a signal to indicate that the next byte in a character stream is an "alphabet" identifier which indicates a particular set of 255 codes to be used (a byte at a time) in subsequent text. Extension of the scheme would allow for two byte encodings (e.g. for Chinese) or for three byte encodings if required. What Hackney has done is reinvent the "compaction forms" of ISO DIS 10646, which has just such a signal for supporting 1-, 2-, 3-, or 4-byte encoding forms, as well as variable length. The irony of this argument is that in citing Becker's article about software based on the XCCS, he seems unaware that Becker is the principle architect of the Unicode fixed-length character encoding architecture which Hackney objects to! It is precisely Xerox's long experience with building multilingual software on a mixed one- or two-byte character encoding which has convinced most of the software implementors in the industry that such encoding schemes are inefficient, buggy, and not even close to being worth the savings in space for text storage which was their main reason for being. Keep in mind that the mixed-width character encodings were designed in an era when 16K was a LOT of RAM and hard disks were something that mainframe computers had. Nowadays a minimal PC or workstation has at least 1 megabyte of RAM and a 40 megabyte hard disk, and the standards are rapidly expanding to 8 megabytes of RAM and even bigger hard disks supplemented by removable cartridge hard disks and CD-ROM's. The growth in the size of text by moving to a 16-bit fixed-width character encoding is insignificant in this context--especially when you take into account that interspersing formatting and graphics with the text means that non-text quickly comes to dominate over the text itself for most storage requirements. When it comes to transmission times, once again the technology is outracing any problem caused by doubling the character size. Modems have moved from 120 baud to 300 to 1200 to 2400 to 9600 baud. Furthermore, if a lot of text has to be transmitted across a phone line, it can be easily compressed and decompressed at each end with well-known algorithms already in use. This is a problem which can be confined to a clear point of appropriate application. It would be inappropriate, however, to force the character encoding to carry around the compression--that would simply make *all* software have to deal with compressed or compacted text just to make characters go down a telephone line faster. For all other network communication protocols, the transmission rates are so high (multiple megabits per second) that the one- versus two-byte character size is completely irrelevant to any perceived behavior for textual transmission. In summary, mixed-width character encoding has been discredited among those who have actually tried to implement software which uses it. And moving from a 8-bit to a 16-bit character encoding is *not* going to suddenly make everything twice as inefficient in dealing with text. (Among other things, note that most systems and software that require moving to the new Unicode standard because they need a large, multilingual character set are already having to deal with Japanese--which means they already have provisions for two-byte characters. They just do so now in a mixed- width and *less* efficient way!) Re Linguist List, Vol. 2, No. 0296. Friday, 14 June 1991 I wish to second David Birnbaum's comments about the importance of the participation of linguists in the character encoding process. It is extremely important that we get the international character encoding to be complete and correct. This will allow for an interchange standard for linguistic data, even in those instances where individual linguists with their own independently developed ad hoc encodings for PC's and Macs have no need to change the software they are using locally. Note that the major current initiative aimed at providing a universal text content interchange format (the Text Encoding Initiative, sponsored by ACL, et al.) is deficient precisely in the area of character encoding. Getting involved to encourage the building of a Text Encoding protocol on top of a *Universal* character set would greatly improve the possibilities that linguists will be able to exchange data freely in the future. Re Linguist List: Vol-2-308. Thursday, 20 June 1991. Herb Stahlke raised the issue of sorting. While it is true that diacritics (among other things) have to be taken into account to produce correct sortings in various languages, the way sorts are implemented is generally by weighted multiple keys generated from tables. The relation between an <a-acute> and its appropriate sorting key for a particular language in a particular orthography is independent of whether that <a-acute> was encoded as a precomposed single character or a baseform character plus non-spacing mark character. (One scheme for encoding characters may be more efficiently mappable to multiple keys than another scheme, but the multiple keys themselves are not *determined* by the character encoding.) The table has to be correct to produce the correct sort. When we want to put control over the sort "in the hands of the user," what we really mean is finding a way to allow a user to define their own collation tables or to modify those provided by the system or application software. I see that David Birnbaum (in Linguist List: Vol-2-321. Tuesday, 25 June 1991) has also responded to this contribution, and I concur completely with his analysis of the distinction between precomposed vs. "separable" architecture in the character encoding versus any issues of how to produce a correct sort.Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue