Editor for this issue: <>
In (partial) answer to T.R. Hofmann's question, there is a mailing list devoted to discussion of the ISO10646 standard (and the Unicode one as well). One can subscribe by sending the message TELL LISTSERV AT JHUVM SUBSCRIBE your name or SEND LISTSERVMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueJHUVM SUBSCRIBE your name or variants thereof, depending on how your system can reach Bitnet addresses.
Here at CILS, we've (somewhat tentatively, perhaps) adopted Unicode for our mulitlingual textual database projects. At present we're still actually using the ASCII/Latin-1 combination (and our main database, that of the ARTFL project, will probably remain that way for reasons of compression: as has rightly been pointed out, nai"ve storing of Unicode can cause space problems: in our case it'd push us off the end of a gigabyte disk). But since this is isomorphic to page zero of Unicode, the translation is completely trivial, and we have started writing code that can be compiled to use Unicode natively. In reply to "Thomas R. Hofmann" <71721.2655%CompuServe.COMMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueRICEVM1.RICE.EDU>'s (second) question about wide diacritics, yes, Unicode does in fact provide for these: a Unicode diacritic is a postfix operator that takes a fixed number of arguments determined by the diacritic itself; so you'd see internal representations like "oo-" (where "-" represents a two place macron, whatever). Speaking as a computer scientist with slight feet in mathematics and linguistics, the main problems with the Unicode approach are, to my mind: (1) the technical difficulty that it is not possible (as far as I know) to determine the argument structure of a diacritic by examination of its bitpattern: automatic processing would be a lot simpler if all the floating diacritics had, for example, been grouped together on pages by number of arguments, so that the character stream could at least be parsed into its linear components by "dumb" (i.e., non-language-aware) software. (2) mathematics (contrary to the belief of most printers, I sometimes suspect) uses font change and similar mechanisms as productive diacritics (font is frequently used just like circumflex or the vector accent to specify domain). This is demonstrably *not* the way "font" is usually used since (a) it typically applies to a single base letter, and (b) a uniform font substitution clearly changes the meaning of an expression; characteristics that are generally diagnostic of diacritics. Thus, for mathematics, diacritics like -fraktur and -shell are at least extremely desirable and to my mind quite necessary. (There are in fact other and more technically challenging instances of this phenomenon, since operator symbols are also productive; arguably the union symbol can be described as lessthan -roundify -rotate-90, for instance, and it would be a great relief to be able to type in a double-swung-shafted triple-open-headed NE arrow...). Now that linguistics is starting to sprout footnotes about combinators and domain equations, this may be a real concern for linguists as well. (3) a separate (and coordinated) standard (since Unicode has justifiably decided to punt on this) is needed NOW to specify how language switching is to be specified, and I don't know of one. What language a sequence of characters is in is the thing that actually determines collation sequence, rendition rules, and so forth, and how this is handled seems to be in danger of falling through the cracks between the character set and the markup notation. Clearly workable solutions to (2) and (3) are still possible within the Unicode framework, but the danger is that by postponing them the opportunity for standardisation will be lost. Despite these minor quibbles, Unicode does seem to be the best alternative from a technical perspective, whether I think about it as a computer scientist or as an amateur of linguistics: it's sufficiently flexible and sufficiently easy to process that where problems arise their solutions seem to be practical, something that it would be much harder to say of, for example, variable-width coding schemes. +---------------------------------------------------------------------- stephen p spackman Center for Information and Language Studies systems analyst University of Chicago +----------------------------------------------------------------------