LINGUIST List 2.354

Sunday, 21 July 1991

Disc: Character Encoding

Editor for this issue: <>


Directory

  1. Wayles Browne, Re: Character Encoding
  2. Stephen P Spackman, Re: Whistler's characters & ISO representation

Message 1: Re: Character Encoding

Date: Sat, 20 Jul 91 11:21:08 EDT
From: Wayles Browne <JN5JCORNELLA.cit.cornell.edu>
Subject: Re: Character Encoding
In (partial) answer to T.R. Hofmann's question, there is a mailing list
devoted to discussion of the ISO10646 standard (and the Unicode one as
well). One can subscribe by sending the message
TELL LISTSERV AT JHUVM SUBSCRIBE your name
or
SEND LISTSERVJHUVM SUBSCRIBE your name
or variants thereof, depending on how your system can reach Bitnet addresses.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Re: Whistler's characters & ISO representation

Date: Sat, 20 Jul 91 14:09:17 -0500
From: Stephen P Spackman <stephen%tira.uchicago.eduRICEVM1.RICE.EDU>
Subject: Re: Whistler's characters & ISO representation
Here at CILS, we've (somewhat tentatively, perhaps) adopted Unicode
for our mulitlingual textual database projects. At present we're still
actually using the ASCII/Latin-1 combination (and our main database,
that of the ARTFL project, will probably remain that way for reasons
of compression: as has rightly been pointed out, nai"ve storing of
Unicode can cause space problems: in our case it'd push us off the end
of a gigabyte disk). But since this is isomorphic to page zero of
Unicode, the translation is completely trivial, and we have started
writing code that can be compiled to use Unicode natively.
In reply to "Thomas R. Hofmann"
<71721.2655%CompuServe.COMRICEVM1.RICE.EDU>'s (second) question about
wide diacritics, yes, Unicode does in fact provide for these: a
Unicode diacritic is a postfix operator that takes a fixed number of
arguments determined by the diacritic itself; so you'd see internal
representations like "oo-" (where "-" represents a two place macron,
whatever).
Speaking as a computer scientist with slight feet in mathematics and
linguistics, the main problems with the Unicode approach are, to my
mind:
(1) the technical difficulty that it is not possible (as far as I
know) to determine the argument structure of a diacritic by
examination of its bitpattern: automatic processing would be a lot
simpler if all the floating diacritics had, for example, been grouped
together on pages by number of arguments, so that the character stream
could at least be parsed into its linear components by "dumb" (i.e.,
non-language-aware) software.
(2) mathematics (contrary to the belief of most printers, I sometimes
suspect) uses font change and similar mechanisms as productive
diacritics (font is frequently used just like circumflex or the vector
accent to specify domain). This is demonstrably *not* the way "font"
is usually used since (a) it typically applies to a single base
letter, and (b) a uniform font substitution clearly changes the
meaning of an expression; characteristics that are generally
diagnostic of diacritics. Thus, for mathematics, diacritics like
-fraktur and -shell are at least extremely desirable and to my mind
quite necessary. (There are in fact other and more technically
challenging instances of this phenomenon, since operator symbols are
also productive; arguably the union symbol can be described as
lessthan -roundify -rotate-90, for instance, and it would be a great
relief to be able to type in a double-swung-shafted triple-open-headed
NE arrow...).
 Now that linguistics is starting to sprout footnotes about
combinators and domain equations, this may be a real concern for
linguists as well.
(3) a separate (and coordinated) standard (since Unicode has
justifiably decided to punt on this) is needed NOW to specify how
language switching is to be specified, and I don't know of one. What
language a sequence of characters is in is the thing that actually
determines collation sequence, rendition rules, and so forth, and how
this is handled seems to be in danger of falling through the cracks
between the character set and the markup notation.
Clearly workable solutions to (2) and (3) are still possible within
the Unicode framework, but the danger is that by postponing them the
opportunity for standardisation will be lost.
Despite these minor quibbles, Unicode does seem to be the best
alternative from a technical perspective, whether I think about it as
a computer scientist or as an amateur of linguistics: it's
sufficiently flexible and sufficiently easy to process that where
problems arise their solutions seem to be practical, something that it
would be much harder to say of, for example, variable-width coding
schemes.
+----------------------------------------------------------------------
stephen p spackman Center for Information and Language Studies
systems analyst University of Chicago
+----------------------------------------------------------------------
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue