LINGUIST List 2.299

Tuesday, 18 June 1991

Disc: Diacritics

Editor for this issue: <>


Directory

  1. Ken Whistler, Diacritics, Unicode, and ISO10646
  2. , Character encodings

Message 1: Diacritics, Unicode, and ISO10646

Date: Thu, 13 Jun 91 22:22:40 PDT
From: Ken Whistler <whistlerzarasun.Metaphor.COM>
Subject: Diacritics, Unicode, and ISO10646
In response to issues raised in Linguist List, Vol. 2, No. 0283.
First of all the terminology of "open repertoire" and "closed
repertoire" tends to cause endless confusion when applied to
character encodings, because the proponents of different
character encoding architecture often mean different things
when they say "character". This leads to different senses
of "closed repertoire of characters".
In Unicode terminology, "character" refers to the thing which
gets a 16-bit number attached to it in the encoding. In this
sense, Unicode clearly has a closed repertoire of characters.
There are about 27,000 of them, mostly Han characters, and each
one of them is unambiguously identified in the standard.
However, the classes of things which are encoded as "characters"
includes both baseform letters (U+0065 LATIN SMALL LETTER E) and floating
diacritics (U+0301 NON-SPACING ACUTE), as well as accented letters
(U+00E9 LATIN SMALL LETTER E ACUTE). This creates the multiple
spelling problem for accented letters that we all know about--
but it is also the basis for the open-ended, productive part of
Unicode, since U+0301 NON-SPACING ACUTE can be used with other
characters to create compositions which are NOT preencoded in
the standard. (e.g. x-acute, beta-acute, Georgian-an-acute, who can
guess...?) In this sense, the encoding of non-spacing characters
in Unicode (of various classes--the Latin/Greek/Cyrillic floating
diacritics are only one of several major classes of non-spacing
marks used in various scripts) creates a vast potential universe
of coded "things" resulting from non-spacing marks applied one
or more at a time to baseform characters. To avoid confusing these
"things" with characters, let's for now call them "charactoids".
While Unicode has a well-defined closed repertoire of characters
(each exactly 16-bits in size and well-defined), at the same time
is has an open repertoire of charactoids. The class of charactoids
is defined by a well-defined set of composition rules, rather than
by enumeration. We know the numerosity of the class is huge, but
no one is going to try to count it--indeed the whole point is that
charactoids are freely generable by the encoding scheme, without
having to go through a committee to get a single 16-bit number
assigned to it. (As an analogy, think of characters as
representing the morphemes of a morphologically complex language
like Cree, for example. Charactoids are then analogous to the
words of Cree. Who knows what they all are? And is anyone
going to try to define them all ahead of time?)
While the class of charactoids encodable by composition in
Unicode is vast and open, the structure of the code, together
with the facts about several widely used orthographies in the
Latin/Greek family of scripts results in several well-defined
subsets of charactoids which have the following properties:
A. charactoids which are functionally equivalent to accented
letters encoded as single characters in Unicode. This is
the <e>+<non-spacing acute> = <e-acute> case. The list of
such cases is well-defined, not too large, and will be
published in the Unicode 1.0 standard as one of the auxiliary
tables. The important principle is that Unicode does not
specify a functional distinction between the two equivalent
"spellings" in Unicode. This is important because to do
otherwise would prevent Unicode applications and systems
programmers from normalizing freely from one to the other
depending on their internal requirements for representation.
Unicode does not require normalization; nor does it prevent it.
Unicode also does not prevent an application from maintaining
a private distinction between, say, <e-acute> as a fundamental
vowel unit in an orthography, and <e>+<non-spacing acute> as a
vowel plus applied tone mark. What it does say is that such
private distinctions cannot be reliably conveyed in plain
Unicode text, because another Unicode text interpreter may
normalize them all to <e-acute>.
B. charactoids which are functionally equivalent to accented
letters which are NOT encoded as single characters in Unicode,
but which are used in important orthographies. The ones which
cause all the controversy are Vietnamese and Polytonic Greek.
Both make wide use of doubly accented letters. Both are
freely encodable in Unicode by baseform plus non-spacing
diacritic combinations. The set of charactoids required for
Vietnamese or for Polytonic Greek is well-defined and is
being published as part of the Unicode 1.0 standard.
C. charactoids which are useful, but whose users LIKE to have
them be an open class, not enumerated. All of IPA falls in
this class, together with the productive application of
vector notation diacritics to mathematical symbols, for instance.
D. The remainder are the charactoids which... well, who the
hell knows what they might be. The Unicode standard has no
intention of prescribing them all, nor of proscribing any of
them. (Unicoders simply want to build software that lets
users do what they want to do.)
The design goals of Unicode were to keep subset A as small as
possible, because keeping track of such "required" equivalences
tends to impose efficiency and resource penalties on software
whose combinatorial properties grow as n-squared. Subset A
cannot, however, be reduced to zero, because of other code compatibility
requirements and offsetting inefficiencies which set in when
having to deal with charactoids instead of characters in the
software.
Others believe that all of subset B should be encoded as
characters (i.e., move them into subset A). Vietnamese is
on the hairy edge of this argument, and a strong case can
be made either way. There is no absolutely right answer as
to how to encode it--just a lot of contradicting tradeoffs
in a multiple sum game, only some of whose sums are purely
technical.
Lars Henrik Mathiesen noted that "there are
technical reasons why a standard without floating diacritics is easier
to implement." While that is true in a world of limited implementations
of European languages on glass terminals with character ROM's,
the Unicode designers are firmly of the opinion that in building
an international character set for multilingual, multiscript
applications, the arguments all come down on the other side.
(In the following I am not abscribing to Lars a particular
position in this--I think his contribution was intended primarily
as an exegesis of another note written by an anti-Unicoder
posted by a pro-Unicoder.)
1. Open-ended productivity of diacritic application is a fundamental
principle of the Latin/Greek family of scripts. To attempt to
code all "useful" combinations and proscribe all others is
both obtuse and unworkable. (Any devotee of IPA who wants to
be able to encode and exchange it on a computer should see
this is self-evidently obvious. But the ISO 10646 approach
to the IPA encoding problem was to remove the problem by
removing IPA from the encoding! Now THAT's a great solution
for linguists!)
2. Any putatively universal character encoding has to be able to convert
and interwork with existing standards (e.g. in the bibliographic
community) which ALREADY have non-spacing diacritics. So it's
a done deal. They MUST be included--unless you just ignore
them. And THAT's a great solution for bibliographers!
3. Finally, non-spacing diacritics aren't even very hard to
implement. Compared to the problems which need to be addressed
and solved to support Arabic and Indic scripts, the whole
issue of non-spacing diacritics is revealed for what it really
is in the larger picture: well-understood "easy stuff".
Next, to address some issues raised by Mark Johnson's comments:
ISO/IEC DIS 10646 did not propose a "fixed-length character
encoding system." That was but one of its many drawbacks. It
proposed a character encoding system whose canonical form was
four "octets" (ISO standardese for "bytes") for one "character",
but which also allowed for "compaction forms" which would
result in characters encoded as one, two, three, or a variable
number of bytes. And in any case, even when a fixed multiple number
of bytes (say 2) would be used to represent "graphical" characters
such as <e-acute>, any control characters would have to be
interpreted one byte at a time.
That aside, it is true that 10646 attempted to enumerate all the
"useful" accented letter forms for Latin, Greek, and Cyrillic,
and encoded them as distinct characters.
Now, with respect to the "general escape method, which...would
allow overstriking of arbitrary characters to build new characters,"
such methods already are standardized! The first-order hack
implemented in PC's (and some earlier computers) was to use
the BACKSPACE control code as a direct technical calque from
everyman's solution for creation of composite characters on
a manual typewriter. ISO, in amongst the various standards
for control characters (ISO 6429 to be exact), has defined a
tonier, less lowbrow control character function, the GCC (Graphic
Character Composition) to serve exactly as the "escape method"
for combining two characters.
The problem is that such approaches are in the stone age of
computer typography. Characters (and charactoids) are not the
same as glyphs, and glyphs are not the same as images. The
glyphs are abstractions of the TYPE of elements of textual
graphic representation. (This is pretty close to what linguists
mean by "grapheme", but abstracted away from issues relevant to
its use as a structural unit; think of "glyph" is to "grapheme" as
"phone" is to "phoneme" and you'll be close.) Images are
instatiations or TOKENS of actual textual graphic representations
taken from particular fonts (of defined face, style, weight,
size, etc.). Modern rendering software provides layers of
mapping between a) character encoding, which is designed to
encode textual CONTENT (appropriate for manipulation by textual
processes) and b) glyphic representation (appropriate for rendering
in visible form on screen, printer, or other device). Such
mappings between character and glyph can be one-to-one in the
simple (ASCII) case, but typically are not simple in computer typography
even of English. Several characters may map to a single ligature
glyph; a sequence of baseform + non-spacing diacritic may map to a single
composite glyph. The typeface designer builds the glyphs; the
rendering software maps to the correct choice. Even in those
cases (as for the open-ended set of charactoids encodable in
Unicode which could not all be anticipated by a typeface designer)
in which a baseform + non-spacing diacritic is mapped to
a pair of glyphs which must be (glyphically) composed, modern
font technology builds the composition rules into the fonts.
Effectively the diacritics "know" where to place themselves
with respect to baseforms and each other (within limits).
Once again, the solutions for handling these issues for the Latin
script are quite well-understood in the industry. Font
technology is an entire computer sub-industry single-mindedly
driven towards making computer typography even better than
the "real thing". But what is very well understood for
Latin (and rapidly being extended to Greek and Cyrillic) is
still very skimpily implemented for Arabic, or Devanagari,
or Tibetan (!) or Burmese (!), for example, where the problems
are very much harder and where getting it right is going to
take a lot more work yet.
Hoping this helps some,
--Ken Whistler
Secretary, Unicode Consortium
(and a practicing linguist!)
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Character encodings

Date: Fri, 14 Jun 91 16:43:15 EDT
From: <macrakisosf.org>
Subject: Character encodings
Mr. Hackney: May I suggest you read the (extensive) discussions on
multilingual character coding before speculating, and before asserting
that ``the answer'' is a variable-length coding?
	-s	(Stavros Macrakis)
PS Your `inferred explanation' of `floating diacritics' is incorrect.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue