LINGUIST List 2.296

Friday, 14 June 1991

Disc: Character Coding (Part 2)

Editor for this issue: <>


Directory

  1. , Re: Diacritics
  2. , Technical problems with diacritics

Message 1: Re: Diacritics

Date: Tue, 11 Jun 91 09:05 EDT
From: <DJBPITT%PITTVMS.BITNETCUNYVM.CUNY.EDU>
Subject: Re: Diacritics
As one of the more active participants in the ISO10646 and Unicode ListServ
discussions of "floating diacritics," I would like to comment on some of the
postings on the subject in Linguist List, Vol. 2, No. 0283 (Monday, 10 June
1991, Subj: 2.0283 Diacritics). I apologize for the length of this
contribution, but it's a large topic with a long history and many readers of
this ListServ may not know the background.

First, thousands of lines have already been written about this subject on
those two ListServs (hundreds of them by me). I and a number of other
linguists have been arguing from the beginning that character sets are too
important to be left entirely to specialists in computer languages (who have
their own priorities) and that natural language orthography is serious
business.

It is encouraging that someone has taken steps to draw more linguists into the
discussion by some judicious crossposting to the Linguist ListServ. But I
would like to suggest that interested linguists subscribe to the specialized
ListServs mentioned above and that they read the archives available there.
This will avoid unnecessary repetition and cross-posting, will ensure that all
participants in the discussion know the background, and -- perhaps most
important -- will ensure that linguists' informed opinions are shared with
colleagues in other disciplines who often play decisive roles in developing
international character set standards.

I would also like to urge linguists to become involved in character set issues
in an effective way. The ISO is composed of national representative bodies and
is not required to listen to individuals. It is possible to join your
country's national delegation and help formulate official positions on ISO
proposals. ISO character set development is ultimately politics, not science.
If you want to influence the outcome, you can't just post intelligent
observations to a non-binding ListServ; you have to participate at the national
delegation level.

Why waste your time? Those of us who work with unusual writing systems are
forced to develop our own character coding. Since we are not organized, we
wind up with files that cannot be shared. Hardware and software manufacturers
support recognized standards (both official ISO standards and de facto industry
standards, which Unicode is likely to be); making a standard suit your needs
means 1) you don't have to do your own character set development any more, and
2) you can share files with colleagues.

macrakisosf.org writes:

>The argument is much narrower than that: should
>character encodings be closed (i.e. contain a fixed repertoire of
>character+diacritic combinations) or open (i.e. permit arbitrary
>combinations of character and diacritic)?

>From a different perspective, both repertoires are closed and open
simultaneously. I see the crucial difference in their different definitions of
character.

First, an "open" repertoire, as defined above, is also fixed, in that there is
a finite number of machine characters. In neither case can a new base,
diacritic, or precomposed base+diacritic be added arbitrarily by a user as a
machine character. In the sense of allowing new machine characters, all
character sets are closed. (Both ISO DIS 10646 and Unicode have provisions for
private use zones.)

Second, no character set ever limits the arbitrary combination of alphabetic
characters, and character sets always permit combinations that would not be
meaningful in any writing system. If we define base and diacritic elements as
our characters, allowing their arbitrary juxtaposition is no different from
allowing the arbitrary juxtaposition of any two base characters. In the sense
of allowing combinations of elements, all character sets are open.

The "precomposed" camp essentially views characters as things that occupy
linear space. The "separable" camp does not; from the latter perspective,
base+diacritic is simply two characters, one of which is traditionally
displayed above, rather than next to, the other. The issue isn't so much that
one repertoire is closed and the other open as much as that the two repertoires
have different constituencies. Creating a new base+diacritic combination in a
separable diacritic system isn't creating a new machine character because the
combination is no more a character than a sequence of two bases.

>The difference comes with characters which are NOT widely used ... [a random
>accented character] does not exist as a precomposed character in Unicode
>or in 10646. But in Unicode it can be represented as a combination of three
>codes, even if it's never been used before (and even if it is a typo!).
>10646 could of course add it in a future revision, but this has to be done
>on a case-by-case basis.

There are two sets of issues in choosing between precomposed characters and
separable diacritics. (Note: "diacritic" is not necessarily the best term
for reasons discussed by Lloyd Anderson in the ISO10646 and Unicode
ListServ archives, but I'll continue using it here.) These issues are adequacy
and appropriateness.

Adequacy is the easy one: the price of prohibiting separable diacritics is
making room for all precomposed combinations. For some poorly-codified writing
systems, it is impossible to determine which precomposed combinations actually
occur. Separable diacritics ensure that unforeseen combinations can be
represented. Opponents of separable diacritics, largely computer scientists
with no experience in uncommon or poorly codified writing systems, do not care
whether the repertoire is adequate for scholars, as long as it is adequate for
businessmen using modern languages. One suggested criterion for representing
characters is to represent only those characters used in newspapers, a
criterion that I hope all linguists will find appalling. (Other opponents of
separable diacritics are more reasonable, but, through lack of experience with
poorly-codified writing systems, may not understand that even with the best of
intentions it may be impossible to define a complete precomposed repertoire in
advance.)

Appropriateness is tougher. One set of arguments holds that character
inventory should reflect grapheme inventory; diacritics would be encoded
separably if they function as separable orthographic entities, while
precomposed combinations would be used otherwise. (Graphemic analysis is not
necessarily unique and "separable orthographic entity" is tricky to define,
but constructive suggestions can be found in the archives.) Another holds that
the proper criterion for appropriateness is processing efficiency; the
programming languages people prefer precomposed combinations because they are
better suited to certain machine operations. (In some cases, this reflects a
limited understanding of the type of operations that people perform on texts,
since separable diacritics may be better suited for other operations. Any
operation _can_ be performed with either coding, but with differences in
efficiency that programmers may consider significant). Another holds that the
decision is arbitrary because the most appropriate or efficient encoding is
unknowable. I will not rehash these arguments here; please consult the
archives.

It seems self-evident to me that the adequacy issue, which is entirely on the
side of separable diacritics, must be paramount.

>Proponents of closed repertoire systems argue that inventors of
>NEW orthographies should limit themselves to standard characters.
>Proponents of open repertoire systems argue that this is an unnatural
>limitation which restricts designers of orthographies artificially.
>
><<That>> is what the argument is about, <<not>> about suppressing e-acute.

That may be what the argument _should_ be about. Nobody wants to suppress
e-acute because it is used in French, but nobody is clamoring to make room for
the early Cyrillic lower_case_neutral_jer+longa that I need. And nobody in the
"precomposed" camp can tell me how they plan to provide for early Cyrillic when
it is impossible to determine a precomposed inventory. My colleague Kyongsok
Kim has raised exactly the same argument concerning ancient Hangul.

Let me close with a telling anecdote from the ISO10646 ListServ. One of the
Unicode developers had occasion to work with a Russian-language
teach-yourself-Japanese book. The Japanese is transcribed phonetically in
Cyrillic and uses a macron to indicate long vowels. This includes macron over
e+diaeresis, which occurs in no Slavic writing of any period as far as I know.
It was pointed out that a separable diacritic approach can handle this, while
ISO DIS 10646, which is a precomposed approach, cannot. It was also suggested
that if we petitioned the ISO to include this precomposed combination in ISO
DIS 10646 because it was needed for Russian phonetic transcriptions of
Japanese, we would not be warmly received (how many newspapers are published in
Russian transcriptions of Japanese?).

Someone responded to this: why can't Russians represent vowel length in some
other way, such as using doubled vowels? Aside from the linguistic ignorance
this betrays (every vowel letter in Russian is syllabic and a doubled vowel
letter is two syllables), it demonstrates an attitude that making life easier
for programmers is more important than the data. If I need e+diaeresis+macron,
it the responsibility of character set designers to provide for it. It is not
their business to tell me to bang in a screw with a hammer because their
toolkit doesn't include a screwdriver and they don't think I need one.

And <<that>> is what the argument is <<really>> about.

Concerning Mark Johnson's summary of technical problems, he raises important
and genuine issues, but confuses certain basic points. "Character" and "glyph"
are technical terms in the character set business and what gets rendered is
glyphs, not characters. A character set is not concerned with centering accent
marks over vowels any more than it is concerned with forming Arabic ligatures;
character sets encode characters and rendering software is responsible for
putting out the proper glyphs. Inventories of characters and glyphs are not
identical. Once again, please consult the archives of the appropriate
ListServs for background on this issue.

--David
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Technical problems with diacritics

Date: Tue, 11 Jun 91 11:26:09 EDT
From: <macrakisosf.org>
Subject: Technical problems with diacritics
The technical issues around diacritics have been extensively discussed
in the ISO10646 and Unicode mailing lists. Let me try to summarize
them (with a bias towards Unicode, I'm afraid):

Encoding issues

In a closed repertoire system, characters (with or without diacritics)
can presumably have a fixed encoding. ISO10646 prescribes exactly one
way of representing e-acute. Unicode allows both the precomposed
e-acute and also the composed e+acute. (Although precomposed
characters appear inconsistent with Unicode's approach, they were
included for compatibility with existing standards (here, Latin-1).)
Unicode also does not prescribe a canonical form for multiple
diacritics in cases where their relationship is unambiguous (although
it suggests one).

However, 10646 does not in fact adhere strictly to biunique mapping of
characters and codes. It includes numerous ligatures (especially for
Arabic). It also includes many contextual forms (for Arabic,
Mongolian, etc.). For Chinese-character languages, 10646 may
represent one character by several different codes, depending on
language and variant rendering. Neither 10646 nor Unicode questions
the distinctness of Greek, Roman, and Cyrillic `A', even though they
have a common history and shape. In general, biunique encoding seems
unattainable, since there are many borderline cases in the world's
writing systems.

Processing issues

Proponents of 10646 argue that fixed-length encodings simplify
processing. Proponents of Unicode argue that this is only true for
the simplest cases. For instance, in Spanish, the digraph `ch' must
be treated as a single letter for alphabetic sorting, but no one
proposes to encode it as a single code. It is also argued that it
would be easier to process text in a canonical form--otherwise, you
must be prepared to handle both e-acute and e+acute. But after all,
you must already be prepared to treat as equivalent: `b' and `B';
u-umlaut, U-umlaut, u+e, and U+e; eta-subscript and Eta+iota; the
single character esszed and the two characters SS (and maybe ss or sz
in some cases); the single character stigma and the two characters
sigma+tau (in Greek numerals).

The main upshot of this discussion has been to make clear that
processing multilingual text is non-trivial. The methods that work
more or less well for English do not work for many other languages --
but even in English, improved internationalization will mean better
handling of such things as capitalization, which are handled poorly by
all too many programs.

Rendering issues

As Johnson says, good rendering of most composed characters requires
individual graphic design. This is technically compatible with both
open and closed repertoire systems. Open repertoire systems should be
able to render combinations which weren't considered by the designer;
but even with today's technology, you can do better than overstriking
(e.g. place accents above the character rather than overlapping with
it).

Note that Unicode does not prohibit meaningless combinations, such as
using Hebrew vowel points on Japanese kana! But you can expect that
the rendering will be just as absurd as the spelling....

	-s
[End Linguist List, Vol. 2, No. 0296]
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue