LINGUIST List 2.363

Thursday, 25 July 1991

Disc: Character Encoding

Editor for this issue: <>


Directory

  1. Ken Whistler, Character Encoding
  2. Ken Whistler, Character Encoding

Message 1: Character Encoding

Date: Wed, 24 Jul 91 15:17:12 PDT
From: Ken Whistler <whistlerzarasun.Metaphor.COM>
Subject: Character Encoding
In response to a question posted on Linguist List Vol-2-352,
20 July 1991, by Ron Hofmann.
The way character encoding in ISO works is roughly as follows. The
parent organization creates standing committees which work on
various broad standards issues. SC2 is the standing committee which
works on encoding, and in particular on character encoding. SC2
in turn creates working groups which have various specific projects
that they work on. WG2 of SC2 is the working group designated to
come up with a multi-byte international character encoding standard--
i.e. the group whose responsibility it is to "solve" the problem
of international character encoding. The convenor of WG2 (currently
Mike Ksar of Hewlett-Packard) is the person responsible for
carrying that project to a conclusion and taking the WG2 recommendation
to SC2 for final approval. The specific project that WG2 is
working on is known as DIS (Draft International Standard) 10646.
ISO standards become standards by being floated as official work
items assigned to working groups. The drafts are then distributed to
the ISO member bodies (which represent their national standards
bodies) for comments and voting. Standards go through a couple
of levels of draft (1st DP, 2nd DP), and then progress to DIS
level for final voting. 10646 was taken to DIS level, and we currently
have the information that it was NOT approved by the vote of
the ISO member bodies which closed June 7. However, that is not
the end of the story. WG2 meets in August (in Geneva) to attempt
to resolve the voting. How could this be? Well, ISO voting is not
really YES versus NO, but instead YES versus YES WITH COMMENTS versus
NO WITH OBJECTIONS. Any NO WITH OBJECTIONS vote *must* explicitly
state what it would take to change the vote into a YES. This give
the convenor of WG2 the leeway to resolve enough details to
convert the overall NO vote into a YES vote by tinkering with the
structure of the standard to meet individual member bodies'
objections. Thus, it is still possible that by tinkering and providing
a justification, DIS 10646 can be carried to SC2 with a WG2 recommendation
for progressing it to IS (International Standard) status. At that point
it is carved in stone for all eternity. Or it may be recommended
for revision and new voting at the DIS level. The SC2 plenary session
that decides this meets in Paris, France in October.
O.k., now how does U.S. input happen? The official body which represents
the U.S. in this is ANSI, the American National Standards Institute.
But the operative body is X3, a committee of ANSI which deals with
information processing systems. The business of X3 is run out of an outfit
called CBEMA (Computer Business Equipment Manufacturers Association),
in Washington, DC. X3 designates subcommittees for various purposes.
The standing subcommittee which deals with character encoding is
X3L2. X3L2 basically has representatives of major computer
companies on it (IBM, Apple, DEC, Unisys, Microsoft, Xerox, etc.). It
has a fairly low fee to get on the committee (X3 itself is expensive),
but also has strict rules for full voting memberships, as opposed
to observer status. Basically, you have to be ready to fly
around the country to attend a string of consecutive (boring)
meetings before you can be promoted to voting status, so only large
companies with people who have travel budgets and several weeks to
burn each year retain voting status. But X3L2 is the committee which
actually voted for the U.S. on ISO DIS 10646 and wrote the
official U.S. position paper to accompany that vote.
The only effective way to gain a voice at X3L2, other than to
start a computer company and send an employee to 3 consecutive
meetings, is to lobby existing representatives (e.g. me, representing
Metaphor) or to send formal written contributions to the
Chairman of X3L2. The current chairman is Jerry Andersen, of IBM.
Jerry J. Andersen
IBM Corporation
C71/673
P.O. Box 12195
Research Triangle Park, NC 27709
email: Andersenralvmk.iinus1.ibm.com
Before you spend a lot of time writing him, however, you should
understand that the ISO process is an exceeding hidebound, rule-driven,
bureaucratic, and historically-inert process. Academic contributions
of the sort that get published in journals, or the more informal
discussions which happen on LINGUIST or other email forums will
bounce off X3L2 and ISO like so much water off a duck's back. They
will be viewed generally as "off-the-wall" because they come from
outsiders who are not plugged into the ISO process, don't understand
the politics, do not have the requisite collection of paper with
all the national standards bodies' votes and comments, etc, etc.
Rather than sending out-of-the-blue comments, any linguist who
seriously wants to take this up might first be advised to
send me a request for an electronic copy of the official U.S.
position paper on DIS 10646 and of the founding documents regarding
the Unicode/10646 merger process now underway. (An ad hoc proposal
for 10646M, as it is being called; plus the Unicode Consortium's
official response to 10646M.) Responding to specific points in
those documents would get you a better chance at a hearing, since
those documents are either generated by or will have to be responded
to by X3L2 and WG2, formally or informally, and any comments relating
to a specific document become "real" in the ISO process--and may
even get a registration number as an X3L2 or WG2 document!
Now, if you are concerned about Canada, Japan, or some other
country, you have to figure out the national standards structure
for that particular country to find out who to contact. I can
provide some pointers for people, but can't claim to be an
expert about any of those bodies.
--Ken Whistler
Unicode Secretary
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Character Encoding

Date: Thu, 25 Jul 91 11:03:53 PDT
From: Ken Whistler <whistlerzarasun.Metaphor.COM>
Subject: Character Encoding
In response to questions about coding of multiple-letter
non-spacing marks, raised in:
	Linguist List: Vol-2-352. Saturday, 20 July 1991
	Linguist List: Vol-2-354. Sunday, 21 July 1991
There are a number of obvious encoding requirements for multiple-letter
non-spacing marks. Some instances occur in standard orthographies, such as
a double-letter tilde in Tagalog. Others are really ligating ties,
such as the IPA ligature bars (rendered above for ligated digraphs
such as eng-g or below for mb, or indifferently above or below
for coarticulations which mix ascenders and descenders: kp, gb, etc.)
(Incidentally, it is not clear whether such things should be
considered diacritics. In any case, the only real character
encoding issue is how to encode non-spacing entities which are
"applied" in the typography across two {or more} letters in
Latin {or other} scripts.)
There are at least three approaches to encoding such things:
	1. glyph parts approach. Encode each part that goes over
		a single letter as a separate character.
	2. unitary character approach, postfix. Encode the
		entire non-spacing mark as a single character,
		and postfix it in the text stream.
	2a. same as 2, but prefix.
	3. unitary character approach, infix. Encode the
		entire non-spacing mark as a single character,
		but specify that it occurs BETWEEN the two
		characters it modifies in the text stream.
The December 1990 draft of Unicode 1.0 contained 5 non-spacing
marks (we were calling them "non-spacing diacritics" at the time)
of type 2: a tilde, a macron, a breve, an above-letter ligature tie,
and a below-letter ligature tie. It ALSO contained, for compatibility
with existing bibliographic standards for non-spacing characters,
4 non-spacing marks of type 1: a left- and right-half tilde, and
a left- and right-half above-letter ligature tie.
In long discussions during Unicode Technical Committee meetings
this spring, the "double diacritics" problem was the subject of
heated debate on at least two accounts. First, the draft contained
two inconsistent models of how to encode them. Second, the type 2
postfix encoding causes potentially large processing problems for
software supporting non-spacing marks.
A compromise along the lines of the type 3 encoding outlined above
almost resulted. However, there was a concern that rushing to
judgement on this without further investigation of all the
potential ramifications would be ill-advised. Instead, in the
current version of Unicode 1.0 being published, all "double diacritics"
were removed, and the entire issue was deferred for further
discussion and resolution in the next additions to Unicode.
It is my personal opinion that type 3 encoding will work just
fine. In a way, it is a kind of Wackernagel's solution to
finding a fixed position for something which may have an indefinite
scope. Unicode can always specify that a non-spacing mark follows
the FIRST item it modifies. This is exactly the current case with
simply non-spacing marks such as a non-spacing acute accent, for
example. If we then add back the five non-spacing marks which
typically apply to two letters, they will occur between the
letters in the text store, but logically they will be occurring
AFTER the FIRST item, as for a "single diacritic". The rendering
engines can handle the placement over two letters without too
much trouble, and the parsing engines don't have to have special
cases built in for the double diacritics.
The glyph parts are still needed for building fonts which will
be used to render "double diacritics" over Latin letters, but
there will be no need for them in the character encoding. It will
also be possible to build the translations which can convert
accurately between existing bibiliographic encodings and Unicode
for such things.
As for instances of extending breves or macrons over 3 or more
letters, I would be inclined to draw the line for plain text
encoding at this point. Unicode includes a large number
of mathematical symbols but does not make it possible to
encode complete math formulas *in plain text* without a higher
level protocol for formula syntax and layout. Similarly,
Unicode supplies a large number of non-spacing marks, but does
not make it possible to do things like extend a nasalization
mark across an entire syllable or word, for example, without
higher-level protocols for layout of such supra-character
text content. (This is comparable to the difference between
applying a non-spacing underline as a diacritic to a single
letter, versus applying an underline style to a word.)
In response to Spackman's discussion of the mathematical aspects
of non-spacing mark handling in Unicode: I agree that specifying
different numbers of "arguments" for the non-spacing marks
causes trouble for processing. This was the basic argument
(stated differently) that led to a reexamination and pulling
of "double diacritics" from Unicode for now. As it stands,
a "dumb" algorithm *can* parse Unicode for non-spacing marks,
even without tables, for the Greek family of scripts, at least,
since all non-spacing marks are confined to a small series of
ranges within Unicode. (The issue for the Indian family of
scripts is more complicated--but then, no "dumb" algorithm is
going to be able to do correct parsing of Devanagari, anyway.)
Moving to coding model #3 above for the double diacritics should
leave the parsing algorithms unaffected for the Greek family of
scripts, since all non-spacing marks, including the "double diacritics"
would have only one argument.
On other issues raised by Spackman:
It has already been noted that mathematics uses type style (not font)
variation as a meaningful semantic component. Unicode encoded
some of the widely used style variants for particular letters
as separate characters (e.g. black-letter I and R for imaginary and
real). It clearly didn't make sense to encode style variants
for every possible combination. (This is where 10646 goes wrong,
for example, in encoding italic A-Z, a-z, 0-9 and underlined A-Z,
a-z, 0-9 for APL as distinct characters.) However, when
a mathematician starts using typeface or style as a
productive semantic component, it is also not clear
that that should be encoding *as a character* in plain text. This
is one of those fuzzy edges which make character encoding difficult.
Functionally the style shift may be equivalent to addition of
a diacritic to a letter, but encoding it the same way in the text
store may cause more problems than it solves. If we add
<black-letter> or <italic style> as characters to support
such mathematical usage as semantic primitives, what is to prevent
such characters from appearing in text in non-mathematical usage,
resulting effectively in in-line character encoding of face and
style information? That is just an invitation to open an enormous
Pandora's Box of problems in plain text.
Language identification clearly does *not* belong in a character
encoding. However, I do agree with Spackman that it would be
a good idea to create a standard scheme for identification of
language. The lack of such is one of the things which drives people
to attempt bizarre things in character encodings, for example.
Also, language identification as currently implemented at the
systems level in computers tends to get muddled with support
for "country" standards for display of dates, times, and numbers.
Right now every system vendor has its own idiosyncratic system
of identifying languages (usually a small list of the commercially
important European and Asian languages), and these differ from
the idiosyncratic system of identifying languages used by each
application vendor. And none of that helps much the linguist
who wants a standard language identifier for tagging text
for communication and interchange.
--Ken Whistler
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue