Editor for this issue: <>
In response to a question posted on Linguist List Vol-2-352, 20 July 1991, by Ron Hofmann. The way character encoding in ISO works is roughly as follows. The parent organization creates standing committees which work on various broad standards issues. SC2 is the standing committee which works on encoding, and in particular on character encoding. SC2 in turn creates working groups which have various specific projects that they work on. WG2 of SC2 is the working group designated to come up with a multi-byte international character encoding standard-- i.e. the group whose responsibility it is to "solve" the problem of international character encoding. The convenor of WG2 (currently Mike Ksar of Hewlett-Packard) is the person responsible for carrying that project to a conclusion and taking the WG2 recommendation to SC2 for final approval. The specific project that WG2 is working on is known as DIS (Draft International Standard) 10646. ISO standards become standards by being floated as official work items assigned to working groups. The drafts are then distributed to the ISO member bodies (which represent their national standards bodies) for comments and voting. Standards go through a couple of levels of draft (1st DP, 2nd DP), and then progress to DIS level for final voting. 10646 was taken to DIS level, and we currently have the information that it was NOT approved by the vote of the ISO member bodies which closed June 7. However, that is not the end of the story. WG2 meets in August (in Geneva) to attempt to resolve the voting. How could this be? Well, ISO voting is not really YES versus NO, but instead YES versus YES WITH COMMENTS versus NO WITH OBJECTIONS. Any NO WITH OBJECTIONS vote *must* explicitly state what it would take to change the vote into a YES. This give the convenor of WG2 the leeway to resolve enough details to convert the overall NO vote into a YES vote by tinkering with the structure of the standard to meet individual member bodies' objections. Thus, it is still possible that by tinkering and providing a justification, DIS 10646 can be carried to SC2 with a WG2 recommendation for progressing it to IS (International Standard) status. At that point it is carved in stone for all eternity. Or it may be recommended for revision and new voting at the DIS level. The SC2 plenary session that decides this meets in Paris, France in October. O.k., now how does U.S. input happen? The official body which represents the U.S. in this is ANSI, the American National Standards Institute. But the operative body is X3, a committee of ANSI which deals with information processing systems. The business of X3 is run out of an outfit called CBEMA (Computer Business Equipment Manufacturers Association), in Washington, DC. X3 designates subcommittees for various purposes. The standing subcommittee which deals with character encoding is X3L2. X3L2 basically has representatives of major computer companies on it (IBM, Apple, DEC, Unisys, Microsoft, Xerox, etc.). It has a fairly low fee to get on the committee (X3 itself is expensive), but also has strict rules for full voting memberships, as opposed to observer status. Basically, you have to be ready to fly around the country to attend a string of consecutive (boring) meetings before you can be promoted to voting status, so only large companies with people who have travel budgets and several weeks to burn each year retain voting status. But X3L2 is the committee which actually voted for the U.S. on ISO DIS 10646 and wrote the official U.S. position paper to accompany that vote. The only effective way to gain a voice at X3L2, other than to start a computer company and send an employee to 3 consecutive meetings, is to lobby existing representatives (e.g. me, representing Metaphor) or to send formal written contributions to the Chairman of X3L2. The current chairman is Jerry Andersen, of IBM. Jerry J. Andersen IBM Corporation C71/673 P.O. Box 12195 Research Triangle Park, NC 27709 email: AndersenMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueralvmk.iinus1.ibm.com Before you spend a lot of time writing him, however, you should understand that the ISO process is an exceeding hidebound, rule-driven, bureaucratic, and historically-inert process. Academic contributions of the sort that get published in journals, or the more informal discussions which happen on LINGUIST or other email forums will bounce off X3L2 and ISO like so much water off a duck's back. They will be viewed generally as "off-the-wall" because they come from outsiders who are not plugged into the ISO process, don't understand the politics, do not have the requisite collection of paper with all the national standards bodies' votes and comments, etc, etc. Rather than sending out-of-the-blue comments, any linguist who seriously wants to take this up might first be advised to send me a request for an electronic copy of the official U.S. position paper on DIS 10646 and of the founding documents regarding the Unicode/10646 merger process now underway. (An ad hoc proposal for 10646M, as it is being called; plus the Unicode Consortium's official response to 10646M.) Responding to specific points in those documents would get you a better chance at a hearing, since those documents are either generated by or will have to be responded to by X3L2 and WG2, formally or informally, and any comments relating to a specific document become "real" in the ISO process--and may even get a registration number as an X3L2 or WG2 document! Now, if you are concerned about Canada, Japan, or some other country, you have to figure out the national standards structure for that particular country to find out who to contact. I can provide some pointers for people, but can't claim to be an expert about any of those bodies. --Ken Whistler Unicode Secretary
In response to questions about coding of multiple-letter
non-spacing marks, raised in:
Linguist List: Vol-2-352. Saturday, 20 July 1991
Linguist List: Vol-2-354. Sunday, 21 July 1991
There are a number of obvious encoding requirements for multiple-letter
non-spacing marks. Some instances occur in standard orthographies, such as
a double-letter tilde in Tagalog. Others are really ligating ties,
such as the IPA ligature bars (rendered above for ligated digraphs
such as eng-g or below for mb, or indifferently above or below
for coarticulations which mix ascenders and descenders: kp, gb, etc.)
(Incidentally, it is not clear whether such things should be
considered diacritics. In any case, the only real character
encoding issue is how to encode non-spacing entities which are
"applied" in the typography across two {or more} letters in
Latin {or other} scripts.)
There are at least three approaches to encoding such things:
1. glyph parts approach. Encode each part that goes over
a single letter as a separate character.
2. unitary character approach, postfix. Encode the
entire non-spacing mark as a single character,
and postfix it in the text stream.
2a. same as 2, but prefix.
3. unitary character approach, infix. Encode the
entire non-spacing mark as a single character,
but specify that it occurs BETWEEN the two
characters it modifies in the text stream.
The December 1990 draft of Unicode 1.0 contained 5 non-spacing
marks (we were calling them "non-spacing diacritics" at the time)
of type 2: a tilde, a macron, a breve, an above-letter ligature tie,
and a below-letter ligature tie. It ALSO contained, for compatibility
with existing bibliographic standards for non-spacing characters,
4 non-spacing marks of type 1: a left- and right-half tilde, and
a left- and right-half above-letter ligature tie.
In long discussions during Unicode Technical Committee meetings
this spring, the "double diacritics" problem was the subject of
heated debate on at least two accounts. First, the draft contained
two inconsistent models of how to encode them. Second, the type 2
postfix encoding causes potentially large processing problems for
software supporting non-spacing marks.
A compromise along the lines of the type 3 encoding outlined above
almost resulted. However, there was a concern that rushing to
judgement on this without further investigation of all the
potential ramifications would be ill-advised. Instead, in the
current version of Unicode 1.0 being published, all "double diacritics"
were removed, and the entire issue was deferred for further
discussion and resolution in the next additions to Unicode.
It is my personal opinion that type 3 encoding will work just
fine. In a way, it is a kind of Wackernagel's solution to
finding a fixed position for something which may have an indefinite
scope. Unicode can always specify that a non-spacing mark follows
the FIRST item it modifies. This is exactly the current case with
simply non-spacing marks such as a non-spacing acute accent, for
example. If we then add back the five non-spacing marks which
typically apply to two letters, they will occur between the
letters in the text store, but logically they will be occurring
AFTER the FIRST item, as for a "single diacritic". The rendering
engines can handle the placement over two letters without too
much trouble, and the parsing engines don't have to have special
cases built in for the double diacritics.
The glyph parts are still needed for building fonts which will
be used to render "double diacritics" over Latin letters, but
there will be no need for them in the character encoding. It will
also be possible to build the translations which can convert
accurately between existing bibiliographic encodings and Unicode
for such things.
As for instances of extending breves or macrons over 3 or more
letters, I would be inclined to draw the line for plain text
encoding at this point. Unicode includes a large number
of mathematical symbols but does not make it possible to
encode complete math formulas *in plain text* without a higher
level protocol for formula syntax and layout. Similarly,
Unicode supplies a large number of non-spacing marks, but does
not make it possible to do things like extend a nasalization
mark across an entire syllable or word, for example, without
higher-level protocols for layout of such supra-character
text content. (This is comparable to the difference between
applying a non-spacing underline as a diacritic to a single
letter, versus applying an underline style to a word.)
In response to Spackman's discussion of the mathematical aspects
of non-spacing mark handling in Unicode: I agree that specifying
different numbers of "arguments" for the non-spacing marks
causes trouble for processing. This was the basic argument
(stated differently) that led to a reexamination and pulling
of "double diacritics" from Unicode for now. As it stands,
a "dumb" algorithm *can* parse Unicode for non-spacing marks,
even without tables, for the Greek family of scripts, at least,
since all non-spacing marks are confined to a small series of
ranges within Unicode. (The issue for the Indian family of
scripts is more complicated--but then, no "dumb" algorithm is
going to be able to do correct parsing of Devanagari, anyway.)
Moving to coding model #3 above for the double diacritics should
leave the parsing algorithms unaffected for the Greek family of
scripts, since all non-spacing marks, including the "double diacritics"
would have only one argument.
On other issues raised by Spackman:
It has already been noted that mathematics uses type style (not font)
variation as a meaningful semantic component. Unicode encoded
some of the widely used style variants for particular letters
as separate characters (e.g. black-letter I and R for imaginary and
real). It clearly didn't make sense to encode style variants
for every possible combination. (This is where 10646 goes wrong,
for example, in encoding italic A-Z, a-z, 0-9 and underlined A-Z,
a-z, 0-9 for APL as distinct characters.) However, when
a mathematician starts using typeface or style as a
productive semantic component, it is also not clear
that that should be encoding *as a character* in plain text. This
is one of those fuzzy edges which make character encoding difficult.
Functionally the style shift may be equivalent to addition of
a diacritic to a letter, but encoding it the same way in the text
store may cause more problems than it solves. If we add
<black-letter> or <italic style> as characters to support
such mathematical usage as semantic primitives, what is to prevent
such characters from appearing in text in non-mathematical usage,
resulting effectively in in-line character encoding of face and
style information? That is just an invitation to open an enormous
Pandora's Box of problems in plain text.
Language identification clearly does *not* belong in a character
encoding. However, I do agree with Spackman that it would be
a good idea to create a standard scheme for identification of
language. The lack of such is one of the things which drives people
to attempt bizarre things in character encodings, for example.
Also, language identification as currently implemented at the
systems level in computers tends to get muddled with support
for "country" standards for display of dates, times, and numbers.
Right now every system vendor has its own idiosyncratic system
of identifying languages (usually a small list of the commercially
important European and Asian languages), and these differ from
the idiosyncratic system of identifying languages used by each
application vendor. And none of that helps much the linguist
who wants a standard language identifier for tagging text
for communication and interchange.
--Ken Whistler
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue