LINGUIST List 2.344

Wednesday, 10 July 1991

Disc: Character Encoding

Editor for this issue: <>


Directory

  1. Ken Whistler, Unicode and character encoding issues

Message 1: Unicode and character encoding issues

Date: Fri, 5 Jul 91 15:29:09 PDT
From: Ken Whistler <whistlerzarasun.Metaphor.COM>
Subject: Unicode and character encoding issues
I am compiling together my comments and responses on a number
of items posted on LINGUIST over the last several weeks. I'm
sorry about the temporal disconnect on this--maybe some of the
items are getting stale, but they do deserve some answers.
Re Linguist List, Vol. 2, No. 0295. Friday, 14 June 1991
Charles Bigelow's discussion of Diacritics, type design, and
font format is a wholly commendable introduction to modern
font technology. It clearly distinguishes the issue of character
encoding from the issues of type design and font technology.
As Bigelow points out, modern font technology already handles
productively applied marks (e.g. floating diacritics in the
Latin script) as separable pieces. But the details of that
are all hidden away in the font rendering engines and the fonts
themselves. A character encoding is *mapped* to glyphs in
the fonts, and there is no particular reason why that relationship
has to be a simple one.
The Unicode standard takes the position that non-spacing marks
(including Latin floating diacritics) must be encoded as
characters primarily for information sufficiency. Diacritics
are productively applied to Latin letters--this goes well beyond
the relatively small set of accented letters considered to be
part of the standard alphabets of most European languages. In
order to be able to encode the textual information of any usage
of the Latin script (including various kinds of ad hoc transcriptions
published over hundreds of years), we have to do one of the
following:
	A. Research every baseform + diacritic(s) combination
		ever used in the history of the Latin (and other)
		script(s) and assign a character encoding number
		to each combination;
	B. Collect the set of productively applied diacritic(s) and
		encode them as non-spacing marks which can compose
		with any baseform a user chooses for them.
I believe alternative (A) to be a hopeless approach. It is almost,
though not quite, as absurd as trying to catalog all the sentences
of a language. Position (B) builds a well-defined productive
rule into the encoding which allows for unambiguous character
encoding of any combination which has to be encoded.
Also Re Linguist List, Vol. 2, No. 0295. Friday, 14 June 1991
Paul Hackney objected to the Unicode proposal for a fixed-width
character encoding:
>However one thing is certain: it is not compatible with
>our present system. Furthermore you will need twice as much space to store
>text using the current character set, and transmission times will be doubled.
>I have come across another system that does not suffer from these limitations,
>and is in my opinion a winner by lengths.
He then goes on to summarize Joe Becker's discussion of the Xerox Coded
Character Set (XCCS), cited in the Scientific American article on "Multilingual
Word Processing." Hackney goes on to praise the notion of a character
set which uses a signal to indicate that the next byte in a character
stream is an "alphabet" identifier which indicates a particular set
of 255 codes to be used (a byte at a time) in subsequent text. Extension
of the scheme would allow for two byte encodings (e.g. for Chinese) or
for three byte encodings if required. What Hackney has done is reinvent
the "compaction forms" of ISO DIS 10646, which has just such a signal
for supporting 1-, 2-, 3-, or 4-byte encoding forms, as well as
variable length. The irony of this argument is that in citing
Becker's article about software based on the XCCS, he seems unaware
that Becker is the principle architect of the Unicode fixed-length
character encoding architecture which Hackney objects to! It is
precisely Xerox's long experience with building multilingual
software on a mixed one- or two-byte character encoding which has
convinced most of the software implementors in the industry that
such encoding schemes are inefficient, buggy, and not even close
to being worth the savings in space for text storage which was
their main reason for being.
Keep in mind that the mixed-width character encodings were
designed in an era when 16K was a LOT of RAM and hard disks were
something that mainframe computers had. Nowadays a minimal
PC or workstation has at least 1 megabyte of RAM and a 40 megabyte
hard disk, and the standards are rapidly expanding to 8 megabytes
of RAM and even bigger hard disks supplemented by removable
cartridge hard disks and CD-ROM's. The growth in the size of
text by moving to a 16-bit fixed-width character encoding is
insignificant in this context--especially when you take into
account that interspersing formatting and graphics with the
text means that non-text quickly comes to dominate over the
text itself for most storage requirements.
When it comes to transmission times, once again the
technology is outracing any problem caused by doubling the
character size. Modems have moved from 120 baud to 300 to
1200 to 2400 to 9600 baud. Furthermore, if a lot of
text has to be transmitted across a phone line, it can
be easily compressed and decompressed at each end with
well-known algorithms already in use. This is a problem
which can be confined to a clear point of appropriate
application. It would be inappropriate, however, to force
the character encoding to carry around the compression--that
would simply make *all* software have to deal with compressed
or compacted text just to make characters go down a telephone
line faster. For all other network communication protocols,
the transmission rates are so high (multiple megabits per second)
that the one- versus two-byte character size is completely
irrelevant to any perceived behavior for textual transmission.
In summary, mixed-width character encoding has been discredited
among those who have actually tried to implement software which
uses it. And moving from a 8-bit to a 16-bit character encoding
is *not* going to suddenly make everything twice as inefficient
in dealing with text. (Among other things, note that most
systems and software that require moving to the new Unicode standard
because they need a large, multilingual character set are already
having to deal with Japanese--which means they already have
provisions for two-byte characters. They just do so now in a mixed-
width and *less* efficient way!)
Re Linguist List, Vol. 2, No. 0296. Friday, 14 June 1991
I wish to second David Birnbaum's comments about the importance of
the participation of linguists in the character encoding process.
It is extremely important that we get the international character
encoding to be complete and correct. This will allow for an
interchange standard for linguistic data, even in those instances
where individual linguists with their own independently developed
ad hoc encodings for PC's and Macs have no need to change the
software they are using locally. Note that the major current
initiative aimed at providing a universal text content interchange
format (the Text Encoding Initiative, sponsored by ACL, et al.)
is deficient precisely in the area of character encoding.
Getting involved to encourage the building of a Text Encoding
protocol on top of a *Universal* character set would greatly
improve the possibilities that linguists will be able to
exchange data freely in the future.
Re Linguist List: Vol-2-308. Thursday, 20 June 1991.
Herb Stahlke raised the issue of sorting. While it is true that
diacritics (among other things) have to be taken into account
to produce correct sortings in various languages, the way
sorts are implemented is generally by weighted multiple keys
generated from tables. The relation between an <a-acute> and
its appropriate sorting key for a particular language in a
particular orthography is independent of whether that <a-acute>
was encoded as a precomposed single character or a baseform
character plus non-spacing mark character. (One scheme for
encoding characters may be more efficiently mappable to multiple
keys than another scheme, but the multiple keys themselves
are not *determined* by the character encoding.) The table has to
be correct to produce the correct sort. When we want to
put control over the sort "in the hands of the user," what we
really mean is finding a way to allow a user to define their
own collation tables or to modify those provided by the system
or application software.
I see that David Birnbaum (in Linguist List: Vol-2-321. Tuesday,
25 June 1991) has also responded to this contribution, and I concur
completely with his analysis of the distinction between precomposed
vs. "separable" architecture in the character encoding versus any
issues of how to produce a correct sort.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue