LINGUIST List 2.321

Tuesday, 25 June 1991

Disc: Character Encoding

Editor for this issue: <>


Directory

  1. , Re: Character Encodings
  2. , Character Encodings

Message 1: Re: Character Encodings

Date: Fri, 21 Jun 91 10:04 EDT
From: <DJBPITT%PITTVMS.BITNETCUNYVM.CUNY.EDU>
Subject: Re: Character Encodings
In Linguist List Vol-2-308 (Thursday, 20 June 1991) Herb Stahlke
<00HFSTAHLKE%BSUVAX1.BITNETUICVM.uic.edu> writes:
>An issue that seems not to have been addressed is sorting. A symbol code with
>multi-character representation for diacritics is almost a must for languages in
>which tone must be marked on each vowel for the orthography to be readable.
>(The Smalley-Gudschinsky sort of orthography, like Hmong, in which tones are
>marked by final consonants works only if there are no actual final consonants.)
>Whether a sort needs to be sensitive to tone or not depends on the reason for
>the sort, but that control needs to be in the hands of the user, not the code
>designer. Allowing for separate representation of diacritics would permit
>this.
I suggested in an earlier posting that there are two basic criteria for
resolving the "precomposed" vs "separable" approach to encoding productive
orthographic units with fixed semantic content like tone marks. I agree with
HS that separable diacritics are convenient or necessary. Nonetheless, I think
his discussion of sorting is misleading.
In my earlier posting, I suggested as self-evident that adequacy of
representation should be the primary criterion. HS's wording of the sorting
issue ("Allowing for separate representaiton of diacritics would permit this.")
might lead to a conclusion that this is a question of adequacy (i.e., it might
imply that "only allowing for separate representation of diacritics would
permit this."). I think this conclusion, which HS doesn't draw but which is
easily inferred from his wording, would be wrong.
One issue that has been very well documented on the ISO10646 ListServ is that
sorting cannot be done on the script level (Latin, Cyrillic, etc.), because
different languages that share a script may sort differently. This effectively
means that sorting cannot be implemented as bitwise operations on characters in
machine order. Additionally, the same ListServ has documented that sorting
cannot even be done on the language level, because different applications
within a single language may have different sorting algorithms.
Most frequently it is the precomposed camp that argues that sorting _requires_ a
precomposed architecture. Otherwise, accented and unaccented letters quickly
get out of sync; e.g., re'sume' is not the same length as resume and it becomes
more difficult to compare the two esses, etc. On the other hand, HS seems to
suggest that sorting that is sensitive to tone marks _requires_ a separable
system.
In fact, neither precomposed nor separable architecture forces or prohibits one
type of sorting or the other. Whichever architecture we select, we _can_ use
or ignore accent or tone mark information. There may be extremely serious
differences in implementation efficiency, differences that may ultimately play
a role in our decision (although they may not prove decisive, since they may be
balanced by equally serious differences in other areas of efficiency). But the
bottom line is that the ability to sort is not an adequacy issue.
>As I understand it, Unicode will allow both, which will lead inevitably
>to problems of determining which representation is being used. A utility for
>converting from one representation to another would be useful.
There are several layers between the user and the character set. Input devices
may allow accent or tone marks to be input separately, but that doesn't require
that they be stored that way by the application that receives them. This issue
has also been discussed on the Unicode ListServ.
--David
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Character Encodings

Date: Fri, 21 Jun 91 13:16:12 EDT
From: <macrakisosf.org>
Subject: Character Encodings
First, let me apologize to Hackney for my testiness. I trust that
some of the more recent postings have filled him in and pointed him to
more complete sources on character encoding issues.
Stahlke brings up the issue of sorting. Sorting rules should
certainly be controlled by the user: it is now generally accepted by
the computing community that one cannot expect to treat natural
language data as uninterpreted bitstrings for sorting or many other
operations--some sort of preprocessing is inevitable, even for
English. Alain LaBonte/ has performed extensive analyses of sorting
rules in many languages, reported in part on the ISO10646 mailing
list. Some of the complications are summed up by the ordering used by
the best French dictionaries:
 cote < Cote < co^te < Co^te < cote/ < Cote/ < co^te/ < Co^te/
(May I suggest that those interested in the details of this analysis
contact LaBonte/ directly (ALB%SEASliverpool.ac.uk) rather than start
a discussion on this list?)
Thus, the particular character encoding used will have little effect
on the ease of sorting, since _no_ encoding will work without
preprocessing. Of course, existing programs using naive algorithms
which ignore language-specific rules cannot hope to be correct in
general, since rules differ in different languages (consider `ch',
which is treated as a single letter for sorting in Spanish).
	-s
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue