Editor for this issue: <>
In Linguist List Vol-2-308 (Thursday, 20 June 1991) Herb Stahlke <00HFSTAHLKE%BSUVAX1.BITNETMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueUICVM.uic.edu> writes: >An issue that seems not to have been addressed is sorting. A symbol code with >multi-character representation for diacritics is almost a must for languages in >which tone must be marked on each vowel for the orthography to be readable. >(The Smalley-Gudschinsky sort of orthography, like Hmong, in which tones are >marked by final consonants works only if there are no actual final consonants.) >Whether a sort needs to be sensitive to tone or not depends on the reason for >the sort, but that control needs to be in the hands of the user, not the code >designer. Allowing for separate representation of diacritics would permit >this. I suggested in an earlier posting that there are two basic criteria for resolving the "precomposed" vs "separable" approach to encoding productive orthographic units with fixed semantic content like tone marks. I agree with HS that separable diacritics are convenient or necessary. Nonetheless, I think his discussion of sorting is misleading. In my earlier posting, I suggested as self-evident that adequacy of representation should be the primary criterion. HS's wording of the sorting issue ("Allowing for separate representaiton of diacritics would permit this.") might lead to a conclusion that this is a question of adequacy (i.e., it might imply that "only allowing for separate representation of diacritics would permit this."). I think this conclusion, which HS doesn't draw but which is easily inferred from his wording, would be wrong. One issue that has been very well documented on the ISO10646 ListServ is that sorting cannot be done on the script level (Latin, Cyrillic, etc.), because different languages that share a script may sort differently. This effectively means that sorting cannot be implemented as bitwise operations on characters in machine order. Additionally, the same ListServ has documented that sorting cannot even be done on the language level, because different applications within a single language may have different sorting algorithms. Most frequently it is the precomposed camp that argues that sorting _requires_ a precomposed architecture. Otherwise, accented and unaccented letters quickly get out of sync; e.g., re'sume' is not the same length as resume and it becomes more difficult to compare the two esses, etc. On the other hand, HS seems to suggest that sorting that is sensitive to tone marks _requires_ a separable system. In fact, neither precomposed nor separable architecture forces or prohibits one type of sorting or the other. Whichever architecture we select, we _can_ use or ignore accent or tone mark information. There may be extremely serious differences in implementation efficiency, differences that may ultimately play a role in our decision (although they may not prove decisive, since they may be balanced by equally serious differences in other areas of efficiency). But the bottom line is that the ability to sort is not an adequacy issue. >As I understand it, Unicode will allow both, which will lead inevitably >to problems of determining which representation is being used. A utility for >converting from one representation to another would be useful. There are several layers between the user and the character set. Input devices may allow accent or tone marks to be input separately, but that doesn't require that they be stored that way by the application that receives them. This issue has also been discussed on the Unicode ListServ. --David
First, let me apologize to Hackney for my testiness. I trust that some of the more recent postings have filled him in and pointed him to more complete sources on character encoding issues. Stahlke brings up the issue of sorting. Sorting rules should certainly be controlled by the user: it is now generally accepted by the computing community that one cannot expect to treat natural language data as uninterpreted bitstrings for sorting or many other operations--some sort of preprocessing is inevitable, even for English. Alain LaBonte/ has performed extensive analyses of sorting rules in many languages, reported in part on the ISO10646 mailing list. Some of the complications are summed up by the ordering used by the best French dictionaries: cote < Cote < co^te < Co^te < cote/ < Cote/ < co^te/ < Co^te/ (May I suggest that those interested in the details of this analysis contact LaBonte/ directly (ALB%SEASMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueliverpool.ac.uk) rather than start a discussion on this list?) Thus, the particular character encoding used will have little effect on the ease of sorting, since _no_ encoding will work without preprocessing. Of course, existing programs using naive algorithms which ignore language-specific rules cannot hope to be correct in general, since rules differ in different languages (consider `ch', which is treated as a single letter for sorting in Spanish). -s