Summary Details
| Query: |
Re Linguist 13.591, Unicode & Tones
|
|
| Author: | Musgrave, S | |
| Submitter Email: | click here to access email | |
| Linguistic LingField(s): |
Computational Linguistics
Text/Corpus Linguistics |
|
| Summary: |
I recently posted the following query to the list: In developing a typological database which will include text data from numerous languages, we have encountered a problem with the representation of tone using Unicode fonts (we are using Lucida Sans Unicode in our application). The Unicode standard includes two diacritics which can be used to represent contour tones, those usually used for HL and LH contours. But many languages have more contour tones than these two: for example, Ngiti has three tone levels and all combinations of levels allowed in one contour tone: HM, HL, LH, LM, MH, ML. In principle it should be possible to combine more than one diacritic with a text character in a Unicode font, and therefore (if the font in question includes the full diacritic set) it should be possible to provide diacritics for all contour tones. However, our attempts suggest that this method is not workable because the positioning of diacritics cannot be controlled finely enough. That is, the various diacritics tend to be positioned on top of one another, rather than beside each other. Our first question then is: 1) has anyone else had more success in producing diacritics for contour tones using the Unicode standard, and if so, what technique was used? If no satisfactory answers to this question emerge, we intend to explore the possibility of creating a set of contour tone diacritics for inclusion in Unicode, either as a part of the user-defined area which the standard makes available, or (preferably) as a part of the defined standard encoding. To this end, we also seek answers to a second question: 2) what range of contour tones have been reported for the languages of the world? Thanks to the following people for their responses to my query: Deborah Anderson Chuck Bigelow Peter Constable Andrew Cunningham Tom Emerson John Koontz John Kovarik Elizabeth Pyatt Cory Sheedy Ken Whistler Moira Yip Michael The first point to emerge from the responses was that the Unicode standard does not specify how different characters should combine; this problem must be handled by the software that renders the character set. John Koontz noted that the problem is not limited only to handling tones, but also arises for linguists in dealing with, for example, the diacritic for nasality, especially if that has to combine with some other diacritic also. Various resources for investigating these issues were suggested including SIL's Graphite font rendering technology, and the Unicode discussion list (http://www.unicode.org/unicode/consortium/distlist.html) Andrew Cunningham noted that it should be possible to define OpenType fonts which would handle diacritic placement, but that this was currently problematic for Windows users as Uniscribe (the Windows Unicode Script Processor) treats Latin script as a simple script which does not require complex rendering. He reports that Microsoft are addressing the problem. I do not know whether similar considerations apply to the IPA section of the Unicode standard. Peter Constable noted that the 1999 IPA handbook lists only 5 contour tone diacritics, of which 2 are already supported in Unicode and the other three cannot be generated as combinations of Unicode characters. He suggests that diacritics are inherently limited as a means of representing tone, and that this accounts for the meagre repertoire of symbols. Other responses described the range of tone possibilities attested: up to 6 levels (reported for Chori of Nigeria according to Dave Odden) plus the posibility of downstep and upstep, and "just about any combination of two of these on one vowel, plus perhaps three of them" (Moira Yip) or even "4- element contours to account for Lomongo"(Dave Odden). This range of possibilities argues strongly for using a system of numbers or letters annotating vowels to indicate tone. In responses to follow-ups from me, I learned from Peter Constable and Ken Whistler what I should already have known: that superscript numerals 0-9 are separately encoded in the Unicode standard. This provides a solution to our original problem, which was how to represent tones in text fields in an Access database, where only minimal formatting is possible. I have now experimented, and I can report that even in such an environment the superscripts are rendered perfectly when you use a Unicode font. This is an immediate solution to our problem, for which I am grateful, but, as both Dave Odden and Moira Yip pointed out to me, there is no consensus in the linguistics community as to whether 1 should indicate the lowest tone or the highest tone. I can report two additional points which emerged. Chuck Bigelow and Kris Holmes are currently working on a revision of the Lucida Sans Unicode font as a result of their discussions with linguists over recent years. This will include some non-Unicode character-diacritic combinations used in the orthographies of Native American languages. And a proposal is under way to include in the Unicode standard a large number of characters needed for the Finno-Ugric Phonetic alphabet. This includes some contour tone diacritics. If any reader would like more detailed information on any of the above points, please feel free to contact me and I will try to help, if only by putting you in touch with the person whose input I have reported here. Once again, thank you to those who responded. Simon Musgrave Spinoza Program Lexicon and Syntax (SPLS) http://www.let.leidenuniv.nl/ulcl/faculty/musgrave (my pages) http://www.let.leidenuniv.nl/spls (project pages) Mail address: ULCL/Spinoza, Leiden University, P.O. Box 9515, 2300 RA The Netherlands |
|
| LL Issue: | 13.681 | |
| Date Posted: | 13-Mar-2002 | |
| Original Query: | Read original query | |
|
Back |
||
|
|
||
|
Sums main page
|
||


