LINGUIST List 13.681

Wed Mar 13 2002

Sum: Unicode & Tones

Editor for this issue: Karen Milligan <>


  1. Musgrave, S., Re Linguist 13.591, Unicode & Tones

Message 1: Re Linguist 13.591, Unicode & Tones

Date: Tue, 12 Mar 2002 15:45:45 +0100
From: Musgrave, S. <>
Subject: Re Linguist 13.591, Unicode & Tones

I recently posted the following query to the list:

In developing a typological database which will include text data from
numerous languages, we have encountered a problem with the
representation of tone using Unicode fonts (we are using Lucida Sans
Unicode in our application). The Unicode standard includes two
diacritics which can be used to represent contour tones, those usually
used for HL and LH contours. But many languages have more contour
tones than these two: for example, Ngiti has three tone levels and all
combinations of levels allowed in one contour tone: HM, HL, LH, LM,
MH, ML. In principle it should be possible to combine more than one
diacritic with a text character in a Unicode font, and therefore (if
the font in question includes the full diacritic set) it should be
possible to provide diacritics for all contour tones. However, our
attempts suggest that this method is not workable because the
positioning of diacritics cannot be controlled finely enough. That is,
the various diacritics tend to be positioned on top of one another,
rather than beside each other. Our first question then is:

1) has anyone else had more success in producing 
diacritics for contour tones using the Unicode standard, and if so, what 
technique was used?

If no satisfactory answers to this question emerge, we intend to
explore the possibility of creating a set of contour tone diacritics
for inclusion in Unicode, either as a part of the user-defined area
which the standard makes available, or (preferably) as a part of the
defined standard encoding. To this end, we also seek answers to a
second question:

2) what range of contour tones have been reported for the 
languages of the world?

Thanks to the following people for their responses to my 

Deborah Anderson
Chuck Bigelow
Peter Constable
Andrew Cunningham
Tom Emerson
John Koontz
John Kovarik
Elizabeth Pyatt
Cory Sheedy
Ken Whistler
Moira Yip

The first point to emerge from the responses was that the Unicode
standard does not specify how different characters should combine;
this problem must be handled by the software that renders the
character set. John Koontz noted that the problem is not limited only
to handling tones, but also arises for linguists in dealing with, for
example, the diacritic for nasality, especially if that has to combine
with some other diacritic also. Various resources for investigating
these issues were suggested including SIL's Graphite font rendering
technology, and the Unicode discussion list
( Andrew
Cunningham noted that it should be possible to define OpenType fonts
which would handle diacritic placement, but that this was currently
problematic for Windows users as Uniscribe (the Windows Unicode Script
Processor) treats Latin script as a simple script which does not
require complex rendering. He reports that Microsoft are addressing
the problem. I do not know whether similar considerations apply to the
IPA section of the Unicode standard.

Peter Constable noted that the 1999 IPA handbook lists only 5 contour
tone diacritics, of which 2 are already supported in Unicode and the
other three cannot be generated as combinations of Unicode
characters. He suggests that diacritics are inherently limited as a
means of representing tone, and that this accounts for the meagre
repertoire of symbols. Other responses described the range of tone
possibilities attested: up to 6 levels (reported for Chori of Nigeria
according to Dave Odden) plus the posibility of downstep and upstep,
and "just about any combination of two of these on one vowel, plus
perhaps three of them" (Moira Yip) or even "4- element contours to
account for Lomongo"(Dave Odden). This range of possibilities argues
strongly for using a system of numbers or letters annotating vowels to
indicate tone. In responses to follow-ups from me, I learned from
Peter Constable and Ken Whistler what I should already have known:
that superscript numerals 0-9 are separately encoded in the Unicode
standard. This provides a solution to our original problem, which was
how to represent tones in text fields in an Access database, where
only minimal formatting is possible. I have now experimented, and I
can report that even in such an environment the superscripts are
rendered perfectly when you use a Unicode font. This is an immediate
solution to our problem, for which I am grateful, but, as both Dave
Odden and Moira Yip pointed out to me, there is no consensus in the
linguistics community as to whether 1 should indicate the lowest tone
or the highest tone.

I can report two additional points which emerged. Chuck Bigelow and
Kris Holmes are currently working on a revision of the Lucida Sans
Unicode font as a result of their discussions with linguists over
recent years. This will include some non-Unicode character-diacritic
combinations used in the orthographies of Native American
languages. And a proposal is under way to include in the Unicode
standard a large number of characters needed for the Finno-Ugric
Phonetic alphabet. This includes some contour tone diacritics.

If any reader would like more detailed information on any of the above
points, please feel free to contact me and I will try to help, if only
by putting you in touch with the person whose input I have reported

Once again, thank you to those who responded.

Simon Musgrave
Spinoza Program Lexicon and Syntax (SPLS) (my pages) (project pages)

Mail address: 
ULCL/Spinoza, Leiden University, 
P.O. Box 9515, 2300 RA The Netherlands
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue