Featured Linguist!

Jost Gippert: Our Featured Linguist!

"Buenos dias", "buenas noches" -- this was the first words in a foreign language I heard in my life, as a three-year old boy growing up in developing post-war Western Germany, where the first gastarbeiters had arrived from Spain. Fascinated by the strange sounds, I tried to get to know some more languages, the only opportunity being TV courses of English and French -- there was no foreign language education for pre-teen school children in Germany yet in those days. Read more



Donate Now | Visit the Fund Drive Homepage

Amount Raised:

$34168

Still Needed:

$40832

Can anyone overtake Syntax in the Subfield Challenge ?

Grad School Challenge Leader: University of Washington


Publishing Partner: Cambridge University Press CUP Extra Publisher Login
amazon logo
More Info


New from Oxford University Press!

ad

What is English? And Why Should We Care?

By: Tim William Machan

To find some answers Tim Machan explores the language's present and past, and looks ahead to its futures among the one and a half billion people who speak it. His search is fascinating and important, for definitions of English have influenced education and law in many countries and helped shape the identities of those who live in them.


New from Cambridge University Press!

ad

Medical Writing in Early Modern English

Edited by Irma Taavitsainen and Paivi Pahta

This volume provides a new perspective on the evolution of the special language of medicine, based on the electronic corpus of Early Modern English Medical Texts, containing over two million words of medical writing from 1500 to 1700.


Summary Details


Query:   Re Linguist 13.591, Unicode & Tones
Author:  Musgrave, S
Submitter Email:  click here to access email
Linguistic LingField(s):   Computational Linguistics
Text/Corpus Linguistics

Summary:   I recently posted the following query to the list:

In developing a typological database which will include text data from
numerous languages, we have encountered a problem with the
representation of tone using Unicode fonts (we are using Lucida Sans
Unicode in our application). The Unicode standard includes two
diacritics which can be used to represent contour tones, those usually
used for HL and LH contours. But many languages have more contour
tones than these two: for example, Ngiti has three tone levels and all
combinations of levels allowed in one contour tone: HM, HL, LH, LM,
MH, ML. In principle it should be possible to combine more than one
diacritic with a text character in a Unicode font, and therefore (if
the font in question includes the full diacritic set) it should be
possible to provide diacritics for all contour tones. However, our
attempts suggest that this method is not workable because the
positioning of diacritics cannot be controlled finely enough. That is,
the various diacritics tend to be positioned on top of one another,
rather than beside each other. Our first question then is:

1) has anyone else had more success in producing
diacritics for contour tones using the Unicode standard, and if so, what
technique was used?

If no satisfactory answers to this question emerge, we intend to
explore the possibility of creating a set of contour tone diacritics
for inclusion in Unicode, either as a part of the user-defined area
which the standard makes available, or (preferably) as a part of the
defined standard encoding. To this end, we also seek answers to a
second question:

2) what range of contour tones have been reported for the
languages of the world?


Thanks to the following people for their responses to my
query:

Deborah Anderson
Chuck Bigelow
Peter Constable
Andrew Cunningham
Tom Emerson
John Koontz
John Kovarik
Elizabeth Pyatt
Cory Sheedy
Ken Whistler
Moira Yip
Michael

The first point to emerge from the responses was that the Unicode
standard does not specify how different characters should combine;
this problem must be handled by the software that renders the
character set. John Koontz noted that the problem is not limited only
to handling tones, but also arises for linguists in dealing with, for
example, the diacritic for nasality, especially if that has to combine
with some other diacritic also. Various resources for investigating
these issues were suggested including SIL's Graphite font rendering
technology, and the Unicode discussion list
(http://www.unicode.org/unicode/consortium/distlist.html) Andrew
Cunningham noted that it should be possible to define OpenType fonts
which would handle diacritic placement, but that this was currently
problematic for Windows users as Uniscribe (the Windows Unicode Script
Processor) treats Latin script as a simple script which does not
require complex rendering. He reports that Microsoft are addressing
the problem. I do not know whether similar considerations apply to the
IPA section of the Unicode standard.

Peter Constable noted that the 1999 IPA handbook lists only 5 contour
tone diacritics, of which 2 are already supported in Unicode and the
other three cannot be generated as combinations of Unicode
characters. He suggests that diacritics are inherently limited as a
means of representing tone, and that this accounts for the meagre
repertoire of symbols. Other responses described the range of tone
possibilities attested: up to 6 levels (reported for Chori of Nigeria
according to Dave Odden) plus the posibility of downstep and upstep,
and "just about any combination of two of these on one vowel, plus
perhaps three of them" (Moira Yip) or even "4- element contours to
account for Lomongo"(Dave Odden). This range of possibilities argues
strongly for using a system of numbers or letters annotating vowels to
indicate tone. In responses to follow-ups from me, I learned from
Peter Constable and Ken Whistler what I should already have known:
that superscript numerals 0-9 are separately encoded in the Unicode
standard. This provides a solution to our original problem, which was
how to represent tones in text fields in an Access database, where
only minimal formatting is possible. I have now experimented, and I
can report that even in such an environment the superscripts are
rendered perfectly when you use a Unicode font. This is an immediate
solution to our problem, for which I am grateful, but, as both Dave
Odden and Moira Yip pointed out to me, there is no consensus in the
linguistics community as to whether 1 should indicate the lowest tone
or the highest tone.

I can report two additional points which emerged. Chuck Bigelow and
Kris Holmes are currently working on a revision of the Lucida Sans
Unicode font as a result of their discussions with linguists over
recent years. This will include some non-Unicode character-diacritic
combinations used in the orthographies of Native American
languages. And a proposal is under way to include in the Unicode
standard a large number of characters needed for the Finno-Ugric
Phonetic alphabet. This includes some contour tone diacritics.

If any reader would like more detailed information on any of the above
points, please feel free to contact me and I will try to help, if only
by putting you in touch with the person whose input I have reported
here.

Once again, thank you to those who responded.

Simon Musgrave
Spinoza Program Lexicon and Syntax (SPLS)
http://www.let.leidenuniv.nl/ulcl/faculty/musgrave (my pages)
http://www.let.leidenuniv.nl/spls (project pages)

Mail address:
ULCL/Spinoza, Leiden University,
P.O. Box 9515, 2300 RA The Netherlands

LL Issue: 13.681
Date Posted: 13-Mar-2002
Original Query: Read original query


Back

Sums main page