Featured Linguist!

Jost Gippert: Our Featured Linguist!

"Buenos dias", "buenas noches" -- this was the first words in a foreign language I heard in my life, as a three-year old boy growing up in developing post-war Western Germany, where the first gastarbeiters had arrived from Spain. Fascinated by the strange sounds, I tried to get to know some more languages, the only opportunity being TV courses of English and French -- there was no foreign language education for pre-teen school children in Germany yet in those days. Read more



Donate Now | Visit the Fund Drive Homepage

Amount Raised:

$34724

Still Needed:

$40276

Can anyone overtake Syntax in the Subfield Challenge ?

Grad School Challenge Leader: University of Washington


Publishing Partner: Cambridge University Press CUP Extra Publisher Login
amazon logo
More Info


New from Oxford University Press!

ad

What is English? And Why Should We Care?

By: Tim William Machan

To find some answers Tim Machan explores the language's present and past, and looks ahead to its futures among the one and a half billion people who speak it. His search is fascinating and important, for definitions of English have influenced education and law in many countries and helped shape the identities of those who live in them.


New from Cambridge University Press!

ad

Medical Writing in Early Modern English

Edited by Irma Taavitsainen and Paivi Pahta

This volume provides a new perspective on the evolution of the special language of medicine, based on the electronic corpus of Early Modern English Medical Texts, containing over two million words of medical writing from 1500 to 1700.


Query Details


Query Subject:   Pashto in Unicode
Author:   Ron Artstein
Submitter Email:  click here to access email

Linguistic LingField(s):  Computational Linguistics
Subject Language(s):  Pashto, Central


Query:   Hi,

I'm interested in knowing if there is a standard way to encode the various
Pashto y-characters in Unicode, and if so, what it is. This question is a
bit more complicated than it sounds, so here's the background.

Pashto is written using a derivative of the Arabic script. The Arabic
language uses a single character for both /j/ and /i:/ sounds. Like many
Arabic characters, this one is composed of a base form (which changes shape
based on its position in a word) and dots (in this case, two dots below the
base form). In most of the Arabic-speaking world the dots are present with
both the medial and final form, though in Egypt (and possibly other places)
the convention is to have two dots on the medial form but leave them off
the final form. The standard arrangement of the two dots is horizontal, but
they can be placed vertically or diagonally with no change in meaning.
Arabic has a separate character derived from an etymological /j/ with
phonetic value /a/ which is written with the same base form but no dots; it
only ever appears at the end of a word.

Persian also uses a single character for /j/ and /i:/, with the convention
of two dots on the medial form, no dots on the final form (same as in Egypt).

The two conventions for the /j/-/i:/ character were given distinct code
points in unicode despite the fact that they do not contrast; documentation
is scarce, but presumably this was done in order to allow writing both
Arabic and Persian in the same document. Therefore, Unicode has the
following code points:

U+064A two dots medially and finally (/j/-/i:/ Arabic convention)
U+06CC two dots medially, none finally (/j/-/i:/ Persian convention)
U+0649 no dots medially or finally (/a/ in Arabic, etymological /j/)
U+06D0 two dots medially and finally in vertical arrangement (Pashto /e/,
see below)

As it so happens, there is much confusion in how these characters are used
in actual electronic documents, which is not surprising given that U+06CC
looks like U+064A in medial position but like U+0649 in final position.
There is an excellent article by Jonathan Kew that sorts out what this
means for various languages that use derivatives of the Arabic script.

http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsi&format=file&media_id=arabicletterusagenotes&filename=ArabicLetterUsageNotes.pdf

Unfortunately, this article does not discuss Pashto. I have little
knowledge of the language, but here's what I managed to understand (and
please correct me if I'm wrong).

First of all, Pashto makes a distinction between a character with two dots
arranged horizontally, representing /j/ or /i:/ as in Arabic and Persian,
and a character with two dots arranged vertically, representing the sound
/e/. I have very little access to Pashto documents from before the computer
age, but from what little I saw, my impression is that the /j/-/i:/
character used the same convention as in Persian, of two dots in the medial
form and none on the final form. I don't have access to documents that
would allow me to determine whether or not the /e/ character traditionally
had dots on its final form.

With the advent of computer typesetting, a new convention appears to have
arisen, which as far as I can tell is unique to Pashto in that it
distinguishes between /j/ and /i:/ (though only in word-final position):

/j/ is written with two dots medially, none finally
/i:/ is written with two dots both medially and finally
/e/ is written with two dots in vertical arrangement, both medially and finally

Which brings me to my original question, of how to represent Pashto in
Unicode. The linguist in me notices a correspondence between sounds and
Unicode code points (which, given the history I have just described, is
most certainly accidental):

/j/ corresponds to U+06CC
/i:/ corresponds to U+064A
/e/ corresponds to U+06D0

However, the wikipedia article on the Pashto alphabet
http://en.wikipedia.org/wiki/Pashto_alphabet
gives a different correspondence, based on visual appearance:

forms with dots: U+064A (standing for /i:/ and /j/ medially and /i:/ finally)
forms without dots: U+0649 (only /j/ in word-final position)
forms with vertical dots: U+06D0 (/e/ in any position)

And there is yet a third convention, which I encountered in an electronic
lexicon:

U+06CC: medial forms with dots (/i:/ and /j/) and dotless final form (/j/)
U+064A: final form with dots (/i:/)
U+06D0: all forms with vertical dots (/e/)

To wrap up, are my observations about the Pashto writing conventions
correct? And is there a standard for assigning the Pashto characters to
Unicode code points? Any resources or insight on Pashto writing conventions
would be appreciated.

-Ron.
LL Issue: 21.2971
Date posted: 18-Jul-2010



Back

Sums main page