LINGUIST List 21.2971
Sun Jul 18 2010
Qs: Pashto in Unicode
Editor for this issue: Elyssa Winzeler
<elyssalinguistlist.org>
1. Ron
Artstein,
Pashto in Unicode
Message 1: Pashto in Unicode
Date: 15-Jul-2010
From: Ron Artstein <linguistartstein.org>
Subject: Pashto in Unicode
E-mail this message to a friend
Hi,
I'm interested in knowing if there is a standard way to encode the variousPashto y-characters in Unicode, and if so, what it is. This question is abit more complicated than it sounds, so here's the background.
Pashto is written using a derivative of the Arabic script. The Arabiclanguage uses a single character for both /j/ and /i:/ sounds. Like manyArabic characters, this one is composed of a base form (which changes shapebased on its position in a word) and dots (in this case, two dots below thebase form). In most of the Arabic-speaking world the dots are present withboth the medial and final form, though in Egypt (and possibly other places)the convention is to have two dots on the medial form but leave them offthe final form. The standard arrangement of the two dots is horizontal, butthey can be placed vertically or diagonally with no change in meaning.Arabic has a separate character derived from an etymological /j/ withphonetic value /a/ which is written with the same base form but no dots; itonly ever appears at the end of a word.
Persian also uses a single character for /j/ and /i:/, with the conventionof two dots on the medial form, no dots on the final form (same as in Egypt).
The two conventions for the /j/-/i:/ character were given distinct codepoints in unicode despite the fact that they do not contrast; documentationis scarce, but presumably this was done in order to allow writing bothArabic and Persian in the same document. Therefore, Unicode has thefollowing code points:
U+064A two dots medially and finally (/j/-/i:/ Arabic convention)U+06CC two dots medially, none finally (/j/-/i:/ Persian convention)U+0649 no dots medially or finally (/a/ in Arabic, etymological /j/)U+06D0 two dots medially and finally in vertical arrangement (Pashto /e/,see below)
As it so happens, there is much confusion in how these characters are usedin actual electronic documents, which is not surprising given that U+06CClooks like U+064A in medial position but like U+0649 in final position.There is an excellent article by Jonathan Kew that sorts out what thismeans for various languages that use derivatives of the Arabic script.
http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsi&format=file&media_id=arabicletterusagenotes&filename=ArabicLetterUsageNotes.pdf
Unfortunately, this article does not discuss Pashto. I have littleknowledge of the language, but here's what I managed to understand (andplease correct me if I'm wrong).
First of all, Pashto makes a distinction between a character with two dotsarranged horizontally, representing /j/ or /i:/ as in Arabic and Persian,and a character with two dots arranged vertically, representing the sound/e/. I have very little access to Pashto documents from before the computerage, but from what little I saw, my impression is that the /j/-/i:/character used the same convention as in Persian, of two dots in the medialform and none on the final form. I don't have access to documents thatwould allow me to determine whether or not the /e/ character traditionallyhad dots on its final form.
With the advent of computer typesetting, a new convention appears to havearisen, which as far as I can tell is unique to Pashto in that itdistinguishes between /j/ and /i:/ (though only in word-final position):
/j/ is written with two dots medially, none finally/i:/ is written with two dots both medially and finally/e/ is written with two dots in vertical arrangement, both medially and finally
Which brings me to my original question, of how to represent Pashto inUnicode. The linguist in me notices a correspondence between sounds andUnicode code points (which, given the history I have just described, ismost certainly accidental):
/j/ corresponds to U+06CC/i:/ corresponds to U+064A/e/ corresponds to U+06D0
However, the wikipedia article on the Pashto alphabethttp://en.wikipedia.org/wiki/Pashto_alphabetgives a different correspondence, based on visual appearance:
forms with dots: U+064A (standing for /i:/ and /j/ medially and /i:/ finally)forms without dots: U+0649 (only /j/ in word-final position)forms with vertical dots: U+06D0 (/e/ in any position)
And there is yet a third convention, which I encountered in an electroniclexicon:
U+06CC: medial forms with dots (/i:/ and /j/) and dotless final form (/j/)U+064A: final form with dots (/i:/)U+06D0: all forms with vertical dots (/e/)
To wrap up, are my observations about the Pashto writing conventionscorrect? And is there a standard for assigning the Pashto characters toUnicode code points? Any resources or insight on Pashto writing conventionswould be appreciated.
-Ron.
Linguistic Field(s):
Computational Linguistics
Subject Language(s): Pashto, Central (pst)
Page Updated: 18-Jul-2010
|