Summary Details
| Query: |
Pashto in Unicode
|
|
| Author: | Ron Artstein | |
| Submitter Email: | click here to access email | |
| Linguistic LingField(s): |
Writing Systems
|
|
| Summary: |
Hi,
Several months ago I posted a query, asking whether there are standards for encoding the various Pashto y-characters in Unicode. I received many helpful responses on this list and the Unicode list (special thanks to Wilma Heston, Kamal Mansour and Roozbeh Pournader), and also met personally with several Pashto speakers in Southern California. The short answer is that there is a proposed standard but it is often not followed in actual electronic texts, partly due to inherent problems with the standard itself. So processing needs to be done with care. The long answer will give the details of the proposed standard at the end, but to make sense of it we will need to look at the history of these characters, both in the development of the Pashto as an adaptation of the Arabic and Persian scripts, and in the later encoding of these scripts into computer character sets. Terminological notes: I will refer to characters and character bases by their Unicode name, allowing me to sidestep transliteration issues and the fact that the characters are known by different names in Arabic, Persian and Pashto. 1. Arabic Each Arabic character consists of a base form and (possibly) a set of dots or other marks, whose use is compulsory in contemporary writing. Historically the dots developed as a way to disambiguate base forms that had become too similar, and they are distinct from a separate set of optional vowel diacritics. The Arabic script is cursive, and most characters are connected to the preceding and following characters within the word (though some character bases do not connect); consequently, each character has up to 4 shapes -- initial (connected only to the following character), medial (connected on both sides), final (connected only to the preceding character), and isolated. Often the various shapes are similar, but for some bases they look very different. For the yeh base, the initial and medial forms look very similar, but they are distinct from the isolated and final forms which are also similar to each other. When I talk about medial and final yeh forms I intend to cover also the initial and isolated forms, respectively. While there has been (and continues to be) debate about what exactly constitutes a character in various derivatives of the Arabic script, the identity of characters used for writing the Arabic language follows a grammatical tradition of over a thousand years. This tradition recognizes 3 characters with a yeh base: yeh: a yeh base with two dots below, used to represent the /j/ and /i:/ sounds. The standard arrangement of the two dots is horizontal, but they can be placed vertically or diagonally with no change in meaning. In Egypt, the final form is written without dots. alef maksura: a yeh base with no dots, used historically to represent long /a:/ in certain contexts; in contemporary writing it is used only at the end of a word for certain short /a/ which derive from an etymological /j/. yeh with hamza above: a yeh base with a hamza character above (or, in some historical texts, below), representing a glottal stop in certain contexts (typically before or after the vowel /i/). Encoding of Arabic for text processing predates computers, going back to 5-bit teletype codes. However, these codes, as well as early computer codes, were all proprietary and did not allow interoperability across systems. Some of these systems had separate codes for initial, medial and final forms where the shapes differed significantly. The first documented and accepted standard for encoding Arabic characters was ASMO-449 (1982), a 7-bit code based on ASCII, with Arabic characters occupying the space of Latin lower-case letters. This code established the principle that each (traditional) character had exactly one code point, with selection of the appropriate contextual glyph done by software and not represented in the characters themselves. ASMO- 449 has 3 code points for yeh-based characters: yeh at 0x6A, alef maksura at 0x69, and yeh with hamza above at 0x46. Later 8-bit codes from the 1980s such as ECMA-114, ASMO-708, and ISO-8859-6 used the same 3 code points, transposed to 0xEA, 0xE9 and 0xC6 by turning on the eighth bit. The same code points found their way to Unicode starting at version 1.0 (1991) as U+064A Arabic Letter Yeh, U+0649 Arabic Letter Alef Maksura, and U+0626 Arabic Letter Yeh with Hamza Above. The state of yeh-based characters in Arabic is rather straightforward, except in Egypt. As mentioned above, convention in Egypt is to write yeh in final position as a base form without dots, which makes it look identical to alef maksura. Moreover, since contemporary texts only use alef maksura at the end of a word, writing in Egypt has no need to distinguish between yeh and alef maksura, so confusion arises. The Egyptian daily Al-Ahram, for example, uses the character U+064A Yeh for both yeh and alef maksura in its online edition (http://www.ahram.org.eg/) and the result is that final yeh looks non- Egyptian because of the dots, and alef maksura looks simply incorrect. In the print edition the characters appear correctly, with no dots on the final forms, presumably through the use of proprietary fonts. 2. Persian The Persian script is an adaptation of the Arabic script. Native Persian vocabulary uses just one yeh-based character, representing the /j/ and /i:/ sounds; additionally, alef maksura and yeh with hamza above are used in loanwords from Arabic. The convention for writing the yeh character in Persian is the same as for Arabic in Egypt: two dots in medial form, none in final form. Thus, Persian does not distinguish between yeh and alef maksura. The first standard 8-bit code for Persian with contextual rendering of characters was ISIRI-3342 (1993), which replaced a previous standard with separate codes for distinct contextual shapes. ISIRI-3342 has a yeh character in position 0xE1, which is displayed without dots on the final form. ISIRI-3342 also includes yeh with hamza above in position 0xFB as well as a character called "Arabic yeh" with dots on the final form in position 0xFE; an annotation specifies that the latter two characters are taken from ISO-8859-6. There is no specific character for alef maksura. The yeh character from ISIRI-3342 corresponds to Unicode character U+06CC Arabic Letter Farsi Yeh, which appears already in Unicode version 1.0 (1991), two years prior to the publication of ISIRI-3342. Character U+06CC carries an explicit annotation: "Initial and medial forms of this letter have dots". I have not found documentation on why Unicode and ISIRI decided to give separate code points to the Arabic and Persian conventions of writing yeh. It is not clear if an actual need exists to use both conventions in a single document, because when Persian names or terms are written in an Arabic document or vice versa, it is common practice to write the yeh according to the conventions of the document language rather than the source language. At any rate, presently Unicode contains the following three characters which encode versions of yeh with and without dots: U+0649 no dots medially or finally U+064A two dots medially and finally U+06CC two dots medially, none finally The intention behind these codes is probably to use U+0649 and U+064A for alef maksura and yeh in Arabic, and U+06CC for yeh in Persian; it is not clear what the intention is for Arabic in Egypt, or for alef maksura in Persian words of Arabic origin. Things are more complicated in practice. The online edition of Hamshahri newspaper (http://www.hamshahrionline.ir/) uses U+06CC regularly, though stray occurrences of U+064A are also found; in contrast, the online edition of Kayhan (http://kayhannews.ir/) uses U+064A exclusively, resulting in inappropriate dots on all final yeh forms (as with Al-Ahram in Egypt, these dots are absent from the print edition, again probably due to proprietary fonts). Online forums in Persian such as (http://balatarin.com/) show a mixture of U+06CC and U+064A. 3. Pashto The Pashto script is an adaptation of the Persian-Arabic script; it shares some non-Arabic characters with Persian but differs on others (for example the sound /g/, not represented in the Arabic script, is written by different modifications of the kaf character base in Persian and Pashto). Traditionally, Pashto used a single yeh character with the same convention as in Persian, of two dots in the medial form and none on the final form, with no significance attached to the visual arrangement of the dots. This character was 3-ways ambiguous between the sounds /j/, /i:/ and /e/. Some informants I met with who had left Afghanistan and Pakistan in the 1980s are not familiar with any distinction among yeh characters, and while they tend to write final yeh without dots, they also accept it with dots. However, recent developments have caused some differentiation (Wilma Heston suggests that this came from some conferences organized in the early 1990s by the Pashto Academy at the University of Peshawar, Pakistan; I was not able to find documentation on this effort, though reference to "a 1991 meeting of Pashto experts in Peshawar" is made in the UNDP document cited below). One convention that has gained fairly wide acceptance is a distinction between a horizontal arrangement of the dots, representing /j/ or /i:/ as in Arabic and Persian, and a vertical arrangement representing the sound /e/. This distinction is the same as in Uighur, and the character with vertical dots is codified as U+06D0 Arabic Letter E. Additional conventions concern the sound /j/ following schwa in final position, represented as U+0626 yeh with hamza above when it is used as the 2nd person plural verb inflection, and as U+06CD Arabic Letter Yeh with Tail when it is used to represent the feminine noun and adjective inflection. This four-way distinction is used, for example, in the following book: Habibullah Tegey and Barbara Robson, A Reference Grammar of Pashto, Center for Applied Linguistics, Washington DC, 1996 (http://www.eric.ed.gov/ERICWebPortal/detail?accno=ED399825) (unfortunately the PDF is a scan of a printout, so I can only identify the characters by their visual shape, but I list them with the most likely corresponding Unicode characters; the final form of the j/i character is usually without dots, but sometimes with). U+06CC or U+064A /j/ and /i:/ U+06D0 /e/ U+06CD /j/ after schwa in final position, feminine marker U+0626 /j/ after schwa in final position, 2nd person plural marker A five-way distinction is offered by M. A. Zyar, A Guide of Standard Pashto, Oxford, 2006 (http://www.tolafghan.com/assets/download/pashto_liklar.pdf). The book itself is in Pashto which I can't read, but on Page 387 it spells out the same usage as Tegey and Robson above, with an additional distinction: a final form with dots is used for /i:/, while a final form without dots is used for the masculine nominal marker /aj/. Word- medially, both /j/ and /i:/ are represented with two dots in a horizontal arrangement. Zyar does not specify a computer encoding for these characters; the book itself contains a mix of U+0649, U+064A and U+06CC, suggesting that the producers of the book cared only about the visual shape of the glyphs, not the machine encoding. The same five-way distinction of yeh shapes with a recommended encoding (but no phonetic characterization) is specified in the document Computer Locale Requirements for Afghanistan, published in 2003 by the UNDP (http://www.evertype.com/standards/af/af-locales.pdf), seen below: U+06CC medial forms with dots (/i:/ and /j/) and dotless final form (/j/) U+064A final form with dots (/i:/) The rationale is given in a note on page 5 (<> indicates places where the document has a specific Pashto glyph): "Since the shapes of the <> initial and <> medial forms of the Pashto letters <> ye (U+06CC) and <> saxta ye (U+064A) are exactly the same, to avoid encoding ambiguities in Pashto data ... we recommend that the Unicode character for saxta ye, namely <> U+064A, never be used in initial and medial forms in Pashto data". The same convention is followed in other on-line resources as well as a proprietary electronic lexicon, so it can be considered to be the preferred or standard encoding. However, this is not the only convention found. For example, the Wikipedia article (http://en.wikipedia.org/wiki/Pashto_alphabet) makes a visual distinction: U+064A forms with dots (/i:/ and /j/ medially, /i:/ finally) U+0649 forms without dots (only /j/ in word-final position) Electronic documents in the wild such as BBC News (http://www.bbc.co.uk/pashto) and Deutsche Welle (http://www.dw-world.de/) show great confusion, with U+064A and U+06CC used interchangeably even within a single article, and even in final position where the glyphs differ. 4. Concluding Remarks I believe that the confusion in the use of Pashto yeh characters is inherent to the design of the script, namely the fact that /i:/ and /j/ are distinguished in the final form but not in the medial form. This already is a difficult concept, and I cannot think of another case where a single language written in a modern derivative of the Arabic script uses two distinct characters that have an identical appearance in some positions but look different in others. The existence of three unicode characters, representing different combinations of dots on medial and final yeh, gives what appears at first glance to be a linguistically elegant alternative to the encoding of /i:/ and /j/ in Pashto: U+064A: /i:/ (with dots medially and finally) U+06CC: /j/ (with dots medially, without finally) However, this encoding is impractical, since it is not reasonable to expect typists to make a distinction between characters that have identical shape. This point is illustrated by the complete merger in visual form between alef maksura and yeh in Persian and Arabic written in Egypt: while authors and typists presumably make a conceptual distinction between the characters (since they are pronounced differently), the fact that the characters look the same means that people do substitute one for the other while typing. A future spelling reform in Pashto may hopefully either fully split /i:/ and /j/ in all contextual positions, or fully merge them. Until then, we will have to live with confusion in electronic documents. - Ron |
|
| LL Issue: | 21.5228 | |
| Date Posted: | 22-Dec-2010 | |
| Original Query: | Read original query | |
|
Back |
||
|
|
||
|
Sums main page
|
||


