* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 21.5228

Wed Dec 22 2010

Sum: Pashto in Unicode

Editor for this issue: Danielle St. Jean <daniellelinguistlist.org>


To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
Directory
        1.     Ron Artstein , Pashto in Unicode

Message 1: Pashto in Unicode
Date: 20-Dec-2010
From: Ron Artstein <linguistartstein.org>
Subject: Pashto in Unicode
E-mail this message to a friend

Query for this summary posted in LINGUIST Issue: 21.2971
Hi,

Several months ago I posted a query, asking whether there are
standards for encoding the various Pashto y-characters in Unicode. I
received many helpful responses on this list and the Unicode list
(special thanks to Wilma Heston, Kamal Mansour and Roozbeh
Pournader), and also met personally with several Pashto speakers in
Southern California. The short answer is that there is a proposed
standard but it is often not followed in actual electronic texts, partly due
to inherent problems with the standard itself. So processing needs to
be done with care.

The long answer will give the details of the proposed standard at the
end, but to make sense of it we will need to look at the history of these
characters, both in the development of the Pashto as an adaptation of
the Arabic and Persian scripts, and in the later encoding of these
scripts into computer character sets.

Terminological notes: I will refer to characters and character bases by
their Unicode name, allowing me to sidestep transliteration issues and
the fact that the characters are known by different names in Arabic,
Persian and Pashto.

1. Arabic

Each Arabic character consists of a base form and (possibly) a set of
dots or other marks, whose use is compulsory in contemporary writing.
Historically the dots developed as a way to disambiguate base forms
that had become too similar, and they are distinct from a separate set
of optional vowel diacritics. The Arabic script is cursive, and most
characters are connected to the preceding and following characters
within the word (though some character bases do not connect);
consequently, each character has up to 4 shapes -- initial (connected
only to the following character), medial (connected on both sides), final
(connected only to the preceding character), and isolated. Often the
various shapes are similar, but for some bases they look very different.
For the yeh base, the initial and medial forms look very similar, but they
are distinct from the isolated and final forms which are also similar to
each other. When I talk about medial and final yeh forms I intend to
cover also the initial and isolated forms, respectively.

While there has been (and continues to be) debate about what exactly
constitutes a character in various derivatives of the Arabic script, the
identity of characters used for writing the Arabic language follows a
grammatical tradition of over a thousand years. This tradition
recognizes 3 characters with a yeh base:

yeh: a yeh base with two dots below, used to represent the /j/ and /i:/
sounds. The standard arrangement of the two dots is horizontal, but
they can be placed vertically or diagonally with no change in meaning.
In Egypt, the final form is written without dots.

alef maksura: a yeh base with no dots, used historically to represent
long /a:/ in certain contexts; in contemporary writing it is used only at
the end of a word for certain short /a/ which derive from an
etymological /j/.

yeh with hamza above: a yeh base with a hamza character above (or,
in some historical texts, below), representing a glottal stop in certain
contexts (typically before or after the vowel /i/).

Encoding of Arabic for text processing predates computers, going back
to 5-bit teletype codes. However, these codes, as well as early
computer codes, were all proprietary and did not allow interoperability
across systems. Some of these systems had separate codes for initial,
medial and final forms where the shapes differed significantly. The first
documented and accepted standard for encoding Arabic characters
was ASMO-449 (1982), a 7-bit code based on ASCII, with Arabic
characters occupying the space of Latin lower-case letters. This code
established the principle that each (traditional) character had exactly
one code point, with selection of the appropriate contextual glyph done
by software and not represented in the characters themselves. ASMO-
449 has 3 code points for yeh-based characters: yeh at 0x6A, alef
maksura at 0x69, and yeh with hamza above at 0x46. Later 8-bit codes
from the 1980s such as ECMA-114, ASMO-708, and ISO-8859-6 used
the same 3 code points, transposed to 0xEA, 0xE9 and 0xC6 by
turning on the eighth bit. The same code points found their way to
Unicode starting at version 1.0 (1991) as U+064A Arabic Letter Yeh,
U+0649 Arabic Letter Alef Maksura, and U+0626 Arabic Letter Yeh
with Hamza Above.

The state of yeh-based characters in Arabic is rather straightforward,
except in Egypt. As mentioned above, convention in Egypt is to write
yeh in final position as a base form without dots, which makes it look
identical to alef maksura. Moreover, since contemporary texts only use
alef maksura at the end of a word, writing in Egypt has no need to
distinguish between yeh and alef maksura, so confusion arises. The
Egyptian daily Al-Ahram, for example, uses the character U+064A Yeh
for both yeh and alef maksura in its online edition
(http://www.ahram.org.eg/) and the result is that final yeh looks non-
Egyptian because of the dots, and alef maksura looks simply incorrect.
In the print edition the characters appear correctly, with no dots on the
final forms, presumably through the use of proprietary fonts.

2. Persian

The Persian script is an adaptation of the Arabic script. Native Persian
vocabulary uses just one yeh-based character, representing the /j/ and
/i:/ sounds; additionally, alef maksura and yeh with hamza above are
used in loanwords from Arabic. The convention for writing the yeh
character in Persian is the same as for Arabic in Egypt: two dots in
medial form, none in final form. Thus, Persian does not distinguish
between yeh and alef maksura.

The first standard 8-bit code for Persian with contextual rendering of
characters was ISIRI-3342 (1993), which replaced a previous standard
with separate codes for distinct contextual shapes. ISIRI-3342 has a
yeh character in position 0xE1, which is displayed without dots on the
final form. ISIRI-3342 also includes yeh with hamza above in position
0xFB as well as a character called "Arabic yeh" with dots on the final
form in position 0xFE; an annotation specifies that the latter two
characters are taken from ISO-8859-6. There is no specific character
for alef maksura.

The yeh character from ISIRI-3342 corresponds to Unicode character
U+06CC Arabic Letter Farsi Yeh, which appears already in Unicode
version 1.0 (1991), two years prior to the publication of ISIRI-3342.
Character U+06CC carries an explicit annotation: "Initial and medial
forms of this letter have dots". I have not found documentation on why
Unicode and ISIRI decided to give separate code points to the Arabic
and Persian conventions of writing yeh. It is not clear if an actual need
exists to use both conventions in a single document, because when
Persian names or terms are written in an Arabic document or vice
versa, it is common practice to write the yeh according to the
conventions of the document language rather than the source
language. At any rate, presently Unicode contains the following three
characters which encode versions of yeh with and without dots:

U+0649 no dots medially or finally
U+064A two dots medially and finally
U+06CC two dots medially, none finally

The intention behind these codes is probably to use U+0649 and
U+064A for alef maksura and yeh in Arabic, and U+06CC for yeh in
Persian; it is not clear what the intention is for Arabic in Egypt, or for
alef maksura in Persian words of Arabic origin. Things are more
complicated in practice. The online edition of Hamshahri newspaper
(http://www.hamshahrionline.ir/) uses U+06CC regularly, though stray
occurrences of U+064A are also found; in contrast, the online edition of
Kayhan (http://kayhannews.ir/) uses U+064A exclusively, resulting in
inappropriate dots on all final yeh forms (as with Al-Ahram in Egypt,
these dots are absent from the print edition, again probably due to
proprietary fonts). Online forums in Persian such as
(http://balatarin.com/) show a mixture of U+06CC and U+064A.

3. Pashto

The Pashto script is an adaptation of the Persian-Arabic script; it
shares some non-Arabic characters with Persian but differs on others
(for example the sound /g/, not represented in the Arabic script, is
written by different modifications of the kaf character base in Persian
and Pashto). Traditionally, Pashto used a single yeh character with the
same convention as in Persian, of two dots in the medial form and none
on the final form, with no significance attached to the visual
arrangement of the dots. This character was 3-ways ambiguous
between the sounds /j/, /i:/ and /e/. Some informants I met with who had
left Afghanistan and Pakistan in the 1980s are not familiar with any
distinction among yeh characters, and while they tend to write final yeh
without dots, they also accept it with dots. However, recent
developments have caused some differentiation (Wilma Heston
suggests that this came from some conferences organized in the early
1990s by the Pashto Academy at the University of Peshawar, Pakistan;
I was not able to find documentation on this effort, though reference to
"a 1991 meeting of Pashto experts in Peshawar" is made in the UNDP
document cited below).

One convention that has gained fairly wide acceptance is a distinction
between a horizontal arrangement of the dots, representing /j/ or /i:/ as
in Arabic and Persian, and a vertical arrangement representing the
sound /e/. This distinction is the same as in Uighur, and the character
with vertical dots is codified as U+06D0 Arabic Letter E. Additional
conventions concern the sound /j/ following schwa in final position,
represented as U+0626 yeh with hamza above when it is used as the
2nd person plural verb inflection, and as U+06CD Arabic Letter Yeh
with Tail when it is used to represent the feminine noun and adjective
inflection. This four-way distinction is used, for example, in the following
book: Habibullah Tegey and Barbara Robson, A Reference Grammar
of Pashto, Center for Applied Linguistics, Washington DC, 1996
(http://www.eric.ed.gov/ERICWebPortal/detail?accno=ED399825)
(unfortunately the PDF is a scan of a printout, so I can only identify the
characters by their visual shape, but I list them with the most likely
corresponding Unicode characters; the final form of the j/i character is
usually without dots, but sometimes with).

U+06CC or U+064A /j/ and /i:/
U+06D0 /e/
U+06CD /j/ after schwa in final position, feminine marker
U+0626 /j/ after schwa in final position, 2nd person plural marker

A five-way distinction is offered by M. A. Zyar, A Guide of Standard
Pashto, Oxford, 2006
(http://www.tolafghan.com/assets/download/pashto_liklar.pdf). The
book itself is in Pashto which I can't read, but on Page 387 it spells out
the same usage as Tegey and Robson above, with an additional
distinction: a final form with dots is used for /i:/, while a final form
without dots is used for the masculine nominal marker /aj/. Word-
medially, both /j/ and /i:/ are represented with two dots in a horizontal
arrangement.

Zyar does not specify a computer encoding for these characters; the
book itself contains a mix of U+0649, U+064A and U+06CC, suggesting
that the producers of the book cared only about the visual shape of the
glyphs, not the machine encoding. The same five-way distinction of yeh
shapes with a recommended encoding (but no phonetic
characterization) is specified in the document Computer Locale
Requirements for Afghanistan, published in 2003 by the UNDP
(http://www.evertype.com/standards/af/af-locales.pdf), seen below:

U+06CC medial forms with dots (/i:/ and /j/) and dotless final form (/j/)
U+064A final form with dots (/i:/)

The rationale is given in a note on page 5 (<> indicates places where
the document has a specific Pashto glyph): "Since the shapes of the <>
initial and <> medial forms of the Pashto letters <> ye (U+06CC) and <>
saxta ye (U+064A) are exactly the same, to avoid encoding ambiguities
in Pashto data ... we recommend that the Unicode character for saxta
ye, namely <> U+064A, never be used in initial and medial forms in
Pashto data".

The same convention is followed in other on-line resources as well as a
proprietary electronic lexicon, so it can be considered to be the
preferred or standard encoding. However, this is not the only
convention found. For example, the Wikipedia article
(http://en.wikipedia.org/wiki/Pashto_alphabet) makes a visual
distinction:

U+064A forms with dots (/i:/ and /j/ medially, /i:/ finally)
U+0649 forms without dots (only /j/ in word-final position)

Electronic documents in the wild such as BBC News
(http://www.bbc.co.uk/pashto) and Deutsche Welle
(http://www.dw-world.de/) show great confusion, with U+064A and
U+06CC used interchangeably even within a single article, and even in
final position where the glyphs differ.

4. Concluding Remarks

I believe that the confusion in the use of Pashto yeh characters is
inherent to the design of the script, namely the fact that /i:/ and /j/ are
distinguished in the final form but not in the medial form. This already is
a difficult concept, and I cannot think of another case where a single
language written in a modern derivative of the Arabic script uses two
distinct characters that have an identical appearance in some positions
but look different in others.

The existence of three unicode characters, representing different
combinations of dots on medial and final yeh, gives what appears at
first glance to be a linguistically elegant alternative to the encoding of
/i:/ and /j/ in Pashto:

U+064A: /i:/ (with dots medially and finally)
U+06CC: /j/ (with dots medially, without finally)

However, this encoding is impractical, since it is not reasonable to
expect typists to make a distinction between characters that have
identical shape. This point is illustrated by the complete merger in
visual form between alef maksura and yeh in Persian and Arabic
written in Egypt: while authors and typists presumably make a
conceptual distinction between the characters (since they are
pronounced differently), the fact that the characters look the same
means that people do substitute one for the other while typing. A future
spelling reform in Pashto may hopefully either fully split /i:/ and /j/ in all
contextual positions, or fully merge them. Until then, we will have to live
with confusion in electronic documents.

- Ron
Linguistic Field(s): Writing Systems

Read more issues|LINGUIST home page|Top of issue



Page Updated: 22-Dec-2010

Supported in part by the National Science Foundation       About LINGUIST    |   Contact Us       ILIT Logo
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.