  1. John Clews, Potential future candidates for new ISO 639 codes (larger languages)

Message 1: Potential future candidates for new ISO 639 codes (larger languages)

Date: Tue, 22 Feb 2000 10:47:00 +0000
From: John Clews <>
Subject: Potential future candidates for new ISO 639 codes (larger languages)


Potential future candidates for new ISO 639 codes (larger languages)

Thank you for the correspondence on the Linguist List (and on other
language-related lists) on codes that might be useful for inclusion
in ISO 639. The correspondence has certainly helped me to enable to
prioritise the most urgent languages to be added.

ISO 639 tends to provide codes only for the larger languages,
although it still needs to provide codes for several larger languages
(by number of speakers). As a rough guide, I am aiming to ensure that
codes will be provided in due course for most distinct languages
where there are a million or more speakers.

In passing, ISO 639 tends to leave the allocation of codes for
lesser-used languages to organizations such as the Summer Institute
of Linguistics (SIL) in its Ethnologue codes.

In fact while in Washington at the recent ISO 639 Joint Advisory
Committee (17-18 February 2000) I had some very useful preliminary
discussions with Peter Constable of SIL, about how these 3-letter
codes might interface with each other, although considerably more
work needs to be put into that. However, given linguists' frequent
use of SIL codes, this may be a useful exercise in due course.

I plan to send the Linguist List a report of the meeting of the
ISO 639 Joint Advisory Committee later. Codes for a few additional
languages were added at this meeting: the main discussions were on
clarifying some precedural issues, that should allow for much more
rapid addition of codes in the future.

Meanwhile, I would be glad if any of you could comment on the list
below: the Foundation for Endangered Languages plans to submit an
application for new codes for some of the larger languages below to
be added to ISO 639-2, based on the following list. There's no
suggestion that these languages are endangered: just that it would be
useful to provide ISO 639 3-letter codes for at least some of them,
and some more information on these languages would be helpful to
present to the ISO 639 Joint Advisory Committee.

The comments that would be particularly useful are:
(a) which of these have some official status in any part of the
 countries concerned:
(b) where there is a significant body of documents in this language, in
 i. manuscripts
 ii. academic linguistic transcriptions
 iii. printed materials
 iv. sound recordings
(c) which of the languages listed below could be regarded as dialects
 of another language, or closely related to another language.
 There is no suggestion that dialects are of less value: it will
 just help in presenting the application to the ISO 639-2
 Maintenance Agency.

In addition the marginal symbol
>>>> marks other specific queries that I have below.

ISO 639-2 already provides 3-letter codes for many of the larger
languages of the world: I have not repeated those in this current
posting. Thus this current posting is NOT a query on "what languages
are missing" - more of a request for further information on the
languages that are listed.

This list runs broadly from East through West, from China through
Europe. The addition of further major languages of the Americas is
not proposed here, as ISO 639 covers most larger languages of the
Americas fairly well already, although I plan to look at this area
again at a later time,

Please embed your comments with the reply, and send this to
<> rather than to the list.

Again I plan to post a summary to the Linguist List in due course.

John Clews

21 February 2000

- ----------------------------------------------------------
East Asia
- ----------------------------------------------------------

 1,487,000 China KHAMS KHG
 1,480,750 China DONG, SOUTHERN KMC

>>>> ISO 639: no codes for Khams and Dong (it is assumed that
 these are non-Han languages).

>>>> NB: it will be useful to consult the official lists of
 around 55 national minorities, to check which, if any,
 non-Han languages with official status are omitted from
 ISO 639.

>>>> What scripts are used for KHAMS and DONG? Latin script?

- ----------------------------------------------------------
Southeast Asia and Oceania
- ----------------------------------------------------------

 1,190,000 Viet Nam TAY THO

ISO 639-2 provides only for Tai (other); not for Tay (Tai Tho)

>>>> How closely related is Tai Tho to other Tai languages?

 2,083,000 Myanmar ARAKANESE MHV

ISO 639-2 provides for Karen and Shan; nothing for Arakanese

>>>> How close is Arakanese to other languages?

- ----------------------------------------------------------

 3,000,000 Indonesia BANJAR BJN
 Also known as

 2,700,000 Indonesia BETAWI BEW
 Also known as

>>>> After checking with Southeast Asian librarians at the
 British Library, it is apparent that these are significantly
 different from Malay. It is not clear whether there is a
 similar situation with Malay languages and Sami languages.

 In Sami, there is now a code for "Northern Sami" (the
 Sami language with the largest population, and the largest
 publishing statistics) and "Sami (other)." The addition of
 further specific Sami languages may be reviewed again later.

 There may be a case for providing a code for "Malay languages
 (other)" as well as particular Malay languages.

 2,000,000 Indonesia BATAK TOBA BBC
 1,200,000 Indonesia BATAK DAIRI BTD

>>>> Note: there are various languages called BATAK in Sumatra,
 (which has 1,200,000) speakers, BATAK KARO, BATAK MANDAILING,
 BATAK SIMALUNGUN and BATAK TOBA (which has 2,000,000

>>>> NB: note also also the different language BATAK in the
 Phillipines, with the SIL code BTK, which is assumed to be
 the "Batak" language encoded in ISO 639-2.

 1,500,000 Indonesia LAMPUNG LJP
 1,000,000 Indonesia REJANG REJ

ISO 639: No codes for LAMPUNG or REJANG. These too are spoken in
Sumatra, Indonesia.

>>>> Dialects assumed? Or different languages?

 1,000,000 Phillipines MADINDANAON MDH

>>>> In passing, ISO 639 provides for most other large languages
 of the Philippines.

 50,000 Papua New Guinea TOK PISIN PDG

>>>> ISO 639: prefer to add special code for Tok Pisin? This has
 national status in Papua New Guinea. Currently only "cpe"
 (Creoles & Pidgins, English) is available. However, Bislama
 (which can also be described as an English-based creole
 language) does have a separate code.

>>>> In passing, ISO 639-2 provides codes for most other larger
 languages of Oceania.

- ----------------------------------------------------------
South Asia
- ----------------------------------------------------------


ISO 639 does not list several of the following, with names as such:

 13,000,000 India HARYANVI BGC
 6,000,000 India KANAUJI BJJ
 3,500,000 India PARSI PRP
 2,730,120 India LAMBADI LMN
 2,246,105 India KHANDESI KHN
 2,095,280 India DOGRI-KANGRI DOJ
 2,081,756 India GARHWALI GBM
 2,013,000 India KUMAUNI KFY
 1,921,000 India BAGRI BGQ
 1,861,965 India SADRI SCK
 1,856,000 India TULU TCY
 1,600,000 India BHILI BHB
 1,544,000 India WAGDI WBR
 1,473,000 India MUNDARI MUW
 1,295,000 India NIMADI NOE
 1,050,000 India MALVI MUP
 1,026,000 India HO HOC
 3,000 India BROKSKAT BKK
 (Broksat is an Indo-Aryan (Dardic) language)

>>>> Some dialects assumed in above list?

- ----------------------------------------------------------
For Indian languages, Peter Claus (California State University,
Hayward) also suggests

 - Kodagu (Coorgi) which has a relatively small (but established)
 literature with a number of scholars working on it.

 - Badaga, which has oral texts transliterated by scholars, and

 - Toda, Kota, and Kuruba languages, along the border of Karnataka
 and Tamil Nadu.

- ----------------------------------------------------------
 5,100,000 Bangladesh SYLHETTI SYL

>>>> Widely used in the United Kingdom Bangladeshi community.
 Sylheti Nagri script was used in the past in Bengal.

- ----------------------------------------------------------
 15,015,000 Pakistan SARAIKI (Siraiki) SKR
 2,210,000 Pakistan BRAHUI BRH
 1,875,000 Pakistan HINDKO, NORTHERN HNO
 625,000 Pakistan HINDKO, SOUTHERN HIN

>>>> Some dialects assumed?

- ----------------------------------------------------------
Dr. Elena Bashir, University of Michigan, also suggests the following
languages which are in SIL:

 333,640 Pakistan BALTI BFT

 320,000 Pakistan SHINA SCL
 222,800 Pakistan KHOWAR KHW
 220,000 Pakistan KOHISTANI, INDUS MVY
 200,000 Pakistan SHINA, KOHISTANI PLK
 108,000 Afghanistan PASHAYI, SOUTHWEST PSH
 60,000 Pakistan TORWALI TRW
 5,000 Pakistan DAMELI DML
 2,900 Pakistan KALASHA KLS
 (Indo-Aryan (Dardic))

 29,000 Pakistan WAKHI WBL
 5,000 Pakistan YIDGHA YDG

 500 Pakistan DOMAAKI DMK

 55,000 Pakistan BURUSHASKI BSK

 9,500 Afghanistan GAWAR-BATI GWT
 5,000 Afghanistan GRANGALI NLI

 1,000 Afghanistan SHUMASHTI SMS
 - Afghanistan TIRAHI TRA
 (Indo-Aryan (Dardic))

 4,000 Tajikistan YAZGULYAM YAH

 4,280,000 Iran LURI LRI
 3,265,000 Iran MAZANDERANI MZN
 3,265,000 Iran GILAKI GLK
 1,500,000 Iran QASHQAI QSQ

Dr. Elena Bashir, University of Michigan, also suggests the following
languages which are apparently not in SIL:

 Gojri Indo-Aryan
 Kanyawali Indo-Aryan (Dardic)
 Palula Indo-Aryan (Dardic)
 Sawi Indo-Aryan (Dardic)

 Ishkashmi Iranian
 Zebaki Iranian

- ----------------------------------------------------------
Northern Africa (including the Horn of Africa)
- ----------------------------------------------------------


ISO 639 codes Tamashek; check differences from Tamazight and other
languages with similar names (see below and Ethnologue entries)

 3,500,000 Morocco TACHELHIT SHI
 2,000,000 Morocco TARIFIT RIF
 2,511,000 Mauritania HASSANIYYA MEY

 1,400,000 Algeria CHAOUIA SHY
 1,148,000 Sudan BEDAWI BEI

 1,236,637 Ethiopia GAMO-GOFA-DAWRO GMO
 1,231,673 Ethiopia WOLAYTTA WBC

- ----------------------------------------------------------
West Africa (including North-West Africa)
- ----------------------------------------------------------

 600,000 Mali DOGON DOG
 361,700 Mali BOMU BMQ
 100,000 Mali BOSO, SOROGAMA BZE


ISO 639 codes Tamashek; check differences from Tamazight (see above)

+ 1,168,500 Mali FULFULDE, MAASINA FUL
+ 7,611,000 Nigeria FULFULDE, NIGERIAN FUV
+ 450,000 Niger FULFULDE,

>>>> ISO 639 codes are "ful" & "ff" - Fulah (Fulfulde/Fulani assumed)
>>>> Relationship of Fulfulde languages etc. needs clarification.


>>>> ISO 639 codes Tamashek; check differences from Tamajaq (see above)

 2,151,000 Niger ZARMA DJE

 2,520,000 Burkina Faso JULA DYU

 1,500,000 Nigeria IBIBIO IBB
 1,000,000 Nigeria EDO EDO
 1,000,000 Nigeria EBIRA IGB
 1,000,000 Nigeria ANAANG ANW

 2,921,300 Senegal PULAAR FUC
 313,000 Senegal JOLA-FOGNY DYO

 2,900,000 Guinea FUUTA JALON FUF

 2,130,000 Cote d'Ivoire BAOULE BCI
 1,020,000 Cote d'Ivoire DAN DAF

- ----------------------------------------------------------
Eastern and Central Africa
- ----------------------------------------------------------

 2,458,000 Kenya KALENJIN KLN
 1,582,000 Kenya GUSII GUZ
 1,305,000 Kenya MERU MER

 1,300,000 Tanzania GOGO GOG
 1,260,000 Tanzania MAKONDE KDE
 1,200,000 Tanzania HAYA HAY
 1,050,000 Tanzania NYAKYUSA-NGONDE NYY

 1,391,442 Uganda CHIGA CHG
 1,370,845 Uganda SOGA SOG
 1,217,000 Uganda TESO TEO

- ----------------------------------------------------------
Central and Southern Africa
- ----------------------------------------------------------

 4,200,000 Congo Dem Rep KITUBA KTU
 1,156,800 Congo MUNUKUTUBA MKW

 1,004,000 Congo Dem Rep CHOKWE CJK
 1,000,000 Congo Dem Rep SONGE SOP

>>>> In passing, no relationship to Tsonga, already in ISO 639

 2,850,000 Mozambique LOMWE NGL
 2,500,000 Mozambique MAKHUWA VMW
 1,160,000 Mozambique MAKHUWA-MEETTO MAK
 1,100,000 Mozambique SENA SEH

John Clews

7 February 2000 (updated/corrected 21 February 2000).

John Clews

7 February 2000 (updated/corrected 21 February 2000).
tel: +44 1423 888 432; fax: + 44 1423 889061;

Committee Chair of ISO/TC46/SC2: Conversion of Written Languages;
Committee Member of ISO/IEC/JTC1/SC22/WG20: Internationalization;
Committee Member of CEN/TC304: Information and Communications
 Technologies: European Localization Requirements
Committee Member of TS/1: Terminology (UK national member body of
 ISO/TC37: Terminology)
Committee Member of the Foundation for Endangered Languages;
Committee Member of ISO/IEC/JTC1/SC2: Coded Character Sets
