LINGUIST List 31.1411

Mon Apr 20 2020

Review: Anthropological Linguistics; Language Documentation; Typology: Jones (2019)

Editor for this issue: Jeremy Coburn <>

Date: 04-Feb-2020
From: Michael Maxwell <>
Subject: Endangered Languages and New Technologies
E-mail this message to a friend

Discuss this message

Book announced at

EDITOR: Mari C. Jones
TITLE: Endangered Languages and New Technologies
PUBLISHER: Cambridge University Press
YEAR: 2019

REVIEWER: Michael B. Maxwell, University of Maryland


Twenty years ago, I reviewed a publication coming out of the First International Conference on Language Resources and Evaluation (LREC). While that conference ostensibly targeted smaller languages, I remarked in my review that their notion of ''smaller'' appeared to be restricted to the largest hundred or so languages of the world, minus English, Modern Standard Arabic, and Mandarin Chinese.

The situation is improved today, with entire conferences looking at ways to study and preserve endangered languages, in many cases relying on computational analysis. Nevertheless, endangered languages are hardly a topic of great interest in the field of computational linguistics, with a few exceptions. This book is one of those exceptions. Its preface, by Mari C. Jones, states that it aims for ''a practicable synthesis of old and new methodologies'', where ''old'' is presumably pencil-and-paper techniques, and ''new'' is computationally informed (if not necessarily driven) technologies. The synthesis is viewed from two directions: new technologies for description and analysis, and how new technologies are being used for language revitalization (but see my comment at the end of this review).

Nicholas Ostler provides an ''Introduction: Endangered languages in the new multi-lingual order per genus et differentiam.'' The claim here is that ''the world will lose its motivation to maintain English as a convenient lingua franca just as automatic language conversion becomes...realistic.'' Ostler expresses the hope that automatic language conversion (i.e. Machine Translation, MT) will extend to smaller languages--but in the end, the hope that this will breathe new life into those languages has been dashed by their lack of computer-readable data, particularly parallel text. Ostler briefly describes approaches to solving this problem, such as finding parallel corpora between more documented and less documented related languages, so that their similarities and differences (the ''genus et differantiam'' of the title) could in theory more easily provide a way to discover the properties of the less documented language, and thereby build MT systems (presumably by pivoting on the more documented language, although Ostler does not go into detail here). There has indeed been some research on this approach (which I will come back to at the end of this review); but even this is not enough, if only because not all small languages have better documented relatives. Bilingual dictionaries could provide another help for MT; but it has been parallel text, not bilingual dictionaries, that jump-started machine translation in the last couple decades. In sum, while I find it quite possible that English will be displaced as a lingua franca in some future, and that in that future there will be MT systems for many more languages, in my view that will not come soon enough to help languages that are endangered today.

Aimée Lahaussois writes about ''The Kiranti comparable corpus: A prototype corpus for the comparison of Kiranti languages and mythology.'' This is a study of how interlinear text in three closely related languages of Nepal can be aligned across documents in the different languages, and then used for comparative study of the languages. At present, there is only a single narrative transcribed from a single speaker in each of three languages, but it provides a proof of concept.

Sjef Barbiers' ''European Dialect Syntax: Towards an infrastructure for documentation and research of endangered dialects'' argues that since the boundary between language and dialect is not a principled one, it is reasonable to make an effort to document endangered dialects. Additionally, since the variation between dialects is (by definition) smaller than those between distinct languages, those variations may shed light on individual parameters of variation, such as syntactic parameters. The age old issue of elicitation vs. pure corpora approaches however arises, since some syntactic variation (the examples are from Dutch dialects) is quite rare, making directed elicitation necessary in order to obtain sufficient examples for study.

Hugh Patterson writes about ''Keyboard Layouts: Lessons from the Me'paaa and Sochiapam Chinantec designs.'' Both these Mexican languages are written in Latin scripts, but with some diacritics or other combining characters that are not found in Spanish. While smart phone input is mentioned, most of the attention is given to physical keyboards in Windows and Mac systems (Linux is mentioned in passing). One of the problems which Patterson discusses is that of Unicode normalization, although he does not use that term (which is well known in the literature about Unicode).

Matt Coler and Petr Homola describe a rule-based machine translation system they are developing for translating from Aymara into Spanish and English. The MT system relies on a number of theories, including a version of Lexical Functional Grammar (LFG) using dependency parsing, and augmented with additional structures. But the article is too short to provide an understanding of how all the structures are derived (automatically?) from the input sentences, and how the MT system pieces together target language output while referring to these multiple structures. For example, while Aymara is an agglutinating language, and the MT system must deal with considerable derivational as well as inflectional morphology, the morphological parser is mentioned in a single short paragraph, which does not explain how its rules are written, what technology is used for parsing (a finite state transducer?), how much ambiguity there is at the morphological output stage, or how the syntactic parser deals with this ambiguity.

The article closes by claiming a 12.1% Word Error Rate (WER) from Aymara to English. While WER is sometimes used to evaluate MT systems, BLEU score is more commonly used for this purpose (whereas WER is used for speech recognition; see Cer, Manning and Jurafsky 2010 for a discussion of MT metrics). It is also not clear how the 12.1% WER was measured (on how large a corpus, for example), much less what it means in this case (for example, the size of the vocabulary could have an effect on WER).

Dorothee Beermann describes a data management and analysis system for endangered languages (although the languages used as examples are not endangered at present). The system, called TypeCraft (TC), emphasizes the production and display of interlinear glossed text (IGT) from text data (audio and video data is apparently planned for the future). TC differs from such tools as SIL's Fieldworks Language Explorer in that annotation and display can both be done over the web. Unfortunately, like the preceding article, many details are unclear. For example, while collaborative annotation is listed as supported, and eleven annotators are said to have worked on IGT using this system, it is unclear whether simultaneous annotation by different annotators is possible, or whether multiple annotators must coordinate non-overlapping work times.

This project uses an LFG parser and an HPSG (Head Driven Phrase Structure grammar) parser, such that ''the subsequent linguistic analysis becomes linked to the material on which it is built.'' But it is unclear how this linking works: can these parsers be integrated into TC? Or is the data in TC exported in some form usable by the parser, then the parser is run on that data, and the result is imported back into TC? Also unclear is whether the annotation of IGT results in a dictionary of morphemes, and whether the system offers previous analyses of a particular word when that same wordform is encountered in later annotation, which would speed up annotation and encourage consistency.

Russell Hugo presents some ''fundamental questions for endangered language learning technology projects'', the answers to which should drive projects to produce pedagogical materials for the teaching of endangered languages in revitalization projects. He proposes the use of a Learning Management System (LMS) such as Moodle ( While such a system generally requires internet accessibility, and while such accessibility is increasingly available, it precludes the use of the system in some parts of the world where endangered languages are found. Hugo points out that the overwhelming advantage of developing a language learning curriculum in such a pre-existing tool is that it removes the need to re-invent the wheel by providing the software framework. Moreover, software changes; Moodle will not always be the best tool for teaching languages. But by using a tool like Moodle, which provides for the export of lessons, one can future-proof the lesson content.

Bernard Bel and Médéric Gasquet-Cyrus argue in favor of not simply preserving data about endangered languages, but curating it, by which they mean adding at least enough metadata to make the resources findable, defining usage rights, and possibly labeling the data (or chunks of data) with location identifiers and linguistic concepts. (I would have thought that everyone did this, but apparently not.) They illustrate using their own efforts to document endangered varieties (dialects) of Occitan, primarily with audio and video, but also with information about ''informants'' (their term), photographs, and data about the audio and video collection methods. (The source they point to for linguistic labels,, is unfortunately defunct, a problem that recurs distressingly frequently.) They discuss in some depth the legal and ethical constraints on the collected data, e.g. protecting ''problematic'' parts of sound files by replacing those parts with humming so as to preserve the prosody. (Again, details would be helpful: must the prosodically-based humming be done by humans, or is it possible to generate this automatically?)

Anthony Scott Warren and Geraint Jennings document efforts to preserve Jèrriais, a language spoken on the island of Jersey between France and England. The language has been in print for over two centuries, but English has been taking over domains of use for a hundred years. The authors then turn to developments of the last few decades which have been used to promote the use of Jèrriais: initially, internet web pages, and more recently, smart phones, twitter, Facebook, Youtube, and so forth. These tools have provided both ways to promote the use of Jèrriais, and the sharing of ideas with groups trying to maintain other endangered languages.

Tjeerd de Graaf, Cor van der Meer, and Lysbeth Jongbloed-Faber document efforts to sustain West Frisian (Netherlands). This language has hundreds of thousands of native speakers, and many more second language speakers. The language enjoys status as the official second language of the Netherlands, and has played an official role in primary education for over a century. Nevertheless, UNESCO considers it to be ''vulnerable''. The article briefly describes the many things that have been done over the past 20 years to maintain the language, ranging from TV shows to Twitter accounts. All these provide a mine of ideas that other languages could try, although I suspect most truly endangered languages could only wish for the budgets and support available for West Frisian (and likewise Jèrriais).

Cecilia Odé writes about a project for Tundra Yukaghir (a language of Siberia). In a predicament much more similar to most other endangered languages than that of West Frisian, the only fluent speakers of Yukaghir are elderly, while teachers--although motivated--are not fluent; and support from the government is less than desired. The project developed an academic grammar, recordings of the spoken language and of songs, and courseware for teachers. Recordings were made in both audio and video form, and the discussion of this work may provide useful ideas to those working with other endangered languages.

Unlike virtually all other sign languages, American Indian Sign Language (AISL) served as a language of communication among non-deaf speakers of diverse, even unrelated, languages. But with the dominance of English, AISL is disappearing. Jeffrey Davis describes efforts to document and describe this unique language, combining the digitization of historically collected materials with ''born-digital'' documentation.


This book is not intended as a handbook of new technologies for endangered languages. There are no descriptions here of methods of elicitation suited to digital methods; no papers about corpus collection or lexicography or interlinearization or grammatical description. Rather its purpose is to describe a set of new(ish) ideas in the use of technology to document and describe endangered languages. Some of these new directions may be fruitful, while others may prove less so.

Some new directions are not covered at all. For example, there is virtually no discussion here of the use of machine learning for language documentation. Examples of ways in which machine learning might be used include Automatic Speech Recognition (ASR), dictionary induction, parser induction, and the development of ways to communicate with speakers of endangered languages in emergency situations--as in work coming out of the US DARPA Low Resource Languages for Emergent Incidents (LORELEI) project. To be sure, all of these are experimental technologies: they are anything but mature, and many current machine learning techniques require larger quantities of data (particularly, annotated data) than there will ever be for most endangered languages. That said, there is research into ways to reduce that data requirement, e.g. by using cross-lingual alignment (as briefly mentioned by Nicholas Ostler in the introduction to this book), as well as research into ways to collect more transcribed data (see e.g. Bird 2010 for one such method).

While the chapters are organized into two sections, namely Creating New Technologies for Endangered Languages, and Applying New Technologies, it is not clear that the chapters actually fell neatly into this dichotomy. Hugo's chapter, for example, is in the section on creating new technologies, but it is actually a call for using an existing technology, Moodle.

The endangered languages used as case studies are oddly skewed: while most endangered languages are to be found outside of Europe (as a glance at the map in will show), four of the eleven languages discussed in this book are found in Europe, and two more in the United States or Canada. This is doubtless due to the availability of speakers of those languages in close proximity to linguists, and perhaps also to the availability of technology in these regions (something that Bel and Gasquet-Cyrus allude to).

In sum, if you come to this book expecting a handbook, you will be disappointed. But if you come looking for new ideas, you may find useful ideas. As for the lack of studies here putting machine learning to work, it is the nature of books like this to be superseded--and I'm sure any advocate of documentation and description of endangered languages will join me in hoping that day will come soon.


Bird, Steven. (2010). A Scalable Method for Preserving Oral Literature from Small Languages. 6102. 5-14. 10.1007/978-3-642-13654-2_2.

Cer, Daniel; Christopher D. Manning, and Daniel Jurafsky. 2010. ''The Best Lexical Metric for Phrase-Based Statistical MT System Optimization.'' Pp. 555-563 in ACL 2010.


Michael Maxwell is a research scientist at the University of Maryland, with experience in language documentation and description, and computational linguistic methods including computational lexicography and morphological parsing. In the past, he developed an appreciation for minority and endangered languages working in Ecuador and Colombia with SIL International, and for other low density languages while working with the Linguistic Data Consortium at the University of Pennsylvania.

Page Updated: 20-Apr-2020