LINGUIST List
People & Organizations * Jobs * Calls & Conferences * Publications
LINGUIST List 24.2604

Wed Jun 26 2013

Review: Language Documentation: Seifart et al. (2012)

Editor for this issue: Joseph Salmons <jsalmonslinguistlist.org>

Date: 20-May-2013
From: Richard Littauer <richard.littauergmail.com>
Subject: Potentials of Language Documentation
EDITOR: Frank Christian Seifart
EDITOR: Geoffrey Haig
EDITOR: Nikolaus P. Himmelmann
EDITOR: Dagmar Jung
TITLE: Potentials of Language Documentation
SUBTITLE: Methods, Analyses, and Utilization
SERIES TITLE: Language Documentation & Conservation, Special Publication No. 3
PUBLISHER: University of Hawai‘i Press
YEAR: 2012

REVIEWER: Richard G Littauer, Universität des Saarlandes

''Potentials of Language Documentation: Methods, Analyses, and Utilization'',
a Special Publication of the Language Documentation & Conservation journal
edited by Frank Seifart, Geoffrey Haig, Nikolaus P. Himmelmann, Dagmar Jung,
Anna Margetts, and Paul Trilsbeek, is a collection of 18 chapters which were
originally presented at a workshop at the Max Planck Institute for
Evolutionary Anthropology in Leipzig, November 2011. The workshop was composed
of language documentation practitioners, experts on computational methods,
linguistic researchers, and applied documentation linguists (working on
maintenance, curation, and presentation of corpora). The publication, as well
as offering an overview of the state of the field for language documentation,
is meant to cover advances and potential work for three different aspects of
language documentation -- namely, computational methods, analyses, and
utilization. The volume is split into three different sections, covering each
of these aspects.

In Chapter 1 (1-6), ''The threefold potential of language documentation,''
Frank Seifart, the first editor, introduces the volume and explains the need
for it. He gives a brief overview of each chapter and section. He goes on to
point out additional aspects of language documentation potential not addressed
explicitly in the papers. First he states that the possibilities of
computational methods are virtually endless, and that ''real challenges are
often conceptual, not technical'' -- he advises that multidisciplinary action
is necessary to overcome this, especially as ''there is often not one single
ideal computational solution for a linguistic problem.'' Thus, modularisation
of various techniques and implementation of interactive learning are useful.
Second, multimodal corpora are now available and there are methods to study
them, making this a promising research area. Third, documentation archives are
normally focused on particular regions instead of being region-independent,
and a possible solution would be mirroring archives centrally. Finally he
reiterates the often acknowledged fact that language documentation receives
less academic recognition than journal articles. To help change this and
encourage more language documentation work, one outcome of the workshop is
that the LDC will have a special section for the review of online language
documentations in the future.

As Seifart notes in the introduction, the authors of the five chapters in
“Part One: Methods” address the central question of ''How do computational
methods developed for large corpora of well-known languages apply to the
relatively small language documentation corpora of less well-known
languages?'' They do this in a variety of ways -- Chapters 2, 3 and 5 call for
carefully planning future corpora and annotation schemes, while Chapters 4 and
6 lay out an unsupervised method that works well on small corpora (despite
statistical techniques largely needing large amounts of input) to help cut
down on time spent on annotation.

In Chapter 2 (7-16), ''Prospects for e-grammars and endangered languages
corpora'', Sebastian Drude explores three aspects of current computational
documentation research -- hypertext grammars, treebanks and the Grammar Matrix
project (Bender et al. 2010), and interoperable grammars. He covers them
briefly before exploring how they could be used together to provide more
comprehensive grammatical descriptions and more comparable corpora, and
ultimately a better understanding of language.

In Chapter 3 (17-24), Jost Gippert covers current difficulties with marking up
corpus data for minority endangered languages that display large amounts of
code switching, in ''Language-specific encoding in endangered language
corpora.'' He uses examples of code switching in three Caucasian languages to
point out problems and questions that arise from fine-grained language
identification in corpora, as well as to show how the emerging ISO standard
639-6 could be used to help with accurate demarcation of languages and

In Chapter 4 (25-31), ''Unsupervised morphological analysis of small corpora:
First experiments with Kilivila'', Amit Kirschenbaum, Peter Wittenburg, and
Gerhard Heyer develop a method for unsupervised (statistical) morphological
analysis and annotation of small corpora from low resource languages, using a
word co-occurrence model to find statistically relevant groupings of words
(which are either etymologically or morphosyntactically similar), then
aligning them using multiple sequence alignment (a method from bioinformatics)
to find regularities. The method performs better than random, and they briefly
discuss how they pan to integrate it with other methods in the future.

In Chapter 5 (32-38), ''A corpus linguistics perspective on language
documentation, data, and the challenge of small corpora'', Anke Lüdeling
discusses how a flexible corpus architecture is necessary for low resource
language corpora. There are many parameters which influence variant choices
speakers make, and the metadata which may be relevant for future understanding
of the language is not always immediately clear. As such, she argues that it
is important to design corpora to which annotation layers and metadata can be
added at any point.

In Chapter 6 (39-45), ''Supporting linguistic research using generic automatic
audio/video analysis'', Oliver Schreer and Daniel Schneider describe several
automatic tools that could be used to expedite annotation processes for
audio/visual corpora, developed in the AVATech project (Auer et al. 2010,
Tschöpel et al. 2011). In particular, they cover tools developed for audio
segmentation, speech detection, speaker clustering, vowel and pitch contour
detection, shot/cut detection and key frames extraction, global motion
detection, skin color estimation, head and hands tracking, and user
interaction. They also compare the use of these tools to manual analysis in a
preliminary experiment, highlighting the time that using these tools will save

In “Part Two: Analyses”, the editors include chapters addressing the question
of ''What impact has language documentation had on analyses and theorizing in
linguistics and related disciplines so far and how can it make greater
impact?'' Each paper is in some sense, reflecting this, a call for better
documentation practices, and each uses examples to point out how documentation
has helped illuminate an issue, or how adoption of best practices for
documentation could benefit the field as a whole.

In Chapter 7 (46-53), titled ''Bilingual multimodality in language
documentation data'', Marianne Gullberg raises current unanswered questions in
bilingualism and second language acquisition research, as well as drawing
attention to gaps in documentation of multimodality in languages. She points
out that there are large gaps that must be filled in current documentation
practices and research, and calls for more joint ventures and collaborative
interdisciplinary work to further our knowledge of existing data, and to
inform new data collection.

In Chapter 8 (54-63), ''Tours of the past through the present of eastern
Indonesia'', Marian Klamer looks at new documentation of minority languages in
a specific region to shed light on their origins and history, and to highlight
cases where new efforts in documentation can explain historical phenomenon and
language phylogenies. In particular, she is able to provide a convincing
argument regarding the origin of Alorese, an Austronesian language spoken on
Pantar and Alor, by comparing it to Lamahalot, a language based 200km away,
and by looking with fresh eyes at ethnological evidence.

In Chapter 9 (64-72), ''Data from language documentations in research on
referential hierarchies'', Stefan Schnell uses a textual analysis of the
Oceanic language Vera'a to examine the referential hierarchy, particularly
involving number and object marking. His textual analysis is used in contrast
to traditional structural descriptions and elicited data, and he examines how
this method could potentially be used across corpora to expedite research.

In Chapter 10 (73-82), ''Information structure, variation and the Referential
Hierarchy'', Jane Simpson also looks at the referential hierarchy, using the
Australian language Arrernte, which exhibits a putative counterexample to a
proposed typological universal. She calls for larger corpora of texts, linked
to audio-visual recordings, in order to fully observe and record languages
which may be typologically interesting -- particularly those like Arrernte
which are undergoing or have recently undergone massive changes. She fully
explains here how better documentation may have given a fuller understanding
of the language and the typological feature in question.

In Chapter 11 (83-89), ''How to measure frequency? Different ways of counting
ergatives in Chintang (Tibeto-Burman, Nepal) and their implications'', Sabine
Stoll and Balthasar Bickel discuss the best way to measure frequency. By
looking at different options -- raw numbers per age in months or ergatives per
word, per transitive verb, or per time unit -- they come to the conclusion
that using time-alignment and measuring frequency in a given time window is
the most psychologically relevant way to count, instead of the standard
frequency of a feature given the opportunity for it.

In Chapter 12 (90-95), ''On the sociolinguistic typology of linguistic
complexity loss'', Peter Trudgill points out that small, minority, and often
endangered languages have been affected by different socio-structural
conditions, influencing their typological complexity, than the larger
languages upon which most of modern linguistic theory is built. He points out
that they are generally more mature, more complex, and made up of more
intimate societies than more global languages, and raises a call to arms for
linguists to document minority languages -- or else the only languages left to
study will be historically atypical languages.

In “Part Three: Utilization”, the central combining question is ''How can
language documentation data be stored, represented, and made accessible in
order to be utilized in a broader context?'' The chapters here range from
guides for linguists in the field (Chapter 14-16) to outlining the new DoBeS
portal (17), to different tools that can be used by linguists now (Chapters
13, 18).

In Chapter 13 (96-104), ''Visualization and online presentation of linguistic
data'', Hans-Jörg Bibiko uses R, open-source statistical and graphic software,
to show how wordlists, structural features (such as those from WALS (Dryer &
Haspelmath 2011)), and geographical information can be easily graphed. He
gives a good, brief overview of the possibilities R presents to linguists.

In Chapter 14 (105-110), ''Language archives: They’re not just for linguists
any more'', Gary Holton describes how the Alaska Native Language Archive
(ANLA, http://www.uaf.edu/anla) has been useful not just for linguists, but
how it has been queried for non-linguistic data, such as ethnoastronomy,
ethnomusicology, and ethnobotany. He uses the example of Eyak, a severely
endangered language undergoing revitalisation, to illustrate how archives can
be useful to language communities. He calls for archives to be constructed to
allow for these two types of use.

In Chapter 15 (111-117), ''Creating educational materials in language
documentation projects – creating innovative resources for linguistic
research'', Ulrike Mosel presents a way that linguists can work with a
community, helping to produce education material while also building a
language documentation corpus. She gives an overview of work done creating a
book of local stories following this method in Teop, an Oceanic
Meso-Melanesian language spoken in Papau New Guinea.

In Chapter 16 (118-125), ''From language documentation to language planning:
Not necessarily a direct route'', Julia Sallabank looks at common difficulties
arising in language planning, policy implementation, and documentation, using
Guernesiais as an example. In particular, she highlights when the views of all
stakeholders -- not just the native speakers, but also semi- and heritage
speakers -- are as valid as those of documentary linguists or language

In Chapter 17 (126-128), ''Online presentation and accessibility of endangered
languages data: The General Portal to the DoBeS Archive'', Gabriele Schwiertz
gives an overview of the DoBeS online portal, which can be found at
http://www.mpi.nl/dobes. The hope is that the DoBeS online portal will allow
the resource to be used more easily and regularly.

In Chapter 18 (129-134), ''Using language documentation data in a broader
context'', Nick Thieberger concludes by discussing the scale of current global
documentation efforts, and ways to ensure longevity of linguistic archives. He
covers how digital data should be stored and curated, what standards are
available and accepted, and how presents a all for more effort on all sides,
from training new linguists to maintaining old archives, in order for language
data (and ultimately languages) to not be lost.

The workshop from which these papers grew was held in order to ''critically
discuss and make more explicit the threefold potentials of language
documentation,'' as Frank Seifart states in the introduction. The three
potentials -- computational methods, analyses, and utilization -- are clearly
evident, in that each chapter deals with one or more of them. The collection
was organised into three parts around these potentials, and each section
responds to specific questions that each potential raises. On the whole, this
worked moderately well. However, the collection still reads like workshop
proceedings with a loose theme rather than a fully coherent volume. Some of
the papers -- such as Chapter 13 on R and Chapter 17 on DoBeS -- struggle to
fit with others, for example Chapter 16 on language policy and planning in

That said, the chapters cover many of the pressing issues facing the language
documentation community today, and many are spot on in calling for renewed or
focused efforts -- for instance, in carefully considering frequency measures
as in Chapter 11, or in planning a corpus as in Chapter 5. Many chapters
feature detailed examples from particular languages, which provide a framework
for the linguist or student reading to easily interpret how the central
message could be applied to their own research. At times it is clear that
language communities and non-linguists may be able to use this work themselves
-- for instance, Chapter 15 has ideas for starting revitalisation efforts. On
the whole, this volume is approachable, timely, and useful for anyone involved
in language documentation efforts.

Richard Littauer is a graduate student in Computational Linguistics, studying
for a joint degree at the University of Malta and Saarland University. He
completed an MA (Hons) in Linguistics at the University of Edinburgh. His main
research interests include minority language documentation and conservation,
particularly involving developing resources for low-resource languages, as
well as understanding language change on a historical and evolutionary
Page Updated: 26-Jun-2013

