Review of Parallel corpora, parallel worlds.
Date: Tue, 17 Dec 2002 14:05:41 +0800
From: Yang Shouxun
Subject: Borin (2002) Parallel Corpora, Parallel Worlds
Borin, Lars, ed. (2002) Parallel Corpora, Parallel Worlds: Selected
Papers from a Symposium on Parallel and Comparable Corpora at Uppsala
University, Sweden, 22-23 April, 1999. Rodopi, vii+220pp, hardback
ISBN 9042015306, $50. Language and Computers: Studies in Practical
Linguistics No 43.
Book Announcement on Linguist:
YANG Shouxun, Foreign Language Teaching and Research Press, Beijing
This book contains a collection of papers from a symposium devoted to
all aspects of parallel and comparable corpora as an effort "to bring
together parallel corpus researchers for an exchange of experiences
and ideas" (preface). In addition to a general introduction to the
papers in the collection and an overview to the field, it contains 12
papers grouped into 4 topical sections:
(1) 3 general presentations of parallel and comparable corpus projects,
(2) 3 discussions of specific linguistic applications of parallel and
(3) 4 descriptions of computational tools for parallel corpus
(4) 2 papers on parallel corpus annotation.
The targeted audience are thus researchers and students involved in
parallel corpus projects.
'... and never the twain shall meet?' (1-43), the general
introduction by Lars Borin, covers many aspects of the field and
clears the ground for the papers to follow. The discussion of
various commonly used terms related to parallel corpora is helpful
to avoid confusion in terminology, as researchers often use slightly
different terms to suit their own needs. Borin also talks about two
different traditions in depth and advocates researchers of the plain
corpus linguistics tradition and those of the computational corpus
linguistics tradition. Section 6 on creating and processing parallel
corpora is particularly informative. There is a lengthy list of
references, including some in the last 3 years, obviously added after
'Towards a multilingual corpus for contrastive analysis and
translation studies' (47-60), by Stig Johansson, illustrated how to
do contrastive analysis and translation studies with the English verb
"spend" and its correspondences in Norwegian and German using the
Oslo Multilingual Corpus.
'The PLUG project: parallel corpora in Linköping, Uppsala,
Göteborg: aims and achievements' (61-78), by Anna Hein, is an
overview of the PLUG project, including a description of Uppsala Word
Aligner, which may eventually be combined with Linköping Word
Aligner into the Plug Word Aligner.
'The Uppsala Student English Corpus (USE): a multi-faceted resource
for research and course development' (79-90), by Margareta Axelsson
and Ylva Berglund, presents the composition of the learner corpus of
Swedish English and the on-going compilation process and suggests
ways to exploit the corpus for research, teaching, course evaluation
and course development.
'How can linguists profit from parallel corpora?' (93-109), by
Raphael Salkie, raises questions parallel corpora enable linguists
to ask and outlines some methods to do linguistic and translation
studies with parallel corpora.
'Parallel corpora as tools for investigating and developing minority
languages' (111-122), by Trond Trosterud, is a discussion of how
parallel corpora can be used in grammatical documentation,
lexicographic and terminological language planning for minority
'Reversing a Swedish-English dictionary for the Internet' (123-133),
by Christer Geisler, describes the experiment in some detail, and
proposes to use examples from parallel corpora as authentic language
examples in the reversed dictionary.
'Multilingual corpus-based extraction and the Very Large Lexicon'
(137-149), by Gregory Grefenstette, introduces different levels of
computational processing of corpus texts (called "computational
linguistic abstractions" in the paper), and shows how to store the
information in a Very Large Lexicon with a sentence from a
Swedish-English bitext and how to use the Lexicon for multilingual
term translation with the WWW as the resource for examples.
The PLUG Link Annotator, together with the Link Scorer, is presented
in 'The PLUG Link Annotator -- Interactive construction of data from
parallel corpora' (151-168), by Magnus Merkel, Mikael Andersson and
Lars Ahrenberg. The annotator is an interactive software to create
reference word lists to be used to measure the performance of a word
alignment program automatically.
'Building and processing a multilingual corpus of parallel texts'
(169-179), by Peter Stahl, demonstrates the use of the Tübingen
System of Text Processing Program (TUSTEP). Examples from the
Finnish-German parallel corpus are used to show how to write the
instructions for TUSTEP and the results are presented. Aligned output
of texts in German, English, Italian and Finnish are shown as
PostScript as well HTML files.
'Uplug -- a modular corpus tool for parallel corpora' (181-197), by
Jörg Tiedemann, introduces the Uplug system, which includes three
components for data management, application management and user
'Part-of-speech tagging for Swedish' (201-206), by Klas Prütz,
describes a tagger for Swedish based on Brill (1992) and suggests
methods to improve the performance via extending the lexicon. It
concludes that the best result can be "achieved when the complete
tagset was used and then converted to the limited one" (p.206).
'Alignment and tagging' (207-218), by Lars Borin, reports an
experiment using part-of-speech tagged German texts by Morphy
(Lezius et al. 1998) in combination with a word alignment system
Uplug (as described by Tiedemann this volume) to obtain a (partial)
tagging for Swedish and reviews linguistic works on the typology of
Few books in the field of computational linguistics get reviewed
at the LINGUIST site. It may be an indicator for lack of interest in
general. That also explains why Maxwell started reviewing
Melamed (2001) by asking the question "why should linguists care?"
and answering it. Salkie raises the same question and suggests
possible uses in theoretical and empirical linguistic studies.
Borin summarizes the most common uses of parallel and comparable
corpora as the following (p. 14), though it may be more useful to the
linguists if a more detailed lists can be presented, as there are a
lot of works touching on this topic but a comprehensive presentation
is still lacking:
(1) for contrastive and typological grammatical and lexicographical
studies in linguistics,
(2) for knowledge acquisition for machine translation in
computational linguistics, and
(3) as a source of authentic contrastive language data in language
learning and teaching.
Many linguists may be already convinced of the usefulness of
(parallel) corpora. What is not so certain is how to make use of
(parallel) corpora. Without a background in computing, (parallel)
corpus linguistics is a foreign territory to most linguists. As Borin
points out in the introduction, there are two traditions in corpus
linguistics (plain and computational corpus linguistics). The first
tradition "tends to be located in university language departments
(often English departments), and in which the emphasis is on the
construction and use of [parallel] corpora for the investigation of
linguistic phenomena for such purposes as traditional lexicography,
second and foreign language pedagogy, or grammatical description for
human consumption" (p. 6). The second tradition "has emerged more
recently in computational linguistics, partly as an effect of a
reawakened interest in probabilistic methods" (p. 6). According to
Borin, computational corpus linguistics is within 'theory of
linguistic computation', the first of the tripartite divisions of
computational linguistics by Gazdar (1996), involving "the study of
the computational, mathematical and statistical properties of natural
languages and systems for processing natural languages"
(Gazdar 1996:2), "even though there are often practical applications
-- i.e., belonging in the third subarea -- in the minds of the
researchers working in this field" (p. 6). I'd expect computational
corpus linguistics to include the third subarea. Grefenstette's paper
in this volume would be a good example of my point. I totally agree
with the editor and contributors that researchers of both traditions
should come together for "an exchange of experiences and ideas",
and explore the possibilities of cooperation.
Borin's overview is broad in coverage, with numerous pointers to
related research works in the field. It serves as a good starting
point for further exploration in parallel corpus linguistics.
Papers in Parts I and II fall in the plain corpus linguistics
tradition, while papers in Parts III and IV the computational
tradition. That is exactly one thing of what the title of Borin's
overview hinted at: communication and cooperation between researchers
of the two traditions. However, even the papers in Parts III and IV
slant towards the plain corpus linguistics tradition.
In the discussion on terminology in the overview, Merkel (1999: 11)
is cited as providing a taxonomy of corpora which may be considered
to fall in the general category of parallel corpus: diachronic
corpus, transcription corpus, target variant corpus, translation
corpus, multi-target corpus, mixed source corpus, text type corpus,
and mixed text type corpus. Yet the taxonomy is flawed in that these
are classified along different dimensions. They may overlap with each
other and one even may be a subset of another.
Presentations of parallel corpora projects and exemplar uses of
parallel corpora in Part I and II can be very useful starter and
guide for others who are to launch similar projects. The presentation
of parallel corpus processing tools in Part III and IV can be of
similar use. I am not going to comment on the computational side,
though, since technical details are not interesting to most fellow
linguists and algorithms are just sketchily presented in the papers.
I think the book serves its purpose well. You can find much
information in this collection of papers, though most papers are
introductory in nature. The contributors basically belong to the
plain corpus linguistics tradition, and the book is for fellow
linguists rather computing scientists or engineers. This book has a
wide coverage of the field: one can find an overview of the field as
well as introductory papers to some specific parallel corpus
projects, presentations to tools for parallel corpus processing, and
demonstrations of various uses of parallel corpora. If you want to
get an idea of what people (at least those in Scandinavia) did in
parallel corpus linguistics and some interesting examples of using
parallel corpora, this book is for you. If you want a detailed and
in-depth account of specific topics, or if you are computationally
oriented and want to find out what algorithms you can use in your own
parallel corpus processing, this book can be the starting point but
you should eventually turn to more technical works, for instance
Véronis (2000) for alignment.
Véronis, Jean (2000). Parallel Text Processing: Alignment and Use
of Translation Corpora. Kluwer Academic.
Melamed, I. Dan (2001). Empirical Methods for Exploiting Parallel
Texts. MIT Press. (Linguist List reviews at
http://linguistlist.org/issues/12/12-1755.html by Mike Maxwell and at
http://linguistlist.org/issues/12/12-1707.html by Constantin Orasan)
| ABOUT THE REVIEWER:
ABOUT THE REVIEWER YANG Shouxun is a research fellow at Foreign Language Teaching and Research Press, China. He is currently working on his thesis on automatic extraction of collocations. His research interests include Generative Grammar and natural language processing.