LINGUIST List 13.3368

Thu Dec 19 2002

Review: Corpus Linguistics: Borin (2002)

Editor for this issue: Naomi Ogasawara <naomilinguistlist.org>


What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Simin Karimi at siminlinguistlist.org.

Directory

  1. Yang Shouxun, Borin (2002), Parallel Corpora, Parallel Worlds

Message 1: Borin (2002), Parallel Corpora, Parallel Worlds

Date: Wed, 18 Dec 2002 23:06:54 +0000
From: Yang Shouxun <yangsxfltrp.com>
Subject: Borin (2002), Parallel Corpora, Parallel Worlds

Borin, Lars, ed. (2002) Parallel Corpora, Parallel Worlds: Selected
Papers from a Symposium on Parallel and Comparable Corpora at Uppsala
University, Sweden, 22-23 April, 1999. Rodopi, vii+220pp, hardback
ISBN 9042015306, $50. Language and Computers: Studies in Practical
Linguistics No 43.

Book Announcement on Linguist:
http://linguistlist.org/get-book.html?BookID=3542
http://linguistlist.org/issues/13/13-2104.html


YANG Shouxun, Foreign Language Teaching and Research Press, Beijing

SYNOPSIS

This book contains a collection of papers from a symposium devoted to
all aspects of parallel and comparable corpora as an effort ''to bring
together parallel corpus researchers for an exchange of experiences
and ideas'' (preface). In addition to a general introduction to the
papers in the collection and an overview to the field, it contains 12
papers grouped into 4 topical sections: (1) 3 general presentations of
parallel and comparable corpus projects, (2) 3 discussions of specific
linguistic applications of parallel and comparable corpora,(3) 4
descriptions of computational tools for parallel corpus linguistics,
and (4) 2 papers on parallel corpus annotation. The targeted audience
are thus researchers and students involved in parallel corpus
projects.

'... and never the twain shall meet?' (1-43), the general introduction
by Lars Borin, covers many aspects of the field and clears the ground
for the papers to follow. The discussion of various commonly used
terms related to parallel corpora is helpful to avoid confusion in
terminology, as researchers often use slightly different terms to suit
their own needs. Borin also talks about two different traditions in
depth and advocates researchers of the plain corpus linguistics
tradition and those of the computational corpus linguistics
tradition. Section 6 on creating and processing parallel corpora is
particularly informative. There is a lengthy list of references,
including some in the last 3 years, obviously added after the
conference.

'Towards a multilingual corpus for contrastive analysis and
translation studies' (47-60), by Stig Johansson, illustrated how to do
contrastive analysis and translation studies with the English verb
''spend'' and its correspondences in Norwegian and German using the
Oslo Multilingual Corpus.

'The PLUG project: parallel corpora in Linkoeping, Uppsala, Goeteborg:
aims and achievements' (61-78), by Anna Hein, is an overview of the
PLUG project, including a description of Uppsala Word Aligner, which
may eventually be combined with Linkoeping Word Aligner into the Plug
Word Aligner.

'The Uppsala Student English Corpus (USE): a multi-faceted resource
for research and course development' (79-90), by Margareta Axelsson
and Ylva Berglund, presents the composition of the learner corpus of
Swedish English and the on-going compilation process and suggests ways
to exploit the corpus for research, teaching, course evaluation and
course development.

'How can linguists profit from parallel corpora?' (93-109), by Raphael
Salkie, raises questions parallel corpora enable linguists to ask and
outlines some methods to do linguistic and translation studies with
parallel corpora.

'Parallel corpora as tools for investigating and developing minority
languages' (111-122), by Trond Trosterud, is a discussion of how
parallel corpora can be used in grammatical documentation,
lexicographic and terminological language planning for minority
languages.

'Reversing a Swedish-English dictionary for the Internet' (123-133),
by Christer Geisler, describes the experiment in some detail, and
proposes to use examples from parallel corpora as authentic language
examples in the reversed dictionary.

'Multilingual corpus-based extraction and the Very Large Lexicon'
(137-149), by Gregory Grefenstette, introduces different levels of
computational processing of corpus texts (called ''computational
linguistic abstractions'' in the paper), and shows how to store the
information in a Very Large Lexicon with a sentence from a
Swedish-English bitext and how to use the Lexicon for multilingual
term translation with the WWW as the resource for examples.

The PLUG Link Annotator, together with the Link Scorer, is presented
in 'The PLUG Link Annotator -- Interactive construction of data from
parallel corpora' (151-168), by Magnus Merkel, Mikael Andersson and
Lars Ahrenberg. The annotator is an interactive software to create
reference word lists to be used to measure the performance of a word
alignment program automatically.

'Building and processing a multilingual corpus of parallel texts'
(169-179), by Peter Stahl, demonstrates the use of the Tuebingen
System of Text Processing Program (TUSTEP). Examples from the
Finnish-German parallel corpus are used to show how to write the
instructions for TUSTEP and the results are presented. Aligned output
of texts in German, English, Italian and Finnish are shown as
PostScript as well HTML files.

'Uplug -- a modular corpus tool for parallel corpora' (181-197), by
Joerg Tiedemann, introduces the Uplug system, which includes three
components for data management, application management and user
interaction, respectively.

'Part-of-speech tagging for Swedish' (201-206), by Klas Pruetz,
describes a tagger for Swedish based on Brill (1992) and suggests
methods to improve the performance via extending the lexicon. It
concludes that the best result can be ''achieved when the complete
tagset was used and then converted to the limited one'' (p.206).

'Alignment and tagging' (207-218), by Lars Borin, reports an
experiment using part-of-speech tagged German texts by Morphy (Lezius
et al. 1998) in combination with a word alignment system Uplug (as
described by Tiedemann this volume) to obtain a (partial) tagging for
Swedish and reviews linguistic works on the typology of part-of-speech
system.

CRITICAL EVALUATION

Few books in the field of computational linguistics get reviewed at
the LINGUIST site. It may be an indicator for lack of interest in
general. That also explains why Maxwell started reviewing Melamed
(2001) by asking the question ''why should linguists care?'' and
answering it. Salkie raises the same question and suggests possible
uses in theoretical and empirical linguistic studies.

Borin summarizes the most common uses of parallel and comparable
corpora as the following (p. 14), though it may be more useful to the
linguists if a more detailed lists can be presented, as there are a
lot of works touching on this topic but a comprehensive presentation
is still lacking: (1) for contrastive and typological grammatical and
lexicographical studies in linguistics, (2) for knowledge acquisition
for machine translation in computational linguistics, and (3) as a
source of authentic contrastive language data in language learning and
teaching.

Many linguists may be already convinced of the usefulness of
(parallel) corpora. What is not so certain is how to make use of
(parallel) corpora. Without a background in computing, (parallel)
corpus linguistics is a foreign territory to most linguists. As Borin
points out in the introduction, there are two traditions in corpus
linguistics (plain and computational corpus linguistics). The first
tradition ''tends to be located in university language departments
(often English departments), and in which the emphasis is on the
construction and use of [parallel] corpora for the investigation of
linguistic phenomena for such purposes as traditional lexicography,
second and foreign language pedagogy, or grammatical description for
human consumption'' (p. 6). The second tradition ''has emerged more
recently in computational linguistics, partly as an effect of a
reawakened interest in probabilistic methods'' (p. 6). According to
Borin, computational corpus linguistics is within 'theory of
linguistic computation', the first of the tripartite divisions of
computational linguistics by Gazdar (1996), involving ''the study of
the computational, mathematical and statistical properties of natural
languages and systems for processing natural languages'' (Gazdar
1996:2), ''even though there are often practical applications - i.e.,
belonging in the third subarea -- in the minds of the researchers
working in this field'' (p. 6). I'd expect computational corpus
linguistics to include the third subarea. Grefenstette's paper in this
volume would be a good example of my point. I totally agree with the
editor and contributors that researchers of both traditions should
come together for ''an exchange of experiences and ideas'', and
explore the possibilities of cooperation.

Borin's overview is broad in coverage, with numerous pointers to
related research works in the field. It serves as a good starting
point for further exploration in parallel corpus linguistics.

Papers in Parts I and II fall in the plain corpus linguistics
tradition, while papers in Parts III and IV the computational
tradition. That is exactly one thing of what the title of Borin's
overview hinted at: communication and cooperation between researchers
of the two traditions. However, even the papers in Parts III and IV
slant towards the plain corpus linguistics tradition.

In the discussion on terminology in the overview, Merkel (1999: 11) is
cited as providing a taxonomy of corpora which may be considered to
fall in the general category of parallel corpus: diachronic corpus,
transcription corpus, target variant corpus, translation corpus,
multi-target corpus, mixed source corpus, text type corpus, and mixed
text type corpus. Yet the taxonomy is flawed in that these are
classified along different dimensions. They may overlap with each
other and one even may be a subset of another.

Presentations of parallel corpora projects and exemplar uses of
parallel corpora in Part I and II can be very useful starter and guide
for others who are to launch similar projects. The presentation of
parallel corpus processing tools in Part III and IV can be of similar
use. I am not going to comment on the computational side, though,
since technical details are not interesting to most fellow linguists
and algorithms are just sketchily presented in the papers.

I think the book serves its purpose well. You can find much
information in this collection of papers, though most papers are
introductory in nature. The contributors basically belong to the plain
corpus linguistics tradition, and the book is for fellow linguists
rather computing scientists or engineers. This book has a wide
coverage of the field: one can find an overview of the field as well
as introductory papers to some specific parallel corpus projects,
presentations to tools for parallel corpus processing, and
demonstrations of various uses of parallel corpora. If you want to get
an idea of what people (at least those in Scandinavia) did in parallel
corpus linguistics and some interesting examples of using parallel
corpora, this book is for you. If you want a detailed and in-depth
account of specific topics, or if you are computationally oriented and
want to find out what algorithms you can use in your own parallel
corpus processing, this book can be the starting point but you should
eventually turn to more technical works, for instance Veronis (2000)
for alignment.

REFERENCES

Veronis, Jean (2000). Parallel Text Processing: Alignment and Use of
Translation Corpora. Kluwer Academic.

Melamed, I. Dan (2001). Empirical Methods for Exploiting Parallel 
Texts. MIT Press. (Linguist List reviews at 
http://linguistlist.org/issues/12/12-1755.html by Mike Maxwell and at
http://linguistlist.org/issues/12/12-1707.html by Constantin Orasan)

ABOUT THE REVIEWER

YANG Shouxun is a research fellow at Foreign Language Teaching and
Research Press, China. He is currently working on his thesis on
automatic extraction of collocations. His research interests include
Generative Grammar and natural language processing.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue