Review of Parallel corpora, parallel worlds. |
|
|
Review: |
Date: Tue, 17 Dec 2002 14:05:41 +0800 From: Yang Shouxun Subject: Borin (2002) Parallel Corpora, Parallel Worlds
Borin, Lars, ed. (2002) Parallel Corpora, Parallel Worlds: Selected Papers from a Symposium on Parallel and Comparable Corpora at Uppsala University, Sweden, 22-23 April, 1999. Rodopi, vii+220pp, hardback ISBN 9042015306, $50. Language and Computers: Studies in Practical Linguistics No 43.
Book Announcement on Linguist: http://linguistlist.org/issues/13/13-2104.html
YANG Shouxun, Foreign Language Teaching and Research Press, Beijing
SYNOPSIS This book contains a collection of papers from a symposium devoted to all aspects of parallel and comparable corpora as an effort "to bring together parallel corpus researchers for an exchange of experiences and ideas" (preface). In addition to a general introduction to the papers in the collection and an overview to the field, it contains 12 papers grouped into 4 topical sections: (1) 3 general presentations of parallel and comparable corpus projects, (2) 3 discussions of specific linguistic applications of parallel and comparable corpora, (3) 4 descriptions of computational tools for parallel corpus linguistics, and (4) 2 papers on parallel corpus annotation. The targeted audience are thus researchers and students involved in parallel corpus projects.
'... and never the twain shall meet?' (1-43), the general introduction by Lars Borin, covers many aspects of the field and clears the ground for the papers to follow. The discussion of various commonly used terms related to parallel corpora is helpful to avoid confusion in terminology, as researchers often use slightly different terms to suit their own needs. Borin also talks about two different traditions in depth and advocates researchers of the plain corpus linguistics tradition and those of the computational corpus linguistics tradition. Section 6 on creating and processing parallel corpora is particularly informative. There is a lengthy list of references, including some in the last 3 years, obviously added after the conference.
'Towards a multilingual corpus for contrastive analysis and translation studies' (47-60), by Stig Johansson, illustrated how to do contrastive analysis and translation studies with the English verb "spend" and its correspondences in Norwegian and German using the Oslo Multilingual Corpus.
'The PLUG project: parallel corpora in Linköping, Uppsala, Göteborg: aims and achievements' (61-78), by Anna Hein, is an overview of the PLUG project, including a description of Uppsala Word Aligner, which may eventually be combined with Linköping Word Aligner into the Plug Word Aligner.
'The Uppsala Student English Corpus (USE): a multi-faceted resource for research and course development' (79-90), by Margareta Axelsson and Ylva Berglund, presents the composition of the learner corpus of Swedish English and the on-going compilation process and suggests ways to exploit the corpus for research, teaching, course evaluation and course development.
'How can linguists profit from parallel corpora?' (93-109), by Raphael Salkie, raises questions parallel corpora enable linguists to ask and outlines some methods to do linguistic and translation studies with parallel corpora.
'Parallel corpora as tools for investigating and developing minority languages' (111-122), by Trond Trosterud, is a discussion of how parallel corpora can be used in grammatical documentation, lexicographic and terminological language planning for minority languages.
'Reversing a Swedish-English dictionary for the Internet' (123-133), by Christer Geisler, describes the experiment in some detail, and proposes to use examples from parallel corpora as authentic language examples in the reversed dictionary.
'Multilingual corpus-based extraction and the Very Large Lexicon' (137-149), by Gregory Grefenstette, introduces different levels of computational processing of corpus texts (called "computational linguistic abstractions" in the paper), and shows how to store the information in a Very Large Lexicon with a sentence from a Swedish-English bitext and how to use the Lexicon for multilingual term translation with the WWW as the resource for examples.
The PLUG Link Annotator, together with the Link Scorer, is presented in 'The PLUG Link Annotator -- Interactive construction of data from parallel corpora' (151-168), by Magnus Merkel, Mikael Andersson and Lars Ahrenberg. The annotator is an interactive software to create reference word lists to be used to measure the performance of a word alignment program automatically.
'Building and processing a multilingual corpus of parallel texts' (169-179), by Peter Stahl, demonstrates the use of the Tübingen System of Text Processing Program (TUSTEP). Examples from the Finnish-German parallel corpus are used to show how to write the instructions for TUSTEP and the results are presented. Aligned output of texts in German, English, Italian and Finnish are shown as PostScript as well HTML files.
'Uplug -- a modular corpus tool for parallel corpora' (181-197), by Jörg Tiedemann, introduces the Uplug system, which includes three components for data management, application management and user interaction, respectively.
'Part-of-speech tagging for Swedish' (201-206), by Klas Prütz, describes a tagger for Swedish based on Brill (1992) and suggests methods to improve the performance via extending the lexicon. It concludes that the best result can be "achieved when the complete tagset was used and then converted to the limited one" (p.206).
'Alignment and tagging' (207-218), by Lars Borin, reports an experiment using part-of-speech tagged German texts by Morphy (Lezius et al. 1998) in combination with a word alignment system Uplug (as described by Tiedemann this volume) to obtain a (partial) tagging for Swedish and reviews linguistic works on the typology of part-of-speech system.
CRITICAL EVALUATION Few books in the field of computational linguistics get reviewed at the LINGUIST site. It may be an indicator for lack of interest in general. That also explains why Maxwell started reviewing Melamed (2001) by asking the question "why should linguists care?" and answering it. Salkie raises the same question and suggests possible uses in theoretical and empirical linguistic studies.
Borin summarizes the most common uses of parallel and comparable corpora as the following (p. 14), though it may be more useful to the linguists if a more detailed lists can be presented, as there are a lot of works touching on this topic but a comprehensive presentation is still lacking: (1) for contrastive and typological grammatical and lexicographical studies in linguistics, (2) for knowledge acquisition for machine translation in computational linguistics, and (3) as a source of authentic contrastive language data in language learning and teaching.
Many linguists may be already convinced of the usefulness of (parallel) corpora. What is not so certain is how to make use of (parallel) corpora. Without a background in computing, (parallel) corpus linguistics is a foreign territory to most linguists. As Borin points out in the introduction, there are two traditions in corpus linguistics (plain and computational corpus linguistics). The first tradition "tends to be located in university language departments (often English departments), and in which the emphasis is on the construction and use of [parallel] corpora for the investigation of linguistic phenomena for such purposes as traditional lexicography, second and foreign language pedagogy, or grammatical description for human consumption" (p. 6). The second tradition "has emerged more recently in computational linguistics, partly as an effect of a reawakened interest in probabilistic methods" (p. 6). According to Borin, computational corpus linguistics is within 'theory of linguistic computation', the first of the tripartite divisions of computational linguistics by Gazdar (1996), involving "the study of the computational, mathematical and statistical properties of natural languages and systems for processing natural languages" (Gazdar 1996:2), "even though there are often practical applications -- i.e., belonging in the third subarea -- in the minds of the researchers working in this field" (p. 6). I'd expect computational corpus linguistics to include the third subarea. Grefenstette's paper in this volume would be a good example of my point. I totally agree with the editor and contributors that researchers of both traditions should come together for "an exchange of experiences and ideas", and explore the possibilities of cooperation.
Borin's overview is broad in coverage, with numerous pointers to related research works in the field. It serves as a good starting point for further exploration in parallel corpus linguistics.
Papers in Parts I and II fall in the plain corpus linguistics tradition, while papers in Parts III and IV the computational tradition. That is exactly one thing of what the title of Borin's overview hinted at: communication and cooperation between researchers of the two traditions. However, even the papers in Parts III and IV slant towards the plain corpus linguistics tradition.
In the discussion on terminology in the overview, Merkel (1999: 11) is cited as providing a taxonomy of corpora which may be considered to fall in the general category of parallel corpus: diachronic corpus, transcription corpus, target variant corpus, translation corpus, multi-target corpus, mixed source corpus, text type corpus, and mixed text type corpus. Yet the taxonomy is flawed in that these are classified along different dimensions. They may overlap with each other and one even may be a subset of another.
Presentations of parallel corpora projects and exemplar uses of parallel corpora in Part I and II can be very useful starter and guide for others who are to launch similar projects. The presentation of parallel corpus processing tools in Part III and IV can be of similar use. I am not going to comment on the computational side, though, since technical details are not interesting to most fellow linguists and algorithms are just sketchily presented in the papers.
I think the book serves its purpose well. You can find much information in this collection of papers, though most papers are introductory in nature. The contributors basically belong to the plain corpus linguistics tradition, and the book is for fellow linguists rather computing scientists or engineers. This book has a wide coverage of the field: one can find an overview of the field as well as introductory papers to some specific parallel corpus projects, presentations to tools for parallel corpus processing, and demonstrations of various uses of parallel corpora. If you want to get an idea of what people (at least those in Scandinavia) did in parallel corpus linguistics and some interesting examples of using parallel corpora, this book is for you. If you want a detailed and in-depth account of specific topics, or if you are computationally oriented and want to find out what algorithms you can use in your own parallel corpus processing, this book can be the starting point but you should eventually turn to more technical works, for instance Véronis (2000) for alignment.
REFERENCES Véronis, Jean (2000). Parallel Text Processing: Alignment and Use of Translation Corpora. Kluwer Academic.
Melamed, I. Dan (2001). Empirical Methods for Exploiting Parallel Texts. MIT Press. (Linguist List reviews at http://linguistlist.org/issues/12/12-1755.html by Mike Maxwell and at http://linguistlist.org/issues/12/12-1707.html by Constantin Orasan)
|
|
ABOUT THE REVIEWER:
ABOUT THE REVIEWER YANG Shouxun is a research fellow at Foreign Language Teaching and Research Press, China. He is currently working on his thesis on automatic extraction of collocations. His research interests include Generative Grammar and natural language processing. |
|
|