LINGUIST List 12.1755

Fri Jul 6 2001

Review: Melamed, Parallel Texts

Editor for this issue: Terence Langendoen <terrylinguistlist.org>

What follows is another discussion note contributed to our Book Discussion Forum. We expect these discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in.

If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for discussion." (This means that the publisher has sent us a review copy.) Then contact Simin Karimi at siminlinguistlist.org or Terry Langendoen at terrylinguistlist.org.

Directory

Mike Maxwell, Review of Melamed "Empirical Methods for Exploiting Parallel Texts"

Message 1: Review of Melamed "Empirical Methods for Exploiting Parallel Texts"

Date: Fri, 6 Jul 2001 12:32:01 -0400
From: Mike Maxwell <Mike_Maxwellsil.org>
Subject: Review of Melamed "Empirical Methods for Exploiting Parallel Texts"

Review of Melamed, I. Dan (2001) Empirical Methods for Exploiting Parallel Texts. MIT Press, xi + 195 pp., $32.95. (publisher's announcement in Linguist List 12.622)

Mike Maxwell, Summer Institute of Linguistics

Two questions. Suppose you had a text in some language, and a translation of that text into another language: a machine-readable Rosetta Stone. How close could a computer come to finding corresponding paragraphs, sentences, and words in the two texts, knowing nothing (or very little) about the two languages to begin with? The answer is, surprisingly close.

The second question: Why should linguists care? Since this review is for Linguist List (not for a computational linguistics mailing list), I will try to answer this question before proceeding to the review itself. Melamed suggests a number of uses for computer tools that find correspondences, including bilingual lexicography (particularly for newer terminology, which may not have found its way into published dictionaries, and for new or rare word senses); other bilingual resources for translators, such as translation examples; aids for students of foreign languages (when the student is reading a text and gets stuck, the aligned translation is available for reference); and a sort of proofreader for detecting omissions in translated text (the subject of chapter four, see below).

I might also add that interlinear text is widely used by field linguists, and perhaps the sorts of alignment tools discussed by Melamed could be adopted to do interlinear glossing semi-automatically. There are obstacles, however, not the least of which is that languages with substantial inflectional morphology present special problems for automatic alignment; more on this later.

Bilingual texts might also help with machine-driven syntactic annotation, since different languages often have differing patterns of syntactic ambiguity (as pointed out by Matsumoto and Utsuro 2000: 582, footnote 5); this is largely unexplored, and Melamed does not comment on it.

Finally, it might be possible to create machine translation tools more or less automatically from aligned bilingual text; this is touched on in chapters seven through nine, although the reality of what can be done at present (lexicography) falls short of actual translation.

In short, a good deal of machine learning from bilingual texts is possible, and linguists should care. This book, then, is an elaboration on these issues.

In the acknowledgements, Melamed says that the book is a revision of his dissertation. Nevertheless, it reads rather like a collection of stand-alone chapters. Indeed several of the chapters are revisions of work published elsewhere: save for the addition of a few paragraphs dealing with Chinese-English alignment, chapters two and three are almost verbatim identical with an article published in the journal of Computational Linguistics (Melamed 1999), while chapter seven was previously published (with rather more differences) in the same journal a year later (Melamed 2000). Some of the remaining chapters appear to be revisions of conference papers. Because the chapters are almost stand-alone, I will depart from the usual practice in Linguist List reviews of saving the evaluation until the end, instead interspersing my comments where appropriate.

The chapters are arranged into three sections plus an introductory chapter (previewing the following chapters) and a summary chapter (reprising the preceding chapters, and suggesting directions for future work).

The first section, 'Translation Equivalence among Word Tokens', sets out the methodology of alignment, and one application of alignment tools.

Chapter two (following the introductory chapter) describes an algorithm for finding the alignment between the two halves of a "bitext", that is, a bilingual text. The algorithm requires that some correspondences between words in the two halves of the bitext be known in advance, either from a seed bilingual lexicon, or from cognates. (The term 'cognates' should be interpreted liberally: for instance, numbers or dates could serve as cognates in certain texts.) Given a pair of words in the two languages which are mutual translations, there may be multiple occurrences of each in the two halves of the bitext (and the number of occurrences may not be the same, since there is often more than one way to translate a given word). In theory, any pair of occurrences could represent an alignment; in practice, the true correspondences tend to occur in "corresponding positions" in the two texts (where "corresponding position" means something like "the same fraction of the way through the text"). Melamed's algorithm capitalizes on this to find the most likely correspondences between the two texts.

This sounds straightforward, but of course the difficulty is in the details, and Melamed lays out in some detail the way his implementation, SIMR ('Smooth Injective Map Recognizer') finds the (usually) correct correspondences. One of his innovations over other alignment algorithms is to search for just a few correspondences at a time, gradually extending the mapping from the beginning of the bitext. In addition to running faster and in less memory (linear time and space) on long texts, this innovation also allows localized 'noise filters'. That is, a particular word and its translation may be rare in texts, and therefore a good indication overall of alignment. But at a particular position in a text, the word may be quite frequent, and therefore a source of confusion. Since Melamed's algorithm works on a small stretch of text at a time, it can afford to ignore words which are locally frequent--in effect, a localized noise filter.

SIMR has a number of parameters, such as how long a stretch of text it attempts to use at a time. For optimum speed, these parameters must be set individually for each language pair (and indeed, for different genera), and Melamed uses training data (pre-aligned texts). The claim, tested on French, Spanish, Korean and Chinese (with English as the second language in each case) is that "it should be possible to create enough hand-aligned [training] data to port SIMR to any language pair in under two days" (page 33). The training data in one case was the Bible, with the initial alignment being at verse boundaries. Significantly, this sort of training data is available for nearly every written language (Resnik, Olsen and Diab 1999).

A limitation of this methodology is that it requires stemming words to canonical form. (This observation is not limited to SIMR, but applies to many of the other programs Melamed describes; there are, however, other alignment algorithms to which it does not apply.) That is, inflectional affixes must be removed, and any stem allomorphy undone. While there are programs that attempt to 'learn' morphology (see e.g. Goldsmith 2001), such work is still experimental and limited; for now, the implication for languages with inflectional morphology is that one must first have a stemmer.

After correspondences have been found, there is the further problem of finding alignments between the halves of the bitext. Correspondences may cross; alignments do not. For example, if the word order in two languages is different, correspondences between the words in a given sentence will likely cross, but the correspondences for two successive sentences will likely not cross. The algorithm described in chapter three finds alignments between 'segments' of two texts for which correspondences have already been discovered at a finer level of granularity, where a segment may be a sentence, paragraph, list, etc. Tests on French- English bitexts show Melamed's algorithm to be more accurate than other alignment algorithms. At least as important, its run time on long texts should be much shorter than that of other published algorithms.

Chapter four shows that automatic alignment tools can be used to automatically discover (larger) omissions in translated texts. A test of this software found a number of previously undetected omissions in a hand-aligned text - a text which had been used as a standard of comparison for computer alignment! This seems to be an excellent use of computers for something which people are not good at. At the same time, I had to wonder whether better tools for the human translators would not have prevented the omissions in the first place. In fact, it seems likely that much of the high level alignment which is the topic of chapters two and three could be better done in a translation tool, given that human translators almost invariably translate paragraph-by-paragraph (if not sentence-by-sentence) in the first place. This would relegate need for software used for higher level alignment to "legacy" texts. (Of course, there are a great many such legacy texts now, so perhaps this is a needless worry.)

Part two of this book is entitled 'The Type-Token Interface', although it is not clear that the two chapters it contains have much to do with each other. In chapter five, Melamed describes a predicate, called a 'model of co-occurrence', which given a region of the bitext and a pair of word tokens, indicates whether those tokens co- occur in that region. Such a predicate might be used to help build a bilingual dictionary, for example. Others have proposed such predicates before; the work described here consists of (substantial) refinements.

Chapter six (and the appendix) describes a program for manually marking correspondences between words in bitext. The program was used by a number of annotators on 250 verses of the Bible, in French and English, with good results: inter-annotator agreement was in the low 90% range if function words were ignored (and somewhat less counting function words). The resulting bitext with correspondences marked is used as a gold standard against which to test computer programs.

Part three turns to 'translation models', a general term which refers in this context to a probabilistic equivalence between the elements in the two halves of a bitext. Such a model can be decomposed into sub-models in various ways. For example, ignoring syntax, and indeed the entire problem of relative word order, gives a word-to-word translation model, the subject of chapter seven. Conceptually, this is like a bilingual dictionary, except that it includes (an approximation to) the relative 'importance' (frequency) of various translations of a word (a property which may vary by topic and by genera).

Lest word-to-word translation models seem irrelevant or naive, Melamed points out a number of applications, including cross-language information retrieval, and development and maintenance of bilingual lexicons for machine translation. The methodology described in this chapter seems sufficient for adding entries to such a bilingual lexicon; the state of the art is not, apparently, sufficient for a machine to determine the contexts in which each translation of a word would be appropriate. (The bilingual lexicon entries resulting from the translation model described here must also be validated by humans. Happily, the entries can be sorted by their probability of being correct, which should make the validation task easier.)

This chapter is probably the most mathematical of the book. But the non-mathematician linguist should not feel that his role is being usurped by statistical methods, for Melamed is careful to point out that the field is ripe for exploiting pre-existing (e.g. linguistic) knowledge: "each infusion of knowledge about the problem domain yielded better translation models" (page 121).

The next chapter looks at how bitexts can be exploited to discover what Melamed calls "non-compositional compounds." (The term refers to any sequence of words--not necessarily contiguous--which is not transparently translated: idioms, for example.) Melamed tested his methodology for finding such "compounds" on a corpus of French-English text. A random sample of the compounds range from genuine cases of non-compositional translations ('shot-gun wedding' and 'blaze the trail', the latter presumably in its non-idiomatic sense), to company names ('Generic Pharmaceutical Industry Association'). Depending on your view, this chapter shows that lexemes are not in one-to-one correspondence with space-delimited words (in languages whose orthography works that way), or it shows one way the methodology of the previous chapter could be extended beyond the word-to-word model.

The last substantive chapter (preceding the summary chapter) describes an algorithm for automatically discovering the word senses in a bitext. The question of how to divide up the senses of a word is a controversial one; Melamed finesses it by assuming that the senses of a word in language X correspond approximately to the number of words into which that word translates in language Y. (This is an approximate limit; for instance, it is possible that two words in language Y are actually synonyms.) The problem is then to discover 'informants': evidence from the context of the word in language X which can be used to predict which way it will be translated into language Y. Melamed achieves a statistically significant improvement in translation accuracy--but it was astonishing to me how small that improvement was: between one and two percentage points. Melamed attributes the small improvement to limits on what the program could use as 'informants': the five words to the left and right of the word to be disambiguated. A limit this is, but in the 1950s Abraham Kaplan showed that for humans, even two words to the left and right were as good as the entire text for disambiguating word senses (Ide and Veronis 1998).

A list of sample ambiguities, together with the 'informants' that Melamed's program found to disambiguate them in the bitexts, offers a clue to why sense discovery is not more helpful: the 'informants' are largely ad hoc. The English word 'right', for example, is translated into French in at least seven ways in the bitexts, one of which refers to the direction. The informants for this sense are the words 'my' and 'friend'. It turns out that this sense appeared frequently in the phrase 'my friend to my right'! The lack of generality here is obvious, and it is not clear how a computer could do much better with that sort of data. As has often been observed, word sense disambiguation is 'AI-complete': its resolution requires resolving all the problems of artificial intelligence.

To my knowledge, the computer tools discussed in this book have not been made available on the Web (although the 250 verse 'gold standard' described in chapter six is freely available, as are a number of other tools and papers at http://www.cis.upenn.edu/~melamed/).

Anyone wanting to know more about the uses (and limitations) of bitexts will want to read this book (although, as mentioned above, much of it has been published elsewhere). The jacket blurb (reproduced at the MIT Press web site) claims it is "a start-to-finish guide to designing and evaluating many translingual applications." Melamed's book is not (nor was it probably intended to be) a start-to-finish guide, but this bit of publisher's hyperbole should not detract from its usefulness. (For those needing the 'start' of a 'start-to-finish guide', see Wu 2000. The finish is not in sight yet, as Melamed makes clear.)

Finally, I wish to comment on the publisher. At a reasonable price for a hard cover book, MIT Press has done an excellent job of production. The format is clear, illustrations and charts are reproduced well, and typos seem to be few (more precisely, I could not find any). There are publishers who would do well to imitate MIT Press in these areas (if not with respect to the jacket blurb).

References

Dale, Robert; Hermann Moisi; and Harold Somers (editors). 2000. The Handbook of Natural Language Processing. Marcel Dekker, Inc.

Goldsmith, John. 2001. "Unsupervised Learning of the Morphology of a Natural Language." Computational Linguistics 27: 153-198.

Ide, Nancy; and Jean Veronis. 1998. "Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art." Computational Linguistics 24: 1-40.

Matsumoto, Yuji; and Takehito Utsuro. 2000. "Lexical Knowledge Acquisition." Pages 563-610 in Dale, Moisi and Somers 2000.

Melamed, I. Dan. 1999. "Bitext Maps and Alignment via Pattern Recognition." Computational Linguistics 25: 107-130.

Melamed, I. Dan. 2000. "Models of Translational Equivalence among Words." Computational Linguistics 26: 221-249.

Resnik, Philip; Mary Broman Olsen; and Mona Diab. 1999. "Creating a Parallel Corpus from the Book of 2000 Tongues." Computers and Humanities 33:129-153.

Wu, Dekai. 2000. "Alignment." Pp. 415-458 in Dale, Moisl, and Somers 2000.

Mike Maxwell works in the development of computational environments for syntactic, morphological and phonological analysis for the Summer Institute of Linguistics. He has a Ph.D. in linguistics from the University of Washington.