Editor for this issue: Terence Langendoen <terry
linguistlist.org>
Melamed, I. Dan (2001): Empirical Methods for Exploiting Parallel Texts, The MIT Press, hardback, 195 pages, $32.95. Reviewed by Constantin Orasan In the light of a failure to design efficient machine translation programs using knowledge bases and rules, research in machine translation started to consider statistical methods for machine translation. Although this proved quite flexible and appropriate for the task, these statistical methods must be trained using a parallel corpus. Parallel corpora are necessary not only for developing automatic translation systems. They can also be used by linguists to compare languages, and to gain insights into their linguistic proprieties. With the recent development of the Internet as a multilingual medium for information dissemination, it is necessary to develop more advanced methods for finding the necessary information and processing this multilingual information. However, these methods need real data (i.e. corpora) in order to be useful, with aligned parallel corpora being necessary in most cases. Unfortunately, these corpora are difficult to build even by humans. The difficulty in aligning parallel texts comes from the fact that the translations are not word for word. However, instead of building parallel corpora in which all the words are aligned, it is possible to build bitexts, where only certain words have a clear equivalent in the other text. The book "Empirical methods for exploiting parallel texts" is a revised version of Melamed's doctoral dissertation and demonstrates how bitexts can be built and how they can be exploited for different tasks. The book is structured in three parts. The first one shows how a classical pattern recognition algorithm can be used to infer knowledge about tokens and how this information can be used for other applications. The second part discusses issues about the type-token interface and a project undertaken to manually annotate translation equivalence in large bitexts. Part three deals with the process of automatic translation. In the following paragraphs a synopsis of each chapter is presented. The book begins with an introductory chapter explaining why parallel texts are important for the linguistic community, also outlining the structure of the book. The author argues that bitexts are one of the richest sources of knowledge because the translation of a text can be seen as a detailed annotation of what that text means. However, the big challenge is how to identify corresponding tokens in the two texts. PART ONE: TRANSLATIONAL EQUIVALENCE AMONG WORD TOKENS Chapter 2 proposes a solution to this problem by presenting an algorithm which regards the problem of finding corresponding words as a pattern recognition problem. The Smooth Injective Map Recogniser (SIMR), a generic pattern recognition algorithm was found to be well suited to the task, building bitext maps that are injective partial functions in the bitext space. However, these maps do not indicate correspondences between all the words. In order to find the points of correspondence in the bitext space, the algorithm alternates between a generation phase and a recognition phase. The generation phase identifies points which can represent correspondence in the bitext space using simple heuristics. For close languages, methods like orthographic cognates proved to be useful, but for more distant languages, phonetic cognates are used instead. A simple list of words, which are mutual translations, can further improve such a method. However, these very simple methods are not infallible, requiring filtration of the noise. The final step of the algorithm uses a very simple technique to identify chains of true points of correspondence. Using a small training corpus, the maximum point dispersion and maximum angle deviation between true points of correspondence are computed, and used to identify the true chains. Given that SIMR uses ideas from previously proposed methods, this chapter reviews the existing work in the field. The extension of the algorithm for new pairs of languages is also discussed. Chapter 3 shows how the pairs of corresponding tokens can be used to align segments. The geometric proprieties of the corresponding tokens are used in order to determine the boundaries of the segments. Evaluation on the Hansard corpus shows that the method proposed in this book is more accurate than previous methods. Chapter 4 presents an algorithm for determining the omissions in translations, based only on the geometric properties of bitext maps and no linguistic information. In a noise free bitext, the missing segments can be easily determined by nearly horizontal bitext maps. However, in the real world, noise free bitext maps are very unlikely to be found. Therefore a more advanced algorithm, which uses more parameters, is proposed. Evaluation of the algorithm proves that it could be a useful tool for the human translator. PART TWO: THE TYPE-TOKEN INTERFACE Chapter 5 discusses models of co-occurrence in a bitext, a precondition for the possibility that two tokens might be mutual translations. The different models of co-occurrence which can apply depend on what kind of bitext maps are available, the language-specific information available, and the assumptions made about the nature of translation equivalence. The co-occurrence counting methods, a problem often considered trivial by other authors, is also discussed, emphasising its problems. The author developed a parallel corpus during his PhD. The structure of this corpus, the decisions taken and the annotation process is presented in chapter 6. The quality of the corpus is evaluated using inter-annotator agreement measures. PART THREE: TRANSLATIONAL EQUIVALENCE AMONG WORD TYPES The third part of the book takes the wider question of how bitexts could be used to improve the performance of machine translation algorithms. The problem of word-to-word translation is discussed in chapter 7. In this chapter the author argues that by using bitexts' properties the results of the translation model can be improved. Given that the proposed models are statistical, they provide more accurate information about the relative importance of different translations. Three models are proposed. The first one, called a competitive learning algorithm, can be seen as a greedy search method in the space of possible assignments for the most likely assignment. The second method is based on the observation that a polysemous word is often used with only one sense in a text. Therefore, the translation of a word can be determined using the words in its vicinity. The third method improves the second one by adding auxiliary parameters on different word classes. Evaluation of the third method showed that even the distinction between open class words and closed class words could improve the results. All three methods are evaluated and shown to be efficient. One of the advantages of the one-to-one assumption for translation is the fact that even rare words can be translated correctly, as long as the words in their vicinity are frequent enough. The author also argues that the proposed method can be used to build translation lexicons in a semiautomatic manner. In many cases it is not appropriate to translate groups of words word by word. In chapter 8, a method for translating non-compositional compounds (NCCs) is presented and evaluated. NCCs are defined as a "(not necessarily contiguous) sequence of words whose translation is not typically composed of the translation of its parts". The proposed method uses measures from information theory in order to maximise the predictive power of the model. Evaluation on the Hansard corpus showed that the results of a translation algorithm are better if NCCs are considered. One of the by products of the algorithm is that it can be used to discover NCCs for other NLP applications that do not involve parallel data. Chapter 8 also reviews the literature in the field of discovery of NCC for machine translation. Chapter 9 presents a word sense discrimination algorithm based on information from bitexts and used to improve the results of translation methods. The word sense discrimination is considered as a first step in the word sense disambiguation task and it clusters word tokens into senses, without trying to label these clusters. In order to achieve this clustering, the algorithm uses information-theoretic criteria in order to improve translation model accuracy. This chapter also features a review of the literature in the field. Chapter 10, the final chapter of the book, provides a general summary and points to directions for future research. The book also has an appendix where the guidelines of the annotation project are explained. Those researchers involved in building parallel corpora will find this chapter particularly useful given the well known dependence of high quality annotation upon good guidelines. The book as a whole is clearly written and informative. In most cases the author explains difficult notions using figures and examples. Mathematical formulas are not used very frequently, a fact that will be appreciated by those readers without a solid mathematical background. However, I found that in certain places the usage of formulas is not fully justified, interrupting the normal flow of reading. A better idea might have been to group those parts which require more advanced mathematical knowledge and are not absolutely necessary at the end of each chapter. Another fact which makes the text more difficult to read is the frequent use of abbreviations, most of them in the first part of the book. Even though they are explained in the book, their large number can be confusing especially in short sentences like: "Therefore, SIMR is likely to miss TPCs wherever the TBM is not linear" (p. 21). A more careful usage of the abbreviation would have been preferable. The book can be read sequentially or it can be used only to find relevant information. In addition to the short description of each chapter presented in the introduction, each one starts with a short summary. The conclusions at the end of each chapter are also very useful. However, a reader trying to find out more about the field will be a little disappointed because this information is not in one place, instead being spread out through different chapters. Throughout the book, the proposed methods are thoroughly evaluated using general measures from computational linguistics or specific ones from the field of bilingual text alignment. Wherever possible, the author makes comparisons with previously proposed methods in order to assess the advantages of using a new method. However, the pleasant surprise was not the presence of these evaluation measures, which are a must for any work in computational linguistics; rather it was the evaluation of algorithms in terms of complexity and therefore of the time necessary to run them. To conclude, I think that any researcher working with parallel corpora will find this book very useful. I also recommend the book to other researchers in the field of computational linguistics given the great potential to be exploited in parallel corpora. Constantin Orasan in doing a PhD in Automatic Summarisation at University of Wolverhampton, U.K. In addition to automatic summarisation, his other current research interests are anaphora resolution, corpus building and analysing, and machine learning techniques for natural language.Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue