Language Evolution: The Windows Approach addresses the question: "How can we unravel the evolution of language, given that there is no direct evidence about it?"
The LINGUIST List is dedicated to providing information on language and language analysis, and to providing the discipline of linguistics with the infrastructure necessary to function in the digital world. LINGUIST is a free resource, run by linguistics students and faculty, and supported primarily by your donations. Please support LINGUIST List during the 2016 Fund Drive.
Review of Idiom Treatment Experiments in Machine Translation
Yuancheng Tu, Department of Linguistics, University of Illinois at Urbana-Champaign
Idiomatic expressions refer to various types of linguistic units or expressions, including idioms, noun compounds, Named Entities, complex verb phrases and any other habitual collocations. These linguistic units pose a particular challenge in empirical Natural Language Processing (NLP) because they always have idiosyncratic interpretations which cannot be formulated by directly aggregating the semantics of their constituents (Sag et al., 2002). ''Idiom Treatment Experiments in Machine Translation'' systematically reviews some theories of idiomatic expressions and presents a method for recognizing these idiomatic expressions in a corpus and translating them automatically within an Example-based Machine Translation System, METIS-II.
The author focuses on one particular type of idiomatic expression, idiomatic Verb Phrases (iVPs) in German and their translation to English. The author shows that the METIS-II system does the automatic translation with the help of a bilingual dictionary, a monolingual corpus in the target language, and four types of manually constructed morphosyntactic rules. Three corpora from three different resources are used to evaluate the results. The first corpus consists of 80 sentences sampled from Europarl (EP). The second has 275 sentences filtered out from the web (MDS) and the last consists of 131 sentences constructed from a part of the digital lexicon of the German language in the 20th Century (DWDS). With a German-English idiom dictionary of 871 entries, the system achieves over 80% precision, recall and F1 for all these three evaluation corpora.
The book consists of eleven chapters, which can be categorized into five sections. The first chapter introduces the definition of translation, and the motivation and contribution of the current research. The next three chapters review the literature on Machine Translation (MT). Chapter five extensively reviews the theories of idiomatic expressions. From chapter six to ten, the author explains her experiments on MT for idiomatic expressions. Chapter eleven is the conclusion and discussion of further research.
Chapter two of this book describes the history of MT from the perspective of projects, companies and patents related to MT technology. In chapter three, the author introduces a brief history of Example-based Machine Translation (EBMT) and compares it to another two popular MT frameworks, Rule-based Machine Translation (RBMT) and Statistical Machine Translation (SMT). The author introduces EBMT as a system between RBMT and SMT. Similar to RBMT, its translation rules are manually extracted. However, unlike RBMT, such translation knowledge usually serves as templates and can be used repeatedly in the system. EBMT is similar to SMT in the sense that EBMT uses bilingual or monolingual corpora to extract knowledge about sentence formation. However, it does not use statistical models to decode the alignment or generate the translation.
In chapter five, the author reviews the broad literature on theories of idioms. As stated in various previous works, it is concluded that idioms are mainly multi-word expressions (MWEs) and no single universal definition works for all of them. Idioms can be compositional or non-compositional, continuous and non-continuous. In addition, idioms are also limitless since new idioms are appearing in languages daily. These properties of idioms pose a substantial challenge for recognizing and translating them automatically.
Chapters six to ten explain the idiom treatment experiments conducted. The source idioms are iVPs in German and the target language is English. These idioms are either continuous or dis-continuous within a sentence. In chapter seven, the author introduces experiments with three commercial MT systems and concludes that these systems cannot identify discontinuous idioms. In chapter eight, she describes an RBMT system, CAT2, and conducts a small-scale experiment with 58 sentences. Since her evaluation achieves 100% precision and recall, she concludes that CAT2 can handle iVP translation successfully. Finally in chapters nine and ten, the author discusses how the EBMT system, METIS-II, treats iVP idioms with a German-English bilingual dictionary, four manually constructed morphosyntactic rules and a monolingual corpus in English. The system assumes that the idioms are listed in the bilingual dictionary. For a continuous idiom, only one rule is necessary to identify it within the sentence and then do the dictionary look-up to translation. The other three rules are used to handle the cases where the iVPs are discontinuous within the sentence. Sentences containing discontinuous idioms are constructed manually according to the German topological field model in order to be identified by the morphosyntactic rules. The author conducted three small-scale evaluations on three different data sets to evaluate the system, and the experiments show more than 80% precision and recall for all experiments and for both continuous and discontinuous iVPs.
This book is structured clearly, from theoretical review to system description and finally to system comparison and evaluation. It offers the reader a relatively comprehensive view of theories of idioms, provides a brief history of EBMT and introduces different stages to identify and translate idioms in one of these EBMT systems. The author lists ample iVP examples in German and shows systematically how the EBMT system can translate them automatically. However, the method offered in this book only focuses on one specific idiom type, iVPs, and the sizes of the evaluation corpora used in this study are all very small. The whole thesis would be significantly strengthened if the author would show how the method used in the system to translate iVPs can be adapted to translate other idiomatic phrases, and evaluated it with larger corpora.
The book identifies several key challenges in MT for idiom translation. However, the method described in this book does not seem to provide a general approach to tackle these challenges. The first key challenge is the Out of Vocabulary (OOV) problem related to idioms. As mentioned in chapter five of this book, new idioms are constantly appearing in languages through various communication channels and updating these OOV idioms within any MT systems is a non-trivial task. However, the method provided in this book assumes the existence of all idioms in the bilingual dictionary. To update OOV idioms, labor-intensive manual maintenance of electronic dictionaries is required constantly within the system. In addition, the morphosyntactic rules within the system are also manually constructed and different types of idioms need different rules. This constraint also limits the scalability and adaptability of the proposed method. The second challenge mentioned in this book is to distinguish the literal and idiomatic usage of idioms, and the author suggests manually constructing simple heuristics and matching rules to handle this phenomenon. Similar to the approach offered by the author to solve the OOV problem, manually constructing rules for each idiom usage is hard and very labor intensive. The author neglects solutions to these challenges addressed in STM literature which offer more robust alternatives to tackle these challenges in this field.
One final note: there are some incongruities between certain chapters of this book. For example, chapter four about Translation Memory, which is only remotely related to the main thesis, could be incorporated in the previous chapter on the history of EBMT. Chapter six, which is related to a historical view on idiom treatment within MT systems, could also be included in the chapter on the history of EBMT. In addition, chapter six lists several schemes on the translation equivalence between source and target language. However, there is no clear description in later chapters to show which scheme is used in the current study.
''Idiom Treatment Experiments in Machine Translation'' offers a specific approach to handle a specific type of idioms within the framework of EBMT. It provides valuable resources such as heuristics and rule templates for EBMT. However, the proposed method, which consists of manually constructing rules and heuristics for only one type of idioms in German, is not flexible enough to adapt to translate other types of idioms, and is labor-intensive to maintain as well. If the book could survey some techniques used in SMT on how to tackle these challenges posed by idioms, it would have a bigger impact and provide the readers a more comprehensive view on automatic idiom translation.
I. Sag, T. Baldwin, F. Bond, and A. Copestake. 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics, CICLing-2002, pages 1-15.
ABOUT THE REVIEWER
ABOUT THE REVIEWER:
Yuancheng Tu is a PhD student in the Department of Linguistics at the
University of Illinois at Urbana-Champaign. Her primary research interests
are Natural Language Processing (NLP), machine learning and computational
lexical semantics. She is also interested in structure learning in NLP and
Text Mining. She is now working on her PhD dissertation on recognizing and
learning of complex verb predicates, such as factive/imperative verbs,
light verb constructions and other inference rules with instantiated or
typed predicates. Her dissertation proposes a general approach to handle
these complex verb predicates within the framework of lexical and
relational similarities and to use them in real NLP applications such as
the task of Textual Entailment.