LINGUIST List 12.1707

Mon Jul 2 2001

Review: Melamed, Exploiting Parallel Texts

Editor for this issue: Terence Langendoen <terrylinguistlist.org>


What follows is another discussion note contributed to our Book Discussion Forum. We expect these discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for discussion." (This means that the publisher has sent us a review copy.) Then contact Simin Karimi at siminlinguistlist.org or Terry Langendoen at terrylinguistlist.org.

Directory

  1. Constantin Orasan, review Melamed, Empirical Methods for Exploiting Parallel Texts

Message 1: review Melamed, Empirical Methods for Exploiting Parallel Texts

Date: Mon, 02 Jul 2001 18:55:39 +0100
From: Constantin Orasan <in6093wlv.ac.uk>
Subject: review Melamed, Empirical Methods for Exploiting Parallel Texts

Melamed, I. Dan (2001): Empirical Methods for Exploiting Parallel
Texts, The MIT Press, hardback, 195 pages, $32.95.


Reviewed by Constantin Orasan

In the light of a failure to design efficient machine
translation programs using knowledge bases and rules, research
in machine translation started to consider statistical methods
for machine translation. Although this proved quite flexible
and appropriate for the task, these statistical methods must
be trained using a parallel corpus.

Parallel corpora are necessary not only for developing
automatic translation systems. They can also be used by
linguists to compare languages, and to gain insights into
their linguistic proprieties. With the recent development of
the Internet as a multilingual medium for information
dissemination, it is necessary to develop more advanced
methods for finding the necessary information and processing
this multilingual information. However, these methods need
real data (i.e. corpora) in order to be useful, with aligned
parallel corpora being necessary in most cases. Unfortunately,
these corpora are difficult to build even by humans. The
difficulty in aligning parallel texts comes from the fact that
the translations are not word for word. However, instead of
building parallel corpora in which all the words are aligned,
it is possible to build bitexts, where only certain words have
a clear equivalent in the other text.

The book "Empirical methods for exploiting parallel texts" is
a revised version of Melamed's doctoral dissertation and
demonstrates how bitexts can be built and how they can be
exploited for different tasks.

The book is structured in three parts. The first one shows how
a classical pattern recognition algorithm can be used to infer
knowledge about tokens and how this information can be used
for other applications. The second part discusses issues about
the type-token interface and a project undertaken to manually
annotate translation equivalence in large bitexts. Part three
deals with the process of automatic translation. In the
following paragraphs a synopsis of each chapter is presented.

The book begins with an introductory chapter explaining why
parallel texts are important for the linguistic community,
also outlining the structure of the book. The author argues
that bitexts are one of the richest sources of knowledge
because the translation of a text can be seen as a detailed
annotation of what that text means. However, the big challenge
is how to identify corresponding tokens in the two texts.

PART ONE: TRANSLATIONAL EQUIVALENCE AMONG WORD TOKENS

Chapter 2 proposes a solution to this problem by presenting an
algorithm which regards the problem of finding corresponding
words as a pattern recognition problem. The Smooth Injective
Map Recogniser (SIMR), a generic pattern recognition algorithm
was found to be well suited to the task, building bitext maps
that are injective partial functions in the bitext space.
However, these maps do not indicate correspondences between
all the words.

In order to find the points of correspondence in the bitext
space, the algorithm alternates between a generation phase and
a recognition phase. The generation phase identifies points
which can represent correspondence in the bitext space using
simple heuristics. For close languages, methods like
orthographic cognates proved to be useful, but for more
distant languages, phonetic cognates are used instead. A
simple list of words, which are mutual translations, can
further improve such a method. However, these very simple
methods are not infallible, requiring filtration of the noise.
The final step of the algorithm uses a very simple technique
to identify chains of true points of correspondence. Using a
small training corpus, the maximum point dispersion and
maximum angle deviation between true points of correspondence
are computed, and used to identify the true chains. Given that
SIMR uses ideas from previously proposed methods, this chapter
reviews the existing work in the field. The extension of the
algorithm for new pairs of languages is also discussed.

Chapter 3 shows how the pairs of corresponding tokens can be
used to align segments. The geometric proprieties of the
corresponding tokens are used in order to determine the
boundaries of the segments. Evaluation on the Hansard corpus
shows that the method proposed in this book is more accurate
than previous methods.

Chapter 4 presents an algorithm for determining the omissions
in translations, based only on the geometric properties of
bitext maps and no linguistic information. In a noise free
bitext, the missing segments can be easily determined by
nearly horizontal bitext maps. However, in the real world,
noise free bitext maps are very unlikely to be found.
Therefore a more advanced algorithm, which uses more
parameters, is proposed. Evaluation of the algorithm proves
that it could be a useful tool for the human translator.

PART TWO: THE TYPE-TOKEN INTERFACE

Chapter 5 discusses models of co-occurrence in a bitext, a
precondition for the possibility that two tokens might be
mutual translations. The different models of co-occurrence
which can apply depend on what kind of bitext maps are
available, the language-specific information available, and
the assumptions made about the nature of translation
equivalence. The co-occurrence counting methods, a problem
often considered trivial by other authors, is also discussed,
emphasising its problems.

The author developed a parallel corpus during his PhD. The
structure of this corpus, the decisions taken and the
annotation process is presented in chapter 6. The quality of
the corpus is evaluated using inter-annotator agreement
measures.

PART THREE: TRANSLATIONAL EQUIVALENCE AMONG WORD TYPES

The third part of the book takes the wider question of how
bitexts could be used to improve the performance of machine
translation algorithms.

The problem of word-to-word translation is discussed in
chapter 7. In this chapter the author argues that by using
bitexts' properties the results of the translation model can
be improved. Given that the proposed models are statistical,
they provide more accurate information about the relative
importance of different translations. Three models are
proposed. The first one, called a competitive learning
algorithm, can be seen as a greedy search method in the space
of possible assignments for the most likely assignment. The
second method is based on the observation that a polysemous
word is often used with only one sense in a text. Therefore,
the translation of a word can be determined using the words in
its vicinity. The third method improves the second one by
adding auxiliary parameters on different word classes.
Evaluation of the third method showed that even the
distinction between open class words and closed class words
could improve the results.

All three methods are evaluated and shown to be efficient. One
of the advantages of the one-to-one assumption for translation
is the fact that even rare words can be translated correctly,
as long as the words in their vicinity are frequent enough.
The author also argues that the proposed method can be used to
build translation lexicons in a semiautomatic manner.

In many cases it is not appropriate to translate groups of
words word by word. In chapter 8, a method for translating
non-compositional compounds (NCCs) is presented and evaluated.
NCCs are defined as a "(not necessarily contiguous) sequence
of words whose translation is not typically composed of the
translation of its parts". The proposed method uses measures
from information theory in order to maximise the predictive
power of the model. Evaluation on the Hansard corpus showed
that the results of a translation algorithm are better if NCCs
are considered. One of the by products of the algorithm is
that it can be used to discover NCCs for other NLP
applications that do not involve parallel data. Chapter 8 also
reviews the literature in the field of discovery of NCC for
machine translation.

Chapter 9 presents a word sense discrimination algorithm based
on information from bitexts and used to improve the results of
translation methods. The word sense discrimination is
considered as a first step in the word sense disambiguation
task and it clusters word tokens into senses, without trying
to label these clusters. In order to achieve this clustering,
the algorithm uses information-theoretic criteria in order to
improve translation model accuracy. This chapter also features
a review of the literature in the field.

Chapter 10, the final chapter of the book, provides a general
summary and points to directions for future research.

The book also has an appendix where the guidelines of the
annotation project are explained. Those researchers involved
in building parallel corpora will find this chapter
particularly useful given the well known dependence of high
quality annotation upon good guidelines.

The book as a whole is clearly written and informative. In most
cases the author explains difficult notions using figures and
examples. Mathematical formulas are not used very frequently,
a fact that will be appreciated by those readers without a
solid mathematical background. However, I found that in
certain places the usage of formulas is not fully justified,
interrupting the normal flow of reading. A better idea might
have been to group those parts which require more advanced
mathematical knowledge and are not absolutely necessary at the
end of each chapter.

Another fact which makes the text more difficult to read is
the frequent use of abbreviations, most of them in the first
part of the book. Even though they are explained in the book,
their large number can be confusing especially in short
sentences like: "Therefore, SIMR is likely to miss TPCs
wherever the TBM is not linear" (p. 21). A more careful usage
of the abbreviation would have been preferable.

The book can be read sequentially or it can be used only to
find relevant information. In addition to the short
description of each chapter presented in the introduction,
each one starts with a short summary. The conclusions at the
end of each chapter are also very useful. However, a reader
trying to find out more about the field will be a little
disappointed because this information is not in one place,
instead being spread out through different chapters.

Throughout the book, the proposed methods are thoroughly
evaluated using general measures from computational
linguistics or specific ones from the field of bilingual text
alignment. Wherever possible, the author makes comparisons
with previously proposed methods in order to assess the
advantages of using a new method. However, the pleasant
surprise was not the presence of these evaluation measures,
which are a must for any work in computational linguistics;
rather it was the evaluation of algorithms in terms of
complexity and therefore of the time necessary to run them.

To conclude, I think that any researcher working with parallel
corpora will find this book very useful. I also recommend the
book to other researchers in the field of computational
linguistics given the great potential to be exploited in
parallel corpora.

Constantin Orasan in doing a PhD in Automatic Summarisation at
University of Wolverhampton, U.K. In addition to automatic
summarisation, his other current research interests are
anaphora resolution, corpus building and analysing, and
machine learning techniques for natural language.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue