LINGUIST List 13.1636

Mon Jun 10 2002

Review: Computational Ling: Daelemans et al (2001)

Editor for this issue: Naomi Ogasawara <naomilinguistlist.org>


What follows is another discussion note contributed to our Book Discussion Forum. We expect these discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for discussion." (This means that the publisher has sent us a review copy.) Then contact Simin Karimi at siminlinguistlist.org or Terry Langendoen at terrylinguistlist.org.

Directory

  1. Pablo Ariel Duboue, Computational Linguistics in the Netherlands 2000

Message 1: Computational Linguistics in the Netherlands 2000

Date: Sat, 08 Jun 2002 20:47:14 +0000
From: Pablo Ariel Duboue <pablocs.columbia.edu>
Subject: Computational Linguistics in the Netherlands 2000


Daelemans, Walter, Khalil Sima'an, Jorn Veenstra, and Jakub Zavrel, ed. (2001)
Computational Linguistics in the Netherlands 2000. 
Rodopi, 204pp, paperback ISBN 90-420-1247-1, US$ 45.00, EUR 48,00.

Book Announcement on Linguist:
http://linguistlist.org/issues/12/12-3072.html


Pablo A. Duboue, Computer Science Department, Columbia University

SYNOPSIS 

This book contains a selection of the papers presented at the eleventh
annual conference on Computational Linguistics in the Netherlands
(Tilburg, 2000). Although its title seems to suggest an audience
exclusively to the Netherlands and Flanders area, this is far from
being true. The book is targeted to a wide audience. As noted in the
introduction of the book, 50% of the contributions are not from the
Benelux area. However, while the book does not concentrates
exclusively in Dutch computational linguistic (CL) issues, people
interested on these issues will find valuable articles in it. In
general, it seems to me this book is a nice conclusion to the process
started in the Balancing Act (Klavans and Resnik, 1996): looking for
some stability between knowledge-based and statistical approaches. In
the papers presented in this book, you can see both statistical
systems trying to incorporate more knowledge to their structure (e.g.,
Carson-Berndsen, Joue and Walsh) and symbolic systems trying to
unveil areas for learning, in order to improve robustness (e.g,
Poibbeau and Kosseim). The book topics cover a considerable spectrum,
including parsing, generation, speech processing and information
retrieval.

DETAILED ANALYSIS 

In his Invited Talk "Very Large Lexicons" (1-15), Gregory Grefenstette
brings added value to the book. Invited talks are normally for the
enjoyment of the attendees of a conference. By providing a
transcription of his talk, Dr. Grefenstette significantly enriches the
collection. It points out the new challenges of internet-aware CL
research. His talk focus on the perspectives for building a
full-lexicon language model out of the Internet. Two years after
publication, his figures on hard disk space seem very conservative,
making a full-lexicon language model an even more doable task.

The first of the regular papers, "Phonotactic Speech Ranking for
Speech Recognition" (16-29), by Julie Carson-Berndsen, Gina Joue and
Michael Walsh, deals with the question of how to add more knowledge to
a speech recognition system. One of the proposed solutions involves
the use of constraints on the permissible combinations of sounds from
a language (phonotactic constraints). The authors draw from their
previous work to present a technique which extrapolates constraints
for enhanced robustness, together with additional techniques which
acquire constraints automatically from corpora.

The next article is "Through a glass darkly: Part-of-speech
distribution in original and translated text" (30-44), by Lars Borin
and Klas Prutz. This article will be of real interest for linguists
studying language acquisition and bilingualism. The authors are
dealing with a very interesting resource (a magazine for foreigners in
Sweden, available in eight different languages) and look for
particular patterns of POS tags that, while not appearing in a
balanced corpus of general English, do appear in the translated
counterparts. They later proceed to formulate a hypothesis of how the
source language affects the election of possible translations. As
pointed out in their conclusions, this is just one of the possible
experiments that can be undertaken to study the same issue.

In the first article on Dutch linguistics, "Alpino: Wide-coverage
Computational Analysis of Dutch" (45-59), Gosse Bouma, Gertjan van
Noord, and Robert Malouf presents a clearly written contribution,
understandable for people with no prior knowledge of Dutch. It is a
broad system description of Alpino, an analytical tool for Dutch,
including its hand-built, head-driven lexicalized grammar (over 100
rules, with inheritance) and its Part-Of-Speech module. Aside from
its obvious impact on Dutch CL this article can be of interest to
other researchers working on wide-coverage systems in new languages.

The following article is a very unusual paper for a CL conference,
"Revolution in Computational Linguistics: Towards a Genuinely Applied
Science" (60-72), written by Pius ten Hacken. The article is really
important for the average CL person, in particular for students, I
believe. It points out how CL has moved in the last 30 years from
being merely linguistics with different methods to a completely
applied field. In the words of the paper, the shift has been from:
Problem: Understanding human language processing; Knowledge:
Contemporary linguistics theories; Solution: A running program in a
computer; to: Problem: A practical problem occurring in real life;
Knowledge: Whatever turns out to be helpful in a solution; Solution: A
system or program in practical use. I consider this paper and
Grefenstette's paper to be the ones that define the style of the book
as a whole. A last note of caution: For the reader used to CL
articles, the eleven pages of margin to margin running text with no
figures can make for a hard reading.

In "Syntactic Annotation for the Spoken Dutch Corpus Project (CGN)"
(73-87), the first of the two articles dealing with the Spoken Dutch
Corpus Project, Heleen Hoekstra, Michael Moortgat, Ineke Schuurman,
and Ton van der Wouden describe how to annotate Dutch continuous
speech. The overall task is to annotate one thousand hours, circa 10M
words using a theory neutral formalism. Aside from the problems
involved in achieving theory neutrality, the peculiarities of Dutch
(crossing dependencies, etc.) make this annotation a complex endeavor.

Andre Kempe's "Part-of-Speech Tagging with Two Sequential Transducers"
(88-96) presents an interesting idea: use two sequential transducers
(i.e., finite state technology) to correct the errors of a baseline
(most frequent tag-per-word) part-of-speech tagger. These transducers
are applied in reverse direction (the first one left-to-right and the
last one right-to-left). It is interesting to note that while this
technique does not improve existing part-of-speech taggers, it is very
efficient, being useful for domains such as information retrieval that
may need to trade speed for efficiency.

Regarding information retrieval, "Different approaches to Cross
Language Information Retrieval" (97-110) by Wessel Kraaij and Renee
Pohlmann presents an overview of cross-language information retrieval
as seen, for instance, in the TREC-6 evaluation conference. It seems
to me that the most outstanding contribution of their technique is to
mix different approaches regarding whether to translate the documents
or translate the query. They achieve this by incorporating translation
information to their document rank model. They provide a thorough
evaluation including all translations, most probable translation and
word sense disambiguated translation. Their results seemed a little
counter-intuitive in my opinion but their analysis is indeed thorough.

In "A New-Old Class of Linguistically Motivated Regulated Grammars"
(111-125), S. Marcus, C. Martin-Vide, V. Mitrana and Gh. Paun present
a heavily theoretical paper, following some ideas presented by
I. Bellert in 1965 that seemed to have been left oversighted. The
authors are interested in the central problem of the generative power
of families of grammars: which grammar formalism can be used to deal
with natural language beyond the context-free grammars but below the
context-sensitive ones. Their proposed methodology "Path Controlled
Grammars" works with two context free grammars on different
alphabets. The first one is a regular context free grammar on the
alphabet of the target language. The second one is defined over the
possible set of intermediate rewritings on the previous grammar. A
derivation of the whole system can only use strings validated by the
second grammar. This familiy of grammars is midly
context-sensitive. The authors also prove a pumping lemma. It woul! d
be interesting to see further development of parsers and grammars
using this formalism.

In the second article dealing with the CGN project, "CGN to Grail:
Extracting a Type-logical Lexicon from the CGN Annotation" (126-143),
Michael Moortgat and Richard Moot describe how to use CGN annotation
to adhere to some particular formalism. The formalism itself (proof
nets for the Grail theorem prover/parser) is quite complicated. The
two articles, (Hoekstra, Moortgat, Schuurman and vad der Wouden) and
this one, are best read together. However, this one requires a good
deal of background knowledge on their formalism to be understood, as
well as knowledge of Dutch linguistics. It is interesting to see how
the annotation affects the transformation process. In any case, this
paper is the most Dutch-dependent in the collection.

Thierry Poibeau and Leila Kosseim present in "Proper Name Extraction
from Non-Journalistic Texts" (146-157) a series of experiments dealing
with the named entity recognition using unusual domains. I found this
article an important contribution, since, in my personal experience,
general tools work bad on domains different from the ones for which
they are trained. The figures shown in the paper (90% performance in
journalistic text dropping to 50% in non-traditional domains) are
indicative of the effects a practitioner may find with tools for other
tasks different than proper name extraction when the tools are trained
on general text. The process followed by the authors at adapting the
tools to new domains allows them re-achieve most of the lost
performance. The article itself should be a mandatory reading for
researchers working on specific domains and sub-languages.

Being the only generation article in the proceedings, "Generating
Referring Expressions in a Multimodal Context: An empirically oriented
approach" (158-176), by Ielka van der Sluis and Emiel Krahmer, targets
the classic problem of generating referring expressions but now in a
multimodal context. They extend Dale & Reiter's classic algorithm with
information such as the distance between the object and focus of
attention. Their algorithm, however, is NP complete. It can regain
polynomial time behavior under certain conditions, explained in (van
Deemter 2001). Their "empirical approach" relates to the fact they
draw their algorithm from the experiments with human subjects done by
Beum and Cremmers (1998).

Erik F. Tjong Kirn Sang's "Transforming a Chunker to a Parser"
(177-188) presents a promising idea of building a parse tree by
cascading chunker applications. The techniques and experiments
described in the paper seem sound and well grounded, although its
results do not compare well to the state of the art on parsing
technology. I like to compare this paper to Kempe's approach to part
of speech tagging. In this case, however, each of the chunkers should
be trained separately, and its information must be loaded at runtime,
therefore a claim on efficiency gain cannot be made.

The last paper in the book is "Automatic Detection of Problematic
Turns in Human-Machine Interactions" (189-200), by Antal van den
Bosch, Emiel Krahmer and Marc Swerts. The authors describe a Dutch
travel reservation system, in particular, they address the issue of
automatic construction of classifier for errors in dialogs (such as "I
want to go to Amsterdam/So you want to go to Rotterdam?"). Their
system is quite successful, although they use two machine learning
techniques, with the rule induction one clearly outperforming the
memory-based approach (an interesting result from the automatic
learning perspective).

OVERALL ANALYSIS 

The book itself contains a good snapshot of natural language
processing and computational linguistics in Europe at the beginning of
the decade. In general, the Dutch contributions are more homogeneous
than its non-Dutch counterparts. All in all, the book makes for an
interesting reading, covering a variety of topics.

REFERENCES 

Klavans J L, Resnik P, (1996) The Balancing Act, Combining Symbolic
and Statistical Approaches to Language. Cambridge, MA: MIT Press.
(Linguist List review at http://linguistlist.org/issues/8/8-834.html)

ABOUT THE REVIEWER

Pablo Ariel Duboue is a senior PhD student working under the
supervision of Dr. Kathleen McKeown at the Natural Language Processing
group, Columbia University in the City of New York (USA). His research
interest falls in the area of Natural Language Generation, mainly on
the automatic construction of content planners from aligned
corpora. More information about Pablo is available at:
http://www.cs.columbia.edu/~pablo
 
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue