LINGUIST List 14.499

Wed Feb 19 2003

Review: Computational Ling: Theune, et al. (2002)

Editor for this issue: Naomi Ogasawara <naomilinguistlist.org>


What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Simin Karimi at siminlinguistlist.org.

Directory

  1. Pablo Ariel Duboue, Computational Linguistics in the Netherlands 2001

Message 1: Computational Linguistics in the Netherlands 2001

Date: Mon, 17 Feb 2003 22:36:38 +0000
From: Pablo Ariel Duboue <pablocs.columbia.edu>
Subject: Computational Linguistics in the Netherlands 2001

Theune, Mari�t, Anton Nijholt, and Hendri Hondorp, ed. (2002)
Computational Linguistics in the Netherlands 2001: Selected Papers
from the Twelfth CLIN Meeting. Rodopi, viii+207pp, hardback, ISBN
90-420-0943-8, US$50, EUR50, Language and Computers series 45.

Announced at http://linguistlist.org/issues/13/13-2106.html


Pablo A. Duboue, Computer Science Department, Columbia University, USA

SYNOPSIS

Continuing with their tradition of a second round of submission and
reviewing after the CLIN meeting, this year "Computational
Linguistics on the Netherlands" offers mostly Dutch-related content.
The book contains papers on a wide variety of topics, distributed over
14 papers and the extended abstract of the invited talk. The second
round of revisions ensures a high level of quality and the authors
profit from the discussions at the meeting before sending their
extended versions.

DETAILED ANALYSIS

I decided to divide the published papers in four sections, to
facilitate the discussion. The division into the sections is not
clear-cut; it should be taken mostly for expository reasons. These
sections are Theory (including psycholinguistically motivated works),
Speech (including dialogs), Corpus (including creation and evaluation)
and Tools (for Dutch and multilingual).

Theory

I found four papers in this category, two dealing with particular
results and two with psycholinguistic concerns. Their results are
mostly language independent, with most papers providing English
examples. In "Conservative vs Set-driven Learning Functions for the
Classes k-valued" (Christophe COSTA FLOR�NCIO), the author
answers an open question set aside by Kanazawa (1998), by means of a
constructive proof. The work focuses in Classical Categorical
Grammars (CCGs). In "Reference Resolution in Context" (Jan van
EIJCK), pronoun reference resolution is analyzed in terms of
incremental semantics. The concepts of the paper are exemplified with
an implemented Haskell prototype, available from the author's
homepage.

Finally, two contributions deal with psycholinguistically motivated
formalisms: "Incremental Generation of Self-corrections Using
Underspecification" (Markus GUHE and Frank SCHILDER), from a
generation perspective; and "Performance Grammar: a Declarative
Definition" (Gerard KEMPEN and Karin HARBUSCH), from an understanding
perspective. Guhe and Schilder's work profits from a
psycholinguistically plausible generation architecture to generate
self corrections (e.g. "I have two seats... uh no... one seat
available"). The authors argue that such self-corrections are
required for systems with dynamic input data. Kempen and Harbusch, on
the other hand, present a HPSG-motivated grammar formalism,
Performance Grammar (PG). PG also captures important psycholinguistic
features such as incrementality and late linearization. The authors
provide both Dutch and English formalizations.

Speech

Dutch morphology makes for a particularly challenging environment in
Speech Recognition. The number of out-of-vocabulary (OOVs) words can
be quite large, as a result of compounding and other word formation
rules. Two papers explore solutions to this problem: "Memory-Based
Phoneme-to-Grapheme Conversion" (Bart DECADT, Jacques DUCHATEAU,
Walter DAELEMANS, and Patrick WAMBACQ) and "Automated Compounding as
a Means for Maximizing Lexical Coverage" (Vincent VANDEGHINSTE).
Decadt et al. investigate the automatic guessing of Dutch spelling out
of phoneme transcriptions (phoneme-to-grapheme conversion). Their
algorithm performs outstandingly well on clean input. However, the
authors acknowledge further work is required to accommodate the highly
noisy phonetic transcriptions coming from the speech recognition
system. Vandeghinste explores a different but related problem:
optimization in the use of the bounded memory of the speech
recognition system. As only 36,000 words can be stored in that
memory, the author combines several readily available lexicons for
Dutch to extract roots and "quasi"-roots for Dutch words. He later
re-combines the words into more complex ones, using a statistically
trained module. His results seem to be ready for practical
application and his statistical analysis is very thorough.

Dealing with errors in dialogs, "Multi-feature Error Detection in
Spoken Dialogue Systems" (Piroska LENDVAI, Antal van den BOSCH, Emiel
KRAHMER, and Marc SWERTS) analyses the impact of combination of
prosodic and non-prosodic features in automatic error detection.
Trying to reproduce available results reported over English spoken
corpora, their results over a Dutch corpus provide mixed evidence
regarding the importance of prosodic features.

In the extended abstract of the invited talk, "Ideas on Multi-layer
Dialogue Management for Multi-party, Multi-conversation, Multi-modal
Communication" (David R. TRAUM), the challenges behind the complex
Mission Rehearsal Exercise are outlined. The MRE is a military
training environment where synthetic agents interact with a human
trainee, on a Bosnia village setting. The talk strengthens the
multiple problems involved during MRE's development. The MRE
challenges go well beyond the ones faced on regular dialogue systems.

Corpus

Corpus creation and evaluation in Dutch is an issue of optimizing
existing, limited, resources and maximizing the impact of the
resources applications. On those grounds, "The Alpino Dependency
Treebank" (Leonoor van der BEEK, Gosse BOUMA, Rob MALOUF, and Gertjan
van NOORD) describes the on-going construction of a dependency
treebank for Dutch, with the objective of theory-neutrality. Also on
the Alpino tree-bank, "Corpus-based Acquisition of Collocational
Prepositional Phrases" (Gosse BOUMA and Begon~a VILLADA),
investigates the problem of collocational prepositional phrases
(CPPs), and experiments with techniques for automated acquisition.
While their initial analysis of the linguistics of the CPPs is very
thorough (and goes beyond computational linguistics, being of interest
for linguists in general), the authors express slight disappointment
on their acquisition results. It seems a better definition for the
CPPs is required.

Working on the PAROLE corpus, "Tagging the Dutch PAROLE Corpus"
Jesse de DOES et al. confront themselves with few training data and a
large tagset (with syntactically motivated, complex, tags). The
authors try to cope with such a challenging situation by using a
mixture of different part-of-speech taggers. They also adapted
POS-taggers trained on larger corpora with a different tagset, by
learning tag-transformation rules. While the authors express regret
on their overall results, the constraints on their task render it a
very challenging one, indeed.

"Creating a Dutch Information Retrieval Test Corpus" (Djoerd
HIEMSTRA, David van LEEUWEN) explains the internals of the Dutch
section employed in the CLEF (Cross-language Evaluation Forum). CLEF
is an European, multilingual, counterpart for the Text Retrieval
Conference (TREC), focusing on information retrieval (IR). The paper
discusses the logistics involved on the construction of the Dutch
corpus, together with some CLEF results. A very thorough analysis of
the impact of judge subjectivity on the overall IR results is worth
mentioning.

Tools

This very general section captures three remaining papers. "A Named
Entity Recognition System for Dutch" (Fien DE MEULDER, Walter
DAELEMANS, V�ronique HOSTE) presents an interesting approach for
rapid development of language technologies tools: a small sample of
expected output is hand-tagged and a rule induction machine learning
system (RIPPER) is run over it. System developers then analyze the
rules and integrate them in a rule-based system. The benefits of this
approach are the ability of the human programmer to tell good rules
from bad ones, together with the possibility of integrating rules from
different runs of the machine learning system. The use of machine
learning as an aid for human knowledge acquisition seems to speed up
their development process quite a bit and it is a technique easily
applicable to other problems or domains.

The question of whether stemming (reducing a word to a rough version
of its root) is useful or not for text classification is revisited in
"Accurate Stemming of Dutch for Text Classification" (Tanja GAUSTAD
and Gosse BOUMA). The authors proceed to do an extrinsic evaluation
of two stemmers, a complex, very accurate, dictionary-based stemmer
and the Dutch version of the Porter stemmer (straight-forward but
inaccurate). Their results provide mixed evidence of the utility of
stemming and diverge from published English experiments.

Finally, "Applying Monte Carlo Techniques to Language
Identification" (Arjen POUTSMA) provide an interesting new
methodology to perform language identification. While the author
argues that the problem of automatically guessing the language of a
given document is considered a solved problem, he proposes a novel,
more efficient approach. The technique, based on Monte Carlo
sampling, requires a small sample of the text in question. It
provides results slightly below the state of the art but with an 850%
speed up.

OVERALL ANALYSIS

A quick scan over the list of contributors yields that, out of 31
contributors, only three authors (the US invited speaker and two
German authors) are located outside the Netherlands and Flanders
areas. Such focus on Dutch processing makes the book of particular
interest for researchers working on Dutch or similar languages
presenting a complex morphology. Nevertheless, computational
linguists focusing on languages spoken by small communities can profit
from the experiences reported on the book. It is also worth noting
that the new edition is hardcover, compared to last year's paperback.
This can motivate purchasing the actual book, as its contents are also
available online.

REFERENCE

Kanazawa, M. (1998) Learnable Classes of Categorical Grammars, CSLI
Publications, Stanford University.

ABOUT THE REVIEWER

Pablo Ariel Duboue is a senior PhD student working under the
supervision of Dr. Kathleen McKeown at the Natural Language Processing
group, Columbia University in the City of New York (USA). His
research interest falls in the area of Natural Language Generation,
mainly on the automatic construction of content planners from aligned
corpora. More information about Pablo is available at
http://www.cs.columbia.edu/~pablo
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue