Publishing Partner: Cambridge University Press CUP Extra Wiley-Blackwell Publisher Login
amazon logo
More Info


New from Oxford University Press!

ad

Language Planning as a Sociolinguistic Experiment

By: Ernst Jahr

Provides richly detailed insight into the uniqueness of the Norwegian language development. Marks the 200th anniversary of the birth of the Norwegian nation following centuries of Danish rule


New from Cambridge University Press!

ad

Acquiring Phonology: A Cross-Generational Case-Study

By Neil Smith

The study also highlights the constructs of current linguistic theory, arguing for distinctive features and the notion 'onset' and against some of the claims of Optimality Theory and Usage-based accounts.


New from Brill!

ad

Language Production and Interpretation: Linguistics meets Cognition

By Henk Zeevat

The importance of Henk Zeevat's new monograph cannot be overstated. [...] I recommend it to anyone who combines interests in language, logic, and computation [...]. David Beaver, University of Texas at Austin


Email this page
E-mail this page

Review of  Computational Linguistics in the Netherlands 2001.


Reviewer:
Book Title: Computational Linguistics in the Netherlands 2001.
Book Author: Hendri Hondorp Mariët Theune Anton Nijholt
Publisher: Rodopi
Linguistic Field(s): Computational Linguistics
Subject Language(s): Dutch
English
Language Family(ies): Germanic
New English
Book Announcement: 14.499

Discuss this Review
Help on Posting
Review:


Date: Mon, 17 Feb 2003 15:51:14 -0500 (EST)
From: Pablo Ariel Duboue <pablo@cs.columbia.edu>
Subject: Computational Linguistics in the Netherlands 2001

Theune, Mariët, Anton Nijholt, and Hendri Hondorp, ed. (2002)
Computational Linguistics in the Netherlands 2001: Selected Papers
from the Twelfth CLIN Meeting. Rodopi, viii+207pp, hardback, ISBN
90-420-0943-8, US$50, EUR50, Language and Computers series 45.

Book Announcement on Linguist:
http://linguistlist.org/issues/13/13-2106.html

Pablo A. Duboue, Computer Science Department, Columbia University, USA

SYNOPSIS
Continuing with their tradition of a second round of submission and
reviewing after the CLIN meeting, this year "Computational Linguistics
on the Netherlands" offers mostly Dutch-related content. The book
contains papers on a wide variety of topics, distributed over 14 papers
and the extended abstract of the invited talk. The second round of
revisions ensures a high level of quality and the authors profit from
the discussions at the meeting before sending their extended versions.

DETAILED ANALYSIS
I decided to divide the published papers in four sections, to
facilitate the discussion. The division into the sections is not
clear-cut; it should be taken mostly for expository reasons. These
sections are Theory (including psycholinguistically motivated works),
Speech (including dialogs), Corpus (including creation and evaluation)
and Tools (for Dutch and multilingual).

Theory
I found four papers in this category, two dealing with particular
results and two with psycholinguistic concerns. Their results are
mostly language independent, with most papers providing English
examples. In "Conservative vs Set-driven Learning Functions for the
Classes k-valued" (Christophe COSTA FLORÊNCIO), the author answers an
open question set aside by Kanazawa (1998), by means of a constructive
proof. The work focuses in Classical Categorical Grammars (CCGs). In
"Reference Resolution in Context" (Jan van EIJCK), pronoun reference
resolution is analyzed in terms of incremental semantics. The
concepts of the paper are exemplified with an implemented Haskell
prototype, available from the author's homepage.

Finally, two contributions deal with psycholinguistically motivated
formalisms: "Incremental Generation of Self-corrections Using
Underspecification" (Markus GUHE and Frank SCHILDER), from a
generation perspective; and "Performance Grammar: a Declarative
Definition" (Gerard KEMPEN and Karin HARBUSCH), from an understanding
perspective. Guhe and Schilder's work profits from a
psycholinguistically plausible generation architecture to generate
self corrections (e.g. "I have two seats... uh no... one seat
available"). The authors argue that such self-corrections are
required for systems with dynamic input data. Kempen and Harbusch,
on the other hand, present a HPSG-motivated grammar formalism,
Performance Grammar (PG). PG also captures important psycholinguistic
features such as incrementality and late linearization. The authors
provide both Dutch and English formalizations.

Speech
Dutch morphology makes for a particularly challenging environment in
Speech Recognition. The number of out-of-vocabulary (OOVs) words can
be quite large, as a result of compounding and other word formation
rules. Two papers explore solutions to this problem: "Memory-Based
Phoneme-to-Grapheme Conversion" (Bart DECADT, Jacques DUCHATEAU,
Walter DAELEMANS, and Patrick WAMBACQ) and "Automated Compounding as a
Means for Maximizing Lexical Coverage" (Vincent VANDEGHINSTE). Decadt
et al. investigate the automatic guessing of Dutch spelling out of
phoneme transcriptions (phoneme-to-grapheme conversion). Their
algorithm performs outstandingly well on clean input. However, the
authors acknowledge further work is required to accommodate the highly
noisy phonetic transcriptions coming from the speech recognition
system. Vandeghinste explores a different but related problem:
optimization in the use of the bounded memory of the speech
recognition system. As only 36,000 words can be stored in that
memory, the author combines several readily available lexicons for
Dutch to extract roots and "quasi"-roots for Dutch words. He later
re-combines the words into more complex ones, using a statistically
trained module. His results seem to be ready for practical
application and his statistical analysis is very thorough.

Dealing with errors in dialogs, "Multi-feature Error Detection in
Spoken Dialogue Systems" (Piroska LENDVAI, Antal van den BOSCH, Emiel
KRAHMER, and Marc SWERTS) analyses the impact of combination of
prosodic and non-prosodic features in automatic error detection.
Trying to reproduce available results reported over English spoken
corpora, their results over a Dutch corpus provide mixed evidence
regarding the importance of prosodic features.

In the extended abstract of the invited talk, "Ideas on Multi-layer
Dialogue Management for Multi-party, Multi-conversation, Multi-modal
Communication" (David R. TRAUM), the challenges behind the complex
Mission Rehearsal Exercise are outlined. The MRE is a military
training environment where synthetic agents interact with a human
trainee, on a Bosnia village setting. The talk strengthens the multiple
problems involved during MRE's development. The MRE challenges go well
beyond the ones faced on regular dialogue systems.

Corpus
Corpus creation and evaluation in Dutch is an issue of optimizing
existing, limited, resources and maximizing the impact of the
resources applications. On those grounds, "The Alpino Dependency
Treebank" (Leonoor van der BEEK, Gosse BOUMA, Rob MALOUF, and Gertjan
van NOORD) describes the on-going construction of a dependency
treebank for Dutch, with the objective of theory-neutrality. Also on
the Alpino tree-bank, "Corpus-based Acquisition of Collocational
Prepositional Phrases" (Gosse BOUMA and Begon~a VILLADA), investigates
the problem of collocational prepositional phrases (CPPs), and
experiments with techniques for automated acquisition. While their
initial analysis of the linguistics of the CPPs is very thorough (and
goes beyond computational linguistics, being of interest for linguists
in general), the authors express slight disappointment on their
acquisition results. It seems a better definition for the CPPs is
required.

Working on the PAROLE corpus, "Tagging the Dutch PAROLE Corpus" Jesse de
DOES et al. confront themselves with few training data and a large tagset
(with syntactically motivated, complex, tags). The authors try to
cope with such a challenging situation by using a mixture of different
part-of-speech taggers. They also adapted POS-taggers trained on
larger corpora with a different tagset, by learning tag-transformation
rules. While the authors express regret on their overall results, the
constraints on their task render it a very challenging one, indeed.

"Creating a Dutch Information Retrieval Test Corpus" (Djoerd HIEMSTRA,
David van LEEUWEN) explains the internals of the Dutch section
employed in the CLEF (Cross-language Evaluation Forum). CLEF is an
European, multilingual, counterpart for the Text Retrieval Conference
(TREC), focusing on information retrieval (IR). The paper discusses
the logistics involved on the construction of the Dutch corpus,
together with some CLEF results. A very thorough analysis of the
impact of judge subjectivity on the overall IR results is worth
mentioning.

Tools
This very general section captures three remaining papers. "A Named
Entity Recognition System for Dutch" (Fien DE MEULDER, Walter
DAELEMANS, Véronique HOSTE) presents an interesting approach for
rapid development of language technologies tools: a small sample of
expected output is hand-tagged and a rule induction machine learning
system (RIPPER) is run over it. System developers then analyze the
rules and integrate them in a rule-based system. The benefits of this
approach are the ability of the human programmer to tell good rules
from bad ones, together with the possibility of integrating rules from
different runs of the machine learning system. The use of machine
learning as an aid for human knowledge acquisition seems to speed up
their development process quite a bit and it is a technique easily
applicable to other problems or domains.

The question of whether stemming (reducing a word to a rough version
of its root) is useful or not for text classification is revisited in
"Accurate Stemming of Dutch for Text Classification" (Tanja GAUSTAD
and Gosse BOUMA). The authors proceed to do an extrinsic evaluation
of two stemmers, a complex, very accurate, dictionary-based stemmer
and the Dutch version of the Porter stemmer (straight-forward but
inaccurate). Their results provide mixed evidence of the utility of
stemming and diverge from published English experiments.

Finally, "Applying Monte Carlo Techniques to Language Identification"
(Arjen POUTSMA) provide an interesting new methodology to perform
language identification. While the author argues that the problem of
automatically guessing the language of a given document is considered
a solved problem, he proposes a novel, more efficient approach. The
technique, based on Monte Carlo sampling, requires a small sample of
the text in question. It provides results slightly below the state of
the art but with an 850% speed up.

OVERALL ANALYSIS
A quick scan over the list of contributors yields that, out of 31
contributors, only three authors (the US invited speaker and two
German authors) are located outside the Netherlands and Flanders
areas. Such focus on Dutch processing makes the book of particular
interest for researchers working on Dutch or similar languages
presenting a complex morphology. Nevertheless, computational linguists
focusing on languages spoken by small communities can profit from the
experiences reported on the book. It is also worth noting that the
new edition is hardcover, compared to last year's paperback. This can
motivate purchasing the actual book, as its contents are also
available online.

REFERENCE
Kanazawa, M. (1998) Learnable Classes of Categorical Grammars, CSLI
Publications, Stanford University.





 
ABOUT THE REVIEWER:
ABOUT THE REVIEWER Pablo Ariel Duboue is a senior PhD student working under the supervision of Dr. Kathleen McKeown at the Natural Language Processing group, Columbia University in the City of New York (USA). His research interest falls in the area of Natural Language Generation, mainly on the automatic construction of content planners from aligned corpora. More information about Pablo is available at http://www.cs.columbia.edu/~pablo

Amazon Store: