LINGUIST List 12.964

Fri Apr 6 2001

Review: Comp Ling in the Netherlands 1998

Editor for this issue: Terence Langendoen <terrylinguistlist.org>


What follows is another discussion note contributed to our Book Discussion Forum. We expect these discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for discussion." (This means that the publisher has sent us a review copy.) Then contact Simin Karimi at siminlinguistlist.org or Terry Langendoen at terrylinguistlist.org.

Directory

  1. Laura Alonso, Review: Computational Linguistics in the Netherlands 1998

Message 1: Review: Computational Linguistics in the Netherlands 1998

Date: Fri, 6 Apr 2001 16:34:43 +0200 (CEST)
From: Laura Alonso <lauraclic.fil.ub.es>
Subject: Review: Computational Linguistics in the Netherlands 1998

Van Eynde, F., Schuurman, I., and Schelkens, N., ed. (2000)
Computational Linguistics in the Netherlands 1998, Editions Rodopi
B.V., 233 pp., Studies in Practical Linguistics

Laura Alonso, CLiC (Centre for Language and Computation),
University of Barcelona


This book consists on fourteen selected papers from the
Computational Linguistics in the Netherlands conference (CLIN) of
the year 1998. The contents are classifiable in three groups:
statistical methods, syntax/semantics and applications.

The section on statistical methods is the largest one with six
papers. Two of them concentrate on phonotactic properties (Kleiweg
& Nerbonne and Stoianov & Nerbonne). The other four are of a
general methodological nature; they concern style adaptation of
statistical language models (Van Uystel, Wambacq & Van
Compernolle), memory-based word sense disambiguation (Veenstra,
van den Bosch, Buchholz, Daelemans & Zavrel), instance families in
memory-based language learning (van den Bosch), and the
arbitrariness of lexical categories (Durieux, Daelemans & Gillis).

The section on syntax and semantics consists of five papers. Three
of them employ the framework of Head-driven Phrase Structure
Grammar (HPSG) for formal analyses of Polish clitics (Kupsc),
null-headed nominals in German and English (Nerbonne & Mullen, and
words without content or figure heads (Van Eynde). The other two
report on work connected with the Amazon grammar, the last part of
the middle field in Dutch (Van Dreumel) and parenthetical
reporting clauses (Schelfhout).

The section on applications has three papers. One shows how NLP
tools can be used for the development of a course on computational
linguistics (Bouma). Another one discusses the use of NLP
techniques in document processing (van der Eijk & Janssen). The
last one concerns the evaluation of the NLP components of the
OVIS2 spoken dialogue system (Veldhuijzen van Zanten, Bouma,
Sima'an, van Noord & Bonnema)

Here is a summary and discussion of the papers in the volume.

Kleiweg, P., Nerbonne, J., 'An FGREP Investigation on
Phonotactics'.
This paper discusses experiments with neural networks trained to
represent letters in a manner meaningful to the processing task:
the network is presented with monosyllabic words, one letter at a
time, and the network has to learn to predict the next letter. The
algorithm used is Miikkulainen's FGREP (Forming Global
Representations with Extended back Propagation). However, since
the network did very badly in distinguishing valid from invalid
words, FGREP was augmented with a 'dispersion' algorithm to
improve distinctness among the letter representations, which
improved performance and made the network sensitive to
dependencies of letters separated by one or more letters, showing
a stricter notion of 'valid word'.
 These experiments are psycholinguistically interesting in
that they were aimed at capturing some aspects of human language
processing rather than at developing a better algorithm for the
machine-learning of language. The data sets used for these
experiments and the results are available at:
http://www.let.rug.nl/~kleiweg/papers/afiip.

Stoianov, I., Nerbonne, J. 'Exploring Phonotactics with Simple
Recurrent Networks'. This paper presents an extension of an
initial experiment learning graphotactics to learning
phonotactics. In addition, a further analysis of neural network is
conducted, with regard to variables such as word frequency,
length, neighborhood and error location.
 This informal comparison of SRNs and human performance
suggests that neural networks may be used for learning natural
language, thus challenging connectionism to tackle symbolic
problems.

Van Uystel, D.H., Wambacq, P., Van Compernolle, D., 'Style
Adaptation of Statistical Language Models'. Language Models
associate each sentence hypothesis generated by a speech
recognizer with its probability to occur in a given domain.
Statistical language models are based in n-grams: an n-gram model
makes the assumption that the occurrence of a word in a sentence
depends on the preceding words.
 This paper discusses the adaptation of language models from a
general training test corpus (broadcast news or financial
newspaper) to the style of a given domain (news talkshow). The
authors propose three different adaptation schemes,
transformation and two variants of relevance weighting, which make
use of a weighted counting approach. The results show that
weighted counting is more effective on the financial newspaper
corpus than on the broadcast news, although the gains from the
tested methods remain rather modest. POS n-grams may not be
sufficient to characterize style, so more fine-grained style
distinctions should be made.

Veenstra, J., Van den Bosch, A., Buchholz, S., Daelemans, W.,
Zavrel, J. 'Memory-based Word Sense Disambiguation'. The authors
present a method for the word sense disambiguation task in the
SENSEVAL project: the association of a word in context with its
contextually appropriate sense tag. For each word to be
disambiguated, a distinct classifier is constructed that is then
trained in POS-tagged corpus examples and selected information
from dictionary entries
<http://www.itri.brighton.ac.uk/events/senseval>;.
 The classifier extracts: (1) context features: a window of
two words (and their POS tag) to the left and the right of the
word of interest; and (2) keyword features: a number of relatively
frequent words that occur frequently with the sense of interest.
 The method achieves a relatively high accuracy and it is
computationally very economical, in contrast with other machine-
learning methods that can not deal with the number of features
that interact in word sense disambiguation. Interesting future
directions of this investigation would be to determine if the
method can feed on dictionary information only, when there is not
an abundant labeled training corpus.

Van den Bosch, A., 'Instance Families in Memory-Based Language
Learning'. Pure memory-based language learning treats a set of
pre-classified training linguistic instances as points in a multi-
dimensional feature-space. They are then stored in memory to
classify new instances by matching them to all instances in the
instance base. In this paper it is shown how careful abstraction
improves the performance these systems by reducing their memory
requirements. Six automated classification tasks are carried over
to compare these two approaches.
 The FAMBL algorithm (FAMily-Based Learning) carefully merges
groups of nearest-neighbor instances labeled with the same class
in a single, more general instance. It works in two stages: a
'probing' stage when all possible families are extracted randomly
and a 'family extraction' stage when no family is extracted that
has more members or more distance between members than the median.
This reduces the number of items in memory between a 31% and a
75%, depending on the task. So, when a new instance is submitted,
a match is made between a the new instance and the stored family
expressions, not with the whole of the instance base.
 The careful abstraction method was applied to six language
tasks: grapheme-phoneme conversion, word pronunciation,
morphological segmentation, base-NP chunking, PP attachment, and
part-of-speech tagging. It performed close to a non-abstraction
algorithm (IB1-IG), though equaling it on only three of the six
tasks. The FAMBL algorithm does not handle adequately properties
like very high disjunctivity or feature interaction.
 The incorporation of feature interaction is a relevant point
for future research in the field of language learning, involving
both linguistic, cognitive and computational knowledge.

Durieux, G. Daelemans, W., Gillis, S., 'On the Arbitrariness of
Lexical Categories'. This paper shows how automated classification
can predict with a similar reliability the category of domains
which are linguistically considered to have a very different
degree of arbitrariness.
 The starting hypotheses were: (1) that the degree to which a
lexical category is predictable can be shown quantitatively by
machine learning techniques, and (2) that memory-based learning
succeeds in successfully learning lexical categories, irrespective
of their degree of arbitrariness. These hypotheses were tested
through automated classification tasks concerning three lexical
categories in Dutch which have a varying predictability:
completely predictable (diminutive), rather predictable (stress)
and essentially arbitrary (gender).
 The results of the two experiments which were carried out
show how predictability is indeed reflected by machine learning
techniques and how any lexical category can be learned with these
methods. However, a correlation was observed between the
predictability and the success in learning: the more arbitrary a
category is, the harder it is to be learnt correctly by the
algorithm.

Kupsc, A., 'Position of Polish Clitics: an HPSG Approach'. Even
though Polish clitics have a rather free distribution, there are
certainly some constraints that are worth being analyzed. Since
most of the positions of clitics follow from general principles of
Polish linear order, the author considers them as syntactic items
and uses order domains, that account for Polish linear order
facts, to also account for their distribution. An LP constraint is
proposed which uniformly accounts for the distribution of both
preverbal and postverbal clitics.
 Two alternatives to this approach are also analyzed: one
based in lexical weight and the other in topological fields.
However, neither of them yield satisfactory results: the first is
still too general, the second is too restrictive for the freedom
of order of Polish.

Nerbonne, J., Mullen, T., 'Null-Headed Nominals in German and
English'. The authors give an explanation for certain nominal
phrase constructions in German and English which are best
considered as having empty lexical heads: they propose the Left
Periphery feature of a nominal tree structure, which can be empty,
full or none. Simple, language-specific rules are then applied to
give account of the combination of signs according to their Left
Periphery values: for example, determiners such as 'none' or
'mine' are restricted to combining with nominal constituents whose
left periphery is empty, while 'no' and 'my' require a nominal
constituent with a full left periphery. This rules also give
satisfactory account of some kinds of anaphoric constructions in
German which seem to lack a clear nominal heads, as well as the
'one' anaphor in English.
The cross-linguistic descriptive power of this account justifies
the use of the null constituent. To prove this, a grammar has been
implemented which, not using the null constituent or the feature
Left Periphery, introduces ambiguity and requires unnecessary
detail.
 The Left Periphery feature seems to provide a general
explanation for a number of phenomena where determiners or
adjectives, and not nominal elements, appear to be the central
part of the phrase.

Van Eynde, F., 'Figure Heads in HPSG'. Figure heads are words
without semantic content, such as the copula or the infinitive
'to'. For these, the head-driven semantics of HPSG-94 stipulates
that the semantic value of a semantically vacuous word is
identified with the one of its complement, in contrast to the
general tendency of this framework to assign to a head-complement
combination the content value of its head daughter. But this
leaves unresolved the semantic contribution of the verbal tense.
Moreover, no criteria are given for the identification of vacuous
words.
 To overcome part of these limitations, the author integrates
a treatment of the tenses in HPSG and defines some criteria for
identifying vacuous words. In doing so, he provides empirical
evidence against the figure head treatment of HPSG, and replaces
it with an alternative, nonsubstantive analysis in which vacuous
verbs have no 'content' feature.

Van Dreumel, S., 'The Amazon Grammar and the Last Part of the
Middle Field'.
The author successfully gives a formalization of the end of the
structuralist Middle Field, the part of the sentence which is
considered to be found between two poles of the sentence. Since
these two poles can be empty in Dutch, the boundary between the
two fields is invisible, thus causing a transparency problem to a
structuralist parser like AMAZON. A parsing technique is proposed
that predicts the Middle Field closing point by recognizing and
determining the internal order of elements with closing
properties, namely particles belonging to pronominal adverbs and
predicative elements. Once formalized and implemented, the
properties of these closing mi-elements enable the development of
more efficient and more robust parsers, able to handle
transparency situations in which an empty verb cluster is followed
by an infinitival complement without complementizer.

Schelfhout, C., 'Corpus-Based Analysis of Parenthetical Reporting
Clauses'.
In investigating the syntactic properties of parenthetical
reporting clauses in Dutch, it is shown that it is inadequate to
give too simple an analysis, in which the quote would be
considered as the direct object of the reporting verb. The authors
propose an analysis in which the quote and the reporting clause
are taken to be adjoined. However, some problems arose from this
analysis, because sometimes obligatory objects were not present,
and there were also frequent inversions in the reporting clause.
To solve this, an abstract particle 'so' was postulated in first
position in reporting clauses. This may be explicit or not, and it
stands in anaphoric relation to the quote.
 This analysis will be implemented in AMAZON (a parser for
Dutch), and it is hoped that, in testing the implementation on the
corpus, answers may come to light for some questions that still
remain, such as: What is the relation between the 'so' and the
quote or a possible direct object in the quote? Are there contexts
in which 'so' could never become explicit? At which positions in
the quote can a reporting clause occur?

Bouma, G., 'A Modern Computational Linguistics Course using
Dutch'. This paper presents a course in computational linguistics
concentrating on realistic language technology applications for
Dutch. Main targets of this course are that the students learn to
use high-level tools, that they become familiar with quantitative
evaluation methods, by working with real data. Some illustrating
exercises in this course deal with finite state methods using
regular expression calculus, grammar development and natural
language interface development, such as report generation and the
development of a question-answering system.

van der Eijk, P., Janssen, D. 'XML Mixed Content Grammars'. The
authors argue that some NLP tools, such as style checkers or
Controlled Language tools, need to evolve into content processing.
They discuss the extension of a Controlled Language tool (CLarity)
by mapping XML DTDs to a constraint-based grammar formalism. They
have embedded the XML document structure within the syntactic
analysis. The only drawback of this extension is the loss of the
modularity of grammar and DTD, which requires grammar developers
to be competent both in computational linguistics and in document
technology. However, this extension can be generalized to support
many arbitrary DTDs.
 They also discuss the effect of XML content manipulation on
XML document integrity, and outline conditions on grammars under
which an NLP system preserves XML well-formedness and validity,
interpreted as guidelines for grammar writers and implemented as a
verification procedure.

Veldhuijzen van Zanten, G., Bouma, G. Sima'an, K., van Noord, G.,
Bonnema, R., 'Evaluation of the NLP Components of the OVIS2 Spoken
Dialogue System'. In the framework of a five-year research program
for the development of spoken language information systems, two
natural language processing modules are developed: a grammar one,
rule-based, and a data-oriented one, memory-based and stochastic.
In order to compare them, a formal evaluation has been carried out
showing that the grammar-based component performs much better than
the data-oriented one, and it requires much less computational
resources. For example, the best data-oriented method obtains an
error rate for concept accuracy of 24.5%, whereas the best
grammar-based method obtains a rate of 17%, and differences
increase with increasing sentence length.
 The most important problem for the application consists of
disambiguation of the word graph used for representing all
sequences of words that the speech recognizer hypothesizes for a
spoken utterance. Here, a combination of speech scores and trigram
scores performs much better in string accuracy than the data-
oriented methods. The grammar-based methods incorporate N-gram
statistics.


Laura Alonso i Alemany is a postgraduate student at the University
of Barcelona. She is currently working on a shallow rhetorical
parser for Spanish unrestricted text. Her thesis project consists
on developing a rhetorical-structure-based system for automated
text summarization for Spanish. Her areas of interest are
discourse and rhetoric, and natural language processing.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue