LINGUIST List 11.86

Mon Jan 17 2000

Review: Kornai: Extended Finite State Models of Lang.

Editor for this issue: Andrew Carnie <carnielinguistlist.org>


What follows is another discussion note contributed to our Book Discussion Forum. We expect these discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for discussion." (This means that the publisher has sent us a review copy.) Then contact Andrew Carnie at carnielinguistlist.org

Directory

  1. Alberto Lavelli, Review of "Extended Finite State Models of Language"

Message 1: Review of "Extended Finite State Models of Language"

Date: Sun, 16 Jan 2000 12:24:12 +0100 (MET)
From: Alberto Lavelli <lavelliitc.it>
Subject: Review of "Extended Finite State Models of Language"


Andras Kornai (ed.), 1999, Extended Finite State Models of Language,
Cambridge University Press, pages 278+xii (plus a CD-ROM).


Reviewed by Alberto Lavelli, ITC-IRST, Trento (Italy)


I'll do a frequent use of the following acronyms:

 - ECAI: European Conference on Artificial Inteligence
 - NLE: the journal Natural Language Engineering
 - NLP: Natural Language Processing
 - POS: part of speech


SYNOPSIS

The book appears in the ACL Studies in Natural Language Processing
series and originates from a workshop held in Budapest in 1996 during
ECAI'96. In a special issue of the journal Natural Language
Engineering (vol. 2 n. 4, December 1996) a set of articles partially
overlapping with those present in this book appeared (some of the NLE
papers are only abstracts of 2 or 3 pages). The electronic versions
of the papers presented at the ECAI'96 workshop are included in the
CD-ROM accompanying the book. Now I briefly describe the papers
contained in the book. The papers that appeared also in the ECAI
workshop proceedings are marked with "ECAI" near the name of the
authors. Note that sometimes the ECAI'96 versions are considerably
shorter than those in the book.


1. Extended finite state models of language by Andras Kornai

 This is a general introduction with a brief presentation of the
 papers contained in the book and of the contents of the CD-ROM

2. A parser from antiquity: an early application of finite state
 transducers to natural language parsing by Aravind K. Joshi and
 Philip Hopely (ECAI)

 This paper describes a parser based on a cascade of finite state
 transducers developed at the University of Pennsylvania in 1958.
 The parser is remarkably modern when compared to some of the recent
 work on finite state transducers. A faithful reconstruction of the
 parser is available on the CD-ROM.

3. Comments on Joshi and Hopely by Lauri Karttunen (ECAI)

 It presents some brief remarks by Karttunen who underlines the
 modernity of the parser described in the previous chapter by Joshi
 and Hopely.

4. Implementing and using finite automata toolkits by Bruce W. Watson
 (ECAI)

 It describes a toolkit (FIRE Lite) developed by the author while at
 the Eindhoven University of Technology and now freely available for
 non-commercial use. The toolkit is available in the accompanying
 CD-ROM and also on the Web at www.RibbitSoft.com (note however that
 at the beginning of January 2000 I have repeatedly and
 unsuccessfully tried to connect to www.RibbitSoft.com). RibbitSoft
 distributes also a commercial version of the toolkit (FIRE Engine
 II). Both the commercial and the non-commercial toolkit implement
 algorithms for building automata from regular expressions and for
 minimizing deterministic finite automata.

5. Finite state morphology and formal verification by Manuel Vilares 
 Ferro, Jorge Grana Gil and Pilar Alvarino Alvarino

 It presents the use of verification methods to ease maintenance
 during the development of resources for morphological analyzers.
 Examples and experiments on Spanish are presented.

6. The Japanese lexical transducer based on stem-suffix style forms by 
 Masakazu Tateno, Hiroshi Masuichi and Hiroshi Umemoto (ECAI)

 It describes a method for building a lexical transducer for
 Japanese with stems and suffixes stored separately in different
 lexicons; an extra level of automata relates canonical citation
 forms and stem-suffix style forms.

7. Acquiring rules for reducing morphological ambiguity from POS
 tagged corpus in Korean by Jae-Hoon Kim and Byung-Gyu Jang

 It presents a method for reducing morphological ambiguities when
 performing morphological analysis of Korean texts. The method
 automatically infers rules (called subsumption conditions) from a
 POS tagged corpus. Experiments are presented on the effectiveness
 of the method.

8. Finite state based reductionist parsing for French by Jean-Pierre
 Chanod and Pasi Tapanainen (ECAI; but see below the description of
 the paper)

 The paper describes a parser based on finite state methods. The
 system includes nondeterministic tokenization, lexical analysis,
 multiword recognition, shallow syntactic analysis. Examples of
 treatment of French linguistic phenomena and an evaluation of the
 parser effectiveness during the analysis of technical manuals are
 presented. This paper is a considerably extended version of the
 ECAI workshop paper by the same authors.

9. Light parsing as finite state filtering by Gregory Grefenstette
 (ECAI)

 The paper presents an approach to parsing useful in case of
 applications that need to extract relevant information without
 necessarily performing a full parse of the text. The approach is
 based on the use of finite state markers and filters. An evaluation
 of the parser effectiveness in analyzing a large corpus is
 presented.

10. Vectorized finite state automata by Andras Kornai (ECAI)

 It presents a technique of finite state parsing based on
 vectorization and describes the application of such technique to
 the problem of extracting relational information from texts. A
 system based on such approach, NewsMonitor, is available in the
 accompanying CD-ROM.

11. Finite state transducers: parsing free and frozen sentences by 
 Emmanuel Roche (ECAI)

 In NLP finite state models are usually considered a lesser evil
 with respect to more powerful techniques. The author instead
 claims that they are quite suitable for representing accurately
 complex linguistic phenomena. This claim is supported by examples
 of finite state analysis of linguistic phenomena (i.e., support
 verbs and frozen expressions).

12. Text and speech translation by means of subsequential transducers
 by Juan Miguel Vilar, Victor Manuel Jimenez, Juan Carlos Amengual,
 Antonio Castellanos, David Llorens and Enrique Vidal (ECAI; but
 see below the description of the paper)

 The authors propose a technique that increases the performance of
 the learning algorithm of Subsequential Transducers from training
 examples; moreover, the use of error-correcting parsing to
 increase the robustness of the model is explored. Experiments on
 both text and speech translation from Spanish to English are
 described. This paper is a considerably extended version of the
 ECAI workshop paper by J.M. Vilar, E. Vidal & J.C. Amengual.

13. Finite state segmentation of discourse into clauses by Eva Ejerhed
 (ECAI)

 The paper presents first of all the analysis of the correlation
 between different acoustically and perceptually derived
 information and clause boundaries in spoken utterances. Then it
 proposes an algorithm for segmenting Swedish texts into clauses
 and evaluates its performance, comparing the results on
 automatically and manually tagged texts.

14. Between finite state and Prolog: constraint-based automata for
 efficient recognition of phrases by Klaus U. Schulz and Tomek
 Mikolajewski

 The paper describes "constraint-based automata", that incorporate
 features from finite state techniques and constraint programming.
 Preliminary empirical evaluation of the performance of the
 proposed approach against that of constraint logic programming
 implementations is presented.

15. Explanation-based learning and finite state transducers:
 applications to parsing lexicalized tree adjoining grammars by
 Srinivas Banglore (ECAI)

 The paper describes the application of explanation-based learning
 (EBL) techniques to parsing Lexicalized Tree Adjoining Grammars.
 Starting from a hand-crafted wide-coverage English grammar (XTAG
 Group 1995), EBL techniques based on finite state transducers are
 applied to customize the grammar to a specific domain. A
 simplified parser, called stapler, is also described; the stapler
 is used in conjunction with the results of the application of EBL
 techniques. Experimental results of such approach are presented,
 comparing the performance with respect to the original system
 in terms of recall, number of parses and processing time.

16. Use of weighted finite state transducers in part of speech tagging
 by Evelyne Tzoukermann and Dragomir R. Radev

 The paper presents the application of weighted finite state
 transducers to POS tagging. The approach uses a combination of
 linguistic and statistical techniques for POS disambiguation.
 Experimental results for French POS tagging are presented.

17. Colonies: a multi-agent approach to language generation by
 Erzsebet Csuhaj-Varju (ECAI)

 The paper presents "colonies", multi-agent symbol systems whose
 behavior is jointly determined by the combination of very simple
 grammars.

18. An innovative finite state concept for recognition and parsing of
 context-free languages by Mark-Jan Nederhof and Eberhard Bertsch
 (ECAI)

 The paper shows that all the languages which are in the regular
 closure of the class of the deterministic (context free) languages
 can be recognized in linear time. The result is interesting as
 this closure contains many inherently ambiguous languages.

19. Hidden Markov models with finite state supervision by Eric Sven
 Ristad

 The paper presents a supervised training approach to Hidden Markov
 Models (HMMs). The author claims that, unlike popular ad hoc
 techniques, the proposed approach is completely general, need not
 make any simplifying assumptions about independence , and can take
 better advantage of the information contained in the training
 corpus.


In the accompanying CD-ROM there are 6 subdirectories with the
following contents:

 - ECAI: the original papers of the ECAI workshop (also available on
 the Web at the location: http://www.cs.rice.edu/~andras/ecai.html)
 - Kanungo: a simple implementation of Hidden Markov Models realized
 by Tapas Kanungo of the University of Maryland
 - Kim: the morphological analyzer described in chapter 7
 - Kornai: the NewsMonitor system, described in chapter 10
 - Uniparse: the source code of the parser described in chapter 2 by 
 Joshi and Hopely
 - Watson: FIRE Lite, the toolkit developed by Bruce W. Watson and 
 described in chapter 4

The ECAI papers not present in the book are listed below:

 - Language modeling for speech recognition by Frederick Jelinek
 - Regular expressions for finite-state syntactic description by Lauri
 Karttunen
 - Finite-state morphology and information retrieval by Kimmo
 Koskenniemi
 - Weighted automata in text and speech processing by Mehryar Mohri,
 Fernando Pereira and Michael Riley
 - Finite-state methods, binding, and anaphora by Richard Oehrle
 - Efficient finite-state approximation of context free grammars by
 Catherine Rood
 - Multilingual finite-state noun phrase extraction by Anne Schiller
 - Finite automata for processing word order by Wojciech Skut
 - Multilingual text analysis for text-to-speech synthesis by Richard
 Sproat


CRITICAL EVALUATION

The book contains papers that cover the application of finite state
techniques to a wide range of NLP areas (morphological analysis, POS
tagging, clause boundary detection, syntactic analysis, etc.). This
fact makes it difficult for a single person to have the necessary
expertise to thoroughly evaluate all the contributions (and, at the
same time, prospective readers will be probably interested only on a
subset of the papers, depending on their areas of interest).
Obviously also this review is partly influenced by the reviewer's
limited knowledge in some areas of NLP (particularly statistical
techniques).

The book contains some very interesting papers from both a theoretical
and an applicative perspective. However, it suffers from a defect
that is often present in books originating from workshops, i.e. the
fact that contributions are uneven in both quality (i.e., clarity of
the presentation, systematic coverage of all the main areas in the
field) and quantity (length and thoroughness of papers). This makes
also difficult to have an overall view of the different areas covered
by the various contributions.


Going to the analysis of some of the papers, I found the contribution
by Joshi & Hopely (chapter 2) particularly interesting as it provides
a useful historical perspective on the work in the field. Too often we
tend to concentrate only on the most recent contributions running the
risk to reinvent the wheel and this paper reminds us not to neglect
past experiences.

The papers by Chanod & Tapanainen and Grefenstette (chapters 8 and 9)
provide a useful indication of the current advances in the application
of finite-state techniques at Xerox Research Centre Europe in
Grenoble, one of the leading centers in this area.

The paper by Roche (chapter 11) is in my opinion not as convincing as
others by the same author, for instance that contained in (Roche and
Schabes 1997).

The papers by Tateno, Masuichi & Umemoto and Kim & Jang (chapters 6
and 7) fail to clearly explain the background and the issues specific
respectively to Japanese and Korean, needed to fully appreciate the
techniques proposed in the papers.

The paper by Srinivas (chapter 15) is a long and complex but
well-written contribution that proposes an approach that combines
manually developed generic grammars with domain-specific constraints
extracted from a corpus. The interesting experimental results seem
however due to to the particular formalism adopted (i.e., Lexicalized
Tree Adjoining Grammars) because they crucially employ some of its
specific characteristics.

In the paper by Kornai (chapter 10) it would have been interesting to
provide more details about the application of vectorized finite state
automata, i.e. the NewsMonitor system (also present in the
accompanying CD-ROM).

In some of the papers there is only a generic reference to the usage
in NLP of the techniques and tools described. For example, the paper
by Watson (chapter 4) mentions an interest in using the toolkit by
computational linguists. This same generic claim was present in the
original paper at the ECAI workshop. Provided that the ECAI workshop
took place in 1996, it would have been interesting that the author
made some explicit mentions of NLP areas where such uses have been
pursued in the meanwhile. The paper by Csuhaj-Varju (chapter 17)
presents some results from the field of formal languages. As far as I
understand, the only link with NLP is that some languages that can be
described using such results (e.g., the languages of multiple
agreement, crossed agreements and replication) would present
structures that are present in natural languages; no further evidence
for this claim is produced. Sometimes the link between formal results
and NLP is more explicit: the theoretical paper by Nederhof & Bertsch
(chapter 18) provides some more direct hints at the usefulness of the
results proposed for NLP.

Given my limited knowledge of statistical techniques, I cannot
thoroughly evaluate the paper by Ristad (chapter 19). I would only
underline that some experiments would probably be needed in order to
empirically verify the claims about the advantages of the proposed
method with respect to the standard ones.

Most contributions provide some kind of experimental evaluation of the
proposed techniques. However, it is not always clear if such
experimental results allow a real comparison with other techniques.


Among the contributions included in the NLE special issue and not
present in the book for various reasons, I have found particularly
interesting the paper "Partial parsing via finite-state cascades" by
Steven Abney. Other NLE papers not included in the book are: "Regular
expressions for language engineering" by Lauri Karttunen, Jean-Pierre
Chanod, Gregory Grefenstette and Anne Schiller, "Finite state
morphology and information retrieval" by Kimmo Koskeniemmi, and
"Multilingual text analysis for text-to-speech synthesis" by Richard
Sproat (the last two papers were present at the ECAI workshop in a
slightly different version).

The editorial care of the book is not completely satisfactory. There
are a few typos (not so many but they could be easily detected using a
spelling checker). Sometimes acronyms are used without being defined.
In a couple of papers there are pending references. The list of
bibliographical references presents some mistakes: for example, Roche
& Schabes 1997 is wrongly cited as "Finite-State Devices for Natural
Language Processing" (the correct title is "Finite-State Language
Processing"), there is one duplicated entry (Tapanainen 1995), there
is no coherence in the style of bibliographical entries. This is
obviously due to the fact that the bibliographical references are the
sum of the references of the single contributions; perhaps it would
have been better if every contribution had listed its references
separately.

The links mentioned on page 2 for HTK (Hidden Markov Model Toolkit)
and XFST (Xerox Finite State Technology) are no longer valid, probably
because of some reorganization undergone by the respective web sites.
The correct locations should be:

 - http://www.entropic.com/support/FAQ/htk/index.html for HTK 
 - http://www.rxrc.xerox.com/research/mltt/fst/home.html for XFST
 

In conclusion, the book is a useful reading for people interested in
the use of finite state techniques in NLP and provides an interesting
perspective of the current status of the area. As said above, given
the wide range of NLP areas covered by the contributions, many
prospect readers will probably be interested only in part of the
papers in the book.


BIBLIOGRAPHY

Steven Abney 1996. Partial parsing via finite-state cascades. Natural
Language Engineering, 2(4): 337-344. (appeared also in the Proceedings
of the ESSLLI '96 Robust Parsing Workshop; also available at the
location http://www.sfs.nphil.uni-tuebingen.de/~abney/Papers.html)

Eva Ejerhed, Frederic Jelinek, Lauri Karttunen, Andras Kornai (eds.)
1996. Proceedings of the ECAI'96 workshop on Extended Finite State
Models of Language, Budapest, Hungary (also available at the
location http://www.cs.rice.edu/~andras/ecai.html).

Andras Kornai (ed.) 1996. Special issue on Extended Finite State
Models of Language. Natural Language Engineering, 2(4).

Emmanuel Roche and Yves Schabes (eds.) 1997. Finite-State Language
Processing. MIT Press, Cambridge, MA.

XTAG Group 1995. A Lexicalized Tree Adjoining Grammar for English. 
Technical Report IRCS 95-03, University of Pennsylvania.


ABOUT THE REVIEWER

Alberto Lavelli is a researcher at ITC-IRST in Trento (Italy). His
interests are related to chart parsing of natural languages,
computational environments for grammar development and finite-state
parsing for Information Extraction. He is currently acting as Local
Arrangements Chair of IWPT 2000 (Sixth International Workshop on
Parsing Technologies) which will be held in Trento from 23 to 25
February 2000.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue