Editor for this issue: Andrew Carnie <carnie
linguistlist.org>
Andras Kornai (ed.), 1999, Extended Finite State Models of Language, Cambridge University Press, pages 278+xii (plus a CD-ROM). Reviewed by Alberto Lavelli, ITC-IRST, Trento (Italy) I'll do a frequent use of the following acronyms: - ECAI: European Conference on Artificial Inteligence - NLE: the journal Natural Language Engineering - NLP: Natural Language Processing - POS: part of speech SYNOPSIS The book appears in the ACL Studies in Natural Language Processing series and originates from a workshop held in Budapest in 1996 during ECAI'96. In a special issue of the journal Natural Language Engineering (vol. 2 n. 4, December 1996) a set of articles partially overlapping with those present in this book appeared (some of the NLE papers are only abstracts of 2 or 3 pages). The electronic versions of the papers presented at the ECAI'96 workshop are included in the CD-ROM accompanying the book. Now I briefly describe the papers contained in the book. The papers that appeared also in the ECAI workshop proceedings are marked with "ECAI" near the name of the authors. Note that sometimes the ECAI'96 versions are considerably shorter than those in the book. 1. Extended finite state models of language by Andras Kornai This is a general introduction with a brief presentation of the papers contained in the book and of the contents of the CD-ROM 2. A parser from antiquity: an early application of finite state transducers to natural language parsing by Aravind K. Joshi and Philip Hopely (ECAI) This paper describes a parser based on a cascade of finite state transducers developed at the University of Pennsylvania in 1958. The parser is remarkably modern when compared to some of the recent work on finite state transducers. A faithful reconstruction of the parser is available on the CD-ROM. 3. Comments on Joshi and Hopely by Lauri Karttunen (ECAI) It presents some brief remarks by Karttunen who underlines the modernity of the parser described in the previous chapter by Joshi and Hopely. 4. Implementing and using finite automata toolkits by Bruce W. Watson (ECAI) It describes a toolkit (FIRE Lite) developed by the author while at the Eindhoven University of Technology and now freely available for non-commercial use. The toolkit is available in the accompanying CD-ROM and also on the Web at www.RibbitSoft.com (note however that at the beginning of January 2000 I have repeatedly and unsuccessfully tried to connect to www.RibbitSoft.com). RibbitSoft distributes also a commercial version of the toolkit (FIRE Engine II). Both the commercial and the non-commercial toolkit implement algorithms for building automata from regular expressions and for minimizing deterministic finite automata. 5. Finite state morphology and formal verification by Manuel Vilares Ferro, Jorge Grana Gil and Pilar Alvarino Alvarino It presents the use of verification methods to ease maintenance during the development of resources for morphological analyzers. Examples and experiments on Spanish are presented. 6. The Japanese lexical transducer based on stem-suffix style forms by Masakazu Tateno, Hiroshi Masuichi and Hiroshi Umemoto (ECAI) It describes a method for building a lexical transducer for Japanese with stems and suffixes stored separately in different lexicons; an extra level of automata relates canonical citation forms and stem-suffix style forms. 7. Acquiring rules for reducing morphological ambiguity from POS tagged corpus in Korean by Jae-Hoon Kim and Byung-Gyu Jang It presents a method for reducing morphological ambiguities when performing morphological analysis of Korean texts. The method automatically infers rules (called subsumption conditions) from a POS tagged corpus. Experiments are presented on the effectiveness of the method. 8. Finite state based reductionist parsing for French by Jean-Pierre Chanod and Pasi Tapanainen (ECAI; but see below the description of the paper) The paper describes a parser based on finite state methods. The system includes nondeterministic tokenization, lexical analysis, multiword recognition, shallow syntactic analysis. Examples of treatment of French linguistic phenomena and an evaluation of the parser effectiveness during the analysis of technical manuals are presented. This paper is a considerably extended version of the ECAI workshop paper by the same authors. 9. Light parsing as finite state filtering by Gregory Grefenstette (ECAI) The paper presents an approach to parsing useful in case of applications that need to extract relevant information without necessarily performing a full parse of the text. The approach is based on the use of finite state markers and filters. An evaluation of the parser effectiveness in analyzing a large corpus is presented. 10. Vectorized finite state automata by Andras Kornai (ECAI) It presents a technique of finite state parsing based on vectorization and describes the application of such technique to the problem of extracting relational information from texts. A system based on such approach, NewsMonitor, is available in the accompanying CD-ROM. 11. Finite state transducers: parsing free and frozen sentences by Emmanuel Roche (ECAI) In NLP finite state models are usually considered a lesser evil with respect to more powerful techniques. The author instead claims that they are quite suitable for representing accurately complex linguistic phenomena. This claim is supported by examples of finite state analysis of linguistic phenomena (i.e., support verbs and frozen expressions). 12. Text and speech translation by means of subsequential transducers by Juan Miguel Vilar, Victor Manuel Jimenez, Juan Carlos Amengual, Antonio Castellanos, David Llorens and Enrique Vidal (ECAI; but see below the description of the paper) The authors propose a technique that increases the performance of the learning algorithm of Subsequential Transducers from training examples; moreover, the use of error-correcting parsing to increase the robustness of the model is explored. Experiments on both text and speech translation from Spanish to English are described. This paper is a considerably extended version of the ECAI workshop paper by J.M. Vilar, E. Vidal & J.C. Amengual. 13. Finite state segmentation of discourse into clauses by Eva Ejerhed (ECAI) The paper presents first of all the analysis of the correlation between different acoustically and perceptually derived information and clause boundaries in spoken utterances. Then it proposes an algorithm for segmenting Swedish texts into clauses and evaluates its performance, comparing the results on automatically and manually tagged texts. 14. Between finite state and Prolog: constraint-based automata for efficient recognition of phrases by Klaus U. Schulz and Tomek Mikolajewski The paper describes "constraint-based automata", that incorporate features from finite state techniques and constraint programming. Preliminary empirical evaluation of the performance of the proposed approach against that of constraint logic programming implementations is presented. 15. Explanation-based learning and finite state transducers: applications to parsing lexicalized tree adjoining grammars by Srinivas Banglore (ECAI) The paper describes the application of explanation-based learning (EBL) techniques to parsing Lexicalized Tree Adjoining Grammars. Starting from a hand-crafted wide-coverage English grammar (XTAG Group 1995), EBL techniques based on finite state transducers are applied to customize the grammar to a specific domain. A simplified parser, called stapler, is also described; the stapler is used in conjunction with the results of the application of EBL techniques. Experimental results of such approach are presented, comparing the performance with respect to the original system in terms of recall, number of parses and processing time. 16. Use of weighted finite state transducers in part of speech tagging by Evelyne Tzoukermann and Dragomir R. Radev The paper presents the application of weighted finite state transducers to POS tagging. The approach uses a combination of linguistic and statistical techniques for POS disambiguation. Experimental results for French POS tagging are presented. 17. Colonies: a multi-agent approach to language generation by Erzsebet Csuhaj-Varju (ECAI) The paper presents "colonies", multi-agent symbol systems whose behavior is jointly determined by the combination of very simple grammars. 18. An innovative finite state concept for recognition and parsing of context-free languages by Mark-Jan Nederhof and Eberhard Bertsch (ECAI) The paper shows that all the languages which are in the regular closure of the class of the deterministic (context free) languages can be recognized in linear time. The result is interesting as this closure contains many inherently ambiguous languages. 19. Hidden Markov models with finite state supervision by Eric Sven Ristad The paper presents a supervised training approach to Hidden Markov Models (HMMs). The author claims that, unlike popular ad hoc techniques, the proposed approach is completely general, need not make any simplifying assumptions about independence , and can take better advantage of the information contained in the training corpus. In the accompanying CD-ROM there are 6 subdirectories with the following contents: - ECAI: the original papers of the ECAI workshop (also available on the Web at the location: http://www.cs.rice.edu/~andras/ecai.html) - Kanungo: a simple implementation of Hidden Markov Models realized by Tapas Kanungo of the University of Maryland - Kim: the morphological analyzer described in chapter 7 - Kornai: the NewsMonitor system, described in chapter 10 - Uniparse: the source code of the parser described in chapter 2 by Joshi and Hopely - Watson: FIRE Lite, the toolkit developed by Bruce W. Watson and described in chapter 4 The ECAI papers not present in the book are listed below: - Language modeling for speech recognition by Frederick Jelinek - Regular expressions for finite-state syntactic description by Lauri Karttunen - Finite-state morphology and information retrieval by Kimmo Koskenniemi - Weighted automata in text and speech processing by Mehryar Mohri, Fernando Pereira and Michael Riley - Finite-state methods, binding, and anaphora by Richard Oehrle - Efficient finite-state approximation of context free grammars by Catherine Rood - Multilingual finite-state noun phrase extraction by Anne Schiller - Finite automata for processing word order by Wojciech Skut - Multilingual text analysis for text-to-speech synthesis by Richard Sproat CRITICAL EVALUATION The book contains papers that cover the application of finite state techniques to a wide range of NLP areas (morphological analysis, POS tagging, clause boundary detection, syntactic analysis, etc.). This fact makes it difficult for a single person to have the necessary expertise to thoroughly evaluate all the contributions (and, at the same time, prospective readers will be probably interested only on a subset of the papers, depending on their areas of interest). Obviously also this review is partly influenced by the reviewer's limited knowledge in some areas of NLP (particularly statistical techniques). The book contains some very interesting papers from both a theoretical and an applicative perspective. However, it suffers from a defect that is often present in books originating from workshops, i.e. the fact that contributions are uneven in both quality (i.e., clarity of the presentation, systematic coverage of all the main areas in the field) and quantity (length and thoroughness of papers). This makes also difficult to have an overall view of the different areas covered by the various contributions. Going to the analysis of some of the papers, I found the contribution by Joshi & Hopely (chapter 2) particularly interesting as it provides a useful historical perspective on the work in the field. Too often we tend to concentrate only on the most recent contributions running the risk to reinvent the wheel and this paper reminds us not to neglect past experiences. The papers by Chanod & Tapanainen and Grefenstette (chapters 8 and 9) provide a useful indication of the current advances in the application of finite-state techniques at Xerox Research Centre Europe in Grenoble, one of the leading centers in this area. The paper by Roche (chapter 11) is in my opinion not as convincing as others by the same author, for instance that contained in (Roche and Schabes 1997). The papers by Tateno, Masuichi & Umemoto and Kim & Jang (chapters 6 and 7) fail to clearly explain the background and the issues specific respectively to Japanese and Korean, needed to fully appreciate the techniques proposed in the papers. The paper by Srinivas (chapter 15) is a long and complex but well-written contribution that proposes an approach that combines manually developed generic grammars with domain-specific constraints extracted from a corpus. The interesting experimental results seem however due to to the particular formalism adopted (i.e., Lexicalized Tree Adjoining Grammars) because they crucially employ some of its specific characteristics. In the paper by Kornai (chapter 10) it would have been interesting to provide more details about the application of vectorized finite state automata, i.e. the NewsMonitor system (also present in the accompanying CD-ROM). In some of the papers there is only a generic reference to the usage in NLP of the techniques and tools described. For example, the paper by Watson (chapter 4) mentions an interest in using the toolkit by computational linguists. This same generic claim was present in the original paper at the ECAI workshop. Provided that the ECAI workshop took place in 1996, it would have been interesting that the author made some explicit mentions of NLP areas where such uses have been pursued in the meanwhile. The paper by Csuhaj-Varju (chapter 17) presents some results from the field of formal languages. As far as I understand, the only link with NLP is that some languages that can be described using such results (e.g., the languages of multiple agreement, crossed agreements and replication) would present structures that are present in natural languages; no further evidence for this claim is produced. Sometimes the link between formal results and NLP is more explicit: the theoretical paper by Nederhof & Bertsch (chapter 18) provides some more direct hints at the usefulness of the results proposed for NLP. Given my limited knowledge of statistical techniques, I cannot thoroughly evaluate the paper by Ristad (chapter 19). I would only underline that some experiments would probably be needed in order to empirically verify the claims about the advantages of the proposed method with respect to the standard ones. Most contributions provide some kind of experimental evaluation of the proposed techniques. However, it is not always clear if such experimental results allow a real comparison with other techniques. Among the contributions included in the NLE special issue and not present in the book for various reasons, I have found particularly interesting the paper "Partial parsing via finite-state cascades" by Steven Abney. Other NLE papers not included in the book are: "Regular expressions for language engineering" by Lauri Karttunen, Jean-Pierre Chanod, Gregory Grefenstette and Anne Schiller, "Finite state morphology and information retrieval" by Kimmo Koskeniemmi, and "Multilingual text analysis for text-to-speech synthesis" by Richard Sproat (the last two papers were present at the ECAI workshop in a slightly different version). The editorial care of the book is not completely satisfactory. There are a few typos (not so many but they could be easily detected using a spelling checker). Sometimes acronyms are used without being defined. In a couple of papers there are pending references. The list of bibliographical references presents some mistakes: for example, Roche & Schabes 1997 is wrongly cited as "Finite-State Devices for Natural Language Processing" (the correct title is "Finite-State Language Processing"), there is one duplicated entry (Tapanainen 1995), there is no coherence in the style of bibliographical entries. This is obviously due to the fact that the bibliographical references are the sum of the references of the single contributions; perhaps it would have been better if every contribution had listed its references separately. The links mentioned on page 2 for HTK (Hidden Markov Model Toolkit) and XFST (Xerox Finite State Technology) are no longer valid, probably because of some reorganization undergone by the respective web sites. The correct locations should be: - http://www.entropic.com/support/FAQ/htk/index.html for HTK - http://www.rxrc.xerox.com/research/mltt/fst/home.html for XFST In conclusion, the book is a useful reading for people interested in the use of finite state techniques in NLP and provides an interesting perspective of the current status of the area. As said above, given the wide range of NLP areas covered by the contributions, many prospect readers will probably be interested only in part of the papers in the book. BIBLIOGRAPHY Steven Abney 1996. Partial parsing via finite-state cascades. Natural Language Engineering, 2(4): 337-344. (appeared also in the Proceedings of the ESSLLI '96 Robust Parsing Workshop; also available at the location http://www.sfs.nphil.uni-tuebingen.de/~abney/Papers.html) Eva Ejerhed, Frederic Jelinek, Lauri Karttunen, Andras Kornai (eds.) 1996. Proceedings of the ECAI'96 workshop on Extended Finite State Models of Language, Budapest, Hungary (also available at the location http://www.cs.rice.edu/~andras/ecai.html). Andras Kornai (ed.) 1996. Special issue on Extended Finite State Models of Language. Natural Language Engineering, 2(4). Emmanuel Roche and Yves Schabes (eds.) 1997. Finite-State Language Processing. MIT Press, Cambridge, MA. XTAG Group 1995. A Lexicalized Tree Adjoining Grammar for English. Technical Report IRCS 95-03, University of Pennsylvania. ABOUT THE REVIEWER Alberto Lavelli is a researcher at ITC-IRST in Trento (Italy). His interests are related to chart parsing of natural languages, computational environments for grammar development and finite-state parsing for Information Extraction. He is currently acting as Local Arrangements Chair of IWPT 2000 (Sixth International Workshop on Parsing Technologies) which will be held in Trento from 23 to 25 February 2000.Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue