Date: Wed, 27 Apr 2005 11:11:38 +0200 From: Svetoslav Marinov <svetoslav.marinov@his.se> Subject: Memory-based Parsing
AUTHOR: Kübler, Sandra TITLE: Memory-Based Parsing SERIES: Natural Language Processing 7 YEAR: 2004 PUBLISHER: John Benjamins
Svetoslav Marinov, Gothenburg University/University College Skoevde, Sweden
SYNOPSIS
The book consists of 8 chapters plus 2 appendices.
The first chapter (Introduction) sets up the stage for the two central topics in this monograph -- parsing and machine-learning. The notion of syntactic parsing is explained and the basic distinction between data-driven and discrete approaches to parsing is given. Relying on the former to create a broad coverage parser for German, the author has chosen machine learning (ML) techniques and specifically the paradigm of memory-based learning (MBL), which is an example of supervised methods within ML. Brief introduction to MBL is given and the place of ML in Natural Language Processing (NLP) is also mentioned.
Chapter 2 gives the reader a more extensive introduction to MBL. Section 2.1 covers the basic approach and the crucial idea of lazy learning - i.e. "storing the presented training instances without abstraction" (p. 11), contrary to eager learning, as well as the similarity metrics used to calculate the distance of a new instance to be classified to the already stored ones. The algorithms extending the basic model in order to achieve faster retrieval, reduction of storage requirements and tolerance to noise are given in section 2.2 together with results of their performance. Feature selection is important in any k-nearest neighbour approach within ML. It is therefore necessary to carefully select the relevant ones and have proper metrics for calculating their weights. This is clearly explained in section 2.3, where the author discusses two of the most widely used - mutual information and class projection, relying mainly, but not only, on the work of (Wettschereck et al, 1997).
The next chapter deals with the application of MBL to syntactic parsing. It is a careful review of the existing albeit not a very extensive literature. The difficulty in utilising MBL for parsing stems from the fact that "[n]atural language parsing [...] is not a task that can easily be defined [as a classification problem]" (p. 34). While the majority of approaches described in this part split the task into sub-tasks in order to better mold parsing into the classification paradigm, Sandra Kübler's method is different, as it will become clear later on in the book. Chapter 3, therefore, describes how different authors solve the partial tasks or adopt the idea of MBL by cascaded classifiers. Section 3.1 summarises two approaches to noun-phrase (NP) chunking. Next, shallow parsing and grammatical relation finding are discussed. Memory-based sequence learning, an approach in which the order of the words is very important, is presented with its results for NP chunking and subject-verb, and verb-object relation assignment. Certain emphasis is also given to the work of Sabine Buchholz (e.g. (Buchholz, 2002)), especially the optimal results and parameters involved in discovering relations between verb chunks and other chunks in a sentence. Last, in section 3.3, an approach to full parsing using MBL is reviewed.
Data-oriented parsing (DOP) is the topic of chapter 4. While not an embodiment of the k-nearest neighbour model, DOP shares many similarities with MBL, which Kübler summarises on p. 58. The basic DOP model, i.e. DOP1, for phrase structure trees is presented in section 4.1. The author then goes on to describe the two important stages in DOP - the actual parsing of new input and the disambiguation procedures. The former being a generation of a parsed forest, while the latter is the finding of the most probable derivation (an NP-hard problem, unlike in standard probabilistic context-free parsing). Two evaluations of DOP1 are discussed in section 4.2 - one on the Wall Street Journal section of the Penn Treebank (Marcus et al., 1993) and the other on the Air Travel Information System corpus. In the next section the non-probabilistic DOP is mentioned and the comparison to its stochastic counterpart is shown. Certain enhancements of DOP to tackle unknown words (the so called DOP3 model) are presented in section 4.4. Further, the possibility to render DOP to PCFG in order to achieve faster search for the most probable derivation and alleviate the exponential growth of grammar, as well as a memory-based approach to DOP are cited in the remaining two sections.
Chapter 5 is central to the book since it presents the memory-based parser TueSBL and the treebank it is trained on TueBa-D/S. TueBa- D/S is a corpus of transliterated spontaneous dialogues developed in the context of VERBMOBIL (speech-to-speech machine translation project). Therefore it covers the domain of business appointments, travel scheduling and hotel reservations and contains approximately 67000 trees. In quite some detail the annotation scheme is presented with appropriate examples and a comparison to the Penn Treebank is given in section 5.1. In the following section Kübler gives reasons for not employing the cascaded approach to parsing in the MBL paradigm. Complete trees are thus viewed as classification classes, a decision reminiscent to DOP. To quickly batter away doubts on the 'trees as classes'-concept, the hybrid parsing architecture of TueSBL is presented. Its indispensable preprocessing module CASS, Abney's (1991) chunk parser and its adoption to German, and the present task, are described in section 5.3. The learning component in the MBL paradigm requires two things in order to succeed - lots of data and carefully selected relevant features used for the classification task. Since syntactically annotated corpora is after all of modest size, the same goes for TueBa-D/S, TueSBL is given access to all possible information in the treebank, yet not simultaneously but "ordered according to their reliability" (p. 160). These design decisions, the organisation of the instance base, as well as the search in it are clearly explained in section 5.5. A sample parse and the weighting scheme are then presented. The latter is also an example of non-standard approach. The chapter ends with an explanation of the backing-off strategy of TueSBL, which is triggered in those case where no tree structures were discovered for input chunks or no sentences were matched to the input sentence.
Chapter 6 is devoted to the evaluation of TueSBL. The standard PARSEVAL metrics of bracketed/labelled precision and recall are described and TueSBL's performance in these terms is given. Since TueBa-D/S contains functional labels (e.g. HD for head, described in Chapter 5), "functional"-labelled precision and recall are also calculated. Additionally, certain modules of the parser are evaluated separately and the results are given in section 6.4. Analysis of TueSBL's errors is then presented. Since it is debatable whether PARCEVAL's measures offer an accurate picture of a parser's output, a dependency-based evaluation is proposed (cf. (Lin, 1995)). Section 6.6 also discussed the conversion of TueBa-D/S into dependency- based representation. Section 6.7 not only gives the results of the dependency-based evaluation, but it also offers a comparison of the two metrics.
Bringing us back to the wider concept of MBL, Chapter 7 compares Kübler's parser to other approaches already mentions in Chapters 2- 4. TueSBL resembles DOP, since both methods "rely on trees or tree fragments that go beyond the local information present in rules." [p. 252]. Yet memory requirements and levels of generalisation are quite different in the two approaches. TueSBL strives to return a complete parse while other MBL methods based on cascaded classifiers focus on chunk and grammatical relation finding (section 7.2). Section 7.3 gives a comparison to a conceptually different full memory-based parser.
Chapter 8 concludes the book and brings up the fact that the presented approach "is especially tailored towards processing spontaneous speech." [p. 262]. Appendix A is the Stuttgart-Tuebingen Tagset. Appendix B gives the syntactic categories and functional labels of TueBa-D/S.
CRITICAL EVALUATION
The book by Sandra Kübler is an important contribution to the area of syntactic parsing in several respects. First, this is the monograph's main point - a memory-based robust parser for German spontaneous speech. A data-driven approach to NLP in its incarnation as an MBL is used for the design of a parser (TueSBL) whose architecture deserves to be looked at by anyone interested in parsing spoken input or using analogy-based methods in Computational Linguistics, or parsing German.
TueSBL is designed to exploit a relatively small treebank to its maximum when trained on it. Complete trees are stored in the memory and also represent the classification classes. This is something that resembles DOP very much yet one do not get the weak points of pure DOP - large storage space and intractability of the parsing model. TueSBL is also different than the standard MBL approaches. It does not use cascades of MBL classifiers and does not only return grammatical relations between chunks but delivers a complete parse. On the other hand it is not a purely k-nearest neighbour approach. As such certain modifications are necessary. The notion of classification classes becomes a very relaxed and loose idea. These have internal structure and allow for modifications, such as omission of words and chunks from a tree (see pages 127-132). This particular decision is very well motivated in the book and also stems from the design and organisation of the treebank (TueBa-D/S). With 3 levels of annotation - phrasal, topological and functional (section 5.2), compared to the flat structure of Penn Treebank; with the particularities of spoken input - omissions, false starts, corrections, etc; with the free word-order of German (while most MBL approaches use English and the Penn Treebank), TueBa-D/S would offer an impossible task to the majority of standard MBL methods using cascaded classifier. The weighting metrics are also non-standard and this hinges again on the design decisions of TueSBL.
Another strong point of the monograph is that the work described in it is clearly placed in the context of other memory-based approaches to parsing. Chapters 3 and 4 give in enough detail to what previous authors have done in this field. Chapter 2 is also essential to better understand TueBSL. The reader is given enough detail to understand the essence of memory-based learning, yet not too much to be distracted from the central topic - TueSBL. Recently, another way to exploit the benefits of MBL for syntactic parsing has been presented, namely the work of (Nivre and Sholz, 2004) and (Nivre et al., 2004). There, TiMBL (Tilburg Memory-Based Learner, (Daelemans et al., 2003)) is used to predict the next step of a deterministic dependency based parser. (Nivre et al., 2004) present results for Swedish and in (Nivre and Scholz, 2004) the results for English are given.
Finally, the question remains how scalable Kübler's approach is. While the author claims that TueSBL can be trained on any treebank that conforms to the TueBa-D/S format, it is well known that treebanks are time- and resource-consuming. It would have been interesting to know whether TueSBL could be trained on other treebanks (not strictly following the TueBa-D/S format) or even on non-phrase structure treebanks, i.e. dependency-based representations.
REFERENCES
Abney, Steven. 1991. Parsing By Chunks. In: Robert Berwick, Steven Abney and Carol Tenny (eds.), Principle-Based Parsing. Kluwer Academic Publishers, Dordrecht. 1991.
Buchholz, Sabine. 2002. Memory-Based Grammatical Relation Finding. PhD Thesis. University of Tilburg, The Netherlands.
Daelemans, Walter, Jakub Zavrel, Ko van der Sloot and Antal van den Bosch. 2003. 'TiMBL: Tilburg Memory Based Learner, version 5.0, Reference Guide.' ILK Research Group Technical Report Series no. 03-10, 56 pages.
Lin, Dekang. 1995. A Dependency-based Method for Evaluating Broad-Coverage Parsers. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, IJCAI-95. Montreal, Canada.
Marcus, Mitchell, Beatrice Santorini and Mary Ann Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19:2.313-330
Nivre, Joakim, Johan Hall and Jens Nilsson. 2004. Memory-Based Dependency Parsing. In Ng, H. T. and Riloff, E. (eds.) Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL). May 6-7, 2004. Boston, Massachusetts. pp. 49-56.
Nivre, Joakim and Mario Scholz. 2004. Deterministic Dependency Parsing of English Text. In Proceedings of COLING 2004, Geneva, Switzerland, August 23-27, 2004.
Wettschereck, Dietrich, David W. Aha and Takao Mohri. 1997. A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms. Artificial Intelligence Review 11(1- 5): 273-314.
|