Christopher D. Manning & Hinrich Schuetze (1999) Foundations of Statistical Natural Language Processing, MIT Press, Massachusetts, US, Pp. 680 Hard $60
Reviewed by Richard Evans, Research Assistant, Computational Linguistics Research Group, University of Wolverhampton, UK
SYNOPSIS
The book provides an introduction to the field of Statistical Natural Language Processing. Aimed at graduate students and researchers, it should also be seen as a valuable teaching aid for courses in computational linguistics. Deriving mathematical formulae from basic principles with reference to specific language processing tasks prevents the descriptions from becoming too dry. At all points the material is thoroughly reinforced with the relevant linguistic examples. The authors succeed in ensuring that the material is relevant and interesting, one of the most important yet difficult criteria to meet when teaching statistics. The book has been written in LaTeX and has the format commonly associated with such documents. One useful stylistic feature is that as important terms are introduced to the text, they are printed in the margin, which makes it easy to scan the text for topics of interest. These terms are also listed in the index. Each chapter is concluded by a fairly thorough 'Further Reading' section and a set of exercises with tasks of varying difficulty. Several of the chapters are also broken up by small sets of exercises. The book concludes with a 44 page bibliography and 23 page index.
Part I Preliminaries
Chapter 1 sets out the empiricist standpoint adopted throughout the volume, providing a critique of rationalist views on linguistics. The points concerning weaknesses with the 'categorical judgment' approach to linguistics exemplified by the work of Chomsky are convincingly made with illustrative examples (section 1.2.1).
Chapter 2 provides a self-contained introduction to the mathematical foundations of the ensuing material. It is divided into two parts, one on probability theory and the other on information theory. The section on probability theory includes the notions of conditional probability, Bayes theorem, random variables (functions that map all the possible outcomes of an event to a probability score), joint and conditional distributions, Bayesian updating (where our statistical expectation estimates are influenced by our prior beliefs about what those expectations should be) and Bayesian decision theory. The section on Information Theory covers Entropy (defined both as the amount of information contained in a random variable and also as a measure of the average amount of information required to describe an outcome of that variable), Mutual Information (the amount of information that one random variable contains about another) and Noisy Channel models (in which the output from a communication channel has a probability of differing from the input to that channel), among others.
Chapter 3 provides a self-contained introduction to the linguistic (largely syntactic) theory that will be used in subsequent chapters. Here we have an introduction to parts of speech, phrase structure and brief descriptions of morphology, semantics and pragmatics. The introduction is quite detailed and the authors have not been afraid to present some quite difficult examples (complex NPs and the like). Having said this, there is not much coverage of analyses above the level of the sentence but this is reflective of the field itself. In combination with chapter 2, a basic statistical and linguistic toolkit has been formed upon which the ensuing approaches will depend. Later chapters do introduce further statistical methods, but it is to chapters 2 and 3 that the reader will return for the fundamentals. Chapter 4 introduces the notion of corpus-based work and provides an overview of the low level formatting issues that must be addressed when using documents as an information source for further processing (section 4.2). This chapter usefully provides details about organisations that can be contacted in order to obtain these crucial resources (table 4.1). There is also discussion of the SGML encoding that is important for much current work (section 4.3).
Part II Words
Chapter 5 examines collocations and simple term extraction using Mutual Information (2.2.3) methods. There is some brief discussion of proper name recognition (sections 5.5 and 5.6), but a failure to highlight the particular problems associated with that subject. For instance the Named Entity Recognition task that has challenged participants in the MUC conferences is not mentioned, nor any of the approaches taken to address the problem (Mikheev, Grover & Moens 1998). This chapter also covers the notions of hypothesis testing and significance (section 5.3).
Chapter 6 concerns statistical inference and the application of probabilistic approaches to language modeling. This is a stochastic method where our expectation of seeing some word or category in a text is based only on the information we have about the preceding n words (section 6.1). The chapter also covers a variety of statistical estimation methods over those models (section 6.2) and the process of smoothing (sections 6.2 and 6.3) which lets us apply statistical methods in the face of sparse data.
Chapter 7 applies the prior methods to word sense disambiguation. Several different algorithms are presented and reviewed. The authors set out supervised and unsupervised methods for disambiguation, the supervised ones being based on Bayes decision rule and Mutual Information techniques. On supervised learning methods that require manually annotated corpora, the authors note (p.232) "the production of labeled training data is expensive". However they do not mention any of the software tools that have been produced that make the annotation task less time consuming and therefore less expensive (such as MITRE's Alembic Workbench).
Chapter 8 presents methods for Lexical Acquisition. Here, the goal is to classify lexical items on the basis of verb subcategoristion, selectional restrictions, attachment ambiguity and semantic similarity. Co-occurrence statistics and vector similarity methods are used to obtain classes of semantically similar words. The chapter also gives good coverage of evaluation measures (precision, recall and f-measure). Part III Grammar
Chapter 9 presents Markov models, a variation of the language models presented in section 6.1. Familiarity with them is widely presumed in current work and it is useful to have them derived here from scratch for the benefit of the uninitiated reader. The Viterbi algorithm is presented as a means of finding the best probability traversal of Markov models.
Chapter 10 Part of Speech Tagging sets out 4 different strategies and concludes with a discussion of performance and applications. The algorithms include methods based on the Markov model techniques introduced in chapter 9 and Brill's transformation based learning method. There is some coverage of issues like base NP chunking, but discussion of complex NP extraction (section 10.6.2) is omitted.
Chapter 11 Probabilistic Context Free Grammars describe an application of Hidden Markov Models to determine the probabilities of strings of words in a language. The authors present the Inside- Outside algorithm as a method for finding the most likely analysis for a sentence. There do appear to be a number of typographical problems with this chapter. Space prevents me from making them explicit but examination of pages 384, 385 and 391 should reveal them to the interested reader.
Chapter 12 Probabilistic Parsing shows how annotated corpora (treebanks) can be used as the basis for finding a syntactic analysis for new sentences. The distinction between phrase-structure and dependency grammars is presented (section 12.1.7) and various statistical methods and search techniques are put forward. The authors present a sample of the Penn treebank. Here we note that the analysis consists of many 'flat' infrequent trees that do not contain X-Bar nodes, only X and XP ones. Many current systems are based on this treebank and the astute reader will be somewhat concerned about the quality of the analyses returned by such systems. There is a good, thorough description of evaluation difficulties with respect to parsing in 12.1.8. An assumption made in this chapter is that a parser should first try out the analysis of a word string that is most commonly observed in a treebank. However some best-first techniques based on human reading-time experiments suggest that this is not always the best approach (Crocker and Pickering 1996 unpublished work). Part IV Applications and Techniques
Chapter 13 Statistical Alignment and Machine Translation presents the idea of aligning sentences and paragraphs between documents of different languages and using this information as the basis for automatic translation (section 13.1). The method is based on the noisy channel model (chapter 2). When reviewing the problems with machine translation techniques, the authors write, "on the surface these are problems of the model, but they are all related to the lack of linguistic knowledge in the model." They then give examples (p.489-492) which demonstrate a range of linguistic information that is not exploited by the statistical models and could serve as the basis for future work.
Chapter 14 Clustering presents a number of methods and algorithms that classify items on the basis of some measure of similarity. Hierarchical (section 14.1) and non-hierarchical (section 14.2) approaches are covered. By using these techniques, words can be classified automatically into categories that reflect something-like semantic similarity. Some promising results are shown in table 14.5.
Chapter 15 Topics in Information Retrieval covers automatic term extraction from documents. One of the approaches uses a vector space model, following from material in chapter 8, and the measures of Term Frequency and Term Frequency Inverse Document Frequency which are derived here (section 15.2.2). The other method for term identification is based on a term distribution model. The review is followed by a method for discourse segmentation 'TextTiling' that is based on information about the distribution of terms in a document (section 15.5).
Chapter 16 Text Categorisation introduces a number of statistical classification methods. The goal here is to automatically identify the topics or themes of documents. Several methods are used. With Decision Trees (section 16.1) a given document is described in terms of feature-value trees where possible values are labeled with probability scores. The combination of a document's value scores gives the likelihood that it belongs to a given class. Maximum entropy models (16.2) are described in which a number of pre- classified documents are defined by means of constraint features. The classification with the highest entropy score is defined as the maximum entropy model. New documents are then classified according to their similarity with this model. With the perceptron learning method term vectors (chapter 15) and iteratively induced weights are used to classify documents. The K-nearest neighbour (16.4) classification method is also described. Here, documents are classified according to their similarity to positively classified documents.
Although much of the material here is also covered by (Charniak 1996) and (Krenn and Samuelsson 1997) and less so by (Allen 1995), Manning and Schetze's work provides wider, more detailed coverage. Strangely, none of the works discusses the application of corpus- based optimisation techniques such as genetic algorithms (Mitchell 1997) to natural language processing.
I recommend this book both as an exemplary teaching aid and a rigorous introduction to statistical NLP. It is to be commended for its readability and the coherent presentation of a notoriously difficult subject. This reviewer did note some flaws, but they represent very minor points in the context of a 680 page book. Part of the beauty of this work is that it can stand-alone without the reader having to refer to anything else in order to understand or clarify parts of it. All the crucial information is here, presented from first principles. It is a very good reference book for anyone working in the field of NLP.
Bibliography Allen, J (1995) Natural Language Understanding, Benjamin / Cummins Charniak, E. (1996) Statistical Language Learning, MIT Press Crocker, M. & Pickering, M. (1996) A Rational Analysis of Parsing and Interpretation, Unpublished Day, D. et al. (1997) Mixed Initiative Development of Language Processing Systems, The Mitre Corporation Krenn, B. & Samuelsson, C. (1997) The Linguist's Guide to Statistics, at http://coli.uni-sb.de/{~Krenn,~christer} Mikheev, A., Grover, C. & Moens, M. (1998) Description of the LTG System Used for MUC-7, Language Technology Group, http://www.ltg.ed.ac.uk/papers/muc.ps Mitchell, T. (1997) Machine Learning, McGraw Hill Richard Evans is a research assistant with the Computational Linguistics Research Group at the University of Wolverhampton in the UK. His current research interest is anaphor resolution and the application of corpus-based machine learning and optimisation methods to that task.
|