LINGUIST List 10.1349

Mon Sep 13 1999

Review: Manning & Schuetze: Statistical NLP

Editor for this issue: Andrew Carnie <>

What follows is another discussion note contributed to our Book Discussion Forum. We expect these discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for discussion." (This means that the publisher has sent us a review copy.) Then contact Andrew Carnie at


  1. Richard Evans, Book review (Manning and Schuetze)

Message 1: Book review (Manning and Schuetze)

Date: Tue, 07 Sep 1999 21:03:05 +0100
From: Richard Evans <>
Subject: Book review (Manning and Schuetze)

Christopher D. Manning & Hinrich Schuetze (1999) Foundations of 
Statistical Natural Language Processing, MIT Press, Massachusetts, 
US, Pp. 680 Hard $60

Reviewed by Richard Evans, Research Assistant, Computational 
Linguistics Research Group, University of Wolverhampton, UK


The book provides an introduction to the field of Statistical 
Natural Language Processing. Aimed at graduate students and 
researchers, it should also be seen as a valuable teaching aid for 
courses in computational linguistics. Deriving mathematical formulae 
from basic principles with reference to specific language processing 
tasks prevents the descriptions from becoming too dry. At all points 
the material is thoroughly reinforced with the relevant linguistic 
examples. The authors succeed in ensuring that the material is 
relevant and interesting, one of the most important yet difficult 
criteria to meet when teaching statistics.
The book has been written in LaTeX and has the format commonly 
associated with such documents. One useful stylistic feature is that 
as important terms are introduced to the text, they are printed in 
the margin, which makes it easy to scan the text for topics of 
interest. These terms are also listed in the index. Each chapter is 
concluded by a fairly thorough 'Further Reading' section and a set 
of exercises with tasks of varying difficulty. Several of the 
chapters are also broken up by small sets of exercises. The book 
concludes with a 44 page bibliography and 23 page index.

Part I Preliminaries

Chapter 1 sets out the empiricist standpoint adopted throughout the 
volume, providing a critique of rationalist views on linguistics. 
The points concerning weaknesses with the 'categorical judgment' 
approach to linguistics exemplified by the work of Chomsky are 
convincingly made with illustrative examples (section 1.2.1).

Chapter 2 provides a self-contained introduction to the mathematical 
foundations of the ensuing material. It is divided into two parts, 
one on probability theory and the other on information theory. The 
section on probability theory includes the notions of conditional 
probability, Bayes theorem, random variables (functions that map all 
the possible outcomes of an event to a probability score), joint and 
conditional distributions, Bayesian updating (where our statistical 
expectation estimates are influenced by our prior beliefs about what 
those expectations should be) and Bayesian decision theory. The 
section on Information Theory covers Entropy (defined both as the 
amount of information contained in a random variable and also as a 
measure of the average amount of information required to describe an 
outcome of that variable), Mutual Information (the amount of 
information that one random variable contains about another) and 
Noisy Channel models (in which the output from a communication 
channel has a probability of differing from the input to that 
channel), among others.

Chapter 3 provides a self-contained introduction to the linguistic 
(largely syntactic) theory that will be used in subsequent chapters. 
Here we have an introduction to parts of speech, phrase structure 
and brief descriptions of morphology, semantics and pragmatics. The 
introduction is quite detailed and the authors have not been afraid 
to present some quite difficult examples (complex NPs and the like). 
Having said this, there is not much coverage of analyses above the 
level of the sentence but this is reflective of the field itself. In 
combination with chapter 2, a basic statistical and linguistic 
toolkit has been formed upon which the ensuing approaches will 
depend. Later chapters do introduce further statistical methods, but 
it is to chapters 2 and 3 that the reader will return for the 
Chapter 4 introduces the notion of corpus-based work and provides an 
overview of the low level formatting issues that must be addressed 
when using documents as an information source for further processing 
(section 4.2). This chapter usefully provides details about 
organisations that can be contacted in order to obtain these crucial 
resources (table 4.1). There is also discussion of the SGML encoding 
that is important for much current work (section 4.3).

Part II Words

Chapter 5 examines collocations and simple term extraction using 
Mutual Information (2.2.3) methods. There is some brief discussion 
of proper name recognition (sections 5.5 and 5.6), but a failure to 
highlight the particular problems associated with that subject. For 
instance the Named Entity Recognition task that has challenged 
participants in the MUC conferences is not mentioned, nor any of the 
approaches taken to address the problem (Mikheev, Grover & Moens 
1998). This chapter also covers the notions of hypothesis testing 
and significance (section 5.3).

Chapter 6 concerns statistical inference and the application of 
probabilistic approaches to language modeling. This is a stochastic 
method where our expectation of seeing some word or category in a 
text is based only on the information we have about the preceding n 
words (section 6.1). The chapter also covers a variety of 
statistical estimation methods over those models (section 6.2) and 
the process of smoothing (sections 6.2 and 6.3) which lets us apply 
statistical methods in the face of sparse data.

Chapter 7 applies the prior methods to word sense disambiguation. 
Several different algorithms are presented and reviewed. The authors 
set out supervised and unsupervised methods for disambiguation, the 
supervised ones being based on Bayes decision rule and Mutual 
Information techniques. On supervised learning methods that require 
manually annotated corpora, the authors note (p.232) "the 
production of labeled training data is expensive". However they do 
not mention any of the software tools that have been produced that 
make the annotation task less time consuming and therefore less 
expensive (such as MITRE's Alembic Workbench).

Chapter 8 presents methods for Lexical Acquisition. Here, the goal 
is to classify lexical items on the basis of verb subcategoristion, 
selectional restrictions, attachment ambiguity and semantic 
similarity. Co-occurrence statistics and vector similarity methods 
are used to obtain classes of semantically similar words. The 
chapter also gives good coverage of evaluation measures (precision, 
recall and f-measure).
Part III Grammar

Chapter 9 presents Markov models, a variation of the language models 
presented in section 6.1. Familiarity with them is widely presumed 
in current work and it is useful to have them derived here from 
scratch for the benefit of the uninitiated reader. The Viterbi 
algorithm is presented as a means of finding the best probability 
traversal of Markov models.

Chapter 10 Part of Speech Tagging sets out 4 different strategies 
and concludes with a discussion of performance and applications. The 
algorithms include methods based on the Markov model techniques 
introduced in chapter 9 and Brill's transformation based learning 
method. There is some coverage of issues like base NP chunking, but 
discussion of complex NP extraction (section 10.6.2) is omitted.

Chapter 11 Probabilistic Context Free Grammars describe an 
application of Hidden Markov Models to determine the probabilities 
of strings of words in a language. The authors present the Inside-
Outside algorithm as a method for finding the most likely analysis 
for a sentence. There do appear to be a number of typographical 
problems with this chapter. Space prevents me from making them 
explicit but examination of pages 384, 385 and 391 should reveal 
them to the interested reader.

Chapter 12 Probabilistic Parsing shows how annotated corpora 
(treebanks) can be used as the basis for finding a syntactic 
analysis for new sentences. The distinction between phrase-structure 
and dependency grammars is presented (section 12.1.7) and various 
statistical methods and search techniques are put forward. The 
authors present a sample of the Penn treebank. Here we note that the 
analysis consists of many 'flat' infrequent trees that do not 
contain X-Bar nodes, only X and XP ones. Many current systems are 
based on this treebank and the astute reader will be somewhat 
concerned about the quality of the analyses returned by such 
systems. There is a good, thorough description of evaluation 
difficulties with respect to parsing in 12.1.8. An assumption made 
in this chapter is that a parser should first try out the analysis 
of a word string that is most commonly observed in a treebank. 
However some best-first techniques based on human reading-time 
experiments suggest that this is not always the best approach 
(Crocker and Pickering 1996 unpublished work).
Part IV Applications and Techniques

Chapter 13 Statistical Alignment and Machine Translation presents 
the idea of aligning sentences and paragraphs between documents of 
different languages and using this information as the basis for 
automatic translation (section 13.1). The method is based on the 
noisy channel model (chapter 2). When reviewing the problems with 
machine translation techniques, the authors write, "on the surface 
these are problems of the model, but they are all related to the 
lack of linguistic knowledge in the model." They then give examples 
(p.489-492) which demonstrate a range of linguistic information that 
is not exploited by the statistical models and could serve as the 
basis for future work.

Chapter 14 Clustering presents a number of methods and algorithms 
that classify items on the basis of some measure of similarity. 
Hierarchical (section 14.1) and non-hierarchical (section 14.2) 
approaches are covered. By using these techniques, words can be 
classified automatically into categories that reflect something-like 
semantic similarity. Some promising results are shown in table 14.5. 

Chapter 15 Topics in Information Retrieval covers automatic term 
extraction from documents. One of the approaches uses a vector space 
model, following from material in chapter 8, and the measures of 
Term Frequency and Term Frequency Inverse Document Frequency which 
are derived here (section 15.2.2). The other method for term 
identification is based on a term distribution model. The review is 
followed by a method for discourse segmentation 'TextTiling' that is 
based on information about the distribution of terms in a document 
(section 15.5).

Chapter 16 Text Categorisation introduces a number of statistical 
classification methods. The goal here is to automatically identify 
the topics or themes of documents. Several methods are used. With 
Decision Trees (section 16.1) a given document is described in terms 
of feature-value trees where possible values are labeled with 
probability scores. The combination of a document's value scores 
gives the likelihood that it belongs to a given class. Maximum 
entropy models (16.2) are described in which a number of pre-
classified documents are defined by means of constraint features. 
The classification with the highest entropy score is defined as the 
maximum entropy model. New documents are then classified according 
to their similarity with this model. With the perceptron learning 
method term vectors (chapter 15) and iteratively induced weights are 
used to classify documents. The K-nearest neighbour (16.4) 
classification method is also described. Here, documents are 
classified according to their similarity to positively classified 

Although much of the material here is also covered by (Charniak 
1996) and (Krenn and Samuelsson 1997) and less so by (Allen 1995), 
Manning and Schetze's work provides wider, more detailed coverage. 
Strangely, none of the works discusses the application of corpus-
based optimisation techniques such as genetic algorithms (Mitchell 
1997) to natural language processing.

I recommend this book both as an exemplary teaching aid and a 
rigorous introduction to statistical NLP. It is to be commended for 
its readability and the coherent presentation of a notoriously 
difficult subject. This reviewer did note some flaws, but they 
represent very minor points in the context of a 680 page book. Part 
of the beauty of this work is that it can stand-alone without the 
reader having to refer to anything else in order to understand or 
clarify parts of it. All the crucial information is here, presented 
from first principles. It is a very good reference book for anyone 
working in the field of NLP.

Allen, J (1995) Natural Language Understanding, Benjamin / Cummins
Charniak, E. (1996) Statistical Language Learning, MIT Press
Crocker, M. & Pickering, M. (1996) A Rational Analysis of Parsing 
and Interpretation, Unpublished
Day, D. et al. (1997) Mixed Initiative Development of Language 
Processing Systems, The Mitre Corporation
Krenn, B. & Samuelsson, C. (1997) The Linguist's Guide to 
Statistics, at{~Krenn,~christer}
Mikheev, A., Grover, C. & Moens, M. (1998) Description of the LTG 
System Used for MUC-7, Language Technology Group,
Mitchell, T. (1997) Machine Learning, McGraw Hill
Richard Evans is a research assistant with the Computational 
Linguistics Research Group at the University of Wolverhampton in the 
UK. His current research interest is anaphor resolution and the 
application of corpus-based machine learning and optimisation 
methods to that task.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue