LINGUIST List 9.488

Mon Mar 30 1998

Review: Boguraev & Pustejovsky: Corpus Processing.

Editor for this issue: Andrew Carnie <>

What follows is another discussion note contributed to our Book Discussion Forum. We expect these discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for discussion." (This means that the publisher has sent us a review copy.) Then contact Andrew Carnie at


  1. Kevin Cohen, Review of Boguraev and Pustejovsky

Message 1: Review of Boguraev and Pustejovsky

Date: Wed, 18 Mar 1998 09:58:37 -0500
From: Kevin Cohen <>
Subject: Review of Boguraev and Pustejovsky

Branimir Boguraev and James Pustejovsky. 1996. Corpus processing for 
	lexical acquisition. MIT Press: Cambridge, Massachusetts. 245 pages.

The term "acquisition" in the title of this book refers to automatic 
learning---acquisition not by human children, but by natural language 
systems. The papers in this book deal with the topic of building and 
refining lexica for natural language systems automatically--i.e. by 
computer, with little or no human intervention--from large corpora.

Building lexica for natural language systems by hand is difficult, expensive, 
and labor-intensive, and the result may be out of date before it is completed.
Furthermore, by the standards of earlier systems, lexica have become 
enormous. Continuous speech dictation systems ship with active vocabularies 
in the range of 30,000 lexical items. Lexica in production by one company
are expected to have 200,000 entries for American English and 700,000 
entries for German. So, from an industrial point of view, work on the 
automatic acquisition of lexical knowledge is very welcome. This is not 
to say that automatic lexical acquisition should be of interest only to 
applied linguists. Lexical information is also necessary in psycholinguistic 
research, and some of the work in this volume shows such application. 
Furthermore, the sorts of data that researchers in this field are attempting 
to acquire is just the sort of data that is needed for large-scale 
applications of formalisms like Head-Driven Phrase Structure Grammar. 
So, the work described in this book should be of interest to academic, 
as well as industrial, linguists.

This book is the result of a workshop, and as such, it has the usual 
scattering of topics seen in proceedings. This should be seen as a 
feature, not a bug: the result is that there is something here for 
everyone. Various papers come from the fields of corpus linguistics, 
statistical analysis of language, psycholinguistics, rule acquisition, 
semantics, and lexical acquisition. The papers are divided into five 
broad categories: (1) unknown words, (2) building representations, 
(3) categorization, (4) lexical semantics, and (5) evaluation. In 
addition, a paper by the editors lays out the reasons for, and challenges 
of, automatic acquisition of lexical information.

(1) Introduction

Issues in text-based lexicon acquisition, Branimir Boguraev and James 
Pustejovsky. This paper presents an in-depth answer to the question 
with which lexicon builders are perenially plagued by anyone to whom 
they try to explain their work: why not just use an on-line dictionary? 
The short answer is that such dictionaries are static and do not evolve at 
the same pace as the language that they are attempting to describe. The 
long answer is that natural language systems require information that is 
not reflected in traditional dictionaries-semantic feature geometries, 
subcategorization frames, and so on. So: "the fundamental problem of 
lexical acquisition... is how to provide, fully and adequately, the 
systems with the lexical knowledge they need to operate with the proper 
degree of efficiency. The answer... to which the community is converging 
today... is to extract the lexicon from the texts themselves" (3). 
Automatic lexical acquisition can trivially solve the short-answer problem 
by allowing updating as frequently as new data can be acquired. More 
importantly, it allows the linguist to define the questions that they 
would like the lexicon to answer, rather than having those questions 
chosen for them by the dictionary maker.

(2) Dealing with unknown words

Consider a spell-checking program that encounters the (unknown) word 
"Horowitz." The spell checker would like to know the best action to 
take with this word: is it a mis-spelling that should be replaced with 
something else, or is it a precious datum that should be added to its 
lexicon? The spell-checker asks its user; the papers in this section 
discuss attempts to answer these questions automatically.

Linguists tend not to pay much attention to proper nouns. As McDonald 
puts it in an epigram to his paper in this volume, "proper names are 
the Rodney Dangerfield of linguistics. They don't get no respect" (21). 
Thus, it might surprise the reader to find that all three of the papers in 
this section deal with names. The identification and classification of 
names is, in fact, of considerable interest in natural language systems. 
For relatively uninflected languages like English, names may constitute 
the majority of unknown words encountered in a corpus. Names raise 
special issues for classification, including the facts that they may 
have multiple forms; multiple forms may have the same referent in a 
single text, raising problems for reference and coindexation; and, 
on a less theoretically interesting but no less morally and legally 
compelling level, they may require special treatment in the corpus. 
For instance, proper names are routinely removed from medical data, 
and may need to be removed from sociolinguistic data, as well.

Internal and external evidence in the identification and semantic 
categorization of proper names. David D. McDonald. This paper is 
written in the language of artificial intelligence. It describes 
the Proper Name Facility of the SPARSER system. It describes the 
use of context-sensitive rewrite rules to analyze "external evidence" 
for proper names, e.g. their combinatorial properties. A surprising 
and impressive aspect of the system described here is that it does not 
use stored lists of proper nouns.

Identifying unknown proper names in newswire text. Inderjeet Mani, 
T. Richard MacMillan. This paper describes a method of using contextual 
clues such as appositives ("<name>, the daughter of a prominent local 
physician" or "a Niloticist of great repute, <name>") and felicity 
conditions for identifying names. The contextual clues themselves 
are then tapped for data about the referents of the names.

Categorizing and standardizing proper nouns for efficient information 
retrieval. Woojin Paik, Elizabeth D. Liddy, Edmund Yu, and Mary McKenna. 
This paper deals with discovering and encoding relationships between groups 
and their members. Paik et al. state the problem as follows: "proper nouns 
are... important sources of information for detecting relevant document in 
information retrieval....Group proper nouns (e.g., "Middle East") and group 
common nouns (e.g., "third world") will not match on their constituents 
unless the group entity is mentioned in the document" (61). The problem, 
then, is to allow a search on "health care third world" to find a document 
on "health care in Nicaragua." The paper includes a short but useful 
discussion of the problems that can arise with respect to prepositions 
when noun phrases containing proper nouns are parsed as common noun phrases. 
(The authors solved this problem by changing the ordering of two bracketing 

(3) Building representations

Customizing a lexicon to better suit a computational task. Marti A. Hearst, 
Hinrich Schuetze. As mentioned above, lexicon building is expensive; this 
paper describes a method for reducing development costs by customizing a 
pre-existing lexicon, rather than building a new one. The project described 
here uses as its pre-existing lexicon WORDNET, an on-line lexicon that 
contains information about semantic relationships such as hypernymy, 
hyponymy, etc. This was customized by reducing the resolution of the 
semantic hierarchies to simple categories, and by combining categories 
from "distant parts of the hierarchy.....We are interested in finding 
grouping of terms that contribute to a frame or schema-like representation... 
This can be achieved by finding associational lexical relations among the 
existing taxonymic relations" (79). Crucially, these relations should be 
derived from a particular corpus. The paper includes a nice description of 
the algorithm used for collapsing semantic categories.

Towards building contextual representations of word senses using statistical 
models. Claudia Leacock, Geoffrey Towell, and Ellen M. Voorhees. This paper 
describes a method for differentiating amongst the multiple senses of a 
polysemous word. The authors discuss using "topical context," or content 
words occurring in the vicinity, and "local context," which includes not 
just content words but function morphemes, word order, and syntactic 
structure. They test three methods of acquiring topical context: 
Bayesian, context vector, and a neural network. They also give the 
results of a psycholinguistic experiment comparing human performance 
with machine performance, given the topical contexts created by the 
three types of "classifiers." Local context acquisition is based on 
acquiring "templates," or specific sequences of words. This paper 
gives a particularly nice description of its algorithms, and is so 
clearly written as to be suitable for presentation in courses on 
statistics or psycholinguistics.

(4) Categorization 
A context driven conceptual clustering method for verb classification. 
Roberto Basili, Maria-Teresa Pazienza, Paola Velardi. This paper describes 
a method of categorizing verbs with respect to thematic roles, drawing on 
the COBWEB and ARIOSTO_LEX systems. Its aim is to do categorization without 
relying on "defining features," and to categorize with respect to the domain 
of discourse. The authors describe their algorithms, and the paper has a 
nice literature review, covering both psycholinguistic and computational 
perspectives on classification.

Distinguished usage. Scott A. Waterman. This paper tackles the 
syntax/semantics interface. The author attempts to give a linguistic 
grounding to systems that map text to some knowledge base by means of 
pattern matching: "by relating lexical pattern-based approaches to a 
lexical semantic framework, such as the Generative Lexicon theory 
[Pustejovsky, 1991], my aim is to provide a basis through which 
pattern-based understanding systems can be understood in more conventional 
linguistic terms.....My main contention is that such a framework can be 
developed by viewing the lexical patterns as structural mappings from text 
to denotation in a compositional lexical semantics...obviating the need for 
separate syntactic and semantic analysis" (144). This paper features an 
excellent presentation of background ideas and explication of the issues 
that it discusses. 

(5) Lexical semantics

Detecting dependencies between semantic verb subclasses and subcategorization 
frames in text corpora. Victor Poznanski, Antonio Sanfilippo. This paper 
describes "a suite of programs....which elicit dependencies between semantic 
verb classes and their... subcategorization frames using machine readable 
thesauri to assist in semantic tagging of texts" (176). The system uses a 
commercially available thesaurus-like online lexicon to do semantic tagging. 
A "subcategorization frame" is then automatically extracted, and the 
subcategorization frames are analyzed and classified.

Acquiring predicate-argument mapping information from multilingual texts. 
Chinatsu Aone, Douglas McKee. The authors hold predicate-argument mapping 
to be equivalent to conceptual representation; as such, it is clearly 
important to language understanding. This is the only paper in the volume 
that deals with bilingual corpora.

(6) Evaluating acquisition

Evaluation techniques for automatic semantic extraction: comparing 
syntactic and window based approaches. Gregory Grefenstette. This 
paper proposes techniques for comparing "knowledge-poor" approaches 
to determining the degree of semantic similarity between two words. 
A syntax-based method is compared to a windowing technique. The 
syntax-based method is shown to perform better for high-frequency words, 
while the windowing method is the better performer for low-frequency words.


This is by no means an introductory text on automatic lexical acquisition. 
Nonetheless, this volume contains papers that will appeal to workers in a 
variety of linguistic disciplines.

The reviewer

K. Bretonnel Cohen is a linguist at Voice Input Technologies in Dublin, 
Ohio, where his responsibilities include the construction of tools for 
lexicon building and analysis.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue