LINGUIST List 14.196

Mon Jan 20 2003

Review: Semantics/Corpus Ling: Barnbrook (2002)

Editor for this issue: Naomi Ogasawara <naomilinguistlist.org>


What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Simin Karimi at siminlinguistlist.org.

Directory

  1. Stefano Bertolo, Barnbrook (2002) Defining Language

Message 1: Barnbrook (2002) Defining Language

Date: Mon, 20 Jan 2003 16:23:47 +0000
From: Stefano Bertolo <bertolocyc.com>
Subject: Barnbrook (2002) Defining Language

Barnbrook, Geoff (2002) Defining Language: A Local Grammar of
Definition Sentences. John Benjamins Publishing Company, xv+280pp,
hardback ISBN 1-58811-298-5, $79.00, Studies in Corpus Linguistics 11.

Book Announcement on Linguist:
http://linguistlist.org/get-book.html?BookID=4381 
http://linguistlist.org/issues/13/13-2964.html


Stefano Bertolo, unaffiliated reader of LINGUIST

SUMMARY OF THE BOOK'S PURPOSE AND CONTENT

The book describes the main features of the language used in
definition sentences in the Collins Cobuild Student's Dictionary
(CCSD). The reason for what might otherwise appear as a rather narrow
choice of topic is explained very clearly and with an abundance of
examples: in the CCSD a definition uses the head-word to be defined in
a context that

a) uses a predictable and simplified syntactic format
 designed to sound natural (and so to be easily understood
 by a student);
b) introduces enough information to disambiguate the
 intended meaning of the head-word.

As an example of a), transitive verbs in the CCSD are commonly defined
using the format

 ''If you [Verb1 an Obj-1], you [[Verb2 Anaphora-1] Adjunct]''
as in
 ''If you MANHANDLE someone, you TREAT them very roughly''

As an example of b), two common senses of the word ''breast'' are
disambiguated by introducing the appropriate modifier (''woman's'' vs
''bird's'') as in
 ''A woman's breasts are ...''
 ''A bird's breast is ...''

Barnbrook's main insight, as I understand it, is the following: given
that it is theoretically possible to explain a word to a learner in
infinitely many ways, how is that the CCSD manages to do a very good
job using only very few patterns of the kind described above?
Evidently, those patterns represent a (or possibly ''the'') minimal
set required to cover all communication needs associated with
situation in which definitional information needs to be
exchanged. Such patterns, Barnbrook goes on to explain, cannot be
directly aligned with the constituents that would be returned by a
syntactic parser. The goal of his book is to offer a detailed analysis
of those patterns from a functional perspective that is orthogonal to
that of current research in syntax/parsing.

I found it very difficult to understand the idea behind this approach
until I happened to recognize the analogy between Barnbrook's
definitional patterns and ''patterns'' as understood in the practice
of software engineering, see for example Gamma et al. (1995). I
mention it here in the hope that other readers might find it as
enlightening as I did. The analogy goes like this: when a programmer
wants to build a web application she might decide to use the
client/server pattern, because this satisfies the need of the
application (many agents need to access information maintained by a
central entity). Nevertheless, when the program has been completed in,
say, Java, analyzing the program with respect to the syntax of Java is
the wrong level of abstraction to chose if one wants to understand
what the intent of the program is: knowing which parts of the program
are variables, method definitions, method calls etc... doesn't help
you in the least in recognizing that you are looking at an
instantiation of the client/server pattern.

Having summarized the general purpose of the book, I now give a brief
description of each chapter.

Chapter 1 introduces the definition format of the Cobuild dictionaries
described above, i.e. the fact that head-words are explained in the
context of simple sentences of a predictable syntactic
format. Barnbrook aptly points out that, in addition to being helpful
for a learner, such predictability can and should be exploited by
programs for automatic information extraction. Several Cobuild
dictionaries are introduced and some of their characteristics
explained. Considering that the rest of the book contains a very large
number of tables, I would have expected and enjoyed finding a table
listing all the Cobuild dictionaries with a summary of their
similarities and differences.

Chapter 2 is devoted to a discussion of features of monolingual
English dictionaries. Barnbrook summarizes earlier discussion to the
effect that, in monolingual dictionaries, the language whose words are
being defined acts as its own metalanguage in defining its
semantics. I find this unconvincing, considering that, with the
exception of entries that record usage information (register, regional
variation, etc...) most Cobuild entries do not mention the words that
are part of the definition, but simply use them in particularly well
chosen examples. The rest of the chapter is devoted to a short history
of the development of monolingual dictionary, showing how they evolved
from lists of 'difficult' (i.e. usually latinate) words which could be
explained by means of the corresponding English word to dictionaries
which, for the benefit of non-native speakers of English, list an
English definition for every English word.

Chapter 3 articulates what, in my opinion, is the central point of the
book, i.e. the claim that monolingual definitional sentences have a
recognizable structure (i.e. can be expected to contain certain
elements in a certain order) which requires a specialized vocabulary
to be described, i.e. cannot be described by (reduced to) the
vocabulary of syntactic theory. Philosophers of mind will find here an
interesting parallel to the debate on functionalism and the
irreducibility of (some) private or public mental construct to their
physical substrate (e.g. the fact that ''money'' is a symbolic entity
that cannot be reduced to its physical realization, precious metals,
paper bills, wampum, ...) This is a rather important point that, in my
opinion, is obfuscated by the insistence on the parallel between the
grammar of English and the grammar of definition, which results in
rather perfunctory attempts to place the grammar of definition in the
class of context- sensitive grammars (page 71) and to show that the
definition language so produced is a subset of English closed under
some operations that are never defined (page73). The discussion
becomes much more interesting when empirical facts about definitional
sentences are listed that are independent of any syntactic analysis,
for example the fact that the entropy of the CCSD is much lower than
that of a corpus of the same size from ''The Times''; the fact that
definitional sentences never contain inter-sentential coreferences;
the fact that only very few of the possible senses of an ambiguous
word are used in definition sentences.

Chapter 4 describes the methodology employed by the author in creating
a taxonomy of definition questions able to subsume almost without
exceptions all the definitional sentences in the CCSD. Most of the
steps described reflect the train of thought that any reasonable
person without any more sophisticated tools than grammar school
linguistic categories would follow in classifying CCSD definitional
sentences. In other words, exactly because definitions are analyzed
according to criteria that are orthogonal to current sophisticated
theories of syntax (and so cannot rely on what could be referred to as
theory-internal construct), the whole classification has an extremely
empiricist feel to it to the point that one is not quite sure if the
resulting taxonomy could be used to make testable predictions or if,
instead, it is just a convenient way to summarize the data. It is
also to be noted, as Barnbrook's discussion makes quite clear, that
the markup information made available by the CCSD database was relied
upon extensively, a fact not without consequence, which will be
discussed later.

Chapter 5 is devoted to an exposition of the definition type
taxonomy. Four major types are identified and described in section
5.4. Each of these types is further subdivided into several subtypes
yielding a total of 17. Descriptively, I found this to be the most
interesting chapter of the book. Most bodies of definitions a reader
would encounter (from Aristotle on) really are definitions of
collections, which in turn are typically denoted (at least in the
Western languages I am familiar with) by nouns. In such cases, as
Aristotle had observed, a definition can be assembled using a
super-ordinate term of the term to be defined (often referred to as
its 'genus' i.e. a term denoting a collection that properly includes
the one denoted by the term to be defined) together with a
'differentia', a term denoting a collection that subsumes the one
denoted by the term to be defined and does not subsume any of the
other possible sub-ordinate terms of the super-ordinate term. A
monolingual dictionary, however, cannot always employ this strategy as
it is required to provide definitions for every word, even those that
are not nouns, e.g. verbs. In formal semantics, e.g. Chierchia and
McConnell-Ginet (1991), it is common to represent verbs as relations,
i.e. as collections not of individuals but of n-tuples of individuals.

Given this analysis it is easy to show why defining a verb cannot
always be done easily or effectively by means of the intersection of a
'genus' and a 'differentia'. For example, in German the word ''eats''
in ''X eats Y'' is translated differently depending on whether X is a
person (''essen'') or an animal (''fressen''). This goes to show that,
although it is easy to find a 'genus' for either of ''essen'' and
''fressen'' (e.g. the collection of pairs <X, Y> such that X ingests
Y), the required 'differentiae' are really best expressed not as
collections of pairs to be intersected with the 'genus' but rather as
type restrictions on the X and Y element of the pair: person vs animal
on X to realize the ''essen''/''fressen'' distinction and solid vs
liquid on Y to realize the ''eat''/''drink'' distinction. Although he
doesn't cast his analysis in these terms, Barnbrook does an excellent
job at presenting an extremely rich selection of examples that show
what definitional strategies are commonly employed when the 'genus' +
'differentia' strategy cannot be easily followed. I found this
completely fascinating and I hope Barnbrook or others will follow up
on these leads in future research. Among obvious research questions to
be asked are:

a) to what extent are Barnbrook's types 'stable'? For example, would a
clustering algorithm -- see Manning and Schuetze (1999), section 14 --
generate the same types for most reasonable similarity metrics?

b) is it possible to predict aggregate properties of these types given
their description? For example, one might predict that because verbs
are often defined by means of argument type restrictions that cannot
be lexically realized (but require, say, entire locative or
instrumental propositional phrases) the entropy of verb definitions
will turn out to be much higher than the entropy of noun definitions.

c) is it possible to show that definition types cut across
part-of-speech classification as long as the underlying semantics of
the term to be defined is the same? For example, can one expect to
find verbs and their nominalizations (e.g. ''destroy'' and
''destruction'') to be defined according to the same pattern?

Chapter 6, to quote from its introduction, ''describes the functional
components of the definition sentences, the structural combinations of
those components and the variations in structure between the different
definition types, together with an outline of the processing involved
in the analysis of the definition sentences''. I found this chapter
tantalizing: it contains more than twenty tables displaying how the
elements of definitions of different types are detected and analyzed
by a program written by Barnbrook. All of these analyses are eminently
sensible, so that one would want to learn by what algorithm exactly
they have been generated in order to reproduce those results on a
different domain. Unfortunately, the algorithm is never described in
any detail (except for several remarks that explain how the markup of
the CCSD database is exploited to detect head word boundaries and
register information) with the consequence that the results reported
are essentially irreproducible. On page 187, Barnbrook reports an
interesting proposal by Schnelle (1995, section 2) (inconsistently
listed as Schnelle (1996) in the bibliography). Schnelle's idea is
that definition patterns can be standardized to the most expressive
pattern(s) in order to achieve uniformity (which in turn would be
desirable as a machine readable format). For example, even definitions
for nouns that can be expressed using the simplest 'genus' +
'differentia' pattern can be recast into the more complex 'If/then'
pattern often required by verbs for the reasons explained above. To
exemplify,
 ''A bachelor is an adult unmarried male''
can be recast as
 ''If someone is a bachelor, then they are adult, male and
 unmarried''

I find this very interesting, because it hints at the
possibility that most (all?) definitions might be analyzed
using the ''logical form''
 (AND P-{n}(x1, ... , xn)
 Q-{m}-1(x1.1, ... , x1.m)
 ...
 Q-{j}-k(xk.1, ... , xk.j))
 implies
 (AND
 Q-{p}-k+1(xk+1.1, ... , xk+1.)
 ...
 Q-{q}-k+n(xk+n.1, ... , xk+n.q)))

where P-{n} is a predicate of the appropriate -arity for the
definiendum (i.e. unary for nouns, binary for transitive verbs,
ternary for di-transitive verbs), each of the Q-{m}-i's is a formula
of m arguments (closed terms or variables, either bound within the
Q-{m}-i literal itself or by the implicit universal quantifier taking
scope over the entire ''logical form'') imposing some restriction on
the arguments of P-{n} and each of the Q-{p}-k+i's is a formula of p
arguments stating conditions that apply to p-tuples that satisfy
conditions of the conjunction in the antecedent. It is easy to see how
such formulae could be classified according to different, independent,
facets. To name a few that come readily to mind: a) the arity of
P-{n}; b) the number of clauses in the antecedent/consequent; c) the
maximum arity of any of the Q's

It would be nice if one could prove that this format is sufficiently
expressive to represent each of the 17 types identified by Barnbrook.

Also of note is the fact that this format invariably expresses only
necessary and never sufficient conditions. In other words it tells you
what follows from the fact that an x is a P (or that it P's something)
but it doesn't tell you what you need to observe in order to be sure
that x is a P (or that it P's something). Philosophers such as Jerry
Fodor (1975) have argued that this is evidence in favor of the
existence of a ''Language of Thought'', with the association between
words and concepts (Harnad's 1990 ''grounding problem'') being
triggered by mechanisms other than testing of ''meaning
hypotheses''. Linguists such as Wierzbicka (1996) have equally pointed
to the likely existence of language primitives that cannot be further
''defined away''. Finally, practicing knowledge engineers encounter
this problem on a daily basis as they expand their ontologies. If
someone has found an evolutionary explanation for why this should be
so (i.e. why, as a species, we crave for necessary conditions but can
live reasonably well without sufficient conditions), I would be very
interested in learning about it.

Chapter 7 is concerned with the evaluation of the taxonomy of
definitions and its possible applications. I found the evaluation part
only marginally interesting because it reports on the evaluation of an
algorithm that is never fully described and because it delves into
some idiosyncrasies of the CCSD database which might not be of general
interest, especially considering that the format of the CCSD is
proprietary. The applications part contains some interesting
ideas. For example, to use the notation introduced above in order to
explain Schnelle's idea, that a database of CCSD definitions broken
down into the components identified by Barnbrook's definition grammar
can be queried for Q-{m}-i formulas, either in the antecedent or in
the consequent to reveal robust word cluster that could hardly be
listed exhaustively even by a native speakers. The reader who doubts
this is invited to write down a list of verbs that can be naturally
defined as ''V-ing something somewhere'' and compare it with the very
interesting list Barnbrook produces on pages 231-2. For a publicly
accessible database, compiled according to a similar design philosophy
from a corpus of newspaper articles, one could visit Dekang Lin's page
at http://www.cs.ualberta.ca/~lindek

A second interesting application is the harvesting of definition
sentences for contextual information that could be used for word sense
disambiguation, an idea that is being taken to its logical conclusion
by the WordNet 2.0 release team; see http://www.cogsci.princeton.edu/~wn

presently engaged in the tagging of all word tokens included in
WordNet definitions by means of the synset that corresponds to the one
meaning, from those available for that word type, which is the one
intended in definitional context in which the word token is found. A
third idea is to use Barnbrook's definition grammar as a quality
control tool to verify that definitions in a dictionary really do what
they are intended to do (i.e. help a learner understanding the meaning
of words differentiating every word from any other) and do so using a
consistent and easily interpretable format. One possibility worth
exploring would be having definitional assertions about concepts be
stored in a language independent knowledge base (in the format
mentioned above in the context of Schnelle's proposal) and definitions
be automatically generated into any number of target languages into
which the knowledge base contents can be paraphrased. A knowledge base
of that kind is freely available for download at http://www.opencyc.org

CRITICAL EVALUATION

My specific comments have already been worked into the review of the
individual chapters. Here I will just summarize them as follows: the
book is clearly written and filled with detailed discussions of a very
large number of well chosen examples and as such it is pedagogically
exemplary. At times, however, the reader finds herself wishing that
something like a theory capable of delivering testable predictions
would emerge from all those details. In this review I have tried to
list what I consider the most promising candidates for such
theory-like developments. I benefited a great deal from reading this
book, as it has forced me to think about possible dimensions of the
analysis of language independent of the more traditional ones of
syntax and semantics. In this respect it reminds me of the kind of
''goal oriented'' analysis that can be found in the short essays in
Rubinstein (2000). My main reason of disappointment, as a software
developer, is that, due to copyright reasons, Barnbrook was not able
to be more explicit in the description of his definition analysis
algorithm. Also, I wonder how easy it would be to generalize the
algorithm to operate on free form text (as opposed to the richly
annotated CCSD database format).

REFERENCES
Chierchia, Gennaro and Sally McConnell-Ginet (1991) Meaning
 and Grammar. MIT Press.
Fodor, Jerry (1975) The Language of Thought: A Philosophical 
 Study of Cognitive Psychology. T. Y. Crowell.
Gamma, Erich (1995) Design Patterns: Elements of Reusable
 Object-oriented Software. Addison-Wesley.
Harnad, Stevan (1990) ''The Symbol Grounding Problem''. Physica
 D42: 335-346
Manning, Chris and Hinrich Schuetze (1999) Foundations of 
 Statistical Natural Language Processing. MIT Press.
Rubinstein, Ariel (2000) Economics of Language. Cambridge
 University Press.
Wierzbicka, Anna (1996) Semantics, Primes and Universals.
 Oxford University Press.

ABOUT THE REVIEWER

Stefano Bertolo currently works as a software developer on projects
that require expertise in information extraction, knowledge
representation and inference. After receiving a Ph.D. in Philosophy
and a Diploma in Cognitive Science from Rutgers University he spent
three years as a Post-Doctoral associate at the Brain and Cognitive
Science Department of MIT, working on formal theories of human
language learning.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue