Featured Linguist!

Jost Gippert: Our Featured Linguist!

"Buenos dias", "buenas noches" -- this was the first words in a foreign language I heard in my life, as a three-year old boy growing up in developing post-war Western Germany, where the first gastarbeiters had arrived from Spain. Fascinated by the strange sounds, I tried to get to know some more languages, the only opportunity being TV courses of English and French -- there was no foreign language education for pre-teen school children in Germany yet in those days. Read more



Donate Now | Visit the Fund Drive Homepage

Amount Raised:

$34413

Still Needed:

$40587

Can anyone overtake Syntax in the Subfield Challenge ?

Grad School Challenge Leader: University of Washington


Publishing Partner: Cambridge University Press CUP Extra Publisher Login
amazon logo
More Info


New from Oxford University Press!

ad

What is English? And Why Should We Care?

By: Tim William Machan

To find some answers Tim Machan explores the language's present and past, and looks ahead to its futures among the one and a half billion people who speak it. His search is fascinating and important, for definitions of English have influenced education and law in many countries and helped shape the identities of those who live in them.


New from Cambridge University Press!

ad

Medical Writing in Early Modern English

Edited by Irma Taavitsainen and Paivi Pahta

This volume provides a new perspective on the evolution of the special language of medicine, based on the electronic corpus of Early Modern English Medical Texts, containing over two million words of medical writing from 1500 to 1700.


Email this page
E-mail this page

Review of  Defining Language


Reviewer: Stefano Bertolo
Book Title: Defining Language
Book Author: Geoff Barnbrook
Publisher: John Benjamins
Linguistic Field(s): Text/Corpus Linguistics
Book Announcement: 14.196

Discuss this Review
Help on Posting
Review:


Date: Mon, 20 Jan 2003 10:06:57 -0600 (CST)
From: Stefano Bertolo <bertolo@cyc.com>
Subject: Barnbrook (2002) Defining Language

Barnbrook, Geoff (2002) Defining Language: A Local Grammar
of Definition Sentences. John Benjamins Publishing Company,
xv+280pp, hardback ISBN 1-58811-298-5, $79.00, Studies in
Corpus Linguistics 11.

Stefano Bertolo, unaffiliated reader of LINGUIST

SUMMARY OF THE BOOK'S PURPOSE AND CONTENT
The book describes the main features of the language used in
definition sentences in the Collins Cobuild Student's
Dictionary (CCSD). The reason for what might otherwise
appear as a rather narrow choice of topic is explained very
clearly and with an abundance of examples: in the CCSD a
definition uses the head-word to be defined in a context that

a) uses a predictable and simplified syntactic format
designed to sound natural (and so to be easily understood
by a student);
b) introduces enough information to disambiguate the
intended meaning of the head-word.

As an example of a), transitive verbs in the CCSD are
commonly defined using the format

"If you [Verb1 an Obj-1], you [[Verb2 Anaphora-1] Adjunct]"
as in
"If you MANHANDLE someone, you TREAT them very roughly"

As an example of b), two common senses of the word "breast"
are disambiguated by introducing the appropriate modifier
("woman's" vs "bird's") as in
"A woman's breasts are ..."
"A bird's breast is ..."

Barnbrook's main insight, as I understand it, is the
following: given that it is theoretically possible to
explain a word to a learner in infinitely many ways, how is
that the CCSD manages to do a very good job using only very
few patterns of the kind described above? Evidently, those
patterns represent a (or possibly "the") minimal set
required to cover all communication needs associated with
situation in which definitional information needs to be
exchanged. Such patterns, Barnbrook goes on to explain,
cannot be directly aligned with the constituents that would
be returned by a syntactic parser. The goal of his book is
to offer a detailed analysis of those patterns from a
functional perspective that is orthogonal to that of current
research in syntax/parsing.

I found it very difficult to understand the idea behind this
approach until I happened to recognize the analogy between
Barnbrook's definitional patterns and "patterns" as
understood in the practice of software engineering, see for
example Gamma et al. (1995). I mention it here in the hope
that other readers might find it as enlightening as I did.
The analogy goes like this: when a programmer wants to build
a web application she might decide to use the client/server
pattern, because this satisfies the need of the application
(many agents need to access information maintained by a
central entity). Nevertheless, when the program has been
completed in, say, Java, analyzing the program with respect
to the syntax of Java is the wrong level of abstraction to
chose if one wants to understand what the intent of the
program is: knowing which parts of the program are
variables, method definitions, method calls etc... doesn't
help you in the least in recognizing that you are looking at
an instantiation of the client/server pattern.

Having summarized the general purpose of the book, I now
give a brief description of each chapter.

Chapter 1 introduces the definition format of the Cobuild
dictionaries described above, i.e. the fact that head-words
are explained in the context of simple sentences of a
predictable syntactic format. Barnbrook aptly points out
that, in addition to being helpful for a learner, such
predictability can and should be exploited by programs for
automatic information extraction. Several Cobuild
dictionaries are introduced and some of their characteristics
explained. Considering that the rest of the book contains
a very large number of tables, I would have expected and
enjoyed finding a table listing all the Cobuild dictionaries
with a summary of their similarities and differences.

Chapter 2 is devoted to a discussion of features of
monolingual English dictionaries. Barnbrook summarizes
earlier discussion to the effect that, in monolingual
dictionaries, the language whose words are being defined
acts as its own metalanguage in defining its semantics. I
find this unconvincing, considering that, with the exception
of entries that record usage information (register, regional
variation, etc...) most Cobuild entries do not mention the
words that are part of the definition, but simply use them
in particularly well chosen examples. The rest of the
chapter is devoted to a short history of the development of
monolingual dictionary, showing how they evolved from lists
of 'difficult' (i.e. usually latinate) words which could be
explained by means of the corresponding English word to
dictionaries which, for the benefit of non-native speakers
of English, list an English definition for every English word.

Chapter 3 articulates what, in my opinion, is the central
point of the book, i.e. the claim that monolingual
definitional sentences have a recognizable structure (i.e.
can be expected to contain certain elements in a certain
order) which requires a specialized vocabulary to be
described, i.e. cannot be described by (reduced to) the
vocabulary of syntactic theory. Philosophers of mind will
find here an interesting parallel to the debate on
functionalism and the irreducibility of (some) private or
public mental construct to their physical substrate (e.g.
the fact that "money" is a symbolic entity that cannot be
reduced to its physical realization, precious metals, paper
bills, wampum, ...) This is a rather important point that,
in my opinion, is obfuscated by the insistence on the
parallel between the grammar of English and the grammar of
definition, which results in rather perfunctory attempts to
place the grammar of definition in the class of context-
sensitive grammars (page 71) and to show that the definition
language so produced is a subset of English closed under
some operations that are never defined (page73). The
discussion becomes much more interesting when empirical
facts about definitional sentences are listed that are
independent of any syntactic analysis, for example the fact
that the entropy of the CCSD is much lower than that of a
corpus of the same size from "The Times"; the fact that
definitional sentences never contain inter-sentential
coreferences; the fact that only very few of the possible
senses of an ambiguous word are used in definition
sentences.

Chapter 4 describes the methodology employed by the author
in creating a taxonomy of definition questions able to
subsume almost without exceptions all the definitional
sentences in the CCSD. Most of the steps described reflect
the train of thought that any reasonable person without any
more sophisticated tools than grammar school linguistic
categories would follow in classifying CCSD definitional
sentences. In other words, exactly because definitions are
analyzed according to criteria that are orthogonal to
current sophisticated theories of syntax (and so cannot rely
on what could be referred to as theory-internal construct),
the whole classification has an extremely empiricist feel to
it to the point that one is not quite sure if the resulting
taxonomy could be used to make testable predictions or if,
instead, it is just a convenient way to summarize the data.
It is also to be noted, as Barnbrook's discussion makes
quite clear, that the markup information made available by
the CCSD database was relied upon extensively, a fact not
without consequence, which will be discussed later.

Chapter 5 is devoted to an exposition of the definition type
taxonomy. Four major types are identified and described in
section 5.4. Each of these types is further subdivided into
several subtypes yielding a total of 17. Descriptively, I
found this to be the most interesting chapter of the book.
Most bodies of definitions a reader would encounter (from
Aristotle on) really are definitions of collections, which
in turn are typically denoted (at least in the Western
languages I am familiar with) by nouns. In such cases, as
Aristotle had observed, a definition can be assembled using
a super-ordinate term of the term to be defined (often
referred to as its 'genus' i.e. a term denoting a collection
that properly includes the one denoted by the term to be
defined) together with a 'differentia', a term denoting a
collection that subsumes the one denoted by the term to be
defined and does not subsume any of the other possible
sub-ordinate terms of the super-ordinate term. A monolingual
dictionary, however, cannot always employ this strategy as
it is required to provide definitions for every word, even
those that are not nouns, e.g. verbs. In formal semantics,
e.g. Chierchia and McConnell-Ginet (1991), it is common to
represent verbs as relations, i.e. as collections not of
individuals but of n-tuples of individuals.

Given this analysis it is easy to show why defining a verb
cannot always be done easily or effectively by means of the
intersection of a 'genus' and a 'differentia'. For example,
in German the word "eats" in "X eats Y" is translated
differently depending on whether X is a person ("essen") or
an animal ("fressen"). This goes to show that, although it
is easy to find a 'genus' for either of "essen" and
"fressen" (e.g. the collection of pairs <X, Y> such that X
ingests Y), the required 'differentiae' are really best
expressed not as collections of pairs to be intersected with
the 'genus' but rather as type restrictions on the X and Y
element of the pair: person vs animal on X to realize the
"essen"/"fressen" distinction and solid vs liquid on Y to
realize the "eat"/"drink" distinction. Although he doesn't
cast his analysis in these terms, Barnbrook does an
excellent job at presenting an extremely rich selection of
examples that show what definitional strategies are commonly
employed when the 'genus' + 'differentia' strategy cannot be
easily followed. I found this completely fascinating and I
hope Barnbrook or others will follow up on these leads in
future research. Among obvious research questions to be
asked are:

a) to what extent are Barnbrook's types 'stable'? For
example, would a clustering algorithm -- see Manning and
Schuetze (1999), section 14 -- generate the same types for
most reasonable similarity metrics?

b) is it possible to predict aggregate properties of these
types given their description? For example, one might
predict that because verbs are often defined by means of
argument type restrictions that cannot be lexically realized
(but require, say, entire locative or instrumental
propositional phrases) the entropy of verb definitions will
turn out to be much higher than the entropy of noun
definitions.

c) is it possible to show that definition types cut across
part-of-speech classification as long as the underlying
semantics of the term to be defined is the same? For
example, can one expect to find verbs and their
nominalizations (e.g. "destroy" and "destruction") to be
defined according to the same pattern?

Chapter 6, to quote from its introduction, "describes the
functional components of the definition sentences, the
structural combinations of those components and the
variations in structure between the different definition
types, together with an outline of the processing involved
in the analysis of the definition sentences". I found this
chapter tantalizing: it contains more than twenty tables
displaying how the elements of definitions of different
types are detected and analyzed by a program written by
Barnbrook. All of these analyses are eminently sensible, so
that one would want to learn by what algorithm exactly they
have been generated in order to reproduce those results on a
different domain. Unfortunately, the algorithm is never
described in any detail (except for several remarks that
explain how the markup of the CCSD database is exploited to
detect head word boundaries and register information) with
the consequence that the results reported are essentially
irreproducible. On page 187, Barnbrook reports an
interesting proposal by Schnelle (1995, section 2)
(inconsistently listed as Schnelle (1996) in the
bibliography). Schnelle's idea is that definition patterns
can be standardized to the most expressive pattern(s) in
order to achieve uniformity (which in turn would be
desirable as a machine readable format). For example, even
definitions for nouns that can be expressed using the
simplest 'genus' + 'differentia' pattern can be recast into
the more complex 'If/then' pattern often required by verbs
for the reasons explained above. To exemplify,
"A bachelor is an adult unmarried male"
can be recast as
"If someone is a bachelor, then they are adult, male and
unmarried"

I find this very interesting, because it hints at the
possibility that most (all?) definitions might be analyzed
using the "logical form"
(AND P-{n}(x1, ... , xn)
Q-{m}-1(x1.1, ... , x1.m)
...
Q-{j}-k(xk.1, ... , xk.j))
implies
(AND
Q-{p}-k+1(xk+1.1, ... , xk+1.)
...
Q-{q}-k+n(xk+n.1, ... , xk+n.q)))
where P-{n} is a predicate of the appropriate -arity for the
definiendum (i.e. unary for nouns, binary for transitive
verbs, ternary for di-transitive verbs), each of the
Q-{m}-i's is a formula of m arguments (closed terms or
variables, either bound within the Q-{m}-i literal itself or
by the implicit universal quantifier taking scope over the
entire "logical form") imposing some restriction on the
arguments of P-{n} and each of the Q-{p}-k+i's is a formula
of p arguments stating conditions that apply to p-tuples
that satisfy conditions of the conjunction in the
antecedent. It is easy to see how such formulae could be
classified according to different, independent, facets. To
name a few that come readily to mind:
a) the arity of P-{n};
b) the number of clauses in the antecedent/consequent;
c) the maximum arity of any of the Q's

It would be nice if one could prove that this format is
sufficiently expressive to represent each of the 17 types
identified by Barnbrook.

Also of note is the fact that this format invariably
expresses only necessary and never sufficient conditions. In
other words it tells you what follows from the fact that an
x is a P (or that it P's something) but it doesn't tell you
what you need to observe in order to be sure that x is a P
(or that it P's something). Philosophers such as Jerry Fodor
(1975) have argued that this is evidence in favor of the
existence of a "Language of Thought", with the association
between words and concepts (Harnad's 1990 "grounding
problem") being triggered by mechanisms other than testing
of "meaning hypotheses". Linguists such as Wierzbicka (1996)
have equally pointed to the likely existence of language
primitives that cannot be further "defined away". Finally,
practicing knowledge engineers encounter this problem on a
daily basis as they expand their ontologies. If someone has
found an evolutionary explanation for why this should be so
(i.e. why, as a species, we crave for necessary conditions
but can live reasonably well without sufficient conditions),
I would be very interested in learning about it.

Chapter 7 is concerned with the evaluation of the taxonomy
of definitions and its possible applications. I found the
evaluation part only marginally interesting because it
reports on the evaluation of an algorithm that is never
fully described and because it delves into some
idiosyncrasies of the CCSD database which might not be of
general interest, especially considering that the format of
the CCSD is proprietary. The applications part contains some
interesting ideas. For example, to use the notation
introduced above in order to explain Schnelle's idea, that a
database of CCSD definitions broken down into the components
identified by Barnbrook's definition grammar can be queried
for Q-{m}-i formulas, either in the antecedent or in the
consequent to reveal robust word cluster that could hardly
be listed exhaustively even by a native speakers. The reader
who doubts this is invited to write down a list of verbs
that can be naturally defined as "V-ing something somewhere"
and compare it with the very interesting list Barnbrook
produces on pages 231-2. For a publicly accessible database,
compiled according to a similar design philosophy from a
corpus of newspaper articles, one could visit Dekang Lin's
page at
http://www.cs.ualberta.ca/~lindek

A second interesting application is the harvesting of
definition sentences for contextual information that could
be used for word sense disambiguation, an idea that is being
taken to its logical conclusion by the WordNet 2.0 release
team; see
http://www.cogsci.princeton.edu/~wn

presently engaged in the tagging of all word tokens included
in WordNet definitions by means of the synset that
corresponds to the one meaning, from those available for
that word type, which is the one intended in definitional
context in which the word token is found. A third idea is to
use Barnbrook's definition grammar as a quality control tool
to verify that definitions in a dictionary really do what
they are intended to do (i.e. help a learner understanding
the meaning of words differentiating every word from any
other) and do so using a consistent and easily interpretable
format. One possibility worth exploring would be having
definitional assertions about concepts be stored in a
language independent knowledge base (in the format mentioned
above in the context of Schnelle's proposal) and definitions
be automatically generated into any number of target
languages into which the knowledge base contents can be
paraphrased. A knowledge base of that kind is freely
available for download at
http://www.opencyc.org

CRITICAL EVALUATION
My specific comments have already been worked into the
review of the individual chapters. Here I will just
summarize them as follows: the book is clearly written and
filled with detailed discussions of a very large number of
well chosen examples and as such it is pedagogically
exemplary. At times, however, the reader finds herself
wishing that something like a theory capable of delivering
testable predictions would emerge from all those details. In
this review I have tried to list what I consider the most
promising candidates for such theory-like developments. I
benefited a great deal from reading this book, as it has
forced me to think about possible dimensions of the analysis
of language independent of the more traditional ones of
syntax and semantics. In this respect it reminds me of the
kind of "goal oriented" analysis that can be found in the
short essays in Rubinstein (2000). My main reason of
disappointment, as a software developer, is that, due to
copyright reasons, Barnbrook was not able to be more
explicit in the description of his definition analysis
algorithm. Also, I wonder how easy it would be to generalize
the algorithm to operate on free form text (as opposed to
the richly annotated CCSD database format).

REFERENCES
Chierchia, Gennaro and Sally McConnell-Ginet (1991) Meaning
and Grammar. MIT Press.
Fodor, Jerry (1975) The Language of Thought: A Philosophical
Study of Cognitive Psychology. T. Y. Crowell.
Gamma, Erich (1995) Design Patterns: Elements of Reusable
Object-oriented Software. Addison-Wesley.
Harnad, Stevan (1990) "The Symbol Grounding Problem". Physica
D42: 335-346
Manning, Chris and Hinrich Schuetze (1999) Foundations of
Statistical Natural Language Processing. MIT Press.
Rubinstein, Ariel (2000) Economics of Language. Cambridge
University Press.
Wierzbicka, Anna (1996) Semantics, Primes and Universals.
Oxford University Press.




 
ABOUT THE REVIEWER:
ABOUT THE REVIEWER Stefano Bertolo currently works as a software developer on projects that require expertise in information extraction, knowledge representation and inference. After receiving a Ph.D. in Philosophy and a Diploma in Cognitive Science from Rutgers University he spent three years as a Post-Doctoral associate at the Brain and Cognitive Science Department of MIT, working on formal theories of human language learning.