LINGUIST List 14.2409

Fri Sep 12 2003

Review: Computational Linguistics: Copestake (2002)

Editor for this issue: Naomi Ogasawara <naomilinguistlist.org>


What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Simin Karimi at siminlinguistlist.org.

Directory

  1. Mike Maxwell, Implementing Typed Feature Structure Grammars

Message 1: Implementing Typed Feature Structure Grammars

Date: Thu, 11 Sep 2003 18:14:14 +0000
From: Mike Maxwell <maxwellldc.upenn.edu>
Subject: Implementing Typed Feature Structure Grammars

Copestake, Ann. 2002. Implementing Typed Feature Structure Grammars.
CSLI Publications, xi+233 pp., hardback ISBN 1575862611, USD 62.00;
paperback ISBN 1575862603, USD 22.00.

Announced at http://linguistlist.org/issues/13/13-2529.html
Modified announcement at http://linguistlist.org/issues/13/13-2589.html

Mike Maxwell, Linguistic Data Consortium at the University of
Pennsylvania.

(Reviewer's note: While announced as a book, this is more properly
described as software, the book being the tutorial and user manual.
The software can be freely downloaded or obtained as a CD.)

Generative grammar was originally a theory of context-free phrase
structure rules, augmented with a notion of transformations. But
embedded in 'Aspects of the Theory of Syntax' was a nascent theory of
syntactic features. Originally viewed as a way to augment a
context-free grammar of atomic categories (noun, verb, NP, VP...),
these features were in fact capable of completely replacing the atomic
categories.

It took many years for syntactic theories to catch up to the
possibilities opened up by the introduction of syntactic features.
Generalized Phrase Structure Grammar, for example, is a half-breed: an
over-generating system of rules based on atomic categories is filtered
by principles governing the percolation of those features. But in
Head-Driven Phrase Structure Grammar (HPSG), and related theories, the
grammar uses only features. While such a grammar is still
context-free, it can be considerably more complex to parse. One way
of looking at this is that there are orders of magnitude more
categories: (potentially) as many as there are combinations of feature
values.

Thus, while there have been many more or less efficient parsers for
context-free phrase structure grammars using atomic categories
(sometimes enhanced with features), there have been very few parsers
suitable for theories like HPSG.

The tool described in this book is such a parser: the Linguistic
Knowledge Building system (LKB). (The LKB also includes a generator,
although it is not as well developed as the parser.) The book is a
tutorial (and, to some extent, a reference) for LKB, as well as
describing (briefly) the linguistic theory which the software models.
It does presuppose some knowledge of syntax, and readers who come from
a different theoretical background should probably begin by reading an
introductory text on HPSG. Prospective users of this program should
also be comfortable with a programmer's editor, such as emacs.

Ann Copestake is the book's author, but she credits many others with
helping design and build LKB. The LKB can be downloaded in executable
form from the book's website. (Actually, as I write this, it is
downloadable from http://www-csli.stanford.edu/~aac/lkb.html; the
download link at the URL given in the book is broken.) It runs under
Windows (95 or later), RedHat Linux, and Solaris operating systems.
The source code is written in Common Lisp (with the gui running under
a particular variety of Common Lisp), and has been open-sourced.
Reportedly, this allows it to run on an Apple Macintosh, provided you
purchase a Lisp interpreter.

I successfully downloaded and ran the Windows version (which I ran
under MS-Windows 2000). Users of this version should however be
warned that some Windows conventions are not followed (probably
because they are not standards under the other operating systems).
For instance, installation does not automatically create an icon on
your desktop, nor an entry in your 'Start' menu. If you create your
own shortcut for starting LKB, and you prefer to keep your grammar
files in some location other than a subdirectory of the LKB directory
(perhaps under your ''My Documents'' directory), you'll want to edit
the shortcut to start LKB in that directory. Also, clicking on the
'x' in the upper right-hand corner of the running LKB ''Top'' window
shuts down LKB, but leaves a Lisp window running. (Using the 'Quit'
menu item from LKB does shut down the Lisp process as well.) Another
oddity: while I was able to change the font sizes for most aspects of
the display, I was unable to change the miniscule font size used for
displaying parses. However, clicking on the parse tree brings up a
menu, and one of the menu choices then displays a larger version of
the parse tree, in which one can browse individual nodes to view the
feature structures which those nodes abbreviate.

I also successfully downloaded and ran the Solaris version (which uses
a similarly miniscule and apparently unchangeable font for displaying
parses).

Finally, I downloaded several of the Linux versions, but I was unable
to make lkb run under my RedHat v.9 Linux installation. It was
unclear which version of lkb I was supposed to use for this latest
version of Linux, but the 'unstable' version dated 2003-09-03 seemed
to come the closest to running. (For those interested in the gory
details, lkb complained about a library error. The LKB website FAQ
suggested that this was a Motif problem, and that it would only run
under an earlier version (2.1) of Motif than what I had on my Linux
machine (2.2--the same version downloadable by default from the Motif
website). I suspect that I could have obtained this older version
from the Motif web site, and replaced the installed version of
Motif--but this was getting beyond my meager Linux skills, and I gave
up.)

Once running, LKB includes a windowing interface to the grammar
loader, parser and generator, various display tools including a type
hierarchy browser, and some debugging tools. All these tools are
basically read-only, that is, browsers rather than editors; editing is
done with your favorite programming editor, which you may think of as
a blessing (if you are a veteran emacs user, since emacs has special
hooks into the LKB) or a curse (otherwise).

The book consists of two parts, a tutorial and a user manual. The
tutorial section begins with a very simple grammar (using atomic
categories, and no features), and progressively adds refinements,
working up to a grammar that handles long-distance dependencies.
Interspersed with the instructions on how to write and debug grammars
are discussions of the theory which the grammars presuppose.
Exercises, both thought problems and hands-on (programming) exercises
help ensure the reader's understanding. (Simple answers to the
exercises are supplied in the book, while longer answers are contained
in the data files downloadable from the LKB website.)

The user manual section of the book is essentially a reference manual,
although there are some useful explanations as well. It is the
nearest thing to a 'help' file for the various commands and displays
(and parts of it could usefully be turned into an on-line help file).

While generally well-written, I found the book to be hard going at
times. I also developed an increased appreciation for the joy of
hyperlinked text--since the text was merely paper, I found myself
penciling in next to in-text references to figures, the page numbers
on which those figures appeared. (I was not helped by the figure
numbering systems, of which there are two: one numbered on a
per-chapter basis, and one running sequentially through the book.)

There are few typos, and most do not interfere with the reading. A
couple of the more important ones are the substitution of ''latter''
for ''former'' (pg. 76, just above figure 3.56); and two distinct
versions of the same rule (first rule on pg. 128 and rule at the
bottom of pg. 124; the latter is correct).

In addition to the text and the website, there is an LKB mailing list
(archived since June at Linguist List). It does not appear to have
received much traffic, but would be a useful resource particularly to
readers working through the LKB system on their own.

The LKB program has (of course) certain limitations, perhaps the most
important of which is that the order of argument phrases in sentences
must match their order in the argument lists of lexical items (apart
from wh-movement; dative movement and other such ''transformations''
are of course treated lexically). The implication is that it will be
difficult to treat 'free word order' languages. (I hasten to add that
this is a fundamental problem which virtually all parsing programs
face.)

Another limitation lies in the simplistic implementation of morphology
in LKB. Actually, although some might see this as a limitation, I
look at this as the right way to build software. For whatever you may
think of the theoretical relationship between morphology and syntax,
when it comes to computational implementations, they are quite
different. Therefore computational implementations should treat
syntax and morphology as separate modules. And indeed, it should not
be difficult to attach to LKB a more sophisticated morphological
parser/ generator (such as the Xerox finite state morphological
transducer, recently published through CSLI/ University of Chicago
Press; see Linguist List 14.2028). The same might hold for efficient
lexical access; while lexical access is implemented internally to LKB,
it may be more efficient with large lexicons to hook up a specialized
lexical lookup engine. (The LKB also supplies 'hooks' to allow for an
alternative semantic ''back end''.)

The LKB is also somewhat limited in its treatment of multi-word
lexical items, which can consist of a list of words in fixed order,
with one of those words being specified as a possible (inflectional)
affixation site. As readers of the Linguist List may well appreciate,
there is an entire spectrum of constructions which lie somewhere near
the border between syntax and morphology, including compounds (which
may or may not be written solid, as in English 'doghouse' and 'dog
house'), proper names and place names, and ranging through idioms,
fixed expressions, and perhaps even verb-particle constructions; not
all of these fit such a limited approach. Neither linguistic theory
nor computational linguistics has not caught up with the variety of
such constructions (see e.g.
http://www.cl.cam.ac.uk/users/alk23/mwe/mwe.html 
for a recent conference on this topic), so it is understandable 
that the LKB would not handle them all well.

I did not try to use LKB with anything except vanilla ASCII
characters. The LKB web site has a place-holder for this issue (as
well as for on-going work on multi-word constructions). Since there
does not appear to be any obvious way to change the actual font used
to display parses (short of editing the Lisp code), it is presumably
not possible at present to use alternative fonts inside LKB itself.
However, there is a downloadable Japanese grammar (reachable from the
LKB website) that is said to run under the LKB; the web documentation
recommends running it from inside emacs for purposes of Japanese text
input. Since I am proficient neither in emacs nor in Japanese, I did
not try this.

Finally, LKB is not intended to serve as a production environment.
Rather, it is intended as a tool for learning about parsing and
generating with HPSG-like grammars.

My principle criticism of the LKB is not so much a criticism as a
concern with what the intended audience is. While my sympathies are
entirely with hand-crafted grammars, this is not a prominent
methodology in computational linguistics in recent years, where the
emphasis has instead been on statistical approaches. Nor, outside of
HPSG, will many linguists find the twin barriers of theory and
computational tools easy to cross. So the main audience seems to be
those linguists who already know HPSG (or a related theory), and who
are already comfortable enough with computers and programming to build
and debug grammars. And this seems--unfortunately--a rather small
audience.

There are several things that could make this grammar development
system more accessible to linguists (or to linguistics students). One
such aid would be to hide more of the computational details, such as
the distinction (discussed in section 4.3 and elsewhere) between
(ordinary) lists and 'difference lists'. This distinction is purely
computational, not linguistic, and should be invisible to users.

It would also be helpful to provide as part of the development
environment structured editors, which would make typographic errors
(spelling, unbalanced brackets) less likely. In my experience,
tracking down such errors (not to mention learning the special
notation in the first place) can chew up a lot of time.

Another change which would make the system much easier to use concerns
the scripting commands. As it stands, in addition to learning the
grammar notation, the advanced user must learn a Lisp-like notation in
order to tell the system which files to load, and for defining certain
system parameters. While the learning curve for this portion of the
application is not long (scripting commands tend not to be complex,
and often what one needs for a new grammar can simply be adapted from
an existing grammar), the learning curve could have been eliminated
entirely by providing access to the necessary commands and variable
settings through the gui (and then attaching to these gui widgets,
pointers to the appropriate sections of the user manual section of the
book, converted into on-line help documents).

But perhaps the most useful thing to make LKB more useful and
interesting to linguists, is something its users could best do: build
up a set of sample grammars. As a start towards this, the LKB website
makes available for download two good-sized grammars, one of English
and one of Japanese. I googled several more LKB grammars (including a
categorial grammar implementation), but it would be a service to the
NLP community and particularly to users of LKB to gather the URLs for
such systems in one location (and make them accessible from OLAC as
well).

In summary, I can recommend this book and its software for those who
wish to try their hand at parsing using an HPSG-like approach. I
suspect it would work well in a classroom setting, although I think
students would need some prior experience in program editing and
debugging. (As I mentioned, a structured editor would do much to ease
the frustration for na�ve users that inevitably comes with using a
'dumb' editor for programming.) There are of course other parsing
systems (Copestake provides a long list of freely available systems at
the end of chapter five), but it is beyond the scope of this review to
compare them with LKB. Suffice to say that if you want to experiment
with parsing using HPSG or similar formalisms, then LKB is probably
the system to use.

ABOUT THE REVIEWER

Mike Maxwell works for the Linguistic Data Consortium at the
University of Pennsylvania. He holds a Ph.D. in linguistics from the
University of Washington. In between these two places, he developed a
broad coverage syntactic grammar of English, consulted on minority
languages in Colombia, and built a morphological parser/ generator.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue