LINGUIST List 14.3165

Wed Nov 19 2003

Review: Software: Natural Language Toolkit (NLTK) 1.2

Editor for this issue: Terence Langendoen <terrylinguistlist.org>


What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Simin Karimi at siminlinguistlist.org.

Directory

  1. Kevin Russell, Natural Language Toolkit (NLTK) 1.2

Message 1: Natural Language Toolkit (NLTK) 1.2

Date: Tue, 18 Nov 2003 21:03:42 -0600
From: Kevin Russell <krussllcc.UManitoba.CA>
Subject: Natural Language Toolkit (NLTK) 1.2

Natural Language Toolkit, ver. 1.2, downloadable from
http://nltk.sourceforge.net/.

Kevin Russell, University of Manitoba

INTRODUCTION

The Natural Language Toolkit (NLTK) is a set of programs and
programming resources for dealing with natural language, intended
mainly for teachers and students in computational linguistics courses.
It includes example programs for many of the common tasks in
computation and corpus linguistics (such as tokenizing, tagging,
parsing, and word-sense disambiguation), programming libraries that
students can use in writing their own programs for these tasks, and
graphical displays useful for classroom presentations.

NLTK is free and licensed under the GNU General Public License, so you
can read the source code (written in Python) and modify it as much as
you want. The current version of the toolkit (1.2) can be downloaded
from the project's website at http://nltk.sourceforge.net/.


SYNOPSIS

The following are some of the things included in NLTK:

* data structure definitions that allow the computer to represent
 linguistically interesting entities: tokens, trees, grammar rules,
 frequency distributions (e.g., for keeping count of how often each
 word occurs in a corpus).

* pre-programmed classes and functions for many language-processing
 tasks, including: tokenizers, basic n-gram taggers, recursive
 descent parsers, chart parsers, and various probabilistic parsers.

* a number of graphical displays. Some can draw static images such as
 syntactic trees and basic histograms for frequency
 distributions. Others can dynamically show how, say, a finite-state
 automaton or a chart parser changes its state through time.

* sample data (mostly English) to practise with, including a portion
 of the Penn Treebank, the complete million-word Brown corpus with
 part-of-speech tags and a subset of it tagged with WordNet senses,
 verb subcategorizations from Levin (1993), and word-lists for a
 number of European languages.

* a number of extras contributed by users, including an interface to
 the Festival speech synthesis system, an re-implementations of the
 Brill tagger and the Porter stemmer, and a parser for Tree Adjoining
 Grammars.

* demonstration programs illustrating how to use most of the above.

For documentation, NLTK comes with an automatically generated catalog
of every last class, method, and function -- fairly standard for
open-source software projects these days. There are also a number of
tutorials on using NLTK for various tasks: modelling probabilistic
systems, tokenizing, tagging, parsing, chunk parsing, chart parsing,
probabilistic parsing, and text classification. The tutorials are
still incomplete, with many sections marked "to be written".


EVALUATION

NLTK will be extremely useful to teachers and students in
computational linguistics courses. The website has links to the pages
of several courses that have already integrated NLTK into the lectures
and assignments. The toolkit may also be useful to researchers
without much programming experience, but many of the features that
make it so good for learning get in the way of its usefulness for real
work.

For teachers of computational linguistics, becoming familiar with NLTK
is unquestionably worth the effort. If you use a laptop and a data
projector while teaching, NLTK's graphics abilities can make for some
very effective classroom presentations. It is easy to set up
step-by-step demonstrations of complex processes that textbooks seldom
devote enough paper to diagramming fully, such as the actions of a
chart parser in parsing a sentence. For those with reasonably fast
internet access, the sample data alone is worth the download time.
The case for requiring students to learn NLTK and use it in
assignments is not quite as clear-cut, but still strong.

Compared with some available stand-alone programs for doing the same
jobs, the various pieces of NLTK are usually slower and offer fewer
options. Their real strength lies in their simplicity (which allows
students to understand how they're written) and in how easy it is to
make minor modifications to them or combine them together to handle
multi-step language processing tasks. NLTK ties together a number of
different areas using the same programming language and the same
overall design. A student who learns part of the toolkit in order to
complete an assignment on part-of-speech tagging does not have to
learn completely different systems in order to complete assignments on
parsing or word-sense disambiguation.

Using Python, one of the easiest programming languages to learn, was
perhaps the best choice the designers made in terms of usefulness for
teaching. While not entirely free of rough edges and eccentricities,
Python has fewer than most other languages. Students with no previous
programming experience will be able to spend more of their time
thinking about the logical steps involved in getting the computer to
process language data, and less time mastering and using the arcana
involved in getting the computer to do anything at all. Python's
interactivity is especially useful for learners: instead of writing a
complete program then seeing if the whole thing runs correctly, users
can issue one command at a time and poke around in the intermediate
results to make sure they're what they expected before moving on.

There are some trade-offs. Programs written in Python often run more
slowly than those in other languages. The speed difference is all but
unnoticeable for the kinds of programs and data-sets involved in a
typical course assignment, but it could become inconvenient for larger
tasks. The very ease of Python is also a double-edged sword. Because
Python already makes it fairly simple to do many of the common
language-processing tasks, the additional simplification that NLTK can
offer (its "value added") is less than it would be for most other
programming languages.

NLTK has quite extensively appled the object-oriented software
philosophy to the natural language domain -- perhaps too extensively.
As an example, let's compare how we would approach the same simple
task in ordinary Python and in NLTK's object-oriented system. (Both
versions below are more long-winded than strictly necessary.) Suppose
you wanted to print out every noun that occurs in a tagged text with a
format like: "The/at dog/nn bit/vbd me/ppo ./." Assuming that the
text is a character string stored in a variable named "text", in
ordinary Python, you could simply use the following code:

 tokenlist = text.split()
 for token in tokenlist:
 (base, tag) = token.split('/')
 if tag == 'nn':
 print base

This splits the text apart into a list of smaller strings, like
["the/at", "dog/nn", "bit/vbd", "me/ppo", "./."]. Then it loops
through this list, splitting each string on the slash character to
make a base/tag pair. If the tag is 'nn', it prints the base.

In comparison, an NLTK version of the noun printer is somewhat wordier:

 tokenizer = nltk.tokenizer.TaggedTokenizer()
 tokenlist = tokenizer.tokenize(text)
 for taggedtoken in tokenlist:
 taggedtype = taggedtoken.type()
 if taggedtype.tag() == 'nn':
 print taggedtype.base()

This first creates a TaggedTokenizer object. Calling the
TaggedTokenizer's "tokenize" method breaks the text into a list of
TaggedToken objects. Each TaggedToken object contains a TaggedType
object and a Location object. The Location object stores the position
of the token in the input, the units that the position is measured in
(characters, words, sentences, etc.), and possibly a lot of other
information we aren't interested in for this task. The TaggedType
object contains the base string and the tag, which we get by invoking
its "base" and "tag" methods.

Despite the extra typing required, there are some advantages to the
NLTK approach. You don't have to worry about the details of how the
input represents tags, and you could recycle almost exactly the same
program for a different tagged text, no matter whether it used "/" or
"|" to separate tags or even encoded them as XML attributes. (This is
assuming that somebody has written a TaggedTokenizer class for the
input format -- which is at least true for most of the sample corpora
included with NLTK.) The rigorous distinction between types and
tokens that NLTK enforces is an important thing for students to learn.
For everyone else, though, it can be just another layer of
complication that gets in way of getting things done.

A more serious disadvantage is that profligately creating objects
within objects within objects takes much more time and uses more of
the computer's memory. Such design decisions, combined with the
disadvantage that Python already has in speed and with the design
team's overall philosophy that it's more important for NLTK code to be
clear than to be efficient, can make the toolkit too slow for programs
involving large amounts of data.

The following two experiments might give some idea of the scale of
problem that the toolkit can be handle in a practical amount of time.
In the first test, I used NLTK to train a tagger on extracts from a
corpus of Cree texts then use it to tag other extracts, using only
about ten thousand words at a time. Most steps were nearly
instantaneous and none took more than a few seconds. In the second
test, I trained a tagger on the million tagged words of the Brown
corpus (included with NLTK), which took several minutes. Then I used
that tagger and NLTK's regular-expression tokenizer to try to tag a
year's worth of CNN transcripts. After an hour, it had finished only
three days worth (about a quarter of a million words), an amount that
stand-alone programs like the Brill tagger can finish in a few
seconds.

Since NLTK was designed primarily for teaching, it may be a little
unfair to ask how useful it might be for linguists doing real
research. On the other hand, students will presumably go on to do
their own research -- how much of what they've learned of NLTK must
then be tossed away like a no-longer-needed crutch? Also,
intentionally or not, NLTK is one of the few tools available for
people who would like a compromise between using existing
task-specific programs such as concordancers (and being locked into
only the options offered and the data-formats required by the
programs) and writing their own programs completely from scratch.
NLTK has the promise of eventually filling this gap -- offering
high-quality components that can be modified, extended, and combined
together with minimal time and programming -- but it's not quite there
yet.

Especially compared to the abysmal documentation that often plagues
open-source software projects, NLTK's documentation is quite good, and
getting better all the time, but it still has a ways to go.
Experienced programmers may be able to use the catalog-like
documentation of every last class and function in the toolkit.
Newcomers will have more difficulty getting started.

The parts of the tutorials that have been written are usually well
written. Unfortunately, instead of giving sequences of commands that
the reader can interactively type into their own computer as they go
along, the tutorials often present only fragments of Python programs
without the context that would allow readers to recreate the results.
Also, they seem to be aimed at linguistically naive computer science
students rather than computationally naive linguistics students. You
can, for example, find a longish and earnest discussion of what
syntactic constituents are and why they're important, but terms like
"stack", "epsilon production", and "polynomial complexity" are dropped
with little or no explanation. Even as a companion to a textbook, the
documentation will probably not allow a typical linguistics student to
use the toolkit effectively without considerable guidance from a
teacher.


SUMMARY

Overall, NLTK is already, as it stands, an excellent resource for
teaching computational linguistics. It is also a promising start for
a toolkit that will be useful for general work in corpus linguistics
- and may already be useful to many for smaller scale tasks or for
prototyping larger tasks.

This review actually started out in October as a review of NLTK 1.1,
which was released in August. Just as I was finishing and getting
ready to e-mail it to LinguistList, out came NLTK 1.2 in early
November. As aggravating as this can be to reviewers who write too
slowly, it points to one of the strengths of the open-source project.
Problems tend to get fixed, fast, and the improved version is just as
free as the old one was. And if the folks in Philadelphia are moving
too slow for you, you can always change it yourself -- just make sure
you contribute your changes back so that others can benefit too.

ABOUT THE REVIEWER

Kevin Russell is associate professor of linguistics at the University
of Manitoba. He specializes in phonology and morphology and has built
and used computer corpora of Cree and Dakota texts in his research.
He wishes NLTK had been around the last time he taught computational
linguistics using Python.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue