|Title:||Review of Natural Language Toolkit 1.2|
|Description:||Date: Tue, 18 Nov 2003 21:03:42 -0600
From: Kevin Russell <krussll@cc.UManitoba.CA>
Subject: Natural Language Toolkit (NLTK) 1.2
Natural Language Toolkit, ver. 1.2, downloadable from http://nltk.sourceforge.net/.
Kevin Russell, University of Manitoba
The Natural Language Toolkit (NLTK) is a set of programs and programming resources for dealing with natural language, intended mainly for teachers and students in computational linguistics courses. It includes example programs for many of the common tasks in
computation and corpus linguistics (such as tokenizing, tagging, parsing, and word-sense disambiguation), programming libraries that students can use in writing their own programs for these tasks, and graphical displays useful for classroom presentations.
NLTK is free and licensed under the GNU General Public License, so you can read the source code (written in Python) and modify it as much as you want. The current version of the toolkit (1.2) can be downloaded from the project's website at http://nltk.sourceforge.net/.
The following are some of the things included in NLTK:
* data structure definitions that allow the computer to represent linguistically interesting entities: tokens, trees, grammar rules, frequency distributions (e.g., for keeping count of how often each word occurs in a corpus).
* pre-programmed classes and functions for many language-processing tasks, including: tokenizers, basic n-gram taggers, recursive descent parsers, chart parsers, and various probabilistic parsers.
* a number of graphical displays. Some can draw static images such as syntactic trees and basic histograms for frequency distributions. Others can dynamically show how, say, a finite-state automaton or a chart parser changes its state through time.
* sample data (mostly English) to practise with, including a portion of the Penn Treebank, the complete million-word Brown corpus with part-of-speech tags and a subset of it tagged with WordNet senses, verb subcategorizations from Levin (1993), and word-lists for a number of European languages.
* a number of extras contributed by users, including an interface to the Festival speech synthesis system, an re-implementations of the Brill tagger and the Porter stemmer, and a parser for Tree Adjoining Grammars.
* demonstration programs illustrating how to use most of the above.
For documentation, NLTK comes with an automatically generated catalog of every last class, method, and function -- fairly standard for open-source software projects these days. There are also a number of tutorials on using NLTK for various tasks: modelling probabilistic systems, tokenizing, tagging, parsing, chunk parsing, chart parsing,
probabilistic parsing, and text classification. The tutorials are still incomplete, with many sections marked ''to be written''.
NLTK will be extremely useful to teachers and students in computational linguistics courses. The website has links to the pages of several courses that have already integrated NLTK into the lectures and assignments. The toolkit may also be useful to researchers without much programming experience, but many of the features that make it so good for learning get in the way of its usefulness for real work.
For teachers of computational linguistics, becoming familiar with NLTK is unquestionably worth the effort. If you use a laptop and a data projector while teaching, NLTK's graphics abilities can make for some very effective classroom presentations. It is easy to set up step-by-step demonstrations of complex processes that textbooks seldom devote enough paper to diagramming fully, such as the actions of a chart parser in parsing a sentence. For those with reasonably fast internet access, the sample data alone is worth the download time.
The case for requiring students to learn NLTK and use it in assignments is not quite as clear-cut, but still strong.
Compared with some available stand-alone programs for doing the same jobs, the various pieces of NLTK are usually slower and offer fewer options. Their real strength lies in their simplicity (which allows students to understand how they're written) and in how easy it is to make minor modifications to them or combine them together to handle multi-step language processing tasks. NLTK ties together a number of different areas using the same programming language and the same overall design. A student who learns part of the toolkit in order to complete an assignment on part-of-speech tagging does not have to
learn completely different systems in order to complete assignments on parsing or word-sense disambiguation.
There are some trade-offs. Programs written in Python often run more slowly than those in other languages. The speed difference is all bu unnoticeable for the kinds of programs and data-sets involved in a typical course assignment, but it could become inconvenient for larger tasks. The very ease of Python is also a double-edged sword. Because
Python already makes it fairly simple to do many of the common language-processing tasks, the additional simplification that NLTK can offer (its ''value added'') is less than it would be for most other programming languages.
NLTK has quite extensively appled the object-oriented software philosophy to the natural language domain -- perhaps too extensively.
As an example, let's compare how we would approach the same simple task in ordinary Python and in NLTK's object-oriented system. (Both versions below are more long-winded than strictly necessary.) Suppose you wanted to print out every noun that occurs in a tagged text with a format like: ''The/at dog/nn bit/vbd me/ppo ./.'' Assuming that the text is a character string stored in a variable named ''text'', in ordinary Python, you could simply use the following code:
tokenlist = text.split()
for token in tokenlist:
(base, tag) = token.split('/')
if tag == 'nn':
This splits the text apart into a list of smaller strings, like [''the/at'', ''dog/nn'', ''bit/vbd'', ''me/ppo'', ''./.'']. Then it loops through this list, splitting each string on the slash character to make a base/tag pair. If the tag is 'nn', it prints the base.
In comparison, an NLTK version of the noun printer is somewhat wordier:
tokenizer = nltk.tokenizer.TaggedTokenizer()
tokenlist = tokenizer.tokenize(text)
for taggedtoken in tokenlist:
taggedtype = taggedtoken.type()
if taggedtype.tag() == 'nn':
This first creates a TaggedTokenizer object. Calling the TaggedTokenizer's ''tokenize'' method breaks the text into a list of TaggedToken objects. Each TaggedToken object contains a TaggedType object and a Location object. The Location object stores the position of the token in the input, the units that the position is measured in
(characters, words, sentences, etc.), and possibly a lot of other information we aren't interested in for this task. The TaggedType object contains the base string and the tag, which we get by invoking its ''base'' and ''tag'' methods.
Despite the extra typing required, there are some advantages to the NLTK approach. You don't have to worry about the details of how the input represents tags, and you could recycle almost exactly the same program for a different tagged text, no matter whether it used ''/'' or ''|'' to separate tags or even encoded them as XML attributes. (This is
assuming that somebody has written a TaggedTokenizer class for the input format -- which is at least true for most of the sample corpora included with NLTK.) The rigorous distinction between types and tokens that NLTK enforces is an important thing for students to learn. For everyone else, though, it can be just another layer of complication that gets in way of getting things done.
A more serious disadvantage is that profligately creating objects within objects within objects takes much more time and uses more of the computer's memory. Such design decisions, combined with the disadvantage that Python already has in speed and with the design team's overall philosophy that it's more important for NLTK code to be clear than to be efficient, can make the toolkit too slow for programs involving large amounts of data.
The following two experiments might give some idea of the scale of problem that the toolkit can be handle in a practical amount of time. In the first test, I used NLTK to train a tagger on extracts from a corpus of Cree texts then use it to tag other extracts, using only about ten thousand words at a time. Most steps were nearly instantaneous and none took more than a few seconds. In the second test, I trained a tagger on the million tagged words of the Brown corpus (included with NLTK), which took several minutes. Then I used that tagger and NLTK's regular-expression tokenizer to try to tag a year's worth of CNN transcripts. After an hour, it had finished only three days worth (about a quarter of a million words), an amount tha stand-alone programs like the Brill tagger can finish in a few seconds.
Since NLTK was designed primarily for teaching, it may be a little unfair to ask how useful it might be for linguists doing real research. On the other hand, students will presumably go on to do their own research -- how much of what they've learned of NLTK mus
then be tossed away like a no-longer-needed crutch? Also, intentionally or not, NLTK is one of the few tools available for people who would like a compromise between using existing task-specific programs such as concordancers (and being locked into only the options offered and the data-formats required by the programs) and writing their own programs completely from scratch.
NLTK has the promise of eventually filling this gap -- offering high-quality components that can be modified, extended, and combined together with minimal time and programming -- but it's not quite there yet.
Especially compared to the abysmal documentation that often plagues open-source software projects, NLTK's documentation is quite good, and getting better all the time, but it still has a ways to go. Experienced programmers may be able to use the catalog-like
documentation of every last class and function in the toolkit. Newcomers will have more difficulty getting started.
The parts of the tutorials that have been written are usually well written. Unfortunately, instead of giving sequences of commands tha the reader can interactively type into their own computer as they go along, the tutorials often present only fragments of Python programs without the context that would allow readers to recreate the results.
Also, they seem to be aimed at linguistically naive computer science students rather than computationally naive linguistics students. You can, for example, find a longish and earnest discussion of wha syntactic constituents are and why they're important, but terms like ''stack'', ''epsilon production'', and ''polynomial complexity'' are dropped with little or no explanation. Even as a companion to a textbook, the documentation will probably not allow a typical linguistics student to use the toolkit effectively without considerable guidance from a teacher.
Overall, NLTK is already, as it stands, an excellent resource for teaching computational linguistics. It is also a promising start for a toolkit that will be useful for general work in corpus linguistics - and may already be useful to many for smaller scale tasks or for prototyping larger tasks.
This review actually started out in October as a review of NLTK 1.1, which was released in August. Just as I was finishing and getting ready to e-mail it to LinguistList, out came NLTK 1.2 in early November. As aggravating as this can be to reviewers who write too
slowly, it points to one of the strengths of the open-source project.
Problems tend to get fixed, fast, and the improved version is just as free as the old one was. And if the folks in Philadelphia are moving too slow for you, you can always change it yourself -- just make sure you contribute your changes back so that others can benefit too.
ABOUT THE REVIEWER
Kevin Russell is associate professor of linguistics at the University of Manitoba. He specializes in phonology and morphology and has buil and used computer corpora of Cree and Dakota texts in his research. He wishes NLTK had been around the last time he taught computational linguistics using Python.