Review of  Exploring Natural Language

Reviewer: Lea Cyrus
Book Title: Exploring Natural Language
Book Author: Bas Aarts Gerald Nelson Sean Wallis
Publisher: John Benjamins
Linguistic Field(s): Text/Corpus Linguistics
Subject Language(s): English
Language Family(ies): New English
Issue Number: 15.263

Date: Mon, 19 Jan 2004 19:30:32 +0100 (CET)
From: Lea Cyrus
Subject: Exploring Natural Language: Working with the British Component of ICE

Nelson, Gerald, Sean Wallis and Bas Aarts, ed. (2002)
Exploring Natural Language: Working with the British
Component of the International Corpus of English,
John Benjamins Publishing Company, Varieties of English
Around the World G29.

Lea Cyrus, Arbeitsbereich Linguistik, Westfaelische
Wilhelms-Universitaet Muenster.


The International Corpus of English (ICE) is a project which
aims at compiling up to twenty comparable syntactically
parsed one million-word corpora of different national and
regional varieties of English, both written and spoken. This
book is the first volume in a new sub-series of "Varieties
of English Around the World", which has been launched to
provide "handbooks for the various corpora in the
International Corpus of English" (p. xi). However, rather
than merely being a handbook for the British Component of
ICE (ICE-GB), it is first and foremost a manual for the
Windows based International Corpus of English Corpus Utility
Program (ICECUP), which is a tailor-made retrieval software
for all ICE subcorpora.

The book consists of four parts, which in turn are
subdivided into several chapters. There are also six
appendices listing details of the corpus annotation and
design. The first part ("Introducing the corpus", pp. 1-68)
begins with a general overview of the ICE-GB corpus, which
is the first of the ICE subcorpora to be completed. The
remainder of this part describes the annotation scheme
used. Each node in the tree is labelled with up to three
types of information: word class/syntactic category,
syntactic function and features (such as "transitivity"),
the latter being optional. All word class-tags, syntactic
categories, functions and features are listed and
illustrated by short explanations and examples.

The second and by far the longest part ("Exploring the
corpus", pp. 69-231) is a very detailed manual to the query
tool ICECUP 3.0. The various facilities are introduced
little by little, starting from general browsing by text
category and leading up to complex queries by means of fuzzy
tree fragments (FTFs). The descriptions are constantly
illustrated by an impressive number of screenshots, and all
features are introduced by means of concrete examples,
accompanied by detailed instructions which allow the readers
to reproduce the queries on their own computers. This part
ends with a description of extensions that have been made to
ICECUP 3.0, resulting in the "evolutionary advance"
(p. 203) ICECUP 3.1.

The third part ("Performing research with the corpus",
pp. 232-283) is used to further exemplify the research
possibilities offered by ICECUP. This is achieved by six
case studies which exploit the annotation available in
ICE-GB. The first two studies are lexical, exploring the
adverbial use of "pretty" and the verbal vs. nominal use of
"book" respectively. The following study is concerned with
the question whether there exists a correlation between the
transitivity properties of an embedded clause and its
syntactic function. The fourth study investigates the
differences in the use of "what" and "which" functioning as
determiners in noun phrases. The fifth study compares the
distribution of passives in the six written registers in
ICE-GB, and in the final study, the authors confirm the
results of an earlier investigation which had shown that
if-clauses occur more frequently in clause-initial than in
clause-final position. The third part is rounded off with a
general chapter on experimental design and basic stochastic
methods, such as working with contingency tables and
calculating the statistical significance of results with the
chi-square test. This chapter, too, includes three exemplary
studies based on ICE-GB.

The fourth and last part ("The future of the corpus",
pp. 284-300) gives a short overview of possible extensions
to both ICE-GB and ICECUP, provided there will be
funding. These include further levels of annotation,
particularly in the spoken component of ICE-GB, and the
partial automation of the experimentation process.


I have pointed out before that large parts of this book are
devoted to presenting the annotation scheme and retrieval
software for the ICE-GB corpus - to a large extent, this
book is really a manual (or two manuals, to be precise). It
is not the objective of this review to evaluate either the
annotation scheme or the application, but to point out
whether or not these manuals are good introductions to their
respective fields.

I will follow the order in the book and begin with the
chapter on the ICE-GB grammar, which I found wanting in some
respects. Firstly, there is the overall organisation of the
chapter. Why are word-class tags treated in a separate
section while phrasal and functional categories are mixed?
My guess is that this was done because it mirrors the
practical annotation process - first tagging, then parsing
(p. 22), and it might have seemed natural to keep up this
ordering in the book. However, from a theoretical point of
view, it would have been neater to group lexical and phrasal
categories together, if such grouping is at all necessary,
since they both refer to the form of the units they
describe. For a discussion of the distinction between
syntactic categories (which include lexical categories) and
grammatical functions see Quirk et al. (1985, pp. 48f. and
pp. 64-67) and Huddleston and Pullum (2002, pp. 20-26).

Also, it is questionable whether alphabetical ordering is
the best approach when describing grammatical choices made
in an annotation scheme. In such a list, elements which
belong closely together are spread over the entire list. If,
for instance, a reader wants to find out how extraposition
is annotated in ICE-GB, he or she will have to read the
entire chapter in order to collect all pieces of
information: for instance, anticipatory 'it' is tagged as a
pronoun with the special feature 'antit' (p. 36), its
function is either 'provisional direct object' (PROD) or
'provisional subject' (PRSU) (p. 53), the extraposed
constituents are function-tagged as 'notional direct object'
(NOOD) or 'notional subject' (NOSU) respectively (p. 50),
and the entire clause is marked with either the specific
feature 'extraposed direct object' ('extod') or 'extraposed
subject' ('extsu') (p. 58). To be fair, it must be added
that sometimes there are references to further tags used in
the same construction, but since these are used
sporadically, they do not really solve the problem.

All in all, this part is more or less an annotated atomic
list of the tags, but there is little actual grammar in
it. The discussion of grammatical constructions is
limited to six and a half pages at the end of the chapter
(pp. 62-68), where a few and seemingly arbitrarily chosen
"special topics" (i.e. inversion, coordination, direct
speech) are discussed in greater detail. For a better
understanding of the grammar behind the scheme, a more
construction-based approach, following the good examples of
Bies et al. (1995) for the Penn Treebank and Sampson (1995)
for the SUSANNE corpus, would have been helpful. For those
who really only want to look up a particular tag, there is
an alphabetically ordered reference guide (Appendix 5), and
if the tags were also included in the index (which they
aren't), the reader could easily be directed to the
appropriate location in the detailed description.

A further point I would like to make is that most tag
descriptions are very general and not detailed enough to
reveal the way the tags are used in the corpus. I shall
illustrate this with two examples: the word class tag "ADJ"
(adjective) can be subclassified by various features. Two of
these are 'edp' for -ed participles and 'ingp' for -ing
participles (p. 25). The problem here is that the
distinction between adjectival participles and their verbal
counterparts is not in all cases easy to make. Huddleston
and Pullum (2002, p. 1438) give the sentence "Kim was
worried by the prospect of redundancy" as an ambiguous
example. I would have liked to learn more about the criteria
which have been applied in ICE-GB to identify adjectival
participles or the conventions according to which borderline
cases like the one cited above are treated.

Adverbials in ICE-GB "can appear at practically any level of
the tree and within any category of the tree" - a
description could hardly be more general. Considering the
space allocated to adverbials in grammars (Quirk et
al. (1985): pp. 475-653; Huddleston and Pullum (2002):
pp. 663-784) this is not very informative. What are the
criteria for attaching adverbials at a specific position in
the tree? Do different attachment positions correspond to
different classes of adverbials? What happens if adverbials
cause discontinuities as in "The door will then be opened",
where the verb group (or verb phrase in ICE-GB terminology)
is split? Answers to questions like these would have been
interesting to know.

Having said that, it must be stressed that this criticism of
too little detail does not extend to the ICECUP manual,
which is truly a very thorough, exhaustive and easy to
follow tutorial to all the facilities this software has to
offer. While working through the manual, I sometimes found
myself thinking something along the lines of "All very nice,
but it would be great if this or that were explained, too",
only to find my questions answered a few pages on. As
mentioned above, there is an abundance of screenshots,
examples and step by step instructions which enable the
reader to reproduce even the more complicated queries.

However, sometimes the authors mean too well. Repeated hints
like "You can use cursor keys [...] and the scroll bar to
move through the elements in the hierarchy" (p. 72), " you
can briskly click twice with the left mouse button, with the
mouse cursor over the text label in the variable hierarchy"
(p. 90) or "[y]ou can make the tree window active by
clicking down with the left mouse button inside it" (p. 108)
do make one wonder what type of audience they have in
mind. Since corpus linguists can generally be expected not
to be complete computer novices, these basics could have
been omitted.

This minor criticism apart, the manual really has all the
properties one could possibly wish for in a tutorial. I have
only one more practical remark concerning this section:
since this is a tutorial rather than a general introduction
to ICECUP, it follows that reading it without actually
trying things out doesn't make much sense and certainly is
not very satisfying. For those who do not have the full
ICE-GB at their disposal, there exists the possibility to
download ICECUP as well as a free sample (ten texts, over
20,000 tokens) of ICE-GB from
A little note to inform the reader of this possibility would
have been helpful.

Part three, containing the two chapters on practical
treebank work, is a good supplement to the manuals. Again,
it is a very positive feature that the authors draw on real
and easily reproducible examples to demonstrate some of the
many ways in which a syntactically annotated corpus like
ICE-GB can be exploited. The case study investigating
wh-determiners in noun phrases (pp. 244-249) is particularly
clever from a didactical point of view. Here, Nelson et
al. do not simply present the best solution to the
problem. Instead, they begin by formulating a query which
turns out to be too narrow. They then modify their query,
but a look at the results shows that this time it is too
general. It is only at the third attempt that they finally
get the results they want. This trial and error procedure
demonstrates that a typical query is not a straightforward
matter but more often than not will be a cyclic procedure
during which the researcher "may need to experiment first in
order to define these queries appropriately" (p. 87). Apart
from this, the authors also show that corpus query results
can only be the starting point for any linguistic
investigation and will not replace the linguist's
interpretation of the findings.

Providing a chapter on experimental design and statistical
methods is also a good and useful idea, especially if, as in
this case, it is geared to the need of corpus linguists and,
again, is illustrated by a wealth of real corpus
examples. However, for those without prior knowledge, this
chapter will not replace a more comprehensive introduction
like Oakes (1998), since some definitions of technical terms
are not easy to understand (e.g. how the expected
distribution is calculated, p. 263) and need to be deduced
from the examples. Also, the term "degrees of freedom"
(p. 264) is left unexplained - the mere formula "df = r-1 =
1" will not mean much to a beginner, especially as no
mention is made of what "r" stands for.

I will close this evaluation with a few remarks on the
overall presentation of the book, which unfortunately makes
the impression of being somewhat hastily edited. For
example, I detected two references in the text which do not
appear in the list of works cited: Wallis, Aarts and Nelson
1999, p. 86 and Mair 1990, p. 242. Furthermore, the
alphabetical ordering of entries in the reference section is
faulty - "Declerck" is positioned after "Depraetere",
"Shastri" after "Spinillo". Also, on page 264, the authors
refer the reader to "Appendix 8" for a table of critical
values for chi-square - but there exists no Appendix 8 (the
appendices only go up to 6).

To sum up, this book can be highly recommended for those who
actually own a copy of ICE-GB and want to set about
analysing it with ICECUP. For those with a merely
theoretical interest in the corpus, the software, or corpus
linguistics in general, it will be of only limited use.


Lea Cyrus is a research assistant and PhD student at the
English Department at Westfaelische Wilhelms-Universitaet in
Muenster, Germany, where she teaches 1st and 2nd year
undergraduate students. Her main research interest lies in
treebank design. She is currently investigating the
possibilities of bi- or multilingual treebanking.