LINGUIST List 15.263

Sat Jan 24 2004

Review: Text/Corpus Ling: Nelson, Wallis & Aarts (2002)

Editor for this issue: Naomi Ogasawara <naomilinguistlist.org>


What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Sheila Dooley Collberg at collberglinguistlist.org.

Directory

  1. Lea Cyrus, Exploring Natural Language: Working with the British Component of ICE

Message 1: Exploring Natural Language: Working with the British Component of ICE

Date: Sat, 24 Jan 2004 10:16:06 -0500 (EST)
From: Lea Cyrus <leamarley.uni-muenster.de>
Subject: Exploring Natural Language: Working with the British Component of ICE

Nelson, Gerald, Sean Wallis and Bas Aarts, ed. (2002) Exploring
Natural Language: Working with the British Component of the
International Corpus of English, John Benjamins Publishing Company,
Varieties of English Around the World G29.

Announced at http://linguistlist.org/issues/13/13-2711.html


Lea Cyrus, Arbeitsbereich Linguistik, Westfaelische
Wilhelms-Universitaet Muenster.

PURPOSE AND CONTENTS

The International Corpus of English (ICE) is a project which aims at
compiling up to twenty comparable syntactically parsed one
million-word corpora of different national and regional varieties of
English, both written and spoken. This book is the first volume in a
new sub-series of ''Varieties of English Around the World'', which has
been launched to provide ''handbooks for the various corpora in the
International Corpus of English'' (p. xi). However, rather than merely
being a handbook for the British Component of ICE (ICE-GB), it is
first and foremost a manual for the Windows based International Corpus
of English Corpus Utility Program (ICECUP), which is a tailor-made
retrieval software for all ICE subcorpora.

The book consists of four parts, which in turn are subdivided into
several chapters. There are also six appendices listing details of the
corpus annotation and design. The first part (''Introducing the
corpus'', pp. 1-68) begins with a general overview of the ICE-GB
corpus, which is the first of the ICE subcorpora to be completed. The
remainder of this part describes the annotation scheme used. Each node
in the tree is labelled with up to three types of information: word
class/syntactic category, syntactic function and features (such as
''transitivity''), the latter being optional. All word class-tags,
syntactic categories, functions and features are listed and
illustrated by short explanations and examples.

The second and by far the longest part (''Exploring the corpus'',
pp. 69-231) is a very detailed manual to the query tool ICECUP
3.0. The various facilities are introduced little by little, starting
from general browsing by text category and leading up to complex
queries by means of fuzzy tree fragments (FTFs). The descriptions are
constantly illustrated by an impressive number of screenshots, and all
features are introduced by means of concrete examples, accompanied by
detailed instructions which allow the readers to reproduce the queries
on their own computers. This part ends with a description of
extensions that have been made to ICECUP 3.0, resulting in the
''evolutionary advance'' (p. 203) ICECUP 3.1.

The third part (''Performing research with the corpus'', pp. 232-283)
is used to further exemplify the research possibilities offered by
ICECUP. This is achieved by six case studies which exploit the
annotation available in ICE-GB. The first two studies are lexical,
exploring the adverbial use of ''pretty'' and the verbal vs. nominal
use of ''book'' respectively. The following study is concerned with
the question whether there exists a correlation between the
transitivity properties of an embedded clause and its syntactic
function. The fourth study investigates the differences in the use of
''what'' and ''which'' functioning as determiners in noun phrases. The
fifth study compares the distribution of passives in the six written
registers in ICE-GB, and in the final study, the authors confirm the
results of an earlier investigation which had shown that if-clauses
occur more frequently in clause-initial than in clause-final
position. The third part is rounded off with a general chapter on
experimental design and basic stochastic methods, such as working with
contingency tables and calculating the statistical significance of
results with the chi-square test. This chapter, too, includes three
exemplary studies based on ICE-GB.

The fourth and last part (''The future of the corpus'', pp. 284-300)
gives a short overview of possible extensions to both ICE-GB and
ICECUP, provided there will be funding. These include further levels
of annotation, particularly in the spoken component of ICE-GB, and the
partial automation of the experimentation process.

CRITICAL EVALUATION

I have pointed out before that large parts of this book are devoted to
presenting the annotation scheme and retrieval software for the ICE-GB
corpus - to a large extent, this book is really a manual (or two
manuals, to be precise). It is not the objective of this review to
evaluate either the annotation scheme or the application, but to point
out whether or not these manuals are good introductions to their
respective fields.

I will follow the order in the book and begin with the chapter on the
ICE-GB grammar, which I found wanting in some respects. Firstly, there
is the overall organisation of the chapter. Why are word-class tags
treated in a separate section while phrasal and functional categories
are mixed? My guess is that this was done because it mirrors the
practical annotation process - first tagging, then parsing (p. 22),
and it might have seemed natural to keep up this ordering in the
book. However, from a theoretical point of view, it would have been
neater to group lexical and phrasal categories together, if such
grouping is at all necessary, since they both refer to the form of the
units they describe. For a discussion of the distinction between
syntactic categories (which include lexical categories) and
grammatical functions see Quirk et al. (1985, pp. 48f. and pp. 64-67)
and Huddleston and Pullum (2002, pp. 20-26).

Also, it is questionable whether alphabetical ordering is the best
approach when describing grammatical choices made in an annotation
scheme. In such a list, elements which belong closely together are
spread over the entire list. If, for instance, a reader wants to find
out how extraposition is annotated in ICE-GB, he or she will have to
read the entire chapter in order to collect all pieces of information:
for instance, anticipatory 'it' is tagged as a pronoun with the
special feature 'antit' (p. 36), its function is either 'provisional
direct object' (PROD) or 'provisional subject' (PRSU) (p. 53), the
extraposed constituents are function-tagged as 'notional direct
object' (NOOD) or 'notional subject' (NOSU) respectively (p. 50), and
the entire clause is marked with either the specific feature
'extraposed direct object' ('extod') or 'extraposed subject' ('extsu')
(p. 58). To be fair, it must be added that sometimes there are
references to further tags used in the same construction, but since
these are used sporadically, they do not really solve the problem.

All in all, this part is more or less an annotated atomic list of the
tags, but there is little actual grammar in it. The discussion of
grammatical constructions is limited to six and a half pages at the
end of the chapter (pp. 62-68), where a few and seemingly arbitrarily
chosen ''special topics'' (i.e. inversion, coordination, direct
speech) are discussed in greater detail. For a better understanding of
the grammar behind the scheme, a more construction-based approach,
following the good examples of Bies et al. (1995) for the Penn
Treebank and Sampson (1995) for the SUSANNE corpus, would have been
helpful. For those who really only want to look up a particular tag,
there is an alphabetically ordered reference guide (Appendix 5), and
if the tags were also included in the index (which they aren't), the
reader could easily be directed to the appropriate location in the
detailed description.

A further point I would like to make is that most tag descriptions are
very general and not detailed enough to reveal the way the tags are
used in the corpus. I shall illustrate this with two examples: the
word class tag ''ADJ'' (adjective) can be subclassified by various
features. Two of these are 'edp' for -ed participles and 'ingp' for
-ing participles (p. 25). The problem here is that the distinction
between adjectival participles and their verbal counterparts is not in
all cases easy to make. Huddleston and Pullum (2002, p. 1438) give the
sentence ''Kim was worried by the prospect of redundancy'' as an
ambiguous example. I would have liked to learn more about the criteria
which have been applied in ICE-GB to identify adjectival participles
or the conventions according to which borderline cases like the one
cited above are treated.

Adverbials in ICE-GB ''can appear at practically any level of the tree
and within any category of the tree'' - a description could hardly be
more general. Considering the space allocated to adverbials in
grammars (Quirk et al. (1985): pp. 475-653; Huddleston and Pullum
(2002): pp. 663-784) this is not very informative. What are the
criteria for attaching adverbials at a specific position in the tree?
Do different attachment positions correspond to different classes of
adverbials? What happens if adverbials cause discontinuities as in
''The door will then be opened'', where the verb group (or verb phrase
in ICE-GB terminology) is split? Answers to questions like these would
have been interesting to know.

Having said that, it must be stressed that this criticism of too
little detail does not extend to the ICECUP manual, which is truly a
very thorough, exhaustive and easy to follow tutorial to all the
facilities this software has to offer. While working through the
manual, I sometimes found myself thinking something along the lines of
''All very nice, but it would be great if this or that were explained,
too'', only to find my questions answered a few pages on. As mentioned
above, there is an abundance of screenshots, examples and step by step
instructions which enable the reader to reproduce even the more
complicated queries.

However, sometimes the authors mean too well. Repeated hints like
''You can use cursor keys [...] and the scroll bar to move through the
elements in the hierarchy'' (p. 72), '' you can briskly click twice
with the left mouse button, with the mouse cursor over the text label
in the variable hierarchy'' (p. 90) or ''[y]ou can make the tree
window active by clicking down with the left mouse button inside it''
(p. 108) do make one wonder what type of audience they have in
mind. Since corpus linguists can generally be expected not to be
complete computer novices, these basics could have been omitted.

This minor criticism apart, the manual really has all the properties
one could possibly wish for in a tutorial. I have only one more
practical remark concerning this section: since this is a tutorial
rather than a general introduction to ICECUP, it follows that reading
it without actually trying things out doesn't make much sense and
certainly is not very satisfying. For those who do not have the full
ICE-GB at their disposal, there exists the possibility to download
ICECUP as well as a free sample (ten texts, over 20,000 tokens) of
ICE-GB from
http://www.ucl.ac.uk/english-usage/ice-gb/sampler/download.htm 
A little note to inform the reader of this possibility would have been
helpful.

Part three, containing the two chapters on practical treebank work, is
a good supplement to the manuals. Again, it is a very positive feature
that the authors draw on real and easily reproducible examples to
demonstrate some of the many ways in which a syntactically annotated
corpus like ICE-GB can be exploited. The case study investigating
wh-determiners in noun phrases (pp. 244-249) is particularly clever
from a didactical point of view. Here, Nelson et al. do not simply
present the best solution to the problem. Instead, they begin by
formulating a query which turns out to be too narrow. They then modify
their query, but a look at the results shows that this time it is too
general. It is only at the third attempt that they finally get the
results they want. This trial and error procedure demonstrates that a
typical query is not a straightforward matter but more often than not
will be a cyclic procedure during which the researcher ''may need to
experiment first in order to define these queries appropriately''
(p. 87). Apart from this, the authors also show that corpus query
results can only be the starting point for any linguistic
investigation and will not replace the linguist's interpretation of
the findings.

Providing a chapter on experimental design and statistical methods is
also a good and useful idea, especially if, as in this case, it is
geared to the need of corpus linguists and, again, is illustrated by a
wealth of real corpus examples. However, for those without prior
knowledge, this chapter will not replace a more comprehensive
introduction like Oakes (1998), since some definitions of technical
terms are not easy to understand (e.g. how the expected distribution
is calculated, p. 263) and need to be deduced from the examples. Also,
the term ''degrees of freedom'' (p. 264) is left unexplained - the
mere formula ''df = r-1 = 1'' will not mean much to a beginner,
especially as no mention is made of what ''r'' stands for.

I will close this evaluation with a few remarks on the overall
presentation of the book, which unfortunately makes the impression of
being somewhat hastily edited. For example, I detected two references
in the text which do not appear in the list of works cited: Wallis,
Aarts and Nelson 1999, p. 86 and Mair 1990, p. 242. Furthermore, the
alphabetical ordering of entries in the reference section is faulty -
''Declerck'' is positioned after ''Depraetere'', ''Shastri'' after
''Spinillo''. Also, on page 264, the authors refer the reader to
''Appendix 8'' for a table of critical values for chi-square - but
there exists no Appendix 8 (the appendices only go up to 6).

To sum up, this book can be highly recommended for those who actually
own a copy of ICE-GB and want to set about analysing it with
ICECUP. For those with a merely theoretical interest in the corpus,
the software, or corpus linguistics in general, it will be of only
limited use.

BIBLIOGRAPHY

Bies, Ann, Mark Ferguson, Karen Katz and Robert MacIntyre (1995)
Bracketing Guidelines for Treebank II Style. Penn Treebank Project.

Huddleston, Rodney and Geoffrey K. Pullum (2002) The Cambridge Grammar
of the English Language. Cambridge: Cambridge University Press.

Oakes, Michael P. (1998) Statistics for Corpus Linguistics. Edinburgh:
Edinburgh University Press.

Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech and Jan Svartvik
(1985) A Comprehensive Grammar of the English Language. Harlow, Essex:
Longman.

Sampson, Geoffrey (1995) English for the Computer: The SUSANNE Corpus
and Analytic Scheme. Oxford: Clarendon Press.

ABOUT THE REVIEWER

Lea Cyrus is a research assistant and PhD student at the English
Department at Westfaelische Wilhelms-Universitaet in Muenster,
Germany, where she teaches 1st and 2nd year undergraduate
students. Her main research interest lies in treebank design. She is
currently investigating the possibilities of bi- or multilingual
treebanking.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue