Review of Exploring Natural Language
|
|
|
|
|
Review:
|
Date: Mon, 19 Jan 2004 19:30:32 +0100 (CET) From: Lea Cyrus <lea@marley.uni-muenster.de> Subject: Exploring Natural Language: Working with the British Component of ICE
Nelson, Gerald, Sean Wallis and Bas Aarts, ed. (2002) Exploring Natural Language: Working with the British Component of the International Corpus of English, John Benjamins Publishing Company, Varieties of English Around the World G29.
Lea Cyrus, Arbeitsbereich Linguistik, Westfaelische Wilhelms-Universitaet Muenster.
PURPOSE AND CONTENTS
The International Corpus of English (ICE) is a project which aims at compiling up to twenty comparable syntactically parsed one million-word corpora of different national and regional varieties of English, both written and spoken. This book is the first volume in a new sub-series of "Varieties of English Around the World", which has been launched to provide "handbooks for the various corpora in the International Corpus of English" (p. xi). However, rather than merely being a handbook for the British Component of ICE (ICE-GB), it is first and foremost a manual for the Windows based International Corpus of English Corpus Utility Program (ICECUP), which is a tailor-made retrieval software for all ICE subcorpora.
The book consists of four parts, which in turn are subdivided into several chapters. There are also six appendices listing details of the corpus annotation and design. The first part ("Introducing the corpus", pp. 1-68) begins with a general overview of the ICE-GB corpus, which is the first of the ICE subcorpora to be completed. The remainder of this part describes the annotation scheme used. Each node in the tree is labelled with up to three types of information: word class/syntactic category, syntactic function and features (such as "transitivity"), the latter being optional. All word class-tags, syntactic categories, functions and features are listed and illustrated by short explanations and examples.
The second and by far the longest part ("Exploring the corpus", pp. 69-231) is a very detailed manual to the query tool ICECUP 3.0. The various facilities are introduced little by little, starting from general browsing by text category and leading up to complex queries by means of fuzzy tree fragments (FTFs). The descriptions are constantly illustrated by an impressive number of screenshots, and all features are introduced by means of concrete examples, accompanied by detailed instructions which allow the readers to reproduce the queries on their own computers. This part ends with a description of extensions that have been made to ICECUP 3.0, resulting in the "evolutionary advance" (p. 203) ICECUP 3.1.
The third part ("Performing research with the corpus", pp. 232-283) is used to further exemplify the research possibilities offered by ICECUP. This is achieved by six case studies which exploit the annotation available in ICE-GB. The first two studies are lexical, exploring the adverbial use of "pretty" and the verbal vs. nominal use of "book" respectively. The following study is concerned with the question whether there exists a correlation between the transitivity properties of an embedded clause and its syntactic function. The fourth study investigates the differences in the use of "what" and "which" functioning as determiners in noun phrases. The fifth study compares the distribution of passives in the six written registers in ICE-GB, and in the final study, the authors confirm the results of an earlier investigation which had shown that if-clauses occur more frequently in clause-initial than in clause-final position. The third part is rounded off with a general chapter on experimental design and basic stochastic methods, such as working with contingency tables and calculating the statistical significance of results with the chi-square test. This chapter, too, includes three exemplary studies based on ICE-GB.
The fourth and last part ("The future of the corpus", pp. 284-300) gives a short overview of possible extensions to both ICE-GB and ICECUP, provided there will be funding. These include further levels of annotation, particularly in the spoken component of ICE-GB, and the partial automation of the experimentation process.
CRITICAL EVALUATION
I have pointed out before that large parts of this book are devoted to presenting the annotation scheme and retrieval software for the ICE-GB corpus - to a large extent, this book is really a manual (or two manuals, to be precise). It is not the objective of this review to evaluate either the annotation scheme or the application, but to point out whether or not these manuals are good introductions to their respective fields.
I will follow the order in the book and begin with the chapter on the ICE-GB grammar, which I found wanting in some respects. Firstly, there is the overall organisation of the chapter. Why are word-class tags treated in a separate section while phrasal and functional categories are mixed? My guess is that this was done because it mirrors the practical annotation process - first tagging, then parsing (p. 22), and it might have seemed natural to keep up this ordering in the book. However, from a theoretical point of view, it would have been neater to group lexical and phrasal categories together, if such grouping is at all necessary, since they both refer to the form of the units they describe. For a discussion of the distinction between syntactic categories (which include lexical categories) and grammatical functions see Quirk et al. (1985, pp. 48f. and pp. 64-67) and Huddleston and Pullum (2002, pp. 20-26).
Also, it is questionable whether alphabetical ordering is the best approach when describing grammatical choices made in an annotation scheme. In such a list, elements which belong closely together are spread over the entire list. If, for instance, a reader wants to find out how extraposition is annotated in ICE-GB, he or she will have to read the entire chapter in order to collect all pieces of information: for instance, anticipatory 'it' is tagged as a pronoun with the special feature 'antit' (p. 36), its function is either 'provisional direct object' (PROD) or 'provisional subject' (PRSU) (p. 53), the extraposed constituents are function-tagged as 'notional direct object' (NOOD) or 'notional subject' (NOSU) respectively (p. 50), and the entire clause is marked with either the specific feature 'extraposed direct object' ('extod') or 'extraposed subject' ('extsu') (p. 58). To be fair, it must be added that sometimes there are references to further tags used in the same construction, but since these are used sporadically, they do not really solve the problem.
All in all, this part is more or less an annotated atomic list of the tags, but there is little actual grammar in it. The discussion of grammatical constructions is limited to six and a half pages at the end of the chapter (pp. 62-68), where a few and seemingly arbitrarily chosen "special topics" (i.e. inversion, coordination, direct speech) are discussed in greater detail. For a better understanding of the grammar behind the scheme, a more construction-based approach, following the good examples of Bies et al. (1995) for the Penn Treebank and Sampson (1995) for the SUSANNE corpus, would have been helpful. For those who really only want to look up a particular tag, there is an alphabetically ordered reference guide (Appendix 5), and if the tags were also included in the index (which they aren't), the reader could easily be directed to the appropriate location in the detailed description.
A further point I would like to make is that most tag descriptions are very general and not detailed enough to reveal the way the tags are used in the corpus. I shall illustrate this with two examples: the word class tag "ADJ" (adjective) can be subclassified by various features. Two of these are 'edp' for -ed participles and 'ingp' for -ing participles (p. 25). The problem here is that the distinction between adjectival participles and their verbal counterparts is not in all cases easy to make. Huddleston and Pullum (2002, p. 1438) give the sentence "Kim was worried by the prospect of redundancy" as an ambiguous example. I would have liked to learn more about the criteria which have been applied in ICE-GB to identify adjectival participles or the conventions according to which borderline cases like the one cited above are treated.
Adverbials in ICE-GB "can appear at practically any level of the tree and within any category of the tree" - a description could hardly be more general. Considering the space allocated to adverbials in grammars (Quirk et al. (1985): pp. 475-653; Huddleston and Pullum (2002): pp. 663-784) this is not very informative. What are the criteria for attaching adverbials at a specific position in the tree? Do different attachment positions correspond to different classes of adverbials? What happens if adverbials cause discontinuities as in "The door will then be opened", where the verb group (or verb phrase in ICE-GB terminology) is split? Answers to questions like these would have been interesting to know.
Having said that, it must be stressed that this criticism of too little detail does not extend to the ICECUP manual, which is truly a very thorough, exhaustive and easy to follow tutorial to all the facilities this software has to offer. While working through the manual, I sometimes found myself thinking something along the lines of "All very nice, but it would be great if this or that were explained, too", only to find my questions answered a few pages on. As mentioned above, there is an abundance of screenshots, examples and step by step instructions which enable the reader to reproduce even the more complicated queries.
However, sometimes the authors mean too well. Repeated hints like "You can use cursor keys [...] and the scroll bar to move through the elements in the hierarchy" (p. 72), " you can briskly click twice with the left mouse button, with the mouse cursor over the text label in the variable hierarchy" (p. 90) or "[y]ou can make the tree window active by clicking down with the left mouse button inside it" (p. 108) do make one wonder what type of audience they have in mind. Since corpus linguists can generally be expected not to be complete computer novices, these basics could have been omitted.
This minor criticism apart, the manual really has all the properties one could possibly wish for in a tutorial. I have only one more practical remark concerning this section: since this is a tutorial rather than a general introduction to ICECUP, it follows that reading it without actually trying things out doesn't make much sense and certainly is not very satisfying. For those who do not have the full ICE-GB at their disposal, there exists the possibility to download ICECUP as well as a free sample (ten texts, over 20,000 tokens) of ICE-GB from <http://www.ucl.ac.uk/english-usage/ice-gb/sampler/download.htm>. A little note to inform the reader of this possibility would have been helpful.
Part three, containing the two chapters on practical treebank work, is a good supplement to the manuals. Again, it is a very positive feature that the authors draw on real and easily reproducible examples to demonstrate some of the many ways in which a syntactically annotated corpus like ICE-GB can be exploited. The case study investigating wh-determiners in noun phrases (pp. 244-249) is particularly clever from a didactical point of view. Here, Nelson et al. do not simply present the best solution to the problem. Instead, they begin by formulating a query which turns out to be too narrow. They then modify their query, but a look at the results shows that this time it is too general. It is only at the third attempt that they finally get the results they want. This trial and error procedure demonstrates that a typical query is not a straightforward matter but more often than not will be a cyclic procedure during which the researcher "may need to experiment first in order to define these queries appropriately" (p. 87). Apart from this, the authors also show that corpus query results can only be the starting point for any linguistic investigation and will not replace the linguist's interpretation of the findings.
Providing a chapter on experimental design and statistical methods is also a good and useful idea, especially if, as in this case, it is geared to the need of corpus linguists and, again, is illustrated by a wealth of real corpus examples. However, for those without prior knowledge, this chapter will not replace a more comprehensive introduction like Oakes (1998), since some definitions of technical terms are not easy to understand (e.g. how the expected distribution is calculated, p. 263) and need to be deduced from the examples. Also, the term "degrees of freedom" (p. 264) is left unexplained - the mere formula "df = r-1 = 1" will not mean much to a beginner, especially as no mention is made of what "r" stands for.
I will close this evaluation with a few remarks on the overall presentation of the book, which unfortunately makes the impression of being somewhat hastily edited. For example, I detected two references in the text which do not appear in the list of works cited: Wallis, Aarts and Nelson 1999, p. 86 and Mair 1990, p. 242. Furthermore, the alphabetical ordering of entries in the reference section is faulty - "Declerck" is positioned after "Depraetere", "Shastri" after "Spinillo". Also, on page 264, the authors refer the reader to "Appendix 8" for a table of critical values for chi-square - but there exists no Appendix 8 (the appendices only go up to 6).
To sum up, this book can be highly recommended for those who actually own a copy of ICE-GB and want to set about analysing it with ICECUP. For those with a merely theoretical interest in the corpus, the software, or corpus linguistics in general, it will be of only limited use.
BIBLIOGRAPHY
Bies, Ann, Mark Ferguson, Karen Katz and Robert MacIntyre (1995) Bracketing Guidelines for Treebank II Style. Penn Treebank Project.
Huddleston, Rodney and Geoffrey K. Pullum (2002) The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press.
Oakes, Michael P. (1998) Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.
Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech and Jan Svartvik (1985) A Comprehensive Grammar of the English Language. Harlow, Essex: Longman.
Sampson, Geoffrey (1995) English for the Computer: The SUSANNE Corpus and Analytic Scheme. Oxford: Clarendon Press.
|
| |
ABOUT THE REVIEWER:
ABOUT THE REVIEWER
Lea Cyrus is a research assistant and PhD student at the
English Department at Westfaelische Wilhelms-Universitaet in
Muenster, Germany, where she teaches 1st and 2nd year
undergraduate students. Her main research interest lies in
treebank design. She is currently investigating the
possibilities of bi- or multilingual treebanking.
|
|
|
|
|
|