Date: Mon, 10 Jan 2005 09:56:34 +1300
From: Ute Knoch <email@example.com>
Subject: Corpus Linguistics: Readings in a Widening Discipline
EDITORS: Sampson, Geoffrey Richard; McCarthy, Diana
TITLE: Corpus Linguistics
SUBTITLE: Readings in a Widening Discipline
PUBLISHER: Continuum International Publishing Group Ltd
Ute Knoch, Department of Applied Language Studies and Linguistics,
University of Auckland, New Zealand
The first thing that struck me about this edited volume of papers in the
area of corpus linguistics is that the chapters were not organised
according to topic area but according to the year they were initially
published and that each chapter in this book has been published previously
somewhere else. The contributions range from 1952 to 2002. The editors
explain their rationale behind the book in the introductory chapter where
they describe that many of these important publications have previously
been published in low circulation volumes. They decided not to organise
the chapters according to topic areas as they think that corpus
linguistics should be seen as a field as a whole and not as a
compartmentalized area of study. In this review, I first describe the
content of each of the 43 chapters of this book and then provide a
critical evaluation of the contents.
In their introduction, the editors give a brief definition of corpora as
well as a concise history of the development of the field in their
introduction (chapter 1).
In chapter 2, the reader can find the oldest contribution (1952) which is
from the time before corpora were in electronic form, written by Charles
C. Fries. This chapter presents excerpts from the introduction and chapter
3 of 'The Structure of English'. The author was one of the first modern
corpus linguists. He recorded 250,000 words of telephone conversation and
used this data to describe English structure based on real-life use.
In 'A standard corpus of edited present day American English'(chapter 3),
Francis describes what the editors call 'the great grandfather' of the
electronic corpora, the Brown Corpus of American written English which was
published in 1994. It was made up of 1 million words of edited written
scholarly work. The paper specifies the rationale of the make-up of the
In chapter 4 entitled 'On the distribution of noun-phrase types in English
clause-structure' originally published by F.G.A.M Aarts in 1971, the
author used the then still paper-based Survey of English Usage as basis to
contradict assumptions about grammar. The author used statistical methods
to validate his study.
Chapter 5 was published 15 years after Aarts's paper, namely in 1986. In
the interim information technology had advanced and therefore more
complicated processing methods were available. This chapter which can be
situated in the area of language engineering describes the development of
the Text Segmentation for Speech (TESS) project which aimed to develop
predictive theories about English intonation to make automated text-to-
speech systems sound more natural.
'Typicality and meaning potentials' (chapter 6) which was written by
Patrick Hanks (1986), a lexicographer, illustrates how useful large
corpora can be for the development of more accurate dictionaries, but they
might also shed some light on other information that should be included in
Biber and Finegan describe in chapter 7, 'Historical drift in three
English genres', the change that three genres (fiction, essays and
letters) have undergone since the eighteenth century. To aid their
analysis they made use of automatic grammatical feature detection and the
statistical method of factor analysis.
John Sinclair, the creator of the COBUILD corpus, touches in chapter 8 on
considerations necessary in the design of corpora. These include the
issues like the overall size, design criteria, and the material included.
For his paper 'Cleft and pseudo-cleft constructions in English spoken and
written discourse' (chapter 9), Collins used the LOB and the London Lund
corpora to compare spoken and written discourse with respect to clefts and
pseudo-clefts by taking into account what communicative strategies they
The next chapter, chapter 10, is the first of several statistical papers
included in the book. Here, Gale and Church (originally published in 1989)
show that a commonly used statistical method used in corpus linguists to
estimate probability (adding one to each category before doing divisions),
is not valid and should therefore not be used. They suggest instead the
use of the Good-Turning method.
In chapter 11, Brown and his co-authors describe how they bypassed
traditional problems with machine translation by developing a computer
system that by itself works out the relationship between equivalent
sentences in two different languages (in this case French and English)
using a large parallel corpus. This bypassed the problem researchers had
struggled with previously when they tried to formulate rules that
translators draw on and encoded these into software applications.
Chapter 12, by Ihalainen, is an example of a dialect study. The author
investigates a variation in verb syntax found in Southwest England.
Hellberg, the author of chapter 13, shows how he used both corpus and
intuitive data to develop a comprehensive Swedish grammar.
'On the history of that/zero as object clause links in English' (chapter
14), written by Rissanen, is an example of the use of a historical corpus
to investigate a certain English structure. Unlike the corpus used in
chapter 7, this corpus was developed to be representative of the English
language from the Dark Ages. The author shows that both that and zero
existed in early written texts and that it is therefore not a more recent
omission as has been claimed by some researchers.
In chapter 15, Burnage and Dunlop describe some of the many recording and
encoding issues encountered in the development of the British National
Chapter 16 is entitled 'Computer corpora - what do they tell us about
culture?'. The authors Geoffrey Leech and Roger Fallon use the LOB and
Brown corpora as representative corpora of British and American writing to
compare if the vocabulary used reveals any social or cultural differences.
They were indeed able to show differences between the two varieties, but
point out that these two corpora were developed in the early 1960s and
that there might have been changes in language use since.
Douglas Biber, the author of chapter 17 shows in his
paper 'Representativeness in corpus design' how statistical methods could
be used to establish what might be seen as a fair sample size for a corpus.
In chapter 18, written by Francis Gill, the author shows how closely tied
grammar and lexicon are. She uses the very large COBUILD 'Bank of English'
to illustrate her approach.
In chapter 19, which is situated in the area of computational linguistics
and more specifically in the area of automatic natural language
processing, Hindle and Rooth show that it is not always correct to assume
that there is only one correct answer to automatic parsing. They
specifically investigate at what point a prepositional phrase should be
attached to a tree.
In his article entitled 'Irony in the text or insincerity in the writer?
The diagnostic potential of semantic prosodies' (chapter 20), the author
William Louw shows that large corpora can reveal patterns of collocations
between lexical items which cannot be predicted on the basis of their
dictionary meaning. Some of these patterns can be found in literary
writing and are used to achieve for example irony.
Chapter 21 describes one of the largest currently available corpora which
is annotated for its clause structure as well as POS tagged, the Penn
Treebank. This is an advance on older corpora which were generally raw
In chapter 22, Kenji Kita and his co-authors describe methods used to
extract collocations from corpora. The two different methods used yield
very different results. One measure they illustrate generates results
which are arguably a lot more useful for language teaching purposes as
well as for computational linguists.
Developing a POS parser capable of parsing naturally occurring language
was a challenge taken up in the mid 1990's as computational linguistics
developed even further. Briscoe and Carroll, the authors of chapter 23,
tested this parser, which incorporated probabilistic information, against
a Treebank and report recall and precision.
Chapter 24, authored by Tent and Mugler in 1996, explores the reasons for
collecting a Fijian English corpus as part of the International Corpus of
English by looking at the history and current role of English in Fiji.
Charniak, who is a leading advocate of Artificial Intelligence, argues in
chapter 25 for parsers that extract their rules directly from treebanks
(other than the parser described in Chapter 23 which had its rules
developed by human linguistic experience). Charniak shows that he is able
to yield good results and reports these as precision, recall and accuracy.
In chapter 26, Dieter Mindt shows how differently modals are presented in
English teaching materials to how they are actually used by native
speakers of English. He argues that a lot more work done by academics
needs to be incorporated into EFL and ESL teaching materials and syllabi.
Data-oriented processing argues that what human language users have in
their heads is not a system of rules extracted from experience, it is just
experience. The authors of chapter 27, Bod and Scha, show experimentally
that computer simulations of this type can produce impressive results.
Chapter 28, 'Conflict talk: a comparison of the verbal disputes between
adolescent females and two corpora' by Hasund and Stenstoem, shows that
corpora make it possible to investigate differences between the speech of
social classes. The authors find quite distinctive differences in the
kinds of dispute of adolescent girls in London from different social
backgrounds by investigating the COLT corpus.
In another statistics paper, chapter 29, Jean Carletta argues for the use
of the kappa statistic to calculate inter-annotator agreement.
The author of chapter 30, Christopher Werry, investigates some of the
features of Internet Relay chat which can be described as speech-like
because of the physical constraints of the medium. He also shows that this
type of interaction is very similar in different languages.
Chapter 31 discusses one problem at the lexical level encountered in
natural-language processing: word-sense disambiguation. Algorithms for
word-sense selection have not yet reached acceptable levels of
reliability. The authors, Resnik and Garowsky, report on some of the
lessons learned from the SENSEVAL evaluation workshop.
In chapter 32 entitled 'Qualification and certainty in L1 and L2 students'
writing, Hyland and Milton compare the lexical devices used to indicate
epistemic modality in the English writing of British native speaker and
Hong Kong school leavers. They show that non-native speakers under- and
overuse certain constructions used to express epistemic modality and that
the manipulation of certainty and effect proves particularly difficult for
Corpora also allow for annotation above the sentence-level. Such an
annotation system is DAMSL, which is described by Core in chapter 33.
DAMSL annotates speech-act features. The author discusses the motivation
behind using machine learning to automatically predict DAMSL tags and
describes an attempt at obtaining decision trees which predict DAMSL trees.
In the paper entitled, 'Assessing claims about language use with corpus-
data: swearing and abuse' (chapter 34), McEnery and his co-authors
investigate the functions of bad language by describing the ongoing
construction of the Lancaster Corpus of Abuse (LCA).
McKelvie, chapter 35, investigates dysfluencies like pauses, fillers,
repetions, repairs and fresh starts to see how they relate to grammatical
Pols et al., the authors of chapter 36, suggest that the success of a text-
to-speech synthesiser should be evaluated by using documents from large
corpora (preferable in several different languages) rather than with
One non-English corpus that has received widespread attention is the
Prague Dependency Treebank which is an annotated section of the Czech
National Corpus. This corpus is of interest as it is annotated according
to dependency analysis and not based on phrase structure analysis as most
English-language treebanks. In chapter 37, the authors discuss the
autoimmunisation of this annotation process.
In his paper 'Reflections of a dendographer' (chapter38), Sampson
discussed the usefulness of Treebank data for language engineering as well
as the usefulness of software engineering to find new insights for
developing treebanks. This paper is based on a speech the author gave in
honour of Geoffrey Leech in 1999.
In chapter 39, Carletta et al., argue for the use of XML as a generic
markup language to be used for all corpora.
McEnery (the author of chapter 40), argues that the languages of India,
Pakistan and Bangladesh are the most ignored languages in terms of
language engineering although there is a great need for work in this area,
for example for translation studies. He argues that work in this field has
only just started.
In chapter 41, Campione and Veronis, the authors of 'Semi-automatic
tagging of intonation in French spoken corpora', describe an approach
which partially automates annotation of prosodic features. Although their
work is done on French, it is also applicable to other languages.
The author of chapter 42, Kilgarriff, claims that the need for corpus
compilation has become redundant as sufficient material is freely
available on the web.
The final chapter focuses on intonation, which is crucial for speech to
sound natural. Studying this phenomenon is central for the advancement of
synthesized speech. For this purpose a research project at Cambridge
University has set out to document the diverse intonation patterns in the
British Isles. Grabe and Post show some of the results of this project.
It can be seen that the book has been compiled with a lot of thought,
covering a large number of different topic areas within corpus
linguistics. The editors' introductions to each chapter are very useful as
they do not only briefly summarize the chapter but also put it into
context for the readers. All chapters are relatively short so that they
are not overwhelming for a reader new to the area and all were selected
for their importance to the field of corpus linguistics. The editors also
supply a very useful list of URLs as an appendix. Personally, coming from
an Applied Linguistics background, I would have preferred some more
material on learner corpora as can be found in the books by Granger (1998)
and Granger, Hung, and Petch-Tyson (2002), more on the kind of corpus-
based material now developed for language teaching purposes as can be
seen, for example, in Tim John's data-driven learning
mconc.htm> or more on how corpora can be used by language learners
themselves. This area could have been more extensively covered, especially
as the editors make repeated reference to the fact that most work on
corpora has been initiated by the EFL profession.
Overall, however, it can be said that the book is an extremely valuable
resource to own, not only for corpus linguists as reference, but also for
those newly interested in the area to understand the wider field of corpus
linguistics as well as the historical development that it has undergone.
Graner, S. (Ed.). (1998). Learner English on Computer. London, New York:
Granger, S., Hung, J., & Petch-Tyson, S. (Eds.). (2002). Computer Learner
Corpora, Second Language Acquisition and Foreign Language Teaching.
Amsterdam, Philadelphia: John Benjamins Publishing Company.