Publishing Partner: Cambridge University Press
Wiley-Blackwell
Publisher Login

E-mail this page

Review of  Corpus Linguistics


Reviewer: Ute Knoch
Book Title: Corpus Linguistics
Book Author: Diana McCarthy Geoffrey Sampson
Publisher: Continuum International Publishing Group Ltd
Linguistic Field(s): Text/Corpus Linguistics
LL Issue: 16.98




Review:

	Date: Mon, 10 Jan 2005 09:56:34 +1300 
From: Ute Knoch  
Subject: Corpus Linguistics: Readings in a Widening Discipline

EDITORS: Sampson, Geoffrey Richard; McCarthy, Diana 
TITLE: Corpus Linguistics 
SUBTITLE: Readings in a Widening Discipline 
PUBLISHER: Continuum International Publishing Group Ltd 
YEAR: 2004 

Ute Knoch, Department of Applied Language Studies and Linguistics, 
University of Auckland, New Zealand

The first thing that struck me about this edited volume of papers in the 
area of corpus linguistics is that the chapters were not organised 
according to topic area but according to the year they were initially 
published and that each chapter in this book has been published previously 
somewhere else. The contributions range from 1952 to 2002. The editors 
explain their rationale behind the book in the introductory chapter where 
they describe that many of these important publications have previously 
been published in low circulation volumes. They decided not to organise 
the chapters according to topic areas as they think that corpus 
linguistics should be seen as a field as a whole and not as a 
compartmentalized area of study. In this review, I first describe the 
content of each of the 43 chapters of this book and then provide a 
critical evaluation of the contents. 

In their introduction, the editors give a brief definition of corpora as 
well as a concise history of the development of the field in their 
introduction (chapter 1). 

In chapter 2, the reader can find the oldest contribution (1952) which is 
from the time before corpora were in electronic form, written by Charles 
C. Fries. This chapter presents excerpts from the introduction and chapter 
3 of 'The Structure of English'. The author was one of the first modern 
corpus linguists. He recorded 250,000 words of telephone conversation and 
used this data to describe English structure based on real-life use. 

In 'A standard corpus of edited present day American English'(chapter 3), 
Francis describes what the editors call 'the great grandfather' of the 
electronic corpora, the Brown Corpus of American written English which was 
published in 1994. It was made up of 1 million words of edited written 
scholarly work. The paper specifies the rationale of the make-up of the 
corpus.

In chapter 4 entitled 'On the distribution of noun-phrase types in English 
clause-structure' originally published by F.G.A.M Aarts in 1971, the 
author used the then still paper-based Survey of English Usage as basis to 
contradict assumptions about grammar. The author used statistical methods 
to validate his study. 

Chapter 5 was published 15 years after Aarts's paper, namely in 1986. In 
the interim information technology had advanced and therefore more 
complicated processing methods were available. This chapter which can be 
situated in the area of language engineering describes the development of 
the Text Segmentation for Speech (TESS) project which aimed to develop 
predictive theories about English intonation to make automated text-to-
speech systems sound more natural.

'Typicality and meaning potentials' (chapter 6) which was written by 
Patrick Hanks (1986), a lexicographer, illustrates how useful large 
corpora can be for the development of more accurate dictionaries, but they 
might also shed some light on other information that should be included in 
modern dictionaries.

Biber and Finegan describe in chapter 7, 'Historical drift in three 
English genres',  the change that three genres (fiction, essays and 
letters) have undergone since the eighteenth century. To aid their 
analysis they made use of automatic grammatical feature detection and the 
statistical method of factor analysis. 

John Sinclair, the creator of the COBUILD corpus, touches in chapter 8 on 
considerations necessary in the design of corpora. These include the 
issues like the overall size, design criteria, and the material included.

For his paper 'Cleft and pseudo-cleft constructions in English spoken and 
written discourse' (chapter 9), Collins used the LOB and the London Lund 
corpora to compare spoken and written discourse with respect to clefts and 
pseudo-clefts by taking into account what communicative strategies they 
serve. 

The next chapter, chapter 10, is the first of several statistical papers 
included in the book. Here, Gale and Church (originally published in 1989) 
show that a commonly used statistical method used in corpus linguists to 
estimate probability (adding one to each category before doing divisions), 
is not valid and should therefore not be used. They suggest instead the 
use of the Good-Turning method. 

In chapter 11, Brown and his co-authors describe how they bypassed 
traditional problems with machine translation by developing a computer 
system that by itself works out the relationship between equivalent 
sentences in two different languages (in this case French and English) 
using a large parallel corpus. This bypassed the problem researchers had 
struggled with previously when they tried to formulate rules that 
translators draw on and encoded these into software applications.

Chapter 12, by Ihalainen, is an example of a dialect study. The author 
investigates a variation in verb syntax found in Southwest England. 

Hellberg, the author of chapter 13, shows how he used both corpus and 
intuitive data to develop a comprehensive Swedish grammar.

'On the history of that/zero as object clause links in English' (chapter 
14), written by Rissanen, is an example of the use of a historical corpus 
to investigate a certain English structure. Unlike the corpus used in 
chapter 7, this corpus was developed to be representative of the English 
language from the Dark Ages. The author shows that both that and zero 
existed in early written texts and that it is therefore not a more recent 
omission as has been claimed by some researchers.

In chapter 15, Burnage and Dunlop describe some of the many recording and 
encoding issues encountered in the development of the British National 
Corpus.

Chapter 16 is entitled 'Computer corpora - what do they tell us about 
culture?'. The authors Geoffrey Leech and Roger Fallon use the LOB and 
Brown corpora as representative corpora of British and American writing to 
compare if the vocabulary used reveals any social or cultural differences. 
They were indeed able to show differences between the two varieties, but 
point out that these two corpora were developed in the early 1960s and 
that there might have been changes in language use since.

Douglas Biber, the author of chapter 17 shows in his 
paper 'Representativeness in corpus design' how statistical methods could 
be used to establish what might be seen as a fair sample size for a corpus.

In chapter 18, written by Francis Gill, the author shows how closely tied 
grammar and lexicon are. She uses the very large COBUILD 'Bank of English' 
to illustrate her approach.

In chapter 19, which is situated in the area of computational linguistics 
and more specifically in the area of automatic natural language 
processing, Hindle and Rooth show that it is not always correct to assume 
that there is only one correct answer to automatic parsing. They 
specifically investigate at what point a prepositional phrase should be 
attached to a tree.

In his article entitled 'Irony in the text or insincerity in the writer? 
The diagnostic potential of semantic prosodies' (chapter 20), the author 
William Louw shows that large corpora can reveal patterns of collocations 
between lexical items which cannot be predicted on the basis of their 
dictionary meaning. Some of these patterns can be found in literary 
writing and are used to achieve for example irony.

Chapter 21 describes one of the largest currently available corpora which 
is annotated for its clause structure as well as POS tagged, the Penn 
Treebank. This is an advance on older corpora which were generally raw 
corpora. 

In chapter 22, Kenji Kita and his co-authors describe methods used to 
extract collocations from corpora. The two different methods used yield 
very different results. One measure they illustrate generates results 
which are arguably a lot more useful for language teaching purposes as 
well as for computational linguists.

Developing a POS parser capable of parsing naturally occurring language 
was a challenge taken up in the mid 1990's as computational linguistics 
developed even further. Briscoe and Carroll, the authors of chapter 23, 
tested this parser, which incorporated probabilistic information, against 
a Treebank and report recall and precision.

Chapter 24, authored by Tent and Mugler in 1996, explores the reasons for 
collecting a Fijian English corpus as part of the International Corpus of 
English by looking at the history and current role of English in Fiji.

Charniak, who is a leading advocate of Artificial Intelligence, argues in 
chapter 25 for parsers that extract their rules directly from treebanks 
(other than the parser described in Chapter 23 which had its rules 
developed by human linguistic experience). Charniak shows that he is able 
to yield good results and reports these as precision, recall and accuracy.

In chapter 26, Dieter Mindt shows how differently modals are presented in 
English teaching materials to how they are actually used by native 
speakers of English. He argues that a lot more work done by academics 
needs to be incorporated into EFL and ESL teaching materials and syllabi.

Data-oriented processing argues that what human language users have in 
their heads is not a system of rules extracted from experience, it is just 
experience. The authors of chapter 27, Bod and Scha, show experimentally 
that computer simulations of this type can produce impressive results.

Chapter 28, 'Conflict talk: a comparison of the verbal disputes between 
adolescent females and two corpora' by Hasund and Stenstoem, shows that 
corpora make it possible to investigate differences between the speech of 
social classes. The authors find quite distinctive differences in the 
kinds of dispute of adolescent girls in London from different social 
backgrounds by investigating the COLT corpus.

In another statistics paper, chapter 29, Jean Carletta argues for the use 
of the kappa statistic to calculate inter-annotator agreement.

The author of chapter 30, Christopher Werry, investigates some of the 
features of Internet Relay chat which can be described as speech-like 
because of the physical constraints of the medium. He also shows that this 
type of interaction is very similar in different languages.

Chapter 31 discusses one problem at the lexical level encountered in 
natural-language processing: word-sense disambiguation. Algorithms for 
word-sense selection have not yet reached acceptable levels of 
reliability. The authors, Resnik and Garowsky, report on some of the 
lessons learned from the SENSEVAL evaluation workshop.

In chapter 32 entitled 'Qualification and certainty in L1 and L2 students' 
writing, Hyland and Milton compare the lexical devices used to indicate 
epistemic modality in the English writing of British native speaker and 
Hong Kong school leavers. They show that non-native speakers under- and 
overuse certain constructions used to express epistemic modality and that 
the manipulation of certainty and effect proves particularly difficult for 
L2 students.

Corpora also allow for annotation above the sentence-level. Such an 
annotation system is DAMSL, which is described by Core in chapter 33. 
DAMSL annotates speech-act features. The author discusses the motivation 
behind using machine learning to automatically predict DAMSL tags and 
describes an attempt at obtaining decision trees which predict DAMSL trees.

In the paper entitled, 'Assessing claims about language use with corpus-
data: swearing and abuse' (chapter 34), McEnery and his co-authors 
investigate the functions of bad language by describing the ongoing 
construction of the Lancaster Corpus of Abuse (LCA). 

McKelvie, chapter 35, investigates dysfluencies like pauses, fillers, 
repetions, repairs and fresh starts to see how they relate to grammatical 
structure.

Pols et al., the authors of chapter 36, suggest that the success of a text-
to-speech synthesiser should be evaluated by using documents from large 
corpora (preferable in several different languages) rather than with 
devised sentences.

One non-English corpus that has received widespread attention is the 
Prague Dependency Treebank which is an annotated section of the Czech 
National Corpus. This corpus is of interest as it is annotated according 
to dependency analysis and not based on phrase structure analysis as most 
English-language treebanks. In chapter 37, the authors discuss the 
autoimmunisation of this annotation process.

In his paper 'Reflections of a dendographer' (chapter38), Sampson 
discussed the usefulness of Treebank data for language engineering as well 
as the usefulness of software engineering to find new insights for 
developing treebanks. This paper is based on a speech the author gave in 
honour of Geoffrey Leech in 1999.

In chapter 39, Carletta et al., argue for the use of XML as a generic 
markup language to be used for all corpora.

McEnery (the author of chapter 40), argues that the languages of India, 
Pakistan and Bangladesh are the most ignored languages in terms of 
language engineering although there is a great need for work in this area, 
for example for translation studies. He argues that work in this field has 
only just started.

In chapter 41, Campione and Veronis, the authors of 'Semi-automatic 
tagging of intonation in French spoken corpora', describe an approach 
which partially automates annotation of prosodic features. Although their 
work is done on French, it is also applicable to other languages.

The author of chapter 42, Kilgarriff, claims that the need for corpus 
compilation has become redundant as sufficient material is freely 
available on the web.

The final chapter focuses on intonation, which is crucial for speech to 
sound natural. Studying this phenomenon is central for the advancement of 
synthesized speech. For this purpose a research project at Cambridge 
University has set out to document the diverse intonation patterns in the 
British Isles. Grabe and Post show some of the results of this project.

It can be seen that the book has been compiled with a lot of thought, 
covering a large number of different topic areas within corpus 
linguistics. The editors' introductions to each chapter are very useful as 
they do not only briefly summarize the chapter but also put it into 
context for the readers. All chapters are relatively short so that they 
are not overwhelming for a reader new to the area and all were selected 
for their importance to the field of corpus linguistics. The editors also 
supply a very useful list of URLs as an appendix. Personally, coming from 
an Applied Linguistics background, I would have preferred some more 
material on learner corpora as can be found in the books by Granger (1998) 
and Granger, Hung, and Petch-Tyson (2002), more on the kind of corpus-
based material now developed for language teaching purposes as can be 
seen, for example, in Tim John's data-driven learning 
 or more on how corpora can be used by language learners 
themselves. This area could have been more extensively covered, especially 
as the editors make repeated reference to the fact that most work on 
corpora has been initiated by the EFL profession. 

Overall, however, it can be said that the book is an extremely valuable 
resource to own, not only for corpus linguists as reference, but also for 
those newly interested in the area to understand the wider field of corpus 
linguistics as well as the historical development that it has undergone. 

REFERENCES

Graner, S. (Ed.). (1998). Learner English on Computer. London, New York: 
Longman.

Granger, S., Hung, J., & Petch-Tyson, S. (Eds.). (2002). Computer Learner 
Corpora, Second Language Acquisition and Foreign Language Teaching. 
Amsterdam, Philadelphia: John Benjamins Publishing Company.

 

	
ABOUT THE REVIEWER


Ute Knoch is a research assistant and a PhD candidate at the University of
Auckland, New Zealand. Her special interest are in the area of corpus
linguistics and language assessment.


Discuss this Review



Page Updated: 30-Jul-2010

Supported in part by the National Science Foundation About LINGUIST    |   Contact Us
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.
ILIT Logo