Publishing Partner: Cambridge University Press CUP Extra Publisher Login
amazon logo
More Info

New from Oxford University Press!


Words in Time and Place: Exploring Language Through the Historical Thesaurus of the Oxford English Dictionary

By David Crystal

Offers a unique view of the English language and its development, and includes witty commentary and anecdotes along the way.

New from Cambridge University Press!


The Indo-European Controversy: Facts and Fallacies in Historical Linguistics

By Asya Pereltsvaig and Martin W. Lewis

This book "asserts that the origin and spread of languages must be examined primarily through the time-tested techniques of linguistic analysis, rather than those of evolutionary biology" and "defends traditional practices in historical linguistics while remaining open to new techniques, including computational methods" and "will appeal to readers interested in world history and world geography."

Review of  The Routledge Handbook of Corpus Linguistics

Reviewer: Kornel Bangha
Book Title: The Routledge Handbook of Corpus Linguistics
Book Author: Anne O'Keeffe Michael McCarthy
Publisher: Routledge (Taylor and Francis)
Linguistic Field(s): Applied Linguistics
Text/Corpus Linguistics
Book Announcement: 23.336

Discuss this Review
Help on Posting
EDITORS: O'Keeffe, Anne; McCarthy, Michael
TITLE: The Routledge Handbook of Corpus Linguistics
SERIES: Routledge Handbooks in Applied Linguistics
PUBLISHER: Routledge (Taylor and Francis)
YEAR: 2010

Kornel Bangha, Vantage Linguistics, PA, USA

The Routledge Handbook of Corpus Linguistics (RHCL) provides an overview of
corpus linguistics (CL), as a resource for advanced undergraduates and
postgraduates. The book contains 45 contributions from 54 authors divided into
eight major sections. Each contribution is divided into five sub-parts, followed
by further readings and references.

In Section I, the first contribution, by the editors, presents the evolution of
corpora from their historical origins (starting with the earliest Bible
concordances) up to their various types and uses in modern day applications.
Elena Tognini Bonelli's contribution offers an overview of the evolution of
corpus linguistics: it describes a shift of focus in linguistics from a
data-driven approach to an approach based on intuition and introspection -- and
back again to a data-driven approach; she explains that a corpus is
fundamentally different from a text because the former, unlike the later, brings
together many different texts and therefore cannot be identified with a unique
and coherent communicative event; she concludes that, using Saussurian
terminology, a text is an instance of ‘parole’ while the patterns uncovered in
corpus evidence yield insight into ‘langue’; finally, the chapter presents a
corpus typology, originally proposed in the course of an EU project.

Section II, Building and designing a corpus: what are the key considerations?,
starts with Randi Rappen's description of key considerations: the chapter covers
the basics, the kind and size of data to collect; how to collect texts; how much
mark-up is needed and finally a look to the future. Svenja Adolphs and Dawn
Knight write about the process of building a spoken corpus: corpus design,
metadata collection (citing Burnard (2005: 31) who states that 'without metadata
the investigator has nothing but disconnected words of unknowable provenance or
authenticity'); the transcription of spoken data and the issue of spoken
interaction being multi-modal in nature including prosodic, gestural and
environmental elements as well; and the analysis of spoken corpora. Mike Nelson
discusses the process of building a written corpus: what this process entails;
how a corpus should be planned; sampling, balancing and representativeness;
gathering, organizing and annotating texts. Almut Koester starts the next
contribution with arguments in favor of small specialized corpora: based on
Carter and McCarthy (1995), he argues that grammatical items are sufficiently
frequent to be reliably studied using a relatively small corpus, that a smaller
data-set is more manageable, and also that there is a closer link between the
corpus and the context in the case of smaller corpora. The author then discusses
how small and specialized corpora should/could be, noting that spoken corpora
tends to be smaller than written ones; followed by some considerations in the
design of small corpora; issues of compilation and transcription are also
discussed. Brian Clancy discusses how to build a corpus to represent a variety
of a language. He starts with examining what a variety of language is; then he
continues with issues like size, diversity, representativeness and balance;
finally he proposes two case studies about a language variety. Paul Thompson is
interested in building a specialized audio-visual corpus. First, he presents the
characteristics of such corpora and argues for the fine granularity of the
corpus annotation to be the most useful. Then he describes the major steps in
the building process: data collection (consent, location, equipment, skills...),
transcription, annotation, assembly and analysis.

The first contribution of Section III (Analysing a corpus -- What are the
basics?) was written by David Y. W. Lee. He proposes a not exhaustive overview
of the currently available ready-made corpora: general and specialized; spoken,
written or both; both in English and in other languages. Jane Evison covers the
basics of analyzing a corpus: how to manipulate and exploit word frequency
lists, key word lists and concordance lines. She states that corpora are useful
not in themselves but through the analysis and manipulation of data they
contain. Mike Scott describes what corpus software in general and WordSmith (his
own software) in particular can do. He starts by explaining what computers are
good at, what they are bad at, and why; then he addresses some issues of
re-formatting and re-organizing data; finally he briefly describes how to
process concordances, wordlists and key word lists. Susan Hunston is interested
in the exploration of patterns in a corpus: what patterns are, what the reasons
are that make them difficult to be identified, how to find them in concordance
lines, and finally how to assess their frequency. Christopher Tribble's
contribution describes concordances. It starts with a clear definition: a
concordance is a collection of occurrences of a word-form, each in its own
textual environment... (Sinclair 1991: 32). Both historical (like Becket's) and
modern ones are covered in the paper, including tools (like WordSmith Tools) and
methods: working with lemmas, sorting and sampling, restricted searches, just to
name a few. Xiaofei Lu studies what corpus software can reveal about language
development. The author first defines what language development is and presents
the three most influential approaches to it: rationalist, empiricist and
pragmatist. He also describes how to measure language development, and discusses
how a corpus can be used to learn more about first and second language development.

Section IV (Using a corpus for language research) starts with Rosamund Moon's
contribution, 'What can a corpus tell us about lexis?'. She examines questions
like how many words comprise the main vocabulary of a language, what we can
learn about a word from looking at the words with which it co-occurs, how far
the meanings of words are derived from context, how different senses and uses of
words are distinguished in context, how corpora can help studying synonyms, what
we can learn about lexis from a spoken corpus. Chris Greaves and Martin Warren
study what corpus can tell us about multi-word units. They discuss what
multi-word units are, including not only n-grams but also discontinuous units,
and why and how they are important. Susan Conrad studies what a corpus can show
about grammar, switching the focus from acceptable versus unacceptable to what
actual choices are made by speakers. Douglas Biber's paper covers what corpus
can indicate about registers and genres. First, a distinction is established
between the genre perspective and the register perspective, then various aspects
of the register variation are presented and finally corpus-based genre studies
are briefly discussed. Michael Handford studies the corpora of specialized
genres. He mentions several criticisms of corpus linguistics and presents a
rationale for specialist corpora and the genre approach. Corpus study in
academic genres, professional genres and non-institutional genres is also
examined. Scott Thornbury discusses what a corpus can reveal about discourse,
what the limitations are and how to overcome them, how a corpus-based approach
work in practice and what kind of data is needed for this. Christoph Rühlemann's
contribution investigates what corpora can tell about pragmatics: after
discussing what restrictions it implies, he discusses various pragmatic
phenomena. Thuc Anh Vo and Ronald Carter studies what a corpus can reveal about
creativity. The authors discuss the concept of creativity, how it is related to
corpora, what corpora can reveal about it, spoken and written aspects of
creativity, and finally some other manifestations of creativity found in corpora.

Winnie Cheng wrote the first contribution in Section V (Using a corpus for
language pedagogy and methodology), addressing the role of corpora in language
teaching. Following Johns (1991: 30), the author emphasizes the importance of
data-driven learning (DDL) and illustrates how corpora can be used by students,
teachers and even editors of grammars. The contribution written by Steve Walsh
covers how corpora can be exploited in creating language teaching materials.
Corpus based materials to teach speaking, listening, reading and writing are
discussed and the merits of learner corpora are explored in detail, over
invented textbook dialogues for instance. Angela Chambers writes about
data-driven learning. Her paper covers a brief history of DDL, how it can be
used and how it changes language pedagogy. Gaëtanelle Gilquin and Sylviane
Granger discuss the possible applications of DDL: its advantages (like bringing
authenticity and providing corrective functions), the resources it requires (a
corpus and tools to exploit the corpus), activities it involves. Their
contribution also covers the problems and limitations of DDL and when it comes
to evaluation, and they admit with remarkable honesty that the claims about the
effectiveness of DDL are largely an act of faith. Passapong Sripicharn is
interested in preparing learners for using language corpora. The author covers
topics like assessing students' knowledge of corpora and their objectives,
preparing learners to build and use corpora, familiarizing them with different
tools and interpreting results.

Section VI (Designing corpus-based materials for the language classroom) starts
with the contribution of Martha Jones and Philip Durrant about corpora and
vocabulary teaching materials. The importance of vocabulary (including
lexicalized phrasal units), the type of corpus suitable for academic vocabulary
learning and the design are among the topics discussed in the paper. Rebecca
Hughes writes about corpora and grammar teaching materials: the role of corpora,
their benefits (e.g. providing evidence of frequency, encouraging more
autonomous learning), their limitations and their future development. Jeanne
McCarten's contribution is about corpus-informed course book design. She
suggests useful considerations in choosing a corpus, and discusses areas of the
course book where a corpus can inform, the use of corpus data in course books
and the future of corpus informed course books. She does mention some
realizations, like the Collins COBUILD English Grammar, but also admits that the
actual use of corpora in this field is rather limited. Elisabeth Walter
discusses the use of corpora in dictionary writing: the reasons to use corpora,
their size and their content, and the analysis tools for lexicographers. She
also illustrates how to use a corpus, paying special attention to learner
corpora and concludes with current limitations and future developments. Lynne
Flowerdew reviews recent corpus applications to various aspects of writing,
covering English for General Academic Purposes and English for Specific Academic
Purposes, followed by discussing the issues in the application of corpora and
possible future expansions and extensions. Averil Coxhead is interested in the
relationship between corpora and English for Academic Purposes (EAP). Five
questions are addressed: what can corpora reveal about aspects of academic
language in use; how can corpora influence EAP pedagogy; how can corpora be used
in EAP materials; what can a corpus tell us about EAP learner language; and what
might the future be for corpora in EAP? Elaine Vaughan's contribution is about
using corpora for teachers' own research. She mentions reasons to do that and
issues in doing so; she also discusses the use of corpora inside and outside the

Section VII covers the topic of using corpora to study literature and
translation. Marie-Madeleine Kennig describes parallel and comparable corpora.
She explains what they are, mentions some existing ones, and discusses their
compilation and use. The contribution by Natalie Küber and Guy Aston is about
using corpora in translation, purposes, processes and types of corpora used.
They also cover special issues like the translator's need to take into account
the reader's knowledge. Dan McIntyre and Brian Walker are interested in the use
of corpora to study the language of poetry and drama. They illustrate the use of
corpora through case studies of poems of William Blake and some blockbusters.
Carolina P. Amador-Moreno investigates the use of corpora to explore literary
speech representation. She discusses similarities and differences between real
and fictional speech, how to use corpora to compare them, and includes a case
study of an Irish novel, concluding with thoughts on the limitations of corpora
use to study speech representation.

Gisle Andersen's contribution about how to use corpus linguistics in
sociolinguistics begins Section VIII (Applying corpus linguistics to other areas
of research). The author discusses advantages and limitations, proposes a few
rules of thumb, provides examples of corpus based sociolinguistic studies and
considers possible future developments. Kieran O'Halloran writes about the use
of corpus linguistics in the study of media discourse. The author discusses the
corpus based approach to Critical Discourse Analysis and presents a case study
of a British newspaper. Janet Cotterill is interested in the use of corpus
linguistics in forensic linguistics. She presents various characteristics of a
forensic corpus, discusses some major tasks (like identifying or eliminating
authorship) and concludes with some limitations and future challenges. Annelie
Ädel covers corpus linguistics and political discourse. She presents what
political discourse is, its relationship to corpora, techniques for exploring
it, and gives examples of topics and concludes with reflections on possible
future developments. Sarah Atkins and Kevin Harvey write about the use of
corpora in the study of health communications. They explain the importance of
studying healthcare communication, present some corpus based studies, describe
the creation of a specialized corpus (related to adolescent health) and the use
of this corpus to explore patterns. Fiona Farr presents the use of corpora in
teacher education. She shows how CL reinforces current approaches and practices
in Language Teacher Education, discusses three types of relevant corpora
(corpora of classroom language, learner corpora and pedagogic corpora), corpora
use for the purposes of developing language awareness skills and finally the use
of specialized corpora. Fiona Barker's contribution discusses corpus-informed
language testing. She describes language testing, provides examples of corpora
developed for this purpose, and discusses the use of both learner corpora and
native speaker corpora.

The RHCL covers an impressively large and very specific set of topics related to
corpus linguistics. Readers interested in these topics will probably find what
they are looking for either in the corresponding contribution or in the list of
publications included in further readings and references. Surprisingly, there is
no contribution dealing directly with the use of corpus linguistics in Natural
Language Processing or computational linguistics.

Cooperation between so many contributors (54!) could be realized in two
completely different ways. It could have aimed simply to be the collection of
autonomous contributions (independent papers). Alternatively, it could have
aimed to be a coherent, unified work. For instance, in the first case, each
author would have defined key notions on their own, independently from others,
while in the second key notions would have been agreed upon and defined once for
all (ideally when they first appear). Unfortunately, this book falls somewhere
in between: cross-references are frequent but do not create coherence. For
example, as early as page 16, ''balance'' and ''representativeness'' are used
without being defined. The index entry for ''balance'' refers to pages 86-87 and
60. Those pages do discuss the notion of balance but do not offer any formal
definition. Fortunately, representativeness fares better: the index also points
to pages 86-87, where we find a definition from Leech (1991: 27). Koester also
evokes representativeness (p. 69) pointing to (Reppen) instead. Evison proposes
her own definition of keyness on p. 127. These are just a few examples of
redundant/discrepant/missing definitions. Readers might appreciate a glossary at
the end of the volume; let us hope that one will be included in a future edition.

Most but not all of the corpora discussed in the RHCL are in English. This,
however, is unlikely to be a shortcoming: it probably represents the
predominance of English in existing corpora.

It warrants note that many of the contributors found Mike Scott’s WordSmith
software very useful. Linguists interested in CL might want to give it a try and
assess if it also fits their needs.

In scientific papers, one expects an abstract, introduction and conclusion,
which all make a paper easier to understand. Unfortunately, they are generally
lacking in this volume, partially or informally present in a few cases only
(e.g. O'Halloran).

More than half of the contributions (23 of 45) contain the word ''can'' in the
title. For example, Coxhead addresses five questions: four contain ''can'', the
fifth ''might''; ''can'' appears six times in the first page and half in Küber and
Aston, etc. The readers might wonder why there is so much to say about what can
be done compared to what has been done. Does it mean that the authors focused
more on potential than on accomplishment? Or is this an indication that the
Golden Age of corpus linguistics is yet to come? The papers give the clear
impression that corpus linguistics can achieve more than what it has already
achieved. Also, many of the contributions end with a sections titled 'Looking to
the future' or something similar. This is another indication that contributors
believe that corpus linguistics has more to offer.

A future edition would benefit from reconsidering some of the issues raised in
this evaluation: regrouping some of the minor topics and including other major
ones; improving the book's coherence; adding
abstracts/introductions/conclusions; putting greater emphasis on what corpus
linguistics has actually accomplished.

Burnard, L. (2005) 'Developing Linguistic Corpora: Metadata for Corpus Work' in
M. Wynne (ed.) Developing Linguistic Corpora: A Guide to Good Practice. Oxford:
Oxbow Books, pp. 30-46.

Carter, R. and McCarthy, M. (1995) 'Grammar and the Spoken Language', Applied
Linguistics 16(2): 141-58.

Johns, T. (1991) 'From Printout to Handout: Grammar and Vocabulary Teaching in
the Context of Data-driven learning', English Language Research Journal 4: 27-45.

Leech, G. (1991) 'The State of the Art in Corpus Linguistics', in K. Aijmer and
B. Altenberg (eds) English Corpus Linguistics. London: Longman, pp. 8-30.

Sinclair, J.M. (1991) Corpus, Concordance and Collocation. Oxford: Oxford
University Press.

Kornel Bangha studied linguistics in Paris and Montreal. He was a post-doctoral research fellow at INRIA, France. Since 2005, he has been working for software companies in Canada and the USA. His main expertise is linguistic data curation for software development.

Format: Hardback
ISBN: 0415464897
ISBN-13: 9780415464895
Pages: 624
Prices: U.K. £ 110
U.S. $ 175