Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."

New from Wiley!


We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Review of  Quantitative Corpus Linguistics with R

Reviewer: Michael Thomas Pace-Sigge
Book Title: Quantitative Corpus Linguistics with R
Book Author: Stefan Th. Gries
Publisher: Routledge (Taylor and Francis)
Linguistic Field(s): Text/Corpus Linguistics
Discipline of Linguistics
Issue Number: 20.4066

Discuss this Review
Help on Posting
AUTHOR: Stefan Th. Gries
TITLE: Quantitative Corpus Linguistics with R
SUBTITLE: A practical introduction
PUBLISHER: Routledge (Taylor and Francis)
YEAR: 2009

Michael Pace-Sigge, School of English, University of Liverpool, UK


The announcement for this book said, ''The first textbook of its kind,
''Quantitative Corpus Linguistics with R'' (QCLwR) demonstrates how to use the
open source programming language R for corpus linguistic analyses. Computational
and corpus linguists doing corpus work will find that R provides an enormous
range of functions that currently require several programs to achieve --
searching and processing corpora, arranging and outputting the results of corpus
searches, statistical evaluation, and graphing'' -- and this is exactly what it does.

I have to start this review with a criticism, however. When Routledge published
this, they clearly thought only the youngest, best-sighted readers would make
use of it. So they chose a font that saves them paper (and, presumably, the
planet) but is incredibly hard to read for anyone over 30 because it is just too
small. This youthful approach is underlined by the choice of book jacket that,
even for a paperback, seems very thin.

In other words: the worst bits about this book are not the author's fault,
because Stefan Gries produced an admirable piece with QCLwR. For a start, this
is one of the first books with an interactive side to it. The companion website
is not just the boring old bit of marketing with some addenda thrown in. The
website here is an essential part of working with the book, and has a platform
to ask questions, discuss problems, and offer new solutions. As QCLwR is indeed
a very practical introduction and clearly aimed at students who develop their
own corpus linguistics tools, this is a very welcome approach.


QCLwR has four distinctive parts:

1) Chapter 2 is a very good, concise and comprehensive introduction to Corpus
Linguistics (CL), what it is for, and what research techniques are available.
This starts with the very basic differentiation between 'corpus, text archive
and example collection' and leads to a step-by-step introduction to statistics.
Here the reader is introduced to features that show how well thought-out this
book is as an instruction manual: At the end of each section (not just chapter)
there is a little grey box with the literature for 'further study / exploration'.

On page 13, the reader gets the first experience with the interactive part of
QCLwR: The exercise box, where the reader is asked (1) 'Write up a plain English
definition how you would ''tell a computer programme'' what a word is'. That seems
to be simple enough until (2) 'How does your definition handle the expressions
'better-suited' and 'ill-defined'? 'Armchair-linguist' and 'armchair
linguist'... Yes, you will have to stop and think but it is necessary.
Unfortunately, the explanation (and password) for how to find the rest of the
exercise boxes is hidden in the text, which can be confusing. (All other
exercises are on the companion website and the key can be found on page 20.)
There is also an 'exercise light' version that pops up throughout: the 'think
break'. A break from simply consuming text to stop and think it is indeed. A
question is rendered and a cartoon asks the reader to give it a thought. A
(possible) answer is then provided. Given that this book sees itself mainly as
an instruction manual -- either for self-study or in a course -- this cannot be
commended enough.

2) This introduction (and beyond) to R is presented in chapters 3 and 4. QCLwR
is a companion piece to 'Data Manipulation with R' (Spector 2008) and I am
confident that this open source software will see a growing number of
applications (and written guides) in years to come. The first impression is
daunting and takes me back to the late 1980s when the first people at school
dabbled in computers. They will find themselves very much at home with R. There
is much commercially available concordancing software out there -- notably
WordSmith. These tools were developed and then marketed when there was simply
nothing else available and, by now, have reached a high degree of sophistication
and penetration. However, like all standardised software, there are clear
drawbacks. QCLwR addresses these as it tries to teach people to write the
software they need for themselves and others -- and it is available regardless
of the platform used (Windows, Linux, or Mac). At this point the user needs to
decide which disadvantage is greater: the strictures of off-the-shelf software
or the time and effort it takes to write one's own programme. Once the decision
is made in favour of the latter option, Gries presents us with an impressive
guide on how to do it. It starts with relatively simple instructions that are
the basis to general programming and, building on this, moves on to construct
programmes that answer real demands within CL. These include simple things like
why it is better to edit data frames in R than Excel 2003 (Excel has not enough
columns because it probably was not developed with text mining in mind -- cf. p.
51) or the overriding importance of knowing the corpus you are working with (in
particular when it comes to tags and transcriptions) -- see p. 68. Gries gives a
broad idea what 'vectors' are for, and proves a good teacher through the variety
of approaches used in the book. While the book's premise is learning-by-doing,
Gries is aware that doing includes making mistakes and he tries to make use of
this fact. For example, on p. 76, the reader is asked to let the programme
'retrieve the length of the matches' with the programme just developed. But only
to ask: 'Note that this does not work (...) Why not?'

By the end of chapter 3 the reader will have learned, amongst other things, how
to make the computer read dates in any format, clean up BNC files and how to
compress and save files and data structures. Chapter 4 lets the reader (or
student doing the module) apply what they have learned. This includes some of
the juicier bits of corpus research -- for example, generating frequency lists
of word pairs (page 126), how to study grammatical constructions (referred to as
'advanced regular expressions' -- pp. 141ff.), or processing corpora that
provide extra-textual information ('multi-tiered corpora' -- pp. 156ff.).
Equally impressive is the excursion into 'Unicode' and what can be done with R
when the corpora in question do not use Latin script.

3) Statistics and CL (chapter 5): Statistics becomes increasingly important to
CL. Little wonder: ''Statistical significance'' describes relevant differences
between two sets of data. That corpus linguists tend to shy away from it is not
surprising either: so do many scientists. Statistical calculations are difficult
and rely on a multitude of factors. Consequently, if not really needed,
researchers do not learn about it. Yet, to prove greater validity of their
claims, corpus linguists will need to be more familiar with statistical
concepts. QCLwR gives a solid foundation on how to write statistical programmes
that are relevant to CL research.

In chapter 5 I felt at times a lack of the clarity and ease-of-use that is found
in other sections of this book. For a number of points explained in chapter 5,
Gries seems to start with the more difficult issue and then moves on to the
simpler ones. Similarly, when he refers to his own data (on p. 195), a reader
can easily feel confused. Self-reference may work very well in a classroom, but
on a page it does not. On the whole, chapter 5 is extremely useful, though, and
there are many things that are disseminated in an easy way (and not only the
difference between 'very' and 'highly' significant). However, it would be better
still were certain parts slightly rearranged.

Further applications: Not many people seem to be aware that CL is actually used
outside lexicography. Gries gives a brief overview of some of the other areas
where CL is found to be useful, including psycholinguistics and applied
linguistics. This means the book is not just purely focused on developing R for
research purposes. Indeed, it can be used as an initial point of reference
without even working with R. This, together with (1) gives the book a balanced
and rounded feel.

To conclude: This is an outstanding work by a scholar who brings in massive
experience of how to teach, and also manages to translate this onto the page. An
instructor will find it the perfect textbook for a module on how to use R for
corpus linguistic investigations. It is the book for this time -- shown by the
fact that the index refers to literature that has, mostly, been published within
the last four years. Any user will find QCLwR extremely versatile and, by and
large, a step-by-step guide to build their programming skills. Beyond the 'Think
Breaks' and material for further exploration, Gries keeps reminding the reader
that often, there are no correct answers: not the writer, but the reader may
come up with the 'more elegant, more efficient, simpler, easier-to-use' answer.
Such encouragement is nothing but laudable.


Spector, Phil. 2008. Data Manipulation with R. New York: Springer.

WordSmith Tools: Published by Oxford University Press since 1996 and now at
version 5.0:
Michael TL Pace-Sigge is University Teacher in the School of English at the University of Liverpool. His research interest mainly lies with corpus linguistics and spoken language research. After completing his MA on the lenition in Liverpool English stop consonants, using spectrography as sound representation, he moved on to do his PhD on the use of lexis in Liverpool English (due for completion in 2009). He is particularly interested in Michael Hoey's theory of Lexical Priming and evidence of priming does form a center part of his thesis. His other main area of interest is phonology and particularly in how far David Brazil's work on the discourse intonation system can be applied in describing language-in-use.

Format: Hardback
ISBN: 0415962714
ISBN-13: 9780415962711
Pages: 248
Prices: U.S. $ 120.00
U.K. £ 65.00
Format: Paperback
ISBN: 0415962706
ISBN-13: 9780415962704
Pages: 248
Prices: U.K. £ 26.99
U.S. $ 49.95
Format: Electronic
ISBN: 0203880927
ISBN-13: 9780203880920
Pages: 256
Prices: U.S. $ 49.95