LINGUIST List 20.4066

Sat Nov 28 2009

Review: Text/Corpus Linguistics: Gries (2009)

Editor for this issue: Monica Macaulay <>

        1.    Michael Pace-Sigge, Quantitative Corpus Linguistics with R

Message 1: Quantitative Corpus Linguistics with R
Date: 28-Nov-2009
From: Michael Pace-Sigge <>
Subject: Quantitative Corpus Linguistics with R
E-mail this message to a friend

Discuss this message

Announced at

AUTHOR: Stefan Th. Gries TITLE: Quantitative Corpus Linguistics with R SUBTITLE: A practical introduction PUBLISHER: Routledge (Taylor and Francis) YEAR: 2009

Michael Pace-Sigge, School of English, University of Liverpool, UK


The announcement for this book said, ''The first textbook of its kind, ''Quantitative Corpus Linguistics with R'' (QCLwR) demonstrates how to use the open source programming language R for corpus linguistic analyses. Computational and corpus linguists doing corpus work will find that R provides an enormous range of functions that currently require several programs to achieve -- searching and processing corpora, arranging and outputting the results of corpus searches, statistical evaluation, and graphing'' -- and this is exactly what it does.

I have to start this review with a criticism, however. When Routledge published this, they clearly thought only the youngest, best-sighted readers would make use of it. So they chose a font that saves them paper (and, presumably, the planet) but is incredibly hard to read for anyone over 30 because it is just too small. This youthful approach is underlined by the choice of book jacket that, even for a paperback, seems very thin.

In other words: the worst bits about this book are not the author's fault, because Stefan Gries produced an admirable piece with QCLwR. For a start, this is one of the first books with an interactive side to it. The companion website is not just the boring old bit of marketing with some addenda thrown in. The website here is an essential part of working with the book, and has a platform to ask questions, discuss problems, and offer new solutions. As QCLwR is indeed a very practical introduction and clearly aimed at students who develop their own corpus linguistics tools, this is a very welcome approach.


QCLwR has four distinct parts:

1) Chapter 2 is a very good, concise and comprehensive introduction to Corpus Linguistics (CL), what it is for, and what research techniques are available. This starts with the very basic differentiation between 'corpus, text archive and example collection' and leads to a step-by-step introduction to statistics. Here the reader is introduced to features that show how well thought-out this book is as an instruction manual: At the end of each section (not just chapter) there is a little grey box with the literature for 'further study / exploration'.

On page 13, the reader gets the first experience with the interactive part of QCLwR: The exercise box, where the reader is asked (1) 'Write up a plain English definition how you would ''tell a computer programme'' what a word is'. That seems to be simple enough until (2) 'How does your definition handle the expressions 'better-suited' and 'ill-defined'? 'Armchair-linguist' and 'armchair linguist'... Yes, you will have to stop and think but it is necessary. Unfortunately, the explanation (and password) for how to find the rest of the exercise boxes is hidden in the text, which can be confusing. (All other exercises are on the companion website and the key can be found on page 20.) There is also an 'exercise light' version that pops up throughout: the 'think break'. A break from simply consuming text to stop and think it is indeed. A question is rendered and a cartoon asks the reader to give it a thought. A (possible) answer is then provided. Given that this book sees itself mainly as an instruction manual -- either for self-study or in a course -- this cannot be commended enough.

2) This introduction (and beyond) to R is presented in chapters 3 and 4. QCLwR is a companion piece to 'Data Manipulation with R' (Spector 2008) and I am confident that this open source software will see a growing number of applications (and written guides) in years to come. The first impression is daunting and takes me back to the late 1980s when the first people at school dabbled in computers. They will find themselves very much at home with R. There is much commercially available concordancing software out there -- notably WordSmith. These tools were developed and then marketed when there was simply nothing else available and, by now, have reached a high degree of sophistication and penetration. However, like all standardised software, there are clear drawbacks. QCLwR addresses these as it tries to teach people to write the software they need for themselves and others -- and it is available regardless of the platform used (Windows, Linux, or Mac). At this point the user needs to decide which disadvantage is greater: the strictures of off-the-shelf software or the time and effort it takes to write one's own programme. Once the decision is made in favour of the latter option, Gries presents us with an impressive guide on how to do it. It starts with relatively simple instructions that are the basis to general programming and, building on this, moves on to construct programmes that answer real demands within CL. These include simple things like why it is better to edit data frames in R than Excel 2003 (Excel has not enough columns because it probably was not developed with text mining in mind -- cf. p. 51) or the overriding importance of knowing the corpus you are working with (in particular when it comes to tags and transcriptions) -- see p. 68. Gries gives a broad idea what 'vectors' are for, and proves a good teacher through the variety of approaches used in the book. While the book's premise is learning-by-doing, Gries is aware that doing includes making mistakes and he tries to make use of this fact. For example, on p. 76, the reader is asked to let the programme 'retrieve the length of the matches' with the programme just developed. But only to ask: 'Note that this does not work (...) Why not?'

By the end of chapter 3 the reader will have learned, amongst other things, how to make the computer read dates in any format, clean up BNC files and how to compress and save files and data structures. Chapter 4 lets the reader (or student doing the module) apply what they have learned. This includes some of the juicier bits of corpus research -- for example, generating frequency lists of word pairs (page 126), how to study grammatical constructions (referred to as 'advanced regular expressions' -- pp. 141ff.), or processing corpora that provide extra-textual information ('multi-tiered corpora' -- pp. 156ff.). Equally impressive is the excursion into 'Unicode' and what can be done with R when the corpora in question do not use Latin script.

3) Statistics and CL (chapter 5): Statistics becomes increasingly important to CL. Little wonder: ''Statistical significance'' describes relevant differences between two sets of data. That corpus linguists tend to shy away from it is not surprising either: so do many scientists. Statistical calculations are difficult and rely on a multitude of factors. Consequently, if not really needed, researchers do not learn about it. Yet, to prove greater validity of their claims, corpus linguists will need to be more familiar with statistical concepts. QCLwR gives a solid foundation on how to write statistical programmes that are relevant to CL research.

In chapter 5 I felt at times a lack of the clarity and ease-of-use that is found in other sections of this book. For a number of points explained in chapter 5, Gries seems to start with the more difficult issue and then moves on to the simpler ones. Similarly, when he refers to his own data (on p. 195), a reader can easily feel confused. Self-reference may work very well in a classroom, but on a page it does not. On the whole, chapter 5 is extremely useful, though, and there are many things that are disseminated in an easy way (and not only the difference between 'very' and 'highly' significant). However, it would be better still were certain parts slightly rearranged.

4) Further applications: Not many people seem to be aware that CL is actually used outside lexicography. Gries gives a brief overview of some of the other areas where CL is found to be useful, including psycholinguistics and applied linguistics. This means the book is not just purely focused on developing R for research purposes. Indeed, it can be used as an initial point of reference without even working with R. This, together with (1) gives the book a balanced and rounded feel.

To conclude: This is an outstanding work by a scholar who brings in massive experience of how to teach, and also manages to translate this onto the page. An instructor will find it the perfect textbook for a module on how to use R for corpus linguistic investigations. It is the book for this time -- shown by the fact that the index refers to literature that has, mostly, been published within the last four years. Any user will find QCLwR extremely versatile and, by and large, a step-by-step guide to build their programming skills. Beyond the 'Think Breaks' and material for further exploration, Gries keeps reminding the reader that often, there are no correct answers: not the writer, but the reader may come up with the 'more elegant, more efficient, simpler, easier-to-use' answer. Such encouragement is nothing but laudable.


Spector, Phil. 2008. Data Manipulation with R. New York: Springer.

WordSmith Tools: Published by Oxford University Press since 1996 and now at version 5.0:


Michael TL Pace-Sigge is University Teacher in the School of English at the University of Liverpool. His research interest mainly lies with corpus linguistics and spoken language research. After completing his MA on the lenition in Liverpool English stop consonants, using spectrography as sound representation, he moved on to do his PhD on the use of lexis in Liverpool English (due for completion in 2009). He is particularly interested in Michael Hoey's theory of Lexical Priming and evidence of priming does form a center part of his thesis. His other main area of interest is phonology and particularly in how far David Brazil's work on the discourse intonation system can be applied in describing language-in-use.