Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."

New from Wiley!


We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Review of  Corpus Linguistics and Statistics with R

Reviewer: Gözde Mercan
Book Title: Corpus Linguistics and Statistics with R
Book Author: Guillaume Desagulier
Publisher: Springer Nature
Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics
Issue Number: 29.4696

Discuss this Review
Help on Posting

“Corpus Linguistics and Statistics with R” by Guillaume Desagulier is a book introducing the principal methods and statistics in corpus linguistics using the programming language R (R Core Team 2018). The author states that “This is a book on empirical linguistics from a theoretical linguist’s perspective” (p. viii). It provides not only clear, hands-on, step-by-step instructions on how to apply these techniques, but also some theoretical discussion on the scope of corpus linguistics.

While its target audience is mainly novices in the fields of programming, statistics and cognitive linguistics, the book may also be of interest to more experienced researchers. As stated in the back cover, it is suitable for use as a textbook in graduate and advanced undergraduate courses as well as self-study.

The book is part of Springer’s “Quantitative Methods in the Humanities and Social Sciences” series. It consists of two parts, 10 chapters and 353 pages.

In the Preface, Desaguiler starts with a personal anecdote which motivated him to acquire empirical techniques and how he was inspired by Stephan Th. Gries (whose work is frequently cited throughout the book, e.g. Gries 2009) while explaining the intended readership of the book. The preface also presents information regarding the goals of the books and the online supplementary materials as well as some notes to instructors.

Chapter 1, “Introduction”, presents the theoretical relevance of corpus-informed judgments by contrasting the top-down generativist approach to language (e.g. Chomsky 1957) with the derivative, bottom-up approach of usage-based theories such as cognitive linguistics (e.g. Langacker 1987), for example. In this chapter, Desagulier also explains what makes a corpus by presenting the required criteria and how linguists make use of corpora. He finishes the chapter with an explanation of the role of a corpus within the empirical cycle of a linguist’s work.

The nine chapters following the introductory chapter are grouped into two parts. Part I is entitled: “Methods in Corpus Linguistics”; it includes 5 chapters. Part II, entitled “Statistics for Corpus Linguistics” contains the remaining four chapters.

Chapter 2, the first chapter in Part II, is a practical introduction to the R programming language.
It acquaints the reader with the fundamental notions of R and provides step-by-step instructions starting with downloading and installing R. It presents basic R-concepts like scripts, packages, variables, assignment, functions and arguments. The chapter also introduces the four main types of R objects, namely vectors, lists, matrices and data frames.

Chapter 3, entitled “Digital Corpora”, presents the different types of corpora. In this chapter, Desagulier also outlines the steps involved in corpus compilation. The chapter contains guidelines for creating one’s own unannotated corpus, and also introduces the properties of ready-made, annotated corpora such as markup, part-of-speech (POS-) tagging and semantic tagging.

Chapter 4 is entitled “Processing and Manipulating Character Strings”. In this chapter, the author aims to teach the basic methods for handling text material with R to lay the basis for applied character string processing. He covers the relevant R functions and regular expressions.

In Chapter 5, “Applied Character String Processing”, Desagulier makes use of and combines the R methods he presented in the previous chapter to demonstrate how to handle text material. He describes basic corpus linguistics operations and covers concordances, data frame creation from an annotated corpus and frequency lists.

Chapter 6, which is the final chapter of Part I, aims to teach the readers how to summarize frequency data graphically. There are instructions demonstrating the construction of plots, barplots, histograms, word clouds, motion charts and other visual representations to summarize results.

Part II of the book, “Statistics for Corpus Linguistics”, opens with a short introductory section, emphasizing the relevance and importance of statistics for contemporary linguistics despite some ongoing misconceptions.

The first chapter in this part, Chapter 7, consists of a concise introduction to descriptive statistics. It presents key concepts of descriptive statistics, namely measures of central tendency and dispersion. This chapter serves as the basis for the following one.

Chapter 8 is entitled “Notions of Statistical Testing”. As this title indicates, Desagulier presents some basic concepts of statistical thinking and inferential statistics. He starts with probabilities, and then explains the key notions of populations, samples, individuals; random variables, dependent and independent variables. Next, he covers hypothesis testing and probability distributions. He concludes the chapter with some important statistical tests, namely the chi-square (χ2) test, Fisher’s exact test of independence and correlation.

Chapter 9, “Association and Productivity Measures”, starts with an introduction, discussing the role of frequency in the generativist vs. usage-based traditions and outlining the evolution of the concept of frequency from the first to the second-generation usage-based linguistics. The chapter covers co-occurrence phenomena (collocation, colligation, collostruction). It presents association measures which quantify the attraction or repulsion between two co occurring linguistic units, including asymmetric association measures positing a directional dependency between collocates. The chapter concludes with a section on lexical richness and productivity, covering issues such as type-token ratio and vocabulary growth curves.

The last chapter of the book, Chapter 10, is on “Clustering Methods”. Desagulier presents five clustering techniques: Principal Component Analysis, t-distributed Stochastic Neighbor Embedding, Correspondence Analysis, Multiple Correspondence Analysis and Hierarchical Cluster Analysis. He explains the principles of each analysis and illustrates applications with case studies. Finally, the chapter also covers cluster dendrograms and network graphs.


“Corpus Linguistics and Statistics with R” is a very well-written and well-organized introductory book. Its contents are clear and readable. The level of complexity of the text increases gradually from basic to quite advanced. Each chapter begins with an abstract and most chapters have an introductory section. This helps the reader contextualize the contents of the relevant chapter. Furthermore, most chapters have their separate references in addition to the full bibliography at the end of the book. The references are sound and comprehensive.

One of the main strengths of this book is that it constitutes an elaborate, step-by-step manual for practical implementations of the contents. It enables the reader to engage in hands-on applications of the methods presented. The rich online supplements include data sets and R codes, making it possible for the reader to work interactively. Moreover, the exercises at the end of the chapters (with the solutions at the very end) offer additional study material.

R being already a flexible and versatile tool, Desagulier makes the life of the reader even easier by providing separate instructions for Windows and Mac users. When several R packages are available for a particular purpose, he lists them all and mentions his own preference, explaining his reasons. Also, he cites relevant websites and recommends references for further reading throughout the book.

Another asset of the book is the author’s fluent style. In addition to being articulate in his writing, he makes subtle jokes and references to popular culture (to Star Trek, for instance) and uses catchy examples such as a concordance of words based on ‘blood’ in the novel “Dracula” to keep the reader interested while reading a technically demanding text. Furthermore, he uses figures efficiently to explain his points. For instance, Figure 6.5 (p. 120) is an excellent example demonstrating the rationale of a word cloud, as it consists of a word cloud of the novel “Moby Dick”, in which the word ‘whale’ is strongly emphasized. Two other examples for such clever use of visuals are Figure 10.1 (p. 117), a phylogenetic tree by Darwin to illustrate a dendrogram and Figure 10.2 (p. 118) with the Eiffel Tower viewed from four different angles to explain the logic behind visualization in clustering methods. From time to time, Desagulier also appends interesting information about how certain tools and methods have been developed (for example, in p. 132, where he mentions the recent history of motion charts).

More importantly, Desagulier uses examples from linguistically relevant topics, case studies and actual data from his own and others’ previous studies. For example, he refers to his study on pre-adjectival vs. pre-determiner uses of the intensifiers ‘quite’ and ‘rather’ in the British National Corpus (BNC) (Desagulier 2015) both in Chapter 8 (p. 160) to explain the notion of hypothesis testing and in Chapter 9 (p. 270) as a case study illustrating Multiple Correspondence Analysis. In his discussion of normal distribution, Desagulier also uses a data set from a real-life lexical decision task on the auditory processing of German compounds by Isel, Gunter and Friederici (2003).

In this book, Desagulier takes on a triple challenge. He aims to introduce the basics of the R language, statistics and corpus linguistics in one book. He is successful in this ambitious endeavor, which is the greatest strength of the book. He also manages to increase the level of complexity of the contents smoothly across chapters. In addition to the detailed methodological instructions, Desagulier provides some theoretical background in various sections of the book, as well. Chapters 1, 2, 3 in Part I and Chapters 6 and 8 in Part II are appropriate for even complete beginners. Chapters 4, 5, 6 and Chapters 9 and 10 are more advanced, but still accessible. Even though most of the methodology is presented from the perspective of corpus linguistics, some or all chapters of the book may also appeal to researchers from other related fields such as computational linguistics and psycholinguistics.

There are only two minor shortcomings of the book. First, even though a certain number of typos are expected or probably unavoidable in any text, there are slightly more typos in “Corpus Linguistics and Statistics with R” than one would expect in such a meticulously crafted book. Just to give a few examples: In the first sentence of page 44, ‘is’ should read ‘if’ in “…another thing is the…”, “The Bank of English” is printed twice in the last sentence of the third paragraph of section 3.1 on page 51, and “three sentence” in the third paragraph of section 4.3.2 on page 72 should be plural. There are some more such typos missed in proofreading, but these can easily be corrected in future editions or in errata.

The second minor point of criticism is the absence of a general conclusion chapter or section. Although the book is written in a text-book format mainly focusing on methodology, it also contains some theoretical aspects. Therefore, a closing section to wrap up especially the theoretical discussion could have helped the reader to put everything in better perspective. In the absence of such a conclusion, there is a risk that readers might feel left in suspense.

To conclude, this clearly written, coherent book with linguistically relevant examples, data sets and R codes, is an inspiring resource for theoretical linguists who wish to familiarize themselves with quantitative methods and statistics. In the present era of big data, this book is a very timely and valuable contribution to the literature. I strongly recommend Guillaume Desagulier’s “Corpus Linguistics and Statistics with R” to anyone interested in learning about R, statistics and the concepts and methods of corpus linguistics.


Chomsky, Noam. 1957. Syntactic structures. The Hague: Mouton.

Desagulier, Guillaume. 2015. Forms and meanings of intensification: A multifactorial comparison of ‘quite’ and ‘rather’. Anglophonia 20. doi:10.400/anglophonia558.

Gries, Stefan Thomas. 2009. Quantitative corpus linguistics with R: A practical introduction. New York, NY: Routledge.

Isel, Frédéric, Thomas C. Gunter & Angela D. Friederici. 2003. Prosody-assisted dead-driven access to spoken German compounds. Journal of Experimental Psychology 29(2). 277–288. doi:10.1037/02787393.29.2.277.

Langacker, Ronald W. 1987. Foundations of cognitive grammar: Theoretical prerequisites, Vol. 1. Stanford: Stanford University Press.

R Core Team. 2013. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
Gözde Mercan is a psycholinguist with a PhD in Cognitive Science from Middle East Technical University, Ankara, Turkey. Her research focuses on the processing and mental representation of language, mainly through the structural priming paradigm. She has conducted structural priming experiments on various linguistic forms in Turkish, English and Norwegian with monolingual and multilingual participants. She is also interested in language acquisition in children and adults. Currently, she is an (external) affiliate of the Center for Multilingualism in Society across the Lifespan of the University of Oslo.