LINGUIST List 32.2625

Wed Aug 11 2021

Review: Text/Corpus Linguistics: Egbert, Larsson, Biber (2020)

Editor for this issue: Jeremy Coburn <>

Date: 19-Jul-2021
From: Tyler Anderson <>
Subject: Doing Linguistics with a Corpus
E-mail this message to a friend

Discuss this message

Book announced at

AUTHOR: Jesse Egbert
AUTHOR: Tove Larsson
AUTHOR: Douglas Biber
TITLE: Doing Linguistics with a Corpus
SUBTITLE: Methodological Considerations for the Everyday User
PUBLISHER: Cambridge University Press
YEAR: 2020

REVIEWER: Tyler Kimball Anderson, Colorado Mesa University


Egbert, Larsson & Biber’s booklet ‘Doing linguistics with a corpus: Methodological considerations for the everyday user’ discusses means to improve investigations where the main research tool is a corpus. It can be said that the authors’ main goal is to put the linguist back into corpus linguistics. Among other factors to reach this goal, they discuss how to improve research methods, research design, and selection of appropriate research questions. In other words, they want all practicing corpus linguists—seasoned and novice alike— “to take control of their research while also employing available resources” (p. 2). They further argue for the need of qualitative interpretations of quantitative data and adopting minimally sufficient statistical methods.

To begin their discussion, the authors briefly lay out their goals in Section 1 by comparing corpus linguists to the everyday driver. Neither require expertise in engineering to use the tools at their exposure, but Egbert, Larsson & Biber propose that by having a basic understanding of what goes on with their vehicles will help both avoid problems. The authors postulate that “just understanding the nature and composition of the corpus used for analysis…can be of tremendous assistance when conducting and interpreting corpus analyses” (p. 1-2). Section 2 is titled ‘Getting to know your corpus’ and here they begin by discussing the importance of corpus size, concluding that given two equally designed corpora, the larger one will always be better because it will provide more word and phrase types that would not be represented in smaller corpora. However, they recognize that finding two equally designed corpora is unlikely, and thus researchers need to likewise consider ‘representativeness.’ Here linguists must assure that the corpus includes texts that are as representative as possible of the target population they are interested in studying. Thus, they should always find (or compile) a large corpus with texts that are based on the goals of their study. In all cases, argue the authors, researchers should read and critically examine any documentation associated with the corpus (looking for information about the texts themselves) and analytically evaluate the texts in each corpus.

Transitioning to Section 3, the authors target “how quantitative corpus analyses relate to tangible linguistic descriptions” (p. 15). Here they delve into research design and the development of research questions. According to Egbert, Larsson & Biber, when it comes to research design, researchers who utilize corpora must decide whether they will be investigating linguistic tokens or entire texts. Similarly, research questions can center on analyzing the factors that predict structural variations, or investigating what they call descriptive linguistics, which they define as describing the linguistic features of the texts. They also discuss the topic of dispersion, where texts are analyzed to discover how uniformly a given linguistic feature is distributed across the corpus. In Section 4, the authors attempt to emphasize the need for linguistically interpretable variables in all corpus linguistic studies. Likewise, here they focus on the need to have clear operational definitions for these linguistic variables.

In Section 5 ‘Software tools and linguistic interpretability’, the authors discuss some of the pitfalls of commonly used tools in corpus linguistics, and how researchers can make advancements in the field to circumvent these pitfalls. Of importance is the idea that all results should be tested for precision without taking results at face value. They illustrate this by showing how three different tools used for linguistic annotation (i.e., Stanford Dependency Parser, Malt Parser, and Biber Tagger) all exhibited different rates of precession when it came to tagging noun-noun sequences. In a similar vein, Section 6 focuses on ‘The role of statistical analysis in linguistic descriptions,’ where the emphasis is placed on using minimally sufficient statistical methods. Here the authors warn against overreliance on statistical paradigms solely for the sake of using a specific statistical model. Here they illustrate how many researchers rely on the null hypothesis paradigm, which is extremely sensitive to sample size. In such studies statistical significance could be shown with any measurable difference simply due to having a large corpus. Regardless of what statistical package is implemented, the authors argue that qualitative linguistic analysis is always required in addition to the quantitative analyses. Indeed, Section 7 centers on ‘Interpreting quantitative results.’ Here it is stressed that “linguistics is done by linguists, not by computers” (p. 52). Utilizing the results of statistical tests, researchers should continue to provide sound qualitative analyses, which is facilitated by the abundance of linguistic context found in corpora. They advocate for linguists to evaluate closely a subset of the texts that have been submitted to statistical analysis, as well as concentrating meticulously on the text-external contexts (also discussed in Section 2).

The manuscript concludes with Section 8, wherein Egbert, Larsson & Biber provide a summary of the main points of the booklet. It is important to mention that each of the main sections (2-7) contain one or more case studies that attempt to illustrate each of their principal points. For example, in Section 2 the case study provides a breakdown of two corpora (i.e., COCA Academic and BNC Academic) and how on the surface each are comparable; however, a deeper dive into these corpora reveals stark differences in composition that impact the interpretations of the results.


This Element from Egbert, Larsson & Biber is an overall positive addition to the field of corpus linguistics. Perhaps the greatest contribution of this work is their proposal to return linguistics back to the forefront of the field of corpus linguistics. Beginning with the title’s focus on “Doing linguistics,” the authors show that a corpus is a useful tool to help linguists analyze texts, and not the final word in the analytical process. With few exceptions, they exemplify each of their topics skillfully with a variety of case studies. Perhaps the most important of these came in the Section 7, where they provide three case studies that illustrate the complexity of qualitatively interpreting quantitative results. For example, in case study 1 (Section 7.2) the research question deals with what adjectives collocate with the nouns ‘man’ and ‘woman’ and what differences are seen. They show how doing a deeper dive beyond the output given by the concordancer provided answers to questions that went beyond the authors’ initial intuitions. They had postulated that the adjective ‘American’ collocated more frequently with ‘woman’ because of the popular song by Lenny Kravitz with that same title. However, by going beyond the frequency-based results and examining a subset of examples, the researchers showed that the majority of these examples dealt with minority groups (e.g., ‘Native American woman’). These case studies were a strong addition to the tome.

As with any book, this manuscript has some weaknesses that impeded the authors from reaching several of their goals. First of all, their analogy of getting to know what is ‘under the hood’, while appropriate, did not bare fruit. At no point in the manuscript did the present reader feel like he understood more fully what is ‘under the hood’ when it comes to corpora. For example, in Section 2 they encourage the compilation of a new corpus for every study without discussing how to perform such a feat, only providing a reference to another work. If their target audience is the novice, more information on how to accomplish this task should be provided here. Similarly, they discuss the pitfalls of reusing publicly available corpora, but don’t discuss the familiarity and trust that some of the most widely used corpora (e.g., COCA, BNC, etc.) would generate over a self-generated corpus. And later they discuss the option of researchers developing their own software programs (p. 33); arguably, the group of researchers that can carry out such a task is minimal, and one who can is probably not diving into this tome.

At some points terminology was not consistent. In Section 2, for example, the authors discuss the development of research questions and research designs, two distinct concepts. However, in exemplifying these concepts they talk about “one major…research question” (p. 16) followed by a “second major type of…research design” (p. 17), as if the terms were interchangeable. And even the use of ‘Section’ was a bit confusing and begged the question of why they were called sections and not chapters.

While the booklet was well written overall, there were a few points where the authors did not make appropriate transitions, especially between sections. For example, between Section 4 and Section 5 no connections are made between the topic of software tools and the case study provided for linguistically interpretable variables. Similarly, definitions of terms oftentimes went missing. For example, in Section 4 it discusses ‘employing MI scores’ but fails to indicate what these are or how to do such a task. Perhaps this is because Mutual Information (MI) “is one of the most popular statistical tests that corpus linguists use to explore collocations” (Szudarski, 2018, p. 77); however, it should not be taken purely for granted by the authors that such information will be readily understood by the inexperienced members of their target audience. Similarly, case study 2 in section 7.3 failed to clearly explain some key terms (i.e., multidimensional (MD) analysis). In fact, they state that this type of analysis is “a classic example of a complex statistical technique that can create distance between a researcher and language data” (p. 57). If that is the case, it begs the question of why it was included in light of their discussion of ‘minimally sufficient statistics.’ And if they deem it necessary to include, they must further explain the topic for those readers who have never seen such an analysis.

In a similar vein, the topic of accuracy level is discussed in Section 5. They recommend that researchers always carry out such measures of accuracy (including precision and recall), but fail to explain how such a measure can be carried out. Again, if it is important to be placed in the book, it should be illustrated on how to carry out such tasks.

Also, an apparent oversight was found in the conclusion of the book. Here they reference a blog titled “Linguistics with a corpus,” but fail to point the readers to where they can find it. But perhaps the most glaring shortfall of this booklet came in Section 4. The proposed goal of this chapter was to ensure that all variables used in a corpus study fit the guideline of being linguistically interpretable. However, their case study—on measures of collocation—has no apparent connection to this goal.

These shortcomings aside, the booklet provides some great insights in to how to improve research for linguists interested in using corpora as tools for language analysis. As previously mentioned, the emphasis on inviting linguists back to their own party is well merited. In a data driven world, Egbert, Larsson & Biber’s focus on using just enough statistical analyses to get answers is also a refreshing addition to the field.


Szudarski, Paweł (2018). Corpus linguistics for vocabulary: A guide for research. Routledge.


Tyler K. Anderson is Professor of Spanish at Colorado Mesa University, where he teaches courses in language, linguistics and second language acquisition. His research interests include language attitudes toward manifestations of contact linguistics, including the acceptability of lexical borrowing and code-switching in Spanish and English contact situations. He is currently researching loanwords and core vocabulary using corpus linguistics.

Page Updated: 11-Aug-2021