Date: 21-Mar-2018
From: Foinse Ó Caoimh <>
Subject: Quantitative Historical Linguistics
AUTHOR: Gard B. Jenset
AUTHOR: Barbara McGillivray
TITLE: Quantitative Historical Linguistics
SUBTITLE: A Corpus Framework
PUBLISHER: Oxford University Press
YEAR: 2017

REVIEWER: Foinse Ó Caoimh, Maynooth University


Corpora and quantitative methods have been extensively employed in many sub-fields of linguistics for decades. However, historical linguists seem to be more reluctant to embrace the corpus-driven quantitative approach, and there are fewer corpora of historical languages available. Recent introductory books to historical linguistics (Ringe and Eska 2013, Campbell 2013, Hale 2007, etc.) remain reticent or even hostile towards quantitative methods, despite that there has been a large amount of scholarly output in corpus building and natural language processing for historical languages (Gippert & Gehrke 2015, Piotrowski 2012), and that fruitful results have been produced by corpus-driven quantitative studies of historical languages, notably championed by the two authors of the book under review (Jenset 2013, McGillivray 2013 etc.). The present book appeals therefore to the community of historical linguists, and sets for itself triple tasks: firstly, to justify and advocate for the corpus-driven quantitative approach in historical linguistics; second, to outline the methodological framework of such an approach; and third, to provide a general account of the current practices and techniques in quantitative studies of historical languages.

Chapter 1 sets out the aim of the book upfront, namely to introduce a methodological framework for quantitative historical linguistics by discussing the necessary steps in doing research, without subscribing to specific techniques or theories in historical linguistics (p.1). The authors point out that while there is a high-level awareness of historical linguistics as data-focused, quantitative corpus methods are still underused and often misused, largely because the empirical nature of historical linguistics is less clear (p.2). They argue for a ‘conceptual change of pace’ (p.6), whereby the transparency and objective verifiability required by an empirical discipline should be conceptualised in a probabilistic rather than a categorical way (p.4). The conventional evidence-based approach provides the categorical judgment that a certain phenomenon exists, but fails to inform on its frequency or trend of change (pp. 8-10), for which one needs annotated corpora (pp.10-12). Even when using annotated corpora, historical linguists must avoid the pitfalls of raw frequency counts and ‘post hoc analysis’ (pp.12-15), about which more details are given in Chapter 6. After a short plea for better documenting and sharing the research process in order to enhance reproducibility and collaboration (pp.15-18), and an even shorter section advocating for pattern-searching in linguistics (pp. 18-19), the authors turn to an interesting metaphor of ‘crossing the chasm’ (Moore 1991), usually employed to model the acceptance of new products in the market, to explain the possible tactics that can be used to help the majority of historical linguists to ‘cross the chasm’ and accept the proposed methodology in this book (pp.19-25). This amalgam chapter ends with a showcase study (pp.25-35) that surveys articles from six journals that focus on historical linguistics, using quantitative methods to find out the links between individual journals, corpus-based research and quantitative-qualitative distinction.

Having prepared the readers’ minds for a paradigmatic shift from categorical models to probabilistic ones, Chapter 2 outlines the methodological framework which constitutes the cornerstone of this book. Several basic assumptions pertaining to historical linguistics are made, such as that the historical linguistic reality is lost and that qualitative models are still indispensable in some fields, and key terms such as ‘evidence’ and ‘model’ are defined (pp. 37-44). A diagram (p.45) shows clearly the research process leading from primary sources to ‘models of language that are quantitatively driven from evidence’ (p.44). The authors list twelve principles (pp.44-53) for conducting historical linguistic research under the proposed framework. These include general principles for an empirical discipline (e.g. it is necessary to reach consensus based on empirical argumentation (p. 45), as against, say, literary criticism) as well as more subject-specific requirements (e.g. languages are multivariate and should be studied as such (p. 51)). At this point the authors address again the plea raised in Chapter 1 and propose several ‘best practices’ (pp. 53-58) aiming to increase the reproducibility and collaboration in the discipline. The last part of this chapter is dedicated to an elucidation of the concept of ‘corpus-driven’ and ‘data-driven’ approaches (pp.58-61), together with an epistemological probe of the relationship between data and theory (pp. 61-65).

Chapter 3 reviews the early methods, both qualitative and quantitative, employed in historical linguistics, especially that of glottochronology. The authors reveal that the failure of glottochronology and the advent of structuralism and generative theory together reduced the interest in quantitative methods during the past decades (pp. 68-71). The rising of modern electronic corpora provides exciting opportunities. As the authors convincingly show with regression plot charts (pp. 74-78), the advance in computing power in recent decades is highly relevant to the rapid growth in both the number and the sizes of corpora of historical languages. The distinction between qualitative and quantitative approaches, and the advantages of the latter in certain contexts, is briefly restated (pp. 78-81), followed by an extensive defence for the use of corpora and quantitative methods in historical linguistics. Arguments against such methods from the standpoints of convenience, redundancy, limitation of scope, principle and the so-called ‘pseudo-science’ scepticism are mentioned and refuted (pp. 81-97).

Chapter 4 advances from the problem of ‘why doing it’ into ‘how to do it’, by introducing various current methods of annotating historical corpora. Compared to corpora of contemporary languages, those of historical languages are in greater need of detailed, interpretative annotations guided by philology (pp. 100-101). Data in the corpus can be structuralised to facilitate retrieval in many ways, such as the table format and markup languages (pp. 103-106). Structuralized data then can be further annotated in embedded or standalone formats. Different levels of linguistic annotation are explained, starting from pre-processing, tokenization to part-of-speech, morphological, syntactic and even sociolinguistic annotations (pp. 110-122). The annotation schemes and standards, the authors argue, should be implemented in the annotating process. The Universal Dependencies standard and the Text Encoding Initiative (TEI) are mentioned as promising candidates for standardizing the many existing markup schemes (pp. 122-125). Many of the methods in this chapter are exemplified by sample annotated data, and at this point(pp. 125-127) the authors illustrate the application of automatic Natural Language Processing (NLP) tools to a Latin corpus, although the efficacy of such tools in this particular case is still not quite clear. The chapter ends with some reflections on the limitations and risks of annotation.

Chapter 5 explores the possibility of (re)using resources. including not only purposely built corpora, but also dictionaries, official documents and historical archives. This is a highly original chapter and represents some of the main breakthroughs the authors have made in recent publications. The authors lucidly demonstrate, with concrete examples, how historical valency lexicons automatically derived from Treebanks can contribute to our understanding of languages to a greater extent than conventional dictionaries (pp. 130-135). Such corpus-driven lexicons can in turn improve the precision of Optical Character Recognition (OCR) and NLP tools. Historical linguistic research can benefit from including information on social features of the texts in the factors that influence language change, while sociolinguists and historians are able to investigate a large number of source texts with the help of quantitative corpus methods (pp. 137-140). One way to further integrate more resources into the corpus is to add metadata, preferably in a separate database linked to the corpus (pp. 140-142). Popular tools for linking data include Resource Description Framework (RDF) and the Hypertext Transfer Protocol (HTTP), and an example of linking a Treebank to the LexiInfo ontology via RDF is given in detail on pp. 143-148. Historical and geographical data can be linked to an annotated corpus in many innovative manners as well, as exemplified by the Pleiades and the Pelagios projects (pp. 148-151).

The beginning of Chapter 6 reiterates the benefits of corpus and quantitative methods (pp. 153-157, cf. pp. 8-15, 78-81). Since language is multivariate, as suggested by Principle 11 (p. 51), the complexity should be tackled with multivariate techniques (p. 157). The authors choose the problem of the concurrence of Latin spatial preverbs and certain argument structures to exemplify what these statistic techniques are and how they work (pp. 157-166). A more complex investigation on the rise of the existential ‘there’ in Middle English is then reported, showing the readers how to translate linguistic claims into statistic questions, and how much the factors of word order, sentence structure, genre and dialect each contributed to the change of frequency of ‘there’ over time (pp. 166-186). This analysis offers a valuable showcase of how to evaluate different statistical techniques and how to test the model fit.

Chapter 7 summaries the core steps of the research process (pp. 189-190) and presents yet another case study that implements the framework. This study entertains an old problem, that of the variation between third person verbal endings -(e)s and -(e)th in early modern English. It tests the various hypotheses on the reasons for this variation and tries to establish the relative importance among these reasons (pp. 190-206). The complete process at the beginning of this chapter is followed and exemplified step by step in this case study. The chapter concludes with some final remarks.


This book is the first to systematically construct the methodological framework of corpus-driven quantitative approaches in historical linguistics, and it has done an excellent job both in proving the necessity and advantages of the approach, and in providing a useful and clear framework for future researches. In addition, it also serves as a comprehensive review of the progress by corpus-driven quantitative methods so far, and its bibliography can be used as an up-to-date reference list of major published corpora of historical languages and corpus-driven quantitative studies of historical languages. The book benefits from its inclusion of many sample data, charts and figures, which are all made openly available by the authors online for readers to repeat the tests or explore more possibilities.

The editorial standard is high, and I only notice three typos: 1) ‘We can distinguish between different types [of] evidence’ (p. 39); 2) ‘since a particular feature or phenomenon can be absent from a corpus for a number [of] very different reasons’ (p. 80); 3) the subsection title ‘Pragmatically and sociolinguistically annotated corpora’ should be in bold type in accordance to other titles on the same level.

The main problem with this monograph is its structure. As I summarise at the first paragraph of this review, the triple task of this book is quite clear and should be (and indeed can be) presented in its logical sequence, namely firstly to justify the approach, secondly to lay out the framework, and thirdly to present the actual uses and techniques. However, what one finds is a mixed-up presentation of the three tasks. For example, the definitions of ‘qualitative’ and ‘quantitative’ methods (p. 79) are so basic for the whole argument of this book, that they really should be put in the first chapter together with 1.2.1 ‘Empirical methods’ (pp. 3-4); otherwise the ‘new pace’ switching from qualitative to quantitative approach proposed on p. 6 cannot be precisely understood. Similarly, the distinction between ‘data-driven’ and ‘corpus-driven’ (pp. 58-61) should be raised at least to the ‘definitions’ section (pp. 39-44), whereas the in-depth discussion between data and theory (pp. 61-65) may be better relegated to the defence for quantitative methods against caveats in Chapter 3. As a matter of fact, the review of and defence for quantitative methods (pp. 81-97) should come first in the book, while the section of ‘problems with certain quantitative analyses’ (pp. 12-15) would belong more naturally with Chapter 6. I understand that the authors may wish to foreshadow in the first chapter some of the main points discussed in the whole book, and to remind the readers of the important conclusions made in the earlier parts (e.g. pp. 154-156 are no more than a summary of some points made earlier), but these should be done in a way more consistent with the logic of presentation.

Notwithstanding these structural considerations, this book is still highly recommendable. Corpus-driven quantitative approaches have huge potentials in historical linguistics, and this treatise on methodology provides a firm starting point for historical linguists to know and accept these approaches, in an informative and accessible manner. I believe scholars will greatly benefit from reading this book.


Fangzhe Qiu is a postdoctoral researcher at the Chronologicon Hibernicum project hosted in Maynooth University, Ireland. The project aims at mapping the linguistic variations in the Old Irish language (7th to 10th century AD) with quantitative and corpus-driven methods. Qiu's main research interest lies in the Irish language, historical and comparative linguistics, and early Irish law.

