LINGUIST List 31.1581

Tue May 12 2020

Review: Computational Linguistics; Historical Linguistics: Toner, Han (2019)

Editor for this issue: Jeremy Coburn <jecoburnlinguistlist.org>



Date: 14-Feb-2020
From: Mark Faulkner <mark.faulknertcd.ie>
Subject: Language and Chronology
E-mail this message to a friend

Discuss this message


Book announced at https://linguistlist.org/issues/30/30-3419.html

EDITOR: Gregory Toner
EDITOR: Xiwu Han
TITLE: Language and Chronology
SUBTITLE: Text Dating by Machine Learning
SERIES TITLE: Language and Computers
PUBLISHER: Brill
YEAR: 2019

REVIEWER: Mark Faulkner, Trinity College Dublin

SUMMARY

Language and Chronology investigates the utility of machine-learning techniques for dating undated medieval Irish texts. More specifically, it asks whether a temporal model built from the year-by-year record of events offered by the Irish Annals (generally thought to have been maintained contemporaneously over perhaps a thousand years) can date texts from other genres to 101-year windows.

The book begins with an introduction, which outlines the problem of dating undated texts in the terms of both philology and machine learning, borrowing from archaeology the term ‘chronometrics’ to refer to a method that attempts ‘to provide absolute dates’ for texts ‘with a defined margin of error’ (p. 6).

Chapter 1, ‘Dating Texts: Principles and Methods’ is a detailed summary of the methods that philologists have traditionally used to date the composition of undated medieval Irish texts. Much of what it has to say applies equally to medieval texts in other languages, though it is clear that several facets of the Irish tradition make it a particularly interesting case study for dating methods, not least the survival of a significant number of texts generally thought to have been composed at a relatively early date only in manuscripts copies written some centuries later, a tendency for writers to claim authorship of texts in fact written earlier by others, and a fondness for deliberate stylistic archaisms.

Chapter 2, ‘Computational Approaches to Text Dating’, introduces how machine learning has approached the problem of text dating, describing the different linguistic features that have been targeted (which include named entities, keywords and word or character n-grams, as well as metalinguistic or extralinguistic features such as text length or font) and the various techniques adopted (including language modelling, regression and classification). On the basis of previous studies, Toner and Han adopt a classification-based approach, in which the machine builds a model to assign texts to given dating windows (e. g. 1400±50 = 1349-1450). They then outline five new techniques that might improve the dating performance of their algorithm, the force of which is in essence to allow the dating windows to be derived from the language of the texts themselves, rather than imposed by the analysts (so that instead of 1400±50, we might have 1407±17 if that is what best suits the texts).

Chapter 3 trials these new techniques in English and medieval Irish texts. For English, they use two datasets: one off over 6,000 news snippets published between 1700 and 2010, chosen because it was used in the DTE Diachronic Text Evaluation task introduced at SemEval-2015 and therefore allowed ready comparison of their results with those from earlier studies, the other of almost 2,500 adverts posted on the website Freecycle over a period of 180 days. Targeting character and word n-grams (with n = 1, 2, 3), their algorithm correctly dated 54% of texts from the DTE dataset within a 21-year window, and 43% of those from Freecycle within a 21-day window. They then turn to the annals, telling the algorithm to look only at character n-grams. With the Annals of Innisfallen, it manages to date 74% of segments to the correct 51-year window.

Chapter 4, ‘Dating Long Documents’, examines whether a dating model derived from the annals is applicable to other undated Irish texts. Their test corpus comprises 22 ‘longer medieval Irish texts’, ranging from 263 to 80,020 words. These were divided into chunks of 20 or more words, and the algorithm asked to predict a date for each chunk. Asking it to return a 101-year window, its most frequent prediction for each text coincided with existing philological opinion 32% of the time. Asking it to return a 21-year window, but stipulating this must fall within the 101-year window already established, lead to a slightly improved performance, coinciding with existing philological opinion half the time. The model fared best with Middle Irish texts, less well with Old and Early Modern Irish, but it nonetheless assigned most texts to their correct period. Since the algorithm tries to date texts chunk by chunk, there is, they show, some scope to use it to distinguish different strata within a text, such as where an Early Modern reviser has extended a Middle Irish text.

A conclusion reviews the success of Toner and Han’s approach, briefly looking inside the ‘black box’ of the algorithm and considering what linguistic features it might have been using to date the texts. It is followed by two appendices, the first a lengthy outline of the dates philologists have usually assigned to the texts on which the approach was tested in Chapter 4 and the second a brief outline of some basic concepts from machine learning. A bibliography closes the book.

EVALUATION

Better datings for undated medieval texts are a major desideratum. As Toner and Han report (p. 15), the medieval Welsh text, the Four Branches of the Mabinogi, has been dated anywhere in the two and a half centuries between 1018 and 1275. In English studies, some scholars continue to advocate an origin for the epic Beowulf in the eleventh century even as a compelling array of evidence suggests c. 700 is a more reasonable date. Any new approach is therefore to be welcomed and it is probable that a machine learning implicitly from extant texts will notice patterns a human cannot.

That said, Toner and Han’s approach is, by objective standards, a failure. Their conclusion is that the dating model derived from the annals ‘can be applied to long narrative texts of various genres’ (p. 116) and ‘could be used as a tool for assigning texts to a linguistic period’ (p. 137). This would perhaps have some utility if all knowledge of Irish was lost, but a large corpus, their algorithm and some people (interested in Irish) survived, but notwithstanding such an esoteric doomsday scenario, it is difficult to see what use such a tool would have. But this is clearly a naïve view that ignores the incremental nature of scientific work: Toner and Han’s research will, we have to hope, be built upon by others and, in due course, machines will better date medieval texts than humans.

What should those machines be told to look at? Toner and Han’s algorithm, as we have seen, looks at character unigrams, bigrams and trigrams. N-grams primarily target orthography and may indirectly pick up features of inflectional morphology (in Old English, one thing the trigram <um > could be is the dative plural morpheme -um and its frequent occurrence would probably point to a date before the twelfth century). But orthography is the feature a modernising scribe can most readily alter when copying a text; it is no surprise therefore that Toner and Han note their algorithm sometimes generates predictions which correspond more closely to manuscript date than presumed composition date. Other linguistic features are less easy to modernise and the general scholarly consensus (at least in Anglocentric medieval studies) is that scribes intervened very little with syntax. Focusing on syntax would require the development of a part of speech tagger and parser for medieval Irish, but using word n-grams might perhaps superficially pick up some underlying syntactic patterns, much as character n-grams pick up some morphological ones.

The meld of philological and computational methods on show in this book is a stimulating one. Philology has always been about contextualising particular linguistic forms, and this is in effect what a machine attempting a classification task undertakes. It is salutary to someone working in a philological tradition where it is quite normal to rely still on work undertaken in the 1880s to encounter the section in the chapter on computational approaches to text dating entitled ‘Early Research’ and notice the earliest paper it cites is from 2005. Toner and Han helpfully include in their introduction instructions on ‘how to read this book’, counselling those coming from a humanities backgrounds to ignore the body of Chapters 2 and 3. It is certainly true these are very difficult, but while the appendix with its brief definition of some terms from machine learning is helpful, more could have been done to make them more accessible. This reader at least would have preferred to be told what ‘prototype methods’ and a ‘non-parametric memory-based distances function’ do linguistically rather than learn that the Wiener process is ‘named in honour of Norbert Wiener’ (pp. 42-3). This is a serious deficiency in that, as I have argued in the previous paragraph, what the machine is told directly or indirectly to look at does matter and if philologists cannot understand what the machine is doing, they cannot advise on what it should be told to look at. But explaining the methods of one discipline to those of another is difficult and it is to Toner and Han’s credit that they have generally explained their methods and results clearly. The first chapter, ‘Dating Texts’, is a model of clarity, making accessible to a wider audience an otherwise challenging body of scholarship, much of it written in Irish, and could easily be set as reading for a class on a philology course on a masters in Medieval Studies; the appendix on the datings philologists have assigned to the Irish texts used in the study will be an invaluable reference point for scholars from a range of different disciplines.

Language and Chronology lays the foundations for the next generation of work on a crucial philological problem. It is to be hoped many computer scientists interested in text dating will seize the challenge that is offered by medieval texts, with their unstandardized orthographies, erratic attestations and want of tools like parsers that are de rigueur for the languages like Present-Day English on which most chronometric work is focused.


ABOUT THE REVIEWER

Mark Faulkner is Ussher Assistant Professor in Medieval Literature at Trinity College Dublin. His work on twelfth-century English has lead to an interest in periodisation and text dating, and he has recently been received a Provost’s Project Award for Medieval Big Dating, which will explore quantitative and methods to develop ‘big data’ techniques to assist in the dating of texts from the Old and early Middle English periods.



Page Updated: 12-May-2020