Review: Computational Linguistics; Text/Corpus Linguistics: Landulfo Teixeira Paradela Cunha (2020)

Editor for this issue: Jeremy Coburn

Date: 01-Jun-2021
From: Nicolás Arellano
Subject: Contributions to the Computational Processing of Diachronic Linguistic Corpora
AUTHOR: Evandro Landulfo Teixeira Paradela Cunha
TITLE: Contributions to the Computational Processing of Diachronic Linguistic Corpora
SERIES TITLE: LOT Dissertation Series
PUBLISHER: Netherlands Graduate School of Linguistics / Landelijke (LOT)
YEAR: 2020

REVIEWER: Nicolás Arellano, Universidad de Buenos Aires


‘Contributions to the Computational Processing of Diachronic Linguistic Corpora’ is a book that aims to offer insight on multiple tasks involving computational tools on the assessment of diachronic corpora. In order to do so, the book presents at its core three chapters that discuss how to develop new diachronic (and not necessarily historical) corpora. Not only could it be helpful for personal research but also, and more importantly, it may serve as the kick-start for the creation of more general databases.

Chapter 1 stands for the introduction, in which Cunha justifies his research by exploring the many intersections between linguistics and computer science. Besides the remarkable increase in studies regarding formal models of language, which could represent the most obvious junction between such fields, the author also focuses on displaying several uses of computational methods in already established subdisciplines, such as sociolinguistics, language preserving, and dialectology, among others. He does it without failing to concentrate on what makes up the very essence of the research: computer-aided studies within the scope of corpus linguistics. Cunha particularly examines the diachronic aspect of corpus linguistics. Despite diachronic corpora (which he opposes to historical corpora, in the sense that the former must deal with change over time and cover a specific span, whether it ends in the past or the present, while the latter concentrates on the past, but without taking shifts and changes into account) having considerably provided new opportunities to linguists, the author explores particular tools that help not only to work with corpora, but also compile and analyze them. Each one of the following chapters focuses on a different aspect of corpora development.

Chapter 2 deals with building and compilation. Particularly, this section presents two resources. Firstly, an easy-to-use web scraper of comments from news portals and websites. Secondly, an example of a freely available corpus made of comments from a Brazilian news site is shown, which is based on the web scraper aforementioned. Cunha accounts for the importance of news comments corpora because this type of discourse has often been neglected due to assumptions on its validity as a source of information. Therefore, most general corpora tend to not include comments. Conversely, this type of discourse genre could shed light on a number of researches that range from language change and lexicology to language variation and social aspects of language. The web scraper (i.e., an automated agent used to extract data from a particular online source), named Xereta, is open-source and free (Cunha, Magno & Almeida 2017). It allows the user to extract proper linguistic and meta information from up to a thousand URLs. Thus far, it runs on two Brazilian major news sites: UOL and Folha de São Paulo. Using the architecture of the web scraper he designed, Cunha collected a (diachronic) corpus containing more than two hundred thousand comments from 2016 to 2018 that appeared at UOL. It also includes more than 7 million tokens and follows a ‘rich-get-richer’ pattern in both commentators and positive evaluations categories, meaning that few people make many comments while many only participate once or twice. Those few comments collect a considerable number of likes, while many are often not liked or barely receive any feedback. Further analysis using the corpus could be completed by using other corpus software, such as AntConc. However, this particular corpus, unlike other considerable small corpora of the same type in English or Portuguese, is not annotated.

Chapter 3 explores methodological limitations in diachronic corpora and presents an algorithm that aims to help identify establishment and obsolescence of linguistic forms, whose criteria of recognition so far, and especially on obsolescence, have not reached an agreement (Tichý 2018). Partly, this obstacle is explained given the existence of a gap between the point in time in which a word appears for the first time in a given language and the time when a significant part of the population becomes familiar with it (Tulloch 1991). For this reason, Cunha claims five possible states of a linguistic form in a time period: a) established, b) obsolete, c) permanent, d) short-lived, e) random. These categories are further defined based on binary criteria. Under this analysis, corpora should be divided into uniform time frames. If a target item is above an already defined threshold based on relative frequency, it is assigned the factor 1. If it is below, then 0. After the assignment, as a result, a binary chain with a particular pattern is formed. For example, if one hundred years are taken into account as a time span for the corpus, and at the same time, this is segmented into ten sub-time frames of ten years each, all of the following sequences could appear: 0000111111, 0101010101, 1111100000, among other possible logical options. These patterns are related to one of the five types of states: established, random, and obsolete, respectively. As an additional advantage, the algorithm allows for deviations in which the binary sequences are not as prototypical by analyzing which sub-time frame stands for the least number of deviations in relation to a more expectable sequence (all 0, all 1). Finally, Cunha implements the algorithm on the Corpus of Historical American English (Davies, 2012) and shows favorable results when dealing with characterizing established, obsolete, lost, and short-lived words, among others.

Chapter 4 presents a framework of analysis of corpora based on the examination of changes in the expression ‘fake news’ both in the English-speaking world and Brazil, specifically concentrated around the 2016 US election and the 2018 Brazilian presidential election. In this way, Cunha claims that the change of interest in society around this particular subject ended up transforming the linguistic expression itself, thus stating a link between certain terminology and social changes. In order to prove this point, the author uses two diachronic corpora of news articles. For English, he selects the NOW Corpus (Davies, 2013), whereas for Brazilian Portuguese he precisely creates an ad-hoc corpus consisting of almost five thousand tokens of the term ‘fake news’ found in ten news sites from Brazil. Through an analysis that comprises multiple techniques (web search behavior, co-occurring entities and general vocabulary, co-occurrence networks, contextualized topics, and polarity, i.e., the sentiment around the utterance), Cunha observes that the interest in fake news increased globally after the US election in 2016, when the term highly specified around topics and contexts related to politics, and not the media industry, as shown by the data before 2016. In Brazil in particular, during and after the presidential election in 2018, the shift happened from US politics or ‘fake news’ in society in general to subjects that especially revolve around Brazilian domestic affairs. Indeed, the rise of public interest in the term ‘fake news’, from niche to a widely known expression, entailed changes in the conceptualization.

The last section of the book, Chapter 5, briefly sums up the conclusions of the investigation. These center around the outcomes of the main three chapters (2-4). Additionally, Cunha anticipates a series of limitations on his research, including the lack of annotation, the possibility of the Xereta corpus only working with two news portals, and a certain degree of imbalance in the samples of the utilized databases.


Precisely, the main problems of the dissertation are focused on its organization, already addressed by the author to some degree. Firstly, he acknowledges the lack of a precise integration among the core chapters of the book, which were originally conceived as three separate papers. Although one could postulate a certain degree of a temporal sequence from Chapter 2 through Chapter 4, involving the different stages of the methodology behind corpus linguistics, ultimately leading to an example of analysis, each chapter feels like a capsule itself, in which many different aspects are all addressed at once, with little correlation between phenomena within the book. Moreover, some computational tools are presented as easy to use; however a fair few need to be complemented with additional instruments, such as lemmatizers or specific corpus-oriented software, especially in Chapter 4. Additionally, some of the outcomes, particularly in Chapter 3, tend to focus more on phenomena related to spelling variation than grammar or lexicography.

Nonetheless, all of these possible issues could also be appreciated as advantages, especially for people who may use this book to look for concrete data or methodological advice. Furthermore, ‘Contributions to the Computational Processing of Diachronic Linguistic Corpora’ remains a solid work from a conceptual point of view and represents a great asset for the integration of more sophisticated and accurate computational tools into the domain of linguistics, without failing to provide a general outlook of both computational and diachronic corpus linguistic aspects. More importantly, the book successfully focuses on a language other than English (Brazilian Portuguese) and aims to help remove barriers in scientific access by providing already several free and simple-to-use options to work with. Also, special concepts are always explained and contextualized, which makes for a very easy-to-read dissertation for people with very little expertise in the field. Finally, the work demarcates the path that research on computational linguistics and diachronic corpus linguistics should follow. Several readers, and most certainly Cunha, will pick up from here and find new and refined ways to contribute to corpus linguistics through computational tools.


Cunha, Evandro L. T. P., Gabriel Magno & Virgilio Almeida. 2017. A elaboração de um coletor e de um corpus de comentários extraídos de portais de noticias. In Anais do X Congresso Internacional da Associação Brasilera de Lingüística (ABRALIN), 764-771. Niterói: Universidade Federal Fluminense.

Davies, Mark. 2012. Expanding horizons in historical linguistics with the 400-million word Corpus of Historical American English. Corpora 7(2). 121-157.

Davies, Mark. 2013. Corpus of News on the Web (NOW): 3+ billion from 20 countries, updated every day. Retrieved from Last access on May 28, 2021.

Tichý, Ondřej. 2018. Lexical obsolescence and loss in English: 1700-2000. In Kopaczyk, Joanna & Jukka Tyrkkö (eds.), Applications of pattern-driven methods in corpus linguistics, 81-103. Amsterdam: John Benjamins.

Tulloch, Sara. 1991. The Oxford dictionary of new words: A popular guide to words in the news. Oxford: Oxford University Press.


Nicolás Arellano is a Linguistics graduate student (Universidad de Buenos Aires) with a scholarship granted by Consejo Nacional de Investigaciones Científicas y Técnicas de Argentina. His main topic of research is Spanish lexicology within a usage-based approach. He is interested in corpus linguistics and has written several articles and presentations in relation to this field. Additionally, he also has some experience in second language acquisition and second language teaching.

