In her book, ''Methods in Latin Computational Linguistics'', Barbara McGillivary builds off of Piotrowski (2012), offering historical linguists basic training in quantitative and corpus methods, while offering computational linguists the interesting challenge of exploring historical data through the use of several case studies. Chapters 1 and 2 give a general overview of the fields of Latin linguistics, computational linguistics, and their intersections. Chapter 3 covers the creation of a verb valency lexicon, which is a valuable resource for future studies. Chapters 4 and 5 cover a case study in selectional preferences and argument structure; the former details the linguistic theory while the latter covers the computational and statistical methods. Chapters 6 and 7 cover another case study on Latin preverbs; again, the former details the linguistics while the latter details the computer science and mathematics. Finally, chapter 8 ties everything together, defining ''Latin computational linguistics'' as a unified field which needs expertise from a variety of interdisciplinary scholars.
Chapter 1, ''Historical Languages, Corpora, and Computational Methods'', situates the book for both historical linguists and computational linguists. This chapter overviews some of the challenges of Latin for computational linguists, such as the fact that spoken Latin is mostly unknowable, the dataset is limited because there are no living native speakers, and the language is morphologically rich with flexible word order. Additionally, the author introduces the reader to some basic concepts in computational linguistics, such as corpus annotation, automatic parsing, statistical significance, and the creation of a well balanced corpus, explaining how each of these might benefit Latin scholars. McGillivary defines ''Latin'' and ''language'' as a certain subset of all the available data, for the purposes of her case studies, and then outlines the remainder of the book by previewing the case studies covered in later chapters.
Chapter 2, ''Computational Resources and Tools for Latin'', overviews the currently available corpora and programs for Latin, as well as the steps necessary to create new tools and resources for Latin. The author points out that although the Latin Index Thomisticus (Busa 1980) was the first electronically available corpus in any language, Latin has not kept up with modern languages such as English and other modern languages which have the benefit of native speakers and a market demand for resources such as machine translation, which in turn drives the field of computational linguistics in those languages. Although Latin does not have resources like English in terms of the scale and availability of resources such as digitized corpora, automatic annotation tools, part-of-speech taggers, treebanks, and lexical databases, the author seeks to partially remedy the situation through her work.
Chapter 3, ''Verbs in Corpora, Lexicon ex machina'', exemplifies a computational approach to Latin which solves one of the problems introduced in Chapter 2: the lack of a verb valency lexicon. The concepts of verb valency, transitivity, and semantic roles are introduced. Next, McGillivary discusses the advantages of a corpus-based distributional approach to semantics over the traditional lexicography approach to verb valency, including detailed usage-based frequency information and the lack of any sectional biases made when a lexicographer is forced to choose only one or two examples due to the space limitations of traditional dictionaries. The chapter then overviews how to work with the Prague Dependency Treebank using MySQL queries to create the valency lexicon. One challenge of Latin and verb valency is exemplified by the fact that Latin allows pro-drop: in order to count all arguments of a verb, one must account for the subject by extracting the person-marking from the verb. Once the verb valency lexicon is created, a number of additional studies can be carried out. For example, the author demonstrated how the valency lexicon allows one to test diachronic trends, finding that VO word order is slightly more common in more modern Latin while OV word order is slightly more common in older Latin.
Chapter 4, ''The Agonies of Choice: Automatic Selectional Preferences'', outlines a case study which makes use of the valency lexicon created in Chapter 3. This chapter covers the linguistic background behind concepts such as selectional preferences, argument structure, semantic features, and animacy. The benefits of a computational approach are also outlined: manual coding of these features is costly and time-consuming, but automatic computational methods can complete this process quickly and accurately. Semantic similarity can be measured computationally as well, either through synonym resources such as WordNet or through distributional approaches which makes use of relative frequencies and word collocations. Both of these approaches, the knowledge-based WordNet and the knowledge-free distributional approach, are tested against a ''gold standard''. Although normally the ''gold standard'' is made by native speakers, Latin requires that the gold standard is made from a separate test corpus.
Chapter 5, ''A Closer Look at Automatic Selectional Preferences for Latin'', covers the statistical and computational methods as well as some of the technical details behind the case study outlined in Chapter 4. This chapter covers the structure of the synsets found in WordNet, the organization of data into a matrix of variables, as well as the concepts of vector space and clustering algorithms. Examples of different clustering algorithms are illustrated with charts and dendrograms, and the benefits and drawbacks of various techniques are discussed. Some probabilistic models as well as the variety of statistical tests carried out on the data are also discussed.
Chapter 6, ''A CorpusBased Foray into Latin Preverbs'', outlines the typical corpusbased approach to linguistic hypothesis formation and testing. This chapter then tackles the Latin pre-verb system, which is an interesting test case in diachronic morphosyntax. After covering some of the typological background of analytic and synthetic languages, as well as the known facts about the evolution of Latin into the modern Romance languages, this chapter delves into another case study using Latin corpora, which seeks to replicate the work done by hand by Bennett (1914). The hypotheses tested include whether pre-verbs correspond to various Latin cases or prepositions. A multivariate analysis is conducted to test the relationship between linguistic features such as each pre-verb, the prepositional phrase, features of the noun such as case or animacy, features of the verbs such as argument structure, selectional preference, or semantics, as well as other variables such as the author of the text, the era in which the text was written, and the genre of the text. The results suggest what was already known: Latin underwent grammaticalization from an inflectionally rich language to the more analytic Romance languages. However, the author argues that because this study is replicable, statistically significant, and does not rely on selectional biases inherent in choosing examples by hand, it is an improvement on Bennett's (1914) work.
Chapter 7, ''Statistical Background to the Investigation on Preverbs'' covers the statistical and computational side of the study outlined in Chapter 6. Topics include basic hypothesis testing, the concept that correlation does not imply causation, and some of the theories and formulae behind linear regression models, correspondence analysis, multiple correspondence analysis, and singular value decomposition. The benefits and drawbacks of each approach are discussed and illustrated with various graphs.
Chapter 8, ''Latin Computational Linguistics'', wraps everything up by summarizing the main goals and contributions of the book. The author suggests several lines of inquiry that future Latin computational linguists could take. McGillivary concludes that computational approaches are an ''unavoidable step in the digital era'' and advises that all scholars ''have a responsibility to acquaint themselves with each other's fields'' (p. 216).
Despite the narrow subfield implied by the title, this book could be of interest to a wide variety of scholars in the broad discipline of the digital humanities. Latin scholars can benefit from more efficient data-mining and analysis, as well as the increased scientific rigor of replicable, quantitative studies. Corpus and computational linguists benefit by adapting methods used on the million word corpora of modern, synchronic languages to the smaller diachronic corpora available for Latin, while meeting the computational challenges of an inflectionally rich language with relatively free word order and no native speakers to test on. Latin, however, is just a case study: many of the methods and concepts covered in this book are widely applicable to any diachronic corpus, as with historical or acquisition data, as well as any small corpus, as with endangered or extinct languages.
Those without a background in Latin, linguistics, computer science, and statistics may find parts of this book difficult. Some Latin examples are occasionally given without a translation, and the statistical formulae are given with an expectation of at least some prior knowledge of Bayes' theorem. It is also expected that the reader is familiar with morphosyntax, particularly the Latin case system. Furthermore, it is important to note that this is not a ''how-to'' guide to Latin computational linguistics. While there is some discussion of the programs and packages used, and a few examples of code or psuedocode, for the most part this book only covers the theoretical background -- both linguistic and computational -- behind the analyses, not the practical details of the analyses themselves.
Overall, this book makes a unique contribution to the field, both by expanding existing Latin resources as well as encouraging greater interdisciplinary research among scholars from such disparate fields as historical linguistics and computer science.
Bennett, C.E. 1914. Syntax of early Latin, Volume IIThe Cases. Boston: Allyn and Bacon.
Busa, R. 19741980. Index Thomisticus: sancti Thomae Aquinatis operum indices et concordantiae, in quibus verborum omnium et singulorum formae et lemmata cum suis frequentiis et contextibus variis modis referuntur quaeque / consociata plutrium opera atque electronico IBM automato usus digessit Robertus Busa SJ. Stuttgart Bad Cannstatt: Frommann Holzboog.
Piotrowski, M. 2012. Natural Language Processing for Historical Texts. Morgan & Claypool Publishers.