LINGUIST List 26.1458: Review: Comp Ling; Forensic Ling; General Ling; Text/Corpus Ling: Oakes (2014)

LINGUIST List 26.1458

Tue Mar 17 2015

Review: Comp Ling; Forensic Ling; General Ling; Text/Corpus Ling: Oakes (2014)

Editor for this issue: Sara Couture <saralinguistlist.org>

Date: 11-Nov-2014
From: Bev Thurber <b.thurbershimer.edu>
Subject: Literary Detective Work on the Computer
E-mail this message to a friend

Discuss this message

Book announced at http://linguistlist.org/issues/25/25-2660.html

AUTHOR: Michael P. Oakes
TITLE: Literary Detective Work on the Computer
SERIES TITLE: Natural Language Processing 12
PUBLISHER: John Benjamins
YEAR: 2014

REVIEWER: Bev Thurber, Shimer College

Review's Editor: Helen Aristar-Dry

SUMMARY

This book provides a concise summary of the ways computational linguistics has been used to obtain certain types of information about texts. The applications discussed are authorship identification, plagiarism identification and spam filtering, Shakespearean authorship, style in religious texts, and decipherment. Each of these topics is discussed in a chapter of approximately 50 pages. The book begins with a brief preface summarizing the content and explaining the its structure, which brings all of the chapters together under the heading of computer stylometry.

Chapter 1, “Author identification,” provides a summary of some of the basic techniques that are applied in later chapters and shows how they have been used to determine who wrote a text when the author is unknown. The chapter opens with a short introduction to the problem, then discusses features that different scholars have based their evaluations on. Two major types of technique are discussed: inter-textual distances and clustering. Inter-textual distances are ways of measuring the similarity between two texts based on their shared features (11). The examples provided focus on comparing of the vocabularies of two texts. The distances discussed include the Euclidean and chi-squared distances, Kullback-Leibler Divergence, and others. Mathematical formulae are provided along with explanations. The Euclidean distance is based on the idea of geographical distance, i.e. the shortest line from one point to another (12). The chi-squared distance is similar to the Euclidean distance, but with the addition of weights that reflect the number of times a word occurs in each text (15). Kullback-Leibler Divergence is an application of the scientific idea of entropy to comparing texts (18). Clustering techniques start with textual features and transform them, by means of inter-textual distances, into diagrams of how sets of more than two texts are related (30).The section on clustering techniques focuses on factor analysis techniques, especially principal components analysis, which is used in later chapters. A principal components analysis begins with a table of data, such as normalized word frequencies from several texts. Standard techniques from linear algebra are then applied to compute the principal components, orthogonal eigenvectors that can be used to produce a graph showing how closely related the texts are (38 - 44). The chapter ends with a section comparing the different methods described and examples of related studies. These studies do not directly address unknown authorship, but ask closely-related questions, such as how an author's writing style changes over her or his lifetime.

Chapter 2, “Plagiarism and spam filtering,” includes two main sections, one on each of these topics. The plagiarism section is approximately twice as long as the spam filtering section. It begins with a discussion of commercially available plagiarism-detection software, then goes on to describe the algorithms behind the software, with applications to student essays and program code. A variety of ways to measure document similarity are discussed, including the cosine measure, overlapping n-grams, fingerprinting, language modeling, and techniques from natural language processing. These techniques require that a suspicious document be compared to a corpus of similar documents. When such a corpus is unavailable, intrinsic techniques, which examine the text's writing style, can be used instead. Oakes describes such techniques in this section as well. Finally, this section of the chapter addresses the problems of plagiarism by translation and how to tell which of two similar texts is the original. The second part of Chapter 2, on spam filtering, covers a variety of approaches to the problem. These include content-based, exact matching, and rule-based methods as well as approaches based on machine learning and some outside of the linguistic realm.

Chapter 3, “Computer studies of Shakespearean authorship,” returns to some of the questions addressed in Chapter 1 with a specific focus on Shakespeare. The chapter summarizes questions about which plays were or were not written by Shakespeare and the answers obtained by computational methods. The plays are divided into three categories, ''Traditional attributions,'' ''Dubitanda,'' and ''Apocrypha,'' following Elliot and Valenza (1996). The 35 plays in the first category provide a control set for the analyses that follow, as there is no reason to doubt that Shakespeare wrote those plays himself. The other two categories, containing plays in which Shakespeare's involvement may have been less than full authorship, provide subjects for the analyses described in the chapter. Computational methods have been used to explain Shakespeare's level of involvement in writing plays from these sets. The principal components analysis discussed in Chapter 1 is shown in action in this chapter, and other methods beyond those presented in Chapter 1 are discussed, including Bayesian analyses and neural networks.

Chapter 4, “Stylometric analysis of religious texts,” addresses questions related to those in Chapters 1 and 3 while focusing on the New Testament, the Book of Mormon, and the Qu'ran. The first of these takes up the bulk of the chapter because less work has been done on the other two. The chapter describes analyses done to answer questions related to authorship by means of writing style, with correspondence analysis and cluster analysis as the most frequently-mentioned methods. Some time is spent on a discussion of the hypothetical source Q for the Gospels of Matthew and Luke. Correspondence analyses showing how the gospels may be related to each other and to Q are summarized in detail. Other New Testament topics covered include possible relationships between all the books of the New Testament derived using the methods of prediction by partial match and word recurrence interval. The sections on the Book of Mormon and the Qu'ran are summaries of similar studies done on those books.

Chapter 5, “Computers and decipherment,” summarizes ways computational techniques have been and could be useful in analyzing unknown writing systems. The most well-known decipherments have made little use of computers, but Oakes suggests that computers could be useful for ''routine tasks like collating and counting'' (207). The chapter relates decipherment to cryptography and machine translation and considers Rongorongo and the Indus Valley seals as case studies. Some attention is also paid to Linear A, Pictish symbols, and Mayan glyphs. One question discussed in this chapter is how to tell whether a set of symbols encodes a particular language. Some statistical properties of language, such as Zipf's law and Sinkov's test, are explained as pointers toward an answer to this question. The chapter concludes on the pessimistic note that decipherment of these unreadable scripts is unlikely, but with the hope that interesting new computational methods will be developed in the attempt.

The book ends with a long list of references and a short index.

EVALUATION

This book is the twelfth volume of John Benjamins' Natural Language Processing series. Edited by Ruslan Mitkov of the University of Wolverhampton; this series focuses on ''new results in NLP and modern alternative theories and methodologies'' (back of half-title page). This particular volume fits into the series framework by providing a very broad, yet concise, summary of what has been done in recent years. The book focuses on methods for analyzing style with an eye to determining the author of a given text. It provides a mix of theory (in the form of computations that can be performed) and application (in the form of case studies). The first chapter, on authorship attribution, lays the foundation for the rest of the book. Chapters 2 through 4 are clear follow-ups to Chapter 1 as they present case studies of questioned authorship. Chapter 5 treats a topic that is related to these, but not quite the same. Rather than questions of who wrote a text, this chapter is concerned with whether a given sequence of symbols is a text and, if so, how one can determine what it says.

The division of the material into chapters based on applications rather than on methods makes the book's focus seem to be on what has been done rather than how it was done. Someone looking to solve a particular problem will be able to see what techniques have been used for that problem (or similar ones). This makes the book useful for someone looking for ideas to try out on a particular problem related to the ones discussed. This results in some repetition of methods; principal components analysis is one technique that appears repeatedly in the book. It is first explained in Chapter 1, then appears again in Chapters 3 through 5, where it is shown in action through case studies. This repetition ensures that readers interested in a particular area of study see this important technique in action.

According to the back cover, ''[t]his book is written for students and researchers of general linguistics, computational and corpus linguistics, and computer forensics.'' Graduate students and other researchers in the early stage of their careers seem the most likely to benefit from the book's system of organization, and the level of mathematics presented is consistent with this. The author assumes that his audience understands basic statistics, but may not be familiar with other mathematical topics, such as vectors and matrix algebra. The book provides a concise overview of many mathematical methods and includes details of the mathematics behind some of these these, providing, for example, a detailed tutorial on matrix arithmetic (pp. 35-38).

In keeping with this audience, the book occasionally presents source code for the methods discussed in the programming language R in order to show concretely how the mathematical methods described can be used. These range from very simple, such as the description of how to calculate a dot product on page 43, to the code for creating a sorted frequency list shown on page 242. These examples are occasional enough to make it clear that the book is not intended as a primer on R, but readers may find them helpful as an introduction to R and guide to implementation.

The back cover states that this book ''will inspire future researchers to study these topics for themselves, and gives sufficient details of the methods and resources to get them started.'' This is an accurate summary of what seems to be the primary purpose of the book. As a source of starting points for research, this book is a great resource that will be helpful to anyone looking for inspiration.

REFERENCES

Elliot, Ward and Robert Valenza. 1996. And then there were none: Winnowing the Shakespeare claimants. Computers and the Humanities 30: 191-245. DOI: 10.1007/BF00055107.

ABOUT THE REVIEWER

B. A. Thurber is an Assistant Professor of Humanities and Natural Sciences at Shimer College in Chicago, IL who is interested in historical and computational linguistics and medieval ice skating.

Page Updated: 17-Mar-2015