Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."

New from Wiley!


We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Review of  Natural Language Processing for Historical Texts

Reviewer: Bev Thurber
Book Title: Natural Language Processing for Historical Texts
Book Author: Michael Piotrowski
Publisher: Morgan & Claypool Publishers
Linguistic Field(s): Computational Linguistics
Historical Linguistics
Issue Number: 25.292

Discuss this Review
Help on Posting

In this book, Michael Piotrowski summarizes much of the state of the art on how techniques from natural language processing (NLP) have been applied to texts written in historical variants of modern languages. This book is a volume of the ''Synthesis Lectures on Human Language Technologies,” a series which claims to ''provide concise, original presentations of important research and development topics'' (back cover). The intended audience consists of readers with backgrounds in either NLP or the humanities.

In the book's nine chapters, Piotrowski shows what has been done in this field and what problems are unique to processing historical texts. The main problem, which the book repeatedly returns to, is that of spelling variation. Two complete chapters are devoted to this subject, and other chapters frequently mention it. Other problems that are mentioned include the fact that there are no living native speakers of historical languages, and that there are often few comparable texts, resulting in a small corpora to which the techniques described can be applied.

The book begins with a brief introduction outlining the concepts to be presented, the scope of the book and its overall structure, and the intended audience. Rather than giving a definition of what ''historical language'' means in this context, Piotrowski gives examples of features of modern languages that historical languages lack, which define the challenges of processing them. These features are standard variants and orthographies, reasonably-sized corpora, and existing processing tools.

Chapter 2, ''NLP and Digital Humanities,'' provides a broad overview of the field of study. Its goal is to situate NLP within the digital humanities. The chapter cites examples of how NLP has been used to solve problems in the humanities, highlighting the potential of NLP techniques and stressing the importance of a thorough understanding of both fields. Piotrowski concludes with the opinion that ''both the humanities and NLP could very much benefit from increased collaboration'' (p. 10).

In Chapter 3, ''Spelling in Historical Texts,'' Piotrowski begins his treatment of the major problem in dealing with historical texts: non-standardized spelling. The chapter begins with an explanation of why this is a problem, and then describes different types of spelling variation. These types are difference (i.e. diachronic variation), variance (i.e. synchronic variation), and uncertainty (i.e. variation introduced by the digitization process). Data from spell-checkers and taggers is used to illustrate the problems caused by spelling variation.

At 28 pages, Chapter 4, ''Acquiring Historical Texts,'' is the longest in the book. It provides an overview of current digitization projects and methods of digitizing texts, including scanning, optical character recognition, manual text entry, and computer-aided transcription. Piotrowski discusses the strengths, weaknesses, and limitations of each method in approximately the order that they would be used in digitizing a text. Scanning to turn a written text into a digital image is generally the first step. Optical character recognition (OCR) is then applied to the image to turn it into electronic text. While OCR works very well for modern texts and is particularly stressed as the best system currently available, it is not yet perfect. Much of the chapter describes the adaptations to current OCR systems needed to make them work well with historical texts. Some possible adaptations include using several OCR systems and merging the results, linking an OCR system to a lexicon, and providing a crowd-sourcing system for humans to correct OCR output. Manual text entry and computer-assisted transcription are discussed as alternatives to OCR that may provide better results in some circumstances.

Chapter 5, ''Text Encoding and Annotation Schemes,'' begins with freshly-digitized text and describes how to encode and annotate it to make it useful for researchers. Unicode is discussed for the former purpose, and the Text Encoding Initiative (TEI) Guidelines for encoding text with Extensible Markup Language (XML) are discussed for the latter. These two have emerged as standards for this kind of work, and Piotrowski considers them ''a solid foundation for encoding and processing many types of historical texts'' (p. 67).

In Chapter 6, ''Handling Spelling Variation,'' Piotrowski returns to the issue of spelling irregularities discussed in Chapter 3. This chapter focuses on specific problems that occur due to variations in spelling due to all three types of variation. The focus is on languages that are still living and how tools for the modern versions of those languages can be applied to the historical versions. Piotrowski adds the caveat that ''[t]exts in dead or extinct languages and scripts certainly pose additional challenges'' without detailing how to deal with those challenges (p. 69). The major concept discussed is canonicalizing the spelling in some way. Edit distance is described as a way of comparing similar strings, which is relevant background for Piotrowski's treatment of canonicalization methods. He describes both absolute and relative methods, and then discusses ways to handle OCR errors and the limits of canonicalization.

Chapter 7, ''NLP Tools for Historical Languages,'' summarizes some currently-available NLP tools that have been applied to historical texts. This chapter's point is ''not to give an exhaustive listing of available tools, but rather to illustrate the variety of approaches that may be used for creating NLP tools for historical languages'' (p. 85). The techniques discussed are Part of Speech Tagging (both creating a new tagger for a historical language and using an existing tagger for an ancestor of its target language), lemmatization and morphological analysis, and syntactic parsing. Spelling variation remains a problem in this area, resulting in low tagger accuracies than those achieved with modern languages. However, Piotrowski points to several reasons, including a lack of native speakers, that lead one to expect lower performance standards for taggers applied to historical languages.

Chapter 8, ''Historical Corpora,'' is a list of corpora that have been developed for Arabic, Chinese, Dutch, English, French, German, the Nordic languages, Latin and Ancient Greek, and Portuguese. The author dedicates a few pages to each language containing brief descriptions of some available corpora, along with instructions on obtaining them. The corpora represent different approaches, including different formats and licenses.

The book concludes with Chapter 9, which provides a couple of pages of summary and looks to the future. Piotrowski sees three challenges in the future of this field: to deal with variation in historical languages, to develop tools for marked-up text processing, and to connect NLP and the digital humanities (p. 118).

A 25-page bibliography concludes the book. A nice feature of it is that each entry includes the numbers of pages on which the resource was referenced, allowing a reader to browse the bibliography and be able to find a longer description of texts that seem interesting.


The publisher's description of the series is quite accurate for this book. Piotrowski packs a lot of valuable information into its 145 pages. While the book is not, and does not claim to be, a complete summary of everything that has been done in the field, it provides a concise explanation of the high points and numerous avenues for future research. The book “does not aim to teach a certain set of core techniques but rather tries to give an overview of projects and the methods used therein” (p. 117). As a result, the examples presented take a range of approaches, but focus on relevant standards, or emerging standards, when appropriate, as in the case of Unicode and TEI. The back cover suggests that the topics covered in this book are also relevant to a variety of modern types of texts, including text messages and online postings. While this statement seems true, these genres are not referenced in the text.

The book provides a valuable introduction to the field for humanists who want an overview of how NLP techniques have been used with historical texts and what promise NLP holds for the future. Readers from the humanities will need some background in the digital humanities or computer science to be able to fully appreciate all that this book provides. Occasionally, a concept from computer science is introduced without the kind of explanation that a reader without any background in the field may need (e.g. hashing on pp. 79-81). For a reader with an NLP background who is interested in working with historical texts, this book provides a concise and up-to-date summary of the major problems and methods specific to such texts.

One limitation of the book is that it mainly focuses on languages written using the Roman alphabet. Since the Roman alphabet presents more than enough problems, this should be considered an appropriate limit to the book's scope rather than an omission. Mentions of non-Roman systems include tools for handling Greek and corpora for Arabic and Chinese as well as additional challenges associated with other writing systems, such as cuneiform or Egyptian hieroglyphs.

Overall, the book is exactly what it claims to be: a good overview of recent progress and problems in applying techniques from NLP to historical texts. It covers the entire processing cycle, from creating a digital text, to tools to analyze it, to existing corpora. The varied approaches described in the book provide many starting points for investigations as well as the necessary references to help a reader follow up on any of those starting points.
B.A. Thurber is an Assistant Professor of Humanities and Natural Sciences in Chicago who is interested in historical linguistics.

Format: Paperback
ISBN-13: 9781608459469
Pages: 157
Prices: U.S. $ 45.00
Format: Electronic
ISBN-13: 9781608459476
Pages: 157
Prices: U.S. $ 30.00