Review of  Best Practices for Spoken Corpora in Linguistic Research

Reviewer: Yolanda Rivera Castillo
Book Title: Best Practices for Spoken Corpora in Linguistic Research
Book Author: Şükriye Ruhi Michael Haugh Thomas Schmidt Kai Wörner
Publisher: Cambridge Scholars Publishing
Linguistic Field(s): Discourse Analysis
Text/Corpus Linguistics
Discipline of Linguistics
Issue Number: 26.2650

Review's Editor: Helen Aristar-Dry


The book entitled “Best Practices for Spoken Corpora in Linguistics Research” describes projects of spoken data documentation, as well as current standards in the field. It includes projects documenting diatopic varieties of German and English (such as British, American, and Australian English), French, Turkish, Czech, Russian, Portuguese, Catalan, Swedish, Danish, Norwegian, Faroese, and Basque. It represents a variety of databases, from those encoding formal and academic registers to informal varieties of some languages (Turkish). The goal of the book is to provide standards on data documentation, curation, processing, annotation, and preservation, such that researchers can retrieve information in formats that are standardized, easy to use, and accessible. It also aims to describe methods of corpus construction and sharing. It consists of two major sections and fourteen (14) chapters, and provides an introductory chapter (Chapter 1) that summarizes the content and general goals of the volume. Chapter 1 (şükriye Ruhi, Michael Haugh, Thomas Schmidt, and Kai Wörner) provides a justification and a brief description of the following monographs.

Since the book includes mostly projects involving spoken corpora, it might be of interest to linguists studying discourse analysis, language variation, conversation analysis, phonology, and pragmatics. There is great emphasis on the creation of metadata, data conservation and segmentation, all key issues in these fields. Only one chapter (Chapter 7) describes a project on speech data rather than spoken data (see definition below).

The first section of the book — entitled “Case Studies on Corpora Design, Annotation and Analysis” — describes five projects. Chapter 2 (Adriana Slavcheva and Cordula Maißner) summarizes the main characteristics of the GeWiss Corpus, a corpus of German academic spoken data. The following chapter (3) (şükriye Ruhi and E. Eda Işik Taş) discusses the components of the STC (Spoken Turkish Corpus) and STCDC (Spoken Turkish Cypriot Dialect Corpus) corpora encompassing informal spoken varieties of Turkish. These corpora include information about allophonic variation. Chapter 4 (Theodosia-Soula Pavlidou, Charikleia Kapellidi, and Eleni Karafoti) describes the Corpus of Spoken Greek (CSG), which comprises informal conversations (including phone exchanges) and their orthographic transcriptions. Chapter 5 (Ines Rehbein, Sören Schalowski and Heike Wiese) explains that KidKO consists on syntactically annotated data on the Kiezdeutsch contact variety (German) used by teenagers in multiethnic communities. In Chapter 6 (Seongsook Choi and Keith Richards), the last one in this section, the authors explain the features of the MICASE database, which encodes metadata on conversations between English speakers.

The second section of the book — entitled “Discussions on Best Practices in Spoken Corpora” — includes eight chapters. In Chapter 7 (Pavel Skrelin and Daniil Kocharov), the project managers produced a speech database for Russian. Lucie Besešová, Martina Waclawičová, and Michal Křen describe in Chapter 8 a database on spoken Czech. Chapter 9 (Oliver Ehmer and Camille Martinez) provides information on a database of naturally occurring spoken French in twenty-four communicative areas over the world. Chapter 10 (Peter M. Fisher and Andreas Witt) distinguishes between “data providers”, “data compilers”, “data curators”, and “data consumers” in data preservation for the Dutenbank für Gesprochenes Deutsch, a project that includes preservation of data from previous projects on the German language. In Chapter 11 (Sebastian Drude, Paul Trilsbeek, Han Sloetjes, and Dan Broeder), issues of privacy, and the ethical treatment of data providers are discussed for the DOBES corpus. Chapter 12 (Hanna Hedeland, Timm Lehmberg, Thomas Schmidt, and Kai Wörmer) describes a multilingual corpus, encompassing data from 1999-2011. The Australian National Corpus includes a variety of data with different types of annotation (Chapter 13). The authors (Simon Musgrave, Andrea C. Schalley, and Michael Haugh) devised an “interlingua ontology” to represent “the knowledge embodied by all the annotations of all the collected data.” (p. 226). The last chapter sketches the history of annotation conventions, highlighting the conventions followed by the Hamburg Centre for Language Corpora (HZSK) (Thomas Schmidt).

In summary, the book provides an overview of numerous projects of spoken corpora, and discusses the main issues related to the standardization, creation, annotation, copyright, and conservation of these data. It provides clear explanations for the non-specialist, and discusses key issues of interest for the specialist as well. It is a good introduction for those pursuing projects on spoken data documentation. The book aims at reporting on developing standards and common practices in the field.


The projects’ description includes information on the status of corpus creation activities, web addresses for corpora, tools and protocols for data annotation, data curation, and data dissemination. The initial and final chapters of the book sum up the main issues that constitute the backbone of the collection. These projects center on annotation standards and tools, a key issue since the publication of Bird and Lieberman’s (2001) paper on the annotation graph framework. Many of these projects also select the same annotation tools (such as EXMARaLDA), as well as the same tools for data encoding, such as XML markup.

The introductory chapter (1) states a distinction between “speech” and “spoken corpora”, indicating that the former are intended as tools for phonological analysis, while the later have a different range of uses and aim at representing “language as used by its speakers in naturally occurring communicative contexts” (p.3). However, both fields share many of the same goals and standards. Indeed, some of contributors to this volume also participate in a collection of speech corpora published in the same year (Durand, Gut and Kristoffersen 2014).

Chapter 1 (p. 5-6) also makes a distinction between recordings and data, parallel to the difference between primary and non-primary data described by Himmelman (2012, p. 188). The distinction between raw, primary and structural data is key to language documentation. In fact, as stated by Himmelman, linguistic analysis depends on these distinctions; and despite many misconceptions of the role of documentary linguistics, the creation of different types of data in this field is key to linguistic analysis: ''[…] documentary linguistics has the important task of making descriptive generalizations replicable and accountable, and in this sense it provides the empirical basis for many branches of linguistics'' (Himmelman, 2012, p. 187).

The first part of the book — “Case Studies on Corpora Design, Annotation and Analysis” — deals with data processing and selection. The description of the projects shows that the goals of individual projects result in vast differences among these. For example, the communicative situations documented vary greatly, ranging from formal academic presentations (Chapter 2) and radio and telephone communications, to spontaneous conversations recorded without the interviewers’ influence (Chapter 9, an ecological approach). One important question not addressed by the editors is which of these approaches is more effective in representing “natural” exchanges between speakers.

Chapter 3 describes an important issue related to data processing: how to handle segmentation of spoken corpora. In some projects, segmentation is applied to equal chunks based on time units. This, obviously, produces data that might not be suitable for discourse analysis, since discourse units are ignored and are replaced by time units.

An additional issue is the amount of data collected by these projects. Determining how much data are required, how much evidence is necessary to analyze certain aspects of language use or the linguistic system is not addressed by some of the papers. Most projects are data driven and collect a large number of words. It is not always clear why a specific amount of data is necessary. An explanation of these issues would help the reader understand and make decisions during data collection. Additionally, except for Chapter 5, there are few references to the role of corpora creation for linguistic analysis and linguistic theory.

This section of the book also examines issues related to data management, such as automated transcription and the choice between orthographic and phonetic transcription. In fact, orthographic transcription dominates in project annotation.

The second section of the book places individual corpus creation within a larger context of ethical issues regarding data management, and the availability of data for future generations. For example, Chapter 11 discusses ethical issues in data use and accessibility, particularly with the purpose of shielding data providers. It describes a project that allows four levels of data accessibility to protect speakers and the copyright of collected data (p. 201).

Additionally, one of the most important issues discussed in the second section of the book is data preservation. The role of data curators and the differences between data migration and emulation are fundamental for those interested in long-term archiving of corpora. Chapter 10 deals with this issue and with the issue of software availability for corpora creation. Table 10-1 (pp. 166-167) is remarkably useful as a summary of descriptions of these tools. As indicated, the book also places great emphasis on data “curation” and migration from one format to another to preserve these data for future generations. This issue is critical, particularly in the case of endangered languages. Bird and Simons (2003) stress this point in corpora creation:

''Funded documentation projects are usually tied to software versions, file formats, and system configurations having a lifespan of three to five years. Once this infrastructure is no longer tended, the language documentation is quickly mired in obsolete technology. The issue is acute for endangered languages. In the very generation when the rate of language death is at its peak, we have chosen to use moribund technologies, and to create endangered data'' (p. 557).

On the other hand, related to the type of data collected, one wonders whether the book should place more emphasis on the documentation of lesser-known languages. The book describes documentation efforts in a variety of languages. From these, Faroese is the language with the smallest number of speakers. Documenting lesser-known languages can contribute to understanding differences and similarities in conversation exchanges across cultures. The description of projects on poorly studied languages can provide linguists interested in documenting these with the knowledge required to start a project. Documentation and extensive study of poorly studied and endangered languages are very important for numerous reasons (See Krauss 1992).

Another concern in this section is the incorporation of data from different fieldwork projects into a single corpus. Chapter 13 describes a project that intends to create an interlingua ontology to represent “the knowledge embodied in all the annotations of all collected data.” They built this ontology in an inductive way since they are collecting materials from many different projects, and these followed different methodologies in data gathering.

A few issues are not discussed in detail in the book, such as the role of native speakers in data collection, segmentation, and annotation. In fact, native speakers have participated in the creation of many projects described in the book.

Finally, this book is an important contribution to the documentation of ongoing projects on corpora creation. The authors provide detailed descriptions that offer the reader enough information on current standards, project content, and the rationale behind the decision making in corpus linguistics. This book is particularly useful in the creation of large databases for a diverse body of languages.


Yolanda Rivera-Castillo is currently a professor at the University of Puerto Rico, Río Piedras campus. She has taught at different institutions in the US, and has chaired linguistic programs. Her research interests include the study of the Papiamentu prosodic system, as well as nasalization and vowel harmony in Papiamentu and other Atlantic Creoles. She is currently working on a project on language documentation and has published papers on Creole phonology as well as on the Phonology-Syntax interface.

Format: Hardback
ISBN-13: 9781443860338
Pages: 285
Prices: U.K. £ 47.99
U.S. $ 81.99