Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!

ad

Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."


New from Wiley!

ad

We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at https://linguistlist.org/!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at webdevlinguistlist.org***

Review of  Handbook of Linguistic Annotation


Reviewer: Emmanuel Schang
Book Title: Handbook of Linguistic Annotation
Book Author: Nancy Ide James Pustejovsky
Publisher: Springer Nature
Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics
Issue Number: 29.3243

Discuss this Review
Help on Posting
Review:
SUMMARY

This handbook is edited by Nancy Ide (Vassar College) and James Pustejovsky (Brandeis Univ.) and gathers 54 chapters in two volumes, for a total of 1459 pages. The first volume (438 pages) collects papers on methodological and theoretical aspects (15 chapters plus an introduction) while the second volume presents detailed case studies. It is aimed at a large audience of students and scholars in linguistics and/or computer science, with no prerequisites on programming.

Volume 1 ('The Science of Annotation') starts with an introduction by Nancy Ide. This introduction presents an overview of the volumes and describes the context of emergence of linguistic annotations as an important matter for linguistics and natural language processing (NLP). Initially, linguistic annotations were conceived to validate corpus linguistic theories. But as Nancy Ide claims '' Over the past three decades, advances in computing power and storage together with development of robust methods for automatic annotation have made linguistically-annotated data increasingly available in ever growing quantities. As a result, these resources now serve not only linguistic studies, but also the field of natural language processing (NLP), which relies on linguistically-annotated text and speech corpora to evaluate new human language technologies and crucially, to develop reliable statistical models for training these technologies''. She indicates that ''the goal of this volume is to provide a comprehensive survey of the development and state-of-the-art for linguistic annotation of language resources, including methods for annotation scheme design, annotation creation, physical format considerations, annotation tools, annotation use, evaluation, etc.''

The volume continues with theoretical papers ['Designing Annotation Schemes: From Theory to Model' (James Pustejovsky, Harry Bunt and Annie Zaenen) ; 'Designing Annotation Schemes: From Model to Representation' (Nancy Ide, Christian Chiarcos, Manfred Stede and Steve Cassidy) and 'Community Standards for Linguistically-Annotated Resources' (Nancy Ide, Nicoletta Calzolari, Judith Eckle-Kohler, Dafydd Gibbon, Sebastian Hellmann, Kiyong Lee, Joakim Nivre and Laurent Romary)] which offer both an overview of the field and an historical approach of this recent domain. Chapter 1 presents the MATTER methodology (an acronym for Model Annotate Train Test Evaluate Revise), which aims at improving the design of annotation schemes in a back-and-forth exchange between the data and the model. Chapter 2 presents an overview of the representation formats and discusses the issues related to the choice of format (from SGML to XML and TEI) and Chapter 3 presents the history and key concepts of the standards for linguistic resources (ISO, TEI, LAF, etc.).

This volume also collects chapters on annotation tools and procedure [Overview of Annotation Creation: Processes and Tools (Mark A. Finlayson and Tomaž Erjavec) ; The Evolution of Text Annotation Frameworks (Graham Wilcock) ; Tools for Multimodal Annotation (Steve Cassidy and Thomas Schmidt) ; Collaborative Web-Based Tools for Multi-layer Text Annotation (Chris Biemann, Kalina Bontcheva, Richard Eckart de Castilho, Iryna Gurevych and Seid Muhie Yimam) ]. In particular, Section 4 of ''Overview of Annotation Creation: Processes and Tools'' goes over the features of a large number of annotation tools, which is very handy. G. Wilcock's chapter is more technical and much more useful to computer engineers than to linguists. It mainly discusses the difference between an annotation pipeline and an annotation framework.

The following chapters focus on techniques and methods: Iterative Enhancement (Markus Dickinson and Dan Tufiş) ; Crowdsourcing (Massimo Poesio, Jon Chamberlain and Udo Kruschwitz) ; Machine Learning for Higher-Level Linguistic Tasks (Anna Rumshisky and Amber Stubbs) ; Sustainable Development and Refinement of Complex Linguistic Annotations at Scale (Dan Flickinger, Stephan Oepen and Emily M. Bender) ; Linguistic Annotation in/for Corpus Linguistics (Stefan Th. Gries and Andrea L. Berez). Poesio & al. wrote a chapter dedicated to crowdsourcing (web collaboration for annotation) and 'games-with-a-purpose'. Delegating the linguistic annotation task to unknown contributors (be they gamers seeking enjoyment or distant workers, as with Amazon Mechanical Turk) is not a harmless choice. This chapter honestly weights the pros and cons of these techniques.

Two chapters take on the difficult matter of evaluation [Inter-annotator Agreement (Ron Artstein) and Ongoing Efforts: Toward Behaviour-Based Corpus Evaluation (Takenobu Tokunaga)]. Ron Artstein raises the issue of the reliability of the annotation and clearly explains the philosophy and the math behind measures of inter-annotator agreement. The chapter is punctuated by useful technical reminders and tips. The author made considerable efforts to remain clear and accessible to non specialists. Tokunaga presents a different and complementary approach, which is based on the analysis of the annotator's behavior during the annotation task.

Finally, this volume ends with a paper discussing the links between linguistic theory and corpus-based studies [Developing Linguistic Theories Using Annotated Corpora (Marie-Catherine de Marneffe and Christopher Potts)].

Volume 2 ('Case Studies') gathers 39 chapters which describe corpus-based projects. The chapters therein provide both an overview of the content (purpose and method) of the projects and the 'lessons learned' of the team. These case studies offer an opportunity to evaluate the design of the experiments and the annotations schemes. Among the projects, one can cite MULTEX-East, OntoNotes, ISO-TimeML and several treebanks (Prague Dependency Treebank, German Treebank, Sinica Treebank and Hindi/Urdu Treebank) to name but a few of these. The reader will find here a description of the projects mentioned in Volume 1 and can go back and forth between the two volumes. I will provide here two examples:

- ISO-TimeML is mentioned many times in Volume 1 as an example of the MATTER methodology. The reader will find a precise description of the project and the annotation scheme in a dedicated chapter (pp. 941-968),

- the reader who is interested in crowdsourcing annotation projects can navigate between a theoretical paper in Volume 1 and a project on Named Entity Recognition using crowdsourcing.

EVALUATION

With its 54 chapters, this handbook covers the wide field of linguistic annotation (and linguistic resource creation). Interestingly, this book reverses the usual perspective in which just one chapter is dedicated to linguistic annotation in an NLP handbook (see Palmer and Xue (2010) for instance).

In recent years we have seen an important growth of Machine Learning (ML) techniques, and NLP tends to be more and more a matter of engineers to the detriment of linguists. But ML techniques crucially require resources (annotated corpora). The building of reliable resources in thus an important matter that cannot be neglected and granted a subsidiary ranking.

In this context, this book is an important effort towards giving linguistic annotation full attention.

Here, the annotation work in its various facets is put forward and the technical or practical tools are in the background (in Volume 1). In Volume 2, the major projects and resources are detailed and one can appreciate that the choice of the projects is well balanced between Europe and the USA.

The chapters on method and theory are written by renowned specialists and the case studies provide the reader with interesting lessons learned. The authors had to follow a guideline, which provides a certain consistency to Volume 2, despite the great disparity of the domain. This makes this handbook interesting for both computer scientists and linguists. Both will find a rich variety of examples and technical information (tools, methods, etc.). Of course certain chapters about tools or machine learning are more aimed at computer scientists than linguists, but overall, this book can be read by linguists without precise technical skills, except perhaps a basic knowledge of XML and document formats. Each chapter can be read alone, as is usually the case in handbooks, but this sometimes leads to repetitions. For instance, the MATTER cycle is presented several times: p. 22, p. 170 and p. 335. This probably could have been avoided, but this is not a major flaw since these repetitions are drowned in the mass of information provided in these chapters. Incidentally, an index would be useful. The search for a technical term would have been facilitated.

For the reader who is still reluctant to take an interest in corpus linguistics I recommend, as a starter, the reading of De Marneffe and Potts' chapter at the end of Volume 1. They provide a clever review of the arguments and counter-arguments against corpus linguistics in Section 2 (Intuition and Experiment, Corpora and Experimental Methods, Competence and performance...) and argue that ''corpus, introspective, and psychological methods all complement each other; far from being in tension methodologically or philosophically, they can be brought together to strengthen linguistic theory and increase its scope and scientific relevance'' (p.431).

For the enthusiastic reader willing to start his/her first project in linguistic annotation, I also recommend the reading of Gries (2013), Reinhardt (2013) and Pustejovsky & Stubbs (2012). Indeed, this handbook will give you all you need to conceive your annotation scheme and assess its quality, but the correct interpretation of your results requires a prior (basic) knowledge of statistics (power curve, confidence intervals, etc.), which falls outside the scope of this book.

To summarize, this book undoubtedly finds its place in every linguistics department library as a major reference on linguistic annotation. The price makes it probably inaccessible to linguists in most parts of the world (the number of pages has its price) but since linguistic annotation projects are supposed to be made by teams and not by individuals, this is not a serious problem.


REFERENCES

Gries, S. T. (2013). Statistics for linguistics with R: A practical introduction. Walter de Gruyter.

Palmer, M., & Xue, N. (2010). Linguistic annotation. Handbook of Computational Linguistics and Natural Language Processing.

Reinhart, A. (2015). Statistics done wrong: The woefully complete guide. No starch press.

Pustejovsky, J., & Stubbs, A. (2012). Natural Language Annotation for Machine Learning: A guide to corpus-building for applications. ''O'Reilly Media, Inc.''.
 
ABOUT THE REVIEWER:
Emmanuel Schang is an associate professor in syntax at the University of Orléans (France). He's in charge of the SEEPiCLa (Structure, Emergence and Evolution of Pidgin and Creole Languages) International Research Group (CNRS) and he has led several projects on linguistic annotation.