Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."

New from Wiley!


We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Review of  Computational Methods for Corpus Annotation and Analysis

Reviewer: Phoebe M. S. Lin
Book Title: Computational Methods for Corpus Annotation and Analysis
Book Author: Xiaofei Lu
Publisher: Springer Nature
Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics
Issue Number: 26.1303

Discuss this Review
Help on Posting
Review's Editor: Helen Aristar-Dry


Corpus linguistics has been fast developing since the 1960s. The efficiency of microprocessors and data storage has grown exponentially while the cost has been dropping. In this information age, language corpora with tens of millions of words are increasingly common. Linguists can easily compile their own multi-million word corpora and search for patterns of interest using a personal computer. However, as the size of corpora continues to increase, the new problem is the apparent poverty of automated language analysis tools that offer alternative and/or more advanced analyses. Without appropriate automated tools, the variety of analyses that can be performed on a multi-million-word corpus is limited. Lu’s ‘Computational Methods for Corpus Annotation and Analysis’ is a book that offers corpus linguists a solution to the problem. The book introduces automated tools for 14 types of annotation and analysis, including tokenization, word segmentation, lemmatization, parsing, POS tagging, semantic tagging, dialogue act tagging, discourse structure tagging and discourse cohesion and coherence tagging. The aim is to empower corpus linguists so that advanced corpus analyses can be conducted on very large corpora even without knowledge of computer programming.

The book consists of eight chapters. Chapter 1, “Introduction”, presents the objectives, rationale and organization of the book and highlights the benefits of corpus annotation.
Chapter 2, “Text processing with the command line interface”, paves the way for the use of the 14 types of corpus annotation and analysis tools, some of which need to be invoked using UNIX command lines. Two types of UNIX command lines are presented: basic commands (e.g. for creating, changing and copying directories, reading from and writing to directories, moving and copying files) and commands for text processing (e.g. searching with regular expressions, filtering and substituting text strings in files). The commands are illustrated with examples downloadable from the book’s website.

Chapter 3, “Lexical annotation”, demonstrates the use of the Stanford Part-of-Speech (POS) tagger (Toutanova et al. 2003) and the CLAWS Tagger (Garside 1987) for POS tagging, the TreeTagger (Schmid 1994) and the Morpha lemmatizer (Minnen et al. 2001) for lemmatization, the Stanford Tokenizer (Klein and Manning 2003) for tokenization, and the Stanford Word Segmenter (Green and DeNero 2012, Tseng et al. 2005) for word segmentation.

Chapter 4, “Lexical analysis”, then shows how frequency and n-gram lists may be generated from the output files of the TreeTagger, the Stanford POS tagger and Morpha using UNIX command lines. Then the range of tools for lexical analyses are presented, including the Lexical Complexity Analyzer (Ai and Lu 2010, Lu 2012), the vocd utility in CHILDES’ CLAN (MacWhinney 2000, where CHILDES and CLAN are the short forms for the Child Language Data Exchange System and Computerized Language Analysis program), MATTR (a tool for calculating the Moving Average Type-Token Ratio, Convington and McFall 2010), the Gramulator (McCarthy et al. 2012), RANGE (Heatley et al. 2002) and VocabProfile (Cobb and Horst 2011). These tools automate the calculation of various indices of lexical density (e.g. the ratio of content to function words), lexical variation (e.g. the classic type-token ratio (TTR), mean segmental TTR (Johnson 1944), D measure (Malvern et al. 2004), the HD-D index (McCarthy and Jarvis 2007) and the MTLD (Measure of Textual and Lexical Diversity, McCarthy and Jarvis 2010)), and lexical sophistication (e.g. Laufer and Nation’s 1995 Lexical Frequency Profile (LFP)).

Chapter 5, “Syntactic annotation”, discusses the notions of phrase structure grammars and dependency grammars before introducing the Stanford Parser and the Collins’ Parser (Collins 1999) for automating the annotation of a corpus based on these two types of grammars.

Chapter 6, “Syntactic analysis”, demonstrates the use of Tregex’s (Levy and Andrew 2006) graphic user interface (i.e. the TregexGUI) for searching Stanford Parser outputs and displaying the results in the form of human-readable phrase structure trees. Then an overview is given of indices for syntactic complexity (e.g. the Developmental Sentence Score (DSS), Black Developmental Sentence Scoring (BDSS), and the Index for Productive Syntax (IPSyn)) before a demonstration of the tools that automate their calculations. These tools include the D-Level Analyzer (Lu 2009), the L2 Syntactic Complexity Analyzer (Lu 2010), and the DSS Utility in CHILDES’ CLAN (MacWhinney 2000). Computerized Profiling (CP, Long et al. 2008) and Coh-Metrix (Graesser et al. 2004) also offer automated syntactic analyses, but they run on Windows and the Internet, respectively.

Chapter 7, “Semantic, pragmatic and discourse analysis”, presents tools for performing semantic, pragmatic and discourse analysis, all of which offer graphic user interfaces (GUIs) that run on Windows, the Internet and/or the iOS. The types of analyses and tools introduced in the chapter include USAS (the UCREL Semantic Analysis System, Archer et al. 2002) and PRISM-L (Profile in Semantics-Lexical, Crystal 1982) for semantic field analysis, CPIDR (Computerized Propositional Idea Density Rater, Brown et al. 2008) and APRON (Analysis of Propositions, Johnston and Kamhi 1984) for annotating specific relations between propositions, CAP (Conversational Act Profile, Fey 1986) for annotating a range of dialogue acts, Coh-Metrix (Graesser et al. 2004) for measuring the levels of cohesion and coherence, and AntMover (Anthony 2003) for a semi-automatic analysis of discourse structure.
Chapter 8, “Summary and Outlook”, concludes the book with a summary of the functions of the computational tools and a discussion of the future directions in the development of computational methods for corpus analysis and annotation.


The book presents an overview of the computational tools that can enhance the types of investigations performed on even very large corpora. It invites corpus linguists who are accustomed to using GUIs to analyse corpus data (e.g. AntConc and WordSmith) to consider using UNIX command lines as well. UNIX command lines not only offer arguably the highest computational efficiency and flexibility of all operating systems, but they are also the medium for invoking advanced corpus annotation tools such as the Stanford Parser and Collins’ Parser. Apparently, knowledge of UNIX command lines is the key to strengthening our power to extract meaningful patterns from corpora.

To researchers who are new to computer science, the task of learning UNIX command lines may be quite daunting. Since UNIX is operated using command lines only (i.e. there are no radio buttons, check boxes, navigation windows or data previews), users need to understand their data’s structure and remember the parameters of each UNIX command very precisely. Even when one remembers the commands and their uses well, success in applying the commands on real corpora may require some computing intelligence and experience. In other words, the challenge of learning UNIX command lines may be much more difficult than the book makes out.

Considering the difficulty of teaching and learning UNIX command lines and using them to operate the various computational tools, the book is outstanding in its treatment of the subject. The presentation of the material is very clear and organized. The reasons for performing each type of annotation, using each tool, and introducing each UNIX command line are clearly explained. All relevant background concepts are discussed in adequate detail before each tool is introduced step-by-step with well-designed examples. The tools presented were also selected on the grounds that their use does not require knowledge of scripting, so they are relatively more accessible for beginners. Where some tools may need a corpus to be pre-arranged in special formats, the book has also made computer scripts that automate the tasks available on its website. In cases where users need to choose between multiple options (e.g. a grammar needs to be chosen for the Stanford Parser), the strengths of each option are clearly presented. Finally, for users who prefer to operate the advanced corpus tagging and annotation tools via GUIs (despite the fact that GUIs typically offer less operational flexibility than UNIX command lines), the book has recommended some GUIs (e.g. Xu and Jia’s (2011) GUI for operating the Stanford Parser and Liang and Xu’s (2011) GUI for operating the TreeTagger).

While the book may be one of the best introductions to computational tools for corpus annotation and analysis, it could be further strengthened by adding a discussion of computational tools that facilitate phonological and prosodic annotation and the examination of corpora with multiple tiers of annotation. This ability to investigate the interfaces between multiple layers of annotations on a corpus is especially important to researchers of discourse. The book is right to devote more attention to the discussion of lexical and syntactic annotation since the technology for those is more mature, but these two types of annotation are more concerned about language at the sentence level than at the discourse level. In light of the great demand for a deeper understanding of discourse, a stronger emphasis (and probably a separate chapter) on discourse annotation tools in the book would be welcome. Furthermore, the book should include statistics about the accuracy and coverage of all, instead of selected, tools for readers’ reference.

In summary, ‘Computational Methods for Corpus Annotation and Analysis’ is an excellent book for corpus linguists who are interested in using advanced corpus queries. It presents the latest computational tools for corpus annotation and analysis in a very accessible manner. The advice and resources in the book are also very practical and useful. It is highly recommended to researchers and students of corpus linguistics.


Ai, Haiyang and Xiaofei Lu. 2010. A web-based system for automatic measurement of lexical complexity. Paper presented at the 27th Annual Symposium of the Computer-Assisted Language Instruction Consortium. Amherst, MA.

Anthony, Laurence. 2003. AntMover, Version 1.0. Tokyo, Japan: Waseda University. August, 2014).

Archer, Dawn, Andrew Wilson and Paul Rayson. 2002. Introduction to the USAS category system. Lancaster: University Centre for Computer Corpus Research on Language, Lancaster University. August, 2014).

Brown, C., Snodgrass, T., Kemper, S. J., Herman, R. and Covington, M. A. 2008. Automatic measurement of propositional idea density from part-of-speech tagging. Behavior Research Methods 40. 540-545.

Cobb, Tom and Marlise Horst. 2011. Does word coach coach words? CALICO Journal 28. 639-661.

Convington, Michael A. and Joe D. McFall. 2010. Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of quantitative linguistics 17. 94-100.

Crystal, David 1982. Profiling linguistic disability. London: Edward Arnold.

Fey, Marc E. 1986. Language intervention with young children. San Diego: College-Hill Press.

Garside, Roger (1987). The CLAWS word-tagging system. In Roger Garside, Geoffrey Leech and Geoffrey Sampson (eds), The computational analysis of English: A corpus-based approach. London: Longman. 30-41.

Graesser, Arthur C., Danielle S. McNamara, Max M. Louwerse and Zhiqiang Cai. 2004. Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, and Computers 36. 193-202.

Green, Spence and John DeNero. 2012. A class-based agreement model for generating accurately inflected translations. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics. 146-155.

Heatley, Alex, I. S. Paul Nation, and Averil Coxhead. 2002. RANGE and FREQUENCY programs. Wellington: Victoria University of Wellington. August, 2014).

Johnson, Wendell 1944. Studies in language behavior: I. A program of research. Psychological Monographs 56. 1-15.

Johnston, Judith R. and Alan G. Kamhi. 1984. Syntactic and semantic aspects of the utterances of language-impaired children: The same can be less. Merrill-Palmer Quarterly 30. 65-86.

Klein, Dan and Christopher D. Manning. 2003. Accurate unlexicalized parsing. Proceedings of the 41st Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics. 423-430.

Laufer, Batia, and I. S. Paul Nation. 1995. Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics 16. 307-322.

Levy, Roger and Galen Andrew. 2006. Tregex and Tsurgeon: Tools for querying and manipulating tree data structures. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, 2231-2234. Paris: ELRA.

Liang, Maocheng and Jiajin, Xu. 2011. TreeTagger for Windows 2.0 (Multilingual edition). Beijing: Beijing Foreign Studies University.

Long, Steven H., Fey, Marc E., and Ron W. Channell. 2008. Computerized Profiling, Version 9.7.0. Cleveland, OH: Case Western Reserve University. August, 2014).

Lu, Xiaofei. 2009. Automatic measurement of syntactic complexity in child language acquisition. International Journal of Corpus Linguistics 14. 3-28.

Lu, Xiaofei. 2010 Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics 15. 474-496.

Lu, Xiaofei. 2012. The relationship of lexical richness to the quality of ESL learners’ oral narratives. The Modern Language Journal 96. 190-208.

MacWhinney, B. 2000. The CHILDES project: Tools for analyzing talk. Mahwah: Erlbaum.

Malvern, David, Brian Richards, Ngoni Chipere, and Pilar Durán. 2004. Lexical diversity and language development: Quantification and assessment. Houndmills: Palgrave MacMillan.

McCarthy, Philip M. and Scott Jarvis. 2007. vocd: A theoretical and empirical evaluation. Language Testing 24. 459-488.

McCarthy, Philip M. and Scott Jarvis. 2010. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods 42. 381-392.

Minnen, Guido, John Carroll and Darren Pearce. 2001. Applied morphological processing of English. Natural Language Engineering 7. 207-223.

Schmid, Helmut (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing. Manchester: University of Manchester. 44-49.

Toutanova, Kristina, Dan Klein, Christopher D. Manning and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of Human Language Technologies: The 2003 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics. 252-259.

Tseng, H Huihsin, Pi-Chuan Chang, Galen Andrew, Dan Jurafsky and Christopher Manning. 2005. A conditional random field word segmenter for SIGHAN Bakeoff 2005. In Proceedings of the fourth SIGHAN Workshop on Chinese Language Processing. Singapore: Asian Federation of Natural Language Processing. 168-171.

Xi, Jiajin and Yunlong Jia. 2011. BFSU Stanford POS tagger: A graphical interface Windows version. Beijing: Beijing Foreign Studies University.
Phoebe Lin is Research Assistant Professor at the Hong Kong Polytechnic University. Her research focuses on the acquisition, processing and use of formulaic language by first and foreign language learners. She publishes on corpus linguistics, applied linguistics, English vocabulary and second language acquisition. She has a forthcoming monograph on the prosody of formulaic language in spontaneous English spoken discourse.