LINGUIST List 26.1303

Mon Mar 09 2015

Review: Computational Ling; Text/Corpus Ling: Lu (2014)

Editor for this issue: Sara Couture <>

Date: 10-Aug-2014
From: Phoebe Lin <>
Subject: Computational Methods for Corpus Annotation and Analysis
E-mail this message to a friend

Discuss this message

Book announced at

AUTHOR: Xiaofei Lu
TITLE: Computational Methods for Corpus Annotation and Analysis
YEAR: 2014

REVIEWER: Phoebe M. S. Lin, Hong Kong Polytechnic University

Review's Editor: Helen Aristar-Dry


Corpus linguistics has been fast developing since the 1960s. The efficiency of
microprocessors and data storage has grown exponentially while the cost has
been dropping. In this information age, language corpora with tens of
millions of words are increasingly common. Linguists can easily compile their
own multi-million word corpora and search for patterns of interest using a
personal computer. However, as the size of corpora continues to increase, the
new problem is the apparent poverty of automated language analysis tools that
offer alternative and/or more advanced analyses. Without appropriate automated
tools, the variety of analyses that can be performed on a multi-million-word
corpus is limited. Lu’s ‘Computational Methods for Corpus Annotation and
Analysis’ is a book that offers corpus linguists a solution to the problem.
The book introduces automated tools for 14 types of annotation and analysis,
including tokenization, word segmentation, lemmatization, parsing, POS
tagging, semantic tagging, dialogue act tagging, discourse structure tagging
and discourse cohesion and coherence tagging. The aim is to empower corpus
linguists so that advanced corpus analyses can be conducted on very large
corpora even without knowledge of computer programming.

The book consists of eight chapters. Chapter 1, “Introduction”, presents the
objectives, rationale and organization of the book and highlights the benefits
of corpus annotation.

Chapter 2, “Text processing with the command line interface”, paves the way
for the use of the 14 types of corpus annotation and analysis tools, some of
which need to be invoked using UNIX command lines. Two types of UNIX command
lines are presented: basic commands (e.g. for creating, changing and copying
directories, reading from and writing to directories, moving and copying
files) and commands for text processing (e.g. searching with regular
expressions, filtering and substituting text strings in files). The commands
are illustrated with examples downloadable from the book’s website.

Chapter 3, “Lexical annotation”, demonstrates the use of the Stanford
Part-of-Speech (POS) tagger (Toutanova et al. 2003) and the CLAWS Tagger
(Garside 1987) for POS tagging, the TreeTagger (Schmid 1994) and the Morpha
lemmatizer (Minnen et al. 2001) for lemmatization, the Stanford Tokenizer
(Klein and Manning 2003) for tokenization, and the Stanford Word Segmenter
(Green and DeNero 2012, Tseng et al. 2005) for word segmentation.

Chapter 4, “Lexical analysis”, then shows how frequency and n-gram lists may
be generated from the output files of the TreeTagger, the Stanford POS tagger
and Morpha using UNIX command lines. Then the range of tools for lexical
analyses are presented, including the Lexical Complexity Analyzer (Ai and Lu
2010, Lu 2012), the vocd utility in CHILDES’ CLAN (MacWhinney 2000, where
CHILDES and CLAN are the short forms for the Child Language Data Exchange
System and Computerized Language Analysis program), MATTR (a tool for
calculating the Moving Average Type-Token Ratio, Convington and McFall 2010),
the Gramulator (McCarthy et al. 2012), RANGE (Heatley et al. 2002) and
VocabProfile (Cobb and Horst 2011). These tools automate the calculation of
various indices of lexical density (e.g. the ratio of content to function
words), lexical variation (e.g. the classic type-token ratio (TTR), mean
segmental TTR (Johnson 1944), D measure (Malvern et al. 2004), the HD-D index
(McCarthy and Jarvis 2007) and the MTLD (Measure of Textual and Lexical
Diversity, McCarthy and Jarvis 2010)), and lexical sophistication (e.g. Laufer
and Nation’s 1995 Lexical Frequency Profile (LFP)).

Chapter 5, “Syntactic annotation”, discusses the notions of phrase structure
grammars and dependency grammars before introducing the Stanford Parser and
the Collins’ Parser (Collins 1999) for automating the annotation of a corpus
based on these two types of grammars.

Chapter 6, “Syntactic analysis”, demonstrates the use of Tregex’s (Levy and
Andrew 2006) graphic user interface (i.e. the TregexGUI) for searching
Stanford Parser outputs and displaying the results in the form of
human-readable phrase structure trees. Then an overview is given of indices
for syntactic complexity (e.g. the Developmental Sentence Score (DSS), Black
Developmental Sentence Scoring (BDSS), and the Index for Productive Syntax
(IPSyn)) before a demonstration of the tools that automate their calculations.
These tools include the D-Level Analyzer (Lu 2009), the L2 Syntactic
Complexity Analyzer (Lu 2010), and the DSS Utility in CHILDES’ CLAN
(MacWhinney 2000). Computerized Profiling (CP, Long et al. 2008) and
Coh-Metrix (Graesser et al. 2004) also offer automated syntactic analyses, but
they run on Windows and the Internet, respectively.

Chapter 7, “Semantic, pragmatic and discourse analysis”, presents tools for
performing semantic, pragmatic and discourse analysis, all of which offer
graphic user interfaces (GUIs) that run on Windows, the Internet and/or the
iOS. The types of analyses and tools introduced in the chapter include USAS
(the UCREL Semantic Analysis System, Archer et al. 2002) and PRISM-L (Profile
in Semantics-Lexical, Crystal 1982) for semantic field analysis, CPIDR
(Computerized Propositional Idea Density Rater, Brown et al. 2008) and APRON
(Analysis of Propositions, Johnston and Kamhi 1984) for annotating specific
relations between propositions, CAP (Conversational Act Profile, Fey 1986) for
annotating a range of dialogue acts, Coh-Metrix (Graesser et al. 2004) for
measuring the levels of cohesion and coherence, and AntMover (Anthony 2003)
for a semi-automatic analysis of discourse structure.

Chapter 8, “Summary and Outlook”, concludes the book with a summary of the
functions of the computational tools and a discussion of the future directions
in the development of computational methods for corpus analysis and


The book presents an overview of the computational tools that can enhance the
types of investigations performed on even very large corpora. It invites
corpus linguists who are accustomed to using GUIs to analyse corpus data (e.g.
AntConc and WordSmith) to consider using UNIX command lines as well. UNIX
command lines not only offer arguably the highest computational efficiency and
flexibility of all operating systems, but they are also the medium for
invoking advanced corpus annotation tools such as the Stanford Parser and
Collins’ Parser. Apparently, knowledge of UNIX command lines is the key to
strengthening our power to extract meaningful patterns from corpora.

To researchers who are new to computer science, the task of learning UNIX
command lines may be quite daunting. Since UNIX is operated using command
lines only (i.e. there are no radio buttons, check boxes, navigation windows
or data previews), users need to understand their data’s structure and
remember the parameters of each UNIX command very precisely. Even when one
remembers the commands and their uses well, success in applying the commands
on real corpora may require some computing intelligence and experience. In
other words, the challenge of learning UNIX command lines may be much more
difficult than the book makes out.

Considering the difficulty of teaching and learning UNIX command lines and
using them to operate the various computational tools, the book is outstanding
in its treatment of the subject. The presentation of the material is very
clear and organized. The reasons for performing each type of annotation, using
each tool, and introducing each UNIX command line are clearly explained. All
relevant background concepts are discussed in adequate detail before each tool
is introduced step-by-step with well-designed examples. The tools presented
were also selected on the grounds that their use does not require knowledge of
scripting, so they are relatively more accessible for beginners. Where some
tools may need a corpus to be pre-arranged in special formats, the book has
also made computer scripts that automate the tasks available on its website.
In cases where users need to choose between multiple options (e.g. a grammar
needs to be chosen for the Stanford Parser), the strengths of each option are
clearly presented. Finally, for users who prefer to operate the advanced
corpus tagging and annotation tools via GUIs (despite the fact that GUIs
typically offer less operational flexibility than UNIX command lines), the
book has recommended some GUIs (e.g. Xu and Jia’s (2011) GUI for operating the
Stanford Parser and Liang and Xu’s (2011) GUI for operating the TreeTagger).

While the book may be one of the best introductions to computational tools for
corpus annotation and analysis, it could be further strengthened by adding a
discussion of computational tools that facilitate phonological and prosodic
annotation and the examination of corpora with multiple tiers of annotation.
This ability to investigate the interfaces between multiple layers of
annotations on a corpus is especially important to researchers of discourse.
The book is right to devote more attention to the discussion of lexical and
syntactic annotation since the technology for those is more mature, but these
two types of annotation are more concerned about language at the sentence
level than at the discourse level. In light of the great demand for a deeper
understanding of discourse, a stronger emphasis (and probably a separate
chapter) on discourse annotation tools in the book would be welcome.
Furthermore, the book should include statistics about the accuracy and
coverage of all, instead of selected, tools for readers’ reference.

In summary, ‘Computational Methods for Corpus Annotation and Analysis’ is an
excellent book for corpus linguists who are interested in using advanced
corpus queries. It presents the latest computational tools for corpus
annotation and analysis in a very accessible manner. The advice and resources
in the book are also very practical and useful. It is highly recommended to
researchers and students of corpus linguistics.


Ai, Haiyang and Xiaofei Lu. 2010. A web-based system for automatic measurement
of lexical complexity. Paper presented at the 27th Annual Symposium of the
Computer-Assisted Language Instruction Consortium. Amherst, MA.

Anthony, Laurence. 2003. AntMover, Version 1.0. Tokyo, Japan: Waseda
University. August, 2014).

Archer, Dawn, Andrew Wilson and Paul Rayson. 2002. Introduction to the USAS
category system. Lancaster: University Centre for Computer Corpus Research on
Language, Lancaster University. August, 2014).

Brown, C., Snodgrass, T., Kemper, S. J., Herman, R. and Covington, M. A. 2008.
Automatic measurement of propositional idea density from part-of-speech
tagging. Behavior Research Methods 40. 540-545.

Cobb, Tom and Marlise Horst. 2011. Does word coach coach words? CALICO Journal
28. 639-661.

Convington, Michael A. and Joe D. McFall. 2010. Cutting the Gordian knot: The
moving-average type-token ratio (MATTR). Journal of quantitative linguistics
17. 94-100.

Crystal, David 1982. Profiling linguistic disability. London: Edward Arnold.

Fey, Marc E. 1986. Language intervention with young children. San Diego:
College-Hill Press.

Garside, Roger (1987). The CLAWS word-tagging system. In Roger Garside,
Geoffrey Leech and Geoffrey Sampson (eds), The computational analysis of
English: A corpus-based approach. London: Longman. 30-41.

Graesser, Arthur C., Danielle S. McNamara, Max M. Louwerse and Zhiqiang Cai.
2004. Coh-Metrix: Analysis of text on cohesion and language. Behavior Research
Methods, Instruments, and Computers 36. 193-202.

Green, Spence and John DeNero. 2012. A class-based agreement model for
generating accurately inflected translations. In Proceedings of the 50th
Annual Meeting of the Association for Computational Linguistics. Stroudsburg:
Association for Computational Linguistics. 146-155.

Heatley, Alex, I. S. Paul Nation, and Averil Coxhead. 2002. RANGE and
FREQUENCY programs. Wellington: Victoria University of Wellington. August, 2014).

Johnson, Wendell 1944. Studies in language behavior: I. A program of research.
Psychological Monographs 56. 1-15.

Johnston, Judith R. and Alan G. Kamhi. 1984. Syntactic and semantic aspects of
the utterances of language-impaired children: The same can be less.
Merrill-Palmer Quarterly 30. 65-86.

Klein, Dan and Christopher D. Manning. 2003. Accurate unlexicalized parsing.
Proceedings of the 41st Meeting of the Association for Computational
Linguistics. Stroudsburg: Association for Computational Linguistics. 423-430.

Laufer, Batia, and I. S. Paul Nation. 1995. Vocabulary size and use: Lexical
richness in L2 written production. Applied Linguistics 16. 307-322.

Levy, Roger and Galen Andrew. 2006. Tregex and Tsurgeon: Tools for querying
and manipulating tree data structures. In Proceedings of the Fifth
International Conference on Language Resources and Evaluation, 2231-2234.
Paris: ELRA.

Liang, Maocheng and Jiajin, Xu. 2011. TreeTagger for Windows 2.0 (Multilingual
edition). Beijing: Beijing Foreign Studies University.

Long, Steven H., Fey, Marc E., and Ron W. Channell. 2008. Computerized
Profiling, Version 9.7.0. Cleveland, OH: Case Western Reserve University. August, 2014).

Lu, Xiaofei. 2009. Automatic measurement of syntactic complexity in child
language acquisition. International Journal of Corpus Linguistics 14. 3-28.

Lu, Xiaofei. 2010 Automatic analysis of syntactic complexity in second
language writing. International Journal of Corpus Linguistics 15. 474-496.

Lu, Xiaofei. 2012. The relationship of lexical richness to the quality of ESL
learners’ oral narratives. The Modern Language Journal 96. 190-208.

MacWhinney, B. 2000. The CHILDES project: Tools for analyzing talk. Mahwah:

Malvern, David, Brian Richards, Ngoni Chipere, and Pilar Durán. 2004. Lexical
diversity and language development: Quantification and assessment. Houndmills:
Palgrave MacMillan.

McCarthy, Philip M. and Scott Jarvis. 2007. vocd: A theoretical and empirical
evaluation. Language Testing 24. 459-488.

McCarthy, Philip M. and Scott Jarvis. 2010. MTLD, vocd-D, and HD-D: A
validation study of sophisticated approaches to lexical diversity assessment.
Behavior Research Methods 42. 381-392.

Minnen, Guido, John Carroll and Darren Pearce. 2001. Applied morphological
processing of English. Natural Language Engineering 7. 207-223.

Schmid, Helmut (1994). Probabilistic part-of-speech tagging using decision
trees. In Proceedings of the International Conference on New Methods in
Language Processing. Manchester: University of Manchester. 44-49.

Toutanova, Kristina, Dan Klein, Christopher D. Manning and Yoram Singer. 2003.
Feature-rich part-of-speech tagging with a cyclic dependency network. In
Proceedings of Human Language Technologies: The 2003 Conference of the North
American Chapter of the Association for Computational Linguistics.
Stroudsburg: Association for Computational Linguistics. 252-259.

Tseng, H Huihsin, Pi-Chuan Chang, Galen Andrew, Dan Jurafsky and Christopher
Manning. 2005. A conditional random field word segmenter for SIGHAN Bakeoff
2005. In Proceedings of the fourth SIGHAN Workshop on Chinese Language
Processing. Singapore: Asian Federation of Natural Language Processing.

Xi, Jiajin and Yunlong Jia. 2011. BFSU Stanford POS tagger: A graphical
interface Windows version. Beijing: Beijing Foreign Studies University.


Phoebe Lin is Research Assistant Professor at the Hong Kong Polytechnic
University. Her research focuses on the acquisition, processing and use of
formulaic language by first and foreign language learners. She publishes on
corpus linguistics, applied linguistics, English vocabulary and second
language acquisition. She has a forthcoming monograph on the prosody of
formulaic language in spontaneous English spoken discourse.

Page Updated: 09-Mar-2015