LINGUIST List 21.3318

Tue Aug 17 2010

Review: Text/Corpus Linguistics, morphology, syntax: Gries et al. (2010)

Editor for this issue: Joseph Salmons <>

        1.    Andrew Caines, Corpus-linguistic applications

Message 1: Corpus-linguistic applications
Date: 17-Aug-2010
From: Andrew Caines <>
Subject: Corpus-linguistic applications
E-mail this message to a friend

Discuss this message

Announced at

EDITORS: Gries, Stefan Th.; Wulff, Stefanie; Davies, Mark TITLE: Corpus-linguistic applications SUBTITLE: Current studies, new directions SERIES TITLE: Language and Computers 71 PUBLISHER: Rodopi YEAR: 2010

Andrew Caines, CCL Group, Research Centre for English and Applied Linguistics, University of Cambridge


This is an edited volume of fifteen papers which arise from the eighth conference of the American Association for Corpus Linguistics, held in 2008 at Brigham Young University, Utah. It is the seventy-first volume in the long-running Rodopi series, 'Language and Computers: Studies in Practical Linguistics'. The papers are divided into four themes, representing to some extent the range of presentations given at the conference. The themes are: diachronic applications, function-oriented applications, register/genre applications and methodological applications.

In the first section, reporting diachronic research, there are four papers by Viola Miglio, Alfonso Medina Urrea, Juhani Rudanko, and Cristina Mota. The second function-oriented section includes work by Georgie Columbus, Philip Dilts, and Tatiana Zdorenko. Phuong Dzung Pho, Eniko Csomay & Viviana Cortes, Luciana Diniz, and Eileen Fitzpatrick & Joan Bachenko contribute four papers in total to the third section on register/genre. The fourth and final section is made up of papers by Stefan Th. Gries, Christopher Cox, Elke Teich & Peter Fankhauser, and Kenneth Bloom & Shlomo Argamon.

Miglio presents a study of the grammaticalization of the Spanish phrase 'dizque' (now 'allegedly', 'supposedly') from the 12th to the 20th century, from an impersonal to an evidential to an epistemic function. She uses a number of corpus resources and discusses the problems arising from gaps in the data. Nevertheless she tracks its fluctuations in meaning, which has not been entirely unidirectional along the grammaticalization cline described above.

Medina Urrea presents an attempt to carry out automatic identification of diachronic morphological profiling from 16th-20th century texts, tracking the development of Mexican Spanish and Peninsular Spanish separately. He uses automatic segmentation to identify morphemes and measures paradigm (dis-)similiarity by Euclidean distance. One important outcome is that a distinct Mexican Spanish is shown to have emerged before the 18th century -- earlier than previously thought.

Rudanko meanwhile takes the verb 'submit' and homes in on the years 1880-1922 and the present-day in order to pick out complement distribution changes in U.S. and U.K. English. She identifies a 'Great Complement Shift' by which there has been a dramatic increase over time in gerundial rather than infinitival complements with the verb in question, a shift she describes as system-internal change.

Lastly under the diachronic theme, Mota applies Kilgarriff's (2001) corpus similarity measure to samples of Portuguese journalism genres in the period 1991-1998. The stated goal of the paper is to apply statistical methodology to corpus comparison but since the work forms part of a larger project in named entity recognition, this topic too features significantly in the paper.

The first paper of the second section on function-oriented applications is by Columbus on invariant tags -- 'eh', 'yeah', 'no', 'na' -- in the New Zealand, Indian and British components of the International Corpus of English (ICE). She identifies likely areas of cultural misunderstanding and points the findings towards application in the pedagogical domain.

Dilts presents a study of 'semantic orientation' and 'semantic preference' using the British National Corpus (BNC). The orientation of nouns is assessed through the type of adjective they collocate with, and this is found to correlate with positive/negative ratings for the same set of nouns -- a measure of their semantic preference.

Zdorenko examines subject omission in Russian using the Russian National Corpus. She finds evidence that the binary encoding required by the 'Principles and Parameters' tradition is too crude when actual usage data is analysed. It is not simply a matter of Russian being or not being a null subject language. Rates of omission are in fact domain-dependent. The null subject is found most often in spontaneous conversation, with first and second person subjects, and with an identified set of verbs.

The third section (on register/genre) begins with Pho's study of moves in a corpus of 40 journal articles. She examines abstracts and introductions for 'moves' -- segments of text with a certain function -- and the linguistic features which associate with them. The intended application is to better instruct nonnative speakers in academic writing techniques.

Next, Csomay & Cortes investigate 4-word 'lexical bundles' in terms of the 'vocabulary-based discourse unit' (VBDU; conceptualised by Youmans 1991, programmed by Biber, Connor and Upton 2007). A VBDU is a section of text of at least 50 words, the limits of which are identified when the section's lexical content is sufficiently distinct by a set statistical measure from the text which follows it. The authors take the first three and the second three VBDUs from a corpus of TOEFL assignments and compare the VBDU groups for differences in four-word lexical bundles. In this way, the outcome of their work is a side-by-side comparison of introductions and bodies of text at the discourse level.

The third section continues with an analysis by Diniz of modal chunks in academic discourse, specifically examining the language professors use to address their students. This is a functional analysis which identifies politeness as the most likely function for modals in this context, followed by transfer of responsibility and communication of expectations among others.

The fourth and final paper in this section is by Fitzpatrick & Bachenko. They present a forensic linguistic study of ground truth and the identification of deception through linguistic cues. They identify twelve linguistic cues which might signal deception (among them: hedges, a preference for negative expressions, and a range of inconsistencies such as tense changes) and manually annotate a training corpus of criminal statements, police interrogations and testimonies. The predictive accuracy of these cues is high (c.76%) and promising for future research. The authors observe that a concentration of the linguistic cues is the most likely indicator of deception.

The methodological papers include an exploration by Gries of an underresearched issue which affects all corpus work: dispersion (the measure of the homogeneity of a word's distribution in a corpus). Gries uses the BNC to compare various dispersion measures from the literature, and subsequently investigates how these measures correlate with lexical reaction time data reported in the psycholinguistic domain.

Cox's theme is corpus planning. He considers the tagging process, and evaluates the time-accuracy trade-off in using (a) normalized/unnormalized orthography; (b) various chunk sizes for rounds of iterative, interactive tagging; (c) tagset size. He does so in the context of corpus building for minority languages which are on the whole associated with more modest resources than major language projects.

Teich & Fankhauser present a study of data mining using DaSciTex -- a corpus of scientific journal papers. They find that certain features (part of speech distribution, type-token ratio and lexical density) are ample to discern DaSciTex from The Freiburg-LOB Corpus of British English (FLOB) -- a more diverse corpus. These features on the other hand do not successfully identify the subdisciplines within DaSciTex. Instead, data mining techniques (feature ranking, clustering and classification) accurately differentiate the subcorpora of pure science (e.g. linguistics, biology) from mixed disciplines (e.g. computational linguistics, bioinformatics) from computer science.

Finally, Bloom & Argamon present a grammatically motivated system for extracting opinionated text by identifying the attitudes and the targets of the attitudes by way of linguistic 'linkages' contained in the text. They test their automated system on corpora of product and movie reviews and achieve a rate of success comparable to manual methods of extraction.


These papers present a wide range of innovative, high quality and at times important work, and the editors are to be credited for assembling such diversity into one volume. Not only are the potential applications shown to be academic, but also pedagogical (Pho, Diniz), judicial (Fitzpatrick & Bachenko) and commercial (Bloom & Argamon). The common themes of the papers are not only automaticity, scale and efficiency, but also cognitive grounding, a data-driven approach and innovative computational techniques in a field which has for so long fallen short in these three respects. Overall, therefore, the authors deserve great credit indeed.

However, as is inevitable in a collected volume, there are highs and lows in the quality of work, the clarity of explanation, and the presentation of data. In terms of making a good early impression, it is unfortunate that Miglio should be first up. Her charts are poorly presented -- without axis labelling and without apparently relativizing the frequencies in comparing (sub)corpora (figures 1 and 2, for example). In addition, strong conclusions are drawn on the basis of low frequency data without caveat as to the validity of distribution pattern. Nevertheless, Miglio makes some good points -- particularly when observing that fiction is a fair indicator of spoken language, given the oral data gap in diachronic (pre-20th century) studies. Additionally, it is a shame there is not more focus on the dialect issue raised regarding region and urban/rural differences (p23), since the data suggest this would have been an interesting story.

The first section improves from this beginning, though Rudanko is guilty of a failure to relativize frequency counts in the data tables comparing the American English and British English sections of the target corpus (she makes up for it somewhat by doing so in the body text discussion). Her paper also suffers from further irrelevant sections and claims without sufficient evidence. However, Medina Urrea presents innovative and interesting work, introducing a combined measure of affixality / glutinosity, discussing problems in corpus sizes successfully, and contrasting the morphological systems of Peninsular and American Spanish through Euclidean distance in a clear and coherent manner.

Mota's paper is the best of all in this opening section on diachronic corpus applications. Her topic is well chosen and neatly self-contained, while at the same time pointing to established research (namely, Kilgarriff's distance statistic for corpus comparison) and future extensions (the named entity recognition which is the overarching project in which this paper is situated). Mota selects the texts of one newspaper from a corpus of 1990s Portuguese journalism, and plots within and across topic vocabulary similarity over time at six-month intervals. The topics are: culture, politics, economy, society and sport. She presents measures of homogeneity, diversification and change in vocabulary, finding that -- among other things -- the culture texts are the most diversified and the least homogeneous, the politics genre is the most homogeneous, and the economy genre is least diversified and yet demonstrates the greatest change over time. Mota discusses the implications of these findings for training Natural Language Processing (NLP) tools on datasets which are specific to the target genre. The only let-downs in this paper are charts which are difficult to interpret -- specifically, the right hand plot in figure 5 and the top right plot in figure 6. Other than that, this paper is of an excellent standard.

The second section starts with Columbus's paper on invariant tags. This is a solid, if unremarkable, study of 'eh', 'yeah', 'no', 'na' in three varieties of English. The results are clearly presented and there is a lengthy discussion of clause position and function. There is an unfortunate tendency to use the phrase, 'reach significance', in what might be misinterpreted as a statistical sense even though no supporting statistical measures are described. Nevertheless, this is on the whole a paper which amply demonstrates the possible benefits of applying corpus linguistics to discourse research.

Dilts presents a fascinating semantic study, though does overcomplicate an otherwise clean narrative with various extraneous levels of analysis. It is a fine example of interdisciplinary work -- drawing as he does on previous psycholinguistic work on semantic orientation for nouns in English, and NLP research on semantic preference in terms of the adjectives which collocate with those nouns. However, the presentation of results by 'empirical', 'full-strength' and 'half-strength' datasets is unnecessary and only serves to cloud the picture. The decision as to which set to use could have been made behind-the-scenes and the optimal set presented to the reader as the only set analysed. As it was, there was little difference in results between the three sets. All the same, this is an innovative piece of research and the results are communicated clearly (except for, again, a lack of axis labelling).

The function-oriented second section concludes with Zdorenko's neat case study of why the Principles and Parameters approach does not work. Use of the null subject in Russian is shown to depend strongly on genre and register. It is most frequently found in spontaneous conversation and infrequently found in writing, even at the more informal register levels. Therefore Russian cannot be assigned to a binary null-subject or not-null-subject parameter. Moving away from the generative tradition, Zdorenko instead extends her corpus study to topicality (person) and lexical association. The null subject is shown to correlate more strongly with first and second than third person contexts. Zdorenko reports that 'znat' (to know) and 'ponimat' (to understand) are verbs used as discourse markers comparable to 'I dunno' or 'y'know': ''verbs with a particular pragmatic function that grammaticalized either in a subjectless form or with a certain pronominal subject''.

Pho's paper does not sufficiently explain the 'deviance residual' -- used as a key statistic -- or fully exemplify the 'move', the central concept under investigation. These assumptions of prior knowledge are too strong for a crossover volume such as this with an intended audience including linguists and computer scientists. There are problems also with the corpus size (only forty articles) and the conclusions which can be drawn on the basis of it. It is unsurprising that little difference should be found in move construction between the two genres studied -- applied linguistics and educational technology. More academic disciplines would need to be included in the research before it could be said with any certainty that specific moves have specific linguistic features consistently. Having said this, Pho does pick out apparent differences between move types, observes well that feature clustering rather than any single feature alone is the cause of such differences, and demonstrates that this understanding is beneficial for teaching academic English.

Csomay & Cortes present an innovative and concise study of the change in the nature of lexical bundles as an academic text progresses. They have a clear research question, a well-explained methodology, and manage to retrieve results with clear patterns, finding that their set of 'stance markers' and 'discourse organizers' occur less frequently as the document moves from introduction to second section whereas the use of 'referential expressions' increases. Further detail about each of the bundles within each category is made available to the reader in the appendix. The paper is only let down by an unreadable figure (p160) and no description of just what the two comparison sets -- VBDUs 1-3 and 4-6 -- might be in terms of which parts of the text we are considering. The concept of a VBDU itself is explained clearly, and it is understood that each must contain at least fifty words. Can it be surmised then that the comparison is between something like the first and second 150-200 words of a text? The reader is thus left wondering whether this is effectively the start and middle for most documents in the corpus, or whether the texts are short enough that we are in fact comparing the beginning and end. Answers to these questions would assist the reader in interpreting the results. Also, it is reported that the corpus contains both spoken and written data, but whether the study is of both types or just written language is another unexplained issue.

The study of modals by Diniz is well formulated and reported, and the pedagogical relevance of her work is made clear. She finds that teachers use indirect language to reduce the power differential to their students, while at the same time communicating expectations. Instructing non-native speakers on this nuance of educational communication is an important and necessary step. Fitzpatrick & Bachenko's paper is excellent and points to a promising future for their work. Their model can predict the truth or falsity of a proposition at 75% accuracy and they indicate how they can improve the model in the long term -- primarily by further data collection. They provide a high point on which to end the third section.

The final, methodological section of the book begins with Gries's exploration of dispersion in corpus lingustics (a follow up to Gries 2008). His is informationally the denses of the papers, breezing through an abundance of technical detail in only twelve pages. However, the work is highly important -- since, as he points out, dispersion is relevant to virtually all corpus research -- and, even if the discussion is at times opaque and difficult to follow, there is a clearly written summary section at the conclusion. Cox considers what is required to tag a minority-language corpus. He finds that orthographically normalized data is 20% more accurate but more expensive to prepare, that smaller chunks are preferable for iterative interactive tagging, and that a less elaborate tagset is more accurate and efficient. Cox notes that these observations must be set against the purpose of the corpus and the requirements of the researchers who will be using it. This is a well-written paper with well-defined research questions and conclusions which are explicitly linked back to them -- an attribute which cannot be taken for granted in academic literature.

Teich & Fankhauser fail to explain IGain, which they use as a crucial statistical measure in their paper -- they say a value of 0.48 is 'fairly high' (p238) and the reader has to take their word for it. Nor are there any hypotheses for the choice of linguistic features (nouns, verbs, adverbs, type-token ratio, lexical density) and why these might indicate 'abstractness', 'technicality' and 'informational density'. Furthermore there is no discussion of why the results for these indicators are as they are: for example, what does it mean that there are more verbs and adverbs in the control corpus (FLOB) but more nouns in the scientific corpus (DaSciTex)? The same question can be asked as to why type-token ratio should be greater in FLOB while lexical density is greater in DaSciTex. Nevertheless, the classification accuracy reported is very impressive, as is the subcorpora comparison within DaSciTex. This is in fact research of a very high standard whose potential impact is not fully realised due to gaps in the discussion.

Bloom & Argamon end the book with an intriguing methods paper in which they report the use of a dependency parser to identify 'linkages' between attitude and target (e.g. 'The Matrix [target] was a good [attitude] movie'). They test the linkage learner on a corpus of user product reviews and a corpus of IMDb movie reviews. They achieve results comparable to manual extraction and conclude that the next step in their research is a disambiguator. This is a very satisfactory way to end the book.

Overall, then, ''Corpus-linguistic applications'' is a volume which describes many aspects of corpus linguistic research, featuring a wide range of innovative techniques, a wide range of corpus resources, languages and topics, and indications of future directions in the field. The book will be of interest to researchers in NLP and computational linguistics first and foremost, but also discourse, semantic, syntactic, morphological, historical, applied and forensic linguistics. It is at once an excellent overview of the activity in corpus linguistics, as well as a varied assortment which demonstrates the diversity of research in the field. The authors and editors are to be commended on the whole for an excellent publication.


Biber, D., U. Connor and T. Upton (2007). Discourse on the move. Amsterdam: John Benjamins.

Gries, St. Th. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13: 403-437.

Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics. 1: 1-37.

Youmans, G. (1991). A new tool for discourse analysis: the Vocabulary Management Profile. Language. 67: 763-789.


Andrew Caines recently completed the thesis for his PhD at the University of Cambridge. His research is a corpus-based study of an innovative construction in English -- namely, the 'zero auxiliary' interrogative: 'what you doing? you going to town? you talking to me?' For more information go to

Page Updated: 17-Aug-2010