Review of Corpus-linguistic applications.
|EDITORS: Gries, Stefan Th.; Wulff, Stefanie; Davies, Mark
TITLE: Corpus-linguistic applications
SUBTITLE: Current studies, new directions
SERIES TITLE: Language and Computers 71
Andrew Caines, CCL Group, Research Centre for English and Applied Linguistics,
University of Cambridge
This is an edited volume of fifteen papers which arise from the eighth
conference of the American Association for Corpus Linguistics, held in 2008 at
Brigham Young University, Utah. It is the seventy-first volume in the
long-running Rodopi series, 'Language and Computers: Studies in Practical
Linguistics'. The papers are divided into four themes, representing to some
extent the range of presentations given at the conference. The themes are:
diachronic applications, function-oriented applications, register/genre
applications and methodological applications.
In the first section, reporting diachronic research, there are four papers by
Viola Miglio, Alfonso Medina Urrea, Juhani Rudanko, and Cristina Mota. The
second function-oriented section includes work by Georgie Columbus, Philip
Dilts, and Tatiana Zdorenko. Phuong Dzung Pho, Eniko Csomay & Viviana Cortes,
Luciana Diniz, and Eileen Fitzpatrick & Joan Bachenko contribute four papers in
total to the third section on register/genre. The fourth and final section is
made up of papers by Stefan Th. Gries, Christopher Cox, Elke Teich & Peter
Fankhauser, and Kenneth Bloom & Shlomo Argamon.
Miglio presents a study of the grammaticalization of the Spanish phrase 'dizque'
(now 'allegedly', 'supposedly') from the 12th to the 20th century, from an
impersonal to an evidential to an epistemic function. She uses a number of
corpus resources and discusses the problems arising from gaps in the data.
Nevertheless she tracks its fluctuations in meaning, which has not been entirely
unidirectional along the grammaticalization cline described above.
Medina Urrea presents an attempt to carry out automatic identification of
diachronic morphological profiling from 16th-20th century texts, tracking the
development of Mexican Spanish and Peninsular Spanish separately. He uses
automatic segmentation to identify morphemes and measures paradigm
(dis-)similiarity by Euclidean distance. One important outcome is that a
distinct Mexican Spanish is shown to have emerged before the 18th century --
earlier than previously thought.
Rudanko meanwhile takes the verb 'submit' and homes in on the years 1880-1922
and the present-day in order to pick out complement distribution changes in U.S.
and U.K. English. She identifies a 'Great Complement Shift' by which there has
been a dramatic increase over time in gerundial rather than infinitival
complements with the verb in question, a shift she describes as system-internal
Lastly under the diachronic theme, Mota applies Kilgarriff's (2001) corpus
similarity measure to samples of Portuguese journalism genres in the period
1991-1998. The stated goal of the paper is to apply statistical methodology to
corpus comparison but since the work forms part of a larger project in named
entity recognition, this topic too features significantly in the paper.
The first paper of the second section on function-oriented applications is by
Columbus on invariant tags -- 'eh', 'yeah', 'no', 'na' -- in the New Zealand,
Indian and British components of the International Corpus of English (ICE). She
identifies likely areas of cultural misunderstanding and points the findings
towards application in the pedagogical domain.
Dilts presents a study of 'semantic orientation' and 'semantic preference' using
the British National Corpus (BNC). The orientation of nouns is assessed through
the type of adjective they collocate with, and this is found to correlate with
positive/negative ratings for the same set of nouns -- a measure of their
Zdorenko examines subject omission in Russian using the Russian National Corpus.
She finds evidence that the binary encoding required by the 'Principles and
Parameters' tradition is too crude when actual usage data is analysed. It is not
simply a matter of Russian being or not being a null subject language. Rates of
omission are in fact domain-dependent. The null subject is found most often in
spontaneous conversation, with first and second person subjects, and with an
identified set of verbs.
The third section (on register/genre) begins with Pho's study of moves in a
corpus of 40 journal articles. She examines abstracts and introductions for
'moves' -- segments of text with a certain function -- and the linguistic
features which associate with them. The intended application is to better
instruct nonnative speakers in academic writing techniques.
Next, Csomay & Cortes investigate 4-word 'lexical bundles' in terms of the
'vocabulary-based discourse unit' (VBDU; conceptualised by Youmans 1991,
programmed by Biber, Connor and Upton 2007). A VBDU is a section of text of at
least 50 words, the limits of which are identified when the section's lexical
content is sufficiently distinct by a set statistical measure from the text
which follows it. The authors take the first three and the second three VBDUs
from a corpus of TOEFL assignments and compare the VBDU groups for differences
in four-word lexical bundles. In this way, the outcome of their work is a
side-by-side comparison of introductions and bodies of text at the discourse level.
The third section continues with an analysis by Diniz of modal chunks in
academic discourse, specifically examining the language professors use to
address their students. This is a functional analysis which identifies
politeness as the most likely function for modals in this context, followed by
transfer of responsibility and communication of expectations among others.
The fourth and final paper in this section is by Fitzpatrick & Bachenko. They
present a forensic linguistic study of ground truth and the identification of
deception through linguistic cues. They identify twelve linguistic cues which
might signal deception (among them: hedges, a preference for negative
expressions, and a range of inconsistencies such as tense changes) and manually
annotate a training corpus of criminal statements, police interrogations and
testimonies. The predictive accuracy of these cues is high (c.76%) and promising
for future research. The authors observe that a concentration of the linguistic
cues is the most likely indicator of deception.
The methodological papers include an exploration by Gries of an underresearched
issue which affects all corpus work: dispersion (the measure of the homogeneity
of a word's distribution in a corpus). Gries uses the BNC to compare various
dispersion measures from the literature, and subsequently investigates how these
measures correlate with lexical reaction time data reported in the
Cox's theme is corpus planning. He considers the tagging process, and evaluates
the time-accuracy trade-off in using (a) normalized/unnormalized orthography;
(b) various chunk sizes for rounds of iterative, interactive tagging; (c) tagset
size. He does so in the context of corpus building for minority languages which
are on the whole associated with more modest resources than major language
Teich & Fankhauser present a study of data mining using DaSciTex -- a corpus of
scientific journal papers. They find that certain features (part of speech
distribution, type-token ratio and lexical density) are ample to discern
DaSciTex from The Freiburg-LOB Corpus of British English (FLOB) -- a more
diverse corpus. These features on the other hand do not successfully identify
the subdisciplines within DaSciTex. Instead, data mining techniques (feature
ranking, clustering and classification) accurately differentiate the subcorpora
of pure science (e.g. linguistics, biology) from mixed disciplines (e.g.
computational linguistics, bioinformatics) from computer science.
Finally, Bloom & Argamon present a grammatically motivated system for extracting
opinionated text by identifying the attitudes and the targets of the attitudes
by way of linguistic 'linkages' contained in the text. They test their automated
system on corpora of product and movie reviews and achieve a rate of success
comparable to manual methods of extraction.
These papers present a wide range of innovative, high quality and at times
important work, and the editors are to be credited for assembling such diversity
into one volume. Not only are the potential applications shown to be academic,
but also pedagogical (Pho, Diniz), judicial (Fitzpatrick & Bachenko) and
commercial (Bloom & Argamon). The common themes of the papers are not only
automaticity, scale and efficiency, but also cognitive grounding, a data-driven
approach and innovative computational techniques in a field which has for so
long fallen short in these three respects. Overall, therefore, the authors
deserve great credit indeed.
However, as is inevitable in a collected volume, there are highs and lows in the
quality of work, the clarity of explanation, and the presentation of data. In
terms of making a good early impression, it is unfortunate that Miglio should be
first up. Her charts are poorly presented -- without axis labelling and without
apparently relativizing the frequencies in comparing (sub)corpora (figures 1 and
2, for example). In addition, strong conclusions are drawn on the basis of low
frequency data without caveat as to the validity of distribution pattern.
Nevertheless, Miglio makes some good points -- particularly when observing that
fiction is a fair indicator of spoken language, given the oral data gap in
diachronic (pre-20th century) studies. Additionally, it is a shame there is not
more focus on the dialect issue raised regarding region and urban/rural
differences (p23), since the data suggest this would have been an interesting
The first section improves from this beginning, though Rudanko is guilty of a
failure to relativize frequency counts in the data tables comparing the American
English and British English sections of the target corpus (she makes up for it
somewhat by doing so in the body text discussion). Her paper also suffers from
further irrelevant sections and claims without sufficient evidence. However,
Medina Urrea presents innovative and interesting work, introducing a combined
measure of affixality / glutinosity, discussing problems in corpus sizes
successfully, and contrasting the morphological systems of Peninsular and
American Spanish through Euclidean distance in a clear and coherent manner.
Mota's paper is the best of all in this opening section on diachronic corpus
applications. Her topic is well chosen and neatly self-contained, while at the
same time pointing to established research (namely, Kilgarriff's distance
statistic for corpus comparison) and future extensions (the named entity
recognition which is the overarching project in which this paper is situated).
Mota selects the texts of one newspaper from a corpus of 1990s Portuguese
journalism, and plots within and across topic vocabulary similarity over time at
six-month intervals. The topics are: culture, politics, economy, society and
sport. She presents measures of homogeneity, diversification and change in
vocabulary, finding that -- among other things -- the culture texts are the most
diversified and the least homogeneous, the politics genre is the most
homogeneous, and the economy genre is least diversified and yet demonstrates the
greatest change over time. Mota discusses the implications of these findings for
training Natural Language Processing (NLP) tools on datasets which are specific
to the target genre. The only let-downs in this paper are charts which are
difficult to interpret -- specifically, the right hand plot in figure 5 and the
top right plot in figure 6. Other than that, this paper is of an excellent
The second section starts with Columbus's paper on invariant tags. This is a
solid, if unremarkable, study of 'eh', 'yeah', 'no', 'na' in three varieties of
English. The results are clearly presented and there is a lengthy discussion of
clause position and function. There is an unfortunate tendency to use the
phrase, 'reach significance', in what might be misinterpreted as a statistical
sense even though no supporting statistical measures are described.
Nevertheless, this is on the whole a paper which amply demonstrates the possible
benefits of applying corpus linguistics to discourse research.
Dilts presents a fascinating semantic study, though does overcomplicate an
otherwise clean narrative with various extraneous levels of analysis. It is a
fine example of interdisciplinary work -- drawing as he does on previous
psycholinguistic work on semantic orientation for nouns in English, and NLP
research on semantic preference in terms of the adjectives which collocate with
those nouns. However, the presentation of results by 'empirical',
'full-strength' and 'half-strength' datasets is unnecessary and only serves to
cloud the picture. The decision as to which set to use could have been made
behind-the-scenes and the optimal set presented to the reader as the only set
analysed. As it was, there was little difference in results between the three
sets. All the same, this is an innovative piece of research and the results are
communicated clearly (except for, again, a lack of axis labelling).
The function-oriented second section concludes with Zdorenko's neat case study
of why the Principles and Parameters approach does not work. Use of the null
subject in Russian is shown to depend strongly on genre and register. It is most
frequently found in spontaneous conversation and infrequently found in writing,
even at the more informal register levels. Therefore Russian cannot be assigned
to a binary null-subject or not-null-subject parameter. Moving away from the
generative tradition, Zdorenko instead extends her corpus study to topicality
(person) and lexical association. The null subject is shown to correlate more
strongly with first and second than third person contexts. Zdorenko reports that
'znat' (to know) and 'ponimat' (to understand) are verbs used as discourse
markers comparable to 'I dunno' or 'y'know': ''verbs with a particular pragmatic
function that grammaticalized either in a subjectless form or with a certain
Pho's paper does not sufficiently explain the 'deviance residual' -- used as a
key statistic -- or fully exemplify the 'move', the central concept under
investigation. These assumptions of prior knowledge are too strong for a
crossover volume such as this with an intended audience including linguists and
computer scientists. There are problems also with the corpus size (only forty
articles) and the conclusions which can be drawn on the basis of it. It is
unsurprising that little difference should be found in move construction between
the two genres studied -- applied linguistics and educational technology. More
academic disciplines would need to be included in the research before it could
be said with any certainty that specific moves have specific linguistic features
consistently. Having said this, Pho does pick out apparent differences between
move types, observes well that feature clustering rather than any single feature
alone is the cause of such differences, and demonstrates that this understanding
is beneficial for teaching academic English.
Csomay & Cortes present an innovative and concise study of the change in the
nature of lexical bundles as an academic text progresses. They have a clear
research question, a well-explained methodology, and manage to retrieve results
with clear patterns, finding that their set of 'stance markers' and 'discourse
organizers' occur less frequently as the document moves from introduction to
second section whereas the use of 'referential expressions' increases. Further
detail about each of the bundles within each category is made available to the
reader in the appendix. The paper is only let down by an unreadable figure
(p160) and no description of just what the two comparison sets -- VBDUs 1-3 and
4-6 -- might be in terms of which parts of the text we are considering. The
concept of a VBDU itself is explained clearly, and it is understood that each
must contain at least fifty words. Can it be surmised then that the comparison
is between something like the first and second 150-200 words of a text? The
reader is thus left wondering whether this is effectively the start and middle
for most documents in the corpus, or whether the texts are short enough that we
are in fact comparing the beginning and end. Answers to these questions would
assist the reader in interpreting the results. Also, it is reported that the
corpus contains both spoken and written data, but whether the study is of both
types or just written language is another unexplained issue.
The study of modals by Diniz is well formulated and reported, and the
pedagogical relevance of her work is made clear. She finds that teachers use
indirect language to reduce the power differential to their students, while at
the same time communicating expectations. Instructing non-native speakers on
this nuance of educational communication is an important and necessary step.
Fitzpatrick & Bachenko's paper is excellent and points to a promising future for
their work. Their model can predict the truth or falsity of a proposition at 75%
accuracy and they indicate how they can improve the model in the long term --
primarily by further data collection. They provide a high point on which to end
the third section.
The final, methodological section of the book begins with Gries's exploration of
dispersion in corpus lingustics (a follow up to Gries 2008). His is
informationally the denses of the papers, breezing through an abundance of
technical detail in only twelve pages. However, the work is highly important --
since, as he points out, dispersion is relevant to virtually all corpus research
-- and, even if the discussion is at times opaque and difficult to follow, there
is a clearly written summary section at the conclusion. Cox considers what is
required to tag a minority-language corpus. He finds that orthographically
normalized data is 20% more accurate but more expensive to prepare, that smaller
chunks are preferable for iterative interactive tagging, and that a less
elaborate tagset is more accurate and efficient. Cox notes that these
observations must be set against the purpose of the corpus and the requirements
of the researchers who will be using it. This is a well-written paper with
well-defined research questions and conclusions which are explicitly linked back
to them -- an attribute which cannot be taken for granted in academic literature.
Teich & Fankhauser fail to explain IGain, which they use as a crucial
statistical measure in their paper -- they say a value of 0.48 is 'fairly high'
(p238) and the reader has to take their word for it. Nor are there any
hypotheses for the choice of linguistic features (nouns, verbs, adverbs,
type-token ratio, lexical density) and why these might indicate 'abstractness',
'technicality' and 'informational density'. Furthermore there is no discussion
of why the results for these indicators are as they are: for example, what does
it mean that there are more verbs and adverbs in the control corpus (FLOB) but
more nouns in the scientific corpus (DaSciTex)? The same question can be asked
as to why type-token ratio should be greater in FLOB while lexical density is
greater in DaSciTex. Nevertheless, the classification accuracy reported is very
impressive, as is the subcorpora comparison within DaSciTex. This is in fact
research of a very high standard whose potential impact is not fully realised
due to gaps in the discussion.
Bloom & Argamon end the book with an intriguing methods paper in which they
report the use of a dependency parser to identify 'linkages' between attitude
and target (e.g. 'The Matrix [target] was a good [attitude] movie'). They test
the linkage learner on a corpus of user product reviews and a corpus of IMDb
movie reviews. They achieve results comparable to manual extraction and conclude
that the next step in their research is a disambiguator. This is a very
satisfactory way to end the book.
Overall, then, ''Corpus-linguistic applications'' is a volume which describes many
aspects of corpus linguistic research, featuring a wide range of innovative
techniques, a wide range of corpus resources, languages and topics, and
indications of future directions in the field. The book will be of interest to
researchers in NLP and computational linguistics first and foremost, but also
discourse, semantic, syntactic, morphological, historical, applied and forensic
linguistics. It is at once an excellent overview of the activity in corpus
linguistics, as well as a varied assortment which demonstrates the diversity of
research in the field. The authors and editors are to be commended on the whole
for an excellent publication.
Biber, D., U. Connor and T. Upton (2007). Discourse on the move. Amsterdam: John
Gries, St. Th. (2008). Dispersions and adjusted frequencies in corpora.
International Journal of Corpus Linguistics 13: 403-437.
Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus
Linguistics. 1: 1-37.
Youmans, G. (1991). A new tool for discourse analysis: the Vocabulary Management
Profile. Language. 67: 763-789.
ABOUT THE REVIEWER
| ABOUT THE REVIEWER:
Andrew Caines recently completed the thesis for his PhD at the University
of Cambridge. His research is a corpus-based study of an innovative
construction in English -- namely, the 'zero auxiliary' interrogative:
'what you doing? you going to town? you talking to me?' For more
information go to http://www.srcf.ucam.org/~apc38