Review of  Corpora and Language Learners

Reviewer: Przemysław Kaszubski
Book Title: Corpora and Language Learners
Book Author: Guy Aston Silvia Bernardini Dominic Stewart
Publisher: John Benjamins
Linguistic Field(s): Applied Linguistics
Text/Corpus Linguistics
Language Acquisition
Issue Number: 16.1893

Date: Thu, 16 Jun 2005 19:02:28 +0200
From: Przemek Kaszubski
Subject: Corpora and Language Learners

EDITORS: Aston, Guy; Bernardini, Silvia; Stewart, Dominic
TITLE: Corpora and Language Learners
SERIES: Studies in Corpus Linguistics 17
PUBLISHER: John Benjamins
YEAR: 2004

Przemysław Kaszubski, School of English, Adam Mickiewicz University,
Poznań, Poland


CORPORA AND LANGUAGE LEARNERS features a selection of
papers presented at the fifth meeting of the bi-annual TaLC (Teaching
and Language Corpora) conference, which was held in Bertinoro, Italy,
in the summer of 2002. The book is divided into five parts, the central
sections exploiting three areas involving corpora and
learners: "Corpora by learners" (i.e. corpus-based studies of learner
language, 6 papers), "Corpora for learners" (various types of target
language corpora, 4 papers), and "Corpora with learners" (data-driven
learning, 3 papers). These 'core' contents are braced by two more
general contributions: a proposal for a corpus-informed theory for
applied linguistics, and an overview of prospects for applying the Web
to corpus-based pedagogy. An index (pp. 301-305) and contributors'
bionotes (307-311) complement the volume.

In their "Introduction: Ten years of TaLC", the editors, previewing the
book's organization and contents, note the field's constantly evolving
and diversifying efforts to optimize the link between corpus application
and language pedagogy. Central to these efforts are attempts to
understand learners and their needs, and the necessity to resolve the
vexed notion of input 'authenticity', surfacing in several papers.

The first major contribution, Michael Hoey's "The textual priming of
lexis", is the one that offers "A theory for TaLC". The author claims
that lexical units -- central to his proposal -- display the property of
becoming loaded ('primed') in a mind exposed to frequently repeating
patterns of usage. Priming may concern any broadly understood
grammatical and collocational properties, both within and beyond the
sentence. Thus, a word may, for example, be primed for acting as a
noun or verb, for representing certain meanings, for preceding or
following specific modification patterns (colligation), for appearing in
particular textual positions (textual colligation), for contributing to
textual relations (e.g. a Problem-Solution pattern), etc. Such primings
are, in addition, relative to specific genres and domains of use. Priming
may "change through an individual's lifetime" (p. 24);it also precedes
grammatical categorizations, which are likely to be post hoc creations.
According to Hoey, effective studies of primings must be based on
specialized corpora, which do not regularize any specific preferences
in favour of the 'big picture'. Evaluating the pedagogical relevance of
his theory, the author points to the role of teachers and materials in
ensuring correct, though gradual, priming of lexical content, properly
contextualized. Priming may also account for creative uses of
language, which, as Hoey claims, can breach some -- but never all --
of the priming constraints (the latter would produce "non-language").
Overall, priming theory, recently elaborated in a monograph (Hoey
2005), merits attention in that it aptly positions, and legitimizes, corpus-
based lexical research within the larger scope of psycholinguistics,
language variation, and acquisition theory.

The first paper in the "Corpora by learners" part is Yukio
Tono's "Multiple comparisons of IL, L1 and TL corpora: The case of L2
acquisition of verb subcategorization patterns by Japanese learners of
English". Solidly grounded in L1 and L2 acquisition theory and Levin's
division of verb classes, the paper lays a methodological claim in
favour of a multiple corpus comparison method in corpus studies of
learner language. Tono shows that combining interlanguage (IL)
material (at possibly various stages of proficiency) with, on the one
hand, appropriate target language corpora (here: English textbooks)
and, on the other, comparable L1 corpora, can make it possible to
capture computationally diverse effects influencing SLA, such as the
L1 effects, the L2 input effects, or the developmental effects. The
advanced linguistic analysis relies on syntactic parsing, database
systems and log-linear analysis of clusters, whose brief discussion
some readers may find a little obscure. The author's concluding wish is
to see international collaboration for the development of a
computational model of SLA.

In "New wine in old skins? A corpus investigation of L1 syntactic
transfer in learner language", Lars Borin and Klaus Prütz attempt to
investigate the syntax of Swedish university-level students through
frequencies of part-of-speech (POS) n-grams (a procedure feasible for
languages with fixed-order syntax, as the authors rightly point out, p.
71). A contrastive, multi-corpus environment is also advocated here,
the corpora ranging between 350,000 and 1 million tokens. The
counted frequencies of 1-4 grams (excluding sequences containing
proper nouns and punctuation, as well as, controversially, those
exclusive to either language) are compared and tested statistically
(Mann-Whitney), revealing a predominant overuse pattern in the
learner data. The authors illustrate their findings and compare them
with earlier studies, most notably Aarts and Granger (1998). The final
outcome is far from definitive, but the discussion sheds interesting light
on the significance of methodological decisions for this kind of
research, such as about the size of the adopted tagset or the degree
of manual adjustment in the frequency lists.

Agnieszka Leńko-Szymańska's "Demonstratives as anaphora markers
in advanced learners' English" adopts a comparatively lighter
computational approach and a more traditional comparison paradigm,
with a Polish university learner corpus (PELCRA; four proficiency
levels) set against just a native speaker corpus norm (BNC Sampler).
The applied log-likelihood and chi-square statistics demonstrate that
Polish learner writers overuse distal anaphoric signals ('that', 'those'),
primarily in the determiner function, and that the problem does not
seem to disappear with rising proficiency. The author accounts for that
by pointing to the lack of appropriate, explicit explanations in the
grammar books.

In "How learner corpus analysis can contribute to language teaching:
A study of support verb constructions", Nadia Nesselhauf presents yet
another learner corpus research scheme, in which the uses
of 'make', 'have', 'take', and 'give' in support constructions (extracted
by eyeball analysis of concordance lines), are judged for appropriacy
not just against a comparable native English reference corpus (written
BNC), but also using lexicographic sources and native-speaker
informants. The author also undertakes to seek correspondences and
clusters across the error types (despite rather low frequencies). Some
of the suggested implications for teaching may seem obvious (e.g. that
frequency information in learner data is insufficient and should be
complemented by appropriate native-speaker genre/text-type
frequency); more importantly, Nesselhauf reminds us of the need to
consider non-corpus factors in judging errors, such as the degree of
communicative disruption. One interesting pedagogical suggestion for
her data is the idea of focusing learners' attention on instances where
single verb uses differ semantically from the corresponding support
constructions (e.g. 'take notice' vs 'notice').

Lynne Flowerdew's article "The problem-solution pattern in apprentice
vs. professional technical writing: An application of appraisal theory"
explores the possibility of applying the systemic-functional Appraisal
framework of categorizing evaluative language to an analysis of cross-
corpus keyword and key-keyword listings generated with Scott's
WordSmith Tools. The author concentrates on apprentice and
professional authors' use of 'inscribed' (explicitly evaluative,
e.g. 'problem') vs. 'evoking' (inviting evaluation, e.g. 'impact') lexis in
signalling problems and/or solutions. The findings indicate that, for the
genre in question, the majority of keywords identified are indeed
problem-solution in nature, and that while learner writers tend to use
inscribed terms for both the Problem and Solution elements, native-
English professionals signal problems with more evoking terms. This,
Flowerdew argues, may be a teaching-induced phenomenon; other
encountered incongruencies are put down to the inequality of topics in
the two corpora under comparison.

Ngoni Chipere, David Malvern and Brian Richards' paper "Using a
corpus of children's writing to test a solution to the sample size
problem affecting type-token ratios" is primarily computational in
character. The authors review and criticize various existing measures
of lexical richness, in particular the type-token ratio (TTR), and put
forward their own formula for a D parameter, which is independent of
the text sample size and, as empirically tested in the study, better
correlated with varied proficiency levels, determined by human scorers
and certain known measures (word length, text length). The D metric
thus appears especially well suited for tracing linguistic development,
and it is only regrettable that the authors do not provide download or
ordering details for readers wishing to test the tool (by comparison, a
mildly criticized measure, standardized TTR, is readily available in
WordSmith Tools).

Opening the "Corpora for learners" section, Ute Römer's "Comparing
real and ideal language learner input: The use of an EFL textbook
corpus in corpus linguistics and language teaching" assesses the
linguistic value of pedagogical materials for classroom use on the
example of spoken 'if' constructions. While conceding the point about
the impossibility of fully transferring the contextual authenticity of
attested language to a classroom setting, the author declares
confidence in learners' ability to adapt, and in the overall positive
influence of naturalistic language exposure as opposed to special
input. Suggestions for applying findings contrasting the language of
the scanned German textbook conversations and the evidence of the
spoken BNC are also offered. Römer's optimism may be open to some
question, as thus far relatively little empirical evidence exists
confirming the effectiveness of corpus-driven material selection;
conversely, authors such as Aston (2001: 8), Gabrielatos (2005), or
Nesselhauf and Mauranen (this volume) admit the necessity of
considering non-frequency factors.

An interesting proposal for a corpus-based stylistics programme is
described by Bernhard Kettemann and Georg Marko in "Can the L in
TALC stand for Literature?". The authors plan to offer an integrative
and 'hands-on' awareness-raising course for students at English
departments (in particular at Graz University), whose knowledge often
gets excessively compartmentalized. It is claimed that corpus-based
analyses of literary texts should help students integrate their
knowledge and build five important, inter-related types of awareness:
(1) language, (2) discourse, (3) literary, (4) cultural/social, and (5)
methodological / metatheoretical (= how to organize and logically
conduct research). Although it is still at an early stage of design,
Kettemann and Marko characterize in considerable detail each part of
their course, providing illustrations of concordancing and other corpus
activities (e.g.: how to discuss the role of performatives retrieved on
the basis of "I * you" frame searches in a Shakespeare corpus).
Special attention is devoted to methodological awareness, which is
meant to build gradually throughout the course, incorporating such
elements as acquisition of strict research procedures, co-textual and
transtextual analysis of data, and the faculty of critical analysis. The
authors hope that, when properly combined with other components in
the curriculum, their course may be successful, especially in view of its
focus on culturally vital literary texts.

The possibility of enhancing academic speaking skills with the help of
the Michigan Corpus of Academic Spoken English (MICASE) is in turn
reviewed by Anna Mauranen ("Speech corpora in the classroom"),
who reports on the responses from a teacher and her students after
running such an experimental course. While the teacher found corpus
use fascinating and stimulating (though humbling), students'
appreciation depended on the level of computer-literacy. Most cited
problems sound familiar: the need for longer pre-training, high time
cost, the questionable value of corpora for less proficient learners. In
addition, some users found inductive learning uncomfortable and
studying frequency irrelevant. In her comments on these results,
Mauranen proposes that pedagogical authenticity of corpora be seen
as including both 'objective authenticity' (the linguistic evidence) as
well as 'subjective authenticity' (how students relate to corpus
material); secondly, she notes that the appeal of corpus material may
relate to its discourse nature: "interactively saturated" spoken data
may deactivate students more than, e.g., written prose. Other issues
concern adapting corpus activities to analytically processing learners
(e.g. adults), and taking a stand on the native-English vs. English as a
lingua franca (ELF) controversy.

In "Lost in parallel concordances", Ana Frankenberg-Garcia gives
recipes for using parallel concordancing in a general language course.
The assumption is that such practice encourages explicit L1-L2
comparison, which, as current research shows, may facilitate rather
than necessarily impede effective learning, since it engages students
and, providing the teacher is sufficiently experienced, brings to the
fore relevant L1-related difficulties. "Navigating through a parallel
corpus" may depend on whether uni-directional or bidirectional
translations are available. Frankenberg-Garcia considers all the
different options for initiating parallel searches (beginning with source
texts in L1, source texts and L2, target texts in L1, or target texts in
L2) and compares their pedagogical value. Some interesting points
are raised (e.g. about the possibility of using L1 translations as
models), although the activities presented seem unsupported by
classroom practice, which poses the question of their genuine
effectiveness. What is perhaps lacking is some proof of parallel
concordancing actually outperforming bilingual dictionaries in some
contexts. Also, little attention is paid to the age or proficiency of
learners, or the importance of genres. Overall, the paper emerges as
a catalogue of ideas that may (some would say should) be useful, but
which have yet to be proved so. (For those interested, an online
version of the paper is available at

The third section of the volume, "Corpora with learners", begins with
Passapong Sripicharn's research report on "Examining native
speakers' and learners' investigation of the same concordance data
and its implications for classroom concordancing with ELF learners". In
the recounted experiment, six BA-level Thai and British students were
presented with brief, pre-selected concordance material and asked to
perform three simple tasks: (1) compare collocations of two verbs, (2)
name the difference between two groups of sentences arranged
according to grammatical patterns, (3) guess the meaning of a
concordanced word, complete a gapped line and (most interestingly)
justify the answer during a taped interview. The results showed that
the Thai students were eager to apply data-driven strategies, while the
native-English students preferred to rely on intuition, generalize
beyond the data, question the evidence and call up exceptions. Such
results, while probably anticipated, may have been prompted by the
the non-native English group having been introduced into
concordancing prior to the experiment. This flaw in the set-up appears
rather unfortunate; however, the study validly points out that
concordancing does not always have to be used in a data-driven-way
(cf. Aston 2001: 22-25), and that limited corpus evidence can condone
overgeneralizing -- a point to beware for teachers preparing material.

In "Some lessons students learn: Self-discovery and corpora", Pascual
Pérez-Paredes and Pascual Cantos-Gómez outline a corpus-based,
form-focused protocol designed to help English learners attain greater
awareness of and control over their spoken performance. For
convenience, the protocol only monitors the use of words. Students
access and query hyperlinked transcriptions and audio recordings of
their aural output, and, guided by a series of open-ended questions,
compare the statistics from their own file with the average class results
and then with data derived from reference corpora (it is, however, not
clear which corpora are used for reference). Pérez-Paredes and
Cantos-Gómez describe their system as promoting Nunan's fifth stage
of learner autonomy (learner as researcher) and report generally
positive feedback from their students. A convenient feature of this
networked database environment is that student data are processed
statistically (cluster analysis), allowing teachers to classify learners by
performance. Overall, the system described is an interesting example
of how learner-corpus data can inform IT solutions for learning, a
promising line of development for intelligent CALL (I-CALL).

In the final paper of the book's third section, entitled "Student use of
large, annotated corpora to analyze syntactic variation", Mark Davies
describes his corpora-supported advanced online course in Spanish
syntax, in which students learn to retrieve and combine data from
multiple corpora in order to solve variation tasks. The corpora are
large (100 M words; 200 M words; and the Spanish web -- Google and
Google Groups), diversified, and, in one case, richly annotated to
enable more powerful searches (Davies' Corpus del Espańol
resembles in this respect his VIEW interface to the British National
Corpus, The course is not corpus-driven,
however: hands-on practice follows readings from a grammar book,
and mainly involves testing the validity of the rules and claims found
there. The author emphasizes the importance of an intensive, task-
based training stage, and of supervising students' early projects
during which they can develop expertise in choosing and combining
corpora and search patterns. A valuable pedagogical suggestion is the
shift from purely quantitative to more explanatory tasks in mid-course.
Concluding, Davies argues that, at advanced levels of proficiency,
even less experienced students can learn to use and appreciate
corpora, a cogent point considering the author's enormous experience
in the field.

In the last, forward-looking article on "Facilitating the compilation and
dissemination of ad-hoc web corpora", William H. Fletcher summarizes
the current possibilities for linguistic exploitation of the World Wide
Web and outlines the prospects for future developments. According to
Fletcher "[t]he quantity of information online greatly surpasses its
overall quality" (p. 275); on the other hand, the infrequency of some
phenomena and genres and the inevitable ageing of finite corpora
force linguists to embrace the web. Techniques of access range from
the most widely known 'browsing' to 'hunting', 'grazing' and
automatic 'crawling', but none of them guarantees immediately high
quality results to linguistic searches. There is therefore a strong need
to filter search engine output by applying linguistic and heuristic "noise-
reduction techniques", which, however, can unduly prolong access
time. Fletcher considers two possibilities for breaking the deadlock: (1)
the creation of a special Web Corpus Archive (WCA), whereby
professionals would help one another by analysing and classifying
web content and submitting reports which would trigger automatic
download and annotation of the pages for future use; and 2) the
creation of a special Search Engine for Applied Linguists (SEAL),
enabling direct, highly sophisticated KWiC concordancing of the web.
Neither solution is free from problems (securing copyright, providing
sufficient processing power, etc). Fletcher then compares
his 'idealistic' visions with the existing facilities: online concordancers
for static corpora, commercial meta search engines, web
concordancers (e.g. WebCorp), the Internet Archive ('Wayback
machine'), advanced linguistic search engines. A practical tip resulting
from this discussion is that students should be taught "responsible
online searching techniques". Overall, the paper brings a useful, if only
slightly lengthy (28 pages), overview of the workings of the web-as-
corpus sub-domain, supported by a set of numerous URL addresses
that IT-minded teachers should be willing to explore.


As transpires from the extended summary above, Aston et al's 2004
collection will be a valuable resource for teachers seeking working and
prospective solutions, as well as up-to-date theoretical motivations, for
corpus-informed teaching practice. The book offers admirable
continuation to Aston's edited volume of 2001 as well as to the
previous volumes of TaLC proceedings. A sceptical reader could
require more theory and more empirical verification, but there is no
doubt that the field of 'applied corpus linguistics' (a broader term,
borrowed here from the name of an American association and a recent
volume of proceedings from a conference it organized) is growing,
maturing and slowly developing its standards (Hoey, Mauranen,
Römer). This progress should lead to the establishment of models for
integrating corpora with other teaching methods and programmes -- a
key to success not just in academic education (Kettemann and Marko),
but also at schools. As noted by several authors, some technical and
practical issues must be resolved before corpus-driven tasks can be
added to the bank of regular in-course activities. Both Davies and
Mauranen point out the need of extensive, task-based pre-training, a
point all the more vital if the level of initial computer literacy affects
students' motivation and performance. In addition, the relatively long
time required to complete corpus-based activities may confine them to
some tasks and/or some learners: more empirical testing is needed to
explore such feasibilities. Thirdly, ensuring universal and dependable
access (cf. Fletcher) to 'optimal' corpora, both general and
specialized, large and small, will be another key factor determining the
popularity of corpus methods among teachers and learners. Despite
these yet unresolved problems, Aston et al.'s collection clearly
demonstrates that enough experience has been accumulated in this
area for a comprehensive resource book for teachers to be offered,
which could recommend specific tools, corpora, methods, techniques,
exercises, etc., for meeting specific teaching aims in a typical (not
necessarily task-based, Gabrielatos 2005) language learning syllabus.

Compared with data-driven learning, the 'behind-the-scenes' (Aston
2000) approach, i.e. corpus-based linguistic research, is well
entrenched. The large size of the "Corpora by learners" section shows
that learner corpora have become a staple component of corpus
networks exploited for educational purposes (all major ELT publishers
today rely on their collections of learner data). Ignoring corpus
evidence is likely to lead to artificiality of input, which many applied
corpus linguists openly criticize. However, as already mentioned,
corpus-derived results, even those supported by the most
sophisticated statistical methods, must be used wisely and in
proportion with other factors. On the other hand, as this volume richly
demonstrates, progress in learner corpus research is on-going and
constantly diversifying inasmuch as ever larger and better annotated
resources are created and new (networking) technologies are reached
for (e.g. Tono, Pérez-Paredes and Cantos-Gómez). The field of
pedagogical exploitation of corpora is thus hardly ready to settle,
inviting interested educators continually to refresh their position on its
development. Of course, Aston et al.'s volume could not be
comprehensive in this respect.

There are no apparently weak papers in the reviewed volume,
although, as indicated, some contributions could be questioned for
methodological assumptions or for insufficient scepticism. Additionally,
some debatable omissions in the use of sources may be noted, e.g.
Römer's lack of reference to earlier word-based analyses of written
textbooks (e.g. Ljung 1990) or Kettemann and Marko's lack of mention
of the Web Concordances service. Fletcher, at the time of writing his
article, could not have heard of the WebCorp team's plans to develop
their own linguistic search engine, or of LexWare Culler -- a fast,
Google-based web concordancer equipped with part-of-speech
search syntax and lemmatization rules for grouping results (several
major languages are supported). These gaps, however, hardly
undermine the overall quality of the volume.

The editing is also generally careful, the few slips mostly concerning
orthography and punctuation. The grossest oversight is the missing
Table 2 in Flowerdew's article, an omission preventing comparison
with Table 1, called upon several times.


Dr Przemysław Kaszubski is a teacher of academic writing and a
corpus linguistics researcher and lecturer at the School of English,
Adam Mickiewicz University, Poznań, Poland. His current research
interests concern the use of online corpus resources for academic
writing instruction. He maintains an online concordancer for English
students, and a large corpus linguistics bibliography
( In 1995-2002 he co-
ordinated the compilation of the Polish subcorpus of the International
Corpus of Learner English.