Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."

New from Wiley!


We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Review of  Corpus Linguistics 25 Years On

Reviewer: Mike Conway
Book Title: Corpus Linguistics 25 Years On
Book Author: Roberta Facchinetti
Publisher: Rodopi
Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics
Subject Language(s): English
Issue Number: 19.1359

Discuss this Review
Help on Posting
EDITOR: Facchinetti, Roberta
TITLE: Corpus Linguistics 25 Years On
SERIES: Language and Computers Vol. 62
YEAR: 2007

Mike Conway, National Institute of Informatics, Tokyo

The book under review is the edited proceedings of the 25th International
Computer Archive of Modern and Medieval English Conference (ICAME), held at the
University of Verona in May 2004.

The book is 385 pages long, and consists of nineteen chapters, and an
introduction from the editor. Each chapter contains a list of references and, if
appropriate, endnotes. The volume is divided into three sections, reflecting
some of the core concerns of the conference. The first section, ''Overviewing 25
Years of Corpus Linguistics Studies'' (four chapters) looks back at the early
days and development of corpus linguistics. The second section, ''Descriptive
Studies in English Syntax and Semantics'' (eight chapters) is concerned with
corpus based language description, a core area of corpus linguistics over the
last 25 years. The third section, ''Second Language Acquisition, Parallel Corpora
and Specialist Corpora'' (seven chapters) focuses primarily on issues relating to
the use of corpora in second language acquisition. This book concentrates on
synchronic corpus research. Another book based on the 25th ICAME conference -
Facchinetti & Rissanen (2006) - is concerned primarily with diachronic language

Roberta Facchinetti, the book's editor, provides the introductory chapter, where
she describes the volume as ''a fairly broad and thematic overview of the work
undertaken in the field of computerised corpus linguistic studies from their
origin to the present day.'' Facchinetti then goes on to summarize each chapter
in turn.

Part 1: Overviewing 25 Years of Corpus Linguistics Studies

''Corpus linguistics 25+ years on'' (Jan Svartvik) describes corpus linguistic
research prior to the first ICAME conference from a personal, conversational
perspective. These early days were ''the stone age of corpus linguistics... when
there were no personal computers, no web, no email, no mobile phones, no Google,
and no electronic corpora.'' Svartvik also describes the experience of being a
corpus linguist in the late 1950s and 1960s, in an environment where empirical
approaches were squeezed by the dominant Chomskyan paradigm. The chapter also
outlines the important foundational work conducted at University College, London
as part of the Survey of English Usage project, including details of how this
project was carried out in a period when computers were ''rare, expensive and

''Corpus development 25 years on: from super corpus to cyber corpus'' (Antoinette
Renouf) provides a survey of the recent history of corpus development, building
the chapter around the three major ''motivating forces'' that have driven the
research area forward; ''science (or intellectual curiosity), pragmatics (or
necessity) and serendipity (or chance).'' Using this explanatory framework,
Renouf describes the motivation for the development of the Brown corpus in the
1960s as primarily scientific. Larger corpora developed in the 1980s and 1990s,
such as the British National Corpus (BNC), are referred to by Renouf as
''super-corpora''. The drivers behind the creation of these super-corpora were
again primarily scientific (''there were questions about lexis and collocation,
and indeed even about grammar, that could not be answered within the scope of a
small corpus''), yet serendipity played a role, with the increasing capabilities
of computers and the emergence of corpora based dictionaries. The creation of
large scale monitor corpora in the 1990s was largely driven by the scientific
motivation to observe language change across time. From the late 1990s,
cyber-corpora (that is, internet derived corpora or ''web-as-corpus'') were
developed due to a range of drivers; serendipity (the web contains a wide range
of linguistic data), pragmatism (downloading documents from the web is cheap
compared to conventional corpus construction techniques) and scientific interest
(the web allows access to the newest usages). In summary, Renouf describes the
historical development of corpora as ''characterised by the tension between the
desire for knowledge and the constraints of practical necessity and
technological feasibility.''

''Seeing through multilingual corpora'' (Stig Johansson) briefly outlines the
development of multilingual corpora over ''the last 10-15 years or so'' where
multilingual corpora are loosely defined as ''collections of texts in two or more
languages which are parallel in some way, either by being in a translation
relationship, or by being comparable in other respects, such as genre, time of
publication, intended readership and so on.'' Johansson then goes on to describe
two common forms of multilingual corpora; translation corpora (consisting of
texts and their translation into one or more languages) and comparable corpora
(consisting of original texts in two or more languages, where the texts chosen
are representative of a given genre, time period and so on for each genre).
Johansson goes on to describe attempts at uniting these paradigms in the English
Norwegian Parallel Corpora. The rest of the chapter goes on to use this
multilingual corpora in order to explore the linguistic difference between
English and Norwegian (for example, the use of the English ''thing'' and Norwegian

''Corpora and spoken discourse'' (Anne Wichmann) presents some of the practical
and theoretical problems confronted by the researcher in constructing speech
corpora, distinguishing between speech corpora that are created as part of the
development of speech technology systems (often under laboratory conditions) and
speech corpora created from ''natural'' data (that is, speech recorded during
''real'' interactions) that tend to be of interest to corpus linguists (and
conversation analysts). Wichmann stresses the importance of including sound
files with spoken discourse corpora, as in the case of spoken language (rather
than text corpora), the spoken language recording itself is the raw data and
ought to be preserved.

Part 2: Descriptive Studies in English Syntax and Semantics

''An example of frequent English phraseology: distributions, structures and
functions'' (Michael Stubbs) begins by emphasizing that the emergence of interest
in phraseology has accompanied the rise of corpus linguistics. Previously the
study of phrases (and the related concept of n-grams, lexical bundles and so on)
had been crowded out by concern with grammar, lexical issues, and some degree of
hostility (or indifference) to the frequency based investigative techniques
appropriate for the study of phrases. Stubbs describes the software tool used in
his study, the PIE (Phrases in English) system (, as ''a
powerful interactive database... constructed from the BNC'' which consists of all
the n-grams shorter than a given length in the BNC (with other phrasal patterns,
also based on the BNC, available to the user). Stubbs uses the software to
explore several research areas, one of which is the prevalence of given phrases
across text types. For example the use of pronouns in fiction (''I don't want
to'', ''I want you to'') and academic writing (''I shall show that'', ''I have already
mentioned'') is analyzed.

''The semantic properties of _going to_: distribution patterns in four subcorpora
of _The British National Corpus_'' (Ylva Berglund and Christoper Williams)
analyzes the ''intentional and predictive uses of the going to construction'' in
four different registers/genres (financial, academic, news and spoken). The
analysis showed that the frequency of occurrence of 'going to' (and also the
more informal 'gonna') varies markedly between the chosen registers ''with less
than one hundred instances per million words of running text in academic
writing, to almost 3000 in spoken conversation.'' The authors then go on to
analyze - among other things - the predictive versus intentional use of ''going
to'' across the four genres of interest, concluding that the news genre ''shows a
marked preference for predictive meaning.''

''The superlative in spoken English'' (Claudia Claridge) suggests that rather than
simply expressing factual comparisons, superlatives are primarily used as ''a
means for (often vague) evaluation and the expression of emotion.'' The spoken
section of the British National Corpus was used as data, as the researchers were
interested in the everyday, informal use of superlatives. The BNC tagset was
utilized to help identify superlatives, with 1973 adjectival superlatives
identified (a frequency of 5 instances per 10,000 words).

''Semantically-based queries with a joint BNC/WordNet database'' (Mark Davies)
describes an attempt at marrying two important linguistic resources; the British
National Corpus and WordNet. The BNC has emerged as a central resource in
English corpus linguistics. WordNet (Fellbaum, 1998), a comprehensive electronic
lexical database widely used in corpus and computational linguistics, is built
around the central notion of sets of synonymous words (''synsets''). The software
described in this paper allows a user to query the BNC/WordNet database for BNC
derived frequency information for a given word and the synonyms of that word
(along with many other more sophisticated types of search).

''Size matters - or thus can meaningful structures be revealed in large corpora''
(Solveig Granath) continues the descriptive theme developed in the previous four
chapters. Granath shows that for some relatively rare constructions, larger
corpora (like the Guardian/Observer British newspaper corpora) are more
informative than the standard one million word corpora commonly used in corpus
linguistics (for example, BROWN, FLOB, and so on) The chapter focuses primarily
on different subject/verb word ordering in sentences that begin with ''thus''.

''Inversion in modern written English: syntactic complexity, information status
and the creative writer'' (Rolf Kreyer) provides a ''discourse functional, corpus
based account of the construction at issue'' (that is, inversion), stressing the
function of inversion within the discourse structure as an aid to readability.
Additionally, two superordinate functions are identified; text structuring
inversion and ''immediate-observer-effect'' inversion (a technique often used in
fiction to give an impression of unmediated perception). Two subsections of the
BNC were used in this work (written-academic and prose-fiction) and instances of
the inversion construction were identified semi-automatically.

''The filling in the sandwich: internal modification of idioms'' (David Minugh)
uses a three hundred million word corpus (composed of the BNC, British and
American newspaper corpora and broadcast transcripts) to investigate the
occurrence of idioms ''and examine the extent to which these prepackaged chunks
of language can be internally expanded so as to link them into the discourse
within which they are used.'' An example of the kind of 'expanded' idiom of
interest, taken from the chapter, includes ''restore some political coals to
Newcastle.'' Fifty five idioms were used, all of which occur in the Collins
COBUILD Dictionary of Idioms (Collins, 2002). Minugh found that - at least for
the fifty five idioms considered in the study - idiom expansion is much less
common than previous studies seemed to have indicated.

''NP-internal functions and extended use of the 'type' nouns kind, sort and type:
towards a comprehensive, corpus based description'' (Liesbeth De Smedt,
Lieselotte Brems and Kristin Davidse) begins with a brief review of work on type
noun functions from the 1930s to the present, before going on to identify six
categories of type noun (head, modifier, postdeterminer, qualifying, discourse
marker and quotational). These six categories were identified using the previous
literature on type nouns, and also on the basis of a close analysis of corpus
evidence. The final part of the paper consists of an analysis of the frequency
of the six categories of type noun in two corpora; the Times newspaper section
of the COBUILD Corpus (a formal written register) and the Bergen Corpus of
London Teenage Slang (the COLT corpus) (an informal written register). The
results of this analysis showed that type nouns from the newspaper corpus were
primarily NP-internal and concerned with classification, whereas in the
teenagers' speech corpus, the use of type nouns as adverbial qualifiers was much
more common.

Part 3: Second Language Acquisition, Parallel Corpora and Specialist Corpora

''Student writing of research articles in a foreign language: metacognition and
corpora'' (Francesca Bianchi and Roberto Pazzaglia) describes the creation of a
corpus of published papers in the area of experimental psychology, designed for
the purpose of teaching Italian undergraduate students how to write research

''The structure of corpora in SLA research'' (Ron Cowan and Michael Leeser)
identifies the characteristics that a corpus should have in order to be useful
for studying SLA (Second Language Acquisition). This focus can be compared to
the previous chapter, which was primarily concerned with the development and use
of corpora for teaching a second language. It is suggested that a useful corpus
should consist of a diversity of subjects (that is, topics) in the second
language, and several levels of proficiency in order to track systematic
difference in the development of the second language. The construction of a
small corpus of writing by Spanish students of different levels of proficiency
enrolled in an English language class at the University of Illinois is also
described. The corpus was used to track those errors that remained common even
for those students who had achieved a good proficiency in English.

''The path from learner corpus analysis to language pedagogy: some neglected
issues'' (Nadja Nesselhauf) stresses the difficulties involved in moving from
corpus studies that identify the difficulties faced by L2 learners to
pedagogical policy. The corpus used was derived from the German subcorpus of
ICLE (containing argumentative and descriptive essays by German native speaking
advanced students of English) and consisted of 150,000 words in total.
Nesselhauf focused on a limited number of collocations and found that ''the
collocations that the learners produced are frequently not unacceptable per se
but rather are existing English collocations used inappropriately.'' The final
section of the paper considers how to best use these findings in a pedagogical
setting, stressing the difficulty of moving from corpus studies (that is,
identifying through corpus evidence particular difficulties that L2 learners
face) to a realistic teaching setting with competing demands on classroom time.

''Exploiting the Corpus of East-African English'' (Josef Schmied) explores this
English as a second language corpora (part of the International Corpus of
English, henceforth ICE-EA) and suggests a number of research questions that the
corpus may be used to address. Examples include, assessing the lexical
complexity of the ICE-EA corpus compared to other ESL corpora, and assessing the
syntactic complexity of the ICE-EA corpus compared to other English as a second
language corpora (and also to native speaker English).

''Transitive verb plus reflexive pronoun/personal pronoun patterns in English and
Japanese: using a Japanese-English parallel corpus'' (Makoto Shimizu and Masaki
Murata) falls into three sections. The first section describes the general area
of English/Japanese parallel corpora, along with a list of corpora currently
available. In section two the authors explore the use of reflexive and personal
pronouns with transitive verbs, and found that personal pronouns were much more
common than reflexive pronouns. Section three considers the differences between
English and Japanese in their use of reflexive and personal pronouns. The
Context Sensitive and Tagged Parallel Corpus (which consists of parallel
English/Japanese newspaper articles) is used throughout the work.

''The retrieval of false anglicisms in newspaper texts'' (Cristiano Furiassi and
Knut Hofland) describes a method for identifying 'false anglicisms' in newspaper
text. False anglicisms are roughly defined as words or phrases that look like
English, but are not part of the English language (the authors give the example
of 'autostop' as an Italian false anglicism for hitchhiking). The corpus used
was constructed from Italian newspaper text (La Stampa, La Repubblica and Il
Corriere della Serra) and consists of 19.5 million tokens. Computational
linguistic techniques were used to identify false anglicisms, but automated
methods alone did not prove sufficient, and human post-processing was required
in order to eliminate noise.

''Lexical semantics for software requirements engineering - a corpus based
approach'' (Kersten Lindmark, Johan Natt och Dag, and Caroline Willners)
describes the use of corpus linguistic techniques for analyzing software
requirements. The authors first identify keywords characteristic of the
requirements domain using the WordSmith toolkit (Scott, 2004) and a corpus
constructed from 1932 requirement texts in English. The BNC Sampler was used as
a reference corpus. That is, in order to identify keywords in the software
requirement domain, the WordSmith toolkit was used to pick out those words that
occur more frequently (at a statistically significant level) in software
requirements compared to a more general corpus of English (the BNC Sampler). In
addition to identifying domain specific keywords, an attempt was made at
constructing a WordNet for the domain (that is, a lexical database specifying
synonyms and part/whole relationships) using simple pattern matching techniques
in conjunction with the extracted keywords.

This edited volume of papers from the twenty-fifth ICAME conference is focused
on (primarily English language) corpus linguistics. The first section of the
book (subtitled ''Overviewing 25 years of corpus linguistic studies'') serves as
an introduction to, and history of the field, with each article authored by an
influential researcher. Section two of the book is concerned with descriptive
studies of syntax and semantics, historically a core area of corpus linguistics.
The eight papers in this section present a representative sample of current work
in descriptive corpus linguistics by well known researchers in the field.
Section three is titled ''Second language acquisition, parallel corpora and
specialist corpora,'' although most of the papers focus on the use of corpora in
the context of studying second language acquisition, or the development of
corpus based pedagogical tools for the teaching of second languages. The volume
covers a great deal of ground. From the description of new software tools for
corpus linguistics (Mark Davies' chapter on the development of a joint
BNC/WordNet database) to a study of transitive verbs based on parallel corpora
(Makato Shimizu and Masaki Murata's chapter on English/Japanese parallel
corpora), and succeeds in both providing an overview of the development of the
discipline and in presenting state-of-the-art research.

It is however worthwhile mentioning some minor shortcomings with the book.
First, there are some typographical errors, although these are not serious
enough to compromise understanding. Second, the division of the papers into
three main sections does pose some difficulties. While the first and second
sections (dealing with the development of corpus linguistics over the past 25
years and descriptive corpus linguistics, respectively) are unproblematic, the
third section ''Second Language Acquisition, parallel corpora and specialist
corpora,'' does not seem to have a unifying theme. This is, however, acknowledged
in the editor's introduction and can be equally well seen in a positive light,
reflecting the diversity of modern corpus research.

Collins (2002) _Collins COBUILD Dictionary of Idioms_. London.

Facchinetti, R. & Rissanen, M. (2006). _Corpus-based Studies of Diachronic
English_. Bern: Peter Lang Publishing.

Fellbaum, C. (1998). _WordNet: An Electronic Lexical Database_. Cambridge: MIT

Scott, M. (2004). _WordSmith Tools_. Oxford: Oxford University Press.

Mike Conway is a research fellow at the National Institute of Informatics, Tokyo.