Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."

New from Wiley!


We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Review of  Advances in Corpus Linguistics.

Reviewer: Rolf Michael Kreyer
Book Title: Advances in Corpus Linguistics.
Book Author: Karin Aijmer Bengt Altenberg
Publisher: Rodopi
Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics
Issue Number: 16.27

Discuss this Review
Help on Posting
Date: Fri, 31 Dec 2004 18:33:21 +0100
From: Rolf Kreyer
Subject: Advances in Corpus Linguistics: Papers from ICAME 23

EDITOR: Aijmer, Karin; Altenberg, Bengt
TITLE: Advances in Corpus Linguistics
SUBTITLE: Papers from the 23rd International Conference on English
Language Research on Computerized Corpora (ICAME 23) Göteborg 22-26 May
SERIES: Language and Computers Vol. 49
YEAR: 2004

Rolf Kreyer, University of Bonn

The volume under review is a collection of papers from the 23rd
International Conference on English Language Research on Computerized
Corpora and contains a total of 22 articles on 419 pages. The papers cover
a wide range of topics, which according to the editors ''illustrate clearly
the diversity of research that is characteristic of corpus linguistics
today'' (1). The contributions are subsumed under six ''broad -- and
inevitably overlapping -- categories'' (1):
* The role of corpora in linguistic research
* Exploring lexis, grammar and semantics
* Discourse and pragmatics
* Language change and language development
* Cross-linguistic studies
* Software development
The following synopsis will give a summary of the key points of each of
the articles. The review will conclude with a critical evaluation.


The first section, 'the role of corpora in linguistic research' starts
with an article by Michael Halliday, who explores the spoken language
corpus as a foundation for grammatical theory. Quantitative research into
spoken language, in his view, will not only increase our understanding of
spoken language itself but also of language as a whole. In his view, it is
in spoken language that ''systemic patterns are established and maintained
[...],instantial patterns are all the time being created [...] and the
instantial can become systemic.'' (25) For instance, patterns, as they are
described by Hunston/Francis (2000), Halliday claims, will most probably
develop and change in speech. Here also 'non-standard' patterns like the
ones below are found (19):
(1) It's been going to've been being taken out for a long time. [of a
package left on the back seat of the car]
(2) All the system was somewhat disorganized, because of not being sitting
in the front of the screen. [cf. because I wasn't sitting ...].

Such instances should not be dismissed as errors but rather as ''productive
innovations which pass unnoticed in speech but have not (yet) found their
way into the written language'' (19). The transcription of spoken corpora,
however, is not without problems: it is well-known that meaningful
prosodic features are often not represented, but in Halliday's view the
problem of over-transcribing is more serious. For instance, only in
transcribed speech are homophonous forms such as 'icicle' and 'eye sickle'
overtly distinguishable; thus ''writing systems mask the indeterminacy in
the spoken language'' (16). The analysis of spoken corpora might also prove
challenging, due to what Halliday calls ''the lexicogrammatical bind'' (21)
of corpus research. Obviously, lexical phenomena are more accessible by
corpus linguistic methods than grammatical ones. Spoken language, however,
shows a high level of grammatical intricacy and favours grammatical
systems as opposed to written language, where meaning tends to be conveyed
through lexis (cf. Halliday 1989). Written language therefore is
inherently more prone to corpus linguistic analysis than spoken language.
So, ''especially in relation to a spoken language corpus, there is work to
be done to discover ways of designing a corpus for the use of
grammarians''. (23)

John Sinclair examines ''the roles of intuition and annotation in corpus
linguistics'' (41), thereby trying to clarify the stance of corpus-driven
as opposed to corpus based linguists. For Sinclair the ''elusive faculty''
(41) of intuition seems to have a dual status: on the one hand, intuition
has been shown not to be trustworthy: for the most part invented sentences
are not of the kind that are usually found in a corpus, and the findings
that the corpus yields often differ drastically from what has been
expected. One the other hand, the corpus-driven linguist has ''a great
respect for intuition, and cannot work without it'' (56), since it
organises corpus evidence; as Sinclair puts it: ''[t]here is no escape from
intuition if you have command of the language you are investigating'' (47).
However, while the corpus-based linguist ''allows his intuition to overrule
his corpus data and hence gives primacy to the former'' (40), the corpus-
driven linguist tries to keep intuition at bay and is aware of its
limitations at all times.

Similar discrepancies seem to divide corpus-based and corpus-driven
researchers on the topic of annotation: while it seems indispensable to
the former, it is rather obfuscating to the latter. Sinclair's scepticism
towards annotation is due to two reasons: firstly, the language models
that underlie most of the tagging programmes are usually pre-corpus
models. Unfortunately, these models are not made subject to close scrutiny
on the basis of corpus evidence but, according to Sinclair, it is usually
assumed ''that the models are basically correct, and [that ...] there is no
need to open up the whole complexity of language theory and description
for the sake of some minor blemishes'' (52). The second argument against
annotation is linked to the first one: since pre corpus language models
are inadequate for the description of corpus data, human intervention is
necessary. As a consequence, the process of annotation is not entirely
replicable thereby failing the first test of scientific method. However,
despite his conclusion that ''corpus-driven linguists are not likely to
have much use for annotation''(56), Sinclair concedes that it ''has its
place in application, where quick results are needed and rough-and ready
ones will suffice'' (56).

Starting off with a short discussion of Chomsky's well known three levels
of explanatory, descriptive and observational adequacy (1964: 62-3), Leech
argues that ''a more realistic account of the main strata of investigation
in linguistics'' (62) could be arrived at by the following hierarchy

THEORY: formal [and functional] characterization or explanation of
language as a phenomenon of the human mind and of society.
DESCRIPTION: formal [and functional] characterization of a given language,
in terms of theory.
DATA COLLECTION: collection of observations which a description, and
ultimately a theory, has to account for [e.g. corpora] (62).

In order to explore the relation between the above levels and in order ''to
argue against the common assumption that corpus linguistics is concerned
with 'mere data collection' or 'mere description' (62), Leech describes
two short-term diachronic case studies on modal auxiliaries and
grammatical changes relating to colloquialization. Both studies are based
on the Brown, LOB, Frown and FLOB corpora and two spoken mini-corpora
extracted from the SEU and the ICE-GB corpora. Leech emphasizes that the
description of corpus data does not necessarily lead to true statements
about a language as such. The corpus linguist always has ''to bear in mind
some hazardous assumptions which can be made in moving from data
description to language'' (70), for instance, the well-known issues of
representativeness and of interpreting statistical significance.

This, however, should not lead to discarding the corpus linguistic
enterprise as such. Rather, these hazards should be regarded as a reminder
that corpus-linguistic results usually are provisional and that ''further
corroborating evidence as well as means of increasing accuracy and
reliability'' (71) need to be sought for. Finally, in moving from the level
of description to the level of theory, the researcher will have to find
explanations for empirical data: for instance, the decline of modals
between the 1960s and the 1990s, that Leech describes, might be accounted
for by language-internal factors, such as processes of grammaticalization,
or by external factors, such as colloquialization, democratization or
Americanization. On the whole, then, ''corpus linguistics is not purely
observational or descriptive in its goals, but also has theoretical
implications'' (61).

Section 2, 'Exploring lexis, grammar and semantics', starts with an
article by Joybrato Mukherjee who investigates the place of corpus data in
a usage-based cognitive grammar. The author tries to show ''that corpus
linguistics and cognitive linguistics are not at all mutually exclusive
but can fruitfully complement each other in developing a genuinely usage-
based model of [...] speakers' knowledge of the underlying language
system'' (96). In particular, the author uses an analysis of the
ditransitive verb GIVE in ICE-GB to illustrate how the lexical and
constructional networks of cognitive grammar (e.g. Langacker 1999) can be
refined by incorporating corpus data. Firstly, corpora provide
frequencies, which in turn yield insights into the strength of the
different links between a particular lexical item and the constructions in
which it can occur. In the case of GIVE, for instance, it is found that
38% of all tokens occur in the pattern 'GIVE + Oi + Od'. The second most
frequent pattern, 'GIVE + Od', accounts for 23.2% of the data. These
patterns are supposed to be more deeply entrenched in the cognitive system
than the other less frequent patterns of GIVE. In addition, corpus data
also provide insights into the context-dependent principles that are at
work in the selection of a particular pattern. The author, for instance,
finds that the pattern 'GIVE + Od' is used in those cases only where the
recipient is either retrievable from the context or where the
specification of the recipient is irrelevant. Thus, Mukherjee
claims, ''corpus-linguistic methodology obviously opens up new and
promising perspectives in cognitive linguistics'' (97).

Caroline David puts 'putting verbs' to the test of corpora. In particular,
she attempts to outline a new typology of 'putting verbs' by taking into
account quantitative data from the corpora Brown, Frown, LOB, FLOB and the
BNC. The first part of her paper is concerned with PUT, SET, PLACE, and
LAY. The author finds that PUT is the most frequent of the four and is
more likely to occur in idiomatic structures than the other three. This
the author counts as evidence for ''generalness of meaning'' (102). The
other three, in contrast, seem to be associated with a particular way of
putting, namely a rather careful way. PUT, therefore, ''is considered the
prototypical verb of the general process of putting with little additional
information regarding the way things are displaced'' (105) while the other
three ''are classified together as a kind of manner of putting'' (105). The
second part of the paper concerns verbs of the SPRAY/LOAD class, namely
LOAD, COIL and FILL. Here, the author is mainly concerned with syntactic
alternations of the following kind:
(3) I loaded school trunks on to the car.
(4) I loaded the car with school trunks.

The author claims that in example (3) ''the default interpretation is that
all the trunks are loaded, irrespective of whether the car is 'full' or
not'' (107). Constructions of type (3), therefore, usually take
a 'quantification' reading and are thus similar to construction with COIL.
In the second case, however, a qualification, namely that the car is now
full, is emphasized. Constructions of type (4) thus resemble those with
FILL-verbs, such as CLOAK, FLOOD, or SOAK.

Peter Willemse explores the relationship of 'esphoric' reference,
cataphoric reference within the same nominal group and pseudo-definite
NPs, i.e. NPs that ''are formally definite but in fact realize presenting
rather than presuming reference'' (117). Willemse focuses on pseudo-
definite NPs in unmarked existential constructions, since their semantics
entail that the postverbal NP is indefinite. A formally definite
postverbal NP will therefore always have 'pseudo'-definite referential
status, as the NP ''the usual sleazy reasons for that'' in the following
(5) The Woody Allen-Mia Farrow breakup [...] seems to have everyone's
attention. There are the usual sleazy reasons for that, of course - the
visceral thrill of seeing the extremely private couple's dirt in the
street, etc.

On the basis of 200 tokens from the Bank-of-English corpus, the author
tries to find a ''motivation of the use of the definite article in [...]
the pseudo-definite NPs'' (130). Willemse provides two possible
(i) The postverbal NP may have 'dual reference', i.e. it may refer to a
type, which is usually hearer-old, and a token, which is usually hearer-
new. In example (5) above, for instance, the specific reasons for the
public fascination are introduced into the discourse and, therefore,
hearer-new. However, the general type of reason that explains such
attention is assumed to be known to the hearer, i.e. hearer-old.
(ii) The other explanation lies in what Willemse calls ''a relation of
[...] 'forward bridging' within the NP'' (131). In example (6) below, the
definite article in 'the shrunken head' is licensed through the fact
that ''the identity of its referent is recoverable by virtue of an
experiential connection with the entity introduced by the second NP: a
head is a part of (the body of) a boy'' (123) In such cases, as in (6)
therefore, the definite article is motivated by esphoric reference (123).
(6) In a room outside the court he talked with the French prosecuting
counsel, [...]. There was the shrunken head of a Polish boy.

In his article 'Why ''an angel rides in the whirlwind and directs the
storm''', Jonathan Charteris-Black analyses the use of metaphor in
political corpora. On the basis of the 51 Inaugural Addresses of the
American Presidents and the political manifestos of the Labour and the
Conservative party from 1945 to 1997, the author explores the similarities
and the differences between types of American and British political
discourse. With regard to similarities, for instance, Charteris-Black
finds that POLITICS IS CONFLICT is the most frequently used metaphor in
the two corpora. This conflict either shows in action for ''abstract social
goals that are positively evaluated'' (138) or in action against ''social
phenomena that are negatively evaluated'' (138), as shown in these examples
(138, 139):
(7) While continuing to defend and respect the absolute right of
individual conscience ....
(8) [...] we intend to continue our fight against all form of social

More interesting maybe are the differences between the two corpora. For
instance, the author finds that the fire metaphor is only used in the
American corpus. This may be due to the fact that the fire metaphor was
used by George Washington in the context of liberty. Apparently, ''the
metaphorical link between fire and liberty has become a source of
intertextual reference in presidential addresses'' (143). On the other
hand, plant metaphors are only attested in the British manifestos. Again,
the author suggests a historical-cultural explanation: ''the British
passion for gardening lead[s] to the positive associations of words such
as 'growth' and 'nurture''' (149). Charteris-Black also reports on metaphor
borrowing. The conceptual metaphor POLITICS IS RELIGION is well
represented in the American corpus but is only found in the more recent
British manifestos; this metaphor seems to have found its way from
American into British political discourse.

Peter Tan, Vincent Ooi and Andy Chan, in their article on ''Signalling
spokenness in personal advertisements on the Web'', discuss the use of
English as a second language in this register by South East Asians. Within
this speech community, ''English is often relegated to the position of
a 'neutral' and 'transactional' (as opposed to 'interactional') language
where 'affect' (emotion) is played down'' (151). The question now arises as
to how English language resources are employed for informal, private and
personal means in personal advertisements (PA) by South East Asians. In
particular, the authors want to analyse ''to what extent [...] resources of
spoken discourse [are] relied on in PA'' (163). To this end, they compare
the frequencies of augmenters (e.g. 'very', 'a lot', or 'really') and
mitigators (e.g. 'somewhat', 'a bit' or 'only') in a corpus of South East
Asian adverts with their usage in a spoken and a written subcorpus of ICE-
SIN (the Singapore component of the International Corpus of English). On
the basis of this data, the authors find ''that personal advertisers tend
to make use of features of spokenness'' (163). However, it would
be ''premature to say at this stage that Netspeak in South East Asia is
closely associated with the norms of spoken language although it seems to
be an important contributor to the norms associated with personal
advertisements'' (163).

''Textual colligation: a special kind of lexical priming'' by Michael Hoey
opens up the third section of the proceedings, ''Discourse and Pragmatics''.
Hoey advocates a view that regards ''textual relationships (interactive,
linear, cohesive, hierarchical and structural) as dependent upon and
created by the lexis of the language in a manner not exhausted by the
demands of the individual text'' (173), thereby claiming a vital role for
corpus linguistic methods and findings in text linguistic research. In
analogy to the term 'colligation', which captures the interdependencies of
lexis and syntax, the author employs the term 'textual colligation' to
denote the ''positive and negative preferences of a lexical item with
regard to [...] textual features'' (174) such as participation in cohesive
chains or occurrence as part of the theme in a Theme-Rheme relation.

An analysis of a 100 million word, predominantly Guardian newspaper corpus
shows, for instance, that the lexical items 'army', 'baby', or 'political'
occur as members of cohesive chains, whereas 'afterwards', 'best'
or 'particularly' seem to show no tendency to form such chains, i.e. these
lexical items have a negative preference with regard to the textual
feature 'cohesion'. Words such as 'reason' or 'option', on the other hand,
are neutral in this respect; they may occur in cohesive chains but if so,
the chains are usually short. With regard to the feature 'occurrence as
theme', Hoey finds that in 75% of 294 instances 'sixty' occurs as part of
the theme in a Theme Rheme relation; interestingly, orthography seems to
be relevant here, since '60' does not show this tendency. The preferences
of lexical items for particular textual features should not be analysed in
isolation from each other. The simultaneous occurrence of a lexical item
in two or more textual features will lead to highly interesting
generalizations: for instance, an item that ''has a positive preference for
both Theme and cohesive chains [...] will inevitably have a positive
preference for Thematic Progression'' (177). Moreover, textual-colligation
analysis must not necessarily stop at the word level. The lexical items
within a phrase may share certain preferences for textual features and
thus create a particular 'colligational prosody'.

Hilde Hasselgard explores ''adverbials in IT-cleft constructions'' on the
basis of data drawn from the British component of the International Corpus
of English (ICE-GB). In particular, Hasselgard focuses on two aspects: (1)
the information structural role of the adverbial, and (2) the discourse
function of the whole IT-cleft construction. As to the first point, the
author reports a marked difference in the information structure of clefts
with adverbials as opposed to the other kinds of cleft constructions: ''IT
clefts with adverbials occur by far most commonly with cleft clauses
conveying new information (86%), while the cleft clauses of IT-clefts in
general seem to be divided about equally between given and new
information'' (200). The author's discussion of the discourse functions of
adverbial-IT-clefts largely capitalizes on Johansson's (2002) fourfold
taxonomy, which distinguishes contrast, topic launching, topic linking and
summative functions, all of which Hasselgard finds attested in her data,
too. However, she adds a further function, namely thematization, which
serves ''to make extra clear what is to be understood as the theme and the
rheme of a sentence'' (204), as in the following example (204):
(9) It is with much regret that I find it necessary to send you a copy of
the enclosed letter which is self explanatory.

According to Hasselgard, the writer here ''wants to give thematic
prominence to the regret he/she feels'' (204). In addition, she suggests
that thematization might be regarded as superordinate to Johansson's four
discourse functions. For instance, if the focused constituent in a cleft
construction is especially marked off as the theme, this may also serve to
mark the theme as contrastive or it may be employed to introduce a new
topic into the discourse.

Section 3 concludes with Bernard De Clerck's article ''on the pragmatic
functions of 'let's' utterances'' in the spoken part of ICE-GB.
Prototypically, these utterances ''have the directive illocutionary force
of a proposal for joint action [... where] the speaker commits herself to
an action and seeks the addressee's agreement'' (217). However, 'let's'
utterances may also assume speaker or hearer orientation. In the first
case, the construction may be used to secure the addressee's agreement to
an action that the speaker is currently carrying out. In the case of
hearer-orientation, the utterance may ''camouflage an authoritative speech
act as a collaborative one'' (219). In both cases, the idea of joint action
recedes into the background. Most frequently, 'let's' is used in a
conversational function, namely to influence the flow of conversation. In
this case ''they are more like announcements of a topical shift that round
off the present topic and introduce the next step in the talk'' (225). This
function involves interesting sociolinguistic consequences: 'let's' as a
conversational imperative ''seem[s] to be part of the repertoire of [...]
interactionally more powerful speakers, who present the conversation as a
joint enterprise, but actually try to control it by restricting the
hearer's influence to a minimum'' (226). A minor function of the
construction is to present the speaker's evaluations or feelings at an
interpersonal level, as in example (10) below, where the speaker evaluates
the hearer's behaviour Again, the prototypical aspect of 'proposal for
joint action' is no longer present in such cases (228):
(10) A: God you really know how to put someone down don't you
B: Oh let's not get touchy touchy.

The fourth section on ''Language change and language development'' starts
off with a paper by Thomas Kohnen, who provides a diachronic case study of
English directives, thereby addressing a number of ''methodological
problems in corpus-based historical pragmatics''. Such problems, for
instance, include what Kohnen calls 'pragmatic false friends',
i.e. ''constructions which, against a contemporary background, suggest a
wrong pragmatic interpretation'' (239). Example (11) (taken from
Shakespeare's 'The Merry Wives of Windsor') is a case in point (239):
(11) Ford: Blesse you sir.
Fal.: And you sir: would you speake with me?

In this case, the utterance 'would you speake with me?' should not be
understood as a request but as ''a real question which serves to identify
the man who wanted to talk to Falstaff'' (240). Modern English does not
allow this interpretation. Another methodological issue, not surprisingly,
is the lack of sufficient data. This may be balanced by concentrating on
individual texts types or genres and their functional profiles. On the
whole, Kohnen argues for what he calls 'structured eclecticism':
diachronic pragmatic analysis should be based on ''a deliberate selection
of typical patterns which we trace by way of representative analysis
throughout the history of English'' (238). Furthermore, ''a diachronic
analysis of speech acts should be embedded in a reasonably stable
functional profile of text types'' (242). This method is put into practice
in a diachronic analysis of English directives. The author finds that, on
the whole, there seems to be a move away from the explicit and direct
forms of directives (e.g. imperatives) to more indirect alternatives, such
as interrogative realisations. As an underlying motivation for this
development Kohnen regards ''the growing importance of considerations of
politeness'' (246) which entails a reduction of possibly face-threatening
speech acts.

Liselotte Brems discusses ''degrees of delexicalization and
grammaticalization'' in measure nouns (MNs) such as 'bunch(es) of' or 'heap
(s) of', and attempts to clarify ''the status of the MNs [...] within their
respective NPs'' (250). In particular, two analyses seem appropriate: the
MN may either function as the head of the bi-nominal NP of which it is a
part, as in (12) (250), or it may be regarded as a quantifier of the
second NP within the construction, as in (13) (251). Other instances, such
as (14) (250) are not easily decided on.
(12) The fox, unable to reach a bunch of grapes that hangs too high,
decides that they were sour anyway.
(13) But then, when I needed one, there were a load of excuses as to why I
couldn't borrow one.
(14) We still have to move loads of furniture and other stuff.

The general structural status of MNs, therefore, is far from clear. As an
answer to this problem Brems suggests to regard ''the developments observed
in MN constructions [...] as a case of ongoing delexicalization and
grammaticalization in MNs'' (251). In particular, delexicalization is
understood as a precursor to grammaticalization, i.e. the ''gradual
broadening of collocational scatter [... and the] loosening of the
collocational requirements imposed by the MN'' (256) paves the way for ''the
re-interpretation of the MN as a quantifier'' (256). Her corpus study of
MNs reveals different degrees of synchronic grammaticalization. For
instance, 'heaps of' is used as a quantifier in 65.6% of all cases,
whereas only 4.7% of the tokens of the semantically related 'piles of'
occur in the same function. According to Brems, these findings can be
explained by the fact that 'pile' is associated with a ''feature of
verticality and constructional solidity'' (261) which blocks processes of
semantic generalization. On the other hand, 'heap' lends itself more
easily to delexicalization (and subsequent grammaticalization) since it
is ''in itself more vague and simply profiles an undifferentiated mass''

Göran Kjellmer investigates the use of 'yourself' as ''a general-purpose
emphatic-reflexive''. The traditional grammar view of the personal
pronoun 'you' and its reflexive counterparts 'yourself' and 'yourselves'
is fixed and stable. However, Kjellmer comes up with a large amount
of 'deviant' uses of 'yourself' in the CobuildDirect and the BNC corpora
which seem to imply ''an ongoing extension of its semantic range, and
consequently an increasing lack of precision'' (270). In (15) below, for
example, 'yourself' unambiguously refers to plurals only (272):
(15) Well can you sort that out amongst yourself [...]

Kjellmer reports on even more deviant (and also rarer) cases, where the
plural that the reflexive pronoun refers to is not limited to the second
person (273):
(16) [...] we were told to use physical resources like deep breathing and
actually making yourself sit down and making yourself go floppy.

Apparently, 'yourself' has ''become more general in its application'' (273)
Furthermore, similar to 'you' as a substitute for the missing generic
personal pronoun in English, 'yourself' also seems to be used generically.
A most illustrative example is given in (17) where 'yourself' refers back
to generic 'one' (274):
(17) [...] in an engineering course one concerns yourself only with how to
apply and harness phenomena

A possible final stage of the changing use of 'yourself', in Kjellmer's
view, might be witnessed in the following examples (275):
(18) I like boxing because it means I can defend yourself if you ever
needed to
(19) Pete's gone down to the shop and got yourself a bottle of whisky

Here, the reflexive pronoun is used specifically with reference to non-
second-person entities On the whole, Kjellmer argues, that 'yourself'
might be regarded as ''a general-purpose emphatic reflexive pronoun'' (175)
which ''has become a close reflexive pronoun copy of [... 'you'] by getting
rid of constraining features in its later stages of development'' (275).

Clive Souter explores ''aspects of spoken vocabulary development in the
Polytechnic of Wales Corpus of Children's English [POW]''. Although the
corpus is fairly small (roughly 61,000 words) and has originally been
compiled to study syntactic and semantic development in children from 6 to
12, Souter argues that ''it does have great value for researchers into
child language development, TEFL [Teaching English as a Foreign Language]
syllabus designers and course-book authors'' (280) and sets out to show the
potential of POW for the study of children's vocabulary development.
However, as Souter points out, results have to be interpreted with great
care due to limitations of corpus size and corpus compilation. For
instance, the data show that the active vocabulary of children in the
corpus increases only around 50 words per year, which, however, might be
an artifact due to ''the limited activities used to elicit speech from the
children'' (279), such as Lego building or conversation with adults about
games or TV. The author also reports on a difference in frequency of the
most common affirmative or negative expressions (e.g. 'yeah', 'yes', 'no'
or 'can't') among boys and girls: boys, in general, seem to prefer
positives while girls fore frequently use negatives. Again, the
interpretation of the results is difficult. They might indicate a general
trend but the frequencies might also be explained as a consequence of
corpus compilation - the author concedes: ''[p]erhaps Lego building elicits
more positive responses from boys and more negative responses from girls''
(285). More interesting is the finding that the vocabulary of boys and
girls used in similar contexts only partly overlaps. No more than half of
the words boys and girls use are used by both sexes, whereas the other
half seems to be sex-specific. This feature, as Souter points out, is
worth more investigation and then might indeed turn out to be ''promising
and perhaps disturbing, from the point of view of syllabus and course
material designers'' (288).

In the last paper of section 4, Roumiana Blagoeva describes the use
of ''demonstrative reference as a cohesive device in advanced learner
writing''. In particular, she is interested in ''the under/overuse of the
demonstratives 'this', 'that' and their plural variants 'these', 'those'''
(298) by advanced Bulgarian learners of English. As a basis for comparison
she chooses the Bulgarian sub-corpus of the International Corpus of
Learner English, the British component of the Louvain Corpus of Native
English essays, a sub-corpus of the BNC from the domains 'Applied
Science', 'Social Science' and 'World Affairs', and a collection of
Bulgarian texts similar to the BNC sub-corpus. Her analysis shows, for
instance, that 'near'-demonstratives, i.e. 'this' and 'these' are
underused by Bulgarian learners when compared to British students while at
the same time the 'remote' types of demonstratives are overrepresented.
This cannot be accounted for by L1 interference, since the Bulgarian
equivalents of 'that' and 'those' show a very low frequency in the
Bulgarian corpus. Rather, the author suggests, a reason seems to lie in
the teaching material that is used in Bulgaria: although Bulgarian,
similarly to English, distinguishes near and remote demonstratives, the
distinction between the English counterparts seem to be overlooked in
teaching materials: ''learners are left with the impression [...] that
both 'this' and 'that' [...] could be used indiscriminately to point to
any word, phrase or longer stretch of text'' (304). Interestingly, both
Bulgarian and British students show a high proportion of 'this'
and 'these' in comparison to the BNC sub-corpus. Blagoeva suggests that
this might be due to ''an influence on learner production by the nature of
the text type'' (305). Furthermore, the author contends that learners of a
foreign language at some point stop learning and mainly seem to be focused
on remedying remaining mistakes in the field of lexis and syntax rather
than developing skills to arrive at ''a more target-like way of producing
coherent texts'' (306), which, of course, would include a native-like use
of demonstratives.

In ''Translation as semantic mirrors'', the first paper of section 5, Helge
Dyvik describes a method for identifying wordnet relations (e.g. synonymy
or hyponymy) on the basis of parallel corpora. The basic assumption
underlying Dyvik's approach is that ''semantically closely related words
ought to have strongly overlapping sets of translations, and words with
wide meanings ought to have a larger number of translations than words
with narrow meanings'' (311). The results he presents are extracted
manually form the 2.6 million word English-Norwegian Parallel Corpus
(ENPC). Searching for a particular Norwegian or English word form in the
corpus will yield all the original sentences that contain this word form
and its translations into English or Norwegian, respectively. From this
set of translations, a human analyser can then compile a list of possible
translations of the word form in question. These lists form the basis for
further analyses. The information they contain may, for instance, be used
to distinguish different senses of a particular word. The Norwegian
word 'tak', for example, is translated into 'roof', 'ceiling', 'cover', 'grip', 'hold'.
These five word forms are translated into various Norwegian words, which
form a number of sets which all contain 'tak' but also partially intersect. The
translations for English 'roof' and 'ceiling', for instance, in addition to 'tak' also
overlap in Norwegian 'hvelving'. Similarly, translations for 'grip' and 'hold' share
Norwegian 'tak' and 'grep'. The respective translation sets, however, do not
intersect. One can thus conclude that Norwegian 'tak' has at least two distinct
senses, namely 'roof/ceiling' and 'grip/hold'. After different senses have been
individuated semantic fields can be established on the basis of overlaps of
translation sets. 'Beautiful', for instance, translates into 'vakker' and 'nydelig'.
These, in turn, in addition to 'beautiful' translate into 'cute' and 'cute'/'delicious',
respectively. It follows that 'beautiful', 'cute' and 'delicious' are part of the same
semantic field. Further procedures assign lexical feature to individual entries and
eventually lead to lattices that reveal hyperonym and hyponym relations among
senses, and even identify sub senses and near-synonyms of each individual

Åke Viberg analyses ''physical contact verbs in English and Swedish from
the perspective of crosslinguistic lexicology''. On the basis of data drawn
from the English Swedish Parallel Corpus (ESPC), the author presents an
extensive and highly detailed comparison of the English
verbs 'strike', 'hit' and 'beat' with their primary Swedish
translation 'slå'. The author finds several interesting differences
between the items at issue. 'Strike', 'hit' and 'beat' in their
prototypical usage as a ''bodily action verb, for instance, most frequently
take human beings as objects. This, however, only seems to be a
tendency, ''whereas it is more or less a requirement of Swedish 'slå'''
(332) Furthermore, the Swedish verb occurs with a human subject in 70% of
all instances. The English counterparts show a mixed picture: while 'beat'
with 72% of human subjects is similar to 'slå', 'strike' and 'hit' are not
(41% and 48%, respectively). With these verbs ''natural disasters, economic
crises, wars and diseases'' (334) seem to be frequent subjects. The same
subjects, in Swedish usually cooccur with a different verb,
namely 'drabba', which could roughly be translated as 'afflict'.
Similarly, if the subject is a projectile (e.g. a bullet), English 'hit'
is the most frequent verb, whereas Swedish again does not use 'slå'
but 'träffa' meaning 'hit a target'. It follows that generally, 'slå' ''is
grounded more firmly in sensorimotoric experience of limb movement'' (349)
which prototypically makes use of arm and hand. For the English
counterparts the sensorimotoric aspect does not seem to be as central.

Anna-Lena Fredriksson aims ''to discuss different approaches to the notion
of theme and to show how parallel corpora can successfully be used for
cross-linguistic analyses of theme'' (353). The author starts off with a
description of theme and rheme in Systemic Functional Grammar (SFG) as
laid out in Halliday (1994). However, SFG ''has a strong orientation
towards English which is a potential problem for using it in other
languages'' (354) One problem arises out of the V2 requirement in Swedish,
since this leads to a different distribution of clause elements with
initial non subject, as example (20) illustrates (EO = English Original;
ST = Swedish Translation; LIT = Literal Translation) (361, adapted):
(20) (a) EO: Surely I'd been freed from those painful memories long ago.
(b) ST: Vistt had jag för länge sedan blivit befriad från de där
plågsamma minnena.
LIT: Surely had I for long ago become freed from those painful

In (20a) 'surely' and 'I' make up the theme. In the Swedish translation,
due to the V2 constraint, the two thematic components are separated by the
auxiliary verb. The question that arises is where to locate the theme-
rheme transition point. Fredrikson suggests a split theme, which ''(in a
declarative clause) can be defined as including all elements preceding the
finite verb plus the postverbal subject'' (365). Thus, the thematic
elements 'surely' and 'I' of the English original can also be treated as
thematic in the Swedish translation. Furthermore, the author questions
Halliday's notion of 'topical theme'. In his approach, the thematic part
of the clause contains one and only one experiential element, the topical
theme, so ''everything that follows the topical theme constitutes the
rheme'' (356). However, Fredriksson allows for several experiential
elements in the theme. Accordingly, ''[t]he concept 'topical theme' has no
function in [... her] approach'' (366). This modified understanding of the
concept 'theme', in her view, is equally applicable to English and to
Swedish data.

In their paper ''Welcoming children, pets and guests'' Elena Tognini Bonelli
and Elena Manca search for translationally equivalent units in two
comparable corpora, namely Italian texts that advertise 'Agriturismo' and
English material that promotes 'Farmhouse Holidays'. The English corpus
indicates that the notion of 'welcome' is central to the whole genre: a
total of 324 instances of this word are attested in the data.
Surprisingly, the 'prima facie' Italian equivalent 'benvenuto' and its
related forms occur only 4 times in the Italian corpus. Translation
equivalence, therefore, does not seem to be located at the word level.
Rather, translation should always consider the context in which a
particular word occurs. The authors therefore suggest a three-stage model
of successive contextualisation for identifying translationally equivalent
units. First, a collocational profile of the word to be translated should
be established. For the word 'welcome' the corpus yields as
collocates 'children', 'pets'/'dogs' and 'visitors'/'guests'. In a second
step, the translator should try to find 'prima facie' translational
equivalents for the respective collocates. In the current example these
would be 'bambini', 'animali' and 'ospiti'. The final step would then try
to identify collocates of these equivalents in L2. For instance, to find a
suitable translation for 'welcome' in the context of 'guests'
or 'visitors', the translator should compare the concordances of 'welcome'
+ 'guests'/'visitors' with the concordance of 'ospiti'. In the English
corpus, the nouns at issue are found to occur regularly in the
structure 'Vb BE + 'welcome' + 'to'-inifitive' ('guests are welcome to
relax'). The concordance of 'ospiti', on the other hand, shows that the
Italian equivalent to this structure is the Italian modal 'potere' and its
inflected forms, as in 'gli ospiti potranno fuire'. Obviously then,
translation equivalents are often not found at the word level. Rather,
translation should aim at ''identifying and comparing syntagmatic units
that share certain contextual feature with the view of identifying a
similar function'' (383).

In the last article of section 5, Natalie Kübler reports on her experience
with ''using WebCorp in the classroom for building specialized
dictionaries''. As the title already indicates, Kübler followed pedagogical
objectives that are different from language teaching, namely ''teaching
students how to extract lexical and syntactic information to build
customised dictionaries for machine translation (MT) in languages for
specific purposes'' (387). The particular register envisaged in this
experiment was computer science, more specifically, the most recent user
manuals of the operating system Linux (HOWTOs). In this particular field
of computer science, new terms are coined almost regularly. Therefore,
existing parallel corpora of HOWTOs, although providing useful information
for translation of the more recent HOWTOs, ''tend to become insufficient or
slightly obsolete, even though they can be regularly updated'' (395). The
web, on the other hand, will contain most of the neologisms in this field.
Accordingly, accessing the internet via WebCorp may be a useful way of
balancing the shortcomings of finite corpora. The term 'buffer', for
instance, occurs as part of five different compounds in the parallel
corpus of English and French HOWTOs. However, terms that were coined after
the translation of the HOWTOs will not be included. Here WebCorp can help
to supplement findings from finite corpora, since French computer
scientists often use English terms together with their French
translations: the search for 'buffer' in the French domain (.fr) yields
two more recent compounds together with the appropriate French
translations, namely 'buffer overflow' and 'heap buffer overflow'.
Accordingly, Kübler concludes that ''WebCorp [...] is ideal for
complementing and updating the information extracted from time-bound
specialised finite corpora'' (398).

The final section, 'Software development', consists of an article by
Antoinette Renouf, Andrew Kehoe and David Mezquiriz, who discuss ''some
issues in extracting linguistic information from the web''. The article
provides insights into the WebCorp project, which was launched at the
University of Liverpool at the end of 2000 in order to investigate ''the
usability of the Web as a linguistic resource, and [... to identify and
address] some of the problems of retrieval and analysis that it presents''
(404). In particular, the authors describe issues that are pertinent in
regard to the WebCorp tool, which allows to use the internet as a corpus.
Issues discussed include the fact that search engines are constantly
changing thereby reducing the comparability of results: ''corpus linguists
[...] each access different pages, and different pages at each time. Thus
the linguistic sample is not constant'' (409). Furthermore, Web text may
not easily be transformed into a format that meets linguistic data
requirements. In this context, the authors mention the problem of
providing sentence-length concordances: since Web text is untagged
only ''few clues exist at surface level as to sentence boundary'' (410). The
automatic retrieval of sentences therefore poses considerable problems.
Nevertheless, WebCorp provides a number of useful ways to exploit the web
linguistically. For instance, searches with wildcards serve to search the
web for phrases. More elaborate searches may be used to discover new or
unconventional forms: the string '[he|she|I] text* [him|her|me], for
example, ''reveals that 'text' not only functions as a verb but as an
uninflected past tense verb'' (413), as in (21) below (21) The next time I
text him, he didn't reply (413) In addition, web information can be
exploited by the WebCorp tool to refine searches. This, for example,
includes the specification of text types or genre via the Open Directory
or Yahoo, or a limitation to certain domains, such as '.net' or ''.
Domains may also be combined by Boolean operators. The next steps that the
authors sketch out lead one to hope that eventually the WebCorp tool will
turn out a highly useful means that opens up the web for corpus linguistic


Karin Aijmer and Bengt Altenberg have edited and excellent selection of
papers. The articles (apart from two or three exceptions maybe) are of a
very high quality and highly stimulating and show impressively the
relevance of corpus linguistic research to linguistics in general.
Furthermore, the diversity of topics covered will make this volume an
interesting read for linguists of almost any area: from functionalists to
cognitive linguists, from synchrony to diachrony, from syntacticians to
text linguists and even translators.

Also, the variety of corpora analysed by the contributors show the wealth
of material which corpus linguistics nowadays has at its disposal: in
addition to the use of standard monolingual and parallel corpora, some
contributors quite convincingly show how smaller special purpose corpora
can be exploited: the HOWTOs corpus used by Kübler and the 'agriturismo'
and 'farmhouse holidays' corpora by Tognini Bonelli and Manca are just two
examples. In this context, mention must also be made of attempts to open
up the worldwide web as a possible source of data; its relevance for
future corpus linguistics, in my view, can hardly be overestimated. On the
whole, this large variety of data reported on in this volume leaves no
doubt as to the flexibility of corpus linguistics approaches in regard to

A further point concerns the relationship between data and theory and the
role of corpus linguistics, which ''have been debated ever since the rise
of corpus linguistics'' (2). This debate has also found its way into the
present volume. A number of extremely important issues are discussed by
renowned linguists such as Michael Halliday, John Sinclair, and Geoffrey
Leech. The mere fact that aspects like the role of intuition in corpus
linguistics or the relation of corpus-based and corpus-driven approaches
are still debated clearly shows the strong dedication of corpus linguists
to theoretical and fundamental aspects of their approach. This is also
mirrored in a number of papers that advance far beyond the word-crunching
and case-studying that corpus linguistics often (and not always
unfoundedly) has been accused of: Joybrato Mukherjee with his ''from-corpus-
to cognition-approach'' (85), for instance, impressively shows how corpus
data can refine cognitive models and thus lead to a more appropriate
description of the speaker's linguistic knowledge. Michael Hoey, through
his concept of 'textual colligation', establishes a ''theoretical
relationship between lexis and text-linguistics'' (171). Anna-Lena
Fredriksson uses contrastive corpus data to refined the theoretical notion
of 'theme'. Even if theoretical aspects are not an explicit focus, the
papers usually give convincing (theoretical) explanations for their
findings and, where appropriate, discuss implications for the model of the
speaker's competence or for the abstract language system.

Nonetheless, critical remarks should be made on two individual
contributions. The first concerns Tan, Ooi and Chiang's conclusion on the
use of augmenters in personal advertisements (PA) as opposed to spoken
(SP) or written (WR) texts. I find it difficult to agree with the authors
that ''PA tends towards SP norms -- but not quite reaching them, in most
cases'' (161). Even if the rare cases 'incredibly' and 'ever' are not taken
into consideration, we find that only two of the remaining five types,
namely 'really' and 'too', show similar normalised frequencies in PA and
SP. In contrast, the normalised frequency of 'very' in PA (29.7) just lies
between that of 'very' in WR (9.7) and SP (50.1). In addition, the
frequency of 'a lot' in PA (5.1) is more similar to that in WR (0.6) than
to that in SP (15.2), and 'lah' is highly frequent in SP (77.2) but
extremely rare in both PA (0.2) and WR (0.0). Admittedly, the authors
concede that ''the situation is not always that clear-cut'' (162). However,
on the basis of data presented I would rather claim that the situation is
not at all clear cut and that the use of augmenters in PA more strongly
resembles their use in WR than in SP. Another remark concerns the article
by Clive Souter: he wants to convince the reader that POW ''is worth
exploring, particularly if you are interested in learning and teaching
language'' (288). At the same time, however, he repeatedly stresses the
shortcomings of the corpus and the problems that may arise out of the
corpus's size and the compilation of the material. So I am not quite
convinced that ''interesting lexical information can be gleaned from this
corpus for EFL instructors and curriculum designers'' (279)

The proofreading has been good, the number of typos and inconsistencies in
layout (I found around 15 cases) is within reasonable limits for a book of
over 400 pages.

On the whole, the volume makes for a highly stimulating and interesting
read and gives a good insight into current issues and aspects of corpus
linguistics showing the vitality and the diversity of the field. Linguists
from many different branches of linguistics will no doubt profit from the


Johansson, M. (2002): Clefts in English and Swedish: A Contrastive Study
of IT-clefts and WH-clefts in original texts and translations. PhD
dissertation, Lund University.

Chomsky, N. (1964): ''Current issues in linguistic theory'', The Structure
of Language, ed. by J. A. Fodor & J. J. Katz. Englewood Cliffs, New
Jersey, 50-118.

Langacker, R. W. (1999): Grammar and conceptualization. Berlin: Mouton de

Halliday, M. A. K. (1989): Spoken and Written Language. Oxford: Oxford
University Press.

Halliday, M. A. K. (1994): An Introduction to Functional Grammar, 2nd ed.
London: Edward Arnold.

Rolf Kreyer is an Assistant Professor of Modern English Linguistics at the
English Department of the University of Bonn/Germany. He holds a degree in
English and Mathematics and has recently finished his PhD thesis, a corpus-
based analysis of inverted constructions in modern written English. His
research interests include syntax, text linguistics, corpus linguistics
and theoretical linguistics.