Review of  Corpus Analysis.

Reviewer: Christophe Parisse
Book Title: Corpus Analysis.
Book Author: Pepi Leistyna Charles F. Meyer
Publisher: Rodopi
Linguistic Field(s): Applied Linguistics
Computational Linguistics
Discourse Analysis
Text/Corpus Linguistics
Subject Language(s): English
Language Family(ies): New English
Issue Number: 14.3455

Date: Thu, 11 Dec 2003 15:21:44 +0100
From: Christophe Parisse
Subject: Corpus Analysis: Language Structure and Language Use

Leistyna, Pepi and Charles F. Meyer, ed. (2003) Corpus Analysis:
Language Structure and Language Use, Rodopi, Language and computers:
Studies in practical linguistics 46.

Christophe Parisse, LEAPLE-INSERM, Villejuif, France


This book contains no less than fifteen articles about corpus
analysis. The papers were originally presented at the 3rd North
American Symposium on Corpus Linguistics and Language Teaching
(Boston, 2001). The book is introduced by a very short presentation
(three pages) that does nothing more than explain the rationale behind
the order of the articles in the book. I have to admit that this order
is quite good, as it is not easy to classify the articles into clear
cut categories. Indeed, works about corpus analysis can be classified
using at least three variables: the method (fully automatic, semi-
automatic, manual, multi-dimensional analysis, lexical analysis,
grammatical analysis), the theme (written language, oral language,
academic language, historical texts, letters, conversations), the
domain (general linguistics, historical linguistics, sociolinguistics,
pragmatic analysis, conversation turn analysis, second language
acquisition, language teaching). There is a nice progression
throughout the book, as every article always has something in common
with the previous one, which could be either about method, theme, or
domain. I will comment on each chapter in turn and finish with general
comments about the book as a whole.

1. "It's really fascinating work": Differences in evaluative
adjectives across academic registers - Swales, J. M. & Burke, A. The
goal of this article is to document the differences between oral and
written language. The authors base their results on two corpora (one
oral, one written) that cover exclusively academic presentations. As
is stated in the title, the authors focus their work on adjectives
that are used for evaluation purposes, such as for example
'important', 'serious', 'trivial', etc. Adjectives are automatically
classified in seven categories: acuity, aesthetic appeal, assessment,
deviance, relevance, size, and strength. Adjectives are also
categorized into polarized (strong evaluation) or centralized (normal
evaluation). There tend to be more adjectives in oral texts than in
written texts, but this is not true for relevance and strength
categories where this trend is reversed. Oral language tends to use
more polarized adjectives. Results for all classes of adjectives are
detailed and analysed.

2. "But here's a flawed argument": Socialisation into and through
metadiscourse - Mauranen, A. This article is based on oral language
only. It studies vocabulary used for argumentation ('argue', 'claim',
'observe' ...). The corpus comes from academic situations and contains
indications about the academic status of the speaker. An automatic
search for all relevant lexical elements shows that different verbs
are employed differently: for example, argue is more used by senior
faculty members and is also more used in monologues. In general, there
are more verbs of argumentation in dialogic discourse, and more nouns
in monologues. A first view, there seems to be fewer occurrences of
negative evaluations. However, a manual search revealed that this is
not true. There are as many negative than positive evaluations, but
negative constructions often use apparently positive forms ('argue',
'say', 'point', 'puzzle', 'seem', ...) -they are somehow hidden into
the text- and a more varied number of features.

3. Register-specificity of signalling nouns in discourse - Flowerdew,
Signalling nouns are kind of anaphoric nouns that are used to link one
clause to another or one part of a clause to another (for example 'the
process', 'the way', 'the issue', 'the question' ...). This work
involves a comparison between oral language (lectures) and written
language (textbooks). The process is semi-automatic: first all
relevant words are extracted from the corpora and concordances are
computed. Then, these concordances are processed manually to keep the
relevant ones only (this is necessary as many signalling nouns are
ambiguous). Difference is made whether signalling occurs within a
clause or between two clauses. An analysis of the results is carried
over using Halliday's (1978) framework (field, tenor, mode). Results
show that signalling is more frequent in written language than in oral
language, and that collocations are less varied in oral language than
in written language. The existence of register specificities can also
be demonstrated.

4. Variation among university spoken and written registers: A new
multi-dimensional analysis - Biber, D.
This work is a new multi-dimensional analysis similar to that of Biber
(1988), but with a different corpus. The main interest of this
analysis is that it allows to classify text genres automatically using
only syntactic features. After computing a large number of linguistic
features (such as for example 'number of contractions', 'number of
past tense verbs', 'number of nominalizations', ...), a statistical
factor analysis is performed and four 'dimensions' (systematic co-
occurrence patterns among linguistic features) are retained. The new
corpus is made of university language, in oral and written form,
produced in academic settings as well as in non-academic settings. It
comprises nearly three millions words. A list of 129 linguistic
features was computed, but only 90 were used in the final analysis.
The number of features is greater than for the previous study (Biber,
1988). The four dimensions that were found are: (1) oral vs. literate
discourse; (2) procedural vs. content-focused discourse; (3) narrative
orientation; (4) academic stance. The differences between this results
and Biber (1988) are then discussed.

5. Linguistic dimensions of direct mail letters - Connor, U. & Upton,
This study uses the framework of Biber (1988) - dimensions are
slightly different from the dimensions presented in the previous
article. The corpus consists of direct mail letters in non- profit
fundraising (191,540 words). These letters are very interesting
because, although they appear at first glance to be highly personal
(because of the use of personal pronouns), they are in fact very
informational (they are at the bottom of dimension 1, nearly as much
as academic texts). Also, they appear to relate some important tale as
they tend to include narrative elements, but this is contradicted by
the dimensional analysis that reveals that they are in fact highly
non-narrative, even more than academic texts, which were on the
extreme of dimension 2. In the other dimensions of analysis, they are
less strongly characterized, close to professional letters and
academic prose.

6. Gender-based variation in nineteen-century English letter-writing -
Geisler, C.
This article also uses the framework of Biber (1988), but with a
completely different type of text and with five dimensions of
analysis. The corpus is made of nineteen-century letters written by an
equal number of men and women. Differences can be found between men
and women discourse. Men's writing tend to be more informational, to
use more abstract features such as passive or word length, and to use
more noun phrase elaboration. Women's writing contains higher
frequencies of features marking involved, situated, and non-abstract
style, such private verbs, emphatics, that-deletion. An interesting
fact is that the differences between men and women language changes
through the century, and some trends from the beginning of the century
are totally reversed at the end of the century.

7. The grammar of stance in early eighteen- century English epistolary
language - Fitzmaurice, S.
This article is difficulty to summarize, as it contains a lot of
various and detailed information. It is an in-depth study of the
grammar of stance constructions in early modern English. The corpus
consists of a set of letters from a group of eight men and six women
associated with the essayist and diplomat, Joseph Addison. This allows
to study the evolution of the grammar of stance during the beginning
of the eighteen-century. A detailed linguistic analysis is presented,
with many examples and figures about the number of lexical or modal
verbs used in stance constructions.

8. Great vs. lovely: Stance differences in American and British
English - Precht, K.
This study compares the frequencies of stance markers in British and
American conversations. A specialized automatic software,
StanceSearch, is used to find out and generate frequencies figures for
stance markers. Stance markers are found in various part of speech
(lexical verb, adverbial, adjectival, noun, modal verb), and
correspond to four main semantic categories: affect, evidentiality,
amount, modality. There is a strong relationship between part of
speech and semantic category which is found both in British and
American English. The difference between the two dialects is more a
question of subtle lexical differences, than of grammatical format.

9. "What's in a name?" Vocatives in casual conversations and radio
phone-in-calls - McCarthy, M. J. & O'Keeffe, A.
This article is concerned with vocatives and compare a corpus of
conversation with family and friends and a corpus of radio phone-in
programmes. For the conversation corpora, vocatives terms were found
automatically whereas the corpus of radio was processed manually. All
vocatives are classified into several categories: relational, topic
manage, badinage, mitigator, turn manage, summons, and call manage.
Each of the categories is analysed in turn for each corpus and a
comparison between the corpora is made. The analysis is closely
detailed with many examples and shows quantitative differences between
the two situations.

10. Turn initiators in spoken English: A corpus- based approach to
interaction and grammar - Tao, H.
This analysis bears strictly on the first word of a turn in a dialog,
as this word is necessarily a good candidate for turn management. This
is assuredly a very restrictive definition of turn initiators (as it
is pointed out by the author himself), however, this allows to realize
a fully automatic analysis of any type of corpus. It appears that turn
initiators are overwhelmingly lexical and that just the 20 most
frequent forms make up 60% of the turn initiators. Some terms are
highly specific to turn initiation (as they are encountered nearly
only in this context) whereas other terms are used more often in other
sentence positions. A detailed functional analysis of the different
types of turn initiators is presented.

11. Situational variation in intonational strategies - Yaeger-Dror,
M., Hall-Lew, L. & Deckert, S.
This article focused once again on oral language only. It is based on
three different corpora. It focuses on the intonational
characteristics of negatives. The pitch of each negation that figures
in a declarative sentence was classified, as well as relevant
environmental prosodic information, turn stance and footing
(informative, supportive, remedial, self- correcting, self-protecting,
and hedge). As this manual procedure is very time consuming, the
number of elements tagged is limited. The authors find that there are
variation of pitch with register. They also found that not all
negations have a prominent pitch. This is true for purely informative
negations, but more than 80% of all negations in interactive
situations carry no pitch accent at all. The authors suggest that the
principles at work in choosing the pitch are social more than

12. On the radical difference between the subject personal pronouns in
written and spoken European French - Fonseca-Greber, B. & Waugh, L. R.
This is unfortunately the only article that does not use English
language corpora. French is a well suited language for comparing
written and oral language as they are a lot of differences between the
two. Many grammatical features are marked in written format but are
not pronounced aloud, with the exception of irregular forms. The use
of verb tense is also different in written and oral French language.
However, the authors focus on a less-known difference which is the use
of subject personal pronouns. They show that the first person plural
pronoun has changed with regard to the traditional written format, and
that the second person singular pronoun has acquired a new use
(impersonal register). The sole problem with this study is that no
corpus is used for statistics in written language and it remains to
demonstrate that the evolution that happened for oral language has not
also happened for written language, as norms computed using real -and
recent- corpora can be quite different from the traditional
grammatical norms.

13. The world wide web as linguistic corpus - Meyer, C., Grabowski, R.
Han, H.-Y., Mantzouranis, K. & Moses, S.
This article makes a thorough evaluation of how it is possible to use
the world wide web as a linguistic corpus. First, the authors stress
that it is not possible to know its exact size, so that exact
frequencies are difficult to compute. It appears that half of the web
is in English and that most of the material has commercial content.
They describe how it is possible to use search engines to make lexical
analyses of the corpus, as well as to find polysemous lexical items or
syntactic constructions. They conclude that, although it is highly
difficult to control the nature of the information, valuable
linguistic data can be extracted from the world wide web.

14. Corpus linguistics and second language acquisition: Rules and
frequency in the acquisition of English multiple wh-questions - Bley-
Vroman, R.
This work uses a mixture of corpus analysis procedure and
psycholinguistic experimentation to demonstrate that the knowledge of
native speakers of English is different in nature from the knowledge
of second language learners. The principle of the demonstration is
that the performances of the second language learners parallel
frequency of occurrences of syntactic forms, whereas this is not the
case for the first language learners. The demonstration appears
correct at first sight, but the authors themselves point out that some
of the results are contradictory. Also, they do not try to evaluate
whether the frequencies encountered by native speakers match those of
non-native speakers, which, it seems to me, is necessary to confirm
before drawing any definite conclusion from the study.

15. Comparing alternate complements of object control verbs: Evidence
from the bank of English corpus - Rudanko, J.
The final study demonstrates that two verbs that appear quite similar
('pressure' and 'prevent') can be differentiated after corpus
analysis. It appears that the constructions used by English speakers
tend to be lexically determined, despite the fact that there is no
grammatical reason for this tendency. Also, a comparison between
British English and American English shows that constructions differ
in the two dialects. This results would be very difficult to obtain
without corpus data.


The average quality of the articles is very good. Each reader of the
book will have different favourite papers, depending on her favourite
interests. As someone that uses corpus analysis in his own research, I
have found a lot of interesting elements that I may use in my future

This book is not for beginners at all. It is very dense and there is
no introduction that explains the methods and goals of corpus
linguistics. This limits the readership of the book to specialists in
the field or advanced students. However, these persons will probably
find the book very good. It covers a lot of approaches and themes, and
it is of high scientific value. The general presentation of the book
is good and the editors did a very good job in harmonizing the
presentation of the different articles. I only wish that they had
provided the reader with a real introduction to corpus linguistics. I
think this would have made the book more interesting to general
linguistic readership. My other regret is that, on the fifteen
articles, one only is not about the English language. It would have
been nice to see corpus linguistics done in flexional languages, or to
try to quantify differences between languages. I guess that this was
not an editors' choice but a consequence of the material they had to
work with. I can only hope that such a book will promote the use of
corpus linguistic in all fields of language research and in all


My main research interests are in language development. my main work
is on the initial development of syntax (children aged one to four).
The tools I use include computer simulation as well as
psycholinguistic experiment. I work with children with language
disorders as well as normally-developing children.

