Review of  Corpus Linguistics: Investigating Language Structure and Use

Reviewer: Niladri Sekhar Dash
Book Title: Corpus Linguistics: Investigating Language Structure and Use
Book Author: Douglas Biber Susan Conrad Randi Reppen
Publisher: Cambridge University Press
Linguistic Field(s): Text/Corpus Linguistics
Book Announcement: 10.1923

Douglas Biber, Susan Conrad, Randi Reppen (1998) Corpus Linguistics:
Investigating Language Structure and Use (Cambridge Approaches to
Linguistics). Cambridge: Cambridge University Press. Binding: Hardback.
216 x 138 mm 310pp 33 tables 34 figures
ISBN: 0 521 49622 5

Reviewed by:
Niladri Sekhar Dash, Indian Statistical Institute, Calcutta, India.

The publication of the book implies the maturation of this sub-discipline
within linguistics and paves out multiple new
avenues for further researches and investigations in the discipline. This is
a timely addition to the leading introductory volumes on corpus linguistics,
including Sinclair (1991), Svartvik (ed.) (1992), McEnery and Wilson (1996),
Graeme Kennedy (1998) etc.

Among these, the book of Sinclair (1991) proposes a systematic development
of corpora in English where he considers the criteria of corpus development,
views the aspects of text processing, shows various types of concordances
and collocations, observes native speaker's introspection, estimates sense
and structure in lexis, evaluates words and phrases, and provides a list of
collocated words complied form the corpora. Svartvik (ed.) (1992), on the
other hand, presents an estimation about the directions of research in
corpus linguistics. After an introduction, he discusses the concept of
corpus linguistics, estimates the importance of corpus linguistics
to understanding the nature of language, designs principles in the
transcription of spoken discourse, views the diachronic corpus as a window
to the history of language, uses computer-based text corpora to analyze the
referential strategies of spoken and written texts, considers automatic
analysis of corpora, and looks into the probabilistic parsing of the corpora
texts. In their book, McEnery
and Wilson (1996) present overviews of the theory and practice of corpus
linguistics, emphasize the key factors in corpus-based (henceforth C-B)
approach such as sampling, representativeness, size, format etc. and
present a report on subject-based studies using
Brown and LOB (Lancaster-Oslo/Bergen) corpora. Moreover, they present the
relative merits of qualitative versus quantitative approaches to language
study. In the core of the book, Kennedy (1998) systematically provides a
historical overview and evaluates the significance of many important studies
that currently define the C-B approach. His main
aim is to introduce the reader to the linguistic rationale behind the
processing for which he includes many useful figures and tables to elucidate
various linguistic investigations of the probabilistic nature of language.

This book (Biber et al. 1998) is about investigating the different ways
people use language in speech and writing. It introduces the
C-B approach to the study of language, based on analysis of large databases
of real language examples compiled from (LOB and BNC) corpora and
illustrates exciting new findings about English. It provides the
'methodology boxes' for the computer processing of texts where, with
step-by-step descriptions of research methods, it gathers new findings about
grammar and vocabulary, language use, language learning, and differences in
language use across registers. The book is organized as follows:

The prologue starts with the following lines: "This book is about
investigating the way people use language in speech and writing. It
introduces the corpus-based approach to linguistics, based on analysis of
large databases of real language examples stored on computer. Each chapter
focuses on a different area of linguistics, including lexicography, grammar,
discourse, register variation, language acquisition, and historical
linguistics. Example analyses are presented in each chapter to provide
concrete descriptions of the research methods and advantages of corpus-based

Ten methodology boxes provide clear and concise explanations of the issues
in doing corpus-based research and reading corpus-based studies and there is
a useful appendix of resources for corpus-based investigation. This lucid
and comprehensive introduction to the subject will be welcomed by a broad
range of readers, from undergraduate students to professional researchers."

The book is divided into 4 parts besides preface, introduction and appendix.
The introductory chapter (18 pages) studies the goals and methods of the C-B
approach. Here the authors discuss the structure and use of language, define
what is C-B approach, characteristics of this approach, association pattern
in language use, the role of quantitative analyses and functional
interpretations in the description of language use; provide a comparative
view between C-B approach and other approaches in linguistics; identify the
areas of linguistics that can be addressed with C-B approach etc. Besides,
they have acknowledged the sources of corpora and corpus analysis-tools they
have used for various studies reported in this book. Finally,
the chapter ends with an overview of the book where they have taken pain to
distinct "what this book is and what this book is not".

Part I (Investigating the use of language features) consists of 4 chapters:
chapter 2 (34 pages), chapter 3 (29 pages), chapter 4 (22 pages), and
chapter 5 (27 pages).

Chapter 2 (Lexicography) is entirely based on the lexical analysis where
different issues of lexicography are investigated such as meaning of words,
frequency of words, distribution of word-classes, word-sense and seemingly
synonymous words across registers with the help of collocation. To
concretize the observations and findings some words ('deal', 'big', 'large',
'great' etc.), taken from the BNC and LOB corpus, are cited as examples with
sample sentences.

Chapter 3 (Grammar) investigates four research questions: (1) the
use and function of morphological characteristics. Here effort is initiated
to capture the distribution and function of nominalizations and
nominalization endings in English; (2) the use and function of grammatical
classes where the grammatical categories are counted and the ratio of nouns
and verbs is compared across registers; (3) the function of syntactic
constructions where that- and to-complement clauses are identified from the
sample corpus and codified with their specific connotations; and (4) the
association of linguistic and non-linguistic factors with the choice between
seemingly synonymous structural variants where the factors associated with
the choice between subject and extraposed that- clauses are identified and
compared with four ESL textbooks. The analyses are conducted on three
registers (academic prose, fiction and spoken English) complied form
Longman-Lancaster and London-Lund corpus.

Chapter 4 (Lexico-grammar) illustrates two research questions: (1) how can
nearly synonymous words (adjectives: 'little' Vs 'small' and verbs: 'begin'
Vs 'start'), with same grammatical potential, be distinguished in terms of
use patterns relating to their grammatical associations; and (2) how the use
of the nearly synonymous grammatical constructions (that-clause Vs
to-clause) be understood in terms of their different lexical associations
across registers. In answering the first question, the differences between
the automatic and a human analytical techniques are shown with a KWIC file
which produces far more exhaustive and robust result than any human being
can ever produce manually. Following this, an elaborate interpretation of
the grammatical association patterns for 'little' versus 'small', and
'begin' versus 'start' are presented with tables, examples and KWIC output.
The second or last question is also addressed with data, complied form the
corpora, and with interpretations and examples from the sample texts. The
analysis shows "that that-clauses and to-clauses differ in their register
associations, grammatical associations and lexical associations" (PP 104).

Chapter 5 (The Study of Discourse Characteristics) presents an interesting
aspect of C-B study - studying the discourse using corpora. For
investigating discourse features two important questions are raised: (1)
possibility of developing and using interactive computer programs to analyze
discourse characteristics across registers; and (2) the possibility of using
automatic analysis to track the use of surface grammatical features over the
course of a text. In case of reference types in spoken and written
registers, four parameters are considered: (i) status of information, (ii)
type of reference, (iii) form of the expression, and (iv) the distance
between the expression and its antecedent. In interactive analysis technique
six characteristics are used for capturing noun phrases automatically in a
relatively small texts of London-Lund and LOB corpus. The program
illustrates the usefulness of interactive computer analysis, in combination
with automatic techniques, for analyzing a discourse system. Further, the
application of C-B techniques for tracking the progression of discourse
features within a text, is shown where the linguistic correlates of
rhetorical structure (such as mapping of verbs, tenses voice etc.) are
investigated. This study is excellently elaborated and established with
tables, figures and text samples.

Part II (Investigating the characteristics of varieties) consists of 3
chapters: chapter 6 (37 pages), chapter 7 (31 pages) and chapter 8 (27

Chapter 6 (Register Variation and English for Special Purposes) shows that
with C-B and computational techniques, multi-dimensional analysis is an
effective tool for studying register variations for specific purposes.
First, the following issues are defined as multi-dimensional factors: (a)
Involved Vs informational production; (b) Narrative Vs non-narrative
discourse; (c) Elaborated Vs situation dependent reference; (d) Overt
expression of argumentation; and (e) Impersonal Vs non-impersonal style.
Next, the multi-dimensional analysis technique is used do investigate four
research questions: (1) how spoken register differ from the written register
in their use of dependent clauses (relative clauses, adverbial clauses,
complement clauses etc.); (2) how speech is different from writing in
English (spoken form mostly contains contractions, second person pronouns,
discourse inserts, semi-modals, wh- complement clauses etc. while the
written form is endowed with frequent nouns and attributive adjectives,
nominalizations, specialized vocabulary items, passive constructions,
extraposed constructions etc.); (3) how do texts from different academic
disciplines vary in patterns of linguistic variation (Biology and History
are chosen as two disciplines to investigate how they differ in subject
matter, evidence, methodologies along with selection of words, formation of
sentences etc.); and (4) how the internal sections (introduction, methods,
results and discussion sections of biology research articles) within single
academic register vary linguistically. With sufficient examples, tables,
samples and graphs it is properly established that using this technique one
can better understand the variation of language used in different situations
for different purposes across registers.

Chapter 7 (Language Acquisition and Development) has focused on three major
areas: (1) the first language acquisition of very young children; (2) later
language development (acquisition of literary skill by students at various
stages); and (3) second language acquisition (by children and adults). To
illustrate the application of C-B techniques in analysis of elementary
student speech and writing, five research questions are addressed, of which
first four questions focus on first-language issues while the last one
investigates the errors of non-native English speaking students focusing on
the second-language acquisition issues. For investigating first four
questions, the corpora (of the student and by the student) are used; writing
development with respect to individual linguistic features are considered
along two popular measures (the number of words per text and the average
length of T-units in a text); the multi-dimensional investigation technique
is applied; and a comparison between elementary student and adult dimensions
of variation is provided with examples and tables. The last question
addresses to the error patterns in writing English of both native (English
L1 students) and non-native (Navajo L1 students) speakers. Only "three types
of errors, which teachers identified as problematic for Navajo students
during the compiling of the Arizona Corpus of Elementary Student Writing"
(PP 198) are considered: subject-verb agreement errors, noun morphology
errors (e.g., unmarked plurals) and verb-morphology errors (e.g., unmarked
past tense). The analyses, results and observations prove that C-B technique
not only introduces a broader and more reliable approach to the issues of
child acquisition of language, but also throws a challenge to the
observations of the traditional (experimental and observational) approaches.

Chapter 8 (Historical and Stylistic Investigations) is the continuation of
the same process discussed in chapter 6 and 7 but
from a different angle. Here C-B technique is used for studying different
stylistic and historical aspects of language. For historical analyses a
diachronic sample corpus is used for studying the shifts
of grammatical and lexical features, for tracking down the evolution of
written and speech-based registers (medical prose Vs drama); and for
changing grammatical characteristics of dialects along the line of
demography (social class, gender etc.). For stylistic investigations "33
fiction texts, from a pilot version of the ARCHER Corpus, are used" (PP
223). The texts are distributed over 4 centuries and are written by
well-known authors (e.g., S. Johnson, D. Defoe, H. Fielding, J. Swift etc.).
The overall sample analyses include "investigation of grammatical and
lexicographic features, examining the use of modals and semi-modals across
registers over 4 centuries; investigation of register development,
considering changes in the language use patterns of spoken and written
registers over three centuries; examination of dialect differences, focusing
on the language used by women and men in personal letters over four
centuries; and investigation of individual author (Johnson) style, relative
to other fiction texts across historical periods"(PP 227). Many tables,
figures and examples are aptly used to "show how electronic corpora and
different computational techniques can facilitate our understanding of
language use across time and style" (PP 228).

The part III (Summing up and looking ahead) contains only chapter 9 (9

Chapter 9 (Conclusion) evaluates the contributions of C-B approach; defines
additional research areas (computational linguistics, natural language
processing, speech processing and recognition, word taggers and sentence
parsers development, information retrieval, text processing and production,
machine (aided) translation etc.); chalks out the usefulness of C-B approach
in language education (developing educational materials, dictionaries and
thesauruses, designing classroom activities etc.) and finally, encourages
the language researchers and students to explore this newly found
sub-discipline for carrying out their own projects along their own
parameters or along the path(s) defined in this book.

Part IV contains 10 methodology boxes giving concise information and
instructions regarding issues in corpus design (approaches to sampling,
diversity, size, copyright etc.); issues in diachronic corpus design
(examples of Helsinki Corpus, ARCHER corpus etc. are cited); concordancing
packages versus programming for corpus analysis (what is a computer program,
advantages of writing one's own program, requirement for writing one's own
program etc.); characteristics of tagged corpora (lexical, grammatical,
semantic and syntactic information in tagged corpus etc.); process of
tagging (word level tagging with probabilistic information for dissolving
ambiguity etc.), norming frequency counts (utility of normalization in
frequency counts), statistical measures of lexical associations (using
mutual information score, T-scores etc.), unit of analysis in C-B studies
(using chi-squared test, VARBRUL etc.); significance tests and the
reporting of statistics (using inferential statistics, t-test, ANOVA,
Pearson Correlation etc.); and factor loading and dimension scores
(following the multi-dimensional analysis, dimension scores, standardization

Each chapter contains many tables, figures, maps, text samples along with
notes and lists of books and journals for further reading which help to
encompass the wider scope of the topic under discussion. Moreover, the
resources useful for C-B investigations (corpora, analytical tools and
on-line sources) are listed in the appendix, each in alphabetical order with
all necessary and relevant information. A subject index and an exhaustive
reference list at the end of the book is an asset to the readers.

A critical evaluation:
The book is virtually pillared on two important aspects: corpus and
register. It shows the ways in which the computer can be used for studying a
corpus of a language and the results obtained from the study would generate
new descriptions and understandings about language used across registers
which are amply different from the traditional approaches to language study.
However, only a few observations which I like to cite here, may be
considered in the next edition of the book.

It is agreed that register is an indispensable aspect of the C-B language
study. Therefore, the extra-linguistic variables (attributes) proposed by
Atkins et al. (1992) are of immense importance in the context of designing
corpora which should have been fully accessed in the book. It would been
better if the speech corpus should have been dealt separately. This is not
to imply that research into spoken language is any less important, simply
that the needs of the two types are different, making it possible to
consider these two aspects of NLP separately. Moreover, the designing
criteria for a spoken corpus, as shown by Svartik (1990) and Atkins et al.
(1992), is different form a written corpus.

The figure 2.5 (PP 31) displays a frequency list generated by TACT showing
the distribution of grammatical forms of the lemma DEAL in the complete and
tagged LOB corpus. It shows the use of the lemma as a singular noun ("nn"),
as a proper name ("np"), as a verb ("vb"), as a plural noun ("nns"), as a
third person verb ("vbz"), as a participle verb ("vbg"), as a single past
tense verb ("vbd"), and as a past participle ("vbn"). Surprisingly, the
lemma is never registered as an adverb in the LOB corpus, though in the
dictionary the lexeme 'deal' has an entry as an adverb. This study shows how
the intuitive assumptions of the lexicographers can be wrong in actuality.

Some minor mistakes: in page 39 (last line) it is said "... two do not
mention the "act of distributing,"". However, in the table (2.4. PP 40) we
find that it is not 'two' but 'three' (dictionaries: Webster's Third 1981,
Chamber 1993 and Longman Lang. & Culture 1992) didn't mention it. In page
43, in section 2.6. (line 13) the phrase 'such systematic differences'
should probably be 'such semantic differences' because the term 'systematic'
seems irrelevant to the context of discussion initiated in the section.
Again, page 44 (line last but one) has 'over ten times' which should
probably be 'over thirteen times' because the table in the same page says
so. (84:1235 in raw counts or 31:408 in normalization). Similarly, page 45
(first line) has 'three times more' which should probably be 'two times
more'. I intend to be accurate because the authors have tired to be so in
their presentations of data through out the book as they have very
meticulously noted 'one-and-a-half times' variation in page 44 (last line).

The authors have drawn interesting observations from table 3.2. (PP 63)
which depicts the proportion of nominalizations from with four English
suffixes: '-tion'/'-sion', '-ment', '-ness' and '-ity'. But what strikes me
most is that the difference between fiction and prose is always less than
the difference between academic prose and fiction for each suffixes for
nominalization. It is probably, that fiction, though written in form,
contains much properties (dialogues, direct speech etc.) which are closer to
the spoken form than the written one.

In chapter 6, 7 and 8, the term 'dimension' has occurred along with the
phrase 'multi-dimensional analysis' quite frequently. But unfortunately the
former term did not get entry in the index though the later has. It would
have been helpful for the readers if the former term would have been duly
registered in the index.

In appendix (PP 286), an exhaustive list of available corpora, analytical
tools and on-line sources are mentioned. However, these are all about
English corpora with some stray information of other European languages. In
this context it can be informed that Burlow's wedsite
( contains almost exhaustive
information regarding corpora development in different languages of the
world, besides English. Moreover, the name International Journal of Corpus
Linguistics (e-mail: also deserves special mentioning
as it is entirely devoted to different issues related to corpus linguistics.
Finally, in the reference section I would like to mention the book of
Sinclair (1991) which provides a good guide for corpus design, development
and processing.

The excellent book with a lucid and comprehensive introduction to the
subject, should become an essential reference work for any researchers
interested in C-B language study. Moreover, it deserves the status of a text
book of corpus linguistics both in undergraduate and post-graduate levels,
because there is ample examples for the student of English who is primarily
interested in the analysis and description of data, for whom the chapters
and methodology boxes are certainly within reach without further specialized
training. Even the general readers, with a liking for language and seeking a
more comprehensive understanding of the resurgence of empirical linguistics,
can find this book interesting for its various new observations which
directly contradict our common hypotheses regarding the uses of words or
lexical associations in the language. The quality of paper, printing and
binding is of international standard; the total get-up of the book is
perfect for easy handing. An eye-catching bright-colored laminated cover
would be a better bait for drawing attention of the casual visitors to the
book stores (I presume that most of the people, who occasionally visit book
stores, are not well aware of such an interesting subject that
scientifically deals with their most powerful tool).

A short biography of the reviewer:
Niladri Sekhar Dash passed MA in Linguistics from the University of Calcutta
in 1991 with 1st class. In 1994 he completed a course on ANLP from Indian
Institute of Technology, Kanpur. His Ph.D. thesis is on corpus design and
development for NLP which is scheduled to be submitted by May, 2000 to the
University of Calcutta. From 1992 to 1995 he worked as a Language Analyst in
the TDIL (Text Development in Indian Languages) project of the Dept. of
Electronics, Govt. of India where his main work was to design and develop
corpora of Indian languages. From 1995 to 1997 he worked as a Technical
Assistant in the project entitled Computational Linguistics and NLP at the
Computer Vision and Pattern Recognition Unit of Indian Statistical
Institute, Calcutta. From 1997 till date he is working as a Scientific
Assistant in the same institute. His present areas of research are: corpus
design and development, lexicography, wordform processing, parts-of-speech
tagging, morphological processing etc. in Indian languages. His contact
address is: Computer Vision and Pattern recognition Unit. Indian Statistical
Institute. 203. B. T. Road. Calcutta - 700 035. India. E-mails:
<> (Off),
<> (Res.).

