LINGUIST List 10.1923

Sun Dec 12 1999

Review: Biber et al: Corpus Linguistics

Editor for this issue: Andrew Carnie <carnielinguistlist.org>


What follows is another discussion note contributed to our Book Discussion Forum. We expect these discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for discussion." (This means that the publisher has sent us a review copy.) Then contact Andrew Carnie at carnielinguistlist.org

Directory

  1. Niladri Sekhar Dash, Review: Biber et al: Corpus Linguistics

Message 1: Review: Biber et al: Corpus Linguistics

Date: Fri, 26 Nov 1999 01:24:33 PST
From: Niladri Sekhar Dash <niladrisekharhotmail.com>
Subject: Review: Biber et al: Corpus Linguistics


Douglas Biber, Susan Conrad, Randi Reppen (1998) Corpus Linguistics: 
Investigating Language Structure and Use (Cambridge Approaches to 
Linguistics). Cambridge: Cambridge University Press. Binding: Hardback.
216 x 138 mm 310pp 33 tables 34 figures
ISBN: 0 521 49622 5

Reviewed by:
Niladri Sekhar Dash, Indian Statistical Institute, Calcutta, India.

Synopsis:
The publication of the book implies the maturation of this sub-discipline 
within linguistics and paves out multiple new
avenues for further researches and investigations in the discipline. This is 
a timely addition to the leading introductory volumes on corpus linguistics, 
including Sinclair (1991), Svartvik (ed.) (1992), McEnery and Wilson (1996), 
Graeme Kennedy (1998) etc.

Among these, the book of Sinclair (1991) proposes a systematic development 
of corpora in English where he considers the criteria of corpus development, 
views the aspects of text processing, shows various types of concordances 
and collocations, observes native speaker's introspection, estimates sense 
and structure in lexis, evaluates words and phrases, and provides a list of 
collocated words complied form the corpora. Svartvik (ed.) (1992), on the 
other hand, presents an estimation about the directions of research in 
corpus linguistics. After an introduction, he discusses the concept of
corpus linguistics, estimates the importance of corpus linguistics
to understanding the nature of language, designs principles in the 
transcription of spoken discourse, views the diachronic corpus as a window 
to the history of language, uses computer-based text corpora to analyze the 
referential strategies of spoken and written texts, considers automatic 
analysis of corpora, and looks into the probabilistic parsing of the corpora 
texts. In their book, McEnery
and Wilson (1996) present overviews of the theory and practice of corpus 
linguistics, emphasize the key factors in corpus-based (henceforth C-B) 
approach such as sampling, representativeness, size, format etc. and 
present a report on subject-based studies using
Brown and LOB (Lancaster-Oslo/Bergen) corpora. Moreover, they present the 
relative merits of qualitative versus quantitative approaches to language 
study. In the core of the book, Kennedy (1998) systematically provides a 
historical overview and evaluates the significance of many important studies 
that currently define the C-B approach. His main
aim is to introduce the reader to the linguistic rationale behind the 
processing for which he includes many useful figures and tables to elucidate 
various linguistic investigations of the probabilistic nature of language.

This book (Biber et al. 1998) is about investigating the different ways 
people use language in speech and writing. It introduces the
C-B approach to the study of language, based on analysis of large databases 
of real language examples compiled from (LOB and BNC) corpora and 
illustrates exciting new findings about English. It provides the 
'methodology boxes' for the computer processing of texts where, with 
step-by-step descriptions of research methods, it gathers new findings about 
grammar and vocabulary, language use, language learning, and differences in 
language use across registers. The book is organized as follows:

The prologue starts with the following lines: "This book is about 
investigating the way people use language in speech and writing. It 
introduces the corpus-based approach to linguistics, based on analysis of 
large databases of real language examples stored on computer. Each chapter 
focuses on a different area of linguistics, including lexicography, grammar, 
discourse, register variation, language acquisition, and historical 
linguistics. Example analyses are presented in each chapter to provide 
concrete descriptions of the research methods and advantages of corpus-based 
techniques.

Ten methodology boxes provide clear and concise explanations of the issues 
in doing corpus-based research and reading corpus-based studies and there is 
a useful appendix of resources for corpus-based investigation. This lucid 
and comprehensive introduction to the subject will be welcomed by a broad 
range of readers, from undergraduate students to professional researchers."

The book is divided into 4 parts besides preface, introduction and appendix. 
The introductory chapter (18 pages) studies the goals and methods of the C-B 
approach. Here the authors discuss the structure and use of language, define 
what is C-B approach, characteristics of this approach, association pattern 
in language use, the role of quantitative analyses and functional 
interpretations in the description of language use; provide a comparative 
view between C-B approach and other approaches in linguistics; identify the 
areas of linguistics that can be addressed with C-B approach etc. Besides, 
they have acknowledged the sources of corpora and corpus analysis-tools they 
have used for various studies reported in this book. Finally,
the chapter ends with an overview of the book where they have taken pain to 
distinct "what this book is and what this book is not".

Part I (Investigating the use of language features) consists of 4 chapters: 
chapter 2 (34 pages), chapter 3 (29 pages), chapter 4 (22 pages), and 
chapter 5 (27 pages).

Chapter 2 (Lexicography) is entirely based on the lexical analysis where 
different issues of lexicography are investigated such as meaning of words, 
frequency of words, distribution of word-classes, word-sense and seemingly 
synonymous words across registers with the help of collocation. To 
concretize the observations and findings some words ('deal', 'big', 'large', 
'great' etc.), taken from the BNC and LOB corpus, are cited as examples with 
sample sentences.

Chapter 3 (Grammar) investigates four research questions: (1) the
use and function of morphological characteristics. Here effort is initiated 
to capture the distribution and function of nominalizations and 
nominalization endings in English; (2) the use and function of grammatical 
classes where the grammatical categories are counted and the ratio of nouns 
and verbs is compared across registers; (3) the function of syntactic 
constructions where that- and to-complement clauses are identified from the 
sample corpus and codified with their specific connotations; and (4) the 
association of linguistic and non-linguistic factors with the choice between 
seemingly synonymous structural variants where the factors associated with 
the choice between subject and extraposed that- clauses are identified and 
compared with four ESL textbooks. The analyses are conducted on three 
registers (academic prose, fiction and spoken English) complied form 
Longman-Lancaster and London-Lund corpus.

Chapter 4 (Lexico-grammar) illustrates two research questions: (1) how can 
nearly synonymous words (adjectives: 'little' Vs 'small' and verbs: 'begin' 
Vs 'start'), with same grammatical potential, be distinguished in terms of 
use patterns relating to their grammatical associations; and (2) how the use 
of the nearly synonymous grammatical constructions (that-clause Vs 
to-clause) be understood in terms of their different lexical associations 
across registers. In answering the first question, the differences between 
the automatic and a human analytical techniques are shown with a KWIC file 
which produces far more exhaustive and robust result than any human being 
can ever produce manually. Following this, an elaborate interpretation of 
the grammatical association patterns for 'little' versus 'small', and 
'begin' versus 'start' are presented with tables, examples and KWIC output. 
The second or last question is also addressed with data, complied form the 
corpora, and with interpretations and examples from the sample texts. The 
analysis shows "that that-clauses and to-clauses differ in their register 
associations, grammatical associations and lexical associations" (PP 104).

Chapter 5 (The Study of Discourse Characteristics) presents an interesting 
aspect of C-B study - studying the discourse using corpora. For 
investigating discourse features two important questions are raised: (1) 
possibility of developing and using interactive computer programs to analyze 
discourse characteristics across registers; and (2) the possibility of using 
automatic analysis to track the use of surface grammatical features over the 
course of a text. In case of reference types in spoken and written 
registers, four parameters are considered: (i) status of information, (ii) 
type of reference, (iii) form of the expression, and (iv) the distance 
between the expression and its antecedent. In interactive analysis technique 
six characteristics are used for capturing noun phrases automatically in a 
relatively small texts of London-Lund and LOB corpus. The program 
illustrates the usefulness of interactive computer analysis, in combination 
with automatic techniques, for analyzing a discourse system. Further, the 
application of C-B techniques for tracking the progression of discourse 
features within a text, is shown where the linguistic correlates of 
rhetorical structure (such as mapping of verbs, tenses voice etc.) are 
investigated. This study is excellently elaborated and established with 
tables, figures and text samples.

Part II (Investigating the characteristics of varieties) consists of 3 
chapters: chapter 6 (37 pages), chapter 7 (31 pages) and chapter 8 (27 
pages).

Chapter 6 (Register Variation and English for Special Purposes) shows that 
with C-B and computational techniques, multi-dimensional analysis is an 
effective tool for studying register variations for specific purposes. 
First, the following issues are defined as multi-dimensional factors: (a) 
Involved Vs informational production; (b) Narrative Vs non-narrative 
discourse; (c) Elaborated Vs situation dependent reference; (d) Overt 
expression of argumentation; and (e) Impersonal Vs non-impersonal style. 
Next, the multi-dimensional analysis technique is used do investigate four 
research questions: (1) how spoken register differ from the written register 
in their use of dependent clauses (relative clauses, adverbial clauses, 
complement clauses etc.); (2) how speech is different from writing in 
English (spoken form mostly contains contractions, second person pronouns, 
discourse inserts, semi-modals, wh- complement clauses etc. while the 
written form is endowed with frequent nouns and attributive adjectives, 
nominalizations, specialized vocabulary items, passive constructions, 
extraposed constructions etc.); (3) how do texts from different academic 
disciplines vary in patterns of linguistic variation (Biology and History 
are chosen as two disciplines to investigate how they differ in subject 
matter, evidence, methodologies along with selection of words, formation of 
sentences etc.); and (4) how the internal sections (introduction, methods, 
results and discussion sections of biology research articles) within single 
academic register vary linguistically. With sufficient examples, tables, 
samples and graphs it is properly established that using this technique one 
can better understand the variation of language used in different situations 
for different purposes across registers.

Chapter 7 (Language Acquisition and Development) has focused on three major 
areas: (1) the first language acquisition of very young children; (2) later 
language development (acquisition of literary skill by students at various 
stages); and (3) second language acquisition (by children and adults). To 
illustrate the application of C-B techniques in analysis of elementary 
student speech and writing, five research questions are addressed, of which 
first four questions focus on first-language issues while the last one 
investigates the errors of non-native English speaking students focusing on 
the second-language acquisition issues. For investigating first four 
questions, the corpora (of the student and by the student) are used; writing 
development with respect to individual linguistic features are considered 
along two popular measures (the number of words per text and the average 
length of T-units in a text); the multi-dimensional investigation technique 
is applied; and a comparison between elementary student and adult dimensions 
of variation is provided with examples and tables. The last question 
addresses to the error patterns in writing English of both native (English 
L1 students) and non-native (Navajo L1 students) speakers. Only "three types 
of errors, which teachers identified as problematic for Navajo students 
during the compiling of the Arizona Corpus of Elementary Student Writing" 
(PP 198) are considered: subject-verb agreement errors, noun morphology 
errors (e.g., unmarked plurals) and verb-morphology errors (e.g., unmarked 
past tense). The analyses, results and observations prove that C-B technique 
not only introduces a broader and more reliable approach to the issues of 
child acquisition of language, but also throws a challenge to the 
observations of the traditional (experimental and observational) approaches.

Chapter 8 (Historical and Stylistic Investigations) is the continuation of 
the same process discussed in chapter 6 and 7 but
from a different angle. Here C-B technique is used for studying different 
stylistic and historical aspects of language. For historical analyses a 
diachronic sample corpus is used for studying the shifts
of grammatical and lexical features, for tracking down the evolution of 
written and speech-based registers (medical prose Vs drama); and for 
changing grammatical characteristics of dialects along the line of 
demography (social class, gender etc.). For stylistic investigations "33 
fiction texts, from a pilot version of the ARCHER Corpus, are used" (PP 
223). The texts are distributed over 4 centuries and are written by 
well-known authors (e.g., S. Johnson, D. Defoe, H. Fielding, J. Swift etc.). 
The overall sample analyses include "investigation of grammatical and 
lexicographic features, examining the use of modals and semi-modals across 
registers over 4 centuries; investigation of register development, 
considering changes in the language use patterns of spoken and written 
registers over three centuries; examination of dialect differences, focusing 
on the language used by women and men in personal letters over four 
centuries; and investigation of individual author (Johnson) style, relative 
to other fiction texts across historical periods"(PP 227). Many tables, 
figures and examples are aptly used to "show how electronic corpora and 
different computational techniques can facilitate our understanding of 
language use across time and style" (PP 228).

The part III (Summing up and looking ahead) contains only chapter 9 (9 
pages).

Chapter 9 (Conclusion) evaluates the contributions of C-B approach; defines 
additional research areas (computational linguistics, natural language 
processing, speech processing and recognition, word taggers and sentence 
parsers development, information retrieval, text processing and production, 
machine (aided) translation etc.); chalks out the usefulness of C-B approach 
in language education (developing educational materials, dictionaries and 
thesauruses, designing classroom activities etc.) and finally, encourages 
the language researchers and students to explore this newly found 
sub-discipline for carrying out their own projects along their own 
parameters or along the path(s) defined in this book.

Part IV contains 10 methodology boxes giving concise information and 
instructions regarding issues in corpus design (approaches to sampling, 
diversity, size, copyright etc.); issues in diachronic corpus design 
(examples of Helsinki Corpus, ARCHER corpus etc. are cited); concordancing 
packages versus programming for corpus analysis (what is a computer program, 
advantages of writing one's own program, requirement for writing one's own 
program etc.); characteristics of tagged corpora (lexical, grammatical, 
semantic and syntactic information in tagged corpus etc.); process of 
tagging (word level tagging with probabilistic information for dissolving 
ambiguity etc.), norming frequency counts (utility of normalization in 
frequency counts), statistical measures of lexical associations (using 
mutual information score, T-scores etc.), unit of analysis in C-B studies 
(using chi-squared test, VARBRUL etc.); significance tests and the 
reporting of statistics (using inferential statistics, t-test, ANOVA, 
Pearson Correlation etc.); and factor loading and dimension scores 
(following the multi-dimensional analysis, dimension scores, standardization 
etc.).

Each chapter contains many tables, figures, maps, text samples along with 
notes and lists of books and journals for further reading which help to 
encompass the wider scope of the topic under discussion. Moreover, the 
resources useful for C-B investigations (corpora, analytical tools and 
on-line sources) are listed in the appendix, each in alphabetical order with 
all necessary and relevant information. A subject index and an exhaustive 
reference list at the end of the book is an asset to the readers.

A critical evaluation:
The book is virtually pillared on two important aspects: corpus and 
register. It shows the ways in which the computer can be used for studying a 
corpus of a language and the results obtained from the study would generate 
new descriptions and understandings about language used across registers 
which are amply different from the traditional approaches to language study. 
However, only a few observations which I like to cite here, may be 
considered in the next edition of the book.

It is agreed that register is an indispensable aspect of the C-B language 
study. Therefore, the extra-linguistic variables (attributes) proposed by 
Atkins et al. (1992) are of immense importance in the context of designing 
corpora which should have been fully accessed in the book. It would been 
better if the speech corpus should have been dealt separately. This is not 
to imply that research into spoken language is any less important, simply 
that the needs of the two types are different, making it possible to 
consider these two aspects of NLP separately. Moreover, the designing 
criteria for a spoken corpus, as shown by Svartik (1990) and Atkins et al. 
(1992), is different form a written corpus.

The figure 2.5 (PP 31) displays a frequency list generated by TACT showing 
the distribution of grammatical forms of the lemma DEAL in the complete and 
tagged LOB corpus. It shows the use of the lemma as a singular noun ("nn"), 
as a proper name ("np"), as a verb ("vb"), as a plural noun ("nns"), as a 
third person verb ("vbz"), as a participle verb ("vbg"), as a single past 
tense verb ("vbd"), and as a past participle ("vbn"). Surprisingly, the 
lemma is never registered as an adverb in the LOB corpus, though in the 
dictionary the lexeme 'deal' has an entry as an adverb. This study shows how 
the intuitive assumptions of the lexicographers can be wrong in actuality.

Some minor mistakes: in page 39 (last line) it is said "... two do not 
mention the "act of distributing,"". However, in the table (2.4. PP 40) we 
find that it is not 'two' but 'three' (dictionaries: Webster's Third 1981, 
Chamber 1993 and Longman Lang. & Culture 1992) didn't mention it. In page 
43, in section 2.6. (line 13) the phrase 'such systematic differences' 
should probably be 'such semantic differences' because the term 'systematic' 
seems irrelevant to the context of discussion initiated in the section. 
Again, page 44 (line last but one) has 'over ten times' which should 
probably be 'over thirteen times' because the table in the same page says 
so. (84:1235 in raw counts or 31:408 in normalization). Similarly, page 45 
(first line) has 'three times more' which should probably be 'two times 
more'. I intend to be accurate because the authors have tired to be so in 
their presentations of data through out the book as they have very 
meticulously noted 'one-and-a-half times' variation in page 44 (last line).

The authors have drawn interesting observations from table 3.2. (PP 63) 
which depicts the proportion of nominalizations from with four English 
suffixes: '-tion'/'-sion', '-ment', '-ness' and '-ity'. But what strikes me 
most is that the difference between fiction and prose is always less than 
the difference between academic prose and fiction for each suffixes for 
nominalization. It is probably, that fiction, though written in form, 
contains much properties (dialogues, direct speech etc.) which are closer to 
the spoken form than the written one.

In chapter 6, 7 and 8, the term 'dimension' has occurred along with the 
phrase 'multi-dimensional analysis' quite frequently. But unfortunately the 
former term did not get entry in the index though the later has. It would 
have been helpful for the readers if the former term would have been duly 
registered in the index.

In appendix (PP 286), an exhaustive list of available corpora, analytical 
tools and on-line sources are mentioned. However, these are all about 
English corpora with some stray information of other European languages. In 
this context it can be informed that Burlow's wedsite 
(http://www.ruf.rice.edu/~barlow/corpus.html) contains almost exhaustive 
information regarding corpora development in different languages of the 
world, besides English. Moreover, the name International Journal of Corpus 
Linguistics (e-mail: IJCLids-mannheim.de) also deserves special mentioning 
as it is entirely devoted to different issues related to corpus linguistics. 
Finally, in the reference section I would like to mention the book of 
Sinclair (1991) which provides a good guide for corpus design, development 
and processing.

The excellent book with a lucid and comprehensive introduction to the 
subject, should become an essential reference work for any researchers 
interested in C-B language study. Moreover, it deserves the status of a text 
book of corpus linguistics both in undergraduate and post-graduate levels, 
because there is ample examples for the student of English who is primarily 
interested in the analysis and description of data, for whom the chapters 
and methodology boxes are certainly within reach without further specialized 
training. Even the general readers, with a liking for language and seeking a 
more comprehensive understanding of the resurgence of empirical linguistics, 
can find this book interesting for its various new observations which 
directly contradict our common hypotheses regarding the uses of words or 
lexical associations in the language. The quality of paper, printing and 
binding is of international standard; the total get-up of the book is 
perfect for easy handing. An eye-catching bright-colored laminated cover 
would be a better bait for drawing attention of the casual visitors to the 
book stores (I presume that most of the people, who occasionally visit book 
stores, are not well aware of such an interesting subject that 
scientifically deals with their most powerful tool).

E)Bibliography:
� Atkins, Sue, Jeremy Clear and Nicholas Ostler. 1992. "Corpus
Design Criteria". Literary and Linguistic Computing. 7(1): 1-16.
� Garside, Roger, Geoffrey Leech, and Tony McEnery (eds.). 1997. Corpus 
Annotation: Linguistic Information from Computer Text Corpora. London: 
Addison-Wesley Longman.
� McEnery, Tony and Andrew Wilson. 1996. Corpus Linguistics. Edinburgh: 
Edinburgh University Press.
� Kennedy, Graeme. 1998. An Introduction to Corpus Linguistics. London: 
Addison-Wesley Longman.
� Sinclair, John. 1991. Corpus, Concordance, Collocation. Oxford: Oxford 
University Press.
� Svartvik, Jan. 1990. The London-Lund Corpus of Spoken English: Description 
and Research. Lund Studies in English, 82. Lund: Lund University Press.
� Svartvik, Jan. (ed.). 1992. Directions in Corpus Linguistics: Proceedings 
of Nobel Symposium 82. (Trends in Linguistics: Studies and Monographs, No 
65). Berlin: Mouton de Gruyter.


A short biography of the reviewer:
Niladri Sekhar Dash passed MA in Linguistics from the University of Calcutta 
in 1991 with 1st class. In 1994 he completed a course on ANLP from Indian 
Institute of Technology, Kanpur. His Ph.D. thesis is on corpus design and 
development for NLP which is scheduled to be submitted by May, 2000 to the 
University of Calcutta. From 1992 to 1995 he worked as a Language Analyst in 
the TDIL (Text Development in Indian Languages) project of the Dept. of 
Electronics, Govt. of India where his main work was to design and develop 
corpora of Indian languages. From 1995 to 1997 he worked as a Technical 
Assistant in the project entitled Computational Linguistics and NLP at the 
Computer Vision and Pattern Recognition Unit of Indian Statistical 
Institute, Calcutta. From 1997 till date he is working as a Scientific 
Assistant in the same institute. His present areas of research are: corpus 
design and development, lexicography, wordform processing, parts-of-speech 
tagging, morphological processing etc. in Indian languages. His contact 
address is: Computer Vision and Pattern recognition Unit. Indian Statistical 
Institute. 203. B. T. Road. Calcutta - 700 035. India. E-mails: 
<niladriisical.ac.in> (Off),
<niladrisekharhotmail.com> (Res.).

______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue