LINGUIST List 13.3401

Sat Dec 21 2002

Review: Corpus Linguistics: Meyer (2002)

Editor for this issue: Naomi Ogasawara <naomilinguistlist.org>


What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Simin Karimi at siminlinguistlist.org.

Directory

  1. Gaby Saldanha, Meyer (2002) English Corpus Linguistics

Message 1: Meyer (2002) English Corpus Linguistics

Date: Sat, 21 Dec 2002 13:03:17 +0000
From: Gaby Saldanha <gabriela.saldanha2mail.dcu.ie>
Subject: Meyer (2002) English Corpus Linguistics

Meyer, Charles F. (2002) English Corpus Linguistics.
Cambridge University Press, xvi+168pp, hardback ISBN 0-521-80879-0, $60.00.

Book Announcement on Linguist:
http://linguistlist.org/get-book.html?BookID=2348 
http://linguistlist.org/issues/13/13-163.html


Gaby Saldanha, Dublin City University

OVERVIEW

_English Corpus Linguistics_ consists of a preface, a first chapter
providing an overview of the field, three chapters describing the
process of designing, building and annotating a corpus, followed by
one chapter on corpus analysis and a brief concluding chapter. Study
questions are provided at the end of each chapter and there are also
two appendixes listing corpus resources and concordancing programs.

In the preface, Meyer provides a rather general definition of a corpus
as a ''collection of texts or parts of texts upon which some general
linguistic analysis can be conducted'' (xi), and argues that corpus
linguistics is not a distinct paradigm in linguistics but a
methodology. As evidence for such claim he shows that corpora can be
used within different paradigms for different purposes and that
various types of corpora are available for different types of
analyses.

The first chapter, ''Corpus analysis and linguistic theory'' starts by
describing the conflicting views of language offered by descriptive
linguistics and generative grammar, which are explained here with
reference to the three types of ''adequacy'' that linguistic theories
can achieve according to Chomsky. It is refreshing to find the two
disciplines described as complementary rather than mutually exclusive,
although Meyer concludes that ''corpora will probably never have much
of a role in generative grammar'' and ''are much better suited to
functional analyses of language'' (p.5). He then explains how corpora
are used in functional descriptions of language and describes by way
of example an analysis of elliptical co-ordinations based on a
96,000-word corpus of American English texts across different
registers. The rest of the chapter outlines briefly the kind of
corpus-based research being conducted in various fields from NLP to
language pedagogy.

The second chapter ''Planning the construction of a corpus'' discusses
the kind of decisions that need to be made before starting to build a
corpus. The British National Corpus (BNC) is used as a
reference. Meyer first describes the composition of the BNC and then
explores the factors that need to be considered in order to obtain a
representative and balanced corpus: corpus length, genres, number and
length of text samples to be included, time frame and population
represented. The last section deals with controlling sociolinguistic
variables such as gender, age and dialect. Meyer insists on the
advantages of good planning without being too rigid, since changes are
bound to be required at later stages.

The next chapter is a practical guide to collecting and computerising
samples of speech and writing and uses the International Corpus of
English (ICE) as a reference. It deals with methodological
considerations to be taken into account when collecting data, which
range from ethical issues to the choice of tape recorders and
microphones. There is also advice on how to keep records and store
files. The process of transcribing speech is explained in detail, but
there is not so much information on scanning and using OCR software on
written text.

Chapter Four explains different kinds of annotation, from structural
mark-up to part-of-speech tagging and parsing. It explains how mark-up
languages such as SGML and -- more briefly -- XML are used in order to
annotate features such as overlapping speech or to produce headers
with bibliographic information on texts. The sections on tagging and
parsing explain how rule-based and probability-based taggers and
parsers work and illustrate the kind of problems that affect their
output. The chapter ends with a brief introduction to semantic,
discourse and problem-oriented tagging.

Chapter Five aims at providing a general framework for conducting
corpus analysis by describing in detail a study on pseudo-titles and
equivalent appositives in press reportage using the International
Corpus of English. It discusses how to frame a research question and
choose a suitable corpus, how to extract the relevant information from
the corpus and how to use statistical tests. Meyer advocates combining
qualitative with quantitative methods, or rather using quantitative
information to support qualitative statements. The data from the study
on pseudo-titles is explored in depth using cross-tabulation and chi-
square and log-likelihood tests. The final chapter reviews each step
in the process of building and analysing a corpus and looks at how it
could be made easier in the future.

EVALUATION

Although the title could lead us to expect a broader coverage, English
Corpus Linguistics is in fact a detailed description of the process of
designing and building a corpus which is preceded by an introductory
chapter and complemented by an example of corpus analysis. The focus
on issues of design, compilation and annotation is actually the main
strength of the book and what distinguishes it from other
introductions to corpus linguistics which have tended to focus on the
kinds of research questions that can be investigated using corpus
tools (Stubbs 1996 and Biber et al. 1998) or offer broad overviews of
the field (McEenry and Wilson 1996 and Kennedy 1998).

Meyer provides a very informative and clear description of every step
in the process, leaving almost no questions unanswered as to the
practicalities involved. It is a realistic account taking into
constraints imposed by time and financial resources as well as the
need to control as many variables as possible. It also provides good
(and working!) links to useful websites.

On account of its clarity and the practical advice offered, this is
the ideal book to be recommended to anybody planning to build their
own corpus. However, there is something else to be taken into account
if we are recommending the book to students or researchers who are new
to the field, namely that Meyer implies that his approach is the only
way of conducting research in corpus linguistics and fails to point
out possible alternatives or counter-arguments. Given the introductory
nature of the book and its intended audience of inexperienced
researchers, it may seem wise not to expand on theoretical debates,
but it would have been appropriate to point out the existence of other
points of view and perhaps provide references for those who wish to
inquire further.

For example, his argument in favour of categorising corpus linguistics
as a methodology and not a paradigm is convincing, but it is also one
of the points that might have benefited from a clear discussion of
counter-arguments (see, for example, Schoenefeld 1999) . As it is,
some questions remain unanswered, such as whether the object of study
is the only thing that defines a paradigm and whether our
understanding of the nature of that object has nothing to do with it.

In Chapter Two, when discussing the length of texts to be included in
a corpus, Meyer does not mention the main and strongest argument for
the inclusion of complete texts, namely that few linguistic features
of a text are evenly distributed throughout the text (Sinclair
1991:19, Stubbs 1996:32). He gives the impression that including
complete texts is ideal but most impractical, without mentioning that
the Bank of English, one of the largest corpus of English currently
available, includes whole texts.

In Chapter Four Meyer starts from the assumption that ''for a corpus
to be fully useful to potential users, it needs to be annotated''
(p.81) and does not mention the arguments in favour of using raw
text. The use of text without any grammatical annotation is, however,
one of the basic tenets of a different approach to corpus analysis,
known as corpus-driven linguistics. When describing tagsets Meyer
comments that ''the differences in the tagsets reflect different
conceptions of English grammar'', and this is precisely why some
corpus linguists object to tagging. Advocates of the corpus-driven
approach (Tognini- Bonelli 2001, Hunston and Francis 2000) avoid
imposing on the data any preconceived linguistic categories that have
not been derived from the data and that in many cases do not fit the
data. While some linguists, like Meyer, see annotation as enriching a
corpus, others, like Sinclair, identify a loss of information in that
process (Sinclair 1992:385-6).

Chapter Five aims at offering guidelines that could be used for any
corpus study, but they are actually adapted to the kind of
corpus-based study that Meyer conducts, where the corpus data are used
to test or exemplify hypotheses elaborated according to existing
linguistic descriptions. A rather different set of guidelines would
have to be followed if we were trying to find lexicogrammatical
patterns.

Finally, it is somewhat disappointing to learn in Chapter Five that
the features chosen for corpus analysis had to be manually retrieved
in six of the seven corpora compared because these were not tagged or
parsed. This particular example supports Meyer's earlier statement
about fully annotated corpora being the most useful. However, given
that each of the six subcorpora contains 40000 words, and therefore
the process ''took several days'' (p.119), we wonder whether this was
a good choice to exemplify corpus analysis, especially since there are
abundant and equally interesting studies where the features are
retrieved automatically.

REFERENCES

Biber, D., S. Conrad, and R. Teppen (1998) Corpus Linguistics:
Investigating Language Structure and Use, Cambridge: Cambridge
University Press.

Hunston, S. and G. Francis (2000) Pattern Grammar: A Corpus-driven
Approach to the Lexical Grammar of English, Amsterdam and
Philadelphia: John Benjamins.

Kennedy, G. (1998) An Introduction to Corpus Linguistics, London and
New York: Longman.

McEnery, T. and A. Wilson (1996) Corpus Linguistics, Edinburgh:
Edinburgh University Press.

Sinclair, J.M. (1991) Corpus, Concordance, Collocation, Oxford: Oxford
University Press.

Sinclair, J.M. (1992) ''The Automatic Analysis of Corpora'', in
Svartvik, J. (ed.) Directions in Corpus Linguistics, Berlin and New
York: Mouton de Gruyter.

Schoenefeld, D. (1999) 'Corpus Linguistis and Cognitivism',
International Journal of Corpus Linguistics, Vol. 4(1), 137-171.

Stubbs, M. (1996) Text and Corpus Analysis: Computer-assisted Studies
of Language and Culture, Oxford and Cambridge, Mass.: Blackwell.

Tognini-Bonelli, E. (2001) Corpus Linguistics at Work, Amsterdam and
Philadelphia: John Benjamins.

ABOUT THE REVIEWER

Gabriela Saldanha holds an MPhil in Translation Studies from UMIST,
UK. She is currently doing a PhD at the Centre for Translation and
Textual Studies, in Dublin City University, Ireland, where she also
lectures on Translation Technology and Corpus Linguistics for
Translation. Her research interests include Corpus Linguistics,
Corpus-based Translation Studies, Translation Technology, Stylistics
and Literary Translation.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue