Review of  English Corpus Linguistics

Reviewer: Gabriela Saldanha
Book Title: English Corpus Linguistics
Book Author: Charles F. Meyer
Publisher: Cambridge University Press
Linguistic Field(s): Text/Corpus Linguistics
Subject Language(s): English
Issue Number: 13.3401

_English Corpus Linguistics_ consists of a preface, a first chapter providing an overview of the field, three chapters describing the process of designing, building and annotating a corpus, followed by one chapter on corpus analysis and a brief concluding chapter. Study questions are provided at the end of each chapter and there are also two appendixes listing corpus resources and concordancing programs.

In the preface, Meyer provides a rather general definition of a corpus as a ''collection of texts or parts of texts upon which some general linguistic analysis can be conducted'' (xi), and argues that corpus linguistics is not a distinct paradigm in linguistics but a methodology. As evidence for such claim he shows that corpora can be used within different paradigms for different purposes and that various types of corpora are available for different types of analyses.

The first chapter, ''Corpus analysis and linguistic theory'' starts by describing the conflicting views of language offered by descriptive linguistics and generative grammar, which are explained here with reference to the three types of ''adequacy'' that linguistic theories can achieve according to Chomsky. It is refreshing to find the two disciplines described as complementary rather than mutually exclusive, although Meyer concludes that ''corpora will probably never have much of a role in generative grammar'' and ''are much better suited to functional analyses of language'' (p.5). He then explains how corpora are used in functional descriptions of language and describes by way of example an analysis of elliptical co-ordinations based on a 96,000-word corpus of American English texts across different registers. The rest of the chapter outlines briefly the kind of
corpus-based research being conducted in various fields from NLP to language pedagogy.

The second chapter ''Planning the construction of a corpus'' discusses the kind of decisions that need to be made before starting to build a corpus. The British National Corpus (BNC) is used as a reference. Meyer first describes the composition of the BNC and then explores the factors that need to be considered in order to obtain a representative and balanced corpus: corpus length, genres, number and length of text samples to be included, time frame and population
represented. The last section deals with controlling sociolinguistic variables such as gender, age and dialect. Meyer insists on the advantages of good planning without being too rigid, since changes are bound to be required at later stages.

The next chapter is a practical guide to collecting and computerising samples of speech and writing and uses the International Corpus of English (ICE) as a reference. It deals with methodological considerations to be taken into account when collecting data, which range from ethical issues to the choice of tape recorders and microphones. There is also advice on how to keep records and store files. The process of transcribing speech is explained in detail, but
there is not so much information on scanning and using OCR software on written text.

Chapter Four explains different kinds of annotation, from structural mark-up to part-of-speech tagging and parsing. It explains how mark-up languages such as SGML and -- more briefly -- XML are used in order to annotate features such as overlapping speech or to produce headers with bibliographic information on texts. The sections on tagging and parsing explain how rule-based and robability-based taggers and parsers work and illustrate the kind of problems that affect their output. The chapter ends with a brief introduction to semantic, discourse and problem-oriented tagging.

Chapter Five aims at providing a general framework for conducting corpus analysis by describing in detail a study on pseudo-titles and equivalent appositives in press reportage using the International Corpus of English. It discusses how to frame a research question and choose a suitable corpus, how to extract the relevant information from the corpus and how to use statistical tests. Meyer advocates combining qualitative with quantitative methods, or rather using quantitative information to support qualitative statements. The data from the study
on pseudo-titles is explored in depth using cross-tabulation and chi- square and log-likelihood tests. The final chapter reviews each step in the process of building and analysing a corpus and looks at how it could be made easier in the future.


Although the title could lead us to expect a broader coverage, English Corpus Linguistics is in fact a detailed description of the process of designing and building a corpus which is preceded by an introductory chapter and complemented by an example of corpus analysis. The focus on issues of design, compilation and annotation is actually the main strength of the book and what distinguishes it from other introductions to corpus linguistics which have tended to focus on the kinds of research questions that can be investigated using corpus tools (Stubbs 1996 and Biber et al. 1998) or offer broad overviews of the field (McEenry and Wilson 1996 and Kennedy 1998).

Meyer provides a very informative and clear description of every step in the process, leaving almost no questions unanswered as to the practicalities involved. It is a realistic account taking into constraints imposed by time and financial resources as well as the need to control as many variables as possible. It also provides good (and working!) links to useful websites.

On account of its clarity and the practical advice offered, this is the ideal book to be recommended to anybody planning to build their own corpus. However, there is something else to be taken into account if we are recommending the book to students or researchers who are new to the field, namely that Meyer implies that his approach is the only way of conducting research in corpus linguistics and fails to point out possible alternatives or counter-arguments. Given the introductory nature of the book and its intended audience of inexperienced researchers, it may seem wise not to expand on theoretical debates, but it would have been appropriate to point out the existence of other points of view and perhaps provide references for those who wish to inquire further.

For example, his argument in favour of categorising corpus linguistics as a methodology and not a paradigm is convincing, but it is also one of the points that might have benefited from a clear discussion of counter-arguments (see, for example, Schoenefeld 1999) . As it is, some questions remain unanswered, such as whether the object of study is the only thing that defines a paradigm and whether our understanding of the nature of that object has nothing to do with it.

In Chapter Two, when discussing the length of texts to be included in a corpus, Meyer does not mention the main and strongest argument for the inclusion of complete texts, namely that few linguistic features of a text are evenly distributed throughout the text (Sinclair 1991:19, Stubbs 1996:32). He gives the impression that including complete texts is ideal but most impractical, without mentioning that
the Bank of English, one of the largest corpus of English currently available, includes whole texts.

In Chapter Four Meyer starts from the assumption that ''for a corpus to be fully useful to potential users, it needs to be annotated'' (p.81) and does not mention the arguments in favour of using raw text. The use of text without any grammatical annotation is, however, one of the basic tenets of a different approach to corpus analysis, known as corpus-driven linguistics. When describing tagsets Meyer
comments that ''the differences in the tagsets reflect different conceptions of English grammar'', and this is precisely why some corpus linguists object to tagging. Advocates of the corpus-driven approach (Tognini- Bonelli 2001, Hunston and Francis 2000) avoid imposing on the data any preconceived linguistic categories that have not been derived from the data and that in many cases do not fit the data. While some linguists, like Meyer, see annotation as enriching a
corpus, others, like Sinclair, identify a loss of information in that process (Sinclair 1992:385-6).

Chapter Five aims at offering guidelines that could be used for any corpus study, but they are actually adapted to the kind of corpus-based study that Meyer conducts, where the corpus data are used to test or exemplify hypotheses elaborated according to existing linguistic descriptions. A rather different set of guidelines would have to be followed if we were trying to find lexicogrammatical
patterns. Finally, it is somewhat disappointing to learn in Chapter Five that the features chosen for corpus analysis had to be manually retrieved in six of the seven corpora compared because these were not tagged or parsed. This particular example supports Meyer's earlier statement about fully annotated corpora being the most useful. However, given that each of the six subcorpora contains 40000 words, and therefore the process ''took several days'' (p.119), we wonder whether this was a good choice to exemplify corpus analysis, especially since there are abundant and equally interesting studies where the features are
retrieved automatically.


Gabriela Saldanha holds an MPhil in Translation Studies from UMIST, UK. She is currently doing a PhD at the Centre for Translation and Textual Studies, in Dublin City University, Ireland, where she also lectures on Translation Technology and Corpus Linguistics for Translation. Her research interests include Corpus Linguistics, Corpus-based Translation Studies, Translation Technology, Stylistics and Literary Translation.

