John M. Kirk, ed. (2000): Corpora Galore: Analyses and Techniques in Describing English (Language and Computers: Studies in Practical Linguistics No 30). Amsterdam: Rodopi.
Reviewed by Joybrato Mukherjee, University of Bonn
This book presents a selection of papers from the Nineteenth Conference on English Language Research on Computerised Corpora, commonly referred to as ICAME 19-98 (International Computer Archive of Modern and Medieval English). This ICAME conference was held at the Slieve Donard Hotel in Newcastle/Northern Ireland from 20-24 May 1998. To make it clear right at the beginning of this review, this collection is, on the whole, as worth-reading as the proceedings of previous ICAME conferences as it introduces "new descriptions from new corpora using new techniques" (p. i), as Kirk points out in his preface (correctly, I believe). The following synopsis is intended to provide the reader with some kind of bird's eye view of the contents of the book.
Synopsis
The papers are subsumed into three groups. The first section comprises studies which are devoted to the lexical and collocational description of English, whereas the papers in the second group present corpus-based analyses of syntactic and semantic phenomena. The third section is about new methods and innovative techniques in the rapidly developing field of corpus linguistics.
Opening the first section, Susan Blackwell offers an extremely inspiring article on the relevance of corpus data to forensic linguistics. She compares the use of "honest", "look" and "well" as discourse markers in the 10 million word spoken component of the Bank of English Corpus (BofE) with their use in disputed and undisputed utterances of two suspects (convicted of armed robbery and drug abuse respectively). By means of this technique, she shows in both cases that the disputed and unsigned police interviews presumably do not form contemporaneously written and reliable transcripts, but summaries which have been infelicitously produced at a later stage.
The exploration of low frequency collocations in the British National Corpus (BNC) lies at the heart of the second paper. Sebastian Hoffmann and Hans Martin Lehmann compare the knowledge of 16 native speakers and 16 non-native speakers concerning the most frequent collocates of 55 node words (e.g. "goalless draw") which occur between 50 and 100 times in the BNC. Whereas the native speakers performed at an average rate of 70%, the non-native speakers guessed correctly at an average rate of 34%. Two main conclusions are drawn: (1) native speakers are able to memorise collocational patterns which are comparatively rare in language use; (2) taking into account the relatively small exposure to the English language, non-native speakers' performance in this experiment turns out to be surprisingly good.
Using the spoken and written Wellington corpora, Graeme Kennedy and Shunji Yamazaki investigate the influence of Maori on the lexicon of New Zealand English. They find out that, in terms of frequency, this influence is not as strong as previously assumed.
Anthony McEnery, John Paul Baker and Andrew Hardie provide a progress report on the current compilation of the Lancaster Corpus of Abuse (LCA) containing examples of swearing from spoken language. So far, corpus data reveal, for example, that terms which have been traditionally regarded as sexist language (e.g. "bitch") are significantly often applied to males as well.
Newspaper CDs as corpora are used by David C. Minugh who explores the frequency of idioms in this genre. Although there are many caveats (e.g. the lack of representativeness), his findings, in general, support the use of these easily accessible corpora in the teaching of English as a Foreign Language (EFL).
The first section is concluded by a highly illuminating article by Vincent B.Y. Ooi. He analyses the use and frequency of culturally distinctive collocations (e.g. "fish-head curry", "urine detector") in Singaporean-Malaysian English (represented by several newspaper corpora) and in the newspaper sections of the BofE. This paper impressively exemplifies the impact of corpus linguistics on the description of collocational differences between varieties of English.
The second section opens with an article by J�rgen Gerner, putting into perspective the choice of singular or plural pronouns in coreference with the indefinite personal pronouns "someone", "anyone", "everyone" and "no one" (as well as the corresponding items ending in "-body"). Whichever pronominal form is chosen, there is either a violation of gender concord (e.g. "himself") or a violation of number concord (e.g. "themselves") between the anaphorical pronoun and its antecedent. Drawing on the spoken component of the BNC, Gerner's analysis reveals that in this medium, the so-called singular "they" ("them", "theirs" etc.) is used in around 96-98% of all cases. Only with regard to "someone/somebody" this relative frequency drops to 84-87%, as this indefinite pronoun may be used with specific singular reference so that a violation of number concord by means of singular "they" is not necessitated. As far as the written usage is concerned, future research will certainly profit from the exploration of the remaining 90 million words in the written domain of the BNC.
In the 50 million word Cobuild Direct Corpus (CDC), a sub-corpus of the BofE, G�ran Kjellmer detects 47 occurrences of the verb "try" followed by a bare infinitive. As "try" in these authentic instances meets at least some of the criteria for auxiliary verbs, the hypothesis that this verb is moving towards auxiliaryhood seems plausible. This study shows that ongoing diachronic processes may be feasible at an early stage only due to large corpora.
Hans Lindquist studies the choice between inflectional and periphrastic comparison (e.g. "costlier" vs. "more costly") of disyllabic adjectives in two newspaper corpora. His findings suggest that the selection is usually not a matter of free variation, but to a large extent guided by morphological, syntactic and prosodic factors. Periphrastic constructions, for example, tend to be placed at the end of a clause as they are somewhat heavier than inflectional forms: this could be regarded as a realization of the principle of end-weight. Whether these mechanisms are genre-specific or not, remains to be seen.
In Inge de M�nnink's paper, seven types of noun phrases (e.g. with a fronted premodifier) are described which elude the usual noun phrase structure and, consequently, a clear-cut immediate constituent analysis. One of the interesting points in this article is the underlying methodology which combines corpus data (from an unspecified 175,000 word corpus and the BNC) with elicitation data so that (1) intuition-based hypotheses can be verified or falsified in the light of empirical corpus data and (2) interpretations of corpus data can be tested by means of intuition or elicitation, potentially leading to new hypotheses (as in this study). Thus, an innovative "data cycle for descriptive linguistics" (p. 144) is established.
Degree modifiers of adjectives in spoken English are investigated by Carita Paradis. To describe diachronic changes in the use of constructions such as "it's well weird", she draws on the 500,000 word London-Lund Corpus (LLC) comprising texts from the sixties and seventies, the Corpus of London Teenage Language (COLT) of the same size and compiled in the nineties, and the spoken component of the BNC. She observes, for example, that there are remarkably fewer degree modifiers in COLT than in LLC, although two degree modifiers are attested in COLT only, namely "well" and "enough".
The article by Aimo Sepp�nen and Joe Trotta is devoted to the use of the pattern "wh- + that" in sentences such as "I yielded to whatever arguments that were given". The wide-spread assumption that this pattern became extinct after the Early Modern English period is refuted since 90 examples are discovered in the BNC and the CDC. These occurrences are neither restricted to spoken/written language nor to specific varieties of English. Therefore the authors make a plea to include this - no doubt marginal - structure in the grammar of present-day English.
Anna Brita Stenstr�m investigates intensifiers in teenage talk as attested in COLT. Two striking results refer to (1) the increasing use of "well" as adjective intensifier and (2) of "enough" as intensifier in premodifying position. These phenomena exemplify the innovative potential of teenage language. Surprisingly enough, both lexical items had already been used as intensifiers in the 8th and 9th century so that the recent developments in London teenage language may be considered as a process of revival.
The specialized 800,000 word Corpus of Early English Medical Writing 1375-1750 (still under construction) is the database of Irma Taavitsainen's analysis of the linguistic processes involved in the development of this very genre. In general, there is a clear change from a rather detached to a more involved writing which is, for example, based on a general trend from textual to interpersonal kinds of metatextual comments. This study emphasizes the relevance of corpus data to diachronic linguistics.
Medical writing, though from a synchronic perspective, is also the topic of Minna Vihla's paper which focuses on modal expressions (of epistemic possibility), e.g. "may" and "might", in a 400,000 word corpus of contemporary American medical texts. On the whole, the extensive use of modal expressions ensures that writers do not identify themselves with the research results presented and remain, thus, sincere towards the reader. More specifically, one can differentiate beween several sub-genres in which modal expressions are used to different extents: in manuals and clinical textbooks, for example, they are much more frequent than in expository and argumentative texts.
The third section opens with Magnar Brekke's at times hilarious, but no doubt thought-provoking considerations of the future role of the world wide web as a cybercorpus. The occurrences of two test items, i.e. "chaos" and "quantum", and their collocates in the cybercorpus are studied. The results are then compared with corpus data from the BNC, leading to the general assumption that the exploration of the constantly growing and changing web may provide very useful linguistic insights. However, two fundamental problems are clearly identified as well: (1) the lack of representativeness and of any other standards of corpus compilation in the web; (2) the, linguistically speaking, primitive toolkits provided by today's web browsers.
Sylviane Granger and Lartin Wynne make an attempt to optimise measures of lexical richness in essays written by EFL learners. Drawing on data from five sections of the International Corpus of Learner English (ICLE) and the concept of adjusted lemma/token ratios, they draw the conclusion that it is not the lack of words but the lack of native-like use of the words used by language learners which should be the prime concern in EFL teaching.
The BNCweb is a client for accessing the BNC via the world wide web. Some of its main features are sketched out by Hans Martin Lehmann, Peter Schneider and Sebastian Hoffmann.
Oliver Mason claims that the collocates of a word can be determined empirically. He introduces the concept of lexical gravity of a word which is described in terms of entropy, i.e. the degree of lexico-grammatical stability in the context of a word. Thus, the so-called window, i.e. the number of words to the left and to the right of the node word, in which collocates are to be described, is not a fixed frame, but a variable span the size of which is dependent on the specific node word.
Nelleke Oostdijk's case study of the linguistic annotation of the English verb phrase underlines how important it is for corpus users to thoroughly know the descriptive model underlying the linguistic analysis provided by corpus compilers. While, for example, the verb in "a bottle containing milk" constitutes a genuine simple verb phrase (i.e. main verb only), the formally similar verb phrase in "the man walking in the park" could be considered as a reduction of a complex verb phrase, namely "is walking". The findings presented in this paper call for a very careful interpretation of corpus data since different systems of linguistic annotation yield, consequently, different results.
The unresolved problem of grammatically annotating spontaneous speech is discussed by Anna Rahman and Geoffrey Sampson. Speech repairs and grammatically ill-formed utterances are two examples of phenomena which still pose great problems to hitherto existing software tools for word tagging and syntactic parsing. If natural language processing is to make substantial progress, grammar annotation standards will have to be extended to these particularities of spoken language.
Considering the increasing availability of syntactically parsed corpora, Pasi Tapanainen and Timo J�rvinen develop a new type of concordance which is not based on node words, but on syntactic functions in node position. This approach allows for syntactic concordances in which the key-word is missing, as for example in zero relative clauses.
Focusing on the potential of parsing procedures as well, Atro Voutilainen gives a progress report on recent trends in parser design at the University of Helsinki. In particular, the performance of a new functional dependency parser, visualizing dependency relations between words, seems to be promising: the overall precision of the parser ranges from 90% to 96% with regard to subjects, objects and predicatives.
Finally, Sean Wallis, Bas Aarts and Gerald Nelson provide a general introduction to the ICE Corpus Utility Program (ICECUP). ICECUP is a software tool which has been designed for the exploration of the syntactically parsed 1,000,000 word British component of the International Corpus of English (ICE). ICECUP draws on the use of so-called fuzzy tree fragments which are intended to visualize the function, the category, the features and the edges of text unit elements.
Critical Evaluation
The selection of papers underlines at how many corpus linguistic front lines progress is being made. As the title implies, the focus is on the diversity of corpora which are available today. This very aspect is, in fact, successfully represented by the selected papers. Some thirty different corpora are used to different extents. In this, three main domains can be identified which do, of course, overlap at times: (1) extensive lexico-grammatical and semantic studies of particular phenomena; (2) comparative analyses of several corpora (including the use of databases as control corpora); (3) putting new techniques and methods to the test.
Those studies which cover the first domain are, on the whole, well-written and plausible contributions to what the prime concern of linguists should be according to the forefather of British contextualism: "The business of linguistics is to describe languages" (Firth 1957: 32). To pick out but one example, Susan Blackwell's paper exemplifies the relevance of corpus-based descriptions of authentic language use to the field of applied (e.g. forensic) linguistics. Hans Lindquist's study is another good example of the advantages of corpus-based analyses (over, say, generative approaches) because the mechanisms which underlie the choice between inflectional and periphrastic comparison can only be identified by considering real language in context as attested in large corpora: "the comprehensive study of language must be based on textual evidence" (Sinclair 1991: 6).
Stubbs (1996: 33) states as a central principle of British traditions in text analysis that "text types must be studied comparatively across text corpora". Accordingly and successfully, Anna-Brita Stenstr�m, for example, compares the use of "well" and "enough" in COLT (representing the teenage language) with general tendencies in the BNC and its subcorpora. The versatility of available corpora allows for such empirical text-typological analyses and leaves no excuses for stylistic descriptions based on intuition and/or invented examples only. In a similar way, Vincent B.Y. Ooi's study shows that today's corpora are a goldmine for English dialectology in that different collocational strengths in different varieties of English become feasible in quantitative terms.
Corpus linguistics is a process, and the myriad of new methods and techniques presented in this book reveals the rapid development in this field. For example, Magnar Brekke's paper makes it clear that there may be the cybercorpus on the horizon - a database of unprecedented size and dynamism.
Having highlighted the positive so far, it is, however, necessary to make some critical remarks about the selection of papers in general and about some of the contributions in particular.
On the one hand, the book suffers from a lack of theoretical commitment. Of course, I do not know whether this is due to the selection procedure or to the entirety of papers submitted for consideration in the first place. Kirk explicitly states that "ICAME papers have not only been descriptive, they have been concerned with theoretical issues" (p. v), but in the corresponding third section, progress reports and introductions to new software hold the field. This is not to say that questions of corpus linguistic theory in a wider setting are not addressed at all. But sometimes, there is no genuine attempt to answer them. For example, Sebastian Hoffmann and Hans Martin Lehmann present most inspiring findings as to the acquisition of low frequency collocations by native and non-native speakers. Their final conclusion is "that an even larger corpus would be needed to provide reliable data for future investigations" (p. 31). Notwithstanding the correctness of this conclusion, I feel that their results may also challenge the traditionally established, generative approach to language competence: obviously, exposure to authentic language use plays a much more important role in the shaping of (collocational) competence than previously assumed. A second example of leaving loose threads is David C. Minugh's paper. He is perfectly correct in observing that "students, particularly EFL students, are both encouraged to learn idioms [...] and simultaneously discouraged from using them" (p. 57). However, he does not provide the reader with a clear-cut conclusion as to this problem on the basis of the numerous - and no doubt valuable - quantitative corpus analyses.
On the other hand, some papers are affected by more specific and minor infelicities. Again, two examples should suffice to illustrate this point. The methodology of Inge de M�nnink's study of the mobility of constituents in the English noun phrase has already been pointed out as being very effective and innovative. However, she does not go very much into detail about the 175,000 word corpus which she draws on. In my view, this vagueness is at odds with her general (and true) statement that "corpus data are verifiable, which is an important requirement for a scientific approach to linguistics" (p. 133). Some studies seem to get carried away by the irresistable power of figures, tables and diagrams. In Oliver Mason's paper, for example, quite a considerable number of diagrams are intended to illustrate the lexical gravity of several words, but I personally would have preferred a more explicit explanation of the underlying concepts of entropy and gravity (although, as a molecular biologist, I am acquainted with the major aspects of entropy in biochemistry and gravity in the physical sciences). I think that quite generally and perhaps inevitably, there is a latent danger in corpus linguistics of focusing on figures and frequencies at the expense of theoretical and functional considerations, explanations and conclusions.
On the whole, Corpora Galore is a celebration of the fact that only a few years after Jan Svartvik's (1992: 7) statement that "[c]orpus linguistics comes of age", it has by now come of age and is rapidly growing and consistently flourishing. The book provides many interesting results by using many different methods and many different corpora. Everyone who is interested in the linguistic description of authentic English, will no doubt profit from reading this selection. As conference proceedings tend to be in general (and this is not a criticism at all), it is more like a jigsaw puzzle and not a straight-forward introduction to the state of the art in corpus linguistics. Hopefully, many linguists will try to put the puzzle together by reading the book.
(Some small typological errata: On p. 12, some lines of the running text have been duplicated, on p. 162 one finds *"structur", and some tables are inconsistently formatted (e.g. p. 171). In one paper, the introductory sentence of section 1 and the first sentence of section 2, comprising 32 words, are identical (p. 133 and p. 134).
References
Firth, John Rupert (1957): "A synopsis of linguistic theory 1930-1955", Studies in Linguistic Analysis, Special Volume of the Philological Society, 1-32.
Sinclair, John (1991): Corpus, Concordance, Collocation. Oxford: Oxford University Press.
Stubbs, Michael (1996): Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Oxford: Blackwell.
Svartvik, Jan (1992): "Corpus linguistics comes of age", Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, edited by Jan Svartvik. Berlin: Mouton de Gruyter. 7-13.
Joybrato Mukherjee is an Assistant Professor of Modern English Linguistics at the English Department of the University of Bonn. His research interests include corpus linguistics, stylistics, textlinguistics, intonation, syntax and EFL teaching. In his forthcoming PhD thesis, interactions between prosody and syntax at tone unit boundaries are described on the basis of quantitative and functional corpus analyses.
Email: [email protected]
|