EDITOR: Facchinetti, Roberta TITLE: Corpus Linguistics 25 Years On SERIES: Language and Computers Vol. 62 PUBLISHER: Rodopi YEAR: 2007
Mike Conway, National Institute of Informatics, Tokyo
SUMMARY The book under review is the edited proceedings of the 25th International Computer Archive of Modern and Medieval English Conference (ICAME), held at the University of Verona in May 2004.
The book is 385 pages long, and consists of nineteen chapters, and an introduction from the editor. Each chapter contains a list of references and, if appropriate, endnotes. The volume is divided into three sections, reflecting some of the core concerns of the conference. The first section, ''Overviewing 25 Years of Corpus Linguistics Studies'' (four chapters) looks back at the early days and development of corpus linguistics. The second section, ''Descriptive Studies in English Syntax and Semantics'' (eight chapters) is concerned with corpus based language description, a core area of corpus linguistics over the last 25 years. The third section, ''Second Language Acquisition, Parallel Corpora and Specialist Corpora'' (seven chapters) focuses primarily on issues relating to the use of corpora in second language acquisition. This book concentrates on synchronic corpus research. Another book based on the 25th ICAME conference - Facchinetti & Rissanen (2006) - is concerned primarily with diachronic language studies.
Roberta Facchinetti, the book's editor, provides the introductory chapter, where she describes the volume as ''a fairly broad and thematic overview of the work undertaken in the field of computerised corpus linguistic studies from their origin to the present day.'' Facchinetti then goes on to summarize each chapter in turn.
Part 1: Overviewing 25 Years of Corpus Linguistics Studies
''Corpus linguistics 25+ years on'' (Jan Svartvik) describes corpus linguistic research prior to the first ICAME conference from a personal, conversational perspective. These early days were ''the stone age of corpus linguistics... when there were no personal computers, no web, no email, no mobile phones, no Google, and no electronic corpora.'' Svartvik also describes the experience of being a corpus linguist in the late 1950s and 1960s, in an environment where empirical approaches were squeezed by the dominant Chomskyan paradigm. The chapter also outlines the important foundational work conducted at University College, London as part of the Survey of English Usage project, including details of how this project was carried out in a period when computers were ''rare, expensive and unreliable.''
''Corpus development 25 years on: from super corpus to cyber corpus'' (Antoinette Renouf) provides a survey of the recent history of corpus development, building the chapter around the three major ''motivating forces'' that have driven the research area forward; ''science (or intellectual curiosity), pragmatics (or necessity) and serendipity (or chance).'' Using this explanatory framework, Renouf describes the motivation for the development of the Brown corpus in the 1960s as primarily scientific. Larger corpora developed in the 1980s and 1990s, such as the British National Corpus (BNC), are referred to by Renouf as ''super-corpora''. The drivers behind the creation of these super-corpora were again primarily scientific (''there were questions about lexis and collocation, and indeed even about grammar, that could not be answered within the scope of a small corpus''), yet serendipity played a role, with the increasing capabilities of computers and the emergence of corpora based dictionaries. The creation of large scale monitor corpora in the 1990s was largely driven by the scientific motivation to observe language change across time. From the late 1990s, cyber-corpora (that is, internet derived corpora or ''web-as-corpus'') were developed due to a range of drivers; serendipity (the web contains a wide range of linguistic data), pragmatism (downloading documents from the web is cheap compared to conventional corpus construction techniques) and scientific interest (the web allows access to the newest usages). In summary, Renouf describes the historical development of corpora as ''characterised by the tension between the desire for knowledge and the constraints of practical necessity and technological feasibility.''
''Seeing through multilingual corpora'' (Stig Johansson) briefly outlines the development of multilingual corpora over ''the last 10-15 years or so'' where multilingual corpora are loosely defined as ''collections of texts in two or more languages which are parallel in some way, either by being in a translation relationship, or by being comparable in other respects, such as genre, time of publication, intended readership and so on.'' Johansson then goes on to describe two common forms of multilingual corpora; translation corpora (consisting of texts and their translation into one or more languages) and comparable corpora (consisting of original texts in two or more languages, where the texts chosen are representative of a given genre, time period and so on for each genre). Johansson goes on to describe attempts at uniting these paradigms in the English Norwegian Parallel Corpora. The rest of the chapter goes on to use this multilingual corpora in order to explore the linguistic difference between English and Norwegian (for example, the use of the English ''thing'' and Norwegian ''ting'').
''Corpora and spoken discourse'' (Anne Wichmann) presents some of the practical and theoretical problems confronted by the researcher in constructing speech corpora, distinguishing between speech corpora that are created as part of the development of speech technology systems (often under laboratory conditions) and speech corpora created from ''natural'' data (that is, speech recorded during ''real'' interactions) that tend to be of interest to corpus linguists (and conversation analysts). Wichmann stresses the importance of including sound files with spoken discourse corpora, as in the case of spoken language (rather than text corpora), the spoken language recording itself is the raw data and ought to be preserved.
Part 2: Descriptive Studies in English Syntax and Semantics
''An example of frequent English phraseology: distributions, structures and functions'' (Michael Stubbs) begins by emphasizing that the emergence of interest in phraseology has accompanied the rise of corpus linguistics. Previously the study of phrases (and the related concept of n-grams, lexical bundles and so on) had been crowded out by concern with grammar, lexical issues, and some degree of hostility (or indifference) to the frequency based investigative techniques appropriate for the study of phrases. Stubbs describes the software tool used in his study, the PIE (Phrases in English) system (http://pie.usna.edu), as ''a powerful interactive database... constructed from the BNC'' which consists of all the n-grams shorter than a given length in the BNC (with other phrasal patterns, also based on the BNC, available to the user). Stubbs uses the software to explore several research areas, one of which is the prevalence of given phrases across text types. For example the use of pronouns in fiction (''I don't want to'', ''I want you to'') and academic writing (''I shall show that'', ''I have already mentioned'') is analyzed.
''The semantic properties of _going to_: distribution patterns in four subcorpora of _The British National Corpus_'' (Ylva Berglund and Christoper Williams) analyzes the ''intentional and predictive uses of the going to construction'' in four different registers/genres (financial, academic, news and spoken). The analysis showed that the frequency of occurrence of 'going to' (and also the more informal 'gonna') varies markedly between the chosen registers ''with less than one hundred instances per million words of running text in academic writing, to almost 3000 in spoken conversation.'' The authors then go on to analyze - among other things - the predictive versus intentional use of ''going to'' across the four genres of interest, concluding that the news genre ''shows a marked preference for predictive meaning.''
''The superlative in spoken English'' (Claudia Claridge) suggests that rather than simply expressing factual comparisons, superlatives are primarily used as ''a means for (often vague) evaluation and the expression of emotion.'' The spoken section of the British National Corpus was used as data, as the researchers were interested in the everyday, informal use of superlatives. The BNC tagset was utilized to help identify superlatives, with 1973 adjectival superlatives identified (a frequency of 5 instances per 10,000 words).
''Semantically-based queries with a joint BNC/WordNet database'' (Mark Davies) describes an attempt at marrying two important linguistic resources; the British National Corpus and WordNet. The BNC has emerged as a central resource in English corpus linguistics. WordNet (Fellbaum, 1998), a comprehensive electronic lexical database widely used in corpus and computational linguistics, is built around the central notion of sets of synonymous words (''synsets''). The software described in this paper allows a user to query the BNC/WordNet database for BNC derived frequency information for a given word and the synonyms of that word (along with many other more sophisticated types of search).
''Size matters - or thus can meaningful structures be revealed in large corpora'' (Solveig Granath) continues the descriptive theme developed in the previous four chapters. Granath shows that for some relatively rare constructions, larger corpora (like the Guardian/Observer British newspaper corpora) are more informative than the standard one million word corpora commonly used in corpus linguistics (for example, BROWN, FLOB, and so on) The chapter focuses primarily on different subject/verb word ordering in sentences that begin with ''thus''.
''Inversion in modern written English: syntactic complexity, information status and the creative writer'' (Rolf Kreyer) provides a ''discourse functional, corpus based account of the construction at issue'' (that is, inversion), stressing the function of inversion within the discourse structure as an aid to readability. Additionally, two superordinate functions are identified; text structuring inversion and ''immediate-observer-effect'' inversion (a technique often used in fiction to give an impression of unmediated perception). Two subsections of the BNC were used in this work (written-academic and prose-fiction) and instances of the inversion construction were identified semi-automatically.
''The filling in the sandwich: internal modification of idioms'' (David Minugh) uses a three hundred million word corpus (composed of the BNC, British and American newspaper corpora and broadcast transcripts) to investigate the occurrence of idioms ''and examine the extent to which these prepackaged chunks of language can be internally expanded so as to link them into the discourse within which they are used.'' An example of the kind of 'expanded' idiom of interest, taken from the chapter, includes ''restore some political coals to Newcastle.'' Fifty five idioms were used, all of which occur in the Collins COBUILD Dictionary of Idioms (Collins, 2002). Minugh found that - at least for the fifty five idioms considered in the study - idiom expansion is much less common than previous studies seemed to have indicated.
''NP-internal functions and extended use of the 'type' nouns kind, sort and type: towards a comprehensive, corpus based description'' (Liesbeth De Smedt, Lieselotte Brems and Kristin Davidse) begins with a brief review of work on type noun functions from the 1930s to the present, before going on to identify six categories of type noun (head, modifier, postdeterminer, qualifying, discourse marker and quotational). These six categories were identified using the previous literature on type nouns, and also on the basis of a close analysis of corpus evidence. The final part of the paper consists of an analysis of the frequency of the six categories of type noun in two corpora; the Times newspaper section of the COBUILD Corpus (a formal written register) and the Bergen Corpus of London Teenage Slang (the COLT corpus) (an informal written register). The results of this analysis showed that type nouns from the newspaper corpus were primarily NP-internal and concerned with classification, whereas in the teenagers' speech corpus, the use of type nouns as adverbial qualifiers was much more common.
Part 3: Second Language Acquisition, Parallel Corpora and Specialist Corpora
''Student writing of research articles in a foreign language: metacognition and corpora'' (Francesca Bianchi and Roberto Pazzaglia) describes the creation of a corpus of published papers in the area of experimental psychology, designed for the purpose of teaching Italian undergraduate students how to write research articles.
''The structure of corpora in SLA research'' (Ron Cowan and Michael Leeser) identifies the characteristics that a corpus should have in order to be useful for studying SLA (Second Language Acquisition). This focus can be compared to the previous chapter, which was primarily concerned with the development and use of corpora for teaching a second language. It is suggested that a useful corpus should consist of a diversity of subjects (that is, topics) in the second language, and several levels of proficiency in order to track systematic difference in the development of the second language. The construction of a small corpus of writing by Spanish students of different levels of proficiency enrolled in an English language class at the University of Illinois is also described. The corpus was used to track those errors that remained common even for those students who had achieved a good proficiency in English.
''The path from learner corpus analysis to language pedagogy: some neglected issues'' (Nadja Nesselhauf) stresses the difficulties involved in moving from corpus studies that identify the difficulties faced by L2 learners to pedagogical policy. The corpus used was derived from the German subcorpus of ICLE (containing argumentative and descriptive essays by German native speaking advanced students of English) and consisted of 150,000 words in total. Nesselhauf focused on a limited number of collocations and found that ''the collocations that the learners produced are frequently not unacceptable per se but rather are existing English collocations used inappropriately.'' The final section of the paper considers how to best use these findings in a pedagogical setting, stressing the difficulty of moving from corpus studies (that is, identifying through corpus evidence particular difficulties that L2 learners face) to a realistic teaching setting with competing demands on classroom time.
''Exploiting the Corpus of East-African English'' (Josef Schmied) explores this English as a second language corpora (part of the International Corpus of English, henceforth ICE-EA) and suggests a number of research questions that the corpus may be used to address. Examples include, assessing the lexical complexity of the ICE-EA corpus compared to other ESL corpora, and assessing the syntactic complexity of the ICE-EA corpus compared to other English as a second language corpora (and also to native speaker English).
''Transitive verb plus reflexive pronoun/personal pronoun patterns in English and Japanese: using a Japanese-English parallel corpus'' (Makoto Shimizu and Masaki Murata) falls into three sections. The first section describes the general area of English/Japanese parallel corpora, along with a list of corpora currently available. In section two the authors explore the use of reflexive and personal pronouns with transitive verbs, and found that personal pronouns were much more common than reflexive pronouns. Section three considers the differences between English and Japanese in their use of reflexive and personal pronouns. The Context Sensitive and Tagged Parallel Corpus (which consists of parallel English/Japanese newspaper articles) is used throughout the work.
''The retrieval of false anglicisms in newspaper texts'' (Cristiano Furiassi and Knut Hofland) describes a method for identifying 'false anglicisms' in newspaper text. False anglicisms are roughly defined as words or phrases that look like English, but are not part of the English language (the authors give the example of 'autostop' as an Italian false anglicism for hitchhiking). The corpus used was constructed from Italian newspaper text (La Stampa, La Repubblica and Il Corriere della Serra) and consists of 19.5 million tokens. Computational linguistic techniques were used to identify false anglicisms, but automated methods alone did not prove sufficient, and human post-processing was required in order to eliminate noise.
''Lexical semantics for software requirements engineering - a corpus based approach'' (Kersten Lindmark, Johan Natt och Dag, and Caroline Willners) describes the use of corpus linguistic techniques for analyzing software requirements. The authors first identify keywords characteristic of the requirements domain using the WordSmith toolkit (Scott, 2004) and a corpus constructed from 1932 requirement texts in English. The BNC Sampler was used as a reference corpus. That is, in order to identify keywords in the software requirement domain, the WordSmith toolkit was used to pick out those words that occur more frequently (at a statistically significant level) in software requirements compared to a more general corpus of English (the BNC Sampler). In addition to identifying domain specific keywords, an attempt was made at constructing a WordNet for the domain (that is, a lexical database specifying synonyms and part/whole relationships) using simple pattern matching techniques in conjunction with the extracted keywords.
EVALUATION This edited volume of papers from the twenty-fifth ICAME conference is focused on (primarily English language) corpus linguistics. The first section of the book (subtitled ''Overviewing 25 years of corpus linguistic studies'') serves as an introduction to, and history of the field, with each article authored by an influential researcher. Section two of the book is concerned with descriptive studies of syntax and semantics, historically a core area of corpus linguistics. The eight papers in this section present a representative sample of current work in descriptive corpus linguistics by well known researchers in the field. Section three is titled ''Second language acquisition, parallel corpora and specialist corpora,'' although most of the papers focus on the use of corpora in the context of studying second language acquisition, or the development of corpus based pedagogical tools for the teaching of second languages. The volume covers a great deal of ground. From the description of new software tools for corpus linguistics (Mark Davies' chapter on the development of a joint BNC/WordNet database) to a study of transitive verbs based on parallel corpora (Makato Shimizu and Masaki Murata's chapter on English/Japanese parallel corpora), and succeeds in both providing an overview of the development of the discipline and in presenting state-of-the-art research.
It is however worthwhile mentioning some minor shortcomings with the book. First, there are some typographical errors, although these are not serious enough to compromise understanding. Second, the division of the papers into three main sections does pose some difficulties. While the first and second sections (dealing with the development of corpus linguistics over the past 25 years and descriptive corpus linguistics, respectively) are unproblematic, the third section ''Second Language Acquisition, parallel corpora and specialist corpora,'' does not seem to have a unifying theme. This is, however, acknowledged in the editor's introduction and can be equally well seen in a positive light, reflecting the diversity of modern corpus research.
REFERENCES Collins (2002) _Collins COBUILD Dictionary of Idioms_. London.
Facchinetti, R. & Rissanen, M. (2006). _Corpus-based Studies of Diachronic English_. Bern: Peter Lang Publishing.
Fellbaum, C. (1998). _WordNet: An Electronic Lexical Database_. Cambridge: MIT Press.
Scott, M. (2004). _WordSmith Tools_. Oxford: Oxford University Press.
ABOUT THE REVIEWER Mike Conway is a research fellow at the National Institute of Informatics, Tokyo.
|