From: Mike Conway <mikenii.ac.jp>
Subject: Corpus Linguistics 25 Years On
E-mail this message to a friend
Announced at http://linguistlist.org/issues/18/18-2026.html
EDITOR: Facchinetti, RobertaTITLE: Corpus Linguistics 25 Years OnSERIES: Language and Computers Vol. 62PUBLISHER: RodopiYEAR: 2007
Mike Conway, National Institute of Informatics, Tokyo
SUMMARYThe book under review is the edited proceedings of the 25th InternationalComputer Archive of Modern and Medieval English Conference (ICAME), held at theUniversity of Verona in May 2004.
The book is 385 pages long, and consists of nineteen chapters, and anintroduction from the editor. Each chapter contains a list of references and, ifappropriate, endnotes. The volume is divided into three sections, reflectingsome of the core concerns of the conference. The first section, ''Overviewing 25Years of Corpus Linguistics Studies'' (four chapters) looks back at the earlydays and development of corpus linguistics. The second section, ''DescriptiveStudies in English Syntax and Semantics'' (eight chapters) is concerned withcorpus based language description, a core area of corpus linguistics over thelast 25 years. The third section, ''Second Language Acquisition, Parallel Corporaand Specialist Corpora'' (seven chapters) focuses primarily on issues relating tothe use of corpora in second language acquisition. This book concentrates onsynchronic corpus research. Another book based on the 25th ICAME conference -Facchinetti & Rissanen (2006) - is concerned primarily with diachronic languagestudies.
Roberta Facchinetti, the book's editor, provides the introductory chapter, whereshe describes the volume as ''a fairly broad and thematic overview of the workundertaken in the field of computerised corpus linguistic studies from theirorigin to the present day.'' Facchinetti then goes on to summarize each chapterin turn.
Part 1: Overviewing 25 Years of Corpus Linguistics Studies
''Corpus linguistics 25+ years on'' (Jan Svartvik) describes corpus linguisticresearch prior to the first ICAME conference from a personal, conversationalperspective. These early days were ''the stone age of corpus linguistics... whenthere were no personal computers, no web, no email, no mobile phones, no Google,and no electronic corpora.'' Svartvik also describes the experience of being acorpus linguist in the late 1950s and 1960s, in an environment where empiricalapproaches were squeezed by the dominant Chomskyan paradigm. The chapter alsooutlines the important foundational work conducted at University College, Londonas part of the Survey of English Usage project, including details of how thisproject was carried out in a period when computers were ''rare, expensive andunreliable.''
''Corpus development 25 years on: from super corpus to cyber corpus'' (AntoinetteRenouf) provides a survey of the recent history of corpus development, buildingthe chapter around the three major ''motivating forces'' that have driven theresearch area forward; ''science (or intellectual curiosity), pragmatics (ornecessity) and serendipity (or chance).'' Using this explanatory framework,Renouf describes the motivation for the development of the Brown corpus in the1960s as primarily scientific. Larger corpora developed in the 1980s and 1990s,such as the British National Corpus (BNC), are referred to by Renouf as''super-corpora''. The drivers behind the creation of these super-corpora wereagain primarily scientific (''there were questions about lexis and collocation,and indeed even about grammar, that could not be answered within the scope of asmall corpus''), yet serendipity played a role, with the increasing capabilitiesof computers and the emergence of corpora based dictionaries. The creation oflarge scale monitor corpora in the 1990s was largely driven by the scientificmotivation to observe language change across time. From the late 1990s,cyber-corpora (that is, internet derived corpora or ''web-as-corpus'') weredeveloped due to a range of drivers; serendipity (the web contains a wide rangeof linguistic data), pragmatism (downloading documents from the web is cheapcompared to conventional corpus construction techniques) and scientific interest(the web allows access to the newest usages). In summary, Renouf describes thehistorical development of corpora as ''characterised by the tension between thedesire for knowledge and the constraints of practical necessity andtechnological feasibility.''
''Seeing through multilingual corpora'' (Stig Johansson) briefly outlines thedevelopment of multilingual corpora over ''the last 10-15 years or so'' wheremultilingual corpora are loosely defined as ''collections of texts in two or morelanguages which are parallel in some way, either by being in a translationrelationship, or by being comparable in other respects, such as genre, time ofpublication, intended readership and so on.'' Johansson then goes on to describetwo common forms of multilingual corpora; translation corpora (consisting oftexts and their translation into one or more languages) and comparable corpora(consisting of original texts in two or more languages, where the texts chosenare representative of a given genre, time period and so on for each genre).Johansson goes on to describe attempts at uniting these paradigms in the EnglishNorwegian Parallel Corpora. The rest of the chapter goes on to use thismultilingual corpora in order to explore the linguistic difference betweenEnglish and Norwegian (for example, the use of the English ''thing'' and Norwegian''ting'').
''Corpora and spoken discourse'' (Anne Wichmann) presents some of the practicaland theoretical problems confronted by the researcher in constructing speechcorpora, distinguishing between speech corpora that are created as part of thedevelopment of speech technology systems (often under laboratory conditions) andspeech corpora created from ''natural'' data (that is, speech recorded during''real'' interactions) that tend to be of interest to corpus linguists (andconversation analysts). Wichmann stresses the importance of including soundfiles with spoken discourse corpora, as in the case of spoken language (ratherthan text corpora), the spoken language recording itself is the raw data andought to be preserved.
Part 2: Descriptive Studies in English Syntax and Semantics
''An example of frequent English phraseology: distributions, structures andfunctions'' (Michael Stubbs) begins by emphasizing that the emergence of interestin phraseology has accompanied the rise of corpus linguistics. Previously thestudy of phrases (and the related concept of n-grams, lexical bundles and so on)had been crowded out by concern with grammar, lexical issues, and some degree ofhostility (or indifference) to the frequency based investigative techniquesappropriate for the study of phrases. Stubbs describes the software tool used inhis study, the PIE (Phrases in English) system (http://pie.usna.edu), as ''apowerful interactive database... constructed from the BNC'' which consists of allthe n-grams shorter than a given length in the BNC (with other phrasal patterns,also based on the BNC, available to the user). Stubbs uses the software toexplore several research areas, one of which is the prevalence of given phrasesacross text types. For example the use of pronouns in fiction (''I don't wantto'', ''I want you to'') and academic writing (''I shall show that'', ''I have alreadymentioned'') is analyzed.
''The semantic properties of _going to_: distribution patterns in four subcorporaof _The British National Corpus_'' (Ylva Berglund and Christoper Williams)analyzes the ''intentional and predictive uses of the going to construction'' infour different registers/genres (financial, academic, news and spoken). Theanalysis showed that the frequency of occurrence of 'going to' (and also themore informal 'gonna') varies markedly between the chosen registers ''with lessthan one hundred instances per million words of running text in academicwriting, to almost 3000 in spoken conversation.'' The authors then go on toanalyze - among other things - the predictive versus intentional use of ''goingto'' across the four genres of interest, concluding that the news genre ''shows amarked preference for predictive meaning.''
''The superlative in spoken English'' (Claudia Claridge) suggests that rather thansimply expressing factual comparisons, superlatives are primarily used as ''ameans for (often vague) evaluation and the expression of emotion.'' The spokensection of the British National Corpus was used as data, as the researchers wereinterested in the everyday, informal use of superlatives. The BNC tagset wasutilized to help identify superlatives, with 1973 adjectival superlativesidentified (a frequency of 5 instances per 10,000 words).
''Semantically-based queries with a joint BNC/WordNet database'' (Mark Davies)describes an attempt at marrying two important linguistic resources; the BritishNational Corpus and WordNet. The BNC has emerged as a central resource inEnglish corpus linguistics. WordNet (Fellbaum, 1998), a comprehensive electroniclexical database widely used in corpus and computational linguistics, is builtaround the central notion of sets of synonymous words (''synsets''). The softwaredescribed in this paper allows a user to query the BNC/WordNet database for BNCderived frequency information for a given word and the synonyms of that word(along with many other more sophisticated types of search).
''Size matters - or thus can meaningful structures be revealed in large corpora''(Solveig Granath) continues the descriptive theme developed in the previous fourchapters. Granath shows that for some relatively rare constructions, largercorpora (like the Guardian/Observer British newspaper corpora) are moreinformative than the standard one million word corpora commonly used in corpuslinguistics (for example, BROWN, FLOB, and so on) The chapter focuses primarilyon different subject/verb word ordering in sentences that begin with ''thus''.
''Inversion in modern written English: syntactic complexity, information statusand the creative writer'' (Rolf Kreyer) provides a ''discourse functional, corpusbased account of the construction at issue'' (that is, inversion), stressing thefunction of inversion within the discourse structure as an aid to readability.Additionally, two superordinate functions are identified; text structuringinversion and ''immediate-observer-effect'' inversion (a technique often used infiction to give an impression of unmediated perception). Two subsections of theBNC were used in this work (written-academic and prose-fiction) and instances ofthe inversion construction were identified semi-automatically.
''The filling in the sandwich: internal modification of idioms'' (David Minugh)uses a three hundred million word corpus (composed of the BNC, British andAmerican newspaper corpora and broadcast transcripts) to investigate theoccurrence of idioms ''and examine the extent to which these prepackaged chunksof language can be internally expanded so as to link them into the discoursewithin which they are used.'' An example of the kind of 'expanded' idiom ofinterest, taken from the chapter, includes ''restore some political coals toNewcastle.'' Fifty five idioms were used, all of which occur in the CollinsCOBUILD Dictionary of Idioms (Collins, 2002). Minugh found that - at least forthe fifty five idioms considered in the study - idiom expansion is much lesscommon than previous studies seemed to have indicated.
''NP-internal functions and extended use of the 'type' nouns kind, sort and type:towards a comprehensive, corpus based description'' (Liesbeth De Smedt,Lieselotte Brems and Kristin Davidse) begins with a brief review of work on typenoun functions from the 1930s to the present, before going on to identify sixcategories of type noun (head, modifier, postdeterminer, qualifying, discoursemarker and quotational). These six categories were identified using the previousliterature on type nouns, and also on the basis of a close analysis of corpusevidence. The final part of the paper consists of an analysis of the frequencyof the six categories of type noun in two corpora; the Times newspaper sectionof the COBUILD Corpus (a formal written register) and the Bergen Corpus ofLondon Teenage Slang (the COLT corpus) (an informal written register). Theresults of this analysis showed that type nouns from the newspaper corpus wereprimarily NP-internal and concerned with classification, whereas in theteenagers' speech corpus, the use of type nouns as adverbial qualifiers was muchmore common.
Part 3: Second Language Acquisition, Parallel Corpora and Specialist Corpora
''Student writing of research articles in a foreign language: metacognition andcorpora'' (Francesca Bianchi and Roberto Pazzaglia) describes the creation of acorpus of published papers in the area of experimental psychology, designed forthe purpose of teaching Italian undergraduate students how to write researcharticles.
''The structure of corpora in SLA research'' (Ron Cowan and Michael Leeser)identifies the characteristics that a corpus should have in order to be usefulfor studying SLA (Second Language Acquisition). This focus can be compared tothe previous chapter, which was primarily concerned with the development and useof corpora for teaching a second language. It is suggested that a useful corpusshould consist of a diversity of subjects (that is, topics) in the secondlanguage, and several levels of proficiency in order to track systematicdifference in the development of the second language. The construction of asmall corpus of writing by Spanish students of different levels of proficiencyenrolled in an English language class at the University of Illinois is alsodescribed. The corpus was used to track those errors that remained common evenfor those students who had achieved a good proficiency in English.
''The path from learner corpus analysis to language pedagogy: some neglectedissues'' (Nadja Nesselhauf) stresses the difficulties involved in moving fromcorpus studies that identify the difficulties faced by L2 learners topedagogical policy. The corpus used was derived from the German subcorpus ofICLE (containing argumentative and descriptive essays by German native speakingadvanced students of English) and consisted of 150,000 words in total.Nesselhauf focused on a limited number of collocations and found that ''thecollocations that the learners produced are frequently not unacceptable per sebut rather are existing English collocations used inappropriately.'' The finalsection of the paper considers how to best use these findings in a pedagogicalsetting, stressing the difficulty of moving from corpus studies (that is,identifying through corpus evidence particular difficulties that L2 learnersface) to a realistic teaching setting with competing demands on classroom time.
''Exploiting the Corpus of East-African English'' (Josef Schmied) explores thisEnglish as a second language corpora (part of the International Corpus ofEnglish, henceforth ICE-EA) and suggests a number of research questions that thecorpus may be used to address. Examples include, assessing the lexicalcomplexity of the ICE-EA corpus compared to other ESL corpora, and assessing thesyntactic complexity of the ICE-EA corpus compared to other English as a secondlanguage corpora (and also to native speaker English).
''Transitive verb plus reflexive pronoun/personal pronoun patterns in English andJapanese: using a Japanese-English parallel corpus'' (Makoto Shimizu and MasakiMurata) falls into three sections. The first section describes the general areaof English/Japanese parallel corpora, along with a list of corpora currentlyavailable. In section two the authors explore the use of reflexive and personalpronouns with transitive verbs, and found that personal pronouns were much morecommon than reflexive pronouns. Section three considers the differences betweenEnglish and Japanese in their use of reflexive and personal pronouns. TheContext Sensitive and Tagged Parallel Corpus (which consists of parallelEnglish/Japanese newspaper articles) is used throughout the work.
''The retrieval of false anglicisms in newspaper texts'' (Cristiano Furiassi andKnut Hofland) describes a method for identifying 'false anglicisms' in newspapertext. False anglicisms are roughly defined as words or phrases that look likeEnglish, but are not part of the English language (the authors give the exampleof 'autostop' as an Italian false anglicism for hitchhiking). The corpus usedwas constructed from Italian newspaper text (La Stampa, La Repubblica and IlCorriere della Serra) and consists of 19.5 million tokens. Computationallinguistic techniques were used to identify false anglicisms, but automatedmethods alone did not prove sufficient, and human post-processing was requiredin order to eliminate noise.
''Lexical semantics for software requirements engineering - a corpus basedapproach'' (Kersten Lindmark, Johan Natt och Dag, and Caroline Willners)describes the use of corpus linguistic techniques for analyzing softwarerequirements. The authors first identify keywords characteristic of therequirements domain using the WordSmith toolkit (Scott, 2004) and a corpusconstructed from 1932 requirement texts in English. The BNC Sampler was used asa reference corpus. That is, in order to identify keywords in the softwarerequirement domain, the WordSmith toolkit was used to pick out those words thatoccur more frequently (at a statistically significant level) in softwarerequirements compared to a more general corpus of English (the BNC Sampler). Inaddition to identifying domain specific keywords, an attempt was made atconstructing a WordNet for the domain (that is, a lexical database specifyingsynonyms and part/whole relationships) using simple pattern matching techniquesin conjunction with the extracted keywords.
EVALUATIONThis edited volume of papers from the twenty-fifth ICAME conference is focusedon (primarily English language) corpus linguistics. The first section of thebook (subtitled ''Overviewing 25 years of corpus linguistic studies'') serves asan introduction to, and history of the field, with each article authored by aninfluential researcher. Section two of the book is concerned with descriptivestudies of syntax and semantics, historically a core area of corpus linguistics.The eight papers in this section present a representative sample of current workin descriptive corpus linguistics by well known researchers in the field.Section three is titled ''Second language acquisition, parallel corpora andspecialist corpora,'' although most of the papers focus on the use of corpora inthe context of studying second language acquisition, or the development ofcorpus based pedagogical tools for the teaching of second languages. The volumecovers a great deal of ground. From the description of new software tools forcorpus linguistics (Mark Davies' chapter on the development of a jointBNC/WordNet database) to a study of transitive verbs based on parallel corpora(Makato Shimizu and Masaki Murata's chapter on English/Japanese parallelcorpora), and succeeds in both providing an overview of the development of thediscipline and in presenting state-of-the-art research.
It is however worthwhile mentioning some minor shortcomings with the book.First, there are some typographical errors, although these are not seriousenough to compromise understanding. Second, the division of the papers intothree main sections does pose some difficulties. While the first and secondsections (dealing with the development of corpus linguistics over the past 25years and descriptive corpus linguistics, respectively) are unproblematic, thethird section ''Second Language Acquisition, parallel corpora and specialistcorpora,'' does not seem to have a unifying theme. This is, however, acknowledgedin the editor's introduction and can be equally well seen in a positive light,reflecting the diversity of modern corpus research.
REFERENCESCollins (2002) _Collins COBUILD Dictionary of Idioms_. London.
Facchinetti, R. & Rissanen, M. (2006). _Corpus-based Studies of DiachronicEnglish_. Bern: Peter Lang Publishing.
Fellbaum, C. (1998). _WordNet: An Electronic Lexical Database_. Cambridge: MITPress.
Scott, M. (2004). _WordSmith Tools_. Oxford: Oxford University Press.
ABOUT THE REVIEWERMike Conway is a research fellow at the National Institute of Informatics, Tokyo.