LINGUIST List 23.336

Wed Jan 18 2012

Review: Applied Ling.; Text/Corpus Ling.: O'Keeffe & McCarthy (2010)

Editor for this issue: Joseph Salmons <>

Date: 18-Jan-2012
From: Kornel Bangha <>
Subject: The Routledge Handbook of Corpus Linguistics
E-mail this message to a friend

Discuss this message

Announced at

EDITORS: O'Keeffe, Anne; McCarthy, MichaelTITLE: The Routledge Handbook of Corpus LinguisticsSERIES: Routledge Handbooks in Applied LinguisticsPUBLISHER: Routledge (Taylor and Francis)YEAR: 2010

Kornel Bangha, Vantage Linguistics, PA, USA

SUMMARYThe Routledge Handbook of Corpus Linguistics (RHCL) provides an overview ofcorpus linguistics (CL), as a resource for advanced undergraduates andpostgraduates. The book contains 45 contributions from 54 authors divided intoeight major sections. Each contribution is divided into five sub-parts, followedby further readings and references.

In Section I, the first contribution, by the editors, presents the evolution ofcorpora from their historical origins (starting with the earliest Bibleconcordances) up to their various types and uses in modern day applications.Elena Tognini Bonelli's contribution offers an overview of the evolution ofcorpus linguistics: it describes a shift of focus in linguistics from adata-driven approach to an approach based on intuition and introspection -- andback again to a data-driven approach; she explains that a corpus isfundamentally different from a text because the former, unlike the later, bringstogether many different texts and therefore cannot be identified with a uniqueand coherent communicative event; she concludes that, using Saussurianterminology, a text is an instance of 'parole' while the patterns uncovered incorpus evidence yield insight into 'langue'; finally, the chapter presents acorpus typology, originally proposed in the course of an EU project.

Section II, Building and designing a corpus: what are the key considerations?,starts with Randi Rappen's description of key considerations: the chapter coversthe basics, the kind and size of data to collect; how to collect texts; how muchmark-up is needed and finally a look to the future. Svenja Adolphs and DawnKnight write about the process of building a spoken corpus: corpus design,metadata collection (citing Burnard (2005: 31) who states that 'without metadatathe investigator has nothing but disconnected words of unknowable provenance orauthenticity'); the transcription of spoken data and the issue of spokeninteraction being multi-modal in nature including prosodic, gestural andenvironmental elements as well; and the analysis of spoken corpora. Mike Nelsondiscusses the process of building a written corpus: what this process entails;how a corpus should be planned; sampling, balancing and representativeness;gathering, organizing and annotating texts. Almut Koester starts the nextcontribution with arguments in favor of small specialized corpora: based onCarter and McCarthy (1995), he argues that grammatical items are sufficientlyfrequent to be reliably studied using a relatively small corpus, that a smallerdata-set is more manageable, and also that there is a closer link between thecorpus and the context in the case of smaller corpora. The author then discusseshow small and specialized corpora should/could be, noting that spoken corporatends to be smaller than written ones; followed by some considerations in thedesign of small corpora; issues of compilation and transcription are alsodiscussed. Brian Clancy discusses how to build a corpus to represent a varietyof a language. He starts with examining what a variety of language is; then hecontinues with issues like size, diversity, representativeness and balance;finally he proposes two case studies about a language variety. Paul Thompson isinterested in building a specialized audio-visual corpus. First, he presents thecharacteristics of such corpora and argues for the fine granularity of thecorpus annotation to be the most useful. Then he describes the major steps inthe building process: data collection (consent, location, equipment, skills...),transcription, annotation, assembly and analysis.

The first contribution of Section III (Analysing a corpus -- What are thebasics?) was written by David Y. W. Lee. He proposes a not exhaustive overviewof the currently available ready-made corpora: general and specialized; spoken,written or both; both in English and in other languages. Jane Evison covers thebasics of analyzing a corpus: how to manipulate and exploit word frequencylists, key word lists and concordance lines. She states that corpora are usefulnot in themselves but through the analysis and manipulation of data theycontain. Mike Scott describes what corpus software in general and WordSmith (hisown software) in particular can do. He starts by explaining what computers aregood at, what they are bad at, and why; then he addresses some issues ofre-formatting and re-organizing data; finally he briefly describes how toprocess concordances, wordlists and key word lists. Susan Hunston is interestedin the exploration of patterns in a corpus: what patterns are, what the reasonsare that make them difficult to be identified, how to find them in concordancelines, and finally how to assess their frequency. Christopher Tribble'scontribution describes concordances. It starts with a clear definition: aconcordance is a collection of occurrences of a word-form, each in its owntextual environment... (Sinclair 1991: 32). Both historical (like Becket's) andmodern ones are covered in the paper, including tools (like WordSmith Tools) andmethods: working with lemmas, sorting and sampling, restricted searches, just toname a few. Xiaofei Lu studies what corpus software can reveal about languagedevelopment. The author first defines what language development is and presentsthe three most influential approaches to it: rationalist, empiricist andpragmatist. He also describes how to measure language development, and discusseshow a corpus can be used to learn more about first and second language development.

Section IV (Using a corpus for language research) starts with Rosamund Moon'scontribution, 'What can a corpus tell us about lexis?'. She examines questionslike how many words comprise the main vocabulary of a language, what we canlearn about a word from looking at the words with which it co-occurs, how farthe meanings of words are derived from context, how different senses and uses ofwords are distinguished in context, how corpora can help studying synonyms, whatwe can learn about lexis from a spoken corpus. Chris Greaves and Martin Warrenstudy what corpus can tell us about multi-word units. They discuss whatmulti-word units are, including not only n-grams but also discontinuous units,and why and how they are important. Susan Conrad studies what a corpus can showabout grammar, switching the focus from acceptable versus unacceptable to whatactual choices are made by speakers. Douglas Biber's paper covers what corpuscan indicate about registers and genres. First, a distinction is establishedbetween the genre perspective and the register perspective, then various aspectsof the register variation are presented and finally corpus-based genre studiesare briefly discussed. Michael Handford studies the corpora of specializedgenres. He mentions several criticisms of corpus linguistics and presents arationale for specialist corpora and the genre approach. Corpus study inacademic genres, professional genres and non-institutional genres is alsoexamined. Scott Thornbury discusses what a corpus can reveal about discourse,what the limitations are and how to overcome them, how a corpus-based approachwork in practice and what kind of data is needed for this. Christoph Rühlemann'scontribution investigates what corpora can tell about pragmatics: afterdiscussing what restrictions it implies, he discusses various pragmaticphenomena. Thuc Anh Vo and Ronald Carter studies what a corpus can reveal aboutcreativity. The authors discuss the concept of creativity, how it is related tocorpora, what corpora can reveal about it, spoken and written aspects ofcreativity, and finally some other manifestations of creativity found in corpora.

Winnie Cheng wrote the first contribution in Section V (Using a corpus forlanguage pedagogy and methodology), addressing the role of corpora in languageteaching. Following Johns (1991: 30), the author emphasizes the importance ofdata-driven learning (DDL) and illustrates how corpora can be used by students,teachers and even editors of grammars. The contribution written by Steve Walshcovers how corpora can be exploited in creating language teaching materials.Corpus based materials to teach speaking, listening, reading and writing arediscussed and the merits of learner corpora are explored in detail, overinvented textbook dialogues for instance. Angela Chambers writes aboutdata-driven learning. Her paper covers a brief history of DDL, how it can beused and how it changes language pedagogy. Gaëtanelle Gilquin and SylvianeGranger discuss the possible applications of DDL: its advantages (like bringingauthenticity and providing corrective functions), the resources it requires (acorpus and tools to exploit the corpus), activities it involves. Theircontribution also covers the problems and limitations of DDL and when it comesto evaluation, and they admit with remarkable honesty that the claims about theeffectiveness of DDL are largely an act of faith. Passapong Sripicharn isinterested in preparing learners for using language corpora. The author coverstopics like assessing students' knowledge of corpora and their objectives,preparing learners to build and use corpora, familiarizing them with differenttools and interpreting results.

Section VI (Designing corpus-based materials for the language classroom) startswith the contribution of Martha Jones and Philip Durrant about corpora andvocabulary teaching materials. The importance of vocabulary (includinglexicalized phrasal units), the type of corpus suitable for academic vocabularylearning and the design are among the topics discussed in the paper. RebeccaHughes writes about corpora and grammar teaching materials: the role of corpora,their benefits (e.g. providing evidence of frequency, encouraging moreautonomous learning), their limitations and their future development. JeanneMcCarten's contribution is about corpus-informed course book design. Shesuggests useful considerations in choosing a corpus, and discusses areas of thecourse book where a corpus can inform, the use of corpus data in course booksand the future of corpus informed course books. She does mention somerealizations, like the Collins COBUILD English Grammar, but also admits that theactual use of corpora in this field is rather limited. Elisabeth Walterdiscusses the use of corpora in dictionary writing: the reasons to use corpora,their size and their content, and the analysis tools for lexicographers. Shealso illustrates how to use a corpus, paying special attention to learnercorpora and concludes with current limitations and future developments. LynneFlowerdew reviews recent corpus applications to various aspects of writing,covering English for General Academic Purposes and English for Specific AcademicPurposes, followed by discussing the issues in the application of corpora andpossible future expansions and extensions. Averil Coxhead is interested in therelationship between corpora and English for Academic Purposes (EAP). Fivequestions are addressed: what can corpora reveal about aspects of academiclanguage in use; how can corpora influence EAP pedagogy; how can corpora be usedin EAP materials; what can a corpus tell us about EAP learner language; and whatmight the future be for corpora in EAP? Elaine Vaughan's contribution is aboutusing corpora for teachers' own research. She mentions reasons to do that andissues in doing so; she also discusses the use of corpora inside and outside theclassroom.

Section VII covers the topic of using corpora to study literature andtranslation. Marie-Madeleine Kennig describes parallel and comparable corpora.She explains what they are, mentions some existing ones, and discusses theircompilation and use. The contribution by Natalie Küber and Guy Aston is aboutusing corpora in translation, purposes, processes and types of corpora used.They also cover special issues like the translator's need to take into accountthe reader's knowledge. Dan McIntyre and Brian Walker are interested in the useof corpora to study the language of poetry and drama. They illustrate the use ofcorpora through case studies of poems of William Blake and some blockbusters.Carolina P. Amador-Moreno investigates the use of corpora to explore literaryspeech representation. She discusses similarities and differences between realand fictional speech, how to use corpora to compare them, and includes a casestudy of an Irish novel, concluding with thoughts on the limitations of corporause to study speech representation.

Gisle Andersen's contribution about how to use corpus linguistics insociolinguistics begins Section VIII (Applying corpus linguistics to other areasof research). The author discusses advantages and limitations, proposes a fewrules of thumb, provides examples of corpus based sociolinguistic studies andconsiders possible future developments. Kieran O'Halloran writes about the useof corpus linguistics in the study of media discourse. The author discusses thecorpus based approach to Critical Discourse Analysis and presents a case studyof a British newspaper. Janet Cotterill is interested in the use of corpuslinguistics in forensic linguistics. She presents various characteristics of aforensic corpus, discusses some major tasks (like identifying or eliminatingauthorship) and concludes with some limitations and future challenges. AnnelieÄdel covers corpus linguistics and political discourse. She presents whatpolitical discourse is, its relationship to corpora, techniques for exploringit, and gives examples of topics and concludes with reflections on possiblefuture developments. Sarah Atkins and Kevin Harvey write about the use ofcorpora in the study of health communications. They explain the importance ofstudying healthcare communication, present some corpus based studies, describethe creation of a specialized corpus (related to adolescent health) and the useof this corpus to explore patterns. Fiona Farr presents the use of corpora inteacher education. She shows how CL reinforces current approaches and practicesin Language Teacher Education, discusses three types of relevant corpora(corpora of classroom language, learner corpora and pedagogic corpora), corporause for the purposes of developing language awareness skills and finally the useof specialized corpora. Fiona Barker's contribution discusses corpus-informedlanguage testing. She describes language testing, provides examples of corporadeveloped for this purpose, and discusses the use of both learner corpora andnative speaker corpora.

EVALUATIONThe RHCL covers an impressively large and very specific set of topics related tocorpus linguistics. Readers interested in these topics will probably find whatthey are looking for either in the corresponding contribution or in the list ofpublications included in further readings and references. Surprisingly, there isno contribution dealing directly with the use of corpus linguistics in NaturalLanguage Processing or computational linguistics.

Cooperation between so many contributors (54!) could be realized in twocompletely different ways. It could have aimed simply to be the collection ofautonomous contributions (independent papers). Alternatively, it could haveaimed to be a coherent, unified work. For instance, in the first case, eachauthor would have defined key notions on their own, independently from others,while in the second key notions would have been agreed upon and defined once forall (ideally when they first appear). Unfortunately, this book falls somewherein between: cross-references are frequent but do not create coherence. Forexample, as early as page 16, ''balance'' and ''representativeness'' are usedwithout being defined. The index entry for ''balance'' refers to pages 86-87 and60. Those pages do discuss the notion of balance but do not offer any formaldefinition. Fortunately, representativeness fares better: the index also pointsto pages 86-87, where we find a definition from Leech (1991: 27). Koester alsoevokes representativeness (p. 69) pointing to (Reppen) instead. Evison proposesher own definition of keyness on p. 127. These are just a few examples ofredundant/discrepant/missing definitions. Readers might appreciate a glossary atthe end of the volume; let us hope that one will be included in a future edition.

Most but not all of the corpora discussed in the RHCL are in English. This,however, is unlikely to be a shortcoming: it probably represents thepredominance of English in existing corpora.

It warrants note that many of the contributors found Mike Scott's WordSmithsoftware very useful. Linguists interested in CL might want to give it a try andassess if it also fits their needs.

In scientific papers, one expects an abstract, introduction and conclusion,which all make a paper easier to understand. Unfortunately, they are generallylacking in this volume, partially or informally present in a few cases only(e.g. O'Halloran).

More than half of the contributions (23 of 45) contain the word ''can'' in thetitle. For example, Coxhead addresses five questions: four contain ''can'', thefifth ''might''; ''can'' appears six times in the first page and half in Küber andAston, etc. The readers might wonder why there is so much to say about what canbe done compared to what has been done. Does it mean that the authors focusedmore on potential than on accomplishment? Or is this an indication that theGolden Age of corpus linguistics is yet to come? The papers give the clearimpression that corpus linguistics can achieve more than what it has alreadyachieved. Also, many of the contributions end with a sections titled 'Looking tothe future' or something similar. This is another indication that contributorsbelieve that corpus linguistics has more to offer.

A future edition would benefit from reconsidering some of the issues raised inthis evaluation: regrouping some of the minor topics and including other majorones; improving the book's coherence; addingabstracts/introductions/conclusions; putting greater emphasis on what corpuslinguistics has actually accomplished.

REFERENCESBurnard, L. (2005) 'Developing Linguistic Corpora: Metadata for Corpus Work' inM. Wynne (ed.) Developing Linguistic Corpora: A Guide to Good Practice. Oxford:Oxbow Books, pp. 30-46.

Carter, R. and McCarthy, M. (1995) 'Grammar and the Spoken Language', AppliedLinguistics 16(2): 141-58.

Johns, T. (1991) 'From Printout to Handout: Grammar and Vocabulary Teaching inthe Context of Data-driven learning', English Language Research Journal 4: 27-45.

Leech, G. (1991) 'The State of the Art in Corpus Linguistics', in K. Aijmer andB. Altenberg (eds) English Corpus Linguistics. London: Longman, pp. 8-30.

Sinclair, J.M. (1991) Corpus, Concordance and Collocation. Oxford: OxfordUniversity Press.

ABOUT THE REVIEWERKornel Bangha studied linguistics in Paris and Montreal. He was apost-doctoral research fellow at INRIA, France. Since 2005, he has beenworking for software companies in Canada and the USA. His main expertise islinguistic data curation for software development.

Page Updated: 18-Jan-2012