This book "asserts that the origin and spread of languages must be examined primarily through the time-tested techniques of linguistic analysis, rather than those of evolutionary biology" and "defends traditional practices in historical linguistics while remaining open to new techniques, including computational methods" and "will appeal to readers interested in world history and world geography."
Review of The Routledge Handbook of Corpus Linguistics
EDITORS: O'Keeffe, Anne; McCarthy, Michael TITLE: The Routledge Handbook of Corpus Linguistics SERIES: Routledge Handbooks in Applied Linguistics PUBLISHER: Routledge (Taylor and Francis) YEAR: 2010
Kornel Bangha, Vantage Linguistics, PA, USA
SUMMARY The Routledge Handbook of Corpus Linguistics (RHCL) provides an overview of corpus linguistics (CL), as a resource for advanced undergraduates and postgraduates. The book contains 45 contributions from 54 authors divided into eight major sections. Each contribution is divided into five sub-parts, followed by further readings and references.
In Section I, the first contribution, by the editors, presents the evolution of corpora from their historical origins (starting with the earliest Bible concordances) up to their various types and uses in modern day applications. Elena Tognini Bonelli's contribution offers an overview of the evolution of corpus linguistics: it describes a shift of focus in linguistics from a data-driven approach to an approach based on intuition and introspection -- and back again to a data-driven approach; she explains that a corpus is fundamentally different from a text because the former, unlike the later, brings together many different texts and therefore cannot be identified with a unique and coherent communicative event; she concludes that, using Saussurian terminology, a text is an instance of ‘parole’ while the patterns uncovered in corpus evidence yield insight into ‘langue’; finally, the chapter presents a corpus typology, originally proposed in the course of an EU project.
Section II, Building and designing a corpus: what are the key considerations?, starts with Randi Rappen's description of key considerations: the chapter covers the basics, the kind and size of data to collect; how to collect texts; how much mark-up is needed and finally a look to the future. Svenja Adolphs and Dawn Knight write about the process of building a spoken corpus: corpus design, metadata collection (citing Burnard (2005: 31) who states that 'without metadata the investigator has nothing but disconnected words of unknowable provenance or authenticity'); the transcription of spoken data and the issue of spoken interaction being multi-modal in nature including prosodic, gestural and environmental elements as well; and the analysis of spoken corpora. Mike Nelson discusses the process of building a written corpus: what this process entails; how a corpus should be planned; sampling, balancing and representativeness; gathering, organizing and annotating texts. Almut Koester starts the next contribution with arguments in favor of small specialized corpora: based on Carter and McCarthy (1995), he argues that grammatical items are sufficiently frequent to be reliably studied using a relatively small corpus, that a smaller data-set is more manageable, and also that there is a closer link between the corpus and the context in the case of smaller corpora. The author then discusses how small and specialized corpora should/could be, noting that spoken corpora tends to be smaller than written ones; followed by some considerations in the design of small corpora; issues of compilation and transcription are also discussed. Brian Clancy discusses how to build a corpus to represent a variety of a language. He starts with examining what a variety of language is; then he continues with issues like size, diversity, representativeness and balance; finally he proposes two case studies about a language variety. Paul Thompson is interested in building a specialized audio-visual corpus. First, he presents the characteristics of such corpora and argues for the fine granularity of the corpus annotation to be the most useful. Then he describes the major steps in the building process: data collection (consent, location, equipment, skills...), transcription, annotation, assembly and analysis.
The first contribution of Section III (Analysing a corpus -- What are the basics?) was written by David Y. W. Lee. He proposes a not exhaustive overview of the currently available ready-made corpora: general and specialized; spoken, written or both; both in English and in other languages. Jane Evison covers the basics of analyzing a corpus: how to manipulate and exploit word frequency lists, key word lists and concordance lines. She states that corpora are useful not in themselves but through the analysis and manipulation of data they contain. Mike Scott describes what corpus software in general and WordSmith (his own software) in particular can do. He starts by explaining what computers are good at, what they are bad at, and why; then he addresses some issues of re-formatting and re-organizing data; finally he briefly describes how to process concordances, wordlists and key word lists. Susan Hunston is interested in the exploration of patterns in a corpus: what patterns are, what the reasons are that make them difficult to be identified, how to find them in concordance lines, and finally how to assess their frequency. Christopher Tribble's contribution describes concordances. It starts with a clear definition: a concordance is a collection of occurrences of a word-form, each in its own textual environment... (Sinclair 1991: 32). Both historical (like Becket's) and modern ones are covered in the paper, including tools (like WordSmith Tools) and methods: working with lemmas, sorting and sampling, restricted searches, just to name a few. Xiaofei Lu studies what corpus software can reveal about language development. The author first defines what language development is and presents the three most influential approaches to it: rationalist, empiricist and pragmatist. He also describes how to measure language development, and discusses how a corpus can be used to learn more about first and second language development.
Section IV (Using a corpus for language research) starts with Rosamund Moon's contribution, 'What can a corpus tell us about lexis?'. She examines questions like how many words comprise the main vocabulary of a language, what we can learn about a word from looking at the words with which it co-occurs, how far the meanings of words are derived from context, how different senses and uses of words are distinguished in context, how corpora can help studying synonyms, what we can learn about lexis from a spoken corpus. Chris Greaves and Martin Warren study what corpus can tell us about multi-word units. They discuss what multi-word units are, including not only n-grams but also discontinuous units, and why and how they are important. Susan Conrad studies what a corpus can show about grammar, switching the focus from acceptable versus unacceptable to what actual choices are made by speakers. Douglas Biber's paper covers what corpus can indicate about registers and genres. First, a distinction is established between the genre perspective and the register perspective, then various aspects of the register variation are presented and finally corpus-based genre studies are briefly discussed. Michael Handford studies the corpora of specialized genres. He mentions several criticisms of corpus linguistics and presents a rationale for specialist corpora and the genre approach. Corpus study in academic genres, professional genres and non-institutional genres is also examined. Scott Thornbury discusses what a corpus can reveal about discourse, what the limitations are and how to overcome them, how a corpus-based approach work in practice and what kind of data is needed for this. Christoph Rühlemann's contribution investigates what corpora can tell about pragmatics: after discussing what restrictions it implies, he discusses various pragmatic phenomena. Thuc Anh Vo and Ronald Carter studies what a corpus can reveal about creativity. The authors discuss the concept of creativity, how it is related to corpora, what corpora can reveal about it, spoken and written aspects of creativity, and finally some other manifestations of creativity found in corpora.
Winnie Cheng wrote the first contribution in Section V (Using a corpus for language pedagogy and methodology), addressing the role of corpora in language teaching. Following Johns (1991: 30), the author emphasizes the importance of data-driven learning (DDL) and illustrates how corpora can be used by students, teachers and even editors of grammars. The contribution written by Steve Walsh covers how corpora can be exploited in creating language teaching materials. Corpus based materials to teach speaking, listening, reading and writing are discussed and the merits of learner corpora are explored in detail, over invented textbook dialogues for instance. Angela Chambers writes about data-driven learning. Her paper covers a brief history of DDL, how it can be used and how it changes language pedagogy. Gaëtanelle Gilquin and Sylviane Granger discuss the possible applications of DDL: its advantages (like bringing authenticity and providing corrective functions), the resources it requires (a corpus and tools to exploit the corpus), activities it involves. Their contribution also covers the problems and limitations of DDL and when it comes to evaluation, and they admit with remarkable honesty that the claims about the effectiveness of DDL are largely an act of faith. Passapong Sripicharn is interested in preparing learners for using language corpora. The author covers topics like assessing students' knowledge of corpora and their objectives, preparing learners to build and use corpora, familiarizing them with different tools and interpreting results.
Section VI (Designing corpus-based materials for the language classroom) starts with the contribution of Martha Jones and Philip Durrant about corpora and vocabulary teaching materials. The importance of vocabulary (including lexicalized phrasal units), the type of corpus suitable for academic vocabulary learning and the design are among the topics discussed in the paper. Rebecca Hughes writes about corpora and grammar teaching materials: the role of corpora, their benefits (e.g. providing evidence of frequency, encouraging more autonomous learning), their limitations and their future development. Jeanne McCarten's contribution is about corpus-informed course book design. She suggests useful considerations in choosing a corpus, and discusses areas of the course book where a corpus can inform, the use of corpus data in course books and the future of corpus informed course books. She does mention some realizations, like the Collins COBUILD English Grammar, but also admits that the actual use of corpora in this field is rather limited. Elisabeth Walter discusses the use of corpora in dictionary writing: the reasons to use corpora, their size and their content, and the analysis tools for lexicographers. She also illustrates how to use a corpus, paying special attention to learner corpora and concludes with current limitations and future developments. Lynne Flowerdew reviews recent corpus applications to various aspects of writing, covering English for General Academic Purposes and English for Specific Academic Purposes, followed by discussing the issues in the application of corpora and possible future expansions and extensions. Averil Coxhead is interested in the relationship between corpora and English for Academic Purposes (EAP). Five questions are addressed: what can corpora reveal about aspects of academic language in use; how can corpora influence EAP pedagogy; how can corpora be used in EAP materials; what can a corpus tell us about EAP learner language; and what might the future be for corpora in EAP? Elaine Vaughan's contribution is about using corpora for teachers' own research. She mentions reasons to do that and issues in doing so; she also discusses the use of corpora inside and outside the classroom.
Section VII covers the topic of using corpora to study literature and translation. Marie-Madeleine Kennig describes parallel and comparable corpora. She explains what they are, mentions some existing ones, and discusses their compilation and use. The contribution by Natalie Küber and Guy Aston is about using corpora in translation, purposes, processes and types of corpora used. They also cover special issues like the translator's need to take into account the reader's knowledge. Dan McIntyre and Brian Walker are interested in the use of corpora to study the language of poetry and drama. They illustrate the use of corpora through case studies of poems of William Blake and some blockbusters. Carolina P. Amador-Moreno investigates the use of corpora to explore literary speech representation. She discusses similarities and differences between real and fictional speech, how to use corpora to compare them, and includes a case study of an Irish novel, concluding with thoughts on the limitations of corpora use to study speech representation.
Gisle Andersen's contribution about how to use corpus linguistics in sociolinguistics begins Section VIII (Applying corpus linguistics to other areas of research). The author discusses advantages and limitations, proposes a few rules of thumb, provides examples of corpus based sociolinguistic studies and considers possible future developments. Kieran O'Halloran writes about the use of corpus linguistics in the study of media discourse. The author discusses the corpus based approach to Critical Discourse Analysis and presents a case study of a British newspaper. Janet Cotterill is interested in the use of corpus linguistics in forensic linguistics. She presents various characteristics of a forensic corpus, discusses some major tasks (like identifying or eliminating authorship) and concludes with some limitations and future challenges. Annelie Ädel covers corpus linguistics and political discourse. She presents what political discourse is, its relationship to corpora, techniques for exploring it, and gives examples of topics and concludes with reflections on possible future developments. Sarah Atkins and Kevin Harvey write about the use of corpora in the study of health communications. They explain the importance of studying healthcare communication, present some corpus based studies, describe the creation of a specialized corpus (related to adolescent health) and the use of this corpus to explore patterns. Fiona Farr presents the use of corpora in teacher education. She shows how CL reinforces current approaches and practices in Language Teacher Education, discusses three types of relevant corpora (corpora of classroom language, learner corpora and pedagogic corpora), corpora use for the purposes of developing language awareness skills and finally the use of specialized corpora. Fiona Barker's contribution discusses corpus-informed language testing. She describes language testing, provides examples of corpora developed for this purpose, and discusses the use of both learner corpora and native speaker corpora.
EVALUATION The RHCL covers an impressively large and very specific set of topics related to corpus linguistics. Readers interested in these topics will probably find what they are looking for either in the corresponding contribution or in the list of publications included in further readings and references. Surprisingly, there is no contribution dealing directly with the use of corpus linguistics in Natural Language Processing or computational linguistics.
Cooperation between so many contributors (54!) could be realized in two completely different ways. It could have aimed simply to be the collection of autonomous contributions (independent papers). Alternatively, it could have aimed to be a coherent, unified work. For instance, in the first case, each author would have defined key notions on their own, independently from others, while in the second key notions would have been agreed upon and defined once for all (ideally when they first appear). Unfortunately, this book falls somewhere in between: cross-references are frequent but do not create coherence. For example, as early as page 16, ''balance'' and ''representativeness'' are used without being defined. The index entry for ''balance'' refers to pages 86-87 and 60. Those pages do discuss the notion of balance but do not offer any formal definition. Fortunately, representativeness fares better: the index also points to pages 86-87, where we find a definition from Leech (1991: 27). Koester also evokes representativeness (p. 69) pointing to (Reppen) instead. Evison proposes her own definition of keyness on p. 127. These are just a few examples of redundant/discrepant/missing definitions. Readers might appreciate a glossary at the end of the volume; let us hope that one will be included in a future edition.
Most but not all of the corpora discussed in the RHCL are in English. This, however, is unlikely to be a shortcoming: it probably represents the predominance of English in existing corpora.
It warrants note that many of the contributors found Mike Scott’s WordSmith software very useful. Linguists interested in CL might want to give it a try and assess if it also fits their needs.
In scientific papers, one expects an abstract, introduction and conclusion, which all make a paper easier to understand. Unfortunately, they are generally lacking in this volume, partially or informally present in a few cases only (e.g. O'Halloran).
More than half of the contributions (23 of 45) contain the word ''can'' in the title. For example, Coxhead addresses five questions: four contain ''can'', the fifth ''might''; ''can'' appears six times in the first page and half in Küber and Aston, etc. The readers might wonder why there is so much to say about what can be done compared to what has been done. Does it mean that the authors focused more on potential than on accomplishment? Or is this an indication that the Golden Age of corpus linguistics is yet to come? The papers give the clear impression that corpus linguistics can achieve more than what it has already achieved. Also, many of the contributions end with a sections titled 'Looking to the future' or something similar. This is another indication that contributors believe that corpus linguistics has more to offer.
A future edition would benefit from reconsidering some of the issues raised in this evaluation: regrouping some of the minor topics and including other major ones; improving the book's coherence; adding abstracts/introductions/conclusions; putting greater emphasis on what corpus linguistics has actually accomplished.
REFERENCES Burnard, L. (2005) 'Developing Linguistic Corpora: Metadata for Corpus Work' in M. Wynne (ed.) Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books, pp. 30-46.
Carter, R. and McCarthy, M. (1995) 'Grammar and the Spoken Language', Applied Linguistics 16(2): 141-58.
Johns, T. (1991) 'From Printout to Handout: Grammar and Vocabulary Teaching in the Context of Data-driven learning', English Language Research Journal 4: 27-45.
Leech, G. (1991) 'The State of the Art in Corpus Linguistics', in K. Aijmer and B. Altenberg (eds) English Corpus Linguistics. London: Longman, pp. 8-30.
Sinclair, J.M. (1991) Corpus, Concordance and Collocation. Oxford: Oxford University Press.
ABOUT THE REVIEWER
ABOUT THE REVIEWER:
Kornel Bangha studied linguistics in Paris and Montreal. He was a
post-doctoral research fellow at INRIA, France. Since 2005, he has been
working for software companies in Canada and the USA. His main expertise is
linguistic data curation for software development.