Review of  Corpora and Discourse Studies

Reviewer: Sibo Chen
Book Title: Corpora and Discourse Studies
Book Author: Paul Baker Tony McEnery
Publisher: Palgrave Macmillan
Linguistic Field(s): Discourse Analysis
Text/Corpus Linguistics
Subject Language(s): English
Issue Number: 27.1506

The development of corpus linguistics has had remarkable impact on discourse analysis: with the growing availability of large collections of texts and computational methods, it is possible for discourse analysts today to use millions of words as research data to investigate linguistic variations as well as hidden messages underlying discursive representations. Edited by Paul Baker and Tony McEnery, ‘Corpora and Discourse Studies’, addresses current trends of corpus-driven discourse analysis by presenting 13 independent studies that pay particular attention to the adoption of qualitative and quantitative corpus methods. The main focus of this collection is the contribution of corpora in revealing patterns of a wide range of written, spoken, multimodal, and electronic discourses.

The book’s 14 chapters can be divided into four parts: Chapter 1 introduces the synergy of corpora and discourse studies as well as the key research and debates in this emerging field; then, Chapters 2-5 address modes of discourse and how newer or under-researched forms of text can be examined through corpus-driven methods; Chapters 6-10 consider discourse from the perspective of social practice, showcasing corpus-driven studies on environment, health, academic, and news discourses; finally, Chapters 11-14 broadly take a critical discourse analysis stance, and studies presented here are concerned with discourse as a means of constructing social identities and ideologies.

In the first part, Chapter 1 serves as an introduction, outlines the volume, and introduces the key debates in the quickly developing field of corpus-driven discourse analysis. The chapter begins by reviewing some basic concepts of corpus linguistics and discourse analysis, which is followed by a brief yet informative summary of the synergy of the two fields. A major advantage of corpus-driven approaches, as the authors argue, is that “sampling and balance which underline corpus building help to guard against cherry-picking […] and to avoid over-focusing on atypical aspects of our texts” (p.5). Yet, the use of corpus methods for discourse analysis has also met a number of challenges that instigate ongoing discussions among its proponents. First, although corpus methods allow researchers to take a relatively ‘naïve’ perspective on discursive data, they can only limit the extent of potential biases and thus we need to be cautious in terms of the inherent limits of corpus linguistics (e.g. its emphasis on statistical difference rather than similarity). Second, findings through corpus linguistics are often challenged by the ‘so what’ question, which drives researchers to explore subtle and insidious ways that obvious discursive patterns are realized. Last but not least, there is an ongoing debate regarding copyright and ethics in how discourse analysts should treat online data. The chapter ends with an overview of the following chapters.

The second part (Chapters 2-5) offers more details about applying corpus methods in investigating under-researched forms of text. To be specific, Chapter 2 explores modal verb usage in the Cambridge and Nottingham e-Language Corpus and the study finds that the presence of modal verbs in online discourse is related to the extent to which a text is meant to communicate to a wider range of audience, since online discourse, despite its immediacy, still lacks the contextual cues in face-to-face communication. Chapter 3 considers the integration of non-linguistic data into corpus analysis. By examining how location alters spoken language use during a series of art gallery visits, the chapter demonstrates the potential and complexity of considering dynamic contexts during corpus compilation. In Chapter 4, two popular types of multimodal text – film and television – are investigated and the main focus here is how visual and verbal representations are combined to create meanings on screen. Chapter 5 presents an analysis of the discourse maker ‘actually’, focusing on its complex usage and multifunctional nature in naturally occurring speech. By attending to the various meanings of ‘actually’, this analysis demonstrates corpus linguistics’ advantage in processing large corpora as well as the necessity to go beyond the concordance line when analyzing subtle meanings within discourse.

In addition to the exploration of under-researched discourses, corpus methods are also able to add new insights into delineating social practices within particular discourse communities. The third part (Chapters 6-10) is dedicated to this analytical perspective. This part begins with Chapter 6’s investigation of environmental discourse in American president speeches between 1961 and 2013. By addressing collocates with the term ‘environment’ and its related concepts such as ‘protect’, ‘energy’, and ‘clean’, the chapter shows how environmental concepts shift along with changing socio-political contexts as well as the ideologies of those in power. Chapter 7 offers a discussion of health discourse through examining discussions on anorexia in two online forums. While the study illustrates how particular discursive representations of eating disorder can (dis)empower their participants, it also highlights the digitalization of health information and corpus linguistics’ potential in illuminating online health discourses’ implications. Chapter 8 applies Doug Biber’s (1988) multi-dimensional approach to the study of academic discourses by undergraduate and graduate students. In line with the findings of traditional rhetorical/genre analysis, this study demonstrates how students’ discursive performance can vary according to the time spent among the academic community and their topics of study. Yet, the study achieved the above conclusion by systemically addressing the subtle changes (e.g. nominalization, wh-clauses, modal verb usage, etc.) underlying students’ academic progress, which adds valuable details to current discussions on academic discourse. Chapter 9 introduces several facets of discourse representation in a corpus of early modern English news reports. Based on a hand-categorized dataset, the chapter compares early modern English and present day English through analyzing early English news reporting’s presentation of speech, writing, and thought. A particularly interesting finding in this chapter is how early journalists’ fictionalization of people’ thoughts laid the foundation for making the dramatic effect a core news value for modern news production. Chapter 10 attempts to draw connections between corpus linguistics and other social research methods. Through a multi-perspectival analysis (Candlin & Crichton, 2011) of creative practice in tertiary art and design education, the chapter illustrates how corpus linguistics can serve as a valuable component in a comprehensive research project that involves various methods and a wide range of texts. A valuable suggestion offered in the chapter is that corpus-based findings need to be aligned with findings through other research methods as well as the socio-cultural context that a specific social practice is taking place.

The third part (Chapters 11-14) consists of studies that offer insights into corpus-driven critical discourse analysis (CDA), especially how corpus methods counter the “cherry-picking” argument (Widdowson, 2004) by CDA critics. In Chapter 11, the representation of the Arab world in a number of English language newspapers is examined and the analysis shows that how this phrase has been frequently used as a passive audience/recipient, reactive to outside stimuli. The constraint of grammatical agency is then linked to negativity and prejudice in news reporting, and the chapter shows how these two factors function as news values. Shifting the analytical focus toward natural online discourse, Chapter 12 analyzes a corpus of tweets on a controversial TV documentary series ‘Benefit Streets’. The investigation of the corpus’ keywords identifies three key storylines (the ‘idle poor’ discourse, the ‘poor as victim’ discourse, and the ‘rich get richer’ discourse). The chapter ends by discussing how online media such as Twitter alter the ways discourses are articulated and circulated as well as corpus methods’ potential to capture these changes. Chapter 13 offers a very interesting literary critique of ‘Harry Potter’, in which words related to male and female body parts are investigated through corpus methods. The chapter presents a compelling picture of how female characters are “generally presented as physically deficient in comparison with males and their inability to cope with physical situation is seen as a liability in terms of plot” (p. 282). Although the same conclusion can be achieved through traditional literary critique, the chapter provides more details regarding how corpus methods are able to delineate subtle and insidious discursive strategies underlying problematic representations. Finally, Chapter 14 demonstrates how semantic tagging can be a useful technique for examining social identity construction in news reports.


As Baker and McEnery write in Chapter 1, the primary goal of this volume is to offer a range of contemporary perspectives on the synergy of discourse studies and corpus methods, thereby serving as an illuminating reading for current and prospective practice for this evolving field. Given the diversified research topics and corpus methods covered in the volume, it surely keeps all the promises. Although this volume’s content is not easily understood by readers without sufficient background in corpus linguistics, it succeeds in presenting the content in an informative and thought-provoking way.

For researchers, the book is an up-to-date summary of corpus-driven discourse analysis, and one may find the thorough methodological sections of each chapter helpful in guiding his/her own research. In addition, the book also offers a well-balanced presentation of different conceptualizations of discourse by covering studies on linguistic variation (which broadly consider discourse as ‘language-in-use’) as well as those on social representation (which broadly approaches discourse from a Foucauldian perspective). Throughout the volume, there are many interesting and useful discussions on particular challenges raised by the integration of corpus methods and discourse studies. Chapters 3, 4, 8, 12, 13, and 14 are particularly impressive since they not only illustrate the great potential of corpus methods in detecting less-obvious linguistic patterns, but also offer informative reflections on moving corpus analysis toward a more interdisciplinary, multimodal, and systemic horizon.

Taken together, the book is coherent and well-edited. It is difficult to point out noticeable shortcomings and scholars of corpus linguistics, discourse analysis and other contingent disciplines would find this book valuable reading with important insights for future research practices.


Sibo Chen is a PHD candidate in the School of Communication, Simon Fraser University. He received his MA in Applied Linguistics from the Department of Linguistics, University of Victoria, Canada. His major research interests include language and communication, critical discourse analysis, and genre theories.

