Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."

New from Wiley!


We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Review of  Doing Corpus Linguistics

Reviewer: Mariana España-Rivera
Book Title: Doing Corpus Linguistics
Book Author: William J. Crawford Eniko Csomay
Publisher: Routledge (Taylor and Francis)
Linguistic Field(s): Applied Linguistics
Text/Corpus Linguistics
Issue Number: 27.5068

Discuss this Review
Help on Posting
Reviews Editor: Robert A. Cote


The book ''Doing Corpus Linguistics'' (DCL) by William J. Crawford and Eniko Csomay offers a practical hands-on introduction to the growing field of Corpus Linguistics (CL). Intended as an introductory guide for university-level students in Applied Linguistics, it briefly explains how to carry out a complete corpus-based project from a framework of Register Analysis (Biber and Conrad 2009; p.16ff).

Corpus Linguistics is an emerging area of study, a cross-field area of studies in which both, qualitative and quantitative approaches meet. Concerned with understanding how people use language in various contexts, it uses a «corpus» or collection of texts -of written or oral language- that are analysed collectively to make statements about the use of language (p.6). While a prescriptive approach to language studies traditionally focuses on providing guidelines or rules to dictate language use, CL aims to provide the linguistic researcher with the necessary methodologies and digital tools to conduct a descriptive approach which can then be useful for ''uncovering'' naturally occurring language patterns as well as enabling the researcher to evaluate ''how prescriptive rules are followed by language users'' (p.5).

To gain practical skills through register analysis and help students with handling online corpora, the authors provide many different clearly formulated problems with practice-oriented solutions. To follow the exercises one should register with COCA (Corpus of Contemporary American English).

The book is divided into three Parts and nine chapters. Part 1: ''Introduction to Doing Corpus Linguistics and Register Analysis''; Part 2: ''Searches in Available Corpora'', and Part 3: ''Building Your Own Corpus, Analyzing Your Quantitative Results, and Making Sense of Data''.
Part 1 contains two sections related to present basic concepts of “Linguistics, Corpus Linguistics, and Language Variation” (Chapter 1), and explains the relevant aspects of “Register Analysis” (Chapter 2). Part 2 comprises two chapters that introduce the essentials of how to search through on existing corpora (Chapter 3 “Searching a Corpus”) and provides examples of “Projects Using Publicly Available Corpora” (Chapter 4). Part 3 is divided into five chapters. They illustrate how to build a corpus (Chapter 5 “Building Your Own Corpus”), introduce basic concepts related to statistical analysis (Chapter 6 “Basic Statistics” and Chapter 7 “Statistical Tests”), provide some guidelines to elaborate on a research project (Chapter 8 “Doing Corpus Linguistics”). The closing section offers ideas on how to develop a deeper understanding of this topic (Chapter 9 “A Way Forward”).

The book includes a Preface, Acknowledgments and a list of Tables and Figures. Bibliographical references are listed at the end of each chapter. The book closes with an index of names and relevant terms.

Chapter 1, “Linguistics, Corpus Linguistics, and Language Variation”, focuses on the study of natural language data, i.e., language as it is used in different contexts and produced for purposes ''other than linguistic investigation'' (p.8). In this section, the authors introduce some key concepts of CL: Language variation, Collocation and Frequency. According to Biber, Conrad, and Reppen (1998; p.8), corpus research is characterised by the following elements: it is empirical and it utilises a large and principled collection of natural texts which are analysed by means of automatic and interactive techniques utilising quantitative and qualitative analytical techniques. Additionally, Tognini-Bonelli (2001; p.9) distinguished between ''corpus-based'' and ''corpus-driven'' research; the former based on already-identified language features, the latter being based on extracting lexical patterns from the corpus. She also refers to the ''vertical-analysis'' that a software program performs on a corpus, which locates many examples of a particular language feature instead of reading them 'horizontally' or from start to finish, as a human brain would do (p.9-11).

Chapter 2, “Register Analysis”, illustrates the seven ''contextual variables'' identified by Biber and Conrad (2009; p.17). When describing language from the perspective of a register analysis, we are basically taking into account the social and cultural environment implicit within the context of real-world usage of the speech community. These variables are related to: 1. Participants; 2. Relations between participants; 3. Channel, i.e., Mode (written, oral) and Medium (permanence of language); 4. Production Circumstances (process or grade of planning); 5. Setting (time and place of communicative event); 6. Communicative Purposes, which includes grade of factuality or grade of expressed personal or subjective attitude about the topic (e.g., to inform, to persuade or just to interact and share thoughts, ideas or feelings); 7. Topic, a ''broad situational variable'' (p.20) that can have an impact on the linguistic characteristics as well and should not be confused with ''communicative purpose'' (p.17-19).

Chapter 3, “Searching a Corpus”, covers the basics of language units and search tools needed to start exploring available online corpora. The four most common lexical units that we identify in CL research are:

1. Keywords in Context (KWIC), allows us to look for an individual word or a group of pre-selected keywords. By default, a specialised software program will give us the frequency of a word in the corpus, across registers and their corresponding average. The results are usually displayed in the form of highlighted concordance lines and therefore we can analyse the patterns or speech categories surrounding them.

2. Collocates. 1951 Firth (p.40) coined the term «collocation» to refer to two word combinations. We often find them in partially or fully fixed expressions (e.g., «strong tea rather than *powerful tea», p.40)

3. N-Grams are word-sequences or word combinations co-occurring where the n-value denotes how many words there are in a unit (p.41). Some Four-Grams are also called «lexical bundles» (Biber et al., 1999; p.49). On the basis of their specific function, frequency and dispersion throughout different registers, they can be studied as a unit (e.g., «in the case of») and their position in the structure of discourse has recently been studied (Csomay 2013; p.49).

4. POS-Tags or ''Part of Speech-Tags''. A corpus can be either tagged for part of speech or not. If tagged, POS-tags can help -independently of the actual words- to look for their associated grammatical patterns or for co-occurring grammatical patterns (p.53).

Chapter 4, “Projects Using Publicly Available Corpora”, introduces some corpora developed by Mark Davies at Brigham Young University (BYU). Using these corpora is free of charge; however registration is mandatory, and some restrictions may apply (e.g., number of queries). Under we can access different corpora that use the same graphical interface and partly allow cross comparison searches.

The BYU project collects corpora of different varieties of English: British (British National Corpus: BYU-BNC), American (Corpus of Contemporary American English: COCA), Canadian (Strathy Corpus), and other English varieties around the world (Global Web-Based English: GloWbE); for diachronic studies we can access historical texts from the early nineteenth century (Corpus of Historical American English: COHA) and texts from the early twentieth century (TIME Magazine Corpus) (p.58-59).

Using twelve tasks of two different types: Word- & Phrase-Based and Grammar-Based projects, we explore corpora and gain hands-on experience conducting corpus research, interpreting data and presenting results.

To summarise the findings in terms of working with different corpora and to compare distributional patterns of language features across corpora or particular registers, it is important to note that each corpus has its own key characteristics in terms of size, number of registers (single/multiple), situational variables, and time or period. To overcome the size differences of corpora we use normalized counts as standard (p.59).

With regard to Register Analysis, we need to consider the contextual factors that form the central issue when analysing features within a particular register, because situational differences can substantially differ even between the same register. E.g., the BYU-BNC takes the spoken data from oral histories, meetings, lectures, and doctor-patient interactions, which provide distinctive features of ''interactional types of discourse'', whereas the COCA examples, taken them from television, radio news, and information shows, provide language features more closely related to ''informational types of discourse'' (cf. Biber 1995; p.59) Thus, data obtained will differ in its situational variables, which can significantly impact the results of qualitative analyses.

Part 3, “Building Your Own Corpus, Analyzing Your Quantitative Results, and Making Sense of Data”, introduces smaller, less representative, specialised corpora specifically designed to address narrower research topics. As a rule, we assume that from these corpora, we can only draw conclusions for our own dataset, so we rarely extrapolate our results (p.79).

Depending on how specific the research question is, we may need to build our own corpus. A practical guide for doing this is outlined in Chapter 5, “Building Your Own Corpus”. The first step towards getting involved is to clarify potential copyright issues related to the selection, the compilation, and the storage of digital texts (p.76). By then, we will have clearly identified the topic of research and framed it within a (set of) research question(s) or ''hypothesis''. This is crucial since the interpretation of the results largely depends on how clear and concise the research question(s) is formulated (p.76-79).

For a variety of reasons, a corpus research project will take ''a good deal of time commitment'' (p.76). Key aspects to consider when building a corpus are as follows: LOCATE enough suitable texts that share the selection criteria or ''variables''. For the sake of frequency comparisons, if a corpus includes SUB-CORPORA, it should be BALANCED, that is, they ought to be of an equal size in terms of number of texts, total word count, and text types (p.79-80). PREPARE the material by saving it in a plain text format, removing all ''meta-data'' so that a concordance software can easily identify textual patterns. NAME your files with a coding scheme that allows you to identify each as being part of a larger group. ADDITIONAL DATA that is not part of the text analysis but is still relevant to the qualitative analysis (e.g., number of words) can be added at the header of the text in angled brackets (< >); then, the program will ignore them (p.81-84).

From all the available software programs for CL research, the authors briefly describe two. One is AntWordProfiler comprising of a Vocabulary Profile, File Viewer, and Editor Tool to generate vocabulary statistics and frequency information with no corpus of texts already loaded into the program: you must upload your own. The other is AntConc, a Concordance Program for doing lexical and grammatical analysis. Both were developed by Laurence Anthony of Waseda University, Japan, and are available as freeware, Easy Install multi platform tools, and can be downloaded under

In Corpus Studies, quantitative analysis relies on statistical measures. Under the conditions of experimental design, we can test our hypotheses and obtain quantitative measures to measure how frequently a particular language feature occurs in a particular dataset. In the following sections, we will look into the basics of conducting descriptive statistics by which data already collected will be tested with so-called parametric and non-parametric tests with regard to Variance and Correlation (p.105ff).

Chapter 6, “Basic Statistics”, introduces the basic terminology used in every statistical analysis: types of Variables, Functions, Scales and Values, and explains their meaning with a number of practical examples. Chapter 7, “Statistical Tests”, explains some statistical methods that have proven useful and are frequently used in linguistic analysis: One-Way and Two-Way Analysis of Variance (ANOVA), Chi-Square Tests of Frequency Tables, and (Pearson) Correlation. Always keeping the target audience in mind, the authors provide specific examples of how to apply these statistical tests to some real-life case studies of linguistic data and how to interpret the results.

Both chapters provide detailed step-by-step instructions that guide the reader through the various statistical procedures firstly by using manual calculations in order to understand how any statistical software package performs them. Finally, we are introduced to the basics of working with SPSS (Statistical Package for the Social Sciences) and learn how to organise and enter data as well as present tables of descriptive statistics (p.109-116).

Chapter 8, “Doing Corpus Linguistics”,. is dedicated to explaining how to put into practice a register analysis framework following either a corpus-based or corpus-driven approach to arrive at a functional interpretation of the results (p.151). This section closes with a practical guidance on how to prepare a written report of research results (p.152-155).

Chapter 9, “A Way Forward”, briefly summarises the key strengths and weaknesses of corpus-based and corpus-driven studies. With regard to the latter, the authors emphasise the increasing need of corpus researchers with ''computational and statistical skills to carry out more in-depth analyses'' (p.156). This is evident from the fact that when conducting an in-depth analysis, we still need to look for different word types (e.g., 'concrete' or 'abstract' nouns; p.156) as tagged corpora usually include only basic grammatical categories, and the capabilities of existing tagging software still need to be manually improved.

As far as corpus-driven register studies are concerned, the authors refer to a ''multi dimensional analytical framework'' as a more suitable model for those striving for a more comprehensive analysis (e.g., to describe language variation across register). Developed by Biber (1988; p.157), this methodology enables us to investigate different types of texts from different registers by means of measuring co-occurring linguistic features through more sophisticated, multivariate statistical methodologies. In this way, we can gain a better insight into language variation across registers (e.g., to identify dimensions of linguistic variation) or reach comprehensive linguistic descriptions of linguistic variation in already-identified dimensions (e.g., to study variation in the context of specialised language domains, p.158) (cf. Multidimensional analysis, Loewen & Plonsky, p.119-120).


DCL offers a very practical introduction and is clearly aimed at students who want to learn how to build their own corpus-based project. While Parts 1 and 2 are a very concise and comprehensive introduction to what CL is, for it enables the reader to have their first experience searching in corpora utilising a corpus approach of language variation. Building on this, Part 3 addresses the basic technical and statistical aspects involved in every corpus research project. In accordance with the premise of learning-by-doing, DCL presents a concise guide on how to do it, including things like how to choose a research topic (p.76ff) and how to formulate research questions in terms of hypotheses for statistical tests (p.106ff).

Aspiring corpus linguists will need to be familiar with the basics of statistics and the preferred statistical methodologies of the discipline. A minor criticism of Chapters 6 and 7 is that the complexity of the subject is such that it is impossible to offer a comprehensive overview in an introductory handbook about CL; therefore, the theoretical explanations of statistical concepts remains superficial, and sometimes the use of symbols can be a challenge to beginners (e.g., the symbol R2 appears on p.117; however, we know nothing about it until it is explained on p.125). One suggestion I have for improving the text is to add the abbreviations, symbols, and statistical terms used as a tabular appendix separate from the overall index.

Another criticism is the selection of working with SPSS based upon its user-friendliness. However, it is almost impossible to have it installed in a private lap-top due to its costly licence. Leading researchers are already working with free software, particularly R, and free software should not be avoided just because it requires a certain level of programming skills.

Linguistics as a science is currently utilising quantitative methodologies, which are enabling linguistics to develop as a discipline, bringing it in line with sociology and psychology. Surely this is in part due to CL research. This does not mean that an introspective view of language will lose its validity for ''introspection is irreplaceable in the descriptive documentation of language'' (Janda, 2013:6; Leech, 2011). As the authors stress, beyond these examples corpora can have many different applications and corpus techniques are currently applied in addressing a wide range of subdomains in applied linguistics, including sociolinguistics, second language acquisition, psycholinguistics, and translation studies (cf. Corpus, Loewen & Plonsky, p.36-37).

Unfortunately, there is still too little information about the positive impact that working with CL can have, and it is not even considered a subject in the linguistic curricula of (German) universities. This applies in particular to the Romance language departments. However, learning from its practical applications up to the point where corpus-based methodologies can be directly or indirectly utilised can only be beneficial to the professional perspective. Especially in the era of Big Data and its increasing complexity, CL offers the tools that will become indispensable to a solid linguistic education: there is no escaping this fact. In this regard, DCL offers a very valuable and inspiring point of departure.


Anthony, Laurence. 2014. AntConc (Version 3.4.3m) [Computer Software]. Tokyo, Japan: Waseda University. Available from

Anthony, Laurence. 2014. AntWordProfiler (Version [Computer Software]. Tokyo, Japan: Waseda University. Available from

Davies, Mark. 2008-. The Corpus of Contemporary American English: 520 million words, 1990-present. Available online at

Gries, Stefan Thomas. 2008. Statistik für Sprachwissenschaftler. Göttingen: Vandenhoeck & Ruprecht.

Janda, Laura A. 2013. Quantitative methods in Cognitive Linguistics: An introduction. In Cognitive linguistics: The quantitative turn. The essential Reader. L. A. Janda (ed), 1-32. Germany: De Gruyter.

Leech, Geoffrey. 2011. Principles and applications of Corpus Linguistics. In Perspectives on Corpus Linguistics (Studies in Corpus Linguistics 48). V. Viana, S. Zyngier & G. Barnbrook (eds), 155-170. Amsterdam/Philadelphia: John Benjamins.

Loewen, Shawn & Plonsky, Luke. 2016. An A – Z of Applied Linguistics Research Methods. UK: Palgrave.

McEnery, Tony & Hardie, Andrew. 2012. Corpus linguistics: method, theory and practice. Cambridge & New York: Cambridge University Press.
Mariana España-Rivera is a lecturer at the Department of Romance Languages and Literatures at the University of Marburg (Germany). She earned a M.A. in Romance Linguistics, Musicology and European & Latin American Art History from the University of Heidelberg. Her teaching and research interests include Applied Linguistics, Historical Linguistics and Latin American Cultural Studies. She is currently building an own Corpus of Academic Written Spanish from German students for research purposes.

Format: Hardback
ISBN-13: 9781138024601
Prices: U.S. $ 140.00
Format: Paperback
ISBN-13: 9781138024618
Pages: 164
Prices: U.S. $ 44.95