Review of  Corpus-Based Sociolinguistics

Reviewer: Irene Checa-Garcia
Book Title: Corpus-Based Sociolinguistics
Book Author: Eric Friginal Jack A. Hardy
Publisher: Routledge (Taylor and Francis)
Linguistic Field(s): Sociolinguistics
Text/Corpus Linguistics
Issue Number: 26.63

Discuss this Review
Help on Posting
Review's Editor: Helen Aristar-Dry


As the subtitle indicates, “Corpus-Based Sociolinguistics: A Guide for Students” is intended to be a student book. In the preface the authors define their target audience by stating that “this book can effectively guide student-researchers in upper-level undergraduate and graduate courses in sociolinguistics” (p. xv). The increasing number of studies in sociolinguistics adopting a Corpus Linguistics (CL) approach and the solid quantitative and empirical methodology that such an approach offers inspired the authors to write a manual on the use of CL applied to sociolinguistic inquiries.

The book consists of three blocks: Block A introducing both sociolinguistics and CL: main goals, methodologies and a brief history, with a bit more focus on CL; Block B presenting an overview of work on several popular areas of sociolinguistic research; and Block C discussing practical methodological issues for the researcher who wants to use a corpus to do a sociolinguistic inquiry.

All chapters contain sections named “Reflective break” with two or more stimulating questions that require the application of and reflection upon the content previously presented, often along with suggestions for further research. Most chapters include examples or summaries of papers dealing with a sociolinguistic issue in the CL-oriented method discussed in that chapter. In addition, maps, histograms, tables summarizing results, tagging examples, etc., help illustrate the content provided throughout the book. Interviews with well-known CL and sociolinguistics researchers are included in some chapters: Grieve (B-1), Tagliamonte (C-1), and Biber (C-2).

The first block of the book dedicates one chapter to introduce sociolinguistics, three chapters to introduce CL and one final chapter to discuss CL's application to sociolinguistics. The chapter on sociolinguistics defines briefly the discipline and summarizes in 1-2 paragraphs five main sociolinguistic approaches: ethnography of communication, interactional sociolinguistics, conversation analysis, experimental sociolinguistics, and variationist sociolinguistics. In addition, the authors group sociolinguistic enterprises into two categories: quantitative and qualitative, giving examples of works using each of them with emphasis on the type of questions answered. The following chapter defines CL and offers a brief history of the field, from early applications to dictionary making to the newest collection of electronic megacorpora. Different kinds of corpora –specialized vs. general; spoken vs. written –, their sizes, and how they are typically collected are discussed in the next chapter. This chapter also introduces basic concepts in CL research, such as normalized frequency, n-grams, lexical bundles, keywords and tagging, among others, later mentioned in other sections of the book. The chapter is completed by reviewing some software available to search for instances of these concepts in corpora, although popular software among CL and sociolinguistics such as VARBRUL (Cedergren and Sankoff 1974) and the R language (Gries 2009) is not reviewed. Next, the authors present the notion of representativeness of a corpus, in terms of target population and variety of registers and linguistic diversity. They recommend creating a corpus matrix and they offer examples. In addition, they suggest resorting to ethnographic/qualitative studies of the target community to help design such matrices more accurately. All these topics are again revisited in the third block in a more detailed and hands on manner. Block A ends with a discussion of the limitations and future directions of corpus approaches to sociolinguistics, but first CL use for some sociolinguistic topics is sketched. The limitations of such an approach are of two types: limitations in corpora encoding of, and accounting in sampling of, social variables, and the impossibility of applying this methodology to some sociolinguistic areas, for instance language policy.

The second and most extensive part of the book exemplifies the application of CL to several popular sociolinguistic areas: dialectology, studies on gender, sexuality and age, politeness and stance, workplace discourse, diachronic variation, and web registers. Each chapter focuses on one topic. It first introduces the relevant sociolinguistic notion, typically discussing some results of studies using corpora and offering a more detailed summary of one or two key studies on the topic thereafter. The corpora used are described in detail, and in some chapters (B-1 on dialectology, B-4 on workplace discourse, and B-5 on diachronic variation) a quite complete list of available corpora is included. Corpora are described in terms of sampling variables, collection methods, size, social variables annotated, and search tools they offer for their analysis when this is desired/needed. In the case of dialectology, a topic with a very rich tradition in sociolinguistics, the authors comment on older studies which led to more extensive use of corpora in the present. Another area of frequent corpora use has been gender. For stance, the limitations of CL analysis are pointed out, since corpora are not typically annotated for prosodic and phonetic features that can mark stance. Finally, for some areas --stance, language change, web registers- analysis of word trends and content, in the line of culturomics research rather than linguistics, are included in the review of studies.

The third and final part of the book deals with some practical advice and descriptions of procedures with which to perform a sociolinguistic corpus-based study. The first chapter explains how to create a sociolinguistic corpus. After emphasizing the importance of well-defined questions that guide the sampling design and its justification, a model study is presented. Stratified random sampling is recommended together with very general guidelines concerning statistical test assumptions and social categories that are considered. Specific formats and format handling along with file organization are suggested. Finally, ethical and legal considerations from Scocco (2007) are summarized. Two types of data collection are discussed in more depth: how to create a corpus from blogs, and how to achieve a naturalistic sociolinguistic interview. The next chapter presents the Multidimensional Analysis (MD) model (Biber 1988). First an account of the rationale of the model and main achievements is presented. In an interview with Biber, this corpus linguist evaluates the main findings of this type of analysis as well as criticisms and how those could be addressed. MD detects linguistic features that tend to occur together and groups them into factors or dimensions whose relation to social variables can be tested statistically. Then the chapter explains the MD procedure. The authors warn the reader, though, that they offer only a general description of the procedure rather than detailed instructions and offer a bibliography to learn how to precisely carry out this analysis. The various steps are exemplified by Friginal’s 2009 work, and an interpretation of the results is provided as means of exemplifying MD’s explanatory powers. The last two chapters present additional CL methods to study variation in language, both diachronic (C-3) and synchronic (C-4). The authors offer advice for diachronic data collection over the internet and many suggestions for research questions in this area. The last chapter describes a step-by-step procedure to determine keyness in a corpus with respect to a reference corpus using AntConc. Examples of studies using automatic taggers close the book. One uses the LIWC tagger (Linguistic Inquiry and Word Count), which tags content, grammatical, and even spoken language features. The other uses POS (Parts of Speech) tagging, whose results can be correlated to different sociolinguistic variables.


“Corpus-Based Sociolinguistics” is mostly a practical introduction to the use of CL for sociolinguistic questions. However, rather than offering a step-by-step guide on how to make a sociolinguistic analysis using CL, this book mainly offers examples of areas of application of CL within sociolinguistics, and relevant bibliography to explore further how to apply CL methods. In this respect, this work can be very helpful in two ways to the new sociolinguist that wants to use CL.

First, it can serve to identify the ideal corpus for a myriad of research topics in sociolinguistics and even content analysis and culturomics, as long as the research is on the English language. There is little reference to corpora in languages other than English, and in the cases where there is, it is not very exhaustive. For instance, for Spanish there is no mention of one of the largest and widest corpora representing different varieties of Spanish, the PRESEEA (Moreno Fernández 1996). However, such a review would be beyond the scope of the book. As for the English corpora, the authors often include the links to access them online or contact information to inquire about them, as well as what software is available to analyze them. Diachronic, synchronic, electronic, and specialized corpora, among many other corpora types, are referenced.

The other aspect in which the book can be very helpful for sociolinguistic CL research is the creation of a corpus, particularly from the internet. Representativeness and feasibility are presented in a very clear and practical manner and good models are offered and discussed. In addition, the book points to online resources such as blogs and the websites that index them, and gives advice on how to organize the corpus files. Sociolinguistic research on the new internet registers is also reviewed, which constitutes a novelty in the literature, as the other general presentation of CL and sociolinguistics (Baker 2010) barely touches upon this, which is to be expected since the booming of this new research area is very recent.

On the other hand, this book cannot be taken as a guide to carry out the different CL analyses reviewed, such as MD or LIWC. Neither the needed software, nor the statistical knowledge are explained in enough detail, as noted by the authors themselves. Instead, the book refers to relevant specialized books for those tasks. Also, and despite the authors' claim in the preface, in order to use this book as a textbook for an introductory class on sociolinguistics, a supplementary sociolinguistics manual could be very valuable. Although the book does make a successful effort not to assume any prior sociolinguistic knowledge, and every sociolinguistic concept is discussed before studies on it are reviewed, these notional introductions are not always very clear, and often very brief, with the majority of the chapter dedicated to reviewing studies on the concepts and the corpora used.

Another area the book does not spend much time covering is multimedia corpora. This is a consequence of their corpus conceptualization as a collection of searchable tagged texts. Therefore, there is little or no review of studies concerning video data or sound data, nor of software to align multimedia with transcription such as ELAN.

Few books have yet been published that offer an account of how to do sociolinguistics with CL or that describe the relationship between the two disciplines. The one exception is Baker (2010) “Sociolinguistics and Corpus Linguistics”. Although, as mentioned, Baker’s book pays little attention to Computer Mediated Communication of any kind, and he explains possibly fewer sociolinguistic and CL basic notions, his book's explanations are more in depth. Likewise, statistical procedures are explained in more detail by Baker, although arguably they are more simple (univariate) than MD’s factorial analysis. Also, Baker’s work pays more attention to the explicit discussion of sociolinguistics and CL relations. By contrast, Friginal and Hardy devote a smaller portion of the book to discussing this relationship (the last two sections of the final chapter in Block A). Instead, the relationship between the two disciplines arises indirectly from the review of a large quantity of studies that employ both. Another difference in focus is the attention paid to interactional sociolinguistics, which is explored in two chapters in Baker and only briefly talked about in the first chapter in Friginal and Hardy.

In sum, “A Corpus-Based Approach to Sociolinguistics” will serve the undergraduate course on sociolinguistics if supplemented with a manual on sociolinguistic concepts; it will then constitute an original and up to date introduction to the discipline, as well as to the CL methodology. Furthermore, it will be an even more valuable resource for the researcher new to CL that wishes to apply this methodology to sociolinguistics quantitative questions or wishes to know what sociolinguistics questions could be addressed with this methodology. Although not a manual on how to do sociolinguistics with corpus linguistics per se, it will direct the researcher to the right resources. Finally, the thought provoking “reflective breaks” and the numerous examples of studies will stimulate younger students and make sociolinguistic research more appealing, while suggesting new research questions to the more advanced students or researchers.


Baker, P. 2010. “Sociolinguistics and Corpus Linguistics.'' Edinburgh: Edinburgh University Press.

Biber, D. 1988. “Variation across speech and writing”. Cambridge: Cambridge University Press.
Cedergren, H. and Sankoff, D. 1974. Variable rules: Performance as a statistical reflection of competence. Language, 50: 333-355.

Friginal, E. 2009. A corpus-based study of gender and age in blogs. “Language Forum”, 35 (2): 19-37.

Gries, S. T. 2009. “Quantitative corpus linguistics with R: a practical introduction”. London and New York: Routledge

Moreno Fernández, F. 1996. Metodología del “Proyecto para el Estudio Sociolingüístico del Español de España y de América” (PRESEEA). “Lingüística” 8: 257-287.

Scocco, 2007. Copyright Law: 12 Dos and Don’ts. “DailyBlog Tips”. Available from
Irene Checa-Garcia is Assistant Professor at University of Wyoming. She wrote her dissertation on measures of Syntactic Development in adolescents and social factors influencing it. During her postdoctoral years, at University of León and University of California, Santa Barbara, she worked on Functional Syntax of Spanish relative clauses using corpus linguistics methodology. She also works on a Conversation Analysis project on very young children's embodiment of action and on morphosyntactic development of young Spanish-English bilinguals with and without Specific Language Impairment. Her main interests include quantitative sociolinguistics, bilingualism, grammaticalization patterns, and conversation analysis of very young children interactions.

Format: Paperback
ISBN-13: 9780415529563
Pages: 312
Prices: U.K. £ 29.99
Format: Hardback
ISBN-13: 9780415529556
Pages: 312
Prices: U.K. £ 90.00