LINGUIST List 21.3125

Fri Jul 30 2010

Review: Sociolinguistics; Text/Corpus Linguistics: Baker (2010)

Editor for this issue: Joseph Salmons <>

        1.    Michael Crombach, Sociolinguistics and Corpus Linguistics

Message 1: Sociolinguistics and Corpus Linguistics
Date: 30-Jul-2010
From: Michael Crombach <>
Subject: Sociolinguistics and Corpus Linguistics
E-mail this message to a friend

Discuss this message

Announced at

AUTHOR: Baker, Paul TITLE: Sociolinguistics and Corpus Linguistics SERIES: Edinburgh Sociolinguistics PUBLISHER: Edinburgh University Press YEAR: 2010

Michael Crombach, Nuance Communications Austria


As the title suggests, ''Sociolinguistics and Corpus Linguistics'' attempts to bring together what at first sight seem rather disparate approaches to language: Sociolinguistics (SL) often considered a very intuitive and practical approach to language and linguistic phenomena, while corpus linguistics (CL) have always been considered a mathematical and technical approach to language. They are linked by statistics. Paul Baker presents an easy to read introduction to the methods of CL and how they can be used in SL research.

The book is organized in seven chapters, each with an introduction, 5-7 subchapters, and a conclusion. Baker presents an organizational overview on the book on pp. 28-30.

Chapter 1 ''Introduction'' (1-30) presents the various types of corpora (written, spoken, general or specialized, 12-15) and the essential methods and concepts of corpus linguistics, like ''concordance'', ''annotation'', ''frequency'', etc. Baker gives five good reasons why CL and SL can make meaningful use of each other (pp. 8-9):

1. SL and CL ''share a number of fundamental tenets of practice when it comes to linguistic analyses'', 2. both CL and SL, use quantitative methodologies to identify similarities and differences, 3. SL and CL use sampling techniques to make claims about larger data, 4. ''both examine variation and change'', and finally 5. SL and CL ''attempt to provide explanations [...] for the findings that their research produces''.

Chapter 2 ''Corpora and sociolinguistic variation'' (31-56) presents the possibilities of investigating the different registers (social varieties of a language) using corpus linguistic methods, and in doing so Baker explains in greater detail the concept of ''frequency'' and its traps, and how to avoid them: ''we have to make sure that what we *think* is being counted is actually what the computer is counting'' (44). The next Chapter 3 ''Diachronic variation'' (57-80) illustrates how linguistic changes can be observed using corpora of different time depths. Baker stresses the difficulties that may arise in working with historical corpora, e.g. for orthographic reasons. Baker makes clear that corpora can quickly turn into historical corpora, take for instance a newspaper corpus compiled in the late 1990s and compare it with a newspaper corpus compiled, say, in the past two years; then by analyzing the occurrence or frequencies of certain words, e.g. *Lewinsky* and *Obama*, the changes become obvious (my example, Baker uses an example from Smith 2002, illustrating the use of the progressive aspect with modal verbs). Chapter 4 ''Synchronic variation'' (81-101) is dedicated to the possibilities of comparing synchronic differences, e.g. between the different varieties of English all over the world. In the context of whether corpora can be used to compare cultures, Baker rightfully mentions one of the central problems using corpus analysis to detect differences (or similarities): in the first place it is always corpora that are compared, and it always requires diligence, precautions and appropriate methods to assure that the similarities or differences are not accidental artifacts of the corpora being compared. Chapter 5 ''Corpora and interpersonal communication'' (102-120) shifts attention to the value CL has for interactional linguistics (IL). Baker illustrates the hurdles in the collection and compilation of corpora that meet the needs of IL; but with some examples Baker is able to illustrate the value of such analyzes. Still, Baker makes clear that a lot of work remains to be done. Chapter 6 ''Uncovering discourses'' (121-145) demonstrates how CL can be used to ''show evidence for constructed differences (e.g. man are constructed as *x*, women are constructed as *y*)'' (143). Finally, Chapter 7 ''Conclusion'' (146-156) sums up the book and offers prospects of the future developments in CL, ''hopefully resulting in sophisticated techniques for analyzing linguistic patterns and enabling many more research questions to be asked. For example, do certain eye movements, facial expressions or gestures tend to accompany particular types of words or conversational situations?'' (156).

In addition to references and indices the book provides appendices showing examples of the tags available in different corpus annotation systems ''CLAWS (= constituent likelihood automatic word-tagging system'' and USAS (= University Centre for Computer Corpus Research on Language Semantic Analysis system).


There are some minor points of critique that do not reduce the overall positive impression of this book, but could add some value to it. First, the book lacks an overview on the general history of CL. Baker manages to write a book on the quantitative analysis of linguistic data without even mentioning George Kingsley Zipf (1902-1950), the pioneer of all word-counting (e.g. Zipf 1965, originally published 1935). The frequency distribution of words always follows a power law curve, commonly referred to a ''Zipf Curve'' (e.g. Ferrer i Cancho 2006, Prün 2002, 2005). Due to this lack of historical grounding CL is presented in a way that neglects the long tradition of handling text and corpora that has been labeled ''philology''. Baker repeatedly stresses the importance of ''concordances'' for a correct and meaningful data analysis, but Baker fails to mention that concordances have been a tool of classical philology and theology for centuries: for example the Thessaurus Linguae Latinae is (more or less) a concordance compiled since 1893. Bible Concordances have been available since the Middle Ages. Only with the advent of computers and vast amounts of digitized texts have corpora and concordances become much easier to create. So, a short chapter dedicated to the history and evolution of corpus linguistics would help readers situate CL historically and methodologically. Another gap is some tables/lists/synopses of the following aspects of CL: A list of the most important statistical tests with their pros and cons and linguistic examples of when to use which. I also would have appreciated a list of available corpora; Baker carefully introduces the Brown family of corpora (59-68, with an overview on the fields covered by these corpora in table 3.1, p. 61), but I could imagine a table that at least gives names of available corpora, their size, their terms of use (free, non-free), their date of compilation, and where to get these corpora. Another helpful thing could be a list of CL tools, computer programs that are helpful to manipulate and analyze corpora. Baker gives an overview on the dedicated ''corpus tools'' in table 1.1 (8), but this list certainly could be improved and expanded by adding e.g. simple text-manipulation tools like editors (other than word), e.g. UltraEdit, the various scripting: languages e.g. perl, awk, etc. Finally, I had hoped to find a list of the most common/important CL formulae, e.g. for occurrences per million, Baker shows how to calculate this (20), but it would be helpful to have these formulae in a synopsis. An interesting development Baker fails to mention is the analysis of historical data done by Lieberman et al. 2007 (see also Pagel et al. 2007), showing a correlation between the frequencies of irregular verbs and their tendency to become regular. An earlier attempt to use frequency analysis explanatory on historical data is Birkhan 1979. This is certainly an aspect of CL that is also interesting in an SL context as it allows predictions of future developments, or -- more precisely -- estimations of the likelihood of future developments, that can be compared with the actual developments. Finally, it would have been good (and certainly a great service for the novice reader) to add further reading suggestions, especially to the statistical procedures, even if only something very general along the lines of Gries 2009.

But these issues do not reduce my overall positive impression of the book. They are more ''this-would-be-also-nice-to-have-add-ons'' that could be considered for future addition. Certainly the greatest achievement of this book is that it brings down barriers between linguistic approaches; the book clearly shows how CL and SL can learn from each other by improving their methods, re-evaluating findings and finally being able to better position themselves within linguistics. The book shows the possibilities and the limits of both approaches and how cooperation can increase these. Another helpful aspect in this book is that makes statistical basics easily accessible, by presenting these in such a simple way (e.g. the chi-square test, 35-36) that is enough to alleviate a lot of the fright for beginners. This, the concise shortness of the book, the richness of examples and samples (I especially like the CL analysis of the book's chapters 1-6 on pp. 146-147) together with a readable style makes it suitable as a course book for beginners.


Birkhan, H. 1979. Das ''Zipfsche Gesetz'', das schwache Präteritum und die germanische Lautverschiebung. Wien.

Ferrer i Cancho, R. 2006. On the universality of Zipf's law for word frequencies. In: Exact methods in the study of language and text. In honor of Gabriel Altmann. Grzybek, P. et al. (eds.), Berlin. 131-140.

Gries, S. Th. 2009. Statistics for Linguistics with R: A Practical Introduction. Berlin et al. Lieberman, E. et al. 2007. Quantifying the evolutionary dynamics of language. Nature 449, 713-716.

Pagel, M. et al. 2007. Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature 449, 717-719.

Prün, C. 2002. Die linguistischen Hypothesen von G.K. Zipf. In: Korpuslinguistische Untersuchungen zur quantitativen und systemtheoretischen Linguistik. R. Köhler (ed.). Trier. 271-321.

Prün, C. 2005. Das Werk von G.K. Zipf. In: Quantitative Linguistik. R. Köhler et al. (eds.). Berlin. 142-152.

Smith, N. 2002. Ever moving on? The progressive in recent British English. In: New frontiers of corpus research. P. Peters et al. (eds.). Amsterdam. 317-330.

Zipf, G. 1965. The psycho-biology of language. Cambridge.

ABOUT THE REVIEWER Michael Crombach is a research and development engineer at Nuance Communications Austria, working on statistical language models and phonetic transcriptions for speech recognition systems. He has a background in historical linguistics (Ph.D.) and biology. His main interests are biology and evolution of language, statistics and language, and theory and history of linguistics.

Page Updated: 30-Jul-2010