From: Michael Crombach <michael.crombachgmx.at>
Subject: Sociolinguistics and Corpus Linguistics
E-mail this message to a friend
Discuss this message
Announced at http://linguistlist.org/issues/21/21-866.html
AUTHOR: Baker, Paul TITLE: Sociolinguistics and Corpus Linguistics SERIES: Edinburgh Sociolinguistics PUBLISHER: Edinburgh University Press YEAR: 2010
Michael Crombach, Nuance Communications Austria
As the title suggests, ''Sociolinguistics and Corpus Linguistics'' attempts to bring together what at first sight seem rather disparate approaches to language: Sociolinguistics (SL) often considered a very intuitive and practical approach to language and linguistic phenomena, while corpus linguistics (CL) have always been considered a mathematical and technical approach to language. They are linked by statistics. Paul Baker presents an easy to read introduction to the methods of CL and how they can be used in SL research.
The book is organized in seven chapters, each with an introduction, 5-7 subchapters, and a conclusion. Baker presents an organizational overview on the book on pp. 28-30.
Chapter 1 ''Introduction'' (1-30) presents the various types of corpora (written, spoken, general or specialized, 12-15) and the essential methods and concepts of corpus linguistics, like ''concordance'', ''annotation'', ''frequency'', etc. Baker gives five good reasons why CL and SL can make meaningful use of each other (pp. 8-9):
1. SL and CL ''share a number of fundamental tenets of practice when it comes to linguistic analyses'', 2. both CL and SL, use quantitative methodologies to identify similarities and differences, 3. SL and CL use sampling techniques to make claims about larger data, 4. ''both examine variation and change'', and finally 5. SL and CL ''attempt to provide explanations [...] for the findings that their research produces''.
Chapter 2 ''Corpora and sociolinguistic variation'' (31-56) presents the possibilities of investigating the different registers (social varieties of a language) using corpus linguistic methods, and in doing so Baker explains in greater detail the concept of ''frequency'' and its traps, and how to avoid them: ''we have to make sure that what we *think* is being counted is actually what the computer is counting'' (44). The next Chapter 3 ''Diachronic variation'' (57-80) illustrates how linguistic changes can be observed using corpora of different time depths. Baker stresses the difficulties that may arise in working with historical corpora, e.g. for orthographic reasons. Baker makes clear that corpora can quickly turn into historical corpora, take for instance a newspaper corpus compiled in the late 1990s and compare it with a newspaper corpus compiled, say, in the past two years; then by analyzing the occurrence or frequencies of certain words, e.g. *Lewinsky* and *Obama*, the changes become obvious (my example, Baker uses an example from Smith 2002, illustrating the use of the progressive aspect with modal verbs). Chapter 4 ''Synchronic variation'' (81-101) is dedicated to the possibilities of comparing synchronic differences, e.g. between the different varieties of English all over the world. In the context of whether corpora can be used to compare cultures, Baker rightfully mentions one of the central problems using corpus analysis to detect differences (or similarities): in the first place it is always corpora that are compared, and it always requires diligence, precautions and appropriate methods to assure that the similarities or differences are not accidental artifacts of the corpora being compared. Chapter 5 ''Corpora and interpersonal communication'' (102-120) shifts attention to the value CL has for interactional linguistics (IL). Baker illustrates the hurdles in the collection and compilation of corpora that meet the needs of IL; but with some examples Baker is able to illustrate the value of such analyzes. Still, Baker makes clear that a lot of work remains to be done. Chapter 6 ''Uncovering discourses'' (121-145) demonstrates how CL can be used to ''show evidence for constructed differences (e.g. man are constructed as *x*, women are constructed as *y*)'' (143). Finally, Chapter 7 ''Conclusion'' (146-156) sums up the book and offers prospects of the future developments in CL, ''hopefully resulting in sophisticated techniques for analyzing linguistic patterns and enabling many more research questions to be asked. For example, do certain eye movements, facial expressions or gestures tend to accompany particular types of words or conversational situations?'' (156).
In addition to references and indices the book provides appendices showing examples of the tags available in different corpus annotation systems ''CLAWS (= constituent likelihood automatic word-tagging system'' and USAS (= University Centre for Computer Corpus Research on Language Semantic Analysis system).
But these issues do not reduce my overall positive impression of the book. They are more ''this-would-be-also-nice-to-have-add-ons'' that could be considered for future addition. Certainly the greatest achievement of this book is that it brings down barriers between linguistic approaches; the book clearly shows how CL and SL can learn from each other by improving their methods, re-evaluating findings and finally being able to better position themselves within linguistics. The book shows the possibilities and the limits of both approaches and how cooperation can increase these. Another helpful aspect in this book is that makes statistical basics easily accessible, by presenting these in such a simple way (e.g. the chi-square test, 35-36) that is enough to alleviate a lot of the fright for beginners. This, the concise shortness of the book, the richness of examples and samples (I especially like the CL analysis of the book's chapters 1-6 on pp. 146-147) together with a readable style makes it suitable as a course book for beginners.
Birkhan, H. 1979. Das ''Zipfsche Gesetz'', das schwache Präteritum und die germanische Lautverschiebung. Wien.
Ferrer i Cancho, R. 2006. On the universality of Zipf's law for word frequencies. In: Exact methods in the study of language and text. In honor of Gabriel Altmann. Grzybek, P. et al. (eds.), Berlin. 131-140.
Gries, S. Th. 2009. Statistics for Linguistics with R: A Practical Introduction. Berlin et al. Lieberman, E. et al. 2007. Quantifying the evolutionary dynamics of language. Nature 449, 713-716.
Pagel, M. et al. 2007. Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature 449, 717-719.
Prün, C. 2002. Die linguistischen Hypothesen von G.K. Zipf. In: Korpuslinguistische Untersuchungen zur quantitativen und systemtheoretischen Linguistik. R. Köhler (ed.). Trier. 271-321.
Prün, C. 2005. Das Werk von G.K. Zipf. In: Quantitative Linguistik. R. Köhler et al. (eds.). Berlin. 142-152.
Smith, N. 2002. Ever moving on? The progressive in recent British English. In: New frontiers of corpus research. P. Peters et al. (eds.). Amsterdam. 317-330.
Zipf, G. 1965. The psycho-biology of language. Cambridge.
ABOUT THE REVIEWER Michael Crombach is a research and development engineer at Nuance Communications Austria, working on statistical language models and phonetic transcriptions for speech recognition systems. He has a background in historical linguistics (Ph.D.) and biology. His main interests are biology and evolution of language, statistics and language, and theory and history of linguistics.
Page Updated: 30-Jul-2010