LINGUIST List 21.3125

Fri Jul 30 2010

Review: Sociolinguistics; Text/Corpus Linguistics: Baker (2010)

Editor for this issue: Joseph Salmons <jsalmonslinguistlist.org>

        1.    Michael Crombach, Sociolinguistics and Corpus Linguistics

Message 1: Sociolinguistics and Corpus Linguistics
Date: 30-Jul-2010
From: Michael Crombach <michael.crombachgmx.at>
Subject: Sociolinguistics and Corpus Linguistics
Announced at http://linguistlist.org/issues/21/21-866.html

AUTHOR: Baker, Paul
TITLE: Sociolinguistics and Corpus Linguistics
SERIES: Edinburgh Sociolinguistics
PUBLISHER: Edinburgh University Press
YEAR: 2010

Michael Crombach, Nuance Communications Austria


As the title suggests, ''Sociolinguistics and Corpus Linguistics'' attempts to
bring together what at first sight seem rather disparate approaches to language:
Sociolinguistics (SL) often considered a very intuitive and practical approach
to language and linguistic phenomena, while corpus linguistics (CL) have always
been considered a mathematical and technical approach to language. They are
linked by statistics. Paul Baker presents an easy to read introduction to the
methods of CL and how they can be used in SL research.

The book is organized in seven chapters, each with an introduction, 5-7
subchapters, and a conclusion. Baker presents an organizational overview on the
book on pp. 28-30.

Chapter 1 ''Introduction'' (1-30) presents the various types of corpora (written,
spoken, general or specialized, 12-15) and the essential methods and concepts of
corpus linguistics, like ''concordance'', ''annotation'', ''frequency'', etc. Baker
gives five good reasons why CL and SL can make meaningful use of each other (pp.

1. SL and CL ''share a number of fundamental tenets of practice when it comes to
linguistic analyses'',
2. both CL and SL, use quantitative methodologies to identify similarities and
3. SL and CL use sampling techniques to make claims about larger data,
4. ''both examine variation and change'', and finally
5. SL and CL ''attempt to provide explanations [...] for the findings that their
research produces''.

Chapter 2 ''Corpora and sociolinguistic variation'' (31-56) presents the
possibilities of investigating the different registers (social varieties of a
language) using corpus linguistic methods, and in doing so Baker explains in
greater detail the concept of ''frequency'' and its traps, and how to avoid them:
''we have to make sure that what we *think* is being counted is actually what the
computer is counting'' (44). The next Chapter 3 ''Diachronic variation'' (57-80)
illustrates how linguistic changes can be observed using corpora of different
time depths. Baker stresses the difficulties that may arise in working with
historical corpora, e.g. for orthographic reasons. Baker makes clear that
corpora can quickly turn into historical corpora, take for instance a newspaper
corpus compiled in the late 1990s and compare it with a newspaper corpus
compiled, say, in the past two years; then by analyzing the occurrence or
frequencies of certain words, e.g. *Lewinsky* and *Obama*, the changes become
obvious (my example, Baker uses an example from Smith 2002, illustrating the use
of the progressive aspect with modal verbs). Chapter 4 ''Synchronic variation''
(81-101) is dedicated to the possibilities of comparing synchronic differences,
e.g. between the different varieties of English all over the world. In the
context of whether corpora can be used to compare cultures, Baker rightfully
mentions one of the central problems using corpus analysis to detect differences
(or similarities): in the first place it is always corpora that are compared,
and it always requires diligence, precautions and appropriate methods to assure
that the similarities or differences are not accidental artifacts of the corpora
being compared. Chapter 5 ''Corpora and interpersonal communication'' (102-120)
shifts attention to the value CL has for interactional linguistics (IL). Baker
illustrates the hurdles in the collection and compilation of corpora that meet
the needs of IL; but with some examples Baker is able to illustrate the value of
such analyzes. Still, Baker makes clear that a lot of work remains to be done.
Chapter 6 ''Uncovering discourses'' (121-145) demonstrates how CL can be used to
''show evidence for constructed differences (e.g. man are constructed as *x*,
women are constructed as *y*)'' (143). Finally, Chapter 7 ''Conclusion'' (146-156)
sums up the book and offers prospects of the future developments in CL,
''hopefully resulting in sophisticated techniques for analyzing linguistic
patterns and enabling many more research questions to be asked. For example, do
certain eye movements, facial expressions or gestures tend to accompany
particular types of words or conversational situations?'' (156).

In addition to references and indices the book provides appendices showing
examples of the tags available in different corpus annotation systems ''CLAWS (=
constituent likelihood automatic word-tagging system'' and USAS (= University
Centre for Computer Corpus Research on Language Semantic Analysis system).


There are some minor points of critique that do not reduce the overall positive
impression of this book, but could add some value to it. First, the book lacks
an overview on the general history of CL. Baker manages to write a book on the
quantitative analysis of linguistic data without even mentioning George Kingsley
Zipf (1902-1950), the pioneer of all word-counting (e.g. Zipf 1965, originally
published 1935). The frequency distribution of words always follows a power law
curve, commonly referred to a ''Zipf Curve'' (e.g. Ferrer i Cancho 2006, Prün
2002, 2005). Due to this lack of historical grounding CL is presented in a way
that neglects the long tradition of handling text and corpora that has been
labeled ''philology''. Baker repeatedly stresses the importance of ''concordances''
for a correct and meaningful data analysis, but Baker fails to mention that
concordances have been a tool of classical philology and theology for centuries:
for example the Thessaurus Linguae Latinae is (more or less) a concordance
compiled since 1893. Bible Concordances have been available since the Middle
Ages. Only with the advent of computers and vast amounts of digitized texts have
corpora and concordances become much easier to create. So, a short chapter
dedicated to the history and evolution of corpus linguistics would help readers
situate CL historically and methodologically. Another gap is some
tables/lists/synopses of the following aspects of CL: A list of the most
important statistical tests with their pros and cons and linguistic examples of
when to use which. I also would have appreciated a list of available corpora;
Baker carefully introduces the Brown family of corpora (59-68, with an overview
on the fields covered by these corpora in table 3.1, p. 61), but I could imagine
a table that at least gives names of available corpora, their size, their terms
of use (free, non-free), their date of compilation, and where to get these
corpora. Another helpful thing could be a list of CL tools, computer programs
that are helpful to manipulate and analyze corpora. Baker gives an overview on
the dedicated ''corpus tools'' in table 1.1 (8), but this list certainly could be
improved and expanded by adding e.g. simple text-manipulation tools like editors
(other than word), e.g. UltraEdit, the various scripting: languages e.g. perl,
awk, etc. Finally, I had hoped to find a list of the most common/important CL
formulae, e.g. for occurrences per million, Baker shows how to calculate this
(20), but it would be helpful to have these formulae in a synopsis. An
interesting development Baker fails to mention is the analysis of historical
data done by Lieberman et al. 2007 (see also Pagel et al. 2007), showing a
correlation between the frequencies of irregular verbs and their tendency to
become regular. An earlier attempt to use frequency analysis explanatory on
historical data is Birkhan 1979. This is certainly an aspect of CL that is also
interesting in an SL context as it allows predictions of future developments, or
-- more precisely -- estimations of the likelihood of future developments, that
can be compared with the actual developments. Finally, it would have been good
(and certainly a great service for the novice reader) to add further reading
suggestions, especially to the statistical procedures, even if only something
very general along the lines of Gries 2009.

But these issues do not reduce my overall positive impression of the book. They
are more ''this-would-be-also-nice-to-have-add-ons'' that could be considered for
future addition. Certainly the greatest achievement of this book is that it
brings down barriers between linguistic approaches; the book clearly shows how
CL and SL can learn from each other by improving their methods, re-evaluating
findings and finally being able to better position themselves within
linguistics. The book shows the possibilities and the limits of both approaches
and how cooperation can increase these. Another helpful aspect in this book is
that makes statistical basics easily accessible, by presenting these in such a
simple way (e.g. the chi-square test, 35-36) that is enough to alleviate a lot
of the fright for beginners. This, the concise shortness of the book, the
richness of examples and samples (I especially like the CL analysis of the
book's chapters 1-6 on pp. 146-147) together with a readable style makes it
suitable as a course book for beginners.


Michael Crombach is a research and development engineer at Nuance
Communications Austria, working on statistical language models and phonetic
transcriptions for speech recognition systems. He has a background in
historical linguistics (Ph.D.) and biology. His main interests are biology
and evolution of language, statistics and language, and theory and history
of linguistics.
