* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 18.2708

Mon Sep 17 2007

Review: Corpus Linguistics: Teubert & Cermáková (2007)

Editor for this issue: Randall Eggert <randylinguistlist.org>

This LINGUIST List issue is a review of a book published by one of our supporting publishers, commissioned by our book review editorial staff. We welcome discussion of this book review on the list, and particularly invite the author(s) or editor(s) of this book to join in. To start a discussion of this book, you can use the Discussion form on the LINGUIST List website. For the subject of the discussion, specify "Book Review" and the issue number of this review. If you are interested in reviewing a book for LINGUIST, look for the most recent posting with the subject "Reviews: AVAILABLE FOR REVIEW", and follow the instructions at the top of the message. You can also contact the book review staff directly.
        1.    Randall Eggert, Review: Corpus Linguistics: Teubert & Cermáková (2007)

Message 1: Review: Corpus Linguistics: Teubert & Cermáková (2007)
Date: 17-Sep-2007
From: Randall Eggert <randylinguistlist.org>
Subject: Review: Corpus Linguistics: Teubert & Cermáková (2007)
E-mail this message to a friend

Announced at http://linguistlist.org/issues/18/18-456.html

AUTHORS: Teubert, Wolfgang; Cermáková, Anna
TITLE: Corpus Linguistics
SUBTITLE: A short introduction
YEAR: 2007
PUBLISHER: Continuum International Publishing Group Ltd

Barbara Schlücker, Institut für Deutsche und Niederländische Philologie, Freie
Universität Berlin

This textbook consists of two main parts: ''Language and corpus linguistics'', by
Wolfgang Teubert, and ''Directions in corpus linguistics'', written jointly by
Wolfgang Teubert and Anna Cermáková. It ends with suggestions for further
reading and a short glossary. The book was published earlier as part of
Teubert's and Cermáková's book _Lexicology and Corpus Linguistics_ (2004).

The first part ''Language and corpus linguistics'' is divided in five subchapters.
Teubert starts by giving an introduction to the aims and main interests of
Generative Grammar and structuralist theories, concluding that these theories
are preoccupied exclusively with the sameness of all languages, the generative
powers of rules, and the structure of language, but not with the lexicon (for
Chomsky) or the mental processes linked to language (for the structuralists).
Teubert then concentrates on the meaning of words which he believes to be
ambiguous both from a mono- and bilingual perspective. He compares translations
and backtranslations of the same word in several bilingual dictionaries. As all
these translations turn out not to map well, Teubert concludes that the single
word does not seem to be an appropriate unit of meaning because, as he puts it,
''units of meaning are, by definition, unambiguous; they only have one meaning''
(p. 16). He supports this claim by discussing at length idioms and collocations
and their possible translations to other languages. He then concludes that
lexicography cannot do without suitable corpora, especially from a bilingual
perspective, because in bilingual lexicography a collocation can be defined as a
phrase that cannot be adequately translated by translating the parts separately
(p. 27). Corpus linguistics, then, is concerned with the meaning of language.
Corpus linguistics is not about what happens in the mind in the process of
encoding and decoding the meaning of language (as does cognitive linguistics)
but with language itself. A corpus is a sample of discourse and studying this
sample will give us the meaning of language in context. The last subchapter of
part one is a brief history of corpus linguistics, where Teubert describes the
first corpus projects and the development of corpus linguistics since the late
1950s as well as the two main directions in current corpus linguistics, namely
corpus-based and corpus-driven research.

In the second part ''Directions in corpus linguistics'' Teubert and Cermáková
first discuss the question of representativeness, concluding that a corpus
cannot be representative of a given discourse as long as we do not have full
access to all texts, spoken and written, of which the discourse consists - which
will probably never be the case for any discourse, no matter how it is defined.
Therefore, representativeness should not be a defining criterion of a corpus.
The next subchapter, ''Corpus typology'', discusses several corpus types such as
reference corpora, monitor corpora, parallel corpora and the question of whether
the internet constitutes a corpus. The remaining subchapters all deal with the
central subject of this book, the question of defining and identifying units of
meaning. The main claim of Teubert and Cermáková is that meaning can best be
described by usage and paraphrase. The usage pattern of a word contains
information about co-occurrences and collocations, grammatical information as
well as frequency information. Usage patterns can be detected automatically.
Another way of identifying units of meaning are paraphrases. Paraphrases are
statements, definitions, explanations and stories about a given meaning unit.
The set of all paraphrases in a given discourse constitutes the meaning of this
unit. Teubert and Cermáková demonstrate the identification of units of meaning
for the examples of 'globalization' / 'globalisation' and 'friendly fire' by
analyzing usage patterns and paraphrases from corpora. The existence of
paraphrases, they argue, can also a be seen as an indication for neologisms: ''It
makes sense to assume that the introduction of neologisms into the discourse
always occurs along the same lines. As long as the word (or larger unit of
meaning) is still new, it needs to be explained. Not everyone understands the
word in the same way.'' (p.98). With the example of 'friendly fire', they also
resume the bilingual aspect when they study translations of this expression in
German. In addition to usage patterns and paraphrases, translations, then, are
seen as a third criterion for identifying units of meaning: ''It seems that the
translation equivalent of a true collocation is not what would be the most
appropriate translation if each of the elements were translated separately. (...)
Rather, collocations are translated as a whole, and it does not seem to matter
whether the favoured equivalent makes any sense if interpreted literally as a
combination of the elements involved.'' (p.113). As their concern lies in the
bilingual aspect, Teubert and Cermáková finally discuss parallel corpora. In
studying the translations of French 'travail'/'travaux' in English, namely
'work' and 'labour', they show that there are different collation profiles which
do not overlap at all, i.e. words that do co-occur with 'work' never co-occur
with 'labour' and vice versa. These findings lead them to conclude that in
bilingual dictionaries single-word-entries should be replaced with entries of
translation units.

This book deals with the meaning and the identification of units of meaning.
Teubert and Cermáková discuss at length the problems which arise when we examine
strings of words. When are we dealing with a fixed expression? What is a
collocation? And how can the meaning of such units be described? One of the
merits of this book certainly is the detailed discussion of examples like
'globalization', 'false', 'friendly fire' etc. These examples not only
illustrate the problems discussed but they also show how the approaches
suggested by the authors (usage patterns, paraphrases) work. This is especially
true for the bilingual aspect: Teubert and Cermáková give many examples from
existing bilingual dictionaries, demonstrating that the single word does not
seem to be an appropriate unit of meaning as the meaning of single words is
ambiguous, and thereby bringing out their claim that entries of translation
units should replace single-word-entries in bilingual dictionaries.

Reading this book thus helps one to gain insight into the problems of lexicology
and of lexicography. It is doubtful, however, whether this book lends itself as
an introduction to corpus linguistics. Corpus linguistics is shown to be a
useful instrument when identifying units of meaning and the meaning of these
units. But it is a major problem that in this textbook corpus linguistics is
addressed exclusively in the context of lexicology. The authors admit that
grammar in principle could be a field of application for corpus linguistics. But
unfortunately they omit this subject, stating that ''corpus linguistics should
keep its hands off grammar, to the extent that the rules we find in our grammar
books are indisputable. (They are not always, though)'' (p. 48). The relevance of
corpus linguistics for the research of grammar, form and formal variation is
completely disregarded in this book.

A second problem is that definitions of the notions of 'corpus' and 'corpus
linguistics' are missing. The book starts with an extensive debate of Chomskyan
linguistics and structuralism. It is only on page 24 that the notion of corpus
linguistics is mentioned for the first time and it is then mentioned in a way
that presupposes that the reader already knows what corpus linguistics is and
what it is used for. On page 37 finally the aim of corpus linguistics is
introduced as compared to Chomskyan linguistics and structuralism. There is no
discussion of the essential properties that discriminate between a linguistic
corpus and any arbitrary collection of texts, like for example composition, size
or annotation. Thus, for the unskilled reader - probably the typical reader of
an introductory textbook - the idea, the aims and the interests of corpus
linguistics are very hard to grasp.

Another (minor) problem is the way the internet corpus is dealt with. In this
book, the internet is used as a corpus and there is also a short section on this
subject. But whether the internet really constitutes a corpus and under which
circumstances we should make use of this corpus is a difficult question which is
currently being debated (e.g. Kilgarriff & Grefenstette 2003, Hundt, Biewer &
Nesselhauf 2007). Unfortunately, the authors do not mention this issue at all
nor the relevant literature. Furthermore, when using data from the internet,
they give Google as the reference for these data rather than the real sources.
However, the fact that Google is an internet search engine but not an internet
source should be known by a corpus linguist.

Obviously, many of these problems have to do with the fact that this book has
been published before as part of _Lexicology and Corpus Linguistics_ (2004) and
is now republished without adjustments. This would not only explain why the area
of application of corpus linguistics is restricted to lexicology, but also why
there is no introduction, which, among other things, defines the audience it is
written for. According to the publisher's advertisement this book is ''a readable
introductory textbook'' which ''will be useful to students trying to get a grasp
on this subject''. Unfortunately, considering the problems mentioned above, I
doubt that this is true. This book without doubt is appealing for everybody
interested in lexicology and lexicography who wants to know how corpus
linguistics can be used in lexicology and who ideally has some background
knowledge about corpus linguistics. But it is not suitable as a textbook for
students looking for an introduction to corpus linguistics with practical advice.


Hundt, Marianne, Caroline Biewer & Nadja Nesselhauf (eds.). (2007) _Corpus
Linguistics and the Web_. Amsterdam: Rodopi

Kilgarriff, Adam & Gregory Grefenstette. (2003) Introduction to the special
issue on the web as corpus. _Computational Linguistics_ 29/3, 333-347

Teubert, Wolfgang & Anna Cermáková. (2004) _Lexicology and Corpus Linguistics_.

Barbara Schlücker is a postdoctoral research fellow and lecturer at the Freie
Universität Berlin. Her research interests include corpus linguistics, semantics
and the morphology-syntax-interface.

Read more issues|LINGUIST home page|Top of issue

Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.