LINGUIST List 18.2708

Mon Sep 17 2007

Review: Corpus Linguistics: Teubert & Cermáková (2007)

Editor for this issue: Randall Eggert <randylinguistlist.org>


Directory         1.    Randall Eggert, Review: Corpus Linguistics: Teubert & Cermáková (2007)


Message 1: Review: Corpus Linguistics: Teubert & Cermáková (2007)
Date: 17-Sep-2007
From: Randall Eggert <randylinguistlist.org>
Subject: Review: Corpus Linguistics: Teubert & Cermáková (2007)
E-mail this message to a friend

Announced at http://linguistlist.org/issues/18/18-456.html

AUTHORS: Teubert, Wolfgang; Cermáková, Anna TITLE: Corpus Linguistics SUBTITLE: A short introduction YEAR: 2007 PUBLISHER: Continuum International Publishing Group Ltd

Barbara Schlücker, Institut für Deutsche und Niederländische Philologie, Freie Universität Berlin

SUMMARY This textbook consists of two main parts: ''Language and corpus linguistics'', by Wolfgang Teubert, and ''Directions in corpus linguistics'', written jointly by Wolfgang Teubert and Anna Cermáková. It ends with suggestions for further reading and a short glossary. The book was published earlier as part of Teubert's and Cermáková's book _Lexicology and Corpus Linguistics_ (2004).

The first part ''Language and corpus linguistics'' is divided in five subchapters. Teubert starts by giving an introduction to the aims and main interests of Generative Grammar and structuralist theories, concluding that these theories are preoccupied exclusively with the sameness of all languages, the generative powers of rules, and the structure of language, but not with the lexicon (for Chomsky) or the mental processes linked to language (for the structuralists). Teubert then concentrates on the meaning of words which he believes to be ambiguous both from a mono- and bilingual perspective. He compares translations and backtranslations of the same word in several bilingual dictionaries. As all these translations turn out not to map well, Teubert concludes that the single word does not seem to be an appropriate unit of meaning because, as he puts it, ''units of meaning are, by definition, unambiguous; they only have one meaning'' (p. 16). He supports this claim by discussing at length idioms and collocations and their possible translations to other languages. He then concludes that lexicography cannot do without suitable corpora, especially from a bilingual perspective, because in bilingual lexicography a collocation can be defined as a phrase that cannot be adequately translated by translating the parts separately (p. 27). Corpus linguistics, then, is concerned with the meaning of language. Corpus linguistics is not about what happens in the mind in the process of encoding and decoding the meaning of language (as does cognitive linguistics) but with language itself. A corpus is a sample of discourse and studying this sample will give us the meaning of language in context. The last subchapter of part one is a brief history of corpus linguistics, where Teubert describes the first corpus projects and the development of corpus linguistics since the late 1950s as well as the two main directions in current corpus linguistics, namely corpus-based and corpus-driven research.

In the second part ''Directions in corpus linguistics'' Teubert and Cermáková first discuss the question of representativeness, concluding that a corpus cannot be representative of a given discourse as long as we do not have full access to all texts, spoken and written, of which the discourse consists - which will probably never be the case for any discourse, no matter how it is defined. Therefore, representativeness should not be a defining criterion of a corpus. The next subchapter, ''Corpus typology'', discusses several corpus types such as reference corpora, monitor corpora, parallel corpora and the question of whether the internet constitutes a corpus. The remaining subchapters all deal with the central subject of this book, the question of defining and identifying units of meaning. The main claim of Teubert and Cermáková is that meaning can best be described by usage and paraphrase. The usage pattern of a word contains information about co-occurrences and collocations, grammatical information as well as frequency information. Usage patterns can be detected automatically. Another way of identifying units of meaning are paraphrases. Paraphrases are statements, definitions, explanations and stories about a given meaning unit. The set of all paraphrases in a given discourse constitutes the meaning of this unit. Teubert and Cermáková demonstrate the identification of units of meaning for the examples of 'globalization' / 'globalisation' and 'friendly fire' by analyzing usage patterns and paraphrases from corpora. The existence of paraphrases, they argue, can also a be seen as an indication for neologisms: ''It makes sense to assume that the introduction of neologisms into the discourse always occurs along the same lines. As long as the word (or larger unit of meaning) is still new, it needs to be explained. Not everyone understands the word in the same way.'' (p.98). With the example of 'friendly fire', they also resume the bilingual aspect when they study translations of this expression in German. In addition to usage patterns and paraphrases, translations, then, are seen as a third criterion for identifying units of meaning: ''It seems that the translation equivalent of a true collocation is not what would be the most appropriate translation if each of the elements were translated separately. (...) Rather, collocations are translated as a whole, and it does not seem to matter whether the favoured equivalent makes any sense if interpreted literally as a combination of the elements involved.'' (p.113). As their concern lies in the bilingual aspect, Teubert and Cermáková finally discuss parallel corpora. In studying the translations of French 'travail'/'travaux' in English, namely 'work' and 'labour', they show that there are different collation profiles which do not overlap at all, i.e. words that do co-occur with 'work' never co-occur with 'labour' and vice versa. These findings lead them to conclude that in bilingual dictionaries single-word-entries should be replaced with entries of translation units.

EVALUATION This book deals with the meaning and the identification of units of meaning. Teubert and Cermáková discuss at length the problems which arise when we examine strings of words. When are we dealing with a fixed expression? What is a collocation? And how can the meaning of such units be described? One of the merits of this book certainly is the detailed discussion of examples like 'globalization', 'false', 'friendly fire' etc. These examples not only illustrate the problems discussed but they also show how the approaches suggested by the authors (usage patterns, paraphrases) work. This is especially true for the bilingual aspect: Teubert and Cermáková give many examples from existing bilingual dictionaries, demonstrating that the single word does not seem to be an appropriate unit of meaning as the meaning of single words is ambiguous, and thereby bringing out their claim that entries of translation units should replace single-word-entries in bilingual dictionaries.

Reading this book thus helps one to gain insight into the problems of lexicology and of lexicography. It is doubtful, however, whether this book lends itself as an introduction to corpus linguistics. Corpus linguistics is shown to be a useful instrument when identifying units of meaning and the meaning of these units. But it is a major problem that in this textbook corpus linguistics is addressed exclusively in the context of lexicology. The authors admit that grammar in principle could be a field of application for corpus linguistics. But unfortunately they omit this subject, stating that ''corpus linguistics should keep its hands off grammar, to the extent that the rules we find in our grammar books are indisputable. (They are not always, though)'' (p. 48). The relevance of corpus linguistics for the research of grammar, form and formal variation is completely disregarded in this book.

A second problem is that definitions of the notions of 'corpus' and 'corpus linguistics' are missing. The book starts with an extensive debate of Chomskyan linguistics and structuralism. It is only on page 24 that the notion of corpus linguistics is mentioned for the first time and it is then mentioned in a way that presupposes that the reader already knows what corpus linguistics is and what it is used for. On page 37 finally the aim of corpus linguistics is introduced as compared to Chomskyan linguistics and structuralism. There is no discussion of the essential properties that discriminate between a linguistic corpus and any arbitrary collection of texts, like for example composition, size or annotation. Thus, for the unskilled reader - probably the typical reader of an introductory textbook - the idea, the aims and the interests of corpus linguistics are very hard to grasp.

Another (minor) problem is the way the internet corpus is dealt with. In this book, the internet is used as a corpus and there is also a short section on this subject. But whether the internet really constitutes a corpus and under which circumstances we should make use of this corpus is a difficult question which is currently being debated (e.g. Kilgarriff & Grefenstette 2003, Hundt, Biewer & Nesselhauf 2007). Unfortunately, the authors do not mention this issue at all nor the relevant literature. Furthermore, when using data from the internet, they give Google as the reference for these data rather than the real sources. However, the fact that Google is an internet search engine but not an internet source should be known by a corpus linguist.

Obviously, many of these problems have to do with the fact that this book has been published before as part of _Lexicology and Corpus Linguistics_ (2004) and is now republished without adjustments. This would not only explain why the area of application of corpus linguistics is restricted to lexicology, but also why there is no introduction, which, among other things, defines the audience it is written for. According to the publisher's advertisement this book is ''a readable introductory textbook'' which ''will be useful to students trying to get a grasp on this subject''. Unfortunately, considering the problems mentioned above, I doubt that this is true. This book without doubt is appealing for everybody interested in lexicology and lexicography who wants to know how corpus linguistics can be used in lexicology and who ideally has some background knowledge about corpus linguistics. But it is not suitable as a textbook for students looking for an introduction to corpus linguistics with practical advice.

REFERENCES

Hundt, Marianne, Caroline Biewer & Nadja Nesselhauf (eds.). (2007) _Corpus Linguistics and the Web_. Amsterdam: Rodopi

Kilgarriff, Adam & Gregory Grefenstette. (2003) Introduction to the special issue on the web as corpus. _Computational Linguistics_ 29/3, 333-347

Teubert, Wolfgang & Anna Cermáková. (2004) _Lexicology and Corpus Linguistics_. Continuum.

ABOUT THE REVIEWER Barbara Schlücker is a postdoctoral research fellow and lecturer at the Freie Universität Berlin. Her research interests include corpus linguistics, semantics and the morphology-syntax-interface.