Date: Mon, 10 Jan 2005 09:56:34 +1300 From: Ute Knoch <u.knoch@auckland.ac.nz> Subject: Corpus Linguistics: Readings in a Widening Discipline
EDITORS: Sampson, Geoffrey Richard; McCarthy, Diana TITLE: Corpus Linguistics SUBTITLE: Readings in a Widening Discipline PUBLISHER: Continuum International Publishing Group Ltd YEAR: 2004
Ute Knoch, Department of Applied Language Studies and Linguistics, University of Auckland, New Zealand
The first thing that struck me about this edited volume of papers in the area of corpus linguistics is that the chapters were not organised according to topic area but according to the year they were initially published and that each chapter in this book has been published previously somewhere else. The contributions range from 1952 to 2002. The editors explain their rationale behind the book in the introductory chapter where they describe that many of these important publications have previously been published in low circulation volumes. They decided not to organise the chapters according to topic areas as they think that corpus linguistics should be seen as a field as a whole and not as a compartmentalized area of study. In this review, I first describe the content of each of the 43 chapters of this book and then provide a critical evaluation of the contents.
In their introduction, the editors give a brief definition of corpora as well as a concise history of the development of the field in their introduction (chapter 1).
In chapter 2, the reader can find the oldest contribution (1952) which is from the time before corpora were in electronic form, written by Charles C. Fries. This chapter presents excerpts from the introduction and chapter 3 of 'The Structure of English'. The author was one of the first modern corpus linguists. He recorded 250,000 words of telephone conversation and used this data to describe English structure based on real-life use.
In 'A standard corpus of edited present day American English'(chapter 3), Francis describes what the editors call 'the great grandfather' of the electronic corpora, the Brown Corpus of American written English which was published in 1994. It was made up of 1 million words of edited written scholarly work. The paper specifies the rationale of the make-up of the corpus.
In chapter 4 entitled 'On the distribution of noun-phrase types in English clause-structure' originally published by F.G.A.M Aarts in 1971, the author used the then still paper-based Survey of English Usage as basis to contradict assumptions about grammar. The author used statistical methods to validate his study.
Chapter 5 was published 15 years after Aarts's paper, namely in 1986. In the interim information technology had advanced and therefore more complicated processing methods were available. This chapter which can be situated in the area of language engineering describes the development of the Text Segmentation for Speech (TESS) project which aimed to develop predictive theories about English intonation to make automated text-to- speech systems sound more natural.
'Typicality and meaning potentials' (chapter 6) which was written by Patrick Hanks (1986), a lexicographer, illustrates how useful large corpora can be for the development of more accurate dictionaries, but they might also shed some light on other information that should be included in modern dictionaries.
Biber and Finegan describe in chapter 7, 'Historical drift in three English genres', the change that three genres (fiction, essays and letters) have undergone since the eighteenth century. To aid their analysis they made use of automatic grammatical feature detection and the statistical method of factor analysis.
John Sinclair, the creator of the COBUILD corpus, touches in chapter 8 on considerations necessary in the design of corpora. These include the issues like the overall size, design criteria, and the material included.
For his paper 'Cleft and pseudo-cleft constructions in English spoken and written discourse' (chapter 9), Collins used the LOB and the London Lund corpora to compare spoken and written discourse with respect to clefts and pseudo-clefts by taking into account what communicative strategies they serve.
The next chapter, chapter 10, is the first of several statistical papers included in the book. Here, Gale and Church (originally published in 1989) show that a commonly used statistical method used in corpus linguists to estimate probability (adding one to each category before doing divisions), is not valid and should therefore not be used. They suggest instead the use of the Good-Turning method.
In chapter 11, Brown and his co-authors describe how they bypassed traditional problems with machine translation by developing a computer system that by itself works out the relationship between equivalent sentences in two different languages (in this case French and English) using a large parallel corpus. This bypassed the problem researchers had struggled with previously when they tried to formulate rules that translators draw on and encoded these into software applications.
Chapter 12, by Ihalainen, is an example of a dialect study. The author investigates a variation in verb syntax found in Southwest England.
Hellberg, the author of chapter 13, shows how he used both corpus and intuitive data to develop a comprehensive Swedish grammar.
'On the history of that/zero as object clause links in English' (chapter 14), written by Rissanen, is an example of the use of a historical corpus to investigate a certain English structure. Unlike the corpus used in chapter 7, this corpus was developed to be representative of the English language from the Dark Ages. The author shows that both that and zero existed in early written texts and that it is therefore not a more recent omission as has been claimed by some researchers.
In chapter 15, Burnage and Dunlop describe some of the many recording and encoding issues encountered in the development of the British National Corpus.
Chapter 16 is entitled 'Computer corpora - what do they tell us about culture?'. The authors Geoffrey Leech and Roger Fallon use the LOB and Brown corpora as representative corpora of British and American writing to compare if the vocabulary used reveals any social or cultural differences. They were indeed able to show differences between the two varieties, but point out that these two corpora were developed in the early 1960s and that there might have been changes in language use since.
Douglas Biber, the author of chapter 17 shows in his paper 'Representativeness in corpus design' how statistical methods could be used to establish what might be seen as a fair sample size for a corpus.
In chapter 18, written by Francis Gill, the author shows how closely tied grammar and lexicon are. She uses the very large COBUILD 'Bank of English' to illustrate her approach.
In chapter 19, which is situated in the area of computational linguistics and more specifically in the area of automatic natural language processing, Hindle and Rooth show that it is not always correct to assume that there is only one correct answer to automatic parsing. They specifically investigate at what point a prepositional phrase should be attached to a tree.
In his article entitled 'Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies' (chapter 20), the author William Louw shows that large corpora can reveal patterns of collocations between lexical items which cannot be predicted on the basis of their dictionary meaning. Some of these patterns can be found in literary writing and are used to achieve for example irony.
Chapter 21 describes one of the largest currently available corpora which is annotated for its clause structure as well as POS tagged, the Penn Treebank. This is an advance on older corpora which were generally raw corpora.
In chapter 22, Kenji Kita and his co-authors describe methods used to extract collocations from corpora. The two different methods used yield very different results. One measure they illustrate generates results which are arguably a lot more useful for language teaching purposes as well as for computational linguists.
Developing a POS parser capable of parsing naturally occurring language was a challenge taken up in the mid 1990's as computational linguistics developed even further. Briscoe and Carroll, the authors of chapter 23, tested this parser, which incorporated probabilistic information, against a Treebank and report recall and precision.
Chapter 24, authored by Tent and Mugler in 1996, explores the reasons for collecting a Fijian English corpus as part of the International Corpus of English by looking at the history and current role of English in Fiji.
Charniak, who is a leading advocate of Artificial Intelligence, argues in chapter 25 for parsers that extract their rules directly from treebanks (other than the parser described in Chapter 23 which had its rules developed by human linguistic experience). Charniak shows that he is able to yield good results and reports these as precision, recall and accuracy.
In chapter 26, Dieter Mindt shows how differently modals are presented in English teaching materials to how they are actually used by native speakers of English. He argues that a lot more work done by academics needs to be incorporated into EFL and ESL teaching materials and syllabi.
Data-oriented processing argues that what human language users have in their heads is not a system of rules extracted from experience, it is just experience. The authors of chapter 27, Bod and Scha, show experimentally that computer simulations of this type can produce impressive results.
Chapter 28, 'Conflict talk: a comparison of the verbal disputes between adolescent females and two corpora' by Hasund and Stenstoem, shows that corpora make it possible to investigate differences between the speech of social classes. The authors find quite distinctive differences in the kinds of dispute of adolescent girls in London from different social backgrounds by investigating the COLT corpus.
In another statistics paper, chapter 29, Jean Carletta argues for the use of the kappa statistic to calculate inter-annotator agreement.
The author of chapter 30, Christopher Werry, investigates some of the features of Internet Relay chat which can be described as speech-like because of the physical constraints of the medium. He also shows that this type of interaction is very similar in different languages.
Chapter 31 discusses one problem at the lexical level encountered in natural-language processing: word-sense disambiguation. Algorithms for word-sense selection have not yet reached acceptable levels of reliability. The authors, Resnik and Garowsky, report on some of the lessons learned from the SENSEVAL evaluation workshop.
In chapter 32 entitled 'Qualification and certainty in L1 and L2 students' writing, Hyland and Milton compare the lexical devices used to indicate epistemic modality in the English writing of British native speaker and Hong Kong school leavers. They show that non-native speakers under- and overuse certain constructions used to express epistemic modality and that the manipulation of certainty and effect proves particularly difficult for L2 students.
Corpora also allow for annotation above the sentence-level. Such an annotation system is DAMSL, which is described by Core in chapter 33. DAMSL annotates speech-act features. The author discusses the motivation behind using machine learning to automatically predict DAMSL tags and describes an attempt at obtaining decision trees which predict DAMSL trees.
In the paper entitled, 'Assessing claims about language use with corpus- data: swearing and abuse' (chapter 34), McEnery and his co-authors investigate the functions of bad language by describing the ongoing construction of the Lancaster Corpus of Abuse (LCA).
McKelvie, chapter 35, investigates dysfluencies like pauses, fillers, repetions, repairs and fresh starts to see how they relate to grammatical structure.
Pols et al., the authors of chapter 36, suggest that the success of a text- to-speech synthesiser should be evaluated by using documents from large corpora (preferable in several different languages) rather than with devised sentences.
One non-English corpus that has received widespread attention is the Prague Dependency Treebank which is an annotated section of the Czech National Corpus. This corpus is of interest as it is annotated according to dependency analysis and not based on phrase structure analysis as most English-language treebanks. In chapter 37, the authors discuss the autoimmunisation of this annotation process.
In his paper 'Reflections of a dendographer' (chapter38), Sampson discussed the usefulness of Treebank data for language engineering as well as the usefulness of software engineering to find new insights for developing treebanks. This paper is based on a speech the author gave in honour of Geoffrey Leech in 1999.
In chapter 39, Carletta et al., argue for the use of XML as a generic markup language to be used for all corpora.
McEnery (the author of chapter 40), argues that the languages of India, Pakistan and Bangladesh are the most ignored languages in terms of language engineering although there is a great need for work in this area, for example for translation studies. He argues that work in this field has only just started.
In chapter 41, Campione and Veronis, the authors of 'Semi-automatic tagging of intonation in French spoken corpora', describe an approach which partially automates annotation of prosodic features. Although their work is done on French, it is also applicable to other languages.
The author of chapter 42, Kilgarriff, claims that the need for corpus compilation has become redundant as sufficient material is freely available on the web.
The final chapter focuses on intonation, which is crucial for speech to sound natural. Studying this phenomenon is central for the advancement of synthesized speech. For this purpose a research project at Cambridge University has set out to document the diverse intonation patterns in the British Isles. Grabe and Post show some of the results of this project.
It can be seen that the book has been compiled with a lot of thought, covering a large number of different topic areas within corpus linguistics. The editors' introductions to each chapter are very useful as they do not only briefly summarize the chapter but also put it into context for the readers. All chapters are relatively short so that they are not overwhelming for a reader new to the area and all were selected for their importance to the field of corpus linguistics. The editors also supply a very useful list of URLs as an appendix. Personally, coming from an Applied Linguistics background, I would have preferred some more material on learner corpora as can be found in the books by Granger (1998) and Granger, Hung, and Petch-Tyson (2002), more on the kind of corpus- based material now developed for language teaching purposes as can be seen, for example, in Tim John's data-driven learning <http://web.archive.org/web/20040203111227/http://web.bham.ac.uk/johnstf/ti mconc.htm> or more on how corpora can be used by language learners themselves. This area could have been more extensively covered, especially as the editors make repeated reference to the fact that most work on corpora has been initiated by the EFL profession.
Overall, however, it can be said that the book is an extremely valuable resource to own, not only for corpus linguists as reference, but also for those newly interested in the area to understand the wider field of corpus linguistics as well as the historical development that it has undergone.
REFERENCES
Graner, S. (Ed.). (1998). Learner English on Computer. London, New York: Longman.
Granger, S., Hung, J., & Petch-Tyson, S. (Eds.). (2002). Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam, Philadelphia: John Benjamins Publishing Company.
|