Editor for this issue: Naomi Ogasawara <naomi
linguistlist.org>
Granger, Sylviane and Stephanie Petch-Tyson, ed. (2003) Extending the Scope of Corpus-Based Research: New Applications, New Challenges, Rodopi. Announced at http://linguistlist.org/issues/14/14-2103.html Petek Kurtb�ke, Ph.D. Much of 1980s and 1990s were taken up by considerations of three major areas in the field of Corpus Linguistics: 1.Corpus design; 2.Corpus Annotation (a. encoding; b. tagging; c. parsing), 3.Linguistic exploration of the data (Oostdijk and de Haan 1994, Svartvik 1992, Meijs 1987). As we have moved into 21st century, the focus of Corpus Linguistics has moved too, and publications such as the present volume are a sign that it really has. Such a volume is also a confirmation that the tension of 1990s, between ''(a) those who want[ed] to know as much as possible about language [...] and (b) those who want[ed] to know as much as possible about what the computer c[ould] do'' (Quirk 1992), has relaxed. There now seems to be agreement that both approaches are equally valid and ''potentially complementary'', hence a collective effort to establish the direction and future of Corpus Linguistics research (see Grefenstette 1998 on ''Approximate Linguistics''). Both parties have used computers, the former to interpret and the latter to generate natural language. Generally-speaking, the term 'natural language' has been perceived as speech or writing produced in 'natural settings', with the term 'natural' meaning 'ideal' in a setting where only one language is used with its rules perfectly in place. Such a view has enabled the expert to approach language processes in procedural terms. In fact, computational applications in linguistics have so far tested the grammars proposed by theoretical linguists. There is endless literature on these experiments and their results, in which the language, most commonly English, is treated in terms of a limited set of rules. In Linguistics, then, in both theoretical and computational terms, there has been a tendency to view the 'natural setting' as monolingual, although it is hardly the case in everyday life. Long before the age of 'multiculturalism' and computers, most communities used at least some elements from a second language as part of their daily communication, or from more languages. For example, in the Balkans, communities have used a mixture of two or three of the following languages contemporarily in speech for centuries, (in spite of nationalistic language planning movements to discourage this tendency): Greek, Turkish, Albanian, Croatian, Serbian, Slovenian and others. Similar examples may be listed from all over the world. Regardless of the commonness of bilingual or multilingual settings, studies reporting computational treatment of mixed linguistic data are rare. In other words, no such data sets have been fully analyzed using computational techniques. Until recently, it was also uncommon to create corpora in bilingual or multilingual settings (Kurtb�ke 2000). Researchers in three areas, LANGUAGE CONTACT, CORPUS LINGUISTICS and NATURAL LANGUAGE PROCESSING, are now starting to think about the problem of how to treat mixed linguistic data computationally, even though some still fail to go beyond the traditional ''borrowing''- ''code- switching'' distinction. In corpus construction, on the other hand, some still discuss whether texts of mixed nature should be allowed into a corpus at all. And in computational research, entire funds are still dedicated to the resolution of monolingual grammars by developing more elegant yet robust systems. EXTENDING THE SCOPE OF CORPUS-BASED RESEARCH is a happy indication that the scene might be changing, with articles reporting on the analysis of contact data, be it in the local press (Hajar & Harjita on Malay- English pp. 159-175) or in language learning contexts (Aronsson on Swedish-English pp. 197-210; Neff et al. on Spanish/Dutch/Italian/French/German in contact with English pp. 211- 230; Schmied on German-English pp. 231-247). In the past, corpus texts were usually categorised according to their primary discourse function (Sinclair 1987:12; Rissanen et al. 1987). Biber's extensive work on the typology of English texts showed that a thorough definition of the target population based on the co-occurrence of grammatical features was possible (e.g. 1989, 1990). Text categorisation and document clustering have been of interest particularly to those in the area of Artificial Intelligence (e.g. Machine Translation), although research outcomes depend largely on how the corpus available has been accessed: raw, annotated or analysed (McNaught 1993). Corpus annotation adds interpretative (especially linguistic) information to an existing corpus of spoken and/or written language, by some kind of coding attached to, or interspersed with, the electronic representation of the language material itself (Leech 1987, 1993). ''At the time that Biber conducted his research, no corpora were available that had been annotated with detailed syntactic information'' (p. 16). Since then fully-parsed corpora have become available (e.g. ICE-GB) and a structurally annotated corpus to replicate Biber's Multi- Feature/Multi-Dimension method has ''simplified and improved the search for the linguistic features considerably'' (p. 23). Also ''a factor analysis carried out on the frequency counts of a set of word class tags resulted in largely the same classification'' (De M�nnink, Brom and Oostdijk pp. 15-25). Parsing has been one of the concerns of computational corpus research since early 1970s (e.g. TOSCA in Nijmegen). Raw (spoken) data may need ''normalization'' before syntactic parsing can proceed, although how far the normalization procedures should go is still debated (Oostdjik pp. 59-85). Parsing a corpus in order to build a syntactic representation for it is of course barely an end in itself. The syntactic structure usually serves as input to some further processing towards the refinement of grammar descriptions (Wallis pp. 27-38). In the previous decade, numerous corpus exploitation tools became available on the market (see detailed surveys by Schulze et al. 1994, Christ 1996). However, as the advances in computer technology facilitated the exchange of on-line textual resources and electronic transfer on the internet, the largest corpus has become the Web itself. This development has moved the focus of tool design from the exploration of a number of controlled and monitored corpora to one wild and uncontrollable corpus sans frontiers. Three articles in the present volume relate to this aspect of Corpus Linguistics: Renouf on a new tool development ''WebCorp'' (pp. 39-58); Peters and Smith on how e- documents are slowly but firmly changing the conventional print documents (71-85); Schmied on the Internet Grammar project at Chemnitz University (231-247). With the shift of emphasis in the late 1980s and early 1990s from language system to language use, it became obvious that the data extracted from corpora were more complex than was described by the rule-based systems. For example, the traditional parsing technology ignored certain aspects of the lexicon such as collocations and word associations since they were too difficult to capture using rule-based systems (Atkins et al. 1994). Sentence as the central unit of linguistic analysis was questioned (Sinclair 1996) and alternative units of analysis continue to be discussed today (Mukherjee on tone- unit pp. 21-134). As 20th century came to an end, a prediction as to the future of Linguistics in general was that it would advance in two directions: computational corpus research and the mental lexicon (Halliday 1998). Sampson's article on WORDINESS - or LEXICAL DENSITY in Hallidayian terms (Halliday and Martin 1993) - in children's writing (pp. 177-193); and Kjellmer's article (pp. 149-158) on potential words which constitute unexpected ''lexical gaps'' in the Bank of English, are indeed evidence that Pshycholinguistics and Corpus Research are coming closer. Finally, P�rez-Parades (pp. 248-261) shows that the tension between corpus-based versus task-based approaches to language teaching is no longer there. Learner corpora in a classroom setting provide naturally occurring examples for the instant use of the teacher and the student, whereas in the past, language course books as well as traditional grammars and dictionaries, used invented examples, which seemed intuitively right to the native-speaker. With an electronic corpus available to the teacher and the learner, tasks in the classroom are now designed around the application of corpus examples to discourse organization. Hence corpus-based and task-based approaches no longer stand in opposition but they have become complementary. CRITICAL COMMENTS a) More corpus research on typologically different language pairs L2 acquisition has been subject of corpus research before. For example, Biber et al. (1994) used corpus analysis to examine the development of discourse competence and register awareness of the adult learners of English. Similarly, Lux and Grabe (1991) used corpus-based analysis to compare the compositions of university students, written in Ecuadorian Spanish and English. Also in Canada, the acquisition of French by the Portugese as well as other migrant groups as a second language has been investigated using a corpus-based approach (Bazergui et al. 1990). The studies reported in the present volume pleasantly add to the Language Learning-Teaching research library. It would be worthwhile though to extend the boundaries of such investigation to more typologically different language pairs. b) Corpus-based vs corpus-driven The editors do not make reference to this significant distinction in corpus research but the title of the volume EXTENDING THE SCOPE OF CORPUS-BASED RESEARCH must have been selected with this distinction in mind. In the data-driven approach the linguist investigates the corpus with an open mind to discover how language really works as opposed to the corpus-based approach where the linguist first establishes the model and then investigates the corpus to find natural examples to fit into that model (Clear et al. 1996). While the majority of the contributions in the volume may be considered corpus-based, some may be considered corpus-driven (e.g. Gotti on the use of SHALL and WILL pp. 91-109; Ketteman, K�nig and Marko on the morpheme ECO pp. 135-148). c) Written vs spoken corpus material The volume places emphasis on devising better methods of differentiation between speech and writing, although this seems to be a contradiction in terms. One cannot ignore that the use of the internet for daily communication, and the globalisation factor creating new diasporas, are two strong forces that are rapidly narrowing the gap between the spoken and written input. d) Title Lastly, most of the contributors are Corpus Linguists ''firmly established'' in their area of research and it is much to our community's benefit that they felt ''the need to ask [themselves] where the future [of Corpus Linguistics] lies'' (p. 9). With the points above considered, perhaps the title of the book could have been THE SCOPE OF CORPUS RESEARCH - A VIEW OF THE PRESENT IN TERMS OF THE PAST rather than ''Extending the scope of corpus-based research - new applications, new challenges''. A final word from the Editors Granger and Petch-Tyson as to how they see the work in progress reported in this volume will develop in the future would have been a nicer closure for an elegant volume showing how far Corpus Linguistics has come. REFERENCES Atkins, B. T. S., B. Levin and A. Zampolli (1994) Computational Approaches to the Lexicon: An Overview. In B.T.S. Atkins and A. Zampolli (eds.) Computational Approaches to the Lexicon. Oxford University Press, Oxford. pp. 17-45. Bazergui, N. et al.(eds) (1990) Acquisition du fran�ais chez des adultes � Montr�al. Office de la langue fran�aise, Qu�bec. Biber, D. (1989) A Typology of English texts. Linguistics 27:3-43. Biber, D. (1990) Methodological Issues Regarding corpus-based Analyses of Linguistic variation. Literary and Linguistic Computing 5:4:257-269. Biber, D., S. Conrad and R. Reppen (1994) Corpus-based Approaches to Issues in Applied Linguistics. Applied Linguistics 15:2:169-189. Christ, O. (1996) Corpus Exploration Tools. Tutorial script. EURALEX 96, University of G�teborg, Sweden. Clear, J. et al. (1996) COBUILD, The State of the Art. International Journal of Corpus Linguistics 1:2:303-314. Grefenstette, G. (1998) The Future of Linguistics and Lexicographers: Will there be lexicographers in the year 3000? Plenary address. EURALEX 98, Proceedings, Univ. of Li�ge. pp. 25-41. Halliday, M. A. K. (1998) Representing the child as a semiotic being (one who means). Plenary Address. Intl. Conference on Representing The Child. Monash University, Melbourne. 2-3 October. Halliday, M. A. K. and J. R. Martin (1993) Writing Science. The Falmer Press, London. Kurtb�ke, P. (2000) 1001 texts: Ali Baba's Charcoal Chicken Delivery YapIlIr. Paper presented at 21st ICAME Conference, Macquairie University, Sydney. Leech, G (1987) General Introduction. In R. Garside et al. (eds), The Computational Analysis of English - a corpus-based approach. Longman, London. pp. 1-15. Leech, G (1993) Corpus Annotation Schemes. Literary and Linguistic Computing 8:4:276-281. Lux, P. and W. Grabe (1991) Multivariate approaches to contrastive rhetoric. Lenguas Modernas 18:133-60. McNaught, J. (1993) User needs for textual corpora in Natural Language Processing. Literary and Linguistic Computing 8:227-234. Meijs, W. (1987) Preface. In W. Meijs (ed.) Corpus Linguistics and Beyond - Proceedings of the Seventh International Conference on English Language Research on Computerised Corpora. Rodopi, Amsterdam. pp. ii- v. Oostdijk, N. and P. de Haan (eds.) (1994) Corpus-Based Research into Language: In Honour of Jan Aarts. Rodopi, Amsterdam. Quirk, R. (1992) On corpus principles and design. In Svartvik, pp. 457- 469. Rissanen, M., O. Ihalainen and M. Kyt� (1987) The Helsinki Corpus of English Texts. In Meijs, pp. 21-32. Sinclair, J (1996) The Search for Units of Meaning. Textus 9:75-105. Sinclair, J (ed.) (1987) Looking Up: An Account of the Cobuild Project in Lexical Computing. Collins, London. Schulze, B. M. et al. (1994) DECIDE Designing and Evaluating Extraction Tools for Collocations in Dictionaries and Corpora. MLAP Project 93- 19. Svartvik, J. (1992) Corpus linguistics comes of age. In J Svartvik (ed) Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82. Stockholm, 4-8 August 1991. Mouton de Gruyter, Berlin. pp. 7-13. ABOUT THE REVIEWER Petek Kurtb�ke holds a Ph.D from Monash University, Melbourne. Her thesis was titled "A Corpus-driven Study of Turkish-English Language Contact in Australia" (1998).Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue