LINGUIST List 14.2667

Fri Oct 3 2003

Review: Text/Corpus Ling: Granger & Petch-Tyson (2003)

Editor for this issue: Naomi Ogasawara <>

What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Simin Karimi at


  1. Petek Kurtb�ke, Extending the Scope of Corpus-Based Research

Message 1: Extending the Scope of Corpus-Based Research

Date: Thu, 02 Oct 2003 23:39:17 +0000
From: Petek Kurtb�ke <>
Subject: Extending the Scope of Corpus-Based Research

Granger, Sylviane and Stephanie Petch-Tyson, ed. (2003) Extending the
Scope of Corpus-Based Research: New Applications, New Challenges,

Announced at

Petek Kurtb�ke, Ph.D.

Much of 1980s and 1990s were taken up by considerations of three major
areas in the field of Corpus Linguistics: 1.Corpus design; 2.Corpus
Annotation (a. encoding; b. tagging; c. parsing), 3.Linguistic
exploration of the data (Oostdijk and de Haan 1994, Svartvik 1992,
Meijs 1987). As we have moved into 21st century, the focus of Corpus
Linguistics has moved too, and publications such as the present volume
are a sign that it really has.

Such a volume is also a confirmation that the tension of 1990s,
between ''(a) those who want[ed] to know as much as possible about
language [...] and (b) those who want[ed] to know as much as possible
about what the computer c[ould] do'' (Quirk 1992), has relaxed. There
now seems to be agreement that both approaches are equally valid and
''potentially complementary'', hence a collective effort to establish
the direction and future of Corpus Linguistics research (see
Grefenstette 1998 on ''Approximate Linguistics'').

Both parties have used computers, the former to interpret and the
latter to generate natural language. Generally-speaking, the term
'natural language' has been perceived as speech or writing produced in
'natural settings', with the term 'natural' meaning 'ideal' in a
setting where only one language is used with its rules perfectly in
place. Such a view has enabled the expert to approach language
processes in procedural terms. In fact, computational applications in
linguistics have so far tested the grammars proposed by theoretical
linguists. There is endless literature on these experiments and their
results, in which the language, most commonly English, is treated in
terms of a limited set of rules.

In Linguistics, then, in both theoretical and computational terms,
there has been a tendency to view the 'natural setting' as
monolingual, although it is hardly the case in everyday life. Long
before the age of 'multiculturalism' and computers, most communities
used at least some elements from a second language as part of their
daily communication, or from more languages. For example, in the
Balkans, communities have used a mixture of two or three of the
following languages contemporarily in speech for centuries, (in spite
of nationalistic language planning movements to discourage this
tendency): Greek, Turkish, Albanian, Croatian, Serbian, Slovenian and
others. Similar examples may be listed from all over the world.
Regardless of the commonness of bilingual or multilingual settings,
studies reporting computational treatment of mixed linguistic data are
rare. In other words, no such data sets have been fully analyzed
using computational techniques. Until recently, it was also uncommon
to create corpora in bilingual or multilingual settings (Kurtb�ke

Researchers in three areas, LANGUAGE CONTACT, CORPUS LINGUISTICS and
NATURAL LANGUAGE PROCESSING, are now starting to think about the
problem of how to treat mixed linguistic data computationally, even
though some still fail to go beyond the traditional ''borrowing''-
''code- switching'' distinction. In corpus construction, on the other
hand, some still discuss whether texts of mixed nature should be
allowed into a corpus at all. And in computational research, entire
funds are still dedicated to the resolution of monolingual grammars by
developing more elegant yet robust systems.

that the scene might be changing, with articles reporting on the
analysis of contact data, be it in the local press (Hajar & Harjita on
Malay- English pp. 159-175) or in language learning contexts (Aronsson
on Swedish-English pp. 197-210; Neff et al. on
Spanish/Dutch/Italian/French/German in contact with English pp. 211-
230; Schmied on German-English pp. 231-247).

In the past, corpus texts were usually categorised according to their
primary discourse function (Sinclair 1987:12; Rissanen et al. 1987).
Biber's extensive work on the typology of English texts showed that a
thorough definition of the target population based on the
co-occurrence of grammatical features was possible (e.g. 1989,
1990). Text categorisation and document clustering have been of
interest particularly to those in the area of Artificial Intelligence
(e.g. Machine Translation), although research outcomes depend largely
on how the corpus available has been accessed: raw, annotated or
analysed (McNaught 1993). Corpus annotation adds interpretative
(especially linguistic) information to an existing corpus of spoken
and/or written language, by some kind of coding attached to, or
interspersed with, the electronic representation of the language
material itself (Leech 1987, 1993). ''At the time that Biber conducted
his research, no corpora were available that had been annotated with
detailed syntactic information'' (p. 16). Since then fully-parsed
corpora have become available (e.g. ICE-GB) and a structurally
annotated corpus to replicate Biber's Multi- Feature/Multi-Dimension
method has ''simplified and improved the search for the linguistic
features considerably'' (p. 23). Also ''a factor analysis carried out
on the frequency counts of a set of word class tags resulted in
largely the same classification'' (De M�nnink, Brom and Oostdijk
pp. 15-25).

Parsing has been one of the concerns of computational corpus research
since early 1970s (e.g. TOSCA in Nijmegen). Raw (spoken) data may
need ''normalization'' before syntactic parsing can proceed, although
how far the normalization procedures should go is still debated
(Oostdjik pp. 59-85). Parsing a corpus in order to build a syntactic
representation for it is of course barely an end in itself. The
syntactic structure usually serves as input to some further processing
towards the refinement of grammar descriptions (Wallis pp. 27-38).

In the previous decade, numerous corpus exploitation tools became
available on the market (see detailed surveys by Schulze et al. 1994,
Christ 1996). However, as the advances in computer technology
facilitated the exchange of on-line textual resources and electronic
transfer on the internet, the largest corpus has become the Web
itself. This development has moved the focus of tool design from the
exploration of a number of controlled and monitored corpora to one
wild and uncontrollable corpus sans frontiers. Three articles in the
present volume relate to this aspect of Corpus Linguistics: Renouf on
a new tool development ''WebCorp'' (pp. 39-58); Peters and Smith on
how e- documents are slowly but firmly changing the conventional print
documents (71-85); Schmied on the Internet Grammar project at Chemnitz
University (231-247).

With the shift of emphasis in the late 1980s and early 1990s from
language system to language use, it became obvious that the data
extracted from corpora were more complex than was described by the
rule-based systems. For example, the traditional parsing technology
ignored certain aspects of the lexicon such as collocations and word
associations since they were too difficult to capture using rule-based
systems (Atkins et al. 1994). Sentence as the central unit of
linguistic analysis was questioned (Sinclair 1996) and alternative
units of analysis continue to be discussed today (Mukherjee on tone-
unit pp. 21-134).

As 20th century came to an end, a prediction as to the future of
Linguistics in general was that it would advance in two directions:
computational corpus research and the mental lexicon (Halliday 1998).
Sampson's article on WORDINESS - or LEXICAL DENSITY in Hallidayian
terms (Halliday and Martin 1993) - in children's writing
(pp. 177-193); and Kjellmer's article (pp. 149-158) on potential words
which constitute unexpected ''lexical gaps'' in the Bank of English,
are indeed evidence that Pshycholinguistics and Corpus Research are
coming closer.

Finally, P�rez-Parades (pp. 248-261) shows that the tension between
corpus-based versus task-based approaches to language teaching is no
longer there. Learner corpora in a classroom setting provide
naturally occurring examples for the instant use of the teacher and
the student, whereas in the past, language course books as well as
traditional grammars and dictionaries, used invented examples, which
seemed intuitively right to the native-speaker. With an electronic
corpus available to the teacher and the learner, tasks in the
classroom are now designed around the application of corpus examples
to discourse organization. Hence corpus-based and task-based
approaches no longer stand in opposition but they have become


a) More corpus research on typologically different language pairs 
L2 acquisition has been subject of corpus research before. For
example, Biber et al. (1994) used corpus analysis to examine the
development of discourse competence and register awareness of the
adult learners of English. Similarly, Lux and Grabe (1991) used
corpus-based analysis to compare the compositions of university
students, written in Ecuadorian Spanish and English. Also in Canada,
the acquisition of French by the Portugese as well as other migrant
groups as a second language has been investigated using a corpus-based
approach (Bazergui et al. 1990). The studies reported in the present
volume pleasantly add to the Language Learning-Teaching research
library. It would be worthwhile though to extend the boundaries of
such investigation to more typologically different language pairs.

b) Corpus-based vs corpus-driven
The editors do not make reference to this significant distinction in
corpus research but the title of the volume EXTENDING THE SCOPE OF
CORPUS-BASED RESEARCH must have been selected with this distinction in
mind. In the data-driven approach the linguist investigates the corpus
with an open mind to discover how language really works as opposed to
the corpus-based approach where the linguist first establishes the
model and then investigates the corpus to find natural examples to fit
into that model (Clear et al. 1996). While the majority of the
contributions in the volume may be considered corpus-based, some may
be considered corpus-driven (e.g. Gotti on the use of SHALL and WILL
pp. 91-109; Ketteman, K�nig and Marko on the morpheme ECO
pp. 135-148).

c) Written vs spoken corpus material
The volume places emphasis on devising better methods of
differentiation between speech and writing, although this seems to be
a contradiction in terms. One cannot ignore that the use of the
internet for daily communication, and the globalisation factor
creating new diasporas, are two strong forces that are rapidly
narrowing the gap between the spoken and written input.

d) Title
Lastly, most of the contributors are Corpus Linguists ''firmly
established'' in their area of research and it is much to our
community's benefit that they felt ''the need to ask [themselves]
where the future [of Corpus Linguistics] lies'' (p. 9). With the
points above considered, perhaps the title of the book could have been
PAST rather than ''Extending the scope of corpus-based research - new
applications, new challenges''. A final word from the Editors Granger
and Petch-Tyson as to how they see the work in progress reported in
this volume will develop in the future would have been a nicer closure
for an elegant volume showing how far Corpus Linguistics has come.


Atkins, B. T. S., B. Levin and A. Zampolli (1994) Computational
Approaches to the Lexicon: An Overview. In B.T.S. Atkins and A.
Zampolli (eds.) Computational Approaches to the Lexicon. Oxford
University Press, Oxford. pp. 17-45.

Bazergui, N. et al.(eds) (1990) Acquisition du fran�ais chez des
adultes � Montr�al. Office de la langue fran�aise, Qu�bec.

Biber, D. (1989) A Typology of English texts. Linguistics 27:3-43.

Biber, D. (1990) Methodological Issues Regarding corpus-based Analyses
of Linguistic variation. Literary and Linguistic Computing

Biber, D., S. Conrad and R. Reppen (1994) Corpus-based Approaches to
Issues in Applied Linguistics. Applied Linguistics 15:2:169-189.

Christ, O. (1996) Corpus Exploration Tools. Tutorial script. EURALEX
96, University of G�teborg, Sweden.

Clear, J. et al. (1996) COBUILD, The State of the Art. International
Journal of Corpus Linguistics 1:2:303-314.

Grefenstette, G. (1998) The Future of Linguistics and Lexicographers:
Will there be lexicographers in the year 3000? Plenary
address. EURALEX 98, Proceedings, Univ. of Li�ge. pp. 25-41.

Halliday, M. A. K. (1998) Representing the child as a semiotic being
(one who means). Plenary Address. Intl. Conference on Representing The
Child. Monash University, Melbourne. 2-3 October.

Halliday, M. A. K. and J. R. Martin (1993) Writing Science. The Falmer
Press, London.

Kurtb�ke, P. (2000) 1001 texts: Ali Baba's Charcoal Chicken Delivery
YapIlIr. Paper presented at 21st ICAME Conference, Macquairie
University, Sydney.

Leech, G (1987) General Introduction. In R. Garside et al. (eds), The
Computational Analysis of English - a corpus-based approach. Longman,
London. pp. 1-15.

Leech, G (1993) Corpus Annotation Schemes. Literary and Linguistic
Computing 8:4:276-281.

Lux, P. and W. Grabe (1991) Multivariate approaches to contrastive
rhetoric. Lenguas Modernas 18:133-60.

McNaught, J. (1993) User needs for textual corpora in Natural Language
Processing. Literary and Linguistic Computing 8:227-234.

Meijs, W. (1987) Preface. In W. Meijs (ed.) Corpus Linguistics and
Beyond - Proceedings of the Seventh International Conference on
English Language Research on Computerised Corpora. Rodopi,
Amsterdam. pp. ii- v.

Oostdijk, N. and P. de Haan (eds.) (1994) Corpus-Based Research into
Language: In Honour of Jan Aarts. Rodopi, Amsterdam.

Quirk, R. (1992) On corpus principles and design. In Svartvik,
pp. 457- 469.

Rissanen, M., O. Ihalainen and M. Kyt� (1987) The Helsinki Corpus of
English Texts. In Meijs, pp. 21-32.

Sinclair, J (1996) The Search for Units of Meaning. Textus 9:75-105. 

Sinclair, J (ed.) (1987) Looking Up: An Account of the Cobuild Project
in Lexical Computing. Collins, London.

Schulze, B. M. et al. (1994) DECIDE Designing and Evaluating
Extraction Tools for Collocations in Dictionaries and Corpora. MLAP
Project 93- 19.

Svartvik, J. (1992) Corpus linguistics comes of age. In J Svartvik
(ed) Directions in Corpus Linguistics. Proceedings of Nobel Symposium
82. Stockholm, 4-8 August 1991. Mouton de Gruyter, Berlin. pp. 7-13.


Petek Kurtb�ke holds a Ph.D from Monash University, Melbourne. Her
thesis was titled "A Corpus-driven Study of Turkish-English Language
Contact in Australia" (1998).
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue