LINGUIST List 12.1218

Wed May 2 2001

Review: Mair & Hundt, Corpus Linguistics

Editor for this issue: Terence Langendoen <terrylinguistlist.org>

What follows is another discussion note contributed to our Book Discussion Forum. We expect these discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in.

If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for discussion." (This means that the publisher has sent us a review copy.) Then contact Simin Karimi at siminlinguistlist.org or Terry Langendoen at terrylinguistlist.org.

Directory

j.mukherjee, Review Mair/Hundt, Corpus Linguistics

Message 1: Review Mair/Hundt, Corpus Linguistics

Date: Wed, 2 May 2001 23:25:02 +0200
From: j.mukherjee <j.mukherjeeuni-bonn.de>
Subject: Review Mair/Hundt, Corpus Linguistics

Christian Mair and Marianne Hundt, eds. (2000) Corpus Linguistics and Linguistic Theory (Language and Computers: Studies in Practical Linguistics No 33). Amsterdam/Atlanta: Rodopi.

Reviewed by Joybrato Mukherjee, University of Bonn

This volume (announced on LINGUIST 12.272) comprises a selection of papers from the Twentieth International Conference on English Research on Computerized Corpora, which is usually referred to as ICAME 20 (International Computer Archive of Modern and Medieval English). The conference was held in Freiburg/Germany in May 1999. As pointed out by the convenors and editors of this book, the conference motto - taken up in the book title - contributes to the fact that it is "timely to focus on a discussion of the changing relationship between current practice in (ICAME-type) corpus linguistics and issues of linguistic theory exercising the field as a whole" (p. 1). Despite the widening theoretical scope of corpus linguistics which is put into perspective in this volume, all papers also represent the traditional "ICAME type" of corpus-based research in that they shed new light on specific aspects of authentic language use by way of extensive and empirical analyses of large corpora. In combining in-depth analyses of corpus data with discussions of their relevance to linguistic theory, this book no doubt makes for highly stimulating reading. In my view, it is a pity that the editors "have abstained from pigeonholing the contributions in any way and resorted to the more neutral alphabetical ordering" (p. 3). As things stand, I will, however, keep to the alphabetical ordering of papers in the following synopsis.

Synopsis

In the first paper, Bas Aarts makes a plea for more "qualitative" research in corpus linguistics. He points to the fact that mere statistics (or "number crunching") cannot be an end in itself, but that frequencies in corpora should always serve as a starting point for a truly functional approach accounting for quantitative data. New software programs such as the International Corpus of English Corpus Utility Program (ICECUP) allow such qualitative research to be conducted even in the field of syntax since syntactically parsed corpora are now available, e.g. the British component of the International Corpus of English (ICE-GB). Apart from those important theoretical considerations (which are exemplified by investigating the distribution of transitivity patterns in ICE-GB), this paper is also worth-reading for another reason. Aarts starts off by giving an extract from an interview he conducted with Noam Chomsky at MIT. When reading the first exchange, any corpus linguist will most certainly give a deep laugh and shake their heads in disbelief: "Bas Aarts: What is your view of modern corpus linguistics? Noam Chomsky: It doesn't exist." (p. 5)

Bengt Altenberg and Karin Aijmer discuss the state of the art in cross-linguistic research which draws on parallel corpora. By diligently reviewing previous studies and systematising different kinds of parallel corpora (i.e. comparable corpora vs. translation corpora), they show how the English-Swedish Parallel Corpus (ESPC) can be analysed from different perspectives. The analysis of parallel corpora provides some important insights into discrepancies between different language systems. For example, agentless English passives tend to be translated not only into corresponding Swedish passive constructions, but also into active sentences with the generic pronoun "man" or a personal pronoun functioning as subject. Such system gaps call for a careful definition of an appropriate "tertium comparationis" as the point of reference in cross- linguistic research.

The identification of sociolinguistic factors conditioning the preferred selection of either "gonna" or "going to" in the spoken component of the British National Corpus (BNC) lies at the heart of Ylva Berglund's paper. The relevant factors include, for instance, speakers' age, education and social class. However, the quantitative data do not reveal a clear-cut correlation between speakers' sex and the preference of the standard variant over the reduced one. In light of the widely held view that women are wont to use standard variants more than men, this empirical analysis nicely shows that authentic language data may well contradict intuition-based assumptions.

Another well-established commonplace in linguistic research is challenged by Sylvie de Cock. Her paper is devoted to the use of highly recurrent word combinations (HRWCs) by advanced learners of English having French as their mother tongue. The database is provided by the relevant parts of several learner corpora at the University of Louvain. Comparable native-speaker databases serve as control corpora. The data reveal that the foreign-soundingness of learners' English is not only due to an overuse of cliche- like prefabricated sequences of words, but that the picture is rather more complex and includes at least four different aspects of non-nativelike language use: misuse (deviant use of English phrases which formally resemble but which are not semantically identical with French phrases, e.g. "on the contrary"); overuse (e.g. "and so on"); underuse (e.g. "sort of"); use of learner idiosyncratic combinations (e.g. *"according to me"). Furthermore, it is necessary to distinguish between spoken and written language since the two media differ with regard to the frequencies and distribution of those four aspects of non-nativelike use of HRWCs.

Learner English is also the object of inquiry in Pieter de Haan's article, though from a more technical perspective. He deals with principles and problems of automatically tagging non-native English. The focus is on how to come to grips with learner errors in the word-tagging procedure. To this end, the author introduces a taxonomy table for learner errors. On this basis, he describes possible solutions as to how the Tag Selection Tool may cope with different kinds of learner errors.

Inge de Moennink discusses the even more complex issue of syntactically parsing learner English. Although computer- aided error analysis is, in principle, feasible and applicable to corpus annotation, a full-fledged automatisation of the process seems to be impossible at present, rendering the parsing procedure extremely time- consuming. However, the semi-automatic solutions offered by the author represent very useful suggestions since the over-all goal of parsing a learner corpus should not remain wishful thinking in the long run: for one, this would allow for an empirical analysis of syntactic differences between learners' and native speakers' language use. Secondly, it would facilitate the development of an automatic error- tagging system.

In a similar vein to Bas Aarts' paper, Juergen Esser's discussion of "corpus linguistics and the linguistic sign" is a truly programmatic celebration of the conference motto. The close inspection of large amounts of corpus data calls for a refinement (or even a revision) of the Saussurean sign model. Firstly, its restriction to the acoustic image needs to be overcome. The signifier should be extended by considering the medium-bound realisations (orthographic and phonological) of, say, a word. Furthermore, the medium-dependent word-form realisations should be integrated into the possible medium-independent grammatical word-forms with specific ranges of meaning. For example, "tree" in singular form is attested with the meanings "plant" and "drawing" in the BNC, while the plural form "trees" is exclusively associated with the first meaning (i.e. "plant") alone. Secondly and additionally, these data warrant a differentiation of the meaning-side of the linguistic sign according to such sense-restrictions on specified word-forms. Accordingly, Esser introduces the notion of a lexical linguistic sign as "the union of a single sense and a set of medium-independent, abstract grammatical word-forms" (p. 97). Such form-meaning- associations within the Saussurean lexical sign can be identified with the help of corpora.

Maria Estling's study illuminates so-called competing constructions by investigating the frequency and distribution of grammatical synonyms including the quantifiers "all", "both" and "half". Drawing on relevant parts of the CobuildDirect Corpus and newspaper corpora, special emphasis is put on the comparison of British, American and Australian usage. For example, while in American English there is no clear preference for either of the competing constructions "half a + modifier + noun" and "a half + modifier + noun", in British and Australian usage the former structure clearly outnumbers the latter. The author also draws some important general conclusions from the data. In particular, corpus analyses can help identify the most frequent variants which should be taught first to learners of English. Not the least likely to profit from such data are advanced learners seeking detailed information on when to use which competing construction.

Grammatical alternatives also play a role in Roberta Facchinetti's study. She explores the use of "be able to", which is suppletive to the modal "can" in non-finite contexts, in present-day English. By comparing two written standard corpora from the 1960s and the 1990s (and by taking into account a written sample corpus from the BNC), the author rejects the hypothesis that the use of "be able to" is on the increase. Her careful qualitative analysis reveals that "be able to" is used for specific semantic reasons, even if "can" or "could" are possible: for example, "be able to" may refer to the actuality of an event or the fact that the subject successfully manages to carry out the action. It would most certainly be interesting to look at spoken material in future research.

The Chemnitz InterNet Grammar (CING) is a contrastive and interactive learning environment available on the internet and designed for German learners of English. Angela Hahn, Sabine Reich and Josef Schmied sketch the descriptive potential of this on-line tool by focusing on how to teach the present progressive. CING includes an English-German translation corpus which provides a wide range of examples of translating the English progressive aspect into German (which has no immediate aspectual equivalent). Thus, learners have access to a well-chosen selection of translations illuminating the use of the progressive. The authors also suggest a theoretical model of the present progressive which basically includes two parts: "the reference time is included in event time" and "speech time = reference time" (p. 138).

Janet Holmes proves that corpus-based methods are relevant to sociolinguistics in general and gender studies in particular. Lakoff's (somewhat impressionistic) assumption that "lady" is gradually replacing "woman" is clearly and empirically rejected by Holmes who investigates standard British corpora from the 1960s and 1990s and New Zealand corpora from the 1980s. On the contrary, her in-depth semantic analysis of the data reveals that "woman" is now the unmarked term for referring to adult females (and taking them seriously) whereas "lady", once marked as polite and respectful, is increasingly associated with a negative semantic prosody which can be described as conservative, patronising, dated and trivialising. Thus, it does not come as too much of a surprise that "lady" is decreasing in terms of frequency, which, by the way, also holds true for "gentleman". Once again, careful observation of authentic language data calls into question long- established intuition-based hypotheses.

Gunther Kaltenboeck's paper breaks new ground in a corpus- based analysis of information structure. The object of inquiry is the alternation between it-extraposition and non-extraposition which has often been said to be linked to different distributions of weight and information. Analysing ICE-GB exhaustively, Kaltenboeck explores this issue empirically. To begin with, it-extraposition turns out to be the statistically unmarked form, accounting for almost 90% of all instances. By considering the context of all 1,918 examples at hand, the author gives a detailed and considered account of a multitude of syntactic, semantic, stylistic, pragmatic and information-structure factors that lead the language user to prefer one of the two arrangements. To pick out but one factor, non-extraposition is much more common in writing than in speaking. This, however, does not pertain to non-extraposed wh-clauses which are evenly distributed across the two media. Generally speaking, the scrutiny of (non-)extraposition in authentic contexts makes it clear that the two constructional types "do not show a one-to-one correspondence which would allow easy 'swapping'" (p. 158).

Another innovative paper is provided by Thomas Kohnen who applies corpus-based methods to the analysis of speech acts. Since it is difficult (if not to say impossible) to operationalise the pragmatic notion of speech act in terms of linguistic form, he confines himself to performatives which tend to be realised in a restricted range of formal structures. The author not only describes the distribution of performatives across different genres in present-day English corpora, but also opens up a diachronic perspective by looking at the Old English section of the Helsinki Corpus. The tentative results he obtains from the diachronic point of view lead him to point out important questions which await further research, e.g. the issue of the historical development of speech act conventions of politeness and formality.

Uta Lenk's paper is devoted to a classic research topic in corpus linguistics which continues to merit attention: collocational frameworks. The author is particularly interested in so-called "stabilized expressions" including the lexeme "time" (e.g. "all + determiner + time") and their semantic potential. To this end, she investigates several spoken corpora, including those of British, American and New Zealand English. Her analysis casts new light on the specific meanings with which seemingly banal patterns and their variations are associated. For example, the stabilized expression "all this time" is used to refer in a neutral way to a relatively long and continuing period of time. Conversely, "all that time" tends to include "an expression of dismay at the extension of the duration of the period mentioned" (p. 189). Lenk's paper provides ample testimony of the fact that such semantic subtleties of collocational patterns should receive much more attention in foreign language teaching if learners are to acquire as much nativelike communicative competence as possible.

Corpus analysis has no doubt become a standard ("mainstream") methodology in linguistics. Despite (or because of?) this development, there is an increasing awareness that corpus-based methods should be reliable and empirically sound. In this context, Hans Lindquist and Magnus Levin discuss the issue of comparing data from different corpora. Such comparisons are often inevitable, but nonetheless problematical since different corpora tend to be compiled according to different standards, to differ in size, genres and other regards. Thus, results obtained from a comparison of different corpora should be taken with a pinch of salt as the authors impressively reveal by means of many concrete examples. Furthermore, very large corpora may hide genre-specific facts since there is reason to believe that frequencies in language use are mainly bound to particular genres rather than to the language as a whole. In the last resort, the linguist cannot dispense with a careful consideration of the representativeness and comparability of the corpus material on which he or she draws.

Corpus-based methods are increasingly applied to diachronic studies. Accordingly, Manfred Markus looks at the use of causal connectors in Middle English as opposed to present- day English. From the wealth of interesting data, he draws three important general conclusions: (1) whereas in modern English speakers prefer causal conjunctions (especially "for" and "because"), adverbs (e.g. "therefore" and "thus") prevail in Middle English texts; (2) "because" in particular has changed from a conjunction "of the imprecise kind" (p. 227), i.e. referring to cause or result, to a genuinely causal connector in present-day English; (3) co- occurrences of causal adverbs and conjunctions are typical of Middle English as, for example, in "right so" and "all thus".

The present lack of software standardisation is discussed by Oliver Mason. Corpus-linguistic research turns out to be affected by what the author calls a "programming dilemma": for example, software developers are not (and cannot) be aware of future research questions so that the software at times proves to be less than optimal for the issue at hand. On the other hand, corpus linguists who want to develop their own tailor-made software program have to start from scratch and face a very time-consuming process. In seeking to provide a way out of this dilemma, the author describes The Corpus Universal Examiner (CUE) System, a modular software program, and Qwick, a (simple but robust) corpus browser making use of CUE. They are available free of charge. What is more, the modularisation of the software allows for its application in many research projects since it is possible to adopt suitable modules and complement them with software modules developed individually.

Anneli Meurman-Solin attempts to re-categorise multi-word verbs on the basis of different strengths of cohesive ties that hold between verb and preposition. This also leads to a re-evaluation of the distinction between non-idiomatic free combinations of verb and preposition and idiomatic multi-word verbs. Special attention is paid to the use of "put" in complex-transitive complementation. The author argues that while the description of the clause pattern in "He put the evidence before the jury" as SVOA is, in fact, plausible, the idiomatic use of "put before" in "He put work before family" should be subsumed into the ditransitive complementation type: "work" and "family" function as two objects required by the idiomatic multi- word verb "put before" which has a distinct figurative meaning. Many other examples which support the author's view are obtained from the BNC. This paper nicely exemplifies the way in which language data themselves may lead to a reassessment of existing grammatical models.

Intonationists also benefit from corpus-linguistic methods. Ilka Mindt describes significantly frequent prosodic cues at speaker turns as obtained from the analysis of parts of the Lancaster/IBM Spoken English Corpus (SEC). In principle, there are two prosodic patterns which are formally different (considering F0-levels before and after the turn) and which fulfil different textual functions: (1) the "discontinuity pattern" consists of an extremely low endpoint before the turn and a very high starting point after the turn, signalling the discontinuity of a specific topic; (2) in the "continuity pattern", F0-levels before and after the turn are much closer together, indicating the continuation of the topic at issue.

Tore Nilsson presents a crisp and interesting analysis of noun phrases (NPs) in British travel texts. His 100,000- word corpus covers three categories: British tourist brochures, articles from the Sunday Times Travel Supplement and British travel guides. The results show, for example, that the newspaper articles display the simplest NP structures, whereas travel guides in particular are characterised by heavy NP postmodification. The author suggests some general explanations for those findings, mainly centering around the different communicative functions fulfilled by the text types.

Linguists at the University of Nijmegen have undoubtedly been in the vanguard of the development of a corpus-based approach to the systematic description of language use. Nelleke Oostdijk gives a progress report on the TOSCA (Tools for Syntactic Corpus Analysis) descriptive model. Especially with regard to the syntactic analysis of authentic spoken data, corpus linguistics has brought to light the need for a restructuring of existing descriptive grammars such as the Quirk grammars. The author, for example, points out how the TOSCA descriptive model deals with hesitation signals (e.g. "er") and discourse markers (e.g. "I mean") which elude traditional grammatical frameworks based on hierarchical relations of immediate constituency. The on-going development of the TOSCA descriptive model is an ambitious and impressive project in that it aims to accommodate the grammatical model to real language use. In so doing, the project clearly exposes the fallacy of considering syntax an entirely autonomous level of description.

Minna Palander-Collin focuses on the use of the evidential or epistemic expression "I think" in the language of husbands and wives in seventeenth-century letters. Two main conclusions are drawn from the quantitative and qualitative analysis of parts of the Corpus of Early English Correspondence: (1) in general, wives use "I think" much more often than husbands; (2) in particular, wives turn out to use "I think" predominantly for interpersonal purposes (e.g. in order to be conventionally indirect or to apologise). This paper highlights the importance of corpus analyses for empirically sound gender studies even in the field of historical linguistics.

Pam Peters investigates the use of synthetic and analytic comparatives and superlatives with 60 common disyllabic adjectives in the BNC. Almost all adjectives are attested in the two possible comparative and superlative forms respectively. However, some general, though at times surprising and contradictory trends can be detected: (1) disyllabic adjectives ending in "-y" tend to occur in the synthetic pattern (e.g. "easy/easier/easiest", but not "worthy" whose comparative and superlative forms usually are "more worthy" and "most worthy"); (2) quite a few disyllabic adjectives (e.g. "deadly") are shown to have a "crossover" pattern in that they habitually form analytic comparatives (e.g. "more deadly") but synthetic superlatives (e.g. "deadliest"). To a certain extent, those crossover patterns can be explained by collocational factors since some adjectives are often used in routinised phrases with the synthetic superlative (e.g. "deadliest weapon"). In conclusion, the author correctly suggests that the adjective paradigm seems to be "splintered rather than simply split" (p. 311).

The formal realisations and the functions of the present perfect are explored by Norbert Schlueter. The empirical and semantic analysis of spoken and written as well as British and American corpus material leads the author to identify two distinct functions of the present perfect: either it refers to an "indefinite past" or to a "continuative past" (both functions are sub-categorised further). It is shown that specific forms of the present perfect (e.g. active progressive) are strongly linked to specific functions (e.g. continuative past). As to the range of functions the present perfect fulfils, it is particularly interesting to see that the least common function (i.e. continuative past) tends to be marked by a temporal marker in two-thirds of all instances: quite obviously, there seems to exist a correlation between low frequency of function and high rate of linguistic specification.

The design and compilation of the Rostock Historical Newspaper Corpus is described in detail by Kristina Schneider. The 600,000-word corpus comprises British newspapers from 1700 to the present at 30-year intervals which were selected according to external criteria (circulation, price, frequency and time of publication) and internal criteria (news content, non-news content, layout). Thus, a large and powerful diachronic database is now available to linguists interested in historical developments in newspaper language and/or stylistic differences between down-market, mid-market and up-market newspapers across time.

Although the ICE project comprises regional subcorpora of only one million words each, their contrastive analysis allows for dialectological research into idioms, which is the topic of Paul Skandera's paper. He explores peculiarities in the use of idiomatic word combinations in Kenyan English against the background of British English usage (as laid down in ICE-GB). Kenyan English is shown to make use of idioms not or rarely attested in ICE-GB (e.g. "jerrican") and variant formal realisations of British English idioms (e.g. "quite fine"). Furthermore, idioms are used with different meanings (e.g. "whereby") and there are local coinages as well as loan translations/borrowings from indigenous languages (e.g. "jua kali"). It is to be hoped that this paper will stimulate corpus linguists to pursue idiomatic research on the basis of ICE data.

The issue of English-Swedish translations is discussed by Mikael Svensson. Swedish translators of English texts are faced with the serious problem that while English allows several elements before the finite verb, Swedish allows only one. Taking into account the importance of the sentence-initial, thematic position for textual progression, the author seeks to offer a set of principles according to which translators may choose one specific preverbal element. For example, if the English sentence has an initial element which fulfils a discourse-organising function (be it the subject or not), it should remain in initial position. If necessary, the subject should be moved into postverbal position, and heavy constituents should be placed in sentence-final position. Svensson's suggestions illustrate the immediate relevance of analyses of parallel corpora (e.g. the ESPC) to translation studies.

Bernadette Vine focuses on the methodological challenges which she encountered in her functional approach to directives in spoken corpora. At the outset, the identification of functional entities, such as directives, which have a virtually unlimited number of formal realisations poses a serious problem for the formalisation and automation of the search query. What remains is either a manual or a selective procedure. This article reminds the reader of Thomas Kohnen's comments on the limitations of corpus-based methods in pragmatic research (see above). Nevertheless, the author ends with an encouraging note of optimism: "Getting things done in an analysis of how people get things done is complicated and time-consuming, but also very interesting and rewarding." (p. 374)

In late sixteenth century, language users had to choose between two possible second person singular pronouns: "you" or "thou". In analysing material from the Corpus of English Dialogues, Terry Walker compares the use of those pronouns (and their variants) in English Drama (i.e. constructed speech) and authentic speech from witness depositions. In all texts, "you" turns out to be the unmarked and neutral form. "Thou", on the other hand, represents the marked form for specific purposes (e.g. to express affection or intimacy) and, in quantitative terms, is shown to have already gone into a decline. Furthermore, men use "thou" more often than women.

In the final paper, Keith Williamson describes the lexico- grammatical tagging system that has been used in the historical linguistic atlas projects at the University of Edinburgh, covering Early Modern English and Older Scots. One of the many problems is caused by the enormous amount of orthographic (and phonological) variants. It seems to be necessary to consider etymological information so that the tagging procedure can be based on a pre-defined set of linguistic forms which have derived from a specific etymon. The issue of automatic parsing is even more complex, but should be pursued in future research since it would allow for the syntactic analysis of texts from a period of time in which language was in a mesmerising state of flux. On the whole, the author points out some important aspects of adjusting synchronic corpus technology to diachronic needs.

Critical evaluation

Christian Mair and Marianne Hundt have edited an excellent selection of papers. All articles are of good quality, concerning both content and style, and the proof-reading turns out to have been almost perfect. Only very few errata remain (e.g. *"Englis" on p. 320, *"decsribed" on p. 385). The volume covers a wide range of linguistic fields to which corpus-based methodology proves relevant. Living up to the book title, many empirical analyses of specific linguistic phenomena are complemented with thought- provoking discussions of either the implications and applications of the results in a wider setting or of theoretical and methodological principles and problems. Thus, it is to be hoped that not only will corpus professionals closely peruse "Corpus Linguistics and Linguistic Theory", but also that colleagues who are still sceptical about corpus linguistics will be tempted to get involved with corpus-based methods. Let me emphasise, though, that corpus linguists should not pay too much attention to the kind of criticism that has been put forward by generativists in particular for forty years now. Consider the way Noam Chomsky, in a response to Bas Aarts, rebuffs the corpus-linguistic enterprise in its entirety: "You don't take a corpus, you ask questions. You do exactly what they do in the natural sciences. (...) You have to ask probing questions of nature. That's what is called experimentation, and then you may get some answers that mean something. (...) You can take as many texts as you like, you can take tape recordings, but you'll never get the answer." (p. 6) That corpus linguists are, from the outset, unable to provide for scientific answers to linguistic questions is, to say the least, utterly ridiculous. The conference proceedings under discussion give thirty impressive examples of the amazing extent to which careful analyses of authentic language in real contexts result in important answers to central (and peripheral) linguistic questions - answers which are difficult (if not impossible) to obtain otherwise, answers which - in the editors' words - represent "detailed and testable accounts of language use in all its baffling complexity rather than a postulated underlying language system" (p. 3).

Biographical note

Joybrato Mukherjee is an Assistant Professor of Modern English Linguistics at the English Department of the University of Bonn. His research interests include corpus linguistics, stylistics, textlinguistics, syntax, intonation and EFL teaching. He is currently working on a corpus-based analysis of ditransitive verbs and their complementation patterns.