LINGUIST List 14.3455

Fri Dec 12 2003

Review: Text/Corpus Ling: Leistyna & Meyer (2003)

Editor for this issue: Naomi Ogasawara <naomilinguistlist.org>

What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in.

If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Sheila Dooley Collberg at collberglinguistlist.org.

Directory

Christophe Parisse, Corpus Analysis: Language Structure and Language Use

Message 1: Corpus Analysis: Language Structure and Language Use

Date: Thu, 11 Dec 2003 16:48:37 -0500 (EST)
From: Christophe Parisse <parisseext.jussieu.fr>
Subject: Corpus Analysis: Language Structure and Language Use

Leistyna, Pepi and Charles F. Meyer, ed. (2003) Corpus Analysis: Language Structure and Language Use, Rodopi, Language and computers: Studies in practical linguistics 46.

Announced at http://linguistlist.org/issues/14/14-2101.html

Christophe Parisse, LEAPLE-INSERM, Villejuif, France

SUMMARY

This book contains no less than fifteen articles about corpus analysis. The papers were originally presented at the 3rd North American Symposium on Corpus Linguistics and Language Teaching (Boston, 2001). The book is introduced by a very short presentation (three pages) that does nothing more than explain the rationale behind the order of the articles in the book. I have to admit that this order is quite good, as it is not easy to classify the articles into clear cut categories. Indeed, works about corpus analysis can be classified using at least three variables: the method (fully automatic, semi- automatic, manual, multi-dimensional analysis, lexical analysis, grammatical analysis), the theme (written language, oral language, academic language, historical texts, letters, conversations), the domain (general linguistics, historical linguistics, sociolinguistics, pragmatic analysis, conversation turn analysis, second language acquisition, language teaching). There is a nice progression throughout the book, as every article always has something in common with the previous one, which could be either about method, theme, or domain. I will comment on each chapter in turn and finish with general comments about the book as a whole.

1. ''It's really fascinating work'': Differences in evaluative adjectives across academic registers - Swales, J. M. & Burke, A.

The goal of this article is to document the differences between oral and written language. The authors base their results on two corpora (one oral, one written) that cover exclusively academic presentations. As is stated in the title, the authors focus their work on adjectives that are used for evaluation purposes, such as for example 'important', 'serious', 'trivial', etc. Adjectives are automatically classified in seven categories: acuity, aesthetic appeal, assessment, deviance, relevance, size, and strength. Adjectives are also categorized into polarized (strong evaluation) or centralized (normal evaluation). There tend to be more adjectives in oral texts than in written texts, but this is not true for relevance and strength categories where this trend is reversed. Oral language tends to use more polarized adjectives. Results for all classes of adjectives are detailed and analysed.

2. ''But here's a flawed argument'': Socialisation into and through metadiscourse - Mauranen, A.

This article is based on oral language only. It studies vocabulary used for argumentation ('argue', 'claim', 'observe' ...). The corpus comes from academic situations and contains indications about the academic status of the speaker. An automatic search for all relevant lexical elements shows that different verbs are employed differently: for example, argue is more used by senior faculty members and is also more used in monologues. In general, there are more verbs of argumentation in dialogic discourse, and more nouns in monologues. A first view, there seems to be fewer occurrences of negative evaluations. However, a manual search revealed that this is not true. There are as many negative than positive evaluations, but negative constructions often use apparently positive forms ('argue', 'say', 'point', 'puzzle', 'seem', ...) -they are somehow hidden into the text- and a more varied number of features.

3. Register-specificity of signalling nouns in discourse - Flowerdew, J.

Signalling nouns are kind of anaphoric nouns that are used to link one clause to another or one part of a clause to another (for example 'the process', 'the way', 'the issue', 'the question' ...). This work involves a comparison between oral language (lectures) and written language (textbooks). The process is semi-automatic: first all relevant words are extracted from the corpora and concordances are computed. Then, these concordances are processed manually to keep the relevant ones only (this is necessary as many signalling nouns are ambiguous). Difference is made whether signalling occurs within a clause or between two clauses. An analysis of the results is carried over using Halliday's (1978) framework (field, tenor, mode). Results show that signalling is more frequent in written language than in oral language, and that collocations are less varied in oral language than in written language. The existence of register specificities can also be demonstrated.

4. Variation among university spoken and written registers: A new multi-dimensional analysis - Biber, D.

This work is a new multi-dimensional analysis similar to that of Biber (1988), but with a different corpus. The main interest of this analysis is that it allows to classify text genres automatically using only syntactic features. After computing a large number of linguistic features (such as for example 'number of contractions', 'number of past tense verbs', 'number of nominalizations', ...), a statistical factor analysis is performed and four 'dimensions' (systematic co- occurrence patterns among linguistic features) are retained. The new corpus is made of university language, in oral and written form, produced in academic settings as well as in non-academic settings. It comprises nearly three millions words. A list of 129 linguistic features was computed, but only 90 were used in the final analysis. The number of features is greater than for the previous study (Biber, 1988). The four dimensions that were found are: (1) oral vs. literate discourse; (2) procedural vs. content-focused discourse; (3) narrative orientation; (4) academic stance. The differences between this results and Biber (1988) are then discussed.

5. Linguistic dimensions of direct mail letters - Connor, U. & Upton, T.

This study uses the framework of Biber (1988) - dimensions are slightly different from the dimensions presented in the previous article. The corpus consists of direct mail letters in non- profit fundraising (191,540 words). These letters are very interesting because, although they appear at first glance to be highly personal (because of the use of personal pronouns), they are in fact very informational (they are at the bottom of dimension 1, nearly as much as academic texts). Also, they appear to relate some important tale as they tend to include narrative elements, but this is contradicted by the dimensional analysis that reveals that they are in fact highly non-narrative, even more than academic texts, which were on the extreme of dimension 2. In the other dimensions of analysis, they are less strongly characterized, close to professional letters and academic prose.

6. Gender-based variation in nineteen-century English letter-writing - Geisler, C.

This article also uses the framework of Biber (1988), but with a completely different type of text and with five dimensions of analysis. The corpus is made of nineteen-century letters written by an equal number of men and women. Differences can be found between men and women discourse. Men's writing tend to be more informational, to use more abstract features such as passive or word length, and to use more noun phrase elaboration. Women's writing contains higher frequencies of features marking involved, situated, and non-abstract style, such private verbs, emphatics, that-deletion. An interesting fact is that the differences between men and women language changes through the century, and some trends from the beginning of the century are totally reversed at the end of the century.

7. The grammar of stance in early eighteen- century English epistolary language - Fitzmaurice, S.

This article is difficulty to summarize, as it contains a lot of various and detailed information. It is an in-depth study of the grammar of stance constructions in early modern English. The corpus consists of a set of letters from a group of eight men and six women associated with the essayist and diplomat, Joseph Addison. This allows to study the evolution of the grammar of stance during the beginning of the eighteen-century. A detailed linguistic analysis is presented, with many examples and figures about the number of lexical or modal verbs used in stance constructions.

8. Great vs. lovely: Stance differences in American and British English - Precht, K.

This study compares the frequencies of stance markers in British and American conversations. A specialized automatic software, StanceSearch, is used to find out and generate frequencies figures for stance markers. Stance markers are found in various part of speech (lexical verb, adverbial, adjectival, noun, modal verb), and correspond to four main semantic categories: affect, evidentiality, amount, modality. There is a strong relationship between part of speech and semantic category which is found both in British and American English. The difference between the two dialects is more a question of subtle lexical differences, than of grammatical format.

9. ''What's in a name?'' Vocatives in casual conversations and radio phone-in-calls - McCarthy, M. J. & O'Keeffe, A.

This article is concerned with vocatives and compare a corpus of conversation with family and friends and a corpus of radio phone-in programmes. For the conversation corpora, vocatives terms were found automatically whereas the corpus of radio was processed manually. All vocatives are classified into several categories: relational, topic manage, badinage, mitigator, turn manage, summons, and call manage. Each of the categories is analysed in turn for each corpus and a comparison between the corpora is made. The analysis is closely detailed with many examples and shows quantitative differences between the two situations.

10. Turn initiators in spoken English: A corpus- based approach to interaction and grammar - Tao, H.

This analysis bears strictly on the first word of a turn in a dialog, as this word is necessarily a good candidate for turn management. This is assuredly a very restrictive definition of turn initiators (as it is pointed out by the author himself), however, this allows to realize a fully automatic analysis of any type of corpus. It appears that turn initiators are overwhelmingly lexical and that just the 20 most frequent forms make up 60% of the turn initiators. Some terms are highly specific to turn initiation (as they are encountered nearly only in this context) whereas other terms are used more often in other sentence positions. A detailed functional analysis of the different types of turn initiators is presented.

11. Situational variation in intonational strategies - Yaeger-Dror, M., Hall-Lew, L. & Deckert, S.

This article focused once again on oral language only. It is based on three different corpora. It focuses on the intonational characteristics of negatives. The pitch of each negation that figures in a declarative sentence was classified, as well as relevant environmental prosodic information, turn stance and footing (informative, supportive, remedial, self- correcting, self-protecting, and hedge). As this manual procedure is very time consuming, the number of elements tagged is limited. The authors find that there are variation of pitch with register. They also found that not all negations have a prominent pitch. This is true for purely informative negations, but more than 80% of all negations in interactive situations carry no pitch accent at all. The authors suggest that the principles at work in choosing the pitch are social more than cognitive.

12. On the radical difference between the subject personal pronouns in written and spoken European French - Fonseca-Greber, B. & Waugh, L. R.

This is unfortunately the only article that does not use English language corpora. French is a well suited language for comparing written and oral language as they are a lot of differences between the two. Many grammatical features are marked in written format but are not pronounced aloud, with the exception of irregular forms. The use of verb tense is also different in written and oral French language. However, the authors focus on a less-known difference which is the use of subject personal pronouns. They show that the first person plural pronoun has changed with regard to the traditional written format, and that the second person singular pronoun has acquired a new use (impersonal register). The sole problem with this study is that no corpus is used for statistics in written language and it remains to demonstrate that the evolution that happened for oral language has not also happened for written language, as norms computed using real -and recent- corpora can be quite different from the traditional grammatical norms.

13. The world wide web as linguistic corpus - Meyer, C., Grabowski, R. Han, H.-Y., Mantzouranis, K. & Moses, S.

This article makes a thorough evaluation of how it is possible to use the world wide web as a linguistic corpus. First, the authors stress that it is not possible to know its exact size, so that exact frequencies are difficult to compute. It appears that half of the web is in English and that most of the material has commercial content. They describe how it is possible to use search engines to make lexical analyses of the corpus, as well as to find polysemous lexical items or syntactic constructions. They conclude that, although it is highly difficult to control the nature of the information, valuable linguistic data can be extracted from the world wide web.

14. Corpus linguistics and second language acquisition: Rules and frequency in the acquisition of English multiple wh-questions - Bley- Vroman, R.

This work uses a mixture of corpus analysis procedure and psycholinguistic experimentation to demonstrate that the knowledge of native speakers of English is different in nature from the knowledge of second language learners. The principle of the demonstration is that the performances of the second language learners parallel frequency of occurrences of syntactic forms, whereas this is not the case for the first language learners. The demonstration appears correct at first sight, but the authors themselves point out that some of the results are contradictory. Also, they do not try to evaluate whether the frequencies encountered by native speakers match those of non-native speakers, which, it seems to me, is necessary to confirm before drawing any definite conclusion from the study.

15. Comparing alternate complements of object control verbs: Evidence from the bank of English corpus - Rudanko, J.

The final study demonstrates that two verbs that appear quite similar ('pressure' and 'prevent') can be differentiated after corpus analysis. It appears that the constructions used by English speakers tend to be lexically determined, despite the fact that there is no grammatical reason for this tendency. Also, a comparison between British English and American English shows that constructions differ in the two dialects. This results would be very difficult to obtain without corpus data.

CONCLUSION

The average quality of the articles is very good. Each reader of the book will have different favourite papers, depending on her favourite interests. As someone that uses corpus analysis in his own research, I have found a lot of interesting elements that I may use in my future work.

This book is not for beginners at all. It is very dense and there is no introduction that explains the methods and goals of corpus linguistics. This limits the readership of the book to specialists in the field or advanced students. However, these persons will probably find the book very good. It covers a lot of approaches and themes, and it is of high scientific value. The general presentation of the book is good and the editors did a very good job in harmonizing the presentation of the different articles. I only wish that they had provided the reader with a real introduction to corpus linguistics. I think this would have made the book more interesting to general linguistic readership. My other regret is that, on the fifteen articles, one only is not about the English language. It would have been nice to see corpus linguistics done in flexional languages, or to try to quantify differences between languages. I guess that this was not an editors' choice but a consequence of the material they had to work with. I can only hope that such a book will promote the use of corpus linguistic in all fields of language research and in all languages.

REFERENCES

Biber, D. (1988). Variation across speech and writting. Cambridge: Cambridge University Press.

Halliday, M. A. K. (1978). Language as social semiotic: The social interpretation of language and meaning. London: Arnold.

ABOUT THE REVIEWER

My main research interests are in language development. my main work is on the initial development of syntax (children aged one to four). The tools I use include computer simulation as well as psycholinguistic experiment. I work with children with language disorders as well as normally-developing children.