LINGUIST List 17.104
|
Fri Jan 13 2006
Review: Corpus Linguistics: Sinclair (2004)
Editor for this issue: Lindsay Butler
<lindsay linguistlist.org>
|
What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Sheila Dooley at dooley linguistlist.org.
|
Directory
1. Oliver
Streiter,
Trust The Text: Language, Corpus and Discourse
Message 1: Trust The Text: Language, Corpus and Discourse
|
Date: 11-Jan-2006
From: Oliver Streiter <ostreiter web.de>
Subject: Trust The Text: Language, Corpus and Discourse
AUTHOR: Sinclair, John McH. EDITOR: Carter, Ronald TITLE: Trust The Text SUBTITLE: Language, Corpus and Discourse PUBLISHER: Routledge (Taylor and Francis) YEAR: 2004 Announced at http://linguistlist.org/issues/15/15-2786.html Oliver Streiter, National University of Kaohsiung, Taiwan OVERVIEW The book under review, ''Trust the Text: Language, corpus and discourse'' by John Sinclair is a collection of 12 papers on written discourse structure, lexis structure, phraseology, lexicography and linguistic theory. All papers have been published previously between 1982 and 2003, but many of these papers are not easily accessible. Some have been published in Festschriften, others are transcripts of lectures. The book thus tries to make these papers accessible to a wider audience. The author, John Sinclair is one of the most influential and original figures in contemporary linguistics. His focus on the analysis of spoken language and his practical and theoretical work in corpus linguistics, long before this had become mainstream, has influenced many linguists and has changed the face of modern linguistics. SUMMARY Most of the ideas presented in this collection have been discussed or assimilated by the research community and taken as a basis for further research. A summary of this follow-up is out of the scope of this review. What this review thus only can do is to identify and explain key ideas presented in each paper and finally try to evaluate the book in terms of whether it succeeds in disseminating these ideas. The book, edited by Ronald Carter, is organized in three parts, called 'Foundations', 'The organization of text' and 'Lexis and grammar'. PART I Foundations Paper 1: Trust the text This paper argues that the availability of electronic corpora should lead to a re-evaluation of linguistic research traditions. It warns of upward projections of proven linguistic techniques to areas with larger linguistic units. For the analysis of discourse, thus, new techniques and a new framework of description are needed. One notion introduced is the ''prospection'' in spoken discourse. A prospection classifies what is going to follow in discourse. Thus, different from backward oriented models which focus on antecedents in the preceding discourse, it is argued that either the entire discourse is encapsulated via a reference in the current sentence (examples can be found in Paper 5, pg. 86, eg. words like 'and', 'however', 'also' etc.) or that the current sentence has been projected by the preceding discourse (like when you say ''... has dramatic consequence.'', what follows will be understood as the consequences). The paper then continues and makes a number of claims which challenge established assumptions: + The idea of a stable lemma is questioned as different word forms of a lemma have different patterns of meaning. + A word that can be used in more than one word class tends to have specific meanings associated with each word class. This correlation between word class and meaning breaks down when the words form part of idiomatic phrases or technical terms. + Words may have specific privileges or restrictions how they are used (as subject, in prepositional phrases etc.) + Words have subliminal meanings, such as the verb 'happen' which refers to something nasty. + Grammar is a grammar of meaning and should state which meaning corresponds to which grammatical pattern. + Words are not selected independently but share meaning components which cannot be ascribed to a single word or a single morpheme. + As a result of the common selection of related words, these words have to give up parts of their meaning. This is referred to as 'delexicalization'. This delexicalization is easily visible with adjective- noun combinations in which adjectives lose much of their meaning, e.g. when they stress part of the meaning of the noun (e.g. 'physical bodies'); Paper 2: The search for units of meaning This paper proposes a linguistic unit called the 'lexical item', a unit in the lexical structure to be selected independently and which then selects lexical or grammatical patterns for its expression. That words are not independent units can be seen from compounds, phrasal verbs, proverbs etc. Words are more or less dependent on each other and this dependence lies somewhere between an 'open choice' and an 'idiom'. Open choice represents the 'terminological tendency', i.e. the tendency for each word to have a fixed, context- independent meaning. Idiomaticity represents the 'phraseological tendency' where words are selected together and make meanings from their combinations. While traditionally the terminological principle is seen as central to language, this paper focuses on the phraseological tendency. Phraseological combinations, even if considered to be fixed, allow for small variations to fit the phraseological combination into its context. In addition, the different components of a phraseological combination have distinct functions. This is taken as an argument for their co- selection. The phraseological combination 'the naked eye' is analyzed. It is shown that it consist of a semantic prosody ('difficult'), a semantic preference ('see'), a colligation (preposition) and an invariable core, i.e. the collocation 'the naked eye', example: 'just visible to the naked eye'. For the phraseological combination 'true feeling' the lexical item consists of a semantic prosody ('reluctant'), a semantic preference ('communicate'), a colligation (possessive) and a collocation ('true feelings'), as in 'try to communicate our true feelings'. The semantic prosody and the semantic preference can be fused as in 'conceal, 'hide' or 'mask'. A similar analysis is provided for the verb 'brook', which because of its infrequent usage, might be more independent of the context. But even for this verb, a complex lexical unit can be identified if sufficient corpus data are available. PART II The organization of text Paper 3: Planes of discourse This paper integrates written language and discourse in one framework as both are essentially interactive. Two notions are introduced. The 'autonomous plane' of discourse gives access to the record of experience of speakers by integrating previous experiences in the form of words and phrases in a text structure. The 'interactive plane' of discourse is in charge of negotiating between participants, selecting the effect of utterances and what features of the outside world utterances should incorporate. The organization of written text is also managed on the interactive plane, e.g. predictions, anticipations, self-reference, discourse labeling and participant intervention. Some operation allows switching the attention between the two planes. 'Reports' transfer attention to the autonomous plane within an utterance, so that the author does not have to adhere to the fact. A 'reference' to the preceding discourse encapsulates the old interaction and makes it available on the autonomous plane. 'Quotes' however remain on the interactive plain. In fiction, then, similar to a report, the author no longer averes each utterance. However she does not attribute the utterances to an author in the real world neither The evaluation at the end (laughter, moral) marks then the return to averral. The notions introduced in this paper are then illustrated in the analysis of a fragment of fiction. Paper 4: On the integration of linguistic description This paper elaborates and illustrates the notions developed in the previous paper. It is shown how the identification of the interactive and autonomous plane of discourse can be used for a descriptive system (annotation scheme) for the analysis of written texts and spoken discourse. Paper 5: Written discourse structure This paper elaborates ideas presented in Paper 1 in the analysis of data. Of central importance is the idea of encapsulation. Each new sentence takes over from the previous sentence the status of 'state of the text'. By default, each new sentence encapsulates the previous one by a reference. This removes the discourse function from the previous sentence and leaves mainly a meaning trace in memory, and only partially a trace of form. The encapsulation creates coherence and cohesion is defined as the referencing act. Point-to-point references, eg. a pronoun referring to its antecedent are then interpreted mainly with reference to the shared knowledge and not the text. 'Logical acts' encapsulate the whole of the previous sentence (eg. through the words 'but', 'therefore') or the previous half of the same sentence (eg. through the words 'and', 'rather'). 'Deictic acts' also include the whole of the previous sentence (eg. 'that', 'this'). A 'prospection' about the next sentence requires the next sentence to fulfill the created expectancies if coherence is to be maintained. A text is analyzed to illustrate and discuss this notion. Different sub-types of prospections, such as prospection through an attribution, internal prospection or advanced labelling are introduced. Paper 6: The internalization of dialogue This paper tries to link spoken and written discourse in a single description and does so in a very original way. The author claims that properties of sentence grammar can be understood by relating grammatical structures (subordinate clause, relative clause, noun phrase etc.) to features of spoken interaction, and that in the phylogenetic development of languages these features of spoken interaction are internalized (understood as ''creating a (language)- internal representation of''). Through the internalization of the 'speaker change', a single speaker can change the posture and present conflicting ideas. The speaker, when marking this change, is no longer bound by the requirement to be coherent in his posture. Declarative, interrogative or imperative mood can be equally understood as internalization of performative aspects of discourse. By internalizing them the speaker can now achieve the same speech act with a combination of different moods. This extends the range and the finesse of mood choices and thus creates an open set of possible speech acts. The internalization of speech acts as subordinate clauses free them from their interactive function. Thus, hypotheses can be formulated by the speaker. Through the internalization, the move (i.e. the discourse unit) becomes a proposition, the averral becomes a truth value and the situational context becomes a possible world. When internalized as restrictive relative clauses, then this clause may specify which referents are included under a denotation by reference to a possible world. Prepositional phrases and attributive adjectives are derived from these by leaving the truth value unexpressed (e.g. dropping the copula). Paper 7: A tool for text explication The author describes the history of text analysis/explication in its various forms (stylistics, discourse analysis) as a periodical movement between the poles of objectivity (e.g. using descriptive schemes) and subjectivity (to achieve a qualitatively rich analysis). In an impressive analysis of a small text fragment, the author shows how corpus data can be used in a qualitatively rich analysis of discourse strategies, having as supported massive objective data. PART III Lexis and grammar Paper 8: The lexical item This paper starts from a historic account of the distinction between 'word' and 'lexical item'. The author revives the notion of 'lexical item' to describe the vocabulary in more meaningful terms, e.g. to account for the fact that a vocabulary is a limited set of meaningful items which in text can assume an unlimited number of meanings. An alternative model according to which words are exchange in their paradigm is rejected as it creates artificial meanings and meaning ambiguities which are not felt by a native speaker. Instead, a mechanism called 'reversal' is introduced according to which meaning is created from the context and takes precedence over the meaning assigned in the vocabulary. When using 'lexical items' in generation, there is less choice than with words and almost no ambiguity. The components of lexical items are those we have seen in Paper 2, the core, the semantic prosody (both obligatory), collocation, colligation and semantic preference. Through their syntactic flexibility (colligation) and semantic flexibility, lexical items allow for a limited paradigmatic choice and thus an integration with other lexical items in their context. New meanings are created when contextual constraints and lexical specifications do not match. The nature of a lexical item is illustrated in an analysis of the usage of the verb 'budge'. Paper 9: The empty lexicon This paper argues against the conception of language as a simple code for a message. According to the author, a message is only part of communication and the message cannot be easily separated or distilled from the form as many elements are concerned with negotiating the interaction and contributing to the message at the same time. Discussing terminology first, the paper contrasts the 'terminological tendency' where words have fixed meanings and the natural flexibility and variability of language. The function terminology has in the lexis, is the same function that sublanguages have in grammar. Sublanguages also try to protect a chosen set of patterns and limit contextual factors on meaning. The terminological approach and the sublanguage approach are prevalent in a technical view on language, e.g. in Natural Language Processing. The technical approach is better suited to describe written language, especially scientific texts. A proposal for a lexicon structure is elaborated. It includes two sublexica. One is similar to a termbank, the other is the flexible lexicon, initially empty. The lexicon learns about vocabulary from text and it is constantly updated. The only fixed element in this lexicon is its structure. It has three subcomponents, (1) the form of a lexical item, (2) an environment and a (3) meaning, and associations between elements of these subcomponents. Paper 10: Lexical grammar This paper discusses the notions of lexis and grammar. It explains why these notions have been seen historically as two separate entities. A model based on this opposition, however, cannot account for meaning. Neither the study of the lexis with the help of referential or logical semantics, nor the study of grammar can assign meaning to syntagmatic patterns (c.f. 'the naked eye'). Traditional frameworks cannot handle cross-border categories, semantic prosody or the vagueness of word classes. Without presenting an alternative model, however, the paper finishes with an exemplary analysis, similar to what we have seen with 'the naked eye'. Paper 11: Phraseognomy This short paper provides an analysis of the phrases 'Society of X' and 'Society for X'. This paper does not pretend to provide deeper insight beyond the specific example. Paper 12: Current issues in corpus linguistic This paper argues, essentially, against a number of ideas that are neither referenced, or fully described. The first argumentation defeats the idea of fixed adequate lexicon for the purpose of Natural Language Processing, and related to it, the idea of sublanguage. The second fusillade goes against small corpora and the third against the (over-)annotation of corpora. CRITICAL EVALUATION While the overall impression of the book is very positive in terms of its intellectual challenges, its linguistic inspirations, the historical perspectives it offers and its capacity to bring together different lines of research, I won't spare some critical remarks. First, different contributions vary in quality, scope and relevance. Paper 11 is nice to read but lacks any import beyond what has been stated repeatedly in the book. Paper 12, I experienced as simply annoying. This paper epitomizes a writing style where positions are criticized with a minimal summary or a reference to a specific person, publication, a school. I have been forgiving throughout the book, seeing this style as the price for the wider view the author offers to the reader, but his paper doesn't offer this wider view and the discourse slips down into an unfair and unscientific shadow-boxing. ''But when someone says their corpus does not need to get any bigger ...'' (pg. 188) Second, statements as the one above can only be understood in the light of the assumption that corpus linguistics is a scientific paradigm defined by the 'exemplary instance of scientific research' (Kuhn 1996/1962) realized by the author and his colleagues. Sometimes, this assumption shows up in half-sentences: ''In corpus linguistics, by contrast, we have to work on the assumption that ...'' (pg. 170) ''[T]he vast majority of work with corpora still takes place under the assumptions of pre-corpus linguistics'' (pg. 176) The author thus silently tries to monopolize the term 'corpus linguistics' and to assign it the meaning of what Tognini-Bonelli identifies correctly as 'corpus-driven approach' within the area of corpus linguistics. The author thus denies the label 'corpus linguistics' to those researchers which understand corpus linguistics differently, e.g. as a (complementary) research method (Biber et al. 1998). Third, the general tendency in these articles to cite research only when it can be integrated en passant or to fire a broadside on 'computational linguistics' or 'structural linguistics' is counterproductive to the advancement of science. As Kuhn (1996/1962) has taught us, new paradigms not only come up with a new theory but also with new data. And this is what the author does extraordinarily well. But as long as the data of the other paradigms cannot be accounted for, or can be shown to be artificial data or represent an artificial problem, we have two theories (old and new) which describe different data derived from the same world. Much would have been gained in this book, if, instead of repeatedly providing new data for theory verification, an analysis of other theories' data would have been given (e.g. in Paper 5, the so called donkey-sentences of Kamp & Reyle 1993, or in Part III, Mel'cuk's 'heavy smoker' (1974) or Pustejovsky's 'fast car' and 'fast secretary' (1995)). Finally, attempts to make the language of the book accessible have either not been made or they have not been successful. Sentence structure is unnecessarily complex, e.g.: ''This chapter concerns the relation between the two types of patterns that are mainly recognized as the means whereby language creates meaning.'' (pg. 164) and sometimes barely understandable: ''A user community that kept clearly separate the language that was used in a particular subject-matter area, and whose usage in that area differed markedly from its other usage and the usage of comparable communities, while remaining largely within the rules of the general language - such conditions would identify a sublanguage.'' (pg. 152) ''Professional linguists should not be surprised to experience a rather disturbing effect from the massive surge in the availability of evidence and the growing sophistication of the tools for examining it and testing hypotheses against it that corpus linguistics has brought.'' (pg. 173) To sum up, the content of book will serve as rich source of inspiration to those who are involved in corpus linguistic research, lexicography and discourse analysis. The book however is not suited as general introduction and certainly not as a text book for university courses. The price of the book, the writing style and the fragmented presentation of ideas are responsible for the fact that, the ideas will still remain difficult to access. REFERENCES Douglas Biber, Susan Conrad and Randi Reppen, (1998) Corpus Linguistics- Investigating Language Structure and Use, Cambridge University Press. Hans Kamp & Uwe Reyle, (1993) From Discourse to Logic. Introduction to Model theoretic Semantics of Natural Language, Formal Logic and Discourse Representation Theory, Dordrecht, Kluwer Academic Publishers. Thomas S. Kuhn (1996/1962) The Structure of Scientific Revolutions. University of Chicago Press, 3rd edition. Igor A. Mel'cuk (1974) Opyt teorii lingusticeskix modelej Smysl <=> Text. Semantika, sintaksis . Izdatel'stvo ''Nauka'', Moskva. James Pustejovsky (1995) The Generative Lexicon, MIT Press, Cambridge. Elena Tognini-Bonelli (2001) Corpus Linguistics at Work. Benjamins. ABOUT THE REVIEWER Oliver Streiter teaches computational linguistics and corpus linguistics at the National University of Kaohsiung, Taiwan. His current research focuses on applications in Computer Assisted Language Learning ("Gymn zilla") and a project which aims at the compilation and annotation of linguistic resources to support low density languages.
Respond to list|Read more issues|LINGUIST home page|Top of issue
|
|

Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.
|
|