LINGUIST List 16.1893
|
Sun Jun 19 2005
Review: Corpus Ling/Applied Ling: Aston et al. (2004)
Editor for this issue: Naomi Ogasawara
<naomi linguistlist.org>
|
What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Sheila Dooley at collberg linguistlist.org.
|
Directory
1. Przemysław
Kaszubski,
Corpora and Language Learners
Message 1: Corpora and Language Learners
|
Date: 16-Jun-2005
From: Przemysław Kaszubski <przemka amu.edu.pl>
Subject: Corpora and Language Learners
EDITORS: Aston, Guy; Bernardini, Silvia; Stewart, Dominic TITLE: Corpora and Language Learners SERIES: Studies in Corpus Linguistics 17 PUBLISHER: John Benjamins YEAR: 2004 Announced at http://linguistlist.org/issues/16/16-33.html Przemyslaw Kaszubski, School of English, Adam Mickiewicz University, Poznan, Poland SUMMARY CORPORA AND LANGUAGE LEARNERS features a selection of papers presented at the fifth meeting of the bi-annual TaLC (Teaching and Language Corpora) conference, which was held in Bertinoro, Italy, in the summer of 2002. The book is divided into five parts, the central sections exploiting three areas involving corpora and learners: "Corpora by learners" (i.e. corpus-based studies of learner language, 6 papers), "Corpora for learners" (various types of target language corpora, 4 papers), and "Corpora with learners" (data-driven learning, 3 papers). These 'core' contents are braced by two more general contributions: a proposal for a corpus-informed theory for applied linguistics, and an overview of prospects for applying the Web to corpus-based pedagogy. An index (pp. 301-305) and contributors' bionotes (307-311) complement the volume. In their "Introduction: Ten years of TaLC", the editors, previewing the book's organization and contents, note the field's constantly evolving and diversifying efforts to optimize the link between corpus application and language pedagogy. Central to these efforts are attempts to understand learners and their needs, and the necessity to resolve the vexed notion of input 'authenticity', surfacing in several papers. The first major contribution, Michael Hoey's "The textual priming of lexis", is the one that offers "A theory for TaLC". The author claims that lexical units -- central to his proposal -- display the property of becoming loaded ('primed') in a mind exposed to frequently repeating patterns of usage. Priming may concern any broadly understood grammatical and collocational properties, both within and beyond the sentence. Thus, a word may, for example, be primed for acting as a noun or verb, for representing certain meanings, for preceding or following specific modification patterns (colligation), for appearing in particular textual positions (textual colligation), for contributing to textual relations (e.g. a Problem-Solution pattern), etc. Such primings are, in addition, relative to specific genres and domains of use. Priming may "change through an individual's lifetime" (p. 24);it also precedes grammatical categorizations, which are likely to be post hoc creations. According to Hoey, effective studies of primings must be based on specialized corpora, which do not regularize any specific preferences in favour of the 'big picture'. Evaluating the pedagogical relevance of his theory, the author points to the role of teachers and materials in ensuring correct, though gradual, priming of lexical content, properly contextualized. Priming may also account for creative uses of language, which, as Hoey claims, can breach some -- but never all -- of the priming constraints (the latter would produce "non-language"). Overall, priming theory, recently elaborated in a monograph (Hoey 2005), merits attention in that it aptly positions, and legitimizes, corpus- based lexical research within the larger scope of psycholinguistics, language variation, and acquisition theory. The first paper in the "Corpora by learners" part is Yukio Tono's "Multiple comparisons of IL, L1 and TL corpora: The case of L2 acquisition of verb subcategorization patterns by Japanese learners of English". Solidly grounded in L1 and L2 acquisition theory and Levin's division of verb classes, the paper lays a methodological claim in favour of a multiple corpus comparison method in corpus studies of learner language. Tono shows that combining interlanguage (IL) material (at possibly various stages of proficiency) with, on the one hand, appropriate target language corpora (here: English textbooks) and, on the other, comparable L1 corpora, can make it possible to capture computationally diverse effects influencing SLA, such as the L1 effects, the L2 input effects, or the developmental effects. The advanced linguistic analysis relies on syntactic parsing, database systems and log-linear analysis of clusters, whose brief discussion some readers may find a little obscure. The author's concluding wish is to see international collaboration for the development of a computational model of SLA. In "New wine in old skins? A corpus investigation of L1 syntactic transfer in learner language", Lars Borin and Klaus Prütz attempt to investigate the syntax of Swedish university-level students through frequencies of part-of-speech (POS) n-grams (a procedure feasible for languages with fixed-order syntax, as the authors rightly point out, p. 71). A contrastive, multi-corpus environment is also advocated here, the corpora ranging between 350,000 and 1 million tokens. The counted frequencies of 1-4 grams (excluding sequences containing proper nouns and punctuation, as well as, controversially, those exclusive to either language) are compared and tested statistically (Mann-Whitney), revealing a predominant overuse pattern in the learner data. The authors illustrate their findings and compare them with earlier studies, most notably Aarts and Granger (1998). The final outcome is far from definitive, but the discussion sheds interesting light on the significance of methodological decisions for this kind of research, such as about the size of the adopted tagset or the degree of manual adjustment in the frequency lists. Agnieszka Lenko-Szymanska's "Demonstratives as anaphora markers in advanced learners' English" adopts a comparatively lighter computational approach and a more traditional comparison paradigm, with a Polish university learner corpus (PELCRA; four proficiency levels) set against just a native speaker corpus norm (BNC Sampler). The applied log-likelihood and chi-square statistics demonstrate that Polish learner writers overuse distal anaphoric signals ('that', 'those'), primarily in the determiner function, and that the problem does not seem to disappear with rising proficiency. The author accounts for that by pointing to the lack of appropriate, explicit explanations in the grammar books. In "How learner corpus analysis can contribute to language teaching: A study of support verb constructions", Nadia Nesselhauf presents yet another learner corpus research scheme, in which the uses of 'make', 'have', 'take', and 'give' in support constructions (extracted by eyeball analysis of concordance lines), are judged for appropriacy not just against a comparable native English reference corpus (written BNC), but also using lexicographic sources and native-speaker informants. The author also undertakes to seek correspondences and clusters across the error types (despite rather low frequencies). Some of the suggested implications for teaching may seem obvious (e.g. that frequency information in learner data is insufficient and should be complemented by appropriate native-speaker genre/text-type frequency); more importantly, Nesselhauf reminds us of the need to consider non-corpus factors in judging errors, such as the degree of communicative disruption. One interesting pedagogical suggestion for her data is the idea of focusing learners' attention on instances where single verb uses differ semantically from the corresponding support constructions (e.g. 'take notice' vs 'notice'). Lynne Flowerdew's article "The problem-solution pattern in apprentice vs. professional technical writing: An application of appraisal theory" explores the possibility of applying the systemic-functional Appraisal framework of categorizing evaluative language to an analysis of cross- corpus keyword and key-keyword listings generated with Scott's WordSmith Tools. The author concentrates on apprentice and professional authors' use of 'inscribed' (explicitly evaluative, e.g. 'problem') vs. 'evoking' (inviting evaluation, e.g. 'impact') lexis in signalling problems and/or solutions. The findings indicate that, for the genre in question, the majority of keywords identified are indeed problem-solution in nature, and that while learner writers tend to use inscribed terms for both the Problem and Solution elements, native- English professionals signal problems with more evoking terms. This, Flowerdew argues, may be a teaching-induced phenomenon; other encountered incongruencies are put down to the inequality of topics in the two corpora under comparison. Ngoni Chipere, David Malvern and Brian Richards' paper "Using a corpus of children's writing to test a solution to the sample size problem affecting type-token ratios" is primarily computational in character. The authors review and criticize various existing measures of lexical richness, in particular the type-token ratio (TTR), and put forward their own formula for a D parameter, which is independent of the text sample size and, as empirically tested in the study, better correlated with varied proficiency levels, determined by human scorers and certain known measures (word length, text length). The D metric thus appears especially well suited for tracing linguistic development, and it is only regrettable that the authors do not provide download or ordering details for readers wishing to test the tool (by comparison, a mildly criticized measure, standardized TTR, is readily available in WordSmith Tools). Opening the "Corpora for learners" section, Ute Römer's "Comparing real and ideal language learner input: The use of an EFL textbook corpus in corpus linguistics and language teaching" assesses the linguistic value of pedagogical materials for classroom use on the example of spoken 'if' constructions. While conceding the point about the impossibility of fully transferring the contextual authenticity of attested language to a classroom setting, the author declares confidence in learners' ability to adapt, and in the overall positive influence of naturalistic language exposure as opposed to special input. Suggestions for applying findings contrasting the language of the scanned German textbook conversations and the evidence of the spoken BNC are also offered. Römer's optimism may be open to some question, as thus far relatively little empirical evidence exists confirming the effectiveness of corpus-driven material selection; conversely, authors such as Aston (2001: 8), Gabrielatos (2005), or Nesselhauf and Mauranen (this volume) admit the necessity of considering non-frequency factors. An interesting proposal for a corpus-based stylistics programme is described by Bernhard Kettemann and Georg Marko in "Can the L in TALC stand for Literature?". The authors plan to offer an integrative and 'hands-on' awareness-raising course for students at English departments (in particular at Graz University), whose knowledge often gets excessively compartmentalized. It is claimed that corpus-based analyses of literary texts should help students integrate their knowledge and build five important, inter-related types of awareness: (1) language, (2) discourse, (3) literary, (4) cultural/social, and (5) methodological / metatheoretical (= how to organize and logically conduct research). Although it is still at an early stage of design, Kettemann and Marko characterize in considerable detail each part of their course, providing illustrations of concordancing and other corpus activities (e.g.: how to discuss the role of performatives retrieved on the basis of "I * you" frame searches in a Shakespeare corpus). Special attention is devoted to methodological awareness, which is meant to build gradually throughout the course, incorporating such elements as acquisition of strict research procedures, co-textual and transtextual analysis of data, and the faculty of critical analysis. The authors hope that, when properly combined with other components in the curriculum, their course may be successful, especially in view of its focus on culturally vital literary texts. The possibility of enhancing academic speaking skills with the help of the Michigan Corpus of Academic Spoken English (MICASE) is in turn reviewed by Anna Mauranen ("Speech corpora in the classroom"), who reports on the responses from a teacher and her students after running such an experimental course. While the teacher found corpus use fascinating and stimulating (though humbling), students' appreciation depended on the level of computer-literacy. Most cited problems sound familiar: the need for longer pre-training, high time cost, the questionable value of corpora for less proficient learners. In addition, some users found inductive learning uncomfortable and studying frequency irrelevant. In her comments on these results, Mauranen proposes that pedagogical authenticity of corpora be seen as including both 'objective authenticity' (the linguistic evidence) as well as 'subjective authenticity' (how students relate to corpus material); secondly, she notes that the appeal of corpus material may relate to its discourse nature: "interactively saturated" spoken data may deactivate students more than, e.g., written prose. Other issues concern adapting corpus activities to analytically processing learners (e.g. adults), and taking a stand on the native-English vs. English as a lingua franca (ELF) controversy. In "Lost in parallel concordances", Ana Frankenberg-Garcia gives recipes for using parallel concordancing in a general language course. The assumption is that such practice encourages explicit L1-L2 comparison, which, as current research shows, may facilitate rather than necessarily impede effective learning, since it engages students and, providing the teacher is sufficiently experienced, brings to the fore relevant L1-related difficulties. "Navigating through a parallel corpus" may depend on whether uni-directional or bidirectional translations are available. Frankenberg-Garcia considers all the different options for initiating parallel searches (beginning with source texts in L1, source texts and L2, target texts in L1, or target texts in L2) and compares their pedagogical value. Some interesting points are raised (e.g. about the possibility of using L1 translations as models), although the activities presented seem unsupported by classroom practice, which poses the question of their genuine effectiveness. What is perhaps lacking is some proof of parallel concordancing actually outperforming bilingual dictionaries in some contexts. Also, little attention is paid to the age or proficiency of learners, or the importance of genres. Overall, the paper emerges as a catalogue of ideas that may (some would say should) be useful, but which have yet to be proved so. (For those interested, an online version of the paper is available at http://www.linguateca.pt/Repositorio/Frankenberg-GarciaTALC2002.rtf The third section of the volume, "Corpora with learners", begins with Passapong Sripicharn's research report on "Examining native speakers' and learners' investigation of the same concordance data and its implications for classroom concordancing with ELF learners". In the recounted experiment, six BA-level Thai and British students were presented with brief, pre-selected concordance material and asked to perform three simple tasks: (1) compare collocations of two verbs, (2) name the difference between two groups of sentences arranged according to grammatical patterns, (3) guess the meaning of a concordanced word, complete a gapped line and (most interestingly) justify the answer during a taped interview. The results showed that the Thai students were eager to apply data-driven strategies, while the native-English students preferred to rely on intuition, generalize beyond the data, question the evidence and call up exceptions. Such results, while probably anticipated, may have been prompted by the the non-native English group having been introduced into concordancing prior to the experiment. This flaw in the set-up appears rather unfortunate; however, the study validly points out that concordancing does not always have to be used in a data-driven-way (cf. Aston 2001: 22-25), and that limited corpus evidence can condone overgeneralizing -- a point to beware for teachers preparing material. In "Some lessons students learn: Self-discovery and corpora", Pascual Pérez-Paredes and Pascual Cantos-Gómez outline a corpus-based, form-focused protocol designed to help English learners attain greater awareness of and control over their spoken performance. For convenience, the protocol only monitors the use of words. Students access and query hyperlinked transcriptions and audio recordings of their aural output, and, guided by a series of open-ended questions, compare the statistics from their own file with the average class results and then with data derived from reference corpora (it is, however, not clear which corpora are used for reference). Pérez-Paredes and Cantos-Gómez describe their system as promoting Nunan's fifth stage of learner autonomy (learner as researcher) and report generally positive feedback from their students. A convenient feature of this networked database environment is that student data are processed statistically (cluster analysis), allowing teachers to classify learners by performance. Overall, the system described is an interesting example of how learner-corpus data can inform IT solutions for learning, a promising line of development for intelligent CALL (I-CALL). In the final paper of the book's third section, entitled "Student use of large, annotated corpora to analyze syntactic variation", Mark Davies describes his corpora-supported advanced online course in Spanish syntax, in which students learn to retrieve and combine data from multiple corpora in order to solve variation tasks. The corpora are large (100 M words; 200 M words; and the Spanish web -- Google and Google Groups), diversified, and, in one case, richly annotated to enable more powerful searches (Davies' Corpus del Espa?ol resembles in this respect his VIEW interface to the British National Corpus, http://view.byu.edu/ ). The course is not corpus-driven, however: hands-on practice follows readings from a grammar book, and mainly involves testing the validity of the rules and claims found there. The author emphasizes the importance of an intensive, task- based training stage, and of supervising students' early projects during which they can develop expertise in choosing and combining corpora and search patterns. A valuable pedagogical suggestion is the shift from purely quantitative to more explanatory tasks in mid-course. Concluding, Davies argues that, at advanced levels of proficiency, even less experienced students can learn to use and appreciate corpora, a cogent point considering the author's enormous experience in the field. In the last, forward-looking article on "Facilitating the compilation and dissemination of ad-hoc web corpora", William H. Fletcher summarizes the current possibilities for linguistic exploitation of the World Wide Web and outlines the prospects for future developments. According to Fletcher "[t]he quantity of information online greatly surpasses its overall quality" (p. 275); on the other hand, the infrequency of some phenomena and genres and the inevitable ageing of finite corpora force linguists to embrace the web. Techniques of access range from the most widely known 'browsing' to 'hunting', 'grazing' and automatic 'crawling', but none of them guarantees immediately high quality results to linguistic searches. There is therefore a strong need to filter search engine output by applying linguistic and heuristic "noise- reduction techniques", which, however, can unduly prolong access time. Fletcher considers two possibilities for breaking the deadlock: (1) the creation of a special Web Corpus Archive (WCA), whereby professionals would help one another by analysing and classifying web content and submitting reports which would trigger automatic download and annotation of the pages for future use; and 2) the creation of a special Search Engine for Applied Linguists (SEAL), enabling direct, highly sophisticated KWiC concordancing of the web. Neither solution is free from problems (securing copyright, providing sufficient processing power, etc). Fletcher then compares his 'idealistic' visions with the existing facilities: online concordancers for static corpora, commercial meta search engines, web concordancers (e.g. WebCorp), the Internet Archive ('Wayback machine'), advanced linguistic search engines. A practical tip resulting from this discussion is that students should be taught "responsible online searching techniques". Overall, the paper brings a useful, if only slightly lengthy (28 pages), overview of the workings of the web-as- corpus sub-domain, supported by a set of numerous URL addresses that IT-minded teachers should be willing to explore. EVALUATION: As transpires from the extended summary above, Aston et al's 2004 collection will be a valuable resource for teachers seeking working and prospective solutions, as well as up-to-date theoretical motivations, for corpus-informed teaching practice. The book offers admirable continuation to Aston's edited volume of 2001 as well as to the previous volumes of TaLC proceedings. A sceptical reader could require more theory and more empirical verification, but there is no doubt that the field of 'applied corpus linguistics' (a broader term, borrowed here from the name of an American association and a recent volume of proceedings from a conference it organized) is growing, maturing and slowly developing its standards (Hoey, Mauranen, Römer). This progress should lead to the establishment of models for integrating corpora with other teaching methods and programmes -- a key to success not just in academic education (Kettemann and Marko), but also at schools. As noted by several authors, some technical and practical issues must be resolved before corpus-driven tasks can be added to the bank of regular in-course activities. Both Davies and Mauranen point out the need of extensive, task-based pre-training, a point all the more vital if the level of initial computer literacy affects students' motivation and performance. In addition, the relatively long time required to complete corpus-based activities may confine them to some tasks and/or some learners: more empirical testing is needed to explore such feasibilities. Thirdly, ensuring universal and dependable access (cf. Fletcher) to 'optimal' corpora, both general and specialized, large and small, will be another key factor determining the popularity of corpus methods among teachers and learners. Despite these yet unresolved problems, Aston et al.'s collection clearly demonstrates that enough experience has been accumulated in this area for a comprehensive resource book for teachers to be offered, which could recommend specific tools, corpora, methods, techniques, exercises, etc., for meeting specific teaching aims in a typical (not necessarily task-based, Gabrielatos 2005) language learning syllabus. Compared with data-driven learning, the 'behind-the-scenes' (Aston 2000) approach, i.e. corpus-based linguistic research, is well entrenched. The large size of the "Corpora by learners" section shows that learner corpora have become a staple component of corpus networks exploited for educational purposes (all major ELT publishers today rely on their collections of learner data). Ignoring corpus evidence is likely to lead to artificiality of input, which many applied corpus linguists openly criticize. However, as already mentioned, corpus-derived results, even those supported by the most sophisticated statistical methods, must be used wisely and in proportion with other factors. On the other hand, as this volume richly demonstrates, progress in learner corpus research is on-going and constantly diversifying inasmuch as ever larger and better annotated resources are created and new (networking) technologies are reached for (e.g. Tono, Pérez-Paredes and Cantos-Gómez). The field of pedagogical exploitation of corpora is thus hardly ready to settle, inviting interested educators continually to refresh their position on its development. Of course, Aston et al.'s volume could not be comprehensive in this respect. There are no apparently weak papers in the reviewed volume, although, as indicated, some contributions could be questioned for methodological assumptions or for insufficient scepticism. Additionally, some debatable omissions in the use of sources may be noted, e.g. Römer's lack of reference to earlier word-based analyses of written textbooks (e.g. Ljung 1990) or Kettemann and Marko's lack of mention of the Web Concordances service. Fletcher, at the time of writing his article, could not have heard of the WebCorp team's plans to develop their own linguistic search engine, or of LexWare Culler -- a fast, Google-based web concordancer equipped with part-of-speech search syntax and lemmatization rules for grouping results (several major languages are supported). These gaps, however, hardly undermine the overall quality of the volume. The editing is also generally careful, the few slips mostly concerning orthography and punctuation. The grossest oversight is the missing Table 2 in Flowerdew's article, an omission preventing comparison with Table 1, called upon several times. REFERENCES: Aarts, Jan and Sylviane Granger. 1998. "Tag sequences in learner corpora: a key to interlanguage grammar and discourse". In: Sylviane Granger (ed.), Learner English on computer, London, Longman. 132- 141. Aston, Guy. 2000. "Learning English with the British National Corpus". In: M. Paz Battaner & Carmen López (eds), VI jornada de corpus lingüístics, Barcelona, Institut universitari de lingüística aplicada, Universitat Pompeu Fabra. 15-40. Aston, Guy. 2001. "Learning with corpora: an overview". In: Guy Aston (ed.). 7-45. Aston, Guy. (ed.). 2001. Learning with corpora. Houston, TX: Athelstan. Gabrielatos, Costas. 2005. "Corpora and language teaching: just a fling or wedding bells?". EJ 8, 4. http://www-writing.berkeley.edu/TESL-EJ/ej32/a1.html Hoey, Michael. 2005. Lexical priming: a new theory of words and language. London: Routledge. LexWare Culler. 2004-5. http://82.182.103.45/lexware/concord/culler.html Ljung, Magnus. 1990. A study of TEFL vocabulary. Stockholm: Almqvist & Wiksell. Scott, Mike. 1996. WordSmith Tools. Oxford: Oxford University Press. The Web Concordances. [nd]. http://www.dundee.ac.uk/english/wics/wics.htm WebCorp. 1999-2005. http://www.webcorp.org.uk/ ABOUT THE REVIEWER Dr Przemyslaw Kaszubski is a teacher of academic writing and a corpus linguistics researcher and lecturer at the School of English, Adam Mickiewicz University, Poznan, Poland. His current research interests concern the use of online corpus resources for academic writing instruction. He maintains an online concordancer for English students, and a large corpus linguistics bibliography ( http://www.staff.amu.edu.pl/~przemka/ ). In 1995-2002 he co- ordinated the compilation of the Polish subcorpus of the International Corpus of Learner English.
Respond to list|Read more issues|LINGUIST home page|Top of issue
|
|

Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.
|
|