LINGUIST List 23.2764
|
Tue Jun 19 2012
Review: Applied Linguistics; Text/Corpus Linguistics: Meunier et al. (2011)
Editor for this issue: Rajiv Rao
<rajiv linguistlist.org>
|
Date: 19-Jun-2012
From: Marlies Prinzl <uclzmgp live.ucl.ac.uk>
Subject: A Taste for Corpora
E-mail this message to a friend
Discuss this message
Announced at http://linguistlist.org/issues/22/22-3472.html
EDITORS: Meunier, Fanny; de Cock, Sylvie; Gilquin, Gaëtanelle; Paquot, Magali TITLE: A Taste for Corpora SUBTITLE: In honour of Sylviane Granger SERIES TITLE: Series in Corpus Linguistics 45 PUBLISHER: John Benjamins Publishing Company YEAR: 2011 Marlies Gabriele Prinzl, Centre for Intercultural Studies, University College London, UK SUMMARY ‘A Taste for Corpora’ is a collection of eleven essays presented in honour of Sylviane Granger’s sixtieth birthday. Although focusing predominantly on the applications of corpora in the field of language learning, the book covers quite a range of topics from within that area that are meant to whet the reader’s appetite -- or taste -- for more. Bengt Altenberg’s preface is followed by an introductory chapter, “Putting corpora to good use”, from the collection’s editors, providing details on Granger’s work in corpus linguistics, from her beginnings as a PhD student at University College London, to her role in founding the Centre for English Corpus Linguistics (CECL), to her current research interests. An overview of all the essays is also included. Chapter 1 “Frequency, corpora and language learning” (Geoffrey Leech) According to Leech, one particular benefit of corpora is that they provide information about frequency that is otherwise not available. He distinguishes between three frequency types: ‘raw frequency’; ‘normalized (or relative) frequency’; and ‘ordinal frequency’, which he deems the most useful measure in language learning. A historical overview of frequency is given, including its rejection by Chomsky’s Generative School of linguistics in the second half of the twentieth century, as well as the role of the computer age in reviving frequency studies and thus challenging “a tradition long established in language study, whereby grammars and dictionaries provide distinct kinds of information about a language” (12). The equation ‘more frequent = more important to learn’ is discussed fairly extensively, with Leech concluding that language learning (i.e. input, performance, evaluation) should be ‘frequency-informed’. Chapter 2 “Learner corpora and contrastive language analysis” (Hilde Hasselgård and Stig Johansson) The second chapter commences with an overview of interlanguage studies before computer corpora, going back to contrastive analyses of native and foreign languages in the 1940s and 1950s, to more systematic analyses in the 1960s and 1970s, which focused on error analysis, until it became apparent that both error and success in language learning needed to be considered. The authors proceed to the introduction of computer corpora, which allowed for projects that were larger and more varied in scope. One such project was Sylviane Granger’s ''International Corpus of Learner English'' (ICLE) in 1990. Innovative in its systematic approach to corpus design and the compilation of comparable sub-corpora for text produced by learners with different native languages, it also developed a new framework for learner corpus research: The Contrastive Interlanguage Analysis (CIA). While contrastive analysis involves the comparison of two languages, CIA ''concerns varieties of the 'same' language'' (38, italicisation in the original text is substituted with single quotation marks), meaning both native language (L1) and learner language (L2) in the form of L1 vs. L2, as well as interlanguage varieties (L2 vs. L2), are compared. Hasselgård and Johansson present significant findings of the CIA, and discuss case studies before identifying some challenges (e.g. application of findings in pedagogy and EFL practice; genre; need for data at different stages of the learning process). Chapter 3 “The use of small corpora for tracing the development of academic literacies” (JoAnne Neff van Aertselaer and Caroline Bunce) Chapter 3 presents a study on academic literacy of Spanish university students based on two corpora: the Spanish subcorpus of the International Corpus of Learner English, containing texts of students with no specific training in academic writing (AW); and a corpus of texts produced by Spanish students of English as part of an AW course used to ascertain the students’ progress, and with the pedagogical aim of syllabus revision in mind. ‘Can do’ descriptors specifying structural and rhetorical features to be learned by the students in the course -- specifically, discourse oriented words, with a focus on the use of intertextual dialogue devices (such as various types of rhetors and grammar patterns) -- are used. The authors provide details on the descriptors and the study’s methods and procedures. Based on the data obtained, they note that students with and without AW training perform differently, with only one of the categories evaluated (use of deictic as subject) showing no improvement. They conclude that academic literacy can be improved by providing students with ‘can do’ descriptors and by studying students’ “use of text-internal and external features” (80) and “centring sets of exercises around these features” (80). Chapter 4 “Revisiting apprentice texts: Using lexical bundles to investigate expert and apprentice performances in academic writing” (Christopher Tribble) Tribble commences with the observation that corpora only became a resource for language learning from the late 1980s onwards, a development that was partially motivated by concerns over the made-up, rather than real, language data used in classrooms until then. He presents a study on ‘lexical bundles’, which are defined as the “most frequently occurring sequences of words” and are normally “not idiomatic in meaning” nor “complete grammatical structures” (87). Tribble looks at the use of both 3-word and 4-word lexical bundles by advanced students in specific disciplinary areas. This corpus of apprentice written production is compared to the language use of experts in the same field (i.e. the exemplar corpus), as well as several analogue corpora, all of which lead him to conclude that the comparison of such data provides valuable insights into what students “use, fail to use, underuse and overuse” (102) -- insights that are crucial for pedagogy and curriculum design. Chapter 5 “Automatic error tagging of spelling mistakes in learner corpora” (Paul Rayson and Alistair Baron) In computer learner research, the marking (i.e. ‘tagging’) of learner errors in corpora has been and still is done mostly manually or semi-automatically. However, more recently, results from natural language processing (NLP) have been applied to learner corpora. In their study, Rayson and Baron employ Variant Detector (VARD) software for tagging language learners’ spelling mistakes, with the aim of evaluating VARD’s potential for the automatic detection of such errors and the insertions of corrections within learner corpora. The experiment uses an expanded data set from Lefer & Thewissen (2007), which consists of 30,000 words from Spanish, German and French language learners drawn from the ICLE corpus. With data having already been manually marked up for all types of learner errors, the researchers were able to determine the accuracy of VARD and, in the second stage of the experiment, used part of the manually corrected corpus to train VARD. With results showing high accuracy and, after training, increased correction, Rayson and Baron conclude that NLP methods can contribute to the automatic error analysis of learner corpora. Chapter 6 “Data mining with learner corpora: Choosing classifiers for L1 detection” (Scott Jarvis) Jarvis presents a study on data mining, using a supervised classification approach to evaluate which classifiers are “best able to learn to recognise the relationship between n-gram patterns in ICLE texts and the L1 group membership of learners who produced the texts” (147). The chapter distinguishes between unsupervised and supervised classification, providing details on different classifier types (i.e. centroid-based, boundary-based, Bayesian, artificial neural networks, decision trees, rule-based, means-based, composite), feature selection and parameter tuning. A lot of background is also covered in terms of previous research, such as studies on L1 detection, projects tackling the question of which classifier is best, and more. Jarvis then proceeds to his own research, which used 20 classifiers and experimented with various parameter settings and feature selection methods to determine optimal classification accuracy. He observes that the best-performing classifiers (i.e. Linear Discriminant Analysis, Sequential Minimal Optimization, Naïve Bayes Multinomial, Nearest Shrunken Centroid) for the task demonstrate relatively little difference between them and considers the question of whether an ensemble of classifiers might produce higher classification accuracies, stating that results are inconclusive with respect to this. Chapter 7 “Learners and users: Who do we want corpus data from?” (Anna Mauranen) With corpora compiling data from native speakers and language learners, Mauranen considers the next step: data from second-language (L2) speakers who use the language as a lingua franca. She explores how L2 users differ from L2 learners, noting, among other things, that the former do not typically share a cultural and linguistic background, and use the language due to convenience or necessity, with the target audience being international rather than English-speaking countries. Unlike learner language, there is also the potential that L2 users may influence the target language with the increasing usage of English as a lingua franca. Finally, L2 users focus on making sense and being understood, not on language learning. The differences between the two L2 groups are reflected in corpus compilation, with corpora such as the Helsinki-based ELFA (English as a Lingua Franca in Academic Settings) containing no data from learners, and variation in the mother tongue and proficiency of participants. Although the principle differences between learner and ELF corpora provide good reasons to keep them separate, Mauranen concludes that the results yielded from either corpora are of mutual interest, as both L2 speakers and learners are using a non-native language. Chapter 8 “Learner knowledge of phrasal verbs: A corpus-informed study” (Norbert Schmitt and Stephen Redwood) Chapter 8 deals with phrasal verbs (PVs), which are key features in spoken and written language that can pose difficulties for learners. While PV lists in textbooks and dictionaries are mostly intuition and tradition based, the researchers use a selection from the 100 most frequently occurring PVs in the British National Corpus for their study, asking the question “[D]o learners tend to know more about the most frequently occurring phrasal verbs than the less frequent ones?” (181). Distinguishing between productive and receptive knowledge of PVs, a group of 68 EFL/ESL students of different nationalities, at both intermediate and upper intermediate levels, was tested. Although some variation in PV knowledge was seen, Schmitt and Redwood conclude that there is an overall relationship between frequency of occurrence and knowledge. Other factors -- language proficiency, gender, age, formal language instruction, extensive reading, watching films and TV, listening to music and social networking – are also discussed, some of which the authors find to play a role. Chapter 9 “Corpora and the new Englishes: Using the ‘Corpus of Cyber-Jamaica’ to explore research perspectives for the future” (Christian Mair) Chapter 9 commences with a brief overview of corpus-based research on ‘New English(es)’, including a discussion on the term’s definition, before introducing an ongoing project at Freiburg University on the use of Jamaican English (JE) and Jamaican Creole (JC) in the 15 million+ word Corpus of Cyber-Jamaica (CCJ). Building on research from the Jamaican component of the International Corpus of English, the study investigates innovations in two areas: 1) the use of Non-Standard English online through Jamaican web posts, where increased usage of JC forms is seen when compared with traditional writing; and 2) the sociolinguistics of globalisation, as originally localised vernaculars spread through the medium of the web, which becomes a contact site for non-standard varieties of English. The issue of legitimacy and authenticity of data from the web for sociolinguistic research is raised and Mair concludes that multilingual diasporic web forums “await corpus-linguistic exploration” (234). Chapter 10 “Towards a new generation of Corpus-derived lexical resources for language learning (David Wible and Nai-Lung Tsao) Wible and Tsao put forth the argument that corpora are “by their very nature as collections of texts and tokens, severely limited in what they can offer directly to language learners or teachers” (237). Their focus is on exploring these limitations as they look at the gap between corpora and learners’ lexical knowledge. Corpus-supported learning, they argue, depends on guided exposure to tokens in use that reveal underlying language patterns. Concordancing and KWIC (Key Word in Context), however, do not find patterns but rather strings, while software that allows pattern searches requires knowledge of technical language (e.g. regular expression), and will only search for patterns that learners stipulate -- not the ones that they are unaware of. The authors further discuss the limitations of n-grams, congrams and skipgrams, suggesting that a paradigmatic dimension is missing with all of these. Hybrid n-grams are introduced as an alternative that not only “identify patterns of word use” but also “create a new entire space where relations hold among these patterns and among the words in them” (243), resulting in a massive StringNet of “organic lexical knowledgebase whose structure is not prescribed but emerges” (244). Illustrative examples are provided to demonstrate how this approach is beneficial for language learners. Chapter 11 “Automating the creation of dictionaries: Where will it all end?” (Michael Rundell and Adam Kilgariff) The final chapter of ‘A Taste for Corpora’ explores automation in dictionaries. It provides an overview of the developments, starting with the 1960s to 1970s, when computers were first used for dictionary making, before proceeding to the technological advances and increased accessibility of PCs in the 80s and 90s. 1981 is identified as “Year Zero” with the COBUILD project, for which “every linguistic fact… [was] supported by the empirical evidence in the form of corpus extracts” (259). However, although changes were clearly occurring in the way dictionaries were made, automation only became more prominent in the late 90s. Rundell and Kilgarriff then detail their work on the Macmillan English Dictionary for Advanced Learners, looking at the tasks involved (e.g. corpus creation, headword lists, collocations and word sketches, labelling, examples, tickbox lexicography) and commenting on the state of automation in each. All these tasks have by now been automated to a significant degree, however, further advances are still on the horizon. The researchers conclude that “the lexicographer’s task changes from selecting and copying data from the software, to validating… the choices made by the computer” (278), but note that fully automated lexicography is “still some way off” (279). EVALUATION ‘A Taste for Corpora’ offers a diverse and rich collections of essays, all of a high quality, covering a wide spectrum of topics related to the applications of corpora in language learning. The volume can be perused as a whole or as an introduction to different topics of interest. Although it is generally also suitable for newcomers to corpus-based language learning, some chapters are quite specialised and may require further reading on the topic. Quite a number of essays -- such as Chapters 6, 10 or 11 -- point to approaches that are at early stages and may thus spark both fascination as well as controversy in terms of possible future directions and developments. It is doubtful, for example, that Wible and Tsao's previously quoted argument in Chapter 10 that ''corpora are, by their very natures as collections of texts and tokens, severely limited in what they can directly offer to language learners and speakers'' (237) will immediately be welcomed by every corpus linguist, as it seems to question the field itself and also appears phrased in a purposely provocative manner. Corpora and corpus-based resources may have limitations, but they are not completely useless even if they only ''find the patterns that the user tells them to search for'' (241) or are ''one-dimensional'' (243). Corpus linguistics challenged approaches to language and language learning that came before it and significantly shifted the focus from ''an ideal speaker-listener, in a completely homogenous speech-community, who knows its language perfectly and is unaffected by such grammatically irrelevant conditions as memory limitations, distractions, shifts of attention and interest, and errors'' (Chomsky 1965: 3) to models and methods that deal with language as it is actually spoken by both L1 and L2 users in real life. Equally, corpus linguistics as a whole and its specific applications for language teaching and learning should be questioned for usefulness both in theory and practice, so that the rightfully identified ''unfortunate gap [that] still stand[s] between what learners need… and what corpora currently provide'' (237) may be filled, but to outright declare corpora ''severely limited'' (237) goes somewhat over the mark. Other chapters also deserve a second mention in this review. The contributions from Mauranen (Chapter 7, ''Learners and users: Who do we want corpus data from?'') and Mair (Chapter 9, “Corpora and the new Englishes: Using the ‘Corpus of Cyber-Jamaica’ to explore research perspectives for the future”) are exciting, as they strongly signal a move away from the traditional research focus on English as spoken in a select few nations, or certain kinds of English users. Although projects like the International Corpus of English (ICE), containing multiple 1 million-token subcorpora from Singaporean to Sri Lankan English, have existed for a while already, both Mauranen's and Mair's research expands the field further. Mair's investigation into the still relatively unmapped territory of language in cyberspace -- web posts in Jamaican English as well as Jamaican Creole -- makes this turn even more interesting, as cyber language in both its more static (e.g. informative websites) and dynamic forms (e.g. Twitter, text messaging) becomes an increasingly significant part of our everyday language usage. ‘A Taste for Corpora’ is a fitting as well as wonderful collection of essays to honour the achievements of Sylviane Granger. Most chapters make at least some reference to how Granger’s work is significant to a particular subfield within corpus-based language learning. These references, on occasion, feel a little forced, but then this is the nature of such a volume published in honour of a researcher. Altogether, ‘A Taste for Corpora’ certainly manages to whet the reader’s appetite -- or taste -- for what the future of corpus-based language learning holds. REFERENCES Chomsky, Noam. 1965. Aspects of the Theory of Syntax. Cambridge, MA: MIT Press. ABOUT THE REVIEWER
Marlies Gabriele Prinzl is a PhD candidate, supervised by Prof. Theo Hermans and Dr. Daniel Abondolo, at the Centre for Intercultural Studies, University College London, UK. Her research interests include literary translation, particularly with regard to creativity and experimental writing, retranslation and corpus linguistic approaches to literature and translation. She is further interested in East Asian cinema and cultural products, including aspects of fansubbing and fantranslation. Details can be found at: http://ucl.academia.edu/mgp.
Read more issues|LINGUIST home page|Top of issue
|
|
Page Updated: 19-Jun-2012
|
|
About LINGUIST
|
Contact Us
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.
|
|