LINGUIST List 18.607

Sat Feb 24 2007

Review: Text/Corpus Linguistics: Wilson; Rayson; Archer (2006)

Editor for this issue: Laura Buszard-Welcher <>

Directory         1.    Isabella Chiari, Corpus Linguistics Around the World

Message 1: Corpus Linguistics Around the World
Date: 24-Feb-2007
From: Isabella Chiari <>
Subject: Corpus Linguistics Around the World

Announced at

EDITORS: Andrew Wilson, Dawn Archer, Paul Rayson TITLE: Corpus Linguistics Around the World SERIES: Language and Computers 56 PUBLISHER: Rodopi YEAR: 2006

Isabella Chiari, Dipartimento di Studi Filologici Linguistici e Letterari, Università ''La Sapienza'' di Roma, Italy

The book under review is a selection of papers presented at the Corpus Linguistics 2003 conference, held at Lancaster University in March 2003. It contains 17 contributions covering a wide variety of languages: Basque, English and it's dialects, Danish, French, Maltese, Dutch, German, Slovene, Spanish, French, Polish, Russian, and Chinese. The papers deal with dialects, learner corpora, vocabulary, spoken language, synchronic and diachronic variation, tagging, corpus development and cross-cultural rhetoric and social psychology.


''Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for automatic processing'', by I. Aduriz, M.J. Aranzabe, J.M. Arriola, A. Atutxa, A. Díaz de Ilarraza, N. Ezeiza, K. Gojenola, M. Oronoz, A. Soroa and R. Urizar, describes the different phases of design and construction of the EPEC corpus of written Basque, which is a annotated corpus consisting in 4,658,036 word forms. The application of the MORFEUS morphological analyzer is described in detail. Manual disambiguation, development of the TATOO stochastic tagger, further supervised training and treebank construction were conducted in order to develop further automatic tools for corpus parsing on Basque texts.

''The mood of the (financial) markets: in a corpus of words and of pictures'', by Khurshid Ahmad, David Cheng, Tugba Taskaya, Saif Ahmad, Lee Gillam, Pensiri Manomaisupat, Hayssam Traboulsi and Andrew Hippisley, presents a ''method for extracting sentiment indicators, e.g. shares going up or a currency falling down [...] together with a technique for correlating the quantitative time-series of values with a time series of sentiment indicators'' (Ahmad et al. 2006: 17). The study focused on three years' output of Reuters financial news (starting in 2000) with about 10 million word tokens. Selected items (like 'up', 'down', 'rise', 'fell', 'growth') were monitored in order to determine their frequency and diachronic usages. Relevant issues raised include the need for the integration of different techniques in corpus linguistics (such as mathematical analysis), and image analysis for the construction of information extraction tools.

''Contrastive observations and their possible diachronic interpretations in the Korpus 2000 and Korpus 90 General Corpora of Danish: Towards a methodology for corpus-based studies of linguistic change'', by Jørg Asmussen, describes advantages and risks in contrasting diachronic corpora. It proposes a comparison of two reference corpora of Danish (Korpus 2000 and Korpus 90, both consisting of 28 million words) compiled with texts from the Eighties to 2002. The author offers examples from vocabulary, inflection, collocation, semantic and syntactic analyses, showing some possible biases in comparing differently designed corpora and posing some methodological questions. The contribution is particularly centered on the elaboration of a methodology for comparative corpus investigation standards and diachronic similarities and differences.

''Synchronic and diachronic variation: the how and why of the sociolinguistic corpora'', by Kate Beeching, discusses issues of methodology and application in French spoken corpora. The paper presents the major spoken language corpora available for French and shows that sociolinguistic questions can be fruitfully investigated using synchronically and diachronically varied corpora.

''Statistical analysis of the source origin of Maltese'', by Roderick Bovingdon and Angelo Dalli, investigates different aspects of the Maltese Language based on statistical analyses of randomly selected samples from the Maltilex Corpus (the first electronically available corpus for the Maltese language, which contains a variety of texts from newspapers, novels, administrative and radio transcripts). In particular, a study of the quantitative incidence of words from Arabic, Italian, English and Dutch is presented, showing a large-scale influence of Italian on word class distributions. Morphological implications of this influence is discussed.

''Discovering regularities in non-native speech'', by Julie Carson-Berndsen, Ulrike Gut and Robert Kelly, discusses some possible applications of computational tools for the analysis of a corpus of native and non-native phonotactic patterns. The aim is twofold: developing tools to be used in speech technologies and investigating non-native phonological realizations. Machine learning tools are applied to extract regularities from different German corpora, analyzing errors and error schemes, and showing significant deviations from the general German system.

''Tracking lexical changes in the reference corpus of Slovene texts'', by Vojko Gorjanc, deals with competition among English loanwords and their Slovene counterparts in the FIDA corpus of contemporary Slovene. A set of key lexical items concerning new technologies (computer, internet, world wide web) were chosen and monitored for their occurrence during the Nineties. The authors observed a general tendency toward native Slovene lexical creation over the adoption of loanwords and a general variability and creativity of expression.

''Relating linguistic units to socio-contextual information in a spontaneous speech corpus of Spanish'', by José María Guirao, Antonio Moreno Sandoval, Ana González Ledesma, Guillermo De La Madrid and Manuel Alcántara, shows how statistical measures can be used to derive divergence in linguistic features pertaining to different text typologies present in a reference corpus of Spanish. The analyzed corpus is the Spanish section of the C-ORAL-ROM (300,000 words), an EU project coordinated by Emanuela Cresti at the University of Florence. After first pointing out differences between speech databases and spoken corpora, the authors focus on corpus design and on methodological issues concerning the application of Dunning's statistics of surprise to extract collocational patterns from the corpus (Dunning 1993).

''An analysis of lexical text coverage in contemporary German'', by Randall L. Jones, is a reflection on word frequency coverage in English and German texts of different typologies (conversation, literature, newspaper and academic). Data on English from the works of Nation (2001) on frequency distributions in texts are compared to analogous data extracted from the 400,000 words sub-corpus of the BYU/Leipzig Corpus of Contemporary German, showing some differences in general coverage in the newspaper and literature sub-corpora, and German words being considerably less covered by the 1,000 most frequent words than their English counterparts.

''Analysing a semantic corpus study across English dialects: Searching for paradigmatic parallels'', by Sarah Lee and Debra Ziegeler, investigates the usages of the 'get' periphrastic constructions in Singapore English, British English and New Zealand English, from the International Corpus of English (ICE). In a comparative approach the authors observe distributional variation and word associations, and raise some methodological questions about the relationship between the frequency of occurrence of a given pattern and the significance of the found association.

''The curse and the blessing of mobile phones - a corpus-based study into American and Polish rhetorical conventions'', by Agnieszka Le?sko-Szyma?ska, proposes a comparison among American native speakers' argumentative essays and Polish mother-tongue English-speaking equivalents. The research offers a corpus-based approach to contrastive rhetoric, making use of tools such as keyword analysis and pronoun emergence to observe, for example, levels of formality and the use of general versus experience-related arguments (the latter preferred by Americans), which are seen as general indicators of rhetorical conventions imposed by cultural differences.

''Using a dedicated corpus to identify features of professional English usage: What do 'we' do in science journal articles?'', by Judy Noguchi, Thomas Orr and Yukio Tono, deals with the improvement of tools for learners of English as a second language, focusing in particular on the Corpus of Professional English (CPE), in development by the Professional English Research Consortium in Tokyo as a 100-million word written corpus. The authors conducted a pilot study investigating the usage of pronoun 'we' in a small section of the corpus. The authors observed a very high rate of usage, especially when 'we' is followed by mental verbs (such as 'find', 'observe' and 'examine') and activity verbs (such as 'use' and 'show').

''Methods and tools for development of the Russian Reference Corpus'', by Serge Sharoff, after a brief outline of preceding corpus projects focused on Russian, describes in detail the BOKR (Boljshoj Korpus Russkogo yazyka) corpus. The BOKR consists of 100 million words and is designed to be the Russian equivalent of the BNC. Some differences in design include a different proportion of text typologies, POS tagging features and an advanced query interface.

''A profile-based calculation of region and register variation: the synchronic and diachronic status of the two main national varieties of Dutch'', by Dirk Speelman, Stefan Grondelaers and Dirk Geeraerts, is centered on variation in Dutch, through the analysis of the ConDiv corpus of 40 million words (typologically and diachronically varied). A statistical profile of a set of words is observed in three different subcorpora (chat, popular newspaper and quality newspaper) in order to define subcorpora distances as the preferred method of keyword analysis. The authors introduce the new concept of stable lexical markers, which are used to find relevant patterns in different texts.

''A multilingual learner corpus in Brazil'', by Stella E. O. Tagnin, presents the panorama of learner corpora in Brazil and the design of the USP Multilingual Learner Corpus (MLC) by the University of Sãn Paulo. The corpus will be composed of texts produced by undergraduates in extracurricular courses in English, German and Spanish. This composition will offer the capability to observe texts produced by the same class over time, individual and collective progress, possible common denominators and problems of Brazilian learners facing different foreign languages.

''Quantitative or qualitative content analysis? Experiences from a cross-cultural comparison of female students' attitudes to shoe fashions in Germany, Poland and Russia'', by Andrew Wilson and Olga Moudraia, observes different results obtained with qualitative and quantitative content analysis applied to texts produced by learners of English from different countries. The focus of the contribution is on cultural aspects connected to the selection of lexical items associated with the topic of footwear. After presenting some methodological issues in content analysis, the authors observe data from a pilot study aimed at comparing dictionary-based quantitative, multivariate and qualitative analyses, which show globally similar results.

''Survey and Prospect of China's Corpus-Based Research'', by Yang Xiao-Jun, is a brief description of the state of the art in Chinese corpus linguistics. The author presents some historical and recent corpora of Chinese, corpora of English as a foreign language, and parallel English-Chinese corpora. The author also presents an overview of the leading scholars working in the field, and some of the major publications regarding corpus-based research.


Common issues raised in the book include theoretical, methodological and computational problems encountered in the development of projects relating different languages. Among the more traditional issues covered is corpus construction and design (text typology and variation, internal and external criteria for their determination, and dynamic design as in virtual corpora), as well as aspects specific to particular types of corpora. A strong interest is shown in methodological and theoretical aspects of diachronic corpora, which require comparability and reliability in a completely different way than synchronic corpora. Spoken corpora are given great attention, not only for their complex design and treatment (specific transcription training, peculiar POS tagging features, etc.), but also for their suitability in providing evidence and validation to phonological and phonetic theories, the possibility of correlating language variation with sociolinguistic and contextual features, and the possibility of comparing cross-cultural differences emerging in textual choices.

Many methodological questions are posed or suggested, especially on the application of statistical and frequency measures. Problems due to limitations in corpus design, sample choices, or comparability are presented to show how the interpretation of quantitative data is problematic due to factors of content analysis, association measures, weakness in data collection techniques or in the experimental elicitation of texts. Finally, some issues address the application of computational tools (parsers, taggers, lemmatizers, etc.) to languages with different morphological and syntactic structures, posing new questions about existing tools and their possible application, and discussing techniques for the design of new tools.

Many issues that are hot topics in the world of corpus linguistics are presented, exemplified, and discussed in this book, with suggestions for future directions in the field. The volume is extremely interesting in illustrating these issues using the concrete experience of a variety of different projects. The only shortcomings are the absence of a structured order (or series of sections) in the contribution sequence and of a final content index.


Ahmad, K. , D. Cheng, T. Taskaya, S. Ahmad, L. Gillam, P. Manomaisupat, H. Traboulsi and A. Hippisley, 2006. The mood of the (financial) markets: in a corpus of words and of pictures, in Corpus Linguistics Around the World, A. Wilson, D. Archer, P. Rayson (eds.), Rodopi, pp. 17-32.

Dunning, T., 1993. Accurate methods for the statistics of surprise and coincidence. In Corpus linguistics, 19(1), pp. 61-74.

Nation, I.S.P., 2001. Learning Vocabulary in Another Language, Cambridge, Mass.: Cambridge University Press.


Isabella Chiari (Ph.D. in Philosophy of Language, 2000) teaches courses in general and computational linguistics at the University La Sapienza of Rome (Italy). Her interests lie at the intersection of linguistics and philosophy of language. She is concerned with scientific, methodological and theoretical issues in quantitative linguistics and linguistic redundancy, and with linguistic behavior in speech performance (slips of the tongue in first and second language). She is also interested in understanding processes and their implications in language teaching and learning, in computational tools for language teaching, and in psycholinguistic aspects of speech errors and slips.