Review of  Corpus Linguistics Around the World

Reviewer: Isabella Chiari
Book Title: Corpus Linguistics Around the World
Book Author: Andrew Wilson Dawn Archer Paul Rayson
Publisher: Rodopi
Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics
Subject Language(s): Basque
Chinese, Mandarin
Issue Number: 18.607

EDITORS: Andrew Wilson, Dawn Archer, Paul Rayson
TITLE: Corpus Linguistics Around the World
SERIES: Language and Computers 56
YEAR: 2006

Isabella Chiari, Dipartimento di Studi Filologici Linguistici e Letterari,
Università ''La Sapienza'' di Roma, Italy

The book under review is a selection of papers presented at the Corpus
Linguistics 2003 conference, held at Lancaster University in March 2003. It
contains 17 contributions covering a wide variety of languages: Basque,
English and it's dialects, Danish, French, Maltese, Dutch, German, Slovene,
Spanish, French, Polish, Russian, and Chinese. The papers deal with
dialects, learner corpora, vocabulary, spoken language, synchronic and
diachronic variation, tagging, corpus development and cross-cultural
rhetoric and social psychology.


''Methodology and steps towards the construction of EPEC, a corpus of
written Basque tagged at morphological and syntactic levels for automatic
processing'', by I. Aduriz, M.J. Aranzabe, J.M. Arriola, A. Atutxa, A. Díaz
de Ilarraza, N. Ezeiza, K. Gojenola, M. Oronoz, A. Soroa and R. Urizar,
describes the different phases of design and construction of the EPEC
corpus of written Basque, which is a annotated corpus consisting in
4,658,036 word forms. The application of the MORFEUS morphological analyzer
is described in detail. Manual disambiguation, development of the TATOO
stochastic tagger, further supervised training and treebank construction
were conducted in order to develop further automatic tools for corpus
parsing on Basque texts.

''The mood of the (financial) markets: in a corpus of words and of
pictures'', by Khurshid Ahmad, David Cheng, Tugba Taskaya, Saif Ahmad, Lee
Gillam, Pensiri Manomaisupat, Hayssam Traboulsi and Andrew Hippisley,
presents a ''method for extracting sentiment indicators, e.g. shares going
up or a currency falling down [...] together with a technique for
correlating the quantitative time-series of values with a time series of
sentiment indicators'' (Ahmad et al. 2006: 17). The study focused on three
years' output of Reuters financial news (starting in 2000) with about 10
million word tokens. Selected items (like 'up', 'down', 'rise', 'fell',
'growth') were monitored in order to determine their frequency and
diachronic usages. Relevant issues raised include the need for the
integration of different techniques in corpus linguistics (such as
mathematical analysis), and image analysis for the construction of
information extraction tools.

''Contrastive observations and their possible diachronic interpretations in
the Korpus 2000 and Korpus 90 General Corpora of Danish: Towards a
methodology for corpus-based studies of linguistic change'', by Jørg
Asmussen, describes advantages and risks in contrasting diachronic corpora.
It proposes a comparison of two reference corpora of Danish (Korpus 2000
and Korpus 90, both consisting of 28 million words) compiled with texts
from the Eighties to 2002. The author offers examples from vocabulary,
inflection, collocation, semantic and syntactic analyses, showing some
possible biases in comparing differently designed corpora and posing some
methodological questions. The contribution is particularly centered on the
elaboration of a methodology for comparative corpus investigation standards
and diachronic similarities and differences.

''Synchronic and diachronic variation: the how and why of the
sociolinguistic corpora'', by Kate Beeching, discusses issues of methodology
and application in French spoken corpora. The paper presents the major
spoken language corpora available for French and shows that sociolinguistic
questions can be fruitfully investigated using synchronically and
diachronically varied corpora.

''Statistical analysis of the source origin of Maltese'', by Roderick
Bovingdon and Angelo Dalli, investigates different aspects of the Maltese
Language based on statistical analyses of randomly selected samples from
the Maltilex Corpus (the first electronically available corpus for the
Maltese language, which contains a variety of texts from newspapers,
novels, administrative and radio transcripts). In particular, a study of
the quantitative incidence of words from Arabic, Italian, English and Dutch
is presented, showing a large-scale influence of Italian on word class
distributions. Morphological implications of this influence is discussed.

''Discovering regularities in non-native speech'', by Julie Carson-Berndsen,
Ulrike Gut and Robert Kelly, discusses some possible applications of
computational tools for the analysis of a corpus of native and non-native
phonotactic patterns. The aim is twofold: developing tools to be used in
speech technologies and investigating non-native phonological realizations.
Machine learning tools are applied to extract regularities from different
German corpora, analyzing errors and error schemes, and showing significant
deviations from the general German system.

''Tracking lexical changes in the reference corpus of Slovene texts'', by
Vojko Gorjanc, deals with competition among English loanwords and their
Slovene counterparts in the FIDA corpus of contemporary Slovene. A set of
key lexical items concerning new technologies (computer, internet, world
wide web) were chosen and monitored for their occurrence during the
Nineties. The authors observed a general tendency toward native Slovene
lexical creation over the adoption of loanwords and a general variability
and creativity of expression.

''Relating linguistic units to socio-contextual information in a spontaneous
speech corpus of Spanish'', by José María Guirao, Antonio Moreno Sandoval,
Ana González Ledesma, Guillermo De La Madrid and Manuel Alcántara, shows
how statistical measures can be used to derive divergence in linguistic
features pertaining to different text typologies present in a reference
corpus of Spanish. The analyzed corpus is the Spanish section of the
C-ORAL-ROM (300,000 words), an EU project coordinated by Emanuela Cresti
at the University of Florence. After first pointing out differences between
speech databases and spoken corpora, the authors focus on corpus design and
on methodological issues concerning the application of Dunning's statistics
of surprise to extract collocational patterns from the corpus (Dunning 1993).

''An analysis of lexical text coverage in contemporary German'', by Randall
L. Jones, is a reflection on word frequency coverage in English and German
texts of different typologies (conversation, literature, newspaper and
academic). Data on English from the works of Nation (2001) on frequency
distributions in texts are compared to analogous data extracted from the
400,000 words sub-corpus of the BYU/Leipzig Corpus of Contemporary German,
showing some differences in general coverage in the newspaper and
literature sub-corpora, and German words being considerably less covered by
the 1,000 most frequent words than their English counterparts.

''Analysing a semantic corpus study across English dialects: Searching for
paradigmatic parallels'', by Sarah Lee and Debra Ziegeler, investigates the
usages of the 'get' periphrastic constructions in Singapore English,
British English and New Zealand English, from the International Corpus of
English (ICE). In a comparative approach the authors observe distributional
variation and word associations, and raise some methodological questions
about the relationship between the frequency of occurrence of a given
pattern and the significance of the found association.

''The curse and the blessing of mobile phones - a corpus-based study into
American and Polish rhetorical conventions'', by Agnieszka Leńsko-Szymańska,
proposes a comparison among American native speakers' argumentative essays
and Polish mother-tongue English-speaking equivalents. The research offers
a corpus-based approach to contrastive rhetoric, making use of tools such
as keyword analysis and pronoun emergence to observe, for example, levels
of formality and the use of general versus experience-related arguments
(the latter preferred by Americans), which are seen as general indicators
of rhetorical conventions imposed by cultural differences.

''Using a dedicated corpus to identify features of professional English
usage: What do 'we' do in science journal articles?'', by Judy Noguchi,
Thomas Orr and Yukio Tono, deals with the improvement of tools for learners
of English as a second language, focusing in particular on the Corpus of
Professional English (CPE), in development by the Professional English
Research Consortium in Tokyo as a 100-million word written corpus. The
authors conducted a pilot study investigating the usage of pronoun 'we' in
a small section of the corpus. The authors observed a very high rate of
usage, especially when 'we' is followed by mental verbs (such as 'find',
'observe' and 'examine') and activity verbs (such as 'use' and 'show').

''Methods and tools for development of the Russian Reference Corpus'', by
Serge Sharoff, after a brief outline of preceding corpus projects focused
on Russian, describes in detail the BOKR (Boljshoj Korpus Russkogo yazyka)
corpus. The BOKR consists of 100 million words and is designed to be the
Russian equivalent of the BNC. Some differences in design include a
different proportion of text typologies, POS tagging features and an
advanced query interface.

''A profile-based calculation of region and register variation: the
synchronic and diachronic status of the two main national varieties of
Dutch'', by Dirk Speelman, Stefan Grondelaers and Dirk Geeraerts, is
centered on variation in Dutch, through the analysis of the ConDiv corpus
of 40 million words (typologically and diachronically varied). A
statistical profile of a set of words is observed in three different
subcorpora (chat, popular newspaper and quality newspaper) in order to
define subcorpora distances as the preferred method of keyword analysis.
The authors introduce the new concept of stable lexical markers, which are
used to find relevant patterns in different texts.

''A multilingual learner corpus in Brazil'', by Stella E. O. Tagnin, presents
the panorama of learner corpora in Brazil and the design of the USP
Multilingual Learner Corpus (MLC) by the University of Sãn Paulo. The
corpus will be composed of texts produced by undergraduates in
extracurricular courses in English, German and Spanish. This composition
will offer the capability to observe texts produced by the same class over
time, individual and collective progress, possible common denominators and
problems of Brazilian learners facing different foreign languages.

''Quantitative or qualitative content analysis? Experiences from a
cross-cultural comparison of female students' attitudes to shoe fashions in
Germany, Poland and Russia'', by Andrew Wilson and Olga Moudraia, observes
different results obtained with qualitative and quantitative content
analysis applied to texts produced by learners of English from different
countries. The focus of the contribution is on cultural aspects connected
to the selection of lexical items associated with the topic of footwear.
After presenting some methodological issues in content analysis, the
authors observe data from a pilot study aimed at comparing dictionary–based
quantitative, multivariate and qualitative analyses, which show globally
similar results.

''Survey and Prospect of China's Corpus-Based Research'', by Yang Xiao-Jun,
is a brief description of the state of the art in Chinese corpus
linguistics. The author presents some historical and recent corpora of
Chinese, corpora of English as a foreign language, and parallel
English-Chinese corpora. The author also presents an overview of the
leading scholars working in the field, and some of the major publications
regarding corpus-based research.


Common issues raised in the book include theoretical, methodological and
computational problems encountered in the development of projects relating
different languages. Among the more traditional issues covered is corpus
construction and design (text typology and variation, internal and external
criteria for their determination, and dynamic design as in virtual
corpora), as well as aspects specific to particular types of corpora. A
strong interest is showed in methodological and theoretical aspects of
diachronic corpora, which require comparability and reliability in a
completely different way than synchronic corpora. Spoken corpora are given
great attention, not only for their complex design and treatment (specific
transcription training, peculiar POS tagging features, etc.), but also for
their suitability in providing evidence and validation to phonological and
phonetic theories, the possibility of correlating language variation with
sociolinguistic and contextual features, and the possibility of comparing
cross-cultural differences emerging in textual choices.

Many methodological questions are posed or suggested, especially on the
application of statistical and frequency measures. Problems due to
limitations in corpus design, sample choices, or comparability are
presented to show how the interpretation of quantitative data is
problematic due to factors of content analysis, association measures,
weakness in data collection techniques or in the experimental elicitation
of texts. Finally, some issues address the application of computational
tools (parsers, taggers, lemmatizers, etc.) to languages with different
morphological and syntactic structures, posing new questions about existing
tools and their possible application, and discussing techniques for the
design of new tools.

Many issues that are hot topics in the world of corpus linguistics are
presented, exemplified, and discussed in this book, with suggestions for
future directions in the field. The volume is extremely interesting in
illustrating these issues using the concrete experience of a variety of
different projects. The only shortcomings are the absence of a structured
order (or series of sections) in the contribution sequence and of a final
content index.


Ahmad, K. , D. Cheng, T. Taskaya, S. Ahmad, L. Gillam, P. Manomaisupat, H.
Traboulsi and A. Hippisley, 2006. The mood of the (financial) markets: in a
corpus of words and of pictures, in Corpus Linguistics Around the World, A.
Wilson, D. Archer, P. Rayson (eds.), Rodopi, pp. 17-32.

Dunning, T., 1993. Accurate methods for the statistics of surprise and
coincidence. In Corpus linguistics, 19(1), pp. 61-74.

Nation, I.S.P., 2001. Learning Vocabulary in Another Language, Cambridge,
Mass.: Cambridge University Press.

Isabella Chiari (Ph.D. in Philosophy of Language, 2000) teaches courses in
general and computational linguistics at the University La Sapienza of Rome
(Italy). Her interests lie at the intersection of linguistics and
philosophy of language. She is concerned with scientific, methodological
and theoretical issues in quantitative linguistics and linguistic
redundancy, and with linguistic behavior in speech performance (slips of
the tongue in first and second language). She is also interested in
understanding processes and their implications in language teaching and
learning, in computational tools for language teaching, and in
psycholinguistic aspects of speech errors and slips.

