* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *

LINGUIST List 24.2603

Wed Jun 26 2013

Review: Text/Corpus Linguistics: Schmidt and Wörner (eds., 2012)

Editor for this issue: Anja Wanner <anjalinguistlist.org>

Date: 26-Jun-2013
From: Ali Karakas <akarakasmehmetakif.edu.tr>
Subject: Multilingual Corpora and Multilingual Corpus Analysis
E-mail this message to a friend

Discuss this message

Book announced at http://linguistlist.org/issues/23/23-5128.html

EDITOR: Thomas Schmidt
EDITOR: Kai Wörner
TITLE: Multilingual Corpora and Multilingual Corpus Analysis
SERIES TITLE: Hamburg Studies on Multilingualism 14
PUBLISHER: John Benjamins
YEAR: 2012

REVIEWER: Ali Karakas, University of Southampton


This volume, organized into five sections, is a collection of 22
contributions. Each examines several characteristics of the compilation and
use of multilingual corpora, including learner and attrition corpora, language
contact corpora, interpreting corpora and parallel corpora. The focus is on
the design of corpora in studies on multilingualism, with a critical analysis
of the available multilingual corpora, consideration of the methodological and
technological problems likely to occur in the compilation and analysis of such
corpora, and exemplification of linguistic analyses drawn out from them.

The volume opens with an introduction, in which the editors offer information
about the rationale for this volume, its primary aims, and clarification of
related terms.

Section 1, “Learner and attribution corpora,” encompasses nine contributions
exploring the creation and analyses of various multilingual learner corpora of
different sizes. In the first paper, Ulrike Gut introduces the reader to The
Leap corpus (Learning Prosody in a Foreign Language). The preliminary findings
from the corpus, which look into second language fluency, suggest differences
between learners and native speakers of German and English virtually in all
facets of fluency (e.g. speech rate, articulation rate, filled pauses).
In the second paper, Hanna Hedeland and Thomas Schmidt center on the possible
hurdles regarding the creation, annotation and sharing of a spoken language
corpus with reference to a small German learner corpus of map task recordings.
Taking the re-usability of the corpus as a point of departure in their
discussion, they assert that decisions taken in each stage of the creation of
the corpus will influence further uses of the corpus.
Following this, Niels Ott, Ramon Ziai, and Detmar Meurers provide an overall
presentation of a task-based corpus (in this case, a reading comprehension
exercise corpus), which explores the appropriateness level of answers of adult
learners of German to the posed reading questions. Initial analyses indicate
that the majority of answers are appropriate based on the meaning assessment.
In the next paper, Heike Zinsmeister and Margit Breckle present a text-based
corpus of two sub-corpora compiled from Chinese learners of German and native
German students, comparing the use of local coherence between the two groups.
On the basis of quantitative analyses, according to which differences were
observed between the two groups (e.g. underuse of adverbs, shorter sentences,
lexical limitation in L2 learners’ essays), the authors offer suggestions for
the use of this specific corpus in teaching German, for example, for error
analysis and contrastive analysis.
In the fifth contribution, Marta Saceda Ulloa, Conxita Lleó, and Izarbe García
Sánchez give a very clear description of a spoken database composed of four
sub-corpora of Spanish recorded speech samples. Recordings of bilinguals
speakers of Spanish and German are compared with those of monolingual German
children in terms of the characteristics of their spoken language.
Conxita Lleó, in the next chapter, turns to two child language corpora of the
speech of different German and Spanish monolingual and bilingual children,
created over a long period of time (circa 25 years). The corpora were created
with the purpose of investigating phonological first language acquisition
(i.e. babbling and early lexicon development) of German-Spanish bilingual
children in a comparative way.
In the subsequent paper, Annette Herkenrath and Jochen Rehbein present the
outline of a spoken corpus of bilingual Turkish-German children and
monolingual Turkish children’s language. In comparing bilingual and
monolingual children with regard to their use of connectivity and
morphological elements, the researchers apply a methodology for quantitative
data analysis called Pragmatic Corpus analysis (PCA), which they illustrate
with screenshots of data analyses.
The next contribution, by Agnieszka Czachór, focuses on a Polish-German
bilingual written and spoken corpus, with the aim of exposing contact-induced
change on the morphosyntactic features (e.g. case markers, word order) of
bilingual adult bilingual speakers of Polish and German by means of
grammaticality judgment tasks.
The final paper of this section, by Tanja Kupisch, Dagmar Barton, Giulia
Bianchi, and Ilse Stangen, deals with a corpus of German-French and
German-Italian adult bilinguals. The authors seek to find out whether adult
bilinguals show an acquisition deficit at certain linguistic domains (e.g.
lexicon, morphology, syntax, and semantics).

Section 2, “Language contact corpora”, presents a group of corpora dealing
with varieties of languages whose current or past statuses are described
through language contact, and corpora exploring the evolution of a language
with diachronic accounts of language contact. The first contribution, by
Christoph Gabriel, addresses the impact of migration-induced contact with
Italian and its dialects on two varieties of the Argentinian-Spanish prosodic
system (e.g. accent, stress, tones, etc.). The initial corpus analyses
demonstrate that both varieties of Spanish spoken in present day Argentina
share some prosodic features with Italian, which corroborates the influence of
language contact in language change, as such features were not observed in
regions which did come into contact with Italian.
In the second paper, Karoline Kühl illustrates how a corpus-linguistic
approach can be utilized to distinguish established features of a contact
variety (i.e. Faroe-Danish) from randomly occurring individual features. To
this end, she investigates the use of the subjunctive in written and spoken
corpora of Faroe-Danish. The analysis reveals that register has influenced the
use of the subjunctive in Faroe-Danish, which affirms the register specific
establishment of a Faroese feature in Faroe-Danish.
In the next paper, Ariadna Benet, Susana Cortés and Conxita Lleó provide an
overview of a spoken corpus of Catalan compiled with the aim of investigating
particular phonological aspects of Catalan uttered by bilingual
Catalan-Spanish speakers of three age groups (i.e. children, young people and
adults). Conducting a descriptive analysis, they have found that the
phonological deviations in Catalan are observed in the areas where the people
are more exposed to the presence of Spanish, illustrating the influence of the
linguistic environment on language contact and thus language change.
Magdalena Putz’ paper is based on a corpus of medical interactions between
doctors (native speakers of Italian) and patients (with German dialects) in
Tyrol, a region in Austria, with the aim of finding out which dialectical
elements cause communication obstacles between patients and doctors while
interacting in German. The researcher‘s goal is to introduce a new annotation
system of physicians’ and patients’ utterances, which will assist the
investigation of problem-causing segments in communication.
The last contribution in this section, by Steffen Höder, deals with a corpus
of historical texts written in Old Swedish, many of which were either
translated from Latin or were affected by Latin sources. Höder discusses the
difficulty in designing annotation schemes to analyze such a corpus due to
syntactic ambiguity caused by ongoing language change. To resolve this issue,
he suggests that annotation categories encompassing such characteristics as
clear definition, theory-independency, language-precision and diachronic
broadness should be created in order to avoid misleading results.

Section 3, “Interpreting Corpora,” which consists of three chapters, explores
simultaneous or successive interpreter-mediated communication between people
who do not share the same first language. Philipp Sebastian Angermeyer, Bernd
Meyer, and Thomas Schmidt deal with community interpreted corpora of three
types: court interpreting, hospital interpreting and a video recorded training
session for hospital interpreters. The researchers present two types of
annotations (language of utterance and translation status) and discuss ways of
approaching such tasks for the purpose of extending the reusability of the
data for future research.
In the subsequent paper, Juliane House, Bernd Meyer, and Thomas Schmidt focus
on a corpus of consecutive and simultaneous scientific talks on genetically
modified food delivered by a Brazilian professional to a non-professional
German audience. The talks were translated by German interpreters. The authors
give a general overview of the corpus, covering its design, compilation, and
The final contribution to the section, by Kristin Bührig, Ortrun Kliche, Bernd
Meyer, and Birte Pawlick, introduces a linguistic project named ‘Interpreting
in Hospitals,’ which is concerned with communication between German doctors
and patients from an immigrant background (Turkish and Portuguese), mediated
by ad hoc interpreters (e.g. family members or bilingual hospital staff). They
show how such a corpus can be utilized in training future medical
interpreters, with a sample training session in which transcripts from the
corpus are used in order to enable trainees to get familiar with the discourse
types, and to equip them with linguistic and institutional knowledge they need
to act as hospital interpreters.

Section 4, ”Comparable and parallel corpora,” explores “comparable” corpora of
a set of speech recordings created in similar settings and content, but in
diverse languages, as well as “parallel” corpora consisting of texts that are
translations of each other. The first paper, by Christian Fandrych, Cordula
Meißner, and Adriana Slavcheva, describes a parallel spoken academic corpus
from German, English and Polish concentrating on two academic genres
(presentations and academic papers), where certain linguistic items are
compared with one another. The paper is focused on discussing the creation of
this corpus, its design, data collection procedures and transcription
conventions, comparing it to similar spoken corpora of academic English (e.g.
Michigan Corpus of Academic Spoken English, British Academic Spoken English,
and English as an Academic Lingua Franca).
Henrik Dittman, Matej Ďurćo, Alexander Geyken, Tobias Roth and, Kai Zimmer, in
the second paper, present a written corpus of German varieties with the
purpose of tracing the use of the German language throughout the 20th and 21st
century in Germany, Switzerland, Austria and Tyrol, Italy. They seek to
construct a reference corpus in which differences of vocabulary and
phonological features among these varieties are compared in a specific period
of time.
Oliver Čulo and Silvia Hansen-Schirra turn towards a chunk-annotated corpus of
parallel texts, which are made up of English-origin German translation and
German-origin English translation essays of various registers (e.g. political
and fictional texts, institutional manuals, etc.). The corpus of dependency
Treebank is created to show how it might be used for the purposes of
translation studies or in computational linguistics and machine translation.

Section 5, “Corpus tools,” revolves around some practical tools that corpus
linguists might use for the objectives of creating and analyzing multilingual
corpora. The section has only two papers. In the first paper, Yvan Rose
presents a description of a project named “PhonBank,” which focuses on
phonological development in first and subsequent languages of learners. Rose
illustrates the use of the Phon software program, which brings new functions
to the corpus building and analysis, ranging from data linkage and
multiple-blind transcription to produced phonological forms. To illustrate how
the software works in practice, a sample phonological analysis of French loan
words adapted in Kinyarwanda, a dialect spoken in Rwanda, is explained with a
visually supported sample analysis.
The second paper, by Kai Wörner, introduces the metadata model in corpus
building and analysis, which basically means a set of data providing
information about other data (e.g. title, creator, publisher, date, format,
etc.) for both spoken and written language corpora. Wörner presents three
metadata formats of spoken and written corpora, mainly drawing examples from
the metadata model of EXMARaLDA (Extensible Markup Language for Discourse
Annotation), “a collection of data formats and software tools for creating,
analyzing and disseminating corpora of spoken language” (Schmidt & Wörner,
2009: 565) and its implementations illustrated with screenshots of sample


This collection of multilingual corpora studies, above all, appeals to a wide
readership interested in multilingualism and corpus linguistics. In addition,
anyone who is to some extent interested in languages or linguistic studies may
find the book useful, as it covers a wide range of areas related to
linguistics such as contact situation, interpretation and translation studies
and language learning process in terms of various language levels and
sub-levels (e.g. spoken and written modes, pronunciation, written essays,
etc.). The volume differs from related collections, which focus only one
aspect of bilingual corpora on certain languages (e.g. Johansson, 2007, which
focuses on the English-Norwegian Parallel Corpus and the Oslo Multilingual
Corpus), or just one level and sub-level of language (e.g. Teubert, 2007,
which deals with bilingual and multilingual lexicography and, annotation
issues). Thus, this volume fills a gap in the literature of multilingualism
and corpus linguistics. Another important aspect of the volume is that it
includes studies on both small and large corpora and studies that deal with
both the creation and analysis of multilingual corpora. The editors’
objectives of (i) introducing the audience to a large number of available
multilingual corpora, (ii) raising issues frequently encountered in the
methodological and technological aspects of corpus creation, and (iii)
presenting a selection of linguistics analyses drawn from multilingual corpora
clearly appear to have been achieved.

The editors state in the introduction of the book that they take the term
multilingual corpus in a broad sense, even counting monolingual data coming
from multilingual speakers. However, it appears to be a weakness to blur the
boundary between “bilingual” and “multilingual” speakers and, accordingly
data. In some contributions, this division is not clear, and therefore the
reader might not really know whether the contribution is really about a
multilingual corpus as the name of the book suggests. The lack of
organizational information within the book (no enumeration for individual
chapters under the relevant sections) may also present a challenge to the
reader, especially since the overall distribution of the papers for each
section is not exactly balanced. While some sections have more than five
contributions, some have only two. A glossary of technical terms might also
have been useful.

A last point of criticism concerns the representation of multilingual corpora,
particularly those large in quantity. The contributions make little mention of
corpora that are truly multilingual and recently created such as ELFA (English
as an Academic Lingua Franca) and VOICE (Vienna Oxford Corpus of English). It
is surprising to see how these two corpora remain largely unmentioned within
the book (see Mauranen, 2003 for further detail about ELFA, and visit the
VOICE website).

Altogether, however, the book clearly has more strengths than weaknesses, and it
addresses a long-standing gap in corpus linguistic research. I would strongly
recommend the book to all linguists interested in aspects of corpus creation
and multilingualism.


Johansson, S. (2007). Seeing through Corpora: On the Use of Corpora in
Contrastive Studies. Amsterdam: John Benjamins.

Mauranen, A. (2003). The Corpus of English as Lingua Franca. TESOL Quarterly,
37(3), 513–527.

Schmidt, T. & Wörner, K. (2009). EXMARaLDA – Creating, Analysing and Sharing
Spoken Language Corpora for Pragmatic Research, Pragmatics 19(4), 565-582.

Teubert, W. (2007) Text Corpora and Multilingual Lexicography. Amsterdam: John

VOICE. (2013). The Vienna-Oxford International Corpus of English (version 2.0
XML). Director: Barbara Seidlhofer; Researchers: Angelika Breiteneder, Theresa
Klimpfinger, Stefan Majewski, Ruth Osimk-Teasdale, Marie-Luise Pitzl, Michael
Radeka. Available at https://www.univie.ac.at/voice/page/index.php

Read more issues|LINGUIST home page|Top of issue

Page Updated: 26-Jun-2013

Supported in part by the National Science Foundation       About LINGUIST    |   Contact Us       ILIT Logo
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.