Review of  Multilingual Corpora and Multilingual Corpus Analysis

Reviewer: Ali Karakas
Book Title: Multilingual Corpora and Multilingual Corpus Analysis
Book Author: Thomas Schmidt Kai Wörner
Publisher: John Benjamins
Linguistic Field(s): Text/Corpus Linguistics
Book Announcement: 24.2603

This volume, organized into five sections, is a collection of 22 contributions. Each examines several characteristics of the compilation and use of multilingual corpora, including learner and attrition corpora, language contact corpora, interpreting corpora and parallel corpora. The focus is on the design of corpora in studies on multilingualism, with a critical analysis of the available multilingual corpora, consideration of the methodological and technological problems likely to occur in the compilation and analysis of such corpora, and exemplification of linguistic analyses drawn out from them.

The volume opens with an introduction, in which the editors offer information about the rationale for this volume, its primary aims, and clarification of related terms.

Section 1, “Learner and attribution corpora,” encompasses nine contributions exploring the creation and analyses of various multilingual learner corpora of different sizes. In the first paper, Ulrike Gut introduces the reader to The Leap corpus (Learning Prosody in a Foreign Language). The preliminary findings from the corpus, which look into second language fluency, suggest differences between learners and native speakers of German and English virtually in all facets of fluency (e.g. speech rate, articulation rate, filled pauses).
In the second paper, Hanna Hedeland and Thomas Schmidt center on the possible hurdles regarding the creation, annotation and sharing of a spoken language corpus with reference to a small German learner corpus of map task recordings. Taking the re-usability of the corpus as a point of departure in their discussion, they assert that decisions taken in each stage of the creation of the corpus will influence further uses of the corpus.
Following this, Niels Ott, Ramon Ziai, and Detmar Meurers provide an overall presentation of a task-based corpus (in this case, a reading comprehension exercise corpus), which explores the appropriateness level of answers of adult learners of German to the posed reading questions. Initial analyses indicate that the majority of answers are appropriate based on the meaning assessment.
In the next paper, Heike Zinsmeister and Margit Breckle present a text-based corpus of two sub-corpora compiled from Chinese learners of German and native German students, comparing the use of local coherence between the two groups. On the basis of quantitative analyses, according to which differences were observed between the two groups (e.g. underuse of adverbs, shorter sentences, lexical limitation in L2 learners’ essays), the authors offer suggestions for the use of this specific corpus in teaching German, for example, for error analysis and contrastive analysis.
In the fifth contribution, Marta Saceda Ulloa, Conxita Lleó, and Izarbe García Sánchez give a very clear description of a spoken database composed of four sub-corpora of Spanish recorded speech samples. Recordings of bilinguals speakers of Spanish and German are compared with those of monolingual German children in terms of the characteristics of their spoken language.
Conxita Lleó, in the next chapter, turns to two child language corpora of the speech of different German and Spanish monolingual and bilingual children, created over a long period of time (circa 25 years). The corpora were created with the purpose of investigating phonological first language acquisition (i.e. babbling and early lexicon development) of German-Spanish bilingual children in a comparative way.
In the subsequent paper, Annette Herkenrath and Jochen Rehbein present the outline of a spoken corpus of bilingual Turkish-German children and monolingual Turkish children’s language. In comparing bilingual and monolingual children with regard to their use of connectivity and morphological elements, the researchers apply a methodology for quantitative data analysis called Pragmatic Corpus analysis (PCA), which they illustrate with screenshots of data analyses.
The next contribution, by Agnieszka Czachór, focuses on a Polish-German bilingual written and spoken corpus, with the aim of exposing contact-induced change on the morphosyntactic features (e.g. case markers, word order) of bilingual adult bilingual speakers of Polish and German by means of grammaticality judgment tasks.
The final paper of this section, by Tanja Kupisch, Dagmar Barton, Giulia Bianchi, and Ilse Stangen, deals with a corpus of German-French and German-Italian adult bilinguals. The authors seek to find out whether adult bilinguals show an acquisition deficit at certain linguistic domains (e.g. lexicon, morphology, syntax, and semantics).

Section 2, “Language contact corpora”, presents a group of corpora dealing with varieties of languages whose current or past statuses are described through language contact, and corpora exploring the evolution of a language with diachronic accounts of language contact. The first contribution, by Christoph Gabriel, addresses the impact of migration-induced contact with Italian and its dialects on two varieties of the Argentinian-Spanish prosodic system (e.g. accent, stress, tones, etc.). The initial corpus analyses demonstrate that both varieties of Spanish spoken in present day Argentina share some prosodic features with Italian, which corroborates the influence of language contact in language change, as such features were not observed in regions which did come into contact with Italian.
In the second paper, Karoline Kühl illustrates how a corpus-linguistic approach can be utilized to distinguish established features of a contact variety (i.e. Faroe-Danish) from randomly occurring individual features. To this end, she investigates the use of the subjunctive in written and spoken corpora of Faroe-Danish. The analysis reveals that register has influenced the use of the subjunctive in Faroe-Danish, which affirms the register specific establishment of a Faroese feature in Faroe-Danish.
In the next paper, Ariadna Benet, Susana Cortés and Conxita Lleó provide an overview of a spoken corpus of Catalan compiled with the aim of investigating particular phonological aspects of Catalan uttered by bilingual Catalan-Spanish speakers of three age groups (i.e. children, young people and adults). Conducting a descriptive analysis, they have found that the phonological deviations in Catalan are observed in the areas where the people are more exposed to the presence of Spanish, illustrating the influence of the linguistic environment on language contact and thus language change.
Magdalena Putz’ paper is based on a corpus of medical interactions between doctors (native speakers of Italian) and patients (with German dialects) in Tyrol, a region in Austria, with the aim of finding out which dialectical elements cause communication obstacles between patients and doctors while interacting in German. The researcher‘s goal is to introduce a new annotation system of physicians’ and patients’ utterances, which will assist the investigation of problem-causing segments in communication.
The last contribution in this section, by Steffen Höder, deals with a corpus of historical texts written in Old Swedish, many of which were either translated from Latin or were affected by Latin sources. Höder discusses the difficulty in designing annotation schemes to analyze such a corpus due to syntactic ambiguity caused by ongoing language change. To resolve this issue, he suggests that annotation categories encompassing such characteristics as clear definition, theory-independency, language-precision and diachronic broadness should be created in order to avoid misleading results.

Section 3, “Interpreting Corpora,” which consists of three chapters, explores simultaneous or successive interpreter-mediated communication between people who do not share the same first language. Philipp Sebastian Angermeyer, Bernd Meyer, and Thomas Schmidt deal with community interpreted corpora of three types: court interpreting, hospital interpreting and a video recorded training session for hospital interpreters. The researchers present two types of annotations (language of utterance and translation status) and discuss ways of approaching such tasks for the purpose of extending the reusability of the data for future research.
In the subsequent paper, Juliane House, Bernd Meyer, and Thomas Schmidt focus on a corpus of consecutive and simultaneous scientific talks on genetically modified food delivered by a Brazilian professional to a non-professional German audience. The talks were translated by German interpreters. The authors give a general overview of the corpus, covering its design, compilation, and accessibility.
The final contribution to the section, by Kristin Bührig, Ortrun Kliche, Bernd Meyer, and Birte Pawlick, introduces a linguistic project named ‘Interpreting in Hospitals,’ which is concerned with communication between German doctors and patients from an immigrant background (Turkish and Portuguese), mediated by ad hoc interpreters (e.g. family members or bilingual hospital staff). They show how such a corpus can be utilized in training future medical interpreters, with a sample training session in which transcripts from the corpus are used in order to enable trainees to get familiar with the discourse types, and to equip them with linguistic and institutional knowledge they need to act as hospital interpreters.

Section 4, ”Comparable and parallel corpora,” explores “comparable” corpora of a set of speech recordings created in similar settings and content, but in diverse languages, as well as “parallel” corpora consisting of texts that are translations of each other. The first paper, by Christian Fandrych, Cordula Meißner, and Adriana Slavcheva, describes a parallel spoken academic corpus from German, English and Polish concentrating on two academic genres (presentations and academic papers), where certain linguistic items are compared with one another. The paper is focused on discussing the creation of this corpus, its design, data collection procedures and transcription conventions, comparing it to similar spoken corpora of academic English (e.g. Michigan Corpus of Academic Spoken English, British Academic Spoken English, and English as an Academic Lingua Franca).
Henrik Dittman, Matej Ďurćo, Alexander Geyken, Tobias Roth and, Kai Zimmer, in the second paper, present a written corpus of German varieties with the purpose of tracing the use of the German language throughout the 20th and 21st century in Germany, Switzerland, Austria and Tyrol, Italy. They seek to construct a reference corpus in which differences of vocabulary and phonological features among these varieties are compared in a specific period of time.
Oliver Čulo and Silvia Hansen-Schirra turn towards a chunk-annotated corpus of parallel texts, which are made up of English-origin German translation and German-origin English translation essays of various registers (e.g. political and fictional texts, institutional manuals, etc.). The corpus of dependency Treebank is created to show how it might be used for the purposes of translation studies or in computational linguistics and machine translation.

Section 5, “Corpus tools,” revolves around some practical tools that corpus linguists might use for the objectives of creating and analyzing multilingual corpora. The section has only two papers. In the first paper, Yvan Rose presents a description of a project named “PhonBank,” which focuses on phonological development in first and subsequent languages of learners. Rose illustrates the use of the Phon software program, which brings new functions to the corpus building and analysis, ranging from data linkage and multiple-blind transcription to produced phonological forms. To illustrate how the software works in practice, a sample phonological analysis of French loan words adapted in Kinyarwanda, a dialect spoken in Rwanda, is explained with a visually supported sample analysis.
The second paper, by Kai Wörner, introduces the metadata model in corpus building and analysis, which basically means a set of data providing information about other data (e.g. title, creator, publisher, date, format, etc.) for both spoken and written language corpora. Wörner presents three metadata formats of spoken and written corpora, mainly drawing examples from the metadata model of EXMARaLDA (Extensible Markup Language for Discourse Annotation), “a collection of data formats and software tools for creating, analyzing and disseminating corpora of spoken language” (Schmidt & Wörner, 2009: 565) and its implementations illustrated with screenshots of sample analyses.


This collection of multi-lingual corpora studies, above all, appeals to a wide readership interested in multilingualism and corpus linguistics. In addition, anyone who is to some extent interested in languages or linguistic studies may find the book useful, as it covers a wide range of areas related to linguistics such as contact situation, interpretation and translation studies and language learning process in terms of various language levels and sub-levels (e.g. spoken and written modes, pronunciation, written essays, etc.). The volume differs from related collections, which focus only one aspect of bilingual corpora on certain languages (e.g. Johansson, 2007, which focuses on the English-Norwegian Parallel Corpus and the Oslo Multilingual Corpus), or just one level and sub-level of language (e.g. Teubert, 2007, which deals with bilingual and multilingual lexicography and, annotation issues). Thus, this volume fills a gap in the literature of multilingualism and corpus linguistics. Another important aspect of the volume is that it includes studies on both small and large corpora and studies that deal with both the creation and analysis of multilingual corpora. The editors’ objectives of (i) introducing the audience to a large number of available multilingual corpora, (ii) raising issues frequently encountered in the methodological and technological aspects of corpus creation, and (iii) presenting a selection of linguistics analyses drawn from multilingual corpora clearly appear to have been achieved.

The editors state in the introduction of the book that they take the term multilingual corpus in a broad sense, even counting monolingual data coming from multilingual speakers. However, it appears to be a weakness to blur the boundary between “bilingual” and “multilingual” speakers and, accordingly data. In some contributions, this division is not clear, and therefore the reader might not really know whether the contribution is really about a multilingual corpus as the name of the book suggests. The lack of organizational information within the book (no enumeration for individual chapters under the relevant sections) may also present a challenge to the reader, especially since the overall distribution of the papers for each section is not exactly balanced. While some sections have more than five contributions, some have only two. A glossary of technical terms might also have been useful.

A last point of criticism concerns the representation of multilingual corpora, particularly those large in quantity. The contributions make little mention of corpora that are truly multilingual and recently created such as ELFA (English as an Academic Lingua Franca) and VOICE (Vienna Oxford Corpus of English). It is surprising to see how these two corpora remain largely unmentioned within the book (see Mauranen, 2003 for further detail about ELFA, and visit the VOICE website).

Altogether, however, the book has more strengths than weaknesses, and it addresses a long-standing gap in corpus linguistic research. I would strongly recommend the book to all linguists interested in aspects of corpus creation and multilingualism.


