LINGUIST List 24.2603

Wed Jun 26 2013

Review: Text/Corpus Linguistics: Schmidt and Wörner (eds., 2012)

Editor for this issue: Anja Wanner <>

Date: 26-Jun-2013
From: Ali Karakas <>
Subject: Multilingual Corpora and Multilingual Corpus Analysis
E-mail this message to a friend

Discuss this message

Book announced at

EDITOR: Thomas SchmidtEDITOR: Kai WörnerTITLE: Multilingual Corpora and Multilingual Corpus AnalysisSERIES TITLE: Hamburg Studies on Multilingualism 14PUBLISHER: John BenjaminsYEAR: 2012

REVIEWER: Ali Karakas, University of Southampton


This volume, organized into five sections, is a collection of 22contributions. Each examines several characteristics of the compilation anduse of multilingual corpora, including learner and attrition corpora, languagecontact corpora, interpreting corpora and parallel corpora. The focus is onthe design of corpora in studies on multilingualism, with a critical analysisof the available multilingual corpora, consideration of the methodological andtechnological problems likely to occur in the compilation and analysis of suchcorpora, and exemplification of linguistic analyses drawn out from them.

The volume opens with an introduction, in which the editors offer informationabout the rationale for this volume, its primary aims, and clarification ofrelated terms.

Section 1, “Learner and attribution corpora,” encompasses nine contributionsexploring the creation and analyses of various multilingual learner corpora ofdifferent sizes. In the first paper, Ulrike Gut introduces the reader to TheLeap corpus (Learning Prosody in a Foreign Language). The preliminary findingsfrom the corpus, which look into second language fluency, suggest differencesbetween learners and native speakers of German and English virtually in allfacets of fluency (e.g. speech rate, articulation rate, filled pauses).In the second paper, Hanna Hedeland and Thomas Schmidt center on the possiblehurdles regarding the creation, annotation and sharing of a spoken languagecorpus with reference to a small German learner corpus of map task recordings.Taking the re-usability of the corpus as a point of departure in theirdiscussion, they assert that decisions taken in each stage of the creation ofthe corpus will influence further uses of the corpus.Following this, Niels Ott, Ramon Ziai, and Detmar Meurers provide an overallpresentation of a task-based corpus (in this case, a reading comprehensionexercise corpus), which explores the appropriateness level of answers of adultlearners of German to the posed reading questions. Initial analyses indicatethat the majority of answers are appropriate based on the meaning assessment.In the next paper, Heike Zinsmeister and Margit Breckle present a text-basedcorpus of two sub-corpora compiled from Chinese learners of German and nativeGerman students, comparing the use of local coherence between the two groups.On the basis of quantitative analyses, according to which differences wereobserved between the two groups (e.g. underuse of adverbs, shorter sentences,lexical limitation in L2 learners’ essays), the authors offer suggestions forthe use of this specific corpus in teaching German, for example, for erroranalysis and contrastive analysis.In the fifth contribution, Marta Saceda Ulloa, Conxita Lleó, and Izarbe GarcíaSánchez give a very clear description of a spoken database composed of foursub-corpora of Spanish recorded speech samples. Recordings of bilingualsspeakers of Spanish and German are compared with those of monolingual Germanchildren in terms of the characteristics of their spoken language.Conxita Lleó, in the next chapter, turns to two child language corpora of thespeech of different German and Spanish monolingual and bilingual children,created over a long period of time (circa 25 years). The corpora were createdwith the purpose of investigating phonological first language acquisition(i.e. babbling and early lexicon development) of German-Spanish bilingualchildren in a comparative way.In the subsequent paper, Annette Herkenrath and Jochen Rehbein present theoutline of a spoken corpus of bilingual Turkish-German children andmonolingual Turkish children’s language. In comparing bilingual andmonolingual children with regard to their use of connectivity andmorphological elements, the researchers apply a methodology for quantitativedata analysis called Pragmatic Corpus analysis (PCA), which they illustratewith screenshots of data analyses.The next contribution, by Agnieszka Czachór, focuses on a Polish-Germanbilingual written and spoken corpus, with the aim of exposing contact-inducedchange on the morphosyntactic features (e.g. case markers, word order) ofbilingual adult bilingual speakers of Polish and German by means ofgrammaticality judgment tasks.The final paper of this section, by Tanja Kupisch, Dagmar Barton, GiuliaBianchi, and Ilse Stangen, deals with a corpus of German-French andGerman-Italian adult bilinguals. The authors seek to find out whether adultbilinguals show an acquisition deficit at certain linguistic domains (e.g.lexicon, morphology, syntax, and semantics).

Section 2, “Language contact corpora”, presents a group of corpora dealingwith varieties of languages whose current or past statuses are describedthrough language contact, and corpora exploring the evolution of a languagewith diachronic accounts of language contact. The first contribution, byChristoph Gabriel, addresses the impact of migration-induced contact withItalian and its dialects on two varieties of the Argentinian-Spanish prosodicsystem (e.g. accent, stress, tones, etc.). The initial corpus analysesdemonstrate that both varieties of Spanish spoken in present day Argentinashare some prosodic features with Italian, which corroborates the influence oflanguage contact in language change, as such features were not observed inregions which did come into contact with Italian.In the second paper, Karoline Kühl illustrates how a corpus-linguisticapproach can be utilized to distinguish established features of a contactvariety (i.e. Faroe-Danish) from randomly occurring individual features. Tothis end, she investigates the use of the subjunctive in written and spokencorpora of Faroe-Danish. The analysis reveals that register has influenced theuse of the subjunctive in Faroe-Danish, which affirms the register specificestablishment of a Faroese feature in Faroe-Danish.In the next paper, Ariadna Benet, Susana Cortés and Conxita Lleó provide anoverview of a spoken corpus of Catalan compiled with the aim of investigatingparticular phonological aspects of Catalan uttered by bilingualCatalan-Spanish speakers of three age groups (i.e. children, young people andadults). Conducting a descriptive analysis, they have found that thephonological deviations in Catalan are observed in the areas where the peopleare more exposed to the presence of Spanish, illustrating the influence of thelinguistic environment on language contact and thus language change.Magdalena Putz’ paper is based on a corpus of medical interactions betweendoctors (native speakers of Italian) and patients (with German dialects) inTyrol, a region in Austria, with the aim of finding out which dialecticalelements cause communication obstacles between patients and doctors whileinteracting in German. The researcher‘s goal is to introduce a new annotationsystem of physicians’ and patients’ utterances, which will assist theinvestigation of problem-causing segments in communication.The last contribution in this section, by Steffen Höder, deals with a corpusof historical texts written in Old Swedish, many of which were eithertranslated from Latin or were affected by Latin sources. Höder discusses thedifficulty in designing annotation schemes to analyze such a corpus due tosyntactic ambiguity caused by ongoing language change. To resolve this issue,he suggests that annotation categories encompassing such characteristics asclear definition, theory-independency, language-precision and diachronicbroadness should be created in order to avoid misleading results.

Section 3, “Interpreting Corpora,” which consists of three chapters, exploressimultaneous or successive interpreter-mediated communication between peoplewho do not share the same first language. Philipp Sebastian Angermeyer, BerndMeyer, and Thomas Schmidt deal with community interpreted corpora of threetypes: court interpreting, hospital interpreting and a video recorded trainingsession for hospital interpreters. The researchers present two types ofannotations (language of utterance and translation status) and discuss ways ofapproaching such tasks for the purpose of extending the reusability of thedata for future research.In the subsequent paper, Juliane House, Bernd Meyer, and Thomas Schmidt focuson a corpus of consecutive and simultaneous scientific talks on geneticallymodified food delivered by a Brazilian professional to a non-professionalGerman audience. The talks were translated by German interpreters. The authorsgive a general overview of the corpus, covering its design, compilation, andaccessibility.The final contribution to the section, by Kristin Bührig, Ortrun Kliche, BerndMeyer, and Birte Pawlick, introduces a linguistic project named ‘Interpretingin Hospitals,’ which is concerned with communication between German doctorsand patients from an immigrant background (Turkish and Portuguese), mediatedby ad hoc interpreters (e.g. family members or bilingual hospital staff). Theyshow how such a corpus can be utilized in training future medicalinterpreters, with a sample training session in which transcripts from thecorpus are used in order to enable trainees to get familiar with the discoursetypes, and to equip them with linguistic and institutional knowledge they needto act as hospital interpreters.

Section 4, ”Comparable and parallel corpora,” explores “comparable” corpora ofa set of speech recordings created in similar settings and content, but indiverse languages, as well as “parallel” corpora consisting of texts that aretranslations of each other. The first paper, by Christian Fandrych, CordulaMeißner, and Adriana Slavcheva, describes a parallel spoken academic corpusfrom German, English and Polish concentrating on two academic genres(presentations and academic papers), where certain linguistic items arecompared with one another. The paper is focused on discussing the creation ofthis corpus, its design, data collection procedures and transcriptionconventions, comparing it to similar spoken corpora of academic English (e.g.Michigan Corpus of Academic Spoken English, British Academic Spoken English,and English as an Academic Lingua Franca).Henrik Dittman, Matej Ďurćo, Alexander Geyken, Tobias Roth and, Kai Zimmer, inthe second paper, present a written corpus of German varieties with thepurpose of tracing the use of the German language throughout the 20th and 21stcentury in Germany, Switzerland, Austria and Tyrol, Italy. They seek toconstruct a reference corpus in which differences of vocabulary andphonological features among these varieties are compared in a specific periodof time.Oliver Čulo and Silvia Hansen-Schirra turn towards a chunk-annotated corpus ofparallel texts, which are made up of English-origin German translation andGerman-origin English translation essays of various registers (e.g. politicaland fictional texts, institutional manuals, etc.). The corpus of dependencyTreebank is created to show how it might be used for the purposes oftranslation studies or in computational linguistics and machine translation.

Section 5, “Corpus tools,” revolves around some practical tools that corpuslinguists might use for the objectives of creating and analyzing multilingualcorpora. The section has only two papers. In the first paper, Yvan Rosepresents a description of a project named “PhonBank,” which focuses onphonological development in first and subsequent languages of learners. Roseillustrates the use of the Phon software program, which brings new functionsto the corpus building and analysis, ranging from data linkage andmultiple-blind transcription to produced phonological forms. To illustrate howthe software works in practice, a sample phonological analysis of French loanwords adapted in Kinyarwanda, a dialect spoken in Rwanda, is explained with avisually supported sample analysis.The second paper, by Kai Wörner, introduces the metadata model in corpusbuilding and analysis, which basically means a set of data providinginformation about other data (e.g. title, creator, publisher, date, format,etc.) for both spoken and written language corpora. Wörner presents threemetadata formats of spoken and written corpora, mainly drawing examples fromthe metadata model of EXMARaLDA (Extensible Markup Language for DiscourseAnnotation), “a collection of data formats and software tools for creating,analyzing and disseminating corpora of spoken language” (Schmidt & Wörner,2009: 565) and its implementations illustrated with screenshots of sampleanalyses.


This collection of multilingual corpora studies, above all, appeals to a widereadership interested in multilingualism and corpus linguistics. In addition,anyone who is to some extent interested in languages or linguistic studies mayfind the book useful, as it covers a wide range of areas related tolinguistics such as contact situation, interpretation and translation studiesand language learning process in terms of various language levels andsub-levels (e.g. spoken and written modes, pronunciation, written essays,etc.). The volume differs from related collections, which focus only oneaspect of bilingual corpora on certain languages (e.g. Johansson, 2007, whichfocuses on the English-Norwegian Parallel Corpus and the Oslo MultilingualCorpus), or just one level and sub-level of language (e.g. Teubert, 2007,which deals with bilingual and multilingual lexicography and, annotationissues). Thus, this volume fills a gap in the literature of multilingualismand corpus linguistics. Another important aspect of the volume is that itincludes studies on both small and large corpora and studies that deal withboth the creation and analysis of multilingual corpora. The editors’objectives of (i) introducing the audience to a large number of availablemultilingual corpora, (ii) raising issues frequently encountered in themethodological and technological aspects of corpus creation, and (iii)presenting a selection of linguistics analyses drawn from multilingual corporaclearly appear to have been achieved.

The editors state in the introduction of the book that they take the termmultilingual corpus in a broad sense, even counting monolingual data comingfrom multilingual speakers. However, it appears to be a weakness to blur theboundary between “bilingual” and “multilingual” speakers and, accordinglydata. In some contributions, this division is not clear, and therefore thereader might not really know whether the contribution is really about amultilingual corpus as the name of the book suggests. The lack oforganizational information within the book (no enumeration for individualchapters under the relevant sections) may also present a challenge to thereader, especially since the overall distribution of the papers for eachsection is not exactly balanced. While some sections have more than fivecontributions, some have only two. A glossary of technical terms might alsohave been useful.

A last point of criticism concerns the representation of multilingual corpora,particularly those large in quantity. The contributions make little mention ofcorpora that are truly multilingual and recently created such as ELFA (Englishas an Academic Lingua Franca) and VOICE (Vienna Oxford Corpus of English). Itis surprising to see how these two corpora remain largely unmentioned withinthe book (see Mauranen, 2003 for further detail about ELFA, and visit theVOICE website).

Altogether, however, the book clearly has more strengths than weaknesses, and itaddresses a long-standing gap in corpus linguistic research. I would stronglyrecommend the book to all linguists interested in aspects of corpus creationand multilingualism.


Johansson, S. (2007). Seeing through Corpora: On the Use of Corpora inContrastive Studies. Amsterdam: John Benjamins.

Mauranen, A. (2003). The Corpus of English as Lingua Franca. TESOL Quarterly,37(3), 513–527.

Schmidt, T. & Wörner, K. (2009). EXMARaLDA – Creating, Analysing and SharingSpoken Language Corpora for Pragmatic Research, Pragmatics 19(4), 565-582.

Teubert, W. (2007) Text Corpora and Multilingual Lexicography. Amsterdam: JohnBenjamins.

VOICE. (2013). The Vienna-Oxford International Corpus of English (version 2.0XML). Director: Barbara Seidlhofer; Researchers: Angelika Breiteneder, TheresaKlimpfinger, Stefan Majewski, Ruth Osimk-Teasdale, Marie-Luise Pitzl, MichaelRadeka. Available at


Page Updated: 26-Jun-2013