LINGUIST List 27.989

Thu Feb 25 2016

Review: Computational Ling; Text/Corpus Ling; Translation: Fantinuoli, Zanettin (2015)

Editor for this issue: Sara Couture <>

Date: 08-Sep-2015
From: Daria Dayter <>
Subject: New directions in corpus-based translation studies
EDITOR: Claudio Fantinuoli
EDITOR: Federico Zanettin
TITLE: New directions in corpus-based translation studies
SERIES TITLE: Translation and Multilingual Natural Language Processing
PUBLISHER: Language Science Press
YEAR: 2015

REVIEWER: Daria Dayter, Universität Basel

Reviews Editor: Robert Arthur Cote


This volume, entitled “New directions in corpus-based translation studies” and edited by Claudio Fantinuoli and Federico Zanettin, is a collection of six papers on different aspects of the corpus-based methodology in translation studies. The authors report on their own efforts within this relatively new field, which explains the focus on the know-how, custom-made corpus software, innovative annotation, and probing for new types of research questions. The book is based on the presentations from “Corpus-based translation studies”, a panel held during the 7th Congress of the European Society of Translation Studies in 2013. Its origin is evident throughout the collection, as most of the papers give detailed accounts of works-in-progress, concentrating on methodological decisions, rather than a systematic analysis or an overview of quantified results, which are promised to follow as the projects unfold. This is not to say that the collection is not a success. As anyone who works in corpus-based translation studies (CBTS) knows, technological solutions are often ad-hoc, and the type of research carried out is sometimes constrained by the tools available to the researcher. Even more importantly, corpus studies so far have mostly addressed staple questions related to counting and contrasting microlinguistic features with the aim of finding S- or T-universals (Chesterman 2004). Here, the authors attempt to explore less conventional territory armed with corpus tools (e.g. how translators form, reject and confirm hypotheses during the translation process, investigated with the help of a keystroke corpus in Serbina et al., this volume). The scope of investigation includes seven European languages: Basque, Dutch, German, Greek, Italian, Spanish, and English. Because of the innovative nature of the collection, it will be of interest to scholars and advanced students in the areas of translation and interpretation studies. It should also attract the attention of corpus linguists, for it demonstrates the potential applications of corpus methods in previously uncharted territory and covers new corpus design and corpus software.

The first chapter, “Creating and using multilingual corpora in translation studies” by Claudio Fantinuoli and Federico Zanettin, takes a welcome detour from the established format of an introduction to an edited volume. Instead of giving a summary of subsequent chapters, the editors identify the main issues in CBTS that appear in every contribution. These issues predictably lie in the areas of corpus design, annotation and alignment, and corpus analysis. In a terminological aside, the editors’ propose to solve the debate surrounding the terms “parallel/comparable corpus” by treating them as a function of corpus architecture. In that case, a parallel corpus is a corpus where “two or more components are aligned, that is, are subdivided into compositional and sequential units (of differing extent and nature) which are linked and can thus be retrieved as pairs (or triplets, etc.)” (p. 4). A comparable corpus, in turn, is a corpus whose components are compared on the basis of assumed similarity. The papers in this collection make use of both parallel and comparable corpora sometimes drawing in existing monolingual corpora to verify their results. The diversity of datasets and annotations (from automatically tagged to full manual tagging) finds a reflection in the range of analyses offered by the contributors, from theta theory to critical discourse analysis. Recognising the achievements of the volume, Fantinuoli and Zanettin call for “a stronger tie between technical expertise and sound methodological practice” (p. 9) to continue to move CBTS forward.

The second chapter, “Development of a keystroke logged translation corpus”, is written by Tatiana Serbina, Paula Niemietz, and Stella Neumann and focuses on the process of translation. To this end, Serbina et al. collected three sub-corpora in an experimental setting: an original English corpus of texts in popular physics and their translations into German by two distinct subject groups, professional translators and domain specialists. During the experiment, the Translog software recorded all the keystrokes and mouse clicks made by the translator and the length of pauses between them. Serbina et al. also designed a custom alignment tool that enabled them to first align the target keystrokes to tokens, and then align these to the alignment units consisting of the source-target token counterparts. The result was a richly annotated corpus that allowed the researchers to identify several intermediate products of translation, juxtapose them to the final version, and draw hypotheses about the thought process of the translator. In addition, the presence of the intermediate versions enabled the researchers to explain the mistakes in the final translation through the reasons other than lacking competence in the target language or simple typos. For example, an incorrect agreement marker on the indefinite article in the phrase “eine dünnes Blatt” is ascribed to the fact that the preceding version of the translation contained another noun phrase in the same position, namely “eine dünne Alufolie”, where the feminine form “dünne” had been the correct choice (p. 23). Serbina et al. conclude with an outlook to further steps in the project: expand the corpus and include eye-tracker data to complement the keystroke logs.

Chapter 3 by Effie Mouka, Ioannis E. Saridakis, and Angeliki Fotopoulou, “Racism goes to the movies: A corpus-driven study of cross-linguistic racist discourse annotation and translation analysis”, is based on the PhD project of the first author. Mouka et al. conducted critical discourse analysis of the translation choices made when translating racial slurs in subtitles of movies from English into Greek and Spanish. They used the categories from the Appraisal Theory – attitude, graduation, engagement – to describe each slur and to categorise the translation choice as mitigating the original, overtoning it, or maintaining the same force. The corpus on which the study is based consists of nine hours of film material annotated in ELAN and GATE platform. The four American and one British film that the authors chose were all feature films belonging to the drama genre, and the stories revolved around racism and interracial relations (p.42). Mouka et al. raise an important concern about the inherent multimodality of film data, and, especially, the shift in the mode of the message in the three sub-corpora. Although the original subtitles are text transcribed from an oral medium, the target subtitles are written. In addition to the subtitles corpus, the authors used enTenTen12, GkWaC, and esTenTen11 as reference corpora for English, Greek and Spanish. The findings, which reflect the cultural sensitivity towards heterophobic discourse that has developed in Greece and Spain as a result of their “first frontier” status in the recent influx of refugees, are said to demonstrate “the role of translation in the diachronic development of the sociolinguistic dimension of racism” (p. 65).

The fourth chapter in the collection is “Building a trilingual parallel corpus to analyse literary translations from German into Basque” by Naroa Zubillaga, Zuriñe Sanz, and Ibon Uribarri. Given the minority status of Basque, certain issues specific to this target language made the corpus compilation especially difficult. For example, it is rare to find a book translated directly from German into Basque without Spanish as a bridge language. In addition, there are very few translators who work with the German-Basque language pair, and until recently, no German-Basque dictionaries were even available (p.78). Zubillaga et al. explain that although they initially only created a Spanish subcorpus for the Basque target texts for which no direct translation was available, they currently plan to complement every German-Basque alignment pair with the Spanish text. The findings of translation research also underscore the special status of Basque. The interference of Spanish, the dominant language of the translators, makes itself known in the Basque translations in the form of literal translations of Spanish idioms. The standardising influence of Basque, a language which is rarely used outside of official domains, is evident in downtoning of offensive language. At the current stage, however, the authors see the creation of the parallel corpus and the accompanying tools as their main achievements. This impressive undertaking (the corpus is 5.5 mio words) involved the digitisation of hundreds of books. Tagging and aligning was done with the help of TRACE-Aligner, a program developed specifically for this corpus, which was followed by manual fine-tuning. The release of the corpus for general use has unfortunately been delayed indefinitely because of the inevitable copyright issues with the literary works.

Chapter 5 by Ekaterina Lapshinova-Koltunski, “Variation in translation: Evidence from corpora”, compares the end product of human and machine translations. Lapshinova-Koltunski extracted source English texts and their translations into German by professional translators from the CroCo corpus. She then supplemented this data with the translations by inexperienced human translators using computer aided tools and rule-based and statistical machine translations. The resulting material was tokenized, lemmatized, and tagged with part-of-speech information and segmented into syntactic chunks and sentences. To compare the different types of translation, the author resorts to the well-known features of translationese: explicitation, simplification, normalisation, and convergence. They are operationalized through a number of microlinguistic features that could be easily retrieved from the tagged corpus. For instance, simplification is measured through lexical density and type-token ratio, whereas explicitation is defined as the proportion of nominal phrases filled with pro-forms vs. full nominal phrases. Interestingly, the translation produced by the rule-based system was so poor that the inclusion of these results into overall discussion is almost nonsensical. Overall, Lapshinova-Koltunski finds that the feature of convergence is the only one visible in all the texts, and it shows no significant variation among translation methods. As the editors remark, although the features of translationese have been extensively tested before, this paper stands out as “one of the first investigations which compares corpora obtained through different methods of translation to test a theoretical hypothesis rather than to evaluate the performance of machine translation systems” (p. 7).

The contribution by Steven Doms, “Non-human agents in subject position: Translation from English into Dutch: A corpus-based translation study of ‘give’ and ‘show’” forms the sixth chapter of the collection. Doms investigates the choices that translators make when confronted with a fundamental typological difference between two languages. The difference in question is the constraint against non-human subjects in agent role in Dutch. In English, of course, such subjects are perfectly acceptable, as the example demonstrates: “Studies in animals have shown reproductive toxicity […]” (p. 116). The author uses the Dutch Parallel Corpus to extract sentences that contain the verbs “give” and “show” in the English source text and then cleans the data manually according to a number of criteria, e.g. filtering out the phrasal verbs and idioms, choosing the sentences that have agent as the subject, etc. Following D’haeyere (2010), Doms assigns the Dutch translations to three categories: (1) the non-human subjects retained in the agent role; (2) avoidance of a non-human agent through changes to the sentence; and (3) the original non-human agent not translated. In Doms’ corpus, when choosing to avoid a non-human agent, the translators either introduced a human agent in Dutch,used a non-agentive subject (theme, recipient, possessor), or substituted the original verb “give/show” for another one. The results show, however, that in an overwhelming majority of cases (57.2%), the translators retain the non-human agent, thus introducing English interference into Dutch texts.

The collection concludes with Gianluca Pontrandolfo’s contribution “Investigating judicial phraseology with COSPE: A contrastive corpus-based study.” This chapter is based on a custom-made corpus of criminal judgements, COSPE, which contains 6 mio tokens in English, Spanish, and Italian. This contribution stands out from the rest of the volume because COSPE is not a parallel but instead a comparative corpus, i.e. the texts are not translations of each other but simply representative of the same legal genre. To query the corpus, Pontrandolfo resorted to a variety of analytical steps, ranging from corpus-driven to corpus-based. On the corpus-driven end of the continuum, he looked at n-grams and collocations of common legal terms. On the corpus-based end, he investigated complex prepositions and lexical doublets/triplets, both of which are characteristic of the judicial genre. To establish the importance of the investigated features for the legal judgements, Pontrandolfo used the BNC, CORIS/CODIS and CREA as reference corpora for English, Italian, and Spanish respectively. The findings confirmed that although there were some differences between the three sub-corpora, “phraseology is indeed a key lexico-syntactic feature of this genre and it is part of judges’ idiosyncratic drafting conventions” (p. 152).


Once again, shortly after the appearance of Straniero Sergio & Falbo’s “Breaking ground in corpus-based interpreting studies”, Italian scholarship has announced its intention to stay on the cutting edge of CBTS. The main strength of “New directions in corpus-based translation studies” is that it reports on the most current, ongoing research that readers would not normally have access to unless they attend thematic conferences. It is also the first volume in the new series “Translation and Multilingual Natural Language Processing” launched by Language Science Press, which promises to be a thoughtful forum dedicated to empirical and interdisciplinary investigation of translation. The contributions all tie in together well thanks to their methodological unity, which makes the book interesting to researchers who are currently working on or plan to undertake a project in quantitative translation studies. The only paper that somewhat skews the pattern is Pontrandolfo’s project on phraseology in legal texts because it uses a comparable rather than a parallel corpus. As a result, it is oriented more towards developing a teaching resource rather than answering fundamental questions about translation, and ultimately is not well-situated within a translation studies framework.

The volume illustrates the trend in translation and interpreting studies to shift the attention from the product of translation towards its process – a research objective which until now has mainly drawn the eyes of cognitive linguists. In this vein, Serbina et al.’s paper proposes an excellent way to query the translation process on the level of observable tokens that can be compiled into a corpus. Zubillaga et al.’s work on a corpus of parallel German/Spanish/Basque translations feeds into this research strand from a different direction by giving a corpus analyst a view of the influence of the intermediate language version and the dominant language of the translator.

I recognise a further value in the language combinations chosen by the authors. It is especially cheering to see European minority languages, such as Basque, investigated within translation studies and through a corpus lens. Similarly, the papers based on the major language pairs such as Spanish, Italian, and Greek (Mouka et al., Pontrandolfo) fill a gap in corpus-based studies of societally relevant topics, e.g. heterophobic language and legal judgements, which to date have mostly been English based (cf. Baker et al. 2013 on representation of Islam in the British press, for example).

The work-in-progress nature of these papers, however, also gives rise to certain drawbacks. Given that quantitative analysis is the key strength of a corpus approach, it would have been desirable to see some overall systematic, quantified results which the authors withhold due to the ongoing status of their projects (see Serbina et al., Mouka et al., Zubillaga et al.). Some methodological decisions are skimmed over, although they appear quite critical to the study design. For example, Lapshinova-Koltunski’s paper makes one wonder about the reliability of corpus-based findings when the chosen operationalisation of the analytical categories is questionable. Is it justifiable to define normalisation solely through the proportion of nominal to verbal phrases? The author does remark that these definitions have limitations. It seems to me, however, that the limitations are too severe to talk of the global categories of translationese, and it would have been more appropriate to talk of individual linguistic features instead.

Finally, the quick production process, which brings the articles to the reader in double time, resulted in some language and formatting issues. Nevertheless, they do not affect readability or understanding in any important way. On the whole, “New directions in corpus-based translation studies” is an excellent publication that gives us a window into the ongoing research in CBTS and undoubtedly deserves the attention of translation scholars, among them those interested in literary translation, machine translation, legal translation, and corpus design. The book can also serve as supplementary reading for courses in translation studies to bring the students up to date on the state of translation research; however, they would need to refer to a simpler text to familiarize themselves with the basics.


Baker, Paul, Costas Gabrielatos & Tony McEnery. 2013. Discourse analysis and media attitudes. The representation of Islam in the British press. Cambridge: Cambridge UP.

Chesterman, Andrew. 2004. „Hypotheses about translation universals.” In Hansen, Gyde, Malmkjar, Kirsten & Daniel Gile (eds.), Claims, changes and challenges in translation studies. Selected contributions from the EST Congress, Copenhagen 2001, 1-13. Amsterdam: Benjamins.

D’haeyere, Laurence. 2010. Non-prototypical agents with proto-agent requiring predicates: A corpus study of their translation from English into Dutch. Gent: Hogeschool Gent.

Straniero Sergio, Francesco & Falbo, Caterina (eds.). 2012. Breaking ground in corpus-based interpreting studies. Bern: Peter Lang.


Daria Dayter is a postdoctoral researcher at the University of Basel, Switzerland. Her habilitation project is a corpus-based investigation of simultaneous interpreting in the Russian-English language pair. Daria Dayter's other research interests include pragmatics of CMC, youth language, and teaching applications of the new media.

