LINGUIST List 28.2935

Wed Jul 05 2017

Review: Applied Linguistics; Discourse Analysis; Language Acquisition; Pragmatics; Text/Corpus Linguistics: Dobrić, Graf, Onysko (2016)

Editor for this issue: Clare Harshey <>

Date: 14-Jan-2017
From: Luciana Forti <>
Subject: Corpora in Applied Linguistics
E-mail this message to a friend

Discuss this message

Book announced at

EDITOR: Nikola Dobrić
EDITOR: Eva-Maria Graf
EDITOR: Alexander Onysko
TITLE: Corpora in Applied Linguistics
SUBTITLE: Current Approaches
PUBLISHER: Cambridge Scholars Publishing
YEAR: 2016

REVIEWER: Luciana Forti, Università per Stranieri di Perugia


This volume contains eight studies stemming from the Klagenfurt Conference of Corpus-Based Applied Linguistics (CALK14).

The first contribution by Marcus Callies, entitled “Research on L2 Pragmatics at a conceptual and methodological interface”(pp. 9-31), presents a case study situated at the intersection between the pragmatics, syntax and discourse interface, and the SLA and LCR interface. The study is based on an analysis of demonstrative clefts, which Callies defines as “a syntactic means of information highlighting located at the interface of syntax and discourse-pragmatics” (p. 15), and operationalises as “all instances of that and this followed by a form of be (‘s, is, was) and a wh-word (what, when, why, where, how)” (p. 17).

The aim of the study is to compare the speech of native and non-native speakers of English in terms of differences in frequency of use, range of discourse functions, and L1 effects. Callies uses three corpora: the French and German sections of LINDSEI (Louvain International Database of Spoken English Interlanguage), respectively named LINDSEI-F and LINDSEI-G, and the comparable LOCNEC (Louvain Corpus of Native English Conversation).

French learners are found to use fewer demonstrative clefts and with a more restricted functional spectrum compared not only to natives, but also to German learners, who display all discourse functions used by native speakers, though in lesser amounts. The fact that demonstrative cleft constructions are dispreferred both in French and German, L1 influence may be an explanatory factor for results based on the French subcorpus, but not on the German subcorpus. On the other hand, the time spent learning English is much higher in LINDSEI-G when compared to LINDSEI-F. As a result, Callies concludes that the overall explanatory factor for the differences observed in the two language groups may be the number of years spent studying English at school.

The second contribution, “A focus of pragmatic competence: the use of pragmatic markers in a corpus of Business English textbooks” (pp. 33-51), by Peter Furkó is a replication study, based on a previous publication (Furkó & Mónos, 2013). The main aim of the study is to evaluate the treatment, in terms of frequency of occurrence and quality of occurrence, that two pragmatic markers (PMs), well and of course, display in textbooks when compared to a reference corpus containing naturally-occurring discourse. Firstly, the author conducts a qualitative analysis of well and of course based on previous literature, in order to identify the functional range of their uses in naturally-occurring spoken discourse; five super categories are defined for both units of analysis. In accordance with these categories, quantitative data is presented and discussed.

The textbook corpus from the 2013 study comprised textbooks published between 1987 and 2006, while the corpus used in the present study contains textbooks published between 2000 and 2011.

The more recent corpus displays a slightly higher proportion of attention devoted to well and of course (47,9% compared to 44% from the 2013 study), but these percentages are still very much lower when compared to data based on naturally-occurring discourse. The functional spectrum of PMs observed in both textbook corpora has remained unchanged, however the author hypothesises that the higher the occurrence of PMs, the higher the likelihood of them being used with three or more super-functions. Finally, the PM of course is found to be characterised by an utterance-initial position in most of its occurrences in the reference corpus, while in both textbook corpora it is found to be described mostly in its medial-final utterance position.

“Written summarisation for academic writing skills development: a corpus-based contrastive investigation of EFL student writing” (pp. 53-77), by Gyula Tankó, is the third study presented in the volume. It aims to evaluate the role of task effects on the kind of language elicited, by comparing the effect of writing an academic essay versus a guided summary. First, the researcher identifies the lexical features of academic prose in terms of syntactic features (prevalence of nominalisation, coordination and use of the passive voice) and of lexical features (higher density of lexical words, adjectives, linking adverbs, etc.), through a literature review of previous studies. Then, 50 first year BA English major students, with varying proficiency levels, are asked to write a short independent argumentative essay as well as a guided summary task. The resulting corpus, 13.903 tokens, is analysed in relation to 23 syntactic complexity indices, and 25 lexical complexity indices. Statistical tests are employed in order to determine significant differences between the two subcorpora, the existence of correlations between the essay and summary syntactic and lexical complexity indices, and the use of academic texts in the two types of texts.

The results ultimately indicate that the typical features of academic prose are more prominently elicited via a guided summary task, rather than an essay. While warning against possible theme effects, which were not controlled for in this study and which may affect the obtained results, Tankó indicates the implications of the study for EAP pedagogy and testing, as well as highlighting the need for further research.

The fourth contribution is by Günther Sigott, Hermann Cesnik and Nikola Dobrić and it is entitled “Refining the scope-substance error taxonomy: a closer look at substance” (pp. 79-94). The aim of the study is to establish the effectiveness of an error coding taxonomy, by determining the extent of agreement amongst a group of error annotators. The taxonomy, originally formulated by Lennon (1991) as the authors discovered after developing their own in 2014 (Dobric & Sigott, 2014), relies on the notions of scope and substance. In the authors’ words, “Scope refers to the amount of context that is necessary in order for an error to become perceptible. Substance, by contrast, refers to the amount of text that needs to be changed so that the error will disappear” (p. 80). In order to create a coding system for the annotation of errors, with special reference to the substance dimension, the authors take into consideration four textual units beyond the word (i.e. phrase, clause, sentence, text), as well as punctuation, thus arriving at a taxonomy of 14 error types. These had been already identified in a previous study, though uncited in this volume (Dobric, 2015). A group of thirteen corpus linguistics students served as annotators of five texts produced by Austrian learners of English.

Overall, the study indicates a low rate of agreement amongst annotators, which the authors discuss in light of two kinds of possible factors: those related to annotators, who may have had an inadequate command of the language or may have had different perceptions as to what constitutes a norm, and those related to the taxonomy itself, which may be lacking in clarity for dealing with difficult phenomena that may arise in error analyses.

The role of spoken metadiscourse in intercultural context is at the centre of Hermine Penz’s contribution “The uses and functions of metadiscourse in intercultural project discussions on language education” (pp. 95-119). By defining metadiscourse as “discourse about the evolving discourse” (p. 98), the study aims at identifying the types and functions of metadiscoursive strategies, in terms of frequency and variation, and at establishing whether there is a connection between interactivity and the kind of metadiscourse employed. In order to do this, the researchers analyse the production of two discussion groups, the first consisting of 4 participants talking about the topic “Language at the work place” and the second consisting of 6 participants talking about the topic “Intercultural communication in teacher education”.

After qualitatively categorising the types of metadiscoursive units, a quantitative analysis is conducted on the corpus resulting from the data collection, containing about 14 thousand and 22 thousand tokens for each of the two groups. The quantitative analysis is based on raw frequencies and percentages. The results indicate the use of similar metadiscourse functions for both groups, while observing a certain degree of variation within the single activities, in terms of frequency and types of metadiscourse.

Olga Grebeshkova’s contribution, entitled “Does code-switching exist in personal writing,” constitutes the sixth study presented in this volume (pp. 121-144). It aims to describe code-switching in a specific type of text: personal writing, which may be defined as the act of producing texts where “the author and the reader are the same person” (p. 124). This appears to be an underexplored area in research: in tracing the lines of background literature, Grebeshkova cites examples based on intra-sentential code-switching from Tolstoy’s “War and Peace,” or studies based on conversational code-switching (p. 122). This study aims to evaluate the extent to which existing models of description developed for analysing code-switching in speech are applicable to the analysis of personal writings. The study is based on the collection of 83 examination notes from French students, and 83 examination notes from Russian students, all having a high proficiency level in English. In the former sample group, the cases of code-switching found were 18, while in the latter they were 25, for a total of 43 notes. The analysis of the texts is conducted according to two parameters: the first one relates to Sebba’s language content relationships of multilingual texts (p. 138); the second one, to the use of intra-sentential and inter-sentential code-switching. In relation to both parameters, the two groups display opposing trends, thus making it difficult to describe the phenomenon in terms of common features of development.

The seventh study is by Vesna Lazović and it is entitled “Frequency analysis of trigger words and money-based expressions in British and Serbian bank offers” (pp. 145-163). It applies a corpus-based methodology to a contrastive analysis between two native languages. The author creates a corpus based on texts found on the websites of 65 different banks, 33 Serbian and 32 British, for a total of about 43 thousand tokens (about 14 thousand Serbian, and about 30 thousand British). The study analyses and compares three aspects of lexical use in the two corpora.

First, the most frequent words. The two frequency lists reflect cross-cultural differences in terms of different products being offered: the British subcorpus shows an emphasis on mortgages and fixed rates, as opposed to the Serbian subcorpus showing an emphasis on loans or payments in instalments.

Second, the study analyses the quality and quantity of trigger words and money-saving
expressions, finding that they recur slightly more often in the Serbian subcorpus. Third, the lexical strategies used to express some form of restriction to the offer presented are analysed. In this case, the comparison seems to reveal a marked difference between the two subcorpora: in terms of normalised absolute frequencies, British banks use restrictions 12.30 times, while Serbian banks do so only 3.66 times.

Branka Drljača Margić and Irena Vodopija-Krstanović conclude the volume with their study entitled ‘“I use English, but if need be I’m fluent in German as well”: Croatian Business professionals’ use of English and other languages’ (pp. 165-186). The study aims at evaluating the use of English in the context of the Croation business environment, in terms of use and perceived status and importance. In order to do this, the researchers ask a sample of 117 business professionals to respond to an online questionnaire, made of five parts built to gain data about: field of work, mother tongue and English proficiency level; use and perceived status of English in their jobs; corporate languages used in respective companies; opinions about the ideal native speaker, and whether nativeness facilitates or hinders communication in business; the extent to which they agree to a series of statements. The findings indicate the primacy of English as a Lingua Franca in the business corporate sector, without disregarding the use of other languages if needed, a perceived need to further English language education in the corporate field, and that although it is not deemed indispensable to attain native-like English proficiency, close to native-like proficiency is seen as a factor that is able to positively influence the image of a business professional in the corporate field.


The first feature of the volume catching one’s attention is its title. “Corpora in Applied Linguistics” is, in fact, Susan Hunston’s classic volume published in 2002 and focused on the potential of corpus-based descriptions of language in contexts of second language teaching, with brief accounts related to other areas of Applied Linguistics (Hunston, 2002). In the volume under evaluation, however, we find an addition to the original title: current approaches.

The reader is thus inevitably led towards a few basic though specific expectations. Firstly, that all contributions deal with corpora, i.e. large collections of texts that are authentic, representative and in electronic format. Secondly, that all contributions deal with studies relating to second language acquisition, or other related areas.

The volume opens with a sound study by Marcus Callies on demonstrative cleft constructions, which applies the principles of CIA, Contrastive Interlanguage Analysis.
The aim of the author is not only to present the findings of the study, but also to use them to draw attention to the potential that learner corpus research has within the study of L2 pragmatics. The concluding remarks about the most likely explanation of the results obtained, relies in fact on the corpus metadata. Thanks to the way in which the corpus was designed, the descriptive analysis can be substantiated with the analysis of the variables displayed.

The only shortcoming of this first paper is that references to previous work by the author are not included in the bibliography, which makes it impossible for the reader to know that, in fact, this study is not new, but was originally published in the same form in 2013, alongside another case study regarding the use of emphatic do (Romero-Trillo, 2013, pp. 18–19; 25–35).

Furkó’s study continues in the path based on expanding the research agenda pertaining to pragmatics in the field of second language acquisition, thus going beyond the sole focus on speech acts. The literature review aimed at describing ‘well’ and ‘of course’ in terms of their pragmatic functions is sound and serves the purpose of identifying the qualitative categories that are necessary in order to conduct the quantitative analysis based on the textbook corpus. Moreover, it is informed by a corpus-based analysis of ‘well’ and ‘of course’ using the Larry King Corpus, a corpus made of transcriptions of a popular TV show that the author compiled and analysed for the same purpose at the time of his PhD research (Furkó, 2005).

However, on more than one occasion, a misleading assumption seems to be made: that all of the learners’ input derives from textbooks. One may argue that the unit of learning is the lesson, and not the textbook, and that teachers frequently plan a lesson by integrating textbook content with other activities that they may invent or take from resource books. Secondly, textbooks may come with audio components, in which case, the analysis would have to be extended to the audio transcriptions as well, which are either included in the student textbook or in the teacher’s book. This aspect does not seem to be specified in the paper.

Since looking at the corpus as a whole, the two tables reporting on the quantitative analysis conducted in the 2013 paper and in the present one (pp. 43, 45) would have perhaps benefited from aggregated quantitative measures, such as means and dispersion rates. The tables, instead, provide only absolute occurrence values, along with percentages (D-values).

Tankó’s study on the syntactic and lexical characteristics of learner prose elicited via two different tasks is one of the most interesting in the volume. It grounds the study in a solid theoretical framework, with a solid founding qualitative analysis, and provides a comprehensive quantitative analysis by taking into consideration a number of different measures, which are ultimately able to create an integrated picture in response to the proposed research question. The combination of descriptive as well as inferential statistics, along with the detailed description of the corpus used, makes the contribution stand out. The implications for pedagogy and language testing are made clear, as well as the shortcomings that may be addressed by future studies. The only shortcoming that does not seem to be mentioned is the need to create larger corpora to conduct similar studies, as almost 14 thousand tokens may be insufficient to make solid generalisations regarding the results.

The fourth study by Günther Sigott, Hermann Cesnik and Nikola Dobrić, based on analysing inter-annotator agreement in relation to the application of an error coding system, is particularly valuable in its methodology because it unveils the difficulties of error analysis and annotation and, as a result, of analysing interlanguage as a whole. The results that the study comes to, indicating a low degree of agreement amongst the annotators, may be due to many reasons which are partly discussed by the authors themselves. The issue of the norm is a central one in second language acquisition studies, and the continuum between correctness and incorrectness is very often made of a series of intermediate areas.

However valuable the study, it is not clear how this study fits with the title of the volume. There is no mention of the concept corpus in the study, unless one assumes that the corpus is represented by those five texts that the annotators are required to analyse. If we go back to the definition of corpus as a large collection of texts that are authentic, representative and in electronic format, we see that the implied notion of corpus emerging from the present study falls short.

Penz’s study of metadiscourse helps to shed light on the ways in which metadiscourse is used in intercultural communication contexts, in the field of language education studies. The study adds to the pragmatic interest significantly manifested in the volume so far and does so by providing a sound qualitative analysis upon which the quantitative analysis bases itself.

The only minor shortcoming concerns, perhaps, the fact that data regarding the two groups of participants, whose productions make up the two corpora being analysed, are not normalised, in the sense that percentages are given for each raw frequency value, but percentages seem to be of little help because they do not provide a common ground to compare two corpora that are in fact significantly different in terms of extension, one being around 14 thousand tokens, and the other around 22 thousand tokens.

Grebeshkova’s interest in personal writing interestingly stems from her own experience as a bilingual writer of personal notes (p. 124). As she points out at the beginning of the article, the study is a work-in-progress, but at the same time seems to derive from her uncited doctoral dissertation (Grebeshkova, 2016). ''Written code-switching in the note taking of second-language learners in bilingual classroom environments” (Grebeshkova 2016). The attention devoted to this particular kind of writing is certainly valuable for the field of studies related to code-switching, especially in regard to the possibilities of widening the scope of the empirical basis upon which such studies are based, in terms of text variety and text medium. The divergent results that the study ultimately attains are discussed in light of two possible causes: first, different exercises and different pedagogical traditions that the students are accustomed to may have affected the extent and nature of code-switching; second, the oralised structure of code-switching may play a role in how this takes place in personal writing. Unfortunately, it is not clear whether metadata regarding the students producing the text were collected, i.e. information about the sociolinguistic background characterising each student. This kind of information may help in further interpreting the results obtained so far, by continuing and deepening the quantitative analysis. Here, the notions of ‘corpora’ and ‘applied linguistics’ are implied in their broadest meaning.

The seventh study by Vesna Lasović unites corpus-based discourse analysis with contrastive analysis, thus contributing to widening the meaning of applied linguistics which is implied in the present book. It is not clear where the list of trigger words is taken from in order to perform the analysis that is reported in the paper. The research provides valuable data indicating the variable according to which the main cross-cultural differences between Sebian and British bank advertising are observable, namely the use of restrictions in offering certain products. In the concluding remarks to the study, Lasović usefully provides an overall picture of the study through a useful table that summarises the main findings. However, even though more sophisticated statistical analysis may be performed in order to establish the significance of the results found, along with the plan to build even larger corpora of this kind, the study is an example of the usefulness of such comparisons in cross-cultural studies.

The last study presented in the volume, conducted by Branka Drljača Margić and Irena Vodopija-Krstanović, confirms that English as Lingua Franca in the business field is the main language used and to be used. It is certainly useful in order to underline the importance of English language skills in the corporate sector, which implies the need to invest in specialised teaching courses and specialised training courses for English teachers. Interestingly, the study indicates that the status of ELF does not hinder the use of national languages or other languages, whenever the need arises as in contrast with the fears that we read about on newspapers or even in some research. However, it is not clear how the study fits in the volume. The data collection tool used in this study is a questionnaire, and the aim of the study is to investigate the perception and use of ELF amongst a sample of speakers in a specific working sector. There is no use of corpora and, again, the expression applied linguistics seems to be, again, considered in its broadest meaning.

Overall, the volume presents eight interesting studies, which reflect the topic indicated by the title with varying degrees of relevance. As we have seen, not all studies are corpus-based, and not all studies deal with second language acquisition.
More specifically, five make clear use of corpus linguistics methods, while three don’t; on the other hand, five studies pertain to the field of second language acquisition, one to bilingualism, one to contrastive linguistics, one to ELF. In regard to the studies that are explicitly corpus-based and focused on SLA, the volume is a testimony to one the characteristics of corpus linguistics so far, namely the fact that it is still mostly focused on English language learning, with little space devoted to studies dealing with the acquisition of other L2s.

Inspired by the principles stated in Hunston’s publication from 2002, this volume takes a number of different directions both methodologically and conceptually. It is, of course, a worthwhile read for specialists of the field, interested in widening the scope of corpus linguistics by reflecting on areas in which corpus linguistics methods may be employed.


Dobrić, N. (2015). Quality Measurements of Error Annotation-Ensuring Validity Through Reliability. The European English Messenger, 24, 36–42.

Dobrić, N., & Sigott, G. (2014). Towards an error taxonomy for student writing. Zeitschrift Für Interkulturellen Fremdsprachenunterricht, 19(2), 111–118.

Furkó, B. P. (2005). The pragmatic marker - discourse marker dichotomy reconsidered - the case of well and of course. Unpublished PhD thesis.

Furkó, B. P., & Mónos, K. (2013). The teachability of communicative competence and the acquisition of pragmatic markers–a case study of some widely-used Business English coursebooks. Argumentum, 9, 132–148.

Grebeshkova, O. (2016). Written code-switching in the note taking of second-language learners in bilingual classroom environments. Unpublished PhD thesis.

Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press.

Romero-Trillo, J. (Ed.). (2013). Yearbook of Corpus Linguistics and Pragmatics 2013 (Vol. 1). Dordrecht: Springer Netherlands.


I am a PhD candidate at the University for Foreigners of Perugia, Italy. My research project deals with the use of corpora in Italian as a second language learning and teaching, with a focus on the acquisition of collocations by Chinese native speakers. It involves the creation of a corpus informed syllabus, followed by an experimental evaluation of its effectiveness. I am interested in the corpus-based analysis of Italian and English learner language, and in the design of corpus-based pedagogical materials and activities. I am also a CELTA qualified EFL teacher.

Page Updated: 05-Jul-2017