"Buenos dias", "buenas noches" -- this was the first words in a foreign language I heard in my life, as a three-year old boy growing up in developing post-war Western Germany, where the first gastarbeiters had arrived from Spain. Fascinated by the strange sounds, I tried to get to know some more languages, the only opportunity being TV courses of English and French -- there was no foreign language education for pre-teen school children in Germany yet in those days.

Review of  Approaching Language Variation through Corpora

Reviewer: Annis Shepherd
Book Title: Approaching Language Variation through Corpora
Book Author: Shunji Yamazaki Robert Sigley
Publisher: Peter Lang AG
Linguistic Field(s): Sociolinguistics
Text/Corpus Linguistics
Subject Language(s): English
This collection of works (produced in honour of Toshio Saito) is aimed at providing practical solutions for the interpretational and methodological problems that are inherently part of corpus-based research, whilst continuing Saito’s mission of integrating Japanese work on corpus research into that of the global linguistic community.

The preface contains an overview of the advantages of corpus studies with regard to studying variation, describes the overall aim of the collection and gives basic details of each paper. The volume is divided into four sections: the first discusses and proposes solutions to methodological issues when using real language samples; section two consists of case studies describing language use in different linguistic environments; the third section considers the investigation of diachronic language change through corpora; and the final section looks at variation in language usage for different purposes or by different groups/individuals.

“Interpreting Textual Distribution: Social and Situational Factors” (Stig Johansson) discusses the use of computerised corpora to study sociolinguistic variation. Johansson begins by providing an overview of two of the “earliest and most influential” corpora: Brown and the Lancaster-Oslo/Bergen Corpus (LOB). He then moves on to discuss other corpora that have been modelled on these two, giving details of how they differ from each other. All the examples given from these corpora involve cross-dialectal comparisons (e.g. American English with British English, etc.). Johansson then moves on to discuss how corpora can also be used to investigate variation between spoken and written language, variation involving more than one linguistic feature, and the possibility of developing a corpus-based grammar. He concludes that computerised corpora are a valuable tool for studying sociolinguistic variation, but that caution is required when choosing which is to be used, as not all are suitable for studying some types of variation.

In “Assessing Corpus Comparability Using a Formality Index: The Case of the Brown/LOB Clones”, Robert Sigley considers the comparison of data from different corpora and how to determine if any apparent variation is genuine and not the result of differences in the way in which the corpora are compiled. The author uses a formality index to investigate variation between five corpora (i.e. Brown, LOB, WWC, Frown and FLOB), giving details of all five. He provides an in-depth discussion of how this formality index was created, before considering how the different corpora compare in levels of formality. He also draws comparisons between the corpora on a finer level, considering the differences between text categories (e.g. academic, fictitious and religious writing) in the different corpora. He concludes that factor analysis is of use in sociolinguistic studies of corpora, as it facilitates the study of both general stylistic comparisons and of individual linguistic variables.

“Approaching a Linguistic Variable: ‘That’-Omission in Mandative Sentences” (Sebastian Hoffmann and Robert Sigley) discusses how best to describe and explain linguistic variables through a case-study of ‘that’-omission in mandative subjunctive sentences. Hoffmann and Sigley begin by considering the different types of research objectives that are common to such an investigation (i.e. comparative description, confirmatory analysis, and exploratory analysis) and the steps involved in undertaking such an analysis. They then provide a detailed case-study, including details of the methodology used, an overview of their results and how they were interpreted, and a discussion of the application of a variable rules model to subsets of their data. The paper concludes with some ideas for future research on mandative sentences.

“Semantic Preference of High-Frequency Mental Verbs in the British National Corpus” (Graeme Kennedy) explores the idea that words that appear to be semantically related are nonetheless distinguished by the collocates that they adopt; for example, ‘stop’ is frequently used with verbs referring to irritating or unpleasant behaviours (e.g. complaining, moaning etc.), whereas ‘finish’ is often associated with small-scale activities or processes (e.g. washing, unpacking, etc.). Kennedy’s paper is a discussion of variation in the semantic preferences linked with ten frequently used lexical verbs in the British National Corpus (BNC), with the aim of identifying whether the collocates of these words have any underlying semantic characteristics. The paper includes a description of the rationale behind the selection of the verbs used in the study and the methodology adopted in capturing the data. Kennedy proceeds by discussing each of the verbs and their most frequent collocates individually, highlighting the semantic patterns that can be seen within the collocates for each verb, and concludes that the use of a large-scale corpus such as the BNC in studies of variation facilitates such analyses. He ends his paper with some further suggestions on how this type of research could be used in other areas of linguistics, such as second language acquisition theory.

In “Functional Variation in Use of ‘Though’ and ‘When’ Clauses”, Teruhiko Fukaya considers the divide between coordinating and subordinating conjunctions, with the aim of showing that it is gradient rather than categorical. Fukaya considers two conjunctions (typically considered to be subordinating), ‘when’ and ‘though’, and examines the extent to which they act as coordinators in the International Corpus of English (ICE-GB) and the BNC. After providing an overview of existing descriptions of the use of these two words, he highlights weak points of these descriptions, and then gives details of the results of the study, where he considers the syntactic position of ‘when’/‘though’ clauses, the existence of non-finite ‘when’/‘though’ clauses and the lexical patterns seen in the collocates of such clauses. The paper concludes by providing a functional explanation of ‘when’ and ‘though’ in terms of the semantic relationship they create between the two clauses and how they can create both paratactic and hypotactic enhancement effects.

“Comparing Adjective Comparison across Genre and Time in Standard Varieties of Modern English” (Shunki Yamazaki) discusses how corpus studies can add to the debate about adjective use in English by reviewing some of the conclusions made about the use of inflectional and periphrastic comparative and superlative adjectives through a new study of four corpora (i.e. Brown, LOB, Frown and FLOB). Existing studies have identified a number of factors that appear to influence whether an inflectional or a periphrastic adjective is used; each factor is tested in the four corpora used in the study, and the data are discussed with the aim of determining whether every factor still appears to be valid.

In his paper, “On the Occurrence and Variation of the Adverbial Subordination Markers ‘Þe’ and ‘Þœt’ in Old English texts”, Matti Rissanen explores the changes in use of adverbial subordinators in Old English through the Helsinki Corpus and the Dictionary of Old English Corpus in Electronic Form. Rissanen discusses the use of fifteen adverbial subordinators and whether they appear with a null marker, with ‘Þe’, or with ‘Þœt’, and concludes that the diachronic development of overt subordination markers in Old English is a complicated matter which requires attention to the semantic relationships between the main and subordinate clauses and the form of the adverbial subordinator, among other factors.

“The Syntactic Development of the Gerund in Early Modern English: A Survey Based on the Penn-Helsinki Parsed Corpus of Early Modern English” (Toshio Saito) makes use of the Penn-Helsinki Corpus to re-examine the conclusions reached in Saito (1993), which was a study into variant constructions of the gerund in Early Modern English using the smaller Helsinki Corpus of English Texts. The aim of the new study is to determine whether a larger corpus produces more reliable and useful results than a smaller one. After giving an overview of the structure of the corpus and background details about the development of the gerund, Saito discusses his results, and concludes that, whilst the Helsinki Corpus produced results that can be verified through the use of the Penn-Helsinki Corpus, the latter produced results of greater statistical significance.

In “The Verb ‘Pray’ in Chaucer and Caxton”, Yoko Iyeiri investigates how the use of the verb ‘pray’ changed from a marker introducing imperatives to a discourse marker meaning ‘please’ in the Middle English Period using selected works of Chaucer and Caxton as a corpus. Iyeiri focuses on ‘that’-clauses dominated by ‘pray’, showing that there are differences in the use of ‘that’ depending on the subject of ‘pray’. The paper concludes with a detailed discussion of ‘The Canterbury Tales’, showing that this work displays some interesting developments with regard to the subject matter of the paper that appear to be unique.

“Defining Periods of Middle English by Measuring Rates of Language Change” (Satoru Tsukamoto) discusses the lack of clarity in what should be classified as ‘Middle English’, and aims to provide boundaries based on computable morphological and syntactic data. Tsukamoto’s data is drawn from the Penn-Helsinki Corpus of Middle English, Second Edition. He gives details of the methodology adopted for analysing the data and discusses the rates of change that can be observed in the 14 variables under investigation. The author concludes that an examination of syntactic change (in comparison to existing examinations of phonological change) enables a division to be made between early and late Middle English around the date 1300.

Pam Peters, in “Style and Politeness: The Case of the Personal Pronoun”, considers a number of questions surrounding variation in the case forms of personal pronouns, such as whether different media (e.g. speech vs. writing) show different levels of variation and whether there are regional differences. Peters uses the International Corpus of English to investigate pronominal variability in Australian, New Zealand and British English when following ‘than’ (e.g. “he is stronger than I/me”), in coordinated phrases (e.g. “him and me/he and I are best friends”) and when preceding a gerund-participle (e.g. “I wasn’t thinking about him/his not being there”). She concludes that, whilst there are some indications that case distinctions are being eroded in English, there are also others signs that dispute this conclusion. Additionally, there appear to be some stylistic differences between spoken and written language, but all three regional varieties show similar overall trends.

“Approaching Literature as a Corpus: Gender-Based Conversational Styles in Hemingway’s ‘Hills Like White Elephants’” (Masahiro Hori) covers some of the advantages and limitations of using a corpus to study literature through an examination of the stylistic differences between a male and female protagonist in Hemingway’s ‘Hills Like White Elephants’. Hori suggests that, whilst a corpus study may not initially seem like a particularly effective way of studying style in literature, it can (when used appropriately) make a valuable contribution. After giving details about the methodology adopted, the author provides both a quantitative and a qualitative analysis of the stylistic differences of the protagonists. The paper concludes that the language of the two characters are different even in the use of function words, and that a corpus analysis of this type is only of use when gathering both quantitative and qualitative data.

“Active Listening in Conversation: Gender and the Use of Verbal Feedback” (Maria Stubbe) again focuses on how corpora can assist in studies of gender differences, concentrating on the use of supportive verbal feedback in conversations. Giving details of both how previous studies influenced the current one, and the methodology adopted, Stubbe develops an analytical framework to enable analysing gender-based variation. She concludes with a discussion of some methodological issues involved in using corpora to investigate complex discourse phenomena and outlines how she approached them.


Put together with the main aim of making the use of corpora accessible to all, especially students undertaking their first independent research projects, this collection of papers covers a range of issues (methodological and interpretational) inherent in using corpora to study variation. Some of the papers focus on an explicit discussion of the merits of corpus-based studies of variation, whilst others use case studies to exemplify how corpora can be used.

The majority of papers should be accessible for all postgraduate students. They cover a wide range of linguistic disciplines (e.g. sociolinguistics, syntax, semantics and discourse analysis, to name but a few), showing the potential use of corpora for those studying variation in all of these disciplines. Whilst no explicit instructions are given on how to undertake a corpus study, there are many clear examples of how such a study could be used to enhance linguistic research. The papers by Sigley and Johansson, in particular, give an extremely clear overview of the merits of some of the major computerised corpora in existence, whilst others (e.g. Peters) give explanations of why they have chosen the corpus that they have.

Considering that this volume covers such diverse areas of linguistics as semantics and discourse analysis, the editors were perhaps wise not to order the papers based on this criterion, having instead one section on methodological issues and solutions, another on variation between linguistic environments, a third on language change, and a final one devoted to variation in usage. In many ways, this division works, as it highlights the range of research areas in which corpora can be utilised. However, the student wishing to determine, for example, how a corpus study could be used to study syntactic or semantic variation, would find this layout less useful, as there are papers related to these areas throughout the book. Nevertheless, the volume flows well, with the links between the papers in each section being coherent.

I feel that this volume is a valuable addition to existing literature on both linguistic methodology and variation. Given the number of directions from which variation can be studied, it is perhaps unsurprising that there is no universal consensus on the best data collection technique: the debate over methodology has been ongoing for many years (see, for example, Cornips and Corrigan 2005 and Maguire and McMahon 2011). It is, to my knowledge, relatively rare for pre-existing corpora to be used in studies of variation; researchers (e.g. Adger 2006, Quinn 2005) have a tendency to create their own, highly specific corpora. It is, therefore, interesting to see the results that can be achieved using existing corpora, especially considering that, for many students, the idea of creating their own corpus would be intimidating (if not impossible). This volume shows that, with care, it is possible to undertake a detailed analysis of a wide range of linguistic phenomena without needing to devote time to creating a corpus.


Annis Shepherd is a Ph.D. student at the University of Southampton. Her research interests include the division of labour between syntax and morphology, intra-speaker variation and non-standard varieties of English. Her thesis focuses on case variation in English conjoined phrases.

