From: Kieran O'Halloran <k.a.ohalloranopen.ac.uk>
Subject: Corpus Linguistics Seminar (British Association of Applied Linguistics)
Corpus Linguistics Seminar (British Association of Applied Linguistics)
Short Title: BAAL
Date: 28-Apr-2006 - 28-Apr-2006
Location: The Open University, Milton Keynes, United Kingdom
Contact: Lia Blaj
Contact Email: L.L.Blajopen.ac.uk
Meeting URL: http://www.baal.org.uk/sigs_corpus.htm
Linguistic Field(s): Applied Linguistics; Text/Corpus Linguistics
British Association for Applied Linguistics (BAAL) Corpus Linguistics Seminar
Theme: Text analysis using corpora - methodological issues.
Date: Friday April 28th 2006
Place: The Open University, Milton Keynes, UK. (for travel directions see
Time and Room: 10am - 5pm, Central Meeting Room 15, Christodolou Building.
Professor Guy Cook, The Open University.
Professor Susan Hunston, University of Birmingham.
Here are the programme and abstracts for the annual meeting of the BAAL Corpus
Linguistics Special Interest Group, held at the Open University, Milton Keynes,
UK on April 28th 06. (BAAL = British Association for Applied Linguistics).
Topic: 'Text analysis using corpora - methodological issues'
10.00 Introduction (Kieran O'Halloran)
10.15 - 11.15 Susan Hunston (Guest Speaker) Text and Intertextuality: Debating
11.15 - 12.15 Guy Cook (Guest Speaker) ''It just says 'could' . Yes I just
spotted that.'' Corpus facts in discourse analysis.
12.15 - 12.45 Bettina Starcke, University of Trier Corpus Linguistic Evidence
and Criteria for its Evaluation
Problems with Data Identification
1.45 - 2.15 Lynne Cameron and Alice Deignan, University of Leeds Emergentism,
metaphor and text analysis
2.15 - 2.45 David Oakey, University of Birmingham Phraseology beyond the bundle:
finding a way in
2.45 - 3.15 Duncan Hunter and Richard Smith, University of Warwick Identifying
keywords and charting their development: Methodological issues in corpus-based
Coffee / Tea
Issues around Spoken Data
3.45 - 4.15 Svenja Adolphs and Dawn Knight, Nottingham University Analysing
spoken corpora: methodological issues and technological challenges
4.15 - 4.45 Nuria Hernandez, Freiburg University Dialect corpora and
orthographic dialect transcriptions - some methodological considerations
4.45 - 5.00 Summary and final discussion
Susan Hunston, University of Birmingham
This paper will tackle three main issues. First, when corpus investigation
techniques are used in the service of other disciplines, whose research
questions take precedence? This issue will be discussed largely in the context
of corpora used in literary studies, where two different methodological
approaches will be compared. Secondly, possible objections to the assumptions
underlying corpus investigations (of the type often described as corpus-driven)
are explored. The question asked is: to what extent does a focus on intertext
make such methodologies inevitable? The final and most extensive part of the
paper will report my own efforts to integrate text and corpus approaches to the
study of evaluation. This will comment on quantitative and qualitative
techniques and consider their application to a comparison between three texts on
the topic of avian flu.
''It just says 'could' . Yes I just spotted that.'' Corpus facts in discourse
Guy Cook, The Open University
Corpus linguistics often presents itself as a replacement for other kinds of
analysis. It is claimed in particular that the ''facts'' of a corpus analysis
are superior to those produced by intuition. This talk seeks to find a middle
way between the two poles inherent in this opposition. It advances a
constructive critique of some accepted corpus wisdom. It argues that
* the objectification of ''actual'' rather than idealised language
should not be bought at the cost of idealising language users
* for use in discourse analysis, corpus analysis needs a theory
and evidence of whether its facts are consciously or subconsciously noticed by
actual language users
* certain discourse qualities such as eloquence and salience are
beyond the reach of corpus analysis.
* intuition remains the basis of key operations in corpus
analysis, including corpus construction, and the interpretation of semantic
prosody and key-word lists.
The argument does not dismiss corpus analysis however. On the contrary, it
acknowledges the invaluable insights it allows. But it is suggested that corpus
analysis is strongest when it presents itself as a component rather than the
totality of discourse analysis, and works in conjunction with investigation of
what is salient and valuable to actual users.
The argument is illustrated with reference to the speaker's use of corpus
analysis in four research projects on the language of controversies over food
politics: one on food labels, two on the GM food debate, and one on organic food
marketing. In each of these projects, analysis of a corpus of language use was
combined with intensive analysis of short texts, and with interview and
focus-group data, to produce a rounded view of how specific linguistic choices
reflect the values and strategies of real writers, and affect the views and
behaviour of real readers.
Corpus Linguistic Evidence and Criteria for its Evaluation
Bettina Starcke, University of Trier
The question of whether corpus linguistics generates objective linguistic
evidence is a central question in the evaluation of corpus linguistic analyses.
Arguments are that corpus linguists use corpora or texts for analyses that are
not subject to change once the analyses have started, and that the use of
software contributes to the objectiveness of the analyses. On the other hand,
the choice of corpus is subject to the analyst's personal or professional
interests, and the software and its settings are selected by the researcher.
Finally, the interpretation of the data generated by the software is a
subjective process. Objective and subjective features in a corpus linguistic
analysis are therefore interdependent.
This means that in order to evaluate the reliability of an analysis, we need
fixed criteria to test their scientific rigour and the validity of conclusions
drawn from the data. The four criteria I suggest for this process are growth of
knowledge, replicability, checkability and innovation.
The question of whether an analysis enhances our knowledge with regard to the
original research purpose is essential. Its answer and evaluation should take
into account the probabilistic and comparative nature of corpus linguistic evidence.
To allow for the replication of an analysis, documentation of the research
process is required. This includes a description (or the inclusion) of the data
and the software with its settings used for the analysis in the report on the
research. Documenting these parameters allows other researchers to identify
decisions taken in the original research and to question them. And, more
importantly, the resulting transparency facilitates a better understanding of
the purpose and reasoning of the research.
Checkability expands replicability as described above. In addition to
facilitating a better understanding of the original research, it also requires
researchers to make their analysis transparent to an extent that enables others
to test the techniques and hypotheses on different data, software, theoretical
premises etc. This allows for an evaluation as to whether the results from the
source study can be generalized and hold if checked with different data. Again,
the probabilistic nature of corpus analyses have to be considered.
Asking whether a linguistic study is innovative and brings new insights into the
field of study is the fourth criterion. This entails the question whether the
choice of method was appropriate and whether the evidence generated is the best
Emergentism, metaphor and text analysis
Lynne Cameron and Alice Deignan, University of Leeds
Patterns of metaphor use found in a small, hand-searchable corpus of transcribed
talk were subjected to further investigation in a large computerized corpus,
using the now well-established technique of combining small and large corpora
(for example, Cameron & Deignan 2003). In this study, the small corpus consists
of conversations between the daughter of a man killed by a bomb planted by the
Irish Republican Army, and the perpetrator of the bombing. The large corpus was
the spoken section of the Bank of English. Close analysis of metaphor in the
small corpus reveals a number of semi-fixed multi-word expressions with
non-literal meaning. These expressions - for example, walk away from - are not
'idioms' and yet are idiomatic in some sense; they are not fixed and yet show
some levels of fixedness; they are not completely predictable in use and yet are
far from random; they are not clearly metaphorical, often being metonymic or
ambiguous. As such, they present problems for analysis at both a formal and
semantic level. It appears that there is a bundle of linguistic, semantic,
pragmatic and affective patterns of use that constrain metaphorical production
and interpretation of an expression like walk away from . An analysis of the
concordance for walk* away from in the larger corpus confirmed this. It also
suggested a close link between the exact linguistic form of walk away from, and
its semantic and pragmatic features.
Corpus researchers have found that expressions like these form a sizeable
section of many concordances, and yet they are often left to one side because
there is no way to categorise or account for them. Our talk offers a first
attempt at such an account. We adopt an emergentist perspective (MacWhinney
1999; Larsen-Freeman & Cameron, in press). This sees human systems, including
language, as complex dynamic systems, and the language repertoires of
individuals and social groups as emergent phenomena, resulting from processes of
adaptation and change over time. We offer the term 'metaphoreme' as a descriptor
for the pattern bundles found in our corpora, suggesting how metaphoremes emerge
from, and contribute to, language use.
Phraseology beyond the bundle: finding a way in
David Oakey, University of Birmingham
This is a talk about my work in progress, looking at problems in identifying
discontinuous and semi-fixed phrases in corpora against the background of
previous work on collocation (c.f. Sinclair), fixed expressions and idioms (c.f.
Moon), (cf. Nattinger and DeCarrico), metadiscourse (c.f. Hyland) and lexical
bundles (c.f. Biber et al). I describe the methodology by which a commonly-used
phraseological item was identified from a 40 million-word comparative corpus of
research articles from eight academic disciplines. I then look in detail into
the textual environments of this particular phraseological item and present
examples of the variations in its use across the disciplines. This leads to some
remarks about how quantitative, lexical, semantic, syntactic, and pragmatic
features need to be taken into account when making statements on the nature of
phrase boundaries in academic discourse. Audience feedback would be much
Identifying keywords and charting their development: Methodological issues in
corpus-based historical research
Duncan Hunter and Richard Smith, University of Warwick
The focus of this paper will be a discussion of methodological issues
surrounding the investigation of keywords within corpora, in particular in the
context of historical research into the discursive construction of particular
academic or professional communities. We begin by describing two different
approaches to the investigation of keywords using corpus tools. The first,
developed by Stubbs (1996), initially identifies a set of keywords through a
process of intuition, and then applies concordancing techniques and statistical
procedures to describe features of their collocation. The second, referred to
notably by Fairclough in his (2000) study of the language of New Labour, deploys
corpus tools during the initial stage of keyword selection, based on their
frequencies relative to a larger corpus. Further analysis of the terms'
collocation and semantic prosody is then carried out using techniques similar to
those of Stubbs.
We will suggest that, methodologically, Fairclough's approach to the selection
of keywords has the advantage of being more empirically reliable, in that the
selection of keywords is related to evidence of frequency within the corpus. We
shall also demonstrate the results of some preliminary corpus investigation,
following the model of Fairclough's research. However, we shall also discuss the
drawbacks of a purely statistical approach to the selection of keywords, showing
that there may be advantages to combining statistical analysis with more
Although the main focus of our paper is on appropriate procedures for
corpus-based selection of keywords, we also wish to touch on some of the
specific requirements and benefits of historical corpus-based research,
considering, for example, methodological issues relating to the identification
of keywords at different points in time, and the potential advantages of
tracking keywords diachronically in enabling in-depth understanding of the
evolution of a particular discipline or profession.
Fairclough, Norman (2000). New Labour, New Language? London: Routledge.
Stubbs, Michael (1996). Text and Corpus Analysis: Computer-Assisted Studies of
Language and Culture . Oxford: Blackwell.
Analysing spoken corpora: methodological issues and technological challenges
Svenja Adolphs and Dawn Knight, Nottingham University
The difficulties associated with the development of spoken corpora large enough
to yield stable analytical results have meant that much of corpus linguistics
has focused on the analysis of written discourse. However, alongside the
large-scale studies of lexico-grammar on the basis of mainly written corpora,
there has been a consistent effort in the exploration of spoken discourse using
a corpus-based approach. Spoken corpora provide a particularly valuable resource
for both quantitative and qualitative types of analysis of specific pragmatic
functions. As such they can help in the re-evaluation of claims and concepts
that originate in more philosophical traditions where the conceptualisation of
pragmatic functions has arguably received most attention.
However, one of the key differences between written and spoken corpus analysis
is that current spoken corpora tend to be mediated records, textual renderings
of events which are multi-modal in nature, and thus capturing only a limited and
limiting aspect of the reality of that event. As a result, analyses of pragmatic
functions in spoken corpora tend to exclude the exploration of the interplay
between gesture and language and therefore neglect a core element in the
construction of meaning in interaction.
This presentation reports on the development of a multi-modal spoken corpus at
the University of Nottingham and explores the implications of a multi-modal
corpus analysis for our understanding of pragmatic categories. Using as an
example the category of active listenership in conversation, the presentation
focuses on the way in which corpus-based descriptions of functional categories
might be affected by the systematic exploration of a multi-modal resource.
Technological and methodological issues with regard to data capture and
representation will be discussed alongside possible areas of application within
the field of applied linguistics.
Dialect corpora and orthographic dialect transcriptions - some methodological
Nuria Hernandez, Freiburg University
This paper elaborates on some practical and theoretical issues that might be
encountered when working with dialect data. Based on experiences with FRED, a
corpus of English dialects recently compiled at Freiburg University, I will
consider some practical problems concerning the searchability of dialect
transcripts as well as general restrictions on linguistic claims based on
FRED is currently one of the largest databases for English dialects, with 300
hours of speech recordings and over 2.5 million words of corresponding
transcripts. It consists of casual oral history interviews with native speakers
from all over the British Isles. According to age and other social factors,
these informants qualify as traditional dialect speakers. With its 370
interviews from 9 major dialect areas (including Wales, Scotland and the
Hebrides) FRED is a valuable database for diatopic variation. Nevertheless, it
represents but a section of possible varieties of English and its significance
for linguistic generalisations is therefore restricted. As is the case with
other dialect corpora, researchers might have to reconsider the data at hand and
decide where to place them on a standard - non-standard continuum. Different
accounts of the same data may vary, depending on the definition of terms like
'dialect', 'spoken standard', etc. and the linguist's estimation of potentially
influencing factors such as the interview situation. Depending on the phenomenon
under investigation, transcription guidelines may complicate a classification.
FRED, which was collected for morpho-syntactic research purposes, consists of
easy-to-read orthographic transcripts that were partly standardized, and
nonstandard pronunciations such as h-dropping are not always reproduced.
However, pronunciation variants do occur, and we need to know them before being
able to search and analyse them.
My aim is to draw attention to the necessity (i) of clearly establishing the
type of variety on which any linguistic study is based as well as its
distance/proximity to a previously defined standard and (ii) of paying special
attention to the degree of standardization that the data might have undergone
from speech recording to transcript. I will propose that, for orthographically
transcribed dialect corpora like FRED, a consistent inlined annotation scheme
(myself [miself]) comprising both the dialect variant and the standard form
presents an optimal solution, preserving the readability and searchability as
well as rendering a more adequate picture of the amount of variation found in
More information on the day can be found at:
or through contacting:
Lia Blaj ( L.L.Blajopen.ac.uk )
Institute of Educational Technology,
Room 199, Geoffrey Crowther Building,
The Open University,
Milton Keynes MK7 6AA
The local organisers are: Dr Kieran O'Halloran, Dr Caroline Coffin (Centre for
Language and Communications, The Open University), Lia Blaj (Institute of
Educational Technology, The Open University) in concert with Dr Paul Thompson
(University of Reading), the Corpus Linguistics SIG convenor.
Dr Kieran O'Halloran
Centre for Language and Communications
Faculty of Education and Language Studies
This Year the LINGUIST List hopes to raise $52,932. This money will go to help keep the
List running by supporting all of our Student Editors for the coming year.
See below for donation instructions, and don't forget to check out our Fund Drive 2006
LINGUIST List Cruise for some Fund Drive fun!
There are many ways to donate to LINGUIST!
You can donate right now using our secure credit card form.
Alternatively you can also pledge right now and pay later.
For all information on donating and pledging, including information on how to donate by
check, money order, or wire transfer, please visit:
The LINGUIST List is under the umbrella of Eastern Michigan University and as such can
receive donations through the EMU Foundation, which is a registered 501(c) Non Profit
organization. Our Federal Tax number is 38-6005986. These donations can be offset against
your federal and sometimes your state tax return (U.S. tax payers only). For more
information visit the IRS Web-Site, or contact your financial advisor.
Many companies also offer a gift matching program, such that they will match any gift
you make to a non-profit organization. Normally this entails your contacting your human
resources department and sending us a form that the EMU Foundation fills in and returns
to your employer. This is generally a simple administrative procedure that doubles the
value of your gift to LINGUIST, without costing you an extra penny. Please take a moment
to check if your company operates such a program.
Thank you very much for your support of LINGUIST!
Respond to list|Read more issues|LINGUIST home page|Top of issue