Review of  Corpus Linguistics with BNCweb - A Practical Guide

Reviewer: Elizabeth (Betsy) Craig
Book Title: Corpus Linguistics with BNCweb - A Practical Guide
Book Author: Sebastian Hoffmann Stefan Evert Ylva Berglund Prytz David Y.W. Lee Nicholas Smith
Publisher: Peter Lang AG
Linguistic Field(s): Text/Corpus Linguistics
Discipline of Linguistics
Subject Language(s): English
Issue Number: 21.663

AUTHORS: Hoffmann, Sebastian; Evert, Stefan; Smith, Nicholas; Lee, David; Prytz,
Ylva Berglund
TITLE: Corpus Linguistics with BNCweb
SUBTITLE: A practical guide
SERIES: English Corpus Linguistics, Volume 6
YEAR: 2008

Elizabeth Craig, Department of English, The University of Georgia


''Corpus Linguistics with BNCweb'' is the sixth in a series of titles from Peter
Lang devoted to English corpus linguistics. BNC is an acronym for the British
National Corpus (, which has been maintained at
Lancaster University since the early 1990's and consists of 100 million words of
both written (90%) and spoken (10%) British English in over 4000 texts, which
are categorized by genre. A large corpus such as the BNC can provide accurate
information on both a word's meaning and usage through the implementation of
various query tools as explicitly described in this detailed guide to the BNCweb.

Emphasized at the outset is the fact that working with a corpus solves two
problems for language researchers: how to base conclusions on actual usage
rather than on mere introspection and how to consider a large amount of data
without the time-consuming task of interviewing individual informants. A corpus
then is not about what a researcher believes, but about what many people do with
language. Lexical behavior is revealed in patterns that can be quickly and
conveniently displayed in concordance lines through the use of sophisticated
search tools such as the BNCweb, which is designed for working with words and
phrases and their co-occurrence frequencies.


Chapter 1 begins by describing the purpose of each of the subsequent chapters,
advising readers to utilize the manual while on the BNCweb
( although the exhaustive inclusion of screenshots for
every sample query discussed renders such full participation unnecessary. The
authors delineate both the advantages and limitations of working with a corpus
and what is essentially a search engine with various parameter settings, the
BNCweb query tool.

For example, in using the DISTRIBUTION feature to look at the behavior of
'shall' in the spoken portion of the corpus, the term is found to co-occur with
either 'I' or 'we' in 90% of cases. Because the BNCweb also allows for
separating data by such sociological features as age, gender, and class, it can
be further determined that the declarative forms of 'I/we shall' are more
commonly used by older speakers, whereas the interrogative forms of 'shall I/we'
are more commonly used by younger speakers, perhaps providing a 'snapshot' of
language change in progress and indicating that the declarative form may be on
its way out of the language or simply attesting to the fact that younger
speakers ask more questions. It would be interesting to compare this British
usage of 'shall' to North American usage. The corpus was not catalogued by race
of the speakers, an unfortunate oversight in the data collection phase.

Some basic principles of corpus linguistics research such as representativeness
and methodology are outlined in Chapter 2. A corpus is a principled collection
of text, and no corpus can be truly representative of a language as a whole.
The BNC, however, by incorporating a massive number of different text types from
an array of social strata strives to present a picture of late 20th century
British English usage. It is described by the authors as ''a synchronic and
static corpus which consists of a large number of text samples that are heavily
marked-up with information about the texts, speakers, and writers, and annotated
with linguistic information (e.g. parts of speech).''

Also, the authors make the important point here that corpus linguistics,
although concerned only with performance data, does offer a way to expose
linguistic competence. The example of complex (multi-word) prepositions such as
'in terms of' and 'in response to' is used to illustrate how frequent phrasal
patterns can be indicative of mental chunking of ''indivisible units.''
Constituent boundaries are evidenced by the non-random distribution of filled
pauses in the spoken portion of the corpus. The authors demonstrate that
''filled pauses occur very frequently both immediately before and after complex
prepositions,'' but rarely in internal positions surrounding the noun. I found
this particular example extremely relevant to my own corpus research on noun
plus preposition clusters in academic writing.

Chapter 3 is largely cautionary as to how generalizable findings from any corpus
and the BNC in particular can be. After describing the BNC in some detail as a
balanced reference corpus of 4000 files, the authors explain why they used both
the highly accurate (98-99%) CLAWS POS tagset and a smaller, simplified tagset
of only 11 tags to facilitate such searches as for ''any verb,'' for example. All
words in the corpus are annotated for HEADWORD and LEMMA as well using XML
format in the underlying source files. A discussion of the significance of
type/token ratios is also useful here.

Chapters 4 and 5 focus on methodology. Chapter 4 is where the reader may want
to begin actually sitting at a computer with access to the BNCweb, but
screenshots are provided. Several alternative ways to conduct basic searches
are covered along with some guidance in how to read and manipulate the display
of concordance lines. The default view is of complete sentences, but the user
can select the KWIC (''Key Word in Context'') view, which aligns the query item in
a fixed, central position to facilitate detection of recurrent language
patterns. Query results can also be displayed in random or corpus order and
saved in QUERY HISTORY. The inclusion of hands-on exercises at the end of this
and other chapters gives the reader a good idea of the kinds of specific
questions that can be answered through corpus inquiry and enhances the
suitability of this text for classroom instruction.

Chapter 5 on ''the comparability and reliability of findings'' emphasizes why
normalized frequencies as determined by statistical significance are
fundamentally important when comparing corpora or subsections of a corpus in
order to ensure that high frequencies are not simply due to chance alone. Raw
frequencies are meaningful only if you are dealing with corpora of the same
size. In comparing the normalized frequencies of the discourse marker 'in fact'
in the written and spoken subsections of the BNC, the authors demonstrate that
it is almost twice as frequent in the spoken data, which is relatively scant
compared to the written portion.
The calculation of normalized frequencies is discussed in some detail because
the authors contend that it is ''the number one source of error for novices in
corpus linguistics.'' In the interest of reliability, there is also a 'Corpus
Frequency Wizard' interface on-line for doing statistical calculations at

Chapter 6 outlines the use of ''Simple Query Syntax'' for more sophisticated
searches of particular affixes, parts-of-speech, wildcards, and
lexico-grammatical patterns using metacharacters.

Chapters 7 and 8 explain how search results can be further manipulated and
analyzed for specific purposes. Chapter 7 describes the automated features of
DISTRIBUTION and SORT. For example, 'because' is deemed to be ''overused'' in
school essays because this is the only written genre showing frequencies
comparable to those in the spoken genres of the corpus. Frequency breakdowns
further allow the sorting of co-occurrence patterns by type and token.

Chapter 8, in which COLLOCATIONS are discussed in great detail, covers the
automated analysis of concordance lines. A collocation is ''the habitual
co-occurrence of two (or more) words,'' and ''collocational tendencies can
arguably be seen as part of the meaning of a word.'' The concept of semantic
prosody is discussed here using the example of the word 'cause', which is shown
to have ''an overwhelming tendency to co-occur with events of a negative or
unfortunate nature.'' The value of such idiomatic information to non-native
speakers is appropriately mentioned here.

Chapter 9 explains how concordance lines may be manually annotated (tagged or
classified) depending on the user's query results. Both advantages and
disadvantages of categorizing queries are discussed. Users more familiar with
Microsoft Excel will appreciate the inclusion of instructions on how to export
and re-import query results to and from the spreadsheet database.

Chapter 10 provides a detailed guide in ways to create subcorpora in order to
restrict searches to particular text types. All texts are classified according
to domain, genre, time period, medium, and the sociological factors mentioned above.

Chapter 11 covers KEYWORD and FREQUENCY LIST features. A keyword is defined as
one that occurs ''with significantly greater frequency in one part of the corpus
than [in] another.'' A comparison between academic lectures and academic
writing confirms the relatively high concentration of verbs in the former and
nouns in the latter. Frequency lists are considered ''useful for detecting
potentially salient linguistic items within the corpus.'' In written genres,
'the' is found to be the most frequent word (again attesting to the 'nouniness'
of more formal registers), and pronouns such as 'I', 'you', and 'it' are the
most frequent words in spoken genres. The more nominalized style of academic
texts is also indicated by the relatively higher frequencies of prepositions
such as 'of', 'in', 'by', and 'with' in this genre, another fact I found
particularly supportive of my own corpus research.

Chapter 12 discusses the Corpus Query Processor (CQP) for more advanced searches
and experienced users. Also mentioned is the IMS Open Corpus Workbench
(, which allows for searching any annotated corpus
in the proper format.

Chapter 13 concerns the more practical aspects of running BNCweb for network
administrators. Topics include administrative access, customizable
configuration settings, the cache system of previous searches, and disk-space

Finally, a brief list of references is provided, noting seminal works in English
corpus linguistics by Douglas Biber, Graeme Kennedy, Geoffrey Leech, Charles
Myer, Michael Scott, John Sinclair, and Michael Stubbs. There is also an
11-page glossary of computerese terms relevant to corpus inquiry. Four
appendices provide all genre classifications for the texts in the corpus,
part-of-speech tags (CLAWS), explication of the Simple Query Syntax, and
HTML-entities for less common characters. A brief index is included as well.


This is a general, introductory text suitable for an undergraduate and/or
graduate class in corpus linguistics. It demonstrates how corpus work is very
much a balance between what the tools can deliver and how the human user can
manipulate those tools to answer very elaborate types of questions about
lexico-syntactic patterns.

The greatest attribute of this text is that it is not just a corpus usage
manual, but an explication of corpus linguistics theory and methodology. In
clear prose and using many illustrative examples, the authors go into great
detail in their discussions about conducting various search queries, customizing
annotations, contrasting raw and normalized frequencies, and enhancing validity
and reliability. Throughout the text, the authors point out that the
reader/user should consider intuitively what they may expect to find with
particular queries before doing the actual searches. This practice reinforces
the value of corpus work in that our assumptions about language usage are
frequently found to be in error or in need of some finer revision in light of
the search results.
Even though the BNCweb provides a wide range of search options, the web-based
interface is attractive and quite easy to use.

Some may find it a tedious read, especially the latter chapters for advanced
users and network administrators, but such is the nature of the beast. This
volume keeps it interesting with numerous suggestions about the types of
questions that can and cannot be answered through both simple and more complex
queries, and the chapter-final exercises are inspiring of innovative approaches
to corpus linguistics. The potential for corpus linguistics discoveries about
word/phrase frequencies has yet to be fully exploited, especially in the areas
of lexicography, sociolinguistics, and second/foreign language teaching. A
comparable, user-friendly mechanism for discovering and comparing the patterns
of North American English usage would certainly be welcome on this side of the pond.

Elizabeth Craig is an experienced ESL/EFL teacher and teacher-trainer with a master's degree in applied linguistics (TESOL) and a doctorate in second language acquisition. She was the English Language Fellow to Paraguay in 2006-2007 for the U.S. State Department and is currently teaching English and linguistics courses at The University of Georgia. Her dissertation ( consists of an examination of N+P clusters in a corpus of native-speaker freshman compositions in an effort to address preposition errors in second language writing. Dr. Craig is also Supervising On-line Editor of 'English around the World', a free, weekly newspaper insert for English language educators in and around Asunción, Paraguay.

