LINGUIST List 13.2840

Mon Nov 4 2002

Review: Software: BNCweb ver. 2.0

Editor for this issue: Terence Langendoen <terrylinguistlist.org>


What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Simin Karimi at siminlinguistlist.org.

Directory

  1. Joybrato Mukherjee, BNCweb Version 2.0

Message 1: BNCweb Version 2.0

Date: Mon, 4 Nov 2002 21:37:07 +0100
From: Joybrato Mukherjee <j.mukherjeeuni-bonn.de>
Subject: BNCweb Version 2.0

BNCweb Version 2.0 on CD-ROM (2002) by Hans Martin Lehmann,
Sebastian Hoffmann and Peter Schneider. Software available
at <http://homepage.mac.com/bncweb/home.html>;. 30.00 EUR.

Rolf Kreyer and Joybrato Mukherjee, University of Bonn

As described in the announcement on Linguist List (see
<http://www.linguistlist.org/issues/13/13-1709.html>;),
BNCweb is a new software tool that allows for access to the
100-million word British National Corpus (BNC) via the
internet or a local area network. With the release of
BNCweb, the BNCweb team has succeeded in providing
linguists with a web-based interface that is particularly
user-friendly and that includes a wide range of search and
query options. As will become clear from the synopsis and
the critical evaluation, we feel that both the user-
friendliness and the range of available options render
BNCweb an extremely helpful and powerful software tool for
linguists keen on working with the BNC.

It should be noted at the outset that BNCweb can only be
installed on the basis of a full installation of the BNC
text files, the software tool SARA (with its index files)
and some additional UNIX tools (e.g. the MySQL database
system). Also, BNCweb in its current version 2.0 is not
compatible with the first release of the BNC (cf. Aston and
Burnard 1998) but can only be added to the second version
of the BNC, i.e. the BNC World Edition (cf. Burnard 2000).
Although a detailed step-by-step description of the
installation process is available in the BNCweb manual, the
BNCweb team points out that installers should have some
basic skills in system administration. The BNCweb manual,
which is included in the CD-ROM and is a very useful help
for BNCweb beginners, has been written by Ylva Berglund
(Oxford Text Archive), Sebastian Hoffmann (University of
Zurich), David Y.W. Lee (University of Michigan) and Nicholas
Smith (University of Lancaster).


SYNOPSIS
On starting BNCweb the user will find several groups of
options. The two main search options are the 'standard
query' and the 'lemma query'. While the former query type
searches for words or phrases without being sensitive to
word-class distinctions, the latter query type allows for a
search for words by additionally specifying the lemma type
(such as adjective, verb, noun, conjunction, etc.). With
both options it is possible to restrict searches to a
subset of either written or spoken texts. These subsets can
be defined by selecting relevant metatextual categories
(into which all texts have been grouped in the BNC). It is
thus possible, for example, to search for a specific word
or phrase in all 'written' texts that deal with the subject
matter of 'social science' and have been written by 'male'
authors aged '25-34'. The search results can be viewed in
sentence format or in keyword-in-context (KWIC) format. One
mouse-click on one of the matches opens up the context of
the result, showing up to 25 preceding and 25 subsequent
so-called <s>-units. This is a considerable improvement on
SARA which has an extremely restricted range of only 2 <s>-
units. A further mouse-click gives access to all the
information that is available for the text-file in which
the match occurs. Furthermore, the user may wish to view
all part-of-speech (POS) tags for the match or the entire
context. In addition, word-class colouring is available.

The post-query options offer the usual functions of 'thin',
'sort', 'collocations', 'delete' and 'save' - functions
that can also be found in SARA. What is unique to BNCweb,
however, is the detailed descriptive statistics for the
query result that it offers in the 'distribution' option,
including the distribution of the word or phrase at hand
across the various metatextual categories (e.g. spoken vs.
written component, age of author, sex of author, year of
publication, social class of respondent). Both absolute
numbers of hits and normalised frequency counts (measured
in instances per million words) are given. Thus, the user
may, for example, find out more about the extent to which
factors such as age and sex impinge on the frequency of a
particular word or phrase. With the help of
crosstabulation, it is also possible to detect the
quantitative effect of two categories in combination. Maybe
the most innovative feature of BNCweb is the post-query
option 'tag-sequence search'. In the 'simple mode', the
tag-sequence search makes it possible to specify syntactic
structures within the four words preceding and/or following
the match of a given query. Thus, for example, after having
searched for all occurrences of the word 'woman', the user
can easily retrieve from the entirety of matches those
instances in which 'woman' is premodified by a determiner
and an adjective or a determiner and two conjoined
adjectives. In the 'advanced mode', the user is allowed to
search for syntactic structures in a window of up to 10
preceding and following words, and the queries can be
specified even further. Instead of just looking for a
conjoined adjectival premodification the user may, for
example, demand that the first adjective be 'young',
yielding 'a young and lovely woman' as a result. It is, by
the way, worth-mentioning that in BNCweb there are no
restrictions on the number of hits, whereas SARA only
allows for maximally 2000 hits to be downloaded.

Apart from the two main options (i.e. standard and lemma
query), a wide range of other functions is available as
well. For example, the user may browse a complete BNC-
document and may look at an <s>-unit in a larger context,
if the 50-<s>-units window in the post-query option is not
sufficient. Again, the user can opt for a POS-tagged or an
untagged output, both of which can be displayed in plain or
coloured mode. 'Word-lookup' offers an alphabetically
ordered list of all forms of a given word and all
composites that start with this word. A search for 'hate'
based on word-lookup, for example, yields all word-forms of
the verb and the noun 'hate' and all composites beginning
with 'hate' such as 'hate-mail' and 'hate-inspired'.
Additionally, both the frequency of each form and its word
class are indicated; a lemmatised list is available on
demand. BNCweb makes it possible to easily retrieve
'frequency lists' of words and lemmata from the whole BNC
(i.e. 'lemma-frequency' and 'POS-frequency'). With the
'scan keywords/titles' option the user can search the
<title> or the <keyword> elements in the headers of all
BNC-documents and identify, for example, all BNC-documents
that have the word 'animal' in the title. In this context,
the user can choose not only from the list of descriptive
keywords that was available for the first version of the
BNC, but also from the standardised list of keywords
provided by the professional library cataloguing system
COPAC that have been assigned to the text documents
included in the BNC World Edition. It is obvious that
scanning keywords and titles is particularly useful for
designing subcorpora. Another way of designing a subcorpus
is through the option 'exploring genre lables'. Each BNC-
document has been assigned a genre-label in BNCweb on the
basis of David Lee's (2001) alluring genre classification
of the entire BNC. In this regard, BNCweb users have a
clear advantage over linguists that rely exclusively on the
BNC World Edition since the latter offers no immediate and
equally convenient access to such a genre classification of
the BNC texts. In a first step, the 'exploring genre
labels' option of the BNCweb can thus be used to retrieve a
list of, say, all parliamentary debates and courtroom
interactions. In a second step, the user can choose all or
only a selection of these texts in order to create a
subcorpus. Subcorpora can also be created - and edited -
through the 'user data' option. Here subcorpora are to be
defined by selecting specific metatextual categories (or,
alternatively, by specifying the IDs of individual BNC
documents). For example, the user may compile a corpus of
'end-samples' of 'periodicals' published 'from 1975 to
1984'.

In 'user settings', the preferred standard output format
(such as the number of matches shown per page), the range
of context (measured in <s>-units), the output of context
(i.e. with or without word-class tags) and other settings
can be predefined. Brief mention should also be made of the
'query history', which shows all queries performed by the
user so far and also gives information on the date of the
query and the number of hits. Each of the queries can be
re-performed by simply clicking on it. It is also possible
for users to look at 'saved queries'. Saved queries can be
re-activated and thus be used as a starting-point for
further investigation. This feature is particularly useful
for analyses which can be carried out by capitalising on
previously performed complex queries.

BNCweb includes a well-produced online manual which not
only gives general information on BNCweb ('what is BNCweb',
'system requirements', etc.) but also provides detailed
information on 'what users can do with BNCweb' (i.e. a
description of features, all illustrated with clear and
telling examples), 'how users can do it' (i.e. a detailed
list of all available functions, complemented with
illustrative examples, step-by-step instructions and
screenshots), and 'what users cannot do with BNCweb'. As to
this last point, the BNCweb team makes it clear that
because BNCweb operates on the basis of the BNC World
Edition and SARA, it cannot overcome the system-internal
limitations. Of particular importance is the fact that all
searches must start off from a lexical query. Thus, the
user cannot search for delexicalised syntactic structures:
for example, it is not possible to directly look for kinds
and frequencies of postmodifications in noun phrases. What
the program will yield, however, is information on the
kinds and frequencies of syntactic structures that a
particular lexical item occurs in; as soon as a lexical
item is specified search options are abundant and even
extremely complex searches are, in principle, possible. A
further shortcoming that is explicitly mentioned in the
BNCweb manual is the fact that searches cannot be
restricted to user-defined subcorpora from the outset. All
searches will have to be run over the whole BNC first; it
is only then that a second search can be confined to a
previously defined subcorpus.

CRITICAL EVALUATION
BNCweb is an extremely user-friendly, robust and rich
software tool for the exploration of the BNC. It also works
considerably faster than SARA because the first hits (by
default, the first 50 hits) are made available to the user
while the remaining hits are still being retrieved from the
corpus. In SARA, on the other hand, the entire search
has to be concluded before the first hits are displayed on
the screen. It is thus beyond reasonable doubt that BNCweb
facilitates in various regards the in-depth quantitative
and qualitative analysis of the BNC. Once installed, it
should certainly be able to involve many linguists -
hopefully, not only corpus enthusiasts - with research into
the BNC (which has turned out over the past decade to
become the most important reference corpus of the English
language). Also, the BNCweb manual provides a very good
introduction to all features and options and explains each
of them in an easily accessible way. It is very fortunate
that the manual is geared to the needs of the vast majority
of linguists who are not corpus experts and on the lookout
for a shortcut through 'the BNC jungle'. We do feel,
however, that a help index file may have been included.

Our only critical remark refers to the lack of an interface
that is equivalent to the very useful SARA query-builder.
Although it is possible in BNCweb to perform the same
complex queries as allowed for by the SARA query-builder
(i.e. complex queries based on logical AND and/or OR), the
latter is, in our view, more user-friendly: users can first
design and carefully check the complex query, displayed in
its entirety on the screen, before actually carrying it
out. Such complex queries can only be performed in BNCweb
if the user is familiar with the CQL-query syntax (although
it should not go unmentioned that the BNCweb team does
provide information on how such queries can be performed).
However, it is unfortunate that BNCweb does not include an
interface design that is as easily accessible, self-
explanatory and iconic as the SARA query-builder. Since, as
luck would have it, BNCweb requires a full installation of
the BNC World Edition (and SARA) anyway it is of course
possible to take advantage both of the numerous strengths
of BNCweb and of the particular user-friendliness of the
SARA query-builder at the same time.

In conclusion, BNCweb represents in almost all regards a
genuine alternative to the standard package of the BNC
World Edition. To recap, BNCweb is much more user-friendly,
includes numerous useful and innovative search and Query
options, carries out searches at a faster rate and has no
restrictions on the number of hits. We thus strongly
recommend BNCweb to everyone interested in research into
the BNC.

REFERENCES
Aston, Guy and Lou Burnard (1998). The BNC Handbook:
Exploring the British National Corpus with SARA. Edinburgh:
Edinburgh University Press.

Burnard, Lou (2000). Reference Guide for the British
National Corpus (World Edition). Available on-line at
<http://www.hcu.ox.ac.uk/BNC/World/html/urg.html>;.

Lee, David Y. W. (2001). "Genres, registers, text types,
domains, and styles: clarifying the concepts and navigating
a path through the BNC jungle", Language Learning and
Technology 5/3, 37-72.

ABOUT THE REVIEWERS
Rolf Kreyer is a Research Assistant in the English
Department of the University of Bonn/Germany. He holds a
degree in English and mathematics and is currently working
on his PhD thesis, a corpus-based analysis of fronted
constructions in English.

Joybrato Mukherjee is Assistant Professor in the same
department. His research interests include corpus linguistics,
EFL teaching, intonation, stylistics, syntax and text-
linguistics.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue