Featured Linguist!

Jost Gippert: Our Featured Linguist!

"Buenos dias", "buenas noches" -- this was the first words in a foreign language I heard in my life, as a three-year old boy growing up in developing post-war Western Germany, where the first gastarbeiters had arrived from Spain. Fascinated by the strange sounds, I tried to get to know some more languages, the only opportunity being TV courses of English and French -- there was no foreign language education for pre-teen school children in Germany yet in those days. Read more



Donate Now | Visit the Fund Drive Homepage

Amount Raised:

$34674

Still Needed:

$40326

Can anyone overtake Syntax in the Subfield Challenge ?

Grad School Challenge Leader: University of Washington


Publishing Partner: Cambridge University Press CUP Extra Publisher Login
amazon logo
More Info


New from Oxford University Press!

ad

What is English? And Why Should We Care?

By: Tim William Machan

To find some answers Tim Machan explores the language's present and past, and looks ahead to its futures among the one and a half billion people who speak it. His search is fascinating and important, for definitions of English have influenced education and law in many countries and helped shape the identities of those who live in them.


New from Cambridge University Press!

ad

Medical Writing in Early Modern English

Edited by Irma Taavitsainen and Paivi Pahta

This volume provides a new perspective on the evolution of the special language of medicine, based on the electronic corpus of Early Modern English Medical Texts, containing over two million words of medical writing from 1500 to 1700.


Email this page
E-mail this page

Review of  Corpus Presenter: Software for language analysis


Reviewer: Stefan Th. Gries
Book Title: Corpus Presenter: Software for language analysis
Book Author: Raymond Hickey
Publisher: John Benjamins
Linguistic Field(s): Text/Corpus Linguistics
Book Announcement: 15.681

Discuss this Review
Help on Posting
Review:
Date: Sat, 31 Jan 2004 09:20:41 +0100
From: Stefan Th. Gries <STGries@sitkom.sdu.dk>
Subject: Corpus Presenter

AUTHOR: Hickey, Raymond
TITLE: Corpus Presenter
SUBTITLE: Software for language analysis [...]
PUBLISHER: John Benjamins
YEAR: 2003

Stefan Th. Gries, University of Southern Denmark

---Notes from Review Editors---
There is a reply from the author, see:
http://linguistlist.org/issues/15/15-981.html


NOTES:
An easier-to-read PDF file with this review which also offers
screenshots to exemplify some points can be found at:
<http://people.freenet.de/Stefan_Th_Gries/research/CP_review.pdf>.

Italics are indicated here by underscores before and after a word.

DESCRIPTION/SUMMARY OF THE BUNDLE
1) The book
Part I (p. 1-27) is a brief introduction into corpus
linguistics. This part provides an overview over some
corpus-linguistic terminology (types of corpora, tagging,
corpus headers, etc.) and a brief section entitled
'examining corpora', which introduces a few basic notions
such as _concordance_, _tagging_ and _lexical cluster
analysis_ (each with about two paragraphs). P. 14 to 27
discuss very briefly a few sample analyses of corpus data
(mainly Irish English data); these involve frequency data
for the presence or absence of particular linguistic forms,
relative frequencies of for collocations and relative
frequency data in an investigation of an author's style.
Part II of the book (p. 28-183 consists of descriptions
of the modules or programs of the Corpus Presenter suite
(henceforth CP). Much of this part corresponds to the help
files of the software bundle.
Part III of the book consists of several appendices.
Appendix 1 and 2 of the book (p. 184-9) provide information
about the installation; while Appendix 3 (p. 190-200) lists
a set of common commands, i.e. commands which are found in
several parts of CP. Appendix 4 (p. 201-4) describes the
file interface of CP, most of this will be familiar to most
users of Windows. Appendix 5 (p. 205-7) gives some
troubleshooting information and Appendix 6 (p. 208-9)
introduces three additional dataset files describing three
corpora from the ICAME CD-ROM. Finally, this part contains a
glossary of short definitions of corpus-linguistic and a few
statistical terms.
Part IV of the book (p. 237-76) is a description of A
Corpus of Irish English, which is followed by a general
bibliography, a glossary for this corpus and a combined
subject/name index.

2) The software
CP offers multiple functions for compiling, annotating and
processing corpora.
- CP can search for strings in texts in order to output them
as a concordance:
-- CP allows for detailed specification of strings and
corpus text settings including those for case-sensitive
searches, sentence and word delimiters, punctuation signs;
-- in addition, one can look for single expressions,
lists of expressions and larger expressions (by specifying
the left and the right part of a complex expression as well
as their maximal distance);
-- an especially useful option is the possibility to
include special characters and symbols from every installed
font into the search pattern;
-- one can specify stop words and output options such as
full sentence or x words to left/right of the search word;
-- one can specify Cocoa settings to include only files
with particular attributes into the search;
- CP can generate collocation frequencies for words in a
span of max. eight words around a search/node word;
- CP can generate some text statistics (word counts) as well
as (regular or reverse) word lists of individual words
and/or clusters of up to 8 words;
- CP can perform search and replace operations on files to
alter texts (e.g. tagging and normalizing files for
lemmatization);
- CP can collate files and compile corpora by collecting and
manipulating files of various sorts, organize the files and
their contents hierarchically and add header information
(e.g. Cocoa) for future searches;
- CP can convert files to different file types, manipulate
their attributes (e.g. date stamps, extensions) and do lots
of file handling operations (cut, copy, past, duplicate,
merge, etc.);
- CP offers a few useful functions that go beyond some
limitations of the Microsoft Windows OS such as a cumulative
clipboard or an undo buffer for deleted text.
The program comes with many different help files, FAQ files
and a brief tutorial with 44 slides and written/read-aloud
instructions (.WAV format).

CRITICAL EVALUATION
The CP suite is a set of programs that offers a vast range
of possibilities for working with corpus data. It was mainly
tested on a notebook computer (Pentium III 1000 with a 20GB
hard disk and 256MB RAM running an English Windows XP
Professional; some additional tests were performed on a
desktop computer (Athlon XP 1800+ with a 40GB hard disk and
640MB RAM) running a German Windows 2000 (both systems are
completely updated in terms of Microsoft Service Packs
etc.). The program was tested by myself alone, but in order
to make the evaluation slightly more objective, I also asked
a colleague for her opinion on some issues. In order to
discuss some of CP's properties, I will make reference to a
few concordancing programs, namely WordSmith Tools 3 (WST),
MonoConc Pro 2.2 (MCP) and WinConcord 2.0 (WC).

1) Speed and power of CP
The author (henceforth RH) stressed that "a special fast
retrieval mode has been incorporated into _Corpus Presenter_
to minimize the time one has to wait for returns to be made
during searches" (<CP_GUIDE.RTF>). However, at least my own
experiments do not support this assessment (with one
exception mentioned below all time-taking experiments were
performed after immediately after a system reboot).
- Concordancing example 1: On the above-mentioned notebook
computer, searching the Brown Corpus for the word form
_best_ using the most powerful 'text retrieval' level lasted
an astonishing 35.7 seconds (and some 35.94 seconds for
_after_; CP Main Programme's own output) even though all
settings concerning Unicode were optimized for the
processing of plain ASCII files. By contrast, MCP took about
2 seconds to load the file and 4 seconds to produce the
concordance.
- Concordancing example 2: On the above-mentioned notebook
computer, searching 674 files from the BNC part A (without
tags) for the word form _best_ once took 1030.48 seconds
(with several applications open but unused in the
background) and 334.21 with no other applications running
(CP Main Programme's own output) even though all settings
concerning Unicode were again optimized for the processing
of plain ASCII files. By contrast, WST took about 57 seconds
to produce the desired concordance ...
- Concordancing example 3: On the above-mentioned notebook
computer, simply finding out how that my British National
Corpus (BNC) directory contains 4,054 .TXT files required 47
seconds (CP Flash) - both MCP and WST need about a second or
less.
- Word list examples: Making a simple word list of the Brown
Corpus (without the reversed list) required 469.46 seconds
with CP Main Programme (and I canceled CP Flash after 30
minutes!) but only 11 seconds with MCP. In this connection,
it is worth pointing out that the program has an upper limit
of 32,000 words for word lists. RH does claim that this is
"virtually ample for every corpus" (p. 59), but the word
list for the one-million Brown Corpus already had about
10,000 entries so it is easy to find corpora whose word
lists will exceed this limit: the word list of the one-
million FROWN Corpus mentioned in RH's book itself has
50,000+ lines (processing time with CP Main Programme:
781.41 seconds; processing time with WST: 7 seconds), and
the word list of the 100-million-words BNC available from
A.Kilgarriff's website at
http://www.itri.bton.ac.uk/~Adam.Kilgarriff/
has 938,000+lines ...
- Merging files: With CP File Manager, merging 15 text files
(with an overall size of 6,758KB) required 4:08 minutes.
All in all, thus, CP is rather slow, especially when
compared to other contemporary programs.

2) Ease and convenience of use of CP
My own impression of the usability of CP is rather negative,
especially when compared to the other three corpus programs
that I work regularly with for teaching and research
mentioned above. While I do admit that the range of
functions is large and that I may not be able to do justice
to all features the suite has to offer, I am not very happy
with a variety of features. My main concerns are as follows:

2a) The modules of CP
The CP suite comes with a wide variety of different modules
and is intended to bring together modules to carry out a
huge number of different tasks into a single suite, which
basically sounds like a good idea. If CP and other similar
programs such as WST, MCP and WC were located on a
'modularity scale', then MCP and WC would have the simplest
structure such that all commands can be accessed from a
single window with one menu bar; by contrast, WST is a suite
with three modules doing different corpus jobs plus four
modules for file handling etc.; and CP is a suite of 27
modules in five groups. Compared to the other programs, CP's
structure, thus, appears relatively complex, an impression
that was involuntarily confirmed by some unguided
experimentation: while I could use many capabilities of WST,
MCP & WC without having looked at any documentation, now
after several years experience doing corpus-linguistic
research with different corpus programs and Perl scripts I
was unable to do a simple corpus search with CP without
having looked at the documentation provided with the bundle.
A related point of criticism is that many of the modules
serve purposes for which even the most modestly equipped
(corpus) linguist probably already has resources available
that can perform (most of) what is needed. For example, for
the potential buyer it is worth pointing out that more than
half of the 27 modules provided on the CD-ROM are
applications that strongly resemble Microsoft Windows,
Microsoft Office or OpenOffice products:
- CP Slide, a program which "will group any set of files
into a list which one can page through like slides on a
projector (from one to the next, without interruption, on a
clear screen)" (<CP_GUIDE.RTF>), a set of functions many of
which Microsoft PowerPoint or OpenOffice Impress can
perform;
- CP Browser, a web browser, which provides functions most
of which Netscape Navigator, Internet Explorer etc. already
provide;
- CP File Manager, CP FileManager Lite and CP Quick Backup,
a program "similar to the file manager but slightly
different in its organization" (<CP_GUIDE.RTF>), all allow
you to perform various file manipulation and storing
operations; most, though not all, of these can of course be
performed by the regular Windows Explorer or other
(freeware) programs; the same holds for the module CP Find
Text;
- CP Diary: a program that is intended to remind you of
important dates and allows you to have a yet-to-do list,
i.e. it offers part of the functionality of Microsoft
Outlook etc.;
- CP Jotter "provides a small and very quick version of the
fuller text editors of [CP]" and, thus, does the same thing
as <Notepad.EXE> on every Windows system (or TextPad or
UltraEdit or ...); in addition, CP also has a command 'view
returns storage' which provides yet another window where you
can enter data for later storage just like <Notepad.EXE>;
and there's also CP Text Editor and CP Text Tool, which are
text editing utilities ...;
- as if the previously mentioned text editing modules were
not enough, there is also CP Word Processor, which does the
same things as Microsoft Word or OpenOffice Writer;
- CP Easy Chart "will generate a pie, bar or line chart from
any series of input numbers" (<CP_GUIDE.RTF>), which is of
course what one normally uses Microsoft Excel / OpenOffice
Calc for;
- CP Database Editor and the separate Database Manager serve
the purpose of processing database file (e.g. .DBF), a
function for which again most people use Microsoft Excel or
OpenOffice Calc;
- CP Internet Editor allows you to edit your homepage(s)
and, therefore, does the same thing as Microsoft Frontpage
or any other freely available HTML editor;
- CP Control Centre: a small module that gives you access to
a variety of system setting options most of which are
already accessible from the Windows Control Panel ...;

My further discussion of CP below will not address all of
these modules at the same level of detail since many of the
modules and/or their functions are not relevant in a more
narrowly defined corpus-linguistic sense; in addition, many
options of these 'non-corpus-linguistic' modules I have
tested were not superior in functionality to their
Windows/Office counterparts anyway. For example, the
possibilities to generate charts with CP Easy Chart's chart
options appear to be much less sophisticated than, say,
Microsoft Excel's options especially since the latter can
generate graphs directly from automatically updated pivot
tables without the whole lot of manual effort required by CP
Easy Chart. Also (a minor point though), many of the
modules themselves contain commands which are nice little
gimmicks but which add little to the linguistic
functionality/utility of this corpus processing suite.
Examples for these include the possibility to access a
calculator or the time/date from several modules, the
possibilities of adjusting color and/or wallpaper or font
settings for many modules, the possibility to access CP
Jotter from some modules' menus, the option to view the RH's
CV etc. - these options of course probably don't really
hurt, but they do of course inflate the number of commands
beyond what is necessary and easily/intuitively handable ...

2b) CP and Windows
Another important usability issue is concerned with the way
CP integrates into, or makes use of the capabilities of, the
Microsoft Windows operating system. While RH emphasizes that
the program is designed "afresh, utilizing to a maximum the
possibilities of the newer operating systemW (p. 28), I
would not quite agree to this assessment. Consider, for an
admittedly painfully detailed example, the installation of
CP:
Upon double-clicking <setup.exe>, the program copies some
files onto the hard disk and opens a window with (i) three
installation options (installing CP, installing a database
software and installing a sample corpus of Irish English)
and (ii) a huge "Installation Advice Text". Among other
things, this text explains, firstly, that the program is
installed into the folder <C:\Corpus Presenter> and - if the
program is installed elsewhere - that the links to the 27
modules must be altered manually!
Secondly, the installation process is split into two
different steps. (This information is provided in the advice
text twice, once before a list of the contents of the CD-ROM
and once again after this list; the confusion is increased
by the fact that, in the second occurrence of the otherwise
identical text segment, a different path is used.) The first
step is called on by clicking on "Installing Corpus
Presenter" so that all the program files are copied to one's
hard disk; you cannot specify where the files are installed
unless you manipulate Windows settings about default program
directories. In the second step, Windows system files are
copied. Surprisingly, you are prompted where these system
files should be installed, and if you decide to install CP
into a different directory (e.g. <D:\>), then the system
files are copied to the system directories where they belong
anyway, but the directory which you chose for installation
only contains .EXE files for CP Programme Launcher and CP
itself, but all the other 477 files CP has installed before
are still in the directory CP mentioned before, namely, on
an English windows machine, <C:\Program Files\Corpus
Presenter>.
Similar comments hold for the database manager and the
sample corpus: you need to install the database manager
separately (which is ok), but CP expects it to be located in
a particular directory without spaces in the name, and the
sample corpus is simply installed to
<C:\Corpus_Irish_English> regardless of where you would want
to have it ...
Although RH explains in the final paragraph of this
advice text that many of the shortcomings are due to the
Windows operating system, it remains completely mysterious
to me why the user cannot simply enter the desired path for
all to-be-installed components, and the program organizes
itself internally as it needs to and outputs the requisite
links as is customary with nearly every other Windows
program I know. The way it is now, the installation process
and its result are painful if you do not know his way around
Windows quite well; your system partition <C:\> is then
cluttered with different directories that you would perhaps
have preferred to be on an 'applications' partition or on a
'corpus' partition. In addition, the uninstallation with the
Windows control panel did not remove all parts of the
installation properly: the corpus as well as the files in
<C:\Program Files\Corpus Presenter> and <D:\Corpus
Presenter> simply remained on the hard disk.
Unfortunately, there are many more inconvenient things
that falsify the claim of the maximum of the possibilities
of the newer Windows system:
- Some of the programs seem to adopt the previous color
settings of the desktop rather having own settings. Doesn't
sound like a big deal? Well, on a notebook with a black
desktop it can result in your not being able to read the
black text in the overview windows of CP Programme Launcher
until you have figured the out how to change the two
available and (misleading) color settings.
- When you start CP Programme Launcher, you get to see a
menu bar at the top which is none: Rather than opening a
menu with commands to choose from, each expression in this
menu-like section is already a command in itself. For users
with a well-entrenched knowledge of the Windows system, this
is at first perplexing, which is why buttons should have
been used here in the first place.
- Why are nearly all program windows opened such that they
cover the whole screen and hide all other applications? When
you turn to the help function to get information on some
window, the help screen hides the window for whose options
you look for clarification. When you open another module, it
hides all other software which you might have needed to see
(e.g. to enter data from it into CP Easy Chart) And why
aren't the windows that cover the whole screen maximized so
that a click on the restore down icon would reasonably
reduce the window size. And then, some windows don't allow
downsizing or maximizing at all: CP Easy Chart does - CP
Flash doesn't.
- In some programs (e.g. CP File Manager and CP Main
Programme), lists of files can be sorted by clicking on a
column heading (e.g. size, type etc.) - in others (e.g. CP
Create Data Set), they cannot.
- Right-clicking does not always open a context-dependent
menu (sometimes it just does the same as left-clicking and
sometimes it just offers to perform one particular action),
and in windows consisting of several horizontically or
vertically separated frames, you can often not change the
window parts' relative sizes to see more of the more
important information although this is of course standard in
all Windows applications.
- While there is a huge amount of commands available in the
'help' menu (twenty in CP! - even Excel XP only has eight
commands, as has WST), many of them don't seem to belong
there (what does benchmarking the system, the system
information, running the graphics program CP Easy Chart or
exploring CP's home directory have to do with a user
struggling with CP's many settings and looking for help?).
Also, CP does not afford Windows users the by now familiar
option of a help index to enter key words describing your
problem in order to retrieve a list of all help topics
related to this notion. Finally, not all help texts are
really useful: when I tried out the interactive tagging
function of CP Text Tool, I was confronted with the window
in which I had to enter the words and the tags for the
tagging. Since I did not immediately understand the makeup
of the window (containing four text fields, seven buttons,
four fields to tick and some text information), I clicked
the help button of this window. However, instead of getting
information on how the information must be entered, I got
eleven lines of text (taken from p. 160f. of the book), nine
of which explain what tagging is and that semantic tagging
is in general not possible plus two lines telling you that
you must enter maximally 512 input forms and none of which
explain the buttons or fields of the window from which this
'help' box was accessed in the first place! (If I do the
same in some window in Excel, I get precise information on
all buttons and all fields of the respective window ...)
Also, the window offers the option to tag forms as words or
strings - neither does the corresponding section of the book
explain what a string is nor is the word 'string' listed in
the index of the whole book ...
- Windows programs usually allow the user to enter data into
several fields of an input window by jumping from data field
to data field by pressing the TAB key; CP does the same, but
- at least in CP Easy Chart, the program does not switch from
one field to the immediately adjacent and thematically
related one, but arbitrarily to some other field, which
doesn't make entering data any easier ...

2c) Some other functionality quibbles
The following is a list of other shortcomings of some
modules which are not directly related to the integration
into Windows; I begin with CP Main Programme.
- If you want to save your results (of a corpus search) such
that collocates at different positions can be accessed
easily, you cannot simply choose to save it as a text file
with tabs as delimiters (at least I didn't find out how).
Instead, you must save it as a database file (.DBF), which
entails you must use CP Data editor (or, say, Excel) to
retrieve the data again and cannot use your favorite text
editor etc. first.
- If a particular search of CP's main program is
interrupted, then - unlike other concordancers - CP does not
present the results obtained so far; it presents none.
- If you wish to INcrease the number of collocates to be
displayed in a results window, you do so counterintuitively
by clicking an arrow pointing DOWNwards.
- While CP can output the collocates of a particular search
word, it is not quite easy to locate this option: all other
concordancers I know simply have a command called
'collocations' (or some equally telling name), but in CP you
have to find out somehow that the command (in CP Main
Programme) is called 'restructure return lines', which is
not only very unintuitive but also somewhat difficult to
find since (i) there is no help index (cf. above), (ii)
_collocate_ and _collocation_ cannot be found using the find
function of the main help text and (iii) the word
_collocation_ only occurs twice in the whole program folder
(as determined with a grep tool), neither occurrence of
which explains this function. The only way to find this
option if you don't already know it is the index of the book
where the third page entry for _collocates_ points you to
the right page in the book for this option.
- If you want to search for bipartite expressions where one
part can be instantiated by several different forms (such as
inflectional forms of one lemma, say, _put_, _puts_,
_putting_), then you can use the option of editing an input
list - but you cannot simply edit the list by entering a few
words and do a search, you must either load an existing list
or enter the list manually and save it.
- Surprisingly, CP cannot sort concordance lines according
to a user-specified position in the vicinity of the search
word: you can only sort concordance output according to the
leftmost word of a cell and the word _sort_ or any
derivative is not even mentioned in the index of the book,
something I find strange for a program (suite) the main
purpose of which is handling text(s).
The module CP Create Data Set also deserves some comments.
If you do not simply load text files as a corpus but want to
compile a corpus, CP needs information on how the corpus is
organized. You can either simply create a text file with a
particular format with any text editor providing this
information for CP by yourself or, alternatively, you can
use this module. However, although the module is explained
on only three pages in the handbook, it is relatively
complex, and its output is the very same text file
description of the corpus. In other words, one must again
enter all information for each corpus file separately and
manually. In addition, several windows this module opens are
not discussed in the book or the corresponding section of
the help file (which are identical anyway) and handling the
module is not always intuitive to say the least:
- I have not been able to find out how the order of files is
changed using CP Create Data Set (other than, of course, by
manually editing the text file itself);
- subheadings of a corpus must make reference to empty dummy
files;
- deleting nodes from your corpus structure does not really
delete the nodes until the data set file has been saved so
you must work with empty nodes and empty files etc.
Other shortcomings of this module are, again, due to the
fact that Windows has not been utilized fully. Why can I
highlight all corpus files which I want to assign to a lower
structure level in my corpus, but cannot also change their
level assignment all in one go? Why does this module not
allow me to simply load a list of files and convert them to
a dataset by providing information as to the structure of
the corpus? (Guess what - you have to turn to a different
module for this option, namely CP Flash, but when you read
the section on CP Create Data Set to find out whether such a
possibility exists, the book doesn't tell you - you must
find out for yourself some other way!) Why is it not
possible to use drag and drop options etc. to determine the
structure of the corpus? Why is there no assistant to guide
you through the creation of the corpus structure (just like
Excel has a guide for pivot/contingency tables and WST has a
brief guide to generate a concordance)? I don't know.
There are similar usability problems throughout CP. I
cannot discuss all of them here since the review is already
(too?) lengthy so a few final examples must suffice for the
moment: First, CP Quick Note makes it possible to structure
a text using embedded table of contents markers. These
markers can be embedded using the very same module ... but
not with the menu 'Insert' as every normal user would
suspect - rather, to insert these markers the menu you have
to open is called ... 'Display'. Second, the program CP List
Processor allows the user to manipulate one or two lists
such that, for example, the lists are merged or differences
between lists are shown. However, there is a little bug in
the program concerning the alphabetical output of the
program since the resulting sorting is not fully
alphabetical.
Finally, let us return the interactive tagging procedure
of CP Text Tool. You open a text file containing words to
tag, and you need to have one file with words to tag and one
file with tags. For the automatic tagging function, you
choose the 1-512 word forms to be tagged with one tag,
choose automatic tagging and CP Text Tool adds the tag to
all the word forms; since you can specify more than one word
to be tagged at a time, this is a huge advantage over
replacing functions of, say, Microsoft Word or TextPad. With
interactive tagging, the program goes through the corpus
text, stops at every instance of one of the to-be-tagged
word forms and asks the user which tag to add to the word
form. This function is implemented a little clumsily since
you are not simply prompted to choose a tag but have to use
some more mouse-clicks and whenever you want to choose a tag
other than the default one to assign and click on 'reject'
in this window, the list of available tags is recursively
added ... Thus, although the interactive tagging function
works basically ok, there is some bug here that needs
correction.

3) Nitpicking, typos etc.
This section is concerned with only minor errors and some
other short comments/questions in a simple list form.

Concerning the book:
- On p. 8., the first line of the second paragraph of
section 3.5 is garbled.
- On pp. 5f., 42 and 164f., RH mentions the normalization of
corpora, but with the exception of one example buried within
a table he restricts himself to normalizing spelling
variants; the issue of lemmatization would have deserved
more emphasis here (for analyses of author style or
collocations) but it is only mentioned once in the glossary
(though, surprisingly, not in the index);
- The notion of tagging is explained briefly under the
rubric of '3.2 Versions of corpora' in three paragraphs (p.
5) and once again under 'Tagging a corpus' on p. 8f. - why
not put this together?
- Why are the BNC and the Cobuild Bank of English not
mentioned at all (not even in the glossary although this
includes several other entries of words not mentioned in the
book; cf. below) although they are probably the most widely
distributed and available corpora?
- The brief explanations of central corpus-linguistic terms
in Part I, sections '3 Preparing corpora' and '4 Examining
corpora' are very brief, having an average length of about
two to three paragraphs only and thus cover these issues
only very superficially.
- On p. 22f., RH discusses an example of collocate analysis,
namely the frequency of particular collocations for _deal_
in the London-Lund Corpus. He states that "139 finds were
reported in 15 files. 57 finds were with _great_ as
immediate left collocate and 41 finds with _good_ in the
same position. The results in a visually effective form can
be shown as a pie chart [...]." However, the pie chart to
which RH refers reports percentages which do not fit the
data described in the text. 57 [_great deal_] out of 139 [*
_deal_] are 41% and not the 60% represented in the chart;
the same holds for great: 41 [_good deal_] out of 139 [*
_deal_] are 29.5% and not 38.89% ...
- On p. 67, RH describes the command 'Frequently asked
questions' in the module CP Main Programme as follows: "The
third text file aims at answering typical questions which
users might come up with who have started working with the
present corpus. This file should preferably be written by
someone who has been connected with the compilation of the
corpus." However, when the test corpus is loaded and one
tries to access the FAQ for this corpus using this option,
what one gets is the FAQ for the CP Main Programme, not for
the corpus and it is unclear why this should have been
"written by someone who has been connected with the
compilation of the corpus."
- On p. 72, section 1.3 ('Normalising texts') consists of
one paragraph only and accidentally interrupts an otherwise
coherent and numbered description of the search parameters
available in CP Main Programme.
- On p. 73, RH gives an example of a word search using the
frame option by suggesting that "if you wished to find all
instances of negated adjectives in a text then you could
enter a frame consisting of _un_ and _able_ [...]" -
obviously, this search would not produce all negated
adjectives since _inadequate_ and _impossible_ would not be
retrieved.
- On p. 81, RH states that the file extension for the output
of a concordance as a plain text file is .OUT, but in the
program it's .TXT.
- On p. 92, a sentence runs "This very is important for
[...]."

Some minor comments concerning the glossary:
- Some of the explanations of statistical terms in the
glossary (which are not mentioned in the book otherwise) are
far from optimal. For example, to define an alternative
hypothesis as "an assumption in statistics that two
variables are different" (p. 211) is perhaps a little too
low-level even for a glossary definition.
- Similarly, the definition of a Chi-square test (p. 213) is
grammatically incorrect ("A common test in linguistics is to
determine if the probability that a difference between sets
of values is due to chance alone.") and much too vague since
the above sentence also characterizes a t-test, a U test, an
ANOVA etc. Also, I am not sure that most scholars would
subscribe to the following statement: "A typical cut-off
point for significance is p<0.001" (p. 213).
- I do not know why a corpus processing book needs glossary
entries for _cookie_, _email_, _inkjet printer_, _laser
printer_, _PC_, _RS232C_ and _TFT_.
- I would not equate _lemma_ and _lexeme_.
I already mentioned that large parts of Part II of the book
are just the help files of the program, but sometimes the
book itself is also a little redundant. For instance, a part
of the general description of CP's main module on p. 30f. is
repeated verbatim in the more detailed section on p. 48f. In
addition, sometimes the names of the modules used in the
book are not identical to the names of the modules used in
CP Programme Launcher, the application from which RH
recommends to access all other modules. For example, the
book has sections on CP Easy Chart and CP Structure, but in
CP Programme Launcher the very same modules are called CP
Chart Generator and CP Structured Texts respectively. This
is of course no big deal (which is why it is in this
nitpicking section of the evaluation in the first place),
but, just like the fact that the book doesn't discuss the
modules in the same order in which they are listed in CP
Programme Launcher although (i) this would be easier to
follow and (ii) perfectly possible since the listing in CP
Programme Launcher is completely arbitrary, it simply does
not speak in favor of careful editing.
Unfortunately, the book is not very well organized in terms
of software learnability either, which is probably a direct
consequence of Part II of the book largely being the help
files. It would have been extremely useful if the book had
provided at least one sample analysis which is designed in
such a way as to lead the beginning user through the many
modules (perhaps in combination with a website) unlike the
sample analyses in the book which make no reference at all
to how exactly they would have been generated with CP. Let
me give an example of what I have appreciated very much. One
could have some corpus files on the publisher's webpage
which one would then turn into a hierarchically structured
corpus using the text editing modules and CP Create Data
Set. Then this corpus could be tagged and lemmatized using
CP Text Tool. Then one could perform a sample analysis on
the basis of this corpus, for example the collocational
differences of _strong_ and _powerful_ (to use the textbook
example) with CP Main Programme or CP Flash and use finally
use CP Easy Chart and CP Slide to prepare a presentation of
the results and CP Internet Editor to present the results on
a website. For all of this, the website could provide
interim results to allow the user to check whether he has
mastered the tasks so far. Sadly, however, none of this is
provided although it would have enhanced the value and quick
learnability value of the bundle by many orders of
magnitude.

Concerning the software:
- The installation advice text contains the same paragraph
twice (with different paths, though); cf. above.
- In the rubric 'Retrieving information' of the CP help,
there are three paragraphs ยง2.
- In several modules, when a window outputs certain results,
you can change the size of the window as such, but not the
size of the part of the window that contains the output;
i.e. you get a larger window with the same information; cf.
here for an example from CP Flash.
- The fact sheet for the installed test corpus start with
"Some essential facts abou the Test Corpus."
- Sometimes, the program uses somewhat idiosyncratic
commands: instead of clicking on 'ok' to close a window and
accept what one has entered/changed, one has to click on
'conclude.'
Lastly, although CP is a very recent program, it does not
have some of the added-value gimmicks that competing
programs offer (it is only fair to repeat here that it of
course also has functions these competitors do not have).
For example, CP does not provide corpus-based statistics
such as indices of collocational strength etc. (like, say,
Michael Barlow's Collocate). Also, although the issue of
analyzing style is brought up repeatedly in the book, CP
does not allow for the automatic identification of key words
in texts (unlike WST).

CONCLUSION
All in all, I am the first to admit that CP is a program
that offers many functions that can be useful for the
compilation, annotation and processing of corpora. I also
freely admit that the evaluation of usability is by
definition a relatively subjective task. I also believe,
however, that many of the flaws I have pointed above would
render the program much more difficult to use than competing
products. From what I have seen, the (only) positive side I
have been able to detect is in fact the large number of
functions, i.e. the 'what the program does'. But, the
negative sides of CP are the 'how the program does it': (the
larger part of) the program is
- difficult, inconvenient and counterintuitive to handle,
sometimes violating even elementary usability issues;
- overloaded with many redundant functions (containing four
text editors alone) that are part and parcel of regular
operating systems and office software;
- painfully slow to execute even some of the most basic
concordancing tasks.
Note in this connection that many functions of CP (other
than the hierarchical corpus compilation functions of
course) are available as (parts of regular) office suites
and freeware programs. In addition, the software book is
largely identical to the many help files that come with the
software and sloppily edited in many ways. Although I had
been waiting quite some time for the program after having it
seen announced as commercially available soon, I am rather
disappointed with the final result and hope that the most
frustrating bugs will be considered for an update soon.
 
ABOUT THE REVIEWER:
ABOUT THE REVIEWER
Stefan Th. Gries is Associate Professor at the Department of
Business Communication and Information Science at the
University of Southern Denmark. His research interest mainly
lies with corpus linguistics and linguistic methodology,
esp. the syntax-lexis interface as well as corpus-based,
quantitative approaches to word-formation processes (e.g.
blending and suffixation), syntactic variation (dative
movement, particle movement etc.) and semantic issues (near
synonyms, word senses etc.). He is currently co-editing two
volumes on corpora in cognitive linguistics and is also one
editor-in-chief of a new journal, Corpus Linguistics and
Linguistic Theory, to be launched in 2005.