LINGUIST List 15.681

Tue Feb 24 2004

Review: Text/Corpus Linguistics: Hickey (2003)

Editor for this issue: Naomi Ogasawara <naomilinguistlist.org>


What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in.

If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Sheila Dooley Collberg at collberglinguistlist.org.

Directory

  • Stefan Th. Gries, Corpus Presenter

    Message 1: Corpus Presenter

    Date: Sat, 31 Jan 2004 09:20:41 +0100
    From: Stefan Th. Gries <STGriessitkom.sdu.dk>
    Subject: Corpus Presenter


    AUTHOR: Hickey, Raymond TITLE: Corpus Presenter SUBTITLE: Software for language analysis [...] PUBLISHER: John Benjamins YEAR: 2003

    Announced at http://linguistlist.org/issues/14/14-2813.html

    Stefan Th. Gries, University of Southern Denmark

    NOTES: An easier-to-read PDF file with this review which also offers screenshots to exemplify some points can be found at: http://people.freenet.de/Stefan_Th_Gries/research/CP_review.pdf

    Italics are indicated here by underscores before and after a word.

    DESCRIPTION/SUMMARY OF THE BUNDLE

    1) The book Part I (p. 1-27) is a brief introduction into corpus linguistics. This part provides an overview over some corpus-linguistic terminology (types of corpora, tagging, corpus headers, etc.) and a brief section entitled 'examining corpora', which introduces a few basic notions such as _concordance_, _tagging_ and _lexical cluster analysis_ (each with about two paragraphs). P. 14 to 27 discuss very briefly a few sample analyses of corpus data (mainly Irish English data); these involve frequency data for the presence or absence of particular linguistic forms, relative frequencies of for collocations and relative frequency data in an investigation of an author's style. Part II of the book (p. 28-183 consists of descriptions of the modules or programs of the Corpus Presenter suite (henceforth CP). Much of this part corresponds to the help files of the software bundle. Part III of the book consists of several appendices. Appendix 1 and 2 of the book (p. 184-9) provide information about the installation; while Appendix 3 (p. 190-200) lists a set of common commands, i.e. commands which are found in several parts of CP. Appendix 4 (p. 201-4) describes the file interface of CP, most of this will be familiar to most users of Windows. Appendix 5 (p. 205-7) gives some troubleshooting information and Appendix 6 (p. 208-9) introduces three additional dataset files describing three corpora from the ICAME CD-ROM. Finally, this part contains a glossary of short definitions of corpus-linguistic and a few statistical terms. Part IV of the book (p. 237-76) is a description of A Corpus of Irish English, which is followed by a general bibliography, a glossary for this corpus and a combined subject/name index.

    2) The software CP offers multiple functions for compiling, annotating and processing corpora.

    - CP can search for strings in texts in order to output them as a concordance: -- CP allows for detailed specification of strings and corpus text settings including those for case-sensitive searches, sentence and word delimiters, punctuation signs; -- in addition, one can look for single expressions, lists of expressions and larger expressions (by specifying the left and the right part of a complex expression as well as their maximal distance); -- an especially useful option is the possibility to include special characters and symbols from every installed font into the search pattern; -- one can specify stop words and output options such as full sentence or x words to left/right of the search word; -- one can specify Cocoa settings to include only files with particular attributes into the search;

    - CP can generate collocation frequencies for words in a span of max. eight words around a search/node word;

    - CP can generate some text statistics (word counts) as well as (regular or reverse) word lists of individual words and/or clusters of up to 8 words;

    - CP can perform search and replace operations on files to alter texts (e.g. tagging and normalizing files for lemmatization);

    - CP can collate files and compile corpora by collecting and manipulating files of various sorts, organize the files and their contents hierarchically and add header information (e.g. Cocoa) for future searches;

    - CP can convert files to different file types, manipulate their attributes (e.g. date stamps, extensions) and do lots of file handling operations (cut, copy, past, duplicate, merge, etc.);

    - CP offers a few useful functions that go beyond some limitations of the Microsoft Windows OS such as a cumulative clipboard or an undo buffer for deleted text. The program comes with many different help files, FAQ files and a brief tutorial with 44 slides and written/read-aloud instructions (.WAV format).

    CRITICAL EVALUATION

    The CP suite is a set of programs that offers a vast range of possibilities for working with corpus data. It was mainly tested on a notebook computer (Pentium III 1000 with a 20GB hard disk and 256MB RAM running an English Windows XP Professional; some additional tests were performed on a desktop computer (Athlon XP 1800+ with a 40GB hard disk and 640MB RAM) running a German Windows 2000 (both systems are completely updated in terms of Microsoft Service Packs etc.). The program was tested by myself alone, but in order to make the evaluation slightly more objective, I also asked a colleague for her opinion on some issues. In order to discuss some of CP's properties, I will make reference to a few concordancing programs, namely WordSmith Tools 3 (WST), MonoConc Pro 2.2 (MCP) and WinConcord 2.0 (WC).

    1) Speed and power of CP

    The author (henceforth RH) stressed that "a special fast retrieval mode has been incorporated into _Corpus Presenter_ to minimize the time one has to wait for returns to be made during searches" (<CP_GUIDE.RTF>). However, at least my own experiments do not support this assessment (with one exception mentioned below all time-taking experiments were performed after immediately after a system reboot).

    - Concordancing example 1: On the above-mentioned notebook computer, searching the Brown Corpus for the word form _best_ using the most powerful 'text retrieval' level lasted an astonishing 35.7 seconds (and some 35.94 seconds for _after_; CP Main Programme's own output) even though all settings concerning Unicode were optimized for the processing of plain ASCII files. By contrast, MCP took about 2 seconds to load the file and 4 seconds to produce the concordance.

    - Concordancing example 2: On the above-mentioned notebook computer, searching 674 files from the BNC part A (without tags) for the word form _best_ once took 1030.48 seconds (with several applications open but unused in the background) and 334.21 with no other applications running (CP Main Programme's own output) even though all settings concerning Unicode were again optimized for the processing of plain ASCII files. By contrast, WST took about 57 seconds to produce the desired concordance ...

    - Concordancing example 3: On the above-mentioned notebook computer, simply finding out how that my British National Corpus (BNC) directory contains 4,054 .TXT files required 47 seconds (CP Flash) - both MCP and WST need about a second or less.

    - Word list examples: Making a simple word list of the Brown Corpus (without the reversed list) required 469.46 seconds with CP Main Programme (and I canceled CP Flash after 30 minutes!) but only 11 seconds with MCP. In this connection, it is worth pointing out that the program has an upper limit of 32,000 words for word lists. RH does claim that this is "virtually ample for every corpus" (p. 59), but the word list for the one-million Brown Corpus already had about 10,000 entries so it is easy to find corpora whose word lists will exceed this limit: the word list of the one- million FROWN Corpus mentioned in RH's book itself has 50,000+ lines (processing time with CP Main Programme: 781.41 seconds; processing time with WST: 7 seconds), and the word list of the 100-million-words BNC available from A.Kilgarriff's website at http://www.itri.bton.ac.uk/~Adam.Kilgarriff/ has 938,000+lines ...

    - Merging files: With CP File Manager, merging 15 text files (with an overall size of 6,758KB) required 4:08 minutes. All in all, thus, CP is rather slow, especially when compared to other contemporary programs.

    2) Ease and convenience of use of CP

    My own impression of the usability of CP is rather negative, especially when compared to the other three corpus programs that I work regularly with for teaching and research mentioned above. While I do admit that the range of functions is large and that I may not be able to do justice to all features the suite has to offer, I am not very happy with a variety of features. My main concerns are as follows:

    2a) The modules of CP

    The CP suite comes with a wide variety of different modules and is intended to bring together modules to carry out a huge number of different tasks into a single suite, which basically sounds like a good idea. If CP and other similar programs such as WST, MCP and WC were located on a 'modularity scale', then MCP and WC would have the simplest structure such that all commands can be accessed from a single window with one menu bar; by contrast, WST is a suite with three modules doing different corpus jobs plus four modules for file handling etc.; and CP is a suite of 27 modules in five groups. Compared to the other programs, CP's structure, thus, appears relatively complex, an impression that was involuntarily confirmed by some unguided experimentation: while I could use many capabilities of WST, MCP & WC without having looked at any documentation, now after several years experience doing corpus-linguistic research with different corpus programs and Perl scripts I was unable to do a simple corpus search with CP without having looked at the documentation provided with the bundle. A related point of criticism is that many of the modules serve purposes for which even the most modestly equipped (corpus) linguist probably already has resources available that can perform (most of) what is needed. For example, for the potential buyer it is worth pointing out that more than half of the 27 modules provided on the CD-ROM are applications that strongly resemble Microsoft Windows, Microsoft Office or OpenOffice products:

    - CP Slide, a program which "will group any set of files into a list which one can page through like slides on a projector (from one to the next, without interruption, on a clear screen)" (<CP_GUIDE.RTF>), a set of functions many of which Microsoft PowerPoint or OpenOffice Impress can perform;

    - CP Browser, a web browser, which provides functions most of which Netscape Navigator, Internet Explorer etc. already provide;

    - CP File Manager, CP FileManager Lite and CP Quick Backup, a program "similar to the file manager but slightly different in its organization" (<CP_GUIDE.RTF>), all allow you to perform various file manipulation and storing operations; most, though not all, of these can of course be performed by the regular Windows Explorer or other (freeware) programs; the same holds for the module CP Find Text;

    - CP Diary: a program that is intended to remind you of important dates and allows you to have a yet-to-do list, i.e. it offers part of the functionality of Microsoft Outlook etc.;

    - CP Jotter "provides a small and very quick version of the fuller text editors of [CP]" and, thus, does the same thing as <Notepad.EXE> on every Windows system (or TextPad or UltraEdit or ...); in addition, CP also has a command 'view returns storage' which provides yet another window where you can enter data for later storage just like <Notepad.EXE>; and there's also CP Text Editor and CP Text Tool, which are text editing utilities ...; - as if the previously mentioned text editing modules were not enough, there is also CP Word Processor, which does the same things as Microsoft Word or OpenOffice Writer;

    - CP Easy Chart "will generate a pie, bar or line chart from any series of input numbers" (<CP_GUIDE.RTF>), which is of course what one normally uses Microsoft Excel / OpenOffice Calc for;

    - CP Database Editor and the separate Database Manager serve the purpose of processing database file (e.g. .DBF), a function for which again most people use Microsoft Excel or OpenOffice Calc;

    - CP Internet Editor allows you to edit your homepage(s) and, therefore, does the same thing as Microsoft Frontpage or any other freely available HTML editor;

    - CP Control Centre: a small module that gives you access to a variety of system setting options most of which are already accessible from the Windows Control Panel ...;

    My further discussion of CP below will not address all of these modules at the same level of detail since many of the modules and/or their functions are not relevant in a more narrowly defined corpus-linguistic sense; in addition, many options of these 'non-corpus-linguistic' modules I have tested were not superior in functionality to their Windows/Office counterparts anyway. For example, the possibilities to generate charts with CP Easy Chart's chart options appear to be much less sophisticated than, say, Microsoft Excel's options especially since the latter can generate graphs directly from automatically updated pivot tables without the whole lot of manual effort required by CP Easy Chart. Also (a minor point though), many of the modules themselves contain commands which are nice little gimmicks but which add little to the linguistic functionality/utility of this corpus processing suite. Examples for these include the possibility to access a calculator or the time/date from several modules, the possibilities of adjusting color and/or wallpaper or font settings for many modules, the possibility to access CP Jotter from some modules' menus, the option to view the RH's CV etc. - these options of course probably don't really hurt, but they do of course inflate the number of commands beyond what is necessary and easily/intuitively handable ...

    2b) CP and Windows

    Another important usability issue is concerned with the way CP integrates into, or makes use of the capabilities of, the Microsoft Windows operating system. While RH emphasizes that the program is designed "afresh, utilizing to a maximum the possibilities of the newer operating systemW (p. 28), I would not quite agree to this assessment. Consider, for an admittedly painfully detailed example, the installation of CP: Upon double-clicking <setup.exe>, the program copies some files onto the hard disk and opens a window with (i) three installation options (installing CP, installing a database software and installing a sample corpus of Irish English) and (ii) a huge "Installation Advice Text". Among other things, this text explains, firstly, that the program is installed into the folder <C:\Corpus Presenter> and - if the program is installed elsewhere - that the links to the 27 modules must be altered manually! Secondly, the installation process is split into two different steps. (This information is provided in the advice text twice, once before a list of the contents of the CD-ROM and once again after this list; the confusion is increased by the fact that, in the second occurrence of the otherwise identical text segment, a different path is used.) The first step is called on by clicking on "Installing Corpus Presenter" so that all the program files are copied to one's hard disk; you cannot specify where the files are installed unless you manipulate Windows settings about default program directories. In the second step, Windows system files are copied. Surprisingly, you are prompted where these system files should be installed, and if you decide to install CP into a different directory (e.g. <D:\>), then the system files are copied to the system directories where they belong anyway, but the directory which you chose for installation only contains .EXE files for CP Programme Launcher and CP itself, but all the other 477 files CP has installed before are still in the directory CP mentioned before, namely, on an English windows machine, <C:\Program Files\Corpus Presenter>. Similar comments hold for the database manager and the sample corpus: you need to install the database manager separately (which is ok), but CP expects it to be located in a particular directory without spaces in the name, and the sample corpus is simply installed to <C:\Corpus_Irish_English> regardless of where you would want to have it ... Although RH explains in the final paragraph of this advice text that many of the shortcomings are due to the Windows operating system, it remains completely mysterious to me why the user cannot simply enter the desired path for all to-be-installed components, and the program organizes itself internally as it needs to and outputs the requisite links as is customary with nearly every other Windows program I know. The way it is now, the installation process and its result are painful if you do not know his way around Windows quite well; your system partition <C:\> is then cluttered with different directories that you would perhaps have preferred to be on an 'applications' partition or on a 'corpus' partition. In addition, the uninstallation with the Windows control panel did not remove all parts of the installation properly: the corpus as well as the files in <C:\Program Files\Corpus Presenter> and <D:\Corpus Presenter> simply remained on the hard disk. Unfortunately, there are many more inconvenient things that falsify the claim of the maximum of the possibilities of the newer Windows system: - Some of the programs seem to adopt the previous color settings of the desktop rather having own settings. Doesn't sound like a big deal? Well, on a notebook with a black desktop it can result in your not being able to read the black text in the overview windows of CP Programme Launcher until you have figured the out how to change the two available and (misleading) color settings.

    - When you start CP Programme Launcher, you get to see a menu bar at the top which is none: Rather than opening a menu with commands to choose from, each expression in this menu-like section is already a command in itself. For users with a well-entrenched knowledge of the Windows system, this is at first perplexing, which is why buttons should have been used here in the first place.

    - Why are nearly all program windows opened such that they cover the whole screen and hide all other applications? When you turn to the help function to get information on some window, the help screen hides the window for whose options you look for clarification. When you open another module, it hides all other software which you might have needed to see (e.g. to enter data from it into CP Easy Chart) And why aren't the windows that cover the whole screen maximized so that a click on the restore down icon would reasonably reduce the window size. And then, some windows don't allow downsizing or maximizing at all: CP Easy Chart does - CP Flash doesn't.

    - In some programs (e.g. CP File Manager and CP Main Programme), lists of files can be sorted by clicking on a column heading (e.g. size, type etc.) - in others (e.g. CP Create Data Set), they cannot.

    - Right-clicking does not always open a context-dependent menu (sometimes it just does the same as left-clicking and sometimes it just offers to perform one particular action), and in windows consisting of several horizontically or vertically separated frames, you can often not change the window parts' relative sizes to see more of the more important information although this is of course standard in all Windows applications.

    - While there is a huge amount of commands available in the 'help' menu (twenty in CP! - even Excel XP only has eight commands, as has WST), many of them don't seem to belong there (what does benchmarking the system, the system information, running the graphics program CP Easy Chart or exploring CP's home directory have to do with a user struggling with CP's many settings and looking for help?). Also, CP does not afford Windows users the by now familiar option of a help index to enter key words describing your problem in order to retrieve a list of all help topics related to this notion. Finally, not all help texts are really useful: when I tried out the interactive tagging function of CP Text Tool, I was confronted with the window in which I had to enter the words and the tags for the tagging. Since I did not immediately understand the makeup of the window (containing four text fields, seven buttons, four fields to tick and some text information), I clicked the help button of this window. However, instead of getting information on how the information must be entered, I got eleven lines of text (taken from p. 160f. of the book), nine of which explain what tagging is and that semantic tagging is in general not possible plus two lines telling you that you must enter maximally 512 input forms and none of which explain the buttons or fields of the window from which this 'help' box was accessed in the first place! (If I do the same in some window in Excel, I get precise information on all buttons and all fields of the respective window ...) Also, the window offers the option to tag forms as words or strings - neither does the corresponding section of the book explain what a string is nor is the word 'string' listed in the index of the whole book ...

    - Windows programs usually allow the user to enter data into several fields of an input window by jumping from data field to data field by pressing the TAB key; CP does the same, but - at least in CP Easy Chart, the program does not switch from one field to the immediately adjacent and thematically related one, but arbitrarily to some other field, which doesn't make entering data any easier ...

    2c) Some other functionality quibbles

    The following is a list of other shortcomings of some modules which are not directly related to the integration into Windows; I begin with CP Main Programme.

    - If you want to save your results (of a corpus search) such that collocates at different positions can be accessed easily, you cannot simply choose to save it as a text file with tabs as delimiters (at least I didn't find out how). Instead, you must save it as a database file (.DBF), which entails you must use CP Data editor (or, say, Excel) to retrieve the data again and cannot use your favorite text editor etc. first.

    - If a particular search of CP's main program is interrupted, then - unlike other concordancers - CP does not present the results obtained so far; it presents none.

    - If you wish to INcrease the number of collocates to be displayed in a results window, you do so counterintuitively by clicking an arrow pointing DOWNwards.

    - While CP can output the collocates of a particular search word, it is not quite easy to locate this option: all other concordancers I know simply have a command called 'collocations' (or some equally telling name), but in CP you have to find out somehow that the command (in CP Main Programme) is called 'restructure return lines', which is not only very unintuitive but also somewhat difficult to find since (i) there is no help index (cf. above), (ii) _collocate_ and _collocation_ cannot be found using the find function of the main help text and (iii) the word _collocation_ only occurs twice in the whole program folder (as determined with a grep tool), neither occurrence of which explains this function. The only way to find this option if you don't already know it is the index of the book where the third page entry for _collocates_ points you to the right page in the book for this option. - If you want to search for bipartite expressions where one part can be instantiated by several different forms (such as inflectional forms of one lemma, say, _put_, _puts_, _putting_), then you can use the option of editing an input list - but you cannot simply edit the list by entering a few words and do a search, you must either load an existing list or enter the list manually and save it.

    - Surprisingly, CP cannot sort concordance lines according to a user-specified position in the vicinity of the search word: you can only sort concordance output according to the leftmost word of a cell and the word _sort_ or any derivative is not even mentioned in the index of the book, something I find strange for a program (suite) the main purpose of which is handling text(s). The module CP Create Data Set also deserves some comments. If you do not simply load text files as a corpus but want to compile a corpus, CP needs information on how the corpus is organized. You can either simply create a text file with a particular format with any text editor providing this information for CP by yourself or, alternatively, you can use this module. However, although the module is explained on only three pages in the handbook, it is relatively complex, and its output is the very same text file description of the corpus. In other words, one must again enter all information for each corpus file separately and manually. In addition, several windows this module opens are not discussed in the book or the corresponding section of the help file (which are identical anyway) and handling the module is not always intuitive to say the least:

    - I have not been able to find out how the order of files is changed using CP Create Data Set (other than, of course, by manually editing the text file itself); - subheadings of a corpus must make reference to empty dummy files; - deleting nodes from your corpus structure does not really delete the nodes until the data set file has been saved so you must work with empty nodes and empty files etc. Other shortcomings of this module are, again, due to the fact that Windows has not been utilized fully. Why can I highlight all corpus files which I want to assign to a lower structure level in my corpus, but cannot also change their level assignment all in one go? Why does this module not allow me to simply load a list of files and convert them to a dataset by providing information as to the structure of the corpus? (Guess what - you have to turn to a different module for this option, namely CP Flash, but when you read the section on CP Create Data Set to find out whether such a possibility exists, the book doesn't tell you - you must find out for yourself some other way!) Why is it not possible to use drag and drop options etc. to determine the structure of the corpus? Why is there no assistant to guide you through the creation of the corpus structure (just like Excel has a guide for pivot/contingency tables and WST has a brief guide to generate a concordance)? I don't know. There are similar usability problems throughout CP. I cannot discuss all of them here since the review is already (too?) lengthy so a few final examples must suffice for the moment: First, CP Quick Note makes it possible to structure a text using embedded table of contents markers. These markers can be embedded using the very same module ... but not with the menu 'Insert' as every normal user would suspect - rather, to insert these markers the menu you have to open is called ... 'Display'. Second, the program CP List Processor allows the user to manipulate one or two lists such that, for example, the lists are merged or differences between lists are shown. However, there is a little bug in the program concerning the alphabetical output of the program since the resulting sorting is not fully alphabetical. Finally, let us return the interactive tagging procedure of CP Text Tool. You open a text file containing words to tag, and you need to have one file with words to tag and one file with tags. For the automatic tagging function, you choose the 1-512 word forms to be tagged with one tag, choose automatic tagging and CP Text Tool adds the tag to all the word forms; since you can specify more than one word to be tagged at a time, this is a huge advantage over replacing functions of, say, Microsoft Word or TextPad. With interactive tagging, the program goes through the corpus text, stops at every instance of one of the to-be-tagged word forms and asks the user which tag to add to the word form. This function is implemented a little clumsily since you are not simply prompted to choose a tag but have to use some more mouse-clicks and whenever you want to choose a tag other than the default one to assign and click on 'reject' in this window, the list of available tags is recursively added ... Thus, although the interactive tagging function works basically ok, there is some bug here that needs correction.

    3) Nitpicking, typos etc. This section is concerned with only minor errors and some other short comments/questions in a simple list form.

    Concerning the book:

    - On p. 8., the first line of the second paragraph of section 3.5 is garbled.

    - On pp. 5f., 42 and 164f., RH mentions the normalization of corpora, but with the exception of one example buried within a table he restricts himself to normalizing spelling variants; the issue of lemmatization would have deserved more emphasis here (for analyses of author style or collocations) but it is only mentioned once in the glossary (though, surprisingly, not in the index);

    - The notion of tagging is explained briefly under the rubric of '3.2 Versions of corpora' in three paragraphs (p. 5) and once again under 'Tagging a corpus' on p. 8f. - why not put this together?

    - Why are the BNC and the Cobuild Bank of English not mentioned at all (not even in the glossary although this includes several other entries of words not mentioned in the book; cf. below) although they are probably the most widely distributed and available corpora?

    - The brief explanations of central corpus-linguistic terms in Part I, sections '3 Preparing corpora' and '4 Examining corpora' are very brief, having an average length of about two to three paragraphs only and thus cover these issues only very superficially.

    - On p. 22f., RH discusses an example of collocate analysis, namely the frequency of particular collocations for _deal_ in the London-Lund Corpus. He states that "139 finds were reported in 15 files. 57 finds were with _great_ as immediate left collocate and 41 finds with _good_ in the same position. The results in a visually effective form can be shown as a pie chart [...]." However, the pie chart to which RH refers reports percentages which do not fit the data described in the text. 57 [_great deal_] out of 139 [* _deal_] are 41% and not the 60% represented in the chart; the same holds for great: 41 [_good deal_] out of 139 [* _deal_] are 29.5% and not 38.89% ...

    - On p. 67, RH describes the command 'Frequently asked questions' in the module CP Main Programme as follows: "The third text file aims at answering typical questions which users might come up with who have started working with the present corpus. This file should preferably be written by someone who has been connected with the compilation of the corpus." However, when the test corpus is loaded and one tries to access the FAQ for this corpus using this option, what one gets is the FAQ for the CP Main Programme, not for the corpus and it is unclear why this should have been "written by someone who has been connected with the compilation of the corpus."

    - On p. 72, section 1.3 ('Normalising texts') consists of one paragraph only and accidentally interrupts an otherwise coherent and numbered description of the search parameters available in CP Main Programme.

    - On p. 73, RH gives an example of a word search using the frame option by suggesting that "if you wished to find all instances of negated adjectives in a text then you could enter a frame consisting of _un_ and _able_ [...]" - obviously, this search would not produce all negated adjectives since _inadequate_ and _impossible_ would not be retrieved.

    - On p. 81, RH states that the file extension for the output of a concordance as a plain text file is .OUT, but in the program it's .TXT.

    - On p. 92, a sentence runs "This very is important for [...]."

    Some minor comments concerning the glossary: - Some of the explanations of statistical terms in the glossary (which are not mentioned in the book otherwise) are far from optimal. For example, to define an alternative hypothesis as "an assumption in statistics that two variables are different" (p. 211) is perhaps a little too low-level even for a glossary definition.

    - Similarly, the definition of a Chi-square test (p. 213) is grammatically incorrect ("A common test in linguistics is to determine if the probability that a difference between sets of values is due to chance alone.") and much too vague since the above sentence also characterizes a t-test, a U test, an ANOVA etc. Also, I am not sure that most scholars would subscribe to the following statement: "A typical cut-off point for significance is p<0.001" (p. 213).

    - I do not know why a corpus processing book needs glossary entries for _cookie_, _email_, _inkjet printer_, _laser printer_, _PC_, _RS232C_ and _TFT_.

    - I would not equate _lemma_ and _lexeme_.

    I already mentioned that large parts of Part II of the book are just the help files of the program, but sometimes the book itself is also a little redundant. For instance, a part of the general description of CP's main module on p. 30f. is repeated verbatim in the more detailed section on p. 48f. In addition, sometimes the names of the modules used in the book are not identical to the names of the modules used in CP Programme Launcher, the application from which RH recommends to access all other modules. For example, the book has sections on CP Easy Chart and CP Structure, but in CP Programme Launcher the very same modules are called CP Chart Generator and CP Structured Texts respectively. This is of course no big deal (which is why it is in this nitpicking section of the evaluation in the first place), but, just like the fact that the book doesn't discuss the modules in the same order in which they are listed in CP Programme Launcher although (i) this would be easier to follow and (ii) perfectly possible since the listing in CP Programme Launcher is completely arbitrary, it simply does not speak in favor of careful editing.

    Unfortunately, the book is not very well organized in terms of software learnability either, which is probably a direct consequence of Part II of the book largely being the help files. It would have been extremely useful if the book had provided at least one sample analysis which is designed in such a way as to lead the beginning user through the many modules (perhaps in combination with a website) unlike the sample analyses in the book which make no reference at all to how exactly they would have been generated with CP. Let me give an example of what I have appreciated very much. One could have some corpus files on the publisher's webpage which one would then turn into a hierarchically structured corpus using the text editing modules and CP Create Data Set. Then this corpus could be tagged and lemmatized using CP Text Tool. Then one could perform a sample analysis on the basis of this corpus, for example the collocational differences of _strong_ and _powerful_ (to use the textbook example) with CP Main Programme or CP Flash and use finally use CP Easy Chart and CP Slide to prepare a presentation of the results and CP Internet Editor to present the results on a website. For all of this, the website could provide interim results to allow the user to check whether he has mastered the tasks so far. Sadly, however, none of this is provided although it would have enhanced the value and quick learnability value of the bundle by many orders of magnitude.

    Concerning the software:

    - The installation advice text contains the same paragraph twice (with different paths, though); cf. above.

    - In the rubric 'Retrieving information' of the CP help, there are three paragraphs �2.

    - In several modules, when a window outputs certain results, you can change the size of the window as such, but not the size of the part of the window that contains the output; i.e. you get a larger window with the same information; cf. here for an example from CP Flash.

    - The fact sheet for the installed test corpus start with "Some essential facts abou the Test Corpus."

    - Sometimes, the program uses somewhat idiosyncratic commands: instead of clicking on 'ok' to close a window and accept what one has entered/changed, one has to click on 'conclude.'

    Lastly, although CP is a very recent program, it does not have some of the added-value gimmicks that competing programs offer (it is only fair to repeat here that it of course also has functions these competitors do not have). For example, CP does not provide corpus-based statistics such as indices of collocational strength etc. (like, say, Michael Barlow's Collocate). Also, although the issue of analyzing style is brought up repeatedly in the book, CP does not allow for the automatic identification of key words in texts (unlike WST).

    CONCLUSION

    All in all, I am the first to admit that CP is a program that offers many functions that can be useful for the compilation, annotation and processing of corpora. I also freely admit that the evaluation of usability is by definition a relatively subjective task. I also believe, however, that many of the flaws I have pointed above would render the program much more difficult to use than competing products. From what I have seen, the (only) positive side I have been able to detect is in fact the large number of functions, i.e. the 'what the program does'. But, the negative sides of CP are the 'how the program does it': (the larger part of) the program is

    - difficult, inconvenient and counterintuitive to handle, sometimes violating even elementary usability issues;

    - overloaded with many redundant functions (containing four text editors alone) that are part and parcel of regular operating systems and office software;

    - painfully slow to execute even some of the most basic concordancing tasks.

    Note in this connection that many functions of CP (other than the hierarchical corpus compilation functions of course) are available as (parts of regular) office suites and freeware programs. In addition, the software book is largely identical to the many help files that come with the software and sloppily edited in many ways. Although I had been waiting quite some time for the program after having it seen announced as commercially available soon, I am rather disappointed with the final result and hope that the most frustrating bugs will be considered for an update soon.

    ABOUT THE REVIEWER

    Stefan Th. Gries is Associate Professor at the Department of Business Communication and Information Science at the University of Southern Denmark. His research interest mainly lies with corpus linguistics and linguistic methodology, esp. the syntax-lexis interface as well as corpus-based, quantitative approaches to word-formation processes (e.g. blending and suffixation), syntactic variation (dative movement, particle movement etc.) and semantic issues (near synonyms, word senses etc.). He is currently co-editing two volumes on corpora in cognitive linguistics and is also one editor-in-chief of a new journal, Corpus Linguistics and Linguistic Theory, to be launched in 2005.