LINGUIST List 15.1087

Thu Apr 1 2004

Disc: Re: Reply to review of Corpus Presenter

Editor for this issue: Sarah Murray <sarahlinguistlist.org>


Directory

  • Stefan Th. Gries, Re: 15.981, Hickey's response to my review of his program/book

    Message 1: Re: 15.981, Hickey's response to my review of his program/book

    Date: Wed, 24 Mar 2004 23:42:18 +0100
    From: Stefan Th. Gries <STGriessitkom.sdu.dk>
    Subject: Re: 15.981, Hickey's response to my review of his program/book


    For a more reader-friendly version of my review, please go to http://people.freenet.de/Stefan_Th_Gries/Research/CP_review.pdf .

    Apart from all communication on this matter (including this text), this page provides some screenshots of the CP and, in the interest of locating the discussion of different modules of CP, highlights all occurrences of CP's program names in blue). In what follows I would like to briefly comment on some of the points raised in Hickey's response to my review. The first part is concerned with aspects of Hickey's book/software (henceforth CP), the second with the more general tone of Hickey's response. I agree with Hickey that my review did not cover all aspects of the book/software that one could have mentioned. The book alone comprises nearly 300 pages and the software offers an extremely vast range of functions (as I also state in my review), so every review - especially when due in maximally six weeks time - is by necessity selective even though the review is already extraordinarily long: Given the vast range of functions the program has to offer, I think it is only natural that not all features can be discussed (to the satisfaction of the author).

    Then, it goes without saying that I accept responsibility for, and regret, all errors or misrepresentations that are not due to my selective emphasis although below I also have something to say about how they arose in the first place. However, it is necessary to set straight some of the points that Hickey criticizes about my review and I will also demonstrate below that some points of critique mentioned by Hickey are due to his not having read the review thoroughly enough.

    As a first example, Hickey complains about my having neglected the presentation of corpora, the retrieval techniques and the Corpus of Irish English, but on the one hand, the review does mention that CP can be used to compile and annotate corpora hierarchically and it mentions the installation of the Corpus of Irish English. On the other hand, while I certainly agree that I could have discussed these issues more prominently, I found it more important to discuss the way CP deals with corpora in general rather than about one corpus in particular and consider this a reviewer's legitimate option.

    Second, Hickey objects to my comparing CP with other software. I do not see why this is problematic. Sure, CP is a tool that is very different from competing products such as MonoConc Pro 2.2, WordSmith Tools 3/4 (WST) and WinConcord 2 - no doubt about that - but (i) the by far largest part of the review is not concerned with comparing CP to competing products anyway (WST is mentioned 11 times (on [9 pages with 6,544 words]) and (ii) I do not see any reason why I as a reviewer should not be allowed to compare selected aspects of a program with competing applications. Moreover, I do not know how well Hickey knows the competing products he refers to: (a) He simply affirms that CP offers the most flexible retrieval tools without offering any evidence (and MonoConc Pro's tretrieval options are in fact extremely versatile). (b) Where I give evidence about the speed of the program, Hickey simply states he doubts my evidence (which, as I mention in my review, are the statistics CP itself outputs!) but does not provide a single argument or perhaps a comparable figure to support his point. Lastly, while I do agree that some may consider speed less relevant nowadays, my own experience is different: searching the BNC or merely larger parts of it for regular expressions of varying degrees of complexity or determining significant collocates for tens of thousands of adjectives can be so time-consuming that speed sometimes matters quite a lot. Be that as it may, reporting performance statistics cannot really be wrong by definition ... Third, let me now turn to the so-called factual misrepresentations. As for one, Hickey states that, contrary to what I wrote, "[i]t certainly is possible to sort concordance returns on words to the left and right of the keyword (this can be done for up to 8 words each side of the keyword)." If the function 'restructure return lines' is activated for a particular concordance, CP outputs a bipartite window, the lower part of which provides absolute and relative frequencies of collocates of some defined span (1, 2, 3, 4, 6, 8 words on one or two flanks - why not also 5 and 7?). The upper part of this window now provides a part of each concordance line in tabular form such that the line is split into as many slots as were previously defined. It is true that this part of the output can then be sorted (if you want to sort according to the first word of the right of the node/search word, you must click on its column name, which is "Item 1" just like the name of the first word to the left) and copied to the clipboard (and only then to disc), but it is not the complete concordance line that can be dealt with this way, it is only the previously defined part of the concordance line that is sorted accordingly, which is why context further away from the node/search word cannot be accessed this way. Thus, unless I have missed some other function, I do not see my statement disproven.

    In this connection, Hickey states "Gries does not like my terminology - 'restructure return lines' - but as a native speaker of English I beg to maintain that this is an acceptable description of this function." While Hickey simply skips over my point of critique that this function is too difficult to locate since no help index is available and no entries for "sort" or "restructure return lines" (! ;-) ) exist in the index of the book, he is completely right: I do not like his terminology. I pointed out some other idiosyncratic names of functions in his program, and I just leave it to the readers to decide whether it is really just due to chance that most, if not all, other programs use the command "sort" for sorting as do some programming languages (e.g. Python and R language, which also has "order").

    In order to address the only other case of "factual misprepresentation" Hickey cares to mention (in spite of the multitude of errors he implies there are), I have to quote him again at length. He states "Gries thinks that the analysis of style is not treated in Corpus Presenter, but the special text editor, CP Text Tool, has a function for Lexical Clustering analysis which does precisely that. It will allow users to determine the occurrence of stylistic features in a flexible manner and so help them answer such questions as text authorship. Lexical Clustering is mentioned on several occasions, including the various guides available within the Launcher so Gries should have seen this is if he had looked at the material properly."

    Unfortunately, Hickey himself has not cared to read my review with the necessary attention. Here's what I said in my review: "Lastly, although CP is a very recent program, it does not have some of the added-value gimmicks that competing programs offer (it is only fair to repeat here that it of course also has functions these competitors do not have). For example, CP does not provide corpus-based statistics such as indices of collocational strength etc. (like, say, Michael Barlow's Collocate). Also, although the issue of analyzing style is brought up repeatedly in the book, CP does not allow for the automatic identification of key words in texts (unlike WST)." As it turns out, CP can really not output collocations statistics, but more importantly, while it can output lexical clusters in the way that WST can, it cannot compute key words as defined in WST, which I explicitly referred to. Key Words in WordSmith takes two corpora (one 'research corpus', one 'reference corpus'), checks the frequencies of all words in both texts and then outputs key words sorted by their p-values (where key words are words which are significantly overrepresented within the research corpus as compared to the reference corpus [measured in terms of Chi-square tests or the Log-likelihood test]). Hickey's Lexical Cluster Analysis does not do this, and I have not been able to locate any other such function in his program, which is why this claim of his remains as much in need of support as many others.

    Let me turn to the final factual point. I pointed out before that my main problems with CP do not derive from its functionality, i.e. what the program can do. Let me state it as directly as possible: the functionality of CP is great, it can do more than any other corpus program I have ever seen. My main quibble is with usability, i.e. how the program lets you do it. I do not wish to bore the readers with all the details of the original review but refer them to it instead, but let me just say two sentences about Hickey's complaint that I devote too much space to criticizing the many utilities CP offers. First, I think it is only fair to point out to the potential buyer that many of the twenty-something modules the program contains just do what the operating system or other freely available software can already do. For some people, this may be an interesting argument against the bundle, and thus this is something that should be pointed out in a review. Second, the usability of a program is not enhanced by crowding it with many modules or functions one is later invited to delete or neglect - rather, a program should offer its functionality in such a way that the user can make intuitive choices from a reasonably small set of alternatives: a help menu with twenty different entries that even includes a command to benchmark the system is simply not the most usable way to design a program, and the fact that no other program goes to similar extremes testifies to this point.

    As to the book, Hickey objects to my lack of appreciation of the structure of the book and that there were good reasons for it. I cannot substantially comment on this point since Hickey does not mention the good reasons he alludes to but just states that, if I don't like the structure, he can't help it. This is doubtlessly correct, but neither does it constitute a rational argument nor do I see why I as a reviewer should not be entitled to criticize the structure of a book (not to forget the many editing errors / typos) especially when I also outline a constructive proposal as to how a from my point of view didactically more feasible structure may look like, as I do at the end of my review. I am, however, very happy to learn that my review has - in spite of all its limitations - already resulted in some bugs being fixed.

    Let me finally say something about the general tone of his reply. First, I (and at least three other colleagues who have read his reply) cannot fail to notice the ad hominem undertone underlying (parts of) his reply. I do not see in what way it is relevant to a discussion of the merits (or lack of them) of my review that I am "one Stefan Gries", "a German academic at the University of Southern Denmark", or that Hickey has never heard of me before. Similarly, Hickey states that "Corpus Presenter works properly and fulfils the functions which it claims to perform (Gries acknowledges this, if only begrudgingly)". What is the purpose of salting his reply by attributing such emotional states to me? Neither did CP perform all the functions Hickey claimed it to perform - remember Hickey's own statements about the bugs he fixed as a reply to my review? why is there already an upgrade if CP already performed all functions as intended? - nor is there any statement in my review that can straightforwardly be interpreted as begrudgingly if one has not already built up some prejudices. Quite the contrary: I mentioned clearly that I had been looking forward to the program quite some time after having it seen announced as commercially available soon. And as usual, note that Hickey simply attributes this emotional state to me, but - as before - does not cite any sentence whatsoever to support this attribution. I would have welcomed a more sober exchange than the one that has now actually taken place, but it is instructive in this connection to not just consider the reply Hickey posted to the LinguistList, but also the reply he had sent to me personally earlier, in which a compound involving a vulgar German verb for to defecate plays a prominent role in characterizing my review (go to http://people.freenet.de/Stefan_Th_Gries/Research/CP_review.pdf to access the original review with example screenshots and all following communication). This will therefore be my final statement in this matter.

    Stefan Th. Gries University of Southern Denmark