LINGUIST List 11.1411

Sun Jun 25 2000

Review: MonoConc Pro 2.0

Editor for this issue: Andrew Carnie <carnielinguistlist.org>




What follows is another discussion note contributed to our Book Discussion Forum. We expect these discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for discussion." (This means that the publisher has sent us a review copy.) Then contact Andrew Carnie at carnielinguistlist.org

Directory

  • John Lawler, Review: MonoConc Pro 2.0 Concordancing Software

    Message 1: Review: MonoConc Pro 2.0 Concordancing Software

    Date: Mon, 19 Jun 2000 16:49:40 -0400 (EDT)
    From: John Lawler <jlawlerumich.edu>
    Subject: Review: MonoConc Pro 2.0 Concordancing Software


    MonoConc Pro Concordance Software, Version 2.0 (March 2000) Athelstan, Houston TX infoathel.com http://www.athel.com/ For Windows 3.1, 95, 98. Educational, single-user price US$85.

    Reviewed by John Lawler, University of Michigan

    MonoConc (MC) is a Windows program that provides most of the functions that corpus linguists require in concordancing. MC has been around for a long time; it was developed by Michael Barlow - barlowruf.rice.edu - http://www.ruf.rice.edu/~barlow/ - of Rice's Linguistics Department, with the needs of corpus linguists in mind. (See Hockey 1998 on the needs of corpus linguists, and Barlow's Corpus Linguistics page on meeting them.) The result is a fast, cheap, and reliable program, and its utility is not restricted to specialists in corpora by any means. This latest version, full of new features, is software any linguist would want to have, and most can afford.

    Not to keep anybody in suspense, this is largely a rave review, with a few rants along the way. First I will discuss some of the main features of MC, then describe three rather different projects I used it for, indicate where I found it indispensable, and where I found it less useful and why, with suggestions for revision of the user interface. At the end I append some Web links in the References. Definitions of most of the technical terms used here can be found in the online Glossary of Lawler and Aristar 1998; see References for URLs.

    The first thing one notices about MC is that it fits on one floppy disk, and it doesn't have an "Installation" module, something unusual for Windows programs. One simply copies three files into whatever directory one pleases, then runs the program (MonoPro.exe). The simplicity continues; when it is run, MC presents the user with a blank window containing only File and Info menus.

    To do anything, one must load some files from disk, or from a URL (this is new in Vers 2.0). MC doesn't keep a list of recent files on the File menu like Word, but it does remember the last folder accessed, which is convenient. Any ASCII text file can be loaded as a corpus, and one can load more than one at a time.

    But you'd better specify all the files you want the first time, because if you try to add more later in the session, they get loaded as a separate corpus. This means that files that can't be grabbed together in the Windows file dialogue may need to be renamed, and sorted in the window, which is at least a bother; Vers 2.0 alleviates this problem somewhat by allowing one to save and load a "workspace", which is essentially a state dump including a specified file list. Then you can simply grab the workspace file icon on the desktop and drop it on the MC icon, thereby avoiding the blank grey window.

    The files get loaded and concatenated together as a corpus (though, oddly, the "Corpus Text" windows is named after the last file), and when that happens, the "Corpus Text", "Frequency", "Window", and "Concordance" menus appear between "File" and "Info". One can suppress tags (or only Part-of-Speech tags), or suppress the words and leave just the tags (useful for examining structure), on the "Corpus Text" menu, and the "Frequency" menu produces instant wordlists. But it's the "Concordance" menu that most of us will be using.

    On this menu there are three options available at first: "Search", "Advanced Search", and "Search Options". Each brings up a dialog box from which one can reach the other, so it doesn't matter which is chosen first. The purpose of all of these commands is to determine how the concordance is to be generated -- and in this program it's important to realize that concordances are not permanent objects. Rather, they're essentially reports generated on the fly, searching for any string in any context that can be specified by regular expressions or by a number of special wild card terms peculiar to MC. One can, of course, do a complete concordance of the corpus simply by searching for '*', but that doesn't begin to exhaust the possibilities of the search function. For instance, full Regular Expression searches are supported, as well as more sophisticated context searches. (See Lawler 1988 for more information on Regular Expressions.)

    Once the search term is entered (and usually it has to be adjusted until it gets exactly the desired results), the concordance is generated (very rapidly; time is almost never a limiting variable) and displayed in Key Word In Context (KWIC) format in a separate window, split into two parts; the lower one contains the concordance as such, while the smaller upper one displays the context of the selected line in the concordance as it is selected. Thus, if the context of a line in the KWIC isn't clear, just select it with the mouse, and the paragraph it appears in shows up in the upper window.

    The concordance window (or windows; one can concord on any number of different search terms) appear in the original order found in the corpus. To get better-organized lists, one uses the "Sort" menu, which appears (along with a "Display" menu) when the concordance window is active. Here one is allowed to sort the concordance on the keywords (called the "search term") or the first, second, or third word to the right or left of them, and there can be up to three consecutive sorts, so that similar contexts will appear grouped together. The sort facility is very flexible; one can use any sorting order on any character set, including unique digraphs, so that for instance "ch" and "ll" can be made to sort as alphabetic letters in their customary Spanish order.

    Any linguist can get the hang of this kind of rapid sorting and re-sorting (and its resultant eyeballing for patterns) very fast, and it's quite gratifying to be able to look through a corpus for patterns so easily and productively. At any point, of course, one can go to the "Frequency" menu to get numeric data, sorted either alphabetically or numerically.

    I discovered in the course of the review that one of the principal uses of MC in various recensions is as a classroom tool, for example in language instruction, since it is child's play (literally) to generate pattern exercises from corpora -- one simply searches for the word in question and suppresses the search item, producing attested cloze samples which can be used for drill or other exercises. Indeed, there is a special version of MC for use in classes, and a special course license and rate. I found it useful for other tasks as well, especially in collaboration with other software, to do things it wouldn't.

    Two years ago, for instance, I prepared the index for a book (Lawler and Aristar-Dry 1998) using an earlier version of MC (see Antworth & Valentine 1998 and Stevens 1997 for reviews of that version), and MS-Word's Index facility. I couldn't have done it very well using either alone. The Word Index facility is quite reasonable, given a list of words to index, but getting that list is not easy, to say the least.

    I used MC on the ASCII files of the various chapters to prepare a complete concordance. After that, I simply went through the concordance deleting words I didn't want to index, leaving only those I did, including all the proper names. These were copied and pasted into Word's Index facility, which then dutifully produced an index with correct page numbers (once I had put in manual page breaks to correspond with the physical page numbers on the galleys) for all the important words in the book. I was then able to edit this index further, grouping together related concepts, differentiating homophonous word uses, and eliminating spurious references. It was a very educational experience for me, and produced a really thorough index in what I considered a reasonable amount of time, without having to pore over each page of the galleys physically. The index is available online (see References).

    I recommend MC highly for any form of indexing, especially now that it can build concordances with more than 16,000 hits at a time, which was the limit at the time I did the index (I actually had to do 26 concordances, for a*, b*, etc, to keep the size of each below 16,000). 16,000 is still, unaccountably, the default in MC Pro, and one must continue to remember to reset it each time when concording large corpora, though I have experienced no problems with numbers as high as 100,000 in MC Pro 2.0.

    Another, more recent, project I found MC helpful with was an investigation (Lawler 2000) of the syntactic properties of the verb 'remain' in the peculiar construction:

    (1) I remain to be convinced that your plan is feasible. (Ross 1977)

    As it turns out, the 2,435,659 quotations in the OED2 include 7,167 that contain some version of the word 'remain' (both noun and verb). While useless for statistical purposes, such a collection is very likely to contain occurrences of every possible construction, idiom, important collocation, subcategorization, and selectional restriction for the search term, identified by source and date.

    The Digital Library Production Services OED Web interface delivers the results as a single Web page, tagged in HTML. That amounts to a partially tagged corpus, a real bonanza of syntactic data. (It is available online; see References.)

    The massaging was done outside of MC, which is equipped to search for strings, but not to change them. Making a raw HTML corpus tractable frequently requires other tools in collaboration with a concordancer; one of the many benefits of MC is that it is ASCII through and through, and therefore can be used in combination with such tools as the ones I frequently use, like the stream editor sed, and the filter language awk (available free for both DOS and UNIX; see Lawler 1988), along with programmable editors, such as emacs and ex, on UNIX (emacs is a screen editor -- some say *the* screen editor, while ex is a line editor), or TextPad and Qedit on Windows (TextPad is a Windows editor, while Qedit -- now called Semware, Jr - is DOS. Both are shareware; see References).

    In this task, I used MC extensively on the massaged corpus; since the study was restricted to infinitive complements governed by the verb 'remain', I had first to eliminate occurrences of noun 'remain(s)', as well as 'remainder'. MC allowed me to eliminate all instances of 'remainder' from the target corpus easily, but of course it could not search for zero morphology in a corpus without Part-of-Speech tags (see Ball's overview on tagging), so I had to put those in myself, by hand, by going through the (by this time) around 4,000 quotations. Once that was done, MC concorded the examples tagged with verbal 'remain' followed (anywhere) by the word 'to'. This didn't guarantee it was an infinitive 'to', but this could be checked easily in MC, and the non-infinitive cases, and those not governed by 'remain' were eliminated.

    At this point, I had about 500 sentences, dated and attributed, all containing a construction formally resembling the one under consideration, suitable for syntactic analysis. From this point on, the task could be done with a wordprocessor; but getting to this point would have been difficult or impossible without MC and the other tools. I found that MC opened up many possibilities that I would not have considered doing in the study, let alone been able to do at all, and that it frequently made the difficult simple, and the impossible merely tedious. This is a program I would recommend to any syntactician or semanticist that wants to work with real data.

    Finally, I have recently become curious about Emily Dickinson's use of phonesthemes. After snagging a lot of her poetry online from various places, I am once again faced with a large amount of text, variously tagged in HTML, to deal with. And once again, MC is the tool of choice. This time, though, I am somewhat more conscious of drawbacks in MC.

    For one thing, while it's very nice to be able to generate concordances on the fly, sometimes you want to work with the same concordance for a long time, and MC makes it hard to do that, because concordances are not not saved in the form in which they appear in MC's "Concordance" window, but rather strictly in ASCII, with the search term [[ marked ]] with double brackets. This is OK for an ASCII save, but if I want to work with the same concordance tomorrow that I'm working with today, I have to generate it all over again tomorrow from data if I want to use MC's searching and sorting facilities -- an ASCII concordance does not load in the concordance window, but as a new corpus (containing [[]]'s, which can't be deleted in MC).

    This is normally not a big deal, but when dealing with HTML text one usually simply wants to avoid the HTML tags, and this is certainly true of the Dickinson corpus, since the poems are heavily formatted. The "Corpus Text" menu has an option to "Suppress Tags", but it only applies to display of the text in that window; when one opens a Concordance window via a Search command (in this case, for *), all the HTML tags show up in the Concordance window, duly found by the search. This window *also* has a setting that suppresses tags, but it can't be invoked until *after* the search, when it produces a lot of lines that don't have any visible search term.

    The workaround is to sort on the search term (which works, whether it's suppressed or not), then delete all the lines without search terms, about a third of the corpus of 77,000 hits, which group together for selection at the beginning of the corpus. Then I have to remove the hits on the various Roman numerals that stud the pages, then I can get down to business. Tomorrow I'll have to do the same thing all over again. And hope I've remembered all the steps, so I wind up with the same concordance.

    Once that's done, though, a variety of sorting strategies suggest themselves, and any number of putative patterns begin to tease the perceptions and beg to be tested. This is the *really* useful part of an interactive concordancer -- it gives one the opportunity to become really intimately acquainted with one's data in ways that are simply impossible with large data sets without such help, and it is here that MC really shines.

    There are plenty of other gripes one might make about details of the user interface, but there are workarounds for almost all of them. And I haven't even begun to list most of the special functions it can also perform, should one desire them; there's a big load of functionality to explore here. The version of MC I received arrived without documentation, but there is now a 70-page comprehensive manual in Word format (9.8 MB), with diagrams and screen shots, that is quite clearly written and covers pretty much everything one needs to know. Nevertheless, it's been possible to learn how to use MC for years just by following one's nose, a sign of an intuitive interface, which itself is a very good sign in any piece of software.

    The flood of text that is now washing over us on the 'net has made us aware that we need to do more than just bail frantically; we need industrial-strength tools if we are to stay afloat. I strongly recommend that every linguist who works with data that is represented (or representable) in ASCII be equipped with MonoConc Pro, for research, development, and teaching, if possible with Departmental site licenses; this is simply too useful a tool to overlook.

    - ---------

    John Lawler is Associate Professor of Linguistics at the University of Michigan, where he is principal advisor of its undergraduate Linguistics program, the largest in the US. He is interested in metaphor, computing, sound symbolism, and English grammar, among other subjects. He is co-editor of Lawler and Aristar 1998. (Full bio at http://www.umich.edu/~jlawler/bio.doc)

    - --------- References:

    Antworth, E. and J. R. Valentine. 1998. 'Software for Doing Field Linguistics'. Ch 6 of Lawler & Aristar-Dry. Appendices: http://www.sil.org/computing/routledge/antworth-valentine/ , http://www.sil.org/computing/routledge/antworth-valentine/text.html

    Ball, Cathy. Tagging overview http://www.georgetown.edu/cball/ling361/tagging_overview.html

    Barlow, Michael. Corpus Linguistics Page http://www.ruf.rice.edu/~barlow/corpus.html

    -------------- Details of Monoconc 1.5 and Monoconc Pro 2.0 http://www.athel.com/rade.html

    Hockey, Susan. 'Textual Databases'. Ch 4 of Lawler & Aristar-Dry. Appendix: http://www.ualberta.ca/~shockey/UCLPP/textual.htm

    Lawler, J. 2000. 'Remainders', paper delivered at LANGUAGING 2000. Text: http://www.umich.edu/~jlawler/remainders.doc Handout: http://www.umich.edu/~jlawler/remaindershandout.doc Data: http://www.umich.edu/~jlawler/oedqa-remain.html (original) and http://www.umich.edu/~jlawler/oedqc-remain.html (massaged)

    ------- 1998. 'The Unix Language Family'. Ch 5 of Lawler & Aristar-Dry. Chapter text: http://www.umich.edu/~jlawler/routledge/unix.doc Appendix: http://www.umich.edu/~jlawler/routledge/unix.html

    Lawler, J, and H. Aristar-Dry (eds), 1998. _Using Computers in Linguistics: A Practical Guide_. Routledge. Home: http://www.routledge.com/routledge/linguistics/using-comp.html Intro: http://www.routledge.com/routledge/linguistics/introduction.html Index: http://www.umich.edu/~jlawler/routledge/unix.doc Glossary: http://www.umich.edu/~jlawler/routledge/glossary.html

    Ross, J.R, 1977. 'Remnants'. Studies in Language I:1.127-135.

    sed and awk (universal freeware text filter languages) http://www.umich.edu/~jlawler/routledge/sedawkperl.html

    Semware Jr (formerly Qedit; shareware DOS programmable editor) http://www.semware.com/

    Stevens, V. 1997. Review of Monoconc 1.2. in CALICO (Computer Assisted Language Instruction Consortium) newsletter. http://www.arts.monash.edu.au/others/calico/review/monoconc.htm

    TextPad (shareware Windows programmable editor) http://www.textpad.com/