LINGUIST List 13.1428

Tue May 21 2002

Review: Software: Concordance 3.0

Editor for this issue: Terence Langendoen <terrylinguistlist.org>


What follows is another discussion note contributed to our Book Discussion Forum. We expect these discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for discussion." (This means that the publisher has sent us a review copy.) Then contact Simin Karimi at siminlinguistlist.org or Terry Langendoen at terrylinguistlist.org.

Directory

  1. Pernilla Danielsson, Concordance 3.0 (software review)

Message 1: Concordance 3.0 (software review)

Date: Tue, 21 May 2002 13:15:29 +0100
From: Pernilla Danielsson <pernillaclg.bham.ac.uk>
Subject: Concordance 3.0 (software review)

Concordance version 3.0, software package available at
http://www.rjcw.freeserve.co.uk/.

Pernilla Danielsson: Centre for Corpus Linguistics, English Department,
University of Birmingham

INTRODUCTION AND GENERAL INFORMATION
It seems unavoidable that any new concordance software will lack that
one essential feature the researcher needs, whilst offering a number of
less useful features instead. That said, every review of concordance
software could begin with a list of unavailable, but necessary features
and continue to list other less useful, but present, features found in
the new version. The only problem is that these lists would vary
from researcher to researcher.

However, instead of a compilation of advantages and disadvantages, this
review will focus on the positive features offered by this concordance
software, such as the word list link to the concordances and the new
web publishing as a saving option. Some concerns are given as to the
time and space issues and the presentation of the collocation within
this software.

The software reviewed here is Concordance 3.0 a Windows-based tool,
available for MS Windows-95, 98, NT and XP. Concordance 3.0,
implemented by Rob Watts, was first released in January 1999 and has
since been fairly regularly updated. A test version of the tool can be
downloaded from the web site http://www.rjcw.freeserve.co.uk/, and to
purchase a single license (also available on-line) costs GBP 55 or USD
89 for the first copy.

BACKGROUND
Naming a tool "Concordance" will unavoidably carry expectations based
on connotations around the words concordance and concordancer. It is not
a name to choose if you want to avoid confrontations with similar
software in the field. The word 'concordance' is defined as follows in
Cobuild's English Dictionary for Advanced Learners: "a concordance is a
list of the words in a text or group of texts, with information about
where in the text each word occurs and how often it occurs. The
sentences each word occurs in are often given" (Cobuild 2001). Hence, a
concordancer can be interpreted as software that produces such a
concordance. Concordancing is one of the oldest ways of browsing
through a (computerised) text or corpus. Early in text computing the
KWIC (KeyWord In Context) model was established and it is still the
standard way of presenting concordance information. Concordance 3.0 can
produce these traditional KWIC concordances but also includes the
option of displaying full sentences.

When concordances first appeared on the linguistics scene they were
criticised for a number of things. For example, there was their lack of
lemmatization, not being able to distinguish between homographs and the
lack of possibilities to choose context. The development in the field
of NLP has now made it possible to add some of this information to the
text, thus enabling the concordancers to have lemmatizing and part-of-
speech disambiguation among its features. Whether or not linguistic
annotation adds or removes information from the text is not a
consideration for this review (see John Sinclair's talk at 6th TELRI
Seminar for further discussions on this matter), but what is notable,
is that while lemmatisation is an option (you as the linguist may
manually group words together to form a lemma), part-of speech tagging
is not considered in the featured software.

USING THE NEW CONCORDANCE 3.0
Perhaps it is my own expectation of a piece of software named
Concordance that makes this tool such a great mystery to me. The first
thing the software does is to index your text or corpus on word level
and produce word lists; this leaves you waiting. The word list turns
out to be vital for your searches, which makes this tool differ
substantially from most other available concordance tools. Instead of
typing in a search word and receiving concordance lines as most
concordancers prompt you to, this tool uses the word list, displayed on
the left-hand window, as a direct link into the text. Once you get used
to the idea that clicking on the word in the word list performs your
search, you may indeed find yourself seeing this as a very attractive
feature. Perhaps not so much for what you get now, but for what it
could provide in a later version. What if every new search you do would
also give you an immediate word list for the smaller set? A linguist's
trained eye will probably pick up more regularities when confronted
with these lists than any statistical calculation can provide. As such,
this interface has many exciting possibilities for future development.

However, while the word list, as a search link, may have attractive
possibilities, it also has its faults. Initially, when beginning a new
session, the index procedure leaves the user waiting. This review was
originally intended to include tests of three separate corpora; one
corpus consisting of five hundred thousand words, another of 2 million
words and a third of 20 million words. Starting with the small 500 000
word corpus, which in fact only consists of 4 novels, the system sends
out a warning that this is a very large file. Considering that it is
currently no challenge to store a 100 million word corpus on your PC, I
chose to proceed and decided to get a cup of coffee while waiting,
leaving the system running. I needed to make a walk of about 40 meters,
including the time it takes to fill up the cup and upon my return
discovered that it was still indexing. To be more precise, the system
used 22.85 seconds to analyse the file, another 147.22 seconds to sort
the file and needed 170.10 seconds to finally load the file; all in all
this added up to more than five and half minutes. For those of us who
are used to instant access to the 430 million words of the Bank of
English, it is a stressful wait. However, it must also be acknowledged
that those of us more comfortable using very large corpora are probably
better off using UNIX based products, such as CWB (Christ 1994), QUE
(Mason 1996) or Lookup (Clear 1987). Still, if we compare Concordance
3.0 with some of its MSWindows-based competitors, such as WordSmith,
the latter only takes a few seconds to load the same input on the same
machine. Of course, having indexed the file once, you can save the
indexing and use this as a starting point next time you want to search
the same data; Don't hold your breath though, loading a saved file will
also include some waiting. You may choose the option "Display while
load", but again this will make loading even more time consuming. Apart
from the time spent on this, this saved index file will take up space
on the hard disk. Despite earlier warnings given by the software, there
are no apparent upper limits for what the software can handle. In the
end, it will be the size of your hard disk and your patience that will
decide how big a corpus you can work with.

Once texts are imported into the concordancer, the next item on the
agenda is to try out the query options. I have previously mentioned
that if you are not familiar with the software it may, at first glance,
seem rather confusing. However, once you know the search routine you
might find yourself looking at this as one of the new concordance 3.0's
strengths. The concordancer is fitted with a simple interface. Compared
to several of its competitors, it presents you with a rather clean
window split into the word list and the relevant concordance lines.
Although interface seems to come down to personal preference, it is at
least easier when placed in a teaching environment as there are not too
many disturbing buttons that may encourage students to get it wrong.

Researchers who do not work with English will be pleased to hear that
this tool does acknowledge other languages. You may choose an alphabet
from an extensive list (not only Latin alphabet, but also for example
Greek and Hebrew) and if the tool discovers a character in your text
not covered by the chosen alphabet it offers to add them. After the
additions, you may go in and sort the alphabet. I found this very
useful for the Swedish characters "�", "�" and "�". As they are sorted
in the given order at the end of our alphabet (although not in the same
order as in the other Scandinavian countries) and it is a relief when I
can manually control the sorting. Also, when trying the Chinese part of
The Birmingham Centre for Corpus Linguistics new Chinese-English
translation database (on a machine running Chinese Windows NT), it
performed well.

The software includes all the normal features, such as statistics of
the text you are working on, which includes information about the text
size, the number of tokens, the number of types and a type/token ratio.
Moving onto statistics between the words, the collocation features
found under the menu Context, we find information about collocation
based on word positions around the keyword. This way of presenting
collocates seems to have gained popularity and is also found in the
WordSmith and LookUp software. It is not clear why the linguist is
forced to look at the collocations in boxes per position. Although this
might be good sometimes, it is very annoying in many cases. On a list
of wanted but not available features, other possible display options of
the collocation information is positioned high up.

Even with all these useful features one of the most interesting parts
of this tool is not found in its linguistics features, but in the way
that it explores web-opportunities. In a simple save-operation
(separate from the normal save as text or pdf), this tool offers you a
complete web version of your research; a four parted frame-window,
including a headword list window, a window for the concordance, a
window for the original text and a window for different sections,
combined with a smaller window to go between the sections. Exploring
the exciting possibilities hypertext has to offer, this format enables
you to jump easily in and out of the concordance into a specific place
in the original text, even without a web server that will post it for
the whole world to see. The innovative use of html code for saving
workspaces can provide you with a useful tool to present to students.
Also, this has great possibilities for the future. Why not include a
small program that lets students manipulate these concordance lines
themselves; sorting; sub-querying etc?: Add a short window for the
student to write down her findings, a submit button and we have a very
useful e-learning facility.

CONCLUDING REMARKS
The new Concordancer 3.0 is a useful tool for corpus linguists,
language teachers and lexicographers; the many users that the
concordancer has around the world already prove this. For those of you
who have not yet tried this tool, the publishing of the result as web
sites, or the direct link between word lists and concordances should be
enough motivation to at least download a test version from the internet
(http://www.rjcw.freeserve.co.uk/). Ultimately, if you need a tool that
can handle millions and millions of words of language data you might
find this tool a bit on the slow side. However, this software may still
be of use in your teaching.

REFERENCES
Christ, O. (1994) A Modular and Flexible Architecture for an Integral
Corpus Query System. In Proceedings of Complex'94. 3rd Conference on
Computational Lexicography and Text Research, Budapest, Hungary, July
7-10, 1994, pp. 23-32.

Clear, J. (1987) Computing. In Sinclair, J. M. S, Looking Up: An
Account of the COBUILD Project. Glasgow: Collins ELT.
ISBN 0-00-370256-1.

Mason, O., (1996) Corpus Access Software: The CUE System. In Text
Technology: The Journal of Computer Text Processing. Vol 6 No. 4.
Winter 1996, pp. 257-266. ISSN 1053-900X. Wright State University-Lake
Campus.

Scott, M., (1999) Wordsmith Tools version 3, Oxford: Oxford University
Press. ISBN 0-19-459289-8.

ABOUT THE REVIEWER
Pernilla Danielsson holds a PhD (Gothenburg 2001) in Computational
Linguistics. She is a senior researcher and deputy director at the
Centre for Corpus Linguistics, University of Birmingham.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue