LINGUIST List 4.894

Fri 29 Oct 1993

FYI: Resources: WLIST, CSLI Bibliography, Corpus Research

Editor for this issue: <>


Directory

  1. Jari Perkiomaki, A new version of WLIST programme available
  2. Jane A. Edwards, linguistics/psycholinguistics bibliography
  3. Jane A. Edwards, list of corpus-related resources

Message 1: A new version of WLIST programme available

Date: Fri, 22 Oct 1993 11:11:31 A new version of WLIST programme available
From: Jari Perkiomaki <jpebacall.uwasa.fi>
Subject: A new version of WLIST programme available

WLIST.EXE
 - a language-independent word frequency and word length counter

(C) Copyright 1990-1993 Ari Hovila.
 Based on the ideas of Ari Hovila (ajhuwasa.fi) and
 and Jari Perkiomaki (jpeuwasa.fi) (University of Vaasa, Finland).

WLIST is a statistical tool for any language user. The program can
recognize all words in an ASCII file as well as count their
occurrences, that is frequencies. Moreover, it counts the lengths of
the words, enabling the user to determine, e.g., how readable a text
is. The simple rule of thumb, for instance, fr Finnish texts is that
the more long words there are in the text, the harder it can be to read
and understand such a text. Furthermore, WLIST counts the lengths of
all unique (i.e. different) words as well as the average lengths of all
and unique words.
 At its simplest, the program can make an alphabetically ordered
list of words in a text, without any statistics on their lengths or
frequencies.

What makes WLIST unique compared with other similar tools available is
language-independence. When it omes to making an alphabetical list of
words (in other words, sorting the words), we have to know that, in
each language, the sorting is determined by the alphabets of that
language. And what is important is that usually the way of sorting does
not followthe English way of sorting things.
 We (the authors) are fully aware of the fact that there is a
number of programs which sort words either according to the English
alphabet or according to the order of the ASCII code. Both of them are
a bad choic for a serious linguistic analysis.

So in most languages, there is a unique way of sorting the letters. Yet
we are faced with another problem: how to sort special characters, e.g.
locograms like &%$! or punctuation marks or numbers. To put it
simply, what is their internal order. And yet, we have to determine how
the special characters, numbers and letters are inter-related, that
is, what their relation to one another is: which of them comes first
and second and so on. The sample files that are included in this
packae are partly based on the article "Alphabetical Ordering in a
Lexicological Perspective" by Rolf Gavare in the book "Studies in
Computer-Aided Lexicology" (pp. 63-102). This program is not, however,
an application of the proposal made in the above mentioned article.

IMPORTANT!

As WLIST can utilize user-defined sorting information, it also means
greater responsibility to the part of the user. The user has free hands
to determine what characters WLIST is supposed to recognize. Also it is
the user who detemines in which order those characters are sorted.
User-defined sorting is activated with option -f (see later) and we
urge you to study closely how the user's own sorting file is built up.

At this point, we like to thank all those people who were using the
earlier version of WLIST and made us add this important feature to this
new release.

WLIST can analyze any language whose letters, numbers and special
characters can be found in the (extended) ASCII code. The best results
can be achieved with texts that are pure ASCII. Most word processors do
support this text format, ad if you are using a text editor, the text
is likely to be in ASCII format by default.

WLIST is capable of handling words which are as much as 50 characters
long. First of all, with the term "word" we refer to a string of
characters surrounded by one or ore white space(s). For most cases, 50
characters are more than enough but there may be languages where words
can be even that long. The Finnish language is a good example: on some
special languages (LSP, language for specific purposes), like the
languageof technology, you can easily find long compound words which
may, in the worst case, even exceed the 50-character limit. In the
Finnish language, compound words are treated differently as those in
English: they are written as one word, whereas in English hey usually
occur as two or three or more separate words. Here you must be aware of
the fact that, by the definition of the term "word", it is quite
impossible for WLIST to distinguish e.g. English compound words written
like "user defined" or "inverted dipole antenna".
 If the word should exceed the limit of 50 characters, WLIST
simply would make a new "word" out of the remaining characters. And if
the remaining characters will also exceed the limit, once again, a new
"word" will be made from the remaining character, and so on. This is
to make sure that no characters really ever will be omitted -- although
there may be some problems in finding out where the "computer made"
words are in the output. These cases are, however, rare and the user
will be prompted if such ords are generated.

WLIST should run on any MS-DOS or compatible computer. This program has
tested with an ordinary PC, an AT 286 and with a 386 machine with
varying amounts of available memory and with different versions of the
operating system. And wha is important: WLIST is a reasonably fast
analyzer of an ordinary running text. But due to the programming
technique used, any text that is already sorted or almost sorted will
slow the performance down considerably.

WLIST is available via anonymous ftp from garbo.uwasa.fi, directory
 /pub/pc/linguistics, file wlist11.zip.

Yours,

 --Jari

 --
Jari Perkiomaki \ "Ihminen itse on k{ym{tt|mist{ rajapinnoista viimeinen.
University of Vaasa \__________ Ihminen on hengen ja aineen rajapinta.
Dept of Communication Studies \ Kummalla puolella rajaa sin{ olet?"
jpebacall.uwasa.fi \___________________________________________
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: linguistics/psycholinguistics bibliography

Date: Tue, 26 Oct 93 17:09:52 -0linguistics/psycholinguistics bibliography
From: Jane A. Edwards <edwardscogsci.Berkeley.EDU>
Subject: linguistics/psycholinguistics bibliography

A couple of weeks ago someone inquired about linguistics bibliography,
and mentioned the CSLI bibliography. It is an extensive common domain
("copyleft") bibliography which is updated as new contributions (in
bib/tib/refer format) are received. It contains and welcomes
references across a wide range of areas relating to language research,
including psycholinguistics. I am appending the information for
accessing it, and contributing to it. For more information, contact
kornaicsli.stanford.edu.
Best Wishes,
Jane Edwards (edwardscogsci.berkeley.edu)

 -------- CSLI LINGUISTICS BIBLIOGRAPHY: ----------------

You can get a linguistics bibliography by anonymous ftp
from csli.stanford.edu:pub/bibliography. The README file
follows:

lingbib.csli is a linguistics bibliography database in bib/tib/refer
format, presently containing some 3,300 entries, heavily slanted
towards phonetics/phonology but with a fair amount of morphology,
syntax, and semantics thrown in, especially if your interests are
computational. We recommend you use it with James Alexander's tib
bibliography system, which is Copyright (C) James Alexander
(jca.lakisis.umd.edu) but available for the public by anonymous ftp
from minos.inria.fr (128.93.39.5) among other places. A typical entry
looks like this:

%A George A. Miller
%A Noam Chomsky
%D 1963
%T Finitary models of language users
%E R. Duncan Luce
%E Robert R. Bush
%E Eugene Galanter
%B Handbook of mathematical psychology
%I Wiley
%C New York
%P 419-491

Tib can generate TeX/LaTeX formatted code that conforms to the
cititation style requirements of various journals. The style file
for Language, called ling.tib (and the related ling.ttx file) created
by Jeff Goldberg (goldbergcsli.stanford.edu) is enclosed with this
distribution. Tib can also interactively look up entries in the
database -- see its man page for details.

If you wish to make addittions to lingbib.csli please send your
contribution (which will become CSLI copyleft) to kornaicsli.
Make sure that

 -- you don't send full articles, just the references
 -- the entry is a new entry, not a correction to an
 existing one. (Corrections are also welcome, just send
 them separately)
 -- the entry is maximally informative (e.g. put in full first names
 if you know them)
 -- the file is in the correct tib format (order of fields does not
 matter)

If you use the slow and clumsy BibTeX system, you might wish to
convert to tib -- use the bibtex2ref script by Bernd Fritzke
(fritzkeimmd2.informatik.uni-erlangen.de) to convert your bibtex
bibliography files. Please do NOT send anything in bibtex format.

To take full advantage of the lingbib.csli database (e.g. to run
interactive searches) you probably want to install the tib package
even if you don't use TeX/LaTeX. However, the database is in no way
tied to tib, and you are welcome to use it as a plain text file, to
search it by grep or other utilities, or to put it under your own DBMS
system as long as you abide by the terms and conditions of the
CSLI General Public License which requires that you maintain the
License and Copying files together with the lingbib.csli file.
(For other terms and conditions and for a NO WARRANTY statement
see the License file.)
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 3: list of corpus-related resources

Date: Wed, 27 Oct 93 19:56:29 -0list of corpus-related resources
From: Jane A. Edwards <edwardscogsci.Berkeley.EDU>
Subject: list of corpus-related resources

I have prepared a compilation of corpus-related resources which may be of
interest to the readers of this list, and wanted to mention it explicitly as
it is too recent to be widely-known.
 Edwards (1993) Survey of Electronic corpora and related resources for
 language researchers (pp. 263-310).
 In Edwards & Lampert (eds) Talking data: transcription and coding in
 discourse research. Hillsdale, NJ: Erlbaum.
My goal was to bridge some gaps between corpus-using researchers in
linguistics, lexicography, computational linguistics and the
humanities. In service of this goal, the survey describes the most
frequently used corpora in corpus linguistics (since these are often
referred to only by name or acronym) with access information and
contact addresses; describes and gives access information for the
corpus surveys by Lancaster, Georgetown, Oxford, Rutgers which include
these corpora plus several hundreds of corpora which are less often used,
including literary sources and languages other than English; and gives
information on organizations and institutes involved in distributing
corpora and disseminating corpus-related research results; describes
electronic discussion groups pertinent to corpus linguistics (e.g., ln,
prosody, corpora, etc.); and lists a couple bibliographies of corpus
research from humanities, lexicographic and linguistics perspectives.
I'd be very interested in comments and reactions.

 -Jane Edwards (edwardscogsci.berkeley.edu)
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue