LINGUIST List 6.99

Tue 24 Jan 1995

FYI: American Dialect Society, Hello, Goodbye and Wordlists

Editor for this issue: <>


Directory

  1. Natalie Maynor, American Dialect Society
  2. Dave Fawthrop, Hello, Goodbye and Wordlists

Message 1: American Dialect Society

Date: Fri, 20 Jan 1995 05:15:44 American Dialect Society
From: Natalie Maynor <maynorRa.MsState.Edu>
Subject: American Dialect Society

Those of you who have been asking about the American Dialect Society can
find information on the Web at http://www.msstate.edu/Archives/ADS/.
 --Natalie (maynorra.msstate.edu)
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Hello, Goodbye and Wordlists

Date: Sat, 21 Jan 95 16:13:04 GMHello, Goodbye and Wordlists
From: Dave Fawthrop <hyphenibmPCUG.CO.UK>
Subject: Hello, Goodbye and Wordlists


X-Organisation: Computer Hyphenation Ltd, Hyphen House, 8 Cooper Grove,
 Shelf, Halifax, HX3 7RF, United Kingdom.
X-Phone/Fax: +44 1274 691092

Listserver Linguist
Subject Hello, Goodbye and wordlists
Date 21/1/94

Just to say hello and goodbye! I was only here to see what was in your
listservers file store. You are too academic for me.

While I am here someone may be interested in a FAQ on wordlist in many
languages which I post to comp.software.international, on a regular
basis.

Enjoy :-) or ignore :-( as the case may be.

)>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>(<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
:Newsgroup comp.software.international
:Subject FAQ on wordlists
:Date issue 3 27/10/94
:Organisation Computer Hyphenation Ltd
:Copyright (C) Public domain, use and copy as required.

Here is an offering for the FAQ of comp.software.international or
anywhere else which wants it, on wordlists, when someone gets round to
compiling one.
IMHO dictionaries should be another section of the FAQ.

If you think I have made any mistakes, or have anything to add, please
contact me by email.

I will post this monthly until the FAQ is organised properly.


WORDLISTS
*********
Lists of words in languages are very useful in many fields of
linguistic research, spelling error correctors and the like.

ftp sites
---------
Back in 1992 Jorge Stolfi then of DEC Systems Research Center, 130
Lytton Avenue, Palo Alto CA 94301 collected an excellent set of
wordlists in dutch, english, german, italian, norwegian, and swedish,
before moving on to other work. These are still available on
gatekeeper.dec.com/.8/misc/stolfi-wordlists
They are compressed using a special algorithm, decompression C source
code is provided, and then compressed even more. These have
propagated around internet and may be found on many ftp sites.

If you "archie" for a language you will find many wordlists labelled
say "German", and most will be Stolfi's. As these lists are quite
large it is important that you try to identify these duplicates,
without getting the files. Stolfi's readme file identifies them
uniquely if present, otherwise where they have been decompressed and
conventionally recompressed, the size of the file is a good indicator
of duplicates.

The best compilation of wordlists at the time of writing is
HAS NOW BEEN MOVED TO ftp.ox.ac.uk\pub\wordlists\* (alias for
sable.ox.ac.uk) administered by Paul Leyland (pclsable.ox.ac.uk).
His list of directories is at the time of writing: afrikaans,
american, aussie, chinese, computer, croatian, danish, databases,
dictionaries, dutch, finnish, french, german, hindi, hungarian,
italian, japanese, latin, literature, movieTV, music, names, net,
norwegian, places, polish, random, religion, russian, science,
spanish, swahili, swedish, yiddish. This list may not be complete, as
things are often added. The major European languages are well covered
with long lists, but the remainder are the best available, some being
quite short.

If you:
 have other wordlists available,
 can find other wordlists on internet,
 are willing to collect wordlists in any of the minor languages,
 or improve on any already there.
Please contact Paul Leyland direct.

Compilation of wordlists
------------------------
If you cannot find what you want, via archie, WWW, or the instructions
above, you may well want to produce your own list, and then hopefully,
Please! put it in the public domain. If so please offer it to Paul
Leyland when finished. The following assumes that you are creating a
wordlist on a shoestring, or as a minor project without funding. A
properly funded wordlist is called a corpus, and is a different
proposition.

A good source of information is to subscribe to soc.culture.???, or
any newsgroup which is appropriate, get the FAQ, and monitor the
newsgroup for at least a month or two.

Electronic Newspapers
.....................
The advent of electronic newspapers, on the net provides an excellent
opportunity to collect wordlists with minimal effort. The quality of
writing is usually good, they contain short articles on many subjects,
by many people, and so are an excellent source of words. A
disadvantage is that many are in an electronic, or unaccented version
of the language.

Listservers
...........
Many of the minor languages have BBS systems run from a University
computer. Some are for learners, the spelling is awful. Some are
just chat, the spelling is generally bad.

Newsgroups
..........
The newsgroups soc.culture.* etc. are also a source of wordlists. The
writing and spelling are often of an appalling standard, and the text
is mostly unaccented. Subjects are varied, but politics and religion
produce a very high proportion of postings, this produces highly
skewed lists, for instance in alt.culture.indonesia "allah" is in the
twenty most frequently used words. I do not believe that this is true
for normal indonesian text, or newspapers.

There are about 6000 newsgroups on the net, but your server probably
only holds the 1500 common ones. Many others are available which
typically are used by a single university. Ask you sysop for a list,
you may find something interesting.

Paper Newspapers
................
All newspapers are now produced on computers, so if you can get access
to a newspapers files these will be fine. There will be short pieces
on lots of subjects by lots of journalists. The newspapers for the
masses write to a reading age of about 12, so have a restricted range
of words. A "quality" newspaper will be better, higher reading age,
and more subjects.

If possible take a few articles from each days issue. A single issue
may have a front page story, a leader article, and several pieces on
inside pages, all on the same subject, and therefore using the same
words.

Beware of the files produced by newspapers internally. A story goes
through many stages in the production cycle, so you can end up with
fifteen versions of the same thing!

Press offices
.............
Government Press offices etc. are now putting news on the net, and can
be used. They however tend to concentrate on Political news, and thus
have a highly skewed, and small, selection of words, although the
quality of text is good.

Dictionaries
............
It may seem obvious, but dictionaries can be converted into wordlists.
Bilingual language1 to language2 dictionaries can with a bit of very!!
very!! nifty editing can be made into wordlists for either language1
or language2. There is always a problem with dictionaries that they
contain words which are interesting to linguists, and not words which
are used in real text, or which people really speak. Compound words
and derivative words are under represented. They are however fully
accented and the words which they contain are correctly spelled.

Things to edit out of text
..........................
When collecting text for wordlists, especially from internet, and with
number of occurrences, there are some things which are better edited
out.
1: Headings
 These are not representative of normal text.
2: Repeated text
 On the net people often post things daily, weekly or as necessary.
 I have received the daily exchange rates for Slovenian currency,
 supplied courtesy of the Governor of the Bank of Slovenia.
3: Salutations
 Gracias, hola, Hallo, hi etc. etc. should not be overepresented
 in word lists. Newsgroups always have frequent contributors,
 and mailing lists have owners. One does not want "tony" or
 "briony" overepresented.
4: Signature blocks.
 These are idiosyncratic, and contain things representing the
 personality of the posters. The language is regularly not the
 same as the text above it.
5: Quoted text.
 It is convention on internet to quote parts of a previous
 communication using ) or : at the start of a line.
 This best omitted as it distorts number of occurrence counts.

Accented or unaccented?
.......................
If at all possible collect words with correctly accented and modified
roman characters i.e. the slashed 'l' in polish, e acute in french.
The easiest words to find however are from internet, where it is
convention to use a computer version of the language which uses the
unaccented versions of characters.

There are more than 500 7 and 8 bit character sets available, so if
you can put in the readme file what, in the wordlist, indicated what
character!

Beware of the accented status of upper case characters, in some
languages, by convention, lower case versions of a character are
accented, whereas upper case are often unaccented.

Non roman languages
...................
The vast majority of Russian, Bulgarian, Ukrainian, and Greek on
internet is transliterated into roman characters. Only a little is
in the correct characters. Several vowels in these languages
transliterate into the same roman character, so you cannot get simply
back to the correct character set from transliterated text. Greek
transliteration is a mess. Wordlists for non roman languages should
be in the original character set, those in transliterated versions are
of not much use.

How many words?
...............
For the simple non compounding languages, English, French, Italian
etc. about 50,000 to 100,000 words will cover almost everything used
in normal text. For the agglutinative languages where it is common
to add words to words to words, such as German, or Finnish, Several
hundred thousand are required, and worse the actual compound words
used keep changing. These are targets to aim for, wordlists of any
length are useful.

Copyright
.........
IMHO as a non lawyer, the copyright of a piece of ordinary writing
rests in the artistic effort in writing it. This is totally lost when
you make a wordlist, so provided that you have the right to read it,
the copyright, of any wordlist produced, is yours and you can put it
in the public domain. Dictionaries may be different in that the
artistic effort is arguably in the selection of the words themselves.

Readme file
...........
Each wordlist should have a readme file giving relevant information.
Please state who you are and your email address for queries, but to
reduce these the following should be included:

your name and email address.
the source of your list.
any gross bias in the words found, political, religious, etc.
The character set being used, with the representation of each of the
accented characters defined. Not everyone used TEX or your favourite
code page! It is a real pain decoding the character set with a paper
dictionary! There are more than 500, 7 and 8 bit character sets: ISO,
PC, Mac, Gem, Word Perfect, ad infinitum. Not to mention 16, 32, 64
bit character sets, Unicode, ISO and so on.
Say whether misspellings have been removed.
Say whether loan words have been removed. They appear in all
languages!!
Say whether proper nouns are capitalised, capitalisation has been left
as found, or all words have been lowercased. IMHO proper nouns are
best capitalised if that is the convention in that language.

Number of occurrences
....................
If one looks at the seminal work of the Brown corpus, available from
the Oxford Text Archive. One will find that the number of occurrences
of that word, in the text examined, is carefully tabulated. This is
very useful in many fields, and its addition to any wordlist is a
distinct advantage. If used for a spelling corrector, it is possible
to construct several dictionaries, one of common words held in main
memory, less common words held on disk and so on.

Two small public domain ANSI C source code programs to facilitate
their collection, are available currently from Dave Fawthrop
(hyphenibmpcug.co.uk), but hopefully I will find an ftp site to put
them.

These are "one_word", which takes ASCII text, splits it at white
space, strips of punctuation at both ends of the word, changes the
word to lower case if required, and prints out the words one per line.
Both the definition of lower case and the definition of punctuation
are simply adaptable to the character set in use. The output from
this is then sorted with any gutsy sort. "uniq_num" takes this sorted
output, and prints out each word followed by the number of times it
occurred. The above sequence with edits as required can be repeated
ad infinitum, with the occurrences collected properly.

Dave Fawthrop, (hyphenibmpcug.co.uk).
-- God loved the World so much that he gave his only Son, so that --
-- anyone who believes in (trusts, clings to, relies on) him shall --
-- not perish, but have eternal life. STARTING NOW! *IT'S GREAT* --
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue