LINGUIST List 7.583

Fri Apr 19 1996

Sum: Am. English word frequency lists

Editor for this issue: T. Daniel Seely <dseelyemunix.emich.edu>


Directory

  1. "L. HILLMAN", Sum: Am. English word frequency lists

Message 1: Sum: Am. English word frequency lists

Date: Fri, 19 Apr 1996 08:19:19 EDT
From: "L. HILLMAN" <LBHNDPritvax.isc.rit.edu>
Subject: Sum: Am. English word frequency lists
In a recent request to LINGUIST, I asked for word frequency
lists for American English. I am grateful to all of you 
for your help. Lou Hillman lbhndprit.edu


In addition to the quoted responses, the following people
also suggested:

 Frequency Analysis of English Usage, Lexicon and Grammar
 by W. Nelson Francis and Henry Kucera

in its various guises.

 MARC PICARD <PICARDvax2.concordia.ca>
 Guillaume Gantard <ggantardlogos-usa.com>
 Judith Parker <jparkers850.mwc.edu>

Here are excerpts from the other responses. 
- ---------------------------------------------------------------
From: patrick.juolapsy.ox.ac.uk (Patrick Juola)

There are several professionally compiled lists of several million
words, sorted by frequency in various corpora -- I know that the
Brown corpus (Kucera & Francis, a zillion years ago) is available
on-line from UPenn if you know whom to ask.

*BUT* having answered your question, please please please please
please let me warn you away from trusting any of the answers you
receive -- as a professional corpus linguist, you're going to have
some *serious* sampling effects in any corpus of that size. A rough
check on the Brown histogram reveals that the 20,000th word is
"bombproof", with a frequency of three per million text tokens. The
inclusion or exclusion of a single page of text in the Brown corpus
would be enough to add or remove a word from the list (as a rough
test, I just opened a copy of a book and confirmed that both the
words "cliques" [rank 41505 in the Brown corpus] and "subgraphs"
[did not appear] occured three times on that page.

The implications are fairly obvious -- the lists that you get are
very sensitive to the corpora from which they are drawn, and
particularly to the style, language, and content of the corpora --
so a list compiled from six million words of newspaper articles is
likely to be significantly and substantially different from a list
compiled from six million words of USENET postings, which in turn
will be completely different from six million words of magazines,
&c.

- ---------------------------------------------------------------
From: john.beavensharp.co.uk (John Beaven)

Not exactly an answer to your question, but as a last resort you
could always "roll your own" by running this Unix script on your
favourite multi-million word corpus...

#! /bin/sh -
# finds the word frequenceies in a text and sorts them by decreasing
# order (ie most frequent word at top)
awk -e '{print " " $0}' $1 | deroff -w | sort | uniq -c | sort -nr

- ---------------------------------------------------------------
From: Evan.AntworthSIL.ORG (Evan L. Antworth)

Go to this address:

 gopher://gopher.sil.org/11/gopher_root/linguistics/info/

and look at the items titled "English word frequencies...". (I didn't
create these lists; I just got them from an FTP site at Vassar.)

[see below for FTP address. LBH]

- ---------------------------------------------------------------
From: cballguvax.acc.georgetown.edu (Catherine N. Ball)

I think you can find many frequency lists for American English in
the library -- for example, Francis and Kucera published one based
on their (now famous) corpus of American English known as the 'Brown
Corpus' (which is available from the Oxford Text Archive). You can
also make your own frequency list using simple software. I recently
made a 'Web Frequency Indexer' which allows you to paste in your
text and get a frequency list -- I will be modifying it soon to
allow the user to simply give the name of a file on their own
computer. Anyhow, you might find it useful. The URL is
 http://www.georgetown.edu/cball/webtools/web_freqs.html

- ---------------------------------------------------------------------
From: meadorU.Arizona.EDU (Diane L Meador)

I have available, through my web page at the URL below, an American
English lexical database, "Phondic". It's packaged with "Sample", a
program written by Emmanual Dupoux (CNRS, Paris), which searches the
database by several criteria, such as stress and syllable patterns,
phonemic or orthographic strings, etc. One of the options is
frequency. While I have never tried to sort by frequency, I don't
imagine that it would be difficult to do so.

I hope that this meets your needs. If you do decide to use it, I
ask on behalf of Emmanual Dupoux that he is given acknowledgment
credit. The program has DOS and Unix versions. Follow the
"Available Papers" link on my page; it's listed under "Miscellany".

 http://aruba.ccit.arizona.edu/~meador

- ---------------------------------------------------------------------
From: ms2928liverpool.ac.uk (Mike Scott)

Do you have a particular corpus in mind? The kind of 40,000
list will be pretty dependent on the corpus you use.

For example, I have done a word list on the UK newspaper the
Guardian, and without lemmatising, 4 million tokens will give rise
to about 85,000 word types. 10 million might give about 120,000 and
100 million gives about 250,000.

I have produced a word lister (etc.) available via
 http://www.liv.ac.uk/~ms2928/homepage.html
 http://www1.oup.co.uk/oup/elt/software/wsmith?

(published by Oxford
Univ. Press) The software costs UK sterling 49 (about 75 US$) and does
a lot more than just word listing. If you visit the OUP site you'll see
sample screens to show the idea.

Alternatively there are existing word lists in paper format:
McGraw Hill have one, there's Francis & Kucera, and presumably
the Brown corpus of the 60s will be in machine-readable format
too.

- ---------------------------------------------------------------------
[Vera Kempe posted a similar request several months ago and sent
the following message, which she forwarded to me. Some information is
repeated from above; I have edited briefly. LBH]

From: VKEMPEUOFT02.UTOLEDO.EDU

For all those who have asked me to share the responses on my query
about computerized word frequency lists - here is what I got so far.

Good luck!

- Vera Kempe
Department of Psychology
University of Toldeo
vkempeuoft02.utoledo.edu



From: IN%"PICARDVAX2.CONCORDIA.CA" "MARC PICARD"
To: IN%"vkempeuoft02.utoledo.edu"
CC:
Subj: Frequency count

I don't have Francis & Kucera but I do have LOB and KWC. Let me know if
you're interested and I'll send them along.

Marc Picard
____________________________________________________

From: IN%"C.J.Gledhillaston.ac.uk"
To: IN%"vkempeuoft02.utoledo.edu"
CC:
Subj:

Write to Birmingham University's Cobuild, a corpus-based
lexicographic project: directcobuild.collins.co.uk



Chris J Gledhill
Lecturer in French
Languages and European Studies
Aston University
BIRMINGHAM B4 7ET
c.j.gledhillaston.ac.uk
____________________________________________________

From: IN%"griffithkula.usp.ac.fj" "Patrick Griffiths"

Dear Dr Kempe

In the Journal of Child Language, 1994(2), 513-6, there is a review
by George Dunbar of Philip Quinlan OXFORD PSYCHOLINGUISTIC DATABASE.
Oxford University Press, 1992. This is a package of computer
software, for Macintosh. The reviewer says (p. 513): "The database
contains entries for over 98,000 words, with information on up to 26
properties of each. This includes information on physical
properties, such as the length of the word, other objective
properties, such as its frequency of occurrence in the
Kucera-Francis list, and subjective or 'psychological' properties,
such as imageability ratings."

A single user licence was priced at 205 British pounds, which
corresponds to somewhere between 300 and 400 US dollars, I think.

Best wishes

Patrick
____________________________________________________

From: IN%"edwardscogsci.Berkeley.EDU"
To: IN%"vkempeuoft02.utoledo.edu"
CC: IN%"edwardscogsci.Berkeley.EDU"
Subj: word frequencies online
The site below has a couple, with documentation available there.
Hope this helps,
-Jane Edwards
- -------------------------------------------------------------------

From: veronisvassar.edu (Jean Veronis)

Comme je l'ai signale dans un precedent message la liste des frequences
dans le Brown Corpus est disponible dans le domaine public, comme partie
de la base de donnee MRC. Toutefois, pour faciliter la tache de ceux qui
sont interesses par ces seules frequences, je viens de placer la liste
des mots les plus frequents plus de 10 occurrences dans le Brown Corpus.
La comparaison serait interessante avec la liste des 5000 mots frequents
dans le Wall Street Journal, mise a disposition par Ken Church.

ftp : vaxsar.vassar.edu ou 143.226.1.6
user : anonymous
password : votre nom
Sous-directory : nlp
____________________________________________________
From: IN%"jemcobuild.collins.co.uk" "Jem Clear"
To: IN%"vkempeuoft02.utoledo.edu"
CC:
Subj: Word frequencies

We do indeed have word frequency lists drawn from our extensive corpora
of modern English language. Do you know about Cobuild? (If not, have a
look at our WWW site at URL
 http://titania.cobuild.collins.co.uk/
for more information.)

Briefly, we have a 20-million word corpus accessible via a subscription
service called CobuildDirect. These 20m samples are taken from our main
"Bank of English" corpus of 211m words (as at time of writing -- we keep
adding more to it).

We receive many requests like yours, so we have recently decided to make
some sort of standard tariff for providing frequency lists. Here it is:

- --------------------------------------------

1. Complete lemmatised 20m freq list
 a. (incl. infl forms, POS, freqs) 150
2. 10,000 most freq lemma from 20m
 a. (lemmas + POS) 100
 b. (with freqs) 120
 c. (with infl forms + freqs) 150



3. 10,000 most freq lemma from 211m
 a. (lemmas + POS) 500
 b. (with freqs) 600
 c. (with infl forms + freqs) 700


5.
 a. 1a. but only top 1,000 words 25
 b. 1a. but only top 2,000 words 30
 c. 1a. but only top 5,000 words 50
Note that POS means "with part-of-speech" tags
and lemmas means that inflected forms of nouns and verbs
have been lemmatised to the base form and their several
frequencies summed.

Here is a brief sample of list 1a.

last JJ 19548
no RB 19399
where WH 18542
find V 18420
 VB find 9001
 VB found 35
 VBD found 4294
 VBG finding 1103
 VBN found 3392
 VBZ finds 595
these DTG 18349
down IN 18014
tell V 17662
 VB tell 6400
 VBD told 5999
 VBG telling 1552
 VBN told 2769
 VBZ tells 942
even RB 17523
three CD 17346
should MD 16764
pound N 16564
 NN pound 954
 NNS pounds 14875
 NP pound 22
 NP pounds 713
off IN 16210
week N 16204
 NN week 11967
 NNS weeks 4228
 NP week 4
 NP weeks 5
really RB 16080
work V 16027
 VB work 6036
 VBD worked 1732
 VBG working 5690
 VBN worked 1299
 VBZ works 1270
may MD 15774
back RB 15759
yes UH 15742
life N 15624
 NN life 13432
 NNS lifes 11
 NNS lives 2122
 NP life 46
 NP lives 13
through IN 15614
those DTG 15473


Best wishes
____________________________________________________
From: IN%"peeremanu-bourgogne.fr"
To: IN%"vkempeuoft02.utoledo.edu"
CC:
Subj: RE: Francis&Kucera

You can try the MRC Psycholinguistic Database. You will find
informations on the Web at http://web.inf.rl.ac.uk/proj/psych.html

Sincerely,

Ronald Peereman

- -----------------------------------------------------------
Ronald Peereman
Laboratoire d'Etudes des Apprentissages et du Developpement-
C.N.R.S., Universite de Bourgogne, Dijon, France
fax. (33)80395767, email: peeremansatie.u-bourgogne.fr
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue