Editor for this issue: T. Daniel Seely <dseely
emunix.emich.edu>
In a recent request to LINGUIST, I asked for word frequency lists for American English. I am grateful to all of you for your help. Lou Hillman lbhndpMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuerit.edu In addition to the quoted responses, the following people also suggested: Frequency Analysis of English Usage, Lexicon and Grammar by W. Nelson Francis and Henry Kucera in its various guises. MARC PICARD <PICARD
vax2.concordia.ca> Guillaume Gantard <ggantard
logos-usa.com> Judith Parker <jparker
s850.mwc.edu> Here are excerpts from the other responses. - --------------------------------------------------------------- From: patrick.juola
psy.ox.ac.uk (Patrick Juola) There are several professionally compiled lists of several million words, sorted by frequency in various corpora -- I know that the Brown corpus (Kucera & Francis, a zillion years ago) is available on-line from UPenn if you know whom to ask. *BUT* having answered your question, please please please please please let me warn you away from trusting any of the answers you receive -- as a professional corpus linguist, you're going to have some *serious* sampling effects in any corpus of that size. A rough check on the Brown histogram reveals that the 20,000th word is "bombproof", with a frequency of three per million text tokens. The inclusion or exclusion of a single page of text in the Brown corpus would be enough to add or remove a word from the list (as a rough test, I just opened a copy of a book and confirmed that both the words "cliques" [rank 41505 in the Brown corpus] and "subgraphs" [did not appear] occured three times on that page. The implications are fairly obvious -- the lists that you get are very sensitive to the corpora from which they are drawn, and particularly to the style, language, and content of the corpora -- so a list compiled from six million words of newspaper articles is likely to be significantly and substantially different from a list compiled from six million words of USENET postings, which in turn will be completely different from six million words of magazines, &c. - --------------------------------------------------------------- From: john.beaven
sharp.co.uk (John Beaven) Not exactly an answer to your question, but as a last resort you could always "roll your own" by running this Unix script on your favourite multi-million word corpus... #! /bin/sh - # finds the word frequenceies in a text and sorts them by decreasing # order (ie most frequent word at top) awk -e '{print " " $0}' $1 | deroff -w | sort | uniq -c | sort -nr - --------------------------------------------------------------- From: Evan.Antworth
SIL.ORG (Evan L. Antworth) Go to this address: gopher://gopher.sil.org/11/gopher_root/linguistics/info/ and look at the items titled "English word frequencies...". (I didn't create these lists; I just got them from an FTP site at Vassar.) [see below for FTP address. LBH] - --------------------------------------------------------------- From: cball
guvax.acc.georgetown.edu (Catherine N. Ball) I think you can find many frequency lists for American English in the library -- for example, Francis and Kucera published one based on their (now famous) corpus of American English known as the 'Brown Corpus' (which is available from the Oxford Text Archive). You can also make your own frequency list using simple software. I recently made a 'Web Frequency Indexer' which allows you to paste in your text and get a frequency list -- I will be modifying it soon to allow the user to simply give the name of a file on their own computer. Anyhow, you might find it useful. The URL is http://www.georgetown.edu/cball/webtools/web_freqs.html - --------------------------------------------------------------------- From: meador
U.Arizona.EDU (Diane L Meador) I have available, through my web page at the URL below, an American English lexical database, "Phondic". It's packaged with "Sample", a program written by Emmanual Dupoux (CNRS, Paris), which searches the database by several criteria, such as stress and syllable patterns, phonemic or orthographic strings, etc. One of the options is frequency. While I have never tried to sort by frequency, I don't imagine that it would be difficult to do so. I hope that this meets your needs. If you do decide to use it, I ask on behalf of Emmanual Dupoux that he is given acknowledgment credit. The program has DOS and Unix versions. Follow the "Available Papers" link on my page; it's listed under "Miscellany". http://aruba.ccit.arizona.edu/~meador - --------------------------------------------------------------------- From: ms2928
liverpool.ac.uk (Mike Scott) Do you have a particular corpus in mind? The kind of 40,000 list will be pretty dependent on the corpus you use. For example, I have done a word list on the UK newspaper the Guardian, and without lemmatising, 4 million tokens will give rise to about 85,000 word types. 10 million might give about 120,000 and 100 million gives about 250,000. I have produced a word lister (etc.) available via http://www.liv.ac.uk/~ms2928/homepage.html http://www1.oup.co.uk/oup/elt/software/wsmith? (published by Oxford Univ. Press) The software costs UK sterling 49 (about 75 US$) and does a lot more than just word listing. If you visit the OUP site you'll see sample screens to show the idea. Alternatively there are existing word lists in paper format: McGraw Hill have one, there's Francis & Kucera, and presumably the Brown corpus of the 60s will be in machine-readable format too. - --------------------------------------------------------------------- [Vera Kempe posted a similar request several months ago and sent the following message, which she forwarded to me. Some information is repeated from above; I have edited briefly. LBH] From: VKEMPE
UOFT02.UTOLEDO.EDU For all those who have asked me to share the responses on my query about computerized word frequency lists - here is what I got so far. Good luck! - Vera Kempe Department of Psychology University of Toldeo vkempe
uoft02.utoledo.edu From: IN%"PICARD
VAX2.CONCORDIA.CA" "MARC PICARD" To: IN%"vkempe
uoft02.utoledo.edu" CC: Subj: Frequency count I don't have Francis & Kucera but I do have LOB and KWC. Let me know if you're interested and I'll send them along. Marc Picard ____________________________________________________ From: IN%"C.J.Gledhill
aston.ac.uk" To: IN%"vkempe
uoft02.utoledo.edu" CC: Subj: Write to Birmingham University's Cobuild, a corpus-based lexicographic project: direct
cobuild.collins.co.uk Chris J Gledhill Lecturer in French Languages and European Studies Aston University BIRMINGHAM B4 7ET c.j.gledhill
aston.ac.uk ____________________________________________________ From: IN%"griffith
kula.usp.ac.fj" "Patrick Griffiths" Dear Dr Kempe In the Journal of Child Language, 1994(2), 513-6, there is a review by George Dunbar of Philip Quinlan OXFORD PSYCHOLINGUISTIC DATABASE. Oxford University Press, 1992. This is a package of computer software, for Macintosh. The reviewer says (p. 513): "The database contains entries for over 98,000 words, with information on up to 26 properties of each. This includes information on physical properties, such as the length of the word, other objective properties, such as its frequency of occurrence in the Kucera-Francis list, and subjective or 'psychological' properties, such as imageability ratings." A single user licence was priced at 205 British pounds, which corresponds to somewhere between 300 and 400 US dollars, I think. Best wishes Patrick ____________________________________________________ From: IN%"edwards
cogsci.Berkeley.EDU" To: IN%"vkempe
uoft02.utoledo.edu" CC: IN%"edwards
cogsci.Berkeley.EDU" Subj: word frequencies online The site below has a couple, with documentation available there. Hope this helps, -Jane Edwards - ------------------------------------------------------------------- From: veronis
vassar.edu (Jean Veronis) Comme je l'ai signale dans un precedent message la liste des frequences dans le Brown Corpus est disponible dans le domaine public, comme partie de la base de donnee MRC. Toutefois, pour faciliter la tache de ceux qui sont interesses par ces seules frequences, je viens de placer la liste des mots les plus frequents plus de 10 occurrences dans le Brown Corpus. La comparaison serait interessante avec la liste des 5000 mots frequents dans le Wall Street Journal, mise a disposition par Ken Church. ftp : vaxsar.vassar.edu ou 143.226.1.6 user : anonymous password : votre nom Sous-directory : nlp ____________________________________________________ From: IN%"jem
cobuild.collins.co.uk" "Jem Clear" To: IN%"vkempe
uoft02.utoledo.edu" CC: Subj: Word frequencies We do indeed have word frequency lists drawn from our extensive corpora of modern English language. Do you know about Cobuild? (If not, have a look at our WWW site at URL http://titania.cobuild.collins.co.uk/ for more information.) Briefly, we have a 20-million word corpus accessible via a subscription service called CobuildDirect. These 20m samples are taken from our main "Bank of English" corpus of 211m words (as at time of writing -- we keep adding more to it). We receive many requests like yours, so we have recently decided to make some sort of standard tariff for providing frequency lists. Here it is: - -------------------------------------------- 1. Complete lemmatised 20m freq list a. (incl. infl forms, POS, freqs) 150 2. 10,000 most freq lemma from 20m a. (lemmas + POS) 100 b. (with freqs) 120 c. (with infl forms + freqs) 150 3. 10,000 most freq lemma from 211m a. (lemmas + POS) 500 b. (with freqs) 600 c. (with infl forms + freqs) 700 5. a. 1a. but only top 1,000 words 25 b. 1a. but only top 2,000 words 30 c. 1a. but only top 5,000 words 50 Note that POS means "with part-of-speech" tags and lemmas means that inflected forms of nouns and verbs have been lemmatised to the base form and their several frequencies summed. Here is a brief sample of list 1a. last JJ 19548 no RB 19399 where WH 18542 find V 18420 VB find 9001 VB found 35 VBD found 4294 VBG finding 1103 VBN found 3392 VBZ finds 595 these DTG 18349 down IN 18014 tell V 17662 VB tell 6400 VBD told 5999 VBG telling 1552 VBN told 2769 VBZ tells 942 even RB 17523 three CD 17346 should MD 16764 pound N 16564 NN pound 954 NNS pounds 14875 NP pound 22 NP pounds 713 off IN 16210 week N 16204 NN week 11967 NNS weeks 4228 NP week 4 NP weeks 5 really RB 16080 work V 16027 VB work 6036 VBD worked 1732 VBG working 5690 VBN worked 1299 VBZ works 1270 may MD 15774 back RB 15759 yes UH 15742 life N 15624 NN life 13432 NNS lifes 11 NNS lives 2122 NP life 46 NP lives 13 through IN 15614 those DTG 15473 Best wishes ____________________________________________________ From: IN%"peereman
u-bourgogne.fr" To: IN%"vkempe
uoft02.utoledo.edu" CC: Subj: RE: Francis&Kucera You can try the MRC Psycholinguistic Database. You will find informations on the Web at http://web.inf.rl.ac.uk/proj/psych.html Sincerely, Ronald Peereman - ----------------------------------------------------------- Ronald Peereman Laboratoire d'Etudes des Apprentissages et du Developpement- C.N.R.S., Universite de Bourgogne, Dijon, France fax. (33)80395767, email: peereman
satie.u-bourgogne.fr