Editor for this issue: <>
Some weeks ago I asked on the Linguist List whether anyone knew where I might obtain frequency count information. I received replies from a number of different sources. Thanks very much to everyone who sent suggestions. The following are summaries of some of the responses: -Contact the people of the upenn treebank (maryannMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueunagi.cis.upenn.edu or robertm
unagi.cis.upenn.edu)" -There are quite a few electronic corpora available now - the major British English ones (e.g. LOB, Lundon-Lund) are available from the Oxford Text Archives, as is the Brown corpus ... Penn has either Dow Jones or Wall Street Journal in electronic form, available through the Linguistic Data Consortium (along with various other corpora) Once you have your corpus, many concordance programs will give you frequencies. But frequency in terms of all the words in the corpus is probably not interesting - well, I guess it depends on what you want to do with the frequency count. But so often we look at relative frequency of something within some class - e.g., the relative frequency of the demonstratives within the determiner class, or the rel freq of 'there' and 'here' within the class of locative adverbs, etc. For that you can also use a concordance program, or you might use a concordance program together with a corpus that has words tagged by part of speech. There are various pos taggers floating around, and then there are also tagged corpora - e.g. I think the Brown corpus is tagged; the LOB corpus is tagged and the Penn Treebank corpus is also. Five of the big corpora + concordance programs are available from ICAME on CD-ROM for about $500, but you can also get free concordance programs by FTP, also from the Summer Institute of Linguistics. -Frequency counts of specified words [can be obtained] by using the CLAN computer program on the CHILDES (Child Language Exchange System) database. The program has specific commands that allow you to find the frequency of particular words (or phrases) in one or more transcripts. Most of the transcripts that [the author knew] of in the database consist of parent-child interactions - so if you need the frequency counts in adult-to-adult interactions, you may not find what you're looking for in this particular database. -These counts should be fairly easy to obtain from a parsed corpus (i.e., one in which every sentence has been assigned a parse tree). There are several such corpora around; the cheapest is on a CD-ROM distributed by the ACL, and the best can be obtained by joining the DCI (Data Collection Initiative) centered at U. Penn. for $2,500. -Try the BROWN and the LOB corpus'. There is also, an Australian one; for details email Pam Peters (ppeters
srsuna.shlrc.mq.oz.au) Richard Larson (rlarson
semlab5.sbs.sunysb.edu)