Tue 25 May 1993

Sum: Frequency Counts

Date: Thu, 20 May 93 16:50:01 -0Frequency Counts
Richard Larson
Subject: Frequency Counts

Some weeks ago I asked on the Linguist List whether anyone knew where
I might obtain frequency count information. I received replies from
a number of different sources. Thanks very much to everyone who sent
suggestions. The following are summaries of some of the responses:

 -Contact the people of the upenn treebank
 ( or"

 -There are quite a few electronic corpora available now -
 the major British English ones (e.g. LOB, Lundon-Lund) are
 available from the Oxford Text Archives, as is the Brown
 corpus ... Penn has either Dow Jones or Wall Street Journal
 in electronic form, available through the Linguistic Data
 Consortium (along with various other corpora)
 Once you have your corpus, many concordance programs will
 give you frequencies. But frequency in terms of all the
 words in the corpus is probably not interesting - well, I
 guess it depends on what you want to do with the frequency
 count. But so often we look at relative frequency of
 something within some class - e.g., the relative frequency
 of the demonstratives within the determiner class, or the
 rel freq of 'there' and 'here' within the class of locative
 adverbs, etc. For that you can also use a concordance
 program, or you might use a concordance program together
 with a corpus that has words tagged by part of speech. There
 are various pos taggers floating around, and then there are
 also tagged corpora - e.g. I think the Brown corpus is
 tagged; the LOB corpus is tagged and the Penn Treebank
 corpus is also.

 Five of the big corpora + concordance programs are available
 from ICAME on CD-ROM for about $500, but you can also get
 free concordance programs by FTP, also from the Summer
 Institute of Linguistics.

 -Frequency counts of specified words [can be obtained] by
 using the CLAN computer program on the CHILDES (Child
 Language Exchange System) database. The program has specific
 commands that allow you to find the frequency of particular
 words (or phrases) in one or more transcripts. Most of the
 transcripts that [the author knew] of in the database
 consist of parent-child interactions - so if you need the
 frequency counts in adult-to-adult interactions, you may not
 find what you're looking for in this particular database.

 -These counts should be fairly easy to obtain from a parsed
 corpus (i.e., one in which every sentence has been assigned
 a parse tree). There are several such corpora around; the
 cheapest is on a CD-ROM distributed by the ACL, and the best
 can be obtained by joining the DCI (Data Collection
 Initiative) centered at U. Penn. for $2,500.

 -Try the BROWN and the LOB corpus'. There is also, an
 Australian one; for details email Pam Peters


