Summary Details
| Query: |
Corpora
|
|
| Author: | Royle Phaedra | |
| Submitter Email: | click here to access email | |
| Linguistic LingField(s): |
Text/Corpus Linguistics
|
|
| Summary: |
I recently made a query on linguist list about corpora for word lists with frequency counts in Bulgarian, Polish, Greek, Turkish and English (excluding Kucera and Francis). Many people responded with helpful comments, which are summarised below. Unfortunately, nothing was found on Greek. If any additions seem necessary, please write back to me. Thanks, Phaedra PhD student Universite de Montreal Centre de recherche theophile alajouanine On English: Gan Wee Keong <ellganwk@leonis.nus.sg> The British National Corpus word frequency lists generated by Adam Kilgarriff. As the various lists are categorised in certain manners, read the README file first before downloading. To get the lists, do a ftp to: ftp.itri.bton.ac.uk/pub/bnc - ----------------------------------------------------------------- Richard Piepenbrock <celex@mpi.nl> THE CELEX CD-ROM PRODUCED BY THE DUTCH CENTRE FOR LEXICAL INFORMATION IN COLLABORATION WITH THE LINGUISTIC DATA CONSORTIUM The Second Release of the CD-ROM, which contains the CELEX lexical databases of English (version 2.5), Dutch (version 3.1) and German (version 2.5), is now available for research purposes from the Linguistic Data Consortium for $150. For each language, the CD-ROM contains detailed information on the orthography (variations in spelling, hyphenation), the phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress), the morphology (derivational and compositional structure, inflectional paradigms), the syntax (word class, word-class specific subcategorisations, argument structures), and word frequency (summed word and lemma counts, based on recent and representative text corpora) of both wordforms and lemmas (English: 52446 lemmas, 160594 wordforms; German: 51728 lemmas, 365530 wordforms; Dutch: 124136 lemmas, 381292 wordforms). - ------------------------------------------------------------------ Llu=EDs Padr=F3 <padro@lsi.upc.es> I have ftp available an English frequency list extracted from 1.1 milion words of WSJ. ftp anonymous to ftp-lsi.upc.es cd pub/lluisp get wsj.freq - ------------------------------------------------------------------ "M. Lynne Roecklein" <lynne@cc.gifu-u.ac.jp> You may be wanting only very formal frequency lists, or you've probably already checked out the following, but if not, there are 'lists of defining words' in the l995 __Cambridge International Dictionary of English__ (which claims frequency was one of the factors in the assembly of that list but does not name its references) and the l993 __Longman Language Activator__ (which refers to the Longman Corpus Network data concerning frequency). The Collins Cobuild people must also have done frequency work on their corpus, which I understand is rather extensive, to arrive at a defining vocabulary, but nothing is said in their standard dictionary. I realize that these dictionaries are specialized in various ways, but their defining word list would include only high frequency words. - ------------------------------------------------------------------- "James L. Fidelholtz" <jfidel@cen.buap.mx> On English: There's always the granddaddy of all frequency counts, Thorndike & Lorge, ca. 1943, probably still in print at Columbia U. Teachers College Press (later impressions, of course). The most accessible 'recent' version would probably be John Carroll's (title may be slightly off) _The American Heritage word frequency book_, published approx. 1980 by AH. There's also some fairly recent Scandinavian stuff (in the 80's) on English, but I forget now the authors' names (on the basis, if I remember correctly, of the Brown corpus). If you need more info, let me know, and I'll scour the stacks at home. Please let me know what you run across, as I'm always interested in frequency studies. - ------------------------------------------------------------------- Ntirampeba Pascal <ntirampp@ERE.UMontreal.CA> An other english word list is given by : Johansson, S. & K. Hofland. 1989."Frequency analysis of english vocabulary and grammar. Oxford:Clarendon Press. _____________________________________________________________ POLISH "James L. Fidelholtz" <jfidel@cen.buap.mx> With respect to Polish, there is a frequency count (or is it a 'backwards dictionary'?) for at least some poems of a Polish poet whose name escapes me at the moment. Ah, yes, there are frequency counts of at least the press by, I believe, Topolin'ska (Maria?), but super hard to come by -- check the OCLC and LC listings -- it would have been published in the early or middle 70's, in several volumes. I think I have some of them, but I'm not sure. - -------------------------------------------------------------------- Andrzej Lyda <kotlet@zeus.polsl.gliwice.pl> A kind of frequency list was compiled by Tadeusz Piotrowski of Institute of English, Wroclaw University for the purposes of a Polish-English dictionary. He has also published: Contemporary English: Word Lists. Part I-II. Wydawnictwo Uniwersytetu Wroclawskiego. 1993. ISBN: 83-229-0940-3. I would also contact PWN, Warsaw (National Scientific Publishers)which has just published a CD-ROM edition of the Dictionary of Contemporary Polish. Andrzej Lyda Institute of English University of Silesia Sosnowiec Polad - ------------------------------------------------------------------ Tilman Berger <tilman.berger@uni-tuebingen.de> There is a frequency dictionary for Polish: Slownik frekwencyjny polszczyzny wspolczesne. Ed. Ida Kurcz et al. Krakow: Polska Akademia Nauk, Institut Jezyka Polskiego. Vol. 1, 1990. Vol. 2, 1990. Prof. Dr. Tilman Berger Slavisches Seminar Universitaet Tuebingen Wilhelmstr. 50 D-72074 Tuebingen Tel. 07071/29-76733 (Universitaet) 07071/63365 (privat) e-mail: tilman.berger@uni-tuebingen.de ________________________________________________________________ BULGARIAN Kjetil Ra Hauge <K.R.Hauge@easteur-orient.uio.no> For Bulgarian: Nikolova, Cvetanka: CHestoten rechnik na bylgarskata razgovorna rech, Sofija 1987 Todorova, Elena; Rada Panchovska: CHestoten rechnik na bylgarskata publicistika (1944-1989), Sofija 1995 The latter is rarer than a Gutenberg bible, the total printing is 25 (!) copies. _________________________________________________________________ TURKISH Kemal Oflazer <ko@cs.bilkent.edu.tr> We do not have frequency lists yet but for general Turkish stuff you can look at http://www.nlp.cs.bilkent.edu.tr. We have some morpological disambiguated corpora poosted there however they are quite short. We have some root word occurence statistics for those but they may not be very meaningful. Kemal Oflazer e-mail: ko@cs.bilkent.edu.tr http://www.cs.bilkent.edu.tr/~ko/ko.html Bilkent University tel: (90-312) 266-4133 (Sec) Computer Engineering Department 266-4000 x1258 (Off) Bilkent, ANKARA, 06533 TURKIYE 240-1627 (Home) fax: (90-312) 266-4126 |
|
| LL Issue: | 8.363 | |
| Date Posted: | 16-Mar-1997 | |
| Original Query: | Read original query | |
|
Back |
||
|
|
||
|
Sums main page
|
||
Business Plan,Business Ideas,Advanced Energy,High Technology,Healthy Diets,Healthy Foods,Games Guides,Games Cheats,Travel Guides,Travel Tips,Study Skills,Study Tips,Health Tips,Health Guides,Jewelry Stores,Jewellery UK Online,Digital Camera Reviews,Digital Camera Buying Guide,Replica Handbags,Replica Bags,Jackets on Sale,Jackets Clearance,WoW Gold,Cheap WoW Gold,Buy WoW Gold,WOW Gold,Swtor Credits


