Editor for this issue: <>
Those of you who have been asking about the American Dialect Society can find information on the Web at http://www.msstate.edu/Archives/ADS/. --Natalie (maynorMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuera.msstate.edu)
X-Organisation: Computer Hyphenation Ltd, Hyphen House, 8 Cooper Grove, Shelf, Halifax, HX3 7RF, United Kingdom. X-Phone/Fax: +44 1274 691092 Listserver Linguist Subject Hello, Goodbye and wordlists Date 21/1/94 Just to say hello and goodbye! I was only here to see what was in your listservers file store. You are too academic for me. While I am here someone may be interested in a FAQ on wordlist in many languages which I post to comp.software.international, on a regular basis. Enjoy :-) or ignore :-( as the case may be. )>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>(<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< :Newsgroup comp.software.international :Subject FAQ on wordlists :Date issue 3 27/10/94 :Organisation Computer Hyphenation Ltd :Copyright (C) Public domain, use and copy as required. Here is an offering for the FAQ of comp.software.international or anywhere else which wants it, on wordlists, when someone gets round to compiling one. IMHO dictionaries should be another section of the FAQ. If you think I have made any mistakes, or have anything to add, please contact me by email. I will post this monthly until the FAQ is organised properly. WORDLISTS ********* Lists of words in languages are very useful in many fields of linguistic research, spelling error correctors and the like. ftp sites --------- Back in 1992 Jorge Stolfi then of DEC Systems Research Center, 130 Lytton Avenue, Palo Alto CA 94301 collected an excellent set of wordlists in dutch, english, german, italian, norwegian, and swedish, before moving on to other work. These are still available on gatekeeper.dec.com/.8/misc/stolfi-wordlists They are compressed using a special algorithm, decompression C source code is provided, and then compressed even more. These have propagated around internet and may be found on many ftp sites. If you "archie" for a language you will find many wordlists labelled say "German", and most will be Stolfi's. As these lists are quite large it is important that you try to identify these duplicates, without getting the files. Stolfi's readme file identifies them uniquely if present, otherwise where they have been decompressed and conventionally recompressed, the size of the file is a good indicator of duplicates. The best compilation of wordlists at the time of writing is HAS NOW BEEN MOVED TO ftp.ox.ac.uk\pub\wordlists\* (alias for sable.ox.ac.uk) administered by Paul Leyland (pclMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuesable.ox.ac.uk). His list of directories is at the time of writing: afrikaans, american, aussie, chinese, computer, croatian, danish, databases, dictionaries, dutch, finnish, french, german, hindi, hungarian, italian, japanese, latin, literature, movieTV, music, names, net, norwegian, places, polish, random, religion, russian, science, spanish, swahili, swedish, yiddish. This list may not be complete, as things are often added. The major European languages are well covered with long lists, but the remainder are the best available, some being quite short. If you: have other wordlists available, can find other wordlists on internet, are willing to collect wordlists in any of the minor languages, or improve on any already there. Please contact Paul Leyland direct. Compilation of wordlists ------------------------ If you cannot find what you want, via archie, WWW, or the instructions above, you may well want to produce your own list, and then hopefully, Please! put it in the public domain. If so please offer it to Paul Leyland when finished. The following assumes that you are creating a wordlist on a shoestring, or as a minor project without funding. A properly funded wordlist is called a corpus, and is a different proposition. A good source of information is to subscribe to soc.culture.???, or any newsgroup which is appropriate, get the FAQ, and monitor the newsgroup for at least a month or two. Electronic Newspapers ..................... The advent of electronic newspapers, on the net provides an excellent opportunity to collect wordlists with minimal effort. The quality of writing is usually good, they contain short articles on many subjects, by many people, and so are an excellent source of words. A disadvantage is that many are in an electronic, or unaccented version of the language. Listservers ........... Many of the minor languages have BBS systems run from a University computer. Some are for learners, the spelling is awful. Some are just chat, the spelling is generally bad. Newsgroups .......... The newsgroups soc.culture.* etc. are also a source of wordlists. The writing and spelling are often of an appalling standard, and the text is mostly unaccented. Subjects are varied, but politics and religion produce a very high proportion of postings, this produces highly skewed lists, for instance in alt.culture.indonesia "allah" is in the twenty most frequently used words. I do not believe that this is true for normal indonesian text, or newspapers. There are about 6000 newsgroups on the net, but your server probably only holds the 1500 common ones. Many others are available which typically are used by a single university. Ask you sysop for a list, you may find something interesting. Paper Newspapers ................ All newspapers are now produced on computers, so if you can get access to a newspapers files these will be fine. There will be short pieces on lots of subjects by lots of journalists. The newspapers for the masses write to a reading age of about 12, so have a restricted range of words. A "quality" newspaper will be better, higher reading age, and more subjects. If possible take a few articles from each days issue. A single issue may have a front page story, a leader article, and several pieces on inside pages, all on the same subject, and therefore using the same words. Beware of the files produced by newspapers internally. A story goes through many stages in the production cycle, so you can end up with fifteen versions of the same thing! Press offices ............. Government Press offices etc. are now putting news on the net, and can be used. They however tend to concentrate on Political news, and thus have a highly skewed, and small, selection of words, although the quality of text is good. Dictionaries ............ It may seem obvious, but dictionaries can be converted into wordlists. Bilingual language1 to language2 dictionaries can with a bit of very!! very!! nifty editing can be made into wordlists for either language1 or language2. There is always a problem with dictionaries that they contain words which are interesting to linguists, and not words which are used in real text, or which people really speak. Compound words and derivative words are under represented. They are however fully accented and the words which they contain are correctly spelled. Things to edit out of text .......................... When collecting text for wordlists, especially from internet, and with number of occurrences, there are some things which are better edited out. 1: Headings These are not representative of normal text. 2: Repeated text On the net people often post things daily, weekly or as necessary. I have received the daily exchange rates for Slovenian currency, supplied courtesy of the Governor of the Bank of Slovenia. 3: Salutations Gracias, hola, Hallo, hi etc. etc. should not be overepresented in word lists. Newsgroups always have frequent contributors, and mailing lists have owners. One does not want "tony" or "briony" overepresented. 4: Signature blocks. These are idiosyncratic, and contain things representing the personality of the posters. The language is regularly not the same as the text above it. 5: Quoted text. It is convention on internet to quote parts of a previous communication using ) or : at the start of a line. This best omitted as it distorts number of occurrence counts. Accented or unaccented? ....................... If at all possible collect words with correctly accented and modified roman characters i.e. the slashed 'l' in polish, e acute in french. The easiest words to find however are from internet, where it is convention to use a computer version of the language which uses the unaccented versions of characters. There are more than 500 7 and 8 bit character sets available, so if you can put in the readme file what, in the wordlist, indicated what character! Beware of the accented status of upper case characters, in some languages, by convention, lower case versions of a character are accented, whereas upper case are often unaccented. Non roman languages ................... The vast majority of Russian, Bulgarian, Ukrainian, and Greek on internet is transliterated into roman characters. Only a little is in the correct characters. Several vowels in these languages transliterate into the same roman character, so you cannot get simply back to the correct character set from transliterated text. Greek transliteration is a mess. Wordlists for non roman languages should be in the original character set, those in transliterated versions are of not much use. How many words? ............... For the simple non compounding languages, English, French, Italian etc. about 50,000 to 100,000 words will cover almost everything used in normal text. For the agglutinative languages where it is common to add words to words to words, such as German, or Finnish, Several hundred thousand are required, and worse the actual compound words used keep changing. These are targets to aim for, wordlists of any length are useful. Copyright ......... IMHO as a non lawyer, the copyright of a piece of ordinary writing rests in the artistic effort in writing it. This is totally lost when you make a wordlist, so provided that you have the right to read it, the copyright, of any wordlist produced, is yours and you can put it in the public domain. Dictionaries may be different in that the artistic effort is arguably in the selection of the words themselves. Readme file ........... Each wordlist should have a readme file giving relevant information. Please state who you are and your email address for queries, but to reduce these the following should be included: your name and email address. the source of your list. any gross bias in the words found, political, religious, etc. The character set being used, with the representation of each of the accented characters defined. Not everyone used TEX or your favourite code page! It is a real pain decoding the character set with a paper dictionary! There are more than 500, 7 and 8 bit character sets: ISO, PC, Mac, Gem, Word Perfect, ad infinitum. Not to mention 16, 32, 64 bit character sets, Unicode, ISO and so on. Say whether misspellings have been removed. Say whether loan words have been removed. They appear in all languages!! Say whether proper nouns are capitalised, capitalisation has been left as found, or all words have been lowercased. IMHO proper nouns are best capitalised if that is the convention in that language. Number of occurrences .................... If one looks at the seminal work of the Brown corpus, available from the Oxford Text Archive. One will find that the number of occurrences of that word, in the text examined, is carefully tabulated. This is very useful in many fields, and its addition to any wordlist is a distinct advantage. If used for a spelling corrector, it is possible to construct several dictionaries, one of common words held in main memory, less common words held on disk and so on. Two small public domain ANSI C source code programs to facilitate their collection, are available currently from Dave Fawthrop (hyphen
ibmpcug.co.uk), but hopefully I will find an ftp site to put them. These are "one_word", which takes ASCII text, splits it at white space, strips of punctuation at both ends of the word, changes the word to lower case if required, and prints out the words one per line. Both the definition of lower case and the definition of punctuation are simply adaptable to the character set in use. The output from this is then sorted with any gutsy sort. "uniq_num" takes this sorted output, and prints out each word followed by the number of times it occurred. The above sequence with edits as required can be repeated ad infinitum, with the occurrences collected properly. Dave Fawthrop, (hyphen
ibmpcug.co.uk). -- God loved the World so much that he gave his only Son, so that -- -- anyone who believes in (trusts, clings to, relies on) him shall -- -- not perish, but have eternal life. STARTING NOW! *IT'S GREAT* --