Editor for this issue: Ljuba Veselinova <lveselin
emunix.emich.edu>
REMINDERS FROM LSA --If you plan to attend the LSA Annual Meeting in San Diego, please register for the meeting and make your hotel reservations. For more information and/or forms, please contact the LSA Secretariat: 202-835-1714; zzlsaMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuegallua.gallaudet.edu --Program Heads and Department Chairs are reminded to complete and return the questionnaire sent by the LSA Committee on Ethnic Diversity in Linguistics. Replies were requested by 11 December and should be sent asap.
Announcing a NEW RELEASE from the LINGUISTIC DATA CONSORTIUM and the CENTRE FOR LEXICAL INFORMATION This message announces the Second Release of the CELEX CD-ROM with lexical data from the Dutch Centre for Lexical Information and the Linguistic Data Consortium. This CD-ROM contains an enhanced, expanded version of the German lexical database (2.5), featuring approximately 1000 new lemma entries, revised morphological parses, verb argument structures, inflectional paradigm codes, and a corpus type lexicon. A complete PostScript version of the German Linguistic Guide is also included, in both European A4-format and American Letter format. For German, the total number of lemmas included is now 51,728, while all their inflected forms number 365,530. Moreover, phonetic syllable frequencies have been added for (British) English and Dutch. Apart from this, and the provision of frequency information alongside every lexical feature, no changes have been made to the Dutch and English lexicons. Complete AWK-scripts are now provided to compute representations not found in the (plain ASCII) lexical data files, corresponding to the features described in the CELEX User Guide, which is included on the CD as well. For each language, i.e. English, German and Dutch, the CD-ROM contains detailed information on the orthography (variations in spelling, hyphenation), the phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress), the morphology (derivational and compositional structure, inflectional paradigms), the syntax (word class, word-class specific subcategorisations, argument structures), and word frequency (summed word and lemma counts, based on recent and representative text corpora) of both wordforms and lemmas. Unique identity numbers allow the linking of information from different files with the aid of an efficient, index-based C-program. Like its predecessor, the CD-ROM is mastered using the ISO 9660 data format, with the Rock Ridge extensions, allowing it to be used in VMS, MS-DOS, Macintosh and UNIX environments. As the new release does not omit any data from the first edition, the current release will replace the old one. Institutions that have membership in the LDC during the 1995 or 1996 Membership Years will be able to receive CELEX for research purposes only at no additional charge, in the same manner as all other text and speech corpora published by the LDC. Non-members can receive a copy of CELEX for research purposes only for a fee of $150. If you would like to order a copy of this corpus, please email your request to ldcMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueunagi.cis.upenn.edu, or fax it to (215) 573-2175. If you need additional information before placing your order, or would like to inquire about membership in the LDC, please send email or call (215) 898-0464. Further information about the LDC and its available corpora can be accessed on the Linguistic Data Consortium WWW Home Page at URL http://www.cis.upenn.edu/~ldc. More information specific to CELEX can be accessed via hyperlinks from this Home Page. Information is also available via ftp at ftp.cis.upenn.edu under pub/ldc; for ftp access, please use "anonymous" as your login name, and give your email address when asked for password. A brief overview of the revised German data on the CD is given below: THE GERMAN DATABASE When starting to use the German database, the user first has to choose between three so-called `lexicon types': - a lemma lexicon - a wordform lexicon - a corpus type lexicon Each lexicon type uses a specific kind of entry. The CELEX lemma lexicon is the one most similar to an ordinary dictionary since every entry in this lexicon represents a set of related inflected words. In a lexicon, a lemma can be represented by using a headword (cf. traditional dictionary entries) such as, for example, `helfen' (help) or `Hund' (dog), or by a stem such as, for example, 'helf' or 'Hund'. The wordform lexicon yields all possible inflected words: every entry in the lexicon is an inflectional variant of the related headword or stem. So, a wordform lexicon contains words like `helfe', `hilft', `geholfen', `huelfe', `Hundes', `Hunde' and so on. A corpus type lexicon, on the other hand, simply gives you an ordered list of all alphanumeric strings found in the corpus with raw string counts, undisambiguated for relations to either lemmas or wordforms. For all types of lexicons, the user may subsequently select any number of columns -- from approximately 200 database columns -- combining information on the orthography, phonology, morphology, syntax and frequency of the entries. LEXICAL DATA, GERMAN The lexical data that can be selected for each entry in the different German lexicon types can be divided into five categories: orthography, phonology, morphology, syntax and frequency. Orthography - with or without diacritics (spelling) - with or without word division positions - number of letters/syllables Phonology - phonetic transcriptions which use different notations (pronunciation) like SAMPA or CPA and include: - syllable boundaries - primary stress markers - consonant-vowel patterns - number of phonemes/syllables Morphology - Derivational/compositional: (word structure) - division into stems and affixes - flat or hierarchical representations - Inflectional: - stems and their inflections Syntax - word class (grammar) - subcategorisations per word class Frequency - Mannheim frequency(*) (*) These frequency data are based on the 6 million word corpus compiled by the Institut fuer Deutsche Sprache in Mannheim, Germany. EXAMPLE DATA, GERMAN An arbitrary query using a small German lemma lexicon (that is, one with very few columns) might yield the following result: Headword Pronunciation Morphology: M: Cl Freq Structured Segmentation Cl ----------- ---------------- ------------------------ --- -- ---- helfen "hEl-f
n (helf) V V 1225 Helfer "hEl-f
r ((helf),(er)) Vx N 134 hellaeugig "hEl-Oy-gIx ((hell),(Auge),(ig)) ANx A 0 hellblau "hEl-blau ((hell),(blau)) AA A 28 Hellseher "hEl-ze:-
r (((hell),(seh)),(er)) AVx N 20 hellseherisch "hEl-ze:-
-rIS (((hell),(seh)),(erisch)) AVx A 0 hellwach "hEl-vax ((hell),(((wach),(e)))) AVx A 13 Helm "hElm (Helm) N N 22 Hund "hUnt (Hund) N N 364 Huendchen "hYnt-x
n ((Hund),(chen)) Nx N 7 hundekalt "hUn-d
-kalt ((Hund),(e),(kalt)) NxA A 0 hundemuede "hUn-d
-my:-d
((Hund),(e),(muede)) NxA A 3 Hundeschnauze "hUn-d
-Snau-ts
((Hund),(e),(Schnauze)) NxN N 1 Hundesteuer "hUn-d
-StOy-
r ((Hund),(e),(Steuer)) NxN N 6 Hundewetter "hUn-d
-vE-t
r ((Hund),(e),(Wetter)) NxN N 0 Huendin "hYn-dIn ((Hund),(in)) Nx N 7 huendisch "hYn-dIS ((Hund),(isch)) Nx A 2 Huene "hy:-n
(Huene) N N 13 huenenhaft "hy:-n
n-haft ((Huene),(n),(haft)) Nxx A 4 Hunger "hU-N
r (Hunger) N N 102 Hungerkur "hU-N
r-ku:r ((Hunger),(Kur)) NN N 5 Hungerlohn "hU-N
r-lo:n ((Hunger),(Lohn)) NN N 6 hungern "hU-N
rn ((Hunger)) N V 33 Hungersnot "hU-N
rs-no:t ((Hunger),(s),(Not)) NxN N 23 Hungerstreik "hU-N
r-Straik ((Hunger),((streik))) NV N 14 Richard Piepenbrock CELEX Project Manager C -- C E L E X -- -- The Centre for Lexical Information -- C C C C C Max Planck Institute for Psycholinguistics C CCCCCC Wundtlaan 1 C CCCCCCCCCCCCC 6525 XD NIJMEGEN C C C CCCCCCCCCCCCCCCC The Netherlands CCCCCCCCCC CC C CCCCCCCC Tel: (+31) (0)24 - 3615797 CCCCCCCC Fax: (+31) (0)24 - 3521213 CCCCCCCC CCCCCCCC CCCCCCCC E-mail: celex
mpi.nl CCCCCCCC CCCCCCCC WWW-page: http://www.kun.nl/celex/ CCCCCCCC CCCCCCCCC CCCCCCCCCCC