LINGUIST List 4.1005

Mon 29 Nov 1993

FYI: publication lists & CELEX-cd available

Editor for this issue:


  1. Jose Camacho, USC Linguistics dissertations available
  New GLSA Publications List Available!
  3. Richard Piepenbrock, CELEX English, German and Dutch lexical data on CD-ROM

Message 1: USC Linguistics dissertations available

Date: Thu, 4 Nov 1993 16:32:18 -0800 (PST)
From: Jose Camacho <>
Subject: USC Linguistics dissertations available

 GSIL Publications List
 Effective November 15, 1993

 Graduates Students in Linguistics (G.S.I.L.)
 Department of Linguistics
 University of Southern California
 Los Angeles, CA 90089-1693 U.S.A.

Titles available:

1 Authier, J-M. Syntax of unselective binding (1988)
2 Franco, J. On object agreement in Spanish (1993)
3 Heggie, L. Syntax of copular structures (1988)

Titles available shortly:

4 Katada, F. The representation of anaphoric relations in
 logical form (1990)

For e-mail information, please contact
Message 2: New GLSA Publications List Available!

Date: Mon, 25 Oct 1993 18:35:29 -0400
From: <>
Subject: New GLSA Publications List Available!

The updated publication list of the Graduate Linguistic Student Association
(GLSA) of the University of Massachusetts, Amherst is available online in
three different ways:

1) A short version of the list (without full tables of contents) is available
 on the Linguist List Listserver;
2) A long version of the list (with tables of contents) is available by
 anonymous ftp to
 in the directory /linguistics/papers/available
3) Both versions of the Publications List are available by emailing
Message 3: CELEX English, German and Dutch lexical data on CD-ROM

Date: Thu, 28 Oct 1993 14:55 +0100 (MET)
From: Richard Piepenbrock <>
Subject: CELEX English, German and Dutch lexical data on CD-ROM

This message is posted to announce the release of a CD-ROM with
lexical data by the Dutch Centre for Lexical Information which
can be obtained from the Linguistic Data Consortium.


The CD-ROM, which contains the CELEX lexical databases of English
(version 2.5), Dutch (version 3.1) and German (version 2.0), is now
available for research purposes from the Linguistic Data Consortium
for $150. For each language, the CD-ROM contains detailed information
on the orthography (variations in spelling, hyphenation), the
phonology (phonetic transcriptions, variations in pronunciation,
syllable structure, primary stress), the morphology (derivational and
compositional structure, inflectional paradigms), the syntax (word
class, word-class specific subcategorisations, argument structures),
and word frequency (summed word and lemma counts, based on recent and
representative text corpora) of both wordforms and lemmas (English:
52446 lemmas, 160594 wordforms; German: 50708 lemmas, 359611
wordforms; Dutch: 124136 lemmas, 381292 wordforms). Postscript files
describe the available lexical information in detail.

The original Celex databases can be consulted interactively either by
using the SQL*PLUS query language within an ORACLE RDBMS environment,
or by means of the specially designed user interface FLEX. The
databases on this CD-ROM have not been tailored to fit any particular
database management program. Instead, the information is presented in
a series of plain ASCII files in a UNIX directory tree that can be
queried with tools such as AWK or ICON. Unique identity numbers allow
the linking of information from different files. As in the original
databases, some kinds of information have to be computed on-line.
Wherever necessary, AWK functions have been provided to recover this
information. README files specify the details of their use.

The CD-ROM is mastered using the ISO 9660 data format, with the Rock
Ridge extensions, allowing it to be used in VMS, MS-DOS, Macintosh (*)
and UNIX environments.

Anyone who would like to purchase the CD-ROM should send a check or
purchase order made payable to the "Trustees of the University of
Pennsylvania" to

 Judith Storniolo
 Administrative Assistant, LDC
 Linguistic Data Consortium
 441 Williams Hall
 University of Pennsylvania
 Philadelphia, PA 19104-6305
 Tel: +1/215/898-0464 Fax: +1/215/573-2175

(*) If someone has a Mac with a cdrom drive that was obtained before
12/92, and has not installed any system upgrades since that date, then
that system will not be able to read the CELEX CD-ROM. In such a case,
all that is needed is to obtain the upgraded driver software (a very
small amount of code), and copy it onto the system in place of the
existing driver. The upgrade can be obtained as follows:

 Connect to ftp server:
 Go to directory: dts/mac/sys.soft/cdrom
 Get file: cd-rom-setup


Further details concerning the lexical databases can be obtained by
anonymous ftp from the LDC as follows:

 connect to:
 go to directory: pub/ldc
 set transfer mode: binary
 get file:

This file, which corresponds to Chapter 1 of the CELEX User Guide written
by Gavin Burnage and which is subject to CELEX copyright, can be
decompressed and output to a postscript-capable printer. The content of
this document should provide answers to most questions regarding the
content and use of CELEX.

Persons outside of Europe who are interested in CELEX, but are unable
to retrieve and print the introductory text themselves, may request a
hard copy of the document from the LDC.

Persons in Europe who want a hard copy of the document mailed to
them, and anyone who still has technical questions after reading the
document, should direct their inquiries to:

 Richard Piepenbrock
 CELEX Project Manager
 Max-Planck-Institut fuer Psycholinguistik
 Wundtlaan 1
 The Netherlands

 Tel: (+31) (0)80 - 615797
 Fax: (+31) (0)80 - 521213

 EARN/BITNET: celexhnympi51
 SURFNET: celex::celexmail

Apart from making the introductory text freely available, the LDC is
not equipped to provide detailed replies as to technical details of
the CELEX CD-ROM. Please contact the LDC only if you need assistance
in obtaining the document, or would like to purchase the disc.


When starting to use the English database, the user first has to
choose between two so-called `lexicon types':

 - a lemma lexicon
 - a wordform lexicon

Each lexicon type uses a specific kind of entry. The CELEX lemma lexicon
is the one most similar to an ordinary dictionary since every entry in
this lexicon represents a set of related inflected words. In a lexicon, a
lemma can be represented by using a headword (cf. traditional dictionary
entries) such as, for example, `call' or `cat'. The wordform lexicon
yields all possible inflected words: every entry in the lexicon is an
inflectional variant of the related headword or stem. So, a wordform
lexicon contains words like `call', `calls', `calling', `called', `cat',
`cats' and so on.

For both types of lexicons, the user may subsequently select any number
of columns -- from approximately 150 database columns -- combining
information on the orthography, phonology, morphology, syntax and
frequency of the entries. The information sheet `Lexical Data, English'
summarizes the types of information available. An exhaustive overview of
the columns available is given in the CELEX User Guide.


The lexical data that can be selected for each entry in the different
English lexicon types can be divided into five categories: orthography,
phonology, morphology, syntax and frequency. In a separate section,
example data are given for each of these categories.

Orthography - with or without diacritics
(spelling) - with or without word division positions
 - alternative spellings
 - number of letters/syllables

Phonology - phonetic transcriptions (using SAMPA notation or
(pronunciation) Computer Phonetic Alphabet (CPA) notation) with:
 - syllable boundaries
 - primary and secondary stress markers
 - consonant-vowel patterns
 - number of phonemes/syllables
 - alternative pronunciations

Morphology - Derivational/compositional:
(word structure) - division into stems and affixes
 - flat or hierarchical representations
 - Inflectional:
 - stems and their inflections

Syntax - word class
(grammar) - subcategorisations per word class

Frequency - COBUILD frequency(*)
(*)These frequency data are based on the COBUILD corpus (sized 18
million words) built up by the University of Birmingham, Great


An arbitrary query using a small English lemma lexicon (that is, one with
very few columns) might yield the following result:

Headword Pronunciation Morphology: M: Cl Freq
 Structure Cl
*---------- ---------------- ------------------- -- -- ----
celebrant "sE-lI-brnt ((celebrate),(ant)) Vx N 6
celebration %sE-lI-"breI-Sn, ((celebrate),(ion)) Vx N 201
cell "sEl (cell) N N 1210
cellar "sE-lr* (cellar) N N 228
cellarage "sE-l-rIdZ ((cellar),(age)) Nx N 0
cellist "tSE-lIst ((cello),(ist)) Nx N 5
cello "tSE-lU (cello) N N 25
cellular "sEl-jU-lr* ((cell),(ular)) Nx A 21
celluloid "sEl-jU-lOId ((cellulose),(oid)) Nx N 29

An example selection from a small English wordform lexicon, showing the
inflectional variants of the headwords given in the previous example, is
presented in the next table:

Word Word division Pronunciation Cl Type Freq
*----------- --------------- ----------------- -- ---- ----
celebrant cel-e-brant "sE-lI-brnt N sing 2
celebrants cel-e-brants "sE-lI-brnts N plu 4
celebration cel-e-bra-tion %sE-lI-"breI-Sn, N sing 144
celebrations cel-e-bra-tions %sE-lI-"breI-Sn,z N plu 57
cell cell "sEl N sing 655
cells cells "sElz N plu 555
cellar cel-lar "sE-lr* N sing 187
cellars cel-lars "sE-lz N plu 41
cellarage cel-lar-age "sE-l-rIdZ N sing 0
cellarages cel-lar-ag-es "sE-l-rI-dZIz N plu 0
cellist cel-list "tSE-lIst N sing 5
cellists cel-lists "tSE-lIsts N plu 0
cello cel-lo "tSE-lU N sing 24
cellos cel-los "tSE-lUz N plu 1
cellular cel-lu-lar "sEl-jU-lr* A pos 21
celluloid cel-lu-loid "sEl-jU-lOId N sing 29
