LINGUIST List 2.93

Sunday, 24 Mar 1991

FYI: Job, CELEX Lexicon Project

Editor for this issue: <>


Directory

  1. John E. Koontz, Intern position
  2. Robert Kluender, Centre for Lexical Information

Message 1: Intern position

Date: Thu, 21 Mar 91 10:24:16 MST
From: John E. Koontz <koontzalpha.bldr.nist.gov>
Subject: Intern position
Reposted from Usenet's sci.lang group. Note that I have nothing to do with
this offer myself, and don't want to get inquiries, if possible!

Forwarded message follows:
-----------------
From: jeremyApple.COM (jeremy j. b. nguyen)
Subject: Summer Internship Avaliable
Date: 21 Mar 91 00:23:25 GMT
Organization: Apple Computer Inc., Cupertino, CA

Summer intern position with Apple Computer's Advanced Technology 
Group

This year, within the Information Access research group at Apple, 
we will have a summer intern position for which readers of this 
newsgroup might be appropriate. Please pass this information on 
to qualified students and to your colleagues. Resumes and 
inquiries should be directed to me at the electronic or U.S. mail 
addresses given below.


Thank you,

Jeremy Bornstein
Information Access Research
Apple Computer Advanced Technology Group
20525 Mariani Ave. MS 76-2C
Cupertino, CA 95014

Internet: jeremyapple.com
Applelink: JEREMY
Fax: 408-974-9793

----------------------

Description:

Work with senior researcher to survey the academic and commercial 
state of the art in computer indexing and retrieval of texts in a 
multilingual setting, with particular emphasis on languages 
represented with non-Roman character sets or ideograms. Prototype 
and validate generalized retrieval approaches as part of an on-
going research program.

Requirements:

Graduate or advanced undergraduate student. Familiarity with 
textual information retrieval literature and technology. C 
programming experience. Preference will be given to candidates 
with reading skills in major non-Roman languages, e. g., Arabic, 
Chinese, Hebrew, Hindi, Japanese, Russian-- the more the better. 
Experience in linguistics and/or computational linguistics is also 
a plus.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Centre for Lexical Information

Date: Thu, 21 Mar 91 12:30:31 PST
From: Robert Kluender <rkluenderUCSD.EDU>
Subject: Centre for Lexical Information
CELEX -- CENTRE FOR LEXICAL INFORMATION

Since 1986, the Dutch national Expertise Centre CELEX (Centre for Lexical
Information) has been constructing large electronic databases containing
various types of lexical data on present-day Dutch, English and German. CELEX
makes this information available to institutes and companies engaged in
language and speech research and in the development of language and speech
oriented technological systems. Using the specially-developed program FLEX,
you can access the databases with ease -- no technical expertise is necessary
-- and extract information which matches the detailed requirements you have.
In addition, CELEX can offer assistance with respect to related research and
development projects.

The lexical data are stored in three separate databases. The Dutch
database is now complete, and contains information on approximately 400,000
present-day Dutch wordforms. The English database currently contains 100,000
wordforms and will soon by extended with another 50,000 wordforms. The
first version of the German database was made available in August 1990
and contains 51,000 wordforms. New information on translation equivalency
is currently being developed, along with additional syntactic and
semantic subcategorizations to establish semantic links between the
three databases. The results of these extensions will be made available
in 1991.

The information contained in all three databases has been derived from
various sources. For the most part, dictionary information has been
combined with frequency data taken from large text corpora. By means of
various manual and automatic procedures, CELEX has checked, improved and
extended the information. On offer now is detailed information on the
orthography (spelling), phonology (pronunciation), morphology (word
structure: inflectional and derivational), syntax (grammar) and frequency
of words. An important feature of the CELEX databases is that all the
information in them has been represented to meet the formal and
strict requirements of computational applications. The data are contained
in a relational DBMS (database management system), a highly flexible tool
for storing, updating and manipulating the data; it also allows users to
make individual selections from the vast quantities of data included. The
CELEX user interface FLEX was specially designed to make it easy for
non-technical people to use the databases. Researchers can log in to CELEX,
create their own particular lexicons using FLEX, and extract the information
for their own use. By selecting specific items from the numerous
possibilities presented in the FLEX menus, and by specifying restrictions on
the selection of words from the databases, you can define and control the
contents of your lexicon.

LEXICON TYPES

When you begin work with any of the databases, you can normally choose
between two so-called `lexicon types': either a LEMMA LEXICON or a
a WORDFORM LEXICON.

Each lexicon type is based on a specific kind of main entry, a lemma or a
headword. The lemma lexicon is the one most similar to an ordinary dictionary
since each entry refers to a full set of inflected words, dealt with together
under some convenient heading. Dictionaries normally represent lemmas as
headwords: the verb lemma `call' represents all the verbal forms which `call'
can appear as. In the CELEX English database, the lemma is represented by the
conventional dictionary-type headword, while in the Dutch and German
databases you can choose between the conventional headword form and the stem
form. In contrast, entries in a wordform lexicon deal with each individual
flection -- this is where you find `call', `calls', `called', and `calling'. 

INFORMATION AVAILABLE

For both lexicon types you can select any number of columns from the 150
columns available for each lexicon type. The table below summarizes the
sort of information you could include in an English lexicon:

 -------------------------------------------------------------------
 Orthography - with or without diacritics
 (spelling) - with or without word division positions
 - alternative spellings
 - number of letters/syllables

 Phonology - phonetic transcriptions (using SAMPA notation or
 (pronunciation) Computer Phonetic Alphabet (CPA) notation) with:
 - syllable boundaries
 - primary and secondary stress markers
 - consonant-vowel patterns
 - number of phonemes/syllables
 - alternative pronunciations

 Morphology - Derivational/compositional:
 (word structure) - division into stems and affixes
 - flat or hierarchical representations
 - Inflectional:
 - stems and their inflections

 Syntax - word class
 (grammar) - subcategorizations per word class

 Frequency - COBUILD frequency*
 -------------------------------------------------------------------
 *These frequency data are based on the COBUILD corpus (sized 18
 million words) built up by the University of Birmingham, UK 

AN EXAMPLE

If you create a small English Lemma lexicon (that is, one with only a few
columns), you might extract information like this from it:

 -----------------------------------------------------------------
 Headword Pronunciation Morphology: Mor: Class Freq
 Structure Class
 ----------- ---------------- ------------------- ----- ----- ----
 celebrant "sE-lI-brnt ((celebrate),(ant)) Vx N 6
 celebration %sE-lI-"breI-Sn, ((celebrate),(ion)) Vx N 201
 cell "sEl (cell) N N 1210
 cellar "sE-lr* (cellar) N N 228
 cellarage "sE-l-rIdZ ((cellar),(age)) Nx N 0
 cellist "tSE-lIst ((cello),(ist)) Nx N 5
 cello "tSE-lU (cello) N N 25
 cellular "sEl-jU-lr* ((cell),(ular)) Nx A 21
 celluloid "sEl-jU-lOId ((cellulose),(oid)) Nx N 29
 ------------------------------------------------------------------

Similarly, a small English wordforms lexicon giving the flections
associated with the lemmas above might look like this:

 --------------------------------------------------------------
 Word Word division Pronunciation Class Type Freq
 ------------ --------------- ----------------- ----- ---- ----
 celebrant cel-e-brant "sE-lI-brnt N sing 2
 celebrants cel-e-brants "sE-lI-brnts N plu 4
 celebration cel-e-bra-tion %sE-lI-"breI-Sn, N sing 144
 celebrations cel-e-bra-tions %sE-lI-"breI-Sn,z N plu 57
 cell cell "sEl N sing 655
 cells cells "sElz N plu 555
 cellar cel-lar "sE-lr* N sing 187
 cellars cel-lars "sE-lz N plu 41
 cellarage cel-lar-age "sE-l-rIdZ N sing 0
 cellarages cel-lar-ag-es "sE-l-rI-dZIz N plu 0
 cellist cel-list "tSE-lIst N sing 5
 cellists cel-lists "tSE-lIsts N plu 0
 cello cel-lo "tSE-lU N sing 24
 cellos cel-los "tSE-lUz N plu 1
 cellular cel-lu-lar "sEl-jU-lr* A pos 21
 celluloid cel-lu-loid "sEl-jU-lOId N sing 29
 -------------------------------------------------------------- 

GETTING AT THE DATABASES

People in the Netherlands can log in to CELEX using SURFnet, the Dutch
academic network. People elsewhere can use the available PSDNs (Packet
Switching Data Networks). In the UK, JANET users connect first to the PSS
gateways in london and Manchester, and then log in to CELEX. In the US,
any of the public PSDNs (TYMNET, AUTONET, or UNINET to name just a few)
can provide direct access to the CELEX machine. In Germany the national PSDN
is called DATEX-P. Most countries have a PSDN which can provide a connection
to let you log in and work with the CELEX databases, and several users outside
the Netherlands have been able to do it -- there are CELEX users in the USA,
the UK, Germany, Belgium and Austria. If, however, the network connections
aren't sufficient, then CELEX can prepare the information you require and
send it on tape.

COSTS AND CONDITIONS

Before access to the databases is provided, a licence agreement between the
user (usually the user's institution) and CELEX is drawn up, which settles
the conditions and rights concerning access to and use of the databases. In
most cases, charges are levied for the use of the database. Since the mention
of money usually causes alarm in academic circles, it's worth stressing that
this is purely a cost-covering exercise to ensure that the system can be
maintained, and that more information can be developed. This is Dutch
government policy at present: state funds enable a central resource to be
set up for the general good in the hope that others who need such resources
will not waste time and money in constructing similar facilities. Once set
up, those facilities are available at a price far lower than the cost of new
development would be. For academic and research purposes, the fees asked are
modest. Naturally when commercial use is made of the data, higher fees are
appropriate.

MORE INFORMATION

If you are interested in finding out more about CELEX, then please get in
touch with us. We can send you copies of our introductory booklet, plus back
issues of the five newsletters so far published, and answer any specific
questions you might have. In many cases a `trial' account can be set up to
let you look round the databases before making any financial commitment

You can send email:

 CELEXCELEX.KUN.NL (Internet)
 CELEXHNYMPI52 (EARN/BITNET),

or write to the following address:

 CELEX -- Centre for Lexical Information
 University of Nijmegen
 Wundtlaan 1
 6525 XD NIJMEGEN
 The Netherlands
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue