Editor for this issue: <>
Reposted from Usenet's sci.lang group. Note that I have nothing to do with this offer myself, and don't want to get inquiries, if possible! Forwarded message follows: ----------------- From: jeremyMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueApple.COM (jeremy j. b. nguyen) Subject: Summer Internship Avaliable Date: 21 Mar 91 00:23:25 GMT Organization: Apple Computer Inc., Cupertino, CA Summer intern position with Apple Computer's Advanced Technology Group This year, within the Information Access research group at Apple, we will have a summer intern position for which readers of this newsgroup might be appropriate. Please pass this information on to qualified students and to your colleagues. Resumes and inquiries should be directed to me at the electronic or U.S. mail addresses given below. Thank you, Jeremy Bornstein Information Access Research Apple Computer Advanced Technology Group 20525 Mariani Ave. MS 76-2C Cupertino, CA 95014 Internet: jeremy
apple.com Applelink: JEREMY Fax: 408-974-9793 ---------------------- Description: Work with senior researcher to survey the academic and commercial state of the art in computer indexing and retrieval of texts in a multilingual setting, with particular emphasis on languages represented with non-Roman character sets or ideograms. Prototype and validate generalized retrieval approaches as part of an on- going research program. Requirements: Graduate or advanced undergraduate student. Familiarity with textual information retrieval literature and technology. C programming experience. Preference will be given to candidates with reading skills in major non-Roman languages, e. g., Arabic, Chinese, Hebrew, Hindi, Japanese, Russian-- the more the better. Experience in linguistics and/or computational linguistics is also a plus.
CELEX -- CENTRE FOR LEXICAL INFORMATION Since 1986, the Dutch national Expertise Centre CELEX (Centre for Lexical Information) has been constructing large electronic databases containing various types of lexical data on present-day Dutch, English and German. CELEX makes this information available to institutes and companies engaged in language and speech research and in the development of language and speech oriented technological systems. Using the specially-developed program FLEX, you can access the databases with ease -- no technical expertise is necessary -- and extract information which matches the detailed requirements you have. In addition, CELEX can offer assistance with respect to related research and development projects. The lexical data are stored in three separate databases. The Dutch database is now complete, and contains information on approximately 400,000 present-day Dutch wordforms. The English database currently contains 100,000 wordforms and will soon by extended with another 50,000 wordforms. The first version of the German database was made available in August 1990 and contains 51,000 wordforms. New information on translation equivalency is currently being developed, along with additional syntactic and semantic subcategorizations to establish semantic links between the three databases. The results of these extensions will be made available in 1991. The information contained in all three databases has been derived from various sources. For the most part, dictionary information has been combined with frequency data taken from large text corpora. By means of various manual and automatic procedures, CELEX has checked, improved and extended the information. On offer now is detailed information on the orthography (spelling), phonology (pronunciation), morphology (word structure: inflectional and derivational), syntax (grammar) and frequency of words. An important feature of the CELEX databases is that all the information in them has been represented to meet the formal and strict requirements of computational applications. The data are contained in a relational DBMS (database management system), a highly flexible tool for storing, updating and manipulating the data; it also allows users to make individual selections from the vast quantities of data included. The CELEX user interface FLEX was specially designed to make it easy for non-technical people to use the databases. Researchers can log in to CELEX, create their own particular lexicons using FLEX, and extract the information for their own use. By selecting specific items from the numerous possibilities presented in the FLEX menus, and by specifying restrictions on the selection of words from the databases, you can define and control the contents of your lexicon. LEXICON TYPES When you begin work with any of the databases, you can normally choose between two so-called `lexicon types': either a LEMMA LEXICON or a a WORDFORM LEXICON. Each lexicon type is based on a specific kind of main entry, a lemma or a headword. The lemma lexicon is the one most similar to an ordinary dictionary since each entry refers to a full set of inflected words, dealt with together under some convenient heading. Dictionaries normally represent lemmas as headwords: the verb lemma `call' represents all the verbal forms which `call' can appear as. In the CELEX English database, the lemma is represented by the conventional dictionary-type headword, while in the Dutch and German databases you can choose between the conventional headword form and the stem form. In contrast, entries in a wordform lexicon deal with each individual flection -- this is where you find `call', `calls', `called', and `calling'. INFORMATION AVAILABLE For both lexicon types you can select any number of columns from the 150 columns available for each lexicon type. The table below summarizes the sort of information you could include in an English lexicon: ------------------------------------------------------------------- Orthography - with or without diacritics (spelling) - with or without word division positions - alternative spellings - number of letters/syllables Phonology - phonetic transcriptions (using SAMPA notation or (pronunciation) Computer Phonetic Alphabet (CPA) notation) with: - syllable boundaries - primary and secondary stress markers - consonant-vowel patterns - number of phonemes/syllables - alternative pronunciations Morphology - Derivational/compositional: (word structure) - division into stems and affixes - flat or hierarchical representations - Inflectional: - stems and their inflections Syntax - word class (grammar) - subcategorizations per word class Frequency - COBUILD frequency* ------------------------------------------------------------------- *These frequency data are based on the COBUILD corpus (sized 18 million words) built up by the University of Birmingham, UK AN EXAMPLE If you create a small English Lemma lexicon (that is, one with only a few columns), you might extract information like this from it: ----------------------------------------------------------------- Headword Pronunciation Morphology: Mor: Class Freq Structure Class ----------- ---------------- ------------------- ----- ----- ---- celebrant "sE-lI-brMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuent ((celebrate),(ant)) Vx N 6 celebration %sE-lI-"breI-Sn, ((celebrate),(ion)) Vx N 201 cell "sEl (cell) N N 1210 cellar "sE-l
r* (cellar) N N 228 cellarage "sE-l
-rIdZ ((cellar),(age)) Nx N 0 cellist "tSE-lIst ((cello),(ist)) Nx N 5 cello "tSE-l
U (cello) N N 25 cellular "sEl-jU-l
r* ((cell),(ular)) Nx A 21 celluloid "sEl-jU-lOId ((cellulose),(oid)) Nx N 29 ------------------------------------------------------------------ Similarly, a small English wordforms lexicon giving the flections associated with the lemmas above might look like this: -------------------------------------------------------------- Word Word division Pronunciation Class Type Freq ------------ --------------- ----------------- ----- ---- ---- celebrant cel-e-brant "sE-lI-br
nt N sing 2 celebrants cel-e-brants "sE-lI-br
nts N plu 4 celebration cel-e-bra-tion %sE-lI-"breI-Sn, N sing 144 celebrations cel-e-bra-tions %sE-lI-"breI-Sn,z N plu 57 cell cell "sEl N sing 655 cells cells "sElz N plu 555 cellar cel-lar "sE-l
r* N sing 187 cellars cel-lars "sE-l
z N plu 41 cellarage cel-lar-age "sE-l
-rIdZ N sing 0 cellarages cel-lar-ag-es "sE-l
-rI-dZIz N plu 0 cellist cel-list "tSE-lIst N sing 5 cellists cel-lists "tSE-lIsts N plu 0 cello cel-lo "tSE-l
U N sing 24 cellos cel-los "tSE-l
Uz N plu 1 cellular cel-lu-lar "sEl-jU-l
r* A pos 21 celluloid cel-lu-loid "sEl-jU-lOId N sing 29 -------------------------------------------------------------- GETTING AT THE DATABASES People in the Netherlands can log in to CELEX using SURFnet, the Dutch academic network. People elsewhere can use the available PSDNs (Packet Switching Data Networks). In the UK, JANET users connect first to the PSS gateways in london and Manchester, and then log in to CELEX. In the US, any of the public PSDNs (TYMNET, AUTONET, or UNINET to name just a few) can provide direct access to the CELEX machine. In Germany the national PSDN is called DATEX-P. Most countries have a PSDN which can provide a connection to let you log in and work with the CELEX databases, and several users outside the Netherlands have been able to do it -- there are CELEX users in the USA, the UK, Germany, Belgium and Austria. If, however, the network connections aren't sufficient, then CELEX can prepare the information you require and send it on tape. COSTS AND CONDITIONS Before access to the databases is provided, a licence agreement between the user (usually the user's institution) and CELEX is drawn up, which settles the conditions and rights concerning access to and use of the databases. In most cases, charges are levied for the use of the database. Since the mention of money usually causes alarm in academic circles, it's worth stressing that this is purely a cost-covering exercise to ensure that the system can be maintained, and that more information can be developed. This is Dutch government policy at present: state funds enable a central resource to be set up for the general good in the hope that others who need such resources will not waste time and money in constructing similar facilities. Once set up, those facilities are available at a price far lower than the cost of new development would be. For academic and research purposes, the fees asked are modest. Naturally when commercial use is made of the data, higher fees are appropriate. MORE INFORMATION If you are interested in finding out more about CELEX, then please get in touch with us. We can send you copies of our introductory booklet, plus back issues of the five newsletters so far published, and answer any specific questions you might have. In many cases a `trial' account can be set up to let you look round the databases before making any financial commitment You can send email: CELEX
CELEX.KUN.NL (Internet) CELEX
HNYMPI52 (EARN/BITNET), or write to the following address: CELEX -- Centre for Lexical Information University of Nijmegen Wundtlaan 1 6525 XD NIJMEGEN The Netherlands