Editor for this issue: Lydia Grebenyova <lydia
linguistlist.org>
___________________________________________________________ ELRA European Language Resources Association ELRA News ___________________________________________________________ *** ELRA NEW RESOURCES *** We are happy to announce new resources available via ELRA ELRA-W0020 ICE-GB (British English component of the International Corpus of English) ELRA-S0077 Telephone Speech Data Collection for Czech ELRA-S0078 Finnish Speechdat(II) FDB-1000 ELRA-S0079 Finnish Speechdat(II) FDB-4000 ELRA-S0080 Finnish-Swedish Speechdat(II) FDB-1000 A description of each database is given below. _______________________________________ ELRA-W0020 ICE-GB (British English component of the International Corpus of English) _______________________________________ ICE-GB is the British component of the International Corpus of English (ICE). ICE began in 1990 with the primary aim of providing material for comparative studies of varieties of English throughout the world. Twenty centres around the world are preparing corpora of their own national or regional variety of English. ICE-GB is fully grammatically analysed. Like all the ICE corpora, ICE-GB consists of a million words of spoken and written English and adheres to the common corpus design. 200 written and 300 spoken texts make up the million words. Every text is grammatically annotated, allowing complex and detailed searches across the whole corpus. ICE-GB contains 83,394 parse trees, including 59,640 in the spoken part of the corpus. ICE-GB has been fully checked. It was checked by linguists at several stages in its completion, using both a traditional `post-checking' strategy and also by cross-sectional error-based searches. ICE-GB is distributed with the retrieval software ICECUP (the International Corpus of English Corpus Utility Program). ICECUP supports a variety of query types, including the use of the parse analyses to construct Fuzzy Tree Fragments to search the corpus. _______________________________________ ELRA-S0077 Telephone Speech Data Collection for Czech _______________________________________ This database contains speech collected in Czech Republic during summer 1999. The collection was performed at the Institute of Radioelectronics of Brno University of Technology, Faculty of Electrical Engineering and Computer Sciences (VUT Brno) and at the Department of Circuit Theory of Czech Technical University in Prague, Faculty of Electrical Engineering (CVUT Prague) upon demand of Siemens AG, Corporate Technology, Munich. This database comprises telephone recordings from 1227 speakers (590 males and 637 females) recorded directly over the fixed telephone network using an ISDN interface. Speech files are stored as sequences of 8bit 8 kHz A-law uncompressed speech samples. Each prompted utterance is stored within a separate file. Each speech file has an accompanying ASCII SAM label file according to the specifications of the SpeechDat project (URL http//www.speechdat.com ). Corpus contents connected digits (prompt sheet number, telephone number, credit card number); sequences of isolated digits (5 digits); answers to yes/no questions; common application words and phrases. The following age distribution has been obtained 36 speakers are below 16 years old, 537 speakers are between 16 and 30, 306 speakers are between 31 and 45, 259 speakers are between 46 and 60, 88 speakers are over 60, and 1 speaker whose age is unknown. The transcription included in this database is an orthographic, lexical transcription with a few details that represent audible acoustic events (speech and non speech) present in the corresponding waveform files. SpeechDat conventions were used in this database. ______________________________________ ELRA-S0078 Finnish Speechdat(II) FDB-1000 ELRA-S0079 Finnish Speechdat(II) FDB-4000 _______________________________________ The Finnish SpeechDat(II) FDB-1000 and FDB-4000 databases comprise respectively 1000 and 4000 Finnish speakers recorded over the Finnish fixed telephone network. The SpeechDat database has been collected and annotated by the Tampere University of Technology's Digital Media Institute. The speech databases made within the SpeechDat(II) project were validated by SPEX, the Netherlands, to assess their compliance with the SpeechDat format and content specifications. Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information. Each speaker uttered the following items: 1 isolated digit; 1 sequence of 10 isolated digits; 4 numbers 1 sheet number (5 digits), 1 telephone number (9-10 digits), 1 credit card number (16 digits), 1 PIN code (6 digits); 1 currency money amount; 1 natural number; 3 dates 1 spontaneous date (birthdate), 1 prompted date, 1 relative or general date expression; 2 time phrases 1 time of day (spontaneous), 1 time phrase; 3 spelled words 1 spontaneous own forename, 1 city name, 1 phonetically rich word; 5 directory assistance names 1 spontaneous own forename, 1 spontaneous city of growing up, 1 frequent city name, 1 frequent company name, 1 common forename surname; 2 yes/no questions 1 predominantly "yes" question, 1 predominantly "no" question; 3 application words; 1 word spotting phrase using an embedded application word; 4 phonetically rich words; 9 phonetically rich sentences. A pronunciation lexicon with a phonemic transcription in SAMPA is also included. ______________________________________ ELRA-S0080 Finnish-Swedish Speechdat(II) FDB-1000 ______________________________________ The Finnish-Swedish SpeechDat(II) FDB-1000 comprises 1000 Finnish speakers uttering speechdat items in the variant of Swedish spoken in Finland, recorded over the Finnish fixed telephone network. The SpeechDat database has been collected and annotated by the Tampere University of Technology's Digital Media Institute. The FDB-1000 database is partitioned into 4 CDs, 3 CDs comprise 300 speakers sessions, the 4th comprises 100 speakers. The speech databases made within the SpeechDat(II) project were validated by SPEX, the Netherlands, to assess their compliance with the SpeechDat format and content specifications. Speech samples are stored as sequences of 8-bit 8 kHz A-law. Each prompted utterance is stored in a separate file. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information. Each speaker uttered the following items: 1 isolated digit; 1 sequence of 10 isolated digits; 4 numbers 1 sheet number (5 digits), 1 telephone number (9-10 digits), 1 credit card number (16 digits), 1 PIN code (6 digits); 1 currency money amount; 1 natural number; 3 dates 1 spontaneous date (birthdate), 1 prompted date, 1 relative or general date expression; 2 time phrases 1 time of day (spontaneous), 1 time phrase; 3 spelled words 1 spontaneous own forename, 1 city name, 1 phonetically rich word; 5 directory assistance names 1 spontaneous own forename, 1 spontaneous city of growing up, 1 frequent city name, 1 frequent company name, 1 common forename surname; 2 yes/no questions 1 predominantly "yes" question, 1 predominantly "no" question; 6 application words; 1 word spotting phrase using an embedded application word; 4 phonetically rich words; 9 phonetically rich sentences The following age distribution has been obtained 178 speakers are below 16 years old, 412 speakers are between 16 and 30, 216 speakers are between 31 and 45, 160 speakers are between 46 and 60, and 34 speakers are over 60. A pronunciation lexicon with a phonemic transcription in SAMPA is also included. ===================================== For further information, please contact: ELRA/ELDA Tel +33 01 43 13 33 33 55-57 rue Brillat-Savarin Fax +33 01 43 13 33 30 F-75013 Paris, France E-mail mapelliMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueelda.fr or visit our Web site: http//www.icp.grenet.fr/ELRA/home.html or http//www.elda.fr =====================================
Summer 2000 at Ohio State University Spoken Language in Context: Methods and Models During July of 2000, the Department of Linguistics at the Ohio State University will be offering a unique combination of short courses aimed at exploring spoken language, with a particular focus on the empirical study of naturally-occurring speech through various instrumental, quantitative, and analytic means. Scholars, researchers (industry or academic), and students are invited to join us for an intense and rewarding summer session. Course offerings: Laboratory Phonology - Mary Beckman Quantitative Methods - Michael Broe Field Phonetics - Keith Johnson Historical Phonology - Brian Joseph & Richard Janda Practicum in English Intonation - Julia McGory The Pragmatics of Focus - Craige Roberts For more information see the website: http://ling.ohio-state.edu/SU2000Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue