LINGUIST List 11.472

Sun Mar 5 2000

FYI: New Resources/ELRA, Summer: Spoken Lang/Context

Editor for this issue: Lydia Grebenyova <lydialinguistlist.org>


Directory

  1. Valerie Mapelli, New Resources/ European Lang Resources Association (ELRA)
  2. Keith Johnson, Summer Session: Spoken Language in Context/ Ohio State University

Message 1: New Resources/ European Lang Resources Association (ELRA)

Date: March 2, 2000 14:30:26 +0100
From: Valerie Mapelli <mapellielda.fr>
Subject: New Resources/ European Lang Resources Association (ELRA)


___________________________________________________________
				ELRA
		European Language Resources Association
			 ELRA News 
___________________________________________________________


		 *** ELRA NEW RESOURCES ***


We are happy to announce new resources available via ELRA

ELRA-W0020 ICE-GB (British English component of the 
International Corpus of English)
ELRA-S0077 Telephone Speech Data Collection for Czech
ELRA-S0078 Finnish Speechdat(II) FDB-1000
ELRA-S0079 Finnish Speechdat(II) FDB-4000
ELRA-S0080 Finnish-Swedish Speechdat(II) FDB-1000

A description of each database is given below.

_______________________________________
ELRA-W0020 ICE-GB (British English component of 
the International Corpus of English)
_______________________________________

ICE-GB is the British component of the International Corpus 
of English (ICE). ICE began in 1990 with the primary aim 
of providing material for comparative studies of varieties of 
English throughout the world. Twenty centres around the 
world are preparing corpora of their own national or regional 
variety of English.

ICE-GB is fully grammatically analysed. Like all the ICE 
corpora, ICE-GB consists of a million words of spoken and 
written English and adheres to the common corpus design. 
200 written and 300 spoken texts make up the million words. 
Every text is grammatically annotated, allowing complex and 
detailed searches across the whole corpus. 

ICE-GB contains 83,394 parse trees, including 59,640 in 
the spoken part of the corpus.

ICE-GB has been fully checked. It was checked by linguists 
at several stages in its completion, using both a traditional 
`post-checking' strategy and also by cross-sectional 
error-based searches. 

ICE-GB is distributed with the retrieval software ICECUP 
(the International Corpus of English Corpus Utility Program). 
ICECUP supports a variety of query types, including the use 
of the parse analyses to construct Fuzzy Tree Fragments to 
search the corpus.

_______________________________________
ELRA-S0077 Telephone Speech Data Collection for Czech
_______________________________________

This database contains speech collected in Czech Republic 
during summer 1999. The collection was performed at the 
Institute of Radioelectronics of Brno University of 
Technology, Faculty of Electrical Engineering and Computer 
Sciences (VUT Brno) and at the Department of Circuit 
Theory of Czech Technical University in Prague, Faculty of 
Electrical Engineering (CVUT Prague) upon demand of 
Siemens AG, Corporate Technology, Munich. This database 
comprises telephone recordings from 1227 speakers (590 
males and 637 females) recorded directly over the fixed 
telephone network using an ISDN interface.

Speech files are stored as sequences of 8bit 8 kHz A-law 
uncompressed speech samples. Each prompted utterance 
is stored within a separate file. Each speech file has an 
accompanying ASCII SAM label file according to the 
specifications of the SpeechDat project 
(URL http//www.speechdat.com ).

Corpus contents connected digits (prompt sheet number, 
telephone number, credit card number); sequences of 
isolated digits (5 digits); answers to yes/no questions; 
common application words and phrases.

The following age distribution has been obtained 36 
speakers are below 16 years old, 537 speakers are between 
16 and 30, 306 speakers are between 31 and 45, 259 
speakers are between 46 and 60, 88 speakers are over 60, 
and 1 speaker whose age is unknown.

The transcription included in this database is an 
orthographic, lexical transcription with a few details that 
represent audible acoustic events (speech and non speech)
present in the corresponding waveform files. SpeechDat 
conventions were used in this database. 

______________________________________
ELRA-S0078 Finnish Speechdat(II) FDB-1000
ELRA-S0079 Finnish Speechdat(II) FDB-4000
_______________________________________

The Finnish SpeechDat(II) FDB-1000 and FDB-4000 
databases comprise respectively 1000 and 4000 Finnish 
speakers recorded over the Finnish fixed telephone network. 
The SpeechDat database has been collected and annotated 
by the Tampere University of Technology's Digital Media 
Institute. The speech databases made within the 
SpeechDat(II) project were validated by SPEX, the 
Netherlands, to assess their compliance with the 
SpeechDat format and content specifications.

Speech samples are stored as sequences of 8-bit 8 kHz 
A-law. Each prompted utterance is stored in a separate file. 
Each signal file is accompanied by an ASCII SAM label file 
which contains the relevant descriptive information.

Each speaker uttered the following items: 1 isolated digit; 1 
sequence of 10 isolated digits; 4 numbers 1 sheet number 
(5 digits), 1 telephone number (9-10 digits), 1 credit card 
number (16 digits), 1 PIN code (6 digits); 1 currency money 
amount; 1 natural number; 3 dates 1 spontaneous date 
(birthdate), 1 prompted date, 1 relative or general date 
expression; 2 time phrases 1 time of day (spontaneous), 1 
time phrase; 3 spelled words 1 spontaneous own forename, 
1 city name, 1 phonetically rich word; 5 directory assistance 
names 1 spontaneous own forename, 1 spontaneous city of 
growing up, 1 frequent city name, 1 frequent company name, 
1 common forename surname; 2 yes/no questions 1 
predominantly "yes" question, 1 predominantly "no" question; 
3 application words; 1 word spotting phrase using an 
embedded application word; 4 phonetically rich words; 9 
phonetically rich sentences.

A pronunciation lexicon with a phonemic transcription in 
SAMPA is also included.

______________________________________
ELRA-S0080 Finnish-Swedish Speechdat(II) FDB-1000
______________________________________

The Finnish-Swedish SpeechDat(II) FDB-1000 comprises 
1000 Finnish speakers uttering speechdat items in the variant 
of Swedish spoken in Finland, recorded over the Finnish 
fixed telephone network. The SpeechDat database has been 
collected and annotated by the Tampere University of 
Technology's Digital Media Institute. The FDB-1000 
database is partitioned into 4 CDs, 3 CDs comprise 300 
speakers sessions, the 4th comprises 100 speakers. 
The speech databases made within the SpeechDat(II) 
project were validated by SPEX, the Netherlands, to assess 
their compliance with the SpeechDat format and content 
specifications.

Speech samples are stored as sequences of 8-bit 8 kHz 
A-law. Each prompted utterance is stored in a separate file.
Each signal file is accompanied by an ASCII SAM label file 
which contains the relevant descriptive information.

Each speaker uttered the following items: 1 isolated digit; 1 
sequence of 10 isolated digits; 4 numbers 1 sheet number 
(5 digits), 1 telephone number (9-10 digits), 1 credit card 
number (16 digits), 1 PIN code (6 digits); 1 currency money 
amount; 1 natural number; 3 dates 1 spontaneous date 
(birthdate), 1 prompted date, 1 relative or general date 
expression; 2 time phrases 1 time of day (spontaneous), 1 
time phrase; 3 spelled words 1 spontaneous own forename, 
1 city name, 1 phonetically rich word; 5 directory assistance 
names 1 spontaneous own forename, 1 spontaneous city of 
growing up, 1 frequent city name, 1 frequent company name, 
1 common forename surname; 2 yes/no questions 1 
predominantly "yes" question, 1 predominantly "no" question; 
6 application words; 1 word spotting phrase using an 
embedded application word; 4 phonetically rich words; 9 
phonetically rich sentences

The following age distribution has been obtained 178 
speakers are below 16 years old, 412 speakers are between 
16 and 30, 216 speakers are between 31 and 45, 160 
speakers are between 46 and 60, and 34 speakers are over 60.

A pronunciation lexicon with a phonemic transcription in 
SAMPA is also included.

=====================================
For further information, please contact:

 ELRA/ELDA	 Tel +33 01 43 13 33 33
 55-57 rue Brillat-Savarin Fax +33 01 43 13 33 30
 F-75013 Paris, France E-mail mapellielda.fr

or visit our Web site:

 http//www.icp.grenet.fr/ELRA/home.html
 or http//www.elda.fr
===================================== 



Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Summer Session: Spoken Language in Context/ Ohio State University

Date: Sat, 4 Mar 2000 14:22:33 -0500
From: Keith Johnson <kjohnsonling.ohio-state.edu>
Subject: Summer Session: Spoken Language in Context/ Ohio State University

Summer 2000 at Ohio State University

Spoken Language in Context: Methods and Models

During July of 2000, the Department of Linguistics at the
Ohio State University will be offering a unique combination
of short courses aimed at exploring spoken language, with a
particular focus on the empirical study of naturally-occurring
speech through various instrumental, quantitative, and analytic
means. Scholars, researchers (industry or academic), and
students are invited to join us for an intense and rewarding
summer session.

Course offerings:
 Laboratory Phonology - Mary Beckman
 Quantitative Methods - Michael Broe
 Field Phonetics - Keith Johnson
 Historical Phonology - Brian Joseph & Richard Janda
 Practicum in English Intonation - Julia McGory
 The Pragmatics of Focus - Craige Roberts

For more information see the website:
http://ling.ohio-state.edu/SU2000
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue