LINGUIST List 8.1234

Thu Aug 28 1997

FYI: LDC Corpus, Spanish Corpora

Editor for this issue: Martin Jacobsen <>


  1. LDC Office, New Corpus from the Linguistic Data Consortium
  2. Lee Hartman, Spanish corpora

Message 1: New Corpus from the Linguistic Data Consortium

Date: Wed, 27 Aug 1997 20:06:32 EDT
From: LDC Office <>
Subject: New Corpus from the Linguistic Data Consortium

 Announcing a NEW RELEASE from the

	 Boston University Radio Speech Corpus

The Boston University Radio Speech Corpus was collected by Mari
Ostendorf of Boston University, primarily to support research in
text-to-speech synthesis, particularly generation of prosodic
patterns. The corpus consists of professionally read radio news data,
including speech and accompanying annotations, suitable for speech and
language research.

The corpus includes speech from seven (4 male, 3 female) FM radio news
announcers associated with WBUR, a public radio station. The main
radio news portion of the corpus consists of over seven hours of news
stories recorded in the WBUR radio studio during broadcasts over a two
year period. In addition, the announcers were also recorded in a
laboratory at Boston University. In this, the lab news portion, the
announcers read a total of 24 stories from the radio news portion.
The announcers were first asked to read the stories in their non-radio
style and then, 30 minutes later. to read the same stories in their
radio style.

Each story read by an announcer was digitized in paragraph size units,
which typically include several sentences. The files were digitized
at a 16k Hz sample rate using a 16 bit A/D. The paragraphs were
annotated with the orthographic transcription, phonetic alignments,
part-of-speech tags and prosodic markers. The orthographic
transcripts were generated by hand and include indication of where the
speaker took a breath. The phonetic alignments and part-of-speech
tags were generated automatically and hand corrected. The prosodic
labels were marked by hand and are available only for a subset of the

Institutions that have membership in the LDC for either the 1996 or
1997 Membership Year will be able to receive the BU Radio Corpus at no
additional charge, in the same manner as all other speech corpora
published by the LDC.

Nonmembers can receive a copy of this corpus for research purposes
only for a fee of US$400. If you would like to order a copy of this
corpus, please email your request to If you
need additional information before placing your order, or would like
to inquire about membership in the LDC, please send email or call
(215) 898-0464.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL Information is also available via ftp at under pub/ldc; for ftp access, please use
"anonymous" as your login name, and give your email address when asked
for password.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Spanish corpora

Date: Wed, 27 Aug 1997 11:24:27 -0500 (CDT)
From: Lee Hartman <>
Subject: Spanish corpora

Nick Caffrey asked

>Does anyone have details of online Spanish corpora?

There is a corpus of written Argentine and Chilean Spanish, and
transcribed spoken Peninsular Spanish online at

Because of recent technical difficulties, it may be temporarily
inaccessible, but keep trying.

- ------------------------------------------------------------------
Lee Hartman
Dept. of Foreign Languages
Southern Illinois University
Carbondale, IL 62901-4521
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue