LINGUIST List 8.1249

Mon Sep 1 1997

FYI: LDC Corpora

Editor for this issue: Martin Jacobsen <martylinguistlist.org>


Directory

  1. LDC Office, New Corpus from the Linguistic Data Consortium
  2. LDC Office, New Corpus from the Linguistic Data Consortium

Message 1: New Corpus from the Linguistic Data Consortium

Date: Sun, 31 Aug 1997 13:43:49 EDT
From: LDC Office <ldcunagi.cis.upenn.edu>
Subject: New Corpus from the Linguistic Data Consortium


 Announcing a NEW RELEASE from the
 LINGUISTIC DATA CONSORTIUM

		 SWITCHBOARD-1 Release 2

The Switchboard-1 Telephone Speech Corpus was originally collected by
Texas Instruments in 1990-1, under DARPA sponsorship. The first
release of the corpus was published by NIST and distributed by the LDC
in 1992-3. Since that release, a number of corrections have been made
to the data files as presented on the original CD-ROM set, and all
copies of the first pressing have been distributed.

SWITCHBOARD is a collection of about 2400 two-sided telephone
conversations among 543 speakers (302 male, 241 female) from all areas
of the United States. A computer-driven "robot operator" system
handled the calls, giving the caller appropriate recorded prompts,
selecting and dialing another person (the callee) to take part in a
conversation, introducing a topic for discussion, and recording the
speech from the two subjects into separate channels until the
conversation was finished. About 70 topics were provided, of which
about 50 were used frequently. Selection of topics and callees was
constrained so that: (1) no two speakers would converse together more
than once, and (2) no one spoke more than once on a given topic.

In this new release, assembled and published by the LDC, all known
errors affecting the original publication of speech files have been
corrected. In addition, modifications have been made to the contents
of the NIST Sphere headers of all speech files, to identify each file
as being part of the new release, and to make the usage of the
"sample_count" header field consistent with standard Sphere usage. (In
particular, the "sample_count" field should reflect the number of
samples on each channel in the file. In the initial release, this
field was improperly set to be the total number of samples in both
channels of the file; this has been corrected in the new release.)

SWITCHBOARD-1 Release 2 is distributed in a notebook-style binder with
23 CD-ROMs. The intermediate version of the corresponding transcripts
is available separately.

Institutions that have membership in the LDC during the 1997
Membership Year will be able to receive SWITCHBOARD-1 Release 2 at no
additional charge, in the same manner as all other text and speech
corpora published by the LDC.

Nonmembers can receive a copy of SWITCHBOARD-1 Release 2 for research
purposes only for a fee of $10,000. If you would like to order a copy
of this corpus, please email your request to
ldcunagi.cis.upenn.edu. If you need additional information before
placing your order, or would like to inquire about membership in the
LDC, please send email or call (215) 898-0464.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL
http://www.ldc.upenn.edu/. Information is also available via ftp at
ftp.cis.upenn.edu under pub/ldc; for ftp access, please use
"anonymous" as your login name, and give your email address when asked
for password.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: New Corpus from the Linguistic Data Consortium

Date: Sun, 31 Aug 1997 16:41:31 EDT
From: LDC Office <ldcunagi.cis.upenn.edu>
Subject: New Corpus from the Linguistic Data Consortium


 Announcing a NEW RELEASE from the
 LINGUISTIC DATA CONSORTIUM

			The Kids Corpus

This database is comprised of sentences read aloud by children. It
was originally designed in order to create a training set of
children's speech for the SPHINX II automatic speech recognizer for
its use in the LISTEN project at Carnegie Mellon University.

The children range in age from 6 to 11 (see details below) and were in
first through third grades (the 11-year-old was in 6th grade) at the
time of recording. There were 24 male and 52 female speakers.
Although the girls outnumber the boys, we feel that the small
difference in vocal tract length between the two at this age should
make the effect of this imbalance negligible. There are 5180
utterances in all.

The speakers come from two separate populations. Since the LISTEN
reading coach needed good examples of reading aloud, it was decided
that the majority of the speakers should be "good" readers. They were
recorded in the summer of 1995, and were enrolled in either the
Chatham College Summer Camp, or the Mount Lebanon Extended Day Summer
Fun program in Pittsburgh. They were recorded on-site. This set will
hereafter be called SUM95. There are 44 speakers and 3333 utterances
in this set. The LISTEN system also needed examples of errorful
reading and dialectic variants. The readers who supplied this type of
speech come from a school which has a high population of children who
are at risk of growing up poor readers and who could therefore benefit
from any reading tutor or other system built upon this database. They
come from Fort Pitt School in Pittsburgh and were recorded in April
1996. This subset will be referred to as FP. There are 32 speakers
and 1847 utterances in this set. The list of speakers, the set they
are in, and the number of sentences per speaker can be found in the
"tables" directory, in the file named "speaker.tbl".

It should be noted that although there will be some dialectal
variation in the speech of the SUM95 subset, the speech of the FP
subset gives us a very good representation of dialects of the children
that may be targeted for the LISTEN system. However, the user should
be aware that the speakers' dialect partly reflects what is locally
called "Pittsburghese".

The text presented to the children was obtained from Weekly Reader
stories. Weekly Reader is a four-page color reading supplement given
out to children in many classrooms. Special reprint permission
granted by Weekly Reader (R), published by Weekly Reader Corporation
Copyright (c) 1994, 1995 by Weekly Reader Corporation All Rights
Reserved.

Because of restrictions imposed by the copyright holders, this corpus
is available to 1997 LDC members only.

If you would like to order a copy of this corpus, please email your
request to ldcunagi.cis.upenn.edu. If you need additional information
before placing your order, or would like to inquire about membership
in the LDC, please send email or call (215) 898-0464.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL
http://www.ldc.upenn.edu/. Information is also available via ftp at
ftp.cis.upenn.edu under pub/ldc; for ftp access, please use
"anonymous" as your login name, and give your email address when asked
for password.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue