Editor for this issue: Martin Jacobsen <marty
linguistlist.org>
Announcing a NEW RELEASE from the LINGUISTIC DATA CONSORTIUM SWITCHBOARD-1 Release 2 The Switchboard-1 Telephone Speech Corpus was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3. Since that release, a number of corrections have been made to the data files as presented on the original CD-ROM set, and all copies of the first pressing have been distributed. SWITCHBOARD is a collection of about 2400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven "robot operator" system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion, and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once, and (2) no one spoke more than once on a given topic. In this new release, assembled and published by the LDC, all known errors affecting the original publication of speech files have been corrected. In addition, modifications have been made to the contents of the NIST Sphere headers of all speech files, to identify each file as being part of the new release, and to make the usage of the "sample_count" header field consistent with standard Sphere usage. (In particular, the "sample_count" field should reflect the number of samples on each channel in the file. In the initial release, this field was improperly set to be the total number of samples in both channels of the file; this has been corrected in the new release.) SWITCHBOARD-1 Release 2 is distributed in a notebook-style binder with 23 CD-ROMs. The intermediate version of the corresponding transcripts is available separately. Institutions that have membership in the LDC during the 1997 Membership Year will be able to receive SWITCHBOARD-1 Release 2 at no additional charge, in the same manner as all other text and speech corpora published by the LDC. Nonmembers can receive a copy of SWITCHBOARD-1 Release 2 for research purposes only for a fee of $10,000. If you would like to order a copy of this corpus, please email your request to ldcMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueunagi.cis.upenn.edu. If you need additional information before placing your order, or would like to inquire about membership in the LDC, please send email or call (215) 898-0464. Further information about the LDC and its available corpora can be accessed on the Linguistic Data Consortium WWW Home Page at URL http://www.ldc.upenn.edu/. Information is also available via ftp at ftp.cis.upenn.edu under pub/ldc; for ftp access, please use "anonymous" as your login name, and give your email address when asked for password.
Announcing a NEW RELEASE from the LINGUISTIC DATA CONSORTIUM The Kids Corpus This database is comprised of sentences read aloud by children. It was originally designed in order to create a training set of children's speech for the SPHINX II automatic speech recognizer for its use in the LISTEN project at Carnegie Mellon University. The children range in age from 6 to 11 (see details below) and were in first through third grades (the 11-year-old was in 6th grade) at the time of recording. There were 24 male and 52 female speakers. Although the girls outnumber the boys, we feel that the small difference in vocal tract length between the two at this age should make the effect of this imbalance negligible. There are 5180 utterances in all. The speakers come from two separate populations. Since the LISTEN reading coach needed good examples of reading aloud, it was decided that the majority of the speakers should be "good" readers. They were recorded in the summer of 1995, and were enrolled in either the Chatham College Summer Camp, or the Mount Lebanon Extended Day Summer Fun program in Pittsburgh. They were recorded on-site. This set will hereafter be called SUM95. There are 44 speakers and 3333 utterances in this set. The LISTEN system also needed examples of errorful reading and dialectic variants. The readers who supplied this type of speech come from a school which has a high population of children who are at risk of growing up poor readers and who could therefore benefit from any reading tutor or other system built upon this database. They come from Fort Pitt School in Pittsburgh and were recorded in April 1996. This subset will be referred to as FP. There are 32 speakers and 1847 utterances in this set. The list of speakers, the set they are in, and the number of sentences per speaker can be found in the "tables" directory, in the file named "speaker.tbl". It should be noted that although there will be some dialectal variation in the speech of the SUM95 subset, the speech of the FP subset gives us a very good representation of dialects of the children that may be targeted for the LISTEN system. However, the user should be aware that the speakers' dialect partly reflects what is locally called "Pittsburghese". The text presented to the children was obtained from Weekly Reader stories. Weekly Reader is a four-page color reading supplement given out to children in many classrooms. Special reprint permission granted by Weekly Reader (R), published by Weekly Reader Corporation Copyright (c) 1994, 1995 by Weekly Reader Corporation All Rights Reserved. Because of restrictions imposed by the copyright holders, this corpus is available to 1997 LDC members only. If you would like to order a copy of this corpus, please email your request to ldcMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueunagi.cis.upenn.edu. If you need additional information before placing your order, or would like to inquire about membership in the LDC, please send email or call (215) 898-0464. Further information about the LDC and its available corpora can be accessed on the Linguistic Data Consortium WWW Home Page at URL http://www.ldc.upenn.edu/. Information is also available via ftp at ftp.cis.upenn.edu under pub/ldc; for ftp access, please use "anonymous" as your login name, and give your email address when asked for password.