Editor for this issue: Brett Churchill <brett
linguistlist.org>
Announcing a NEW CORPUS from the LDC *************************************************** 1997 Mandarin Broadcast News Speech and Transcripts *************************************************** This collection consists of 30 hours of recorded broadcasts and transcripts that have been drawn from the following sources: Voice of America (VOA): United States Information Agency Radio People's Republic of China Television (CCTV) Commercial radio based in Los Angeles, CA. (KAZN-AM) Of these three sources, the first two comprise the bulk of the collection, and are represented in roughly equal amounts; only a relatively small sample of KAZN-AM recordings are included, owing to the relatively high proportion of unusable material (commercials, local traffic reports loaded with California place names, etc). The transcripts were created by native speakers of Mandarin working at the LDC; they are in GB-encoded form, with SGML tagging to identify story boundaries, speaker turn boundaries, and phrasal pauses; these tags include time stamps to align the text with the speech data. Word segmentation (white-spacebetween words) is included. A working DTD is provided, and the markup is consistent with that of the 1997 English and Spanish Hub-4 collections. Because of restrictions imposed by the copyright holders, this corpus is available to 1998 LDC members only. Members who wish to receive this corpus must sign the 1997 Mandarin Broadcast News license. This license can be retrieved from the LDC website at: http://www.ldc.upenn.edu/ldc/catalog/nonmem_agree/agreements.html If you would like to order a copy of this corpus, please email your request to <ldcMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueunagi.cis.upenn.edu>. If you need additional information before placing your order, or would like to inquire about membership in the LDC, please send email or call (215) 898-0464. Further information about the LDC and its available corpora can be accessed on the Linguistic Data Consortium WWW Home Page at URL: http://www.ldc.upenn.edu/
Announcing a NEW CORPUS from the LDC ************************* Voicemail Corpus - Part I ************************* The Voicemail Corpus - Part I was created by the following researchers at IBM: M. Padmanabhan, G. Ramaswamy, B. Ramabhadran, P.S. Gopalakrishnan, and C. Dunn. This CD-ROM corpus consists of 1801 voicemail messages, collected from volunteers at various IBM sites in the United States, comprising the training data set and 42 messages in the development test set. The average voicemail message is 31 seconds in duration, and has about 100 words. Approximately 38% of the messages correspond to male speakers; the remainder correspond to females. All messages were transcribed by IBM. During the collection period, volunteers were asked to forward some of their voicemail messages to a local extension number set up for the purpose of collecting this data. The messages were then collected periodically from the voicemailbox of this local extension and added to the database. DirectTalk6000 (DT6K) software was used to transfer the voicemail messages to the computer. DT6K is an application that runs under the AIX operating system on a host computer, and can interface to a phone line through special hardware on the host computer. Note that the data was collected from IBM sites all over the US whereas the host computer that the DT6K application was running on was located at a single IBM site. Consequently, when the application dialed into the phonemail system of an IBM site in a different state, the voicemail messages were played out over a long distance line before they were recorded on the host computer. The data was sampled at 8 KHz, and recorded in 8-bit u-law compressed format onto a local disk of the host computer. The messages were compressed by the proprietary compression techniques used by the ROLM phonemail system, which is the phonemail system in use at various IBM locations. IBM would like to acknowledge the support of DARPA for funding this data collection effort under Grant MDA972- 97-C-0012 and is also extremely grateful to George di Simone and Ira Ellis (Watson telephone system support) for their help in setting up the data collection process. IBM would also like to thank Dr. Ellen Eide for helping with the verification of transcripts and Dr. Salim Roukos, Dr. David Nahamoo, and Dr. Lalit Bahl for their help and support. Finally, thanks are due to the various volunteers who contributed their voicemail messages to the database. Institutions that have membership in the LDC during the 1998 Membership Year will be able to receive this corpus in the same manner as all other text and speech corpora published by the LDC. If you would like to order a copy of this corpus, please email your request to <ldcMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueunagi.cis.upenn.edu>. If you need additional information before placing your order, or would like to inquire about membership in the LDC, please send email or call (215) 898-0464. Further information about the LDC and its available corpora can be accessed on the Linguistic Data Consortium WWW Home Page at URL: http://www.ldc.upenn.edu/
Announcing a NEW CORPUS from the LDC ******************************************** 1998 Speaker Recognition Evaluation Test-Set ******************************************** The 1998 speaker recognition evaluation is part of an ongoing series of yearly benchmark tests conducted by NIST. These tests are intended to provide a stable reference point for measuring and comparing the performance of diverse methods for text-independent speaker recognition over the telephone, and should be of interest to all researchers working in this area of speech technology development. The test sets and evaluation protocols have been designed to be simple, to focus on core technology issues, to be fully supported, and to be accessible. In 1996 and 1997 handset variation was featured as a prominent technical challenge to be addressed. While handset variation remains a formidable challenge, the 1998 evaluation directs greatest attention toward speaker recognition performance for the case in which both training and test data are from the same source. The speech data were recorded by the LDC between January and March, 1997; most of the speakers recruited for this collection were college students from the Great Lakes (Northern Mid-West) region of the U.S. Institutions that have membership in the LDC during the 1998 Membership Year will be able to receive this corpus in the same manner as all other text and speech corpora published by the LDC. Nonmembers may purchase the 1998 Speaker Recognition Evaluation Test-Set for $600. If you would like to order a copy of this corpus, please email your request to <ldcMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueunagi.cis.upenn.edu>. If you need additional information before placing your order, or would like to inquire about membership in the LDC, please send email or call (215) 898-0464. Further information about the LDC and its available corpora can be accessed on the Linguistic Data Consortium WWW Home Page at URL: http://www.ldc.upenn.edu/
Announcing a NEW CORPUS from the LDC ******************************************************************* JURIS (Justice Department Retrieval and Inquiry System) Text Corpus ******************************************************************* The text data contained on this two-CD-ROM set represent a release of the JURIS (Justice Department Retrieval and Inquiry System) data collection that has been made available to the Linguistic Data Consortium (LDC) by the U.S. Department of Justice. The time span of the text ranges from the 1700's to the early 1990's. There are 1664 individual text files in the corpus, 1011 on the first CD-ROM, and 653 on the second. The original archive consisted of 219 files ranging between less than 1 MB and nearly 70 MB in size. In order to make the data more accessible for research use, we chose to divide the larger files into pieces, such that the average file size was about 2 MB when uncompressed (the largest uncompressed file size is about 4.5 MB). Divisions of the files were done at document boundaries, so all files contain whole documents. There are a total of 694,667 document units in the corpus, and these can be categorized to some extent with regard to their content. The following is a partial list of categories and their descriptions drawn from JURIS documentation contained in the corpus. The terminology and organization of categories are those used in the JURIS documentation: * ADMINISTRATIVE LAW Published Comptroller General Decisions; Unpublished Comptroller General Decisions; Opinions of the Attorney General; Office of Legal Counsel (US Dept. of Justice Board of Contract Appeals; ADP Protest Report (Summary of ADP Procurement Protests before the GSBCA); Federal Labor Relations Authority Case Decisions; FLRA Administrative Law Judge Decisions; Federal Service Impasses Decisions; Decisions and Reports on Rulings of the Assistant Sec. of Labor for Labor Management Relations; Federal Labor Relations Council Rulings on Requests of the Asst. Sec. of Labor for Labor Management Relations; HUD Administrative Law Decisions; Merit System Protection Board Decisions; Decisions under Immigration and Nationality Laws; Environmental Protection Agency General Counsel Opinions; Equal Opportunity Commission Decisions; Equal Employment Opportunity Commission Policy Statements; US Office of Government Ethics Decisions; HHS Department Appeals Board Decisions. * DEPARTMENT OF JUSTICE BRIEFS Office of the Solicitor General; Civil Division; Civil Division Trial; Environmental and Natural Resources Division; Tax Division Criminal Appellate; US Attorney's Offices; US Trustees' Offices. * CASE LAW U.S. Supreme Court; Federal Reporter, 2nd Series; Court of Appeals Unpublished Decisions; Federal Supplement; Federal Rules Decisions; Atlantic 2nd Reporter (DC only); Bankruptcy Reporter; Courts of Military Review; Military Justice Reporter; Court of Claims. * FREEDOM OF INFORMATION ACT FOIA Update Newsletter; DOJ Guide to the FOIA Case List Publications. * FEDERAL REGULATIONS Code of Federal Regulations; Unified Agenda of Federal Regulations; Defense Acquisition Regulations. * TREATIES AND OTHER INTERNATIONAL AGREEMENTS United States Treaties and Other International Agreements; Department of Defense Unpublished International Agreements. * INDIAN LAW Opinions of the Solicitor (Dept. of Interior); Ratified Treaties; Unratified Treaties; Presidential Proclamations; Executive Orders and Other Orders Pertaining to Indians. * IMMIGRATION AND NATURALIZATION LAW Decisions Under Immigration and Nationality Law; Title 8 - Code of Federal Regulations; Immigration Reform and Control Act of 1988, Legislative History; Equal Access to Justice Act, Legislative History. * STATUTORY LAW Public Laws; United States Code; Executive Orders; Anti-Drug Abuse Act of 1988; Section-by-section analysis of anti-drug abuse act of 1988; Criminal Division Handbook on CCCA; The Organic Laws of the United States. * TAX LAW US Tax Court Decisions; US Board of Tax Appeals Decisions; Tax Division's Summons Enforcement Decisions; Tax Division's Tax Protester Case List; Tax Division's Criminal Tax Manual; Tax Division's Criminal Tax Indictment/Information Forms; Tax Division's Standardized Criminal Tax Jury Instructions; Tax Division's Criminal Section Newsletter; Tax Court Memorandum Decisions; IRS Cumulative Bulletin; Tax International Acts; IRS News Releases; IRS General Counsel Memoranda; IRS Actions on Decisions; IRS Technical Memoranda. * MANUALS United States Attorney's Manual; United States Trustees' Manual; Federal Personnel Manual; Federal Acquisition Regulations; Federal Acquisition Circulars; Federal Travel Regulation; Federal Information Resources Management Regulation; Federal Property Management Regulations; Principles of Federal Appropriations Law; Justice Department Acquisition Regulation; Justice Property Management Regulations. * DEPARTMENT OF JUSTICE WORKPRODUCTS Civil Division Monographs; Civil Division Torts Branch Handbook on damages under FTCA; Criminal Division Monographs; Criminal Division Forms; Criminal Division Guidelines for Drafting Indictments; Criminal Division Narcotics; Forfeiture, Prosecution Manual; Criminal Division Directory of Services; Asset Forfeiture Manuals; Obscenity Enforcement Reporter; Environmental and Natural Resources Division Monographs; US Sentencing Commission's Guidelines Manual; Sentencing Guidelines Updates. The text files are all formatted using a set of SGML tags to mark document boundaries, and to mark major structural features within documents. As with file organization, the markup is derived from the document structures as provided by the Justice Department. Institutions that have membership in the LDC during the 1998 Membership Year will be able to receive this corpus in the same manner as all other text and speech corpora published by the LDC. Nonmembers may purchase JURIS for $1500. If you would like to order a copy of this corpus, please email your request to <ldcMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueunagi.cis.upenn.edu>. If you need additional information before placing your order, or would like to inquire about membership in the LDC, please send email or call (215) 898-0464. Further information about the LDC and its available corpora can be accessed on the Linguistic Data Consortium WWW Home Page at URL: http://www.ldc.upenn.edu/ - ----- End of Forwarded Message