LINGUIST List 28.1361

Mon Mar 20 2017

FYI: News from LDC

Editor for this issue: Yue Chen <yuelinguistlist.org>


Date: 17-Mar-2017
From: Katie Kindle <ldcldc.upenn.edu>
Subject: News from LDC
E-mail this message to a friend

In this newsletter:

- BOLT Chinese Discussion Forum Parallel Training Data
- IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d
- Noisy TIMIT Speech
- GALE English-Chinese Parallel Aligned Treebank -- Training

New Corpora:

-BOLT Chinese Discussion Forum Parallel Training Data was developed by LDC and
consists of 1,876,799 tokens of Chinese discussion forum data collected for
the DARPA BOLT program along with their corresponding English translations.

The BOLT (Broad Operational Language Translation) program developed machine
translation and information retrieval for less formal genres, focusing
particularly on user-generated content. LDC supported the BOLT program by
collecting informal data sources -- discussion forums, text messaging and chat
-- in Chinese, Egyptian Arabic and English. The collected data was translated
and annotated for various tasks including word alignment, treebanking,
propbanking and co-reference.

The source data in this release consists of discussion forum threads harvested
from the Internet by LDC using a combination of manual and automatic
processes. The full source data collection is released as BOLT Chinese
Discussion Forums (LDC2016T05). Word-aligned and tagged data is released as
BOLT Chinese-English Word Alignment and Tagging - Discussion Forum Training
(LDC2016T19).

BOLT Chinese Discussion Forum Parallel Training Data is distributed via web
download.

2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

-IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 200 hours of Swahili conversational and
scripted telephone speech collected from 2012-2014 along with corresponding
transcripts.

The Babel program focuses on underserved languages and seeks to develop speech
recognition technology that can be rapidly applied to any human language to
support keyword search performance over large amounts of recorded speech.

The Swahili speech in this release represents that spoken in the Nairobi
dialect region of Kenya. The gender distribution among speakers is
approximately equal; speakers' ages range from 16 years to 65 years. Calls
were made using different telephones (e.g., mobile, landline) from a variety
of environments including the street, a home or office, a public place, and
inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d is distributed via web
download.

2017 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

- Noisy TIMIT Speech was developed by the Florida Institute of Technology and
contains approximately 322 hours of speech from the TIMIT Acoustic-Phonetic
Continuous Speech Corpus (LDC93S1) modified with different additive noise
levels. Only the audio has been modified; the original arrangement of the
TIMIT corpus is still as described by the TIMIT documentation.

The additive noise are white, pink, blue, red, violet and babble noise with
levels varying in 5 dB (decibel) steps, ranging from 5 to 50 dB. The color
noise types were generated artificially using MATLAB. The babble noise was
selected from a random segment of recorded babble speech scaled relative to
the power of the original TIMIT audio signal.

Noisy TIMIT Speech is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

- GALE English-Chinese Parallel Aligned Treebank -- Training was developed by
LDC and contains 196,123 tokens of word aligned English and Chinese parallel
text with treebank annotations. This material was used as training data in the
DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological and
syntactic structures aligned at the sentence level and the sub-sentence level.
Such data sets are useful for natural language processing and related fields,
including automatic word alignment system training and evaluation,
transfer-rule extraction, word sense disambiguation, translation lexicon
extraction and cultural heritage and cross-linguistic studies. With respect to
machine translation system development, parallel aligned treebanks may improve
system performance with enhanced syntactic parsers, better rules and knowledge
about language pairs and reduced word error rate.

The English source data was translated into Chinese. Chinese and English
treebank annotations were performed independently. The parallel texts were
then word aligned. The material in this release corresponds to portions of the
treebanked data in OntoNotes 3.0 (LDC2009T24) and OntoNotes 4.0 (LDC2011T03).

This release consists of English source broadcast programming (CNN, NBC/MSNBC)
and web data collected by LDC in 2005 and 2006.

GALE English-Chinese Parallel Aligned Treebank – Training is distributed via
web download.

2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldcldc.upenn.edu
M: 3600 Market St. Suite 810, Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

Page Updated: 20-Mar-2017