LINGUIST List 32.1752

Wed May 19 2021

FYI: May 2021 Newsletter - LDC

Editor for this issue: Everett Green <everettlinguistlist.org>



Date: 17-May-2021
From: Membership Coordinator <ldcldc.upenn.edu>
Subject: May 2021 Newsletter - LDC
E-mail this message to a friend

In this newsletter:
LDC at ICASSP 2021

New Publications:
The SSNCE Database of Tamil Dysarthric Speech
ESPADA
BOLT Chinese SMS/Chat Parallel Training Data
________________________________________
LDC at ICASSP 2021
LDC will be exhibiting at ICASSP 2021, held virtually this year June 6-11. Stop by our digital booth June 8-10 to learn more about recent developments at the Consortium and new publications.

Also, check out the following poster featuring LDC work:

Probing Acoustic Representations for Phonetic Properties
Wednesday, June 9, 14:00 - 14:45
Session: AUD-11: Auditory Modeling and Hearing Instruments

LDC will post conference links and updates via our Twitter feed and Facebook page. We hope to “see” you there!
________________________________________

New publications:
(1) The SSNCE Database of Tamil Dysarthric Speech was developed by the Speech Lab, SSN College of Engineering, India, in collaboration with the Indian National Institute of Empowerment of Persons with Multiple Disabilities (NIEPMD) and contains approximately eight hours of Tamil speech data, time-aligned transcripts and metadata collected from 30 speakers (20 dysarthric speakers and 10 non-dysarthric speakers).

The speech data was collected between 2015 and 2017 in two sessions at NIEPMD. Each speaker recorded 365 utterances consisting of single words and of sentences that included a combination of common and uncommon Tamil phrases. The non-dysarthric speakers were five female and five male subjects. The dysarthric speakers (7 female, 13 male) reported a diagnosis of cerebral palsy and ranged in age from 12 years old to 37 years old.

The SSNCE Database of Tamil Dysarthric Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) ESPADA (Extended Syntactic Phrase Alignment DAtaset) consists of annotated parse trees and alignment on English sentential paraphrases from NIST’s OpenMT evaluation corpora. It extends SPADE (LDC2018T09) by adding new annotated data for training/testing phrasal paraphrase detection and phrase representation models to SPADE's development and test sets. Gold standard annotations of HPSG (head-driven phrase structure grammar) trees and phrase alignments were performed, resulting in 251,972 phrase alignments identified in 1,916 sentential paraphrases.

ESPADA is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(3) BOLT Chinese SMS/Chat Parallel Training Data was developed by LDC and consists of approximately 1.8 million tokens of Chinese SMS/Chat data and their corresponding English translations.

The source data was donated or collected by LDC via live platforms. Data was manually selected for translation. Messages/conversations were arranged in chronological order, segmented into sentence units (all or portions of message threads depending on their length), and assigned to translation vendors. Translators followed LDC's BOLT translation guidelines.

BOLT Chinese SMS/Chat Parallel Training Data is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldcldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104


Linguistic Field(s): Computational Linguistics


Page Updated: 19-May-2021