LINGUIST List 34.3470 FYI: November 2023 Newsletter

LINGUIST List 34.3470

Fri Nov 17 2023

FYI: November 2023 Newsletter - LDC

Editor for this issue: Justin Fuller <justinlinguistlist.org>
Date: 15-Nov-2023
From: Membership Coordinator <ldcldc.upenn.edu>
Subject: November 2023 Newsletter - LDC
E-mail this message to a friend
In this newsletter:
Join LDC for Membership Year 2024
New publications:
REMIX Telephone Collection
News Sub-domain Named Entity Recognition
________________________________________
Join LDC for Membership Year 2024
It’s time to renew your LDC membership for 2024. Current (2023) members who renew their membership before March 1, 2024 will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 1.
Plans for 2024 publications are in progress. Among the expected releases are:
•KASET: 147 hours of Sorani Kurdish and Kurmanji Kurdish conversational telephone speech and web broadcasts, 65 hours transcribed
•AIDA Topic Source Data and Annotations: multimodal source data and annotations in multiple languages (Russian, Ukrainian, English, Spanish) for information and entity extraction
•RATS Low Speech Density Data: 87 hours of Levantine Arabic, English, Persian, Pushto, and Urdu audio files selected from RATS speech activity detection and keyword spotting data sets, also including communications systems sounds and silence
•Call My Net 1: 364 hours of conversational telephone speech recordings in Tagalog, Cebuano, Cantonese and Mandarin from speakers in the Philippines and China using various handsets under diverse noise conditions
•Ravnursson Faroese Speech and Transcripts: 109 hours of read speech from 433 native speakers with transcripts
•Diaspora Tibetan Speech: elicited, read, and spontaneous speech from 73 native Tibetan speakers in Katmandu’s diaspora Tibetan community, some recordings transcribed
•IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations, and queries in multiple languages (e.g., Bulgarian, Somali, Georgian)
•LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources, and related tools in various languages (e.g., Farsi, Hungarian, Hindi, Amharic)
For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.
________________________________________
New publications:
REMIX Telephone Collection was developed by LDC and contains 320 hours of English conversational telephone speech from 358 speakers who had completed all tasks in one of the previous LDC Mixer collections, specifically, Mixers 4-7. The data was collected in 2012; recordings in this corpus were used to support the NIST 2012 Speaker Recognition Evaluation. Speakers completed up to 12 calls lasting up to 10 minutes conversing on suggested topics. They were asked that half of the calls be made in a "noisy" environment, e.g., from a speakerphone, a busy street, noisy store or office, or a room with loud background noise. Speaker metadata is included.
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
News Sub-domain Named Entity Recognition was developed at the University of Pennsylvania and contains over 20,000 English news sentences annotated with named entities and categorized into sub-domains. The sentences were extracted from The New York Times Annotated Corpus (LDC2008T19). Named entity annotation was based on the CoNLL-2003 guidelines and annotation scheme. Sentences were labeled with person (PER), location (LOC) and organization (ORG) tags using phrase matching with a manual second pass. Sub-domains are: Arts (+Weekend/Cultural), Business (+Financial), Classifieds (+Obituary), Editorial, Foreign, Metropolitan, Sports, and Others. "Others" includes topics such as Real Estate, New Jersey Weekly, Book Review, Job Market, Science, and Health & Fitness.
2023 members can access this corpus through their LDC accounts provided they have submitted a signed copy of the special license agreement. Non-members may license this data for a fee.
Membership Coordinator
LDC
T: +1-215-573-1275
E: ldc@ldc.upenn.edu
Linguistic Field(s): Computational Linguistics
Page Updated: 17-Nov-2023
LINGUIST List is supported by the following publishers: