LINGUIST List 36.934

Tue Mar 18 2025

FYI: March 2025 Newsletter - LDC

Editor for this issue: Joel Jenkins <joellinguistlist.org>



Date: 17-Mar-2025
From: Membership Coordinator <ldcldc.upenn.edu>
Subject: March 2025 Newsletter - LDC
E-mail this message to a friend

In this newsletter:
LDC data and commercial technology development

New publications:
2015 NIST Language Recognition Evaluation Test Set
The Xi’an Multi-Language Learner Corpus

________________________________________
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.
________________________________________

New publications:
2015 NIST Language Recognition Evaluation Test Set was developed by LDC and NIST. It contains the evaluation test set for the 2015 NIST Language Recognition Evaluation (LRE), approximately 867 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) collected by LDC in 20 languages over 6 clusters of related languages: Arabic (Egyptian, Iraqi, Levantine, Maghrebi, Modern Standard Arabic); Spanish (Caribbean, European, Latin American, Brazilian Portuguese); English (British, Indian, General American English); Chinese (Cantonese, Mandarin, Min Nan, Wu); Slavic (Polish, Russian); and French (West African, Haitian Creole).

The CTS data includes calls between individuals in the same social networks lasting 8-15 minutes and telephone speech from the IARPA Babel series collected in 2012-2013 from speakers using a range of phone types in diverse settings with varying noise conditions. The BNBS data was collected by LDC from streaming and satellite radio programming, focusing on programs that included narrowband speech (e.g., call-ins to a talk show).

The goal of NIST's LRE evaluations is to establish the baseline of current performance capability for CTS language recognition and to lay the groundwork for further research efforts. LRE15 expanded the range of test segment durations and added a test condition that allowed systems to make use of unrestricted training data when developing models

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

The Xi’an Multi-Language Learner Corpus was developed by Xi'an International Studies University (XISU) and is comprised of 526 argumentative essays in 15 languages by Chinese L1 university students studying second languages, along with student metadata and writing prompts. It was developed to support second language learner research and to provide a database for cross-linguistic comparison of second languages.

Data was collected in 2023 and 2024 from students at XISU and Yunnan Minzu University (YMU) who were linguistic majors or studying one of the foreign languages available at XISU and YMU. Off-topic essays and incomplete texts were excluded.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC account and uncheck the box next to “Receive Newsletter” under Account Options or contact LDC for assistance.

Linguistic Field(s): Computational Linguistics




Page Updated: 17-Mar-2025


LINGUIST List is supported by the following publishers: