LINGUIST List 33.3585

Wed Nov 16 2022

FYI: November 2022 Newsletter - LDC

Editor for this issue: Everett Green <everettlinguistlist.org>



Date: 15-Nov-2022
From: Membership Coordinator <ldcldc.upenn.edu>
Subject: November 2022 Newsletter - LDC
E-mail this message to a friend

In this newsletter:

Join LDC for membership year 2023
It’s time to renew your LDC membership for 2023. Current (2022) members who renew their membership before March 1, 2023 will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 1.

In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 900+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.

Spring 2023 data scholarship application deadline
Applications are now being accepted through January 15, 2023 for the Spring 2023 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.
______________________________

New publications:
BOLT English Translation Treebank – Egyptian Arabic SMS/Chat was developed by LDC and consists of SMS and chat text data (472 files representing 98,206 tokens) translated from Egyptian Arabic to English and annotated for part-of-speech and syntactic structure. Only the translated English text is included in the source data for this release. Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included in the corpus documentation.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
Samrómur Children Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 131 hours of Icelandic prompted speech from 3,175 speakers (children, aged 4-17 years) representing 137,597 utterances.

Speech data was collected between October 2019 and September 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus

2022 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.
*
Third DIHARD Challenge Development was developed by LDC and contains approximately 34 hours of English and Chinese speech data along with corresponding annotations used in support of the Third DIHARD Challenge.

The DIHARD third development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, and amateur web videos. Annotations include diarization and segmentation.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC account and uncheck the box next to “Receive Newsletter” under Account Options; or contact LDC for assistance.




Linguistic Field(s): Computational Linguistics


Page Updated: 16-Nov-2022