LINGUIST List 32.240

Sun Jan 17 2021

FYI: January 2021 Newsletter - LDC

Editor for this issue: Everett Green <everettlinguistlist.org>



Date: 15-Jan-2021
From: Membership Coordinator <ldcldc.upenn.edu>
Subject: January 2021 Newsletter - LDC
E-mail this message to a friend

In this newsletter:
Renew Your LDC Membership Today

New Publications:
LORELEI Akan Representative Language Pack
ATIS – Seven Languages
BOLT English Treebank – SMS/Chat
________________________________________
Renew Your LDC Membership Today

Now through March 1, 2021, 2020 members receive a 10% discount on 2021 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits.
________________________________________
New publications:
(1) LORELEI Akan Representative Language Pack consists of Akan monolingual text, Akan-English parallel text, annotations, supplemental resources, and related software tools developed by LDC for the DARPA LORELEI program.

Data was collected from discussion forum, news, reference, social network, and weblog. Data volumes are as follows:
- Over 3.3 million words of Akan monolingual text, all of which were translated into English
- 115,000 Akan words translated from English data

Approximately 2,300 words were annotated for named entities, full entity including nominals and pronouns, entity linking, simple semantic annotation, and situation frame annotation (identifying entities, needs, and issues). Around 2,000 words have morphological segmentation annotation.

LORELEI Akan Representative Language Pack is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) ATIS – Seven Languages was developed by Amazon Web Services, Inc. and consists of 5,871 English utterances from ATIS (Air Travel Information Services) corpora, specifically ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26), translated into six languages: Spanish, German, French, Portuguese, Chinese, and Japanese.

The data is separated into 4,978 utterances for training and 893 utterances for testing following the original ATIS division. The source English utterances were manually translated into the six languages and are included in this release. annotated with named entities via table lookup; markers include city, airline, airport names, and dates.

ATIS Seven Languages is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

*

(3) BOLT English Treebank – SMS/Chat was developed by LDC and consists of English SMS and text chat data with part-of-speech and syntactic structure annotation.

The source data consists of 115,667 tokens/words in 484 files of English SMS and text chat collected by LDC using two methods: new collection via LDC's collection platform and donation of SMS or chat archives from BOLT collection participants.
BOLT English Treebank – SMS/Chat is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldcldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104



Linguistic Field(s): Computational Linguistics


Page Updated: 17-Jan-2021