LINGUIST List 32.960

Tue Mar 16 2021

FYI: March 2021 Newsletter - LDC

Editor for this issue: Everett Green <everettlinguistlist.org>



Date: 15-Mar-2021
From: Membership Coordinator <ldcldc.upenn.edu>
Subject: March 2021 Newsletter - LDC
E-mail this message to a friend

In this newsletter:
LDC data and commercial technology development

New Publications:
Columbia Games Corpus
Global TIMIT Mandarin Chinese
BOLT Chinese Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech
_____
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.
_____

New publications:
(1) Columbia Games Corpus was developed by the Spoken Language Group, Columbia University and the Department of Linguistics, Northwestern University. It consists of approximately 10 hours of spontaneous English conversation from 13 subjects playing a series of computer games that required verbal communication to achieve joint goals of identifying and moving images on the screen to reach a combined number of points. This publication also includes corresponding manually time-aligned orthographic transcripts and annotation marking discourse and turn-taking.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) Global TIMIT Mandarin Chinese was developed by LDC and Shanghai Jiao Tong University and consists of five hours of read speech from Chinese Gigaword Fifth Edition (LDC2011T13) with corresponding transcripts. Fifty speakers read 120 sentences; specifically, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) BOLT Chinese Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on Chinese informal text. Co-reference annotation aims to fill in connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation (i.e., Chinese Treebank 9.0 (LDC2016T13)) and covers noun phrases (including proper nouns, nominals, pronouns, and null arguments), possessives, proper noun pre-modifiers, and verbs.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldcldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104






Linguistic Field(s): Computational Linguistics


Page Updated: 16-Mar-2021