LINGUIST List 31.2805

Tue Sep 15 2020

FYI: September 2020 Newsletter - Linguistic Data Consortium

Editor for this issue: Everett Green <everettlinguistlist.org>



Date: 15-Sep-2020
From: Membership Coordinator <ldcldc.upenn.edu>
Subject: September 2020 Newsletter - Linguistic Data Consortium
E-mail this message to a friend

In this newsletter:

New publications:
BOLT English PropBank and Sense – Discussion Forum, SMS/Chat and Conversational Telephone Speech
LORELEI Tigrinya Incident Language Pack
Chinese Lexical Resources for Gender, Number, Animacy

New publications:
(1) BOLT English PropBank and Sense – Discussion Forum, SMS/Chat and Conversational Telephone Speech was developed by the University of Colorado, Boulder – CLEAR (Computational Language and Education Research) and consists of propbank and verb sense disambiguation annotation on English discussion forum (DF), SMS/Chat, and conversational telephone speech data. Annotation was applied to each predicate verb tree in LDC’s BOLT phrase structure treebanks. PropBank provides a layer of semantic annotation over treebank and was performed on all three genres. DF and SMS/Chat data were also annotated for verb sense disambiguation using Verbnet 3.2 classes.

BOLT English PropBank and Sense – Discussion Forum, SMS/Chat and Conversational Telephone Speech is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


(2) LORELEI Tigrinya Incident Language Pack was developed by LDC and is comprised of approximately 4.5 million words of Tigrinya monolingual text, 25,000 words of English monolingual text, 235,000 words of parallel and comparable Tigrinya-English text, and 50,000 words of data annotated for Entity Discovery and Linking and for Situation Frames. It contains all of the text data, annotations, supplemental resources, and related software tools for the Tigrinya language that were used in the DARPA LORELEI / LoReHLT 2017 Evaluation.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity Detection and Linking and Situation Frame annotations identified “entities,” “needs” (such as a need for food), and “issues” (such as civil unrest) to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information that would be useful for planning a disaster response effort.

LORELEI Tigrinya Incident Language Pack is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


(3) Chinese Lexical Resources for Gender, Number, Animacy was developed by LDC and consists of gender, number, and animacy lexicons produced in support of the DARPA DEFT program. Gender, number, and animacy are lexical indicators useful for named entity tagging, including the detection of person mentions in text.

This corpus was created by extracting information from newswire texts in Chinse Gigaword Fifth Edition (LDC2011T13) in the following steps: (1) segmenting source documents into sentences; (2) converting any traditional Chinese script to simplified Chinese; (3) tagging all sentences for parts-of-speech; (4) developing queries to detect patterns; and (5) building lexicons based on frequency counts and entity types.

The resulting resources include dictionaries of Chinese animate nominals and names; Chinese nominals and name with gender and number predicted; and other dictionaries of Chinese nominals, names, verbs, and pronouns. Each dictionary contains frequency information as well as the features in question.

Chinese Lexical Resources for Gender, Number, Animacy is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Membership Coordinator
Linguistic Data Consortium
E: ldcldc.upenn.edu


Linguistic Field(s): Computational Linguistics


Page Updated: 15-Sep-2020