Publishing Partner: Cambridge University Press CUP Extra Publisher Login

Software Details

Title: New Resources from the LDC
Submitter: Linguistic Data Consortium
Description: New Publications from the LDC

The CSLU: Spelled and Spoken Words corpus consists of spelled and spoken
words. 3647 callers were prompted to to say and spell their first and last
names, to say what city they grew up in and what city they were calling
from, and to answer two yes/no questions. In order to collect sufficient
instances of each letter, 1371 callers also recited the English alphabet
with pauses between the letters. Each call was transcribed by two people,
and all differences were resolved. In addition, a subset of 2648 calls has
been phonetically labeled.

Korean Propbank is a semantic annotation of the Korean English Treebank
Annotations and Korean Treebank version 2.0. Each verb and adjective
occurring in the Treebank has been treated as a semantic predicate and the
surrounding text has been annotated for arguments and adjuncts of the
predicate. The verbs and adjectives have also been tagged with coarse
grained senses.

There are two basic components to Korean Propbank:

1. The Verb Lexicon. A frames file, consisting of one or more frame sets,
has been created for each predicate occurring in the Treebank. These files
serve as a reference for the annotators and for users of the data. 2,749
such files have been created.
2. The Annotation. There are two annotation files. The virginia-verbs.pb
file has 9,588 annotated predicate tokens. These predicate tokens include
all those occurring in over 54 thousand words of the Korean English
Treebank Annotations, totaling ~791 KB of uncompressed data. The
newswire-verbs.pb file has 23,707 annotated predicate tokens. These
predicate tokens include all those occurring in over 131 thousand words of
the Korean Treebank version 2.0.

The Speech Controlled Computing corpus was designed to support the
development of small footprint, embedded ASR applications in the domain of
voice control for the home. It consists of the recordings of 125 speakers
of American English from four regions, three age groups and two gender
groups, pronouncing isolated words. The recordings were conducted in a
sound-attenuated room, and a high-quality microphone was used. Each speaker
read a randomized word list consisting of 2100 words (100 distinct words
appearing 21 times each).

NOTE: Nonmembers may obtain a commercial rights license to Speech
Controlled Computing for US$7000 by signing the LDC User License Agreement
for Speech Controlled Computing. For-Profit Membership to the LDC is not

If you need further information, or would like to inquire about membership
to the LDC, please email or call +1 215 573 1275.
Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics

LL Issue: 17.1053
Date Posted: 07-Apr-2006

Search Again

Back to Software Index