LINGUIST List 2.372

Tuesday, 31 July 1991

Disc: DARPA Linguistic Data Consortium

Editor for this issue: <>


  1. , Linguistic Data Consortium
  2. Koenraad De Smedt, DARPA Linguistic Data Consortium

Message 1: Linguistic Data Consortium

Date: Wed, 31 Jul 91 10:39 CDT
From: <>
Subject: Linguistic Data Consortium
[Editors' note: we are grateful to Mark Liberman for the following
informative response to our query. The DARPA project seems to us to have
great potential benefit for many linguists who may not have seen the
ACL announcements (e.g., discourse analysts, textlinguists, phonologists).
Thus we appreciate this LINGUIST posting.]
In what follows, I have tried to answer the questions that you raised
in Linguist Vol-2-371 about the proposed Linguistic Data Consortium.
	Mark Liberman (
	Department of Linguistics
	University of Pennsylvania
 >Has anyone heard more about the Linguistic Data Consortium that was
 >announced in Linguist Vol-2-367?
I chaired the planning committee for this effort. The other committee
members were:
Janet Baker
Dragon Systems
Ken Church
AT&T Bell Laboratories
(presently at USC/Information Sciences Institute)
George Doddington
Texas Instruments
(presently at SRI)
Paul Jacobs
General Electric Central Research and Development Laboratories
Fred Jelinek
IBM TJ Watson Research Center
Mitch Marcus
University of Pennsylvania
Dave Pallett
National Institute of Standards and Technologies
Patti Price
Stanford Research International
Don Walker
Bell Communications Research
Yorick Wilks
New Mexico State University
Victor Zue
 >This is the first we--or anyone we
 >talk to--has heard of it; and yet the deadline for membership application
 >is August 19. We understand that even LSA did not have prior knowledge
 >of this DARPA project. But perhaps we've been misinformed.
The LDC planning committee was formed in January of this year,
following a request by Charles Wayne of DARPA. The announcement in the
Commerce Business Daily formally solicits members for the initial establishment
of the organization, but new members can be added at any later time.
The plans for the LDC were announced at the DARPA Speech and Natural
Language meeting in Asilomar in February, and discussed at the
ACH/ALLC meeting in Tempe in March and the ACL meeting in Berkeley in
June. Within the community of computational linguists and speech
researchers, both in the US and abroad, the LDC has been widely
 >Can someone
 >offer more background on this (apparently) important project?
Over the past decade, research in speech and natural language
technology has come to depend more and more on models induced
from very large amounts of text and speech. The needed data is
expensive and troublesome to get, and it is also hard to compare
results unless different groups can share the same data for
training and testing.
DARPA has funded the development of speech databases for several
years, and has made them generally available through NIST. In 1989,
the ACL formed an ad hoc committee to gather and distribute text and
speech corpora, the ACL Data Collection Initiative. In the fall of
1990, the NSF sponsored a workshop, run by the ACL, on "Open Lexical
and Textual Resources," which aimed to arrive at a consensus on needs
and opportunities in this area, and was attended by representives of
several government agencies, including DARPA. Large linguistic data
projects, such as the British National Corpus, are underway in Europe
and in Japan.
The proposed formation of the LDC is thus another blossom in an
already-vigorous flowering of efforts to create shared resources for
resarch and development of natural language technology. The
particular form in this case, a government-industry-university
consortium, seems appropriate given the nature of the problem, but it
also forms part of a larger picture. In response to a recent request
from congress, DARPA has proposed six consortia intended to promote
pre-competitive technology development: the Linguistic Data
Consortium, a Ceramic Fiber Consortium, a Consortium for
Optoelectronics and All-Optical Networks, a Superconducting
Electronics Consortium, a Scalable Computing Systems Consortium, and
an Advanced Static Random Access Memory Consortium.
 >What institutions are/intend to be Senior Members?
The consortium has not been formed yet, nor have any companies
committed to joining as "senior members." I would like to underline
the fact that senior members do not have any privileged access to
data, and that (as the announcement says) "broad participation is
desired" and "general membership fees will be set at affordable
 >Will there be a later enrollment period?
The plans for the LDC have not included any notion of an "enrollment
period." Applications for membership will be accepted at any time.
 >What linguists, if any, are consultants?
All of the members of the planning committee are researchers who work
on speech or text, and are thus linguists in some sense of the term.
Patti Price and I have degrees in linguistics. Mitch Marcus has a
secondary appointment as a member of the linguistics department at
Penn. Several members of the committee are active in the Association
for Computational Linguistics, notably Don Walker, its
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: DARPA Linguistic Data Consortium

Date: Wed, 31 Jul 91 09:55 MET
Subject: DARPA Linguistic Data Consortium
As a European, I am surprised that in the US, a lot of research money is
apparently spent through Defense, and in particular through DARPA. After
they have almost monopolized American AI research during the last
decades, it seems that the army is now ready to invade linguistics (as
announced in Linguist Vol. 2-367).
Have academic institutions in the US never objected to this continuing
militarization of research? Are American scientists not arguing for
allocation of more government research funds through civilian channels?
And in particular, should linguists not claim that any Linguistic Data
Consortium be supervised by peaceful civilians rather the DOD?
Koenraad De Smedt
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue