Editor for this issue: <>
[Editors' note: we are grateful to Mark Liberman for the following informative response to our query. The DARPA project seems to us to have great potential benefit for many linguists who may not have seen the ACL announcements (e.g., discourse analysts, textlinguists, phonologists). Thus we appreciate this LINGUIST posting.] In what follows, I have tried to answer the questions that you raised in Linguist Vol-2-371 about the proposed Linguistic Data Consortium. Regards, Mark Liberman (mylMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueunagi.cis.upenn.edu) Department of Linguistics University of Pennsylvania >Has anyone heard more about the Linguistic Data Consortium that was >announced in Linguist Vol-2-367? I chaired the planning committee for this effort. The other committee members were: Janet Baker Dragon Systems dragon
a.isi.edu Ken Church AT&T Bell Laboratories (presently at USC/Information Sciences Institute) church
venera.isi.edu George Doddington Texas Instruments (presently at SRI) gd
speech.sri.com Paul Jacobs General Electric Central Research and Development Laboratories jacobs
sol.crd.ge.com Fred Jelinek IBM TJ Watson Research Center jelinek
ibm.com Mitch Marcus University of Pennsylvania mitch
linc.cis.upenn.edu Dave Pallett National Institute of Standards and Technologies dave
ssi.ncsl.nist.gov Patti Price Stanford Research International pprice
speech.sri.com Don Walker Bell Communications Research walker
flash.bellcore.com Yorick Wilks New Mexico State University yorick
nmsu.edu Victor Zue MIT zue
goldilocks.lcs.mit.edu >This is the first we--or anyone we >talk to--has heard of it; and yet the deadline for membership application >is August 19. We understand that even LSA did not have prior knowledge >of this DARPA project. But perhaps we've been misinformed. The LDC planning committee was formed in January of this year, following a request by Charles Wayne of DARPA. The announcement in the Commerce Business Daily formally solicits members for the initial establishment of the organization, but new members can be added at any later time. The plans for the LDC were announced at the DARPA Speech and Natural Language meeting in Asilomar in February, and discussed at the ACH/ALLC meeting in Tempe in March and the ACL meeting in Berkeley in June. Within the community of computational linguists and speech researchers, both in the US and abroad, the LDC has been widely discussed. >Can someone >offer more background on this (apparently) important project? Over the past decade, research in speech and natural language technology has come to depend more and more on models induced from very large amounts of text and speech. The needed data is expensive and troublesome to get, and it is also hard to compare results unless different groups can share the same data for training and testing. DARPA has funded the development of speech databases for several years, and has made them generally available through NIST. In 1989, the ACL formed an ad hoc committee to gather and distribute text and speech corpora, the ACL Data Collection Initiative. In the fall of 1990, the NSF sponsored a workshop, run by the ACL, on "Open Lexical and Textual Resources," which aimed to arrive at a consensus on needs and opportunities in this area, and was attended by representives of several government agencies, including DARPA. Large linguistic data projects, such as the British National Corpus, are underway in Europe and in Japan. The proposed formation of the LDC is thus another blossom in an already-vigorous flowering of efforts to create shared resources for resarch and development of natural language technology. The particular form in this case, a government-industry-university consortium, seems appropriate given the nature of the problem, but it also forms part of a larger picture. In response to a recent request from congress, DARPA has proposed six consortia intended to promote pre-competitive technology development: the Linguistic Data Consortium, a Ceramic Fiber Consortium, a Consortium for Optoelectronics and All-Optical Networks, a Superconducting Electronics Consortium, a Scalable Computing Systems Consortium, and an Advanced Static Random Access Memory Consortium. >What institutions are/intend to be Senior Members? The consortium has not been formed yet, nor have any companies committed to joining as "senior members." I would like to underline the fact that senior members do not have any privileged access to data, and that (as the announcement says) "broad participation is desired" and "general membership fees will be set at affordable levels." >Will there be a later enrollment period? The plans for the LDC have not included any notion of an "enrollment period." Applications for membership will be accepted at any time. >What linguists, if any, are consultants? All of the members of the planning committee are researchers who work on speech or text, and are thus linguists in some sense of the term. Patti Price and I have degrees in linguistics. Mitch Marcus has a secondary appointment as a member of the linguistics department at Penn. Several members of the committee are active in the Association for Computational Linguistics, notably Don Walker, its secretary-treasurer.
As a European, I am surprised that in the US, a lot of research money is apparently spent through Defense, and in particular through DARPA. After they have almost monopolized American AI research during the last decades, it seems that the army is now ready to invade linguistics (as announced in Linguist Vol. 2-367). Have academic institutions in the US never objected to this continuing militarization of research? Are American scientists not arguing for allocation of more government research funds through civilian channels? And in particular, should linguists not claim that any Linguistic Data Consortium be supervised by peaceful civilians rather the DOD? Koenraad De SmedtMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue