Message 1: Answer to Hansard Inquiry

Date: Tue, 10 Sep 91 22:16:45 EDT
From: <>
Subject: Answer to Hansard Inquiry
In Linguist List Vol-2-485, 9/10/91,
(Gregory Grefenstette) inquires:
 >A recent listing in the LINGUIST spoke of The Hansards
 >(bilingual transcripts of the Canadian Parliament debates)
 >currently available through the ACL/DCI.
 >and then said to contact the ACL/DCI.
 >Does anyone know the e-mail coordinates of this contact?
I'm probably the best contact for this material, since I'm the one
who is trying to put it in shape to distribute. My email address
is; the Hansard corpus will be available
on CD-ROM sometime later this year. That's the short answer.
Since many Linguist readers don't know what the ACL/DCI is, and some
of these may want to know, I give a longer answer below. I apologize to
others for running on at such length.
In early 1989, the Association for Computational Linguistics set up an
ad hoc committee called the Data Collection Initiative (hence
ACL/DCI), to oversee the acquisition and preparation of a large
linguistic corpus to be made available for scientific research at cost
and without royalties. All materials submitted for inclusion in the
collection remain the exclusive property of the copyright holders (if
any) for all other purposes. Each applicant for data from the ACL/DCI
is required to sign an agreement not to redistribute the data or make
any direct commercial use or it; however, commercial application of
"analytical materials" derived from the text, such as statistical
tables or grammar rules, is explicitly permitted.
The ACL/DCI has gathered several hundred million words of text, one
dictionary, and a bit of speech data. About 40 individuals and
research groups have gotten portions of the ACL/DCI's holdings, mostly
by cartridge tape. Future distribution will mainly be by CD-ROM.
Because of restrictions placed on us by the original information
providers, anonymous ftp is not an option as a distribution mechanism.
For the first couple of years of its operation, the ACL/DCI was run
entirely by volunteer labor, using borrowed computer resources. Last
spring, the General Electric company gave us some much-appreciated
seed money, which we used to buy some disk drives, and AT&T Bell Labs
lent us some other disks. This past summer, we got a grant from the
NSF (IRI 91-13530), which provides for some additional computer
equipment, and a graduate RA for a year and a half to help with data
preparation and distribution. Dragon Systems Inc. paid for the
manufacture of our first CD-ROM, and IBM has offered to pay for the
second one.
The first ACL/DCI CD-ROM is now being manufactured, and 200 copies of
it should be shipped to me on Sept. 16. It contains about 310 MB of
Wall Street Journal text, about 180 MB of scientific abstracts, the
full text of the 1979 edition of the Collins English Dictionary, and a
preliminary sample of tagged and parsed text from the Penn Treebank
project. In order to get it, users need only sign and return
a copy of the ACL/DCI User Agreement, with a $25 check to the ACL.
Most, if not all, of the Hansard material should be on the
second ACL/DCI CD-ROM, which we hope to produce later this fall.
Other material will be released later on.
I now have a total of about six years of the Hansard, derived by two
different routes at two different times by two different people
from the Canadian government. For half of it, I have official
permission to re-distribute via the ACL/DCI, to bona fide
researchers who sign the ACL/DCI's User Agreement. For the other
half, there are a few loose ends to tie up to get such permission.
Each part amounts to about 500 MB. The format of the two parts is
somewhat different. One part arrived in the form of a typographer's
tape, which I have analyzed and marked up so that information in font
shifts and the distribution of white space is replaced by explicit
tagging of headings, speaker attributions, tables, and so forth. The
original version of this portion has also been aligned (by researchers
at Bell Labs), in the sense of explicitly connecting the corresponding
French and English sentences. The marked-up version and the aligned
version have yet to be merged, and both mark-up and alignment are more
errorful than I would like.
The second part was cleaned up and aligned some time ago by
researchers at IBM. They have given it to me in the form of two lists
of sentences, one in French and one in English, numbered
correspondingly. All other material (including the identification of
who is talking, the boundaries of speaking turns, the indication of
session boundaries, etc.) has been elided. The division into sentences
and the alignment of the two languages have been done to a very high
standard, and thus this portion is a wonderful resource for research
on translation (or other issues) at the level of the sentence. Those
interested in discourse phenomena may perhaps prefer the other half of
the Hansard database, or other available texts.
	Mark Liberman
Message 2: The Coyote Papers

Date: Tue, 10 Sep 91 22:54 MST
From: "Pat E. Perez" <>
Subject: The Coyote Papers
The University of Arizona Linguistics Circle is pleased to bring you
'Linguists on the Net' the opportunity to purchase volumes of our
working papers - _The Coyote Papers_. Please check out the order form
that will be available on the server for more information on the
volumes (authors, titles, prices, etc.). If you're interested, you
can either:
1) Download or capture the order form, print it, fill it out, and
 return it to the US Mail address provided (please notice that all
 orders must be prepaid) OR
2) Send a message to the address below telling me what volumes you'd
 like. You'll still need to send us the pre-payment though. No email
 money allowed - sorry! :-)
Thank you for your cooperation!
Patricia E. Perez
Linguistics Circle
Message 3: Correction to conference on language acquisition

Date: Wed, 11 Sep 91 16:31:54 +0200
From: R. Tracy <>
Subject: Correction to conference on language acquisition
Correction: Please substitute "Tuebingen" for "Tingen" throughout the text
of my conference announcement. Sorry, that's what happens when you think the
computer can handle the German "umlaut" !!
