Editor for this issue: <>
In Linguist List Vol-2-485, 9/10/91, grefenMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuecs.pitt.edu (Gregory Grefenstette) inquires: >A recent listing in the LINGUIST spoke of The Hansards >(bilingual transcripts of the Canadian Parliament debates) >currently available through the ACL/DCI. >and then said to contact the ACL/DCI. > >Does anyone know the e-mail coordinates of this contact? I'm probably the best contact for this material, since I'm the one who is trying to put it in shape to distribute. My email address is myl
unagi.cis.upenn.edu; the Hansard corpus will be available on CD-ROM sometime later this year. That's the short answer. Since many Linguist readers don't know what the ACL/DCI is, and some of these may want to know, I give a longer answer below. I apologize to others for running on at such length. In early 1989, the Association for Computational Linguistics set up an ad hoc committee called the Data Collection Initiative (hence ACL/DCI), to oversee the acquisition and preparation of a large linguistic corpus to be made available for scientific research at cost and without royalties. All materials submitted for inclusion in the collection remain the exclusive property of the copyright holders (if any) for all other purposes. Each applicant for data from the ACL/DCI is required to sign an agreement not to redistribute the data or make any direct commercial use or it; however, commercial application of "analytical materials" derived from the text, such as statistical tables or grammar rules, is explicitly permitted. The ACL/DCI has gathered several hundred million words of text, one dictionary, and a bit of speech data. About 40 individuals and research groups have gotten portions of the ACL/DCI's holdings, mostly by cartridge tape. Future distribution will mainly be by CD-ROM. Because of restrictions placed on us by the original information providers, anonymous ftp is not an option as a distribution mechanism. For the first couple of years of its operation, the ACL/DCI was run entirely by volunteer labor, using borrowed computer resources. Last spring, the General Electric company gave us some much-appreciated seed money, which we used to buy some disk drives, and AT&T Bell Labs lent us some other disks. This past summer, we got a grant from the NSF (IRI 91-13530), which provides for some additional computer equipment, and a graduate RA for a year and a half to help with data preparation and distribution. Dragon Systems Inc. paid for the manufacture of our first CD-ROM, and IBM has offered to pay for the second one. The first ACL/DCI CD-ROM is now being manufactured, and 200 copies of it should be shipped to me on Sept. 16. It contains about 310 MB of Wall Street Journal text, about 180 MB of scientific abstracts, the full text of the 1979 edition of the Collins English Dictionary, and a preliminary sample of tagged and parsed text from the Penn Treebank project. In order to get it, users need only sign and return a copy of the ACL/DCI User Agreement, with a $25 check to the ACL. Most, if not all, of the Hansard material should be on the second ACL/DCI CD-ROM, which we hope to produce later this fall. Other material will be released later on. I now have a total of about six years of the Hansard, derived by two different routes at two different times by two different people from the Canadian government. For half of it, I have official permission to re-distribute via the ACL/DCI, to bona fide researchers who sign the ACL/DCI's User Agreement. For the other half, there are a few loose ends to tie up to get such permission. Each part amounts to about 500 MB. The format of the two parts is somewhat different. One part arrived in the form of a typographer's tape, which I have analyzed and marked up so that information in font shifts and the distribution of white space is replaced by explicit tagging of headings, speaker attributions, tables, and so forth. The original version of this portion has also been aligned (by researchers at Bell Labs), in the sense of explicitly connecting the corresponding French and English sentences. The marked-up version and the aligned version have yet to be merged, and both mark-up and alignment are more errorful than I would like. The second part was cleaned up and aligned some time ago by researchers at IBM. They have given it to me in the form of two lists of sentences, one in French and one in English, numbered correspondingly. All other material (including the identification of who is talking, the boundaries of speaking turns, the indication of session boundaries, etc.) has been elided. The division into sentences and the alignment of the two languages have been done to a very high standard, and thus this portion is a wonderful resource for research on translation (or other issues) at the level of the sentence. Those interested in discourse phenomena may perhaps prefer the other half of the Hansard database, or other available texts. Mark Liberman myl
unagi.cis.upenn.edu
The University of Arizona Linguistics Circle is pleased to bring you 'Linguists on the Net' the opportunity to purchase volumes of our working papers - _The Coyote Papers_. Please check out the order form that will be available on the server for more information on the volumes (authors, titles, prices, etc.). If you're interested, you can either: 1) Download or capture the order form, print it, fill it out, and return it to the US Mail address provided (please notice that all orders must be prepaid) OR 2) Send a message to the address below telling me what volumes you'd like. You'll still need to send us the pre-payment though. No email money allowed - sorry! :-) Thank you for your cooperation! Patricia E. Perez Linguistics Circle patepMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuearizrvax.bitnet [Moderators' note: additional information relevant to this posting is available on the server. To get the file, send a message to: listserv
uniwa.uwa.oz.au The message should consist of the single line: get coyote You will then receive the complete file.]
Correction: Please substitute "Tuebingen" for "Tingen" throughout the text of my conference announcement. Sorry, that's what happens when you think the computer can handle the German "umlaut" !!Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue