Editor for this issue: T. Daniel Seely <dseely
emunix.emich.edu>
In LINGUIST 6.973 and a parallel query to the corpora list, Rapahel Salkie asked some questions about the UN Parallel Corpus from the Linguistic Data Consortium (LDC). Before these queries appeared, a response to an earlier query from Dr. Salkie had already gone out privately from Rebecca Finch. I certainly urge anyone with experience in using the UN corpus to respond to Dr. Salkie as well. We have also placed a sample of 24 (out of 21,000) parallel documents, each in English, French and Spanish, in: ftp://ftp.cis.upenn.edu/pub/ldc/data_samples/UN_Par_sample.tar.Z These samples should also be accessible, along with quite a bit of other LDC information, from the WWW page at URL http://www.cis.upenn.edu/~ldc Let me add a few words about LDC prices and costs, since Dr. Salkie's message expressed the normal human annoyance at being asked to part with money, both in the case of the UN corpus and another (not yet published) LDC parallel text corpus, the Canadian Hansard. The LDC membership fee for a university is $2,000, and for this fee everyone at that university can get an unlimited and perpetual research license for everything that the LDC publishes during the year of membership. Thus you can join the other ninety current members of the corsortium and get not only the forthcoming Hansard corpus, but also the other twenty or so databases published this year. For the same amount, a university can get the UN corpus and the other 15 databases published in 1994. Whether a particular database, or collection of databases, is worth that amount of money is of course a matter of individual or institutional judgement. We feel that $2,000, which is roughly the cost of a moderately configured PC or an international conference trip, is not out of line even for university researchers. Speaking for myself, I have a great deal of sympathy for the effort to provide research resources free or at minimal cost, and I have been involved in several successful efforts to bring out such databases over the years, including the ACL/DCI CD-ROM, the ECI disk, the CELEX disk, and others offered in the range of $25-$200. These efforts rely heavily on volunteer labor and other donated resources; in several cases they have also relied on cash donations from the LDC. However, volunteer labor is rarely available in the needed quantities; and of course LDC-supplied cash, as well as the existence of the LDC as an organization, depends on income from somewhere. The money that we get from memberships and database sales is a crucial part of the picture---without it the LDC would not exist, and neither would either of the databases under discussion. To highlight the point, the history of the U.N. publication is worth reviewing briefly. We decided to try to publish the U.N. archives because translation researchers wanted parallel texts. After concluding several months of negotiations with UN representatives and lawyers for both sides, we paid for a NJ-based computer consultant to go into the UN offices at night so as to make backups of the archives from dismountable disk packs for a long-obsolete Wang word processor onto cartridge tapes. This required several months and cost a considerable sum; we had to use this particular person because he was an authorized service rep for the UN facility. Then came six person-months of work at the LDC. We had to decode the proprietary and undocumented Wang BACKUP format, and the equally proprietary and undocumented Wang character set, typographical codes and file structures. We re-organized the entire archive and translated it into WordPerfect format, and published a certain number of CD-ROMs in this form for the purposes of the UN---this was part of the agreement that we made with them for access to the data. Then we translated the documents into ISO-8859-1 with SGML markup (including a working DTD, for those how care), and worked out the correspondences among documents. This was far from trivial, since each UN language had been entered separately, with no coordination of file names, file dates, or even division of documents into files, and there were tens of thousands of documents per language. This work was mainly done by Dave Graff, whose salary the LDC pays. We are not likely to recover the costs of acquisition and production of this database through sales and memberships bought for its sake. We subsidize our members by cost-sharing with government grants, or by using income from more popular or less expensive databases to cover unrecovered costs of less popular or more expensive ones. In the case of the forthcoming Hansard corpus, which Dr. Salkie also mentioned, the cost of acquisition and publication has been similar to that of the U.N. material, and the same remarks about subsidies apply. Whether a particular database is worth a certain price is a matter of individual taste, but as a matter of simple arithmetic, the fees charged in these two cases are unlikely ever to cover the costs incurred. For those who have read this far, I would like to repeat a standing offer that has been in existence since the beginning of the LDC. If you are interested in CD-ROM publication of a language-related database that is plausibly of interest to our membership, and this database is reasonably close to being in shape to publish, we will pay the costs of production, using your label design or one worked out with you; we will give you up to a hundred copies to do with as you see fit; we will put the item in our catalogue at whatever price you choose; and we will remit to you any resulting income in excess of our production costs. The copyright (if any) will remain with you, and we will handle any user license arrangements that may be necessary, sending the signed licenses to you. We have published several databases on this basis, and are planning to publish several others, although (from past experience) the chances of making back our production costs are no better than even. Best wishes, Mark Liberman mylMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueunagi.cis.upenn.edu 619 Williams Hall University of Pennsylvania Phone: 215-898-0141 Philadelphia, PA 19104-6305 Fax: 215-573-2175