Editor for this issue: <>
Recently I sent a request to the members of the LINGUIST list asking for references to English corpora. I got several extremely informative responses, which are summarized below. Thanks go to: Knut Hofland Knut.HoflandMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuehd.uib.no Jane A. Edwards edwards
cogsci.berkeley.edu Mark Liberman myl
sansom.ling.upenn.edu Loren Allen Billings BILLINGS
pucc.princeton.edu Patricia Haegeman FTE.HAEGEMAN.P
alpha.ufsia.ac.be I haven't yet investigated all of these options. I truly appreciate the speedy and voluminous response. Kathy Mitchell ai.kathy
mcc.com Several people mentioned a 1993 survey of well-known corpora written by Jane Edwards and published as chapter 10 in the book: Edwards, Jane A. & Martin D. Lampert (eds). TALKING DATA: TRANSCRIPTION AND CODING IN DISCOURSE RESEARCH. London and Hillsdale, NJ: Erlbaum. 336 pp. 0-8058-0349-1 [ppr] This chapter is also available via anonymous ftp from cogsci.berkeley.edu in compressed format, in the "pub" directory, under the filename of "CorpusSurvey.Z and as the LINGUIST file "CORPORA FAQ", which is retrievable by email by sending the message "GET CORPORA FAQ LINGUIST" to LISTSERV
tamvm1.tamu.edu (For a list of all the archived LINGUIST files, send the message "INDEX LINGUIST" to this same address.) A couple other surveys of corpora were mentioned as well. The Lancaster Survey of Machine-Readable Language Corpora, is available from the ICAME file server FAFSRV
NOBERGEN.BITNET (send mail with Subject: DIR) Another survey "to create and maintain a comprehensible database about archives and projects in machine-readable text" was done by the Center for Text and Technology at Georgetown University, in collaboration with other centres. More information can be obtained at: Michael Neuman, Ph.D., Georgetown Centre for Text and Technology, Reiss Science Building, Room 238, Georgetown University, Washington, DC 20057, U.S.A. There is an unmoderated email list, CORPORA, for discussion about text corpora such as availability, aspects of compiling and using corpora, software, tagging, parsing, bibliography, etc. One can subscribe by sending an email message with the command "sub corpora <firstname> <lastname>" to LISTSERV
UIB.NO This list is hosted at the Norwegian Computing Centre for the Humanities in Bergen, Norway. Information stored at the machine nora.hd.uib.no can be accessed through gopher, anonymous FTP, or a mail server (send a "help" message for more details). There is also a World-Wide-Web page with the URL http://www.hd.uib.no A contact for any of these services is Knut Hofland (knut.hofland
.hd.uib.no) The International Computer Archive of Modern English (ICAME) has a number of English corpora (including American, British and Indian English corpora) available in various media for a low cost, primarily for research and teaching purposes. It is distributed by the Norwegian Computing Centre for the Humanities (NCCH) in Bergen, Norway, which can be contacted at icame
hd.uib.no The Linguistic Data Consortium has an extensive list of corpora available for sale. Info about these can be gotten by anonymous ftp from ftp.cis.upenn.edu in directory /pub/ldc, or by email from ldc
unagi.cis.upenn.edu. The Center for Electronic Texts in the Humanities (CETH), a joint Rutgers/Princeton organization is worth investigating. They can be contacted at ceth
pucc.princeton.edu or hockey
zodiac.rutgers.edu. There is some interesting tagging software available by anonymous ftp from PARCFTP.Xerox.COM in /ftp/pub/tagger. Some specific corpora mentioned were: The Multilingual Corpus 1 of the European Corpus Initiative (ECI/MCI) contains almost 100 million words in 27 (mainly European) languages. It consists of 48 opportunistically collected component corpora marked up in SGML. The CD-ROM is available in the US from the Linguistic Data Consortium (LDC) or from ELSNET, 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND. Information on ordering it from ELSNET can be obtained from elsnet
cogsci.ed.ac.uk or http://www.cogsci.ed.ac.uk/elsnet/eci.html or by anonymous ftp from ftp.cogsci.ed.ac.uk:pub/elsnet/eci/mci-listing The ACL has a CD-ROM, in ISO 9660 format, containing about 300 Mb of Wall Street Journal text, a large collection of scientific abstracts, the full text of the 1979 edition of the Collins English Dictionary, and some samples of tagged and parsed text from the Penn Treebank project. To order this, send a message to Rafi Khan (khanr
unagi.cis.upenn.edu) including your mailing address, and he will send a paper copy of the User Agreement to be signed. The SUSANNE Corpus is an annotated sample comprising about 130,000 words of written American English text, produced to exemplify a set of annotation standards which attempt to specify an explicit notation for all aspects of the surface and logical grammar of real-life English in sufficient detail that analysts independently applying the standards to the same text must produce identical annotations. These standards are defined in the book ENGLISH FOR THE COMPUTER; a skeleton outline of the scheme is included in the electronic documentation file which accompanies the Corpus. The texts of the SUSANNE Corpus are a subset of the texts included in the (unannotated) Brown University Corpus. Release 3 of the SUSANNE Corpus is available is available by anonymous ftp from the Oxford Text Archive at black.ox.ac.uk in the directory ota/susanne - follow the instructions in the README file in that directory. The Oxford Text Archive is a repository for some corpora as well: Oxford Text Archive, Oxford University Computing Service, 13 Banbury Road, Oxford OX2 6NN. E-mail: ARCHIVE
VAX.OX.AC.UK (outside JANET) ARCHIVE
UK.AC.OX.VAX (JANET)