LINGUIST List 5.1186

Thu 27 Oct 1994

Sum: English corpora

Editor for this issue: <>


Directory

  1. Kathy Mitchell, Summary of responses to request for English corpora

Message 1: Summary of responses to request for English corpora

Date: Tue, 25 Oct 1994 17:57-050Summary of responses to request for English corpora
From: Kathy Mitchell <ai.kathymcc.com>
Subject: Summary of responses to request for English corpora

Recently I sent a request to the members of the LINGUIST list asking for
references to English corpora. I got several extremely informative
responses, which are summarized below. Thanks go to:
 Knut Hofland Knut.Hoflandhd.uib.no
 Jane A. Edwards edwardscogsci.berkeley.edu
 Mark Liberman mylsansom.ling.upenn.edu
 Loren Allen Billings BILLINGSpucc.princeton.edu
 Patricia Haegeman FTE.HAEGEMAN.Palpha.ufsia.ac.be

I haven't yet investigated all of these options. I truly appreciate the
speedy and voluminous response.

Kathy Mitchell
ai.kathymcc.com

Several people mentioned a 1993 survey of well-known corpora written
by Jane Edwards and published as chapter 10 in the book:
 Edwards, Jane A. & Martin D. Lampert (eds). TALKING DATA: TRANSCRIPTION AND
 CODING IN DISCOURSE RESEARCH. London and Hillsdale, NJ: Erlbaum.
 336 pp. 0-8058-0349-1 [ppr]
This chapter is also available via anonymous ftp from cogsci.berkeley.edu
in compressed format, in the "pub" directory, under the filename of
"CorpusSurvey.Z and as the LINGUIST file "CORPORA FAQ", which is
retrievable by email by sending the message "GET CORPORA FAQ LINGUIST"
to LISTSERVtamvm1.tamu.edu (For a list of all the archived LINGUIST
files, send the message "INDEX LINGUIST" to this same address.)


A couple other surveys of corpora were mentioned as well. The Lancaster
Survey of Machine-Readable Language Corpora, is available from the ICAME
file server FAFSRVNOBERGEN.BITNET (send mail with Subject: DIR)

Another survey "to create and maintain a comprehensible database about
archives and projects in machine-readable text" was done by the Center for
Text and Technology at Georgetown University, in collaboration with
other centres. More information can be obtained at:

 Michael Neuman, Ph.D.,
 Georgetown Centre for Text and Technology,
 Reiss Science Building, Room 238,
 Georgetown University,
 Washington,
 DC 20057,
 U.S.A.


There is an unmoderated email list, CORPORA, for discussion about text
corpora such as availability, aspects of compiling and using corpora,
software, tagging, parsing, bibliography, etc. One can subscribe by
sending an email message with the command "sub corpora <firstname>
<lastname>" to LISTSERVUIB.NO This list is hosted at the Norwegian
Computing Centre for the Humanities in Bergen, Norway. Information
stored at the machine nora.hd.uib.no can be accessed through gopher,
anonymous FTP, or a mail server (send a "help" message for more
details). There is also a World-Wide-Web page with the URL
http://www.hd.uib.no A contact for any of these services is Knut Hofland
(knut.hofland.hd.uib.no)


The International Computer Archive of Modern English (ICAME) has a
number of English corpora (including American, British and Indian
English corpora) available in various media for a low cost, primarily
for research and teaching purposes. It is distributed by the Norwegian
Computing Centre for the Humanities (NCCH) in Bergen, Norway, which can
be contacted at icamehd.uib.no


The Linguistic Data Consortium has an extensive list of corpora
available for sale. Info about these can be gotten by anonymous
ftp from ftp.cis.upenn.edu in directory /pub/ldc, or by email from
ldcunagi.cis.upenn.edu.


The Center for Electronic Texts in the Humanities (CETH), a joint
Rutgers/Princeton organization is worth investigating. They can be
contacted at cethpucc.princeton.edu or hockeyzodiac.rutgers.edu.


There is some interesting tagging software available by anonymous
ftp from PARCFTP.Xerox.COM in /ftp/pub/tagger.


Some specific corpora mentioned were:

The Multilingual Corpus 1 of the European Corpus Initiative (ECI/MCI)
contains almost 100 million words in 27 (mainly European) languages. It
consists of 48 opportunistically collected component corpora marked up
in SGML. The CD-ROM is available in the US from the Linguistic Data
Consortium (LDC) or from ELSNET, 2 Buccleuch Place, Edinburgh EH8 9LW,
SCOTLAND. Information on ordering it from ELSNET can be obtained from
 elsnetcogsci.ed.ac.uk
or
 http://www.cogsci.ed.ac.uk/elsnet/eci.html
or by anonymous ftp from
 ftp.cogsci.ed.ac.uk:pub/elsnet/eci/mci-listing


The ACL has a CD-ROM, in ISO 9660 format, containing about 300 Mb of
Wall Street Journal text, a large collection of scientific abstracts,
the full text of the 1979 edition of the Collins English Dictionary, and
some samples of tagged and parsed text from the Penn Treebank project.
To order this, send a message to Rafi Khan (khanrunagi.cis.upenn.edu)
including your mailing address, and he will send a paper copy of the
User Agreement to be signed.


The SUSANNE Corpus is an annotated sample comprising about 130,000 words
of written American English text, produced to exemplify a set of
annotation standards which attempt to specify an explicit notation for
all aspects of the surface and logical grammar of real-life English in
sufficient detail that analysts independently applying the standards to
the same text must produce identical annotations. These standards are
defined in the book ENGLISH FOR THE COMPUTER; a skeleton outline of the
scheme is included in the electronic documentation file which
accompanies the Corpus. The texts of the SUSANNE Corpus are a subset of
the texts included in the (unannotated) Brown University Corpus.
Release 3 of the SUSANNE Corpus is available is available by anonymous
ftp from the Oxford Text Archive at black.ox.ac.uk in the directory
ota/susanne - follow the instructions in the README file in that
directory.

The Oxford Text Archive is a repository for some corpora as well:

 Oxford Text Archive,
 Oxford University Computing Service,
 13 Banbury Road,
 Oxford OX2 6NN.

 E-mail: ARCHIVEVAX.OX.AC.UK (outside JANET)
 ARCHIVEUK.AC.OX.VAX (JANET)
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue