Editor for this issue: <>
THE SUSANNE CORPUS [Revised announcement including modified access instructions] 26 October 1992 Geoffrey Sampson School of Cognitive & Computing Sciences University of Sussex Falmer, Brighton BN1 9QH, England geoffsMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueuk.ac.susx.cogs Colleagues needing the use of a grammatically-analysed corpus of English may like to know that Release 1 of the SUSANNE Corpus is now complete, and is freely available from the Oxford Text Archive via anonymous ftp to any machine connected to the Internet. Instructions for retrieving a copy of the Corpus are given at the end of this announcement. The SUSANNE Corpus has been created, with the sponsorship of the Economic and Social Research Council (UK), as part of the process of developing a comprehensive NLP-oriented taxonomy and annotation scheme for the (logical and surface) grammar of English. The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis. The SUSANNE scheme may be likened to a "Linnaean taxonomy" of the grammatical domain: its aim (comparable to that of Linnaeus's eighteenth-century taxonomy for the domain of botany) is not to identify categories which are theoretically optimal or which necessarily reflect the psychological organization of speakers' linguistic competence, but simply to offer a scheme of categories and ways of applying them that make it practical for NLP researchers to register everything that occurs in real-life usage systematically and unambiguously, and for researchers at different sites to exchange empirical grammatical data without misunderstandings over local uses of analytic terminology. The SUSANNE Corpus comprises an approximately 128,000-word subset of the Brown Corpus of American English, annotated in accordance with the SUSANNE scheme. The SUSANNE analytic scheme is defined in detail in a book by myself, ENGLISH FOR THE COMPUTER, forthcoming from Oxford University Press, and briefly in a documentation file which accompanies the Corpus. The Chairman of the Analysis and Interpretation Working Group of the US/EC-sponsored Text Encoding Initiative has proposed the adoption of the scheme as a recognised TEI standard. The SUSANNE scheme aims to specify annotation norms for the modern English language; it does not cover other languages, although it is hoped that the general principles of the SUSANNE scheme may prove helpful in developing comparable taxonomies for these. Regrettably, Release 1 of the SUSANNE Corpus is not a "TEI-conformant" resource, though aspects of the annotation scheme have been decided in such a way as to facilitate a move to TEI conformance in later releases. The working timetable of the Initiative meant that relevant aspects of the TEI Guidelines were not yet complete at the point when the SUSANNE Corpus was ready for initial release; delaying this release would have been unfortunate. Although the SUSANNE analytic scheme is by now rather tightly defined, Release 1 of the SUSANNE Corpus undoubtedly still contains errors despite considerable proof-checking. It is intended to correct these in later releases; I should be extremely grateful if users discovering errors would notify me, preferably by post rather than e-mail. The SUSANNE Corpus consists of 64 data files (each comprising an annotated version of one Brown text), together with a documentation file. However, the versions held by the Oxford Text Archive are compressed, in order to reduce file transfer time, into single files in two alternative formats, suitable for Unix users and for users who have access only to a PC. The procedure for retrieving a copy of the Corpus in either case is as follows: >From a machine on the Internet, type either: ftp black.ox.ac.uk or, since the Archive is not yet in many official name tables: ftp 129.67.1.165 When connected, you will be prompted for an account name, to which you should respond: ftp or: anonymous You will be asked to supply a password, in response to which you should type your e-mail address. After this is accepted, your first command should be to move to the directory containing the Text Archive files, by typing: cd ota To see a list of the files and directories currently available, type: ls All files relating to the SUSANNE Corpus are kept in the directory "susanne", so your next command should be: cd susanne Apart from a README file containing the instructions which you are currently reading, this directory contains the two alternative compressed versons of the SUSANNE Corpus. To retrieve a copy of the corpus, if you are a Unix user, type: get susanne.tar.Z Having successfully transferred a copy of "susanne.tar.Z" to your home system, get the material into a usable state by the successive commands: uncompress susanne.tar.Z and: tar -xf susanne.tar If you are not a Unix user, you need to retrieve the other version of the Corpus, which will be uncompressed using the PKUNZIP software on an IBM-PC. First, set ftp transfer mode to binary by typing the command: bin at the ftp prompt. Then retrieve the appropriate version of the Corpus by typing: get susanne.zip Having transferred a copy of the Corpus to your home machine, uncompress it with the command: pkunzip -x susanne.zip In either case (whether you have followed the Unix or the non-Unix instructions) you should now have the Corpus split up into its 65 files, one of which, "SUSANNE.doc", is a text file describing the format and contents of the 64 data files. To log out of the ftp connexion, type: bye If you encounter any problems, please send an e-mail message to archive
black.ox.ac.uk or archive
uk.ac.oxford.vax.