LINGUIST List 7.1661

Sat Nov 23 1996

Sum: Corpus design

Editor for this issue: Susan Robinson <robinsonemunix.emich.edu>


Directory

  1. Adrian Clynes, Summary, corpus query

Message 1: Summary, corpus query

Date: Thu, 21 Nov 1996 08:29:38 +0800
From: Adrian Clynes <aclynesubd.edu.bn>
Subject: Summary, corpus query

Here are edited responses to a query about
corpus-design-for-beginners, posted to the List on 2 November. Many
thanks to the following for their time and suggestions: Imran Ho
imran.hostonebow.otago.ac.nz, Ellen Gurman Bard, ellenling.ed.ac.uk,
Claire Warwick <claire.warwickcomputing-services.oxford.ac.uk>, and
Michael Barlow <barlowruf.rice.edu>:

1) From Imran Ho <imran.hostonebow.otago.ac.nz>

1.i am sure you must be aware of the corpora list (ICAME) which
contains specific discussion of corpus linguistics. They also have a
list of software which might be of interest to your colleagues at
UBD. Altenberg's bibliography is a good place for references and is
available from the same site. 
2. i am currently compiling a corpus of written Malaysian English -
following the organisation of the LOB/Brown and Wellington corpus of
NZ English. I use the Oxford Concordance Programme for extracting the
info i need from the ME corpus. I have also tried MonoConc (available
on both Mac and PC) [AC: see Michael Barlow's response below ]and the
programme seems to be a very user friendly programme.
3. For tagging ... try the Birmingham Tagger (via e-mail), however,
with a learners' corpus beware... the tagger has an accuracy of (i
would guess based on the tagging I have done -- around 80%)..so alot
of editing is needed.
4. Hardware is not really a problem...For my corpus of newspaper text,
most of the texts are already in electonic form and only needed to be
downloaded. The rest of the texts are scanned using Calera
WordScan. The storage space for 44 texts of 2,000 wds is around
612k. So if you have 500 texts you might need 6Mb of disc space. I
store my documents in text format (ascii).
5. There is a particular stage of corpus development which needs
careful thought at some stage ... ie. the reference and mark up for
the texts. Imran


2) From: Ellen Gurman Bard (ellenling.ed.ac.uk)
For some examples of design and collection techniques for spoken
corpora, you might want to have a look at:

Anderson, A. H., Bader, M., Bard, E. G., Boyle, E., Doherty, G.,
Garrod, S., Isard, S., Kowtko, J., McAllister, J. M., Miller, J.,
Sotillo, C., Thompson, H., Weinert, R. (1991). The HCRC Map Task Corpus.
LANGUAGE AND SPEECH, 34(4), 351-66.

Bard, E. G., Sotillo, C. F., Anderson, A. H., and Taylor, M. M. (in
press). The DCIEM Map Task Corpus: Spontaneous Dialogue under Sleep
Deprivation and Drug Treatment. SPEECH COMMUNICATION.

or (1996) PROCEEDINGS OF INTERNATIONAL CONFERENCE ON SPEECH
AND LANGUAGE PROCESSING


3) From: Claire Warwick <claire.warwickcomputing-services.oxford.ac.uk>

You may like to look at the web page for the British
National Corpus, at http://info.ox.ac.uk/bnc. It
should provide you with some of the
information that you need.

4) From: Michael Barlow <barlowruf.rice.edu>

You might look at my corpus linguistics page:
http://www.ruf.rice.edu/~barlow/corpus.html

I have developed a couple of concordance programs. MonoConc for
Windows is a commercial program published by Athelstan (my
company). 
You can download a demo from http://www.nol.net/athel.html.
A Hypertalk-based Mac concordancer (MonoConc) can be downloaded from:
http://www.ruf.rice.edu/~barlow/mono.html.


Again, my thanks to those who responded for their time and suggestions.



Adrian Clynes
aclynesubd.edu.bn
Dept of English & Applied Linguistics			
Universiti Brunei Darussalam, Brunei					
	
						
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue