LINGUIST List 8.797

Thu May 29 1997

Qs: Lang id, Student corpus, Syntax

Editor for this issue: Ann Dizdar <>

We'd like to remind readers that the responses to queries are usually best posted to the individual asking the question. That individual is then strongly encouraged to post a summary to the list. This policy was instituted to help control the huge volume of mail on LINGUIST; so we would appreciate your cooperating with it whenever it seems appropriate.


  1. Mark Mandel, Language identification
  2. colber, student corpus - advice sought
  3. Hiroyuki TANAKA, syntax papers

Message 1: Language identification

Date: Tue, 27 May 1997 13:52:58 -0500
From: Mark Mandel <>
Subject: Language identification

An acquaintance of my daughter's writes:


 Identify this language please?

"Idolem urodo iatu a wi rot
 Ukufu kush onuoy nehawuoch
 Etia di ukoik ura nakurah
 Enadu yoimi nnesar urugem
 Eteako ich atak
 Ureatu tso oodah
 Amia wibo koro yonneie"

I think I have a pretty good idea of what languages this is *not* (not
a Romance language, not Germanic, not Slavic, not Chinese, Japanese,
Vietnamese...). Also, if it translates to something really corny,
lemme know so I can stop embarrassing myself every time I sing it.


Please respond to me. I will forward replies to the inquirer and
summarize to the list. Thank you for any help.

 Mark A. Mandel : Senior Linguist :
 Dragon Systems, Inc. : speech recognition : +1 617 965-5200
 320 Nevada St., Newton, MA 02160, USA :
 Personal home page:
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: student corpus - advice sought

Date: Wed, 28 May 1997 01:28:30 +0800
From: colber <>
Subject: student corpus - advice sought

Has anyone in the List compiled or worked with STUDENT corpora?

I am in the process of putting together a corpus of Chinese college
students' unedited writings in English. The purpose is to
subsequently analyze this corpus, with concordancer and other
programs, and find quantitative information about the extent of some
characteristic errors or other non-native speaker word usages in their
writings. This information can be very valuable in determining
syllabuses and directions in secondary school English instruction.

The corpus is planned to be the size of about 300,000 words,
consisting of 800-1000 pieces of written assignments, each anywhere
between 150-400 words long, typed and saved as text files. About one
third of these assignments has already been typed (entered).

I haven't so far used any other STUDENT corpus, from any country. So
my question is: are there any STANDARDS, generally used or accepted
electronic formats, in which these corpora are compiled, saved, and
prepared to be used by others?

Here I briefly describe how the corpus is being compiled here, and
will be very grateful for suggestions or comments whether this way is
OK or any change should be made to comply with accepted forms.

- Each piece is typed in the Word 6.0 window (in Windows 3.1
environment), using a fixed space font, making each line about 70
words long, typing the unedited, uncorrected text (only obvious
spelling and punctuation mistakes made by the students are corrected).

- An 8-12 character code (number) is typed in the first line. Then
one line is skipped, and the heading (headline) of the piece, as
written by the student, is typed.

- Paragraphing follows the original, with blank lines between the

- Before saving the text, possible spelling and other errors made in
the typing process are checked and corrected using Word's spell

- Then each piece is saved as a "text only with line breaks" file and
given a file name (number).

- All these files are placed in one directory and backed up to prevent
accidental erasure.

- Using a simple merger application, the files are merged.

So far, I have already tried using in a concordancer (WordSmith Tools)
a consolidated long file comprising about 350 pieces of writing, about
120,000 words, and there seem to be no problems. Would files compiled
this way be ALSO USABLE in other concordancer or text
processing/analyzing programs?

Please send your comments either to the List, or to me. I could
certainly summarize the contents of communications sent to me and send
it to the List.

I should also be very happy to eventually make this corpus available
to anyone interested in using it, or exchange it with similar learner
corpora on file, based on writings of other Chinese or Japanese
students, or English-learning college students in any country.

Best to all,

Colman Bernath

- -----------------
Colman Bernath
c/o Department of English
Soochow University, Taipei, TAIWAN
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 3: syntax papers

Date: Thu, 1 May 1997 19:34:33 +0900
From: Hiroyuki TANAKA <>
Subject: syntax papers

Does anyone have (a published version of) the following two papers,
both of which are cited in L. Rizzi's (1990) _Relativized Minimality_?

 Carstens, V., and Kinyalolo. 1989. Agr, Tense, Aspect and 
 the IP Structure: Evidence from Bantu. Paper presented at 
 GLOW Conference, Utrecht. 

 Schneider-Zioga, P. 1987. Syntax Screening. Paper, USC, Los Angeles.

Please contact me at the address below.
 | Hiroyuki Tanaka |
 | Department of English Linguistics, |
 | Faculty of Letters, Osaka University. +--+
 | e-mail: | /
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue