Editor for this issue: Elyssa Winzeler
<elyssalinguistlist.org>
We'd like to remind readers that the responses to queries are usually best posted to the individual asking the question. That individual is then strongly encouraged to post a summary to the list. This policy was
instituted to help control the huge volume of mail on LINGUIST; so we would appreciate your cooperating with it whenever it seems appropriate.
In addition to posting a summary, we'd like to remind people that it is usually a good idea to personally thank those individuals who have taken the trouble to respond to the query.
Date: 04-Jan-2010 From: Barry Kavanagh <b_kavanaghauhw.ac.jp> Subject: Japanese and English Corpora Research E-mail this message to a friend
I have a question regarding corpora if I may. At the moment I am looking at non-verbal representations of language such as emoticons in computer mediated discourse and have compiled a fairly large Japanese and English corpus. As I am counting these non-verbal or paralinguistic cues within these corpora the corpora need to be of the same size otherwise my data and findings may be deemed void. For example, if the Japanese corpus if much bigger than the English one then the chances are the more likely that these non-verbal representations will appear. I have tried making the number of sentences the same within each corpora (very time consuming, also defining what a sentence is in online communication can be difficult) and I am also trying to find similar studies that have compared English and Japanese corpora (no luck yet) and to see if here are any reliable representations that state for example that 400 kanji is equal to 1000 English words etc.
Any ideas or advice would be fantastic.
Linguistic Field(s):
Computational Linguistics
Text/Corpus Linguistics