LINGUIST List 23.2009
|
Tue Apr 24 2012
FYI: New Linguistic Corpus of Sina Weibo Messages
Editor for this issue: Kristen Dunkinson
<kristen linguistlist.org>
|
To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
|
Date: 24-Apr-2012
From: Daan van Esch <daanvanesch gmail.com>
Subject: New Linguistic Corpus of Sina Weibo Messages
E-mail this message to a friend
It is my pleasure to announce to you the Leiden Weibo Corpus (LWC), an annotated linguistic 100-million word corpus containing 5.1 million messages from Sina Weibo, China’s premier Twitter-like microblogging service. The LWC is freely available online at http://lwc.daanvanesch.nl/. Data for the LWC was collected in January 2012. As such, it contains many linguistic phenomena that may not be found in older corpora, such as suffixation with "-ing", an aspect marker borrowed from English. Furthermore, Sina Weibo messages come with valuable meta data, such as the gender of the user and his location. This information allows the LWC to calculate how often words are used in different provinces and cities across China, which is useful for research into lexical variation across China. Naturally, the LWC also supports searching for single words or grammar patterns, such as "any verb followed by an aspectual particle and then a noun". This feature may also be of interest to students and teachers of Mandarin who are looking for example sentences. Please feel free to forward this announcement to anyone who might be interested. Any feedback regarding the LWC would be greatly appreciated; please send it to daanvanesch gmail.com. Best wishes, Daan van Esch Graduate Student in Chinese linguistics Leiden University
Linguistic Field(s): Text/Corpus Linguistics
Read more issues|LINGUIST home page|Top of issue
|
|
Page Updated: 24-Apr-2012
|
|
About LINGUIST
|
Contact Us
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.
|
|