FYI: New Linguistic Corpus of Sina Weibo Messages
| Author: |
Daan van Esch
|
| Linguistic Field(s): |
Text/Corpus Linguistics
|
| FYI Body: |
It is my pleasure to announce to you the Leiden Weibo Corpus (LWC),
an annotated linguistic 100-million word corpus containing 5.1 million messages from Sina Weibo, China’s premier Twitter-like microblogging service. The LWC is freely available online at http://lwc.daanvanesch.nl/. Data for the LWC was collected in January 2012. As such, it contains many linguistic phenomena that may not be found in older corpora, such as suffixation with "-ing", an aspect marker borrowed from English. Furthermore, Sina Weibo messages come with valuable meta data, such as the gender of the user and his location. This information allows the LWC to calculate how often words are used in different provinces and cities across China, which is useful for research into lexical variation across China. Naturally, the LWC also supports searching for single words or grammar patterns, such as "any verb followed by an aspectual particle and then a noun". This feature may also be of interest to students and teachers of Mandarin who are looking for example sentences. Please feel free to forward this announcement to anyone who might be interested. Any feedback regarding the LWC would be greatly appreciated; please send it to daanvanesch@gmail.com. Best wishes, Daan van Esch Graduate Student in Chinese linguistics Leiden University |

