LINGUIST List 25.2587
Tue
Jun 17 2014
FYI: Sinica Chinese Core
Vocabulary (version 1.0)
Editor for this issue:
Uliana Kazagasheva <ulianalinguistlist.org>
Date: 15-Jun-2014
From: Shu-Chuan Tseng
<tsengsc
gate.sinica.edu.tw>
Subject: Sinica Chinese Core
Vocabulary (version 1.0)
E-mail this message to a
friend
The Sinica Chinese Core Vocabulary (version
1.0) consists of 1,121 Chinese words that are
derived from the intersection of the top 2000
(most frequently used) words in the Sinica
Balanced Corpus and in the Taiwan Mandarin
Conversational Corpus.
The Sinica Balanced Corpus contains mainly
Chinese texts, approximately 4.7 millions of
Chinese words after some minor modifications on
the original data, whereas the Taiwan Mandarin
Conversational Corpus contains free
conversations, task- and topic-oriented
dialogues, approximately 500K of transcribed
Chinese words.
The Sinica Chinese Core Vocabulary was produced
based on the “Word List with Accumulated Word
Frequency in Sinica Balanced Corpus 3.0”
released by the Chinese Knowledge and
Information Processing Group (CKIP) via the
Association for Computational Linguistics and
Chinese Language Processing (ACLCLP) and the
“Chinese Spoken Wordlist” released by Dr.
Shu-Chuan Tseng. Words were segmented and
POS-tagged by the CKIP automatic word
segmentation and tagging system. The Sinica
Chinese Core Vocabulary puts together the most
frequently used Chinese words appearing in both
of the written and spoken forms. It covers
57.6% of word tokens in the Sinica Balanced
Corpus and 86.1% in the Taiwan Mandarin
Conversational Corpus.
The Sinica Chinese Core Vocabulary consists of
word information about part of speech,
frequency, ranking in both of the corpora as
well as the corresponding English glossaries
with Chinese examples and English translations.
All Chinese characters are transcribed in
Pinyin. Words written in identical characters,
but belonging to different POS tags as well as
words that have multiple writing conventions
are regarded as different lexical units. Users
can also find a list with a subset of the top
2000 words of the Sinica Balanced Corpus that
do not appear in the core vocabulary. This list
contains 879 words that are frequently used in
the written language only, covering 13.1% of
word tokens in the Sinica Balanced Corpus.
Another list contains a subset of the top 2000
words of the Taiwan Mandarin Conversational
Corpus that do not appear in the core
vocabulary. 699 conversation-only
high-frequency words make up 7.6% of the Taiwan
Mandarin Conversational Corpus.
Please note that due to the setting of corpus
scenario some proper nouns in the
conversational corpus are corpus-specific and
should not be regarded as high-frequency words
in conversation. For this reason, 180 words
were excluded from the final conversation-only
list. In addition, a set of 1,235 basic Chinese
characters, covering the core, text-, and
conversation-only vocabulary lists, is derived
from the aforementioned three wordlists.
To access the Sinica Chinese Core Vocabulary
(version 1.0), please see:
http://www.aclclp.org.tw/use_sccv.php
http://mmc.sinica.edu.tw/resources_e_01.html
Linguistic Field(s): Computational Linguistics;
Language Acquisition
Subject Language(s):
Chinese, Mandarin (cmn)
Page Updated: 17-Jun-2014