The LINGUIST List is dedicated to providing information on language and language analysis, and to providing the discipline of linguistics with the infrastructure necessary to function in the digital world. LINGUIST is a free resource, run by linguistics students and faculty, and supported primarily by your donations. Please support LINGUIST List during the 2016 Fund Drive.
FYI: Sinica Chinese Core Vocabulary (version 1.0)
The Sinica Chinese Core Vocabulary (version 1.0) consists of 1,121 Chinese words that are derived from the intersection of the top 2000 (most frequently used) words in the Sinica Balanced Corpus and in the Taiwan Mandarin Conversational Corpus.
The Sinica Balanced Corpus contains mainly Chinese texts, approximately 4.7 millions of Chinese words after some minor modifications on the original data, whereas the Taiwan Mandarin Conversational Corpus contains free conversations, task- and topic-oriented dialogues, approximately 500K of transcribed Chinese words.
The Sinica Chinese Core Vocabulary was produced based on the “Word List with Accumulated Word Frequency in Sinica Balanced Corpus 3.0” released by the Chinese Knowledge and Information Processing Group (CKIP) via the Association for Computational Linguistics and Chinese Language Processing (ACLCLP) and the “Chinese Spoken Wordlist” released by Dr. Shu-Chuan Tseng. Words were segmented and POS-tagged by the CKIP automatic word segmentation and tagging system. The Sinica Chinese Core Vocabulary puts together the most frequently used Chinese words appearing in both of the written and spoken forms. It covers 57.6% of word tokens in the Sinica Balanced Corpus and 86.1% in the Taiwan Mandarin Conversational Corpus.
The Sinica Chinese Core Vocabulary consists of word information about part of speech, frequency, ranking in both of the corpora as well as the corresponding English glossaries with Chinese examples and English translations. All Chinese characters are transcribed in Pinyin. Words written in identical characters, but belonging to different POS tags as well as words that have multiple writing conventions are regarded as different lexical units. Users can also find a list with a subset of the top 2000 words of the Sinica Balanced Corpus that do not appear in the core vocabulary. This list contains 879 words that are frequently used in the written language only, covering 13.1% of word tokens in the Sinica Balanced Corpus. Another list contains a subset of the top 2000 words of the Taiwan Mandarin Conversational Corpus that do not appear in the core vocabulary. 699 conversation-only high-frequency words make up 7.6% of the Taiwan Mandarin Conversational Corpus.
Please note that due to the setting of corpus scenario some proper nouns in the conversational corpus are corpus-specific and should not be regarded as high-frequency words in conversation. For this reason, 180 words were excluded from the final conversation-only list. In addition, a set of 1,235 basic Chinese characters, covering the core, text-, and conversation-only vocabulary lists, is derived from the aforementioned three wordlists.
To access the Sinica Chinese Core Vocabulary (version 1.0), please see: