* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
 
E-mail this message to a friend
Title: Hybrid Models for Chinese Unknown Word Resolution
Author: Xiaofei Lu
Email: click here to access email
Degree Awarded: Ohio State University , Department of East Asian Languages and Literature
Degree Date: 2006
Linguistic Subfield(s): Computational Linguistics
Subject Language(s): Chinese, Mandarin
Director(s): Walt Meurers

Abstract:

Word segmentation, part-of-speech (POS) tagging, and sense tagging are
important steps in various Chinese natural language processing (CNLP)
systems. Unknown words, i.e., words that are not in the dictionary or
training data used in a CNLP system, constitute a major challenge for each
of these steps. This dissertation is concerned with developing hybrid
models that effectively combine statistical, knowledge-based, and machine
learning approaches for Chinese unknown word resolution, including the
identification, part-of-speech (POS) tagging, and sense tagging of Chinese
unknown words. What makes Chinese unknown word resolution hard is the
limited information available for predicting the properties of unknown
words, and for this reason it is crucial to make optimal use of information
that is available. To this end, this research explores two central ideas
and aims to achieve two major goals.

First, the morphological, syntactic, and semantic information of the
component characters or morphemes of an unknown word provides useful
insights into its structural and semantic properties. The first goal of
this work is to develop novel algorithms that capture such insights. To
integrate unknown word identification with word segmentation, the notion of
character-based tagging is adopted to model the tendency of individual
characters to combine with adjacent characters to form words in different
contexts. To predict the POS categories of unknown words, morphological
rules that encode knowledge about the relationship between the POS
categories of unknown words and those of their component morphemes are
developed. Finally, to classify unknown words into appropriate semantic
categories in a Chinese thesaurus, rules that capture the regularities in
the relationship between the semantic categories of unknown words and those
of their component morphemes are developed; information-theoretical models
are used to compute the associations between individual morphemes and
semantic categories for the same purpose.

Second, in addition to information about the component characters of an
unknown word, information about its type, length, and internal structure as
well as the context in which it occurs provides useful insights into its
properties, too. Existing approaches to Chinese unknown word resolution
tend to use different, but single sources of information and are often
effective in handling different subsets of unknown words. The second goal
of this research is to identify the relative strengths of novel and
existing models and to combine them to achieve optimal use of information
and better performance for the task.
Add a dissertation
Update dissertation
Page Updated: 28-Nov-2009

Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.