* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 17.2259

Mon Aug 07 2006

Diss: Computational Ling: Lu: 'Hybrid Models for Chinese Unknown Wo...'

Editor for this issue: Hannah Morales <hannahlinguistlist.org>


To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.html.
Directory
        1.    Xiaofei Lu, Hybrid Models for Chinese Unknown Word Resolution


Message 1: Hybrid Models for Chinese Unknown Word Resolution
Date: 05-Aug-2006
From: Xiaofei Lu <xfluling.osu.edu>
Subject: Hybrid Models for Chinese Unknown Word Resolution


Institution: Ohio State University
Program: Department of East Asian Languages and Literature
Dissertation Status: Completed
Degree Date: 2006

Author: Xiaofei Lu

Dissertation Title: Hybrid Models for Chinese Unknown Word Resolution

Dissertation URL: http://ling.osu.edu/~xflu/papers/2006diss.pdf

Linguistic Field(s): Computational Linguistics

Subject Language(s): Chinese, Mandarin (cmn)

Dissertation Director:
Walt Detmar Meurers

Dissertation Abstract:

Word segmentation, part-of-speech (POS) tagging, and sense tagging are
important steps in various Chinese natural language processing (CNLP)
systems. Unknown words, i.e., words that are not in the dictionary or
training data used in a CNLP system, constitute a major challenge for each
of these steps. This dissertation is concerned with developing hybrid
models that effectively combine statistical, knowledge-based, and machine
learning approaches for Chinese unknown word resolution, including the
identification, part-of-speech (POS) tagging, and sense tagging of Chinese
unknown words. What makes Chinese unknown word resolution hard is the
limited information available for predicting the properties of unknown
words, and for this reason it is crucial to make optimal use of information
that is available. To this end, this research explores two central ideas
and aims to achieve two major goals.

First, the morphological, syntactic, and semantic information of the
component characters or morphemes of an unknown word provides useful
insights into its structural and semantic properties. The first goal of
this work is to develop novel algorithms that capture such insights. To
integrate unknown word identification with word segmentation, the notion of
character-based tagging is adopted to model the tendency of individual
characters to combine with adjacent characters to form words in different
contexts. To predict the POS categories of unknown words, morphological
rules that encode knowledge about the relationship between the POS
categories of unknown words and those of their component morphemes are
developed. Finally, to classify unknown words into appropriate semantic
categories in a Chinese thesaurus, rules that capture the regularities in
the relationship between the semantic categories of unknown words and those
of their component morphemes are developed; information-theoretical models
are used to compute the associations between individual morphemes and
semantic categories for the same purpose.

Second, in addition to information about the component characters of an
unknown word, information about its type, length, and internal structure as
well as the context in which it occurs provides useful insights into its
properties, too. Existing approaches to Chinese unknown word resolution
tend to use different, but single sources of information and are often
effective in handling different subsets of unknown words. The second goal
of this research is to identify the relative strengths of novel and
existing models and to combine them to achieve optimal use of information
and better performance for the task.



Respond to list|Read more issues|LINGUIST home page|Top of issue




Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.