LINGUIST List 17.2259: Computational Ling: Lu: 'Hybrid Models for Chinese Unknown Wo...'

LINGUIST List 17.2259

Mon Aug 07 2006

Diss: Computational Ling: Lu: 'Hybrid Models for Chinese Unknown Wo...'

Editor for this issue: Hannah Morales <hannahlinguistlist.org>

Directory 1. Xiaofei Lu, Hybrid Models for Chinese Unknown Word Resolution

Message 1: Hybrid Models for Chinese Unknown Word Resolution
Date: 05-Aug-2006
From: Xiaofei Lu <xfluling.osu.edu>
Subject: Hybrid Models for Chinese Unknown Word Resolution

Institution: Ohio State University Program: Department of East Asian Languages and Literature Dissertation Status: Completed Degree Date: 2006

Author: Xiaofei Lu

Dissertation Title: Hybrid Models for Chinese Unknown Word Resolution

Dissertation URL: http://ling.osu.edu/~xflu/papers/2006diss.pdf

Linguistic Field(s): Computational Linguistics
Subject Language(s): Chinese, Mandarin (cmn)
Dissertation Director:
Walt Detmar Meurers
Dissertation Abstract:

Word segmentation, part-of-speech (POS) tagging, and sense tagging areimportant steps in various Chinese natural language processing (CNLP)systems. Unknown words, i.e., words that are not in the dictionary ortraining data used in a CNLP system, constitute a major challenge for eachof these steps. This dissertation is concerned with developing hybridmodels that effectively combine statistical, knowledge-based, and machinelearning approaches for Chinese unknown word resolution, including theidentification, part-of-speech (POS) tagging, and sense tagging of Chineseunknown words. What makes Chinese unknown word resolution hard is thelimited information available for predicting the properties of unknownwords, and for this reason it is crucial to make optimal use of informationthat is available. To this end, this research explores two central ideasand aims to achieve two major goals.

First, the morphological, syntactic, and semantic information of thecomponent characters or morphemes of an unknown word provides usefulinsights into its structural and semantic properties. The first goal ofthis work is to develop novel algorithms that capture such insights. Tointegrate unknown word identification with word segmentation, the notion ofcharacter-based tagging is adopted to model the tendency of individualcharacters to combine with adjacent characters to form words in differentcontexts. To predict the POS categories of unknown words, morphologicalrules that encode knowledge about the relationship between the POScategories of unknown words and those of their component morphemes aredeveloped. Finally, to classify unknown words into appropriate semanticcategories in a Chinese thesaurus, rules that capture the regularities inthe relationship between the semantic categories of unknown words and thoseof their component morphemes are developed; information-theoretical modelsare used to compute the associations between individual morphemes andsemantic categories for the same purpose.

Second, in addition to information about the component characters of anunknown word, information about its type, length, and internal structure aswell as the context in which it occurs provides useful insights into itsproperties, too. Existing approaches to Chinese unknown word resolutiontend to use different, but single sources of information and are ofteneffective in handling different subsets of unknown words. The second goalof this research is to identify the relative strengths of novel andexisting models and to combine them to achieve optimal use of informationand better performance for the task.