LINGUIST List 13.937

Fri Apr 5 2002

Diss: Computational Ling: Fu "Statistical Methods..."

Editor for this issue: Karolina Owczarzak <karolinalinguistlist.org>


Directory

  1. ghfu, Computational Ling: Fu "Research on Statistical Methods to Chinese..."

Message 1: Computational Ling: Fu "Research on Statistical Methods to Chinese..."

Date: Wed, 03 Apr 2002 22:12:09 +0000
From: ghfu <ghfuhotmail.com>
Subject: Computational Ling: Fu "Research on Statistical Methods to Chinese..."


New Dissertation Abstract

Institution: Harbin Institute of Technology
Program: Graduate English Department
Dissertation Status: Completed
Degree Date: 2001

Author: Guohong Fu 

Dissertation Title: 
Research on Statistical Methods to Chinese Syntactic Ambiguity Resolution

Linguistic Field: Computational Linguistics

Subject Language: Chinese, Mandarin

Dissertation Director 1: Xiaolong Wang


Dissertation Abstract: 

Syntactic ambiguity is one of the key problems in natural language
parsing. This thesis proposed a set of statistical methods towards
Chinese syntactic ambiguity resolution. In order to reduce the
complexity and to improve the efficiency, we divide the whole course
of Chinese parsing into four sub-problems, i.e. word boundary
identification, part-of-speech tagging, word sense tagging and
syntactic analysis, so that each type of ambiguity can be resolved by
a corresponding and appropriate technique under the statistical
framework. Our research mainly concerns the following four
fields:

Word boundary identification is the first problem to be resolved in
parsing Chinese. Toward large-scale applications, a comprehensive
framework for a robust Chinese word boundary identification system is
firstly proposed in this thesis, including: (1) Character juncture
model (CJM) is introduced into word n-gram language model for word
boundary disambiguating.(2) Chinese word boundary identification
space is also discussed and A* decoding algorithm is then applied to
detect the word boundary. (3) Based on a simple superposition
principle, word formation power and patterns of Chinese character, CJM
and word n-gram are combined to detect unknown words. (4) To overcome
the bottleneck of knowledge acquisition in large applications, a
semi-automatic procedure is designed to construct large training
corpora on the basis of extracting ambiguous fragments and unknown
words.

Part-of-Speech is the foundation of syntactic structures and their
analysis. After a survey of the error distributions of the Hidden
Markov Model (HMM) based Part-of-speech (POS) tagging, a novel method
integrating log-linear model with the HMM is here proposed for the
problems of lexical categorical ambiguity and unknown words in Chinese
POS tagging. The new algorithm runs as follows: A stochastic tagger
based on HMM is firstly used to pre-tag the input sentences, and then
a log-linear model combining a number of contextual features such
as context words and POS, is then employed to re-evaluate the
error-prone ambiguous words and correct the errors in the initial
tagging. Experiments indicate that the proposed method provides a
higher accuracy than the HMM.

Semantic information makes a great contribution towards Chinese
syntactic disambiguation. To acquire semantic patterns for parsing, a
log-linear model based stochastic method is put forward for Chinese
word sense tagging. To disambiguate non-categorical and mixed types of
Chinese polysemants, features from words, POS and word senses in the
local context are combined into one word-sense tagging model under the
framework of log-linear methods. To reduce manual supervision in
acquiring annotated training data, an unsupervised methods based on
context clustering is also proposed. Compared with the nave-Bayesian
methods, the log-linear model semantic tagging method takes into
account the effect of combinations of different contextual features,
as well as the main effects of individual feature, which results in a
strong semantic disambiguation power.

Finally, An algorithm is proposed to parse Chinese based on
lexicalized probabilistic context-free grammars (LPCFGs). To resolve
the serious data sparse problem in the word-based LPCFG and
considering characters of the Chinese language simultaneously, word
sense information is introduced, and a novel LPCFG based on word sense
is proposed. In addition, the standard chart algorithm is discussed
and a modified best-first chart algorithm is given to improve the
parsing efficiency, which adopts a word sense based figure of merit
(FOM). The advantage of LPCFG is that it combines contextual lexical
information (especially semantic collocation patterns) and the
stochastic information of syntactic structures and then brings about
improvement in parsing accuracy.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue