|Title:||Research on Statistical Methods to Chinese Syntactic Ambiguity Resolution||Add Dissertation|
|Author:||Guohong Fu||Update Dissertation|
|Email:||click here to access email|
|Institution:||Oklahoma State University, English Department TESL/Linguistics Program|
|Linguistic Subfield(s):||Computational Linguistics; Syntax;|
|Abstract:||Syntactic ambiguity is one of the key problems in natural language parsing. This thesis proposed a set of statistical methods towards Chinese syntactic ambiguity resolution. In order to reduce the complexity and to improve the efficiency, we divide the whole course of Chinese parsing into four sub-problems, i.e. word boundary identification, part-of-speech tagging, word sense tagging and syntactic analysis, so that each type of ambiguity can be resolved by a corresponding and appropriate technique under the statistical framework. Our research mainly concerns the following four fields:
Word boundary identification is the first problem to be resolved in parsing Chinese. Toward large-scale applications, a comprehensive framework for a robust Chinese word boundary identification system is firstly proposed in this thesis, including: (1) Character juncture model (CJM) is introduced into word n-gram language model for word boundary disambiguating.(2) Chinese word boundary identification space is also discussed and A* decoding algorithm is then applied to detect the word boundary. (3) Based on a simple superposition principle, word formation power and patterns of Chinese character, CJM and word n-gram are combined to detect unknown words. (4) To overcome the bottleneck of knowledge acquisition in large applications, a semi-automatic procedure is designed to construct large training corpora on the basis of extracting ambiguous fragments and unknown words.
Part-of-Speech is the foundation of syntactic structures and their analysis. After a survey of the error distributions of the Hidden Markov Model (HMM) based Part-of-speech (POS) tagging, a novel method integrating log-linear model with the HMM is here proposed for the problems of lexical categorical ambiguity and unknown words in Chinese POS tagging. The new algorithm runs as follows: A stochastic tagger based on HMM is firstly used to pre-tag the input sentences, and then a log-linear model combining a number of contextual features such as context words and POS, is then employed to re-evaluate the error-prone ambiguous words and correct the errors in the initial tagging. Experiments indicate that the proposed method provides a higher accuracy than the HMM.
Semantic information makes a great contribution towards Chinese syntactic disambiguation. To acquire semantic patterns for parsing, a log-linear model based stochastic method is put forward for Chinese word sense tagging. To disambiguate non-categorical and mixed types of Chinese polysemants, features from words, POS and word senses in the local context are combined into one word-sense tagging model under the framework of log-linear methods. To reduce manual supervision in acquiring annotated training data, an unsupervised methods based on context clustering is also proposed. Compared with the naïve-Bayesian methods, the log-linear model semantic tagging method takes into account the effect of combinations of different contextual features, as well as the main effects of individual feature, which results in a strong semantic disambiguation power.
Finally, an algorithm is proposed to parse Chinese based on lexicalized probabilistic context-free grammars (LPCFGs). To resolve the serious data sparse problem in the word-based LPCFG and considering characters of the Chinese language simultaneously, word sense information is introduced, and a novel LPCFG based on word sense is proposed. In addition, the standard chart algorithm is discussed and a modified best-first chart algorithm is given to improve the parsing efficiency, which adopts a word sense based figure of merit (FOM). The advantage of LPCFG is that it combines contextual lexical information (especially semantic collocation patterns) and the stochastic information of syntactic structures and then brings about improvement in parsing accuracy.