|Title:||Computational Learning of Finite-State Models for Natural Language Processing||Add Dissertation|
|Author:||Anja Belz||Update Dissertation|
|Email:||click here to access email|
|Institution:||University of Sussex, School of Cognitive and Computing Sciences|
|Linguistic Subfield(s):||Computational Linguistics;|
|Abstract:||This thesis is an investigation into the use of finite-state methods for natural language modelling, in particular, the modelling of phonological constraints, and the automatic construction of such models. The contents of the thesis reflect its intended readership of computational linguists, especially with regard to the level of detail provided in background sections.
A novel approach to finite-state modelling of natural language phenomena, OFS Modelling, is proposed which uses prototype models that encode generalisation in terms of levels of description and classes of objects. Prototype models are instantiated and generalised automatically for different data sets. The main application discussed is to the construction of phonotactic models. It is argued that traditional one-syllable models are inadequate for the phonotactics of natural languages, and that the latter can be modelled more accurately with Multi-Syllable Models. Evidence is provided from several languages for the usefulness of distinguishing between different syllable classes. A range of other applications of the OFS modelling approach is described, including graphotactics, finite-state approximation of context-free grammars, and automatic grammar construction from hand-parsed corpora.
The learnability of regular sets is discussed in detail, a sufficient learnability condition is defined, and it is shown that finite-state models under the OFS Modelling formalism are identifiable in the limit from positive data. A new approach to generalised regular grammar induction without prior knowledge, Minimal g-Partitioning (MgP), is introduced. The computability of MgP generalisations is considered and a method is described which uses nonexhaustive search to find MgP generalisations.
A new genetic algorithm, GAL, is introduced which implements a nonexhaustive search for MgP generalisations. Ample background is provided on genetic algorithms, and GAL is described, analysed and evaluated in depth. Particular emphasis is placed on demonstrating the efficiency of a new crossover operator, Fitness-Proportional Uniform Crossover, and on analysing the processes and mechanisms that enable GAL reliably to evolve solutions for what is often considered a worst-case learning task for genetic algorithms.
Experiments are described and results reported for applications of the OFS prototype modelling approach and the GAL algorithm, including the automatic construction of phonotactic models for Russian, German, English and Dutch, and graphotactic models for German, English and Dutch for use in automatic written language identification.
This thesis makes contributions to several fields of research - finite-state NLP, phonological constraint modelling, regular grammar induction and genetic algorithms - by introducing a novel approach to finite-state natural language modelling, a new, multi-syllable approach to phonological description, a new approach to generalised finite-state grammar learning, and a genetic algorithm that incorporates several new techniques.