Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."

E-mail this page

We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Dissertation Information

Title: Computational Learning of Finite-State Models for Natural Language Processing Add Dissertation
Author: Anja Belz Update Dissertation
Email: click here to access email
Institution: University of Sussex, School of Cognitive and Computing Sciences
Completed in: 2000
Linguistic Subfield(s): Computational Linguistics;
Subject Language(s): Dutch
Director(s): Gerald Gazdar

Abstract: This thesis is an investigation into the use of finite-state methods for natural language modelling, in particular, the modelling of phonological constraints, and the automatic construction of such models. The contents of the thesis reflect its intended readership of computational linguists, especially with regard to the level of detail provided in background sections.

A novel approach to finite-state modelling of natural language phenomena, OFS Modelling, is proposed which uses prototype models that encode generalisation in terms of levels of description and classes of objects. Prototype models are instantiated and generalised automatically for different data sets. The main application discussed is to the construction of phonotactic models. It is argued that traditional one-syllable models are inadequate for the phonotactics of natural languages, and that the latter can be modelled more accurately with Multi-Syllable Models. Evidence is provided from several languages for the usefulness of distinguishing between different syllable classes. A range of other applications of the OFS modelling approach is described, including graphotactics, finite-state approximation of context-free grammars, and automatic grammar construction from hand-parsed corpora.

The learnability of regular sets is discussed in detail, a sufficient learnability condition is defined, and it is shown that finite-state models under the OFS Modelling formalism are identifiable in the limit from positive data. A new approach to generalised regular grammar induction without prior knowledge, Minimal g-Partitioning (MgP), is introduced. The computability of MgP generalisations is considered and a method is described which uses nonexhaustive search to find MgP generalisations.

A new genetic algorithm, GAL, is introduced which implements a nonexhaustive search for MgP generalisations. Ample background is provided on genetic algorithms, and GAL is described, analysed and evaluated in depth. Particular emphasis is placed on demonstrating the efficiency of a new crossover operator, Fitness-Proportional Uniform Crossover, and on analysing the processes and mechanisms that enable GAL reliably to evolve solutions for what is often considered a worst-case learning task for genetic algorithms.

Experiments are described and results reported for applications of the OFS prototype modelling approach and the GAL algorithm, including the automatic construction of phonotactic models for Russian, German, English and Dutch, and graphotactic models for German, English and Dutch for use in automatic written language identification.

This thesis makes contributions to several fields of research - finite-state NLP, phonological constraint modelling, regular grammar induction and genetic algorithms - by introducing a novel approach to finite-state natural language modelling, a new, multi-syllable approach to phonological description, a new approach to generalised finite-state grammar learning, and a genetic algorithm that incorporates several new techniques.