LINGUIST List 14.1004

Fri Apr 4 2003

Diss: Computational Ling: Zaanen "Bootstrapping..."

Editor for this issue: Anita Yahui Huang <anitalinguistlist.org>


Directory

  1. mvzaanen, Computational Ling: Zaanen "Bootstrapping Structure into Language..."

Message 1: Computational Ling: Zaanen "Bootstrapping Structure into Language..."

Date: Thu, 03 Apr 2003 04:23:45 +0000
From: mvzaanen <mvzaanenuvt.nl>
Subject: Computational Ling: Zaanen "Bootstrapping Structure into Language..."


Institution: University of Leeds
Program: School of Computing
Dissertation Status: Completed
Degree Date: 2002

Author: Menno M. van Zaanen 

Dissertation Title: 

Bootstrapping Structure into Language: Alignment-Based Learning

Dissertation URL: http://ilk.uvt.nl/~mvzaanen/publications.html

Linguistic Field: Computational Linguistics 

Dissertation Director 1: Rens Bod 
Dissertation Director 2: Eric Atwell


Dissertation Abstract: 

This thesis introduces a new unsupervised learning framework, called
Alignment-Based Learning, which is based on the alignment of sentences
and Harris's (1951) notion of substitutability. Instances of the
framework can be applied to an untagged, unstructured corpus of
natural language sentences, resulting in a labelled, bracketed version
of that corpus.

Firstly, the framework aligns all sentences in the corpus in pairs,
resulting in a partition of the sentences consisting of parts of the
sentences that are equal in both sentences and parts that are
unequal. Unequal parts of sentences can be seen as being substitutable
for each other, since substituting one unequal part for the other
results in another valid sentence. The unequal parts of the sentences
are thus considered to be possible (possibly overlapping)
constituents, called hypotheses.

Secondly, the selection learning phase considers all hypotheses found
by the alignment learning phase and selects the best of these. The
hypotheses are selected based on the order in which they were found,
or based on a probabilistic function.

The framework can be extended with a grammar extraction phase. This
extended framework is called parseABL. Instead of returning a
structured version of the unstructured input corpus, like the ABL
system, this system also returns a stochastic context-free or tree
substitution grammar.

Different instances of the framework have been tested on the English
ATIS corpus, the Dutch OVIS corpus and the Wall Street Journal
corpus. One of the interesting results, apart from the encouraging
numerical results, is that all instances can (and do) learn recursive
structures.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue