LINGUIST List 15.2855

Tue Oct 12 2004

Diss: Comp Ling/ Lexicography: Gaustad:'Linguistic...'

Editor for this issue: Takako Matsui <takolinguistlist.org>


Directory


        1.    Tanja Gaustad, Linguistic Knowledge and Word Sense Disambiguation



Message 1: Linguistic Knowledge and Word Sense Disambiguation

Date: 11-Oct-2004
From: Tanja Gaustad <tanjalet.rug.nl>
Subject: Linguistic Knowledge and Word Sense Disambiguation


Institution: University of Groningen
Program: Alpha-Informatica
Dissertation Status: Completed
Degree Date: 2004

Author: Tanja Gaustad

Dissertation Title: Linguistic Knowledge and Word Sense Disambiguation

Linguistic Field(s):
Computational Linguistics; Lexicography

Subject Language(s):
Dutch Language Code: DUT

Dissertation Director:
John Nerbonne
Gertjan van Noord

Dissertation Abstract:

The main research question I try to answer in the my thesis is which
linguistic knowledge sources are most useful for word sense disambiguation
(WSD), more specifically word sense disambiguation of Dutch. The goal of
the project was to develop a tool which is able to automatically determine
the meaning of a particular ambiguous word in context, a so called word
sense disambiguation system. In order to achieve this, I make use of the
information contained in the context, namely the words surrounding the
ambiguous word, and additional underlying information (such as syntactic
class and structure) to build a statistical language model. This model is
then used to determine the meaning of examples of that particular ambiguous
word in new contexts.

My results on the (unseen) Senseval-2 test data show that adding structural
syntactic information in the form of dependency relations instead of PoS of
the context leads to an error-rate reduction of 8% for the word form model.
Furthermore, the lemma-based approach (introduced in this thesis)
outperforms the word form-based approach independently of the features
included in the model. We can observe an error rate reduction of 10% with
regard to the lemma-based model including PoS in context, and a reduction
of 6% of errors with regard to the best model based on word forms.

Comparing the results on the test data to results obtained with a different
system, using Memory-Based Learning (MBL) as a classification algorithm,
both the word form-based classifiers and the lemma-based classifiers from
my system produce higher accuracy. The lemma-based model actually leads to
an error rate reduction of 10% if compared to the MBL WSD system.

In my maximum entropy system, especially the addition of deep linguistic
knowledge greatly improves accuracy. In combination with an approach taking
advantage of morphological information, the lemma-based approach, the best
results for WSD of Dutch on the Senseval-2 data set are obtained. Our
system achieves significantly higher disambiguation accuracy than any
results for Dutch that have been reported in the literature up to now and
is thus state-of-the-art for Dutch WSD.



Respond to list|Read more issues|LINGUIST home page|Top of issue