LINGUIST List 15.2855
Tue Oct 12 2004
Diss: Comp Ling/ Lexicography: Gaustad:'Linguistic...'
Editor for this issue: Takako Matsui <takolinguistlist.org>
Linguistic Knowledge and Word Sense Disambiguation
Message 1: Linguistic Knowledge and Word Sense Disambiguation
From: Tanja Gaustad <tanjalet.rug.nl>
Subject: Linguistic Knowledge and Word Sense Disambiguation
Institution: University of Groningen
Dissertation Status: Completed
Degree Date: 2004
Author: Tanja Gaustad
Dissertation Title: Linguistic Knowledge and Word Sense Disambiguation
Computational Linguistics; Lexicography
Dutch Language Code: DUT
Gertjan van Noord
The main research question I try to answer in the my thesis is which
linguistic knowledge sources are most useful for word sense disambiguation
(WSD), more specifically word sense disambiguation of Dutch. The goal of
the project was to develop a tool which is able to automatically determine
the meaning of a particular ambiguous word in context, a so called word
sense disambiguation system. In order to achieve this, I make use of the
information contained in the context, namely the words surrounding the
ambiguous word, and additional underlying information (such as syntactic
class and structure) to build a statistical language model. This model is
then used to determine the meaning of examples of that particular ambiguous
word in new contexts.
My results on the (unseen) Senseval-2 test data show that adding structural
syntactic information in the form of dependency relations instead of PoS of
the context leads to an error-rate reduction of 8% for the word form model.
Furthermore, the lemma-based approach (introduced in this thesis)
outperforms the word form-based approach independently of the features
included in the model. We can observe an error rate reduction of 10% with
regard to the lemma-based model including PoS in context, and a reduction
of 6% of errors with regard to the best model based on word forms.
Comparing the results on the test data to results obtained with a different
system, using Memory-Based Learning (MBL) as a classification algorithm,
both the word form-based classifiers and the lemma-based classifiers from
my system produce higher accuracy. The lemma-based model actually leads to
an error rate reduction of 10% if compared to the MBL WSD system.
In my maximum entropy system, especially the addition of deep linguistic
knowledge greatly improves accuracy. In combination with an approach taking
advantage of morphological information, the lemma-based approach, the best
results for WSD of Dutch on the Senseval-2 data set are obtained. Our
system achieves significantly higher disambiguation accuracy than any
results for Dutch that have been reported in the literature up to now and
is thus state-of-the-art for Dutch WSD.
Respond to list|Read more issues|LINGUIST home page|Top of issue