* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
 
E-mail this message to a friend
Title: Linguistic Knowledge and Word Sense Disambiguation
Author: Tanja Gaustad
Email: click here to access email
Degree Awarded: Rijksuniversiteit Groningen , Alpha-Informatica
Degree Date: 2004
Linguistic Subfield(s): Computational Linguistics
Lexicography
Subject Language(s): Dutch
Director(s): Gertjan van Noord
John Nerbonne

Abstract:

The main research question I try to answer in the my thesis is which linguistic knowledge sources are most useful for word sense disambiguation (WSD), more specifically word sense disambiguation of Dutch. The goal of the project was to develop a tool which is able to automatically determine the meaning of a particular ambiguous word in context, a so called word sense disambiguation system. In order to achieve this, I make use of the information contained in the context, namely the words surrounding the ambiguous word, and additional underlying information (such as syntactic class and structure) to build a statistical language model. This model is then used to determine the meaning of examples of that particular ambiguous word in new contexts.

My results on the (unseen) Senseval-2 test data show that adding structural syntactic information in the form of dependency relations instead of PoS of the context leads to an error-rate reduction of 8% for the word form model. Furthermore, the lemma-based approach (introduced in this thesis) outperforms the word form-based approach independently of the features included in the model. We can observe an error rate reduction of 10% with regard to the lemma-based model including PoS in context, and a reduction of 6% of errors with regard to the best model based on word forms.

Comparing the results on the test data to results obtained with a different system, using Memory-Based Learning (MBL) as a classification algorithm, both the word form-based classifiers and the lemma-based classifiers from my system produce higher accuracy. The lemma-based model actually leads to an error rate reduction of 10% if compared to the MBL WSD system.

In my maximum entropy system, especially the addition of deep linguistic knowledge greatly improves accuracy. In combination with an approach taking advantage of morphological information, the lemma-based approach, the best results for WSD of Dutch on the Senseval-2 data set are obtained. Our system achieves significantly higher disambiguation accuracy than any results for Dutch that have been reported in the literature up to now and is thus state-of-the-art for Dutch WSD.
Add a dissertation
Update dissertation
Page Updated: 29-Nov-2009

Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.