Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."

We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Academic Paper

Title: Lexical acquisition and semantic space models: Learning the semantics of unknown words
Author: Kostadin Cholakov
Institution: Rijksuniversiteit Groningen
Linguistic Field: Computational Linguistics; Semantics; Syntax; Text/Corpus Linguistics
Subject Language: Dutch
Abstract: In recent studies it has been shown that syntax-based semantic space models outperform models in which the context is represented as a bag-of-words in several semantic analysis tasks. This has been generally attributed to the fact that syntax-based models employ corpora that are syntactically annotated by a parser and a computational grammar. However, if the corpora processed contain words which are unknown to the parser and the grammar, a syntax-based model may lose its advantage since the syntactic properties of such words are unavailable. On the other hand, bag-of-words models do not face this issue since they operate on raw, non-annotated corpora and are thus more robust. In this paper, we compare the performance of syntax-based and bag-of-words models when applied to the task of learning the semantics of unknown words. In our experiments, unknown words are considered the words which are not known to the Alpino parser and grammar of Dutch. In our study, the semantics of an unknown word is defined by finding its most similar word in, a Dutch lexico-semantic hierarchy. We show that for unknown words the syntax-based model performs worse than the bag-of-words approach. Furthermore, we show that if we first learn the syntactic properties of unknown words by an appropriate lexical acquisition method, then in fact the syntax-based model does outperform the bag-of-words approach. The conclusion we draw is that, for words unknown to a given grammar, a bag-of-words model is more robust than a syntax-based model. However, the combination of lexical acquisition and syntax-based semantic models is best suited for learning the semantics of unknown words.


This article appears IN Natural Language Engineering Vol. 20, Issue 4.

Return to TOC.

Add a new paper
Return to Academic Papers main page
Return to Directory of Linguists main page