Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."

New from Wiley!


We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Academic Paper

Title: How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts
Author: Katja Zupan
Author: Nikola Ljubešić
Author: Tomaž Erjavec
Linguistic Field: Computational Linguistics
Subject Language: Slovenian
Abstract: Part-of-speech (PoS) tagging of non-standard language with models developed for standard language is known to suffer from a significant decrease in accuracy. Two methods are typically used to improve it: word normalisation, which decreases the out-of-vocabulary rate of the PoS tagger, and domain adaptation where the tagger is made aware of the non-standard language variation, either through supervision via non-standard data being added to the tagger’s training set, or via distributional information calculated from raw texts. This paper investigates the two approaches, normalisation and domain adaptation, on carefully constructed data sets encompassing historical and user-generated Slovene texts, in particular focusing on the amount of labour necessary to produce the manually annotated data sets for each approach and comparing the resulting PoS accuracy. We give quantitative as well as qualitative analyses of the tagger performance in various settings, showing that on our data set closed and open class words exhibit significantly different behaviours, and that even small inconsistencies in the PoS tags in the data have an impact on the accuracy. We also show that to improve tagging accuracy, it is best to concentrate on obtaining manually annotated normalisation training data for short annotation campaigns, while manually producing in-domain training sets for PoS tagging is better when a more substantial annotation campaign can be undertaken. Finally, unsupervised adaptation via Brown clustering is similarly useful regardless of the size of the training data available, but improvements tend to be bigger when adaptation is performed via in-domain tagging data.


This article appears IN Natural Language Engineering Vol. 25, Issue 5, which you can READ on Cambridge's site .

Return to TOC.

View the full article for free in the current issue of
Cambridge Extra Magazine!
Add a new paper
Return to Academic Papers main page
Return to Directory of Linguists main page