Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!

ad

Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."


We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at https://linguistlist.org/!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at webdevlinguistlist.org***

Academic Paper


Title: New treebank or repurposed? On the feasibility of cross-lingual parsing of Romance languages with Universal Dependencies
Author: Marcos Garcia
Author: Carlos Gómez-Rodríguez
Author: Miguel Alonso
Linguistic Field: Computational Linguistics
Subject Language: Galician
Subject LANGUAGE Family: Romance
Abstract: This paper addresses the feasibility of cross-lingual parsing with Universal Dependencies (UD) between Romance languages, analyzing its performance when compared to the use of manually annotated resources of the target languages. Several experiments take into account factors such as the lexical distance between the source and target varieties, the impact of delexicalization, the combination of different source treebanks or the adaptation of resources to the target language, among others. The results of these evaluations show that the direct application of a parser from one Romance language to another reaches similar labeled attachment score (LAS) values to those obtained with a manual annotation of about 3,000 tokens in the target language, and unlabeled attachment score (UAS) results equivalent to the use of around 7,000 tokens, depending on the case. These numbers can noticeably increase by performing a focused selection of the source treebanks. Furthermore, the removal of the words in the training corpus (delexicalization) is not useful in most cases of cross-lingual parsing of Romance languages. The lessons learned with the performed experiments were used to build a new UD treebank for Galician, with 1,000 sentences manually corrected after an automatic cross-lingual annotation. Several evaluations in this new resource show that a cross-lingual parser built with the best combination and adaptation of the source treebanks performs better (77 percent LAS and 82 percent UAS) than using more than 16,000 (for LAS results) and more than 20,000 (UAS) manually labeled tokens of Galician.

CUP AT LINGUIST

This article appears IN Natural Language Engineering Vol. 24, Issue 1, which you can READ on Cambridge's site .

Return to TOC.

View the full article for free in the current issue of
Cambridge Extra Magazine!
Add a new paper
Return to Academic Papers main page
Return to Directory of Linguists main page