LINGUIST List 21.5076

Wed Dec 15 2010

FYI: Free Access: Corpora in Catalan, Spanish, English

Editor for this issue: Brent Miller <brentlinguistlist.org>


        1.     Gemma Boleda , Free Access: Corpora in Catalan, Spanish, English

Message 1: Free Access: Corpora in Catalan, Spanish, English
Date: 15-Dec-2010
From: Gemma Boleda <gemma.boledagmail.com>
Subject: Free Access: Corpora in Catalan, Spanish, English
E-mail this message to a friend

Wikicorpus, v. 1.0: Catalan, Spanish and English portions of the Wikipedia.

The Wikicorpus contains portions of the Catalan, Spanish, and EnglishWikipedias based on a 2006 dump. The corpora have been automatically taggedwith lemma and part of speech information using the open source libraryFreeLing. Also, they have been WordNet-sense annotated with the state ofthe art Word Sense Disambiguation algorithm UKB. In its current version,the corpora have the following sizes:

* Catalan: around 50 million words* Spanish: around 120 million words* English: around 600 million words

We provide access to the corpora in their raw text and tagged versions,under the same license as Wikipedia itself. To our knowledge, these are thelargest Catalan and Spanish corpora freely available for download. For moreinformation and download, please visit the project's page:

http://www.lsi.upc.edu/~nlp/wikicorpus

Linguistic Field(s): Text/Corpus Linguistics
Subject Language(s): Catalan-Valencian-Balear (cat)                             English (eng)                             Spanish (spa)

Page Updated: 15-Dec-2010