* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 21.5076

Wed Dec 15 2010

FYI: Free Access: Corpora in Catalan, Spanish, English

Editor for this issue: Brent Miller <brentlinguistlist.org>


To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
Directory
        1.     Gemma Boleda , Free Access: Corpora in Catalan, Spanish, English

Message 1: Free Access: Corpora in Catalan, Spanish, English
Date: 15-Dec-2010
From: Gemma Boleda <gemma.boledagmail.com>
Subject: Free Access: Corpora in Catalan, Spanish, English
E-mail this message to a friend

Wikicorpus, v. 1.0: Catalan, Spanish and English portions of the Wikipedia.

The Wikicorpus contains portions of the Catalan, Spanish, and English
Wikipedias based on a 2006 dump. The corpora have been automatically tagged
with lemma and part of speech information using the open source library
FreeLing. Also, they have been WordNet-sense annotated with the state of
the art Word Sense Disambiguation algorithm UKB. In its current version,
the corpora have the following sizes:

* Catalan: around 50 million words
* Spanish: around 120 million words
* English: around 600 million words

We provide access to the corpora in their raw text and tagged versions,
under the same license as Wikipedia itself. To our knowledge, these are the
largest Catalan and Spanish corpora freely available for download. For more
information and download, please visit the project's page:

http://www.lsi.upc.edu/~nlp/wikicorpus

Linguistic Field(s): Text/Corpus Linguistics

Subject Language(s): Catalan-Valencian-Balear (cat)
                            English (eng)
                            Spanish (spa)

Read more issues|LINGUIST home page|Top of issue



Page Updated: 15-Dec-2010

Supported in part by the National Science Foundation       About LINGUIST    |   Contact Us       ILIT Logo
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.