LINGUIST List 21.5076
|
Wed Dec 15 2010
FYI: Free Access: Corpora in Catalan, Spanish, English
Editor for this issue: Brent Miller
<brent linguistlist.org>
|
To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
|
Directory
1. Gemma Boleda ,
Free Access: Corpora in Catalan, Spanish, English
Message 1: Free Access: Corpora in Catalan, Spanish, English
|
Date: 15-Dec-2010
From: Gemma Boleda <gemma.boleda gmail.com>
Subject: Free Access: Corpora in Catalan, Spanish, English
E-mail this message to a friend
Wikicorpus, v. 1.0: Catalan, Spanish and English portions of the Wikipedia. The Wikicorpus contains portions of the Catalan, Spanish, and English Wikipedias based on a 2006 dump. The corpora have been automatically tagged with lemma and part of speech information using the open source library FreeLing. Also, they have been WordNet-sense annotated with the state of the art Word Sense Disambiguation algorithm UKB. In its current version, the corpora have the following sizes: * Catalan: around 50 million words * Spanish: around 120 million words * English: around 600 million words We provide access to the corpora in their raw text and tagged versions, under the same license as Wikipedia itself. To our knowledge, these are the largest Catalan and Spanish corpora freely available for download. For more information and download, please visit the project's page: http://www.lsi.upc.edu/~nlp/wikicorpus
Linguistic Field(s): Text/Corpus Linguistics
Subject Language(s): Catalan-Valencian-Balear (cat)
English (eng)
Spanish (spa)
Read more issues|LINGUIST home page|Top of issue
|
|
Page Updated: 15-Dec-2010
|
|
About LINGUIST
|
Contact Us
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.
|
|