LINGUIST List 22.144
|
Sat Jan 08 2011
FYI: Icelandic Parsed Historical Corpus (IcePaHC) V0.3
Editor for this issue: Brent Miller
<brent linguistlist.org>
|
To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
|
Directory
1. Joel Wallenberg ,
Icelandic Parsed Historical Corpus (IcePaHC) V0.3
Message 1: Icelandic Parsed Historical Corpus (IcePaHC) V0.3
|
Date: 06-Jan-2011
From: Joel Wallenberg <joel.wallenberg gmail.com>
Subject: Icelandic Parsed Historical Corpus (IcePaHC) V0.3
E-mail this message to a friend
We are pleased to announce that version 0.3 of the Icelandic Parsed Historical Corpus (IcePaHC) is now available for free download. The corpus is syntactically parsed, annotated for full phrase structure using an adaptation of the annotation scheme used by the Penn parsed corpora of historical English (http://www.ling.upenn.edu/hist-corpora/) and other corpora in that tradition (see links from website). The corpus contains ca. 262.000 words from every century between the 12th and the 19th centuries inclusive. Please note that this is about a quarter of the ultimate goal for the completed corpus, ca. 1 million words. The corpus is distributed as raw UTF-8 data in labeled bracketing format and it is therefore compatible with various existing programs, including CorpusSearch (http://corpussearch.sourceforge.net/). The corpus can be downloaded from: www.linguist.is/icelandic_treebank/Download Further information on the annotation guidelines and project organization can be found on the project wiki: www.linguist.is/icelandic_treebank/ We hope that this release will result in feedback that allows us to improve the resource for upcoming versions. Updates are released every three months - the upcoming 0.4 version will be released on April 4th 2011. Between releases, development can be tracked at our open repository at Github (http://github.com/antonkarl/icecorpus) but use of released versions is encouraged to ensure that results can be replicated. Texts included in Version 0.3: 4439 words from The First Grammatical Treatise (entire text) (12th century) 8179 words from Íslensk hómilíubok (Icelandic book of homilies) (12th century) 3459 words from Egils saga (theta fragment) (13th century) 22720 words from Sturlunga saga (13th century) 23040 words from Finnboga saga ramma (1350) 11486 words from Bandamanna saga (1450) 23041 words from Vilhjálms saga Sjóðs (1450) 8582 words from Erasmus saga (1525) 20683 words from the New Testament's Gospel of John (1540) 16421 words from the New Testament's Acts (1540) 17127 words from Ólafur Egilsson's travelogue (1628) 9760 words from Píslarsaga Jóns Magnússonar (1659) 22905 words from Jón Indíafari's travelogue (1661) 22099 words from Jón Steingrímsson's biography (1791) 3269 words from Jónas Hallgrímsson's essay on the nature and origin of the earth (1835) 17837 words from Piltur og stúlka (novel by Jón Thoroddsen) (1850) 27192 words from Brynjólfur Sveinsson biskup (novel by Torfhildur Hólm) (1882) Total number of words: 262240 Joel C. Wallenberg (joel.wallenberg gmail.com) Anton Karl Ingason (anton.karl.ingason gmail.com) Einar Freyr Sigurðsson (einarfs gmail.com) Eiríkur Rögnvaldsson (eirikur hi.is) University of Iceland The project is funded by the following grants: Icelandic Research Fund (RANNÍS), grant nr. 090662011,''Viable Language Technology beyond English – Icelandic as a test case''. U.S. National Science Foundation (NSF) International Research Fellowship Program (IRFP), grant #OISE-0853114, ''Evolution of Language Systems: a comparative study of grammatical change in Icelandic and English''.
Linguistic Field(s): Text/Corpus Linguistics
Subject Language(s): Icelandic (isl)
Read more issues|LINGUIST home page|Top of issue
|
|
Page Updated: 08-Jan-2011
|
|
About LINGUIST
|
Contact Us
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.
|
|