* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 22.144

Sat Jan 08 2011

FYI: Icelandic Parsed Historical Corpus (IcePaHC) V0.3

Editor for this issue: Brent Miller <brentlinguistlist.org>


To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
Directory
        1.     Joel Wallenberg , Icelandic Parsed Historical Corpus (IcePaHC) V0.3

Message 1: Icelandic Parsed Historical Corpus (IcePaHC) V0.3
Date: 06-Jan-2011
From: Joel Wallenberg <joel.wallenberggmail.com>
Subject: Icelandic Parsed Historical Corpus (IcePaHC) V0.3
E-mail this message to a friend

We are pleased to announce that version 0.3 of the Icelandic Parsed
Historical Corpus (IcePaHC) is now available for free download.

The corpus is syntactically parsed, annotated for full phrase structure
using an adaptation of the annotation scheme used by the Penn parsed
corpora of historical English (http://www.ling.upenn.edu/hist-corpora/) and
other corpora in that tradition (see links from website). The corpus
contains ca. 262.000 words from every century between the 12th and the 19th
centuries inclusive. Please note that this is about a quarter of the
ultimate goal for the completed corpus, ca. 1 million words.

The corpus is distributed as raw UTF-8 data in labeled bracketing format
and it is therefore compatible with various existing programs, including
CorpusSearch (http://corpussearch.sourceforge.net/).

The corpus can be downloaded from:
www.linguist.is/icelandic_treebank/Download

Further information on the annotation guidelines and project organization
can be found on the project wiki:
www.linguist.is/icelandic_treebank/

We hope that this release will result in feedback that allows us to improve
the resource for upcoming versions. Updates are released every three months
- the upcoming 0.4 version will be released on April 4th 2011. Between
releases, development can be tracked at our open repository at Github
(http://github.com/antonkarl/icecorpus) but use of released versions is
encouraged to ensure that results can be replicated.

Texts included in Version 0.3:
4439 words from The First Grammatical Treatise (entire text) (12th century)
8179 words from Íslensk hómilíubok (Icelandic book of homilies) (12th century)
3459 words from Egils saga (theta fragment) (13th century)
22720 words from Sturlunga saga (13th century)
23040 words from Finnboga saga ramma (1350)
11486 words from Bandamanna saga (1450)
23041 words from Vilhjálms saga Sjóðs (1450)
8582 words from Erasmus saga (1525)
20683 words from the New Testament's Gospel of John (1540)
16421 words from the New Testament's Acts (1540)
17127 words from Ólafur Egilsson's travelogue (1628)
9760 words from Píslarsaga Jóns Magnússonar (1659)
22905 words from Jón Indíafari's travelogue (1661)
22099 words from Jón Steingrímsson's biography (1791)
3269 words from Jónas Hallgrímsson's essay on the nature and origin of the
earth (1835)
17837 words from Piltur og stúlka (novel by Jón Thoroddsen) (1850)
27192 words from Brynjólfur Sveinsson biskup (novel by Torfhildur Hólm) (1882)
Total number of words: 262240


Joel C. Wallenberg (joel.wallenberggmail.com)
Anton Karl Ingason (anton.karl.ingasongmail.com)
Einar Freyr Sigurðsson (einarfsgmail.com)
Eiríkur Rögnvaldsson (eirikurhi.is)
University of Iceland

The project is funded by the following grants:

Icelandic Research Fund (RANNÍS), grant nr. 090662011,''Viable Language
Technology beyond English – Icelandic as a test case''.

U.S. National Science Foundation (NSF) International Research Fellowship
Program (IRFP), grant #OISE-0853114, ''Evolution of Language Systems: a
comparative study of grammatical change in Icelandic and English''.

Linguistic Field(s): Text/Corpus Linguistics

Subject Language(s): Icelandic (isl)

Read more issues|LINGUIST home page|Top of issue



Page Updated: 08-Jan-2011

Supported in part by the National Science Foundation       About LINGUIST    |   Contact Us       ILIT Logo
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.