LINGUIST List 22.3413|
Tue Aug 30 2011
FYI: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
Editor for this issue: Brent Miller
New! Visit LL's Multitree project for over 1000 trees dynamically generated from scholarly hypotheses about language relationships:
To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
1. Joel Wallenberg ,
IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
Message 1: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
From: Joel Wallenberg <joel.wallenberggmail.com>
Subject: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
E-mail this message to a friend
We are very pleased to announce that version 0.9 of the Icelandic
Parsed Historical Corpus (IcePaHC) is now available for free download.
The corpus can be downloaded from:
The corpus is a treebank of over 1 million words in size, annotated for
full phrase structure parse, and hand-corrected, using an adaptation of
the annotation scheme used by the Penn Treebank and the Penn
parsed corpora of historical English (http://www.ling.upenn.edu/hist-
corpora/). Note that this release contains all of the text for version 1.0,
but some minor corrections remain to be finished.
The corpus contains:
- 1 002 361 words total, consisting of ~100 000-word samples from
each century from the 12th to the beginnng of the 21st century.
- Annotated with a phrase structure parse, part-of-speech-tagged, and
- The entire parse, pos-tagging, and lemmata for every sentence have
- Text samples are balanced for genre within each century.
- LGPL license: You are free to copy, modify and redistribute the
corpus for research and/or profit with appropriate citation.
The corpus is distributed as raw UTF-8 data in labeled bracketing
format and it is therefore compatible with various existing programs,
including CorpusSearch (http://corpussearch.sourceforge.net/).
A plain text version without markup and a set of info files containing
philological information accompany the corpus download.
The entire corpus may be downloaded in a plain text version, a
platform-independent GUI, and a Windows-compatible GUI for ease of
Further information on the annotation guidelines and project
organization can be found on the project wiki:
Joel C. Wallenberg (joel.wallenberggmail.com)
Anton Karl Ingason (anton.karl.ingasongmail.com)
Einar Freyr Sigurðsson (einarfsgmail.com)
Eiríkur Rögnvaldsson (eirikurhi.is)
University of Iceland
We were grateful to receive support for this project through the
Icelandic Research Fund (RANNÍS), grant nr. 090662011,''Viable
Language Technology beyond English – Icelandic as a test case''.
U.S. National Science Foundation (NSF) International Research
Fellowship Program (IRFP), grant #OISE-0853114, ''Evolution of
Language Systems: a comparative study of grammatical change in
Icelandic and English''.
University of Iceland Research Fund (Rannsóknasjóður Háskóla
Íslands), grant Icelandic Diachronic Treebank (Sögulegur íslenskur
Linguistic Field(s): Computational Linguistics; Historical Linguistics; Syntax; Text/Corpus Linguistics
Read more issues|LINGUIST home page|Top of issue
Page Updated: 30-Aug-2011
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.