* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 22.3413

Tue Aug 30 2011

FYI: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank

Editor for this issue: Brent Miller <brentlinguistlist.org>

New! Multi-tree Visit LL's Multitree project for over 1000 trees dynamically generated from scholarly hypotheses about language relationships:

To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
        1.     Joel Wallenberg , IcePaHC 0.9.: 1 Million Words, Icelandic Treebank

Message 1: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
Date: 29-Aug-2011
From: Joel Wallenberg <joel.wallenberggmail.com>
Subject: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
E-mail this message to a friend

We are very pleased to announce that version 0.9 of the Icelandic
Parsed Historical Corpus (IcePaHC) is now available for free download.

The corpus can be downloaded from:

The corpus is a treebank of over 1 million words in size, annotated for
full phrase structure parse, and hand-corrected, using an adaptation of
the annotation scheme used by the Penn Treebank and the Penn
parsed corpora of historical English (http://www.ling.upenn.edu/hist-
corpora/). Note that this release contains all of the text for version 1.0,
but some minor corrections remain to be finished.

The corpus contains:

- 1 002 361 words total, consisting of ~100 000-word samples from
each century from the 12th to the beginnng of the 21st century.
- Annotated with a phrase structure parse, part-of-speech-tagged, and
- The entire parse, pos-tagging, and lemmata for every sentence have
been *hand-corrected*.
- Text samples are balanced for genre within each century.
- LGPL license: You are free to copy, modify and redistribute the
corpus for research and/or profit with appropriate citation.

The corpus is distributed as raw UTF-8 data in labeled bracketing
format and it is therefore compatible with various existing programs,
including CorpusSearch (http://corpussearch.sourceforge.net/).

A plain text version without markup and a set of info files containing
philological information accompany the corpus download.

The entire corpus may be downloaded in a plain text version, a
platform-independent GUI, and a Windows-compatible GUI for ease of

Further information on the annotation guidelines and project
organization can be found on the project wiki:

Joel C. Wallenberg (joel.wallenberggmail.com)
Anton Karl Ingason (anton.karl.ingasongmail.com)
Einar Freyr Sigurðsson (einarfsgmail.com)
Eiríkur Rögnvaldsson (eirikurhi.is)
University of Iceland

We were grateful to receive support for this project through the
following grants:

Icelandic Research Fund (RANNÍS), grant nr. 090662011,''Viable
Language Technology beyond English – Icelandic as a test case''.

U.S. National Science Foundation (NSF) International Research
Fellowship Program (IRFP), grant #OISE-0853114, ''Evolution of
Language Systems: a comparative study of grammatical change in
Icelandic and English''.

University of Iceland Research Fund (Rannsóknasjóður Háskóla
Íslands), grant Icelandic Diachronic Treebank (Sögulegur íslenskur

Linguistic Field(s): Computational Linguistics; Historical Linguistics; Syntax; Text/Corpus Linguistics

Read more issues|LINGUIST home page|Top of issue

Page Updated: 30-Aug-2011

Supported in part by the National Science Foundation       About LINGUIST    |   Contact Us       ILIT Logo
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.