LINGUIST List 22.3413
|
Tue Aug 30 2011
FYI: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
Editor for this issue: Brent Miller
<brent linguistlist.org>
|
New! Visit LL's Multitree project for over 1000 trees dynamically generated from scholarly hypotheses about language relationships: http://multitree.linguistlist.org/
To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
|
Directory
1. Joel Wallenberg ,
IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
Message 1: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
|
Date: 29-Aug-2011
From: Joel Wallenberg <joel.wallenberg gmail.com>
Subject: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
E-mail this message to a friend
We are very pleased to announce that version 0.9 of the Icelandic Parsed Historical Corpus (IcePaHC) is now available for free download. The corpus can be downloaded from: www.linguist.is/icelandic_treebank/Download The corpus is a treebank of over 1 million words in size, annotated for full phrase structure parse, and hand-corrected, using an adaptation of the annotation scheme used by the Penn Treebank and the Penn parsed corpora of historical English (http://www.ling.upenn.edu/hist- corpora/). Note that this release contains all of the text for version 1.0, but some minor corrections remain to be finished. The corpus contains: - 1 002 361 words total, consisting of ~100 000-word samples from each century from the 12th to the beginnng of the 21st century. - Annotated with a phrase structure parse, part-of-speech-tagged, and lemmatized. - The entire parse, pos-tagging, and lemmata for every sentence have been *hand-corrected*. - Text samples are balanced for genre within each century. - LGPL license: You are free to copy, modify and redistribute the corpus for research and/or profit with appropriate citation. The corpus is distributed as raw UTF-8 data in labeled bracketing format and it is therefore compatible with various existing programs, including CorpusSearch (http://corpussearch.sourceforge.net/). A plain text version without markup and a set of info files containing philological information accompany the corpus download. The entire corpus may be downloaded in a plain text version, a platform-independent GUI, and a Windows-compatible GUI for ease of searching. Further information on the annotation guidelines and project organization can be found on the project wiki: www.linguist.is/icelandic_treebank/ Joel C. Wallenberg (joel.wallenberg gmail.com) Anton Karl Ingason (anton.karl.ingason gmail.com) Einar Freyr Sigurðsson (einarfs gmail.com) Eiríkur Rögnvaldsson (eirikur hi.is) University of Iceland We were grateful to receive support for this project through the following grants: Icelandic Research Fund (RANNÍS), grant nr. 090662011,''Viable Language Technology beyond English – Icelandic as a test case''. U.S. National Science Foundation (NSF) International Research Fellowship Program (IRFP), grant #OISE-0853114, ''Evolution of Language Systems: a comparative study of grammatical change in Icelandic and English''. University of Iceland Research Fund (Rannsóknasjóður Háskóla Íslands), grant Icelandic Diachronic Treebank (Sögulegur íslenskur trjábanki)
Linguistic Field(s): Computational Linguistics; Historical Linguistics; Syntax; Text/Corpus Linguistics
Read more issues|LINGUIST home page|Top of issue
|
|
Page Updated: 30-Aug-2011
|
|
About LINGUIST
|
Contact Us
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.
|
|