LINGUIST List 18.1699
|
Mon Jun 04 2007
FYI: Freely Available JRC-Acquis Parallel Corpus
Editor for this issue: Dan Parker
<dan linguistlist.org>
|
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.
|
Directory
1. Ralf
Steinberger,
Freely Available JRC-Acquis Parallel Corpus
Message 1: Freely Available JRC-Acquis Parallel Corpus
|
Date: 01-Jun-2007
From: Ralf Steinberger <Ralf.Steinberger jrc.it>
Subject: Freely Available JRC-Acquis Parallel Corpus
We are pleased to announce a new release of the freely available multilingual parallel corpus JRC-Acquis (version 3.0). The corpus size has nearly tripled (totaling over 1 Billion words) and Bulgarian texts have now been added (thanks to the Romanian Academy of Sciences) so that the parallel texts are now available in 22 languages. Size and Format: - 22 languages (all official EU languages except Irish) - Average corpus size per language: 28.9 million words + 19 Million words in annexes, etc. - 23,000 texts per language (less in Bulgarian, Maltese and Romanian) - XML Format according to TEI P4, UTF-8-encoded - Modular: download the languages you need. Languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish. Text Types: - Documents on contents, principles and political objectives of the EU Treaties - EU legislation - Declarations - Resolutions - Acts - International agreements. Paragraph Alignment: Paragraph alignment for all 231 language pairs will soon be available for version 3.0 of the corpus. The following text applies to version 2.2, still available on the same website: - Paragraph-aligned for all 210 language pairs - Paragraphs are sentence parts, sentences, or groups of sentences - 2 alternative alignments: using Vanilla and HunAlign - Ca. 270,000 alignments per language pair. Manual Subject Domain Classification: - Manually classified according to EUROVOC subject domains - Selected from 6000 hierarchically organised classes, wide-coverage. Use / Download: - Download from http://langtech.jrc.it/JRC-Acquis.html - Usage free for research purposes. For More Details: Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Toma Erjavec, Dan Tufi, Dániel Varga (2006). 'The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages'. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24-26 May 2006. Available at http://langtech.jrc.it/#Publications. The JRC's Language Technology group specialises in the development of highly multilingual text analysis tools and in cross-lingual applications. An example is our multilingual (19 languages) news analysis application NewsExplorer, publicly accessible at http://press.jrc.it/NewsExplorer. Related JRC developments (both covering 22+ languages): - NewsBrief (http://press.jrc.it): breaking news detection and display of the very latest thematic news from around the world; - Medical Information System MedISys (http://medusa.jrc.it): displays the latest health-related news from around the world according to themes and diseases. Ralf Steinberger European Commission - Joint Research Centre (JRC) IPSC - SeS - EMM - Language Technology http://langtech.jrc.it, http://press.jrc.it/NewsExplorer Linguistic Field(s): Computational Linguistics Text/Corpus Linguistics Translation
Respond to list|Read more issues|LINGUIST home page|Top of issue
|
|

Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.
|
|