* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 18.1699

Mon Jun 04 2007

FYI: Freely Available JRC-Acquis Parallel Corpus

Editor for this issue: Dan Parker <danlinguistlist.org>


To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.html.
Directory
        1.    Ralf Steinberger, Freely Available JRC-Acquis Parallel Corpus


Message 1: Freely Available JRC-Acquis Parallel Corpus
Date: 01-Jun-2007
From: Ralf Steinberger <Ralf.Steinbergerjrc.it>
Subject: Freely Available JRC-Acquis Parallel Corpus


We are pleased to announce a new release of the freely available
multilingual parallel corpus JRC-Acquis (version 3.0). The corpus size has
nearly tripled (totaling over 1 Billion words) and Bulgarian texts have now
been added (thanks to the Romanian Academy of Sciences) so that the
parallel texts are now available in 22 languages.

Size and Format:

- 22 languages (all official EU languages except Irish)
- Average corpus size per language: 28.9 million words + 19 Million words
in annexes, etc.
- 23,000 texts per language (less in Bulgarian, Maltese and Romanian)
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.

Languages:

Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish,
French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish,
Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish.

Text Types:

- Documents on contents, principles and political objectives of the EU Treaties
- EU legislation
- Declarations
- Resolutions
- Acts
- International agreements.

Paragraph Alignment:

Paragraph alignment for all 231 language pairs will soon be available for
version 3.0 of the corpus. The following text applies to version 2.2, still
available on the same website:

- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.

Manual Subject Domain Classification:

- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.

Use / Download:

- Download from http://langtech.jrc.it/JRC-Acquis.html
- Usage free for research purposes.

For More Details:

Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Toma
Erjavec, Dan Tufi, Dániel Varga (2006). 'The JRC-Acquis: A multilingual
aligned parallel corpus with 20+ languages'. Proceedings of the 5th
International Conference on Language Resources and Evaluation (LREC'2006).
Genoa, Italy, 24-26 May 2006. Available at
http://langtech.jrc.it/#Publications.

The JRC's Language Technology group specialises in the development of
highly multilingual text analysis tools and in cross-lingual applications.
An example is our multilingual (19 languages) news analysis application
NewsExplorer, publicly accessible at http://press.jrc.it/NewsExplorer.

Related JRC developments (both covering 22+ languages):

- NewsBrief (http://press.jrc.it): breaking news detection and display of
the very latest thematic news from around the world;

- Medical Information System MedISys (http://medusa.jrc.it): displays the
latest health-related news from around the world according to themes and
diseases.

Ralf Steinberger
European Commission - Joint Research Centre (JRC)
IPSC - SeS - EMM - Language Technology
http://langtech.jrc.it, http://press.jrc.it/NewsExplorer



Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics
Translation


Respond to list|Read more issues|LINGUIST home page|Top of issue




Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.