LINGUIST List 18.1699|
Mon Jun 04 2007
FYI: Freely Available JRC-Acquis Parallel Corpus
Editor for this issue: Dan Parker
To post to LINGUIST, use our convenient web form at
Freely Available JRC-Acquis Parallel Corpus
Message 1: Freely Available JRC-Acquis Parallel Corpus
From: Ralf Steinberger <Ralf.Steinbergerjrc.it>
Subject: Freely Available JRC-Acquis Parallel Corpus
We are pleased to announce a new release of the freely available
multilingual parallel corpus JRC-Acquis (version 3.0). The corpus size has
nearly tripled (totaling over 1 Billion words) and Bulgarian texts have now
been added (thanks to the Romanian Academy of Sciences) so that the
parallel texts are now available in 22 languages.
Size and Format:
- 22 languages (all official EU languages except Irish)
- Average corpus size per language: 28.9 million words + 19 Million words
in annexes, etc.
- 23,000 texts per language (less in Bulgarian, Maltese and Romanian)
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.
Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish,
French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish,
Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish.
- Documents on contents, principles and political objectives of the EU Treaties
- EU legislation
- International agreements.
Paragraph alignment for all 231 language pairs will soon be available for
version 3.0 of the corpus. The following text applies to version 2.2, still
available on the same website:
- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.
Manual Subject Domain Classification:
- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.
Use / Download:
- Download from http://langtech.jrc.it/JRC-Acquis.html
- Usage free for research purposes.
For More Details:
Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Toma
Erjavec, Dan Tufi, Dániel Varga (2006). 'The JRC-Acquis: A multilingual
aligned parallel corpus with 20+ languages'. Proceedings of the 5th
International Conference on Language Resources and Evaluation (LREC'2006).
Genoa, Italy, 24-26 May 2006. Available at
The JRC's Language Technology group specialises in the development of
highly multilingual text analysis tools and in cross-lingual applications.
An example is our multilingual (19 languages) news analysis application
NewsExplorer, publicly accessible at http://press.jrc.it/NewsExplorer.
Related JRC developments (both covering 22+ languages):
- NewsBrief (http://press.jrc.it): breaking news detection and display of
the very latest thematic news from around the world;
- Medical Information System MedISys (http://medusa.jrc.it): displays the
latest health-related news from around the world according to themes and
European Commission - Joint Research Centre (JRC)
IPSC - SeS - EMM - Language Technology
Linguistic Field(s): Computational Linguistics
Respond to list|Read more issues|LINGUIST home page|Top of issue
Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.