LINGUIST List 16.3291|
Tue Nov 15 2005
FYI: German Treebanks; Corpus of Written Italian
Editor for this issue: Svetlana Aksenova
To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.html.
2. Pier Marco
Corpus and Frequency Lexicon of Written Italian
Message 1: German Treebanks
From: Heike Zinsmeister <heike.zinsmeisteruni-tuebingen.de>
Subject: German Treebanks
The Division of Computational Linguistics at the Seminar fuer
Sprachwissenschaft of the University of Tuebingen (Germany) is happy to
announce the release of two German language resources:
* The Tuebingen Treebank of Spoken German (TueBa-D/S)
* The Tuebingen Treebank of Written German (TueBa-D/Z)
- second release
Both treebanks have the same basic annotation scheme which
distinguishes four levels of syntactic constituency: the lexical level, the
phrasal level, the level of topological fields, and the clausal level. In
addition to constituent structure, annotated trees contain edge labels
between nodes which encode grammatical functions.
Both treebanks are available in 3 different formats:
* NEGRA export format
* XML format
* Penn Treebank format
The treebanks in detail:
1. The Tuebingen Treebank of Spoken German (TueBa-D/S)
The TueBa-D/S treebank was annotated in the project Verbmobil, a longterm
Machine Translation project for spontaneous speech funded by the German
Ministry for Education, Science, Research, and Technology (BMBF). This is
the first public release of the treebank.
TueBa-D/S is a syntactically annotated corpus based on spontaneous
dialogues, which were manually transliterated. The treebank comprises
approximately 38 000 sentences (ca. 360 000 words). The syntactic
annotation was also performed manually.
The license for TueBa-D/S is granted free of charge for scientific use. For
more information, please refer to:
2. The Tuebingen Treebank of Written German (TueBa-D/Z) - second release
The TueBa-D/Z treebank is a manually annotated, German newspaper corpus
based on data taken from the daily issues of the 'die tageszeitung'. It
currently comprises approximately 22 000 sentences (ca. 380 000 words).
The annotation scheme is an extended version of the TueBa-D/S annotation
scheme. It accounts for a larger number of linguistic phenomena and is
enriched at two levels: (multi-word) named entities are marked at the
phrasal level; words are annotated with inflectional morphology at the
lexical level (currently ca. 70% of the sentences are covered).
What is new in the second release:
- about 6 800 additional sentences
- morphological information
- cleaner versions of the trees published in the first release
The license for TueBa-D/Z is granted free of charge for scientific use. For
more information, please refer to:
With best regards,
Erhard W. Hinrichs
For your information:
A related resource is The Tuebingen Partially Parsed Corpus of
Written German (TuePP-D/Z), released 12/2003.
TuePP-D/Z is a 200 million word collection of articles from the taz
newspaper which have been automatically annotated with clause structure,
topological fields, and chunks, in addition to more low level annotation
including parts of speech and morphological ambiguity classes.
For more information, please refer to:
Linguistic Field(s): Computational Linguistics
Message 2: Corpus and Frequency Lexicon of Written Italian
From: Pier Marco Bertinetto <Bertinettosns.it>
Subject: Corpus and Frequency Lexicon of Written Italian
We are glad to announce a new lexical resource:
CoLFIS (Corpus e Lessico di Frequenza dell'Italiano Scritto)
[Corpus and Frequency Lexicon of Written Italian]
Pier Marco Bertinetto°, Cristina Burani*, Alessandro Laudanna^*,
Lucia Marconi+, Daniela Ratti+, Claudia Rolando+, Anna Maria Thornton§
° Scuola Normale Superiore, Pisa
* Istituto di Scienze e Tecnologie della Cognizione, CNR, Roma
^ Università di Salerno
+ Istituto di Linguistica Computazionale,Unità Staccata di Genova, CNR, Genova
§ Università de L'Aquila
The reference corpus consists of excerpts from newspapers, magazines and
books. It includes 3.150.075 lexical occurrences. The corpus was designed
as the best approximation to the Italians' average preferred readings, as
mirrored by official statistics.
The lexicon consists of two main components: the forms repertoire and the
lemmas repertoire. In the latter, all identical forms belonging to
different lemmas are disambiguated, while syntagmatic words (such as
table's leg) are treated as single entries.
The lexical lists (both forms and lemmas) are presently available for free
They are organized according to a number of possibilities: frequency rank,
inverse alphabetical ordering, with or without capital / non-capital
distinction, etc. The entire corpus is not yet available. We hope to put it
on-line as soon as we obtain the necessary authorizations.
The work has been produced with CNR (Consiglio Nazionale delle Ricerche)
With the help of willing users, this product will hopefully be enriched
with further facilities.
Linguistic Field(s): Text/Corpus Linguistics
Respond to list|Read more issues|LINGUIST home page|Top of issue
Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.