LINGUIST List 16.3291
|
Tue Nov 15 2005
FYI: German Treebanks; Corpus of Written Italian
Editor for this issue: Svetlana Aksenova
<svetlana linguistlist.org>
|
To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.html.
|
Directory
1. Heike
Zinsmeister,
German Treebanks
2. Pier Marco
Bertinetto,
Corpus and Frequency Lexicon of Written Italian
Message 1: German Treebanks
|
Date: 14-Nov-2005
From: Heike Zinsmeister <heike.zinsmeister uni-tuebingen.de>
Subject: German Treebanks
The Division of Computational Linguistics at the Seminar fuer Sprachwissenschaft of the University of Tuebingen (Germany) is happy to announce the release of two German language resources: * The Tuebingen Treebank of Spoken German (TueBa-D/S) * The Tuebingen Treebank of Written German (TueBa-D/Z) - second release Both treebanks have the same basic annotation scheme which distinguishes four levels of syntactic constituency: the lexical level, the phrasal level, the level of topological fields, and the clausal level. In addition to constituent structure, annotated trees contain edge labels between nodes which encode grammatical functions. Both treebanks are available in 3 different formats: * NEGRA export format * XML format * Penn Treebank format The treebanks in detail: 1. The Tuebingen Treebank of Spoken German (TueBa-D/S) The TueBa-D/S treebank was annotated in the project Verbmobil, a longterm Machine Translation project for spontaneous speech funded by the German Ministry for Education, Science, Research, and Technology (BMBF). This is the first public release of the treebank. TueBa-D/S is a syntactically annotated corpus based on spontaneous dialogues, which were manually transliterated. The treebank comprises approximately 38 000 sentences (ca. 360 000 words). The syntactic annotation was also performed manually. The license for TueBa-D/S is granted free of charge for scientific use. For more information, please refer to: http://www.sfs.uni-tuebingen.de/en_tuebads.shtml 2. The Tuebingen Treebank of Written German (TueBa-D/Z) - second release The TueBa-D/Z treebank is a manually annotated, German newspaper corpus based on data taken from the daily issues of the 'die tageszeitung'. It currently comprises approximately 22 000 sentences (ca. 380 000 words). The annotation scheme is an extended version of the TueBa-D/S annotation scheme. It accounts for a larger number of linguistic phenomena and is enriched at two levels: (multi-word) named entities are marked at the phrasal level; words are annotated with inflectional morphology at the lexical level (currently ca. 70% of the sentences are covered). What is new in the second release: - about 6 800 additional sentences - morphological information - cleaner versions of the trees published in the first release The license for TueBa-D/Z is granted free of charge for scientific use. For more information, please refer to: http://www.sfs.uni-tuebingen.de/en_tuebadz.shtml With best regards, Erhard W. Hinrichs Sandra Kübler Heike Zinsmeister ------------------------------------------------------- For your information: A related resource is The Tuebingen Partially Parsed Corpus of Written German (TuePP-D/Z), released 12/2003. TuePP-D/Z is a 200 million word collection of articles from the taz newspaper which have been automatically annotated with clause structure, topological fields, and chunks, in addition to more low level annotation including parts of speech and morphological ambiguity classes. For more information, please refer to: http://www.sfs.uni-tuebingen.de/en_tuepp.shtml Linguistic Field(s): Computational Linguistics Syntax Text/Corpus Linguistics
Message 2: Corpus and Frequency Lexicon of Written Italian
|
Date: 14-Nov-2005
From: Pier Marco Bertinetto <Bertinetto sns.it>
Subject: Corpus and Frequency Lexicon of Written Italian
We are glad to announce a new lexical resource: CoLFIS (Corpus e Lessico di Frequenza dell'Italiano Scritto) [Corpus and Frequency Lexicon of Written Italian] produced by Pier Marco Bertinetto°, Cristina Burani*, Alessandro Laudanna^*, Lucia Marconi+, Daniela Ratti+, Claudia Rolando+, Anna Maria Thornton§ ° Scuola Normale Superiore, Pisa * Istituto di Scienze e Tecnologie della Cognizione, CNR, Roma ^ Università di Salerno + Istituto di Linguistica Computazionale,Unità Staccata di Genova, CNR, Genova § Università de L'Aquila The reference corpus consists of excerpts from newspapers, magazines and books. It includes 3.150.075 lexical occurrences. The corpus was designed as the best approximation to the Italians' average preferred readings, as mirrored by official statistics. The lexicon consists of two main components: the forms repertoire and the lemmas repertoire. In the latter, all identical forms belonging to different lemmas are disambiguated, while syntagmatic words (such as table's leg) are treated as single entries. The lexical lists (both forms and lemmas) are presently available for free download at: http://alphalinguistica.sns.it/BancheDati.htm http://www.istc.cnr.it/material/database/colfis/ They are organized according to a number of possibilities: frequency rank, inverse alphabetical ordering, with or without capital / non-capital distinction, etc. The entire corpus is not yet available. We hope to put it on-line as soon as we obtain the necessary authorizations. The work has been produced with CNR (Consiglio Nazionale delle Ricerche) support. With the help of willing users, this product will hopefully be enriched with further facilities. Linguistic Field(s): Text/Corpus Linguistics
Respond to list|Read more issues|LINGUIST home page|Top of issue
|
|

Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.
|
|