Featured Linguist!

Jost Gippert: Our Featured Linguist!

"Buenos dias", "buenas noches" -- this was the first words in a foreign language I heard in my life, as a three-year old boy growing up in developing post-war Western Germany, where the first gastarbeiters had arrived from Spain. Fascinated by the strange sounds, I tried to get to know some more languages, the only opportunity being TV courses of English and French -- there was no foreign language education for pre-teen school children in Germany yet in those days. Read more



Donate Now | Visit the Fund Drive Homepage

Amount Raised:

$34724

Still Needed:

$40276

Can anyone overtake Syntax in the Subfield Challenge ?

Grad School Challenge Leader: University of Washington


Publishing Partner: Cambridge University Press CUP Extra Publisher Login

FYI: New Release of the TüBa-D/Z German Treebank


Author: Kathrin Beck

Linguistic Field(s): Computational Linguistics
Discourse Analysis
Morphology
Syntax
Text/Corpus Linguistics

FYI Body: The Department of Linguistics of the University of Tuebingen (Germany) is happy to announce the new release of a referentially and syntactically annotated German corpus:

* The Tuebingen Treebank of Written German (TüBa-D/Z) - 8th release

The TueBa-D/Z treebank is a manually annotated German newspaper corpus based on data taken from the daily issues of the 'die tageszeitung'. It currently comprises approximately 75,000 sentences (ca. 1,300,000 words).

The syntactic annotation scheme of the TueBa-D/Z distinguishes four levels of syntactic constituency: the lexical level, the phrasal level, the level of
topological fields, and the clausal level.

The treebank (about 3,200 newspaper articles) has been enriched with anaphoric and coreference relations referring to nominal and pronominal antecedents. Linking relations include: coreferential (two NPs refer to the same extralinguistic referent), anaphoric/cataphoric (a definite pronoun
refers to a contextual antecedent) and other relations (split-antecedent, instance) as well as marking of expletive pronouns.

For selected discourse connectives, the instances occurring in the treebank have been annotated with the discourse relation(s) conveyed by the connective instance. Portions of the treebank have been sense-annotated for the connectives 'nachdem' (298 instances), 'während' (531 instances), 'sobald' (28 instances), 'seitdem' (13 instances), 'als' (169 instances), 'aber' (161 instances), and 'bevor' (119 instances).
Another annotation layer contains structural information as well as implicit discourse relations for a subcorpus of 41 annotated newspaper articles (21,817 tokens) with 1,458 (explicit and implicit) discourse relations.

The annotation comprises information on
* inflectional morphology
* lemmas
* syntactic constituency
* grammatical functions
* (complex) named entities incl. semantic classification
* anaphora and coreference relations
* dependency relations (automatically created)
* chunk annotation (automatically created)

The treebank is available in 5 different formats:
* NEGRA export format
* XML format (TigerXML and exportXML)
* Penn Treebank format
* CoNLL format

The license for TueBa-D/Z is granted free of charge for scientific use.
For more information, please refer to:

http://www.sfs.uni-tuebingen.de/en/ascl/resources/corpora/tuebadz.html

With best regards,

Erhard W. Hinrichs
Kathrin Beck
Heike Telljohann
Yannick Versley

---

Kathrin Beck

Project Coordinator D-SPIN & CLARIN-D
Dept. of Computational Linguistics
University of Tübingen
Wilhelmstr. 19/ 2.31
72074 Tübingen
Germany

Tel.: +49-7071-29-73970
Fax: +49-7071-29-5214
E-Mail: kbeck@sfs.uni-tuebingen.de,
kathrin.beck@uni-tuebingen.de

Back   FYI main page