Publishing Partner: Cambridge University Press CUP Extra Wiley-Blackwell Publisher Login
amazon logo
More Info


New from Oxford University Press!

ad

The Social Origins of Language

By Daniel Dor

Presents a new theoretical framework for the origins of human language and sets key issues in language evolution in their wider context within biological and cultural evolution


New from Cambridge University Press!

ad

Preposition Placement in English: A Usage-Based Approach

By Thomas Hoffmann

This is the first study that empirically investigates preposition placement across all clause types. The study compares first-language (British English) and second-language (Kenyan English) data and will therefore appeal to readers interested in world Englishes. Over 100 authentic corpus examples are discussed in the text, which will appeal to those who want to see 'real data'


New from Brill!

ad

Free Access 4 You

Free access to several Brill linguistics journals, such as Journal of Jewish Languages, Language Dynamics and Change, and Brill’s Annual of Afroasiatic Languages and Linguistics.


Email this page
E-mail this page

Review of  Treebanks


Reviewer: Verginica Mititelu
Book Title: Treebanks
Book Author: Anne Abeillé
Publisher: Kluwer
Linguistic Field(s): Computational Linguistics
Syntax
Book Announcement: 15.1589

Discuss this Review
Help on Posting
Review:
Date: Mon, 17 May 2004 10:35:40 +0300
From: Verginica Mititelu <vergi@racai.ro>
Subject: Treebanks: Building and Using Parsed Corpora

EDITOR: Abeillé, Anne
TITLE: Treebanks
SUBTITLE: Building and Using Parsed Corpora
SERIES: Text, Speech and Language Technology, volume 20
PUBLISHER: Kluwer Academic Publishers
YEAR: 2003

Verginica Barbu Mititelu, Institute for Artificial
Intelligence, Romanian Academy

The book is a collection of 21 papers on building and using
parsed corpora, most of them formerly presented at
workshops and conferences (ATALA, LINC, LREC, EACL).

The objective of the book, as stated in the Introduction,
is to present an overview of the work being done in the
field of treebanks, the results achieved so far and the
open questions. The addressees are linguists, including
computational linguists, psycholinguists, and
sociolinguists.

The book is organized in two parts: Building treebanks (15
chapters, pp. 1-277) and Using treebanks (6 chapters, pp.
279-389), each of them having subparts. It also contains a
preface (pp. xi), an introduction (pp. xiii-xxvi), a list
of contributing authors and their affiliation (pp. 391-
397), and an index of topics (pp. 399-405).

The organization of the Introduction (signed by Anne
Abeillé) is similar to the structure of the whole book,
namely it has two parts, entitled Building treebanks and
Using treebanks, respectively. After making the
terminological distinction between tagged corpora and
parsed corpora (or treebanks), the author emphasizes the
reasons for the need of the existence of treebanks and
makes a general presentation of the topics to be covered by
the papers in the volume, stressing the fact that the
problems encountered for each language are, at great
extent, the same, thus a certain redundancy in the papers
collected in this volume.


PART I. BUILDING TREEBANKS

The chapters of the first part are grouped according to the
language or language families for which the approaches to
building treebanks are presented: the first four chapters
are dedicated to English treebanks, the next two to German
ones; there are two papers on Slavic treebanks, four on
Romance parsed corpora and the last three chapters of the
first part address to treebanks for other languages
(Sinica, Japanese, Turkish).

ENGLISH TREEBANKS
Ch. 1. The Penn Treebank: an Overview. Ann Taylor, Mitchell
Marcus, and Beatrice Santorini.
The authors present the annotation schemes and the
methodology used during the 8-year Treebank project. The
part-of-speech (POS) tagset is based on that of the Brown
Corpus, but adjusted to serve the stochastic orientation of
Penn Treebank and its concern with sparse data, and reduced
to eliminate lexical and syntactic redundancies. More than
one tag can be associated to a word, thus avoiding
arbitrary decisions. POS tags also contain syntactic
functions, thus serving as a basis for syntactic
bracketing, which was modified during the project from a
skeletal context free bracketing with limited empty
categories and no indication of non-contiguous structures
and dependencies to a style of annotation which aimed at
clearly distinguishing between arguments and adjuncts of a
predicate, recovering the structure of discontiguous
constituents, and making use of null elements and
coindexing to deal with wh-movement, passive, subjects of
infinitival constructions. The first objective was not
always easy to achieve via structural differences, that is
why a set of easily identifiable roles were defined,
although sometimes these ones proved difficult to apply,
too. The Penn Treebank (PTB) project also produced
dysfluency annotation of transcribed conversations,
labeling complete and incomplete utterances, non-sentence
elements (filters, explicit ending terms, discourse
markers, coordinating conjunctions) and restarts (with or
without repair).

For all these 3 annotation schemes a 2-step methodology was
adopted: an automatic step (represented by PARTS and Brill
taggers for POS tagging, the Fidditch deterministic parser
for syntactic bracketing, and a mere Perl script
identifying common non-sentential elements) followed by
hand correction.

Chapter 2. Thoughts on Two Decades of Drawing Trees.
Geoffrey Sampson.
The author exploits the idea that the annotation of both
written and (transcribed) oral corpora makes obvious the
deficiencies of theoretical linguistics and may even
contradict some widely accepted conventional linguistic
wisdom. For instance, sentences of the form subject-
intransitive verb are rather infrequent in English corpus,
contrary to what can be found in some linguistics
textbooks.

Chapter 3. Bank of English and beyond. Timo Järvinen.
The aim of this paper is twofold. On the one hand, the
author describes the four modules (pre-processing -- i.e.
segmentation and tokenization --, POS assignment, POS
tagging, functional analysis) of the English Constraint
Grammar (ENGCG) system used for annotating corpora for
compiling the second edition of the Collins COBUILD
Dictionary of English, and also the methodology adopted
taking into consideration the huge amount of data that was
to be dealt with; thus, manual inspection was possible only
for some random fragments of the data and automatic methods
were created for monitoring them.

As clearly stated, the CG system was chosen for its
morphological accuracy. However, syntactic ambiguity was
too high. That is why, Järvinen pleas for a Functional
Dependency Grammar (FDG) parser, which better deals with
long-distance dependencies, ellipses and other complex
phenomena. He points out the need for a deep parsing,
instead of the shallow one, his reason being, besides the
lower ambiguity, the practical orientation of the former.

Chapter 4. Completing Parsed Corpora. Sean Wallis.
A more challenging title for this paper could have been:
''Do we need linguists for constructing treebanks?'' For
answering this question, S. Wallis starts by giving us a
brief overview of the phases of the annotation employed on
International Corpus of English ? British Component (ICE-
GB) and by pointing out the fact that the use of two
parsers (i.e., TOSCA and Survey parser) increased the
number of inconsistencies in the corpus, thus the necessity
of a post-correction. He provides two arguments against
Sinclair (1992), who found human annotators a source of
errors in the treebank.

In order to ensure the cleanness of the parsed corpus, one
has two problems to solve: the decision (i.e. the
correctness of the analysis) and the consistency (of the
analysis throughout the corpus) ones. S. Wallis draws a
distinction between longitudinal (that is, working through
a corpus sentence-by-sentence, until it is completed) and
transverse (i.e. working through a corpus construction-by-
construction) correction, bringing arguments in favor of
the latter: less time-consuming, control of the accuracy of
the analysis and of its consistency. The price paid is
difficulty in implementation and in managing the process.
But once the tool for grammatical queries search facility
(Fuzzy tree Fragment) is created, it can also be used not
only for correction, but also for searching and browsing
the corpus for linguistic queries, so a post-project use of
the tool.

As clearly stated in the Critique section of Wallis's
paper, the question formulated above receives an
affirmative answer if the final aim of the corpus is not a
study of the parser performance, but of language variation.

GERMAN TREEBANKS
Chapter 5. Syntactic Annotation of a German Newspaper
Corpus. Thorsten Brants, Wojciech Skut, Hans Uszkoreit.
This paper is a presentation of the syntactic annotation of
the NEGRA newspaper corpus. Language-specific reasons (free
word order, among others), corpus structure (frequently
elliptical constructions) and the characteristics of the
formalism contributed to the choosing of Dependency Grammar
for the annotation. However, it was modified so that to
take advantage of phrase-structure grammar, too: flat
structures, no empty categories, treatment of the head as a
grammatical function expressed by labeling, not by the
syntactic structure, allowance of crossing branches (which
give rise to a large number of errors), a more explicit
annotation of grammatical functions, encoding of predicate-
argument information.

A characteristic of this project is the interactive
annotation process which makes use of the TnT statistical
tagger and second order Markov models for POS tagging.
Syntactic structure is built incrementally, using cascaded
Markov models. A graphical user interface allows for manual
tree manipulation and runs taggers and parsers in the
background. Human annotators need to concentrate only on
the problematic cases, which are assigned different
probabilities by statistical tagger and parser. Accuracy is
ensured by annotating the same set of sentences by two
different annotators. Differences are discussed and after
agreeing on them, modifications are applied to the
annotation.

The design of the corpus and the annotation scheme make it
usable for different linguistic investigations and also for
training taggers and chunkers.

Chapter 6. Annotation of Error Types for German Newsgroup
Corpus. Markus Becker, Andrew Bredenkamp, Berthold
Crysmann, Judith Klein.
This paper contributes to the presentation of the
applications used for the development of controlled
language and grammar checking applications for German.
The corpus in the FLAG project consisted of email messages
(as they present the characteristics needed: high error
density, accessibility, electronic availability). Their
annotation was 3-phased: developing of a typology of
grammatical errors in the target language (German), manual
annotation on paper, and annotation by means of computer
tools.

The first phase relied on traditional grammar books and its
outcome was a type hierarchy of possible errors, also
containing error domains (i.e. it tries to define the
relations between the affected words) useful in guiding the
detection of errors. Although the hierarchy was a fine-
grained one, in the annotation process only a pool of 16
error types were to be detected and classified.
After being manually annotated, the same set of sentences
was annotated in turn with two tools: Annotate and DiET.
The annotation with the former one has a tree-format: the
nodes are the error types, and the edges are descriptive
information on these types; thus, a rich representation of
the structure of errors in terms of relations. However,
this representation is built bottom-up, the error-type
being added last. DiET offers a better method for
configuring an annotation schema, that is why the
annotation was performed with this latter tool.
The overwhelming type of errors were the orthographical
ones (83%), followed, at huge distance, by grammatical ones
(16%).

TREEBANKS FOR SLAVIC LANGUAGES
Chapter 7. The Prague Dependency Treebank. Alena Böhmová,
Jan Hajic, Eva Hajicová, Barbora Hladká
For the annotation of the Czech newspaper corpus, a 3-level
structure was used. At the morphological level, the
automatic analyzer produces ideally for each token in the
input data the lemma and the associated MTag. Whenever more
than one lemma and/or an MTag are produced, manual
disambiguation is needed. For the analytical (syntactic)
level of annotation the dependency structure was used. It
is based on a dependency/determination relation. Solutions
were found for problematic structures, as coordination,
ellipses, ambiguity, and apposition. Two modes of
annotation were employed: first, manual annotation, then
the Collins parser was trained on such annotated data and
used further to generate the structure, while syntactic
functions went on being manually assigned. The separately
produced morphological and analytical syntactic annotations
are then merged together, all possible discrepancies being
manually solved. The third level of annotation, the
tectogramatical one, describes the meaning of the sentences
by means of tectogrammatical functions and the information
structure of sentences. Analytic trees are transduced to
tectogrammatical ones in two phases: an automatic one
(which makes the necessary changes to syntactic trees, as
merging the auxiliary nodes with verbs) and a manual one.

Chapter 8. An HPSG-Annotated Test Suite for Polish.
Malgorzata Marciniak, Agnieszka Mykowiecka, Adam
Przepiórkowski, Anna Kupsc.
The aim of the paper is to present the construction of a
test-suite for Polish, consisting of written sentences,
both correct and incorrect ones, the latter being manually
annotated with correctness markers. Each of these two types
are further classified into three subgroups, according to
their complexity. Moreover, each sentence is hand annotated
with the list of linguistic phenomena they display,
choosing from nine groups of hierarchies of such phenomena.
Sentences are annotated with attribute-value matrices
(AVMs), whose content is restricted by an HPSG signature.
The result is a database of sentences, the correct ones
augmented with their HPSG structures, and a database of
wordforms. The aim of the former database is to evaluate
computational grammars for Polish.

TREEBANKS FOR ROMANCE LANGUAGES
Chapter 9. Developing a Syntactic Annotation Scheme and
Tools for a Spanish Treebank. Antonio Moreno, Susana López,
Fernando Sánchez, Ralph Grishman.
The paper reports on building an annotated Spanish corpora,
based on newspaper articles. Problems specific to Spanish
are presented: dealing with multiword constituents and with
amalgams or portmanteau words, with null subjects and
ellipses, ''se''-constructions, etc. There are three levels
of annotations: syntactic categories, syntactic functions,
morpho-syntactic features and some semantic features. The
annotation and debugging tools are also presented in the
paper, alongside with some error statistics, current state
of the Spanish treebank and future development.

Chapter 10. Building a Treebank for French. Anne Abeillé,
Lionel Clément, François Toussenel.
A newspaper corpus, representative of contemporary written
French, was subject to automatic tagging (segmentation with
special attention to compounds, tagging relying on trigram
method, and retagging making use of contextual information)
and parsing (surface and shallow annotation, theory-
neutral, with the aim of identifying sentence boundaries
and limited embedding). Each annotation with morphosyntax,
lemmas (based on lexical rules), compounds and sentence
boundaries was followed by manual validation. The resulting
treebank was used for evaluating lemmatizers and for
training taggers.

Chapter 11. Building the Italian Syntactic-Semantic
Treebank. Simonetta Montemagni, Francesco Barsotti, Marco
Battista, Nicoletta Calzolari, Ornella Corazzari,
Alessandro Lenci, Antonio Zampolli, Francesca Fanciulli,
Maria Massetani, Remo Raffaelli, Roberto Basili, Maria
Teresa Pazienza, Dario Saracino, Fabio Zanzotto, Nadia
Mana, Fabio Pianesi, Rodolfo Delmonte.
The paper presents the syntactic-semantic annotation of a
balanced corpus and of a specialized one. Four levels of
annotations were adopted: morpho-syntactic annotation (POS,
lemma, morpho-syntactic features), syntactic annotation
made up of constituency annotation (identification of
phrase boundaries and labeling of constituents) and
functional annotation (with functional relations), lexico-
semantic annotation (distinguishing among single lexical
items, semantically complex units and title sense units;
specification of senses for each word - relying on
ItalWordNet -, along with other lexico-semantic
information, such as figurative usage, idiomatic
expressions, etc.). The first two types of annotations were
performed semi-automatically, while the other two were
performed manually. There are two innovations brought about
by this treebank: sense tagging (which resembles a semantic
annotation of the corpus) and two distinct layers of
syntactic annotation, the constituency and the functional
ones, grounded by language specific phenomena (such as free
constituent order and pro-drop property) and by further
usages of the obtained treebank which is compatible with
different approaches to syntax.

In the second part of the article the annotation tool,
GesTALt, is presented: its consisting applications and the
architecture of the tool. In the end the usages of the
obtained data are presented: improvement of a translation
system, enrichment of dictionaries, improvement at the level
of analysis.

Chapter 12. Automated Creation of a Medieval Portuguese
Partial Treebank. Vitor Rocio, Mário Amado Alves, J.
Gabriel Lopes, Maria Francisca Xavier, Gracia Vicente.
The novelty of the approach presented in this paper arises
from the use of tools and resources developed for
Contemporary Portuguese to the annotation of a corpus of
Medieval Portuguese. The differences between these two
phases of the language are presented.

The neural-network based POS tagger was trained on a set of
words manually tagged for each of the texts in the Medieval
Portuguese corpus. It was then used to extract a dictionary
and to tag the rest of the texts. Manual correction
followed. For the lexical analysis, a morphocentric lexical
knowledge-base (LKB) was used. The lexical analyzer uses as
input the output from the POS tagger and applies to it the
knowledge in the LKB. Its output serves as input for the
syntactic analyzer.

The authors present the resources used and the adaptations
required to deal with the corpus. A similar method for
dealing with corpora of other Romance languages is
envisaged.

TREEBAKNS FOR OTHER LANGUAGES
Chapter 13. Sinica Treebank. Keh-Jiann Chen, Chi-Ching Lou,
Ming-Chung Chang, Feng-Yi Chen, Chao-Jan Chen, Chu-Ren
Huang, Zhao-Ming Gao.
The paper reports on the construction of a treebank for
Mandarin Chinese, relying on Sinica Corpus, already
annotated at the moment of starting the treebank, so its
resources could be used for the latter. The authors provide
reasons for their choosing of the grammar formalism used
for the representation of lexico-grammatical information,
namely Information-based Case Grammar. They also present
the concepts they work with: the principles of inheritance,
the phrasal categories, etc.

Sinica treebank is not a mere syntactically annotated
corpora, but also a semantically annotated one, containing
thematic information. The automatic annotation process was
followed by a manual checking, as in most cases. The
language-specific phenomena (for instance, constructions
with nominal predicates) are given a short presentation,
along with the solution adopted in the annotation process.
The treebank aims at being used as a reliable resource by
(theoretical) linguists, but not only by them, so tools for
extracting information from it were developed.

Chapter 14. Building a Japanese Parsed Corpus. Sadao
Kurohashi, Makoto Nagao.
The morphological and syntactic annotation of a Japanese
newspaper corpus is presented in this paper. It developed
in parallel with the improvement of the morphological
analyzer JUMAN and of the dependency structure analyzer KNP
(chosen in accordance with the characteristics of
Japanese). The dependency relation is defined on bunsetsu,
the traditional Japanese linguistic unit. The free word
order of Japanese raised a problem which remained unsolved:
predicate-argument relation in embedded sentences.

Chapter 15. Building a Turkish Treebank. Kemal Oflazer,
Bilge Say, Dilek Zeynep Hakkani-Tür, Gökhan Tür.
The aims of realizing the Turkish treebank is to be
representative and to contain all the relevant information
for its potential users.

There are two levels of annotation: morphological and
syntactical ones. Both take into consideration the
characteristics of Turkish, especially its rich
inflectional and derivational morphology. Thus, each word
is annotated for each of its morphemes, as this information
may be necessary for syntax. The free word order and the
discontinuities favor the usage of the dependency
framework. Its typical problems (pro-drop phenomenon, verb
ellipsis, etc.) are given the solution adopted in the
annotation process.


PART II. USING TREEBANKS

Chapter 16. Encoding Syntactic Annotation. Nancy Ide,
Laurent Romary.
The emerge of treebanks, alongside with the proliferation
of annotation schemes, triggered the need for a general
framework to accommodate these annotation schemes and the
different theoretical and practical approaches. The general
framework (built within XCES) presented in this paper is an
abstract model, theory and tagset independent, that can be
instantiated in different ways, according to the
annotator's approach and goal. This abstract model uses two
knowledge sources: Data Category Registry (an inventory of
data categories for syntactic annotation) and a meta-model
(a domain-dependent abstract structural framework for
syntactic annotation). Two other sources are used for the
project-specific formats of the annotation scheme: Data
Category Specification (DCS) (the description of the set of
data categories used within a certain annotation scheme)
and Dialect Specification (defining the project-specific
format for syntactic annotation). Combining the meta-model
with the DCS, a virtual annotation markup language (AML)
can be defined for comparing annotations, for merging them
or for designing tools for visualization, editing,
extraction, etc. A concrete AML results from the
combination of a virtual AML and Dialect Specification.
The abstract model ensures the coherence and consistency of
the annotation schemes.

Chapter 17. Parser Evaluation. John Carroll, Guido Minnen,
Ted Briscoe.
The emergence of syntactic parsers triggered the need for
methods evaluating them. In fact, this has become a real
branch in the field of NLP research.
In this paper we are presented a corpus annotation scheme
that can be used for the evaluation of syntactic parsers.
The scheme makes use of a grammatical relation hierarchy,
containing types of syntactic dependencies between heads
and dependents. Based on EAGLES lexicon/syntax standards
(Barnett et al. 1996), this hierarchy aims at being
language- and application- independent.

The authors present a 10,000 words corpus semi-
automatically marked up. For its evaluation three measures
are calculated: precision (the number of bracketing matches
with respect to the total number of bracketings returned by
the parser), recall (the number of bracketing matches with
respect to the number of bracketings in the corpus) and F-
score (this is a measure combining the previous two
measures: (2 x precision x recall)/(precision + recall)).
This last measure can be used to illustrate the parser
accuracy. The evaluation of grammatical relations provides
information about levels of precision and recall for groups
or single relations. Thus, they are useful for indicating
the areas where more effort should be concentrated for
bettering.

Chapter 18. Dependency-based Evaluation of MINIPAR. Dekang
Lin.
The author presents a dependency-based method for
evaluating parsers performance.
To represent a dependency tree he makes use of a set of
tuples for each node in the tree, specifying the word, its
grammatical category, its head (if the case, and also its
position with respect to this head) and its relationship
with the head (again, if the case). To perform the
evaluation, for the parser generated trees (called here
answers) and the manually constructed trees (called keys)
dependency trees are generated and compared on a word-by-
word basis. Very important, a selective evaluation is also
possible: one can measure the parser performance with
respect to a certain type of dependency relation or even to
a certain word. Two scores are calculated: recall and
precision.

The author goes on with the presentation of MINIPAR, a
principle-based broad coverage English parser (Berwick et
al. 1991). The dependency-based method presented above is
used for evaluating this parser. One interesting outcome of
this evaluation is that the parser performs better on
longer sentences than on shorter ones. This may be the
outcome of having trained the parser on press reportage,
with long sentences, while the shorter sentences are found
in fiction, the genre against which the parser is tested.

GRAMMAR INDUCTION WITH TREEBANKS
Chapter 19. Extracting Stochastic Grammars from Treebanks.
Rens Bod.
The assumption (see Scha 1990, 1992, Bod 1992, 1995, 1998)
constituting the basis of this article is that ''human
language perception and production processes may very well
work with representations of concrete past language
experiences, and that language processing models could
emulate this behavior if they analyzed new input by
combining fragments of representations from annotated
corpus''. So, the idea is to use an already annotated corpus
as a stochastic grammar. The idea is not new, but the aim
of the article is to answer the question: in what measure
can constraints be imposed on the used subtrees without
decreasing the performance of the parser?

The results reported here were obtained using a data-
oriented parsing (DOP) model (presented in section 2 of the
paper) which was applied to two corpora of phrase structure
trees: Air Travel Information System (ATIS) and the Wall
Street Journal (WSJ) part from PTB. The conclusion drawn
from the experiments is that almost all constraints
decrease the performance of the model: the most probable
parse (which takes into consideration overlapping subtrees)
gives better results than the most probable derivation
(which does not takes it into consideration); the larger
the subtrees, the better predictions (as larger subtrees
capture more dependencies than small ones); the larger the
lexical context (up to a certain depth, which seems to be
corpus-specific), the better accuracy (as more lexical
dependencies are taken into account); the low frequency
subtrees have an important contribution to the parse
accuracy (as they seem to be larger, thus to contain more
lexical/structural context useful for further parsing); the
use of subtrees with non-headwords have a good impact on
the performance of the model (as they contain syntactic
relations for those non-headwords, which cannot be found in
other subtrees).

Chapter 20. A Uniform Method for Automatically Extracting
Stochastic Lexicalized Tree Grammars from Treebanks and
HPSG. Günter Neumann.
As the title states it, the paper presents a uniform method
for automatically extraction of stochastic lexicalized tree
grammars (SLTG) from treebanks (allowing corpus-based
analysis of grammars) and HPSG (allowing extraction of
domain-independent and phenomena-oriented subgrammars),
with the future aim at merging the two SLTGs to improve the
coverage of treebank grammars on unseen data and to ease
adaptation of treebanks to new domains.

The major operation in the extraction of SLTG is the
recursive top-down tree decomposition according to the head
principle, thus each extracted tree is automatically
lexically anchored. The path from the lexical anchor to the
root of the tree is called a head-chain. There are two more
additional operations involved: each subtree of the head-
chain is copied and the copied tree is processed
individually by the decomposition operation, thus allowing
a phrase to occur both in head and in non-head positions;
for each SLTG-tree having a modifier phrase attached, a new
tree is created with the modifier unattached, thus using
the extracted grammar for recognizing sentences with less
or no modifiers than the seen ones. There results a SLTG
which is processed by a two-phase stochastic parser.
The rest of the paper describes the extraction of SLTG from
PTB and from NEGRA treebank, on the one hand, and from a
set of parse trees with an English HPSG, on the other, and
some experiments results of the use of an extracted SLTG.

Chapter 21. From Treebank Resources to LFG F-Structures.
Anette Frank, Louisa Sadler, Josef van Genabith, Andy Way.
This paper presents two methods for automatic f-structure
annotation. The first one consists in extracting a Context-
Free Grammar (CFG) from a treebank, according to Charniak
1996. A set of regular expression based annotation
principles are then developed and applied to the CFG,
resulting an annotated CFG. The annotated rules are
rematched against the treebank trees, the result being
f(unctional)-structures. The second method uses flat tree
descriptions. Annotation principles define projection
constraints which associate partial c(onstituent)-
structures with their corresponding partial f-structures.
When these principles are applied to flat set-based
encoding of treebank trees, they induce the f-structure.
The two methods are characterized by robustness, due to the
following facts: principles are partial, underspecified and
match unseen configurations, partial annotations are
generated instead of failure, the constraint solver cope
with conflicting information.


DISCUSSION

Although this was not the objective of the book, its first
part can be used as a textbook for those venturing to
construct a treebank. As the papers here focus on different
types of languages, displaying grammatical phenomena and
different ways of dealing with them, these papers can serve
as a repository of solutions to various problems
encountered when trying to design a corpus, to establish a
certain annotation scheme to be used for a treebank, to
develop annotation tools. The style in which the papers
were written is helpful in this respect: they are clear,
accessible and the information is introduced gradually.
The second part of the book has a more reduced group of
addressees than the first one, due to its technical details
involved by the presentation of different application in
computer linguistics: lexicon induction (Järvinen),
grammatical induction (Frank et al., Bod) parser evaluation
(Carroll et al.), checker evaluation (Becker et al.).


REFERENCES

Barnett, R., N. Calzolari, S. Flores, P. Hellwig, P.
Kahrel, G. Leech, M. Melera, S. Montemagni, J. Odijk, V.
Pirrelli, A. Sanfilippo, S. Teufel, M. Villegas, L. Zaysser
(1996) EAGLES Recommemdations on Subcategorisation. Report
of the EAGLES Working Group on Computational Lexicons,
ftp://ftp.ilc.pi.cnr.it/pub/eagles/lexicons/synlex.ps.gz.

Berwick, R.C., S.P. Abney, C. Tenny (Eds.) (1991)
Principle-Based Parsing: Computation and Psycholinguistics.
Kluwer Academic Publishers.

Bod, R. (1992) Data Oriented Parsing (DOP), Proceedings
COLING '92, Nantes, France.

Bod, R. (1995) Enriching Linguistics with Statistics:
Performance Models of Natural Language, ILLC Dissertation
Series 1995-14, University of Amsterdam.

Bod, R. (1998) Spoken Dialogue Interpretation with the DOP
Model, Proceedings COLING-ACL'98, Montreal, Canada.

Charniak, E. (1996) Tree-bank Grammars. AAAI-96.
Proccedings of the Thirteenth national Conference of
Artificial Intelligence, p. 1031-1036. MIT Press.

Scha, R. (1990) Taaltheorie en Taaltechnologie; Competence
en Performance, in Q.A.M. de Kort and G.L.J. Leerdam
(Eds.), Computertoepassingen in de Neerlandistiek, Almere:
Landelijke Vereniging van Neerlandici (LVVN-jaarboek).

Scha, R. (1992) Virtuele Gramatica's en Creatieve
Algoritmen, Gramma/TTT 1(1).

Sinclair, J. (1992) The automatic analysis of corpora. In
J. Svartvik (Ed.) Directions in Corpus Linguistics.
Proceeedings of Nobel Symposium 82. Berlin: Mouton de
Gruyter, pp. 379-397.
 
ABOUT THE REVIEWER:
ABOUT THE REVIEWER


Verginica Barbu Mititelu is a researcher at the Romanian
Institute for Artificial Intelligence and a PhD candidate
at the Bucharest University. She has been involved in the
development of a treebank for Romanian for a very short
period of time.


Amazon Store: