LINGUIST List 3.587

Fri 17 Jul 1992

FYI: Meeting on Grammar Evaluation

Editor for this issue: <>


  1. , Meeting on Grammar Evaluation

Message 1: Meeting on Grammar Evaluation

Date: Fri, 17 Jul 92 13:07:01 EDMeeting on Grammar Evaluation
From: <>
Subject: Meeting on Grammar Evaluation

 Meeting of Interest Group on Evaluation of Broad-Coverage
 Grammars of English: UPenn Sept. 20-21

 A meeting of the ongoing Interest Group on Evaluation of
Broad-Coverage Grammars of English, hosted by Professors Mitch
Marcus and Mark Liberman, will take place at the University of
Pennsylvania on September 20-21. Invited to attend, in addition
to the original nine group members, are researchers who (a) have
developed working, broad-coverage parser/grammars of English,
(b) are willing to bring to the meeting and to have distributed
among its attendees their system's output parses for 50 real-world
sentences to be randomly selected from large corpora by M. Marcus;
(c) agree to transform their parser output by program into the
canonical form necessary for input into the group's evaluation
program, written in LISP, to run this program on their parses,
and bring the results to the meeting.
 The purpose of the meeting is to continue refining the evaluation
metric worked out jointly at our last University of Pennsylvania
meeting, and more generally to learn from each other's successes and
pratfalls in handling these sentences previously unseen by our parsers.
We expect that at this meeting there will also be a preliminary
discussion of proposals for the evaluation of predicate/argument
analyses. Anyone who is interested specifically in this aspect of
the meeting should mention that in their reply.
 Individuals interested in attending this meeting must contact
Ezra Black (Black at Watson.IBM.Com) by July 31. All invitees will
be sent detailed instructions for transforming parse output into
the format necessary for input to our evaluation program, PARSEVAL.
 Below is an account of our approach to parser/grammar evaluation,
from the Proceedings of the Workshop on Grammar Evalaution at ACL 1991:

 Evaluating Syntax Performance of Parser/Grammars of English

 Philip Harrison, Boeing Computer Services
 Steven Abney, Bellcore
 Ezra Black, IBM
 Dan Flickinger, Hewlett Packard
 Claudia Gdaniec, Logos, Inc.
 Ralph Grishman, NYU
 Donald Hindle, AT&T
 Robert Ingria, BBN
 Mitch Marcus, U. of Pennsylvania
 Beatrice Santorini, U. of Pennsylvania
 Tomek Strzalkowski, NYU

We report on an ongoing collaborative effort to
develop criteria, methods, measures and procedures for evaluating the
syntax performance of different broad-coverage parser/grammars of English.
The project was motivated by the apparent difficulty of comparing different
grammars because of divergences in the way they handle various syntactic
phenomena. The availability of a means for useful comparison would
allow hand-bracketed corpora, such as the University of Pennsylvania
Treebank, to serve as a source of data for evaluation of many grammars.
The project has progressed to the point where the first version of
an automated syntax evaluation program has been
completed and is available for testing. The methodology continues
to undergo refinement as more data is examined.

The project began with a comparison of hand syntactic analyses of 50 Brown
Corpus sentences by grammarians from nine organizations:
Steve Abney (Bellcore), Ezra Black (IBM), Dan Flickinger
(Hewlett Packard), Claudia Gdaniec (Logos), Ralph Grishman and Tomek
Strzalkowski (NYU), Philip Harrison (Boeing), Donald Hindle (AT&T), Robert
Ingria (BBN), and Mitch Marcus and Beatrice Santorini (U. of Pennsylvania).
The purpose of the bracketing exercise was to provide a focus for the
discussion of syntactic differences and a source of data to test proposals for
evaluation techniques. The participating grammarians produced labelled
bracketings representing what they ideally want their grammars to
specify. After the exercise was completed, a small workshop was held
at the University of Pennsylvania to discuss the results and examine
proposals for evaluation methodologies.

The results of the hand-bracketing exercise revealed that very little
structure was common to all the parses. For example, an
analysis revealed that the following three Brown Corpus sentences (taken
from what we call the "consensus" parses) display only the indicated phrases
in common to all of the bracketings:

 The famed Yankee Clipper, now retired, has been assisting (as (a batting coach)

 One of those capital-gains ventures, in fact, has saddled him (with (Gore Court

 He said this constituted a (very serious) misuse (of the (Criminal court) proce

A rather more encouraging result was obtained when phrases were selected
which appeared bracketed in a majority of the analyses (the "majority"

 ((((The famed (Yankee Clipper)) , (now retired) ,)
 (has been (assisting (as (a batting coach))))) .)

 (((One (of (those capital-gains ventures))) , (in fact) ,
 (has (saddled him (with (Gore Court))))) .)

 ((He (said (this (constituted (a (very serious) misuse
 (of (the (Criminal court) processes))))))

The lack of structure for the consensus parses is a reflection of the
diversity of approaches to such phenomena as punctuation, the employment
of null nodes by the grammar, and the attachment of auxiliaries, negation,
pre-infinitival `to', adverbs, and other types of constituents.
But the results for the majority parses indicated that a good foundation
of agreement exists among the several grammars.

The challenge was to find an evaluation method that would not penalize
even those analyses that diverged from the majority in ways that would be
considered generally acceptable. The proposed solution, explored
in depth by hand analysis at the workshop, involves 1) the systematic
elimination of certain problematical constructions from the parse tree
(resulting in trees that show a much higher degree of structural agreement)
and 2) systematic restructuring of constituents to a minor degree for
particular constructions if the grammar being evaluated
differs from the evaluation standard for these constructions.
The evaluation program itself carries out the elimination of
constituents for both the standard parse and the parse being tested
(hereafter the test or candidate parse, provided by the client grammarian for
evaluation). The client is responsible for restructuring the
special constructions in the test parse. These restructurings will be
discussed after the evaluation procedure itself.

The proposed evaluation procedure has been implemented and is still undergoing
analysis and modification, but generally, it has these characteristics:
it judges a parse based only on the constituent boundaries it stipulates
(and not the categories or features that may be assigned to these constituents);
it compares the parse to a hand-parse of the same sentence from the
University of Pensylvania Treebank, (the standard parse);
and it yields two principal measures for each parse submitted: Crossing
Parentheses and Recall.

The procedure has three steps. For each parse to be evaluated:
 (1) erase all word-external punctuation and null categories from
 both the standard tree and the test tree; use the standard tree
 to identify and erase from both trees all instances of: auxiliaries,
 "not", pre-infinitival "to", and possessive endings ('s and ').
 (2) recursively eliminate from both trees all parenthesis pairs
 enclosing either a single constituent or word, or nothing at all;
 (3) using the nodes that remain, compute goodness scores
 (Crossing Parentheses, and Recall) for the input parse,
 by comparing its nodes to a similarly-reduced node set
 for the standard parse.

For example, for the Brown Corpus sentence:

 Miss Xydis was best when she did not need to be too probing.

consider the candidate parse:

 (S (NP-s (PNP (PNP Miss) (PNP Xydis)))
 (VP (VPAST was) (ADJP (ADJ best)))
 (S (COMP (WHADVP (WHADV when)))
 (NP-s (PRO she))
 (VP (X (VPAST did) (NEG not) (V need))
 (VP (X (X to) (V be)) (ADJP (ADV too) (ADJ probing)))))
 (? (FIN .)))

After step-one erasures, this becomes:

 (S (NP-s (PNP (PNP Miss) (PNP Xydis)))
 (VP (VPAST was) (ADJP (ADJ best)))
 (S (COMP (WHADVP (WHADV when)))
 (NP-s (PRO she))
 (VP (X (VPAST ) (NEG ) (V need))
 (VP(X (X ) (V be)) (ADJP (ADV too) (ADJ probing)))))
 (? (FIN )))

And after step-two erasures:

 (S (NP-s Miss Xydis) (VP was best)
 (S when she (VP need (V be (ADJP too probing)))))

The University of Pennsylvania Treebank output for this sentence,
after steps one and two have been applied to it, is:
 (S (S (NP Miss Xydis) (VP was best))
 (SBAR when (S she (VP need (VP be (ADJP too probing))))))

Step three consists of comparing the candidate parse to the Treebank
parse and deriving two scores: (1) The Crossing Parentheses score is the
number of times the candidate parse has a structure such as ((A B) C)
and the standard parse has one or more structures such as (A (B C))
which "cross" with the test parse structure.
(2) The Recall score is the number of parenthesis pairs in the
intersection of the candidate and treebank parses (T intersection C)
divided by the number of parenthesis pairs in the treebank parse T, viz.
(T intersection C) / T. This score provides an additional measure of
the degree of fit between the standard and the candidate parses; in
theory a Recall of 1 certifies a candidate parse as including all
constituent boundaries that are considered essential to the analysis of
the input sentence by the Treebank. (Treebank parses are in general
underspecified because certain structures, such as compound nouns, are
not bracketed.) For the above example sentence, there are no crossings
and the recall is 7/9.

The last element of the proposed evaluation method involves the
restructuring of trees by the client, which is necessary only if
the parse submitted treats any of certain constructions in
a manner different from the standard. At the workshop, three
constructions were identified: extraposition, modification of
noun phrases by post-head phrases
such as PP, and sequences of prepositions which occur constituent-initially
and/or particles which occur constituent-finally. Briefly,
for extraposition sentences like "It is necessary for us to leave"
the extaposed phrase "for us to leave" should be attached at the S
level and not, for example as a sister of "necessary". For NP
modification, post-head modifiers should be attached to the NP and
not at the N-BAR level. Finally, for sequences of prepositions/particles
we attach to the top node of the constituent. Thus if the initial
client analysis is

 (We (were ((out of) (oatmeal cookies))))

then the restructured analysis should be

 (We (were (out of (outmeal cookies)))).

These three constructions were identified from a hand analysis of a
limited amount of data and we are currently examining more data
to see whether the list should be extended.

Generally, there are two strategies that can be followed in cases
where a client's analysis differs systematically from the standard:
modify the evaluation program so that it deletes certain nodes, or specify
a procedure that can be adopted by clients to bring their
trees into conformity with the standard. However, we have seen that
there are instances where reconciliation is very difficult or impossible
and are working to assess the expected frequency of such cases.

Before the evaluation software was available, we applied the
method by hand, using the UPenn Treebank as a standard, to 14 of the
above-mentioned 50 Brown Corpus sentences which were given their "ideal"
analyses by the grammarians. (Canonical modifications as specified above were
required.) The sentences were selected
because they had been successfully run by one of our automated systems (NYU's)
and were expected to give some hint of the method's reliability for
sentences that are easy for automated systems.
The Crossing score was zero in every case and the corresponding Recall
average score was 94%. We were encouraged by this initial result to
pursue the development of software to carry out the scoring.

After the evaluation program became available, we ran it on the entire
50 sentence corpus and obtained the following results:

 crossings recall
 AT&T 3 (1) .88
 BBN 4 (1) .86
 Bellcore 10 (5) .87
 Boeing 4 (1) .97
 HP 4 (0) .97
 IBM 4 (2) .96
 Logos 3 (0) .86
 NYU 10 (10) .79

The first number in the crossings column is the total number of sentences
that contained a crossing while the second number in parentheses is the
number of sentences with crossings that remain after certain policy
changes are implemented in the standard parse and the node deletion protocol
of the evaluation procedure.

There are several points to made about the above data:
We feel that the number of crossings initially obtained is
unacceptably high and that changes in the standard bracketing procedures or
changes in the deletion protocols need to be adopted. Second, the
number of crossings obtained after a few suggested changes are implemented
(the number in parentheses) is an acceptable level of crossings for a 50
sentence corpus for all but two of the grammars. However, until more data
are examined, we will not know whether this level of crossings can be
maintained with a fixed evaluation method. We are still in a "training phase"
as far as the bracketing and deletion policies go and the actual level
that will be attained may turn out to be less than is acceptable. The
policy changes themselves are still being debated by the group. Finally,
we note that two of the grammars (Bellcore's and NYU's) differ significantly
from the others with respect to crossings. The Bellcore grammar is based on
a new grammar methodology called "chunking" which results in non-standard
phrasal groupings in some instances while the NYU grammar has significantly
different in that it does not use any category corresponding to verb phrase,
which results in non-standard attachments. It is unclear at this time
whether convenient transformations can be found to allow these grammars
to be compared to the standard so as to reduce their crossings scores.

There are four proposed changes to the evaluation method and the standard
that are being debated at this time by our group. If the four policies below
are adopted, then the crossing scores obtained are the ones in parentheses
in the above table. The four policies are:

1) Delete left-recursive subnodes of type S from the standard.
The Treebank uses recursive attachment at the S level for adverbial
attachment in sentences like

 Miss Xydis was best when she did not need to be too probing

which results in a structure of the form (S (S (A ..)(B...))(C ...)).
Several of us preferred to attach the rightmost constituent
(the `when' phrase) at a lower level. With a structure of the
form (S (A ...)(B ...)(C ...)) all of the crossings are eliminated
from our data. This policy can be implemented by the evaluation program.

2) Flatten structures in the standard containing the collocations "less than,
more than, greater than," etc. when they precede a number or adjective.
Some of us take these collocations as constituents (under a certain
reading of the sentence) while others always build phrases with "than"
and phrases to its right before combining with "less, more," etc.
The lack of agreement among practitioners can be accommodated only if
the standard is neutral. So the phrase "more than 4000,000,000 inhabit ants"
would need to be bracketed something like
 (NP (ADVP more than 400,000,000) inhabitants),
The same requirement would also be imposed for phrases such as
"more than likely".

3) Flatten certain common sequences involving preposition, noun, preposition
such as "in light of" and "in violation of". Here again, there is a
diversity of practice in our group as to whether the preposition, noun,
preposition sequence is treated as a
multi-word preposition or has NP and PP structures
built between the words, as exemplified by:

 (PP in (NP light (PP of (NP his success))))

A neutral bracketing of this phrase is

 (PP in light of (NP his success))

4) Delete copular "be\fP when it precedes an adjective. A phrase such as
"is happy to leave\fP would receive both of the following bracketings in our
data: ((is happy)(to leave)) and (is (happy (to leave))). The deletion
policy will eliminate any crossings for this type of phrase.

Even with these additional policies, there is still a residual set of
eight sentences with crossings for some of the grammars (excluding, for the
sake of brevity, some sentences for which the NYU grammar has crossings).
We present here the eight sentences along with
a discussion of the differences in analyis that led to the crossings:

 1. The petition listed the mayor's occupation as attorney and
 his age as 71.

The standard analyzes this by coordinating "listed ... as attorney" with
"his age as 71". (The second coordinate is taken to be a verb phrase with a
ellipted verb.) One of us prefers an analysis in which "the mayor's
occupation as attorney" and "his age as 71" are treated as the coordinated
constituents, creating a phrase crossing with the first coordinate phrase
of the standard.

 2. His political career goes back to his election to city council in 1923.

The standard analysis makes a constituent out of "back to ... 1923" while
one of our analyses postulates "goes back" as a constituent.

 3. All Dallas members voted with Roberts, except Rep. Bill Jones, who was absen

The standard attaches non-restrictive relative clauses to NP. In this
case "who was absent" is attached to "Rep. Bill Jones". Two of
us attach non-restrictive relative clauses at the sentential level.

 4. The odds favor a special session, more than likely early in the year.

The standard attaches "more than likely early in the year" to the NP
associated with "session" while some of us attach it higher.

 5. The year will probably start out with segregaton still the most
 troublesome issue.

One of our grammars attaches the adverb "probably" at a low level
to the verb "start" while that standard associates it with the S and
specifies a verb phrase from "start" to the end of the sentence, which
produces a crossing.

 6. The dinner is sponsored by organized labor and is scheduled for 7 p.m.

The standard coordinates "is sponsored by organized labor" and
"is scheduled for 7 p.m." while another analysis coordinates
"the dinner is sponsored by organized labor" with
"is scheduled for 7 p.m."

 7. He is willing to sell it just to get it off his hands.

There is significant disagreement in our group over how to attach the phrase
"just to get it off his hands". The standard attaches it under the root S,
while others attach it variously to phrases beginning with "is willing,
willing", and "sell". (A recursive attachment to "is willing
to sell it" would not produce a crossing with the standard.)

 8. Mr. Reama, far from really being retired, is engaged in industrial
 relations counseling.

The standard takes "far" as an adverb that subcategorizes a PP, while
one of our grammars treats "far from" as a multi-word lexical item.

In conclusion, we believe that the degree of disagreement that remains after
the application of our deletion and restructuring method does not pose a
significant barrier to the use of hand bracketed corpora for evaluation
purposes for most of our grammars. However, the amount of data that we
have been able to examine so far is limited and our judgements about the
success of the method are still tentative. We will continue with our
hand analyses, but also start to use the evaluation program
with the real output of our parsers in a realistic test of
the complete evaluation methodology. We invite other groups to
participate and will make our evaluation software (which runs in
Common Lisp) available.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue