Summary Details

Query:   Online Survey on Thematic Roles
Author:  Brian Murphy
Submitter Email:  click here to access email
Linguistic LingField(s):   Psycholinguistics
Cognitive Science

Summary:   Regarding query:

This is the summary posting for a survey on the usage of thematic roles
among linguists. The objective was to evaluate the reliability of
judgements of thematic role made by linguist annotators - both
individually, and between judges.

Respondents were presented with 81 English sentences, each having between 2
and 5 dependent phrases. For each of the total 219 dependents, respondents
were asked to choose among one of 12 possible roles, or opt for
''other/unsure''. The original survey can be viewed at Since the objective was to study
to what extent linguists share an understanding of thematic role
categories, no role definitions were given.

For 60 of the 219 argument items there was full consensus. For example, all
twelve votes on ''you'' in ''you decide to buy a car'' were Agent, and all
thirteen votes for ''today'' in ''I booked the pitch for Gryffindor today''
were Time.

The remaining 159 items varied in the degree to which they were
controversial. Those that garnered the least amount of consensus were:
''your years with us here at Rydell'' in ''your years with us here at
Rydell have prepared you for the challenges you face'' that got the
following votes: Time (4), Agent (4), Instrument (3), other/unsure (2),
Percept (2), Theme (2), Reason (1). And, the argument ''my fingers'' in
''Every time my fingers touch brain'' got the votes Percept (3), Theme (3),
Patient (3), other/unsure (1), Place (1), and Experiencer (1).

Several weeks after initial submissions, we suggested that respondents
repeat part or all of the survey to test their individual reliability.

We would like to take the opportunity to again warmly thank all those who
took part in this very lengthy survey - so MANY thanks to Alessio Frenda,
Chris Koops, Corinna Anderson, Florencia Franceschina, Harry Feldman,
Jean-Charles Khalifa, Lesley Stirling, Lu Bingfu, Luis González, Magda,
Marina Gorlach, Mark Donohue, Rene Dirven, Suzanne Kemmer, Stella
Markantonatou, Steven Schaufele, Suzette Haden Elgin, and four others who
preferred to remain anonymous.

An overview of the results follows below. We have tried to keep this short,
but are sure to have missed some interesting aspects of the results, so we
encourage people to read the full results for each item (listed on a
separate page due to length):

The first section below describes the response rate. That is followed by a
section giving a qualitative overview of the annotations, listed by role
label. The third section approaches the results qualitatively, supported by
statistical tests. The last section addresses issues of experimental design
raised by respondents.

We look forward to hearing your comments,

Brian Murphy and Carl Vogel,
Computational Linguistics Group,
Trinity College Dublin.


Overall, 21 people took part, 4 of those anonymously. Of those who
specified, 8 were native English speakers and 10 were non-natives. Their
dialects were: English of the US (9), Australia (2), England (2), Ireland
(2) and 'other' (2). On average each respondent made 142 dependent
annotations, out of a total of 219 - 11 of the respondents completed the
survey in full.

The extent to which individual respondents agreed with the majority view
(or plurality view, if the most voted role got less than 50%) varied, from
54% to 87% (mean 77%, median 77%). Natives agreed with the consensus (78% -
Standard Deviation 6%) more often than non-natives (73% - SD 9%), but the
difference was not statistically strong (p=0.161)#1.

In the follow-on intra-subject experiment, the six repeat-respondents
agreed with their original annotations 77% of the time, ranging from 63% to
88%. Native repeat-responders (77%, SD 12%) agreed only very slightly more
often than non-natives (76%, SD 18%) (p=0.971)#2.

It is interesting to note that there was no difference between
intra-subject agreement and inter-subject agreement. This is suggestive
that the variation in responses are primarily random (which can be dealt
with by increasing the numbers of respondents), rather than being due to
more fundamental and stable differences of opinion among linguists.

#1 2-tailed independent samples t-test, assuming equal variance (Levene's
test of unequal variance p=0.106). Assuming unequal variance, p would be 0.142.
#2 2-tailed independent samples t-test, assuming equal variance (Levene's
test of unequal variance p=0.845). Assuming unequal variance, p would b 0.971.

Description of Responses, by Role

In qualitative terms, here is an overview of annotations and comments made
for each role. In this section we use the majority view of the correct role
label for the purposes of categorisation (or, in cases where no role
received more than 50% of votes, the plurality view).

- Agent dependents were very stable (mean 94%). Agents of cognitive
actions with little or no effect on objects were sometimes annotated as
Experiencer (e.g. ''play'', ''find'', ''seek''), and agents which involved
possession got some votes of Recipient (''get'', ''take'').
- Similarly, Experiencers that might involve active involvement or
volition gathered quite a few votes of Agent (e.g. the verbs ''know'',
''hate'', ''want'', ''see'').
- Annotators seemed reluctant to give inanimate referents (particularly
clausal constituents) the role of Agent, often opting for Instrument, or
suggesting an additional category of Cause.
- Some annotators wanted to distinguish between active and passive
perception (i.e. listen vs hear and look vs see).
- One respondent suggested Counter-Agent for ''computer'' in ''play
against the ... computer''

- The Patient/Theme distinction seemed to be the largest source of
disagreement and uncertainty among judges, and accounted for a majority of
the tied result dependents. The highest scoring Patient had majority
agreement of 80%; the highest scoring Theme item 75%. Prototypically
affected participants tended to be labelled Patient (e.g. ''man'' in
''Dinosaurs eat man''), with a cline of decreasing affectedness towards
prototypical Themes (e.g. ''take orders''). In addition, judges seemed more
comfortable with sentient concretely affected patients (i.e. Patients who
are also Experiencers). Inanimate, abstract or event participants were more
often annotated as Theme.
- For some judges Theme was a ''residual'' category to use for items that
did not seem to fit any other.
- Percept was used very little, although there were (in our opinion)
plenty of candidates to which this role might apply. Judges preferred Theme
in these cases (e.g. ''analyze their attack'', ''saying that'', ''saw a
cockroach'', ''find something''), or suggested the additional roles of Goal
or Source.
- Several judges suggested an additional Product role, for entities that
are created during an event (e.g. ''make statues'').
- There was sometimes disagreement on the concrete or abstract nature of
affectedness of animate participants, leading to votes or Patient or
Experiencer respectively (consider ''you'' in ''your years [here] have
prepared you for ...'' and ''I'm warning you'').

- Many variations on reason were suggested: including Cause (a directly
precipitating event), Condition (a potentially limiting event); Purpose
(motivation for Agent); Result (precipitated event).
- Source and Goal were suggested for Reasons that precede or follow an
event respectively
- One respondent suggested different levels of reasons - a ''meta reason''
(''in order to ...'') and ''specific reason'' (''to get ...'') in ''In
order to prepare this speech I rang a few people to get a general picture
of how Gareth was regarded by those who met him''

- Place and Time were highly reliable. However many judges noted that they
would prefer Source and Goal variants of Place, and (less often) Time.
- Source and Goal were also used suggested for non-spatial and
non-temporal cases such as communication (e.g. the subject of ''say'').
- Range or Measure roles were suggested for distances, lengths of time or
amounts of money

- Beneficiaries were often Recipients also, and so there was some
disagreement (e.g. ''compensate you'', ''bring your son back to you'',
''fetch slippers for you'').
- It was not always clear whether an end-point which can be construed both
animately or inanimately should be Place or Recipient (e.g. ''doctor'' in
''bring my daughter to a witch doctor''; or ''office'' in ''sent a wire to
the main office'').
- Goal was often suggested.

Degree of Agreement

For each dependent annotated we examined what the 'consensus' role was
(that is, the plurality choice) among the 21 people who took part. Of the
219 dependents, 4 were judged ''other/unsure'' by a plurality of judges,
and 14 resulted in ties (9 or which were Patient/Theme ties). The degree of
agreement on a single role ranged from 100% to 22%. The average consensus
agreement was 74% (median 75%) - i.e. for an average dependent,
three-quarters of linguist respondents agreed on a single role as the
correct annotation. Subject dependents were on average more reliable (av.
85% agreement, median 95%) than other dependents (mean 68%, median 64%).

An alternative way to evaluate agreement is by proportion of pairwise
agreement. For example, given three judges, A, B and C, there are three
possible agreements: A with B, B with C and C with A. If two of them agree
on a single annotation, and the third disagrees, pairwise agreement would
be 33% (one agreement/three possible agreements), while majority agreement
would be 67% (two same/three judgements). Pairwise agreement ranged from
11% to 100%, and averaged 63% (median 59%). Again, subjects (mean 78%,
median 0.9) saw considerably more agreement than other dependent types
(mean 55%, median 49%).

The degree of agreement varied dramatically by their consensus role. Here
each role is listed, together with the number of items for which it was the
majority choice (n=X), the mean (and median) majority agreement, followed
by the mean (and median) pairwise agreement.

Agent (n=56) 94% (100%) - 90% (100%)
Beneficiary (n=10) 84% (90%) - 74% (80%)
Experiencer (n=13) 69% (75%) - 55% (59%)
Instrument (n=4) 48% (48%) - 28% (24%)
Manner (n=4) 67% (67%) - 50% (52%)
other/unsure (n=4) 46% (46%) - 32% (30%)
Patient (n=25) 62% (58%) - 46% (47%)
Percept (n=3) 49% (47%) - 33% (32%)
Place (n=16) 86% (91%) - 77% (82%)
Reason (n=14) 80% (83%) - 70% (69%)
Recipient (n=6) 59% (50%) - 41% (29%)
Theme (n=32) 54% (55%) - 37% (38%)
tied (n=14) 38% (38%) - 28% (29%)
Time (n=18) 95% (100%) - 91% (100%)

Both the majority and pairwise measures can be adjusted for what degree of
agreement might be expected by chance. Taking the distribution of roles
found across all responses to estimate their distribution in the language
at large (which is a conservative assumption), we can normalise agreement
measures to a scale where 1 signifies full agreement, and 0 signifies only
the degree of agreement predicted by chance. By this normalisation the
average majority agreement is 0.64 (median 0.66). The same measure
calculated by grammatical function was 0.44 (median 0.80, n=77) for
subjects, and 0.59 (n=142, median 0.54) for other types of dependents.
Using the pairwise measure, the normalised overall agreement was 0.57 (the
Kappa statistic, p<10^-12)#3.

#3 See Jean Carletta, 1996: ''Assessing agreement on classification tasks:
the kappa statistic'', Computational Linguistics 22(2):249-254; and Siegel
& Castellan 1988: ''Nonparametric Statistics for The Behavioural
Sciences'', pp.284-290.

Comments on Experimental Design

Several people commented that the role categories were not defined, and so
they could not be sure how to apply them. Our intention in the experiment
was to investigate linguists' existing conception of roles, since they are
often appealed to in the literature without any explanation. Generally,
respondents seem to have applied the roles without much confusion, with the
exception of Theme/Patient (the boundary seems not well defined) and
Percept (which was used much less than expected - we consider that this
*may* have been due to a terminological choice, and that Stimulus might
have been more widely chosen).

In particular many felt that Goal and Source were missing. The reasons for
their omission were two-fold. On the one hand, it was not our objective in
this experiment to presuppose a particular categorisation of the spatial
and temporal domains (as the range of prepositions available suggest, this
is a complex area - consider ''under'', ''behind'', ''towards'', ''at'',
''in'', ''on'', ''away'', ''from'', ''back'', etc. A two or even three-way
distinction (Goal/Path/Source) is unlikely to be adequate). Secondly, we
feel that systems using Source and Goal often miss essential distinctions,
as they are often and variously applied outside the spatial domain by
metaphorical extension. In the last section we saw that Goal or Source were
variously suggested for dependents that received majority votes of Agent,
Place, Time, Reason, and Recipient.

Some respondents commented that the questionnaire was too long. We agree,
but surprisingly a fair proportion finished it, and there was only a
minimal order effect. One might expect respondents to become less careful
as they proceed through the exercise, and so for agreement to decrease, but
only a very slight correlation was found between order and either the
majority or pairwise agreement measures (majority: Pearson's r=-.080, 0.6%
of variability p=.24; pairwise: Pearson's r=-.107, 1.1% of var, p=.115).
For comparison, correlation between the two agreement measures was high
(r=.982, 96.5% of variability p<.001).

One respondent questioned why we used popular film scripts. We wanted to
use everyday language, and judged scripts to give a good approximation of
genuine speech, with readily accessible contexts (since understanding of
the context was expected to play a large part in how roles were
interpreted). In addition we used arbitrary sentences from the web, as
returned by Google.

Another question was on what semantic ''range'' (for want of a better word)
was to be considered. Immediate interpretation sometimes encouraged a
different role than a wider interpretation, after drawing in more
real-world knowledge. This was an issue, precisely because we used heavily
contextualised materials.
- For example when we read ''they chose him a new form'', ''him'' might be
viewed as a beneficiary. However in this context (Ghostbusters, a
supernatural one) it seems that the pragmatically derived meaning is ''they
changed him into something else'', in which case ''he'' is more likely a
- Similarly in ''bring Captain Solo ... to me'', ''me'' might be seen as
an endpoint, and so a Place. However in the context of Captain Solo being a
prisoner, ''me'' might be a Recipient.
- Can we consider ''us'' in ''they didn't design a survival suit for us''
a beneficiary, since the proposition is negated? No benefit was received.
- ''What'' and ''you'' in ''What brought you to Casablanca'' might be a
Cause and Patient, if viewed literally, but is more likely to be Reason and
Agent in the pragmatic interpretation of ''Why did you come to Casablanca''.
- ''you will bring X to me'': while ''you'' would ordinarily be an Agent,
one respondent thought it might be an instrument, since that person was
being compelled (i.e. was not volitionally involved).
- ''you hit the wall'' (in context of extreme sports): ''you'' might be
seen as an Agent (if you think they chose to hit the wall), Experiencer or
Patient (if you think they got hurt hitting the wall), or Theme (if you
think they couldn't avoid hitting the wall, but were little affected by it).

Some respondents disagreed with our analysis of sentences and dependents.
They believed that certain types of dependent phrase do not bear thematic
role - for some adverbials presented problems, for others clausal phrases,
for one prepositional objects (since they are directly governed by the
preposition, not the verb). However our approach is that since these are
dependent phrases, they have a (semantic) relationship to the verb which
deserves to be described.

A more serious problem was disagreement on the boundaries of dependents.
Many of the respondents believed that a sentence like ''this lets us see
...'' has three dependents (''this'', ''us'' and ''see ...''), rather than
the two of our analysis (''this'' and ''us see ...''). We had not
considered this analysis, but consider both to be plausible. In the example
above, it is possible to see ''us'' as an Experiencer or perhaps Patient of
the ''let'' event. However a if we consider ''let chaos reign'', it is hard
to see what relation ''chaos'' has to the verb.

Some objected to the analysis of ''why'' as a dependent of ''take'' in
''why ... students take the SAT ...''. We can see how ''why'' can be viewed
as an operator head, with ''... students take the SAT ...'' as its
dependent. However, we also think that it can be validly viewed as a
fronted pronominal version of a reason adjunct (e.g. ''students take the
SAT because XYZ'', where ''because XYZ'' has an equivalent function to

Finally, people also objected to the analysis of ''Because'' as a dependent
of ''choose'' in ''Because your father chose me''. They are right - this is
a mistake. ''your father chose me'' is in fact an dependent of ''Because''.

LL Issue: 17.1209
Date Posted: 21-Apr-2006


Sums main page