LINGUIST List 17.1209
|
Fri Apr 21 2006
Sum: Online Survey on Thematic Roles
Editor for this issue: James Rider
<rider linguistlist.org>
|
To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.html.
|
Directory
1. Brian
Murphy,
Online Survey on Thematic Roles
Message 1: Online Survey on Thematic Roles
|
Date: 20-Apr-2006
From: Brian Murphy <Brian.Murphy cs.tcd.ie>
Subject: Online Survey on Thematic Roles
Regarding query: http://www.linguistlist.org/issues/17/17-17.html#1 This is the summary posting for a survey on the usage of thematic roles among linguists. The objective was to evaluate the reliability of judgements of thematic role made by linguist annotators - both individually, and between judges. Respondents were presented with 81 English sentences, each having between 2 and 5 dependent phrases. For each of the total 219 dependents, respondents were asked to choose among one of 12 possible roles, or opt for ''other/unsure''. The original survey can be viewed at http://www.cs.tcd.ie/Brian.Murphy/survey/. Since the objective was to study to what extent linguists share an understanding of thematic role categories, no role definitions were given. For 60 of the 219 argument items there was full consensus. For example, all twelve votes on ''you'' in ''you decide to buy a car'' were Agent, and all thirteen votes for ''today'' in ''I booked the pitch for Gryffindor today'' were Time. The remaining 159 items varied in the degree to which they were controversial. Those that garnered the least amount of consensus were: ''your years with us here at Rydell'' in ''your years with us here at Rydell have prepared you for the challenges you face'' that got the following votes: Time (4), Agent (4), Instrument (3), other/unsure (2), Percept (2), Theme (2), Reason (1). And, the argument ''my fingers'' in ''Every time my fingers touch brain'' got the votes Percept (3), Theme (3), Patient (3), other/unsure (1), Place (1), and Experiencer (1). Several weeks after initial submissions, we suggested that respondents repeat part or all of the survey to test their individual reliability. We would like to take the opportunity to again warmly thank all those who took part in this very lengthy survey - so MANY thanks to Alessio Frenda, Chris Koops, Corinna Anderson, Florencia Franceschina, Harry Feldman, Jean-Charles Khalifa, Lesley Stirling, Lu Bingfu, Luis González, Magda, Marina Gorlach, Mark Donohue, Rene Dirven, Suzanne Kemmer, Stella Markantonatou, Steven Schaufele, Suzette Haden Elgin, and four others who preferred to remain anonymous. An overview of the results follows below. We have tried to keep this short, but are sure to have missed some interesting aspects of the results, so we encourage people to read the full results for each item (listed on a separate page due to length): http://www.cs.tcd.ie/Brian.Murphy/survey/overallReport.html. The first section below describes the response rate. That is followed by a section giving a qualitative overview of the annotations, listed by role label. The third section approaches the results qualitatively, supported by statistical tests. The last section addresses issues of experimental design raised by respondents. We look forward to hearing your comments, Brian Murphy and Carl Vogel, Computational Linguistics Group, Trinity College Dublin. Responses ========= Overall, 21 people took part, 4 of those anonymously. Of those who specified, 8 were native English speakers and 10 were non-natives. Their dialects were: English of the US (9), Australia (2), England (2), Ireland (2) and 'other' (2). On average each respondent made 142 dependent annotations, out of a total of 219 - 11 of the respondents completed the survey in full. The extent to which individual respondents agreed with the majority view (or plurality view, if the most voted role got less than 50%) varied, from 54% to 87% (mean 77%, median 77%). Natives agreed with the consensus (78% - Standard Deviation 6%) more often than non-natives (73% - SD 9%), but the difference was not statistically strong (p=0.161)#1. In the follow-on intra-subject experiment, the six repeat-respondents agreed with their original annotations 77% of the time, ranging from 63% to 88%. Native repeat-responders (77%, SD 12%) agreed only very slightly more often than non-natives (76%, SD 18%) (p=0.971)#2. It is interesting to note that there was no difference between intra-subject agreement and inter-subject agreement. This is suggestive that the variation in responses are primarily random (which can be dealt with by increasing the numbers of respondents), rather than being due to more fundamental and stable differences of opinion among linguists. #1 2-tailed independent samples t-test, assuming equal variance (Levene's test of unequal variance p=0.106). Assuming unequal variance, p would be 0.142. #2 2-tailed independent samples t-test, assuming equal variance (Levene's test of unequal variance p=0.845). Assuming unequal variance, p would b 0.971. Description of Responses, by Role ================================= In qualitative terms, here is an overview of annotations and comments made for each role. In this section we use the majority view of the correct role label for the purposes of categorisation (or, in cases where no role received more than 50% of votes, the plurality view). Agents/Experiencer/Instrument - Agent dependents were very stable (mean 94%). Agents of cognitive actions with little or no effect on objects were sometimes annotated as Experiencer (e.g. ''play'', ''find'', ''seek''), and agents which involved possession got some votes of Recipient (''get'', ''take''). - Similarly, Experiencers that might involve active involvement or volition gathered quite a few votes of Agent (e.g. the verbs ''know'', ''hate'', ''want'', ''see''). - Annotators seemed reluctant to give inanimate referents (particularly clausal constituents) the role of Agent, often opting for Instrument, or suggesting an additional category of Cause. - Some annotators wanted to distinguish between active and passive perception (i.e. listen vs hear and look vs see). - One respondent suggested Counter-Agent for ''computer'' in ''play against the ... computer'' Patient/Theme/Percept - The Patient/Theme distinction seemed to be the largest source of disagreement and uncertainty among judges, and accounted for a majority of the tied result dependents. The highest scoring Patient had majority agreement of 80%; the highest scoring Theme item 75%. Prototypically affected participants tended to be labelled Patient (e.g. ''man'' in ''Dinosaurs eat man''), with a cline of decreasing affectedness towards prototypical Themes (e.g. ''take orders''). In addition, judges seemed more comfortable with sentient concretely affected patients (i.e. Patients who are also Experiencers). Inanimate, abstract or event participants were more often annotated as Theme. - For some judges Theme was a ''residual'' category to use for items that did not seem to fit any other. - Percept was used very little, although there were (in our opinion) plenty of candidates to which this role might apply. Judges preferred Theme in these cases (e.g. ''analyze their attack'', ''saying that'', ''saw a cockroach'', ''find something''), or suggested the additional roles of Goal or Source. - Several judges suggested an additional Product role, for entities that are created during an event (e.g. ''make statues''). - There was sometimes disagreement on the concrete or abstract nature of affectedness of animate participants, leading to votes or Patient or Experiencer respectively (consider ''you'' in ''your years [here] have prepared you for ...'' and ''I'm warning you''). Reason - Many variations on reason were suggested: including Cause (a directly precipitating event), Condition (a potentially limiting event); Purpose (motivation for Agent); Result (precipitated event). - Source and Goal were suggested for Reasons that precede or follow an event respectively - One respondent suggested different levels of reasons - a ''meta reason'' (''in order to ...'') and ''specific reason'' (''to get ...'') in ''In order to prepare this speech I rang a few people to get a general picture of how Gareth was regarded by those who met him'' Place/Time - Place and Time were highly reliable. However many judges noted that they would prefer Source and Goal variants of Place, and (less often) Time. - Source and Goal were also used suggested for non-spatial and non-temporal cases such as communication (e.g. the subject of ''say''). - Range or Measure roles were suggested for distances, lengths of time or amounts of money Beneficiary/Recipient - Beneficiaries were often Recipients also, and so there was some disagreement (e.g. ''compensate you'', ''bring your son back to you'', ''fetch slippers for you''). - It was not always clear whether an end-point which can be construed both animately or inanimately should be Place or Recipient (e.g. ''doctor'' in ''bring my daughter to a witch doctor''; or ''office'' in ''sent a wire to the main office''). - Goal was often suggested. Degree of Agreement =================== For each dependent annotated we examined what the 'consensus' role was (that is, the plurality choice) among the 21 people who took part. Of the 219 dependents, 4 were judged ''other/unsure'' by a plurality of judges, and 14 resulted in ties (9 or which were Patient/Theme ties). The degree of agreement on a single role ranged from 100% to 22%. The average consensus agreement was 74% (median 75%) - i.e. for an average dependent, three-quarters of linguist respondents agreed on a single role as the correct annotation. Subject dependents were on average more reliable (av. 85% agreement, median 95%) than other dependents (mean 68%, median 64%). An alternative way to evaluate agreement is by proportion of pairwise agreement. For example, given three judges, A, B and C, there are three possible agreements: A with B, B with C and C with A. If two of them agree on a single annotation, and the third disagrees, pairwise agreement would be 33% (one agreement/three possible agreements), while majority agreement would be 67% (two same/three judgements). Pairwise agreement ranged from 11% to 100%, and averaged 63% (median 59%). Again, subjects (mean 78%, median 0.9) saw considerably more agreement than other dependent types (mean 55%, median 49%). The degree of agreement varied dramatically by their consensus role. Here each role is listed, together with the number of items for which it was the majority choice (n=X), the mean (and median) majority agreement, followed by the mean (and median) pairwise agreement. Agent (n=56) 94% (100%) - 90% (100%) Beneficiary (n=10) 84% (90%) - 74% (80%) Experiencer (n=13) 69% (75%) - 55% (59%) Instrument (n=4) 48% (48%) - 28% (24%) Manner (n=4) 67% (67%) - 50% (52%) other/unsure (n=4) 46% (46%) - 32% (30%) Patient (n=25) 62% (58%) - 46% (47%) Percept (n=3) 49% (47%) - 33% (32%) Place (n=16) 86% (91%) - 77% (82%) Reason (n=14) 80% (83%) - 70% (69%) Recipient (n=6) 59% (50%) - 41% (29%) Theme (n=32) 54% (55%) - 37% (38%) tied (n=14) 38% (38%) - 28% (29%) Time (n=18) 95% (100%) - 91% (100%) Both the majority and pairwise measures can be adjusted for what degree of agreement might be expected by chance. Taking the distribution of roles found across all responses to estimate their distribution in the language at large (which is a conservative assumption), we can normalise agreement measures to a scale where 1 signifies full agreement, and 0 signifies only the degree of agreement predicted by chance. By this normalisation the average majority agreement is 0.64 (median 0.66). The same measure calculated by grammatical function was 0.44 (median 0.80, n=77) for subjects, and 0.59 (n=142, median 0.54) for other types of dependents. Using the pairwise measure, the normalised overall agreement was 0.57 (the Kappa statistic, p<10^-12)#3. #3 See Jean Carletta, 1996: ''Assessing agreement on classification tasks: the kappa statistic'', Computational Linguistics 22(2):249-254; and Siegel & Castellan 1988: ''Nonparametric Statistics for The Behavioural Sciences'', pp.284-290. Comments on Experimental Design =============================== Several people commented that the role categories were not defined, and so they could not be sure how to apply them. Our intention in the experiment was to investigate linguists' existing conception of roles, since they are often appealed to in the literature without any explanation. Generally, respondents seem to have applied the roles without much confusion, with the exception of Theme/Patient (the boundary seems not well defined) and Percept (which was used much less than expected - we consider that this *may* have been due to a terminological choice, and that Stimulus might have been more widely chosen). In particular many felt that Goal and Source were missing. The reasons for their omission were two-fold. On the one hand, it was not our objective in this experiment to presuppose a particular categorisation of the spatial and temporal domains (as the range of prepositions available suggest, this is a complex area - consider ''under'', ''behind'', ''towards'', ''at'', ''in'', ''on'', ''away'', ''from'', ''back'', etc. A two or even three-way distinction (Goal/Path/Source) is unlikely to be adequate). Secondly, we feel that systems using Source and Goal often miss essential distinctions, as they are often and variously applied outside the spatial domain by metaphorical extension. In the last section we saw that Goal or Source were variously suggested for dependents that received majority votes of Agent, Place, Time, Reason, and Recipient. Some respondents commented that the questionnaire was too long. We agree, but surprisingly a fair proportion finished it, and there was only a minimal order effect. One might expect respondents to become less careful as they proceed through the exercise, and so for agreement to decrease, but only a very slight correlation was found between order and either the majority or pairwise agreement measures (majority: Pearson's r=-.080, 0.6% of variability p=.24; pairwise: Pearson's r=-.107, 1.1% of var, p=.115). For comparison, correlation between the two agreement measures was high (r=.982, 96.5% of variability p<.001). One respondent questioned why we used popular film scripts. We wanted to use everyday language, and judged scripts to give a good approximation of genuine speech, with readily accessible contexts (since understanding of the context was expected to play a large part in how roles were interpreted). In addition we used arbitrary sentences from the web, as returned by Google. Another question was on what semantic ''range'' (for want of a better word) was to be considered. Immediate interpretation sometimes encouraged a different role than a wider interpretation, after drawing in more real-world knowledge. This was an issue, precisely because we used heavily contextualised materials. - For example when we read ''they chose him a new form'', ''him'' might be viewed as a beneficiary. However in this context (Ghostbusters, a supernatural one) it seems that the pragmatically derived meaning is ''they changed him into something else'', in which case ''he'' is more likely a Patient. - Similarly in ''bring Captain Solo ... to me'', ''me'' might be seen as an endpoint, and so a Place. However in the context of Captain Solo being a prisoner, ''me'' might be a Recipient. - Can we consider ''us'' in ''they didn't design a survival suit for us'' a beneficiary, since the proposition is negated? No benefit was received. - ''What'' and ''you'' in ''What brought you to Casablanca'' might be a Cause and Patient, if viewed literally, but is more likely to be Reason and Agent in the pragmatic interpretation of ''Why did you come to Casablanca''. - ''you will bring X to me'': while ''you'' would ordinarily be an Agent, one respondent thought it might be an instrument, since that person was being compelled (i.e. was not volitionally involved). - ''you hit the wall'' (in context of extreme sports): ''you'' might be seen as an Agent (if you think they chose to hit the wall), Experiencer or Patient (if you think they got hurt hitting the wall), or Theme (if you think they couldn't avoid hitting the wall, but were little affected by it). Some respondents disagreed with our analysis of sentences and dependents. They believed that certain types of dependent phrase do not bear thematic role - for some adverbials presented problems, for others clausal phrases, for one prepositional objects (since they are directly governed by the preposition, not the verb). However our approach is that since these are dependent phrases, they have a (semantic) relationship to the verb which deserves to be described. A more serious problem was disagreement on the boundaries of dependents. Many of the respondents believed that a sentence like ''this lets us see ...'' has three dependents (''this'', ''us'' and ''see ...''), rather than the two of our analysis (''this'' and ''us see ...''). We had not considered this analysis, but consider both to be plausible. In the example above, it is possible to see ''us'' as an Experiencer or perhaps Patient of the ''let'' event. However a if we consider ''let chaos reign'', it is hard to see what relation ''chaos'' has to the verb. Some objected to the analysis of ''why'' as a dependent of ''take'' in ''why ... students take the SAT ...''. We can see how ''why'' can be viewed as an operator head, with ''... students take the SAT ...'' as its dependent. However, we also think that it can be validly viewed as a fronted pronominal version of a reason adjunct (e.g. ''students take the SAT because XYZ'', where ''because XYZ'' has an equivalent function to ''why''). Finally, people also objected to the analysis of ''Because'' as a dependent of ''choose'' in ''Because your father chose me''. They are right - this is a mistake. ''your father chose me'' is in fact an dependent of ''Because''. Linguistic Field(s): Cognitive Science Psycholinguistics Semantics Subject Language(s): English (eng)
Respond to list|Read more issues|LINGUIST home page|Top of issue
|
|

Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.
|
|