LINGUIST List 19.296
|
Fri Jan 25 2008
FYI: REG Challenge 2008: First Call for Participation
Editor for this issue: Matthew Lahrman
<matt linguistlist.org>
|
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.
|
Directory
1. Anja
Belz,
REG Challenge 2008: First Call for Participation
Message 1: REG Challenge 2008: First Call for Participation
|
Date: 25-Jan-2008
From: Anja Belz <A.S.Belz brighton.ac.uk>
Subject: REG Challenge 2008: First Call for Participation
E-mail this message to a friend
First Call For Participation Referring Expression Generation Challenge 2008: To be held in conjunction with the 5th International Natural Language Generation Conference (INLG 2008), June 12-14, 2008 Salt Fork, Ohio, USA Following the success of the Pilot NLG Challenge on Attribute Selection for Generating Referring Expressions (ASGRE) in September 2007, we are organising a second NLG Challenge, the Referring Expression Generation Challenge (REG 2008), to be presented and discussed during a special session at INLG 2008. While the ASGRE Challenge focused on attribute selection for definite references, the REG Challenge expands the scope of the original to include both attribute selection and realisation, while introducing a new task involving references to named entities in context. The REG Challenge has eight submission tracks and two different data sets. It maintains the ASGRE Challenge's emphasis on openness to alternative task definitions and evaluation methods, and will involve both automatic and task-based evaluation. Contents of this Call: 1. Background 2. Generation of Referring Expressions 3. REG Challenge Data Sets 4. REG Challenge Tasks and Submission Tracks 5. Evaluation 6. Participation 7. Proceedings and Presentations 8. Important Dates 9. Organisation 1. Background Over the past few years, the need for comparative, quantitatively evaluated results has been increasingly felt in the field of NLG. Following a number of discussion sessions at NLG meetings, a workshop dedicated to the topic was held with NSF support in Arlington, Va., US, in April 2007 (see http://www.ling.ohio-state.edu/~mwhite/nlgeval07/). At this workshop, a decision was taken to organise a pilot shared task evaluation challenge, focussing on the area of GRE because of the broad consensus that has arisen among researchers on the nature and scope of this problem. The First NLG Challenge on Attribute Selection for Generating Referring Expressions (ASGRE), was held in Copenhagen in September 2007 in conjunction with the UCNLG+MT Workshop. It was a successful pilot, both in terms of participation and in the variety and quality of submissions received see http://www.csd.abdn.ac.uk/~agatt/home/pubs/asgre2007.pdf). With 18 initial registrations, and final submissions from six teams comprised of 13 researchers submitting outputs from 22 different systems, community interest was substantial. Several aspects of the ASGRE Challenge were intended to promote an approach to shared-task evaluation where community interests feed directly into the nature and evaluation of tasks. This is important in order to counteract a potential narrowing of scope, where a shared task, rather than reflecting community interest, plays a causal role in shaping those interests. The most important of these aspects were: * A wide range of evaluation criteria, involving both automatic and task-based, intrinsic and extrinsic methods. * An Open Category Track which enabled researchers to submit reports describing novel approaches involving the shared dataset, while opting out of the competitive element. * An Evaluation Methods Track for submissions with novel proposals for evaluation of the shared task. * Self-evaluation: participants computed scores for the development data set, using code supplied by the organisers. 2. Generation of Referring Expressions (GRE): Since the foundational work of authors such as Appelt, Kronfeld, Grosz, Joshi, Dale and Reiter, GRE has been the subject of intensive research in the NLG community, giving rise to significant consensus on the GRE problem definition, as well as the nature of the inputs and outputs of GRE algorithms. This is particularly true of the subtask of attribute selection for definite referring expressions (REs), perhaps the most widely researched NLG subtask. A succinct definition of the attribute selection task is given by Bohnet and Dale (2005): ''Given a symbol corresponding to an intended referent, how do we work out the semantic content of a referring expression that uniquely identifies the entity in question?'' This was precisely the task definition in the ASGRE Challenge. The REG Challenge adds tasks on realisation of definite REs and choice of type of RE in discourse context, aiming to cover a larger subset of NLG research interests. 3. REG Challenge Data--TUNA Corpus of Referring Expressions: The TUNA Corpus consists of a set of human-produced descriptions of objects in a visual domain of pictures of furniture or people, annotated at the semantic level with a domain representation. It was collected during an elicitation experiment, in which one between-subjects condition controlled the use of the location of an object in descriptions (+/-Location). The version of the TUNA data to be used in the REG Challenge will, in addition to attribute sets, include human-produced descriptions, the corresponding pictures of domain objects, and the applicable experimental condition. While ASGRE participants were never shown the test set outputs, we feel it is more appropriate to use a new test set with unseen inputs as well as unseen outputs. We are therefore creating 50 new corpus items for each subdomain (people and furniture), in experimental conditions that replicate the original TUNA elicitation experiments. This time we are obtaining three descriptions for each test item which will allow us to compare peer outputs to several human-produced descriptions, resulting in a more reliable assessment of humanlikeness. The TUNA Corpus is described in greater detail here: Gatt, A., van der Sluis, I., and van Deemter, K. (2007). Evaluating algorithms for the generation of referring expressions using a balanced corpus. Proceedings of the 11th European Workshop on Natural Language Generation, ENLG-07. See also http://www.csd.abdn.ac.uk/research/evaluation for details of the corpus. GREC Corpus of named entity references in context: The GREC corpus consists of just over 2,000 short introductory texts from Wikipedia entries containing about 18,000 annotated referring expressions in total. The annotated references in each text are to a single entity which constitutes the main subject or topic of the text. The texts fall into five domains: cities, countries, rivers, people and mountains. A subset of the corpus (100 texts), which will serve as the test set for the tasks involving GREC, contains three additional referring expressions for each reference in the original texts, as selected by human participants during an experiment. We are preparing a second test set, also containing multiple human-selected alternatives for each reference. The domain of this test set will be different from the five contained in the corpus, and will not be revealed to participants in advance. Our use of this corpus represents an effort to extend the scope of the REG Challenge to take into account the effect of discourse context on the form that a referring expression should take. An earlier version of the GREC corpus (containing just over 1,000 texts) is described in greater detail here: Belz, A. and Varges, S. (2007). Generation of Repeated References to Discourse Entities. Proceedings of the 11th European Workshop on Natural Language Generation, ENLG-07. 4. REG Challenge Tasks and Submission Tracks--Summary of submission tracks: 1. Task 1 (TUNA-AS): Attribute selection for referring expressions. 2. Task 2 (TUNA-R): Realisation of referring expressions. 3. Task 3 (TUNA-REG): Attribute selection and realisation combined. 4. TUNA Open Track: Any work involving the TUNA data. 5. TUNA Evaluation Methods: Any work involving evaluation of Tasks 1-3. 6. Task 4 (GREC): Named Entity generation: given a referent, a discourse context and a list of possible referring expressions, select the referring expression most appropriate in the context. 7. GREC Open Track: Any work involving the GREC data. 8. GREC Evaluation Methods: Any work involving evaluation of Task 4. Open Tracks and Evaluation Methods Tracks: The open tracks act to prevent an overly narrow task definition from dominating in an otherwise varied research field, and the evaluation methods tracks allow researchers to contribute evaluation methods that they consider most appropriate. The idea is that such alternative tasks and evaluation methods will become part of future evaluation events (e.g. the MASI metric included this year was proposed by Advaith Siddharthan, Cambridge University, last year). Task 1 (TUNA-AS): This is the same task as in the ASGRE Challenge (i.e. mapping from domain representations to attribute sets), but with a previously unseen test data set which will have multiple reference descriptions. The inclusion of this task will allow participants to try and improve over the 2007 systems (something called a `Progress Test' in the NIST-led MT evaluations), and will allow some researchers to participate who were not able to in 2007. See also http://www.csd.abdn.ac.uk/research/evaluation for details of data and task definition used in the ASGRE Challenge. Task 2 (TUNA-R): In Task 2, participants need to create systems that map sets of attribute-value pairs to natural language descriptions. For example, { type:fan, colour:red, size:large } could map to ''the large red fan'' (among other possibilities). Participants can choose to either maximise similarity with the human-produced descriptions in the corpus, or to maximise optimality from the point of view of human comprehension and fast identification of referents (see also evaluation section below). Task 3 (TUNA-REG): Task 3 combines Tasks 1 and 2, i.e. participating systems need to map from a domain representation to a natural language description that describes the target referent. Again, participants can choose which of the evaluation criteria to optimise for. One important aspect of the TUNA tasks is that the template realiser and some of the participant attribute-selection systems from the ASGRE Challenge will be made available to REG Challenge participants. Thus, teams participating in these tasks can choose to focus on just one of the components (mapping from input domains to attribute sets or mapping from attribute sets to realisations). This is especially relevant to participants in Task 3, where a team might choose to invest more effort on one of the two components, while using an off-the-shelf system for the other. Such reuse of components has been found useful in other evaluation initiatives such as the TC-STAR text-to-speech evaluation initiative. Task 4 (GREC): In the shared-task version of the GREC corpus, every main subject reference (MSR) is annotated with a set of possible MSRs. This set is automatically generated by collecting all MSRs that occur in the same text, and applying some generic rules that add a range of default options. The set of possible MSRs along with the surrounding text (including annotations) forms the input, and a single MSR selected for a given slot forms the output. Participating systems need to implement this mapping. This is a (simplified) example of an input, where the reference output in the corpus is ''Brunei'': ... [Brunei itself; _; it; it itself; that; that itself; the country; the ] country itself; which; which itself''>Brunei, the remnant of a very powerful sultanate, became independent from Great Britain in 1984. ... The GREC task is to decide, given a set of alternative referring expressions, as well as the context of the reference, which of the alternatives is the most appropriate. The main focus in the Humanlikeness evaluation of this task (see following section) will be on the relatively coarse-grained choice between (a) common-noun references (e.g. ''The country''); (b) proper-name references (e.g. ''Brunei''); and (c) pronominal references (e.g. ''it''). Under this conception, a system is considered to have made a correct choice relative to a human if it selects a reference of the same type as the human. Secondarily, a more fine-grained evaluation will be carried out, in which a system's actual REs (rather than their types) will be assessed. 5. Evaluation: All data sets will be divided into training, development and test data. Participants will compute evaluation scores on the development set (using code provided by us), and the organisers will perform evaluations on the test data set. We will again use a range of different evaluation methods, including intrinsic and extrinsic, automatically assessed and human-evaluated, as shown in the overview below. Intrinsic evaluations assess properties of peer systems in their own right, whereas extrinsic evaluations assess the effect of a peer system on something that is external to it, such as its effect on human performance at a given task or the added value it brings to an application. Task(s): Criteria: Type of evaluation: Evaluation Methods: TUNA-AS Humanlikeness Intrinsic/automatic Accuracy, Dice, MASI Minimality, Uniqueness Intrinsic/automatic Proportion of minimal/unique outputs TUNA-R Humanlikeness Intrinsic/automatic Accuracy, BLEU, NIST, string-edit distance TUNA-REG Ease of comprehension Extrinsic/human Self-paced reading in identification experiment Referential Clarity Extrinsic/human Speed and accuracy in identification experiment GREC Humanlikeness Intrinsic/automatic Accuracy, BLEU, NIST, string-edit distance Ease of comprehension Extrinsic/human Self-paced reading in identification experiment Referential Clarity Extrinsic/human Speed and accuracy in identification experiment Intrinsic/human Direct human assessment Coherence Intrinsic/human Direct human assessment Extrinsic evaluations: For TUNA Tasks 2 and 3, we are planning similar task-based experiments as in the ASGRE Challenge. However, we will present the referring expression and the images of referents separately in two steps, so that subjects are first shown the RE and can then bring up the images when they are ready. We will measure reading speed in the first step (as an estimation of ease of comprehension), and identification speed and identification accuracy (referential clarity) in the second step, for which we will again use the free high-precision software DMDX and Time-DX (Forster & Forster, 2003). For the GREC task, where there is no obvious set of distractor entities, we envisage an experimental scenario in which participants are asked to decide, for a given NP, whether it refers to the main subject of the text. For example, the task may be to decide for all personal and relative pronouns in a text whether the intended referent is the main subject of the text, or not. We will record identification speed and identification accuracy (referential clarity). Subjects will self-pace their reading of the texts, which will enable us to also measure reading speed (ease of comprehension). We are currently preparing pilot experiments to test possible approaches. Intrinsic automatic evaluations: We will assess humanlikeness, the similarity of peer outputs to sets of human-produced reference `outputs', by a range of automatic metrics. For Task 1, we will again use the Dice coefficient, and additionally accuracy (the proportion of exact matches), and a measure called MASI (Measuring Agreement on Set-valued Items) which is slightly biased in favour of similarity where one set is a subset of the other (Passonneau, 2006). For Tasks 2, 3 and 4, we will use accuracy, string-edit distance, BLEU, NIST. For backwards comparability with the ASGRE Challenge we will also again assess minimality and uniqueness of attribute sets in Task 1. Intrinsic human-assessed evaluations: This is the type of human evaluation in the MT-Eval and DUC evaluation campaigns. We will train human evaluators in assessing texts for the two DUC criteria of referential clarity and coherence and will largely follow DUC methodology (Trang Dang, 2006). While the TUNA domain is likely to be too simple for this type of evaluation to show significant differences, we will use it on the more varied GREC domain. Having these additional evaluations will also act as a fall-back in a task type (named entity reference generation) for which there is no evaluation experience to draw upon. General system evaluation: Subject to feasibility, (implementations of) the submitted systems may also be compared in terms of their processing efficiency. 6. Participation: At this point we would like anybody who is potentially interested in participating in the REG Challenge to register via the REG homepage (http://www.nltg.brighton.ac.uk/research/reg08), completing and submitting a preliminary registration form. Upon preliminary registration, participants will be sent sample data for the four shared tasks and detailed task definitions, including input and output specifications. A Participant's Pack with full details, and a more exhaustive description of the datasets, input and output specifications will subsequently be distributed. 7. Proceedings and Presentations: The REG Challenge 2008 meeting will be part of INLG'08. There will be a special session in the conference programme for an overview of the participating systems, presentation of evaluation results and open discussion. The participating systems will additionally be presented in the form of 2-page papers in the conference proceedings, and posters during the INLG'08 poster session. REG Challenge Papers will not undergo a selection procedure with multiple reviews, but the organisers reserve the right to reject material which is not appropriate given the participation guidelines. Page limits are the same for all tracks: papers should not exceed 2 (two) pages in length, including diagrams and bibliography. Authors should follow the INLG'08 style guidelines. 8. Important Dates: Oct 17, 2007 INLG'08 First Call for papers, including announcement of REG Challenge Jan 03, 2007 REG Challenge 2008 First Call for Participation; Preliminary registration open; sample data available Jan 28, 2008 Release of training and development data sets for all tasks Mar 17, 2008 Test data becomes available Mar 17-Apr 07 Test data submission period: participants can download test data at any time, but must submit system report first and must submit outputs within 48 hours Apr 07, 2008 Final deadline for submission of test data outputs Apr 07-May 10 Evaluation period Jun 12, 2008 REG Challenge meeting at INLG'08 9. Organisation: Anja Belz, NLTG, University of Brighton, UK Albert Gatt, Computing Science, University of Aberdeen, UK REG Challenge homepage: http://www.nltg.brighton.ac.uk/research/reg08 REG Challenge email: gre-stec -AT- itri.brighton.ac.uk
Linguistic Field(s): Computational Linguistics
Read more issues|LINGUIST home page|Top of issue
|
|

Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.
|
|