LINGUIST List 32.3744

Wed Dec 01 2021

Calls: Computational Linguistics/USA

Editor for this issue: Everett Green <>

Date: 24-Nov-2021
From: James Pustejovsky <>
Subject: SemEval-2022 Task 09: R2VQ - Competence-based Multimodal Question Answering
E-mail this message to a friend

Full Title: SemEval-2022 Task 09: R2VQ - Competence-based Multimodal Question Answering
Short Title: R2VQ

Date: 10-Jul-2022 - 15-Jul-2022
Location: Seattle, WA, USA
Contact Person: James Pustejovsky
Meeting Email: < click here to access email >
Web Site:

Linguistic Field(s): Computational Linguistics

Call Deadline: 31-Jan-2022

Meeting Description:


Call for Participation:


SemEval-2022 Task 09: R2VQ - Competence-based Multimodal Question Answering

We invite you to participate in the SemEval-2022 Task 9: Competence-based Multimodal Question Answering (R2VQ).
The task is being held as part of SemEval-2022, and all participating team will be able to publish their system description paper in the proceedings published by ACL.

When we apply our existing knowledge to new situations, we demonstrate a kind of understanding of how the knowledge (through tasks) is applied. When viewed over a conceptual domain, this constitutes a competence. Competence-based evaluations can be seen as a new approach for designing NLP challenges, in order to better characterize the underlying operational knowledge that a system has for a conceptual domain, rather than focusing on individual tasks. In this shared task, we present a challenge that is reflective of linguistic and cognitive competencies that humans have when speaking and reasoning.

Task Overview
Given the intuition that textual and visual information mutually inform each other for semantic reasoning, we formulate the challenge as a competence-based question answering (QA) task, designed to involve rich semantic annotation and aligned text-video objects. The task is structured as question answering pairs, querying how well a system understands the semantics of recipes.

We adopt the concept of ''question families'' as outlined in the CLEVR dataset (Johnson et al., 2017). While some question families naturally transfer over from the VQA domain (e.g., integer comparison, counting), other concepts such as ellipsis and object lifespan must be employed to cover the full extent of competency within procedural texts.

Data Content
We have built the R2VQ (Recipe Reading and Video Question Answering) dataset, a dataset consisting of a collection of recipes sourced from and, and labeled according to three distinct annotation layers: (i) Cooking Role Labeling (CRL), (ii) Semantic Role Labeling (SRL), and (iii) aligned image frames taken from creative commons cooking videos downloaded from YouTube. It consists of 1,000 recipes, with 800 to be used as training, and 100 recipes each for validation and testing. Participating systems will be exposed to the aforementioned multimodal training set, and will be asked to provide answers to unseen queries exploiting (i) visual and textual information jointly, or (ii) textual information only.

Task Website and Codalab Submission site:
Mailing List:

Important Dates
Training data available: October 15, 2021

Validation data available: December 3, 2021

Evaluation data ready: December 3, 2021

Evaluation start: January 10, 2021

Evaluation end: January 31, 2022

System Description Paper submissions due: February 23, 2022

Notification to authors: March 31, 2022

James Pustejovsky, Brandeis University,

Jingxuan Tu, Brandeis University,

Marco Maru, Sapienza University of Rome,

Simone Conia, Sapienza University of Rome,

Roberto Navigli, Sapienza University of Rome,

Kyeongmin Rim, Brandeis University,
Kelley Lynch, Brandeis University,

Richard Brutti, Brandeis University,

Eben Holderness, Brandeis University,

Page Updated: 01-Dec-2021