LINGUIST List 30.482

Wed Jan 30 2019

Calls: Computational Linguistics/USA

Editor for this issue: Everett Green <>

Date: 27-Jan-2019
From: Anna Rogers <>
Subject: Third Workshop on Evaluating Vector Space Representations for NLP (co-located with NAACL 2019)
E-mail this message to a friend

Full Title: Third Workshop on Evaluating Vector Space Representations for NLP (co-located with NAACL 2019)
Short Title: RepEval2019

Date: 06-Jun-2019 - 06-Jun-2019
Location: Minneapolis, MN, USA
Contact Person: Anna Rogers
Meeting Email: < click here to access email >
Web Site:

Linguistic Field(s): Computational Linguistics

Call Deadline: 06-Mar-2019

Meeting Description:

General-purpose dense word embeddings have come a long way since the beginning of their boom in 2013, and they are still the most widely used way of representing words in both industrial and academic NLP systems. However, the issue of intrinsic metrics that are predictive of performance on downstream tasks, and can help to develop better representations, is far from being solved. At the sentence level and above, we now have a number of probing tasks and large extrinsic evaluation datasets targeting high-level verbal reasoning, but there is still much to learn about what features make a compositional representation successful. Last but not the least, there are no established intrinsic methods for newer kinds of representations such as ELMO, BERT, or box embeddings.

The third edition of RepEval aims to foster discussion of the above issues, and to support the search for high-quality general purpose representation learning techniques for NLP. We hope to encourage interdisciplinary dialogue by welcoming diverse perspectives on the above issues: submissions may focus on properties of embedding space, performance analysis for various downstream tasks, as well as approaches based on linguistic and psychological data. In particular, experts from the latter fields are encouraged to contribute analysis of claims previously made in NLP community.

Research paper submissions may consist of 4-6 pages of content, plus unlimited references. An additional page in the camera-ready version will be available for addressing reviewers' comments. Please refer to the NAACL author guidelines for the style files, policy on double submissions and preprints:

Call for Papers:

RepEval 2019 invites submissions on approaches to extrinsic and intrinsic evaluation of distributional meaning representations, including evaluation motivated by linguistic, psycholinguistic or neurological evidence. For the former, we still know very little about what linguistic phenomena should be captured by distributional meaning representations for better performance on downstream tasks, and how general it can be. Improved diagnostic tests for word and morpheme compositionality and all kinds of semantic relations, interpretability of distributional meaning representations, validating existing approaches in cross-lingual studies, especially with languages of different families - these are all topics in which computational linguistics needs to be truly interdisciplinary.

We invite practical proposals for new evaluation techniques that experimentally demonstrate the benefits of the new approach. Submissions may also focus on critical analysis of the existing approaches (especially by experts in other domains such as linguistics or psychology), or methodological caveats (reproducibility, parameters impact, the issue of attribution of results to the representation or the whole system, dataset structure/balance/representativeness).

Paper Submission:

Submission is electronic, using the Softconf START conference management at

Analysis papers might like to consider the following questions:

- Pros and cons of existing evaluations;
- (Mis)attribution of performance improvements to various elements of the pipeline in complex NLP systems;
- Given a specific downstream application, which existing evaluation (or family of evaluations) is a good predictor of performance improvement?
- Which linguistic/semantic/psychological properties are captured by existing evaluations? Which are not?
- What methodological mistakes were made in the creation of existing evaluation datasets?
- What linguistic/psychological properties of meaning representations are supposed to make them ''better'', and why?
- The recent tendency is to take high-level reasoning tasks such as question answering or inference as the ''ultimate'' evaluation for meaning representations (effectively, a Turing test proxy). How justified is this approach? Should a ''good'' representation excel at all such tasks (and also all the low-level ones), or specialize? What alternatives do we have?

Proposal papers should introduce a novel method for evaluating representations, accompanied with a proof-of-concept dataset (of which at least a sample should be made available to the reviewers at the submission time). The new method should highlight some previously unnoticed properties of the target representations, enable a faster/more cost-effective way of measuring some previously known properties, or demonstrate a significant improvement to the previous proposals (e.g. update to an imbalanced or noisy dataset that shows that previous claims were misattributed). Each proposal should explicitly mention:

- Which type of representation it evaluates (e.g. word, sentence, document, contextualized or not), and what properties of that representation it targets;
- For which downstream application(s) it functions as a proxy;
- Any syntactic/semantic/psychological properties it captures, in comparison with previous work;
- If any annotation was performed, what was the inter-annotator agreement, and how cost-effective would it be to scale it up and/or create a similar resource for other languages?
- The permissions to use/release the data.

See the full CFP at

Page Updated: 30-Jan-2019