LINGUIST List 30.4418

Thu Nov 21 2019

Calls: Computational Linguistics/Spain

Editor for this issue: Everett Green <everettlinguistlist.org>



Date: 13-Nov-2019
From: Agata Savary <agata.savaryuniv-tours.fr>
Subject: PARSEME Shared Task 1.2 on Semi-supervised Identification of Verbal Multiword Expressions
E-mail this message to a friend

Full Title: PARSEME shared task 1.2 on semi-supervised identification of verbal multiword expressions
Short Title: PAERSEME ST 1.2

Date: 13-Sep-2020 - 14-Sep-2020
Location: Barcelona, Spain
Contact Person: Carlos Ramisch
Meeting Email: < click here to access email >
Web Site: http://multiword.sourceforge.net/sharedtask2020

Linguistic Field(s): Computational Linguistics

Call Deadline: 30-Apr-2020

Meeting Description:

MWE-LEX 2020 will host edition 1.2 of the PARSEME shared task on semi-supervised identification of verbal MWEs. This is a follow-up of editions 1.0 (2017), and 1.1 (2018). The latter covered 20 languages and received 17 submissions by 12 teams. Edition 1.2 will feature (a) improved and extended corpora annotated with MWEs, (b) complementary unannotated corpora for unsupervised MWE discovery, and (c) new evaluation metrics focusing on unseen MWEs. Following the synergy with Elexis, our aim is to foster the development of unsupervised methods for MWE lexicon induction, which in turn can be used for identification. Authors may submit system description papers to a special track, following common submission guidelines. Details will be available here soon.

Call for Papers:

The third edition of the PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) aims at identifying verbal MWEs in running texts. Verbal MWEs include, among others, idioms (to let the cat out of the bag), light-verb constructions (to make a decision), verb-particle constructions (to give up), multi-verb constructions (to make do) and inherently reflexive verbs (s'évanouir 'to faint' in French). Their identification is a well-known challenge for NLP applications, due to their complex characteristics including discontinuity, non-compositionality, heterogeneity and syntactic variability.

Previous editions have shown that, while some systems reach high performance (F1>0.7) for identifying VMWEs that were seen in training data, performance on unseen VMWEs is very low. Hence for this third edition, **emphasis will be put on discovering VMWEs that were not seen in the training data**.

We kindly ask potential participant teams to register using the expression of interest form:
https://docs.google.com/forms/d/e/1FAIpQLSfcmbd6MmKjFuBxCoaTWGCPGqoH5FoJ-th8IAZk3kh_ECDaZQ/viewform?usp=sf_link

Task updates and questions will be posted on the shared task website:
http://multiword.sourceforge.net/sharedtask2020
and announced on our public mailing list:
http://groups.google.com/group/verbalmwe



Provided data:

For each language, we provide to the participants corpora in which VMWEs are annotated according to the 1.1 shared task guidelines (http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1).

On March 18th, we will release, for each language:
- A training corpus manually annotated for VMWEs;
- A development corpus to tune/optimize the systems' parameters,
- A larger raw corpus, to favor semi- and unsupervised methods for VMWEs discovery

On April 28th, we will release, for each language:
- A blind test corpus to be used as input to the systems during the evaluation phase, during which the VMWE annotations will be kept secret.

When available, morphosyntactic data (parts of speech, lemmas, morphological features and/or syntactic dependencies) are also provided, both for annotated and raw corpora. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).

So far we plan to include data for the following languages:
Bulgarian (BG), German (DE), Greek (EL), Basque (EU), French (FR), Hebrew (HE), Hindi (HI), Croatian (HR), Hungarian (HU), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Swedish (SV).

The amount of annotated data depends on the language.


Tracks:

System results can be submitted in two tracks:
- Closed track: Systems using only the provided training and development data (with VMWE and provided morpho-syntactic annotations) + provided raw corpora.
- Open track: Systems using or not the provided training data, plus any additional resources deemed useful (MWE lexicons, symbolic grammars, wordnets, other raw corpora, word embeddings and language models trained on external data, etc.). However, the use of previous shared task editions' corpora is strictly forbidden. This track includes notably purely symbolic and rule-based systems.

Teams submitting systems in the open track will be requested to describe and provide references to all resources used at submission time. Teams are encouraged to favor freely available resources for better reproducibility of their results.




Page Updated: 21-Nov-2019