Sat Jun 20 2020

FYI: PARSEME shared task 1.2 on semi-supervised identification of verbal multiword expressions

Date: 19-Jun-2020
From: Marie Candito <>
Subject: PARSEME shared task 1.2 on semi-supervised identification of verbal multiword expressions
PARSEME shared task 1.2 - Final call for participation

The third edition of the PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) aims at identifying **verbal MWEs** in running text, with **emphasis on discovering VMWEs that were not seen in the training corpus**.

See the shared task web site for all additional information :

#### Blind test data and upload of system results

The PARSEME team has prepared corpora in which VMWEs were manually annotated: The provided annotations follow the PARSEME 1.2 guidelines:

On March 23, 2020, we released, for each language:

* a training corpus manually annotated for VMWEs;
* a development corpus to tune/optimize the systems' parameters ; and
* a syntactically parsed raw corpus, not annotated for VMWEs, to support semi- and unsupervised methods for VMWE discovery (for each language, the size is between 12 million tokens and 2.5 billion tokens)

On July 1, 2020, we will release, for each language:
* A blind test corpus to be used as input to the systems during the evaluation phase, during which the VMWE annotations will be kept secret.

On July 3, 2020, participants will have to upload their annotated version of the test corpus at

Morphosyntactic annotations (parts of speech, lemmas, morphological features, and syntactic dependencies) are also provided, both for annotated and raw corpora.

The annotated training and development corpora are released in the CUPT format (which is the CoNLL-U format with an extra column for the MWE annotations). The raw corpora are released in the CoNLL-U format. The blind test corpus will be released in the CUPT format, with an underspecified 11th column to be predicted. Reference annotations for the test copus will be released after the evaluation phase.

The trial data, training and dev sets are available on the shared task's release repository:

The raw corpus is available on the corpus initiative website:

Corpora are available for the following languages: German (DE), Greek (EL), Basque (EU), French (FR), Irish (GA), Hebrew (HE), Hindi (HI), Italian (IT), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Swedish (SV), Turkish (TR), Chinese (ZH).

The amount of annotated data in the training, development, test, and raw corpus depends on the language.

#### Corpus split

For each language, the annotated sentences are shuffled and split, in a way which ensures that there is a minimum of 300 VMWEs in the test set which are unseen in the training + dev sets. This means that the natural sequence of sentences in a document will not be respected in the proposed corpus split. Note the unseen ratio, that is, the proportion of unseen VMWEs wrt all VMWEs in the test set, may vary across languages. To guide participants on this hard task, the number and rate of unseen VMWEs for the dev corpora are available on the shared task website. In both tracks, the use of previous shared task editions' corpora, and from the PARSEME source repositories, is strictly forbidden, as material may have moved during corpus splits.

#### Important dates (updated)

* Jul 01, 2020: blind test corpus released
* Jul 03, 2020: submission of system results
* Jul 09, 2020: announcement of results
* Sep 02, 2020: shared task system description papers due (same as regular papers)
* Oct 16, 2020: notification of acceptance
* Nov01, 2020: camera-ready system description papers due
* Dec 13, 2020: shared task session at the MWE-LEX 2020 workshop at Coling 2020

Linguistic Field(s): Computational Linguistics

