LINGUIST List 31.1757

Wed May 27 2020

Media: Enhanced Large Scale Colloquial Persian Language Understanding (LSCP) Corpus

Editor for this issue: Everett Green <>

Date: 22-May-2020
From: Hadi Abdi Khojasteh <>
Subject: Enhanced Large Scale Colloquial Persian Language Understanding (LSCP) Corpus
E-mail this message to a friend

I am thrilled to announce our new study on informal language understanding which will be announced in LREC 2020.
This is the first public contribution of our effort for informal spoken Persian (Farsi) language understanding and multilingual corpus for the low-resourced aspect of spoken language. The language in its oral form is typically much more dynamic than its written form. The written variety of a language typically involves a higher level of ritual, whereas the spoken form is characterised by several contractions and abbreviations. In formal written texts, longer and tougher sentences tend to be used as the reader can re-read the troublesome parts if they lose track.

More information can be found at and the corpus is available in the LINDAT/CLARIN-CZ repository via LSCP has approx. 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, part-of-speech tags, sentiment polarity and translations in English, German, Czech, Italian and Hindi spoken languages.

Linguistic Field(s): Computational Linguistics

Subject Language(s): Persian, Iranian (pes)
Language Family(ies): Iranian

Page Updated: 27-May-2020