LINGUIST List 32.801

Wed Mar 03 2021

FYI: First News Text Corpus of Indian English (NTCIE)

Editor for this issue: Everett Green <>

Date: 03-Mar-2021
From: Niladri Sekhar Dash <>
Subject: First News Text Corpus of Indian English (NTCIE)
E-mail this message to a friend

The Linguistic Research Unit (LRU) of Indian Statistical Institute (ISI), Kolkata has developed a 'News Text Corpus of Indian English (NTCIE)' from the online version of a widely circulated English newspaper published from Kolkata, India. To date, this is the first corpus of its kind on Indian Newspaper English. The corpus contains around 10 million (1 crore) words of running texts obtained from news reports published between August and December 2015. The LRU team has processed the corpus and generated a lexical database of 99,37,817 words, a syntax database of 4,82,532 sentences, and a list of 3,07,599 tokens after tokenization. Moreover, the corpus is POS tagged using Stanford POS Tagger (v3.6.0-2015-12-09). The corpus has high applicational value in machine learning, technology development for Indian English, digital lexicography, education, translation, language planning, discourse analysis, and many other works. Both raw and POS annotated versions of the corpus are available for commercial and academic purposes (with a price tag). This is the product of an entirely self-funded project (July 2016 to December 2020).

Interested people may contact Prof. Niladri Sekhar Dash, Head, LRU, ISI, Kolkata.

Thank you.

Linguistic Field(s): Text/Corpus Linguistics

Subject Language(s): English (eng)
Language Family(ies): Indo-European

Page Updated: 03-Mar-2021