LINGUIST List 32.801
Wed Mar 03 2021
FYI: First News Text Corpus of Indian English (NTCIE)
Editor for this issue: Everett Green <everettlinguistlist.org>
Niladri Sekhar Dash <ns_dash
First News Text Corpus of Indian English (NTCIE) E-mail this message to a friend
The Linguistic Research Unit (LRU) of Indian Statistical Institute (ISI), Kolkata has developed a 'News Text Corpus of Indian English (NTCIE)' from the online version of a widely circulated English newspaper published from Kolkata, India. To date, this is the first corpus of its kind on Indian Newspaper English. The corpus contains around 10 million (1 crore) words of running texts obtained from news reports published between August and December 2015. The LRU team has processed the corpus and generated a lexical database of 99,37,817 words, a syntax database of 4,82,532 sentences, and a list of 3,07,599 tokens after tokenization. Moreover, the corpus is POS tagged using Stanford POS Tagger (v3.6.0-2015-12-09). The corpus has high applicational value in machine learning, technology development for Indian English, digital lexicography, education, translation, language planning, discourse analysis, and many other works. Both raw and POS annotated versions of the corpus are available for commercial and academic purposes (with a price tag). This is the product of an entirely self-funded project (July 2016 to December 2020).
Interested people may contact Prof. Niladri Sekhar Dash, Head, LRU, ISI, Kolkata.
Linguistic Field(s): Text/Corpus Linguistics
Subject Language(s): English (eng)
Language Family(ies): Indo-European
Page Updated: 03-Mar-2021