LINGUIST List 31.2636

Fri Aug 21 2020

FYI: PELIC - New Publicly Available Learner Corpus

Editor for this issue: Everett Green <>

Date: 19-Aug-2020
From: Ben Naismith <>
Subject: PELIC - New Publicly Available Learner Corpus
E-mail this message to a friend

The ELI Data Mining Group at the University of Pittsburgh is pleased to announce the release of the University of Pittsburgh English Language Institute Corpus (PELIC).

PELIC is a publicly-available 4.2-million-word learner corpus of written texts. Collected over seven years in the University of Pittsburgh’s Intensive English Program, these texts were produced by over 1100 students with a wide range of linguistic backgrounds and proficiency levels. PELIC is longitudinal, offering opportunities for tracking development in a natural classroom setting.

Further information about PELIC and research based on these data can be found at the PELIC homepage:

The entire dataset is available for download at the PELIC GitHub repository, stored in csv files:

In addition to the data, the PELIC repository contains tools for lexical analysis (concordancing, lexical sophistication, etc.) and tutorials on how to access and analyze the data.

Linguistic Field(s): Corpus Linguistics; Learner Corpora; Longitudinal Corpora; Second Language Acquisition

Subject Language(s): English (eng)

Linguistic Field(s): Text/Corpus Linguistics

Subject Language(s): English (eng)

Page Updated: 21-Aug-2020