LINGUIST List 31.1626

Fri May 15 2020

FYI: Coronavirus Corpus

Editor for this issue: Everett Green <everettlinguistlist.org>



Date: 15-May-2020
From: Mark Davies <mark_daviesbyu.edu>
Subject: Coronavirus Corpus
E-mail this message to a friend

We are please to announce the release of the Coronavirus Corpus:

https://www.english-corpora.org/corona/

The Coronavirus Corpus is designed to be the definitive record of the social, cultural, and economic impact of the coronavirus (COVID-19) in 2020 and beyond, and it is part of the English-Corpora.org suite of corpora, which offer unparalleled insight into genre-based, historical, and dialectal variation in English.

The corpus is currently about 270 million words in size, and it continues to grow by 3-4 million words each day. (For example, there are already 4 million words of text for yesterday, May 14). At this rate, the corpus may be 500-600 million words in size by August 2020.

The Coronavirus Corpus allows you to see the frequency of words and phrases in 10-day increments (and even day by day, if desired) since Jan 2020, such as social distancing, flatten the curve, WORK * home, Zoom, Wuhan, hoard*, toilet paper, curbside, pandemic, reopen, defy.

You can also look at collocates, to see what is being said about a certain topic, such as (verbs near) virus, or any word near ban (v), stockpile, disinfect*, or remotely. And you can even see and compare the collocates of a word in 10-day periods since Jan 2020.

As is common with most online corpora, the Coronavirus Corpus allows you to see re-sortable, PoS-colored Keyword in Context (KWIC) / concordance views, for any word or phrase.

You can also compare between different time periods, to see how our view of things have changed over time. A few examples might be: phrases with social * or economic * that were more common in Jan/Feb than in Apr/May, words near BAN or OBEY that were more common in Apr-May than in Jan-Feb, or all nouns that were much more common in late April 2020 than in March 2020.

The corpus allows you to compare across the 20 countries in the corpus (US, UK, Australia, India, etc), to see what is being said about the coronavirus in each of these countries. You can also quickly and easily create ''Virtual Corpora'' for particular topics, based on keywords in the text, country, date, publication source, and more.

Finally, full-text data from the corpus will soon be available on a ''subscription'' basis, where you can download nearly all of the new data every day, week, or month -- just as with the other corpora from English-Corpora.org (see https://www.corpusdata.org).

We hope that the corpus will be of use to you in your research and teaching.

Mark Davies
English-Corpora.org


Linguistic Field(s): Computational Linguistics; Lexicography; Text/Corpus Linguistics

Subject Language(s): English (eng)


Page Updated: 15-May-2020