LINGUIST List 25.1708

Fri Apr 11 2014

FYI: Full-Text Corpora: Contemporary American English (COCA) and Global Web-Based English (GloWbE)

Editor for this issue: Uliana Kazagasheva <>

Date: 10-Apr-2014
From: Mark Davies <>
Subject: Full-Text Corpora: Contemporary American English (COCA) and Global Web-Based English (GloWbE)
E-mail this message to a friend

At you can now download full-text data for the following two corpora:

Corpus of Contemporary American English (COCA).
440 million words of downloadable text (190,000 separate texts). Balanced for genre — about 88 million words each of spoken, fiction, magazine, newspaper, and academic. With the included [sources] table, you can also search by sub-genre, e.g. News-Financial or Academic-Medicine.

The corpus of Global Web-Based English (GloWbE).
1.8 billion words of downloadable text (1,800,000 separate texts). Divided into groups from twenty different English-speaking countries (US, UK, Canada, Australia, India, etc). About 60% from blogs, for very informal language.

Of course with the full-text data from either corpus, you will have the actual corpora on your computer. As a result, you can do many things that would be difficult or impossible with the standard web interface, such as complex and time-consuming syntactic and semantic searches, sentiment analysis, topic modeling, named entity recognition, advanced regex searches, creating treebanks, and so on. You can also generate word frequency lists (e.g. top 100,000 words, by (sub-)genre), collocates (millions of pairs), and n-grams (hundreds of millions of strings).

The data comes in three different formats (see samples): data for relational databases (info), word/lemma/PoS (vertical), and linear text (horizontal). When you obtain the data, you have the rights to any and all of these formats.

Mark Davies

Linguistic Field(s): Computational Linguistics; Lexicography; Text/Corpus Linguistics

Subject Language(s): English (eng)

Page Updated: 11-Apr-2014