LINGUIST List 24.1694

Tue Apr 16 2013

FYI: New Corpus: GloWbE 1.9 Billion Words, 20 Countries

Editor for this issue: Brent Miller <>

Date: 15-Apr-2013
From: Mark Davies <>
Subject: New Corpus: GloWbE 1.9 Billion Words, 20 Countries
E-mail this message to a friend

We have just released a new corpus at, which may be of interest to some of you:

GloWbE: Corpus of Global Web-Based English

This new corpus is 1.9 billion words in size, and is based on 1.8 million web pages (including blogs) from 20 different English-speaking countries (US, UK, NZ, India, Hong Kong, etc). GloWbE is 4-5 times as large as COCA, and about 20 times as big as the BNC, and thus yields much richer data for some low-frequency constructions.

The real power of GloWbE, though, is the ability to see the frequency of any word, phrase, or grammatical construction in each of the 20 different countries. You can also compare any features in two sets of dialects, such as British and American English (in more than 775 million words of text for just these two dialects). Or you could just limit your search to one or two countries (e.g. Australia (148 million words), South Africa (45 million), or Singapore (43 million)), and you'll still be searching the largest online corpus for most of these twenty countries.

This new corpus of World English adds nicely to the other corpora from, which allow you to examine variation in English in ways that are perhaps not possible with other corpora (see

-- historical: COHA, TIME, COCA (recent change), Google Books (Advanced)
-- genres: COCA and BYU-BNC
-- dialects: GloWbE, and side-by-side comparisons of corpora

Mark Davies
Professor of Linguistics / Brigham Young University
Corpus design and use // Linguistic databases
Historical linguistics // Language variation
English, Spanish, and Portuguese

Linguistic Field(s): Computational Linguistics; Lexicography; Text/Corpus Linguistics
Subject Language(s): English (eng)

Page Updated: 16-Apr-2013