Publishing Partner: Cambridge University Press CUP Extra Publisher Login

FYI: New Corpus: GloWbE 1.9 Billion Words, 20 Countries


Author: Mark Davies

Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics
Lexicography

Subject Language(s): English

FYI Body: We have just released a new corpus at corpus.byu.edu, which may be of interest to some of you:

GloWbE: Corpus of Global Web-Based English
http://corpus2.byu.edu/glowbe/

This new corpus is 1.9 billion words in size, and is based on 1.8 million web pages (including blogs) from 20 different English-speaking countries (US, UK, NZ, India, Hong Kong, etc). GloWbE is 4-5 times as large as COCA, and about 20 times as big as the BNC, and thus yields much richer data for some low-frequency constructions.

The real power of GloWbE, though, is the ability to see the frequency of any word, phrase, or grammatical construction in each of the 20 different countries. You can also compare any features in two sets of dialects, such as British and American English (in more than 775 million words of text for just these two dialects). Or you could just limit your search to one or two countries (e.g. Australia (148 million words), South Africa (45 million), or Singapore (43 million)), and you'll still be searching the largest online corpus for most of these twenty countries.

This new corpus of World English adds nicely to the other corpora from corpus.byu.edu, which allow you to examine variation in English in ways that are perhaps not possible with other corpora (see http://corpus.byu.edu/variation.asp):

-- historical: COHA, TIME, COCA (recent change), Google Books (Advanced)
-- genres: COCA and BYU-BNC
-- dialects: GloWbE, and side-by-side comparisons of corpora

Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/
Corpus design and use // Linguistic databases
Historical linguistics // Language variation
English, Spanish, and Portuguese

Back   FYI main page