Featured Linguist!

Jost Gippert: Our Featured Linguist!

"Buenos dias", "buenas noches" -- this was the first words in a foreign language I heard in my life, as a three-year old boy growing up in developing post-war Western Germany, where the first gastarbeiters had arrived from Spain. Fascinated by the strange sounds, I tried to get to know some more languages, the only opportunity being TV courses of English and French -- there was no foreign language education for pre-teen school children in Germany yet in those days. Read more

Donate Now | Visit the Fund Drive Homepage

Amount Raised:


Still Needed:


Can anyone overtake Syntax in the Subfield Challenge ?

Grad School Challenge Leader: University of Washington

Publishing Partner: Cambridge University Press CUP Extra Publisher Login

FYI: New Corpus: GloWbE 1.9 Billion Words, 20 Countries

Author: Mark Davies

Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics

Subject Language(s): English

FYI Body: We have just released a new corpus at corpus.byu.edu, which may be of interest to some of you:

GloWbE: Corpus of Global Web-Based English

This new corpus is 1.9 billion words in size, and is based on 1.8 million web pages (including blogs) from 20 different English-speaking countries (US, UK, NZ, India, Hong Kong, etc). GloWbE is 4-5 times as large as COCA, and about 20 times as big as the BNC, and thus yields much richer data for some low-frequency constructions.

The real power of GloWbE, though, is the ability to see the frequency of any word, phrase, or grammatical construction in each of the 20 different countries. You can also compare any features in two sets of dialects, such as British and American English (in more than 775 million words of text for just these two dialects). Or you could just limit your search to one or two countries (e.g. Australia (148 million words), South Africa (45 million), or Singapore (43 million)), and you'll still be searching the largest online corpus for most of these twenty countries.

This new corpus of World English adds nicely to the other corpora from corpus.byu.edu, which allow you to examine variation in English in ways that are perhaps not possible with other corpora (see http://corpus.byu.edu/variation.asp):

-- historical: COHA, TIME, COCA (recent change), Google Books (Advanced)
-- genres: COCA and BYU-BNC
-- dialects: GloWbE, and side-by-side comparisons of corpora

Mark Davies
Professor of Linguistics / Brigham Young University
Corpus design and use // Linguistic databases
Historical linguistics // Language variation
English, Spanish, and Portuguese

Back   FYI main page