LINGUIST List 22.2067

Fri May 13 2011

FYI: 155 Billion Word Corpus: American English

Editor for this issue: Danielle St. Jean <daniellelinguistlist.org>


        1.     Mark Davies , 155 Billion Word Corpus: American English

Message 1: 155 Billion Word Corpus: American English
Date: 12-May-2011
From: Mark Davies <mark_daviesbyu.edu>
Subject: 155 Billion Word Corpus: American English
E-mail this message to a friend

We're pleased to announce a new corpus -- the Google Books(American English) corpus: http://googlebooks.byu.edu/

This corpus is based on the American English portion of the GoogleBooks data (see http://ngrams.googlelabs.com and especiallyhttp://ngrams.googlelabs.com/datasets). It contains 155 *billion* words(155,000,000,000) in more than 1.3 million books from the 1810s-2000s (including 62 billion words from just 1980-2009).

The corpus has most of the functionality of the other corpora fromhttp://corpus.byu.edu (e.g. COCA, COHA, and our interface to theBNC), including: searching by part of speech, wildcards, and lemma(and thus advanced syntactic searches), synonyms, collocatesearches, frequency by decade (tables listing each individual string, orcharts for total frequency), comparisons of two historical periods (e.g.collocates of "women" or "music" in the 1800s and the 1900s), andmore.

This American English corpus is just one of seven Google Books-basedcorpora that we hope to create in the next year or two (contingent onfunding, which we are applying for in June 2011). If funded, the othercorpora will include British English, English from the 1500s-1700s, andcorpora of Spanish, French, and German (see the listing athttp://ngrams.googlelabs.com/datasets). Each of these corpora will bebased on at least 50 billion words of data, and they should represent anice addition to existing resources.

The Google Books (American English) corpus is freely-available athttp://googlebooks.byu.edu, and we hope that it is of value to you inyour research and teaching.

Linguistic Field(s): Computational Linguistics; Text/Corpus Linguistics
Subject Language(s): English (eng)

Page Updated: 13-May-2011