* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 22.2067

Fri May 13 2011

FYI: 155 Billion Word Corpus: American English

Editor for this issue: Danielle St. Jean <daniellelinguistlist.org>

To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
        1.     Mark Davies , 155 Billion Word Corpus: American English

Message 1: 155 Billion Word Corpus: American English
Date: 12-May-2011
From: Mark Davies <mark_daviesbyu.edu>
Subject: 155 Billion Word Corpus: American English
E-mail this message to a friend

We're pleased to announce a new corpus -- the Google Books
(American English) corpus: http://googlebooks.byu.edu/

This corpus is based on the American English portion of the Google
Books data (see http://ngrams.googlelabs.com and especially
http://ngrams.googlelabs.com/datasets). It contains 155 *billion* words
(155,000,000,000) in more than 1.3 million books from the 1810s-
2000s (including 62 billion words from just 1980-2009).

The corpus has most of the functionality of the other corpora from
http://corpus.byu.edu (e.g. COCA, COHA, and our interface to the
BNC), including: searching by part of speech, wildcards, and lemma
(and thus advanced syntactic searches), synonyms, collocate
searches, frequency by decade (tables listing each individual string, or
charts for total frequency), comparisons of two historical periods (e.g.
collocates of "women" or "music" in the 1800s and the 1900s), and

This American English corpus is just one of seven Google Books-based
corpora that we hope to create in the next year or two (contingent on
funding, which we are applying for in June 2011). If funded, the other
corpora will include British English, English from the 1500s-1700s, and
corpora of Spanish, French, and German (see the listing at
http://ngrams.googlelabs.com/datasets). Each of these corpora will be
based on at least 50 billion words of data, and they should represent a
nice addition to existing resources.

The Google Books (American English) corpus is freely-available at
http://googlebooks.byu.edu, and we hope that it is of value to you in
your research and teaching.

Linguistic Field(s): Computational Linguistics; Text/Corpus Linguistics

Subject Language(s): English (eng)

Read more issues|LINGUIST home page|Top of issue

Page Updated: 13-May-2011

Supported in part by the National Science Foundation       About LINGUIST    |   Contact Us       ILIT Logo
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.