LINGUIST List 25.1708
Fri
Apr 11 2014
FYI: Full-Text Corpora:
Contemporary American English (COCA) and Global
Web-Based English (GloWbE)
Editor for this issue:
Uliana Kazagasheva <ulianalinguistlist.org>
Date: 10-Apr-2014
From: Mark Davies
<mark_davies
byu.edu>
Subject: Full-Text Corpora:
Contemporary American English (COCA) and Global
Web-Based English (GloWbE)
E-mail this message to a
friend
At
http://corpus.byu.edu/full-text/
you can now download full-text data for the
following two corpora:
Corpus of Contemporary American English
(COCA).
440 million words of downloadable text (190,000
separate texts). Balanced for genre — about 88
million words each of spoken, fiction,
magazine, newspaper, and academic. With the
included [sources] table, you can also search
by sub-genre, e.g. News-Financial or
Academic-Medicine.
The corpus of Global Web-Based English
(GloWbE).
1.8 billion words of downloadable text
(1,800,000 separate texts). Divided into groups
from twenty different English-speaking
countries (US, UK, Canada, Australia, India,
etc). About 60% from blogs, for very informal
language.
Of course with the full-text data from either
corpus, you will have the actual corpora on
your computer. As a result, you can do many
things that would be difficult or impossible
with the standard web interface, such as
complex and time-consuming syntactic and
semantic searches, sentiment analysis, topic
modeling, named entity recognition, advanced
regex searches, creating treebanks, and so on.
You can also generate word frequency lists
(e.g. top 100,000 words, by (sub-)genre),
collocates (millions of pairs), and n-grams
(hundreds of millions of strings).
The data comes in three different formats (see
samples): data for relational databases (info),
word/lemma/PoS (vertical), and linear text
(horizontal). When you obtain the data, you
have the rights to any and all of these
formats.
Mark Davies
http://davies-linguistics.byu.edu/
Linguistic Field(s): Computational Linguistics;
Lexicography; Text/Corpus Linguistics
Subject Language(s):
English (eng)
Page Updated: 11-Apr-2014