LINGUIST List 21.3572

Wed Sep 08 2010

FYI: Corpus of Historical American English

Editor for this issue: Elyssa Winzeler <elyssalinguistlist.org>

        1.    Mark Davies, Corpus of Historical American English

Message 1: Corpus of Historical American English
Date: 07-Sep-2010
From: Mark Davies <mark_daviesbyu.edu>
Subject: Corpus of Historical American English
We are pleased to announce the release of the 400 million word Corpus of
Historical American English (1810-2009). The corpus has been funded by a
generous grant from the US National Endowment for the Humanities (NEH), and
it is freely available at http://corpus.byu.edu/coha/. COHA is the largest
structured corpus of historical English, and it contains more than 100,000
texts from fiction, popular magazines, newspapers, and non-fiction books,
with the same genre balance decade by decade from the 1810s-2000s.

COHA is also related to other large corpora that we have created or
modified, including the 410 million word Corpus of Contemporary American
English (COCA), the 100 million word TIME Magazine Corpus (1920s-2000s),
the 100 million word British National Corpus (our architecture and
interface), the 100 million word NEH-funded Corpus del Español
(1200s-1900s), and the 45 million word NEH-funded Corpus do Português
(1300s-1900s). For information on these corpora, see http://corpus.byu.edu.

COHA allows you to quickly and easily search the 400 million words of text
from the 1810s-2000s to see how words, phrases and grammatical
constructions have increased or decreased in frequency, how words have
changed meaning over time, and how stylistic changes have taken place in
the language. Users can see the overall (normalized) frequency by decade
and year, as well as the frequency of each matching string, by decade.

The following are just a small sample of an unlimited number of queries,
but they should give some idea of what the corpus can do.

* Lexical change: the rise and fall of words and phrases like the following:
- (decrease since the 1800s): bosom, folly, grieved, bestow*, quaint,
beauteous, fellow, sublime, lad, many a time, of no little, for (conj)
- (an increase and then decrease): mustn't, naughty, boyish, agog, toddle,
far-out, famed, wangle, swell (adj), lousy
- (an increase to the present time): a lot of, unleash, sexual, calm down,
screw up, freak out, mommy, skills, frustrating
- (words reflecting historical and cultural shifts): emancipation,
steamship, telegraph, flapper*, fascis*, teenage*, communis*, global warming

* Stylistic change (which gives the flavor of a different time period).
Examples from the 1800s, which have decreased since then, are: [so ADJ as
to V] (so good as to show me), [PRON be but] (they are but the last
examples), [have quite V-ed] (until she had quite finished), [NOUN be that
of] (her dress was that of a beggar), or [a most ADJ NOUN] (a most helpful

* Morphological change: which show how word roots, prefixes, and suffixes
have been used over time, including comparisons between different periods,
such as -heart- (1800s noble-hearted, 1900s heart-stopping), home- (1800s
homebred, 1900s homeowner), or -able adjectives (1800s placable, 1900s

* Syntactic change (since the corpus is tagged and lemmatized), like [end
up V-ing], [going to V], [V PRON into V-ing] (talked them into going),
phrasal verbs with [up] (make up, show up), post-verbal negation with
[need] (needn't mention), the 'get' passive (get hired), sentence-initial
'hopefully', and semi-modals like [need to] and [have to].

* Semantic change: how the meaning or usage of words have changed over
time, by looking at changes in collocates (co-occurring words), like
[sexual, gay, chip, engine, or web]. This can also signal cultural changes
over time, such as nouns used with [woman] in the 1930s-50s compared to the
1960s-80s (fabrics, hips // liberation, abortion), or nouns used with
[problem] in the 1810s-1920s compared to the 1920s-2000s (railway, trust //
drugs, pollution).

* Lexical change (again): users can also have the corpus generate a list of
words that were used more in one period than another, even when they don't
know what the specified words might be. For example, the corpus can
generate lists of verbs in the 1970s-2000s compared to the 1930s-1960s
(download, recycle // effectuate, redound), adjectives in the 1970s-2000s
and the 1930s-1960s (online, affordable // leftist, communistic), or -ly
adverbs in the 1900s and the 1800s (basically, reportedly // despondingly,

As can be seen, the corpus allows research on a wide range of phenomena in
400 million words of text from the last two centuries of American English.
The corpus is freely available at http://corpus.byu.edu/coha/, and we
invite you to use it for your research and teaching.

Linguistic Field(s): Historical Linguistics; Lexicography; Text/Corpus Linguistics

