LINGUIST List 27.2261

Wed May 18 2016

FYI: BYU Corpora: NOW, CORE, New Interface

Editor for this issue: Ashley Parker <ashleylinguistlist.org>


Date: 17-May-2016
From: Mark Davies <mark_daviesbyu.edu>
Subject: BYU Corpora: NOW, CORE, New Interface
E-mail this message to a friend

New from BYU corpora (http://corpus.byu.edu)

1. New corpus interface (see http://corpus.byu.edu/updates2016.asp?c=n)

The new interface is much more mobile-friendly (smartphones and tablets); it has a cleaner, simpler interface; more helpful ''context-sensitive'' help files; and simpler, more intuitive search syntax (e.g. EAT * NOUN = eat the cake, ate some strawberries; =expensive CLOTHES = pricey coat, classy jeans).

The new interface also allows users to quickly and easily create and use virtual corpora, such as texts from Cosmopolitan or Astronomy magazines (COCA), texts dealing with the New Deal from 1932-1938 (COHA), or newspaper articles from September 2015 dealing with the European refugee crisis (NOW). Users can search within the virtual corpora, compare the frequency of words, phrases, and constructions across their different virtual corpora, and quickly and easily extract keywords from a virtual corpus.


2. NOW corpus (http://corpus.byu.edu/now). Nearly three billion word corpus from 2010 through ... yesterday. Approximately 4 million words / 10,000 articles are added to the corpus every day (~125 million words a month, 1.5 billion words a year), which means that you're not limited to corpus results from several years (or even decades) ago.

The following are just a few examples of what the corpus can do. Click on the ''tour'' icon at the top of the page for many more examples.

With such an up-to-date corpus, you can look for neologisms like fracklog, swatting, mommy porn, catfishing, trigger warning, and nomophobia; find words occurring with digital NOUN or data NOUN; or substrings, such as *fest, *sexual*, *phobia, *alypse, *geddon, or *ware (with frequency of words by year or month).

You can also see the frequency of words and phrases by ''week'', to see when a particular topic was discussed the most since 2010 (for example: Paris attacks, Ashley Madison, or tsunami). You can also find the keywords for a given day (including yesterday), or by week, month, or year. For example, you can find the keywords for Apr 4 2016 (Panama papers: offshore, taxes) or Mar 22 2016 (Brussels airport: bomb, terrorists).

It is also possible to compare across different ''sections'' of the corpus -- either time or country. For example *gate (2015-2016 vs 2010-2011: deflategate, deiselgate, etc), data NOUN (2015-2016 vs 2010-2011: data lake, data grid), or ADJ collocates of Obama (2015-2016 vs 2010-2011).

Finally, you can quickly and easily create and then use ''virtual corpora''. For example, in just 5-10 seconds you could create a million word corpus based on texts from September 2015 dealing with refugees in Europe. As is discussed above in #1, you can then search within the virtual corpora, compare the frequency of words and phrases across different virtual corpora, and generate ''keyword'' lists from a virtual corpus (e.g. asylum, war-torn, resettle).


3. CORE corpus (http://corpus.byu.edu/core). This corpus results from a grant to Douglas Biber, Mark Davies, and Jesse Egbert from the US National Science Foundation dealing with ''A Linguistic Taxonomy of English Web Registers''. The corpus contains more than 50 million words of text from the web, and it carefully categorizes the 50,000 texts into 30+ different web registers (personal blogs, interviews, ''description with intent to sell'', ''how-to'' pages, sports reporting, etc). This is quite different from other very large corpora that simply present huge amounts of data from web pages as giant ''blobs'', with no real attempt to categorize them into linguistically distinct registers.


We hope that these new resources will be of value to you in your research and teaching.

Best,

Mark Davies
BYU Corpora
--
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================


Linguistic Field(s): Computational Linguistics; Lexicography; Text/Corpus Linguistics

Subject Language(s): English (eng)
Language Family(ies): Indo-European

Page Updated: 18-May-2016