* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 22.2068

Fri May 13 2011

Sum: Genre-Specific Corpora

Editor for this issue: Danielle St. Jean <daniellelinguistlist.org>

To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
        1.     Marina Santini , Genre-Specific Corpora

Message 1: Genre-Specific Corpora
Date: 11-May-2011
From: Marina Santini <MarinaSantini.MSgmail.com>
Subject: Genre-Specific Corpora
E-mail this message to a friend

Query for this summary posted in LINGUIST Issue: 22.1852
Editor's Node: Please note that some URLs included in this submission
may carry over onto a second line. If you want to go to a specific
website provided in this submission, please be sure that you have
copied the whole URL.

Many thanks to Laura Christopherson, Cohan Sujay Carlos, Vineet
Yadav, Jason Teeple, Leslie Barrett, Joakim Nordström, Bob Kuhns,
Dong Wang, Dave Lewis, John Tait, and Loredana Cerrato.

Suggested Corpora and Resources in English if not stated otherwise
(not all of them are free of charge)

Genre-specific corpora:
- Genre: SMS Messages = NUS SMS corpus:
http://wing.comp.nus.edu.sg:8080/SMSCorpus/ (English / Chinese)

- Genre: chatlogs = CODIAC chatlogs

- Genre: chatlogs = Many Eyes datasets: some chatlogs can be found

- Genre: chats and switchboard conversations =
Switchboard corpus and NPS chat corpus samples NLTK in NLTK data
(http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml). The NPS
chat corpus (http://faculty.nps.edu/cmartell/NPSChat.htm) is a POS
tagged chat corpus and the switchboard corpus
(http://spot.colorado.edu/~michaeli/Lexsubj/swbd.html) is a telephonic
conversation corpus.

- The Linguistics Data Consortium has a good deal of telephone
conversation - many files and a variety of languages. See
text (not for free)

- Genre: blogs = The Corporate weblogs dataset in TREC datasets
(http://ir.dcs.gla.ac.uk/test_collections/) is not for free. Helpful wiki:
- Genre: corporate blogs = It is possible to pull corporate blog feeds
or scrape the blogs from this list:

- The Göteborg Spoken Language Corpus and other corpora in
Swedish (http://spraakbanken.gu.se/)

- Genre: tweets = The twitter corpus associated with the paper
www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf is
here: https://sites.google.com/site/twittersentimenthelp/for-researchers

- Genre: tweets and other microblogs= MicroBlog track
http://sites.google.com/site/trecmicroblogtrack/ (not for free)

- Genre: Newswires: Reuters' Newswires collections =

- Genre: emails = Enron corpus (http://www.cs.cmu.edu/~enron/);
categorized Enron emails (http://sgi.nu/enron/corpora.php)

- Genre: emails = Junk email corpus

- Genre: FAQs = 200 FAQs

- In terms of words and concept, there are two main resources for
English. First is WordNet, originally from Princeton, it is in NLTK (and
one can get it separately). It is English words 'organized' according to
their relationships: synonym, hyponym, piece of a whole, etc. The other
resource is Word Association Norms, one can get that from the
University of South Florida (http://w3.usf.edu/FreeAssociation/).
- Article: Hella Koo Finding: Twitter Dialect -
- Genre: tweets = the suggestion is to use Twitter API to crawl twitter
- DiscoverText is a program you can use to scoop out Twitter feeds
really easily. Their website is here:
One can do a free 30 day trial and get a bunch of Twitter messages.

Genre: Tweets = The Edinburg Tweets corpus has been withdrawn:
Linguistic Field(s): Computational Linguistics
                            Text/Corpus Linguistics

Read more issues|LINGUIST home page|Top of issue

Page Updated: 13-May-2011

Supported in part by the National Science Foundation       About LINGUIST    |   Contact Us       ILIT Logo
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.