Featured Linguist!

Jost Gippert: Our Featured Linguist!

"Buenos dias", "buenas noches" -- this was the first words in a foreign language I heard in my life, as a three-year old boy growing up in developing post-war Western Germany, where the first gastarbeiters had arrived from Spain. Fascinated by the strange sounds, I tried to get to know some more languages, the only opportunity being TV courses of English and French -- there was no foreign language education for pre-teen school children in Germany yet in those days. Read more



Donate Now | Visit the Fund Drive Homepage

Amount Raised:

$34378

Still Needed:

$40622

Can anyone overtake Syntax in the Subfield Challenge ?

Grad School Challenge Leader: University of Washington


Publishing Partner: Cambridge University Press CUP Extra Publisher Login
amazon logo
More Info


New from Oxford University Press!

ad

What is English? And Why Should We Care?

By: Tim William Machan

To find some answers Tim Machan explores the language's present and past, and looks ahead to its futures among the one and a half billion people who speak it. His search is fascinating and important, for definitions of English have influenced education and law in many countries and helped shape the identities of those who live in them.


New from Cambridge University Press!

ad

Medical Writing in Early Modern English

Edited by Irma Taavitsainen and Paivi Pahta

This volume provides a new perspective on the evolution of the special language of medicine, based on the electronic corpus of Early Modern English Medical Texts, containing over two million words of medical writing from 1500 to 1700.


Summary Details


Query:   Genre-Specific Corpora
Author:  Marina Santini
Submitter Email:  click here to access email
Linguistic LingField(s):   Computational Linguistics
Text/Corpus Linguistics

Summary:   Editor's Node: Please note that some URLs included in this submission
may carry over onto a second line. If you want to go to a specific
website provided in this submission, please be sure that you have
copied the whole URL.

Many thanks to Laura Christopherson, Cohan Sujay Carlos, Vineet
Yadav, Jason Teeple, Leslie Barrett, Joakim Nordström, Bob Kuhns,
Dong Wang, Dave Lewis, John Tait, and Loredana Cerrato.

Suggested Corpora and Resources in English if not stated otherwise
(not all of them are free of charge)

Genre-specific corpora:
- Genre: SMS Messages = NUS SMS corpus:
http://wing.comp.nus.edu.sg:8080/SMSCorpus/ (English / Chinese)

- Genre: chatlogs = CODIAC chatlogs
(http://data.eol.ucar.edu/codiac/dss/id=92.124;
http://data.eol.ucar.edu/codiac/dss/id=88.044;
http://data.eol.ucar.edu/codiac/dss/id=107.010)

- Genre: chatlogs = Many Eyes datasets: some chatlogs can be found
here:
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets

- Genre: chats and switchboard conversations =
Switchboard corpus and NPS chat corpus samples NLTK in NLTK data
(http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml). The NPS
chat corpus (http://faculty.nps.edu/cmartell/NPSChat.htm) is a POS
tagged chat corpus and the switchboard corpus
(http://spot.colorado.edu/~michaeli/Lexsubj/swbd.html) is a telephonic
conversation corpus.

- The Linguistics Data Consortium has a good deal of telephone
conversation - many files and a variety of languages. See
http://www.ldc.upenn.edu/Catalog/byType.jsp#lexicon,%20speech,%20
text (not for free)

- Genre: blogs = The Corporate weblogs dataset in TREC datasets
(http://ir.dcs.gla.ac.uk/test_collections/) is not for free. Helpful wiki:
http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG
- Genre: corporate blogs = It is possible to pull corporate blog feeds
or scrape the blogs from this list:
http://www.debbieweil.com/blog/list-of-67-big-brand-corporate-blogs/

- The Göteborg Spoken Language Corpus and other corpora in
Swedish (http://spraakbanken.gu.se/)

- Genre: tweets = The twitter corpus associated with the paper
www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf is
here: https://sites.google.com/site/twittersentimenthelp/for-researchers

- Genre: tweets and other microblogs= MicroBlog track
http://sites.google.com/site/trecmicroblogtrack/ (not for free)

- Genre: Newswires: Reuters' Newswires collections =
http://trec.nist.gov/data/reuters/reuters.html

- Genre: emails = Enron corpus (http://www.cs.cmu.edu/~enron/);
categorized Enron emails (http://sgi.nu/enron/corpora.php)

- Genre: emails = Junk email corpus
(http://clg.wlv.ac.uk/resources/junk-emails/index.php)

- Genre: FAQs = 200 FAQs
(http://www.itri.brighton.ac.uk/~Marina.Santini/#Download)

Resources:
- In terms of words and concept, there are two main resources for
English. First is WordNet, originally from Princeton, it is in NLTK (and
one can get it separately). It is English words 'organized' according to
their relationships: synonym, hyponym, piece of a whole, etc. The other
resource is Word Association Norms, one can get that from the
University of South Florida (http://w3.usf.edu/FreeAssociation/).
- Article: Hella Koo Finding: Twitter Dialect -
http://blogs.wsj.com/ideas-market/2011/02/08/hella-koo-finding-twitter-
dialect/
- Genre: tweets = the suggestion is to use Twitter API to crawl twitter
dataset.
- DiscoverText is a program you can use to scoop out Twitter feeds
really easily. Their website is here:
http://discovertext.com/defaultDT2.aspx
One can do a free 30 day trial and get a bunch of Twitter messages.

Note:
Genre: Tweets = The Edinburg Tweets corpus has been withdrawn:
http://demeter.inf.ed.ac.uk/

LL Issue: 22.2068
Date Posted: 13-May-2011
Original Query: Read original query


Back

Sums main page