Summary Details
| Query: |
Genre-Specific Corpora
|
|
| Author: | Marina Santini | |
| Submitter Email: | click here to access email | |
| Linguistic LingField(s): |
Computational Linguistics
Text/Corpus Linguistics |
|
| Summary: |
Editor's Node: Please note that some URLs included in this submission
may carry over onto a second line. If you want to go to a specific website provided in this submission, please be sure that you have copied the whole URL. Many thanks to Laura Christopherson, Cohan Sujay Carlos, Vineet Yadav, Jason Teeple, Leslie Barrett, Joakim Nordström, Bob Kuhns, Dong Wang, Dave Lewis, John Tait, and Loredana Cerrato. Suggested Corpora and Resources in English if not stated otherwise (not all of them are free of charge) Genre-specific corpora: - Genre: SMS Messages = NUS SMS corpus: http://wing.comp.nus.edu.sg:8080/SMSCorpus/ (English / Chinese) - Genre: chatlogs = CODIAC chatlogs (http://data.eol.ucar.edu/codiac/dss/id=92.124; http://data.eol.ucar.edu/codiac/dss/id=88.044; http://data.eol.ucar.edu/codiac/dss/id=107.010) - Genre: chatlogs = Many Eyes datasets: some chatlogs can be found here: http://www-958.ibm.com/software/data/cognos/manyeyes/datasets - Genre: chats and switchboard conversations = Switchboard corpus and NPS chat corpus samples NLTK in NLTK data (http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml). The NPS chat corpus (http://faculty.nps.edu/cmartell/NPSChat.htm) is a POS tagged chat corpus and the switchboard corpus (http://spot.colorado.edu/~michaeli/Lexsubj/swbd.html) is a telephonic conversation corpus. - The Linguistics Data Consortium has a good deal of telephone conversation - many files and a variety of languages. See http://www.ldc.upenn.edu/Catalog/byType.jsp#lexicon,%20speech,%20 text (not for free) - Genre: blogs = The Corporate weblogs dataset in TREC datasets (http://ir.dcs.gla.ac.uk/test_collections/) is not for free. Helpful wiki: http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG - Genre: corporate blogs = It is possible to pull corporate blog feeds or scrape the blogs from this list: http://www.debbieweil.com/blog/list-of-67-big-brand-corporate-blogs/ - The Göteborg Spoken Language Corpus and other corpora in Swedish (http://spraakbanken.gu.se/) - Genre: tweets = The twitter corpus associated with the paper www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf is here: https://sites.google.com/site/twittersentimenthelp/for-researchers - Genre: tweets and other microblogs= MicroBlog track http://sites.google.com/site/trecmicroblogtrack/ (not for free) - Genre: Newswires: Reuters' Newswires collections = http://trec.nist.gov/data/reuters/reuters.html - Genre: emails = Enron corpus (http://www.cs.cmu.edu/~enron/); categorized Enron emails (http://sgi.nu/enron/corpora.php) - Genre: emails = Junk email corpus (http://clg.wlv.ac.uk/resources/junk-emails/index.php) - Genre: FAQs = 200 FAQs (http://www.itri.brighton.ac.uk/~Marina.Santini/#Download) Resources: - In terms of words and concept, there are two main resources for English. First is WordNet, originally from Princeton, it is in NLTK (and one can get it separately). It is English words 'organized' according to their relationships: synonym, hyponym, piece of a whole, etc. The other resource is Word Association Norms, one can get that from the University of South Florida (http://w3.usf.edu/FreeAssociation/). - Article: Hella Koo Finding: Twitter Dialect - http://blogs.wsj.com/ideas-market/2011/02/08/hella-koo-finding-twitter- dialect/ - Genre: tweets = the suggestion is to use Twitter API to crawl twitter dataset. - DiscoverText is a program you can use to scoop out Twitter feeds really easily. Their website is here: http://discovertext.com/defaultDT2.aspx One can do a free 30 day trial and get a bunch of Twitter messages. Note: Genre: Tweets = The Edinburg Tweets corpus has been withdrawn: http://demeter.inf.ed.ac.uk/ |
|
| LL Issue: | 22.2068 | |
| Date Posted: | 13-May-2011 | |
| Original Query: | Read original query | |
|
Back |
||
|
|
||
|
Sums main page
|
||


