Media: New York Times Article on NYT Annotated Corpus
| Submitter: |
Evan Sandhaus
|
| Submitter Email: | sandes@nytimes.com |
| Linguistic Field(s): |
Computational Linguistics Semantics Text/Corpus Linguistics Lexicography |
| Media Body: |
Available for noncommercial research license from The Linguistic Data Consortium (LDC), the corpus spans 20 years of newspapers between 1987 and 2007 (that’s 7,475 issues, to be exact). This collection includes the text of 1.8 million articles written at The Times. Of these, more than 1.5 million have been manually annotated by The New York Times Index with distinct tags for people, places, topics and organizations drawn from a controlled vocabulary. A further 650,000 articles also include summaries written by indexers from the New York Times Index. The corpus is provided as a collection of XML documents in the News Industry Text Format and includes open source Java tools for parsing documents into memory resident objects. You can read more about the corpus at: http://open.blogs.nytimes.com/2009/01/12/fatten-up-your-corpus/ All the best, Evan Sandhaus -- Semantic Technologist Research & Development Operations New York Times Company |
| Issue Number: | 20.1529 |
| Date Posted: | April 22, 2009 |


