Publishing Partner: Cambridge University Press CUP Extra Publisher Login
amazon logo
More Info

Media: New York Times Article on NYT Annotated Corpus

Submitter: Evan Sandhaus

Submitter Email:
Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics

Media Body: Available for noncommercial research license from The Linguistic Data
Consortium (LDC), the corpus spans 20 years of newspapers between 1987 and
2007 (that’s 7,475 issues, to be exact). This collection includes the text
of 1.8 million articles written at The Times. Of these, more than 1.5
million have been manually annotated by The New York Times Index with
distinct tags for people, places, topics and organizations drawn from a
controlled vocabulary. A further 650,000 articles also include summaries
written by indexers from the New York Times Index. The corpus is provided
as a collection of XML documents in the News Industry Text Format and
includes open source Java tools for parsing documents into memory resident

You can read more about the corpus at:

All the best,

Evan Sandhaus
Semantic Technologist
Research & Development Operations
New York Times Company
Issue Number: 20.1529
Date Posted: April 22, 2009