Featured Linguist!

Jost Gippert: Our Featured Linguist!

"Buenos dias", "buenas noches" -- this was the first words in a foreign language I heard in my life, as a three-year old boy growing up in developing post-war Western Germany, where the first gastarbeiters had arrived from Spain. Fascinated by the strange sounds, I tried to get to know some more languages, the only opportunity being TV courses of English and French -- there was no foreign language education for pre-teen school children in Germany yet in those days. Read more



Donate Now | Visit the Fund Drive Homepage

Amount Raised:

$34413

Still Needed:

$40587

Can anyone overtake Syntax in the Subfield Challenge ?

Grad School Challenge Leader: University of Washington


Publishing Partner: Cambridge University Press CUP Extra Publisher Login
amazon logo
More Info

Media: New York Times Article on NYT Annotated Corpus

Submitter: Evan Sandhaus

Submitter Email: sandes@nytimes.com
Linguistic Field(s): Computational Linguistics
Semantics
Text/Corpus Linguistics
Lexicography

Media Body: Available for noncommercial research license from The Linguistic Data
Consortium (LDC), the corpus spans 20 years of newspapers between 1987 and
2007 (that’s 7,475 issues, to be exact). This collection includes the text
of 1.8 million articles written at The Times. Of these, more than 1.5
million have been manually annotated by The New York Times Index with
distinct tags for people, places, topics and organizations drawn from a
controlled vocabulary. A further 650,000 articles also include summaries
written by indexers from the New York Times Index. The corpus is provided
as a collection of XML documents in the News Industry Text Format and
includes open source Java tools for parsing documents into memory resident
objects.

You can read more about the corpus at:

http://open.blogs.nytimes.com/2009/01/12/fatten-up-your-corpus/

All the best,

Evan Sandhaus
--
Semantic Technologist
Research & Development Operations
New York Times Company
Issue Number: 20.1529
Date Posted: April 22, 2009

Back to browse media
Media main page