LINGUIST List 8.1552

Wed Oct 29 1997

FYI: North American News Text Corpus from the LDC

Editor for this issue: Brett Churchill <brettlinguistlist.org>


Directory

  • LDC Office, A New Corpus from the Linguistic Data Consortium

    Message 1: A New Corpus from the Linguistic Data Consortium

    Date: Wed, 29 Oct 1997 13:36:56 EST
    From: LDC Office <ldcunagi.cis.upenn.edu>
    Subject: A New Corpus from the Linguistic Data Consortium




    Announcing a NEW CORPUS from the LINGUISTIC DATA CONSORTIUM

    North American News Text Corpus

    The Linguistic Data Consortium (LDC) announces the availability of a corpus of North American news text. This corpus is a collection of journalistic text in English from newswire and newspaper sources in the United States.

    The North American News Text corpus is composed of news text that has been marked using SGML. The text is taken from the following sources:

    Source Dates Aprox. #Words Covered (Millions) - ------------------------------------------------------- Los Angeles Times & 05/94-08/97 52 Washington Post

    New York Times News 07/94-12/96 173 Syndicate

    Reuters News Service 04/94-12/96 85 (General & Financial)

    Wall Street Journal 07/94-12/96 40 - -------------------------------------------------------

    Both the New York Times and the L.A.Times/Washington Post services actually include a range of other newspaper sources in their syndicated newswires. The L.A.Times/Wash.Post material will be found to include the following sources (in lesser amounts) in addition to the two predominant sources:

    Newsday The Baltimore Sun The Hartford Courant

    The New York Times material will be found to contain the following sources (in lesser amounts), but N.Y. Times articles predominate:

    Bloomberg Business News The Boston Globe Los Angeles Daily News Fort Worth Star-Telegram Newsweek Cox News Service The Arizona Republic Seattle Post-Intelligencer San Francisco Examiner Houston Chronicle San Francisco Chronicle Economist Newspaper Ltd. Hearst Newspapers

    Both of these newswire services also include small numbers of articles from a larger set of miscellaneous sources. The ones listed above appear with some frequency on a daily basis.

    Because of restrictions imposed by the copyright holders of the news text, this corpus is available to 1995, 1996 and 1997 LDC members only. Members who wish to receive this corpus must sign the North American News Text user agreement. This agreement is available on the Linguistic Data Consortium WWW Home Page at URL

    http://www.ldc.upenn.edu/ldc/catalog/index.html.

    If you would like to order a copy of this corpus, please email your request to ldcunagi.cis.upenn.edu. If you need additional information before placing your order, or would like to inquire about membership in the LDC, please send email or call (215) 898-0464.