LINGUIST List 14.321

Fri Jan 31 2003

FYI: LSA Bulletin, New LDC Corpus

Editor for this issue: James Yuells <jameslinguistlist.org>


Directory

  1. LSA, LSA Bulletin
  2. LDC Office, New LDC Corpus

Message 1: LSA Bulletin

Date: Wed, 29 Jan 2003 11:14:48 -0500
From: LSA <lsalsadc.org>
Subject: LSA Bulletin

The December 2002 issue of the LSA Bulletin is now available at the
Linguistic Society of America website: http://www.lsadc.org.
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: New LDC Corpus

Date: Thu, 30 Jan 2003 17:00:21 -0500
From: LDC Office <ldcldc.upenn.edu>
Subject: New LDC Corpus


 		 * English Gigaword *


The Linguistic Data Consortium (LDC) is pleased to announce the
availability of the English Gigaword corpus. 

English Gigaword is a comprehensive archive of newswire text data 
in English that has been acquired over several years by the LDC. The
newswire texts are drawn from four international sources:

Agence France Press English Service
Associated Press Worldstream English Service
The New York Times Newswire Service
The Xinhua News Agency English Service

English Gigaword is the first LDC publication to be distributed on
DVD. Much of the content in this collection has been published 
previously by the LDC in a variety of other, older corpora,
particularly, the North American News text corpora (LDC95T21, LDC98T30),
the various TDT corpora and the AQUAINT text corpus (LDC2002T31). In 
addition to this previously published data, the English Gigaword corpus
contains a significant amount of previously unreleased data,
specifically, all of the Agence France Presse content, the 1995 and 
2001 Xinhua content, and portions of NYT and APW dating from February 
2001 forward. 

All text data are presented in SGML form, using a very simple, minimal
markup structure; all text consists of printable ASCII and whitespace.
The text formatting is consistent across all sources. The English 
Gigaword corpus has been fully validated by a standard SGML parser 
utility (nsgmls), using a DTD file which is provided as part of this 
publication. 

For further information, including a link to online documentation,
please visit:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05

Institutions that have membership in the LDC during the 2003 
Membership Year will be able to receive this corpus free of charge. 
Nonmembers may license this publication for $2,500. 

			 *
 
If you need additional information before placing your order, or 
would like to inquire about membership in the LDC, please send email to
<ldcldc.upenn.edu> or call (215) 573-1275.


- -------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 email: ldcldc.upenn.edu
Philadelphia, PA 19104-2653 www: http://www.ldc.upenn.edu
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue