LINGUIST List 9.598

Wed Apr 22 1998

FYI: LDC Corpora, Lang Universals

Editor for this issue: Martin Jacobsen <martylinguistlist.org>


Directory

  • LDC Office, New Corpora from the Linguistic Data Consortium
  • LDC Office, New Corpora from the Linguistic Data Consortium
  • Don Nilsen, Language Universals: Irony, Language Play, Metaphor, Metonymy

    Message 1: New Corpora from the Linguistic Data Consortium

    Date: Mon, 20 Apr 1998 16:43:14 EDT
    From: LDC Office <ldcunagi.cis.upenn.edu>
    Subject: New Corpora from the Linguistic Data Consortium




    Announcing NEW RELEASES from the Linguistic Data Consortium

    1996 Broadcast News Training Speech Data 1996 Broadcast News Dev. and Eval. Data 1996 Broadcast News Transcripts

    The 1996 Broadcast News Speech Corpus contains a total of 104 hours of broadcasts from ABC, CNN, and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts. The primary motivation for this collection is to provide training data for the DARPA "Hub-4" Project on continuous speech recognition in the broadcast domain. The speech files are available in a 19 disc training data set with one additional disc of development data and an additional disc of evaluation data. The following programs are represented in this corpus:

    ABC Nightline ABC World Nightly News ABC World News Tonight CNN Early Edition CNN Early Prime News CNN Headline News CNN Prime Time News CNN The World Today CSPAN Washington Journal NPR All Things Considered NPR Marketplace

    Transcripts have been made of all recordings in this publication, manually time aligned to the phrasal level, annotated to identify boundaries between news stories, speaker turn boundaries, and gender information about the speakers. The released version of the transcripts is in SGML format, and there is accompanying documentation, and an SGML DTD file, included with the transcription release. The transcripts are available via ftp.

    Because of restrictions imposed by the copyright holders of the news text, these corpora are available to 1997 and 1998 LDC members only. Members who wish to receive these corpora MUST SIGN BOTH THE USC AND THE NPR AGREEMENTS. These agreements are available on the Linguistic Data Consortium WWW Home Page at URL

    http://www.ldc.upenn.edu/ldc/catalog/index.html.

    If you would like to order a copy of these corpora, please email your request to <ldcunagi.cis.upenn.edu>. If you need additional information before placing your order, or would like to inquire about membership in the LDC, please send email or call (215) 898-0464.

    Further information about the LDC and its available corpora can be accessed on the Linguistic Data Consortium WWW Home Page at URL:

    http://www.ldc.upenn.edu/

    Information is also available via ftp at ftp.cis.upenn.edu under pub/ldc; for ftp access, please use "anonymous" as your login name, and give your email address when asked for password.

    Message 2: New Corpora from the Linguistic Data Consortium

    Date: Mon, 20 Apr 1998 16:42:01 EDT
    From: LDC Office <ldcunagi.cis.upenn.edu>
    Subject: New Corpora from the Linguistic Data Consortium


    Announcing a NEW RELEASE from the LINGUISTIC DATA CONSORTIUM

    COMLEX English Syntax Lexicon, Version 3.0

    This is a moderately broad coverage English lexicon (with about 38,000 lemmas) developed at New York University under LDC sponsorship. It contains detailed information about the syntactic characteristics of each lexical item, and is particularly detailed in its treatment of subcategorization (complement structures).

    In the current dictionary, nouns have 9 possible features and 9 possible complements; adjectives have 7 features and 14 complements; verbs have 5 features and 92 complements; and adverbs have 11 positional classes and 12 features. The entries for 750 frequent verbs contain 100 tags each, where a tag includes: a pointer to an instance of that verb in a corpus and the subcategorization appropriate for that instance.

    This latest version of COMLEX Syntax has been updated to include the adverb classes. We also added diacritics to foreign words, while retaining the unaccented versions and performed various other updates to correct and supplement our lexical entries. For more details about this revised version, please contact Adam Meyers at New York University (meyerscs.nyu.edu).

    This release is accompanied by the COMLEX Syntax Text Corpus, Version 2.0. The Text corpus consists of material from the following sources:

    The Brown Corpus, Francis, W. Nelson, 1964 Brown University, Providence

    Wall Street Journal Material, Copyright 1989 Dow Jones, Inc.

    San Jose Mercury News, Copyright 1991 San Jose Mercury News

    Associated Press, Copyright 1988

    Federal Register materials courtesy of IBM; formatted version copyright 1992, University of Pennsylvania

    Computer Library materials copyright owned by Ziff Communications Company and other parties as their respective interests may appear.

    Institutions that have membership in the LDC during the 1998 Membership Year will be able to receive COMLEX Syntax Lexicon 3.0 at no additional charge, in the same manner as all other text and speech corpora published by the LDC. Members who wish to receive this corpus must sign the COMLEX user agreement. This agreement is available on the Linguistic Data Consortium WWW Home Page at URL http://www.ldc.upenn.edu/ldc/catalog/index.html.

    Nonmembers can receive a copy of COMLEX Syntax Lexicon 3.0 for research purposes only for a fee of $1500. If you would like to order a copy of this corpus, please email your request to ldcunagi.cis.upenn.edu. If you need additional information before placing your order, or would like to inquire about membership in the LDC, please send email or call (215) 898-0464.

    Further information about the LDC and its available corpora can be accessed on the Linguistic Data Consortium WWW Home Page at URL http://www.ldc.upenn.edu/. Information is also available via ftp at ftp.cis.upenn.edu under pub/ldc; for ftp access, please use "anonymous" as your login name, and give your email address when asked for password.

    Message 3: Language Universals: Irony, Language Play, Metaphor, Metonymy

    Date: Mon, 20 Apr 1998 10:00:44 -0700 (MST)
    From: Don Nilsen <don.nilsenasu.edu>
    Subject: Language Universals: Irony, Language Play, Metaphor, Metonymy


    In response to Arthur Merin's query on "Verbal Irony as a Language Universal," I have evidence suggesting that it might be, and even more evidence suggesting that Language Play, Metaphor, and Metonymy are language universals. I have bibliographies relating to these areas for anyone out there who is interested in the current research.

    Don L. F. Nilsen 8-) <don.nilsenasu.edu> (602) 965-7592; FAX: (602) 965-3451 Executive Secretary International Society for Humor Studies English Department Arizona State University Tempe, AZ 85287-0302