LINGUIST List 9.577

Thu Apr 16 1998

FYI: ELRA Focus,Parser Link,OTA Web Testers

Editor for this issue: Martin Jacobsen <martylinguistlist.org>


Directory

  1. Valerie Mapelli, ELRA Focus - MLCC Multilingual Corpora for Co-operation
  2. Doug Beeferman, Link Grammar Parser: http://www.link.cs.cmu.edu/link/
  3. Oxford Text Archive, Testers wanted for new OTA website

Message 1: ELRA Focus - MLCC Multilingual Corpora for Co-operation

Date: Mon, 6 Apr 1998 13:49:28 +0200 (MET DST)
From: Valerie Mapelli <info-elracalva.net>
Subject: ELRA Focus - MLCC Multilingual Corpora for Co-operation

 
 EUROPEAN LANGUAGE RESOURCES ASSOCIATION
 ELRA Focus
 =====================================

 
 MLCC Multilingual Corpora for Co-operation

A collection of newspaper articles from financial newspapers in 6
languages (Dutch, English, French, German, Italian and Spanish) and a
set of parallel texts in the 9 European Union official languages (as
of 1993)

 =====================================

The current catalogue of ELRA consists of more than 500 language
resources (!) available for speech, written or terminology
works. This electronic message aims to remind of the availability of
one of them, namely the MLCC Multilingual Corpora for Co-operation.

The MLCC text corpus has two main components - one set to allow
comparable studies to be carried out in different languages and one
set as the basis for translation studies.

The first set is referred as the Polylingual Document Collection
(ELRA-W0006), a collection of newspaper articles from financial
newspapers in 6 languages (Dutch, English, French, German, Italian and
Spanish). It consists of the following sub-corpora:

Dutch - "Het Financieele Dagblad" - 1992-1993 The corpus contains
articles from the Dutch financial newspaper "Het Financieele Dagblad"
editions of 2nd January 1992 through to 24th December 1993. It
contains around 8.5 million words of text.

English - "The Financial Times" - 1993 The corpus contains articles
from the British financial newspaper "The Financial Times" editions
from the year 1993. The corpus contains around 30 million words.

French - "Le Monde" - 1992-1993 A corpus of articles from the French
newspaper "Le Monde", consisting of two years worth (1992-1993) of
articles on financial subjects, approximately 10 million words.

German - "Handelsblatt" - 1986-1988 This subcorpus consists of
articles from the period 02.01.1986 to 15.06.1988. It contains some
33 million words. It may be possible to obtain more recent articles
from "Handelsblatt".

Italian - "Il Sole 24 Ore" - 1992-1993 The corpus described here
contains articles from the Italian financial newspaper "Il Sole 24
Ore" from the year 1992. This corpus contains some 1.88 million
words. The SGML-markup was done by the University of Edinburgh.

Spanish - "Expansion" - 1994 This subcorpus contains articles from the
Spanish financial newspaper "Expansion" editions from 21.10.1991 to
24.10.1991 and 14.05.1994 to 27.12.1994. It contains some 10 million
words.

 Price for ELRA members: 
 for research use: 360 ECU 
 for commercial use: 1500 ECU

 Price for non-members:
 for research use: 750 ECU 
 for commercial use: 3200 ECU

The second set is a Multilingual Parallel Corpus (ELRA-W0007)
consisting of translated data in nine European languages: Danish,
Dutch, English, French, German, Greek, Italian, Portuguese and
Spanish. The parallel data, provided by the European Commission,
comprises two sub-corpora from the Official Journal of the European
Communities:

Official Journal of the European Commission, C Series: Written
Questions 1993 Records of questions and answers regarding European
Community matters. The data is regularly published as one section of
the C Series of the Official Journal of the European Community in all
official languages (previously nine). This corpus contains written
questions asked by members of the European Parliament and
corresponding answers from the European Commission in 9 parallel
versions. The total size of the corpus is approximately 10.2 million
words (ca. 1.1 million words per language).

Official Journal of the European Commission, Annex: Debates of the
European Parliament 1992-1994 This parallel corpus is the records of
Parliamentary sitting published as an annex to the Official Journal of
the European Community Debates of the European Parliament. The
Parliamentary Debates are a record of what was said by members of the
meeting as well as written input provided to the meeting. The original
data from which the translations are produced consist of a transcript
of the sittings, each member speaking in the language of his
choice. The final version consists of nine parallel versions of the
material. The texts delivered comprise the Debates of Parliament from
January 1992 to July 1994. This sub-corpus contains some 5 to 8
million words per language.

 Price for ELRA members:
	for research use: 120 ECU 
	for commercial use: 480 ECU
 Price for non-members: 
 for research use: 200 ECU
 for commercial use: 800 ECU

 ********************************************
 For more information, please contact:
 ELRA/ELDA
 55-57 rue Brillat Savarin
 75013 PARIS
 Tel: +33 1 43 13 33 33
 Fax: +33 1 43 13 33 30
 E-mail: info-elracalva.net
 http://www.icp.grenet.fr/ELRA/home.html
 ********************************************
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Link Grammar Parser: http://www.link.cs.cmu.edu/link/

Date: Thu, 16 Apr 1998 00:26:20 -0400
From: Doug Beeferman <Doug_Beefermancuff.link.cs.cmu.edu>
Subject: Link Grammar Parser: http://www.link.cs.cmu.edu/link/


We would like to draw your attention to the release of the new version
of the Link Grammar Parser, version 3.0.

The Link Grammar Parser is a syntactic parser of English, based on
link grammar, an original theory of English syntax. Given a sequence
of words, the system assigns to it a syntactic structure, composed of
a set of arcs or "links" of different kinds, connecting pairs of
words. The parser has a dictionary of about 60000 word-forms; it has
coverage of a wide variety of syntactic constructions, many idioms,
and capitalization and punctuation phenomena. It is able to make
guesses about the syntactic categories of unknown words based on
context. It is also robust, and can assign structure to sentences even
when it cannot parse them completely.

The system is written in C, and runs under unix and windows.

Since our last version (version 2.0, in Fall 1995), we have made a
number of improvements to the parser. Its speed is greatly enhanced;
its coverage is significantly improved. We have also incorporated a
"panic mode", which allows the parser to recover some structure on
long sentences in a short amount of time. We have also developed an
API for the system. This allows the parser to be easily integrated
into your own applications.

At the Link Parser website (http://www.link.cs.cmu.edu/link/) you can
try the parser out for yourself. This website also contains more
information and detailed documentation of the parser. You are welcome
to download the system from the website and use it for personal or
academic purposes. If you intend to use it for commercial purposes,
please contact us. Contact information, and information on the Link
Group at Carnegie Mellon, can be found off the Link Group home page at
http://www.link.cs.cmu.edu/


 Davy Temperley Daniel Sleator John Lafferty
 dt3columbia.edu sleatorcs.cmu.edu laffertycs.cmu.edu
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 3: Testers wanted for new OTA website

Date: Thu, 16 Apr 1998 10:21:41 +0100 (BST)
From: Oxford Text Archive <archivesable.ox.ac.uk>
Subject: Testers wanted for new OTA website


The Oxford Text Archive is launching a state-of-the-art web service
later in the year, reflecting our new status as a Service Provider for
the UK's national Arts and Humanities Data Service.

Before this web site goes live, we need feedback from all types of
user. So whether you are new to electronic text or an expert in the
field, we invite you to visit our site and use our feedback form to
tell us what you think.

As always, the OTA's homepage remains 

http://ota.ahds.ac.uk/ 

but throughout this period of testing, users will have the option to
visit either our current site, or our new experimental service.

NB.in order to fully appreciate this service, we recommend that you
use either Netscape Navigator 4 or IE 3 (or better).

Features of the new OTA site include:

- an online catalogue of all our texts, whether online or offline
- a facility to create a corpus of texts
- a download facility for TEI encoded texts that allows you to
 choose from a variety of different formats
- online tools to help you preparing your texts in SGML
- a listing of future events, as well as papers from previous
 workshops and conferences.
- a FAQ, based on the OTA's 22 years of operation.
- a search tool and site map to help you find your way around the site
- an SGML software repository
- "Guides to Good Practice" on the creation and documentation of 
 electronic texts (in preparation)


- -----
Oxford Text Archive
http://ota.ahds.ac.uk
infoota.ahds.ac.uk
+44-1865-273 238
13 Banbury Road, Oxford, OX2 6NN, UK


Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue