LINGUIST List 6.990

Thu Jul 20 1995

Sum: Recursos para el espanol (spanish resources)

Editor for this issue: Ann Dizdar <dizdartam2000.tamu.edu>


Directory

  1. Pablo Accuosto, Sum: Recursos para el espanol (spanish resources)

Message 1: Sum: Recursos para el espanol (spanish resources)

Date: Wed, 19 Jul 1995 22:14:00 Sum: Recursos para el espanol (spanish resources)
From: Pablo Accuosto <accuostofing.edu.uy>
Subject: Sum: Recursos para el espanol (spanish resources)

Aqui envio un resumen de respuestas acerca de recursos linguisticos existentes
para el espanol.

Here I send a summary of answers about available spanish resources.

Gracias a / Thanks to:

Gerardo Arrarte
Fernando Sanchez Leon
Ruthanna Barnett
Alice Carlberger
Rodrigo Santurio
James L. Fidelholtz
Cesar Romani
Joerge Koch
Jose L. Rodrigo
Martin Beaumont Franowsky
Steve Halmreich
Eduardo A. Martinez Labrada
Mon Alameda
Erik Oltmans

...and many more

- ------------------------------------------------------------------

El Instituto Cervantes, ente pu'blico espan~ol dedicado
principalmente a la difusio'n en el mundo de la lengua espan~ola
y de la cultura de los pueblos de habla hispana, lleva a cabo
diversas actividades destinadas a fomentar la investigacio'n de
la lengua espan~ola.

Entre otras actividades relacionadas con el campo de la
Tecnologi'a Lingu"i'stica, estamos poniendo en marcha una oficina
cuyo objetivo sera' la promocio'n de las Industrias de la Lengua
aplicadas al espan~ol. Para ello, se ha considerado esencial
realizar una labor de recogida y diseminacio'n de informacio'n
sobre actividades en curso y recursos lingu"i'sticos disponibles
en distintos centros de investigacio'n.

Hasta el momento, hemos realizado una encuesta sobre corpus de
espan~ol existentes o en desarrollo en centros de investigacio'n
espan~oles, y hemos recogido los datos resultantes de esta
encuesta en un informe de 56 pa'ginas que tendre' mucho gusto en
hacerte llegar. En el futuro, esta' previsto ampliar este
inventario con datos correspondientes a otros tipos de recursos
lingu"i'sticos, asi' como con los procedentes de proyectos en
marcha en otros pai'ses.

.................................................................
: Gerardo Arrarte Carriquiry : E-mail: :
: Programas de Tecnologia Linguistica : g.arrartecervantes.es :
: Instituto Cervantes : :
: Libreros, 23 : Tel: +34 1 885 62 03 :
: E-28801 ALCALA DE HENARES (Madrid) : Fax: +34 1 883 50 10 :
.................................................................


- ------------------------------------------------------------------


El corpus ITU est'a disponible en el corpus de ECI (European Corpus
Initiative), que puede conseguirse a trav'es de la ELSNET. La direcci'on es
la siguiente:

email: elsnetlet.ruu.nl
mail : OTS, Trans 10, 3512 JK, Utrecht, The Netherlands
tel : +31 30 53 6039
fax : +31 30 53 6000
www : http://www.cogsci.ed.ac.uk/elsnet/home.html

Es un corpus triling"ue (espa~nol, ingl'es, franc'es). La versi'on que
estamos elaborando nosotros incluye etiquetado morfosint'actico, corregido
a mano, de 1 mill'on de palabras del corpus. Esta versi'on estar'a en el
dominio p'ublico a partir de octubre de este a~no.

Asimismo, la versi'on espa~nola del etiquetador de Xerox estar'a tambi'en
en el dominio p'ublico en esa fecha.

En nuestro laboratorio tenemos otros corpus, como habr'as visto en la lista
CORPORA (te incluyo parte de un anuncio en ingl'es):

There are some Spanish corpora that you can retrieve from our
laboratory. They are all documented. The corpora can be downloaded from
the following address:

Host: lola.lllf.uam.es
Login: anonymous
Password: <send your e-mail address>

At this moment, we have a corpus of spoken Spanish in orthographic
transcription

Directory: pub/corpus/oral

And a corpus of written Spanish texts from Argentine and Chile

Directory: pub/corpus/argentina
 pub/corpus/chile

All the corpora include texts in one of the topics you are interested
in. Note that the oral corpus is compressed using UNIX command
'compress' while the other two are .zip files produced with DOS compress
utilities (take a look at README files).


Fernando Sanchez Leon
fsanchezccuam3.uam.es

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-

NOTA: Mas informacion sobre el tagger de XEROX se puede conseguir en:

CONSORTIUM FOR LEXICAL RESEARCH
email: lexicalcrl.nmsu.edu
ftp:// clr.nmsu.edu

Ftp Directory: members-only/tools/ling-analysis/syntax/xerox-tagger/

This part-of-speech tagger, designed by Doug Cutting and Jan Pederson
at Xerox, was written in ANSI Common Lisp. Its development was done
in Franz Allegro Common Lisp version 4.1 on SunOS4.x and MacIntosh
Common Lisp 2.0p2. The following code is provided: source code, a
tokenizer for plain ASCII English, an English lexicon enduced from the
Brown corpus, a table of mappings for word suffixes to likely
ambiguity classes, and an HMM trained on the odd numbered sentences in
the Brown corpus. More Info: info/XEROX.

o:

ftp ://parcftp.xerox.com/pub/tagger

If you need to install Common Lisp to run it, several good free implementations
 at
http://www.cs.rochester.edu/users/staff/miller/alu.html.


- --------------------------------------------------------------------


European Corpus Initiative corpora available on CD-ROM:

ECI1/MUL06/MSP06/SPA16A:
Information technology, EU, 26,000 words

ECI1/SPA02A-J:
El Diario Sur, local newspaper from Malaga, belongs to national publisher, in
 existence for 40 years.
Different writing styles, 500,000 words.

ECI2/MUL04/MSP04A-J:
Telecommunication user manual, several 100,000 words.

ECI2/MUL09/SPA19A:
Xerox ScanWorx user manual, 45,000 words.

ECI2/MUL12/MSP12/MSP12A-C:
Civil law, Switzerland, 600,000 words.

ECI4/SPA03:
Minimally processed by ECI; contains errors and duplication but the CLEAN and F
C
 files are clean(?)



El Diario Vasco, newspaper
CLEAN files, news, few errors, 300,000 words
FC files, 177,000 words


The national newspaper ABC has just released a CD-ROM with last year's literary
 supplement that can be purchased
for under $50. +4 million words of clean, high-quality written text.

Archivo Digital de Manuscritos y Textos Espa=A4oles available on CD-ROM.
Charles Faulhaber, Dept. of Spanish & Portuguese, U of California, Berkeley

The EU MULTEXT Project of collecting a corpus which will contain parallel texts
 from the European
Parliament and financial newspaper articles (Spanish from Expansion newspaper).
Still finalizing licence agreements for these data.

The RELATOR language resources server, supports distribution of NLP resources.
Currently available through RELATOR speech and text corpora, lexicons, NLP
 programs and tools,
and related databases and systems.

ftp://de.relator.research.ec.org/relator=0D
afs://afs/research.ec.org/projects/relator

Multilingual Web pages: http://www.XX.relator.research.ec.org (XX=3Dtwo-letter
 country codes of
the EU countries such as de, uk, etc.) Only speech materials.=0D

Alice Carlberger
alicespeech.kth.se

- --------------------------------------------------------------------

We have been working on a Spanish to English Machine Translation
system and so have access to a large corpus of Spanish text and have
developed a tagger for general newspaper articles. Although the
tagger uses proprietary information (Collins Spanish-English on-line
dictionary), we will shortly make the results available on-line. That
is, you will be able to e-mail Spanish texts and they will be returned
tagged with part of speech.

Steve Helmreich
shelmreicrl.nmsu.edu

- --------------------------------------------------------------------

HOLA;
SOY EL COAUTOR DE UN DICCIONARIO DE FRECUENCIAS DEL CASTELLANO.
...
MON ALAMEDA
CMSFI52vmesa.cpd.uniovi.es

- --------------------------------------------------------------------

Quizas pueda serte util la lista Terminometro electronico en espanhol.

La direccion de la lista es LATIN-TEFRMOP11.CNUSC.FR
El servidor electonico de la lista es LISTSERVFRMOP11.CNUSC.FR

Martin Beaumont Franowsky
BEAUMONTDESCO.ORG.PE

- --------------------------------------------------------------------

Desde hace mucho existe el trabajo de El Colegio de Me'xico (el
Diccionario del espan~ol de Me'xico), proyecto cuyo investigador
principal es Luis Fernando Lara. E'l tiene cuenta en Internet, pero no la
tengo a la mano, asi' que te doy su direccio'n de snail-mail:
 Dr. Luis Fernando Lara
 DEM
 El Colegio de Me'xico
 Camino al Ajusco
 Me'xico, D. F.
 ME'XICO.
Han hecho recuentos por frecuencia segu'n un corpus de aproximadamente 2
millones (si no mal recuerdo) de palabras, y tienen un programa de
asignacio'n de palabras segu'n su parte de la oracio'n.

James L. Fidelholtz
jfideludlapvms.pue.udlap.mx
jfidelunm.edu

- --------------------------------------------------------------------

Nosotros tratamos corpus de lengua de gran tamano, y hemos creado herramientas
para la extraccion de informacion linguistica:

- programa de busqueda y extraccion automatica de lemas con su contexto: REAL
- programa de segmentacion y etiquetado morfologico de lemas, SMORPH.

Jose L. Rodrigo
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
josegril.univ-bpclermont.fr
GRIL : GROUPE DE RECHERCHE DANS LES INDUSTRIES DE LA LANGUE
UNIVERSITE BLAISE PASCAL - CLERMONT II
34 Av. Carnot, F - 63037 Clermont-Ferrand Cedex
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
rodrigoeucmax.sim.ucm.es
Facultad de Filologia
Universidad Complutense de Madrid

- --------------------------------------------------------------------

You might want to check out the AGFL Grammar WorkLab which
also contains a small grammar for the Spanish Noun Phrase.
The author, Paula Maria Santalla, can be contacted through
paulacs.kun.nl. The URL of the AGFL home page is:

http://www.cs.kun.nl/agfl/

Erik Oltmans
Department of Computer Science
University of Nijmegen
Nijmegen, The Netherlands
http://www.cs.kun.nl/agfl/eriko

- --------------------------------------------------------------------

The Autonomous University of Nuevo Leon College of Medicine,

Monterrey, Mexico and California State University at

Fullerton (CSUF) make available "Spanish 92" (the first

2,000 most frequent words of Spanish) based on ESPA~NOL 92

(E92), computational linguistic analysis of a million-

word corpus of contemporary Spanish carried out between

1986 and 1992 under a grant from the Secretariat of Public

Education of the Mexican government.


"Spanish 92" is available from the ftp server at CSUF:


ftp wintermute.fullerton.edu

user> anonymous

 pw> usernamehost.domain

 FTP> cd/pub/research/chandler


Prof. R. M. Chandler-Burns

College of Medicine

Autonomous University of Nuevo Leon

Monterrey, MEXICO

Remite:


Gabriel Amores
Departamento de Lengua Inglesa
Universidad de Sevilla

NOTA :

La direccion del Prof. Chandler-Burns es rchandlrccr.dsi.uanl.mx

- --------------------------------------------------------------------

CONSORTIUM FOR LEXICAL RESEARCH
email: lexicalcrl.nmsu.edu
ftp:// clr.nmsu.edu


Parallel Text in English and Spanish
Pan American Health Organization

Ftp Directory: members-only/corpora/PAHO/

The Pan American Health Organization (PAHO), Conferences and General
Services Division, has kindly allowed this group of sample parallel
texts to be released for nlp research purposes. There are 180 pairs
of text, 360 individual files, which amount to about 8 Mb of data.
The documents cover the general domains of Public Health and Latin
America, but vary greatly in content and in length. Some are short
memos or letters, most are longer reports and conference proceedings.
The Spanish documents do contain the Spanish character encoding.
Other formatting commands, such as tabs, centering, italicizing, etc.
have been removed. Special thanks to Dr. Marjorie Leon for her
assistance in making these texts available.

- --------------------------------------------------------------------

 The PAPPI System: A Principle-Based Parser


 Announcing the first public release of PAPPI, a Prolog-based
 natural language parser for theories in the Principles-and-
 Parameters framework. PAPPI is designed to run on Sun Sparc-
 stations with Quintus Prolog. The PAPPI system includes:

 * An X-Window system-based user interface to the
 underlying Prolog-based parser.

 * A sample implementation of classic GB-theory, based
 on theory described in Lasnik and Uriagereka's textbook
 "A Course in GB Syntax". The implementation also includes
 sets of example sentences and sample parameterization for
 six languages. Currently, these are English, Japanese,
 Dutch, French, Spanish and German. (This software was
 recently demoed at COLING '94.)

 PAPPI is a parser that is designed to be a high-level research
 tool for experimenting with and learning about linguistic
 theory. This release represents just one possible instantiation
 within the Principles-and-Parameters framework. Users are
 encouraged to experiment with and modify the sample principles.

 The PAPPI system represents code written to support research
 work. It is still very much under development. Alternate
 theories (and more sophisticated parsing models) will be made
 publically available at a later stage. Upcoming releases may
 also support other platforms and may not need Quintus Prolog.

 This is free software developed at the NEC Research Institute,
 Inc., an institute for conducting long-term, fundamental
 research in computer and physical sciences. Comments and
 suggestions for improvement to the system will be gratefully
 accepted! I would like to also hear from those interested in
 extending the system. The PAPPI project also welcomes unencumbered
 software contributions, including (but not limited to) support
 for additional languages, theory and debugging tools.

 The system is available for anonymous ftp as:

 external.nj.nec.com:/pub/sandiway/Pappi-2.0X.tar.Z

 [Note: X is an alphabetic character denoting the current
 minor release.]

 A .gz compressed version of the same tar file is also
 available as:

 external.nj.nec.com:/pub/sandiway/Pappi-2.0X.tar.gz

 This version is recommended for those for those installations
 having GNU compress.

 Current requirements:

 Sun Sparcstation
 SunOS 4.1.3 or 5.3 (aka Solaris 2.3)
 Quintus Prolog 3.1.4 or 3.1.1 (June 1992)
 Approx. 35MB of disk space (55-70MB to install)

 Contact address:

 Dr. Sandiway Fong
 NEC Research Institute, Inc.
 Princeton NJ 08540
 USA
 Email: sandiwayresearch.nj.nec.com
 Fax: (609) 951-2482

- --------------------------------------------------------------------

Cualquier otra informacion sobre recursos para el espanol, por
favor envienla a mi direccion de e-mail (no voy a estar suscrito
a la lista).

Please, send any other information about spanish resources to
my e-mail address (I'll be no longer subscribed to the list).

Muchas gracias !!
Thank you very much !!

Pablo Accuosto
Facultad de Ingenieria
Universidad de la Republica
Montevideo - Uruguay

e-mail: accuostofing.edu.uy
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue