Editor for this issue: Ann Dizdar <dizdar
tam2000.tamu.edu>
Aqui envio un resumen de respuestas acerca de recursos linguisticos existentes para el espanol. Here I send a summary of answers about available spanish resources. Gracias a / Thanks to: Gerardo Arrarte Fernando Sanchez Leon Ruthanna Barnett Alice Carlberger Rodrigo Santurio James L. Fidelholtz Cesar Romani Joerge Koch Jose L. Rodrigo Martin Beaumont Franowsky Steve Halmreich Eduardo A. Martinez Labrada Mon Alameda Erik Oltmans ...and many more - ------------------------------------------------------------------ El Instituto Cervantes, ente pu'blico espan~ol dedicado principalmente a la difusio'n en el mundo de la lengua espan~ola y de la cultura de los pueblos de habla hispana, lleva a cabo diversas actividades destinadas a fomentar la investigacio'n de la lengua espan~ola. Entre otras actividades relacionadas con el campo de la Tecnologi'a Lingu"i'stica, estamos poniendo en marcha una oficina cuyo objetivo sera' la promocio'n de las Industrias de la Lengua aplicadas al espan~ol. Para ello, se ha considerado esencial realizar una labor de recogida y diseminacio'n de informacio'n sobre actividades en curso y recursos lingu"i'sticos disponibles en distintos centros de investigacio'n. Hasta el momento, hemos realizado una encuesta sobre corpus de espan~ol existentes o en desarrollo en centros de investigacio'n espan~oles, y hemos recogido los datos resultantes de esta encuesta en un informe de 56 pa'ginas que tendre' mucho gusto en hacerte llegar. En el futuro, esta' previsto ampliar este inventario con datos correspondientes a otros tipos de recursos lingu"i'sticos, asi' como con los procedentes de proyectos en marcha en otros pai'ses. ................................................................. : Gerardo Arrarte Carriquiry : E-mail: : : Programas de Tecnologia Linguistica : g.arrarteMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuecervantes.es : : Instituto Cervantes : : : Libreros, 23 : Tel: +34 1 885 62 03 : : E-28801 ALCALA DE HENARES (Madrid) : Fax: +34 1 883 50 10 : ................................................................. - ------------------------------------------------------------------ El corpus ITU est'a disponible en el corpus de ECI (European Corpus Initiative), que puede conseguirse a trav'es de la ELSNET. La direcci'on es la siguiente: email: elsnet
let.ruu.nl mail : OTS, Trans 10, 3512 JK, Utrecht, The Netherlands tel : +31 30 53 6039 fax : +31 30 53 6000 www : http://www.cogsci.ed.ac.uk/elsnet/home.html Es un corpus triling"ue (espa~nol, ingl'es, franc'es). La versi'on que estamos elaborando nosotros incluye etiquetado morfosint'actico, corregido a mano, de 1 mill'on de palabras del corpus. Esta versi'on estar'a en el dominio p'ublico a partir de octubre de este a~no. Asimismo, la versi'on espa~nola del etiquetador de Xerox estar'a tambi'en en el dominio p'ublico en esa fecha. En nuestro laboratorio tenemos otros corpus, como habr'as visto en la lista CORPORA (te incluyo parte de un anuncio en ingl'es): There are some Spanish corpora that you can retrieve from our laboratory. They are all documented. The corpora can be downloaded from the following address: Host: lola.lllf.uam.es Login: anonymous Password: <send your e-mail address> At this moment, we have a corpus of spoken Spanish in orthographic transcription Directory: pub/corpus/oral And a corpus of written Spanish texts from Argentine and Chile Directory: pub/corpus/argentina pub/corpus/chile All the corpora include texts in one of the topics you are interested in. Note that the oral corpus is compressed using UNIX command 'compress' while the other two are .zip files produced with DOS compress utilities (take a look at README files). Fernando Sanchez Leon fsanchez
ccuam3.uam.es -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- NOTA: Mas informacion sobre el tagger de XEROX se puede conseguir en: CONSORTIUM FOR LEXICAL RESEARCH email: lexical
crl.nmsu.edu ftp:// clr.nmsu.edu Ftp Directory: members-only/tools/ling-analysis/syntax/xerox-tagger/ This part-of-speech tagger, designed by Doug Cutting and Jan Pederson at Xerox, was written in ANSI Common Lisp. Its development was done in Franz Allegro Common Lisp version 4.1 on SunOS4.x and MacIntosh Common Lisp 2.0p2. The following code is provided: source code, a tokenizer for plain ASCII English, an English lexicon enduced from the Brown corpus, a table of mappings for word suffixes to likely ambiguity classes, and an HMM trained on the odd numbered sentences in the Brown corpus. More Info: info/XEROX. o: ftp ://parcftp.xerox.com/pub/tagger If you need to install Common Lisp to run it, several good free implementations at http://www.cs.rochester.edu/users/staff/miller/alu.html. - -------------------------------------------------------------------- European Corpus Initiative corpora available on CD-ROM: ECI1/MUL06/MSP06/SPA16A: Information technology, EU, 26,000 words ECI1/SPA02A-J: El Diario Sur, local newspaper from Malaga, belongs to national publisher, in existence for 40 years. Different writing styles, 500,000 words. ECI2/MUL04/MSP04A-J: Telecommunication user manual, several 100,000 words. ECI2/MUL09/SPA19A: Xerox ScanWorx user manual, 45,000 words. ECI2/MUL12/MSP12/MSP12A-C: Civil law, Switzerland, 600,000 words. ECI4/SPA03: Minimally processed by ECI; contains errors and duplication but the CLEAN and F C files are clean(?) El Diario Vasco, newspaper CLEAN files, news, few errors, 300,000 words FC files, 177,000 words The national newspaper ABC has just released a CD-ROM with last year's literary supplement that can be purchased for under $50. +4 million words of clean, high-quality written text. Archivo Digital de Manuscritos y Textos Espa=A4oles available on CD-ROM. Charles Faulhaber, Dept. of Spanish & Portuguese, U of California, Berkeley The EU MULTEXT Project of collecting a corpus which will contain parallel texts from the European Parliament and financial newspaper articles (Spanish from Expansion newspaper). Still finalizing licence agreements for these data. The RELATOR language resources server, supports distribution of NLP resources. Currently available through RELATOR speech and text corpora, lexicons, NLP programs and tools, and related databases and systems. ftp://de.relator.research.ec.org/relator=0D afs://afs/research.ec.org/projects/relator Multilingual Web pages: http://www.XX.relator.research.ec.org (XX=3Dtwo-letter country codes of the EU countries such as de, uk, etc.) Only speech materials.=0D Alice Carlberger alice
speech.kth.se - -------------------------------------------------------------------- We have been working on a Spanish to English Machine Translation system and so have access to a large corpus of Spanish text and have developed a tagger for general newspaper articles. Although the tagger uses proprietary information (Collins Spanish-English on-line dictionary), we will shortly make the results available on-line. That is, you will be able to e-mail Spanish texts and they will be returned tagged with part of speech. Steve Helmreich shelmrei
crl.nmsu.edu - -------------------------------------------------------------------- HOLA; SOY EL COAUTOR DE UN DICCIONARIO DE FRECUENCIAS DEL CASTELLANO. ... MON ALAMEDA CMSFI52
vmesa.cpd.uniovi.es - -------------------------------------------------------------------- Quizas pueda serte util la lista Terminometro electronico en espanhol. La direccion de la lista es LATIN-TE
FRMOP11.CNUSC.FR El servidor electonico de la lista es LISTSERV
FRMOP11.CNUSC.FR Martin Beaumont Franowsky BEAUMONT
DESCO.ORG.PE - -------------------------------------------------------------------- Desde hace mucho existe el trabajo de El Colegio de Me'xico (el Diccionario del espan~ol de Me'xico), proyecto cuyo investigador principal es Luis Fernando Lara. E'l tiene cuenta en Internet, pero no la tengo a la mano, asi' que te doy su direccio'n de snail-mail: Dr. Luis Fernando Lara DEM El Colegio de Me'xico Camino al Ajusco Me'xico, D. F. ME'XICO. Han hecho recuentos por frecuencia segu'n un corpus de aproximadamente 2 millones (si no mal recuerdo) de palabras, y tienen un programa de asignacio'n de palabras segu'n su parte de la oracio'n. James L. Fidelholtz jfidel
udlapvms.pue.udlap.mx jfidel
unm.edu - -------------------------------------------------------------------- Nosotros tratamos corpus de lengua de gran tamano, y hemos creado herramientas para la extraccion de informacion linguistica: - programa de busqueda y extraccion automatica de lemas con su contexto: REAL - programa de segmentacion y etiquetado morfologico de lemas, SMORPH. Jose L. Rodrigo ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ jose
gril.univ-bpclermont.fr GRIL : GROUPE DE RECHERCHE DANS LES INDUSTRIES DE LA LANGUE UNIVERSITE BLAISE PASCAL - CLERMONT II 34 Av. Carnot, F - 63037 Clermont-Ferrand Cedex ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ rodrigo
eucmax.sim.ucm.es Facultad de Filologia Universidad Complutense de Madrid - -------------------------------------------------------------------- You might want to check out the AGFL Grammar WorkLab which also contains a small grammar for the Spanish Noun Phrase. The author, Paula Maria Santalla, can be contacted through paula
cs.kun.nl. The URL of the AGFL home page is: http://www.cs.kun.nl/agfl/ Erik Oltmans Department of Computer Science University of Nijmegen Nijmegen, The Netherlands http://www.cs.kun.nl/agfl/eriko - -------------------------------------------------------------------- The Autonomous University of Nuevo Leon College of Medicine, Monterrey, Mexico and California State University at Fullerton (CSUF) make available "Spanish 92" (the first 2,000 most frequent words of Spanish) based on ESPA~NOL 92 (E92), computational linguistic analysis of a million- word corpus of contemporary Spanish carried out between 1986 and 1992 under a grant from the Secretariat of Public Education of the Mexican government. "Spanish 92" is available from the ftp server at CSUF: ftp wintermute.fullerton.edu user> anonymous pw> username
host.domain FTP> cd/pub/research/chandler Prof. R. M. Chandler-Burns College of Medicine Autonomous University of Nuevo Leon Monterrey, MEXICO Remite: Gabriel Amores Departamento de Lengua Inglesa Universidad de Sevilla NOTA : La direccion del Prof. Chandler-Burns es rchandlr
ccr.dsi.uanl.mx - -------------------------------------------------------------------- CONSORTIUM FOR LEXICAL RESEARCH email: lexical
crl.nmsu.edu ftp:// clr.nmsu.edu Parallel Text in English and Spanish Pan American Health Organization Ftp Directory: members-only/corpora/PAHO/ The Pan American Health Organization (PAHO), Conferences and General Services Division, has kindly allowed this group of sample parallel texts to be released for nlp research purposes. There are 180 pairs of text, 360 individual files, which amount to about 8 Mb of data. The documents cover the general domains of Public Health and Latin America, but vary greatly in content and in length. Some are short memos or letters, most are longer reports and conference proceedings. The Spanish documents do contain the Spanish character encoding. Other formatting commands, such as tabs, centering, italicizing, etc. have been removed. Special thanks to Dr. Marjorie Leon for her assistance in making these texts available. - -------------------------------------------------------------------- The PAPPI System: A Principle-Based Parser Announcing the first public release of PAPPI, a Prolog-based natural language parser for theories in the Principles-and- Parameters framework. PAPPI is designed to run on Sun Sparc- stations with Quintus Prolog. The PAPPI system includes: * An X-Window system-based user interface to the underlying Prolog-based parser. * A sample implementation of classic GB-theory, based on theory described in Lasnik and Uriagereka's textbook "A Course in GB Syntax". The implementation also includes sets of example sentences and sample parameterization for six languages. Currently, these are English, Japanese, Dutch, French, Spanish and German. (This software was recently demoed at COLING '94.) PAPPI is a parser that is designed to be a high-level research tool for experimenting with and learning about linguistic theory. This release represents just one possible instantiation within the Principles-and-Parameters framework. Users are encouraged to experiment with and modify the sample principles. The PAPPI system represents code written to support research work. It is still very much under development. Alternate theories (and more sophisticated parsing models) will be made publically available at a later stage. Upcoming releases may also support other platforms and may not need Quintus Prolog. This is free software developed at the NEC Research Institute, Inc., an institute for conducting long-term, fundamental research in computer and physical sciences. Comments and suggestions for improvement to the system will be gratefully accepted! I would like to also hear from those interested in extending the system. The PAPPI project also welcomes unencumbered software contributions, including (but not limited to) support for additional languages, theory and debugging tools. The system is available for anonymous ftp as: external.nj.nec.com:/pub/sandiway/Pappi-2.0X.tar.Z [Note: X is an alphabetic character denoting the current minor release.] A .gz compressed version of the same tar file is also available as: external.nj.nec.com:/pub/sandiway/Pappi-2.0X.tar.gz This version is recommended for those for those installations having GNU compress. Current requirements: Sun Sparcstation SunOS 4.1.3 or 5.3 (aka Solaris 2.3) Quintus Prolog 3.1.4 or 3.1.1 (June 1992) Approx. 35MB of disk space (55-70MB to install) Contact address: Dr. Sandiway Fong NEC Research Institute, Inc. Princeton NJ 08540 USA Email: sandiway
research.nj.nec.com Fax: (609) 951-2482 - -------------------------------------------------------------------- Cualquier otra informacion sobre recursos para el espanol, por favor envienla a mi direccion de e-mail (no voy a estar suscrito a la lista). Please, send any other information about spanish resources to my e-mail address (I'll be no longer subscribed to the list). Muchas gracias !! Thank you very much !! Pablo Accuosto Facultad de Ingenieria Universidad de la Republica Montevideo - Uruguay e-mail: accuosto
fing.edu.uy