LINGUIST List 6.977

Mon Jul 17 1995

Sum: Spanish corpora

Editor for this issue: Ann Dizdar <>


  1. Albert Llorens, summary

Message 1: summary

Date: Fri, 14 Jul 1995 11:48:49 summary
From: Albert Llorens <>
Subject: summary

Dear all,

I send you a summary of the answers I got for my query on Spanish corpora.
My apologies for the repetitions: I haven't got the time to really "summarize".



Albert Llorens
Spanish-English Development Group
Incyta, S.A.
c. Lluis Muntadas 5
08940 Cornella de Llobregat

There's a CD-ROM edited by the European Corpus Initiative which includes
a number of texts in several european languages. Among others it includes
CEE law in Spanish, English and Portugese, or a Xerox manual in English
and Spanish.

A somewhat more detailed account of the contents of this CD-ROM follows:


ECI1/MUL06/MSP06/SPA16A: Information technology, EU, 26,000 words

ECI1/SPA02A-J: El Diario Sur, local newspaper from Malaga, belongs
to national publisher, in existence for 40 years. Different writing
styles, 500,000 words.

ECI2/MUL04/MSP04A-J: Telecommunication user manual, several 100,000

ECI2/MUL09/SPA19A: Xerox ScanWorx user manual, 45,000 words.

ECI2/MUL12/MSP12/MSP12A-C: Civil law, Switzerland, 600,000 words.

ECI4/SPA03: Minimally processed by ECI; contains errors and
duplication but the CLEAN and FC files seem to be clean.

 El Diario Vasco, newspaper
 CLEAN files, news, few errors, 300,000 words
 FC files, 177,000 words

Apart from the ECI CD-ROM there are the following corpora available:

ftp /pub/corpus/argentina 2 million words
 /pub/corpus/chile 2 millions words

Fernando Sanchez Leon, Laboratorio de Linguistica Informatica:
The CRATER Project: ITU corpus in the process of postediting.
Trilingual (French/English/Spanish) corpus has more than 3 million
words and is the so-called "White Book on Telecommunications"
released by the International Telecommunications Union. Fernando et al
are working with a 1-million word subcorpus, which will also be
postedited. This corpus, along with the tagger developed for its
tagging and all the resources associated with the tagger
will be in the public domain in October 1995. There is a lexicon with
+35,000 words (full forms, not lemmas), part-of-speech annotated, that
can be used as a starting point in lexicon-building tasks.

The national newspaper ABC has just released a CD-ROM with last year's
literary supplement that can be purchased for under $50. +4 million
words of clean, high-quality written text.

Archivo Digital de Manuscritos y Textos Espa=A4oles available on
CD-ROM. Charles Faulhaber, Dept. of Spanish & Portuguese, U of
California, Berkeley.

The EU MULTEXT Project of collecting a corpus which will contain
parallel texts from the European Parliament and financial newspaper
articles (Spanish from Expansion newspaper). Still finalizing licence
agreements for these data.

The RELATOR language resources server, supports distribution of NLP
resources. Currently available through RELATOR speech and text
corpora, lexicons, NLP programs and tools, and related databases and

Multilingual Web pages:
(XX=3Dtwo-letter country codes of the EU countries such as de, uk,
etc.) Only speech materials.

Briscoe et al paper reports a 17,000-word tagged corpus. (This is all
the info I have on this paper.)

ftp ://
Spanish tagger, implemented in Common Lisp. Comes with documentation,
works very well. If you need to install Common Lisp to run it, several
good free implementations at


A last report.

> 1. /pub/corpus/: a. Oral corpus of Spanish (7 MB, about 2,000,000 words)
> b. Some written corpora of South American Spanish
> 2. The lds is the best source, but joining costs money.
> 3. The Oxford Text Archive
> 13 Banbury Road
> Oxford OX2 6NN
> fax: +44 865 273275
> Catalogue of over 1300 titles, available in paper
> or electronic form on the Oxford VAX Cluster as OX$DOC:TEXTARCHIVE.LIST and
> OX$TEXTARCHIVE.SGML, from various ListServers, e.g., LISTSERVBROWNVM (send
> the mail message GET HUMANIST FILELIST for details), by anonymous FTP from
> Internet site ( in the directory pub/ota/public.
> Also, wherever you are, you can send a note to ARCHIVEVAX.OXFORD.AC.UK
> specifying which form you want.
> Spanish
> a. Literary works, poems.
> 4. 1066108 words (approx.)
> Origin: Grupo EUROTRA, Universidad Autonoma de Madrid
> Contact: Manuel Campos, or
> Fernando Sanchez Leon, Laboratorio de L
> Available: Publically via anonymous ftp, node,
> directory pub/corpus
> Contents: transcriptions of spoken language (conferences, conversations, etc.
> 5. 121051 words (approx.)
> Origin: CHILDES (Child Language Data Exchange System) database, Carnegie
> Univ.
> Contact: Brian MacWhinney,
> Available: Publically, previous communication with Brian MacWhinney
> Contents: Database of corpora of parent-child and child-child interactions
> from children speaking.
> 6. 9,000,000 words (approx.)
> Origin: THis is the European Corpus Initiative Multilingual Corpus I CD-ROM
> Cost: 20 Pounds
> Contact:
> Available: All use of this corpus is subject to a licence agreement
> The CD-ROM is available in the US from the Linguistic Data Consortium (LDC),
> for members of the LDC or those making a bulk purchase, and otherwise from
> ELSNET, 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND. The cost from ELSNET
> is 20 UK Pounds plus postage, handling and tax where applicable. Ordering
> procedure is detailed in
> 7. University of Barcelona: spoken corpus

Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue