Tue 08 Mar 1994

Sum: French language corpora

Date: Mon, 7 Mar 94 14:50 GMT
Subject: Sum: French language corpora

Here is some information about computerized French language corpora. Heartfelt
thanks to all who responded. Allow 4 to 5 pages to print out this information.
The headings are:

1. The Canadian Hansard Corpus
2. The ACL initiative/ more on Hansard
3. The ARTFL database at Chicago University
4. More information abou t Hansard and ARTFL
5. The Hachette / OUP corpus
6. The Oxford Text Archive
7. Mike Scott at Liverpool University
8. Le Monde on CD-ROM

Please send me corrections and additions.

Raphael Salkie,
The Language Centre,
University of Brighton,
Falmer, Brighton,
BN1 9PH England

Tel: (0273) 643335 (direct line); (0273) 643337 (Language Centre Office).
Fax: (0273) 690710

1. The Canadian Hansard Corpus

Source: Bruno Maximilian Schulze: <schulzede.uni-stuttgart.ims>

"How about the Hansard Corpus, which is an (sentence) aligned parallel corpus
(French and English) containing debates of the Canadian parliament (total 50
million words)? The Hansard Corpus is available from the ACL/DCI (Association
for Computational Linguistics Data Collection Initiative). You can probably
contact :

 Mark Liberman
 Department of Linguistics
 University of Pennsylvania
 PA 19104

 email: "

2. The ACL initiative/ more on Hansard

Source: Louisa Sadler < >

The Association for Computational Linguistics has a project to compile a
multilingual corpus. Information from: Susan Armstrong-Warwick
<susanch.unige.divsun> , who writes:

"We are in the final stages - the data should go to LDC end of this week to
prepare for pressing the CD-ROM - we have 100 million words in over 20
languages and including parallel versions from
banks and international organizations.

The hansard corpus is distributed by LDC - contact them."

I replied asking what LDC is - no response yet.

3. The ARTFL database at Chicago University

Sources: Stavros Macrakis <macrakisorg.osf>
Angus B. Grieve-Smith:

 A Textual Database

 2000 Texts
 17th-20th Centuries
 Literature, Philosophy, Arts, Sciences...

A Cooperative Project:

 Centre National de la The University
Recherche Scientifique of Chicago

At present the corpus consists of nearly 2000 texts, ranging from classic works
of French literature to various kinds of non-fiction prose and technical
writing. The eighteenth, nineteenth and twentieth centuries are about equally
represented, with a smaller selection of seventeenth-century texts as well as
some medieval and Renaissance texts. Genres include novels, verse,
journalism, essays, correspondence, and treatises. Subjects include literary
criticism, biology, history, economics, and philosophy. In most cases standard
scholarly editions were used in converting the text into machine-readable
form, and the data include page references to these editions.

The ARTFL Project is supported by a full-time staff at the University of
Chicago. We encourage you to write or call us with any questions you may have
about the project - the availability of texts, operation of the system, the
costs of using the database.

The ARTFL Project
American and French Research on the
Treasury of the French Language
Department of Romance Languages and Literatures
University of Chicago
1050 East 59th Street
Chicago Illinois 60637
(312) 702-8488
electronic mail: "

Access is on a subscription basis. A college buys the right to access the
database online for a year. Subscriptions to Chicago are not available in

4. More information abou t Hansard and ARTFL

Source: Jane Edwards <edwardsEDU.Berkeley.cogsci>

"I give information regarding the ARTFL and the Hansard Corpus in my survey of
corpora, which is available in compressed form via anonymous ftp from in the pub directory, as "CorpusSurvey.Z". If you have
difficulty obtaining it, let me know and I'll be happy to email it to you. It
is available in hardcopy in the book "Talking Data: Transcription and Coding in
Discourse Research" edited by myself and Martin Lampert (1993)."

5. The Hachette / OUP corpus

Source: Mark Gide of ARTFL (email address under [3]):

"INaLF Hachette CD-ROM

It is about 300 texts with software for PC systems.
M. Alain Pierrot
Hachette Education
79, boul. Saint Germain
75006 Paris FRANCE
Discotext 1

The product is called Dicsotext1 I believe."

The Oxford-Hachette French Dictionary will be launched in April 1994. It is
based in part on this corpus. Information from: Ivan Asquith, Oxford
University Press, Walton St., Oxford, OX2 6DP England. Tel: +44 (0)865
56767; Fax: +44 (0)865 56646. OUP are holding a conference to coincide with
the launch, where lexicographers will talk about how the corpus was used. The
conference is free, including accommodation and food in Oxford - you just pay
for travel (am I the only one who has uneasy feelings about this kind of
freebee by a commercial organisation?).

6. The Oxford Text Archive

A collection of machine-readable texts in many languages. Some of the texts
have unrestricted access, others have copyright restrictions imposed by the
people who deposited them in the archive. FTP or orders by email can be used
to obtain files, once you have registered with the OTA. Information from:

Oxford Text Archive email:
Oxford University Computing Services tel: +44 865 273238
13 Banbury Road, Oxford OX2 6NN, UK fax: +44 865 273275

7. Mike Scott -- AELSU, English, Un. of Liverpool <>

"I have a stock of Portuguese (about 400,000 wrds) and a 4.2 million word
corpus from the Guardian newspaper. I also have some small amounts of French
but want more! I'd be prepared to swap with others who want text (clean ASCII
is my preference, untagged).

I am engaged with researchers in Birmingham & Brazil in setting up a corpus in
Portuguese and English."

8. Le Monde on CD-ROM

The French newspaper Le Monde is available on CD-ROM. The UK distributor is:
Research Publications International, PO Box 45, Reading, Berkshire, RG1 8HF,
England. Tel: +44 (0)734 583247; Fax: +44 (0)734 394334. The same company
has a French office, c/0 Office Central de Documentation, 33 rue Linne (that's
e acute), 75005 France. Tel +33 1 43 37 66 11; Fax +33 1 45 35 72 04. The US
office is at 12 Lunar Drive/Drawer AB, Woodbridge, CT 06535. Tel: (203) 397
2600; Fax: (203) 397 3893.

This is a commercial company who mainly publish patent and business information
on CD-rom. They also supply CD-ROMs of the Times and Sunday Times, the
Jerusalem Post, and biographical and bibliographical info on contemporary

Le Monde on CD-ROM costs in the UK 695 pounds per year, or 495 pounds if your
University also take the microfilm, or 245 pounds for schools.
