Editor for this issue: <>
Here is some information about computerized French language corpora. Heartfelt thanks to all who responded. Allow 4 to 5 pages to print out this information. The headings are: 1. The Canadian Hansard Corpus 2. The ACL initiative/ more on Hansard 3. The ARTFL database at Chicago University 4. More information abou t Hansard and ARTFL 5. The Hachette / OUP corpus 6. The Oxford Text Archive 7. Mike Scott at Liverpool University 8. Le Monde on CD-ROM Please send me corrections and additions. Raphael Salkie, The Language Centre, University of Brighton, Falmer, Brighton, BN1 9PH England Tel: (0273) 643335 (direct line); (0273) 643337 (Language Centre Office). Fax: (0273) 690710 Email: RMS3Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issueUK.AC.BRIGHTON.VMS ............................................................................... . 1. The Canadian Hansard Corpus Source: Bruno Maximilian Schulze: <schulze
de.uni-stuttgart.ims> "How about the Hansard Corpus, which is an (sentence) aligned parallel corpus (French and English) containing debates of the Canadian parliament (total 50 million words)? The Hansard Corpus is available from the ACL/DCI (Association for Computational Linguistics Data Collection Initiative). You can probably contact : Mark Liberman Department of Linguistics University of Pennsylvania Philadelphia PA 19104 email: myl
unagi.cis.upenn.edu " 2. The ACL initiative/ more on Hansard Source: Louisa Sadler <louisa
essex.ac.uk > The Association for Computational Linguistics has a project to compile a multilingual corpus. Information from: Susan Armstrong-Warwick <susan
ch.unige.divsun> , who writes: "We are in the final stages - the data should go to LDC end of this week to prepare for pressing the CD-ROM - we have 100 million words in over 20 languages and including parallel versions from banks and international organizations. The hansard corpus is distributed by LDC - contact them." I replied asking what LDC is - no response yet. 3. The ARTFL database at Chicago University Sources: Stavros Macrakis <macrakis
org.osf> Angus B. Grieve-Smith: grvsmth
uchicago.edu "ARTFL A Textual Database 2000 Texts 17th-20th Centuries Literature, Philosophy, Arts, Sciences... A Cooperative Project: Centre National de la The University Recherche Scientifique of Chicago At present the corpus consists of nearly 2000 texts, ranging from classic works of French literature to various kinds of non-fiction prose and technical writing. The eighteenth, nineteenth and twentieth centuries are about equally represented, with a smaller selection of seventeenth-century texts as well as some medieval and Renaissance texts. Genres include novels, verse, journalism, essays, correspondence, and treatises. Subjects include literary criticism, biology, history, economics, and philosophy. In most cases standard scholarly editions were used in converting the text into machine-readable form, and the data include page references to these editions. The ARTFL Project is supported by a full-time staff at the University of Chicago. We encourage you to write or call us with any questions you may have about the project - the availability of texts, operation of the system, the costs of using the database. The ARTFL Project American and French Research on the Treasury of the French Language Department of Romance Languages and Literatures University of Chicago 1050 East 59th Street Chicago Illinois 60637 (312) 702-8488 electronic mail: mark
gide.uchicago.edu " Access is on a subscription basis. A college buys the right to access the database online for a year. Subscriptions to Chicago are not available in Europe. 4. More information abou t Hansard and ARTFL Source: Jane Edwards <edwards
EDU.Berkeley.cogsci> "I give information regarding the ARTFL and the Hansard Corpus in my survey of corpora, which is available in compressed form via anonymous ftp from cogsci.berkeley.edu in the pub directory, as "CorpusSurvey.Z". If you have difficulty obtaining it, let me know and I'll be happy to email it to you. It is available in hardcopy in the book "Talking Data: Transcription and Coding in Discourse Research" edited by myself and Martin Lampert (1993)." 5. The Hachette / OUP corpus Source: Mark Gide of ARTFL (email address under [3]): "INaLF Hachette CD-ROM It is about 300 texts with software for PC systems. M. Alain Pierrot Hachette Education 79, boul. Saint Germain 75006 Paris FRANCE Discotext 1 The product is called Dicsotext1 I believe." The Oxford-Hachette French Dictionary will be launched in April 1994. It is based in part on this corpus. Information from: Ivan Asquith, Oxford University Press, Walton St., Oxford, OX2 6DP England. Tel: +44 (0)865 56767; Fax: +44 (0)865 56646. OUP are holding a conference to coincide with the launch, where lexicographers will talk about how the corpus was used. The conference is free, including accommodation and food in Oxford - you just pay for travel (am I the only one who has uneasy feelings about this kind of freebee by a commercial organisation?). 6. The Oxford Text Archive A collection of machine-readable texts in many languages. Some of the texts have unrestricted access, others have copyright restrictions imposed by the people who deposited them in the archive. FTP or orders by email can be used to obtain files, once you have registered with the OTA. Information from: Oxford Text Archive email: archive
ox.ac.uk Oxford University Computing Services tel: +44 865 273238 13 Banbury Road, Oxford OX2 6NN, UK fax: +44 865 273275 7. Mike Scott -- AELSU, English, Un. of Liverpool <ms2928
liverpool.ac.uk> "I have a stock of Portuguese (about 400,000 wrds) and a 4.2 million word corpus from the Guardian newspaper. I also have some small amounts of French but want more! I'd be prepared to swap with others who want text (clean ASCII is my preference, untagged). I am engaged with researchers in Birmingham & Brazil in setting up a corpus in Portuguese and English." 8. Le Monde on CD-ROM The French newspaper Le Monde is available on CD-ROM. The UK distributor is: Research Publications International, PO Box 45, Reading, Berkshire, RG1 8HF, England. Tel: +44 (0)734 583247; Fax: +44 (0)734 394334. The same company has a French office, c/0 Office Central de Documentation, 33 rue Linne (that's e acute), 75005 France. Tel +33 1 43 37 66 11; Fax +33 1 45 35 72 04. The US office is at 12 Lunar Drive/Drawer AB, Woodbridge, CT 06535. Tel: (203) 397 2600; Fax: (203) 397 3893. This is a commercial company who mainly publish patent and business information on CD-rom. They also supply CD-ROMs of the Times and Sunday Times, the Jerusalem Post, and biographical and bibliographical info on contemporary authors. Le Monde on CD-ROM costs in the UK 695 pounds per year, or 495 pounds if your University also take the microfilm, or 245 pounds for schools.