LINGUIST List 12.3165

Sat Dec 22 2001

Sum: Russian Corpora & Natural Language Processing

Editor for this issue: Marie Klopfenstein <>


  1. Xu Hancheng, NLP and Russian language

Message 1: NLP and Russian language

Date: Thu, 20 Dec 2001 09:38:59 +0800
From: Xu Hancheng <>
Subject: NLP and Russian language

Dear colleagues,

I posted a message to this list seeking for information about Russian
corpora and NLP in general(Linguist 12.2884).I really appreciate responses
from all of you. Unexpectedly, I have got abundant information. This is
my summary which consists of two parts: info about Russian corpora, papers,
tools and then contact information about respondents.

I. Russian corpora, papers,MLP tools and systems

1/ Upssala-Tuebingen corpora (Wayles Browne, Daniel Buncic, Dagmar
Divjak, Andrew Hippisley, Ruprecht von Waldenfels)

A most famous Russian corpora. 1 million words. A site you'd better to see.

2. (Serge Sharoff)

An aligned corpus and tools to work with multilingual corpora.

3. (Daniel Buncic)

Elisabeth Seitz's paper "Digital Corpora and Databases: New Horizons
in Slavic Linguistics" read on 19.03.1998 in Ljubljana.

4. (Daniel Buncic)

The Computer Fund of the Russian Language provided by the Russian
Academy of Sciences, available at the TRACTOR project (

5. (Daniel Buncic)
Russian audiotexts and acoustic databases provided by the LiLab at the Bochum 
University Seminar of Slavistics.

6. (Daniel Buncic)

 The Old Russian texts at the TITUS project (Thesaurus
 Indogermanischer Text- und Sprachmaterialien - Thesaurus of
 Indoeuropean Text and Language Material, ).

7. "Biblioteka Moshkova" at (Daniel Buncic, 
Ruprecht von Waldenfels)
One of most popular on-line libraries. Fictional and non-fictional texts can be downloaded.


Link collection at Bonn University Seminar of Slavistics.

9."A Computational Phonology of Russian"(1999) of Dr. Peter Chew from
Oxford university. A morphological corpus of Russian as one of the
appendices. Available through inter-library loan. (Peter Chew)

10.the coprus used by Apresjan and his team in their work on the
dictionary of synonyms. that Corpus counts about 10 million words but is not available to 
'outsiders'. (Dagmar Divjak)

11 Andrew Hippisley carried out a statistical analysis of the nouns.
Available at site the site There is a readme file
explaining this dataset at the site (Andrew Hippisly)

12. ( Serge Sharoff, Vera Fluhr-Semenova)

I'd like to advise all who are interested in Russian NLP and have
never been there to have look at the site. Rich information on Russian NLP systems.
Sciper - Societe de Conseil dans le domaine Informatique sur le Pays
de l��Est et Russie (Consulting on Information Technologies of East European Countries and 

13. . (Ruprecht von Waldenfels)

14. An on-line morphological parser at the Sergei Starostin homepage
( under the link "Russian dictionaries and
morphology" ( (Alexandre Arkhipov).

15. (Alexandre Arkhipov, Olga Krivnova)
Dr. Olga Krivnova with her colleagues are developing a system of
Russian speech synthesis. Creation of several Russian speech corpora
designed for some experimental speech recognition projects.

16. Grigori Sidorov has developed the program for Russian morphological analysis / generation
(also lemmatizing). It works with about 100,000 stems (generating about 1,500,000 wordforms).
Dictionary file size is less than 2 MB. For scientific purposes it is
free.(Available as DLL or EXE for Windows). (Grigori Sidorov)

And I'd like to add what I've found through Internet:

17. (Russian Virtual library)

Laboratory for general and computational lexicology and lexicography
of Moscow University. Copra of Russian newspapers and other corpora.

II. Contact information of respondents

1.Wayles Browne
Wayles Browne, Assoc. Prof. of Linguistics
Department of Linguistics
Morrill Hall 220, Cornell University
Ithaca, New York 14853, U.S.A.

tel. 607-255-0712 (o), 607-273-3009 (h)
fax 607-255-2044 (write FOR W. BROWNE)

2.Dr. Serge Sharoff 
Fakultaet fuer Linguistik und Literaturwissenschaft, 
Universitaet Bielefeld,
Postfach 10 01 31, D-33501 Bielefeld, Germany,
tel: +49-521-1065275; fax: +49-521-1066447

3.Daniel Buncic

Bonn University Seminar of Slavonic Philology
Lennestr. 1, D-53113 Bonn
Phone: +49 228 73-7203
Fax & answering-machine: +49 1212 515081457

4. Peter Chew ( )
Oxford University

5.Dagmar Divjak (
Ph.d. candidate in Russian linguistics.

6.Andrew Hippisley (

7.Vera Fluhr-Semenova (

8. Ruprecht von Waldenfels (

9. Alexandre Arkhipov
Moscow State University

10. Olga Krivnova (

11. Grigori Sidorov, Ph.D.,
Natural Language Processing Lab,
Center for Computing Research (CIC),
National Polytechnic Institute (IPN),
Av.Juan de Dios Batiz, s/n, esq. Mendizabal, Zacatenco, CP 07738, Mexico
D.F., Mexico
Tel. +52 5729-6000, ext 56618, 56544
Fax +1 (520) 441-18-17, +52 55862936

12. Evelina G. Fedorenko (,a student from
Alfonso Caramazza's Lab, Havard, is also interested in our exchange of

I am looking forward to further exchange of information with all.

Best regards,

Xu Hancheng
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue