LINGUIST List 12.3165

Sat Dec 22 2001

Sum: Russian Corpora & Natural Language Processing

Editor for this issue: Marie Klopfenstein <>


  • Xu Hancheng, NLP and Russian language

    Message 1: NLP and Russian language

    Date: Thu, 20 Dec 2001 09:38:59 +0800
    From: Xu Hancheng <>
    Subject: NLP and Russian language

    Dear colleagues,

    I posted a message to this list seeking for information about Russian corpora and NLP in general(Linguist 12.2884).I really appreciate responses from all of you. Unexpectedly, I have got abundant information. This is my summary which consists of two parts: info about Russian corpora, papers, tools and then contact information about respondents.

    I. Russian corpora, papers,MLP tools and systems

    1/ Upssala-Tuebingen corpora (Wayles Browne, Daniel Buncic, Dagmar Divjak, Andrew Hippisley, Ruprecht von Waldenfels) and

    A most famous Russian corpora. 1 million words. A site you'd better to see.

    2. (Serge Sharoff)

    An aligned corpus and tools to work with multilingual corpora.

    3. (Daniel Buncic)

    Elisabeth Seitz's paper "Digital Corpora and Databases: New Horizons in Slavic Linguistics" read on 19.03.1998 in Ljubljana.

    4. (Daniel Buncic)

    The Computer Fund of the Russian Language provided by the Russian Academy of Sciences, available at the TRACTOR project (

    5. (Daniel Buncic) Russian audiotexts and acoustic databases provided by the LiLab at the Bochum University Seminar of Slavistics.

    6. (Daniel Buncic)

    The Old Russian texts at the TITUS project (Thesaurus Indogermanischer Text- und Sprachmaterialien - Thesaurus of Indoeuropean Text and Language Material, ).

    7. "Biblioteka Moshkova" at (Daniel Buncic, Ruprecht von Waldenfels) One of most popular on-line libraries. Fictional and non-fictional texts can be downloaded.


    Link collection at Bonn University Seminar of Slavistics.

    9."A Computational Phonology of Russian"(1999) of Dr. Peter Chew from Oxford university. A morphological corpus of Russian as one of the appendices. Available through inter-library loan. (Peter Chew)

    10.the coprus used by Apresjan and his team in their work on the dictionary of synonyms. that Corpus counts about 10 million words but is not available to 'outsiders'. (Dagmar Divjak)

    11 Andrew Hippisley carried out a statistical analysis of the nouns. Available at site the site There is a readme file explaining this dataset at the site (Andrew Hippisly)

    12. ( Serge Sharoff, Vera Fluhr-Semenova)

    I'd like to advise all who are interested in Russian NLP and have never been there to have look at the site. Rich information on Russian NLP systems. Sciper - Societe de Conseil dans le domaine Informatique sur le Pays de l��Est et Russie (Consulting on Information Technologies of East European Countries and Russia.)

    13. . (Ruprecht von Waldenfels) Links.

    14. An on-line morphological parser at the Sergei Starostin homepage ( under the link "Russian dictionaries and morphology" ( (Alexandre Arkhipov).

    15. (Alexandre Arkhipov, Olga Krivnova) Dr. Olga Krivnova with her colleagues are developing a system of Russian speech synthesis. Creation of several Russian speech corpora designed for some experimental speech recognition projects.

    16. Grigori Sidorov has developed the program for Russian morphological analysis / generation (also lemmatizing). It works with about 100,000 stems (generating about 1,500,000 wordforms). Dictionary file size is less than 2 MB. For scientific purposes it is free.(Available as DLL or EXE for Windows). (Grigori Sidorov)

    And I'd like to add what I've found through Internet:

    17. (Russian Virtual library)

    18. Laboratory for general and computational lexicology and lexicography of Moscow University. Copra of Russian newspapers and other corpora.

    II. Contact information of respondents

    1.Wayles Browne Wayles Browne, Assoc. Prof. of Linguistics Department of Linguistics Morrill Hall 220, Cornell University Ithaca, New York 14853, U.S.A.

    tel. 607-255-0712 (o), 607-273-3009 (h) fax 607-255-2044 (write FOR W. BROWNE) e-mail

    2.Dr. Serge Sharoff Fakultaet fuer Linguistik und Literaturwissenschaft, Universitaet Bielefeld, Postfach 10 01 31, D-33501 Bielefeld, Germany, tel: +49-521-1065275; fax: +49-521-1066447

    3.Daniel Buncic

    Bonn University Seminar of Slavonic Philology Lennestr. 1, D-53113 Bonn Phone: +49 228 73-7203 Fax & answering-machine: +49 1212 515081457 E-mail: Homepage:

    4. Peter Chew ( ) Oxford University

    5.Dagmar Divjak ( Ph.d. candidate in Russian linguistics.

    6.Andrew Hippisley (

    7.Vera Fluhr-Semenova (

    8. Ruprecht von Waldenfels (

    9. Alexandre Arkhipov Moscow State University

    10. Olga Krivnova (

    11. Grigori Sidorov, Ph.D., Natural Language Processing Lab, Center for Computing Research (CIC), National Polytechnic Institute (IPN), Av.Juan de Dios Batiz, s/n, esq. Mendizabal, Zacatenco, CP 07738, Mexico D.F., Mexico Tel. +52 5729-6000, ext 56618, 56544 Fax +1 (520) 441-18-17, +52 55862936 e-mail:

    12. Evelina G. Fedorenko (,a student from Alfonso Caramazza's Lab, Havard, is also interested in our exchange of information.

    I am looking forward to further exchange of information with all.

    Best regards,

    Xu Hancheng