* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
Texts & Corpora

Page Index:
   Corpora
   Electronic Texts
   Text and Corpora Meta Sites

Browse our listings
                 by Subject Language
                 by Linguistic Subfield
                by Language Family

Help us update our listings
       Add a link to Texts & Corpora
       Update or report a bad link



Corpora

  • AnCora Corpora: AnCora: Syntactically and Semantically Annotated Corpora (Spanish, Catalan) CLiC (Centre for Language and Computation) of the University of Barcelona, together with the Natural Language Processing group of the Polytechnic University of Catalonia, have created two new language technology resources: AnCora-Esp for Spanish and AnCora-Cat for Catalan, consisting of 500,000 words each. They are two treebanks enriched with different kinds of semantic information: 1) each function has its argument and thematic role; 2) each verb belongs to a semantic class according to its event structure and diathesis alternations; 3) each noun has its WordNet sense; and 4) each named entity (i.e. persons, organisations, locations, dates, etc.) is identified and categorized. The annotation process has also resulted in two verbal lexicons with approximately 2,000 entries for each language with information about verbal semantic classes and their syntactic subcategorization, their argument structure and the thematic roles for each sense. The AnCora corpora as well as the derived verbal lexicons (AnCora-Verb) are freely available (queries and downloads) from: http://clic.ub.edu/ancora/.
  • British National Corpus: A 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
  • Buckeye Corpus: This corpus contains high-quality recordings of conversational American English speech from 40 speakers in Columbus, OH, USA. The speech has been orthographically transcribed and phonetically labeled. Currently the audio files and transcriptions for 20 talkers are available.
  • CHAINS: Characterising Individual Speakers: The Chains corpus is a novel speech corpus collected with the primary aim of facilitating research in speaker identification. The corpus features approximately 36 speakers recorded under a variety of speaking conditions, allowing comparison of the same speaker across different well-defined speech styles. Speakers read a variety of texts alone, in synchrony with a dialect-matched co-speaker, in imitation of a dialect-matched co-speaker, in a whisper, and at a fast rate. There is also an unscripted spontaneous retelling of a read fable. The bulk of the speakers were speakers of Eastern Hiberno-English. The corpus is being made freely available for research purposes.
  • CLIPS, Corpus of Spoken Italian: CLIPS is a corpus of spoken Italian, freely available at www.clips.unina.it. The corpus (audio files, annotation and documentation) are fully downloadable from the website via ftp, free for research purposes. CLIPS consists of about 100 hours of speech, equally represented by female and male voices. A section of the corpus is transcribed orthographically, a smaller section has been phonetically labeled. Recordings were made in 15 Italian cities, selected on the basis of linguistic and socio-economic principles of representativeness: Bari, Bergamo, Bologna, Cagliari, Catanzaro, Firenze, Genova, Lecce, Milano, Napoli, Palermo, Parma, Perugia, Roma, Venezia. For each of the 15 cities different text typologies have been included: a) radio and television broadcasts (news, interviews, talk shows); dialogue (240 dialogues collected using the map task procedure and the "spot the difference" game. In this set: 30 dialogues are phonetically labeled, 90 orthographically transcribed); c) read speech from non professional speakers (20 sentences each, covering medium-high frequency Italian words); d) speech over the telephone (conversations between 300 speakers and a simulated hotel desk service operator), e) read speech from 20 professional speakers (160 sentences, covering all phonotactic sequences and medium-high frequency Italian words) recorded in an anechoic chamber. Documentation, corpus collection and annotation follow the EAGLES guidelines.
  • COMPARA - Portuguese-English Parallel Corpus: COMPARA is bi-directional parallel corpus based on an open-ended collection of Portuguese-English and English-Portuguese source-texts and translations. Access is free and requires no registration.
  • COSMAS Corpus Archive: The largest German corpus archive, free-of-charge online search in 1181 Mio words of running text (1846 Mio words for invited guests).
  • CSLU Spoltech Brazilian Portuguese: The CSLU Spoltech Brazilian Portuguese corpus contains microphone speech from a variety of regions in Brazil with phonetic and orthographic transcriptions. The utterances consist of both read speech (for phonetic coverage) and responses to questions (for spontaneous speech). The corpus contains 477 speakers and 8080 separate utterances. A total of 2540 utterances have been transcribed at the word level (without time alignments), and 5479 utterances have been transcribed at the phoneme level (with time alignments).
  • CSLU: Spelled and Spoken Words: The CSLU: Spelled and Spoken Words corpus consists of spelled and spoken words. 3647 callers were prompted to say and spell their first and last names, to say what city they grew up in and what city they were calling from, and to answer two yes/no questions. In order to collect sufficient instances of each letter, 1371 callers also recited the English alphabet with pauses between the letters. Each call was transcribed by two people, and all differences were resolved. In addition, a subset of 2648 calls has been phonetically labeled.
  • Chinese Gigaword Second Edition: Chinese Gigaword Release Second Edition is a comprehensive archive of newswire text data in Chinese that has been acquired over several years by the LDC. This release includes all of the contents in the first release of the Chinese Gigaword corpus (LDC2003T09), material from one new source, as well as new materials from the other two sources. Thus, the corpus contains three distinct international sources of Chinese newswire - Central News Agency, Taiwan, Xinhua News Agency, and Zaobao. Some minor updates to the documents from the first release have been made.
  • Corpora at ICAME: International Computer Archive of Modern and Medieval English.
  • Corpus del español: An online, searchable corpus of diachronic Spanish texts (100 million words, 13th century to present).
  • Corpus e Lessico di Frequenza dell'Italiano Scritto: CoLFIS (Corpus e Lessico di Frequenza dell'Italiano Scritto) [Corpus and Frequency Lexicon of Written Italian] produced by Pier Marco Bertinetto°, Cristina Burani*, Alessandro Laudanna^*, Lucia Marconi+, Daniela Ratti+, Claudia Rolando+, Anna Maria Thornton§ ° Scuola Normale Superiore, Pisa * Istituto di Scienze e Tecnologie della Cognizione, CNR, Roma ^ Università di Salerno + Istituto di Linguistica Computazionale,Unità Staccata di Genova, CNR, Genova § Università de L'Aquila The reference corpus consists of excerpts from newspapers, magazines and books. It includes 3.150.075 lexical occurrences. The corpus was designed as the best approximation to the Italians' average preferred readings, as mirrored by official statistics. The lexicon consists of two main components: the forms repertoire and the lemmas repertoire. In the latter, all identical forms belonging to different lemmas are disambiguated, while syntagmatic words (such as table's leg) are treated as single entries. The lexical lists (both forms and lemmas) are presently available for free download at http://alphalinguistica.sns.it/BancheDati.htm http://www.istc.cnr.it/material/database/colfis/ They are organized according to a number of possibilities: frequency rank, inverse alphabetical ordering, with or without capital / non-capital distinction, etc. The entire corpus is not yet available. We hope to put it on-line as soon as we obtain the necessary authorizations. The work has been produced with CNR (Consiglio Nazionale delle Ricerche) support. With the help of willing users, this product will hopefully be enriched with further facilities.
  • Croatian National Corpus: The starting point for each linguistic research is the corpus. As Croatian does not have a systematically compiled corpus the objective of this project is the compilation and analysis of the representative Croatian texts -- both older and contemporary -- in the form of the corpus the usage of which is applicable for all kinds of Croatistic, lexicographic and lexicological research
  • Czech Academic Corpus v. 1.0: The Czech Academic Corpus version 1.0 is a corpus with a manual morphological annotation of morphology of the Czech language consisting of approximately 600,000 words in continuous texts.
  • Database of spoken Italian (BADIP): Contains an online edition of the 500,000 word LIP-Corpus. The edition is being enriched with POS-tags and lemmata, more data are being added continuously. Other corpora of spoken Italian will be included in the database as soon as possible. Access to BADIP is free. The database is part of the LanguageServer of the University of Graz (Austria).
  • Ega XML Lexicon: Digitized, online lexicon of Ega, a language of the Ivory Coast as provided by the late Prof. Eddy Aimé Gbery.
  • El Grial Corpus of Spanish: El Grial Corpus of Spanish (www.elgrial.cl) is a growing collection of eight corpora (almost 100 million words) with approximately 700 documents of contemporary Spanish, developed by the members of the Escuela Lingüística de Valparaíso (www.linguistica.cl) at the Pontificia Universidad Católica de Valparaíso, Chile. Also, there is a tagger and parser for Spanish Language available on the web site. These corpora have been collected under specific methodological principles, identifying specialized/non-specialized, written/spoken registers and text types (academic, professional, technical, etc). Detailed description of each corpus is available in the web site. All documents have been tagged and parsed. Part of the data has been enriched with deep syntactic information. To the best of our knowledge, this is currently the largest searchable morphosyntactically annotated and register-diversified corpus of Spanish available to the public, with online tools that help analyze the collected data. Users can define their corpus of study and search the data using a wide variety of resources. Query results are presented in different formats, depending on the kind of research questions. El Grial users can make queries concerning word, lemma and/or parts of speech frequencies. One of the last tools developed is El Manchador de Textos, an online resource that “spots” and puts color to the words or sequences under study; statistical information about the co-occurrences is also available.
  • English Accents and Dialects: Extracts from the Survey of English Dialects and the Millennium Memory Bank document how we spoke and lived in the 20th century.
  • Hebrew Corpus of Arutz7 Newswires: A Corpus containing news and articles from Arutz 7 since 2001, which updates daily. Text is available in HTML, plain ascii text, tokenized text in XML format. It is possible to obtain an XML version of the text morphologically annotated (with all possible analyses) and morphologically disambiguated (with the correct morphological analysis in context). Every day, the front page of Arutz 7 is being scanned for updated news and articles and new material is being downloaded. The relevant text is being extracted from the downloaded pages, and then analyzed for document structure (paragraph, sentence and token segmentation). The texts are then being represented in XML. The resources are free but require a username and a password that can be obtained by sending an email to Shlomo Yona .
  • Hellenic National Corpus: HNC is a corpus of written Modern Greek texts, available over the Internet, for research use only. It is based on the General Language corpus developed by the Institute of Language and Speech Processing and is fully available on the Internet since 2000. It currently contains about 32,000,000 words of written texts from several media (books, periodicals, newspapers etc.), which belong to different genres (articles, essays, literary works, reports, biographies etc.) and various topics (economy, medicine, leisure, art, human sciences etc.). The HNC users can make the following queries concerning the lexicon, morphology, syntax and usage of Modern Greek: - specific words (e.g. child), - lemmas (e.g. child as a lemma produces every inflected type of the word), - parts of speech and - up to three combinations of all the above, in which users can specify the distance among lexical items (e.g. word + word, lemma + word, lemma + word + word, lemma + part of speech). Users can define their own sub-corpus within the HNC. This sub-corpus may cover one or more media, genres and/or topics and may also be saved for further reference by the users. Query results are presented as whole sentences, within which the query objects are highlighted. Alternatively, concordances of query results are presented, where the query object is centred on the page. Finally, HNC users can make queries concerning word, lemma and/or parts of speech frequencies within the HNC texts. Statistical information about the 100 and 1,000 most frequent words and lemmata in these texts is also available.
  • Het Corpus Gesproken Nederlands: The Corpus Gesproken Nederlands, (Spoken Dutch Corpus), or CGN is a collection of approximately 900 hours of spoken Dutch from Flemish and Dutch speakers. All recordings have been aligned with an orthographic transcription and each word has been given a POS tag and a lemma. Part of the data has been enriched with syntactic, prosodic and/or phonetic information.
  • IPI PAN Corpus of Polish: The 2nd edition of the IPI PAN Corpus of Polish, developed at the Institute of Computer Science of the Polish Academy of Sciences (PAS), is available at the web pages of: - the Institute of Computer Science PAS: http://korpus.pl/en/ - the Institute of Polish Language PAS: http://corpus.ijp-pan.krakow.pl/en/ To the best of our knowledge, this is currently the largest searchable morphosyntactically annotated corpus of Polish available to the public. The whole corpus consists of over 250 million segments (about 200 million orthographic words) and it is not balanced, but a balanced sample of over 30 million segments is also available. These corpora can be directly searched at the above addresses (do read the query syntax cheatsheet at http://korpus.pl/en/cheatsheet/index.html) or downloaded in a binary form to be used with a standalone version of the corpus search engine Poliqarp (announced separately on the 'corpora' list and available from http://korpus.pl/en/).
  • IULA's UPF Textual, plurilingual, specialized Corpus: The main goal of the Corpus project is the construction and exploitation of a textual, plurilingual and specialized corpus. The languages involved are the following: Catalan, Spanish, English, German and French. The areas of interest include: economics, law, computer science, medicine and enviromental science. This corpus is the main support for teaching and research at our institut. Some of the research activities envisaged against this corpus include the following ones: terminology detection, parallel texts alignment, partial parsing, (semi)automatic extraction of several levels of linguistic information for building computational systems (for example, subcategorization patterns), language variation studies.
  • International Corpus of English (British Component): The British Component of the International Corpus of English (ICE-GB) contains one million words of spoken and written British English. The material is fully tagged and parsed and the associated syntactic treebank is searchable with dedicated exploration software. The spoken material can be listened to.
  • Korean Propbank: Korean Propbank is a semantic annotation of the Korean English Treebank Annotations and Korean Treebank version 2.0. Each verb and adjective occurring in the Treebank has been treated as a semantic predicate and the surrounding text has been annotated for arguments and adjuncts of the predicate. The verbs and adjectives have also been tagged with coarse grained senses. There are two basic components to Korean Propbank: * The Verb Lexicon. A frames file, consisting of one or more frame sets, has been created for each predicate occurring in the Treebank. These files serve as a reference for the annotators and for users of the data. 2,749 such files have been created. * The Annotation. There are two annotation files. The virginia-verbs.pb file has 9,588 annotated predicate tokens. These predicate tokens include all those occurring in over 54 thousand words of the Korean English Treebank Annotations, totaling ~791 KB of uncompressed data. The newswire-verbs.pb file has 23,707 annotated predicate tokens. These predicate tokens include all those occurring in over 131 thousand words of the Korean Treebank version 2.0.
  • Korean Treebank Annotations Version 2.0: The Korean Treebank Annotations Version 2.0 is an extension of the Korean English Treebank Annotations corpus, LDC2002T26 (2002). It is essentially an electronic corpus of Korean texts annotated with morphological and syntactic information. The original texts for the Korean Treebank 2.0 were selected from The Korean Newswire corpus published by LDC, catalog number LDC2000T45, which is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. Korean Treebank 2.0 is based on the March 2000 portion of the corpus and includes 647 articles. The annotated corpus can find many uses, including training of morphological analyzers, part-of-speech taggers and syntactic parsers.
  • LumaLiDa - Resources for Child Language: LumaLiDa is a family of database resources for the study of child language. It includes LumaLiDaOn (the Linguistic Diary of Luma, an European Portuguese Child), LumaLiDaOnLexicon (the lexicon used by the child in LumaLiDaOn, types and tokens), LumaLiDaAudy (transcribed audio files of child speech), and LumaLiDaAudyLexicon (the lexicon used by the child in LumaLiDaAudy, types and tokens).
  • MDE RT04 Training Data Speech: MDE RT-04 Training Data Speech was created to provide training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program. The goal of MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: flagging non-content words like filled pauses and discourse markers for optional removal; marking sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. LDC has defined a SimpleMDE annotation task specification and has annotated English telephone and broadcast news data to provide training data for MDE.
  • Monguor - Online Texts: Digitized online texts of Monguor, an endangered language spoken in the People's Republic of China, as provided to the E-MELD School of Best Practices by Dr. Wang Xianzhen.
  • N4 NATO Native and Non-Native Speech: The N4 NATO Native and Non-Native Speech corpus was developed by the NATO research group on Speech and Language Technology in order to provide a military oriented database for multilingual and non-native speech processing studies. The NATO Speech and Language Technology group decided to create a corpus geared towards the study of non-native accents. The group chose naval communications as the common task because it naturally includes a great deal of non-native speech and because there were training facilities where data could be collected in several countries. Speech data was recorded in the Naval transmission training centers of four countries (Germany, The Netherlands, United Kingdom, and Canada). The material consists of native and non-native speakers using NATO English procedure between ships and reading from a text.
  • Online Dena'ina Qenaga Lexicon: Searchable online wordlist of Dena'ina Qenaga (Tanaina).
  • Penn-Helsinki Parsed Corpus of Early Modern English: The Penn-Helsinki Parsed Corpus of Early Modern English is a 1.8 million word parsed corpus of text samples of Early Modern English. It includes the text samples of the Helsinki Corpus of Historical English, which consists of 600,000 words of genre balanced text and two extension samples of the same size, balanced for genre in the same way. It is a sister corpus of the Penn-Helsinki Parsed Corpus of Middle English and the two corpora are distributed together.
  • Penn-Helsinki Parsed Corpus of Middle English: The Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) and the Penn Parsed Corpus of Modern British English are syntactically annotated corpora of prose text samples of English from the indicated time periods. Their syntactic annotation (parsing) permits searching, not only for words and word sequences, but also for syntactic structure. The corpora are designed for the use of students and scholars of the history of English, especially the historical syntax of the language, and they are publicly available under certain conditions.
  • Persian Linguistic Database (PLDB): This is the first on-line database for the contemporary (Modern) Persian designed and developed by Dr. S. M. Assi at the Institute for Humanities and Cultural Studies (IHCS), Iran. The database contains a huge selected corpora of all varieties of the Modern Persian language in the form of running texts. Some of the texts are annotated with grammatical, pronunciation and lemmatisation tags. A special and powerful software provides different types of search and statistical listing facilities through the whole database or any selective corpus made up of a group of texts. The database is constantly improved and expanded.
  • Russian National Corpora: The corpora is designed for anyone interested in a variety of issues related to the Russian language: professional linguists, language teachers, students, foreigners studying the Russian language.
  • SCoSE - Saarbrücken Corpus of Spoken English: The SCoSE consists of five parts: Part 1: Stories Part 2: Indianapolis Interviews Part 3: Jokes Part 4: Complete Conversations Part 5: Drawing Experiment You can download each of the five parts as a .pdf file.
  • SMULTRON - The Stockholm Multilingual Treebank: SMULTRON is a parallel treebank developed by the Computational Linguistics Group at the Department of Linguistics, at Stockholm University. The parallel treebank contains around 1000 sentences each in English, German and Swedish. The sentences have been PoS-tagged and annotated with phrase structure trees. The trees have been aligned on sentence, phrase and word level. Additionally, the German and Swedish monolingual treebanks contain lemma information.
  • Scandinavië Vertalingen: Translation agency for translations from and into the Scandinavian languagues (Swedish, Finnish, Norwegian, Danish)
  • Scottish Corpus of Texts and Speech (SCOTS): SCOTS is an AHRC-funded project, creating a corpus of texts in the languages of Scotland, in the first instance Scots and Scottish English, of all available genres. Spoken texts (orthographic transcription plus accompanying audio/video files) make up 20% of the complete corpus. The corpus is fully searchable online, and the website also contains a description and instructions.
  • Searchable Biao Min Lexicon: The Biao Min Lexicon, housed on the E-MELD site, consists of nearly 3,000 lexical items from Biao Min documentation collected by David Solnit.
  • Searchable Kayardild Lexicon: Searchable lexicon of Kayardild, collected by Dr. Nicholas Evans and hosted by E-MELD.
  • Searchable Mocoví Lexicon: Searchable Mocoví Lexicon, based on data provided and collected by Dr. Verónica Grondona.
  • Searchable Potawatomi Lexicon: Online, searchable Potawatomi lexicon, utilizing data provided to the E-MELD School of Best Practices by Dr. Laura Buszard-Welcher.
  • Searchable Saliba Lexicon: Searchable Lexicon of the Saliba language, utilizing data provided to the E-MELD School of Best Practices by Nancy Morse.
  • Slovak National Corpus: Slovak National Corpus is built as a general monolingual corpus, which in the first phase (year 2003) started to compile written texts originated in years 1990 – 2003, containing about 30 million of words with a lemmatisation, morphological and source (bibliographical and style-genre) annotation. During the second phase (up to 2006) the representative span of written texts will be extended to other periods of the contemporary language (1955 – 2005) to the amount of 200 million words and its selected sample will be syntactically annotated. Simultaneously, specific sub-corpora of diachronic and dialectological texts will commence to be built, as well as a terminological and lexicographical database. Slovak National Corpus is provided primarily to lexicographers (dictionary creation), complements grammar and stylistic research (grammar and orthographical handbooks; varieties of the national language and their usage in communication). We suppose that it will also find its use at schools (preparing of orthography, grammar and style textbooks; teaching Slovak as a foreign language). Specific sub-corpora of historical and dialectological texts will help to preserve an important part of our cultural heritage in a long-term perspective .
  • Speech Controlled Computing: The Speech Controlled Computing corpus was designed to support the development of small footprint, embedded ASR applications in the domain of voice control for the home. It consists of the recordings of 125 speakers of American English from four regions, three age groups and two gender groups, pronouncing isolated words. The recordings were conducted in a sound-attenuated room, and a high-quality microphone was used. Each speaker read a randomized word list consisting of 2100 words (100 distinct words appearing 21 times each). NOTE: Nonmembers may obtain a commercial rights license to Speech Controlled Computing for US$7000 by signing the LDC User License Agreement for Speech Controlled Computing. For-Profit Membership to the LDC is not required.
  • The Babel English-Chinese Parallel Corpus: The Babel English-Chinese Parallel Corpus consists of 327 English articles and their translations in Mandarin Chinese. Of these, 115 texts (121,493 English words plus 135,493 Chinese words) were collected from the World of English between October 2000 and February 2001 while the remaining 212 texts (132,140 English words plus 151,969 Chinese words) were collected from Time from September 2000 to January 2001. The corpus contains a total of 544,095 words (253,633 English words and 287,462 Chinese words). Both English and Chinese texts are tagged for part of speech. The parallel corpus is aligned at the sentence level. Sentence alignment was done automatically and corrected by hand. The Babel parallel corpus can be accessed freely via the Web-based parallel concordancer at the corpus website.
  • The Bergen Corpus of London Teenage Language (COLT): The Bergen Corpus of London Teenage Language (COLT) is the first large English Corpus focusing on the speech of teenagers. It was collected in 1993 and consists of the spoken language of 13 to 17-year-old teenagers from different boroughs of London. The complete corpus, half a million words, has been orthographically transcribed and word-class tagged, and is a constituent of the British National Corpus.
  • The EMILLE Corpus: The EMILLE Corpus consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu. The EMILLE monolingual corpora contain approximately 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. The annotated component includes the Urdu monolingual and parallel corpora annotated for parts-of-speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. The corpus is marked up using CES-compliant SGML, and encoded using Unicode. The EMILLE/CIIL Corpus (http://www.elda.org/catalogue/en/text/W0037.html) is distributed free of charge for use in non-profit-making research only. The EMILLE Lancaster Corpus (http://www.elda.org/catalogue/en/text/W0038.html) is for commercial use only. Both versions are available from the European Language Resources Association.
  • The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages: The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 22 Languages - New: Version 3.0 almost tripled in size: The JRC-Acquis Version 3.0 is a unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in the 23 official EU languages, with the exception of Irish. The corpus consists of about 23,000 documents per language, with an average size of 49 million words per language, totalling to over one Billion words. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is currently available for a subset of 8000 documents in 210 language pair combinations. Pair-wise alignment for all texts in all 231 language pairs will be available soon. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction).
  • The Lancaster Corpus of Mandarin Chinese: The Lancaster Corpus of Mandarin Chinese (LCMC) is designed as a Chinese match for the FLOB and FROWN corpora for modern British and American English. The corpus sampled 15 written text categories including news, literary texts, academic prose and official documents etc published in P. R. China in the earlier 1990s for a total of approximately 1 million words. The same sampling frame and period as FLOB/FROWN were used in LCMC. The texts in the corpus are encoded in Unicode (UTF-8) and marked up in XML. Linguistic annotations undertaken on the corpus include tokenization and part-of-speech tagging. The corpus is suitable for use in both monolingual research into modern Mandarin Chinese and cross-linguistic contrast of Chinese and British/American English. It can be ordered from the European Language Resources Association (http://www.elda.org/catalogue/en/text/W0039.html) or accessed online at the corpus website using the Web-based concordancer or Xaira.
  • Timebank 1.2: The TimeBank 1.2 corpus contains 183 news articles that have been annotated with temporal information, adding events, times and temporal links between events and times. The annotation follows the TimeML 1.2.1 specification. The most recent information on TimeML is always available at www.timeml.org. TimeML aims to capture and represent temporal information. This is accomplished using four primary tag types: TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals, and LINK for representing relationships. Timebank 1.2 is distributed via web download. Nonmembers may license this data at no cost - please note that a signed copy of our generic nonmember user agreement is required.


Electronic Texts

  • Albanian Linguistic Corpus: 1. The complete Bible 874,676 words of which 29,717 are unique 2. The Constitution of the Republic of Albania 27,778 words of which 3,447 are unique 3. Voice of America (Albanian language) all news stories all the way back to 2001. 7,061,826 total and 104,398 unique words .. the parts of each news item are wrapped in headline, date, and story tags
  • Alex: A Catalogue of Electronic Texts on the Internet: A collection of public domain documents from American and English literature as well as Western philosophy.
  • Croatian National Corpus: The starting point for each linguistic research is the corpus. As Croatian does not have a systematically compiled corpus the objective of this project is the compilation and analysis of the representative Croatian texts -- both older and contemporary -- in the form of the corpus the usage of which is applicable for all kinds of Croatistic, lexicographic and lexicological research
  • El Grial Corpus of Spanish: El Grial Corpus of Spanish (www.elgrial.cl) is a growing collection of eight corpora (almost 100 million words) with approximately 700 documents of contemporary Spanish, developed by the members of the Escuela Lingüística de Valparaíso (www.linguistica.cl) at the Pontificia Universidad Católica de Valparaíso, Chile. Also, there is a tagger and parser for Spanish Language available on the web site. These corpora have been collected under specific methodological principles, identifying specialized/non-specialized, written/spoken registers and text types (academic, professional, technical, etc). Detailed description of each corpus is available in the web site. All documents have been tagged and parsed. Part of the data has been enriched with deep syntactic information. To the best of our knowledge, this is currently the largest searchable morphosyntactically annotated and register-diversified corpus of Spanish available to the public, with online tools that help analyze the collected data. Users can define their corpus of study and search the data using a wide variety of resources. Query results are presented in different formats, depending on the kind of research questions. El Grial users can make queries concerning word, lemma and/or parts of speech frequencies. One of the last tools developed is El Manchador de Textos, an online resource that “spots” and puts color to the words or sequences under study; statistical information about the co-occurrences is also available.
  • Freiburger Anthologie: Die 1200 bekanntesten deutschen Gedichte in einer durchsuchbaren Datenbank.
  • Georgian proverbs: Georgian proverbs online book. Georgian language.
  • Hellenic National Corpus: HNC is a corpus of written Modern Greek texts, available over the Internet, for research use only. It is based on the General Language corpus developed by the Institute of Language and Speech Processing and is fully available on the Internet since 2000. It currently contains about 32,000,000 words of written texts from several media (books, periodicals, newspapers etc.), which belong to different genres (articles, essays, literary works, reports, biographies etc.) and various topics (economy, medicine, leisure, art, human sciences etc.). The HNC users can make the following queries concerning the lexicon, morphology, syntax and usage of Modern Greek: - specific words (e.g. child), - lemmas (e.g. child as a lemma produces every inflected type of the word), - parts of speech and - up to three combinations of all the above, in which users can specify the distance among lexical items (e.g. word + word, lemma + word, lemma + word + word, lemma + part of speech). Users can define their own sub-corpus within the HNC. This sub-corpus may cover one or more media, genres and/or topics and may also be saved for further reference by the users. Query results are presented as whole sentences, within which the query objects are highlighted. Alternatively, concordances of query results are presented, where the query object is centred on the page. Finally, HNC users can make queries concerning word, lemma and/or parts of speech frequencies within the HNC texts. Statistical information about the 100 and 1,000 most frequent words and lemmata in these texts is also available.
  • Korean Treebank Annotations Version 2.0: The Korean Treebank Annotations Version 2.0 is an extension of the Korean English Treebank Annotations corpus, LDC2002T26 (2002). It is essentially an electronic corpus of Korean texts annotated with morphological and syntactic information. The original texts for the Korean Treebank 2.0 were selected from The Korean Newswire corpus published by LDC, catalog number LDC2000T45, which is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. Korean Treebank 2.0 is based on the March 2000 portion of the corpus and includes 647 articles. The annotated corpus can find many uses, including training of morphological analyzers, part-of-speech taggers and syntactic parsers.
  • Korpus 2000: The aim of the Korpus 2000 project is to document the use of the Danish language around the year 2000 - in the form of a text corpus in which one can look up words and phrases via this website. The texts that constitute the Korpus 2000 were written mainly between 1998 and 2002.
  • Oxford Text Archive (OTA).: Text archive.
  • Penn-Helsinki Parsed Corpus of Early Modern English: The Penn-Helsinki Parsed Corpus of Early Modern English is a 1.8 million word parsed corpus of text samples of Early Modern English. It includes the text samples of the Helsinki Corpus of Historical English, which consists of 600,000 words of genre balanced text and two extension samples of the same size, balanced for genre in the same way. It is a sister corpus of the Penn-Helsinki Parsed Corpus of Middle English and the two corpora are distributed together.
  • Project Gutenberg e-texts: Texts online.
  • Sociolingüística Andaluza: Information: Grupo de investigación en sociolingüística. Universidad de Sevilla.
  • Textos Hixkaryana: A scanned version of Derbyshire's (1965) text collection of Hixkaryana, with constituent-numbered translations in Portuguese and English. A sincere but unsuccessful attempt has been made to contact the publisher for an exemption of copyright. Publication Information: Derbyshire, Desmond. 1965. Textos Hixkaryana. Belem, Para, Brasil: Conselho Nacional de Pesquisas. Instituto Nacional de Pesquisas da Amazonia. Museo Paraense Emilio Goeldi.
  • Texts in context: A lovely collection of classified, annotated and (partially) downloadable texts from the British Library's collection - good for both teaching and research. Here's the introduction: Texts in Context is a rich and unusual collection of over 400 British Library texts. You can find menus for medieval banquets and handwritten recipes scribbled inside book covers. You can browse the first English dictionary ever written and explore the secret language of the Georgian underworld. You can study the East India Company's shopping lists and practise sentences from colonial phrasebooks. You can learn smugglers' songs, listen to rare dialect recordings, and examine the logbooks of 17th century trading ships.
  • The Aboriginal Studies Electronic Data Archive: The Australian Institute of Aboriginal and Torres Strait Islander Studies holds computer-based (digital) materials about Australian Indigenous languages in the Aboriginal Studies Electronic Data Archive (ASEDA). ASEDA offers a free service of secure storage, maintenance, and distribution of electronic texts relating to these languages.
  • The Babel English-Chinese Parallel Corpus: The Babel English-Chinese Parallel Corpus consists of 327 English articles and their translations in Mandarin Chinese. Of these, 115 texts (121,493 English words plus 135,493 Chinese words) were collected from the World of English between October 2000 and February 2001 while the remaining 212 texts (132,140 English words plus 151,969 Chinese words) were collected from Time from September 2000 to January 2001. The corpus contains a total of 544,095 words (253,633 English words and 287,462 Chinese words). Both English and Chinese texts are tagged for part of speech. The parallel corpus is aligned at the sentence level. Sentence alignment was done automatically and corrected by hand. The Babel parallel corpus can be accessed freely via the Web-based parallel concordancer at the corpus website.
  • The Sumerian Text Archive: A growing collection of texts in the Sumerian language.
  • The University of Virginia Electronic Text Center: An on-line archive of tens of thousands of SGML and XML-encoded electronic texts and images with a library service that offers hardware and software suitable for the creation and analysis of text.
  • Tofa Videos and Texts: The Tofa stories available here were recorded by Dr. K. David Harrison in 2000 and 2001, for a project funded by a grant from Volkswagen-Stiftung.
  • Vercial Project: A database of Portuguese texts (medieval and classic).


Text and Corpora Meta Sites

  • ACL SIGLEX: An index of links to publicly available lexical resources (dictionaries and corpora).
  • Bookmarks for Corpus-based Linguists: These links (c. 1,000 of them) are meant mainly for linguists/language teachers, not computational linguists/NLP researchers, so the language-engineering-type links here are definitely not exhaustive.
  • Centre for English Corpus Linguistics: The International Corpus of Learner English is a corpus of writing by higher intermediate to advanced learners. It is the result of over ten years of collaborative activity between a large number of universities internationally. It contains 2.5 million words of EFL writing from learners representing 11 different mother tongue backgrounds (Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish, Swedish). The CD-ROM and Handbook is available from http://www.i6doc.com
  • ELRA (European Language Resources Association): The overall goal of ELRA is to provide a centralized organization for the validation, management, and distribution of speech, text, and terminology resources and tools, and to promote their use within the European telematics R&TD community.
  • Italian Linguistics: Information (in Italian) on Italian linguistics and corpora.
  • Leiden Armenian Lexical Textbase: Launching the Leiden Armenian Lexical Textbase http://www.sd-editions.com/LALT/home.html LALT combines Classical Armenian dictionaries with morphologically analyzed texts. There are some 80.000 Armenian lexemes and ten texts. The complete Nor Bargirk, main sections of Adjarian's Root Dictionary, Bedrossian's Armenian- English Dictionary and other material are integrated in LALT. There is a Greek-Armenian lexicon (20000 entries), and aligned Armenian-Greek texts. LALT is currently open for inspection. In a few months paid subscriptions will be accepted. Conditions for individuals and institutions will be published. LALT will be updated at regular intervals. Also, LALT easily is able to integrate additional material and welcomes contributions of other scholars. I have been asked about fonts: LALT is written in xml and uses unicode. Any unicode font will be able to read it, provided this font contains the glyphs (screen images) for Armenian and Greek. One such font is Titus Cyberbit, which is used within LALT itself. It is available for free at http://titus.fkidg1.uni-frankfurt.de/unicode/tituut.asp General information on Armenian and Unicode may be obtained at http://www.armunicode.org/en/fonts/unicode Jos Weitenberg
  • Linguistic Data Consortium: Creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes.
  • Linguistic and Folklore materials from the Kujamaat Jóola: A site which will eventually grow to have an extensive collection of Kujamaat linguistic and folklore materials. It currently contains a dictionary (already listed), two folktales (text, translation and sound) and verses from extemporaneous funeral songs (text, sound, translation, commentary).
  • On-line books FAQ: Public domain sources of Etext available on the Internet.
  • The IViE corpus: An intonationally transcribed corpus covering seven dialects of English from the British Isles. Subjects were secondary school students. The corpus covers short read sentence, a read story, a retold story, map tasks and free conversation.
  • WEBSOM: Large document collections that are automatically organized by the novel WEBSOM method (including Usenet newsgroup sci.lang). An ordered map of the information space is provided.
  • XNLRDF: XNLRDF is a database for the creation and distribution of basic linguistic information for a great number of natural languages so that they can be used for research and development. Linguistic data are inserted through a Web-interface. The linguistic data in the database can be compiled and downloaded in XML by the user.

Page Updated: 16-May-2008

Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.