amazon logo
More Info


Publishing Partner: Cambridge University Press CUP Extra Wiley-Blackwell Publisher Login

Texts & Corpora


Browse


by Subject Language
by Linguistic Subfield
by Language Family

Help Us


Add a link to Texts & Corpora
Update or report a bad link

Corpora

·AnCora Corpora: AnCora: Syntactically and Semantically Annotated Corpora (Spanish, Catalan) CLiC (Centre for Language and Computation) of the University of Barcelona, together with the Natural Language Processing group of the Polytechnic University of Catalonia, have created two new language technology resources: AnCora-Esp for Spanish and AnCora-Cat for Catalan, consisting of 500,000 words each. They are two treebanks enriched with different kinds of semantic information: 1) each function has its argument and thematic role; 2) each verb belongs to a semantic class according to its event structure and diathesis alternations; 3) each noun has its WordNet sense; and 4) each named entity (i.e. persons, organisations, locations, dates, etc.) is identified and categorized. The annotation process has also resulted in two verbal lexicons with approximately 2,000 entries for each language with information about verbal semantic classes and their syntactic subcategorization, their argument structure and the thematic roles for each sense. The AnCora corpora as well as the derived verbal lexicons (AnCora-Verb) are freely available (queries and downloads) from: http://clic.ub.edu/ancora/.
·Arquivo Dialetal do CLUP: Arquivo Dialetal do CLUP (Dialectal Archive of the Center of Linguistics of the University of Porto) is a database of recordings of European Portuguese collected during the last two decades, spanning both Mainland Portugal and islands. Apart from the recordings themselves, this resource provides detailed maps, exhaustive narrow phonetic and orthographic transcriptions and information about dialectal phenomena. Arquivo Dialetal do CLUP is a growing project, and we welcome any comments, suggestions and contributions.
·British National Corpus: A 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
·Buckeye Corpus: This corpus contains high-quality recordings of conversational American English speech from 40 speakers in Columbus, OH, USA. The speech has been orthographically transcribed and phonetically labeled. Currently the audio files and transcriptions for 20 talkers are available.
·Centre for English Corpus Linguistics: ICLEv2 contains 3.7 million words of writing from higher intermediate to advanced learners of English representing 16 different mother tongue backgrounds (Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Turkish, Tswana). It differs from the first version published in 2002 not only by its increased size and range of learner populations, but also by its interface, which contains two new functionalities: built-in concordancer allowing users to search for word forms, lemmas and/or parts-of-speech tags and breakdown of the query results according to the learner profile information. The accompanying ICLEv2 Handbook contains a detailed description of the corpus, a user's manual and an overview of the ELT situation in the countries of origin of the learners. There are three types of licence (for non-profit research purposes only): single user, multiple-user (2-10) and multiple-user (11-25). The corpus can be ordered online at http://www.i6doc.com
·CGL: The corpus of texts known as Grammatici Latini comprises the Latin grammar manuals written between the 2nd and 7th centuries AD and edited by Heinrich Keil in Leipzig, from 1855 to 1880. The corpus has several points of interest: By assembling the main sources, it allows for the reconstruction of the history of ideas in Western linguistics. From the Middle Ages on, these texts (Donatus' and Priscian’s artes in particular) were the basis for the later linguistic tradition. The corpus comprises more than 14,000 literary quotations as grammatical examples, a large number of which being fragments (literary, philosophical) of works that are now lost, or passages which one can compare with the direct tradition of extant texts. Some tendencies in late Latin are given prominence, such as the proscription of expressive forms foreign to classical use.
·CHAINS: Characterising Individual Speakers: The Chains corpus is a novel speech corpus collected with the primary aim of facilitating research in speaker identification. The corpus features approximately 36 speakers recorded under a variety of speaking conditions, allowing comparison of the same speaker across different well-defined speech styles. Speakers read a variety of texts alone, in synchrony with a dialect-matched co-speaker, in imitation of a dialect-matched co-speaker, in a whisper, and at a fast rate. There is also an unscripted spontaneous retelling of a read fable. The bulk of the speakers were speakers of Eastern Hiberno-English. The corpus is being made freely available for research purposes.
·Chinese Gigaword Second Edition: Chinese Gigaword Release Second Edition is a comprehensive archive of newswire text data in Chinese that has been acquired over several years by the LDC. This release includes all of the contents in the first release of the Chinese Gigaword corpus (LDC2003T09), material from one new source, as well as new materials from the other two sources. Thus, the corpus contains three distinct international sources of Chinese newswire - Central News Agency, Taiwan, Xinhua News Agency, and Zaobao. Some minor updates to the documents from the first release have been made.
·CLIPS, Corpus of Spoken Italian: CLIPS is a corpus of spoken Italian, freely available at www.clips.unina.it. The corpus (audio files, annotation and documentation) are fully downloadable from the website via ftp, free for research purposes. CLIPS consists of about 100 hours of speech, equally represented by female and male voices. A section of the corpus is transcribed orthographically, a smaller section has been phonetically labeled. Recordings were made in 15 Italian cities, selected on the basis of linguistic and socio-economic principles of representativeness: Bari, Bergamo, Bologna, Cagliari, Catanzaro, Firenze, Genova, Lecce, Milano, Napoli, Palermo, Parma, Perugia, Roma, Venezia. For each of the 15 cities different text typologies have been included: a) radio and television broadcasts (news, interviews, talk shows); dialogue (240 dialogues collected using the map task procedure and the "spot the difference" game. In this set: 30 dialogues are phonetically labeled, 90 orthographically transcribed); c) read speech from non professional speakers (20 sentences each, covering medium-high frequency Italian words); d) speech over the telephone (conversations between 300 speakers and a simulated hotel desk service operator), e) read speech from 20 professional speakers (160 sentences, covering all phonotactic sequences and medium-high frequency Italian words) recorded in an anechoic chamber. Documentation, corpus collection and annotation follow the EAGLES guidelines.
·COMPARA - Portuguese-English Parallel Corpus: COMPARA is bi-directional parallel corpus based on an open-ended collection of Portuguese-English and English-Portuguese source-texts and translations. Access is free and requires no registration.
·CORIS/CODIS - Corpus of Contemporary Written Italian: An updated version of CORIS/CODIS, the synchronic corpus of written Italian designed and developed at the University of Bologna, is now accessible online for research purposes. The new version contains around 130 million words and is updated to 2010. The corpus covers a wide range of text varieties, chosen by virtue of their representativeness of contemporary Italian, and ranges from the 1980s to 2010. The following features are now available: - annotation for part-of-speech and lemma; - user-friendly interface; - advanced IMS-CWB query language; The corpus is freely accessible online for research purposes only.
·Corpora at ICAME: International Computer Archive of Modern and Medieval English.
·Corpus Artesia - Archivio Testuale del Siciliano Antico: The Artesia Corpus is part of a larger research project, 'Artesia - Archivio testuale del Siciliano Antico' (Text Archive of Ancient Sicilian), a production of the Department of Modern Philology of the University of Catania, in close cooperation with the Centro di Studi Filologici e Linguistici Siciliani, Palermo (http://www.csfls.it). Among the other contributing national research projects, the Opera del Vocabolario Italiano (OVI) provided the software for creating and managing the full-text database. Our aim is to supply a well-structured research tool for the study of Medieval Sicilian (14th-16th centuries) from a Romance perspective and to account for its whole textual production. In particular, Artesia: • makes accessible and searchable a philologically reliable and periodically up-to-date corpus of literary and non-literary Sicilian texts, from the earliest attestations (14th cent.) to the latest (mid-16th cent.); • provides a brief, yet scholarly presentation for each author and text; • documents the individual works by putting them into a historical and critical context, highlighting relationships with and comparing it to different Latin and Romance textual traditions (Catalan, Tuscan, etc.); • brings a fundamental contribution to realize a Medieval Sicilian Dictionary; • publishes philological studies and linguistic researches concerning Medieval Sicilian, both in electronic and paper format (see Quaderni di Artesia, Catania: Ed.it, http://www.editpress.it). The Database: The full-text database is available online on the OVI website (http://artesia.ovi.cnr.it) and is searchable using GATTOWEB - Gestione degli Archivi Testuali del Tesoro delle Origini, created for the Tesoro della lingua italiana delle Origini (TLIO) (http://www.ovi.cnr.it). It allows advanced word searches, concordance generations and text lemmatization. Moreover, the Artesia Corpus is periodically published in a CD-Rom format (http://www.editpress.it/0808.htm). The Corpus: The corpus is made up of both literary and non-literary (e.g. documentary) texts dating from the beginning of 14th century, when the earliest Sicilian texts appeared, to the mid-16th century, when Tuscan replaced Sicilian as the language of the administration. Among the texts belonging to the corpus are: • published texts (especially the Collezione di testi siciliani dei secoli XIV e XV, published by Centro di Studi Filologici e Linguistici Siciliani, Palermo); • edited texts (for example PhD dissertations and other editions by the University of Catania); • previously unpublished and unedited texts, now for the first time made digitally accessible for especially Artesia. As of March 2010, the corpus contains 73 literary texts and 171 documents (1 081 539 tokens). These have been widely revisited and amended where doubtful readings occurred; these amendations have been systematically signaled by means of GATTO notes. The Database is progressively expanding to include the complete set of Medieval Sicilian texts, thus constituting a firm platform for the creation of a Medieval Sicilian Dictionary.
·Corpus de Català Contemporani de la Universitat de Barcelona (CCCUB): Spoken language corpora developed for the study of geographical, functional and socio-cultural variation in Catalan. The texts are in .pdf. The sound files are not yet available through the web, but they have been published in CD-ROM and can be purchased. The CCCUB is also available through RECERCAT (Dipòsit de la Recerca de Catalunya): http://www.recercat.net/handle/2072/8925).
·Corpus de français parlé au Québec (CFPQ): Le Corpus de français parlé au Québec (CFPQ) vise à refléter le français québécois spontané en usage dans les années 2000. Il est susceptible d’aider tout chercheur qui s’intéresse à la variation en français, notamment sous des angles lexicologique, sémantique ou pragmatique. Sa taille actuelle est de sept sous-corpus, ce qui correspond à plus de dix heures de conversations informelles à quatre ou cinq locuteurs. Les informations requises pour une exploitation optimale des données sont disponibles sur le site.
·Corpus del español: An online, searchable corpus of diachronic Spanish texts (100 million words, 13th century to present).
·Corpus e Lessico di Frequenza dell'Italiano Scritto: CoLFIS (Corpus e Lessico di Frequenza dell'Italiano Scritto) [Corpus and Frequency Lexicon of Written Italian] produced by Pier Marco Bertinetto°, Cristina Burani*, Alessandro Laudanna^*, Lucia Marconi+, Daniela Ratti+, Claudia Rolando+, Anna Maria Thornton§ ° Scuola Normale Superiore, Pisa * Istituto di Scienze e Tecnologie della Cognizione, CNR, Roma ^ Università di Salerno + Istituto di Linguistica Computazionale,Unità Staccata di Genova, CNR, Genova § Università de L'Aquila The reference corpus consists of excerpts from newspapers, magazines and books. It includes 3.150.075 lexical occurrences. The corpus was designed as the best approximation to the Italians' average preferred readings, as mirrored by official statistics. The lexicon consists of two main components: the forms repertoire and the lemmas repertoire. In the latter, all identical forms belonging to different lemmas are disambiguated, while syntagmatic words (such as table's leg) are treated as single entries. The lexical lists (both forms and lemmas) are presently available for free download at http://alphalinguistica.sns.it/BancheDati.htm http://www.istc.cnr.it/material/database/colfis/ They are organized according to a number of possibilities: frequency rank, inverse alphabetical ordering, with or without capital / non-capital distinction, etc. The entire corpus is not yet available. We hope to put it on-line as soon as we obtain the necessary authorizations. The work has been produced with CNR (Consiglio Nazionale delle Ricerche) support. With the help of willing users, this product will hopefully be enriched with further facilities.
·Corpus of Greek Texts: The Corpus of Greek Texts (CGT) is now available via an alternative webpage interface at the University of Athens. Access is free of charge, provided that users are registered with a valid e-mail address. The Corpus of Greek Texts (CGT) is the first electronic corpus of Greek texts designed for linguistic research in a wide range of Modern Greek genres. CGT includes 30 million words from spoken and written texts produced between 1990 and 2010. It has been created by co-operation between the Universities of Athens and Cyprus and was funded by the Research Committee of the University of Cyprus (For more info see: www.ucy.ac.cy/sek) and the programme Pythagoras (co-funded by the EU and Greek sources) (For more info see: http://greekcorpora.isll.uoa.gr/gr/Default.aspx). The alternative webpage interface was funded by the research programme Kapodistrias of the National and Kapodistrian University of Athens (Programme No: 70/4/760, Dionysis Goutsos). The Corpus of Greek Texts (CGT) has as an exclusive aim the scientific linguistic research into Greek through language data. The use of the webpage is strictly restricted for academic purposes and for non-profit exploitation and has as a sole precondition that researchers will inform CGT developers of any output in the form of papers, dissertations, presentations or publications arising from its analysis. For acknowledgments, please quote Γούτσος, Δ. (2003). Σώμα Ελληνικών Κειμένων: Σχεδιασμός και υλοποίηση. Πρακτικά του 6ου Διεθνούς Συνεδρίου Ελληνικής Γλωσσολογίας, Πανεπιστήμιο Κρήτης, 18-21 Σεπτεμβρίου 2003. Electronic publication: http://www.philology.uoc.gr/conferences/6thICGL/gr.htm.
·Corpus of Modern Scottish Writing: The Corpus of Modern Scottish Writing makes freely available a wide range of documents in Scots and Scottish English from 1700-1945, ranging from a rare first edition of Robert Burns' poems to letters from the explorer David Livingstone to murder trial transcripts dating back to the 1750s. The free online resource contains texts, digital images and searchable transcriptions, and can be found at: http://www.scottishcorpus.ac.uk/cmsw/ The Corpus also features James Hogg's first ever book - a treatise on diseases of sheep; personal accounts of the 1715 uprising; letters to Scotland from the French-Indian war and from emigrants to Australia in the 19th century. City Council minutes are also present along with a work on spiritualism by Arthur Conan Doyle, novels and student disciplinary trials of the 18th century and a selection of personal letters and diaries that give an invaluable insight into Scottish life in days gone by. The CMSW project has been funded by the Arts and Humanities Research Council (AHRC), and run by the University’s Department of English Language, now part of the School of Critical Studies.
·Corpus of Remarks on the French language (17th Century): The authors of Remarks treat all aspects of usage – pronunciation, spelling, morphology, syntax, vocabulary and style – but drop the traditional format of grammars. This corpus is an indispensable instrument, not only for specialists of 17th century language and literature, but also for all those interested in the history of the French language, of its codification and standardization. This data-base contains the classic texts (the remarks of Vaugelas, Ménage and Bouhours); collections which adopt an alphabetical presentation (Alemand, Andry de Boisregard) ; texts which criticise Vaugelas and call for greater freedom of usage (Dupleix, La Mothe Le Vayer) ; the volumes which emanate from circles close to the Academy (the Academy's comments on Vaugelas, and its decisions collected by Tallemant), as well as some less prestigious texts (Buffet's observations addressed to a female audience, the compilation by Macé which completes his general and critical grammar). For easy use and exploitation, the Corpus of remarks on the French language is accompanied by a number of research instruments: full-text search, a thesaurus of authors (5 categories) and of titles (3 categories), thesaurus of examples and quotations. The user can constitute his/her own corpus, extract and export results. This set of instruments will promote new research in the fields of the history of the French language and linguistic conceptions.
·Corpus OVI dell'Italiano Antico: For the TLIO redaction, the OVI has prepared—and keeps constantly developing—a large textual database, the Corpus OVI dell’Italiano antico, aimed to contain all relevant edited texts in any variety of Early Italian, written before 1400 A.D. At present, this corpus, which is updated every 4 months, consists of 1978 texts with 21,817,929 words, 443,810 different word forms, 116,224 lemmas and 3,615,478 lemmatized occurrences. For not yet lemmatized texts awaiting inclusion in the Corpus OVI, an additional corpus has been created, the Corpus TLIO aggiuntivo, which at present contains 306 texts with 1,189,808 words and 71,900 different word forms.
·Corpus TCOF: Le projet « Traitement de Corpus Oraux en Français » (TCOF) de l’ATILF (UMR 7118, Université de Lorraine & CNRS) met à disposition de la communauté des corpus oraux alignés texte-son (Transcriber). Le corpus TCOF comporte deux grandes catégories : des enregistrements de corpus d'interactions adultes / enfants (126 enregistrements actuellement) et des enregistrements d'interactions entre adultes (102 enregistrements actuellement). Ce corpus est enrichi régulièrement. L’accès aux données, via le site du CNRTL, est facilité par une interface de recherche qui permet aux utilisateurs de choisir les corpus en fonction de leurs objets d’étude (adultes, enfants, homme, femme, situations professionnelles, genre de discours, etc.).
·COSMAS Corpus Archive: The largest German corpus archive, free-of-charge online search in 1181 Mio words of running text (1846 Mio words for invited guests).
·Croatian Language Corpus: The Croatian Language Corpus is the result of various projects at the Institute of Croatian Language and Linguistics and the Linguistics Department of the University of Zadar. There is an online interface based on Philologic at the given URL. The current status is that the corpus indexes more than 100 k tokens, and the base is growing continuously. It is annotated in TEI XML P5, its annotation is being enriched with morphological segmentation, lemmatization, phonemic transcription, morphosyntactic annotation and syntactic parses. The online interfaces are subject to change and extension for the improvement of access to various corpus properties.
·Croatian National Corpus: The starting point for each linguistic research is the corpus. As Croatian does not have a systematically compiled corpus the objective of this project is the compilation and analysis of the representative Croatian texts -- both older and contemporary -- in the form of the corpus the usage of which is applicable for all kinds of Croatistic, lexicographic and lexicological research
·CSLU Spoltech Brazilian Portuguese: The CSLU Spoltech Brazilian Portuguese corpus contains microphone speech from a variety of regions in Brazil with phonetic and orthographic transcriptions. The utterances consist of both read speech (for phonetic coverage) and responses to questions (for spontaneous speech). The corpus contains 477 speakers and 8080 separate utterances. A total of 2540 utterances have been transcribed at the word level (without time alignments), and 5479 utterances have been transcribed at the phoneme level (with time alignments).
·CSLU: Spelled and Spoken Words: The CSLU: Spelled and Spoken Words corpus consists of spelled and spoken words. 3647 callers were prompted to say and spell their first and last names, to say what city they grew up in and what city they were calling from, and to answer two yes/no questions. In order to collect sufficient instances of each letter, 1371 callers also recited the English alphabet with pauses between the letters. Each call was transcribed by two people, and all differences were resolved. In addition, a subset of 2648 calls has been phonetically labeled.
·Czech Academic Corpus v. 1.0: The Czech Academic Corpus version 1.0 is a corpus with a manual morphological annotation of morphology of the Czech language consisting of approximately 600,000 words in continuous texts.
·Database of spoken Italian (BADIP): Contains an online edition of the 500,000 word LIP-Corpus. The edition is being enriched with POS-tags and lemmata, more data are being added continuously. Other corpora of spoken Italian will be included in the database as soon as possible. Access to BADIP is free. The database is part of the LanguageServer of the University of Graz (Austria).
·Digital Archive of the Macedonian Language: The Digital Archive of the Macedonian Language is a growing collection of digitized, searchable texts in Modern Macedonian from the nineteenth and twentieth centuries and is completely free for anyone who would like to use, search through, and/or download the materials. web address: http://damj.manu.edu.mk Macedonian Academy of Sciences and Arts Research Center for Areal Linguistics Project Coordinator: Prof. Marjan Markovik e-mail: marjan@manu.edu.mk
·Digital Tamil Literature: Searchable Tamil Digital Text Archive.
·dlexDB: dlexDB is a new lexical statistical database for German. It is based on the DWDS-Kerncorpus, a balanced collection - over time and text genre - of 100 million words of texts of the 20th century. dlexDB provides frequencies for types, lemmas, syllables, characters, orthographic neighbors and more. These measures are of considerable interest for research in psycholinguistics and psychology (e.g., studies on visual word recognition) as well as for general linguistics and lexicography. During the course of the project, more levels of linguistic representation will be added.
·DWDS Corpora and Dictionaries: A lexical information system of German, based on very large corpora and dictionaries. It contains the DWDS-Kerncorpus, a balanced collection - over time and text genre - of 100 million words of texts of the 20th century, various newspaper and special corpora (~650 million words of text publicly available).
·Eastern Armenian National Corpus: Eastern Armenian National Corpus (EANC) is a comprehensive linguistic database of annotated texts in Standard Eastern Armenian (SEA), the language spoken in the Republic of Armenia. EANC is: - a comprehensive corpus with about 90 million tokens - a powerful search engine for making complex lexical morphological queries - a learner’s corpus including English translations for frequent tokens - a diachronic corpus covering SEA texts from the mid-19th century to the present - a mixed corpus consisting of both written discourse and oral discourse - an open-ended corpus with new texts being added continuously - an annotated corpus with morphological and metatext tagging - an open access corpus - an electronic library with full access to over 100 Armenian classic titles Another important feature is the Glossed output: typologists and language learners can now work with a text format similar to interlinear morphological glosses. In this format, wordforms are supplied with lemmas, lexical and grammatical categories, and translations, vertically aligned below each wordform. Also possible is switching to Latin transliteration from the Armenian alphabet.
·EF Cambridge Open Language Database: EFCamDat contains writings submitted to Englishtown, EF’s online school, accessed daily by thousands of learners worldwide. The database currently contains 412,000 scripts from 76,000 learners summing up 32 million words. (More information: http://linguistlist.org/issues/24/24-2935.html#1)
·Ega XML Lexicon: Digitized, online lexicon of Ega, a language of the Ivory Coast as provided by the late Prof. Eddy Aimé Gbery.
·El Grial Corpus of Spanish: El Grial Corpus of Spanish (www.elgrial.cl) is a growing collection of eight corpora (almost 100 million words) with approximately 700 documents of contemporary Spanish, developed by the members of the Escuela Lingüística de Valparaíso (www.linguistica.cl) at the Pontificia Universidad Católica de Valparaíso, Chile. Also, there is a tagger and parser for Spanish Language available on the web site. These corpora have been collected under specific methodological principles, identifying specialized/non-specialized, written/spoken registers and text types (academic, professional, technical, etc). Detailed description of each corpus is available in the web site. All documents have been tagged and parsed. Part of the data has been enriched with deep syntactic information. To the best of our knowledge, this is currently the largest searchable morphosyntactically annotated and register-diversified corpus of Spanish available to the public, with online tools that help analyze the collected data. Users can define their corpus of study and search the data using a wide variety of resources. Query results are presented in different formats, depending on the kind of research questions. El Grial users can make queries concerning word, lemma and/or parts of speech frequencies. One of the last tools developed is El Manchador de Textos, an online resource that “spots” and puts color to the words or sequences under study; statistical information about the co-occurrences is also available.
·English Accents and Dialects: Extracts from the Survey of English Dialects and the Millennium Memory Bank document how we spoke and lived in the 20th century.
·General Corpus of the Modern Mongolian language: General Corpus of the Modern Mongolian language (GCML) contains 966 texts, 1 155 583 words. The processor analyzes effectively 97 % of textual word forms which correspond to 76 % word forms from the inputs of the concordance to the GCML.
·German Political Speeches Corpus and Visualization: This corpus consists of speeches by the German Presidents, Chancellors and a few ministers, all gathered from official sources. It can be freely republished. The two main corpora are released in XML format with metadata. POS-tags will be added to it. There is also a basic visualization tool enabling users to get a first glimpse of the resource.
·German Speech Errors: Collection of 474 German speech errors by Richard Wiese.
·HATII and DCC Release KRYS I Corpus to Aid Research: The Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) are delighted to announce the release of the KRYS I Corpus for genre classification research. http://www.krys-corpus.eu The corpus, consisting of 6434 documents labelled with document genres, is expected to become a major research resource among text processing and data and information management researchers. In particular, we encourage the use of the corpus for the research of: - Automated Text Classification (TC) - Digital curation and metadata extraction - Natural Language Processing (NLP) - Computational Linguistics (CL) Despite the potential of document genre classification as a supporting step in language processing, document management, and information retrieval (e.g. the linguistic style and the vocabulary of a document varies distinctively across document genres), to date, there has been a severe lack of genre-labelled document corpora with which researchers can experiment. It is, therefore, with great pleasure that the Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) makes the KRYS I Corpus available to researchers around the globe. The Corpus originated as part of the ongoing Semantic Metadata Extraction research at the Digital Curation Centre (http://www.dcc.ac.uk) and the HATII at the University of Glasgow (http://www.hatii.arts.gla.ac.uk). The metadata extraction research evolved into a study of automated genre classification, reflecting the observation that the genre of a document (e.g. whether a document is a scientific article or a letter) is characterised by the form and structure of a document, the understanding of which would facilitate further extraction of metadata from within the document. Further details about the development of the KRYS I corpus are available via the website (http://www.krys-corpus.eu). Specifically, researchers will find a detailed account of the document collection process, the reclassification of the documents in the corpus, and the initial findings with regard to human classification of the documents. We encourage researchers to make full use of this corpus for their own research activity and recommend that you consider contributing towards the ongoing development of the corpus by adding your own documents to the database. Instructions as to how to contribute to the corpus are provided at http://www.krys-corpus.eu. Comments and/or feedback on the KRYS I Corpus are invited. Contacts details can be found on the website. Please feel free to distribute this announcement to any interested colleagues.
·HC Corpora: A collection of open text corpora. Covers many different languages.
·Hebrew Corpus of Arutz7 Newswires: A Corpus containing news and articles from Arutz 7 since 2001, which updates daily. Text is available in HTML, plain ascii text, tokenized text in XML format. It is possible to obtain an XML version of the text morphologically annotated (with all possible analyses) and morphologically disambiguated (with the correct morphological analysis in context). Every day, the front page of Arutz 7 is being scanned for updated news and articles and new material is being downloaded. The relevant text is being extracted from the downloaded pages, and then analyzed for document structure (paragraph, sentence and token segmentation). The texts are then being represented in XML. The resources are free but require a username and a password that can be obtained by sending an email to Shlomo Yona .
·Hellenic National Corpus: HNC is a corpus of written Modern Greek texts, available over the Internet, for research use only. It is based on the General Language corpus developed by the Institute of Language and Speech Processing and is fully available on the Internet since 2000. It currently contains about 32,000,000 words of written texts from several media (books, periodicals, newspapers etc.), which belong to different genres (articles, essays, literary works, reports, biographies etc.) and various topics (economy, medicine, leisure, art, human sciences etc.). The HNC users can make the following queries concerning the lexicon, morphology, syntax and usage of Modern Greek: - specific words (e.g. child), - lemmas (e.g. child as a lemma produces every inflected type of the word), - parts of speech and - up to three combinations of all the above, in which users can specify the distance among lexical items (e.g. word + word, lemma + word, lemma + word + word, lemma + part of speech). Users can define their own sub-corpus within the HNC. This sub-corpus may cover one or more media, genres and/or topics and may also be saved for further reference by the users. Query results are presented as whole sentences, within which the query objects are highlighted. Alternatively, concordances of query results are presented, where the query object is centred on the page. Finally, HNC users can make queries concerning word, lemma and/or parts of speech frequencies within the HNC texts. Statistical information about the 100 and 1,000 most frequent words and lemmata in these texts is also available.
·IFA Dialog Video corpus: The IFA Dialog Video corpus is a collection of annotated video recordings of friendly Face-to-Face dialogs licensed under the GNU General Public License (GPLv2). It is modeled on the Face-to-Face dialogs Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialog conversations of 15 minutes were recorded and annotated, in total 5 hours of speech. To stay close to the very useful Face-to-Face dialogs in the CGN, pairs of well acquainted participants, either good friends, relatives, or long-time colleagues were selected. The participants were allowed to talk about any topic they wanted.
·International Corpus of English (British Component): The British Component of the International Corpus of English (ICE-GB) contains one million words of spoken and written British English. The material is fully tagged and parsed and the associated syntactic treebank is searchable with dedicated exploration software. The spoken material can be listened to.
·IPI PAN Corpus of Polish: The 2nd edition of the IPI PAN Corpus of Polish, developed at the Institute of Computer Science of the Polish Academy of Sciences (PAS), is available at the web pages of: - the Institute of Computer Science PAS: http://korpus.pl/en/ - the Institute of Polish Language PAS: http://corpus.ijp-pan.krakow.pl/en/ To the best of our knowledge, this is currently the largest searchable morphosyntactically annotated corpus of Polish available to the public. The whole corpus consists of over 250 million segments (about 200 million orthographic words) and it is not balanced, but a balanced sample of over 30 million segments is also available. These corpora can be directly searched at the above addresses (do read the query syntax cheatsheet at http://korpus.pl/en/cheatsheet/index.html) or downloaded in a binary form to be used with a standalone version of the corpus search engine Poliqarp (announced separately on the 'corpora' list and available from http://korpus.pl/en/).
·Italian Attribution Corpus: This is a corpus annotated for attribution relations according to an annotation schema developed from the one adopted for the PDTB corpus. It comprises 50 articles drawn from the ISST corpus of Italian. The overall number of tokens is 37.000. Overall, 461 attribution relations are annotated, using MMAX2. The corpus is available for download and research use at: http://homepages.inf.ed.ac.uk/s1052974/resources.php
·IULA's UPF Textual, plurilingual, specialized Corpus: The main goal of the Corpus project is the construction and exploitation of a textual, plurilingual and specialized corpus. The languages involved are the following: Catalan, Spanish, English, German and French. The areas of interest include: economics, law, computer science, medicine and enviromental science. This corpus is the main support for teaching and research at our institut. Some of the research activities envisaged against this corpus include the following ones: terminology detection, parallel texts alignment, partial parsing, (semi)automatic extraction of several levels of linguistic information for building computational systems (for example, subcategorization patterns), language variation studies.
·Korean Propbank: Korean Propbank is a semantic annotation of the Korean English Treebank Annotations and Korean Treebank version 2.0. Each verb and adjective occurring in the Treebank has been treated as a semantic predicate and the surrounding text has been annotated for arguments and adjuncts of the predicate. The verbs and adjectives have also been tagged with coarse grained senses. There are two basic components to Korean Propbank: * The Verb Lexicon. A frames file, consisting of one or more frame sets, has been created for each predicate occurring in the Treebank. These files serve as a reference for the annotators and for users of the data. 2,749 such files have been created. * The Annotation. There are two annotation files. The virginia-verbs.pb file has 9,588 annotated predicate tokens. These predicate tokens include all those occurring in over 54 thousand words of the Korean English Treebank Annotations, totaling ~791 KB of uncompressed data. The newswire-verbs.pb file has 23,707 annotated predicate tokens. These predicate tokens include all those occurring in over 131 thousand words of the Korean Treebank version 2.0.
·Korean Treebank Annotations Version 2.0: The Korean Treebank Annotations Version 2.0 is an extension of the Korean English Treebank Annotations corpus, LDC2002T26 (2002). It is essentially an electronic corpus of Korean texts annotated with morphological and syntactic information. The original texts for the Korean Treebank 2.0 were selected from The Korean Newswire corpus published by LDC, catalog number LDC2000T45, which is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. Korean Treebank 2.0 is based on the March 2000 portion of the corpus and includes 647 articles. The annotated corpus can find many uses, including training of morphological analyzers, part-of-speech taggers and syntactic parsers.
·LumaLiDa - Resources for Child Language: LumaLiDa is a family of database resources for the study of child language. It includes LumaLiDaOn (the Linguistic Diary of Luma, an European Portuguese Child), LumaLiDaOnLexicon (the lexicon used by the child in LumaLiDaOn, types and tokens), LumaLiDaAudy (transcribed audio files of child speech), and LumaLiDaAudyLexicon (the lexicon used by the child in LumaLiDaAudy, types and tokens).
·MDE RT04 Training Data Speech: MDE RT-04 Training Data Speech was created to provide training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program. The goal of MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: flagging non-content words like filled pauses and discourse markers for optional removal; marking sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. LDC has defined a SimpleMDE annotation task specification and has annotated English telephone and broadcast news data to provide training data for MDE.
·Monguor - Online Texts: Digitized online texts of Monguor, an endangered language spoken in the People's Republic of China, as provided to the E-MELD School of Best Practices by Dr. Wang Xianzhen.
·N4 NATO Native and Non-Native Speech: The N4 NATO Native and Non-Native Speech corpus was developed by the NATO research group on Speech and Language Technology in order to provide a military oriented database for multilingual and non-native speech processing studies. The NATO Speech and Language Technology group decided to create a corpus geared towards the study of non-native accents. The group chose naval communications as the common task because it naturally includes a great deal of non-native speech and because there were training facilities where data could be collected in several countries. Speech data was recorded in the Naval transmission training centers of four countries (Germany, The Netherlands, United Kingdom, and Canada). The material consists of native and non-native speakers using NATO English procedure between ships and reading from a text.
·NPS Chat Corpus: The NPS Chat Corpus, Release 1.0 consists of 10,567 posts gathered from various online chat services in accordance with their terms of service. The posts have been: 1) Hand privacy masked; 2) Part-of-speech tagged; and 3) Dialogue-act tagged.
·Online Dena'ina Qenaga Lexicon: Searchable online wordlist of Dena'ina Qenaga (Tanaina).
·Ossetic National Corpus: The Ossetic National Corpus, which contains about 5 million wordforms, is now freely available online. All the texts in the corpus have been automatically annotated and contain English translations of most lexemes. The percentage of annotated wordforms is more than 75%. The corpus supports automatic Latin transliteration of search results.
·Penn Parsed Corpora of Historical English: The Penn Parsed Corpora of Historical English are a collection of three annotated corpora of historical British English: the Penn-Helsinki Parsed Corpus of Middle English (1.2 million words), the Penn-Helsinki Parsed Corpus of Early Modern English (1.7 million words) and the Penn Parsed Corpus of Modern British English (currently 1 million words). The corpora are genre-balanced and consist of POS-tagged and syntactically annotated text samples, including all of the samples in the Middle and Early Modern English sections of Helsinki Corpus of Historical English (1.1 million words).
·Penn-Helsinki Parsed Corpus of Early Modern English: The Penn-Helsinki Parsed Corpus of Early Modern English is a 1.8 million word parsed corpus of text samples of Early Modern English. It includes the text samples of the Helsinki Corpus of Historical English, which consists of 600,000 words of genre balanced text and two extension samples of the same size, balanced for genre in the same way. It is a sister corpus of the Penn-Helsinki Parsed Corpus of Middle English and the two corpora are distributed together.
·Persian Linguistic Database (PLDB): This is the first on-line database for the contemporary (Modern) Persian designed and developed by Dr. S. M. Assi at the Institute for Humanities and Cultural Studies (IHCS), Iran. The database contains a huge selected corpora of all varieties of the Modern Persian language in the form of running texts. Some of the texts are annotated with grammatical, pronunciation and lemmatisation tags. A special and powerful software provides different types of search and statistical listing facilities through the whole database or any selective corpus made up of a group of texts. The database is constantly improved and expanded.
·PMSE: PetaMem Scripting Environment (only PMSE further) is a suite of Perl scripts that allow the user to take control over various processes related with the usage of corpora. The intend of PMSE is to build a comprehensive toolchain enabling the user a generic work with text corpora - starting with the acquirement of the data, ongoing with statistical computation and data visualization.
·Russian National Corpora: The corpora is designed for anyone interested in a variety of issues related to the Russian language: professional linguists, language teachers, students, foreigners studying the Russian language.
·Scandinavië Vertalingen: Translation agency for translations from and into the Scandinavian languagues (Swedish, Finnish, Norwegian, Danish)
·SCoSE - Saarbrücken Corpus of Spoken English: The SCoSE consists of five parts: Part 1: Stories Part 2: Indianapolis Interviews Part 3: Jokes Part 4: Complete Conversations Part 5: Drawing Experiment You can download each of the five parts as a .pdf file.
·Scottish Corpus of Texts and Speech (SCOTS): SCOTS is an AHRC-funded project, creating a corpus of texts in the languages of Scotland, in the first instance Scots and Scottish English, of all available genres. Spoken texts (orthographic transcription plus accompanying audio/video files) make up 20% of the complete corpus. The corpus is fully searchable online, and the website also contains a description and instructions.
·Searchable Biao Min Lexicon: The Biao Min Lexicon, housed on the E-MELD site, consists of nearly 3,000 lexical items from Biao Min documentation collected by David Solnit.
·Searchable Kayardild Lexicon: Searchable lexicon of Kayardild, collected by Dr. Nicholas Evans and hosted by E-MELD.
·Searchable Mocoví Lexicon: Searchable Mocoví Lexicon, based on data provided and collected by Dr. Verónica Grondona.
·Searchable Potawatomi Lexicon: Online, searchable Potawatomi lexicon, utilizing data provided to the E-MELD School of Best Practices by Dr. Laura Buszard-Welcher.
·Searchable Saliba Lexicon: Searchable Lexicon of the Saliba language, utilizing data provided to the E-MELD School of Best Practices by Nancy Morse.
·Slovak National Corpus: Slovak National Corpus is built as a general monolingual corpus, which in the first phase (year 2003) started to compile written texts originated in years 1990 – 2003, containing about 30 million of words with a lemmatisation, morphological and source (bibliographical and style-genre) annotation. During the second phase (up to 2006) the representative span of written texts will be extended to other periods of the contemporary language (1955 – 2005) to the amount of 200 million words and its selected sample will be syntactically annotated. Simultaneously, specific sub-corpora of diachronic and dialectological texts will commence to be built, as well as a terminological and lexicographical database. Slovak National Corpus is provided primarily to lexicographers (dictionary creation), complements grammar and stylistic research (grammar and orthographical handbooks; varieties of the national language and their usage in communication). We suppose that it will also find its use at schools (preparing of orthography, grammar and style textbooks; teaching Slovak as a foreign language). Specific sub-corpora of historical and dialectological texts will help to preserve an important part of our cultural heritage in a long-term perspective .
·SMULTRON - The Stockholm Multilingual Treebank: SMULTRON is a parallel treebank developed by the Computational Linguistics Group at the Department of Linguistics, at Stockholm University. The parallel treebank contains around 1000 sentences each in English, German and Swedish. The sentences have been PoS-tagged and annotated with phrase structure trees. The trees have been aligned on sentence, phrase and word level. Additionally, the German and Swedish monolingual treebanks contain lemma information.
·Speech Controlled Computing: The Speech Controlled Computing corpus was designed to support the development of small footprint, embedded ASR applications in the domain of voice control for the home. It consists of the recordings of 125 speakers of American English from four regions, three age groups and two gender groups, pronouncing isolated words. The recordings were conducted in a sound-attenuated room, and a high-quality microphone was used. Each speaker read a randomized word list consisting of 2100 words (100 distinct words appearing 21 times each). NOTE: Nonmembers may obtain a commercial rights license to Speech Controlled Computing for US$7000 by signing the LDC User License Agreement for Speech Controlled Computing. For-Profit Membership to the LDC is not required.
·The Bergen Corpus of London Teenage Language (COLT): The Bergen Corpus of London Teenage Language (COLT) is the first large English Corpus focusing on the speech of teenagers. It was collected in 1993 and consists of the spoken language of 13 to 17-year-old teenagers from different boroughs of London. The complete corpus, half a million words, has been orthographically transcribed and word-class tagged, and is a constituent of the British National Corpus.
·The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages: The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 22 Languages - New: Version 3.0 almost tripled in size: The JRC-Acquis Version 3.0 is a unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in the 23 official EU languages, with the exception of Irish. The corpus consists of about 23,000 documents per language, with an average size of 49 million words per language, totalling to over one Billion words. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is currently available for a subset of 8000 documents in 210 language pair combinations. Pair-wise alignment for all texts in all 231 language pairs will be available soon. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction).
·The SenSem Databank: This databank includes a corpus for Spanish and another for Catalan (http://grial.uab.es/sensem/corpus), and their two respective verb lexicons (http://grial.uab.es/sensem/lexico). These resources can be readily consulted online and also downloaded. In these corpora the semantics of sentences is analyzed as a syntax-lexicon continuum and so the annotation ranges from the lexical level to the sentence level. Constituents are independently annotated regarding different types of information: semantic role, syntagmatic category, and syntactic and semantic function. The verb phrase is also specified in terms of telicity and dynamism and the sentence is specified regarding topicalization or detopicalization of logical subject, aspectuality, modality and polarity. All these values converge in order to create sentence meaning. The two SenSem lexicons embrace 1,200 senses. This description is carried out by means of a definition, the Aktionsart, semantic roles and subcategorization frames (with information about frequency and sentence semantics). In the Spanish lexicon, these senses are organized in 250 lemmas, which constitute the headwords for which 100 sentences from a journalistic register and 20 from a literary register have been randomly selected and manually annotated. The Spanish sentences corresponding to the journalistic register have been translated into Catalan and annotated independently.
·The SenSem Databank: With the information extracted from the two SenSem Corpora, one for Spanish and another one for Catalan, two corresponding lexicons have been created. These resources can be consulted online (http://grial.uab.es/sensem/lexico) and also downloaded (http://grial.uab.es/descarregues.php). In the Spanish lexicon, 250 lemmas have been described. These were selected from the most frequent Spanish verbs in an original corpus made up of 13,000,000 words. In the Catalan lexicon, the number of lemmas is higher (318) because the correspondence between Spanish and Catalan verbs is not one-to-one. The two SenSem lexicons embrace 1,200 senses each, out of which approximately 1,000 are exemplified in the corpus. The sense description is carried out by means of a definition, semantic roles, the WordNet synset and the frequency of each sense in the corpus, differentiating between different registers. We also include the Aktionsart. Moreover, each sense is completed with information extracted from the corpora referring to subcategorization frames and their frequency. In order to describe the patterns, we make use of two levels. In the first level we include the general syntagmatic categories ordered according to the unmarked Spanish word order and we mark those patterns that are pronominal. In the second level these categories are subspecified and semantic roles and syntactic functions are added to the patterns. For each frame we also indicate the sentence semantics that it is associated to, the real order of categories and the adjuncts. Finally, all the sentences of the corpora that exemplify each pattern can be visualized and a graphic shows the annotation of each sentence.
·Timebank 1.2: The TimeBank 1.2 corpus contains 183 news articles that have been annotated with temporal information, adding events, times and temporal links between events and times. The annotation follows the TimeML 1.2.1 specification. The most recent information on TimeML is always available at www.timeml.org. TimeML aims to capture and represent temporal information. This is accomplished using four primary tag types: TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals, and LINK for representing relationships. Timebank 1.2 is distributed via web download. Nonmembers may license this data at no cost - please note that a signed copy of our generic nonmember user agreement is required.
·TS Corpus: TS Corpus is a Turkish Corpus project. TS Corpus is a general-purpose corpus containing 491 million POSTagged tokens (491,360,398 million). TS Corpus is a tagged corpus. TS Corpus aims to combine former Turkish computational linguistics studies and other corpus linguistics studies from around the world.
·VOICE: Vienna-Oxford International Corpus of English: The Vienna-Oxford International Corpus of English (VOICE) 1.0 Online is available as a free-of-charge resource for non-commercial research purposes. VOICE comprises naturally occurring, non-scripted face-to-face interactions in English as a lingua franca (ELF). The recordings made for VOICE are keyboarded by trained transcribers and stored as a computerized corpus. The speakers recorded in VOICE are experienced ELF speakers from a wide range of first language backgrounds. The ELF interactions recorded cover a range of different speech events in terms of domain (professional, educational, leisure), function (exchanging information, enacting social relationships), and participant roles and relationships (acquainted vs. unacquainted, symmetrical vs. asymmetrical).
·Word Frequency Lists and Dictionary for American English: This site contains what we believe are the most accurate and hopefully the most useful word frequency lists of (American) English. Our data is based on the only large, genre-balanced, up-to-date corpus of American English -- the 400 million word Corpus of Contemporary American English. You can be sure that the words in these lists and in this dictionary - sorted from most to least frequent - are really the most common ones that you will encounter in the real world. The frequency data comes in a number of different formats: * An eBook containing up to the 20,000 most frequent words, along with the 20-30 most frequent collocates (nearby words) and the synonyms for each word. * A printed book (from Routledge) with the top 5,000 words (including collocates) and thematic lists. * A free word list -- top 5,000 words, but no collocates or synonyms. * Simple word lists of the top 10,000 or 20,000 words, but without collocates or synonyms. * Lists with the top 200-300 collocates for each of the 20,000 words, for up to 5,000,000 node word / collocate pairs.

Electronic Texts

·Alex: A Catalogue of Electronic Texts on the Internet: A collection of public domain documents from American and English literature as well as Western philosophy.
·Centre for English Corpus Linguistics: ICLEv2 contains 3.7 million words of writing from higher intermediate to advanced learners of English representing 16 different mother tongue backgrounds (Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Turkish, Tswana). It differs from the first version published in 2002 not only by its increased size and range of learner populations, but also by its interface, which contains two new functionalities: built-in concordancer allowing users to search for word forms, lemmas and/or parts-of-speech tags and breakdown of the query results according to the learner profile information. The accompanying ICLEv2 Handbook contains a detailed description of the corpus, a user's manual and an overview of the ELT situation in the countries of origin of the learners. There are three types of licence (for non-profit research purposes only): single user, multiple-user (2-10) and multiple-user (11-25). The corpus can be ordered online at http://www.i6doc.com
·Chinese Text Project: The Chinese Text Project is a web-based e-text system designed to present ancient Chinese texts, particularly those relating to Chinese philosophy, in a well-structured and properly cross-referenced manner, making the most of the electronic medium to aid in their study and understanding.
·Croatian National Corpus: The starting point for each linguistic research is the corpus. As Croatian does not have a systematically compiled corpus the objective of this project is the compilation and analysis of the representative Croatian texts -- both older and contemporary -- in the form of the corpus the usage of which is applicable for all kinds of Croatistic, lexicographic and lexicological research
·Digital Archive of the Macedonian Language: The Digital Archive of the Macedonian Language is a growing collection of digitized, searchable texts in Modern Macedonian from the nineteenth and twentieth centuries and is completely free for anyone who would like to use, search through, and/or download the materials. web address: http://damj.manu.edu.mk Macedonian Academy of Sciences and Arts Research Center for Areal Linguistics Project Coordinator: Prof. Marjan Markovik e-mail: marjan@manu.edu.mk
·El Grial Corpus of Spanish: El Grial Corpus of Spanish (www.elgrial.cl) is a growing collection of eight corpora (almost 100 million words) with approximately 700 documents of contemporary Spanish, developed by the members of the Escuela Lingüística de Valparaíso (www.linguistica.cl) at the Pontificia Universidad Católica de Valparaíso, Chile. Also, there is a tagger and parser for Spanish Language available on the web site. These corpora have been collected under specific methodological principles, identifying specialized/non-specialized, written/spoken registers and text types (academic, professional, technical, etc). Detailed description of each corpus is available in the web site. All documents have been tagged and parsed. Part of the data has been enriched with deep syntactic information. To the best of our knowledge, this is currently the largest searchable morphosyntactically annotated and register-diversified corpus of Spanish available to the public, with online tools that help analyze the collected data. Users can define their corpus of study and search the data using a wide variety of resources. Query results are presented in different formats, depending on the kind of research questions. El Grial users can make queries concerning word, lemma and/or parts of speech frequencies. One of the last tools developed is El Manchador de Textos, an online resource that “spots” and puts color to the words or sequences under study; statistical information about the co-occurrences is also available.
·Freiburger Anthologie: Die 1200 bekanntesten deutschen Gedichte in einer durchsuchbaren Datenbank.
·French Proverbs from 1611: A summary of 1,500 proverbs found in Randle Cotgrave's "A dictionarie of the French and English tongues", published in 1611. Most are in French with English translations; a few English and Latin proverbs are also included.
·HATII and DCC Release KRYS I Corpus to Aid Research: The Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) are delighted to announce the release of the KRYS I Corpus for genre classification research. http://www.krys-corpus.eu The corpus, consisting of 6434 documents labelled with document genres, is expected to become a major research resource among text processing and data and information management researchers. In particular, we encourage the use of the corpus for the research of: - Automated Text Classification (TC) - Digital curation and metadata extraction - Natural Language Processing (NLP) - Computational Linguistics (CL) Despite the potential of document genre classification as a supporting step in language processing, document management, and information retrieval (e.g. the linguistic style and the vocabulary of a document varies distinctively across document genres), to date, there has been a severe lack of genre-labelled document corpora with which researchers can experiment. It is, therefore, with great pleasure that the Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) makes the KRYS I Corpus available to researchers around the globe. The Corpus originated as part of the ongoing Semantic Metadata Extraction research at the Digital Curation Centre (http://www.dcc.ac.uk) and the HATII at the University of Glasgow (http://www.hatii.arts.gla.ac.uk). The metadata extraction research evolved into a study of automated genre classification, reflecting the observation that the genre of a document (e.g. whether a document is a scientific article or a letter) is characterised by the form and structure of a document, the understanding of which would facilitate further extraction of metadata from within the document. Further details about the development of the KRYS I corpus are available via the website (http://www.krys-corpus.eu). Specifically, researchers will find a detailed account of the document collection process, the reclassification of the documents in the corpus, and the initial findings with regard to human classification of the documents. We encourage researchers to make full use of this corpus for their own research activity and recommend that you consider contributing towards the ongoing development of the corpus by adding your own documents to the database. Instructions as to how to contribute to the corpus are provided at http://www.krys-corpus.eu. Comments and/or feedback on the KRYS I Corpus are invited. Contacts details can be found on the website. Please feel free to distribute this announcement to any interested colleagues.
·Hellenic National Corpus: HNC is a corpus of written Modern Greek texts, available over the Internet, for research use only. It is based on the General Language corpus developed by the Institute of Language and Speech Processing and is fully available on the Internet since 2000. It currently contains about 32,000,000 words of written texts from several media (books, periodicals, newspapers etc.), which belong to different genres (articles, essays, literary works, reports, biographies etc.) and various topics (economy, medicine, leisure, art, human sciences etc.). The HNC users can make the following queries concerning the lexicon, morphology, syntax and usage of Modern Greek: - specific words (e.g. child), - lemmas (e.g. child as a lemma produces every inflected type of the word), - parts of speech and - up to three combinations of all the above, in which users can specify the distance among lexical items (e.g. word + word, lemma + word, lemma + word + word, lemma + part of speech). Users can define their own sub-corpus within the HNC. This sub-corpus may cover one or more media, genres and/or topics and may also be saved for further reference by the users. Query results are presented as whole sentences, within which the query objects are highlighted. Alternatively, concordances of query results are presented, where the query object is centred on the page. Finally, HNC users can make queries concerning word, lemma and/or parts of speech frequencies within the HNC texts. Statistical information about the 100 and 1,000 most frequent words and lemmata in these texts is also available.
·IFA Dialog Video corpus: The IFA Dialog Video corpus is a collection of annotated video recordings of friendly Face-to-Face dialogs licensed under the GNU General Public License (GPLv2). It is modeled on the Face-to-Face dialogs Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialog conversations of 15 minutes were recorded and annotated, in total 5 hours of speech. To stay close to the very useful Face-to-Face dialogs in the CGN, pairs of well acquainted participants, either good friends, relatives, or long-time colleagues were selected. The participants were allowed to talk about any topic they wanted.
·Korean Treebank Annotations Version 2.0: The Korean Treebank Annotations Version 2.0 is an extension of the Korean English Treebank Annotations corpus, LDC2002T26 (2002). It is essentially an electronic corpus of Korean texts annotated with morphological and syntactic information. The original texts for the Korean Treebank 2.0 were selected from The Korean Newswire corpus published by LDC, catalog number LDC2000T45, which is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. Korean Treebank 2.0 is based on the March 2000 portion of the corpus and includes 647 articles. The annotated corpus can find many uses, including training of morphological analyzers, part-of-speech taggers and syntactic parsers.
·Korpus 2000: The aim of the Korpus 2000 project is to document the use of the Danish language around the year 2000 - in the form of a text corpus in which one can look up words and phrases via this website. The texts that constitute the Korpus 2000 were written mainly between 1998 and 2002.
·Linguistic eBooks at Diesel eBook Store: Download linguistic eBooks in multiple formats. There is a large inventory of linguistic eBooks from top industry authors available at the Diesel eBook Store.
·NPS Chat Corpus: The NPS Chat Corpus, Release 1.0 consists of 10,567 posts gathered from various online chat services in accordance with their terms of service. The posts have been: 1) Hand privacy masked; 2) Part-of-speech tagged; and 3) Dialogue-act tagged.
·Oxford Text Archive (OTA).: Text archive.
·Penn-Helsinki Parsed Corpus of Early Modern English: The Penn-Helsinki Parsed Corpus of Early Modern English is a 1.8 million word parsed corpus of text samples of Early Modern English. It includes the text samples of the Helsinki Corpus of Historical English, which consists of 600,000 words of genre balanced text and two extension samples of the same size, balanced for genre in the same way. It is a sister corpus of the Penn-Helsinki Parsed Corpus of Middle English and the two corpora are distributed together.
·Project Gutenberg e-texts: Texts online.
·Sociolingüística Andaluza: Information: Grupo de investigación en sociolingüística. Universidad de Sevilla.
·Texts in context: A lovely collection of classified, annotated and (partially) downloadable texts from the British Library's collection - good for both teaching and research. Here's the introduction: Texts in Context is a rich and unusual collection of over 400 British Library texts. You can find menus for medieval banquets and handwritten recipes scribbled inside book covers. You can browse the first English dictionary ever written and explore the secret language of the Georgian underworld. You can study the East India Company's shopping lists and practise sentences from colonial phrasebooks. You can learn smugglers' songs, listen to rare dialect recordings, and examine the logbooks of 17th century trading ships.
·The Aboriginal Studies Electronic Data Archive: The Australian Institute of Aboriginal and Torres Strait Islander Studies holds computer-based (digital) materials about Australian Indigenous languages in the Aboriginal Studies Electronic Data Archive (ASEDA). ASEDA offers a free service of secure storage, maintenance, and distribution of electronic texts relating to these languages.
·The Sumerian Text Archive: A growing collection of texts in the Sumerian language.
·The University of Virginia Electronic Text Center: An on-line archive of tens of thousands of SGML and XML-encoded electronic texts and images with a library service that offers hardware and software suitable for the creation and analysis of text.
·Tofa Videos and Texts: The Tofa stories available here were recorded by Dr. K. David Harrison in 2000 and 2001, for a project funded by a grant from Volkswagen-Stiftung.

Text and Corpora Meta Sites

·ACL SIGLEX: An index of links to publicly available lexical resources (dictionaries and corpora).
·Centre for English Corpus Linguistics: ICLEv2 contains 3.7 million words of writing from higher intermediate to advanced learners of English representing 16 different mother tongue backgrounds (Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Turkish, Tswana). It differs from the first version published in 2002 not only by its increased size and range of learner populations, but also by its interface, which contains two new functionalities: built-in concordancer allowing users to search for word forms, lemmas and/or parts-of-speech tags and breakdown of the query results according to the learner profile information. The accompanying ICLEv2 Handbook contains a detailed description of the corpus, a user's manual and an overview of the ELT situation in the countries of origin of the learners. There are three types of licence (for non-profit research purposes only): single user, multiple-user (2-10) and multiple-user (11-25). The corpus can be ordered online at http://www.i6doc.com
·CONCISUS Corpus of Event Summaries: The CONCISUS Corpus is an annotated dataset of comparable Spanish and English event summaries in four application domains. The CONCISUS Corpus covers for the time being the following domains: aviation accidents, train accidents, earthquakes, and terrorist attacks. The dataset contains: comparable summaries, comparable automatic translations, and comparable full documents.
·Digital Archive of the Macedonian Language: The Digital Archive of the Macedonian Language is a growing collection of digitized, searchable texts in Modern Macedonian from the nineteenth and twentieth centuries and is completely free for anyone who would like to use, search through, and/or download the materials. web address: http://damj.manu.edu.mk Macedonian Academy of Sciences and Arts Research Center for Areal Linguistics Project Coordinator: Prof. Marjan Markovik e-mail: marjan@manu.edu.mk
·ELRA (European Language Resources Association): The overall goal of ELRA is to provide a centralized organization for the validation, management, and distribution of speech, text, and terminology resources and tools, and to promote their use within the European telematics R&TD community.
·HATII and DCC Release KRYS I Corpus to Aid Research: The Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) are delighted to announce the release of the KRYS I Corpus for genre classification research. http://www.krys-corpus.eu The corpus, consisting of 6434 documents labelled with document genres, is expected to become a major research resource among text processing and data and information management researchers. In particular, we encourage the use of the corpus for the research of: - Automated Text Classification (TC) - Digital curation and metadata extraction - Natural Language Processing (NLP) - Computational Linguistics (CL) Despite the potential of document genre classification as a supporting step in language processing, document management, and information retrieval (e.g. the linguistic style and the vocabulary of a document varies distinctively across document genres), to date, there has been a severe lack of genre-labelled document corpora with which researchers can experiment. It is, therefore, with great pleasure that the Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) makes the KRYS I Corpus available to researchers around the globe. The Corpus originated as part of the ongoing Semantic Metadata Extraction research at the Digital Curation Centre (http://www.dcc.ac.uk) and the HATII at the University of Glasgow (http://www.hatii.arts.gla.ac.uk). The metadata extraction research evolved into a study of automated genre classification, reflecting the observation that the genre of a document (e.g. whether a document is a scientific article or a letter) is characterised by the form and structure of a document, the understanding of which would facilitate further extraction of metadata from within the document. Further details about the development of the KRYS I corpus are available via the website (http://www.krys-corpus.eu). Specifically, researchers will find a detailed account of the document collection process, the reclassification of the documents in the corpus, and the initial findings with regard to human classification of the documents. We encourage researchers to make full use of this corpus for their own research activity and recommend that you consider contributing towards the ongoing development of the corpus by adding your own documents to the database. Instructions as to how to contribute to the corpus are provided at http://www.krys-corpus.eu. Comments and/or feedback on the KRYS I Corpus are invited. Contacts details can be found on the website. Please feel free to distribute this announcement to any interested colleagues.
·Het Corpus Gesproken Nederlands: The Corpus Gesproken Nederlands, (Spoken Dutch Corpus), or CGN is a collection of approximately 900 hours of spoken Dutch from Flemish and Dutch speakers. All recordings have been aligned with an orthographic transcription and each word has been given a POS tag and a lemma. Part of the data has been enriched with syntactic, prosodic and/or phonetic information.
·IFA Dialog Video corpus: The IFA Dialog Video corpus is a collection of annotated video recordings of friendly Face-to-Face dialogs licensed under the GNU General Public License (GPLv2). It is modeled on the Face-to-Face dialogs Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialog conversations of 15 minutes were recorded and annotated, in total 5 hours of speech. To stay close to the very useful Face-to-Face dialogs in the CGN, pairs of well acquainted participants, either good friends, relatives, or long-time colleagues were selected. The participants were allowed to talk about any topic they wanted.
·Italian Linguistics: Information (in Italian) on Italian linguistics and corpora.
·korpusy.net: This is a Polish-language site devoted to language corpora. It provides a range of introductory articles on corpora and corpus-based research, links and conference calls, a glossary of common terms, and some downloadable papers and resources.
·Leiden Armenian Lexical Textbase: Launching the Leiden Armenian Lexical Textbase http://www.sd-editions.com/LALT/home.html LALT combines Classical Armenian dictionaries with morphologically analyzed texts. There are some 80.000 Armenian lexemes and ten texts. The complete Nor Bargirk, main sections of Adjarian's Root Dictionary, Bedrossian's Armenian- English Dictionary and other material are integrated in LALT. There is a Greek-Armenian lexicon (20000 entries), and aligned Armenian-Greek texts. LALT will be updated at regular intervals. Also, LALT easily is able to integrate additional material and welcomes contributions of other scholars. I have been asked about fonts: LALT is written in xml and uses unicode. Any unicode font will be able to read it, provided this font contains the glyphs (screen images) for Armenian and Greek. One such font is Titus Cyberbit, which is used within LALT itself. It is available for free at http://titus.fkidg1.uni-frankfurt.de/unicode/tituut.asp General information on Armenian and Unicode may be obtained at http://www.armunicode.org/en/fonts/unicode Jos Weitenberg
·Lexicographical Corpus of Portuguese: The Lexicographical Corpus of Portuguese is a database of electronic texts in Portuguese. It contains the electronic transcription and edition of some of the most important dictionaries from the 17th and 18th centuries. The selected texts are generally considered the most important monuments of portuguese dictionary tradition for their dimension, reception and documental value. - Jerónimo Cardoso, Dictionarium iuventuti studiosae (1562, aliás 1551); Dictionarium ex lusitanico in latinum sermonem (1562); Dictionarium Latinolusitanicum (1569/70); Breve dictionarium vocum ecclesiasticarum (1569); De monetis (1569, aliás 1561) - Pedro de Poiares, Diccionario Lusitanico-Latino de Nomes Proprios (1667) - Bento Pereira, Prosodia Tesouro (1697) - Rafael Bluteau, Vocabulario Portuguez e Latino (1712-1728) - António Franco (F. Pomey), Indiculo Universal (1716) With this project we made available, in a digital format, the complete text of those dictionaries and it’s now possible to retrieve indexes off all the words, which are found in them.
·Linguistic and Folklore materials from the Kujamaat Jóola: A site which will eventually grow to have an extensive collection of Kujamaat linguistic and folklore materials. It currently contains a dictionary (already listed), two folktales (text, translation and sound) and verses from extemporaneous funeral songs (text, sound, translation, commentary).
·Linguistic Data Consortium: Creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes.
·NPS Chat Corpus: The NPS Chat Corpus, Release 1.0 consists of 10,567 posts gathered from various online chat services in accordance with their terms of service. The posts have been: 1) Hand privacy masked; 2) Part-of-speech tagged; and 3) Dialogue-act tagged.
·On-line books FAQ: Public domain sources of Etext available on the Internet.
·The IViE corpus: An intonationally transcribed corpus covering seven dialects of English from the British Isles. Subjects were secondary school students. The corpus covers short read sentence, a read story, a retold story, map tasks and free conversation.
·WEBSOM: Large document collections that are automatically organized by the novel WEBSOM method (including Usenet newsgroup sci.lang). An ordered map of the information space is provided.
·XNLRDF: XNLRDF is a database for the creation and distribution of basic linguistic information for a great number of natural languages so that they can be used for research and development. Linguistic data are inserted through a Web-interface. The linguistic data in the database can be compiled and downloaded in XML by the user.