amazon logo
More Info


Publishing Partner: Cambridge University Press CUP Extra Publisher Login

Texts & Corpora


Browse


by Subject Language
by Linguistic Subfield
by Language Family

Help Us


Add a link to Texts & Corpora
Update or report a bad link

Corpora

·Arquivo Dialetal do CLUP: Arquivo Dialetal do CLUP (Dialectal Archive of the Center of Linguistics of the University of Porto) is a database of recordings of European Portuguese collected during the last two decades, spanning both Mainland Portugal and islands. Apart from the recordings themselves, this resource provides detailed maps, exhaustive narrow phonetic and orthographic transcriptions and information about dialectal phenomena. Arquivo Dialetal do CLUP is a growing project, and we welcome any comments, suggestions and contributions.
·British National Corpus: A 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
·Buckeye Corpus: This corpus contains high-quality recordings of conversational American English speech from 40 speakers in Columbus, OH, USA. The speech has been orthographically transcribed and phonetically labeled. Currently the audio files and transcriptions for 20 talkers are available.
·Centre for English Corpus Linguistics: ICLEv2 contains 3.7 million words of writing from higher intermediate to advanced learners of English representing 16 different mother tongue backgrounds (Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Turkish, Tswana). It differs from the first version published in 2002 not only by its increased size and range of learner populations, but also by its interface, which contains two new functionalities: built-in concordancer allowing users to search for word forms, lemmas and/or parts-of-speech tags and breakdown of the query results according to the learner profile information. The accompanying ICLEv2 Handbook contains a detailed description of the corpus, a user's manual and an overview of the ELT situation in the countries of origin of the learners. There are three types of licence (for non-profit research purposes only): single user, multiple-user (2-10) and multiple-user (11-25). The corpus can be ordered online at http://www.i6doc.com
·CGL: The corpus of texts known as Grammatici Latini comprises the Latin grammar manuals written between the 2nd and 7th centuries AD and edited by Heinrich Keil in Leipzig, from 1855 to 1880. The corpus has several points of interest: By assembling the main sources, it allows for the reconstruction of the history of ideas in Western linguistics. From the Middle Ages on, these texts (Donatus' and Priscian’s artes in particular) were the basis for the later linguistic tradition. The corpus comprises more than 14,000 literary quotations as grammatical examples, a large number of which being fragments (literary, philosophical) of works that are now lost, or passages which one can compare with the direct tradition of extant texts. Some tendencies in late Latin are given prominence, such as the proscription of expressive forms foreign to classical use.
·CHAINS: Characterising Individual Speakers: The Chains corpus is a novel speech corpus collected with the primary aim of facilitating research in speaker identification. The corpus features approximately 36 speakers recorded under a variety of speaking conditions, allowing comparison of the same speaker across different well-defined speech styles. Speakers read a variety of texts alone, in synchrony with a dialect-matched co-speaker, in imitation of a dialect-matched co-speaker, in a whisper, and at a fast rate. There is also an unscripted spontaneous retelling of a read fable. The bulk of the speakers were speakers of Eastern Hiberno-English. The corpus is being made freely available for research purposes.
·Chinese Gigaword Second Edition: Chinese Gigaword Release Second Edition is a comprehensive archive of newswire text data in Chinese that has been acquired over several years by the LDC. This release includes all of the contents in the first release of the Chinese Gigaword corpus (LDC2003T09), material from one new source, as well as new materials from the other two sources. Thus, the corpus contains three distinct international sources of Chinese newswire - Central News Agency, Taiwan, Xinhua News Agency, and Zaobao. Some minor updates to the documents from the first release have been made.
·CLIPS, Corpus of Spoken Italian: CLIPS is a corpus of spoken Italian, freely available at www.clips.unina.it. The corpus (audio files, annotation and documentation) are fully downloadable from the website via ftp, free for research purposes. CLIPS consists of about 100 hours of speech, equally represented by female and male voices. A section of the corpus is transcribed orthographically, a smaller section has been phonetically labeled. Recordings were made in 15 Italian cities, selected on the basis of linguistic and socio-economic principles of representativeness: Bari, Bergamo, Bologna, Cagliari, Catanzaro, Firenze, Genova, Lecce, Milano, Napoli, Palermo, Parma, Perugia, Roma, Venezia. For each of the 15 cities different text typologies have been included: a) radio and television broadcasts (news, interviews, talk shows); dialogue (240 dialogues collected using the map task procedure and the "spot the difference" game. In this set: 30 dialogues are phonetically labeled, 90 orthographically transcribed); c) read speech from non professional speakers (20 sentences each, covering medium-high frequency Italian words); d) speech over the telephone (conversations between 300 speakers and a simulated hotel desk service operator), e) read speech from 20 professional speakers (160 sentences, covering all phonotactic sequences and medium-high frequency Italian words) recorded in an anechoic chamber. Documentation, corpus collection and annotation follow the EAGLES guidelines.
·COMPARA - Portuguese-English Parallel Corpus: COMPARA is bi-directional parallel corpus based on an open-ended collection of Portuguese-English and English-Portuguese source-texts and translations. Access is free and requires no registration.
·CORIS/CODIS - Corpus of Contemporary Written Italian: An updated version of CORIS/CODIS, the synchronic corpus of written Italian designed and developed at the University of Bologna, is now accessible online for research purposes. The new version contains around 130 million words and is updated to 2010. The corpus covers a wide range of text varieties, chosen by virtue of their representativeness of contemporary Italian, and ranges from the 1980s to 2010. The following features are now available: - annotation for part-of-speech and lemma; - user-friendly interface; - advanced IMS-CWB query language; The corpus is freely accessible online for research purposes only.
·Corpora at ICAME: International Computer Archive of Modern and Medieval English.
·Corpus Artesia - Archivio Testuale del Siciliano Antico: The Artesia Corpus is part of a larger research project, 'Artesia - Archivio testuale del Siciliano Antico' (Text Archive of Ancient Sicilian), a production of the Department of Modern Philology of the University of Catania, in close cooperation with the Centro di Studi Filologici e Linguistici Siciliani, Palermo (http://www.csfls.it). Among the other contributing national research projects, the Opera del Vocabolario Italiano (OVI) provided the software for creating and managing the full-text database. Our aim is to supply a well-structured research tool for the study of Medieval Sicilian (14th-16th centuries) from a Romance perspective and to account for its whole textual production. In particular, Artesia: • makes accessible and searchable a philologically reliable and periodically up-to-date corpus of literary and non-literary Sicilian texts, from the earliest attestations (14th cent.) to the latest (mid-16th cent.); • provides a brief, yet scholarly presentation for each author and text; • documents the individual works by putting them into a historical and critical context, highlighting relationships with and comparing it to different Latin and Romance textual traditions (Catalan, Tuscan, etc.); • brings a fundamental contribution to realize a Medieval Sicilian Dictionary; • publishes philological studies and linguistic researches concerning Medieval Sicilian, both in electronic and paper format (see Quaderni di Artesia, Catania: Ed.it, http://www.editpress.it). The Database: The full-text database is available online on the OVI website (http://artesia.ovi.cnr.it) and is searchable using GATTOWEB - Gestione degli Archivi Testuali del Tesoro delle Origini, created for the Tesoro della lingua italiana delle Origini (TLIO) (http://www.ovi.cnr.it). It allows advanced word searches, concordance generations and text lemmatization. Moreover, the Artesia Corpus is periodically published in a CD-Rom format (http://www.editpress.it/0808.htm). The Corpus: The corpus is made up of both literary and non-literary (e.g. documentary) texts dating from the beginning of 14th century, when the earliest Sicilian texts appeared, to the mid-16th century, when Tuscan replaced Sicilian as the language of the administration. Among the texts belonging to the corpus are: • published texts (especially the Collezione di testi siciliani dei secoli XIV e XV, published by Centro di Studi Filologici e Linguistici Siciliani, Palermo); • edited texts (for example PhD dissertations and other editions by the University of Catania); • previously unpublished and unedited texts, now for the first time made digitally accessible for especially Artesia. As of March 2010, the corpus contains 73 literary texts and 171 documents (1 081 539 tokens). These have been widely revisited and amended where doubtful readings occurred; these amendations have been systematically signaled by means of GATTO notes. The Database is progressively expanding to include the complete set of Medieval Sicilian texts, thus constituting a firm platform for the creation of a Medieval Sicilian Dictionary.
·Corpus de Català Contemporani de la Universitat de Barcelona (CCCUB): Spoken language corpora developed for the study of geographical, functional and socio-cultural variation in Catalan. The texts are in .pdf. The sound files are not yet available through the web, but they have been published in CD-ROM and can be purchased. The CCCUB is also available through RECERCAT (Dipòsit de la Recerca de Catalunya): http://www.recercat.net/handle/2072/8925).
·Corpus de français parlé au Québec (CFPQ): Le Corpus de français parlé au Québec (CFPQ) vise à refléter le français québécois spontané en usage dans les années 2000. Il est susceptible d’aider tout chercheur qui s’intéresse à la variation en français, notamment sous des angles lexicologique, sémantique ou pragmatique. Sa taille actuelle est de sept sous-corpus, ce qui correspond à plus de dix heures de conversations informelles à quatre ou cinq locuteurs. Les informations requises pour une exploitation optimale des données sont disponibles sur le site.
·Corpus del español: An online, searchable corpus of diachronic Spanish texts (100 million words, 13th century to present).
·Corpus of Modern Scottish Writing: The Corpus of Modern Scottish Writing makes freely available a wide range of documents in Scots and Scottish English from 1700-1945, ranging from a rare first edition of Robert Burns' poems to letters from the explorer David Livingstone to murder trial transcripts dating back to the 1750s. The free online resource contains texts, digital images and searchable transcriptions, and can be found at: http://www.scottishcorpus.ac.uk/cmsw/ The Corpus also features James Hogg's first ever book - a treatise on diseases of sheep; personal accounts of the 1715 uprising; letters to Scotland from the French-Indian war and from emigrants to Australia in the 19th century. City Council minutes are also present along with a work on spiritualism by Arthur Conan Doyle, novels and student disciplinary trials of the 18th century and a selection of personal letters and diaries that give an invaluable insight into Scottish life in days gone by. The CMSW project has been funded by the Arts and Humanities Research Council (AHRC), and run by the University’s Department of English Language, now part of the School of Critical Studies.
·Corpus of Remarks on the French language (17th Century): The authors of Remarks treat all aspects of usage – pronunciation, spelling, morphology, syntax, vocabulary and style – but drop the traditional format of grammars. This corpus is an indispensable instrument, not only for specialists of 17th century language and literature, but also for all those interested in the history of the French language, of its codification and standardization. This data-base contains the classic texts (the remarks of Vaugelas, Ménage and Bouhours); collections which adopt an alphabetical presentation (Alemand, Andry de Boisregard) ; texts which criticise Vaugelas and call for greater freedom of usage (Dupleix, La Mothe Le Vayer) ; the volumes which emanate from circles close to the Academy (the Academy's comments on Vaugelas, and its decisions collected by Tallemant), as well as some less prestigious texts (Buffet's observations addressed to a female audience, the compilation by Macé which completes his general and critical grammar). For easy use and exploitation, the Corpus of remarks on the French language is accompanied by a number of research instruments: full-text search, a thesaurus of authors (5 categories) and of titles (3 categories), thesaurus of examples and quotations. The user can constitute his/her own corpus, extract and export results. This set of instruments will promote new research in the fields of the history of the French language and linguistic conceptions.
·Corpus OVI dell'Italiano Antico: For the TLIO redaction, the OVI has prepared—and keeps constantly developing—a large textual database, the Corpus OVI dell’Italiano antico, aimed to contain all relevant edited texts in any variety of Early Italian, written before 1400 A.D. At present, this corpus, which is updated every 4 months, consists of 1978 texts with 21,817,929 words, 443,810 different word forms, 116,224 lemmas and 3,615,478 lemmatized occurrences. For not yet lemmatized texts awaiting inclusion in the Corpus OVI, an additional corpus has been created, the Corpus TLIO aggiuntivo, which at present contains 306 texts with 1,189,808 words and 71,900 different word forms.
·Corpus TCOF: Le projet « Traitement de Corpus Oraux en Français » (TCOF) de l’ATILF (UMR 7118, Université de Lorraine & CNRS) met à disposition de la communauté des corpus oraux alignés texte-son (Transcriber). Le corpus TCOF comporte deux grandes catégories : des enregistrements de corpus d'interactions adultes / enfants (126 enregistrements actuellement) et des enregistrements d'interactions entre adultes (102 enregistrements actuellement). Ce corpus est enrichi régulièrement. L’accès aux données, via le site du CNRTL, est facilité par une interface de recherche qui permet aux utilisateurs de choisir les corpus en fonction de leurs objets d’étude (adultes, enfants, homme, femme, situations professionnelles, genre de discours, etc.).
·Croatian Language Corpus: The Croatian Language Corpus is the result of various projects at the Institute of Croatian Language and Linguistics and the Linguistics Department of the University of Zadar. There is an online interface based on Philologic at the given URL. The current status is that the corpus indexes more than 100 k tokens, and the base is growing continuously. It is annotated in TEI XML P5, its annotation is being enriched with morphological segmentation, lemmatization, phonemic transcription, morphosyntactic annotation and syntactic parses. The online interfaces are subject to change and extension for the improvement of access to various corpus properties.
·Croatian National Corpus: The starting point for each linguistic research is the corpus. As Croatian does not have a systematically compiled corpus the objective of this project is the compilation and analysis of the representative Croatian texts -- both older and contemporary -- in the form of the corpus the usage of which is applicable for all kinds of Croatistic, lexicographic and lexicological research
·CSLU Spoltech Brazilian Portuguese: The CSLU Spoltech Brazilian Portuguese corpus contains microphone speech from a variety of regions in Brazil with phonetic and orthographic transcriptions. The utterances consist of both read speech (for phonetic coverage) and responses to questions (for spontaneous speech). The corpus contains 477 speakers and 8080 separate utterances. A total of 2540 utterances have been transcribed at the word level (without time alignments), and 5479 utterances have been transcribed at the phoneme level (with time alignments).
·CSLU: Spelled and Spoken Words: The CSLU: Spelled and Spoken Words corpus consists of spelled and spoken words. 3647 callers were prompted to say and spell their first and last names, to say what city they grew up in and what city they were calling from, and to answer two yes/no questions. In order to collect sufficient instances of each letter, 1371 callers also recited the English alphabet with pauses between the letters. Each call was transcribed by two people, and all differences were resolved. In addition, a subset of 2648 calls has been phonetically labeled.
·Czech Academic Corpus v. 1.0: The Czech Academic Corpus version 1.0 is a corpus with a manual morphological annotation of morphology of the Czech language consisting of approximately 600,000 words in continuous texts.
·Database of spoken Italian (BADIP): Contains an online edition of the 500,000 word LIP-Corpus. The edition is being enriched with POS-tags and lemmata, more data are being added continuously. Other corpora of spoken Italian will be included in the database as soon as possible. Access to BADIP is free. The database is part of the LanguageServer of the University of Graz (Austria).
·Digital Archive of the Macedonian Language: The Digital Archive of the Macedonian Language is a growing collection of digitized, searchable texts in Modern Macedonian from the nineteenth and twentieth centuries and is completely free for anyone who would like to use, search through, and/or download the materials. web address: http://damj.manu.edu.mk Macedonian Academy of Sciences and Arts Research Center for Areal Linguistics Project Coordinator: Prof. Marjan Markovik e-mail: marjan@manu.edu.mk
·Digital Tamil Literature: Searchable Tamil Digital Text Archive.
·dlexDB: dlexDB is a new lexical statistical database for German. It is based on the DWDS-Kerncorpus, a balanced collection - over time and text genre - of 100 million words of texts of the 20th century. dlexDB provides frequencies for types, lemmas, syllables, characters, orthographic neighbors and more. These measures are of considerable interest for research in psycholinguistics and psychology (e.g., studies on visual word recognition) as well as for general linguistics and lexicography. During the course of the project, more levels of linguistic representation will be added.
·DWDS Corpora and Dictionaries: A lexical information system of German, based on very large corpora and dictionaries. It contains the DWDS-Kerncorpus, a balanced collection - over time and text genre - of 100 million words of texts of the 20th century, various newspaper and special corpora (~650 million words of text publicly available).
·Eastern Armenian National Corpus: Eastern Armenian National Corpus (EANC) is a comprehensive linguistic database of annotated texts in Standard Eastern Armenian (SEA), the language spoken in the Republic of Armenia. EANC is: - a comprehensive corpus with about 90 million tokens - a powerful search engine for making complex lexical morphological queries - a learner’s corpus including English translations for frequent tokens - a diachronic corpus covering SEA texts from the mid-19th century to the present - a mixed corpus consisting of both written discourse and oral discourse - an open-ended corpus with new texts being added continuously - an annotated corpus with morphological and metatext tagging - an open access corpus - an electronic library with full access to over 100 Armenian classic titles Another important feature is the Glossed output: typologists and language learners can now work with a text format similar to interlinear morphological glosses. In this format, wordforms are supplied with lemmas, lexical and grammatical categories, and translations, vertically aligned below each wordform. Also possible is switching to Latin transliteration from the Armenian alphabet.
·EF Cambridge Open Language Database: EFCamDat contains writings submitted to Englishtown, EF’s online school, accessed daily by thousands of learners worldwide. The database currently contains 412,000 scripts from 76,000 learners summing up 32 million words. (More information: http://linguistlist.org/issues/24/24-2935.html#1)
·Ega XML Lexicon: Digitized, online lexicon of Ega, a language of the Ivory Coast as provided by the late Prof. Eddy Aimé Gbery.
·El Grial Corpus of Spanish: El Grial Corpus of Spanish (www.elgrial.cl) is a growing collection of eight corpora (almost 100 million words) with approximately 700 documents of contemporary Spanish, developed by the members of the Escuela Lingüística de Valparaíso (www.linguistica.cl) at the Pontificia Universidad Católica de Valparaíso, Chile. Also, there is a tagger and parser for Spanish Language available on the web site. These corpora have been collected under specific methodological principles, identifying specialized/non-specialized, written/spoken registers and text types (academic, professional, technical, etc). Detailed description of each corpus is available in the web site. All documents have been tagged and parsed. Part of the data has been enriched with deep syntactic information. To the best of our knowledge, this is currently the largest searchable morphosyntactically annotated and register-diversified corpus of Spanish available to the public, with online tools that help analyze the collected data. Users can define their corpus of study and search the data using a wide variety of resources. Query results are presented in different formats, depending on the kind of research questions. El Grial users can make queries concerning word, lemma and/or parts of speech frequencies. One of the last tools developed is El Manchador de Textos, an online resource that “spots” and puts color to the words or sequences under study; statistical information about the co-occurrences is also available.
·General Corpus of the Modern Mongolian language: General Corpus of the Modern Mongolian language (GCML) contains 966 texts, 1 155 583 words. The processor analyzes effectively 97 % of textual word forms which correspond to 76 % word forms from the inputs of the concordance to the GCML.
·German Political Speeches Corpus and Visualization: This corpus consists of speeches by the German Presidents, Chancellors and a few ministers, all gathered from official sources. It can be freely republished. The two main corpora are released in XML format with metadata. POS-tags will be added to it. There is also a basic visualization tool enabling users to get a first glimpse of the resource.
·German Speech Errors: Collection of 474 German speech errors by Richard Wiese.
·HATII and DCC Release KRYS I Corpus to Aid Research: The Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) are delighted to announce the release of the KRYS I Corpus for genre classification research. http://www.krys-corpus.eu The corpus, consisting of 6434 documents labelled with document genres, is expected to become a major research resource among text processing and data and information management researchers. In particular, we encourage the use of the corpus for the research of: - Automated Text Classification (TC) - Digital curation and metadata extraction - Natural Language Processing (NLP) - Computational Linguistics (CL) Despite the potential of document genre classification as a supporting step in language processing, document management, and information retrieval (e.g. the linguistic style and the vocabulary of a document varies distinctively across document genres), to date, there has been a severe lack of genre-labelled document corpora with which researchers can experiment. It is, therefore, with great pleasure that the Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) makes the KRYS I Corpus available to researchers around the globe. The Corpus originated as part of the ongoing Semantic Metadata Extraction research at the Digital Curation Centre (http://www.dcc.ac.uk) and the HATII at the University of Glasgow (http://www.hatii.arts.gla.ac.uk). The metadata extraction research evolved into a study of automated genre classification, reflecting the observation that the genre of a document (e.g. whether a document is a scientific article or a letter) is characterised by the form and structure of a document, the understanding of which would facilitate further extraction of metadata from within the document. Further details about the development of the KRYS I corpus are available via the website (http://www.krys-corpus.eu). Specifically, researchers will find a detailed account of the document collection process, the reclassification of the documents in the corpus, and the initial findings with regard to human classification of the documents. We encourage researchers to make full use of this corpus for their own research activity and recommend that you consider contributing towards the ongoing development of the corpus by adding your own documents to the database. Instructions as to how to contribute to the corpus are provided at http://www.krys-corpus.eu. Comments and/or feedback on the KRYS I Corpus are invited. Contacts details can be found on the website. Please feel free to distribute this announcement to any interested colleagues.
·HC Corpora: A collection of open text corpora. Covers many different languages.
·IFA Dialog Video corpus: The IFA Dialog Video corpus is a collection of annotated video recordings of friendly Face-to-Face dialogs licensed under the GNU General Public License (GPLv2). It is modeled on the Face-to-Face dialogs Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialog conversations of 15 minutes were recorded and annotated, in total 5 hours of speech. To stay close to the very useful Face-to-Face dialogs in the CGN, pairs of well acquainted participants, either good friends, relatives, or long-time colleagues were selected. The participants were allowed to talk about any topic they wanted.
·International Corpus of English (British Component): The British Component of the International Corpus of English (ICE-GB) contains one million words of spoken and written British English. The material is fully tagged and parsed and the associated syntactic treebank is searchable with dedicated exploration software. The spoken material can be listened to.
·Italian Attribution Corpus: This is a corpus annotated for attribution relations according to an annotation schema developed from the one adopted for the PDTB corpus. It comprises 50 articles drawn from the ISST corpus of Italian. The overall number of tokens is 37.000. Overall, 461 attribution relations are annotated, using MMAX2. The corpus is available for download and research use at: http://homepages.inf.ed.ac.uk/s1052974/resources.php
·IULA's UPF Textual, plurilingual, specialized Corpus: The main goal of the Corpus project is the construction and exploitation of a textual, plurilingual and specialized corpus. The languages involved are the following: Catalan, Spanish, English, German and French. The areas of interest include: economics, law, computer science, medicine and enviromental science. This corpus is the main support for teaching and research at our institut. Some of the research activities envisaged against this corpus include the following ones: terminology detection, parallel texts alignment, partial parsing, (semi)automatic extraction of several levels of linguistic information for building computational systems (for example, subcategorization patterns), language variation studies.
·Korean Propbank: Korean Propbank is a semantic annotation of the Korean English Treebank Annotations and Korean Treebank version 2.0. Each verb and adjective occurring in the Treebank has been treated as a semantic predicate and the surrounding text has been annotated for arguments and adjuncts of the predicate. The verbs and adjectives have also been tagged with coarse grained senses. There are two basic components to Korean Propbank: * The Verb Lexicon. A frames file, consisting of one or more frame sets, has been created for each predicate occurring in the Treebank. These files serve as a reference for the annotators and for users of the data. 2,749 such files have been created. * The Annotation. There are two annotation files. The virginia-verbs.pb file has 9,588 annotated predicate tokens. These predicate tokens include all those occurring in over 54 thousand words of the Korean English Treebank Annotations, totaling ~791 KB of uncompressed data. The newswire-verbs.pb file has 23,707 annotated predicate tokens. These predicate tokens include all those occurring in over 131 thousand words of the Korean Treebank version 2.0.
·Korean Treebank Annotations Version 2.0: The Korean Treebank Annotations Version 2.0 is an extension of the Korean English Treebank Annotations corpus, LDC2002T26 (2002). It is essentially an electronic corpus of Korean texts annotated with morphological and syntactic information. The original texts for the Korean Treebank 2.0 were selected from The Korean Newswire corpus published by LDC, catalog number LDC2000T45, which is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. Korean Treebank 2.0 is based on the March 2000 portion of the corpus and includes 647 articles. The annotated corpus can find many uses, including training of morphological analyzers, part-of-speech taggers and syntactic parsers.
·LumaLiDa - Resources for Child Language: LumaLiDa is a family of database resources for the study of child language. It includes LumaLiDaOn (the Linguistic Diary of Luma, an European Portuguese Child), LumaLiDaOnLexicon (the lexicon used by the child in LumaLiDaOn, types and tokens), LumaLiDaAudy (transcribed audio files of child speech), and LumaLiDaAudyLexicon (the lexicon used by the child in LumaLiDaAudy, types and tokens).
·MDE RT04 Training Data Speech: MDE RT-04 Training Data Speech was created to provide training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program. The goal of MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: flagging non-content words like filled pauses and discourse markers for optional removal; marking sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. LDC has defined a SimpleMDE annotation task specification and has annotated English telephone and broadcast news data to provide training data for MDE.
·Monguor - Online Texts: Digitized online texts of Monguor, an endangered language spoken in the People's Republic of China, as provided to the E-MELD School of Best Practices by Dr. Wang Xianzhen.
·N4 NATO Native and Non-Native Speech: The N4 NATO Native and Non-Native Speech corpus was developed by the NATO research group on Speech and Language Technology in order to provide a military oriented database for multilingual and non-native speech processing studies. The NATO Speech and Language Technology group decided to create a corpus geared towards the study of non-native accents. The group chose naval communications as the common task because it naturally includes a great deal of non-native speech and because there were training facilities where data could be collected in several countries. Speech data was recorded in the Naval transmission training centers of four countries (Germany, The Netherlands, United Kingdom, and Canada). The material consists of native and non-native speakers using NATO English procedure between ships and reading from a text.
·NPS Chat Corpus: The NPS Chat Corpus, Release 1.0 consists of 10,567 posts gathered from various online chat services in accordance with their terms of service. The posts have been: 1) Hand privacy masked; 2) Part-of-speech tagged; and 3) Dialogue-act tagged.
·Ossetic National Corpus: The Ossetic National Corpus, which contains about 5 million wordforms, is now freely available online. All the texts in the corpus have been automatically annotated and contain English translations of most lexemes. The percentage of annotated wordforms is more than 75%. The corpus supports automatic Latin transliteration of search results.
·Penn Parsed Corpora of Historical English: The Penn Parsed Corpora of Historical English are a collection of three annotated corpora of historical British English: the Penn-Helsinki Parsed Corpus of Middle English (1.2 million words), the Penn-Helsinki Parsed Corpus of Early Modern English (1.7 million words) and the Penn Parsed Corpus of Modern British English (currently 1 million words). The corpora are genre-balanced and consist of POS-tagged and syntactically annotated text samples, including all of the samples in the Middle and Early Modern English sections of Helsinki Corpus of Historical English (1.1 million words).
·Penn-Helsinki Parsed Corpus of Early Modern English: The Penn-Helsinki Parsed Corpus of Early Modern English is a 1.8 million word parsed corpus of text samples of Early Modern English. It includes the text samples of the Helsinki Corpus of Historical English, which consists of 600,000 words of genre balanced text and two extension samples of the same size, balanced for genre in the same way. It is a sister corpus of the Penn-Helsinki Parsed Corpus of Middle English and the two corpora are distributed together.
·PMSE: PetaMem Scripting Environment (only PMSE further) is a suite of Perl scripts that allow the user to take control over various processes related with the usage of corpora. The intend of PMSE is to build a comprehensive toolchain enabling the user a generic work with text corpora - starting with the acquirement of the data, ongoing with statistical computation and data visualization.
·Russian National Corpora: The corpora is designed for anyone interested in a variety of issues related to the Russian language: professional linguists, language teachers, students, foreigners studying the Russian language.
·Scandinavië Vertalingen: Translation agency for translations from and into the Scandinavian languagues (Swedish, Finnish, Norwegian, Danish)
·Scottish Corpus of Texts and Speech (SCOTS): SCOTS is an AHRC-funded project, creating a corpus of texts in the languages of Scotland, in the first instance Scots and Scottish English, of all available genres. Spoken texts (orthographic transcription plus accompanying audio/video files) make up 20% of the complete corpus. The corpus is fully searchable online, and the website also contains a description and instructions.
·Slovak National Corpus: Slovak National Corpus is built as a general monolingual corpus, which in the first phase (year 2003) started to compile written texts originated in years 1990 – 2003, containing about 30 million of words with a lemmatisation, morphological and source (bibliographical and style-genre) annotation. During the second phase (up to 2006) the representative span of written texts will be extended to other periods of the contemporary language (1955 – 2005) to the amount of 200 million words and its selected sample will be syntactically annotated. Simultaneously, specific sub-corpora of diachronic and dialectological texts will commence to be built, as well as a terminological and lexicographical database. Slovak National Corpus is provided primarily to lexicographers (dictionary creation), complements grammar and stylistic research (grammar and orthographical handbooks; varieties of the national language and their usage in communication). We suppose that it will also find its use at schools (preparing of orthography, grammar and style textbooks; teaching Slovak as a foreign language). Specific sub-corpora of historical and dialectological texts will help to preserve an important part of our cultural heritage in a long-term perspective .
·Speech Controlled Computing: The Speech Controlled Computing corpus was designed to support the development of small footprint, embedded ASR applications in the domain of voice control for the home. It consists of the recordings of 125 speakers of American English from four regions, three age groups and two gender groups, pronouncing isolated words. The recordings were conducted in a sound-attenuated room, and a high-quality microphone was used. Each speaker read a randomized word list consisting of 2100 words (100 distinct words appearing 21 times each). NOTE: Nonmembers may obtain a commercial rights license to Speech Controlled Computing for US$7000 by signing the LDC User License Agreement for Speech Controlled Computing. For-Profit Membership to the LDC is not required.
·The Bergen Corpus of London Teenage Language (COLT): The Bergen Corpus of London Teenage Language (COLT) is the first large English Corpus focusing on the speech of teenagers. It was collected in 1993 and consists of the spoken language of 13 to 17-year-old teenagers from different boroughs of London. The complete corpus, half a million words, has been orthographically transcribed and word-class tagged, and is a constituent of the British National Corpus.
·The SenSem Databank: This databank includes a corpus for Spanish and another for Catalan (http://grial.uab.es/sensem/corpus), and their two respective verb lexicons (http://grial.uab.es/sensem/lexico). These resources can be readily consulted online and also downloaded. In these corpora the semantics of sentences is analyzed as a syntax-lexicon continuum and so the annotation ranges from the lexical level to the sentence level. Constituents are independently annotated regarding different types of information: semantic role, syntagmatic category, and syntactic and semantic function. The verb phrase is also specified in terms of telicity and dynamism and the sentence is specified regarding topicalization or detopicalization of logical subject, aspectuality, modality and polarity. All these values converge in order to create sentence meaning. The two SenSem lexicons embrace 1,200 senses. This description is carried out by means of a definition, the Aktionsart, semantic roles and subcategorization frames (with information about frequency and sentence semantics). In the Spanish lexicon, these senses are organized in 250 lemmas, which constitute the headwords for which 100 sentences from a journalistic register and 20 from a literary register have been randomly selected and manually annotated. The Spanish sentences corresponding to the journalistic register have been translated into Catalan and annotated independently.
·The SenSem Databank: With the information extracted from the two SenSem Corpora, one for Spanish and another one for Catalan, two corresponding lexicons have been created. These resources can be consulted online (http://grial.uab.es/sensem/lexico) and also downloaded (http://grial.uab.es/descarregues.php). In the Spanish lexicon, 250 lemmas have been described. These were selected from the most frequent Spanish verbs in an original corpus made up of 13,000,000 words. In the Catalan lexicon, the number of lemmas is higher (318) because the correspondence between Spanish and Catalan verbs is not one-to-one. The two SenSem lexicons embrace 1,200 senses each, out of which approximately 1,000 are exemplified in the corpus. The sense description is carried out by means of a definition, semantic roles, the WordNet synset and the frequency of each sense in the corpus, differentiating between different registers. We also include the Aktionsart. Moreover, each sense is completed with information extracted from the corpora referring to subcategorization frames and their frequency. In order to describe the patterns, we make use of two levels. In the first level we include the general syntagmatic categories ordered according to the unmarked Spanish word order and we mark those patterns that are pronominal. In the second level these categories are subspecified and semantic roles and syntactic functions are added to the patterns. For each frame we also indicate the sentence semantics that it is associated to, the real order of categories and the adjuncts. Finally, all the sentences of the corpora that exemplify each pattern can be visualized and a graphic shows the annotation of each sentence.
·Timebank 1.2: The TimeBank 1.2 corpus contains 183 news articles that have been annotated with temporal information, adding events, times and temporal links between events and times. The annotation follows the TimeML 1.2.1 specification. The most recent information on TimeML is always available at www.timeml.org. TimeML aims to capture and represent temporal information. This is accomplished using four primary tag types: TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals, and LINK for representing relationships. Timebank 1.2 is distributed via web download. Nonmembers may license this data at no cost - please note that a signed copy of our generic nonmember user agreement is required.
·TS Corpus: TS Corpus is a Turkish Corpus project. TS Corpus is a general-purpose corpus containing 491 million POSTagged tokens (491,360,398 million). TS Corpus is a tagged corpus. TS Corpus aims to combine former Turkish computational linguistics studies and other corpus linguistics studies from around the world.
·VOICE: Vienna-Oxford International Corpus of English: The Vienna-Oxford International Corpus of English (VOICE) 1.0 Online is available as a free-of-charge resource for non-commercial research purposes. VOICE comprises naturally occurring, non-scripted face-to-face interactions in English as a lingua franca (ELF). The recordings made for VOICE are keyboarded by trained transcribers and stored as a computerized corpus. The speakers recorded in VOICE are experienced ELF speakers from a wide range of first language backgrounds. The ELF interactions recorded cover a range of different speech events in terms of domain (professional, educational, leisure), function (exchanging information, enacting social relationships), and participant roles and relationships (acquainted vs. unacquainted, symmetrical vs. asymmetrical).
·Word Frequency Lists and Dictionary for American English: This site contains what we believe are the most accurate and hopefully the most useful word frequency lists of (American) English. Our data is based on the only large, genre-balanced, up-to-date corpus of American English -- the 400 million word Corpus of Contemporary American English. You can be sure that the words in these lists and in this dictionary - sorted from most to least frequent - are really the most common ones that you will encounter in the real world. The frequency data comes in a number of different formats: * An eBook containing up to the 20,000 most frequent words, along with the 20-30 most frequent collocates (nearby words) and the synonyms for each word. * A printed book (from Routledge) with the top 5,000 words (including collocates) and thematic lists. * A free word list -- top 5,000 words, but no collocates or synonyms. * Simple word lists of the top 10,000 or 20,000 words, but without collocates or synonyms. * Lists with the top 200-300 collocates for each of the 20,000 words, for up to 5,000,000 node word / collocate pairs.

Electronic Texts

·Alex: A Catalogue of Electronic Texts on the Internet: A collection of public domain documents from American and English literature as well as Western philosophy.
·Centre for English Corpus Linguistics: ICLEv2 contains 3.7 million words of writing from higher intermediate to advanced learners of English representing 16 different mother tongue backgrounds (Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Turkish, Tswana). It differs from the first version published in 2002 not only by its increased size and range of learner populations, but also by its interface, which contains two new functionalities: built-in concordancer allowing users to search for word forms, lemmas and/or parts-of-speech tags and breakdown of the query results according to the learner profile information. The accompanying ICLEv2 Handbook contains a detailed description of the corpus, a user's manual and an overview of the ELT situation in the countries of origin of the learners. There are three types of licence (for non-profit research purposes only): single user, multiple-user (2-10) and multiple-user (11-25). The corpus can be ordered online at http://www.i6doc.com
·Chinese Text Project: The Chinese Text Project is a web-based e-text system designed to present ancient Chinese texts, particularly those relating to Chinese philosophy, in a well-structured and properly cross-referenced manner, making the most of the electronic medium to aid in their study and understanding.
·Croatian National Corpus: The starting point for each linguistic research is the corpus. As Croatian does not have a systematically compiled corpus the objective of this project is the compilation and analysis of the representative Croatian texts -- both older and contemporary -- in the form of the corpus the usage of which is applicable for all kinds of Croatistic, lexicographic and lexicological research
·Digital Archive of the Macedonian Language: The Digital Archive of the Macedonian Language is a growing collection of digitized, searchable texts in Modern Macedonian from the nineteenth and twentieth centuries and is completely free for anyone who would like to use, search through, and/or download the materials. web address: http://damj.manu.edu.mk Macedonian Academy of Sciences and Arts Research Center for Areal Linguistics Project Coordinator: Prof. Marjan Markovik e-mail: marjan@manu.edu.mk
·El Grial Corpus of Spanish: El Grial Corpus of Spanish (www.elgrial.cl) is a growing collection of eight corpora (almost 100 million words) with approximately 700 documents of contemporary Spanish, developed by the members of the Escuela Lingüística de Valparaíso (www.linguistica.cl) at the Pontificia Universidad Católica de Valparaíso, Chile. Also, there is a tagger and parser for Spanish Language available on the web site. These corpora have been collected under specific methodological principles, identifying specialized/non-specialized, written/spoken registers and text types (academic, professional, technical, etc). Detailed description of each corpus is available in the web site. All documents have been tagged and parsed. Part of the data has been enriched with deep syntactic information. To the best of our knowledge, this is currently the largest searchable morphosyntactically annotated and register-diversified corpus of Spanish available to the public, with online tools that help analyze the collected data. Users can define their corpus of study and search the data using a wide variety of resources. Query results are presented in different formats, depending on the kind of research questions. El Grial users can make queries concerning word, lemma and/or parts of speech frequencies. One of the last tools developed is El Manchador de Textos, an online resource that “spots” and puts color to the words or sequences under study; statistical information about the co-occurrences is also available.
·Freiburger Anthologie: Die 1200 bekanntesten deutschen Gedichte in einer durchsuchbaren Datenbank.
·French Proverbs from 1611: A summary of 1,500 proverbs found in Randle Cotgrave's "A dictionarie of the French and English tongues", published in 1611. Most are in French with English translations; a few English and Latin proverbs are also included.
·HATII and DCC Release KRYS I Corpus to Aid Research: The Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) are delighted to announce the release of the KRYS I Corpus for genre classification research. http://www.krys-corpus.eu The corpus, consisting of 6434 documents labelled with document genres, is expected to become a major research resource among text processing and data and information management researchers. In particular, we encourage the use of the corpus for the research of: - Automated Text Classification (TC) - Digital curation and metadata extraction - Natural Language Processing (NLP) - Computational Linguistics (CL) Despite the potential of document genre classification as a supporting step in language processing, document management, and information retrieval (e.g. the linguistic style and the vocabulary of a document varies distinctively across document genres), to date, there has been a severe lack of genre-labelled document corpora with which researchers can experiment. It is, therefore, with great pleasure that the Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) makes the KRYS I Corpus available to researchers around the globe. The Corpus originated as part of the ongoing Semantic Metadata Extraction research at the Digital Curation Centre (http://www.dcc.ac.uk) and the HATII at the University of Glasgow (http://www.hatii.arts.gla.ac.uk). The metadata extraction research evolved into a study of automated genre classification, reflecting the observation that the genre of a document (e.g. whether a document is a scientific article or a letter) is characterised by the form and structure of a document, the understanding of which would facilitate further extraction of metadata from within the document. Further details about the development of the KRYS I corpus are available via the website (http://www.krys-corpus.eu). Specifically, researchers will find a detailed account of the document collection process, the reclassification of the documents in the corpus, and the initial findings with regard to human classification of the documents. We encourage researchers to make full use of this corpus for their own research activity and recommend that you consider contributing towards the ongoing development of the corpus by adding your own documents to the database. Instructions as to how to contribute to the corpus are provided at http://www.krys-corpus.eu. Comments and/or feedback on the KRYS I Corpus are invited. Contacts details can be found on the website. Please feel free to distribute this announcement to any interested colleagues.
·IFA Dialog Video corpus: The IFA Dialog Video corpus is a collection of annotated video recordings of friendly Face-to-Face dialogs licensed under the GNU General Public License (GPLv2). It is modeled on the Face-to-Face dialogs Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialog conversations of 15 minutes were recorded and annotated, in total 5 hours of speech. To stay close to the very useful Face-to-Face dialogs in the CGN, pairs of well acquainted participants, either good friends, relatives, or long-time colleagues were selected. The participants were allowed to talk about any topic they wanted.
·Korean Treebank Annotations Version 2.0: The Korean Treebank Annotations Version 2.0 is an extension of the Korean English Treebank Annotations corpus, LDC2002T26 (2002). It is essentially an electronic corpus of Korean texts annotated with morphological and syntactic information. The original texts for the Korean Treebank 2.0 were selected from The Korean Newswire corpus published by LDC, catalog number LDC2000T45, which is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. Korean Treebank 2.0 is based on the March 2000 portion of the corpus and includes 647 articles. The annotated corpus can find many uses, including training of morphological analyzers, part-of-speech taggers and syntactic parsers.
·NPS Chat Corpus: The NPS Chat Corpus, Release 1.0 consists of 10,567 posts gathered from various online chat services in accordance with their terms of service. The posts have been: 1) Hand privacy masked; 2) Part-of-speech tagged; and 3) Dialogue-act tagged.
·Oxford Text Archive (OTA).: Text archive.
·Penn-Helsinki Parsed Corpus of Early Modern English: The Penn-Helsinki Parsed Corpus of Early Modern English is a 1.8 million word parsed corpus of text samples of Early Modern English. It includes the text samples of the Helsinki Corpus of Historical English, which consists of 600,000 words of genre balanced text and two extension samples of the same size, balanced for genre in the same way. It is a sister corpus of the Penn-Helsinki Parsed Corpus of Middle English and the two corpora are distributed together.
·Project Gutenberg e-texts: Texts online.
·Sociolingüística Andaluza: Information: Grupo de investigación en sociolingüística. Universidad de Sevilla.
·The Sumerian Text Archive: A growing collection of texts in the Sumerian language.
·The University of Virginia Electronic Text Center: An on-line archive of tens of thousands of SGML and XML-encoded electronic texts and images with a library service that offers hardware and software suitable for the creation and analysis of text.
·Tofa Videos and Texts: The Tofa stories available here were recorded by Dr. K. David Harrison in 2000 and 2001, for a project funded by a grant from Volkswagen-Stiftung.

Text and Corpora Meta Sites

·Centre for English Corpus Linguistics: ICLEv2 contains 3.7 million words of writing from higher intermediate to advanced learners of English representing 16 different mother tongue backgrounds (Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Turkish, Tswana). It differs from the first version published in 2002 not only by its increased size and range of learner populations, but also by its interface, which contains two new functionalities: built-in concordancer allowing users to search for word forms, lemmas and/or parts-of-speech tags and breakdown of the query results according to the learner profile information. The accompanying ICLEv2 Handbook contains a detailed description of the corpus, a user's manual and an overview of the ELT situation in the countries of origin of the learners. There are three types of licence (for non-profit research purposes only): single user, multiple-user (2-10) and multiple-user (11-25). The corpus can be ordered online at http://www.i6doc.com
·CONCISUS Corpus of Event Summaries: The CONCISUS Corpus is an annotated dataset of comparable Spanish and English event summaries in four application domains. The CONCISUS Corpus covers for the time being the following domains: aviation accidents, train accidents, earthquakes, and terrorist attacks. The dataset contains: comparable summaries, comparable automatic translations, and comparable full documents.
·Digital Archive of the Macedonian Language: The Digital Archive of the Macedonian Language is a growing collection of digitized, searchable texts in Modern Macedonian from the nineteenth and twentieth centuries and is completely free for anyone who would like to use, search through, and/or download the materials. web address: http://damj.manu.edu.mk Macedonian Academy of Sciences and Arts Research Center for Areal Linguistics Project Coordinator: Prof. Marjan Markovik e-mail: marjan@manu.edu.mk
·ELRA (European Language Resources Association): The overall goal of ELRA is to provide a centralized organization for the validation, management, and distribution of speech, text, and terminology resources and tools, and to promote their use within the European telematics R&TD community.
·HATII and DCC Release KRYS I Corpus to Aid Research: The Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) are delighted to announce the release of the KRYS I Corpus for genre classification research. http://www.krys-corpus.eu The corpus, consisting of 6434 documents labelled with document genres, is expected to become a major research resource among text processing and data and information management researchers. In particular, we encourage the use of the corpus for the research of: - Automated Text Classification (TC) - Digital curation and metadata extraction - Natural Language Processing (NLP) - Computational Linguistics (CL) Despite the potential of document genre classification as a supporting step in language processing, document management, and information retrieval (e.g. the linguistic style and the vocabulary of a document varies distinctively across document genres), to date, there has been a severe lack of genre-labelled document corpora with which researchers can experiment. It is, therefore, with great pleasure that the Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) makes the KRYS I Corpus available to researchers around the globe. The Corpus originated as part of the ongoing Semantic Metadata Extraction research at the Digital Curation Centre (http://www.dcc.ac.uk) and the HATII at the University of Glasgow (http://www.hatii.arts.gla.ac.uk). The metadata extraction research evolved into a study of automated genre classification, reflecting the observation that the genre of a document (e.g. whether a document is a scientific article or a letter) is characterised by the form and structure of a document, the understanding of which would facilitate further extraction of metadata from within the document. Further details about the development of the KRYS I corpus are available via the website (http://www.krys-corpus.eu). Specifically, researchers will find a detailed account of the document collection process, the reclassification of the documents in the corpus, and the initial findings with regard to human classification of the documents. We encourage researchers to make full use of this corpus for their own research activity and recommend that you consider contributing towards the ongoing development of the corpus by adding your own documents to the database. Instructions as to how to contribute to the corpus are provided at http://www.krys-corpus.eu. Comments and/or feedback on the KRYS I Corpus are invited. Contacts details can be found on the website. Please feel free to distribute this announcement to any interested colleagues.
·Het Corpus Gesproken Nederlands: The Corpus Gesproken Nederlands, (Spoken Dutch Corpus), or CGN is a collection of approximately 900 hours of spoken Dutch from Flemish and Dutch speakers. All recordings have been aligned with an orthographic transcription and each word has been given a POS tag and a lemma. Part of the data has been enriched with syntactic, prosodic and/or phonetic information.
·IFA Dialog Video corpus: The IFA Dialog Video corpus is a collection of annotated video recordings of friendly Face-to-Face dialogs licensed under the GNU General Public License (GPLv2). It is modeled on the Face-to-Face dialogs Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialog conversations of 15 minutes were recorded and annotated, in total 5 hours of speech. To stay close to the very useful Face-to-Face dialogs in the CGN, pairs of well acquainted participants, either good friends, relatives, or long-time colleagues were selected. The participants were allowed to talk about any topic they wanted.
·Italian Linguistics: Information (in Italian) on Italian linguistics and corpora.
·korpusy.net: This is a Polish-language site devoted to language corpora. It provides a range of introductory articles on corpora and corpus-based research, links and conference calls, a glossary of common terms, and some downloadable papers and resources.
·Leiden Armenian Lexical Textbase: Launching the Leiden Armenian Lexical Textbase http://www.sd-editions.com/LALT/home.html LALT combines Classical Armenian dictionaries with morphologically analyzed texts. There are some 80.000 Armenian lexemes and ten texts. The complete Nor Bargirk, main sections of Adjarian's Root Dictionary, Bedrossian's Armenian- English Dictionary and other material are integrated in LALT. There is a Greek-Armenian lexicon (20000 entries), and aligned Armenian-Greek texts. LALT will be updated at regular intervals. Also, LALT easily is able to integrate additional material and welcomes contributions of other scholars. I have been asked about fonts: LALT is written in xml and uses unicode. Any unicode font will be able to read it, provided this font contains the glyphs (screen images) for Armenian and Greek. One such font is Titus Cyberbit, which is used within LALT itself. It is available for free at http://titus.fkidg1.uni-frankfurt.de/unicode/tituut.asp General information on Armenian and Unicode may be obtained at http://www.armunicode.org/en/fonts/unicode Jos Weitenberg
·Lexicographical Corpus of Portuguese: The Lexicographical Corpus of Portuguese is a database of electronic texts in Portuguese. It contains the electronic transcription and edition of some of the most important dictionaries from the 17th and 18th centuries. The selected texts are generally considered the most important monuments of portuguese dictionary tradition for their dimension, reception and documental value. - Jerónimo Cardoso, Dictionarium iuventuti studiosae (1562, aliás 1551); Dictionarium ex lusitanico in latinum sermonem (1562); Dictionarium Latinolusitanicum (1569/70); Breve dictionarium vocum ecclesiasticarum (1569); De monetis (1569, aliás 1561) - Pedro de Poiares, Diccionario Lusitanico-Latino de Nomes Proprios (1667) - Bento Pereira, Prosodia Tesouro (1697) - Rafael Bluteau, Vocabulario Portuguez e Latino (1712-1728) - António Franco (F. Pomey), Indiculo Universal (1716) With this project we made available, in a digital format, the complete text of those dictionaries and it’s now possible to retrieve indexes off all the words, which are found in them.
·Linguistic and Folklore materials from the Kujamaat Jóola: A site which will eventually grow to have an extensive collection of Kujamaat linguistic and folklore materials. It currently contains a dictionary (already listed), two folktales (text, translation and sound) and verses from extemporaneous funeral songs (text, sound, translation, commentary).
·Linguistic Data Consortium: Creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes.
·NPS Chat Corpus: The NPS Chat Corpus, Release 1.0 consists of 10,567 posts gathered from various online chat services in accordance with their terms of service. The posts have been: 1) Hand privacy masked; 2) Part-of-speech tagged; and 3) Dialogue-act tagged.
·On-line books FAQ: Public domain sources of Etext available on the Internet.
·The IViE corpus: An intonationally transcribed corpus covering seven dialects of English from the British Isles. Subjects were secondary school students. The corpus covers short read sentence, a read story, a retold story, map tasks and free conversation.