|
|
Texts & Corpora
|
|
Page Index:
Corpora
Electronic Texts
Text and Corpora Meta Sites
|
|
|
Corpora
|
- AnCora Corpora:
AnCora: Syntactically and Semantically Annotated Corpora (Spanish, Catalan)
CLiC (Centre for Language and Computation) of the University of Barcelona, together with the Natural Language Processing group of the Polytechnic University of Catalonia, have created two new language technology resources: AnCora-Esp for Spanish and AnCora-Cat for Catalan, consisting of 500,000 words each. They are two treebanks enriched with different kinds of semantic information: 1) each function has its argument and thematic role; 2) each verb belongs to a semantic class according to its event structure and diathesis alternations; 3) each noun has its WordNet sense; and 4) each named entity (i.e. persons, organisations, locations, dates, etc.) is identified and categorized.
The annotation process has also resulted in two verbal lexicons with approximately 2,000 entries for each language with information about verbal semantic classes and their syntactic subcategorization, their argument structure and the thematic roles for each sense.
The AnCora corpora as well as the derived verbal lexicons (AnCora-Verb) are freely available (queries and downloads) from: http://clic.ub.edu/ancora/.
- British National Corpus:
A 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
- Buckeye Corpus:
This corpus contains high-quality recordings of conversational American English speech from 40 speakers in Columbus, OH, USA. The speech has been orthographically transcribed and phonetically labeled. Currently the audio files and transcriptions for 20 talkers are available.
- CHAINS: Characterising Individual Speakers:
The Chains corpus is a novel speech corpus collected with the primary aim of facilitating research in speaker identification. The corpus features approximately 36 speakers recorded under a variety of speaking conditions, allowing comparison of the same speaker across different well-defined speech styles. Speakers read a variety of texts alone, in synchrony with a dialect-matched co-speaker, in imitation of a dialect-matched co-speaker, in a whisper, and at a fast rate. There is also an unscripted spontaneous retelling of a read fable. The bulk of the speakers were speakers of Eastern Hiberno-English. The corpus is being made freely available for research purposes.
- CLIPS, Corpus of Spoken Italian:
CLIPS is a corpus of spoken Italian, freely available at www.clips.unina.it. The corpus (audio files, annotation and documentation) are fully downloadable from the website via ftp, free for research purposes.
CLIPS consists of about 100 hours of speech, equally represented by female and male voices. A section of the corpus is transcribed orthographically, a smaller section has been phonetically labeled. Recordings were made in 15 Italian cities, selected on the basis of linguistic and socio-economic principles of representativeness: Bari, Bergamo, Bologna, Cagliari, Catanzaro, Firenze, Genova, Lecce, Milano, Napoli, Palermo, Parma, Perugia, Roma, Venezia.
For each of the 15 cities different text typologies have been included: a) radio and television broadcasts (news, interviews, talk shows); dialogue (240 dialogues collected using the map task procedure and the "spot the difference" game. In this set: 30 dialogues are phonetically labeled, 90 orthographically transcribed); c) read speech from non professional speakers (20 sentences each, covering medium-high frequency Italian words); d) speech over the telephone (conversations between 300 speakers and a simulated hotel desk service operator), e) read speech from 20 professional speakers (160 sentences, covering all phonotactic sequences and medium-high frequency Italian words) recorded in an anechoic chamber.
Documentation, corpus collection and annotation follow the EAGLES guidelines.
- COMPARA - Portuguese-English Parallel Corpus:
COMPARA is bi-directional parallel corpus based on an open-ended collection of Portuguese-English and English-Portuguese source-texts and translations. Access is free and requires no registration.
- COSMAS Corpus Archive:
The largest German corpus archive, free-of-charge online search in 1181 Mio words of running text (1846 Mio words for invited guests).
- CSLU Spoltech Brazilian Portuguese:
The CSLU Spoltech Brazilian Portuguese corpus contains microphone speech from a variety of regions in Brazil with phonetic and orthographic transcriptions. The utterances consist of both read speech (for phonetic coverage) and responses to questions (for spontaneous speech). The corpus contains 477 speakers and 8080 separate utterances. A total of 2540 utterances have been transcribed at the word level (without time alignments), and 5479 utterances have been transcribed at the phoneme level (with time alignments).
- CSLU: Spelled and Spoken Words:
The CSLU: Spelled and Spoken Words corpus consists of spelled and spoken words. 3647 callers were prompted to say and spell their first and last names, to say what city they grew up in and what city they were calling from, and to answer two yes/no questions. In order to collect sufficient instances of each letter, 1371 callers also recited the English alphabet with pauses between the letters. Each call was transcribed by two people, and all differences were resolved. In addition, a subset of 2648 calls has been phonetically labeled.
- Centre for English Corpus Linguistics:
ICLEv2 contains 3.7 million words of writing from higher intermediate to advanced learners of English representing 16 different mother tongue backgrounds (Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Turkish, Tswana). It differs from the first version published in 2002 not only by its increased size and range of learner populations, but also by its interface, which contains two new functionalities: built-in concordancer allowing users to search for word forms, lemmas and/or parts-of-speech tags and breakdown of the query
results according to the learner profile information.
The accompanying ICLEv2 Handbook contains a detailed description of the corpus, a user's manual and an overview of the ELT situation in the countries of origin of the learners.
There are three types of licence (for non-profit research purposes only): single user, multiple-user (2-10) and multiple-user (11-25).
The corpus can be ordered online at http://www.i6doc.com
- Chinese Gigaword Second Edition:
Chinese Gigaword Release Second Edition is a comprehensive archive of newswire text data in Chinese that has been acquired over several years by the LDC.
This release includes all of the contents in the first release of the Chinese Gigaword corpus (LDC2003T09), material from one new source, as well as new materials from the other two sources. Thus, the corpus contains three distinct international sources of Chinese newswire - Central News Agency, Taiwan, Xinhua News Agency, and Zaobao.
Some minor updates to the documents from the first release have been made.
- Corpora at ICAME:
International Computer Archive of Modern and Medieval English.
- Corpus de Català Contemporani de la Universitat de Barcelona (CCCUB):
Spoken language corpora developed for the study of geographical, functional and socio-cultural variation in Catalan. The texts are in .pdf. The sound files are not yet available through the web, but they have been published in CD-ROM and can be purchased.
The CCCUB is also available through RECERCAT (Dipòsit de la Recerca de Catalunya): http://www.recercat.net/handle/2072/8925).
- Corpus del español:
An online, searchable corpus of diachronic Spanish texts (100 million words, 13th century to present).
- Corpus e Lessico di Frequenza dell'Italiano Scritto:
CoLFIS (Corpus e Lessico di Frequenza dell'Italiano Scritto)
[Corpus and Frequency Lexicon of Written Italian]
produced by
Pier Marco Bertinetto°, Cristina Burani*, Alessandro Laudanna^*,
Lucia Marconi+, Daniela Ratti+, Claudia Rolando+, Anna Maria Thornton§
° Scuola Normale Superiore, Pisa
* Istituto di Scienze e Tecnologie della Cognizione, CNR, Roma
^ Università di Salerno
+ Istituto di Linguistica Computazionale,Unità Staccata di Genova, CNR, Genova
§ Università de L'Aquila
The reference corpus consists of excerpts from newspapers, magazines and books. It includes 3.150.075 lexical occurrences. The corpus was designed as the best approximation to the Italians' average preferred readings, as mirrored by official statistics.
The lexicon consists of two main components: the forms repertoire and the lemmas repertoire.
In the latter, all identical forms belonging to different lemmas are disambiguated, while syntagmatic words (such as table's leg) are treated as single entries.
The lexical lists (both forms and lemmas) are presently available for free download at
http://alphalinguistica.sns.it/BancheDati.htm
http://www.istc.cnr.it/material/database/colfis/
They are organized according to a number of possibilities: frequency rank, inverse alphabetical ordering, with or without capital / non-capital distinction, etc.
The entire corpus is not yet available. We hope to put it on-line as soon as we obtain the necessary
authorizations.
The work has been produced with CNR (Consiglio Nazionale delle Ricerche) support.
With the help of willing users, this product will hopefully be enriched with further facilities.
- Croatian Language Corpus:
The Croatian Language Corpus is the result of various projects at the Institute of Croatian Language and Linguistics and the Linguistics Department of the University of Zadar. There is an online interface based on Philologic at the given URL. The current status is that the corpus indexes more than 100 k tokens, and the base is growing continuously. It is annotated in TEI XML P5, its annotation is being enriched with morphological segmentation, lemmatization, phonemic transcription, morphosyntactic annotation and syntactic parses. The online interfaces are subject to change and extension for the improvement of access to various corpus properties.
- Croatian National Corpus:
The starting point for each linguistic research is the corpus. As Croatian does not have a systematically compiled corpus the objective of this project is the compilation and analysis of the representative Croatian texts -- both older and contemporary -- in the form of the corpus the usage of which is applicable for all kinds of Croatistic, lexicographic and lexicological research
- Czech Academic Corpus v. 1.0:
The Czech Academic Corpus version 1.0 is a corpus with a manual morphological annotation of morphology of the Czech language consisting of approximately 600,000 words in continuous texts.
- Database of spoken Italian (BADIP):
Contains an online edition of the 500,000 word LIP-Corpus. The edition is being enriched with POS-tags and lemmata, more data are being added continuously. Other corpora of spoken Italian will be included in the database as soon as possible. Access to BADIP is free. The database is part of the LanguageServer of the University of Graz (Austria).
- Digital Archive of the Macedonian Language:
The Digital Archive of the Macedonian Language is a growing collection of digitized, searchable texts in Modern Macedonian from the nineteenth and twentieth centuries and is completely free for anyone who would like to use, search through, and/or download the materials.
web address: http://damj.manu.edu.mk
Macedonian Academy of Sciences and Arts
Research Center for Areal Linguistics
Project Coordinator:
Prof. Marjan Markovik
e-mail: marjan@manu.edu.mk
- Eastern Armenian National Corpus:
Eastern Armenian National Corpus (EANC) is a comprehensive linguistic database of annotated texts in Standard Eastern Armenian (SEA), the language spoken in the Republic of Armenia.
EANC is:
- a comprehensive corpus with about 90 million tokens
- a powerful search engine for making complex lexical morphological queries
- a learner’s corpus including English translations for frequent tokens
- a diachronic corpus covering SEA texts from the mid-19th century to the present
- a mixed corpus consisting of both written discourse and oral discourse
- an open-ended corpus with new texts being added continuously
- an annotated corpus with morphological and metatext tagging
- an open access corpus
- an electronic library with full access to over 100 Armenian classic titles
Another important feature is the Glossed output: typologists and language learners can now work with a text format similar to interlinear morphological glosses. In this format, wordforms are supplied with lemmas, lexical and grammatical categories, and translations, vertically aligned below each wordform. Also possible is switching to Latin transliteration from the Armenian alphabet.
- Ega XML Lexicon:
Digitized, online lexicon of Ega, a language of the Ivory Coast as provided by the late Prof. Eddy Aimé Gbery.
- El Grial Corpus of Spanish:
El Grial Corpus of Spanish (www.elgrial.cl) is a growing collection of eight corpora (almost 100 million words) with approximately 700 documents of contemporary Spanish, developed by the members of the Escuela Lingüística de Valparaíso (www.linguistica.cl) at the Pontificia Universidad Católica de Valparaíso, Chile. Also, there is a tagger and parser for Spanish Language available on the web site. These corpora have been collected under specific methodological principles, identifying specialized/non-specialized, written/spoken registers and text types (academic, professional, technical, etc).
Detailed description of each corpus is available in the web site. All documents have been tagged and parsed. Part of the data has been enriched with deep syntactic information. To the best of our knowledge, this is currently the largest searchable morphosyntactically annotated and register-diversified corpus of Spanish available to the public, with online tools that help analyze the collected data. Users can define their corpus of study and search the data using a wide variety of resources. Query results are presented in different formats, depending on the kind of research questions. El Grial users can make queries concerning word, lemma and/or parts of speech frequencies. One of the last tools developed is El Manchador de Textos, an online resource that “spots” and puts color to the words or sequences under study; statistical information about the co-occurrences is also available.
- English Accents and Dialects:
Extracts from the Survey of English Dialects and the Millennium Memory Bank document how we spoke and lived in the 20th century.
- HATII and DCC Release KRYS I Corpus to Aid Research:
The Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) are delighted to announce the release of the KRYS I Corpus for genre classification research.
http://www.krys-corpus.eu
The corpus, consisting of 6434 documents labelled with document genres, is expected to become a major research resource among text processing and data and information management researchers. In particular, we encourage the use of the corpus for the research of:
- Automated Text Classification (TC)
- Digital curation and metadata extraction
- Natural Language Processing (NLP)
- Computational Linguistics (CL)
Despite the potential of document genre classification as a supporting step in language processing, document management, and information retrieval (e.g. the linguistic style and the vocabulary of a document varies distinctively across document genres), to date, there has been a severe lack of genre-labelled document corpora with which researchers can experiment. It is, therefore, with great pleasure that the Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) makes the KRYS I Corpus available to researchers around the globe.
The Corpus originated as part of the ongoing Semantic Metadata Extraction research at the Digital Curation Centre (http://www.dcc.ac.uk) and the HATII at the University of Glasgow (http://www.hatii.arts.gla.ac.uk). The metadata extraction research evolved into a study of automated genre classification, reflecting the observation that the genre of a document (e.g. whether a document is a scientific article or a letter) is characterised by the form and structure of a document, the understanding of which would facilitate further extraction of metadata from within the document.
Further details about the development of the KRYS I corpus are available via the website (http://www.krys-corpus.eu). Specifically, researchers will find a detailed account of the document collection process, the reclassification of the documents in the corpus, and the initial findings with regard to human classification of the documents.
We encourage researchers to make full use of this corpus for their own research activity and recommend that you consider contributing towards the ongoing development of the corpus by adding your own documents to the database. Instructions as to how to contribute to the corpus are provided at http://www.krys-corpus.eu.
Comments and/or feedback on the KRYS I Corpus are invited. Contacts details can be found on the website. Please feel free to distribute this announcement to any interested colleagues.
- Hebrew Corpus of Arutz7 Newswires:
A Corpus containing news and articles from Arutz 7 since 2001, which updates daily. Text is available in HTML, plain ascii text, tokenized text in XML format. It is possible to obtain an XML version of the text morphologically annotated (with all possible analyses) and morphologically disambiguated (with the correct morphological analysis in context).
Every day, the front page of Arutz 7 is being scanned for updated news and articles and new material is being downloaded. The relevant text is being extracted from the downloaded pages, and then analyzed for document structure (paragraph, sentence and token segmentation). The texts are then being represented in XML.
The resources are free but require a username and a password that can be obtained by sending an email to Shlomo Yona .
- Hellenic National Corpus:
HNC is a corpus of written Modern Greek texts, available over the Internet, for research use only. It is based on the General Language corpus developed by the Institute of Language and Speech Processing and is fully available on the Internet since 2000.
It currently contains about 32,000,000 words of written texts from several media (books, periodicals, newspapers etc.), which belong to different genres (articles, essays, literary works, reports, biographies etc.) and various topics (economy, medicine, leisure, art, human sciences etc.).
The HNC users can make the following queries concerning the lexicon, morphology, syntax and usage of Modern Greek:
- specific words (e.g. child),
- lemmas (e.g. child as a lemma produces every inflected type of the word),
- parts of speech and
- up to three combinations of all the above, in which users can specify the distance among lexical items (e.g. word + word, lemma + word, lemma + word + word, lemma + part of speech).
Users can define their own sub-corpus within the HNC. This sub-corpus may cover one or more media, genres and/or topics and may also be saved for further reference by the users.
Query results are presented as whole sentences, within which the query objects are highlighted. Alternatively, concordances of query results are presented, where the query object is centred on the page.
Finally, HNC users can make queries concerning word, lemma and/or parts of speech frequencies within the HNC texts. Statistical information about the 100 and 1,000 most frequent words and lemmata in these texts is also available.
- IFA Dialog Video corpus:
The IFA Dialog Video corpus is a collection of annotated video recordings of friendly Face-to-Face dialogs licensed under the GNU General Public License (GPLv2). It is modeled on the Face-to-Face dialogs Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialog conversations of 15 minutes were recorded and annotated, in total 5 hours of speech. To stay close to the very useful Face-to-Face dialogs in the CGN, pairs of well acquainted participants, either good friends, relatives, or long-time colleagues were selected. The participants were allowed to talk about any topic they wanted.
- IPI PAN Corpus of Polish:
The 2nd edition of the IPI PAN Corpus of Polish, developed
at the Institute of Computer Science of the Polish Academy
of Sciences (PAS), is available at the web pages of:
- the Institute of Computer Science PAS:
http://korpus.pl/en/
- the Institute of Polish Language PAS:
http://corpus.ijp-pan.krakow.pl/en/
To the best of our knowledge, this is currently the largest
searchable morphosyntactically annotated corpus of Polish
available to the public.
The whole corpus consists of over 250 million segments
(about 200 million orthographic words) and it is not
balanced, but a balanced sample of over 30 million segments
is also available. These corpora can be directly searched
at the above addresses (do read the query syntax cheatsheet
at http://korpus.pl/en/cheatsheet/index.html) or downloaded
in a binary form to be used with a standalone version of the
corpus search engine Poliqarp (announced separately on the
'corpora' list and available from http://korpus.pl/en/).
- IULA's UPF Textual, plurilingual, specialized Corpus:
The main goal of the Corpus project is the construction and exploitation of a textual, plurilingual and specialized corpus. The languages involved are the following: Catalan, Spanish, English, German and French. The areas of interest include: economics, law, computer science, medicine and enviromental science. This corpus is the main support for teaching and research at our institut. Some of the research activities envisaged against this corpus include the following ones: terminology detection, parallel texts alignment, partial parsing, (semi)automatic extraction of several levels of linguistic information for building computational systems (for example, subcategorization patterns), language variation studies.
- International Corpus of English (British Component):
The British Component of the International Corpus of English (ICE-GB) contains one million words of spoken and written British English. The material is fully tagged and parsed and the associated syntactic treebank is searchable with dedicated exploration software. The spoken material can be listened to.
- Korean Propbank:
Korean Propbank is a semantic annotation of the Korean English Treebank Annotations and Korean Treebank version 2.0. Each verb and adjective occurring in the Treebank has been treated as a semantic predicate and the surrounding text has been annotated for arguments and adjuncts of the predicate. The verbs and adjectives have also been tagged with coarse grained senses.
There are two basic components to Korean Propbank:
* The Verb Lexicon. A frames file, consisting of one or more frame sets, has been created for each predicate occurring in the Treebank. These files serve as a reference for the annotators and for users of the data. 2,749 such files have been created.
* The Annotation. There are two annotation files. The virginia-verbs.pb file has 9,588 annotated predicate tokens. These predicate tokens include all those occurring in over 54 thousand words of the Korean English Treebank Annotations, totaling ~791 KB of uncompressed data. The newswire-verbs.pb file has 23,707 annotated predicate tokens. These predicate tokens include all those occurring in over 131 thousand words of the Korean Treebank version 2.0.
- Korean Treebank Annotations Version 2.0:
The Korean Treebank Annotations Version 2.0 is an extension of the Korean English Treebank Annotations corpus, LDC2002T26 (2002). It is essentially an electronic corpus of Korean texts annotated with morphological and syntactic information. The original texts for the Korean Treebank 2.0 were selected from The Korean Newswire corpus published by LDC, catalog number LDC2000T45, which is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. Korean Treebank 2.0 is based on the March 2000 portion of the corpus and includes 647 articles. The annotated corpus can find many uses, including training of morphological analyzers, part-of-speech taggers and syntactic parsers.
- LumaLiDa - Resources for Child Language:
LumaLiDa is a family of database resources for the study of child language. It includes LumaLiDaOn (the Linguistic Diary of Luma, an European Portuguese Child), LumaLiDaOnLexicon (the lexicon used by the child in LumaLiDaOn, types and tokens), LumaLiDaAudy (transcribed audio files of child speech), and LumaLiDaAudyLexicon (the lexicon used by the child in LumaLiDaAudy, types and tokens).
- MDE RT04 Training Data Speech:
MDE RT-04 Training Data Speech was created to provide training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program. The goal of MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes.
In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: flagging non-content words like filled pauses and discourse markers for optional removal; marking sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. LDC has defined a SimpleMDE annotation task specification and has annotated English telephone and broadcast news data to provide training data for MDE.
- Monguor - Online Texts:
Digitized online texts of Monguor, an endangered language spoken in the People's Republic of China, as provided to the E-MELD School of Best Practices by Dr. Wang Xianzhen.
- N4 NATO Native and Non-Native Speech:
The N4 NATO Native and Non-Native Speech corpus was developed by the NATO research group on Speech and Language Technology in order to provide a military oriented database for multilingual and non-native speech processing studies. The NATO Speech and Language Technology group decided to create a corpus geared towards the study of non-native accents. The group chose naval communications as the common task because it naturally includes a great deal of non-native speech and because there were training facilities where data could be collected in several countries.
Speech data was recorded in the Naval transmission training centers of four countries (Germany, The Netherlands, United Kingdom, and Canada). The material consists of native and non-native speakers using NATO English procedure between ships and reading from a text.
- NPS Chat Corpus:
The NPS Chat Corpus, Release 1.0 consists of 10,567 posts gathered from various online chat
services in accordance with their terms of service. The posts have been:
1) Hand privacy masked;
2) Part-of-speech tagged; and
3) Dialogue-act tagged.
- Online Dena'ina Qenaga Lexicon:
Searchable online wordlist of Dena'ina Qenaga (Tanaina).
- Penn-Helsinki Parsed Corpus of Early Modern English:
The Penn-Helsinki Parsed Corpus of Early Modern English is a 1.8 million word parsed corpus of text samples of Early Modern English. It includes the text samples of the Helsinki Corpus of Historical English, which consists of 600,000 words of genre balanced text and two extension samples of the same size, balanced for genre in the same way. It is a sister corpus of the Penn-Helsinki Parsed Corpus of Middle English and the two corpora are distributed together.
- Penn-Helsinki Parsed Corpus of Middle English:
The Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) and the Penn Parsed Corpus of Modern British English are syntactically annotated corpora of prose text samples of English from the indicated time periods. Their syntactic annotation (parsing) permits searching, not only for words and word sequences, but also for syntactic structure. The corpora are designed for the use of students and scholars of the history of English, especially the historical syntax of the language, and they are publicly available under certain conditions.
- Persian Linguistic Database (PLDB):
This is the first on-line database for the contemporary (Modern) Persian designed and developed by Dr. S. M. Assi at the Institute for Humanities and Cultural Studies (IHCS), Iran.
The database contains a huge selected corpora of all varieties of the Modern Persian language in the form of running texts. Some of the texts are annotated with grammatical, pronunciation and lemmatisation tags.
A special and powerful software provides different types of search and statistical listing facilities through the whole database or any selective corpus made up of a group of texts.
The database is constantly improved and expanded.
- Russian National Corpora:
The corpora is designed for anyone interested in a variety of issues related to the Russian language: professional linguists, language teachers, students, foreigners studying the Russian language.
- SCoSE - Saarbrücken Corpus of Spoken English:
The SCoSE consists of five parts:
Part 1: Stories
Part 2: Indianapolis Interviews
Part 3: Jokes
Part 4: Complete Conversations
Part 5: Drawing Experiment
You can download each of the five parts as a .pdf file.
- SMULTRON - The Stockholm Multilingual Treebank:
SMULTRON is a parallel treebank developed by the Computational Linguistics Group at the Department of Linguistics, at Stockholm University. The parallel treebank contains around 1000 sentences each in English, German and Swedish. The sentences have been PoS-tagged and annotated with phrase structure trees. The trees have been aligned on sentence, phrase and word level. Additionally, the German and Swedish monolingual treebanks contain lemma information.
- Scandinavië Vertalingen:
Translation agency for translations from and into the Scandinavian languagues (Swedish, Finnish, Norwegian, Danish)
- Scottish Corpus of Texts and Speech (SCOTS):
SCOTS is an AHRC-funded project, creating a corpus of texts in the languages of Scotland, in the first instance Scots and Scottish English, of all available genres. Spoken texts (orthographic transcription plus accompanying audio/video files) make up 20% of the complete corpus. The corpus is fully searchable online, and the website also contains a description and instructions.
- Searchable Biao Min Lexicon:
The Biao Min Lexicon, housed on the E-MELD site, consists of nearly 3,000 lexical items from Biao Min documentation collected by David Solnit.
- Searchable Kayardild Lexicon:
Searchable lexicon of Kayardild, collected by Dr. Nicholas Evans and hosted by E-MELD.
- Searchable Mocoví Lexicon:
Searchable Mocoví Lexicon, based on data provided and collected by Dr. Verónica Grondona.
- Searchable Potawatomi Lexicon:
Online, searchable Potawatomi lexicon, utilizing data provided to the E-MELD School of Best Practices by Dr. Laura Buszard-Welcher.
- Searchable Saliba Lexicon:
Searchable Lexicon of the Saliba language, utilizing data provided to the E-MELD School of Best Practices by Nancy Morse.
- Slovak National Corpus:
Slovak National Corpus is built as a general monolingual corpus, which in the first phase (year 2003) started to compile written texts originated in years 1990 – 2003, containing about 30 million of words with a lemmatisation, morphological and source (bibliographical and style-genre) annotation. During the second phase (up to 2006) the representative span of written texts will be extended to other periods of the contemporary language (1955 – 2005) to the amount of 200 million words and its selected sample will be syntactically annotated. Simultaneously, specific sub-corpora of diachronic and dialectological texts will commence to be built, as well as a terminological and lexicographical database.
Slovak National Corpus is provided primarily to lexicographers (dictionary creation), complements grammar and stylistic research (grammar and orthographical handbooks; varieties of the national language and their usage in communication). We suppose that it will also find its use at schools (preparing of orthography, grammar and style textbooks; teaching Slovak as a foreign language). Specific sub-corpora of historical and dialectological texts will help to preserve an important part of our cultural heritage in a long-term perspective .
- Speech Controlled Computing:
The Speech Controlled Computing corpus was designed to support the development of small footprint, embedded ASR applications in the domain of voice control for the home. It consists of the recordings of 125 speakers of American English from four regions, three age groups and two gender groups, pronouncing isolated words. The recordings were conducted in a sound-attenuated room, and a high-quality microphone was used. Each speaker read a randomized word list consisting of 2100 words (100 distinct words appearing 21 times each).
NOTE: Nonmembers may obtain a commercial rights license to Speech Controlled Computing for US$7000 by signing the LDC User License Agreement for Speech Controlled Computing. For-Profit Membership to the LDC is not required.
- The Bergen Corpus of London Teenage Language (COLT):
The Bergen Corpus of London Teenage Language (COLT) is the first large English Corpus focusing on the speech of teenagers. It was collected in 1993 and consists of the spoken language of 13 to 17-year-old teenagers from different boroughs of London. The complete corpus, half a million words, has been orthographically transcribed and word-class tagged, and is a constituent of the British National Corpus.
- The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages:
The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 22 Languages - New: Version 3.0 almost tripled in size: The JRC-Acquis Version 3.0 is a unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in the 23 official EU languages, with the exception of Irish. The corpus consists of about 23,000 documents per language, with an average size of 49 million words per language, totalling to over one Billion words. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is currently available for a subset of 8000 documents in 210 language pair combinations. Pair-wise alignment for all texts in all 231 language pairs will be available soon. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software.
The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the
JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction).
- Timebank 1.2:
The TimeBank 1.2 corpus contains 183 news articles that have been annotated with temporal information, adding events, times and temporal links between events and times. The annotation follows the TimeML 1.2.1 specification. The most recent information on TimeML is always available at www.timeml.org.
TimeML aims to capture and represent temporal information. This is accomplished using four primary tag types: TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals, and LINK for representing relationships. Timebank 1.2 is distributed via web download.
Nonmembers may license this data at no cost - please note that a signed copy of our generic nonmember user agreement is required.
- VOICE: Vienna-Oxford International Corpus of English:
The Vienna-Oxford International Corpus of English (VOICE) 1.0 Online is available as a free-of-charge resource for non-commercial research purposes.
VOICE comprises naturally occurring, non-scripted face-to-face interactions in English as a lingua franca (ELF). The recordings made for VOICE are keyboarded by trained transcribers and stored as a computerized corpus.
The speakers recorded in VOICE are experienced ELF speakers from a wide range of first language backgrounds. The ELF interactions recorded cover a range of different speech events in terms of domain (professional, educational, leisure), function (exchanging information, enacting social relationships), and participant roles and relationships (acquainted vs. unacquainted, symmetrical vs. asymmetrical).
|
Electronic Texts
|
- Alex: A Catalogue of Electronic Texts on the Internet:
A collection of public domain documents from American and English literature as well as Western philosophy.
- Centre for English Corpus Linguistics:
ICLEv2 contains 3.7 million words of writing from higher intermediate to advanced learners of English representing 16 different mother tongue backgrounds (Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Turkish, Tswana). It differs from the first version published in 2002 not only by its increased size and range of learner populations, but also by its interface, which contains two new functionalities: built-in concordancer allowing users to search for word forms, lemmas and/or parts-of-speech tags and breakdown of the query
results according to the learner profile information.
The accompanying ICLEv2 Handbook contains a detailed description of the corpus, a user's manual and an overview of the ELT situation in the countries of origin of the learners.
There are three types of licence (for non-profit research purposes only): single user, multiple-user (2-10) and multiple-user (11-25).
The corpus can be ordered online at http://www.i6doc.com
- Chinese Text Project:
The Chinese Text Project is a web-based e-text system designed to present ancient Chinese texts, particularly those relating to Chinese philosophy, in a well-structured and properly cross-referenced manner, making the most of the electronic medium to aid in their study and understanding.
- Croatian National Corpus:
The starting point for each linguistic research is the corpus. As Croatian does not have a systematically compiled corpus the objective of this project is the compilation and analysis of the representative Croatian texts -- both older and contemporary -- in the form of the corpus the usage of which is applicable for all kinds of Croatistic, lexicographic and lexicological research
- Digital Archive of the Macedonian Language:
The Digital Archive of the Macedonian Language is a growing collection of digitized, searchable texts in Modern Macedonian from the nineteenth and twentieth centuries and is completely free for anyone who would like to use, search through, and/or download the materials.
web address: http://damj.manu.edu.mk
Macedonian Academy of Sciences and Arts
Research Center for Areal Linguistics
Project Coordinator:
Prof. Marjan Markovik
e-mail: marjan@manu.edu.mk
- El Grial Corpus of Spanish:
El Grial Corpus of Spanish (www.elgrial.cl) is a growing collection of eight corpora (almost 100 million words) with approximately 700 documents of contemporary Spanish, developed by the members of the Escuela Lingüística de Valparaíso (www.linguistica.cl) at the Pontificia Universidad Católica de Valparaíso, Chile. Also, there is a tagger and parser for Spanish Language available on the web site. These corpora have been collected under specific methodological principles, identifying specialized/non-specialized, written/spoken registers and text types (academic, professional, technical, etc).
Detailed description of each corpus is available in the web site. All documents have been tagged and parsed. Part of the data has been enriched with deep syntactic information. To the best of our knowledge, this is currently the largest searchable morphosyntactically annotated and register-diversified corpus of Spanish available to the public, with online tools that help analyze the collected data. Users can define their corpus of study and search the data using a wide variety of resources. Query results are presented in different formats, depending on the kind of research questions. El Grial users can make queries concerning word, lemma and/or parts of speech frequencies. One of the last tools developed is El Manchador de Textos, an online resource that “spots” and puts color to the words or sequences under study; statistical information about the co-occurrences is also available.
- Freiburger Anthologie:
Die 1200 bekanntesten deutschen Gedichte in einer durchsuchbaren Datenbank.
- Georgian proverbs:
Georgian proverbs online book. Georgian language.
- HATII and DCC Release KRYS I Corpus to Aid Research:
The Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) are delighted to announce the release of the KRYS I Corpus for genre classification research.
http://www.krys-corpus.eu
The corpus, consisting of 6434 documents labelled with document genres, is expected to become a major research resource among text processing and data and information management researchers. In particular, we encourage the use of the corpus for the research of:
- Automated Text Classification (TC)
- Digital curation and metadata extraction
- Natural Language Processing (NLP)
- Computational Linguistics (CL)
Despite the potential of document genre classification as a supporting step in language processing, document management, and information retrieval (e.g. the linguistic style and the vocabulary of a document varies distinctively across document genres), to date, there has been a severe lack of genre-labelled document corpora with which researchers can experiment. It is, therefore, with great pleasure that the Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) makes the KRYS I Corpus available to researchers around the globe.
The Corpus originated as part of the ongoing Semantic Metadata Extraction research at the Digital Curation Centre (http://www.dcc.ac.uk) and the HATII at the University of Glasgow (http://www.hatii.arts.gla.ac.uk). The metadata extraction research evolved into a study of automated genre classification, reflecting the observation that the genre of a document (e.g. whether a document is a scientific article or a letter) is characterised by the form and structure of a document, the understanding of which would facilitate further extraction of metadata from within the document.
Further details about the development of the KRYS I corpus are available via the website (http://www.krys-corpus.eu). Specifically, researchers will find a detailed account of the document collection process, the reclassification of the documents in the corpus, and the initial findings with regard to human classification of the documents.
We encourage researchers to make full use of this corpus for their own research activity and recommend that you consider contributing towards the ongoing development of the corpus by adding your own documents to the database. Instructions as to how to contribute to the corpus are provided at http://www.krys-corpus.eu.
Comments and/or feedback on the KRYS I Corpus are invited. Contacts details can be found on the website. Please feel free to distribute this announcement to any interested colleagues.
- Hellenic National Corpus:
HNC is a corpus of written Modern Greek texts, available over the Internet, for research use only. It is based on the General Language corpus developed by the Institute of Language and Speech Processing and is fully available on the Internet since 2000.
It currently contains about 32,000,000 words of written texts from several media (books, periodicals, newspapers etc.), which belong to different genres (articles, essays, literary works, reports, biographies etc.) and various topics (economy, medicine, leisure, art, human sciences etc.).
The HNC users can make the following queries concerning the lexicon, morphology, syntax and usage of Modern Greek:
- specific words (e.g. child),
- lemmas (e.g. child as a lemma produces every inflected type of the word),
- parts of speech and
- up to three combinations of all the above, in which users can specify the distance among lexical items (e.g. word + word, lemma + word, lemma + word + word, lemma + part of speech).
Users can define their own sub-corpus within the HNC. This sub-corpus may cover one or more media, genres and/or topics and may also be saved for further reference by the users.
Query results are presented as whole sentences, within which the query objects are highlighted. Alternatively, concordances of query results are presented, where the query object is centred on the page.
Finally, HNC users can make queries concerning word, lemma and/or parts of speech frequencies within the HNC texts. Statistical information about the 100 and 1,000 most frequent words and lemmata in these texts is also available.
- IFA Dialog Video corpus:
The IFA Dialog Video corpus is a collection of annotated video recordings of friendly Face-to-Face dialogs licensed under the GNU General Public License (GPLv2). It is modeled on the Face-to-Face dialogs Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialog conversations of 15 minutes were recorded and annotated, in total 5 hours of speech. To stay close to the very useful Face-to-Face dialogs in the CGN, pairs of well acquainted participants, either good friends, relatives, or long-time colleagues were selected. The participants were allowed to talk about any topic they wanted.
- Korean Treebank Annotations Version 2.0:
The Korean Treebank Annotations Version 2.0 is an extension of the Korean English Treebank Annotations corpus, LDC2002T26 (2002). It is essentially an electronic corpus of Korean texts annotated with morphological and syntactic information. The original texts for the Korean Treebank 2.0 were selected from The Korean Newswire corpus published by LDC, catalog number LDC2000T45, which is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. Korean Treebank 2.0 is based on the March 2000 portion of the corpus and includes 647 articles. The annotated corpus can find many uses, including training of morphological analyzers, part-of-speech taggers and syntactic parsers.
- Korpus 2000:
The aim of the Korpus 2000 project is to document the use of the Danish language around the year 2000 - in the form of a text corpus in which one can look up words and phrases via this website. The texts that constitute the Korpus 2000 were written mainly between 1998 and 2002.
- Linguistic eBooks at Diesel eBook Store:
Download linguistic eBooks in multiple formats. There is a large inventory of linguistic eBooks from top industry authors available at the Diesel eBook Store.
- NPS Chat Corpus:
The NPS Chat Corpus, Release 1.0 consists of 10,567 posts gathered from various online chat
services in accordance with their terms of service. The posts have been:
1) Hand privacy masked;
2) Part-of-speech tagged; and
3) Dialogue-act tagged.
- Oxford Text Archive (OTA).:
Text archive.
- Penn-Helsinki Parsed Corpus of Early Modern English:
The Penn-Helsinki Parsed Corpus of Early Modern English is a 1.8 million word parsed corpus of text samples of Early Modern English. It includes the text samples of the Helsinki Corpus of Historical English, which consists of 600,000 words of genre balanced text and two extension samples of the same size, balanced for genre in the same way. It is a sister corpus of the Penn-Helsinki Parsed Corpus of Middle English and the two corpora are distributed together.
- Project Gutenberg e-texts:
Texts online.
- Sociolingüística Andaluza:
Information: Grupo de investigación en sociolingüística. Universidad de Sevilla.
- Textos Hixkaryana:
A scanned version of Derbyshire's (1965) text collection of Hixkaryana, with constituent-numbered translations in Portuguese and English. A sincere but unsuccessful attempt has been made to contact the publisher for an exemption of copyright.
Publication Information:
Derbyshire, Desmond. 1965. Textos Hixkaryana. Belem, Para, Brasil: Conselho Nacional de Pesquisas. Instituto Nacional de Pesquisas da Amazonia. Museo Paraense Emilio Goeldi.
- Texts in context:
A lovely collection of classified, annotated and (partially) downloadable texts from the British Library's collection - good for both teaching and research. Here's the introduction:
Texts in Context is a rich and unusual collection of over 400 British Library texts. You can find menus for medieval banquets and handwritten recipes scribbled inside book covers. You can browse the first English dictionary ever written and explore the secret language of the Georgian underworld. You can study the East India Company's shopping lists and practise sentences from colonial phrasebooks. You can learn smugglers' songs, listen to rare dialect recordings, and examine the logbooks of 17th century trading ships.
- The Aboriginal Studies Electronic Data Archive:
The Australian Institute of Aboriginal and Torres Strait Islander Studies holds computer-based (digital) materials about Australian Indigenous languages in the Aboriginal Studies Electronic Data Archive (ASEDA). ASEDA offers a free service of secure storage, maintenance, and distribution of electronic texts relating to these languages.
- The Sumerian Text Archive:
A growing collection of texts in the Sumerian language.
- The University of Virginia Electronic Text Center:
An on-line archive of tens of thousands of SGML and XML-encoded electronic texts and images with a library service that offers hardware and software suitable for the creation and analysis of text.
- Tofa Videos and Texts:
The Tofa stories available here were recorded by Dr. K. David Harrison in 2000 and 2001, for a project funded by a grant from Volkswagen-Stiftung.
|
Text and Corpora Meta Sites
|
- ACL SIGLEX:
An index of links to publicly available lexical resources (dictionaries and corpora).
- Bookmarks for Corpus-based Linguists:
These links (c. 1,000 of them) are meant mainly for linguists/language teachers, not computational linguists/NLP researchers, so the language-engineering-type links here are definitely not exhaustive.
- Centre for English Corpus Linguistics:
ICLEv2 contains 3.7 million words of writing from higher intermediate to advanced learners of English representing 16 different mother tongue backgrounds (Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Turkish, Tswana). It differs from the first version published in 2002 not only by its increased size and range of learner populations, but also by its interface, which contains two new functionalities: built-in concordancer allowing users to search for word forms, lemmas and/or parts-of-speech tags and breakdown of the query
results according to the learner profile information.
The accompanying ICLEv2 Handbook contains a detailed description of the corpus, a user's manual and an overview of the ELT situation in the countries of origin of the learners.
There are three types of licence (for non-profit research purposes only): single user, multiple-user (2-10) and multiple-user (11-25).
The corpus can be ordered online at http://www.i6doc.com
- Digital Archive of the Macedonian Language:
The Digital Archive of the Macedonian Language is a growing collection of digitized, searchable texts in Modern Macedonian from the nineteenth and twentieth centuries and is completely free for anyone who would like to use, search through, and/or download the materials.
web address: http://damj.manu.edu.mk
Macedonian Academy of Sciences and Arts
Research Center for Areal Linguistics
Project Coordinator:
Prof. Marjan Markovik
e-mail: marjan@manu.edu.mk
- ELRA (European Language Resources Association):
The overall goal of ELRA is to provide a centralized organization for the validation, management, and distribution of speech, text, and terminology resources and tools, and to promote their use within the European telematics R&TD community.
- HATII and DCC Release KRYS I Corpus to Aid Research:
The Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) are delighted to announce the release of the KRYS I Corpus for genre classification research.
http://www.krys-corpus.eu
The corpus, consisting of 6434 documents labelled with document genres, is expected to become a major research resource among text processing and data and information management researchers. In particular, we encourage the use of the corpus for the research of:
- Automated Text Classification (TC)
- Digital curation and metadata extraction
- Natural Language Processing (NLP)
- Computational Linguistics (CL)
Despite the potential of document genre classification as a supporting step in language processing, document management, and information retrieval (e.g. the linguistic style and the vocabulary of a document varies distinctively across document genres), to date, there has been a severe lack of genre-labelled document corpora with which researchers can experiment. It is, therefore, with great pleasure that the Humanities Advanced Technology and Information Institute (HATII) at the University of Glasgow and the Digital Curation Centre (DCC) makes the KRYS I Corpus available to researchers around the globe.
The Corpus originated as part of the ongoing Semantic Metadata Extraction research at the Digital Curation Centre (http://www.dcc.ac.uk) and the HATII at the University of Glasgow (http://www.hatii.arts.gla.ac.uk). The metadata extraction research evolved into a study of automated genre classification, reflecting the observation that the genre of a document (e.g. whether a document is a scientific article or a letter) is characterised by the form and structure of a document, the understanding of which would facilitate further extraction of metadata from within the document.
Further details about the development of the KRYS I corpus are available via the website (http://www.krys-corpus.eu). Specifically, researchers will find a detailed account of the document collection process, the reclassification of the documents in the corpus, and the initial findings with regard to human classification of the documents.
We encourage researchers to make full use of this corpus for their own research activity and recommend that you consider contributing towards the ongoing development of the corpus by adding your own documents to the database. Instructions as to how to contribute to the corpus are provided at http://www.krys-corpus.eu.
Comments and/or feedback on the KRYS I Corpus are invited. Contacts details can be found on the website. Please feel free to distribute this announcement to any interested colleagues.
- Het Corpus Gesproken Nederlands:
The Corpus Gesproken Nederlands, (Spoken Dutch Corpus), or CGN is a collection of approximately 900 hours of spoken Dutch from Flemish and Dutch speakers. All recordings have been aligned with an orthographic transcription and each word has been given a POS tag and a lemma. Part of the data has been enriched with syntactic, prosodic and/or phonetic information.
- IFA Dialog Video corpus:
The IFA Dialog Video corpus is a collection of annotated video recordings of friendly Face-to-Face dialogs licensed under the GNU General Public License (GPLv2). It is modeled on the Face-to-Face dialogs Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialog conversations of 15 minutes were recorded and annotated, in total 5 hours of speech. To stay close to the very useful Face-to-Face dialogs in the CGN, pairs of well acquainted participants, either good friends, relatives, or long-time colleagues were selected. The participants were allowed to talk about any topic they wanted.
- Italian Linguistics:
Information (in Italian) on Italian linguistics and corpora.
- Leiden Armenian Lexical Textbase:
Launching the Leiden Armenian Lexical Textbase
http://www.sd-editions.com/LALT/home.html
LALT combines Classical Armenian dictionaries with morphologically analyzed texts. There are some 80.000 Armenian lexemes and ten texts. The complete Nor Bargirk, main sections of Adjarian's Root Dictionary, Bedrossian's Armenian- English Dictionary and other material are integrated in LALT. There is a Greek-Armenian lexicon (20000 entries), and aligned Armenian-Greek texts.
LALT is currently open for inspection. In a few months paid subscriptions will be accepted. Conditions for individuals and institutions will be published.
LALT will be updated at regular intervals. Also, LALT easily is able to integrate additional material and welcomes contributions of other scholars.
I have been asked about fonts: LALT is written in xml and uses unicode. Any unicode font will be able to read it, provided this font contains the glyphs (screen images) for Armenian and Greek. One such font is Titus Cyberbit, which is used within LALT itself. It is available for free at
http://titus.fkidg1.uni-frankfurt.de/unicode/tituut.asp
General information on Armenian and Unicode may be obtained at
http://www.armunicode.org/en/fonts/unicode
Jos Weitenberg
- Lexicographical Corpus of Portuguese:
The Lexicographical Corpus of Portuguese is a database of electronic texts in Portuguese. It contains the electronic transcription and edition of some of the most important dictionaries from the 17th and 18th centuries. The selected texts are generally considered the most important monuments of portuguese dictionary tradition for their dimension, reception and documental value. - Jerónimo Cardoso, Dictionarium iuventuti studiosae (1562, aliás 1551); Dictionarium ex lusitanico in latinum sermonem (1562); Dictionarium Latinolusitanicum (1569/70); Breve dictionarium vocum ecclesiasticarum (1569); De monetis (1569, aliás 1561) - Pedro de Poiares, Diccionario Lusitanico-Latino de Nomes Proprios (1667) - Bento Pereira, Prosodia Tesouro (1697) - Rafael Bluteau, Vocabulario Portuguez e Latino (1712-1728) - António Franco (F. Pomey), Indiculo Universal (1716) With this project we made available, in a digital format, the complete text of those dictionaries and it’s now possible to retrieve indexes off all the words, which are found in them.
- Linguistic Data Consortium:
Creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes.
- Linguistic and Folklore materials from the Kujamaat Jóola:
A site which will eventually grow to have an extensive collection of Kujamaat linguistic and folklore materials. It currently contains a dictionary (already listed), two folktales (text, translation and sound) and verses from extemporaneous funeral songs (text, sound, translation, commentary).
- NPS Chat Corpus:
The NPS Chat Corpus, Release 1.0 consists of 10,567 posts gathered from various online chat
services in accordance with their terms of service. The posts have been:
1) Hand privacy masked;
2) Part-of-speech tagged; and
3) Dialogue-act tagged.
- On-line books FAQ:
Public domain sources of Etext available on the Internet.
- The IViE corpus:
An intonationally transcribed corpus covering seven dialects of English from the British Isles. Subjects were secondary school students. The corpus covers short read sentence, a read story, a retold story, map tasks and free conversation.
- WEBSOM:
Large document collections that are automatically organized by the novel WEBSOM method (including Usenet newsgroup sci.lang). An ordered map of the information space is provided.
- XNLRDF:
XNLRDF is a database for the creation and distribution of basic linguistic information for a great number of natural languages so that they can be used for research and development. Linguistic data are inserted through a Web-interface.
The linguistic data in the database can be compiled and downloaded in XML by the user.
|
|
Page Updated: 19-Nov-2009

Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.
|
|