* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 19.3604

Mon Nov 24 2008

FYI: KRYS I Corpus for Genre Classification Research

Editor for this issue: Matthew Lahrman <mattlinguistlist.org>


To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.html.
Directory
        1.    Yunhyong Kim, KRYS I Corpus for Genre Classification Research


Message 1: KRYS I Corpus for Genre Classification Research
Date: 20-Nov-2008
From: Yunhyong Kim <y.kimhatii.arts.gla.ac.uk>
Subject: KRYS I Corpus for Genre Classification Research
E-mail this message to a friend

The Humanities Advanced Technology and Information Institute (HATII) at
the University of Glasgow and the Digital Curation Centre (DCC) are
delighted to announce the release of the KRYS I Corpus for genre
classification research.

http://www.krys-corpus.eu

The corpus, consisting of 6434 documents labelled with document genres,
is expected to become a major research resource among text processing
and data and information management researchers. In particular, we
encourage the use of the corpus for the research of:

- Automated Text Classification (TC)
- Digital curation and metadata extraction
- Natural Language Processing (NLP)
- Computational Linguistics (CL)

Despite the potential of document genre classification as a supporting
step in language processing, document management, and information
retrieval (e.g. the linguistic style and the vocabulary of a document
varies distinctively across document genres), to date, there has been a
severe lack of genre-labelled document corpora with which researchers
can experiment. It is, therefore, with great pleasure that the
Humanities Advanced Technology and Information Institute (HATII) at the
University of Glasgow and the Digital Curation Centre (DCC) makes the
KRYS I Corpus available to researchers around the globe.

The Corpus originated as part of the ongoing Semantic Metadata
Extraction research at the Digital Curation Centre
(http://www.dcc.ac.uk) and the HATII at the University of Glasgow
(http://www.hatii.arts.gla.ac.uk). The metadata extraction research
evolved into a study of automated genre classification, reflecting the
observation that the genre of a document (e.g. whether a document is a
scientific article or a letter) is characterised by the form and
structure of a document, the understanding of which would facilitate
further extraction of metadata from within the document.

Further details about the development of the KRYS I corpus are available
via the website (http://www.krys-corpus.eu). Specifically, researchers
will find a detailed account of the document collection process, the
reclassification of the documents in the corpus, and the initial
findings with regard to human classification of the documents.

We encourage researchers to make full use of this corpus for their own
research activity and recommend that you consider contributing towards
the ongoing development of the corpus by adding your own documents to
the database. Instructions as to how to contribute to the corpus are
provided at http://www.krys-corpus.eu.

Comments and/or feedback on the KRYS I Corpus are invited. Contacts
details can be found on the website. Please feel free to distribute this
announcement to any interested colleagues.

--
Yunhyong Kim
DCC Curation Resources Researcher
Humanities Advanced Technology and Information Institute (HATII)
University of Glasgow (charity number SC004401)
Glasgow
United Kingdom

Linguistic Field(s): Computational Linguistics

Read more issues|LINGUIST home page|Top of issue




Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.