* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 18.1436

Fri May 11 2007

Diss: Computational Ling/Text&Corpus Ling: Santini: 'Automatic Iden...'

Editor for this issue: Hunter Lockwood <hunterlinguistlist.org>


To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.html.
Directory
        1.    Marina Santini, Automatic Identification of Genre in Web Pages


Message 1: Automatic Identification of Genre in Web Pages
Date: 11-May-2007
From: Marina Santini <MarinaSantini.MSgmail.com>
Subject: Automatic Identification of Genre in Web Pages


Institution: University of Brighton
Program: Computational Linguistics
Dissertation Status: Completed
Degree Date: 2007

Author: Marina Santini

Dissertation Title: Automatic Identification of Genre in Web Pages

Linguistic Field(s): Computational Linguistics
                            Text/Corpus Linguistics

Dissertation Director:
Roger Evans
Michael Oakes
Lyn Pemberton
Richard Power

Dissertation Abstract:

The aim of this thesis is to understand how genre is instantiated on the
web, and thereby to develop automatic methods for genre identification in
web pages. The main challenges arise from the interaction of three factors:
(1) the complexity of web pages, (2) the fluidity and the fast-paced
evolution of the web, and (3) the limitation of automatically-extractable
features for genre detection. First, genres on the web are instantiated in
web pages, which, from a physical, linguistic and textual point of view,
can be considered documents of a new type, much more unpredictable and
individualised than documents on paper. Second, the web is unstable and
fluid, undergoing a fast-paced evolution, so genre identification is
influenced by phenomena such as the formation of novel genres, genre
hybridism, individualisation, and intra-genre and inter-genre variation.
Finally, automatically-extractable features represent a poor surrogate for
potentially useful genre-revealing features. These three factors strongly
affect the automatic identification of genre in web pages. Previous work
has disregarded them for the sake of practicality, and built on the
oversimplifying assumption that a web page is to be assigned to only one
genre, relying as little as possible on the linguistic features returned by
NLP tools. By contrast, this thesis argues for the necessity of a more
flexible genre classification scheme, capable of assigning zero, one or
multiple genre labels, and builds as much as possible on the output of NLP
tools. A series of empirical studies is presented which investigate (i) why
a zero-to-multi genre classification scheme would be more appropriate for
classifying web pages, and (ii) to what extent it is possible to implement
this scheme in an automatic system. A new model of zero-to-multi-genre
classification is presented that combines several traditions, incorporating
findings from automatic genre classification, corpus linguistics, genre
analysis, textlinguistics and artificial intelligence. This model offers a
more articulated view of genres in web pages. Although such a model cannot
be fully evaluated, given the limitations of the current state of genre
research, experimental results show that its accuracy on single-genre
classification is competitive: about 86% vs. 90% for a standard
machine-learning model, in ideal conditions; and about 86% vs. 76% in more
realistic conditions.





Respond to list|Read more issues|LINGUIST home page|Top of issue




Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.