* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
 
E-mail this message to a friend
Title: Automatic Identification of Genre in Web Pages
Author: Marina Santini
Email: click here to access email
Homepage: http://www.nltg.brighton.ac.uk/home/Marina.Santini/
Degree Awarded: University of Brighton , Computational Linguistics
Degree Date: 2007
Linguistic Subfield(s): Computational Linguistics
Text/Corpus Linguistics
Director(s): Michael Oakes
Roger Evans
Richard Power
Lyn Pemberton

Abstract:

The aim of this thesis is to understand how genre is instantiated on the
web, and thereby to develop automatic methods for genre identification in
web pages. The main challenges arise from the interaction of three factors:
(1) the complexity of web pages, (2) the fluidity and the fast-paced
evolution of the web, and (3) the limitation of automatically-extractable
features for genre detection. First, genres on the web are instantiated in
web pages, which, from a physical, linguistic and textual point of view,
can be considered documents of a new type, much more unpredictable and
individualised than documents on paper. Second, the web is unstable and
fluid, undergoing a fast-paced evolution, so genre identification is
influenced by phenomena such as the formation of novel genres, genre
hybridism, individualisation, and intra-genre and inter-genre variation.
Finally, automatically-extractable features represent a poor surrogate for
potentially useful genre-revealing features. These three factors strongly
affect the automatic identification of genre in web pages. Previous work
has disregarded them for the sake of practicality, and built on the
oversimplifying assumption that a web page is to be assigned to only one
genre, relying as little as possible on the linguistic features returned by
NLP tools. By contrast, this thesis argues for the necessity of a more
flexible genre classification scheme, capable of assigning zero, one or
multiple genre labels, and builds as much as possible on the output of NLP
tools. A series of empirical studies is presented which investigate (i) why
a zero-to-multi genre classification scheme would be more appropriate for
classifying web pages, and (ii) to what extent it is possible to implement
this scheme in an automatic system. A new model of zero-to-multi-genre
classification is presented that combines several traditions, incorporating
findings from automatic genre classification, corpus linguistics, genre
analysis, textlinguistics and artificial intelligence. This model offers a
more articulated view of genres in web pages. Although such a model cannot
be fully evaluated, given the limitations of the current state of genre
research, experimental results show that its accuracy on single-genre
classification is competitive: about 86% vs. 90% for a standard
machine-learning model, in ideal conditions; and about 86% vs. 76% in more
realistic conditions.
Add a dissertation
Update dissertation
Page Updated: 26-Nov-2009

Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.