LINGUIST List 18.1436: Computational Ling/Text&Corpus Ling: Santini: 'Automatic Iden...'

LINGUIST List 18.1436

Fri May 11 2007

Diss: Computational Ling/Text&Corpus Ling: Santini: 'Automatic Iden...'

Editor for this issue: Hunter Lockwood <hunterlinguistlist.org>

Directory 1. Marina Santini, Automatic Identification of Genre in Web Pages

Message 1: Automatic Identification of Genre in Web Pages
Date: 11-May-2007
From: Marina Santini <MarinaSantini.MSgmail.com>
Subject: Automatic Identification of Genre in Web Pages

Institution: University of Brighton Program: Computational Linguistics Dissertation Status: Completed Degree Date: 2007

Author: Marina Santini

Dissertation Title: Automatic Identification of Genre in Web Pages

Linguistic Field(s): Computational Linguistics Text/Corpus Linguistics
Dissertation Director:
Roger Evans Michael Oakes Lyn Pemberton Richard Power
Dissertation Abstract:

The aim of this thesis is to understand how genre is instantiated on theweb, and thereby to develop automatic methods for genre identification inweb pages. The main challenges arise from the interaction of three factors:(1) the complexity of web pages, (2) the fluidity and the fast-pacedevolution of the web, and (3) the limitation of automatically-extractablefeatures for genre detection. First, genres on the web are instantiated inweb pages, which, from a physical, linguistic and textual point of view,can be considered documents of a new type, much more unpredictable andindividualised than documents on paper. Second, the web is unstable andfluid, undergoing a fast-paced evolution, so genre identification isinfluenced by phenomena such as the formation of novel genres, genrehybridism, individualisation, and intra-genre and inter-genre variation.Finally, automatically-extractable features represent a poor surrogate forpotentially useful genre-revealing features. These three factors stronglyaffect the automatic identification of genre in web pages. Previous workhas disregarded them for the sake of practicality, and built on theoversimplifying assumption that a web page is to be assigned to only onegenre, relying as little as possible on the linguistic features returned byNLP tools. By contrast, this thesis argues for the necessity of a moreflexible genre classification scheme, capable of assigning zero, one ormultiple genre labels, and builds as much as possible on the output of NLPtools. A series of empirical studies is presented which investigate (i) whya zero-to-multi genre classification scheme would be more appropriate forclassifying web pages, and (ii) to what extent it is possible to implementthis scheme in an automatic system. A new model of zero-to-multi-genreclassification is presented that combines several traditions, incorporatingfindings from automatic genre classification, corpus linguistics, genreanalysis, textlinguistics and artificial intelligence. This model offers amore articulated view of genres in web pages. Although such a model cannotbe fully evaluated, given the limitations of the current state of genreresearch, experimental results show that its accuracy on single-genreclassification is competitive: about 86% vs. 90% for a standardmachine-learning model, in ideal conditions; and about 86% vs. 76% in morerealistic conditions.