LINGUIST List 18.1436
|
Fri May 11 2007
Diss: Computational Ling/Text&Corpus Ling: Santini: 'Automatic Iden...'
Editor for this issue: Hunter Lockwood
<hunter linguistlist.org>
|
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.
|
Directory
1. Marina
Santini,
Automatic Identification of Genre in Web Pages
Message 1: Automatic Identification of Genre in Web Pages
|
Date: 11-May-2007
From: Marina Santini <MarinaSantini.MS gmail.com>
Subject: Automatic Identification of Genre in Web Pages
Institution: University of Brighton
Program: Computational Linguistics
Dissertation Status: Completed
Degree Date: 2007
Author: Marina Santini
Dissertation Title: Automatic Identification of Genre in Web Pages
Linguistic Field(s):
Computational Linguistics
Text/Corpus Linguistics
Dissertation Director:
Roger Evans
Michael Oakes
Lyn Pemberton
Richard Power
Dissertation Abstract:
The aim of this thesis is to understand how genre is instantiated on the web, and thereby to develop automatic methods for genre identification in web pages. The main challenges arise from the interaction of three factors: (1) the complexity of web pages, (2) the fluidity and the fast-paced evolution of the web, and (3) the limitation of automatically-extractable features for genre detection. First, genres on the web are instantiated in web pages, which, from a physical, linguistic and textual point of view, can be considered documents of a new type, much more unpredictable and individualised than documents on paper. Second, the web is unstable and fluid, undergoing a fast-paced evolution, so genre identification is influenced by phenomena such as the formation of novel genres, genre hybridism, individualisation, and intra-genre and inter-genre variation. Finally, automatically-extractable features represent a poor surrogate for potentially useful genre-revealing features. These three factors strongly affect the automatic identification of genre in web pages. Previous work has disregarded them for the sake of practicality, and built on the oversimplifying assumption that a web page is to be assigned to only one genre, relying as little as possible on the linguistic features returned by NLP tools. By contrast, this thesis argues for the necessity of a more flexible genre classification scheme, capable of assigning zero, one or multiple genre labels, and builds as much as possible on the output of NLP tools. A series of empirical studies is presented which investigate (i) why a zero-to-multi genre classification scheme would be more appropriate for classifying web pages, and (ii) to what extent it is possible to implement this scheme in an automatic system. A new model of zero-to-multi-genre classification is presented that combines several traditions, incorporating findings from automatic genre classification, corpus linguistics, genre analysis, textlinguistics and artificial intelligence. This model offers a more articulated view of genres in web pages. Although such a model cannot be fully evaluated, given the limitations of the current state of genre research, experimental results show that its accuracy on single-genre classification is competitive: about 86% vs. 90% for a standard machine-learning model, in ideal conditions; and about 86% vs. 76% in more realistic conditions.
Respond to list|Read more issues|LINGUIST home page|Top of issue
|
|

Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.
|
|