The aim of this thesis is to understand how genre is instantiated on the
web, and thereby to develop automatic methods for genre identification in
web pages. The main challenges arise from the interaction of three factors:
(1) the complexity of web pages, (2) the fluidity and the fast-paced
evolution of the web, and (3) the limitation of automatically-extractable
features for genre detection. First, genres on the web are instantiated in
web pages, which, from a physical, linguistic and textual point of view,
can be considered documents of a new type, much more unpredictable and
individualised than documents on paper. Second, the web is unstable and
fluid, undergoing a fast-paced evolution, so genre identification is
influenced by phenomena such as the formation of novel genres, genre
hybridism, individualisation, and intra-genre and inter-genre variation.
Finally, automatically-extractable features represent a poor surrogate for
potentially useful genre-revealing features. These three factors strongly
affect the automatic identification of genre in web pages. Previous work
has disregarded them for the sake of practicality, and built on the
oversimplifying assumption that a web page is to be assigned to only one
genre, relying as little as possible on the linguistic features returned by
NLP tools. By contrast, this thesis argues for the necessity of a more
flexible genre classification scheme, capable of assigning zero, one or
multiple genre labels, and builds as much as possible on the output of NLP
tools. A series of empirical studies is presented which investigate (i) why
a zero-to-multi genre classification scheme would be more appropriate for
classifying web pages, and (ii) to what extent it is possible to implement
this scheme in an automatic system. A new model of zero-to-multi-genre
classification is presented that combines several traditions, incorporating
findings from automatic genre classification, corpus linguistics, genre
analysis, textlinguistics and artificial intelligence. This model offers a
more articulated view of genres in web pages. Although such a model cannot
be fully evaluated, given the limitations of the current state of genre
research, experimental results show that its accuracy on single-genre
classification is competitive: about 86% vs. 90% for a standard
machine-learning model, in ideal conditions; and about 86% vs. 76% in more
realistic conditions.
|