LINGUIST List 27.3671: Diss: Ad hoc and General-Purpose Corpus Construction from Web Sources

LINGUIST List 27.3671

Fri Sep 16 2016

Diss: Ad hoc and General-Purpose Corpus Construction from Web Sources

Editor for this issue: Kenneth Steimel <kenlinguistlist.org>

Date: 09-Sep-2016
From: Adrien Barbaresi <adrien.barbaresioeaw.ac.at>
Subject: Ad hoc and General-Purpose Corpus Construction from Web Sources
E-mail this message to a friend

Institution: Ecole Normale Supérieure
Program: PhD program, school of linguistics
Dissertation Status: Completed
Degree Date: 2015

Author: Adrien Barbaresi

Dissertation Title: Ad hoc and General-Purpose Corpus Construction from Web Sources

Dissertation URL: https://hal.archives-ouvertes.fr/tel-01167309/

Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics

Dissertation Director:
Benoît Habert

Dissertation Abstract:

This thesis introduces theoretical and practical reflections on corpus linguistics, computational linguistics, and web corpus construction. More specifically, two different types of corpora from the web, specialized (ad hoc) and general-purpose, are presented and analyzed, including suitable conditions for their creation.

In a historical perspective, several milestones of corpus design are presented, from pre-digital corpora at the end of the 1950's to web corpora in the 2000's and 2010's. Three main phases are distinguished in this evolution, first the age of copy typing and establishment of the scientific methodology and tradition regarding corpora, second the age of digitized text and further development of corpus linguistics, and third the arrival of web data and “opportunistic'' approaches among researchers. The continuities and changes between the linguistic tradition are exposed.

In the second chapter, methodological insights on automated text scrutiny are presented. Readability studies and automated text classification are used as a paragon of methods to find salient features in order to grasp text characteristics. As a conclusion, guiding principles for research practice are listed, and reasons are given to find a balance between quantitative analysis and corpus linguistics, in an environment which is spanned by technological innovation and artificial intelligence techniques.

Third, current research on web corpora is summarized. The chapter opens with notions of “web science''. Then, I examine the issue of data collection, more specifically in the perspective of URL seeds, both for general and for specialized corpora. I distinguish two main approaches to web document retrieval: restricted retrieval, where documents to be retrieved are listed or even known in advance, and web crawling. I show that the latter case should not be deemed too complex for linguists, by summarizing different strategies to find new documents, and discussing their advantages and limitations. Finally, ways to target small fractions of the Web and afferent issues are described. In a further section, the notion of web corpus preprocessing is introduced and salient steps are discussed.

I present my work on web corpus construction in the fourth chapter, with two types of end products, specialized and even niche corpora on the one hand, and general-purpose corpora on the other hand. My analyses concern two main aspects, first the question of corpus sources (or prequalification), and secondly the problem of including valid, desirable documents in a corpus (or document qualification). First, I show that it is possible and even desirable to use sources other than just search engines as state of the art, and I introduce a light scout approach along with experiments to prove that a preliminary analysis and selection of crawl sources is possible as well as profitable. Second, I perform work on document selection, in order to enhance web corpus quality in general-purpose approaches, and in order to perform a suitable quality assessment in the case of specialized corpora. I show that it is possible to use salient features inspired from readability studies along with machine learning approaches in order to improve corpus construction processes. To this end, I select a number of features extracted from the texts and tested on an annotated sample of web texts. Last, I present work on corpus visualization consisting of extracting certain corpus characteristics in order to give indications on corpus contents and quality.

Page Updated: 16-Sep-2016