LINGUIST List 25.1203
Tue
Mar 11 2014
Confs: Computational
Linguistics, Text/Corpus
Linguistics/Sweden
Editor for this issue:
Xiyan Wang <xiyanlinguistlist.org>
Date: 11-Mar-2014
From: Felix Bildhauer
<felix.bildhauer
fu-berlin.de>
Subject: EACL 2014 Workshop on
Web as Corpus
E-mail this message to a
friend
EACL 2014 Workshop on Web as Corpus
Short Title: WAC-9
Date: 26-Apr-2014 - 26-Apr-2014
Location: Gothenburg, Sweden
Contact: Felix Bildhauer
Contact Email:
< click here to access email >
Meeting URL:
http://www.sigwac.org.uk/wiki/WAC9
Linguistic Field(s): Computational Linguistics;
Text/Corpus Linguistics
Meeting Description:
The 9th Web as Corpus Workshop (WAC-9)
Endorsed by the Special Interest Group of the
ACL on Web as Corpus
(
http://www.sigwac.org.uk/)
The World Wide Web has become increasingly
popular as a source of linguistic data, not
only within the NLP communities, but also with
theoretical linguists facing problems of data
sparseness or data diversity. Accordingly, web
corpora continue to gain importance, given
their size and diversity in terms of
genres/text types. However, the field is still
new, and a number of
issues in web corpus construction still needs
much research (fundamental and applied),
ranging from questions of corpus design (e.g.,
corpus composition assessment, sampling
strategies and their relation to crawling
algorithms, handling of duplicated material) to
more technical aspects (e.g., efficient
implementation of individual post-processing
steps in document cleansing and linguistic
annotation, or large-scale parallelization to
achieve web-scale corpus construction).
Similarly, the systematic evaluation of web
corpora, for example in the form of task-based
comparisons to traditional corpora, has only
lately shifted into focus.
For almost a decade, the ACL SIGWAC, and
especially the Web as Corpus (WAC) workshops
have served as a platform for researchers
interested in building and working with
web-derived corpora. Past workshops have been
co-located with major conferences on
computational linguistics and/ or corpus
linguistics (such as EACL, LREC, WWW, Corpus
Linguistics). As part of the workshop, we will
have a panel discussion dedicated to the
planning of a shared task for WAC10 (2015),
including the nomination of organizers of the
shared task. The tracks of the shared task will
focus on the quality of web corpus creation
tools, tools for linguistic annotation (at
least lemmatization, possibly also POS tagging,
etc.), and the quality of web corpora
themselves.
Organizing Committee:
Felix Bildhauer, Freie Universität Berlin
Roland Schäfer, Freie Universität Berlin
Program Committee:
Organizing Committee, plus:
Adrien Barbaresi, École Normale Supérieure de
Lyon
Silvia Bernardini, Università di Bologna
Chris Biemann, Technische Universität
Darmstadt
Jesse Egbert, Northern Arizona University
Stefan Evert, Friedrich-Alexander Universität
Erlangen-Nürnberg
Adriano Ferraresi, Università di Bologna
William Fletcher, United States Naval
Academy
Dirk Goldhahn, Universität Leipzig
Adam Kilgarriff, Lexical Computing Ltd.
Anke Lüdeling, Humboldt-Universität zu
Berlin
Alexander Mehler, Goethe-Universität Frankfurt
am Main
Uwe Quasthoff, Universität Leipzig
Paul Rayson, Lancaster University
Serge Sharoff, University of Leeds
Sabine Schulte, im Walde, Universität
Stuttgart
Egon Stemle, European Academy of Bolzano
Yannick Versley, Universität Heidelberg
Torsten Zesch, Universität Darmstadt
Stephen Wattam, Lancaster University
Workshop Program:
11:15–11:30
Welcome (Felix Bildhauer & Roland
Schäfer)
11:30–12:00
Finding Viable Seed URLs for Web Corpora: A
Scouting Approach and Comparative Study of
Available Sources (Adrien Barbaresi)
12:00–12:30
Focused Web Corpus Crawling (Roland Schäfer,
Adrien Barbaresi & Felix Bildhauer)
Lunch Break
14:00–14:30
Less Destructive Cleaning of Web Documents by
Using Standoff Annotation (Maik
Stührenberg)
14:30–15:00
Some Issues on the Normalization of a Corpus of
Products Reviews in Portuguese (Magali Sanches
Duran, Lucas Avanço, Sandra Aluísio, Thiago
Pardo & Maria da Graça Volpe Nunes)
15:00–15:30
{bs,hr,sr}WaC - Web Corpora of Bosnian,
Croatian and Serbian (Nikola Ljubešić &
Filip Klubička)
Coffee Break
16:00–16:30
The PAISÀ Corpus of Italian Web Texts (Verena
Lyding, Egon Stemle, Claudia Borghetti, Marco
Brunello, Sara Castagnoli, Felice Dell’Orletta,
Henrik Dittmann, Alessandro Lenci & Vito
Pirrelli)
16:30–17:00
Internet Data in a Study of Language Change and
a Program Helping to Work with Them (Varvara
Magomedova, Natalia Slioussar & Maria
Kholodilova)
17:00–18:00
Discussion
Page Updated: 11-Mar-2014