* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 16.1843

Sat Jun 11 2005

Confs: Text/Corpus Ling/Birmingham, UK

Editor for this issue: Amy Wronkowicz <amylinguistlist.org>

To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.html.
        1.    Sebastian Hoffmann, Web as Corpus Workshop/Tutorial (CL2005)

Message 1: Web as Corpus Workshop/Tutorial (CL2005)
Date: 09-Jun-2005
From: Sebastian Hoffmann <sebhoffes.unizh.ch>
Subject: Web as Corpus Workshop/Tutorial (CL2005)

Web as Corpus Workshop/Tutorial (CL2005)

Date: 14-Jul-2005 - 14-Jul-2005
Location: Birmingham, United Kingdom
Contact: Sebastian Hoffmann
Contact Email: sebhoffes.unizh.ch
Meeting URL: http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html

Linguistic Field(s): Text/Corpus Linguistics

Meeting Description:

Pre-conference workshop/tutorial
Corpus Linguistics 2005
14th July 2005
Birmingham University, UK


Marco Baroni, Sebastian Hoffmann, Adam Kilgarriff


The World Wide Web is a mine of language data of unprecedented richness and ease
of access (Kilgarriff and Grefenstette, 2003). A growing body of studies has
shown that simple algorithms using Web-based evidence are successful at many
linguistic tasks, often outperforming sophisticated methods based on smaller but
more controlled data sources (e.g., Turney 2001).

However, many fundamental issues about the viability and exploitation of the web
as a linguistic corpus must still be explored, or are just starting to be
tackled. These issues range from word frequency distributions on the web to
efficient handling of massive data sets, to the legal standing of web indexing.

Thus, we believe that the research on the web as corpus is currently in a very
exciting stage: increasing evidence points to the enormous potential of the
Internet as a source of linguistic data, but we are still far removed from
anything like a working, fully-fledged tool for linguists and language
technologists to use the web as a corpus.


This full-day workshop and tutorial will provide an introduction to the issues
involved in using the web as a corpus. The emphasis will be practical and
participatory, with presentations of programs addressing particular issues, and
opportunities for all participants to describe their experiences of working with
the web as a source of linguistic data. We shall also aim to establish what
main challenges lying ahead are for this young community, and how it should work
collectively to address them.

* General overview of web-as-corpus work
* Building large/general and small/special-purpose web corpora
* Web crawling for linguistic purposes
* (Near-)duplicate detection, boilerplate removal, language identification
* Linguistic annotation
* Working with non-latin1 languages
* Indexing and retrieval from large document collections
* Prospected interfaces

Provisional program:

9:30-10:00 Adam Kilgarriff (Lexicography MasterClass) - Welcome, goals of the
workshop, overview of program
10:00-10:45 Tom Emerson (Basis Technology) - Large crawls of the web for
linguistic purposes
10:45-11:15 coffee break
11.15-12.00 Marco Baroni (University of Bologna) and Serge Sharoff (University
of Leeds) - Creating specialized and general corpora using automated search
engine queries
12:00-13:00 Small groups arranged around the participants' research purposes

13:00-14:30 lunch break

14:30-15:15 Sebastian Hoffmann (University of Zurich) - Processing web-derived
text (or: Working with very messy data)
15:15-16:00 Stefan Evert (University of Osnabrück) and Adam Kilgarriff
(Lexicography MasterClass) - Indexing and interfaces
16:00-16:30 coffee break
16:30-17:00 Alexander Mehler and Rüdiger Gleim (University of Bielefeld) -
Representing genre-specific websites
17:00-17:30 Small groups on ''what are critical next steps for Web-as-Corpus
17:30-18:10 Plenary: where next?


Registration and accommodation are managed by the main conference organizers.
Please visit:


Respond to list|Read more issues|LINGUIST home page|Top of issue

Please report any bad links or misclassified data

LINGUIST Homepage | Read LINGUIST | Contact us

NSF Logo

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.