LINGUIST List 16.1843
|
Sat Jun 11 2005
Confs: Text/Corpus Ling/Birmingham, UK
Editor for this issue: Amy Wronkowicz
<amy linguistlist.org>
|
To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.html.
|
Directory
1. Sebastian
Hoffmann,
Web as Corpus Workshop/Tutorial (CL2005)
Message 1: Web as Corpus Workshop/Tutorial (CL2005)
|
Date: 09-Jun-2005
From: Sebastian Hoffmann <sebhoff es.unizh.ch>
Subject: Web as Corpus Workshop/Tutorial (CL2005)
Web as Corpus Workshop/Tutorial (CL2005) Date: 14-Jul-2005 - 14-Jul-2005 Location: Birmingham, United Kingdom Contact: Sebastian Hoffmann Contact Email: sebhoff es.unizh.ch Meeting URL: http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html Linguistic Field(s): Text/Corpus Linguistics Meeting Description: WEB AS CORPUS Pre-conference workshop/tutorial Corpus Linguistics 2005 14th July 2005 Birmingham University, UK http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html Co-chairs: Marco Baroni, Sebastian Hoffmann, Adam Kilgarriff Motivation: The World Wide Web is a mine of language data of unprecedented richness and ease of access (Kilgarriff and Grefenstette, 2003). A growing body of studies has shown that simple algorithms using Web-based evidence are successful at many linguistic tasks, often outperforming sophisticated methods based on smaller but more controlled data sources (e.g., Turney 2001). However, many fundamental issues about the viability and exploitation of the web as a linguistic corpus must still be explored, or are just starting to be tackled. These issues range from word frequency distributions on the web to efficient handling of massive data sets, to the legal standing of web indexing. Thus, we believe that the research on the web as corpus is currently in a very exciting stage: increasing evidence points to the enormous potential of the Internet as a source of linguistic data, but we are still far removed from anything like a working, fully-fledged tool for linguists and language technologists to use the web as a corpus. Contents: This full-day workshop and tutorial will provide an introduction to the issues involved in using the web as a corpus. The emphasis will be practical and participatory, with presentations of programs addressing particular issues, and opportunities for all participants to describe their experiences of working with the web as a source of linguistic data. We shall also aim to establish what main challenges lying ahead are for this young community, and how it should work collectively to address them. * General overview of web-as-corpus work * Building large/general and small/special-purpose web corpora * Web crawling for linguistic purposes * (Near-)duplicate detection, boilerplate removal, language identification * Linguistic annotation * Working with non-latin1 languages * Indexing and retrieval from large document collections * Prospected interfaces Provisional program: 9:30-10:00 Adam Kilgarriff (Lexicography MasterClass) - Welcome, goals of the workshop, overview of program 10:00-10:45 Tom Emerson (Basis Technology) - Large crawls of the web for linguistic purposes 10:45-11:15 coffee break 11.15-12.00 Marco Baroni (University of Bologna) and Serge Sharoff (University of Leeds) - Creating specialized and general corpora using automated search engine queries 12:00-13:00 Small groups arranged around the participants' research purposes 13:00-14:30 lunch break 14:30-15:15 Sebastian Hoffmann (University of Zurich) - Processing web-derived text (or: Working with very messy data) 15:15-16:00 Stefan Evert (University of Osnabrück) and Adam Kilgarriff (Lexicography MasterClass) - Indexing and interfaces 16:00-16:30 coffee break 16:30-17:00 Alexander Mehler and Rüdiger Gleim (University of Bielefeld) - Representing genre-specific websites 17:00-17:30 Small groups on ''what are critical next steps for Web-as-Corpus activity?'' 17:30-18:10 Plenary: where next? Registration: Registration and accommodation are managed by the main conference organizers. Please visit: http://www.corpus.bham.ac.uk/conference
Respond to list|Read more issues|LINGUIST home page|Top of issue
|
|

Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us

While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.
|
|