The LINGUIST List is dedicated to providing information on language and language analysis, and to providing the discipline of linguistics with the infrastructure necessary to function in the digital world. LINGUIST is a free resource, run by linguistics students and faculty, and supported primarily by your donations. Please support LINGUIST List during the 2016 Fund Drive.
|Full Title:||DGfS 2014 Workshop: Web Data as a Challenge for Theoretical Linguistics and Corpus Design|
|Start Date:||05-Mar-2014 - 07-Mar-2014|
|Meeting Email:||click here to access email|
|Meeting Description:||Web Data as a Challenge for Theoretical Linguistics and Corpus Design
Workshop at the 36th Annual Conference of the German Linguistic Society (March 5-7, 2014 at Marburg University, Marburg/Lahn, Germany)
Felix Bildhauer (Freie Universität Berlin/SFB632)
Roland Schäfer (Freie Universität Berlin)
Aim of the Workshop:
The huge amounts of linguistic data on the web offer exciting new possibilities in empirically based theoretical linguistics. Web-derived linguistic resources can contain greater amounts of variation as well as non-standard grammar and writing compared to traditionally compiled corpora. Also, whole new registers and genres have been described to emerge on the web. Like spoken language - although clearly distinct from it - the language found on the web can thus challenge linguistic theories which are based mainly on standard written language as well as the categories assumed within these theories. At the same time, such non-standard features make the data harder to process for computational linguists, and additional care is required in making the decision of labeling material as ‘noise’, because it might be considered valuable data by some linguists.
This workshop aims to bring together researchers working in Theoretical Linguistics and Corpus Linguistics with those who create resources from web data. The primary question of the workshop is: Which new linguistic insights can we derive from web data? Secondarily, we ask how web data is (and how it should be) processed to produce easily accessible high-quality resources and thus facilitate this kind of innovative linguistic research.
Possible subjects for talks include (but are by no means restricted to):
- Theoretically motivated empirical studies of linguistic phenomena in web data
- Work on problems with established linguistic categories specific to certain types of web data (problems with traditional part-of-speech classification, syntactic categories, register and genre classification, etc.)
- Problems of working with web corpora from the user’s perspective in concrete studies (low quality of: tokenization, POS tagging, named entity recognition, etc.; availability and lack of metadata)
- Assessments and improvements of the quality of available and newly designed tools and models to process or classify web data
- Approaches to normalization of web data and evaluations of the acceptability of such normalizations from a linguistic perspective
- Sampling of web data (e.g., stratified vs. randomly compiled corpora, linguistic web characterization)
|Linguistic Subfield:||Computational Linguistics; General Linguistics; Text/Corpus Linguistics|
|Calls and Conferences main page|