Conference Information

Full Title: Workshop on Corpus-based Quantitative Typology

Short Title: CoQuaT 2013
Location: Leipzig, Germany
Start Date: 14-Aug-2013 - 14-Aug-2013
Contact: Thomas Mayer
Meeting Email: click here to access email
Meeting URL:
Meeting Description: Convenors:

Michael Cysouw (Philipps University of Marburg)
Dirk Goldhahn (University of Leipzig)
Thomas Mayer (Philipps University of Marburg)
Uwe Quasthoff (University of Leipzig)

Invited Speakers:

Östen Dahl (Stockholm University)
Kevin Scannell (Saint Louis University)

Workshop Description:

The amount of available (textual) corpora of the world's languages is currently rising at an incredible rate. The aim of this workshop is to bring together researchers dealing with corpus-based quantitative language comparison and to encourage typological studies that rely on corpus data.

A growing body of research uses corpora to investigate the structure of individual languages. There also exists a large amount of research on the world-wide linguistic diversity, though mostly on the basis of information manually extracted from published sources. In contrast, the combination of the two is still rare. There are only few quantitative typological investigations with a world-wide scope that use corpora to infer cross-linguistic generalizations and insights. Some previous work compiled quantitative data through manual corpus annotation (e.g. Greenberg 1960; Wälchli 2005) or automatically with the help of computer programs (e.g. Mayer and Cysouw 2012). In addition, there is some relevant work using corpora to compare a smaller number of (genealogically related) languages (e.g. Bickel 2003; van der Auwera 2005).

Cross-linguistic corpora, in particular (massively) parallel corpora (cf. Cysouw and Wälchli 2007) or comparable corpora compiled through web crawling (e.g. Scannell 2007; Goldhahn et al. 2012), provide an enormous amount of information about the world's languages. Although such data is often not ideal from a linguistic point of view (involving problems of translationese, or being restricted to special textual genres), it would be a waste not at least to try to use them for comparative linguistic purposes.

One of the reasons for the shortage of quantitative cross-linguistic work is the lack of adequate resources for a representative sample of languages. Consequently, on top of the laborious manual analysis, typologically interested researchers are faced with the time-consuming task to build their own corpora from scratch. One goal of this workshop is therefore to collect (online) resources (especially for lesser studied languages) and to exchange experience with crawling texts from the web. Furthermore, we intend to discuss in which formats cross-linguistic corpora should be made publicly available so that typologists can best benefit from them without violating copyright laws.
Linguistic Subfield: Computational Linguistics; Text/Corpus Linguistics; Typology
LL Issue: 24.1381

