Publishing Partner: Cambridge University Press CUP Extra Publisher Login

The LINGUIST List is dedicated to providing information on language and language analysis, and to providing the discipline of linguistics with the infrastructure necessary to function in the digital world. LINGUIST is a free resource, run by linguistics students and faculty, and supported by your donations. Please support LINGUIST List during the 2017 Fund Drive.

E-mail this page

Conference Information

Full Title: Challenges in the Management of Large Corpora + Big Data and Natural Language Processing

Short Title: CMLC 5 + BigNLP 2017
Location: Birmingham, United Kingdom
Start Date: 24-Jul-2017 - 24-Jul-2017
Contact: Piotr Banski
Meeting Email: click here to access email
Meeting URL:
Meeting Description: The CMLC+BigNLP workshop is a joint initiative of two teams who have decided to join forces for the purpose of organizing an event co-located with Corpus Linguistics 2017 in Birmingham. The upcoming meeting continues the successful series of “Challenges in the management of large corpora” events (previously hosted at LREC conferences and CL2015) and is at the same time the second event in the the Big-NLP series, inaugurated last year at the IEEE Big Data 2016 conference. This year, we wish to together explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing and data science.

An increasing amount of text is available in digital format: more historical archives are being digitised, more publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media. The resulting large textual datasets are used across a number of disciplines to answer a wide range of research questions. In order for these datasets to be maximally useful, careful consideration needs to be made regarding their design, collection, cleaning, encoding, annotation, storage, retrieval and curation.

A number of key themes and questions emerge of interest to the contributing research communities: (a) is having more data always better? (b) is the full range of text types available online and what quality issues should we be aware of? (c) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (d) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (e) what are the key legal and ethical issues related to the use of large corpora?

An open-access (CC BY-NC-ND) electronic volume of proceedings is planned.

This year’s event focuses on the union of the standard topics of CLMC and Big NLP:

Technical issues:

- Storage and retrieval solutions for big textual data corpora: primary data, metadata, and annotation data
- Scalable and efficient NLP tooling for annotating and analysing large datasets: distributed and GPGPU computing; using big data analysis frameworks (Hadoop, Spark, etc.) for language processing
- Dealing with streaming data (e.g. Social Media) and rapidly changing corpora

Licensing, legal and privacy issues:

- Licensing models of open and closed data
- Coping with intellectual property restrictions

Linguistic content issues:

- Dealing with the variety of language: multilinguality, historical texts, user-generated content, etc.
- Integration of human computation (crowdsourcing) and automatic annotation
- Quality management of annotations

Exploitation issues:

- Query languages
- Innovative approaches for aggregation and visualisation of text analytics
Linguistic Subfield: Computational Linguistics; Text/Corpus Linguistics
LL Issue: 28.876

This is a session of the following meeting:
Corpus Linguistics 2017

Calls and Conferences main page