LINGUIST List 23.1397|
Tue Mar 20 2012
Confs: Text/Corpus Ling/Turkey
Editor for this issue: Amy Brunett
LINGUIST is pleased to announce an exciting service: Easy Abstracts! Easy Abs is a free abstract submission and review facility designed to help conference organizers and reviewers accept and process abstracts online. Just go to: http://www.linguistlist.org/confcustom, and begin your conference customization process today! With Easy Abstracts, submission and review will be as easy as 1-2-3!
From: Piotr Banski <banskiids-mannheim.de>
Subject: Challenges in the Management of Large Corpora
E-mail this message to a friend
Challenges in the Management of Large Corpora
Short Title: CMLC
Date: 22-May-2012 - 22-May-2012
Location: Istanbul, Turkey
Contact: Piotr Banski
Contact Email: < click here to access email >
Meeting URL: http://corpora.ids-mannheim.de/cmlc.html
Linguistic Field(s): Text/Corpus Linguistics
We live in an age where the well-known maxim that ‘the only thing better than data is more data’ is something that no longer sets unattainable goals. Creating extremely large corpora is no longer a challenge, given the proven methods that lie behind e.g. applying the Web-as-Corpus approach or utilizing Google’s n-gram collection. Indeed, the challenge is now shifted towards dealing with the large amounts of primary data and much larger amounts of annotation data. On the one hand, this challenge concerns finding new (corpus-) linguistic methodologies that can make use of such /extremely large corpora/, e.g. in order to investigate rare phenomena involving multiple lexical items or to find and represent fine-grained sub-regularities; on the other hand, some fundamental technical methods and strategies are being called into question. These include e.g. successful curation of the data, management of collections that span multiple volumes or that are distributed across several centres, methods to clean the data from non-linguistic intrusions or duplicates, as well as automatic annotation methods or innovative corpus architectures that maximise the usefulness of data or allow to search and to analyse it efficiently. Among the new tasks are also collaborative manual annotation and methods to manage it as well as new challenges to the statistical analysis of such data and metadata.
The half-day LREC-2012 workshop on ‘Challenges in the Management of Large Corpora’ aims at gathering the leading researchers in the field of Language Resource creation and Corpus Linguistics, in order to provide for an intensive exchange of expertise, results and ideas.
The workshop will take place at the Conference venue, the Lütfi Kirdar Istanbul Exhibition and Congress Centre. Further details will be available in due time from conference homepage.
Nancy Ide (Vassar College), title TBA
-The AAC Container. Managing Text Resources for Text Studies,
Hanno Biber and Evelyn Breiteneder
-Creating and Managing a large annotated parallel corpora of Indian languages,
Ritesh Kumar, Pinkey Nainwani, Girish Nath Jha and Shiv Bhusan Kaushik
-Introducing the CLARIN-NL Data Curation Service,
Nelleke Oostdijk and Henk van den Heuvel
-Efficient N-gram Language Modeling for Billion Word Web-Corpora,
Lars Bungum and Björn Gambäck
-Evaluating DBMS-based access strategies to very large multi-layer corpora,
Hans Martin Lehmann and Gerold Schneider
-Large Mailing List Corpora: Management, Annotation and Repository,
Damir Ćavar, Helen Aristar-Dry and Anthony Aristar
Deadline for early-bird registration: March 21.
( http://www.lrec-conf.org/lrec2012/?-Registration- )
Workshop: May 22, 2 pm. - 6.30 pm.
Read more issues|LINGUIST home page|Top of issue
Page Updated: 20-Mar-2012
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.