Editor for this issue: Andrea Berez <andrea
linguistlist.org>
ACL04 Workshop: Tackling the Challenges of Terascale Human Language Problems PLEASE NOTE THE CORRECTED DEADLINE! Short Title: Terascale NLP 2004 Date: 26-Jul-2004 - 26-Jul-2004 Location: Barcelona, Spain Contact: Rob Malouf Contact Email: rmaloufMail to author|Respond to list|Read more issues|LINGUIST home page|Top of issuemail.sdsu.edu Meeting URL: http://www-rohan.sdsu.edu/~malouf/terascale04.html Linguistic Sub-field: Computational Linguistics Call Deadline: 18-Apr-2004 Meeting Description: Machine learning methods form the core of most modern speech and language processing technologies. Techniques such as kernel methods, log-linear models, and graphical models are routinely used to classify examples (e.g., to identify the topic of a story), rank candidates (to order a set of parses for some sentence) or assign labels to sequences (to identify named entities in a sentence). While considerable success has been achieved using these algorithms, what has become increasingly clear is that the size and complexity of the problems---in terms of number of training examples, the size of the feature space, and the size of the prediction space---are growing at a much faster rate than our computational resources are, Moore's Law notwithstanding. This raises real questions as to whether our current crop of algorithms will scale gracefully when processing such problems. This workshop will bring researchers together who are interested in meeting the challenges associated with scaling systems for natural language processing. Machine learning methods form the core of most modern speech and language processing technologies. Techniques such as kernel methods, log-linear models, and graphical models are routinely used to classify examples (e.g., to identify the topic of a story), rank candidates (to order a set of parses for some sentence) or assign labels to sequences (to identify named entities in a sentence). While considerable success has been achieved using these algorithms, what has become increasingly clear is that the size and complexity of the problems---in terms of number of training examples, the size of the feature space, and the size of the prediction space---are growing at a much faster rate than our computational resources are, Moore's Law notwithstanding. This raises real questions as to whether our current crop of algorithms will scale gracefully when processing such problems. For example, training Support Vector Machines for relatively small-scale problems, such as classifying phones in the speech TIMIT dataset, will take an estimated six years of CPU time (Salomon, et al. 2002). If we wished to move to a larger domain and harness, say, all the speech data emerging from a typical call center, then very clearly enormous computational resources would be needed to be devoted to the task. Allocation of such vast amounts of computational resources is beyond the scope of most current research collaborations, which consist of small groups of people working on isolated tasks using small networks of commodity machines. The ability to deal with large-scale speech and language problems requires a move away from isolated individual groups of researchers towards co-ordinated `virtual organizations'. The terascale problems that are now emerging demand an understanding of how to manage people and resources possibly distributed over many sites. Evidence of the timely nature of this workshop can be seen at this year's ''Text Retrieval Conference'' (TREC), which concluded with the announcement of a new track next year which would be specifically devoted to scaling information retrieval systems. This clearly demonstrates the community need for scaling human language technologies. In order to address large scale speech and language problems that arise in realistic tasks, we must address the issue of scalable machine learning algorithms that can better exploit the structure of such problems, their computational resource requirements and its implications on how we carry out research as a community. This workshop will bring researchers together who are interested in meeting the challenges associated with scaling systems for natural language processing. Topics include (but are not limited to): + exactly scaling existing techniques + applying interesting approximations which drastically reduce the amount of required computation yet do not sacrifice much in the way of accuracy + using on-line learning algorithms to learn from streaming data sources + efficiently retraining models as more data becomes available + experience with using very large datasets, apply for example Grid computing strategies technologies + techniques for efficiently manipulating enormous volumes of data + human factors associated with managing large virtual organizations + adapting methods developed for dealing with large-scale problems in other computational sciences, such as physics and biology, to natural language processing