* * * * * * * * * * * * * * * * * * * * * * * *
LINGUIST List logo Eastern Michigan University Wayne State University *
* People & Organizations * Jobs * Calls & Conferences * Publications * Language Resources * Text & Computer Tools * Teaching & Learning * Mailing Lists * Search *
* *
LINGUIST List 22.5043

Wed Dec 14 2011

FYI: Crowdsourcing the Development of Underserved Langs

Editor for this issue: Brent Miller <brentlinguistlist.org>


To post to LINGUIST, use our convenient web form at http://linguistlist.org/LL/posttolinguist.cfm.
Directory
        1.     Mark Mandel , Crowdsourcing the Development of Underserved Langs


Message 1: Crowdsourcing the Development of Underserved Langs
Date: 13-Dec-2011
From: Mark Mandel <mamandelldc.upenn.edu>
Subject: Crowdsourcing the Development of Underserved Langs
E-mail this message to a friend

(I am not connected with this project; please do not contact me about it.
-- M. Mandel)

Crowdsourcing the Development of Underserved Language Resources
(http://www.rhok.org/problems/crowdsourcing-development-
underserved-language-resources)

The provision of affordable, accessible and sustainable data, tools and
technologies in local languages is necessary for developing world
populations across the globe to allow them access to the knowledge
society and economy, to both consume and to generate relevant
content. This includes access to appropriate networks and Information
and Communication Technologies (ICTs) supported by adequate
Human Language Technologies (HLT). There is an urgent need to
realize the fundamental rights of the citizens of the world to have
access to information in their language, information that will allow them
to improve their economic situation, their education, their legal rights,
and their health. A major challenge that still faces the development of a
truly inclusive and diverse global information society is the extreme
scarcity of language resources that can be utilized by researchers and
practitioners to build human language technologies (HLT) for countries
in the developing world. Unless resolved, this issue will prevent the
vast majority of the next billions of the world's citizens, who rely
exclusively on their native languages to consume and produce
information, from participating in the global information society.

This project aims at tackling this challenge by leveraging open content,
mobile technologies and crowd-sourcing to create language resources
for the underserved world languages and make them available under
open licenses to stimulate research and development in the area of
Human Language Technologies (HLT). The project will use existing
open text repositories (such as Wikipedia) in language such as Swahili,
Arabic and Urdu, and will create a crowd-sourcing mechanism for
developing these text repositories into language corpora. This could
include, for example, tagging the words in the corpus based on part of
speech (a process known as Part of Speech Tagging). For this
purpose, a platform can be built to extract sentences from the corpus
and send it to a group of contributors through text messages. Each
contributor can examine the sentence and determine the tag for each
word in the sentence (verb, noun, adjective, etc.) and send it back to
the platform. Redundant responses from several contributors will be
used to ensure the correctness of the answers and to flag any potential
errors. Participation in the platform can be encouraged through several
means. For example, contributors may be rewarded for their
participation with mobile credit they can use on their phones, or a
badge system could be applied to acknowledge active contributors.
The participation process can also be possibly structured around a
game-like style.

Linguistic Field(s): Computational Linguistics; Text/Corpus Linguistics


Read more issues|LINGUIST home page|Top of issue



Page Updated: 14-Dec-2011

Supported in part by the National Science Foundation       About LINGUIST    |   Contact Us       ILIT Logo
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed on its pages, it cannot vouch for their contents.