Workshop on
The Digitization of Language Data:
The Need for Standards

Overview      










This workshop has been organized to initiate an ongoing collaboration between three groups of researchers who need each other's expertise: field linguists and archivists, who are experts in collecting language data, and language engineers, who are experts in digitizing linguistic data. It is designed to suggest ways to build a vitally needed component in modern linguistics: a common infrastructure for the preservation of, and access to, digital linguistic data.

As linguists, we are only too aware that the Internet has not yet fulfilled its promise. A great many useful pieces of linguistic information are available that 10 years ago were inaccessible. This has made certain things — finding a job, tracking down a colleague — much easier than once they were. At the same time, we know that certain things have not changed. Access to linguistic data is still difficult, or, for many languages, impossible.


This is not for a lack of effort. Many organizations are working intensively to make data available on the web, notable amongst them organizations whose aim is to preserve endangered languages, such as The Linguistic Society of America Committee on Endangered Languages and their Preservation (CELP), The Foundation for Endangered Languages (FEL), Terralingua (TL), the International Clearing House for Endangered Languages (ICHEL), and The Endangered Language Fund (ELF). Among the most prominent of the linguistic projects within the United States are:

Significant projects outside the US include:

The establishment of these multiple archives is to be welcomed, since the magnitude of the task requires distributed effort. No one institution can archive all the important data on language—certainly not within the time limits imposed by impending language attrition and by the ongoing deterioration of the existing documentation. Paper, audiotapes, videotapes, and computer diskettes are all prone to degradation and destruction. Moreover, most field notes and grammars currently reside on individual computers, vulnerable to disk crashes as well as file corruption. Some older notes and grammars still exist only in the form of notebooks and file cards. Digital archiving at distributed sites thus offers the best hope for preserving this valuable linguistic material.

But developing all the infrastructure necessary for a digital archive of language data (including delivery mechanism, formatting guidelines, and supporting software) is a huge task, and institutions which have set up online archives have tended to resort to different strategies for designing infrastructure. What is unavailable is a generalized solution to the problem of archiving language, and a common infrastructure for the archives.

Without such a common infrastructure, the individual linguist will find it very difficult to identify all the resources pertinent to a given language. To posit an extreme case: the language in question may be classified, or even named, differently in different archives (e.g., Waikurean vs. Guaicuruan, Lappish vs. Sami). The language data may be marked up using different sets of structural tags (e.g., possessive vs. genitive). The texts may have different organizations (e.g., chronological organization vs. frequency organization of the meanings in a dictionary entry). And the files may have different formats because they have been created with incompatible software tools. In this situation, even a linguist with access to resources might not be able to compare them well enough to make reliable linguistic judgments. And if various archives develop different ways of describing and indexing their resources, no central meta-index can easily be developed. The amount of data will defeat a human librarian, and the different formats will defeat a machine.

The aim of this workshop, then, is bring together those researchers who have the knowledge and expertise to contribute to the development of such a common infrastructure. It is not a task which can be accomplished solely by researchers in language engineering and computation. They are not field linguists, and do not understand the unique problems and needs of the field linguists who collect the basic linguistic data. Nor is it a task which can be accomplished by field linguists alone. Their research is not focused upon the design and implementation of complex language engineering systems. It is a task which can only be successfully completed by both of these groups working together; and the conference will, we hope, initiate a fruitful collaboration between some of the prominent individuals in both groups.

The workshop is sponsored by the LINGUIST List, and is funded by the National Science Foundation of the United States. LINGUIST is part of the Open Language Archives Community (OLAC), which is a new international project also funded by the National Science Foundation. The OLAC project is designed to build means of accessing language archives linked by community-specific metadata and centralized union catalogs. It builds on the Open Archives Initiative and the Dublin Core Metadata Initiative. OLAC was founded at the Workshop on Web-Based Language Documentation and Description, Philadelphia, December 2000. The Language Digitization workshop should be seen as part of the OLAC initiative.


Workshop homepage | Workshop Proposal | Contact the Organizers