Workshop on
The Digitization of Language Data:
The Need for Standards

Workshop Description      










The Santa Barbara Workshop on the digitization of endangered languages data is only three months away, so we would like to give you a more precise idea of the focus and structure of the workshop, as well as to request some help from you.

If you have read the background material listed on the web site at http://linguistlist.org/~workshop/reading.html, you know that many individuals have become concerned, not only about the impending disappearance of many of the world's languages, but also about the difficulty of effectively preserving documentation on these languages in the absence of a common archive infrastructure.

In the months since we submitted the proposal for this workshop, an initiative has developed which, we feel, offers the best hope for ensuring interoperability among linguistic archives and averting the dangers inherent in a multiplicity of incompatible practices. This is the Open Language Archives Community (OLAC), which was established as part of the Workshop on Web-based Language Documentation and Description (Philadelphia, December 12-15, 2000). The Linguist List has become a member of OLAC, and has, in fact, undertaken to develop the central OLAC metadata server and some of the recommendations of best practice for the OLAC community. Furthermore, we have just learned that we will have NSF support in this enterprise.

So we would like to use this Workshop to solicit your input and advice about some of these recommendations. The workshop participant list includes many of the most knowledgeable field linguists and language engineers in the world. The Workshop therefore seems an ideal venue for discussing possible OLAC recommendations of best practice in three crucial areas:

1. Metadata: Metadata is crucial to finding and classifying data in an electronic environment, but no widely accepted standards exist for linguistic metadata. We will discuss the merits of an extended Dublin Core metadata set, the proposal of the Max Plank Institute for metadata, and the plans of the OLAC community in this regard.

2. Markup: There are no standards for linguistic markup. Indeed, there is not even a consensus on what should be marked-up in linguistic material. We hope to discuss both morphological markup and text-formatting markup, and to review some of the developing proposals (e.g., the DOBES proposal) for the tagging of linguistic information in digitized texts.

3. Language family classification in an electronic environment: At the moment, there are no standard digital means of coding relationships between languages. We must consider what such a system would comprise, how the linguistic community might come to a working consensus on such a system, and what mechanisms would be necessary to modify the system when new evidence comes to light.

The workshop, therefore, will be both information-oriented and task-oriented, and will include 4 types of events:

Overview (or "State of the Art") Presentations:

Since knowledge of the three topics varies considerably among the workshop participants, as well as the organizers, we feel that most of us would benefit from some presentations introducing the major topics and reviewing work to date. We have solicited overview presentations on metadata, markup, and language family codes, as well as on OLAC, Unicode, and the ethical issues which impact the design of digital archives.

Individual Presentations:

We would also like to hear from those of you who have had experience with these topics in the course of developing your own projects. We earnestly solicit volunteers to present material which you feel is of particular relevance or use to the 3 foci of the workshop, and also to database design for language documentation, since this is also a major objective of the LINGUIST List project. If you would be willing to tell us about your experiences, please let us know so that we can reserve time for you in the Workshop schedule.

Workgroups:

Each of the 3 topics listed above will also be the focus of a workgroup, which will discuss the relevant issues and proposals and then produce a session report, suggesting either specific recommendations or directions for future investigation. We have made some tentative workgroup assignments. If you feel that you have been misplaced, please feel free to contact us, and tell us of your preferences.

Markup Workgroup: Austin, Dwyer, Carnie, Crowhurst, Grondona, Harris, Hammond, Heath, Langendoen, Sands, Whalen, Winter.

Metadata Workgroup: Aristar-Dry, Dench, Donnelly, Furbee, Jahr, Michaelowsky, Ostler, Simons, Wittenberg, Woodbury.

Language Classification/Codes Workgroup: Aristar, Blust, Cook, Gragg, LaPolla, Mithun, Ratliff, Solnit, Starostin, Thwala, Veselinova

Panels:

The workshop will also serve as a means by which those of us who are involved in new technical projects can learn what the community needs. Language engineering projects often proceed with little input from the working linguist. And yet the de facto standards set by these projects will control the kind of resource retrieval that linguists can do for years to come. To address this problem, we would like to recruit participants to serve on two panels:

1. A field advisory panel: This panel will be asked to comment on the kinds of features they would like to discriminate and be able to search for within linguistic data and documentation -- in other words, what needs to be identified through the use of linguistic markup, and why. We believe that asking linguists to reflect on the research purposes for which they might need texts with linguistic tagging may lead us to think about markup in new ways.

2. A linguists' resource panel. This panel will be asked to comment on the kinds of resources they would like to be able to identify easily in an Internet search i. e., on the kinds of search parameters (metadata) that digital search and retrieval mechanisms should support. We would also like to know your opinion of existing search tools -- what works, what doesn't, and what kinds of tools you would like to see developed?

Essentially, both of these panels will address issues related to searching -- either searching for resources or searching within them. And in each case we would encourage panelists to consider what they would put on a "wish list" if Santa Claus were a language engineer. Panel members will be asked to prepare a 5 -- 10 minute presentation, then to help lead a general discussion of the issues raised.

Summary:

As currently planned, the Workshop will include

1. Overview presentations on metadata, markup, language family coding, and other background information related to digital archiving.

2. Project presentations and demos related to the foci of the workshop: metadata, markup, language family coding, and database design.

3. Working groups to develop recommendations on the 3 topics

4. Panels to consider what the linguists who use digital archives want metadata and markup to enable them to do.

We would appreciate your feedback on the plans sketched above and the tentative working group assignments. And we would especially appreciate your volunteering either to give a project presentation or demo (2 above) or to participate in a panel (4).


Workshop homepage | Workshop Proposal | Contact the Organizers