About Markup
Traditionally markup has been defined as systematic annotation designed to reveal a text’s typographical and informational structure. Linguistic markup might be broadly described as annotation representing: (a) the grammatical structure of text couched in the focus language and (b) the structure of documents presenting a linguistic description or analysis of such text. These two kinds of linguistic markup are required in the digitization of such language documentation as paradigms, word lists, dictionary entries, and glossed text.
Example: Glossed Text
Markup for interlinearized text, for example, must represent both the phonological, morphological and syntactic structure of the text and enough additional information to allow reconstruction of the conventionalized formatting which makes the information intelligible. This is exemplified in the fragment of a Mocovi text (Grondona 1998) given as (2) below:
(2) Glossed Mocovi text fragment
|
a. |
ka/maq |
yale |
yowito/ |
ka |
lawo/ |
ka |
/na:ko/: |
|
|
b. |
ka-/maq |
yale |
i+owir+o/ |
ka |
l+awo-r |
ka/ |
Æ+/na:k+o/ |
|
|
c. |
class(absn)- ? |
man |
3ac+arrive+ev |
class(absnt) |
3poss+family+pcl |
and |
3ac+say+ev |
|
|
d. |
And the man arrived [to] his relatives and (he) said: |
|||||||
|
e. |
And the man arrived where his relatives were and he said: |
|||||||
Following the format that is conventional in printed grammars, the Mocovi text and its analysis are laid out in 5 lines: (a) phonemic transcription in the source language, (b) the underlying morpheme sequence, (c) a morpheme-by-morpheme gloss into a partly technical vocabulary, (d) a literal English translation, and (e) a free translation. To replicate this in electronic text requires that the markup represent:
· The technical terms and abbreviations used in the glossing. Such terms lie at the heart of any linguistic markup; they must be assigned tags and attributes in a way that can support validation and searching.
· Correspondence information that allows alignment of a particular morpheme (line b) with its gloss (line c), and a phoneme sequence (line a)— e.g., information that the / in lawo/ corresponds to the -r in l+awo-r and is glossed as PCL. [7]
· Distinctions among morphemic boundaries—e.g., information that the ‘-’ delimiter indicates ordinary affixation; the ‘+’ delimiter cliticization.
Even if we do not wish to replicate this conventional 5-line format, however, we still need markup to enable other things we might like to do with a digitized text, e.g. searching the text for linguistic features, or aligning appropriate segments of the text with an accompanying sound file.
Can we all use different markup schemes?
Yes, and we probably always will, to some extent, since data reflects different language structures and is marked up for different purposes. Various conventions (i.e., the XML namespace convention) allow you to specify, and link to, the markup schema that was used for the data. It is also the case that most markup schemes (e.g., the Corpus Encoding Standard) allow the user to choose what level of detail s/he wishes to represent in the markup--i.e., even if two people use the same markup system, the result may differ because of different levels of specificity.
However, there are reasons to try to reach as much consensus as possible on guidelines for best practice in the markup of linguistic data. One is that without compatible markup, two bodies of data are not comparable. The linguistic similarities and differences will be difficult to see even by human inspection. Computationally they are essentially undiscoverable, since no search-engine can be expected to "know" that differently named entities are equivalent. Second, a lack of standardization makes data difficult to interpret in and of itself, because a linguist must first learn the nature of the data markup before he or she is able to understand a new body of data.
A number of bodies have proposed markup standards for linguistic data; among the first were the Text Encoding Initiative and EAGLES/ISLE (Expert Advisory Group on Language Engineering/ International Standards for Language Engineering). If you look at the EAGLES work on linguistics annotation, for example, you will see that the group specified 3 levels of linguistic annotation: required attributes (noun, adjective), recommended attributes (gender, number, person) and optional or language-specific attributes (countable, mass). But the features specified were intended to cover only European languages; so the system is woefully inadequate for most of the world's endangered languages. Recently other groups like LACITO and the DOBES (Documentation of Endangered Languages) project have developed markup systems which seem more satisfactory for the markup of non-European languages; and it is these proposals that we will look at at the workshop.
For more information on linguistic annotation systems, with a focus on speech annotation, see the Linguistic Annotation Page of the Linguistic Data Consortium, and also the special issue of Speech Communication focusing on corpus annotation tools.