Using
Computers
in
Linguistics:
A Practical Guide

The Nature of Linguistic Data
and the Requirements of a
Computing Environment for
Linguistic Research

Online Appendix:
Using Text Encoding


Summary

Multilingual Computing


Text Encoding


Databases

Using Text Encoding to Represent  Linguistic Data

One means of representing linguistic data (along with the linguist's analysis of it) is to use a special markup language to encode the information in text files. SGML is the most widely used markup language, and the TEI is an example of widely accepted encoding scheme developed specifically for linguistic and literary data. The following is a glossary containing these and other key terms related to text encoding.  Basic definitions are supplemented with pointers to further information resources.

encoding

The manner in which information is represented in computer data files. Text encoding refers specifically to the way in which the structural (and even interpretative) information in text is encoded. (See also character encoding.)

markup

Codes added to the stream of an encoded text to signal structure, formatting, or processing commands.

generalized markup

The discipline of using markup codes in a text to describe the function or purpose of the elements in the text, rather than their formatting.

A seminal paper on the discipline of generalized markup is:

style sheet

A separate file that is used with a document containing generalized markup to declares how each generalized text element is to be formatted for display.

SGML

Standard Generalized Markup Language. A method for generalized markup that has been adopted by ISO (the International Organization for Standardization) and is consequently gaining widespread use in the world of computing.

Some introductions to SGML:

For pointers to virtually any resource related to SGML, see The SGML Web Page by Robin Cover. Some highlights:

Here is a glossary of some key SGML terms:

DTD
Document Type Definition. The definition of the markup rules for an SGML document.
tag
A string of characters inserted into a text file to represent a markup code. In SGML, each text element is delimited by a start tag of the form <type> and an end tag of the form </type>.
element
In an SGML file, a single entity delimited by a start tag and an end tag. For instance, a paragraph element might be delimited by <p> and </p>.
attribute
In SGML, a qualifier within the opening tag for an element which specifies a value for some named property of that element.

TEI

Text Encoding Initiative. A joint effort of the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics to develop SGML-based guidelines for the encoding of texts and the analysis of texts.

See The TEI Home Page. Some highlights:

The TEI Guidelines are published in print, on CD-ROM, and online:

  • Sperberg-McQueen, C. M. and Lou Burnard. (1994) Guidelines for the encoding and interchange of machine-readable texts. Chicago and Oxford: Text Encoding Initiative.

Of special interest to linguists are:


This page is part of the preliminary online appendices for the book
Using Computers in Linguistics: A Practical Guide, 1998 (Routledge).

Up to Chapter Page   Up to Book Page  
Summary   Multilingual computing   Text encoding   Databases
Last modified: May 13, 1997