LINGUIST List 2.295

Friday, 14 June 1991

Disc: Character Coding (Part 1)

Editor for this issue: <>


Directory

  1. "Charles A. Bigelow", Diacritics, orthography, type design, font format
  2. Paul Hackney, Character encodings

Message 1: Diacritics, orthography, type design, font format

Date: Mon, 10 Jun 91 22:16:36 PDT
From: "Charles A. Bigelow" <bigelow%Sunburn.Stanford.EDURICEVM1.RICE.EDU>
Subject: Diacritics, orthography, type design, font format
The question of whether to have a "closed repertoire" character set
(like the ISO proposal, which fixes single character codes for a
limited but large, set of letter+diacritic combinations) or an "open
repertoire" set (like Unicode, which allows arbitrary combinations, but
as multiple codes) does not depend much on modern type font technology
or the current art of type design.

Most major font formats in use today, including PostScript Type1,
Apple/Microsoft TrueType, and Sun F3, (perhaps also Hewlett Packard /
Agfa CG Intellifont, though I am not certain), actually store most
letter+diacritic combinations as subroutine calls to the separate
elements - letter, diacritic - rather than as a fully formed character
comprising letter and diacritic. That is to say, when a program like a
word-processor calls out the character code for, say, a-acute, the font
looks up the a, and then looks up the acute, and then looks up some
information about where to position the acute over the a, puts the
pieces together, rasterizes the new composite, and hands it over for
display and/or printing.

This method of forming composites has two advantage: economy - it
reduces the memory requirements of the font; power - it allows the
potential for arbitrary production of all possible letter + accent
combinations.

The creation of new diacritic combinations doesn't require the skills
of professional type designers. Some brave and ingenious souls may
write PostScript programs to implement the desired combinations and to
assign those combinations to character codes. Or, if a font already
has most floating diacritics (like the Macintosh, or the Microsoft
UGL character set), a "kerning" table can be devised to properly
position the accents over the selected letters. This requires some
planning, arithmetic, etc. Another and simpler way is to use a
font editing program, such as Altsys' Fontographer, LetraSet's
FontStudio, or URW's Ikarus M (available on the Macintosh; there are
also related programs for the PC) to get into the font and mix n'
match letters and accents for the desired effect, and assign the
results to arbitrary character codes/positions. This requires some time
to learn the rudiments of the editing program, but no training in type
design. In most fonts, the designer has provided most of the common
letters and diacritics. All the user needs is the desire to combine
them. In fact, some users will actually do a better job of it than the
designers, since the designers are not likely to be literate in all
the languages for which they have designed accents, and simply follow
some basic, simple rules, or various precedents, whereas literate
users often have a better feeling for what constitutes
discriminability among the graphemes of their own language.

In the 1950's, French typographers persuaded the English Monotype
Corporation, originators of Times Roman, to reposition several of the
accented characters and to redesign various other characters, to make
a Gallicized version of Times that would be acceptable to the French
literate palate. It is reasonable to suppose that literates of other
languages and orthographies might want something similar.

Moreover, some users may want to design new forms that are not included
in a standard font. Such things don't always look as sleek and polished as
professional work, but they also might have merits that professional
designers would have failed to include. If a new form achieves
acceptance, sooner or later some designer will come along to spiff it
up.

So, the technology of fonts and the art of type design provide the means
for either closed or open character sets. The decision of which to use
is based on other factors, including politics.

-- Chuck Bigelow
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue

Message 2: Character encodings

Date: Tue, 11 Jun 91 10:40:47 +0100
From: Paul Hackney <paulh%cogs.sussex.ac.ukRICEVM1.RICE.EDU>
Subject: Character encodings
The most commonly used coding for text is the ASCII (American Standard Code for
Information Interchange) character set, which does not provide for characters
containing diacritical marks. As it stands only 7 bits of a possible 8 bits are
used, giving 128 encodings (the reason for this is historical - the 8th bit was
used for parity checking). Various extensions are in use (ISO multinational, DEC
multinational et al) which use the 8th bit to provide another 128 encodings
containing the commonly used European characters.
However, in my experience, not all terminals, printers and personal
computers support even this limited character set. There are (at least) two
proposals for extending the character set into something that addresses the
rich variety of symbols found in the many languages of the world.

In response to John Baima, I must confess to an ignorance of what a floating
diacritic is. I will therefore limit my comment to an inferred explanation:
a floating diacritic is a character that can be combined with a normal character
(such as ~ [tilde] and n) to provide a composite. This method is satisfactory
but limited in that it enhances an existent impoverished system of coding. A
more general solution is to extend the coding to cover more alphabets.

I recently came across an article in New Scientist (a popular and serious
scientific magazine) which described a new coding system ["Computer code speaks
many tongues", New Scientist, 9 March 1991, pp.28]. Apparently a consortium of
American companies called "Unicode" (inluding IBM, Apple, Sun, ...) have chosen
to represent their character set using a 16 bit code, which will give a possible
65,536 characters. They suggest that 6,000 codes suffice for all the alphabets
of Europe, the Middle East and the Indian subcontinent. Chinese, Japanese and
Korean require about another 18,000 codes. I expect it is arguable whether these
figures are really representative of the characters used and preferred by the
respective nationals. However one thing is certain: it is not compatible with
our present system. Furthermore you will need twice as much space to store
text using the current character set, and transmission times will be doubled.
I have come across another system that does not suffer from these limitations,
and is in my opinion a winner by lengths.

Instead of using a fixed length encoding, the answer is to arrange for the
encoding to expand to two or three bytes when required [Becker, J.D. (1984?)
Multilingual Word Processing, from: Language, Writing, and the Computer:
Readings from Scientific American, pp 86-96 ISBN 0-7167-1772-7]. This is simply
done by setting aside a few bytes as signals to the computer (or printer etc)
and embedding these in the text. The principle signal is one byte that
indicates that the next byte is a code representing the alphabet to be used for
the subsequent text. This gives 255 different alphabets, each containing 255
codes. Compatability between the system and the current ASCII encoding is
easily achieved by assuming the start of the text is in alphabet "Roman" (ie
ASCII). Although most (all?) of the European languages, Cyrillic, Arabic etc
can be fully represented with 255 characters, other "alphabets", such as Chinese
require considerably more encodings. A simple extension to the above scheme
provides an elegant solution. Two 'shift-alphabet' characters in sequence
indicate that the next byte signals which 'super alphabet' is to be used. These
alphabets use a two byte encoding scheme giving 65,536 possible letters (this is
similar to the "Unicode" proposed system). (a 3 byte 'super-super-alphabet'
would allow well over 16 million codes).

I am really convinced that this system should be adopted in preference to the
fixed length encoding. Unfortunately "Unicode" appear to be well established and
their proposed system may well become the de facto standard (as they hope).

Paul Hackney

[End Linguist List, Vol. 2, No. 0295]
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue