LINGUIST List 8.343

Sun Mar 9 1997

Qs: Email and accented characters

Editor for this issue: Susan Robinson <suelinguistlist.org>


We'd like to remind readers that the responses to queries are usually best posted to the individual asking the question. That individual is then strongly encouraged to post a summary to the list. This policy was instituted to help control the huge volume of mail on LINGUIST; so we would appreciate your cooperating with it whenever it seems appropriate.

Directory

  1. Ted Harding, Email and accented characters

Message 1: Email and accented characters

Date: Sat, 8 Mar 1997 16:21:29 +0000 (GMT)
From: Ted Harding <Ted.Hardingnessie.mcc.ac.uk>
Subject: Email and accented characters

Folks,

I'm writing to solicit opinion.

The issue is "email and accented characters". This is a can of worms,
and most of you will have experienced some of its many possible
manifestations.

On this occasion I'm interested primarily in French, but the views of
users of other European languages will be welcome. In principle, I'm
also willing to "summarise to the list" in due course, but I fear the
reactions may turn out to be un-summarisable so in the end I may just
say "Thanks to you all".

I apologise for the length. This is part of a major exercise, and at the
end of this posting I shall ask for your comments on rather detailed
explanations of certain things, according to your perception of what I
have written.

Now please read on -- and thanks in advance.

PART 1 -----------------------------------------------------

What triggers this particular query is some correspondence I have been
having with a Professor of French in the UK. According to him, many
people in the French-Studies community are forming the opinion that
there is a method of typing they should learn in order to cope with
the fact that accented characters do not get transmitted and received
"tels quels" by email, so they should be transliterated into
characters which can be so transmitted, including:

{C-cedilla}->G {a-grave}->` {a-circumflex}->b
{c-cedilla}->g {e-grave}->h {e-acute}->i {e-circumflex}->j
{i-circumflex}->n {i-dieresis}->o {o-circumflex}->t
{u-grave}->y {u-circumflex}->{

so that the French sentence (written here without accents -- you have to
imagine that the accents really are present):

 Ca s'ecrit "a bientot, ma chere chatelaine"

becomes, on reception:

 Ga s'icrit "` bienttt, ma chhre chbtelaine"

The story is, that people receiving email composed in French have seen
the like of the above and inferred that this is a deliberate
transliteration. They have therefore consciously adopted it
themselves when typing email in French, to the point of training
secretaries to use it too.

QUESTION 1: BEFORE YOU READ ON: Please note HONESTLY whether or not you
- --------- too believe this to be the case. Have you adopted this
	 "transliteration" yourself, or seriously considered doing so?
	 If not now, would this have applied to you in the past?

PART 2 -----------------------------------------------------

While the above strikes me as potentially a prime candidate for the
corpus of Urban Myths, nonetheless it by no means impossible for
people who are not aware of the technicalities of the Internet, of the
representation of text within computers, and of the inner mechanisms
of email, to proceed on the basis that "these things work in
mysterious ways" and they had better play along with what they see
going on. I can therefore plausibly see it happening.

What really takes place, in the first instance, is that A, who happens
to be adequately equipped in computer terms, types a message in French
and sees it on the screen as it should be (accents and all), and then
sends it off. The message is received by B, who may or may not be
adequately equipped. B sees it transformed as above. The
transformation has been effected by some computer system along the way
(which may be B's own computer), and has nothing to do with A.

However, if B is not aware that this is what happens, then B may
believe that A did it on purpose, for mysterious technical reasons,
leading to the potential for the situation described previously.

QUESTION 2: Are you aware, at least in general terms, of the fact that
- --------- computer systems produce such changes of their own accord?
 Do you have some awareness of the technical reasons
	 (primarily the difference between US-ASCII encoding and
	 "extended" ASCII encoding using the so-called "upper ASCII"
	 range of codes)?

PART 3 -----------------------------------------------------

Technically speaking, the basis of the "transliterations" noted above is
the following. These days the "standard packet" of computer information
is one "byte" composed of 8 "bits" (each "bit" may be "0" or "1").
Historically, technical limitations restricted transmission to packets
consisting of 7 "bits" only. With only 7 0/1 "bits", there are only 128
possibilities (corresponding to numbers 0-127), as opposed to 256 with 8
"bits" (corresponding to numbers 0-255).

In the early (teletype) days, the first 32 (0-31), and 127, were reserved
for communications and device control (e.g. 4 signalled "End Of
Transmission", 10 rolled the paper up one line, 13 brought the print head
back to the left edge of the paper, 127 to "delete character"), leaving 95
out of 128. These 95 were then assigned to the upper and lower case
letters of English, the digits 0-9, and various other signs and symbols
such as you see on the standard US computer keyboard. This assignment
became the American Standard Code for Information Interchange, or ASCII.

QUESTION 3: (a) OK so far? Is this (b) new to you? and, if so, is it
- --------- (c) understandable, (d) usefully informative?

When, later, the full 8 bits became the norm, a further 128 possibilities
(128-255) became available. These got used for all sorts of purposes,
including assignment as codes for various characters not present in
English (such as the accented characters of various European Languages).
However, unlike ASCII, there is not a unique generally-adopted convention
for what they should stand for. Instead, there are several different
standard conventions (of which, on any occasion, one may adopt any one).
These include the various Interational Standards Organisation (ISO)
encodings defined in ISO Standard 8859, and known as ISO-8859-1,
ISO-8859-2, etc.

Here we are concerned with ISO-8859-1 (also known as ISO-Latin-1) which
prescribes numerical codes for various European languages including
French, German and the Scandinavian Languages. However, if you want the
Eastern European languages using a "latin" alphabet (Czech, Polish etc)
the you need ISO-8859-2, if Turkish then ISO-8859-9, Cyrillic 8859-5, and
so on.

Similar encoding systems (but different in detail) are the "IBM Code
Pages". You may have encountered these when setting up a DOS system
on an IBM PC (e.g. code page 437 for English, French, German etc;
857 Turkish, 866 Russian). IBM Code Page 819 "is supposed to be fully
ISO-8859-1 compliant" (but I haven't checked this).

QUESTION 4: As Question 3.
- ---------

In ISO-8859-1, we find the following assignments, for instance:

code 199 -> {C-cedilla}
code 224 -> {a-grave]
code 226 -> {a-circumflex}
and so on.

In ASCII,

code 71 -> G
code 96 -> ` (back-quote/grave accent)
code 98 -> b
and so on.

The relationship between a number (199) exceeding 127 corresponding to an
8-bit byte (11000111) and the number corresponding to the byte obtained by
stripping the 8th bit (counting from the right) (1000111), namely 71,
is that the latter is 128 less than the former. The same effect is
achieved by setting the 8th bit to 0 (01000111) in an 8-bit byte.

Therefore {C-cedilla} -> 199 -> 71 (199 minus 128) -> G.

In this way, the above sentence about "chatelaine" in its first form (with
accents) gets tranliterated into its second form simply by subtracting 128
from the code of each character encoded above 127. Or, equivalently, by
setting the 8th bit to 0.

QUESTION 5: As Question 3 again.
- ---------

PART 4 -----------------------------------------------------

Now we come to the question you may already be asking: Why the hell go
through all this transformation stuff for email?

The plain fact is that although the Internet is capable of faithfully
transmitting 8-bit bytes (and in fact routinely does so), the ancient
7-bit standards are still firmly entrenched, especially on the US side of
the Atlantic, to some extent in hardware but more particularly in
software, especially the software installed in network computers which
perform the transmission of email messages from one site to another. In
particular, a standard protocol for email transmission, called SMTP
(Simple Mail Transfer Protocol) which enables two computers to negotiate
with each other for the purpose of sending (by one) and reception (by the
other) of a mail message, is by definition limited to 7 bits.

There is a more recent protocol ESMTP (Extended SMTP) which by definition
allows faithful communication of all 8 bits per byte for the text of a
message. Software which has the capability for ESMTP is quite widely
installed these days, though far from universally.

According to the definitive document on which SMTP implementations are
based (Internic document RFC-821, dated August 1982):

 The mail data may contain any of the 128 ASCII characters. All
 characters are to be delivered to the recipient's mailbox
 including format effectors and other control characters. If
 the transmission channel provides an 8-bit byte (octets) data
 stream, the 7-bit ASCII codes are transmitted right justified
 in the octets with the high order bits cleared to zero.

The immediate effect of this is that when a mail message containing 8-bit
characters with codes above 127 (i.e. with the 8th bit set to 1) is
transmitted or received by software which implements SMTP, the 8th bit is
"cleared to zero" which has the effect of subtracting 128 from the
numerical code, with the results described above. This effect of an
apparent transliteration is in fact a corrpution of the messge, to put it
bluntly. And this is despite the fact that the Internet, as a
"transmission channel", "provides an 8-bit byte (octets) data stream".

This, at the technical level of the ancient standard SMTP protocol (still
installed on the majority of network compouter systems), is the
explanation of the above "transliteration". Therefore whereas user A may
compose and initially send off a message encoded using the full 8 bits per
byte, by the time it gets to B it may well have had the 8th bit stripped,
so that B sees it "transliterated".

On the other hand, documents RFC-1341 (June 1992) and RFC-1425 (Feb 1993),
which define MIME (Multipurpose Internet Mail Extensions) and ESMTP,
specifically allow for full 8-bit bytes in the representation of text in
the "content" part of an email message. Software which complies with ESMTP
will not corrupt the textual content of a message in the above way.

As stated above, ESMTP-capable software is quite widely installed, and if
your end of the chain is ESMTP-capable, and if a sender's end is likewise,
and if the sender's computer system can directly engage your system in an
ESMTP dialogue to negotiate the message transfer, then you will receive
the message as sent. On other circumstances -- you won't!

For the present, the continued existence of software limited to SMTP on
many sites imposes (so 'tis claimed) a "requirement" that mail content
should be transmitted in 7 bits only, for the sake of compatibility with
an ancient standard and to guarantee that any system should be able to
receive any message without loss of information.

Leaving aside the "political" issue that this could considered a kind of
"Internet convoy principle" (all ships proceed at the speed of the
slowest), thereby holding up progress, application of this constraint
means that message-content originally 8-bit has to be transformed into
7-bit content (i.e. pure ASCII characters) without loss of information.
There are several methods in use by which this is done (such as
uuencoding, base-64-encoding, and the dreaded "quoted-printable"
representation). All of these require software at the sending end in
order to achieve this before the message is sent; and other software at
the receiving end in order to decode the results.

Discussion of these is not part of the purpose of the present posting,
except in the context of pointing out that the corruption resulting from
stripping the 8th bit is not information-conserving: For example, since
{e-acute}->i, we do not know whether reception of "tournis" means that
"tournis" was sent, or whether it was originally "tourn{e-acute}s", except
by inference from context ("avoir le tournis" is unlikely to originally
have been sent as "avoir le tourn{e-acute}s" -- except, of course, in a
message like this one).

The troubles generated in this way can be mitigated to some extent by
having more sites upgrade to ESMTP. If your own site is limited to SMTP
then you inevitably have these problems, no matter who mails to you.
If your site has ESMTP, then anyone else whose site has ESMTP should be
able to mail to you without these problems arising. If any of your friends
have SMTP sites, whereby you will have these problems when they mail to
you -- well, feel free to copy this message to them (it is deliberately
written in pure ASCII).

Upgrading to ESMTP is not necessarily going to disrupt Internet email: an
ESMTP site can (indeed must) determine whether the site at the other end
is SMTP or ESMTP, and take action accordingly.

QUESTION 6: As Question 3 again.
- --------- Also, has the above explanation and discussion of what goes
 on around the Internet, where email is concerned, clarified
	 for you matters that had been puzzling or mysterious?
	 And is this explanation sufficient, in general terms?

PART 5 -----------------------------------------------------

The Professor of French referred to above, whose original query about
"{e-acute}->i" etc found its way to me and to which I sent some
explanation of the reason why it really occurred, has suggested that
I should prepare an short article for the sake of the French Studies
community, in particular to disabuse any who may really believe that
this transformation is a standard transliteration scheme which they should
adopt when they type.

I had been planning on writing in similar terms to the above (especially
Parts 1-3), not necessarily including everything. However, I feel that
some reference to the wider context such as is described in Part 4, would
also be useful to many (if only to put a stick in their hands to beat
their local computer gurus with).

QUESTION 7: Therefore I would be grateful for any comments on the
- --------- suitability, in style and material, of the above for such
 an article, intended for the general majority of Arts
	 readers whose acquaintance with the underlying technicalities
	 may be minimal. Also: Is there anything you think might be
	 useful, which has not been mentioned?

QUESTION 8: Finally, as a contribution to wider context of the major
- --------- exercise, of which the present message is a small facet
 triggered by the Professor's query, I would be very obliged
	 to receive your general comments on the issues of
	 multi-lingual email, both in general and regarding your
	 personal experiences of difficulties and frustrations
	 (including experience of negotiations with your computer
	 gurus).

Ouf! There it is. Once again, sorry for the length. From my point of view
it's important: I hope the above is of some use to some of you.

Many thanks in advance for all contributions, and best wishes to all.

Ted. (Ted.Hardingnessie.mcc.ac.uk)
Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue