Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."

New from Wiley!


We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Review of  Expression in Speech

Reviewer: Marianne Jessen
Book Title: Expression in Speech
Book Author: Mark Tatham Katherine Morton
Publisher: Oxford University Press
Linguistic Field(s): Computational Linguistics
Cognitive Science
Issue Number: 15.1283

Discuss this Review
Help on Posting
Date: Tue, 20 Apr 2004 20:05:40 +0200
From: Marianne Jessen
Subject: Expression in Speech: Analysis and synthesis

AUTHOR: Tatham, Mark; Morton, Katherine
TITLE: Expression in Speech
SUBTITLE: Analysis and synthesis
PUBLISHER: Oxford University Press
YEAR: 2004

Marianne Jessen, Dept of Logopedics,Fachhochschule Fresenius, Idstein.
Michael Jessen, Forensic Speech and Audio Dept, Bundeskriminalamt,

''Expression in Speech'' focuses on the issue of how current speech
synthesis systems (e.g. within text-to-speech applications or dialogue
systems) can be improved by adding or enhancing acoustic correlates of
expression. ''Expression'' is seen as a ''manner of speaking, a way of
externalizing feelings, attitudes, and moods - conveying information
about our emotional state'' (p. 39); Tatham and Morton (TM) also use the
term ''tone of voice'' synonymously with expression (p.65). TM are not
interested in any quick, short sighted solutions to the issue of
expression in speech synthesis. Instead, before turning to more
concrete implementation design proposals in the latter part of their
book, TM go through great efforts to capture the issue of expression in
speech more generally, including its foundations in the biology and
psychology of emotions and the linguistic pragmatics of emotive
expression in speech. They point out explicitly that the phonetics of
expression in speech is not just a set of salient acoustic correlates
of strong basic emotions overlaid on entire utterances, and that it
should not be synthesized in this manner. Instead, what happens in
natural speech is that often very subtle and blended emotions are
conveyed for only small sections of speech, that there is a complicated
interaction between acoustic and linguistic (choice of lexical items
etc.) cues to emotions, and that the speaker is not just a passive
victim to the biology of emotion and its reflection in speech but that
expression in speech can be modified and adjusted on a cognitive and
sometimes conscious level. This cognitive mediation includes the fact
that the speaker can perceive or infer the reaction of the listener to
the expressive content of his/her speech within the context of the
conversation and is able to make adjustments. TM propose that a speech
synthesis system should be able to model all of these aspects. As for
the incorporation of listener reactions, TM claim that an automatic
speech recognition module can increase the capabilities of the speech
synthesis module. In general, TM emphasize that speech synthesis should
not end with a model of the speaker and her/his expression capabilities
but should ultimately be listener-oriented. This not only would be an
appropriate way of capturing the goal-oriented nature of speech
production on a scientific level but it would also be of commercial
interest - after all, it is the customer who will be the listener of
the synthetic speech.

TM in the final part of their book propose a speech production model
(see Fig. 16.1, p. 365) in which on a ''static plane'' the
phonology/phonetics of a language and their interface is captured as
the set of grammatical/ linguistic-phonetic rules and constraints of
speech with ''neutral expression'' (p. 302). In addition to this static
plane there is a ''dynamic prosody/phonology tier'', responsible for
planning utterances and a ''dynamic phonetic tier'', responsible for
rendering utterances. The rendering module receives input from a
''dynamic cognitive phonetics agent'', which supervises and modifies the
rendering process based on contextual and environmental information.
Apparently, while the static components cover what is addressed in most
of current phonology and phonetics, the dynamic components focus on
psycholinguistic and linguistic-pragmatic factors. This model implies a
plea by TM for a broad-sighted view of phonetics, in which
psycholinguistic and pragmatic factors are taken into account, so that
a topic like expression in speech does not assume a marginal role in
phonetics. TM mention that their theory of ''Cognitive Phonetics'' (e.g.
p. 360) is a proposal into that direction. TM make proposals as to how
their speech production model and their account of expression in speech
can be implemented as part of a speech synthesis architecture. Within
this agenda they present a number of XML declarations in which they lay
out a prosodic hierarchy. A node is on top of this
hierarchy, which proceeds further down with prosodic categories such a
, , and (p. 370). Aside
from the practical aspects of this hierarchy (capturing that
expressions usually have a longer temporal domain, i.e. change less
rapidly than units of linguistic prosody) TM also claim that in the
planning of an utterance the speaker first formulates the ''prosodic
wrapper'' and subsequently the segmental content, contrary to the more
traditional notion that the segmental make up of an utterance is
planned first and then provided with linguistic and expressive prosody
(pp. 384-386).

Since ''Expression in Speech'' is a lot about imagining how speech
synthesis can be improved in the future, let us for illustration
purposes (and fun) beam aboard the Enterprise 1701-D and listen to the
type of (Sci-Fi-projected) speech synthesis found there. (To Tatham and
Morton: this is not to ridicule your book but to cherish its value; to
all who don't like or know Paramout Picture's Star Trek:The Next
Generation: please skip to the next paragraph.) First there is the
voice of the ship's computer, everybody can talk to from the bridge,
the elevator and all over the ship. The computer speaks in a voice that
is essentially expressionless. Actually the voice is not fully without
expression: it speaks in an overall friendly manner, which is an
illustration of TM's point that ''all speech is expression-based'' (title
of Chapter 14). But this friendly kind of voice by the computer is
always the same, no matter how inappropriate for the context and how
annoying for the listener. In TM's terms, the node has an
attribute such as ''low-emotion friendly'' as a permanent setting for
every utterance. This kind of inflexible way of including expression in
speech synthesis is what TM's argue against. What will probably meet
their expectations, however, is the voice of the unique android
Lieutenant Commander Data. Data is not able to experience emotions but
in his speech and nonverbal behavior is able to express a certain
degree of emotion. He usually cannot express strong and basic emotions;
at least he is not very good at it, although when demanded in
situations like a theater play his expressive abilities into that
direction improve (cf. TM's XML declaration of emotive aspects in
Hamlet's speech, p. 304f.).

According to TM, what is both more difficult and more required of a
speech synthesis system is the ability to express subtle and blended
rather than extreme and basic emotions. What an interactive system
needs in their words is ''less intense expressiveness to increase its
naturalness and credibility'' (p. 90) - a feature certainly met in
Data's speech. Data also meets TM's proposal that a speech synthesis
system should be able to perceive or infer listener reactions and to
relate those reactions to the verbal or vocal expressive content of its
speech with the ability to adjust it. In his regular interactions with
the other crew members Data can for example perceive physical or
verbal/vocal signs of distress in reaction to his behavior and can ask
if he in any way offended the person he talked to. Another point: is
the goal of expressive speech synthesis to model just the expression or
also the physical and perhaps psychological aspects that come with an
emotive reaction, as a stage prior to or interacting with the
expression (TM pp. 277-280 for discussion)? More philosophically: can
or should machines ever cross the body-mind barrier and even be able to
EXPERIENCE emotions? That certainly went wrong with Data's android
brother Lore, who turned into a raving lunatic over his abilities to
experience emotions - but who knows. By the way, being an android, Data
is also the perfect embodiment of an articulatory synthesizer, which
many in the field of speech synthesis think will ultimately be the best
way of doing synthesis.

''Expression in Speech'' in some ways has more the character of a
monograph for the advanced reader than of a basic textbook or handbook
because it presupposes that - or is of maximal value if - the reader is
familiar with or willing to familiarize her/himself elsewhere with the
principles of speech synthesis, with the literature on emotion in
speech, and with background subjects such as phonology or
psycholinguistics. For example, although different speech synthesis
techniques such as formant synthesis, unit- selection synthesis, or
diphone synthesis are all mentioned, discussed and in part illustrated,
the reader still has to turn to other sources when wanting to know how
e.g. formant synthesis works (the distinction between source and filter
parameters, the cascade and the parallel branch, etc.). And although
the most important correlates of emotion in speech that have been
reported in the literature are summarized in the form of tables (pp.
55, 115), TM essentially do not provide a literature overview on this
topic (by mentioning the original sources such as Williams and Stevens
1972 and many others) but cite a few secondary sources, one of which a
probably not very accessible Ph.D. thesis, to which the interested
reader can turn for further literature. [We want to mention at this
point that there has also been some interesting work on emotion in
speech in Germany including Tischer (1993; with extensive literature
review up to that date), Klasmeyer and Sendlmeier (2000), Burkhardt
(2001; with special reference to emotion in speech synthesis), and
Kienast (2002).]

The importance of phonology and prosody are mentioned throughout the
book, but except for a few remarks on the Firthian prosodic framework,
metrical phonology and articulatory phonology (pp. 21f.), their theory
of Cognitive Phonetics (p. 209, 334 etc.), or on the limitations of
Pierrehumbert's intonation model and the ToBI system for speech
synthesis (p. 118), it is not really clear what the model of phonology
it is that TM have in mind as background for their work on expression
in speech (e.g. in their production model mentioned above) or whether
they think a combination of models is best for the practical goals at
hand. In our opinion, for example, it would be too harsh a judgment to
question the usefulness of autosegmental phonology for the purpose of
speech synthesis, if this is what TM have in mind (see Clements and
Hertz 1996 for the autosegmental ''Delta'' model of speech synthesis and
its phonological motivation). The unfamiliar reader would need a few
phonology textbooks and perhaps an introduction to the history of
linguistics explaining the differences between British and American
linguistic traditions (e.g. Anderson 1985) to get a perspective.
Regarding psycholinguistics, it would have been useful had TM explained
how their speech production model is similar to or differs from at
least the one of Levelt (1989). On the positive side, TM mention quite
a bit of literature on the biology and psychology of emotions. For that
purpose they also provide a bibliography (p. 411f.) following their
list of references.

''Expression in Speech'' is written in a clear and explicit style,
avoiding as much technical language as possible. It also focuses in on
some topics and explains them in quite some detail (e.g. what the
syllable-internal constituents are and how hierarchical syllable
structure can be expressed in XML; p. 372-374). These aspects make the
book again more textbook- than monograph-like, and it has the positive
consequence that it will be understood by many interested persons
outside the specialized emotion-in- speech-synthesis community, which
corresponds to the announcement in the text on the book cover that the
book will be of interest for researchers in linguistics, speech
science, pathology, technology and behavioral or cognitive science. In
some instances, however, clarity and explicit style turns into
redundancy. The book contains 16 chapters not all of which dealing with
separate topics. TM have the habit of bringing up a topic and
explaining some aspects of it, then bringing it up again in a different
chapter with a certain shift in detail or perspective. Some readers
will enjoy this way of arranging the book - and it can be a way of
ultimately grasping the subject matter better than with a more
redundancy-free style - but other readers, who cannot invest the same
amount of time or may wish to concentrate on some aspects while leaving
others, might find it difficult to extract the information they need
without missing something important that occurs elsewhere in the book
(TM provide a subject and author index however).

We have two technical comments on speech synthesis. First, to our
knowledge, the HLsyn system by the Sensimetrics company is based on the
revised and expanded parameter set described in Klatt and Klatt (1990)
and not the 1980 model of the Klatt formant synthesizer (p. 239).
Second, it is essentially correct that formant frequencies and
amplitudes (including correlates of articulatory precision) as well as
voice quality parameters cannot be modified with signal processing
methods in concatenative synthesis (see table on p. 237). However,
there has been research and development into that direction, and it is
probably increasing strongly in the future, enhanced in part by the
motivation to enable synthesizers to speak with different individual
voices (see e.g. Quatieri and McAulay 1986, d'Alessandro and Doval
1998, Kain and Macon 1998, Stylianou 2001). [Thanks to Karlheinz Stöber
for discussion on that subject and for giving us information on

The few critical comments we made here are essentially about issues of
style and the selection and organization of background information.
They leave untouched our central impression of the book: that it is
extremely useful as a guide to anyone working on the interface between
emotion in speech and speech synthesis. Tatham and Morton offer a far-
sighted perspective to this topic and make explicit many issues the
developer of synthesis systems might not think about at all. In this
sense the book is also a very good example of how the linguist and
phonetician can make valuable contributions to speech technology, and
that in the end the best results will be obtained if speech
technologists and linguists/phoneticians work together.


Anderson, S. R. (1985) Phonology in the twentieth century: theories of
rules and theories of representations, Chicago: The University of
Chicago Press.

Burkhardt, F. (2001) Simulation emotionaler Sprechweise mit
Sprachsynthesesystemen, Aachen: Shaker Verlag.

Clements, G. N. and Hertz, S. R. (1996) An integrated approach to
phonology and phonetics. In Durand, J. and Laks, B. (eds.) Current
trends in phonology: models and methods, pp. 143-173, University of
Salford, European Studies Research Institute.

d'Alessandro C. and Doval, B. (1998) Experiments in voice quality
modification of natural speech signals: the spectral approach. In: The
Third ESCA/COCOSDA Workshop on Speech Synthesis (on CD).

Kain, A. and Macon, M. (1998) Personalizing a speech synthesizer by
voice adaptation. In: The Third ESCA/COCOSDA Workshop on Speech
Synthesis (on CD).

Kienast, M. (2002) Phonetische Veränderungen in emotionaler
Sprechweise, Aachen: Shaker Verlag.

Klasmeyer, G. and Sendlmeier, W. F. (2000) Voice and emotional states.
In R.D. Kent and M. J. Ball (eds.) Voice quality measurement, pp. 339-
357, San Diego: Singular Publishing Group.

Klatt, D. H. and Klatt, L. C. (1990) Analysis, synthesis, and
perception of voice quality variations among females and male talkers,
Journal of the Acoustical Society of America 87, pp. 820-857.

Levelt, W. J. M. (1989) Speaking: from intention to articulation.
Cambridge, MA: The MIT Press.

Quatieri T. F. and McAulay, R. J. (1986) Speech transformations based
on sinusoidal representation, IEEE Transactions on Acoustics, Speech,
and Signal Processing, ASSP-34, pp. 1449-1464.

Stylianou, Y. (2001) Applying the harmonic plus noise model in
concatenative speech synthesis, IEEE Transactions on Speech and Audio
Processing, 9, 1, pp. 21-29.

Tischer, B. (1993) Die vokale Kommunikation von Gefühlen. Weinheim:

Williams C. and Stevens, K. N. (1972) Emotions and speech: some
acoustical correlates, Journal of the Acoustical Society of America 52,
Marianne Jessen is a lecturer at the Department of Logopedics, Europa-
Fachhochschule Fresenius in Idstein, Germany - the first academically-
based program in Logopedics in Germany - where she is responsible for
the section on voice. Her interests include speech under stress, voice
quality, and dysphagia. Michael Jessen works at the Forensic Speech and
Audio Department of the Bundeskriminalamt (Federal Criminal Police
Office) in Wiesbaden, Germany. His interests include voicing and voice
quality, laboratory phonology, and speaker identification.