Date: Sun, 1 Jan 2006 23:14:52 -0500 From: Richard Sproat Subject: Wired for Speech
AUTHOR: Nass, Clifford; Brave, Scott TITLE: Wired for Speech SUBTITLE: How Voice Activates and Advances the Human-Computer Relationship PUBLISHER: MIT Press YEAR: 2005
Richard Sproat, Departments of Linguistics and ECE, University of Illinois at Urbana-Champaign
The topic of this book is voice user interfaces, an example of which is the automated system that one interacts with when one calls United Airlines and wishes to check on the arrival or departure time for a flight. Other examples include systems where one speaks to a graphical avatar (a ''talking head'') that serves as an automated information kiosk; or the National Oceanic and Atmospheric Administration's Weather Radio, which presents the weather using a text-to-speech (TTS) synthesizer. In short, a voice user interface is any automated system that allows a user to access information, possibly with automatic speech recognition (ASR) for voice input, and with either prerecorded prompts or TTS technology to produce output.
The book is not about the technology underlying voice user interfaces. Rather, it is about how humans react to them and interact with them in controlled experiments, and how this information should guide the design of the ''persona'' that the interface presents to the world.
The book is divided into fourteen main chapters, and a two-page summary chapter. After a brief stage-setting first chapter, the authors turn in chapters 2-3 to their first topic, namely the gender of voices. How do people react to synthetic male versus female voices? How does gender stereotyping affect people's perception of the quality or believability of a voice user interface system? Can one get around user prejudices by having a ''gender neutral'' voice?
Chapters 4-5 turn to the issue of voice ''personality''. People infer many aspects of other people's personality by the way they talk, and the kinds of words they use. And, as it turns out, people impute ''personalities'' to voice user interfaces. As with gender, preconceived notions about personalities have a strong effect on the user's perception of a voice user interface.
Chapter 6 deals with the issue of regional or foreign accents and perceived ethnicity. Once again people's prejudices about accent and race carry over to machines, even though the notion that a machine has a geographical or ethnic background is obviously absurd.
Chapters 7-8 discuss emotion and how that should be expressed, or not expressed, in voice user interfaces. One of the clear suggestions of this section of the book is that, where possible, it is important for a voice user interface to match its emotion to the (expected) emotional state of the user.
Chapter 9 asks when and how a voice user interface should use multiple voices. A couple of conclusions are drawn: first, if multiple voices are used, they should be matched to the tasks being performed. For example, the authors suggest using an officious sounding voice to guide users through a complex menu system, and a warm friendly sounding voice to reassure users that they are being guided to the right place. Second, despite the common notion of ''voice fonts'' (e.g. Raman 2004), users do not treat different voices the same way as they treat different textual fonts since a change of voice has social implications that a change of font does not.
Chapter 10 deals with the question of whether voice interfaces should say ''I'', and thereby make perceived claims to being human. From the authors' experiments, it seems that systems that use synthetic (TTS) voices should not say ''I''.
Chapter 11 deals with recorded speech versus TTS, and real faces versus synthetic faces, and concludes that people react better to a system that has either a synthetic face speaking with a synthetic voice, or a real face speaking with a real voice; users do not like it when the conditions are crossed.
Chapter 12 argues that it is generally bad to mix obviously recorded speech with obviously synthetic speech: for example, it would be a bad design choice to have a system produce a canned phrase like ''Good Morning Ms.'' using a prerecorded voice, and then finish the utterance with an obviously synthetic voice saying the name of the user. This, at least, is one section of the book that will seem obvious to anyone who has worked on the technology of speech synthesis: we have known for a long time that it is not a good idea to mix high quality prerecorded speech with poor quality synthesis. The chapter also contains a discussion of humor, though it is not obvious how this relates to the main topic of the chapter.
The final two chapters, 13-14, shift the ground from voice (and video) output to voice input. Chapter 13 deals with the issue of how comfortably people will interact when they know, or are constantly reminded that they are being recorded. A set of experiments with various kinds of attached microphones versus unobtrusive array microphones, showed that users who used the less obtrusive microphones were more creative in their responses and more willing to disclose sensitive information. Finally, chapter 14 discusses what systems should do when they misrecognize a user: what are the relative costs and benefits of the system accepting blame (''I'm sorry, I did not understand you'') versus implicating blame on the user (''You are speaking too quickly, please slow down.'').
Three main themes run through this book.
The first is simply this: we are ''wired'' for speech. Even though users know they are dealing with an automated system, if the system takes speech as input, or produces speech as output, users cannot help but treat the system as if it were another human, and will apply the same beliefs and prejudices to the automated system as they would to a human that had the same behavior. That is, if a person feels more comfortable with a male human explaining how to operate a complex piece of equipment than with a female, then that prejudice will carry over to an automated assistant that has a female voice. This first point is consistent with previous work from Nass's lab: in general, people seem to treat computers as if they were people, even though they know full well that they are not (Nass, Steuer & Tauber, 1994).
The second theme is that there is no way around the first theme: for instance, we cannot solve the problem of speakers' inherent gender biases by making a system with a voice of ambiguous gender. Users will just think that the system is weird and will react to it worse than if the system clearly indicates that it is ''female'' or ''male''.
Finally, user perceptions of voice user interfaces have direct implications for users' views of whatever service or product the system is trying to sell. Just as a skilled salesman can make a product seem more desirable than it might otherwise seem, so a well-designed voice interface can make claims of a product's value seem more believable.
To place the current research in some historical perspective it is worth noting that Nass's research on ''Computers as Social Actors'' was the inspiration for Microsoft ''Bob'' which, after its demise, led eventually to ''Clippy'', the Microsoft Office automated assistant. Neither of these products have been well received and there has been much discussion of why (e.g., Schwartz, 2003), a topic that would take us beyond the scope of this review. The authors are evidently very proud of their long experience at providing user-interface design advice to corporations: the preface to the book is highly self-laudatory, and contains a fairly long list of consulting contracts that Nass's lab has had with various companies over the years, including such varied companies as BMW, Charles Schwab, General Magic, Macromedia, NTT, Philips and US West.
In the overview above, reference was made to experiments conducted by the authors to validate their claims about design issues for voice user interfaces, and it is worth summarizing one of those experiments just to give a flavor of the kind of research the authors performed. For example, in assessing the importance of gender stereotyping, the authors conducted an experiment, where participants were directed to an online auction site which offered a set of stereotypically male and stereotypically female merchandise, with descriptions from eBay. Descriptions were read to the listeners either with a female voice or a male voice generated with the Festival TTS system (Taylor, Black & Caley, 1998). Subjects were then asked to rate how credible the description they heard was. The results of this experiment (reported on pages 25-27) were that speakers rated the product descriptions as more credible if the gender of the voice matched the ''gender'' of the product.
While the focus of the book is on the use of technology, rather than the technology itself, one cannot forget that any use of a technology presumes some understanding of the technology that is being used. From the technological perspective there are a couple of points of interest about this book. First, I personally found it noteworthy that the majority of the discussion focusses on synthesis rather than recognition. This focus is the opposite of the focus in the speech technology community, where synthesis has long taken a back seat to recognition, and where synthesis has traditionally been regarded as much easier than recognition. But the focus of the current book on synthesis is, after all, natural: although speech recognition is an important part of many voice user interfaces, it is the voice with which the system speaks that gives it its ''personality'' and its apparent human-like qualities.
Second, it is unfortunately the case that the authors do not always seem to understand the technology that they are evaluating. On several occasions they imply that changing voices, changing the emotions of voices, and changing the gender of voices is a straightforward process. This is misleading at a number of levels. First, consider emotion. While Nass and Brave are correct that many of the acoustic correlates of emotion are known, and while it is true that rendering emotion in synthetic speech has been a research topic since Cahn's work (Cahn, 1989), it is still not possible to produce convincing renditions of all emotions. Second, while it might seem as if it should be easy in general to change the voice or the gender used by the system, in practice there are limitations. To understand this, it is necessary to briefly remind the reader of the various methods used to produce speech output in TTS systems. The oldest approach, exemplified by the Klatt synthesizer (Klatt 1980) and its commercial offspring DECTalk, is a fully parametric system where all parameters of the voice, including pitch, formant values, spectral tilt, and many others, are controllable. In such a system it is indeed in principle easy to produce new voices --- but at a cost: the quality of the resulting speech sounds distinctly mechanical, largely because we do not have good models of how to control the parameters over time. Such limitations in our understanding have been sidestepped in much of the recent work on ''unit selection'' based methods. These methods, pioneered in work on the CHATR system (Hunt & Black, 1996), and exemplified in commercial systems such as AT&T's ''Natural Voices'', depends upon a huge database of speech from one speaker. During synthesis, a set of units as closely as possible matching the intended utterance is selected on the fly from the database. The resulting speech can sound very good in the best case --- and downright silly in the worst. But one of the practical discoveries of this work is that the less one fiddles with the speech, the better and more natural the resulting synthesis sounds. This means that modification of speech such as changing the pitch is to be eschewed. The result: if you want a different voice, you have to record a different speaker, and analyze their speech. If you want a different emotion, you have to record your speaker performing speech with that emotion. This is certainly a lot easier to do than it used to be, but at a minimum one is looking at recording an hour's worth of a speech. This clearly involves more than turning a few knobs, which is all that Nass and Brave seem to imply is needed.
Turning away from technological issues, there are problems with the design of the book itself. As the authors state at the outset, they use endnotes extensively for background information that can be freely skipped by the casual reader. For example, the data and statistical analyses of all the experiments are presented in endnotes, not in the body of the text, which merely summarizes the results. Also, bibliographic references are all given in the endnotes. This design choice has both a good and a bad aspect. It surely helps the non- specialist reader, who will not necessarily be inclined to look at the authors' data in detail, but will be satisfied with the authors' synopsis of the results. But it is annoying for someone who has a technical background in the field, since following up any single point necessitates thumbing to the back of the book. The lack of a standard bibliography is also an extremely bothersome feature of the book.
Negatives aside, this is a book worth reading by anyone interested in speech technology. Those of us who have worked on developing the technology underlying voice user interfaces have traditionally not thought much about the actual design of the end product. Nass and Brave have clearly thought about these issues more than anyone else.
Still, while it is useful to understand what features of speech work best for which applications, we should not lose sight of the fact that the underlying technology is itself immature, and that just building a system that can communicate effectively with inexperienced users is still a challenge. In the ''Restaurant at the End of the Universe'', the second book in Douglas Adams' ''Hitchhiker'' series, Ford Prefect (the Betelgeusian companion of the hero, Arthur Dent) berates the Golgafrincham colonizers of prehistoric Earth for not having made much progress on the invention of the wheel. A marketing consultant fires back at Ford and asks him, if he is so smart, what color it should be.
Cahn, J. 1989. ''Generating Expression in Synthesized Speech.'' Master's thesis. Massachusetts Institute of Technology.
Hunt, A. and Black, A. 1996. ''Unit selection in a concatenative speech synthesis system using a large speech database.'' Proceedings of ICASSP 96, vol 1, pp 373-376, Atlanta, Georgia.
Klatt, D. 1980. ''Software for a cascade/parallel formant synthesizer'', Journal of the Acoustical Society of America, 67.3, 971-995.
Nass, C., Steuer, J. S., & Tauber, E. (1994). ''Computers are social actors.'' Proceeding of the CHI Conference, 72-77. Boston, MA
Raman, TV. 2004. ''Emacsspeak -- The Complete Audio Desktop''. http://emacspeak.sourceforge.net/
Schwartz, L. 2003. ''Why people hate the paperclip: Labels, appearance, behavior and social responses to user interface agents.'' Master's Thesis, Stanford University.
Taylor, P., Black, A. and Caley, R. 1998 ''The architecture of the Festival Speech Synthesis System'' 3rd ESCA Workshop on Speech Synthesis, pp. 147-151, Jenolan Caves, Australia.
ABOUT THE REVIEWER:
ABOUT THE REVIEWER
Richard Sproat is professor in the departments of Linguistics and Electrical & Computer Engineering at the University of Illinois at Urbana-Champaign. His interests include multilingual text processing and speech technology. Prior to coming to the University of Illinois, Sproat worked in industrial research at AT&T Bell Laboratories, with his primary area of research being text-to-speech synthesis. Sproat was one of the main architects of the Bell Labs multilingual text-to- speech synthesizer. He was also involved in the design of the SABLE text-to-speech markup language, a precursor to the W3C's SSML.