EDITORS: Minker, W.; Bühler, Dirk; Dybkjaer, Laila TITLE: Spoken Multimodal Human-Computer Dialogue in Mobile Environments SERIES: Text, Speech and Language Technology 28 YEAR: 2005 PUBLISHER: Springer
Petra Gieselmann, Interactive Systems Lab, University of Karlsruhe
This book contains a collection of 20 papers presenting current work in the field of development and evaluation of multimodal dialogue systems with a special emphasis on mobile environments. It is based on publications from an ISCA Tutorial and Research Workshop on Multimodal Dialogue in Mobile Environements held in 2002. The book combines overview chapters of key areas in spoken multimodal dialogue systems with chapters focusing on particular applications or problems in the field. In particular, it deals with the influence of the mobile environment when building and evaluating an application.
SUMMARY
In the introduction, the editors provide a short overview of the book which is divided into three different parts: The first part describes some general issues on multimodal dialogue systems and components. The second part deals with system architectures and some example applications. Finally, the third part concerns evaluation and usability of dialogue systems.
Chapter 1, ''Multimodal Dialogue Systems'' by Alexander Rudnicky, provides an overview of interesting research issues and challenges within multimodal dialogue systems, such as detection of intentional user input, appropriate use of interaction modalities, management of dialogue history and context, using domain reasoning for incorporating intelligence into the system, appropriate output planning. Most of these challenges are explained in more detail by the chapters of the first part.
Next is Sadaoki Furui's contribution, ''Speech Recognition Technology in Multimodal/Ubiquitous Computing Environments''. Furui discusses the state of the art in speech recognition and addresses the two major application areas of speech recognition: Dialogue systems and speech transcription systems for meeting minutes (cf. Furui 2000). In order to build speech recognition systems robust against acoustic and linguistic variation in spontaneous speech, Furui proposes a paradigm shift from speech recognition to understanding so that the meaning of the user’s input is delivered. To this end, world knowledge has to be integrated and efficient methods to represent, store, retrieve and generally use it have to be evaluated.
In Chapter 3 Satoshi Tamura et al. propose ''A Robust Multimodal Speech Recognition Method Using Optical Flow Analysis''. This chapter is the only one dealing with audio-visual speech recognition. Their approach seems promising in coupling acoustic speech signals with visual information, such as lip movements. It is robust against visual and acoustic noises and outperforms audio-only speech recognition.
The fourth chapter, ''Feature Functions for Tree-based Dialogue Course Management'' by Klaus Macherey and Hermann Ney, discusses some assumptions of application-independent dialogue management. In order to be able to create such application-independent dialogue managers, they determine the steps which many domains have in common and derive parametrizable data structures for the knowledge of each domain in the form of trees. The tree nodes encode concepts and the edges represent relations between those concepts. During a dialogue session, instance trees are built from the original knowledge tree and a cost function is used to calculate the most direct path through the tree to achieve the user goal. The results of an evaluation with the implemented tree-based dialogue manager in a telephone directory assistance system seem promising.
Chapter 5, ''A Reasoning Component for Information-Seeking and Planning Dialogues'' by Dirk Bühler and Wolfgang Minker, presents a logic-based problem assistant for dialogue systems. During the dialogue, this problem assistant processes constraints from the user on one hand and the system’s domain-specific information on the other hand by finite model generation. During the inference process, the assistant generates transparent information for the users so that they can understand the inferences drawn and get a deeper insight into the whole process. This results in a collaborative conflict resolution process which has been illustrated in a meeting appointment system.
Chapter 6, entitled ''A Model for Multimodal Dialogue System Output Applied to an Animated Talking Head'', by Jonas Beskow et al. concerns multimodal output generation and concludes the first part of the book. They propose a new markup formalism to annotate the verbal and non-verbal output generated by the dialogue manger to the user. The output specification is XML-based and provides information about communicative functions of the output, but at the same time does not constrain the realization of these functions within different systems with various output devices and modalities.
The second part about ''System Architecture and Example Implementations'' starts with an ''Overview of System Architecture'' by Andreas Kellner. Since multimodal dialogue systems have recently evolved from speech-only server-based systems such as call-centre or travel information systems (McTear 2002) to personal, mobile interaction partners such as PDAs or handheld computers, a number of new requirements for the underlying architecture arise. Kellner presents two different system architectures facing these challenges: the Galaxy communicator infrastructure used by the US American DARPA, and the SmartKom testbed developed within a German national research project (Wahlster et al. 2001). In addition, Kellner provides an overview of standards for speech-enabled applications, such as VoiceXML and SALT.
Chapter 8, ''XISL: A Modality-Independent MMI Description Language'' by Kouichi Katsurada et al., proposes a description language for the man-machine interface independent of the modalities available. By means of this language, new modalities can be easily added and existing ones modified. They evaluate the descriptive power of this language by implementing an online shopping application with different terminals, such as a PC terminal, a mobile phone terminal, and a PDA terminal.
The ninth chapter, ''A Path to Multimodal Data Services for Telecommunications'' by Georg Niklefeld et al., discusses some assumptions of multimodal interfaces for public mobile telecommunications. They develop three demonstrators taking the current technical means available on the consumer market into account. The results of the evaluation show that also quite simple forms of multimodality can bring usability advantages on specific tasks.
Chapter 10 by Roberto Pieraccini et al. deals with ''Multimodal Spoken Dialogue with Wireless Devices''. They focus on the design of a user interface that exploits the complementary features of audio and video channels to enhance usability. Given the wide variety of situations, when the user could interact with a mobile device, the balance of audio and visual interaction differs enormously. Pieraccini and his colleagues evaluate situations when modalities are or are not preferred within their two implementations for mapping and navigating to points of interest.
Chapter 11, ''The SmartKom Mobile Car Prototype System for Flexible Human-Machine Communication'' by Dirk Bühler and Wolfgang Minker, also considers different contexts of use of mobile devices and they deal with map interaction and navigation, too. Since the system is intended for car drivers and pedestrians, it has to adapt itself to the current environment, letting the user adapt it. Particular requirements for modality control are identified.
The next chapter by Dan Bohus and Alexander Rudnicky deals with ''LARRI: A Language-Based Maintenance and Repair Assistant'' and concludes the second part of the book about architectures and sample implementations of multimodal mobile applications. They focus on adaptations necessary to port a dialogue system from one domain to another. As an example, they take a travel-planning system and port it to the aircraft maintenance domain and evaluate the system with professional aircraft mechanics. The results reveal that in this domain the criterion for success of an interaction is jointly determined by the system and the user. The system has to support the user in tracking various individual tasks and in managing the change of focus between them. In addition, they change the interface design to feature multimodal interaction where an optimal balance between spoken and graphical output is necessary. Given the large and constantly mutating domain of the maintenance and repair domain, this stresses the importance of automating the creation of domain information into spoken language-based systems.
The third part about ''Evaluation and Usability'' starts with an ''Overview of Evaluation and Usability'' by Laila Dybkjaer et al. They provide an overview of the state of the art and review some initiatives and projects with a main focus on evaluation and usability, such as ATIS (Air Travel Information System), Evalda, EAGLES (Expert Advisory Group on Language Engineering Standards), DISC (Spoken Language Dialogue Systems and Components), the Danish Dialogue Project, and the Paradise Framework (Paradigm for Dialogue System Evaluation). The recent emergence of multimodal, mobile and non-task-oriented systems continues to pose entirely new challenges, such as online user modelling, recognition of user emotions and personality, non-task oriented dialogues, mobile environments, modelling user preferences and priorities. Some of these challenges are addressed in more detail by the following chapters.
The fourteenth chapter, ''Evaluating Dialogue Strategies in Multimodal Dialogue Systems'' by Steve Whittaker and Marilyn Walker, discusses some assumptions of information presentation coupled with user models within multimodal systems. They present two evaluation methods that could expedite the design process: a Wizard-of-Oz data collection and an overhearer evaluation experiment using logged interactions with the real system. The testbed application consists of a multimodal map interaction and general city help. By means of the Wizard-of-Oz experiment, user tasks and strategies could be determined and various presentation strategies piloted. Although rich data can be collected by means of this experiment, the dialogue context in which various presentation strategies are generated cannot be carefully controlled. In contrast, the overhearer method allows more control: The idea behind this method is that the participant overhears a serious of turns from several dialogues that have previously been logged as successful interactions with the system. The participants than judge the quality of the system’s output and gives general feedback on the interaction. The results of this method reveal that dialogues with individual user models score higher than those with a default user model.
Chapter 15, entitled ''Enhancing the Usability of Multimodal Virtual Co-Drivers'', by Niels Ole Bernsen and Laila Dybkjaer concerns an in-car navigation system. They analyze in detail the following important issues for in-car information systems development: when should the system listen vs. not listen to speech; when to use the in-car display vs. spoken driver-system dialogue; how to identify the present driver to build a user model; how to create an online adaptive user model of the driver.
Chapter 16, ''Design, Implementation and Evaluation of the SENECA Spoken Language Dialogue System'' by Wolfgang Minker et al., also considers an in-car navigation system. The evaluation with users who were driving while carrying out predefined tasks shows that speech input is faster, driving skills are less affected by speech input, and users prefer speech input. However, the task completion rate is considerably lower with speech input compared to manual input because the participants use a variety of out-of-vocabulary words.
The next chapter by Sabine Geldof and Robert Dale deals with ''Segmenting Route Description for Mobile Devices''. They also use a navigation system, but focus on route descriptions on mobile devices. Since these descriptions must be easy to understand and remember and at the same time short taking into account the small screen size of mobile devices, they propose a summarized route description which can be expanded in a tree-like way to provide more detail. They discuss various segmentation strategies and compare and evaluate their approach to a route description with flat numbered list of instructions. The results show that two third of the users preferred this tree structure.
Chapter 18, ''Effects of Prolonged Use on the Usability of a Multimodal Form-Filling Interface'' by Janienke Sturm et al., considers the ongoing use of a train timetable information system and the development of interaction patterns over time for such a multimodal dialogue system. The user study shows that with practice the users learn to develop interaction patterns that ensure a more reliable and efficient interaction, resulting in a decreased dialogue duration and higher user satisfaction.
The nineteenth chapter, ''User Multitasking with Mobile Multimodal Systems'' by Anthony Jameson and Kerstin Klöckner, discusses some assumptions of the fact that users often simultaneously perform various tasks while interacting with a mobile system. In their experiment, the participants use a mobile phone while walking. Eye-based vs. ear-based interaction methods are evaluated showing that users’ behaviour is strongly influenced by subjective factors, such as habits from other tasks, experiences with similar phones, dislikes related to the design, and ideas on what is socially acceptable.
Chapter 20 by Sharon Oviatt et al. deals with ''Speech Convergence with Animated Personas''. They investigate the convergence of users’ speech with the text-to-speech (TTS) heard from the dialogue system. In their user study, young children interact with animated marine animals. The output voices are tailored to represent opposite ends of the introvert-extrovert personality spectrum. An analysis of children’s amplitude, durational features, and dialogue response latencies confirm that they spontaneously adapt basic acoustic-prosodic features of their speech to the TTS. The adaptations are bi-directional and dynamically readable when introduced to a new TTS voice. In this way, it might be possible to guide users’ speech to fall within a range easily processed by the speech recognizer without explicitly instructing users.
EVALUATION
''Spoken multimodal human-computer dialogue in mobile environments'' is an interesting and valuable contribution to the NLP community. It offers a good overview of the field of multimodal, mobile man-machine interaction and the different research areas within this field. A shortcoming of this book might be that, since the original papers are from 2002, some aspects especially of the second part about the architecture and some frameworks is out of date and therefore obsolete now. For example, continuous speech recognition on PDAs and handheld computers is meanwhile possible (cf. eg. Köhler et al. 2005). Nevertheless, the book forms also a good starting point for students who want to specialize within the area of multimodal mobile dialogue systems.
REFERENCES
Furui, S. (2000). Speech Recognition Technology in the Ubiquitous/Wearable Computing Environment, Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Turkey.
Köhler, T., Fügen, C., Stüker, S. & Waibel, A. (2005). Rapid Porting of ASR-Systems to Mobile Devices, Proceedings of the 9th European Conference on Speech Communication and Technology (Interspeech), Portugal.
McTear, M. (2002). Spoken Dialogue Technology: Enabling the Conversational Interface, ACM Computing Surveys, volume 34(1), pages 90-169.
Seneff, S., Hurley, E., Lau, R., Pao, C. Schmid, P. & Zue, V. (1998). GALAXY-II: A Reference Architecture for Conversational System Development. Proceedings of the International Conference on Spoken Language Processing (ICSLP), pages 931-934, Australia.
Wahlster, W., Reithinger, N. & Blocher, A. (2001). SmartKom: Multimodal Communication with a Life-like Character. Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), pages 1547-1550, Denmark.
|