|EDITORS: Minker, W.; Bühler, Dirk; Dybkjaer, Laila
TITLE: Spoken Multimodal Human-Computer Dialogue in Mobile Environments
SERIES: Text, Speech and Language Technology 28
Petra Gieselmann, Interactive Systems Lab, University of Karlsruhe
This book contains a collection of 20 papers presenting current work in the
field of development and evaluation of multimodal dialogue systems with a
special emphasis on mobile environments. It is based on publications from
an ISCA Tutorial and Research Workshop on Multimodal Dialogue in Mobile
Environements held in 2002. The book combines overview chapters of key
areas in spoken multimodal dialogue systems with chapters focusing on
particular applications or problems in the field. In particular, it deals
with the influence of the mobile environment when building and evaluating
In the introduction, the editors provide a short overview of the book which
is divided into three different parts: The first part describes some
general issues on multimodal dialogue systems and components. The second
part deals with system architectures and some example applications.
Finally, the third part concerns evaluation and usability of dialogue systems.
Chapter 1, ''Multimodal Dialogue Systems'' by Alexander Rudnicky, provides an
overview of interesting research issues and challenges within multimodal
dialogue systems, such as detection of intentional user input, appropriate
use of interaction modalities, management of dialogue history and context,
using domain reasoning for incorporating intelligence into the system,
appropriate output planning. Most of these challenges are explained in more
detail by the chapters of the first part.
Next is Sadaoki Furui's contribution, ''Speech Recognition Technology in
Multimodal/Ubiquitous Computing Environments''. Furui discusses the state of
the art in speech recognition and addresses the two major application areas
of speech recognition: Dialogue systems and speech transcription systems
for meeting minutes (cf. Furui 2000). In order to build speech recognition
systems robust against acoustic and linguistic variation in spontaneous
speech, Furui proposes a paradigm shift from speech recognition to
understanding so that the meaning of the user’s input is delivered. To this
end, world knowledge has to be integrated and efficient methods to
represent, store, retrieve and generally use it have to be evaluated.
In Chapter 3 Satoshi Tamura et al. propose ''A Robust Multimodal Speech
Recognition Method Using Optical Flow Analysis''. This chapter is the only
one dealing with audio-visual speech recognition. Their approach seems
promising in coupling acoustic speech signals with visual information, such
as lip movements. It is robust against visual and acoustic noises and
outperforms audio-only speech recognition.
The fourth chapter, ''Feature Functions for Tree-based Dialogue Course
Management'' by Klaus Macherey and Hermann Ney, discusses some assumptions
of application-independent dialogue management. In order to be able to
create such application-independent dialogue managers, they determine the
steps which many domains have in common and derive parametrizable data
structures for the knowledge of each domain in the form of trees. The tree
nodes encode concepts and the edges represent relations between those
concepts. During a dialogue session, instance trees are built from the
original knowledge tree and a cost function is used to calculate the most
direct path through the tree to achieve the user goal. The results of an
evaluation with the implemented tree-based dialogue manager in a telephone
directory assistance system seem promising.
Chapter 5, ''A Reasoning Component for Information-Seeking and Planning
Dialogues'' by Dirk Bühler and Wolfgang Minker, presents a logic-based
problem assistant for dialogue systems. During the dialogue, this problem
assistant processes constraints from the user on one hand and the system’s
domain-specific information on the other hand by finite model generation.
During the inference process, the assistant generates transparent
information for the users so that they can understand the inferences drawn
and get a deeper insight into the whole process. This results in a
collaborative conflict resolution process which has been illustrated in a
meeting appointment system.
Chapter 6, entitled ''A Model for Multimodal Dialogue System Output Applied
to an Animated Talking Head'', by Jonas Beskow et al. concerns multimodal
output generation and concludes the first part of the book. They propose a
new markup formalism to annotate the verbal and non-verbal output generated
by the dialogue manger to the user. The output specification is XML-based
and provides information about communicative functions of the output, but
at the same time does not constrain the realization of these functions
within different systems with various output devices and modalities.
The second part about ''System Architecture and Example Implementations''
starts with an ''Overview of System Architecture'' by Andreas Kellner. Since
multimodal dialogue systems have recently evolved from speech-only
server-based systems such as call-centre or travel information systems
(McTear 2002) to personal, mobile interaction partners such as PDAs or
handheld computers, a number of new requirements for the underlying
architecture arise. Kellner presents two different system architectures
facing these challenges: the Galaxy communicator infrastructure used by the
US American DARPA, and the SmartKom testbed developed within a German
national research project (Wahlster et al. 2001). In addition, Kellner
provides an overview of standards for speech-enabled applications, such as
VoiceXML and SALT.
Chapter 8, ''XISL: A Modality-Independent MMI Description Language'' by
Kouichi Katsurada et al., proposes a description language for the
man-machine interface independent of the modalities available. By means of
this language, new modalities can be easily added and existing ones
modified. They evaluate the descriptive power of this language by
implementing an online shopping application with different terminals, such
as a PC terminal, a mobile phone terminal, and a PDA terminal.
The ninth chapter, ''A Path to Multimodal Data Services for
Telecommunications'' by Georg Niklefeld et al., discusses some assumptions
of multimodal interfaces for public mobile telecommunications. They develop
three demonstrators taking the current technical means available on the
consumer market into account. The results of the evaluation show that also
quite simple forms of multimodality can bring usability advantages on
Chapter 10 by Roberto Pieraccini et al. deals with ''Multimodal Spoken
Dialogue with Wireless Devices''. They focus on the design of a user
interface that exploits the complementary features of audio and video
channels to enhance usability. Given the wide variety of situations, when
the user could interact with a mobile device, the balance of audio and
visual interaction differs enormously. Pieraccini and his colleagues
evaluate situations when modalities are or are not preferred within their
two implementations for mapping and navigating to points of interest.
Chapter 11, ''The SmartKom Mobile Car Prototype System for Flexible
Human-Machine Communication'' by Dirk Bühler and Wolfgang Minker, also
considers different contexts of use of mobile devices and they deal with
map interaction and navigation, too. Since the system is intended for car
drivers and pedestrians, it has to adapt itself to the current environment,
letting the user adapt it. Particular requirements for modality control are
The next chapter by Dan Bohus and Alexander Rudnicky deals with ''LARRI: A
Language-Based Maintenance and Repair Assistant'' and concludes the second
part of the book about architectures and sample implementations of
multimodal mobile applications. They focus on adaptations necessary to port
a dialogue system from one domain to another. As an example, they take a
travel-planning system and port it to the aircraft maintenance domain and
evaluate the system with professional aircraft mechanics. The results
reveal that in this domain the criterion for success of an interaction is
jointly determined by the system and the user. The system has to support
the user in tracking various individual tasks and in managing the change of
focus between them. In addition, they change the interface design to
feature multimodal interaction where an optimal balance between spoken and
graphical output is necessary. Given the large and constantly mutating
domain of the maintenance and repair domain, this stresses the importance
of automating the creation of domain information into spoken language-based
The third part about ''Evaluation and Usability'' starts with an ''Overview of
Evaluation and Usability'' by Laila Dybkjaer et al. They provide an overview
of the state of the art and review some initiatives and projects with a
main focus on evaluation and usability, such as ATIS (Air Travel
Information System), Evalda, EAGLES (Expert Advisory Group on Language
Engineering Standards), DISC (Spoken Language Dialogue Systems and
Components), the Danish Dialogue Project, and the Paradise Framework
(Paradigm for Dialogue System Evaluation). The recent emergence of
multimodal, mobile and non-task-oriented systems continues to pose entirely
new challenges, such as online user modelling, recognition of user emotions
and personality, non-task oriented dialogues, mobile environments,
modelling user preferences and priorities. Some of these challenges are
addressed in more detail by the following chapters.
The fourteenth chapter, ''Evaluating Dialogue Strategies in Multimodal
Dialogue Systems'' by Steve Whittaker and Marilyn Walker, discusses some
assumptions of information presentation coupled with user models within
multimodal systems. They present two evaluation methods that could expedite
the design process: a Wizard-of-Oz data collection and an overhearer
evaluation experiment using logged interactions with the real system. The
testbed application consists of a multimodal map interaction and general
city help. By means of the Wizard-of-Oz experiment, user tasks and
strategies could be determined and various presentation strategies piloted.
Although rich data can be collected by means of this experiment, the
dialogue context in which various presentation strategies are generated
cannot be carefully controlled. In contrast, the overhearer method allows
more control: The idea behind this method is that the participant overhears
a serious of turns from several dialogues that have previously been logged
as successful interactions with the system. The participants than judge the
quality of the system’s output and gives general feedback on the
interaction. The results of this method reveal that dialogues with
individual user models score higher than those with a default user model.
Chapter 15, entitled ''Enhancing the Usability of Multimodal Virtual
Co-Drivers'', by Niels Ole Bernsen and Laila Dybkjaer concerns an in-car
navigation system. They analyze in detail the following important issues
for in-car information systems development: when should the system listen
vs. not listen to speech; when to use the in-car display vs. spoken
driver-system dialogue; how to identify the present driver to build a user
model; how to create an online adaptive user model of the driver.
Chapter 16, ''Design, Implementation and Evaluation of the SENECA Spoken
Language Dialogue System'' by Wolfgang Minker et al., also considers an
in-car navigation system. The evaluation with users who were driving while
carrying out predefined tasks shows that speech input is faster, driving
skills are less affected by speech input, and users prefer speech input.
However, the task completion rate is considerably lower with speech input
compared to manual input because the participants use a variety of
The next chapter by Sabine Geldof and Robert Dale deals with ''Segmenting
Route Description for Mobile Devices''. They also use a navigation system,
but focus on route descriptions on mobile devices. Since these descriptions
must be easy to understand and remember and at the same time short taking
into account the small screen size of mobile devices, they propose a
summarized route description which can be expanded in a tree-like way to
provide more detail. They discuss various segmentation strategies and
compare and evaluate their approach to a route description with flat
numbered list of instructions. The results show that two third of the users
preferred this tree structure.
Chapter 18, ''Effects of Prolonged Use on the Usability of a Multimodal
Form-Filling Interface'' by Janienke Sturm et al., considers the ongoing use
of a train timetable information system and the development of interaction
patterns over time for such a multimodal dialogue system. The user study
shows that with practice the users learn to develop interaction patterns
that ensure a more reliable and efficient interaction, resulting in a
decreased dialogue duration and higher user satisfaction.
The nineteenth chapter, ''User Multitasking with Mobile Multimodal Systems''
by Anthony Jameson and Kerstin Klöckner, discusses some assumptions of the
fact that users often simultaneously perform various tasks while
interacting with a mobile system. In their experiment, the participants use
a mobile phone while walking. Eye-based vs. ear-based interaction methods
are evaluated showing that users’ behaviour is strongly influenced by
subjective factors, such as habits from other tasks, experiences with
similar phones, dislikes related to the design, and ideas on what is
Chapter 20 by Sharon Oviatt et al. deals with ''Speech Convergence with
Animated Personas''. They investigate the convergence of users’ speech with
the text-to-speech (TTS) heard from the dialogue system. In their user
study, young children interact with animated marine animals. The output
voices are tailored to represent opposite ends of the introvert-extrovert
personality spectrum. An analysis of children’s amplitude, durational
features, and dialogue response latencies confirm that they spontaneously
adapt basic acoustic-prosodic features of their speech to the TTS. The
adaptations are bi-directional and dynamically readable when introduced to
a new TTS voice. In this way, it might be possible to guide users’ speech
to fall within a range easily processed by the speech recognizer without
explicitly instructing users.
''Spoken multimodal human-computer dialogue in mobile environments'' is an
interesting and valuable contribution to the NLP community. It offers a
good overview of the field of multimodal, mobile man-machine interaction
and the different research areas within this field. A shortcoming of this
book might be that, since the original papers are from 2002, some aspects
especially of the second part about the architecture and some frameworks is
out of date and therefore obsolete now. For example, continuous speech
recognition on PDAs and handheld computers is meanwhile possible (cf. eg.
Köhler et al. 2005). Nevertheless, the book forms also a good starting
point for students who want to specialize within the area of multimodal
mobile dialogue systems.
Furui, S. (2000). Speech Recognition Technology in the Ubiquitous/Wearable
Computing Environment, Proceedings of the International Conference on
Acoustics, Speech and Signal Processing (ICASSP), Turkey.
Köhler, T., Fügen, C., Stüker, S. & Waibel, A. (2005). Rapid Porting of
ASR-Systems to Mobile Devices, Proceedings of the 9th European Conference
on Speech Communication and Technology (Interspeech), Portugal.
McTear, M. (2002). Spoken Dialogue Technology: Enabling the Conversational
Interface, ACM Computing Surveys, volume 34(1), pages 90-169.
Seneff, S., Hurley, E., Lau, R., Pao, C. Schmid, P. & Zue, V. (1998).
GALAXY-II: A Reference Architecture for Conversational System Development.
Proceedings of the International Conference on Spoken Language Processing
(ICSLP), pages 931-934, Australia.
Wahlster, W., Reithinger, N. & Blocher, A. (2001). SmartKom: Multimodal
Communication with a Life-like Character. Proceedings of European
Conference on Speech Communication and Technology (EUROSPEECH), pages