Most dialogues are multimodal. When people talk, they use not only their voices, but also facial expressions and other gestures, and perhaps even touch. When computers communicate with people, they use pictures and perhaps sounds, together with textual language, and when people communicate with computers, they are likely to use mouse "gestures" almost as much as words. How are such multimodal dialogues constructed? This is the main question addressed in this selection of papers of the second "Venaco Workshop", sponsored by the NATO Research Study Group RSG-10 on Automatic Speech Processing, and by the EuropeanSpeech Communication Association (ESCA).Contributions by: J. Allwood; D. Hill; S. Selcon and R.M. Taylor; S. Oviatt; A. Murray; S. Candelaria de Ram; H. Bunt; D. Sadek; E. Bilange; D. Luzzati; A. Vilnat; R.J. Beun; J. Edwards and D. Sinclair; M. Tatham, K. Morton and E. Lewis; M. Maybury and J. Lee; J.C. Junqua; C. Cuxac; W. Edmondson; D. Teil and Y. Bellik; F. Gavignet, M. Guyomard and J. Siroux; G. Boudreau and C. McCann; J. Lee; B. Gaiffe, J.-M. Pierrel and L. Romary; M. Taylor and D. Waugh; A. Datta; M. Brooke and M. Tomlinson; C. Benont.