Date: Thu, 08 Apr 2004 17:25:39 +0200 From: Roser Morante Subject: Current and New Directions in Discourse and Dialogue
EDITOR: van Kuppevelt, Jan; Smith, Ronnie W. TITLE: Current and New Directions in Discourse and Dialogue SERIES: Text, Speech and Language Technology 22 PUBLISHER: Kluwer Academic Publishers YEAR: 2004 ANNOUNCED IN: http://linguistlist.org/issues/15/15-138.html
Roser Morante, Computational Linguistics and AI section, Faculty of Arts, University of Tilburg.
"Current and New Directions in Discourse and Dialogue" is a collection of sixteen papers. Twelve of them are extended versions of the papers presented in the Second SIGdial Workshop on Discourse and Dialogue held in September 2001 in Aalborg, Denmark. The rest are invited papers. As the editors point out, the three main themes addressed in the book are: (i) corpus annotation and analysis: chapters 1, 3, 5; (ii) methodologies for construction of dialogue systems: chapters 2, 6, 10, 12, 15; and (iii) perspectives on various theoretical issues: chapters 4 (communicative intention), 7 (human- computer versus human-human dialogues), 8 (context-based generation), 9 and 11 (clarification requests), 13 (conversational implicatures), 14 (modeling of discourse structure), 16 (role of interruptions).
The fact that the book gathers papers on several research areas related to discourse and dialogue makes it interesting not only for researchers working on specific areas, but also for those who would like to have a multidimensional view of the current dialogue and discourse oriented research by reading some representative articles. However, the reader should not expect to find machine learning studies in dialogue modelling.
Chapter 1, "Annotations and tools for an activity based spoken language corpus", written by Jens Allwood, Leif Grönqvist, Elisabeth Ahlsén, and Magnus Gumnarson, describes the Spoken Language Corpus of Swedish developed in the Department of Linguistics at Göteborg Universtiy (GSLC) and the various types of tools and analysis that have been developed for work on this corpus. The corpus contains 1.3 million words of naturalistic spoken language data, its original feature being the number of social activities recorded (25). The tools described are the following: TransTool, a tool for transcribing spoken language following a transcription standard (GTS 6.2, MSO6); Corpus Browser, a web interface that makes it possible to search for words, word combinations and phrases; Tractor, a coding tool to create new coding schemas and annotate transcriptions; a toolbox that allows to visualize coding schemas and coding directly in the transcription as a FrameMaker document; TraSA, a tool that calculates 30 statistical measurements; SyncTool, a tool that synchronizes transcriptions with digitized audio/video recordings; and MultiTool, which is a general tool for linguistic annotation and transcribing of dialogs, as well as browsing, searching and counting. The tools allow synchronizing views pertaining to the same point in time in order to show the same sequence from different points of view. The corpus has been analyzed quantitatively and quantitatively. As for the quantitative analysis, a set of automatically derivable properties of the corpus (volume, ratios, special descriptors, lemma, POS, collocations, frequency lists, sequences of part of speech and similarities) has been defined using the information provided by the transcriptions. The qualitative analysis of the corpus has resulted in the development of coding schemas, which include: social activity and communicative act related coding, communication management related coding, grammatical coding and semantic coding.
Chapter 2, "Using direct variant transduction for rapid development of natural spoken interfaces", is written by Hiyan Alshawi and Shona Douglas. The authors present a new approach to constructing interactive spoken interfaces, the Direct Variant Transduction (DVT), that relies on specifying an application with examples and on classification and pattern-matching techniques. The method aims at addressing two bottlenecks in the development of an interface with natural spoken language: coping with language variation and linking natural language to appropriate actions in the application back-end. The authors first outline the characteristics of the method, which adopts several constraints: (i) applications are constructed using a relatively small number of example inputs from which robust interpretation models are compiled automatically; (ii) no intermediate semantic representations are needed; (iii) confirmation queries posed by the system to the user are constructed automatically from the examples; (iv) dialog control should be simple to specify for simple applications, while allowing the flexibility of delegating this control to another module for more complex applications.
Two applications have been built using this method: one to access e- mail and a call-routing application. The authors continue by describing what needs to be provided by the application builder. An application consists of a set of contexts. Each context provides the mapping between user inputs and application actions that are meaningful in a particular stage of interaction between the user and the system. Next they explain variant expansion and the use of classifiers and matchers. Their approach to handle language variation is twofold: first, they use robust recognition, classification and matching techniques, instead of rules. Second, they expand contexts to include variants not included in the original set of examples provided by the application developer. After that they describe their approach to specifying dialogue control. Finally, the authors present quantitative data that, as they point out, indicate that a very small number of training examples can provide useful performance in a call routing application. The results suggest that the DVT method is a viable option for constructing spoken language applications without specialized expertise.
Chapter 3, "An interface for annotating natural interactivity", is written by Niels Ole Bernsen, Laila Dybkjaer, and Mykola Kolodnytsky. The article focuses on spoken dialogue data annotation. The goal of their research is to build a user-friendly general-purpose tool for coding natural interactive communicative behavior in the framework of the NITE project. The authors start by pointing out the need to have a tool, which enables cross-level and cross-modality coding for all the levels of analysis involved in natural interactivity, since this kind of tool is a key-factor in the development of high quality annotated data and coding schemes. They continue by presenting a review of existing natural interactivity coding tools and by introducing the NITE project coding tool specification. Next they describe the NITE WorkBench (NWB) annotation user interface. It aims to support annotation of any kind of phenomena involved in natural interactive communication. The main requirements of the tool are: supporting working with annotation projects, including meta-data; enabling flexible control of raw data files (audio and/or video); supporting annotation of natural interactive communication at any analytical level and across levels and modalities through the use of existing or new coding schemes; enabling users to specify new coding schemes; and enabling information extraction and analysis of annotated data. Finally, they present the Audio-Visual Annotation Interface: the data structure, where project files are used as organizers; the annotation interface, and the coding scheme interface.
Chapter 4, "Managing communicative intentions with collaborative problem solving", is written by Nate Blaylock, James Allen, and George Ferguson. The work is done in the context of producing a conversational agent. In this paper the authors concern themselves with the intention/language interface level, where communicative intentions are converted to and from a high-level semantic form. The paper proposes a descriptive model of dialogue, based on collaborative problem solving, which defines communicative intentions as attempts to modify a shared collaborative problem-solving state between the user and the system. Following the authors, modeling communicative intentions based on collaborative problem solving allows the behavioral reasoning component of a dialogue system to only worry about problem solving and not about linguistic issues. The model covers a range of collaboration paradigms and models dialogue involving planning and acting. The authors start by describing previous work (models of collaborative planning and models of dialogue). Next they introduce their model of dialogue, showing how communicative intentions are defined. In the model they try to enumerate the complete taxonomy of collaborative problem-solving activities. They continue by exemplifying the model with several dialogue examples and by discussing how the model is used in the TRIPS dialogue system. Finally, they put forward some conclusions and future work.
Chapter 5, "Building a discourse-tagged corpus in the framework of rhetorical structure theory", is written by Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. The authors describe their experience in developing the Rhetorical Structure Theory (RST) corpus. Their goal was to conduct large-scale implementation within the framework of a single discourse theory. The corpus contains 385 documents of American English selected from the Penn Treebank, which are hierarchically annotated in the framework of Rhetorical Structure Theory. The corpus can be mined in order to study discourse related phenomena. The discourse structure is built by means of 78 rhetorical relations and three additional relations to impose structure on the tree. The paper describes the selection of theoretical approach, annotation methodology, training, and quality of assurance. A main feature of the corpus is that it is annotated at different levels: leaf-level, text-level, and mid-level analysis, so that the mining of the corpus can be done at different levels. The authors present three examples: comparison of discourse structure at the leaf-level, comparisons of trees for different styles of news reports at the text-level, and examination of relations at the mid-level analysis.
Chapter 6, "An empirical study of speech recognition errors in human computer dialogue", is written by Marc Caravazza. The author investigates the impact of speech recognition errors on a fully implemented dialogue prototype based on a speech acts formalism. The system is a mixed-initiative conversational interface organized around a human like character, with which the user communicates through speech recognition. The software architecture is a pipeline comprising speech recognition, parsing and dialogue. The main goal of parsing is to produce a semantic structure from which speech acts can be identified. The system operates in a bottom-up fashion, it does not include any specific mechanism for error detection or explicit error handling. The author reports findings on the consequences of speech recognition errors on the identification of speech acts, and the conditions under which the system can be robust to those errors. He provides an empirical classification of system reaction to speech recognition errors, and discusses methods for the specific evaluation of the consequences of speech recognition errors. Caravazza proposes that the method should include a combination of speech acts accuracy metrics and concept efficiency metrics. He concludes by saying that there appears to be no simple correlation between robustness to speech recognition errors and the depth of parsing and interpretation, since there are a number of factors that support the robustness of the system to speech recognition errors, like for example the fact that the dialogue control mechanism triggered by the speech act recognition can contribute to repairing the consequences of speech recognition errors.
Chapter 7, "Comparing several aspects of human-computer and human-human dialogues", is written by Christine Doran, John Aberdeen, Laurie Damianos, and Lynette Hirschman. The authors present results of the experiments carried out to compare human-human (HH) and human-computer (HC) interaction in the context of the Communicator Travel task, a DARPA-funded program. In order to begin an empirical exploration of how certain aspects of the dialogue shed light on differences between HC and HH communication, they annotated dialogues (20 HH and 40 HC) from the air travel domain with several sets of tags: dialogue act (CSTAR consortium tags), initiative (which participant has control at the end of the turn), and unsolicited information (this only for HC dialogues). The paper describes the data, the coding of the corpus and the analysis. The authors point out some findings. In general the conversation is more balanced between traveler and expert in the HH setting, in terms of amount of speech, types of dialogue acts and sharing initiative. As for initiative, in the HC data the experts massively dominated in taking the initiative, whereas in the HH data, users and expert shared the initiative relatively equitably. The results of the experiments show that one of the most salient characteristics of the HC data is that they contain many misunderstandings of a sort that are nearly absent in the HH data. They find that misunderstandings prevalent in the HC data can be classified into three groups (hallucinations, mismatches and prompt after fills), and that these misunderstandings can be detected through the combination of a semantic annotation and an automatic algorithm.
Chapter 8, "Full paraphrase generation for fragments in dialogue", is written by Christian Ebert, Shalom Lappin, Howard Gregory, and Nicolas Nicolov. The authors start by pointing out that one major challenge for any dialogue interpretation system is the proper treatment of fragments. In this paper they show how to generate phrases for fragments of dialogues with SHARDS, which is a system for the resolution of fragments in a dialogue, based on a version of HPSG which integrates the situation semantics-based theory of dialogue context given in KOS. The generator uses a template-filler approach and it does not do any deep generation from an underlying semantic representation. Instead it reuses the results of the parse and interpretation process of SHARDS to dynamically compute the templates, and then to update the filter. The innovation of their approach is that they situate generation within the context of dialogue interpretation, specifically fragment resolution. In doing so they are able to eliminate much of the indeterminacy that characterizes classical generation systems by exploiting the rich syntactic and phonological information produced in the course of dialogue interpretation. The paper starts with the presentation of the system and the grammatical background. Next the proposal for generating fragment paraphrases with templates is explained, and, finally, the implementation of SHARDS and the generation component are described.
Chapter 9, "Disentangling public from non-public meaning", is written by Jonathan Ginzburg. The author starts by claiming that analyses of interaction need to characterize not solely 'success conditions', but also 'clarification potential'. In this paper the author illustrates how characterising certain classes of Clarification Requests (CR) can shed light on the problem of distinguishing publicly expressed communicative effects from non-public ones. First he considers the very productive and effective ways of producing CRs relating to the grammatically governed content of an utterance. Then he turns to CRs that pertain to the non-public intentions of a conversational participant, like Whymeta. He demonstrates that Whymeta shows distinct behaviour from CRs that pertain to grammatically governed content, the most prominent feature being that whereas the latter are almost invariably adjacent to the utterances whose clarification they seek, non-adjacency is quite natural for Whymeta. This leads him to establish the distinction between the notion of utterer's content and utterer's plan. The author provides data to reinforce the distinction between Utterer's Content and Utterer's Plan. He provides some background notions from the KOS framework required for his formalization and applies this to explicate Utterer's Content. Finally he considers a previous analysis of Whymeta and develops his own analysis, which involves viewing it as an instance of a metadiscoursive utterance, instead of as a mechanism that clarifies a contextually instantiable goals/plan parameter.
Chapter 10, "Adaptivity and response generation in a spoken dialogue system", is written by Kristiina Jokinen and Graham Wilcock. The paper addresses the issue of how to increase adaptivity in response generation for a spoken dialogue system. Realization strategies for dialogue responses depend on communicative confidence levels and interaction management goals. They first discuss interaction models and naturalness and give concrete examples from a spoken dialogue system in which different forms of surface realization are required in order to achieve interaction management goals. They continue by describing a Java/XML-based generator which produces different realizations of system responses based on agendas specified by the dialogue manager. The way in which the generator chooses between the different realizations is based on detailed specifications of the information status of different concepts, given in an agenda by the dialogue manager component. They then discuss how greater adaptivity can be achieved by using a set of distinct generator agents, each of which is specialized in its realization strategy. This allows a simpler design of each generator agent while increasing the overall system adaptivity to meet the requirements for flexible cooperation in incremental and immediate interactive situations.
Chapter 11, "On the means of clarification in dialogue", is written by Matthew Purver, Jonathan Ginzburg, and Patrick Healey. In this paper they describe an attempt to exhaustively categorise Clarification Requests (CR) forms and readings based on corpus work. CR can take various forms and can be used to request various types of clarification information, but have in common the fact that they are in some sense utterance-anaphoric. Thus the corpus work has the additional aim of identification of the maximum distance between a CR and the utterance being clarified. The authors start by discussing previous work on CR. Then they list the possible CR (non-reprise clarifications, reprise sentences, reprise sluices, reprise fragments, gaps, gap fillers, and conventional) and readings (clausal, constituent, lexical, and corrections) that they identify from corpus analysis. They continue by describing the analysis of the corpus. Finally, they discuss the implications of their results for a possible HPSG analysis of clarification requests and for an ongoing implementation of a clarification-capable dialogue system.
Chapter 12, "Plug and play spoken dialogue processing", is written by Manny Rayner, Johan Boye, Ian Lewin, and Genevieve Gorrell. The paper contains a description of a spoken language dialogue system architecture which supports plug and playable networks of objects. The discussion centres around a concrete prototype system, CANTONA. The main point of plug and play spoken language dialogue is that at any given time the system's dialogue capabilities are determined by the set of devices currently connected; adding new devices dynamically changes its ability to recognise, understand, and respond to commands. The authors first introduce the top- level components and the key interfaces of the CANTONA Plug and Play demonstrator, where all processing is rule-based. Next they describe the general architecture considerations to achieving plug and play functionality in rule-based systems, the rules and hierarchies, the plug and play response generation and the speech recognition and parsing.
Chapter 13, "Conversational implicatures and communication theory, is written by Robert van Rooy. This paper presents an account for implicatures in terms of a mathematical theory of communicaiton. Following the author, from a standard pragmatics perspective conversational implicatures should be accounted for in terms of Grice's maxims of conversation. Neo-Giceans seek to reduce those maxims to the so-called Q and I-principles. In this paper the author argues that: (i) there are major problems for reducing Gricean pragmatics to these two principles, and (ii) that, in fact, it is better to account for implicatures in terms of the principles of (a) optimal relevance and (b) optimal coding. To formulate these principles the author makes use of Shannon's mathematical theory of communication.
Chapter 14, "Reconciling control and discourse structure", is written by Susan E. Strayer, Peter A. Heeman, and Fan Yang. In this paper the authors consider how control, in the sense of initiative, is managed in task-oriented dialogues. They first describe previous work in discourse structure and in control, and they present their coding of the corpus (eight dialogues of the TRAINS corpus) with subdialogues and control tags based on the DAMSL coding schema. Next they explore the relationship between discourse structure and control: (i) they compare control boundaries to subdialogue boundaries using recall and precision; (ii) they look at control inside of discourse segments. Then they explore how control can shift within a subdialogue and find two types of contributions that a speaker can make in a discourse segment: collaborative completions, in which the non-initiator helps the segment initiator achieve their goal, and short contributions to the discourse segment purpose. They find that collaborative completions and co- contributions are exceptions to the general rule that control tends to reside with the same speaker. Based on the results of their study they propose that control is subordinate to the intentional structure. Control is held by the segment initiator. They point out the implications for dialogue management: a system only needs to model intentional structure, from which control follows.
Chapter 15, "The information state approach to dialogue management", is written by David R. Traum and Staffan Larsson. In this paper the authors introduce the information state approach to dialogue management, and show how it can be used to formalize theories of dialogue in a manner suitable for easy implementation. The authors start by defining what dialogue management is. Then they propose two contributions towards solving the problem of dialogue management re- use: first, unifying a view of dialogue management that can help organize the relationship between dialogue theories and implementations. The unifying view includes a proposal to formalize dialogue management functions in terms of information state update. Second, software tools that can help to achieve reusable dialogue systems. They continue by presenting the information state approach. After that they show how the information state approach can be used to help provide reusable components for dialogue system design, separating three layers: basic software engineering layer, dialogue theory layer, task/domain specific layer. Then they describe TrindiKit, a tool that provides the basic software engineering glue that can be used to implement a dialogue manager at a level closer to linguistic theories than other existing toolkits. Next they illustrate some of the systems that have been built using TrindiKit. Finallly they describe how the separation of architecture layers previously defined has led to actual reuse in a number of dialogue systems.
Chapter 16, "Visualizing spoken discourse, is written by Li-Chung Yang. The goal of the study is to look at the distribution of interruptive occurrences in natural speech, and investigate their respective functions and characteristics. It is shown that interruptions are important elements in the interactive character of discourse and in the resolution of issues of cognitive uncertainty and planning. The author analyses what are the different types of interruptions and to what extent are prosodic-acoustic features significant in distinguishing between the different types of interruptions. He distinguishes between cooperative and competitive interruptions. The specific pitch height of the interruption varies with the expression of emotion, signals of attention-getting, and signals of competitiveness. In general, competitive interruptions are marked by a high pitch level and a loud amplitude, expressing the participant's competition for the focus of attention. By contrast, cooperative interruptions are more supportive of the main speaker's floor rights. Because of their non-disruptive nature, they often occur at low or medium pitch levels and they are generally lower in pitch than competitive interruptions. Their amplitude can vary. To conclude the author analyses the implications for dialogue systems.
|