Speech Widgets

physical button interfaces and that knowledge of the speech widgets transfers from ..... In refining our speech widget dialogs we used 35 paid subjects for an ...
308KB taille 2 téléchargements 334 vues
Speech Widgets Dan R. Olsen Jr., S. Travis Nielsen, Matt Reimann Computer Science Department, Brigham Young University, Provo, Utah, 84602, USA {olsen, nielsent, ratt}@cs.byu.edu ABSTRACT

Spoken language interfaces are difficult to develop. We have developed a set of widgets for building speech interfaces by composition similar to toolkits for graphical user interfaces. Our speech widgets presume that users will learn an artificial language that can be universally applied to all applications built with the widgets. We describe lessons learned from having naïve users use the widgets. Experiments with naïve users show that speech widget interfaces have performance times that are comparable to physical button interfaces and that knowledge of the speech widgets transfers from application to application. This learning transfer results in reduced performance times. Keywords

spoken language interfaces, toolkits, widgets INTRODUCTION

One of the important future directions for interactive systems is to push human-computer interaction into more physical situations than the simple desktop workstation. One of the most promising of these technologies is speech. In recent years commercial speech recognition software with usable accuracy has become available. Though significant work still remains to increase recognition accuracy, it is now possible to do more extensive work on exactly how spoken language interfaces should work. Spoken language has some significant advantages over visual-based interfaces. It can be used when either the hands or the eyes are otherwise engaged. Most people can speak desired values faster than they can type them or enter them with a mouse. Speech also has a decided advantage in terms of low power requirements and very small physical form factors. One of our scenarios for the use of interactive speech is a very small device that has a single “push-totalk” button as its interface along with a speaker and microphone. If Moore’s law makes computing cheap and speech recognition becomes highly accurate, this can become a model for highly functional unobtrusive interactive devices. In this paper we explicitly separate spoken language interaction from natural language interaction. One of the arguments in favor of speech interaction has been the naturalness of speech. This has tended to equate spoken language interfaces with natural language understanding. In our work we have separated these two concepts and set aside the natural language portion. We have done this for

1/8 1/28/2002

two reasons. The first is the difficulty of the natural language interaction problem and the second is a series of lessons learned from developing tools for graphical user interfaces. One of the problems with spoken language interfaces is that they are invisible. Unlike graphical user interfaces they contain no external affordances to tell a user what they can say so that the system will understand them. Natural language offers the solution of “users will say whatever they want.” This seems very attractive, but has several problems. Natural language, as used when people speak to each other, contains many contextual references to mutually understood world knowledge. Even people speaking the same language but not sharing the same knowledge base have trouble understanding each other. Filling a small interactive device with enough world context to support such an interface remains problematic. When encountering a new device or interface a user must always enter a discovery process to understand what that interface can do. This is true even when using natural language and human assistance. Our strategy is to provide a standard discovery process that simultaneously teaches the user our restricted user interface language while working with the interface. Our work is focused on creating tools that designers can use to rapidly develop new applications and is also strongly biased by the XWeb project’s cross-modal interaction goals [8]. Many speech-based tools and development approaches use grammar-based approaches from natural language understanding. Work on Air Travel Information Systems (ATIS) have this characteristic[11]. Simpler tools such as the CSLU Toolkit [9] and Suede[3] use a flowchart model of a conversation for designing sequences of prompts, questions, answers and conversational recovery. The focus of these tools is on designing a conversation for a specific application. Our approach is to build a generalized interactive client, which can browse and edit information for a huge variety of applications. This is very different from the application specific efforts of Chatter[5], SpeechActs[12], and Wildfire[13]. In developing speech tools we are guided by our experience in building tools for graphical user interfaces. In the very early days of interactive graphics all of our models for interaction were based on linguistic models such as automata or grammars. A driving assumption of the

early days of User Interface Management Systems (UIMS) was that the tools needed the power to express any interactive dialog because that would allow the designer to craft dialogs that were most appropriate and most effective for a particular need. In graphical user interfaces, this universal dialog approach has been almost completely replaced by the concept of components or widgets. In the widget model, fragments of predefined interaction are packaged in such a manner that they can be easily composed together to form an interface to virtually any application. The widget approach allows many programmers to completely ignore the design of input event syntax by using prepackaged pieces. Unlike linguistic parsing, the widget approach tightly integrates system feedback with user input. The user is also benefited because all applications are composed of the same syntactic pieces. The transfer of learning between applications is enhanced because similar pieces behave in similar ways. This component-based approach is the one that we have pursued in our development of speech tools. Our speech widgets are explicitly not conversational. Our dialog model is that each device has a set of information that the user is either trying to understand or to modify. There are a fixed set of commands that are universal for all applications. Our focus on a small fixed set of commands was inspired by Aron’s Hyperspeech[1] work where increased user satisfaction and performance was reported when the command set was uniform for all nodes. As such our dialogs are strictly user-initiated rather than the systeminitiated or mixed initiative styles studied by Walker[10]. The paper proceeds by first outlining the forms of interaction imposed by the XWeb model and briefly discusses how such interfaces are specified. We then discuss how the abstract interfaces of XWeb are implemented in a speech-only interactive client. This is followed by a discussion of our formative evaluations with users and the lessons we learned in making our speech widgets usable. We conclude with the results of a limited user study comparing our speech-based techniques against the physical interfaces of common automated devices. XWEB INTERACTORS

The XWeb architecture is intended to distribute interactive services over the Internet to a wide variety of interactive clients[8]. All information to be manipulated is represented as XML trees. Interactive control of processes and devices is modeled as interactive editing of control state information. For example a home automation thermostat has several settings that are presented to XWeb via a special server implementation. The user controls the thermostat using an XWeb client to modify those settings. Interfaces in XWeb are specified in an XView. An XView is also encoded in XML and consists of a set of abstract

2/8 1/28/2002

interactors. The primitive interactors that manage atomic values are enumeration, number, time, date and text. These can be composed into larger interfaces using the group, list and link interactors. These eight interactors are used to build any XWeb interface. Recently we have added interactors for 2D and 3D spaces, but no speech implementations for these have been attempted. Our earlier work on speech interaction in 2D spaces[7] has not yet been incorporated into XWeb. Each interactor is tied to a specific data object in an XWeb server’s XML tree. The interactor’s purpose is to present that data to the user and to make modifications to that data as requested by the user. Consider the example of a simple automated thermostat that uses different temperature settings for different times of the day. Such a thermostat could be represented in XML as: A highly abbreviated XView for this thermostat would be: The ,