Speech Widgets Dan R. Olsen Jr., S. Travis Nielsen, Matt Reimann Computer Science Department, Brigham Young University, Provo, Utah, 84602, USA {olsen, nielsent, ratt}@cs.byu.edu ABSTRACT
Spoken language interfaces are difficult to develop. We have developed a set of widgets for building speech interfaces by composition similar to toolkits for graphical user interfaces. Our speech widgets presume that users will learn an artificial language that can be universally applied to all applications built with the widgets. We describe lessons learned from having naïve users use the widgets. Experiments with naïve users show that speech widget interfaces have performance times that are comparable to physical button interfaces and that knowledge of the speech widgets transfers from application to application. This learning transfer results in reduced performance times. Keywords
spoken language interfaces, toolkits, widgets INTRODUCTION
One of the important future directions for interactive systems is to push human-computer interaction into more physical situations than the simple desktop workstation. One of the most promising of these technologies is speech. In recent years commercial speech recognition software with usable accuracy has become available. Though significant work still remains to increase recognition accuracy, it is now possible to do more extensive work on exactly how spoken language interfaces should work. Spoken language has some significant advantages over visual-based interfaces. It can be used when either the hands or the eyes are otherwise engaged. Most people can speak desired values faster than they can type them or enter them with a mouse. Speech also has a decided advantage in terms of low power requirements and very small physical form factors. One of our scenarios for the use of interactive speech is a very small device that has a single “push-totalk” button as its interface along with a speaker and microphone. If Moore’s law makes computing cheap and speech recognition becomes highly accurate, this can become a model for highly functional unobtrusive interactive devices. In this paper we explicitly separate spoken language interaction from natural language interaction. One of the arguments in favor of speech interaction has been the naturalness of speech. This has tended to equate spoken language interfaces with natural language understanding. In our work we have separated these two concepts and set aside the natural language portion. We have done this for
1/8 1/28/2002
two reasons. The first is the difficulty of the natural language interaction problem and the second is a series of lessons learned from developing tools for graphical user interfaces. One of the problems with spoken language interfaces is that they are invisible. Unlike graphical user interfaces they contain no external affordances to tell a user what they can say so that the system will understand them. Natural language offers the solution of “users will say whatever they want.” This seems very attractive, but has several problems. Natural language, as used when people speak to each other, contains many contextual references to mutually understood world knowledge. Even people speaking the same language but not sharing the same knowledge base have trouble understanding each other. Filling a small interactive device with enough world context to support such an interface remains problematic. When encountering a new device or interface a user must always enter a discovery process to understand what that interface can do. This is true even when using natural language and human assistance. Our strategy is to provide a standard discovery process that simultaneously teaches the user our restricted user interface language while working with the interface. Our work is focused on creating tools that designers can use to rapidly develop new applications and is also strongly biased by the XWeb project’s cross-modal interaction goals [8]. Many speech-based tools and development approaches use grammar-based approaches from natural language understanding. Work on Air Travel Information Systems (ATIS) have this characteristic[11]. Simpler tools such as the CSLU Toolkit [9] and Suede[3] use a flowchart model of a conversation for designing sequences of prompts, questions, answers and conversational recovery. The focus of these tools is on designing a conversation for a specific application. Our approach is to build a generalized interactive client, which can browse and edit information for a huge variety of applications. This is very different from the application specific efforts of Chatter[5], SpeechActs[12], and Wildfire[13]. In developing speech tools we are guided by our experience in building tools for graphical user interfaces. In the very early days of interactive graphics all of our models for interaction were based on linguistic models such as automata or grammars. A driving assumption of the
early days of User Interface Management Systems (UIMS) was that the tools needed the power to express any interactive dialog because that would allow the designer to craft dialogs that were most appropriate and most effective for a particular need. In graphical user interfaces, this universal dialog approach has been almost completely replaced by the concept of components or widgets. In the widget model, fragments of predefined interaction are packaged in such a manner that they can be easily composed together to form an interface to virtually any application. The widget approach allows many programmers to completely ignore the design of input event syntax by using prepackaged pieces. Unlike linguistic parsing, the widget approach tightly integrates system feedback with user input. The user is also benefited because all applications are composed of the same syntactic pieces. The transfer of learning between applications is enhanced because similar pieces behave in similar ways. This component-based approach is the one that we have pursued in our development of speech tools. Our speech widgets are explicitly not conversational. Our dialog model is that each device has a set of information that the user is either trying to understand or to modify. There are a fixed set of commands that are universal for all applications. Our focus on a small fixed set of commands was inspired by Aron’s Hyperspeech[1] work where increased user satisfaction and performance was reported when the command set was uniform for all nodes. As such our dialogs are strictly user-initiated rather than the systeminitiated or mixed initiative styles studied by Walker[10]. The paper proceeds by first outlining the forms of interaction imposed by the XWeb model and briefly discusses how such interfaces are specified. We then discuss how the abstract interfaces of XWeb are implemented in a speech-only interactive client. This is followed by a discussion of our formative evaluations with users and the lessons we learned in making our speech widgets usable. We conclude with the results of a limited user study comparing our speech-based techniques against the physical interfaces of common automated devices. XWEB INTERACTORS
The XWeb architecture is intended to distribute interactive services over the Internet to a wide variety of interactive clients[8]. All information to be manipulated is represented as XML trees. Interactive control of processes and devices is modeled as interactive editing of control state information. For example a home automation thermostat has several settings that are presented to XWeb via a special server implementation. The user controls the thermostat using an XWeb client to modify those settings. Interfaces in XWeb are specified in an XView. An XView is also encoded in XML and consists of a set of abstract
2/8 1/28/2002
interactors. The primitive interactors that manage atomic values are enumeration, number, time, date and text. These can be composed into larger interfaces using the group, list and link interactors. These eight interactors are used to build any XWeb interface. Recently we have added interactors for 2D and 3D spaces, but no speech implementations for these have been attempted. Our earlier work on speech interaction in 2D spaces[7] has not yet been incorporated into XWeb. Each interactor is tied to a specific data object in an XWeb server’s XML tree. The interactor’s purpose is to present that data to the user and to make modifications to that data as requested by the user. Consider the example of a simple automated thermostat that uses different temperature settings for different times of the day. Such a thermostat could be represented in XML as: A highly abbreviated XView for this thermostat would be: The , and tags specify interactors to be used in presenting the thermostat. Information about how these interactors attach to the underlying XML data has been omitted since it does not contribute to understanding our speech dialogs. Details can be found in our documentation [14]. SPOKEN DIALOGS
The most important aspects of an XWeb speech interactor are its name/context, its current data value, and its location in the interactor tree. Our dialogs are driven by the current interactor, its value, and its context. Our speech client is built using IBM’s ViaVoice and the Java Speech API. For each interactor we generate a context-free grammar that contains the commands that can be applied to that interactor as well as navigation commands for moving around the interactor tree. These grammars constrain the recognizer to exactly those items that are applicable to the current interactor without requiring our implementation to become involved in the speech recognition process. Such restricted grammars also reduce recognition errors by limiting the perplexity.
The interactor specifications are shared among all XWeb clients. For example the thermostat could also be presented on a workstation using the same XView. In such a case the names of the items would appear next to the widgets on the screen. In many cases it is perfectly acceptable for the screen names and the spoken names to be the same. However, to preserve screen space “Sleep temp.” might be used as the name, which would be inappropriate in a spoken context. All names can have a separate spoken value. For example: or Names and contexts
An interactor’s full name is based on its context. For example if our thermostat was placed in a “Home Automation” group that contained other controls such as sprinklers and lights, then the full name of the wake up time would be “Home Automation, thermostat, wake up time.” This is very tedious to hear repeatedly, but is important in establishing the user’s context. We use a prefix pruning algorithm, which compares the name of the current interactor with that of the last interactor and deletes any common prefix. Thus while one is moving around in the thermostat the “Home Automation, thermostat” is not spoken. If, however, the user follows a link to a new view, the complete name of the new destination is spoken to provide context. At any time the user can say “where am I” and get the full path name to the current interactor. Spoken atomic interactors
The core of our speech widgets are the atomic interactors of enumeration, number, date, time, and text. Whenever the user moves to a new atomic interactor or that interactor’s value is changed, XWeb will speak a description of that interactor’s current value. All such descriptions for primitive interactors have the form: name is set to value The value syntax depends upon the various interactors. To modify the current interactor the user says set to value The first rule of our interfaces is that any expression the user hears can also be used as a command. This supports learning by listening. All interactors support many ways to specify their values, but the form spoken by system feedback is always acceptable. Enumeration
Enumeration interactors consist of a small fixed number of alternatives. An alternative can have a separately defined “spoken” value just as with interactor names. Each alternative can have multiple synonyms that the user can
3/8 1/28/2002
use when setting the value. Only the primary spoken name is ever used in user feedback. Numbers
Numbers are special because of their units. Our interactor can handle multiple unit systems each of which may consist of a hierarchy of units. For example we can define temperatures in both Fahrenheit and Celsius in a number interactor. We might define lengths in miles, feet, and inches as well as meters. This generalization of numbers to multiple levels within multiple units sharply increases the widget’s applicability. Full unit handling is also critical when formatting spoken output for a wide variety of interactive uses. When people speak numbers, the units that those numbers represent are integral with the way they speak them. Trying to express “10 feet 6 inches” without reference to units would be extremely awkward. The following example illustrates our handling of numbers The two units each have a format string that specifies how the numerical unit values should be assembled for output, or spoken on input. There is additional specification (not shown) for how variables like $(degrees) are computed from the data. When a user speaks a number value, the system tries each format in turn until one parses correctly. In the above example a user could speak either in Fahrenheit or Celsius. Hierarchic units such as length could be handled similarly by the format “$(ft) feet $(in) inches”. None of the units or their variable names are hard coded. They are defined by separate tags that specify the hierarchy of units as well as the conversions between unit systems. Each unit system takes an integer or real number and breaks it apart into pieces such as degrees, feet or inches. The pieces are placed in variables. The format strings shown above describe how to assemble those variables into speakable or parseable strings. After parsing such a spoken string the number values are placed into the variables by the parser and then reassembled into a numeric value by the corresponding units system. In traditional graphical user interfaces a multi-level unit such as a length would be broken into four separate widgets, a text box for feet, the label “feet”, a text box for inches, and the label “inches”. Such fragmenting of the concept of a length makes it very difficult to translate that interaction into speech. XWeb tries to maintain the whole interaction concept and then allows the various clients to break out the dialog pieces. This allows for a natural and
consistent speaking about numbers that is not possible by direct translation from the corresponding graphical interface as in Mercator[6]. Dates and Times
In an informal survey of interactive applications we found dates and times to be among the most common standard expressions after enumerations and numbers. English also contains quite stylized phrases for speaking about dates and times. For these reasons we have specifically provided dates and times as primitive widgets. It would be very awkward to express a time by first speaking the hour and then moving to the minutes widget and speaking the minutes. We handle dates and times in a similar fashion to numbers, the difference being that there is a fixed set of variable pieces of a date or time phrase. The interface designer specifies a spoken format for assembling the parts of a date or a time into whatever format is appropriate for the context. The interface designer, however, is only crafting the spoken output. For each of date and time our system has preprogrammed grammars for most of the various ways one might speak them. We have omitted the more colloquial expressions such as “half past four.” We did include relative references to the current date such as “yesterday” or “next Tuesday.” The grammars are quite extensive and would take substantial effort for programmers to reimplement on their own. In the widget approach this effort is easily reused. Text
Text is a very important form of information. However, it is quite problematic in speech interaction. The largest problem is the free-form nature of text and the very high recognition perplexity that it imposes. Voice dictation of text has long been a goal and the subject of several studies[2,4]. There are still problems with being able to edit text using only spoken interaction. Because this has been so extensively studied and the many problems that remain unsolved, our implementation only speaks text. It does not allow the user to make changes. Admittedly this is a serious limitation, but we have found the use of enumerations to overcome many (but not all) of the user interface problems that would involve creating text. With our tools we have easily built an email reader, but we cannot create new email messages. Groups
Our primary mechanism for composing interfaces is the group. Each element of the group is a named interactor. Groups can contain groups to create a hierarchy of interactors. In our original implementation user navigated by the “Next” and “Previous” commands or by speaking the name or synonym of one of the interactors in the group. If the current interactor is itself a group, the “Enter” command
4/8 1/28/2002
will take the user into the components of that group. The “Exit” command will return the user to the outer context. The “Where am I” command will speak the path name through any groups that the user has entered. Users can also speak the name of any higher group in the hierarchy and return to that location. In the discussion on formative evaluations, we will show how the four navigation commands were not usable. Lists
For assembling arbitrary amounts of data we provide the list construct. This is an ordered list of elements, each one of which is a group. The elements can be of many types with each type having its own group interactor. Each such group can have a group summary which is a string formatted from the elements of the group using the same formatting mechanism used to assemble number strings from their units. Users “Enter” a list and then scroll through the elements of the list using “Next” and “Previous.” When thinking about lists and in reviewing our formative evaluations it is clear that simple scrolling through lists may be a complete interface, but it is a totally inadequate solution. We have deferred the speech handling of lists to a later phase of study and development. FORMATIVE EVALUATIONS
In refining our speech widget dialogs we used 35 paid subjects for an average of one hour each. All of our subjects were university students specifically drawn from non-technical majors. Their experience with computers varied widely. Our goal during the formative evaluations was to create speech widgets that users could walk up to and use with virtually no training in the speech interface. To set up our tasks we purchased a programmable VCR, a smart thermostat, and a sprinkler timer. These are relatively complicated consumer devices with interfaces that are known to confuse users. We then duplicated the functionality of these interfaces in XWeb to produce speech interfaces. We used the actual physical interfaces as a control for evaluating the effectiveness of our speech solutions. These interfaces utilized the enumeration, number, date, time and group interactors. All of our user tests focused on these features. Our first design approach was to teach the users about the “What can I say” command and let them work from there, using the speech as the only guidance. About half of the users were able to make slow progress and the others got hopelessly bogged down. We then augmented the training by giving the users a sheet with the fixed command set. This helped, but it was still very slow. User’s spent a great deal of time reading carefully and trying to figure out what to do.
In reviewing the video tapes we recognized that the user’s primary problem was in knowing what to say to actually reach the widgets that they needed. Our initial design assumption was that a small set of navigational commands as indicated in Arons’ work [1] would be the way to go. We had provided “Next/Previous” and “Enter/Exit” as the navigational commands, but users had no concept of how to navigate a tree in this fashion. Once they did catch on they were very slow. The reason we see for the difference between our results and Arons is that the Hyperspeech system was exploratory rather than goal directed. The users of Hyperspeech were not trying to accomplish a particular task but rather to explore and experience a space of interesting information. In our home appliance tests, the users were given specific tasks to accomplish. We believe that this difference in goals accounts for our failure to reproduce the Hyperspeech results. We added a “List” command, which would speak the names of all the interactors in the group. Having heard the list users could then speak the desired name and go directly to what they wanted. This also was very slow and very confusing. If a user wanted to change the wake up time on a thermostat they would first need to ask “What can I say.” They would then need to recognize that the “List” command can help them find what they need to know (many users failed to make this connection). Then they would need to speak the appropriate interactor name (“Wake up time”). This dialog style was too distant from the problem at hand. Managing Help
We also noticed that a key problem was the amount of help that was spoken. The longer the help message, the less likely that users would listen or comprehend. In our initial implementation we established the following priorities for the help to be given. 1. 2. 3. 4.
How to set the current interactor’s value The Next/Previous commands to navigate The Enter/Exit commands to traverse the tree Other system management commands.
Users rarely listened all the way to level 4. By always echoing interactor values in the same form as the user can speak, and because the recognized syntax for numbers, dates and times covers the vast majority of the ways that people speak, we found that most of the value setting help was unnecessary. We reduced the value setting help to two simple examples such as: “to change the time say set to 10:00 A.M. or set to 11:30 P.M.” For numbers, dates, and times two simple examples more than sufficed for users to get the idea. For enumerations this was not possible, because the set of choices may not be obvious from prior experience. With enumerations we still
5/8 1/28/2002
needed to speak all of the options. Shortening the valuesetting help improved the performance somewhat. Our next improvement was to eliminate “What can I say” as the way to guide new users. We retained the command but we modified our dialog to always initiate help whenever the user changed interactors. To users experienced with the interface this can be annoying. Support for barge-in is essential both for usability and to short circuit unwanted help. We also added the commands “More Help” and “Less Help” that users can use to control the amount of help generated automatically. By automatically offering help, both user confusion and performance times dropped. Our final help improvement was automatic chunking. In our first implementation all of the help messages that applied in a certain context were spoken in highest to lowest priority order. This generally created long, monotonous help sequences. Users quickly forgot what they heard and became very frustrated with having to start at the beginning when they missed something. Since our help system has prioritized lists of items to speak, we modified it to only speak 3-5 five items before pausing and offering the user the opportunity to ask for more help. This was much more successful because users could think about what they just heard and decide if they had what they needed before getting more help. We found this more effective than slowing down the help messages. Our key design findings in helping users to learn our interfaces were: 1. 2. 3.
Automatically offer help rather than force the user to ask. This is augmented with user ability to reduce the help. Reduce value help to two simple examples and let the users generalize, leaving more extensive information for lower priority help messages. Chunk the help with offers for more if needed. This gives the users time to digest what they have heard before asking for more.
Navigation problems
For most users navigation was the primary source of confusion and the primary drag on performance times. As mentioned previously, the tree-traversal model did not work well at all. We retained the Enter/Exit and Next/Previous commands but placed them much later in the help priority. We instead focused on using interactor names as the primary navigation paradigm. We had originally treated this as a secondary approach because the learning time for the names would not transfer among applications as well as the four tree-navigation commands. However, the studies were quite clear that this would not work.
We replaced the “Enter” command for entering a group to work on its component interactors with the names of those interactors. When the current interactor is a group, rather than offering the “Enter” command we would offer the names. For example in our sprinkler interface the “Days” interactor would, as its first help message, offer the names “Sunday,” “Monday,” “Tuesday,” etc. To reach Wednesday the user would say “Wednesday” rather than “Enter” followed by “Wednesday.” When the current interactor is a group, these names are the help presented. We then replaced Next and Previous at level 2 help with the chunked names of all of the sibling interactors within a group. This provided the same navigation capability with much less user confusion. Lastly we replaced Exit with the names of all groups on the path from the root of the interactor tree to the current interactor. Thus a user could return to the top of our test tree by saying “Home Automation.” A very strong result from our formative evaluations was that names take longer to instruct the users but they are much less confusing and on the whole much faster to use than a fixed set of navigation commands. Though we provided synonyms for many of the item names, they are never mentioned in the help and users did not use them. Since no user ever performed more than 25 simple tasks they did not acquire a lot of familiarity with the system. We believe that more experienced users who are more comfortable with our dialogs might try a few obvious names rather than wait for the help system. Even with limited experience, some users started to try guessing rather than wait for the help. Knowing that one can speak the name of an interactor users might try “morning temperature” rather than “wake up temperature.” With this kind of behavior the synonym mechanism should increase in effectiveness for experienced users. It is very important to point out that through all of these formative evaluations, few changes were required to the interface specification. The exhaustive user study work all went into the standard dialog improvement. This means that developers of new speech interfaces using XWeb will not need to go through this same effort because the results will automatically be provided by the speech widgets architecture. This is the same usability payoff that comes with graphical widget set development. COMPARATIVE EVALUATIONS
Once the formative evaluations had lead our dialogs to effective performance times we wanted to test several hypotheses about our widget approach to speech: 1) how do our speech interfaces compare with equivalent physical appliance interfaces, 2) how rapidly do users learn and become familiar with our artificial language, and 3) is there transfer of user knowledge of the artificial language among applications?
6/8 1/28/2002
Our experiments involved twelve paid subjects each in two one-hour sessions. The tests involved a lawn sprinkler timer interface and a smart thermostat interface. In the first session each user was run through the IBM ViaVoice training process to train the recognizer to their voice. We were not testing recognizer quality and we wanted the best recognition that we could get. Half the subjects were given the sprinkler timer and half were given the thermostat. Each subject was brought back for a second session 1 to 10 days later and tested with the other interface. In all tests the subjects were told that they were being timed for how fast they could complete the tasks. Within each group, half were given the physical device first followed by the speech interface and half were given the speech interface followed by the physical device. This was to mitigate learning effects on the tasks themselves. For each device there were eight tasks such as “change the thermostat to wake up at 7:30 A.M.” or “Will the sprinklers run on Wednesday?” The first four tasks were repeated in the second group of four with the values changed. This allowed us to evaluate performance as the user gained experience with the interface. Comparing physical to speech interfaces
To evaluate speech interfaces relative to physical interfaces we measured the total time required to complete all tasks and the total time to complete the last four tasks. These comparisons give us a measure of overall usability and learnability of the two forms of interface. The second measure contains less of the learning time and more of actual performance. Average performance times across all subjects are compared in table 1. Thermostat Sprinkler seconds seconds Physical Interface 334 385 Speech Interface 220 589 Comparison 34% -53% Table 1 - Average time to complete 8 tasks Users of the manual thermostat had a difficult time learning how to use the device from the instructions on the lid. The speech interface to the thermostat had a flat structure with no nested groups. The sprinkler timer interface had two groups (days of week, and sprinkler zone times) nested within the main group. This hierarchy caused navigation problems for many of the users and reduced their performance times substantially. Since some of the effects have to do with learning the interface we compared times on the last 4 tasks where the users have already performed a similar task previously. In this case the relative speeds switched, as shown in table 2. Thermostat seconds
Sprinkler seconds
Physical Interface 34 190 Speech Interface 74 152 Comparison -118% 20% Table 2 - Average time for last 4 tasks The thermostat is physically larger with only two modes (run and program). Once the users got over the learning curve of the device they were more effective than when using speech. On the other hand, the physical sprinkler has a highly moded interface with much scrolling using a minimal number of buttons. Once subjects learned the navigation mechanisms for the speech interface they outperformed the physical interface. However, the standard deviations on this data are quite high. In a related test we compared a speech interface to the programming capability on a VCR. In this test the users could see the results of the interface rather than only an audio channel. Speech was 12% faster than the VCR remote with much less variance than with the sprinkler and thermostat tasks. We cannot draw any strong conclusions other than that speech and physical button interfaces are quite comparable in terms of their performance speed for simple control tasks. Learning time
Since we are using an artificial language for our speech interfaces we wanted to know how quickly users would learn the language. To test this, tasks 5 through 8 are identical to tasks 1 through 4 except that the values have changed. By comparing corresponding tasks within a given interface problem we can get a sense of how rapidly users are learning the dialog language. These comparisons are shown in table 3
Thermostat Sprinkler Physical Interface 89% 3% Speech Interface 49% 65% Table 3 - Drop in time between tasks 1-4 and 5-8 These numbers show a substantial learning effect for the speech interfaces. They also show the users overcoming their initial problems with the physical thermostat. It is quite clear that users can rapidly learn to use the interfaces generated by our speech widgets. Transfer of widget knowledge
One of our original assertions in taking the speech widget approach was that users would acquire facility with the widget set and then transfer that knowledge to new applications using the same widget set. To measure this we had the two groups of subjects use different applications on different days in the two different orders. This allowed us to compare performance times between subjects who have never used our speech widgets to performance times when
7/8 1/28/2002
they have previously used the speech widgets but on a different application. In making this comparison it is important to note that the sprinkler interface uses nested groups where the thermostat does not. Therefore there should be less of a transfer effect from thermostat to sprinklers (because of the new widget features required) than from sprinklers to thermostat. The time comparisons are shown in table 4. When subjects used the thermostat after having previously experienced speech widgets they were much faster than first time users, showing a clear transfer of knowledge among applications. There was also an advantage for the sprinkler speech interface when subjects had prior experience with the widgets. The gain is not as great in the sprinkler case because the subjects encountered navigation issues that they had not seen before. There was no such transfer of experience among the physical interfaces. The evidence is quite clear that users do transfer user interface knowledge about the artificial language used by our widgets.
Thermostat Sprinkler 1st use of speech widgets 287 635 nd 2 use of speech widgets 154 499 Comparison 46% 21% Table 4 - Speech performance time in seconds Understanding the knowledge transfer
Having measured the learning effect between applications of the speech widgets, we wanted to understand exactly where the gains were occurring. To do this we coded all activity on the video tapes into six categories: time thinking without speaking or Think listening 19% time listening to confirmation of the last Confirm command’s result 8% Guidance time listening to help messages 20% Ask time actually asking for help 48% Do time speaking a command 12% Recovery time recovering from recognition errors -6% Table 5 – Percentage of drop attributed to each category One of the first results is that recovery from recognition errors only consumed about 5% of the performance time. It is unclear to us why users spent more time recovering from errors after they were familiar with the widgets. In comparing the coded times of first time users of the widgets to the same times for subjects that have used the widgets before, we can see where the improvement comes. Table 5 shows the percentage of the time improvement that is accounted for in each category of activity. It is clear that
when users have already used our speech widgets they spend much less time asking for and listening to help. In reviewing the coded data we saw that listening to confirmation of command results took 24% of the performance time and 34% of the actual interaction time. This category also showed little improvement with experience. Based on this observation we ran two quick tests of 5 subjects each to determine if visual presentation of command results would improve performance over listening to them. In the first test we showed the users the full XWeb desktop interface that was slaved to their speech interface using XWeb’s cross-modal interaction architecture [8]. In the second test we only showed a one line of text displaying just the confirmation. We were hopeful that this minimal display would produce substantial performance gains. The results of these tests are in table 6. The full display performed very well, not only because it reduced confirmation listening time, but because the navigation information was all displayed. The single line display did not do as well and in the case of the thermostat did worse than the audio-only. In reviewing the video the reason for the poor performance was that people couldn’t listen to the help and read the confirmation at the same time. They would skip one or the other and thus cause delays in their performance. It is clear that the full UI display greatly augments the speech performance but that the single line does not help. Speech performance in seconds Thermostat Sprinkler Audio-only 154 499 Full display 108 266 One line display 273 408 Table 6 – Performance times for various presentation modes CONCLUSION
[2] Danis, C. and Karat, J., “Technology-Driven Design of Speech Recognition Systems,” Design of Interactive Systems (DIS 95), 1995, pp 17-24. [3] Klemmer, S. R., Sinha, A. K., Chen, J., Landay, J. A., Aboobaker, N. and Wang, A. “Suede: a Wizard of Oz Prototyping Tool for Speech User Interfaces” UIST 2000, ACM Symposium on User Interface Software and Technology, CHI Letters 2(2), pp 1 – 10. [4] Lai, J., and Vergo, J. “MedSpeak: Report Creation with Continuous Speech Recognition,” Human Factors in Computing Systems (CHI 97), 1997, pp 431-438. [5] Marx, M., and Schmandt, C. “Putting People First: Specifying Proper Names in Speech Interfaces,” UIST 94, ACM Symposium on User Interface Software and Technology, 1994, pp 29-37. [6] Mynatt, E. D., and Edwards, W. K. “Mapping GUIs to Auditory Interfaces,” ACM Symposium on User Interface Software and Technology (UIST 92), 1992, pp 61-70. [7] Olsen, D. R., Hudson, S. E., Tam, C. M., Conaty, G., Phelps, M., Heiner, J. M. “Speech Interaction with Graphical User Interfaces,” Human-Computer Interaction: Interact 2001, pp 286-293. [8] Olsen, D. R., Jefferies, S., Nielsen, T., Moyes, W. and Fredricson, P. “Cross-modal interaction using XWeb” UIST 2000,ACM Symposium on User Interface Software and Technology, CHI Letters 2(2), pp 191 – 200. [9] Sutton, S. and Cole, R. “The CSLU Toolkit: Rapid Prototyping of Spoken Language Systems,” User Interface Software and Technology (UIST 97), pp 85-86.
The XWeb speech client allows for rapid development of spoken language interfaces. We believe from our GUI tools experience that this approach will work better for most programmers than linguistic approaches. After much usability guided design work we have a widget set that ordinary people can walk up and use with little or no instruction. The resulting interfaces have been proven to be comparable to existing physical devices. The experimental evidence also shows that users quickly learn the interface’s language and that their knowledge of the widgets transfers from application to application. We believe this to be strong evidence for a widget-based approach to spoken language interfaces.
[10] Walker, M. A., Fromer, J., Di Fabbrizio, G., Mestel, C., Hindle, D. “What Can I Say?: Evaluating a Spoken Language Interface to Email,” Human Factors in Computing Systems (CHI 98), 1998, pp 582-589.
REFERENCES
[13] http://www.wildfire.com/
[1] Arons, B. “Hyperspeech: Navigating in Speech-Only Hypermedia,” Hypertext ’91, 1991, pp 133-146.
[14] http://icie.cs.byu.edu/XWeb/
8/8 1/28/2002
[11] Yang, C, and O’Shaughnessy, D. “Development of the INRS ATIS System”, Intelligent User Interfaces ’93, pp 133-140. [12] Yankelovich, N., Levow, G., and Marx, M. “Designing SpeechActs: Issues in Speech User Interfaces,” Human Factors in Computing Systems (CHI 95), 1995, pp 369-376.