VoiceXML Web Browser Dissertation - main page

We will use PERL, a fantastic but very bloodcurdling programming language, to ..... jobs in bringing new developers to the joy of developing for voice. ... of the book can be found at: [http://www.developer/com/voice/article.php/1565061] ... it: http://www.ecla-international.org/publications/files/ECMA-ST/Ecma-. 262.pdf.
1MB taille 56 téléchargements 260 vues
VoiceXML Web Browser Dissertation 2003/2004 Pierre Naquin

Document Name Author Name Author Student Number Author Course Document Status Submission Date

VoiceXML Web Browser Dissertation Pierre Naquin 03120748 MSc in Computer Science Final 10/09/2004

Acknowledgments I dedicate this piece of work to my parents and all the people that believed in me, to my friends starting for the greatest of all: Yannis, to my previous and coming loves. I will like to thank all my friends for their supports, the two “girls” in their Postgraduate Administration Room that helped me so much this year, my teachers of this year and specially Chris Cox and Mark Green that both helped me a lot with this @°#~%* dissertation! I sincerely hope all of you will be proud of my work. For my “Mamy du train” that took so great care of me these last five years.

I’m conscious that the style of this dissertation is not the wellknown, well-tested, standard, approved, classic dissertation style. I apologise in advance for this, but it is much easier for me to write in this style (I know that the easy path is not always the best one but...) and I tried to make a dissertation enjoyable to read. Hope I will succeed!

Abstract

Internet has become more and more accessible to people during the last 3/4 years. It is an incredible source of cultural and commercial profit. It has been a huge success and this success in not likely to be stopped before some time. Some very smart people known as the W3C (World Wide Web Consortium) have been working very hard to make the web (a common word for almost everything in relation with Internet) more easy to use for users, web developers and companies by providing standards solutions to almost any interesting feature of the web. One of these standard technologies is VoiceXML. This technology (actually these technologies because VoiceXML is only a piece of the voice languages castle) was mainly designed to operate human/computer dialogs on telephonic servers in order to make these dialogs more efficient, less boring and costless to create. But what is interesting in a technology is when you try to push it to its limits, to do something a little bit extreme...

In this dissertation, we will have a look on the creation of a Voice Web Browser using the VoiceXML technology. This browser will give access to the user to any HTML data and transform them into voice. We will see all the power of VoiceXML, its limitations, and the problems we face when dealing with very new technologies. We will take special care of making a system understandable, usable, maintainable and configurable for both the user and the administrator of the system. We will use PERL, a fantastic but very bloodcurdling programming language, to dynamically provide the user (an almost – I hate to have to say “almost”!) full access to the power of the Internet directly in his/her ear! Finally we will be very proud of ourselves but will have to admit that nothing is perfect in this world and that some work could and should be done to make our VoiceXML Web Browser totally perfect.

Table of contents - i out of iii -

Introduction Literature Review ... and thoughts

..............................................

1

.............................................. Introduction ................ Technical Materials ................ HTML ................ VoiceXML ................ Speech Recognition Grammar Specification: SRGS ................ Semantic Interpretation for Speech Recognition: SISR Speech Synthesis Markup Language: SSML ................ ECMAScript ................ Perl ................ Regular Expressions ................ Voice Interfaces ................ Building Browsers ................ Conclusion ................

4 5 5 5 5 6 6 7 7 8 8 9 9

Methodology

11 12 12 12 12 13 13 13 14 14 15 15 15 16 17 18 19 20 21

Testing

22 23 23 34 34 35 36 37 37 38 39 40 41 41 41 42 47 48 49 50 50

.............................................. Introduction ................ The System Functionalities ................ Our lovable languages! ................ VoiceXML ................ Speech Recognition Grammar Specification: SRGS ................ Semantic Interpretation for Speech Recognition: SISR ................ Speech Synthesis Markup Language: SSML ................ ECMAScript ................ Who is doing what? ................ How does it work? ................ System process ................ A server side browser... ................ Design: Main Loop ................ Design: Downloading HTML content from a source web browser ................ Design: Making a zip of all the page ................ Design: HTML tag processing ................ Design: Sending the output to the user ................ What action for what tag? ................ .............................................. Introduction ................ Inputs and Outputs ................ “Well, but I want more explanations!” ................ Initialisation (VoiceXML) ................ Initialisation (SRGS) ................ Switch between windows ................ Listing favourites ................ Opening a favourite ................ Opening an URL ................ Sending links and zips ................ Transforming the tag ................ Transforming the

tag ................ Transforming the

tag ................ Processing a listing:
    and
  1. tags ................ Dealing with forms: the tag ................ Dealing with forms: the tag ................ Dealing with forms: the tag ................ Dealing with forms: the ................ Transforming the tag ................ Cows do eat other cows! : Transforming the tag ................

    - ii out of iii -

    Conclusion

    ................

    51

    .............................................. ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................

    52 53 53 53 53 53 53 54 54 54 55 55 57 57

    .............................................. Introduction ................ What to find where? ................ In the cgi-bin directory ................ In the dictionaries directory ................ In the user1 directory ................ What can I configure if I am an Administrator ................ Changing email configuration ................ Changing the path of configuration files ................ Managing global dictionaries ................ Managing users ................ Advanced modifications ................ What can I configure if I am a user ................ Configuration of the user’s information ................ Managing your personal dictionary ................ Managing your favourites ................ The big part: Changing the way the system behave tag by tag ................ A word about modularity ................ Conclusion ................

    58 59 59 59 59 59 60 60 60 60 61 62 62 62 62 62 63 67 69

    Guide

    Introduction Login in to the system Logout off the system While browsing a page Accessing another URL Managing windows Favourites Sending link to page Sending copy of the page Following links Forms Images Conclusion

    Configuration

    Conclusion Recommendations Introduction Now it is your turn to work! Testing in real case A graphical user-interface for configuration A more efficient tag auto-closing system Skipping the content of a tag Processing <meta> tags Titling windows Skipping advertising images A better processing for everyone Information on demand Help messages Recursive numbering for listing tags The “flat mode” for lists RSS feeds Favourites as web searchers Configuring favourites for “good information”

    Code

    ..............................................

    70

    .............................................. ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................

    72 73 73 73 73 73 74 74 74 75 75 75 75 76 76 76 77 78 78

    ..................................

    - iii out of iii -

    in Appendix

    Introduction

    - 1 out of 78 -

    The Wold is becoming smaller and smaller by each passing day with the advent of the Internet and modern communication means. A person residing in one extreme corner of the globe can now interact with the person living in the other corner just with the help of mice and clicks. Internet has become a popular channel of communication and is not only used by tenyear-old kids but also by aged persons. It is a storehouse of rich data and materials of different kind, flavour, and variety. The Wold Wide Web (WWW) is a useful bank where you can find a huge amount of resources and information. But despite of the popularity of the Internet technology and the increasing accessibility of it through the word, we cannot deny that will still use more the telephone. Telephone is a great device and the greatest way for people to interact together. The reason behind this fact has to be found in its ease of use (just pick of the device and talk) and the age-old tradition of communication by voice. Besides that, the increasing popularity of mobile phones will make this tradition last. For accessing to the Internet, one has to sit in font of the computer; however when one wants to use the telephone, be it a landline or a mobile device, one does not need any kind of connection with the computer. In the case of mobile phones, one does not even have to be at home or in the office, even if the two communication channels are different and should not be compared. The purpose of this “comparison” is to set the rationale and theme for this dissertation. This dissertation is a conscious effort towards the development of a voice-based browser using which the Internet can be made (more) accessible to a huge mass of people. My system aims to create a voice based browser whereby people can use the ordinary telephone to browse the internet, be it from home, office, outside or anywhere a telephonic communication can be made. This would be a great evolution and advancement for both the Internet technology enabling more people to have access to the Internet even without having to learn how to use a computer. There are other advantages also. For instance, when one watches television or surfs the Internet, one takes an active part in it and hence cannot carry out any other work simultaneously. This is not the case when one is listening to music. Our visual sense needs focus; our auditory one less. The use of such a voice browsing system let people continue to carry out other works along with listening to information provided by the system. VoiceXML is a markup language designed to describe and process Human-Computer dialogs. Other languages help it in this: the Speech Recognition Grammar Specification (SRGS) language, the Semantic Interpretation for Speech Recognition (SISR) language and the Speech Synthesis Markup Language (SSML). This dissertation is designed for people who are interested in creating voice applications; people who want to have a look at more advanced VoiceXML applications after an introduction found on the Internet; or people who want to learn how to dynamically generate VoiceXML content. Therefore if you are already familiar with VoiceXML and other voice platform languages (SRGS, SISR, SSML ...) you will find this dissertation easier to understand. It is also assumed that you are familiar with client-server programming models; server side scripting; and web programming. We will try during the entire dissertation to keep the balance between the two approaches we will mainly talk about: the technical approach (how to do things) and the

    - 2 out of 78 -

    user approach (how to provide the user a not-too-bad experience; not-too-bad being very good for a voice interface!). The objective of this dissertation is to develop a dynamic system whereby Internet would be to most possible people; push VoiceXML to its last quarters. Also it would be a new way of accessing the “web” making it more dynamic and advanced. Finally it is for me a way to learn and become an advanced developer in the area of voice technologies. This is a challenging dissertation, so challenging that it enables me to spend my last night on this island in a computer room!

    - 3 out of 78 -

    Literature Review …and thoughts - 4 out of 78 -

    Introduction Building a VoiceXML Web Browser is not a easy and implies lot of thinking but it also require lots of background knowledge in general programming, web programming, user interfacing, web protocols, client-server programming, scripting, and text processing. We will first talk about the technological aspect. What materials are available about HTML, VoiceXML, SRGS, SISR, SSML, ECMAScript, Perl or Regular Expressions? What are their limits? Then we will move to the user interface aspect. Finally we will try to find some information on the development of browsers (both voice and graphical ones).

    Technical materials Finding technical documentations is usually not the most difficult information to find... we will see that this is not always true, especially when it is about very new languages or technologies •

    HTML

    Finding information on HTML is the easiest thing you can ever imagine. The problem that then comes is what information is valuable for you and what is not. HTML is a formatting language and therefore you can have the best book or the worse piece of magazine talking about HTML, the only thing you really need is a reference book. As I am a big fan of all the O’Reilly collection, I’m using the HTML Pocket Reference by Jennifer Niederst (O’Reilly, 2002). A reference is compulsory even if you know HTML by hear and you are providing your sweetheart with some poems full of
    and tags! I also have to mention the HTML 4.01 Specification by the World Wide Web Consortium (W3C). It can be found on the W3C website (http://www.w3.org/TR/html4/). I found it a little long and over-explaining but maybe it is because I already know about HTML. It is anyway the specification and you should at least try to read it once. •

    VoiceXML

    Even if all people doing web development are not aware of this language, VoiceXML is not that new. VoiceXML is already in version 2.0 and the W3C is now working on creating a free implementation of what they call a Voice Browser (the thing I wanted so much while doing this dissertation: a way of simply opening VoiceXML documents!) VoiceXML is a language describing a dialog between a user and a VoiceXML interpreter. VoiceXML is also the root between all the different elements (different languages) that compose a voice application. Tutorials that can be found on the Internet usually represent more an introduction to the language than some explanation on how to do something. I sincerely think they won’t make you an advance VoiceXML developer but these are doing their jobs in bringing new developers to the joy of developing for voice.

    - 5 out of 78 -

    Some very new books have been published on the subject. As I did not have the privilege of being able to read them, I will simply give a list of some of them: VoiceXML: Professional Developer’s Guide Early Adopter VoiceXML

    Chetan Sharma, Jeff Kunins (Paperback, 2001) (0-471-41893-5) Stephen Breitenbach, Tyler Burd, Nirmal Chidambaram, Eve Astrid Andersson, Xiaofei Tang, Paul Houle, Daniel Newsome, Xiaolan Zhu. (Paperback, 2001) (1-861-00562-8)

    An overview of the book can be found at: [http://www.developer/com/voice/article.php/1565061]

    Definitive VoiceXML

    Adam Hocek, David Cuddihy (Paperback, 2002) (0-130-46345-0)

    The VoiceXML 2.0 Recommendation is available from the W3C website (http://www.w3.org/TR/2004/REC-voicexml20-20040316/) and I have to admit that I found this recommendation well written and very convenient to use when developing VoiceXML dialogs. I would also advise Mark Green lecture notes on voice applications (Oxford Brookes University – MSc in Computer Science – P08786). His documents describe very well each W3C voice language and how they interact between each over. Finally, very good articles can be found on the web regarding voice computing and specially VoiceXML applications (these articles were published in the USA): acm.org:

    VoiceXML for Web-based distributed conversational applications by Kenneth R. Abbott (Apress, 2001) (0001-0782)

    [http://delivery.acm.org/10.1145/350000/348985/p53-lucas.html]

    acm.org:

    Mixed-initiative interaction = mixed computation Naren Ramakrishnan, Robert Capra, Manuel A. Pérez-Quiñones (Apress, 2002) (0362-1340)

    [http://delivery.acm.org/10/.1145/510000/503042/p119-ramakrishnan.pdf]



    Speech Recognition Grammar Specification: SRGS

    The W3C released quite recently (March 2004) the recommendation for this language (http://w3.org/TR/2004/REC-speech-grammar-20040316/) and it is very difficult to find some information that does not come directly from the W3C. My personal view on this version of the recommendation is very negative and even more because it is the only source of information about the language. The document spends lot of time giving 3 or 4 times the same useless examples and leave lots of gaps largely open. SRGS is a grammar language that describes what kind of words or pattern of words should be looked for by the speech recognition module of a VoiceXML interpreter. SRGS comes in two forms: the ABNF form and the XML form. I used for this dissertation the XML form. The only information that can be found on SRGS (paper-based or web-based) is encapsulated with some VoiceXML information. Somehow it does make sense because SRGS by itself is totally useless. •

    Semantic Interpretation for Speech Recognition: SISR Speech Synthesis Markup Language: SSML - 6 out of 78 -

    I deliberately choose to group these two languages even if they only have one (but the very strong one) similitude. This is because they are both (what I called) documentation disasters. A good example of that is: if search for SISR in google, it finds: “Société Suisse des Informaticiens – Section Romande” (that is some kind of Swiss Company of Computer Guys). SISR is a language that is supposed to defines the syntax of the content of elements (in SRGS grammar documents). It looks more like if it was describing how data recognised from SRGS rules are filled into ECMAScript variables. The W3C status for the document describing this language is Working Draft (they are working on it since April 2003 without any public modification!) (http://www.w3.org/TR/2003/WD-semantic-interpretation-20030401/). The actual version is a disaster as they are tons of non-documented cases. This is the language that cost me the most trouble in the writing of this dissertation. SSML also suffer of its very recent introduction but being much more classical in its approach; understand its goal is easier. SSML is a formatting language for voice. Its tags are now integrated directly into VoiceXML (in version 2.0) so this language is very easy to use and do not necessitate too much brain overheat. The SSML 1.0 Proposed Recommendation is available on the W3C website: http://www.w3.org/TR/2004/PR-speech-synthesis-20040715/. Last Minute Change: A W3C Recommendation has been published for SSML 1.0 on the 7th of September (http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/). •

    ECMAScript

    ECMAScript is simply (it is almost that simple!) another name for JavaScript. Finding information about ECMAScript is therefore quite easy as all tutorial, documentations, examples, scripts in JavaScript are their exact equivalent in ECMAScript. ECMAScript is a language from the ECMA International Company. The ECMAScript specification can be found on their web site but I have to admit that I never read it: http://www.ecla-international.org/publications/files/ECMA-ST/Ecma262.pdf. Personally, I’m using JavaScript Pocket Reference by David Flanagan (O’Reilly, 1998) and JavaScript: The Definitive Guide also by David Flanagan (O’Reilly, 1998). I like the first one because it is shortness; everything is resumed, organisation is very clear and well structured. You recover your information quickly. The second one is a lot thicker but is very good for solving particular cases... and I do not know how I manage, but I am always facing particular cases. I also would like to mention the very good paper from Rick Dobson: ECMAScript: the holy standard?. This article describes the advantages of ECMAScript face to JavaScript. •

    Perl

    - 7 out of 78 -

    Perl has been for a long time a very popular CGI scripting language. Even if nowadays ASP and PHP are taking serious market shares, Perl still remains very used. Even if principally know for as a CGI scripting language, it is a programming language; lots of non-web-based programs are written in Perl. I choose Perl as an occasion to learn a new technology (even if I have been introduced to it during the P08771 module – Oxford Brookes University – MSc in Computer Science) and I discovered a fantastic, huge, very powerful, very dangerous language. This language let you do everything you will ever want to do and even more... and this is problem! I used two books for the purpose of my everyday work: Perl in a Nutshell (2nd Edition) by Stephen Spainhour, Ellen Siever, Nathan Patwardhan (O’Reilly, 2002) and Advanced Perl Programming by Sriram Srinivasan (O’Reilly, 1997). The first book is a reference book and contains lots of information (everything about Perl is simply not yet discovered!) about the language and the most used modules. The second is principally focussing on the complex aspects (complex nested data structure, typeblogs, the power of Perl OO, graphical interfaces ...) of Perl Programming. If you always wondered why is so great about Perl, read this book. I would also like to mention Perl Cookbook (2nd Edition) by Tom Christiansen and Nathan Torkington (O’Reilly, 2003) that I unfortunately discover far too late. If you want to have fun, some people even write poetry in Perl! I really could not believe it when I first saw this. Have a look at http://www.perlmonks.org/index.pl?node=Perl%20Poetry and enjoy! Some of them are really nice. •

    Regular Expressions

    I do not consider Regular Expressions either as a technology either as a language. For me Regular Expressions look more a tool... but this tool has his own language and a roughly complex one! I only used one book during to learn about and how to use Regular Expression. It is quite exceptional and the first time it happens to be but this book is perfect (I said it!): Mastering Regular Expressions by Jeffrey E.F. Friedl (O’Reilly, 1997). This book describes all the aspects of Regular Expressions: how to use them but also how the different motors works; how to take advantage of their differences; it also points the great importance of efficiency.

    Voice interfaces Information on the design of voice interfaces is really difficult to find. This area of interest is a quite explored area of research but the results of these are difficult to find out and are especially not available outside of labs. The rules to follow in order to build efficient voice interfaces are not yet written: and lots of people like me suffer of this lack! My first and the further accessible material for designing the voice interface were Mary Zajicek’s lecture notes on voice interface designing theory (Oxford Brookes University – MSc in Computer Science – P08786). In this material you can find the real case of a whether report system. It introduced some very interesting ideas that are not all applicable to the VoiceXML technology.

    - 8 out of 78 -

    But the main problem with this material is that it was describing a dialog. The system was asking something, the user was responding something and finally the computer was trying to respond intelligently to user. In this case the initiative comes to the computer. It starts the discussion. It is influence the user’s freedom of expression to something it will be able to understand and compute. In our case we are making a system that is responding to “orders” in the sense that the system is not trying to influence the user for his/her choice. The problem then comes when the user “gets lost”, are we (the system) are not waiting for anything in particular, we do not know what the user is willing to do. Some very interesting articles regarding voice interfaces can there are very theoretical and do not really give solution: they explain how we do interact between each over. I will not describe them here as they are simply expressing time. A short explanation for these can be found in the dissertation.

    be found on the web but are usually more trying to too many things at each bibliography part of this

    At the end I decide not to surcharge the system with some different ways to giving orders. There are a lot of reasons for this: • Efficiency: The most complex is the recognition, the more choices the system has to try makes the system slower and breakable. • Messiness: The most complex are the recognition rules, the messiest the system will become and then will become less understandable for developers, less understandable for administrators, less maintainable, less improvable. Another case of vicious circle: you want to improve the system and you finish by blocking it! • Too human: A web browser – as evolved, as you may ever want – stays a tool for you. We are not in the case of making computer dialogs to simulate nice ladies in voice centre.

    Building browsers Information can be found on how to build browser but very few on how to build voice ones. When you think more about it; this does not make so many differences. The process is the same: First you download the HTML content that will have to be presented to the user. Then you somehow process this file to make what is your output. When you continue to think about it, the main problems you have to face are: • How to recognise what is a tag? When does it start, when does it end? • How to process that tag that you finally found? If you are looking of explications on how to build web browsers, you will get sad very quickly; the closest available information is from the Mozilla group that are providing to developers the sources of both of their browsers (Mozilla and Firefox) (http://www.mozilla.org/developer/). I have to admit that I was not able to put myself properly into it to be able to understand what they were doing or even what was their approach to find tag and to process them. So I have to find other solution that is to find more specific or technical content: and then I came back to the tool we were talking about before: regular expressions: • The system is using regular expression to separate tags from text. • The tags are processed one by one and have different action associated with each of them.

    Conclusion I am French and therefore a professional complainer. - 9 out of 78 -

    Some of the technologies, languages that are used to build this project are really undocumented. The same can be said for the theoretical aspects of building a voice web browser either on the voice part or on the browser part. This problem makes the implementation of complex ideas even more complex and don’t help the first-steppers. But I believe that our work (we, the first-steppers!) is for the benefit of everybody and my soul feels peaceful because earring that!

    - 10 out of 78 -

    Methodology

    - 11 out of 78 -

    Introduction In this section, I will describe the system, how it should work, how it works. Which are the problems faced and how did I solved them. I will first start by talking about the functionalities of the system. Then I will introduce our output languages for the ones of us that are not familiar with them. Then I will move to a description of the system’s designs and processes.

    The System Functionalities What are the functionalities that are provided to the user? What from the user’s point of view makes that the system is usable.

    Main criteria that makes the system usable.

    We will try to provide a solution to every of these criteria. The user is tough, but we are strong enough!

    Our lovable languages! I will try to provide you some background about the languages that we are going to use as outputs of our server side program. It is very important that you have in mind which language is doing which part of the job. Try to have a special attention about this, especially because these languages both do not look like programming language that we use everyday neither to the data structuring languages that XML brought us. •

    VoiceXML

    The origin of VoiceXML goes back to year 1995 in the research project called Phone Markup Language at AT&T Bell Laboratories. In the year 1998, W3C organised a conference on voice browsers and the attendees of this conference included AT&T, Lucent, Motorola, IBM ...

    - 12 out of 78 -

    By this period, AT&T and Lucent had developed different variants of their original Phone Markup Language and IBM was developing its own speech language. On the other hand, Motorola had already developed VoxML. Following these developments by these separate commercial companies the VoiceXML Forum was formed to develop and promote a standard voice markup language that developers could use to build conversational applications. The VoiceXML Forum’s main objective was to explore public domain ideas from existing work in the voice browser arena. As the standardization process for voice browsers develops, the VoiceXML Forum would work with others to find common ground and the right solution for business needs. In the year 2000, the forum wrote the VoiceXML 1.0 specification and submitted it to the World Wide Web Consortium for the purpose of standardization. In October 2001, the VoiceXML 2.0 was published by the W3C’s Voice Browser Working Group; that is the latest version of the specification. VoiceXML is a markup language for describing Human-Computer dialogs, based on XML (Extensible Markup Language). It first concrete objective is to give a solution to the problem of Human-Computer interaction for voice servers where users shall fills forms. •

    Speech Recognition Grammar Specification: SRGS

    The Speech Recognition Grammar Specification is a language used to describe what should a VoiceXML Interpreter should listen to. The syntax of the grammar format is presented in two forms namely an Augmented BNF form and a XML form. The specification makes sure that the two forms are mappable to allow free transferability between them. In this dissertation we will use the second form; the XML one. The W3C decided to provide an XML version of this language in order to enable the grammar developers to use all the power of the tools developed for XML. •

    Semantic Interpretation for Speech Recognition: SISR

    Semantic Interpretation is useful when it is combined with some other specifications like the SRGS one. Semantic Interpretation provides a way whereby instructions can be attached for the computation of semantic results to a speech recognition grammar. In other words, it define how should a recognised pattern of word be interpreted. SRGS is the pattern, SISR give a sense to this pattern. SISR statements should be valid ECMAScript expressions. Like SRGS, SISR comes in two forms: a ABNF and a XML form. The two versions are in the same way totally mappable. •

    Speech Synthesis Markup Language: SSML

    - 13 out of 78 -

    Speech Synthesis Markup Language is a standard that the W3C is working on which provide formatting to voice. It has been created to provide a rich XML-based markup language for assisting the generation of synthetic speech. It provides a standard way to control different aspects of speech such as volume, pronunciation, rate and pitch across various syntheses. This language is used to improve the quality of synthesized content. This markup language is suitable for “style” developers. It can be seen as an XML pendent of what is CSS for graphical web interface. Using SSML allows the content developer to provide information to the user in two ways: what is said and how it is said. Last Minute Change: A W3C Recommendation has been published for SSML 1.0 on the 7th of September (http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/). •

    ECMAScript

    ECMAScript is simply another name for JavaScript. Actually ECMAScript is an attempt to normalise the JavaScript language that was imagine years ago by Netscape. We can say that the language is the same because the implementation of the JavaScript and the ECMAScript languages are exactly the same (the company working on these technology not having any interest writing some different code that would do the same work). This language is used – in the context of VoiceXML applications - to perform computing actions directly on the VoiceXML interpreter (client side). •

    Who is doing what?

    A typical workflow between voice languages. -

    The user is giving some input under the form of voice. The user’s voice is parsed and tested through the SRGS rules. The elements of the SRGS grammar file are executed. These elements are in the SISR language. Variables are passed back to the VoiceXML file that can ask some ECMAScript code to be executed.

    - 14 out of 78 -

    -



    VoiceXML decide what output should be given to the user regarding to its internal rules. Outputs can be formed of SSML tags (or not like in this case); SSML tags specify how the voice should sound like.

    How does it work?

    As described in the diagram, the VoiceXML architecture is one piece added to the web application design model. This piece is the VoiceXML Interpreter that fits between the content server that send the data as VoiceXML content and the user’s phone (accessing media).

    System process I will now try to explain what the designs that I used for my system are. •

    A server side browser...

    The first thing that has to be understood about the system is that the clientserver idea driving normal web architectures are altered when passing to voice: in our case, the browser is actually server side.

    Normal client-server web architecture versus...

    - 15 out of 78 -

    The system used for our VoiceXML Web Browser.

    As we can see on the diagrams, the architecture involved for such a system to work is more complex and implies more different components: The VoiceXML Interpreter: this piece of the architecture is common to every voice system running using VoiceXML. Its role is to process all the speech synthesis and speech recognition according to the content located on the VoiceXML and SRGS data. What was before the “server side” is now (for the purpose of our browser) acting like a proxy server: it makes request to other web sites (server the user wants access to) and then give it back to the user (in a VoiceXML form). This is why we can say that our browser is server-side. •

    Design: Main Loop

    - 16 out of 78 -

    VoiceXML Web Browser Main Loop.

    Each time the system has to process a web page, it starts by downloading it from the source web server. Then it creates a zip copy of the page (with images, CSS files, Java applets ...) that will be used to be sent by email to the user on its request. Then it processes all the tag transformation depending on the type of tag and on some rules and preferences set by both the user and the administrator. Finally it brings to the user the VoiceXML document that will be treated by the VoiceXML Interpreter. •

    Design: Downloading HTML content from a source web server

    After seeing the design of the global loop, let us now have a look at every action that has been describe just before. First let us have a look on what is really performed when downloading some HTML content on the user’s request.

    - 17 out of 78 -

    Loop controlling the downloading of HTML content from a web server.

    The first thing that the system checks is if there is actually any data to be posted to the requested URL (like in the case of the user sending a form). If it is the case, the data is posted. Then the system downloads the answer form the remote web server (our server is actually – as described before - only acting like a proxy). When the data is received, it checks the HTTP headers to see if a HTTP redirection has been requested. If so, it goes to the next URL and continues the loop. •

    Design: Making a zip of all the page

    Now we can start the second action of our main loop: saving a copy of the overall web site in a zip format.

    - 18 out of 78 -

    Making a complete copy of a web page.

    The data received from the remote web server is analysed to find all the tags that requires some processing. For all these tags, the system finds out where to find the linked content and download it. When one content is downloaded, its link from the source file is changed to meet the new location. Finally, when everything is processed, a zip is made from all the files. •

    Design: HTML tag processing

    Third step and the most complex of all: finding and replacing HTML tags.

    - 19 out of 78 -

    HTML tag processing by the VoiceXML Web Browser

    For each tag that is found (the system is using regular expression to find them) in the HTML document the system tests if it is an opening tag or a closing tag (or a singleton tag like
    ). If it is a closing tag, it is deleted from the opened tag stack. If it is an opening tag it is added to the opened tag stack and it tries to close the nearest tag that can be closed by the opening of another tag. Then it finds out which function to import and call regarding to the rules.xml general configuration file and the preferences.xml file from the user. The function is called and the result of it is added to the output files (VoiceXML one and SRGS one). •

    Design: Sending the output to the user

    Now that the VoiceXML and the SRGS contents are ready, the web server can then send this content to the user’s VoiceXML Interpreter.

    Process for sending the output to the user’s VoiceXML interpretor.

    - 20 out of 78 -

    When all the tag processing is over, the VoiceXML and the SRGS files are save in the temporary directory of the user and an HTTP redirection request is sent to link to the VoiceXML just saved document. •

    What action for what tag?

    Some tags execute special actions when met. I do not have enough space to describe them in here but you will find in the configuration section an explanation of all the action that the system can make depending on how it is configured.

    - 21 out of 78 -

    Testing - 22 out of 78 -

    Introduction While I was programming, debugging and even after while testing my system, I really putted it in some very hard situations. For demonstration purpose I will only show the transformation of a roughly easier HTML file. I’ve only be able to test HTML data (HTML files and HTML data generated by CGI scripts) on localhost (meaning my computer but through an HTTP server) due to firewall problem depending of my internet provider; but the code making no difference regarding where the code is located it shouldn’t make any problem on a well-configured server. I also would like to mention that both of the outputs files get succeed in passing the W3C XML validator. This doesn’t insure that the content of the file is corresponding to what it should be, but it at least certifies that the output files are valid. Because absolutely no free implementation of W3C voice languages you won’t be able to test it in a real situation (I wasn’t able too!). So I would try to explain, looking at the input HTML file, the user personal preference file, and both of the output files (the VoiceXML and the SRGS one) why the system is doing what it is suppose to do.

    Inputs and Outputs Here are the input and the output directly. I’ll explain them just after. TESTING THE DISSERTATION

    TESTING THE DISSERTATION

    This time it better work! We have some audience!
    1. It should work.
    2. It should work GREAT!
    3. IT'S SOOOOO GOOD!
    Why do web developer uses table for styling! GRRR!
    1 + 1 = 2 2 + 2 = 3 3 + 3 = 4

    - 23 out of 78 -



    do

    eat

    other

    HTML File in input.

    link Here starts the form number %%ID%% Here ends the form number %%ID%% A large image is there. text field password field

    - 24 out of 78 -

    file field. file fields are not supported by the system. Personal Preference File of the user.



    - 25 out of 78 -

    <script> TESTING THE DISSERTATION TESTING THE DISSERTATION This time it better work! We have some audience! 1 It should work. 2 It should work GREAT! 3 IT'S SOOOOO GOOD! Why do web developer uses table for styling! GRRR! Here start the form number 1 text field fill with $ += $dictionary <script>textValue[0] = text1 ; password field fill with $ = $alphabet

    - 26 out of 78 -

    <script>passwordValue[0] = password1 ; fill with $ += $dictionary <script>textareaValue[0] = replaceReturn(textarea1) ; 1 + 1 = 2 select it <script>checkValue[0].push( checkValuePossible[0][0]) ; 2 + 2 = 3 select it

    - 27 out of 78 -

    <script>checkValue[0].push( checkValuePossible[0][1]) ; 3 + 3 = 4 select it <script>checkValue[0].push( checkValuePossible[0][2]) ; Here ends the form number 1 Cows do eat other cows! <script>
    - 28 out of 78 -

    ]]> <script> <script>

    - 29 out of 78 -

    <script> <script>

    - 30 out of 78 -

    The VoiceXML document in output. Only the tabulations have been changed for the document to be more readable. This does not influence in any case how the document is treated by the VoiceXML interpreter.



    - 31 out of 78 -



    Brookes $ = 0 ; in a new window switch to window number $.windows.windowNb = $digit ; if ($digit == 0) $.windows.url = " http://localhost/DISSERTATION/tmp.html" ; else $.windows.url = "ERROR" ; list favourites $.listFavorite = 'FILLED' ; $ = 0 ; in a new window open if($newWindowFav == 0) $.openFavorite.windowNb = 0 ; else $.openFavorite.windowNb = 1 ; $.openFavorite.favNb = $favorite ; $ += $alphabet

    - 32 out of 78 -

    $ = 0 ; in a new window open if($newWindowURL == 0) $.openURL.windowNb = 0 ; else $.openUrl.windowNb = 1 ; $.openURL.url = $url ; $ = 0 ; in a new window follow link number if ($newWindowLink == 0) $.links.windowNb = 0 ; else $.links.windowNb = 1 ; if ($digit == 1) $.links.url = "" ; else $.url = "ERROR" ; send url $.sendLink = 'FILLED' ; send all urls $.sendAllLink = 'FILLED' ; send zip of page $.sendZip = 'FILLED' ; send all zips $.sendAllZip = 'FILLED' ; send form number 1 $.form1send = 'FILLED' ; clear form number 1 $.form1clear = 'FILLED' ; fill the number of form 1 with $.form1fill1.type = $fieldType1 ; $.form1fill1.nb = $digit ; $.form1fill1.value = $sentence ;

    - 33 out of 78 -

    select the element number of the number of form 1 $.form1fill2.type = $fieldType2 ; $.form1fill2.nb = $digit ; $.form1fill2.value = $elementNumber ; $ += $dictionary text block text field password field check list list The grammar (SRGS) document in output. Only the tabulations have been changed for the document to be more readable. This does not influence in any case how the document is treated by the grammar interpreter.

    “Well, but I want more explanation!” To show you why the system is doing what it is suppose to do I will take parts of the outputs files and explain the transformation. We will do all the totality of the outputs files in this way. I don’t expect you to know VoiceXML and I’ll try my best to make this explanation readable and understandable without you requiring too much knowledge of W3C voice languages; but it’s undeniable that you will much faster if you knowing these technologies. •

    Initialisation (VoiceXML)



    - 34 out of 78 -

    (1)

    (2)

    (3) <script> VoiceXML code.

    In here, I’m only opening the VoiceXML document (1) (all VoiceXML files start like that) and initialising some different variables that I’ll use later (2). The system is using a form-level grammar (3). This implies that the user and the system will have a mixed-initiative dialog between them. In a traditional way (and before VoiceXML), the dialog was driven by the computer. This means that the computer when asking a question, the user has to answer. In a mixed-initiative dialog, the computer knows from before all the information it needs to know and try to find these information in anything the user is saying. This functionality of VoiceXML is very important for us because it let us a dialog driven by the user: the user ask a question, the computer will answer to him/her. I’m also creating an ECMAScript function that will be used to build a string containing the list of all favourites (4). It’s the perfect moment to tell you that for readability, all part of code that will be interpreted by the ECMAScript processor are indicated in italic. •

    Initialisation (SRGS)

    (1) (2)

    - 35 out of 78 -

    SRGS code (grammar file).

    First, I’m opening a SRGS document. Like for VoiceXML, all SRGS documents starts in the same way (1). Then I’m making a first special rule that will be use as the root rule and will be used to find every combination of valid sentences (2). The SRGS documentation from the W3C is not clear on how root rules in the context of a mixed-initiative form should be handled. The rule described previously might not be useful but it will make the system surely work. •

    Switch between windows

    (3) (4) (5) VoiceXML code.

    switch to window number (2) (3) $.windows.windowNb = $digit ; if ($digit == 0) $.windows.url = " http://localhost/DISSERTATION/tmp.html" ; else $.windows.url = "ERROR" ; SRGS and SISR code (grammar file).

    In here, we have an example of a real recognition. The goal of this part is to recognise a sentence like: “switch to window number 0”. When I recognise this sentence the system will then open the corresponding “window”. Let’s have a look at the process: (1) The system recognise “switch to window number x” (in the grammar file). (2) The system then performs the ECMAScript code that is in between . (3) The variable $.windows is modified (in the grammar file). This will automatically fill the windows field (in the VoiceXML file). (4) The VoiceXML code in is executed. (5) If the “window” indicated by the number exists (has already be opened), the system put the URL of the corresponding “window” in the url variable and send all the needed variables to the server

    - 36 out of 78 -

    The server will then reply back to the user with the processed transformation of the new url passed. •

    Listing Favourites

    (2) (3)

    (4)

    VoiceXML code.

    list favourites $.listFavorite = 'FILLED' ;

    (1) (2)

    SRGS and SISR code (grammar file).

    This action is much simpler than the pervious one. If the grammar recognise “list favourites” (1), it filled the listFavorite field (2) of the VoiceXML document that then calls the appropriate ECMAScript function listFavorite() (3) that we introduced at the beginning. At the end, we clear the listFavorite field (the listFavorite value) (4) in order to be able to re-recognise it... and therefore for the user to be able to re-ask for his favourite. I would like to add the line (2) of the grammar is required in this case because we are in a mixed-initiative context. simply says what is inside itself. •

    Opening a Favourite

    (5) VoiceXML code.

    Brookes

    (1) (2)

    (3) $ = 0 ; in a new window

    - 37 out of 78 -

    open if($newWindowFav == 0) $.openFavorite.windowNb = 0 ; (4) else $.openFavorite.windowNb = 1 ; $.openFavorite.favNb = $favorite ; SRGS and SISR code (grammar file).

    Here nothing is new, just a little bit more complex. The system recognises sentences like “open brookes in a new window”. The favorite rule is a list of all favourites configured by the user (1). meaning that the rule will try recognise one of (and only one of) the elements that are inside itself (2). The newWindowFav rule is here to capture the optional “in a new window” (3). If the user wants to access his/her favourite in a new “window” the system will look for the next available “window” (4); in our case the “window” number 1. The field is then filled (5) and the variable initialised at the beginning is now used to give their value to the url variable (6). •

    Opening an URL

    (5) VoiceXML code.

    (1) (2) $ += $alphabet (4) $ = 0 ; in a new window open if($newWindowURL == 0) $.openURL.windowNb = 0 ;

    - 38 out of 78 -

    else $.openUrl.windowNb = 1 ; $.openURL.url = $url ;

    (3)

    SRGS and SISR code (grammar file).

    The idea is exactly the same as the previous one but in here we have to recognise a URL instead of recognising a number. The system will recognise: “h t t p : / / w w w . g o o g l e . c o m /” The important points are: (1) The repeat attribute permit the system to find more than one letter and therefore let it recognise the list of letters. (2) We are constantly adding the newly found letter to the other ones found previously. The $ represent the value that will return the url rule and will be caught as (3) by the openURL rule. (4) The newWindowUrl recognise the optional “in a new window”. (5) The openURL field is filled; the variables are set and send back to the server. The alphabet.grxml file encloses a grammar that recognises every possible character of an URL. •

    Sending links and zips

    As they are all done exactly the same way and I will only explain the first action... and it will be quite short because it’s very simple. (2) (3) VoiceXML code. Send a link to the current “window” by email.

    send url $.sendLink = 'FILLED' ;

    (1) (2)

    SRGS and SISR (grammar file). Send a link to the current “window” by email.

    The recognition part can be compared to the one used when we wanted to recognise “list favourites”: when the exact phrase is matched (1), the system fills the variable corresponding to the right VoiceXML field with any data (2). Then the right data is sent (3) to the send.cgi server script. I am adding here the screenshot of the webpage directly from the url and decompressed from the zip.

    - 39 out of 78 -

    Before. In here the image can be anywhere in the web server file tree; or even on some other web server.

    After. In here all the files are in the same zip, in the same folder, starting at file number 0 to file number n. For this the system is parsing the whole HTML file, download the files that should be, change all the href, src, background, ... attributes in the HTML file, save everything, zip everything and finally send the zip to the user by email.

    As you can see, there is no difference... meaning that the system is doing its job right! It is working correctly for images (as we can see here), but also for css files, background images, java applets, Flash animations, javascript files, ... •

    Transforming the tag TESTING THE DISSERTATION

    (3)

    HTML tag in input.



    (1) (2)

    Preferences for the tag.

    TESTING THE DISSERTATION

    (3)

    VoiceXML code.

    As we can see in the preference file, an action is required when meeting an tag (action="on") (1). The action specified is to add a pause during 2 seconds where the tag closes (2). And this is exactly what is done (3). is not really a VoiceXML tag, it is part of the SSML specification (another one) that specifies how should some word be said or (in this case not said). SSML tags can be directly included into VoiceXML documents starting from VoiceXML version 2.0. With this tag, we show that we can specify a pause using a preference file, but the user can also choose any arbitrary text to be specified before or after a tag.

    - 40 out of 78 -



    Transforming the

    tag

    TESTING THE DISSERTATION

    HTML

    tag in input.



    (1) (2) (3)

    Preferences for the

    tag.

    TESTING THE DISSERTATION

    (4)

    VoiceXML code.

    Here it is the same idea than previously, the

    tag requires some processing because of its action property being equal to “on” (1). The user requires a break of 2 seconds before (2) and after (3) the

    tag. The output is corresponding to the user’s request (4). •

    Transforming the

    tag

    TESTING THE DISSERTATION

    HTML

    tag in input.



    (2)

    Preferences for the

    tag.

    TESTING THE DISSERTATION

    (1)

    Pure text in the VoiceXML file.

    You might have wondered reading the previous section why the

    tag look like not being processed (1)? The answer is to be looked in the preference file: no action is required by the user to be done to the

    tag (action="off") (2). Of course, the user could have select like for the previous section some pause to be taken or some text to be spoken. Starting from now I won’t explain the tags that does not require any action. This testing section is already very long... there is no need to make it even longer. •

    Processing a listing:
      and
    1. tags
      1. It should work.
      2. It should work GREAT!
      3. IT'S SOOOOO GOOD!

        (3) (4) (5)

      HTML
        and
      1. tags in input. (1) (2)

        - 41 out of 78 -

        Preferences for the
          tag.

          1 It should work. 2 It should work GREAT! 3 IT'S SOOOOO GOOD!

          (3) (4) (5)

          Pure text in the VoiceXML file.

          This case is more interesting; we are trying to solve the case of the numbering of a list. In HTML, there is a whole bunch of tag for doing different listings:
            ,