Enhanced Search and Navigation on ... - Frederik Cailliau

navigation technology. It is partly a product of one of the subtasks ... identity in case of mono-channel recording. The result is a ... mining procedure, including punctuation insertion, so as to produce ..... The permanent access to audio content ... As the implementation has been done using manual transcripts, an evaluation is ...
354KB taille 4 téléchargements 324 vues
Enhanced Search and Navigation on Conversational Speech 1,2

Frederik Cailliau [email protected] 1 2

1

Aude Giraudel [email protected]

Sinequa Labs – 12, rue d’Athènes, 75009 PARIS, France

LIPN, Université Paris-Nord – 99, avenue Jean-Baptiste Clément, 93430 Villetaneuse, France

ABSTRACT Huge amounts of conversational speech continually flow through call centers world wide but remain inaccessible. In the context of a French research project, we adapted our industrial search and navigation engine as to be able to process conversational speech. Our full text search engine indexes the transcripts of an automatic speech recognition system. We adapted our processing at two crucial levels: text analysis and user interface. To tackle the problem of disfluencies, a special language model has been developed for the integrated part-of-speech tagger. This text based approach enables the use of current named entity recognition and data mining methods. The user interface takes into account the nature of conversational speech documents without leaving the user behind. We will demonstrate the operational search and navigation engine with a special accent on the user interface. The index will contain a 150h corpus of automatic transcripts and some of the corresponding anonymized audio files.

Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing – linguistic processing; H.5.2 [Information Interfaces and Presentation]: User Interfaces – Ergonomics, Voice I/O.

General Terms Performance, Design, Reliability, Experimentation.

Keywords Spoken Document Retrieval, Full Text Search, Interface Design, Spontaneous Speech, Conversational Speech, Spoken Dialogue.

1. INTRODUCTION There is a growing demand for intelligent analysis of conversational speech. Some major economic players for whom call centers play a key role in their customer relationship Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’04, Month 1–2, 2004, City, State, Country. Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00.

management are especially concerned. The enormous amounts of data that flow through their call centers contain precious information able to increase the company’s competitiveness. The system we present enables data discovery by search and navigation technology. It is partly a product of one of the subtasks of Infom@gic, the major project of the French business cluster Cap Digital. The goal of subtask 2.31 is to develop data mining on conversational speech by combining and adapting existing technologies. A consortium was built including the energy supplier EDF, the research laboratory Limsi and three R&Dperforming SMEs Sinequa, Vecsys and Temis. Each actor is specialized in one part of the process chain. The use case and the role of each actor have been presented in [6], with a special focus on the corpus collection and the transcription technology. This project was the occasion for Sinequa to adapt its NLP-based search technology to process conversational speech. A description of popular approaches in the domain of information retrieval from speech and existing systems are presented in section 2. We will then detail the general architecture of the search engine and give some examples of the adaptations.

2. RELATED WORK The amount of data stored in digital speech archives increases every day, creating a tremendous need for automatic tools to explore these data in the same way as we have got used to for textual documents. Most of the search engines, like Yahoo and Google, perform their search on the surrounding text, like the document title, as if it were metadata. In a perspective of data mining and enhanced navigation, this approach is insufficient as the audio document is presented as a whole with no direct access to its internal structure or content. Most systems for information retrieval on speech documents perform a more in-depth analysis of the audio files and enable a detailed search of the content. They generally use a combination of automatic speech recognition (ASR) and information retrieval. Some examples of such systems are Rough’n’Ready [8], Speechbot [9] and SCAN [4]. The common idea behind these systems is to index and search the transcripts produced by the ASR system. Our system, as explained in the next section, shares the same architecture. The interfaces for search in audio files presented in [8] served as a first source of inspiration for our work. We adapted the interface to our project partners’ input, remarks and demands, which lead to the interface presented in this paper’s last section.

3. GENERAL ARCHITECTURE The system presented here has been built from an industrial full text search engine, which we adapted to the needs of spoken documents retrieval. The chosen approach assumes that the performance of automatic transcription is not critical for information retrieval [1]. We thus favor the production and processing of transcripts from recordings of conversational speech to generate the search engine’s index. Figure 1 presents the overall functional architecture of the processing chain. A first process consists in producing automatic transcripts from call centers audio recordings. The adapted search engine then uses the transcripts as input for indexing and proposes facilities to search and navigate through conversational speech.

speech turns and words. Each speech turn has a speaker id, product of the speaker identification and tracking produced by the ASR system, since the audio was recorded on one channel. Each word has 2 attributes: time code and confidence scoring. Punctuation (comma and dot) has been inserted.

5. TEXT ANALYSIS At the core of the search engine a text analysis module performs different processes: document conversion, word and sentence segmentation, part-of-speech tagging and lemmatization, named entity recognition and semantic document analysis. Its analyses are indexed, can be queried through the interface and are used for navigation purposes.

5.1 Part-of-speech tagging For the lexical disambiguation, the part-of-speech tagger needs a language model and general language lexicons. The model contains contextual rules that are derived from a training corpus by a supervised training module. After disambiguation, a lemma is provided for each word by the general language lexicon.

Figure 1. System architecture The Automatic Speech Recognition System (ASR) uses modeling and decoding strategies developed in the Limsi conversational telephone speech system [7]. Prior to transcription, automatic speaker segmentation and tracking keep track of the speaker identity in case of mono-channel recording. The result is a sequence of non-overlapping, acoustically homogeneous segments corresponding to the speaker turns in the audio document. Then, the automatic speech transcription performs a speech-to-text conversion with special care for the adaptation of acoustic and language models to improve speech recognition of call centers data. Specific work has also been done to facilitate the data mining procedure, including punctuation insertion, so as to produce transcribed corpora approaching a written structure. A first adaptation of the LIMSI conversational telephone speech (CTS) transcription system resulted in a WER of 35% on the development set [6]. After a short description of the transcript corpora, the next sections will detail the two modules that were to reengineer to adapt our full text search engine to spoken document retrieval.

4. CORPORA 1

Several corpora have been developed during the project . Two transcript corpora have been built by Vecsys, by manually transcribing telephone conversations recorded in the EDF call center. The level of manual transcription is different for each corpus: fine for the 20h corpus and fast for the 150h corpus. Both are used by the Limsi to build the acoustic models. All client identification related information has been anonymized. More details on the corpora and the acoustic model as well as some preliminary automatic transcription results can be found in [6]. Sinequa has used the fine transcript corpus to adapt its text analysis, as will be discussed in 5.1. A third corpus corresponding to the automatic transcripts of the 150h corpus has been delivered. These will replace the manual transcripts for the interface and indexation tests. Coded in XML, they contain the following data that are largely exploited in the interface as we explain in section 6. The essential XML tags are 1

Access to these corpora is restricted to the project members.

The main difference between spontaneous speech transcripts and regular written text is the overall presence of disfluencies and discursive elements like interjections and greetings. The only domain specific lexicon was added with the names of energy enterprises active on the French market. Our general language lexicons, usually used to tag written texts, with their 135 interjections, lacked some of the most common interjections, like hum, hm, nan, quoi. For this reason we added a lexicon with only interjections. Some of the word forms were already present as nouns, like hm, abbreviation for the measure hectomètre (hectometer), and quoi, relative and interrogative pronoun. In a second phase we might add or redefine some word descriptions concerning discursive elements like bonjour, bonsoir, to avoid tagging them as nouns, which in most contexts they are not. Most of the disfluencies are characterized by part-word, wholeword or phrase repetitions, and by revisions. Example 1 illustrates most of the disfluency types observed in the corpus. Example 1: par ce que le fait d’avoir changé le compteur ça euh ça ça a fait que l’hi l’historique n’est plus euh n’est plus euh juste comme il l’était avant parce qu’avant je pense que c’était euh c’était euh c’était juste pour vous (because the fact of having changed the meter that eh that that makes the his the history isn’t eh isn’t eh isn’t eh just like it was before because before I think it was eh it was eh it was eh just for you) All types of disfluency break the canonical syntax observed in written texts. Radio broadcast transcripts, as we observed in a previous project, only marginally share these characteristics - they are quite rare even in interviews. Radio broadcasts are therefore considered as read speech, its characteristics being very close to written text. The performance of a written language model on conversational speech is unsatisfactory. For example, it typically tags two successive la as a determiner (the) and as a noun (the note la), whereas in speech, it is almost exclusively a repetition of the determiner (example 2).

Example 2: la la le la facture on l'a pas reçue, là on a reçu un rappel hier (the the the the bill we didn’t get it, but we got a reminder yesterday) For information retrieval, the difference is important, since determiners are considered as empty words and therefore not indexed. Unlike for the radio broadcasts, a new language model has been built because of the observed lexical and syntactic differences. The conversational language model has been derived from the 20h fine transcript corpus, corresponding to about 130.000 words. These transcripts were then manually annotated on part-of-speech categories by Sinequa to build the training corpus. Only POS-tags have been taken into account by the training module to create the disambiguation rules, limiting the number of learning features. The tagger provides a part-of-speech tag for every word, and does not disambiguate unless a disambiguation rule exists. Multiple tags for a word are therefore possible. The performance of the new language model has been evaluated in terms of precision and recall, using the following measures: Precision = total of correct tags / total of tags Recall = total of correct tags / total of tokens For the evaluation, the 98 files of the fine transcripts corpus were arbitrarily distributed into 12 sets: 11 sets of 8 files and 1 set of 10 files. 12 language models have been created using 11 sets as training corpus and have been evaluated on the remaining set. Standard deviation, minimum, maximum and mean values for precision and recall are given in Table 1. Table 1. Tagging results on manual transcripts Min

Max

Mean

Precision

0.9760

0.9853

0.9817

Standard deviation 0.0080

Recall

0.9779

0.9860

0,9826

0.0074

These good results can probably be justified by a limited vocabulary and a limited variety of syntactic structures, producing a language model that is less complicated than one trained on a written corpus. The progression made by building a specific language model has been measured by tagging the transcription corpus with a language model trained on a journalistic corpus. This gave a mean precision of 0.9165 and a recall of 0.9986. The tagger performance should be evaluated on automatic transcripts. The problem here is that an erroneous ASR output contains syntactically and semantically incorrect sentences, making it impossible to build an evaluation corpus of manually tagged automatic transcripts.

5.2 Entity Recognition Entity recognition is performed by proprietary FSA-technology. It serves essentially for three high-level functionalities in the search engine: to browse through the indexed pages as if the entities were meta-data, for intra-document navigation and as a reading help by highlighting the different entities. These user centered tools are of great importance to boost the search and retrieval performance of the search engine as shows [5]. They heavily rely on the

performance of the entity recognition module, unlike document retrieval, reputed not to suffer from a word error rate in transcription up to 50 % [1]. Named entity recognition is only possible when the ASR has correctly transcribed the words to recognize. If a name is out of vocabulary for the ASR, then it transcribes it by one or more other words, making the phrase unintelligible and the original entity impossible to detect. On radio broadcasts, we experienced that ASR errors do have a great impact on the entity extraction and therefore considerably but acceptably lower the overall performance of an intelligent search engine, especially on its navigation capacities [2]. We noted for example a regression in the detection of person names: recall drops from 0.80 on manual transcripts to 0.73 on automatic ones on almost constant precision (0.90 and 0.91). The errors on noun phrase extraction were doubled: the mean error rate was of 5% on text and of 10% on transcripts. ASR errors are at the origin of both degradations. Company and geographic name detection were essentially extracted by white lists and therefore not evaluated in the same way. For call center applications however, person names extraction is less important: as the agent always asks the client references, his coordinates are always clearly known and can be attached to the speech file by post-processing. It goes the same for all client related entities like address, bank account number, telephone number, etc. More important are the entities that, in the view of future data mining, enrich the central part of the conversation. All special vocabulary (eg. relevé de compteur – a meter reading) and energy related companies are of great interest for navigation and data mining purposes.

6. USER INTERFACE DESIGN In the problem of retrieving conversational audio documents, the presentation of the results is as important as the analysis and the indexing of the documents. Although several systems have been built to retrieve segments of spoken documents, little attention has been paid to develop intelligent interfaces that enhance results presentation and navigation. Starting from the difficulties associated to such a development, we will detail the key features of the interface we have developed.

6.1 Difficulties linked to the characteristics of spoken documents The traditional interface of text-based retrieval systems providing a ranked set of documents relevant to a user query is insufficient for audio because of the problem of scanning and browsing speech data. In fact, given the sequential nature of speech, it is extremely laborious to scan through whole audio contents to identify the relevant segment. Interfaces developed to access spoken documents need to support enhanced navigation in audio contents or transcripts while preserving the relevance of the result set. Spoken audio content, when used in a speech retrieval system based on transcripts, can be delivered as audio or visually. These two displaying modes present the advantage, when considered together, to be complementary. On the one hand, the audio content preserves the original audio segment but its sequential structure makes it hard to extract information. On the other hand,

transcripts can be easily displayed for scanning but contain errors that may disturb users. The challenge here is thus to present to the user key elements either in audio or textual form to enhance efficiency in the retrieval task. We therefore need interfaces that allow rapid scanning through audio content while providing multimodal access to relevant information.

6.2 Presentation of the global interface The interface developed in the system follows the paradigm known as “What you see is almost what you hear” introduced in [11] and favors direct access to the segment containing the relevant information (see figure 2). In response to a query, the interface provides a set of relevant spoken documents. The search results are displayed as a relevance-ranked list. Each displayed document is composed of three elements: metadata that enable access to key information (title, number of speakers, speech duration), a “content panel” that provides transcript segments of retrieved documents and a “navigation panel” for intra-document navigation, allowing the user to rapidly scan through the matching speech turns of each document.

6.3 Query results presentation Automatic transcripts are often hard to read because of possible ASR errors and the syntactic characteristics of spontaneous speech. We therefore provided a direct access to the original

speech segments. The transcription of the matching speech turn is displayed in its context (previous and following speech turn) and the audio content is directly accessible by passing the mouse over the transcription. Each retrieved document is represented by its segments that have a matching speech turn. A segment is made of the matching speech turn with the preceding and the following speech turn. To avoid overcrowding the interface (one document can easily have 7 or more matching segments), the different segments are displayed as a slideshow. Every document shows a series of clickable icons corresponding to the segments, enabling user-centered intradocument navigation. The slideshow is timed on human reading abilities, based on the ratio between silent and oral reading (close to recording time) [3]. Query words are highlighted and named entities are colored. This process puts forward the relevant information and allows users to rapidly get a semantic view of the proposed document. The audio content is accessible at any time. Pointing a text segment starts the playback of the corresponding audio segment. This is very useful when the transcript contains ASR errors and therefore seems odd. We chose to synchronize audio playing with a highlight of the corresponding word in the transcription (word in bold on the screenshot). In this way we create a synergy between text and audio content.

Figure 2. Search Engine Interface

6.4 Enhanced navigation 6.4.1 Navigation by noun groups and named entities In order to enhance navigation abilities, lists of extracted noun groups and named entities are presented on the left hand side of the page. These extractions strongly depend on the user query and are thus contextual. Navigation using these extractions allows the user to make his original query more precise. In fact, each extraction is a clickable text that can launch a new query based on the previous one. A simple click adds the noun group to the query and sends it to the search engine. The extraction process combined with a statistical balance can be considered as a filtering process used to narrow the number of retrieved documents presented in the result set.

6.4.2 Intra-document navigation The “navigation panel” gives access to all matching speech turns present in a retrieved document. By clicking on the corresponding icon, the transcription of the matching turn appears, surrounded by a few speech turns that provide contextual representation. Used in combination with the slideshow that continuously displays different matching turns, the navigation panel enables the user to get a randomized view of the audio content and to go directly to the matching turn that is of interest. The permanent access to audio content is an advantage in case of transcription errors that may induce comprehensibility problems. In short, the proposed interface uses transcripts to avoid playing the complete audio file, which is particularly laborious. The highlighting of the key elements and the relevant information enhances rapid scanning of the document. The combination of audio and visual content is particularly beneficial as it provides a complete and rich view of the retrieved documents.

7. CONCLUSION We presented an operational search engine designed to retrieve spoken documents. This system is adapted from an original full text search and navigation engine with a special attention to the design of the text analyzer and a proposal of a new interface to present and navigate in spoken documents containing conversational speech. As the implementation has been done using manual transcripts, an evaluation is still to be achieved to confirm the non-criticality of transcription errors in the information retrieval process. The integration of automatic transcription is on the stage. In the near future, the Infom@gic project foresees a smarter segmentation of audio content by detecting topics in the conversations and creating an index composed of structured documents. We then plan to take advantage of this structure to represent topical segmentation in the interface.

8. ACKNOWLEDGMENTS This work was partly financed by the French business cluster Cap Digital (Infom@gic ST2.31).

9. REFERENCES [1] Allan, J. 2001. Perspectives on information retrieval and speech. In Proceedings of SIGIR 2001, Workshop on Information Retrieval techniques for Speech Applications. [2] Cailliau, F., Loupy, C. de. 2007. Aides à la navigation dans un corpus de transcriptions d'oral. Proceedings of TALN 2007, pp. 143-152. Toulouse, France. [3] Carver, R.P. 1972. Speed readers don't read; they skim. Psychology Today, August, 22-30. [4] Choi, J., Hindle, D., Hirschberg, J., Pereira, F., Singhal, A. and Whittaker, S. 1999. Spoken content-based audio navigation (SCAN). In Proceedings of ICPhS-99 (International Congress of Phonetics Sciences), San Francisco, California. [5] Crestan, E., Loupy, C. de. 2004. Browsing help for a faster retrieval. Proceedings of COLING 2004, pp. 576-582. Geneva, Switzerland. [6] Garnier-Rizet, M., Adda, G., Cailliau, F., Gauvain, J.-L., Guillemin-Lanne, S., Lamel, L., Vanni, S., Waast-Richard, C. 2008. CallSurf - Automatic transcription, indexing and structuration of call center conversational speech for knowledge extraction and query by content. In Proceedings of LREC 2008. Marrakech, Morocco. [7] Gauvain, J.L., Adda, G., Lamel, L., Lefevre, F., Schwenk, H. 2004. Transcription de la parole conversationnelle. TAL, 45(3). Hermès, Paris, France. [8] Hürst, W. 2004. User interfaces for speech-based retrieval of lecture recordings. In Proceedings of the World Conference on Educational Multimedia, Hypermedia and Telecommunications, pp. 4470-4477, Chesapeake. [9] Makhoul, J., Kubala, F., Leek, T., Liu, D., Nguyen, L., Schwartz, R. and Srivastana, A. 2000. Speech and language technologies for audio indexing and retrieval. In Proceedings of the IEEE, Vol. 88, No. 8, pp.1338-1353. [10] Van Thong, J.-M., Moreno, P. J., Logan, B., Fidler, B., Maffey, K. and Moores, M. 2002. Speechbot: An experimental speech-based search engine for multimedia content on the Web. IEEE Transactions on Multimedia, 4(1). [11] Whittaker, S., Choi, J., Hirschberg, J. and Nakatani, C. 1998. “What you see is almost what you hear”: design principles for accessing speech archives. In Proceedings of ICSLP-98, Sydney.