Structured Audio Player: Supporting Radio Archive

The net effect of these trends is that the .... frame includes conventional features such as average energy per frame, but also less widely-used .... Biatov, K. & Köhler J. (2003) An Audio Stream Classification and Optimal Segmentation for Multi-.
242KB taille 2 téléchargements 244 vues
Structured Audio Player: Supporting Radio Archive Workflows with Automatically Generated Structure Metadata Martha Larson1 & Joachim Köhler2 1

2

ISLA, University of Amsterdam Kruislaan 403 1098 SJ Amsterdam, The Netherlands [email protected]

Fraunhofer IAIS Schloss Birlinghoven 53754 Sankt Augustin, Germany [email protected]

Abstract Although techniques to automatically generate metadata have been steadily refined over the past decade, archive professionals at radio broadcasters continue to use conventional audio players in order to screen and annotate radio material. In order to facilitate technology transfer, the archives departments of two large German radio broadcasters, Deutsche Welle and WDR, commissioned Fraunhofer IAIS to develop a prototype audio archive and to investigate the practical aspects of integrating automatically generated metadata into their existing workflows. The project identified the structuring of radio programs as the area in which automatically generated metadata has the clearest potential to support the work of archive staff. This paper discusses the development and performance of the structured audio player, the component of the audio archive system that demonstrates this potential. The automatically generated structured metadata includes speaker boundaries, speaker IDs, speaker gender and identification of audio segments not containing speech. In contrast to similar systems, our prototype was designed, developed and optimized in a project group composed of both archive professionals and multimedia researchers. As a result, important insights were gained into how automatically generated metadata should (and should not) be deployed to support the work of archivists preparing radio content for archival.

Introduction Radio broadcasters are moving towards fully digital workflows. The cost of digital storage medium, in the past a limiting factor, has fallen to the point that it is not longer necessary to discard any of the content produced. Producers and journalists are integrating more and more recycled content into new productions and a rising awareness of the value of existing analog archives has led to large retro-digitalization projects. The net effect of these trends is that the archive departments at radio broadcasters are faced with a new set of challenges, but are also provided with a new palette of solutions. The challenges include ever-increasing amounts of content to be annotated and rising demand for recycled content. The solutions include an array of audio processing techniques that can be applied to digital radio content to automatically produce metadata. Archive departments at large radio broadcasters are moving “out of the basement” and becoming more closely involved with the production of content (Hans & de Koster, 2004). It is clear that algorithms for audio processing must be exploited by radio archives in order to keep pace with current developments. The form that this exploitation should take, however, remains as yet far from obvious. As a rule, the development of technology for automatic metadata generation has taken place at universities and laboratories in relative isolation from the principles of information science and the time-tested conventions that guide the daily workflow of archive professionals working at large radio broadcasters. Statistical approaches produce metadata differing in fundamental ways from the human-created annotation, such as summaries, that currently provide the basis for information retrieval in broadcast archives. The difference between human annotations and machine-generated metadata is difficult to grasp without exposure to concrete examples, and experience has shown that generic demonstrators are not task-specific

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

enough to allow archive departments to estimate the potential of automatically generated metadata or to conceptualize how this metadata could support their workflow. In order to facilitate the planning of an operational system incorporating automatically generated metadata, two German radio broadcasters, Deutsche Welle and WDR, commissioned a prototype radio archive to be developed by Fraunhofer IAIS. The system was to be based on technologies that have proven themselves in past broadcast news retrieval systems, but would be characterized by a novel difference: the design and optimization process would lie completely in the hands of a project group comprising not only researchers developing the underlying algorithms, but critically also a large number of end users, the archive professionals themselves. The project group met on a regular basis and together defined the functionality of the system, chose the data to be used for evaluation, designed the user interface and carried out qualitative and quantitative tests of system performance. This paper treats in detail one module of the resulting system, the structured audio player. At the end of the project, the archive professionals deemed the automatically generated structure metadata and the structured audio player to hold the most evident potential for supporting their work. The next sections present the structured audio player, highlighting issues that emerged during the cooperative design process. Finally, results of the evaluation of the quality of the metadata produced and displayed in the structured audio player are presented. Structured Audio Player The structured audio player makes the work of archive professionals more efficient because it gives them intelligent control over playback, enabling them to scan quickly through audio files. As can be seen from the screen shot in Figure 1, the player has a browser-based interface that displays a radio program as a list of segments each labeled speech or nonspeech. Segments labeled speech are further identified by a speaker ID number, which makes it possible to trace a speaker through the program. When one of the segments is clicked, the audio player jumps into the audio and starts playing from that particular point. There are two sets of fast-forward/rewind buttons, high speed buttons, which move quickly and quietly through the audio, and low speed buttons, which move at a somewhat slower rate that allows users to listen in as an aid to orientation. The structured audio player is implemented in Flash, which makes possible a tight connection between mouse-click and playback. The project group decided that it is important to use a metadata standard to encode the automatically generated metadata (e.g. time points of segment boundaries and speaker ID labels of segments) in order to guarantee compatibility with other applications. MPEG-7 was chosen as the system standard. System internally, a Xindice1 database is used to store and provide access to the metadata. The archive systems currently in use in radio archives departments are coupled with audio players supporting no more than conventional functionality. For this reason, under current practices it is necessary to make a best-guess jump to find the beginning of the next segment. It takes multiple tries to fast forward through a song or commercial and land at the beginning of the immediately following speech segment. The structured audio player significantly reduces such time-wasting guesswork and makes it possible to skim in an informed way. Automatically generated metadata is particularly useful for programs lacking associated production data, which might have included time markers reflecting original cuts. Even if

1

http://xml.apache.org/xindice/

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

production data is preserved, automatic methods offer a clear advantage in that they are capable of generating a segmentation finer than that of the original production cuts.

Figure 1: Interface of the Structured Audio Player

Designing the Structured Audio Player The project group faced a number of important design decisions during the creation of the interface. It was found that archive professionals prefer to work with an interface closely resembling the metadata database that they currently use. For this reason, the player displays segments as vertical cut list and makes full use of conventional abbreviations with which all (German-language) archivists are familiar. Archivists prefer to see as much as possible of the cut list at any given time and, for this reason, the representation of each cut is kept as compact as possible. When a cut is activated to start playback, it becomes highlighted. There was a strong preference that the currently playing cut be emphasized by boldface or background color, rather than size change, which throws off the alignment of the columns. It is important

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

that the highlight moves down the list from one segment to the next as the audio is played and that the list scrolls to reveal upcoming cuts only when the highlight can move down no further. As can be seen in Figure 1, the structured audio player also has the functionality that it displays keywords, in this case “Johannes Rau”, located in speech recognition transcripts. When the user clicks a keyword, audio playback starts a few seconds before the time code of the keyword. The user can control this offset by setting the “Vorlauf” at the left. Applying the same offset to segment playback turned out to confuse users and was eliminated from the design. It was determined that users should be able to export metadata records in MPEG-7 format from the database, but that additionally a simpler method of data transfer should also be supported. Archivists found that the ability to print out the cut list as displayed in the player or to cut and paste the cut list to another application would allow an immediate possibility to integrate automatically generated metadata into their workflow. This very simple but useful functionality was implemented as a button “Druckansicht” (printable view), which can be seen in Figure 1. This button opens the cut list displayed in the Flash interface as a text file in a text editor. During the implementation and test of the structured audio player, several important lessons were learned by the project group. We discovered that the skeleton of segments makes it possible for archivists to leverage professional knowledge concerning the structure of the programs. If production metadata indicates that a recording contains an interview with a politician, but not its start time, the structured audio player can help to localize the interview, since it is clear where segments with two different speaker IDs start alternating with characteristic frequency. In order to exploit such patterns, the group requested that gender labels, which are generated by the speaker-clustering algorithm in its initial step, be included with the speaker ID labels. Furthermore, the project group discovered that the structured audio player can aid in discovering mis-catalogued radio programs, since each program has a characteristic distribution of segment lengths that archivists immediately recognize when segments are displayed in the player. Suggestions were collected for future player versions. It was deemed important to include a fast-forward button that would jump from segment to segment playing the first few seconds of each. Also recommended was a user-controlled playback rate, which would make it possible to accelerate audio playback, but not beyond the point of personal comfort or understandability. Finally, during the project we realized the structured audio player holds great potential to support blind archivists, whose workflow is particular listening intensive. Parallel to the final phases of the project, research investigating this aspect of the potential of the structured audio player was conducted (Busche 2006). Automatic Generation of Structure Metadata The metadata needed for the structured representation of radio broadcasts are generated by three algorithms: segmentation, speech vs. non-speech classification, and speaker clustering. The segmentation algorithm generates time markers at each point at which the audio signal changes markedly in quality, typically from music to speech or from one speaker to the next. The segmentation algorithm makes use of the Bayesian Information Criterion (BIC) as applied in (Chen & Gopalakrishnan, 1998). The BIC is a criterion for model selection that makes it possible to decide if the neighborhood of a potential segment boundary is better modeled as a single Gaussian process or as two separate Gaussian processes. Where two models provide a better fit, a segment boundary is hypothesized. For speech vs. non-speech classification a maximum likelihood multivariate Gaussian classifier is used, which classifies individual 20ms frames as either speech or non-speech. The feature vector representing each frame includes conventional features such as average energy per frame, but also less widely-used features, such as zero-crossing values, which encode discriminative information

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

about frequency without entailing the computational expense associated with an explicit move to the frequency domain. The entire segment is labeled as either speech or non-speech depending on the label predominant among the individual frames. Our initial work with zero-crossing features is reported in (Biatov, Larson & Eickeler, 2002). Details of the segmentation and classification algorithm can be found in (Biatov & Köhler, 2003). The speaker-clustering algorithm derives from BIC-based clustering, also used in (Chen & Gopalakrishnan, 1998). The speaker-clustering algorithm, which also identifies gender, extends this basic principle by making use of additional features that introduce global information into the clustering process. For details of global similarity in BIC-based clustering see (Biatov & Larson, 2005). Evaluation of the Audio Processing Algorithms The system was evaluated on a hand-annotated set of 12 hours of test data drawn from four different radio programs. The test set was chosen to represent the diversity in the types of audio archives departments must process. Above and beyond conventional broadcast news, the test set includes extensive in-studio interviews and telephone interviews, as well as human-interest reports and commercials made colorful by a variety of sound effects and background music. Table 1 summarizes segmentation performance in terms of precision (percent of system boundaries that are correct), recall (percent of reference boundaries identified by system) and F1-measure (harmonic mean between precision and recall). Station WDR2 WDR2 DW DW

Program Montalk Der Tag Wiso Funkjournal Average

Amount 4 x 60 min 4 x 60 min 4 x 30 min 4 x 30 min 12 hours

Precision 0.87 0.83 0.96 0.94 0.90

Recall 0.44 0.71 0.68 0.67 0.63

F1-measure 0.58 0.77 0.80 0.78 0.73

Table 1: Evaluation of the Segmentation Algorithm

Speech vs. non-speech classification and clustering performance is reported in Table 2 in terms of the harmonic mean between reference cluster purity and hypothesis cluster purity. Station

Program

Amount

WDR2 WDR2 DW DW

Montalk Der Tag Wiso Funkjournal Average

4 x 60 min 4 x 60 min 4 x 30 min 4 x 30 min 12 hours

Speech vs. Non-speech Classification 0.98 0.94 0.91 0.90 0.93

Speaker Clustering 0.81 0.90 0.84 0.86 0.85

Table 2: Evaluation of the Classification and Clustering Algorithms

Several aspects of the results deserve more detailed comment. As can be seen from the tables, segmentation and speaker clustering performance was rather low for the talk show Montalk. This program is challenging because it involves untrained speakers in an informal interview setting and includes telephone interviews as well as opinion collages combining the voices of multiple speakers recorded on the street. The project group discovered that some mistakes of the system are disturbing to the archive workflow, while others are less disturbing. A minimal disruption occurs if the system fails to separate the individual speakers in an opinion collage.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

A significant disruption occurs if interviewer and guest have similar voice qualities and fail to be distinguished by the system. In the case of Montalk, the low performance of the segmentation algorithm was due to missed segment boundaries, as can be seen from the recall figure listed in Table 1. The missed boundaries were divided between those that are disturbing for the archivist and those that are not. Conclusions The structured audio player presented in this paper offers functionality tailored to support archive workflow at the radio broadcasters. The project group that designed and tested the paper confirmed that such a player would be an asset for the workflow of archive staff, even in the face of the fact that the segmentation, classification and clustering algorithms used to generate the needed structural metadata fall short of flaw-free performance. Further optimization of metadata generation should stress avoiding those mistakes most disturbing to the archive workflow, such as missing boundaries between speakers in an interview or classifying speech with background music or noise as non-speech. A significant conclusion of the project was that structure metadata provides better support for archive workflows than content metadata (i.e. speech recognition transcripts). A large number of requests for content received by archives departments are quite abstract, meaning that “full text” search in spoken audio is of little use and archivists must rely on conventional manual annotations to retrieve relevant programs. For example, a typical request would be for an interview where a famous politician recounts some anecdote from his childhood. The word ‘anecdote’ or ‘childhood’ (or any other obvious indicator) would not occur explicitly in the spoken audio. Archivists feel that audio search will have increased potential in the future when precision rates improve. Especially when resources are limited, it is more important to concentrate on generating the best possible structure metadata and providing the archive staff with a structured audio player and a high-quality cut list than providing imperfect search in spoken audio. Acknowledgements We would like to acknowledge Deutsche Welle and WDR, who commissioned the project and provided the data. A special thanks goes to the members of the archives departments of DW and WDR for their sustained and productive participation in the project group. The final work for this paper was carried out by the first author while being supported by the E.U. IST programme of the 6th FP for RTD under project MultiMATCH contract IST-033104. References Biatov, K. & Köhler J. (2003) An Audio Stream Classification and Optimal Segmentation for Multimedia Applications. In Proc. of the 11th ACM International Conference on Multimedia 211-214. Biatov, K. & Larson, M. (2005) Speaker Clustering via Bayesian Information Criterion using a Global Similarity Constraint. In Proceedings of Specom05: 10th International Conference on Speech and Computer. Biatov, K., Larson, M. & Eickeler, S. (2002) Zero-Crossing-based Temporal Segmentation and Classification of Audio Signals. In Proceedings of the 6th All-Ukrainian International Conference on Signal/Image Processing and Pattern Recognition. 71-74. Busche, M. (2006) Audiomining: Anforderungen an eine Barrierefreie Umsetzung. Abschlussarbeit. Institut für Information und Dokumentation Potsdam. Chen, S. & Gopalakrishnan P. (1998) Clustering via the Bayesian Information Criterion with the Applications in Speech Recognition. In Proceedings of International Conference on Acoustic Speech and Signal Processing, vol. 2. 645-648. Hans, N. & de Koster, J. (2004) Taking care of tomorrow before it is too late: A pragmatic archiving strategy. In Proceedings of the 116th Convention of the Audio Engineering Society.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France