Pic-A-Topic: Efficient Viewing of Informative TV

By analysing closed captions and eletronic program guide texts, it performs topic ... Pic-A-Topic's accuracy is around 82% of manual performance on average. ... topic labelling (i.e., assigning a closed-class category) and so on. ... along a valley; Topic 7 discusses a place called Kakuenho Peak and the famous rocks nearby;.
1MB taille 3 téléchargements 168 vues
Pic-A-Topic: Efficient Viewing of Informative TV Contents on Travel, Cooking, Food and More Tetsuya Sakai‡ Tatsuya Uehara† Taishi Shimomori† Makoto Koyama∗ Mika Fukui∗ ‡ NewsWatch, Inc. † Multimedia Laboratory, Toshiba Corporate R&D Center ∗ Knowledge Media Laboratory, Toshiba Corporate R&D Center [email protected] Abstract Pic-A-Topic is a prototype system designed for enabling the user to view topical segments of recorded TV shows selectively. By analysing closed captions and eletronic program guide texts, it performs topic segmentation and topic sentence selection, and presents a clickable table of contents to the user. Our previous work handled TV shows on travel, and included a user study which suggested that Pic-A-Topic’s average segmentation accuracy at that point was possibly indistinguishable from that of manual segmentation. This paper shows that the latest version of Pic-A-Topic is capable of effectively segmenting several TV genres related to travel, cooking, food and talk/variety shows, by means of genre-specific strategies. According to an experiment using 26.5 hours of real Japanese TV shows (25 clips) which subsumes the travel test collection we used earlier (10 clips), Pic-A-Topic’s topic segmentation results for non-travel genres are as accurate as those for travel. We adopt an evaluation method that is more demanding than the one we used in our previous work, but even in terms of this strict measurement, Pic-A-Topic’s accuracy is around 82% of manual performance on average. Moreover, the fusion of cue phrase detection and vocabulary shift detection is very successful for all the genres that we have targeted.

Introduction Nowadays, hard disk recorders that can record more than one thousand hours of TV shows are on the market, so that people can watch TV shows at the time of their convenience. But there is a problem: Nobody can spend one thousand hours just watching TV! Thus, unless there are ways to let the user handle recorded TV contents efficiently, the recorded contents will eventually be deleted or forgotten before put to use in any way. Many researchers have tackled the problem of efficient information access for broadcast news [3, 4, 5, 6, 9, 13, 19, 21], by means of news story segmentation, segment/shot retrieval, topic labelling (i.e., assigning a closed-class category) and so on. Broadcast news is clearly an important type of video contents, especially for professionals and for organisations such as companies and governments. Both timely access to incoming news and retrospective access to news archives are required for such applications.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

However, at the personal level, there are many other TV genres that need to be considered: drama, comedy, quiz, sport, music, wildlife, cookery, education, and so on. In fact, one could argue that these types of contents are more important than news for general consumers, as these are the kinds of contents that tend to accumulate in hard disks, waiting to be accessed by the user some day, often in vain. Among the aforementioned “entertaining” kinds of TV genres, we are currently interested in separable contents. By “separable”, we casually mean that a TV show can be broken down into several segments, where each segment is independent enough to provide the user with a useful piece of information. Thus, according to our definition, most factual TV shows are separable, while most dramas and films are not. For separable TV contents, we believe that topic segmentation is useful for solving the aforementioned “hard disk information overload” problem. For example, suppose that there is a recorded TV show that is two-hour long, which contains several distinct topics. If it is possible to segment the TV show according to topics and provide the user with a clickable table-ofcontents interface from which he can select an interesting topic or two, then the user may be able to obtain useful information by viewing the selected segments only, which may only last for several minutes. As a first step to enabling selective viewing of separable contents, we introduced Pic-A-Topic that can handle TV shows on travel in [17]. The present study reports on its latest version, which can handle a wider variety of TV genres that are related to travel, cooking, food and talk/variety shows by means of genre-specific topic segmentation strategies. According to an experiment using 26.5 hours of real Japanese TV shows (25 clips) which subsumes the travel test collection we used earlier (10 clips), Pic-A-Topic’s topic segmentation results for non-travel genres are as accurate as those for travel. We adopt an evaluation method that is more demanding than the one we used in our previous work, but even in terms of this strict measurement, Pic-ATopic’s accuracy is around 82% of manual performance on average. Moreover, the fusion of cue phrase detection and vocabulary shift detection is very successful for all the genres that we have targeted. Figure 1 provides a couple of screendumps of Pic-A-Topic’s interface. To avoid copyright problems, the figure shows a travel video clip created at Toshiba rather than a real TV show. The green bar at the top of each screen is a static area that displays the title of the clip: “Yume no Tabi (Travel of Dreams)”. The other blue bars represent topics automatically generated by Pic-A-Topic, where a topic is casually defined as a video segment that is informative on its own. (This loose definition essentially implies that what constitutes a topic differs from person to person. Thus the evaluation of our topic segmentation task is harder than that of, say, news story segmentation, as we shall see later in this paper.) Each topic is represented by a thumbnail image and a couple of topic sentences extracted automatically from closed caption texts. For example, Topic 5 in Figure 1 discusses a dinner featuring Koshu beef; Topic 6 discusses a walk along a valley; Topic 7 discusses a place called Kakuenho Peak and the famous rocks nearby; Topic 9 discusses a castle and so on: Rough English translations are provided just for the convenience of the reader. When a topic is selected by the user, three extra thumbnails (selected using simple heuristics) are shown in order to provide more details of that topic. Then, when he presses an “enter” key, Pic-A-Topic starts playing the video segment. The user can also move directly from Topic N to Topic N + 1 or N − 1 by pressing “skip” buttons.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Figure 1. Sample screendumps of Pic-A-Topic.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

The remainder of this paper is organised as follows. First, we discuss some previous work related to Pic-A-Topic. Next, we describe the TV genres and the Japanese TV shows that we have targeted. Next, we describe how the latest version of Pic-A-Topic handles these different genres. Then, we evaluate Pic-A-Topic’s topic segmentation accuracy using the aforementioned data. Finally, we provide conclusions and directions for future research.

Related Work As mentioned earlier, many researchers have tackled the problem of efficient information access from broadcast news video. Below, we briefly mention some studies that handled TV genres other than news, and point out how our approach differs from them. Extracting highlights from sports TV programs is a popular research topic, for which audio features [14] or manually transcribed commentaries [22] are often utilised. Aoki, Shimotsuji and Hori [1] used colour and layout analysis for selecting unique keyframes from movies. More recently, Aoki [2] reported on a system that can structuralise variety shows based on shot interactivity, while Hoashi et al. also tackled the topic segmentation problem for news and variety shows using generic audio and video features [8]. Although these approaches are very interesting, we believe that nontextual features alone are not sufficient for identifying topics within a separable and informative TV content. Lack of language analysis also implies that providing topic words or topic sentences [17] to the user is difficult. Zhang et al. [23] handled non-news video contents such as travelogue material to perform video parsing, but their method is based on shot boundary detection, not topic segmentation. As we argued in our previous work [17], we feel that shot boundaries are not suitable for the purpose of viewing a particular topical segment. There exist approaches that effectively combine audio, video and textual evidence. Jasinschi et al. [10] report on a combination-of-evidence system that can deal with talk shows. However, what they refer to as “topic segmentation” appears to be to segment closed captions based on speaker change markers for the purpose of labelling each “closed caption unit” with either financial news or talk show. Nitta and Babaguchi [12] structuralise sports programs by analysing both closed captions and video. Smith and Kanade [20] also combine image and textual evidence to handle news and non-news contents: They first select keyphrases from closed captions based on tf -idf values, and use them as the basis of a video skim. While their keyphrase extraction involves detection of breaks between utterances, it is clear that this does not necessarily corresond to topical boundaries. Thus, even though their Video Skimming interface may be useful for viewing the entire “summary” of a TV program, whether it is also useful for selecting and viewing a particular topical segment or two is arguably an open question. More recently, Shibata and Kurohashi [18] reported on a “topic identification” system for the cooking domain using linguistic features and color distribution. However, their “topics” are closed-class states of Hidden Markov Models, namely, preparation, sauteing, frying, baking, simmering, boiling, dishing up and steaming. This definition is clearly narrower (or rather, more focused) than ours. It remains to be seen, moreover, how their approach extends to other TV genres.

Target TV Genres and Test Data In our previous study [17], we used 10 clips of travel TV shows to evaluate Pic-A-Topic’s topic segmentation accuracy and user satisfaction. In order to expand Pic-A-Topic’s scope, we gathered 15 new representative Japanese TV shows, covering several genres that involve talking

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

in the studio, cooking and eating. In contrast to cooking shows, a TV show on eating is to do with visiting good restaurants or obtaining good foodstuff, and is not about how to cook dishes. Together with travel TV shows, these TV genres are very popular in Japan, and most of them are separable: small segments of the entire show can often be of use to the user. Table 1 provides a description of nine well-known Japanese TV series A-I that we selected for contructing our topic segmentation test collection. A total of 25 clips, each lasting between 30 minutes and 2 hours, were collected by recording the broadcast shows and decoding the closed caption information. This collection subsumes the travel collection we used earlier: Series E-H, containing 10 clips. The second column of Table 1 shows the EPG “genres” provided by TV stations, where each program can have up to three genre tags. The EPG genres form a two-level hierarchy: For example, Series B has only one genre tag: “gourmet/cooking” which is categorised under “information/long variety”. We quickly discovered, however, that the EPG genres are not always useful. For example, while both Series B and C are tagged with “gourmet/cooking” only, they in fact have very little in common: Series B is a typical cooking show consisting almost entirely of studio shots, while Series C is a “countdown” show which not only concerns food and restaurants but also “must-visit” spots such as parks and memorials in a particular city in Japan. Note also that Series H and I are tagged with “subtitles”, which indicates that subtitle information is available with the programs: The tag has nothing to do with genres. Because of this nature of the EPG genres, Pic-A-Topic uses its own genre taxonomy which is a little more fine-grained, and classifies a given TV program based on both the EPG genres and the closed caption text using simple pattern-matching heuristics. We shall describe this feature in the next section. The third column of Table 1 briefly describes each TV series. It can be observed that we are handling a relatively diverse range of TV shows, which suggests that genre-specific topic segmentation strategies may be effective.

How Pic-A-Topic Works Overview

Figure 2 provides an overview of the Pic-A-Topic system. As can be seen, in addition to the topic segmentation, topic sentence selection and table-of-contents creation modules [17], we now have a pilot genre identification module to handle several different TV genres. The input to the genre identification module are Japanese closed captions and Electronic Program Guide (EPG) text. It identifies the genre of a given TV program according to Pic-ATopic’s own genre taxonomy. Details will follow. The input to the topic segmentation component are Japanese closed captions and EPG text, as well as the result of genre identification. Currently, the EPG data (if available) is used only for the purpose of extracting names of TV celebrities, in order to automatically augment our named entity recognition dictionaries used in vocabulary shift detection. This is because celebrities such as comedians tend to have anomalous names, which named entity recognition tend to overlook (e.g., “Beat Takeshi” and “Papaya Suzuki”). The topic segmentation component performs both cue phrase detection and vocabulary shift detection, and then finally fuses the two results. Details will follow.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Table 1. Nine TV series.

Name A

B

genres provided in EPG (in English translation) variety: cooking variety vairety: talk variety information/long variety: gourmet/cooking information/long variety: gourmet/cooking

C

information/long variety: gourmet/cooking

D

variety: cooking variety

E

variety: travel variety

F

variety: travel variety

G

variety: travel travel variety variety: travel variety welfare: subtitles variety: travel variety welfare: subtitles

H

I

description of a typical program Every week, TV Celebrity S cooks one dish in the studio after watching some video instructions by professional chefs. S invites a guest into the studio, who helps S with the cooking while talking about various matters. At the end of the show, the guest tastes and assesses the dish. Throughout the show, the studio scenes alternate with chefs’ video instructions. An invited chef demonstrates how to cook a few dishes using a particular foodstuff, e.g., cabbage. Typically, a recipe summary is shown at the end of the show. Every week, the program focusses on one particular district in Japan and reports on the top 30 “must-visit” spots such as restaurants, shops, parks and other sightseeing spots. The countdown has several breaks, during which guests make some comments in the studio. The show ends with an original TV commercial for promoting the featured district. A pair of TV Celebrities, famous for their endless appetite, visits a place in Japan in quest for good food. They not only visit restaurants and hotels, but also go fishing and have other adventures. Team A visits a hot springs resort and claims how enjoyable the place is; Team B does the same for a different location; the guests in the studio watches the alternating footage and decide which place they prefer. A family visits a hot springs resort: they check in at a hotel; they have a good bath and then a good dinner. Next day they go sightseeing and visit a cafe. One program usually contains between three to six such sequences. Very similar to F, except that the program is usually only one-hour long. Usually contains one or two travel sequences. Similar to F and G, except that travel footages may alternate with studio shots in which the guests look back on their travel. Studio shots are often overlayed on top of travel footage. A guest rides on a train in Tokyo with no particular purpose; he gets off whenever he feels like it and explores the nearby area. He visits eateries, shrines and anything that comes across his path. It is always a short day-return trip by train.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Figure 2. Pic-A-Topic configuration.

The input to the topic sentence selection component are the result of topic segmentation and the raw closed caption text. For each topical segment, this component first selects topic words based on a relevance feedback algorithm used widely in information retrieval. Subsequently, it selects topic sentences based on the weights of the aforementioned topic words. Note that this functionality is different from “topic labelling” [5] which assigns a closed-class category to (say) a news story. Topic Sentence Selection can assign any text extracted from closed captions, and are more flexible. As the focus of the present study is topic segmentation rather than topic sentence selection, we refer the reader to our previous work [17] for more details on this module. The input to the table-of-contents creation component are the results of topic segmentation and topic sentence selection. This component performs several postprocessing functions, including keyframe selection, thumbnail creation and start/end time adjustment [17]. Because there may be a time lag between the closed-caption timestamps and the actual audio/video, the timestamps output by topic segmentation are heuristically (but automatically) adjusted. An example heuristic would be to avoid playing the video from the middle of an utterance, since this would be neither informative nor user-friendly. Genre Identification

As we mentioned earlier, we found that the EPG genres provided by the TV stations do not sufficiently reflect the nature of the TV show contents. We therefore devised our own preliminary TV genre taxonomy, shown in Table 2. As can be seen, we set up individual genres for TV series A-E, as their characteristics were quite different from one another. In constrast, we found that the travel shows are quite similar in nature to one another, despite the fact that they come from different TV stations. (Series I is perhaps an exception, in that it is more about exploring one’s neighbourhood in just one day rather than travel. However, we did not set up an individual genre for I, as we found that it was difficult to automatically separate I from the other travel shows using our simple genre identication heuristics described below.) Our current genre identification module is a simple set of hand-crafted if-then rules that rely on the more coarse-grained EPG genres and closed caption text. Some of the rules are shown in

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Table 2. Pic-A-Topic’s preliminary genre taxnonomy.

genre name COOK/TALK COOK EAT/COUNTDOWN EAT TRAVEL/COMPETE TRAVEL

TV series A B C D E F,G,H,I

description a cross between a cooking show and a talk show traditional cooking show countdown show on food show on food and restaurants travel show involving two competing teams travel show

if EPGgenre includes “gourmet/cooking” if EPGgenre includes “talk variety” genre=COOK/TRAVEL else genre=COOK if EPGgenre includes “cooking variety” if closed caption text contains countdown-type expressions (e.g., “No. 1”, “Top 30”) genre=EAT/COUNTDOWN else genre=EAT

Figure 3. Sample rules for genre identification.

Figure 3. Because of the limited scale of our data set, the classification accuracy of this simple set of rules is currently 100%. However, if we expand our target TV genres, we will have to expand our genre taxonomy accordingly and to devise more sophisticated ways to classify TV shows into genres. This issue will be pursued in our future work. Topic Segmentation

The task of topic segmentation is to take the timestamps in the closed captions as candidates and output a list of selected timestamps that are likely to represent topical boundaries, together with confidence scores. We assume that the number of required topical segments will be given from outside, based on constraints such as the size and the resolution of the TV screen. As mentioned earlier, our topic segmentation module can perform both cue phrase detection and vocabulary shift detection. Our first approach to topic segmentation, cue phrase detection, relies on Semantic Role Analysis (SRA) [15, 16]. SRA first performs morphological analysis, breaks the text into fragments (which in our case are sentences), and assigns one or more fragment labels to each fragment based on hand-written pattern-matching rules. A heuristically-determined weight is assigned to each rule. We have devised a separate set of cue phrase detection rules for each genre. For example, our fragment labels for the TRAVEL genre include [17]: CONNECTIVE This covers expressions such as “as a starter”, “at last” and “furthermore”. (Hereafter, all examples of Japanese words and phrases will be given in English translations.)

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

MOVEMENT This covers verbs such as “head for” and “visit”. TIME ELAPSED This covers expressions that refer to the passage of time, such as “next morning” and “lunchtime”. Whereas, our fragment labels for COOK includes: CONNECTIVE This is similar to that for travel. NEWDISH This covers expressions that introduce a new dish, such as “main course”, “dessert” and “next dish”. RECIPE This covers expressions that indicate recipes and foodstuff. For each candidate boundary (i.e., timestamp), cue phrase detection calculates the raw confidence score by summing up the weights of all rules that matched the corresponding sentence. Finally, it obtains the normalised confidence scores c by dividing the raw confidence scores with the maximum one among all candidates. The result of cue phrase detection may be used on its own for defining topical segments. In this case, we sieve the topical boundary timestamps before handing them to topic sentence selection: For each timestamp s (in milliseconds) obtained, we examine all its “neighbours” (i.e., timestamps that lie within [s − 30000, s + 30000]), and overwrite its confidence score c with zero if any of the neighbours has a higher confidence score than s. This is for obtaining “local optimum” timestamps which are at least 30 seconds apart from one another. Our second approach to topic segmentation, vocabulary shift detection, is similar in spirit to standard topic segmentation algorithms such as TextTiling [7]. Although these algorithms originally designed for “written” text are often directly applied to closed captions (e.g., [11]), our preliminary experiments suggested that they are not satisfactory for analysing closed-captions which mainly consist of dialogues. Since closed captions contain timestamps, our algorithm uses timestamps explicitly and extensively. Moreover, as our preliminary experiments showed that domain specific knowledge is effective for some of our TV genres, we use named entity recognition tuned specifically for each of these genres. Our algorithm is described below. We first analyse the closed-caption text and extract morphemes and named entities, which we collectively refer to as terms. We have over one hundred generic named entity classes covering person names, place names, organization names, numbers and so on, originally developed for open-domain question answering [15]. In addition, we have some domain-specific named entity classes for some of the genres. For example, for TRAVEL, we have [17]: TRAVEL ACTIVITY This class covers typical activities of a tourist, such as “dinner”, “walk” and “rest”. TRAVEL ATTRACTION CLASS This class covers concepts that represent tourist attractions and events such as “sightseeing spot”, “park”, “show” and “festival”. Note that this is not for detecting specific instances such as “Tokyo Disneyland”. TRAVEL HOTEL CLASS This class covers words such as “hotel” and “inn”. TRAVEL BATH CLASS This class covers words such as “hot spring” and “bath”.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Let t denote a term in a given closed-caption text, and s denote a candidate topical boundary (represented by a timestamp in milliseconds). For a fixed widow size S, let W L denote the set of terms whose timestamps (by which we actually mean start times) lie within [s − S, s), and W R denote the set of terms whose timestamps lie within [s, s + S). For each t ∈ W L ∪ W R − W L ∩ W R (i.e., term included in either W L or W R but not both), we define a downweighting factor dw(t) as follows: dw(t) = max{dwdomain (t), dwgeneric(t), dwmorph (t)}

(1)

where dwdomain (t) = DWdomain if t is a domain-specific named entity; otherwise 0; dwgeneric (t) = DWgeneric if t is a generic named entity; otherwise 0; dwmorph (t) = DWmorph if t is a single morpheme; otherwise 0. Here, DWdomain , DWgeneric and DWmorph are tuning constants between 0 and 1. Next, for each t ∈ W L ∪ W R − W L ∩ W R whose timestamp is s(t), we compute: f (t) = 0.5 − 0.5 cos π(

s(t) − s + 1) . S

(2)

f (t) takes the maximum value of 1 when s(t) = s and the minumum value of 0 when |s(t)−s| = S. That is, f (t) gets smaller as the term moves away (along the timestamp) from the candidate boundary. Meanwhile, for each t ∈ W L ∩ W R, let sW L (t) and sW R (t) denote the timestamps of t that correspond to W L and W R. (If there are multiple occurrences within the interval covered by W L or W R, then we take the timestamp that is closest to s.) Then we compute: g(t) = 0.5 − 0.5 cos π(

sW R (t) − sW L (t) + 1) . 2S

(3)

If sW L (t) and sW R (t) are close (i.e., the term occurs just before the candidate boundary and just after it), then g(t) is close to 1. If sW R (t) − sW L (t) is close to 2S (i.e., the two occurrences of the same term are far apart), then g(t) is close to 0. The term weighting functions f (t) and g(t) have been designed in order to make the algorithm robust to the choice of window size S, which is currently fixed at 30 seconds. We currently do not use global statistics such as idf (e.g., [20]). Thus our vocabulary shift detection algorithm is theoretically applicable to real-time processing of stream data as well. Based on dw(t), f (t) and g(t) as well as two positive parameters α and β, we compute the novelty of each candidate boundary s as follows: novelty =



dw(t) ∗ f (t) + α ∗

t∈W R−W L



dw(t) ∗ f (t) − β ∗

t∈W L−W R



g(t)

(4)

t∈W L∩W R

Thus, a candidate boundary receives a high novelty score if W R has many terms that are not in W L (and vice versa) and if W L and W R have few terms in common. Using α < 1 implies that W R and W L are not treated symmetrically, unlike the cosine-based segmentation methods (e.g., [7, 9]). This corresponds to the intuition that terms that occur after the candidate boundary may be more important than those that occur before it. This feature proved effective for some of our genres. The final confidence score v based on vocabulary shift is given by: v=

novelty − minnovelty maxnovelty − minnovelty

(5)

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Table 3. TV clips and trials.

(a)genre COOK/TALK COOK EAT/COUNTDOWN EAT TRAVEL/COMPETE TRAVEL TRAVEL TRAVEL TRAVEL total

(b) series A B C D E F G H I -

(c) clip ID’s A01, A02, A03 B01, B02, B03 C01, C02, C03 D01, D02, D03 E01 F01, F02, F03 G01, G02, G03 H01, H02, H03 I01, I02, I03 -

(d) #clips 3 3 3 3 1 3 3 3 3 25

(e) # judges 2 2 1 2 2 2 2 2 2 -

(f) #trials ((d)*(e)) 6 6 3 6 2 6 6 6 6 47

where maxnovelty and minnovelty are the maximum and minimum values among all the novelty values computed for the closed-caption text. The result of vocabulary shift detection may be used on its own for defining topical segments. Again, sieving is performed in such a case. Finally, the confidence scores based on cue phrase detection and vocabulary shift detection can be fused as follows: conf idence = γ ∗ v + (1 − γ) ∗ c (6) In fact, we fix γ to 0.5. Sieving is performed after the above fusion.

Evaluation of Topic Segmentation Evaluation Methods

As we mentioned earlier, we collected a total of 25 TV clips that correspond to the TV series shown in Table 1 to evaluate Pic-A-Topic’s segmentation accuracy. As shown in Table 3, we collected three clips for each series (except for Series E, due to unavailability), and the clip ID’s we assigned to them are shown in Column (c). Since our “topic” is more ill-defined than, say, the notion of “story” in news story segmentation, we employed six judges for manually identifying topical boundaries, and randomly assigned two judges for each clip as shown in Column (e). For Series C, we did not employ a second judge as we hypothesised that the topical boudaries for this particular series can be determined fairly objectively, since it is merely a systematic top-thirty countdown of restaurants and sightseeing spots. As shown in Column (f), this setting gave us 47 evaluation trials in total. Our evaluation metric for topic segmentation, which we call relative F1-measure, is described below. Let N(x) and N(y) denote the total number of boundaries (each represented by a timestamp in milliseconds) identified by the two judges x and y, respectively, and let M denote the total number of candidate boundaries output by Pic-A-Topic. Typically, N(x) and N(y) are similar numbers, but M is larger than these two, since Pic-A-Topic currently does not apply thresholding. We first treat Judge x’s set of topical boundaries as the gold standard, and assess the performance of Judge y. A topical boundary b detected by Judge y is counted as correct if there exists a topical boundary b∗ detected by Judge x such that b lies within the interval

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

[b∗ − 10000, b∗ + 10000]. (At most one boundary can be counted as correct for each goldstandard interval: If there are two or more boundaries that lie within a single gold-standard interval, then only the one that is closest to the center of the interval is counted as correct.) Let nx (y) denote the number of “correct” boundaries counted in this way. Then, we compute precision, recall and F1-measure of Judge y as follows: precisionx (y) = nx (y)/N(y) recallx (y) = nx (y)/N(x) F 1-measurex (y) =

2 ∗ precisionx (y) ∗ recallx (y) precisionx (y) + recallx (y)

As we mentioned earlier, however, N(x) and N(y) are generally very similar numbers, if not always identical. Thus, in practice, precision and recall values are very similar, and therefore F1-measure, which is the harmonic mean of precision and recall, is also very similar to these values. For this reason, this paper focusses on F1-measure. For short, we may refer to F1measure as “F1”. The above F1-measure represents the performance of Judge y when the topical boundaries set by Judge x is assumed to be the gold standard. This can be regarded as a performance upperbound for our system, although our system may possibly outperform it especially if the degree of inter-judge agreement is low. To see how well Pic-A-Topic does in comparison to Judge y when Judge x is the gold standard, we first compute the absolute F1-measure of the system as follows. Recall that Pic-A-Topic outputs M (> N(x), N(y)) candidate boundaries with different confidence values. Hence, in order to compare Pic-A-Topic directly with Judge y, we sort the M candidates in decreasing order of confidence and take the top N(y). Then we evaluate these N(y) boundaries in exactly the same way as described above. Thus, let n x (system) denote the number of “correct” boundaries among the top N(y) candidates output by Pic-A-Topic. Then, its absolute F1 is computed as: precisionx (system) = nx (system)/N(y) recallx (system) = nx (system)/N(x) F 1-measurex (system) =

2 ∗ precisionx (system) ∗ recallx (system) precisionx (system) + recallx (system)

Finally, we directly compare Pic-A-Topic with Judge y, assuming that Judge x defines the gold standard, by computing the relative F1-measure: RelativeF 1-measurex (system) = F 1-measurex (system)/F 1-measurex (y) Similarly, RelativeF 1-measurey (system) can be computed, by treating Judge y’s boundaries as the gold standard instead. Note that if the two judges agree perfectly with each other, then their absolute F1 values would equal one, and therefore Pic-A-Topic’s relative F1 would equal its absolute F1. We use macroaveraging rather than microaveraging: That is, we compute performance values for each clip first, and then take the average across all data. This is because we prefer to weight all clips equally rather than to weight all topical boundaries equally. In our previous work using travel TV shows [17], we employed four judges per clip, and counted a candidate boundary as correct if it agreed with at least one of three judges. Our new

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

evaluation method is arguably more intuitive and substantially more demanding: According to our previous lenient measurement, Pic-A-Topic’s mean relative F1 was 82% for travel data [17]; however, according to our new strict measurement, the mean relative F1 of the same system for the same data is only 62%. As we shall see in the next section, the latest version of Pic-A-Topic achieves 82% mean relative F1 even though we use the strict measurement and cover several genres. It should also be noted that lack of standard test collections for diverse TV genres currently prevents us from conducting more large-scale experiments, although, to our knowlegdge, our own data set covers more genres than those used in most studies. Results and Discussions

Figure 4 shows the absolute per-trial F1 values for our data set, using cue phrase detection (c), vocabulary shift detection (v) and a fusion of the two (c + v). For example, A01x represents the trial using Clip A01 with Judge x as the gold standard. The benefit of fusion is clear: the fusion performance is better than either of the pre-fusion performances (i.e., c or v) for as many as 32 trials out of 47. According to a two-tailed sign test, fusion is significantly better than the cue phrase detection and vocabulary shift detection components at α = 0.01 1 . Figure 5 shows the absolute and relative per-trial F1 values for our data after fusion, together with the absolute F1 values of manual segmentation. It can be observed that Pic-A-Topic actually outperforms Judge y for clips B03, D01 and E01 when Judge x is treated as the gold standard. (Note that Judge y is generally a different person for different clips.) Moreover, Pic-A-Topic’s relative F1 values for non-travel genres (series A-D) are comparable to those for travel (series E-I). Also, recall that we assigned only one judge per clip for Series C: The manual performance for this particular series is always 100%, and therefore Pic-A-Topic’s relative F1 equals absolute F1 for this series. Table 4 summarises the above figures from the viewpoint of genres. For example, for the COOK/TALK genre, the mean absolute F1 of cue phrase detection (c) is only 30%, which is 33% in mean relative F1; vocabulary shift detection (v) achieves 57% in mean absolute F1 and 62% in mean relative F1; fusion (c + v) achieves 69% in mean absolute F1 and 74% in mean relative F1. Note that the fusion performance is higher than either of the pre-fusion ones for all genres. For the entire data set, cue phrase detection achieves 49% in mean absolute F1 and 67% in mean relative F1; vocabulary shift detection achieves 39% in mean absolute F1 and 51% in mean relative F1; and fusion achieves 61% in mean absolute F1 and 82% in mean relative F1. Thus the overall performance has also been boosted substantially through fusion. On the other hand, it can be observed that the performance of vocabulary shift detection for the EAT genre is very low. Moreover, the absolute F1 for this genre is only 50%, even though the corresponding relative F1 is 95%. The discrepancy between these two values arises from considerable inter-judge disagreements for the EAT genre (i.e., Series D): Figure 5 shows that it is very hard even for a man to produce a topic segmentation output that agrees well with the “gold standard” for Series D. In each episode of Series D, two TV celebrities keep travelling from one place to another looking for something good to eat, and where the boundary lies between Food 1 and Food 2 or Place 1 and Place 2 is indeed controversial. For this kind of TV show, we should probably aim for user-biased topic segmentation rather than for perfect generic 1

The use of two judges per clip violates the i.i.d. assumption for our data, so significance tests should probably be performed using those based on only one of the judges. We have a total of 25 trials based on the first judge: Fusion outperforms the two components for 17 trials while hurting only three trials, which is still statistically highly significant.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

topic segmentation, as our ultimate goal is to aid every user in efficient viewing of recorded TV shows. We would like to tackle this problem in our future work. We now take a closer look at some of our successful and unsuccessful trials. Table 5 provides the details of our most successful trials, whose relative F1 values were 100% or even higher. The table includes the aforementioned three trials for which Pic-A-Topic actually outperformed manual performance. For example, for Trial B03y, the absolute F1 of Judge x is 77%; the relative F1 of cue phrase detection is 81%, while that of vocabulary shift detection is only 19%; after fusion, Pic-A-Topic’s relative F1 for this trial is 100%. Figure 6 shows how fusion worked for Trial B03y shown in Table 5: The top left graph compares cue phrase detection with Judge y (i.e., the gold standard); the bottom left graph compares vocabulary shift detection with Judge y; the top right graph compares Judge x with Judge y; finally, the bottom right graph compares fusion with Judge y. The vertical axis of each graph represents the timestamps, and the horizontal axis represents the number of topical boundaries identified. For this clip, Judge y identified six boundaries, while Judge x identified seven boundaries, as shown in the top right graph. The two judged agreed on five trials, which are indicated by circles on Judge y’s graph. Hence, Judge x’s recall and precision are 5/6 and 5/7, respectively, and therefore his F1 is 77% as shown in Table 5. The top left graph of Figure 6 shows that cue phrase detection successfully detected Judge y’s boundaries No.1, No.3, No.5 and No.6; the bottom left graph shows that vocabulary shift detection detected boundary No.4 only; finally, the bottom right graph shows that, after fusion, all of these five boundaries were successfully detected. As a result, Pic-A-Topic is on a par with Judge x. This example suggests that cue phrase detection and vocabulary shift detection can be complementary: They can find different correct boundaries. Table 6 provides the details of our trials for which fusion failed. Let c, v and c + v denote the performance of cue phrase detection, vocabulary shift detection and fusion, respectively; we say that “fusion failed” if c + v < max(c, v). However, the last column of this table shows that fusion is never a disaster: Even when it fails, the difference between the fusion performance and max(c, v) is very small. For example, for Trial A03y, the absolute F1 of vocabulary shift detection is 73%, while the corresponding fusion performance is only 67%, which is only 6% lower. Note that the corresponding cue phrase detection performance is only 27%. In summary, fusion rarely hurts performance, and even when it does, it hurts very little. Figure 7 depicts how fusion fails for Trial D02y shown in Table 6, in a way similar to Figure 6. As the top right graph shows, Judge y (the gold standard) identified as many as 20 topical boundaries, while Judge x identified only 11. Thus, as mentioned earlier, there are many interjudge disagreements for Series D. The top left graph shows that cue phrase detection managed to detect boundaries No.1, No.2, No.5, No.12, No.13, No.14 and No.15, and therefore its recall and precision are 7/20 and 7/11, respectively. On the other hand, vocabulary shift detection was no good for this clip: it only detected boundary No.2, and therefore its recall and precision are 1/20 and 1/11, respectively. Finally, as the bottom right graph shows, Pic-A-Topic missed boundary No.5 as a result of fusion, and therefore its final recall and precision are 6/20 and 6/11. Hence the absolute F1 for this trial is 39%, as shown in Table 6. In this example, there is clearly too much noise in the vocabulary shift detection output. There certainly is room for improvement for this module.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Figure 4. Cue phrase, vocabulary shift and fusion performance (absolute F1-measure).

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Figure 5. Fusion performance (absolute/relative F1-measure).

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Table 4. Average per-genre F1-measure performances.

genre COOK/TALK COOK EAT/COUNTDOWN EAT TRAVEL/COMPETE TRAVEL all

#trials 6 6 3 6 2 24 47

c(abs) 0.30 0.74 0.62 0.38 0.43 0.50 0.49

c(rel) 0.33 0.81 0.62 0.75 0.76 0.71 0.67

v(abs) 0.57 0.40 0.66 0.13 0.41 0.37 0.39

v(rel) 0.62 0.42 0.66 0.26 0.72 0.52 0.51

c + v(abs) 0.69 0.84 0.78 0.50 0.53 0.56 0.61

c + v(rel) 0.74 0.92 0.78 0.95 0.94 0.79 0.82

Table 5. Trials whose relative F1-measure values with fusion are 100% or higher.

genre COOK COOK EAT EAT EAT TRAVEL/ COMPETE

trial B03x B03y D01x D01y D03x E01y

manual 0.77 0.77 0.40 0.40 0.62 0.53

c(abs) 0.77 0.62 0.40 0.40 0.31 0.46

c(rel) 1.00 0.81 1.00 1.00 0.50 0.87

v(abs) 0.31 0.15 0.10 0.20 0.15 0.39

v(rel) 0.40 0.19 0.25 0.50 0.24 0.74

c + v(abs) 0.92 0.77 0.40 0.50 0.62 0.56

c + v(rel) 1.19 1.00 1.00 1.25 1.00 1.06

Figure 6. How fusion works for Trial B03y .

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Table 6. Fusion failures (Absolute F1-measure).

genre COOK/TALK EAT TRAVEL TRAVEL TRAVEL TRAVEL

trial A03y D02y G03y H02x I02x I03x

c(abs) 0.27 0.45 0.73 0.50 0.53 0.45

v(abs) 0.73 0.06 0.41 0.32 0.35 0.26

c + v(abs) 0.67 0.39 0.69 0.45 0.47 0.42

c + v − max(c, v) −0.06 −0.06 −0.04 −0.05 −0.06 −0.03

Figure 7. How vocabulary shift and fusion fail for Trial D02y .

Conclusions and Future Work This paper showed that the latest version of Pic-A-Topic is capable of segmenting several TV genres related to travel, cooking, food and talk/variety shows using genre-specific strategies. Using 26.5 hours of real Japanese TV shows (25 clips), we showed that Pic-A-Topic’s topic segmentation results for non-travel genres are as accurate as those for travel. We adopted an evaluation method that is more demanding than the one we used in our previous work, but even in terms of this strict measurement, Pic-A-Topic’s accuracy is around 82% of manual performance on average. In our previous work, we used a substantially less effective version of Pic-A-Topic (62% in mean relative F1-measure according to our strict measurement), and conducted a preliminary user evaluation which suggested that Pic-A-Topic’s average topic segmentation performance at that point was possibly indistinguishable from a manual one. It is therefore possible that the latest version of Pic-A-Topic provides accuracy that is practically

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

more than sufficient on average, although more extensive user studies are required to verify this claim. Moreover, it is clear that the fusion of cue phrase detection and vocabulary shift detection is very successful for all the genres that we have targeted, although we need to make vocabulary shift detection more robust to change of genres. Our current list of future work includes: • Expanding our TV genres further and to build a genre identification module with a wider coverage; • Incorporating video and audio features to our topic segmentation algorithm; • Automatic per-genre, per-user optimisation of topic segmentation parameters, especially those for vocabulary shift detection; • A large-scale user evaluation of topic segmentation, topic sentence selection and the entire table-of-contents interface; • Applying Pic-A-Topic modules to other applications, such as selective downloading of video contents for mobile phones.

References [1] Aoki, H., Shimotsuji, S. and Hori, O. (1996). A Shot Classification Method of Selecting Key-Frames for Video Browsing. In Proceedings of ACM Multimedia ’96. [2] Aoki, H. (2006). High-Speed Topic Organizer of TV Shows Using Video Dialog Detection. Systems and Computers in Japan, 37(6), 44–54. [3] Boykin, S. and Merlino, A. (2000). Machine Learning of Event Segmentation for News on Demand. Communications of the ACM, 43(2), 35–41. [4] Chua, T.-S. et al. (2004). Story Boundary Detection in Large Broadcast News Video Archives - Techniques, Experience and Trends. In Proceedings of ACM Multimedia 2004. [5] Hauptmann, A. G. and Lee, D. (1998). Topic Labeling of Broadcast News Stories in the Informedia Digital Video Library. In Proceedings of ACM Digital Libraries ’98. [6] Hauptmann, A. G. and Witbrock, M. J. (1998). Story Segmentation and Detection of Commercials in Broadcast News Video. Advances in Digital Libraries ’98. [7] Hearst, M. A. (1994). Multi-Paragraph Segmentation of Expository Text. In Proceedings of ACL ’94, 9–16. [8] Hoashi, K. et al. (2006). Video Story Segmentation Based on Generic Low-Level Features. IEICE Transactions on Information and Systems, J86-D-II-8, 2305–2314. [9] Ide, I. et al. (2003). Threading News Video Topics. ACM SIGMM Workshop on Multimedia Information Retrieval (MIR 2003), 239–246. [10] Jasinschi, R. S. et al. (2001). Integrated Multimedia Processing for Topic Segmentation and Classification. In Proceedings of IEEE ICIP.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

[11] Miyamori, H. and Tanaka, K. (2005). Webified Video: Media Conversion from TV Program to Web Content and their Integrated Viewing Method. In Proceedings of ACM WWW 2005. [12] Nitta, N. and Babaguchi, N. (2003). Story Segmentation of Broadcasted Sports Videos for Semantic Content Acquisition (in Japanese). IEICE Transactions on Information and Systems, J86-D-II-8, 1222–1233. [13] Over, P., Kraaij, W. and Smeaton, A. F. (2005). TRECVID 2005 - An Introduction. In Proceedings of TREC 2005 Proceedings. [14] Rui, Y., Gupta, A. and Acero, A. (2000). Automatically Extracting Highlights for TV Baseball Programs. In Proceedings of ACM Multimedia 2000. [15] Sakai, T. et al. (2004). ASKMi: A Japanese Question Answering System based on Semantic Role Analysis. In Proceedings of RIAO 2004, 215–231. [16] Sakai, T. (2005). Advanced Technologies for Information Access. International Journal of Computer Processing of Oriental Languages, 18(2), 95–113. [17] Sakai, T., Uehara, T., Sumita, K. and Shimomori, T. (2006). Pic-A-Topic: Gathering Information Efficiently from Recorded TV Shows on Travel. AIRS 2006, Lecture Notes in Computer Science 4182, 374–389, Springer-Verlag. [18] Shibata, T. and Kurohashi, S. (2006). Unsupervised Topic Identification by Integrating Linguistic and Visual Information Based on Hidden Markov Models, COLING/ACL 2006 Main Conference Poster Sessions, 755–762. [19] Smeaton, A. F. et al. (2004). The F´ıschl´ar-News-Stories System: Personalised Access to an Archive of TV News. In Proceedings of RIAO 2004. [20] Smith, M. A. and Kanade, T. (1998). Video Skimming and Characterization through the Combination of Image and Language Understanding. In Proceedings IEEE ICCV ’98 Proceedings. [21] Uehara, T., Horikawa, M. and Sumita, K. (2000). Navigation System for News Programs Featuring Direct Access to Desired Scenes (in Japanese). Toshiba Review 55(10). [22] Yamada, I. et al. (2006). Automatic Generation of Segment Metadata for Football Games Using Announcer’s and Commentator’s Commentaries (in Japanese). IEICE Transactions, Vol. J89-D, No. 10, 2328–2337. [23] Zhang. H.-J. et al. (1995). Video Parsing, Retrieval and Browsing: An Integrated and Content-Based Solution. In Proceedings of ACM Multimedia ’95, 15–24.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France