Detecting screen shot images within large-scale video archive

In this paper a particular issue is addressed: match- ing still images (screen shots) with videos. A content- based similarity search approach using image queries.
467KB taille 2 téléchargements 227 vues
Detecting screen shot images within large-scale video archive Sebastien Poullot, Shin’Ichi Satoh NII, Tokyo, Japan (spoullot,satoh) at nii.ac.jp

Abstract In this paper a particular issue is addressed: matching still images (screen shots) with videos. A contentbased similarity search approach using image queries is proposed. A fast method based on local visual patterns both for matching and indexing is employed. But we argue that using every frames may limit the scalability of the approach. Therefore only keyframes are extracted and used. The main contribution of this paper is an investigation over the trade-off between accuracy and scalability using different keyframe rates for sampling the video database. This trade-off is evaluated on a ground truth using a large reference video database (1000 hours).

1. Introduction On the Internet many professional websites and public blogs include references to broadcast videos, typically as captured images (screen shots). They appear beside reviews of the original show, or can come to express a mood or just for their aesthetic aspect. Identifying those screen shots presents many interest, for right’s owner protection of course but also to perform some user centric interest analysis. Texts surrounding these images could also be used for automatic video tagging at precise moments. We argue that a imagebased approach is more relevant and accurate that a textbased one. Therefore the asymmetrical screen shots and videos matching content-based issue is here addressed. In order to match videos and screen shots, some visual features must be extracted from each one then compared. A video can be considered as a stream of images. Comparing features from a screen shots to every features of a reference video set is not suitable considering the computational costs. An indexing method can be used to tackle this issue. However if a 100,000 hours reference video database is aimed, including each frame from the videos in the database is not reasonable.

Three major reasons can be found. First, as a video shows at least 24 frames per second (30 in Japan), the database size goes huge. Even if an indexing method is used, the database size should be constrained for fast results. Second, near temporal frames are very similar, then, any indexing methods, generally based on hashing, would cause severe congestion, inducing much slower retrieval. Third, the results would be cluttered by the multiple very similar answers to a query. As a remedy for this, only keyframes are taken from the videos for building a reference database. These keyframes are automatically extracted using the global luminance changes. They appear at an average rate of 3400 per hour (about 1 per second), depending on visual activity. The database size and redundancy are drastically reduced but another issue occurs: a screen shot has poor chances to be a keyframe (frame rate is 30 frames/second). Most of the time various transformations occurs between it and the closest keyframe in the database: some objects or characters may have moved and the video camera as well. Therefore we can consider a screen shot and the closest keyframe as near duplicates. Figure 1 presents two examples from our ground truth. Here, our main contribution is to investigate different keyframe samplings of the reference video database in order to evaluate the possible trade-off between accuracy and scalability. For this, we propose to use an indexing method that relies on local patterns in order to make the method invariant to these transformations as well as offering a fast retrieval solution. This method has already been used for Video Copy Detection (VCD) in a symmetrical framework: a video reference database is built and queried with videos [4]. The main difference is that the symmetrical system uses the temporal aspect of the frames to check the possible visual matchings. This temporal consistency raises the precision. In the asymmetrical one, for one screen shot, only a ranking of matching keyframes is obtained, so the more relevant one should come on first position.

Figure 1. Two queries (left side) and the two closest keyframes found (right side)

The rest of this paper is organized as follows. Section 2 contains a short state of the art of near duplicate matching and quickly introduces our method. Section 3 exposes the method used to step up accuracy and scalability using local patterns indexing and geometrical constraints. Section 4 presents results on a ground truth using various keyframe rates.

millions, depending on the content (volume and variability) of the database. For indexing, an inverted list, where each entry corresponds to one of the VW, is generated. On query time, using an image, local descriptors are extracted and addressed to one of the VW. The visual histogram of this image is obtained then compared to the entries corresponding to its VW in the inverted list. Some works also added some weak consistency constraints for better accuracy [1]. This method, using a global feature, has been shown high accuracy for various applications and is generally faster, but still is not scalable. The method employed here aims to scale up retrieval solutions while keeping a high accuracy. For that, a two layer indexing method is used, based both on the description space (SIFT space) and on the picture space plane (2D) for hashing and indexing features. This two cross fold hashing reduces the possible bad collisions in the buckets. Next section and figure 2 gives a general overview of the method.

2. Near duplicate matching The main idea relies on the consistency of local visual areas in pictures. If an object appears in a picture we argue that the local descriptors computed on its surface should be shared with the ones computed on another picture where it appears as well. Moreover, when pictures suffer some transformations neighbor descriptors should be affected in the same way (change of luminance, occluded together or not for example). Usually local description (SIFT for instance) are computed around some Points of Interest (PoI - DoG for instance). Then a matching is performed between the local descriptions. For improving accuracy of retrieval, using the positions of the PoI in the pictures has widely been explored. For example Ransac [2] or ICP algorithms [3] can be used for matching two sets of positions. It strengthens up the constraints so as to avoid false matchings between sets of descriptors. Precision is then far better with generally a reasonable recall loss. However such solutions generally cause a strong computational cost. In the last 10 years, Bag of Features (BoF, [5]) have shown a lot of interest from the image community. Here pictures are coded as histograms of visual words (VW). The visual vocabulary is an arbitrary set based on k-means (k words) in the local description space (SIFT for instance). k usually comes to

Figure 2. Chain processings of the approach

3. Local patterns indexing and filtering Since this section sums up the method mainly presented in [4], for further details please refer to this article. However a main difference comes from that the PoI detector and the local descriptor are different. Indeed when focused on VCD a spatio-temporal descriptor was computed around Harris corner position. But in this asymmetrical problem, spatio-temporal description can not be used any more. Instead SIFT is used. The approach is quite similar to BoF, first a global vector that describes a frame is computed, using a set of visual words and of visual descriptors. The main difference comes from the choice of the vocabulary and of the global description vector. Visual vocabulary. The size of the visual vocabulary used in BoF is usually set to millions of words. Here the local descriptors are roughly quantized offering a smaller vocabulary (about 1000 visual words).

First P PoIs are localized by DoG (Difference of Gaussian) and local descriptors (SIFT) are extracted around these positions. The SIFT are randomly projected in order to obtain smaller descriptors (32 dimensions) with better distributions. Each projected vector (PV) is then quantized to a visual word using a grid. The description space is split d = 10 times so as to obtain 1024 cells. A PV then belongs to one of this cell and is qualified by its cell number c ∈ [0, 2d ]. A picture is globally described as a binary vector Bv of 2d bits, with bits corresponding to its PV cell location set to 1. If more than 2 PV have the same quantization, only one bit will be set to 1 in the global binary vector (thus not ”bag of words” but ”set of words”). The resulting vector is very compact moreover it avoids matchings between textured pictures. To compute similarity between two of these vectors, the Dice coefficient 2 |G1 ∩G2 | is used: SDice (g1 , g2 ) = |G , where Gi is the set 1 |+|G2 | of bits set to 1 in the vector gi and | · | denotes set cardinality. Local patterns. The local patterns associate in triplets a set of (K + 1) PoI. These associations are based on the spatial locality in the picture plane. First we compute the K-NN for each PoI Pi using their position (xi , yi ). Then some assocations are build between Pi and its K-NN. For instance, for K = 4, Pi is associated to f irst-NN and second-NN to be a triplet. Pi , third-NN and f ourth-NN are also associated as another triplet. Figure 3 shows the 5 patterns generated between a PoI located on the noose of a character and its 10-NN. Each triangle represents a triplet to be used as local pattern.

The indexing is done as follows: for each picture, the buckets (associations) are computed for each PoI, then the binary global feature Bv is inserted in each one. For example, if there is no redundancy on the buckets, for P PoI, using 10-NN, Bv is inserted in P ∗ 5 buckets. The redundancy aims to make possible matchings even if, because of the possible transformations, some PoI disappeared or some SIFT descriptors are altered. Finally in order to improve selectivity and to speedup query processing another feature is computed on each local pattern . Filtering. If PoI are kept and SIFT matches together between two pictures, most of the times the pattern is not deformed, the global aspect remains the same. So we propose to add a feature that weakly represents the shape of the pattern. It is the ratio between the longest edge and the p shortest one in the triplet and is noted as Pwc ∈ [1, x2 + y 2 ] where x and y are the width and height of the picture and using L2 distance. For each pattern this code is inserted is the corresponding bucket along with Bv. Querying. Now, on query time, given a picture, its PoI and SIFT are extracted, then its patterns are build and shape code computed for each one. The corresponding buckets are visited. For each couple (Pwci , Bvi ), first are compared its shape code Pwci and the query one Pwcq . A fuzzy comparison which accepts slight deformations is employed. This filtering rules out some bad matchings. Moreover it is very fast to compute, far faster than the binary similarity between Bvq and Bvi . It avoids long similarity computations and so speed-up the search. Finally, after passing this test, the similarity between Bvq and Bvi is computed using Dice coefficient. The matchings are sorted descending according to this similarity.

4. Experiments

Figure 3. On the left the full frame with the different PoI spotted, on the right a close up of the face, showing the 5 associations

Each of these associations corresponds to a hash number. Indeed, if the P cell numbers of the SIFT used at these PoI positions are sorted ascendantly it gives a d hashing number (bucket) among 23 /3! existing ones.

Experiments were performed on a AMD Opteron 880 dual-core. For testing our approach a ground truth was necessary, one was build using Japanese TV archives as follows. A reference database was set up, it contains 1015 hours of videos, 145 hours (about 6 days) in a raw from 7 different Japanese channels. The reference database includes all keyframes extracted from the 1015 hours set. Average ratio is 1 keyframe per second (the database contains 3,449,006 of them). Then 300 random frames were picked up as queries from the videos, not from the keyframe set. A handwork checked that these ones do not belong to a commercial or a jingle that could be repeat many times but to a particular program (news, entertainment, drama, movies, etc). Most of them are not keyframes (about 1 out of 30, due to

frame frequency: 29,97 per second). A return is considered as correct if is a keyframe temporally close to the query, which means that it belongs to the same shot. Only correct returns occurring before rank 50 are considered. Using the rank of the first correct keyframe that appears the mean average precision can also be computed. The keyframe ratio is investigated: 4 other bases where built using only 1 out of 2, 3, 4 and 5 keyframes from the first database (table 1 sums up the database contents and average results). The other parameters are: P