Motion Based Segmentation Using MPEG Streams and ... - Renan

not reliable. ... Main Peak. Detection a1 dx. dVx. Fig. 1: Scheme of the Global Motion Estimation using affine model .... 1 presents the code (in C language) which.
257KB taille 1 téléchargements 211 vues
Motion Based Segmentation Using MPEG Streams and Watershed Method Renan Coudray, Bernard Besserer Laboratoire Informatique Image Interaction, University of La Rochelle Av. Michel Crepeau, 17042 La Rochelle Cedex 1, France E-mail: [email protected], [email protected]

Abstract. Many computer vision applications require the calculation of motion present in image sequences, such as video indexing, summarization, motion segmentation and others. In previous work, we have presented a new technique which performs Global Motion Estimation on MPEG compressed video. This article presents a method to extend the process to allow fast motion-based segmentation of a video. The method enables the segmentation of the background and objects which have their own local motion in real time. The motion information belonging to each area is also given. Moreover, some indicators warn if the estimation is not reliable.

1

Introduction

As the amount of archived videos continuously grows, the demand for video annotation and metadata generation increases in order to catalog, sort or categorize the huge amount of sequences often stored in digital form[1]. Several approaches for video indexing based on single image (snapshots) have been investigated [2], but the analysis of the dynamic behavior of the sequences could improve sequence characterization [3]. Since the MPEG1 or MPEG2 standards are widely used for digital video storage and for digital video broadcasting (DVB [4]), the input data of the presented approach is a MPEG stream [5]. Global Motion Estimation (GME) performed in MPEG4 or MPEG7 is not exploited because its use is still marginal. Instead, Global Motion Estimation is based upon existing motion vectors carried along in the MPEG stream for block-based motion compensation. In [6], the model for the estimation was a simplified affine motion model. We presented here an extension of the method, fitting the complete affine model. The GME algorithm calculates global translation by forming an accumulation space of the motion vectors where the different translation movements are separated. By this accumulation space, a motion based segmentation is carried out with a watershed algorithm. An alternative method is also proposed, more adapted to our sparse data. Finally visual results will be presented and discussed.

2

Renan Coudray, Bernard Besserer

dVx dx

Plot Histogram on a1 Space

Main Peak Detection

dVy

Plot Histogram on a3 Space

Main Peak Detection

dy Motion Vectors Field

a3

Compensation

dVy

Plot Histogram on a2 Space

Main Peak Detection

dVx dy

Plot Histogram on a4 Space

Main Peak Detection

,

Plot Histogram on (tx ,ty ) Space

Main Peak Detection

dx

Compensated Motion Vectors

a1

Vx

Vy

a2

a4

(tx ,ty )

Fig. 1: Scheme of the Global Motion Estimation using affine model

2

Global Motion Estimation

Since MPEG is a data compression standard, the motion compensation vectors stored in MPEG streams are not necessarily accurate with regard to real motion in the scene. The motion estimation is for reducing temporal redundancy and not for estimating the real displacements. In a previous work [6], we have presented a method to set up one motion field per Group Of Picture (GOP, often 12-15 successive frames) and to reject wrong vectors ([7], areas with poor visual surface characteristics like flat shades often lead to meaningless motion vectors) by the use of the Discrete Cosine Transform (DCT) coefficients computed for each bloc in the compressed stream. For the GME (Global Motion Estimation), the motion vectors are plotted in appropriate accumulation spaces, and the estimation is fast. In [6], the used motion model was the simplified affine model (4 parameters). Here, the motion is represented with the affine model (6 parameters, Eq. 1, Fig. 1). The same method is still used, each sample (motion vector) contributes to a globally consistent solution. The most redundant values point to the parameters of the global motion. In a first stage (upper part of Fig. 1), the motion vector spatial derivatives are computed to estimate a first set of deformation parameters (zoom, rotation). As a second stage, all motion vectors are compensated to estimate the remaining translational motion (lower line of Fig. 1).  V =

a1 a2 a3 a4

    x t + x y ty

(1)

Lecture Notes in Computer Science

 V =V −

a1 a2 a3 a4

3

  x y

(2)

As the MPEG motion vectors can be inaccurate, each contribution from a motion vector to an accumulation space is weighted by a Gaussian distribution. This accumulation method congregates close contributions in a same peak within the accumulation space. The simple search of the most redundant value gives the global parameter. To refine the estimation, for each one-dimensional accumulation space (a1 , a2 , a3 , a4 ), a second order polynomial regression is carried out around the maximum. The top position of the regression curve gives the parameter value (i.e. where the curve derivative is zero). All the motion vectors are compensated (Eq. 2) before being added up in the two-dimensional accumulation space (tx , ty ). In this space, the mode represents the translation parameters of the global motion, and each remaining peak represents an object movement. The extension of the Global Motion Estimation by fitting the complete affine model is the natural continuity of previous works. The main contribution of this paper is the complete investigation of the accumulation space to compute a motion based segmentation.

3

Motion Based Segmentation

60

40

Ty

20

0

−20

−40

−60

−60

Fig. 2: (tx , ty ) accumulation space visualization

−40

−20

0 Tx

20

40

60

Fig. 3: watershed result

Efficient motion-based segmentations usually rely on a dense motion vectors field (so-called optical flows). Being based on the work published in [8], our concept is similar : estimation of the affine movement for each picture area and classification of the obtained parameters. However, using a sparse motion field, the deformation parameters (a1 , a2 , a3 , a4 ) are problematic to estimate on small areas. Given a video shot, the foremost deformations are caused by camera motion.

4

Renan Coudray, Bernard Besserer

So, our compromise is to estimate the global deformation and to classify each vector only on the translation values. Fig. 2 is a representation of our translation accumulation space after to have compensate the global deformation. With a standard watershed method, all the relevant peaks can be easily extracted. The first step is to threshold the accumulated data in order to eliminate noise. All location (or cells) in this accumulation space holding less than 5 occurrences are discarded. Then, the accumulation space is inverted (making valleys from peaks, Eq. 3). Finally, the standard watershed algorithm [9] is applied. Fig. 3 presents the result of the watershed, and three movements are easily distinguished. v =M −v

Where v is one value and M the maximum value of the accumulation space

(3)

For clarity, only the center of our accumulation space is shown in the figures. In fact, the accumulation space is rather large to enable the segmentation of large motion (-1024 to 1024, with half-pixel precision). This accumulation space is also very sparse, so a recursive method has been developed which gives equivalent results to the watershed. The optimization is not presented in details, but this approach allows to treat only nonnull positions. Basically, the position of each accumulation is registered, and this registration map is used in the next step, making computation cost fairly independent of the size of the accumulation space. Firstly, the maximum position (mode) within the accumulation space is detected. For each position related to the maximum one, the directional Gradient in the direction of the mode is computed. While the Gradient is positive, points are assumed to belong to the same peak and the algorithm is repeated in a neighborhood. At this step, all points which belong to the highest peak are aggregated, and this aggregate is labelled. Prog. 1 presents the code (in C language) which computes the gradient for each position and labels the accumulation space for all the data of the same peak. Prog. 1: (mx,my) is the maximum position, (x,y) the actual position and n the identification number of the peak. histoT is a global array which contains the accumulation space data and mask is the array which contains the label of each position. void DilatPeakRec(int x,int y,int mx,int my, __int64 n){ // is this point already agregated if(!(_mask[x+y*_histoWidth]&n)){ // compute the direction of the maximum position int tx,ty;int dx=mx-x;int dy=my-y; int adx=abs(dx);int ady=abs(dy); if((adx==0)||(adx