LNCS 3907 - Differential Evolution Applied to a ... - Patricia Besson

This paper addresses the optimization of an information theoretic-based .... set of possible solutions, additional constraints have been introduced on the α ...
281KB taille 6 téléchargements 222 vues
Differential Evolution Applied to a Multimodal Information Theoretic Optimization Problem Patricia Besson, Jean-Marc Vesin, Vlad Popovici, and Murat Kunt Signal Processing Institute, EPFL, Lausanne 1015, Switzerland {patricia.besson, jean-marc.vesin, vlad.popovici, murat.kunt}@epfl.ch

Abstract. This paper discusses the problems raised by the optimization of a mutual information-based objective function, in the context of a multimodal speaker detection. As no approximation is used, this function is highly nonlinear and plagued by numerous local minima. Three different optimization methods are compared. The Differential Evolution algorithm is deemed to be the best for the problem at hand and, consequently, is used to perform the speaker detection.

1 Introduction This paper addresses the optimization of an information theoretic-based function for the detection of the current speaker in audio-video (AV) sequences. A single camera and microphone are used so that the detection relies on the evaluation of the AV synchronism. As in [1], the information contained in each modality is fused at the feature level, optimizing the audio features with respect to the video ones. The objective function is based on mutual information (MI) and is highly nonlinear, with no analytical formulation of its gradient, unless approximations are introduced. The local Powell’s method [2], has been tried in a first set of experiments and the conclusion was that a global approach was more suited. For this, two Evolutionary Algorithms (EAs) - the Genetic Algorithm in Continuous Space [3] and the Differential Evolution [4] - have been applied and their performance compared and analyzed. After a brief introduction to the optimization problem, the three previously mentioned optimization methods are applied to the problem at hand. A detailed discussion follows the experiments presented in the last part of the paper and suggests that DE is the best choice for solving the given problem.

2 Multimodal Speaker Detection Framework Theoretic framework for multimodal feature extraction. The detection of the current speaker in an AV sequence can be seen as a multimodal classification problem [1]. The goal is to estimate the class membership O (”speaker” or ”non-speaker”) of the bimodal source S that emits jointly an audio and a video signal, A and V . The probability Pe of assigning S to the wrong class should be minimized. Since a single camera and microphone are used, the detection rely on the evaluation of the synchronism between the AV signals. The estimation of the MI between the AV features F. Rothlauf et al. (Eds.): EvoWorkshops 2006, LNCS 3907, pp. 505–509, 2006. c Springer-Verlag Berlin Heidelberg 2006 

506

P. Besson et al.

can be used as such a classifer. This MI classifier must be provided with suitable features to perform well. Let FA and FV be the features extracted from A and V , and let e = I(FA ,FV )/H(FA ,FV ) ∈ [0, 1] be the features’ efficiency coefficient [1] (I and H standing respectively for Shannon’s MI and entropy). Then the following inequality can be stated [1]: e(FA , FV ) + 1 . (1) Pe  1 − log 2 By minimizing the right hand term of inequality (1), we expect to recover in each modality the information that originates from the common source S while discarding the source-independent information present in each signal. This would lead to a more accurate classification. For more details, see [1]. Application to speaker detection. The video features are the magnitude of the optical flow estimated over T frames in the mouth regions (rectangular regions including the lips and the chin). T −1 video feature vectors FV,t (t = 1, . . . ,T −1) are obtained, each element of these vectors being an observation of the random variable (r.v.) FV . The speech signal is represented as a set of T − 1 vectors Ct , each containing P Mel-Frequency Cepstral Coefficients (MFCCs), discarding the first coefficient. Audio feature optimization. As mentioned, the goal is to construct better features for classification. The focus is now on the audio features FA,t (α), associated to the r.v. FA , P that are built as the linear combination of the P MFCCs, FA,t (α) = i=1 α(i)·Ct (i), ∀t = 1,. . . ,T −1. The α are to be optimized with respect to the Efficiency Coefficient Criterion (ECC): (2) αopt = arg max{e(FV , FA (α))}. α

A ∆ECC criterion is introduced to perform only one optimization for two mouths and to take into account the discrepancy between the marginal densities of the video features in each region. If FVM1 and FVM2 denote the r.v. associated to regions M1 and M2 respectively, then the optimization problem becomes:   (3) αopt = arg max [e(FVM1 , FA (α)) − e(FVM2 , FA (α))]2 . α

This MI-based optimization criterion requires the availability of the probability density function (pdf) as well as of the marginal distributions of FA and FV . To avoid any restrictive assumption, they are estimated using Parzen windowing.

3 Optimization Method Definition of the optimization problem. The extraction of the optimized audio features requires to find the vector α ∈ RP , that minimizes -∆ECC (Eq. (3)). To restrain the set of possible solutions, additional constraints have been introduced on the α weights: 0 ≤ α(i) ≤ 1 P  i=1

α(i) = 1.

∀i = 1, 2, . . . , P ,

(4) (5)

Differential Evolution Applied

507

The optimization problem is highly nonlinear and gradient-free. Indeed, the MI-based objective function is a priori non-convex and is very likely to present rugged surface. Moreover, it is difficult to obtain an analytical form of its gradient due to the unknown form of the pdf of the extracted audio features. The use of Parzen window to estimate the pdf reduces the risk of getting trapped in a local minimum by smoothing the cost function. Because a trade-off has to be found between smoothness and accuracy of the distribution estimates, the smoothing parameter is iteratively adapted. Thus, the optimization problem is solved using a multi-resolution scheme (see [5]). The deterministic Powell’s method [2] has been used in a first set of experiments [6]. However, the objective function still exhibited too many local optima for a local optimization method to perform well. A global optimization strategy fulfilling the following requirements turned out to be preferable: 1. Efficiency for highly nonlinear problems without requiring the objective function to be differentiable or even continuous over the search space; 2. Efficiency with cost functions that present a shallow, rough error surface; 3. Ability to deal with real-valued parameters; 4. Efficiency in handling the two constraints defined by Eqs. (4, 5); Genetic Algorithm in Continuous Space (GACS). An evolutionary approach such as GACS answers the first three requirements while presenting flexibility and simplicity of use in a challenging context. The adaptation developed in [3] efficiently deals with finite solution domain by relating the genetic operators to the constraints on the solution parameters. The crossover operator is defined such that the child chromosome is guaranted to lie into the acceptance domain (defined by Eq. (4)) provided its parents are valid. The mutation is performed by perturbing a randomly selected chromosome element with a zero-mean Gaussian perturbation which variance σ is defined as a certain fraction of the acceptance domain. The mutation is rejected if the mutated gene lies outside its acceptance domain. To satisfy the constraint defined by Eq. (5), the new population is normalized. Notice that the initial chromosomes are regularly placed in the acceptance domain according to a user-defined number of quantization levels Q [7]. This ensures a better initial exploration of the search space than a random initialization. This extension of GACS leads to better results than the Powell’s method. However, the mutation operator appears to be ineffective. The solutions are indeed very close to the search space limits. A high number of mutations are then rejected, resulting in a loss of the population diversity and in a premature convergence of the algorithm. The perturbation should adapt to the population evolution and should lead to a better exploration of the search space. Differential Evolution (DE). Differential Evolution, introduced in [4], is an Evolution Strategy where the perturbation corresponds to the difference of chromosomes (or vectors in this context) randomly selected from the population. In this way, the distribution of the perturbation is determined by the distribution of the vectors themselves and no a priori defined distribution is required. Since this distribution depends primarily on the response of the population vectors to the objective function topography, the biases introduced by DE in the random walk towards the solution match those implicit in the function it is optimizing [8]. The exact algorithm we used is based on the so-called

508

P. Besson et al.

DE/rand/1/bin algorithm [8]. The initial population however is generated as done with GACS. The validity of each perturbed vector is verified before starting the decision process. If the element j of a child vector i does not belong to the acceptance domain, it is replaced by the mean between its pre-mutation value and the bound that is violated [8]. This scheme is more efficient than the simple rejection adopted with GACS. Indeed, it allows to asymptotically approach the search space bounds. To handle the second constraint (Eq. (5)), a simple normalization is performed on each child vector, as it was done with GACS.

4 Results Comparison of the optimization methods. All the test sequences are 4 seconds long, PAL standard (T = 100); 12 MFCCs are computed using 23.22ms Hamming windows. The three different optimization methods are tested on a single speaker sequence using -ECC (Eq. (2)) as the objective function. σ is fixed to 10% of the acceptance domain for GACS, while the scaling factor F and the crossover probability CR required by the DE algorithm [8] are fixed to 0.5 and 1 respectively 1 . Both algorithms are run for 400 generations on a population of 125 vectors. 33 runs were then performed with GACS and DE methods, whereas different initial solution guesses were tried for Powell’s method. Table 1 summarizes the results. Obviously, a much better result is obtained using the Table 1. Values of the -ECC cost function for 33 runs under the same conditions, on the same AV sequence Best Value Mean Value Standard Deviation Powell -0.0213 -0.0183 0.0047 GACS -0.0695 -0.0619 0.0052 DE -0.0788 -0.0773 0.0018

global optimization schemes instead of the local one. DE is the algorithm that reaches the best solution in a more stable way. Indeed, the standard deviation of the solutions is much smaller in the case of DE than in the case of the other two methods, giving us more confidence in the results. While the high variation of the solutions found with Powell’s method is not a surprise (as it is very sensitive to initial conditions), the instability of GACS solution seems intriguing. However, this is less surprising when we analyze the evolution of the algorithm towards the solution: the degeneration of the population combined with the less systematic exploration of the solution space (especially the boundaries) make GACS solutions to be very different from run to run. Both the generation of the perturbation increment using the population itself instead of a predefined probability density and the handling of the out-of-range values allow the DE algorithm to achieve outstanding performance in the context of our problem. Audiovisual speaker detection results. Five home-grown sequences with two individuals (only one being speaking at a time) are now used. The DE optimization method is 1

The implementation of the DE algorithm is based on Storn’s public domain software [9].

Differential Evolution Applied

509

used to project the MFCCs on a new 1D subspace as defined in Sec. 2 using ∆ECC as optimization criterion. The measure of the MI between the resulting audio feature vector FAopt and the video features of each mouth region allows to classify the mouth as ”speaking” (highest value of MI) or ”non-speaking” (lowest value of MI). The normalized difference of MI is always in favor of the active speaker, i.e. the correct speaking mouth region is always indicated (see Table 2). Table 2. Normalized difference between the speaking and the non-speaking mouth regions’ MI using the audio features optimized with the −∆ECC cost function Sequence 1 2 3 4 5 ∆I 84.23% 86.27% 95.55% 80.9% 76.15%

5 Conclusions One central issue in the context of the multimodal speaker detection method described here is the optimization of an objective function based on MI. Since no approximation is made, neither of the pdf of the features (estimated from the samples), nor of the cost function, the optimization problem turns out to be a quite challenging one. The performances and limits of three optimization methods, the local Powell’s method and the global GACS and DE, have been compared, showing that the intrinsic properties of the DE algorithm make it the best choice for the problem tackled here. As a result, the method is able to detect the current speaker on the five test sequences.

References 1. Butz, T., Thiran, J.P.: From error probability to information theoretic (multi-modal) signal processing. Signal Process. 85 (2005) 875–902 2. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C. 2nd edn. Cambridge University Press (1992) 3. Schroeter, P., Vesin, J.M., Langenberger, T., Meuli, R.: Robust parameter estimation of intensity distributions for brain magnetic resonance images. IEEE Trans. Med. Imaging 17(2) (1998) 172–186 4. Storn, R., Price, K.: Differential evolution - a simple and efficient adaptive scheme for global optimization over continuous spaces. J. Global Optim. 11 (1997) 341–359 5. Besson, P., Popovici, V., Vesin, J., Kunt, M.: Extraction of audio features specific to speech using information theory and differential evolution. EPFL-ITS Tech. Rep. 2005-018, EPFL, Lausanne, Switzerland (2005) 6. Besson, P., Kunt, M., Butz, T., Thiran, J.P.: A multimodal approach to extract optimized audio features for speaker detection. In: Proc. EUSIPCO. (2005) 7. Leung, Y.W., Wang, Y.: An orthogonal genetic algorithm with quantization for global numerical optimization. IEEE Trans. Evo. Comp. 5(1) (2001) 41–53 8. Price, K.V.: 6: An Introduction to Differential Evolution. In: New Ideas in Optimization. McGraw-Hill (1999) 79–108 9. Storn, R.: Differential evolution homepage [online]. (Available: http://www.icsi.berkeley.edu/ storn/code.html)