CALIBRATION-FREE GAZE TRACKING USING PARTICLE FILTER

the Gaussian process regression (GPR), ii) the gaze samples required for training the .... ing in a bootstrap filter with an easy implementation. This way, weight ...
2MB taille 1 téléchargements 291 vues
CALIBRATION-FREE GAZE TRACKING USING PARTICLE FILTER PhiBang Nguyen, Julien Fleureau, Christel Chamaret, Philippe Guillotel Technicolor, 975 Ave. des Champs Blancs, 35576 Cesson-Sevigne, France [email protected] ABSTRACT This paper presents a novel approach for gaze estimation using only a low-cost camera and requiring no calibration. The main idea is based on the center-bias property of human gaze distribution to get a coarse estimate of the current gaze position as well as benefit from temporal information to enhance this rough gaze estimate. Firstly, we propose a method for detecting the eye center location and a mapping model based on the center-bias effect to convert it to gaze position. This initial gaze estimate then serves to construct the likelihood model of the eye-appearance. The final gaze position is estimated by fusing the likelihood model with the prior information obtained from previous observations on the basis of the particle filtering framework. Extensive experiments demonstrate the good performance of the proposed system with an average estimation error of 3.43◦ which outperforms state-of-the-art methods. Furthermore, the low complexity of the proposed system makes it suitable for real-time applications. Index Terms— HCI, gaze estimation, eye center detection, calibration-free, particle filter 1. INTRODUCTION Gaze estimation plays an important role in different domains, especially in human-computer-interaction since it gives a more efficient and interactive way for dealing with computer. However, existing gaze sensing systems are still far from being widely used in everyday applications because of two main reasons: i) the cost of such systems is still high, essentially due to the embedded devices, ii) most systems require a calibration procedure to infer certain person-specific eye parameters (i.e., model-based methods) or to regress the corresponding mapping function between the eye appearances and the gaze positions on the screen (i.e., appearance-based methods). Such a process is quite cumbersome, uncomfortable and difficult to be done. Moreover, in some situations such as consumer home applications (e.g., interactive gaming interfaces, adaptive content selection interfaces) active calibration is almost impossible because the gaze estimation is required to be imperceptible to users. In the literature, some efforts to avoid calibration have been recently proposed. In [1], Sugano et al. exploit the user mouse’s cursor as

an additional information for an implicit calibration. Each time the user clicks, the cursor position (considered as the gaze point) together with the corresponding eye image features are collected as a labeled sample. The system predicts a new gaze point by linearly interpolating from its nearest neighbors obtained from the labeled samples. The method achieves an accuracy of about 5◦ but requires a large number of clicked points (i.e., ≈ 1500 points) and is restricted to specific contexts where there is a physical interaction between the user and the computer. More recently, researchers have focused their attention on using visual saliency to avoid calibration since the computational model of saliency may be interpreted as the gaze probability distribution and hence, may be exploited for gaze estimation. One of the early attempts in this new direction was coined by Sugano et al. [2]. The idea behind the scene is relatively simple: i) mapping between eye appearances and gaze points is done by using the Gaussian process regression (GPR), ii) the gaze samples required for training the GPR are created by repeatedly random sampling on the saliency maps. In order to reduce the computational cost, eye images that are similar to each other (as well as the corresponding saliency maps) are averaged as they are supposed to belong to the same fixation. The method achieves a mean estimation error of 6.3◦ . Although it is considered as the first attempt to use the computational model of saliency for a passive calibration, this approach has some limits. Firstly, the method is completely based on visual saliency (which is not always reliable) to acquire training samples, the method might not be able to handle situations where the prediction accuracy of saliency maps is very low (e.g., in scenes containing many instantaneous salient regions, during top-down tasks such as object-specific search or text reading). Secondly, in order to go into operating mode, the system needs an off-line and time-consuming training beforehand (i.e., users have to do 10 minutes of training before each 10-minute test). Thirdly, the number of sampled gaze points (required for the GPR) has to be large enough to successfully approximate the gaze probability distribution. This significantly increases the computational cost. In [3], Chen et al. also integrate visual saliency into their existing model-based gaze estimation system in order to avoid ”personal calibration”. Their main idea is based on inferences in a Bayesian network model relating the stimulus, the gaze and

the eye parameters and modeling (i.e., approximating) some probabilistic densities by Gaussian distributions to estimate the subject’s visual axis. The gaze point is then derived as the intersection between the estimated visual axis and the screen. The algorithm has a better performance than the Sugano et al.’s with a mean error of about 2◦ and allows head movements ( which are not supported by Sugano et al’s method). However, these good results come with an expense which requires multiple IR cameras and light sources within a complex configuration [4]. Furthermore, their method is specific for static images and requires a longer learning time to improve the accuracy, i.e., users need to look at an image for 3 − 4 seconds to reach an accuracy of about 2◦ . In this paper, we develop a novel approach for calibrationfree gaze estimation that requires only one low-cost camera with the assumption that user’s head is fixed for the sake of simplicity. Differently from methods in [2] and [3] which are totally based on saliency to avoid calibration, we exploit instead, the center-bias effect to achieve calibration-free. The basic idea is to obtain a rough estimation of the gaze position from the eye center location with the help of the center-bias effect. This gaze position is served to build the likelihood of the eye appearance. The final gaze position is refined by fusing the eye likelihood together with the temporal information obtained from past observations on the basis of a particle filtering framework.

2. PROPOSED APPROACH We formulate the gaze estimation as a probabilistic problem of estimating the posterior probability p(gt |e1:t ) (where gt and et are the current gaze position and eye appearance, respectively). By applying the Bayes’ rule, we obtain: p(gt |e1:t ) ∝ p(et |gt )p(gt |e1:t−1 ).

(1)

In this equation, the posterior density p(gt |e1:t ) can be estimated via the prior probability p(gt |e1:t−1 ) (the prediction of the current state gt given the previous measurements) and the likelihood p(et |gt ). Equation 1 characterizes a dynamic system with one state variable g and the observation e. The gaze position distribution is assumed to follow a first-order Markov process, i.e., the current state only depends on the previous state. This assumption is strongly valid for fixation and smooth-pursuit eye movements. In a saccadic eye movement, the current gaze position can also be considered as dependent on the previous gaze position if it is modeled by a distribution having a sufficiently large scale as described in Section 2.1. The particle filtering framework [5] can be adopted to solve this problem. Theoretically, particle filter based methods approximate the posterior density p(gt |e1:t ) via two steps: 1. Prediction: the current state is predicted from the pre-

vious observations e1:t−1 Z p(gt |e1:t−1 ) = p(gt |gt−1 )p(gt−1 |e1:t−1 )det−1 .

(2)

2. Update: the estimation of the current state is updated with the incoming observation et using the Bayes’ rule p(gt |e1:t ) ∝ p(et |gt )p(gt |e1:t−1 ).

(3)

The posterior distribution p(gt |e1:t ) is approximated by a set of N particles {gti }i=1,...,N associated to their weight wti . N is empirically set to 5000 which seems to offer a good trade-off between the computational cost and the desired accuracy (i.e., as the eye gaze may spread over large distances while object movements are often restricted to a smaller area around its previous state). In our experiments, this parameter does not have a critical impact on the system performance (i.e., a greater number of particles does not improve the final results). Usually, we cannot draw samples from p(gt |e1:t ) directly, but rather from the so-called ”proposal distribution” q(gt |g1:t−1 , e1:t ) for which q(.) can be chosen under some constraints. The weights are updated by: i wti = wt−1

i ) p(et |gti )p(gti |gt−1 . i i q(gt |g1:t−1 , e1:t )

(4)

We choose p(gt |gt−1 ) as the proposal distribution resulting in a bootstrap filter with an easy implementation. This way, weight updating is simply reduced to the computation of the likelihood. In order to avoid degeneracy problem, resampling is also adopted to replace the old set of particles by the new set of equally weighted particles according to their important weights. 2.1. State Transition Model Intuitively, fixation and smooth-pursuit eye movements can be successfully modeled by a distribution whose peak is centered at the previous gaze position state gt−1 (e.g., Gaussian distribution). Otherwise, for a saccadic eye movement, another Gaussian distribution centered at the previous gaze position can also be used but with a much larger scale to describe the uncertainty property of the saccade. Hence, the state transition should be modeled by a Gaussian mixture of two densities. However, for the sake of simplicity, we adopt a unique distribution for both types of eye movement: p(gt |gt−1 ) = ℵ(gt−1 ; diag(σ 2 )).

(5)

where diag(σ 2 ) is the diagonal covariance matrix which corresponds to the variance of each independent variable xt and yt (the gaze point being denoted as a two dimensional vector gt = (xt , yt )). σ 2 needs to be large enough to cover all possible ranges of the gaze on the display in order to model the saccadic eye movement. We set σ = 1/3 the screen dimensions.

2.2. Likelihood Model In the context of object tracking, the likelihood p(gt |et ) is often computed by a similarity measure between the current observation and the existing object model. For gaze estimation, we model p(gt |et ) as follows: p(gt |et ) ∝ exp(−λd(et )).

(6)

where λ is the parameter which determines the ”peaky shape” of the distribution and d(et ) = ket − eˆt k2 denotes a distance measure between the current observation et and the estimated eye image eˆt (corresponding to the particle position gˆt ). Since the system is calibration-free, we do not have access to training images to estimate eˆt . Hence, we propose a simple method to estimate p(gt |et ) via the detection of the eye center position and the center-bias effect. This estimation goes through two steps described as below.

2

2

2

4

4

4

6

6

8

8

8

10

10

10

12

12

12

14

14

14

16

16

16

18

18

20

20

2

4

6

8

10

12

14

16

18

20

6

18

20 2

4

6

8

10

12

14

16

18

20

2

4

6

8

10

12

14

16

18

20

Fig. 1. Average spatial histograms of gaze positions recorded by the SMI gaze tracker for Movie viewing (left), Television viewing (middle) and Web browsing (right). 5. Finally, the eye center location (xc , yc ) is estimated by a weighted voting of all candidate points: P (x,y)∈P R HM (x, y).(x, y) P (xc , yc ) = . (8) (x,y)∈P R HM (x, y) where P R is the set of candidate points.

2.2.1. Eye Center Localization

2.2.2. Conversion from eye center position to gaze position

The most common approach to locate eye center is based on the natural circular shape of the eye [6], [7], [8]. In general, shape-based methods require edge detection (or gradient computation) as a preprocessing step that is very sensitive to noise or shape deformations. Moreover, a few of them could be used for real time applications due to their high computational cost. We propose here a fast and simple eye center localization and tracking scheme using color cues. Since the user head is fixed, the subjects face is supposed to be roughly bounded in a predefined rectangle (in the real system, we use a face detection algorithm) and the rough eye regions are determined based on anthropometric relations. The algorithm consists of the following major steps: 1. Converting the eye image into the Y CbCr space 2. Based on statistical observations of color eye images that pixels in pupil region usually have high values in Cb component and low values in Y and Cr components, an eye center heat map (HM ) can be determined as follows:

The gaze distribution is known to be biased toward the center of the screen in free viewing mode [9] [10]. Such an effect could be observed in Fig. 1. In this experiment, we record gaze position by a commercial eye tracker when users are in three screen viewing activities: movie watching, television watching and web browsing. 4 observers are asked to watch 8 video sequences (i.e., 4 movie clips and 4 television clips, each of 10 minutes). For the web browsing activity, the subjects are free to choose 5 favorite web sites to browse during 10 minutes. The results are then averages on all stimuli and all subjects. A strong center-bias effect may be observed for the movie and television viewing activities when the gaze positions are distributed in a very narrow region located at the middle of the screen. For the web browsing activity, the center-bias is also noticeable despite a large dispersion of gaze distribution around the center. Based on this statistic property, the ”observed gaze” position gˆt = (ˆ xgt , yˆgt ) (normalized into [0 1]) can be determined from the current eye center coordinates (xc , yc ) (in the eye image) by the following simple projection model:

HM (x, y) = Cb(x, y)(1 − Cr(x, y))(1 − Y (x, y)).

(7)

3. All sub-regions that may be the pupil region are then extracted using the region growing method. To do so, local maxima larger than a predefined threshold T1 are chosen as seed points. 4-connected regions around each seed point are then built by growing all pixels whose values are higher than a predefined threshold T2 . The selected points are then dynamically added to the set of ”candidate points” and the process is continued until we go to the end of the eye region. Empirically, we set T1 = 0.98 and T2 = 0.85 to obtain the best performance. 4. The heat map is then smoothed by convolving with a Gaussian kernel. The use of a Gaussian filter here is important to stabilize the detection result (i.e., avoiding wrong detection due to eyelid, eye glass, reflextion).

xc − x ¯c , Ax σxc yc − y¯c . yˆg = 0.5 + 0.5 Ay σyc

x ˆg = 0.5 + 0.5

(9)

where x ˆg and yˆg are the convert gaze positions; - xc and yc are the current eye center positions in absolute image coordinates. - x¯c , y¯c , σxc and σyc are respectively the mean values and the standard deviation values of xc and yc . These parameters are continuously computed and updated during the process; - Ax and Ay are tuning factors which describe the ”scales” of the gaze distribution. They are empirically set to 4 that is large enough to quantify the center-bias level.

Fig. 2. Examples of successful eye center detection. All the human faces are blurred due to privacy protection. The resulting likelihood can be now computed by Equation 6. Or, more precisely, the likelihood value p(gt |et ) given the observation et is exponentially proportional to the distance between gt and the ”observed gaze position” gˆet which is derived from the eye center location via Equation 9: p(gt |et ) ∝ exp(−λkgt − gˆet k2 ).

(10)

The parameter λ in this equation is determined such that p(gt |et ) ≈  (where  is a very small positive number) when kgi − gˆt k2 = D where D is the possibly largest error, generally set to the diagonal of the screen. 3. EXPERIMENTAL RESULTS Firstly, the proposed eye center detection method is independently evaluated and then, the performances of the whole system is given and discussed. Due to the low complexity of the eye center detection and the particle filter methods, the system runs 10 times faster than real-time with our multi-threading implementation. 3.1. Eye Center Detection Evaluation Although some databases exist for evaluating eye center detection algorithms, most of them contains only gray-scale images. As our method is color-based, we therefore, propose here a data set consisting of 401 color images of 401 people with different resolutions, illuminations, poses, eye colors and specificities (i.e., wearing glasses). A manual annotation of both eye centers was performed to build the ground truth. Fig. 2 shows qualitative results obtained on a sample of the tested database. We observe that the method successfully handles some changes in poses, scales, illuminations, resolutions, eye colors and the presence of glasses. For quantitative results, the normalized error [11] is adopted as the accuracy measure. This measure may be interpreted as follows: i) e ≤ 0.25 corresponds to an error equivalent to the distance between the eye center and the eye corners, ii) e ≤ 0.1 corresponds to to the diameter of the iris, and iii) e ≤ 0.05 corresponds to the diameter of the pupil. Regarding the accuracy required

for our application (i.e., pupil center detection), we focus on the performance obtained for e ≤ 0.05. We also compare our method to a recent state-of-the-art algorithm proposed by Valenti et al. [8] which is based on the isophote curvature. Two methods are then tested on the same database and evaluated by the same accuracy measure as described above. Fig. 3 shows the obtained accuracy as a function of the normalized error for the proposed method (Fig. 3a) and Valenti et al.’s method (Fig. 3b). In order to give more information about the accuracy, the minimum normalized error (computed on the better eye only) as well as the average error of the better and worse estimation are also computed. As one can observe, our method achieves an accuracy comparable with the results obtained by Valenti et al. Despite the simplicity of this approach, the method also works well in other test scenarios (i.e., indoor environment with luminance change, people wearing glass, different eye color). Furthermore, since the proposed system is based on a probabilistic framework, such an accuracy is sufficient for a rough estimation of the gaze probability map from the eye image. The final gaze position is then refined by fusing this map with the prior information obtained from the past observations. 3.2. Gaze Tracking Evaluation Four movies (which have been used in [2]) are selected: A) ”Dreams” B) ”Forrest Gump” C) ”Nuovo Cinema Paradiso” and D) ”2001: A Space Odyssey” for evaluation. From these, we extract 4 sequences, each of 10-minute length, by concatenating many 2 second-sequences from the beginning to the end of the entire feature length movie with equal time intervals. The reason for choosing such sequences is to consistently compare our method with the one in [2]. The screen resolution is 1280x1024 (376mm×301mm). Ten subjects (8 males and 2 females) were involved in our tests. A chin rest was used to fix the head position and maintain a constant distance of 610 mm (about twice the screen height) between the test subjects and the display. Ground truth data were recorded by a SMI RED gaze tracker with a sampling frequency of 50 Hz. The accuracy of the SMI RED gaze tracker is 0.4◦ as reported in its technical specification. The user video were captured in parallel with the gaze tracker by a consumer web camera with a frequency of 16Hz. Fig. 4 shows the demo interface of the proposed gaze tracker. The user video and the video stimulus are displayed at the same time. The mean errors of the proposed system are shown in both digit numbers and curves as functions of time. The performances of each eye are independently measured to have an idea about the ”ocular dominance” effect. From the graph of curves, we can see that the mean errors of the system are rather high in the beginning due to the inaccuracy of the projection model (the number of samples used for computing the mean and the standard deviation is not large enough to be able to represent the statistics of the eye center position).

100

100

95.76%

90

98.01% 93.02%

91.52%

90

89.03%

87.28% 80

80

77.56%

76.06% 70

Accuracy (%)

Accuracy (%)

70

60

50

40

30

50

40

30

20

20 Worst Eye Average Best Eye

10

0

60

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Worst Eye Average Best Eye

10

0.4

Normalized Error

0

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Normalized Error

Fig. 3. Accuracy in function of normalized error for the proposed method (left) and the Valenti et al.’s method [8] (right). However, as time goes by, the system performance is quickly improved. The gaze estimation results are detailed in Table 1. As it can be seen, the whole system gives an overall accuracy of 3.43◦ whereas Sugano et al.’s method (on the same dataset) achieves 6.3◦ [2]. 1 Compared to Chen et al.’s method [3] which achieves 1.8◦ (on another dataset of static images), our method has a lower accuracy and furthermore, does not allow free head movement. Nevertheless, this method is geometric model-based which requires multiple IR cameras as well as IR light sources. This makes their system more expensive than ours that only requires a low cost camera. Without using temporal information, the system achieves an accuracy of 4.25◦ by using only the eye appearance cues (i.e., gaze position is directly computed from the conversion model as in Equation 9). Be noted that, in order for the system to work, the gaze distribution does not need to be in a near-center region of the screen. Only the mean value of the gaze distribution needs to match the screen center, otherwise, users can look at peripheral areas. Since the proposed method is partially based on the center-bias effect (in the design of the projection model), the system should target applications where the center-bias has a sufficiently significant impact on the gaze distribution. Analytic applications such as i) content selection/navigation/recommendation for Video on Demand, ii) gaze-based audience profiling for cinema, iii) Aesthetic quality assessment (learning aesthetic taste of a person based on his gaze) can be considered. Interactive applications such as 1 Sugano et al. have reported an accuracy of 3.5◦ in a journal version of their same work [12]. However, these results are obtained by testing on another video set which contains 80 short trailer clips, each of 30s. Such a database thus, helps to significantly improve their method (which are based on saliency maps to learn a GPR-based gaze estimator) because in a trailer sequence, the bottom-up visual attention plays a more important role and hence, the saliency regions have more chance to become real gaze positions than in a normal video sequence.

Table 1. Gaze estimation errors averaged on 10 subjects. We consider two settings: i) “Eye Only”: estimates p(gt |et ) from only the eye center location using the simple projection model and ii) “Eye + PF” estimates p(gt |e1:t ) based on the eye appearance in the framework of particle filtering. The shown values are the averaged errors in degree unit. Sequence Whole System (Eye+PF) Eye Only A 4.18 ± 0.40 4.91 ± 0.25 B 3.10 ± 0.68 4.06 ± 0.41 C 3.07 ± 0.23 3.93 ± 0.25 D 3.39 ± 0.50 4.11 ± 0.46 Average 3.43 ± 0.52 4.25 ± 0.44

gaze-based TV remote control could also be envisaged. In specific contexts whereas subjects are just doing a monotone top-down task (e.g., text reading), the performance of the system will be obviously reduced because the center-bias in horizontal direction takes place only after reading some lines and for vertical direction after reading some pages. 4. CONCLUSION AND FUTURE WORK We proposed a new method for a calibration-free gaze estimation that uses only a low cost camera. To avoid calibration, the gaze position is roughly estimated from the eye center position thanks to the center-bias effect. On the basis of the particle filtering framework, the gaze position is obtained by exploiting temporal information from the past observations. Regarding the future works, we consider to include saliency information as an additional observation to improve our system. In order to deal with head movements, some directions can be considered. Assuming that the center-bias effect is still valid even when users change their head pose, the existing system can be slightly modified to allow head movements

Fig. 4. Demo interface of the proposed gaze tracker. On the video stimulus window, the green dot is the ground truth gaze point and the red dot is the gaze point estimated by the system. The curves display the mean estimation errors over time. just by re-initializing the statistic values (i.e., the mean and the standard deviation) of the eye center position whenever a sufficiently large head movement is detected. Another direction is to integrate a head pose tracker into the existing system as an additional cues and would be fused with other cues in a principled manner thanks to the particle filtering framework. 5. REFERENCES [1] Y. Sugano, Y. Matsushita, Y. Sato, and H. Koike, “An incremental learning method for unconstrained gaze estimation,” in Proc. of the 10th ECCV, 2008, pp. 656– 667. [2] Y. Sugano, Y. Matsushita, and Y. Sato, “Calibration-free gaze sensing using saliency maps,” in Proc. of the 23rd IEEE CVPR, June 2010. [3] J. Chen and Q. Ji, “Probabilistic gaze estimation without active personal calibration,” in Proc. of IEEE CVPR’11, 2011, pp. 609–616. [4] E. D. Guestrin and M. Eizenman, “General theory of remote gaze estimation using the pupil center and corneal reflections,” . [5] M. Isard and A. Blake, “Condensation - conditional density propagation for visual tracking,” IJCV, vol. 29, no. 1, pp. 5–28, 1998.

[6] R. S. Stephens, “Probabilistic approach to the hough transform,” Journal of Image and Vision Computing, vol. 9, no. 1, pp. 66–71, 1991. [7] L. Pan, W.S. Chu, J.M. Saragih, and F. De la Torre, “Fast and robust circular object detection with probabilistic pairwise voting,” IEEE Sig. Proc. Letter, vol. 18, no. 11, pp. 639–642, 2011. [8] R. Valenti and T. Gevers, “Accurate eye center location and tracking using isophote curvature,” in Proc. of the IEEE CVPR’08, June 2008. [9] P. H. Tseng, R. Carmi, I. G. M. Cameron, D. P. Munoz, and L. Itti, “Quantifying center bias of observers in free viewing of dynamic natural scenes,” Journal of Vision, vol. 9, no. 7, pp. 1–16, July 2009. [10] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where humans look,” in IEEE ICCV’09, 2009. [11] Jesorsky, K. J. Kirchberg, and R. Frischholz, “Robust face detection using the hausdorff distance,” in Proc. Audio Video Biometric Pers. Authentication, 1992, pp. 90–95. [12] Y. Sugano, Y. Matsushita, and Y. Sato, “Appearancebased gaze estimation using visual saliency,” IEEE Trans. on PAMI, vol. PP, no. 99, 2012.