Learning in presence of input noise using the

is not a general procedure to decide if the Markov chain has converged or not. ... efficient in this case to apply the Monte Carlo EM algorithm and the SEM ...
359KB taille 2 téléchargements 283 vues
Learning in presence of input noise using the stochastic EM algorithm 

Hichem Snoussi , Abd-Krim Seghouane , Ali Mohammad-Djafari and Gilles Fleury 

Laboratoire des Signaux et Systèmes (L2S), Supélec, Plateau de Moulon, 91192 Gif-sur-Yvette Cedex, France  Service des Mesures, Supélec, Plateau de Moulon, 91192 Gif-sur-Yvette Cedex, France Abstract. Most learning algorithms rely on the assumption that the input training data contains no noise or uncertainty. However, when collecting data under an identification experiment it may not be possible to avoid noise when measuring the input. The use of the errors-in-variable model to describe the data in this case is more appropriate. However, learning based on maximum likelihood estimation is far from straightforward because of the high number of unknown parameters. In this paper, to overcome the problems associated to the estimation with high number of unknown parameters, the nonlinear errors-in-variable estimation problem is treated under a Bayesian formulation. In order to compute the necessary maximum a posteriori estimate we use the restoration maximization algorithms where the true but unknown training inputs are treated as hidden variables. In order to accelerate the convergence of the algorithm a modified version of the stochastic EM algorithm is proposed. A simulation example on learning a non linear parametric function and an example on learning feedforward neural networks are presented to illustrate the effectiveness of the proposed learning method.

INTRODUCTION Most learning algorithms take into account only the uncertainty in the output when treating the training data set. Therefore, they rely on the assumption that the input is known exactly. Generally, the description of the training data set is made by the additiveerror regression model   

!"#""#%$

(1)

where  is sampled from a centered Gaussian law of variance &( ' , ) *+%",.-/101010101/ 2 are the experimental input-output training pairs, #3 can represents a neural network approximator where  is the neural network parameter vector. In the case of feedforward neural networks with 4 neurons in the hidden layer

± ",. § Et²³¡«´!°>µ ± § Et²³¡v´{

§ y  Et¡«+ ¨ yEt¡« Y  

¶ ¨ t E ¡v´{!°>a ±

-



 ¨ @ E ¡v´{ 

(9)

$ n«$ where § and ¨ are vectors of dimension $ , ² is a diagonal matrix of dimension   › Y   ± ¸ 2 and ›  , 5·†#"#%$ , ° µ ±  & µm where - the - diagonal elements are the derivatives ° a ±  & am ± ¸ 2 .  Y ¼ that minimize this sum The vector ¡ Š Y º¹»¡ Š Y- !"#"3¡ Š 2. ¡ K· *² Š



-  °>µ ± { ² Ÿ°>a ± ½± *² > ° µ ± § Ÿ°>a ± ¨ š

(10)

allow the actualization of the estimation of r Y

Y  Š

› Y ¡ Y  Š

(11)

and therefore this of  by the use of equation (7). The statistical properties of the estimator of  obtained by this algorithm has been derived in [8]. The major inconvenient of this algorithm is that the solution depends on the stability of the vector ¡ Š Y , which is not evident when the vector Y is of high dimension. Therefore methods which allow the elimination of the nuisance parameters from the likelihood function may be preferable. A standard method of eliminating the nuisance parameters is to adopt a Bayesian approach [9]. This is made by multiplying the density (4) by the appropriate prior distribution to obtain the joint a posteriori distribution for the nuisance parameters and the vector of parameters of interest. Then the marginal distribution of it gives an estimating criterion independent of the nuisance parameters. The Bayesian estimators are particularly interesting because they are asymptotically efficient and asymptotically equivalent to the maximum likelihood estimator (under regularity conditions) independently on the imposed prior [10]. Based on this, an estimator and the associate algorithm is proposed in the following section.

THE BAYESIAN APPROACH Noting the training data set ¾ ¿  d%",.-/101010101/ 2 , the a posteriori distribution of the parameters vectors r Y and  is, according to the Bayesian rule, uœ r Y ÁÀ ¾ ÂÃuœ ¾ Àšr Y uœ )r Y 

(12)

where uœ ¾ Àr Y 3 is the density of the training data set ¾ 7 ) d%",.-/101010101/ 2 from which the likelihood function is constructed and uœ r Y  is the a priori distribution of the unknown parameters. The choice of a non informative a priori distribution is not an easy task [11] and in the following we retain the flat prior which is not proper but we suppose that the function under hand guarantees the existence of the a posteriori distribution, i.e Ä uœ ¾ À†r Y 3)u5 )r Y 3wr Y w`Å7I . We note that the a posteriori distribution carries nicely all our knowledge about our inferential problem and is sufficiently flexible to incorporate any additional a priori information concerning the unknown parameters. As we have mentioned in the previous section, the joint estimation of the parameters of interest  and the nuisance parameters r Y is, if not intractable, computation consuming. Moreover, the joint estimation of the unknown inputs r5Y may introduce a bias to the resulting estimate of the parameter of interest  [12]. The Bayesian formulation of the problem makes the integration over the undesirable inputs possible which yields the marginal a posteriori distribution of the parameter  : uœ dÆÀ ¾ ­ Â

Ǻuœ )r Y ÁÀš -/10101/ 25dW-/10101/ 2.†wr Y uœ * Ç

uœ %W-/10101/ 2ÀÈ*r Y uœ ) -/10101/ 2 Àjr Y Bwr Y

(13)

Now, our purpose is the estimation of the parameter  by maximizing the a posteriori distribution (8): K ŒÉ u5 dÊÀ ¾  Š sŒ†!Ž5o (14) ’”“• In most cases, the integration in (8) and maximization in (9) are not feasible and an explicit solution  Š is unreachable. However, given the true inputs r Y , the problem turns to be a classic supervised learning procedure. This suggests to complete artificially the training data set ¾ Ë  d%",.-/101010101/ 2 into  d* Y ",.-/101010101/ 2 where we consider the unknown inputs r Y as hidden variables and consequently the use of the restoration maximization algorithms like EM algorithm [19] which is an iterative algorithm consisting in two steps: •

E-step: Compute the functional: Ì *ÈœÍCÎ ±



-*Ï

7ÐKєҖÓŽ]u5 dÈ% Y À ¾  À ¾ 35ÍCÎ ±

-*ÏÔ

M-step: Update the parameter  by maximizing the functional Ì : 5ÍCÎ

Ï

sŒ†!Ž5oŒÉ Ì dž35ÍÕÎ ± ’“Q•

-*Ï 

The input and output noises are white leading to a point wise computation of expectations: 2 Ì *ž5ÍCÎ ±

-*Ï

fE Ö

 :

Ðm×

",.-

+†E@ ) Y 3 & (

‡

ˆ

'BØ `Ò#ÓŽ]uœ *UÙ

(15)

where Ù is a constant independent of  . The existence of the non linear function makes the E-step difficult which leads to the use of the stochastic version of the EM. The first step is replaced by a sampling from the a posteriori distribution of r Y : •

S-step: generate r Ú Y according to its posterior distribution: Y Û uœ ) Y Àj d+3 ÍCÎ ±

Ú b



-*Ï



M-step: the classic supervised learning of knowing the inputs r Ú Y and the outputs Ü : 2  ÍÕÎ

Ï

7Œ+!Ž5o#‘ ’“Q•

 :

& ( ' ",.-

)+†E` ž Ú Y  ' EtÒ#ÓŽ]uœ *

The stochastic EM algorithm can be generalized [16] by drawing z #"dz at each time  and then, in the M-step, 5Ý ÍCÎ

Ï

sŒ†!Ž5F–‘ ’”“•

 :

2

& ( ' ",.- z

 : Ý h!,.-

samples Ú iY h  e 

)+†E` ž Ú iY h 3! ' E‰Ò–Ó†Ž]uœ d

(16)

It appears clearly that when z¿E.GHI and assuming the ergodicity of the Ú iY h  chain [13], this algorithm has the same properties as the exact EM algorithm.

Sampling schemes for true inputs The direct sampling of r Y is not an easy task because of the existence of the non linearity . However, non direct but exact sampling methods already exist such as the Accept-reject procedure or the Monte Carlo Markov Chains methods [18]. We briefly recall in the following these two methods to show their drawbacks when applied to our problem and propose a modified version which is efficient and fast and so adapted to our general algorithm.

Accept-Reject method uœ ) Y Àj d The first step consists in sampling Y - from its a posteriori distribution  Y  'Aå ߚÉUàKá ±  yE} Y  'Aå . Chooswhich is proportional to Þ ) Y .ÃßAÉUàFá ± ä %E@ æ 'âjã

'âjã

ing the instrumental distribution ç ) Y žmèé  ”Y ê  &a '  , we can easily uniformly bound ֔î &a ' , thus the sampling procedure is: the ratio Þ ) Y ë ç ) Y žìsí  •

1. Sample ï Û ç ï 

Û ñóò 8!/1-ô 2. Sample ð é Ï ì ÷Jõ øÍCö Ï , then accept Y  ï , else reject ï and return to  • 3. If ð ÍCö •

The - drawback of using this method in our case is the fact that the acceptation probability í ± ÄùÞ  Y w Y depends on the time  . Consequently, besides the randomness of the number of rejections, its law varies across time. The algorithm may be "stucked" in the first step (S-step) because of only one sample at time 8 even if the other samples ï *,ú üû are all accepted !

Hasting-Metropolis method It consists in sampling from an instrumental distribution ç (its choice is optimized to mimic the original distribution Þ ) and then accept the sample or keep the previous one according to an acceptation probability ý [14]. The algorithm is then, At iteration þ : • S-step: a)- Sample ï Û Ï ç ï  . Ï ø   ˜  Ï b)- Accept YjÍÕÿ  ï with probability ý 7í·x l á †  øõ ÍCÕÍ öö Ï õ ÍÍ   ˜   Ï å Ï -*Ï else Y ÍCÿ ^ Y ÍCÿ ±

  c)- return to a) under convergence of the Markov chain: Ï d)- When Markov chain converges put Ú Y ^ YjÍCÿ • M-step: 2 Ï

 :

)+†E` ž Ú Y   ' EtÒ#ÓŽ]uœ * & ( ' ",.Ï In the first step, the Markov chain ) Y  ÍCÿ has Þ as a stationary distribution. The con5ÍÕÎ

7Œ+!Ž5o#‘ ’“Q•

vergence of the MCMC methods is a known problem studied in literature [15] and an efficient tool for convergence diagnostic depends on the problem under hand and there is not a general procedure to decide if the Markov chain has converged Ï or not. Moreover, even if we attain the convergence, then we have many samples ) Y  ÍÕÿ and it will be more efficient in this case to apply the Monte Carlo EM algorithm and the SEM algorithm is useless in this case.

Modified Stochastic EM algorithm In order to avoid the convergence problem at each step of the SEM algorithm, we implement only one step of Hasting-Metropolis procedure. In classical SEM algorithms, the first step consists in computing a sample from the a posteriori of the hidden variable Y . Such sample is obtained in the asymptotic regime of the Markov chain formed by Hasting-Metropolis procedure described above and so we need to repeat this procedure enough until convergence. We propose to perform only one iteration of Hasting

Metropolis algorithm at each iteration based on the hidden variables Ú Y  ÍCÎ ± in the previous iteration. The algorithm is then, At iteration þ : • S-step: a)- Sample ï Û Ï ç ï  . Ï ÍÕö Ï ø



 · í x W á    õ j Y Õ Í Î ø b)- Accept ï with probability ý l ÍCö õ Ï -*Ï ±

¶ 

j Y C Í Î j Y Õ Í Î else Ï c)- put Ú Y ¶ YjÍÕÎ • M-step: 2  ÍCÎ

Ï

sŒ†!ŽF–‘ ’”“•

 :

& ( ' ",.-

-*Ï

sampled

  Ï Í   ˜   Ï Í ˜   å

%E@ Ú Y 3! '

Input noise variance estimation Given the sampled true inputs Y , the input noise variance can be estimated by maximizing its a posteriori distribution uœ & a ÀW-010 25* -010 2œ% Y-010 2 which has the following expression: uœ & a À-010 2* -010 2È* Y-010 2 

Â

uœ  -010 2OÀ– Y-010 2  & a ) uœ & a  2 2 š ß U É à  *E u & a  ± " ,.-  yE} Y  '  5 a & æ   'â ã

Â

Choosing a Jeffrey prior for & a leads to a degeneracy of the above function as the sampled inputs Y will tend to  and then the variance goes to zero (see [17] for a detailed study of the degeneracy occurence). Therefore, an inverse Gamma prior is chosen for uœ & a  uœ  

& a ± ' 5\ *6Á*=È

leading to an inverse Gamma a posteriori: uœ \ *6!*ƒ"$#)d= %ƒ"$#%% 6!*ƒ"$#)s6O@$óë Ö     Ï = %ƒ%"&#%]s=N('*¦,) +  Í ¦ ± ¦˜ ã ' ¦± 3 Then, the maximum is attained at -/.0&12 ¦ . .405142

Choice of the instrumental distribution The choice of the instrumental distribution ç is crucial to obtain efficient, easy to implement and quick algorithms. In the error-in-model special case, a simple and efficient choice is the Gaussian with mean  and covariance &a ' ¸ . ç ï %7è

ï  ê  & a ' ¸ 

(17)

Thus the acceptation probability ý is simply: ý  í·x l ‡

*ߚÉUà@×E Ö

 & ('

 %+†E@ ï   ' Eà )+E@  Y ÍÕÎ ±

-*Ï

 '  Ø ˆ

(18)

 Note that -*we don’t need the Ï -*Ï knowledge of the function but only its values ï and  Y ÍÕÎ ±  in ï and Y ÍÕÎ ± . Therefore, the algorithm can be implemented with any -*Ï  ž )

 Y ± C Í Î  learning architecture providing we can get the values ï and .

SIMULATION EXAMPLE Example 6 : Parametric learning To illustrate the performances of the proposed algorithm, we consider the following parametric function:   ç ) .s‹+-ßAÉUàJ¹ E ‹ ¼  (19)

' Ö + ‹ œ  ‹ 87  where we take the original values and . We add a white Gaussian noise to ' both the inputs © and outputs  © : X   x f#"# DD  © ç  ©Y  ©  Z (20)

© ^ Y `_ ©  ©

where the standard deviation for the output noise is & (  D " and for the input noise is & a  D D:9 . The learning of the parameters *‹+-šd‹  from the noisy inputs (when we ' suppose that  ©  ç ) © U © ) fails in recovering the original values of the parameters despite the low intensity of the input noise. This shows the sensitivity of the classic learning rule to the input noise. We run the SEM algorithm on this data and we obtain good results. The figure  -a shows the Markov chain of the first parameter ‹+- , note the fluctuations around the original value. Ö Figure  -b shows the convergence of the empirical Ö E + ‹ expectation of . Figures § and E ¨ show the same results for the parameter ‹ . ' In figures 7 -a and 7 -b, we plot the evolution of the estimated input variance & a and the corresponding empirical expectation of the Markov chain. We note the convergence around the true value and the success of the algorithm when the input noise variance is

unknown. 2.8

2.4

Empirical expectation of theta (1) 2.6

2.2

theta (1)

Theta (1)

Parameter theta (1)

2.4

2.2

2

1.8

2

True value of theta (1)

1.8

1.6

True value of theta (1)

1.6

1.4 1.4

1.2

1.2

1

0

1000

2000

3000

4000

5000

1

6000

0

1000

2000

3000

4000

Iterations

5000

6000

Iterations

Figure  § Evolution of the parameter ‹+-

Figure  ¨ Empirical expectation of ‹+-

3.4

True value of theta (2)

3.4

True value of theta (2) Theta (2)

Theta (2)

3.2

3

3.2

3

2.8

2.6

2.8

Parameter theta (2)

Empirical expectation of theta (2)

2.4

2.6

2.2

2.4 2

0

1000

2000

3000

4000

5000

6000

500

1000

Figure

Ö

§ Evolution of the parameter ‹

1500

2000

2500

3000

3500

4000

4500

5000

Iterations

Iterations

'

Figure

Ö

¨ Empirical expectation of ‹ '

0.06 0.07

0.058

Input noise variance

0.065

Empirical expectation of input noise

0.056

0.054 0.06

0.052

0.05 0.055

0.048 0.05

0.046

0.044

True input noise

0.045

True noise value 0.04

0

500

1000

1500

2000

0.042

2500

3000

3500

4000

4500

0.04

5000

Figure 7 § Evolution of the input noise variance & a

500

1000

1500

2000

2500

Figure 7 ¨ Empirical expectation of the input variance & a

In classical regression modelisation, we try to estimate the mapping ç