Multi-componets Data, Signal and Image Processing for Biological

Processing for Biological and Medical ..... M < N: Af = g has infinite number of solutions. Minimum ...... When we have data and classes, the question to answer is:.
6MB taille 4 téléchargements 311 vues
. Multi-componets Data, Signal and Image Processing for Biological and Medical Applications Ali Mohammad-Djafari Laboratoire des Signaux et Syst`emes UMR 8506 CNRS - CS - Univ Paris Sud ´ CentraleSupelec, Gif-sur-Yvette. [email protected] http://djafari.free.fr

January 6, 2017 A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

1/74

Summary 3: Data redundancy, Dimensionality Reduction, ... ◮

Redundancy and structure



Dimentionality Reduction



PCA and ICA



PPCA and its extensions



Stationarity / non-stationarity



Discriminant Analysis (DA)



Classification and Clustering



Mixture Models



Factor Analysis



Blind Sources Separation

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

2/74

Dimension reduction, PCA, Factor Analysis, ICA ◮

M variables g 1 , · · · , g M are observed. They are redundant. Can we express them with N ≤ M factors f 1 , · · · , f N ? How many factors (Principal Components, Independent Components) can describe the observed data?



Each variable is observed T times. To index them, we use t, but this does not forcibly means time. So, we have {g 1 (t), · · · , g M (t), t = 1, · · · , T }.

We  may represent these data either as a vector or a matrix: g 1 (t) g 1 (1) g 1 (1) · · · g 1 (T )  g (t)   g (1) g (1) · · · g (T ) 2 1   2   2 g(t) =  .  , t = 1, .., T or G =  .  . .   .   . ◮

g M (t)

g M (1) g M (1) · · ·

g 1 (T )

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

3/74

Dimension reduction, PCA, Factor Analysis, ICA ◮



◮ ◮



If we define the   N < M factors as   f 1 (t) f 1 (1) f 1 (1) · · · f 1 (T )  f 2 (t)   f 2 (1) g (1) · · · f 1 (T ) 2     f (t) =  .  , t = 1, .., T or F =  .  . .  .   .  f N (t) f M (1) g M (1) · · · f 1 (T ) where we assume that each factor is a linear combination of data P P f j (t) = M or inversely g i (t) = N j=1 aij f j (t). i=1 b ji g i (t) Now, if we define the matrices B = {bji } or A = {aij } we can write f (t) = Bg(t) or g(t) = Af (t)

B is called Loading matrix and A is called mixing matrix. Ideal case is then B = A−1 . But this is not interesting, because we want to have N < M. We may accept some errors: g(t) = Af (t) + ǫ(t)

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

4/74

Dimension reduction, PCA, Factor Analysis ◮

M variables g(t) are observed. They are redundant. Can we express them with N ≤ M factors f ? How many factors (Principal Components, Independent Components) can describe the observed data? f (t) = Bg(t) or g(t) = Af (t) or still F = BG or G = AF

◮ ◮

We assume all the variables to be centred. PCA uses the second order statistics cov[g] = Acov[f ]At





We want principal Components (PC) to be non-correlated: cov[f ] be a diagonal matrix. cov[f ] = diag [v1 , · · · , vN ] P One solution: Estimate cov[g] = Tt=1 g(t)g ′ (t) and use it.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

5/74

Dimension reduction, PCA Algorithms ◮

Forward model g(t) = Af (t) or G = AF



Classical PCA algorithm: T X

g(t)g ′ (t)



Estimate cov[g] =



Hoping that it is positive Definite, compute its SVD: cov[g] = AV At e = AV 1/2 and compute f b = V −1/2 At g Identify A The number of factors is the number of non-zero singular values. The data can be retrieved exactly by

t=1

◮ ◮





b fb = AV 1/2 V −1/2 At g = AAt g = g b=A g

if we keep K SV, we make an error which can be used as criterion to determine the number of factors E=

kb g − gk2 kgk2

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

6/74

Dimension reduction: PPCA ◮

Forward model g(t) = Af (t) + ǫ(t) or G = AF + E



Probabilistic PCA: cov[g] = Acov[f ]At + cov[ǫ] T X

g(t)g ′ (t)



Estimate cov[g] =



Hoping that it is positive Definite, compute its SVD: cov[g] = AV At Keep the SV which are greater than vǫ and Identify b = V −1/2 At g b = AV 1/2 and compute f A The number of factors is the number of singular values which are greater than vǫ . The main difficulty is to estimate vǫ .

t=1







A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

7/74

Application on a set of data ◮

Genes expressions Time series data in two organs: Colon: Clock: Rev, Per2, Bmal1 Metabolism: CE2, Top1, UGT, DBP CC: Wee1, Ccna2, Ccnb2 Apoptose: Bcl2, Mdm2, Bax, P53 Liver: Clock: Rev, Per2, Bmal1 Metabolism: CE2, Top1, UGT, DBP CC: Wee1, P21 Apoptose: Bcl2, Mdm2, Bax, P53 Physiology: Temperature, Activity Cortico, Melato



The data are obtained via the COSINOR Model 3 X 2π Ak cos(kω0 t + φk ), ω0 = g(t) = M + , t = 0, 3, 6, .., 21 24 k =1

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

8/74

Genes Clock Colon: Time Series

Fourier Transform

Colon: Clock Genes

Colon: Clock Genes

1.5

15 Rev Per2 Bmal1

1

Rev Per2 Bmal1

10

0.5 5

0 -0.5

0

3

6

9

12

15

18

0 -3

21

0.6

-2

-1

0

1

2

3

4

30 Rev Per2 Bmal1

0.4

Rev Per2 Bmal1

20

0.2 10

0 -0.2

0

3

6

9

12

15

18

0 -3

21

1.5

-2

-1

0

1

2

3

4

15 Rev Per2 Bmal1

1

Rev Per2 Bmal1

10

0.5 5

0 -0.5

0

3

6

9

12

15

18

0 -3

21

Liver: Time Series

-2

-1

0

1

2

3

4

Fourier Transform

Liver: Clock Genes

Liver: Clock Genes

4

80 Rev Per2 Bmal1

2

Rev Per2 Bmal1

60 40

0 -2

20 0

3

6

9

12

15

18

0 -3

21

1.5

-2

-1

0

1

2

3

4

150 Rev Per2 Bmal1

1

Rev Per2 Bmal1

100

0.5 50

0 -0.5

0

3

6

9

12

15

18

0 -3

21

10

-2

-1

0

1

2

3

4

80 Rev Per2 Bmal1

5

Rev Per2 Bmal1

60 40

0 -5

20 0

3

6

9

12

15

18

21

0 -3

-2

-1

0

1

2

3

4

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

9/74

Genes Colon Time series: Clock 0.6

1 2 3

Rev

0.4 0.2 0

Per2

1

0.5

0 0.4 Bmal1

0.3 0.2 0.1 0 -0.1 0

0.2 Rev

0.4

0.6

0

0.5

1 Per2

0

0.2 Bmal1

0.4

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

10/74

Genes Liver Time series: Clock 1 2 3

4 Rev

2 0 -2 3

Per2

2 1 0 2

Bmal1

1.5 1 0.5 0 -2

0

2 Rev

4

0

1

2 Per2

3

0

0.5

1 Bmal1

1.5

2

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

11/74

Factor Analysis: 2 factors: Colon

Time series

FT Amplitudes

1

P53

Ccna2 0.8

0.8

Bax

Rev

P53

1

1

Ccnb2

UGT

Bax

Per2 Bmal1 DBP

0.8

Mdm2

0.6

0.6

Bax

Bcl2 0.4

Ccnb2

Rev

Ccna2 0.2

Wee1 DBP

0

UGT

Component 2

0.4

Bcl2

Wee1

P53 Bcl2 DBP

-0.2

Top1

DBP

0.4

-0.8

Bmal1

-0.8

Per2

0

Per2 -0.6

Rev 2

-1

-1 -1

1

-0.6

CE2

Per2

-0.4

Bax P53

-0.2 -0.4

0.2

Wee1

-0.6

Bmal1

0

Top1

-0.2 CE2

Wee1 Ccna2 CE2 Top1

0.2

UGT -0.4

Top1

0.6

Ccna2

Mdm2 CE2 0

Ccnb2 Mdm2 Bcl2

0.4

Ccnb2

UGT

Bmal1

0.2

0.6

0.8

Component 2

Mdm2

-0.8

-0.6

-0.4

-0.2

0 0.2 Component 1

0.4

0.6

0.8

1

Rev

-1

1

2

-0.8

-0.6

-0.4

-0.2

0 0.2 Component 1

0.4

0.6

0.8

1

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

12/74

How to determine the number of factors ◮

When N is given: p(A, f |g) ∝ p(g|A, f ) p(A) p(f )



Different choices for p(A) and p(f ) and Different methods to estimate both A and f : JMAP, EM, Variational Bayesian Approximation When N is not known: ◮ ◮ ◮





Model selection Bayesian or Maximum likelihood methods To determine the number of factors we do the analyze with different N factors and use two criteria: -log likelihood − ln p(g|A, N) of the observations and DFE: Degrees of freedom error (N − M)2 − (N + M))/2 related to AIC or BIC model selection criteria.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

13/74

Factor Analysis: Time series, colon

1 P53

P53

Bax

Bax

0.8

Mdm2

Mdm2

Bcl2

0.6 Bcl2

P53 0.8

0.8

Bax Mdm2

We1

Bax

0.8

0.6

Bcl2

Bax

Ccnb2

Ccnb2

0.4

Bcl2

0.6

Ccna2

Ccna2

Ccnb2

Ccnb2

We 1

0.2 Wee1

0.2 Wee1

0.4

Ccna2

DBP

0.2

Wee1

UGT

0

DBP

Wee1

UGT

Top1

-0.2 CE2

CE2

Bmal1

-0.4 Bmal1

-0.4 Bmal1

Per2

-0.4 Per2

Per2

Per2

Rev

Rev

.

1

2

1

2

3

30

.

Rev

-0.2

Bmal1

1

2

3

4

.

-0.6 1

2

3

4

5

10

0

Bmal1

Per2

.

-0.2

Per2

Rev

-0.6 1

2

3

4

5

6

20

0

CE2

-0.4

Per2 Rev

0.2

UGT Top1

CE2

Bmal1

-0.6

-0.6

.

-0.2

CE2 -0.4

0

Top1

-0.4

Rev

40

DBP

UGT

0

Top1

-0.2

CE2

Bmal1

-0.6

50

0.4

Wee1

DBP

UGT

Top1

-0.2

-0.2

0.6

Ccna2

0

UGT

Top1

60

0.2

DBP

0

0

0.8

Ccnb2 0.4

Ccna2

0.2

Top1

1

Mdm2

Bcl2

-Log L DFE

0.4

Ccna2

DBP

CE2

Mdm2

80

70

Bax

0.6 0.4

1

P53

0.8

0.8

Bcl2

0.2 DBP UGT

1

P53

1

Mdm2

0.6 Bcl2

Ccnb2 0.4

Ccna2

P53

Bax Mdm2

0.6

Ccnb2

P53

-10

Rev 1

2

3

4

5

6

7

-20

1

2

3

4

5

6

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

7

14/74

Dimension reduction: ML PPCA ◮

Forward model g(t) = Af (t) + ǫ(t) or G = AF + E



ML PPCA: Σg = cov[g] = Acov[f ]At + vǫ I = AV f At + vǫ I Σg = Adiag [v f ] At + vǫ I



Likelihood p(g|A, v f , vǫ ) = N (g|0, Σg ) # " 1 X −1/2 2 kg(t) − Af (t)k exp − p(g|A, v f , vǫ ) ∝ det(Σg ) 2vǫ t

Alternate maximization of the -log likelihood 1 1 X L(A, v f , vǫ ) = ln det(Adiag [vf ] At +vǫ I)+ kg(t)−Af (t)k2 2 2vǫ ◮

t

with respect to its arguments can give the desired solution.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

15/74

Dimension reduction, PCA, Factor Analysis, ICA ◮

M variables g(t) are observed. They are redundant. Can we express them with N ≤ M factors f ? How many factors (Principal Components, Independent Components) can describe the observed data?

(  P A : (M × N) Loading matrix , N ≤ M g i (t) = N j=1 aij f j (t) + ǫi (t) f (t) : factors, sources g(t) = Af (t) + ǫ(t) ◮

How to find both A and factors f (t) ?

Three cases: ◮

A known, find f



f known, find A



A and f both uknown

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

16/74

Case 1: A known, find f g(t) = Af (t) + ǫ(t)



First, assume the data iid. So g = Af + ǫ.



Second, assume no noise. So g = Af

◮ ◮

b = A−1 g Ideal case M = N and A invertible: f

M < N: Af = g has infinite number of solutions. Minimum Norme (MN) Solution: o n b = arg min f kf k2 {f :Af =g}

Lagrangian technique: AAt f = At g If A has full rank: rank (A) = min{M, N} b = At [AAt ]−1 g f

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

17/74

Case 1: A known, find f : LS ◮

M > N: Af = g may not have any solution. Least Square (LS) Solution: n o b = arg min kg − Af k2 f f

Gradient of J(f ) = kg − Af k2 is : −2At (g − Af ) = 0 −→ At Af = At g. ◮



If A has full rank: rank (A) = min{M, N}:

General case ? ◮ ◮ ◮

b = [At A]−1 At g f

Truncated Singular Value Decomposition (TSVD) Regularization Bayesian

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

18/74

Case 1: A known, find f : TSVD ◮ ◮

SVD: A = U ΛV t , Right SVD:



Left SVD:



Full rank





U U t = I,

AAt v At Au

k

k

V t V = I,

Λ diagonal.

= λk v k

= λk u k

b = V S −1 U t g = f

X

vk

k

< g, uk > sk

Rank deficient rank (A) = K < min {M, N} b = V S+U t g = f

K X k

vk

< g, uk > sk

Main difficulty: How to decide which singular values are near to zero.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

19/74

Case 1: A known, find f : Regularization ◮

MN: Minimize kf k2 subject to Af = g,



LS: Minimize kg − Af k2



MN + LS:

b = arg min {J(f )} f f

with J(f ) = kg − Af k2 + λkf k2 ◮

Gradient of J(f ) is : −2At (g − Af ) − 2λI = 0 −→ (At A + λI)f = At g b = [At A + λI]−1 At g f



λ −→ 0 LS.



When M < N, we may also obtain

b = At [µAAt + I]−1 g, with µ = 1 . f λ

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

20/74

Case 1: A known, find f : Bayesian inference ◮

g = Af + ǫ Prior knowledge on ǫ:



1 ǫ ∼ N (ǫ|0, vǫ I) −→ p(g|f , A) = N (g|Af , vǫ I) ∝ exp kg − Af k2 2v ǫ i h ◮ Simple prior models for f : p(f |α) ∝ exp − 1 kf k2 2v



f



Expression of the posterior law:



 1 J(f ) p(f |g, A) ∝ p(g|f , A) p(f ) ∝ exp − 2vǫ with J(f ) = kg − Af k2 + λkf k2 , λ = vǫ /vf ◮

Link between MAP estimation and regularization b = arg max {p(f |g, A)} = arg min {J(f )} f f



b = (A′ A + λI)−1 A′ g Solution: f

f

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

21/74

Bayesian inference for sources f when A is known ◮ ◮

  More general prior model p(f ) ∝ exp − α2 Ω(f ) MAP:

J(f ) = kg − Af k2 + λΩ(f ), p(f |θ, g) −→

λ = vǫ α

Optimization of b −→ f J(f ) = kg − Af k2 + λΩ(f )



Different priors=Different expressions for Ω(f )



Solution can be obtained using appropriate optimization algorithm.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

22/74

MAP estimation with sparsity enforcing priors ◮



Gaussian: Ω(f ) = kf k2 =

P

j

|fj |2

b = [A′ A + λI]−1 A′ g J(f ) = kg − Af k2 + λkf k2 −→ f

Generalized Gaussian:

Ω(f ) = γ ◮

Student-t model:



Ω(f ) = Elastic Net model:

X

|f j |β )

j

  ν+1X log 1 + f 2j /ν 2 j i Xh Ω(f ) = γ1 |f j | + γ2 f 2j j

For an extended list of such sparsity enforcing priors see: A. Mohammad-Djafari, “Bayesian approach with prior models which enforce sparsity in signal and image processing,” EURASIP Journal on Advances in Signal Processing, vol. Special issue on Sparse Signal Processing, 2012. A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

23/74

Case 1: A known, find f : Regularization ◮

More general case: b = arg min {J(f )} f f



with J(f ) = kg − Af k2 + λkDf k2 Gradient of J(f ) is :

−2At (g − Af ) − 2D t D = 0 −→ (At A + λDt D)f = At g



b = [At A + λDt D]−1 At g f

L1 Regularization:

J(f ) = kg − Af k22 + λkf k1



P with kf k1 = j |f j |. Still more general J(f ) = ∆1 (g, Af ) + λ∆2 (f , f 0 ))

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

24/74

Case2: f known, find A



g = Af + ǫ is linear in f and in A. X X g = Af −→ g i = aij f j −→ g = a∗j f j j



j

 a11  g1 f1 a11 a12 f1 0 f2 0   a21  = =  g2 f2 a12  a21 a22 0 f1 0 f2 a22 g = Af = F a with F = f ⊙ I, a = vec(A) 

◮ ◮ ◮













A known, estimation of f : g = Af + ǫ f known, estimation of A: g = F a + ǫ Joint estimation of f and A: g = Af + ǫ = F a + ǫ

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

25/74

Cases 1 and 2: Regularization g = Af



or g = Af = F a

A known, find f o n b = arg min kg − Af k2 + λkf k2 = (A′ A + λI)−1 A′ g f f



f known, find A o n b = arg min kg − Af k2 + λkAk2 = gf ′ (f f ′ + λI)−1 A A



Both A and f are unknown o n b , A) b = arg min kg − Af k2 + λ1 kf k2 + λ2 kAk2 (f (f ,A)

Alternate optimisation ?

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

26/74

Estimation of A when the sources f are known Source separation is a bilinear model:



◮ ◮

g1 g2

=



a11 a21

g = Af = F a      a11 a12 f1 f1 0 f2 0   a21 = a22 f2 0 f1 0 f2  a12 a22 F = f ⊙ I, a = vec(A)

   

Problem is more ill-posed (underdetermined). We need absolutely to impose constraintes on elements or the structure of A, for example: ◮ ◮ ◮ ◮





Positivity of the elements Toeplitz or TBT structure Symmetry Sparsity

The same Bayesian approach then can be applied.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

27/74

Estimation of A when the sources f are known g = Af + ǫ = F a + ǫ ◮

Prior on noise:





h

1 kg h 2vǫ exp 2v1ǫ kg

p(g|f , A) = N (g|Af , vǫ I) ∝ exp

i − Af k2 i − F ak2

∝ Simple prior models for a: i h i h p(A|α) ∝ exp −αkak2 ∝ exp −αkAk2 Expression of the posterior law:

p(A|g, f ) ∝ p(g|f , A) p(A) ∝ exp [−J(A)] with ◮

J(A) =

1 kg − Af k2 + αkAk2 2vǫ

MAP estimation:

b = gf ′ (f f ′ + λI)−1 b = (F ′ F + λI)−1 F ′ g ↔ A a ! !−1 X X ′ ′ b = g(t) = Af (t)+ǫ(t) −→ A g(t)f (t) f (t)f (t) + λI

t t UPSa, 2015-2016, Gif, France, A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI,

28/74

Case 3: both f and A unknown ◮

Both A and f are unknown: o n b , A) b = arg min kg − Af k2 + λ1 kf k2 + λ2 kAk2 (f (f ,A)



Undeterminations: ◮ ◮





Permutation: AP , P ′ f Scale: k A, k1 f

Alternate optimisation (  b = arg minf kg − Af k2 + λ1 kf k2 = (A′ A + λ1 I)−1 A′ g f  b = arg minA kg − Af k2 + λ2 kAk2 = gf ′ (f f ′ + λ2 I) A Importance of initialization and other constraintes such as positivity ◮

Non-negative Matrix decomposition

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

29/74

Both A and f unknown: Bayesian approach Three main steps: 1. Assigning priors: p(f ) and p(A) 2. Obtaining the expressions of the joint posterior: p(f , A|g) ∝ p(g|f , A) p(f ) p(A) 3. Doing the computations: • Joint optimization of p(f , A|g); • MCMC Gibbs sampling methods which need generation of samples from the conditionals p(f |A, g) and p(A|f , g); • Marginalisation and EM algorithm: Z b b −→ p(f |A, b g) −→ f p(f , A|g) df −→ A p(A|g) = • Bayesian Variational Approximation (BVA) methods which approximate p(f , A|g) by a separable one q(f , A) = q1 (f )q2 (A) and then using them for the estimation of f and A.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

30/74

Bayesian source separation: both A and f unknown p(f , A|g, θ 1 , θ 2 , θ 3 ) = ◮



p(g|f , A, θ 1 ) p(f |θ 2 ) p(A|θ 3 ) p(g|θ 1 , θ 2 , θ 3 )

Joint estimation (JMAP): b , A) b = arg max(f ,A) {p(f , A|g, θ 1 , θ 2 , θ 3 )} (f JMAP with Gaussian priors: n o b , A) b = arg min kg − Af k2 + λ1 kf k2 + λ2 kAk2 (f (f ,A)







Permutation and scale ideterminations: needs good choices for priors Alternate optimisation (  b = arg minf kg − Af k2 + λ1 kf k2 = (A′ A + λ1 I)−1 A′ g f  b = arg minA kg − Af k2 + λ2 kAk2 = gf ′ (f f ′ + λ2 I) A Importance of initialization and constraintes such as positivity

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

31/74

Joint MAP Estimation of A and f with Gaussian priors g(t) = A f (t) + ǫ(t),

t = 1, · · · , T ,

iid

p(f 1..T , A|g 1..T ) ∝ p(g Q 1..T |A, f 1..T , vǫ ) p(f 1..T ) p(A|A0 , V0 ) ∝ t p(g(t)|A, f (t), vǫ ) p(f (t)|z(t)) p(A|A0 , V0 )

Joint MAP: Alternate optimization  b (t) = (A b ′A b ′ g(t), b + λf I)−1 A  f λf = vǫ /vf  −1 P P ′ ′ b b b b =  A λa = vǫ /va t f (t)f (t) + λa I t g(t)f (t) Alternate optimization Algorithm:

b A(0) −→ A−→ ↑

b A←−



P

b + λf I b ′A A

−1

b ′g A

−→

b (t) f

↓  P −1 ′ b b (t) b b ←−f t g(t)f (t) t f (t)f (t) + λa I ′

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

32/74

Summary of Bayesian estimation with different levels ◮

Simple Bayesian Model and Estimation θ1

θ2



p(A|θ 3 )





Prior on f

Prior on A



p(f |θ 2 )

θ3

❄ b →f ⋄ p(g|f , A, θ1 ) → p(f , A|g, θ) b →θ

Likelihood

Posterior

Full Bayesian Model and Hyperparameter Estimation scheme ↓ α, β

Hyper prior model p(θ|α, β) p(θ 3 )

p(θ 2 )



p(A|θ 3 ) Prior on A





p(f |θ 2 ) Prior on f

p(θ1 )

b →f ❄ b ⋄ p(g|f , A, θ 1 )→ p(f, A, θ|g, α, β) → A Likelihood

Joint Posterior

b →θ

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

33/74

Summary of Bayesian estimation with different levels ◮

Marginalization over f p(f , A|g) −→

p(A|g)

b → p(f |A, b g) → f b −→ A

Joint Posterior Marginalize over f Marginalization over A



p(f , A|g) −→

p(f |g)

b → p(A|f b , g) → A −→ f b

Joint Posterior Marginalize over A Joint MAP



p(f , A|g) −→

n o e p(f , A|g) n o e = arg maxA p(f e , A|g) A e = arg max f f

Joint Posterior Alternate optimization Other solutions:







b →A b →f

MCMC for exploring the joint posterior law and computing mean values Variational Bayesian Approximation (VBA)

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

34/74

Variational Bayesian Approximation ◮







Main idea: Approximate the joint pdf p(A, f |g) difficult to handle by a simpler one. For example aQ separable one q(A, f |g) = q1 (A) q2 (f ) or even q1 (A) j q2j (f j ).

Criterion: minimize Z q KL(q|p) = q ln = −H(q1 ) − H(q2 )− < ln p(A, f |g) >q p Solution obtained by alternate optimization:    q1 (f ) ∝ exp < ln p(f , A|g) >q2 (A)  q2 (A) ∝ exp < ln p(f , A|g) >q1 (f )

Use q1 (f ) for infering on f and q2 (A) for infering on A

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

35/74

VBA and links with JMAP and EM Three possibilities: ◮



e ) and q2 (A) = δ(A − A) e −→ JMAP q1 (f ) = δ(f − f  n o ( e = arg maxf p(f , A|g)  f e e q1 (f ) ∝ p(f , A|g) n o e , A|g) −→  A e , A|g) e = arg maxA p(f q2 (A) ∝ p(f

e −→ EM q1 (f ) free and q2 (A) = δ(A − A)  h i e e  q1 (f ) ∝ exp hln p(f , A|g)i q2 (A) ∝ p(f , A|g) = p(f |A, g) h i h i e  q2 (A) ∝ exp hln p(f , A|g)i q1 (f ) ∝ exp Q(A, A) −→ EM



(

e = hln p(f , A|g)i Q(A, A) e p(f |A,g) n o e e A = arg maxA Q(A, A)

q1 (f ) and q2 (A) free forms: Affordable if exponential families and conjugate priors

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

36/74

JMAP, EM and VBA JMAP Alternate optimization Algorithm: e A(0) −→ A−→ ↑ e b A ←− A←−

EM:

e A(0) −→ A−→ ↑ b ←− A←− e A

VBA:

n o e = arg maxf p(f , A|g) e −→ f b e f −→ f ↓ n o e e e A = arg maxA p(f , A|g) ←−f e g) q1 (f ) = p(f |A,

e = hln p(f , A|g)i Q(A, A) q1o (f ) n e e = arg maxA Q(A, A) A

b −→ q1 (f ) −→ f ↓ ←−q1 (f )

h i b A(0) −→ q2 (A)−→ q1 (f ) ∝ exp hln p(f , A|g)iq2 (A) −→q1 (f ) −→ f ↑ ↓ h i b A ←− q2 (A)←− q2 (A) ∝ exp hln p(f , A|g)iq1 (f ) ←−q1 (f )

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

37/74

Summary of Bayesian estimation with different levels ◮

Marginalization over f p(f , A|g) −→

p(A|g)

b → p(f |A, b g) → f b −→ A

Joint Posterior Marginalize over f Marginalization over A



p(f , A|g) −→

p(f |g)

b → p(A|f b , g) → A −→ f b

Joint Posterior Marginalize over A Joint MAP



p(f , A|g) −→ Joint Posterior VBA

n o e p(f , A|g) n o e = arg maxA p(f e , A|g) A e = arg max f f

Alternate optimization



p(f , A|g) −→

i h q1 (f ) ∝ exp hln p(f , A|g)iq (A) 2 i h q2 (A) ∝ exp hln p(f , A|g)iq (f ) 1

Joint Posterior

VB Approximation

b →A b →f b −→ q2 (A) −→ A b −→ q1 (f ) −→ f

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

38/74

General case and Link with other methods θ2 θ3 θ1

✓✏ ✲f (t) p(f (t)|θ ) 2 ✒✑ ✲ A♥ p(A|θ 3 ) → p(f (t), A|g(t), θ) ❄ ✓✏ ❘✓✏ ❅ ✲ ǫ(t) ✲ g(t) p(g(t)|A, f (t), θ 1 ) ✒✑✒✑

p(f 1..T , A|g 1..T ) ∝ p(g1..T |A, f 1..T , θ 1 ) p(f 1..T |θ 2 ) p(A|θ 3 ) Different scenarios: ◮

IID (strict stationnarity) g(t) = A f (t) + ǫ(t) → g = A f + ǫ



Second order stationnarity (Gaussian): All the probability laws are Gaussian



Non Gaussian but IID: ICA



Non-Gaussian, Time dependent, ...

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

39/74

Gaussian white case: PCA, MNF, PMF and NMF ◮

White and Gaussian signals f (t), ǫ(t) −→ g(t): g(t) = A f (t) + ǫ(t) −→ g = A f + ǫ



Likelihood: p(g|A, f ) = N (g|Af , Σǫ ),









p(f ) = N (f |0, Σf )

−→ p(g|A) = N (g|0, AΣf A′ + Σǫ ) P PCA : Estimate Σg by T1 t g(t)g ′ (t), svd and keep all the non-zero svd: Σg = AΣf A′ Minimum Norm Factorization (MNF) : Estimate Σg , svd and keep all svd ≥ σǫ : Σg = AΣf A′ + Σǫ Positive Matrix Factorization (MNF) : Decompose Σg in positive definite matrices [Paatero & Tapper, 94] Non-negative Matrix Factorization (NMF) : Decompose Σg in Non-negative definite matrices [Lee & Seung,99]

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

40/74

Non Gaussian white case: ICA ◮

White Non Gaussian signals and Exact model (no noise): f (t) −→ g(t) −→ y(t) = A−1 g(t) −→ y(t) = Bg(t)



ICA: Find B in such a way that the components of y be the most independent



Different measures of independencies: Entropy: Z H(y) = −

p(yi ) ln p(yi ) dyi

Relative entropy or Kullback-Leibler divergence: Q Z Y p(yi ) dyi KL(p(y) : p(yi )) = p(yi ) ln i p(y i



Different choices and approximations for p(yi ) −→ contrast functions, cumulants basis criteria

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

41/74

Non Gaussian white case: Maximum Likelihood ◮

White Non Gaussian signals (Accounting for noise) g(t) = A f (t) + ǫ(t) −→ g = A f + ǫ



Likelihood: p(g|A, Σǫ ) =



Z

p(g|A, f , Σǫ )p(f ) df

ICA (Maximum Likelihood) : cǫ ) = arg max {p(g|θ} b = (A, b Σ θ θ



EM iterative algorithm :

 Q(θ, θ ′ ) = E ln p(g, f , θ|θ ′  θ ′ = arg max Q(θ, θ ′ ) θ



A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

42/74

General Gaussian case: Joint Estimation of A and f ✓✏ ✲f (t) p(f j (t)|v0 j ) = N (fhj (t|0, v0 j ) i P ✒✑ p(f (t)|v0 ) ∝ exp − 21 j f 2j (t)/v0 j ✲ A♥ ❄ p(A |0, V ) = N (A |0, V ) ✓✏ ❘✓✏ ❅ 0 ij 0 ij ij ij ✲ ǫ(t) ✲ g(t) p(A|0, V0 ) = N (A|0, V0 ) ✒✑✒✑

v0 V0 vǫ

p(g(t)|A, f (t), vǫ ) = N (Af (t), vǫ I)

p(f 1..T , A|g 1..T ) ∝ p(g1..T |A, f 1..T , vǫ ) p(f 1..T |v0 ) p(A|0, V0 ) Q ∝ t p(g(t)|A, f (t), vǫ ) p(f (t)|v0 ) p(A|0, V0 ) b b (t), Σ) p(f (t)|g 1..T , A, vǫ , v0 , V0 ) = N (f (t)|f

b Vb ) p(A|g 1..T , f 1..T , vǫ , v0 , V0 ) = N (A|A,

Two approaches: ◮

Alternate joint MAP (JMAP) estimation



Bayesian Variational Approximation

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

43/74

Joint Estimation of A and f : Alternate JMAP Some simplification: v0 = [vf , .., vf ]′ , All sources a priori same variance vf vǫ = [vǫ , .., vǫ ]′ , All noise terms a priori same variance vǫ A0 = 0, V0 = va I b (t), Σ) b A, vǫ , v0 ) = N (f (t)|f p(f (t)|g(t), ( b = (A′ A + λf I)−1 Σ b (t) = (A′ A + λf I)−1 A′ g(t), λf = vǫ /vf f b b p(A|g(t), ( f (t), vǫ , A0 , V0 ) = N (A|A, V ) Vb = (F ′ F + λf I)−1 b = (P g(t)f ′ (t)) (P f (t)f ′ (t) + λa I)−1 , λa = vǫ /va A t

JMAP:

b A(0) −→ A−→ ↑

b A←−

t



P

b ′A b + λf I A

−1

b ′g A

−→

b (t) f

↓  P −1 ′ b b (t) b b ←−f t f (t)f (t) + λa I t g(t)f (t) ′

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

44/74

Joint Estimation: Variational Bayesian Approximation p(f (t), A|g(t)) −→ q1 (f (t)) q2 (A) e e e q1 (f (t)|g(t), ( A, vǫ , v′0 , V0 ) = N (f (t)|f (t), Σ) e = (A e + λf Ve )−1 e A Σ e (t) = (A e + λf Ve )−1 A e ′A e ′ g(t), λf = vǫ /vf f e e e q2 (A|g(t), f (t), vǫ , A′ 0 , V0 ) = N (A|A, V ) e −1 e + λf Σ) e F  Ve = (F −1  P e ′ (t) P f e ′ (t) + λa Σ e e (t)f e =  A , λa = vǫ /va g(t) f t t

b −→ A(0) −→ A V (0) −→ Vb −→ ⇑

b ←− A b V ←−

 ′ −1 ′ b (t) = A b A b + λf Vb b g(t) f A b = (A′ A + λf Vb )−1 Σ P

P

b′ b = A t g(t)f (t) b −1 Vb = (F ′ F + λf Σ)

b (t) −→ f b −→ Σ

⇓  −1 b (t) b b b′ ←−f t f (t)f (t) + λa Σ b ←− Σ

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

45/74

Other more complexe models Gaussian iid: v0 A0 , V 0 vǫ

✓✏ ) ✲f (t) p(f j (t)|v0 j ) = N (0, h v0 j P i 1 ✒✑ 2 (t)/v (t)|v ) ∝ exp − f p(f 0 0 j j j ✲ A♥ 2 ❄ ✓✏ ✓✏ p(A |A , V ) = N (A , V ) 0 ij 0 ij 0 ij 0 ij ij ❘ ❅ ✲ ǫ(t) ✲ g(t) p(A|A , V ) = N (A , V ) 0 0 0 0 ✒✑✒✑

p(g(t)|A, f (t), vǫ ) = N (Af (t), vǫ I)

Variance modulated✓✏ prior model inducing sparsity: αj0 , βj0

✲ v(t) p(vj (t)|αj0 , βj0 ) = IG(αj0 , βj0 ) ✒✑

(t)) p(f j (t)|vj (t)) = N (0, h vjP i 2 p(f (t)|v(t)) ∝ exp − j f j (t)/vj (t)

❄ ✓✏

f (t)

✒✑ p(A|A0 , V0 ) = N (A0 , V0 ) ✲ A♥ A0 , V 0 ❄ ✓✏ ✓✏ p(g(t)|A, f (t), vǫ ) = N (Af (t), vǫ I) ❘ ❅ ✲ ǫ(t) ✲ g(t) αǫ0 , βǫ0 ✲ v♥ ǫ p(vǫ |αǫ0 , βǫ0 ) = IG(αǫ0 , βǫ0 ) ✒✑✒✑

Q p(f 1..T , A, v 1..T , vǫ |g 1..T ) ∝ Qt p(g(t)|A, f (t), vǫ ) p(f (t)|v(t)) Q t j p(vj (t)|αj0 , βj0 ) p(A|A0 , V0 )

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

46/74

Sparse PCA ◮

In classical PCA, FA and ICA, one looks to obtain principal (uncorrelated or independent) components.



In Sparse PCA or FA, one looks for the loading matrix A with sparsest components.



This can be imposed via the prior p(A). This leads to least variables selections. PCA

SPCA

PCA

SPCA 1

1 P53

0.8

Bax

P53

0.6

Bax

P53

0.8

Bax

0.6

P53

0.6

Bax

0.6 Mdm2

0.4

Mdm2

Mdm2

0.4

Mdm2 0.4

0.4

Bcl2 P21

0.2

Wee1

Bcl2

Bcl2 0.2

P21 Wee1

P21 Wee1

0

0.2

P21 Wee1

0

0 DBP

Bcl2 0.2

DBP

0 DBP

DBP -0.2

UGT

-0.2

Top1

-0.4

CE2

UGT

-0.2

Top1

UGT

UGT -0.4

Top1

CE2

-0.4

CE2

-0.6

-0.2

Top1 CE2

-0.4

-0.6 Bmal1

Bmal1

Bmal1

Bmal1 -0.8

-0.8

Per2 Rev

-1 1

2

-0.6

Per2 Rev

Per2

1

2

-0.6

Per2 -1

Rev 1

2

3

Rev 1

2

3

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

47/74

Discriminant Analysis ◮



When we have data and classes, the question to answer is: What are the most discriminant factors? There are many variants: ◮ ◮ ◮ ◮

Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Exponential Discriminant Analysis (EDA), Regularized LDA (RLDA), ...



One can also ask for Sparsest Linear Discriminant factors (SLDA)



Deterministic point of view (Geometrical distances)



Probabilistic point of view (Mixture densities)



Mixture of Gaussians models: Each classe is modelled by a Gaussian pdf

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

48/74

Discriminant Analysis: Time series, Colon 1 2 3

6 4 2

P53

12

Bax

10

Mdm2 8

0 Bcl2

-2 Ccnb2

6

-6

Ccna2

4

-8

Wee1

2

DBP

1

UGT

-4

2

0

0

Top1

-2

CE2

-4

-1 -2

Bmal1

-3

-6

-4

Per2

-5

-8

Rev

-6 -5

0

5

-6

-4

-2

0

2

1

2

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

49/74

Sparse Discriminant Analysis: Time series, colon What are the sparsest discriminant factors? 12

1 2 3

10

P53

12

Bax

8 Mdm2 10

6 Bcl2

4 Ccnb2 8

50

Ccna2

40

Wee1 6

30

DBP

20

UGT

10

4

Top1

20 CE2

15

2

Bmal1

10

Per2

5

0

Rev

0 4

6

8

10

12 10

20

30

40

50

0

5

10

15

20

1

2

3

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

50/74

LDA and SLDA study on time serie: 1:before, 2:during, 3: SLDA-Time

LDA-Time -2

1 2 3

-4 -6

1 2 3

3 2 1

-8

0

-10

-1

-12

1.5 1

-14

0.5 0

13

-0.5

12

-1

11

2

10 9

1

8 0

7 6

-1

5 -14

-12

-10

-8

-6

-4

-2

6

8

10

12

-1

0

1

2

3

-1

0

1

-1

0

1

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

2

51/74

Dependancy graphs ◮

The main objective here is to show the dependencies between variables



Three different measures can be used: Pearson ρ, Spearman ρs and Kendall τ



In this study we used ρs



A table of 2 by 2 mutual ρs are computed and used in different forms: Hinton, Adjacency table and Graphical network representation Hinton

Adjacency 1

P53

Network 1

P53

Bax

0.8

Mdm2 Bcl2

Rev

0.9

Bax Mdm2

P53 Bax

Bcl2

Ccnb2

0.4

Ccna2

Per2

0.8

0.6

Bmal1

0.7

Ccnb2 Ccna2

0.6

Mdm2

CE2

Bcl2

Top1

0.2 Wee1

Wee1 0.5

DBP

0

UGT

DBP 0.4

UGT -0.2

Top1

Top1

CE2

-0.4

Bmal1

0.3

Ccnb2

CE2

Ccna2

Per2

Rev

-0.8 Rev Per2 Bmal1 CE2 Top1 UGT DBP Wee1 Ccna2Ccnb2 Bcl2 Mdm2 Bax P53

UGT

0.2

Bmal1 -0.6

Per2

0.1

DBP Wee1

Rev Rev Per2 Bmal1 CE2 Top1 UGT DBP Wee1 Ccna2Ccnb2 Bcl2 Mdm2 Bax P53

0

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

52/74

Graph of Dependancies: Colon, Class 1 Time series 1 P53

1 P53

Bax

0.8

Mdm2

P53 Bax

Bcl2

Ccnb2

0.4

Ccna2

Per2

0.8

0.6 Bcl2

Rev

0.9

Bax Mdm2

Bmal1

0.7

Ccnb2 Ccna2

0.6

Mdm2

CE2

Bcl2

Top1

0.2 Wee1

Wee1 0.5

DBP

0

UGT

DBP 0.4

UGT -0.2

Top1

Top1

CE2

-0.4

Bmal1

0.3

Ccnb2

CE2

Ccna2

Per2

Rev

-0.8

UGT

0.2

Bmal1 -0.6

Per2

0.1

DBP Wee1

Rev

Rev Per2 Bmal1 CE2 Top1 UGT DBP Wee1 Ccna2Ccnb2 Bcl2 Mdm2 Bax P53

Rev Per2 Bmal1 CE2 Top1 UGT DBP Wee1 Ccna2Ccnb2 Bcl2 Mdm2 Bax P53

0

FT amplitudes 1 P53

P53

Bax Mdm2 Bcl2

Rev

0.9

Bax 0.8

Mdm2

P53

0.6

Ccnb2 Ccna2

Bax

0.4

Bmal1

0.7

Ccnb2 Ccna2

Wee1

Per2

0.8

Bcl2

0.6

Mdm2

CE2

Wee1 0.5

DBP

DBP

UGT

0.2

0.4

UGT

Top1

Top1

CE2

CE2

Ccna2

Per2

0.1

-0.2 Rev

UGT

0.2

Bmal1

Per2

Top1

Ccnb2

0 Bmal1

Bcl2

0.3

DBP Wee1

Rev Rev Per2 Bmal1 CE2 Top1 UGT DBP Wee1 Ccna2Ccnb2 Bcl2 Mdm2 Bax P53

Rev Per2 Bmal1 CE2 Top1 UGT DBP Wee1 Ccna2Ccnb2 Bcl2 Mdm2 Bax P53

0

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

53/74

Classification tools



Supervised classification ◮ ◮ ◮



K nearest neighbors methods Needs Training sets data Must be careful to measure the performances of the classification on a different set of data (Test set)

Unsupervised classification ◮ ◮ ◮ ◮

Mixture models Expectation-Maximization methods Bayesian versions of EM Bayesian Variational Approximation (VBA)

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

54/74

Classification tools 4

4

x 10

3.5

3

2.5

2

1.5

1

0.5

0

0

10

20

30

40

50 60 Time in hours

70

80

4

4500 13 14 19 20 21 22 35 36 37

4000

3500

3000

2

90

100

110

4

x 10

4 1 2 3 4 5 6 7 8 9 10 11 12 18 33 34

1.8

1.6

1.4

1.2

2500 1

2000 0.8

x 10

15 16 17 23 24 25 26 27 28 29 30 31 32

3.5

3

2.5

2

1.5

1500 0.6 1

1000 0.4

500

0

0

10

20

30

40

50 60 Time in hours

70

80

90

100

110

4

2

0.5

0.2

0

0

10

20

30

40

50 60 Time in hours

70

80

90

100

110

0

0

10

20

30

40

50 60 Time in hours

70

80

90

100

110

4500

x 10

1 2 3 4 10 11 12 18 33 34

1.8

1.6

1.4

13 14 19 20 21 22 35 36 37

4000

3500

3000

1.2

2500 1

2000 0.8

1500 0.6

1000 0.4

500

0.2

0

0

10

20

30

40

50 60 Time in hours

70

80

90

100

110

0

0

10

20

30

40

50 60 Time in hours

70

80

90

100

110

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

55/74

Contents

1. Mixture models 2. Different problems related to classification and clustering ◮ ◮ ◮ ◮

Training Supervised classification Semi-supervised classification Clustering or unsupervised classification

3. Mixture of Student-t 4. Variational Bayesian Approximation 5. VBA for Mixture of Student-t 6. Conclusion

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

56/74

Mixture models ◮

General mixture model p(x|a, Θ, K ) =

K X

ak pk (x k |θ k ),

0 < ak < 1

k =1 ◮

Same family pk (x k |θ k ) = p(x k |θ k ), ∀k



Gaussian p(x k |θ k ) = N (x k |µk , Σk ) with θ k = (µk , Σk )



Data X = {x n , n = 1, · · · , N} where each element x n can be in one of these classes cn .



ak = p(cn = k), a = {ak , k = 1, · · · , K }, Θ = {θ k , k = 1, · · · , K } p(Xn , cn = k|a, θ) =

N Y

p(x n , cn = k|a, θ).

n=1

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

57/74

Different problems ◮



Training: Given a set of (training) data X and classes c, estimate the parameters a and Θ. Supervised classification: Given a sample x m and the parameters K , a and Θ determine its class k ∗ = arg max {p(cm = k|x m , a, Θ, K )} . k



Semi-supervised classification (Proportions are not known): Given sample x m and the parameters K and Θ, determine its class k ∗ = arg max {p(cm = k|x m , Θ, K )} . k



Clustering or unsupervised classification (Number of classes K is not known): Given a set of data X, determine K and c.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

58/74

Training ◮





Given a set of (training) data X and classes c, estimate the parameters a and Θ. Maximum Likelihood (ML): b = arg max {p(X, c|a, Θ, K )} . (b a, Θ) (a,Θ) Q Bayesian: Assign priors p(a|K ) and p(Θ|K ) = Kk=1 p(θ k ) and write the expression of the joint posterior laws: p(a, Θ|X, c, K ) = where p(X, c|K ) =



ZZ

p(X, c|a, Θ, K ) p(a|K ) p(Θ|K ) p(X, c|K )

p(X, c|a, Θ|K )p(a|K ) p(Θ|K ) da dΘ

Infer on a and Θ either as the Maximum A Posteriori (MAP) or Posterior Mean (PM).

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

59/74

Supervised classification ◮

Given a sample x m and the parameters K , a and Θ determine p(cm = k|x m , a, Θ, K ) =

p(x m , cm = k|a, Θ, K ) p(x m |a, Θ, K )

where p(x m , cm = k|a, Θ, K ) = ak p(x m |θ k ) and p(x m |a, Θ, K ) =

K X

ak p(x m |θ k )

k =1 ◮

Best class k ∗ : k ∗ = arg max {p(cm = k|x m , a, Θ, K )} k

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

60/74

Semi-supervised classification ◮

Given sample x m and the parameters K and Θ (not the proportions a), determine the probabilities p(cm = k|x m , Θ, K ) =

p(x m , cm = k|Θ, K ) p(x m |Θ, K )

where p(x m , cm = k|Θ, K ) =

Z

and p(x m |Θ, K ) =

p(x m , cm = k|a, Θ, K )p(a|K ) da

K X

p(x m , cm = k|Θ, K )

k =1 ◮

Best class k ∗ , for example the MAP solution: k ∗ = arg max {p(cm = k|x m , Θ, K )} . k

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

61/74

Clustering or non-supervised classification ◮

Given a set of data X, determine K and c.



Determination of the number of classes: p(K = L|X) =

p(X|K = L) p(K = L) p(X, K = L) = p(X) p(X)

and p(X) =

L0 X

p(K = L) p(X|K = L),

L=1

where L0 is the a priori maximum number of classes and p(X|K = L) =

Z Z YY L

ak p(x n , cn = k|θ k )p(a|K ) p(Θ|K ) da d

n k =1



When K and c are determined, we can also determine the characteristics of those classes a and Θ.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

62/74

Mixture of Student-t model ◮

Student-t and its Infinite Gaussian Scaled Model (IGSM): Z ∞ ν ν N (x|µ, z −1 Σ) G(z| , ) dz T (x |ν, µ, Σ) = 2 2 0 where i h 1 N (x|µ, Σ)= |2πΣ|− 2 exp − 21 (x − µ)′ Σ−1 (x − µ) h oi n 1 = |2πΣ|− 2 exp − 21 Tr (x − µ)Σ−1 (x − µ)′

and

G(z|α, β) = ◮

β α α−1 z exp [−βz] . Γ(α)

Mixture of Student-t: p(x|{νk , ak , µk , Σk , k = 1, · · · , K }, K ) =

K X

ak T (x n |νk , µk , Σk ).

k =1 A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

63/74

Mixture of Student-t model ◮

Introducing znk , zk = {znk , n = 1, · · · , N}, Z = {znk }, c = {cn , n = 1, · · · , N}, θ k = {νk , ak , µk , Σk }, Θ = {θ k , k = 1, · · · , K }



Assigning Q the priors p(Θ) = k p(θk ), we can write: p(X, c, Z, Θ|K ) =



Joint posterior law:

Q Q n

k

−1 ak N (x n |µk , zn,k Σk ) G(znk | ν2k , ν2k ) p(θ k )

p(c, Z, Θ|X, K ) = ◮

p(X, c, Z, Θ|K ) . p(X|K )

The main task now is to propose some approximations to it in such a way that we can use it easily in all the above mentioned tasks of classification or clustering.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

64/74

Variational Bayesian Approximation (VBA) ◮

◮ ◮

Main idea: to propose easy computational approximation q(c, Z, Θ) for p(c, Z, Θ|X, K ). Criterion: KL(q : p) Interestingly, by noting that p(c, Z, Θ|X, K ) = p(X, c, Z, Θ|K )/p(X|K ) we have: KL(q : p) = −F(q) + ln p(X|K ) where F(q) = h− ln p(X, c, Z, Θ|K )iq is called free energy of q and we have the following properties: – Maximizing F(q) or minimizing KL(q : p) are equivalent and both give un upper bound to the evidence of the model ln p(X|K ). – When the optimum q ∗ is obtained, F(q ∗ ) can be used as a criterion for model selection.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

65/74

VBA: choosing the good families



Using KL(q : p) has the very interesting property that using q to compute the means we obtain the same values if we have used p (Conservation of the means).



Unfortunately, this is not the case for variances or other moments.



If p is in the exponential family, then choosing appropriate conjugate priors, the structure of q will be the same and we can obtain appropriate fast optimization algorithms.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

66/74

Hierarchical graphical model

ξ0

γ0 , Σ0

µ0 , η0

k0

❅ ✠ ❘ ❅ ✓✏ ❄ ❄ ❄ ✓✏ ✓✏ ✓✏ ✓✏ αk a βk Σk ✲ µk ✒✑✒✑ ✒✑✒✑✟ ✒✑ ❅ ❅ ✟✟ ❘✠ ❅ ❘✠✟✟ ❅ ✓✏ ✓✏ ✙ ✲ xn ✟ znk ✒✑ ✒✑

Graphical representation of the model.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

67/74

VBA for mixture of Student-t ◮

In our case, noting that YY p(X, c, Z, Θ|K ) = p(x n , cn , znk |ak , µk , Σk , νk ) n k Y [p(αk ) p(βk ) p(µk |Σk ) p(Σk )] k

with −1 p(x n , cn , znk |ak , µk , Σk , νk ) = N (x n |µk , zn,k Σk ) G(znk |αk , βk )

is separable, in one side for [c, Z] and in other size in components of Θ, we propose to use q(c, Z, Θ) = q(c, Z) q(Θ).

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

68/74

VBA for mixture of Student-t



With this decomposition, the expression of the Kullback-Leibler divergence becomes: KL(q1 (c, Z)q2 (Θ) : p(c, Z, Θ|X, K ) = XZ Z q1 (c, Z)q2 (Θ) dΘ dZ q1 (c, Z)q2 (Θ) ln p(c, Z, Θ|X, K ) c

The expression of the Free energy becomes: XZ Z p(X, c, Z|Θ, K )p(Θ|K ) dΘ F(q1 (c, Z)q2 (Θ)) = q1 (c, Z)q2 (Θ) ln q1 (c, Z)q2 (Θ) c ◮

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

69/74

Proposed VBA for Mixture of Student-t priors model ◮

Using a generalized Student-t obtained by replacing G(zn,k | ν2k , ν2k ) by G(zn,k |αk , βk ) it will be easier to propose conjugate priors for αk , βk than for νk .

−1 p(x n , cn = k, znk |ak , µk , Σk , αk , βk , K ) = ak N (x n |µk , zn,k Σk ) G(zn,k |αk , βk ◮

In the following, noting by Θ = {(ak , µk , Σk , αk , βk ), k = 1, · · · , K }, we propose to use the factorized prior laws: X p(Θ) = p(a) [p(αk ) p(βk ) p(µk |Σk ) p(Σk )] k

with the following components:  p(a) = D(a|k0 ), k0 = [k0 , · · · , k0 ] = k0 1     p(α  k ) = E(αk |ζ0 ) = G(αk |1, ζ0 ) p(βk ) = E(βk |ζ0 ) = G(αk |1, ζ0 )    p(µk |Σk ) = N (µk |µ0 1, η0−1 Σk )   p(Σk ) = IW(Σk |γ0 , γ0 Σ0 )

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

70/74

Proposed VBA for Mixture of Student-t priors model where

P Γ( l kk ) Y kl −1 al D(a|k) = Q l Γ(kl ) l

is the Dirichlet pdf,

E(t|ζ0 ) = ζ0 exp [−ζ0 t] is the Exponential pdf, G(t|a, b) = is the Gamma pdf and IW(Σ|γ, γ∆) =

b a a−1 t exp [−bt] Γ(a)

oi h n | 12 ∆|γ/2 exp − 21 Tr ∆Σ−1 ΓD (γ/2)|Σ|

γ+D+1 2

.

is the inverse Wishart pdf. With these prior laws and the likelihood: joint posterior law: p(X, c, Z, Θ) . p(X) A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France, pk (c, Z, Θ|X) =

71/74

Expressions of q Q Q q(c, Z, Θ) = q(c, Z) q(Θ) = n k [q(cn = k|znk ) q(znk )] Q k [q(αk ) q(βk ) q(µk |Σk ) q(Σk )] q(a).

with:

 ˜ ˜ = [k˜1 , · · · , k˜K ] q(a) = D(a|k), k       q(αk ) = G(αk |ζ˜k , η˜k )    q(βk ) = G(βk |ζ˜k , η˜k )      q(µk |Σk ) = N (µk |e µ, η˜−1 Σk )     ˜ q(Σk ) = IW(Σk |˜ γ , γ˜Σ)

With these choices, we have

F(q(c, Z, Θ)) = hln p(X, c, Z, Θ|K )iq(c,Z,Θ) =

YY k

n

F1kn

= hln p(x n , cn , znk , θ k )iq(cn =k |znk )q(znk )

F

= hln p(x , c , z , θ )i

F1kn +

Y

F2k

k

n n nk 2k k q(θ k ) A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

72/74

VBA Algorithm step Expressions of the updating expressions of the tilded parameters are obtained by following three steps: ◮ E step: Optimizing F with respect to q(c, Z) when keeping q(Θ) fixed, we obtain the expression of ˜k , q(znk ) = G(znk |e q(cn = k|znk ) = a αk , βek ). ◮ M step: Optimizing F with respect to q(Θ) when keeping q(c, Z) fixed, we obtain the expression of ˜ ˜ = [k˜1 , · · · , k˜K ], q(αk ) = G(αk |ζ˜k , η˜k ), q(a) = D(a|k), k q(βk ) = G(βk |ζ˜k , η˜k ), q(µk |Σk ) = N (µk |e µ, η˜−1 Σk ), and ˜ which gives the updating q(Σk ) = IW(Σk |˜ γ , γ˜ Σ), algorithm for the corresponding tilded parameters. ◮ F evaluation: After each E step and M step, we can also evaluate the expression of F(q) which can be used for stopping rule of the iterative algorithm. ◮ Final value of F(q) for each value of K , noted Fk , can be used as a criterion for model selection, i.e.; the determination of the number of clusters. A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

73/74

Conclusions ◮

Clustering and classification of a set of data are between the most important tasks in statistical researches for many applications such as data mining in biology.



Mixture models and in particular Mixture of Gaussians are classical models for these tasks.



We proposed to use a mixture of generalised Student-t distribution model for the data via a hierarchical graphical model.



To obtain fast algorithms and be able to handle large data sets, we used conjugate priors everywhere it was possible.



The proposed algorithm has been used for clustering, classification and discriminant analysis of some biological data (Cancer research related), but in this paper, we only presented the main algorithm.

A. Mohammad-Djafari, Data, signals, images in Biological and medicalapplication. Master ATSI, UPSa, 2015-2016, Gif, France,

74/74