Dimitri Nion & Lieven De Lathauwer - Dr. Dimitri Nion, Signal

Tensor Decompositions = Powerful multi-linear algebra tools that generalize matrix ... H is the representation of Y in the reduced spaces. ▫ We may have.
696KB taille 0 téléchargements 301 vues
The decomposition of a third-order tensor in R block-terms of rank-(L,L,1) Model, Algorithms, Uniqueness, Estimation of R and L

Dimitri Nion & Lieven De Lathauwer K.U. Leuven, Kortrijk campus, Belgium E-mails: [email protected] [email protected]

TRICAP 2009, Nurià, Spain, June 14th-19th, 2009

Introduction Tensor Decompositions = Powerful multi-linear algebra tools that generalize matrix decompositions. Motivation: increasing number of applications involving manipulation of multi-way data, rather than 2-way data. Key research axes:  Development of new models/decompositions  Development of algorithms to compute decompositions  Uniqueness of tensor decompositions  Use these tools in new applications, or existing applications where the multi-way nature of data was ignored until now  Tensor decompositions under constraints (e.g. imposing non-negativity or specific algebraic structures) 2

From matrix SVD to tensor HOSVD J

Y

I

Matrix SVD v1H d11 R = u1

R

VH

= U

D

vRH

dRR +…+ uR

Tensor HOSVD (third-order case) L

K I

Y

=

N

U

L

H

M

N

yijk = ∑∑∑ uil v jm wkn hlmn

W

VT

M

l =1 m =1 n =1

Y = H ×1 U ×2 V ×3 W

J

 One unitary matrix (U U, V, W) per mode  H is the representation of Y in the reduced spaces.  We may have

L≠M ≠ N

 H is not diagonal (difference with matrix SVD).

3

From matrix SVD to PARAFAC J Y

I

R V

= U

D

Y

R

=

A

J

b1

+ … +

bR

Sum of R rank-1 tensors: Y1+…+ YR

aR K

A

( if i=j=k, hijk=1, else, hijk=0 ) cR

a1 =

uR

H

R

c1 =

+…+

H is diagonal

BT

R

dRR

vRH

PARAFAC decomposition

C

K I

H

Matrix SVD H v 1 d11 R = u1

C

BT

Y = set of K matrices of the form: Y(:,:,k)=A A diag(C C(k,:)) BT

From PARAFAC/HOSVD to Block Components Decompositions (BCD) [De Lathauwer and Nion] BCD in rank (Lr,Lr,1) terms c1

K I

=

Y

cR

B1T

L1

L1

B RT

LR LR

+…+

A1

AR

J

BCD in rank (Lr, Mr, . ) terms K I

K

K

=

Y

A1

L1

B1T

H1

+…+

M1

AR

LR

H1

B RT

MR

J

BCD in rank (Lr, Mr, Nr) terms C1 K I

N1

Y J

=

A1

L1

H1 M1

CR T 1

B

NR

+…+

AR

HR

LR

MR

B RT 5

Content of this talk BCD - (Lr,Lr,1) c1

K I

Y

=

L1

A1

cR L1

B1T

+…+

LR

LR

B RT

AR

J

 Model ambiguities  Algorithms  Uniqueness  Estimation of the parameters Lr (r = 1,…,R) and R  An application in telecommunications 6

BCD - (Lr ,Lr ,1) : Model ambiguities cR

c1 K I

Y J

=

L1

F1

F1−1

B1T

LR

+…+

B RT

AR

A1

 Unknown matrices:

FR

FR −1

L1

LR

A = A1 ...

AR

I

L1

LR

B = B1 ...

BR

J

C=

... c1

K

cR

 BCD-(Lr,Lr,1) is said essentially unique if the only ambiguities are: Arbitrary permutation of the R blocks in A and B and of the R columns of C + Each block of A and B post--multiplied by arbitrary non-singular matrix, each column of C arbitrarily scaled. = A and B estimated up to multiplication by a blockblock-wise permuted blockdiagonal matrix and C by a permuted diagonal matrix.

BCD - (Lr ,Lr ,1) : Algorithms  Usual approach: estimate A, B and C by minimization of

Φ = Y −

R



r =1

2

o = outer product

T

(A rB r ) o c r F

The model is fitted for a given choice of the parameters {Lr , R} Exploit algebraic structure of matrix unfoldings J J I Y J K

Yk

I

Y1

...

K

Yi

J

Yj

Y1 Y1

= YI×KJ

K

...

I K

YK YI

= YJ×IK

I

...

YJ

= YK ×JI 8

BCD - (Lr ,Lr ,1) : ALS Algorithm YK ×JI = C ⋅ Z1(B, A )

Φ = YK ×JI − C ⋅ Z1(B, A ) F

YJ×IK = B ⋅ Z 2 ( A, C)

Φ = YJ×IK − B ⋅ Z 2 ( A, C) F

2

2

YI×KJ = A ⋅ Z3 (C, B)

Φ = YI×KJ − A ⋅ Z 3 (C, B) F 2

Z1, Z2 and Z3 are built from 2 matrices only and have a block-wise KhatriRao product structure.

ˆ ( 0 ) , Bˆ ( 0 ) , k = 1 Initialisation : A while Φ ( k −1) − Φ ( k ) > ε (e.g. ε = 10-6 )

[ [ [

ˆ ( k ) = Y ⋅ Z (Bˆ ( k −1) , A ˆ ( k −1) ) C K ×JI 1 ˆ ( k −1) , C ˆ (k ) ) Bˆ ( k ) = YJ×IK ⋅ Z 2 ( A ˆ ( k ) = Y ⋅ Z (C ˆ ( k ) , Bˆ ( k ) ) A I×KJ

k ← k +1

3

]

]

]

(1) ( 2) (3) 9

ALS algorithm: problem of swamps Observation:

Long swamp

ALS is fast in many problems, but sometimes, a long swamp is encountered before convergence.

27000 iterations ! Long Swamps typically occur when:  The loading matrices of the decomposition (i.e. the objective matrices) are ill-conditioned  The updated matrices become ill-conditionned (impact of initialization)  One of the R tensor-components in Y = Y1 + … + YR has a much higher norm than the R-1 others (e.g. « near-far » effect in telecommunications)

10

Improvement 1 of ALS: Line Search Purpose: reduce the length of swamps Principle: for each iteration, interpolate A, B and C from their estimates of 2 previous iterations and use the interpolated matrices in input of ALS 1.Line Search:

Search directions

B( new ) = B( k −2 ) + ρ ( B( k −1) − B( k −2 ) )

Choice of

C( new ) = C( k −2 ) + ρ (C( k −1) − C( k −2 ) ) A ( new ) = A ( k −2 ) + ρ ( A ( k −1) − A ( k −2 ) ) 2.Then ALS update

[ [ [

ˆ ( k ) = Y ⋅ Z ( Bˆ ( new ) , A ˆ ( new ) ) C K ×JI 1 ˆ ( new ) , C ˆ (k ) ) Bˆ ( k ) = YJ×IK ⋅ Z 2 ( A ˆ ( k ) = Y ⋅ Z (C ˆ ( k ) , Bˆ ( k ) ) A I×KJ

k ← k +1

3

]

]

]

ρ crucial

ρ =1 annihilates LS step

(i.e. we get standard ALS)

(1) ( 2) (3) 11

[Harshman, 1970]

Improvement 1 of ALS: Line Search « LSH » Choose ρ = 1.25

[Bro, 1997] « LSB »

Choose ρ = k 1/ 3 and validate LS step if decrease in Fit

[Rajih, Comon, 2005] « Enhanced Line Search (ELS) »

For REAL tensors

Φ ( A ( new ) , S ( new ) , H ( new ) ) = Φ ( ρ ) = 6 th order polynomial .

Optimal ρ is the root that minimizes Φ ( A ( new ) , S ( new ) , H ( new ) ) [Nion, De Lathauwer, 2006] «Enhanced Line Search with Complex Step (ELSCS) »

For complex tensors, look for optimal ρ = m.e iθ We have Φ ( A ( new ) , S ( new ) , H ( new ) ) = Φ ( m , θ ) Alternate update of m and θ : ∂Φ ( m , θ ) Update m : for θ fixed, = 5 th order polynomial in m ∂m ∂Φ ( m , θ ) θ Update θ : for m fixed, = 6 th order polynomial in t = tan( ) ∂θ 2 12

Improvement 1 of ALS: Line Search «easy» problem

«difficult» problem

2000 iterations

27000 iterations

 ELS  Large reduction of the number of iterations at a very low additional complexity w.r.t. standard ALS 13

Improvement 2 of ALS: Dimensionality reduction C

C K I

N

Y

=

A

L

H

T

B

M

=

+…+

BT

A

J STEP 1:

STEP 2:

HOSVD of Y

BCD of the small core tensor H (compressed space)

STEP 3: Come back to original space + a few refinement iterations in original space

 Compression  Large reduction of the cost per iteration since the model is 14 fitted in compressed space.

Improvement 3 of ALS: Good initialization

Comparison ALS and ALS+ELS, with three random initializations Instead of using random initializations, could we use the observed tensor itself ? YES For the BCD-(L,L,1), if A and B are full column rank (so I and J have to be long enough), there is an easy way to find a good intialization, in same spirit as Direct Trilinear Decomposition (DTLD) used to initialize PARAFAC (not detailed in 15 this talk).

Other algorithms Existing algorithms for PARAFAC can be adapted to Block-ComponentDecompositions. Examples:  Levenberg-Marquardt algorithm (Gauss-Newton type method),  Simultaneous Diagonalization (SD) algorithms  let’s say a few words on this technique. SD for PARAFAC (De Lathauwer, 2006)  Initial condition to reformulate PARAFAC in terms of SD:

min( IJ , K ) ≥ R

 PARAFAC decomposition can be computed by solving a SD problem:

M n = WDn WT , n=1,...,R, Dn is R × R diagonal  Advantage: Low complexity (only R matrices of size RxR to diagonalize + direct use of existing fast algorithms designed for SD)  SD reformulation yields a uniqueness bound generically more relaxed than Kruskal bound I(I − 1) J(J − 1) R(R − 1) K ≥ R et ≥ 16 2 2 2

BCD - (L ,L ,1) : computation via Simultaneous Diag. (Nion & De Lathauwer, 2007)

 Results established for BCD-(L,L,1), i.e., same L for the R terms  Initial condition to reformulate BCD-(L,L,1) in terms of SD:

min( IJ , K ) ≥ R

 Then the decomposition can be computed by solving a SD problem:

M n = WDn WT , n=1,...,R, Dn is R × R diagonal  Advantage: Low complexity (only R matrices of size RxR to diagonalize + direct use of existing fast algorithms designed for SD)  SD reformulation yields a new, more relaxed uniqueness bound (next slide)

17

BCD - (L ,L ,1) : Uniqueness (Nion & De Lathauwer, 2007) Sufficient bound 1

[De Lathauwer 2006] Sufficient bound 2

[Nion & De Lathauwer, 2007] :

LR ≤ IJ and min(

 I ,R)+min(  J ,R)+min(K,R)≥ 2(R+1 )  L   L 

R ≤ min( IJ , K ) and

C IL + 1 . C LJ + 1 ≥ C LR ++1L − R

Cnk =

(1)

(2)

n! k! ( n − k )!

New Bound much more relaxed

18

Concluding remarks on algorithms  Standard ALS sometimes slow (swamps)  ALS+ELS (drastically) reduces swamp length at low additional complexity  Levenberg-Marquardt  convergence very fast, less sensitive to ill-conditioned data, but higher complexity and memory (dimensions of Jacobian matrix=IJK)  Simultaneous diagonalization: a very attractive algorithm (low complexity and good accuracy).  Important practical considerations: - Dimensionality reduction pre-processing step (e.g. via Tucker/HOSVD) - Find a good initialization if possible.  Algorithms have to be adapted to include constraints specific to applications: - preservation of specific matrix-structures (Toeplitz, Van der Monde, etc) - Constant Modulus, Finite Alphabet, … (e.g. in Telecoms Applications) - non-negativity constraints (e.g. Chemometrics applications) 19

BCD - (Lr ,Lr ,1) : estimation of R and Lr Problem: Given a tensor Y, how to estimate the number of terms R and the rank Lr of the matrices Ar and Br that yield a reasonable (Lr, Lr, 1) model? c1

K I

Y

=

L1

cR L1

B1T

+…+

A1

LR

LR

B RT

AR

J

 Criterion 1: Simple approach: examinate singular values of matrix unfoldings.  Y (JIxK) generically rank

R

 Y (IKxJ) generically rank N  Y (KJxI) generically rank

N

=

R



r =1

Lr

if min(JI, K) ≥ R if min(IK, J ) ≥ N if min(KJ, I ) ≥ N

 If noise level not too high and if conditions on dimensions satisfied, the number of significant singular values yields an estimate 20 for R and/or N.

CORCONDIA (Core Consistency Diagnostic) Core idea: PARAFAC can be seen as a particular case of Tucker model, where the core tensor is diagonal.

C

K I

Y

R

=

A

J

H is diagonal

BT

R R

( if i=j=k, hijk=1, else, hijk=0 )

H

Method [Bro et al.]  Choose a set of plausible values for R.  For a given test (i.e., for a given R), fit a PARAFAC model and compute the Least Squares estimate of the core tensor H, 2  and measure the diagonality of the core tensor:

C = 100 (1 −

H − Hˆ R

 Examinate the core consistency measurements to select R 21

F

)

Block-(Lr ,Lr ,1) CORCONDIA Core idea: BCD-(Lr ,Lr , 1) can be seen as a particular case of Tucker model, where the core tensor is « block-diagonal ». c1

K I

Y

=

cR L1

L1

B1T

LR LR

+…+

A1

B RT

AR

J

K R

C

R L1

=

I

LR

A1 ...

AR

N N

N =

L1

LR

B1 ...

BR

J

H

R



r =1

Lr

22

Block-(Lr ,Lr ,1) CORCONDIA Criterion 2: So we can proceed in a way similar to CORCONDIA for PARAFAC  Choose a set of plausible values for R and Lr , r=1,…,R.  For a given test (i.e., for given R and Lr ‘s), fit a BCD-(Lr ,Lr ,1) model and compute the Least Squares estimate of the core tensor H,  and measure the block - diagonality of the core tensor:

C COR = 100 (1 −

H − Hˆ

2 F

RL

)

 Examinate the multiple core consistency measurements to select the most plausible parameters Criterion 3: Similarly to PARAFAC, better to couple Block-CORCONDIA to other criteria, e.g., examination of the relative Fit to the (Lr , Lr, 1) model:

C Fit = 100 (1 −

Y − Yˆ Y

2 F

2 F

) 23

Block-(Lr ,Lr ,1) CORCONDIA  Example 1: I=12, J=12, K=50, L=2, R=3 (L=L1=L2=L3 ) Complex data (random), and SNR=10 dB Test: Rtry = {1,2,3,4,5,6} and Ltry={1,2,3,4} Note: For each (R,L) pair, the decomposition is computed via ALS+ELS algorithm and 5 different starting points. Ltry Ltry

CFit=

22.3 38.6 56.3 71.4 84.1 91.5

36.6 66.6 91.2 91.5 91.7 92.0

38.1 67.8 91.3 91.7 91.9 92.3

39.3 69.1 91.4 91.8 92.1 92.4

Rtry

CCOR=

100 100 100 100 99 99.8 < 0 < 0 98.9 99.4 < 0 < 0 84.8 30.9 < 0 < 0