The decomposition of a third-order tensor in R block-terms of rank-(L,L,1) Model, Algorithms, Uniqueness, Estimation of R and L
Dimitri Nion & Lieven De Lathauwer K.U. Leuven, Kortrijk campus, Belgium E-mails:
[email protected] [email protected]
TRICAP 2009, Nurià, Spain, June 14th-19th, 2009
Introduction Tensor Decompositions = Powerful multi-linear algebra tools that generalize matrix decompositions. Motivation: increasing number of applications involving manipulation of multi-way data, rather than 2-way data. Key research axes: Development of new models/decompositions Development of algorithms to compute decompositions Uniqueness of tensor decompositions Use these tools in new applications, or existing applications where the multi-way nature of data was ignored until now Tensor decompositions under constraints (e.g. imposing non-negativity or specific algebraic structures) 2
From matrix SVD to tensor HOSVD J
Y
I
Matrix SVD v1H d11 R = u1
R
VH
= U
D
vRH
dRR +…+ uR
Tensor HOSVD (third-order case) L
K I
Y
=
N
U
L
H
M
N
yijk = ∑∑∑ uil v jm wkn hlmn
W
VT
M
l =1 m =1 n =1
Y = H ×1 U ×2 V ×3 W
J
One unitary matrix (U U, V, W) per mode H is the representation of Y in the reduced spaces. We may have
L≠M ≠ N
H is not diagonal (difference with matrix SVD).
3
From matrix SVD to PARAFAC J Y
I
R V
= U
D
Y
R
=
A
J
b1
+ … +
bR
Sum of R rank-1 tensors: Y1+…+ YR
aR K
A
( if i=j=k, hijk=1, else, hijk=0 ) cR
a1 =
uR
H
R
c1 =
+…+
H is diagonal
BT
R
dRR
vRH
PARAFAC decomposition
C
K I
H
Matrix SVD H v 1 d11 R = u1
C
BT
Y = set of K matrices of the form: Y(:,:,k)=A A diag(C C(k,:)) BT
From PARAFAC/HOSVD to Block Components Decompositions (BCD) [De Lathauwer and Nion] BCD in rank (Lr,Lr,1) terms c1
K I
=
Y
cR
B1T
L1
L1
B RT
LR LR
+…+
A1
AR
J
BCD in rank (Lr, Mr, . ) terms K I
K
K
=
Y
A1
L1
B1T
H1
+…+
M1
AR
LR
H1
B RT
MR
J
BCD in rank (Lr, Mr, Nr) terms C1 K I
N1
Y J
=
A1
L1
H1 M1
CR T 1
B
NR
+…+
AR
HR
LR
MR
B RT 5
Content of this talk BCD - (Lr,Lr,1) c1
K I
Y
=
L1
A1
cR L1
B1T
+…+
LR
LR
B RT
AR
J
Model ambiguities Algorithms Uniqueness Estimation of the parameters Lr (r = 1,…,R) and R An application in telecommunications 6
BCD - (Lr ,Lr ,1) : Model ambiguities cR
c1 K I
Y J
=
L1
F1
F1−1
B1T
LR
+…+
B RT
AR
A1
Unknown matrices:
FR
FR −1
L1
LR
A = A1 ...
AR
I
L1
LR
B = B1 ...
BR
J
C=
... c1
K
cR
BCD-(Lr,Lr,1) is said essentially unique if the only ambiguities are: Arbitrary permutation of the R blocks in A and B and of the R columns of C + Each block of A and B post--multiplied by arbitrary non-singular matrix, each column of C arbitrarily scaled. = A and B estimated up to multiplication by a blockblock-wise permuted blockdiagonal matrix and C by a permuted diagonal matrix.
BCD - (Lr ,Lr ,1) : Algorithms Usual approach: estimate A, B and C by minimization of
Φ = Y −
R
∑
r =1
2
o = outer product
T
(A rB r ) o c r F
The model is fitted for a given choice of the parameters {Lr , R} Exploit algebraic structure of matrix unfoldings J J I Y J K
Yk
I
Y1
...
K
Yi
J
Yj
Y1 Y1
= YI×KJ
K
...
I K
YK YI
= YJ×IK
I
...
YJ
= YK ×JI 8
BCD - (Lr ,Lr ,1) : ALS Algorithm YK ×JI = C ⋅ Z1(B, A )
Φ = YK ×JI − C ⋅ Z1(B, A ) F
YJ×IK = B ⋅ Z 2 ( A, C)
Φ = YJ×IK − B ⋅ Z 2 ( A, C) F
2
2
YI×KJ = A ⋅ Z3 (C, B)
Φ = YI×KJ − A ⋅ Z 3 (C, B) F 2
Z1, Z2 and Z3 are built from 2 matrices only and have a block-wise KhatriRao product structure.
ˆ ( 0 ) , Bˆ ( 0 ) , k = 1 Initialisation : A while Φ ( k −1) − Φ ( k ) > ε (e.g. ε = 10-6 )
[ [ [
ˆ ( k ) = Y ⋅ Z (Bˆ ( k −1) , A ˆ ( k −1) ) C K ×JI 1 ˆ ( k −1) , C ˆ (k ) ) Bˆ ( k ) = YJ×IK ⋅ Z 2 ( A ˆ ( k ) = Y ⋅ Z (C ˆ ( k ) , Bˆ ( k ) ) A I×KJ
k ← k +1
3
]
]
]
(1) ( 2) (3) 9
ALS algorithm: problem of swamps Observation:
Long swamp
ALS is fast in many problems, but sometimes, a long swamp is encountered before convergence.
27000 iterations ! Long Swamps typically occur when: The loading matrices of the decomposition (i.e. the objective matrices) are ill-conditioned The updated matrices become ill-conditionned (impact of initialization) One of the R tensor-components in Y = Y1 + … + YR has a much higher norm than the R-1 others (e.g. « near-far » effect in telecommunications)
10
Improvement 1 of ALS: Line Search Purpose: reduce the length of swamps Principle: for each iteration, interpolate A, B and C from their estimates of 2 previous iterations and use the interpolated matrices in input of ALS 1.Line Search:
Search directions
B( new ) = B( k −2 ) + ρ ( B( k −1) − B( k −2 ) )
Choice of
C( new ) = C( k −2 ) + ρ (C( k −1) − C( k −2 ) ) A ( new ) = A ( k −2 ) + ρ ( A ( k −1) − A ( k −2 ) ) 2.Then ALS update
[ [ [
ˆ ( k ) = Y ⋅ Z ( Bˆ ( new ) , A ˆ ( new ) ) C K ×JI 1 ˆ ( new ) , C ˆ (k ) ) Bˆ ( k ) = YJ×IK ⋅ Z 2 ( A ˆ ( k ) = Y ⋅ Z (C ˆ ( k ) , Bˆ ( k ) ) A I×KJ
k ← k +1
3
]
]
]
ρ crucial
ρ =1 annihilates LS step
(i.e. we get standard ALS)
(1) ( 2) (3) 11
[Harshman, 1970]
Improvement 1 of ALS: Line Search « LSH » Choose ρ = 1.25
[Bro, 1997] « LSB »
Choose ρ = k 1/ 3 and validate LS step if decrease in Fit
[Rajih, Comon, 2005] « Enhanced Line Search (ELS) »
For REAL tensors
Φ ( A ( new ) , S ( new ) , H ( new ) ) = Φ ( ρ ) = 6 th order polynomial .
Optimal ρ is the root that minimizes Φ ( A ( new ) , S ( new ) , H ( new ) ) [Nion, De Lathauwer, 2006] «Enhanced Line Search with Complex Step (ELSCS) »
For complex tensors, look for optimal ρ = m.e iθ We have Φ ( A ( new ) , S ( new ) , H ( new ) ) = Φ ( m , θ ) Alternate update of m and θ : ∂Φ ( m , θ ) Update m : for θ fixed, = 5 th order polynomial in m ∂m ∂Φ ( m , θ ) θ Update θ : for m fixed, = 6 th order polynomial in t = tan( ) ∂θ 2 12
Improvement 1 of ALS: Line Search «easy» problem
«difficult» problem
2000 iterations
27000 iterations
ELS Large reduction of the number of iterations at a very low additional complexity w.r.t. standard ALS 13
Improvement 2 of ALS: Dimensionality reduction C
C K I
N
Y
=
A
L
H
T
B
M
=
+…+
BT
A
J STEP 1:
STEP 2:
HOSVD of Y
BCD of the small core tensor H (compressed space)
STEP 3: Come back to original space + a few refinement iterations in original space
Compression Large reduction of the cost per iteration since the model is 14 fitted in compressed space.
Improvement 3 of ALS: Good initialization
Comparison ALS and ALS+ELS, with three random initializations Instead of using random initializations, could we use the observed tensor itself ? YES For the BCD-(L,L,1), if A and B are full column rank (so I and J have to be long enough), there is an easy way to find a good intialization, in same spirit as Direct Trilinear Decomposition (DTLD) used to initialize PARAFAC (not detailed in 15 this talk).
Other algorithms Existing algorithms for PARAFAC can be adapted to Block-ComponentDecompositions. Examples: Levenberg-Marquardt algorithm (Gauss-Newton type method), Simultaneous Diagonalization (SD) algorithms let’s say a few words on this technique. SD for PARAFAC (De Lathauwer, 2006) Initial condition to reformulate PARAFAC in terms of SD:
min( IJ , K ) ≥ R
PARAFAC decomposition can be computed by solving a SD problem:
M n = WDn WT , n=1,...,R, Dn is R × R diagonal Advantage: Low complexity (only R matrices of size RxR to diagonalize + direct use of existing fast algorithms designed for SD) SD reformulation yields a uniqueness bound generically more relaxed than Kruskal bound I(I − 1) J(J − 1) R(R − 1) K ≥ R et ≥ 16 2 2 2
BCD - (L ,L ,1) : computation via Simultaneous Diag. (Nion & De Lathauwer, 2007)
Results established for BCD-(L,L,1), i.e., same L for the R terms Initial condition to reformulate BCD-(L,L,1) in terms of SD:
min( IJ , K ) ≥ R
Then the decomposition can be computed by solving a SD problem:
M n = WDn WT , n=1,...,R, Dn is R × R diagonal Advantage: Low complexity (only R matrices of size RxR to diagonalize + direct use of existing fast algorithms designed for SD) SD reformulation yields a new, more relaxed uniqueness bound (next slide)
17
BCD - (L ,L ,1) : Uniqueness (Nion & De Lathauwer, 2007) Sufficient bound 1
[De Lathauwer 2006] Sufficient bound 2
[Nion & De Lathauwer, 2007] :
LR ≤ IJ and min(
I ,R)+min( J ,R)+min(K,R)≥ 2(R+1 ) L L
R ≤ min( IJ , K ) and
C IL + 1 . C LJ + 1 ≥ C LR ++1L − R
Cnk =
(1)
(2)
n! k! ( n − k )!
New Bound much more relaxed
18
Concluding remarks on algorithms Standard ALS sometimes slow (swamps) ALS+ELS (drastically) reduces swamp length at low additional complexity Levenberg-Marquardt convergence very fast, less sensitive to ill-conditioned data, but higher complexity and memory (dimensions of Jacobian matrix=IJK) Simultaneous diagonalization: a very attractive algorithm (low complexity and good accuracy). Important practical considerations: - Dimensionality reduction pre-processing step (e.g. via Tucker/HOSVD) - Find a good initialization if possible. Algorithms have to be adapted to include constraints specific to applications: - preservation of specific matrix-structures (Toeplitz, Van der Monde, etc) - Constant Modulus, Finite Alphabet, … (e.g. in Telecoms Applications) - non-negativity constraints (e.g. Chemometrics applications) 19
BCD - (Lr ,Lr ,1) : estimation of R and Lr Problem: Given a tensor Y, how to estimate the number of terms R and the rank Lr of the matrices Ar and Br that yield a reasonable (Lr, Lr, 1) model? c1
K I
Y
=
L1
cR L1
B1T
+…+
A1
LR
LR
B RT
AR
J
Criterion 1: Simple approach: examinate singular values of matrix unfoldings. Y (JIxK) generically rank
R
Y (IKxJ) generically rank N Y (KJxI) generically rank
N
=
R
∑
r =1
Lr
if min(JI, K) ≥ R if min(IK, J ) ≥ N if min(KJ, I ) ≥ N
If noise level not too high and if conditions on dimensions satisfied, the number of significant singular values yields an estimate 20 for R and/or N.
CORCONDIA (Core Consistency Diagnostic) Core idea: PARAFAC can be seen as a particular case of Tucker model, where the core tensor is diagonal.
C
K I
Y
R
=
A
J
H is diagonal
BT
R R
( if i=j=k, hijk=1, else, hijk=0 )
H
Method [Bro et al.] Choose a set of plausible values for R. For a given test (i.e., for a given R), fit a PARAFAC model and compute the Least Squares estimate of the core tensor H, 2 and measure the diagonality of the core tensor:
C = 100 (1 −
H − Hˆ R
Examinate the core consistency measurements to select R 21
F
)
Block-(Lr ,Lr ,1) CORCONDIA Core idea: BCD-(Lr ,Lr , 1) can be seen as a particular case of Tucker model, where the core tensor is « block-diagonal ». c1
K I
Y
=
cR L1
L1
B1T
LR LR
+…+
A1
B RT
AR
J
K R
C
R L1
=
I
LR
A1 ...
AR
N N
N =
L1
LR
B1 ...
BR
J
H
R
∑
r =1
Lr
22
Block-(Lr ,Lr ,1) CORCONDIA Criterion 2: So we can proceed in a way similar to CORCONDIA for PARAFAC Choose a set of plausible values for R and Lr , r=1,…,R. For a given test (i.e., for given R and Lr ‘s), fit a BCD-(Lr ,Lr ,1) model and compute the Least Squares estimate of the core tensor H, and measure the block - diagonality of the core tensor:
C COR = 100 (1 −
H − Hˆ
2 F
RL
)
Examinate the multiple core consistency measurements to select the most plausible parameters Criterion 3: Similarly to PARAFAC, better to couple Block-CORCONDIA to other criteria, e.g., examination of the relative Fit to the (Lr , Lr, 1) model:
C Fit = 100 (1 −
Y − Yˆ Y
2 F
2 F
) 23
Block-(Lr ,Lr ,1) CORCONDIA Example 1: I=12, J=12, K=50, L=2, R=3 (L=L1=L2=L3 ) Complex data (random), and SNR=10 dB Test: Rtry = {1,2,3,4,5,6} and Ltry={1,2,3,4} Note: For each (R,L) pair, the decomposition is computed via ALS+ELS algorithm and 5 different starting points. Ltry Ltry
CFit=
22.3 38.6 56.3 71.4 84.1 91.5
36.6 66.6 91.2 91.5 91.7 92.0
38.1 67.8 91.3 91.7 91.9 92.3
39.3 69.1 91.4 91.8 92.1 92.4
Rtry
CCOR=
100 100 100 100 99 99.8 < 0 < 0 98.9 99.4 < 0 < 0 84.8 30.9 < 0 < 0