Computing Nonnegative Matrix Factorizations
Nicolas Gillis Joint work with Fran¸cois Glineur, Robert Luce, Stephen Vavasis, Arnaud Vandaele, J´er´emy Cohen
Where is Mons?
/
2/
Where is Mons?
/
2/
Nonnegative Matrix Factorization (NMF) Given a matrix M ∈ Rp×n and a factorization rank r min(p, n), find + U ∈ Rp×r and V ∈ Rr ×n such that min
U≥0,V ≥0
||M − UV ||2F =
X
(M − UV )2ij .
(NMF)
i,j
/
3/
Nonnegative Matrix Factorization (NMF) Given a matrix M ∈ Rp×n and a factorization rank r min(p, n), find + U ∈ Rp×r and V ∈ Rr ×n such that min
U≥0,V ≥0
||M − UV ||2F =
X
(M − UV )2ij .
(NMF)
i,j
NMF is a linear dimensionality reduction technique for nonnegative data : M(:, i) ≈ | {z } ≥0
r X k=1
U(:, k) V (k, i) | {z } | {z } ≥0
for all i.
≥0
/
3/
Nonnegative Matrix Factorization (NMF) Given a matrix M ∈ Rp×n and a factorization rank r min(p, n), find + U ∈ Rp×r and V ∈ Rr ×n such that min
U≥0,V ≥0
||M − UV ||2F =
X
(M − UV )2ij .
(NMF)
i,j
NMF is a linear dimensionality reduction technique for nonnegative data : M(:, i) ≈ | {z } ≥0
r X k=1
U(:, k) V (k, i) | {z } | {z } ≥0
for all i.
≥0
Why nonnegativity? → Interpretability: Nonnegativity constraints lead to easily interpretable factors (and a sparse and part-based representation). → Many applications. image processing, text mining, hyperspectral unmixing, community detection, clustering, etc. /
3/
Example 1: Blind hyperspectral unmixing
Figure: Urban hyperspectral image, 162 spectral bands and 307-by-307 pixels.
/
4/
Example 1: Blind hyperspectral unmixing
Figure: Urban hyperspectral image, 162 spectral bands and 307-by-307 pixels.
Problem. Identify the materials and classify the pixels. /
4/
Linear mixing model
/
5/
Linear mixing model
/
5/
Example 1: Blind hyperspectral unmixing with NMF
/
6/
Example 1: Blind hyperspectral unmixing with NMF
Basis elements allow to recover the different endmembers: U ≥ 0; /
6/
Example 1: Blind hyperspectral unmixing with NMF
Basis elements allow to recover the different endmembers: U ≥ 0; Abundances of the endmembers in each pixel: V ≥ 0. /
6/
Urban hyperspectral image
/
7/
Urban hyperspectral image
Figure: Decomposition of the Urban dataset. /
8/
Urban hyperspectral image
Figure: Decomposition of the Urban dataset. /
8/
Urban hyperspectral image
Figure: Decomposition of the Urban dataset. /
8/
Example 2: topic recovery and document classification
/
9/
Example 2: topic recovery and document classification
Basis elements allow to recover the different topics;
/
9/
Example 2: topic recovery and document classification
Basis elements allow to recover the different topics; Weights allow to assign each text to its corresponding topics. /
9/
Exemple 3: feature extraction and classification
/
10 /
Exemple 3: feature extraction and classification
/
10 /
Exemple 3: feature extraction and classification
The basis elements extract facial features such as eyes, nose and lips. /
10 /
Outline
1
Computational complexity
2
Standard non-linear optimization schemes and acceleration
3
Exact NMF (M = UV ) and its geometric interpretation
4
NMF under the separability assumption
/
12 /
Computational Complexity of NMF
/
13 /
Complexity of NMF min
U∈Rp×r ,V ∈Rr ×n
||M − UV ||2F
such that U ≥ 0, V ≥ 0.
For r = 1, Eckart-Young and Perron-Frobenius theorems.
/
14 /
Complexity of NMF min
U∈Rp×r ,V ∈Rr ×n
||M − UV ||2F
such that U ≥ 0, V ≥ 0.
For r = 1, Eckart-Young and Perron-Frobenius theorems. Checking whether there exists an exact factorization M = UV : NP-hard (Vavasis, 2009) where p, n and r are not fixed.
/
14 /
Complexity of NMF min
U∈Rp×r ,V ∈Rr ×n
||M − UV ||2F
such that U ≥ 0, V ≥ 0.
For r = 1, Eckart-Young and Perron-Frobenius theorems. Checking whether there exists an exact factorization M = UV : NP-hard (Vavasis, 2009) where p, n and r are not fixed. Using quantifier elimination (reformulation with fixed number of variables) Cohen and Rothblum [1991]: (mn)O(mr +nr ) , non-polynomial r Arora et al. [2012]: (mn)O(2 ) , polynomial O(r 2 ) Moitra [2013] : (mn) , polynomial → not really useful in practice . . .
/
14 /
Complexity of NMF min
U∈Rp×r ,V ∈Rr ×n
||M − UV ||2F
such that U ≥ 0, V ≥ 0.
For r = 1, Eckart-Young and Perron-Frobenius theorems. Checking whether there exists an exact factorization M = UV : NP-hard (Vavasis, 2009) where p, n and r are not fixed. Using quantifier elimination (reformulation with fixed number of variables) Cohen and Rothblum [1991]: (mn)O(mr +nr ) , non-polynomial r Arora et al. [2012]: (mn)O(2 ) , polynomial O(r 2 ) Moitra [2013] : (mn) , polynomial → not really useful in practice . . .
Does not imply that rank+ (the minimum r such that M = UV ) can be computed in polynomial time (because there are no upper bound on rank+ ). /
14 /
Complexity for other norms min p
u∈R ,v ∈Rn
||M − uv T ||1 =
X
|Mij − ui vj | .
(`1 norm)
i,j
/
15 /
Complexity for other norms min p
u∈R ,v ∈Rn
||M − uv T ||1 =
X
|Mij − ui vj | .
(`1 norm)
i,j
If M is binary, M ∈ any optimal solution (u ∗ , v ∗ ) can be assumed to be binary, that is, (u ∗ , v ∗ ) ∈ {0, 1}p × {0, 1}n . {0, 1}m×n ,
G., Vavasis, On the Complexity of Robust PCA and `1 -Norm Low-Rank Matrix Approximation, Mathematics of Operations Research, 2018. /
15 /
Complexity for other norms min p
u∈R ,v ∈Rn
||M − uv T ||1 =
X
|Mij − ui vj | .
(`1 norm)
i,j
If M is binary, M ∈ any optimal solution (u ∗ , v ∗ ) can be assumed to be binary, that is, (u ∗ , v ∗ ) ∈ {0, 1}p × {0, 1}n . {0, 1}m×n ,
min p
u∈R ,v ∈Rn
||M − uv T ||2W =
X
Wij (M − uv T )2ij ,
(weighted `2 norm)
i,j
where W is a nonnegative weight matrix. This model can be used when data is missing (Wij = 0 for missing entries), entries have different variances (Wij = 1/σij2 ). G., Vavasis, On the Complexity of Robust PCA and `1 -Norm Low-Rank Matrix Approximation, Mathematics of Operations Research, 2018. G., Glineur, Low-Rank Matrix Approximation with Weights or Missing Data is NP-hard, SIAM J. Mat. Anal. Appl., 2011.
/
15 /
NMF Algorithms and Acceleration
/
16 /
NMF Algorithms Given a matrix M ∈ Rm×n and a factorization rank r ∈ N: + min
U∈Rm×r ,V ∈Rr+×n +
||M − UV ||2F =
X (M − UV )2ij .
(NMF)
i,j
This is a difficult non-linear optimization problem with potentially many local minima.
/
17 /
NMF Algorithms Given a matrix M ∈ Rm×n and a factorization rank r ∈ N: + min
U∈Rm×r ,V ∈Rr+×n +
||M − UV ||2F =
X (M − UV )2ij .
(NMF)
i,j
This is a difficult non-linear optimization problem with potentially many local minima. Standard framework: 0. Initialize (U, V ). Then, alternatively update U and V : 1. Update V ≈ argminX ≥0 ||M − UX ||2F . 2. Update U ≈ argminY ≥0 ||M − YV ||2F .
(NNLS) (NNLS)
/
17 /
NMF Algorithms Given a matrix M ∈ Rm×n and a factorization rank r ∈ N: + min
U∈Rm×r ,V ∈Rr+×n +
||M − UV ||2F =
X (M − UV )2ij .
(NMF)
i,j
This is a difficult non-linear optimization problem with potentially many local minima. Standard framework: 0. Initialize (U, V ). Then, alternatively update U and V : 1. Update V ≈ argminX ≥0 ||M − UX ||2F . 2. Update U ≈ argminY ≥0 ||M − YV ||2F .
(NNLS) (NNLS)
Most NMF algorithms come with no guarantees (except convergence to stationary points).
/
17 /
NMF Algorithms Given a matrix M ∈ Rm×n and a factorization rank r ∈ N: + min
U∈Rm×r ,V ∈Rr+×n +
||M − UV ||2F =
X (M − UV )2ij .
(NMF)
i,j
This is a difficult non-linear optimization problem with potentially many local minima. Standard framework: 0. Initialize (U, V ). Then, alternatively update U and V : 1. Update V ≈ argminX ≥0 ||M − UX ||2F . 2. Update U ≈ argminY ≥0 ||M − YV ||2F .
(NNLS) (NNLS)
Most NMF algorithms come with no guarantees (except convergence to stationary points). Solution is in general highly non-unique: indentifiability issues. /
17 /
Block coordinate descent method Use block-coordinate descent on the NNLS subproblems −→ closed-form solutions for the columns of U and rows of V : Rk Vk:T ∗ 2 ∀k, U:k = argminU:k ≥0 ||Rk − U:k Vk: ||F = max 0, ||Vk: ||22 P . where Rk = M − j6=k U:j Vj: , and similarly for V . This is the so-called HALS algorithm.
/
18 /
Block coordinate descent method Use block-coordinate descent on the NNLS subproblems −→ closed-form solutions for the columns of U and rows of V : Rk Vk:T ∗ 2 ∀k, U:k = argminU:k ≥0 ||Rk − U:k Vk: ||F = max 0, ||Vk: ||22 P . where Rk = M − j6=k U:j Vj: , and similarly for V . This is the so-called HALS algorithm. It can be accelerated: 1
Gauss-Seidel Coordinate descent (Hsieh, Dhillon, 2011).
2
Loop several time over columns of U/rows of V to perform more iterations at a lower computational cost (Glineur, G., 2012).
3
Randomized shuffling (Chow, Wu, Yin, 2017). ˆ (k+1) = W (k+1) + βk (W (k+1) − W (k) ) Use an extrapolation step: W (Ang, G., 2018).
4
/
18 /
Illustration on the CBCL face image data set
/
19 /
Exact NMF: Geometry and Extended Formulations
/
20 /
Geometric interpretation of exact NMF Given M = UV , one can scale M and U such that they become column stochastic implying that V is column stochastic: M = UV ⇐⇒ M 0 = MDM = (UDU )(DU−1 VDM ) = U 0 V 0 .
/
21 /
Geometric interpretation of exact NMF Given M = UV , one can scale M and U such that they become column stochastic implying that V is column stochastic: M = UV ⇐⇒ M 0 = MDM = (UDU )(DU−1 VDM ) = U 0 V 0 . The columns of M are convex combinations of the columns of U: M:j =
k X i=1
U:i Vij
with
k X
Vij = 1 ∀j, Vij ≥ 0 ∀ij.
i=1
/
21 /
Geometric interpretation of exact NMF Given M = UV , one can scale M and U such that they become column stochastic implying that V is column stochastic: M = UV ⇐⇒ M 0 = MDM = (UDU )(DU−1 VDM ) = U 0 V 0 . The columns of M are convex combinations of the columns of U: M:j =
k X
U:i Vij
with
i=1
k X
Vij = 1 ∀j, Vij ≥ 0 ∀ij.
i=1
In other terms, conv(M)
⊆
conv(U)
⊆
Sn ,
where conv(X ) is the convex hull of the columns of X , and Pn n n S = {x ∈ R |x ≥ 0, i=1 xi = 1 } is the unit simplex.
/
21 /
Geometric interpretation of exact NMF Given M = UV , one can scale M and U such that they become column stochastic implying that V is column stochastic: M = UV ⇐⇒ M 0 = MDM = (UDU )(DU−1 VDM ) = U 0 V 0 . The columns of M are convex combinations of the columns of U: M:j =
k X
U:i Vij
with
i=1
k X
Vij = 1 ∀j, Vij ≥ 0 ∀ij.
i=1
In other terms, conv(M)
⊆
conv(U)
⊆
Sn ,
where conv(X ) is the convex hull of the columns of X , and Pn n n S = {x ∈ R |x ≥ 0, i=1 xi = 1 } is the unit simplex. Exact NMF ≡ Find r points whose convex hull is nested between two given polytopes. /
21 /
Geometric interpretation of NMF
/
22 /
Geometric interpretation of NMF Example: Two nested hexagons (rank(Ma ) = 3) 1 Ma = a
1 a 2a − 1 2a − 1 a 1 1 1 a 2a − 1 2a − 1 a a 1 1 a 2a − 1 2a − 1 2a − 1 a 1 1 a 2a − 1 2a − 1 2a − 1 a 1 1 a a 2a − 1 2a − 1 a 1 1
, a > 1.
/
22 /
Geometric interpretation of NMF Example: Two nested hexagons (rank(Ma ) = 3) Case 1: a = 2, rank+ (Ma ) = 3, col(M) = col(U) 0.8 0.6 0.4 0.2
∆p ∩ col(M ) 2
conv(M2)
0
conv(U) −0.2 −0.4 −0.6 −0.8 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1 /
22 /
Geometric interpretation of NMF Example: Two nested hexagons (rank(Ma ) = 3) Case 2: a = 3, rank+ (Ma ) = 4, col(M) = col(U) 0.8 0.6 0.4 0.2
∆p ∩ col(M3) conv(M3)
0
conv(U) −0.2 −0.4 −0.6 −0.8 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1 /
22 /
Geometric interpretation of NMF Example: Two nested hexagons (rank(Ma ) = 3) Case 3: a → +∞, rank+ (Ma ) = 5, col(M) 6= col(U)
/
22 /
An amazing result: NMF and extended formulations Let P be a polytope P = {x ∈ Rk | bi − A(i, :)x ≥ 0 for 1 ≤ i ≤ m}, and let vj ’s (1 ≤ j ≤ n) be its vertices.
/
23 /
An amazing result: NMF and extended formulations Let P be a polytope P = {x ∈ Rk | bi − A(i, :)x ≥ 0 for 1 ≤ i ≤ m}, and let vj ’s (1 ≤ j ≤ n) be its vertices. We define the m-by-n slack matrix SP of P as follows: SP (i, j) = bi − A(i, :)vj ≥ 0
1 ≤ i ≤ m, 1 ≤ j ≤ n.
/
23 /
An amazing result: NMF and extended formulations Let P be a polytope P = {x ∈ Rk | bi − A(i, :)x ≥ 0 for 1 ≤ i ≤ m}, and let vj ’s (1 ≤ j ≤ n) be its vertices. We define the m-by-n slack matrix SP of P as follows: SP (i, j) = bi − A(i, :)vj ≥ 0
1 ≤ i ≤ m, 1 ≤ j ≤ n.
The hexagon:
SP =
0 0 1 2 2 1
1 0 0 1 2 2
2 1 0 0 1 2
2 2 1 0 0 1
1 2 2 1 0 0
0 1 2 2 1 0
/
23 /
An amazing result: NMF and extended formulations An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p that (linearly) projects onto P. The minimum number of facets of such a polytope is called the extension complexity xp(P) of P.
/
24 /
An amazing result: NMF and extended formulations An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p that (linearly) projects onto P. The minimum number of facets of such a polytope is called the extension complexity xp(P) of P. Theorem (Yannakakis, 1991). rank+ (SP ) = xp(P).
/
24 /
An amazing result: NMF and extended formulations An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p that (linearly) projects onto P. The minimum number of facets of such a polytope is called the extension complexity xp(P) of P. Theorem (Yannakakis, 1991). rank+ (SP ) = xp(P). Proof (one direction). Given P = {x ∈ Rk | b − Ax ≥ 0}, any exact NMF of SP = UV , U ≥ 0, V ≥ 0 provides an explicit extended formulation (with some redundant equalities) of P: P = {x | b − Ax ≥ 0} = {x | b − Ax = Uy and y ≥ 0}.
/
24 /
An amazing result: NMF and extended formulations An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p that (linearly) projects onto P. The minimum number of facets of such a polytope is called the extension complexity xp(P) of P. Theorem (Yannakakis, 1991). rank+ (SP ) = xp(P). Proof (one direction). Given P = {x ∈ Rk | b − Ax ≥ 0}, any exact NMF of SP = UV , U ≥ 0, V ≥ 0 provides an explicit extended formulation (with some redundant equalities) of P: P = {x | b − Ax ≥ 0} = {x | b − Ax = Uy and y ≥ 0}. Remark. The slack matrix SP of P satisfies conv(SP ) = Sm ∩ col(SP ). To get a small factorization, we need to go to a higher dimensional space: rank(U) > rank(M). /
24 /
The Hexagon
SP
=
0 0 1 2 2 1
1 0 0 1 2 2
2 1 0 0 1 2
2 2 1 0 0 1
1 2 2 1 0 0
0 1 2 2 1 0
/
25 /
The Hexagon
SP
=
0 0 1 2 2 1
1 0 0 1 2 2
2 1 0 0 1 2
2 2 1 0 0 1
1 2 2 1 0 0
0 1 2 2 1 0
1/2 1 1/2 0 0 0
0 0 0 1/2 1 1/2
rank+ (SP ) = 5
≤
=
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
0 0 1 0 2
1 0 0 0 2
2 1 0 0 0
1 0 0 2 0
0 0 1 2 0
0 1 2 0 0
,
with rank(SP ) = 3
≤
min(m, n) = 6. /
25 /
Some implications Problem: limits of LP for solving combinatorial problems: given a polytope, what is the most compact way to represent it?
/
26 /
Some implications Problem: limits of LP for solving combinatorial problems: given a polytope, what is the most compact way to represent it? Its extension complexity = nonnegative rank of its slack matrix.
/
26 /
Some implications Problem: limits of LP for solving combinatorial problems: given a polytope, what is the most compact way to represent it? Its extension complexity = nonnegative rank of its slack matrix. Key tool: lower bound techniques for the nonnegative rank.
/
26 /
Some implications Problem: limits of LP for solving combinatorial problems: given a polytope, what is the most compact way to represent it? Its extension complexity = nonnegative rank of its slack matrix. Key tool: lower bound techniques for the nonnegative rank. Ex. The matching problem cannot be solved via a polynomial-size LP. Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.
/
26 /
Some implications Problem: limits of LP for solving combinatorial problems: given a polytope, what is the most compact way to represent it? Its extension complexity = nonnegative rank of its slack matrix. Key tool: lower bound techniques for the nonnegative rank. Ex. The matching problem cannot be solved via a polynomial-size LP. Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.
This can be generalized to approximations (no poly-size LP can approximate these problems up to some precision). Braun, Fiorini, Pokutta & Steurer (2012). Approximation limits of linear programs (beyond hierarchies), FOCS.
/
26 /
Some implications Problem: limits of LP for solving combinatorial problems: given a polytope, what is the most compact way to represent it? Its extension complexity = nonnegative rank of its slack matrix. Key tool: lower bound techniques for the nonnegative rank. Ex. The matching problem cannot be solved via a polynomial-size LP. Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.
This can be generalized to approximations (no poly-size LP can approximate these problems up to some precision). Braun, Fiorini, Pokutta & Steurer (2012). Approximation limits of linear programs (beyond hierarchies), FOCS.
any convex cone, in particular PSD (so called PSD-rank). See the survey: Fawzi, Gouveia, Parrilo, Robinson, & Thomas. Positive semidefinite rank, Mathematical Programming, 2015. /
26 /
Exact NMF computation and regular n-gons Can we use numerical solvers to get insight into these problems?
/
27 /
Exact NMF computation and regular n-gons Can we use numerical solvers to get insight into these problems? Yes! We have developed a library to compute exact NMF’s for small matrices using meta-heuristics. [V14] Vandaele, G., Glineur & D. Tuyttens, Heuristics for Exact NMF (2014).
/
27 /
Exact NMF computation and regular n-gons Can we use numerical solvers to get insight into these problems? Yes! We have developed a library to compute exact NMF’s for small matrices using meta-heuristics. [V14] Vandaele, G., Glineur & D. Tuyttens, Heuristics for Exact NMF (2014).
Extension complexity of the octagon?
/
27 /
Exact NMF computation and regular n-gons Can we use numerical solvers to get insight into these problems? Yes! We have developed a library to compute exact NMF’s for small matrices using meta-heuristics. [V14] Vandaele, G., Glineur & D. Tuyttens, Heuristics for Exact NMF (2014).
Extension complexity of the octagon?
rank(SP ) = 3
≤
rank+ (SP ) = 6
≤
min(m, n) = 8. /
27 /
Exact NMF computation and regular n-gons We observed a special structure on the solutions for regular n-gons, leading to the best known upper bound and closing the gap for some n-gons: 2dlog2 (n)e − 1 for 2k−1 < n ≤ 2k−1 + 2k−2 , rank+ (Sn ) ≤ 2dlog2 (n)e for 2k−1 + 2k−2 < n ≤ 2k . [V15] Vandaele, G. & Glineur, On the Linear Extension Complexity of Regular n-gons (2015).
/
28 /
Exact NMF computation and regular n-gons We observed a special structure on the solutions for regular n-gons, leading to the best known upper bound and closing the gap for some n-gons: 2dlog2 (n)e − 1 for 2k−1 < n ≤ 2k−1 + 2k−2 , rank+ (Sn ) ≤ 2dlog2 (n)e for 2k−1 + 2k−2 < n ≤ 2k . [V15] Vandaele, G. & Glineur, On the Linear Extension Complexity of Regular n-gons (2015).
Implication: conic quadratic programming is ‘polynomially reducible’ to linear programming. [BTN01] Ben-Tal and Nemirovski (2001). On polyhedral approximations of the second-order cone. Mathematics of Operations Research, 26(2), 193-205.
/
28 /
NMF under the separability assumption
/
29 /
Separability Assumption Separability of M: there exists an index set K and V ≥ 0 with M = M(:, K) V , with |K| = r . | {z } U
/
30 /
Separability Assumption Separability of M: there exists an index set K and V ≥ 0 with M = M(:, K) V , with |K| = r . | {z } U
[AGKM12] Arora, Ge, Kannan, Moitra, Computing a Nonnegative Matrix Factorization – Provably, STOC 2012. /
30 /
Separability Assumption Separability of M: there exists an index set K and V ≥ 0 with M = M(:, K) V , with |K| = r . | {z } U
[AGKM12] Arora, Ge, Kannan, Moitra, Computing a Nonnegative Matrix Factorization – Provably, STOC 2012. /
30 /
Applications In hyperspectral imaging, this is the pure-pixel assumption: for each material, there is a ‘pure’ pixel containing only that material. [M+14] Ma et al., A Signal Processing Perspective on Hyperspectral Unmixing: Insights from Remote Sensing, IEEE Signal Processing Magazine 31(1):67-81, 2014.
/
31 /
Applications In hyperspectral imaging, this is the pure-pixel assumption: for each material, there is a ‘pure’ pixel containing only that material. [M+14] Ma et al., A Signal Processing Perspective on Hyperspectral Unmixing: Insights from Remote Sensing, IEEE Signal Processing Magazine 31(1):67-81, 2014.
In document classification: for each topic, there is a ‘pure’ word used only by that topic (an ‘anchor’ word). [A+13] Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees, ICML 2013.
/
31 /
Applications In hyperspectral imaging, this is the pure-pixel assumption: for each material, there is a ‘pure’ pixel containing only that material. [M+14] Ma et al., A Signal Processing Perspective on Hyperspectral Unmixing: Insights from Remote Sensing, IEEE Signal Processing Magazine 31(1):67-81, 2014.
In document classification: for each topic, there is a ‘pure’ word used only by that topic (an ‘anchor’ word). [A+13] Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees, ICML 2013.
Time-resolved Raman spectra analysis: each substance has a peak in its spectrum while the other spectra are (close) to zero. [L+16] Luce et al., Using Separable Nonnegative Matrix Factorization for the Analysis of Time-Resolved Raman Spectra, Appl Spectrosc. 2016.
Others: video summarization, foreground-background separation. [ESV12] Elhamifar, Sapiro, Vidal, See all by looking at a few: Sparse modeling for finding representative objects, CVPR 2012. [KSK13] Kumar, Sindhwani, Near-separable Non-negative Matrix Factorization with `1 and Bregman Loss Functions, SIAM data mining 2015. /
31 /
Geometric Interpretation The columns of U are the vertices of the convex hull of the columns of M: r r X X M(:, j) = U(:, k)V (k, j) ∀j, where V (k, j) = 1, V ≥ 0. k=1
k=1
/
32 /
Geometric Interpretation with Noise The columns of U are the vertices of the convex hull of the columns of M: r r X X M(:, j) ≈ U(:, k)V (k, j) ∀j, where V (k, j) = 1, V ≥ 0. k=1
k=1
Goal: theoretical analysis of the robustness to noise of separable NMF algorithms /
32 /
Key Parameters: Noise and Conditioning We assume M = U[Ir , V 0 ]Π + N, where V 0 ≥ 0, Π is a permutation and N is the noise.
/
33 /
Key Parameters: Noise and Conditioning We assume M = U[Ir , V 0 ]Π + N, where V 0 ≥ 0, Π is a permutation and N is the noise. We will assume that the noise is bounded (but otherwise arbitrary): ||N(:, j)||2 ≤ ,
for all j,
/
33 /
Key Parameters: Noise and Conditioning We assume M = U[Ir , V 0 ]Π + N, where V 0 ≥ 0, Π is a permutation and N is the noise. We will assume that the noise is bounded (but otherwise arbitrary): ||N(:, j)||2 ≤ ,
for all j,
and some dependence on the conditioning κ(U) =
σmax (U) σmin (U)
is unavoidable:
/
33 /
Key Parameters: Noise and Conditioning We assume M = U[Ir , V 0 ]Π + N, where V 0 ≥ 0, Π is a permutation and N is the noise. We will assume that the noise is bounded (but otherwise arbitrary): ||N(:, j)||2 ≤ ,
for all j,
and some dependence on the conditioning κ(U) =
σmax (U) σmin (U)
is unavoidable:
/
33 /
Successive Projection Algorithm (SPA) 0: Initially K = ∅. For i = 1 : r 1: Find j ∗ = argmaxj ||M(:, j)||. 2: K = K ∪ {j ∗ }. M(:,j ∗ ) 3: M ← I − uu T M where u = ||M(:,j ∗ )|| . 2 end ∼modified Gram-Schmidt with column pivoting.
/
34 /
Successive Projection Algorithm (SPA) 0: Initially K = ∅. For i = 1 : r 1: Find j ∗ = argmaxj ||M(:, j)||. 2: K = K ∪ {j ∗ }. M(:,j ∗ ) 3: M ← I − uu T M where u = ||M(:,j ∗ )|| . 2 end ∼modified Gram-Schmidt with column pivoting. (U) , SPA satisfies Theorem. If ≤ O √σmin r κ2 (U) ||U−M(:, K)|| = max ||U(:, k)−M(:, K(k))|| ≤ O κ2 (U) . 1≤k≤r
Advantages. Extremely fast, no parameter. Drawbacks. Requires U to be full rank; bound is weak. [GV14] G., Vavasis, Fast and Robust Recursive Algorithms for Separable Nonnegative Matrix Factorization, IEEE Trans. Patt. Anal. Mach. Intell. 36 (4), pp. 698-714, 2014. /
34 /
Pre-conditioning for More Robust SPA Observation. Pre-multiplying M preserves separability: P M = P (U[Ir , V 0 ]Π + N) = (PU) [Ir , V 0 ]Π + PN.
/
35 /
Pre-conditioning for More Robust SPA Observation. Pre-multiplying M preserves separability: P M = P (U[Ir , V 0 ]Π + N) = (PU) [Ir , V 0 ]Π + PN. Ideally, P = U −1 so that κ(PU) = 1 (assuming m = r ).
/
35 /
Pre-conditioning for More Robust SPA Observation. Pre-multiplying M preserves separability: P M = P (U[Ir , V 0 ]Π + N) = (PU) [Ir , V 0 ]Π + PN. Ideally, P = U −1 so that κ(PU) = 1 (assuming m = r ). Solving the minimum volume ellipsoid centered at the origin and containing all the columns of M (which is SDP representable) min log det(A)−1 s.t. mj T Amj ≤ 1 ∀ j,
A∈Sr+
allows to approximate U −1 : in fact, A∗ ≈ (UU T )−1 .
/
35 /
Pre-conditioning for More Robust SPA Observation. Pre-multiplying M preserves separability: P M = P (U[Ir , V 0 ]Π + N) = (PU) [Ir , V 0 ]Π + PN. Ideally, P = U −1 so that κ(PU) = 1 (assuming m = r ). Solving the minimum volume ellipsoid centered at the origin and containing all the columns of M (which is SDP representable) min log det(A)−1 s.t. mj T Amj ≤ 1 ∀ j,
A∈Sr+
allows to approximate U −1 : in fact, A∗ ≈ (UU T )−1 . (U) √ Theorem. If ≤ O σmin , preconditioned SPA satisfies r r ||U − M(:, K)|| ≤ O (κ(U)). [GV15] G., Vavasis, SDP-based Preconditioning for More Robust Near-Separable NMF, SIAM J. on Optimization, 2015. /
35 /
Geometric Interpretation
Figure: Geometric Interpretation of the SDP-based Preconditioning.
See also Mizutani, Ellipsoidal Rounding for Nonnegative Matrix Factorization Under Noisy Separability, JMLR, 2014. /
36 /
Synthetic data sets 40×20 Each entry of U ∈ R+ uniform in [0, 1]; each column normalized.
The other columns of M are the middle points of the columns of U (hence there are 20 2 = 190). The noise moves the middle points toward the outside of the convex hull of the column of U.
/
37 /
Synthetic data sets 40×20 Each entry of U ∈ R+ uniform in [0, 1]; each column normalized.
The other columns of M are the middle points of the columns of U (hence there are 20 2 = 190). The noise moves the middle points toward the outside of the convex hull of the column of U.
/
37 /
Results for the synthetic data sets
Figure: Average of the fraction of columns correctly extracted depending on the noise level (for each noise level, 25 matrices are generated). /
38 /
Combinatorial formulation for separable NMF We want to find the index set K with |K| = r such that M
=
M(:, K) V .
/
39 /
Combinatorial formulation for separable NMF We want to find the index set K with |K| = r such that M
M(:, K) V .
=
This is equivalent to finding X ∈ Rn×n with r non-zero rows such that M
=
M X.
/
39 /
Combinatorial formulation for separable NMF We want to find the index set K with |K| = r such that M
M(:, K) V .
=
This is equivalent to finding X ∈ Rn×n with r non-zero rows such that M
=
M X.
A combinatorial formulation: min ||X ||row,0 X
such that M = MX or ||M − MX || ≤ .
/
39 /
Combinatorial formulation for separable NMF We want to find the index set K with |K| = r such that M
M(:, K) V .
=
This is equivalent to finding X ∈ Rn×n with r non-zero rows such that M
=
M X.
A combinatorial formulation: min ||X ||row,0 X
such that M = MX or ||M − MX || ≤ .
How to make X row sparse?
/
39 /
A Linear Optimization Model min
X ∈Rn×n +
such that
trace(X ) = || diag(X )||1 ||M − MX || ≤ , Xij ≤ Xii ≤ 1 for all i, j.
/
40 /
A Linear Optimization Model min
X ∈Rn×n +
such that
trace(X ) = || diag(X )||1 ||M − MX || ≤ ,
Xij ≤ Xii ≤ 1 for all i, j. Robustness: noise ≤ O κ−1 ⇒ error ≤ O rκ [GL14].
[GL14] G., Luce, Robust Near-Separable NMF Using Linear Optimization, JMLR 2014.
/
40 /
A Linear Optimization Model min
X ∈Rn×n +
such that
trace(X ) = || diag(X )||1 ||M − MX || ≤ ,
Xij ≤ Xii ≤ 1 for all i, j. Robustness: noise ≤ O κ−1 ⇒ error ≤ O rκ [GL14]. This model is an improvement over [B+12]: more robust and detects the factorization rank r automatically.
[GL14] G., Luce, Robust Near-Separable NMF Using Linear Optimization, JMLR 2014. [B+12] Bittorf, Recht, R´ e, Tropp, Factoring nonnegative matrices with LPs, NIPS 2012.
/
40 /
A Linear Optimization Model min
X ∈Rn×n +
such that
trace(X ) = || diag(X )||1 ||M − MX || ≤ ,
Xij ≤ Xii ≤ 1 for all i, j. Robustness: noise ≤ O κ−1 ⇒ error ≤ O rκ [GL14]. This model is an improvement over [B+12]: more robust and detects the factorization rank r automatically. P It is equivalent [GL16] to using ||X ||1,∞ = di=1 ||X (i, :)||∞ as a convex surrogate for ||X ||row,0 [E+12]. [GL14] G., Luce, Robust Near-Separable NMF Using Linear Optimization, JMLR 2014. [B+12] Bittorf, Recht, R´ e, Tropp, Factoring nonnegative matrices with LPs, NIPS 2012. [E+12] Esser et al., A convex model for NMF and dimensionality reduction on physical space, IEEE Trans. Image Processing, 2012. [GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with Self Dictionary, IEEE Trans. Image Processing, 2018. /
40 /
Practical Model and Algorithm min ||M − MX ||2F + µ tr(X ),
X ∈Ω
Ω = {X ∈ Rn,n | Xii ≤ 1, wi Xij ≤ wj Xii ∀i, j}.
/
41 /
Practical Model and Algorithm min ||M − MX ||2F + µ tr(X ),
X ∈Ω
Ω = {X ∈ Rn,n | Xii ≤ 1, wi Xij ≤ wj Xii ∀i, j}. We used a fast gradient method (optimal 1st order): 1 2
Choose an initial point X (0) , Y = X (0) , α1 ∈ (0, 1). k = 1, 2, . . . X (k) = PΩ Y − L1 ∇f (Y ) . 2 Y = X (k) +βk X (k) − X (k−1) , k) 2 where βk = ααk2(1−α with αk+1 ≥ 0 t.q. αk+1 = (1 − αk+1 )αk2 . +αk+1 1
k
[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with Self Dictionary, IEEE Trans. Image Processing, 2018. /
41 /
Practical Model and Algorithm min ||M − MX ||2F + µ tr(X ),
X ∈Ω
Ω = {X ∈ Rn,n | Xii ≤ 1, wi Xij ≤ wj Xii ∀i, j}. We used a fast gradient method (optimal 1st order): 1 2
Choose an initial point X (0) , Y = X (0) , α1 ∈ (0, 1). k = 1, 2, . . . X (k) = PΩ Y − L1 ∇f (Y ) . 2 Y = X (k) +βk X (k) − X (k−1) , k) 2 where βk = ααk2(1−α with αk+1 ≥ 0 t.q. αk+1 = (1 − αk+1 )αk2 . +αk+1 1
k
Projection onto Ω can be done effectively in O(n2 log(n)) operations.
[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with Self Dictionary, IEEE Trans. Image Processing, 2018. /
41 /
Practical Model and Algorithm min ||M − MX ||2F + µ tr(X ),
X ∈Ω
Ω = {X ∈ Rn,n | Xii ≤ 1, wi Xij ≤ wj Xii ∀i, j}. We used a fast gradient method (optimal 1st order): 1 2
Choose an initial point X (0) , Y = X (0) , α1 ∈ (0, 1). k = 1, 2, . . . X (k) = PΩ Y − L1 ∇f (Y ) . 2 Y = X (k) +βk X (k) − X (k−1) , k) 2 where βk = ααk2(1−α with αk+1 ≥ 0 t.q. αk+1 = (1 − αk+1 )αk2 . +αk+1 1
k
Projection onto Ω can be done effectively in O(n2 log(n)) operations. The total computational cost is O(pn2 ) operations. [GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with Self Dictionary, IEEE Trans. Image Processing, 2018. /
41 /
Hyperspectral unmixing
VCA VCA-500 SPA SPA-500 SNPA SNPA-500 XRAY XRAY-500 H2NMF H2NMF-500 FGNSR-500
r =6 Time (s.) 1.02 0.03 0.26