Computing Nonnegative Matrix Factorizations 0.5cm - Laurent Risser

NMF is a linear dimensionality reduction technique for nonnegative data : M(:,i). ︸ ︷︷ ︸. ≥0. ≈ r. ∑ k=1. U(:,k). ︸ ︷︷ ︸. ≥0. V(k,i). ︸ ︷︷ ︸. ≥0 for all i.
8MB taille 1 téléchargements 278 vues
Computing Nonnegative Matrix Factorizations

Nicolas Gillis Joint work with Fran¸cois Glineur, Robert Luce, Stephen Vavasis, Arnaud Vandaele, J´er´emy Cohen

Where is Mons?

/

2/

Where is Mons?

/

2/

Nonnegative Matrix Factorization (NMF) Given a matrix M ∈ Rp×n and a factorization rank r  min(p, n), find + U ∈ Rp×r and V ∈ Rr ×n such that min

U≥0,V ≥0

||M − UV ||2F =

X

(M − UV )2ij .

(NMF)

i,j

/

3/

Nonnegative Matrix Factorization (NMF) Given a matrix M ∈ Rp×n and a factorization rank r  min(p, n), find + U ∈ Rp×r and V ∈ Rr ×n such that min

U≥0,V ≥0

||M − UV ||2F =

X

(M − UV )2ij .

(NMF)

i,j

NMF is a linear dimensionality reduction technique for nonnegative data : M(:, i) ≈ | {z } ≥0

r X k=1

U(:, k) V (k, i) | {z } | {z } ≥0

for all i.

≥0

/

3/

Nonnegative Matrix Factorization (NMF) Given a matrix M ∈ Rp×n and a factorization rank r  min(p, n), find + U ∈ Rp×r and V ∈ Rr ×n such that min

U≥0,V ≥0

||M − UV ||2F =

X

(M − UV )2ij .

(NMF)

i,j

NMF is a linear dimensionality reduction technique for nonnegative data : M(:, i) ≈ | {z } ≥0

r X k=1

U(:, k) V (k, i) | {z } | {z } ≥0

for all i.

≥0

Why nonnegativity? → Interpretability: Nonnegativity constraints lead to easily interpretable factors (and a sparse and part-based representation). → Many applications. image processing, text mining, hyperspectral unmixing, community detection, clustering, etc. /

3/

Example 1: Blind hyperspectral unmixing

Figure: Urban hyperspectral image, 162 spectral bands and 307-by-307 pixels.

/

4/

Example 1: Blind hyperspectral unmixing

Figure: Urban hyperspectral image, 162 spectral bands and 307-by-307 pixels.

Problem. Identify the materials and classify the pixels. /

4/

Linear mixing model

/

5/

Linear mixing model

/

5/

Example 1: Blind hyperspectral unmixing with NMF

/

6/

Example 1: Blind hyperspectral unmixing with NMF

Basis elements allow to recover the different endmembers: U ≥ 0; /

6/

Example 1: Blind hyperspectral unmixing with NMF

Basis elements allow to recover the different endmembers: U ≥ 0; Abundances of the endmembers in each pixel: V ≥ 0. /

6/

Urban hyperspectral image

/

7/

Urban hyperspectral image

Figure: Decomposition of the Urban dataset. /

8/

Urban hyperspectral image

Figure: Decomposition of the Urban dataset. /

8/

Urban hyperspectral image

Figure: Decomposition of the Urban dataset. /

8/

Example 2: topic recovery and document classification

/

9/

Example 2: topic recovery and document classification

Basis elements allow to recover the different topics;

/

9/

Example 2: topic recovery and document classification

Basis elements allow to recover the different topics; Weights allow to assign each text to its corresponding topics. /

9/

Exemple 3: feature extraction and classification

/

10 /

Exemple 3: feature extraction and classification

/

10 /

Exemple 3: feature extraction and classification

The basis elements extract facial features such as eyes, nose and lips. /

10 /

Outline

1

Computational complexity

2

Standard non-linear optimization schemes and acceleration

3

Exact NMF (M = UV ) and its geometric interpretation

4

NMF under the separability assumption

/

12 /

Computational Complexity of NMF

/

13 /

Complexity of NMF min

U∈Rp×r ,V ∈Rr ×n

||M − UV ||2F

such that U ≥ 0, V ≥ 0.

For r = 1, Eckart-Young and Perron-Frobenius theorems.

/

14 /

Complexity of NMF min

U∈Rp×r ,V ∈Rr ×n

||M − UV ||2F

such that U ≥ 0, V ≥ 0.

For r = 1, Eckart-Young and Perron-Frobenius theorems. Checking whether there exists an exact factorization M = UV : NP-hard (Vavasis, 2009) where p, n and r are not fixed.

/

14 /

Complexity of NMF min

U∈Rp×r ,V ∈Rr ×n

||M − UV ||2F

such that U ≥ 0, V ≥ 0.

For r = 1, Eckart-Young and Perron-Frobenius theorems. Checking whether there exists an exact factorization M = UV : NP-hard (Vavasis, 2009) where p, n and r are not fixed. Using quantifier elimination (reformulation with fixed number of variables) Cohen and Rothblum [1991]: (mn)O(mr +nr ) , non-polynomial r Arora et al. [2012]: (mn)O(2 ) , polynomial O(r 2 ) Moitra [2013] : (mn) , polynomial → not really useful in practice . . .

/

14 /

Complexity of NMF min

U∈Rp×r ,V ∈Rr ×n

||M − UV ||2F

such that U ≥ 0, V ≥ 0.

For r = 1, Eckart-Young and Perron-Frobenius theorems. Checking whether there exists an exact factorization M = UV : NP-hard (Vavasis, 2009) where p, n and r are not fixed. Using quantifier elimination (reformulation with fixed number of variables) Cohen and Rothblum [1991]: (mn)O(mr +nr ) , non-polynomial r Arora et al. [2012]: (mn)O(2 ) , polynomial O(r 2 ) Moitra [2013] : (mn) , polynomial → not really useful in practice . . .

Does not imply that rank+ (the minimum r such that M = UV ) can be computed in polynomial time (because there are no upper bound on rank+ ). /

14 /

Complexity for other norms min p

u∈R ,v ∈Rn

||M − uv T ||1 =

X

|Mij − ui vj | .

(`1 norm)

i,j

/

15 /

Complexity for other norms min p

u∈R ,v ∈Rn

||M − uv T ||1 =

X

|Mij − ui vj | .

(`1 norm)

i,j

If M is binary, M ∈ any optimal solution (u ∗ , v ∗ ) can be assumed to be binary, that is, (u ∗ , v ∗ ) ∈ {0, 1}p × {0, 1}n . {0, 1}m×n ,

G., Vavasis, On the Complexity of Robust PCA and `1 -Norm Low-Rank Matrix Approximation, Mathematics of Operations Research, 2018. /

15 /

Complexity for other norms min p

u∈R ,v ∈Rn

||M − uv T ||1 =

X

|Mij − ui vj | .

(`1 norm)

i,j

If M is binary, M ∈ any optimal solution (u ∗ , v ∗ ) can be assumed to be binary, that is, (u ∗ , v ∗ ) ∈ {0, 1}p × {0, 1}n . {0, 1}m×n ,

min p

u∈R ,v ∈Rn

||M − uv T ||2W =

X

Wij (M − uv T )2ij ,

(weighted `2 norm)

i,j

where W is a nonnegative weight matrix. This model can be used when data is missing (Wij = 0 for missing entries), entries have different variances (Wij = 1/σij2 ). G., Vavasis, On the Complexity of Robust PCA and `1 -Norm Low-Rank Matrix Approximation, Mathematics of Operations Research, 2018. G., Glineur, Low-Rank Matrix Approximation with Weights or Missing Data is NP-hard, SIAM J. Mat. Anal. Appl., 2011.

/

15 /

NMF Algorithms and Acceleration

/

16 /

NMF Algorithms Given a matrix M ∈ Rm×n and a factorization rank r ∈ N: + min

U∈Rm×r ,V ∈Rr+×n +

||M − UV ||2F =

X (M − UV )2ij .

(NMF)

i,j

This is a difficult non-linear optimization problem with potentially many local minima.

/

17 /

NMF Algorithms Given a matrix M ∈ Rm×n and a factorization rank r ∈ N: + min

U∈Rm×r ,V ∈Rr+×n +

||M − UV ||2F =

X (M − UV )2ij .

(NMF)

i,j

This is a difficult non-linear optimization problem with potentially many local minima. Standard framework: 0. Initialize (U, V ). Then, alternatively update U and V : 1. Update V ≈ argminX ≥0 ||M − UX ||2F . 2. Update U ≈ argminY ≥0 ||M − YV ||2F .

(NNLS) (NNLS)

/

17 /

NMF Algorithms Given a matrix M ∈ Rm×n and a factorization rank r ∈ N: + min

U∈Rm×r ,V ∈Rr+×n +

||M − UV ||2F =

X (M − UV )2ij .

(NMF)

i,j

This is a difficult non-linear optimization problem with potentially many local minima. Standard framework: 0. Initialize (U, V ). Then, alternatively update U and V : 1. Update V ≈ argminX ≥0 ||M − UX ||2F . 2. Update U ≈ argminY ≥0 ||M − YV ||2F .

(NNLS) (NNLS)

Most NMF algorithms come with no guarantees (except convergence to stationary points).

/

17 /

NMF Algorithms Given a matrix M ∈ Rm×n and a factorization rank r ∈ N: + min

U∈Rm×r ,V ∈Rr+×n +

||M − UV ||2F =

X (M − UV )2ij .

(NMF)

i,j

This is a difficult non-linear optimization problem with potentially many local minima. Standard framework: 0. Initialize (U, V ). Then, alternatively update U and V : 1. Update V ≈ argminX ≥0 ||M − UX ||2F . 2. Update U ≈ argminY ≥0 ||M − YV ||2F .

(NNLS) (NNLS)

Most NMF algorithms come with no guarantees (except convergence to stationary points). Solution is in general highly non-unique: indentifiability issues. /

17 /

Block coordinate descent method Use block-coordinate descent on the NNLS subproblems −→ closed-form solutions for the columns of U and rows of V :   Rk Vk:T ∗ 2 ∀k, U:k = argminU:k ≥0 ||Rk − U:k Vk: ||F = max 0, ||Vk: ||22 P . where Rk = M − j6=k U:j Vj: , and similarly for V . This is the so-called HALS algorithm.

/

18 /

Block coordinate descent method Use block-coordinate descent on the NNLS subproblems −→ closed-form solutions for the columns of U and rows of V :   Rk Vk:T ∗ 2 ∀k, U:k = argminU:k ≥0 ||Rk − U:k Vk: ||F = max 0, ||Vk: ||22 P . where Rk = M − j6=k U:j Vj: , and similarly for V . This is the so-called HALS algorithm. It can be accelerated: 1

Gauss-Seidel Coordinate descent (Hsieh, Dhillon, 2011).

2

Loop several time over columns of U/rows of V to perform more iterations at a lower computational cost (Glineur, G., 2012).

3

Randomized shuffling (Chow, Wu, Yin, 2017). ˆ (k+1) = W (k+1) + βk (W (k+1) − W (k) ) Use an extrapolation step: W (Ang, G., 2018).

4

/

18 /

Illustration on the CBCL face image data set

/

19 /

Exact NMF: Geometry and Extended Formulations

/

20 /

Geometric interpretation of exact NMF Given M = UV , one can scale M and U such that they become column stochastic implying that V is column stochastic: M = UV ⇐⇒ M 0 = MDM = (UDU )(DU−1 VDM ) = U 0 V 0 .

/

21 /

Geometric interpretation of exact NMF Given M = UV , one can scale M and U such that they become column stochastic implying that V is column stochastic: M = UV ⇐⇒ M 0 = MDM = (UDU )(DU−1 VDM ) = U 0 V 0 . The columns of M are convex combinations of the columns of U: M:j =

k X i=1

U:i Vij

with

k X

Vij = 1 ∀j, Vij ≥ 0 ∀ij.

i=1

/

21 /

Geometric interpretation of exact NMF Given M = UV , one can scale M and U such that they become column stochastic implying that V is column stochastic: M = UV ⇐⇒ M 0 = MDM = (UDU )(DU−1 VDM ) = U 0 V 0 . The columns of M are convex combinations of the columns of U: M:j =

k X

U:i Vij

with

i=1

k X

Vij = 1 ∀j, Vij ≥ 0 ∀ij.

i=1

In other terms, conv(M)



conv(U)



Sn ,

where conv(X ) is the convex hull of the columns of X , and Pn n n S = {x ∈ R |x ≥ 0, i=1 xi = 1 } is the unit simplex.

/

21 /

Geometric interpretation of exact NMF Given M = UV , one can scale M and U such that they become column stochastic implying that V is column stochastic: M = UV ⇐⇒ M 0 = MDM = (UDU )(DU−1 VDM ) = U 0 V 0 . The columns of M are convex combinations of the columns of U: M:j =

k X

U:i Vij

with

i=1

k X

Vij = 1 ∀j, Vij ≥ 0 ∀ij.

i=1

In other terms, conv(M)



conv(U)



Sn ,

where conv(X ) is the convex hull of the columns of X , and Pn n n S = {x ∈ R |x ≥ 0, i=1 xi = 1 } is the unit simplex. Exact NMF ≡ Find r points whose convex hull is nested between two given polytopes. /

21 /

Geometric interpretation of NMF

/

22 /

Geometric interpretation of NMF Example: Two nested hexagons (rank(Ma ) = 3)    1 Ma =  a  

1 a 2a − 1 2a − 1 a 1 1 1 a 2a − 1 2a − 1 a a 1 1 a 2a − 1 2a − 1 2a − 1 a 1 1 a 2a − 1 2a − 1 2a − 1 a 1 1 a a 2a − 1 2a − 1 a 1 1

     , a > 1.   

/

22 /

Geometric interpretation of NMF Example: Two nested hexagons (rank(Ma ) = 3) Case 1: a = 2, rank+ (Ma ) = 3, col(M) = col(U) 0.8 0.6 0.4 0.2

∆p ∩ col(M ) 2

conv(M2)

0

conv(U) −0.2 −0.4 −0.6 −0.8 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1 /

22 /

Geometric interpretation of NMF Example: Two nested hexagons (rank(Ma ) = 3) Case 2: a = 3, rank+ (Ma ) = 4, col(M) = col(U) 0.8 0.6 0.4 0.2

∆p ∩ col(M3) conv(M3)

0

conv(U) −0.2 −0.4 −0.6 −0.8 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1 /

22 /

Geometric interpretation of NMF Example: Two nested hexagons (rank(Ma ) = 3) Case 3: a → +∞, rank+ (Ma ) = 5, col(M) 6= col(U)

/

22 /

An amazing result: NMF and extended formulations Let P be a polytope P = {x ∈ Rk | bi − A(i, :)x ≥ 0 for 1 ≤ i ≤ m}, and let vj ’s (1 ≤ j ≤ n) be its vertices.

/

23 /

An amazing result: NMF and extended formulations Let P be a polytope P = {x ∈ Rk | bi − A(i, :)x ≥ 0 for 1 ≤ i ≤ m}, and let vj ’s (1 ≤ j ≤ n) be its vertices. We define the m-by-n slack matrix SP of P as follows: SP (i, j) = bi − A(i, :)vj ≥ 0

1 ≤ i ≤ m, 1 ≤ j ≤ n.

/

23 /

An amazing result: NMF and extended formulations Let P be a polytope P = {x ∈ Rk | bi − A(i, :)x ≥ 0 for 1 ≤ i ≤ m}, and let vj ’s (1 ≤ j ≤ n) be its vertices. We define the m-by-n slack matrix SP of P as follows: SP (i, j) = bi − A(i, :)vj ≥ 0

1 ≤ i ≤ m, 1 ≤ j ≤ n.

The hexagon:

    SP =   

0 0 1 2 2 1

1 0 0 1 2 2

2 1 0 0 1 2

2 2 1 0 0 1

1 2 2 1 0 0

0 1 2 2 1 0

      

/

23 /

An amazing result: NMF and extended formulations An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p that (linearly) projects onto P. The minimum number of facets of such a polytope is called the extension complexity xp(P) of P.

/

24 /

An amazing result: NMF and extended formulations An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p that (linearly) projects onto P. The minimum number of facets of such a polytope is called the extension complexity xp(P) of P. Theorem (Yannakakis, 1991). rank+ (SP ) = xp(P).

/

24 /

An amazing result: NMF and extended formulations An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p that (linearly) projects onto P. The minimum number of facets of such a polytope is called the extension complexity xp(P) of P. Theorem (Yannakakis, 1991). rank+ (SP ) = xp(P). Proof (one direction). Given P = {x ∈ Rk | b − Ax ≥ 0}, any exact NMF of SP = UV , U ≥ 0, V ≥ 0 provides an explicit extended formulation (with some redundant equalities) of P: P = {x | b − Ax ≥ 0} = {x | b − Ax = Uy and y ≥ 0}.

/

24 /

An amazing result: NMF and extended formulations An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p that (linearly) projects onto P. The minimum number of facets of such a polytope is called the extension complexity xp(P) of P. Theorem (Yannakakis, 1991). rank+ (SP ) = xp(P). Proof (one direction). Given P = {x ∈ Rk | b − Ax ≥ 0}, any exact NMF of SP = UV , U ≥ 0, V ≥ 0 provides an explicit extended formulation (with some redundant equalities) of P: P = {x | b − Ax ≥ 0} = {x | b − Ax = Uy and y ≥ 0}. Remark. The slack matrix SP of P satisfies conv(SP ) = Sm ∩ col(SP ). To get a small factorization, we need to go to a higher dimensional space: rank(U) > rank(M). /

24 /

The Hexagon

 SP

   =  

0 0 1 2 2 1

1 0 0 1 2 2

2 1 0 0 1 2

2 2 1 0 0 1

1 2 2 1 0 0

0 1 2 2 1 0

      

/

25 /

The Hexagon

 SP

   =  

0 0 1 2 2 1

1 0 0 1 2 2

2 1 0 0 1 2

2 2 1 0 0 1

1 2 2 1 0 0

0 1 2 2 1 0



1/2 1 1/2 0 0 0

0 0 0 1/2 1 1/2

rank+ (SP ) = 5





      =    

1 0 0 0 0 1

0 1 0 0 1 0

0 0 1 1 0 0



      

0 0 1 0 2

1 0 0 0 2

2 1 0 0 0

1 0 0 2 0

0 0 1 2 0

0 1 2 0 0

   , 

with rank(SP ) = 3



min(m, n) = 6. /

25 /

Some implications Problem: limits of LP for solving combinatorial problems: given a polytope, what is the most compact way to represent it?

/

26 /

Some implications Problem: limits of LP for solving combinatorial problems: given a polytope, what is the most compact way to represent it? Its extension complexity = nonnegative rank of its slack matrix.

/

26 /

Some implications Problem: limits of LP for solving combinatorial problems: given a polytope, what is the most compact way to represent it? Its extension complexity = nonnegative rank of its slack matrix. Key tool: lower bound techniques for the nonnegative rank.

/

26 /

Some implications Problem: limits of LP for solving combinatorial problems: given a polytope, what is the most compact way to represent it? Its extension complexity = nonnegative rank of its slack matrix. Key tool: lower bound techniques for the nonnegative rank. Ex. The matching problem cannot be solved via a polynomial-size LP. Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.

/

26 /

Some implications Problem: limits of LP for solving combinatorial problems: given a polytope, what is the most compact way to represent it? Its extension complexity = nonnegative rank of its slack matrix. Key tool: lower bound techniques for the nonnegative rank. Ex. The matching problem cannot be solved via a polynomial-size LP. Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.

This can be generalized to approximations (no poly-size LP can approximate these problems up to some precision). Braun, Fiorini, Pokutta & Steurer (2012). Approximation limits of linear programs (beyond hierarchies), FOCS.

/

26 /

Some implications Problem: limits of LP for solving combinatorial problems: given a polytope, what is the most compact way to represent it? Its extension complexity = nonnegative rank of its slack matrix. Key tool: lower bound techniques for the nonnegative rank. Ex. The matching problem cannot be solved via a polynomial-size LP. Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.

This can be generalized to approximations (no poly-size LP can approximate these problems up to some precision). Braun, Fiorini, Pokutta & Steurer (2012). Approximation limits of linear programs (beyond hierarchies), FOCS.

any convex cone, in particular PSD (so called PSD-rank). See the survey: Fawzi, Gouveia, Parrilo, Robinson, & Thomas. Positive semidefinite rank, Mathematical Programming, 2015. /

26 /

Exact NMF computation and regular n-gons Can we use numerical solvers to get insight into these problems?

/

27 /

Exact NMF computation and regular n-gons Can we use numerical solvers to get insight into these problems? Yes! We have developed a library to compute exact NMF’s for small matrices using meta-heuristics. [V14] Vandaele, G., Glineur & D. Tuyttens, Heuristics for Exact NMF (2014).

/

27 /

Exact NMF computation and regular n-gons Can we use numerical solvers to get insight into these problems? Yes! We have developed a library to compute exact NMF’s for small matrices using meta-heuristics. [V14] Vandaele, G., Glineur & D. Tuyttens, Heuristics for Exact NMF (2014).

Extension complexity of the octagon?

/

27 /

Exact NMF computation and regular n-gons Can we use numerical solvers to get insight into these problems? Yes! We have developed a library to compute exact NMF’s for small matrices using meta-heuristics. [V14] Vandaele, G., Glineur & D. Tuyttens, Heuristics for Exact NMF (2014).

Extension complexity of the octagon?

rank(SP ) = 3



rank+ (SP ) = 6



min(m, n) = 8. /

27 /

Exact NMF computation and regular n-gons We observed a special structure on the solutions for regular n-gons, leading to the best known upper bound and closing the gap for some n-gons:  2dlog2 (n)e − 1 for 2k−1 < n ≤ 2k−1 + 2k−2 , rank+ (Sn ) ≤ 2dlog2 (n)e for 2k−1 + 2k−2 < n ≤ 2k . [V15] Vandaele, G. & Glineur, On the Linear Extension Complexity of Regular n-gons (2015).

/

28 /

Exact NMF computation and regular n-gons We observed a special structure on the solutions for regular n-gons, leading to the best known upper bound and closing the gap for some n-gons:  2dlog2 (n)e − 1 for 2k−1 < n ≤ 2k−1 + 2k−2 , rank+ (Sn ) ≤ 2dlog2 (n)e for 2k−1 + 2k−2 < n ≤ 2k . [V15] Vandaele, G. & Glineur, On the Linear Extension Complexity of Regular n-gons (2015).

Implication: conic quadratic programming is ‘polynomially reducible’ to linear programming. [BTN01] Ben-Tal and Nemirovski (2001). On polyhedral approximations of the second-order cone. Mathematics of Operations Research, 26(2), 193-205.

/

28 /

NMF under the separability assumption

/

29 /

Separability Assumption Separability of M: there exists an index set K and V ≥ 0 with M = M(:, K) V , with |K| = r . | {z } U

/

30 /

Separability Assumption Separability of M: there exists an index set K and V ≥ 0 with M = M(:, K) V , with |K| = r . | {z } U

[AGKM12] Arora, Ge, Kannan, Moitra, Computing a Nonnegative Matrix Factorization – Provably, STOC 2012. /

30 /

Separability Assumption Separability of M: there exists an index set K and V ≥ 0 with M = M(:, K) V , with |K| = r . | {z } U

[AGKM12] Arora, Ge, Kannan, Moitra, Computing a Nonnegative Matrix Factorization – Provably, STOC 2012. /

30 /

Applications In hyperspectral imaging, this is the pure-pixel assumption: for each material, there is a ‘pure’ pixel containing only that material. [M+14] Ma et al., A Signal Processing Perspective on Hyperspectral Unmixing: Insights from Remote Sensing, IEEE Signal Processing Magazine 31(1):67-81, 2014.

/

31 /

Applications In hyperspectral imaging, this is the pure-pixel assumption: for each material, there is a ‘pure’ pixel containing only that material. [M+14] Ma et al., A Signal Processing Perspective on Hyperspectral Unmixing: Insights from Remote Sensing, IEEE Signal Processing Magazine 31(1):67-81, 2014.

In document classification: for each topic, there is a ‘pure’ word used only by that topic (an ‘anchor’ word). [A+13] Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees, ICML 2013.

/

31 /

Applications In hyperspectral imaging, this is the pure-pixel assumption: for each material, there is a ‘pure’ pixel containing only that material. [M+14] Ma et al., A Signal Processing Perspective on Hyperspectral Unmixing: Insights from Remote Sensing, IEEE Signal Processing Magazine 31(1):67-81, 2014.

In document classification: for each topic, there is a ‘pure’ word used only by that topic (an ‘anchor’ word). [A+13] Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees, ICML 2013.

Time-resolved Raman spectra analysis: each substance has a peak in its spectrum while the other spectra are (close) to zero. [L+16] Luce et al., Using Separable Nonnegative Matrix Factorization for the Analysis of Time-Resolved Raman Spectra, Appl Spectrosc. 2016.

Others: video summarization, foreground-background separation. [ESV12] Elhamifar, Sapiro, Vidal, See all by looking at a few: Sparse modeling for finding representative objects, CVPR 2012. [KSK13] Kumar, Sindhwani, Near-separable Non-negative Matrix Factorization with `1 and Bregman Loss Functions, SIAM data mining 2015. /

31 /

Geometric Interpretation The columns of U are the vertices of the convex hull of the columns of M: r r X X M(:, j) = U(:, k)V (k, j) ∀j, where V (k, j) = 1, V ≥ 0. k=1

k=1

/

32 /

Geometric Interpretation with Noise The columns of U are the vertices of the convex hull of the columns of M: r r X X M(:, j) ≈ U(:, k)V (k, j) ∀j, where V (k, j) = 1, V ≥ 0. k=1

k=1

Goal: theoretical analysis of the robustness to noise of separable NMF algorithms /

32 /

Key Parameters: Noise and Conditioning We assume M = U[Ir , V 0 ]Π + N, where V 0 ≥ 0, Π is a permutation and N is the noise.

/

33 /

Key Parameters: Noise and Conditioning We assume M = U[Ir , V 0 ]Π + N, where V 0 ≥ 0, Π is a permutation and N is the noise. We will assume that the noise is bounded (but otherwise arbitrary): ||N(:, j)||2 ≤ ,

for all j,

/

33 /

Key Parameters: Noise and Conditioning We assume M = U[Ir , V 0 ]Π + N, where V 0 ≥ 0, Π is a permutation and N is the noise. We will assume that the noise is bounded (but otherwise arbitrary): ||N(:, j)||2 ≤ ,

for all j,

and some dependence on the conditioning κ(U) =

σmax (U) σmin (U)

is unavoidable:

/

33 /

Key Parameters: Noise and Conditioning We assume M = U[Ir , V 0 ]Π + N, where V 0 ≥ 0, Π is a permutation and N is the noise. We will assume that the noise is bounded (but otherwise arbitrary): ||N(:, j)||2 ≤ ,

for all j,

and some dependence on the conditioning κ(U) =

σmax (U) σmin (U)

is unavoidable:

/

33 /

Successive Projection Algorithm (SPA) 0: Initially K = ∅. For i = 1 : r 1: Find j ∗ = argmaxj ||M(:, j)||. 2: K = K ∪ {j ∗ }.  M(:,j ∗ ) 3: M ← I − uu T M where u = ||M(:,j ∗ )|| . 2 end ∼modified Gram-Schmidt with column pivoting.

/

34 /

Successive Projection Algorithm (SPA) 0: Initially K = ∅. For i = 1 : r 1: Find j ∗ = argmaxj ||M(:, j)||. 2: K = K ∪ {j ∗ }.  M(:,j ∗ ) 3: M ← I − uu T M where u = ||M(:,j ∗ )|| . 2 end ∼modified Gram-Schmidt with column pivoting.   (U) , SPA satisfies Theorem. If  ≤ O √σmin r κ2 (U)  ||U−M(:, K)|| = max ||U(:, k)−M(:, K(k))|| ≤ O κ2 (U) . 1≤k≤r

Advantages. Extremely fast, no parameter. Drawbacks. Requires U to be full rank; bound is weak. [GV14] G., Vavasis, Fast and Robust Recursive Algorithms for Separable Nonnegative Matrix Factorization, IEEE Trans. Patt. Anal. Mach. Intell. 36 (4), pp. 698-714, 2014. /

34 /

Pre-conditioning for More Robust SPA Observation. Pre-multiplying M preserves separability: P M = P (U[Ir , V 0 ]Π + N) = (PU) [Ir , V 0 ]Π + PN.

/

35 /

Pre-conditioning for More Robust SPA Observation. Pre-multiplying M preserves separability: P M = P (U[Ir , V 0 ]Π + N) = (PU) [Ir , V 0 ]Π + PN. Ideally, P = U −1 so that κ(PU) = 1 (assuming m = r ).

/

35 /

Pre-conditioning for More Robust SPA Observation. Pre-multiplying M preserves separability: P M = P (U[Ir , V 0 ]Π + N) = (PU) [Ir , V 0 ]Π + PN. Ideally, P = U −1 so that κ(PU) = 1 (assuming m = r ). Solving the minimum volume ellipsoid centered at the origin and containing all the columns of M (which is SDP representable) min log det(A)−1 s.t. mj T Amj ≤ 1 ∀ j,

A∈Sr+

allows to approximate U −1 : in fact, A∗ ≈ (UU T )−1 .

/

35 /

Pre-conditioning for More Robust SPA Observation. Pre-multiplying M preserves separability: P M = P (U[Ir , V 0 ]Π + N) = (PU) [Ir , V 0 ]Π + PN. Ideally, P = U −1 so that κ(PU) = 1 (assuming m = r ). Solving the minimum volume ellipsoid centered at the origin and containing all the columns of M (which is SDP representable) min log det(A)−1 s.t. mj T Amj ≤ 1 ∀ j,

A∈Sr+

allows to approximate U −1 : in fact, A∗ ≈ (UU T )−1 .   (U) √ Theorem. If  ≤ O σmin , preconditioned SPA satisfies r r ||U − M(:, K)|| ≤ O (κ(U)). [GV15] G., Vavasis, SDP-based Preconditioning for More Robust Near-Separable NMF, SIAM J. on Optimization, 2015. /

35 /

Geometric Interpretation

Figure: Geometric Interpretation of the SDP-based Preconditioning.

See also Mizutani, Ellipsoidal Rounding for Nonnegative Matrix Factorization Under Noisy Separability, JMLR, 2014. /

36 /

Synthetic data sets 40×20 Each entry of U ∈ R+ uniform in [0, 1]; each column normalized.

The other columns of  M are the middle points of the columns of U (hence there are 20 2 = 190). The noise moves the middle points toward the outside of the convex hull of the column of U.

/

37 /

Synthetic data sets 40×20 Each entry of U ∈ R+ uniform in [0, 1]; each column normalized.

The other columns of  M are the middle points of the columns of U (hence there are 20 2 = 190). The noise moves the middle points toward the outside of the convex hull of the column of U.

/

37 /

Results for the synthetic data sets

Figure: Average of the fraction of columns correctly extracted depending on the noise level (for each noise level, 25 matrices are generated). /

38 /

Combinatorial formulation for separable NMF We want to find the index set K with |K| = r such that M

=

M(:, K) V .

/

39 /

Combinatorial formulation for separable NMF We want to find the index set K with |K| = r such that M

M(:, K) V .

=

This is equivalent to finding X ∈ Rn×n with r non-zero rows such that M

=

M X.

/

39 /

Combinatorial formulation for separable NMF We want to find the index set K with |K| = r such that M

M(:, K) V .

=

This is equivalent to finding X ∈ Rn×n with r non-zero rows such that M

=

M X.

A combinatorial formulation: min ||X ||row,0 X

such that M = MX or ||M − MX || ≤ .

/

39 /

Combinatorial formulation for separable NMF We want to find the index set K with |K| = r such that M

M(:, K) V .

=

This is equivalent to finding X ∈ Rn×n with r non-zero rows such that M

=

M X.

A combinatorial formulation: min ||X ||row,0 X

such that M = MX or ||M − MX || ≤ .

How to make X row sparse?

/

39 /

A Linear Optimization Model min

X ∈Rn×n +

such that

trace(X ) = || diag(X )||1 ||M − MX || ≤ , Xij ≤ Xii ≤ 1 for all i, j.

/

40 /

A Linear Optimization Model min

X ∈Rn×n +

such that

trace(X ) = || diag(X )||1 ||M − MX || ≤ ,

Xij ≤ Xii ≤ 1 for all i, j.   Robustness: noise ≤ O κ−1 ⇒ error ≤ O rκ [GL14].

[GL14] G., Luce, Robust Near-Separable NMF Using Linear Optimization, JMLR 2014.

/

40 /

A Linear Optimization Model min

X ∈Rn×n +

such that

trace(X ) = || diag(X )||1 ||M − MX || ≤ ,

Xij ≤ Xii ≤ 1 for all i, j.   Robustness: noise ≤ O κ−1 ⇒ error ≤ O rκ [GL14]. This model is an improvement over [B+12]: more robust and detects the factorization rank r automatically.

[GL14] G., Luce, Robust Near-Separable NMF Using Linear Optimization, JMLR 2014. [B+12] Bittorf, Recht, R´ e, Tropp, Factoring nonnegative matrices with LPs, NIPS 2012.

/

40 /

A Linear Optimization Model min

X ∈Rn×n +

such that

trace(X ) = || diag(X )||1 ||M − MX || ≤ ,

Xij ≤ Xii ≤ 1 for all i, j.   Robustness: noise ≤ O κ−1 ⇒ error ≤ O rκ [GL14]. This model is an improvement over [B+12]: more robust and detects the factorization rank r automatically. P It is equivalent [GL16] to using ||X ||1,∞ = di=1 ||X (i, :)||∞ as a convex surrogate for ||X ||row,0 [E+12]. [GL14] G., Luce, Robust Near-Separable NMF Using Linear Optimization, JMLR 2014. [B+12] Bittorf, Recht, R´ e, Tropp, Factoring nonnegative matrices with LPs, NIPS 2012. [E+12] Esser et al., A convex model for NMF and dimensionality reduction on physical space, IEEE Trans. Image Processing, 2012. [GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with Self Dictionary, IEEE Trans. Image Processing, 2018. /

40 /

Practical Model and Algorithm min ||M − MX ||2F + µ tr(X ),

X ∈Ω

Ω = {X ∈ Rn,n | Xii ≤ 1, wi Xij ≤ wj Xii ∀i, j}.

/

41 /

Practical Model and Algorithm min ||M − MX ||2F + µ tr(X ),

X ∈Ω

Ω = {X ∈ Rn,n | Xii ≤ 1, wi Xij ≤ wj Xii ∀i, j}. We used a fast gradient method (optimal 1st order): 1 2

Choose an initial point X (0) , Y = X (0) , α1 ∈ (0, 1). k = 1, 2, . . .  X (k) = PΩ Y − L1 ∇f (Y ) .  2 Y = X (k) +βk X (k) − X (k−1) , k) 2 where βk = ααk2(1−α with αk+1 ≥ 0 t.q. αk+1 = (1 − αk+1 )αk2 . +αk+1 1

k

[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with Self Dictionary, IEEE Trans. Image Processing, 2018. /

41 /

Practical Model and Algorithm min ||M − MX ||2F + µ tr(X ),

X ∈Ω

Ω = {X ∈ Rn,n | Xii ≤ 1, wi Xij ≤ wj Xii ∀i, j}. We used a fast gradient method (optimal 1st order): 1 2

Choose an initial point X (0) , Y = X (0) , α1 ∈ (0, 1). k = 1, 2, . . .  X (k) = PΩ Y − L1 ∇f (Y ) .  2 Y = X (k) +βk X (k) − X (k−1) , k) 2 where βk = ααk2(1−α with αk+1 ≥ 0 t.q. αk+1 = (1 − αk+1 )αk2 . +αk+1 1

k

Projection onto Ω can be done effectively in O(n2 log(n)) operations.

[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with Self Dictionary, IEEE Trans. Image Processing, 2018. /

41 /

Practical Model and Algorithm min ||M − MX ||2F + µ tr(X ),

X ∈Ω

Ω = {X ∈ Rn,n | Xii ≤ 1, wi Xij ≤ wj Xii ∀i, j}. We used a fast gradient method (optimal 1st order): 1 2

Choose an initial point X (0) , Y = X (0) , α1 ∈ (0, 1). k = 1, 2, . . .  X (k) = PΩ Y − L1 ∇f (Y ) .  2 Y = X (k) +βk X (k) − X (k−1) , k) 2 where βk = ααk2(1−α with αk+1 ≥ 0 t.q. αk+1 = (1 − αk+1 )αk2 . +αk+1 1

k

Projection onto Ω can be done effectively in O(n2 log(n)) operations. The total computational cost is O(pn2 ) operations. [GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with Self Dictionary, IEEE Trans. Image Processing, 2018. /

41 /

Hyperspectral unmixing

VCA VCA-500 SPA SPA-500 SNPA SNPA-500 XRAY XRAY-500 H2NMF H2NMF-500 FGNSR-500

r =6 Time (s.) 1.02 0.03 0.26