A novel boosting algorithm for multi-task learning based on the Itakuda

Itakura Saito distance, Un-normalized model and AdaBoost. 2 Multi-task learning using IS distance. Experiments. 3 Conclusions. T. Takenouchi (FUN). 2 / 27 ...
822KB taille 4 téléchargements 150 vues
A novel boosting algorithm for multi-task learning based on the Itakuda-Saito distance Takashi Takenouchi (FUN), Osamu Komori (ISM), Shinto Eguchi (ISM)

T. Takenouchi (FUN)

1 / 27

1

Itakura Saito distance, Un-normalized model and AdaBoost

2

Multi-task learning using IS distance Experiments

3

Conclusions

T. Takenouchi (FUN)

2 / 27

Setting of binary classification problem Setting x ∈ X : Input vector y ∈ {+1, −1}: Class label D = {xi , yi }ni=1 : Dataset p˜(y|x)˜ r(x): Empirical distribution of D

Purpose of binary classification problem To construct discriminant function F (x) and predict by sgn(F (x)) To estimate conditional distribution p(y|x) using dataset D or empirical distribution p˜(y|x)˜ r(x)

T. Takenouchi (FUN)

3 / 27

To construct discriminant function F Extended model Extended model or un-normalized model of the conditional distribution qF (y|x) = exp(F (x)y) m F (x) is parametrized ✞P with θ ∈ R ☎ q (y|x) = 6 1 y∈{±1} F





(Normalized) Conditional distribution: q¯F (y|x) = ZF =

P

exp (F (x)y) qF (y|x) = ZF exp (F (x)) + exp (−F (x))

y∈{±1} qF (y|x)

T. Takenouchi (FUN)

= eF (x) + e−F (x) : normalization constant

⇒Logistic model 4 / 27

To construct discriminant function F Extended model Extended model or un-normalized model of the conditional distribution qF (y|x) = exp(F (x)y) m F (x) is parametrized ✞P with θ ∈ R ☎ q (y|x) = 6 1 y∈{±1} F





(Normalized) Conditional distribution: q¯F (y|x) = ZF =

P

exp (F (x)y) qF (y|x) = ZF exp (F (x)) + exp (−F (x))

y∈{±1} qF (y|x)

T. Takenouchi (FUN)

= eF (x) + e−F (x) : normalization constant

⇒Logistic model 4 / 27

To construct discriminant function F Extended version of Kullback-Leibler (KL) divergence For two extended models p(y|x) and q(y|x),  Z X  p(y|x) KL(p, q; r) = r(x) p(y|x) log − p(y|x) + q(y|x) dx q(y|x) y∈{±1}

Estimation with the extended model qF (y|x) = exp (F (x)y) argmin KL(p, q¯F ; r) : MLE F

argmin KL(p, qF ; r) : ? F

argmin KL(qF , p; r) : ? F

T. Takenouchi (FUN)

5 / 27

To construct discriminant function F Extended version of Kullback-Leibler (KL) divergence For two extended models p(y|x) and q(y|x),  Z X  p(y|x) KL(p, q; r) = r(x) p(y|x) log − p(y|x) + q(y|x) dx q(y|x) y∈{±1}

Estimation with the extended model qF (y|x) = exp (F (x)y) argmin KL(p, q¯F ; r) : MLE F

argmin KL(p, qF ; r) : ? F

argmin KL(qF , p; r) : ? F

T. Takenouchi (FUN)

5 / 27

Estimation with the extended model Proposition 1 Assume that underlying distribution p(y|x) is written as p(y|x) = q¯F0 (y|x) =

exp (F0 (x)y) . exp (F0 (x)) + exp (−F0 (x))

and qF (y|x) = exp(F (x)y). Then we observe that in general, argmin KL(p, qF ; r) 6= F0 F

argmin KL(qF , p; r) 6= F0 F









The KL divergence is not good for the extended model!!

T. Takenouchi (FUN)

6 / 27

Estimation with the extended model Proposition 1 Assume that underlying distribution p(y|x) is written as p(y|x) = q¯F0 (y|x) =

exp (F0 (x)y) . exp (F0 (x)) + exp (−F0 (x))

and qF (y|x) = exp(F (x)y). Then we observe that in general, argmin KL(p, qF ; r) 6= F0 F

argmin KL(qF , p; r) 6= F0 F









The KL divergence is not good for the extended model!!

T. Takenouchi (FUN)

6 / 27

Estimation with the extended model Proposition 1 Assume that underlying distribution p(y|x) is written as p(y|x) = q¯F0 (y|x) =

exp (F0 (x)y) . exp (F0 (x)) + exp (−F0 (x))

and qF (y|x) = exp(F (x)y). Then we observe that in general, argmin KL(p, qF ; r) 6= F0 F

argmin KL(qF , p; r) 6= F0 F









The KL divergence is not good for the extended model!!

T. Takenouchi (FUN)

6 / 27

Itakura-Saito (IS) distance Itakura-Saito(IS) distance: For two extended models p(y|x), q(y|x) IS(p, q; r) =

1 2 3 4

Z

r(x)

X 

y∈{±1}

q(y|x) p(y|x) log −1+ p(y|x) q(y|x)



dx

Special case of Bregman divergence IS(p, q; r) ≥ 0 IS(p, q; r) = 0 if and only if p = q Frequently used as discrepancy measure between spectrums of signal in region of signal processing

T. Takenouchi (FUN)

7 / 27

Estimation of F with extended model and IS-distance Proposition 2 Assume that underlying distribution is written as p(y|x) = q¯F0 (y|x) =

exp (F0 (x)y) , exp (F0 (x)) + exp (−F0 (x))

and qF (y|x) = exp(F (x)y). Then we observe that argmin IS(p, qF ; r) = F0 F

argmin IS(qF , p; r) = F0 F









IS distance is appropriate for the extended model!!

T. Takenouchi (FUN)

8 / 27

Estimation of F with extended model and IS-distance Proposition 2 Assume that underlying distribution is written as p(y|x) = q¯F0 (y|x) =

exp (F0 (x)y) , exp (F0 (x)) + exp (−F0 (x))

and qF (y|x) = exp(F (x)y). Then we observe that argmin IS(p, qF ; r) = F0 F

argmin IS(qF , p; r) = F0 F









IS distance is appropriate for the extended model!!

T. Takenouchi (FUN)

8 / 27

IS distance and AdaBoost Proposition 3 IS distance between the empirical distribution p˜ and an extended model qF (y|x) = exp(F (x)y) is written as IS(˜ p, qF ; r˜) = Const +

= Const +

  n 1X X p˜(y|xi ) F (xi )y + n qF (y|xi ) 1 n

i=1 y∈{±1} n X

exp(−F (xi )yi )

i=1









minF IS(˜ p, qF ; r˜) is equivalent to AdaBoost!!

AdaBoost is binary classification method based on ensemble learning combines weak classifiers to construct strong classifier is easy to compute has consistency T. Takenouchi (FUN)

9 / 27

Summary of estimation with the extended model

KL divergence IS distance

T. Takenouchi (FUN)

Normalized model q¯F MLE

Extended model qF

10 / 27

Summary of estimation with the extended model

KL divergence IS distance

T. Takenouchi (FUN)

Normalized model q¯F MLE

Extended model qF ?

10 / 27

Summary of estimation with the extended model

KL divergence IS distance

T. Takenouchi (FUN)

Normalized model q¯F MLE

Extended model qF ? AdaBoost

10 / 27

Summary of estimation with the extended model

Normalized model q¯F Extended model qF KL divergence MLE ? IS distance Variant of AdaBoost AdaBoost P IS+Normalized P model: ni=1 exp(−2F (xi )yi ) c.f. AdaBoost: ni=1 exp(−F (xi )yi )

T. Takenouchi (FUN)

10 / 27

1

Itakura Saito distance, Un-normalized model and AdaBoost

2

Multi-task learning using IS distance Experiments

3

Conclusions

T. Takenouchi (FUN)

11 / 27

Multi-task learning Setting Assumption: There are multiple (J) binary classification problems (tasks) Tasks share structure or information. (k)

(k)

1

k Dk = {xi , yi }ni=1 (k = 1, . . . , J): Dataset of task k

2

p˜k (y|x)˜ rk (x): Empirical distribution associated with Dk

3

Fk (x): Classifier for task k

Purpose of Multi-task learning Improve performance of classifiers using the shared information

T. Takenouchi (FUN)

12 / 27

Multi-task learning Setting Assumption: There are multiple (J) binary classification problems (tasks) Tasks share structure or information. (k)

(k)

1

k Dk = {xi , yi }ni=1 (k = 1, . . . , J): Dataset of task k

2

p˜k (y|x)˜ rk (x): Empirical distribution associated with Dk

3

Fk (x): Classifier for task k

Purpose of Multi-task learning Improve performance of classifiers using the shared information

T. Takenouchi (FUN)

12 / 27

Multi-task learning

Two types of approaches for multi-task learning Case 1: A task k is target and we want to improve performance of Fk using the shared structure Case 2: All tasks are targets and we want to improve all Fk (k = 1, . . . , J) using the shared structure We propose a method for both approach based on

the IS distance and un-normalized model

T. Takenouchi (FUN)

13 / 27

Proposed method: Case 1 AdaBoost Target task p˜k (y|x)˜ rk (x) ⇒ argmin IS(˜ pk , qFk ; r˜k ) ⇒ Fk or qFk (y|x) Fk

pj , qFj ; r˜j ) ⇒ Fj or qFj (y|x) Target task p˜j (y|x)˜ rj (x) ⇒ argmin IS(˜ Fj

If task k is similar to task j, Empirical distribution p˜k (y|x)˜ rk (x) is similar to p˜j (y|x)˜ rj (x). Idea: Un-normalized model qFk (y|x) also should be similar to qFj (y|x)!! Z o X n IS(qFk , qFj ; r) = r(x) eFk (x)−Fj (x) + e−Fk (x)+Fj (x) dx y∈{±1}

T. Takenouchi (FUN)

14 / 27

Proposed method: Case 1 AdaBoost Target task p˜k (y|x)˜ rk (x) ⇒ argmin IS(˜ pk , qFk ; r˜k ) ⇒ Fk or qFk (y|x) Fk

pj , qFj ; r˜j ) ⇒ Fj or qFj (y|x) Target task p˜j (y|x)˜ rj (x) ⇒ argmin IS(˜ Fj

If task k is similar to task j, Empirical distribution p˜k (y|x)˜ rk (x) is similar to p˜j (y|x)˜ rj (x). Idea: Un-normalized model qFk (y|x) also should be similar to qFj (y|x)!! Z o X n IS(qFk , qFj ; r) = r(x) eFk (x)−Fj (x) + e−Fk (x)+Fj (x) dx y∈{±1}

T. Takenouchi (FUN)

14 / 27

Proposed method: Case 1 AdaBoost Target task p˜k (y|x)˜ rk (x) ⇒ argmin IS(˜ pk , qFk ; r˜k ) ⇒ Fk or qFk (y|x) Fk

pj , qFj ; r˜j ) ⇒ Fj or qFj (y|x) Target task p˜j (y|x)˜ rj (x) ⇒ argmin IS(˜ Fj

If task k is similar to task j, Empirical distribution p˜k (y|x)˜ rk (x) is similar to p˜j (y|x)˜ rj (x). Idea: Un-normalized model qFk (y|x) also should be similar to qFj (y|x)!! Z o X n IS(qFk , qFj ; r) = r(x) eFk (x)−Fj (x) + e−Fk (x)+Fj (x) dx y∈{±1}

T. Takenouchi (FUN)

14 / 27

Proposed method: Case 1 AdaBoost Target task p˜k (y|x)˜ rk (x) ⇒ argmin IS(˜ pk , qFk ; r˜k ) ⇒ Fk or qFk (y|x) Fk

pj , qFj ; r˜j ) ⇒ Fj or qFj (y|x) Target task p˜j (y|x)˜ rj (x) ⇒ argmin IS(˜ Fj

If task k is similar to task j, Empirical distribution p˜k (y|x)˜ rk (x) is similar to p˜j (y|x)˜ rj (x). Idea: Un-normalized model qFk (y|x) also should be similar to qFj (y|x)!! Z o X n IS(qFk , qFj ; r) = r(x) eFk (x)−Fj (x) + e−Fk (x)+Fj (x) dx y∈{±1}

T. Takenouchi (FUN)

14 / 27

Proposed method: Case 1

To improve performance of Fk using fixed Fj (j 6= k) We incorporate information of the shared structure into AdaBoost by adding regularizer defined with IS distance X λj IS(qFj , qFk ; rk ) argmin IS(pk , qFk ; rk ) + Fk

j6=k

λj is a regularization constant

T. Takenouchi (FUN)

15 / 27

Proposed method: Case 1

Empirical risk function: ) ( nk  X  X 1 k k k k k k ¯ k (Fk ) = e−Fk (xi )yi + λj eFk (xj )−Fj (xj ) + eFj (xj )−Fk (xj ) L nk i=1 j6=k Sequential optimization of Fk 1 Initialize Fk . 2 For t = 1, . . . , T 1 2 3

¯ k (Fk + αf ) Seek (ft , αt ) = argminf,α L Update the function as Fk ← Fk + αt ft

Output Fk

λj = 0 (j 6= k) ⇒ AdaBoost T. Takenouchi (FUN)

16 / 27

Comparison of regularizers Lk (Fk ) = IS(pk , qFk ; rk ) +

P

j6=k

λj IS(qFk , qFj ; rk )

For two model qF and qF +ǫ (or q¯F and q¯F +ǫ ) where ǫ is a small perturbation term KL(¯ qF , q¯F +ǫ ; r) ≃

Z

2r(x)¯ qF (y|x)(1 − q¯F (y|x))ǫ(x)2 dx

IS(¯ qF , q¯F +ǫ ; r) ≃

Z

2r(x)

IS(qF , qF +ǫ ; r) ≃

Z

r(x)ǫ(x)2 dx (Proposed method)

T. Takenouchi (FUN)

X

q¯F (y|x)2 ǫ(x)2 dx

(1) (2)

y∈{±1}

(3)

17 / 27

Comparison of regularizers Lk (Fk ) = IS(pk , qFk ; rk ) +

P

j6=k

λj IS(qFk , qFj ; rk )

For two model qF and qF +ǫ (or q¯F and q¯F +ǫ ) where ǫ is a small perturbation term KL(¯ qF , q¯F +ǫ ; r) ≃

Z

2r(x)¯ qF (y|x)(1 − q¯F (y|x))ǫ(x)2 dx

IS(¯ qF , q¯F +ǫ ; r) ≃

Z

2r(x)

IS(qF , qF +ǫ ; r) ≃

Z

r(x)ǫ(x)2 dx (Proposed method)

T. Takenouchi (FUN)

X

q¯F (y|x)2 ǫ(x)2 dx

(1) (2)

y∈{±1}

(3)

17 / 27

Comparison of regularizers Lk (Fk ) = IS(pk , qFk ; rk ) +

P

j6=k

λj IS(qFk , qFj ; rk )

For two model qF and qF +ǫ (or q¯F and q¯F +ǫ ) where ǫ is a small perturbation term KL(¯ qF , q¯F +ǫ ; r) ≃

Z

2r(x)¯ qF (y|x)(1 − q¯F (y|x))ǫ(x)2 dx

IS(¯ qF , q¯F +ǫ ; r) ≃

Z

2r(x)

IS(qF , qF +ǫ ; r) ≃

Z

r(x)ǫ(x)2 dx (Proposed method)

T. Takenouchi (FUN)

X

q¯F (y|x)2 ǫ(x)2 dx

(1) (2)

y∈{±1}

(3)

17 / 27

Comparison of regularizers Lk (Fk ) = IS(pk , qFk ; rk ) +

P

j6=k

λj IS(qFk , qFj ; rk )

For two model qF and qF +ǫ (or q¯F and q¯F +ǫ ) where ǫ is a small perturbation term KL(¯ qF , q¯F +ǫ ; r) ≃

Z

2r(x)¯ qF (y|x)(1 − q¯F (y|x))ǫ(x)2 dx

IS(¯ qF , q¯F +ǫ ; r) ≃

Z

2r(x)

IS(qF , qF +ǫ ; r) ≃

Z

r(x)ǫ(x)2 dx (Proposed method)

T. Takenouchi (FUN)

X

q¯F (y|x)2 ǫ(x)2 dx

(1) (2)

y∈{±1}

(3)

17 / 27

Comparison of regularizers

KL-divergence (1) focuses on region q¯F (y|x) = 12 IS-distance (2) focuses on region q¯F (y|x) ≃ 0 or 1 KL-divergence (1) ≤ IS-distance (3) ≤ IS-distance (2) T. Takenouchi (FUN)

18 / 27

Proposed method: Case 2 (We want to improve all Fk s) Dataset Dk is target: IS(˜ pk , qFk ; r˜k ) +

X

λj IS(qFj , qFk ; r˜k )

j6=k

We want improve all classifiers F1 , . . . , FJ , simultaneously

⇒ argmin

J X

πk

F1 ,...,FJ k=1 1 2

  

λk IS(qFj , qFk ; r˜k )

j6=k

Initialize F1 , . . . , FJ . for t in 1 : T 1 2

  

Choose randomly an index k from {1, . . . , J}. Update Fk by X pk , qFk ; r˜k ) + argmin IS(˜ λj IS(qFj , qFk ; r˜k ). Fk

3

IS(˜ pk , qFk ; r˜k ) +

X

j6=k

Output F1 , . . . , FJ . T. Takenouchi (FUN)

19 / 27

Proposed method: Case 2 (We want to improve all Fk s) Dataset Dk is target: IS(˜ pk , qFk ; r˜k ) +

X

λj IS(qFj , qFk ; r˜k )

j6=k

We want improve all classifiers F1 , . . . , FJ , simultaneously

⇒ argmin

J X

πk

F1 ,...,FJ k=1 1 2

  

λk IS(qFj , qFk ; r˜k )

j6=k

Initialize F1 , . . . , FJ . for t in 1 : T 1 2

  

Choose randomly an index k from {1, . . . , J}. Update Fk by X pk , qFk ; r˜k ) + argmin IS(˜ λj IS(qFj , qFk ; r˜k ). Fk

3

IS(˜ pk , qFk ; r˜k ) +

X

j6=k

Output F1 , . . . , FJ . T. Takenouchi (FUN)

19 / 27

1

Itakura Saito distance, Un-normalized model and AdaBoost

2

Multi-task learning using IS distance Experiments

3

Conclusions

T. Takenouchi (FUN)

20 / 27

Experiment

Weak classifier is the boosting stump and we compared Proposed method Classifiers separately learned by AdaBoost with each dataset Classifiers simultaneously learned by AdaBoost with all datasets Initial classifiers are set as Fj (x) = 0(j = 1, . . . , J) All hyper-parameters such as step number T and λ were determined by validation technique

T. Takenouchi (FUN)

21 / 27

Experiment Datasets and decision boundaries (J = 3) Dataset 2 1.0 x2

−0.5

−0.5 −0.5

0.0 x1

0.5

1.0

−1.0

−1.0

−1.0 −1.0

0.0

x2

0.0

0.5

0.5

1.0 0.5 0.0 −0.5

x2

Dataset 3

1.0

Dataset 1

−1.0

−0.5

0.0 x1

0.5

1.0

−1.0

−0.5

0.0

0.5

1.0

x1

(Training dataset: 400 examples ⇒ 80% training and 20% validation) ⇓ Discriminant function Fj ⇓ Test dataset: 600 examples T. Takenouchi (FUN)

22 / 27

Experiment: Test error A: Proposed method B: Separately learned Adaboost C: AdaBoost learned with all datasets

A

B

Dataset 1 T. Takenouchi (FUN)

C

0.26 0.14

0.16

0.18

0.20

0.22

0.24

0.26 0.24 0.22 0.20 0.18 0.16 0.14

0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.28

Dataset 3

0.28

Dataset 2

0.28

Dataset 1

A

B

Dataset 2

C

A

B

C

Dataset 3 23 / 27

Experiment: Decision boundary

−1.0

−0.5

0.0

0.5

1.0

1.0 0.5 −1.0

−0.5

0.0

0.5 −1.0

−0.5

0.0

0.5 0.0 −1.0

−0.5

Common

−1.0

−0.5

1.0

0.5

1.0

1.0

−1.0

−0.5

0.0

1.0

0.5

1.0

1.0 −1.0

−0.5

0.0

0.5

1.0

Dataset 3

0.0 1.0

0.5

1.0 0.5

−0.5 0.5

1.0

0.0 0.0

−1.0 0.0

0.5

−0.5 −0.5

0.5

1.0 0.5 0.0 −0.5

−0.5

0.0

−1.0 −1.0

Dataset 2

−1.0 −1.0

−0.5

0.5

1.0 0.5 −0.5 0.0

−1.0

Dataset 3

−1.0 −0.5

Dataset 1

T. Takenouchi (FUN)

0.5

0.0

0.5 0.0 −0.5 −1.0 −1.0

Proposed

0.0

Dataset 2

1.0

Dataset 1

Separate

Dataset 3

1.0

Dataset 2

1.0

Dataset 1

−1.0

−0.5

0.0

0.5

1.0

−1.0

−0.5

0.0

24 / 27

Conclusion

We discuss properties of the Itakura-Saito distance. IS distance is proper for estimation with the un-normalized extended model AdaBoost is derived from minimization of IS distance between the empirical distribution and extended model. We proposed multi-task learning method by incorporating regularizer with IS distance into AdaBoost.

T. Takenouchi (FUN)

25 / 27

Proposed method:Case 1 pk (y|x)rk (x): Underlying distribution of task k Optimizer of abstract risk P Lk (Fk ) = IS(pk , qFk ; rk ) + j6=k λj IS(qFk , qFj ; rk ) Fk∗ = argmin Lk (Fk ) Fk

P pk (+1|x) + j6=k λj exp(Fj (x)) 1 P = log 2 pk (−1|x) + j6=k λj exp(−Fj (x))

Fk∗ (x) ≥ 0 does not mean pk (+1|x) ≥ 21 . ⇒ Not Bayes optimal

T. Takenouchi (FUN)

26 / 27

Proposed method:Case 1 Proposition Let us assume that 1 There is a common conditional distribution p0 (y|x) among tasks 2 pk (y|x) − p0 (y|x) = δk (x)y 3 q¯Fj (y|x) − p0 (y|x) = ǫj (x)y (||ǫj (x)|| ≪ 1) Then P p0 (+1|x) 1 P δk (x) + j6=k λj ǫj (x) 1 ∗ P + Fk (x) ≃ log 2 p0 (−1|x) 2P 2 P + j6=k λj where P =

p p0 (+1|x)p0 (−1|x)

Difference between Fk∗ (x) and classifier defined by common distribution p0 (y|x) decreases when ǫj (x) is small T. Takenouchi (FUN)

27 / 27