A novel boosting algorithm for multi-task learning based on the Itakuda

without heavy computational cost, and AdaBoost is the most popular boosting ... In this paper, we firstly reveal that AdaBoost can be derived by a sequential.
114KB taille 1 téléchargements 174 vues
A novel boosting algorithm for multi-task learning based on the Itakuda-Saito divergence Takashi Takenouchi∗ , Osamu Komori† and Shinto Eguchi† ∗

Future university Hakodate, [email protected], † Institute of Statistical Mathematics, Japan

Abstract. In this paper, we propose a novel multi-task learning algorithm based on an ensemble learning method. We consider a specific setting of the multi-task learning for binary classification problems, in which features are shared among all tasks and all tasks are targets of performance improvement. We focus on a situation that the shared structures among dataset are represented by divergence between underlying distributions associated with multiple tasks. We discuss properties of the proposed method and investigate validity of the proposed method with numerical experiments. Keywords: Multi-task learning, Itakura-Saito distance, Un-normalized pseudo model

INTRODUCTION In the framework of multi-task learning problems, we assume that there are multiple related tasks (datasets) sharing common structures, and can utilize the shared structures to improve generalization performance of predictors for multiple tasks [7]. This framework has been successfully employed to various kind of applications such as medical diagnosis. Most methods utilize similarity among tasks to improve performance by representing the shared structure as a regularization term [1, 3]. We tackle with this problem using a boosting method which makes it possible to adaptively learn complicated problems with low computational cost. The boosting methods are notable implementations of the ensemble learning and tries to construct a better classifier by combining weak classifiers without heavy computational cost, and AdaBoost is the most popular boosting method and a lot of variation including TrAdaBoost for the multi-task learning [2] have been developed. In this paper, we firstly reveal that AdaBoost can be derived by a sequential minimization of the Itakura-Saito (IS) distance between an empirical distribution and an un-normalized pseudo measure model constructed by a classifier. The IS distance is a special case of the Bregman divergence between positive measures and is frequently used in the region of signal processing. Secondly, we propose a novel boosting algorithm for the multi-task learning based on the IS distance. We utilize the IS distance as a discrepancy measure between pseudo models associated with tasks, and incorporate the IS distance as a regularizer into the AdaBoost. The proposed method can capture the shared structure, i.e., relationship between underlying distribution by considering divergence between pseudo models estimated by constructed classifiers. We discuss statistical properties of the proposed method and investigate validity of the regularization by the IS distance with small experiments using a synthetic dataset.

SETTINGS In this study, we focus on binary classification problems. Let x be an input and y ∈ Y = ( j) ( j) n j {±1} be a class label. Let us assume that J datasets D j = {xi , yi }i=1 ( j = 1, . . . , J) are given, and let p j (y|x)r j (x) and p˜ j (y|x)˜r j (x) be an underlying distribution and an empirical distribution associated with the dataset D j , respectively. Here we assume that each conditional distribution of y given x is written as pk (y|x) = p0 (y|x) + δk (x)y

(1)

where p0 (y|x) is a common conditional distribution for all datasets and δk (x) is a perturbation term specific to the dataset Dk . Note that ∑y∈Y δk (x)y = 0 because pk (y|x) is a probability distribution. While a discriminant function Fk is usually constructed using only the dataset Dk , the multi-task learning aims to improve performance of the discriminant function for each dataset Dk with the help of datasets D j ( j ̸= k). For the purpose, we consider a risk minimization problem defined with an un-normalized pseudo model and the Itakura-Saito (IS) distance which is a discrepancy measure frequently used in a region of signal processing.

ITAKURA-SAITO DISTANCE AND PSEUDO MODEL } { Let M = m(y) 0 ≤ ∑y∈Y m(y) < ∞ be a space of all the positive finite measures over Y . The Itakura-Saito distance between p, q ∈ M is defined as { } ∫ q(y|x) p(y|x) IS(p, q; r) = r(x) ∑ log −1+ dx p(y|x) q(y|x) y∈Y

(2)

where r(x) is a marginal distribution of x shared by p, q ∈ M . Note that the IS distance is a kind of statistical version of the Bregman divergence [6] and we observe that IS(p, q; r) ≥ 0 and IS(p, q; r) = 0 if and only if p = q.

Parameter estimation with pseudo model Let qF (y|x) be an un-normalized pseudo model associated with a function F(x), qF (y|x) = exp(F(x)y).

(3)

Note that qF (y|x) is not a probability function, i.e., ∑y∈Y qF (y|x) ̸= 1 in general. If qF (y|x) is normalized, the model reduces to the classical logistic model q¯F (y|x) =

exp(F(x)y) . exp(F(x)) + exp(−F(x))

(4)

When the function F is parametrized by θ , the maximum likelihood estimation (MLE) argmaxθ ∑ni=1 log q¯F (yi |xi ) or equivalently minimization of the (extended) KullbackLeibler (KL) divergence, is a powerful tool for estimation of θ , and the MLE has

properties as asymptotic consistency and efficiency under some regularity conditions. Here we consider parameter estimation with the pseudo model (3) rather than the normalized model (4). Proposition 1 Assume that the underlying distribution is written as p(y|x) = q¯F0 (y|x) where F0 is a function. Then we observe argmin IS(p, qF ; r) = F0 , and argmin IS(qF , p; r) = F0 . F

(5)

F

On the other hand, when we consider estimation based on an extended KL divergence, ∫ p(y|x) i.e., argminF KL(p, qF ; r) where KL(p, q; r) = r(x) ∑y∈Y {p(y|x) log q(y|x) − p(y|x) + q(y|x)}dx, we observe the following. Remark 1 Assume that the underlying distribution is written as p(y|x) = q¯F0 (y|x) where F0 is a function (F0 ̸= 0). Then we observe argmin KL(p, qF ; r) ̸= F0 , and argmin KL(qF , p; r) ̸= F0 . F

(6)

F

This shows that the extended KL divergence is not appropriate for estimation with the un-normalized pseudo model.

Relationship with AdaBoost The IS distance between the underlying conditional distribution p(y|x) and the unnormalized pseudo model qF (y|x) is written as { } ∫ ∫ p(y|x) IS(p, qF ; r) = C + r(x) ∑ F(x)y + dx = C + r(x) ∑ p(y|x)e−F(x)y dx, qF (y|x) y∈Y y∈Y (7) where C is a constant and (7) is equivalent to an abstract loss of AdaBoost except for the constant term. Then sequential minimization of an empirical version of (7) is equivalent to the algorithm of AdaBoost. Also [4, 6] discussed that a gradient-based boosting algorithm can be derived from minimization of the KL divergence or the Bregman divergence between the underlying distribution and an un-normalized pseudo model. An important difference between these frameworks and our framework (7) is employed pseudo model. The pseudo model employed by the previous frameworks assumes a condition called “Consistent data assumption” and is defined with the empirical distribution implying that the pseudo model varies depending on dataset. On the other hand, the pseudo model (3) employed in (7) is fixed against dataset as usual statistical models. Also, the IS distance between two pseudo models qF (y|x) and qF ′ (y|x) is written as, ∫ { } IS(qF , qF ′ ; r) = Const. + r(x) exp(F(x) − F ′ (x)) + exp(F ′ (x) − F(x)) dx. (8) Note that IS(qF ′ , qF ; r) = IS(qF , qF ′ ; r) holds for arbitrary qF and qF ′ while the IS distance itself is not necessarily symmetric. Also note that the symmetric property does not hold for normalized models q¯F and q¯F ′ .

PROPOSED METHOD There are two main types of frameworks for multi-task learning. One is that there is a target dataset Dk and our interest is to construct a discriminant function Fk utilizing remaining datasets D j ( j ̸= k) or a priori constructed discriminant functions Fj ( j ̸= k). The other is that our interest is to simultaneously construct better discriminant functions F1 , . . . , FJ using all J datasets D1 , . . . , DJ by utilizing shared information among datasets.

Case 1 In this subsection, we focus on the above first framework. Let us assume that discriminant functions Fj (x) ( j ̸= k) are given, i.e., constructed by an arbitrary binary classification method. Then let us consider a risk function Lk (Fk ) = IS(pk , qFk ; rk ) + ∑ λ j IS(qFk , qFj ; rk ) {



=

rk (x)

(9)

j̸=k



y∈Y

} { } dx, (10) pk (y|x)e−Fk (x)y + ∑ λ j eFk (x)−Fj (x) + eFj (x)−Fk (x) j̸=k

where λ j ( j ̸= k) are regularization constants. Note that the risk function depends on functions Fj ( j ̸= k). The second term becomes small when the target discriminant function Fk is similar to functions Fj ( j ̸= k) in the sense of the IS distance and corresponds to a regularizer incorporating the shared information among datasets into the target function Fk . Note that the marginal distribution rk is shared in the second term for easiness of implementation and simplicity of theoretical analysis. The minimizer Fk∗ of the risk function satisfies the following { ∗ } δ Lk (Fk ) Fj∗ (x)−Fk (x) −Fk∗ (x) Fk∗ (x) Fk (x)−Fj (x) = −p (+1|x)e + p (−1|x)e + λ e − e = 0, j k k ∑ δ Fk (x) Fk =F ∗ j̸=k k

which implies Fk∗ (x) =

pk (+1|x) + ∑ j̸=k λ j exp(Fj (x)) 1 , log 2 pk (−1|x) + ∑ j̸=k λ j exp(−Fj (x))

(11)

or equivalently (

)

pk (y|x) = p0,k (y|x) 1 + ∑ λ j exp(−Fj (x)y) − p0,k (−y|x) ∑ λ j exp(Fj (x)y), (12) j̸=k

exp(F ∗ (x)y)

j̸=k

k where p0,k (y|x) = exp(F ∗ (x))+exp(−F . This can be interpreted as a probabilistic model ∗ k k (x)) of asymmetric mislabeling [8, 9]. In (11), confidence of classification is discounted by results of remaining discriminant functions when the classifier sgn(Fk∗ (x)) makes

a different decision from these of sgn(Fj (x)) ( j ̸= k). Note that Fk∗ (x) ≥ 0 does not mean pk (+1|x) ≥ 12 unless Fj (x) = 12 log ppk (+1|x) (−1|x) holds. k

Proposition 2 Let us assume that Fj (x) satisfies exp(Fj (x)y) = p0 (y|x) + ε j (x)y, ||ε j (x)|| ≪ 1. exp(Fj (x)) + exp(−Fj (x))

(13)

Then (11) can be approximated as Fk∗ (x) ≃ where P =



1 p0 (+1|x) 1 Pδk (x) + ∑ j̸=k λ j ε j (x) log + 2 p0 (−1|x) 2 P3 + Λk P2

(14)

p0 (+1|x)p0 (−1|x) and Λk = ∑ j̸=k λ j .

We observe that a discrepancy δk of Fk∗ from the optimal discriminant function associated with the common structure p0 is moderated by the mixture of ε j when perturbations ε j are independently and identically distributed. An empirical version of (10) is written as ( ( )) nk (k) (k) (k) (k) (k) (k) 1 L¯ k (Fk ) = ∑ e−Fk (xi )yi + ∑ λ j eFk (xi )−Fj (xi ) + eFj (xi )−Fk (xi ) . (15) nk i=1 j̸=k An algorithm is derived by sequential minimization of (15) by updating Fk to Fk + α f , i.e., (α , f ) = argminα , f L¯ k (Fk + α f ) where f is a weak classifier and α is coefficient [5]. 1. Initialize function to Fk0 and define weights for i-th example with the function F as (k)

(k)

(k)

(k)

(k)

∑ j̸=k λ j e− f (xi )(F(xi )−Fj (xi )) e−F(xi )yi w1 (i; F) = , w2 (i; F) = , where Z1 (F) Z2 (F) ( ) nk nk (k) (k) (k) (k) (k) (k) −F(xi )yi F(xi )−Fj (xi ) −F(xi )+Fj (xi ) Z1 (F) = ∑ e , Z2 (F) = ∑ ∑ λ j e +e . i=1

i=1 j̸=k

2. For t = 1, . . . , T (a) Select a weak classifier fkt ∈ {±1} which minimizes the following quantity

ε( f ) =

Z1 (Fkt−1 ) Z1 (Fkt−1 ) + Z2 (Fkt−1 )

ε1 ( f ) +

Z2 (Fkt−1 ) Z1 (Fkt−1 ) + Z2 (Fkt−1 )

(k)

(k)

ε2 ( f ).

(16)

k w1 (i; Fkt−1 ) I( f (xi ) ̸= yi ) and ε2 ( f ) = ∑ni=1 w2 (i; Fkt−1 ). where ε1 ( f ) = ∑ni=1

1−ε ( fkt ) . ε ( fkt ) Fkt = Fkt−1 + αkt fkt .

(b) Calculate a coefficient of fkt by αkt = 12 log (c) Update the discriminant function as T αkt fkt (x). 3. Output FkT (x) = Fk0 (x) + ∑t=1

In step 1, Fk0 is typically initialized as Fk0 (x) = 0. (16) is a mixture of two quantities: ε1 ( f ) is a weighted error rate of the classifier f and ε2 ( f ) is sum of weights w2 ( f ) which represents degree of discrepancy between f and F − Fj . ε2 ( f ) becomes large when F is updated as departed from Fj by f .

Case 2 In this subsection, we consider simultaneous construction of discriminant functions F1 , . . . , FJ by minimizing the following risk function: J

L(F1 , . . . , FJ ) =

∑ π j L j (Fj )

(17)

j=1

where π j ( j = 1, . . . , J) is a constant and Lk is defined in (10). While we can directly minimize the empirical version of (17), the derived algorithm is complicated and is computationally heavy. Then we derive a simplified algorithm utilizing the algorithm shown in Case 1 in which a target dataset is fixed. 1. Initialize functions F1 , . . . , FJ . 2. For t = 1, . . . , T (a) Randomly choose a target index k ∈ {1, . . . , J}. (b) Update the function Fk using the algorithm in Case 1 by S steps with fixed functions Fj ( j ̸= k). 3. Output learned functions F1 , . . . , FJ . Note that the empirical risk function cannot be monotonically decreased because minimization of Lk (Fk ) is a trade-off of the first term and the second regularization term, and decrease of Lk (Fk ) does not mean decrease of the regularization term in general.

COMPARISON OF REGULARIZATION TERMS The proposed method incorporates the regularization term defined by the IS distance into the AdaBoost. In this section, we discuss a property of the regularization term. Proposition 3 Let us assume that a perturbation function ε (x) satisfies |ε (x)| ≪ 1. KL(q¯F , q¯F+ε ; r) ≃ KL(qF , qF+ε ; r) ≃ IS(q¯F , q¯F+ε ; r) ≃ IS(qF , qF+ε ; r) ≃

∫ ∫ ∫

2r(x)q¯F (+1|x)q¯F (−1|x)ε (x)2 dx

(18)

1 r(x) √ ε (x)2 dx 2 q¯F (+1|x)q¯F (−1|x)

(19)

2r(x) ∫

∑ q¯F (y|x)2ε (x)2dx

(20)

y∈Y

r(x)ε (x)2 dx

(21)

Dataset 2 1.0 x2

−0.5

−0.5 −0.5

0.0 x1

0.5

1.0

−1.0

−1.0

−1.0 −1.0

0.0

x2

0.0

0.5

0.5

1.0 0.5 0.0 −0.5

x2

Dataset 3

1.0

Dataset 1

−1.0

−0.5

0.0

0.5

1.0

−1.0

−0.5

x1

0.0

0.5

1.0

x1

FIGURE 1. Generated three datasets and decision boundaries.

Those relations implies that the KL-divergence (18) emphasizes a region of input x whose conditional distribution is nearly equal to 12 , while the IS-distance (20) focuses on a region of x whose conditional distribution is nearly equal to 0 or 1. The IS-distance between pseudo models (21) considers intermediate of (18) and (20).

EXPERIMENTS In this section, we investigate performance of the proposed method using a synthetic dataset within the situation described in Case 2. We set the number J of dataset to 3 and assume that a marginal distribution of x is uniform distribution on [−1, 1]2 , and a discriminant function Fj ( j = 1, 2, 3) associated with each dataset is generated by Fj (x) = c j,2 (x1 − c j,1 ) − x2 where c1 ∼ N (0, 0.22 ) and c2 ∼ N (1, 0.12 ). In addition, we randomly added a contamination noise on label y. Under these settings, we generated training dataset including 400 examples and test dataset including 600 examples. Generated datasets are shown in figure 1. We observe that each discriminant function and noise structure are different from the other two. We compared the proposed method (A) with AdaBoost learned with an individual dataset (B), and AdaBoost learned with all datasets simultaneously (C). We employed the boosting stump 1 as the weak classifier and fixed as π j = 1/3. A boosting-type method has a hyper-parameter T , step number of boosting, and the proposed method additionally has the hyper-parameter λ j . In the experiment, we assume that λ j = λ for all j and determined λ and T by the validation technique. We utilized 80% of the training dataset for training of classifiers and the remaining 20% for the validation. We repeated the above procedure 20 times and observed averaged performance of methods. A figure 2 shows boxplots of test errors of each method for datasets D j ( j = 1, 2, 3). We observe that the proposed method consistently outperforms individually learned AdaBoost, and AdaBoost learned with all datasets simultaneously. This implies that the proposed method can incorporate shared information among datasets into classifiers. 1

Boosting stump is a decision tree with only one node.

A

B

C

0.26 0.14

0.16

0.18

0.20

0.22

0.24

0.26 0.24 0.22 0.20 0.18 0.16 0.14

0.14

0.16

0.18

0.20

0.22

0.24

0.26

0.28

Dataset 3

0.28

Dataset 2

0.28

Dataset 1

A

B

C

A

B

C

FIGURE 2. Boxplots of the test error of each method (A: proposed method, B: AdaBoost learned with the individual dataset, and C: AdaBoost learnd with all datasets simultaneously) for three datasets, over the 20 simulation trials.

CONCLUSIONS In this paper, we proposed a novel binary classification method for the multi-task learning. The proposed method is based on the Itakura-Saito distance and the un-normalized pseudo model. The IS-divergence between pseudo models can be interpreted as the regularization term incorporating shared information among tasks into the binary classifier for the target dataset. We numerically investigated performance of the proposed method with the synthetic dataset, and verified the effectiveness of the proposed method. Acknowledgements This study was partially supported by Grant-in-Aid for Young Scientists (B), 50403340, from MEXT, Japan.

REFERENCES 1. A. Argyriou, M. Pontil, Y. Ying, and C. A Micchelli. A spectral regularization framework for multitask structure learning. In Advances in neural information processing systems, pages 25–32, 2007. 2. W. Dai, Q. Yang, G. Xue, and Y. Yu. Boosting for transfer learning. In Proceedings of the 24th international conference on Machine learning, pages 193–200. ACM, 2007. 3. A Evgeniou and Massimiliano Pontil. Multi-task feature learning. Advances in neural information processing systems, 19:41, 2007. 4. G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. Advances in neural information processing systems, 14:447, 2002. 5. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient decent in function space. Advances in neural information processing systems, 12, 1999. 6. N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi. Information geometry of U-boost and bregman divergence. Neural Computation, 16(7):1437–1481, 2004. 7. S. Jialin Pan and Qiang Yang. A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on, 22(10):1345–1359, 2010. 8. T. Takenouchi and S. Eguchi. Robustifying AdaBoost by adding the naive error rate. Neural Computation, 16(4):767–787, 2004. 9. T. Takenouchi, S. Eguchi, T. Murata, and T. Kanamori. Robust boosting algorithm against mislabeling in multi-class problems. Neural Computation, 20(6):1596–1630, 2008.