Incorporating Prior-Knowledge in Support Vector Machines by Kernel

In this section, we define our standard vocabulary regard- ing statistical machine learning and present the C-SVM and Ç«-SVR, two popular forms of SVM used ...
392KB taille 0 téléchargements 263 vues
Incorporating Prior-Knowledge in Support Vector Machines by Kernel Adaptation Antoine Veillard∗† , Daniel Racoceanu† and St´ephane Bressan∗ ∗ School of Computing National University of Singapore, Singapore 117417 † IPAL CNRS 1 Fusionopolis Way, #21-01 Connexis, Singapore 138632 Email: [email protected]

Abstract—SVMs with the general purpose RBF kernel are widely considered as state-of-the-art supervised learning algorithms due to their effectiveness and versatility. However, in practice, SVMs often require more training data than readily available. Prior-knowledge may be available to compensate this shortcoming provided such knowledge can be effectively passed on to SVMs. In this paper, we propose a method for the incorporation of prior-knowledge via an adaptation of the standard RBF kernel. Our practical and computationally simple approach allows prior-knowledge in a variety of forms ranging from regions of the input space as crisp or fuzzy sets to pseudo-periodicity. We show that this method is effective and that the amount of required training data can be largely decreased, opening the way for new usages of SVMs. We propose a validation of our approach for pattern recognition and classification tasks with publicly available datasets in different application domains. Keywords-support vector machine; prior-knowledge; kernel; breast cancer;

distance according to the prior-knowledge. Our method is more versatile and practical than previous approaches by enabling the integration of a wide range of prior-knowledge. In this paper, we deal with prior-knowledge such as advice on the labeling of particular regions of the input space given in a crisp or fuzzy way, and the pseudo-periodicity of the labels w.r.t. a specific variable. Our method also provides a transparent control over the amount of prior-knowledge which is incorporated via the tuning of a single parameter. Our objective is to enable a switch of paradigm for the utilization of SVMs from an often unrealistic situation where a lot of training data are required to much fewer labelled data complemented with general advice on the problem. In Sect. IV, we propose an empirical validation of our method in pattern recognition and scalar regression contexts for medical and climatological applications using publicly available data.

I. M OTIVATION

II. R ELATED W ORK

Support Vector Machines (SVM) with their numerous variants for pattern recognition and scalar regression are often regarded as the state-of-the-art supervised learning tools. In particular, the possibility to use them in conjunction with general purpose nonlinear kernels such as the Radial Basis Function (RBF) kernel make them a versatile and yet powerful out-of-the-box solution. However, the practicality of such supervised learning algorithms is limited by the requirement for quality training data in sufficient amounts. Sometimes, it is possible to compensate for the lack of training data with other prior-knowledge on the problem. In fact, situations where data is scarce but prior-knowledge such as advice from experts or good intuition is abundantly available are rather common. For instance in the medical field, the collection of adequate data is often difficult due to high costs or ethical issues but large amounts of knowledge is usually available from well documented medical sources and the experience of specialist doctors. In this paper, we propose a solution for the incorporation of prior-knowledge in SVM classification and regression based on modifications of the standard RBF kernel (Sect. III). The principle behind our method based on kernel theory is to induce appropriate alterations on the kernel

The incorporation of prior-knowledge into SVMs recently attracted significant interest from the research community. Prior-knowledge usually refers to any information other than what can be inferred from the training data. It can take a wide variety of forms. A well organized general review of the recent work up to 2008 is available from Lauer and Bloch [1]. In this paper, we are interested in prior-knowledge which can be obtained as extra advice from experts of a particular problem. Labeling regions of the input space is a corresponding form of prior-knowledge which has been recently investigated by various research groups. Initially, Mangasarian et al. proposed a framework they called “knowledge based linear programming” for the incorporation of sets as additional constraints into the optimization problem of the Linear Programming SVM [2], [3]. An extension was also proposed for scalar regression [4]. This first solution is restricted to the incorporation of convex regions of the feature space. It is also somewhat unintuitive when nonlinear kernels such as the RBF kernel are used. A refinement addressing this issue and allowing for non-convex sets is proposed in [5], [6]. Simpler alternatives to the rather complex framework by Mangasarian et al. were also proposed by Le and Smola

[7] as-well-as Maclin et al. [8]. An online version following a passive-aggressive framework was recently proposed by Kunapuli et al. [9]. Most of the previous work incorporate the priorknowledge on sets in the form of additional constraints into the SVM optimization problem. This results in a shifting of the optimum but the hypothesis space from which the optimum is drawn does not change. In particular, if the hypothesis space does not contain a good solution, shifting the optimum will not be sufficient. In contrast, our method presented in the next section modifies the kernel function which results in an adaptation of the hypothesis space itself. III. M ETHODOLOGY Throughout this whole section, let X be an arbitrary set referred to as the input space. A. Notion of Kernel Distance The goal of this section is to introduce the notion of kernel distance resulting from a central theorem from the theory of PD kernels. Our contribution is justified by the adaptation of this kernel distance according to the prior-knowledge. PD kernels are a practical tool widely used in many fields of computer science including machine learning. Definition 1. A PD kernel over X is a function K : X × X → R such that: 1) K is symmetric. 2) ∀N ∈ N, ∀(x1 , x2 , . . . , xN ) ∈ XN, N ∀(v1 , v2 , . . . , vN ) ∈ R : N N X X i=1 j=1

vi vj K(xi , xj ) ≥ 0 .

(1)

Essentially, a PD kernel can be understood as a generalization of the notion of inner product. The following is a central result on PD kernels proved by Nachman Aronszajn in 1950. Theorem 2. K : X 2 → R is PD iff there is a real Hilbert space (H, h., .iH ) and a mapping ψ : X → H such that ∀(x1 , x2 ) ∈ X 2 , K(x1 , x2 ) = hψ(x1 ), ψ(x2 )iH . In addition, H (referred to as the kernel space) and φ are unique1 . In other words, a PD kernel is in fact an inner product after embedding the data into a real Hilbert space. Aronszajn’s theorem has a very important algorithmic application known as the kernel trick which consists in the replacement of every inner product evaluation by a kernel evaluation. Indeed, if an algorithm only requires Gram matrices as inputs, then replacing them with corresponding 1 Technically

speaking, only the RKHS is unique. Nevertheless, embeddings of X into any such Hilbert space H are related by an isometry.

kernel Gram matrices effectively applies the algorithm to the data embedded in the kernel space H instead of the original input space X without explicitely knowing the ampping ψ or the space H. Only a PD kernel is required. In our experiments, we use the RBF kernel as our basic kernel, which performs a nonlinear embedding into its kernel space Hrbf . With parameter γ > 0, its expression is: Krbf (x1 , x2 ) = exp(−γkx1 − x2 k2 ) .

(2)

The distance dH (x1 , x2 ) between two data points (x1 , x2 ) ∈ X 2 after embedding them into the kernel space H is referred to as the kernel distance. Using theorem 2, it can be expressed in terms of kernel products: p dH (x1 , x2 ) = K(x1 , x1 ) + K(x2 , x2 ) − 2K(x1 , x2 ) . (3) Therefore, the RBF kernel distance is: p dHrbf (x1 , x2 ) = 2 − 2K(x1 , x2 ) .

(4)

Note that the √ RBF kernel distance always belongs to the interval [0, 2].

B. Machine Learning Algorithms In this section, we define our standard vocabulary regarding statistical machine learning and present the C-SVM and ǫ-SVR, two popular forms of SVM used for the validation of our method in Sect. IV. Then, we also explain why our method is justified in comparison with previous approaches for the incorporation of prior-knowledge as sets. Let Y ⊆ R be referred to as the output or label space. A labeling problem is an unknown probability distribution P with values in (X , Y). The labeling problem is qualified of classification or pattern recognition problem when Y = {−1, 1} and of scalar regression when Y = R. Given a set of labeling models H ⊂ Y X referred to as the hypothesis set , a learning algorithm selects a labeling model f ∈ H such that f (X) approximates Y as closely as possible when (X, Y ) ∼ P. SVMs are a class of state-of-the-art learning algorithms derived from a statistical learning theory developed by the mathematician Valdimir Vapnik and known as the structural risk minimization principle. They are supervised learning algorithms, i.e. they select f according to a finite training set SN = (xi , yi )i=1...N ∈ (X , Y)N of N input-output pairs i.i.d. according to P. A common type known as the soft margin SVMs solves the following optimization problem: X minimize C N ζi + kf k2H f ∈H, b∈R

i=1

subject to φ(yi , (f (xi ) + b)) ≤ ζi , ζi ≥ 0,

i = 1, . . . , N i = 1, . . . , N

(5) where φ : R2 → R is the loss function, C > 0 is the misclassification cost parameter and (H, h., .iH ) is a

particular real Hilbert space. Let f ∗ and b∗ be solutions of this problem. If the fringe loss function φfringe (u, v) = max(1 − uv, 0) is used, then (5) becomes the C-SVM, a standard form of SVM used for pattern recognition. The resulting labeling model is sign(f ∗ (x)+b∗ ). The ǫ-SVR, a common variant of SVM for scalar regression, corresponds to the ǫ-insensitive loss function φǫ (u, v) = max(0, |u − v| − ǫ). The labeling model of the ǫ-SVR is f ∗ (x) + b∗ . The C parameter is a mean of controlling overfitting. A specific value of C is usually found with a tuning method such as a grid search associated with n-folds cross validation. When H is the kernel space of a PD kernel such as the RBF kernel (i.e. we perform the kernel trick with the RBF kernel), a result known as the representer theorem entails that the optimal solution f ∗ has the following form: f ∗ (x) =

N X

αi yi Krbf (x, xi )

i=1

, 0 ≤ αi ≤ C, i = 1, . . . , N . (6)

Therefore, a problem will arise when the training set is too small because the set of all possible such linear combinations may not contain a single good labeling model. In such a case, modifying the optimal linear combination by adding constraints to (5) as in previous work presented in Sect. II will not yield a good enough solution. To solve this problem, we propose in the following sections a method modifying the kernel itself and resulting in a different hypothesis set fitting the problem better. C. Adapted Kernels In this section, we present our contribution consisting in the adaptation of the RBF kernel using prior-knowledge. The adaptation is performed through the following modification of the RBF kernel: Ka (x1 , x2 ) = (λ + µξ(x1 , x2 ))Krbf (x1 , x2 )

(7)

2

where ξ : X → [0, 1] is a symmetric function referred to as the prior-knowledge function. λ ∈ [0, 1] and µ ∈ [0, 1] are two positive parameters such as λ + µ = 1 (thus a single parameter in practice) controlling the degree of incorporation of prior-knowledge. Lets assume for the moment that Ka is PD and let Ha be its kernel space. Using equality (3), the corresponding squared kernel distance is: dHa (x1 , x2 )2 = λdHrbf (x1 , x2 )2 + µ [ξ(x1 , x1 ) + ξ(x2 , x2 ) − 2ξ(x1 , x2 )Krbf (x1 , x2 )] . (8) The above formula shows that the RBF kernel distance is modified depending on the choice of prior-knowledge function ξ and parameter µ (or λ = 1 − µ). The idea is to

make this distance shorter or greater to increase or decrease the separability of points according to the prior-knowledge. D. Prior-Knowledge Functions The prior-knowledge function is the framework we propose for the incorporation of prior-knowledge. This solution is quite versatile as it allows for the incorporation of a variety of forms of prior-knowledge. In this paper, we show how to deal with three different forms of prior-knowledge: 1) regions of the input space defined as crisp sets; 2) an extension to fuzzy sets; 3) pseudo-periodicity of the labels w.r.t. a specific variable. For simplicity, we assume that X = Rn . 1) Crisp Sets: Let A ⊂ X be a region of the input space. The idea is to increase the separability between the elements in A and those in ∁A (the complementary set). Let χ : X → {−1, 1} be an indicator function for the set A such as: ( 1 if x ∈ A χ(x) = . (9) −1 if x ∈ /A The prior-knowledge function itself is defined as: χ(x1 )χ(x2 ) + 1 . (10) 2 We can verify that Ka is PD by construction. Indeed, any kernel K(x1 , x2 ) = f (x1 )f (x2 ) is PD. Thus, ξ is PD as a sum of products of PD kernels and Ka is in turn PD for the same reason. Therefore Ka induces a kernel space Ha and the corresponding squared kernel distance given in equation (8) becomes after a simple factorization: ξ(x1 , x2 ) =

dHa (x1 , x2 )2 µ 2 = λdHrbf (x1 , x2 )2 + [χ(x1 ) − χ(x2 )] 2 µ + [(χ(x1 )χ(x2 ) + 1)(2 − 2Krbf (x1 , x2 ))] 2

(11)

and by using equality (4): h i µ = λ + (χ(x1 )χ(x2 ) + 1) dHrbf (x1 , x2 )2 2 µ 2 (12) + [χ(x1 ) − χ(x2 )] 2 ( dHrbf (x1 , x2 )2 if (x1 , x2 ) ∈ A2 ∪ ∁A2 = . 2 (1 − µ)dHrbf (x1 , x2 ) + 2µ otherwise Figure 1 shows plots of the adapted kernel distance da (x1 , x2 ) for n = 1, A = [a, b] and different values of the parameter µ ∈ [0, 1] (we remind: λ = 1 − µ). The different relative positions of x1 and x2 are covered. We can see that when the two points are in the same set (A or ∁A), the kernel distance between them is the standard RBF

1

1

χ(x)

da(x1,x2)

1.41

0

a

x1

0

−1

b

a

x2

(a) Indicator function χ 1.41

1 da(x1,x2)

da(x1,x2)

1.41

0

b x

(a) case: x1 ∈ A

x1

a

1

0

b

a=x1

x2

kernel distance. However, when they are in different sets, the kernel distance increases by an amount controllable via the parameter µ: from no increase when µ √ = 0 to an increase to the maximal RBF kernel distance of 2 when µ = 1. We show an application to the diagnosis of breast cancer from morphological parameters of cell nuclei in Sect. IV-A where we known in advance that cells having a specific morphology should be diagnosed as non-cancerous. 2) Fuzzy Sets: The above formulation of the priorknowledge function can sometimes prove impractical because one may not precisely know the boundaries of A but instead only have an approximative idea of them. Therefore, we propose an extension of the previous type of knowledgefunctions to fuzzy sets. In practice, this is done by using a continuous indicator function χ : X → [−1, 1] instead of the binary one used previously. The previous proof of the positive-definiteness of Ka is still valid as-well-as the formula of the induced kernel distance (12). Figure 2 shows a fuzzified version of the illustration in Fig. 1 with crisp sets. We can see that the previously discontinuous transitions are now smooth. Prior-knowledge functions corresponding to fuzzy sets are validated in the same application to the diagnosis of breast cancer in Sect. IV-A. 3) Pseudo-Periodicity: In addition to crisp and fuzzy sets, we also show how to deal with the pseudo-periodicity of the labels w.r.t. a specific feature. If the labels are expected to have a pseudo-period of P w.r.t. the j-th feature of the input, we propose the following prior-knowledge function:  cos 2π P (x1,j − x2,j ) + 1 (13) ξ(x1 , x2 ) = 2 where x1,j (resp. x2,j ) is the j-th component of x1 (resp. x2 ).

(b) case: −1 < χ(x1 ) < 1

da(x1,x2)

1.41

1

0

a

x1

b

x2

(c) case: χ(x1 ) = 1 1.41

da(x1,x2)

Figure 1. Adapted kernel distance da (x1 , x2 ) for n = 1, A = [a, b] and different values of the parameter µ. Black plots correspond to µ = 0 i.e. the standard RBF kernel, blue plots to µ = 0.5 and red plots to µ = 1.

b x2

(b) case: x1 ∈ ∁A

1

0

x1

a

b x2

(d) case: χ(x1 ) = −1 Figure 2. (a) fuzzy indicator function and (b)-(d) adapted kernel distance da (x1 , x2 ) for n = 1. Different values of µ are used: black plots correspond to µ = 0 i.e. the standard RBF kernel, blue plots to µ = 0.5 and red plots to µ = 1.

ξ can be expanded in the following fashion by the application of a well known trigonometric formula:      1 2π 2π ξ(x1 , x2 ) = cos x1,j cos x2,j 2 P P      2π 2π x1,j sin x2,j + 1 . (14) + sin P P

Therefore ξ is PD as a sum of PD kernels and Ka is in turn PD. The modified kernel distance (8) resulting from this priorknowledge function is: dHa (x1 , x2 )2 = λdHrbf (x1 , x2 )2       2π + µ 2 − cos (x1,j − x2,j ) + 1 Krbf (x1 , x2 ) . P (15) Figure 3 shows plots of the kernel distance according to the relative position of x1 and x2 for n = 1 and different parameters µ (and λ = 1 − µ). We can observe a pseudoperiodic increase of the kernel distance in addition to its

da(x1,x2)

1.41

1

0

x1 − 2P

x1 − P

x1 x2

x1+P

x1+2P

Figure 3. Adapted kernel distance da (x1 , x2 ) for n = 1 with different values of µ: black plots correspond to µ = 0 i.e. the standard RBF kernel, blue plots to µ = 0.5 and red plots to µ = 1. Pseudo-periods are indicated with vertical dashed lines.

We present an application of this type of prior-knowledge function to scalar regressions of climatological data with a pseudo-period of 1 year in Sect. IV-B. IV. E MPIRICAL E VALUATION We propose a validation of our method in two different contexts: a pattern recognition context involving the diagnosis of breast cancer and a scalar regression context involving the prediction of monthly average temperatures. A. Pattern Recognition Application: Diagnosis of Breast Cancer We use the “Wisconsin Breast Cancer” dataset publicly available at the UCI Machine Learning Repository2 . Every instance is a vector of features extracted from the segmentation of cell nuclei in a fine needle biopsy image. It contains 569 instances, 357 corresponding to benign cases (non cancer) and 212 to malignant cases (cancer). For this application, we use two of the features to predict if a case is benign or malignant: the mean texture and the mean smoothness of the cell nuclei. Both features are normalized in [−1, 1]. Nuclei with homogeneous interiors and smooth contours are an indication of healthy tissue. Accordingly, we assume the following prior-knowledge: if both normalized features are smaller than −0.5, then the case is benign. This translates into a prior-knowledge function constructed from a region of the input space A = [−∞, −0.5]2 as described in Sect. III-D. In the first batch of experiments, we evaluate the priorknowledge function using the crisp indicator function (corresponding to set A). Training sets are generated by randomly choosing N instances. Labeling models are then created for each training sets using the C-SVM described in Sect. III-B and the adapted RBF kernel described in Sect. III-C. The C 2 http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+ (Diagnostic)

and γ parameters are adjusted every time by performing a grid search (values yielding the best 2-folds cross-validation average are chosen). The models are tested on the 569 − N remaining instances and compared in terms of average misclassification rates. Figure 4 shows average results over 100 randomly selected training sets. We tried different sizes N of the training sets and different values of the parameter µ ∈ [0, 1] controlling the amount of prior-knowledge into the kernel. Overall, the adapted kernel outperforms the original RBF kernel (µ = 0), specially when the training set is small: the best rate of improvement over the RBF kernel is 23.89% and is achieved when N = 8 and µ = 1. This rate decreases when N becomes larger and the adapted kernel is about on a par with the RBF kernel when N = 64. Moreover, we can notice that the optimal µ (in bold in the tables) decreases when N increases: µ = 1 for N = 8, µ = 0.2 for N = 16 and µ = 0.1 for N = 36 or N = 64. This seems to confirm that the prior-knowledge is more important when the training set is small and becomes less useful as more training data is available. In general, µ ≥ 0.5 seems to be a good default value for the parameter µ. 0.2 average error

exponential increase proper to the RBF kernel. Therefore, two point points will be relatively further in the kernel space if they are not separated by a whole number of pseudoperiods. Again, the magnitude of the modifications can be controlled by tuning µ.

0.18 0.16 0.14 0.12 0

0.1

0.2

0.3

0.4

0.5 µ

0.6

0.7

0.8

0.9

1

Figure 4. Misclassification rates with a crisp indicator function for different sizes N of training set and values of µ. Each value is an average over 100 randomly selected training sets. The color convention is: black for N = 8, blue for N = 16, red for N = 32 and green for N = 64.

A second batch of experiments was performed in a similar setting with fuzzified versions of the indicator function. Instead of a discontinuous transition from χ(x) = −1 when x ∈ / A to χ(x) = 1 when x ∈ A, the transition is made linear with a slope α. Figure 5 shows average results over 100 random selections for different values µ ∈ [0, 1] and α. All the mean values are computed for the same 100 randomly selected training sets. The training sample size is fixed to N = 8, a small size which proved to favor the adapted kernel in the previous batch. It appears that fuzzified versions of the prior-knowledge function also perform well, with the adapted kernel clearly improving the results obtained with the standard RBF kernel. This improvement is however generally less when the slope is more gentle (specially α = 2.5), which is consistent with the fact that the priorknowledge is more coarsely approximated. Overall, greatly improved results could be obtained on real-life data with training sets which can be considered typically too small for the effective use of such supervised learning tools (N = 8). This could make the use of

3 average error

SVMs practical in a variety new scenarii where a fastidious collection of training data whould not be necessary. For instance, just a few examples could be labelled “on the spot” by the user.

2.5 2 1.5 1 0

0.1

0.2

0.3

0.4

0.5 µ

0.6

0.7

0.8

0.9

1

average error

0.2 0.19

Figure 6. Average results for different values of N and µ. The color convention is: black for N = 50, blue for N = 100, red for N = 200 and green for N = 400.

0.18 0.17 0.16 0

0.1

0.2

0.3

0.4

0.5 µ

0.6

0.7

0.8

0.9

1

Figure 5. Average results for N = 8 and different values of µ and α. Rates of improvement are compared to the standard RBF kernel (i.e. µ = 0). The color convention is: black for α = ∞ (crisp indicator function), blue for α = 20, red for α = 10 green for α = 5 and yellow for α = 2.5.

B. Scalar Regression Application: Meteorological Predictions This second application is based upon publicly available meteorological data from the UK Climate Projections database3 . It is a scalar regression problem using the monthly average temperatures measured from January 1914 to December 2006 at the geographic point with coordinates: easting 337500 - northing 1032500. A training set of N values from these 93 × 12 = 1104 monthly averages is used to predict values of the remaining ones. The only input is the corresponding month. The ǫ-SVR described in Sect. III-B was used with ǫ = 0.1. Results are compared in terms of mean error. The rest of the procedure is similar to the one used for the previous pattern recognition application, including the grid search to set C and γ. Although some variations are usually observed from one year to another, average temperatures have a one year cycle. Accordingly, the prior-knowledge is a pseudo-periodicity of 1 year incorporated into a prior-knowledge function in a fashion described in Sect. III-D. Figure 6 shows the average results obtained over 50 randomly selected training sets for different N and µ. The overall improvement compared to the standard RBF kernel is very significant, reaching 62.06% for N = 100 and µ = 1. As for the previous application, the rate of improvement is less when the training set becomes larger. The incorporation of prior-knowledge radically improves the results even when µ = 0.1, and larger values of µ only yield marginal additional improvements. Best rates of improvements are obtained with large values of µ (µ = 1 for N = 50, 100, 400 and µ = 0.9 for N = 200). V. C ONCLUSIONS Our method shows an improvement of the performance of SVMs and SVRs by a significant margin which is specially marked when the size of the training set is small. This 3 http://www.metoffice.gov.uk/climatechange/science/monitoring/ukcp09/

contributes to broadening the field of application of these state-of-the-art learning tools by enabling their effective use in new situations when the collection of a sufficient amount of labelled data is not possible. In addition, it is very practical with only a single additional parameter µ which can be tuned in the same fashion as the other SVM/SVR parameters (C, γ or ǫ). It is also computationally efficient by only requiring the computation of a different kernel matrix. In the future, we are interested in the extension of our method to other commonly used kernels besides the RBF kernel and other forms of prior-knowledge such as the monotonicity of the labels w.r.t. specific features. R EFERENCES [1] F. Lauer and G. Bloch, “Incorporating prior knowledge in support vector machnies for classification: a review,” Neurocomputing, vol. 71, no. 7–9, pp. 1578–1594, March 2008. [2] G. M. Fung, O. L. Mangasarian, and J. W. Shavlik, “Knowledge-based nonlinear kernel classifiers,” in Conference on Learning Theory, 2003. [3] O. L. Mangasarian, “Knowledge-based linear programming,” SIAM Journal on Optimization, vol. 15, pp. 375–382, 2004. [4] O. L. Mangasarian, J. Shavlik, and E. W. Wild, “Knowledgebased kernel approximation,” Machine Learning Research, vol. 5, pp. 1127–1141, 2004. [5] O. L. Mangasarian and E. W. Wild, “Nonlinear knowledgebased classification,” Transaction on Neural Networks, vol. 10, pp. 1826–1832, 2008. [6] ——, “Nonlinear knowledge in kernel approximation,” Transaction on Neural Networks, vol. 18, pp. 300–306, 2007. [7] Q. V. Le and A. J. Smola, “Simpler knowledge-based support vector machines,” in International Conference on Machine Learning, 2006. [8] R. Maclin, J. Shavlik, T. Walker, and L. Torrey, “A simple and effective method for incorporating advice into kernel methods,” in Association for the Advancement of Artificial Intelligence, 2006. [9] G. Kunapuli, K. P. Bennett, A. Shabbeer, R. Maclin, and J. W. Shavlik, “Online knowledge-based support vector machines,” in European Conference on Machine Learning, 2010.