pRBF Kernels: A Framework for the ... - Daniel RACOCEANU

inadequate training data has been the focus of previous research .... recent methods such as the tangent distance kernels [2] relate .... In plain words, Theo. IV.2 ...
1MB taille 2 téléchargements 271 vues
pRBF Kernels: A Framework for the Incorporation of Task-Specific Properties into Support Vector Methods Antoine Veillard∗† , Daniel Racoceanu∗† and St´ephane Bressan∗ ∗ School of Computing National University of Singapore, Singapore 117417 † IPAL CNRS 1 Fusionopolis Way, #21-01 Connexis, Singapore 138632 Email: [email protected] Abstract—The incorporation of prior-knowledge into support vector machines (SVM) in order to compensate for inadequate training data has been the focus of previous research works and many found a kernel-based approach to be the most appropriate. However, they are more adapted to deal with broad domain knowledge (e.g. “sets are invariant to permutations of the elements”) rather than task-specific properties (e.g. “the weight of a person is cubically related to her height”). In this paper, we present the partially RBF (pRBF) kernels, our original framework for the incorporation of priorknowledge about correlation patterns between specific features and the output label. pRBF kernels are based upon the tensorproduct combination of the standard radial basis function (RBF) kernel with more specialized kernels and provide a natural way for the incorporation of a commonly available type of prior-knowledge. In addition to a theoretical validation of our framework, we propose an detailed empirical evaluation on real-life biological data which illustrates its ease-of-use and effectiveness. Not only pRBF kernels were able to improve the learning results in general but they also proved to perform particularly well when the training data set was very small or strongly biased, significantly broadening the field of application of SVMs. Keywords-prior knowledge; support vector methods; RBF; kernel tensor product;

I. M OTIVATIONS Support vector methods (SVM) with their numerous variants for classification and scalar regression tasks are often regarded as the state-of-the-art supervised learning tools. In particular, they are appreciated for their ability to deal with high dimensional data, their good generalizabilty to data not yet seen during training, and their compatibility with the kernel trick. Theoretically, the kernel trick is a mapping Φ : X 7→ H of the data from the original feature space X to a real Hilbert space H. Practically, it is implemented by replacing the matrix of pairwise dot products in X which constitutes the input of the SVM by the matrix of the inner products in H after applying Φ to the data. In fact, the function Φ performing the actual mapping does not need to be known explicitly. Instead, only the symmetric and positive semidefinite (PSD) product K : X 2 7→ R known as the kernel is

required where: K(x1 , x2 ) = hΦ(x1 ), Φ(x2 )iH is the inner products in H after mapping of the data. The Hilbert space H is referred to as the reproducing-kernel Hilbert space (RKHS). The radial basis function (RBF) kernel with adjustable kernel bandwidth is widely considered as a good generalpurpose kernel for SVMs. Its expression is: Krbf (x1 , x2 ) = exp(−γkx1 − x2 k2 )

(1)

The implicit mapping to its RKHS Hrbf of infinite dimension is nonlinear allowing for nonlinear classification or regression. The kernel bandwidth parameter γ > 0 provides an effective and practical way of controlling overfitting during learning. With a large enough γ (narrow kernel bandwidth) any configuration of points in X becomes linearly separable in Hrbf . On the contrary, with a small enough γ (broad kernel bandwidth), points are closely clustered in Hrbf and become harder to distinguish. The SVM+RBF combination is a versatile, yet powerful solution, which is often viewed as a good default option. In practice, it is successfully used as a black-box solution in many complex, real-life applications. However, the volume of training data required to take advantage of nonlinear classifiers such as the SVM+RBF combination can be very high. Unfortunately, many realworld applications many not provide training data in sufficient quantity or quality. Meanwhile, the user often has some understanding of the general properties of the task at hand. For instance in credit rating, the credit worthiness of a person should be a monotonically increasing function of her salary. As another example in road safety, the breaking distance of a car is quadratically correlated to its total weight. Naturally, one may want to incorporate those kind of highly informative, task-specific properties in the learning process to compensate for the insufficient training data. In this paper, we propose a framework for the incorporation of prior-knowledge into SVMs subsequently referred to as partially RBF (pRBF) kernels. The pRBF kernels are a combination of the standard RBF kernel with other more specific kernel using kernel tensor products. The main idea

is to preserve the power and versatility of the SVM+RBF combination while allowing for the incorporation of properties highly specific to the task. Compared to previous works on the incorporation priorknowledge into SVMs, our framework allows for the incorporation of properties more specific to the task itself rather than generic domain properties such as transformations invariances. First, we present a state-of-the-art review on the incorporation of prior-knowledge into SVMs in Sect. II with an emphasis on kernel-based methods and justify the originality of our contribution. Next, a few background notions related to SVMs and the theory of PD kernels are presented in Sect. III. Then, the pRBF kernels are presented in Sect. IV along with a theoretical validation of the design choices. We propose a systematic empirical validation of the method on real-world biological data in Sect. V and provide concluding remarks in Sect. VI. II. R ELATED WORK The incorporation of prior-knowledge, i.e. information on the problem that cannot be inferred from the training data alone, into SVMs has been the focus of a number of previous research works which can be distinguished either by the type of prior-knowledge incorporated, or by the general method employed. The types of prior-knowledge addressed in previous works can be divided into three broad categories: knowledge on the domain of application, knowledge on the data, and knowledge on the task itself. Works on domain-specific knowledge are the most numerous. The majority deals with transformation invariances such as rotations in computer vision applications. The earliest example is the virtual samples method [1] consisting in the generation of artificial samples from the original ones. More recent methods such as the tangent distance kernels [2] relate to a replacement of the data points by their equivalence class. Some other works incorporate notions of distances for specific types of objects (other than tuples from Rn ) such as sequences for the local alignment kernels [3]. Additional knowledge on the data such as class imbalances can be incorporated by attributing class specific misclassification cost parameters, a method first proposed in [4] or by selecting a kernel based on the scatter of the classes in the RKHS [5]. Previous works on task-specific knowledge was proposed by Mangasarian et. al. who proposed the knowledge-based linear programming framework [6] which allows for the incorporation of labelled sets obtained from expert knowledge into the problem in addition to labelled data points. Recently, Veillard et. al. [7] proposed a kernel-based method for the incorporation unlabelled sets which does not require hypothesis on the label. Our pRBF kernel framework is a kernel-based methods for the incorporation task-specific knowledge. It is based on the

combination of the standard, general-purpose RBF kernel with more specialized kernels using tensor products. It incorporates properties related to correlation patters between features and labels such as a monotonicity of the model w.r.t. specific features or specific orders of correlations such as “linear”, “quadratic” or “cubic”. This type of task-specific knowledge is different from what has already been dealt with in previous works (labelled and unlabelled regions). A kernel-based approach is arguably the most natural since it does not alter the statistical meaning of the SVM (as with optimization-based approaches) and does not inflate the size of the problem (as with sample-based approach). III. BACKGROUND NOTIONS This section presents a few background notions necessary to the understanding of our main contribution, the pRBF kernels described in section IV. A short introduction to SVMs is first given in Sect. III-A followed by a few elements of kernel theory in Sect. III-B. A. Support vector methods In this section, we define our standard vocabulary regarding statistical machine learning and introduce basic SVM theory, including the -SVR used for the empirical study in Sect. V. Let X = Rn be referred to as the input or feature space and Y ⊂ R be referred to as the output or label space. A labeling problem is a probability distribution P with values in (X , Y) which is not explicitly known and from which the data are drawn. The labeling problem is qualified of classification or pattern recognition problem when Y = {−1, 1} and of scalar regression when Y = R. Given a set of labeling models H ⊂ Y X referred to as the hypothesis set, a learning algorithm selects a labeling model f ∈ H such that f (X) approximates Y as closely as possible when (X, Y ) ∼ P. SVMs are a state-of-the art class of learning algorithms implementing the structural risk minimization principle elaborated by the mathematician Valdimir Vapnik. They are supervised learning algorithms which select a labeling model f from according to a finite training set SN = (xi , yi )i=1...N ∈ (X , Y)N of N input-output pairs i.i.d. according to P. A common type of SVMs qualified as “soft margin” solves the following optimization problem: X minimize C N ξi + kf k2H f ∈H, b∈R

i=1

subject to φ(yi , (f (xi ) + b)) ≤ ξi , ξi ≥ 0,

i = 1, . . . , N i = 1, . . . , N

(2) where φ : R2 → R is the loss function, C > 0 is the misclassification cost parameter and (H, h., .iH ) is a particular real Hilbert space. It can be viewed as the minimization of a regularized empirical risk on SN . Let f ∗ and b∗ be solutions of this problem.

The loss function φ is selected according to the type of learning problem. The C-SVM, a standard form of SVM used for pattern recognition, is obtained when using the fringe loss function φfringe (u, v) = max(1 − uv, 0) is used. The resulting labeling model is sign(f ∗ (x) + b∗ ). The SVR, a common variant of SVM for scalar regression, corresponds to the -insensitive loss function φ (u, v) = max(0, |u − v| − ). The labeling model of the -SVR is f ∗ (x) + b∗ . The C parameter is a mean of controlling overfitting. A specific value of C is usually found with a tuning method such as a grid search associated with n-folds cross validation. Details on the practical resolution of (2) and the geometrical interpretation of SVMs can be found in a wide range of excellent tutorials including the following reference book [8] and will not be covered in this paper. B. Positive-definite kernels and representer theorem In this section, we define the notion of positive-definite (PD) kernel and present a central result known as the representer theorem which will play a key importance in the design of pRBF kernels. As mentioned in section I, a PD kernel can be understood as a generalization of the dot product to arbitrary real Hilbert spaces. A PD kernel over X is a binary function K : X × X → R which is symmetric and which Gram matrix is always PSD. When H is a RKHS associated to a PD kernel K such as the RBF kernel, a result known as the representer theorem entails that the optimal solution f ∗ of (2) has the following form: f ∗ (x) =

N X

αi yi Krbf (x, xi ) , 0 ≤ αi ≤ C, i = 1, . . . , N

i=1

(3) IV. P RBF KERNELS This section describes our main contribution for the incorporation of task-specific properties into SVMs referred to as the pRBF kernels. First, we formally define the pRBF kernels and prove that a set of hypothesis guarantees the incorporation of prior-knowledge into SVMs in Sect. IV-A. Next, Sect. IV-B shows how this main result can be exploited to incorporated specific monomial and polynomial correlation patterns between features and labels.

Or equivalently: K a : Rn × R n (x1 , x2 )

→ R 7→ Krbf (x1,1 , x2,1 ) × K(x1,2 , x2,2 )

with x1 = (x1,1 , x1,2 ) ∈ Rn−m × Rm and x2 = (x2,1 , x2,2 ) ∈ Rn−m × Rm . A pRBF kernel is therefore the tensor product of the RBF kernel with other PD kernels. The idea behind this design is to use the non-RBF part for features with specific priorknowledge and the RBF part for other features (usually the largest part). Nevertheless, not all arbitrary properties of the non-RBF portion K will be reflected in the solution. The hypothesis which need to be verified by K are expressed in Theo. IV.2. Lets first introduce the following useful notation. Given a PD kernel K over X and x ∈ X , Kx : X → R is the function defined as: def

∀t ∈ X , Kx (t) = K(x, t) = K(t, x)

Theorem IV.2. Let: E be a real vector field over R, K be a PD kernel over Rm such that {Kx |x ∈ Rm } ⊂ E, Ka = Krbf ⊗ K be a pRBF kernel over Rn = Rn−m × Rm (m < n) with Ha its RKHS, and S = {(x1 , y1 ) . . . (xN , yN )} ∈ (Rn × Y)N be a finite training set. If fˆ : Rn−m × Rm → R is a solution of the optimization problem (2) in Ha , then ∀x0 ∈ Rn−m , fˆx0 ∈ E where: fˆx0 : Rm x

Definition IV.1. pRBF kernel Let 1 ≤ m ≤ n−1. A pRBF kernel over Rn is a function:



R 7 → fˆ(x0 , x)

In plain words, Theo. IV.2 implies that the properties of the non-RBF portion of the kernel pRBF kernel will be inherited by the labeling model if they are preserved by linear combination. An explicit illustration of theorem IV.2 is later given in figure 1. Proof: A tensor product of PD kernels is a PD kernel. Therefore, Ka is a valid PD kernel and the representer theorem introduced in section III-B can be applied. Following (3), there exist (α1 , . . . , αN ) ∈ RN such that: fˆ =

N X

αi yi Kaxi =

i=1

A. Definition and properties

(5)

N X

αi yi Krbfxi ⊗ Kxi

(6)

i=1

Then, for x0 ∈ Rn−m : fˆx0 =

N X

αi yi Krbf (xi , x0 )Kxi

(7)

i=1

Ka = Krbf ⊗ K n−m

where Krbf is an RBF kernel over R over Rm and ⊗ is the tensor product.

(4) , K is a PD kernel

Since αi yi Krbf (xi , x0 ) ∈ R and Kxi ∈ E, (7) is a linear combination of terms belonging to E. The stability of E by linear combination completes the proof.

(a) RBF kernel

(b) pRBF kernel (f1 )

(c) pRBF kernel (f12 )

Figure 1. Example of regression with the -SVR+pRBF combination. The data is 3-dimensional with 2 features (f1 , f2 ) and 1 output label y. For f2 fixed, y is proportional to f12 , i.e. the correlation between f1 and y is quadratic. The training data points are indicated with white dots and the test data points with black dots. The red curves drawn on the decision surface are level curves w.r.t. f2 . Each graph corresponds to a different monomial expression: f1 for (b) and f12 for (c). (a) corresponds to the standard RBF kernel.

B. Polynomial and monomial correlation In this section, we describe how prior-knowledge corresponding to monomial and polynomial correlations can easily be incorporated into SVMs using the pRBF kernel. This type of properties are commonly available in real applications as suggested in Sect. I or as shown by the empirical validation on real biological data in Sect. V. The basic idea is to use monomials or polynomials of corresponding degrees for Kx . For instance, if the labels are known to be linearly dependent to the i-th feature and cubically dependent to the j-th, we should choose choose 3 Kx (x0 ) = x0 i x0 j where x0i (resp. x0j ) is the i-th (resp. j-th) 0 feature of x . Let Md [x1 , . . . , xM ] be the set of multivariate monomials in the variables x1 , . . . , xM of degree exactly d. Note that for d > 0, Md [x1 , . . . , xM ] does not contain 0 and is therefore not a vector space. However Md [x1 , . . . , xM ]∪{0} is a vector spaces. The Pd [x1 , . . . , xM ] set of multivariate polynomials in the variables x1 , . . . , xM of degree at most d is also a vector space. Therefore, Theo. IV.2 guarantees that the solution will preserve the corresponding correlation patterns when choosing Kx from such algebraic structures. Figure 1 proposes a graphical illustration of regression with the -SVR+pRBF combination. The feature space is 2-dimensional with f1 and f2 as features. In this example, the label is known to have a quadratic correlation w.r.t. f1 . For instance, the label y could be the breaking distance of a car, f1 its total weight and f2 another parameter of the car which incidence on the breaking distance is not understood. When the standard RBF kernel is used (Fig. 1a), the resulting decision model fits the training data (white dots) but not the test data (black dots). Using a pRBF kernel with monomials in f1 (Fig. 1b and Fig. 1c) causes the decision model to have the properties predicted by Theo. IV.2 as shown by the level curves w.r.t. f2 . Most importantly the pRBF kernel using the monomial f12 , i.e. making the correct prior-assumption about

the model, can label all the test data correctly including the data falling out of the range of the training data. Such a generalizability of the model outside of the range of the training data is usually not expected from SVMs. V. A PPLICATION : PREDICTION OF ANATOMICAL DATA ON A POPULATION OF ABALONES USING GEOMETRICAL PRIOR - KNOWLEDGE This application validating the use of pRBF kernels consists in the prediction of the unit weight of abalones (marine gastropod molluscs) from their morphological features. The dataset publicly available from the UCI Machine Learning Repository1 contains data for 4177 abalones. The different morphological parameters are: the length of the abalone, i.e. the longest shell measurement, in centimetres (feature f1 ); the width of the abalone, perpendicular to the length, in centimetres (feature f2 ); the height of the abalone, with the meat inside, in centimetres (feature f3 ); and the amount of rings visible on the shell (feature f4 ). Therefore, a single instance consists in a quintuple (f1 , f2 , f3 , f4 , y) with the 4 morphological features of the abalone f1 , f2 , f3 and f4 , and the total weight of the abalone y. We first present the correlation patterns between features and labels which can be expected a priori in Sect. V-A and show that they are verified by the actual data distribution. The empirical results for a random, unbiased selection of the training data are presented in Sect. V-B and in Sect. V-B for a biased selection of the training data. A. Prior-knowledge The prior-knowledge for this problem corresponds to simple geometrical intuition which suggests that the weight y should be cubical correlated to the length f1 , the width f2 or the height f3 . Figure 2 represents the weight y of the 1 http://archive.ics.uci.edu/ml/datasets/Abalone

3

3

2.5

2.5

2

2

1.5

1.5

y

y

4177 abalones plotted against a few monomial combinations of the parameters. The monotonic increase of the weight w w.r.t. the length f1 is clearly visible on Fig. 2a. Figure 2b ehibiting a linear correlation between f13 and y confirms that the relationship is in fact cubical. In addition, w is monotonically increasing w.r.t. f1 f2 (Fig. 2c) and the relationship between f1 f2 f3 and w is linear (Fig. 2d). Therefore, the above assumption is qualitatively confirmed by the plots. This justifies the use of the pRBF with monomials as the non-RBF portion, in particular monomials of degree 3 in f1 , f2 and f3 .

1

1

0.5

0.5

0 0.2

0.4

0.6

0.8

1

0.2

0.4

f31

(b) y against

3

3 2.5

2

2

1.5

1.5

y

y

(a) y against f1 .

2.5

1

1

0.5

0.5

0

0.6

0.8

1

0.2

0.4

0.6

0.8

f1 f2

(c) y against f1 f2 .

1

f12 0.2988 0.1776 0.1366 0.1198 0.1088 0.1021 0.0953

f13 0.2284 0.1474 0.1215 0.1056 0.0991 0.0979 0.0920

f1 f2 0.2524 0.1742 0.1325 0.1144 0.1041 0.1001 0.1004

f1 f2 f3 0.2589 0.1591 0.1319 0.1060 0.0975 0.1005 0.0945

1 0.3502 0.2516 0.1927 0.1543 0.1314 0.1154 0.1100

0

0.2

0.4 0.6 f1 f2 f3

f1 -0.0602 -0.0033 0.1678 0.1939 0.0846 0.0748 0.0918

f12 0.1469 0.2939 0.2911 0.2235 0.1720 0.1156 0.1337

f13 0.3480 0.4143 0.3697 0.3157 0.2459 0.1519 0.1635

f1 f2 0.2794 0.3077 0.3126 0.2589 0.2077 0.1329 0.0873

f1 f2 f3 0.2607 0.3675 0.3154 0.3128 0.2577 0.1292 0.1411

(b) Average improvement rate Figure 3. Average results over 100 randomly selected training sets using the pRBF kernel for different values of N and different monomial expressions. (a) corresponds to mean errors and (b) to improvement rates over the standard RBF kernel (i.e. when the monomial expression is 1).

f13 .

0 0

f1 0.3713 0.2524 0.1604 0.1244 0.1203 0.1068 0.0999

N =5 N = 10 N = 20 N = 40 N = 60 N = 80 N = 100 0

f1

N =5 N = 10 N = 20 N = 40 N = 60 N = 80 N = 100

(a) Average error

0

0

Figure 3 shows a comparison of the results obtained with different pRBF kernels and the standard RBF kernel. Each numerical result is an average value over 100 random iterations. The monomials used for the pRBF kernels were f1 , f12 , f13 , f1 f2 and f1 f2 f3 .

0.8

1

(d) y against f1 f2 f3 .

Figure 2. Weight of the abalones (output label y) against several monomial combinations of length (feature f1 ), diameter (feature f2 ) and height (feature f3 ). The linear and polynomial relationships are clearly visible.

Accordingly, this batch of experiments uses the pRBF kernel with monomials in f1 , f2 and f3 . For instance, if we choose the monomial f1 f2 , the expression of the pRBF kernel product between the feature vectors xa = (fa,1 , fa,2 , fa,3 , fa,4 ) and xb = (fb,1 , fb,2 , fb,3 , fb,4 ) is:   K(xa , xb ) = exp −γ (fa,3 − fb,3 )2 + (fa,4 − fb,4 )2 ×fa,1 fa,2 × fb,1 fb,2 (8) where γ > 0 is the RBF kernel bandwidth parameter. B. Learning with few data The type of SVM used was the -SVR described in section III-A (with  = 0.1). Results are compared in terms of average absolute error. Training sets are created by randomly choosing N instances. The C and γ parameters are adjusted every time by performing a grid search (values yielding the best average results in 5-folds cross-validation are chosen).

Every pRBF kernel systematically improves the results of the standard RBF kernel, with the exception of the pRBF kernel with monomial f1 for which the rate of improvement is between −6.02% and 9.18%. The best results are obtained with the degree 3 monomials f13 (rate of improvement between 15.19% and 41.45%) and f1 f2 f3 (rate of improvement between 12.92% and 36.75%). The order of the monomials from worse to best is: first the degree 1 monomial f1 which is the worse by far, then the degree 2 monomials f12 and f1 f2 , and finally the degree 3 monomials f1 f2 f3 and f13 . The above order is consistent with the prior-knowledge available on the problem. While a degree of 1 or 2 capture the monotonicity of the relationship between output label and input features, only the degree 3 monomials are a faithful representation of the cubic relationship between dimensions and weight. The fact that degree 2 monomials perform better than degree 1 monomials is also expected since a quadratic relationship is a better approximation of a cubic relationship than a linear relationship. Overall, this is a confirmation that the most faithfully the pRBF kernel incorporates the prior-knowledge, the better are the results. The impact in terms of the required amount of training data is significant. On this example, the required amount of training data is divided by more than 4 thanks to the use of the pRBF kernel with proper prior-knowledge. Indeed, the pRBF kernel associated to the monomial f13 with N = 10 training samples (average absolute error of 14.74%) performs better than the standard RBF kernel with N = 40 training samples (average absolute error of 15.45%).

VI. C ONCLUSION AND PERSPECTIVES

C. Learning with biased data Another batch of similar experiments were conducted after a biased selection of the data instead of the uniformly distributed random selection of section V-B. The training sets are constituted by only selecting infant (sexually immature) abalones which are on average smaller in size than adult abalones. Infant and adult abalones are used indiscriminately for testing. In practice, this could for instance happen if the abalones used for the training data set where artificially cultivated and could not be given enough time to reach maturity. Figure 4 presents the numerical results obtained with this second batch of experiments. Again, the pRBF kernels substantially improve the results obtained with the stand RBF kernel with the degree 3 monomials offering the best improvements (except for the smallest training set size N = 5 for which f1 f2 performed the best). The best rate of improvement is 35.78% obtained with the monomial f1 f2 f3 for N = 80. A notable difference with the case of the unbiased training sets is that improvement rates remain consistently high even when the training set becomes larger (up to 33.73% for N = 100). This shows that the pRBF kernel with priorknowledge allows for accurate predictions even outside of the range of the training data which is usually impossible for the standard RBF kernel, thus confirming the observations made in Sec. IV-B Fig. 1. As a matter of fact, the best result obtained for N = 100 with the pRBF kernel on biased training sets (an average error of 0.1082) is almost on a par with the best result obtained with the pRBF kernel on unbiased training sets (0.0920) whereas the best result obtained with the standard RBF kernel on biased training sets (0.1633) remains considerably worse than its counterpart on unbiased training sets (0.1100).

N =5 N = 10 N = 20 N = 40 N = 60 N = 80 N = 100

f1 0.4448 0.3519 0.2770 0.2236 0.1653 0.1382 0.1439

f12 0.3266 0.2731 0.2247 0.1938 0.1611 0.1467 0.1240

f13 0.3454 0.2404 0.1847 0.1368 0.1400 0.1258 0.1289

f1 f2 0.3223 0.2909 0.2359 0.1567 0.1427 0.1320 0.1262

f1 f2 f3 0.3412 0.2284 0.1840 0.1590 0.1318 0.1140 0.1082

1 0.4197 0.3393 0.2761 0.1936 0.1718 0.1775 0.1633

(a) Average error N =5 N = 10 N = 20 N = 40 N = 60 N = 80 N = 100

f1 -0.0598 -0.0374 -0.0033 -0.1548 0.0379 0.2215 0.1189

f12 0.2219 0.1950 0.1861 -0.0009 0.0626 0.1736 0.2405

f13 0.1771 0.2915 0.3309 0.2932 0.1853 0.2913 0.2108

f1 f2 0.2323 0.1427 0.1454 0.1904 0.1696 0.2567 0.2272

f1 f2 f3 0.1871 0.3268 0.3333 0.1785 0.2331 0.3578 0.3373

(b) Average improvement rate Figure 4. Average results over 100 training sets selected from infants abalones. Conventions are similar as for Fig. 3.

The pRBF kernel framework proposed in this paper is an innovative use of kernel tensor products for the incorporation of new types of task-specific prior-knowledge into SVMs that proved both effective and practical. With adequate prior-knowledge, a pRBF kernel used in place of the standard RBF kernels can significantly improve the learning performances of SVMs as shown by the empirical study. In particular, the pRBF kernel was able to considerably reduce the amount of training data required. It also proved able to adequately compensate for strong biases in training data. The sharp decrease in quantitative and qualitative requirements brought by the pRBF framework contributes to broaden the field of application of SVMs and paves the way for several interesting new possibilities. In particular, we can look towards the possibility to learn from very small training sets, or from datasets which are strongly biased. R EFERENCES [1] B. Sch¨olkopf, C. Burges, and V. N. Vapnik, “Incorporating invariances in support vector learning machines,” in Proc. International Conference on Artificial Neural Networks. Springer, 1996, pp. 47–52. [2] B. Haasdonk and D. Keysers, “Tangent distance kernels for support vector machines,” in Proc. International Conference on Pattern Recognition, 2002, pp. 864–868. [3] J.-P. Vert, S. H., and A. T., “Local alignment kernels for biological sequences,” in Kernel Methods in Computational Biology. MIT Press, 2004, pp. 131–153. [4] K. Veropoulos, C. Campbell, and N. Cristianini, “Controlling the sensitivity of support vector machines,” in Proc. International Joint Conference on AI, 1999, pp. 55–60. [5] L. Wang, Y. Gao, K. L. Chan, P. Xue, and W.-Y. Yau, “Retrieval with knowledge-driven kernel design: an approach to improving svm-based cbir with relevance feedback,” in Proc. International Conference on Computer Vision, vol. 2, 2005, pp. 1355–1362 Vol. 2. [6] O. L. Mangasarian and E. W. Wild, “Nonlinear knowledgebased classification,” Transaction on Neural Networks, vol. 10, pp. 1826–1832, 2008. [7] A. Veillard, D. Racoceanu, and S. Bressan, “Incorporating prior-knowledge in support vector machines by kernel adaptation,” in Proc. International Conference on Tools with Artificial Intelligence, 2011, pp. 591–596. [8] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, 2000.