Using Kernel Basis with Relevance Vector Machine

CEA, LIST, Laboratoire Intelligence Multi-capteurs et Apprentissage, F-91191 Gif sur .... Auto MPG. • Regression task : evaluate the pollution emission of a car,.
1MB taille 3 téléchargements 295 vues
Using Kernel Basis with Relevance Vector Machine ´ eric ´ Suard and David Mercier Fred {frederic.suard , david.mercier} @ cea.fr CEA, LIST, Laboratoire Intelligence Multi-capteurs et Apprentissage, F-91191 Gif sur Yvette, FRANCE.

How to define Multiple Kernel - Kernel Basis for RVM ?

Relevance Vector Machine in few words • Probabilistic model, Mickael Tipping [Tip00], 2000,

• The objective : how to obtain independant set of kernels for each vector ?

• Sparse solution, Kernel Machine • Decision function :

f (x) = f (x) =

n X

n X k X

wi,j · Kj (x, xi ) + w0.

i=1 j=1 • New definition of Φ as a Kernel Basis [VB02] :

wi · Φ(x, xi ) + w0



i=1



1 Φ =  .. K1 K2 . . . Kk  1

with Φ a kernel function, w the coefficient associated to each vector.

• RVM imposes no theoretical constraint on the kernel. n∗k X • New formulation of RVM Kernel Basis : f (x) = wi · φi (x) + w0 i=1

• Which kernel for the best performance ? • How to set kernel parameters ? • Which features are pertinent ?

→ multiple kernel !

Illustration ⇒ What is the difference between Multiple Kernel like Composite Kernel or Kernel Basis ? Set of kernel = {K (x, .), K (y, :), K (xy, .)}.

Composite Kernel

SV20

Kernel Basis

SVM Composite Kernel n k X X f (x) = αi βj · Kj (x, xi ) + b i=1

RVM Kernel Basis k n X X wi,j · Kj (x, xi ) + w0 f (x) =

RV1

RV2

RV3

i=1 j=1

j=1

Application for Feature Selection

Auto MPG

• Dataset : {xi , yi }i=1..n , x ∈ Rd , • Feature selection : 1 kernel for each feature,

• Regression task : evaluate the pollution emission of a car, • 398 cars, 7 features.

• Comparison with state of the art of Composite Kernel : SVM Composite Kernel [BLJ04, RBCG08]. • RVM-KB : one kernel matrix of size n × (n × d), SVM-CK : d kernel matrices of size n × n. • How to evaluate the feature selection : • RVM-KB : number of vectors based on each feature, • SVM-CK : weighting coefficient for each kernel. • Experimental procedure : • data are normalized, • multiple iteration of 4-fold cross validation to reinforce the results, same data for cross comparison, • several parameters for kernel, best are retained, • SVM : exhaustive list of classifier parameters (C and ), best are retained. • Performance of single kernel and multiple kernel for each algorithm.

AUTO Kernel error #RV (%) RVM Gaussian,2 2.84 20 (6.7) RVM-KB Poly,3 2.70 101 (4.2) • Multiple kernel improve performance, • RVM more sparse, • Feature selection : • Features 5 and 7 are neglicted, • SVM-CK add more importance to kernel 8, • RVM-KB share fairly the weight on kernels 1, 3, 4 and 6. • Selection is valuable regarding the car consumption objective.

Boston Housing

AUTO Parameters Kernel error #SV×#K(%) SVM =0.2 C=100 Gaussian,3 2.74 272 (91) SVM-CK =0.1 C=10 Gaussian,1 2.70 292 × 6 (73) kernel 1 2 3 4 5 6 7 8

Feature(s) # cylinders displacement horsepower weight acceleration model year origin [1-7]

RVM-KB SVM-CK P #RV β i |wij | 18.2500 0.1886 0.0081 13.2500 0.0596 0.0323 10.2500 0.2772 0.0289 7.7500 0.2242 0.1053 1.0000 0.0029 0 38.2500 0.2147 0.0724 0 0 0 12.2500 0.0328 0.7530

• Regression task : estimation of residential house price,

Blood Transfusion : example of feature selection limits

• 506 houses, described by 13 features. • Summary of single and multiple kernel performance :

BOSTON Kernel error #RV (%) RVM Gaussian 5 3.81 30 (7.8) RVM-KB Gaussian 2 3.22 322 (6.1)

BOSTON Parameters Kernel error #SV×#K(%) SVM =0.05,C=100 Gaussian,3 3.27 373 (98) SVM-CK =0.01,C=10 Gaussian,1 3.26 375 × 7 (49)

• Details for the multiple kernel solution : • RVM-KB more performant than SVM-CK, • Some similarities between the selections : • Kernel 14 has an important weight, • Features 2, 4, 7, 8 and 12 are neglicted. • SVM-CK is able to remove completely one

feature (β = 0), but RVM-KB can consider local kernel.

RVM-KB SVM-CK P Kernel Feature(s) #RV β i |wij | 1 Criminal rate 5.7500 0.0050 0.0040 2 Residential % 74.2500 0.0005 0 3 Industrial % 13.0000 0.0014 0 4 River distance 11.7500 0.0006 0 5 % NOx 16.0000 0.0037 0.0104 6 # rooms 15.0000 0.0035 0.1271 7 Age 3.5000 0.0008 0 8 Distance 0 0 0 9 Road 61.0000 0.0200 0 10 Tax 21.2500 0.0027 0.0181 11 Teachers 10.7500 0.0016 0.0185 12 % blacks 1.2500 0.0003 0 13 Lower salary 14.7500 0.0085 0.0939 14 [1-13] 73.7500 0.9513 0.7281

Some final words ...

• Classification task : blood transfusion, predict if a person has donated its blood, • 748 persons, 4 features. • AUC value for single and multiple kernel :

BLOOD Kernel AUC #RV (%) BLOOD Parameter Kernel AUC #SV (%) RVM gaussian 10 0.744 3 (0.54) SVM C=10 gaussian 10 0.718 299 (55) RVM-KB poly 3 0.75 247 (11.02) SVM-CK C=100 linear 0.71 378 × 4 (67) • Best performance for RVM-KB, with sparsity. • Feature selection : same behavior

All features are required !

BLOOD Kernel Feature 1 Recency 2 Frequency 3 Monetary 4 Time

RVM-KB SVM-CK P #RV β i |wij | 22 0.6678 0.2484 104 0.7228 0.1983 104 0.7228 0.2500 16 0.6656 0.2499

References [BLJ04]

Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic duality, and the smo algorithm. In ICML’04: Proceedings of the twenty-first international conference on Machine learning, page 6, New York, NY, USA, 2004. ACM Press.

• Extension of RVM to Multiple Kernel through Kernel Basis,

´ [RBCG08] Alain Rakotomamonjy, Francis Bach, Stephane Canu, and Yves Grandvalet. Simple MKL. In Journal of Machine Learning Research, 2008.

• Performance comparable with SVM Composite Kernel, but with a reduction of solution size,

[Tip00]

Michael Tipping. The relevance vector machine. In T. K. Leen S. A. Solla and K.-R. Muller, editors, Advances in Neural Information ¨ Processing Systems, volume 12. MIT Press, 2000.

[VB02]

Pascal Vincent and Yoshua Bengio. Kernel matching pursuit. Mach. Learn., 48(1-3):165–187, 2002.

• Application to Model selection : parameter selection, feature selection.

Some improvements : → How to build the kernel set ? → How to combine features ? → Reducing computational complexity. http://www.cea.fr

Using Kernel Basis with Relevance Vector Machine for model selecetion, ICANN 2009, Springer-Verlag, p255-264