Distance-based kernels for real-valued data .fr

Interest in kernel-based machine learning algorithms (e.g. SVMs). • Distance-based .... are big, being specially sensitive to small changes near zero. A simple ...
66KB taille 8 téléchargements 241 vues
Distance-based kernels for real-valued data Ll. Belanche [email protected] Dept. de Llenguatges i Sistemes Inform` atics Universitat Polit` ecnica de Catalunya 08034 Barcelona, Spain J.L. V´ azquez [email protected] Departamento de Matem´ aticas Universidad Aut´ onoma de Madrid. 28049 Madrid, Spain M. V´ azquez [email protected] Dept. Sistemas Inform´ aticos y Programaci´ on Universidad Complutense de Madrid 28040 Madrid, Spain

Outline • Interest in kernel-based machine learning algorithms (e.g. SVMs)

• Distance-based similarity measures for real-valued vectors: – A truncated Euclidean similarity measure – A self-normalized similarity measure related to the Canberra distance

• It is proved that they are positive semi-definite (p.s.d.)

• Better suited than standard kernels? (e.g. RBF)

• Series of benchmarking experiments

Kernel-based learning methods • Firm grounds in statistical learning theory • Support Vector Machine (SVM) very popular tool • Generally good practical results • Central to SVMs is the notion of kernel function: mapping of variables from original space to a high-dimensional Hilbert space • The decision function is  P l FSV M (x) = sgn i=1 αiyi k(xi , x) + b

• Intuitively, the kernel is a function that represents the similarity between two data observations

Kernels for vectors of real variables (I) • Two common-place kernels for real numbers, one of which (popularly known as the RBF kernel) is based on the Euclidean distance • Conditions for being a kernel function are very precise and related to the so-called kernel matrix being positive semi-definite (p.s.d.) • How should the similarity between two vectors of (positive) real numbers be computed? • Which of these similarity measures are valid kernels? • Many interesting possibilities coming from well-established distances • There has been little work on this subject, probably due to the widespread use of the initially proposed kernel and the difficulty of proving the p.s.d. property

Kernels for vectors of real variables (II) • We examine two alternative distance-based similarity and show them to be valid p.s.d. kernels: 1. Truncated version of the standard Euclidean metric in R 2. Inversely related to the Canberra distance, • Relation between the two new kernels and the RBF kernel • Intuitive semantics • We establish several results for positive vectors which lead to kernels extending the RBF kernel • A multidimensional kernel is created as a combination of different one-dimensional distance-based kernels, one for each variable.

Kernels defined on real numbers (I) • General form:



k(x, y) = f 

n X

j=1



gj (dj (xj , yj )) , xj , yj ∈ R+ 0

(1)

+ n and {f, g } are metric distances in R where the {dj }n j j=1 are j=1 0 appropriate continuous and monotonic functions in R+ 0 making the resulting k a valid p.s.d. kernel.

• An example is the kernel: ||x − y||2 n k(x, y) = exp {− }, x , y ∈ R , σ 6≡ 0 ∈ R 2 2σ popularly known as the RBF (or Gaussian) kernel.

(2)

Kernels defined on real numbers (II) • This particular kernel may be obtained by taking: d(xj , yj ) = |xj − yj |, gj (z) = z 2/(2σj2) for non-zero σj2 and f (z) = exp(−z).

• Different scaling parameters σj for every component. • This decomposition need not be unique and is not necessarily the most useful for proving the p.s.d. property of the kernel.

Kernels defined on real numbers (III) • We concentrate on upper-bounded metric distances, in which case the partial kernels gj (dj (xj , yj )) are automatically lower-bounded, though this is not a necessary condition in general.

• We focus on the following choices for partial distances:

dT rE (xi, yi) = min{1, |xi − yi|}

|xi − yi| dCan(xi, yi) = xi + y i

(Truncated Euclidean)

(Canberra)

(3)

(4)

Semantics and applicability • The truncated version of the standard metric can be useful when differences greater than a specified threshold have to be ignored. • In similarity terms, examples can become more and more similar until they are suddenly indistinguishable. • It also yields more sparse matrices than the standard metric. • The one inversely related to the Canberra distance: (a) is selfnormalised, and (b) scales in a log fashion: → similarity is smaller if the numbers are small than if the numbers are big, being specially sensitive to small changes near zero. A simple example: let a variable stand for the number of children, then the distance between 7 and 9 is not the same “psychological” distance than that between 1 and 3 (which is triple); however, |7 − 9| = |1 − 3|.

1. Truncated Euclidean Similarity Let xi be an arbitrary finite collection of n different points on the real line xi ∈ R, i = 1, . . . , n. We are interested in the n × n similarity matrix A = (aij ) with aij = 1 − dij ,

dij = min{1, |xi − xj |},

(5)

where the usual Euclidean distances have been replaced by truncated Euclidean distances. We can also write aij = (1 − dij )+ = max{0, 1 − |xi − xj |}. Theorem 1 Matrix A is positive definite (p.d.). (i) Analytic proof. Computation of integrals of certain functions. (ii) Probabilistic approach. Strategy based on random games

(6)

Analytic proof We define the bounded functions Xi (x) for x ∈ R with value 1 if |x − xi| ≤ 1/2, zero otherwise. We calculate the interaction integrals lij ≡

Z

R

Xi(x)Xj (x)dx = [xi − 1/2, xi + 1/2] ∩ [xj − 1/2, xj + 1/2]

the length of the overlap interval. We have lij = 1 − dij and zero if |xi − xj | ≥ 1. Therefore, lij = aij

if dij < 1,

if i 6= j.

Moreover, for i = j we have Z

R

Xi (x)Xj (x) dx =

Z

Xi2(x) dx = 1.

We conclude that the matrix A is obtained as the interaction matrix for the system of functions {Xi : i = 1, . . . , N }. Actually the dot products of the functions in the functional space L2(R). Since aij is the dot product of the inputs cast into some Hilbert space it forms, by definition, a p.s.d. matrix.

2. Canberra distance-based similarity We define the Canberra similarity between two points as follows

SCan(xi, xj ) = 1 − dCan(xi, xj ),

| xi − xj | dCan(xi, xj ) = , xi + xj

(7)

where dCan(xi, xj ) is called the Canberra distance. Theorem 2 The matrix A formed by the elements aij = SCan(xi, xj ) for xi, xj ∈ R+ is p.s.d. Proof. omitted!

New multivariate kernels (I) Theorem 3 The following function

k(x, y) = exp

Xn d(xi, yi) − i=1 σi

!

,

xi, yi, σi ∈ R+

where d is either “Truncated Euclidean” or “Canberra” is a p.s.d. kernel. This result establishes new kernels analogous to the Gaussian RBF kernel but based on alternative metrics. The inclusion of the σi parameters (acting as learning parameters) has the purpose of adding flexibility to the models Computational considerations should not be overlooked: the use of the exponential function considerably increases the computational cost associated with evaluating the kernel.

New multivariate kernels (II) |x −x |

Let d(xi , xj ) = x i+xj be the Canberra distance. i j The Canberra kernel is a valid p.s.d. kernel:

n di(xi, yi) 1 X , σi ≥ 1 k(x, y) = 1 − n i=1 σi

Let d(xi , xj ) = min{1, |xi − xj |} and for a real number a, a+ ≡ 1 − min(1, a) = max(0, 1 − a). The Truncated Euclidean kernel is a valid p.s.d. kernel:

n 1 X

!

di(xi , yi) k(x, y) = , σi > 0 n i=1 σi +

Experimental results

DATABASE RBF breast cancer 97.2 diabetes 77.5 fourclass 100 german 77.6 heart 84.4 sonar 90.4 svmguide1 97.0 svmguide3 85.2

Canberra 97.4 77.5 100 76.4 84.8 87.0 97.2 85.2

Mixed 98.0 78.3 100 77.7 85.6 93.3 97.0 86.3

Conclusions • We have considered distance-based similarity measures for realvalued vectors of interest in kernel-based machine learning algorithms, like the Support Vector Machine. • Alternatives to the widely used RBF kernel (based on the standard metric) for certain problems. • Possibility of mixed kernels. These distances may be a better choice for data affected by multiplicative noise, skewed data and/or containing outliers. • Future work: experimental evaluation (performance, number of support vectors, speed, ...) • Some rather general results concerning positivity properties have been presented in detail.