A linear classifier can be made non-linear by adding new features which are combinations of original features.

(1)

Kernelization of vector and matrix algorithms

Wojciech Kot lowski

joint work with:

Manfred Warmuth (UC Santa Cruz) Shuisheng Zhou (Xidian University, China)

Pozna´ n, IDSS, 07.10.2014

(2)

Outline

A linear classifier can be made non-linear by adding new features which are combinations of original features.

Kernel trick makes this transformation efficient: independent of the dimension of new feature space!

Which algorithms can be “kernelized”, i.e. to which algorithms can kernel trick be applied?

Necessary and sufficient conditions for kernelization of algorithms: for vector data,

for matrix data.

(3)

Outline

A linear classifier can be made non-linear by adding new features which are combinations of original features.

Kernel trick makes this transformation efficient: independent of the dimension of new feature space!

Which algorithms can be “kernelized”, i.e. to which algorithms can kernel trick be applied?

Necessary and sufficient conditions for kernelization of algorithms: for vector data,

for matrix data.

(4)

Outline

A linear classifier can be made non-linear by adding new features which are combinations of original features.

Kernel trick makes this transformation efficient: independent of the dimension of new feature space!

Which algorithms can be “kernelized”, i.e. to which algorithms can kernel trick be applied?

Necessary and sufficient conditions for kernelization of algorithms:

for vector data,

for matrix data.

(5)

Outline

1 Introduction

2 The vector case

3 The matrix case

4 Representer Theorem

5 Limitation of kernelizable algorithms

(6)

Outline

1 Introduction

2

The vector case

3

The matrix case

4

Representer Theorem

5

Limitation of kernelizable algorithms

(7)

Simple start – linear binary classification

x₂

x1

w

Training set S = {(x

i

, `

i

)}

ⁿ_i=1

, x

i

∈ R

^d

. Linear hypothesis w = w(S) ∈ R

^d

. Prediction on new instance x: w · x = P

j

w

_j

x

_j

(8)

What if data is not close to linear?

x2

x₁

Give up on linear classifier...?

Or better, simply invent new features.

(9)

What if data is not close to linear?

x2

x₁

Give up on linear classifier...?

Or better, simply invent new features.

(10)

Close to linear in feature space

Embed instances into a feature space:

φ : R

^d

→ R

^N

original space feature space

x2

x1

φ₂(x)

φ1(x)

φ(x

1

, x

2

) =

px

²₁

+ x

²₂

, arctan

^x_x²

1

.

(11)

Close to linear in feature space

Embed instances into a feature space:

φ : R

^d

→ R

^N

original space feature space

x2

x1

φ₂(x)

φ1(x)

px

² ² ^x

(12)

Regression

●

●●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

● ●

●

x

y

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

x

y

prediction: f (x) = w · x prediction: f (x) = w · x features: x = (1, x) features: x = (1, x, x

²

, x

³

, x

⁴

)

f (x) = w

0

+ w

1

x f (x) = w

0

+ P

₄

i=1

w

i

x

ⁱ

φ(x) = 1, x, x

²

, x

³

, x

⁴

.

Both models are linear in features space!

(13)

Regression

●

●●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

● ●

●

x

y

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

x

y

prediction: f (x) = w · x prediction: f (x) = w · x

features: x = (1, x) features: x = (1, x, x

²

, x

³

, x

⁴

) f (x) = w

0

+ w

1

x f (x) = w

0

+ P

₄

i=1

w

i

x

ⁱ

φ(x) = 1, x, x

²

, x

³

, x

⁴

.

Both models are linear in features space!

(14)

Regression

●

●●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

● ●

●

x

y

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

x

y

prediction: f (x) = w · x prediction: f (x) = w · x features: x = (1, x) features: x = (1, x, x

²

, x

³

, x

⁴

)

f (x) = w

0

+ w

1

x f (x) = w

0

+ P

₄

i=1

w

i

x

ⁱ

φ(x) = 1, x, x

²

, x

³

, x

⁴

.

Both models are linear in features space!

(15)

Regression

●

●●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

● ●

●

x

y

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

x

y

prediction: f (x) = w · x prediction: f (x) = w · x features: x = (1, x) features: x = (1, x, x

²

, x

³

, x

⁴

)

f (x) = w

₀

+ w

₁

x f (x) = w

₀

+ P

₄

i=1

w

_i

x

ⁱ

φ(x) = 1, x, x

²

, x

³

, x

⁴

.

Both models are linear in features space!

(16)

Regression

●

●●

●

● ●

●

● ●

●

● ●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

● ●

●

x

y

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

x

y

prediction: f (x) = w · x prediction: f (x) = w · x features: x = (1, x) features: x = (1, x, x

²

, x

³

, x

⁴

)

f (x) = w

₀

+ w

₁

x f (x) = w

₀

+ P

₄

i=1

w

_i

x

ⁱ

φ(x) = 1, x, x

²

, x

³

, x

⁴

.

Both models are linear in features space!

(17)

The Kernel Trick [Boser, Guyon & Vapnik, 92]

If w is a linear combination of the training instances:

w =

n

X

i=1

c

_i

x

_i

,

then:

w · x =

n

X

i=1

c

_i

x

_i

| {z }

w

· x =

n

X

i=1

c

_i

(x

_i

· x).

After embedding x 7→ φ(x): w · φ(x) =

n

X

i=1

c

_i

φ(x

_i

)

| {z }

w

· φ(x) =

n

X

i=1

c

_i

φ(x

_i

) · φ(x)

| {z }

K(xi,x)

(18)

The Kernel Trick [Boser, Guyon & Vapnik, 92]

If w is a linear combination of the training instances:

w =

n

X

i=1

c

_i

x

_i

,

then:

w · x =

n

X

i=1

c

_i

x

_i

| {z }

w

· x =

n

X

i=1

c

_i

(x

_i

· x).

After embedding x 7→ φ(x):

w · φ(x) =

n

X

i=1

c

_i

φ(x

_i

)

| {z }

w

· φ(x) =

n

X

i=1

c

_i

φ(x

_i

) · φ(x)

| {z }

K(xi,x)

(19)

The Kernel Trick [Boser, Guyon & Vapnik, 92]

If w is a linear combination of the training instances:

w =

n

X

i=1

c

i

x

i

,

then:

w · x =

n

X

i=1

c

_i

x

_i

| {z }

w

· x =

n

X

i=1

c

_i

(x

_i

· x).

dot product

After embedding x 7→ φ(x):

w · φ(x) =

n

X

i=1

c

i

φ(x

i

)

| {z }

w

· φ(x) =

n

X

i=1

c

i

φ(x

i

) · φ(x)

| {z }

K(xi,x)

kernel function

(20)

The Kernel Trick

primal form kernelized form

f (x) = w · φ(x) f (x) = P

_n

i=1

c

_i

K(x

_i

, x)

N parameters n parameters

(feature space dim.) (num. of instances)

Training requires n × n kernel matrix K

_ij

= K(x

_i

, x

_j

).

Testing on x requires K(x

_i

, x) for all i with c

_i

6= 0.

Kernel function must be efficiently computable.

(21)

The Kernel Trick

primal form kernelized form

f (x) = w · φ(x) f (x) = P

_n

i=1

c

_i

K(x

_i

, x)

N parameters n parameters

(feature space dim.) (num. of instances) Training requires n × n kernel matrix K

_ij

= K(x

_i

, x

_j

).

Testing on x requires K(x

_i

, x) for all i with c

_i

6= 0.

Kernel function must be efficiently computable.

(22)

Kernel function – examples

Linear kernel (no embedding, φ(x) = x):

K(x

_i

, x

_j

) = x

_i

· x

_j

. (features space dimension N = d)

Polynomial kernel of degree k

(φ(x) = (1, x

₁

, x

₂

, . . . , x

²₁

, x

₁

x

₂

, . . . , x

³₁

, x

²₁

x

₂

. . .)) K(x

i

, x

j

) = 1 + x

i

· x

_j

k

. (features space dimension N = O(d

^k

)) RBF kernel (φ(x) ∈ R

^X

):

K(x

_i

, x

_j

) = e

^−γkxⁱ^−x^j^k

.

(features space dimension N = ∞)

(23)

Kernel function – examples

Linear kernel (no embedding, φ(x) = x):

K(x

_i

, x

_j

) = x

_i

· x

_j

. (features space dimension N = d)

Polynomial kernel of degree k

(φ(x) = (1, x

₁

, x

₂

, . . . , x

²₁

, x

₁

x

₂

, . . . , x

³₁

, x

²₁

x

₂

. . .)) K(x

i

, x

j

) = 1 + x

i

· x

_j

k

. (features space dimension N = O(d

^k

))

RBF kernel (φ(x) ∈ R

^X

):

K(x

_i

, x

_j

) = e

.

(features space dimension N = ∞)

(24)

Kernel function – examples

Linear kernel (no embedding, φ(x) = x):

K(x

_i

, x

_j

) = x

_i

· x

_j

. (features space dimension N = d)

Polynomial kernel of degree k

(φ(x) = (1, x

₁

, x

₂

, . . . , x

²₁

, x

₁

x

₂

, . . . , x

³₁

, x

²₁

x

₂

. . .)) K(x

i

, x

j

) = 1 + x

i

· x

_j

k

. (features space dimension N = O(d

^k

)) RBF kernel (φ(x) ∈ R

^X

):

K(x

_i

, x

_j

) = e

.

(features space dimension N = ∞)

(25)

Kernel function – in general

Theorem

If K(x

i

, x

j

) is such that for any n, and any set of n points {x

₁

, . . . , x

_n

}, the n × n kernel matrix K

_ij

= K(x

_i

, x

_j

) is symmetric and positive semidefinite, then there exists a feature embedding φ, such that

K(x

_i

, x

_j

) = φ(x

_i

) · φ(x

_j

).

Allows to construct kernel functions without explicit construction of feature space embedding φ.

Original instances x do not need to be described by features

at all (kernels on graphs, signals, images, proteins, etc.)

(26)

Kernel function – in general

Theorem

If K(x

i

, x

j

) is such that for any n, and any set of n points {x

₁

, . . . , x

_n

}, the n × n kernel matrix K

_ij

= K(x

_i

, x

_j

) is symmetric and positive semidefinite, then there exists a feature embedding φ, such that

K(x

_i

, x

_j

) = φ(x

_i

) · φ(x

_j

).

Allows to construct kernel functions without explicit construction of feature space embedding φ.

Original instances x do not need to be described by features

at all (kernels on graphs, signals, images, proteins, etc.)

(27)

Good news

Many of our favorite algorithms can be “kernelized”:

Support Vector Machines, Linear Least Squares, Widrow-Hoff, PCA, Simplex Algorithm, ...

Question:

What is the class of algorithms that can be kernelized?

(28)

Which algorithms are kernelizable?

Algorithm that makes linear predictions with the parameter vector which is a linear combination of instances?

w(S) =

n

X

i=1

c

i

x

i

.

=⇒ Necessary, but not sufficient.

Algorithms to which the Representer Theorem [Kimeldorf &

Wahba, 71] applies?

w = arg min

˜ w





n

X

i=1

loss( ˜ w · x

_i

) + λk ˜ wk

²



 Solution w is a linear combination of instances.

=⇒ Sufficient, but not necessary.

(29)

Which algorithms are kernelizable?

Algorithm that makes linear predictions with the parameter vector which is a linear combination of instances?

A linear classifier can be made non-linear by adding new features which are combinations of original features.

Kernelization of vector and matrix algorithms

Wojciech Kot lowski

Pozna´ n, IDSS, 07.10.2014

Outline