• Nie Znaleziono Wyników

A linear classifier can be made non-linear by adding new features which are combinations of original features.

N/A
N/A
Protected

Academic year: 2021

Share "A linear classifier can be made non-linear by adding new features which are combinations of original features."

Copied!
76
0
0

Pełen tekst

(1)

Kernelization of vector and matrix algorithms

Wojciech Kot lowski

joint work with:

Manfred Warmuth (UC Santa Cruz) Shuisheng Zhou (Xidian University, China)

Pozna´ n, IDSS, 07.10.2014

(2)

Outline

A linear classifier can be made non-linear by adding new features which are combinations of original features.

Kernel trick makes this transformation efficient: independent of the dimension of new feature space!

Which algorithms can be “kernelized”, i.e. to which algorithms can kernel trick be applied?

Necessary and sufficient conditions for kernelization of algorithms: for vector data,

for matrix data.

(3)

Outline

A linear classifier can be made non-linear by adding new features which are combinations of original features.

Kernel trick makes this transformation efficient: independent of the dimension of new feature space!

Which algorithms can be “kernelized”, i.e. to which algorithms can kernel trick be applied?

Necessary and sufficient conditions for kernelization of algorithms: for vector data,

for matrix data.

(4)

Outline

A linear classifier can be made non-linear by adding new features which are combinations of original features.

Kernel trick makes this transformation efficient: independent of the dimension of new feature space!

Which algorithms can be “kernelized”, i.e. to which algorithms can kernel trick be applied?

Necessary and sufficient conditions for kernelization of algorithms:

for vector data,

for matrix data.

(5)

Outline

1 Introduction

2 The vector case

3 The matrix case

4 Representer Theorem

5 Limitation of kernelizable algorithms

(6)

Outline

1 Introduction

2

The vector case

3

The matrix case

4

Representer Theorem

5

Limitation of kernelizable algorithms

(7)

Simple start – linear binary classification

x2

x1

w

Training set S = {(x

i

, `

i

)}

ni=1

, x

i

∈ R

d

. Linear hypothesis w = w(S) ∈ R

d

. Prediction on new instance x: w · x = P

j

w

j

x

j

(8)

What if data is not close to linear?

x2

x1

Give up on linear classifier...?

Or better, simply invent new features.

(9)

What if data is not close to linear?

x2

x1

Give up on linear classifier...?

Or better, simply invent new features.

(10)

Close to linear in feature space

Embed instances into a feature space:

φ : R

d

→ R

N

original space feature space

x2

x1

φ2(x)

φ1(x)

φ(x

1

, x

2

) =

 px

21

+ x

22

, arctan

xx2

1



.

(11)

Close to linear in feature space

Embed instances into a feature space:

φ : R

d

→ R

N

original space feature space

x2

x1

φ2(x)

φ1(x)

 px

2 2 x



(12)

Regression

● ●

x

y

x

y

prediction: f (x) = w · x prediction: f (x) = w · x features: x = (1, x) features: x = (1, x, x

2

, x

3

, x

4

)

f (x) = w

0

+ w

1

x f (x) = w

0

+ P

4

i=1

w

i

x

i

φ(x) = 1, x, x

2

, x

3

, x

4

.

Both models are linear in features space!

(13)

Regression

● ●

x

y

x

y

prediction: f (x) = w · x prediction: f (x) = w · x

features: x = (1, x) features: x = (1, x, x

2

, x

3

, x

4

) f (x) = w

0

+ w

1

x f (x) = w

0

+ P

4

i=1

w

i

x

i

φ(x) = 1, x, x

2

, x

3

, x

4

.

Both models are linear in features space!

(14)

Regression

● ●

x

y

x

y

prediction: f (x) = w · x prediction: f (x) = w · x features: x = (1, x) features: x = (1, x, x

2

, x

3

, x

4

)

f (x) = w

0

+ w

1

x f (x) = w

0

+ P

4

i=1

w

i

x

i

φ(x) = 1, x, x

2

, x

3

, x

4

.

Both models are linear in features space!

(15)

Regression

● ●

x

y

x

y

prediction: f (x) = w · x prediction: f (x) = w · x features: x = (1, x) features: x = (1, x, x

2

, x

3

, x

4

)

f (x) = w

0

+ w

1

x f (x) = w

0

+ P

4

i=1

w

i

x

i

φ(x) = 1, x, x

2

, x

3

, x

4

.

Both models are linear in features space!

(16)

Regression

● ●

x

y

x

y

prediction: f (x) = w · x prediction: f (x) = w · x features: x = (1, x) features: x = (1, x, x

2

, x

3

, x

4

)

f (x) = w

0

+ w

1

x f (x) = w

0

+ P

4

i=1

w

i

x

i

φ(x) = 1, x, x

2

, x

3

, x

4

.

Both models are linear in features space!

(17)

The Kernel Trick [Boser, Guyon & Vapnik, 92]

If w is a linear combination of the training instances:

w =

n

X

i=1

c

i

x

i

,

then:

w · x =



n

X

i=1

c

i

x

i

| {z }

w



· x =

n

X

i=1

c

i

(x

i

· x).

After embedding x 7→ φ(x): w · φ(x) =



n

X

i=1

c

i

φ(x

i

)

| {z }

w



· φ(x) =

n

X

i=1

c

i



φ(x

i

) · φ(x)

| {z }

K(xi,x)



(18)

The Kernel Trick [Boser, Guyon & Vapnik, 92]

If w is a linear combination of the training instances:

w =

n

X

i=1

c

i

x

i

,

then:

w · x =



n

X

i=1

c

i

x

i

| {z }

w



· x =

n

X

i=1

c

i

(x

i

· x).

After embedding x 7→ φ(x):

w · φ(x) =



n

X

i=1

c

i

φ(x

i

)

| {z }

w



· φ(x) =

n

X

i=1

c

i



φ(x

i

) · φ(x)

| {z }

K(xi,x)



(19)

The Kernel Trick [Boser, Guyon & Vapnik, 92]

If w is a linear combination of the training instances:

w =

n

X

i=1

c

i

x

i

,

then:

w · x =



n

X

i=1

c

i

x

i

| {z }

w



· x =

n

X

i=1

c

i

(x

i

· x).

dot product

After embedding x 7→ φ(x):

w · φ(x) =



n

X

i=1

c

i

φ(x

i

)

| {z }

w



· φ(x) =

n

X

i=1

c

i



φ(x

i

) · φ(x)

| {z }

K(xi,x)



kernel function

(20)

The Kernel Trick

primal form kernelized form

f (x) = w · φ(x) f (x) = P

n

i=1

c

i

K(x

i

, x)

N parameters n parameters

(feature space dim.) (num. of instances)

Training requires n × n kernel matrix K

ij

= K(x

i

, x

j

).

Testing on x requires K(x

i

, x) for all i with c

i

6= 0.

Kernel function must be efficiently computable.

(21)

The Kernel Trick

primal form kernelized form

f (x) = w · φ(x) f (x) = P

n

i=1

c

i

K(x

i

, x)

N parameters n parameters

(feature space dim.) (num. of instances) Training requires n × n kernel matrix K

ij

= K(x

i

, x

j

).

Testing on x requires K(x

i

, x) for all i with c

i

6= 0.

Kernel function must be efficiently computable.

(22)

Kernel function – examples

Linear kernel (no embedding, φ(x) = x):

K(x

i

, x

j

) = x

i

· x

j

. (features space dimension N = d)

Polynomial kernel of degree k

(φ(x) = (1, x

1

, x

2

, . . . , x

21

, x

1

x

2

, . . . , x

31

, x

21

x

2

. . .)) K(x

i

, x

j

) = 1 + x

i

· x

j



k

. (features space dimension N = O(d

k

)) RBF kernel (φ(x) ∈ R

X

):

K(x

i

, x

j

) = e

−γkxi−xjk

.

(features space dimension N = ∞)

(23)

Kernel function – examples

Linear kernel (no embedding, φ(x) = x):

K(x

i

, x

j

) = x

i

· x

j

. (features space dimension N = d)

Polynomial kernel of degree k

(φ(x) = (1, x

1

, x

2

, . . . , x

21

, x

1

x

2

, . . . , x

31

, x

21

x

2

. . .)) K(x

i

, x

j

) = 1 + x

i

· x

j



k

. (features space dimension N = O(d

k

))

RBF kernel (φ(x) ∈ R

X

):

K(x

i

, x

j

) = e

−γkxi−xjk

.

(features space dimension N = ∞)

(24)

Kernel function – examples

Linear kernel (no embedding, φ(x) = x):

K(x

i

, x

j

) = x

i

· x

j

. (features space dimension N = d)

Polynomial kernel of degree k

(φ(x) = (1, x

1

, x

2

, . . . , x

21

, x

1

x

2

, . . . , x

31

, x

21

x

2

. . .)) K(x

i

, x

j

) = 1 + x

i

· x

j



k

. (features space dimension N = O(d

k

)) RBF kernel (φ(x) ∈ R

X

):

K(x

i

, x

j

) = e

−γkxi−xjk

.

(features space dimension N = ∞)

(25)

Kernel function – in general

Theorem

If K(x

i

, x

j

) is such that for any n, and any set of n points {x

1

, . . . , x

n

}, the n × n kernel matrix K

ij

= K(x

i

, x

j

) is symmetric and positive semidefinite, then there exists a feature embedding φ, such that

K(x

i

, x

j

) = φ(x

i

) · φ(x

j

).

Allows to construct kernel functions without explicit construction of feature space embedding φ.

Original instances x do not need to be described by features

at all (kernels on graphs, signals, images, proteins, etc.)

(26)

Kernel function – in general

Theorem

If K(x

i

, x

j

) is such that for any n, and any set of n points {x

1

, . . . , x

n

}, the n × n kernel matrix K

ij

= K(x

i

, x

j

) is symmetric and positive semidefinite, then there exists a feature embedding φ, such that

K(x

i

, x

j

) = φ(x

i

) · φ(x

j

).

Allows to construct kernel functions without explicit construction of feature space embedding φ.

Original instances x do not need to be described by features

at all (kernels on graphs, signals, images, proteins, etc.)

(27)

Good news

Many of our favorite algorithms can be “kernelized”:

Support Vector Machines, Linear Least Squares, Widrow-Hoff, PCA, Simplex Algorithm, ...

Question:

What is the class of algorithms that can be kernelized?

(28)

Which algorithms are kernelizable?

Algorithm that makes linear predictions with the parameter vector which is a linear combination of instances?

w(S) =

n

X

i=1

c

i

x

i

.

=⇒ Necessary, but not sufficient.

Algorithms to which the Representer Theorem [Kimeldorf &

Wahba, 71] applies?

w = arg min

˜ w

n

X

i=1

loss( ˜ w · x

i

) + λk ˜ wk

2

 Solution w is a linear combination of instances.

=⇒ Sufficient, but not necessary.

(29)

Which algorithms are kernelizable?

Algorithm that makes linear predictions with the parameter vector which is a linear combination of instances?

w(S) =

n

X

i=1

c

i

x

i

.

=⇒ Necessary, but not sufficient.

Algorithms to which the Representer Theorem [Kimeldorf &

Wahba, 71] applies?

w = arg min

˜ w

n

X

i=1

loss( ˜ w · x

i

) + λk ˜ wk

2

 Solution w is a linear combination of instances.

=⇒ Sufficient, but not necessary.

(30)

Our contribution

Which algorithms can be kernelized?

Necessary and sufficient conditions for kernelizability.

Vectors.

Asymmetric and symmetric matrices.

Prove new versions of the Representer Theorem.

Build kernelizable algorithms with matrix parameters.

(31)

Our contribution

Which algorithms can be kernelized?

Necessary and sufficient conditions for kernelizability.

Vectors.

Asymmetric and symmetric matrices.

Prove new versions of the Representer Theorem.

Build kernelizable algorithms with matrix parameters.

Cytaty

Powiązane dokumenty

ciw komu prowadzone jest postępowanie karne, ma prawo do obro- ny we wszystkich stadiach postępowania. Kodeks postępowania karnego gwarantuje pra- wo do obrony poprzez dyspozycję

Ponieważ wszystkie badane próbki mają taki sam kształt i położenie widma emisji, różnice w czasie jej zaniku nie mogą być wiązane z różnicami w najbliższym otoczeniu jo-

Pozostałe posługują się innych form umożliwiających dotarcie do grupy docelowej – reklamą zewnętrzną (na billboardach, citylightach, słupach ogłoszeniowych,

Ponieważ Paryż nie jest dla nich wyłącznie stoli- cą światowej przestrzeni literackiej, jaką historycznie rzecz biorąc, był dla wszystkich innych pisarzy, lecz pełni

devant des regards importuns des passants, est comme la plupart de demeures noires difficilement accessible : elle se trouve près d’une usine et entre deux murs de

Bankowa 12B, 40-007 Katowice. www.wydawnictwo.us.edu.pl

Przebieg wydarzeń w Bułgarii zdaje się świadczyć raczej na rzecz tezy, iż olbrzymie znaczenie miały ukształtowane w przeszłości obyczaje polityczne oraz

"Les groupes informels dans