Kernelization of vector and matrix algorithms
Wojciech Kot lowski
joint work with:
Manfred Warmuth (UC Santa Cruz) Shuisheng Zhou (Xidian University, China)
Pozna´ n, IDSS, 07.10.2014
Outline
A linear classifier can be made non-linear by adding new features which are combinations of original features.
Kernel trick makes this transformation efficient: independent of the dimension of new feature space!
Which algorithms can be “kernelized”, i.e. to which algorithms can kernel trick be applied?
Necessary and sufficient conditions for kernelization of algorithms: for vector data,
for matrix data.
Outline
A linear classifier can be made non-linear by adding new features which are combinations of original features.
Kernel trick makes this transformation efficient: independent of the dimension of new feature space!
Which algorithms can be “kernelized”, i.e. to which algorithms can kernel trick be applied?
Necessary and sufficient conditions for kernelization of algorithms: for vector data,
for matrix data.
Outline
A linear classifier can be made non-linear by adding new features which are combinations of original features.
Kernel trick makes this transformation efficient: independent of the dimension of new feature space!
Which algorithms can be “kernelized”, i.e. to which algorithms can kernel trick be applied?
Necessary and sufficient conditions for kernelization of algorithms:
for vector data,
for matrix data.
Outline
1 Introduction
2 The vector case
3 The matrix case
4 Representer Theorem
5 Limitation of kernelizable algorithms
Outline
1 Introduction
2
The vector case
3
The matrix case
4
Representer Theorem
5
Limitation of kernelizable algorithms
Simple start – linear binary classification
x2
x1
w
Training set S = {(x
i, `
i)}
ni=1, x
i∈ R
d. Linear hypothesis w = w(S) ∈ R
d. Prediction on new instance x: w · x = P
j
w
jx
jWhat if data is not close to linear?
x2
x1
Give up on linear classifier...?
Or better, simply invent new features.
What if data is not close to linear?
x2
x1
Give up on linear classifier...?
Or better, simply invent new features.
Close to linear in feature space
Embed instances into a feature space:
φ : R
d→ R
Noriginal space feature space
x2
x1
φ2(x)
φ1(x)
φ(x
1, x
2) =
px
21+ x
22, arctan
xx21
.
Close to linear in feature space
Embed instances into a feature space:
φ : R
d→ R
Noriginal space feature space
x2
x1
φ2(x)
φ1(x)
px
2 2 xRegression
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
x
y
prediction: f (x) = w · x prediction: f (x) = w · x features: x = (1, x) features: x = (1, x, x
2, x
3, x
4)
f (x) = w
0+ w
1x f (x) = w
0+ P
4i=1
w
ix
iφ(x) = 1, x, x
2, x
3, x
4.
Both models are linear in features space!
Regression
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
x
y
prediction: f (x) = w · x prediction: f (x) = w · x
features: x = (1, x) features: x = (1, x, x
2, x
3, x
4) f (x) = w
0+ w
1x f (x) = w
0+ P
4i=1
w
ix
iφ(x) = 1, x, x
2, x
3, x
4.
Both models are linear in features space!
Regression
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
x
y
prediction: f (x) = w · x prediction: f (x) = w · x features: x = (1, x) features: x = (1, x, x
2, x
3, x
4)
f (x) = w
0+ w
1x f (x) = w
0+ P
4i=1
w
ix
iφ(x) = 1, x, x
2, x
3, x
4.
Both models are linear in features space!
Regression
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
x
y
prediction: f (x) = w · x prediction: f (x) = w · x features: x = (1, x) features: x = (1, x, x
2, x
3, x
4)
f (x) = w
0+ w
1x f (x) = w
0+ P
4i=1
w
ix
iφ(x) = 1, x, x
2, x
3, x
4.
Both models are linear in features space!
Regression
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
x
y
prediction: f (x) = w · x prediction: f (x) = w · x features: x = (1, x) features: x = (1, x, x
2, x
3, x
4)
f (x) = w
0+ w
1x f (x) = w
0+ P
4i=1
w
ix
iφ(x) = 1, x, x
2, x
3, x
4.
Both models are linear in features space!
The Kernel Trick [Boser, Guyon & Vapnik, 92]
If w is a linear combination of the training instances:
w =
n
X
i=1
c
ix
i,
then:
w · x =
nX
i=1
c
ix
i| {z }
w
· x =
n
X
i=1
c
i(x
i· x).
After embedding x 7→ φ(x): w · φ(x) =
nX
i=1
c
iφ(x
i)
| {z }
w
· φ(x) =
n
X
i=1
c
iφ(x
i) · φ(x)
| {z }
K(xi,x)
The Kernel Trick [Boser, Guyon & Vapnik, 92]
If w is a linear combination of the training instances:
w =
n
X
i=1
c
ix
i,
then:
w · x =
nX
i=1
c
ix
i| {z }
w
· x =
n
X
i=1
c
i(x
i· x).
After embedding x 7→ φ(x):
w · φ(x) =
nX
i=1
c
iφ(x
i)
| {z }
w
· φ(x) =
n
X
i=1
c
iφ(x
i) · φ(x)
| {z }
K(xi,x)
The Kernel Trick [Boser, Guyon & Vapnik, 92]
If w is a linear combination of the training instances:
w =
n
X
i=1
c
ix
i,
then:
w · x =
nX
i=1
c
ix
i| {z }
w
· x =
n
X
i=1
c
i(x
i· x).
dot product
After embedding x 7→ φ(x):
w · φ(x) =
nX
i=1
c
iφ(x
i)
| {z }
w
· φ(x) =
n
X
i=1
c
iφ(x
i) · φ(x)
| {z }
K(xi,x)
kernel functionThe Kernel Trick
primal form kernelized form
f (x) = w · φ(x) f (x) = P
ni=1
c
iK(x
i, x)
N parameters n parameters
(feature space dim.) (num. of instances)
Training requires n × n kernel matrix K
ij= K(x
i, x
j).
Testing on x requires K(x
i, x) for all i with c
i6= 0.
Kernel function must be efficiently computable.
The Kernel Trick
primal form kernelized form
f (x) = w · φ(x) f (x) = P
ni=1
c
iK(x
i, x)
N parameters n parameters
(feature space dim.) (num. of instances) Training requires n × n kernel matrix K
ij= K(x
i, x
j).
Testing on x requires K(x
i, x) for all i with c
i6= 0.
Kernel function must be efficiently computable.
Kernel function – examples
Linear kernel (no embedding, φ(x) = x):
K(x
i, x
j) = x
i· x
j. (features space dimension N = d)
Polynomial kernel of degree k
(φ(x) = (1, x
1, x
2, . . . , x
21, x
1x
2, . . . , x
31, x
21x
2. . .)) K(x
i, x
j) = 1 + x
i· x
jk. (features space dimension N = O(d
k)) RBF kernel (φ(x) ∈ R
X):
K(x
i, x
j) = e
−γkxi−xjk.
(features space dimension N = ∞)
Kernel function – examples
Linear kernel (no embedding, φ(x) = x):
K(x
i, x
j) = x
i· x
j. (features space dimension N = d)
Polynomial kernel of degree k
(φ(x) = (1, x
1, x
2, . . . , x
21, x
1x
2, . . . , x
31, x
21x
2. . .)) K(x
i, x
j) = 1 + x
i· x
jk. (features space dimension N = O(d
k))
RBF kernel (φ(x) ∈ R
X):
K(x
i, x
j) = e
−γkxi−xjk.
(features space dimension N = ∞)
Kernel function – examples
Linear kernel (no embedding, φ(x) = x):
K(x
i, x
j) = x
i· x
j. (features space dimension N = d)
Polynomial kernel of degree k
(φ(x) = (1, x
1, x
2, . . . , x
21, x
1x
2, . . . , x
31, x
21x
2. . .)) K(x
i, x
j) = 1 + x
i· x
jk. (features space dimension N = O(d
k)) RBF kernel (φ(x) ∈ R
X):
K(x
i, x
j) = e
−γkxi−xjk.
(features space dimension N = ∞)
Kernel function – in general
Theorem
If K(x
i, x
j) is such that for any n, and any set of n points {x
1, . . . , x
n}, the n × n kernel matrix K
ij= K(x
i, x
j) is symmetric and positive semidefinite, then there exists a feature embedding φ, such that
K(x
i, x
j) = φ(x
i) · φ(x
j).
Allows to construct kernel functions without explicit construction of feature space embedding φ.
Original instances x do not need to be described by features
at all (kernels on graphs, signals, images, proteins, etc.)
Kernel function – in general
Theorem
If K(x
i, x
j) is such that for any n, and any set of n points {x
1, . . . , x
n}, the n × n kernel matrix K
ij= K(x
i, x
j) is symmetric and positive semidefinite, then there exists a feature embedding φ, such that
K(x
i, x
j) = φ(x
i) · φ(x
j).
Allows to construct kernel functions without explicit construction of feature space embedding φ.
Original instances x do not need to be described by features
at all (kernels on graphs, signals, images, proteins, etc.)
Good news
Many of our favorite algorithms can be “kernelized”:
Support Vector Machines, Linear Least Squares, Widrow-Hoff, PCA, Simplex Algorithm, ...
Question:
What is the class of algorithms that can be kernelized?
Which algorithms are kernelizable?
Algorithm that makes linear predictions with the parameter vector which is a linear combination of instances?
w(S) =
n
X
i=1
c
ix
i.
=⇒ Necessary, but not sufficient.
Algorithms to which the Representer Theorem [Kimeldorf &
Wahba, 71] applies?
w = arg min
˜ w
n
X
i=1
loss( ˜ w · x
i) + λk ˜ wk
2
Solution w is a linear combination of instances.
=⇒ Sufficient, but not necessary.
Which algorithms are kernelizable?
Algorithm that makes linear predictions with the parameter vector which is a linear combination of instances?
w(S) =
n
X
i=1
c
ix
i.
=⇒ Necessary, but not sufficient.
Algorithms to which the Representer Theorem [Kimeldorf &
Wahba, 71] applies?
w = arg min
˜ w
n
X
i=1
loss( ˜ w · x
i) + λk ˜ wk
2
Solution w is a linear combination of instances.
=⇒ Sufficient, but not necessary.
Our contribution
Which algorithms can be kernelized?
Necessary and sufficient conditions for kernelizability.
Vectors.
Asymmetric and symmetric matrices.
Prove new versions of the Representer Theorem.
Build kernelizable algorithms with matrix parameters.
Our contribution
Which algorithms can be kernelized?
Necessary and sufficient conditions for kernelizability.
Vectors.
Asymmetric and symmetric matrices.
Prove new versions of the Representer Theorem.
Build kernelizable algorithms with matrix parameters.