Tests show, that SVM solver with SMS is generally faster, than SMO algorithm

(1)

Support Vector Machines: Sequential Multidimensional Subsolver (SMS)

Marcin Orchel

AGH Univeristy of Science and Technology, Al. Mickiewicza 30, 30-059 Krak ˘A lw, Poland

Abstract

In this paper I will present a new algorithm for solving Support Vector Machines (SVM) optimization problem. The new algorithm has a simpler form, than existing algorithms and has a comparable computa- tional cost. Classical Sequential Minimal Optimiza- tion (SMO) algorithm decomposes SVM problem into two dimensional subproblems. It was shown in [3], that SVM optimization with decomposition into more than two dimensional subproblems can be faster. How- ever existing algorithms for solving multidimensional subproblems are complicated quadratic programming solvers. Proposed Sequential Multidimensional Sub- solver (SMS) employs SMO for solving multidimensional subproblems. Tests show, that SVM solver with SMS is generally faster, than SMO algorithm.

1 Support Vector Machines

Support Vector Machine (SVM) [7] is an approach to statistical classification. The idea of SVM is to sepa- rate classes with hyperplane by maximizing geometric distance between hyperplane and the nearest vectors.

In this article Support Vector Machines will be used for binary classification. The classifier, which will be investigated in this paper is the soft margin classifier with box constraints [2]. This classifier allows for mis- classified vectors, by adding the penalty part to the objective function value. Soft margin classifier will use kernel trick for splitting classes with nonlinear bound instead of hyperplane bound. Learning SVM classifier leads to the following quadratic programming (QP) optimization problem:

SVM optimization problem (O₁) Maximization of

W (~α) =

l

X

i=1

αi−1 2

l

X

i=1 l

X

j=1

yijαiαjKij

with the constraints:

l

X

i=1

yiαi= 0

0 ≤ α_i≤ C, i ∈ I = {1, . . . , l} , C > 0 where:

−

→α is the parameters vector,

l is the size of the vector α, number of data vectors,

yij := yiyj, yi is the classification value, yi∈ {−1, 1}, C is the soft margin classifier parameter,

K_ij := K (~x_i, ~x_j) is the kernel function, ~x_i and ~x_j are data vectors.

2 SVM decomposition technique

In order to speed up solving SVM optimization problem decomposition technique is used [5]. Decomposition algorithm divides problem into smaller subproblems.

The set of parameters chosen for subproblem is called active set.

SVM optimization subproblem is defined as:

SVM optimization subproblem (O₂) Maximiza- tion of

W2 ~β

=

p

P

i=1

βi+

l

P

i=1 i /∈C

αi−¹₂

p

P

i=1

yc_iβi p

P

j=1

yc_jβjKc_ic_j

−

p

P

i=1

yc_iβi l

P

j=1 j /∈C

yjαjKc_ij−¹₂

l

P

i=1 i /∈C

l

P

j=1 j /∈C

yijαiαjKij

(1) with the constraints:

p

X

i=1

y_c_iβ_i+

l

X

i=1, i /∈C

y_iα_i= 0 (2)

0 ≤ β_i ≤ C, i ∈ {1, 2, . . . , p} , C > 0 where:

P = {c1, c2, . . . , cp} is a set of indices of parameters chosen to the active set, ci∈ I, ci6= cj for i 6= j, β is a subproblem variable vector,~

βi is a searched value of ci parameter.

The vector α is a previous solution. It must fulfill the constraints from O1problem.

3 SVM multidimensional heuristic

SVM heuristic is responsible for choosing right parameters to the active set. In order to minimize the computation time of learning SVM, heuristic should minimize the overall number of iterations. SVM multidimensional heuristic was proposed in [3]. Default multidimensional heuristic proposed in this article differs

1

(2)

from existing one. It chooses the best p parameters, which fulfill optimization conditions based on solving subproblems with linear constraint included. I proposed this heuristic for two parameters case in [4].

We can transform linear constraint (2) to the form:

β_d= −y_c_d

p

X

i=1 i6=d

y_c_iβ_i− y_c_d

l

X

i=1, i /∈C

y_iα_i (3)

where d ∈ {1, 2, . . . , p} is an arbitraily chosen parameter.

After substituting βd to the (1) we get the following optimization subproblem:

SVM optimization subproblem with linear constraint included (O3) Maximization of:

W3(~γ) = −yc_d p

P

i=1 i6=d

yc_iγe_i− yc_d l

P

i=1, i /∈C

yiαi+

p

P

i=1 i6=d

γe_i

+

l

P

i=1 i /∈C

αi−¹₂







p

P

i=1 i6=d

yc_iγe_i+

l

P

i=1, i /∈C

yiαi







2

Kc_dc_d

+







p

P

i=1 i6=d

yc_iγe_i+

l

P

i=1, i /∈C

yiαi







p

P

i=1 i6=d

yc_iγe_iKc_dc_i

−¹₂

p

P

i=1 i6=d

y_c_iγ_e_i

p

P

j=1 j6=d

y_c_jγ_e_jK_c_i_c_j

+







p

P

i=1 i6=d

y_c_iγ_e_i+

l

P

i=1, i /∈C

y_iα_i







l

P

i=1 i /∈C

y_iα_iK_c_d_i

−

p

P

i=1 i6=d

yc_iγe_i l

P

j=1 j /∈C

yjαjKc_ij−¹₂

l

P

i=1 i /∈C

l

P

j=1 j /∈C

yijαiαjKij

with the constraints:

0 ≤ γ_e_i ≤ C, for i ∈ {1, 2, . . . , p} \ {d} , C > 0

0 ≤ c = −yc_d p

X

i=1 i6=d

yc_iγe_i−yc_d l

X

i=1, i /∈C

yiαi≤ C, C > 0 (4)

where

~γ is a p − 1 elements variable vector, e_i= i for i < d,

e_i= i − 1 for i > d,

γ_e_i is a searched value of c_i parameter, c is a searched value of cd parameter.

The vector α is a previous solution. It must fulfill the constraints from O1 problem.

The partial derivative of W3(~γ) in the point for which γe_i = αc_i has a value:

∂

∂γe_k

W₃(~γ_old) = y_c_k(E_c_d− E_c_k) (5)

where

Ei=

l

X

j=1

yjαjKij− yi

3.1 Default multidimensional heuristic

In this subsection I will present conditions for optimization possibility. Based on these conditions I will describe the algorithm for choosing the best active set.

The first obvious necessary condition is that one of all parameters must change its value. The remaining optimization conditions consist of two parts. First part consists of conditions based on fulfilling constraint (4), second part consists of conditions based on partial derivatives. Merging all conditions leads to the overall optimization conditions.

Necessary optimization conditions based on fulfilling (4). (4) must be fulfilled after changes, hence we can write:

−α^old_c

d ≤ ∆α_c_d= −α^old_c

d − y_c_d

p

P

i=1 i6=d

y_c_iα^new_c

i

−yc_d l

P

i=1 i /∈C

yiαi≤ C − α_c^old_d

After substituting:

α^old_c_d = −yc_d p

X

i=1 i6=d

yc_iα^old_c_i − yc_d l

X

i=1 i /∈C

yiαi

we get the following condition:

−α^old_c_d ≤ ∆αc_d= −yc_d p

X

i=1 i6=d

yci∆αci ≤ C − α_c^old_d (6)

Theorem 3.1. If the condition (6) is fulfilled, then there exist two parameters c_i, where i ∈ {1, 2, . . . , p}, which belong to the opposite groups G₁ and G₂ defined as:

G₁:= {i ∈ {1, 2, . . . , l} : (y_i= 1 ∧ α_i= 0)

∨ (y_i= −1 ∧ α_i= C) ∨ (0 < α_i< C)}

G2:= {i ∈ {1, 2, . . . , l} : (yi= −1 ∧ αi= 0)

∨ (yi= 1 ∧ αi = C) ∨ (0 < αi< C)}

Note that nonbound parameters are included in both groups.

Proof. We will prove that, if all parameters belong to only one group G₁or G₂, then condition (6) is not fulfilled. We choose parameters belong to the G₁ group.

The proof for G₂ group is similar. The set of chosen parameters does not contain nonbound parameters, because they belong to the both groups. If all c_i parameters for i ∈ {1, 2, . . . , p} \ {d} does not change, then

p

P

i=1 i6=d

y_c_i∆α_c_i = 0, therefore ∆α_c_d = 0. So all parameters ci does not change, which cannot be true.

(3)

Otherwise the following holds:

p

P

i=1 i6=d

yci∆αci > 0. If y_c_d = 1, then ∆α_c_d < 0 and α_c^old

d = 0. The condition (6) becomes 0 ≤ ∆α_c_d ≤ C, which cannot be true. If y_c_d= −1, then ∆α_c_d> 0 and α^old_c

d = C. The condition (6) becomes −C ≤ ∆α_c_d≤ 0, which cannot be true.

Sufficient optimization conditions based on fulfilling (4).

Theorem 3.2. If there exist two parameters ci, where i ∈ {1, 2, . . . , p}, which belong to the opposite groups G1 and G2, then condition (6) is fulfilled for some parameter changes.

Proof. If none of chosen two parameters (ca from G1

group and cb from G2 group) is cd parameter, then we can set ∆αc_a and ∆αc_b to the same values or with inverse signs, in the way that ∆αc_d = 0, thus (6) is fulfilled. If the chosen parameters are c_d parameter from G₁ group and c_b parameter from G₂ group, then when we set all remaining parameter changes to zero the following can hold:

p

P

i=1 i6=d

yc_i∆αc_i < 0. If yc_d = 1, then ∆αc_d > 0. If αôld_c_d = 0, then condition (6) is obviously fulfilled. If 0 < αôld_c_d < C, then condition (6) is fulfilled, when ∆αc_b is set to close enough to zero value. If yc_d = −1, then ∆αc_d < 0. If αôld_c_d = C, then condition (6) is obviously fulfilled. If 0 < αôld_c

d < C, then condition (6) is fulfilled, when ∆α_c_b is set to close enough to zero value.

Necessary optimization conditions based on partial derivatives.

Theorem 3.3. If optimization is possible based on partial derivatives, then one of the partial derivatives of the function W3 must fulfill the following condition:

W₃(~γ)

γ_ek > 0 when α_c_k = 0

W₃(~γ)

γ_ek < 0 when αc_k = C

W3(~γ)

γ_ek 6= 0 when 0 < αc_k< C

(7)

Proof. We will prove that, if all partial derivatives of the function W3 don’t fulfill condition (7), then optimization is not possible. If (7) is not fufilled, then objective function W3 can’t increase its value in any direction. Thus function W3can’t increase its value at all.

Corollary 3.4. After substitution (5) to (7) we get:

y_c_k(E_c_d− E_c_k) > 0 when α_c_k = 0 y_c_k(E_c_d− E_c_k) < 0 when α_c_k = C yc_k(Ec_d− Ec_k) 6= 0 when 0 < αc_k< C After simplification:

When yc_k= 1:

Ec_k < Ec_d when αc_k = 0 Ec_k > Ec_d when αc_k = C

Ec_k 6= Ec_d when 0 < αc_k< C When yc_k= −1:

E_c_k > E_c_d when α_c_k = 0 E_c_k < E_c_d when α_c_k = C E_c_k 6= E_c_d when 0 < α_c_k< C

Sufficient optimization conditions based on partial derivatives.

Theorem 3.5. If one of the partial derivatives of the function W3 fulfills the condition (7), then optimization is possible based on partial derivatives for some parameter changes.

Proof. We can change parameter, which fulfills the condition (7). Remaining parameters, which are at- tributed to W₃ variables can stay unchanged. Thus W3 value will grow.

Overall optimization conditions.

Theorem 3.6. Optimization is possible for some parameter changes, if and only if there exist two parameters c_i, where i ∈ {1, 2, . . . , p}, which belong to the opposite groups G₁ and G₂and one of the partial derivatives of the function W₃ fulfills the condition (7).

Proof. Because of (Th. 3.1), (Th. 3.2), (Th. 3.3), (Th. 3.5) we only have to prove, that overall optimization is a multiplication of optimization based on (4) and based on partial derivatives. This can be shown in the terms of multidimensional functions with set of linear constraints and one nonlinear constraint. Multidimen- sional function W₃ can be optimized when conditions with derivatives are fulfilled with respect to the linear conditions. There is additionaly only one nonlinear constraint. When it is also fulfilled, then optimization is possible.

Choosing the best active set. In default multidimensional heuristic we do not graduate condition based on (4), therefore we only want to fulfill this condition.

Using the optimization conditions based on partial derivatives, we choose those two parameters, which maximize m_c_d_c_k defined as:

when parameter ck is bound and belongs to the G1

group, then

m_c_d_c_k:= E_c_d− Eck

when parameter ck is bound and belongs to the G2

group, then

m_c_d_c_k:= E_c_k− E_c_d when parameter ck is non bound, then

mc_dc_k:= |Ec_k− Ec_d|

The conclusion from above is that the best two parameters to optimize will be with minimal E from G1

group and with maximal E from G2 group, if the chosen parameters are different.

(4)

Choosing the remaining parameters. One of the already chosen parameters is cd parameter. In order to maximize m_c_d_c_k for the remaining parameters, that have to be chosen, we choose them from the opposite group to the group with c_d parameter: when α_c_d belongs to the G₁ group, then the remaining parameters are chosen from the G₂ group, when α_c_d belongs to the G2 group, then the remaining parameters are chosen from the G1group.

After choosing two parameters, cdparameter can be one of them. So we choose remaining parameters ei- ther from G1 group or from G2 group, and then com- pare both cases by suming mi values for all chosen parameters. The case with maximal sum of mi is finally chosen.

4 SMO based subproblem solver

There were two methods for solving SVM problem. A new, third method is proposed in this article:

1. solve SVM problem by heuristic algorithm with two parameters subproblems solved analytically (SMO algorithm)

2. solve SVM problem by heuristic algorithm with more than two parameters subproblems solved by quadratic programming solvers

3. solve SVM problem by heuristic algorithm with more than two parameters subproblems solved by SMO algorithm

In the third method multidimensional subproblems are solved by sequentially finding the solutions of two parameter subsubproblems in the analytical way.

Comparison of the new algorithm with the second solver. In the second method subproblems are solved by quadratic programming solver (for example interior point method solver, see [6]), which has computation time independent from SVM problem length, but dependent on subproblem length: O (p). Also the third solution solves subproblems with computation time O (p). In practice the best times are achieved when parameter p is small, thus the computation times are comparable. Although the proposed solver is less complicated and easier to implement.

Comparison of the new algorithm with the first solver. The second and third solvers choose more than two parameters subproblems. In practice SVM problems are computed faster than with the first solver choosing two parameters to the active set. This comparison will be presented in tests section below.

5 Testing SVM optimization with Sequential Multidimen- sional Subsolver

SVM optimization with Sequential Multidimensional Subsolver will be compared with SMO algorithm in terms of computation time. Distribution of computation time comparison will also be presented.

Tests were done on image data sets and stock exchange prediction data sets. Data from images were extracted by getting the indices of every point and the color. Data vectors from images have two dimensions.

Every point indices are the data vector coefficients and the classification is equal to 1, if the point color is closer to white, and is equal -1, if the point color is closer to black. Stock exchange prediction data sets were gen- erated from 1 hour market data. Every vector corre- sponds to every market hour and has two features. The first feature is a percentage price growth from the hour before previous hour to the previous hour. The second feature is a percentage volume growth from the hour before previous hour to the previous hour. The classification is set to 1, when there were a price growth from the previous hour, and is set to -1, when there were a fall. The efects of slippage and order costs were om- mited. This model is suited for trading during the day, buying in the beginning of the full hour, and selling in the end of the full hour.

Data standarization Data were squeezed propor- tionally to the [-1, 1] set, in order to minimize floating point representation errors. This operation were done independently for every feature as below:

[a, b] → [−1, 1]

The a and b were extracted from data set: a is a mini- mum value, b is the maximum value for the particular feature. When

a ≤ x ≤ b

x value is changed to x⁰ in the following way:

x⁰= 2x − a b − a − 1.

Test parameters

Tests were done for various kernel functions:

• linear kernel

• polynomial kernel

• Radial Basis Function (RBF) kernel

In all tests C = 1.0, acceptable error e = 0.001.

SVM optimization with SMS algorithm was tested with the subproblem size of 5. This size was experi- mentally chosen as the best size among the others.

Detailed test configurations are in Tab. 1.

Test 1

Test parameters:

(5)

conf 1 conf 2

Subsolver SMO SMS

Test .1

Kernel linear a = 1 the same Test .2

Kernel polynomial a=1.0 the same dim=2.0

Test .3

Kernel RBF sigma = 1.0 the same Table 1: Test configurations

• data from images

• 248 cases

• every case has 204 vectors

Comparison results are in the table Tab. 2.

Testing feature SMO SMS (5)

Test 1.1

Number of iterations 33731 19550 Number of tests with fewer iterations 1 246 (1)

Computation time 92.45 83.46

Number of tests with shorter times 67 173 (8) Test 1.2

Computation time 261.17 160.23 Number of tests with shorter times 39 204 (5)

Test 1.3

All

Table 2: Image data test results Test 2

Test parameters:

• stock exchange prediction data sets

• 230 securities from the National Association of Securities Dealers Automated Quotations (NAS- DAQ) Stock Market

• securites have from 389 up to 506 vectors, 1 hour tick data from December 2006 up to now

Comparison results are in the table Tab. 3.

In the case of image data sets the SVM with Sequen- tial Multidimensional Solver with 5 parameters subproblem size is faster than SMO algorithm in more

Testing feature SMO SMS (5)

Test 2.1

Test 2.2

Test 2.3

All

Table 3: Stock exchange data test results

than 74% tests. Overall score is that SVM with SMS is faster than SMO algorithm by 17%. In the case of stock exchange data sets SVM with SMS is faster than SMO algorithm in more than 57% tests. Overall score is that SVM with SMS is faster than SMO algorithm by 16%.

6 Conclusions

In this article I presented a new subsolver for finding solution of SVM subproblems. Proposed Sequen- tial Multidimensional Subsolver (SMS) was compared with the SMO algorithm. Tests show that SVM optimization with the new subsolver is generally faster, than the SMO algorithm. From the theoretical comparison with quadratic optimization subsolvers, I con- cluded that both subsolvers have comparable speed, but proposed SMS subsolver is less complicated and easier to implement.

7 Acknowledgments

This research is founded by the Polish Ministry of Ed- ucation and Science, Project No.3 T11F 010 30. I would like to express my sincere gratitude to Professor Witold Dzwinel and Mr Marcin Kurdziel (AGH Univer- sity of Science and Technology, Institute of Computer Science) for contributing ideas, discussion and useful suggestions.

(6)

References

[1] P.-H. Chen, R.-E. Fan, and C.-J. Lin. A study on SMO-type decomposition methods for support vector machines. IEEE Transactions on Neural Net- works, 17:893–908, July 2006.

[2] N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines: and other kernel-based learning methods. Cambridge University Press, New York, NY, USA, 2000.

[3] T. Joachims. Making large-scale support vector machine learning practical, 1998.

[4] M. Orchel. Support vector machines: Heuristic of alternatives (hoa). In Signal Processing Symposium Jachranka 2007, 2007.

[5] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. In J. Principe, L. Gile, N. Morgan, and E. Wilson, editors, Neural Networks for Signal Processing VII

— Proceedings of the 1997 IEEE Workshop, pages 276 – 285, New York, 1997. IEEE.

[6] R. J. Vanderbei. LOQO: An interior point code for quadratic programming. Optimization Methods and Software, 11:451–484, 1999.

[7] V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method. Automation and Remote Control, 24, 1963.