• Nie Znaleziono Wyników

Tests show, that SVM solver with SMS is generally faster, than SMO algorithm

N/A
N/A
Protected

Academic year: 2021

Share "Tests show, that SVM solver with SMS is generally faster, than SMO algorithm"

Copied!
6
0
0

Pełen tekst

(1)

Support Vector Machines: Sequential Multidimensional Subsolver (SMS)

Marcin Orchel

AGH Univeristy of Science and Technology, Al. Mickiewicza 30, 30-059 Krak ˘A lw, Poland

Abstract

In this paper I will present a new algorithm for solving Support Vector Machines (SVM) optimization prob- lem. The new algorithm has a simpler form, than existing algorithms and has a comparable computa- tional cost. Classical Sequential Minimal Optimiza- tion (SMO) algorithm decomposes SVM problem into two dimensional subproblems. It was shown in [3], that SVM optimization with decomposition into more than two dimensional subproblems can be faster. How- ever existing algorithms for solving multidimensional subproblems are complicated quadratic programming solvers. Proposed Sequential Multidimensional Sub- solver (SMS) employs SMO for solving multidimen- sional subproblems. Tests show, that SVM solver with SMS is generally faster, than SMO algorithm.

1 Support Vector Machines

Support Vector Machine (SVM) [7] is an approach to statistical classification. The idea of SVM is to sepa- rate classes with hyperplane by maximizing geometric distance between hyperplane and the nearest vectors.

In this article Support Vector Machines will be used for binary classification. The classifier, which will be investigated in this paper is the soft margin classifier with box constraints [2]. This classifier allows for mis- classified vectors, by adding the penalty part to the objective function value. Soft margin classifier will use kernel trick for splitting classes with nonlinear bound instead of hyperplane bound. Learning SVM classifier leads to the following quadratic programming (QP) op- timization problem:

SVM optimization problem (O1) Maximization of

W (~α) =

l

X

i=1

αi1 2

l

X

i=1 l

X

j=1

yijαiαjKij

with the constraints:

l

X

i=1

yiαi= 0

0 ≤ αi≤ C, i ∈ I = {1, . . . , l} , C > 0 where:

α is the parameters vector,

l is the size of the vector α, number of data vectors,

yij := yiyj, yi is the classification value, yi∈ {−1, 1}, C is the soft margin classifier parameter,

Kij := K (~xi, ~xj) is the kernel function, ~xi and ~xj are data vectors.

2 SVM decomposition technique

In order to speed up solving SVM optimization problem decomposition technique is used [5]. Decomposition algorithm divides problem into smaller subproblems.

The set of parameters chosen for subproblem is called active set.

SVM optimization subproblem is defined as:

SVM optimization subproblem (O2) Maximiza- tion of

W2 ~β

=

p

P

i=1

βi+

l

P

i=1 i /∈C

αi12

p

P

i=1

yciβi p

P

j=1

ycjβjKcicj

p

P

i=1

yciβi l

P

j=1 j /∈C

yjαjKcij12

l

P

i=1 i /∈C

l

P

j=1 j /∈C

yijαiαjKij

(1) with the constraints:

p

X

i=1

yciβi+

l

X

i=1, i /∈C

yiαi= 0 (2)

0 ≤ βi ≤ C, i ∈ {1, 2, . . . , p} , C > 0 where:

P = {c1, c2, . . . , cp} is a set of indices of parameters chosen to the active set, ci∈ I, ci6= cj for i 6= j, β is a subproblem variable vector,~

βi is a searched value of ci parameter.

The vector α is a previous solution. It must fulfill the constraints from O1problem.

3 SVM multidimensional heuris- tic

SVM heuristic is responsible for choosing right param- eters to the active set. In order to minimize the com- putation time of learning SVM, heuristic should min- imize the overall number of iterations. SVM multidi- mensional heuristic was proposed in [3]. Default mul- tidimensional heuristic proposed in this article differs

1

(2)

from existing one. It chooses the best p parameters, which fulfill optimization conditions based on solving subproblems with linear constraint included. I pro- posed this heuristic for two parameters case in [4].

We can transform linear constraint (2) to the form:

βd= −ycd

p

X

i=1 i6=d

yciβi− ycd

l

X

i=1, i /∈C

yiαi (3)

where d ∈ {1, 2, . . . , p} is an arbitraily chosen parame- ter.

After substituting βd to the (1) we get the following optimization subproblem:

SVM optimization subproblem with linear constraint included (O3) Maximization of:

W3(~γ) = −ycd p

P

i=1 i6=d

yciγei− ycd l

P

i=1, i /∈C

yiαi+

p

P

i=1 i6=d

γei

+

l

P

i=1 i /∈C

αi12

p

P

i=1 i6=d

yciγei+

l

P

i=1, i /∈C

yiαi

2

Kcdcd

+

p

P

i=1 i6=d

yciγei+

l

P

i=1, i /∈C

yiαi

p

P

i=1 i6=d

yciγeiKcdci

12

p

P

i=1 i6=d

yciγei

p

P

j=1 j6=d

ycjγejKcicj

+

p

P

i=1 i6=d

yciγei+

l

P

i=1, i /∈C

yiαi

l

P

i=1 i /∈C

yiαiKcdi

p

P

i=1 i6=d

yciγei l

P

j=1 j /∈C

yjαjKcij12

l

P

i=1 i /∈C

l

P

j=1 j /∈C

yijαiαjKij

with the constraints:

0 ≤ γei ≤ C, for i ∈ {1, 2, . . . , p} \ {d} , C > 0

0 ≤ c = −ycd p

X

i=1 i6=d

yciγei−ycd l

X

i=1, i /∈C

yiαi≤ C, C > 0 (4)

where

~γ is a p − 1 elements variable vector, ei= i for i < d,

ei= i − 1 for i > d,

γei is a searched value of ci parameter, c is a searched value of cd parameter.

The vector α is a previous solution. It must fulfill the constraints from O1 problem.

The partial derivative of W3(~γ) in the point for which γei = αci has a value:

∂γek

W3(~γold) = yck(Ecd− Eck) (5)

where

Ei=

l

X

j=1

yjαjKij− yi

3.1 Default multidimensional heuristic

In this subsection I will present conditions for opti- mization possibility. Based on these conditions I will describe the algorithm for choosing the best active set.

The first obvious necessary condition is that one of all parameters must change its value. The remain- ing optimization conditions consist of two parts. First part consists of conditions based on fulfilling constraint (4), second part consists of conditions based on partial derivatives. Merging all conditions leads to the overall optimization conditions.

Necessary optimization conditions based on fulfilling (4). (4) must be fulfilled after changes, hence we can write:

−αoldc

d ≤ ∆αcd= −αoldc

d − ycd

p

P

i=1 i6=d

yciαnewc

i

−ycd l

P

i=1 i /∈C

yiαi≤ C − αcoldd

After substituting:

αoldcd = −ycd p

X

i=1 i6=d

yciαoldci − ycd l

X

i=1 i /∈C

yiαi

we get the following condition:

−αoldcd ≤ ∆αcd= −ycd p

X

i=1 i6=d

yci∆αci ≤ C − αcoldd (6)

Theorem 3.1. If the condition (6) is fulfilled, then there exist two parameters ci, where i ∈ {1, 2, . . . , p}, which belong to the opposite groups G1 and G2 defined as:

G1:= {i ∈ {1, 2, . . . , l} : (yi= 1 ∧ αi= 0)

∨ (yi= −1 ∧ αi= C) ∨ (0 < αi< C)}

G2:= {i ∈ {1, 2, . . . , l} : (yi= −1 ∧ αi= 0)

∨ (yi= 1 ∧ αi = C) ∨ (0 < αi< C)}

Note that nonbound parameters are included in both groups.

Proof. We will prove that, if all parameters belong to only one group G1or G2, then condition (6) is not ful- filled. We choose parameters belong to the G1 group.

The proof for G2 group is similar. The set of cho- sen parameters does not contain nonbound parame- ters, because they belong to the both groups. If all ci parameters for i ∈ {1, 2, . . . , p} \ {d} does not change, then

p

P

i=1 i6=d

yci∆αci = 0, therefore ∆αcd = 0. So all pa- rameters ci does not change, which cannot be true.

(3)

Otherwise the following holds:

p

P

i=1 i6=d

yci∆αci > 0. If ycd = 1, then ∆αcd < 0 and αcold

d = 0. The condition (6) becomes 0 ≤ ∆αcd ≤ C, which cannot be true. If ycd= −1, then ∆αcd> 0 and αoldc

d = C. The condition (6) becomes −C ≤ ∆αcd≤ 0, which cannot be true.

Sufficient optimization conditions based on fulfilling (4).

Theorem 3.2. If there exist two parameters ci, where i ∈ {1, 2, . . . , p}, which belong to the opposite groups G1 and G2, then condition (6) is fulfilled for some pa- rameter changes.

Proof. If none of chosen two parameters (ca from G1

group and cb from G2 group) is cd parameter, then we can set ∆αca and ∆αcb to the same values or with inverse signs, in the way that ∆αcd = 0, thus (6) is fulfilled. If the chosen parameters are cd parameter from G1 group and cb parameter from G2 group, then when we set all remaining parameter changes to zero the following can hold:

p

P

i=1 i6=d

yci∆αci < 0. If ycd = 1, then ∆αcd > 0. If αoldcd = 0, then condition (6) is obviously fulfilled. If 0 < αoldcd < C, then condition (6) is fulfilled, when ∆αcb is set to close enough to zero value. If ycd = −1, then ∆αcd < 0. If αoldcd = C, then condition (6) is obviously fulfilled. If 0 < αoldc

d < C, then condition (6) is fulfilled, when ∆αcb is set to close enough to zero value.

Necessary optimization conditions based on partial derivatives.

Theorem 3.3. If optimization is possible based on par- tial derivatives, then one of the partial derivatives of the function W3 must fulfill the following condition:

W3(~γ)

γek > 0 when αck = 0

W3(~γ)

γek < 0 when αck = C

W3(~γ)

γek 6= 0 when 0 < αck< C

(7)

Proof. We will prove that, if all partial derivatives of the function W3 don’t fulfill condition (7), then op- timization is not possible. If (7) is not fufilled, then objective function W3 can’t increase its value in any direction. Thus function W3can’t increase its value at all.

Corollary 3.4. After substitution (5) to (7) we get:

yck(Ecd− Eck) > 0 when αck = 0 yck(Ecd− Eck) < 0 when αck = C yck(Ecd− Eck) 6= 0 when 0 < αck< C After simplification:

When yck= 1:

Eck < Ecd when αck = 0 Eck > Ecd when αck = C

Eck 6= Ecd when 0 < αck< C When yck= −1:

Eck > Ecd when αck = 0 Eck < Ecd when αck = C Eck 6= Ecd when 0 < αck< C

Sufficient optimization conditions based on partial derivatives.

Theorem 3.5. If one of the partial derivatives of the function W3 fulfills the condition (7), then optimiza- tion is possible based on partial derivatives for some parameter changes.

Proof. We can change parameter, which fulfills the condition (7). Remaining parameters, which are at- tributed to W3 variables can stay unchanged. Thus W3 value will grow.

Overall optimization conditions.

Theorem 3.6. Optimization is possible for some pa- rameter changes, if and only if there exist two parame- ters ci, where i ∈ {1, 2, . . . , p}, which belong to the op- posite groups G1 and G2and one of the partial deriva- tives of the function W3 fulfills the condition (7).

Proof. Because of (Th. 3.1), (Th. 3.2), (Th. 3.3), (Th. 3.5) we only have to prove, that overall optimiza- tion is a multiplication of optimization based on (4) and based on partial derivatives. This can be shown in the terms of multidimensional functions with set of linear constraints and one nonlinear constraint. Multidimen- sional function W3 can be optimized when conditions with derivatives are fulfilled with respect to the lin- ear conditions. There is additionaly only one nonlinear constraint. When it is also fulfilled, then optimization is possible.

Choosing the best active set. In default multidi- mensional heuristic we do not graduate condition based on (4), therefore we only want to fulfill this condition.

Using the optimization conditions based on partial derivatives, we choose those two parameters, which maximize mcdck defined as:

when parameter ck is bound and belongs to the G1

group, then

mcdck:= Ecd− Eck

when parameter ck is bound and belongs to the G2

group, then

mcdck:= Eck− Ecd when parameter ck is non bound, then

mcdck:= |Eck− Ecd|

The conclusion from above is that the best two pa- rameters to optimize will be with minimal E from G1

group and with maximal E from G2 group, if the cho- sen parameters are different.

(4)

Choosing the remaining parameters. One of the already chosen parameters is cd parameter. In or- der to maximize mcdck for the remaining parameters, that have to be chosen, we choose them from the oppo- site group to the group with cd parameter: when αcd belongs to the G1 group, then the remaining parame- ters are chosen from the G2 group, when αcd belongs to the G2 group, then the remaining parameters are chosen from the G1group.

After choosing two parameters, cdparameter can be one of them. So we choose remaining parameters ei- ther from G1 group or from G2 group, and then com- pare both cases by suming mi values for all chosen pa- rameters. The case with maximal sum of mi is finally chosen.

4 SMO based subproblem solver

There were two methods for solving SVM problem. A new, third method is proposed in this article:

1. solve SVM problem by heuristic algorithm with two parameters subproblems solved analytically (SMO algorithm)

2. solve SVM problem by heuristic algorithm with more than two parameters subproblems solved by quadratic programming solvers

3. solve SVM problem by heuristic algorithm with more than two parameters subproblems solved by SMO algorithm

In the third method multidimensional subproblems are solved by sequentially finding the solutions of two parameter subsubproblems in the analytical way.

Comparison of the new algorithm with the second solver. In the second method subproblems are solved by quadratic programming solver (for example interior point method solver, see [6]), which has com- putation time independent from SVM problem length, but dependent on subproblem length: O (p). Also the third solution solves subproblems with computation time O (p). In practice the best times are achieved when parameter p is small, thus the computation times are comparable. Although the proposed solver is less complicated and easier to implement.

Comparison of the new algorithm with the first solver. The second and third solvers choose more than two parameters subproblems. In practice SVM problems are computed faster than with the first solver choosing two parameters to the active set. This comparison will be presented in tests section below.

5 Testing SVM optimization with Sequential Multidimen- sional Subsolver

SVM optimization with Sequential Multidimensional Subsolver will be compared with SMO algorithm in terms of computation time. Distribution of computa- tion time comparison will also be presented.

Tests were done on image data sets and stock ex- change prediction data sets. Data from images were extracted by getting the indices of every point and the color. Data vectors from images have two dimensions.

Every point indices are the data vector coefficients and the classification is equal to 1, if the point color is closer to white, and is equal -1, if the point color is closer to black. Stock exchange prediction data sets were gen- erated from 1 hour market data. Every vector corre- sponds to every market hour and has two features. The first feature is a percentage price growth from the hour before previous hour to the previous hour. The second feature is a percentage volume growth from the hour before previous hour to the previous hour. The classifi- cation is set to 1, when there were a price growth from the previous hour, and is set to -1, when there were a fall. The efects of slippage and order costs were om- mited. This model is suited for trading during the day, buying in the beginning of the full hour, and selling in the end of the full hour.

Data standarization Data were squeezed propor- tionally to the [-1, 1] set, in order to minimize floating point representation errors. This operation were done independently for every feature as below:

[a, b] → [−1, 1]

The a and b were extracted from data set: a is a mini- mum value, b is the maximum value for the particular feature. When

a ≤ x ≤ b

x value is changed to x0 in the following way:

x0= 2x − a b − a − 1.

Test parameters

Tests were done for various kernel functions:

• linear kernel

• polynomial kernel

• Radial Basis Function (RBF) kernel

In all tests C = 1.0, acceptable error e = 0.001.

SVM optimization with SMS algorithm was tested with the subproblem size of 5. This size was experi- mentally chosen as the best size among the others.

Detailed test configurations are in Tab. 1.

Test 1

Test parameters:

(5)

conf 1 conf 2

Subsolver SMO SMS

Test .1

Kernel linear a = 1 the same Test .2

Kernel polynomial a=1.0 the same dim=2.0

Test .3

Kernel RBF sigma = 1.0 the same Table 1: Test configurations

• data from images

• 248 cases

• every case has 204 vectors

Comparison results are in the table Tab. 2.

Testing feature SMO SMS (5)

Test 1.1

Number of iterations 33731 19550 Number of tests with fewer iterations 1 246 (1)

Computation time 92.45 83.46

Number of tests with shorter times 67 173 (8) Test 1.2

Number of iterations 96962 28156 Number of tests with fewer iterations 1 247 (0)

Computation time 261.17 160.23 Number of tests with shorter times 39 204 (5)

Test 1.3

Number of iterations 41674 18855 Number of tests with fewer iterations 1 247 (0)

Computation time 116.69 89.16 Number of tests with shorter times 66 176 (6)

All

Number of iterations 172367 66561 Number of tests with fewer iterations 3 740 (1)

Computation time 470.30 332.84 Number of tests with shorter times 172 553 (19)

Table 2: Image data test results Test 2

Test parameters:

• stock exchange prediction data sets

• 230 securities from the National Association of Securities Dealers Automated Quotations (NAS- DAQ) Stock Market

• securites have from 389 up to 506 vectors, 1 hour tick data from December 2006 up to now

Comparison results are in the table Tab. 3.

In the case of image data sets the SVM with Sequen- tial Multidimensional Solver with 5 parameters sub- problem size is faster than SMO algorithm in more

Testing feature SMO SMS (5)

Test 2.1

Number of iterations 54545 46831 Number of tests with fewer iterations 4 225 (1)

Computation time 500.49 537.83 Number of tests with shorter times 168 60 (2)

Test 2.2

Number of iterations 165787 65472 Number of tests with fewer iterations 1 229 (0)

Computation time 1489.21 779.54 Number of tests with shorter times 5 222 (3)

Test 2.3

Number of iterations 97266 63935 Number of tests with fewer iterations 6 221 (3)

Computation time 889.03 756.50 Number of tests with shorter times 119 110 (1)

All

Number of iterations 317598 176238 Number of tests with fewer iterations 4 675 (11)

Computation time 2878.72 2073.87 Number of tests with shorter times 292 392 (6)

Table 3: Stock exchange data test results

than 74% tests. Overall score is that SVM with SMS is faster than SMO algorithm by 17%. In the case of stock exchange data sets SVM with SMS is faster than SMO algorithm in more than 57% tests. Overall score is that SVM with SMS is faster than SMO algorithm by 16%.

6 Conclusions

In this article I presented a new subsolver for find- ing solution of SVM subproblems. Proposed Sequen- tial Multidimensional Subsolver (SMS) was compared with the SMO algorithm. Tests show that SVM op- timization with the new subsolver is generally faster, than the SMO algorithm. From the theoretical com- parison with quadratic optimization subsolvers, I con- cluded that both subsolvers have comparable speed, but proposed SMS subsolver is less complicated and easier to implement.

7 Acknowledgments

This research is founded by the Polish Ministry of Ed- ucation and Science, Project No.3 T11F 010 30. I would like to express my sincere gratitude to Professor Witold Dzwinel and Mr Marcin Kurdziel (AGH Univer- sity of Science and Technology, Institute of Computer Science) for contributing ideas, discussion and useful suggestions.

(6)

References

[1] P.-H. Chen, R.-E. Fan, and C.-J. Lin. A study on SMO-type decomposition methods for support vec- tor machines. IEEE Transactions on Neural Net- works, 17:893–908, July 2006.

[2] N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines: and other kernel-based learning methods. Cambridge University Press, New York, NY, USA, 2000.

[3] T. Joachims. Making large-scale support vector ma- chine learning practical, 1998.

[4] M. Orchel. Support vector machines: Heuristic of alternatives (hoa). In Signal Processing Symposium Jachranka 2007, 2007.

[5] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. In J. Principe, L. Gile, N. Morgan, and E. Wilson, editors, Neural Networks for Signal Processing VII

— Proceedings of the 1997 IEEE Workshop, pages 276 – 285, New York, 1997. IEEE.

[6] R. J. Vanderbei. LOQO: An interior point code for quadratic programming. Optimization Methods and Software, 11:451–484, 1999.

[7] V. Vapnik and A. Lerner. Pattern recognition us- ing generalized portrait method. Automation and Remote Control, 24, 1963.

Cytaty

Powiązane dokumenty

Badanie wpływu modyfikatora palladowego na liniowość krzywej kalibracji Aby zapobiec tworzeniu się trudno lotnych węglików, ograniczyć niepożądane reakcje oraz zapobiec

Rozważając przedstawiony przykład, można zauważyć brak jednej spójnej me- tody oceny niezawodnościowej procesów produkcyjnych, która obejmowałaby

„Tak więc, gdy jest powiedziane, że będziesz kochał Pana twojego Boga z całego swojego serca” i „będziesz kochał bliźniego jak siebie samego”, wiemy dzięki nauce

R-SHADE [9] algorithm has been proposed as one of the more successful modification of a Differential Evolution, following the path of adapting the scale and cross-over

The re- search on GAPSO approach is based on two assumptions: (1) it is possible to improve the performance of an optimization algorithm through utiliza- tion of more function

Styan tw ierdzi, że punkt kulm inacyjny w czarnej komedii nie jest usytuow any w ta ­ kich miejscach, jak w tragedii lub komedii (bohater podejm uje dę- cyzje,

Poza tym w związku z odmiennością koncepcji wolności słowa w Pierwszej Poprawce do Konstytucji USA oraz EKPC uzasadniona jest ostrożność w korzystaniu przez

The proposed Heuristic of Alternatives (HoA) chooses parameters to the active set on the basis of not only SVM optimization conditions, but also on the basis of objective function