• Nie Znaleziono Wyników

Support Vector Regression as a Classification Problem with a Priori Knowledge in the Form of Detractors for Creating Reduced Models

N/A
N/A
Protected

Academic year: 2021

Share "Support Vector Regression as a Classification Problem with a Priori Knowledge in the Form of Detractors for Creating Reduced Models"

Copied!
8
0
0

Pełen tekst

(1)

Problem with a Priori Knowledge in the Form of Detractors for Creating Reduced Models

Marcin Orchel

AGH University of Science and Technology, Mickiewicza Av. 30, 30-059 Krak´ ow, Poland, marcin@orchel.pl

Summary. In this article, we propose applying a recently proposed technique of reducing Support Vector Classification models for regression problems. The reduced method creates reduced models by removing support vectors and uses a general formulation of Support Vector Classification with a priori knowledge in the form of detractors. We apply this method for regression problems by using a reformulation of ε-insensitive Support Vector Regression as a special case of the general formulation.

Indeed, the experiments show that reduced technique can be successfully applied for regression problems. The tests were performed on various regression data sets and on stock price data from public domain repositories.

Key words: Support Vector Machines, a priori knowledge

1 Introduction

One of the main learning problems is a regression estimation. Vapnik [9] pro- posed a new regression method, which is called ε-insensitive Support Vector Regression (ε-SVR). It belongs to a group of methods called Support Vec- tor Machines (SVM). For estimating indicator functions the Support Vector Classification (SVC) method was developed. The SVM were invented on a basis of statistical learning theory. They are efficient learning methods partly for the reason of having the following important properties: they lead to con- vex optimization problems, they generate sparse solutions, kernel functions can be used for generating nonlinear solutions. Recently, a general formula- tion of SVC with a priori knowledge in the form of detractors was proposed [5][6]. It has been already used for incorporating sample a priori knowledge for manipulating the result curve and creating reduced models.

In this article, we use a reformulation of ε-SVR as a special case of the

generalized SVC. The similar reformulation has been already implemented

in LibSVM [1] for solving ordinal classification problems – without a pri-

ori knowledge, and ε-SVR. The consequence of this reformulation is that we

(2)

can apply all the applications for generalized SVC also for ε-SVR. These are manipulating of the regression function proposed as a manipulation of the de- cision curve for classification problems in [5] and creating improved reduced models by removing support vectors proposed for classification problems in [6].

The former was also investigated for the alternative SVR regression method recently proposed [7]. Various methods for reducing the complexity of the output model were widely investigated [3]. In particular, the reduction by re- moving support vectors was also analyzed in [2] for regression problems. In this article, we create improved reduced models by removing support vectors and using sample a priori knowledge in reduced models. The knowledge comes from the solution of the original problem obtained before the reduction. It is also possible to extract the knowledge from any source in the form of analyti- cal function. Particularly, it could be the solution of an alternative regression method.

1.1 Introduction to ε-SVR

In a regression estimation we consider a set of training examples x i for i = 1..l, where x i = x 1 i , . . . , x m i . The i-th training example is mapped to y i r ∈ IR. The m is a dimension of the problem. The ε-SVR soft case optimization problem is

OP 1. Minimization of

f w r , b r , ξ r , ξ r  = kw r k 2 + C r l

X

i=1

ξ r i + ξ ∗i r 

with constraints y i r − g (x i ) ≤ ε + ξ i r , g (x i ) − y i r ≤ ε + ξ r i∗ , ξ r ≥ 0, ξ r ≥ 0 for i ∈ {1..l}, where g (x i ) = w r · x i + b r .

The g (x) = w r · x + b r is a regression function. Optimization problem 1 is transformed to an equivalent dual optimization problem. The regression function becomes

g (x) =

l

X

i=1

i − β i ) K (x i , x) + b r , (1)

where α i , β i are Lagrange multipliers of the dual problem, K (·, ·) is a kernel function, which is incorporated to a dual problem. The most popular kernel functions are linear, polynomial, radial basis function (RBF) and sigmoid.

The i-th training example is a support vector, when α i − β i 6= 0. It can be

proved that a set of support vectors contains all training examples which fall

outside the ε tube, and some of the examples, which lie on the ε tube. The

conclusion is that a number of support vectors can be controlled by a tube

height (the ε).

(3)

1.2 Introduction to Generalized SVC

A 1-norm soft margin SVC optimization problem for x i with example weights C i is

OP 2. Minimization of

f (w c , b c , ξ c ) = 1

2 kw c k 2 + C c · ξ c

with constraints y i c h (x i ) ≥ 1 − ξ c i , ξ c ≥ 0 for i ∈ {1..l}, where C c  0, h (x i ) = w c · x i + b c .

A generalized SVC has additional example weights ϕ in constraints OP 3. Minimization of

f (w c , b c , ξ c ) = 1

2 kw c k 2 + C c · ξ c with constraints y i c h (x i ) ≥ 1 − ξ c i + ϕ i , ξ c ≥ 0 for i ∈ {1..l}, where C c  0, ϕ i ∈ R, h (x i ) = w c · x i + b c .

The weights ϕ are only present in constraints. When ϕ = 0, the OP 3 is equivalent to the OP 2. A functional margin for a point p is defined as a value y p h (p). A value v in functional margin units is equal to v/ kw c k. For the example with ϕ i > 0, called a detractor, 1 + ϕ i can be interpreted as a lower bound on a distance from the example to a decision boundary measured in functional margin units. In this article, we extend the possible values of the ϕ i to negative ones. For 1 + ϕ i < 0 and for incorrectly classified example and omitted slack variables for simplicity, we get −y i c h (x i ) /

w c

≤ −1 − ϕ i /

w c

therefore |1 + ϕ i | in this case is an upper bound on a distance from a detractor example to a decision boundary measured in functional margin units and it is alternatively called for this case an attractor.

2 Reformulation of ε-SVR

Here we present a reformulation of the OP 1. Every regression training exam- ple is duplicated, Fig. 1. Every original training example gets 1 class, and the duplicated training example gets -1 class and therefore we get

OP 4. Minimization of

f (w c , b c , ξ c ) = 1

2 kw c k 2 + C c 2l

X

i=1

ξ i c

with constraints y i c h (x i ) ≥ 1 − ξ c i + ϕ i , ξ c ≥ 0

for i ∈ {1..2l}, where h (x i ) = w c · x i + b c , ϕ i = y c i y r i − ε − 1.

(4)

The difference from OP 1 is that we added a constant 1/2. The OP 4 is a special case of OP 3. Instead of using a decision curve of OP 4 we use a regression function

2l

X

i=1

y c i α i K (x i , x) + b c = 0 → g (x) =

2l

X

i=1

y i c α i K (x i , x) + b c .

In a typical scenario ϕ i < 0, because ε is close to 0 and y i r is less than 1 and therefore an interpretation of the ϕ i is similar to an interpretation of the slack variables, but with a difference that a sum of slack variables is minimized additionally. For incorrectly classified i-th training example and ϕ i < 0, |ϕ i | can be interpreted as a radius of a hypersphere in functional margin units with a center in the i-th point that must intersect a margin boundary y i h (x) = 1, Fig. 1. Note that for ϕ i > 0 the interpretation was described in [6] – ϕ i can be interpreted as a radius of a hypersphere that must not intersect a margin boundary y i h (x) = 1 (in more than one point). The detractors lead to the new type of support vectors that can lie outside the margin [6]. In turn, attractors presented in this article could lead to examples which lie on the wrong side of the margin (in particular which are incorrectly classified), but they do not belong to support vectors. Moreover, attractors could lead to the new type of support vectors – which have slack variables equal to zero and do not lie on the margin. We can notice the following property of the OP 4.

Because every training example is duplicated, for every possible solution, l training examples will be always incorrectly classified except those lying on a classification decision boundary, Fig. 1. An efficient solution of the OP 4 based on Sequential Minimal Optimization [8] is the same as described in [5][6].

3 Reduce a Model by Removing Support Vectors

We use a method of removing support vectors to decrease the -SVR model complexity. Reduced models are more suitable for further processing, e.g.

decrease time of testing new examples. However, reduced models have the disadvantage that generalization performance could be worse than for the original full models. The one of possibilities of improving the performance is to incorporate a priori knowledge to the reduced models. The reduced mod- els based on removing support vectors were recently proposed for SVC [6]

and for ε-SVR [2]. In this article, we propose creating reduced models using a priori knowledge in the form of detractors for regression problems. This method was already proposed in [6] but for classification problems. Because we have already reformulated the -SVR as a classification problem with a priori knowledge in the form of detractors, creating a reduced model is sim- ilar to described in [6]. First, we solve the OP 4 and then detractors are automatically generated from the solution by setting

ϕ i = y i h (x i ) − 1 (2)

(5)

0 0.2 0.4 0.6 0.8 1 0

0.2 0.4 0.6 0.8 1 0.2 0

0.4 0.6 0.8 1 1.2 1.4

0 0.2 0.4 0.6 0.8 1 0

0.2 0.4 0.6 0.8 1 0.2 0

0.4 0.6 0.8 1 1.2 1.4

0 0.2 0.4 0.6 0.8 1 0

0.2 0.4 0.6 0.8 1 0.2 0

0.4 0.6 0.8 1 1.2 1.4

0 1 2 3 4 5

0 1 2 3 4 5

0 1 2 3 4 5

0 1 2 3 4 5

0 1 2 3 4 5

0 1 2 3 4 5

0 1 2 3 4 5

0 1 2 3 4 5

Fig. 1. In the left figure regression data (’+’ points) transformed to classification data (’x’ points) are depicted. In the right figure the interpretation of an attractor as a hypersphere is depicted. The attractor is located in (2, 1).

for all training examples. Note that all training examples have already set the parameter ϕ i therefore the method adjusts these values according to the found solution.

A reduced model is generated by removing a bunch of support vectors – randomly selected support vectors, with maximal removal ratio of p% of all training vectors, where p is a configurable parameter. Finally, we run the reformulated -SVR method with incorporated a priori knowledge and reduced data. The reduced solution is used for testing new examples.

4 Experiments

In the first experiment, we show that the reduced models with knowledge in the form of detractors have better generalization performance than without that knowledge for various p. The first method does not use knowledge in the form of detractors in reduced models, the second one use it. In the third experiment, we show that the reduced models have much better time of testing new examples which mainly depends on a number of support vectors. Note that for purposes of fair comparison training data of a reduced model is the same for both methods. We use the author implementation of reformulation of the ε-SVR for both methods.

For all data sets, every feature is scaled linearly to [0, 1] including an

output. For ε we use a grid search method for finding the best values.

(6)

4.1 Comparing Generalization Performance of Reduced Model The synthetic samples were generated from particular functions with added Gaussian noise for output values. The real world data sets were taken from the LibSVM site [1][4] except stock price data. They originally come from UCI Machine Learning Repository and StatLib DataSets Archive. The stock price data consist of monthly prices of the DJIA index from 1898 up to 2010. We generated the sample data set as follows: for every month the output value is a growth/fall comparing to the next month. Every feature i is a percent price change between the month and the i-th previous month. In every simulation, training data are randomly chosen, the remaining examples become test data.

For p = 70, C = 0.1 the method with knowledge in the form of detractors has better performance in all tests with similar number of support vectors (columns im1 and im2 ), Table 1. The testing performance improvement varies from 0% to 68%. For variable p the proposed method is also superior, example results for the cadata1 test are depicted in Fig. 2.

4.2 Comparing Testing Speed Performance of Reduced Model Testing speed of new examples depends mainly on a number of support vec- tors. With less support vectors we achieve testing time reduction. We can see testing speed improvements (columns t1 and t2 ) for all tests in Table 1.

-60 -50 -40 -30 -20 -10 0

10 20 30 40 50 60 70 -60

-50 -40 -30 -20 -10 0

10 20 30 40 50 60 70 -60 -50 -40 -30 -20 -10 0

10 20 30 40 50 60 70 -60

-50 -40 -30 -20 -10 0

10 20 30 40 50 60 70

Fig. 2. A comparison of two methods of removing support vectors for the cadata1

test case from Table 1 for variable p. On x axis there is a p parameter as a percentage,

on y axis there is a percent difference in misclassified training examples in the left

figure, and misclassified testing examples in the right figure between the original

method without removing support vectors, and the method with removing procedure

applied. The line with ’+’ points represents a random removing method, while the

line with ’x’ points represents proposed removing method with knowledge in the

form of detractors

(7)

Table 1. Comparison of time of testing new examples for an original model and a reduced model with detractors. Column descriptions: a name – a name of the test, a function – a function used for generating data y

1

= P

dim

i=1

x

i

, y

2

=  P

dim

i=1

x

i



kerP

, y

3

= 0.5 P

dim

i=1

sin 10x

i

+0.5, simT – a number of random simulations, where training examples are randomly selected (or generated), results are averaged, ker – a kernel (pol – a polynomial kernel), kerP – a kernel parameter (for a polynomial kernel it is a dimension, for the RBF kernel it is σ), trs – a training set size, all – a number of all data, it is a sum of training and testing data, dm – a dimension of the problem, s1 – an average number of support vectors for the original method, s2 – an average number of support vectors for a method with detractors, im1 – improvement as a percentage between the method with detractors over the without one for training data, im2 – improvement as a percentage between the method with detractors over the without one for testing data, t1 – a time of testing for an original model (in s), t2 – a time of testing for a reduced model (in s)

name fun simT ker kerP trs all dm s1 s2 im1 im2 t1 t2 synthetic1 y

1

10 lin – 180 300000 4 173 53 1 0.7 3.5 2 synthetic2 y

2

10 pol 3 180 300000 4 162 52 1.5 2.4 17 9 synthetic3 y

3

10 rbf 0.25 180 300000 4 153 53 8.8 4.2 12 7 abalone1 - 10 lin – 180 4177 8 81 40 5.4 5.2 0 0 abalone2 - 10 pol 3 180 4177 8 155 47 24.3 18.1 0.2 0.1 abalone3 - 10 rbf 0.125 180 4177 8 159 53 14 10.7 0.2 0 cadata1 - 10 lin – 180 20640 8 144 53 7.6 8.6 0.5 0.2 cadata2 - 10 pol 3 180 20640 8 140 53 9.8 0.2 2.2 0.7 cadata3 - 10 rbf 0.125 180 20640 8 147 54 7.7 5.1 2 0.4 housing1 - 10 lin – 180 506 13 66 44 24.2 21.8 0 0 housing2 - 10 pol 3 180 506 13 166 53 76.7 68.5 0 0 housing3 - 10 rbf 0.077 180 506 13 177 54 0.2 0.6 0 0 djia1 - 10 lin – 180 1351 10 74 16 4.5 5.2 0 0 djia2 - 10 pol 3 180 1351 10 121 50 45.5 29.2 0 0 djia3 - 10 rbf 0.1 180 1351 10 166 24 4 2.3 0.2 0

5 Conclusions

We show experimentally that knowledge in the form of detractors allows to construct reduced -SVR regression models with better generalization per- formance than models without that knowledge. Moreover, we confirm that reduced models lead to a simpler formulation of a regression function and therefore decreased time of testing new examples.

Acknowledgement. The research is financed by the Polish Ministry of Science and

Higher Education project No NN519579338. I would like to express my sincere grat-

itude to Professor Witold Dzwinel (AGH University of Science and Technology,

Department of Computer Science) for contributing ideas, discussion and useful sug-

gestions.

(8)

References

1. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. Soft- ware available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm

2. Karasuyama, M., Takeuchi, I., Nakano, R.: Reducing svr support vectors by using backward deletion. In: Proceedings of the 12th International Conference on Knowledge-Based Intelligent Information and Engineering Systems, Part III, KES ’08, pp. 76–83. Springer-Verlag, Berlin, Heidelberg (2008)

3. Keerthi, S.S., Chapelle, O., DeCoste, D.: Building support vector machines with reduced classifier complexity. J. Mach. Learn. Res. 7, 1493–1515 (2006) 4. Libsvm data sets. www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ (2011) 5. Orchel, M.: Incorporating detractors into svm classification. In: K. Cyran,

S. Kozielski, J. Peters, U. Sta´ nczyk, A. Wakulicz-Deja (eds.) Man-Machine In- teractions, Advances in Intelligent and Soft Computing, vol. 59, pp. 361–369.

Springer Berlin Heidelberg (2009)

6. Orchel, M.: Incorporating a priori knowledge from detractor points into support vector classification. In: A. Dobnikar, U. Lotric, B. ˇ Ster (eds.) Adaptive and Natural Computing Algorithms, Lecture Notes in Computer Science, vol. 6594, pp. 332–341. Springer Berlin Heidelberg (2011)

7. Orchel, M.: Regression based on support vector classification. In: A. Dobnikar, U. Lotric, B. ˇ Ster (eds.) Adaptive and Natural Computing Algorithms, Lecture Notes in Computer Science, vol. 6594, pp. 353–362. Springer Berlin Heidelberg (2011)

8. Platt, J.C.: Fast training of support vector machines using sequential minimal optimization, pp. 185–208. MIT Press, Cambridge, MA, USA (1999)

9. Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience (1998)

Cytaty

Powiązane dokumenty

W ramach Podschematu 1.1.D1.a ogoszono jeden nabór w padzierniku 2009 roku. W sumie zoono 304 wnioski, z czego 219 zostao ocenione pozytywnie pod wzgl- dem formalnym. Nastpnie

wych , takich jak fundusze strukturalne i Fundusz Spójności , z których mogą korzystać wszystkie kraje członkowskie Wspólnoty.. Fundusze te kierowane są w

Specyfika polskich organizacji pozarządowych ma duże znaczenie dla określenia ich udziału w realizacji założeń Strategii Lizbońskiej 17. w rejestrze REGON zarejestrowanych było

The unknown shear stresses can be found from the condi- tion that the cross section has to transmit only a shear force (acting in the shear centre) and a bending moment but no

The algorithm for selecting the best set of hyperparameter values using knowledge-uncertainty framework is as follows 1) For a candidate set of values of hyperparameters compute

– The generalization performance for various ε and δ is generally slightly better for balanced SVR (rows 0-5, columns rs1, rs2, rs3, rs4 in Table 3), we did not achieve

We present the new application for a priori knowledge in the form of detractors – improv- ing generalization performance of Support Vector Classification while reducing complexity of

We use SVR for predicting volume participation function in ex- ecution strategies which try to achieve Volume Weighted Average Price (VWAP) measure of quality of the