Keywords: Support Vector Machines, a priori knowledge

(1)

Regression Based on Support Vector Classification

Marcin Orchel

AGH University of Science and Technology, Mickiewicza Av. 30, 30-059 Krak´ ow, Poland, marcin@orchel.pl

Abstract. In this article, we propose a novel regression method which is based solely on Support Vector Classification. The experiments show that the new method has comparable or better generalization performance than ε-insensitive Support Vector Regression. The tests were performed on synthetic data, on various publicly available regression data sets, and on stock price data. Furthermore, we demonstrate how a priori knowledge which has been already incorporated to Support Vector Classification for predicting indicator functions, could be directly used for a regression problem.

Keywords: Support Vector Machines, a priori knowledge

1 Introduction

One of the main learning problems is a regression estimation. Vapnik [6] pro- posed a new regression method, which is called ε-insensitive Support Vector Regression (ε-SVR). It belongs to a group of methods called Support Vector Machines (SVM). For estimating indicator functions the Support Vector Clas- sification (SVC) method was developed. The SVM were invented on a basis of statistical learning theory. They are efficient learning methods partly for the reason of having the following important properties: they lead to convex opti- mization problems, they generate sparse solutions, kernel functions can be used for generating nonlinear solutions.

In this article, we analyze the differences between ε-SVR and SVC. We list

some advantages of SVC over ε-SVR, that are: ε-SVR has the additional free

parameter ε, in ε-SVR the minimized term kwk ² is responsible for rewarding

flat functions, while in SVC the same term has a meaning fully dependent on

training data – it takes a part in finding a maximal margin hyperplane. This is

the motivation for the proposed new regression method which is fully based on

SVC. Additionally, the proposed method has an advantage while incorporating

a priori knowledge into SVM. Incorporating a priori knowledge into SVM is

an important task and is extensively researched recently [2]. It is a practice,

that most of a priori knowledge is first incorporated to SVC. Additional effort

is needed to introduce the same a priori knowledge for ε-SVR. We show on

example that a particular type of a priori knowledge already incorporated to

(2)

SVC can be directly used for a regression problem by using the proposed method.

Recently some attempts were made to combine SVC with ε-SVR [7]. They differ substantially from the proposed method by the fact, that the proposed method is a replacement for ε-SVR.

1.1 Introduction to ε-SVR and SVC

In a regression estimation, we consider a set of training vectors a i for i = 1..l, where a i = a ¹ _i , . . . , a ^m _i . The i-th training vector is mapped to y _r ⁱ ∈ IR. The m is a dimension of the problem. The ε-SVR soft case optimization problem is OP 1. Minimization of

f w _r , b _r , ξ _r , ξ _r ^∗ = kw r k ² + C _r

l

X

i=1

ξ _r ⁱ + ξ ^∗i _r

with constraints y ⁱ _r − g (a i ) ≤ ε + ξ _r ⁱ , g (a i ) − y ⁱ _r ≤ ε + ξ _r ^i∗ , ξ r ≥ 0, ξ _r ^∗ ≥ 0 for i ∈ {1..l}, where g (a i ) = w r · a i + b r .

The g ^∗ (x) = w ^∗ _r · x + b ^∗ _r is a regression function. Optimization problem 1 is transformed to an equivalent dual optimization problem. The regression function becomes

g ^∗ (x) =

l

X

i=1

(α ^∗ _i − β _i ^∗ ) K (a i , x) + b ^∗ _r , (1)

where α _i , β _i are Lagrange multipliers of the dual problem, K (·, ·) is a kernel function which is incorporated to a dual problem. The most popular kernel functions are linear, polynomial, radial basis function (RBF) and sigmoid. A kernel function which is a dot product of its variables we call a simple linear kernel. The i-th training example is a support vector, when α ^∗ _i − β _i ^∗ 6= 0. It can be proved that a set of support vectors contains all training examples which fall outside the ε tube, and some of the examples, which lie on the ε tube. The conclusion is that a number of support vectors can be controlled by a tube height (the ε).

For an indicator function estimation, we consider a set of training vectors a i for i = 1..l, where a i = a ¹ _i , . . . , a ^m _i . The i-th training vector is mapped to y _c ⁱ ∈ {0, 1}. The m is a dimension of the problem. The SVC 1-norm soft margin case optimization problem is

OP 2. Minimization of

f (w c , b c , ξ c ) = kw c k ² + C c l

X

i=1

ξ _c ⁱ

with constraints y ⁱ _c h (a i ) ≥ 1 + ξ _c ⁱ , ξ c ≥ 0 for i ∈ {1..l},

where h (a i ) = w c · a i + b c .

(3)

The h ^∗ (x) = w ^∗ _c ·x+b ^∗ _c = 0 is a decision curve of the classification problem. Op- timization problem 2 is transformed to an equivalent dual optimization problem.

The decision curve becomes

h ^∗ (x) =

l

X

i=1

y ⁱ _c α ^∗ _i K (a i , x) + b ^∗ _c = 0 , (2)

where α i are Lagrange multipliers of the dual problem, K (·, ·) is a kernel func- tion, which is incorporated to a dual problem. Margin boundaries are defined as the two hyperplanes h (x) = −1 and h (x) = 1. Optimal margin boundaries are defined as the two hyperplanes h ^∗ (x) = −1 and h ^∗ (x) = 1. The i-th train- ing example is a support vector, when α ^∗ _i 6= 0. It can be proved that a set of support vectors contains all training examples which fall below optimal margin boundaries (y _i h ^∗ (a _i ) < 1), and some of the examples, which lie on the optimal margin boundaries (y i h ^∗ (a i ) = 1).

Comparing the number of free parameters of both methods, the ε-SVR has the additional ε parameter. The one of motivations of developing a regression method based on SVC is flatness property of the ε-SVR. The minimization term kw r k ² in ε-SVR is related to the following property of a linear function: for two linear functions g 1 (x) = w 1 · x + b 1 and g 2 (x) = w 2 · x + b 2 , whenever kw 2 k < kw 1 k, then we can say that g 2 (x) is flatter than g 1 (x). Flatter functions are awarded by ε-SVR. Flatness property of a linear function is not related to training examples. It differs from SVC, where minimizing term kw c k ² is related to training examples, because it is used for finding a maximal margin hyperplane.

2 Regression Based on SVC

We consider a set of training vectors a i for i = 1..l, where a i = a ¹ _i , . . . , a ^m _i .

The i-th training vector is mapped to y _r ⁱ ∈ IR. The proposed regression method (SVCR) is based on the following scheme of finding a regression function:

1. Every training example a _i is duplicated, an output value y _r ⁱ is translated by a value of a parameter ϕ > 0 for an original training example, and translated by −ϕ for a duplicated training example.

2. Every training example a i is converted to a classification example by incor- porating the output as an additional feature and setting class 1 for original training examples, and class −1 for duplicated training examples.

3. SVC is run with the classification mappings.

4. The solution of SVC is converted to a regression form.

The above procedure is repeated for different values of ϕ, for particular ϕ it is depicted in Fig. 1. The best solution among various ϕ is found based on the mean squared error (MSE) measure.

The result of the first step is a set of training mappings for i ∈ {1, . . . , 2l}

b _i = (a _i,1 , . . . , a _i,m ) → y _i + ϕ for i ∈ {1, . . . , l}

b i = (a i−l,1 , . . . , a i−l,m ) → y i−l − ϕ for i ∈ {l + 1, . . . , 2l}

(4)

0 0.2 0.4 0.6 0.8 1

Fig. 1. In the left figure, there is an example of regression data for a function y = 0.5 sin 10x + 0.5 with Gaussian noise, in the right figure, the regression data are translated to classification data. With the ’+’ translated original examples are depicted, with the ’x’ translated duplicated examples are depicted

for ϕ > 0. The parameter ϕ we call the translation parameter. The result of the second step is a set of training mappings for i ∈ {1, . . . , 2l}

c _i = (b _i,1 , . . . , b _i,m , y _i + ϕ) → 1 for i ∈ {1, . . . , l}

c _i = (b _i,1 , . . . , b _i,m , y _i−l − ϕ) → −1 for i ∈ {l + 1, . . . , 2l}

for ϕ > 0. The dimension of the c _i vectors is equal to m + 1. The set of a _i map- pings is called a regression data setting, the set of c i ones is called a classification data setting. In the third step, we solve OP 2 with c i examples. Note that h ^∗ (x) is in the implicit form of the last coordinate of x. In the fourth step, we have to find an explicit form of the last coordinate. The explicit form is needed for example for testing new examples. The w c variable of the primal problem for a simple linear kernel case is found based on the solution of the dual problem in the following way

w c =

2l

X

i=1

y _c ⁱ α i c i .

For a simple linear kernel the explicit form of (2) is

x ^m+1 = − P m

j=1 w _c ^j x ^j − b c

w ^m+1 c

.

The regression solution is g ^∗ (x) = w _r · x + b _r , where w ⁱ _r = −w _c ⁱ /w ^m+1 _c , b _r =

−b c /w _c ^m+1 for i = 1..m. For nonlinear kernels, a conversion to the explicit form

has some limitations. First, a decision curve could have more than one value of

the last coordinate for specific values of remaining coordinates of x and therefore

(5)

it cannot be converted unambiguously to the function (e.g. a polynomial kernel with a dimension equals to 2). Second, even when the conversion to the function is possible, there is no explicit analytical formula (e.g. a polynomial kernel with a dimension greater than 4), or it is not easy to find it and hence a special method for finding an explicit formula of the coordinate should be used, e.g. a bisection method. The disadvantage of this solution is a longer time of testing new exam- ples. To overcome these problems, we propose a new kernel type in which the last coordinate is placed only inside a linear term. A new kernel is constructed from an original kernel by removing the last coordinate, and adding the linear term with the last coordinate. For the most popular kernels polynomial, radial basis function (RBF) and sigmoid, the conversions are respectively

(x · y) ^d →

m

X

i=1

x _i y _i

! d

+ x _m+1 y _m+1 , (3)

exp − kx − yk ²

2σ ² → exp − P m

i=1 (x i − y i ) ²

2σ ² + x m+1 y m+1 , (4) tanh xy → tanh

m

X

i=1

x i y i + x ^m+1 y ^m+1 . (5)

The proposed method of constructing new kernels always generates a function fulfilling Mercer’s condition, because it generates a function which is a sum of two kernels. For the new kernel type, the explicit form of (2) is

x ^m+1 = − P 2l

i=1 y _c ⁱ α i K r c ⁱ _r , x ⁱ _r P 2l

i=1 y _c ⁱ α i c ^m+1 _i , where c ⁱ _r = c ¹ _i ..c ^m _i , x ⁱ _r = x ¹ _i ..x ^m _i .

2.1 Support Vectors

The SVCR runs the SVC method on duplicated number of examples and there- fore a maximal number of support vectors of the SVC is 2l. The SVCR algorithm is constructed in the way, that while searching for the best value of ϕ, the cases for which a number of SVC support vectors is bigger than l are omitted. We prove the fact, that in this case the set of SVC support vectors does not contain any two training examples where one of them is a duplicate of the another; therefore a set of SVC support vectors is a subset of the a i set of training examples.

Let’s call a margin boundaries vector or a below margin boundaries vector as an essential margin vector and a set of such vectors EM V .

Theorem 1. The a i examples are not collinear and |EM V | ≤ l, implicates EM V does not contain duplicates.

Proof (Proof sketch). Let’s assume, that the EM V contains a duplicate a t

⁰

of

the example a t . Let p (·) = 0 be a hyperplane parallel to margin boundaries

(6)

and containing the a _t ; therefore the set of EM V examples for which p (·) ≥ 0 has r elements where r >= 1. Let p ⁰ (·) = 0 be a hyperplane parallel to margin boundaries and containing the a _t

⁰

; therefore the set of EM V examples for which p ⁰ (·) ≤ 0 has equal or greater than l − r + 1 elements and so |EM V | ≥ l + 1

which contradicts the assumption. u t

For nonlinear case the same theorem is applied in induced feature kernel space.

It can be proved that a set of support vectors is a subset of the EM V and therefore the same theorem is applied for a set of support vectors. Experiments show that it is a rare situation when for any value of ϕ checked by SVCR, a set of support vectors has more than l elements. In such situation the best solution among violating the constraint is chosen.

Here we consider how changes of a value of ϕ influence on a number of support vectors. First, we can see that for ϕ = 0, l ≤ |EM V | ≤ 2l. When for a particular value of ϕ both classes are separable then 0 ≤ |EM V | ≤ 2l. By a configuration of essential margin vectors we call a list of essential margin vectors, each with a distance to a one of the margin boundaries.

Theorem 2. For two values of ϕ, ϕ 1 > 0 and ϕ 2 > 0, where ϕ 1 > ϕ 2 , for every margin boundaries for ϕ 2 , there exist margin boundaries for ϕ 1 with the same configuration of essential margin vectors.

Proof (Proof sketch). Let’s consider the EM V for ϕ 2 with particular margin boundaries. When increasing a value of ϕ by ϕ 1 − ϕ 2 in order to preserve the same configuration of essential margin vectors we extend margin bounded region

by ϕ 1 − ϕ 2 on both sides. u t

When increasing a value of ϕ, new sets of essential margin vectors arise, and all sets presented for the lower values of ϕ remains. When both classes become separable by a hyperplane, further increasing the value of ϕ does not change a collection of sets of essential margin vectors. The above suggests that increasing a value of ϕ would lead to solutions with lesser number of support vectors.

2.2 Comparison with ε-SVR

Both methods have the same number of free parameters. For ε-SVR: C, ker-

nel parameters, and ε. For SVCR: C, kernel parameters and ϕ. When using a

particular kernel function for ε-SVR and a related kernel function for SVCR,

both methods have the same hypothesis space. Both parameters ε and ϕ con-

trol a number of support vectors. There is a slightly difference between these

two methods when we compare configurations of essential margin vectors. For

the case of ε-SVR, we define margin boundaries as a lower and upper ε tube

boundaries. Among various values of the ε, every configuration of essential mar-

gin vectors is unique. In the SVCR, based on Thm. 2, configurations of essential

margin vectors are repeated while a value of ϕ increases. This suggest that for

particular values of ϕ and ε a set of configurations of essential margin vectors is

richer for SVCR than for ε-SVR.

(7)

3 Experiments

First, we compare performance of SVCR and ε-SVR on synthetic data and on publicly available regression data sets. Second, we show that by using SVCR, a priori knowledge in the form of detractors introduced in [5] for classification problems could be applied for regression problems. For the first part, we use a LibSVM [1] implementation of ε-SVR and we use LibSVM in solving SVC problems in SVCR. We use a version ported to Java. For the second part, we use the author implementation of SVC with detractors.

For all data sets, every feature is scaled linearly to [0, 1] including an output.

For variable parameters like the C, σ for the RBF kernel, ϕ for SVCR, and ε for ε-SVR, we use a grid search method for finding best values. The number of values searched by the grid method is a trade-off between an accuracy and a speed of simulations. Note that for particular data sets, it is possible to use more accurate grid searches than for massive tests with multiple number of simulations. The preliminary tests confirm that while ϕ is increased, a number of support vectors is decreased.

3.1 Synthetic Data Tests

We compare the SVCR and ε-SVR methods on data generated from particular functions with added Gaussian noise for output values. We perform tests with a linear kernel on linear functions, with a polynomial kernel on the polynomial function, with the RBF kernel on the sine function. The tests with results are presented in Table 1. We can notice generally slightly worse training performance for the SVCR. The reason is that ε-SVR directly minimizes the MSE. We can notice fairly good generalization performance for the SVCR, which is slightly better than for the ε-SVR. We can notice lesser number of support vectors for the SVCR method for linear kernels. For the RBF kernel the SVCR is slightly worse.

3.2 Real World Data Sets

The real world data sets were taken from the LibSVM site [1] [3] except stock

price data. The stock price data consist of monthly prices of the DJIA index

from 1898 up to 2010. We generated the training data as follows: for every

month the output value is a growth/fall comparing to the next month. Every

feature i is a percent price change between the month and the i-th previous

month. In every simulation, training data are randomly chosen and the remaining

examples become test data. The tests with results are presented in Table 2. For

linear kernels, the tests show better generalization performance of the SVCR

method. The performance gain on testing data is ranged from 0–2%. For the

polynomial kernel, we can notice better generalization performance of the SVCR

(performance gain from 68–80%). A number of support vectors is comparable

for both methods. For the RBF kernel, results strongly depends on data: for two

test cases the SVCR has better generalization performance (10%).

(8)

Table 1. Description of test cases with results for synthetic data. Column descriptions:

a function – a function used for generating data y

1

= P

dim

i=1

x

i

, y

4

= P

dim

i=1

x

i

kerP

, y

5

= 0.5 P

dim

i=1

sin 10x

i

+ 0.5, simC – a number of simulations, results are averaged, σ – a standard deviation used for generating noise in output, ker – a kernel (pol – a polynomial kernel), kerP – a kernel parameter (for a polynomial kernel it is a dimension, for the RBF kernel it is σ), trs – a training set size, tes – a testing set size, dm – a dimension of the problem, tr12M – a percent average difference in MSE for training data, if greater than 0 than SVCR is better, te12M – the same as tr12M, but for testing data, tr12MC – a percent average difference in number of tests for training data in which SVCR is better (SVCR is better when a value is greater than 50%), te12MC – the same as tr12MC, but for testing data, s1 – an average number of support vectors for ε-SVR, s2 – an average number of support vectors for SVCR. The value ’var’ means that we search for the best value comparing the training data MSE

function simC σ ker kerP trs tes dm tr12M te12M tr12MC te12MC s1 s2

y

1

100 0.04 lin – 90 300 4 0% 0.5% 20% 58% 50 46

y

2

= 3y

1

100 0.04 lin – 90 300 4 −0.4% −0.4% 10% 40% 74 49 y

3

= 1/3y

1

100 0.04 lin – 90 300 4 0% 1% 50% 80% 50 40

y

4

100 0.04 pol 3 90 300 4 −2% 10% 2% 80% 61 61

y

5

20 0.04 rbf var 90 300 4 −500% −10% 30% 20% 90 90

Generally the tests show that the new method SVCR has good generalization performance on synthetic and real world data sets used in experiments and often it is better than for the ε-SVR.

3.3 Incorporating a Priori Knowledge in the Form of Detractors to SVCR

In the article [5], a concept of detractors was proposed for a classification case.

Detractors were used for incorporating a priori knowledge in the form of a lower bound (a detractor parameter b) on a distance from a particular point (called a detractor point) to a decision surface. We show that we can use a concept of detractors directly in a regression case by using the SVCR method. We define a detractor for the SVCR method as a point with the parameter d, and a side (1 or

−1). We modify the SVCR method in the following way: the detractor is added

to a training data set, and transformed to the classification data setting in a way

that when a side is 1: d = b + ϕ, for a duplicate d = 0; when a side is −1: d = 0,

for a duplicate d = b − ϕ. The primal application of detractors was to model a

decision function (i.e. moving far away from a detractor). Indeed, a synthetic test

shows that we can use detractors for modeling a regression function. In Fig. 2,

we can see that adding a detractor causes moving a regression function far away

from the detractor point.

(9)

Table 2. Description of test cases with results for real world data. Column descriptions:

a name – a name of the test, simT – a number of random simulations, where training data are randomly selected, results are averaged, ker – a kernel (pol – a polynomial kernel), kerP – a kernel parameter (for a polynomial kernel it is a dimension, for the RBF kernel it is σ), trs – a training set size, all – a number of all data, it is a sum of training and testing data, dm – a dimension of the problem, tr12M – a percent average difference in MSE for training data, if greater than 0 than SVCR is better, te12M – the same as tr12M, but for testing data, tr12MC – a percent average difference in number of tests for training data in which SVCR is better (SVCR is better when a value is greater than 50%), te12MC – the same as tr12MC, but for testing data, s1 – an average number of support vectors for ε-SVR, s2 – an average number of support vectors for SVCR. The value ’var’ means that we search for the best value comparing the training data MSE

name simT ker kerP trs all dm tr12M te12M tr12MC te12MC s1 s2 abalone1 100 lin – 90 4177 8 −0.2% 2% 20% 70% 35 38 abalone2 100 pol 5 90 4177 8 −90% 80% 0% 100% 78 73 abalone3 20 rbf var 90 4177 8 70% 10% 90% 65% 90 90 caData1 100 lin – 90 4424 8 −1.5% 2% 1% 55% 41 44 caData2 100 pol 5 90 4424 8 −105% 68% 0% 100% 79 75 caData3 20 rbf var 90 4424 8 −25% 10% 50% 50% 90 90 stock1 100 lin – 90 1351 10 0% 0% 40% 55% 35 32 stock2 100 pol 5 90 1351 10 −4500% 78% 0% 100% 90 87 stock3 20 rbf var 90 1351 10 76% −6% 100% 25% 90 90

0 0.2 0.4 0.6 0.8 1

Fig. 2. In the left figure, the best SVCR translation for particular regression data is

depicted, in the right figure, the best SVCR translation for the same data, but with a

detractor in a point (0.2, 0.1) and d = 10.0 is depicted. We can see that the detractor

causes moving the regression function far away from it. Note that the best translation

parameter ϕ is different for both cases

(10)

4 Conclusions

The SVCR method is an alternative for ε-SVR. We focus on two advantages of the new method: first, a generalization performance of the SVCR is comparable or better than for the ε-SVR based on conducted experiments. Second, we show on the example of a priori knowledge in the form of detractors, that a priori knowledge already incorporated to SVC can be used for a regression problem solved by the SVCR. In such case, we do not have to analyze and implement the incorporation of a priori knowledge to the other regression methods (e.g. to the ε- SVR). Further analysis of the SVCR will concentrate on analysing and comparing the generalization performance of the proposed method in the framework of statistical learning theory.

Just before submitting this paper, we have found in [4] very similar idea.

However, the Authors solve an additional optimization problem in the testing phase to find a root of the nonlinear equation and therefore two problems arise:

multiple solutions and lack of solutions. Instead, we propose a special type of kernels (3)(4)(5) which overcome these difficulties. In [4], the Authors claim that by modifying ϕ parameter for every example in a way that the examples with lower and upper values of y _i have lesser values of ϕ than the middle ones, a solution with lesser number of support vectors can be obtained. However, this modification leads to a necessity of tuning a value of an additional parameter during the training phase.

Acknowledgments. The research is financed by the Polish Ministry of Science and Higher Education project No NN519579338. I would like to express my sincere gratitude to Professor Witold Dzwinel (AGH University of Science and Technology, Department of Computer Science) for contributing ideas, discussion and useful suggestions.

References

1. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

2. Lauer, F., Bloch, G.: Incorporating prior knowledge in support vector machines for classification: A review. Neurocomputing 71(7-9), 1578–1594 (2008)

3. Libsvm data sets. www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ (06 2011)

4. Lin, F., Guo, J.: A novel support vector machine algorithm for solving nonlinear regression problems based on symmetrical points. In: Proceedings of the 2010 2nd International Conference on Computer Engineering and Technology (ICCET). pp.

176–180 (2010)

5. Orchel, M.: Incorporating detractors into svm classification. In: Cyran, K., Kozielski, S., Peters, J., Sta´ nczyk, U., Wakulicz-Deja, A. (eds.) Man-Machine Interactions, Advances in Intelligent and Soft Computing, vol. 59, pp. 361–369. Springer Berlin Heidelberg (2009), http://dx.doi.org/10.1007/978-3-642-00563-3_38

6. Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience (September 1998)

(11)

7. Wu, C.A., Liu, H.B.: An improved support vector regression based on classification.

In: Proceedings of the 2007 International Conference on Multimedia and Ubiquitous

Engineering. pp. 999–1003. MUE ’07, IEEE Computer Society, Washington, DC,

USA (2007), http://dx.doi.org/10.1109/MUE.2007.79

Keywords: Support Vector Machines, a priori knowledge

Regression Based on Support Vector Classification

Marcin Orchel

AGH University of Science and Technology, Mickiewicza Av. 30, 30-059 Krak´ ow, Poland, marcin@orchel.pl

Keywords: Support Vector Machines, a priori knowledge

1 Introduction

In this article, we analyze the differences between ε-SVR and SVC. We list

some advantages of SVC over ε-SVR, that are: ε-SVR has the additional free

parameter ε, in ε-SVR the minimized term kwk 2 is responsible for rewarding

flat functions, while in SVC the same term has a meaning fully dependent on

training data – it takes a part in finding a maximal margin hyperplane. This is

the motivation for the proposed new regression method which is fully based on

SVC. Additionally, the proposed method has an advantage while incorporating

a priori knowledge into SVM. Incorporating a priori knowledge into SVM is

an important task and is extensively researched recently [2]. It is a practice,

that most of a priori knowledge is first incorporated to SVC. Additional effort

is needed to introduce the same a priori knowledge for ε-SVR. We show on

example that a particular type of a priori knowledge already incorporated to

SVC can be directly used for a regression problem by using the proposed method.

Recently some attempts were made to combine SVC with ε-SVR [7]. They differ substantially from the proposed method by the fact, that the proposed method is a replacement for ε-SVR.

1.1 Introduction to ε-SVR and SVC

In a regression estimation, we consider a set of training vectors a i for i = 1..l, where a i = a 1 i , . . . , a m i . The i-th training vector is mapped to y r i ∈ IR. The m is a dimension of the problem. The ε-SVR soft case optimization problem is OP 1. Minimization of

f w r , b r , ξ r , ξ r ∗ = kw r k 2 + C r

l

X

i=1

ξ r i + ξ ∗i r

with constraints y i r − g (a i ) ≤ ε + ξ r i , g (a i ) − y i r ≤ ε + ξ r i∗ , ξ r ≥ 0, ξ r ∗ ≥ 0 for i ∈ {1..l}, where g (a i ) = w r · a i + b r .

The g ∗ (x) = w ∗ r · x + b ∗ r is a regression function. Optimization problem 1 is transformed to an equivalent dual optimization problem. The regression function becomes

g ∗ (x) =

l

X

i=1

(α ∗ i − β i ∗ ) K (a i , x) + b ∗ r , (1)

For an indicator function estimation, we consider a set of training vectors a i for i = 1..l, where a i = a 1 i , . . . , a m i . The i-th training vector is mapped to y c i ∈ {0, 1}. The m is a dimension of the problem. The SVC 1-norm soft margin case optimization problem is

OP 2. Minimization of

f (w c , b c , ξ c ) = kw c k 2 + C c l

X

i=1

ξ c i

with constraints y i c h (a i ) ≥ 1 + ξ c i , ξ c ≥ 0 for i ∈ {1..l},

where h (a i ) = w c · a i + b c .

The h ∗ (x) = w ∗ c ·x+b ∗ c = 0 is a decision curve of the classification problem. Op- timization problem 2 is transformed to an equivalent dual optimization problem.

The decision curve becomes

h ∗ (x) =

l

X

i=1

y i c α ∗ i K (a i , x) + b ∗ c = 0 , (2)

2 Regression Based on SVC

We consider a set of training vectors a i for i = 1..l, where a i = a 1 i , . . . , a m i .

The i-th training vector is mapped to y r i ∈ IR. The proposed regression method (SVCR) is based on the following scheme of finding a regression function:

1. Every training example a i is duplicated, an output value y r i is translated by a value of a parameter ϕ > 0 for an original training example, and translated by −ϕ for a duplicated training example.

2. Every training example a i is converted to a classification example by incor- porating the output as an additional feature and setting class 1 for original training examples, and class −1 for duplicated training examples.

3. SVC is run with the classification mappings.

4. The solution of SVC is converted to a regression form.

The above procedure is repeated for different values of ϕ, for particular ϕ it is depicted in Fig. 1. The best solution among various ϕ is found based on the mean squared error (MSE) measure.

The result of the first step is a set of training mappings for i ∈ {1, . . . , 2l}

 b i = (a i,1 , . . . , a i,m ) → y i + ϕ for i ∈ {1, . . . , l}

b i = (a i−l,1 , . . . , a i−l,m ) → y i−l − ϕ for i ∈ {l + 1, . . . , 2l}

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

for ϕ > 0. The parameter ϕ we call the translation parameter. The result of the second step is a set of training mappings for i ∈ {1, . . . , 2l}

 c i = (b i,1 , . . . , b i,m , y i + ϕ) → 1 for i ∈ {1, . . . , l}

c i = (b i,1 , . . . , b i,m , y i−l − ϕ) → −1 for i ∈ {l + 1, . . . , 2l}

w c =

2l

X

i=1

y c i α i c i .

For a simple linear kernel the explicit form of (2) is

x m+1 = − P m

j=1 w c j x j − b c

w m+1 c

.

The regression solution is g ∗ (x) = w r · x + b r , where w i r = −w c i /w m+1 c , b r =

parameter ε, in ε-SVR the minimized term kwk ² is responsible for rewarding

In a regression estimation, we consider a set of training vectors a i for i = 1..l, where a i = a ¹ _i , . . . , a ^m _i . The i-th training vector is mapped to y _r ⁱ ∈ IR. The m is a dimension of the problem. The ε-SVR soft case optimization problem is OP 1. Minimization of

f w _r , b _r , ξ _r , ξ _r ^∗ = kw r k ² + C _r

ξ _r ⁱ + ξ ^∗i _r

with constraints y ⁱ _r − g (a i ) ≤ ε + ξ _r ⁱ , g (a i ) − y ⁱ _r ≤ ε + ξ _r ^i∗ , ξ r ≥ 0, ξ _r ^∗ ≥ 0 for i ∈ {1..l}, where g (a i ) = w r · a i + b r .

The g ^∗ (x) = w ^∗ _r · x + b ^∗ _r is a regression function. Optimization problem 1 is transformed to an equivalent dual optimization problem. The regression function becomes

g ^∗ (x) =

(α ^∗ _i − β _i ^∗ ) K (a i , x) + b ^∗ _r , (1)

For an indicator function estimation, we consider a set of training vectors a i for i = 1..l, where a i = a ¹ _i , . . . , a ^m _i . The i-th training vector is mapped to y _c ⁱ ∈ {0, 1}. The m is a dimension of the problem. The SVC 1-norm soft margin case optimization problem is

f (w c , b c , ξ c ) = kw c k ² + C c l

ξ _c ⁱ

with constraints y ⁱ _c h (a i ) ≥ 1 + ξ _c ⁱ , ξ c ≥ 0 for i ∈ {1..l},

The h ^∗ (x) = w ^∗ _c ·x+b ^∗ _c = 0 is a decision curve of the classification problem. Op- timization problem 2 is transformed to an equivalent dual optimization problem.

h ^∗ (x) =

y ⁱ _c α ^∗ _i K (a i , x) + b ^∗ _c = 0 , (2)

We consider a set of training vectors a i for i = 1..l, where a i = a ¹ _i , . . . , a ^m _i .

The i-th training vector is mapped to y _r ⁱ ∈ IR. The proposed regression method (SVCR) is based on the following scheme of finding a regression function:

1. Every training example a _i is duplicated, an output value y _r ⁱ is translated by a value of a parameter ϕ > 0 for an original training example, and translated by −ϕ for a duplicated training example.

b _i = (a _i,1 , . . . , a _i,m ) → y _i + ϕ for i ∈ {1, . . . , l}

c _i = (b _i,1 , . . . , b _i,m , y _i + ϕ) → 1 for i ∈ {1, . . . , l}

c _i = (b _i,1 , . . . , b _i,m , y _i−l − ϕ) → −1 for i ∈ {l + 1, . . . , 2l}

y _c ⁱ α i c i .

x ^m+1 = − P m

j=1 w _c ^j x ^j − b c

w ^m+1 c

The regression solution is g ^∗ (x) = w _r · x + b _r , where w ⁱ _r = −w _c ⁱ /w ^m+1 _c , b _r =

−b c /w _c ^m+1 for i = 1..m. For nonlinear kernels, a conversion to the explicit form

(x · y) ^d →

x _i y _i

+ x _m+1 y _m+1 , (3)

exp − kx − yk ²

2σ ² → exp − P m

i=1 (x i − y i ) ²

2σ ² + x m+1 y m+1 , (4) tanh xy → tanh

x i y i + x ^m+1 y ^m+1 . (5)

x ^m+1 = − P 2l

i=1 y _c ⁱ α i K r c ⁱ _r , x ⁱ _r P 2l

i=1 y _c ⁱ α i c ^m+1 _i , where c ⁱ _r = c ¹ _i ..c ^m _i , x ⁱ _r = x ¹ _i ..x ^m _i .

and containing the a _t ; therefore the set of EM V examples for which p (·) ≥ 0 has r elements where r >= 1. Let p ⁰ (·) = 0 be a hyperplane parallel to margin boundaries and containing the a _t

; therefore the set of EM V examples for which p ⁰ (·) ≤ 0 has equal or greater than l − r + 1 elements and so |EM V | ≥ l + 1