Regression Based on Support Vector Classification
Marcin Orchel
AGH University of Science and Technology, Mickiewicza Av. 30, 30-059 Krak´ ow, Poland, marcin@orchel.pl
Abstract. In this article, we propose a novel regression method which is based solely on Support Vector Classification. The experiments show that the new method has comparable or better generalization performance than ε-insensitive Support Vector Regression. The tests were performed on synthetic data, on various publicly available regression data sets, and on stock price data. Furthermore, we demonstrate how a priori knowledge which has been already incorporated to Support Vector Classification for predicting indicator functions, could be directly used for a regression problem.
Keywords: Support Vector Machines, a priori knowledge
1 Introduction
One of the main learning problems is a regression estimation. Vapnik [6] pro- posed a new regression method, which is called ε-insensitive Support Vector Regression (ε-SVR). It belongs to a group of methods called Support Vector Machines (SVM). For estimating indicator functions the Support Vector Clas- sification (SVC) method was developed. The SVM were invented on a basis of statistical learning theory. They are efficient learning methods partly for the reason of having the following important properties: they lead to convex opti- mization problems, they generate sparse solutions, kernel functions can be used for generating nonlinear solutions.
In this article, we analyze the differences between ε-SVR and SVC. We list
some advantages of SVC over ε-SVR, that are: ε-SVR has the additional free
parameter ε, in ε-SVR the minimized term kwk 2 is responsible for rewarding
flat functions, while in SVC the same term has a meaning fully dependent on
training data – it takes a part in finding a maximal margin hyperplane. This is
the motivation for the proposed new regression method which is fully based on
SVC. Additionally, the proposed method has an advantage while incorporating
a priori knowledge into SVM. Incorporating a priori knowledge into SVM is
an important task and is extensively researched recently [2]. It is a practice,
that most of a priori knowledge is first incorporated to SVC. Additional effort
is needed to introduce the same a priori knowledge for ε-SVR. We show on
example that a particular type of a priori knowledge already incorporated to
SVC can be directly used for a regression problem by using the proposed method.
Recently some attempts were made to combine SVC with ε-SVR [7]. They differ substantially from the proposed method by the fact, that the proposed method is a replacement for ε-SVR.
1.1 Introduction to ε-SVR and SVC
In a regression estimation, we consider a set of training vectors a i for i = 1..l, where a i = a 1 i , . . . , a m i . The i-th training vector is mapped to y r i ∈ IR. The m is a dimension of the problem. The ε-SVR soft case optimization problem is OP 1. Minimization of
f w r , b r , ξ r , ξ r ∗ = kw r k 2 + C r
l
X
i=1
ξ r i + ξ ∗i r
with constraints y i r − g (a i ) ≤ ε + ξ r i , g (a i ) − y i r ≤ ε + ξ r i∗ , ξ r ≥ 0, ξ r ∗ ≥ 0 for i ∈ {1..l}, where g (a i ) = w r · a i + b r .
The g ∗ (x) = w ∗ r · x + b ∗ r is a regression function. Optimization problem 1 is transformed to an equivalent dual optimization problem. The regression function becomes
g ∗ (x) =
l
X
i=1
(α ∗ i − β i ∗ ) K (a i , x) + b ∗ r , (1)
where α i , β i are Lagrange multipliers of the dual problem, K (·, ·) is a kernel function which is incorporated to a dual problem. The most popular kernel functions are linear, polynomial, radial basis function (RBF) and sigmoid. A kernel function which is a dot product of its variables we call a simple linear kernel. The i-th training example is a support vector, when α ∗ i − β i ∗ 6= 0. It can be proved that a set of support vectors contains all training examples which fall outside the ε tube, and some of the examples, which lie on the ε tube. The conclusion is that a number of support vectors can be controlled by a tube height (the ε).
For an indicator function estimation, we consider a set of training vectors a i for i = 1..l, where a i = a 1 i , . . . , a m i . The i-th training vector is mapped to y c i ∈ {0, 1}. The m is a dimension of the problem. The SVC 1-norm soft margin case optimization problem is
OP 2. Minimization of
f (w c , b c , ξ c ) = kw c k 2 + C c l
X
i=1
ξ c i
with constraints y i c h (a i ) ≥ 1 + ξ c i , ξ c ≥ 0 for i ∈ {1..l},
where h (a i ) = w c · a i + b c .
The h ∗ (x) = w ∗ c ·x+b ∗ c = 0 is a decision curve of the classification problem. Op- timization problem 2 is transformed to an equivalent dual optimization problem.
The decision curve becomes
h ∗ (x) =
l
X
i=1
y i c α ∗ i K (a i , x) + b ∗ c = 0 , (2)
where α i are Lagrange multipliers of the dual problem, K (·, ·) is a kernel func- tion, which is incorporated to a dual problem. Margin boundaries are defined as the two hyperplanes h (x) = −1 and h (x) = 1. Optimal margin boundaries are defined as the two hyperplanes h ∗ (x) = −1 and h ∗ (x) = 1. The i-th train- ing example is a support vector, when α ∗ i 6= 0. It can be proved that a set of support vectors contains all training examples which fall below optimal margin boundaries (y i h ∗ (a i ) < 1), and some of the examples, which lie on the optimal margin boundaries (y i h ∗ (a i ) = 1).
Comparing the number of free parameters of both methods, the ε-SVR has the additional ε parameter. The one of motivations of developing a regression method based on SVC is flatness property of the ε-SVR. The minimization term kw r k 2 in ε-SVR is related to the following property of a linear function: for two linear functions g 1 (x) = w 1 · x + b 1 and g 2 (x) = w 2 · x + b 2 , whenever kw 2 k < kw 1 k, then we can say that g 2 (x) is flatter than g 1 (x). Flatter functions are awarded by ε-SVR. Flatness property of a linear function is not related to training examples. It differs from SVC, where minimizing term kw c k 2 is related to training examples, because it is used for finding a maximal margin hyperplane.
2 Regression Based on SVC
We consider a set of training vectors a i for i = 1..l, where a i = a 1 i , . . . , a m i .
The i-th training vector is mapped to y r i ∈ IR. The proposed regression method (SVCR) is based on the following scheme of finding a regression function:
1. Every training example a i is duplicated, an output value y r i is translated by a value of a parameter ϕ > 0 for an original training example, and translated by −ϕ for a duplicated training example.
2. Every training example a i is converted to a classification example by incor- porating the output as an additional feature and setting class 1 for original training examples, and class −1 for duplicated training examples.
3. SVC is run with the classification mappings.
4. The solution of SVC is converted to a regression form.
The above procedure is repeated for different values of ϕ, for particular ϕ it is depicted in Fig. 1. The best solution among various ϕ is found based on the mean squared error (MSE) measure.
The result of the first step is a set of training mappings for i ∈ {1, . . . , 2l}
b i = (a i,1 , . . . , a i,m ) → y i + ϕ for i ∈ {1, . . . , l}
b i = (a i−l,1 , . . . , a i−l,m ) → y i−l − ϕ for i ∈ {l + 1, . . . , 2l}
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
Fig. 1. In the left figure, there is an example of regression data for a function y = 0.5 sin 10x + 0.5 with Gaussian noise, in the right figure, the regression data are translated to classification data. With the ’+’ translated original examples are depicted, with the ’x’ translated duplicated examples are depicted
for ϕ > 0. The parameter ϕ we call the translation parameter. The result of the second step is a set of training mappings for i ∈ {1, . . . , 2l}
c i = (b i,1 , . . . , b i,m , y i + ϕ) → 1 for i ∈ {1, . . . , l}
c i = (b i,1 , . . . , b i,m , y i−l − ϕ) → −1 for i ∈ {l + 1, . . . , 2l}
for ϕ > 0. The dimension of the c i vectors is equal to m + 1. The set of a i map- pings is called a regression data setting, the set of c i ones is called a classification data setting. In the third step, we solve OP 2 with c i examples. Note that h ∗ (x) is in the implicit form of the last coordinate of x. In the fourth step, we have to find an explicit form of the last coordinate. The explicit form is needed for example for testing new examples. The w c variable of the primal problem for a simple linear kernel case is found based on the solution of the dual problem in the following way
w c =
2l
X
i=1
y c i α i c i .
For a simple linear kernel the explicit form of (2) is
x m+1 = − P m
j=1 w c j x j − b c
w m+1 c
.
The regression solution is g ∗ (x) = w r · x + b r , where w i r = −w c i /w m+1 c , b r =
−b c /w c m+1 for i = 1..m. For nonlinear kernels, a conversion to the explicit form
has some limitations. First, a decision curve could have more than one value of
the last coordinate for specific values of remaining coordinates of x and therefore
it cannot be converted unambiguously to the function (e.g. a polynomial kernel with a dimension equals to 2). Second, even when the conversion to the function is possible, there is no explicit analytical formula (e.g. a polynomial kernel with a dimension greater than 4), or it is not easy to find it and hence a special method for finding an explicit formula of the coordinate should be used, e.g. a bisection method. The disadvantage of this solution is a longer time of testing new exam- ples. To overcome these problems, we propose a new kernel type in which the last coordinate is placed only inside a linear term. A new kernel is constructed from an original kernel by removing the last coordinate, and adding the linear term with the last coordinate. For the most popular kernels polynomial, radial basis function (RBF) and sigmoid, the conversions are respectively
(x · y) d →
m
X
i=1
x i y i
! d
+ x m+1 y m+1 , (3)
exp − kx − yk 2
2σ 2 → exp − P m
i=1 (x i − y i ) 2
2σ 2 + x m+1 y m+1 , (4) tanh xy → tanh
m
X
i=1
x i y i + x m+1 y m+1 . (5)
The proposed method of constructing new kernels always generates a function fulfilling Mercer’s condition, because it generates a function which is a sum of two kernels. For the new kernel type, the explicit form of (2) is
x m+1 = − P 2l
i=1 y c i α i K r c i r , x i r P 2l
i=1 y c i α i c m+1 i , where c i r = c 1 i ..c m i , x i r = x 1 i ..x m i .
2.1 Support Vectors
The SVCR runs the SVC method on duplicated number of examples and there- fore a maximal number of support vectors of the SVC is 2l. The SVCR algorithm is constructed in the way, that while searching for the best value of ϕ, the cases for which a number of SVC support vectors is bigger than l are omitted. We prove the fact, that in this case the set of SVC support vectors does not contain any two training examples where one of them is a duplicate of the another; therefore a set of SVC support vectors is a subset of the a i set of training examples.
Let’s call a margin boundaries vector or a below margin boundaries vector as an essential margin vector and a set of such vectors EM V .
Theorem 1. The a i examples are not collinear and |EM V | ≤ l, implicates EM V does not contain duplicates.
Proof (Proof sketch). Let’s assume, that the EM V contains a duplicate a t
0of
the example a t . Let p (·) = 0 be a hyperplane parallel to margin boundaries
and containing the a t ; therefore the set of EM V examples for which p (·) ≥ 0 has r elements where r >= 1. Let p 0 (·) = 0 be a hyperplane parallel to margin boundaries and containing the a t
0; therefore the set of EM V examples for which p 0 (·) ≤ 0 has equal or greater than l − r + 1 elements and so |EM V | ≥ l + 1
which contradicts the assumption. u t
For nonlinear case the same theorem is applied in induced feature kernel space.
It can be proved that a set of support vectors is a subset of the EM V and therefore the same theorem is applied for a set of support vectors. Experiments show that it is a rare situation when for any value of ϕ checked by SVCR, a set of support vectors has more than l elements. In such situation the best solution among violating the constraint is chosen.
Here we consider how changes of a value of ϕ influence on a number of support vectors. First, we can see that for ϕ = 0, l ≤ |EM V | ≤ 2l. When for a particular value of ϕ both classes are separable then 0 ≤ |EM V | ≤ 2l. By a configuration of essential margin vectors we call a list of essential margin vectors, each with a distance to a one of the margin boundaries.
Theorem 2. For two values of ϕ, ϕ 1 > 0 and ϕ 2 > 0, where ϕ 1 > ϕ 2 , for every margin boundaries for ϕ 2 , there exist margin boundaries for ϕ 1 with the same configuration of essential margin vectors.
Proof (Proof sketch). Let’s consider the EM V for ϕ 2 with particular margin boundaries. When increasing a value of ϕ by ϕ 1 − ϕ 2 in order to preserve the same configuration of essential margin vectors we extend margin bounded region
by ϕ 1 − ϕ 2 on both sides. u t
When increasing a value of ϕ, new sets of essential margin vectors arise, and all sets presented for the lower values of ϕ remains. When both classes become separable by a hyperplane, further increasing the value of ϕ does not change a collection of sets of essential margin vectors. The above suggests that increasing a value of ϕ would lead to solutions with lesser number of support vectors.
2.2 Comparison with ε-SVR
Both methods have the same number of free parameters. For ε-SVR: C, ker-
nel parameters, and ε. For SVCR: C, kernel parameters and ϕ. When using a
particular kernel function for ε-SVR and a related kernel function for SVCR,
both methods have the same hypothesis space. Both parameters ε and ϕ con-
trol a number of support vectors. There is a slightly difference between these
two methods when we compare configurations of essential margin vectors. For
the case of ε-SVR, we define margin boundaries as a lower and upper ε tube
boundaries. Among various values of the ε, every configuration of essential mar-
gin vectors is unique. In the SVCR, based on Thm. 2, configurations of essential
margin vectors are repeated while a value of ϕ increases. This suggest that for
particular values of ϕ and ε a set of configurations of essential margin vectors is
richer for SVCR than for ε-SVR.
3 Experiments
First, we compare performance of SVCR and ε-SVR on synthetic data and on publicly available regression data sets. Second, we show that by using SVCR, a priori knowledge in the form of detractors introduced in [5] for classification problems could be applied for regression problems. For the first part, we use a LibSVM [1] implementation of ε-SVR and we use LibSVM in solving SVC problems in SVCR. We use a version ported to Java. For the second part, we use the author implementation of SVC with detractors.
For all data sets, every feature is scaled linearly to [0, 1] including an output.
For variable parameters like the C, σ for the RBF kernel, ϕ for SVCR, and ε for ε-SVR, we use a grid search method for finding best values. The number of values searched by the grid method is a trade-off between an accuracy and a speed of simulations. Note that for particular data sets, it is possible to use more accurate grid searches than for massive tests with multiple number of simulations. The preliminary tests confirm that while ϕ is increased, a number of support vectors is decreased.
3.1 Synthetic Data Tests
We compare the SVCR and ε-SVR methods on data generated from particular functions with added Gaussian noise for output values. We perform tests with a linear kernel on linear functions, with a polynomial kernel on the polynomial function, with the RBF kernel on the sine function. The tests with results are presented in Table 1. We can notice generally slightly worse training performance for the SVCR. The reason is that ε-SVR directly minimizes the MSE. We can notice fairly good generalization performance for the SVCR, which is slightly better than for the ε-SVR. We can notice lesser number of support vectors for the SVCR method for linear kernels. For the RBF kernel the SVCR is slightly worse.
3.2 Real World Data Sets
The real world data sets were taken from the LibSVM site [1] [3] except stock
price data. The stock price data consist of monthly prices of the DJIA index
from 1898 up to 2010. We generated the training data as follows: for every
month the output value is a growth/fall comparing to the next month. Every
feature i is a percent price change between the month and the i-th previous
month. In every simulation, training data are randomly chosen and the remaining
examples become test data. The tests with results are presented in Table 2. For
linear kernels, the tests show better generalization performance of the SVCR
method. The performance gain on testing data is ranged from 0–2%. For the
polynomial kernel, we can notice better generalization performance of the SVCR
(performance gain from 68–80%). A number of support vectors is comparable
for both methods. For the RBF kernel, results strongly depends on data: for two
test cases the SVCR has better generalization performance (10%).
Table 1. Description of test cases with results for synthetic data. Column descriptions:
a function – a function used for generating data y
1= P
dimi=1
x
i, y
4= P
dimi=1
x
i kerP, y
5= 0.5 P
dimi=1