Problem with a Priori Knowledge in the Form of Detractors for Creating Reduced Models
Marcin Orchel
AGH University of Science and Technology, Mickiewicza Av. 30, 30-059 Krak´ ow, Poland, marcin@orchel.pl
Summary. In this article, we propose applying a recently proposed technique of reducing Support Vector Classification models for regression problems. The reduced method creates reduced models by removing support vectors and uses a general formulation of Support Vector Classification with a priori knowledge in the form of detractors. We apply this method for regression problems by using a reformulation of ε-insensitive Support Vector Regression as a special case of the general formulation.
Indeed, the experiments show that reduced technique can be successfully applied for regression problems. The tests were performed on various regression data sets and on stock price data from public domain repositories.
Key words: Support Vector Machines, a priori knowledge
1 Introduction
One of the main learning problems is a regression estimation. Vapnik [9] pro- posed a new regression method, which is called ε-insensitive Support Vector Regression (ε-SVR). It belongs to a group of methods called Support Vec- tor Machines (SVM). For estimating indicator functions the Support Vector Classification (SVC) method was developed. The SVM were invented on a basis of statistical learning theory. They are efficient learning methods partly for the reason of having the following important properties: they lead to con- vex optimization problems, they generate sparse solutions, kernel functions can be used for generating nonlinear solutions. Recently, a general formula- tion of SVC with a priori knowledge in the form of detractors was proposed [5][6]. It has been already used for incorporating sample a priori knowledge for manipulating the result curve and creating reduced models.
In this article, we use a reformulation of ε-SVR as a special case of the
generalized SVC. The similar reformulation has been already implemented
in LibSVM [1] for solving ordinal classification problems – without a pri-
ori knowledge, and ε-SVR. The consequence of this reformulation is that we
can apply all the applications for generalized SVC also for ε-SVR. These are manipulating of the regression function proposed as a manipulation of the de- cision curve for classification problems in [5] and creating improved reduced models by removing support vectors proposed for classification problems in [6].
The former was also investigated for the alternative SVR regression method recently proposed [7]. Various methods for reducing the complexity of the output model were widely investigated [3]. In particular, the reduction by re- moving support vectors was also analyzed in [2] for regression problems. In this article, we create improved reduced models by removing support vectors and using sample a priori knowledge in reduced models. The knowledge comes from the solution of the original problem obtained before the reduction. It is also possible to extract the knowledge from any source in the form of analyti- cal function. Particularly, it could be the solution of an alternative regression method.
1.1 Introduction to ε-SVR
In a regression estimation we consider a set of training examples x i for i = 1..l, where x i = x 1 i , . . . , x m i . The i-th training example is mapped to y i r ∈ IR. The m is a dimension of the problem. The ε-SVR soft case optimization problem is
OP 1. Minimization of
f w r , b r , ξ r , ξ ∗ r = kw r k 2 + C r l
X
i=1
ξ r i + ξ ∗i r
with constraints y i r − g (x i ) ≤ ε + ξ i r , g (x i ) − y i r ≤ ε + ξ r i∗ , ξ r ≥ 0, ξ r ∗ ≥ 0 for i ∈ {1..l}, where g (x i ) = w r · x i + b r .
The g ∗ (x) = w r ∗ · x + b ∗ r is a regression function. Optimization problem 1 is transformed to an equivalent dual optimization problem. The regression function becomes
g ∗ (x) =
l
X
i=1
(α ∗ i − β i ∗ ) K (x i , x) + b ∗ r , (1)
where α i , β i are Lagrange multipliers of the dual problem, K (·, ·) is a kernel function, which is incorporated to a dual problem. The most popular kernel functions are linear, polynomial, radial basis function (RBF) and sigmoid.
The i-th training example is a support vector, when α ∗ i − β i ∗ 6= 0. It can be
proved that a set of support vectors contains all training examples which fall
outside the ε tube, and some of the examples, which lie on the ε tube. The
conclusion is that a number of support vectors can be controlled by a tube
height (the ε).
1.2 Introduction to Generalized SVC
A 1-norm soft margin SVC optimization problem for x i with example weights C i is
OP 2. Minimization of
f (w c , b c , ξ c ) = 1
2 kw c k 2 + C c · ξ c
with constraints y i c h (x i ) ≥ 1 − ξ c i , ξ c ≥ 0 for i ∈ {1..l}, where C c 0, h (x i ) = w c · x i + b c .
A generalized SVC has additional example weights ϕ in constraints OP 3. Minimization of
f (w c , b c , ξ c ) = 1
2 kw c k 2 + C c · ξ c with constraints y i c h (x i ) ≥ 1 − ξ c i + ϕ i , ξ c ≥ 0 for i ∈ {1..l}, where C c 0, ϕ i ∈ R, h (x i ) = w c · x i + b c .
The weights ϕ are only present in constraints. When ϕ = 0, the OP 3 is equivalent to the OP 2. A functional margin for a point p is defined as a value y p h (p). A value v in functional margin units is equal to v/ kw c k. For the example with ϕ i > 0, called a detractor, 1 + ϕ i can be interpreted as a lower bound on a distance from the example to a decision boundary measured in functional margin units. In this article, we extend the possible values of the ϕ i to negative ones. For 1 + ϕ i < 0 and for incorrectly classified example and omitted slack variables for simplicity, we get −y i c h ∗ (x i ) /
w ∗ c
≤ −1 − ϕ i /
w ∗ c
therefore |1 + ϕ i | in this case is an upper bound on a distance from a detractor example to a decision boundary measured in functional margin units and it is alternatively called for this case an attractor.
2 Reformulation of ε-SVR
Here we present a reformulation of the OP 1. Every regression training exam- ple is duplicated, Fig. 1. Every original training example gets 1 class, and the duplicated training example gets -1 class and therefore we get
OP 4. Minimization of
f (w c , b c , ξ c ) = 1
2 kw c k 2 + C c 2l
X
i=1
ξ i c
with constraints y i c h (x i ) ≥ 1 − ξ c i + ϕ i , ξ c ≥ 0
for i ∈ {1..2l}, where h (x i ) = w c · x i + b c , ϕ i = y c i y r i − ε − 1.
The difference from OP 1 is that we added a constant 1/2. The OP 4 is a special case of OP 3. Instead of using a decision curve of OP 4 we use a regression function
2l
X
i=1
y c i α ∗ i K (x i , x) + b ∗ c = 0 → g ∗ (x) =
2l
X
i=1
y i c α ∗ i K (x i , x) + b ∗ c .
In a typical scenario ϕ i < 0, because ε is close to 0 and y i r is less than 1 and therefore an interpretation of the ϕ i is similar to an interpretation of the slack variables, but with a difference that a sum of slack variables is minimized additionally. For incorrectly classified i-th training example and ϕ i < 0, |ϕ i | can be interpreted as a radius of a hypersphere in functional margin units with a center in the i-th point that must intersect a margin boundary y i h (x) = 1, Fig. 1. Note that for ϕ i > 0 the interpretation was described in [6] – ϕ i can be interpreted as a radius of a hypersphere that must not intersect a margin boundary y i h (x) = 1 (in more than one point). The detractors lead to the new type of support vectors that can lie outside the margin [6]. In turn, attractors presented in this article could lead to examples which lie on the wrong side of the margin (in particular which are incorrectly classified), but they do not belong to support vectors. Moreover, attractors could lead to the new type of support vectors – which have slack variables equal to zero and do not lie on the margin. We can notice the following property of the OP 4.
Because every training example is duplicated, for every possible solution, l training examples will be always incorrectly classified except those lying on a classification decision boundary, Fig. 1. An efficient solution of the OP 4 based on Sequential Minimal Optimization [8] is the same as described in [5][6].
3 Reduce a Model by Removing Support Vectors
We use a method of removing support vectors to decrease the -SVR model complexity. Reduced models are more suitable for further processing, e.g.
decrease time of testing new examples. However, reduced models have the disadvantage that generalization performance could be worse than for the original full models. The one of possibilities of improving the performance is to incorporate a priori knowledge to the reduced models. The reduced mod- els based on removing support vectors were recently proposed for SVC [6]
and for ε-SVR [2]. In this article, we propose creating reduced models using a priori knowledge in the form of detractors for regression problems. This method was already proposed in [6] but for classification problems. Because we have already reformulated the -SVR as a classification problem with a priori knowledge in the form of detractors, creating a reduced model is sim- ilar to described in [6]. First, we solve the OP 4 and then detractors are automatically generated from the solution by setting
ϕ i = y i h ∗ (x i ) − 1 (2)
0 0.2 0.4 0.6 0.8 1 0
0.2 0.4 0.6 0.8 1 0.2 0
0.4 0.6 0.8 1 1.2 1.4
0 0.2 0.4 0.6 0.8 1 0
0.2 0.4 0.6 0.8 1 0.2 0
0.4 0.6 0.8 1 1.2 1.4
0 0.2 0.4 0.6 0.8 1 0
0.2 0.4 0.6 0.8 1 0.2 0
0.4 0.6 0.8 1 1.2 1.4
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
Fig. 1. In the left figure regression data (’+’ points) transformed to classification data (’x’ points) are depicted. In the right figure the interpretation of an attractor as a hypersphere is depicted. The attractor is located in (2, 1).
for all training examples. Note that all training examples have already set the parameter ϕ i therefore the method adjusts these values according to the found solution.
A reduced model is generated by removing a bunch of support vectors – randomly selected support vectors, with maximal removal ratio of p% of all training vectors, where p is a configurable parameter. Finally, we run the reformulated -SVR method with incorporated a priori knowledge and reduced data. The reduced solution is used for testing new examples.
4 Experiments
In the first experiment, we show that the reduced models with knowledge in the form of detractors have better generalization performance than without that knowledge for various p. The first method does not use knowledge in the form of detractors in reduced models, the second one use it. In the third experiment, we show that the reduced models have much better time of testing new examples which mainly depends on a number of support vectors. Note that for purposes of fair comparison training data of a reduced model is the same for both methods. We use the author implementation of reformulation of the ε-SVR for both methods.
For all data sets, every feature is scaled linearly to [0, 1] including an
output. For ε we use a grid search method for finding the best values.
4.1 Comparing Generalization Performance of Reduced Model The synthetic samples were generated from particular functions with added Gaussian noise for output values. The real world data sets were taken from the LibSVM site [1][4] except stock price data. They originally come from UCI Machine Learning Repository and StatLib DataSets Archive. The stock price data consist of monthly prices of the DJIA index from 1898 up to 2010. We generated the sample data set as follows: for every month the output value is a growth/fall comparing to the next month. Every feature i is a percent price change between the month and the i-th previous month. In every simulation, training data are randomly chosen, the remaining examples become test data.
For p = 70, C = 0.1 the method with knowledge in the form of detractors has better performance in all tests with similar number of support vectors (columns im1 and im2 ), Table 1. The testing performance improvement varies from 0% to 68%. For variable p the proposed method is also superior, example results for the cadata1 test are depicted in Fig. 2.
4.2 Comparing Testing Speed Performance of Reduced Model Testing speed of new examples depends mainly on a number of support vec- tors. With less support vectors we achieve testing time reduction. We can see testing speed improvements (columns t1 and t2 ) for all tests in Table 1.
-60 -50 -40 -30 -20 -10 0
10 20 30 40 50 60 70 -60
-50 -40 -30 -20 -10 0
10 20 30 40 50 60 70 -60 -50 -40 -30 -20 -10 0
10 20 30 40 50 60 70 -60
-50 -40 -30 -20 -10 0
10 20 30 40 50 60 70
Fig. 2. A comparison of two methods of removing support vectors for the cadata1
test case from Table 1 for variable p. On x axis there is a p parameter as a percentage,
on y axis there is a percent difference in misclassified training examples in the left
figure, and misclassified testing examples in the right figure between the original
method without removing support vectors, and the method with removing procedure
applied. The line with ’+’ points represents a random removing method, while the
line with ’x’ points represents proposed removing method with knowledge in the
form of detractors
Table 1. Comparison of time of testing new examples for an original model and a reduced model with detractors. Column descriptions: a name – a name of the test, a function – a function used for generating data y
1= P
dimi=1
x
i, y
2= P
dimi=1
x
i kerP, y
3= 0.5 P
dimi=1