Balanced Support Vector Regression

(1)

Balanced Support Vector Regression

Marcin Orchel

Department of Computer Science, AGH University of Science and Technology, Al.

Mickiewicza 30, 30-059 Krak´ow, Poland, marcin@orchel.pl

Abstract. We propose a novel idea of regression – balancing the distances from a regression function to all examples. We created a method, called balanced support vector regression (balanced SVR) in which we incorporated this idea to support vector regression (SVR) by adding an equality constraint to the SVR optimization problem. We implemented our method for two versions of SVR: ε-insensitive support vector regression (ε-SVR) and δ support vector regression (δ-SVR). We performed preliminary tests comparing the proposed method with SVR on real world data sets and achieved the improved generalization performance for suboptimal values of ε and δ with the similar overall generalization performance.

Keywords: Support vector machines, Regression, Prior knowledge

The most popular problems in machine learning are classification and regression. For classification, the goal is to separate data of different classes, for regression the goal is to fit the function to the data. The support vector machines (SVM) are the popular machine learning methods for solving classification and regression problems invented by Vapnik; the standard soft margin version of SVM for solving classification problems was proposed in [1]. The SVM method for solving regression problems, was proposed in [2]. It is the ε-SVR version, which minimizes the distances from a regression function to examples lying outside the ε bands. Recently, an alternative SVR method was proposed, [11], called δ-SVR, which transforms the problem of regression to classification, and solves the classification problem by using standard SVM. In the same paper, the authors find out that the δ-SVR has better generalization performance related to δ than ε-SVR with regard to ε, which means that δ-SVR is able to achieve better generalization performance for suboptimal values of δ than ε-SVR for suboptimal values of ε. Such characteristic could be potentially useful to speed up the training by using suboptimal values of ε and δ.

We propose a novel idea of balancing the distances between all examples and a regression function – the sum of distances below the regression function should be equal to the sum of distances above the regression function. The main motivation for this constraint is that it is already fulfilled for the least squares (LS) regression. So the goal is to incorporate balancing to the SVM methods by adding one constraint to the SVM optimization problems.

We use the following notation: the subscript r stands for regression, while the subscript c for classification, for example y_rⁱ, the red subscript near the vector

(2)

stands for reduced and it is a vector without the last coefficient of the original vector.

The outline of the paper is as follows. First, we will show the idea of the balanced regression. Then we will present the new optimization problems for balanced SVR. Finally, we will show experiments and results.

1 Balanced Regression

The standard regression in the probability setting is defined for a joint distribution X, Y based on the conditional mean function,

r (X) = E (Y |X) , (1)

where X is a multivariate random variable, Y is a random variable. For a set of events e_i for X, for i = 1 . . . n we can define random variables Y_i− r (e_i), where Y_i= (Y |e_i). After summing them, we can compute the expected value

E

n

X

i=1

Yi− r (ei)

!

=

n

X

i=1

E (Yi− r (ei)) =

n

X

i=1

EYi− r (ei) = 0 . (2)

The (2) is a necessary condition that must be met by the regression function.

The idea of the proposed method is to use the (2) in the sample space as a requirement for the approximated regression function. In practice, we have only sample data, the training data xiare mapped to regression values y_rⁱrespectively, so we do not know the expected values of Yi. What we can do is to estimate the requirement (2) as

n

X

i=1

yⁱ_r− g (xi) = 0 , (3)

where g(·) is the estimated regression function. The geometric interpretation of the (3) is balancing data below and above the regression function – the sum of all distances below the regression function should be equal to the sum of all distances above the regression function. We can imagine that based only on the condition (3), we can get multiple solutions, some of them not necessarily with good fitness to the data. So the new requirement alone is not enough to build a regression estimator.

We may notice that the LS regression balances the data due to the following proposition.

Proposition 1. The LS regression balances the data.

Proof. The solution for a LS regression for the function w · x + b and the sample data is normal equations in the form:

bn +

m

X

j=1

w_j

n

X

k=1

x^j_k =

n

X

k=1

y_k (4)

(3)

for i = 0, and

n

X

k=1

bxⁱ_k+

m

X

j=1

w_j

n

X

k=1

xⁱ_kx^j_k =

n

X

k=1

y_kxⁱ_k (5)

for i = 1, . . . , m. We can notice that the equation (4) is equivalent to (3).

Others regression methods may not balance the data, for example ridge regression or SVR. For the ridge regression methods, we get the balanced solution for a big value of the C parameter, when we are close to the LS solution. For the ε-SVR method, we get close solutions to the LS ones for the L₂ norm variant, and ε = 0. Then the ε-SVR method becomes the ridge regression method.

For the least absolute deviations (LAD) regression, the regression line will go through some of two points in the case of unique solutions, so we might not get the balanced solutions at all (the LAD regression could be approximated by ε- SVR with L1norm, with ε = 0 and high value of the C). We can notice that the balanced solutions are more visually appealing – they seem to be better fitted to the data, Fig.1. We expect that for more points the solutions will be closer to each other and to the optimal solution.

For ε-SVR, generally increasing value of ε leads to more flat solutions with reduced number of support vectors and worse performance, finally the number of support vectors becomes 0. The δ-SVR overcomes this effect. We expect the similar effect with added balancing. For ε-SVR, when we sum the constraints we have

− nε −

n

X

i=1

ξ^i∗_r ≤

n

X

i=1

yⁱ_r− g (xi) ≤ nε +

n

X

i=1

ξⁱ_r (6)

Notice that we have better bounds for balancing when errors from outside ε bands are smaller and when ε is smaller.

We measure the performance of balancing by the parameter pb, which we define as

p_b :=

n

X

i=1

yⁱ_r− g (x_i)

. (7)

For balanced solutions, p_b= 0, Fig.1.

We list some advantages of different loss functions for a regression estimation.

For a joint distribution of X, Y we can write that

Y = r (X) + h (X, Y ) . (8)

The function r(X) = E(Y |X) minimizes the mean square deviation E (Y − r (X))². So the LS loss might be preferred. In [14], page 92, authors propose more statistically motivated reasons for the LS loss functions when the conditional distributions P (·|x) have finite variances. For some specific cases, for example symmetric distributions, or for data with extreme outliers, other losses might be preferred.

In [13], page 80, authors state that the LS loss is preferred for the Gaussian noise due to the maximum likelikehood estimation, but in practice we do not know the noise model and other losses might be good as well. They also state the

(4)

0.6 1.0

0.1 0.2

y

(a)x

0.1 1.0

0.0 0.2

y

(b)x

Fig. 1: Balanced Regression. Points—examples, solid lines—LS regression solutions, dashed lines—balanced LAD regression solutions, dotted lines—LAD regression solutions. (a) For 3 points. The balancing performance, pbis: for the LS solution, pb= 0, for the balanced LAD regression, pb= 0, for the LAD regression, pb= 0.1. (b) For 30 points. The balancing performance, pbis: for the LS solution, pb= 0, for the balanced LAD regression, pb= 0, for the LAD regression, pb= 0.04.

disadvantage of the LS loss, that it does not lead to robust estimators. The ro- bustness is the property of the ε-insensitive loss. Moreover the ε-insensitive loss has a property of returning sparse values. Balancing can be seen as the property of the function which minimizes the sum of loss functions values for some data.

We can notice that balancing does not change the type of mentioned loss functions, for the LS loss we have

L (x, y, g (x)) = (y − g (x))² (9) We have a function space of functions in the form (24), we can incorporate the balance constraint (3) to (9) by substituting b and we get a form of the loss

L (x, y, g (x)) = (y − y⁰− gr(x + x⁰))² (10) where x⁰, y⁰ are some constants, and g_r belongs to a set of functions without the b, g_r= w · x. So we get the LS loss but for different function space and with translated data. Similarly for the ε-insensitive loss function, we get

L (x, y, g (x)) = max (0, |y − y⁰− gr(x + x⁰)| − ε) . (11) The δ-SVR transforms the problem from regression to classification, and uses the hinge loss for the classification which is

L (x, y, g (x)) = max (0, 1 − y_ch (x)) , (12) where h(x) is defined in (30), yc ∈ {−1, 1}. Motivated by the relation between δ-SVR and ε-SVR stated in [12], we will show that this loss is equivalent to the

(5)

dynamic version of the -insensitive loss. For the points with class 1 the loss can be written as

max 0, 1 − wc,red· x − w^m+1_c (yr+ δ) − bc

(13)

After reformulation we have max

0, w_c^m+1

1

wc^m+1

−wc,red

w^m+1c

· x − yr− δ − bc

w^m+1c

(14) We can write

max

0, v 1

v + g (x) − yr− δ

(15) where v = w^m+1_c . Similar for the yc= −1

max

0, v 1

v − g (x) + y_r− δ

(16) After merging (15) and (16) we get

|v| max

0, |g (x) − y_r| + sgn (v) 1 v − δ

(17) So the δ-SVR solves the problem with the hinge loss, which can be converted to the ε-insensitive loss after finding the solutions, by setting

ε = sgn (v) 1 v − δ

. (18)

Finally, we make a conclusion, that for δ-SVR, the type of loss function which can be converted to the ε-insensitive remains unchanged.

2 Incorporating Balancing to SVR

We add the balanced property, (3) to SVR. We add the balanced property to two variants of SVR, ε-SVR and δ-SVR. We incorporate the (3) to SVR by adding the equality constraint to the optimization problems of ε-SVR and δ-SVR (see [12]

for detailed descriptions of the optimization problems). We incorporate the (3) in a kernel space. After incorporation we still have convex optimization problems.

The optimization problem for the balanced ε-SVR is Optimization problem (OP) 1.

min

w_r,b_r,ξ_r,ξ^∗_r f wr, br, ξr, ξ_r^∗ = 1

2kwrk²+ Cr n

X

i=1

ξ_rⁱ+ ξ_r^∗i

(19)

subject to

n

X

i=1

yⁱ_r− g (xi) = 0 , (20)

(6)

y_rⁱ− g (x_i) ≤ ε + ξ_rⁱ , (21) g (xi) − yⁱ_r≤ ε + ξ_r^i∗ , (22)

ξr≥ 0 , ξ_r^∗≥ 0 (23)

for i ∈ {1, . . . , n}, where ε ∈ R,

g (xi) = wr· xi+ br . (24) The g^∗(x) = w^∗_r · x + b^∗_r is a regression function, ε is a parameter, ξ_rⁱ, ξ_r^∗i are slack variables. The difference between OP 1 and the ε-SVR optimization problem, OP6, is an additional equality constraint (20). We solve OP1by using a technique introduced in [12]. The technique allows to incorporate the general equalities in the form

s

X

i=1

sig (di) = e , (25)

where s_i are some parameters, for which Ps

i=1s_i 6= 0, di are some points, s is the number of d_i points, e is a parameter, g is defined in (24). We apply the technique by noticing first that the (20) is a special case of (25), for s = n, s_i= 1, d_i= x_i, e =Pn

i=1yⁱ_r.

The optimization problem for the balanced δ-SVR is OP 2.

min

wc,bc,ξc

f (wc, bc, ξc) =1

2kwck²+ Cc 2n

X

i=1

ξ_cⁱ (26)

subject to

n

X

i=1

−wred· xi− bc

w^m+1c

=

n

X

i=1

yⁱ_r , (27)

y_cⁱ wc,red· xi+ w^m+1_c yⁱ_r+ yⁱ_cδ + bc ≥ 1 − ξⁱ_c , (28)

ξ_c≥ 0 (29)

for i ∈ {1, . . . , 2n}.

The C_c is a parameter. The ξ_cⁱ are slack variables. We are looking for the decision boundary

h^∗(x) = w^∗_c· x + b^∗_c = 0 . (30) The difference between OP 2 and the δ-SVR optimization problem, OP 8, is an additional equality constraint (27). We solve OP 2 by using a technique introduced in [12]. We apply the technique by noticing first that the (27) is a special case of (25), for s = n, s_i = 1, d_i = x_i, e = 0. The solution for both optimization problems is derived for the dual forms with the dual variables α_i for i = 1 . . . n and the kernel function K(x, y), where x, y are vectors.

Due to the space limits, we provide only short description of all optimization problems that are used in the derivation of the solutions of OP1and OP2

(7)

OP 3. This is a standard SVM problem for solving classification problems.

OP 4. Another variant of support vector classification (SVC) is the SVC without the offset b_c, analyzed recently in [15]. The optimization problem is the same except missing b_c term.

OP 5. The ϕ support vector classification (ϕ-SVC) optimization problem is the extension of the SVC optimization problem, OP3, where the margin knowledge of an example in the form of the margin weights ϕ_iis introduced. It was presented in [7,8].

OP 6. This is a standard ε-SVR soft case optimization problem in the primal form for solving regression problems.

OP 7. This is a dual form of the SVC without the offset, OP4.

OP 8. This is the optimization problem for δ-SVR after transformation to the classification problem. The δ-SVR has been first introduced in [9]. We use special kernels in δ-SVR as following

K (x, y) = Ko(xred, yred) + xm+1ym+1 , (31) where x and y are m + 1 dimensional vectors, xred = (x1, . . . , xm), yred = (y1, . . . , ym), Ko(·, ·) is the original kernel from which the new one was con- structed.

We derive the solution for OP1by first defining a special version of (25) for classification problems that is

s

X

i=1

sih (di) = e , (32)

where s_i are some parameters, for which Ps

i=1s_i 6= 0, di are some points, s is the number of d_i points, e is a parameter, h is defined in (30). The constraint (32) is incorporated to the ϕ-SVC optimization problem, OP5, for classification.

The incorporation leads to the ϕ-SVC optimization problem without the offset, which is the SVC optimization problem without the offset, OP4with additional margin weights. Then we incorporate (25) to OP6by noticing that ε-SVR is a special case of ϕ-SVC, [10], and we have derived already the incorporation for ϕ- SVC. Finally, the incorporation of (25) to δ-SVR leads to the SVC optimization problem without the offset OP4.

3 Experiments

We compare performance of standard SVR with balanced SVR for various real world data sets. We chose all real world data sets for regression from the LibSVM site [6] which originally come from UCI Machine Learning Repository and Statlog (the YearPredictionMSD data set is reduced to the first 25000 data vectors).

(8)

Moreover, we chose rcv1v2 (topics; subsets) data set for multilabel classification, which we convert to a regression problem by predicting the number of classes.

See the details about the data sets in Table 1 and Table 2. We use our own implementation of all methods. For all data sets, every feature is scaled linearly to [0, 1]. We performed all tests with the radial basis function (RBF) kernel.

For variable parameters like the Cr, σ for the RBF kernel, we use a double grid search method for finding the best values - first a coarse grid search is performed, then a finer grid search as described in [4]. The training set size is fixed, the rest of data become test data. The standard 5 fold cross validation is used for the inner loop for finding optimal values of the parameters. After finding optimal values, we run the method on training data, and we report results for a test set.

We report the balancing parameter pb for SVR methods.

We use the Friedman test with the two tailed Nemenyi post hoc test for checking statistical significance of the regression error, as suggested in [? 3, 5]. The statistical procedure is performed for the level of significance equal to 0.05. The critical values for the Friedman test are taken from the statistical table design specifically for the case with smaller number of tests or methods as suggested in [5]. The critical values for the Nemenyi test are taken from the statistical table for the Tukey test which are divided by √

2.

We compare SVR with balanced SVR for various ε and δ, we choose the fixed size of training set in the outer loop which is 80. We report all results in Table1, Table2 and Table3. The general conclusion is that the balanced SVR achieves similar overall generalization performance as SVM with better generalization performance for suboptimal values. The detail observations are as follows.

– A value of the balanced parameter pb generally increases for ε-SVR and δ- SVR while increasing a value of ε and δ (columns tamse1, tamse3 in Table1 and columns tamse1, tamse3 in Table3).

– A value of the balanced parameter pb is generally lower for balanced SVR than for standard SVR (columns tamse1, tamse2, tamse3, tamse4 in Table1 and columns tamse1, tamse2, tamse3, tamse4 in Table3).

– The generalization performance for various ε and δ is generally slightly better for balanced SVR (rows 0-5, columns rs1, rs2, rs3, rs4 in Table3), we did not achieve direct statistical significance for this result (columns tsn12 and tsn34), but we can notice that for example for ε/δ = 0.32, we achieve better generalization performance for balanced ε-SVR than ε-SVR for all data sets except one (space ga). The Nemenyi statistics are very conservative, [3] and we believe that such relations could be found with more sensible tests. What we can show is that the difference in generalization performance between ε- SVR and δ-SVR can become statistically significant when we replace δ-SVR with balanced δ-SVR, for example for ε/δ = 0.16 (row with id=2, columns tsn13, tsn14).

– The overall generalization performance (row 5 in Table 3) is similar for all methods, without statistical significance.

(9)

Table 1: Performance of the balanced SVR for real world data per different value of ε/δ, part 1. The numbers in descriptions of the columns mean the methods: 1 - ε-SVR, 2 - balanced ε-SVR, 3 - δ-SVR, 4 - balanced δ-SVR. Column descriptions: id – id of a test, dn – the name of a data set, size – the number of all examples, dim – the dimension of a problem, eps/delta – a value of the parameters ε and δ, trse – root-mean-square error (RMSE) for testing data; the best method is in bold, tamse – the average for the balancing parameter pb for testing data; the best method is in bold.

id dn size dim eps/delta trse1 trse2 trse3 trse4 tamse1 tamse2 tamse3 tamse4 0 abalone 4177 8 0.01 0.087 0.084 2.782 0.103 0.027 0.008 2.22 0.019 1 abalone 4177 8 0.04 0.084 0.085 0.09 0.09 0.003 0.016 0.017 0.022 2 abalone 4177 8 0.16 0.105 0.097 0.088 0.088 0.054 0.006 0.023 0.008 3 abalone 4177 8 0.32 0.159 0.098 0.095 0.15 0.11 0.005 0.039 0.019 4 abalone 4177 8 0.64 0.147 0.115 0.092 0.093 0.092 0.004 0.007 0.005 5 bodyfat 252 14 0.01 0.019 0.02 0.05 0.043 0.001 0.005 0.011 0.001 6 bodyfat 252 14 0.04 0.043 0.042 0.033 0.03 0.007 0.003 0.009 0.005 7 bodyfat 252 14 0.16 0.092 0.106 0.019 0.018 0.026 0.008 0.003 0.006 8 bodyfat 252 14 0.32 0.151 0.148 0.033 0.035 0.041 0.018 0.007 0.01 9 bodyfat 252 14 0.64 0.183 0.165 0.056 0.056 0.08 0.005 0.004 0.004 10 cadata 20640 8 0.01 0.159 0.17 1.85 0.223 0.009 0.013 0.243 0.009 11 cadata 20640 8 0.04 0.154 0.155 0.154 0.156 0.027 0.011 0.014 0.009 12 cadata 20640 8 0.16 0.151 0.159 0.159 0.15 0.032 0.011 0.007 0.007 13 cadata 20640 8 0.32 0.196 0.162 0.148 0.149 0.075 0.006 0.007 0.009 14 cadata 20640 8 0.64 0.278 0.238 0.18 0.181 0.143 0.003 0.026 0.012 15 housing 506 13 0.01 0.097 0.099 0.133 0.147 0.01 0.014 0.013 0.018 16 housing 506 13 0.04 0.108 0.11 0.111 0.12 0.005 0.006 0.003 0.01 17 housing 506 13 0.16 0.142 0.135 0.095 0.095 0.006 0.011 0.009 0.003 18 housing 506 13 0.32 0.201 0.178 0.107 0.106 0.124 0.008 0.015 0.004 19 housing 506 13 0.64 0.245 0.204 0.134 0.129 0.136 0.005 0.009 0.005 20 mpg 392 7 0.01 0.087 0.086 0.085 0.154 0.015 0.01 0.004 0.014 21 mpg 392 7 0.04 0.076 0.075 0.075 0.073 0.014 0.008 0.013 0.003 22 mpg 392 7 0.16 0.083 0.082 0.079 0.081 0.012 0.0 0.01 0.006 23 mpg 392 7 0.32 0.169 0.115 0.086 0.085 0.103 0.007 0.017 0.012 24 mpg 392 7 0.64 0.225 0.211 0.102 0.099 0.082 0.024 0.004 0.001 25 pyrim 74 26 0.01 0.163 0.165 0.188 0.178 0.057 0.059 0.086 0.076 26 pyrim 74 26 0.04 0.065 0.064 0.075 0.08 0.004 0.003 0.016 0.0 27 pyrim 74 26 0.16 0.1 0.091 0.062 0.064 0.021 0.001 0.017 0.02 28 pyrim 74 26 0.32 0.231 0.217 0.164 0.162 0.096 0.053 0.062 0.06 29 pyrim 74 26 0.64 0.237 0.127 0.078 0.077 0.2 0.002 0.004 0.002 30 space ga 3107 6 0.01 0.044 0.045 0.056 0.057 0.002 0.005 0.001 0.002 31 space ga 3107 6 0.04 0.045 0.045 0.044 0.045 0.001 0.007 0.005 0.004 32 space ga 3107 6 0.16 0.072 0.054 0.047 0.048 0.036 0.001 0.006 0.002 33 space ga 3107 6 0.32 0.063 0.064 0.048 0.048 0.008 0.012 0.003 0.002 34 space ga 3107 6 0.64 0.075 0.063 0.05 0.049 0.041 0.006 0.005 0.001 35 triazines 185 58 0.01 0.195 0.198 0.253 0.205 0.015 0.005 0.044 0.008 36 triazines 185 58 0.04 0.2 0.199 0.213 0.199 0.034 0.008 0.033 0.006 37 triazines 185 58 0.16 0.206 0.213 0.288 0.177 0.004 0.012 0.034 0.005 38 triazines 185 58 0.32 0.213 0.208 0.199 0.202 0.058 0.042 0.033 0.032 39 triazines 185 58 0.64 0.278 0.196 0.179 0.181 0.2 0.031 0.018 0.03

(10)

Table 2: Performance of balanced SVR for real world data per different value of ε/δ, cont. of the part 1.

id dn size dim eps/delta trse1 trse2 trse3 trse4 tamse1 tamse2 tamse3 tamse4 40 rcv1 26173 39029 0.01 0.105 0.105 0.108 0.105 0.009 0.012 0.023 0.013 41 rcv1 26173 39029 0.04 0.105 0.105 0.105 0.107 0.003 0.001 0.003 0.001 42 rcv1 26173 39029 0.16 0.113 0.107 0.105 0.105 0.038 0.008 0.003 0.002 43 rcv1 26173 39029 0.32 0.174 0.106 0.109 0.104 0.137 0.01 0.022 0.008 44 rcv1 26173 39029 0.64 0.145 0.107 0.114 0.106 0.098 0.001 0.042 0.002 45 year pred 24989 90 0.01 0.112 0.111 0.112 0.113 0.024 0.015 0.013 0.009 46 year pred 24989 90 0.04 0.11 0.112 0.162 0.124 0.0 0.021 0.109 0.021 47 year pred 24989 90 0.16 0.14 0.134 0.125 0.122 0.064 0.014 0.01 0.01 48 year pred 24989 90 0.32 0.162 0.121 0.113 0.113 0.108 0.007 0.022 0.008 49 year pred 24989 90 0.64 0.213 0.121 0.117 0.123 0.176 0.004 0.01 0.004 50 abalone var 0.084 0.084 0.088 0.088 0.003 0.008 0.023 0.008 51 bodyfat var 0.019 0.02 0.019 0.018 0.001 0.005 0.003 0.006 52 cadata var 0.151 0.155 0.148 0.149 0.032 0.011 0.007 0.009 53 housing var 0.097 0.099 0.095 0.095 0.01 0.014 0.009 0.003 54 mpg var 0.076 0.075 0.075 0.073 0.014 0.008 0.013 0.003 55 pyrim var 0.065 0.064 0.062 0.064 0.004 0.003 0.017 0.02 56 space ga var 0.044 0.045 0.044 0.045 0.002 0.007 0.005 0.004 57 triazines var 0.195 0.196 0.179 0.177 0.015 0.031 0.018 0.005 58 rcv1 var 0.105 0.105 0.105 0.104 0.009 0.001 0.003 0.008 59 year pred var 0.11 0.111 0.112 0.113 0.0 0.015 0.013 0.009

Table 3: Performance of balanced SVR for real world data per different value of ε/δ, part 2. The numbers in descriptions of the columns mean the methods: 1 - ε-SVR, 2 - balanced ε-SVR, 3 - δ-SVR, 4 - balanced δ-SVR. The test with id=0 is for all the tests from Table1with ε, δ = 0.01, with id=1 for all the tests with ε, δ = 0.04, with id=2 for all the tests with ε, δ = 0.16, with id=3 for all the tests with ε, δ = 0.32, with id=4 for all the tests with ε, δ = 0.64, and finally with id=5 for variable ε, δ. Column descriptions: rs – an average rank of the method for RMSE; the best method is in bold, tsf – the Friedman statistic for average ranks for RMSE; the significant value is in bold, tsn – the Nemenyi statistic for average ranks for RMSE, reported when the Friedman statistic is significant, the significant value is in bold, tamse – the average for the balancing parameter pb for testing data; the best method is in bold.

id rs1 rs2 rs3 rs4 tsf tsn12 tsn13 tsn23 tsn14 tsn24 tsn34 tamse1 tamse2 tamse3 tamse4 0 1.5 1.8 3.3 3.4 17.64 −0.52 −3.12 −2.6 −3.29 −2.77 −0.17 0.017 0.015 0.266 0.017

1 2.2 2.3 2.6 2.9 1.8 – – – – – – 0.01 0.008 0.022 0.008

2 3.5 3.2 2.0 1.3 19.08 0.52 2.6 2.08 3.81 3.29 1.21 0.029 0.007 0.012 0.007 3 3.9 2.9 1.6 1.6 22.44 1.73 3.98 2.25 3.98 2.25 0.0 0.086 0.017 0.023 0.016 4 4.0 2.8 1.6 1.6 23.76 2.08 4.16 2.08 4.16 2.08 0.0 0.125 0.009 0.013 0.006 5 2.56 3.11 2.33 2.0 3.53 – – – – – – 0.009 0.011 0.012 0.007

(11)

4 Summary

In this paper, we proposed a novel regression method of balancing data, balanced SVR. We provided details on the method and performed experiments.

The advantage of the proposed method is an improvement in the suboptimal performance compared to SVR. In the future, we plan to extend the theoretical analysis and to perform thorough experiments.

Acknowledgments

I would like to express my sincere gratitude to Professor Witold Dzwinel (AGH University of Science and Technology, Department of Computer Science) for dis- cussion and useful suggestions. The research is financed by the National Science Centre, project id 217859, UMO-2013/09/B/ST6/01549, titled ”Interactive Vi- sual Text Analytics (IVTA): Development of novel user-driven text mining and visualization methods for large text corpora exploration”.

References

[1] Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–

297 (1995)

[2] Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A.J., Vapnik, V.: Support vector regression machines. In: Neural Information Processing Systems. pp.

155–161 (1996)

[3] Garcia, S., Herrera, F.: An extension on ”statistical comparisons of classi- fiers over multiple data sets” for all pairwise comparisons. J. Mach. Learn.

Res. 9, 2677–2694 (Dec 2008),

[4] wei Hsu, C., chung Chang, C., jen Lin, C.: A practical guide to support vector classification (2010)

[5] Japkowicz, N., Shah, M. (eds.): Evaluating Learning Algorithms: A Classi- fication Perspective. Cambridge University Press (2011)

[6] Libsvm data sets. (06 2011)

[7] Orchel, M.: Incorporating detractors into svm classification. In: Cyran, K., Kozielski, S., Peters, J., Sta´nczyk, U., Wakulicz-Deja, A. (eds.) Man- Machine Interactions, Advances in Intelligent and Soft Computing, vol. 59, pp. 361–369. Springer Berlin Heidelberg (2009),

[8] Orchel, M.: Incorporating a priori knowledge from detractor points into support vector classification. In: Dobnikar, A., Lotric, U., ˇSter, B. (eds.) Adaptive and Natural Computing Algorithms, Lecture Notes in Computer Science, vol. 6594, pp. 332–341. Springer Berlin Heidelberg (2011),

[9] Orchel, M.: Regression based on support vector classification. In: Dobnikar, A., Lotric, U., ˇSter, B. (eds.) Adaptive and Natural Computing Algorithms, Lecture Notes in Computer Science, vol. 6594, pp. 353–362. Springer Berlin Heidelberg (2011),

(12)

[10] Orchel, M.: Support vector regression as a classification problem with a priori knowledge in the form of detractors. In: Czachorski, T., Kozielski, S., Sta´nczyk, U. (eds.) Man-Machine Interactions 2, Advances in Intelligent and Soft Computing, vol. 103, pp. 353–362. Springer Berlin Heidelberg (2011), [11] Orchel, M.: Support vector regression based on data shifting. Neurocom-

puting 96, 2–11 (2012),

[12] Orchel, M.: Incorporating Prior Knowledge into SVM Algorithms in Analy- sis of Multidimensional Data. Ph.D. thesis, AGH University of Science and Technology (2013)

[13] Sch¨olkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Ma- chines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA (2001)

[14] Steinwart, I., Christmann, A.: Support Vector Machines. Information Sci- ence and Statistics, Springer (2008),

[15] Steinwart, I., Hush, D.R., Scovel, C.: Training svms without offset. J. Mach.

Learn. Res. 12, 141–202 (2011),