Knowledge-Uncertainty Axiomatized Framework with Support Vector Machines for Sparse Hyperparameter Optimization

(1)

Knowledge-Uncertainty Axiomatized Framework with Support Vector Machines for Sparse Hyperparameter Optimization

Marcin Orchel

Department of Computer Science AGH University of Science and Technology Al. Mickiewicza 30, 30-059 Krak´ow, Poland

Email: morchel@agh.edu.pl

Abstract—We solve a problem of hyperparameter optimization for classification with two objectives: find solutions with low generalization error and simultaneously sparse. We designed a theoretical and practical framework for combining multi- ple measures for hyperparameter optimization. The theoretical framework is based on selecting measures of knowledge and uncertainty fulfilling axiomatic systems for classification prob- lems. The practical framework is based on solving multi-objective optimization problem. We validated our model on real world data sets with support vector machines (SVM) and we achieved substantially smaller number of support vectors with similar generalization performance compared to cross validation with misclassification error.

I. I

NTRODUCTION

One of basic problems of machine learning is classification.

One of the best classifier in terms of accuracy is SVM [1].

The question about designing an algorithm for SVM is how to select values of hyperparameters. Simple techniques like random or grid search have similar generalization performance as more sophisticated meta-heuristics [2]. The SVM have problems with achieving sparse solutions [3] measured by the number of terms in the solution. Usually some post- processing procedure is performed to sparsify the basic so- lution. Other approach is to modify the procedure for finding the basic solution. The question is how to perform hyperpa- rameter optimization to preserve generalization performance and simultaneously improve sparsity of a solution. The cross validation as a subsampling method is the most effective technique [4]. In cross validation, we compare solutions for different values of hyperparameters with a misclassification error on a validation set. The question is how to redesign a cross validation to improve sparsity of a solution while preserving generalization performance. In particular, how to select measures for a validation set which improve sparsity and estimates true generalization error. The next problem is how to design a framework for combining multiple measures in a cross validation. Computing measures based on statistical bounds for generalization error on a training set is not as effective as cross validation [5].

The idea is to use those measures on a validation set additionally to misclassification error. While the inspiration for selecting particular measures is statistical learning theory, we propose an alternative approach based on axiomatic selection of measures. The advantage of such approach is that it solves a problem of designing a framework for testing multiple mea- sures. The problem that we also address is the compatibility of the proposed measures with statistical bounds.

There are two basic approaches to selection of hyperparam- eter values either manual or automated based on optimization for validation error [6]. In [2], authors found that simple approaches like grid search or random search have similar accuracy compared to more sophisticated approaches like Genetic Algorithm. Moreover, it is a common practice to improve grid search by using a technique of zooming in (double grid search) implemented in LIBSVM. In [7], authors proposed the heuristic to select σ parameter of the radial basis function (RBF) kernel based on distances to the nearest neighbors. For the box constraint C in SVM, they added early stopping in a grid-search using Elbow method. They achieved similar accuracy with improved speed performance.

The limitation of previous research is that they do not report and optimize sparsity of a solution.

The methods aimed at returning sparse solutions follow two alternative strategies: they find a sparse approximation of a non-sparse solution, or they modify the SVM algorithm [3].

The example of the latter is a novel algorithm for training SVM that significantly improves sparsity of a solution [8]

without sacrificing generalization performance. Our approach of optimizing hyperparameter search for sparsity is comple- mentary to both approaches.

We introduce a framework of knowledge and uncertainty

for SVM, called knowledge-uncertainty machines (KUM). The

framework is based on multi-objective optimization problem

with two goals: maximizing knowledge and minimizing uncer-

tainty for knowledge set objects proposed in [9]. We generalize

the regularizer of SVM to an uncertainty measure, and the

(2)

empirical risk with an averaged hinge loss to a knowledge measure. We use the following techniques: axiomatization of measures and multi-objective optimization. For the practical part, we propose a technique of computing the measures on a validation set for selecting optimal values of hyperparameters.

We aim to select measures that optimize generalization perfor- mance and sparsity of a solution. We found that the number of margin errors for a validation set and the radius of a minimal hyperpshere improve sparsity of a solution.

The outline of the article is as follows. First, we introduce a theoretical framework of knowledge and uncertainty. Then we describe an algorithm based on this framework for selecting optimal values of hyperparameters for SVM. Then, we show experiments on real world data sets.

II. P

ROBLEM DEFINITION

We solve a classification problem.

Definition 1 (Classification problem with a training set). For a universe X of objects, a set C of classes and a set of mappings M

T

: X

T

⊂ X → C called a training set T , a classification problem is to find classes for all elements of X.

The set C includes a special class c

0

which means unknown class.

Definition 2 (Classification problem with hypotheses). For a classification problem with a training set, we define addition- ally a space of hypotheses H. Each hypothesis is a function h : X → C. Finding a class for all elements of X is replaced by finding a hypothesis.

Solving a problem might be a special case of a classification problem with one class. We could define a task of solving a problem as mapping a class (an object is a solution of a problem) for each object (candidate solution). The training set might be a set of candidate solutions to exclude. We recall a definition of a knowledge set from [9].

Definition 3 (knowledge set). A knowledge set K is a tuple K = (X, C, M

T

; U, c), shortly, without environment objects it is a pair K = (U, c), shortly U

c

. It is a set U ⊂ X of points with information that every u ∈ U maps to c ∈ C. The c is called a class of a knowledge set.

The difference between a hypothesis and a knowledge set, is that a knowledge set defines mappings only for some of objects, while a hypothesis is for all objects. However, the special case of a knowledge set is when it contains mappings for all data, and then it becomes a hypothesis. We also define an unknowledge set as a pair U = (U, c

0

). We also recall a definition of a knowledge setting.

Definition 4 (knowledge setting). A knowledge setting is a set of knowledge sets, {K

1

, K

₂

}, shortly K

1,2

, where K

1

= (U

₁

, c

₁

), K

2

= (U

₂

, c

₂

), c

1

, c

₂

∈ C.

We can generalize knowledge settings to multiple knowl- edge sets, including only one knowledge set. The special type of knowledge setting is a knowledge setting consistent with a

training set T

1,2

. In the context of SVM, we proposed margin knowledge sets in [9], in which margins are boundaries for knowledge sets and examples without margin errors belong to knowledge sets. We define set operations on knowledge settings as operations on mappings. The difference between knowledge settings is the difference between corresponding sets of mappings. We define a space of knowledge measures K

K

for knowledge settings. Each knowledge measure is a function k : K

1,2

→ R. The goal is to find a knowledge setting with maximal value of a knowledge measure. We define a space of uncertainty measures U

K

for knowledge settings.

Each uncertainty measure is a function u : K

_1,2

→ R. The goal is to find a knowledge setting with minimal value of an uncertainty measure. The special type of uncertainty measure is a measure dependent on U

1

and U

2

but independent of mappings. More formally, it is a measure on an uncertain setting U

1,2

, for example on (X, c

0

). We present axioms for knowledge and uncertainty measures.

Definition 5 (knowledge measure axioms).

Axiom 1 (monotonicity). When K

1,2

⊆ L

1,2

, then

k (K

_1,2

) ≤ k (L

_1,2

) . (1) Axiom 2 (strict monotonicity). When K

1,2

⊂ L

1,2

and L

1,2

\K

1,2

∩ T

1,2

6= ∅, then

k (K

1,2

) < k (L

1,2

) . (2) Axiom 3 (non-negativity).

k (K

1,2

) ≥ 0 (3)

Axiom 4. When K

_1,2

∩ T

1,2

6= ∅,

k (K

1,2

) > 0 . (4)

Axiom 5 (null empty set). When U

1

= ∅ and U

2

= ∅ for K

1,2

,

k (K

1,2

) = 0 . (5)

Axiom 6 (knowledge outside a training set). When K

1,2

∩ T

1,2

= ∅, then

k (K

1,2

) = 0 . (6)

Axiom 7 (knowledge outside a training set 2). When K

1,2

⊂ L

1,2

and L

1,2

\K

1,2

∩ T

1,2

= ∅,

k (K

1,2

) = k (L

1,2

) . (7) Axiom 8. The maximal value of k exists, that is k < ∞.

Axiom 9.

k (T

_1,2

) = k K

⁰_1,2

+ k (K

1,2

) (8) The knowledge measure is not a measure in a measure theory sense. It does not fulfill in general countable additivity axiom.

Example 1. The example of a knowledge measure is

k (K

_1,2

) = |T

_1,2

∩ K

_1,2

| . (9)

(3)

The knowledge k can be interpreted as an upper bound on the number of examples with correct classification included in a knowledge setting from a training set. In general, knowledge k is a quality measure. In the context of SVM it is related to empirical risk.

Definition 6 (uncertainty measure axioms).

Axiom 10 (monotonicity). When K

1,2

⊆ L

1,2

, then

u (K

_1,2

) ≤ u (L

_1,2

) . (10) Axiom 11 (non-negativity).

u (K

_1,2

) ≥ 0 . (11)

Axiom 12 (null empty set). When U

1

= ∅ and U

2

= ∅ for K

1,2

,

u (K

1,2

) = 0 . (12)

Axiom 13 (uncertainty outside a training set). When K

1,2

6⊂

T

1,2

,

u (K

1,2

) > 0 . (13) Axiom 14 (uncertainty outside a training set 2). When K

1,2

⊂ L

1,2

and L

1,2

\K

1,2

6⊂ T

1,2

,

u (K

1,2

) < u (L

1,2

) . (14) Axiom 15. The maximal value of u exists, that is u < ∞.

Axiom 16.

u (K

_0,0

) = 0.5u ((X, c

₀

)) − u (U

₁

∪ U

₂

)

⁰

, c

₀

(15) This axiom might be derived from additional additivity axiom.

Example 2. The example of an uncertainty measure is u (K

1,2

) = |K

1,2

| . (16) The uncertainty u can be related to uncertain knowledge from a knowledge setting, so the classification can be in- correct. It could be also interpreted as an upper bound on knowledge in a knowledge setting that might be incorrect.

The uncertainty 0 means that we are sure about correct classification. In the context of SVM, uncertainty is related to capacity of the model class. The capacity of the model class quantifies the complexity of the model. We prefer a predictor from a simple model class. The complexity of a model can be represented by a regularizer. The interpretation of a regularizer is how much a model is preferred a priori (without observing the training data). Restriction of the model class in the first place can also be seen as a regularization. An example of such restriction is a Vapnik-Chervonenkis (VC) dimension.

The SVM are based on maximimum margin principle and a regularizer can be interpreted as maximizing margin. Both knowledge and uncertainty are related to the idea of structural risk minimization (SRM) minimizing the empirical risk in each model class with a penalty for the capacity of the model class.

The SVM in fact realizes the SRM with the difference that we penalize the complexity of a model within a model class. The SRM is related to regularized risk minimization.

III. K

NOWLEDGE

-

UNCERTAINTY MODEL

We define a solution for the classification problem as a solu- tion of a multi-objective optimization problem for knowledge settings

Optimization problem (OP) 1.

K

1,2

∈ K

s

: max k (K

1,2

) , min u (K

1,2

) . (17) We maximize knowledge k and minimize u. We are inter- ested in a Pareto optimal set. For a finite set K

s

, the solution always exists. The objective functions are lower bounded by 0, 0. For all knowledge settings with the same knowledge, we minimize the uncertainty. For all knowledge settings with the same uncertainty, we maximize the knowledge. Increasing the size of a knowledge setting (but only for included knowledge setting) means that we get a knowledge setting with possibly more uncertainty and more knowledge. So generally bigger knowledge settings have bigger uncertainty (but only for included knowledge setting). So it can be hard to obtain big knowledge and small uncertainty.

We are interested only in Pareto optimal solutions. It is a priori assumption and may be considered as a principle.

Principle 1. The best solutions for the OP 1 are Pareto optimal solutions.

When we have multiple Pareto optimal solutions, they must be compared by using oracle for knowledge settings in order to obtain the best single solution. By selecting only Pareto optimal solutions, we limit the number of knowledge settings to validate in oracle.

Proposition 1. The set of Pareto optimal solutions includes only subsets of a training knowledge setting T

1,2

included in K

s

assuming axioms Ax. 7, Ax. 14.

Proof. When we add an element from outside a training set, we possibly add uncertainty and knowledge. But according to the Ax. 7 the new knowledge setting has the same knowledge as before adding the element. But the uncertainty will be bigger due to Ax. 14. The similar reason is for adding an element from a training set but with a different class.

Not all subsets of T

1,2

are Pareto optimal solutions. We could have two separated knowledge settings, when one would have more knowledge and less uncertainty. For online setting, when we add one more object to a training set, then we add new potential Pareto optimal solutions. Consider a problem of which knowledge settings to check after adding a new object.

We need to compare all old Pareto optimal solutions with all solutions containing a new mapping. Due to Prop. 1, we could limit the solutions to subsets of T

1,2

. For a special case, when we compare two knowledge settings K

1,2

⊂ L

1,2

, where L

1,2

is with a new mapping, we always get better knowledge due to Ax. 2. When additionally uncertainty is the same, then we need to remove K

1,2

from the Pareto optimal solution set, and only then due to Ax. 10. We could further limit the number of sets to check by assuming the optional additivity axioms.

Details are omitted due to space limitation.

(4)

Due to Prop. 1, in order to be able to generate Pareto optimal solutions on examples from outside a training set, we need to provide an additional structure S to the problem.

Usually the structure for data is assumed a priori. Without any assumptions about a structure of the space, we would not be able to generalize in the sense of Pareto optimal solutions over a training set. The structure gives us more flexibility in defining knowledge and uncertainty measures. So we have a space for our classification problem (X, S). Because we do not know the space S, we may introduce multiple spaces S

i

∈ S.

In order to solve OP 1, we need to compare measures k and u on different spaces (X, S

_i

). So we need an additional principle about comparability for the measures on different spaces.

Principle 2. A knowledge measure k and an uncertainty measure u on different spaces (X, S

_i

) must be comparable.

We need measures k and u which are invariant to spaces.

When the measures are dependent only on some basic space S

o

such as S

o

⊂ S

i

, then they are comparable. In extreme, when the measure only depends on X they are also comparable.

In order to achieve comparability, one strategy might be to compute the maximal (supremum) value of a measure in a given space. Then we could divide a measure by a maximal possible value of that measure. We need to be sure that the maximal value exists, for example for an uncertainty measure due to Ax. 15, and for a knowledge measure due to Ax. 8.

Due to Ax. 10, we expect maximal value of u for the biggest possible knowledge setting. We can define such for the whole X. Due to Ax. 8, we expect maximal value of k for the training set. Another principle is that generally we need more data to get better results. Validation set is not used during training, so it may be valuable to further improvement of a model, in particular selection of a model.

IV. R

ELATION TO

SVM

We show how SVM presented in Appendix A solves the OP 1. We define the following optimization problem

OP 2.

max −

n

X

i=1

max (0, 1 − y

i

h ( ~ x

i

)) , min − 2

k ~ wk

²

. (18) Theorem 1. The set of solutions of OP 4 for any C is equivalent to a set of Pareto optimal solutions of OP 2.

Proof. We can reformulate OP 2 to OP 3.

min

n

X

i=1

max (0, 1 − y

_i

h ( ~ x

_i

)) , min 1

2 k ~ wk

²

. (19) The OP 4 is a scalarization of OP 3 using a weighting method [10], page 78. The scalarization was also mentioned in [11], page 5 that equivalent formulation leads to Pareto optimal solutions, because of convex quadratic function and a linear term. The theorem and proof can be found in [10] for the weighting method, page 78.

We can reformulate the OP 2 to the knowledge set model OP 1 as follows. We define the optimization problem for margin knowledge sets M

1,2

∈ M

s

, where M

s

is a space of margin knowledge settings

M

1,2

= (({~ x : h (~ x) > 1} , 1) , ({~ x : h (~ x) < −1} , −1)) . (20) We define the knowledge measure as

k (M

1,2

) = k (T

1,2

) − X

~

xi:yih( ~xi)≤1

max (0, 1 − y

i

h ( ~ x

i

)) . (21) The first term is constant so it does not change the solution.

The second term is a potential knowledge measure on a complement to M

1,2

. Due to Ax. 9, we need to show which axioms for knowledge measures are fulfilled for the second term. This measure does not fulfill all knowledge axioms.

The Ax. 1 is not fulfilled in general. Increasing the size of M

⁰_1,2

by increasing the margin while preserving the solution leads to decreased margin values and possibly to more training examples, which oppositely increases sum of margin values.

So in general we cannot say if the knowledge is increased or decreased. However, for final solutions of SVM due to Thm. 1, Ax. 1 is fulfilled. When no new training examples are added while increasing margin, the Ax. 7 is not fulfilled. We define the following uncertainty measure

u (M

_1,2

) = 0.5u ((X, c

₀

)) − 2/ k ~ wk

²

. (22) The first term is constant so it does not change the solution.

The second term is a potential uncertainty measure. We can notice that it is a measure on (U

1

∪U

2

)

⁰

, so it fulfills the axiom Ax. 16.

Proposition 2. The uncertainty measure (22) fulfills axioms for an uncertainty measure Def. 6.

Proof. Due to Ax. 16, we need to prove that the second term fulfills axioms for uncertainty. It fulfills (10). When we decrease the size of a knowledge setting (as a subset), then the distance between margins increases, so u

k

defined as in (22) decreases. The other axioms are also fulfilled.

V. M

EASURES OF KNOWLEDGE AND UNCERTAINTY

The aim is to define measures of knowledge and uncertainty fulfilling axioms and being comparable for different kernel- induced feature spaces. The measures (21) and (22) are com- parable for the same kernel-induced feature space. Consider different spaces S

1

and S

2

, like for different values of the σ parameter for the RBF kernel. The uncertainty measure (22) is dependent on a space. In order to convert it to an independent measure according to the Pr. 2, we define the following measure

u (M

_1,2

) = 0.5 − 2/

k ~ wk

²

R

, (23)

where R is a radius of a minimal hypersphere containing all

examples in a kernel-induced feature space. We reformulated

(22). For the same set of examples, R is constant, so all

(5)

uncertainty axioms are fulfilled. This measure is related to VC bound and radius-margin bound [5]. We might approximate this measure with

u (M

1,2

) ≈ 1 − 2/R . (24) The measure (24) depends only on X. In the same space, it is constant. It does not fulfill all axioms from Def. 6, for example Ax. 14. However, when assuming fixed distance between margins, so k ~ wk = const, stronger form of monotonicity is fulfilled, a strict monotonicity. This measure is related to the bound on the number of mistakes for an on-line perceptron algorithm and to the bound for fat-shattering dimension [12].

The knowledge measure is dependent on a space. So we redefine it as

k (M

1,2

) = k (T

1,2

) −

n

X

i=1

sgn max (0, 1 − y

i

h ( ~ x

i

)) . (25)

All axioms for knowledge are fulfilled. Because of the sign function, the additivity axiom is also fulfilled. This measure by minimizing the number of examples with margin error, minimizes the number of support vectors, so we expect more sparse solutions. This measure is related to the bounds on a compression scheme, to the bounds for the maximal margin hyperplane for the leave-one-out estimate and to the Ockham approach for finding compact representations [12]. The num- ber of support vectors is related to the C parameter in the sense that for C = 1/(νn), 0 < ν < 1 controls the number of support vectors. For sufficiently big margin, we get the second term in (25) equal to n, so the measure (25) is related to (21) as follows: for the same decision boundary

k ~wk→0

lim

n

X

i=1

max (0, 1 − y

i

h ( ~ x

i

)) = n . (26)

We also use misclassification error measure defined as k (F

1,2

) = k (T

1,2

) −

n

X

i=1

sgn max (0, −y

i

h ( ~ x

i

)) (27)

for full knowledge sets (left and right side of a solution).

All knowledge axioms are fulfilled. This measure is equal to a measure (25) in a limit case for k ~ wk → ∞. This measure is related to the bound with VC dimension and to the empirical risk minimization. For a knowledge measure (27), all margin knowledge settings with the same decision boundary would have the same value of knowledge. We can use only knowledge measure by setting artificially uncertainty to constant value. Then for a knowledge measure (27), we get the same results as for standard cross validation. Because misclassification error is not able to distinguish cases with the same decision boundary, we might add another knowledge measure. Even if we have the same decision boundary, the choice of values of hyperparameters matters, because firstly we run again the classifier with the best parameters on the whole training set, and secondly, we perform the internal part

of double grid search. We also define least squares criterion from least squares support vector machines (LS-SVM) as

n

X

i=1

(y

i

− h ( ~ x

i

))

²

/R . (28) We added the radius of a minimal hypersphere to create a measure which is more comparable on different spaces. So it is independent of scaling data. This criterion might be interpreted in the knowledge-uncertainty model as a sum of two measures:

for examples with margin error, we get knowledge for a complement of a margin knowledge setting M

1,2

, and for examples above the margin, we get an uncertainty measure for M

1,2

. For simplicity, we use the combined criterion as a knowledge measure.

We consider grouping of knowledge and uncertainty mea- sures. Usually data are given by specifying n-tuples. So each object is described by a corresponding ordered set of real numbers. This is a reference system (coordinate system). We usually assume that a coordinate system is equipped with an Euclidean vector space. For SVM, we use the transformation of each object to a kernel-induced feature space. We char- acterize knowledge and uncertainty measures in terms of a minimal space S in which the measure might be defined.

For a knowledge measure (21), we need to compute inner product, so we need a unitary vector space for defining it. For uncertainty measures (22) and (23), we need a normed vector space to compute it. A knowledge measure (25) can be defined in a minimal space of X without any additional structure S.

So they are comparable as in Pr. 2.

Finally, we define a knowledge-uncertainty hybrid model:

combination of multiple knowledge and uncertainty measures.

We compare selected knowledge measures with selected un- certainty measures. We count cases for, against and draws.

Based on counting cases, we select the best set of values of hyperparameters. We propose two versions of KUM. The first version KUM1 is

KUM1 := (K

_K

, U

_K

) : {(ce, 1) , (me, 1) , (lsr, 1)} (29) where ce is a misclassification error (27) computed on a validation set, lsr is a least squares criterion (28) computed on a validation set with radius computed on a training set, me is the number of examples with a margin error (25) computed on a validation set. When U

_K

= {1}, we take into account only a knowledge measure. The second version is

KUM2 := (K

_K

, U

_K

) : {(ce, 1) , (me, 1) , (lsr, 1) , (sv, R)}

(30)

The equivalent formulation for the SVM method

in the knowledge-uncertainty framework would

be ((K

K

, U

K

) : {(ce, 1)}). Because in the case

of draw, we compare ce, for any hybrid model

((K

K

, U

K

) : {(ce, 1) , (k, 1)}), where k is any knowledge

measure, we use k in the case when the measure ce is the

same for both set of values of hyperparameters. So, k is a

way to specify a strategy for dealing with draw cases for

SVM. The k might cause increasing of draw cases, when it

(6)

points to different set of values of hyperparameters than ce.

The algorithm for selecting the best set of hyperparameter values using knowledge-uncertainty framework is as follows 1) For a candidate set of values of hyperparameters compute the uncertainty and knowledge measures. 2) Compute Pareto optimal solutions of OP 1 for the candidate set of values and the current best set of values. 3) If the candidate set of values is a unique Pareto optimal solution, then replace the current best set of values. 4) If not, then use misclassification error on a validation set for comparison. If a candidate set of values is better, then replace the current best set of values.

VI. E

XPERIMENTS

We use the stochastic gradient descent method from [8]

for solving the SVM optimization problem called OLLAWV which generates solutions with significantly smaller number of support vectors than other methods. The OLLAWV can achieve linear convergence. We use one-class support vector machines (OCSVM) for computing radius of a minimal hyper- sphere containing all points in a kernel-induced feature space.

We redesigned the OLLAWV for solving OCSVM. We com- pare our proposed methods against two methods. The first one is SVM. The second one is SVM with post-sparsifier, which we call post-sparsified support vector machines (SVMPS). The SVMPS runs the SVM and then the sparsifier which is the ϕ support vector classification (ϕ-SVC) [13, 14] Appendix A.

We again reformulated OLLAWV for solving ϕ-SVC. We set margin weights ϕ

i

to ϕ

i

= y

i

h ( ~ x

i

) − 1.0 for examples for which −0.5 ≤ y

i

h ( ~ x

i

) < 1, where h ( ~ x

i

) is a margin of the SVM solution. The examples for which y

i

h ( ~ x

i

) < −0.5 are removed, the remaining examples have ϕ

i

= 0. Very similar post-sparsifier has been proposed in [3]. The difference are in constants used above. We use ϕ-SVC with very high value of C, while they used specialized solver for this problem.

They also had different aim to improve sparsity with possibly degradation of generalization performance, while we want to preserve the best generalization performance.

We compare performance of SVM, SVMPS, KUM1 and KUM2 for real world data sets described in Table I for binary classification. More details about data sets are on the LibSVM site [15]. We omitted data sets with high number of features.

We use our own implementation of OLLAWV. For all data sets, every feature is scaled linearly to [0, 1]. We performed all tests with the RBF kernel. The number of hyperparameters to tune is 2, σ and C. For all hyperparameters, we use a double grid search method for finding the best values – first a coarse grid search is performed, then a finer grid search as described in [16]. The range for σ values is from 2

⁻⁹

to 2

⁹

, for C it is from 2

⁻⁹

to 2

¹⁴

on the first level. We use the procedure similar to repeated double cross validation for performance comparison [17]. For the outer loop, we run a modified k- fold cross validation for k = 20, with the training set size set to 80% of all examples with maximal training set size equal to 1000 examples. When it is not possible to create the next fold, we shuffle data and start from the beginning. We use the 5-fold cross validation for the inner loop for finding optimal

values of the hyperparameters. After that, we run the method on training data, and we report results on a test set.

We use the Friedman test with the two tailed Nemenyi post hoc test for checking statistical significance, as suggested in [18, 19] for comparing classifiers. First, we average the results over the folds of the outer cross validation, then we perform statistical tests on averaged results. We compute ranks for the averaged misclassification error over the folds. The statistical procedure is performed for the level of significance equal to 0.05. We also use Bayesian statistical tests which are preferred over null hypothesis significance testing (NHST) [20]. In particular, we use Bayesian signed rank test implemented in R [21].

The results are that all presented methods have similar gen- eralization performance (columns rs,tsf, tsn in Table I) without statistical significance. The methods KUM1 and KUM2 have smaller number of support vectors (columns sv, svf, svn in Table I) with statistical significance between SVM, and both methods. For particular data sets they are statistically better for 1 and 3 data sets respectively. There is no statistical significance between SVMPS and KUM1 and KUM2, however the probability that KUM2 has smaller number of support vectors is 0.4 (last row in svn24). The KUM1 is statistically better for 6 data sets, and KUM2 for 6 data sets.

VII. D

ISCUSSION

The technique of multi-objective optimization in the context of SVM has been already proposed. The closest research is in [22], where authors use bicriteria formulation of SVM for model selection with different objective functions like modified radius margin bounds and the number of support vectors. In [11], authors proposed also a bicriteria formulation of SVM. In [23] authors proposed evolutionary multiobjective model and instance selection approach for SVM with Pareto- based ensemble. In [24] authors proposed a framework for SVM based on principle of multi-objective optimization with minimizing the empirical risk and minimizing the model capacity. The proposed framework differs from the above solutions by using bicriteria formulation of SVM with mea- sures computed on validation set. We also provide theoretical axiomatized framework for justifying the measures.

The technique of axiomatization has been used for a concept of risk in finance. The risk measure conditional value-at- risk (CVaR), also known as tail value-at-risk and expected shortfall can be modeled by a set of axioms for a coherent risk measure [25]. The empirical risk for ν support vector classification (ν-SVC) is related to CVaR. The uncertainty itself has been axiomatized in [26]. The knowledge itself has been axiomatized in the field of epistemic modal logic.

However axioms are based on logic, not on mathematical spaces.

VIII. S

UMMARY

We axiomatized knowledge and uncertainty for solving a

classification problem with SVM. We created a framework

(7)

TABLE I

PERFORMANCE OFSVM, SVMPS, KUM1, KUM2. THE NUMBERS IN DESCRIPTIONS OF THE COLUMNS MEAN THE METHODS: 1 - SVM, 2 - SVMPS, 3 - KUM1, 4 - KUM2. COLUMN DESCRIPTIONS FOR ALL ROWS EXCEPT LAST TWO: dn –THE NAME OF A DATA SET, s –THE NUMBER OF ALL EXAMPLES,

d–THE DIMENSION OF A PROBLEM, ce –THE MEAN MISCLASSIFICATION ERROR;THE BEST METHOD IS IN BOLD, tsn – BAYESIAN SIGNED RANK PROBABILITY FOR MISCLASSIFICATION ERROR;VALUE GREATER THAN0.9IS IN BOLD, sv –THE NUMBER OF SUPPORT VECTORS,THE SMALLEST NUMBER IS IN BOLD, svn – BAYESIAN SIGNED RANK PROBABILITY FOR THE NUMBER OF SUPPORT VECTORS. FOR THE BEFORE LAST ROW: tsf –THE FRIEDMAN STATISTIC FOR AVERAGE RANKS FOR THE MISCLASSIFICATION ERROR;THE SIGNIFICANT VALUE IS IN BOLD, svf –THEFRIEDMAN STATISTIC

FOR AVERAGE RANKS FOR THE NUMBER OF SUPPORT VECTORS, ce –THE AVERAGE RANK FOR THE MEAN MISCLASSIFICATION ERROR;THE BEST METHOD IS IN BOLD, tsn –THENEMENYI STATISTIC FOR AVERAGE RANKS FOR THE MISCLASSIFICATION ERROR,REPORTED WHEN THEFRIEDMAN

STATISTIC IS SIGNIFICANT,THE SIGNIFICANT VALUE IS IN BOLD, sv –THE AVERAGE RANK FOR THE NUMBER OF SUPPORT VECTORS, svn –THE NEMENYI STATISTIC FOR AVERAGE RANKS FOR THE NUMBER OF SUPPORT VECTORS,REPORTED WHEN THEFRIEDMAN STATISTIC IS SIGNIFICANT,THE

SIGNIFICANT VALUE IS IN BOLD. FOR THE LAST ROW: tsn – BAYESIAN SIGNED RANK PROBABILITY FOR MISCLASSIFICATION ERROR, svn – BAYESIAN SIGNED RANK PROBABILITY FOR THE NUMBER OF SUPPORT VECTORS.

dn s/tsf d/svf ce1 ce2 ce3 ce4 tsn12 tsn13 tsn23 tsn14 tsn24 tsn34 sv1 sv2 sv3 sv4 svn12 svn13 svn23 svn14 svn24 svn34 a1a 24947 123 0.165 0.166 0.163 0.163 0.0 0.02 0.03 0.02 0.03 0.0 366 186 354 346 0.85 0.49 0.1 0.58 0.1 0.46 australian 690 14 0.146 0.149 0.146 0.141 0.0 0.08 0.12 0.32 0.36 0.26 350 53 345 342 0.94 0.43 0.01 0.58 0.01 0.38 breast-cancer 675 10 0.03 0.032 0.027 0.029 0.0 0.06 0.17 0.03 0.1 0.07 80 31 74 57 0.93 0.37 0.08 0.85 0.16 0.87 cod-rna 100000 8 0.053 0.051 0.055 0.056 0.0 0.0 0.0 0.0 0.0 0.0 367 60 362 345 0.95 0.39 0.0 0.57 0.0 0.39 colon-cancer 62 2000 0.15 0.165 0.146 0.162 0.31 0.16 0.3 0.26 0.33 0.14 28 40 28 25 0.0 0.33 0.95 0.75 0.95 0.73 covtype 100000 54 0.277 0.274 0.278 0.278 0.0 0.0 0.0 0.0 0.0 0.0 609 406 549 545 0.81 0.66 0.27 0.78 0.27 0.55 diabetes 768 8 0.233 0.232 0.23 0.231 0.01 0.13 0.14 0.12 0.12 0.0 345 54 331 330 0.95 0.56 0.0 0.62 0.0 0.25 fourclass 862 2 0.001 0.004 0.001 0.001 0.0 0.0 0.05 0.0 0.05 0.0 624 92 625 113 0.95 0.45 0.0 0.95 0.29 0.95 german numer 1000 24 0.248 0.245 0.248 0.246 0.26 0.11 0.19 0.25 0.24 0.1 406 161 400 398 0.94 0.57 0.01 0.63 0.01 0.44 heart 270 13 0.178 0.181 0.181 0.18 0.02 0.08 0.18 0.14 0.22 0.08 112 36 111 107 0.95 0.35 0.01 0.5 0.01 0.46 HIGGS 100000 28 0.42 0.42 0.421 0.419 0.0 0.0 0.0 0.02 0.01 0.09 790 286 781 773 0.95 0.32 0.0 0.51 0.0 0.25 ijcnn1 100000 22 0.084 0.084 0.084 0.085 0.0 0.0 0.0 0.0 0.0 0.0 228 549 223 209 0.09 0.31 0.88 0.42 0.88 0.45 ionosphere sc 350 33 0.061 0.068 0.066 0.069 0.14 0.07 0.27 0.08 0.27 0.01 121 174 72 69 0.01 0.85 0.95 0.89 0.95 0.59 liver-disorders 341 5 0.355 0.35 0.349 0.351 0.29 0.31 0.3 0.28 0.28 0.07 194 39 191 190 0.95 0.38 0.0 0.5 0.0 0.39 madelon 2600 500 0.423 0.421 0.426 0.426 0.08 0.03 0.02 0.04 0.02 0.0 911 782 899 898 0.54 0.36 0.4 0.42 0.4 0.17 mushrooms 8124 111 0.001 0.01 0.001 0.001 0.0 0.0 0.32 0.0 0.32 0.0 527 371 80 48 0.63 0.8 0.75 0.9 0.88 0.89 phishing 5785 68 0.06 0.058 0.061 0.061 0.0 0.0 0.0 0.0 0.0 0.0 347 429 191 186 0.18 0.87 0.86 0.89 0.86 0.31 skin nonskin 51432 3 0.008 0.01 0.008 0.008 0.0 0.0 0.03 0.0 0.02 0.0 447 325 205 140 0.5 0.76 0.66 0.84 0.7 0.63 splice 2990 60 0.125 0.12 0.132 0.131 0.32 0.0 0.02 0.0 0.02 0.09 544 799 373 365 0.0 0.91 0.95 0.92 0.95 0.25 sonar scale 208 60 0.132 0.121 0.155 0.161 0.52 0.29 0.18 0.2 0.13 0.0 66 133 62 62 0.0 0.62 0.95 0.61 0.95 0.16 SUSY 100000 18 0.24 0.243 0.241 0.24 0.0 0.0 0.01 0.0 0.01 0.01 533 152 525 520 0.95 0.46 0.0 0.63 0.0 0.51 svmguide1 6910 4 0.04 0.041 0.039 0.039 0.0 0.0 0.01 0.0 0.0 0.0 159 100 134 129 0.86 0.46 0.2 0.73 0.21 0.52 svmguide3 1243 21 0.19 0.187 0.191 0.19 0.21 0.0 0.06 0.0 0.08 0.0 389 103 389 389 0.95 0.24 0.0 0.38 0.0 0.23 w1a 34703 300 0.023 0.024 0.023 0.024 0.0 0.0 0.0 0.0 0.0 0.0 69 335 55 56 0.0 0.61 0.95 0.72 0.95 0.55 websam unigr 100000 134 0.062 0.061 0.065 0.065 0.0 0.0 0.0 0.0 0.0 0.0 236 418 180 179 0.0 0.87 0.95 0.86 0.95 0.37 3.23 29.41 2.46 2.52 2.51 2.51 – – – – – – 2.94 2.25 2.62 2.19 4.0 2.68 −1.31 5.15 1.15 2.46

0.0 0.0 0.04 0.0 0.0 0.0 0.71 0.93 0.36 0.96 0.4 0.92

for hyperparameter optimization based on multi-objective op- timization for the knowledge-uncertainty model. We showed usefulness of knowledge and uncertainty measures computed on a validation set specifically the number of margin errors and the radius of minimal hypersphere for improving sparsity of a solution.

A

CKNOWLEDGMENTS

I would like to express my sincere gratitude to Profes- sor Witold Dzwinel (AGH University of Science and Tech- nology, Department of Computer Science) for discussion and useful suggestions. The work is financed by the Na- tional Science Centre in Poland, project id 289884, UMO- 2015/17/D/ST6/04010, titled “Development of Models and

Methods for Incorporating Knowledge to Support Vector Ma- chines”.

A

PPENDIX

A ϕ-SVC

We have a set of n training vectors ~ x

i

for i ∈ {1, . . . , n}, where ~ x

i

= (x

¹_i

, . . . , x

^m_i

). The m is a dimension of the problem. The ϕ-SVC soft margin case optimization problem with k·k

₁

norm is

OP 4.

min

~ w,b

1 2 k ~ wk

²

+ C

n

X

i=1

max (0, 1 + ϕ

i

− y

i

h ( ~ x

i

)) (31) for i ∈ {1, . . . , n}, where C > 0, ϕ

i

∈ R, h ( ~ x

_i

) = ~ w·φ ( ~ x

_i

)+

b.

(8)

When ~ ϕ = 0, the OP A is equivalent to support vector classification (SVC). The h

^∗

(~ x) = ~ w

^∗

· φ (~ x) + b

^∗

= 0 is a decision curve of the classification problem. Data are from a kernel-induced feature space.

R

EFERENCES

[1] M. F. Delgado, E. Cernadas, S. Barro, and D. G. Amorim,

“Do we need hundreds of classifiers to solve real world classification problems?” J. Mach. Learn. Res., vol. 15, no. 1, pp. 3133–3181, 2014.

[2] R. G. Mantovani, A. L. D. Rossi, J. Vanschoren, B. Bis- chl, and A. C. P. L. F. de Carvalho, “Effectiveness of random search in SVM hyper-parameter tuning,” in 2015 International Joint Conference on Neural Networks, IJCNN 2015, Killarney, Ireland, July 12-17, 2015, 2015, pp. 1–8.

[3] A. Cotter, S. Shalev-Shwartz, and N. Srebro, “Learning optimally sparse support vector machines,” in Proceed- ings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, 2013, pp. 266–274.

[4] J. Wainer and G. C. Cawley, “Empirical evaluation of resampling procedures for optimising SVM hyperparam- eters,” J. Mach. Learn. Res., vol. 18, pp. 15:1–15:35, 2017.

[5] K. Duan, S. S. Keerthi, and A. N. Poo, “Evaluation of simple performance measures for tuning SVM hyperpa- rameters,” Neurocomputing, vol. 51, pp. 41–59, 2003.

[6] I. J. Goodfellow, Y. Bengio, and A. C. Courville, Deep Learning, ser. Adaptive computation and machine learn- ing. MIT Press, 2016.

[7] G. Chen, W. Florero-Salinas, and D. Li, “Simple, fast and accurate hyper-parameter tuning in gaussian-kernel SVM,” in 2017 International Joint Conference on Neural Networks, IJCNN 2017, Anchorage, AK, USA, May 14- 19, 2017, 2017, pp. 348–355.

[8] G. Melki, V. Kecman, S. Ventura, and A. Cano,

“OLLAWV: online learning algorithm using worst- violators,” Appl. Soft Comput., vol. 66, pp. 384–393, 2018.

[9] M. Orchel, “Solving classification problems by knowl- edge sets,” Neurocomputing, vol. 149, pp. 1109–1124, 2015.

[10] K. Miettinen, Nonlinear Multiobjective Optimization, ser.

International Series in Operations Research & Manage- ment Science. Springer US, 1999.

[11] H. Aytug and S. Sayin, “Exploring the trade-off between generalization and empirical errors in a one-norm SVM,”

Eur. J. Oper. Res., vol. 218, no. 3, pp. 667–675, 2012.

[12] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learn- ing Methods. Cambridge University Press, 2000.

[13] M. Orchel, “Incorporating detractors into svm classifi- cation,” in Man-Machine Interactions, ser. Advances in Intelligent and Soft Computing, K. Cyran, S. Kozielski,

J. Peters, U. Sta´nczyk, and A. Wakulicz-Deja, Eds.

Springer Berlin Heidelberg, 2009, vol. 59, pp. 361–369.

[14] M. Orchel, “Incorporating a priori knowledge from de- tractor points into support vector classification,” in Adap- tive and Natural Computing Algorithms, ser. Lecture Notes in Computer Science, A. Dobnikar, U. Lotric, and B. ˇSter, Eds. Springer Berlin Heidelberg, 2011, vol.

6594, pp. 332–341.

[15] “Libsvm data sets,” www.csie.ntu.edu.tw/

^∼

cjlin/

libsvmtools/datasets/, 06 2011.

[16] C. wei Hsu, C. chung Chang, and C. jen Lin, “A practical guide to support vector classification,” 2010.

[17] P. Filzmoser, B. Liebmann, and K. Varmuza, “Repeated double cross validation,” J. Chemometr., vol. 23, no. 4, pp. 160–171, 2009.

[18] N. Japkowicz and M. Shah, Eds., Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, 2011.

[19] J. Demˇsar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, pp. 1–

30, Dec. 2006.

[20] A. Benavoli, G. Corani, J. Demsar, and M. Zaffalon,

“Time for a change: a tutorial for comparing multiple classifiers through bayesian analysis,” J. Mach. Learn.

Res., vol. 18, pp. 77:1–77:36, 2017.

[21] J. Carrasco, S. Garc´ıa, M. del Mar Rueda, and F. Herrera,

“rnpbst: An R package covering non-parametric and bayesian statistical tests,” in Hybrid Artificial Intelligent Systems - 12th International Conference, HAIS 2017, La Rioja, Spain, June 21-23, 2017, Proceedings, 2017, pp.

281–292.

[22] C. Igel, “Multi-objective model selection for support vector machines,” in Evolutionary Multi-Criterion Op- timization, Third International Conference, EMO 2005, Guanajuato, Mexico, March 9-11, 2005, Proceedings, 2005, pp. 534–546.

[23] A. Rosales-P´erez, S. Garc´ıa, J. A. Gonzalez, C. A. C.

Coello, and F. Herrera, “An evolutionary multiobjective model and instance selection for support vector machines with pareto-based ensembles,” IEEE Trans. Evolutionary Computation, vol. 21, no. 6, pp. 863–877, 2017.

[24] J. Bi, “Multi-objective programming in svms,” in Ma- chine Learning, Proceedings of the Twentieth Interna- tional Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, 2003, pp. 35–42.

[25] J. Gotoh, A. Takeda, and R. Yamamoto, “Interaction between financial risk measures and machine learning methods,” Comput. Manag. Science, vol. 11, no. 4, pp.