Clustering by Support Vector Manifold Learning

(1)

Clustering by Support Vector Manifold Learning

Marcin Orchel

Department of Computer Science AGH University of Science and Technology Al. Mickiewicza 30, 30-059 Krak´ow, Poland

Email: morchel@agh.edu.pl

Abstract—We solve a manifold learning problem by searching for hypersurfaces fitted to the data. The method, called support vector manifold learning (SVML), transforms data to a kernel- induced feature space, duplicates points, shifts them in two oppo- site directions and solves a classification problem using support vector machines (SVM). Then, we cluster data by mapping found hypersurfaces to clusters, the method is called support vector manifold learning clustering (SVMLC). We analyze how the choice of direction of moving points influences the error for fitting to the data. Moreover, we derive the generalization bound with Vapnik-Chervonenkis (VC) dimension for SVML.

The experiments on synthetic and real world data sets show that SVML is better in fitting to the data than one-class support vector machines (OCSVM) and kernel principal component analysis (KPCA) with statistical significance for OCSVM. The SVMLC method has comparable performance in clustering to support vector clustering (SVCL) and KPCA. However, the SVMLC allows for improved grouping of points in the form of manifolds.

I. I

NTRODUCTION

In machine learning, clustering is an example of unsuper- vised learning as opposed to supervised learning like clas- sification and regression. A cluster in Euclidean space is a separated group of points. It can be characterized in multiple ways, for example with 1) a boundary, 2) a center (proto- type) (boundaries may be defined by checking the distance), 3) cluster core (could be also a cluster). Clustering methods map a point to a cluster. One of the methods is SVCL which uses OCSVM [1]. It finds boundaries for clusters by predicting a support of a distribution (quantile functions) in a kernel- induced feature space. The k-means method tries to find centers of clusters.

We use another characterization of clusters: characteristic manifold of a cluster (a main manifold, a principal manifold or a center manifold) [2]. Such a manifold characterizes a structure of a cluster. It is a generalization of a center of a cluster which is a point to a manifold. Clusters can have arbitrary shapes, like elongated shapes thus a center being only one point is not enough to characterize such cluster. We can discover non-spherical clusters even by center-based clustering methods by using special distance measures, like Mahalanobis

distance, or by mapping data to a kernel-induced feature space (kernel k-means). In the limit case, when clusters are in the form of manifolds and sample data are generated without error, the characteristic manifold is equivalent to a boundary manifold. In another limit case, when clusters are spherical with randomly generated data points, we can only find some artificial characteristic manifolds. Clusters characterized by a manifold could be generated artificially from any cluster by removing internal part of a cluster. Then, the boundary of a cluster, which can be modeled by a manifold becomes a characteristic manifold Fig. 1a. Both boundary manifolds and characteristic manifolds are n − 1 dimensional spaces embedded in an n dimensional space (codimension equal to 1), while a center of a cluster is only a zero dimensional space.

We solve clustering problems by finding directly charac- teristic manifolds, the method is called SVMLC. It solves a multiple manifold learning problem: fit multiple manifolds (hypersurfaces) to data points and generalize to unseen data.

As manifold learning, we mean fitting only one manifold to the data. A n − 1 dimensional manifold is a topological space that locally resembles Euclidean space of dimensional n − 1 near each point. In one dimension (n = 2), a figure-eight shaped curve is not a manifold (because it has a singularity). We could generalize manifolds to hypersurfaces (manifolds with singularities), which we define as a set of points fulfilling a nonlinear equation f (~ x) = 0, where f (~ x) is a scalar function.

The solution of SVM as well as OCSVM and KPCA are hypersurfaces. For the radial basis function (RBF) kernel, the linear transformation of the kernel is a continuously differ- entiable function so the solutions are differentiable manifolds.

Moreover, a function can be converted to a nonlinear equation, so regression is a special case of manifold learning. Manifold learning is often used for dimensionality reduction when the goal is to preserve some property of the data set, like distances between points, after projecting data to the subspace and unfolding a manifold to get new point coordinates.

The idea of SVML is to transform a feature space to a

kernel-induced feature space and then fit to the data with the

(2)

hypothesis space containing only hyperplanes and generalize well. This is a special case of a regression problem when we do not have defined a predictive variable. There exist regression methods that define fitting in terms of perpendicular distances (orthogonal regression) of points to a function, for example total least squares (TLS) methods. They model data with errors in all covariates. The solution of principal component analysis (PCA) minimizes the perpendicular distances to principal components [3]. The δ support vector regression (δ-SVR) method [4, 5] has similar property as mentioned in [5]. The δ-SVR duplicates and shifts points in the original feature space in the direction of the predictive variable. So, we still need to define a predictive variable. Our hypothesis is that δ- SVR is robust to selection of direction, so choosing a specific direction should not have a noticeable impact on performance of fitting data to the hyperplane with δ-SVR. We need a regression method that works completely in a kernel-induced feature space. For example, this is not the case for ε-insensitive support vector regression (ε-SVR) because bands are defined in the original feature space and the kernel-induced feature space is constructed only for covariates without the predictive variable. We use a slight modification of δ-SVR, which is able to solve a regression problem completely in a kernel- induced feature space. The δ-SVR duplicates and shifts points in the original feature space in the direction of the predictive variable. The idea of the modification is to shift points in a kernel-induced feature space instead of an original feature space. We shift points in the direction of arbitrarily chosen predictive variable in a kernel-induced feature space.

The hypotheses of our work are 1) the SVML is better in recognizing manifolds than OCSVM and KPCA, 2) the SVMLC is better in recognizing clusters characterized by man- ifolds than SVCL and KPCA, 3) the SVMLC has comparable performance in recognizing any clusters compared to SVCL and KPCA. Regarding the 1st hypothesis. The OCSVM is used for finding quantile functions, but it could be also used for recognizing manifolds. Regarding the 2nd hypothesis. The SVCL finds high density regions. So, for close manifolds to each other and for small data sets when it is hard to rely on density, SVCL has problems in discriminating the manifolds.

In [6], authors design a regularization framework for unsu- pervised learning. In particular, they added a regularization to PCA. Our method is similar to this approach. The difference is that, we defined empirical error in terms of SVM rather than orthogonal distances. A concept of a center plane has been already used for example in the k-plane clustering method [7].

In [2] authors use twin SVM for plane-based clustering. They also use a kernel trick, so they look for k-center manifolds.

The disadvantage is that the number of clusters is not found automatically by the method. In [8], authors present a semi- supervised framework for manifold learning, kernel methods and spectral graph theory. The framework can be extended for unsupervised learning and their approach leads to a regular- ized spectral clustering. In [9], authors use isometric feature mapping (ISOMAP) algorithm for dimensionality reduction as a preprocessor for predicting business failure by using

SVM. In [10], authors use Laplacian SVM which combines SVM with graph Laplacian for manifold regularization for semi-supervised learning for image classification. Regularizing the graph penalizes rapid changes of the decision function between near samples in the graph. In [11], authors reduce dimensionality by using manifold learning for extracting vari- ables for least squares support vector machines (LS-SVM).

They test linear manifold learning methods such as PCA, metric multidimensional scaling, as well as nonlinear manifold learning methods such as ISOMAP, local linear embedding, laplacian eigenmaps and local tangent space alignment. In [12], authors use combination of manifold learning with semi- supervised learning and a new formulation with criteria of manifold consistency and the hinge loss of class prediction. A weighted graph is used to capture the manifold structure. In [13], authors group data into several clusters, and then train a linear SVM in each cluster.

The outline of the article is as follows. First, we introduce SVML. Then we present SVMLC. Then, we describe experi- ments and conclusions.

II. SVML

FOR

M

ANIFOLD

L

EARNING

We have sample data ~ x

i

for i ∈ {1 . . . n} in an Euclidean space with dimensionality m. We have also a function ϕ(~ x) which transforms the original feature space to a kernel-induced feature space. We duplicate every training point and shift up the original point, and down the duplicated point in a kernel-induced feature space. Then, we use support vector classification (SVC). Points moved up get 1 class, and points moved down get −1 class. We shifted a point by a vector y

i

tϕ(~ c), where ~c is a shifting point defined in an original feature space, t is a translation parameter, y

i

= 1 for shifting up, and y

i

= −1 for shifting down. The shifting in an original feature space would be much trickier, because we would need to fit to the nonlinear manifold. We can compute a kernel function for two shifted data points ~ x

i

and ~ x

j

as follows

(ϕ ( ~ x

i

) + y

i

tϕ (~ c))

^T

(ϕ ( ~ x

j

) + y

j

tϕ (~ c)) = (1) ϕ ( ~ x

i

)

^T

ϕ ( ~ x

j

) + y

j

tϕ ( ~ x

i

)

^T

ϕ (~ c) (2) + y

i

tϕ (~ c)

^T

φ ( ~ x

j

) + y

j

y

i

t

²

ϕ (~ c)

^T

ϕ (~ c) . (3) We can replace scalar products with a kernel function K and we get

K ( ~ x

i

, ~ x

j

) + y

j

tK ( ~ x

i

, ~ c) (4) + y

i

tK (~ c, ~ x

j

) + y

j

y

i

t

²

K (~ c, ~ c) . (5) We can construct a cross kernel as follows

(ϕ ( ~ x

i

) + y

i

tϕ (~ c))

^T

ϕ (~ x) = (6) ϕ ( ~ x

_i

)

^T

ϕ (~ x) + y

_i

tϕ (~ c)

^T

ϕ (~ x) . (7) So

K ( ~ x

_i

, ~ x) + y

_i

tK (~ c, ~ x) . (8)

We use SVC with a new kernel for solving SVML. We have

doubled the number of points, in spite of this, the number of

support vectors is maximally equal to n + 1.

(3)

Proof. We can write a part of a solution of SV C double optimization problem for two corresponding points i and i + n (original and duplicated), where y

i

= 1 and y

i+n

= −1 as

α

i

(K ( ~ x

i

, ~ x) + tK (~ c, ~ x)) (9)

− α

i+n

(K ( ~ x

_i

, ~ x) − tK (~ c, ~ x)) = (10) (α

_i

− α

_i+n

) K ( ~ x

_i

, ~ x) + (α

_i

+ α

_i+n

) tK (~ c, ~ x) . (11) The second term can be summed for all i and we get

n

X

i=1

(α

i

+ α

i+n

) tK (~ c, ~ x) . (12)

It can be interpreted as adding one artificial point c to the solution. So, we have one more basis function in the solution, which has maximal value in ~c and it dissipates far from this point.

The solution of SVML is a differentiable manifold. We design several models for discovering the influence of ~c on performance of prediction. First, we design a model with shifting a hyperplane. It is the model for a limit case, when we have infinitely many data points.

Proposition II.1. Shifting a hyperplane with any value of ~c gives a new hyperplane which differs from the original by a free term b.

Proof. We can write ~c as a sum of two vectors, the one perpendicular to the hyperplane, and the second parallel to the hyperplane. The second one does not change the hyper- plane, only the first one. The hyperplane before shifting is

~a · ~ x + b = 0, and after shifting it is ~a · (~ x + ~ c) + b = 0 which can be written as ~a · ~ x + ~a · ~ c + b = 0. It means that it is equivalent to changing b to ~a · ~c + b.

By adding a shift parameter before a vector ~c as t~c, we can get any free term. For example for b

f

, we have ~a · t~c + b = b

f

, so we get

t = b

f

− b

~a · ~ c , (13)

when ~a · ~c 6= 0. Otherwise, we get b = b

f

. It could be also redefined in a kernel-induced feature space for the SVC hyperplane. The conclusion is that it is enough to take into consideration arbitrarily chosen ~c. The next, more accurate model of shifting is with a hyperplane bounded by a hypersphere such as a center of a hypersphere belongs to a hyperplane. So the relevant figure is a n − 2 dimensional hypersphere. We duplicate and shift the n − 2 dimensional hypersphere by ~c and −~c. We have the following lemma about the size of a minimal hypersphere containing the both manifolds after shifting.

Lemma II.2. After duplicating and shifting a n − 1 di- mensional hyperplane constrained by n − 1-dimensional hy- persphere, the maximal distance from an original center of a hypersphere to any point belonging to the shifted n − 2 hypersphere is for a point such as after projecting this point

to the n − 1 dimensional hyperplane (before shift), a vector from ~ 0 to this point is parallel to a vector from ~0 to a projected center of one of the shifted n − 2 hyperspheres.

Proof. All n−1 dimensional hyperplanes constrained by n−1 dimensional hypersphere are n − 2 dimensional hyperspheres.

The ~c has two components. The one that is parallel to the n−1 dimensional hyperplane is ~ c

m

, the other is perpendicular to the n − 1 dimensional hyperplane ~ c

p

. A vector from the original center to any shifted point also have two similar components, and the latter is the same up to the sign for all shifted points.

So we are looking for a point which has the maximal length of the first component. So

max

k~rk=R

k~r + ~ck , (14)

where ~ r is a vector parallel to the original n − 1 dimensional hyperplane, R is a radius of the n − 2 dimensional hyper- spheres. We get

q

k~r + ~ck

²

= q

k~rk

²

+ 2~ r · ~ c + k~ ck

²

. (15) The value varies only on the scalar product, which is maximal when the vectors are parallel.

Using Lem. II.2, we can obtain a formula for R

n

, which is a radius of a minimal hypersphere containing both hyperplanes after shifting.

Lemma II.3. The radius R

n

of a minimal hypersphere con- taining both hyperplanes after shifting is equal to

R

_n

= k~ c + R ~ c

_m

/ k ~ c

_m

kk (16) where c

m

is defined in (17) and kc

m

k 6= 0. For kc

m

k = 0, we get R

_n

=

q

k~ck

²

+ R

²

.

Proof. We assume without loss of generality that the original center of a hypersphere is ~0. After projecting ~c to a n − 1 dimensional hyperplane we get a projected vector

~

c

m

= ~ c − b + ~ w · ~ c

kwk

²

w . ~ (17)

Because of the assumption about the original center equal to

~0, we have b = 0, so

~

c

m

= ~ c − w · ~ ~ c

kwk

²

w . ~ (18)

Then, we scale ~ c

_m

so that it has a length R. So

~

c

ms

= R ~ c

m

/ k ~ c

m

k . (19) Finally, we add ~c. For a special case when k ~ c

m

k = 0, we derive formula for R

n

from the Pythagorean theorem.

For a shift with a constant ~ c

p

, we get the minimal value of R

n

when ~ c

m

= ~0, otherwise R

n

is bigger.

We derive generalization bounds with VC dimension as

described in [5]. We increase shift by ~c. The perpendicular

(4)

distances are increased by ~ c

p

. For canonical hyperplanes we assume that

k ~ wk ≤ D , (20)

1/ k ~ wk ≥ 1/D . (21)

After shifting we get

1/ k ~ wk ≥ 1/D + k ~ c

_p

k , (22) 1/ k ~ wk ≥ (1 + D k ~ c

_p

k) /D , (23) k ~ wk ≤ D/ (1 + D k ~ c

_p

k) . (24) We can improve generalization bounds by using a radius of the minimal hypersphere from Lem. II.3 and (24), then we get

k~c + R ~ c

m

/ k ~ c

m

kk

²

D

²

(1 + D k ~ c

p

k)

²

≤ R

²

D

²

, (25) k~c + R ~ c

_m

/ k ~ c

_m

kk

²

(1 + D k ~ c

p

k)

²

≤ R

²

. (26) The radius of the minimal hypersphere is for a model with shifted hyperplanes, so for a model with points, a hypersphere containing all points can have smaller radius and we can further improve the bound for a VC dimension. For a special case, when kc

m

k = 0, we get improved bounds when

q

k~ck

²

+ R

²

(1 + D k ~ c

p

k)

²

≤ R

²

. (27) The conclusion for the second model is that a shifting opera- tion may be advantageous in terms of generalization bounds.

In the next model, we find a hyperplane that maximizes a margin between two n − 2 hyperspheres shifted up and down.

We can notice, that shifting both hyperspheres in the direction of a normal vector to a hyperplane containing each of the hypersphere (k ~ c

m

k = 0) leads to a solution equivalent to the hyperplane that contained the n − 2 hypersphere before duplicating and shifting.

Proposition II.4. When ~ c

_p

is constant and 2 k ~ c

_m

k ≤ R, then the solution of maximizing a margin between two n − 2 hyperspheres is equivalent to the hyperplane that contains the n − 2 hypersphere before duplicating and shifting.

Proof. (Sketch). We can construct a hypersphere with the center in the point lying on the boundary of one of the shifted n − 2 hyperspheres in the direction of ~ c

m

and with the radius equal to 2 kc

p

k. We are looking for the tangent hyperplane which does not cross the second shifted n − 1 hypersphere. Such hyperplane exists only when we shift by 2 k ~ c

m

k > R.

The conclusion is that, we need to shift in the direction close to the perpendicular direction to the n − 2 hypersphere. The next model is with a hyperplane crossing the hypersphere, and with a hyperspherical arc (curved surface of a hyperspherical cap) defined by the hyperplane. The optimal hyperplane in terms of minimizing the maximal distance from any point

lying on a hyperspherical arc to a hyperplane is when a hyperplane is parallel to the hyperplane of the hyperspherical arc. The angle between the direction of such hyperplane and any point lying on a hyperspherical arc is bounded. All points for kernels for which K(~ x, ~ x) is constant, such as the RBF kernel, lie on a hypersphere in a kernel-induced feature space.

So, we can find a minimal hyperspherical arc which contains all training points. As a value of ~c we can choose any training vector. The influence of ~c on solution can be low as presented in Fig. 1c.

In OCSVM the number of hyperparameters is 2 for the RBF kernel namely C and σ. See OP 1 in Appendix A for details of the version of OCSVM with the C parameter. The SVML solves SVC with a special kernel with the additional hyperparameter t, so the number of hyperparameters is 3.

We compare performance of manifold learning by measur- ing the distance of points to a boundary manifold in case of OCSVM and to the characteristic manifold in case of SVML. The definition of such distance in an original feature space would be tricky, so we compare distances in the kernel- induced feature space. For both methods, we use the same kernel-induced feature space, so values are comparable for both methods. The distance between two points in a kernel- induced feature space is

q

kϕ ( ~ x

i

) − ϕ ( ~ x

j

)k

²

= (28) q

kϕ ( ~ x

_i

)k

²

+ kϕ ( ~ x

_j

)k

²

− 2ϕ ( ~ x

_i

) · ϕ ( ~ x

_j

) . (29) For kernels for which K(~ x, ~ x) is constant such as the RBF kernel the first two terms are constant, so we have

q

const − 2K ( ~ x

_i

, ~ x

_j

) . (30) Moreover, the RBF kernel is composed with a squared distance between two points and the outer function is a monotonically decreasing function. So the distance between two points in a kernel-induced feature space is a monotonically increasing function of the distance between two points in an original feature space.

For SVML, the distance between a point ~ r and the hyper- plane in a kernel-induced feature space can be computed as

| ~ w

_c

· ~r + b

c

| q

k ~ w

c

k

²

= (31)

P

n

i=1

y

_cⁱ

α

^∗_i

K ( ~ x

i

, ~ r) + b

c

q P

n

i=1

P

n

j=1

y

ⁱ_c

y

c^j

α

^∗_i

α

^∗_j

K ( ~ x

i

, ~ x

j

)

. (32)

The OCSVM is a special case of a method of finding a hypersphere with a minimal radius in a kernel-induced feature space. Because, we assume that K(~ x, ~ x) is constant, so we get a degenerate case with a hyperplane instead of a hypersphere.

So we compute the distance similar to SVML.

For KPCA, we compute the distance between a point ~ r

and the hyperplane in a kernel-induced feature space similar

as for SVML. The hyperplane is perpendicular to the last

(5)

eigenvector. For the RBF kernel, a kernel matrix is positive definite, but while computing, some eigenvalues may be equal to zero or negative. So, we search for the last eigenvector with positive eigenvalue. We adjust a free term by com- puting the average point in a kernel-induced feature space.

We can reduce dimensionality with KPCA and treat it as regularization, although, we lose knowledge about data, so the performance of fitting to the data may get worse. We could also reduce dimensionality with SVML and OCSVM before clustering as well. The disadvantage of KPCA is that we get dense solutions, and we do not have regularization, so the generalization may be worse. The advantage is that we get directly new point coordinates for an unfolded manifold.

For a model with the hyperspherical arc with points, if the points are closer to each other, then we get better fit to the data for the OCSVM solution. For the RBF kernel, the radius of a hypersphere with all points in a kernel-induced feature space is equal to 1. The maximal distance between two points in a kernel-induced feature space is 2, but only in the limit.

The distance is close to 2, when a value of a kernel function is close to 0. We get values close to 0, when we decrease a value of σ. We get worse fit, if the distance between points is bigger in relation to the radius.

III. SVMLC

FOR

C

LUSTERING BY

M

ANIFOLD

L

EARNING

The solution of SVML with the RBF kernel is a manifold that consists of multiple separated manifolds. Each manifold is a characteristic manifold for a cluster. The number of manifolds is the number of clusters. Every point belongs to a cluster represented by the nearest characteristic manifold.

This is a difference compared to SVCL, where points inside boundary manifolds belong to clusters. So SVCL is able to return directly outliers – points that are outside of any clusters, while SVML cannot.

We use the same method for finding clusters for SVCL, SVML and KPCA. We assume that each point belongs to some cluster. In the first step, we map any two points to the same cluster if there exist two points between them with different sign of a functional margin of a solution curve. We generate points lying on a segment between pair of points by using the formula

~

r = m ~ x

2

+ n ~ x

1

m + n , (33)

where a point ~ r divides a segment between points ~ x

1

and ~ x

2

in the ratio m : n. We are able to discover non-convex shapes.

In the second step, we map remaining unassigned points to clusters of the nearest neighbors from the assigned points. We assign a cluster for a testing point as the cluster of the nearest training point. The SVMLC fits manifolds to data, while SVCL finds high density regions. So the SVMLC may be better in finding clusters for structured data in the form of manifolds, with lack of enough data to discover densities, or overlapping data for multiple clusters. We get more clusters for SVMLC for smaller value of the σ parameter, similar as for SVCL.

IV. E

XPERIMENTS

We use sequential minimal optimization (SMO) method for solving SVM optimization problems for SVML, OCSVM, SVMLC and SVCL. The time complexity of SMO is O(ln), the space complexity is O(n), where l is the number of iterations for SMO. For labeling clusters, the time complexity is O(kn

²

mn

s

), where k is the number of points to be checked between any two points for sign ( n), m is the dimension, n

s

is the number of support vectors. We compare SVML with OCSVM on two toy problems for manifold learning Fig. 1. We define toy problems by generating data from parametric curves with added error from one dimensional normal distribution with the same value of σ for all covariates. The OCSVM can fit to the curve only for data samples with small errors and only for simple curves Fig. 1a.

We compare SVMLC with SVCL on a toy problem for clustering Fig. 2. We have two clusters represented by a characteristic manifold being a circle that are close to each other, with small errors and regularly generated data points.

The SVCL cannot recognize correctly both clusters.

We compare performance of SVMLC, SVCL, KPCA, SVML, OCSVM for real world data sets described in Table I for binary and multiclass classification from the LibSVM site [14] except data sets with high number of features. We use our own implementation of SMO for SVM methods. For all data sets, every feature and a predictive variable are scaled linearly to [0, 1]. We performed all tests with the RBF kernel.

The number of hyperparameters to tune for SVMLC is 3, mainly t, σ and C. The SVCL has two hyperparameters σ and C. The KPCA has only one hyperparameter σ. For all hyperparameters, we use a double grid search method for finding the best values - first a coarse grid search is performed, then a finer grid search as described in [15]. For OCSVM, a value of C is greater or equal to 1/n, where n is the number of training examples. The training set size is fixed to 100 examples, the rest of data become test data. The standard 5 fold cross validation is used for the inner loop for finding optimal values of the hyperparameters. After finding the optimal values, we run the method on training data, and we report results for a test set.

We validate clustering on classification data. We check if the whole cluster belong to the same class. It follows from an assumption that data samples that belong to the same cluster have the same class in a classification problem defined for such data. This is not supervised clustering, because we do not have training data with clusters. We use the measure of error for clustering such as

1 n

n

X

i=0 n

X

j=0

[( ~ x

i

, ~ x

j

) ∈ {( ~ x

i

, ~ x

j

) : cl ( ~ x

i

) 6= cl ( ~ x

j

) (34)

∧y

i

= y

_j

∨ cl ( ~ x

_i

) = cl ( ~ x

_j

) ∧ y

_i

6= y

j

] , (35) where cl is the number representing the found cluster, y

i

are predefined classes from a classification problem, [] is an

indicator function. We check if each cluster belong to only one

(6)

class, then the value of the measure is 0. In other words, we compute the dissimilarity matrix S

c

for classification, and the dissimilarity matrix for clustering S

d

. The error is the number of different values in both matrices. The definition is similar to the metric accuracy when correct cluster labels are given and a Rand statistic. In opposite to a Rand statistic, we count the number of disagreements in numerator instead of the number of agreements, and instead of the denominator (n

²

choose 2), we use n. The measure of error can take values from 0 to n. We use this measure also for tuning hyperparameters. We validate performance of clustering methods on real world data which may not contain any clusters. So, for some experiments the optimal number of clusters is 1. Data may not contain manifolds. Then, SVML and SVMLC find some artificial manifolds. Even for such case, they may correctly fit to the data and recognize clusters. We skip data sets for which we get 1 cluster for all methods.

We use the Friedman test with the two tailed Nemenyi post hoc test for checking statistical significance of the clustering error, as suggested in [16]. The statistical procedure is per- formed for the level of significance equal to 0.05. The critical values for the Friedman test are taken from the statistical table design specifically for the case with smaller number of tests or methods as suggested in [16]. The critical values for the Nemenyi test are taken from the statistical table for the Tukey test which are divided by √

2. We performed two experiments. The first experiment is for the clustering problem. We compare SVMLC, SVCL, KPCA.

The SVMLC has comparable generalization performance for solving clustering problems (columns trse in Table I) without statistical significance (columns rs, tsf, tsn for the first row in Table II). For KPCA solutions are not sparse, so the number of nonzero coefficients (column sv3 in Table I) is equal to either the size of a training set or 1 for a degenerate case with a kernel matrix being a diagonal one. Due to the degenerated cases, KPCA has the best rank for the number of nonzero coefficients (columns sv1, sv2, sv3 in the first row of Table II). The number of clusters is sometimes 1, because not all classification data sets contain clusters (columns clust in Table I).

The second experiment is for the manifold learning problem.

We compare SVML, OCSVM, KPCA. We run the experiments for a few selected σ values. We compare fitting to the data in a kernel-induced feature space. The SVML achieves the best rank in this measure (columns rn in Table I and rs, tsf, tsn for the second row in Table II) with statistical significance between SVML and OCSVM. The smallest number of nonzero coefficients has SVML (columns sv in the second row in Ta- ble II). The SVML is slower compared to OCSVM especially for big values of C and σ.

V. S

UMMARY

The SVML is better in solving manifold learning problems than OCSVM and KPCA with statistical significance for OCSVM. It has also the smallest number of support vectors.

The SVMLC method has comparable performance in solving clustering problems to SVCL and KPCA. The SVMLC can

have advantage for data in the form of manifolds, especially for close manifolds to each other, which are hard to distinguish by finding high density regions. Computational performance of SVMLC is limited, because we need to tune one more hyperparameter. We can improve quality performance SVMLC by designing heuristics for selecting directions for shifting in a kernel-induced feature space.

A

CKNOWLEDGMENTS

I would like to express my sincere gratitude to Professor Witold Dzwinel (AGH University of Science and Technology, Department of Computer Science) and Professor Stan Matwin (Dalhousie University, Faculty of Computer Science) for dis- cussion and useful suggestions. The theoretical analysis and the method design are financed by the National Science Centre in Poland, project id 289884, UMO-2015/17/D/ST6/04010, titled ”Development of Models and Methods for Incorporating Knowledge to Support Vector Machines” and the implementa- tion of the method is financed by the National Science Centre in Poland, project id 217859, UMO-2013/09/B/ST6/01549, titled ”Interactive Visual Text Analytics (IVTA): Development of novel user-driven text mining and visualization methods for large text corpora exploration”.

A

PPENDIX

A O

NE

-

CLASS

SVM

We consider a set of n training vectors ~ x

_i

for i ∈ {1, . . . , n}, where ~ x

_i

= (x

¹_i

, . . . , x

^m_i

). The m is a dimension of the problem. The OCSVM is

Optimization problem (OP) 1.

min

~ w,b,~ξ

f

~ w, b, ~ ξ

= 1

2 kwk

²

+ C

n

X

i=1

ξ

i

+ b (36) subject to

g ( ~ x

i

) ≥ 1 − ξ

i

(37)

ξ

_i

≥ 0 (38)

for i ∈ {1 . . . n}, where

g ( ~ x

i

) = ~ w · ~ x

i

+ b (39)

C > 0 . (40)

The differences compared to SVC is the term b in the objective function, and all points having the same class y

i

= 1.

R

EFERENCES

[1] A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik,

“Support vector clustering,” J. Mach. Learn. Res., vol. 2, pp. 125–137, 2001.

[2] Z. Wang, Y.-H. Shao, L. Bai, and N.-Y. Deng, “Twin support vector machine for clustering.” IEEE Trans.

Neural Netw. Learning Syst., vol. 26, no. 10, pp. 2583–

2588, 2015.

[3] J. Koronacki and J. ´ Cwik, Statystyczne systemy uczace

sie. EXIT, 2008.

(7)

TABLE I

PERFORMANCE OFSVMLC, SVCL, KPCA, SVML, OCSVMFOR REAL WORLD DATA,PART1. THE NUMBERS IN DESCRIPTIONS OF THE COLUMNS MEAN THE METHODS: 1 - SVMLC, 2 - SVCL, 3 - KPCA (EXCEPT COLUMNS RN1,RN2,RN3WHERE1 - SVML, 2 - OCSVM, 3 - KPCA). ALL COLUMNS ARE FOR THE CLUSTERING EXPERIMENT(EXCEPT RN1,RN2,RN3FOR THE MANIFOLD LEARNING EXPERIMENT). COLUMN DESCRIPTIONS: dn

–THE NAME OF A DATA SET, size –THE NUMBER OF ALL EXAMPLES, dim –THE DIMENSION OF A PROBLEM, trse –THE MEAN ERROR FOR CLUSTERING PERFORMANCE MEASURE FOR TESTING DATA;THE BEST METHOD IS IN BOLD, sv –THE NUMBER OF NONZERO COEFFICIENTS(SUPPORT VECTORS FOR SVMMETHODS),THE SMALLEST NUMBER IS IN BOLD, clust –THE NUMBER OF CLUSTERS. rn –THE RANK FOR THE PERFORMANCE MEASURE FOR

TESTING DATA FOR MANIFOLD LEARNING EXPERIMENT;THE BEST METHOD IS IN BOLD.

dn size dim trse1 trse2 trse3 sv1 sv2 sv3 clust1 clust2 clust3

australian 690 14 51.876 55.337 55.605 101.0 100.0 100.0 5.0 5.0 2.0

breast-cancer 675 10 27.272 34.241 34.241 100.0 52.0 100.0 4.0 1.0 1.0

cod-rna 100000 8 57.144 52.366 57.419 101.0 99.0 99.0 2.0 6.0 1.0

colon-cancer 62 2000 27.827 28.419 28.119 51.0 50.0 1.0 2.0 2.0 1.0

fourclass 862 2 52.233 53.265 61.72 48.0 100.0 100.0 4.0 6.0 1.0

german numer 1000 24 59.438 57.411 57.411 101.0 100.0 1.0 4.0 1.0 1.0

heart 270 13 54.185 56.212 54.185 101.0 100.0 100.0 7.0 4.0 1.0

ionosphere scale 350 33 41.732 51.712 41.732 57.0 99.0 1.0 1.0 4.0 1.0

phishing 5785 68 48.322 48.322 50.0 101.0 100.0 1.0 2.0 2.0 1.0

skin nonskin 51432 3 62.537 63.483 65.821 101.0 99.0 100.0 17.0 3.0 4.0 sonar scale 208 60 52.332 51.353 47.012 101.0 100.0 100.0 2.0 3.0 2.0

SUSY 100000 18 53.697 52.18 53.697 101.0 100.0 1.0 1.0 3.0 1.0

svmguide3 1243 21 71.309 71.333 72.079 101.0 100.0 1.0 11.0 4.0 1.0

covtype 100000 53 41.029 41.295 48.764 101.0 86.0 100.0 13.0 11.0 1.0

glass scale 213 9 36.87 38.561 38.706 101.0 95.0 100.0 7.0 4.0 3.0

iris 147 4 73.891 51.466 52.7 100.0 98.0 97.0 2.0 5.0 1.0

poker 100000 10 46.177 44.594 51.278 101.0 100.0 100.0 13.0 7.0 1.0

sensorless 58509 48 20.065 20.723 27.5 101.0 100.0 100.0 2.0 4.0 1.0

combined 98500 100 51.024 50.678 47.999 101.0 100.0 100.0 3.0 1.0 4.0

vowel 990 10 21.193 18.912 32.317 101.0 100.0 100.0 4.0 6.0 1.0

wine scale 178 13 44.296 49.407 32.156 101.0 100.0 100.0 2.0 2.0 4.0

rn1 rn2 rn3 1.67 3.0 1.33 1.25 3.0 1.75 1.25 3.0 1.75 1.0 3.0 2.0 1.5 3.0 1.5 1.67 3.0 1.33 1.25 3.0 1.75 1.0 3.0 2.0 1.0 3.0 2.0 1.5 3.0 1.5 1.0 3.0 2.0 1.75 3.0 1.25

2.0 3.0 1.0 1.67 3.0 1.33 1.25 3.0 1.75 2.0 3.0 1.0 1.25 3.0 1.75 1.33 3.0 1.67 1.67 3.0 1.33 2.0 3.0 1.0 1.67 3.0 1.33

TABLE II

PERFORMANCE OFSVMLC, SVCL, KPCA, SVML, OCSVMFOR REAL WORLD DATA,PART2. THE NUMBERS IN DESCRIPTIONS OF THE COLUMNS MEAN THE METHODS: 1 - SVMLC, 2 - SVCL, 3 - KPCAFOR THE FIRST ROW, 1 - SVML, 2 - OCSVM, 3 - KPCAFOR THE SECOND ROW. THE TEST

WITH ID=0IS FOR ALL TESTS FROMTABLEIFOR THE CLUSTERING EXPERIMENT. THE TEST WITH ID=1IS FOR ALL TESTS FROMTABLEIFOR THE MANIFOLD LEARNING EXPERIMENT. COLUMN DESCRIPTIONS: rs –AN AVERAGE RANK OF THE METHOD FOR THE MEAN ERROR;THE BEST METHOD IS IN

BOLD, tsf –THEFRIEDMAN STATISTIC FOR AVERAGE RANKS FOR THE MEAN ERROR;THE SIGNIFICANT VALUE IS IN BOLD, tsn –THENEMENYI STATISTIC FOR AVERAGE RANKS FOR THE MEAN ERROR,REPORTED WHEN THEFRIEDMAN STATISTIC IS SIGNIFICANT,THE SIGNIFICANT VALUE IS IN

BOLD, svr –THE AVERAGE RANK FOR THE NUMBER OF NONZERO COEFFICIENTS(SUPPORT VECTORS FORSVMMETHODS).

id rs1 rs2 rs3 tsf tsn12 tsn13 tsn23 sv1 sv2 sv3

0 1.71 1.93 2.36 4.5 – – – 2.83 1.67 1.5

1 1.49 2.98 1.53 33.09 -4.82 0.3 5.13 1.51 2.38 2.11

[4] M. Orchel, “Regression based on support vector classifi- cation,” in Adaptive and Natural Computing Algorithms, ser. Lecture Notes in Computer Science, A. Dobnikar, U. Lotric, and B. ˇSter, Eds. Springer Berlin Heidelberg, 2011, vol. 6594, pp. 353–362.

[5] M. Orchel, “Support vector regression based on data shifting,” Neurocomputing, vol. 96, pp. 2–11, 2012.

[6] A. J. Smola, S. Mika, B. Sch¨olkopf, and R. C.

Williamson, “Regularized principal manifolds,” J. Mach.

Learn. Res., vol. 1, pp. 179–209, 2001.

[7] P. S. Bradley and O. L. Mangasarian, “k-plane cluster- ing,” J. Global Optimization, vol. 16, no. 1, pp. 23–32,

2000.

[8] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” J. Mach. Learn. Res., vol. 7, pp. 2399–2434, 2006.

[9] F. Lin, C. Yeh, and M. Lee, “The use of hybrid manifold learning and support vector machines in the prediction of business failure,” Knowl.-Based Syst., vol. 24, no. 1, pp. 95–101, 2011.

[10] L. Gomez-Chova, G. Camps-Valls, J. Munoz-Mari, and

J. Calpe, “Semisupervised image classification with

laplacian support vector machines,” Ieee Geosci. Remote

(8)

S., vol. 5, no. 3, pp. 336–340, July 2008.

[11] Y. M. Chen, P. Lin, J. Q. He, Y. He, and X. Li,

“Combination of the manifold dimensionality reduction methods with least squares support vector machines for classifying the species of sorghum seeds,” Sci. Rep., vol. 6, no. 19917, 2016.

[12] Z. Wu, C. hung Li, J. Zhu, and J. Huang, “A semi- supervised svm for manifold learning,” in 18th Inter- national Conference on Pattern Recognition (ICPR’06), vol. 2, 2006, pp. 490–493.

[13] Q. Gu and J. Han, “Clustered support vector machines,”

in Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2013, Scottsdale, AZ, USA, April 29 - May 1, 2013, 2013, pp.

307–315.

[14] “Libsvm data sets,” www.csie.ntu.edu.tw/

^∼

cjlin/

libsvmtools/datasets/, 06 2011.

[15] C. wei Hsu, C. chung Chang, and C. jen Lin, “A practical guide to support vector classification,” 2010.

[16] N. Japkowicz and M. Shah, Eds., Evaluating Learning

Algorithms: A Classification Perspective. Cambridge

University Press, 2011.

(9)

-1.0 1.0

1.0 y

x

(a)

0.0 1.0

y

x

(b)

-1.0 1.0

1.0 y

x

(c)

Fig. 1. Manifold learning. Points—examples. (a) For points generated from a circle. Solid line—solution of OCSVM for C = 1.0, σ = 0.9, dashed line—solution of SVML for C = 100.0, σ = 0.9, t = 0.01, thin dotted line—solution of KPCA for σ = 0.9. (b) For points generated from a Lissajous curve. Solid line—solution of OCSVM for C = 1000.0, σ = 0.5, dashed line—solution of SVML for C = 100000.0, σ = 0.8, t = 0.01, thin dotted line—solution of KPCA for σ = 0.5. (c) Solid line— solution of SVML for ~c = ~0, C = 100.0, σ = 0.9, t = 0.01, dashed line—solution of SVML for random values of ~c, C = 100.0, σ = 0.9, t = 0.01.

(10)

0.0 1.0

0.0 y

x

(a)

0.0 1.0

0.0 y

x

(b)

0.0 1.0

0.0 y

x

(c)

Fig. 2. Clustering by manifold learning. Points—examples. (a) Solid line—solution of SVCL for C = 10000.0, σ = 0.35. (b) Solid line—solution of SVMLC for C = 100000.0, σ = 1.1, t = 0.01. (c) Solid line—solution of KPCA.