class classifiers for multi-

(1)

On the usefulness of

one-class one-classifiers for

multi-class problems

Bartosz Krawczyk

(2)

Outline

• One-class classification • Multi-class decomposition

• One-class decomposition as a special case of one-versus-all • One-Class Clustering-based Ensemble

• Online one-class ensembles for massive multi-class data streams with concept drift

(3)

One-class vs. multi-class

 Multi-class classification aims at establishing a decision boundary that results in lowest classification error for object originating from two or more classes.

(4)

One-class vs. multi-class

 Multi-class classification aims at establishing a decision boundary that results in lowest classification error for object originating from two or more classes.

(5)

One-class vs. multi-class

 In one-class classification (OCC) we have only data representing a single class available. This is known as learning in the absence of counterexamples.

(6)

One-class vs. multi-class

 One of the most popular approaches for such a case is to establish a

(7)

One-class vs. multi-class

 It is assumed that during the exploitation of such a classifier there can appear new objects from the target concept and new, unknown during the learning process, objects originating from a different distribution(s). Such negative objects are labeled as outliers.

(8)

One-class ensembles

 One-class problems tend to have a complex structure of the target class.

 A proper model selection of one-class classifier for a given dataset can be troublesome and time-consuming.

 For complex data with underlying internal structures single-model approaches may return complex boundaries or become overtrained.  Ensemble learning allows to exploit unique strengths of each model from

the pool.

 Ensembles increase the quality and robustness of one-class learning systems.

(9)

(10)

One-class ensembles - challenges

 One may have two different types of one-class ensembles:

heterogeneous[1] _and _homogeneous[2]_.

 In case of homogeneous ensembles, usually Random Subspace method is used, which requires the overproduce-and-select mode.

 We need to choose classifiers, that are individually competent and

mutually complementary.

 The problem lies in forming any kind of efficient competence / quality or diversity measures[3] _{for one-class classification task.}

[1] _{Krawczyk B., Woźniak M., Dynamic Classifier Selection for One-Class Classification. Information Fusion}

(in review)

[2] _{Krawczyk B., One-Class Classifier Ensemble Pruning and Weighting with Firefly Algorithm.} Neurocomputing, 2015, vol.150, part B: 490–500.

[3] _{Krawczyk B., Woźniak M., Diversity measures for one-class classifier ensembles. Neurocomputing,} 2014, vol. 126: 36–44.

(11)

Multi-class decomposition

 When considering a multi-class problem we often deal with a large number of classes (10 / 20 / 100 / 500 /etc.).

 This poses a significant challenge for the classification system, as it increases the complexity of the decision boundary.

 According to the divide-and-conquer rule it is worthwhile to split the complex problem into a set of simplified subproblems and solve them individually.

 Decomposition strategies aim at reducing a multi-class problem into a number of subproblems with lower number of classes.

 Binary decomposition is the most popular one.

 We delegate a local classifier to each subproblem, thus creating a combined classifier.

(12)

Multi-class decomposition

 One-versus-one (OVO) is a strategy in which we consider all possible pairs of classes and build a classifier for each of them.

(13)

Multi-class decomposition

 One-versus-all (OVA) is a strategy in which we use one class as a positive one and aggregate all remaining ones into a negative class.

(14)

Multi-class decomposition

 OVO advantages: set of simple binary problems, that can be solved by local classifiers.

 OVO drawbacks: large number of base classifiers to be trained (for M-class problem we need to train M(M-1)/2 base M-classifiers), complex aggregation of local decisions.

 OVA advantages: small number of base classifiers (for for M-class problem we need to train M base classifiers), classifiers use all of samples for training.

 OVA drawbacks: using an aggregation of classes as negative one introduces artificial imbalance (e.g., for 10 classes we will get imbalance ratio 1:9).

(15)

Multi-class decomposition

 The question is how to reconstruct the input multi-class problem from a set of local binary outputs.

 One needs to use a dedicated classifier combination methods.

 OVA decomposition usually relies on maximum function and winner-takes-all approach.

 OVO has a series of methods dedicated, like pairwise coupling or

decision directed acyclic graphs.

 One may also use a more compound combiner like Error-Correcting Output Codes or Decision Templates.

(16)

Decomposition with one-class

 Is it worthwhile to use one-class classifiers for cases where we have an access to counterexamples?

 One may doubt the idea of using OCC for decomposing multi-class datasets. OCC uses only information about the target class, therefore, we discard available useful information.

 OCC methods have several desirable properties, that may aid the process of multi-class decomposition.

 However, we emphasize that OCC is by no means a superior method to binary classifiers.

 For standard datasets, binary classifiers will perform better, due to their access to counterexamples.

 The applicability of OCC lies in complex data, where standard binary classifiers tend to fail.

(17)

Decomposition with one-class

 To understand the possible advantages of OCC, let us first take a look on the differences between a binary and an one-class classifier.

 While OCC uses less information about the problem being considered, its properties allows to deal with difficulties embedded in the nature of the data: imbalance, class noise or inner outliers, to name a few.

(18)

Decomposition with one-class

 Using OCC for decomposing a multi-class dataset is very intuitive - each class is considered as independent and a different one-class model is delegated to it. Therefore, for an M-class problem, we get M separate one-class tasks.

(19)

Decomposition with one-class

 Learning using only examples from a single class may seem as a disadvantage.

 In practice lifts one of the biggest problems related to OVA -imbalanced object quantity between the target class and the remaining ones[1]_.

 Additionally, we get smaller ensembles than in case of OVO.

[1] _{Krawczyk B., Filipczuk P., Cytological Image Analysis with Firefly Nuclei Detection and Hybrid One-Class}

(20)

Experimental analysis

 20 datasets with large number of classes (4 - 95).

 Base classifiers: C4.5, SVM, One-Class Parzen, One-Class Mixture of Gaussians, One-Class Support Vector Machine, Support Vector Data Description.

 Combination methods used: OVO (Pairwise coupling and DDAG), OVA (MAX), ECOC and Decision Templates.

 5x2 CV F-test for training / testing / pairwise comparison, Friedman ranking test and Shaffer post-hoc test for comparison over multiple datasets.

 We will present exemplary results. For details please refer to [1]_.

[1] _{Krawczyk B., Woźniak M., Herrera F. On the Usefulness of One-Class Classifier Ensembles for}

(21)

Experimental analysis

Results for Auslan dataset with 95 classes using Decision Templates combination.

(22)

Experimental analysis

Results of Shaffer post-hoc test over 20 datasets for comparing different combination methods for SVDD classifier. Symbol '=' stands for classifiers without significant differences, '+' for situation in which the method on the left is superior and '-' vice versa.

(23)

Recommendations

 One-class classification is not an universal solution. For standard data, with no difficulties embedded in them, binary decomposition performed much better having access to all the data. However, if you have a more complex problem, OCC is a worthwhile direction.

 One-class decomposition works exceptionally well with datasets having a

high number of classes.

 One-class classifiers can deal with problems embedded in the nature of the data - imbalanced distribution, feature and label noise or small number of objects.

 When having a dataset with moderate or low number of examples and / or risk of being affected by noise / in-class outliers, it is preferable to apply boundary methods, like OCSVM or SVDD. However, when the class objects are plentiful, density-based methods can give a better description of the data.

(24)

Designing one-class classifiers

 The presented decomposition

technique relies on the quality of used one-class classifiers.

 Often classes have complex

structure and present internal outliers (label noise).

 These two factors may lead to creating an one-class classifier

with so-called „empty regions” –

areas within the decision

boundary, but not covered by training samples. Classifier may be deemed as incompetent in such areas.

 For complex data with underlying internal structures single-model approaches

may return complex boundaries / become overtrained.

 A proper model selection of one-class classifier for a given dataset can be

(25)

OCClustE

 We developed a novel ensemble for forming locally specialised classifiers, named One-Class Clustering-Based Ensemble (OCClustE).

 This allows us to create base one-class models that work on delegated subspaces of the original decision problem, thus increasing their local competence.

 This approach is rooted in our previous experiences wth multi-class classification and our original method named Adaptive Splitting and Selection (AdaSS) for selecting classifiers to local decision in areas of their highest competence [1] [2]_.

[1]_{Jackowski K., Krawczyk B. & Woźniak M., Improved Adaptive Splitting and Selection: The Hybrid Training}

Method of a Classifier Based on a Feature Space Partitioning. International Journal of Neural Systems,

2014, vol. 24, no. 3: 1430007-1 – 1430007-18.

[2] _{Woźniak M., Krawczyk B., Combined Classifier Based On Feature Space Partitioning. Journal of Applied} Mathematics and Computer Science, 2012, vol. 22, no. 4: 855–866.

(26)

(27)

OCClustE

 OCClustE [1] _{has two important components:}

 Method for decomposing the original decision problem into a set of

reduced, less complex sub-problems.

 Usage of specific base one-class classifiers trained on detected areas of competence, that can increase robustness to possible internal outliers and reduce empty areas within their boundaries.

[1] _{Krawczyk B., Woźniak M. & Cyganek B., Weighted One-Class Classifier Ensemble Based on Fuzzy Feature}

(28)

OCClustE

 To detect compact subgroups within target class, we employ a kernel

fuzzy c-means clustering.

 Due to the usage of a kernel, we are able to search for a more atomic and efficient representation of the target concept in different spaces.  However, the fuzzy c-means clustering relies on the user-defined numer

of clusters.

 To avoid time-consumig trial-and-error settings, we automatically establish the numer of clusters with the entropy criterion.

 We select number of clusters that introduces the lowest entropy.

 Of course, this criterion does not always return optimal solution, but is automatic and can be used as a starting point for further tuning.

(29)

OCClustE

 As a base classifier for our ensemble, we have selected Weighted One-Class Suport Vector Machine (WOCSVM).

 It assigns weights to each training object to control their influence on the shape of the decision boundary and to filter outliers and irrelevant objects.

 WOCSVM outputs a more compact decision boundary, thus significantly reducing the size of empty areas present within.

(30)

OCClustE

 In standard WOCSVM weights assigned to objects are established on the basis of the distance between the object and the hypersphere centre.  This however requires additional computation and can be costly,

especially for big data.

 We propose to utilize membership functions, obtained from fuzzy c-means clustering, as objects weights.

 Such an approach significantly reduces the computational time and improves an individual accuracy of each base classifier in the ensemble.

(31)

OCClustE

 We can use OCClustE for multi-class problems by developing M ensemble models, each for one class [1]_.

[1] _{Krawczyk B., Woźniak M. & Cyganek B., Clustering-based ensembles for one-class classification.} Information Sciences, 2014, vol. 264: 182-195

(32)

Experimental analysis

 20 datasets with large number of classes (4 - 95).

 Base classifiers: OCClustE, C4.5, SVM, One-Class Parzen, One-Class Mixture of Gaussians, One-Class Support Vector Machine, Support Vector Data Description.

 Combination methods used: OVO (Pairwise coupling and DDAG), OVA (MAX), ECOC and Decision Templates.

 5x2 CV F-test for training / testing / pairwise comparison, Friedman ranking test and Shaffer post-hoc test for comparison over multiple datasets.

(33)

Experimental analysis

Results for Auslan dataset with 95 classes using Decision Templates combination.

(34)

Recommendations

 OCClustE is a flexible ensemble framework, that can be used with any components for both one-class and multi-class problems.

 It can offer an efficient tool for decomposition with two-level classifier combination.

 It can be easily run in paralel or distributed systems.  Higher computational cost than single-model approach.

(35)

Mining massive data streams

 One-class classification is also an interesting tool for mining data streams - both stationary and those with concept drift.

 It can be used for novelty detection, non-stationary classes, or cases where access to class labels is limited.

(36)

Mining massive data streams

 Recently we have proposed a novel model for one-class data stream classification: incremental WOCSVM with forgetting[1] [2]_.

 It applies the incremental learning mechanism with embedded weight modification. For incoming objects we assign the highest possible weights, as they represent the current state of the concept.

 Objects from previous iterations have slowly reduced degree of importance by modifying their weights with our proposed forgetting function:

[1] _{Krawczyk B., Woźniak M., One-Class Classifiers with Incremental Learning and Forgetting for Data}

Streams with Concept Drift. Soft Computing. DOI: 10.1007/s00500-014-1492-5

[2] _{Krawczyk B., Woźniak M., Incremental Weighted One-Class Classifier for Mining Stationary Data}

(37)

Mining massive data streams

 We have developed a modification of this method suitable for online learning.

(38)

Mining massive data streams

 We propose to use an ensemble of these classifiers for online mining of massive evolving data streams.

(39)

Experimental analysis

 12 data streams.

 Reference online ensemble classifiers: Online Bagging with an ADWIN change detector (BAG), Leveraging Bagging (LEV), the Dynamic Weighted Majority (DWM), and the Adaptive Classifier.

(40)

Experimental analysis

 Obtained accuracies:

(41)

Experimental analysis

 Remaining performance measures:

(42)

Experimental analysis

(43)

Conclusions

 One-class classifiers are an attractive tool not only for learning in the absence of counterexamples, but also for decomposing multi-class datasets.

 They are very useful for cases with a high number of classes and

difficulties embedded in the nature of data.

 One should remember, that one-class classifiers are not an universal solution for multi-class problems. However, in specific scenarios they may return excellent performance.

 In future, we plan to extend our works on multi-class decomposition by using dynamic classifier selection according to their competence, applying novel competence measures based on empty regions and adapt our methods for mining multi-class imbalanced data.

(44)

The 16th International Conference on Intelligent Data Engineering and Automated Learning IDEAL 2015, October 14-16, 2015, Wroclaw, Poland

Submission deadline: 15.05.2015 Website: ideal2015.pwr.edu.pl

The 1st IEEE International Workshop on Classification Problems Embedded in the

Nature of Big Data IEEE CPBD 2015, August 20-22, 2015, Helsinki, Finland

Submission deadline: 20.05.2015

Website: research.comnet.aalto.fi/BDSE2015/cpbd2015/

General chair: Prof. Michał Woźniak

PC chair: Dr. Robert Burduk

Publicity chair: Bartosz Krawczyk

Co-chair:

(45)

The 2nd International Workshop on Machine Learning in Life Sciences, September

11, 2015, Porto, Portugal, within ECML-PKDD2015.

Submission deadline: 22.06.2015 Machine learning competition:

Website: http://competition.gmum.net/

Co-chair: Bartosz Krawczyk

Co-chair:

(46)

class classifiers for multi-

On the usefulness of

one-class one-classifiers for

multi-class problems

Bartosz Krawczyk

Outline

One-class vs. multi-class

One-class vs. multi-class

One-class vs. multi-class

One-class vs. multi-class

One-class vs. multi-class

One-class ensembles

One-class ensembles - challenges

Multi-class decomposition

Multi-class decomposition

Multi-class decomposition

Multi-class decomposition

Multi-class decomposition

Decomposition with one-class

Decomposition with one-class

Decomposition with one-class

Decomposition with one-class

Experimental analysis

Experimental analysis

Experimental analysis

Recommendations

Designing one-class classifiers

OCClustE

OCClustE

OCClustE

OCClustE

OCClustE

OCClustE

Experimental analysis

Experimental analysis

Recommendations

Mining massive data streams

Mining massive data streams

Mining massive data streams

Mining massive data streams

Experimental analysis

Experimental analysis

Experimental analysis

Experimental analysis

Conclusions

Thank you for your

kind attention

bartosz.krawczyk@pwr.edu.pl

www.kssk.pwr.edu.pl/krawczyk