13listopada2018 MateuszLango Wykorzystaniegridowegoalgorytmugrupowaniadladanychniezbalansowanych

(1)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

Wykorzystanie gridowego algorytmu grupowania

dla danych niezbalansowanych

12

Mateusz Lango

Instytut Informatyki, Politechnika Poznańska

13 listopada 2018

1

Lango M., Brzeziński D., Stefanowski J., ImWeights: Classifying Imbalanced Data Using Local and Neighborhood Information, JMLR Proceedings of the 2nd International Workshop on Learning with Imbalanced Domains co-located with ECML/PKDD, 2018

2

Lango M., Brzeziński D., Firlik S., Stefanowski J., Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data, Proceedings of the 20th International Conference on Discovery Science, Kyoto,

(2)

(3)

(4)

(5)

Common issue of presented tasks

highly skewed class distribution

classifiers inducted for those problems are good at detecting majority class

minority class is of special interest connecting related parts of a graph inferring a need of improvement/repair

(6)

Agenda

1 Introduction

2 Why learning from imbalanced data is difficult?

3 _{Discovering data difficulty factors with grid clustering}

4 How to improve imbalanced data classification using grid cluster-ing?

(7)

Why learning from imbalanced data is difficult?

an obvious answer: unequal class cardinality IR = N−

N+

classifiers learned with maximum generality bias η = TPRate·

N+

+ TNRate·

(8)

An experiment with imbalanced data

Acc 100%

Prec 100%

(9)

An experiment with imbalanced data

Acc 100%

(10)

An experiment with imbalanced data

Acc 100%

Prec 100%

(11)

An experiment with imbalanced data

Acc 100%

Prec 100%

(12)

An experiment with imbalanced data

Acc 94.9%

Prec 95.1% Recall 94.7%

(13)

An experiment with imbalanced data

Acc 95.1%

(14)

An experiment with imbalanced data

Acc 95.8%

Prec 89%

(15)

An experiment with imbalanced data

Acc 97%

Prec 86.9%

(16)

An experiment with imbalanced data

Acc 70.6%

Prec 72.3% Recall 66.7%

(17)

An experiment with imbalanced data

Acc 74.8%

(18)

An experiment with imbalanced data

Acc 83.3%

Prec 0%

(19)

An experiment with imbalanced data

Acc 90.9%

Prec 0%

(20)

Why learning from imbalanced data is really difficult?

global imbalance ratio

minority class examples divided into subconcepts/small disjuncts

overlap between classes

presence of many minority class examples inside the majority class region

(21)

Local data difficulty factors

(22)

Local data difficulty factors

The labels of examples are usually established by locally estimating the conditional probability:

p = Pr (y = +|x )

1 ≥ p > 0.7 → safe example 0.7 ≥ p > 0.3 → bordeline example 0.3 ≥ p > 0.1 → rare example 0.1 ≥ p ≥ 0 → outlier example

Napierała & Stefanowski proposed to estimate this probabilities by k-NN3.

3

The influence of minority class distribution on learning from imbalance data, Napierała & Stefanowski, HAIS,

(23)

ImGrid algorithm - motivation

Locality-based approaches define difficulties by modelling interactions with majority examples

However, they do not detect minority subclusters

Limited works on clustering imbalanced data (mainly in the context of resampling methods)

Question: Is it possible to construct a clustering approach that simultaneously discovers sub-concepts in complex imbalanced data and categorizes types of examples inside

(24)

ImGrid algorithm - general idea

divide attribute space into grid cells

join adjacent cells based on minority class distributions label examples according to local difficulty factors form minority sub-clusters

(25)

ImGrid - details

A trade-off in grid construction: precision vs number of data points needed for probability estimation

We divide each dimension of the attribute space into d_{p|D|/10e equally wide intervals.}m

Algorithm aims at connecting cells that contain similar class distributions → statistical hypothesis testing framework popular tests: Pearson’s χ2, Fisher’s exact test, Barnard’s test those tests cannot directly state that the distributions are identical

Bayesian test based on Bayes factor for beta-binomial conjugate distribution with non-informative Jeffreys prior is used

(26)

ImGrid - details

(27)

ImGrid - details

(28)

(29)

(30)

(31)

(32)

(33)

Alternatives

ImScan

Cluster minority examples using DBSCAN

Add majority examples to minority clusters within Label example types according to class proportion ImKmeans

Cluster minority examples using k-means

(34)

Experimental evaluation

Comparison: ImGrid, ImKmeans, ImScan, Napierala ImGrid: α ∈ {0.75, 0.80, 0.85, 0.90, 0.95};

ImScan: ∈ {10, 30, 50, 70, 90}, min points ∈ {2}; ImKmeans: k ∈ [1, 9];

Napierala: k ∈ {5, 7, 9, 11}.

78 synthetic binary classification datasets

varying shapes and numbers of minority sub-concepts varying proportions of safe, borderline, rare, outlier cases varying example density and sub-cocept overlapping data generator by Wojciechowski & Wilk

Algorithms evaluated based on:

Minority class clustering (AMI - mutual information adjusted for chance)

Minority example categorization (G-mean) Processing time (s)

(35)

(36)

ImGrid - results

4

Upper panel: Minority sub-concepts

Lower panel: Example types (safe, borderline, rare, outlier)

4_{See full results in the paper and on the dedicated Web page, code is}

(37)

(38)

ImGrid - results

AMI 1 2 3 CD ImScan ImKmeans ImGrid 1 2 3 CD ImGrid ImScan ImKmeans G-mean 1 2 3 4 CD Napierala ImScan ImGrid ImKmeans 1 2 3 4 CD Napierala ImGrid ImScan ImKmeans Time 1 2 3 4 CD Napierala ImGrid ImScan ImKmeans

(39)

ImGrid - results

(40)

Methods for imbalanced classification

A common categorization of methods for imbalanced classification is the following5_: data-level solutions oversampling undersampling combinations of both algorithmic-level solutions cost-sensitive learning 5

(41)

Resampling methods: a different perspective

global data characteristic local data characteristic

Random Oversampling SMOTE

Random Undersampling NCR

Roughly Balanced Bagging Hybrid Sampling Bagging

DataBoost-IM SMOTEBoost

global data charaterisitcs local

Can we provide some kind of ”halfway” data characteristic? Would this kind of information be helpful in designing

(42)

Resampling methods: a different perspective

global data characteristic local data characteristic

Random Oversampling SMOTE

Random Undersampling NCR

Roughly Balanced Bagging Hybrid Sampling Bagging

DataBoost-IM SMOTEBoost

global data charaterisitcs local

Can we provide some kind of ”halfway” data characteristic? Would this kind of information be helpful in designing resampling methods?

(43)

ImWeights

a new example weighting schema which exploits local, global and neighbourhood information about data difficulty

since ImGrid is a grid approach, it naturally defines the neighbourhood relation between clusters

(44)

ImWeights - concept of gravity

The strength of gravity emitted by a cell is proportional to its safety

The gravity have a bigger impact on the examples which lie closer to the emitting cell

(45)

(46)

ImWeights

The weight of a minority example is calculated using the following formula:

wx = 1 + f(safety(x))(1 + gravity(x))

f () ∈ [0, 1] is piecewise linear and increases with the decrease of safety

f () = 0 for safe examples and f () = 1 for rare and outlier examples

gravity term is neglected for safe minority examples

The majority examples receive weights which balance out the weights of minority examples.

(47)

ImWeights

The weight of a minority example is calculated using the following formula:

wx = 1 + f(safety(x))(1 + gravity(x))

local data characteristic

neighborhood information

f () ∈ [0, 1] is piecewise linear and increases with the decrease of safety

f () = 0 for safe examples and f () = 1 for rare and outlier examples

gravity term is neglected for safe minority examples

The majority examples receive weights which balance out the weights of minority examples.

(48)

ImWeights - details

non-linear transformation of safety

f (x ) = max{0, min{1, −2.5 · x + 1.75}}

influence of gravity on an example weight is a sum of gravity emitted by neighbouring cells

gravity emitted by a cell decreases linearly with growing distance from the cell border

(49)

ImWeights - calculations

(50)

ImWeights - calculations

wred = 1 + f (

11

(51)

ImWeights - calculations

(52)

ImWeights - calculations

(53)

ImWeights - calculations

(54)

ImWeights - calculations

(55)

ImWeights - calculations

(56)

ImWeights - calculations

wred = 1 + 0.875(1 + 0.388)

(57)

ImWeights - calculations

wred = 1 + 0.875(1 + 0.388)

wred = 2.2145

(58)

Experiments

Algorithms in the experiment:

global methods: Random Oversampling local methods: Borderline-SMOTE, ADASYN 12 real-world datasets from UCI repository

Dataset # examples # attrib. IR Difficulty type

breast-w 699 9 1.90 safe vehicle 846 18 3.25 safe new-thyroid 215 5 5.14 safe pima 768 8 1.87 borderline haberman 306 3 2.78 borderline ecoli 336 7 8.60 borderline transfusion 748 4 3.20 rare yeast 1484 8 28.10 rare glass 214 9 12.59 rare seismic-bumps 2584 11 14.2 rare/outlier abalone 4177 7 11.47 outlier balance-scale 625 4 11.76 outlier

(59)

Results (Logistic Regression)

G-mean

Dataset Baseline ImWt. ROS B-SM. ADA. abalone 0.189 0.744 0.769 0.760 0.766 balance-scale 0.000 0.265 0.328 0.516 0.387 breast-w 0.957 0.962 0.961 0.969 0.967 ecoli 0.169 0.863 0.875 0.841 0.867 glass 0.000 0.673 0.569 0.609 0.578 haberman 0.392 0.640 0.643 0.622 0.650 new-thyroid 0.997 0.989 0.994 0.976 0.994 pima 0.694 0.761 0.752 0.743 0.748 seismic-bumps 0.448 0.582 0.306 0.333 0.344 transfusion 0.504 0.664 0.650 0.656 0.668 vehicle 0.965 0.963 0.962 0.952 0.963 yeast 0.000 0.846 0.831 0.833 0.815

(60)

Conclusions

Simultaneous clustering and categorization of minority examples is effective

ImGrid: best trade-off between clustering and categorization We provide a new perspective on combining local and global information when classifying imbalanced data

ImWeights: a weighting approach which combines local, global and neighbourhood information via the concept of gravity ImWeights achieves promising results, especially on rare and borderline datasets

(61)