Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
Wykorzystanie gridowego algorytmu grupowania
dla danych niezbalansowanych
12Mateusz Lango
Instytut Informatyki, Politechnika Poznańska
13 listopada 2018
1
Lango M., Brzeziński D., Stefanowski J., ImWeights: Classifying Imbalanced Data Using Local and Neighborhood Information, JMLR Proceedings of the 2nd International Workshop on Learning with Imbalanced Domains co-located with ECML/PKDD, 2018
2
Lango M., Brzeziński D., Firlik S., Stefanowski J., Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data, Proceedings of the 20th International Conference on Discovery Science, Kyoto,
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
Common issue of presented tasks
highly skewed class distribution
classifiers inducted for those problems are good at detecting majority class
minority class is of special interest connecting related parts of a graph inferring a need of improvement/repair
Agenda
1 Introduction
2 Why learning from imbalanced data is difficult?
3 Discovering data difficulty factors with grid clustering
4 How to improve imbalanced data classification using grid cluster-ing?
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
Why learning from imbalanced data is difficult?
an obvious answer: unequal class cardinality IR = N−
N+
classifiers learned with maximum generality bias η = TPRate·
N+
+ TNRate·
An experiment with imbalanced data
Acc 100%
Prec 100%
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
An experiment with imbalanced data
Acc 100%
An experiment with imbalanced data
Acc 100%
Prec 100%
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
An experiment with imbalanced data
Acc 100%
Prec 100%
An experiment with imbalanced data
Acc 94.9%
Prec 95.1% Recall 94.7%
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
An experiment with imbalanced data
Acc 95.1%
An experiment with imbalanced data
Acc 95.8%
Prec 89%
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
An experiment with imbalanced data
Acc 97%
Prec 86.9%
An experiment with imbalanced data
Acc 70.6%
Prec 72.3% Recall 66.7%
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
An experiment with imbalanced data
Acc 74.8%
An experiment with imbalanced data
Acc 83.3%
Prec 0%
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
An experiment with imbalanced data
Acc 90.9%
Prec 0%
Why learning from imbalanced data is really difficult?
global imbalance ratio
minority class examples divided into subconcepts/small disjuncts
overlap between classes
presence of many minority class examples inside the majority class region
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
Local data difficulty factors
Local data difficulty factors
The labels of examples are usually established by locally estimating the conditional probability:
p = Pr (y = +|x )
1 ≥ p > 0.7 → safe example 0.7 ≥ p > 0.3 → bordeline example 0.3 ≥ p > 0.1 → rare example 0.1 ≥ p ≥ 0 → outlier example
Napierała & Stefanowski proposed to estimate this probabilities by k-NN3.
3
The influence of minority class distribution on learning from imbalance data, Napierała & Stefanowski, HAIS,
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImGrid algorithm - motivation
Locality-based approaches define difficulties by modelling interactions with majority examples
However, they do not detect minority subclusters
Limited works on clustering imbalanced data (mainly in the context of resampling methods)
Question: Is it possible to construct a clustering approach that simultaneously discovers sub-concepts in complex imbalanced data and categorizes types of examples inside
ImGrid algorithm - general idea
divide attribute space into grid cells
join adjacent cells based on minority class distributions label examples according to local difficulty factors form minority sub-clusters
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImGrid - details
A trade-off in grid construction: precision vs number of data points needed for probability estimation
We divide each dimension of the attribute space into dp|D|/10e equally wide intervals.m
Algorithm aims at connecting cells that contain similar class distributions → statistical hypothesis testing framework popular tests: Pearson’s χ2, Fisher’s exact test, Barnard’s test those tests cannot directly state that the distributions are identical
Bayesian test based on Bayes factor for beta-binomial conjugate distribution with non-informative Jeffreys prior is used
ImGrid - details
A trade-off in grid construction: precision vs number of data points needed for probability estimation
We divide each dimension of the attribute space into dp|D|/10e equally wide intervals.m
Algorithm aims at connecting cells that contain similar class distributions → statistical hypothesis testing framework popular tests: Pearson’s χ2, Fisher’s exact test, Barnard’s test those tests cannot directly state that the distributions are identical
Bayesian test based on Bayes factor for beta-binomial conjugate distribution with non-informative Jeffreys prior is used
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImGrid - details
A trade-off in grid construction: precision vs number of data points needed for probability estimation
We divide each dimension of the attribute space into dp|D|/10e equally wide intervals.m
Algorithm aims at connecting cells that contain similar class distributions → statistical hypothesis testing framework popular tests: Pearson’s χ2, Fisher’s exact test, Barnard’s test those tests cannot directly state that the distributions are identical
Bayesian test based on Bayes factor for beta-binomial conjugate distribution with non-informative Jeffreys prior is used
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
Alternatives
ImScan
Cluster minority examples using DBSCAN
Add majority examples to minority clusters within Label example types according to class proportion ImKmeans
Cluster minority examples using k-means
Experimental evaluation
Comparison: ImGrid, ImKmeans, ImScan, Napierala ImGrid: α ∈ {0.75, 0.80, 0.85, 0.90, 0.95};
ImScan: ∈ {10, 30, 50, 70, 90}, min points ∈ {2}; ImKmeans: k ∈ [1, 9];
Napierala: k ∈ {5, 7, 9, 11}.
78 synthetic binary classification datasets
varying shapes and numbers of minority sub-concepts varying proportions of safe, borderline, rare, outlier cases varying example density and sub-cocept overlapping data generator by Wojciechowski & Wilk
Algorithms evaluated based on:
Minority class clustering (AMI - mutual information adjusted for chance)
Minority example categorization (G-mean) Processing time (s)
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImGrid - results
4Upper panel: Minority sub-concepts
Lower panel: Example types (safe, borderline, rare, outlier)
4See full results in the paper and on the dedicated Web page, code is
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImGrid - results
AMI 1 2 3 CD ImScan ImKmeans ImGrid 1 2 3 CD ImGrid ImScan ImKmeans G-mean 1 2 3 4 CD Napierala ImScan ImGrid ImKmeans 1 2 3 4 CD Napierala ImGrid ImScan ImKmeans Time 1 2 3 4 CD Napierala ImGrid ImScan ImKmeansIntroduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImGrid - results
Methods for imbalanced classification
A common categorization of methods for imbalanced classification is the following5: data-level solutions oversampling undersampling combinations of both algorithmic-level solutions cost-sensitive learning 5
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
Resampling methods: a different perspective
global data characteristic local data characteristic
Random Oversampling SMOTE
Random Undersampling NCR
Roughly Balanced Bagging Hybrid Sampling Bagging
DataBoost-IM SMOTEBoost
global data charaterisitcs local
Can we provide some kind of ”halfway” data characteristic? Would this kind of information be helpful in designing
Resampling methods: a different perspective
global data characteristic local data characteristic
Random Oversampling SMOTE
Random Undersampling NCR
Roughly Balanced Bagging Hybrid Sampling Bagging
DataBoost-IM SMOTEBoost
global data charaterisitcs local
Can we provide some kind of ”halfway” data characteristic? Would this kind of information be helpful in designing resampling methods?
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImWeights
a new example weighting schema which exploits local, global and neighbourhood information about data difficulty
since ImGrid is a grid approach, it naturally defines the neighbourhood relation between clusters
ImWeights - concept of gravity
The strength of gravity emitted by a cell is proportional to its safety
The gravity have a bigger impact on the examples which lie closer to the emitting cell
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImWeights
The weight of a minority example is calculated using the following formula:
wx = 1 + f(safety(x))(1 + gravity(x))
f () ∈ [0, 1] is piecewise linear and increases with the decrease of safety
f () = 0 for safe examples and f () = 1 for rare and outlier examples
gravity term is neglected for safe minority examples
The majority examples receive weights which balance out the weights of minority examples.
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImWeights
The weight of a minority example is calculated using the following formula:
wx = 1 + f(safety(x))(1 + gravity(x))
local data characteristic
neighborhood information
f () ∈ [0, 1] is piecewise linear and increases with the decrease of safety
f () = 0 for safe examples and f () = 1 for rare and outlier examples
gravity term is neglected for safe minority examples
The majority examples receive weights which balance out the weights of minority examples.
ImWeights - details
non-linear transformation of safety
f (x ) = max{0, min{1, −2.5 · x + 1.75}}
influence of gravity on an example weight is a sum of gravity emitted by neighbouring cells
gravity emitted by a cell decreases linearly with growing distance from the cell border
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImWeights - calculations
ImWeights - calculations
wred = 1 + f (
11
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImWeights - calculations
ImWeights - calculations
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImWeights - calculations
ImWeights - calculations
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImWeights - calculations
ImWeights - calculations
wred = 1 + 0.875(1 + 0.388)
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
ImWeights - calculations
wred = 1 + 0.875(1 + 0.388)
wred = 2.2145
Experiments
Algorithms in the experiment:
global methods: Random Oversampling local methods: Borderline-SMOTE, ADASYN 12 real-world datasets from UCI repository
Dataset # examples # attrib. IR Difficulty type
breast-w 699 9 1.90 safe vehicle 846 18 3.25 safe new-thyroid 215 5 5.14 safe pima 768 8 1.87 borderline haberman 306 3 2.78 borderline ecoli 336 7 8.60 borderline transfusion 748 4 3.20 rare yeast 1484 8 28.10 rare glass 214 9 12.59 rare seismic-bumps 2584 11 14.2 rare/outlier abalone 4177 7 11.47 outlier balance-scale 625 4 11.76 outlier
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary
Results (Logistic Regression)
G-mean
Dataset Baseline ImWt. ROS B-SM. ADA. abalone 0.189 0.744 0.769 0.760 0.766 balance-scale 0.000 0.265 0.328 0.516 0.387 breast-w 0.957 0.962 0.961 0.969 0.967 ecoli 0.169 0.863 0.875 0.841 0.867 glass 0.000 0.673 0.569 0.609 0.578 haberman 0.392 0.640 0.643 0.622 0.650 new-thyroid 0.997 0.989 0.994 0.976 0.994 pima 0.694 0.761 0.752 0.743 0.748 seismic-bumps 0.448 0.582 0.306 0.333 0.344 transfusion 0.504 0.664 0.650 0.656 0.668 vehicle 0.965 0.963 0.962 0.952 0.963 yeast 0.000 0.846 0.831 0.833 0.815
Conclusions
Simultaneous clustering and categorization of minority examples is effective
ImGrid: best trade-off between clustering and categorization We provide a new perspective on combining local and global information when classifying imbalanced data
ImWeights: a weighting approach which combines local, global and neighbourhood information via the concept of gravity ImWeights achieves promising results, especially on rare and borderline datasets
Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary