• Nie Znaleziono Wyników

13listopada2018 MateuszLango Wykorzystaniegridowegoalgorytmugrupowaniadladanychniezbalansowanych

N/A
N/A
Protected

Academic year: 2021

Share "13listopada2018 MateuszLango Wykorzystaniegridowegoalgorytmugrupowaniadladanychniezbalansowanych"

Copied!
61
0
0

Pełen tekst

(1)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

Wykorzystanie gridowego algorytmu grupowania

dla danych niezbalansowanych

12

Mateusz Lango

Instytut Informatyki, Politechnika Poznańska

13 listopada 2018

1

Lango M., Brzeziński D., Stefanowski J., ImWeights: Classifying Imbalanced Data Using Local and Neighborhood Information, JMLR Proceedings of the 2nd International Workshop on Learning with Imbalanced Domains co-located with ECML/PKDD, 2018

2

Lango M., Brzeziński D., Firlik S., Stefanowski J., Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data, Proceedings of the 20th International Conference on Discovery Science, Kyoto,

(2)
(3)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

(4)
(5)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

Common issue of presented tasks

highly skewed class distribution

classifiers inducted for those problems are good at detecting majority class

minority class is of special interest connecting related parts of a graph inferring a need of improvement/repair

(6)

Agenda

1 Introduction

2 Why learning from imbalanced data is difficult?

3 Discovering data difficulty factors with grid clustering

4 How to improve imbalanced data classification using grid cluster-ing?

(7)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

Why learning from imbalanced data is difficult?

an obvious answer: unequal class cardinality IR = N−

N+

classifiers learned with maximum generality bias η = TPRate·

N+

+ TNRate·

(8)

An experiment with imbalanced data

Acc 100%

Prec 100%

(9)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

An experiment with imbalanced data

Acc 100%

(10)

An experiment with imbalanced data

Acc 100%

Prec 100%

(11)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

An experiment with imbalanced data

Acc 100%

Prec 100%

(12)

An experiment with imbalanced data

Acc 94.9%

Prec 95.1% Recall 94.7%

(13)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

An experiment with imbalanced data

Acc 95.1%

(14)

An experiment with imbalanced data

Acc 95.8%

Prec 89%

(15)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

An experiment with imbalanced data

Acc 97%

Prec 86.9%

(16)

An experiment with imbalanced data

Acc 70.6%

Prec 72.3% Recall 66.7%

(17)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

An experiment with imbalanced data

Acc 74.8%

(18)

An experiment with imbalanced data

Acc 83.3%

Prec 0%

(19)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

An experiment with imbalanced data

Acc 90.9%

Prec 0%

(20)

Why learning from imbalanced data is really difficult?

global imbalance ratio

minority class examples divided into subconcepts/small disjuncts

overlap between classes

presence of many minority class examples inside the majority class region

(21)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

Local data difficulty factors

(22)

Local data difficulty factors

The labels of examples are usually established by locally estimating the conditional probability:

p = Pr (y = +|x )

1 ≥ p > 0.7 → safe example 0.7 ≥ p > 0.3 → bordeline example 0.3 ≥ p > 0.1 → rare example 0.1 ≥ p ≥ 0 → outlier example

Napierała & Stefanowski proposed to estimate this probabilities by k-NN3.

3

The influence of minority class distribution on learning from imbalance data, Napierała & Stefanowski, HAIS,

(23)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

ImGrid algorithm - motivation

Locality-based approaches define difficulties by modelling interactions with majority examples

However, they do not detect minority subclusters

Limited works on clustering imbalanced data (mainly in the context of resampling methods)

Question: Is it possible to construct a clustering approach that simultaneously discovers sub-concepts in complex imbalanced data and categorizes types of examples inside

(24)

ImGrid algorithm - general idea

divide attribute space into grid cells

join adjacent cells based on minority class distributions label examples according to local difficulty factors form minority sub-clusters

(25)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

ImGrid - details

A trade-off in grid construction: precision vs number of data points needed for probability estimation

We divide each dimension of the attribute space into dp|D|/10e equally wide intervals.m

Algorithm aims at connecting cells that contain similar class distributions → statistical hypothesis testing framework popular tests: Pearson’s χ2, Fisher’s exact test, Barnard’s test those tests cannot directly state that the distributions are identical

Bayesian test based on Bayes factor for beta-binomial conjugate distribution with non-informative Jeffreys prior is used

(26)

ImGrid - details

A trade-off in grid construction: precision vs number of data points needed for probability estimation

We divide each dimension of the attribute space into dp|D|/10e equally wide intervals.m

Algorithm aims at connecting cells that contain similar class distributions → statistical hypothesis testing framework popular tests: Pearson’s χ2, Fisher’s exact test, Barnard’s test those tests cannot directly state that the distributions are identical

Bayesian test based on Bayes factor for beta-binomial conjugate distribution with non-informative Jeffreys prior is used

(27)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

ImGrid - details

A trade-off in grid construction: precision vs number of data points needed for probability estimation

We divide each dimension of the attribute space into dp|D|/10e equally wide intervals.m

Algorithm aims at connecting cells that contain similar class distributions → statistical hypothesis testing framework popular tests: Pearson’s χ2, Fisher’s exact test, Barnard’s test those tests cannot directly state that the distributions are identical

Bayesian test based on Bayes factor for beta-binomial conjugate distribution with non-informative Jeffreys prior is used

(28)
(29)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

(30)
(31)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

(32)
(33)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

Alternatives

ImScan

Cluster minority examples using DBSCAN

Add majority examples to minority clusters within  Label example types according to class proportion ImKmeans

Cluster minority examples using k-means

(34)

Experimental evaluation

Comparison: ImGrid, ImKmeans, ImScan, Napierala ImGrid: α ∈ {0.75, 0.80, 0.85, 0.90, 0.95};

ImScan:  ∈ {10, 30, 50, 70, 90}, min points ∈ {2}; ImKmeans: k ∈ [1, 9];

Napierala: k ∈ {5, 7, 9, 11}.

78 synthetic binary classification datasets

varying shapes and numbers of minority sub-concepts varying proportions of safe, borderline, rare, outlier cases varying example density and sub-cocept overlapping data generator by Wojciechowski & Wilk

Algorithms evaluated based on:

Minority class clustering (AMI - mutual information adjusted for chance)

Minority example categorization (G-mean) Processing time (s)

(35)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

(36)

ImGrid - results

4

Upper panel: Minority sub-concepts

Lower panel: Example types (safe, borderline, rare, outlier)

4See full results in the paper and on the dedicated Web page, code is

(37)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

(38)

ImGrid - results

AMI 1 2 3 CD ImScan ImKmeans ImGrid 1 2 3 CD ImGrid ImScan ImKmeans G-mean 1 2 3 4 CD Napierala ImScan ImGrid ImKmeans 1 2 3 4 CD Napierala ImGrid ImScan ImKmeans Time 1 2 3 4 CD Napierala ImGrid ImScan ImKmeans

(39)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

ImGrid - results

(40)

Methods for imbalanced classification

A common categorization of methods for imbalanced classification is the following5: data-level solutions oversampling undersampling combinations of both algorithmic-level solutions cost-sensitive learning 5

(41)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

Resampling methods: a different perspective

global data characteristic local data characteristic

Random Oversampling SMOTE

Random Undersampling NCR

Roughly Balanced Bagging Hybrid Sampling Bagging

DataBoost-IM SMOTEBoost

global data charaterisitcs local

Can we provide some kind of ”halfway” data characteristic? Would this kind of information be helpful in designing

(42)

Resampling methods: a different perspective

global data characteristic local data characteristic

Random Oversampling SMOTE

Random Undersampling NCR

Roughly Balanced Bagging Hybrid Sampling Bagging

DataBoost-IM SMOTEBoost

global data charaterisitcs local

Can we provide some kind of ”halfway” data characteristic? Would this kind of information be helpful in designing resampling methods?

(43)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

ImWeights

a new example weighting schema which exploits local, global and neighbourhood information about data difficulty

since ImGrid is a grid approach, it naturally defines the neighbourhood relation between clusters

(44)

ImWeights - concept of gravity

The strength of gravity emitted by a cell is proportional to its safety

The gravity have a bigger impact on the examples which lie closer to the emitting cell

(45)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

(46)

ImWeights

The weight of a minority example is calculated using the following formula:

wx = 1 + f(safety(x))(1 + gravity(x))

f () ∈ [0, 1] is piecewise linear and increases with the decrease of safety

f () = 0 for safe examples and f () = 1 for rare and outlier examples

gravity term is neglected for safe minority examples

The majority examples receive weights which balance out the weights of minority examples.

(47)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

ImWeights

The weight of a minority example is calculated using the following formula:

wx = 1 + f(safety(x))(1 + gravity(x))

local data characteristic

neighborhood information

f () ∈ [0, 1] is piecewise linear and increases with the decrease of safety

f () = 0 for safe examples and f () = 1 for rare and outlier examples

gravity term is neglected for safe minority examples

The majority examples receive weights which balance out the weights of minority examples.

(48)

ImWeights - details

non-linear transformation of safety

f (x ) = max{0, min{1, −2.5 · x + 1.75}}

influence of gravity on an example weight is a sum of gravity emitted by neighbouring cells

gravity emitted by a cell decreases linearly with growing distance from the cell border

(49)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

ImWeights - calculations

(50)

ImWeights - calculations

wred = 1 + f (

11

(51)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

ImWeights - calculations

(52)

ImWeights - calculations

(53)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

ImWeights - calculations

(54)

ImWeights - calculations

(55)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

ImWeights - calculations

(56)

ImWeights - calculations

wred = 1 + 0.875(1 + 0.388)

(57)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

ImWeights - calculations

wred = 1 + 0.875(1 + 0.388)

wred = 2.2145

(58)

Experiments

Algorithms in the experiment:

global methods: Random Oversampling local methods: Borderline-SMOTE, ADASYN 12 real-world datasets from UCI repository

Dataset # examples # attrib. IR Difficulty type

breast-w 699 9 1.90 safe vehicle 846 18 3.25 safe new-thyroid 215 5 5.14 safe pima 768 8 1.87 borderline haberman 306 3 2.78 borderline ecoli 336 7 8.60 borderline transfusion 748 4 3.20 rare yeast 1484 8 28.10 rare glass 214 9 12.59 rare seismic-bumps 2584 11 14.2 rare/outlier abalone 4177 7 11.47 outlier balance-scale 625 4 11.76 outlier

(59)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

Results (Logistic Regression)

G-mean

Dataset Baseline ImWt. ROS B-SM. ADA. abalone 0.189 0.744 0.769 0.760 0.766 balance-scale 0.000 0.265 0.328 0.516 0.387 breast-w 0.957 0.962 0.961 0.969 0.967 ecoli 0.169 0.863 0.875 0.841 0.867 glass 0.000 0.673 0.569 0.609 0.578 haberman 0.392 0.640 0.643 0.622 0.650 new-thyroid 0.997 0.989 0.994 0.976 0.994 pima 0.694 0.761 0.752 0.743 0.748 seismic-bumps 0.448 0.582 0.306 0.333 0.344 transfusion 0.504 0.664 0.650 0.656 0.668 vehicle 0.965 0.963 0.962 0.952 0.963 yeast 0.000 0.846 0.831 0.833 0.815

(60)

Conclusions

Simultaneous clustering and categorization of minority examples is effective

ImGrid: best trade-off between clustering and categorization We provide a new perspective on combining local and global information when classifying imbalanced data

ImWeights: a weighting approach which combines local, global and neighbourhood information via the concept of gravity ImWeights achieves promising results, especially on rare and borderline datasets

(61)

Introduction Why imbalanced data is difficult? ImGrid ImWeights Summary

Cytaty

Powiązane dokumenty

In summary, from the individual, environmental, situational and biological perspective, an increased tendency to commit suicide would occur amongst chil- dren and

The considerations refer to the problem of controversies emerging from regulation of the article 83 The Bankruptcy and Reorganisation Law after amendment. The author

Systematyka ta, uzupełniona cechami poszczego´lnych typo´w strategii oraz krajo´w je stosuja˛cych, została wykorzystana przez autoro´w do skonstruo- wania drzewa

Dalsza część wizji »kontrastow ana z optym istyczną przepow iednią, w y ­ raźnie odnosi się do klęski pow stania listopadowego'... Teraz rozbudow any obraz staje

w Jekatierinosławiu (dziś: Dniepropie- trowsk). Jej ojciec, Bolesław Roszkowski, urodzony w Warszawie, ukończył studia politechniczne w Belgii. W tym sa­ mym roku

Stwierdzono zależ‑ ność między zmianami w czasie przejścia fali tętna a taki‑ mi cechami, jak zwiększony poziom lęku i wrażliwości na lęk, które są czynnikami

Point 3 is another meeting with Marx’s “old” theory of economy and its conceptual apparatus: the notion of capital migration and concentration is used to present the changes

A VIPR layer model of the quality of online services [SiWa09] is a proposal extending the perspective of building relationships between service providers and customers beyond the