Measures of Diversity and the Classification Error in the Multiple-model Approach

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

FO L IA O ECO N O M ICA 225, 2009

Eugeniusz Gcitnar*

MEASURES OF DIVERSITY AND THE CLASSIFICATION ERROR

IN THE MULTIPLE-MODEL APPROACH

M ultiple-model approach (model aggregation, model fusion) is most com m only used in classification and regression. In this approach К component (single) models C |(x), C |( x ) ,..., Ck{ \) are combined into one global model (ensemble) C '( x ) , for example using m ajority voting:

T u r n e r i G h o s h ( 1 9 9 6 ) p r o v e d that the classification error o f the ensem ble C "(x ) de-pends on the diversity o f the ensem ble members. In other words, the higher diversity o f component models, the lower classification error o f the combined model.

Since several diversity m easures for classifier ensembles have been proposed so far in this paper we present a comparison o f the ability o f selected diversity measures to predict the accuracy o f classifier ensembles.

Key w ords: Multiple-model approach, Model fusion, Classifier ensemble, Diversity measures.

Several variants o f aggregation methods have been developed in the past decade. They differ in two aspects: the way the subsets to train component classifiers are formed and the method the base classifiers are combined.

Generally there are three approaches to obtain the training subsets:

* Ph.D., Chair o f Statistics, Katowice University o f Economics, Katowice. A b stract

(1)

(2)

• Manipulating training examples, e.g. Bagging ( B r e i m a n , 1996);

Boosting ( F r e u n d and S h a p i r e, 1997) and Arcing ( B r e i m a n ,

1998).

• Manipulating input features: Random subspaces (H o, 1998); Random

split selection (Amit and Geman, 1997), Random forests ( B r e i ma n ,

2001).

• Manipulating output values: Adaptive bagging ( B r e i m a n , 1999);

Error-correcting output coding (D i e 11 e r i c h and B a k i r i, 1995).

Having a set o f classifiers they can be combined using one o f the following methods:

• averaging methods, e.g. average vote and weighted vote;

• non-linear methods, e.g. majority vote (the component classifiers vote for the most frequent class as the predicted class), maximum vote, Borda Count method, etc.;

• stacked generalisation, where the classifiers are fitted to training subsamples obtained by leave-one-out cross-validation (W o 1 p e r t ,

1992).

2. D iversity

The high accuracy o f the classifier ensemble C *(x) is achieved if the members o f the ensemble are “weak” and diverse. The term “weak” refers to classifiers that have high variance, e.g. classification trees, nearest neighbours, and neural nets.

Diversity among classifiers means that they are different from each other, i.e. they misclassify different examples. This is mostly obtained by using different training subsets, assigning different weights to instances or selecting different subsets o f features (subspaces).

T u r n e r and G h o s h (1996) proved that the classification error o f the ensemble C (x) depends on the diversity o f the ensemble members:

e ( C ') = е в ( С' ) + 1 + Г{* ~ 1)е(С,.), (2)

К

where e "(C *) is the Bayes error, r is average correlation coefficient between errors o f component classifiers, К is the number o f base classifiers, and e(C,) is an error o f individual classifier.

As we can see in formula (2) the ensemble error decreases with decrease of the correlation between the component classifiers.

(3)

In general, S h a r k e y and S h a r k e y (1997) introduced four levels of diversity:

• Level 1 - no more than one classifier is wrong on each example.

• Level 2 - up to half o f classifiers could be wrong for each example (majority vote is always correct).

• Level 3 - al least one classifier is correct for each example. • Level 4 - none o f the classifiers is correct for some examples.

The level o f diversity among candidate classifiers determines the method they should be combined. For example, the majority vote is good for the classifiers that exhibit the level 1 an 2 diversity. Otherwise some more complex methods, e.g. stacked generalization, are more appropriate.

Several combining methods that take into account the diversity o f classifiers have been proposed in the literature. For example, R o s e n (1996) presented a combination method that incorporates an error-dscorrelation penalty term. It allows component classifiers to make errors which are uncorrelatcd. H a s h e m (1999) proposed the use o f relative accuracy o f component classifiers as weights in linear combinations o f the members o f an ensemble. O z a and T u r n e r (1999) developed a method named input decimation that eliminates features low correlated with the class. Z e n o b i and C u n n i g h a m (2001) have used the hill-climbing search for feature subsets that is guiding by a diversity measure.

Recently, M e l v i l l e and M o o n e y (2004) developed a method called DECORATE that reduces the classification error o f the ensemble by increasing diversity. It adds artificial examples to the original training set, i.e. examples oppositely labelled.

3. M easures o f diversity

We can simply measure the agreement between two classifiers C,(x) and

C j (x) as:

A g r e e m e n t( C i ,C J) = ^ - Y j I ( C i( x ll) = (3)

™ 11 = 1

but this takes into account both correct and incorrect classifications o f the component models.

In order to overcome this drawback we define the „oracle” output (O) of the classifier C k ( \ ) as:

(4)

0 , ( x , ) = ' = * (4) [0 Ck( \ i) ^ y l

In other words, the value o f Ok(x) = 1 means that the classifier C *(x) is correct, i.e. it recognizes the true class (y) o f the example x, and 0 * (х) = 0 means that the classifier is wrong. The relationship between a pair o f classifiers is presented in two-way contingency table (Table 1).

T a b l e 1 Oracle labels for two classifiers

0,(x) = 1 O, (x) = 0

1

il _a _b о II

s

c r _с _d S o u r c e : own study.

The most simple measure o f diversity (or rather similarity) o f two classifiers is the binary version o f the Pearson’s correlation coefficient:

,. .4

ad - be

r ( i , j ) = --- --- (5)

(a + b)(c + d)(a + c)(b + d)

P a r t r i d g e and Y a t e s (1996), and M a r g i n e a n t u a n d D i e t t r i c h (1997) have used a measure named within-set generalization diversity (kappa statistics):

K(i, j

) = --- 2(- — -Ы )--- (6)

(a + b)(c + d) + (a + c)(b + d)

It measures the level of agreement between two classifiers with the correction for chance.

S к a 1 a к (1996) has proposed the disagreement measure.

DM (i,j) =

— -

(5)

This is the ratio between the number of examples on which one classifier is correct and the other is wrong to the total number o f examples.

G i a c i n t o and R o l i (2001) have introduced a measure named com pound diversity:

C D ( i J ) = --- --- (8)

a + b + c + d

This measure is also named “double-fault measure” because it is the proportion o f the examples that have been misclassified by both classifiers.

K u n e h e v a et cd. (2000) recommended the Y ule’s Q statistics as diversity measure:

. a d - b e

0 0 ,./) = —_{a d + be}7—7 " (9) The Y ule’s Q statistics is the original measure o f dichotomous agreement, designed to be analogous to the correlation. This measure is pairwise and symmetric and varies between -1 and 1. A value of “0” indicates statistical independence o f classifiers, positive values mean that the classifiers have recognized the same examples correctly and negative values - that the classifiers commit errors on different examples.

G a t n a r (2005) proposed the H a m a n n ’s cocfficicnt:

(a + d ) - ( b + c) . . . .

= i --- ' — r _{a + b + c + d} (10)

This binary similarity coefficient is simply the difference between the matches and mismatches as a proportion of the total number of entries. It ranges from -1 to 1. A value o f “0” indicates an equal number o f matches to mismatches, “- 1 ” represents perfect disagreement and “ 1” - perfect agreement.

In the case o f pairwise measures, the overall value of diversity for the ensemble C *(x) is computed as the mean:

Diversity(C' ) = I ■ X Yj Diversity(Ci, C; ) (11)

A ( A ~ 1 ) /=| j=i+\

Several non-pairwise measures have been also developed to estimate the diversity between classifiers C|(x), C|(x), ..., Ca(x).

(6)

H a n s e n and S a l a m o n (1990) proposed the measure o f difficulty:

D 1 = T í ( P y W - P y ) 2 0 2 )

^ *-i

It is in fact the variance o f variable p v(k) representing the proportion of classifiers that correctly classify an example x chosen at random.

P a r t r i d g e and K r z a n o w s k i (1997) have introduced the generalized diversity measure: GD = 1 - (13) P<X) where: P ^ = ) ; ^ L k 'Pk and P(2) = * Л П к ( 14) ^ *=i M * *) *=i

and p k is thef probability that к classifiers mischassify an cample x choscn at random .

C u n n i g h a m and C a r n e y (2000) used the entropy function:

where N" is number o f base classifiers that misclassified x„ to class C,.

4. E xperim ents

In order to compare the ability o f the diversity measures to detect the accuracy o f combined classifier we followed the synthetic experiment presented in ( K u n c h e v a et al., 2000). We have generated two artificial sets o f classifier ensembles o f known classification accuracy.

In the first experiment we have used a test set o f 10 examples (N = 10) and 3 classifiers: C ,(x ),C 2(x ),C 1(x) (i.e. K = 3) and each has the same classification error e(C r) = 0.4 (6 out o f 10 examples are recognized correctly). That gave the total number o f 28 different combinations o f classification results for the test set.

We have used all the measures of diversity mentioned above and majority vote combining. The Pearson’s correlation coefficients between the ensemble error and the diversity are presented in Table 2.

(7)

T a b l e 2 Pearson’s correlation with the ensemble error in experiment 1

Diversity measure Correlation with the error

К 0.209 D M 0.387 CD 0.408 Q 0.421 H 0.532 D l 0.324 GD 0.543 IE N 0.412 S o u r c e : own study.

In the second experiment we have generated a test set o f 100 examples

(N = 100) and also 3 classifiers: C ,(x),C 2(x ),C 3(x) (i.e. K = 3) each has the

same classification error e(C,) = 0.4 (60 out o f 100 examples are recognized correctly). That gives the total number of 36 151 different combinations of classification results.

We have used all the measures o f diversity and majority vote combining. Pearson’s correlation coefficients between the ensemble error and the diversity are presented in Table 3.

T a b l e 3 Pearson’s correlation with the ensemble error in experiment 2

Diversity measure Correlation with the error

К 0.387 D M 0.402 CD 0.478 0 0.471 H 0.598 D l 0.539 GD 0.678 IE N 0.578 S o u r c e : own study.

(8)

5. C onclusions

In this paper we compared the ability o f several diversity measures to detect the accuracy o f a classifier ensemble.

As the result o f two experiments we conclude that the H am ann’s coefficient is the best diversity measure among the pairwise diversity measures, while in the group o f non-pairwise measures the Partridge and Krzanowski’s measure is the recommended.

In general, we observed that non-pairwise measures better identify the diversity among component models (which is the main reason o f the prediction error) than the pairwise ones.

R eferences

В r e i in a n L. (1996), Bagging predictors, “Machine Learning”, 24, 123-140. В r e i in a n L. (1998), A rcing classifiers, “Annals o f Statistics”, 26, 801-849.

В r e im a n L. (1999), Using adaptive bagging to debias regressions. Technical Report 547, Department o f Statistics, University o f California, Berkeley.

B r e i m a n L. (2001), Random forests, “ M achine Learning”, 45, 5-32.

C u n n i g h a m P., C a r n e y J. (2000), Diversity versus quality in classification ensembles

based on fe a tu re selection, [in:] Proceedings o f European Conference on M achine Learning,

LNCS, vol. 1810, Springer, Berlin, 109-116.

D i c 11 e r i с Ii T., B a k i r i G. (1995), Solving multiclass learning problem via error-correcting

output codes, “Journal o f Artificial Intelligence Research”, 2, 263-286.

F I e i s s J. L. (1981), Statistical m ethods f o r rates and proportions, John Wiley and Sons, New York.

F r e u n d Y., S c h a p i r e R. E. (1997), A decision-theoretic generalization o f on-line learning

and an application to boosting, “Journal o f Computer and System Sciences” , 55, 119-139.

G a t n a r E. (2 0 0 1), Nonparametric m ethod f o r classification and regression, PWN, Warszawa (in Polish).

G a t n a r E. (2005), A diversity measure f o r tree-based classifier ensem bles, [in:] Data analysis

and decision support, eds D. Baicr, R. Decker, L. Schmidt-Thieme, Springer-Verlag,

H eidelberg-B erlin, 30-38.

G i а с i n t o G., R o l i F. (2001), Design o f effective neural network ensem bles f o r image

classification processes, “ Image Vision and Computing Journal”, 19, 699-707.

H a n s e n L. K., S a l a m o n P. (1990), Neural network ensembles, “IEEE Transactions on Pattern Analysis and M achine Intelligence” , 12, 993-1001.

Н о 'Г. K. (1998), The random subspace method fo r constructing decision fo rests, “ IEEE Transactions on Pattern Analysis and M achine Intelligence” , 20, 832-844.

K u n c h e v a L., W h i t a k e r C., S h i p p D., D u i n R. (2000), Is independence go o d fo r

com bining classifiers, [in:] Proceedings o f the 15th International Conference on Pattern Recognition, Barcelona, Spain, 168-171.

K u n c h e v a L., W h i t a k e r C. (2003): Measures o f diversity in classifier ensem bles and their

(9)

M a r g i n c a n t u M. M. , D i e t t e r i c h T. G. (1997), Pruning adaptive boosting, [in:]

Proceedings o f the I4tli International Conference on M achine Learning, Morgan Kaufmann,

San M ateo, 211-218.

O z a N. C., T u m a r K. (1999), D imensionality reduction through classifier ensem bles, Technical

Report, N A SA -A R C -IC -1999-126, Computational Sciences Division, NASA Ames Research

Center.

P a r t r i d g e D., K r z a n o w s k i W. J. (1997), Software diversity: practical statistics f o r its

m easurem ent and exploitation, “ Information and software Technology”, 39, 707-717.

P a r t r i d g e D., Y a t e s W. B. (1996), Engineering m ultiversion neural-net systems, “Neural Computation” , 8, 869-893.

S h a r k e y A., S h a r k e y N. (1997), Diversity, selection, and ensembles o f artificial neural

nets, [in:] Neural Networks and their applications, NEURAP-97, 205-212.

S k a 1 a k D. В. (1996), The sources o f increased accuracy f o r two proposed boosting algorithms, [in:] Proceedings o f the American Association fo r A rtificial Intelligence AAAI-96, Morgan Kaufmann, San Mateo.

T u r n e r K., G h o s h J. (1996), Analysis o f decision boundaries in linearly com bined neural

classifiers, “Pattern Recognition”, 29, 341-348.

W о I p e r t D. (1992), Stacked generalization, “Neural Networks” , 5, 241-259.

M iary zróżn icow an ia m odeli a błąd k lasyfik acji

w podejściu w iclom odelow ym

Podejście w ielom odelowc (agregacja modeli), stosowane najczęściej w analizie dyskryminacyjnej i regresyjnej, polega na połączeniu M modeli składowych Ci(x), ..., Сл/(х) jeden model globalny C ’ ( x ) :

T u m e r i G h o s h (1996) udowodnili, że błąd klasyfikacji dla modelu zagregowanego C ‘ (x ) zależy od stopnia podobieństwa (zróżnicowania) modeli składowych. Inaczej mówiąc, najbardziej dokładny model C ’(x) składa się z modeli najbardziej do siebie niepodobnych, tj. zupełnie inaczej klasyfikujących te same obiekty.

W literaturze zaproponowano kilka miar pozwalających ocenić podobieństwo (zróżnicowanie) modeli składowych w podejściu wielomodelowym.

W artykule om ów iono związek znanych miar zróżnicowania z oceną wielkości błędu klasyfikacji modelu zagregowanego.