Methods of Classification Error Decompositions and their Properties

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

FO LIA O EC O N O M IC A 206, 2007

Do r ot a R o z mus*

M ETHO DS OF CLASSIFICATION ERROR DECOM POSITIONS AND THEIR PROPERTIES

Abstract. T he idea o f error decom position originates in regression where squared loss function is applied. M ore recently, several authors have proposed corresponding decom positions for classification problem , where 0-1 loss is used. The paper presents the analysis o f some properties o f recently developed decom positions for 0-1 loss.

Key words: classification error, prediction error, error decom position.

1. IN TR O D U C T IO N

The m ain aim o f aggregating models is to decrease prediction error. There are two ways o f building such models: we can get new training subsets from the original set by selecting features or by selecting objects. A learning algorithm builds a model on the base o f each subset and then single models are aggregated1 into one model (see: Fig. 1).

T he error o f the aggregated model, given in form ula (1), can be deco mposed into three components that show how such factors, e.g.: the num ber o f single models, the way o f getting training subsets, the num ber of objects in subsets or param eters of single models influence the value of the error

Errarg = Ez {Ey[L(y, ý)]} (1)

where: L(y, ý) - a loss function.

* M .S c., D epartam ent o f Statistics, Karol Adamiecki University оГ E conom ics, K atowice. ' A learning algorithm is a function which takes as sole a learning sam ple and outputs a classification or regression model.

(2)

The subscript у in equation (1) denotes th at the expectation is taken with respect to noise, and the subscript Z - th at the expectation is taken with respect to the predictions produced by single models, built by learning algorithm on the training subsets from set Z.

original training set

set of learning subsets

single models

aggregated model

Fig. 1. Aggregated model S o u r c e : own research.

2. PR E D IC T IO N ERR OR D E C O M P O S IT IO N

The idea of error decomposition originates in regression, where squared loss function is applied, and it breaks down the error into three terms: noise (N(x)), bias (B(x)) and variance (D 2(x)) ( G e m a n et al. 1992)

Errag, = Ez {Ey[(y - j))2]} = N( x ) + B(x) + D 2(x) (2) Before defining the decomposition components, we should first introduce the idea o f aggregated m odel prediction and optim al m odel prediction ( D o m i n g o s 2000).

In general, the dependent variable у is a nondeterm inistic function of predictors x, so it is alm ost impossible to get the value o f loss function

(3)

equal 0. But for a determined problem and determined loss function we m ay define the best possible model that ensures the optim al prediction of у

y. = arg m inEy[L(y, у')] (3)

У'

Prediction on the base of optimal model, for determined example x, is such value o f у that minimises the expected value of loss

Ey[L(y, y,)} (4)

In the case of regression such prediction equals the expected value of у

У* = Ey[y] (5)

Prediction on the base o f aggregated model, for any loss function and collection o f training subsets in Z, is the prediction that minimises the expected value o f loss

ym --- arg min E7 [L(ý, j>')] (6)

if

Aggregated m odel predicts such value p' whose average loss, relative to all the predictions obtained by means of single models, is minimum. In regression it is the m ean o f all single model predictions

Ут = E M (7)

The noise o f the learning algorithm on an example x in regression is defined in form ula (8) and it is the loss coming from the difference between the real value of dependent variable and the value obtained on the base o f optim al model

W(x) = ą [ ( y - y . ) 2] (8)

It is an unavoidable component of the loss, incurred independently of the model. It is the hypothetical lower boundary o f the error.

The bias o f the learning algorithm on an example x, using squared loss function, is systematic loss incurred by the value o f dependent variable obtained on the base of optimal model relative to the prediction of ag gregated model

(4)

The variance o f a learning algorithm on example x is the average loss incurred by single predictions relative to prediction on the base of aggregated model

D \ x ) = Ez [(ym- m 0 0 )

The sum o f the three elements is equal to the value o f prediction error, when the squared loss function is applied ( G e m a n et al. 1992).

3. C L A SSIFIC A T IO N ERR OR D E C O M P O S IT IO N

In recent years, several authors have tried to apply the idea of error decom position to classification problems, where 0-1 loss is used. Then, the three com ponents can be described as below.

Optimal m odel in classification associates each input x with the m ost likely class, according to the conditional class probability distribution

y. = a r g m a x P /y |x ) (11)

У

The prediction o f aggregated m odel in classification is the class receiving the m ajority o f votes am ong all classes predicted by single models

ym = a rg m a x P y(ý |x ) (12)

s>

Noise in classification is equal 1 minus the probability of optimal classification

N( x) = I - Py( y , \ x ) (13)

In regression bias was defined as difference between the value o f depen dent variable predicted by optim al m odel relative to aggregated model; in classification bias is given by indicator function

B(x) = \(y, Ф ym\x) (14)

So in classification biased observations are those, for which the clas sification on the base of aggregated model is different relative to clas sification o f the optim al model.

(5)

The variance o f a learner on example x in classification is equal 1 minus probability o f the same classification as obtained by m eans of aggregated m odel am ong all classification by single models

D 2(x) = 1 — P;z(ym I x) (15) But it appears th at in classification, by analogy to regression, the sum o f such defined terms is not equal to the value of classification error ( D o m i n g o s 2000):

Erraqr = E z {Ey[l(y Ф ý|x)]} Ф N(x) + ß(x) + D 2(x) (16)

M ore recently, several authors have proposed corresponding deco m positions in classification problem. The difference in definitions of bias and variance can be attributed to disagreement over the properties that those term s should fulfil in the case of 0-1 loss. The m ost often used decom positions were proposed by: E. B. K o n g and T. G. D i e t - t e r i c h (1995), R. K o h a v i and D. H. W o l p e r t (1995), R. l i b - s h i r a n i (1996), P. D o m i n g o s (2000) and L. B r e i m a n (1996,

2000).

As the values of bias and variance are different depending on which decom position was used, the article discusses how the values of those terms depend on different proposed definitions, and it tries to find concepts giving similar or even the same values.

Table 1 shows the m ain properties of bias and variance depending on concept o f decomposition:

T a b l e 1

M ain properties o f bias and variance, depending on concept ol decom position

A u thors o f decom position Bias properties Variance properties K o n g and D ietterich

Tibshirani

K ohavi and W olpert Breiman I Breiman II D om in gos it may be negative it may be negative it is nonnegative it is nonnegative it is nonnegative it may be negative it m ay be negative it m ay be negative it is nonnegative it is nonnegative it is nonnegative it m ay be negative

(6)

4 . B IA S, V A R IA N C E V A L U E S A N D N U M B E R OF P R E D IC T O R S SE L E C T E D FOR B U IL D IN G A S IN G L E M O D E L

A benchm ark d ataset “ Satim age” , where there are 4435 objects ( B l a k e et al. 1998), was chosen for experiments. F o r the training set there is a separate test set with 2000 observations. The objects in this set are fragm ents ( 3 x 3 pixels) o f the E arth area. The whole area has 82 x 100 pixels and each pixel represents the area o f 80 x 80 meters. Each line contains the pixel values in the four spectral bands (converted to A SCII) o f each of the 9 pixels in the 3 x 3 neighbourhood. Each object belongs to one from six possible classes th at denote the way of soil utilisation. bias - m - variance error к sum Number of variables

Fig. 2. Bias, variance, classification error according to number o f variables random ly selected for building single m odels

S o u r c e : own research.

The aim of the first experiment is to verify how the value o f bias and variance, depending on different concepts o f calculating, will be formed according to the num ber o f variables random ly selected for building single models. The num ber of models in aggregated m odel was stated as constant equal 100. Building o f the single models started with 5 random ly selected variables and procedure was continued up to the m om ent, when 34 variables were taken, adding each time one m ore variable. Figure 2 shows what are the values o f bias and variance calculated from general definition (formula

(7)

c(14) and (15)). It is clearly seen th at the sum o f these two terms is not equal to the value o f classification error. Initially this sum is m uch higher th an the error, and later it slowly comes m ore and more equal to it.

Thanks to Fig. 3, showing what the form ation of bias calculated by m eans o f different definitions is, it can be said that, generally, they indicate a very similar, increasing tendency. Initially significant differences in the values, according to the way of counting, come less and less apparent along with growth o f num ber o f features.

general Tibshlrani —A— Brelman I - x — Breiman II -Ж— Kong Kohavi - a — Domingos Number of variables

Fig. 3. Values o f bias calculated by means o f different definitions according to number o f variables selected for building single m odels

It is w orth saying th at the three concepts — by R. 1 ibshirani, P. D om in gos, E. B. K ong and T. G. Dietterich - give the same values of bias, which additionally are identical as bias calculated from the general definition. All rem aining concepts give lower values.

T aking the values o f variance, calculated by m eans o f different p ro posed definitions into consideration it can be said that, generally, they also show very similar tendency, but this time the trend is decreasing (Fig. 4). As it was for bias, differences in values are less and less sig nificant.

(8)

generai Tibshirani — A — Breiman 1 — X— Breiman II — ж — Kong —• — Kohavi - - Б - Dom ingos

Fig. 4. Values o f variance calculated by m eans o f different definitions according to number o f variables selected for building single m odels

Number of variables

Fig. 5. Bias, variance, classification error according to definitions o f R. Tibshirani, P. Dom ingos,

E . B. K ong and T. O . D ietterich

(9)

Definitions of R. Tibshirani, P. Domingos, E. B. K ong and T. G. D ietterich lead to the same, even negative values of variance, which form far away from the variance calculated from the general definition, that gives definitely the highest value.

It is also w orth analysing the relations between bias and variance in single decom positions because they form differently. In some concepts, as e.g. in decom positions by R. Tibshirani, P. Dom ingos, E. B. K ong and T. G. D ietterich (Fig. 5) there are very low values of variance and the trend stabilises from some point. Together with it, there are very high, alm ost equal to the classification error, values of bias whose trend also stabilises.

The rem aining decompositions, that is by R. K ohavi and D. H. W olpert (Fig. 6), I and II Breim an’s decomposition (Fig. 7 and 8) show very similar relations between values of bias and variance: regular increase of bias and regular decrease of variance. M oreover, in the II Breim an’s decomposition we can observe an inversion in relation between bias and variance values (cross o f lines).

—♦— b ia s —* — v a ria n ce

error

Fig. 6. Bias, variance, classification error according to first decom position o f R. Kohavi and D . H. Wolpert

(10)

Number of variables

bias variance — д — error

F ig. 7. Bias, variance, classification error according to I d ecom position o f L. Breiman S o u r c e : own research. — bi as variance —a— error 0 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Number of variables

F ig. 8. Bias, variance, prediction error according to II d ecom position o f L. Breiman S o u r c e : own research.

(11)

5. B IA S, V A R IA N C E V A LUES A N D N U M B E R O F SIN G L E M O D E L S IN AGGREGATED M O D E L

The aim o f the second experiment is to verify how will the values of bias and variance, calculated by means of different decompositions, form according to the num ber of single models in the aggregated one. In ex perim ent building of an aggregated model started with 10 single models, and the procedure was continued up to 100, adding each time 10 m ore models.

Again, the sum o f bias and variance, calculated from general definition, is not equal to the value o f classification error.

A lthough it is rather difficult to notice a clear increasing or decreasing trend in values o f bias, but it is seen that all proposed definitions show a very similar way of forming (Fig. 9). And again three decompositions - by R. Tibshirani, P. Domingos, E. B. K ong and T. G. Dietterich - give the same value, which is equal to the value obtained from the general definition. general Tibshirani — Br ei man I — x — Breiman II — ж— Kong — • — Kohavi — В ..Dom ingos 10 20 30 40 50 60 70 80 90 100 Number of single models

Fig. 9. Values o f bias calculated by means o f different definitions according to number o f single models in aggregated model

(12)

general - Ш - Tibshirani Breiman 1 Breiman II - Ж - Kong Kohavi - a — Dom ingos

Fig. 10. Values o f variance calculated by means o f different definitions according to number o f single m odels in aggregated m odel

T he lack o f clear tendency and a similar way o f form ation can also be disscussed in the case o f values of variance (Fig. 10). Values obtained from decom positions of R. Tibshirani, P. Dom ingos, E. B. K ong and T. G. D ietterich are the same again.

6. C O N C L U SIO N S

The fact that there are so m any ways of defining bias and variance proves th at this is a quite complicated problem in classification. A lthough some proposed definitions give different values, we can still claim th at all show similar ways o f form ation. In the case of the research into how the num ber o f variables influences the values of bias and variance, we can talk about a clear increasing trend o f bias and decreasing trend of variance. In the experiment which aim was to verify how the num ber of single models influence the bias and variance we can say that even though there is no clear tendency, there are similarities in the way of form ation.

Choosing one of the proposed ways of defining bias and variance in classification, it should also be noticed th at different decom positions form the relations between bias and variance in a different way.

(13)

R EFERENCES

B l a k e C„ K e o g h E., M e r z C. J. (1998), U C I R epository o f M achine Learning D atabases. D epartm ent o f Inform ation and Computer Science, University o f California, Irvine. B r e i m a n L. (1996), Arcing classifiers. Technical R eport, Departm ent o f Statistics, University

o f California, California.

B r e i m a n L. (2000), Randomizing outputs to increase prediction accuracy, M achine Learning , 40, 3, 229-242.

D o m i n g o s P. (2000), A unified hias-variance decomposition f o r zero-one and squared loss, “ Proceedings o f the Seventeenth N ational Conference on Arificial Intelligence , A A A I Press, A ustin, 564-569.

G e m a n S., B i e n e n s t o c k E., D o u r s a t R. (1992), Neural netw orks and the bias/variance dilemma, “Neural C om putation” , 4, 1-58.

K o h a v i R. , W o l p e r t D . H. (1995), Bias plus variance decomposition fo r zero-one loss functions, “ Proceedings o f the Twelfth International Conference on M achine Learnig ,

M organ K aufm ann, Tahoe City, 275-283.

K o n g E. B. , D i e t t e r i c h T . G . (1995), Error-correcting output coding corrects bias and variance, „Proceedings o f the Thirteenth International Conference on M achine Learnig” , M organ K aufm ann, 313-321.

T i b s h i r a n i R. (1996), Bias, variance and prediction error f o r classification rules, Technical R eport, D epartm ent o f Statistics, University o f Toronto, lo ro n to .

D orota Rozmus

A N A L IZ A W Ł A S N O ŚC I M E T O D D E K O M PO Z Y C JI B Ł ĘD U KLASYFIKACJI

Pojęcie dekom pozycji błędu wywodzi się z regresji, gdzie stosuje się kwadratową lunkcję straty. M ając dany obiekt x, dla którego prawdziwa wartość zmiennej objaśnianej wynosi y, algorytm uczący, na podstaw ie każdego podzbioru uczącego ze zbioru prób uczących Z, przewiduje dla tego obiektu wartość ý. Błąd predykcji m ożna poddać wtedy następującej dekom pozycji:

e2№ - Й2]} = Ш + В « + Пх).

Błąd resztow y (N (x)) jest elementem składowym błędu, który nie podlega redukcji i który jest niezależny od algorytm u uczącego. Stanowi hipotetyczną dolną granicę błędu predykcji.

O bciążeniem algorytmu uczącego dla obiektu x (B(x)), nazywam y błąd systematyczny sp ow od ow any różnicą m iędzy predykcją, otrzymaną na podstaw ie m odelu optym alnego (y.), a predykcją na podstaw ie m odelu zagregowanego ( y j , gdzie y . i y m definiow ane są jako

y . = Ey [y], y m =

-b’zLP]-W ariancja dla obiektu x (D 2(x)) to przeciętny błąd wynikający z różnicy m iędzy predykcją na podstaw ie m odelu zagregowanego (ym) a predykcją uzyskaną na podstaw ie pojedynczych

m o d e l i W - _{W literaturze pojaw iły się także liczne koncepcje przeniesienia idei dekom pozycji do}, • • • -a ■ A i zagadnienia klasyfikacji. Celem artykułu jest analiza własności różnych sp osob ów dekom pozycji błędu przy zastosow aniu zero-jedynkowej funkcji straty.