Feature Selection and Multiple Model Approach in Discriminant Analysis

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

FO LIA O E C O N O M IC A 206, 2007

Eugeniusz Gatnar*

FEA TU R E SE L E C T IO N AND M U L T IPL E M O D E L A PPR O A C H IN D ISC RIM IN A N T ANALYSIS

Abstract. Significant improvement o f model stability and prediction accuracy in classification and regression can be obtained by using the multiple model approach. In classification multiple m odels are built on the basis o f training subsets (selected from the training set) and combined into an ensem ble or a com m ittee. Then the com ponent m odels (classification trees) determine the predicted class by voting.

In this paper som e problem s o f feature selection for ensembles will be discussed. We propose a new correlation-based feature selection method com bined with the wrapper ap proach.

Key words: tree-based m odels, aggregation, feature selection, random subspaces.

Tree-based m odels are popular and widely used because they are simple, flexible and powerful tools for classification and regression. U nfortunately they are not stable, i.e. a small change in a predictor value could lead to a quite different model. T o solve this probiem, single m odels C ,(x ),..., CK(x) are combined into one global model C*(x).

In classification the component models vote for the predicted class

Several variants of aggregation m ethods have been proposed so far. They m anipulate training cases (random sampling) or predictors (random selection) or values o f the у (system of weights) or involve random ness directly ( G a t n a r 2001).

* Professor, Institute o f Statistics, K atowice University o f E conom ics, K atow ice, Poland. 1. IN T R O D U C T IO N

(2)

T he m ethod developed by T. K. H o (1998) has been called “ R an dom subspaces” (RSM ). Each com ponent m odel Cm(x) in the ensemble is fitted to the training subsam ple Um containing all cases from the training set but with random ly selected features. Varying the feature subsets used to fit the com ponent classifiers results in their necessary diversity.

This m ethod is very useful, especially when d ata are highly dimensional, or some features are redundant, or the training set is small com pared to the d a ta dimensionality. Similarly, when the base classifiers suffer from the “ curse o f dim ensionality” .

The RSM uses a parallel classification algorithm , in contrast to boosting or adaptive bagging that are sequential. It does not require specialised software or any m odification of the source code of the existing ones.

A disadvantage o f the RSM is the problem o f finding the optimal num ber o f dimensions for random subspaces. T. K. II о (1998) proposed to choose half of the available features while L. В r e i m a n (2001) - the square root o f the num ber o f features, or twice the root.

In order to obtain the appropriate num ber of variables we need to apply a feature selection procedure to the initial num ber o f variables chosen at random .

2. FE A T U R E SEL E C T IO N FOR E N S E M B L E S

The aim of feature selection is to find the best subset of variables. In general there are three approaches to feature selection for ensembles:

• filter m ethods, that filter undesirable features out of the d ata before classification,

• “w rapper m ethods” , th at use the classification algorithm itself to evaluate the usefulness o f feature subsets,

• “ranking m ethods” th at score individual features.

Filter m ethods are the m ost comm on used m ethods for feature selection in statistics. In particular they are the correlation-based m ethods and we can divide them into three groups: simple correlation-based selection m et hods, advanced correlation-based selection m ethods, and contextual merit- based m ethods.

F o r example, the m ethod proposed by N. C. O z a and K . 1 u m a r (1999) belongs to the first group. It ranks the features by their correlations with the class. This approach is not effective if there is a strong feature interaction.

(3)

The correlation feature selection (CFS) m ethod developed by M. H a l l (2000) is advanced because it also takes into account correlations between pairs o f features. The CFS value of a set o f features F m is calculated as

C F S ( f " > = / , _{s jL m + Lm(Ln}, _{— 1) ■}. _{r ,j} (2)

where:

fj - the average feature-class correlation,

r v - the average feature-feature correlation,

L m - the num ber o f features in the set Fm.

The w rapper m ethods generate sets of features. I hen they run the classification algorithm using features in each set and evaluate resulting m odels using 10-fold cross-validation. R. K o h a v i and G. H. J o h n (1997) proposed a stepwise w rapper algorithm that starts with an empty set of features and adds single features that improve the accuracy of the resulted classifier. U nfortunately, this m ethod is only useful for d a ta sets with relatively small num ber o f features and very fast classification algorithms (e.g. trees). In general, the wrapper m ethods are com putationally expensive and very slow.

The R E L IE F algorithm (Kira, R e n d e l l 1992) is an interesting example o f ranking m ethods for feature selection. It draws instances at random , finds their nearest neighbors, and gives higher weights to features that discrim inate the instance from neighbors of different classes. Then those features with weights that exceed a user-specified threshold are selected.

3. P R O P O S E D M E T H O D

We propose to reduce the dimensionality of random subspaces using a filter m ethod based on Hellwig heuristic (CFSH). The m ethod is a cor relation-based feature selection and consists of two steps.

1. Iterate m = 1 to M:

• choose at random half o f the data set features (L/2) to the training subset Um,

• determ ine the best subset Fm o f features in Um according to the Hellwig m ethod,

• grow and prune the tree using the subset Fm.

2. Finally combine the com ponent trees using m ajority voting.

The heuristic proposed by Z. H e l l w i g (1969) takes into account both class-feature correlation and correlation between pairs of variables. The best

(4)

subset o f features is selected from am ong all possible subsets F {, F V . . . ,F M th at maximises the so-called “ integral capacity of inform ation” :

H ( F J = (3)

j - 1

where Lm is the num ber o f features in the subset F m and hmJ is the capacity o f inform ation o f a single feature Xj in the subset F m

hmJ = --- (4) 1 + ľ r f

1-1.114

In the equation (4) rc is a class-feature correlation, and r y is a featurc-feature correlation.

The correlations are computed using the form ula of symmetrical uncertainty coefficient ( P r e s s et al. 1989) based on the entropy function

2[E(x,) + E(xj) - E (x„ x;.)]

T>> ~ ' E(x,) + E(xj)

because the variable у representing class is nom inal. T he m easure (5) lies between 0 and 1. If the two variables are independent, then it equals 0, and if they are dependent, it equals 1.

C ontinuous features have been discretised using the contextual technique o f U . M. F a y y a d and K. M. I r a n i (1993).

U nfortunately, maximising the form ula (3) in some cases does not lead to the m ost accurate model from am ong all models generated by the CFSH m ethod, so the further improvement of the aggregated m odel accuracy can be achieved combining the CFSH m ethods with the w rapper approach.

In order to obtain the best subset o f features we propose to choose the best model (in term s o f classification error) from am ong the top 5 models containing sets o f features generated by the CFSH m ethod. I he algorithm contains 3 steps:

1. Choose the to p 5 feature subsets F t, F 2, . .. , FS that maximize the value

of (3). .

2. Build the m odels C (F ,),..., C(F5) and calculate the classification error for each o f them using the appropriate test set: e[C (F ,)],..., e[CF5)].

3. Choose the subset F' that gives m odel with the lowest classification error:

(5)

4. EXA M PLE

In order to compare prediction accuracy o f ensembles for different feature selection m ethods we used 9 benchmark d ata sets from the M achine Lear ning Repository at the UCI ( B l a k e et al. 1998). Results of the comparisons are presented in Tab. 1. F o r each data set an aggregated model has been built containing M = 100 com ponent trees'.

T a b l e 1 Benchmark data sets

D a ta set

Num ber o f exam ples in the

learning set Num ber o f examples in the test set N um ber o f features Num ber o f classes D N A 2 000 1 186 180 3 Letter 15 000 5 000 16 26 Satellite 4 435 2 000 36 6 Soybean 600 83 35 19 G erm an credit 900 100 24 2 Segm entation 2 000 310 19 8 Sick 3 400 372 29 2 Anneal 800 98 38 5 Australian credit 600 90 15 2

Classification errors have been estimated for the appropriate test sets and presented in T ab. 2.

T a b l e 2 Classification errors for the data sets (in %)

D a ta set Single model

(tree) CFS CFSH Wrapper approach D N A 6.40 5.20 4.51 4.78 Letter 14.00 10.83 5.84 5.34 Satellite 13.80 14.87 10.32 10.28 Soybean 8.00 9.34 6.98 7.05 Germ an credit 29.60 27.33 26.92 26.72 Segm entation 3.70 3.37 2.27 2.14 Sick 1.30 2.51 2.14 2.12 Anneal 1.40 1.22 1.20 1.20 A ustralian credit 14.90 14.53 14.10 14.04

1 In order to grow trees we have used the Rpart procedure written by Т. M . T h e r n e a u and E. J. A t k i n s o n (1997) for the S-PLU S and R environment.

(6)

In this paper we have proposed a m odification o f the correlation-based feature selection m ethod for classifier ensembles based on the Hellwig heuristic. The w rapper approach gives m ore accurate aggregated models than those built with the CFSH correlation-based feature selection m ethod.

R EFEREN CES

B l a k e C., K e o g h E., M e r z C. J. (1998), U C I R epository o f M achine Learning Databases, Departm ent o f Inform ation and Computer Science, U niversity o f C alifornia, Irvine (CA). В r e i m a n L. (2001), Random fo re sts, “M achine Learning”, 45, 5-32.

F a y y a d U. M. , I r a n i K . B. (1993), M ulti-interval discretisation o f continuous-valued attributes, fin:] Proceedings o f the X III International Joint Conference on A rtificial Intelligence, M organ K aufm ann, San Francisco, 1022-1027.

G a t n a r E. (2001), N ieparam etryczna m etoda estym acji i regresji, W ydaw nictw o N aukow e PW N , W arszawa.

H a l l M . (2000), Correlation-based featu re selection f o r discrete an d numeric class machine

learning, [in:] Proceedings o f the 17th International Conference on M achine Learning, M organ

K aufm ann, San Francisco.

H e l l w i g Z. (1969), On the problem o f the optim al selection o f predictors, “Statistical R evue” , 3 - 4 (in Polish).

H о Т. K. (1998), The random subspace m ethod f o r constructing decision fo re sts, IEEE Trans, on Pattern A nalysis and M achine Learning, 20, 832-844.

K i r a A. , R e n d e l l L. (1992), A practical approach to fe a tu re selection, [in:] Proceedings o f

the 9th International Conference on Machine Learning, D . Sleeman, P. Edwards (eds.),

M organ K aufm ann, San Francisco, 249-256.

K o h a v i R. , J o h n G. H . (1997), Wrappers f o r featu re subset selection, “Artificial Intelligence”, 97, 273-324.

O z a N . C., T u m a r K . (1999), D im ensionality reduction through classifier ensembles. Technical

R eport, N A S A -A R C -IC -I9 9 9 -I2 6 , C om putational Sciences D ivision , N A S A A m es Research

Center.

P r e s s W. H. , F l a n n e r y B. P., T e u k o l s k y S. A. , V e t t e r l i n g W. T . (1989), Numerical

recipes in Pascal, Cambridge University Press, Cambridge.

T h e r n e a u T. M. , A t k i n s o n E. J. (1997), An introduction to recursive partitioning using

the R P A R T routines, M ayo Found ation, Rochester.

Eugeniusz Gatnar

D O B Ó R Z M IE N N Y C H A P O D E JŚ C IE W IE L O M O D E L O W E W A N A L IZ IE D Y SK R Y M IN A C Y JN E J

W pracy przedstaw iono podstaw ow e zagadnienia związane z doborem zm iennych w podej ściu w ielom odelow ym (m ultiple-m odel approach) w analizie dyskryminacyjnej.

Podejście w ielom od elow e p olega na budow ie К m odeli, prostych (składow ych) C ,(x )... Cx(x), które są następnie łączone w jeden m odel zagregowany C‘(x), np. w oparciu o zasadę majoryzacji

(7)

Znane z literatury m etody agregacji modeli różnią się przede wszystkim sposobem tworzenia prób uczących U ,... U K, w oparciu o które pow stają m odele składowe. Jedną z najprostszych jest m etoda losow ego doboru zmiennych do modeli składowych.

A by jednak zm ienne te wpływały na jakość budow anych m odeli zaproponow ano wykorzy stanie m etody doboru zmiennych spośród tych, które zostały w ylosow ane. W tym celu zm o dyfikow ano m etodę korelacyjną Hellwiga.