Combining different types of classifiers

(1)

Eugeniusz G a tn a r *

C O M B IN IN G D IF F E R E N T T Y P E S O F C L A S S IF IE R S

A B S T R A C T . M od el fusion has proved to be a very su ccessfu l strategy for obtain ing accurate m od els in classification and regression. T he k ey issu e, h ow ever, is the di versity o f the com p on en t classifiers because classification error o f an en sem b le depends on the correlation b etw een its mem bers.

T he m ajority o f e x istin g en sem b le m ethods com bine the sam e type o f m od els, e.g. trees. In order to prom ote the diversity o f the ensem b le m em bers, w e prop ose to aggre gate cla ssifiers o f different types, because they can partition the sam e classification space in very different w a y s (e.g. trees, neural netw orks and SV M s).

K ey words: m u ltip le-m od el approach, m odel fusion, cla ssifier en sem b le, d iversity measures.

I. INTRODU CTIO N

Fusion o f classification models is commonly used in classification in order to improve classification accuracy. In this approach К component (base) mod els C ,(x ),..., Сд- íx ) a r e combined into one global model (ensemble) C * ( x ) , for example using majority voting:

C *(x ) = arg max { £ * , / (C* (x ) = j>)J. (1)

Turner i Ghosh (1996) proved that the classification error o f the ensemble C* (x ) depends on the diversity o f the ensemble members. In other words, the higher diversity o f component models, the lower classification error o f the com bined model.

(2)

members o f the ensemble are “weak” and diverse. The term “weak” refers to classifiers that have high variance, e.g. classification trees, nearest neighbors, and neural nets.

Diversity among classifiers means that they are different from each other, i.e. they misclassify different examples. This is obtained by using different train ing subsets, assigning different weights to instances or selecting different subsets o f features (subspaces).

Several variants o f aggregation methods have been developed so far. They differ in two aspects: the way the subsets to train component classifiers are formed and the method the base classifiers are combined. Generally, there are three approaches have been developed to obtain diversity among component models:

• Manipulating training examples, e.g. Bagging (Breiman, 1996); Boosting (Freund and Shapire, 1997) and Arcing (Breiman, 1998).

• Manipulating input features: Random subspaces (Ho, 1998); Random split

selection (Amit and Geman, 1997), Random forests (Breiman, 2001).

• Manipulating output values: Adaptive bagging (Breiman, 1999); Error-

correcting output coding (Dietterich and Bakiri, 1995).

II. BASE CLASSIFIERS

Existing ensemble methods combine the same type o f models built for dif ferent subsets o f observations, e.g. RandomForest developed by Breiman (2001), or different subsets o f features, e.g. Feature Subspaces developed by Ho (1998). In order to improve the diversity o f the ensemble members, we proposed to fuse classifiers o f different types.

We used 6 types o f classifiers. • к -Nearest Neighbors, • Linear Discriminants:

= E ' V , (2)

and Quadratic Discriminants:

fj

0 0 = “ l° g |£ , I - ^ ( x -

VjY

(x - ). (3)

(3)

• Classification Trees:

/ ( x ) = J a * / ( x e / ? * ) ,

k=\

• N eural N etw orks with one hidden layer:

У j = g j

/ л/

vo / + X v* z

-v m=i /

where z = [ z l , z 2, z 3,...,z w ] is a set o f variables in the hidden layer:

(4)

where are disjoint regions in the feature space

/ ( x e ^ ) = fl/(v';'>SAr,iv“ ).

(5)

г ... = h w0ffl+ Z w- x «

«=1 У

(6)

and h is an activation function. We chosen the sigmoid function

1

h(u) = to activate the neurons.

1 + e xp (-w )

S u p p o rt V ector M achines (SVM):

/ ( x ) = Y j l>y>K ( x >'x ) + 0 * ' (7)

where V = { ( x ,,^ ,) ,...,( x s , y s )} is the set o f support vectors and

(4)

In order to compare the accuracy o f the two types o f ensembles we per formed the classification o f European countries based on the World Bank data and the AMECO database.

The World Bank classifies economies based on the gross national income (GNI) per capita1 to one o f four classes:

• H - high income ($10,726 and more),

• UM - upper middle income ($3,466-$ 10,725), • LM - low middle income ($876-$3,465), • L - low income ($875 or less).

This was the true class for each country in the training set.

The AMECO database is the annual macro-economic database o f the Euro pean Commission's Directorate General for Economic and Financial Affairs. It contains data for EU-25, the euro area, EU Member States, candidate countries and other OECD countries (United States, Japan, Canada, Switzerland, Norway, Iceland, Mexico, Korea, Australia and New Zealand).

The database contains a selection o f about 700 variables, e.g. Gross Savings, Final Consumption Expenditure o f General Government, Exports o f Goods and Services, Imports o f Goods and Services, Unemployment Rate, etc.

We have collected 780 observations in the training set:

• 15 countries (old EU -15 members) observed in the years 1970-2005. • 14 countries (10 new EU-25 members and 4 candidate countries: Bul garia, Romania, Turkey, Croatia) observed in the years 1991-2005.

In order to classify the European countries we started with single classifica tion models, and Figure 1 shows how they divide the classification space. We used the 10-fold cross-validation to assess their performance and the estimated classification errors are presented in the table 1.

Then we combined classifiers o f the same type using bagging, boosting and random subspace method. Their errors are presented in Table 1.

Table 1 Classification errors for different combining methods

Method 3-NN LDA QDA Tree Nnet SVM

Single model (CV) Bagging Boosting Random Subspace 12.79% 12.88% 12.59% 12.21% 25.13% 23.46% 21.56% 20.97% 22.67% 21.92% 20.05% 19.34% 17.44% 12.31% 13.15% 12.23% 15.38% 13.08% 12.78% 11.83% 19.21% 18.46% 17.63% 17.45% ' www.worldbank.org/Home/Data/CountryClassification.

(5)

3 - N o a r e s t N e i g h b o r Qr m i Saving« C l a s s i f i c a t i o n T r e e • • • • • • H л и м ♦ IM • H • • • • • * 9*0 * . • A. 1 та h k L » л 9 ftVga о • _• _• _H e -4 • I • H 30 40 Grosa Savings S -j S-8 s -8 - s-• • “ I— 20 Gross Savings Gross Savings

Figure 1. Partition o f the classification space by 3-nearest neighbors, trees, neural nets and SVMs

• H

& и м

+ LU

—I— 50

Then we combined classification models of different types using majority voting (1) and we observed significant improvement of the classification accu racy. The results are shown in table 2. We have pruned trees, while NNets and SVMs have been tuned over supplied parameter ranges with the „tune function from the e l 071 library in the package R.

Table 2 Classification errors for combining different models

Ensemble Error

6 models 11.54%

60 models 10.28%

(6)

of each type, changing their parameters, e.g. the parameter ,,k” for the

к -Nearest Neighbors, size o f the Classification Trees, the number o f neurons in

the hidden layer for Neural Networks, etc.

IV. CO NCLUSIO NS

In our experiments we have combined classifiers o f different types, i.e. Lin ear and Quadratic Classifiers, Trees, Neural Networks, SVM models and Nearest Neighbors. Then we compared their performance with the ensembles formed using the standard fusion methods like bagging or boosting.

The obtained results showed that ensembles o f classifiers o f different types outperformed those o f the same type. They are more accurate because the mem bers o f the ensemble are diverse and divide the classification space in very dif ferent way.

This approach can be used in difficult domains, e.g. in economics, pattern recognition, medicine, etc.

REFERENCES

A m it Y ., G em an D. (1 997): Shape quantization and recognition w ith random ized trees,

N e u r a l C o m p u ta tio n , 1 5 4 5 -1 5 8 8 .

Breim an L. (1 996): B a g g in g predictors. M a c h in e L e a r n in g , 2 4 , 1 2 3 -1 4 0 . Breim an L. (1 9 9 8 ): A rcing classifiers. A n n a ls o f S ta tis tic s, 2 6 , 8 0 1 -8 4 9 .

Breim an L. (1 9 9 9 ): U sin g adaptive b aggin g to debias regressions. T ech n ical R eport 547, Departm ent o f Statistics, U niversity o f C alifornia, B erkeley.

B reim an L. (2 0 0 1 ): Random forests. M a c h in e L e a r n in g 4 5 , 5 - 3 2 .

D ietterich T ., Bakiri G. (1 995): S o lv in g m ulticlass learning problem via error-correcting output co d es. J o u r n a l o f A r tific ia l In te llig e n c e R e se a rc h , 2 , 2 6 3 - 2 8 6 .

Freund Y ., Schapire R .E. (199 7 ): A d ecision-theoretic generalization o f o n -lin e learning and an application to b oostin g, J o u r n a l o f C o m p u te r a n d S y ste m S c ie n c e s 55, 1 1 9 -

139.

Н о Т .К . (1 998): T he random subspace m ethod for constructing d e c isio n forests. I E E E

T ra n s a c tio n s on P a tte rn A n a ly sis a n d M a c h in e In te llig e n c e , 2 0 , 8 3 2 -8 4 4 .

Turner K ., G h osh J. (1 9 9 6 ): A n alysis o f d ecisio n boundaries in linearly com b in ed neural classifiers. P a tte rn R e c o g n itio n 2 9 , 3 4 1 -3 4 8 .

(7)

E u g e n iu s z G atncir

Ł Ą C Z E N IE R Ó ŻN Y C H R O D Z A JÓ W M O D ELI D Y SK R Y M IN A C Y JN Y C H Ł ączen ie m od eli ok azało się być bardzo efek tyw n ą strategią popraw y ja k o ści pre dykcji m odeli dyskrym inacyjnych. K lu czow ym zagadnieniem , jak w yn ik a z tw ierdzenia Turnera i G hosha (1 9 9 6 ), jest jednak stopień różnorodności agregow an ych m od eli, tzn. im w ięk sza korelacja m ięd zy w ynikam i klasyfikacji tych m o d eli, tym w ię k sz y błąd.

W ięk szość znanych m etod łączenia m odeli, np. R andom Forest zaproponow any przez Breim ana (2 0 0 1 ), agreguje m odele tego sam ego typu w różnych przestrzeniach cech. A b y z w ię k sz y ć różn ice m ięd zy p ojedynczym i m odelam i, w referacie zapropo now ano łą czen ie m od eli różnych typów , które zostały zbudow ane w tej sam ej przestr zeni zm ien n ych (np. drzew a klasyfikacyjne i m odele SV M ).

W eksperym entach w ykorzystano 5 klas m odeli: lin io w e i kw adratow e m od ele d y s krym inacyjne, drzew a klasyfikacyjne, sieci neuronow e, oraz m od ele zbudow ane za pom ocą m etod y £ -n ajb liższych sąsiad ów (k -N N ) i m etody w ek torów n ośn ych (S V M ).

U zysk an e rezultaty pokazują, że m odele zagregow ane p ow stałe w w yn ik u łączenia różnych m od eli są bardziej dokładne n iż gdy m odele sk ład ow e są teg o sa m eg o typu.