evolutionary multi-test tree approach Decision tree underﬁtting in mining of gene expression data. An Expert Systems With Applications

(1)

ContentslistsavailableatScienceDirect

Expert Systems With Applications

journalhomepage:www.elsevier.com/locate/eswa

Decision tree underﬁtting in mining of gene expression data. An evolutionary multi-test tree approach

Marcin Czajkowski^∗, Marek Kretowski

Faculty of Computer Science, Bialystok University of Technology, Wiejska 45a, Bialystok 15-351, Poland

a rt i c l e i nf o

Article history:

Received 23 January 2019 Revised 28 June 2019 Accepted 9 July 2019 Available online 10 July 2019 Keywords:

Data mining

Evolutionary algorithms Decision trees Underﬁtting Gene expression data

a b s t ra c t

Theproblemofunderfittingandoverfittinginmachinelearningisoftenassociatedwithabias-variance trade-off.Theunderfittingmostclearlymanifestsinthetree-basedinducerswhenusedtoclassifythe geneexpressiondata.Toimprovethegeneralizationabilityofdecisiontrees,weareintroducinganevo- lutionary, multi-testtree approachtailoredto thisspecificapplication domain.The generalidea isto apply geneclusters ofvarying size,whichconsistoffunctionally related genes ineachsplitting rule.

Itisachievedbyusingafewsimpleteststhatmimiceachother’spredictionsandbuilt-ininformation aboutthediscriminatorypowerofgenes.Thetendenciestounderfitandoverfitarelimitedbythemulti- objective fitness functionthat minimizes treeerror, splitdivergenceand attribute costs. Evolutionary searchformulti-testsininternalnodes,aswellastheoveralltreestructure,isperformedsimultaneously.

Thisnovelapproach called EvolutionaryMulti-Test Tree (EMTTree) may bringfar-reaching benefits tothedomainofmolecularbiologyincludingbiomarkerdiscovery,finding newgene-geneinteractions andhigh-qualityprediction.Extensiveexperimentscarriedouton35publiclyavailablegeneexpression datasetsshowthatwemanagedtosignificantlyimprovetheaccuracyandstabilityofdecisiontree.Im- portantly,EMTTreedoesnotsubstantiallyincreasetheoverallcomplexityofthetree,sothatthepatterns inthepredictivestructuresarekeptcomprehensible.

1. Introduction

Inmachinelearning,generalizationoftenreferstotheabilityof a predictive model to matchunseen data (Hastie, Trevor, Tibshi- rani,Robert, Friedman, 2009). If the model matches the training setwell,butfailstopredictnewinstancesintheproblemarea,we aretypicallydealingwithso-calledoverfitting.Thishappenswhen themodelconcentratesontoomuchdetailedinformationfromthe trainingdata, whichcan occurin theformof noiseoraccidental fluctuations,which negatively affectsthe abilityofthe modelsto generalize. Underfittingis in opposition to overfitting asthe un- derfittedmodel is not complicated enough andtoo little focuses ontrainingdata.As aresult,itcanneitherfitthetrainingsetnor generalizenewdatawell.

Decisiontrees(DT)s(Kotsiantis,2013)areoneofthemaintech- niques for discriminant analysis in knowledge discovery. Due to their non-parametric andﬂexible algorithm, DTsare at some ex- tentpronetooverﬁtting(Loh,2014;Sez,Luengo,&Herrera,2016).

There are also known to be instable, as small variations in the

∗ Corresponding author.

E-mail addresses: m.czajkowski@pb.edu.pl (M. Czajkowski), m.kretowski@pb.edu.pl (M. Kretowski).

trainingset can result in differenttrees and non-repeatable predictions. While this is an unquestionable advantage when using multipletrees,it isa problemwhen aclassiferbased ona single tree is used. Both greater generalization ability and stability can beimproved,forexample,bylearningmultiplemodelsfromboot- strapsamplesofthetrainingdata,butsuchanensembleapproach makestheextractedknowledgelessunderstandable.

ThispapertacklestheproblemofunderﬁttingofDTintheclas- siﬁcation ofgene expressiondata.Insuchdataa ratiooffeatures to observations is very high, which creates serious problems for thestandardunivariatedecisiontrees(Chen,Wang,&Zhang,2011;

Czajkowski & Kretowski,2014). The learningalgorithms mayﬁnd teststhatperfectlyseparatesthetrainingdata,butthesesplitsof- tencorrespondtonoise.Thissituationismorelikelyatintermedi- ateandlowerlevelsofthetree,wherethenumberofinstancesis reducedwitheachtreelevelandmaybeseveralordersofmagni- tudesmallerthanthenumberofavailablefeatures.Forthisreason, mostofunivariateDT inducersproduce considerablysimpletrees that successfullyclassifythe trainingdata,butfailto classify un- seeninstances(Grzes&Kretowski,2007).Thismayleadtounder- ﬁttingasa smallnumberof attributesisused insuch trees and, therefore, their models are not complex enough andcause poor generalization (Hastie, Trevor,Tibshirani, Robert,Friedman, 2009).

(2)

Theproductionoflargertreesdoesnotsolvetheproblem,because in caseofgene expression, smalltrees alreadyclassify the training dataperfectly.Thisindicates thatone canoptfortheissueof split complexity, aslittle can be obtained fromlarger univariate DTswiththistypeofdata.

A gene cluster is a part of a gene family, which is a set ho- mologous genes within one organism. It is composed of two or moregenesfoundwithinanorganism’sDNAthatencodeforsimi- larpolypeptides,orproteins,whichcollectivelyshareageneralized function.Ithasbeenshown(Yi,Sze,& Thon,2007) thatpolypep- tides,orproteins are alsoencodedby a group offunctionallyre- lated genes not a single one. In addition,the use ofinformation on subgroups of attributes isparticularly important in the problem of classiﬁcation andselection of genomic data(Kar, Sharma,

&Maitra,2015;Wong&Liu,2010).Therefore,we believe,thatfo- cusingona treesplitbasedon geneclustersratherthana single gene improves not only classiﬁers generalization ability but also provides interestingpatternsthat mayappearineach multi-tests.

Thisdirectionofresearchiscontinuedinourstudy.

Themaincontributionoftheworkisanewevolutionarymulti- test treealgorithm calledEvolutionaryMulti-Test Tree(EMTTree).

It aims to improve single-tree classiﬁers in context ofprediction accuracy and stability with a redeﬁned and extended multi-test split approach(Czajkowski,Grze´s, &Kretowski, 2014). Incontrast toexistingsolutions,weproposeaconceptofageneclusterinor- der tosplitinstancesineach non-terminalnode ofthetree.Each cluster consists of teststhat mimic each other’s predictions, and eachtestisanunivariatetestbuiltonaselectedattribute.Novelty oftheEMTTreecovers:

• an evolutionary tree induction as an alternative to the greedy top-down which was used in our previous works.

Thanks to thisglobalapproach we were able to search for the treestructure andmulti-test splits simultaneously, and resignfromtheﬂawedpruningprocedure;

• a newalgorithm forsearchingmulti-test splits:specialized EAinacombinationwithlocaloptimizationsallowssearch- ingformostuniformmulti-testswiththetop-rankedgenes;

• introducing gene cluster concept to the multi-test and addinganewdimensiontoits structure:informationabout thediscriminatory powerofgenes isassociated withevery univariatetestthatconstitutesamulti-test;

• a unique fitness function that focuses on minimizing the tree error, butnot on the tree size, which is the standard procedureforDT.Inaddition,weincorporateinformationon gene rankingandresemblanceofsplits inorderto prevent thepredictorfromunderfittingandoverfittingtodata,espe- ciallyinthelowerpartsofthetree.

An extensive set ofcomputational experiments using 35real- worldgene-expressiondatasetshasshownthattheEMTTreesolu- tionnowappearstobeoneofthetopdecisiontree-likeclassiﬁers intheﬁeldofgeneexpressiondata.

The paperis organizedasfollows.Thenext section providesa brief background on DTs in the context of gene expression data analysis. Section 3 describes the concept of multi-test and the proposed evolutionaryapproach.Allexperimentsarepresentedin Section 4 andthelast section containsconclusionsand plansfor futurework.

2. Background

With the rapid developmentand popularityof genomic technology,a large numberof geneexpression datasets havebecome publicly accessible (Lazar et al., 2012). The availability of these datasets opens up new challenges for existing tools and algo-

rithms. However, traditional solutions often fail due to highfea- tures/observationsratiosandhugegeneredundancy.

2.1. Decisiontree

Decisiontrees (also knownas classiﬁcationtrees)have a long historyinpredictivemodeling(Kotsiantis,2013).Thesuccessofthe tree-basedapproachcan beexplainedbyitseaseofuse,speedof classiﬁcationandeffectiveness.Inaddition,thehierarchicalstruc- tureofthe tree,where appropriate testsare applied successively fromone node to the next,closely resemblesthe human wayof makingdecisions.

DThasaknowledgerepresentationstructuremadeupofnodes andbranches, where:eachinternal nodeisassociatedwithatest ononeormoreattributes;eachbranchrepresentsthetestresult;

andeach leaf (terminalnode) is designed by a class label. Most oftreeinducing algorithmspartition the featurespacewithaxis- parallelhyperplanes. Treesofthistypeare oftencalledunivariate becauseatestineachnon-terminalnodeusuallyinvolvesasingle attribute,whichisselectedaccordingtoagivengoodnessofsplit.

There are also algorithms that apply multivariate tests (Brodley

& Utgoff, 1995) basedmainly on linear combinationsof multiple dependentattributes. Theoblique splitcauses alineardivision of thefeaturespaceby anon-orthogonal hyperplane.DTs,whichal- lowmultiplefeaturestobetestedinanode,ispotentiallysmaller thanthose which arelimitedto singleunivariate splits, buthave muchhighercomputationalcostandareoftendiﬃculttointerpret (Brodley&Utgoff,1995).

InductionofoptimalDTisknownNP-completeproblem(Hyaﬁl

& Rivest, 1976). As a consequence, the practical DT learning al- gorithmsmust be heuristically enhanced. The mostpopular type oftree induction is based on a top-down greedysearch (Rokach

& Maimon, 2005). It starts with the root node, where the lo- cally optimalsplit (test) issearched accordingto the givenmea- sure of optimality. Then the training instances are redirected to the newly created nodes and this process is repeated for each nodeuntilthestop conditionismet.Inadditionally,post-pruning (Esposito, Malerba, & Semeraro, 1997) is usually used after induction to avoid the problem of overﬁtting to the training data and to improve the generalizing power of the predictive model.

The two most popular representatives of top-down DT inducers are CART (Breiman, Friedman, Olshen, & Stone, 2017) and C4.5 (Quinlan, 1992). The CART systemgenerates recursively a binary tree,andthe qualityofa splitismeasured eitherby theGiniin- dexortheTwoingcriterion.TheC4.5algorithmappliesmulti-way splits insteadof a typicalbinary strategyand usesthe gain ratio criterion to split the nodes. Inducing DT through a greedy strat- egyisfastandgenerallyeﬃcientinmanypracticalproblems,but usuallyprovideslocallyoptimalsolutions.

Inorderto mitigatesome ofthenegativeeffectsoflocallyop- timaldecisions,awide rangeofmeta-heuristics fortheintroduc- tionof DTwasexamined (Barros,Basgalupp, DeCarvalho, & Fre- itas,2012; Czajkowski& Kretowski, 2014;2016). Theyareable to globallysearch forthe treestructure andtests ininternal nodes.

Such a globalinduction isof course much morecomputationally complex, but it can reveal hidden patterns that are often unde- tectablebygreedymethods(Lv,Peng,Chen,&Sun,2016).Different recentapproachestoimprovingthepredictiveperformance ofde- cisiontreesincludefuzziness(Wang,Liu,Pedrycz,& Zhang,2015), uncertainties(Cao&Rockett,2015),discretization(Saremi&Yagh- maee,2018)orvariableselection(Painsky&Rosset,2017).

2.2.Geneexpressiondataclassiﬁcationwithdecisiontrees

MicroarraysandRNA-seq analysiscansimultaneouslymeasure the expression level of thousands of genes within a particular

(3)

mRNAsample. The applicationof a mathematical apparatus and computations tools is indispensable here, since gene expression observationsarerepresentedby highdimensional featurevectors.

However,thegenomicdataisstillchallengingandthereareseveral culpritsresponsible, mainly: (i) Bellmans curse of dimensionality (toomanyfeatures);(ii)thecurseofdatasetsparsity(toofewsam- ples);(iii)theirrelevant andnoise genes;(iv)bias frommethod- ologicalandtechnical factors.Each observationis describedby a high dimensional feature vector with a number of features that reachintothethousands,butthenumberofobservationsisrarely higherthan100.

Univariate decisiontrees representa white-boxapproach, and improvementstosuch modelshaveconsiderablepotential forge- nomic research and scientific modeling of the underlying pro- cesses.Therearenotsomanynewsolutionsintheliteraturethat focuson the classification of gene expression datawith compre- hensiveDTmodels. Oneofthe latestproposalsis theFDT (Fuzzy Decision Tree) algorithm (Ludwig, Picek, & Jakobovic, 2018) for classifyinggene expression data. The authors compare FDT with theclassic DTalgorithm(J48)onfivepopularcancerdatasetsand haveshown some benefits fromthe use of data uncertainty. Al- ternative studies are presented in Barros, Basgalupp, Freitas, and DeCarvalho(2014)wheretheauthorsproposeanevolutionaryDT inducer calledHEAD-DT. Detailed experiments carried out on 35 real-worldgeneexpressiondatasetshaveshownthesuperiorityof thealgorithm interms of predictive accuracy compared to well- knownDTsystems such asC4.5andCART.An expertsystemhas alsobeenproposed toclassify gene expressiondatausing agene selection by decisiontree (Horng et al., 2009). However, existing attempts have shown that decision tree algorithms often induce classifierswithinferiorpredictiveperformance(Barrosetal.,2014;

Ge&Wong,2008).CurrentDT-inducingalgorithmswiththeirpre- dictionmodels limitedtosplits composedfromone attribute use onlyafractionoftheavailableinformation.Itresultsinatendency tounderfitastheir models haveasmall biasonthe trainingset, butoftenfail to classifywell the newhigh-dimensionaldata. On theotherhand,therearealgorithmswhichapplymultivariatetests (Brown,Pittard,&Park,1996)basedmostlyonlinearcombination splits.However,themainflawsofsuchsystemsarehugecomplex- ityaswellasthebiologicalandclinicalinterpretationoftheoutput modelsisverydifficult,ifnotimpossible.

Nowadays,muchmoreinterestisgivenintreesassub-learners of an ensemble learning approach, such as Rotation or Random Forests(Chen&Ishwaran, 2012;Lu,Yang,Yan,Xue,&Gao,2017).

Thesesolutions alleviate the problemoflow accuracy by averag- ing oradaptive merging ofmultiple trees. One ofthe recent ex- amplesis themulti-objectivegeneticprogramming-based ensem- bleoftreesisproposedinNagandPal(2016).Theauthorspresent anintegratedalgorithmforsimultaneousselectionoffeaturesand classification.However, whenmodelingisaimed atunderstanding basicenvironmentalprocesses,suchmethodsarenotsousefulbe- causetheygeneratemorecomplexandlessunderstandablemodels (Piltaver,Luštrek,Gams,&Martinˇcić-Ipšić,2016).Nevertheless,im- portantknowledgecanstillbedrawnfromensemblemethods,e.g.

toidentifyreducedsetsofrelevantvariablesinagivenmicroarray (Lazzarini&Bacardit,2017).

A solution called Multi-Test Decision Tree (MTDT) (Czajkowski etal., 2014) canbe placed betweenone-dimensional and oblique trees. It uses several one-dimensional tests in each node, which on the one hand increases the complexity of the model and on the other hand still allows for relatively easy interpretation of the decision rules. There were, however, a few other ﬂaws and limitations of MTDT, which were ad- dressed and removed with the proposed EMTTree solution, in particular:

• thelackofﬂexibilityinthestructureofmulti-test-theﬁxed sizeofmulti-testsinalltreenodes;

• the limited search space - only a few highest-rated attributes were takenintoaccount when buildingthe multi- test(performancereasons);

• thehighnumberofcrucialparameterstobedeﬁnedad-hoc, including the size of multi-test, the number of alternative multi-testsandthehomogeneityofmulti-tests;

• greedy top-down induction: meta-heuristic searches (Barros et al., 2012) could be expected to improve clas- siﬁcationaccuracyandrevealnewpatternsinthedata.

Thisway,theproposedEMTTreesolutionthatcanself-adaptits structure tothe currentlyanalyzed data.The undoubted strength ofoursolutionisthehigherpredictionaccuracyandimprovedsta- bilityof themodel. The minorweakness ofthe EMTTree are the resultsofusinganevolutionaryapproach,mainlytheslowtreein- duction time anda number ofinput parameters that can be ad- justed.However,thegeneexpressiondataarestillrelativelysmall andas we show in the experimental section, the numberof pa- rametersthatneedtobetunedissmall.

2.3. Conceptofmulti-test

Thegeneralconceptofthemulti-testsplit,markedmt,wasin- troducedforthefirsttimeinMulti-TestDecisionTree(MTDT)algorithm(Czajkowskietal.,2014),whichinducesaDTinatop-down manner. The main idea wasto find a split in each non-terminal node thatis composedofseveralunivariate teststhatbranch out thetreeinasimilarway.The reasonforaddingfurthertestswas thattheuseofasingleunivariate testbasedonasingleattribute maycause the classifier to underfitthe learningdata dueto the low complexityof the classificationrule. Eachmulti-test consists of a set withat least one univariate test. One test in the set is markedasprimarytest(pt),andallremainingtestsarecalledsur- rogate tests (st). The role of surrogate testsis to support the di- vision of traininginstances carried out by the primary test with theuseofremainingfeatures.Inordertodeterminethesurrogate tests, wehaveadoptedthe solutionproposedin theCARTsystem (Breiman etal., 2017).Eachsurrogatetestisconstructedona dif- ferentattributeandmimicstheprimarytestintermsofwhichand how manyobservationsgo tothe corresponding branches.In the majorityvotingthatdeterminestheoutcomeofthemulti-test,the individualweights ofeachtest areequal.Thiswaysurrogatetests haveaconsiderableimpact(positiveornegative)onmulti-testde- cisions,astheycanprevailovertheprimarytestdecision.Itisalso possiblethatamulti-test,withoutatestwiththehighestgainra- tio,canbethemostaccuratesplit.

The experimental evaluation(Czajkowski et al., 2014) showed a significant improvementin classification accuracy anda reduc- tioninunderfittingcomparedtopopularDTsystems.Resultsfrom severalreal geneexpression datasets suggest that the knowledge discoveredbyMTDTissupportedbybiologicalevidenceinthelit- eratureandcanbeeasilyunderstoodandinterpreted.

Let’sconsiderabinaryclassificationproblem, inwhicha node contains instances from two classes (ClassA and ClassB) and in- stancesshouldbe dividedintotwo leavesaccordingtoatest.The Fig.1aillustratesthepossibleassignofinstancestoleavesaccord- ing to the T test performedon a singlea attribute. Desired split should place the instances from Class A in left leaf and the in- stancesfromClassBinrightleaf.Eachcellrepresentsasinglein- stancewithadefinedclass,andeachrowshowshowinstancesare arrangedinthe leavesafterperformingthe test.From theFig.1a itisclearthatasingleT₁ testonthea₁ attributehasthehighest goodnessofsplit,because13outof17instancesareclassifiedcor- rectly. Ina typical system, thistest should be selectedasa split.