ContentslistsavailableatScienceDirect
Expert Systems With Applications
journalhomepage:www.elsevier.com/locate/eswa
Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach
Marcin Czajkowski∗, Marek Kretowski
Faculty of Computer Science, Bialystok University of Technology, Wiejska 45a, Bialystok 15-351, Poland
a rt i c l e i nf o
Article history:
Received 23 January 2019 Revised 28 June 2019 Accepted 9 July 2019 Available online 10 July 2019 Keywords:
Data mining
Evolutionary algorithms Decision trees Underfitting Gene expression data
a b s t ra c t
Theproblemofunderfittingandoverfittinginmachinelearningisoftenassociatedwithabias-variance trade-off.Theunderfittingmostclearlymanifestsinthetree-basedinducerswhenusedtoclassifythe geneexpressiondata.Toimprovethegeneralizationabilityofdecisiontrees,weareintroducinganevo- lutionary, multi-testtree approachtailoredto thisspecificapplication domain.The generalidea isto apply geneclusters ofvarying size,whichconsistoffunctionally related genes ineachsplitting rule.
Itisachievedbyusingafewsimpleteststhatmimiceachother’spredictionsandbuilt-ininformation aboutthediscriminatorypowerofgenes.Thetendenciestounderfitandoverfitarelimitedbythemulti- objective fitness functionthat minimizes treeerror, splitdivergenceand attribute costs. Evolutionary searchformulti-testsininternalnodes,aswellastheoveralltreestructure,isperformedsimultaneously.
Thisnovelapproach called EvolutionaryMulti-Test Tree (EMTTree) may bringfar-reaching benefits tothedomainofmolecularbiologyincludingbiomarkerdiscovery,finding newgene-geneinteractions andhigh-qualityprediction.Extensiveexperimentscarriedouton35publiclyavailablegeneexpression datasetsshowthatwemanagedtosignificantlyimprovetheaccuracyandstabilityofdecisiontree.Im- portantly,EMTTreedoesnotsubstantiallyincreasetheoverallcomplexityofthetree,sothatthepatterns inthepredictivestructuresarekeptcomprehensible.
© 2019 Elsevier Ltd. All rights reserved.
1. Introduction
Inmachinelearning,generalizationoftenreferstotheabilityof a predictive model to matchunseen data (Hastie, Trevor, Tibshi- rani,Robert, Friedman, 2009). If the model matches the training setwell,butfailstopredictnewinstancesintheproblemarea,we aretypicallydealingwithso-calledoverfitting.Thishappenswhen themodelconcentratesontoomuchdetailedinformationfromthe trainingdata, whichcan occurin theformof noiseoraccidental fluctuations,which negatively affectsthe abilityofthe modelsto generalize. Underfittingis in opposition to overfitting asthe un- derfittedmodel is not complicated enough andtoo little focuses ontrainingdata.As aresult,itcanneitherfitthetrainingsetnor generalizenewdatawell.
Decisiontrees(DT)s(Kotsiantis,2013)areoneofthemaintech- niques for discriminant analysis in knowledge discovery. Due to their non-parametric andflexible algorithm, DTsare at some ex- tentpronetooverfitting(Loh,2014;Sez,Luengo,&Herrera,2016).
There are also known to be instable, as small variations in the
∗ Corresponding author.
E-mail addresses: m.czajkowski@pb.edu.pl (M. Czajkowski), m.kretowski@pb.edu.pl (M. Kretowski).
trainingset can result in differenttrees and non-repeatable pre- dictions. While this is an unquestionable advantage when using multipletrees,it isa problemwhen aclassiferbased ona single tree is used. Both greater generalization ability and stability can beimproved,forexample,bylearningmultiplemodelsfromboot- strapsamplesofthetrainingdata,butsuchanensembleapproach makestheextractedknowledgelessunderstandable.
ThispapertacklestheproblemofunderfittingofDTintheclas- sification ofgene expressiondata.Insuchdataa ratiooffeatures to observations is very high, which creates serious problems for thestandardunivariatedecisiontrees(Chen,Wang,&Zhang,2011;
Czajkowski & Kretowski,2014). The learningalgorithms mayfind teststhatperfectlyseparatesthetrainingdata,butthesesplitsof- tencorrespondtonoise.Thissituationismorelikelyatintermedi- ateandlowerlevelsofthetree,wherethenumberofinstancesis reducedwitheachtreelevelandmaybeseveralordersofmagni- tudesmallerthanthenumberofavailablefeatures.Forthisreason, mostofunivariateDT inducersproduce considerablysimpletrees that successfullyclassifythe trainingdata,butfailto classify un- seeninstances(Grzes&Kretowski,2007).Thismayleadtounder- fittingasa smallnumberof attributesisused insuch trees and, therefore, their models are not complex enough andcause poor generalization (Hastie, Trevor,Tibshirani, Robert,Friedman, 2009).
https://doi.org/10.1016/j.eswa.2019.07.019 0957-4174/© 2019 Elsevier Ltd. All rights reserved.
Theproductionoflargertreesdoesnotsolvetheproblem,because in caseofgene expression, smalltrees alreadyclassify the train- ing dataperfectly.Thisindicates thatone canoptfortheissueof split complexity, aslittle can be obtained fromlarger univariate DTswiththistypeofdata.
A gene cluster is a part of a gene family, which is a set ho- mologous genes within one organism. It is composed of two or moregenesfoundwithinanorganism’sDNAthatencodeforsimi- larpolypeptides,orproteins,whichcollectivelyshareageneralized function.Ithasbeenshown(Yi,Sze,& Thon,2007) thatpolypep- tides,orproteins are alsoencodedby a group offunctionallyre- lated genes not a single one. In addition,the use ofinformation on subgroups of attributes isparticularly important in the prob- lem of classification andselection of genomic data(Kar, Sharma,
&Maitra,2015;Wong&Liu,2010).Therefore,we believe,thatfo- cusingona treesplitbasedon geneclustersratherthana single gene improves not only classifiers generalization ability but also provides interestingpatternsthat mayappearineach multi-tests.
Thisdirectionofresearchiscontinuedinourstudy.
Themaincontributionoftheworkisanewevolutionarymulti- test treealgorithm calledEvolutionaryMulti-Test Tree(EMTTree).
It aims to improve single-tree classifiers in context ofprediction accuracy and stability with a redefined and extended multi-test split approach(Czajkowski,Grze´s, &Kretowski, 2014). Incontrast toexistingsolutions,weproposeaconceptofageneclusterinor- der tosplitinstancesineach non-terminalnode ofthetree.Each cluster consists of teststhat mimic each other’s predictions, and eachtestisanunivariatetestbuiltonaselectedattribute.Novelty oftheEMTTreecovers:
• an evolutionary tree induction as an alternative to the greedy top-down which was used in our previous works.
Thanks to thisglobalapproach we were able to search for the treestructure andmulti-test splits simultaneously, and resignfromtheflawedpruningprocedure;
• a newalgorithm forsearchingmulti-test splits:specialized EAinacombinationwithlocaloptimizationsallowssearch- ingformostuniformmulti-testswiththetop-rankedgenes;
• introducing gene cluster concept to the multi-test and addinganewdimensiontoits structure:informationabout thediscriminatory powerofgenes isassociated withevery univariatetestthatconstitutesamulti-test;
• a unique fitness function that focuses on minimizing the tree error, butnot on the tree size, which is the standard procedureforDT.Inaddition,weincorporateinformationon gene rankingandresemblanceofsplits inorderto prevent thepredictorfromunderfittingandoverfittingtodata,espe- ciallyinthelowerpartsofthetree.
An extensive set ofcomputational experiments using 35real- worldgene-expressiondatasetshasshownthattheEMTTreesolu- tionnowappearstobeoneofthetopdecisiontree-likeclassifiers inthefieldofgeneexpressiondata.
The paperis organizedasfollows.Thenext section providesa brief background on DTs in the context of gene expression data analysis. Section 3 describes the concept of multi-test and the proposed evolutionaryapproach.Allexperimentsarepresentedin Section 4 andthelast section containsconclusionsand plansfor futurework.
2. Background
With the rapid developmentand popularityof genomic tech- nology,a large numberof geneexpression datasets havebecome publicly accessible (Lazar et al., 2012). The availability of these datasets opens up new challenges for existing tools and algo-
rithms. However, traditional solutions often fail due to highfea- tures/observationsratiosandhugegeneredundancy.
2.1. Decisiontree
Decisiontrees (also knownas classificationtrees)have a long historyinpredictivemodeling(Kotsiantis,2013).Thesuccessofthe tree-basedapproachcan beexplainedbyitseaseofuse,speedof classificationandeffectiveness.Inaddition,thehierarchicalstruc- tureofthe tree,where appropriate testsare applied successively fromone node to the next,closely resemblesthe human wayof makingdecisions.
DThasaknowledgerepresentationstructuremadeupofnodes andbranches, where:eachinternal nodeisassociatedwithatest ononeormoreattributes;eachbranchrepresentsthetestresult;
andeach leaf (terminalnode) is designed by a class label. Most oftreeinducing algorithmspartition the featurespacewithaxis- parallelhyperplanes. Treesofthistypeare oftencalledunivariate becauseatestineachnon-terminalnodeusuallyinvolvesasingle attribute,whichisselectedaccordingtoagivengoodnessofsplit.
There are also algorithms that apply multivariate tests (Brodley
& Utgoff, 1995) basedmainly on linear combinationsof multiple dependentattributes. Theoblique splitcauses alineardivision of thefeaturespaceby anon-orthogonal hyperplane.DTs,whichal- lowmultiplefeaturestobetestedinanode,ispotentiallysmaller thanthose which arelimitedto singleunivariate splits, buthave muchhighercomputationalcostandareoftendifficulttointerpret (Brodley&Utgoff,1995).
InductionofoptimalDTisknownNP-completeproblem(Hyafil
& Rivest, 1976). As a consequence, the practical DT learning al- gorithmsmust be heuristically enhanced. The mostpopular type oftree induction is based on a top-down greedysearch (Rokach
& Maimon, 2005). It starts with the root node, where the lo- cally optimalsplit (test) issearched accordingto the givenmea- sure of optimality. Then the training instances are redirected to the newly created nodes and this process is repeated for each nodeuntilthestop conditionismet.Inadditionally,post-pruning (Esposito, Malerba, & Semeraro, 1997) is usually used after in- duction to avoid the problem of overfitting to the training data and to improve the generalizing power of the predictive model.
The two most popular representatives of top-down DT inducers are CART (Breiman, Friedman, Olshen, & Stone, 2017) and C4.5 (Quinlan, 1992). The CART systemgenerates recursively a binary tree,andthe qualityofa splitismeasured eitherby theGiniin- dexortheTwoingcriterion.TheC4.5algorithmappliesmulti-way splits insteadof a typicalbinary strategyand usesthe gain ratio criterion to split the nodes. Inducing DT through a greedy strat- egyisfastandgenerallyefficientinmanypracticalproblems,but usuallyprovideslocallyoptimalsolutions.
Inorderto mitigatesome ofthenegativeeffectsoflocallyop- timaldecisions,awide rangeofmeta-heuristics fortheintroduc- tionof DTwasexamined (Barros,Basgalupp, DeCarvalho, & Fre- itas,2012; Czajkowski& Kretowski, 2014;2016). Theyareable to globallysearch forthe treestructure andtests ininternal nodes.
Such a globalinduction isof course much morecomputationally complex, but it can reveal hidden patterns that are often unde- tectablebygreedymethods(Lv,Peng,Chen,&Sun,2016).Different recentapproachestoimprovingthepredictiveperformance ofde- cisiontreesincludefuzziness(Wang,Liu,Pedrycz,& Zhang,2015), uncertainties(Cao&Rockett,2015),discretization(Saremi&Yagh- maee,2018)orvariableselection(Painsky&Rosset,2017).
2.2.Geneexpressiondataclassificationwithdecisiontrees
MicroarraysandRNA-seq analysiscansimultaneouslymeasure the expression level of thousands of genes within a particular
mRNAsample. The applicationof a mathematical apparatus and computations tools is indispensable here, since gene expression observationsarerepresentedby highdimensional featurevectors.
However,thegenomicdataisstillchallengingandthereareseveral culpritsresponsible, mainly: (i) Bellmans curse of dimensionality (toomanyfeatures);(ii)thecurseofdatasetsparsity(toofewsam- ples);(iii)theirrelevant andnoise genes;(iv)bias frommethod- ologicalandtechnical factors.Each observationis describedby a high dimensional feature vector with a number of features that reachintothethousands,butthenumberofobservationsisrarely higherthan100.
Univariate decisiontrees representa white-boxapproach, and improvementstosuch modelshaveconsiderablepotential forge- nomic research and scientific modeling of the underlying pro- cesses.Therearenotsomanynewsolutionsintheliteraturethat focuson the classification of gene expression datawith compre- hensiveDTmodels. Oneofthe latestproposalsis theFDT (Fuzzy Decision Tree) algorithm (Ludwig, Picek, & Jakobovic, 2018) for classifyinggene expression data. The authors compare FDT with theclassic DTalgorithm(J48)onfivepopularcancerdatasetsand haveshown some benefits fromthe use of data uncertainty. Al- ternative studies are presented in Barros, Basgalupp, Freitas, and DeCarvalho(2014)wheretheauthorsproposeanevolutionaryDT inducer calledHEAD-DT. Detailed experiments carried out on 35 real-worldgeneexpressiondatasetshaveshownthesuperiorityof thealgorithm interms of predictive accuracy compared to well- knownDTsystems such asC4.5andCART.An expertsystemhas alsobeenproposed toclassify gene expressiondatausing agene selection by decisiontree (Horng et al., 2009). However, existing attempts have shown that decision tree algorithms often induce classifierswithinferiorpredictiveperformance(Barrosetal.,2014;
Ge&Wong,2008).CurrentDT-inducingalgorithmswiththeirpre- dictionmodels limitedtosplits composedfromone attribute use onlyafractionoftheavailableinformation.Itresultsinatendency tounderfitastheir models haveasmall biasonthe trainingset, butoftenfail to classifywell the newhigh-dimensionaldata. On theotherhand,therearealgorithmswhichapplymultivariatetests (Brown,Pittard,&Park,1996)basedmostlyonlinearcombination splits.However,themainflawsofsuchsystemsarehugecomplex- ityaswellasthebiologicalandclinicalinterpretationoftheoutput modelsisverydifficult,ifnotimpossible.
Nowadays,muchmoreinterestisgivenintreesassub-learners of an ensemble learning approach, such as Rotation or Random Forests(Chen&Ishwaran, 2012;Lu,Yang,Yan,Xue,&Gao,2017).
Thesesolutions alleviate the problemoflow accuracy by averag- ing oradaptive merging ofmultiple trees. One ofthe recent ex- amplesis themulti-objectivegeneticprogramming-based ensem- bleoftreesisproposedinNagandPal(2016).Theauthorspresent anintegratedalgorithmforsimultaneousselectionoffeaturesand classification.However, whenmodelingisaimed atunderstanding basicenvironmentalprocesses,suchmethodsarenotsousefulbe- causetheygeneratemorecomplexandlessunderstandablemodels (Piltaver,Luštrek,Gams,&Martinˇci´c-Ipši´c,2016).Nevertheless,im- portantknowledgecanstillbedrawnfromensemblemethods,e.g.
toidentifyreducedsetsofrelevantvariablesinagivenmicroarray (Lazzarini&Bacardit,2017).
A solution called Multi-Test Decision Tree (MTDT) (Czajkowski etal., 2014) canbe placed betweenone-dimensional and oblique trees. It uses several one-dimensional tests in each node, which on the one hand increases the complexity of the model and on the other hand still allows for relatively easy interpretation of the decision rules. There were, however, a few other flaws and limitations of MTDT, which were ad- dressed and removed with the proposed EMTTree solution, in particular:
• thelackofflexibilityinthestructureofmulti-test-thefixed sizeofmulti-testsinalltreenodes;
• the limited search space - only a few highest-rated at- tributes were takenintoaccount when buildingthe multi- test(performancereasons);
• thehighnumberofcrucialparameterstobedefinedad-hoc, including the size of multi-test, the number of alternative multi-testsandthehomogeneityofmulti-tests;
• greedy top-down induction: meta-heuristic searches (Barros et al., 2012) could be expected to improve clas- sificationaccuracyandrevealnewpatternsinthedata.
Thisway,theproposedEMTTreesolutionthatcanself-adaptits structure tothe currentlyanalyzed data.The undoubted strength ofoursolutionisthehigherpredictionaccuracyandimprovedsta- bilityof themodel. The minorweakness ofthe EMTTree are the resultsofusinganevolutionaryapproach,mainlytheslowtreein- duction time anda number ofinput parameters that can be ad- justed.However,thegeneexpressiondataarestillrelativelysmall andas we show in the experimental section, the numberof pa- rametersthatneedtobetunedissmall.
2.3. Conceptofmulti-test
Thegeneralconceptofthemulti-testsplit,markedmt,wasin- troducedforthefirsttimeinMulti-TestDecisionTree(MTDT)algo- rithm(Czajkowskietal.,2014),whichinducesaDTinatop-down manner. The main idea wasto find a split in each non-terminal node thatis composedofseveralunivariate teststhatbranch out thetreeinasimilarway.The reasonforaddingfurthertestswas thattheuseofasingleunivariate testbasedonasingleattribute maycause the classifier to underfitthe learningdata dueto the low complexityof the classificationrule. Eachmulti-test consists of a set withat least one univariate test. One test in the set is markedasprimarytest(pt),andallremainingtestsarecalledsur- rogate tests (st). The role of surrogate testsis to support the di- vision of traininginstances carried out by the primary test with theuseofremainingfeatures.Inordertodeterminethesurrogate tests, wehaveadoptedthe solutionproposedin theCARTsystem (Breiman etal., 2017).Eachsurrogatetestisconstructedona dif- ferentattributeandmimicstheprimarytestintermsofwhichand how manyobservationsgo tothe corresponding branches.In the majorityvotingthatdeterminestheoutcomeofthemulti-test,the individualweights ofeachtest areequal.Thiswaysurrogatetests haveaconsiderableimpact(positiveornegative)onmulti-testde- cisions,astheycanprevailovertheprimarytestdecision.Itisalso possiblethatamulti-test,withoutatestwiththehighestgainra- tio,canbethemostaccuratesplit.
The experimental evaluation(Czajkowski et al., 2014) showed a significant improvementin classification accuracy anda reduc- tioninunderfittingcomparedtopopularDTsystems.Resultsfrom severalreal geneexpression datasets suggest that the knowledge discoveredbyMTDTissupportedbybiologicalevidenceinthelit- eratureandcanbeeasilyunderstoodandinterpreted.
Let’sconsiderabinaryclassificationproblem, inwhicha node contains instances from two classes (ClassA and ClassB) and in- stancesshouldbe dividedintotwo leavesaccordingtoatest.The Fig.1aillustratesthepossibleassignofinstancestoleavesaccord- ing to the T test performedon a singlea attribute. Desired split should place the instances from Class A in left leaf and the in- stancesfromClassBinrightleaf.Eachcellrepresentsasinglein- stancewithadefinedclass,andeachrowshowshowinstancesare arrangedinthe leavesafterperformingthe test.From theFig.1a itisclearthatasingleT1 testonthea1 attributehasthehighest goodnessofsplit,because13outof17instancesareclassifiedcor- rectly. Ina typical system, thistest should be selectedasa split.