• Nie Znaleziono Wyników

evolutionary multi-test tree approach Decision tree underfitting in mining of gene expression data. An Expert Systems With Applications

N/A
N/A
Protected

Academic year: 2021

Share "evolutionary multi-test tree approach Decision tree underfitting in mining of gene expression data. An Expert Systems With Applications"

Copied!
13
0
0

Pełen tekst

(1)

ContentslistsavailableatScienceDirect

Expert Systems With Applications

journalhomepage:www.elsevier.com/locate/eswa

Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach

Marcin Czajkowski, Marek Kretowski

Faculty of Computer Science, Bialystok University of Technology, Wiejska 45a, Bialystok 15-351, Poland

a rt i c l e i nf o

Article history:

Received 23 January 2019 Revised 28 June 2019 Accepted 9 July 2019 Available online 10 July 2019 Keywords:

Data mining

Evolutionary algorithms Decision trees Underfitting Gene expression data

a b s t ra c t

Theproblemofunderfittingandoverfittinginmachinelearningisoftenassociatedwithabias-variance trade-off.Theunderfittingmostclearlymanifestsinthetree-basedinducerswhenusedtoclassifythe geneexpressiondata.Toimprovethegeneralizationabilityofdecisiontrees,weareintroducinganevo- lutionary, multi-testtree approachtailoredto thisspecificapplication domain.The generalidea isto apply geneclusters ofvarying size,whichconsistoffunctionally related genes ineachsplitting rule.

Itisachievedbyusingafewsimpleteststhatmimiceachother’spredictionsandbuilt-ininformation aboutthediscriminatorypowerofgenes.Thetendenciestounderfitandoverfitarelimitedbythemulti- objective fitness functionthat minimizes treeerror, splitdivergenceand attribute costs. Evolutionary searchformulti-testsininternalnodes,aswellastheoveralltreestructure,isperformedsimultaneously.

Thisnovelapproach called EvolutionaryMulti-Test Tree (EMTTree) may bringfar-reaching benefits tothedomainofmolecularbiologyincludingbiomarkerdiscovery,finding newgene-geneinteractions andhigh-qualityprediction.Extensiveexperimentscarriedouton35publiclyavailablegeneexpression datasetsshowthatwemanagedtosignificantlyimprovetheaccuracyandstabilityofdecisiontree.Im- portantly,EMTTreedoesnotsubstantiallyincreasetheoverallcomplexityofthetree,sothatthepatterns inthepredictivestructuresarekeptcomprehensible.

© 2019 Elsevier Ltd. All rights reserved.

1. Introduction

Inmachinelearning,generalizationoftenreferstotheabilityof a predictive model to matchunseen data (Hastie, Trevor, Tibshi- rani,Robert, Friedman, 2009). If the model matches the training setwell,butfailstopredictnewinstancesintheproblemarea,we aretypicallydealingwithso-calledoverfitting.Thishappenswhen themodelconcentratesontoomuchdetailedinformationfromthe trainingdata, whichcan occurin theformof noiseoraccidental fluctuations,which negatively affectsthe abilityofthe modelsto generalize. Underfittingis in opposition to overfitting asthe un- derfittedmodel is not complicated enough andtoo little focuses ontrainingdata.As aresult,itcanneitherfitthetrainingsetnor generalizenewdatawell.

Decisiontrees(DT)s(Kotsiantis,2013)areoneofthemaintech- niques for discriminant analysis in knowledge discovery. Due to their non-parametric andflexible algorithm, DTsare at some ex- tentpronetooverfitting(Loh,2014;Sez,Luengo,&Herrera,2016).

There are also known to be instable, as small variations in the

Corresponding author.

E-mail addresses: m.czajkowski@pb.edu.pl (M. Czajkowski), m.kretowski@pb.edu.pl (M. Kretowski).

trainingset can result in differenttrees and non-repeatable pre- dictions. While this is an unquestionable advantage when using multipletrees,it isa problemwhen aclassiferbased ona single tree is used. Both greater generalization ability and stability can beimproved,forexample,bylearningmultiplemodelsfromboot- strapsamplesofthetrainingdata,butsuchanensembleapproach makestheextractedknowledgelessunderstandable.

ThispapertacklestheproblemofunderfittingofDTintheclas- sification ofgene expressiondata.Insuchdataa ratiooffeatures to observations is very high, which creates serious problems for thestandardunivariatedecisiontrees(Chen,Wang,&Zhang,2011;

Czajkowski & Kretowski,2014). The learningalgorithms mayfind teststhatperfectlyseparatesthetrainingdata,butthesesplitsof- tencorrespondtonoise.Thissituationismorelikelyatintermedi- ateandlowerlevelsofthetree,wherethenumberofinstancesis reducedwitheachtreelevelandmaybeseveralordersofmagni- tudesmallerthanthenumberofavailablefeatures.Forthisreason, mostofunivariateDT inducersproduce considerablysimpletrees that successfullyclassifythe trainingdata,butfailto classify un- seeninstances(Grzes&Kretowski,2007).Thismayleadtounder- fittingasa smallnumberof attributesisused insuch trees and, therefore, their models are not complex enough andcause poor generalization (Hastie, Trevor,Tibshirani, Robert,Friedman, 2009).

https://doi.org/10.1016/j.eswa.2019.07.019 0957-4174/© 2019 Elsevier Ltd. All rights reserved.

(2)

Theproductionoflargertreesdoesnotsolvetheproblem,because in caseofgene expression, smalltrees alreadyclassify the train- ing dataperfectly.Thisindicates thatone canoptfortheissueof split complexity, aslittle can be obtained fromlarger univariate DTswiththistypeofdata.

A gene cluster is a part of a gene family, which is a set ho- mologous genes within one organism. It is composed of two or moregenesfoundwithinanorganism’sDNAthatencodeforsimi- larpolypeptides,orproteins,whichcollectivelyshareageneralized function.Ithasbeenshown(Yi,Sze,& Thon,2007) thatpolypep- tides,orproteins are alsoencodedby a group offunctionallyre- lated genes not a single one. In addition,the use ofinformation on subgroups of attributes isparticularly important in the prob- lem of classification andselection of genomic data(Kar, Sharma,

&Maitra,2015;Wong&Liu,2010).Therefore,we believe,thatfo- cusingona treesplitbasedon geneclustersratherthana single gene improves not only classifiers generalization ability but also provides interestingpatternsthat mayappearineach multi-tests.

Thisdirectionofresearchiscontinuedinourstudy.

Themaincontributionoftheworkisanewevolutionarymulti- test treealgorithm calledEvolutionaryMulti-Test Tree(EMTTree).

It aims to improve single-tree classifiers in context ofprediction accuracy and stability with a redefined and extended multi-test split approach(Czajkowski,Grze´s, &Kretowski, 2014). Incontrast toexistingsolutions,weproposeaconceptofageneclusterinor- der tosplitinstancesineach non-terminalnode ofthetree.Each cluster consists of teststhat mimic each other’s predictions, and eachtestisanunivariatetestbuiltonaselectedattribute.Novelty oftheEMTTreecovers:

an evolutionary tree induction as an alternative to the greedy top-down which was used in our previous works.

Thanks to thisglobalapproach we were able to search for the treestructure andmulti-test splits simultaneously, and resignfromtheflawedpruningprocedure;

a newalgorithm forsearchingmulti-test splits:specialized EAinacombinationwithlocaloptimizationsallowssearch- ingformostuniformmulti-testswiththetop-rankedgenes;

introducing gene cluster concept to the multi-test and addinganewdimensiontoits structure:informationabout thediscriminatory powerofgenes isassociated withevery univariatetestthatconstitutesamulti-test;

a unique fitness function that focuses on minimizing the tree error, butnot on the tree size, which is the standard procedureforDT.Inaddition,weincorporateinformationon gene rankingandresemblanceofsplits inorderto prevent thepredictorfromunderfittingandoverfittingtodata,espe- ciallyinthelowerpartsofthetree.

An extensive set ofcomputational experiments using 35real- worldgene-expressiondatasetshasshownthattheEMTTreesolu- tionnowappearstobeoneofthetopdecisiontree-likeclassifiers inthefieldofgeneexpressiondata.

The paperis organizedasfollows.Thenext section providesa brief background on DTs in the context of gene expression data analysis. Section 3 describes the concept of multi-test and the proposed evolutionaryapproach.Allexperimentsarepresentedin Section 4 andthelast section containsconclusionsand plansfor futurework.

2. Background

With the rapid developmentand popularityof genomic tech- nology,a large numberof geneexpression datasets havebecome publicly accessible (Lazar et al., 2012). The availability of these datasets opens up new challenges for existing tools and algo-

rithms. However, traditional solutions often fail due to highfea- tures/observationsratiosandhugegeneredundancy.

2.1. Decisiontree

Decisiontrees (also knownas classificationtrees)have a long historyinpredictivemodeling(Kotsiantis,2013).Thesuccessofthe tree-basedapproachcan beexplainedbyitseaseofuse,speedof classificationandeffectiveness.Inaddition,thehierarchicalstruc- tureofthe tree,where appropriate testsare applied successively fromone node to the next,closely resemblesthe human wayof makingdecisions.

DThasaknowledgerepresentationstructuremadeupofnodes andbranches, where:eachinternal nodeisassociatedwithatest ononeormoreattributes;eachbranchrepresentsthetestresult;

andeach leaf (terminalnode) is designed by a class label. Most oftreeinducing algorithmspartition the featurespacewithaxis- parallelhyperplanes. Treesofthistypeare oftencalledunivariate becauseatestineachnon-terminalnodeusuallyinvolvesasingle attribute,whichisselectedaccordingtoagivengoodnessofsplit.

There are also algorithms that apply multivariate tests (Brodley

& Utgoff, 1995) basedmainly on linear combinationsof multiple dependentattributes. Theoblique splitcauses alineardivision of thefeaturespaceby anon-orthogonal hyperplane.DTs,whichal- lowmultiplefeaturestobetestedinanode,ispotentiallysmaller thanthose which arelimitedto singleunivariate splits, buthave muchhighercomputationalcostandareoftendifficulttointerpret (Brodley&Utgoff,1995).

InductionofoptimalDTisknownNP-completeproblem(Hyafil

& Rivest, 1976). As a consequence, the practical DT learning al- gorithmsmust be heuristically enhanced. The mostpopular type oftree induction is based on a top-down greedysearch (Rokach

& Maimon, 2005). It starts with the root node, where the lo- cally optimalsplit (test) issearched accordingto the givenmea- sure of optimality. Then the training instances are redirected to the newly created nodes and this process is repeated for each nodeuntilthestop conditionismet.Inadditionally,post-pruning (Esposito, Malerba, & Semeraro, 1997) is usually used after in- duction to avoid the problem of overfitting to the training data and to improve the generalizing power of the predictive model.

The two most popular representatives of top-down DT inducers are CART (Breiman, Friedman, Olshen, & Stone, 2017) and C4.5 (Quinlan, 1992). The CART systemgenerates recursively a binary tree,andthe qualityofa splitismeasured eitherby theGiniin- dexortheTwoingcriterion.TheC4.5algorithmappliesmulti-way splits insteadof a typicalbinary strategyand usesthe gain ratio criterion to split the nodes. Inducing DT through a greedy strat- egyisfastandgenerallyefficientinmanypracticalproblems,but usuallyprovideslocallyoptimalsolutions.

Inorderto mitigatesome ofthenegativeeffectsoflocallyop- timaldecisions,awide rangeofmeta-heuristics fortheintroduc- tionof DTwasexamined (Barros,Basgalupp, DeCarvalho, & Fre- itas,2012; Czajkowski& Kretowski, 2014;2016). Theyareable to globallysearch forthe treestructure andtests ininternal nodes.

Such a globalinduction isof course much morecomputationally complex, but it can reveal hidden patterns that are often unde- tectablebygreedymethods(Lv,Peng,Chen,&Sun,2016).Different recentapproachestoimprovingthepredictiveperformance ofde- cisiontreesincludefuzziness(Wang,Liu,Pedrycz,& Zhang,2015), uncertainties(Cao&Rockett,2015),discretization(Saremi&Yagh- maee,2018)orvariableselection(Painsky&Rosset,2017).

2.2.Geneexpressiondataclassificationwithdecisiontrees

MicroarraysandRNA-seq analysiscansimultaneouslymeasure the expression level of thousands of genes within a particular

(3)

mRNAsample. The applicationof a mathematical apparatus and computations tools is indispensable here, since gene expression observationsarerepresentedby highdimensional featurevectors.

However,thegenomicdataisstillchallengingandthereareseveral culpritsresponsible, mainly: (i) Bellmans curse of dimensionality (toomanyfeatures);(ii)thecurseofdatasetsparsity(toofewsam- ples);(iii)theirrelevant andnoise genes;(iv)bias frommethod- ologicalandtechnical factors.Each observationis describedby a high dimensional feature vector with a number of features that reachintothethousands,butthenumberofobservationsisrarely higherthan100.

Univariate decisiontrees representa white-boxapproach, and improvementstosuch modelshaveconsiderablepotential forge- nomic research and scientific modeling of the underlying pro- cesses.Therearenotsomanynewsolutionsintheliteraturethat focuson the classification of gene expression datawith compre- hensiveDTmodels. Oneofthe latestproposalsis theFDT (Fuzzy Decision Tree) algorithm (Ludwig, Picek, & Jakobovic, 2018) for classifyinggene expression data. The authors compare FDT with theclassic DTalgorithm(J48)onfivepopularcancerdatasetsand haveshown some benefits fromthe use of data uncertainty. Al- ternative studies are presented in Barros, Basgalupp, Freitas, and DeCarvalho(2014)wheretheauthorsproposeanevolutionaryDT inducer calledHEAD-DT. Detailed experiments carried out on 35 real-worldgeneexpressiondatasetshaveshownthesuperiorityof thealgorithm interms of predictive accuracy compared to well- knownDTsystems such asC4.5andCART.An expertsystemhas alsobeenproposed toclassify gene expressiondatausing agene selection by decisiontree (Horng et al., 2009). However, existing attempts have shown that decision tree algorithms often induce classifierswithinferiorpredictiveperformance(Barrosetal.,2014;

Ge&Wong,2008).CurrentDT-inducingalgorithmswiththeirpre- dictionmodels limitedtosplits composedfromone attribute use onlyafractionoftheavailableinformation.Itresultsinatendency tounderfitastheir models haveasmall biasonthe trainingset, butoftenfail to classifywell the newhigh-dimensionaldata. On theotherhand,therearealgorithmswhichapplymultivariatetests (Brown,Pittard,&Park,1996)basedmostlyonlinearcombination splits.However,themainflawsofsuchsystemsarehugecomplex- ityaswellasthebiologicalandclinicalinterpretationoftheoutput modelsisverydifficult,ifnotimpossible.

Nowadays,muchmoreinterestisgivenintreesassub-learners of an ensemble learning approach, such as Rotation or Random Forests(Chen&Ishwaran, 2012;Lu,Yang,Yan,Xue,&Gao,2017).

Thesesolutions alleviate the problemoflow accuracy by averag- ing oradaptive merging ofmultiple trees. One ofthe recent ex- amplesis themulti-objectivegeneticprogramming-based ensem- bleoftreesisproposedinNagandPal(2016).Theauthorspresent anintegratedalgorithmforsimultaneousselectionoffeaturesand classification.However, whenmodelingisaimed atunderstanding basicenvironmentalprocesses,suchmethodsarenotsousefulbe- causetheygeneratemorecomplexandlessunderstandablemodels (Piltaver,Luštrek,Gams,&Martinˇci´c-Ipši´c,2016).Nevertheless,im- portantknowledgecanstillbedrawnfromensemblemethods,e.g.

toidentifyreducedsetsofrelevantvariablesinagivenmicroarray (Lazzarini&Bacardit,2017).

A solution called Multi-Test Decision Tree (MTDT) (Czajkowski etal., 2014) canbe placed betweenone-dimensional and oblique trees. It uses several one-dimensional tests in each node, which on the one hand increases the complexity of the model and on the other hand still allows for relatively easy interpretation of the decision rules. There were, however, a few other flaws and limitations of MTDT, which were ad- dressed and removed with the proposed EMTTree solution, in particular:

thelackofflexibilityinthestructureofmulti-test-thefixed sizeofmulti-testsinalltreenodes;

the limited search space - only a few highest-rated at- tributes were takenintoaccount when buildingthe multi- test(performancereasons);

thehighnumberofcrucialparameterstobedefinedad-hoc, including the size of multi-test, the number of alternative multi-testsandthehomogeneityofmulti-tests;

greedy top-down induction: meta-heuristic searches (Barros et al., 2012) could be expected to improve clas- sificationaccuracyandrevealnewpatternsinthedata.

Thisway,theproposedEMTTreesolutionthatcanself-adaptits structure tothe currentlyanalyzed data.The undoubted strength ofoursolutionisthehigherpredictionaccuracyandimprovedsta- bilityof themodel. The minorweakness ofthe EMTTree are the resultsofusinganevolutionaryapproach,mainlytheslowtreein- duction time anda number ofinput parameters that can be ad- justed.However,thegeneexpressiondataarestillrelativelysmall andas we show in the experimental section, the numberof pa- rametersthatneedtobetunedissmall.

2.3. Conceptofmulti-test

Thegeneralconceptofthemulti-testsplit,markedmt,wasin- troducedforthefirsttimeinMulti-TestDecisionTree(MTDT)algo- rithm(Czajkowskietal.,2014),whichinducesaDTinatop-down manner. The main idea wasto find a split in each non-terminal node thatis composedofseveralunivariate teststhatbranch out thetreeinasimilarway.The reasonforaddingfurthertestswas thattheuseofasingleunivariate testbasedonasingleattribute maycause the classifier to underfitthe learningdata dueto the low complexityof the classificationrule. Eachmulti-test consists of a set withat least one univariate test. One test in the set is markedasprimarytest(pt),andallremainingtestsarecalledsur- rogate tests (st). The role of surrogate testsis to support the di- vision of traininginstances carried out by the primary test with theuseofremainingfeatures.Inordertodeterminethesurrogate tests, wehaveadoptedthe solutionproposedin theCARTsystem (Breiman etal., 2017).Eachsurrogatetestisconstructedona dif- ferentattributeandmimicstheprimarytestintermsofwhichand how manyobservationsgo tothe corresponding branches.In the majorityvotingthatdeterminestheoutcomeofthemulti-test,the individualweights ofeachtest areequal.Thiswaysurrogatetests haveaconsiderableimpact(positiveornegative)onmulti-testde- cisions,astheycanprevailovertheprimarytestdecision.Itisalso possiblethatamulti-test,withoutatestwiththehighestgainra- tio,canbethemostaccuratesplit.

The experimental evaluation(Czajkowski et al., 2014) showed a significant improvementin classification accuracy anda reduc- tioninunderfittingcomparedtopopularDTsystems.Resultsfrom severalreal geneexpression datasets suggest that the knowledge discoveredbyMTDTissupportedbybiologicalevidenceinthelit- eratureandcanbeeasilyunderstoodandinterpreted.

Let’sconsiderabinaryclassificationproblem, inwhicha node contains instances from two classes (ClassA and ClassB) and in- stancesshouldbe dividedintotwo leavesaccordingtoatest.The Fig.1aillustratesthepossibleassignofinstancestoleavesaccord- ing to the T test performedon a singlea attribute. Desired split should place the instances from Class A in left leaf and the in- stancesfromClassBinrightleaf.Eachcellrepresentsasinglein- stancewithadefinedclass,andeachrowshowshowinstancesare arrangedinthe leavesafterperformingthe test.From theFig.1a itisclearthatasingleT1 testonthea1 attributehasthehighest goodnessofsplit,because13outof17instancesareclassifiedcor- rectly. Ina typical system, thistest should be selectedasa split.

Cytaty

Powiązane dokumenty

Konferencja była okazją do konfrontacji poglądów literaturoznaw- ców i historyków na kwestię odzyskania niepodległości, wpisała się w długofalowe obchody rocznicy odrodzenia

Recykling odpadów komunalnych jest podstawą nowoczesnego systemu gospodarki odpadami. Unijne przepisy z zakresu ochrony środowiska w tym gospodarki odpadami zostały implementowane

The plot showing the results of 5NN with Euclidean measure, applied to the thyroid data shows clearly the ad- vantage of SSV2 selector (its ability to detect dependencies

In this section, our goal is to construct a fast tabu search algorithm for computing solutions of good quality for large instances of the minmax regret minimum span- ning tree

Wyżywieniem żołnierzy zgrupowania "Zaremba" zajmowało się kwater- mistrzostwo VII Obwodu "Obroża"; służby sztabu "Obroży" przez cały czas trwania

Tested algorithms: univariate Global Regression Tree (uGRT), oblique Global Regression Tree (oGRT), univariate Global Model Tree (uGMT), oblique Global Model Tree (oGMT), and

Results: Experimental validation was performed on several real-life gene expression datasets. Compar- ison results with eight classifiers show that MTDT has a statistically

Performed experiments suggest that proposed hybrid solution may successfully compete with decision trees and popular TSP algorithms for solving classification problems of