• Nie Znaleziono Wyników

classification Multi-test decision tree and its application to microarraydata Artificial Intelligence in Medicine

N/A
N/A
Protected

Academic year: 2021

Share "classification Multi-test decision tree and its application to microarraydata Artificial Intelligence in Medicine"

Copied!
10
0
0

Pełen tekst

(1)

ContentslistsavailableatScienceDirect

Artificial Intelligence in Medicine

jou rn al h om e p a g e :w w w . e l s e v i e r . c o m / l o c a t e / a i i m

Multi-test decision tree and its application to microarray data classification

Marcin Czajkowski

a,∗

, Marek Grze´s

b

, Marek Kretowski

a

aFacultyofComputerScience,BialystokUniversityofTechnology,Wiejska45a,15-351Bialystok,Poland

bSchoolofComputerScience,UniversityofWaterloo,200UniversityAvenueWest,Waterloo,OntarioN2L3G1,Canada

a r t i c l e i n f o

Articlehistory:

Received24June2013

Receivedinrevisedform11January2014 Accepted30January2014

Keywords:

Decisiontrees Univariatetests Underfitting Geneexpressiondata

a b s t r a c t

Objective:Thedesirablepropertyoftoolsusedtoinvestigatebiologicaldataiseasytounderstandmodels andpredictivedecisions.Decisiontreesareparticularlypromisinginthisregardduetotheircompre- hensiblenaturethatresemblesthehierarchicalprocessofhumandecisionmaking.However,existing algorithmsforlearningdecisiontreeshavetendencytounderfitgeneexpressiondata.Themainaimof thisworkistoimprovetheperformanceandstabilityofdecisiontreeswithonlyasmallincreaseintheir complexity.

Methods:Weproposeamulti-testdecisiontree(MTDT);ourmaincontributionistheapplicationof severalunivariatetestsineachnon-terminalnodeofthedecisiontree.Wealsosearchforalternative, lower-rankedfeaturesinordertoobtainmorestableandreliablepredictions.

Results:Experimentalvalidationwasperformedonseveralreal-lifegeneexpressiondatasets.Compar- isonresultswitheightclassifiersshowthatMTDThasastatisticallysignificantlyhigheraccuracythan populardecisiontreeclassifiers,anditwashighlycompetitivewithensemblelearningalgorithms.The proposedsolutionmanagedtooutperformitsbaselinealgorithmon14datasetsbyanaverage6%.Astudy performedononeofthedatasetsshowedthatthediscoveredgenesusedintheMTDTclassificationmodel aresupportedbybiologicalevidenceintheliterature.

Conclusion:Thispaperintroducesanewtypeofdecisiontreewhichismoresuitableforsolvingbiological problems.MTDTsarerelativelyeasytoanalyzeandmuchmorepowerfulinmodelinghighdimensional microarraydatathantheirpopularcounterparts.

©2014ElsevierB.V.Allrightsreserved.

1. Introduction

Decisiontrees[1,2]areoneofthemostpopularclassification techniquesindataminingandmachinelearning.Duetotheircom- prehensiblenature,theyareparticularlyusefulwhentheaimof modelingistounderstandtheunderlyingprocessesoftheenviron- ment.Decisiontreesarealsousefulwhenthedatadonotsatisfy therigorousassumptionsrequiredbymoretraditionalmethods[3].

Tree-basedclassifierscanbesuccessfullyappliedtosolvingbiolog- icalproblems[4–6].Populartechniquesformicroarraydatainvolve decisiontreeensembleslikerandomforest[7]andboosteddeci- siontrees[8].However,existingattemptstoapplydecisiontrees toclassificationusinggeneexpressiondatashowedthatsingletree algorithmsarenotsufficientforinducingcompetitiveclassifiers [9,10].

In this paper, we tackle the problemof improvingthe per- formance of decision trees on gene expression data, with the

∗ Correspondingauthor.Tel.:+48857469163;fax:+48857469057.

E-mailaddress:m.czajkowski@pb.edu.pl(M.Czajkowski).

constraintofpreservingsimplicityofdecisiontrees.Standardtech- niquesforimprovingtheperformanceofclassificationalgorithms, e.g.,ensemblemethods,donotsatisfythisconstraintwhenapplied todecisiontreesbecauseresultingclassifiersbecomecomplexand almostimpossibletounderstand[11,12].Weproposeamulti-test approachtodecisiontrees inwhichseveralunivariate testscan be used tocreate a single splittingrule in every non-terminal node of the classification tree. We also search for alternative, lower-rankedfeaturesinordertoobtainmorestableandreliable predictions.

1.1. Geneexpressiondataanalysis

Cellsrepresentbasicorganizationalunitsofalllivingorganisms.

Eachcellcontainsinstructionsforthecreationofproteinsandthe regulationofprocessesinalivingbody.Thiscollectionofinstruc- tionsiscontainedintheDNA.Eachproteinhasacorresponding gene which can be seenas a recipe for how tocreate a given protein.Ifthegeneisexpressed,acorrespondingproteinwillbe produced[13].Asignificantstepingenomicresearchwastheability tomonitortheexpressionlevelofgenesinlivingcells.Specifically, 0933-3657/$–seefrontmatter©2014ElsevierB.V.Allrightsreserved.

http://dx.doi.org/10.1016/j.artmed.2014.01.005

(2)

cDNAmicroarrayandhigh-densityoligonucleotidechipsallowthe expressionlevelofthousandsofgenestobemonitoredsimulta- neously[14].Theoutcomeofthesediagnostictestsisknownas geneexpression(ormicroarray)data.

Microarray data allows for numerous analyses of living organisms. The application of a mathematical apparatus and computationstoolsisindispensablehere,since geneexpression observations are represented by highdimensional feature vec- tors.Theimportantquestionsarewhatkindofoutcomescanbe expectedandwhatkindofquestionscanbeansweredusingthese tools.The answer comesfrom two fundamental approaches to mathematicalmodeling,whichareequallyimportantinthecase ofgeneexpressiondata.Scientificmodeling attemptstounder- standthetruemodelthatisbehindthedatageneratedaccording tothatmodel.Inthecaseofgeneexpressiondata,itisconcerned withproblemsofcausalrelationshipsbetween,forexample,genes, orgenesandproteins.Technologicalmodelinghasdifferentaims.

Here,thepurposeistobuildamodelfrompastdatathatwould begoodatpredictingfuturedataregardlessofwhetherthemodel isclosetorealityornot[15].Discriminantanalysisisanexample ofthiskindofmodelinginageneralsense.Ithasalsobeenwidely usedinpost-genomecancerresearchstudies[16,17].

Gene expression data poses many research challenges, and is not limited toresearch areasthat are concernedwith living organisms.Thiskindofdataisalsoextremelychallengingforcom- putationaltoolsandmathematicalmodeling[18].Eachobservation isdescribed by a highdimensional featurevectorwitha num- berof features that reachinto the thousands, but the number ofobservationsisrarelyhigherthan100.Therefore,thiskindof datarequiresnewcomputationaltoolstoextractsignificantand meaningfulrules,andsomefeatureselectionshouldbetakeninto account.Providingagroupofmostrelevantgenesmaysignificantly improveclassificationperformance[19].

1.2. Decisiontrees

Decisiontrees(alsoknownasclassificationtrees)representone ofthemaintechniquesfordiscriminantanalysisindataminingand knowledgediscovery.Theypredicttheclassmembership(depend- entvariable)ofaninstanceusingitsmeasurementsofpredictor variables.

Themostpopularalgorithms fordecision treeinduction are basedontop-downgreedysearch[20].First,thetestattribute(and thethresholdinthecaseofcontinuousattributes)isdecidedforthe rootnode.Instancesaresplitthroughthetreefromtherootnode toaleafnode,whichprovidesclassificationofagiveninstance.At eachnon-terminalnodethroughwhichtheinstancepasses,one(or more)attributeoftheinstanceistestedandtheinstanceismoved downtothebranchthatcorrespondstoanoutcomeofthetest.The processisrecursivelyrepeatedforeachbranch.Whentostoppar- titioningandcreatealeafnodeisstilloneofthemajorproblemsin thearea.

Classification trees have many advantages that make them applicablein variousscenarios,particularlywhen thedatadoes notsatisfytherigorousassumptionsrequiredbymoretraditional methods.Inthispaper,thefollowingfactsaresignificant:

• learningofdecisiontreesisfast,evenwithhugedatasets,dueto greedysearch;

• classificationisveryfast,flexible,andallowsforstraightforward approachestotheproblemofmissingvalues;

• decisiontreesareeasytounderstandandanalyze,astheyreflect ahierarchicalwayofhumandecisionmaking.Theyarethusthe oppositeofthe‘black-box’approacheswheremodelparameters arenotunderstandable.

Thisintroduction appliesto casesin which tests in internal nodesoftreesarebasedononeattribute.Therearealsoalgorithms whichapplymultivariatetests[21,22]basedmostlyonlinearcom- binationsplits.Decisiontrees thatallowthetestingofmultiple featuresatthenodearepotentiallysmallerthanthoselimitedto singleunivariate splits.Additionally,whenonlyoneattributeat eachnodeistested,itmaycausereplicationofspecificsubtrees inthedecisiontree[23].Ineffect,somefeatures maybetested morethanonceinthedecisiontree.However,treeswithsimple testsarestilldesirablebecauseexpertscanunderstandthem.This factisexplicitlyemphasizedintherelatedliterature.Brodleyand Utgoff[24]say: “Asmalltree withsimpletestsismostappealing becauseahumancanunderstandit.Thereisatradeofftoconsider inallowingmultivariatetests:usingonlyunivariatetestsmayresult inlargetreesthataredifficulttounderstand,whereastheadditionof multivariatetestsmayresultintreeswithfewernodes,butthetest nodesmaybemoredifficulttounderstand”.Ourfocusistherefore onunivariatetrees,sincetheyarea‘white-box’technique,which makesthemparticularlyinterestingforscientific modeling.Itis easytofindexplanationfordecisionsofunivariateclassification trees.

1.3. Backgroundandmotivation

Asstatedintheprevioussection,decisiontreeswithunivari- atesplitsareconvenient.Theyaremucheasiertounderstandthan treeswithmultivariatesplits,anditismucheasiertolearnfrom thedata.However,traditionalalgorithms,forexample,C4.5[25]

orCART[26],failtoproducedecisiontreeswithhighclassification accuracyofgeneexpressiondata.Ourpreviousworkwithvarious univariatedecisiontreealgorithmsshowedthatthesealgorithms produceconsiderablysmalltreesthatperfectlyclassifythetraining databutfailtoclassifyunseeninstances[10].Onlyasmallnumber ofattributesisusedinsuchtrees,andtheirmodelcomplexityislow (highbias).Therefore,theyunderfitthetrainingdata[2].Producing biggertreesusingstandardalgorithmssuchasC4.5doesnotsolve theprobleminthecaseofgeneexpressiondatabecausesmalltrees oftenclassifythetrainingdataperfectly[10].Thisindicatesthatthe issueofsplitcomplexitycouldbeadvocatedhere,sincenotmuch canbegainedfrombiggerunivariatedecisiontreeswiththiskind ofdata.Thislineofresearchispursuedinourpaper.

Wearemotivatedbythefactthatunivariatedecisiontreeinduc- tionrepresentsawhite-boxapproachandimprovementsofsuch algorithmshaveconsiderablepotentialforgenomicresearchand scientificmodelingof underlyingprocesses.Thus, ourgoalisto improve theclassificationaccuracy of decision trees and imply moreinformativeanalysisofmicroarraydataina waythatwill makethemstilleasytounderstand.Decisiontreeswithmultivari- atesplitsorbagging/boostingmethodsoftenoutperformexisting univariatealgorithmsongeneexpressiondata[9,27,28].However, thoseapproachesgeneratecomplexrulesthatfromamedicalpoint ofviewaremoredifficulttounderstandandanalyze.Ourgoalisto increasethecomplexityofunivariatedecisiontreestotheextent thatmakesthemeasytounderstandandmorecompetitiveinterms ofclassificationaccuracy.Webelievethattheuseofindividualuni- variatesplitsmaycausetheclassifiertounderfitthelearningdata, sinceitleadstotreesthatarenotrobustenoughanddonottake informationaboutothermostrelevantattributesintoaccount.Our noveltechniqueusesseveralunivariatetestsineachinternalnode toavoidtheseproblems.Asmulti-testnodesarebasedonunivari- atetests,treeslearnedwiththisapproachwillbemucheasierto analyzethantreeswithclassicalmultivariatesplits.

Inthisparagraph,weattempttojustifywhyourapproachis suitableforgeneexpressiondataandwhythismayleadtohigh classificationaccuracy.Gene expressiondataischaracterizedby averyhighratiooffeaturestoobservations,whichposesserious

(3)

problemsforstandardunivariatesplits.Thelearningalgorithmcan easilyfinda testthatseparates thetraining dataverywellata givenlevelinthetree,butthissplitcancorrespondtonoiseonly.

Thissituationisevenmorelikelyatintermediateandlowerlevels ofthetree.Forexample,assumingthatatagivenlevelofthetree thereare20observations(10fromclassAand10fromclassB)and 2×105features,thenumberofpossiblepartitionsofthistraining set(thenumberofcombinationsofchoosing10outof20instances) issmaller(theexactnumberis184,756)thanthenumberofavail- ablefeatures.Thismakesfindingatestlikely,i.e.,anattributeand itscorrespondingthreshold,which cansplitthis dataperfectly.

Whenthereareonly10observationsinthenode(evendistribu- tion),thenumber ofpossiblesplitsisonly252,butthenumber ofattributesis3ordersofmagnitudehigher.Whenthesplitcon- tainsonlyoneunivariatetest,thereisaveryhighriskofchoosing teststhatcorrespondtonoise.Thus,ourapproachistohavemore univariatetestsineachinternalnodeandtobasesplittingdeci- sionsonalargernumberofunivariatetestsnotnecessarilythose teststhatyieldthehighestvalueofthegainratio[25]ortheGini index[26].

1.4. Relatedwork

Thispaperaddresses anissueof testcomplexity in decision trees.Astandardapproachinthecaseofdiscreteattributesisto associateabranchwitheachcategoricalattributevalue.Another possibilityistogroupsomeattributevaluesinordertoreducethe branchingfactor.Whenallvaluesaregroupedintotwoclusters,a binarytreeisobtained(e.g.,inCART[26]).Inthecaseofcontinues attributes,binarysplitsareused.Here,thestandardsplitcompares thevalueoftheattributewithathresholdandtheoutcomeofsuch acomparisonisbinary.Thus,astraightforwardwaytoreducethe treecomplexity(intermsofthenumberofnodes)istousemultiple thresholdsineachsplitonanumericalattribute.Thiswillpoten- tiallyincreasethebranchingfactorofsuchsplits;however,such testswillbemoreexpressiveandtheoverallnumberofnodesin thecorrespondingdecisiontreewillbesmaller.Thisapproachwas exploredbyBerzaletal.[29]whoproposedmulti-waydecision treesusingmulti-waysplits.

In[29],ahierarchicalclusteringofattributevaluesiscombined withthestandardgreedydecision treealgorithm. Initially,each separateattribute valueistreatedasanindividualinterval,and thetwomostsimilaradjacentintervalsaremergedineachstep.

Thisprocesscanberepeateduntilonlytwointervalsareleft;this wouldleadtoabinarydecisiontree.However,theclusteringpro- cesscanbestoppedbeforethat.Eachtimetwoadjacentintervals aremerged,theimpuritymeasureassociatedwiththedecisiontree ischecked.Thecurrentintervalsetisdeterminedaccordingtothe highestmeasureofimpurity.Thistechniqueissimilartothesplit- tingcriterionusedtoevaluatealternativesplitsliketheC4.5gain ratioortheGiniindexofdiversity.TheBerzalapproachwasnot evaluatedintermsofgeneexpressiondata,and,duetothenature ofsingleattributemulti-waytests,itwouldnot besufficientto overcomethehighratiooffeatures/observationsinthiskindof data.

Thespecificcharacterofgeneexpressiondataanditsinfluence ontheprocessofbuildingdecisiontreeswasinvestigatedbyLi etal.[30].Thissolutionfocusedonusingcommitteesoftreesto aggregatethediscriminatingpowerofabiggernumberofsignifi- cantrulesandtomakemorereliablepredictions.First,allfeatures arerankedaccordingtothegainratio.Then,thefirsttreeusing thefirsttop-rankedfeatureintherootnodeisbuilt.Next,thesec- ondtreeusingthesecondtop-rankedfeatureintherootnodeis builtandtheprocesscontinuesuntilthekthtreeusingthekthtop- rankedfeatureisobtained.Theclassificationofthefinalcommittee

ofkdecisiontreesisgovernedbyweightedvoting.Itwasobserved that:

• significantrules often contain features that are globally low- ranked;

• if the construction of a tree is confined to a set of globally top-rankedfeatures,therulesintheresultingtreemaybeless accuratethanrulesderivedfromthewholefeaturespace;

• alternativetreesoftenoutperformorcompetewiththeperfor- manceofthegreedytree.

Thisworkalsosupportsourdecision tousemanyunivariate testsinourmulti-testdecisiontreeinductionalgorithm.Inpartic- ular,ouraimistomakeuseoffeaturesthataregloballylow-ranked andusethemjointlyinmulti-tests.However,ouraimistopre- servethesimplicityoffinaldecisiontrees,whichisnotthecase in[30].

Ourpreviouswork[10]inwhichstandarddecisiontreeswere evaluatedongeneexpressiondataledustotheconclusionthatthe highratioofvariables/casesmaycausethelearningalgorithmtobe misledbyrandomlychosendependenciesinthetrainingdata.This maybedisastrousforlearnedtreesduetothehierarchicalnatureof theclassificationprocessofdecisiontrees.Performedexperiments revealedthatthesizeofdecisiontreesbuiltwithtraditionalclas- sificationmethods,suchasC4.5,isrelativelysmallanddoesnot captureallofthestructureavailableinthedataandisadditionally misledbynoise.

The rest of the paper is organized as follows. In Section 2, weintroduceanovelrepresentationfordecisiontrees.Then,our algorithm that learns decision trees in thenew representation is presented in Section 3. In Section 4, the proposedapproach is experimentally evaluated on real gene expression data. The paper is concluded in thelast section and future workis also discussed.

2. Multi-testdecisiontrees

Thispaperintroducesmulti-testdecisiontrees(MTDT)–anew, richerlanguagetorepresentadecisiontree.Theoverallstructure ofamulti-testtreedoesnotdifferfromastandarddecisiontree, e.g.,C4.5[25].Inmulti-testtree,everysplitinnon-terminalnodes iscomposedofasetofunivariatetestsandiscalledamulti-test split.Theseelementarytestsareunivariateandarecombinedina waythatshowsourapproachissubstantiallydifferentfromtypical multivariate,e.g.oblique,splits.

Duringclassification,theMTDTsplittingcriterionisdirectedby themajorityvotingmechanismwhereallunivariatetestcompo- nentshavethesameimportance.

Fig.1illustratesamulti-testwiththreeindividualattributetests, {(f1≤2),(f2≤5),(f3≤8)},thatsplitsthedatainthenodeintotwo subsets:ClassAandClassB.Inthisparticularexample,asaresult of themajority voting rule, atleast 2 out of 3univariate tests

Fig.1. Anexampleofamulti-testsplitwhichcontainsasetofunivariatetests.

(4)

Fig.2.Graphicalrepresentationofamulti-testsplitthatcontains3singleattribute tests:{(f1≤2),(f2≤5),(f3≤8)}.

determinethedecisionoftheactualmulti-testsplitatthenode.

Thegraphicalrepresentationofthemulti-testexampleisshown inFig.2.Eachtestthatusesfeatureficansplitaninstancespace butonlywithaboundarythatisorthogonaltothefiaxis.Inour example,iff1<2andf2<5,thenregardlessofthedecisionoff3, thedecisionisClassA(lightgrayregion).Iff1>2andf2>5,then regardlessoff3,thedecisionisClass B(darkgrayregion). If,on theotherhand,f1andf2yieldacontradiction,thefinaldecisionis determinedbyf3wheref3>8leadstoClassBandf3<8toClass A.Certainly,univariatetestscanbeevaluatedinanyorder.Thefact thatonlyunivariatetestsareusedinmulti-testsplitsensuresthat MTDTcanbetreatedasaunivariatedecisiontreedespitemorethan onebeingusedineachsplit.

3. Learningmulti-testdecisiontrees

Theprevioussection introducedthe ideaof multi-testdeci- siontrees.Inprinciple,thisdecisiontreelearningcantakevarious approaches.Inthissection,weproposeoneparticularmethodfor learningtreesthatisbasedongreedyconstructionofmulti-test splits.In what follows,it is assumed that decision trees learn- ingusestop-downinduction,where,ateverylevelofrecursion, thetop-downalgorithmconstructsmulti-testsplitstobeusedin non-terminalnodes ofthedecision tree.The proceduretocon- structthose multi-testsplitsis thecoreofthis sectionandour contribution.Itshouldbenotedthatmulti-testscouldalsobeused withothertypesofdecisiontreelearningmethods,i.e.,algorithms thatarenottop-down.Theconceptoftop-downinductionwas introducedinSection1.2.

3.1. Buildingmulti-testsplits

Top-downdecisiontreelearningalgorithmshavetochoosea split(orterminaterecursion andcreatea terminalnodewitha decision)ateverylevelofrecursion,givenasubsetXoftraining instances.Forthisreason,ourprocedureinAlgorithm1takesX, thecurrentsetofinstances, and returnsthebestmulti-testfor splittinginstances in X. Note, that thecardinality of Xis non- increasingwitheveryrecursivecallofthetop-downprocedure.

Additionalparameter W, determinesthe number of alternative multi-teststhatareconsideredbytheprocedurebeforethebest multi-testisreturned.Specifically,ourprocedureconstructsaset MT={mt1,mt2,...,mtW}ofWalternativemulti-testsandreturns the best one according to the gain ratio criterion (Line 9) in Algorithm1.

Fig.3. Anexamplesearchprocesswhichdeterminesthebestmulti-testforanon- terminalnodeofmulti-testdecisiontree(MTDT)fromthesetofpotentialmulti-tests (MT).

Algorithm1. Multi-testconstruction.

CreateMultiTest(X,W,N)

1: V←createallcandidatethresholdsusingX 2: bestprimary=argmaxvVGainRatio(v,X) 3: mt1=BuildMulti-test(bestprimary,V,X,N) 4: fori∈{2,...,W}do

5: MT={mt1,...,mti−1}

6: nextprimary=NextPrimary(V,MT,X) 7: mti=BuildMulti-test(nextprimary,V,X,N) 8: endfor

9: returnargmaxmtiGainRatio(mti,X)

Thefirststepofalgorithmdeterminesthesetofunivariatetests forfurtherevaluation.Weonlyconsidertherelevantthresholds, calledthecandidatethresholds[31],which splitinstancesfrom differentclasses.Inexistingalgorithmswithunivariatetests,once thesetofpossiblethresholdsiscomputed(univariatetestscorre- spondtothresholdswhencontinuesfeaturesarepresent),thebest thresholdisselectedaccordingtosomeprioritymeasure(e.g.,the gainratiocriterion),andtheunivariatetestwithhighestevaluation isreturned.Thisstandardprocedurewouldreturntheunivariate testcomputedinLine2ofthealgorithm.Ouralgorithmdoesaddi- tionalcomputationinordertobuildsplitswithmultipleunivariate tests.

Eachithmulti-testiscomposedofnomorethanNunivariate tests;thefirstoneiscalledaprimarytest(mti,1),andallremaining N−1 tests are calledsurrogate testsmti,j where 1<j≤N. The parameterdenotedasNrepresentsthemaximumnumberofuni- variateteststhatconstitutethemulti-test.Fig.3illustratestests thatareconsideredineveryexecutionofAlgorithm1.

Theset{mt1,...,mtW}is constructedasfollows.First,mt1 is constructedinLine3usingtheprimaryunivariatetestfoundin Line2.Theactualmulti-testisbuiltintheBuildMulti-testfunction, whichidentifieswhichcandidatetestsshouldbeusedassurrogate testsoftheprimarytestthatisprovidedasaparameter.Thisstep isexplainedindetailinSection3.1.1.mt1 isaspecialmulti-test becauseitsprimarytestisthebestunivariatetestaccordingtothe prioritymeasure.Primarytestsforremainingmulti-testshaveto beselectedinawaythatwoulddiversifycreatedmulti-tests.This processtakesplaceintheNextPrimaryfunctionwhichisexecuted inLine6anddescribedinSection3.1.2.Oncetheprimarytestfora newmulti-testisidentified,theBuildMulti-testprocedurecanbe usedagain(Line7).

Becauseofthemajorityvotingmechanismappliedduringclas- sification,surrogatetestshaveaconsiderableimpactonmulti-test decisionsbecausetheycanoutvotetheprimarytest.Itshouldbe notedthatthisimpactcanbepositiveandnegative,anditaffects thegainratiooftheentiremulti-test.Therefore,itispossiblethat thebestmulti-testthatwillbereturnedbyAlgorithm1maynot containtheoriginalunivariatetestwithhighestgainratio(mt1,1).

(5)

Thiscanhappenwhenvotingcomponentsofcompetitivemulti- testsmti(1<i≤W)havehighergainratiotakenasawholethan mt1despitethefactthatmt1,1istheunivariatetestwiththehighest gainratio.Thisfactjustifiesourdecisiontousemulti-testdecision trees,sinceit canprovidebetterand morerobustagainstnoise classification.

3.1.1. Multi-testconstruction

Whenfunction BuildMulti-test isexecuted, theprimary test providedinthefirstparameterconstitutesthefirstunivariatetest thatwillbeincludedinthemulti-test,andthegoalofthisfunction isaddN−1surrogatetests.Thereasonforaddingmoretestsisthat applyingasingleprimarytestbasedononeattributemaycausethe classifiertounderfitthelearningdataduetolowcomplexityofthe classificationrule.Surrogatetestsshouldsupportthedivisionof thetraininginstancesmadebytheprimarytest.Inotherwords, theremainingtests(thesurrogatetests)ofthemulti-testshould, usingtheremainingfeatures,branchthetreeinasimilarwayto theirprimarytest.

Inordertodeterminesurrogatetests,wehaveadoptedasolu- tionproposedintheCARTsystem[26].Theuseofthesurrogate variableata givensplitresultsinasimilarnodeimpurity mea- sure.Italsomimicsthechosensplitintermsofwhichandhow manyobservationsgotothecorrespondingbranches.Therefore, themeasureofsimilaritybetweentheprimarytestandsurrogates ofthemulti-testisgivenbythenumberofobservationsclassified inthesameway.Theparameterbequalsthepercentofdecisions madebysurrogateteststhatdifferfromtheprimarysplitter.The parameterisdescribedinmoredetailinthenextsection.Inour method,wealsoconsiderteststhatclassifyinstancesinaninverse (opposite)waytotheirprimarytest(highvalueoftheparameterb).

Forsuchtests,wereversetherelationbetweenattributeandinter- valmidpoint,andrecalculatethescore.However,thisonlyworks ifweahavebinaryclassificationproblem.

3.1.2. Identifyingadditionalprimarytests

TheNextPrimaryfunctionsearchesforathresholdthatwillbe appliedinaBuildMulti-test,whichisexecutedtobuildmulti-test mtifork<i≤Wafterthefirstk≥1multi-testsareconstructed.

Twofactorsshouldbetakenintoconsiderationwhenchoosing theprimarytestformti.First,newprimarytestsshouldbecompeti- torstoallexistingprimarytests.Thecompetitortestsyieldhigh gainratiobutarenotasgoodas,e.g.,theprimarytestmt1,1used toconstructmt1.Asignificantdifferencebetweenthesetestsand surrogatetestsisthewayinwhichtheyareranked.Asshownin theprevioussubsection,surrogatetestsarenotevaluatedbyhow muchimprovementtheyyieldinreducingnodeimpuritybutrather onhowcloselytheymimicthesplitdeterminedbytheirprimary test.

Ontheotherhand,thecompetitortestsarerankedaccordingto thehighestgainratio.Wedenotetestsascompetitortestsiftheir gainratioisinthetopqhighestgainratiovalueswhereprimary testsusedinmtjforj<iarenotconsidered(thedefaultvalueof qis10).PerformedexperimentsdescribedinSection4.2.3show thatusingmorecompetitors(highqvalue)leadstotheselection oftestswithlowgainratio;thisdecreasesthepowerofalternative multi-tests.However,decreasingthenumberofcompetitortests (lowq)maycausenewprimarytestsbetoosimilartothosealready selected.

Thesecondfactorthatshouldbeconsideredisthatthesame attributeis oftenlistedasboth acompetitor,i.e.,as oneofthe primary tests,and as a surrogate. Thismay lead toalternative multi-tests,mti,thatcontainsimilaroridenticalunivariatetests and do not provide any comparable improvement. Therefore, competitortestsshouldbediversifiedinordertodiversifythealter- nativemulti-tests.Forthatreason,functionNextPrimaryreceives

thelistofallmulti-teststhatwerecreatedbeforemti.Thediversifi- cationproblemisthensolvedbyrequiringthateverynewprimary splitmustbeacompetitorformt1,1,i.e.,fortheprimarytestofthe firstmulti-test(determinedbytheqvalueintroducedinthepre- viousparagraph),anditmustalsobetheworstaveragesurrogate (havethehighestaveragevalueofparameterb)inallprimarytests mtk,1wherek<i.

Tosumup:thesurrogatetestsaresimilartotheprimarytest;

thecompetitortestsarethosethathavehighestgainratioandare differentthanallpreviouslyselectedprimarytests.

3.2. Multi-testsizeandprediction

Thesizeofthemulti-test,i.e.,themaximumnumberofsingle teststhatmakeeverymulti-test,hasastrongimpactonitsper- formanceandthesplittingdecision.TheparameterdenotedasN representsthemaximumnumberofunivariatetestsinamulti-test andisdefinedbytheuser.Toclassifyobservations,themajorityvot- ingmechanismisemployedinwhicheachtesthasanequalvote.

Inthecaseofadraw,thedecisionismadeinaccordancewiththe primarytest.

The exact size of the multi-test depends on the difference betweentheprimaryandsurrogatetests.ThemainideaoftheMTDT istouseagroupofsimilartestsinasinglenodeinsteadofonetest, asseenintheclassicalapproachtounivariatedecision trees.To avoiddiscrepanciesinthemulti-test,surrogatetestsshouldnotbe addedtoteststhatdonothaveapropersubstitute.Aninappropri- atesetofsurrogatesmaydominatetheprimarytestanddeteriorate thesplittingcriterion.Therefore,surrogatetestsaddedtothemulti- testshouldreturnnomorethanb%ofdecisions(default10%)that differfromtheprimarytest.Using b=0%meansthatsurrogate testscanonlybeaddedtothemulti-testiftheysplitobservations intheexactlythesamewayasthecorrespondingprimarysplitter.

Inpractice,settingbto0%rejectsalmostallsurrogates;therefore, itisequivalenttosettingthesizeofmulti-testNto1.Inthisevent, thedecisiontreewouldbecomesimilartothetreegeneratedbythe C4.5algorithmbecauseonlyoneattributewouldbeusedineach multi-test.IfthevalueofbishighthenallN−1surrogatesjointhe multi-test.

4. Experimentalresults

Inthissection,theproposedsolutionisexperimentallyverified usingmorethanadozenrealmicroarraydatasets.Theresultsof theMTDTalgorithmwerecomparedwithseveralpopulardecision treebasedsystems.

4.1. Setup

TheperformanceoftheMTDTclassifierwasinvestigatedusing publiclyavailablemicroarraydatasetsdescribedinTable1.These datasetsarefromtheKentRidgeBio-medicalDatasetRepository [32]andarerelatedtostudiesofhumancancer,includingleukemia, colontumor,prostatecancer,lungcancer,breastcancerandlym- phoma.Fordatasetsthatwerenotpre-dividedintothetrainingand testingparts,the10-foldstratifiedcross-validationwasapplied.

By thestratified cross-validation,we mean that each fold con- tainsroughlythesameproportionofinstanceswiththesameclass labels.Leave-one-outcross-validationwasalsoconsidered;how- ever,nosignificantdifferenceinresultswasobservedwiththis typeofcross-validation.Theaveragescoreof10runsispresented forcross-validateddata.

The classification process for all algorithms was preceded by feature selection using the Relief-F [33] method, which is commonfor microarray dataanalysis.In the firststep, Relief-F drawsinstancesatrandomandcomputestheirnearestneighbors

(6)

Table1

KentRidgebio-medicalgeneexpressiondatasetsusedinexperiments.

Dataset Abbreviation Attributes Classes Trainingset Testingset

BreastCancer BC 24,481 2 34/44 12/7

CentralNervousSystem CNS 7129 2 21/39 –

Colontumor CT 6500 2 40/22 –

DLBCLStanford DS 4026 2 24/23 –

DLBCLvs.FollicularLymphoma DF 6817 2 58/19 –

DLBCLNIH DNH 7399 2 88/72 30/50

LeukemiaALLvs.AML AML 7129 2 27/11 20/14

LeukemiaMLLvs.ALLvs.AML MLL 12,583 3 20/17/20 4/3/8

LungCancerDana-Farber LCD 12,600 5 139/21/20/6/17 –

LungCancerBrigham LCB 12,533 2 16/16 15/134

LungCancerUniv.ofMichigan LCU 7129 2 86/10 –

LungCancer–Toronto,Ontario LCT 2880 2 24/15 –

OvarianCancerNCIPBSII OC 15,154 2 91/162 –

ProstateCancer PC 12,600 2 52/50 27/8

(default10).Then,Relief-Fadjustsafeatureweightingvectorto givehigherweighttoattributesthat discriminatetheinstances fromneighbors of differentclasses. The main benefitsof using featureselectionareshortertrainingtimes,improvedmodelinter- pretability,andenhancedgeneralizationbyreducing overfitting.

However,aswementionedinprevioussections,withunivariate decision trees using microarraydata, one facesthe problemof underfittingto thelearning data(overfitting is not significant).

Hence,there is no need toimprove the model interpretability becauseitis alreadysimple;itisusefultoretaina largernum- berof features and usea less aggressive featureselection. We testeddifferentnumbersoftoprankedattributes/features50,100, 200,1000,2000andalsoconsiderednofeatureselectionatall.

Reductionsinthenumberofattributesto200havenosignificant influenceonthetest-setsaccuracyofcomparedclassifiers;how- ever,itspeedsupthetrainingtimeofallalgorithms.Ourmulti-test algorithmworkswellontest-setswithoutfeatureselectionand thosewithlargernumbersoffeatures(200andover).Whenthe numberof top selectedattributesis small,MTDT losesits abil- ity tofind lower-ranked features (asthey wereexcluded from thedata),anditsperformanceissimilartotherestofthetested decisiontrees.Therefore, thenumberof selectedattributeswas arbitrarilylimitedtothetop1000toallowMTDTtofindlow-ranked features.

Astatisticalanalysisofallobtainedresultswasperformedusing theFriedmantestandthecorrespondingDunn’smultiplecompari- sontest(significancelevelequalto0.05)recommendedbyDemsar [34].

4.2. Multi-testdecisiontreeresults

4.2.1. Multi-testsize

Theinfluenceofthemulti-testsizeontheperformanceofour methodwasexperimentallyverifiedonrealgeneexpressiondata.

Classificationalgorithmsappliedtothesekindsofdataaremore likelytounderfitbecauseofasmallratioofthenumberofobserva- tionstotheamountofattributes.TheperformanceoftheMTDT classifier wasstudiedwithsix differentvalues of parameterN, whichstandsforthemaximumnumberofunivariatetestsinthe multi-test.It isworthemphasizingthattheMTDTwithasingle one-attributetestinanode,N=1,behavessimilarlytothestandard C4.5algorithm.Bothalgorithmsusethegainratiocriterionandpes- simisticpruning.Thereis,however,aslightdifferenceincalculating theexactthresholdvalue;thisisdescribedinSection3.1.

InTable2,wecomparetheinfluenceofthemulti-testsizeon accuracy.Inallexperiments,1000attributeswereconsidered,and thealgorithm’sparametershadtheirdefaultvaluesofW=3and b=10%.Theseresultsrevealedthatthenumberofunivariatetests

usedinasinglemulti-testhasasignificantimpactontheclassi- fieraccuracy.AccordingtotheFriedmantest,thereisastatistically significantdifference (p-valueof 0.0003) in theaccuracy of all versions.BasedonDunn’sMultipleComparisonTestDifference, thereisastatisticallysignificantdifferenceinclassificationqual- itybetweenthenumberoftestsinthemulti-test,N,equalto1,and 7,9,and11.

Experimentalvalidationperformedon14datasetsshowedthat theaverageaccuracyofthemulti-testalgorithmsincreasedover 3%whenN=3,andover6%whenN=7,comparedtothebaseMTDT withN=1.Ononlyonedataset(BC),theresultofthemulti-test algorithmwaslowerthanexpected,althoughtheoverallimprove- mentisnoticeable.ThereasonwhyresultsforBCwerebetterfor N=1liesinthenumberofattributesthatdistinguishclasses.For thisdataset,onlyafewgenesareconsideredasmarkers;therefore, ahighernumberofsurrogatescoulddecreasetheMTDTaccuracy whenthetreeisoverfit.

Considering the results, we conjecturethat the underfitting is themain cause of lower classificationaccuracy of the MTDT approachwithN=1.Decisiontreesobtainedbystandard(single univariatetest inanode)algorithm arenotcomplexenough.It wasalsoobservedthatusingtoomanygenesinthemulti-testmay inducemorecomplexrulesandoverfitlearnedtreestothetraining data.

Inordertodetectandmitigatethepossibilityofoverfittingin thetrainingphaseofourmethod,wecreatedartificialdatasetsthat werecopiedfromthoselistedinTable1;attributeswereleftexactly thesame,butclasslabelswererandomlychanged.Thistechnique

Table2

Acomparisonofthemulti-testdecisiontree (MTDT)accuracyunderdifferent numbersoftests(N)inthemulti-test.Datasetsabbreviationsareused(Table1).

Thehighestclassifiersaccuracyforeachdatasetwasbolded.

Dataset N=1 N=3 N=5 N=7 N=9 N=11

BC 68.42 63.15 57.89 52.63 57.89 57.89

CNS 60.50 71.33 72.17 72.00 72.17 74.33

CT 80.40 83.14 85.83 85.97 85.83 83.92

DS 81.75 85.00 85.25 85.55 85.05 86.60

DF 84.82 82.07 83.42 85.01 85.57 85.42

NIH 51.25 60.00 60.00 62.50 63.75 62.50

AML 91.18 85.29 91.18 91.18 91.18 88.23

MLL 86.67 73.33 100.00 100.00 93.33 100.00

LCD 89.41 90.98 91.60 92.12 91.15 90.96

LCB 88.59 95.97 96.64 97.98 98.66 98.66

LCU 97.48 98.04 98.32 98.93 99.78 100.00

LCT 61.42 61.66 63.67 66.83 65.67 62.16

OC 97.04 98.69 98.02 98.34 98.34 98.18

PC 26.47 58.82 61.76 61.76 47.06 44.11

Average 76.10 79.11 81.83 82.20 81.20 80.93

(7)

Fig.4.Theinfluenceofthesimilaritymeasureb,ontheclassificationaccuracyof themulti-testdecisiontree(MTDT)algorithm.

isusuallyreferredtoastheY-randomizationtest[35].TheMTDT classificationaccuracywassignificantlylowerontherandomized datathanonoriginaldata(whichisgoodinthissituation);this indicatesthatthereisnoevidenceofoverfittinginourmethod.

4.2.2. Surrogatetests

InSection3.2,itwasexplainedthatsurrogatetests,shouldnot differmorethanb%fromprimarytests.Performedexperiments suggestthat surrogatetestsadded tothemulti-testshouldnot differfromtheprimarytestbymorethan10%.Weconsiderthis adefaultvalueinalldatasets;however,adequatesettingofthis parametermayimproveclassificationaccuracy.Fig.4presentsthe influenceofthesimilarityparameter,b,ontheperformanceofthe MTDTclassifier.

Inthisfigure,b=0%meansthatonlysurrogatesthathavethe samegainratioasprimarytestsareaccepted(itisalmostequiv- alenttosettingN=1),andahighvalueofb(inthefigureabove 15%)meansthatallN−7surrogatesjointhemulti-test.Although, ageneralaverageonall14datasetshasthehighestaccuracywhen b=10%,wemayobservethattheoptimalvalueofthisparameter isdifferentforspecificdatasets.ThescoreoftheMTDTalgorithms onleukemia (MLL) and prostate cancer (PC) datasets increases significantlywhenthere isnorestrictiononchoosing surrogate tests.However,inotherdatasets,whenthesurrogatetestsaremore

Fig.5.Theinfluenceofthesimilaritymeasureb,onthedecisionsplitinthetree node.

Fig.6.Theinfluenceofthenumberofcompetitortestsq,ontheapplicationof multi-testsasasplittingcriterioninthetreenode.

similartotheprimarytest,theresultsareslightlybetter.Anaddi- tionalcomparisonoftheMTDTperformancewithasimplebaseline basedonrandomselection showedthesignificantdifferencein predictionaccuracyinfavorofproposedsolution.

Fig.5presentstheimpactofsurrogatesonthedecisionsplit.It illustratesthepercentagenumberofsplitsonthetestingdatafor whichprimarytestswereoutvotedbytheirsurrogates.Wemay observethatforalldatasets,forthedefinedvalueofparameterb equal10%,theaveragepercentageofsplitscontrastingtheprimary testisequal8%.Thisimpactofsurrogatetests,togetherwithalter- nativemulti-tests,improvestheMTDTaverageaccuracyupto6%, from76.1%to82.2%.

4.2.3. Alterativemulti-tests

Theparametersofalternativemulti-testsweredefinedempir- icallythroughextensiveexperiments.Fig.6presentstheaverage influenceofthenumberofcompetitortestsqontheperformance ofalternativemulti-testsonalldatasets.Wecanobservewhatper- centofmulti-testswereappliedasasplittingcriterioninthetree node.Itisnotsurprisingthatthetreenodesplitsweremostlydeter- minedbythemulti-tests(mt1)thatwerebuiltontheprimarytests.

However,forthedefaultvalueoftheparameterq=10,over12%

ofallsplitsweremadeinaccordancetothealternativemulti-tests mti(1<i≤W).

Inexperiments,weemployedtwoalternativemulti-testsmt2 and mt3, so the number of multi-tests analyzed in each non- terminalnodewasequalto3(W=3).Additionalexperimentsshow thatemployingahighernumberofmulti-tests,besidessignificant increaseofthecalculationtime,didnotyieldanyimprovementin classificationaccuracy.

4.2.4. LeukemiaMLLvs.ALLvs.AMLdataset

In oneof ourexperiments,thedatasetfromArmstrong [36]

wasevaluatedinmoredetail. Thedatasetdescribesthedistinc- tionbetweenleukemiaMLLandotherconventionalALLsubtypes.

Thereareatotalof57threeclasstrainingsamples(20forALL,17 forMLL,and20forAML)and15testsamples(4,3,and8corre- spondingly).TheMTDTdecisiontreeswithN=1andN=7,when evaluatedonthetraininginstances splitsdataexactlythesame wayandforbothvaluesofN,theclassificationaccuracyis100%.

TheactualtreesareillustratedinFig.7,andtheconfusionmatrix ispresentedinTable3.

(8)

Fig.7. Multi-testdecisiontree(MTDT)withN=1andN=7testsinasinglenode.

Wecanobservethatalthoughbothinducedtreeshavethesame structureandclassifiedinstancesfromthetrainingset,theirper- formancesonthetestsetweresignificantlydifferent.Becauseboth treeshaveidenticalprimarytests,evenwhenthereisnoimpacton alternativemulti-tests,thisisaverygoodexampleofthestrength oftheproposedsolution.Thereasonforsuchagoodperformanceof MTDTwithN=7inthisexamplecanbeexplainedbytheimpactof surrogatesonthemulti-testdecision.In6outof15instances,the surrogatetestsmt1,j(1<j≤N)inMTDTwithN=7havetoout- votetheprimarytestsmt1,1inthenodesandcorrectlyclassifythe instances.Inthisway,wehaveimprovedtheclassificationaccuracy fortheArmstrongdatasetfrom86%to100%.

4.3. ComparisonofMTDTtootherclassifiers

ThecomparisonofMTDTtootherclassifierswasalsoperformed.

Thefollowingclassificationalgorithmswereselectedforthisanal- ysis:

• Decisiontrees:

1.ADTree(AD)–alternatingdecisiontree[38].

2.BFTree(BF)–best-firstdecisiontreeclassifier[39].

3.J48Tree(J48)–prunedC4.5decisiontree[25].

4.SimpleC&RT(CT)–versionoftheC&RTalgorithmthatimple- mentsminimalcost-complexitypruning[26].

• Decisionruleclassifiers:

1.JRip(JR)–rulelearner–repeatedincrementalpruningtopro- duceerrorreduction(RIPPER)[40].

• ‘Blackbox’metadecisiontrees:

1.Randomforest(RF)–algorithmconstructingaforestofrandom trees[41].

2.Bagging(BG)–reducingvariancemetaclassifier[42].

3.Adaboost (ADA) – boosting algorithm using Adaboost M1 method[43].

Table3

Resultsformulti-testdecisiontree(MTDT)withN=1andN=7ondatasetLeukemia MLLvs.ALLvs.AML.

MTDTN=1 MTDTN=7 Classifiedas:

(a) (b) (c) (a) (b) (c)

6 2 0 8 0 0 (a)AML

0 1 2 0 3 0 (b)MLL

0 2 2 0 0 4 (c)ALL

Accuracy60% Accuracy100%

Itisworthnotingthatbesidesthe‘whitebox’classifiers,results on meta decision trees are also included. Those methods can generatemorecomplexdecisionrules andoutperformstandard approaches.Theresultingclassifiersare,however,moredifficultto understand.OurresultsshowthattheproposedMTDTalgorithm thatusessimple,univariatetestsishighlycompetitivewith‘black box’solutions.

The implementationof competitive algorithms in theWeka package[44]wasusedinourevaluation.Allclassifiers,including theMTDTalgorithm,wereemployedwithdefaultvaluesofparam- etersonalldatasets.TheresultsarepresentedinTable4.

ResultsinTables2and4showthatMTDTwithN=7testsina singlenodeyieldedthebestaverageaccuracy:82.20%,inallclassi- ficationproblems.Ingeneral,itcanbeobservedthatmorecomplex methods like RF, ADA, and BG performed betterthan standard non-ensemblealgorithms,whichgeneratesimplersolutions.The proposedMTDTmethodmanagedtoachievehighaccuracy,but comprehensiveclassificationrulesweremaintainedviatheuni- variatetestsusedinmulti-testsplits.AccordingtotheFriedman test,thereisastatisticallysignificantdifference(p-valueof0.0215) betweentestedclassifiers.BasedonDunn’sMultipleComparison TestDifference,thereisastatisticallysignificantdifferenceinterms ofqualitybetweentheMTDT(withN=7),andBFandj48trees.The

Table4

Comparisonofclassificationaccuracyofalgorithms:ADTree(AD),BFTree(BF),J48 Tree(J48),SimpleCART(CT),JRip(JR),Randomforest(RF),Bagging(BG),Adaboost (ADA).

Dataset AD BF J48 CT JR RF BG ADA

BC 42.10 47.36 52.63 68.42 73.68 68.42 63.15 57.89 CNS 63.33 71.66 56.66 73.33 65.00 75.00 71.66 75.00 CT 74.19 75.80 85.48 75.80 74.19 75.80 79.03 79.03 DS 95.74 80.85 87.23 82.97 74.46 95.74 87.23 89.36 DF 88.31 79.22 79.22 83.11 77.92 88.31 85.71 90.90 NIH 50.00 60.00 57.50 62.50 61.25 52.50 58.75 65.00 AML 91.17 91.17 91.17 91.17 94.11 82.35 94.11 91.17 MLL a 73.33 80.00 73.33 66.66 86.66 100.00 66.66

LCD a 89.65 91.62 88.17 90.14 92.11 90.64 78.32

LCB 81.87 89.65 81.87 81.87 95.97 93.28 82.55 81.87 LCU 96.87 96.87 98.95 96.87 93.75 98.95 97.91 96.87 LCT 69.23 61.53 58.97 58.97 64.10 66.66 61.53 69.23 OC 99.60 98.02 97.23 98.02 98.81 98.02 97.62 99.20 PC 38.23 44.11 29.41 44.11 32.35 29.41 41.17 41.17 Average 74.22 75.66 74.85 77.05 75.89 78.80 79.36 77.26

aADcanbeappliedtodatawithtwoclassesonly.

(9)

ADclassifierwasexcludedfromstatisticalanalysis,asitcouldnot beappliedtoamulti-classdataset.

5. Discussion

Insomecases, multi-testtrees couldbetreated asa consis- tentrepresentationoftraditionalunivariatedecisiontrees,butit worksinone wayonly.A MTDTtreecanbetransformedintoa traditionaldecisiontree,butitisusuallyimpossibletodo itthe otherwayround.Furthermore,evenifourformulationofmulti- testdecisiontreesandtraditionalunivariatetreeswereisomorphic (buttheyarenotasweexplainedabove),thiswouldnotinvali- dateourresearchbecauseweshowanotherrepresentationthat ismoresuitable(accordingtoourresults)forthegreedysearch, whichistraditionallyemployedforlearningdecisiontrees.Asim- ilarrelationshipexistsbetweendecisiontreesanddecisionrules.

Eventhoughthehypothesisspaceofdecisionrulesisasupersetof thehypothesisspaceofdecisiontrees,researchersstillinvestigate decisiontreesbecauseofvariousadvantagesthatdecisiontreescan offer.

Theimportanceofparticulartypesofteststhatareusedtobuild decisiontreesmaydependonthetypeofsearch.Resultspresented inourpapershowthatstandardgreedytop-downlearningofdeci- siontreescanbesignificantlyimprovedusingmulti-testsplits.If itwouldbepossibletolearntheoptimaldecisiontreesforagiven testrepresentation(whichisinfeasibleonreal-lifedatabecause theproblemisNP-hard)insteadofusingagreedyalgorithm,then onecouldcheck,forexample,theimfluenceofsingleandmulti-test splitsontheexactalgorithm.Itislikelythatsingletestsplitswould bemorecompetitiveusingalternativesearchstrategiesbutatthe sametimemulti-testsplitscouldleadtofurtherimprovements.

Thecurrentstate-of-the-artindecisiontreelearningusesgreedy searchinmostacademicresearchandindustrialapplications;thus, ourmulti-testsplitsimprovelearningwiththatmostimportant typeofsearch.Thisfactexplains,forexample,whysingletestsplits inFig.7wereweakerthanmulti-testsplits.Multi-testsplitswere simplymoreconvenientfor top-downlearningandbettertrees couldbelearned.Intheory,bettersingletesttreescouldpoten- tiallybeobtainedfortheexampleinFig.7;however,butassuming thatsuchtreesexistandcouldbefound,adifferentsearchalgo- rithmorspecialtuningofexistingalgorithmswouldberequiredto findthem.Thesameadvantageofmulti-testsplitswasobservedon otherdatasetsevaluatedinthispaper.Theoretically,theseobser- vationscanbeexplainedusingtheconceptof‘inductivebias’in machinelearning, that is,theneedto makeexplicitor implicit assumptionsaboutwhatkindofmodeliswantedforaparticular problem[46].

IntheexperimentontheleukemiaMLLvs.ALLvs.AMLdataset, decisiontreeswithmulti-testsizeN=1andN=7havethesame structurethesamenumberofnodes.However,forothervalues of parameterNor differentdatasets,this may notbe thecase.

Differencesinthetreestructuremayoccurwhenalternativemulti- testsoutperformthemulti-testmt1orsurrogatetestsoutvotethe primarytest. Inspiteof anequal treesizebetweenMTDTwith N=1andN>1,alargernumberofunivariatetestsinamulti-test generatesmore complexnodes. Hopefully, themulti-tests con- tainonlyunivariatetestswhichareeasytounderstandbyhuman experts.

Tothemostofdatasets described inTable1 biologistshave foundandpublishedsomemarkergenesthatarehighlycorrelated withclassdistinction.InordertoevaluatewhethertheMTDTresults arebiologicalmeaningfulornot,weexploredwhetherdiscovered genesfromclassifier’smodelaresupportedbybiologicalevidence intheliterature.Ourresearchshowedthatmostofthegenesfrom theMTDTmodelwerealsoidentifiedinbiologicalpublications.For

thisparticulardataset,sixoutofsevengenesthatbuiltMTDTmulti- testintherootnodewerealsoreferredtoinarticle[36]andpatent [37].Attributesthatbuiltmulti-testsinthelowerpartsoftheMTDT treeusuallydonotappearinpublicationsastheyrepresentonlya smallsetsofinstances.WebelievethatMTDTiscapableoffind- ingnotonlythemostsignificantgroupsofmarkergenesbutalso low-rankedgenesthatmayalsobemeaningfulwhencombined.

In the comparison of MTDT to other classifiers, it is worth emphasizingthattheMTDTwitha singlebinarytestina node, i.e., N=1, performedsimilarly toall remaining ‘univariate test’

methods. Itcan becompared totheJ48tree algorithmas they bothusethegainratiocriterion.Theirtreesinmostcasessepa- ratedthetrainingdataperfectly,butperformedconsiderablyworse ontestinginstances.Thismaybecausedbytheunderfitteddeci- siontreemodel.Aslightincreaseinthenumberoftestsineach splitimprovedtheclassification accuracy;this canbeobserved inTable2.Theexperimentalsectionsshowedthattheproposed methodleadstohighlycompetitiveresults.Inourtests,MTDTout- performedclassicaldecisiontreesanddecisionruleclassifiersand washighlycompetitivewithmorepowerfulmetalearningalgo- rithms.

Eventhoughseveralinteresting questionsfromthemachine learningpoint ofviewarestill open,weareconvinced thatthe existingversionofthealgorithmreportedinthispaperoffersause- fultoolformolecularbiologistsdoingexploratoryanalysisofgene expressiondata.Intheirwork,biologistsrarelyrelyonoutofthe boxsolutions,andtuningalgorithm’sparametersistheirnormal practice.Therefore,ourexistingMTDTalgorithmisaperfecttool fortheirexperiments.Bychangingthenumberofcomponentsin themulti-testssplitsoftheMTDT,thebiologistcanobtainamono- tonerangeofdecisiontreesthatstartwithtreescorrespondingto C4.5(whereoneattributeistestedineverynode)andproceedby havinghighernumbersoftests.Aknownphenomenoninmolec- ularbiologyisthatthereoftenexistgroupsofgenes(orfeatures ingeneral)thatbehaveinasimilarway.Biologistscallit‘epista- sis’[45].Forexample,aspecificsubstance,suchasmelanin,which isproducedinlivingorganism,mayrequireseveralcompoundsin whicheachcompoundisproducedbyitscorrespondinggeneand compoundsthatrequireothercompoundsinordertobeproduced.

Ifanyofthegenesaredefective,itscompoundwillnotbeproduced andneitherwillthesubstance.Asimilarphenomenonexistswhen thefeaturesusedfordataanalysisaremotifs,i.e.,smallsequences ofDNA.Whenthenumbersofoccurrencesofmotifsaresimilarin thesamepieceoftheDNAsequence,thenfeaturescorresponding tomotifsbecomesimilar.Ourideaofidentifyingsurrogatetests relatestothisbiological phenomenon,andit canidentifythese relationshipsexactly.

6. Conclusion

Inthispaper,wepresentedamulti-testdecisiontreeapproach togeneexpressiondataclassification.Anewsplittingcriterionwas introducedwiththeaimofreducingtheunderfitofdecisiontrees onthesekindsofdataandimprovingclassificationaccuracy.The proposedsolutionoutperforms,orwashighlycompetitivewith, alltestedcompetitors.Evaluationonrealmicroarraydatashowed thatknowledgediscoveredbyMTDTissupportedbybiologicalevi- denceintheliterature.Therefore,biologistcanbenefitfromusing this‘whitebox’approach,asitcanbuildaccurateandbiologically meaningfulmodelsforclassificationandrevealnewregularities inbiologicaldata.Fromthemachinelearningpointofview,our rigorousempiricalanalysisrevealedandevaluatedtheimportant algorithmicpropertiesofourmethod.

Thestandard practicalquestionleft openis theautonomous tuningofthetestsize,N,toparticulardata.Weobservedthata

(10)

data-specificselectionofNcansignificantlyimprovetheperfor- manceof ourmethod,although a generaldomainindependent valuewasenoughtoobtainbetterresultsthanthatwhichexisting algorithmcanachieve.Wearecurrentlyworkingonanalgorithm that, through internal cross-validation, can set this parameter automaticallyforparticulartrainingdata.Anotherimprovement concerns thepre-pruning mechanism that will reducethe size ofthemulti-testinlowerpartsofthetree.Ouranalysisshowed thatthesplitsubsetsmayhaveanincorrectsize,whichcanthen increasethetreeheightandleadtodataoverfit.

Acknowledgments

AuthorsthankWojciechKwedloforreviewingthepaperand providingconstructivefeedback.Thisworkwassupportedbythe grant S/WI/2/13and W/WI/1/2013fromBialystok University of Technology.Thesecondauthorwassupportedbyafellowshipfrom theOntarioMinistryofResearchandInnovation.

References

[1]MurthySK.Automaticconstructionofdecisiontreesfromdata: amulti- disciplinarysurvey.DataMiningandKnowledgeDiscovery1997;2:345–89.

[2]RokachL,MaimonO.Dataminingwithdecisiontrees:theoryandapplica- tions.Machineperceptionandartificialintelligence,vol.69.Singapore:World ScientificPublishing;2008.

[3]HastieT,TibshiraniR,FriedmanJH.Theelementsofstatisticallearning.Data mining,inferenceandprediction.2nded.NewYork:Springer;2009.

[4]CheD,LiuQ,RasheedK,TaoX.Decisiontreeandensemblelearningalgo- rithmswiththeirapplicationsinbioinformatics.Softwaretoolsandalgorithms for biological systems. Advances in Experimental Medicine and Biology 2011;696:191–9.

[5]ChenX,WangM,ZhangH.Theuseofclassificationtreesforbioinformat- ics.WileyInterdisciplinaryReviews:DataMiningandKnowledgeDiscovery 2011;1:55–63.

[6]CzajkowskiM,KretowskiM.Topscoringpairdecisiontreeforgeneexpression dataanalysis.In:ArabniaHR,TranQ-N,editors.Softwaretoolsandalgorithms forbiologicalsystems.Advancesinexperimentalmedicineandbiology,696.

2011.p.27–35.

[7]Diaz-UriarteR, Alvarez de AndresS.Gene selection andclassification of microarraydatausingrandomforest.BMCBioinformatics2006;7:3.

[8]QuY,AdamBL,YasuiY,WardMD,CazaresLH,SchellhammerPF,etal.Boosted decisiontreeanalysisofsurface-enhancedlaserdesorption/ionizationmass spectralserumprofilesdiscriminatesprostatecancerfromnoncancerpatients.

ClinicalChemistry2002;48:1835–43.

[9]Ge G, Wong GW.Classification ofpremalignant pancreaticcancer mass- spectrometry data using decision tree ensembles. BMC Bioinformatics 2008;9:275.

[10]Grze´sM,KretowskiM.Decisiontreeapproachtomicroarraydataanalysis.

BiocyberneticsandBiomedicalEngineering2007;27(3):29–42.

[11]DettlingM,BuhlmannP.Boostingfortumorclassificationwithgeneexpression data.Bioinformatics2003;19(9):1061–9.

[12]TanAC,GilbertD.Ensemblemachinelearningongeneexpressiondatafor cancerclassification.AppliedBioinformatics2003;2(3):75–83.

[13]KuoWP,KimE,TrimarchiJ,JenssenT,VinterboSA,Ohno-MachadoL.Aprimer ongeneexpressionandmicroarraysformachinelearningresearchers.Journal ofBiomedicalInformatics2004;37:293–303.

[14]BrownPO,BotsteinD.ExploringthenewworldofthegenomewithDNA microarrays.NatureGenetics1999;21:33–7.

[15]CowellRG,DawidAP,LauritzenSL,SpiegelhalterDJ.Probabilisticnetworksand expertsystems:exactcomputationalmethodsforBayesiannetworks.Interna- tionalStatisticalReview2008;76:306–7.

[16]Golub TR, Slonim DK. Molecular classification of cancer: class dis- covery and class prediction by gene expression monitoring. Science 1999;286(5439):531–7.

[17]YeohEJ,RossME.Classification,subtypediscovery,andpredictionofoutcome inpediatricacutelymphoblasticleukemiabygeneexpressionprofiling.Cancer Cell2002;1(2):133–43.

[18]SebastianiP,GussoniE,KohaneIS,RamoniMF.Statisticalchallengesinfunc- tionalgenomics.StatisticalScience2003;18:33–70.

[19]Drami ´nskiM,Rada-IglesiasA,EnrothS,WadeliusC,KoronackiJ,Komorowski J.MonteCarlofeatureselectionforsupervisedclassification.Bioinformatics 2008;24(1):110–7.

[20]Rokach L, Maimon O. Top-down inductionof decisiontreesclassifiers– a survey. IEEE Transactionson Systems, Man,and Cybernetics– PartC 2005;35(4):476–87.

[21]BrownDE,PittardCL,ParkH.Classificationtreeswithoptimalmultivariate decisionnodes.PatternRecognitionLetters1996;17:699–703.

[22]MurthyS,KasifS,SalzbergS.Asystemforinductionofobliquedecisiontrees.

JournalofArtificialIntelligenceResearch1994;2:1–33.

[23]PagalloG,HausslerD.Booleanfeaturediscoveryinempiricallearning.Machine Learning1990;5:71–99.

[24]Brodley CE, Utgoff PE. Multivariate decision trees. Machine Learning 1995;19:45–77.

[25]QuinlanR.C4.5:programsformachinelearning.SanMateo,CA,USA:Morgan Kaufmann;1993.

[26]BreimanL,FriedmanJ,OlshenR,StoneC.Classificationandregressiontrees.

Belmont,CA,USA:WadsworthInternationalGroup;1984.

[27]TanPJ,DoweDL,DixTI.Buildingclassificationmodelsfrommicroarraydata withtree-basedclassificationalgorithms.In:OrgunMA,ThorntonJ,editors.

AI2007.Lecturenotesinartificialintelligence,vol.4830.BerlinGermany:

Springer;2007.p.589–98.

[28]HuH,LiJ,WangH,ShiM.Amaximallydiversifiedmultipledecisiontreealgo- rithmformicroarraydataclassification.In:BodenM,BaileyTL,editors.WISB 2006,vol.73.Darlinghurst,Australia,Australia:AustralianComputerSociety, Inc.;2006.p.35–8.

[29]BerzalF,CuberoJC,MarinN,SanchezD.Buildingmulti-waydecisiontreeswith numericalattributes.InformationSciences2004;165:73–90.

[30]LiJ,LiuH,NgS,WongL.Discoveryofsignificantrulesforclassifyingcancer diagnosisdata.Bioinformatics2003;19(2):93–102.

[31]FayyadUM,IraniKB.Onthehandlingofcontinuous-valuedattributesindeci- siontreegeneration.MachineLearning1992;8:87–102.

[32]Kent Ridge bio-medical dataset repository; 2012. http://datam.i2r.a-star.

edu.sg/datasets/krbd/index.html(accessed20.12.12).

[33]Robnik-SiikonjaM,KononenkoI,Theoretical.Empiricalanalysisofreliefand relief.MachineLearning2003;53:23–69.

[34]DemsarJ.Statisticalcomparisonsofclassifiersovermultipledatasets.Journal ofMachineLearningResearch2006;7:1–30.

[35]WoldS,ErikssonL,ClementiS.StatisticalvalidationofQSARresults.Chemo- metricsmethodsinmoleculardesign,vol.5.Weinheim,Germany:Wiley-VCH VerlagGmbH;2008.p.309–38.

[36]ArmstrongSA.MLLtranslocationsspecifyadistinctgeneexpressionprofile thatdistinguishesauniqueleukemia.NatureGenetics2002;30:41–7.

[37]GolubTR,ArmstrongSA,KorsmeyerSJ.MLLtranslocationsspecifyadistinct geneexpressionprofile,distinguishingauniqueleukemia.UnitedStatespatent 20060024734;2006.

[38]FreundY,MasonL.Thealternatingdecisiontreelearningalgorithm.In:16th internationalconferenceonmachinelearningICML99,Bled,Slovenia.SanFran- sisco,CA,USA:MorganKaufmann;1999.p.124–33.

[39]Shi H. Best-first decision tree learning. University of Waikato; 2012 (Master’s thesis), http://researchcommons.waikato.ac.nz/bitstream/handle/

10289/2317/thesis.pdf(accessed20012.12).

[40]CohenWW.Fasteffectiveruleinduction.In:12thinternationalconferenceon machinelearningICML95.SanFrancisco,CA,USA:MorganKaufmann;1995.p.

115–23.

[41]BreimanL.Randomforests.MachineLearning2001;45(1):5–32.

[42]BreimanL.Baggingpredictors.MachineLearning1996;24(2):123–40.

[43]FreundY,SchapireRE.Experimentswithanewboostingalgorithm.In:13th internationalconferenceonmachinelearningICML96.1996.p.148–56.

[44]Hall M,FrankE, Holmes G, PfahringerB, Reutemann P,Witten HI.The WEKADataMiningSoftware:anupdate.ACMSIGKDDexplorationsnewsletter 2009;11(1):10–8.

[45]CordellHJ.Epistasis:whatitmeans,whatitdoesn’tmean,andstatisticalmeth- odstodetectitinhumans.HumanMolecularGenetics2002;11(20):2463–8.

[46]Shalev-ShwartzS,Ben-DavidS.UnderstandingMachineLearning:FromTheory toAlgorithms.CambridgeUniversityPress;2014.

Cytaty

Powiązane dokumenty

Jakkolwiek bowiem, zgodnie z wyznawaną przez pisarza zasadą „prawdopodobieństwa”, świat ten tworzą kontrolowane przez świadome Ja narratora „przebiegi normalne”, to

Exactly the same problems as with EMTTreeNC occur for ETree solution that uses a single univariate test in the split nodes (see Fig. After finding in the first 50–100 iterations the

Performed experiments suggest that proposed hybrid solution may successfully compete with decision trees and popular TSP algorithms for solving classification problems of

The proposed method extends previous approaches: T SP and k −T SP algorithms by consider- ing weight pairwise mRNA comparisons and percentage changes of gene expressions in

A number of Equivalence Theorems on languages defined by pseudo probabilistic tree automata has been settled.. The theorems involve the result which states the

Balance measurements on support near-field effects are performed in the Low Turbulence Tunnel (LTT) of Delft University of Technology, a closed circuit tunnel with a test section

On the whole, according to Jaccard’s coefficient best of all methods is White method segmentation performed on Haematoxylin

As shown in Ta- ble 4, 20 dysregulated pathways were identified from the changes in HCC; of these pathways, the complement and coagulation cascades and cell adhesion molecules (CAMs)