classiﬁcation Multi-test decision tree and its application to microarraydata Artiﬁcial Intelligence in Medicine

(1)

ContentslistsavailableatScienceDirect

Artiﬁcial Intelligence in Medicine

jou rn al h om e p a g e :w w w . e l s e v i e r . c o m / l o c a t e / a i i m

Multi-test decision tree and its application to microarray data classiﬁcation

Marcin Czajkowski

^a,∗

, Marek Grze´s

^b

, Marek Kretowski

^a

aFacultyofComputerScience,BialystokUniversityofTechnology,Wiejska45a,15-351Bialystok,Poland

bSchoolofComputerScience,UniversityofWaterloo,200UniversityAvenueWest,Waterloo,OntarioN2L3G1,Canada

a r t i c l e i n f o

Articlehistory:

Received24June2013

Receivedinrevisedform11January2014 Accepted30January2014

Keywords:

Decisiontrees Univariatetests Underﬁtting Geneexpressiondata

a b s t r a c t

Objective:Thedesirablepropertyoftoolsusedtoinvestigatebiologicaldataiseasytounderstandmodels andpredictivedecisions.Decisiontreesareparticularlypromisinginthisregardduetotheircompre- hensiblenaturethatresemblesthehierarchicalprocessofhumandecisionmaking.However,existing algorithmsforlearningdecisiontreeshavetendencytounderﬁtgeneexpressiondata.Themainaimof thisworkistoimprovetheperformanceandstabilityofdecisiontreeswithonlyasmallincreaseintheir complexity.

Methods:Weproposeamulti-testdecisiontree(MTDT);ourmaincontributionistheapplicationof severalunivariatetestsineachnon-terminalnodeofthedecisiontree.Wealsosearchforalternative, lower-rankedfeaturesinordertoobtainmorestableandreliablepredictions.

Results:Experimentalvalidationwasperformedonseveralreal-lifegeneexpressiondatasets.Compar- isonresultswitheightclassifiersshowthatMTDThasastatisticallysignificantlyhigheraccuracythan populardecisiontreeclassifiers,anditwashighlycompetitivewithensemblelearningalgorithms.The proposedsolutionmanagedtooutperformitsbaselinealgorithmon14datasetsbyanaverage6%.Astudy performedononeofthedatasetsshowedthatthediscoveredgenesusedintheMTDTclassificationmodel aresupportedbybiologicalevidenceintheliterature.

Conclusion:Thispaperintroducesanewtypeofdecisiontreewhichismoresuitableforsolvingbiological problems.MTDTsarerelativelyeasytoanalyzeandmuchmorepowerfulinmodelinghighdimensional microarraydatathantheirpopularcounterparts.

1. Introduction

Decisiontrees[1,2]areoneofthemostpopularclassiﬁcation techniquesindataminingandmachinelearning.Duetotheircom- prehensiblenature,theyareparticularlyusefulwhentheaimof modelingistounderstandtheunderlyingprocessesoftheenviron- ment.Decisiontreesarealsousefulwhenthedatadonotsatisfy therigorousassumptionsrequiredbymoretraditionalmethods[3].

Tree-basedclassifierscanbesuccessfullyappliedtosolvingbiolog- icalproblems[4–6].Populartechniquesformicroarraydatainvolve decisiontreeensembleslikerandomforest[7]andboosteddeci- siontrees[8].However,existingattemptstoapplydecisiontrees toclassificationusinggeneexpressiondatashowedthatsingletree algorithmsarenotsufficientforinducingcompetitiveclassifiers [9,10].

In this paper, we tackle the problemof improvingthe per- formance of decision trees on gene expression data, with the

∗ Correspondingauthor.Tel.:+48857469163;fax:+48857469057.

E-mailaddress:m.czajkowski@pb.edu.pl(M.Czajkowski).

constraintofpreservingsimplicityofdecisiontrees.Standardtech- niquesforimprovingtheperformanceofclassificationalgorithms, e.g.,ensemblemethods,donotsatisfythisconstraintwhenapplied todecisiontreesbecauseresultingclassifiersbecomecomplexand almostimpossibletounderstand[11,12].Weproposeamulti-test approachtodecisiontrees inwhichseveralunivariate testscan be used tocreate a single splittingrule in every non-terminal node of the classification tree. We also search for alternative, lower-rankedfeaturesinordertoobtainmorestableandreliable predictions.

1.1. Geneexpressiondataanalysis

Cellsrepresentbasicorganizationalunitsofalllivingorganisms.

Eachcellcontainsinstructionsforthecreationofproteinsandthe regulationofprocessesinalivingbody.Thiscollectionofinstruc- tionsiscontainedintheDNA.Eachproteinhasacorresponding gene which can be seenas a recipe for how tocreate a given protein.Ifthegeneisexpressed,acorrespondingproteinwillbe produced[13].Asigniﬁcantstepingenomicresearchwastheability tomonitortheexpressionlevelofgenesinlivingcells.Speciﬁcally, 0933-3657/$–seefrontmatter©2014ElsevierB.V.Allrightsreserved.

http://dx.doi.org/10.1016/j.artmed.2014.01.005

(2)

cDNAmicroarrayandhigh-densityoligonucleotidechipsallowthe expressionlevelofthousandsofgenestobemonitoredsimulta- neously[14].Theoutcomeofthesediagnostictestsisknownas geneexpression(ormicroarray)data.

Microarray data allows for numerous analyses of living organisms. The application of a mathematical apparatus and computationstoolsisindispensablehere,since geneexpression observations are represented by highdimensional feature vec- tors.Theimportantquestionsarewhatkindofoutcomescanbe expectedandwhatkindofquestionscanbeansweredusingthese tools.The answer comesfrom two fundamental approaches to mathematicalmodeling,whichareequallyimportantinthecase ofgeneexpressiondata.Scientiﬁcmodeling attemptstounder- standthetruemodelthatisbehindthedatageneratedaccording tothatmodel.Inthecaseofgeneexpressiondata,itisconcerned withproblemsofcausalrelationshipsbetween,forexample,genes, orgenesandproteins.Technologicalmodelinghasdifferentaims.

Here,thepurposeistobuildamodelfrompastdatathatwould begoodatpredictingfuturedataregardlessofwhetherthemodel isclosetorealityornot[15].Discriminantanalysisisanexample ofthiskindofmodelinginageneralsense.Ithasalsobeenwidely usedinpost-genomecancerresearchstudies[16,17].

Gene expression data poses many research challenges, and is not limited toresearch areasthat are concernedwith living organisms.Thiskindofdataisalsoextremelychallengingforcom- putationaltoolsandmathematicalmodeling[18].Eachobservation isdescribed by a highdimensional featurevectorwitha numberof features that reachinto the thousands, but the number ofobservationsisrarelyhigherthan100.Therefore,thiskindof datarequiresnewcomputationaltoolstoextractsignificantand meaningfulrules,andsomefeatureselectionshouldbetakeninto account.Providingagroupofmostrelevantgenesmaysignificantly improveclassificationperformance[19].

1.2. Decisiontrees

Decisiontrees(alsoknownasclassiﬁcationtrees)representone ofthemaintechniquesfordiscriminantanalysisindataminingand knowledgediscovery.Theypredicttheclassmembership(depend- entvariable)ofaninstanceusingitsmeasurementsofpredictor variables.

Themostpopularalgorithms fordecision treeinduction are basedontop-downgreedysearch[20].First,thetestattribute(and thethresholdinthecaseofcontinuousattributes)isdecidedforthe rootnode.Instancesaresplitthroughthetreefromtherootnode toaleafnode,whichprovidesclassiﬁcationofagiveninstance.At eachnon-terminalnodethroughwhichtheinstancepasses,one(or more)attributeoftheinstanceistestedandtheinstanceismoved downtothebranchthatcorrespondstoanoutcomeofthetest.The processisrecursivelyrepeatedforeachbranch.Whentostoppar- titioningandcreatealeafnodeisstilloneofthemajorproblemsin thearea.

Classiﬁcation trees have many advantages that make them applicablein variousscenarios,particularlywhen thedatadoes notsatisfytherigorousassumptionsrequiredbymoretraditional methods.Inthispaper,thefollowingfactsaresigniﬁcant:

• learningofdecisiontreesisfast,evenwithhugedatasets,dueto greedysearch;

• classiﬁcationisveryfast,ﬂexible,andallowsforstraightforward approachestotheproblemofmissingvalues;

• decisiontreesareeasytounderstandandanalyze,astheyreﬂect ahierarchicalwayofhumandecisionmaking.Theyarethusthe oppositeofthe‘black-box’approacheswheremodelparameters arenotunderstandable.

Thisintroduction appliesto casesin which tests in internal nodesoftreesarebasedononeattribute.Therearealsoalgorithms whichapplymultivariatetests[21,22]basedmostlyonlinearcom- binationsplits.Decisiontrees thatallowthetestingofmultiple featuresatthenodearepotentiallysmallerthanthoselimitedto singleunivariate splits.Additionally,whenonlyoneattributeat eachnodeistested,itmaycausereplicationofspecificsubtrees inthedecisiontree[23].Ineffect,somefeatures maybetested morethanonceinthedecisiontree.However,treeswithsimple testsarestilldesirablebecauseexpertscanunderstandthem.This factisexplicitlyemphasizedintherelatedliterature.Brodleyand Utgoff[24]say: “Asmalltree withsimpletestsismostappealing becauseahumancanunderstandit.Thereisatradeofftoconsider inallowingmultivariatetests:usingonlyunivariatetestsmayresult inlargetreesthataredifficulttounderstand,whereastheadditionof multivariatetestsmayresultintreeswithfewernodes,butthetest nodesmaybemoredifficulttounderstand”.Ourfocusistherefore onunivariatetrees,sincetheyarea‘white-box’technique,which makesthemparticularlyinterestingforscientific modeling.Itis easytofindexplanationfordecisionsofunivariateclassification trees.

1.3. Backgroundandmotivation

Asstatedintheprevioussection,decisiontreeswithunivari- atesplitsareconvenient.Theyaremucheasiertounderstandthan treeswithmultivariatesplits,anditismucheasiertolearnfrom thedata.However,traditionalalgorithms,forexample,C4.5[25]

orCART[26],failtoproducedecisiontreeswithhighclassiﬁcation accuracyofgeneexpressiondata.Ourpreviousworkwithvarious univariatedecisiontreealgorithmsshowedthatthesealgorithms produceconsiderablysmalltreesthatperfectlyclassifythetraining databutfailtoclassifyunseeninstances[10].Onlyasmallnumber ofattributesisusedinsuchtrees,andtheirmodelcomplexityislow (highbias).Therefore,theyunderﬁtthetrainingdata[2].Producing biggertreesusingstandardalgorithmssuchasC4.5doesnotsolve theprobleminthecaseofgeneexpressiondatabecausesmalltrees oftenclassifythetrainingdataperfectly[10].Thisindicatesthatthe issueofsplitcomplexitycouldbeadvocatedhere,sincenotmuch canbegainedfrombiggerunivariatedecisiontreeswiththiskind ofdata.Thislineofresearchispursuedinourpaper.

Wearemotivatedbythefactthatunivariatedecisiontreeinduc- tionrepresentsawhite-boxapproachandimprovementsofsuch algorithmshaveconsiderablepotentialforgenomicresearchand scientificmodelingof underlyingprocesses.Thus, ourgoalisto improve theclassificationaccuracy of decision trees and imply moreinformativeanalysisofmicroarraydataina waythatwill makethemstilleasytounderstand.Decisiontreeswithmultivari- atesplitsorbagging/boostingmethodsoftenoutperformexisting univariatealgorithmsongeneexpressiondata[9,27,28].However, thoseapproachesgeneratecomplexrulesthatfromamedicalpoint ofviewaremoredifficulttounderstandandanalyze.Ourgoalisto increasethecomplexityofunivariatedecisiontreestotheextent thatmakesthemeasytounderstandandmorecompetitiveinterms ofclassificationaccuracy.Webelievethattheuseofindividualuni- variatesplitsmaycausetheclassifiertounderfitthelearningdata, sinceitleadstotreesthatarenotrobustenoughanddonottake informationaboutothermostrelevantattributesintoaccount.Our noveltechniqueusesseveralunivariatetestsineachinternalnode toavoidtheseproblems.Asmulti-testnodesarebasedonunivari- atetests,treeslearnedwiththisapproachwillbemucheasierto analyzethantreeswithclassicalmultivariatesplits.

Inthisparagraph,weattempttojustifywhyourapproachis suitableforgeneexpressiondataandwhythismayleadtohigh classiﬁcationaccuracy.Gene expressiondataischaracterizedby averyhighratiooffeaturestoobservations,whichposesserious

(3)

problemsforstandardunivariatesplits.Thelearningalgorithmcan easilyﬁnda testthatseparates thetraining dataverywellata givenlevelinthetree,butthissplitcancorrespondtonoiseonly.

Thissituationisevenmorelikelyatintermediateandlowerlevels ofthetree.Forexample,assumingthatatagivenlevelofthetree thereare20observations(10fromclassAand10fromclassB)and 2×10⁵features,thenumberofpossiblepartitionsofthistraining set(thenumberofcombinationsofchoosing10outof20instances) issmaller(theexactnumberis184,756)thanthenumberofavail- ablefeatures.Thismakesﬁndingatestlikely,i.e.,anattributeand itscorrespondingthreshold,which cansplitthis dataperfectly.

Whenthereareonly10observationsinthenode(evendistribu- tion),thenumber ofpossiblesplitsisonly252,butthenumber ofattributesis3ordersofmagnitudehigher.Whenthesplitcon- tainsonlyoneunivariatetest,thereisaveryhighriskofchoosing teststhatcorrespondtonoise.Thus,ourapproachistohavemore univariatetestsineachinternalnodeandtobasesplittingdeci- sionsonalargernumberofunivariatetestsnotnecessarilythose teststhatyieldthehighestvalueofthegainratio[25]ortheGini index[26].

1.4. Relatedwork

Thispaperaddresses anissueof testcomplexity in decision trees.Astandardapproachinthecaseofdiscreteattributesisto associateabranchwitheachcategoricalattributevalue.Another possibilityistogroupsomeattributevaluesinordertoreducethe branchingfactor.Whenallvaluesaregroupedintotwoclusters,a binarytreeisobtained(e.g.,inCART[26]).Inthecaseofcontinues attributes,binarysplitsareused.Here,thestandardsplitcompares thevalueoftheattributewithathresholdandtheoutcomeofsuch acomparisonisbinary.Thus,astraightforwardwaytoreducethe treecomplexity(intermsofthenumberofnodes)istousemultiple thresholdsineachsplitonanumericalattribute.Thiswillpoten- tiallyincreasethebranchingfactorofsuchsplits;however,such testswillbemoreexpressiveandtheoverallnumberofnodesin thecorrespondingdecisiontreewillbesmaller.Thisapproachwas exploredbyBerzaletal.[29]whoproposedmulti-waydecision treesusingmulti-waysplits.

In[29],ahierarchicalclusteringofattributevaluesiscombined withthestandardgreedydecision treealgorithm. Initially,each separateattribute valueistreatedasanindividualinterval,and thetwomostsimilaradjacentintervalsaremergedineachstep.

Thisprocesscanberepeateduntilonlytwointervalsareleft;this wouldleadtoabinarydecisiontree.However,theclusteringpro- cesscanbestoppedbeforethat.Eachtimetwoadjacentintervals aremerged,theimpuritymeasureassociatedwiththedecisiontree ischecked.Thecurrentintervalsetisdeterminedaccordingtothe highestmeasureofimpurity.Thistechniqueissimilartothesplit- tingcriterionusedtoevaluatealternativesplitsliketheC4.5gain ratioortheGiniindexofdiversity.TheBerzalapproachwasnot evaluatedintermsofgeneexpressiondata,and,duetothenature ofsingleattributemulti-waytests,itwouldnot besufﬁcientto overcomethehighratiooffeatures/observationsinthiskindof data.

Thespecificcharacterofgeneexpressiondataanditsinfluence ontheprocessofbuildingdecisiontreeswasinvestigatedbyLi etal.[30].Thissolutionfocusedonusingcommitteesoftreesto aggregatethediscriminatingpowerofabiggernumberofsignifi- cantrulesandtomakemorereliablepredictions.First,allfeatures arerankedaccordingtothegainratio.Then,thefirsttreeusing thefirsttop-rankedfeatureintherootnodeisbuilt.Next,thesec- ondtreeusingthesecondtop-rankedfeatureintherootnodeis builtandtheprocesscontinuesuntilthekthtreeusingthekthtop- rankedfeatureisobtained.Theclassificationofthefinalcommittee

ofkdecisiontreesisgovernedbyweightedvoting.Itwasobserved that:

• signiﬁcantrules often contain features that are globally low- ranked;

• if the construction of a tree is conﬁned to a set of globally top-rankedfeatures,therulesintheresultingtreemaybeless accuratethanrulesderivedfromthewholefeaturespace;

• alternativetreesoftenoutperformorcompetewiththeperfor- manceofthegreedytree.

Thisworkalsosupportsourdecision tousemanyunivariate testsinourmulti-testdecisiontreeinductionalgorithm.Inpartic- ular,ouraimistomakeuseoffeaturesthataregloballylow-ranked andusethemjointlyinmulti-tests.However,ouraimistopre- servethesimplicityofﬁnaldecisiontrees,whichisnotthecase in[30].

Ourpreviouswork[10]inwhichstandarddecisiontreeswere evaluatedongeneexpressiondataledustotheconclusionthatthe highratioofvariables/casesmaycausethelearningalgorithmtobe misledbyrandomlychosendependenciesinthetrainingdata.This maybedisastrousforlearnedtreesduetothehierarchicalnatureof theclassiﬁcationprocessofdecisiontrees.Performedexperiments revealedthatthesizeofdecisiontreesbuiltwithtraditionalclas- siﬁcationmethods,suchasC4.5,isrelativelysmallanddoesnot captureallofthestructureavailableinthedataandisadditionally misledbynoise.

The rest of the paper is organized as follows. In Section 2, weintroduceanovelrepresentationfordecisiontrees.Then,our algorithm that learns decision trees in thenew representation is presented in Section 3. In Section 4, the proposedapproach is experimentally evaluated on real gene expression data. The paper is concluded in thelast section and future workis also discussed.

2. Multi-testdecisiontrees

Thispaperintroducesmulti-testdecisiontrees(MTDT)–anew, richerlanguagetorepresentadecisiontree.Theoverallstructure ofamulti-testtreedoesnotdifferfromastandarddecisiontree, e.g.,C4.5[25].Inmulti-testtree,everysplitinnon-terminalnodes iscomposedofasetofunivariatetestsandiscalledamulti-test split.Theseelementarytestsareunivariateandarecombinedina waythatshowsourapproachissubstantiallydifferentfromtypical multivariate,e.g.oblique,splits.

Duringclassiﬁcation,theMTDTsplittingcriterionisdirectedby themajorityvotingmechanismwhereallunivariatetestcompo- nentshavethesameimportance.

Fig.1illustratesamulti-testwiththreeindividualattributetests, {(f1≤2),(f₂≤5),(f₃≤8)},thatsplitsthedatainthenodeintotwo subsets:ClassAandClassB.Inthisparticularexample,asaresult of themajority voting rule, atleast 2 out of 3univariate tests

Fig.1. Anexampleofamulti-testsplitwhichcontainsasetofunivariatetests.

(4)

Fig.2.Graphicalrepresentationofamulti-testsplitthatcontains3singleattribute tests:{(f1≤2),(f2≤5),(f3≤8)}.

determinethedecisionoftheactualmulti-testsplitatthenode.

Thegraphicalrepresentationofthemulti-testexampleisshown inFig.2.Eachtestthatusesfeaturef_icansplitaninstancespace butonlywithaboundarythatisorthogonaltothefiaxis.Inour example,iff₁<2andf₂<5,thenregardlessofthedecisionoff₃, thedecisionisClassA(lightgrayregion).Iff₁>2andf₂>5,then regardlessoff3,thedecisionisClass B(darkgrayregion). If,on theotherhand,f₁andf₂yieldacontradiction,theﬁnaldecisionis determinedbyf₃wheref₃>8leadstoClassBandf₃<8toClass A.Certainly,univariatetestscanbeevaluatedinanyorder.Thefact thatonlyunivariatetestsareusedinmulti-testsplitsensuresthat MTDTcanbetreatedasaunivariatedecisiontreedespitemorethan onebeingusedineachsplit.

3. Learningmulti-testdecisiontrees

Theprevioussection introducedthe ideaof multi-testdecisiontrees.Inprinciple,thisdecisiontreelearningcantakevarious approaches.Inthissection,weproposeoneparticularmethodfor learningtreesthatisbasedongreedyconstructionofmulti-test splits.In what follows,it is assumed that decision trees learn- ingusestop-downinduction,where,ateverylevelofrecursion, thetop-downalgorithmconstructsmulti-testsplitstobeusedin non-terminalnodes ofthedecision tree.The proceduretocon- structthose multi-testsplitsis thecoreofthis sectionandour contribution.Itshouldbenotedthatmulti-testscouldalsobeused withothertypesofdecisiontreelearningmethods,i.e.,algorithms thatarenottop-down.Theconceptoftop-downinductionwas introducedinSection1.2.

3.1. Buildingmulti-testsplits

Top-downdecisiontreelearningalgorithmshavetochoosea split(orterminaterecursion andcreatea terminalnodewitha decision)ateverylevelofrecursion,givenasubsetXoftraining instances.Forthisreason,ourprocedureinAlgorithm1takesX, thecurrentsetofinstances, and returnsthebestmulti-testfor splittinginstances in X. Note, that thecardinality of Xis non- increasingwitheveryrecursivecallofthetop-downprocedure.

Additionalparameter W, determinesthe number of alternative multi-teststhatareconsideredbytheprocedurebeforethebest multi-testisreturned.Speciﬁcally,ourprocedureconstructsaset MT={mt1,mt2,...,mtW}ofWalternativemulti-testsandreturns the best one according to the gain ratio criterion (Line 9) in Algorithm1.

Fig.3. Anexamplesearchprocesswhichdeterminesthebestmulti-testforanon- terminalnodeofmulti-testdecisiontree(MTDT)fromthesetofpotentialmulti-tests (MT).

Algorithm1. Multi-testconstruction.

CreateMultiTest(X,W,N)

1: V←createallcandidatethresholdsusingX 2: bestprimary=argmaxv_∈VGainRatio(v,X) 3: mt₁=BuildMulti-test(bestprimary,V,X,N) 4: fori∈{2,...,W}do

5: MT={mt1,...,mt_i−1}

6: nextprimary=NextPrimary(V,MT,X) 7: mti=BuildMulti-test(nextprimary,V,X,N) 8: endfor

9: returnargmaxmtiGainRatio(mti,X)

Theﬁrststepofalgorithmdeterminesthesetofunivariatetests forfurtherevaluation.Weonlyconsidertherelevantthresholds, calledthecandidatethresholds[31],which splitinstancesfrom differentclasses.Inexistingalgorithmswithunivariatetests,once thesetofpossiblethresholdsiscomputed(univariatetestscorre- spondtothresholdswhencontinuesfeaturesarepresent),thebest thresholdisselectedaccordingtosomeprioritymeasure(e.g.,the gainratiocriterion),andtheunivariatetestwithhighestevaluation isreturned.Thisstandardprocedurewouldreturntheunivariate testcomputedinLine2ofthealgorithm.Ouralgorithmdoesaddi- tionalcomputationinordertobuildsplitswithmultipleunivariate tests.

Eachithmulti-testiscomposedofnomorethanNunivariate tests;theﬁrstoneiscalledaprimarytest(mt_i,1),andallremaining N−1 tests are calledsurrogate testsmt_i,j where 1<j≤N. The parameterdenotedasNrepresentsthemaximumnumberofuni- variateteststhatconstitutethemulti-test.Fig.3illustratestests thatareconsideredineveryexecutionofAlgorithm1.

Theset{mt1,...,mtW}is constructedasfollows.First,mt1 is constructedinLine3usingtheprimaryunivariatetestfoundin Line2.Theactualmulti-testisbuiltintheBuildMulti-testfunction, whichidentiﬁeswhichcandidatetestsshouldbeusedassurrogate testsoftheprimarytestthatisprovidedasaparameter.Thisstep isexplainedindetailinSection3.1.1.mt₁ isaspecialmulti-test becauseitsprimarytestisthebestunivariatetestaccordingtothe prioritymeasure.Primarytestsforremainingmulti-testshaveto beselectedinawaythatwoulddiversifycreatedmulti-tests.This processtakesplaceintheNextPrimaryfunctionwhichisexecuted inLine6anddescribedinSection3.1.2.Oncetheprimarytestfora newmulti-testisidentiﬁed,theBuildMulti-testprocedurecanbe usedagain(Line7).

Becauseofthemajorityvotingmechanismappliedduringclas- siﬁcation,surrogatetestshaveaconsiderableimpactonmulti-test decisionsbecausetheycanoutvotetheprimarytest.Itshouldbe notedthatthisimpactcanbepositiveandnegative,anditaffects thegainratiooftheentiremulti-test.Therefore,itispossiblethat thebestmulti-testthatwillbereturnedbyAlgorithm1maynot containtheoriginalunivariatetestwithhighestgainratio(mt1,1).

(5)

Thiscanhappenwhenvotingcomponentsofcompetitivemulti- testsmti(1<i≤W)havehighergainratiotakenasawholethan mt₁despitethefactthatmt_1,1istheunivariatetestwiththehighest gainratio.Thisfactjustiﬁesourdecisiontousemulti-testdecision trees,sinceit canprovidebetterand morerobustagainstnoise classiﬁcation.

3.1.1. Multi-testconstruction

Whenfunction BuildMulti-test isexecuted, theprimary test providedinthefirstparameterconstitutesthefirstunivariatetest thatwillbeincludedinthemulti-test,andthegoalofthisfunction isaddN−1surrogatetests.Thereasonforaddingmoretestsisthat applyingasingleprimarytestbasedononeattributemaycausethe classifiertounderfitthelearningdataduetolowcomplexityofthe classificationrule.Surrogatetestsshouldsupportthedivisionof thetraininginstancesmadebytheprimarytest.Inotherwords, theremainingtests(thesurrogatetests)ofthemulti-testshould, usingtheremainingfeatures,branchthetreeinasimilarwayto theirprimarytest.

Inordertodeterminesurrogatetests,wehaveadoptedasolu- tionproposedintheCARTsystem[26].Theuseofthesurrogate variableata givensplitresultsinasimilarnodeimpurity mea- sure.Italsomimicsthechosensplitintermsofwhichandhow manyobservationsgotothecorrespondingbranches.Therefore, themeasureofsimilaritybetweentheprimarytestandsurrogates ofthemulti-testisgivenbythenumberofobservationsclassiﬁed inthesameway.Theparameterbequalsthepercentofdecisions madebysurrogateteststhatdifferfromtheprimarysplitter.The parameterisdescribedinmoredetailinthenextsection.Inour method,wealsoconsiderteststhatclassifyinstancesinaninverse (opposite)waytotheirprimarytest(highvalueoftheparameterb).

Forsuchtests,wereversetherelationbetweenattributeandinter- valmidpoint,andrecalculatethescore.However,thisonlyworks ifweahavebinaryclassiﬁcationproblem.

3.1.2. Identifyingadditionalprimarytests

TheNextPrimaryfunctionsearchesforathresholdthatwillbe appliedinaBuildMulti-test,whichisexecutedtobuildmulti-test mt_ifork<i≤Waftertheﬁrstk≥1multi-testsareconstructed.

Twofactorsshouldbetakenintoconsiderationwhenchoosing theprimarytestformti.First,newprimarytestsshouldbecompeti- torstoallexistingprimarytests.Thecompetitortestsyieldhigh gainratiobutarenotasgoodas,e.g.,theprimarytestmt_1,1used toconstructmt1.Asigniﬁcantdifferencebetweenthesetestsand surrogatetestsisthewayinwhichtheyareranked.Asshownin theprevioussubsection,surrogatetestsarenotevaluatedbyhow muchimprovementtheyyieldinreducingnodeimpuritybutrather onhowcloselytheymimicthesplitdeterminedbytheirprimary test.

Ontheotherhand,thecompetitortestsarerankedaccordingto thehighestgainratio.Wedenotetestsascompetitortestsiftheir gainratioisinthetopqhighestgainratiovalueswhereprimary testsusedinmtjforj<iarenotconsidered(thedefaultvalueof qis10).PerformedexperimentsdescribedinSection4.2.3show thatusingmorecompetitors(highqvalue)leadstotheselection oftestswithlowgainratio;thisdecreasesthepowerofalternative multi-tests.However,decreasingthenumberofcompetitortests (lowq)maycausenewprimarytestsbetoosimilartothosealready selected.

Thesecondfactorthatshouldbeconsideredisthatthesame attributeis oftenlistedasboth acompetitor,i.e.,as oneofthe primary tests,and as a surrogate. Thismay lead toalternative multi-tests,mti,thatcontainsimilaroridenticalunivariatetests and do not provide any comparable improvement. Therefore, competitortestsshouldbediversiﬁedinordertodiversifythealter- nativemulti-tests.Forthatreason,functionNextPrimaryreceives

thelistofallmulti-teststhatwerecreatedbeforemt_i.Thediversiﬁ- cationproblemisthensolvedbyrequiringthateverynewprimary splitmustbeacompetitorformt_1,1,i.e.,fortheprimarytestofthe ﬁrstmulti-test(determinedbytheqvalueintroducedinthepre- viousparagraph),anditmustalsobetheworstaveragesurrogate (havethehighestaveragevalueofparameterb)inallprimarytests mt_k,1wherek<i.

Tosumup:thesurrogatetestsaresimilartotheprimarytest;

thecompetitortestsarethosethathavehighestgainratioandare differentthanallpreviouslyselectedprimarytests.

3.2. Multi-testsizeandprediction

Thesizeofthemulti-test,i.e.,themaximumnumberofsingle teststhatmakeeverymulti-test,hasastrongimpactonitsper- formanceandthesplittingdecision.TheparameterdenotedasN representsthemaximumnumberofunivariatetestsinamulti-test andisdeﬁnedbytheuser.Toclassifyobservations,themajorityvot- ingmechanismisemployedinwhicheachtesthasanequalvote.

Inthecaseofadraw,thedecisionismadeinaccordancewiththe primarytest.

The exact size of the multi-test depends on the difference betweentheprimaryandsurrogatetests.ThemainideaoftheMTDT istouseagroupofsimilartestsinasinglenodeinsteadofonetest, asseenintheclassicalapproachtounivariatedecision trees.To avoiddiscrepanciesinthemulti-test,surrogatetestsshouldnotbe addedtoteststhatdonothaveapropersubstitute.Aninappropri- atesetofsurrogatesmaydominatetheprimarytestanddeteriorate thesplittingcriterion.Therefore,surrogatetestsaddedtothemulti- testshouldreturnnomorethanb%ofdecisions(default10%)that differfromtheprimarytest.Using b=0%meansthatsurrogate testscanonlybeaddedtothemulti-testiftheysplitobservations intheexactlythesamewayasthecorrespondingprimarysplitter.

Inpractice,settingbto0%rejectsalmostallsurrogates;therefore, itisequivalenttosettingthesizeofmulti-testNto1.Inthisevent, thedecisiontreewouldbecomesimilartothetreegeneratedbythe C4.5algorithmbecauseonlyoneattributewouldbeusedineach multi-test.IfthevalueofbishighthenallN−1surrogatesjointhe multi-test.

4. Experimentalresults

Inthissection,theproposedsolutionisexperimentallyveriﬁed usingmorethanadozenrealmicroarraydatasets.Theresultsof theMTDTalgorithmwerecomparedwithseveralpopulardecision treebasedsystems.

4.1. Setup

TheperformanceoftheMTDTclassiﬁerwasinvestigatedusing publiclyavailablemicroarraydatasetsdescribedinTable1.These datasetsarefromtheKentRidgeBio-medicalDatasetRepository [32]andarerelatedtostudiesofhumancancer,includingleukemia, colontumor,prostatecancer,lungcancer,breastcancerandlym- phoma.Fordatasetsthatwerenotpre-dividedintothetrainingand testingparts,the10-foldstratiﬁedcross-validationwasapplied.

By thestratiﬁed cross-validation,we mean that each fold con- tainsroughlythesameproportionofinstanceswiththesameclass labels.Leave-one-outcross-validationwasalsoconsidered;however,nosigniﬁcantdifferenceinresultswasobservedwiththis typeofcross-validation.Theaveragescoreof10runsispresented forcross-validateddata.

The classiﬁcation process for all algorithms was preceded by feature selection using the Relief-F [33] method, which is commonfor microarray dataanalysis.In the ﬁrststep, Relief-F drawsinstancesatrandomandcomputestheirnearestneighbors

(6)

Table1

KentRidgebio-medicalgeneexpressiondatasetsusedinexperiments.

Dataset Abbreviation Attributes Classes Trainingset Testingset

BreastCancer BC 24,481 2 34/44 12/7

CentralNervousSystem CNS 7129 2 21/39 –

Colontumor CT 6500 2 40/22 –

DLBCLStanford DS 4026 2 24/23 –

DLBCLvs.FollicularLymphoma DF 6817 2 58/19 –

DLBCLNIH DNH 7399 2 88/72 30/50

LeukemiaALLvs.AML AML 7129 2 27/11 20/14

LeukemiaMLLvs.ALLvs.AML MLL 12,583 3 20/17/20 4/3/8

LungCancerDana-Farber LCD 12,600 5 139/21/20/6/17 –

LungCancerBrigham LCB 12,533 2 16/16 15/134

LungCancerUniv.ofMichigan LCU 7129 2 86/10 –

LungCancer–Toronto,Ontario LCT 2880 2 24/15 –

OvarianCancerNCIPBSII OC 15,154 2 91/162 –

ProstateCancer PC 12,600 2 52/50 27/8

(default10).Then,Relief-Fadjustsafeatureweightingvectorto givehigherweighttoattributesthat discriminatetheinstances fromneighbors of differentclasses. The main beneﬁtsof using featureselectionareshortertrainingtimes,improvedmodelinter- pretability,andenhancedgeneralizationbyreducing overﬁtting.

However,aswementionedinprevioussections,withunivariate decision trees using microarraydata, one facesthe problemof underfittingto thelearning data(overfitting is not significant).

Hence,there is no need toimprove the model interpretability becauseitis alreadysimple;itisusefultoretaina largernum- berof features and usea less aggressive featureselection. We testeddifferentnumbersoftoprankedattributes/features50,100, 200,1000,2000andalsoconsiderednofeatureselectionatall.

Reductionsinthenumberofattributesto200havenosignificant influenceonthetest-setsaccuracyofcomparedclassifiers;however,itspeedsupthetrainingtimeofallalgorithms.Ourmulti-test algorithmworkswellontest-setswithoutfeatureselectionand thosewithlargernumbersoffeatures(200andover).Whenthe numberof top selectedattributesis small,MTDT losesits abil- ity tofind lower-ranked features (asthey wereexcluded from thedata),anditsperformanceissimilartotherestofthetested decisiontrees.Therefore, thenumberof selectedattributeswas arbitrarilylimitedtothetop1000toallowMTDTtofindlow-ranked features.

Astatisticalanalysisofallobtainedresultswasperformedusing theFriedmantestandthecorrespondingDunn’smultiplecompari- sontest(signiﬁcancelevelequalto0.05)recommendedbyDemsar [34].

4.2. Multi-testdecisiontreeresults

4.2.1. Multi-testsize

Theinﬂuenceofthemulti-testsizeontheperformanceofour methodwasexperimentallyveriﬁedonrealgeneexpressiondata.

Classificationalgorithmsappliedtothesekindsofdataaremore likelytounderfitbecauseofasmallratioofthenumberofobserva- tionstotheamountofattributes.TheperformanceoftheMTDT classifier wasstudiedwithsix differentvalues of parameterN, whichstandsforthemaximumnumberofunivariatetestsinthe multi-test.It isworthemphasizingthattheMTDTwithasingle one-attributetestinanode,N=1,behavessimilarlytothestandard C4.5algorithm.Bothalgorithmsusethegainratiocriterionandpes- simisticpruning.Thereis,however,aslightdifferenceincalculating theexactthresholdvalue;thisisdescribedinSection3.1.

InTable2,wecomparetheinﬂuenceofthemulti-testsizeon accuracy.Inallexperiments,1000attributeswereconsidered,and thealgorithm’sparametershadtheirdefaultvaluesofW=3and b=10%.Theseresultsrevealedthatthenumberofunivariatetests

usedinasinglemulti-testhasasignificantimpactontheclassi- fieraccuracy.AccordingtotheFriedmantest,thereisastatistically significantdifference (p-valueof 0.0003) in theaccuracy of all versions.BasedonDunn’sMultipleComparisonTestDifference, thereisastatisticallysignificantdifferenceinclassificationqual- itybetweenthenumberoftestsinthemulti-test,N,equalto1,and 7,9,and11.

Experimentalvalidationperformedon14datasetsshowedthat theaverageaccuracyofthemulti-testalgorithmsincreasedover 3%whenN=3,andover6%whenN=7,comparedtothebaseMTDT withN=1.Ononlyonedataset(BC),theresultofthemulti-test algorithmwaslowerthanexpected,althoughtheoverallimprove- mentisnoticeable.ThereasonwhyresultsforBCwerebetterfor N=1liesinthenumberofattributesthatdistinguishclasses.For thisdataset,onlyafewgenesareconsideredasmarkers;therefore, ahighernumberofsurrogatescoulddecreasetheMTDTaccuracy whenthetreeisoverﬁt.

Considering the results, we conjecturethat the underfitting is themain cause of lower classificationaccuracy of the MTDT approachwithN=1.Decisiontreesobtainedbystandard(single univariatetest inanode)algorithm arenotcomplexenough.It wasalsoobservedthatusingtoomanygenesinthemulti-testmay inducemorecomplexrulesandoverfitlearnedtreestothetraining data.

Inordertodetectandmitigatethepossibilityofoverﬁttingin thetrainingphaseofourmethod,wecreatedartiﬁcialdatasetsthat werecopiedfromthoselistedinTable1;attributeswereleftexactly thesame,butclasslabelswererandomlychanged.Thistechnique

Table2

Acomparisonofthemulti-testdecisiontree (MTDT)accuracyunderdifferent numbersoftests(N)inthemulti-test.Datasetsabbreviationsareused(Table1).

Thehighestclassiﬁersaccuracyforeachdatasetwasbolded.

Dataset N=1 N=3 N=5 N=7 N=9 N=11

BC 68.42 63.15 57.89 52.63 57.89 57.89

CNS 60.50 71.33 72.17 72.00 72.17 74.33

CT 80.40 83.14 85.83 85.97 85.83 83.92

DS 81.75 85.00 85.25 85.55 85.05 86.60

DF 84.82 82.07 83.42 85.01 85.57 85.42

NIH 51.25 60.00 60.00 62.50 63.75 62.50

AML 91.18 85.29 91.18 91.18 91.18 88.23

MLL 86.67 73.33 100.00 100.00 93.33 100.00

LCD 89.41 90.98 91.60 92.12 91.15 90.96

LCB 88.59 95.97 96.64 97.98 98.66 98.66

LCU 97.48 98.04 98.32 98.93 99.78 100.00

LCT 61.42 61.66 63.67 66.83 65.67 62.16

OC 97.04 98.69 98.02 98.34 98.34 98.18

PC 26.47 58.82 61.76 61.76 47.06 44.11

Average 76.10 79.11 81.83 82.20 81.20 80.93

(7)

Fig.4.Theinﬂuenceofthesimilaritymeasureb,ontheclassiﬁcationaccuracyof themulti-testdecisiontree(MTDT)algorithm.

isusuallyreferredtoastheY-randomizationtest[35].TheMTDT classificationaccuracywassignificantlylowerontherandomized datathanonoriginaldata(whichisgoodinthissituation);this indicatesthatthereisnoevidenceofoverfittinginourmethod.

4.2.2. Surrogatetests

InSection3.2,itwasexplainedthatsurrogatetests,shouldnot differmorethanb%fromprimarytests.Performedexperiments suggestthat surrogatetestsadded tothemulti-testshouldnot differfromtheprimarytestbymorethan10%.Weconsiderthis adefaultvalueinalldatasets;however,adequatesettingofthis parametermayimproveclassificationaccuracy.Fig.4presentsthe influenceofthesimilarityparameter,b,ontheperformanceofthe MTDTclassifier.

Inthisfigure,b=0%meansthatonlysurrogatesthathavethe samegainratioasprimarytestsareaccepted(itisalmostequiv- alenttosettingN=1),andahighvalueofb(inthefigureabove 15%)meansthatallN−7surrogatesjointhemulti-test.Although, ageneralaverageonall14datasetshasthehighestaccuracywhen b=10%,wemayobservethattheoptimalvalueofthisparameter isdifferentforspecificdatasets.ThescoreoftheMTDTalgorithms onleukemia (MLL) and prostate cancer (PC) datasets increases significantlywhenthere isnorestrictiononchoosing surrogate tests.However,inotherdatasets,whenthesurrogatetestsaremore

Fig.5.Theinﬂuenceofthesimilaritymeasureb,onthedecisionsplitinthetree node.

Fig.6.Theinﬂuenceofthenumberofcompetitortestsq,ontheapplicationof multi-testsasasplittingcriterioninthetreenode.

similartotheprimarytest,theresultsareslightlybetter.Anaddi- tionalcomparisonoftheMTDTperformancewithasimplebaseline basedonrandomselection showedthesigniﬁcantdifferencein predictionaccuracyinfavorofproposedsolution.

Fig.5presentstheimpactofsurrogatesonthedecisionsplit.It illustratesthepercentagenumberofsplitsonthetestingdatafor whichprimarytestswereoutvotedbytheirsurrogates.Wemay observethatforalldatasets,forthedeﬁnedvalueofparameterb equal10%,theaveragepercentageofsplitscontrastingtheprimary testisequal8%.Thisimpactofsurrogatetests,togetherwithalter- nativemulti-tests,improvestheMTDTaverageaccuracyupto6%, from76.1%to82.2%.

4.2.3. Alterativemulti-tests

Theparametersofalternativemulti-testsweredeﬁnedempir- icallythroughextensiveexperiments.Fig.6presentstheaverage inﬂuenceofthenumberofcompetitortestsqontheperformance ofalternativemulti-testsonalldatasets.Wecanobservewhatper- centofmulti-testswereappliedasasplittingcriterioninthetree node.Itisnotsurprisingthatthetreenodesplitsweremostlydeter- minedbythemulti-tests(mt₁)thatwerebuiltontheprimarytests.

However,forthedefaultvalueoftheparameterq=10,over12%

ofallsplitsweremadeinaccordancetothealternativemulti-tests mt_i(1<i≤W).

Inexperiments,weemployedtwoalternativemulti-testsmt₂ and mt3, so the number of multi-tests analyzed in each non- terminalnodewasequalto3(W=3).Additionalexperimentsshow thatemployingahighernumberofmulti-tests,besidessigniﬁcant increaseofthecalculationtime,didnotyieldanyimprovementin classiﬁcationaccuracy.

4.2.4. LeukemiaMLLvs.ALLvs.AMLdataset

In oneof ourexperiments,thedatasetfromArmstrong [36]

wasevaluatedinmoredetail. Thedatasetdescribesthedistinc- tionbetweenleukemiaMLLandotherconventionalALLsubtypes.

Thereareatotalof57threeclasstrainingsamples(20forALL,17 forMLL,and20forAML)and15testsamples(4,3,and8corre- spondingly).TheMTDTdecisiontreeswithN=1andN=7,when evaluatedonthetraininginstances splitsdataexactlythesame wayandforbothvaluesofN,theclassiﬁcationaccuracyis100%.

TheactualtreesareillustratedinFig.7,andtheconfusionmatrix ispresentedinTable3.

(8)

Fig.7. Multi-testdecisiontree(MTDT)withN=1andN=7testsinasinglenode.

Wecanobservethatalthoughbothinducedtreeshavethesame structureandclassifiedinstancesfromthetrainingset,theirper- formancesonthetestsetweresignificantlydifferent.Becauseboth treeshaveidenticalprimarytests,evenwhenthereisnoimpacton alternativemulti-tests,thisisaverygoodexampleofthestrength oftheproposedsolution.Thereasonforsuchagoodperformanceof MTDTwithN=7inthisexamplecanbeexplainedbytheimpactof surrogatesonthemulti-testdecision.In6outof15instances,the surrogatetestsmt_1,j(1<j≤N)inMTDTwithN=7havetoout- votetheprimarytestsmt_1,1inthenodesandcorrectlyclassifythe instances.Inthisway,wehaveimprovedtheclassificationaccuracy fortheArmstrongdatasetfrom86%to100%.

4.3. ComparisonofMTDTtootherclassiﬁers

ThecomparisonofMTDTtootherclassiﬁerswasalsoperformed.

Thefollowingclassiﬁcationalgorithmswereselectedforthisanal- ysis:

• Decisiontrees:

1.ADTree(AD)–alternatingdecisiontree[38].

2.BFTree(BF)–best-ﬁrstdecisiontreeclassiﬁer[39].

3.J48Tree(J48)–prunedC4.5decisiontree[25].

4.SimpleC&RT(CT)–versionoftheC&RTalgorithmthatimple- mentsminimalcost-complexitypruning[26].

• Decisionruleclassiﬁers:

1.JRip(JR)–rulelearner–repeatedincrementalpruningtopro- duceerrorreduction(RIPPER)[40].

• ‘Blackbox’metadecisiontrees:

1.Randomforest(RF)–algorithmconstructingaforestofrandom trees[41].

2.Bagging(BG)–reducingvariancemetaclassiﬁer[42].

3.Adaboost (ADA) – boosting algorithm using Adaboost M1 method[43].

Table3

Resultsformulti-testdecisiontree(MTDT)withN=1andN=7ondatasetLeukemia MLLvs.ALLvs.AML.

MTDTN=1 MTDTN=7 Classiﬁedas:

(a) (b) (c) (a) (b) (c)

6 2 0 8 0 0 (a)AML

0 1 2 0 3 0 (b)MLL

0 2 2 0 0 4 (c)ALL

Accuracy60% Accuracy100%

Itisworthnotingthatbesidesthe‘whitebox’classifiers,results on meta decision trees are also included. Those methods can generatemorecomplexdecisionrules andoutperformstandard approaches.Theresultingclassifiersare,however,moredifficultto understand.OurresultsshowthattheproposedMTDTalgorithm thatusessimple,univariatetestsishighlycompetitivewith‘black box’solutions.

The implementationof competitive algorithms in theWeka package[44]wasusedinourevaluation.Allclassiﬁers,including theMTDTalgorithm,wereemployedwithdefaultvaluesofparam- etersonalldatasets.TheresultsarepresentedinTable4.

ResultsinTables2and4showthatMTDTwithN=7testsina singlenodeyieldedthebestaverageaccuracy:82.20%,inallclassi- ficationproblems.Ingeneral,itcanbeobservedthatmorecomplex methods like RF, ADA, and BG performed betterthan standard non-ensemblealgorithms,whichgeneratesimplersolutions.The proposedMTDTmethodmanagedtoachievehighaccuracy,but comprehensiveclassificationrulesweremaintainedviatheuni- variatetestsusedinmulti-testsplits.AccordingtotheFriedman test,thereisastatisticallysignificantdifference(p-valueof0.0215) betweentestedclassifiers.BasedonDunn’sMultipleComparison TestDifference,thereisastatisticallysignificantdifferenceinterms ofqualitybetweentheMTDT(withN=7),andBFandj48trees.The

Table4

Comparisonofclassiﬁcationaccuracyofalgorithms:ADTree(AD),BFTree(BF),J48 Tree(J48),SimpleCART(CT),JRip(JR),Randomforest(RF),Bagging(BG),Adaboost (ADA).

Dataset AD BF J48 CT JR RF BG ADA

BC 42.10 47.36 52.63 68.42 73.68 68.42 63.15 57.89 CNS 63.33 71.66 56.66 73.33 65.00 75.00 71.66 75.00 CT 74.19 75.80 85.48 75.80 74.19 75.80 79.03 79.03 DS 95.74 80.85 87.23 82.97 74.46 95.74 87.23 89.36 DF 88.31 79.22 79.22 83.11 77.92 88.31 85.71 90.90 NIH 50.00 60.00 57.50 62.50 61.25 52.50 58.75 65.00 AML 91.17 91.17 91.17 91.17 94.11 82.35 94.11 91.17 MLL ^a 73.33 80.00 73.33 66.66 86.66 100.00 66.66

LCD ^a 89.65 91.62 88.17 90.14 92.11 90.64 78.32

LCB 81.87 89.65 81.87 81.87 95.97 93.28 82.55 81.87 LCU 96.87 96.87 98.95 96.87 93.75 98.95 97.91 96.87 LCT 69.23 61.53 58.97 58.97 64.10 66.66 61.53 69.23 OC 99.60 98.02 97.23 98.02 98.81 98.02 97.62 99.20 PC 38.23 44.11 29.41 44.11 32.35 29.41 41.17 41.17 Average 74.22 75.66 74.85 77.05 75.89 78.80 79.36 77.26

aADcanbeappliedtodatawithtwoclassesonly.

(9)

ADclassiﬁerwasexcludedfromstatisticalanalysis,asitcouldnot beappliedtoamulti-classdataset.

5. Discussion

Insomecases, multi-testtrees couldbetreated asa consis- tentrepresentationoftraditionalunivariatedecisiontrees,butit worksinone wayonly.A MTDTtreecanbetransformedintoa traditionaldecisiontree,butitisusuallyimpossibletodo itthe otherwayround.Furthermore,evenifourformulationofmulti- testdecisiontreesandtraditionalunivariatetreeswereisomorphic (buttheyarenotasweexplainedabove),thiswouldnotinvali- dateourresearchbecauseweshowanotherrepresentationthat ismoresuitable(accordingtoourresults)forthegreedysearch, whichistraditionallyemployedforlearningdecisiontrees.Asim- ilarrelationshipexistsbetweendecisiontreesanddecisionrules.

Eventhoughthehypothesisspaceofdecisionrulesisasupersetof thehypothesisspaceofdecisiontrees,researchersstillinvestigate decisiontreesbecauseofvariousadvantagesthatdecisiontreescan offer.

Theimportanceofparticulartypesofteststhatareusedtobuild decisiontreesmaydependonthetypeofsearch.Resultspresented inourpapershowthatstandardgreedytop-downlearningofdeci- siontreescanbesigniﬁcantlyimprovedusingmulti-testsplits.If itwouldbepossibletolearntheoptimaldecisiontreesforagiven testrepresentation(whichisinfeasibleonreal-lifedatabecause theproblemisNP-hard)insteadofusingagreedyalgorithm,then onecouldcheck,forexample,theimﬂuenceofsingleandmulti-test splitsontheexactalgorithm.Itislikelythatsingletestsplitswould bemorecompetitiveusingalternativesearchstrategiesbutatthe sametimemulti-testsplitscouldleadtofurtherimprovements.

Thecurrentstate-of-the-artindecisiontreelearningusesgreedy searchinmostacademicresearchandindustrialapplications;thus, ourmulti-testsplitsimprovelearningwiththatmostimportant typeofsearch.Thisfactexplains,forexample,whysingletestsplits inFig.7wereweakerthanmulti-testsplits.Multi-testsplitswere simplymoreconvenientfor top-downlearningandbettertrees couldbelearned.Intheory,bettersingletesttreescouldpoten- tiallybeobtainedfortheexampleinFig.7;however,butassuming thatsuchtreesexistandcouldbefound,adifferentsearchalgo- rithmorspecialtuningofexistingalgorithmswouldberequiredto ﬁndthem.Thesameadvantageofmulti-testsplitswasobservedon otherdatasetsevaluatedinthispaper.Theoretically,theseobser- vationscanbeexplainedusingtheconceptof‘inductivebias’in machinelearning, that is,theneedto makeexplicitor implicit assumptionsaboutwhatkindofmodeliswantedforaparticular problem[46].

IntheexperimentontheleukemiaMLLvs.ALLvs.AMLdataset, decisiontreeswithmulti-testsizeN=1andN=7havethesame structurethesamenumberofnodes.However,forothervalues of parameterNor differentdatasets,this may notbe thecase.

Differencesinthetreestructuremayoccurwhenalternativemulti- testsoutperformthemulti-testmt1orsurrogatetestsoutvotethe primarytest. Inspiteof anequal treesizebetweenMTDTwith N=1andN>1,alargernumberofunivariatetestsinamulti-test generatesmore complexnodes. Hopefully, themulti-tests con- tainonlyunivariatetestswhichareeasytounderstandbyhuman experts.

Tothemostofdatasets described inTable1 biologistshave foundandpublishedsomemarkergenesthatarehighlycorrelated withclassdistinction.InordertoevaluatewhethertheMTDTresults arebiologicalmeaningfulornot,weexploredwhetherdiscovered genesfromclassiﬁer’smodelaresupportedbybiologicalevidence intheliterature.Ourresearchshowedthatmostofthegenesfrom theMTDTmodelwerealsoidentiﬁedinbiologicalpublications.For

thisparticulardataset,sixoutofsevengenesthatbuiltMTDTmulti- testintherootnodewerealsoreferredtoinarticle[36]andpatent [37].Attributesthatbuiltmulti-testsinthelowerpartsoftheMTDT treeusuallydonotappearinpublicationsastheyrepresentonlya smallsetsofinstances.WebelievethatMTDTiscapableofﬁnd- ingnotonlythemostsigniﬁcantgroupsofmarkergenesbutalso low-rankedgenesthatmayalsobemeaningfulwhencombined.

In the comparison of MTDT to other classiﬁers, it is worth emphasizingthattheMTDTwitha singlebinarytestina node, i.e., N=1, performedsimilarly toall remaining ‘univariate test’

methods. Itcan becompared totheJ48tree algorithmas they bothusethegainratiocriterion.Theirtreesinmostcasessepa- ratedthetrainingdataperfectly,butperformedconsiderablyworse ontestinginstances.Thismaybecausedbytheunderfitteddeci- siontreemodel.Aslightincreaseinthenumberoftestsineach splitimprovedtheclassification accuracy;this canbeobserved inTable2.Theexperimentalsectionsshowedthattheproposed methodleadstohighlycompetitiveresults.Inourtests,MTDTout- performedclassicaldecisiontreesanddecisionruleclassifiersand washighlycompetitivewithmorepowerfulmetalearningalgo- rithms.

Eventhoughseveralinteresting questionsfromthemachine learningpoint ofviewarestill open,weareconvinced thatthe existingversionofthealgorithmreportedinthispaperoffersause- fultoolformolecularbiologistsdoingexploratoryanalysisofgene expressiondata.Intheirwork,biologistsrarelyrelyonoutofthe boxsolutions,andtuningalgorithm’sparametersistheirnormal practice.Therefore,ourexistingMTDTalgorithmisaperfecttool fortheirexperiments.Bychangingthenumberofcomponentsin themulti-testssplitsoftheMTDT,thebiologistcanobtainamono- tonerangeofdecisiontreesthatstartwithtreescorrespondingto C4.5(whereoneattributeistestedineverynode)andproceedby havinghighernumbersoftests.Aknownphenomenoninmolec- ularbiologyisthatthereoftenexistgroupsofgenes(orfeatures ingeneral)thatbehaveinasimilarway.Biologistscallit‘epistasis’[45].Forexample,aspeciﬁcsubstance,suchasmelanin,which isproducedinlivingorganism,mayrequireseveralcompoundsin whicheachcompoundisproducedbyitscorrespondinggeneand compoundsthatrequireothercompoundsinordertobeproduced.

Ifanyofthegenesaredefective,itscompoundwillnotbeproduced andneitherwillthesubstance.Asimilarphenomenonexistswhen thefeaturesusedfordataanalysisaremotifs,i.e.,smallsequences ofDNA.Whenthenumbersofoccurrencesofmotifsaresimilarin thesamepieceoftheDNAsequence,thenfeaturescorresponding tomotifsbecomesimilar.Ourideaofidentifyingsurrogatetests relatestothisbiological phenomenon,andit canidentifythese relationshipsexactly.

6. Conclusion

Inthispaper,wepresentedamulti-testdecisiontreeapproach togeneexpressiondataclassification.Anewsplittingcriterionwas introducedwiththeaimofreducingtheunderfitofdecisiontrees onthesekindsofdataandimprovingclassificationaccuracy.The proposedsolutionoutperforms,orwashighlycompetitivewith, alltestedcompetitors.Evaluationonrealmicroarraydatashowed thatknowledgediscoveredbyMTDTissupportedbybiologicalevi- denceintheliterature.Therefore,biologistcanbenefitfromusing this‘whitebox’approach,asitcanbuildaccurateandbiologically meaningfulmodelsforclassificationandrevealnewregularities inbiologicaldata.Fromthemachinelearningpointofview,our rigorousempiricalanalysisrevealedandevaluatedtheimportant algorithmicpropertiesofourmethod.

Thestandard practicalquestionleft openis theautonomous tuningofthetestsize,N,toparticulardata.Weobservedthata

(10)

data-specificselectionofNcansignificantlyimprovetheperfor- manceof ourmethod,although a generaldomainindependent valuewasenoughtoobtainbetterresultsthanthatwhichexisting algorithmcanachieve.Wearecurrentlyworkingonanalgorithm that, through internal cross-validation, can set this parameter automaticallyforparticulartrainingdata.Anotherimprovement concerns thepre-pruning mechanism that will reducethe size ofthemulti-testinlowerpartsofthetree.Ouranalysisshowed thatthesplitsubsetsmayhaveanincorrectsize,whichcanthen increasethetreeheightandleadtodataoverfit.

Acknowledgments

AuthorsthankWojciechKwedloforreviewingthepaperand providingconstructivefeedback.Thisworkwassupportedbythe grant S/WI/2/13and W/WI/1/2013fromBialystok University of Technology.Thesecondauthorwassupportedbyafellowshipfrom theOntarioMinistryofResearchandInnovation.

References

[1]MurthySK.Automaticconstructionofdecisiontreesfromdata: amulti- disciplinarysurvey.DataMiningandKnowledgeDiscovery1997;2:345–89.

[2]RokachL,MaimonO.Dataminingwithdecisiontrees:theoryandapplica- tions.Machineperceptionandartiﬁcialintelligence,vol.69.Singapore:World ScientiﬁcPublishing;2008.

[3]HastieT,TibshiraniR,FriedmanJH.Theelementsofstatisticallearning.Data mining,inferenceandprediction.2nded.NewYork:Springer;2009.

[4]CheD,LiuQ,RasheedK,TaoX.Decisiontreeandensemblelearningalgo- rithmswiththeirapplicationsinbioinformatics.Softwaretoolsandalgorithms for biological systems. Advances in Experimental Medicine and Biology 2011;696:191–9.

[5]ChenX,WangM,ZhangH.Theuseofclassiﬁcationtreesforbioinformat- ics.WileyInterdisciplinaryReviews:DataMiningandKnowledgeDiscovery 2011;1:55–63.

[6]CzajkowskiM,KretowskiM.Topscoringpairdecisiontreeforgeneexpression dataanalysis.In:ArabniaHR,TranQ-N,editors.Softwaretoolsandalgorithms forbiologicalsystems.Advancesinexperimentalmedicineandbiology,696.

2011.p.27–35.

[7]Diaz-UriarteR, Alvarez de AndresS.Gene selection andclassiﬁcation of microarraydatausingrandomforest.BMCBioinformatics2006;7:3.

[8]QuY,AdamBL,YasuiY,WardMD,CazaresLH,SchellhammerPF,etal.Boosted decisiontreeanalysisofsurface-enhancedlaserdesorption/ionizationmass spectralserumproﬁlesdiscriminatesprostatecancerfromnoncancerpatients.

ClinicalChemistry2002;48:1835–43.

[9]Ge G, Wong GW.Classiﬁcation ofpremalignant pancreaticcancer mass- spectrometry data using decision tree ensembles. BMC Bioinformatics 2008;9:275.

[10]Grze´sM,KretowskiM.Decisiontreeapproachtomicroarraydataanalysis.

BiocyberneticsandBiomedicalEngineering2007;27(3):29–42.

[11]DettlingM,BuhlmannP.Boostingfortumorclassiﬁcationwithgeneexpression data.Bioinformatics2003;19(9):1061–9.

[12]TanAC,GilbertD.Ensemblemachinelearningongeneexpressiondatafor cancerclassiﬁcation.AppliedBioinformatics2003;2(3):75–83.

[13]KuoWP,KimE,TrimarchiJ,JenssenT,VinterboSA,Ohno-MachadoL.Aprimer ongeneexpressionandmicroarraysformachinelearningresearchers.Journal ofBiomedicalInformatics2004;37:293–303.

[14]BrownPO,BotsteinD.ExploringthenewworldofthegenomewithDNA microarrays.NatureGenetics1999;21:33–7.

[15]CowellRG,DawidAP,LauritzenSL,SpiegelhalterDJ.Probabilisticnetworksand expertsystems:exactcomputationalmethodsforBayesiannetworks.Interna- tionalStatisticalReview2008;76:306–7.

[16]Golub TR, Slonim DK. Molecular classiﬁcation of cancer: class dis- covery and class prediction by gene expression monitoring. Science 1999;286(5439):531–7.

[17]YeohEJ,RossME.Classiﬁcation,subtypediscovery,andpredictionofoutcome inpediatricacutelymphoblasticleukemiabygeneexpressionproﬁling.Cancer Cell2002;1(2):133–43.

[18]SebastianiP,GussoniE,KohaneIS,RamoniMF.Statisticalchallengesinfunc- tionalgenomics.StatisticalScience2003;18:33–70.

[19]Drami ´nskiM,Rada-IglesiasA,EnrothS,WadeliusC,KoronackiJ,Komorowski J.MonteCarlofeatureselectionforsupervisedclassiﬁcation.Bioinformatics 2008;24(1):110–7.

[20]Rokach L, Maimon O. Top-down inductionof decisiontreesclassiﬁers– a survey. IEEE Transactionson Systems, Man,and Cybernetics– PartC 2005;35(4):476–87.

[21]BrownDE,PittardCL,ParkH.Classiﬁcationtreeswithoptimalmultivariate decisionnodes.PatternRecognitionLetters1996;17:699–703.

[22]MurthyS,KasifS,SalzbergS.Asystemforinductionofobliquedecisiontrees.

JournalofArtiﬁcialIntelligenceResearch1994;2:1–33.

[23]PagalloG,HausslerD.Booleanfeaturediscoveryinempiricallearning.Machine Learning1990;5:71–99.

[24]Brodley CE, Utgoff PE. Multivariate decision trees. Machine Learning 1995;19:45–77.

[25]QuinlanR.C4.5:programsformachinelearning.SanMateo,CA,USA:Morgan Kaufmann;1993.

[26]BreimanL,FriedmanJ,OlshenR,StoneC.Classiﬁcationandregressiontrees.

Belmont,CA,USA:WadsworthInternationalGroup;1984.

[27]TanPJ,DoweDL,DixTI.Buildingclassiﬁcationmodelsfrommicroarraydata withtree-basedclassiﬁcationalgorithms.In:OrgunMA,ThorntonJ,editors.

AI2007.Lecturenotesinartiﬁcialintelligence,vol.4830.BerlinGermany:

Springer;2007.p.589–98.

[28]HuH,LiJ,WangH,ShiM.Amaximallydiversiﬁedmultipledecisiontreealgo- rithmformicroarraydataclassiﬁcation.In:BodenM,BaileyTL,editors.WISB 2006,vol.73.Darlinghurst,Australia,Australia:AustralianComputerSociety, Inc.;2006.p.35–8.

[29]BerzalF,CuberoJC,MarinN,SanchezD.Buildingmulti-waydecisiontreeswith numericalattributes.InformationSciences2004;165:73–90.

[30]LiJ,LiuH,NgS,WongL.Discoveryofsigniﬁcantrulesforclassifyingcancer diagnosisdata.Bioinformatics2003;19(2):93–102.

[31]FayyadUM,IraniKB.Onthehandlingofcontinuous-valuedattributesindeci- siontreegeneration.MachineLearning1992;8:87–102.

[32]Kent Ridge bio-medical dataset repository; 2012. http://datam.i2r.a-star.

edu.sg/datasets/krbd/index.html(accessed20.12.12).

[33]Robnik-SiikonjaM,KononenkoI,Theoretical.Empiricalanalysisofreliefand relief.MachineLearning2003;53:23–69.

[34]DemsarJ.Statisticalcomparisonsofclassiﬁersovermultipledatasets.Journal ofMachineLearningResearch2006;7:1–30.

[35]WoldS,ErikssonL,ClementiS.StatisticalvalidationofQSARresults.Chemo- metricsmethodsinmoleculardesign,vol.5.Weinheim,Germany:Wiley-VCH VerlagGmbH;2008.p.309–38.

[36]ArmstrongSA.MLLtranslocationsspecifyadistinctgeneexpressionproﬁle thatdistinguishesauniqueleukemia.NatureGenetics2002;30:41–7.

[37]GolubTR,ArmstrongSA,KorsmeyerSJ.MLLtranslocationsspecifyadistinct geneexpressionproﬁle,distinguishingauniqueleukemia.UnitedStatespatent 20060024734;2006.

[38]FreundY,MasonL.Thealternatingdecisiontreelearningalgorithm.In:16th internationalconferenceonmachinelearningICML99,Bled,Slovenia.SanFran- sisco,CA,USA:MorganKaufmann;1999.p.124–33.

[39]Shi H. Best-ﬁrst decision tree learning. University of Waikato; 2012 (Master’s thesis), http://researchcommons.waikato.ac.nz/bitstream/handle/

10289/2317/thesis.pdf(accessed20012.12).

[40]CohenWW.Fasteffectiveruleinduction.In:12thinternationalconferenceon machinelearningICML95.SanFrancisco,CA,USA:MorganKaufmann;1995.p.

115–23.

[41]BreimanL.Randomforests.MachineLearning2001;45(1):5–32.

[42]BreimanL.Baggingpredictors.MachineLearning1996;24(2):123–40.

[43]FreundY,SchapireRE.Experimentswithanewboostingalgorithm.In:13th internationalconferenceonmachinelearningICML96.1996.p.148–56.

[44]Hall M,FrankE, Holmes G, PfahringerB, Reutemann P,Witten HI.The WEKADataMiningSoftware:anupdate.ACMSIGKDDexplorationsnewsletter 2009;11(1):10–8.

[45]CordellHJ.Epistasis:whatitmeans,whatitdoesn’tmean,andstatisticalmeth- odstodetectitinhumans.HumanMolecularGenetics2002;11(20):2463–8.

[46]Shalev-ShwartzS,Ben-DavidS.UnderstandingMachineLearning:FromTheory toAlgorithms.CambridgeUniversityPress;2014.