Maximum parsimony distance on phylogenetic trees
A linear kernel and constant factor approximation algorithm
Jones, Mark; Kelk, Steven; Stougie, Leen
DOI
10.1016/j.jcss.2020.10.003
Publication date
2021
Document Version
Final published version
Published in
Journal of Computer and System Sciences
Citation (APA)
Jones, M., Kelk, S., & Stougie, L. (2021). Maximum parsimony distance on phylogenetic trees: A linear
kernel and constant factor approximation algorithm. Journal of Computer and System Sciences, 117,
165-181. https://doi.org/10.1016/j.jcss.2020.10.003
Important note
To cite this publication, please use the final published version (if applicable).
Please check the document version above.
Copyright
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy
Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim.
This work is downloaded from Delft University of Technology.
Contents lists available atScienceDirect
Journal
of
Computer
and
System
Sciences
www.elsevier.com/locate/jcss
Maximum
parsimony
distance
on
phylogenetic
trees:
A linear
kernel
and
constant
factor
approximation
algorithm
Mark Jones
a,
b,
∗
,
Steven Kelk
c,
Leen Stougie
b,
d,
eaDelftInstituteofAppliedMathematics,DelftUniversityofTechnology,VanMourikBroekmanweg6,2628XE,Delft,theNetherlands bCentrumWiskunde&Informatica(CWI),1098XGAmsterdam,theNetherlands
cDepartmentofDataScienceandKnowledgeEngineering(DKE),MaastrichtUniversity,6200MDMaastricht,theNetherlands dVrijeUniversiteitAmsterdam,1081HVAmsterdam,theNetherlands
eINRIA-Erable,France
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Articlehistory:
Received7April2020
Receivedinrevisedform 23October2020 Accepted26October2020
Availableonline7December2020
Keywords:
Phylogenetics Maximumparsimony Fixedparametertractability Maximumagreementforest
Maximum parsimony distance is a measure used to quantify the dissimilarity of two unrooted phylogenetictrees.ItisNP-hardtocompute,andveryfewpositivealgorithmic results are known due to its complex combinatorial structure. Here we address this shortcoming by showing that the problem is fixedparameter tractable. We do this by establishinga linearkernel i.e.,thatafter applyingcertain reductionrulesthe resulting instance has size that is bounded by a linear function of the distance. As powerful corollariestothisresultweprovethattheproblempermitsapolynomial-time constant-factorapproximationalgorithm;thatthetreewidthofanaturalauxiliary graphstructure encountered in phylogenetics is bounded by a function of the distance; and that the distance is withinaconstantfactor ofthe sizeof amaximum agreementforest ofthe twotrees,awellstudiedobjectinphylogenetics.
©2020TheAuthor(s).PublishedbyElsevierInc.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).
1. Introduction
Phylogeneticsisthescienceofinferringandcomparingtrees(ormoregenerally,graphs)thatrepresenttheevolutionary historyofasetofspecies[34].Inthisarticlewefocusontrees.Theinferenceproblemhasbeencomprehensivelystudied: givenonlydataaboutthespeciesinX (suchasDNAdata)constructaphylogenetictree whichoptimizesaparticularobjective function[17,40].Informally,aphylogenetictreeissimplyatreewhoseleavesarebijectivelylabelledby X .Duetodifferent objectivefunctions,multipleoptimaandthephenomenonthatcertaingenomesaretheresultofseveralevolutionarypaths (rather than just one) we are often confronted with multiple “good” phylogenetic trees [32]. In such caseswe wish to formally quantifyhowdissimilarthesetreesreallyare.Thisleads naturallytotheproblemofdefiningandcomputingthe
distance between phylogenetictrees [36]. Many such distances havebeen proposed, some ofwhich can be computed in polynomial-time, such as Robinson-Foulds (RF) distance [33], and some of which are NP-hard, such as SubtreePruneand Regraft (SPR)distance[9] orTreeBisectionandReconnection (TBR)distance[1].
Interestingly,distancesarenotonlyrelevantasanumericalquantificationofdifference:theyalsoappearinconstructive methods for the inference ofphylogenetic networks[20], whichgeneralise trees to graphs, andphylogenetic supertrees,
*
Correspondingauthorat:DelftInstituteofAppliedMathematics,DelftUniversityofTechnology,VanMourikBroekmanweg6,2628XE,Delft,the Netherlands.E-mailaddress:M.E.L.Jones@tudelft.nl(M. Jones). https://doi.org/10.1016/j.jcss.2020.10.003
0022-0000/©2020TheAuthor(s).PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).
whichseektomergemultipletreesintoasinglesummarytree[42].InrecentdecadesNP-hardphylogeneticdistanceshave attractedquitesomeattentionfromthediscreteoptimizationandparameterizedcomplexitycommunities,seee.g.[12,16].
In this articlewe focuson a relatively newdistance measure, maximumparsimonydistance, henceforth denoted dM P. Let T1 and T2 be two unrooted (i.e. undirected) binary phylogenetic trees,with the sameset of leaf labels X . Consider
an arbitraryassignmentofcolours(“states”)to X ;we callsuchanassignment acharacter.Theparsimonyscore of T1 with
respect to thecharacter is theminimum numberofbichromatic edges in T1,ranging overall possible colouringsof the
internalverticesofT1.TheparsimonydistanceofT1andT2isthemaximumabsolutedifferencebetweenparsimonyscores
ofT1 andT2,rangingoverallcharacters[18,31].
The distancehasseveralattractive properties;it isa metric,and(unlike e.g.RF distance) itis not confounded bythe influence ofhorizontal evolutionaryevents [18]. Furthermore,the concept ofparsimony, whichlies atthe heart ofdM P, is fundamentalin phylogeneticssince itarticulatesthe ideathat explanations ofevolutionary historyshouldbe no more complexthannecessary.Alongsideitshistoricalsignificanceforappliedphylogenetics[17],thestudyofcharacter-based par-simonyhasgivenrisetomanybeautifulcombinatorialandalgorithmicresults;werefertoe.g.[37,29,38,2,30] foroverviews. Unfortunately, itis NP-hardto compute dM P [22]. A simpleexponential-time algorithm isknown [26], which runsin time O
(φ
n·
poly(
n))
,where|
X|
=
n andφ
≈
1.
618 is the golden ratio,but beyondthis few positive results are known. Thisisfrustratingandsurprising,sinceanumberofresultslinkdM P tothewell-studiedTBRdistance,henceforthdenoted dT B R. Namely, it has been proven that dM P is a lower bound on dT B R [18], which, informally, asks for the minimum number oftopological rearrangement operationsto transformone tree into the other;an empirical study has suggested thatinpracticethedistancesareoftenveryclose[23].Also,dM P hasbeenusedtoprovethetightnessofthebest-known kernelizationresultsfordT B R [24,25].What,exactly,istherelationshipbetweendM P anddT B R?Thisisapertinentquestion, whichtranscendsthespecificsofTBRdistancebecause,crucially,dT B R canbe characterizedusingthepowerfulmaximum agreementforest abstraction.Distancesbasedonagreementforestshavebeenintensivelyandsuccessfullystudied inrecentyears,astheuseofthe agreementforestabstractionalmostalwaysyieldsfixedparametertractabilityandconstant-factorapproximationalgorithms [10], manyofwhichare effectiveinpractice.Werefer to[41,39,14,35] forrecentoverviewsoftheagreement forest liter-ature, andbookssuch as[15] for an introductionto fixed parameter tractability.Inparticular, dT B R can becomputed in O
(
3dT B R·
poly(
n))
time[13],permitsapolynomial-time3-approximationalgorithm,andakernelofsize11dT B R
−
9 [25]. Incontrast,priortothispaperverylittlewasknownaboutdM P:nothingwasknownabouttheapproximabilityofdM P; itwas not knownwhetherit isfixedparametertractable (wheredM P istheparameter);and, while,asmentionedabove, it is known that dM P≤
dT B R,it remained unclear howmuch smaller dM P can be than dT B R in the worst case. Despite promising partial resultsit evenremained unclearwhether questionssuch as“Is dM P≥
k?” can be solved inpolynomial time whenk isa constant[8,23]. Thisisanotherimportantdifference withdistancessuch asdT B R,wherecorresponding questionsaretriviallypolynomialtime solvableforfixedk.TheapparentextracomplexityofdM P seems tostemfromthe unusualmax-mindefinitionoftheproblem,andthefactthatunlikedT B R,whichisbasedontopologicalrearrangementsof subtrees,dM P isbasedonlyoncharacters.Inthisarticlewetakea significantstepforwardinunderstanding thedeepercomplexityofdM P andresolveall ofthe above questions.Our central result is that we prove that two common polynomial-time reduction rules encountered in phylogenetics,thesubtree andchain reductions[1],aresufficienttoproducealinearkernel fordM P.Thismeansthat,after exhaustiveapplicationoftheserules,whichpreserve dM P,thereducedtreeswillhaveatmost
α
· (
dM P+
1)
leaves, withα
=
560.The fixed parametertractability ofcomputingdM P (parameterizedby itself)thenfollows, bysolving the kernel using the exact algorithm from [26]. The fact that the reduction rules preserve dM P was already known [23]. However, proving thebound onthesize ofthereducedtrees requiresratherinvolvedcombinatorialarguments,which haveavery differentflavourtotheargumentstypicallyencounteredinthemaximumagreementforestliterature.Themaingoalofthis articleistopresenttheseargumentsasclearlyaspossible,ratherthantooptimizetheresultingconstants.The kernelconfirmsthat questionssuchas“IsdM P
≥
k?”can,indeed,be solvedinpolynomial time:it isstrikingthat heretheproofoffixedparametertractabilityhasprecededtheweakerresultofpolynomial-timesolveabilityforfixedk.Next,by producinga modified,constructiveversion oftheboundingargumentunderpinning thekernelization, weare abletodemonstrateapolynomial-time
α
(
1+
1/
r)
-factorapproximationalgorithmforcomputationofdM P foranyconstant r,placingtheprobleminAPX.Anumberofotherpowerfulcorollariesresultfromthekernelization.Weleveragethefactthatthereductionrulesalso preserve dT B R, to show that 1
≤
ddT B RM P≤
2α
, which limits how much smaller dM P can be than dT B R. Subsequently, we show that thetreewidth ofan auxiliary graphstructure knownasthe displaygraph [11] isbounded bya linearfunction of dM P, resolving an open question posedseveraltimes [28,23]. Thetreewidth bound, andthe existence ofa non-trivial approximationalgorithmfordM P,werespecifiedassufficientconditionsforprovingthefixedparametertractabilityofdM P viaCourcelle’s Theorem[23];ourlinearkernelimpliesthem.Summarising,ourcentralresultshowshowkernelizationcan openthegatewaytoahostofstrongauxiliaryresultsandbypassintermediatestepsinthealgorithmdesignprocess.The structure of the paper is as follows. In Section 2 we give formal definitions and insightful preliminary results. In Section 3 we prove our main result: the linear kernel.The section starts with Subsection 3.1 that gives a high-level overviewofhowasequenceoflemmasandtheoremsleadtothekernel,whereasintherestofthesectiontheselemmas and theorems are proved. Interestingcorollaries of the existence of a linearkernel are derived inSection 4: Aconstant approximation algorithm in Section 4.1; A bound on the ratio between dM P and dT B R in Section 4.2; A bound on the
Fig. 1. TwounrootedbinaryphylogenetictreesT1,T2onX= {a,. . . ,g}.Solidedgesaremonochromaticanddashededgesarebichromaticunderanoptimal
extensionforthecharacterχ:X→ {red,blue}, whereχ(a)=χ(b)=χ(c)=red,χ(d)=χ(e)=χ(f)=χ(g)=blue.Asthereisonebichromaticedge inT1 andtwoinT2,wehavethatlχ(T1)=1,lχ(T2)=2,provingthatdM P(T1,T2)≥ |1−2|=1.Infact,itcanbeverifiedthatnocharactercancause
theparsimonyscoresofthesetwotreestodifferbymore,sodM P(T1,T2)=1.WewillshowinSection4.2thatdT B R(T1,T2)=2,becauseamaximum
agreementforestofthesetwotreescontainsthreeblocks[23].(Forinterpretationofthecoloursinthefigure(s),thereaderisreferredtothewebversion ofthisarticle.)
treewidthoftheso-calleddisplaygraphintermsofdM P inSection4.3.Section5concludeswithsomedirectionsforfuture research.
2. Definitionsandpreliminaries
An unrootedbinaryphylogenetictree ona setof species(ortaxa) X isan undirectedtreein whichall internal vertices havedegree3,andthedegree-1 vertices(theleaves)arebijectivelylabelledwithelementsfromX .Forbrevitywewillrefer tounrootedbinaryphylogenetictreesasphylogenetictrees,orevenshortertrees.SeeFig.1foranexample.
GivenasetS
⊆
X andatreeT on X ,we denoteby T[
S]
thespanningsubtreeonS inT ,thatis,theminimalconnected subgraph Tof T suchthat Tcontainsevery elementofS.TheinducedsubtreeT|
SbyS inT isthetreederivedfromT[
S]
bysuppressinganyverticesofdegree2.Givenasubset S
⊆
X andatreeT on X ,wesaythat S hasdegreed inT ifthereareexactlyd edgesuv inT for whichu isinT
[
S]
andv isnot;inotherwords,d isthenumberofedgesseparatingT[
S]
fromtherestofT .WecalltheseedgespendingedgesofS in T .
For two disjointsubsets S1
,
S2⊆
X , we say S1 and S2 are spanning-disjoint in T if thespanning subtreesT[
S1]
and T[
S2]
areedge-disjoint.(ObservethatasT is binary,thisalsoimpliesthat T[
S1]
andT[
S2]
arevertex-disjoint.)Similarly,wesayacollectionS1
,
. . .
Sm ofsubsetsof X arespanning-disjoint inT if Si,
Sj arespanning-disjointinT foranyi=
j. 2.1. CharactersandparsimonyAcharacter on X isa function
χ
:
X→
C,whereC isasetofstates.Inthispaperthereisnolimit onthesizeofC,in contrasttosome contextswhere|
C|
isassumedtobequitesmall(forexample,ingeneticdatathenucleobasesA,C,G,T). Thinkofthestatesascolours,say1,
2,
. . . ,
t=: [
t]
.Foragivencharacter
χ
andtreeT on X ,theparsimonyscore measureshowwell T fitsχ
.Itisdefinedinthefollowing way.Callacolouringφ
:
V(
T)
→ [
t]
an extension ofχ
to T ifφ (
x)
=
χ
(
x)
forallx∈
X . DenotebyT
(φ)
thenumberof bichromaticedges uv in T ,i.e.forwhichφ (
u)
= φ(
v)
.We usuallyomit subscript T whenthetreeis clearfromcontext. Theparsimonyscore forT withrespecttoχ
isdefinedaslχ
(
T)
=
minφ
T
(φ)
wheretheminimum istakenover allpossibleextensions
φ
ofχ
to T .An extensionφ
that achievesthisbound iscalled anoptimalextension ofχ
to T .Anoptimalextension,andthustheparsimonyscore,canbeeasilycomputedinpolynomial timeusingdynamicprogrammingore.g.Fitch’salgorithm[19].Observethat foranyT and
χ
,the parsimonyscore forT with respecttoχ
is atleast|
χ
(
X)
|
−
1, i.e.the numberof colours assignedbyχ
minus 1.Iflχ(
T)
isexactly|
χ
(
X)
|
−
1,wesaythat T isaperfectphylogeny forχ
.Fortrees T1,
T2andacharacter
χ
on X ,theparsimonydistancewithrespecttoχ
isdefinedas dM Pχ(
T1,
T2)
= |
lχ(
T1)
−
lχ(
T2)
|.
Nowwearereadytodefinethemaximumparsimonydistance betweentwotrees(seealsoFig.1).FortwotreesT1
,
T2 on X ,themaximumparsimonydistanceisdefinedasdM P
(
T1,T2)=
maxχ dM Pχ
(
T1,T2)wherethemaximumistakenoverallpossiblecharacters
χ
onX [18,31].Equivalently,wemaywriteitas dM P(
T1,T2)=
maxwhere
φ
1 isanoptimalextensionofχ
toT1,andφ
2 anoptimalextensionofχ
toT2.Thismeasuresatisfiesthepropertiesofa distancemetric onthespaceofunrootedbinary phylogenetictrees[18,31].Fortwo treesonn taxaitisknownthat
dM P isatmostn
−
2√
n+
1 [18].Aweakerbound ofn−
1 iseasilyobtainedbyobservingthat theparsimonyscoreofa characteronatreeisatleast0andatmostn−
1.GivenatreeT on X andacolouring
φ
:
V(
T)
→ [
t]
,theforestinducedbyφ
isderived fromT bydeletingevery bichro-maticedgeunderφ
.Observethatthenumberofconnectedcomponentsintheforestinducedbyφ
isexactly(φ)
+
1.Lemma1.If
χ
:
X→ [
t]
isacharacterwithSi=
χ
−1(
i)
= ∅
(i.e.atleastonetaxaiscoloured i)foreachi∈ [
t]
,andT isatreeonX , thenlT
(
χ
)
≥
t−
1withequalityifandonlyifS1
,
. . .
Starespanning-disjointinT .Proof. ToseethatlT
(
χ
)
≥
t−
1,consideranoptimalextensionφ
ofχ
toT ,andlet F betheforestinducedbyφ
.Aseach connectedcomponentin F ismonochromaticallycolouredbyφ
,theremustbe atleastt connectedcomponents,andthus(φ)
≥
t−
1,whichimplieslχ(
T)
≥
t−
1.NowsupposethatS1
,
. . . ,
Starespanning-disjointinT .Thenconstructanextensionφ
ofχ
toT byfirstsettingφ (
u)
=
i foreveryvertexu inT[
Si]
,foreachi∈ [
t]
.(Asthespanningtreesareedge-disjointandthusvertex-disjointinT ,thisis well-defined).Foranyremainingunassignedverticesv,ifv hasaneighbouru forwhichφ (
u)
isdefined,thensetφ (
v)
= φ(
u)
. Repeatthisprocess untilevery vertexisassigneda colourbyφ
. Nowobserve thatby construction,the verticesassigned colouri byφ
formaconnectedsubtreeforeach i∈ [
t]
.Thustheforestinducedbyφ
hasexactlyt connectedcomponents, andso(φ)
=
t−
1.Finally,suppose lχ
(
T)
=
t−
1, andletφ
be an optimalextension ofχ
.Then theforest F inducedbyφ
hasexactly tconnectedcomponents,whichimpliesbythepigeonholeprinciplethateach Si isasubsetofoneconnectedcomponentin F .Thenaseach Si iscontainedwithin adifferentconnectedcomponentof F , thespanning treesT
[
Si]
arealsocontained withinthesecomponents,andso S1,
. . .
St arespanning-disjoint.2.2. Parameterizedcomplexityandkernelization
A parameterizedproblem is aproblemfor whichthe inputsare oftheform
(
x,
k)
, wherek isan non-negative integer, calledtheparameter.Aparameterizedproblemisfixed-parametertractable (FPT)ifthereexistsanalgorithmthatsolvesany instance(
x,
k)
in f(
k)
· |
x|
O(1)time,where f()
isacomputablefunctiondependingonlyonk.Aparameterizedproblemhas akernel ofsize g(
k)
,where g()
isacomputablefunctiondependingonlyonk,ifthereexistsapolynomialtimealgorithm transforming anyinstance(
x,
k)
intoan equivalent problem(
x,
k)
, with|
x|,
k≤
g(
k)
.If g(
k)
is a polynomial ink thenwe call this a polynomialkernel; if g
(
k)
=
O(
k)
then it isa linearkernel. It is well-known that a parameterized problem isfixed-parameter tractableifandonlyifithasa(notnecessarilypolynomial)kernel.Formoreinformation,wereferthe readerto [15].Foramaximizationproblem
and
ρ
≥
1,wesayhasaconstantfactorapproximation withapproximationratio
ρ
ifthere existsapolynomial-timealgorithmsuchthatforanyinstanceπ
of,thefollowinginequalitieshold,whereopt
(
π
)
denotes themaximumvalueofasolutiontoπ
,andalg(
π
)
denotesthevalueofthesolutiontoπ
returnedbythealgorithm:1
≤
opt(
π
)
alg
(
π
)
≤
ρ
Inthispaperwestudythefollowingmaximizationproblem: MaximumParsimonyDistance(dmp)
Input: Twotrees T1
,
T2 onasetoftaxa X .Output: Acharacter
χ
on X thatmaximizes|
lχ(
T1)
−
lχ(
T2)
|
.3. Kernelbound
3.1. Overview
Inthissectionwegiveanoverviewoftheconstituentpartsofourkernelizationresult,andhowtheyfittogether. The firststepistoapply tworeduction rules,theCherryruleandtheChain rule,describedinthe nextsection.These rulescorrespondroughlytoreductionrulesthatoftenappearinpapersoncomputationalphylogenetics.Thecorrectnessof theserules was provedin [23];ourcontributionisto show thatthe exhaustiveapplicationoftheserules grantsa linear kernel,asstatedinthefollowingtheorem.
Theorem1.Thereexistsaconstant
α
(α
=
560)forwhichthefollowingholds.Let(
T1,
T2)
beapairofbinaryunrootedphylogenetic treesonX thatareirreducibleunderReductionRules1and2.Thenif
|
X|
≥
α
k,itholdsthatdM P(
T1,
T2)
≥
k,andwecanfindawitnessingcharacter,i.e.acharacterχ
yieldingdM Pχ(
T1,
T2)
≥
k,inpolynomialtime.Thistheorem,togetherwiththecorrectnessofthereductionrulesasprovedin [23],immediatelyimpliesalinearkernel for dmp.
Toshowhowweprovethetheorem,wewillneedtointroducesometerminologyaswego.
Aquartet Q isanysetof4 elementsin X .IfT1
|
Q=
T2|
Q,wesaythat Q isaconflictingquartet for(
T1,
T2)
.Asacrucialstepweprovethatforany S largeenoughwithrespecttothedegreeofS inboth T1 andT2,eitherthere
existsaconflictingquartetoroneofthereductionrulesapplies.
Lemma2.LetS beasubsetofX withd1thedegreeofS inT1,andd2thedegreeofS inT2.If
|
S|
>
9(
d1+
d2)
−
12,theneither T1|
S=
T2|
S oroneofReductionRules1or2appliesto(
T1,
T2)
.Inparticularif(
T1,
T2)
isirreducibleunderRules1or2and|
S|
≥
9
(
d1+
d2)
−
11,thenthereexistsaconflictingquartetQ⊆
S,andsuchaquartetcanbefoundinpolynomialtime.The next resultimpliesthat if we havea large enoughnumber ofconflicting quartets that are alsospanning-disjoint in both T1 and T2,then we are done. While it is intuitively clearthat such quartets can be leveraged to create a high
parsimonyscoreinonetree,somecarehastobetakentokeeptheparsimonyscorelowintheothertree.
Lemma3.Let
Q
= {
Q1,
. . . ,
Qk}
beasetofconflictingquartetsforT1,
T2,suchthatQ1,
. . .
Qkarespanning-disjointinT1andin T2.ThendM P
(
T1,
T2)
≥
k,andwecanfindawitnessingcharacterinpolynomialtime.Incombination,Lemmas2and3allowustoshowthatdM P
(
T1,
T2)
≥
k providedthatwecanfindatleastk setsS1,
. . .
Sk thatarespanning-disjointinbothtreesandsatisfytheconditionsofLemma2.We will findk such sets as part of the construction of a character that witnesses dM P
(
T1,
T2)
≥
k, for any reducedinstancewith
|
X|
≥
α
k.Inordertoconstructthischaracter,wefirstcreateapartitionof X intolargesubsets,asdescribed bythefollowinglemma.Lemma4.Supposethat
|
X|
≥
2ct forsomeintegersc andt,andletT1beaphylogenetictreeonX .TheninpolynomialtimewecanconstructapartitionS1
,
. . . ,
StofX withS1,
. . . ,
Stspanning-disjointinT1,suchthat|
Si|
≥
c foreachi.Wenotethatthereisaone-to-onecorrespondencebetweenpartitionsandcharactersonX ,inthefollowingsense.Given a partition S1
,
. . .
St of X ,we maydefinea characterχ
:
X→ [
t]
such thatχ
(
x)
=
i if x∈
Si,foreach i∈ [
t]
.Callsucha characterthecharacterdefined by S1,
. . .
St.Thusletusconsiderthecharacter
χ
on X definedbythepartitiondescribedbyLemma4.Since S1,
. . .
St are spanning-disjointinT1,Lemma1tellsthattheparsimonyscoreofT1 withrespecttoχ
isexactlyt−
1.Lemma5.Let
χ
bethecharacterdefinedbythepartitionS1,
. . . ,
StwhereS1,
. . . ,
Starespanning-disjointinT1,letd1,
d2bepositive integerssuchthatd1d2−
d1−
d2>
0,andassumet
≥
(
2d1d2+
d1) d1d2−
d1−
d2 k.
TheneitherdM Pχ
(
T1,
T2)
≥
k,orinpolynomialtimewecanfindasetofindicesi1,
. . .
ikwithk≥
k suchthat:•
Si1,
. . .
Sikarespanning-disjointinT2(aswellasinT1);•
Sijhasdegreeatmostd1inT1foreachj∈ [
k]
;and•
Sijhasdegreeatmostd2inT2foreachj∈ [
k]
.We willprove Theorem1 bycombiningtheseresults inthefollowing way.Fixintegers d1
,
d2 to be determinedlater.Assume
(
T1,
T2)
isirreducibleunderReductionRules1and2,andassumethat|
X| ≥
2ct,
where c=
9(
d1+
d2)−
11 and t≥
(
2d1d2+
d1)d1d2
−
d1−
d2k (thisholdsif
|
X|
≥
α
k).ByLemma4,thereexistsapartitionS1
,
. . .
St ofX with S1,
. . .
St spanning-disjointinT1and|
Si|
≥
c foreachi∈ [
t]
.Letgetasetofindicesi1
,
. . .
iksuchthat Si1,
. . .
Sik arespanning-disjointinT2 (aswellasinT1),each Sij hasdegreeatmostd1 in T1,andeach Sij hasdegreeatmostd2 inT2.Buttheneach Sij satisfiestheconditionsofLemma2,andtherefore
foreach j
∈ [
k]
thereexistsaconflictingquartet Qj⊆
Sij.Moreover,as Si1,
. . .
Sik arespanning-disjointin T1 andT2,thequartets Q1
,
. . .
Qkarealsospanning-disjointinT1andT2.ThenLemma3impliesthatdM P(
T1,
T2)
≥
k.Bysettingd1
=
4 andd2=
5,wegetthatα
=
560,givingthedesiredbound.Inthenextsubsectionsweproveeachoftheselemmas,andthenthemaintheorem,inturn.
3.2. Reductionrules
We begin by statingthe reduction rules forour kernelizationresult. In what follows,a pair
(
x,
y)
with x,
y∈
X is acherry inatree T ifthereexistsan internalvertexu inT adjacenttoboth x andy.Acherryisalsosometimesknownin theliterature asasibling-pair. Asequence ofleavesx1
,
. . .
xr∈
X isachain in T if thereexistsa pathofinternal vertices p1,
. . . ,
pr (possiblywithp1=
p2 andpossiblywithpr−1=
pr),suchthatforeachi∈ [
r]
pi istheinternal vertexadjacent toxi.Wecallr thelength ofthischain.ReductionRule1.[Cherryreductionrule]Ifthereexistx
,
y∈
X suchthat(
x,
y)
isacherryineachofT1,
T2,thenreplace(
T1,
T2)
with(
T1|
X\{x},
T2|
X\{x})
.ReductionRule2.[Chainreductionrule]Ifthereexistsasequenceofleavesx1
, . . .
xr∈
X suchthatx1,
. . .
xrischaininbothT1and T2,andr≥
5,thenreplace(
T1,
T2)
with(
T1|
X\{x5,...,xr},
T2|
X\{x5,...xr})
(thus,thecommonchainisreducedtolength4).Thecorrectnessoftheserules(inthesensethattheypreservedM P)waspreviouslyprovedin [23].
Theorem2.Let
(
T1,
T2)
beaninstanceof dmpderivedfrom(
T1,
T2)
byanapplicationofReductionRules1or2.ThendM P
(
T1,
T2)
=
dM P(
T1,T2).CorrectnessofthechainreductionrulefollowsfromTheorem3.1in [23].Correctnessofthecherryreductionrulefollows asasubcaseofTheorem4.1in [23].
Our main contribution is to show that ifan instanceis reduced by these rules then its size is bounded by a linear functionofdM P.
3.3. Smalldegreesets
InthissectionweproveLemma2.
Lemma2. Let S beasubsetof X withd1 thedegreeof S inT1,andd2 thedegreeofS in T2.If
|
S|
>
9(
d1+
d2)
−
12,then eitherT1|
S=
T2|
S oroneofReductionRules1or2appliesto(
T1,
T2)
.Inparticularif(
T1,
T2)
isirreducibleunderRules1or2and|
S|
≥
9(
d1+
d2)
−
11,thenthereexistsaconflictingquartetQ⊆
S,andsuchaquartetcanbefoundinpolynomialtime.Proof. Since unrooted binary trees are characterized by their quartets [34,Theorem 6.3.5(iii)], the last statement of the lemmafollowsdirectly.
WewillshowthatifT1
|
S=
T2|
S andneitherofthereductionrulesappliesto(
T1,
T2)
,then|
S|
≤
9(
d1+
d2)
−
12.Thisimpliesthemainclaimofthelemma.LetusdenoteT
|
S=
T1|
S=
T2|
S.Considerthebackbone graphofT
|
S obtainedbydeletingallleaves(seeFig.2foranexample).LetPC bethesetofnodes havingdegree1onthebackbone,whichwerefertoasparents ofacherryinT|
S.LetPL bethesetofnodeshavingdegree 2 onthe backbone,which werefer to asparents ofaleaf ofT|
S.All remaining verticeson thebackbone havedegree 3. Thus|
S|
,thetotalnumberofleavesof T|
S is2|
PC|
+ |
PL|
.We callthepathbetweenanytwoodddegreeverticesonthe backbone,havinginternalnodesonlyinPL,aside ofthebackbone.First noticethat for each cherryin T
|
S,there mustexist in T1[
S]
,the spanning treeon S in T1, orin T2[
S]
anode,incident to a pending edge of S, between at least one of its two leaves and its corresponding node in PC. Otherwise ReductionRule1canbeapplied.Inparticularthisimpliesthat
|
PC|
≤
d1+
d2.Thusatleast PC ofthed1
+
d2 pendingedgesmustbeusedfor“cutting”thecherries,eachofthemcutting1leafofacherry.Letuschooseonesuchleaffromeachcherry,andcallthesethecut-leaves.
After removing cut-leaves, every node in PC and PL is now the parent of 1 leaf in T
|
S.Every side ofthe backbone contains at most4vertices in PC and PL, unless T1[
S]
or T2[
S]
hasa node ofa pending edge of S or a node adjacentto anodeofa pendingedgeon thatside.We showthat everysuch pendingedgeon asidemayincrease thenumberof
PL-nodesonthat sidebyatmost5 (seeFig.2).Indeed,supposeasideofthebackbone hasintotald pendingedgesof S inboth T1 and T2,butmorethan 4
+
5d nodesin PL,i.e.atleast5(
d+
1)
.Then T|
S containsachainoflength5(
d+
1)
,Fig. 2. ExampleillustrationofthebackboneofT|S=T1|S=T2|SwithinT1andT2,whereS= {s1,. . . ,s29}.Edgesandverticesofthebackboneareinbold.
ObservethatT|Shasthechains1,. . . ,s9,but(T1,T2)donothaveacommonchainoflengthgreaterthan4,astheleafs5hasasiblinga inT2. whichwecansplitupintod
+
1 chainsoflength 5.Clearlyatleastone ofthesechainshasnopendingedgeineitherT1orT2,andsoT1
,
T2 haveacommonchainoflength5,acontradiction.Thus thetotalnumberofnodesfrom PC and PL onaside isatmostfivetimesthenumberofpendingedges ofS (in T1
[
S]
orT2[
S]
)onthatside,plus4.OtherwiseReductionRule2canbeapplied.Giventhatwe alreadyused|
PC|
pending edgesforcuttingthecherries,wehaved1+
d2− |
PC|
pendingedgeslefttobedistributedoverthesides.The number ofsides onthe backbone is the numberof edges in an unrooted binary tree with
|
PC|
leaves, which is 2|
PC|
−
3.ThereforethetotalnumberofleavesofT|
S is|
S| =
2|
PC| + |
PL| ≤ |
PC| +
4(
2|
PC| −
3)
+
5(
d1+
d2− |
PC|)
≤
4|
PC| +
5(
d1+
d2)
−
12.
Clearly,thisattainsitslargestvalueif
|
PC|
=
d1+
d2,inwhichcase|
S|
≤
9(
d1+
d2)
−
12,aswastobeproven.3.4. Combiningconflictingquartets
Lemma3. Let
Q
= {
Q1,
. . . ,
Qk}
beasetofconflictingquartetsforT1,
T2,suchthatQ1,
. . .
Qkarespanning-disjointinT1andin T2.ThendM P
(
T1,
T2)
≥
k,andwecanfindawitnessingcharacterinpolynomialtime.Proof. Foraquartet Q andtreeT ,wesaythatT
|
Q=
ab|
cd if Q= {
a,
b,
c,
d}
andinT thepathbetweena andb is edge-disjointfromthepathbetweenc andd.Withoutlossofgenerality,wemayassume Qi= {
ai,
bi,
ci,
di}
,T1|
Qi=
aibi|
cidiandT2
|
Qi=
aici|
bidiforeachi∈ [
k]
.We will show how to build a character
χ
with two states, such that lχ(
T1)
≤
k, and lχ(
T2)
≥
2k. This shows that dM Pχ(
T1,
T2)
≥
k,asrequired.Theideaistoconstruct
χ
insuchawaythat,foreachquartet Qi,χ
(
ai)
=
χ
(
bi)
=
χ
(
ci)
=
χ
(
di)
.Thiswillensurethat lχ(
T2)
isatleast2k,asT2willhaveatleast2k edge-disjointpaths(fromai toci andfrombidi,foreach i∈ [
k]
)thateach requireatleastonechangeinstatealongsomeedge.Foreach Qi,leteQi denoteanedgeinT1 suchthatinT1
[
Qi]
,ei isonthepaththatseparates{
ai,
bi}
from{
ci,
di}
.Now weconstructafunction
φ
:
V(
T1)
→ {
red,
blue}
asfollows.Startbychoosing anarbitraryleafinT1,saywithout lossofgeneralitya1,andsetφ (
a1)
=
red.Nowproceedasfollows.Foranyedgeuv inT1suchthatφ (
u)
isdefinedbutφ (
v)
isnot,wesetφ (
v)
= φ(
u)
,unlessuv=
eQi forsomei.Inthatcase,wesetφ (
v)
=
blueifφ (
u)
=
red,andsetφ (
v)
=
redotherwise.
Nowwecanlet
χ
betherestrictionofφ
to X .Byconstruction,φ
isanextensionofχ
toT1and(φ)
= |
eQi:
i∈ [
k]|
=
k.Thisisenoughtoshowthatlχ
(
T1)
≤
k.We now show that
χ
(
ai)
=
χ
(
bi)
=
χ
(
ci)
=
χ
(
di)
, foreach i∈ [
k]
.To seethis, consider thespanning tree T1[
Qi]
. By construction, T1[
Qi]
containstheedgeeQi andeQi separates{
ai,
bi}
from{
ci,
di}
.Letui,
vi betheverticesofeQi,withuithevertexclosertoai andbi.Notethat T1
[
Qi]
cannotcontaineQj forany j=
i,asT1[
Qi]
andT1[
Qj]
areedge-disjoint.Itfollowsthatui
,
aibiareallassignedthesamevaluebyφ
andvi,
ci,
diareassignedtheoppositevalue.Thusbydefinitionofχ
,wehaveχ
(
ai)
=
χ
(
bi)
= φ(
ui)
= φ(
vi)
=
χ
(
ci)
=
χ
(
di)
.ItremainstoobservethatasQ1
,
. . .
Qkarespanning-disjointinT2,theai−
ciandbi−
dipathsinT2 arepairwiseedge-disjointforalli
∈ [
k]
.Thenasχ
(
ai)
=
χ
(
ci)
andχ
(
bi)
=
χ
(
di)
,thereexistatleast2k edgesuv inT2withφ
2(
u)
= φ
2(
v)
,foranyextension
φ
2 ofχ
toT2.Itfollowsthatlχ(
T2)
≥
2k,andsodM P(
T1,
T2)
≥
dM Pχ(
T1,
T2)
= |
lχ(
T1)
−
lχ(
T2)
|
≥
2k−
k=
k.Sinceeach edgeisprocessedatmostonceintheconstruction of
χ
,itisclearthat thisconstructiontakespolynomialtime.
3.5. Constructinganinitialpartition
InthissectionweproveLemma4.
Lemma4. Supposethat
|
X|
≥
2ct forsomeintegersc andt,andletT1beaphylogenetictreeonX .TheninpolynomialtimewecanconstructapartitionS1
,
. . . ,
StofX withS1,
. . . ,
Stspanning-disjointinT1,suchthat|
Si|
≥
c foreachi.Proof. Weprovetheclaimbyinductionont.Forthebasecase,ift
=
1 thenwemaylet S1=
X ,andwehavethedesiredpartition.
Fortheinductivestep,assume
|
X|
≥
2ct andthattheclaimistrueforsmallervaluesoft.Wefirstfixanarbitraryrooting on T1.Thatis,chooseanarbitraryedgee in T1 andsubdivideitwithanew(temporary)vertexr,thenorientalledges in T1 awayfromr.Underthisrooting,letu bealowestvertexinT1 forwhichu hasatleastc descendantsin X .Let St⊆
X bethesetofthesedescendants. NotethatsinceT1isbinary,|
St|
<
2c,asotherwiseoneofthetwochildrenofu wouldbe alowervertexwithatleastc descendants.NowconsidertheinducedsubtreeT1
|
X,where X=
X\
St.As|
St|
<
2c,wehaveX≥
2c(
t−
1)
.Thenbytheinductive hypothesis,we canconstructapartition S1,
. . . ,
St−1 of X with S1,
. . . ,
St−1 spanning-disjointinT1|
X,suchthat|
Si|
≥
c for each i.By construction itis clearthat St is spanning-disjointin T1 from S1,
. . . ,
St−1.Thus S1,
. . . ,
St isthe desired partition.Astheconstructionof St canbedoneinpolynomialtimeandthisprocessisrepeatedt
≤ |
X|
times,theentireprocess takespolynomialtime.3.6. Well-behavedsets
InthissectionweproveLemma5.Westartwithanobservation:
Observation1.Forany(notnecessarilybinary)unrootedtreeT withn vertices,andanyintegerd
≥
1,thenumberofverticesinT withdegreestrictlygreaterthand isatmostn/
d.1Proof. Foreachvertexv inT letd
(
v)
denotethedegreeofv.Recallthatanunrootedtreewithn verticeshasexactlyn−
1 edges.Itfollowsthatv∈V(T)
d
(
v)
=
2|
E(
T)
| =
2n−
2.
NowsupposethatT hasm
>
n/
d verticeswithdegreestrictlygreaterthand,i.e.atleastd+
1.Theremainingn−
m verticesallhavedegreeatleast1,fromwhichitfollowsthat
v∈V(T)
d
(
v)
≥
m(
d+
1)
+
n−
m=
md+
n≥ (
n/
d)
d+
n=
2n,
acontradiction.
Lemma5. Let
χ
bethecharacterdefinedbythepartitionS1,
. . . ,
StwhereS1,
. . . ,
Starespanning-disjointinT1,letd1,
d2bepositive integerssuchthatd1d2−
d1−
d2>
0,andassumet
≥
(
2d1d2+
d1) d1d2−
d1−
d2 k.
TheneitherdM Pχ
(
T1,
T2)
≥
k,orinpolynomialtimewecanfindasetofindicesi1,
. . .
ikwithk≥
k suchthat:•
Si1,
. . .
Sikarespanning-disjointinT2(aswellasinT1);•
Sijhasdegreeatmostd1inT1foreachj∈ [
k]
;and•
Sijhasdegreeatmostd2inT2foreachj∈ [
k]
.Proof. By Lemma 1, lχ
(
T1)
=
t−
1. If lχ(
T2)
≥
t+
k−
1, then dM Pχ(
T1,
T2)
≥
k as required. So we may assume that lχ(
T2)
≤
t+
k−
2.Letδ
=
lχ(
T2)
−
lχ(
T1)
,andobservethat0≤ δ ≤
k−
1.Wenow constructa partition P1
,
. . .
Ps of X whichisspanning-disjoint inT2 (seeFig.3foran illustration).Letφ
2 beanoptimalextensionof
χ
to T2.Aslχ(
T2)
=
lχ(
T1)
+ δ =
t+ δ −
1,theforestinducedbyφ
2 hasexactlys monochromaticconnectedcomponents,wheres
=
t+ δ
.Let P1,
. . . ,
Ps bethepartition of X formedbytakingtheintersectionof X with the vertex set of each tree in this forest. Observe that by construction P1,
. . .
Ps are spanning-disjoint in T2, and thatfurthermoreeachPjisasubsetofSiforsomei
∈ [
t]
(aseachelementof Pjisassignedthesamevaluebyφ
2,andthusbyχ
).Nowlet
I ⊆ [
t]
denotethesetofindicesi in[
t]
suchthat•
Si=
Pjforsome j∈ [
s]
;•
Sihasdegreeatmostd1 inT1;and•
Sihasdegreeatmostd2 inT2.Notethat since P1
,
. . .
Pj arespanning-disjoint in T2,the sets{
Si:
i∈
I}
arealso spanning-disjointin T2. Noticethatit is sufficient to provethat
|
I|
≥
k, whence anysubset of k indices fromI
satisfiesthe lemma. We will prove thisby providingupperboundsonthenumberofindicesin[
t]
thatdonotsatisfytheconditionsofI
.Let
I
0 denotethesetofindicesi∈ [
t]
suchthat Pj=
Si forany j∈ [
s]
.Wefirstclaimthat|
I
0|
≤ δ
.Indeed,sinceevery Pjisasubsetofsome Si andS1,
. . .
St andP1,
. . . ,
PsarebothpartitionsofX ,wehavethatforeveryi∈
I
0,thereexistatleasttwodistinctindices j
,
j∈ [
s]
forwhich Pj,
Pj⊂
Si.Hence,s
≥
2|
I
0| + |[
t] \
I
0| =
t+ |
I
0|.
Thereforeif
|
I
0|
> δ
thens>
t+ δ
,contradictingthedefinitionofs.Thus,wehave|
I
0|
≤ δ
.Next, let
I>
d1 denote the set of indices i∈ [
t]
for which Si has degree greater than d1 in T1. We will show that|
I>
d1|
≤
t/
d1.Foreach i∈ [
t]
,compressthespanningsubtreeT1[
Si]
toasinglevertex,andobservethatthedegreeofthis vertexisequaltothedegreeofSiinT1.Anyvertexu whichisnotpartofanyT1[
Si]
ismergedwithoneofitsneighbours. Notethatthismergingprocesscanonlyincreasethedegreesoftheremainingvertices.CalltheresultingtreeT1.SeeFig.4.T1 hast vertices,eachofthemcorrespondingtoasubset Si,andhavingdegreeatleastthedegreeofthecorresponding Si inT1.NowbyObservation1,thereareatmostt
/
d1 verticesin T1 withdegreegreaterthand1.Itfollowsthatthereareatmostt
/
d1 valuesofi∈ [
t]
forwhich Sihasdegreegreaterthand1 inT1,andthus|
I>
d1|
≤
t/
d1 aswewantedtoshow. SimilarlyletJ>
d2 denotethesetofindices j∈ [
s]
forwhich Pj hasdegreegreaterthand2 inT2.Bysimilararguments asusedforI>
d1 above,wecanshowthat|
J>
d2|
≤
s/
d2.Noticethatforanyi
∈ [
t]
,ifi isnotinI
,theneitheri∈
I
0,ori∈
I>
d1,orthereexists j∈
J>
d2 suchthat Si=
Pj.We thereforehavethatFig. 3. Illustrationoftheconstructionofpartition P1,P2,P3,P4,P5 fromS1,S2,S3.Solidedgesaremonochromaticanddashededgesarebichromatic
underanoptimalextensionforχ,whereχisthecharacterinducedbyS1,S2,S3.
Fig. 4. Illustrationoftheconstructionofauxiliarytree T1,givenapartitionofX with S1= {a,b,c},S2= {d,e,f},S3= {g,h,i},S4= {j,k},S5= {l,m}.
Notethattheinternalvertexlabelledu isnotpartofT1[Si]foranyi,sowemergeitwithanarbitraryadjacentvertex.Inthiscasewemergeu into
Now,usingthatt
≥
(2d1d2+d1) d1d2−d1−d2k,s=
t+ δ
andδ
≤
k−
1,wehave:|
I
| ≥
t− |
I
0| − |
I>
d1| − |
J>
d2|
≥
t− δ −
t/
d1−
s/
d2=
t− δ −
t/
d1− (
t+ δ)/
d2=
d1d2t−
d1d2δ−
d2t−
d1t−
d1δ d1d2=
(
d1d2−
d1−
d2)t− (
d1d2+
d1)δ d1d2≥
(
d1d2−
d1−
d2)
t− (
d1d2+
d1)(
k−
1)
d1d2≥
(
2d1d2+
d1)
k− (
d1d2+
d1)(
k−
1)
d1d2=
d1d2k+
d1d2+
d1 d1d2>
d1d2k d1d2=
k,
as we needed to prove. To see that
I
can be constructed in polynomial time, it suffices to observe that the partitionP1
,
. . . ,
Pscan beconstructedinpolynomialtime(astheφ
2 canbe foundinpolynomialtime),andafterthiseach Si can becheckedformembershipinI
inpolynomialtime.3.7. ProofofTheorem1
Lemma6.Letd1
,
d2bepositiveintegerssuchthatd1d2−
d1−
d2>
0.Let(
T1,
T2)
beapairofbinaryunrootedphylogenetictreeson X thatareirreducibleunderReductionRules1and2.Thenif
|
X|
≥
2ct,wherec=
9(
d1+
d2)
−
11 andt=
d(12dd21−d2d+1−d1d)2k,itholdsthatdM P(
T1,
T2)
≥
k,andwecanfindawitnessingcharacterinpolynomialtime.
Proof. By Lemma4,thereexistsapartition S1
,
. . .
St of X ,allspanning-disjointin T1,andwith|
Si|
≥
c forall i∈ [
t]
.Letχ
be thecharacter definedby S1,
. . . ,
St. Ifχ
isa witness todM P(
T1,
T2)
≥
k,then we mayreturnχ
andwe aredone.Otherwise,wemayapplyLemma5tofindindicesi1
,
. . .
iksuchthat:•
Si1,
. . .
Sik areallspanning-disjointinT2 (aswellasinT1);•
each Sij hasdegreeatmostd1 inT1;and•
each Sij hasdegreeatmostd2 inT2.Nowforeach Sij,wehavethatSij hasdegreed
j 1
≤
d1 inT1andd j 2≤
d2 inT2,andthat|
Sij| ≥
c>
9(
d1+
d2)
−
11≥
9(
d j 1+
d j 2)
−
11,
andalsothat
(
T1,
T2)
isirreducibleunderRules1and2.ThuswemayapplyLemma2,tofindaconflictingquartetQj⊆
Sijforeachij.
Finally, as Si1
,
. . .
Sik are spanning-disjoint in both T1 and T2, and as each Qj is a subset of Sij, we have thatQ1
,
. . . ,
Qk are also spanning-disjoint in both T1 and T2. Therefore we may apply Lemma 3 to find a witnessingchar-acter fordM P
(
T1,
T2)
≥
k. As each stepofthis process takespolynomial time,the construction ofa witnessingcharactertakespolynomialtime.
ItremainstocompletetheproofofTheorem1.
Theorem1. Thereexistsaconstant
α
(α
=
560)forwhichthefollowingholds.Let(
T1,
T2)
beapairofbinaryunrootedphylogenetic treesonX thatareirreducibleunderReductionRules1and2.Thenif