Daniah Tahir (Uppsala) Ingemar Kaj (Uppsala)

(1)

MATHEMATICA APPLICANDA Vol. 48(1) 2020, p. i–xxviii doi: 10.14708/ma.v48i1.6465

Daniah Tahir (Uppsala) Ingemar Kaj (Uppsala)

Krzysztof Bartoszek

^∗

(Linköping) Marta Majchrzak

^∗

(ód¹)

Paweª Parniewski

^∗

(ód¹) Sebastian Sakowski (ód¹)

Using multitype branching models to analyze bacterial pathogenicity

Abstract We apply multitype continuous time Markov branching models to study pathogenicity in E. coli, a bacterium belonging to the genus Escherichia. First, we examine briey the properties of multitype branching processes and we also survey some fundamental limit theorems regarding the behavior of such models under various conditions. These theorems are then applied to discrete, state dependent models in order to analyze pathogenicity in a published clinical data set consisting of 251 strains of E. coli. We use well established methods, incorporating maximum likelihood techniques, to estimate speciation rates as well as the rates of transition between dierent states of the models. From the analysis, we not only derive new results, but we also verify some preexisting notions about virulent behavior in bacterial strains.

2010 Mathematics Subject Classication: Primary: 60B12; Secondary: 60J85.

Key words and phrases: Markov models, branching processes, limit theorems, virulence factors, E. coli strains.

1. Introduction In this paper, we explore the possibility of utilizing the theory of branching processes into analyzing pathogenicity in bacterial strains. For that purpose, we rst review fundamental properties of multitype, continuous time Markov branching processes as well as their behavior in the long time limit. Then, we apply multitype branching models to examine virulence in E. coli strains, and perform an in depth analysis on the limits of proportions of E. coli in dierent states of the models. The strains used in this study were isolated from human hosts, and obtained from a previously published data set (by

Bartoszek et al.

[4]) of pathogenic and nonpathogenic E. coli bacteria.

∗KB was supported by Vetenskapsrådets grant no. 2017-04951.

∗MM was partially supported by IMB PAS.

∗PP was partially supported by IMB PAS.

(2)

ii

Using multitype branching models to analyze bacterial pathogenicity

In recent years, considerable research has been conducted regarding the use of branching processes to explore various biological phenomena. A two

type Markov branching model, coined as the `binary state speciation and extinction' (BiSSE) model, was proposed by Maddison et al. [12] to assess the impact of binary characters on rates of diversicationthe dierence between speciation and extinction rates. The parameters of the BiSSE model describe speciation and extinction in the two types, as well as the transitions that take place from one type to the other. This model was recently used by Bartoszek et al. [4] in order to estimate parameters from genetic data of E. coli populations, and predict pathogenicity of various virulence factors (VFs)agents that enable bacteria to replicate and spread within the host by damaging and eluding its defences [6].

Bartoszek et al.

used a modied version of the BiSSE model, in which the extinction rates were assumed to be zero. Another Markov branching model, named as the `multistate speciation and extinction' (MuSSE) model, was later introduced by FitzJohn [7] as an extension of the BiSSE model to binary traits with more than two states.

In this paper, we apply the MuSSE model with zero extinction rates to estimate state dependent speciation and transition rates for a real, known collection of pathogenic and nonpathogenic strains of E. coli bacteria (obtained from [4]) that reside in the human digestive and urinary tracts. The bacterial strains are subdivided into four categories depending on whether or not they carry a VF in the gut and bladder of the human host. Lately, a number of researchers have successfully used the discrete state MuSSE model for estimation of various parameters. For instance,

Sachs et al.

(2013) used the MuSSE model to estimate trait-dependent diversication and transition rates amongst states consisting of free-living, mutualistic, parasitic, and dual lifestyle bacteria in the Proteobacteria phylum. Using this model, they in- ferred that proteobacterial mutualist lineages arise from freeliving and parasitic ancestors, but rarely transition back to a parasitic or freeliving status.

Pirie et al. [14] implemented the MuSSE framework to compare diversication rates of 800 species in the plant genus Erica, endemic to ve geographical regionsPalearctic, Tropical African, Madagascan, Drakensberg and Cape.

Arbuckle et al.[1] used the BiSSE and MuSSE models to show that speciation and extinction rates vary across defensive traits in amphibians.

With this study, we ask the following questions:

(a) Is it possible to utilize state dependent branching processes to analyze pathogenic behavior in bacteria?

(b) How do maximum likelihood methods behave when estimating parameters (such as speciation rates and transition rates between states) of multitype models?

(c) Based only on a nite sample, do the estimated parameters provide

reasonable information on the almost sure limits of the proportion of

(3)

D. Tahir, I. Kaj, K. Bartoszek, M. Majchrzak, P. Parniewski, S. Sakowski

iii

bacterial strains, and if so, can they be used further to obtain plausible condence regions for these limits?

To answer these questions, we proceed by rst giving a concise survey of n type branching processes. More specically, we recall fundamental theorems regarding the long time behavior of branching models. These limit theorems are obtained from earlier works of Athreya and Ney [2] and Janson [8]. In order to thoroughly explain the mathematical background, which is perti- nent to understanding the forthcoming biological application, we introduce some technical notation as well. For multitype branching models, a matrix known as the mean ospring matrixwhose entries consist of the net growth and transition rates of the processis of central importance; the limit behavior of the process can be completely characterized by this matrix through its eigenvalues and corresponding eigenvectors. In the coming sections, we evaluate the mean ospring matrix for various submodels of the multitype branching process, and present interesting results associated with the largest eigenvalue of these matrices. After reviewing the general characteristics of multitype branching processes, we apply 4type branching submodels to a known clinical data set of virulent and nonvirulent E. coli strains, and, with the help of the MuSSE model, obtain rates of speciation and transition among the 4 states of the models. Using the aforementioned limit theorems, we also perform an in depth analysis on the limits of proportions of E. coli strains in dierent states. Finally, we compare the results obtained from various models and draw some useful inferences regarding pathogenic behavior in virulent bacteria.

2. General multitype branching processes We consider an ntype continuous time Markov branching process X(t), given by the column vector X(t) = X

₁

(t), . . . , X

_n

(t)

0

, t ≥ 0, where each X

i

(t) , i = 1, . . . , n, represents the number of typei particles at time t. The lifetime of each type is assumed to be exponentially distributed with intensity a

i

, i = 1, . . . , n. We introduce a vector a whose components comprise of a

i

, i.e., a = (a

1

, . . . , a

_n

) . With each type i, we also associate j = (j

1

, . . . , j

_n

) , a vector of nonnegative integer coordinates. Then, the ospring distribution of the n types is specied by the coordinates of p(j), where

p(j) = p

⁽¹⁾

(j), . . . , p

⁽ⁿ⁾

(j)

and P

_j

p

⁽ⁱ⁾

(j) = 1, for all i = 1, . . . , n. Here, p

⁽ⁱ⁾

(j) = p

⁽ⁱ⁾

(j

1

, . . . , j

n

) gives the probability that a typei particle creates j

1

type1 ospring, j

2

type2 ospring, . . ., j

n

typen ospring [3]. Letting s = (s

1

, . . . , s

_n

) , the generating function is recognized as f(s) = f

⁽¹⁾

(s), . . . , f

⁽ⁿ⁾

(s), where, for each i = 1, . . . , n ,

f

⁽ⁱ⁾

(s) = X

j

p

⁽ⁱ⁾

(j)s

^j

= X

j1,..., jn≥0

p

⁽ⁱ⁾

(j

₁

, . . . , j

_n

)s

^j₁¹

· . . . · s

^j_nⁿ

(4)

iv

determines the distribution of the number of various types of ospring pro- duced by a typei particle. Further, we consider the matrix A = {a

ik

: i, k = 1, . . . , n}, where

a

_ik

= a

i

∂f

⁽ⁱ⁾

(s)

∂s

_k

s=(1,...,1)

− δ

_ik

and δ

ik

=

1 if i = k 0 otherwise.

The mean matrix of the branching process is given by M (t) = e

^At

= {m

ik

(t) : i, k = 1, . . . , n},

with m

ik

(t) = E[X

k

(t)|X

i

(t) = 1] [3]. Following [8], we identify the mean ospring matrix A as

A = A

^T

,

where T in the superscript denotes the matrix transpose. We let γ be the largest positive eigenvalue of A, and, let u and v be the left and right normalized column eigenvectors, respectively, of A, corresponding to γ. Thus, u

⁰

A = γu

⁰

and Av = γv.

We now recall some limit results for multitype branching processes that will be used in later sections for the analysis of branching models. These results, numbered

1, 2

and

3

here, are stated formally in Appendix

A

of the paper as Theorems

A.3,A.4

and

A.5, respectively.

1. The fundamental limit result for multitype branching processes sig- nies the existence of a nonnegative random variable W such that e

^−γt

X(t) −→ W v

^a.s.

as t → ∞ [2]. This implies that the number of particles increases to innity at a speed e

^γt

, with the distribution of the types specied by a random multiple of the right eigenvector.

2. For a real nonnegative number N, let T

N

be the rst time when the total size of the branching process reaches a given level N > 0. Then, letting C be the sum of the components of v, and under additional assumptions (see Appendix

A), the asymptotic distribution of types at

T

N

is given by the right normalized eigenvector v in the sense that X(T

_N

)/N −→ v/C

^a.s.

as N → ∞ [8].

3. Under further assumptions on the size of the eigenvalues of the ospring matrix A, as N → ∞, the centered and normalized type distribution has a Gaussian limit distribution in the sense that √

N (X(T

_N

)/N − v/C)

−→ N (0, Σ

D _b

), where N (0, Σ

b

) is a multivariate normal distribution and Σ

_b

is the covariance matrix stated explicitly in Appendix

A

[8].

3. A 4type branching model Consider a 4type, continuous time Markov branching process, X(t) = X

1

(t), . . . , X

4

(t)

0

, t ≥ 0, where each

X

i

(t) , i = 1, . . . , 4, gives the number of typei particles at time t. Let λ

i

(5)

v

1 2

4 3

q

12

q

₂₁

m

₃₂

q

₄₃

q

34

λ

₁

λ

4

λ

₂

λ

3

Figure 1:

Diagrammatic representation of the speciation and transition parameters in the 4type branching model X(t).

represent the speciation rate of typei particles, i = 1, . . . , 4. Further, q

12

and q

₂₁

are the rates of transition from type1 → 2 and type2 → 1, respectively, and similarly, q

34

and q

43

the transition rates from type3 → 4 and type

4 → 3 , respectively. Finally, m

32

is assumed to be the transition rate from type3 to type2. The branching rates of X(t) are thus given as

(x

₁

, x

₂

, x

₃

, x

₄

) 7→



 



 



(x

₁

+ 1, x

₂

, x

₃

, x

₄

) λ

₁

x

₁

(x

₁

, x

₂

+ 1, x

₃

, x

₄

) λ

₂

x

₂

(x

1

, x

2

, x

3

+ 1, x

4

) λ

3

x

3

(x

₁

, x

₂

, x

₃

, x

₄

+ 1) λ

₄

x

₄

(x

₁

− 1, x

₂

+ 1, x

₃

, x

₄

) q

₁₂

x

₁

(x

1

+ 1, x

2

− 1, x

₃

, x

4

) q

21

x

2

(x

₁

, x

₂

, x

₃

− 1, x

₄

+ 1) q

₃₄

x

₃

(x

₁

, x

₂

, x

₃

+ 1, x

₄

− 1) q

₄₃

x

₄

(x

1

, x

2

+ 1, x

3

− 1, x

₄

) m

32

x

3

,

(1)

where the initial state X(0) can be either (0, 0, 1, 0)

⁰

or (0, 0, 0, 1)

⁰

, since type

3 and type4 are the dominating types (see Def.

A.1). Figure 1

summarizes the parameters used in this branching process. The motivation behind choos- ing this particular model is that the results derived here will be applied in Section

4

to obtain results and draw inferences from a real data set of E. coli bacterial strains. Note that a 4type branching model would generally consist of 20 parameters: 4 speciation parameters, 4 extinction parameters and 12 parameters representing transition rates between states. Here we have assumed nonextinction, and we have also set 7 (out of 12) transition parameters to zero.

In order to identify the ospring matrix A for this process, we recall notation from Section

2, and let a = (a1

, . . . , a

₄

) , where

a

1

= λ

1

+ q

12

, a

2

= λ

2

+ q

21

, a

3

= λ

3

+ q

34

+ m

32

, a

4

= λ

4

+ q

43

.

(6)

vi

We recognize the ospring distribution of the four states as p

⁽¹⁾

(2, 0, 0, 0) = λ

₁

/a

₁

, p

⁽¹⁾

(0, 1, 0, 0) = q

₁₂

/a

₁

, p

⁽²⁾

(0, 2, 0, 0) = λ

2

/a

2

, p

⁽²⁾

(1, 0, 0, 0) = q

21

/a

2

,

p

⁽³⁾

(0, 0, 2, 0) = λ

3

/a

3

, p

⁽³⁾

(0, 0, 0, 1) = q

34

/a

3

, p

⁽³⁾

(0, 1, 0, 0) = m

32

/a

3

, p

⁽⁴⁾

(0, 0, 0, 2) = λ

4

/a

4

, p

⁽⁴⁾

(0, 0, 1, 0) = q

43

/a

4

.

Using the above information, we are now able to nd the generating functions:

f

⁽¹⁾

(s

₁

, s

₂

, s

₃

, s

₄

) = (λ

₁

s

²₁

+ q

₁₂

s

₂

)/a

₁

, f

⁽²⁾

(s

1

, s

2

, s

3

, s

4

) = (λ

2

s

²₂

+ q

21

s

1

)/a

2

, f

⁽³⁾

(s

₁

, s

₂

, s

₃

, s

₄

) = (λ

₃

s

²₃

+ q

₃₄

s

₄

+ m

₃₂

s

₂

)/a

₃

,

f

⁽⁴⁾

(s

₁

, s

₂

, s

₃

, s

₄

) = (λ

₄

s

²₄

+ q

₄₃

s

₃

)/a

₄

. The mean ospring matrix A for this process is dened as

A =







δ

₁

q

₂₁

0 0 q

12

δ

2

m

32

0 0 0 δ

3

q

43

0 0 q

₃₄

δ

₄





 ,

where δ

1

= λ

₁

− q

₁₂

, δ

2

= λ

₂

− q

₂₁

, δ

3

= λ

₃

− m

₃₂

− q

₃₄

and δ

4

= λ

₄

− q

₄₃

. Letting

H

₁

= δ

₁

+ δ

₂

, H

₂

= δ

₃

+ δ

₄

, S

1

= p

(δ

1

− δ

₂

)

²

+ 4q

12

q

21

, S

2

= p

(δ

3

− δ

₄

)

²

+ 4q

34

q

43

, the eigenvalues of A, obtained using Maple 18.00 [13], are

γ

₁

= 1

2 (H

₁

+ S

₁

), γ

₂

= 1

2 (H

₁

− S

₁

), γ

₃

= 1

2 (H

₂

+ S

₂

), γ

₄

= 1

2 (H

₂

− S

₂

).

We have

γ

1

≥ max{δ

1

, δ

2

} ≥ min{δ

1

, δ

2

} ≥ γ

₂

, γ

3

≥ max{δ

3

, δ

4

} ≥ min{δ

3

, δ

4

} ≥ γ

₄

. It can be seen that γ

1

≥ 0 and γ

3

≥ 0 for any set of parameters. Also, γ

1

> 0 if at least one of δ

1

and δ

2

is strictly positive, or if both are negative, then one is strictly negative, and a similar result holds for γ

3

. Further, we also require the eigenvectors of A to be used later in the application of limit theorems

A.3A.5. Thus, using again Maple 18.00 [13], the left column eigenvectors of

A , after substantial manual simplication, are found as

u

1

=

¹

2

(G

1

+ S

1

)(γ

1

G

3

+ G

4

)

q

₂₁

m

₃₂

q

₄₃

, γ

1

G

3

+ G

4

q

₄₃

m

₃₂

, γ

1

− λ

₄

+ q

43

q

₄₃

, 1

0

,

(7)

vii

u

2

=

−

¹₂

(−G

1

+ S

1

)(γ

2

G

3

+ G

4

) q

21

m

32

q

43

, γ

2

G

3

+ G

4

q

43

m

32

, γ

2

− λ

₄

+ q

43

q

43

, 1

0

, u

3

=

0, 0, − 1

2q

₄₃

G

2

− S

₂

, 1

0

, u

4

=

0, 0, − 1

2q

₄₃

G

2

+ S

2

, 1

0

, and similarly, the right column eigenvectors are computed as

v

₁

=

1, − 1

2q

₂₁

G

₁

− S

₁

, 0, 0

0

, v

₂

=

1, − 1

2q

₂₁

G

₁

+ S

₁

, 0, 0

0

,

v

₃

=

1, γ

₃

− λ

₁

+ q

₁₂

q

21

, γ

₃

G

₃

+ G

₄

−q

₂₁

m

32

,

1

2

(G

₂

+ S

₂

)(γ

₃

G

₃

+ G

₄

)

−q

₂₁

m

32

q

43

0

,

v

₄

=

1, γ

4

− λ

₁

+ q

12

q

₂₁

, γ

4

G

3

+ G

4

−q

₂₁

m

₃₂

,

1

2

(G

₂

− S

₂

)(γ

₄

G

₃

+ G

₄

)

−q

₂₁

m

₃₂

q

₄₃

0

, where G

1

= δ

₁

− δ

₂

, G

2

= δ

₄

− δ

₃

, G

3

= H

₁

− H

₂

, and G

4

= δ

₃

δ

₄

− δ

₁

δ

₂

− q

43

(m

32

+ q

34

) + q

12

q

21

+ q

43

m

32

. It should be noted that for certain parameter settings, where one of q

21

, q

43

or m

32

vanish, the above expressions do not apply. Instead, alternate formulae have to be found on a case by case basis.

Now let γ = max{γ

1

, γ

₃

} be the largest positive eigenvalue, and let v = (ν

₁

, ν

₂

, ν

₃

, ν

₄

)

⁰

be the normalized form of the corresponding right eigenvector.

Then, from Thm.

A.3

(and as described in limit result

1), there exists a

nonnegative random variable W , such that as t → ∞,

X

₁

(t)e

^−γt

, . . . , X

₄

(t)e

^−γt

0 a.s.

−→ (ν

₁

W, . . . , ν

₄

W )

⁰

.

Moreover, let ν

1

+ ν

₂

+ ν

₃

+ ν

₄

= C > 0 and recall that T

N

is the rst time when the total number of species reaches a level N. Using limit result

2

(or Thm.

A.4), we have that as N → ∞,

X(T

_N

)

N = X

₁

(T

_N

), . . . , X

₄

(T

_N

)

0

N

−→

a.s.

(ν

₁

, . . . , ν

₄

)

⁰

C . (2)

In order to apply the limit result

3

(see also Thm.A.5 in Appendix

A), we need

more information on the variance characteristics of the branching process. For that purpose, consider rst the column vectors, ξ

i

, i = 1, . . . , 4, given as

ξ

1

=

(1, 0, 0, 0)

⁰

with probability λ

1

/a

₁

(−1, 1, 0, 0)

⁰

-"- q

12

/a

1

, ξ

2

=

(0, 1, 0, 0)

⁰

with probability λ

2

/a

2

(1, −1, 0, 0)

⁰

-"- q

₂₁

/a

₂

,

ξ

3

=







(0, 0, 1, 0)

⁰

with probability λ

3

/a

₃

(0, 0, −1, 1)

⁰

-"- q

34

/a

3

(0, 1, −1, 0)

⁰

-"- m

₃₂

/a

₃

,

(8)

viii

ξ

4

=

(0, 0, 0, 1)

⁰

with probability λ

4

/a

4

(0, 0, 1, −1)

⁰

-"- q

₄₃

/a

₄

. Next, dene a matrix B as

B =

4

X

i=1

ν

i

a

i

E(ξ

i

ξ

_i⁰

) =







b

₁

−b

₂

0 0

−b

₂

b

3

−b

₄

0 0 −b

₄

b

₅

−b

₆

0 0 −b

₆

b

₇





 ,

where b

1

= a

₁

ν

₁

+ q

₂₁

ν

₂

, b

2

= q

₁₂

ν

₁

+ q

₂₁

ν

₂

, b

3

= a

₂

ν

₂

+ q

₁₂

ν

₁

+ m

₃₂

ν

₃

, b

4

= m

32

ν

3

, b

5

= a

3

ν

3

+ q

43

ν

4

, b

6

= q

34

ν

3

+ q

43

ν

4

and b

7

= a

4

ν

4

+ q

34

ν

3

. Now, since A is a 4 × 4 matrix with 4 distinct eigenvalues, it is diagonalizable.

Hence, using Eq. (7) from Appendix

A, the matrix ΣI

is specied by

Σ

I

= X

j:γj<^γ₂

X

k:γk<^γ₂

ˆ u

⁰_j

B ˆ u

k

γ − γ

_j

− γ

_k

v ˆ

j

ˆ v

⁰_k

, where column vectors ˆu

i

and ˆv

i

are determined as

ˆ

u

i

= u

_i

u

i

· v

_i

and ˆv

i

= v

i

, i = 1, . . . , 4.

Finally using Eq. (6), the covariance matrix Σ

b

is given by Σ

_b

= 1

C

³

M

₁

Σ

_I

M

₂

,

where M

1

=







ν

₁

− C ν

₁

ν

₁

ν

₁

ν

2

ν

2

− C ν

2

ν

2

ν

₃

ν

₃

ν

₃

− C ν

₃

ν

₄

ν

₄

ν

₄

ν

₄

− C





 and M

2

= M

₁^T

. Under the condition that γ is greater than two times the second largest eigenvalue, and using the central limit theorem

A.5, we have that

√ N

X

1

(T

N

), . . . , X

4

(T

N

)

0

N − (ν

1

, . . . , ν

4

)

⁰

C

D

−→ N (0, Σ

_b

), (3) as N → ∞.

In the following section, we apply the above 4type model to a clinical data set of 251 bacterial strains, and obtain insightful results concerning pathogenicity in E. coli bacteria. We will take N = 251, so that T

N

is the

rst time when the total size of the branching process reaches 251, and hence we can obtain the proportion of strains in various states of the model.

4. Application of the branching model to E. coli strains data

From [4], we obtain an E. coli data set of N = 251 strains, which forms the

(9)

ix

Virulence Factor (VF) N1 N2 N3 N4

astA - heat-stable enterotoxin 1 108 20 7 116 cnf1 - cytotoxic necrotizing factor 1 90 38 3 120

mG - mbrial protein 13 115 120 3

fyuA - pesticin receptor protein 50 78 74 49

hly1 - alpha hemolysin 88 40 5 118

iroN - IroN protein 44 84 36 87

iutA - ferric aerobactin receptor 66 62 77 46

papC - mbrial protein 83 45 33 90

sat - secreted autotransporter toxin 113 15 46 77

Table 1:

A list of various VFs and the number of strains, Ni, in each of the four states obtained from [4]. Note that N1+ . . . + N4= 251for each VF.

tips of a given phylogenetic tree (see Fig. 1 and Fig. 3 in [4]). The tree is

xed and describes the genealogical structure of the strains. The tree tips are grouped into 4 categories: pathogenic and nonpathogenic E. coli found in the human gastrointestinal tract, and, pathogenic and nonpathogenic E. coli found in the human urinary tract. Whether a bacterial strain is pathogenic or not in any environment, depends on whether or not it is positive for carrying a certain VF (such as toxins, invasins, hemolysins, etc.). Hence, the strains are divided into the following 4 states

1 2 3 4

U

0

U

1

K

1

K

0

where

U

₀

: negative for VF in the urinary tract, U

₁

: positive for VF in the urinary tract, K

1

: positive for VF in the digestive tract, K

₀

: negative for VF in the digestive tract.

From the data given in [4], we choose to analyze 9 VFs. These VFs along with the number of strains, N

i

, in each state i, i = 1, . . . , 4, are given in Table

1. We

model this data set by applying a 4type, continuous time Markov branching process X(t) = (U

0

, U

1

, K

1

, K

0

)

⁰

and branching rates as given in Eq. (1).

Figure

1

gives a diagrammatic representation of the parameters used in the model. In the gure, λ

i

, i = 1, . . . , 4, represent speciation rates of strains in various states, q

12

, q

21

, q

34

, q

43

are the transition rates from pathogenic to nonpathogenic states and vice versa, and nally, m

32

represents migration from state K

1

to U

1

. As in [4], extinction rates are set to zero for all 4 states, and, pathogenic and nonpathogenic strains are allowed to transition back and forth in the same environment (urinary tract or intestine). It is well known that E. coli travel from the gastrointestinal tract to the bladder in the human host, and cause urinary tract infections ([5], [10]). Hence, we assume that E.

coli migrate from states K

1

to U

1

.

(10)

x

λ1 λ2 λ3 λ4 q12 q21 q34 q43 m32

1. λ1 λ2 λ3 λ4 q12 q12 q34 q43 m32

2. λ1 λ2 λ3 λ4 q12 q21 q34 q34 m32

3. λ1 λ2 λ3 λ4 q12 q12 q34 q34 m32

4. λ1 λ2 λ3 λ4 0 q21 q34 q43 m32

5. λ1 λ2 λ3 λ4 q12 0 q34 q43 m32

6. λ1 λ2 λ3 λ4 q12 q21 0 q43 m32

7. λ1 λ2 λ3 λ4 q12 q21 q34 0 m32

8. λ1 λ2 λ3 λ4 q12 q12 0 q43 m32

9. λ1 λ2 λ3 λ4 q12 q12 q34 0 m32

10. λ1 λ₂ λ₃ λ₄ 0 q₂₁ q₃₄ q₃₄ m₃₂ 11. λ1 λ₂ λ₃ λ₄ q₁₂ 0 q₃₄ q₃₄ m₃₂ 12. λ1 λ₁ λ₃ λ₄ q₁₂ q₂₁ q₃₄ q₄₃ m₃₂ 13. λ1 λ₂ λ₃ λ₃ q₁₂ q₂₁ q₃₄ q₄₃ m₃₂ 14. λ1 λ₁ λ₃ λ₄ q₁₂ q₁₂ q₃₄ q₄₃ m₃₂ 15. λ1 λ2 λ3 λ3 q12 q21 q34 q34 m32

16. λ1 λ1 λ3 λ3 q12 q12 q34 q34 m32

17. 0 λ2 λ3 λ4 q12 q21 q34 q43 m32

18. λ1 0 λ3 λ4 q12 q21 q34 q43 m32

19. λ1 λ2 0 λ4 q12 q21 q34 q43 m32

20. λ1 λ2 λ3 0 q12 q21 q34 q43 m32

Table 2:

A list of 20 parameter constraints used in the analysis. The rst row gives parameters for a model in which no constraints are applied, i.e., all parameters are allowed to vary freely. The subsequent rows represent models in which at least one constraint is used: a parameter is either constrained to be zero, or set equal to another parameter. In each row succeeding the rst, which is to be used as a reference row in this table, the constrained parameters are highlighted in bold. For example, in the row marked (1), q21= q₁₂, while the remaining parameters are free to vary; in row (2), q43= q₃₄, etc.

All subsequent analysis of the bacterial data set is carried out in R [15].

The discrete MuSSE model [7] is applied to the data set of each VF to obtain estimates for all parameters. This is achieved by making use of the

`make.musse()' function in the R package diversitree [7], which allows for maximum likelihood estimation of the models' parameters. During the parameter estimation analysis of each VF, the initial state for the process is determined by the relative probability of observing type3 and type4 strains, i.e., 0, 0, N

3

/(N

3

+ N

4

), N

4

/(N

3

+ N

4

)

0

. To increase the estimation power of the MuSSE framework, we try out various models in which dierent constraints are applied on the parameters (the extinction rates already being set to zero), as given in Table

2. From the table, it can be seen that parameters

are either set to be zero, or, pairs of parameters are constrained to be equal.

This is done by utilizing the `constrain()' function in the diversitree pack-

age. The most suitable constraint on the parameters is then chosen using the

Bayesian information criterion or BIC [17], that is, for each VF separately,

(11)

xi

VF λ1 λ2 λ3 λ4 q12 q21 q34 q43 m32

astA 26.39 0.000 229.6 0 176.0 904.4 407.4 41.68 65.53 cnf1 5.840 108.3 284.8 8.066 1.185 144.6 341.3 14.92 121.5

mG 4.105 33.75 96.99 3.737 11.74 11.74 3.715 3.715 35.18 fyuA 0.464 51.71 120.7 4.242 0 38.36 47.40 6.445 46.71 hly1 4.630 101.7 265.6 10.58 0 131.1 287.9 17.77 110.4 iroN 1.843 48.50 164.0 0 21.20 60.93 178.0 20.55 51.44 iutA 0.000 64.93 126.6 3.923 49.07 157.0 43.34 5.527 46.21 papC 4.314 100.7 164.9 8.197 1.915 128.5 152.7 16.06 59.41 sat 6.059 165.6 157.4 6.493 0 307.3 102.4 9.691 65.62

Table 3:

Parameter values for all VFs in the branching model X(t).

we choose that combination of constrained (or freely varying) parameter estimates which give the lowest BIC.

Results The parameter values obtained as a result of the analysis are given in Table

3. We conclude that for 6 out of 9 VFs, whenever the parameters are

constrained in some manner, the model is a better t to the given data set, in contrast to when all the parameters are allowed to vary freely. From the parameter values in Table

3, we have that:

(a) λ

2

> λ

1

for 8 out of 9 VFs, and λ

3

> λ

4

for all VFs. Thus, virulent strains of E. coli, in both urinary and digestive environments, speciate at a higher rate than nonvirulent strains.

(b) q

21

≥ q

₁₂

and q

34

≥ q

₄₃

for all VFs. Thus, E.coli bacteria, in both digestive and urinary tracts, lose their pathogenicity at a higher rate as compared to gaining it.

Using the mathematical analyses in previous sections and Appendix

A, we

also nd that in accordance with our assumptions (F1) to (F6), γ

3

> 0 is the largest eigenvalue for all VFs and v

3

> 0 the corresponding right eigenvector.

Let v = (ν

1

, . . . , ν

₄

)

⁰

be the normalized version of v

3

, and let C = ν

1

+. . .+ν

₄

, as before. Then, from Thm.

A.4

and using Eq. (2), we have for N = 251,

(N

1

, . . . , N

4

)

⁰

N

−→ p = (p

a.s. ₁

, . . . , p

4

)

⁰

,

where p = (p

1

, . . . , p

4

)

⁰

:= v/C is the limit of the proportions of E. coli

strains. For all VFs, the values of N

i

/N and p

i

, i = 1, . . . , 4, are given in

Table

4. The sum, p2

+ p

₃

, gives the probability of maintaining each VF in

the E. coli strains. From Table

4, we infer that the probability of maintaining

VFs varies in the strains; it depends on the VF under consideration. Figure

2

compares the value p

2

+ p

₃

, with the sum N

2

/N + N

₃

/N . It can be seen that

E. coli strains carrying VFs mG, fyuA and iutA have a higher probability

of being virulent as compared to strains carrying cnf1, hly1 and sat.

(12)

xii

VF N1/N N2/N N3/N N4/N p1 p2 p3 p4

astA 0.43 0.08 0.03 0.46 0.83 0.15 0.01 0.01 cnf1 0.36 0.15 0.01 0.48 0.60 0.06 0.02 0.32

mG 0.05 0.46 0.48 0.01 0.08 0.45 0.44 0.03 fyuA 0.20 0.31 0.29 0.20 0.35 0.32 0.15 0.18 hly1 0.35 0.16 0.02 0.47 0.51 0.08 0.04 0.37 iroN 0.18 0.33 0.14 0.35 0.52 0.35 0.02 0.11 iutA 0.26 0.25 0.31 0.18 0.34 0.20 0.23 0.23 papC 0.33 0.18 0.13 0.36 0.52 0.10 0.07 0.31 sat 0.45 0.06 0.18 0.31 0.55 0.03 0.09 0.33

Table 4:

Limit values, pi, for the the branching model X(t), and the proportion of strains, Ni/N, in each state, i = 1, . . . , 4. Here, N = 251, Ni is the number of strains in each state i, and pi = νi/C, where νi (i = 1, .., 4) are the components of the normalized eigenvector v and C = P⁴_i=1ν_i.

astA cnf1 fimG fyuA hly1 iroN iutA papC sat

Virulence factors p2+p3 and (N2+N3)N 0.00.20.40.60.81.0

Figure 2:

A bar chart comparing the probabilities, p2+ p3, of maintaining VFs in bacterial strains, with the sum, (N2+ N3)/N, of strain proportions. The values used in this gure are obtained from Table 4. p2+ p3 is represented by grey bars and (N2+ N3)/N by black bars.

Condence regions We now apply Thm.

A.5

to construct condence regions for p the limit of proportions of E. coli strains. Using the expression in Eq. (3), let

D

²

= N X(T f

_N

) N − e v

C

!

0

( e Σ

_b

)

⁻¹

X(T f

_N

) N − e v

C

!

(4)

where vectors and matrices with the rst type removed are denoted by a tilde above them, that is, X(T f

N

) = X

2

(T

N

), . . . , X

4

(T

N

)

0

= (N

2

, . . . , N

4

)

⁰

,

v = (ν e

2

, . . . , ν

4

)

⁰

, and Σ e

_b

= Σ

_b

with the rst row and column removed.

(13)

xiii

VF D² (N2, N3, N4)/N a b c

astA - - - - -

cnf1 10.35 (0.15, 0.01, 0.48) 0.651 0.091 0.046

mG 3.421 (0.46, 0.48, 0.01) 0.352 0.034 0.021 fyuA 5.655 (0.31, 0.29, 0.20) 0.409 0.135 0.101 hly1 8.963 (0.16, 0.02, 0.47) 0.280 0.098 0.060

iroN - - - - -

iutA 5.989 (0.25, 0.31, 0.18) 0.488 0.127 0.061 papC 8.499 (0.18, 0.13, 0.36) 0.362 0.097 0.081 sat 6.589 (0.06, 0.18, 0.31) 0.172 0.098 0.045

Table 5:

Specication of condence ellipsoids for limits of proportions of E. coli strains. For a signicance level α = 0.01, the observed generalized squared distance D², dened by Eq. (4) and (5), is given as D² ≤ χ²₃(0.01) = 11.345. The center of the condence ellipsoids is represented by (N2, N3, N4)/N, while the half-lengths of the axes are denoted by a, b, and c.

Here, it is important to notice that the matrix Σ

b

has rank 3, hence, for the construction of the subsequent condence regions, we have removed the counts for the rst type (alternatively, we can choose to remove any one of the four types) in Eq.(4). From the theory developed in Chapters 4.2 and 5.4 of [9], at a given signicance level α, a 100(1 − α)% joint condence region for pwhich can be thought of as the mean of a multidimensional normal distributionis dened by ellipsoids such that,

D

²

≤ χ

²₃

(α), (5)

where χ

²₃

(α) is the upper αlevel quantile of the χ

²

distribution with 3 degrees of freedom. The quantity D

²

represents the square of the generalized distance from the centre of the condence ellipsoid to a constant density surface. For various VFs, the observed D

²

value is calculated using Eq. (4), and is given in the second column of Table

5. It can be seen that for astA and iroN,

we are unable to nd D

²

, since the conditions of Thm.

A.5

are not met for these two VFs; the second largest eigenvalue γ

1

is greater than γ

3

/2 , i.e., γ

3

/2 < γ

1

< γ

3

. In this case, weak convergence is shown but to a random variable, whose distribution is not characterized, only its existence (Corollary 3.18 in [8]).

The axes of the condence ellipsoids and their relative lengths are determined using the eigenvalues and corresponding eigenvectors of the positive denite matrix Σ e

_b

[9]. Since Σ e

_b

is a 3 × 3 matrix with the rst type removed, we can nd a simultaneous condence ellipsoid for p

i

when i = 2, 3 and 4. Let β , ζ, and η be the positive eigenvalues, and ˆβ, ˆζ, and ˆη, the corresponding right eigenvectors of Σ e

_b

. Then, the axes of the condence ellipsoids centered at X(T f

_N

)/N , are given as

±a ˆ β , ±b ˆ ζ , and ± c ˆη,

(14)

xiv

with a = Dpβ/N, b = Dpζ/N, and c = Dpη/N being the half length of the three axes. For various VFs, the axes lengths can be computed, as shown in Table

5.

Since Thm.

A.5

is an asymptotic result, and our sample size consists of only 251 data points, we must conrm that Thm.

A.5

has been successfully implemented while nding the condence regions in the aforementioned analysis. To achieve this, we check how the observed proportions (N

i

/N ) behave for the observed number of strains (N = 251) by making use of simulated trees. For the estimated model parameters, given in Table

3, we obtain the

clades evolution using the `tree.musse()' function of the diversitree package in R. For various VFs, excluding astA and iroN, we simulate 10000 trees, each with 251 tips. To ensure that the root of all simulated trees is a dominating type (assumption (F5) in Appendix

A), we take the prior distribution to be

concentrated on states 3 and 4, with probabilities equaling the observed relative proportions of types3 and 4. For this, we use the `sample()' function of

cnf1

0100020003000

Frequency

0 10 20 30 40

χ²

D²=10.35

fimG

010002000

Frequency

0 10 20 30 40

χ²

D²=3.421

fyuA

010002000

Frequency

0 10 20

χ²

D²=5.655

hly1

010002000

Frequency

0 10 20 30 40

χ²

D²=8.963

iutA

010002000

Frequency

0 10 20 30

χ²

D²=5.989

papC

010002000

Frequency

0 10 20 30

χ²

D²=8.499

sat

010002000

Frequency

0 10 20 30 40

χ²

D²=6.589

Figure 3:

For various VFs, histograms showing the distribution of generalized squared distances for simulated trees. The values less than χ²3(0.01) = 11.345 are represented by white bars, and those greater than χ²3(0.01)are given in black. Arrows in each histogram represent the observed D² value as obtained in Table5.

(15)

xv

VF Nsim Fsim fsim

astA - - -

cnf1 7958 0.031 0.02

mG 6588 0.236 0.01 fyuA 7276 0.074 0.01 hly1 8073 0.046 0.02

iroN - - -

iutA 7405 0.071 0.01 papC 7967 0.096 0.01 sat 7583 0.128 0.03

Table 6:

Values obtained from simulating 10000 trees: Nsimdenotes the total number of simulations, out of 10000, that are actually used in the analysis, Fsim is that fraction of simulations for which the generalized square distance is greater than the observed D² value (given in Table 5), and fsim is that fraction of simulations for which the generalized square distance is greater than χ²3(0.01) = 11.345.

R which randomly selects either of the two states with desired probabilities.

Out of the 10000 simulations, we take into consideration in the subsequent analysis, only those trees in which the observed tips have at least one obser- vation of type3 or 4. This is due to the essential nonextinction assumption of Thm.

A.5

(cf. Def.

A.2). For each VF, the number of simulations that we

use out of 10000 is denoted by N

sim

, and shown in Table

6. From the 251 tip

counts of each simulated tree, we calculate the proportions of types, and then obtain the D

²

value given in Eq. (4), i.e., the squared generalized distance from the simulated proportions (with the rst coordinate removed) which lie on the surface of an ellipsoid, to the ellipsoid's centre (at v/C). Using the χ e

²

distribution with 3 degrees of freedom, we obtain from Eq. (5), how `far in the tail' the observed D

²

values lie.

The results are shown in the form of histograms in Figure

3. We con-

clude that the observed proportions are close to the ellipsoid's centre, i.e., the quantile corresponds to a level greater than a cuto value of α = 0.01, and thus for the given number of strains, N = 251, we are indeed close to the asymptotic regime. From the histograms, we obtain the fraction of simulations, F

sim

, which give a value of the generalized square distance greater than the observed value of D

²

, as well as the fraction of simulations, f

sim

, for which the generalized squared distance is greater than the critical value, χ

²₃

(0.01) = 11.345 . For each VF, these values are shown in Table

6.

5. Two variations of the branching model We now model the same

E. coli data set [4] by applying another 4type, Markov branching process

Y (t) = (U

₀

, U

₁

, K

₁

, K

₀

)

⁰

with branching rates as shown in Figure

4. All pa-

rameters of speciation and transition are the same as in the original process

X(t), except for the migration rate, m

32

, which is replaced by m

42

in this

case. Since E. coli are assumed to travel from the intestinal to the urinary

(16)

xvi

U

0

1 U

1

2 K

₀

4 3

K

₁

q

12

q

₂₁

q

₄₃

q

34

m

₄₂

λ

₁

λ

₄

λ

₂

λ

₃

Figure 4:

Diagrammatic representation of the branching process Y (t). The four states are similar to the ones in the branching model X(t). The parameters λi, i = 1, . . . , 4, represent speciation rates in each state, q12, q₂₁, q₃₄ and q43 are the transition rates between states, m42 represents migration from state K0 to U1.

tract and cause infections, here we assume that bacteria migrate from the nonvirulent state K

0

to the virulent state U

1

. The mean ospring matrix, ˆ A , for this process is given as

A = ˆ







λ

1

− q

₁₂

q

21

0 0

q

12

λ

2

− q

₂₁

0 m

42

0 0 λ

₃

− q

₃₄

q

₄₃

0 0 q

₃₄

λ

₄

− q

₄₃

− m

₄₂





 ,

with eigenvalues ˆ

γ

₁

= 1 2

λ

₁

− q

₁₂

+ λ

₂

− q

₂₁

+ p

(λ

₁

− q

₁₂

− λ

₂

+ q

₂₁

)

²

+ 4q

₁₂

q

₂₁

, ˆ

γ

₂

= 1 2

λ

₁

− q

₁₂

+ λ

₂

− q

₂₁

− p

(λ

₁

− q

₁₂

− λ

₂

+ q

₂₁

)

²

+ 4q

₁₂

q

₂₁

, ˆ

γ

₃

= 1 2

λ

₃

− m

₄₂

− q

₃₄

+ λ

₄

− q

₄₃

+ p

(λ

₃

+ m

₄₂

− q

₃₄

− λ

₄

+ q

₄₃

)

²

+ 4q

₃₄

q

₄₃

, ˆ

γ

₄

= 1 2

λ

₃

− m

₄₂

− q

₃₄

+ λ

₄

− q

₄₃

− p

(λ

₃

+ m

₄₂

− q

₃₄

− λ

₄

+ q

₄₃

)

²

+ 4q

₃₄

q

₄₃

. Using Maple 18.00 [13], the left and right column eigenvectors of ˆ A are obtained as

ˆ u

1

=

¹

2

(G

1

+ S

1

)(ˆ γ

1

G ˆ

3

+ ˆ G

4

) q

21

m

42

q

34

, (ˆ γ

1

G ˆ

3

+ ˆ G

4

) q

34

m

42

, 1, ˆ γ

1

− λ

₃

+ q

34

q

34

0

,

ˆ u

2

=

¹

2

(G

1

− S

₁

)(ˆ γ

2

G ˆ

3

+ ˆ G

4

) q

21

m

42

q

34

, (ˆ γ

2

G ˆ

3

+ ˆ G

4

) q

34

m

42

, 1, ˆ γ

2

− λ

₃

+ q

34

q

34

0

, ˆ

u

3

=

0, 0, 1, 1

2q

₃₄

G ˆ

2

+ ˆ S

2

0

, u ˆ

4

=

0, 0, 1, 1

2q

₃₄

G ˆ

2

− ˆ S

2

0

,

Daniah Tahir (Uppsala) Ingemar Kaj (Uppsala)