• Nie Znaleziono Wyników

The notation used here is partially based on a book "Spoken language processing; a guide to algorithms and system development" by Huang, Acero and Hon (section 4.4.2 and chapter 8) [25].

3.1.1. Denitions and notation

Denition 1 Let us dene a Markov chain by:

ˆ Ω = {1, 2, ..., N} - set of states

ˆ π = {πi} - the initial state distribution (size N)

ˆ A = {aij} - matrix of transition probabilities (size N × N)

Markov chain is used to model sequences of events. In each step the model can be in one of the states from the set Ω. We will denote Xt as the state in the t-th step; hence, X1, X2, ..., XT would denote a sequence of states that the model was in from the rst step to the step T . By convention, usually the last step under consideration is denoted T and I will use that notation here.

πi represents the probability that the Markov chain will start in state i, i.e.:

πi = P (X1 = i).

Elements of transition matrix are dened as follows:

aij = P (Xt= j|Xt−1= i),

i.e. element in i-th row, j-th column reects probability of transitioning from state i to state j. It is important to emphasize that this probability doesn't depend on t itself, i.e.

for any pair of states i, j probability of transitioning from i to j is the same in every step.

In Markov chain the state in the t-th step depends only on the state in the (t − 1)-th step:

P (Xt|X1, X2, ..., Xt−1) = P (Xt|Xt−1) (3.1) where P (Xt) should be interpreted as "probability that at step t the model is in the state Xt". More formally, we could write it as P (Xt = xt), where Xt would be general notation for a state that the model is in in the t-th step, while xt would represent a specic variable from the set Ω, observed at step t in the sequence of states that is under consideration. While this notation seems more logical, because of its verbosity it might be sometimes confusing and hence I will use the simplied one.

Property described by equation 3.1 is called Markov property, and processes with that property (that can be modelled by Markov chain) are called Markov processes.

In general, a Markov chain can be dened with continuous time, but here we will only consider models with discrete time, so I will not present the broader denition.

Denition 2 Let us dene a Hidden Markov Model (HMM) by:

ˆ a Markov chain: (Ω, π, A)

ˆ space S of observable emissions

ˆ characteristics of emissions B

A Hidden Markov Model can be used to simulate Markov processes that have hidden and observed variables. Similarly to a Markov chain, in every step the model is in some state, but the sequence of states is not directly observed. Instead, in every step the model emits some value. Yt will denote the value emitted in the t-th step. Similarly to the notation regarding probability of the model being in some state, for simplicity I will denote the probability that in t-th step the model has emitted signal Yt as P (Yt).

Figure 3.1: Example Hidden Markov Model. The circles represent states, and the squares represent values emitted by the states. The green arrows denote transitions between the states. Each green arrow is signed with a axy symbol, denoting probability of transi-tion between the given states. Values X1, X2 and X3 are random variables, distributed according to some random distribution (either discrete or continuous).

The emission in every step depends on the state the HMM is in and only on this, i.e. it is independent of the whole previous sequence of states or emissions. It can be described with the following equation:

P (Yt|X1, ..., Xt, Y 1, ..., Yt) = P (Yt|Xt) (3.2) The emissions can be described by a discrete or continuous distribution. I will denote the probability measure as B. In the case of discrete distributions, B can be represented as a matrix with |Ω| rows and |S| columns; element bij would denote the probability of emitting symbol j while the model is in state i. In the case of continuous distributions, B can be written as a set of tuples of parameters, one tuple for each state. For example, if all the states emit signal from Gaussian distribution, which is parametrised by mean µ and standard deviation σ, B would be a set of |Ω| tuples (µi, σi); i here denotes a state whose emissions are described by the parameters. In this dissertation, I will use bi(j) to denote probability that the HMM will emit symbol j while in state i.

In gure 3.1 I present a schematic representation of an example Hidden Markov Model.

In the following sections, I use θ to denote a set of parameters describing HMM, speci-cally θ = {A, B, π} - the matrix of transitions, the probability measure of emissions and the initial state distribution. Number of states, alphabet of emissions and the family of distributions allowed for emissions will be considered hyperparameters. Whenever it is stated that "parameters of the HMM are unknown", it means the set θ; hyperparameters will be assumed to be known. Obviously, in many cases they are not, but they require a separate procedure of tuning and adjusting, that will not be covered in the following sections. I will write more about choosing the number of states and random distribution later in sections 3.2 and 6.3.

3.1.2. Forward algorithm

Let us have a Hidden Markov Model with known parameters θ and a sequence of emissions Y = Y1, Y2, ..., YT emitted by it. The sequence of states X = X1, X2, ..., XT is unknown.

Forward algorithm answers the question: what is the probability that the model generated the sequence Y ? We will denote this probability as P (Y |θ). It can be also interpreted as the likelihood of the model given the observed data, and in that interpretation we will denote it as L(θ|Y ). Note that often the probability is denoted as Pθ(Y ), to reect the fact that θ is not a random event or variable, while the likelihood is often denoted as L(Y, θ), i.e. a function of both observation and parameters. Throughout this dissertation, however, I will use the rst introduced notation.

Forward algorithm is an iterative algorithm. In every iteration t, for every state i we calculate the probability that the model generated the observed sequence up to the step tand ended in state i. We will denote this probability, often called forward probability, as α:

αt(i) = P (Y1Y2...Yt−1Yt, Xt= i|θ) (3.3) The initial step would be to calculate this forward probability for every state for rst step:

α1(i) = P (Y1, X1= i|θ) = P (Y1|X1 = i, θ)P (X1 = i|θ) = bi(Y1i (3.4) i.e. for the rst step it is the probability of emitting symbol Y1 (the srt observed symbol) while in state i, multipled by the probability that the sequence of states started in state i.

For every step t > 1, probability that the HMM ended in the state i and emitted the sequence Y1, ..., Ytis equal to probability that it emitted signal Yt in state i times prob-ability that it emitted the sequence Y1, ..., Yt−1 and ended in state i:

αt(i) = P (Yt|Xt= i, θ)P (Y1, ..., Yt−1, Xt= i|θ) (3.5) The rst factor is simply equal to bi(Yt). The event in the second factor can be broken down to the following events: the HMM emitted Y1, ..., Yt−1, was in state 1 in step t − 1 and then transitted from state 1 to state i; the same for states 2, 3, ..., N. Probability of such single event for some state j is equal to the forward probability in state j in the step t − 1, times probability of transition from state j to state i. The probability of the event in the second factor in equation 3.5 is sum over all the states of the probabilites of the single events described above. And so, 3.5 can be rewritten as:

αt(i) = bi(Yt)

N

X

j=1

P (Y1, ..., Yt; Xt−1= j; Xt= i) =

= bi(Yt)

N

X

j=1

αt−1(j)aji (3.6)

By combining 3.4 and 3.6 we can calculate forward probabilites for every step, for every state.

Finally, to calculate the probability that the model generated the whole sequence Y we can simply sum over all the states probabilities that it generated the sequence Y and ended in some state i, i.e. sum the forward probabilities in the nal step for every state:

P (Y1, ..., YT|θ) =

N

X

i=1

αT(i) (3.7)

3.1.3. Viterbi algorithm

A Hidden Markov Model with known parameters θ and a sequence Y = Y1, Y2, ..., YT of emissions generated by it are given. Viterbi algorithm [79] answers the question: what is the most probable sequence of states X = X1, X2, ..., XT that generated the observed sequence Y ? In other words, what X maximises P (X, Y |θ)?

Viterbi algorithm is a dynamic programming algorithm. It bases on a simple observation, that if X1, X2, ..., XT is the most probable sequence of states that generated sequence Y1, Y2, ..., YT, then for any t (1 ≤ t ≤ T ) sequence X1, ...Xt is also the most probable sequence of states from all the possible sequences that the model could be in from steps 1 to t, given the observed data (i.e. the whole sequence Y1, ..., YT). In this section, I will simply refer to the most probable sequence Xa, ..., Xb (given the sequence Y1, ..., YT) as the best sequence.

The algorithm can be viewed as a modied forward algorithm. In forward algorithm for every step t we would sum up probabilities of the sequences ending in the state Xt (see equations 3.5 and 3.6); here, we want to remember the exact best sequence that has led to the state Xt.

Let us denote the probability of the best sequence up to step t that ended in state i and generated the observed sequence as Vt(i). It is worth emphasising that it is dierent from forward probability αt(i), which is equal to the probability of generating the observed sequence and ending in state i; so it is a sum of probabilities for every possible sequence of states, not only for the best one.

For the initial step, there is only one possible path to get to each of the states - it is to simply start in that state. Because of this, for the initial step V1(i) is equal to the probability of starting in the state i and emitting the rst observed signal:

V1(i) = πibi(Y1) (3.8)

For any step t > 1, the best path leading to the state i would be the best path chosen from the N possible paths: each is the best path ending in some state j in the step t − 1, and then going from j to i (with probability aji). In particular, j and i can be the same state. The probability of each of those paths is equal to the product of three probabilities:

the probability of the best path to the j state (equal to Vt−1(j)), of transitioning to the istate (aji), and of emitting a signal observed in the t-th step (bi(Yt)). The best path is then chosen by nding which one returns the highest probability. It is noteworthy that the last part of the product, i.e. emission probability, does not inuence the choice itself, as it does not depend on the previous path. And so, the induction step can be written as follows:

Vt(i) = bi(Yt) max

j∈[1,N ]Vt−1(j)aji (3.9)

Additionally, for every step we can remember for which j the maximum was achieved for the given state i:

bestt(i) = arg max

j∈[1,N ]Vt−1(j)aji (3.10)

So, bestt(i)would denote the state from which the best path goes to the state i in step t.

Using 3.8 and 3.9 we can calculate Vt(i) for every step, for every state; and using 3.10 we can additionally remember the exact sequence that has led us to these probabilities.

Once VT(i) are calculated for the last step, we check for which state the maximum is achieved; we conclude that the sequence of states ended in this state with the highest probability:

XbT = arg max

i∈[1,N ]VT(i), (3.11)

whereXbt is the t-th element of the most probable sequence of states. Starting from this state and going backwards, we can reconstruct the whole best path:









XbT −1 = bestT( bXT) XbT −2 = bestT −1( bXT −1) . . .

Xb1 = best2( cX2)

(3.12)

The nal best path, i.e. the most probable path to generate the observed sequence of emissions given the parameters θ, is equal to Xb1, . . . , bXT.

3.1.4. Baum-Welch algorithm

Let us have a sequence Y = Y1, Y2, ..., YT of emissions that was emitted by some HMM with unknown parameters θ. The sequence of states X = X1, X2, ..., XT that emitted Y is also unknown. The Baum-Welch algorithm [4] answers the question: what set of parameters is most likely to generate the observed sequence Y ? In other words, what θ maximises the likelihood of the HMM given the observed sequence Y ?

Baum-Welch algorithm is a special case of Expectation-Maximisation (EM) algorithm [16]. EM algorithm can be used to nd Maximum Likelihood Estimators of parameters of a model when we have incomplete data, i.e. part of the data generated by the model is observed and part is hidden. In the case of estimating HMM's parameters the observed data is the sequence of emissions, and the sequence of states is hidden.

EM algorithm operates in iterations; every iteration consists of two steps, called E (Es-timation) and M (Maximisation). Intuitively, the algorithm can be described as follows:

1. Initialise parameters θ with some value, denoted ˆθ(0). 2. Repeat the following steps until convergence:

ˆ E-step: using the current estimation for θ, denoted ˆθ(n), estimate the probability of each possible value for the hidden data; mimick the hidden data with frequencies proportional to the probabilities

ˆ M-step: using the full data (i.e. observed and generated in the pre-vious step) nd MLE for parameters θ, denoted ˆθ(n+1)

This procedure converges to a local maximum of likelihood function.

In the M-step we want to nd ˆθ that will maximise:

L(θ|Y ) = P (Y |θ) (3.13)

where Y is the sequence of emissions (or, more generally, the observed data) and θ denotes the vector of parameters we want to estimate.

It was shown (see [16]) that to maximise the likelihood dened in 3.13 it is sucient to maximise the auxiliary function Q, dened as:

Q(θ, ˆθ(n)) = EX|Y ;ˆθ(n)[log P (X, Y |θ)] (3.14)

where ˆθ(n) is the current estimation of θ, and EX|Y ;θ[Z] denotes the expectation value of Z over the variable X, under the condition that the value Y was observed and cal-culated with parameter vector θ. In other words, the Q function is the expectation of log-likelihood from complete data.

In every iteration n of the EM-algorithm in the M-step the ˆθ(n+1) is chosen to maximise the value of Q:

θˆ(n+1)= arg max

θ Q(θ, ˆθ(n)) (3.15)

It is equivalent to maximising the likelihood L(θ|Y ). The full proof for this equivalence can be found in the paper by Dempster, Laird and Rubin from 1977 [16]. It should be pointed out that this is the rst paper that fully dened and formalised the EM-algorithm, however the algorithm itself was already used before for various applications (e.g. the Baum-Welch algorithm was published in 1970 [4] together with the proof for its correctness in the context of HMMs).

The Q function can be further expanded as follows:

Q(θ, ˆθ(n)) = EX|Y ;ˆθ(n)[log P (X, Y |θ)] =

= X

allX



P (X|Y, ˆθ(n)) log P (X, Y |θ)

 (3.16)

The sum is calculated over all possible sequences of states of length T .

The probability of X, Y can be expressed as a product over all the observations of the probability of transitioning from the previous state to the current one and the probability of emission of the observed symbol:

log P (X, Y |θ) = log

T

Y

t=1

aXt−1,XtbXt(Yt) =

=

T

X

t=1

log aXt−1,Xt+

T

X

t=1

log bXt(Yt) (3.17)

Note that a, b are a part of the θ parameters. Here for the simplicity of notation we assume that a0,i denotes πi for any state i.

Joining 3.16 and 3.17 we obtain:

Q(θ, ˆθ(n)) = X

allX

P (X|Y, ˆθ(n))

T

X

t=1

log aXt−1,Xt+

T

X

t=1

log bXt(Yt)

!

=

= X

allX

P (X|Y, ˆθ(n))

T

X

t=1

log aXt−1,Xt+

+X

allX

P (X|Y, ˆθ(n))

T

X

t=1

log bXt(Yt) (3.18)

As a and b are independent from each other they can be optimised separately, using only the terms they are present in. And so, let us dene two separate functions we will optimise in respect to a and b:

Qa(a, ˆθ(n)) = X

allX

P (X|Y, ˆθ(n))

T

X

t=1

log aXt−1,Xt =

=

N

X

i=1 N

X

j=1 T

X

t=1

P (Xt−1= i, Xt= j|Y, ˆθ(n)) log aij (3.19)

Qb(b, ˆθ(n)) = X

allX

P (X|Y, ˆθ(n))

T

X

t=1

log bXt(Yt) =

=

N

X

i=1 T

X

t=1

P (Xt= i|Y, ˆθ(n)) log bi(Yt) (3.20)

Let us rst maximise the Qafunction in respect to a. We want to do it with the following constraint:

∀i

N

X

j=1

aij = 1 (3.21)

i.e. for every state i probabilities of transitioning from this state to some state j (for all states j, including j = i) must sum up to 1.

By using the Lagrange multipliers, we can derive the following Maximum Likelihood Estimator for a:

ˆ aij =

T

X

t=1

P (Xt−1= i, Xt= j|Y, θ)

N

X

k=1 T

X

t=1

P (Xt−1= i, Xt= k|Y, θ)

=

=

T

X

t=1

P (Xt−1= i, Xt= j|Y, θ)

T

X

t=1

P (Xt−1= i|Y, θ)

(3.22)

To calculate it let us dene two additional auqiliary functions. βt(i) would denote so-called backward probability. It is dened similarly to the forward probabilities α (see equations 3.4 and 3.6); it is a probability that the HMM emitted symbols from step t + 1 to the end of the sequence, provided in step t it was in state i:

βt(i) = P (Yt+1...YT|Xt= i, θ) (3.23) We can calculate it iteratively, starting from step T and going backwards:

βT(i) = P (∅|XT = i, θ) = 1 (3.24)

βt(i) =

N

X

j=1

aijbj(Yt+1t+1(j) (3.25)

For the last step (equation 3.24), it is simply probability of emitting nothing from T + 1 to T and is equal to 1 (because T is the last emitted symbol). For the following steps (equation 3.25), βt(i) is equal to sum over all the states of probabilities that the model transits from state i to j (which happens with probability aij), while in state j it emits signal Yt+1 (which happens with probability bj(Yt+1)), in step t + 1 it is in state j and from step t + 2 it emits sequence Yt+2...YT (which happens with probability βt+1(j)).

Recursively we can calculate the backward probabilities for all the steps for all the states.

The second auxiliary function to help us calculate the estimator ˆa would be the probabil-ity of taking the transition from state i to state j at step t given the observed sequence and model's parameters. We will denote this as γt(i, j)and it can be calculated as follows:

γt(i, j) = P (Xt−1= i, Xt= j|Y, θ) =

= P (Xt−1= i, Xt= j; Y |θ)

P (Y |θ) =

= 1

P (Y |θ)P (Xt−1= i; Y1, ..., Yt−1|θ)P (Xt= j|Xt−1= i|θ) P (Yt|Xt= j|θ)P (Xt= j; Yt+1, ..., YT|θ) =

= αt−1(i)aijbj(Ytt(j)

N

X

k=1

αT ,k

(3.26)

We can now use this auqiliary function in 3.22, i.e. the derived Maximum Likelihood Estimator for a:

ˆ aij =

T

X

t=1

P (Xt−1= i, Xt= j|Y, θ)

N

X

k=1 T

X

t=1

P (Xt−1= i, Xt= k|Y, θ)

=

=

T

X

t=1

γt(i, j)

N

X

k=1 T

X

t=1

γt(i, k)

(3.27)

If the HMM had discrete emissions, we could similarly derive the MLE for b, dening it for every symbol k that Y can take. We want to maximise Qb (equation 3.20) with respect to b with the following constraint:

∀iX

allk

bi(k) = 1, (3.28)

where the sum is calculated over all the possible values that can be emitted by the HMM.

In other words, for every state i probabilities of emitting some symbol k must sum up to 1. By using Lagrange multipliers we obtain the following formula:

ˆbi(k) =

T

X

t=1

P (Xt= i|Y, θ)δ(Yt= k)

T

X

t=1

P (Xt= i|Y, θ)

=

=

X

t∈{Yt=k}

N

X

j=1

γt(i, j)

T

X

t=1 N

X

j=1

γt(i, j)

(3.29)

where δ(w) is equal to 1 if w is true, 0 otherwise.

However, in the case of continuous emissions we cannot dene b for every possible value of Y . In the following sections I present how to derive the equation for ˆb if we assume that the emissions come from Gaussian (section 3.1.4.1) or negative binomial distribution (section 3.1.4.2).

3.1.4.1. Gaussian emissions

Let us assume for every state i the emitted value comes from Gaussian distribution with parameters µi (mean) and σi (standard deviation):

Yi ∼ N (µi, σi)

where Yi denotes the random variable emitted when the model is in the state i.

The probability density function is dened as follows:

fYi(y) = 1 σi

2πe

−1 2

 y − µi σi

2

(3.30)

It is worth noting that Y could be a multivariate random variable, in which case the distribution would be parameterised with a vector of means µ and a covariance matrix Σ;

here for simplicity we assume that Y is univariate. The obtained results can be extended to the multivariate case.

Let us maximise Qbdened in equation 3.20, substituting bi(k)term with the probability density function:

Qb(b, ˆθ(n)) =

N

X

i=1 T

X

t=1

P (Xt= i|Y, ˆθ(n)) log bi(Yt) =

=

N

X

i=1 T

X

t=1

P (Xt= i|Y, ˆθ(n)) log 1 σi

√ 2πe

−1 2

 Yt− µi σi

2

(3.31)

Let us rst maximise Qb with respect to some µj; i.e. we want to nd the MLE of mean for state j. As Qb is dierentiable in respect to µ on its whole domain, we will nd the estimator by nding the value of µj for which the derivative is equal zero:

dQb(b, ˆθ(n))

j = d

j

"N X

i=1 T

X

t=1

P (Xt= i|Y, ˆθ(n))



−1 2

  Yt− µi σi

2#

=

= −1 2

T

X

t=1

P (Xt= j|Y, ˆθ(n)) 1 σ2j

d dµj

(Yt− µj)2 =

= −1 2

T

X

t=1

P (Xt= j|Y, ˆθ(n)) 1

σ2j(−2)(Yt− µi) =

= 1

σj2

T

X

t=1

P (Xt= j|Y, ˆθ(n))(Yt− µj) (3.32)

We equate the calculated derivative with zero:

dQb(b, ˆθ(n))

j = 0

1 σj2

T

X

t=1

P (Xt= j|Y, ˆθ(n))(Yt− µj) = 0

T

X

t=1

P (Xt= j|Y, ˆθ(n))Yt =

T

X

t=1

P (Xt= j|Y, ˆθ(n)j

ˆ µj =

PT

t=1P (Xt= j|Y, ˆθ(n))Yt PT

t=1P (Xt= j|Y, ˆθ(n)) (3.33) And so, for any state j the MLE of mean of emissions is calculated as a mean of the observed emissions, weighted by the postertior probabilities that the given observation was emitted by the state j.

Similarly we will derive the estimator ˆσj for the state j:

dQb(b, ˆθ(n))

j = d

j

N

X

i=1 T

X

t=1

P (Xt= i|Y, ˆθ(n)) log 1 σi

2πe

−1 2

 Yt− µi σi

2

=

=

T

X

t=1

P (Xt= j|Y, ˆθ(n)) d dσj

"

− log σj − log√ 2π−1

2

 Yt− µj σj

2#

=

=

T

X

t=1

P (Xt= j|Y, ˆθ(n))

"

−1 σj

−1

2(−2)(Yt− µj)2 σ3j

#

=

=

T

X

t=1

P (Xt= j|Y, ˆθ(n))

"

−1

σj +(Yt− µj)2 σj3

#

(3.34)

To nd the MLE for σj, we equate the derivative to zero:

dQb(b, ˆθ(n)) dσj

= 0

T

X

t=1

P (Xt= j|Y, ˆθ(n))

"

−1 σj

+ (Yt− µi)2 σj3

#

= 0

T

X

t=1

P (Xt= j|Y, ˆθ(n)) =

T

X

t=1

P (Xt= j|Y, ˆθ(n))(Yt− µi)2 σ2j σj2

T

X

t=1

P (Xt= j|Y, ˆθ(n)) =

T

X

t=1

P (Xt= j|Y, ˆθ(n))(Yt− µi)2

σ2j =

T

X

t=1

P (Xt= j|Y, ˆθ(n))(Yt− µi)2

T

X

t=1

P (Xt= j|Y, ˆθ(n))

ˆ σj =

T

X

t=1

P (Xt= j|Y, ˆθ(n))(Yt− µi)2

T

X

t=1

P (Xt= j|Y, ˆθ(n))

−2

(3.35)

Similarly to the results obtained for mean, MLE for σj resembles the MLE for standard deviation for Gaussian distribution calculated from multiple independent realisations

of the same random variable, but the observations here are weighted by the posterior probability that the observation was emitted by the state j.

To summarise, in the M-step of Baum-Welch algorithm parameters of Gaussian distri-bution for each state would be updated according to the following equations:

ˆ µi =

T

X

t=1

P (Xt= i|Y, ˆθ(n))Yt T

X

t=1

P (Xt= i|Y, ˆθ(n))

=

T

X

t=1 N

X

j=1

γt(i, j)Yt

T

X

t=1 N

X

j=1

γt(i, j)

(3.36)

ˆ σi=

T

X

t=1

P (Xt= i|Y, ˆθ(n))(Yt− µi)2

T

X

t=1

P (Xt= i|Y, ˆθ(n))

−2

=

T

X

t=1 N

X

j=1

γt(i, j)(Yt− µi)2

T

X

t=1 N

X

j=1

γt(i, j)

−2

(3.37)

3.1.4.2. Negative binomial distribution

Let Yi be a random variable emitted by HMM in state i, distributed according to negative binomial distribution with parameters pi, ri. Let us denote it in the following way:

Yi∼ N B(pi, ri), where:

pi - probability of success; pi ∈ [0, 1]

ri - number of successes; ri ∈ [0, ∞]

Yi - number of failures before ri successes occur; Yi∈ [0, ∞] ∩ Z.

If Y ∼ NB(p, r) then probability that it will be equal to some y ≥ 0 is given by:

P (Y = y) =r + y − 1 y



pr(1 − p)y (3.38)

In the equation above, we assume that r is an integer. The probability can be how-ever dened for r belonging to all non-negative real numbers. In this case the binomial coecient in equation 3.38 can be exchanged with gamma function:

P (Y = y) = Γ(r + y)

y!Γ(r) pr(1 − p)y (3.39)

There are also alternative denitions of the negative binomial distribution, e.g. with p being the probability of success, r - number of failures and Y ∼ NB(p, r) - number of successes before r failures occur. In this dissertation, however, I use the denitions and notation described above.

We want to nd estimators of p and r that would maximise the Qb function. Let us rst maximise it in respect to p. Similarly to the previous section, we will dierentiate Qb

with respect to pi and equate it to zero.

dQb(b, ˆθ(n)) dpi

= d

dpi

N

X

j=1 T

X

t=1

P (Xt= j|Y, ˆθ(n)) log bi(Yt)

=

= d

dpi

N

X

j=1 T

X

t=1

P (Xt= j|Y, ˆθ(n)) logrj + yt− 1 yt



prjj(1 − pj)yt

=

= d

dpi

" T X

t=1

P (Xt= i|Y, ˆθ(n)) (log prii+ log(1 − pi)yt)

#

=

=

T

X

t=1

P (Xt= i|Y, ˆθ(n))

 d dpi

[rilog pi] + d dpi

[ytlog(1 − pi)]



=

=

T

X

t=1

P (Xt= i|Y, ˆθ(n)) ri pi

− yt

1 − pi



(3.40)

Note that we could also use the equation 3.39 (the one for any non-negative real value of r), as the factor that is dierent between the two equations does not depend on p and disappears during dierentiating by pi.

To nd the extremum of the Qb function, we equate the calculated derivative to zero:

dQb(b, ˆθ(n))

dpi = 0

T

X

t=1

P (Xt= i|Y, ˆθ(n)) ri pi

− yt 1 − pi



= 0

T

X

t=1

P (Xt= i|Y, ˆθ(n))ri

pi

=

T

X

t=1

P (Xt= i|Y, ˆθ(n)) yt

1 − pi

(1 − pi)

T

X

t=1

P (Xt= i|Y, ˆθ(n))ri = pi

T

X

t=1

P (Xt= i|Y, ˆθ(n))yt

T

X

t=1

P (Xt= i|Y, ˆθ(n))ri = pi

T

X

t=1

P (Xt= i|Y, ˆθ(n))(yt+ ri)

ˆ

piM LE =

T

X

t=1

P (Xt= i|Y, ˆθ(n))ri T

X

t=1

P (Xt= i|Y, ˆθ(n))(yt+ ri)

=

=

T

X

t=1 N

X

j=1

γt(i, j)ri T

X

t=1 N

X

j=1

γt(i, j)(yt+ ri)

(3.41)

We will use the same approach to nd Maximum Likelihood Estimator for r. To dier-entiate Qb by ri we will use the equation for probability mass function dened for all non-negative real values of r, not only integers, i.e. equation 3.39.

dQb(b, ˆθ(n)) dri

= d

dri

N

X

j=1 T

X

t=1

P (Xt= j|Y, ˆθ(n)) logΓ(rj+ yt)

y!Γ(rt) prjj(1 − pj)yt

=

=

T

X

t=1

P (Xt= i|Y, ˆθ(n)) d

dri [log Γ(ri+ yt) − log Γ(rt) + rilog pi] =

=

T

X

t=1

P (Xt= i|Y, ˆθ(n))

 1

Γ(ri+ yt)

dΓ(ri+ yt) dri

− 1

Γ(ri) dΓ(ri)

dri

+ log pt

 (3.42)

Here it will be useful to dene digamma function:

Ψ(k) = Γ0(k)

Γ(k) (3.43)

where Γ0(k) is a derivative of the Gamma function. By plugging it into the previous equation, we obtain:

dQb(b, ˆθ(n)) dri

=

T

X

t=1

P (Xt= i|Y, ˆθ(n)) (Ψ(ri+ yt) − Ψ(ri) + log pi) (3.44)

We want to nd ˆri for which 3.44 is equal to zero:

T

X

t=1

P (Xt= i|Y, ˆθ(n)) (Ψ(ri+ yt) − Ψ(ri) + log pi) = 0 (3.45)

However, there is no solution for this equation in a closed form. It needs to be found using other methods, like Newton's method or EM algorithm. In section 3.3 I describe approach implemented in HERON.

Powiązane dokumenty