Mathematical Statistics
Anna Janicka
Lecture XIV, 1.06.2020
BAYESIAN STATISTICS
Plan for Today
1. Bayesian Statistics
a priori and a posteriori distributions
Bayesian estimation:
Maximum a posteriori probability (MAP)
Bayes Estimator
Bayesian Statistics vs. traditional statistics
Frequentist: unknown parameters are given (fixed), observed data are random
Bayesian: observed data are given (fixed), parameters are random
Bayesian Statistics
Our knowledge about the unknown parameters is described by means of probability distributions, and additional knowledge may affect our description.
Knowledge:
general
specific
Example: coin toss
Bayesian Model
X1, ..., Xn come from distribution Pθ , with density fθ (x) – conditional density given a specific value of θ (likelihood function).
P – family of probability distributions Pθ , indexed by the parameter θ∈Θ
General knowledge: distribution Π over the parameter space Θ, given by π(θ) – the so- called a priori/prior distribution of θ,
θ ~ Π
Bayesian Model – cont.
Additional knowledge (specific, contextual):
based on observation. We have a joint distribution of observations and θ:
on this basis we can derive the conditional distribution of θ (given the observed data)
where
is a marginal distribution for the obs.
𝑓𝑓(𝑥𝑥1, 𝑥𝑥2, . . . , 𝑥𝑥𝑛𝑛, 𝜃𝜃) = 𝑓𝑓(𝑥𝑥1, 𝑥𝑥2, . . . , 𝑥𝑥𝑛𝑛|𝜃𝜃)𝜋𝜋(𝜃𝜃)
𝜋𝜋(𝜃𝜃|𝑥𝑥1, . . . , 𝑥𝑥𝑛𝑛) = 𝑓𝑓(𝑥𝑥1, . . . , 𝑥𝑥𝑛𝑛|𝜃𝜃)𝜋𝜋(𝜃𝜃) 𝑚𝑚(𝑥𝑥1, . . . , 𝑥𝑥𝑛𝑛) , 𝑚𝑚(𝑥𝑥1, . . . , 𝑥𝑥𝑛𝑛) = �
Θ𝑓𝑓( 𝑥𝑥1, . . . , 𝑥𝑥𝑛𝑛|𝜃𝜃)𝜋𝜋(𝜃𝜃)𝑑𝑑𝜃𝜃
Bayesian Model – a posteriori distribution
is called the a posteriori/
posterior distribution, denoted Πx The posterior distribution reflects all
knowledge: general (initial) and specific (based on the observed data).
Grounds for Bayesian inference and modeling
𝜋𝜋(𝜃𝜃|𝑥𝑥1, . . . , 𝑥𝑥𝑛𝑛)
Prior and posterior distributions: examples
1.Let X1, ..., Xn be IID r.v. from a 0-1 distr. with prob. of success θ; let
for θ∈(0,1) where
and
then the posterior distribution:
conjugate prior for Bernoulli distr.
𝜋𝜋(𝜃𝜃) = 𝜃𝜃𝛼𝛼−1(1 − 𝜃𝜃)𝛽𝛽−1 𝐵𝐵(𝛼𝛼, 𝛽𝛽)
𝐵𝐵(𝛼𝛼, 𝛽𝛽) = �
0
1𝑢𝑢𝛼𝛼−1(1 − 𝑢𝑢)𝛽𝛽−1𝑑𝑑𝑢𝑢 = Γ(𝛼𝛼)Γ(𝛽𝛽) Γ(𝛼𝛼 + 𝛽𝛽) Γ(𝛼𝛼) = �
0
∞𝑢𝑢𝛼𝛼−1exp( − 𝑢𝑢)𝑑𝑑𝑢𝑢 = (𝛼𝛼 − 1)Γ(𝛼𝛼 − 1)
Beta(�
𝑖𝑖=1 𝑛𝑛
𝑥𝑥𝑖𝑖 + 𝛼𝛼, 𝑛𝑛 − �
𝑖𝑖=1 𝑛𝑛
𝑥𝑥𝑖𝑖 + 𝛽𝛽)
Beta(α,β) distr with mean
= α/(α+ β)
For a Beta (1,1) prior and data: n=10 and 1, 5, 9 successes
For a Beta (1,1) prior and data: n=100 and 10, 50, 90 successes
For a Beta (10,10) prior and data: n=10 and 1, 5, 9 successes
For a Beta (10,10) prior and data: n=100 and 10, 50, 90 successes
For a Beta (1,5) prior and data: n=10 and 1, 5, 9 successes
For a Beta (1,5) prior and data: n=100 and 10, 50, 90 successes
Prior and posterior distributions: examples (2)
2. Let X1, ..., Xn be IID r.v. from N(θ, σ2), and σ2 known; θ ~N(m, τ 2) for m, τ known.
Then the posterior distribution for θ:
conjugate prior for a normal distr.
𝑁𝑁 𝑛𝑛 1𝜎𝜎2 ̄𝑋𝑋 + 1𝜏𝜏2 𝑚𝑚 𝑛𝑛 1𝜎𝜎2 + 1𝜏𝜏2
, 1
𝑛𝑛 1𝜎𝜎2 + 1𝜏𝜏2
Bayesian Statistics
Based on the Bayes approach, we can
find estimates
find an equivalent of confidence intervals
verify hypotheses
make predictions
Bayesian Most Probabale (BMP) / Maximum a posteriori Probability (MAP) estimate
Similar to ML estimation: the argument which maximizes the posterior distribution:
i.e.
𝜋𝜋( ̂𝜃𝜃𝐵𝐵𝐵𝐵𝐵𝐵|𝑥𝑥1, . . . , 𝑥𝑥𝑛𝑛) = max𝜃𝜃 𝜋𝜋 (𝜃𝜃|𝑥𝑥1, . . . , 𝑥𝑥𝑛𝑛)
𝐵𝐵𝐵𝐵𝐵𝐵(𝜃𝜃) = ̂𝜃𝜃𝐵𝐵𝐵𝐵𝐵𝐵 = argmax𝜃𝜃 𝜋𝜋 (𝜃𝜃|𝑥𝑥1, . . . , 𝑥𝑥𝑛𝑛)
BMP: examples
1. Let X1, ..., Xn be IID r.v. from a Bernoulli distr. with prob. of success θ ; for θ∈(0,1)
We know the posterior distribution:
we have max for
i.e. for 5 successes in 10 trials for a prior U(0,1) (i.e. Beta(1,1) distr.), we have BMP(θ)=5/10 = ½
and for 9 successes in 10 trials for the same prior distr., we have BMP(θ )=9/10
𝜋𝜋(𝜃𝜃) = 𝜃𝜃𝛼𝛼−1(1 − 𝜃𝜃)𝛽𝛽−1 𝐵𝐵(𝛼𝛼, 𝛽𝛽)
Beta(α,β) distr; the mode of this distr
= (α-1)/(α+ β-2) for α>1, β>1
Beta(�
𝑖𝑖=1 𝑛𝑛
𝑥𝑥𝑖𝑖 + 𝛼𝛼, 𝑛𝑛 − �
𝑖𝑖=1 𝑛𝑛
𝑥𝑥𝑖𝑖 + 𝛽𝛽)
𝐵𝐵𝐵𝐵𝐵𝐵(𝜃𝜃) = ∑𝑖𝑖=1𝑛𝑛 𝑥𝑥𝑖𝑖 + 𝛼𝛼 − 1 𝑛𝑛 + 𝛽𝛽 + 𝛼𝛼 − 2
BMP: examples (2)
2. Let X1, ..., Xn be IID r.v. from N(θ, σ2), with σ2
known; θ ~N(m, τ 2) for m, τ known.
Then the posterior distr. for θ : so
i.e. if we have a sample of 5 obs 1.2; 1.7 ; 1.9 ; 2.1; 3.1 from distr. N(θ, 4) and the prior distr is θ ~N(1, 1), then
BMP(θ) = (5 /4 * 2 + 1)/(5/4 + 1) = 14/9 ≈ 1.56 and if the prior distr were θ ~N(3, 1), then
BMP(θ) = (5 /4 * 2 + 1*3)/(5/4 + 1) = 22/9 ≈ 2.44
𝑁𝑁 𝑛𝑛 1𝜎𝜎2 ̄𝑋𝑋 + 1𝜏𝜏2 𝑚𝑚 𝑛𝑛 1𝜎𝜎2 + 1𝜏𝜏2
, 1
𝑛𝑛 1𝜎𝜎2 + 1𝜏𝜏2
𝐵𝐵𝐵𝐵𝐵𝐵(𝜃𝜃) = 𝑛𝑛 1𝜎𝜎2 ̄𝑋𝑋 + 1𝜏𝜏2 𝑚𝑚 𝑛𝑛 1𝜎𝜎2 + 1𝜏𝜏2
Bayes Estimator
An estimation rule which minimizes the
posterior expected value of a loss function
L(θ, a) – loss function, depends on the true value of θ and the decision a.
e.g. if we want to estimate g(θ ):
L(θ, a) = (g(θ) - a)2 – quadratic loss function L(θ, a) = |g(θ) - a| – module loss function
Bayes Estimator – cont.
We can also define the accuracy of an estimate for a given loss function :
(the average loss of the estimator for a given prior distribution and data, i.e. for a specific posterior distribution)
𝑎𝑎𝑎𝑎𝑎𝑎(Π, �𝑔𝑔(𝑥𝑥)) = 𝐸𝐸 𝐿𝐿(𝜃𝜃, �𝑔𝑔(𝑥𝑥))|𝑋𝑋 = 𝑥𝑥 = �
Θ𝐿𝐿(𝜃𝜃, �𝑔𝑔(𝑥𝑥))𝜋𝜋(𝜃𝜃|𝑥𝑥)𝑑𝑑𝜃𝜃
Bayes Estimator – cont. (2)
The Bayes Estimator for a given loss function L(θ, a) is such that
For a quadratic loss function (θ – a)2:
For a module loss function |θ – a|2:
�𝑔𝑔𝐵𝐵
∀𝑥𝑥 𝑎𝑎𝑎𝑎𝑎𝑎(Π, �𝑔𝑔𝐵𝐵(𝑥𝑥)) = min𝑎𝑎 𝑎𝑎 𝑎𝑎𝑎𝑎(Π, 𝑎𝑎)
̂𝜃𝜃𝐵𝐵 = 𝐸𝐸(𝜃𝜃|𝑋𝑋 = 𝑥𝑥) = 𝐸𝐸(Π𝑥𝑥)
̂𝜃𝜃𝐵𝐵 = 𝐵𝐵𝑀𝑀𝑑𝑑(Π𝑥𝑥)
more generally: E(g(θ)|x)
Bayes Estimator: Example (1)
1. Let X1, ..., Xn be IID r.v. from a Bernoulli distr. with prob. of success θ ; for θ∈(0,1)
We know the posterior distribution:
so the Bayes Estimator is
i.e. for 5 successes in 10 trials for a prior U(0,1) (i.e. Beta(1,1) distr.), we have =6/12 = ½
and for 9 successes in 10 trials for the same prior distr., we have
=10/12 = 5/6
𝜋𝜋(𝜃𝜃) = 𝜃𝜃𝛼𝛼−1(1 − 𝜃𝜃)𝛽𝛽−1 𝐵𝐵(𝛼𝛼, 𝛽𝛽)
Beta(�
𝑖𝑖=1 𝑛𝑛
𝑥𝑥𝑖𝑖 + 𝛼𝛼, 𝑛𝑛 − �
𝑖𝑖=1 𝑛𝑛
𝑥𝑥𝑖𝑖 + 𝛽𝛽)
Beta(α,β) distr with mean
= α/(α+ β)
�𝜃𝜃𝐵𝐵 = ∑𝑖𝑖=1𝑛𝑛 𝑥𝑥𝑖𝑖 + 𝛼𝛼 𝑛𝑛 + 𝛽𝛽 + 𝛼𝛼
̂𝜃𝜃𝐵𝐵
̂𝜃𝜃𝐵𝐵
BMP: examples
1. Let X1, ..., Xn be IID r.v. from a Bernoulli distr. with prob. of success θ ; for θ∈(0,1)
We know the poster distribution:
we have max for
i.e. for 5 successes in 10 trials for a prior U(0,1) (i.e. Beta(1,1) distr.), we have BMP(θ)=5/10 = ½
and for 9 successes in 10 trials for the same prior distr., we have BMP(θ )=9/10
𝜋𝜋(𝜃𝜃) = 𝜃𝜃𝛼𝛼−1(1 − 𝜃𝜃)𝛽𝛽−1 𝐵𝐵(𝛼𝛼, 𝛽𝛽)
Beta(α,β) distr; the mode of this distr
= (α-1)/(α+ β-2) for α>1, β>1
Beta(�
𝑖𝑖=1 𝑛𝑛
𝑥𝑥𝑖𝑖 + 𝛼𝛼, 𝑛𝑛 − �
𝑖𝑖=1 𝑛𝑛
𝑥𝑥𝑖𝑖 + 𝛽𝛽)
𝐵𝐵𝐵𝐵𝐵𝐵(𝜃𝜃) = ∑𝑖𝑖=1𝑛𝑛 𝑥𝑥𝑖𝑖 + 𝛼𝛼 − 1 𝑛𝑛 + 𝛽𝛽 + 𝛼𝛼 − 2
Bayes Estimator: examples (2)
2. Let X1, ..., Xn be IID r.v. from N(θ, σ2), with σ2
known; θ ~N(m, τ 2) for m, τ known.
Then the a posteriori distr for θ : so
i.e. if we have sa sample of 5 obs 1.2; 1.7 ; 1.9 ; 2.1; 3.1 from distr. N(θ, 4) and the prior distr is θ ~N(1, 1), then
= (5 /4 * 2 + 1)/(5/4 + 1) = 14/9 ≈ 1.56 and if the prior distr were θ ~N(3, 1), then
= (5 /4 * 2 + 1*3)/(5/4 + 1) = 22/9 ≈ 2.44
𝑁𝑁 𝑛𝑛 1𝜎𝜎2 ̄𝑋𝑋 + 1𝜏𝜏2 𝑚𝑚 𝑛𝑛 1𝜎𝜎2 + 1𝜏𝜏2
, 1
𝑛𝑛 1𝜎𝜎2 + 1𝜏𝜏2
̂𝜃𝜃𝐵𝐵 = 𝑛𝑛 1𝜎𝜎2 ̄𝑋𝑋 + 1𝜏𝜏2 𝑚𝑚 𝑛𝑛 1𝜎𝜎2 + 1𝜏𝜏2
̂𝜃𝜃𝐵𝐵
̂𝜃𝜃𝐵𝐵
BMP: examples (2)
2. Let X1, ..., Xn be IID r.v. from N(θ, σ2), with σ2
known; θ ~N(m, τ 2) for m, τ known.
Then the a posteriori distr for θ : so
i.e. if we have sa sample of 5 obs 1.2; 1.7 ; 1.9 ; 2.1; 3.1 from distr. N(θ, 4) and the prior distr is θ ~N(1, 1), then
BMP(θ) = (5 /4 * 2 + 1)/(5/4 + 1) = 14/9 ≈ 1.56 and if the prior distr were θ ~N(3, 1), then
BMP(θ) = (5 /4 * 2 + 1*3)/(5/4 + 1) = 22/9 ≈ 2.44
𝑁𝑁 𝑛𝑛 1𝜎𝜎2 ̄𝑋𝑋 + 1𝜏𝜏2𝑚𝑚 𝑛𝑛 1𝜎𝜎2 + 1𝜏𝜏2
, 1
𝑛𝑛 1𝜎𝜎2+ 1𝜏𝜏2
𝐵𝐵𝐵𝐵𝐵𝐵(𝜃𝜃) = 𝑛𝑛 1𝜎𝜎2 ̄𝑋𝑋 + 1𝜏𝜏2 𝑚𝑚 𝑛𝑛 1𝜎𝜎2 + 1𝜏𝜏2