1. Chi-squared tests – cont.

(1)

Mathematical Statistics

Anna Janicka

Lecture XIV, 27.05.2019

B^AYESIAN S^TATISTICS

(2)

Plan for Today

1. Chi-squared tests – cont.

2. Bayesian Statistics

a priori and a posteriori distributions Bayesian estimation:

Maximum a posteriori probability (MAP) Bayes Estimator

(3)

Chi-squared goodness-of-fit test – reminder.

General form of the test:

here: or

Theorem. If H

₀

is true, for n→∞ the distribution of the χ

²

statistic converges to a chi-squared distr. with k-1 degrees of freedom χ

²

(k-1) or to a chi-squared distr. with k-d-1 degrees of freedom χ

²

(k-d-1) (depending on the

dimension d of unknown parameter θ )

∑

= expected value

value) expected

- value observed

( ²

χ2

∑

⁼

=

i i k i

i np

np

N ²

1

2 ( - )

χ ^χ ⁼

∑

⁼ ⁽ ₍_θ₎^θ

i

2 i

k i

i np

)) ( np - N

1 2

(4)

Chi-squared goodness-of-fit test – version for continuous distributions

Kolmogorov tests are better, but the chi- squared test may also be used

Model: X

₁

, X

₂

, ..., X

_n

are an IID sample from a continuous distribution.

H

₀

: The distribution is given by F

H

₁

: ¬ H

₀ (i.e. the distribution is different)

It suffices to divide the range of values of the random variable into classes and count the

observations. The expected values are known

(result from F).Then: the chi-squared test.

(5)

Chi-squared goodness-of-fit test – practical notes

The test should be used for large samples The expected counts can’t be too small (<5). If they are smaller, observations should be grouped.

The classes in the „continuous” version

may be chosen arbitrarily, but it is best if

the theoretical probabilities are balanced.

(6)

Chi-squared test of independence

Model: (X

₁

,Y

₁

), ..., (X

_n

,Y

_n

) are an IID sample from a two-dimensional distribution with rs values*

(denoted by the set {1, ..., r} × {1, ..., s}).

Let the theoretical distribution be

Denote

We want to veryfy independence of X and Y:

H

₀

:

H

₁

: ¬ H

₀

s j

r i

j Y

i X

P

p

_ij

= ( = , = ) = 1 ,..., = 1 ,...,

∑

= • =

•

= =

^r

i ij

j s

j ij

i

p p p

p

1

,

1

r j

s i

p p

p

_ij

=

_i_•

∗

_•_j

= 1 ,..., , = 1 ,...,

(7)

Chi-squared test of independence – cont.

The empirical distribution may be summarized by a table (so-called contingency table, or

crosstab)

i \ j 1 2 ... s N_i•

1 N₁₁ N₁₂ N_1s N_1•

2 N₂₁ N₂₂ N_2s N_2•

...

r N_r1 N_r2 N_rs N_r•

N_•j N_•1 N_•2 N_•s n

(8)

Chi-squared test of independence – cont. (2)

This is a special case of a goodness-of-fit test with (r-1) + (s-1) parameters to be

estimated:

The test statistic:

has a chi-squared distribution with (r-1)(s-1) degrees of freedom (if H

₀

is true)

∑ ∑

= =

•

−

•

=

^r

i

s j

j i

ij

n N

N

n N

N N

1 1

2 2

/

)

/

χ (

(9)

Chi-squared test of independence – example

We verify independence of political and

musical preferences, for signif. level α =0.05

Source: W. Niemiro

Support X Do not support X Total

Listen to jazz 25 10 35

Listen to rock 20 20 40

Listen to hip-hop 15 10 25

Total 60 40 100

57 . 100 3

/ 25

* 40

) 100 / 25

* 40 10

( 100

/ 40

* 40

) 100 / 40

* 40 20

( 100

/ 35

* 40

) 100 / 35

* 40 10

(

100 / 25

* 60

) 100 / 25

* 60 15

( 100

/ 40

* 60

) 100 / 40

* 60 20

( 100

/ 35

* 60

) 100 / 35

* 60 25

(

2 2

2

2 2

− ≈

− +

− + +

+ − + −

= − χ

99 . 5 )

2 ( ))

1 3

)(

1 2

((

₀²_.₉₅

2 05 . 0

1₋

− − = χ ≈

χ

→ no grounds to reject H0.

(10)

Bayesian Statistics vs. traditional statistics

Frequentist: unknown parameters are given (fixed), observed data are random

Bayesian: observed data are given (fixed),

parameters are random

(11)

Bayesian Statistics

Our knowledge about the unknown parameters is described by means of probability distributions, and additional knowledge may affect our description.

Knowledge:

general specific

Example: coin toss

(12)

Bayesian Model

X

₁

, ..., X

_n

come from distribution P

_θ

, with density f

_θ

(x) – conditional density given a specific value of θ (likelihood function).

P – family of probability distributions P

_θ

, indexed by the parameter θ ∈Θ

General knowledge: distribution Π over the parameter space Θ, given by π ( θ ) – the so- called a priori/prior distribution of θ ,

θ ~ Π

(13)

Bayesian Model – cont.

Additional knowledge (specific, contextual):

based on observation. We have a joint distribution of observations and θ :

on this basis we can derive the conditional distribution of θ (given the observed data)

where

is a marginal distribution for the obs.

) ( )

| ,...,

, (

) , ,...,

,

( x

₁

x

₂

x

_n

θ f x

₁

x

₂

x

_n

θ π θ

f =

) , ,...,

(

) ( )

| ,...,

) ( ,...,

| (

1 1

1

n n

n

m x x

x x

x f

x θ π θ

θ

π =

θ θ

π

θ d

x x

f x

x

m (

₁

,...,

_n

) = ∫

Θ

(

₁

,...,

_n

| ) ( )

(14)

Bayesian Model – a posteriori distribution

is called the a posteriori/

posterior distribution, denoted Π

_x

The posterior distribution reflects all

knowledge: general (initial) and specific (based on the observed data).

Grounds for Bayesian inference and modeling

) ,...,

|

( θ x

₁

x

_n

π

(15)

A priori and a posteriori distributions: examples

1.Let X

₁

, ..., X

_n

be IID r.v. from a 0-1 distr. with prob. of success θ ; let

for θ ∈(0,1) where

and

then the posterior distribution:

conjugate prior

for Bernoulli distr.

) ,

(

) 1

) ( (

1 1

β α

θ θ θ

π

^α ^β

B

−

=

) (

) ( ) ) (

1 ( )

,

( ¹

0

1 1

β α

β β α

α ^α ^β

+ Γ

Γ

= Γ

−

=

∫

^u ⁻ ^u ⁻ ^du

B

) 1 (

) exp(

)

( 0

1 − = − Γ −

=

Γ ^α

∫

^∞^u^α⁻ ^u ^du ^α ^α

) ,

(

Beta ∑

=1

+ α − ∑

=1

+ β

n

i i

n

i

x

i

n x

Beta(α,β) distr with mean

= α/(α+ β)

(16)

For a Beta (1,1) prior and data: n=10 and 1, 5, 9 successes

(17)

(18)

(19)

(20)

(21)

(22)

A priori and a posteriori distributions: examples (2)

2. Let X

₁

, ..., X

_n

be IID r.v. from N( θ , σ

²

), and σ

²

known; θ ~N(m, τ

²

) for m, τ known.

Then the posterior distribution for θ :

conjugate prior

for a normal distr.

 





 





+ +

+

2 2

1 1

1

1 ,

τ σ

n n

m X

N n

(23)

Bayesian Statistics

Based on the Bayes approach, we can find estimates

find an equivalent of confidence intervals verify hypotheses

make predictions

(24)

Bayesian Most Probabale (BMP) / Maximum a posteriori Probability (MAP) estimate

Similar to ML estimation: the argument which maximizes the posterior distribution:

i.e.

) ,...,

| ( max

) ,...,

ˆ |

( θ

_BMP

x

₁

x

_n

π θ x

₁

x

_n

π =

_θ

) ,...,

| ( max

ˆ arg )

(

_BMP

x

₁

x

_n

BMP θ = θ =

_θ

π θ

(25)

BMP: examples

1. Let X

₁

, ..., X

_n

be IID r.v. from a Bernoulli distr. with prob. of success θ ^{; for} θ ∈(0,1)

We know the posterior distribution:

we have max for

i.e. for 5 successes in 10 trials for an a priori U(0,1) (i.e. Beta(1,1) distr.), we have BMP(θ)=5/10 = ½

and for 9 successes in 10 trials for the same a priori distr., we have BMP(θ )=9/10

) , (

) 1

) ( (

1 1

β α

θ θ θ

π

^α ^β

B

−

=

Beta(α,β) distr; the mode of this distr

= (α-1)/(α+ β-2) for α>1, β>1

) ,

(

Beta ∑

=1

+ α − ∑

=1

+ β

n

i i

n

i

x

i

n x

2 ) 1

(

¹

− +

+

−

= ∑

=

+

α β

θ α

n BMP x

n

i i

(26)

BMP: examples (2)

2. Let X

₁

, ..., X

_n

be IID r.v. from N( θ , σ

²

), with σ

²

known; θ ^~N(m, τ

²

^{) for m,} τ ^known.

Then the posterior distr. for θ : so

i.e. if we have sa sample of 5 obs 1.2; 1.7 ; 1.9 ; 2.1; 3.1 from distr. N(

θ

, 4) and the a priori distr is

θ

~N(1, 1), then

BMP(

θ

) = (5 /4 * 2 + 1)/(5/4 + 1) = 14/9 ≈ 1.56 and if the a priori distr were

θ

~N(3, 1), then

BMP(

θ

) = (5 /4 * 2 + 1*3)/(5/4 + 1) = 22/9 ≈ 2.44













+ +

+

2 2

1 1

1

1 1

,

τ σ

n n

m X

N n

2 2

1 1

) (

τ σ

τ

θ

σ

+

= +

n

m X

BMP n

(27)

Bayes Estimator

An estimation rule which minimizes the

posterior expected value of a loss function

L( θ , a) – loss function, depends on the true value of θ and the decision a.

e.g. if we want to estimate g(

θ

):

L(

θ

, a) = (g(

θ

) - a)² – quadratic loss function L(

θ

, a) = |g(

θ

) - a| – module loss function

(28)

Bayes Estimator – cont.

We can also define the accuracy of an estimate for a given loss function :

(the average loss of the estimator for a given a priori distribution and data, i.e. for a specific

posterior distribution)

( = ) = ∫

Θ

=

Π g x E L θ g x X x L θ g x π θ x d θ

acc ( , ˆ ( )) ( , ˆ ( )) | ( , ˆ ( )) ( | )

(29)

Bayes Estimator – cont. (2)

The Bayes Estimator for a given loss function L( θ , a) is such that

For a quadratic loss function ( θ – a)

²

:

For a module loss function | θ – a|

²

:

gˆ

B

) ,

( min

)) ˆ (

, (

acc g x acc a

x Π

_B

=

_a

Π

∀

) (

)

| ˆ (

x B

= E θ X = x = E Π

θ

) ˆ (

x B

= Med Π

θ

more generally: E(g(θ)|x)

(30)

Bayes Estimator: Example (1)

1. Let X

₁

, ..., X

_n

be IID r.v. from a Bernoulli distr. with prob. of success θ ^{; for} θ ∈(0,1)

We know the posterior distribution:

so the Bayes Estimator is

i.e. for 5 successes in 10 trials for an a priori U(0,1) (i.e. Beta(1,1) distr.), we have =6/12 = ½

and for 9 successes in 10 trials for the same a priori distr., we have

=10/12 = 5/6

) , (

) 1

) ( (

1 1

β α

θ θ θ

π

^α ^β

B

−

= )

, (

Beta ∑

=1

+ α − ∑

=1

+ β

n

i i

n

i

x

i

n x

Beta(α,β) distr with mean

= α/(α+ β)

α β

θ α

+ +

= ∑

=

+ n

n

x

i i

B

ˆ

1

θ

^ˆB

θ

^ˆB

(31)

BMP: examples

1. Let X

₁

, ..., X

_n

be IID r.v. from a Bernoulli distr. with prob. of success θ ^{; for} θ ∈(0,1)

We know the poster distribution:

we have max for

i.e. for 5 successes in 10 trials for an a priori U(0,1) (i.e. Beta(1,1) distr.), we have BMP(θ)=5/10 = ½

and for 9 successes in 10 trials for the same a priori distr., we have BMP(θ )=9/10

) , (

) 1

) ( (

1 1

β α

θ θ θ

π

^α ^β

B

−

=

Beta(α,β) distr; the mode of this distr

= (α-1)/(α+ β-2) for α>1, β>1

) ,

(

Beta ∑

=1

+ α − ∑

=1

+ β

n

i i

n

i

x

i

n x

2 ) 1

(

¹

− +

+

−

= ∑

=

+

α β

θ α

n BMP x

n

i i

(32)

Bayes Estimator: examples (2)

2. Let X

₁

, ..., X

_n

be IID r.v. from N( θ , σ

²

), with σ

²

known; θ ^~N(m, τ

²

^{) for m,} τ ^known.

Then the a posteriori distr for θ : so

θ

~N(1, 1), then

= (5 /4 * 2 + 1)/(5/4 + 1) = 14/9 ≈ 1.56 and if the a priori distr were

θ

~N(3, 1), then

= (5 /4 * 2 + 1*3)/(5/4 + 1) = 22/9 ≈ 2.44













+ +

+

2 2

1 1

1

1 1

,

τ σ

n n

m X

N n

2 2

1 1

ˆ

τ σ

τ

θ

σ

+

= +

n

m X

n

B

θ

^ˆB

θ

^ˆB

(33)

BMP: examples (2)

2. Let X

₁

, ..., X

_n

be IID r.v. from N( θ , σ

²

), with σ

²

known; θ ^~N(m, τ

²

^{) for m,} τ ^known.

Then the a posteriori distr for θ : so

θ

~N(1, 1), then

BMP(

θ

) = (5 /4 * 2 + 1)/(5/4 + 1) = 14/9 ≈ 1.56 and if the a priori distr were

θ

~N(3, 1), then

BMP(

θ

) = (5 /4 * 2 + 1*3)/(5/4 + 1) = 22/9 ≈ 2.44













+ +

+

2 2

1 1

1

1 1

,

τ σ

n n

m X

N n

2 2

1 1

) (

τ σ

τ

θ

σ

+

= +

n

m X

BMP n

1. Chi-squared tests – cont.

Mathematical Statistics

Anna Janicka

Plan for Today