• Nie Znaleziono Wyników

OLS, MLE and related topics. Primer.

N/A
N/A
Protected

Academic year: 2021

Share "OLS, MLE and related topics. Primer."

Copied!
88
0
0

Pełen tekst

(1)

OLS, MLE and related topics. Primer.

Katarzyna Bech

Week 1

(2)

Classical Linear Regression Model (CLRM)

The model:

y =X β+e, and the assumptions:

A1 The true model is y =X β+e.

A2 E(e) =0.

A3 Var(e) =E(ee0) =σ2In.

A4 X is a non-stochastic n k matrix with rank k n.

(3)

Least Squares Estimation

The OLS estimator bβ will minimize the sum of squares =arg min

β

RSS(β) =e0e= (y X β)0(y X β). Using the rules for matrix calculus the FOC is

∂RSS(β)

∂β = (y X β)0(y X β)

∂β = 2X0y+2X0X β=0, and the SOC is

2RSS(β)

∂β∂β0

=2X0X which is clearly positive de…nite.

(4)

Least Squares Estimation

From the FOC:

= (X0X) 1X0y . Properties of the OLS estimator:

E[] = β+ (X0X) 1X0E[e] =β hence it is unbiased, and

Var[] =Eh

(bβ β)(bβ β)0i= σ2(X0X) 1. Gauss-Markov Theorem: under A1-A4: bβOLS is BLUE.

(5)

Gauss Markov Theorem Proof

Consider again the model y =X β+e Let

˜β

(k 1)

= C

(k n) y

n 1

be another linear unbiased estimator.

For ˜β to be unbiased we require

E(˜β) =E(Cy) =E(CX β+C e) =CX β=β.

Thus, we need

CX =Ik ! ˜β =CX β+C e= β+C e.

We have

Var(˜β) = E((˜β β)(˜β β)0) =E(C ee0C0)

= CE(ee0)C0 = σ2CC0.

(6)

Gauss Markov Theorem Proof

We want to show that Var(˜β) Var(ˆβ) 0. Thus, Var(˜β) Var(ˆβ) = σ2(CC0 (X0X) 1)

= σ2(CC0 CX

=Ik

(X0X) 1X0C0

=Ik

)

= σ2(CInC0 CX(X0X) 1X0C0)

= σ2C(In X(X0X) 1X0)C0

= σ2CMXC0

= σ2CMXMX0 C0 = σ2DD0 where D =CMX. DD0 is positive semide…nite.

σ2DD0 0 implying that Var(˜β) Var(ˆβ) 0.

(7)

Gauss Markov Theorem Proof

The latter result also applies to any linear combination of the elements of β:, i.e.

Var(c0˜β) Var(c0ˆβ) 0.

We thus have the following corollary

Corollary Under A1-A4, for any vector of constants c the minimum variance linear unbiased estimator of c0β in the classical regression model is c0ˆβ, where ˆβ is the least squares estimator.

Corollary Each coe¢ cient βj is estimated at least as e¢ ciently by ˆβj as by any other linear unbiased estimator.

(8)

Least Squares Estimation

Clearly , we also need to estimate σ2. The most obvious estimator to take is

2 =

y X bβ

0(y X bβ)

n = y0MXy

n where MX =I PX =I X(X0X) 1X0.

BUT 2 is biased, so we have to de…ne an unbiased estimator

s2 = n

n k2 =

y X bβ

0(y X bβ)

n k .

(9)

How Econometrics is typically thought?

What are the consequences of the violation of the assumptions A1-A4?

(10)

Perfect Multicollinearity

Let’s see what happens when A4 is violated, i.e. rank(X) <k (equivalently, the column of X are linearly dependent).

In this case X0X is not invertible (it’s singular), so that the ordinary least squares estimator cannot be computed. The parameters of such a regression model are unidenti…ed.

The use of too many dummy variables (dummy variable trap) is a typical cause for exact multicollinearity. Consider for instance the case where we would like to estimate wage including a dummy for males (MALEi), a dummy for females (FEMALEi ) as well as a constant (note that MALEi+FEMALEi =1 8i ).

Exact multicollinearity is easily solved by excluding one of the variables from the model.

(11)

"Near" Multicollinearity

Imperfect multicollinearity arises when two or more regressors are highly correlated, in the sense that there is a linear function of regressors which is highly correlated with another regressor.

When regressors are highly correlated, it becomes di¢ cult to disentangle the separate e¤ects of the regressors on the dependent variable.

Near multicollinearity is not a violation of the classical linear

regression assumptions. So our OLS estimates are still the best linear unbiased estimators. “Best”, just simply it might not be all that good.

(12)

"Near" Multicollinearity

The higher the correlation between the regressors becomes, the less precise our estimates will be (note that Var(ˆβ) =σ2(X0X) 1).

Consider

yi =α+xi 1β1+xi 2β2+ei. Var(ˆβj) = σ

2

n i=1

(xij ¯xj)2(1 rx21,x2) .

Obviously, as rx21,x2 !1, the variance increases.

(13)

"Near" Multicollinearity

Practical consequences:

Although OLS estimators are still BLUE they have large variances and covariances.

Individual t-tests may fail to reject that coe¤cients are 0, even though they are jointly sign…cant.

Parameter estimates may be very sensitive to one or a small number of observations.

Coe¤cients may have the “wrong” sign or implausible magnitude.

How to detect multicollinearity?

High R2 but few signi…cant t-ratios.

High pairwise correlation among explanatory variables.

How to deal with multicollinearity?

A priori information.

Dropping a variable.

Transformation of variables.

(14)

Misspecifying the Set of Regressors- violation of A1

We consider two issues:

Omission of relevant variable (e.g., due to oversight or lack of measurement).

Inclusion of irrelevant variables.

Preview of the results:

In the …rst case, the estimator is generally biased.

In the second case, there is no bias but the estimator is ine¤cient.

(15)

Omitted Variable Bias

Consider the following two models:

True model :

y =X β+Z δ+e, E(e) =0, E(ee0) =σ2I (1) and misspeci…ed model which we estimate:

y =X β+v (2)

The OLS estimate of β in (2) are

˜β = (X0X) 1X0y .

(16)

Omitted Variable Bias

In order to assess the statistical properties of the latter, we have to substitute the true model for y , which is given by (1). We obtain

˜β= (X0X) 1X0(X β+Z δ+e) =β+ (X0X) 1X0Z δ+ (X0X) 1X0e.

Taking expectation (recall that X and Z are non stochastic), we have E(˜β) = β+ (X0X) 1X0Z δ+ (X0X) 1X0E(e) = β+ (X0X) 1X0Z δ.

Thus, generally ˜β is biased.

Interpretation of the bias term:

(X0X) 1X0Z is the regression coe¢ cient(s) of omitted variable(s) on all included variables.

δ is the true coe¢ cient corresponding to the omitted variable(s).

(17)

Omitted Variable Bias

We can interpret E(˜β)as the sum of two terms, i.e.

β is direct change in E(y)associated with changes in X

(X0X) 1X0Z δ is the indirect change in E(y)associated with changes in Z (X partially acts as a “proxi” for Z , when Z is omitted).

There would be no omitted variable bias if X and Z were orthogonal, i.e.

X0Z =0. Trivially there would be no bias also if δ= 0.

(18)

Example

Consider the standard human capital earning function

lnwi =α+β1schooli+β2expi+vi (3) and we are interested in β1, which is the returns to schooling.

We suspect that the relevant independent variable abilityi is omitted.

What is the likely direction of the bias?

Let ˆν denote the OLS coe¢ cients corresponding to schooling in the regression of abilityi on a constant, schooli and expi. ˆν is likely to be positive (note that we are not claiming a causal e¤ect).

On the other hand, consider the true model (that cannot be used since abilityi is not observed)

lnwi = α+β1schooli+β2expi+δabilityi+vi.

(19)

Example

Hence, the direction of the bias of ˆβ1, when (3) is estimated is given by

ˆν

(+)(+) = +δ .

The returns to schooling tend to be overestimated.

(20)

Variance

Taking into account that X and Z are non stochastic and E(e) =0, Var(˜β) =E((˜β E(˜β))(˜β E(˜β))0) =σ2(X0X) 1. Important: s2 is a biased estimate of the error variance of the true regression error. Indeed,

s2 = ˆv0ˆv n kx,

where ˆv =y X ˜β=Mxy . By substituting the true model (1) for y , we get

ˆv =Mx(X β+Z δ+e) =Mx(Z δ+e). Thus (show it)

E(s2) =σ2+δ

0Z0MXZ δ n kx 6=σ2.

(21)

Summary of consequences of omitting a relevant variable

If the omitted variable(s) are correlated with the included variable(s), the parameter estimates are biased.

The disturbance variance σ2 is incorrectly estimated.

As a result, the usual con…dence intervals and hypothesis testing procedures are likely to give misleading conclusions.

(22)

Irrelevant Variables

Suppose that the correct model is

y =X β+e (4)

but we instead choose to estimate the ’bigger’model

y =X β+Z δ+v (5)

The OLS estimate of β in (5) are

˜β = (X0MZX) 1X0MZy .

(23)

Irrelevant Variables

In order to assess the statistical properties of the latter, we have to substitute the true model for y , which is given by (4). We obtain

˜β = (X0MZX) 1X0MZ(X β+e) = β+ (X0MZX) 1X0MZe.

Taking expectation, we have

E(˜β) = β+ (X0MZX) 1X0MZE(e) =β.

Thus, ˜β is unbiased.

However, it can be shown that

Var(˜β) σ2(X0X) 1

where σ2(X0X) 1 is the variance of the ’correct’OLS estimator.

(24)

Summary of consequences of including an irrelevant variable

OLS estimators of the parameters of the "incorrect" model are unbiased, but less precise (larger variance).

As a consequence of the fact that the price for omitting a relevant variable is so much higher than the price for including irrelevant ones many econometric theorists have suggested a "General to Speci…c"

modelling strategy. It entails, basically, initially including every regressor that may be suspected of being relevant. Then a

combination of R2 and F and t tests are used to begin to eliminate the least signi…cant regressors, hopefully leading to a "correct" model.

(25)

Ramsey’s RESET test

The most commonly used test to check for the speci…cation of the mean function is known as Ramsey’s RESET test.

Let byi =xi0bβ be the …tted values of the OLS regression, then the Ramsey’s RESET procedure is to run the regression, as in

yi =xi0β+γ1(byi)2+γ2(byi)3+...+γq(byi)q+1+vi

and perform an F test of the hypothesis

H0: γ1= γ2 =...=γq =0 If we cannot reject H0 the model is correctly speci…ed.

(26)

Modelling Strategy

Approaches to model building:

"General to speci…c".

"Simple to general".

The motivating factor when building a model should be ECONOMIC THEORY

How do we judge a model to be "good"?

Parsimony Identi…ability Goodness of …t

Theoretical consistency Predictive power

(27)

Selecting regressors

It is good practice to select the set of potentially relevant variables on the basis of economic arguments rather than statistical ones.

There is always a small (but not ignorable) probability of drawing the wrong conclusion.

For example there is always a probability of rejecting the null hypothesis that a coe¢ cient is zero, while the null is actually true.

Such type I errors are rather likely to happen if we use a sequence of many tests to select the regressors to include in the model. This process is referred to as data mining.

In presenting your estimation results, it is not a mistake to have insigni…cant variables included in your speci…cation. Of course, you should be careful with including many variables in your model that are multicollinear so that, in the end, almost none of the variables appear individually signi…cant.

(28)

Spherical Disturbances

Assumption A3 of the Classical Linear Regression Model ensures that the variance-covariance matrix of the errors is

E(ee0) =σ2In. This implies two things:

E(e2i) =σ2 for all i , i.e. homoskedasticity E(eiej) =0 if i 6=j, i.e. no serial correlation.

Naturally, testing for heteroskedasticity and for serial correlation forms a very important part of the econometrics of the linear model. Before we do that, however, we need to look at what sources there are of what the textobooks call ’non-spherical errors’and what e¤ects might be on our model.

(29)

Generalized Linear Regression Model

The Generalized Linear Regression Model is just the Classical Linear Regression Model, but with non-spherical disturbances (the

covariance matrix is no longer proportional to the identity matrix).

We consider the model

y =X β+e, where

E(e) =0 and

E(ee0) =Σ.

Σ is any symmetric positive de…nite matrix.

(30)

Non-spherical disturbances

We are interested in two di¤erent forms of non-spherical disturbances:

Heteroscedasticity, with

E(e2i) =σ2i and E(eiej) =0 if i 6=j,

so that each observation will have a di¤erent variance. In this case,

Σ=σ2Ω = 0

@ σ21 ... 0 ... ... ...

0 ... σ2n 1

A=diagfσ21...σ2ng.

Note thatΩ is a diagonal matrix of weights ωi. Sometimes it is convenient to write σ2i =σ2ωi.

Autocorrelation, with

E(e2i) =σ2 and E(eiej) 6=0 if i 6=j.

(31)

Heteroscedasticity- graphical intuition

When data are homoscedastic we expect something like

(32)

Heteroscedasticity- graphical intuition

With heteroscedasticity, we expect

(33)

Consequences for OLS estimation

Let’s examine the statistical properties of OLS estimator when E(ee0) =σ2Ω6=σ2In.

Unbiasedness: bβ is still unbiased, since = β+ (X0X) 1X0e and since E(e) =0, we conclude that E() =β.

E¢ ciency: the variance of bβ changes to

Var() = E((bβ β)(bβ β)0)

= (X0X) 1X0E(ee0)X(X0X) 1

= σ2(X0X) 1X0ΩX(X0X) 1

The consequence is that the standard OLS estimated variance and standard deviation (computed by statistical packages) is biased, since it is based on the wrong formula σ2(X0X) 1. Standard tests are

(34)

Consequences for OLS estimation

Since one of the Gauss-Markov assumptions is violated (E(ee0) 6=σ2In), the OLS of such models is no longer BLUE (although, remember, it is still unbiased).

Possible remedies:

Construct another estimator which is BLUE- we will call it Generalized Least Squares (GLS)

We stick to OLS, but we compute the correct estimated variance.

(35)

GLS

The idea is to transform the model

y =X β+e

so that the transformed model satis…es the Gauss- Markov assumptions.

We assume for the time being that Ω is known (rather unrealistic assumption).

Property SinceΩ is positive de…nite and symmetric, there exists a square, nonsingular matrix P such that

P0P = 1.

Sketch of proof: (Spectral decomposition) Since Ω is symmetric there exists C and Λ such that

Ω=CΛC0, C0C =In.

Because Ω is positive de…nite (positive de…nite matrix is always nonsingular):

1 =CΛ 1C0 =P0P

(36)

GLS

We can transform the model y =X β+e by premultiplying it by the matrix P:

Py =PX β+P e i.e.

y =X β+e . (6)

This transformed model satis…es the Gauss-Markov assumptions, since E(e ) =E(P e) =PE(e) =0

and

Var(e ) = E(e e 0) =E(P ee0P0) =PE(ee0)P0

= σ2PΩP0 =σ2P(P0P) 1P0 =σ2PP 1P0 1P0

= σ2In.

(37)

GLS

Hence, the OLS estimator of the transformed model (6) is BLUE and GLS = (X 0X ) 1X 0y

= (X0P0PX) 1X0P0Py = (X01X) 1X01y It is easy to verify that

E bβGLS = β and

Var bβGLS =σ2(X01X) 1= (X0Σ 1X) 1. (I leave it to you as the exercise)

(38)

GLS

Note: the GLS estimator we have just discussed is useful in any general case of non-spherical disturbances. The general formula is

GLS = (X0Σ 1X) 1X0Σ 1y .

Now, we will specify it to the heteroscedasticity form, i.e. when Σ=diagfσ21...σ2ng.

(39)

Unfeasible GLS

We consider again the model

yi =xi0β+ei, with

Var(ei) =σ2i. Thus, Var(e) =diagfσ21...σ2ng.

In this setting, the transformation P is given by P =diagf 1

σ1

, ..., 1 σng

It is easy to verify that (P0P) 1= Ω=Σ, when we set σ2 =1.

(40)

Unfeasible GLS

The transformed model is given by yi

σi

= xi σi

0

β+ ei σi

. Thus,

GLS =

n i=1

xixi0 σ2i

! 1 n

i

=1

xiyi

σ2i

! , which is also called "weighted least squares".

This approach is only available if σ2i is known.

(41)

Feasible GLS

Problem: σ2i is not known in general. Therefore, the latter estimator is unfeasible.

We need to use a feasible version of such estimator, which means we need to replace the unknown σ2i with sensible estimates:

GLS =

n i=1

xixi0 2i

! 1 n

i

=1

xiyi

2i

! .

(42)

Feasible GLS

The main issue is that we need to know the form of the heteroscedasticity in order to estimate σ2i.

The simplest situation, for instance, would be if we had a conjecture that σ2i =zi0a, where the components of zi are observable. In such case, a consistent estimator of σ2i would be the …tted values of the arti…cial (called skedastic) regression:

be2i =zi0a+vi,

wherebei are the residuals of the original regression when estimated by standard OLS.

We can accomodate into this idea more complicated functional forms for the heteroscedasticity (e.g. σ2i =exp(zi0a)).

(43)

White robust standard errors

Rather than estimating β by GLS, we can stick to OLS and "correct"

the standard errors. We have seen that

Var bβOLS = (X0X) 1X0E(ee0)X(X0X) 1

= (X0X) 1X0ΣX(X0X) 1 and in the vector form

Var bβOLS =

n i=1

xixi0

! 1 n

i

=1

σ2ixixi0

! n

i

=1

xixi0

! 1

.

We don’t know σ2i, but under general conditions White (1980) showed that the matrix

1 n

n i=1

be2ixixi0, be2

(44)

Detection of heteroscedasticity

Informal methods:

graphical method Formal methods:

Goldfeld- Quandt test Breusch-Pagan/Godfrey test

White’s general heteroscedasticity test

(45)

Goldfeld- Quandt test

Suppose that we think that the heteroscedasticity is particularly related to one of the variables, i.e. one of the columns of X , say xj. The Goldfeld- Quandt test is based on the following:

1 Order the observations according to the values of xj, starting with the lowest.

2 Split the sample into three parts, of lenghts n1, c and n2.

3 Obtain the residualsbe1 andbe2 from regressions using the …rst n1, and then the last n2 observations separately.

4 Test H0 : σ2i =σ2 for all i using the statistic F = be

01be1/(n1 k) be02be2/(n2 k)

with critical values obtained from the F(n1 k, n2 k)distribution.

Practical advice: use this test for relatively small sample sizes (up to

(46)

Breusch-Pagan/Godfrey test

Suppose that the source of heteroscedasticity is that σ2i =g(α0+ezi0α),

whereezi0 is the q 1 row of a matrix of regressors Z , which is n q. Z can contain some or all of the X ’s in our CLRM, if desired, although ideally the choice of Z is done on the basis of some economic theory. For computational purposes Z does not contain a constant term.

This test involves an auxiliary regression:

v =Z γe +u, eZ =M0Z

where M0=I eee0e0 with e = (1, 1, ..., 1)0. u is a vector of iid errors and the dependent variable is de…ned by

v = be

2

i 1.

(47)

Breusch-Pagan/Godfrey test

The test is related to an F test in the auxiliary regression of the joint restrictions

H0 : γ=0 and the test- statistic is given by

BP = 1

2 v0Ze Ze0Ze 1Ze0v , which follows χ2(q) distribution in large samples.

(48)

White test

For simplicity, assume that the data follow

yi = β0+β1x1i +β2x2i+ei.

1 Run your regression and save the residuals (denoted asbei, as usual).

2 Run an auxiliary (arti…cial) regression with the squared residuals as the dependent variable and where explanatory variables include all the explanatory variables from step 1 plus the squares and cross products:

be2i =β0+β1x1i+β2x2i +β3x1i2 +β4x2i2β0+β5x1ix2i +vi. 3 Obtain R2 from the auxiliary regression in step 2.

4 Construct the test- statistic:

nR2 asy . χ2(k 1),

(49)

White test

Note that

H0 : Var(ei) =σ2 vs. H1 : heteroscedasticity .

Thus, if the calculated value of the test-statistic exceeds the critical chi-square value at the chosen level of signi…cance, we reject the null that the error has a constant variance.

White test is a large sample test, so it is expected to work well only when the sample is su¢ ciently large.

(50)

Summary of heteroscedasticity

A crucial CLRM assumption is that Var(ei) =σ2. If Var(ei) =σ2i instead, we have heteroscedasticity.

OLS estimators are no longer BLUE; still unbiased, but not e¢ cient.

Two remedial approaches to deal with heteroscedasticity:

when σ2i is known: weighted least squares (WLS), when σ2i is unknown:

make an educated guess about the likely pattern of the

heteroscedasticity to consistently estimate σ2i and use it in the feasible GLS

White’s heteroscedasticity- consistent variances and standard errors

A number of tests are available to detect heteroscedasticity.

(51)

Serial correlation- general case

Now we want to specify the results we discussed to the model yt =xt0β+et,

where xt is the usual k 1 vector. E(et) =0 and E(etes) 6=0 for some t 6=s. Note that autocorrelation (or serial correlation, it’s the same) is very common in time series data, for instance due to unobserved factors (omitted variables) that are correlated over time. A common form of serial correlation is the autoregressive structure.

(52)

Autocorrelation patterns

There are several forms of autocorrelation, each leading to a di¤erent structure for the error covariance matrix Var(e) =σ2Ω.

We will only consider here AR(1) structures, i.e. et =ρet 1+vt, where vt iid(0, σ2), wherejρj <1.

It’s easy to show that E(et) =0 We can show that:

E(ee0) = σ

2v

1 ρ2 0 BB

@

1 ρ ρ2 .... ρn 1

ρ 1 ρ .... ρn 2

... .... .... .... ....

ρn 1 ρn 2 .... .... 1 1 CC A

(53)

Autocorrelation patterns

As we discussed, in case we have non-spherical disturbances:

we can either stick to OLS and correct the standard errors or we can derive GLS, but this requires a feasible version of the matrix P to be implemented. A suitable matrix P can be derived fairly easily in the case of AR(1) error term as

P = 0 BB BB

@

p1 ρ2 0 .... .... ...

ρ 1 0 .... ...

0 ρ 1 0 ....

... .... ... ... ....

0 ... .... ρ 1

1 CC CC A

Rather than on the derivation of P we focus on tests for detecting serial correlation.

(54)

Adjust the standard errors when errors are serially correlated

When all other classical regression assumptions are satis…ed (in particular uncorrelatedness between the errors and regressors) we may decide to use OLS, which is unbiased and consistent but ine¢ cient.

The corrected standard errors can be derived from Var(ˆβ) = (X0X) 1X0ΣX(X0X) 1,

using “Newey-West Heteroskedasticity Autocorrelation Consistent”

estimator for Σ.

(55)

Testing for serial correlation

Informal checks:

plot the residuals and check whether there is any systematic pattern obtain the correlogram (graphical representation of the

autocorrelations) of the residuals and check whether calculated autocorrelations are signi…cantly di¤erent from zero. From the correlogram you can also "guess" the structure of the autocorrelation.

Tests:

Durbin- Watson d test Breusch- Godfrey test

(56)

Durbin- Watson d test

The …rst test we consider to test H0 : ρ=0 is the Durbin-Watson d statistic, which is

d =

T t=2

(ˆet ˆet 1)2

T t=2

ˆe2t ,

where ˆet are the OLS residuals obtained by estimating the parameters.

It is possible to show that for large T , d !2 2ρ.

The value of d is given by any statistical package.

If the value of d is “close” to 2, we can conclude that the model is free from serial correlation. We need to establish a precise rejection rule.

(57)

Durbin- Watson d test

Unfortunately, establishing a rejection rule for a test of H0 based on d statistic is more complicated than usual.

This is because the critical values depend also on the value of the regressors.

The best we can do is to give upper and lower bounds for the critical values of d , dU and dL, and establish the following rejection rule

If the computed value of d is lower than dL, we reject H0.

If the computed value of d is greater than dU, we fail to reject H0. If the computed value of d is between dL and dU we cannot conclude (the outcome is indeterminate).

dL and dU in the statistical tables are given in terms of number of observations and number of regressors.

(58)

Durbin- Watson d test

Limitations: d test is only valid when:

(59)

Breusch- Godfrey test

Suppose we have the general model

yt =xt0β+et where et = ρet 1+vt, (7) where one or more of the regressors can be a lagged y (such as, yt 1). If we wish to test H0 : ρ=0, we can

Estimate the parameters of (7) by OLS and save the residuals ˆet for t =1, ....T .

Estimate the following residual regression

ˆet =xt0γk+δ ˆet 1+νt. (8) From the regression output obtained by performing regression (8), compute nR2 ( n is the actual number of observations, which is T 1 in this case as the …rst one is lost).

Under H , nR2 is asymptotically χ2 with one degree of freedom.

(60)

Maximum Likelihood Estimation

The most direct way of estimating unknown parameters is known as maximum likelihood estimation. Although the principle can be applied more generally, assume that the datafyigni=1 are iid copies of the RV Y, which is known to have a density function f(y , θ).

As yi are iid , the joint sample density is f(y1, ..., yn, θ) =

n i=1

f(yi, θ).

We then interpret this as a function of the unknown parameters given we have observed the data, as in

L(θ, y) =f(y1, ..., yn, θ)

called the likelihood. Then, we simply ask what values of θ are most likely given this data i.e. we maximize the likelihood with respect to θ. Actually, since log is a monotone function, we de…ne MLE to be

(61)

Maximum Likelihood Estimation in NLRM

Consider the model

y =X β+ε

and let the standard assumptions (A1-A4) hold. Additionally assume that the errors are normally distributed, i.e.

A5 : ε N(0, σ2In). Since y =X β+ε, we immediately have

y N(X β, σ2In),

as if w N(µ,Σ), then for any n m A and m 1 b, z =A0w+b N(A0µ+b, A0ΣA).

(62)

Maximum Likelihood Estimation in NLRM

Moreover we can write down its joint density (multivariate normal) f(y) = expf 12(y X β)0(y X β)g

(2πσ2)n2 from which we obtain the log-likelihood

l(β, σ2) = 1

2(y X β)0(y X β) n

2ln σ2 n 2ln 2π.

(63)

Maximum Likelihood Estimation in NLRM

Thus the maximum likelihood estimators must satisfy the FOC (the score vector equals 0 at the MLE ), i.e.

S(β, σ2) =

∂l(β,σ2)

∂β

∂l(β,σ2)

∂σ2

=0 with

∂l(β, σ2)

∂β = X0y X0X β

σ2 =0 (9)

∂l(β, σ2)

∂σ2 = 1

4(y X β)0(y X β) n 2 =0.

Solving (9), we …nd

MLE = (X0X) 1X0y

(y X bβ)0(y X bβ) y0My

(64)

Maximum Likelihood Estimation in NLRM

The SOC, is that the matrix of second order derivatives (evaluated at MLE estimators), i.e.

H(β, σ2)jMLE,2MLE = 0

@

2l(β,σ2)

∂β∂β0

2l(β,σ2)

∂β∂σ2

2l(β,σ2)

∂σ2∂β

2l(β,σ2)

(σ2)2

1 A

is negative de…nite. Note that

H(β, σ2) =

X0X σ2

X0y X0X β σ4 X0y X0X β

σ4

1

σ6(y X β)0(y X β) + n4

! (10) and

H(MLE,2MLE) =

X0X 2MLE

0

0 n

22MLE

! ,

(65)

Properties of MLE

Clearly

MLE N(β, σ2(X0X) 1).

Note that the OLS estimator equals bβMLE under normality. We proved that the OLS estimator is BLUE by the Gauss- Markov

Theorem (smallest variance in the class of linear unbiased estimators).

Under normality, we can strengthen our notion of e¢ ciency and we can show that bβMLE is BUE (Best Unbiased Estimator) by the Cramer-Rao Lower Bound (the variance of any unbiased estimator is at least as large as the inverse of the information).

(66)

Cramer-Rao Lower Bound

The information is de…ned as

In(β, σ2) = E H(β, σ2) and given (10), in our case we have

In(β, σ2) =

X0X

σ2 0

0 n4

so that the inverse of the information is In 1(β, σ2) = σ

2(X0X) 1 0 0 n4 .

(67)

Maximum Likelihood Estimation in NLRM

Since we know that V h MLEi

=σ2(X0X) 1, then clearly bβMLE is BUE.

Note that2MLE is biased, but we can de…ne the unbiased estimator by s2 = n

n k2MLE = (y X bβ)0(y X bβ)

n k .

Unfortunately the unbiased s2 is not BUE, since

V s2 = (n k4) > n4. Note that as n!∞, s2 becomes e¢ cient as long as k is …xed.

(68)

Maximum Likelihood Estimation in NLRM

Since we know that V h MLEi

=σ2(X0X) 1, then clearly bβMLE is BUE.

Note that2MLE is biased, but we can de…ne the unbiased estimator by s2 = n

n k2MLE = (y X bβ)0(y X bβ)

n k .

Unfortunately the unbiased s2 is not BUE, since

V s2 = (n k4) > n4. Note that as n!∞, s2 becomes e¢ cient as long as k is …xed.

(69)

Oaxaca-Blinder (1973)

Microeconometric decomposition technique, which allows to study the di¤erences in outcomes between two groups, when these di¤erences come from the di¤erences in characteristics (explained variation) and di¤erences in parameters (discrimination/ unexplained variation).

Most often used in the literature on wage inequalities (female vs.

male, union vs. nonunion workers, public vs. private sector workers, migrants vs. native workers).

(70)

Oaxaca-Blinder preliminaries

Assume that y is explained by the vector of regressors x as in the linear regression model:

y = β

femalexi+εfemalei , if i is female βmalexi+εmalei , if i is male , where β also contains the intercept.

Assume that men are privilidged.

The di¤erence in a mean outcome is

ymale yfemale = βmalexmale βfemalexfemale

(71)

Oaxaca-Blinder

Alternatively:

ymale yfemale = 4x βmale+ 4βxfemale or

ymale yfemale = 4x βfemale + 4βxmale

Di¤erences in outcome come from di¤erent characteristics and di¤erent parameters (females have worse x and worse β).

Even more generally:

ymale yfemale = 4x βfemale+ 4βxfemale + 4x4β

= E+C +CE Problem: which group to pick as the reference?

(72)

General version of the decomposition

General equation:

ymale yfemale=

β (xmale xfemale) (di¤ in characteristics) + xmale(βmale β )+xfemale(β βfemale)(di¤ in parameters) (male advantage) (female disadvantage)

where β =λβmale+ (1 λ)βfemale

(73)

General version of the decomposition

λvalue Interpretation

λ=1 Male parameters as a reference

λ=0 Female parameters as a reference

λ=0.5 Average, Reimers (1983)

λ=%male Parameters weighted by the sample proportion, Cotton (1988)

β =pooled Parameters for the whole sample without gender dummy, Neumark (1988)

β =pooled Parameters for the whole sample with gender dummy, Fortin (2008)

λ=%female Parameters weighted by the opposite sex, oczy´nski (2013)

(74)

Some asymptotic results prior to Stochastic Regressors

The idea is to treat results that hold when n!∞ as approximations for …nite n.

In particular, we will be interested to see whether estimators are consistent and what their asymptotic distribution is.

Asymptotic results are useful in …nite samples (…nite n) because, for instance, we may not be able to show unbiasedness or we may not be able to determine the sampling distribution to carry on statistical inference.

(75)

Asymptotic theory- some results

De…nition - Convergence in Probability

Let Xn = fXi, i =1, ..., ngbe a sequence of random variables. Xn

converges in probability to X if

nlim!Pr(jXn Xj >ε) =0 for any ε>0.

You will see convergence in probability written as either Xn !p X or p limn

!Xn =X .

(76)

Convergence in Probability

The idea behind this type of convergence is that the probability of an

"unusual" outcome becomes smaller and smaller as the sequence progresses.

Example: Suppose you take a basketball and start shooting free throws. Let Xn be your success percentage in nth shot. Initially you are likely to miss a lot, but as the time goes on your skill increases, and you are more likely to make the shots. After years of practice the probability that you miss will be getting increasingly smaller and smaller. Thus, as n!∞ the sequence Xn converges in probability to X =100%. Note that the probability does not become 100% as there is always a small probability of missing.

(77)

Consistency

De…nition - Consistency

An estimator bθ of θ is consistent if, when the sample size increases, bθ gets

"closer" to θ. Formally, bθ is consistent when p lim

n!=θ.

Useful result - Slutsky Theorem Let g( )be a continuous function. Then

p lim

n!g(Xn) =g(p lim

n!Xn). Useful result - Chebyshev’s inequality

For a random variable Xn with mean µ and variance Var(Xn) Pr(jXn µj >ε) Var(Xn)

ε2 for any ε >0.

(78)

Consistency

Su¢ cient condition for consistency (not necessary):

If lim

n!E() =θ and Var() !0 as n!∞ then

nlim!Pr(jbθ θj >ε) =0 or as we write,

p lim

n!=θ.

If bθ is unbiased, in order to determine whether bθ is also consistent, we only need to check that Var() !0 as n!∞.

Note that biased estimators can be consistent as long as limn!E() =θ and Var() !0 as n!∞.

(79)

Unbiasedness versus Consistency

Unbiased, but not consistent: Suppose that given an iid sample fX1, ..., Xngwe want to estimate the mean of X , i .e. E(X). We could use the …rst observation as the estimator for the mean, so =X1. bθ is unbiased, since E() =E(X1) =E(X), because we have an iid sample. However, bθ does not converge to any value, therefore it cannot be consistent.

Biased, but consistent: an alternative estimator for the mean in an iid samplefX1, ..., Xngmight be eθ = n1ni=1Xi+n1. eθ is biased, since E() =E(X) +n1, but

nlim!E() =E(X) and

Var() = Var(X)

n !0 as n !∞.

(80)

Law of Large Numbers

Laws of large numbers provide conditions to ensure that sample moments converge to their population moments.

Weak Law of Large Numbers

Let Xn = fXi, i =1, ..., ngbe an independent and identically distributed (iid ) sequence of random variables with EjXij <∞. Then

1 n

n i=1

Xi !p E(Xi).

We only consider this version of WLLN (for iid data). When data are not iid , such as for time series data, stronger conditions are needed.

(81)

Law of Large Numbers

Example: Consider a coin (heads and tails) being ‡ipped. Logic says there are equal chances of getting heads or tails. If the coin is ‡ipped 10 times, there is a good chance that the proportion of heads and tails is not equal. Crudely, the law of large numbers says that as the number of times ‡ipped increases the proportion of heads will converge in probability to 0.5.

(82)

Sampling distribution

When we do not know the exact sampling distribution of an estimator (for instance when we do not assume normality of the error term), we may ask whether asymptotics can allow us to infer something about its distribution for large n , so that we are still able to make inference on the estimates.

De…nition - Convergence in Distribution

Let Xn = fXi, i =1, ..., ngbe a sequence of random variables and let X be a random variable with distribution FX(x). Xn converges in distribution to X if

nlim!Pr(Xn x) =FX(x) and is written

Xn !d X .

The approximating distribution FX(x)is called a limiting or asymptotic distribution.

(83)

Useful results

Convergence in probability implies convergence in distribution, i.e.

Xn !p X ) Xn !d X .

Convergence in distribution to a constant implies convergence in probability to a constant, i.e.

Xn !d c ) Xn !p c.

Suppose that fYngis another sequence of random variables and let g( ) be a continuous function.

If Xn !d X and Yn !p c then

g(Xn, Yn) !d g(X , c).

(84)

Central Limit Theorem

The most important example of convergence in distribution is called the Central Limit Theorem, which is useful to establish the

asymptotic distribution.

If X1, ..., Xn is an iid sample from any probability distribution with

…nite mean µ and …nite variance σ2, we have pn(X µ) !d N(0, σ2), or equivalently

pn(X µ)

σ !d N(0, 1).

The CLT guarantees that, even if the errors are not normally distributed, but simply iid with zero mean and variance σ2, we can conclude

p d 2 1 0 1!

(85)

Limiting distributions of test-statistics

Thus, when the disturbances are iid with zero mean and …nite

variance, the tests we previously discussed are asymptotically valid and t ! dN(0, 1) under H0

W = F !d χ2J under H0.

(86)

Law of iterated expectation

We have a useful result that we will use extensively. Let X and Y be two random variables, we have

E(Y) =E[E(YjX)].

A very useful by-product of the Law of Iterated Expectation is E(XY) =E[XE(YjX)],

sometimes written as

E(XY) =EX [XE(YjX)]

to indicate that the outer expectation is taken with respect to X .

Cytaty

Powiązane dokumenty

Assume that the amount (in USD) a random consumer is willing to spend yearly on water consump- tion follows a uniform distribution on the interval [0, θ], where θ &gt; 0 is an

For the group D(4) find the set of generators containing possibly least elements.. Is this

The problem of choosing the regres- sion order for least squares estimators in the case of equidistant observation points was investigated in [4].. In order to investigate

I would like to thank Professors Peter Pflug and W lodzimierz Zwonek for their valuable

Keywords: linear models, least squares estimator, consistency, radial symmetry, generalized polar coordinates.. to have radial symmetry in the study of

&#34;The main objectives of the mission are to understand how and where the solar wind plasma and the magnetic field originate in the corona; how solar transients, like flares

The graph of f has been translated by a vector [−1, 3] and then reflected in the y-axis to form the graph of g(x)... Sketch the function, g(m), deno- ting the number of solutions to

Keywords: linear models, least squares estimator, strong onsisten y