OLS, MLE and related topics. Primer.

(1)

OLS, MLE and related topics. Primer.

Katarzyna Bech

Week 1

(2)

Classical Linear Regression Model (CLRM)

The model:

y =X β+e, and the assumptions:

A1 The true model is y =X β+e.

A2 E(e) =0.

A3 Var(e) =E(ee⁰) =σ²In.

A4 X is a non-stochastic n k matrix with rank k n.

(3)

Least Squares Estimation

The OLS estimator bβ will minimize the sum of squares bβ =arg min

β

RSS(β) =e⁰e= (y X β)⁰(y X β). Using the rules for matrix calculus the FOC is

∂RSS(β)

∂β = ^∂(y X β)⁰(y X β)

∂β = 2X⁰y+2X⁰X β=0, and the SOC is

∂²RSS(β)

∂β∂β⁰

=2X⁰X which is clearly positive de…nite.

(4)

Least Squares Estimation

From the FOC:

bβ= (X⁰X) ¹X⁰y . Properties of the OLS estimator:

E[bβ] = β+ (X⁰X) ¹X⁰E[e] =β hence it is unbiased, and

Var[bβ] =Eh

(bβ β)(bβ β)⁰ⁱ= σ²(X⁰X) ¹. Gauss-Markov Theorem: under A1-A4: bβ_OLS is BLUE.

(5)

Gauss Markov Theorem Proof

Consider again the model y =X β+e Let

˜β

(k 1)

= C

(k n) y

n 1

be another linear unbiased estimator.

For ˜β to be unbiased we require

E(˜β) =E(Cy) =E(CX β+C e) =CX β=β.

Thus, we need

CX =I_k ! ^˜β =CX β+C e= β+C e.

We have

Var(˜β) = E((˜β β)(˜β β)⁰) =E(C ee⁰C⁰)

= CE(ee⁰)C⁰ = σ²CC⁰.

(6)

Gauss Markov Theorem Proof

We want to show that Var(˜β) Var(ˆβ) 0. Thus, Var(˜β) Var(ˆβ) = σ²(CC⁰ (X⁰X) ¹)

= σ²(CC⁰ CX

=Ik

(X⁰X) ¹X⁰C⁰

=Ik

)

= σ²(CI_nC⁰ CX(X⁰X) ¹X⁰C⁰)

= σ²C(I_n X(X⁰X) ¹X⁰)C⁰

= σ²CM_XC⁰

= σ²CMXM_X⁰ C⁰ = σ²DD⁰ where D =CM_X. DD⁰ is positive semide…nite.

σ²DD⁰ 0 implying that Var(˜β) Var(ˆβ) 0.

(7)

Gauss Markov Theorem Proof

The latter result also applies to any linear combination of the elements of β:, i.e.

Var(c⁰˜β) Var(c⁰ˆβ) 0.

We thus have the following corollary

Corollary Under A1-A4, for any vector of constants c the minimum variance linear unbiased estimator of c⁰β in the classical regression model is c⁰ˆβ, where ˆβ is the least squares estimator.

Corollary Each coe¢ cient β_j is estimated at least as e¢ ciently by ˆβ_j as by any other linear unbiased estimator.

(8)

Least Squares Estimation

Clearly , we also need to estimate σ². The most obvious estimator to take is

bσ² =

y X bβ

0(y X bβ)

n = ^y⁰^M^X^y

n where M_X =I P_X =I X(X⁰X) ¹X⁰.

BUT bσ² is biased, so we have to de…ne an unbiased estimator

s² = ⁿ

n kbσ² =

y X bβ

0(y X bβ)

n k .

(9)

How Econometrics is typically thought?

What are the consequences of the violation of the assumptions A1-A4?

(10)

Perfect Multicollinearity

Let’s see what happens when A4 is violated, i.e. rank(X) <k (equivalently, the column of X are linearly dependent).

In this case X⁰X is not invertible (it’s singular), so that the ordinary least squares estimator cannot be computed. The parameters of such a regression model are unidenti…ed.

The use of too many dummy variables (dummy variable trap) is a typical cause for exact multicollinearity. Consider for instance the case where we would like to estimate wage including a dummy for males (MALE_i), a dummy for females (FEMALE_i ) as well as a constant (note that MALE_i+FEMALE_i =1 8^{i ).}

Exact multicollinearity is easily solved by excluding one of the variables from the model.

(11)

"Near" Multicollinearity

Imperfect multicollinearity arises when two or more regressors are highly correlated, in the sense that there is a linear function of regressors which is highly correlated with another regressor.

When regressors are highly correlated, it becomes di¢ cult to disentangle the separate e¤ects of the regressors on the dependent variable.

Near multicollinearity is not a violation of the classical linear

regression assumptions. So our OLS estimates are still the best linear unbiased estimators. “Best”, just simply it might not be all that good.

(12)

"Near" Multicollinearity

The higher the correlation between the regressors becomes, the less precise our estimates will be (note that Var(ˆβ) =σ²(X⁰X) ¹).

Consider

y_i =α+x_{i 1}β₁+x_{i 2}β₂+ei. Var(ˆβ_j) = ^σ

2

∑n i=1

(xij ¯xj)²(1 r_x²₁_,x₂) .

Obviously, as r_x²₁_,x₂ !1, the variance increases.

(13)

"Near" Multicollinearity

Practical consequences:

Although OLS estimators are still BLUE they have large variances and covariances.

Individual t-tests may fail to reject that coe¤cients are 0, even though they are jointly sign…cant.

Parameter estimates may be very sensitive to one or a small number of observations.

Coe¤cients may have the “wrong” sign or implausible magnitude.

How to detect multicollinearity?

High R² but few signi…cant t-ratios.

High pairwise correlation among explanatory variables.

How to deal with multicollinearity?

A priori information.

Dropping a variable.

Transformation of variables.

(14)

Misspecifying the Set of Regressors- violation of A1

We consider two issues:

Omission of relevant variable (e.g., due to oversight or lack of measurement).

Inclusion of irrelevant variables.

Preview of the results:

In the …rst case, the estimator is generally biased.

In the second case, there is no bias but the estimator is ine¤cient.

(15)

Omitted Variable Bias

Consider the following two models:

True model :

y =X β+Z δ+e, E(e) =0, E(ee⁰) =σ²I (1) and misspeci…ed model which we estimate:

y =X β+v (2)

The OLS estimate of β in (2) are

˜β = (X⁰X) ¹X⁰y .

(16)

Omitted Variable Bias

In order to assess the statistical properties of the latter, we have to substitute the true model for y , which is given by (1). We obtain

˜β= (X⁰X) ¹X⁰(X β+Z δ+e) =β+ (X⁰X) ¹X⁰Z δ+ (X⁰X) ¹X⁰e.

Taking expectation (recall that X and Z are non stochastic), we have E(˜β) = β+ (X⁰X) ¹X⁰Z δ+ (X⁰X) ¹X⁰E(e) = β+ (X⁰X) ¹X⁰Z δ.

Thus, generally ˜β is biased.

Interpretation of the bias term:

(X⁰X) ¹X⁰Z is the regression coe¢ cient(s) of omitted variable(s) on all included variables.

δ is the true coe¢ cient corresponding to the omitted variable(s).

(17)

Omitted Variable Bias

We can interpret E(˜β)as the sum of two terms, i.e.

β is direct change in E(y)associated with changes in X

(X⁰X) ¹X⁰Z δ is the indirect change in E(y)associated with changes in Z (X partially acts as a “proxi” for Z , when Z is omitted).

There would be no omitted variable bias if X and Z were orthogonal, i.e.

X⁰Z =0. Trivially there would be no bias also if δ= 0.

(18)

Example

Consider the standard human capital earning function

lnw_i =α+β₁school_i+β₂exp_i+v_i (3) and we are interested in β₁, which is the returns to schooling.

We suspect that the relevant independent variable ability_i is omitted.

What is the likely direction of the bias?

Let ˆν denote the OLS coe¢ cients corresponding to schooling in the regression of ability_i on a constant, school_i and exp_i. ˆν is likely to be positive (note that we are not claiming a causal e¤ect).

On the other hand, consider the true model (that cannot be used since abilityi is not observed)

lnw_i = α+β₁school_i+β₂exp_i+δabilityi+v_i.

(19)

Example

Hence, the direction of the bias of ˆβ₁, when (3) is estimated is given by

ˆν

(+)(+) = +^δ .

The returns to schooling tend to be overestimated.

(20)

Variance

Taking into account that X and Z are non stochastic and E(e) =0, Var(˜β) =E((˜β E(˜β))(˜β E(˜β))⁰) =σ²(X⁰X) ¹. Important: s² is a biased estimate of the error variance of the true regression error. Indeed,

s² = ^ˆv⁰^ˆv n k_x,

where ˆv =y X ˜β=M_xy . By substituting the true model (1) for y , we get

ˆv =Mx(X β+Z δ+e) =Mx(Z δ+e). Thus (show it)

E(s²) =σ²+^δ

0Z⁰M_XZ δ n kx 6=^σ²^.

(21)

Summary of consequences of omitting a relevant variable

If the omitted variable(s) are correlated with the included variable(s), the parameter estimates are biased.

The disturbance variance σ² is incorrectly estimated.

As a result, the usual con…dence intervals and hypothesis testing procedures are likely to give misleading conclusions.

(22)

Irrelevant Variables

Suppose that the correct model is

y =X β+e (4)

but we instead choose to estimate the ’bigger’model

y =X β+Z δ+v (5)

The OLS estimate of β in (5) are

˜β = (X⁰M_ZX) ¹X⁰M_Zy .

(23)

Irrelevant Variables

In order to assess the statistical properties of the latter, we have to substitute the true model for y , which is given by (4). We obtain

˜β = (X⁰M_ZX) ¹X⁰M_Z(X β+e) = β+ (X⁰M_ZX) ¹X⁰M_Ze.

Taking expectation, we have

E(˜β) = β+ (X⁰M_ZX) ¹X⁰M_ZE(e) =β.

Thus, ˜β is unbiased.

However, it can be shown that

Var(˜β) σ²(X⁰X) ¹

where σ²(X⁰X) ¹ is the variance of the ’correct’OLS estimator.

(24)

Summary of consequences of including an irrelevant variable

OLS estimators of the parameters of the "incorrect" model are unbiased, but less precise (larger variance).

As a consequence of the fact that the price for omitting a relevant variable is so much higher than the price for including irrelevant ones many econometric theorists have suggested a "General to Speci…c"

modelling strategy. It entails, basically, initially including every regressor that may be suspected of being relevant. Then a

combination of R² and F and t tests are used to begin to eliminate the least signi…cant regressors, hopefully leading to a "correct" model.

(25)

Ramsey’s RESET test

The most commonly used test to check for the speci…cation of the mean function is known as Ramsey’s RESET test.

Let byⁱ =x_i⁰bβ be the …tted values of the OLS regression, then the Ramsey’s RESET procedure is to run the regression, as in

yi =x_i⁰β+γ₁(_byi)²+γ₂(_byi)³+...+γ_q(_byi)^q⁺¹+vi

and perform an F test of the hypothesis

H₀: γ₁= γ₂ =...=γ_q =0 If we cannot reject H₀ the model is correctly speci…ed.

(26)

Modelling Strategy

Approaches to model building:

"General to speci…c".

"Simple to general".

The motivating factor when building a model should be ECONOMIC THEORY

How do we judge a model to be "good"?

Parsimony Identi…ability Goodness of …t

Theoretical consistency Predictive power

(27)

Selecting regressors

It is good practice to select the set of potentially relevant variables on the basis of economic arguments rather than statistical ones.

There is always a small (but not ignorable) probability of drawing the wrong conclusion.

For example there is always a probability of rejecting the null hypothesis that a coe¢ cient is zero, while the null is actually true.

Such type I errors are rather likely to happen if we use a sequence of many tests to select the regressors to include in the model. This process is referred to as data mining.

In presenting your estimation results, it is not a mistake to have insigni…cant variables included in your speci…cation. Of course, you should be careful with including many variables in your model that are multicollinear so that, in the end, almost none of the variables appear individually signi…cant.

(28)

Spherical Disturbances

Assumption A3 of the Classical Linear Regression Model ensures that the variance-covariance matrix of the errors is

E(ee⁰) =σ²I_n. This implies two things:

E(e²_i) =σ² for all i , i.e. homoskedasticity E(eiej) =0 if i 6=j, i.e. no serial correlation.

Naturally, testing for heteroskedasticity and for serial correlation forms a very important part of the econometrics of the linear model. Before we do that, however, we need to look at what sources there are of what the textobooks call ’non-spherical errors’and what e¤ects might be on our model.

(29)

Generalized Linear Regression Model

The Generalized Linear Regression Model is just the Classical Linear Regression Model, but with non-spherical disturbances (the

covariance matrix is no longer proportional to the identity matrix).

We consider the model

y =X β+e, where

E(e) =0 and

E(ee⁰) =Σ.

Σ is any symmetric positive de…nite matrix.

(30)

Non-spherical disturbances

We are interested in two di¤erent forms of non-spherical disturbances:

Heteroscedasticity, with

E(e²_i) =σ²_i and E(eiej) =0 if i 6=^j,

so that each observation will have a di¤erent variance. In this case,

Σ=σ²Ω = 0

@ σ²₁ ... 0 ... ... ...

0 ... σ²_n 1

A=diagf^σ²1...σ²_ng^.

Note thatΩ is a diagonal matrix of weights ωi. Sometimes it is convenient to write σ²_i =σ²ωi.

Autocorrelation, with

E(e²_i) =σ² and E(eiej) 6=^{0 if i} 6=^j.

(31)

Heteroscedasticity- graphical intuition

When data are homoscedastic we expect something like

(32)

Heteroscedasticity- graphical intuition

With heteroscedasticity, we expect

(33)

Consequences for OLS estimation

Let’s examine the statistical properties of OLS estimator when E(ee⁰) =σ²Ω6=^σ²^Iⁿ^.

Unbiasedness: bβ is still unbiased, since bβ= β+ (X⁰X) ¹X⁰e and since E(e) =0, we conclude that E(bβ) =β.

E¢ ciency: the variance of bβ changes to

Var(bβ) = E((bβ β)(bβ β)⁰)

= (X⁰X) ¹X⁰E(ee⁰)X(X⁰X) ¹

= σ²(X⁰X) ¹X⁰ΩX(X⁰X) ¹

The consequence is that the standard OLS estimated variance and standard deviation (computed by statistical packages) is biased, since it is based on the wrong formula σ²(X⁰X) ¹. Standard tests are

(34)

Consequences for OLS estimation

Since one of the Gauss-Markov assumptions is violated (E(ee⁰) 6=^σ²^Iⁿ), the OLS of such models is no longer BLUE (although, remember, it is still unbiased).

Possible remedies:

Construct another estimator which is BLUE- we will call it Generalized Least Squares (GLS)

We stick to OLS, but we compute the correct estimated variance.

(35)

GLS

The idea is to transform the model

y =X β+e

so that the transformed model satis…es the Gauss- Markov assumptions.

We assume for the time being that Ω is known (rather unrealistic assumption).

Property SinceΩ is positive de…nite and symmetric, there exists a square, nonsingular matrix P such that

P⁰P =_Ω ¹.

Sketch of proof: (Spectral decomposition) Since Ω is symmetric there exists C and Λ such that

Ω=CΛC⁰, C⁰C =I_n.

Because Ω is positive de…nite (positive de…nite matrix is always nonsingular):

Ω ¹ =CΛ ¹C⁰ =P⁰P

(36)

GLS

We can transform the model y =X β+e by premultiplying it by the matrix P:

Py =PX β+P e i.e.

y =X β+e . (6)

This transformed model satis…es the Gauss-Markov assumptions, since E(e ) =E(P e) =PE(e) =0

and

Var(e ) = E(e e ⁰) =E(P ee⁰P⁰) =PE(ee⁰)P⁰

= σ²PΩP⁰ =σ²P(P⁰P) ¹P⁰ =σ²PP ¹P⁰ ¹P⁰

= σ²In.

(37)

GLS

Hence, the OLS estimator of the transformed model (6) is BLUE and bβ_GLS = (X ⁰X ) ¹X ⁰y

= (X⁰P⁰PX) ¹X⁰P⁰Py = (X⁰Ω ¹X) ¹X⁰Ω ¹y It is easy to verify that

E bβ_GLS = β and

Var bβ_GLS =σ²(X⁰Ω ¹X) ¹= (X⁰Σ ¹X) ¹. (I leave it to you as the exercise)

(38)

GLS

Note: the GLS estimator we have just discussed is useful in any general case of non-spherical disturbances. The general formula is

bβ_GLS = (X⁰Σ ¹X) ¹X⁰Σ ¹y .

Now, we will specify it to the heteroscedasticity form, i.e. when Σ=diagf^σ²1...σ²_ng^.

(39)

Unfeasible GLS

We consider again the model

yi =x_i⁰β+ei, with

Var(ei) =σ²_i. Thus, Var(e) =diagf^σ²1...σ²_ng^.

In this setting, the transformation P is given by P =diagf ¹

σ1

, ..., 1 σng

It is easy to verify that (P⁰P) ¹= Ω=Σ, when we set σ² =1.

(40)

Unfeasible GLS

The transformed model is given by y_i

σi

= ^xⁱ σi

0

β+ ^eⁱ σi

. Thus,

bβ_GLS =

∑

n i=1

xix_i⁰ σ²_i

! 1 n

i

∑

=1

xiyi

σ²_i

! , which is also called "weighted least squares".

This approach is only available if σ²_i is known.

(41)

Feasible GLS

Problem: σ²_i is not known in general. Therefore, the latter estimator is unfeasible.

We need to use a feasible version of such estimator, which means we need to replace the unknown σ²_i with sensible estimates:

bβ_GLS =

∑

n i=1

xix_i⁰ bσ²i

! 1 n

i

∑

=1

xiyi

bσ²i

! .

(42)

Feasible GLS

The main issue is that we need to know the form of the heteroscedasticity in order to estimate σ²_i.

The simplest situation, for instance, would be if we had a conjecture that σ²_i =z_i⁰a, where the components of z_i are observable. In such case, a consistent estimator of σ²_i would be the …tted values of the arti…cial (called skedastic) regression:

be²i =z_i⁰a+vi,

wherebei are the residuals of the original regression when estimated by standard OLS.

We can accomodate into this idea more complicated functional forms for the heteroscedasticity (e.g. σ²_i =exp(z_i⁰a)).

(43)

White robust standard errors

Rather than estimating β by GLS, we can stick to OLS and "correct"

the standard errors. We have seen that

Var bβ_OLS = (X⁰X) ¹X⁰E(ee⁰)X(X⁰X) ¹

= (X⁰X) ¹X⁰ΣX(X⁰X) ¹ and in the vector form

Var bβ_OLS =

∑

n i=1

x_ix_i⁰

! 1 n

i

∑

=1

σ²_ix_ix_i⁰

! _n

i

∑

=1

x_ix_i⁰

! 1

.

We don’t know σ²_i, but under general conditions White (1980) showed that the matrix

1 n

∑

n i=1

be²ixix_i⁰, be²

(44)

Detection of heteroscedasticity

Informal methods:

graphical method Formal methods:

Goldfeld- Quandt test Breusch-Pagan/Godfrey test

White’s general heteroscedasticity test

(45)

Goldfeld- Quandt test

Suppose that we think that the heteroscedasticity is particularly related to one of the variables, i.e. one of the columns of X , say x_j. The Goldfeld- Quandt test is based on the following:

1 Order the observations according to the values of xj, starting with the lowest.

2 Split the sample into three parts, of lenghts n₁, c and n₂.

3 Obtain the residualsbe¹ andbe² from regressions using the …rst n1, and then the last n₂ observations separately.

4 Test H0 : σ²_i =σ² for all i using the statistic F = ^be

01be1/(n₁ k) be⁰2be2/(n2 k)

with critical values obtained from the F(n1 k, n2 k)distribution.

Practical advice: use this test for relatively small sample sizes (up to

(46)

Breusch-Pagan/Godfrey test

Suppose that the source of heteroscedasticity is that σ²_i =g(α0+_ez_i⁰α),

whereezi⁰ is the q 1 row of a matrix of regressors Z , which is n q. Z can contain some or all of the X ’s in our CLRM, if desired, although ideally the choice of Z is done on the basis of some economic theory. For computational purposes Z does not contain a constant term.

This test involves an auxiliary regression:

v =Z γ^e +u, eZ =M0Z

where M0=I ^ee_e₀_e⁰ with e = (1, 1, ..., 1)⁰. u is a vector of iid errors and the dependent variable is de…ned by

v = ^be

2

i 1.

(47)

Breusch-Pagan/Godfrey test

The test is related to an F test in the auxiliary regression of the joint restrictions

H₀ : γ=0 and the test- statistic is given by

BP = ¹

2 v⁰Ze Ze⁰Ze ¹Ze⁰v , which follows χ²₍_q₎ distribution in large samples.

(48)

White test

For simplicity, assume that the data follow

yi = β₀+β₁x1i +β₂x2i+ei.

1 Run your regression and save the residuals (denoted asbei, as usual).

2 Run an auxiliary (arti…cial) regression with the squared residuals as the dependent variable and where explanatory variables include all the explanatory variables from step 1 plus the squares and cross products:

be²i =β₀+β₁x_1i+β₂x_2i +β₃x_1i² +β₄x_2i²β₀+β₅x_1ix_2i +v_i. 3 Obtain R² from the auxiliary regression in step 2.

4 Construct the test- statistic:

nR² _{asy .} χ²₍_k ₁₎,

(49)

White test

Note that

H₀ : Var(ei) =σ² vs. H₁ : heteroscedasticity .

Thus, if the calculated value of the test-statistic exceeds the critical chi-square value at the chosen level of signi…cance, we reject the null that the error has a constant variance.

White test is a large sample test, so it is expected to work well only when the sample is su¢ ciently large.

(50)

Summary of heteroscedasticity

A crucial CLRM assumption is that Var(ei) =σ². If Var(ei) =σ²_i instead, we have heteroscedasticity.

OLS estimators are no longer BLUE; still unbiased, but not e¢ cient.

Two remedial approaches to deal with heteroscedasticity:

when σ²_i is known: weighted least squares (WLS), when σ²_i is unknown:

make an educated guess about the likely pattern of the

heteroscedasticity to consistently estimate σ²_i and use it in the feasible GLS

White’s heteroscedasticity- consistent variances and standard errors

A number of tests are available to detect heteroscedasticity.

(51)

Serial correlation- general case

Now we want to specify the results we discussed to the model y_t =x_t⁰β+et,

where x_t is the usual k 1 vector. E(et) =0 and E(etes) 6=^{0 for some} t 6=s. Note that autocorrelation (or serial correlation, it’s the same) is very common in time series data, for instance due to unobserved factors (omitted variables) that are correlated over time. A common form of serial correlation is the autoregressive structure.

(52)

Autocorrelation patterns

There are several forms of autocorrelation, each leading to a di¤erent structure for the error covariance matrix Var(e) =σ²Ω.

We will only consider here AR(1) structures, i.e. et =ρet 1+vt, where v_t iid(0, σ²), wherej^ρj <^1.

It’s easy to show that E(et) =0 We can show that:

E(ee⁰) = ^σ

2v

1 ρ² 0 BB

@

1 ρ ρ² .... ρ^{n 1}

ρ 1 ρ .... ρ^{n 2}

... .... .... .... ....

ρ^{n 1} ρ^{n 2} .... .... 1 1 CC A

(53)

Autocorrelation patterns

As we discussed, in case we have non-spherical disturbances:

we can either stick to OLS and correct the standard errors or we can derive GLS, but this requires a feasible version of the matrix P to be implemented. A suitable matrix P can be derived fairly easily in the case of AR(1) error term as

P = 0 BB BB

@

p1 ρ² 0 .... .... ...

ρ 1 0 .... ...

0 ρ 1 0 ....

... .... ... ... ....

0 ... .... ρ 1

1 CC CC A

Rather than on the derivation of P we focus on tests for detecting serial correlation.

(54)

Adjust the standard errors when errors are serially correlated

When all other classical regression assumptions are satis…ed (in particular uncorrelatedness between the errors and regressors) we may decide to use OLS, which is unbiased and consistent but ine¢ cient.

The corrected standard errors can be derived from Var(ˆβ) = (X⁰X) ¹X⁰ΣX(X⁰X) ¹,

using “Newey-West Heteroskedasticity Autocorrelation Consistent”

estimator for Σ.

(55)

Testing for serial correlation

Informal checks:

plot the residuals and check whether there is any systematic pattern obtain the correlogram (graphical representation of the

autocorrelations) of the residuals and check whether calculated autocorrelations are signi…cantly di¤erent from zero. From the correlogram you can also "guess" the structure of the autocorrelation.

Tests:

Durbin- Watson d test Breusch- Godfrey test

(56)

Durbin- Watson d test

The …rst test we consider to test H0 : ρ=0 is the Durbin-Watson d statistic, which is

d =

∑T t=2

(ˆet ˆet 1)²

∑T t=2

ˆe²_t ,

where ˆet are the OLS residuals obtained by estimating the parameters.

It is possible to show that for large T , d !² ^2ρ.

The value of d is given by any statistical package.

If the value of d is “close” to 2, we can conclude that the model is free from serial correlation. We need to establish a precise rejection rule.

(57)

Durbin- Watson d test

Unfortunately, establishing a rejection rule for a test of H₀ based on d statistic is more complicated than usual.

This is because the critical values depend also on the value of the regressors.

The best we can do is to give upper and lower bounds for the critical values of d , dU and dL, and establish the following rejection rule

If the computed value of d is lower than d_L, we reject H₀.

If the computed value of d is greater than d_U, we fail to reject H₀. If the computed value of d is between d_L and d_U we cannot conclude (the outcome is indeterminate).

dL and dU in the statistical tables are given in terms of number of observations and number of regressors.

(58)

Durbin- Watson d test

Limitations: d test is only valid when:

(59)

Breusch- Godfrey test

Suppose we have the general model

y_t =x_t⁰β+et where et = ρet 1+v_t, (7) where one or more of the regressors can be a lagged y (such as, y_{t 1}). If we wish to test H0 : ρ=0, we can

Estimate the parameters of (7) by OLS and save the residuals ˆet for t =1, ....T .

Estimate the following residual regression

ˆet =x_t⁰γ_k+δ ˆet 1+νt. (8) From the regression output obtained by performing regression (8), compute nR² ( n is the actual number of observations, which is T 1 in this case as the …rst one is lost).

Under H , nR² is asymptotically χ² with one degree of freedom.

(60)

Maximum Likelihood Estimation

The most direct way of estimating unknown parameters is known as maximum likelihood estimation. Although the principle can be applied more generally, assume that the dataf^yⁱgⁿi=1 are iid copies of the RV Y, which is known to have a density function f(y , θ).

As y_i are iid , the joint sample density is f(y₁, ..., y_n, θ) =

∏

n i=1

f(y_i, θ).

We then interpret this as a function of the unknown parameters given we have observed the data, as in

L(θ, y) =f(y₁, ..., y_n, θ)

called the likelihood. Then, we simply ask what values of θ are most likely given this data i.e. we maximize the likelihood with respect to θ. Actually, since log is a monotone function, we de…ne MLE to be

(61)

Maximum Likelihood Estimation in NLRM

Consider the model

y =X β+ε

and let the standard assumptions (A1-A4) hold. Additionally assume that the errors are normally distributed, i.e.

A5 : ε N(0, σ²In). Since y =X β+ε, we immediately have

y N(X β, σ²I_n),

as if w N(µ,Σ), then for any n m A and m 1 b, z =A⁰w+b N(A⁰µ+b, A⁰ΣA).

(62)

Maximum Likelihood Estimation in NLRM

Moreover we can write down its joint density (multivariate normal) f(y) = ^expf _2σ¹²(y X β)⁰(y X β)g

(2πσ²)ⁿ² from which we obtain the log-likelihood

l(β, σ²) = ¹

2σ²(y X β)⁰(y X β) ⁿ

2ln σ² n 2ln 2π.

(63)

Maximum Likelihood Estimation in NLRM

Thus the maximum likelihood estimators must satisfy the FOC (the score vector equals 0 at the MLE ), i.e.

S(β, σ²) =

∂l(β,σ²)

∂β

∂l(β,σ²)

∂σ²

=0 with

∂l(β, σ²)

∂β = ^X⁰^y ^X⁰^{X β}

σ² =0 (9)

∂l(β, σ²)

∂σ² = ¹

2σ⁴(y X β)⁰(y X β) ⁿ 2σ² =0.

Solving (9), we …nd

bβ_MLE = (X⁰X) ¹X⁰y

(y X bβ)⁰(y X bβ) y⁰My

(64)

Maximum Likelihood Estimation in NLRM

The SOC, is that the matrix of second order derivatives (evaluated at MLE estimators), i.e.

H(β, σ²)jbβ_MLE,bσ²MLE = 0

@

∂²l(β,σ²)

∂β∂β⁰

∂²l(β,σ²)

∂β∂σ²

∂²l(β,σ²)

∂σ²∂β

∂²l(β,σ²)

∂(σ²)²

1 A

is negative de…nite. Note that

H(β, σ²) =

X⁰X σ²

X⁰y X⁰X β σ⁴ X⁰y X⁰X β

σ⁴

1

σ⁶(y X β)⁰(y X β) + _2σⁿ4

! (10) and

H(bβ_MLE,bσ²MLE) =

X⁰X bσ²MLE

0

0 ⁿ

2bσ²MLE

! ,

(65)

Properties of MLE

Clearly

bβ_MLE N(β, σ²(X⁰X) ¹).

Note that the OLS estimator equals bβ_MLE under normality. We proved that the OLS estimator is BLUE by the Gauss- Markov

Theorem (smallest variance in the class of linear unbiased estimators).

Under normality, we can strengthen our notion of e¢ ciency and we can show that bβ_MLE is BUE (Best Unbiased Estimator) by the Cramer-Rao Lower Bound (the variance of any unbiased estimator is at least as large as the inverse of the information).

(66)

Cramer-Rao Lower Bound

The information is de…ned as

In(β, σ²) = E H(β, σ²) and given (10), in our case we have

I_n(β, σ²) =

X⁰X

σ² 0

0 _2σⁿ4

so that the inverse of the information is I_n ¹(β, σ²) = ^σ

2(X⁰X) ¹ 0 0 ^2σ_n⁴ .

(67)

Maximum Likelihood Estimation in NLRM

Since we know that V h bβ_MLEi

=σ²(X⁰X) ¹, then clearly bβ_MLE is BUE.

Note thatbσ²MLE is biased, but we can de…ne the unbiased estimator by s² = ⁿ

n kbσ²MLE = (y X bβ)⁰(y X bβ)

n k .

Unfortunately the unbiased s² is not BUE, since

V s² = ₍_{n k}^2σ⁴₎ > ^2σ_n⁴. Note that as n!∞, s² becomes e¢ cient as long as k is …xed.

(68)

Maximum Likelihood Estimation in NLRM

Since we know that V h bβ_MLEi

=σ²(X⁰X) ¹, then clearly bβ_MLE is BUE.

Note thatbσ²MLE is biased, but we can de…ne the unbiased estimator by s² = ⁿ

n kbσ²MLE = (y X bβ)⁰(y X bβ)

n k .

Unfortunately the unbiased s² is not BUE, since

V s² = ₍_{n k}^2σ⁴₎ > ^2σ_n⁴. Note that as n!∞, s² becomes e¢ cient as long as k is …xed.

(69)

Oaxaca-Blinder (1973)

Microeconometric decomposition technique, which allows to study the di¤erences in outcomes between two groups, when these di¤erences come from the di¤erences in characteristics (explained variation) and di¤erences in parameters (discrimination/ unexplained variation).

Most often used in the literature on wage inequalities (female vs.

male, union vs. nonunion workers, public vs. private sector workers, migrants vs. native workers).

(70)

Oaxaca-Blinder preliminaries

Assume that y is explained by the vector of regressors x as in the linear regression model:

y = ^β

femalex_i+ε^female_i , if i is female β^malexi+ε^male_i , if i is male , where β also contains the intercept.

Assume that men are privilidged.

The di¤erence in a mean outcome is

y^male y^female = β^malex^male β^femalex^female

(71)

Oaxaca-Blinder

Alternatively:

y^male y^female = 4^{x β}^male+ 4^βx^female or

y^male y^female = 4^{x β}^female + 4^βx^male

Di¤erences in outcome come from di¤erent characteristics and di¤erent parameters (females have worse x and worse β).

Even more generally:

y^male y^female = 4^{x β}^female+ 4^βx^female + 4^x4^β

= E+C +CE Problem: which group to pick as the reference?

(72)

General version of the decomposition

General equation:

y^male y^female=

β (x^male x^female) (di¤ in characteristics) + x^male(β^male β )+x^female(β β^female)(di¤ in parameters) (male advantage) (female disadvantage)

where β =λβ^male+ (1 λ)β^female

(73)

General version of the decomposition

λvalue Interpretation

λ=1 Male parameters as a reference

λ=0 Female parameters as a reference

λ=0.5 Average, Reimers (1983)

λ=%male Parameters weighted by the sample proportion, Cotton (1988)

β =pooled Parameters for the whole sample without gender dummy, Neumark (1988)

β =pooled Parameters for the whole sample with gender dummy, Fortin (2008)

λ=%female Parameters weighted by the opposite sex, S÷oczy´nski (2013)

(74)

Some asymptotic results prior to Stochastic Regressors

The idea is to treat results that hold when n!∞ as approximations for …nite n.

In particular, we will be interested to see whether estimators are consistent and what their asymptotic distribution is.

Asymptotic results are useful in …nite samples (…nite n) because, for instance, we may not be able to show unbiasedness or we may not be able to determine the sampling distribution to carry on statistical inference.

(75)

Asymptotic theory- some results

De…nition - Convergence in Probability

Let Xn = f^Xⁱ^{, i} =1, ..., ngbe a sequence of random variables. Xn

converges in probability to X if

nlim!∞Pr(j^Xⁿ ^Xj >^ε) =0 for any ε>0.

You will see convergence in probability written as either X_n !^p ^{X or p lim}_n

!∞X_n =X .

(76)

Convergence in Probability

The idea behind this type of convergence is that the probability of an

"unusual" outcome becomes smaller and smaller as the sequence progresses.

Example: Suppose you take a basketball and start shooting free throws. Let X_n be your success percentage in n^th shot. Initially you are likely to miss a lot, but as the time goes on your skill increases, and you are more likely to make the shots. After years of practice the probability that you miss will be getting increasingly smaller and smaller. Thus, as n!∞ the sequence Xn converges in probability to X =100%. Note that the probability does not become 100% as there is always a small probability of missing.

(77)

Consistency

De…nition - Consistency

An estimator bθ of θ is consistent if, when the sample size increases, bθ gets

"closer" to θ. Formally, bθ is consistent when p lim

n!∞bθ=θ.

Useful result - Slutsky Theorem Let g( )be a continuous function. Then

p lim

n!∞g(Xn) =g(p lim

n!∞Xn). Useful result - Chebyshev’s inequality

For a random variable Xn with mean µ and variance Var(Xn) Pr(j^Xⁿ ^µj >^ε) ^Var(X_n)

ε² for any ε >0.

(78)

Consistency

Su¢ cient condition for consistency (not necessary):

If lim

n!∞E(bθ) =θ and Var(bθ) !^{0 as n}!∞ then

nlim!∞Pr(j^{bθ θ}j >^ε) =0 or as we write,

p lim

n!∞bθ=θ.

If bθ is unbiased, in order to determine whether bθ is also consistent, we only need to check that Var(bθ) !^{0 as n}!∞.

Note that biased estimators can be consistent as long as limn!∞E(bθ) =θ and Var(bθ) !^{0 as n}!∞.

(79)

Unbiasedness versus Consistency

Unbiased, but not consistent: Suppose that given an iid sample f^X¹^{, ..., X}ⁿgwe want to estimate the mean of X , i .e. E(X). We could use the …rst observation as the estimator for the mean, so bθ=X₁. bθ is unbiased, since E(bθ) =E(X₁) =E(X), because we have an iid sample. However, bθ does not converge to any value, therefore it cannot be consistent.

Biased, but consistent: an alternative estimator for the mean in an iid samplef^X¹^{, ..., X}ⁿg^{might be e}^θ = _n¹_∑ⁿ_i₌₁X_i+_n¹. eθ is biased, since E(eθ) =E(X) +_n¹, but

nlim!∞E(eθ) =E(X) and

Var(eθ) = ^Var(X)

n !^{0 as n} !∞.

(80)

Law of Large Numbers

Laws of large numbers provide conditions to ensure that sample moments converge to their population moments.

Weak Law of Large Numbers

Let X_n = f^Xⁱ^{, i} =1, ..., ngbe an independent and identically distributed (iid ) sequence of random variables with Ej^Xⁱj <∞. Then

1 n

∑

n i=1

X_i !^p ^E(X_i).

We only consider this version of WLLN (for iid data). When data are not iid , such as for time series data, stronger conditions are needed.

(81)

Law of Large Numbers

Example: Consider a coin (heads and tails) being ‡ipped. Logic says there are equal chances of getting heads or tails. If the coin is ‡ipped 10 times, there is a good chance that the proportion of heads and tails is not equal. Crudely, the law of large numbers says that as the number of times ‡ipped increases the proportion of heads will converge in probability to 0.5.

(82)

Sampling distribution

When we do not know the exact sampling distribution of an estimator (for instance when we do not assume normality of the error term), we may ask whether asymptotics can allow us to infer something about its distribution for large n , so that we are still able to make inference on the estimates.

De…nition - Convergence in Distribution

Let Xn = f^Xⁱ^{, i} =1, ..., ngbe a sequence of random variables and let X be a random variable with distribution F_X(x). X_n converges in distribution to X if

nlim!∞Pr(Xn x) =F_X(x) and is written

X_n !^d ^{X .}

The approximating distribution F_X(x)is called a limiting or asymptotic distribution.

(83)

Useful results

Convergence in probability implies convergence in distribution, i.e.

X_n !^p ^X ) ^Xⁿ !^d ^{X .}

Convergence in distribution to a constant implies convergence in probability to a constant, i.e.

X_n !^d ^c ) ^Xⁿ !^p ^c.

Suppose that f^Yⁿgis another sequence of random variables and let g( ) be a continuous function.

If X_n !^d ^{X and Y}ⁿ !^p ^c then

g(X_n, Y_n) !^d ^g(X , c).

(84)

Central Limit Theorem

The most important example of convergence in distribution is called the Central Limit Theorem, which is useful to establish the

asymptotic distribution.

If X₁, ..., X_n is an iid sample from any probability distribution with

…nite mean µ and …nite variance σ², we have pn(X µ) !^d ^N(0, σ²), or equivalently

pn(X µ)

σ !^d ^N(0, 1).

The CLT guarantees that, even if the errors are not normally distributed, but simply iid with zero mean and variance σ², we can conclude

p _d ₂ ¹ ₀ ¹^!

(85)

Limiting distributions of test-statistics

Thus, when the disturbances are iid with zero mean and …nite

variance, the tests we previously discussed are asymptotically valid and t ! ^d^N(0, 1) under H₀

W = F !^d ^χ²J under H0.

(86)

Law of iterated expectation

We have a useful result that we will use extensively. Let X and Y be two random variables, we have

E(Y) =E[E(Yj^X)].

A very useful by-product of the Law of Iterated Expectation is E(XY) =E[XE(Yj^X)],

sometimes written as

E(XY) =E_X [XE(Yj^X)]

to indicate that the outer expectation is taken with respect to X .