Applied Econometrics QEM Meeting 4: Endogenous regressors
Chapter 10 in PoE
Michaª Rubaszek
Warsaw School of Economics
Random regressors
The forth assumption of the LS:
A4 The values of xkt for k = 1, 2, . . . , K are not random and are not linear functions of other explanatory variables
Today we relax A4 and allow the explanatory variable xt to be random.
We will show that this might have an eect on the characteristics of the LS estimator.
For simplicity - we will concentrate on a linear model with one explanatory variable.
Random regressors
Remark: if x is random then its value is unknown until the experiment is performed, i.e. y and x are revealed at the same time (which is often the case)
If x is random, the LS assumption E(|x) = 0 implies cov(, x) = 0, so that x is an exogenous variable ( = determined outside the system). In this case we can apply the standard method of estimation: the OLS estimator is unbiased, consistent and eective.
The problem starts if cov(, x) 6= 0, i.e. when x is endogenous ( = determined within the system). In this case the LS estimator is biased and inconsistent and none of usual hypothesis testing (t-Student, F-tests) are valid.
Conclusion: when x is random, the relationship between and x is crucial factor when deciding whether LS is appropriate or not.
Endogeneity bias
For a linear regression: y = β1+ β2x + , since E(y) = β1+ β2E (x ), we have:
yt− E (y ) = β2(x − E (x )) + .
Multiplying both sides by (x − E(x)) yields:
cov (y , x ) = β2var (x ) + cov (, x ) which means that:
β2=cov (y , x )
var (x ) −cov (, x ) var (x ) .
Endogeneity bias
Recall, that the LS estimator for a model with one explanatory variable is
βb2= cov (y , x )\ var (x )\
, so that it converges to:
βb2→ β2+cov (, x ) var (x ) .
Conclusion: if cov(, x) 6= 0, the LS estimator is inconsistent and biased.
The value of this bias (endogeneity bias) is cov (,x )var (x ) .
Endogeneity bias - multivariate case
LS estimator:
b =X
xixi0−1X xiyi
Substitute yi = xi0β + i to get:
b = β +X
xixi0−1X xii If cov(x, ) 6= 0 then:
E (X
xii) 6=0 and:
E (b) 6= β
Sources of endogeneity bias
Three most popular cases of the endogeneity bias:
i. If we analyze a single equation of the bigger simultaneous equations system, where the structure of the system implies cov(, x) 6= 0.
ii. If we use an approximation z of the explanatory variable x so that z = x + ν, where ν is the measurement error. Instead of the true modely = β1+ β2x + , we estimatey = β1+ β2z + ε, where ε = −β2ν + . Notice that:
cov (ε, z) = cov (−β2ν + , x + ν) = −α2var (ν)
iii. If we omit an important variable z from the regression, which has an eect on y, and this variable is correlated with the explanatory variable x, i.e. cov(z, x) 6= 0. In this case the eect of the omitted variable z is attributed to the explanatory variable x
Sources of endogeneity bias: example
Let us consider regression of earnings y on years of schooling x:
y = β1+ β2x +
The error term embodies all factors other than education that determine earnings, including ability z. It might be argued that individual abilities have a non negligible eect on education: more gifted persons tend to be better educated. Let us assume that the full model is of the form:
y = β1+ β2x + β3z + ν.
It can be noticed that = β3z + ν. The LS estimate of β2of a model with one explanatory variable converges to:
βb2→ β2+ β3cov (x , z) var (x ) .
Conclusion: LS estimates might overstate the eect of education on earnings.
Instrumental variables estimator - intuition
What can we do? Change the method of estimation!
The intuition of IV estimation:
For regression: y = β1+ β2x + we have:
yt− E (y ) = β2(x − E (x )) + .
Let us multiply both sides by (z − E(z)):
cov (y , z) = β2cov (x , z) + cov (, z) If z is not correlated with then:
βb2= cov (y , x )\ cov (x , z)\ , should be unbiased estimator for β2.
Since variable z is used as an instrument, the estimator is called instrumental variables estimator (IV).
Instrumental variables estimator
Recall that the k-th population moment and its sample counterpart are:
µk = E (yk)andµbk =T1
T
X
t=1
(yt)k.
The method of moments (MM) estimation procedure equates m population moments to the corresponding m sample moments.
Instrumental variables estimator
For the linear regression with one explanatory variable the LS assumptions state that:
E () = E (y − β1− β2x ) =0
cov (, x ) = E (x ) = E (x (y − β1− β2x )) =0 In terms of two population moments in the MM estimation procedure:
1 T
T
X
t=1
(yt− ˆβ1− ˆβ2xt) =0
1 T
T
X
t=1
xt(yt− ˆβ1− ˆβ2xt) =0.
The solution is the same as the LS estimates, i.e.:
βˆ1= ¯y − ˆβ2¯xand ˆβ2=cov (x , y )\ var (x )\
.
Instrumental variables estimator
The advantage of the MM estimation procedure over the LS is that we can assume various moments to match. For example, if cov(, x) 6= 0, we can substitute E (x ) =0 with E(z) = 0 so that:
1 T
T
X
t=1
(yt− ˆβ1− ˆβ2xt) =0
1 T
T
X
t=1
zt(yt− ˆβ1− ˆβ2xt) =0.
for which the solution is:
βˆ1= ¯y − ˆβ2x¯and ˆβ2= cov (z, y )\ cov (z, x )\ .
The above estimator is called the Instrumental Variables (IV) estimator (z is used as an instrument).
Instrumental variables estimator
The properties of the IV estimator depend on the choice of z. If the following conditions are met:
i. z does not have a direct impact on y (it is not an explanatory variable)
ii. z is not correlated with the error term (so that E(z) = 0 holds) iii. z is strongly (or at least weakly) correlated with x
then the IV estimator is unbiased and consistent.
Instrumental variables estimator
In regression with one explanatory variable, the IV estimator satisfy:
βˆ2∼ N (β2, 1 ˆ
ρ2xz × σ2 T \var (x )
), where ˆρxz is the sample correlation between x and z.
Remark: Since 1/ˆρ2xz >1 the precision of the IV estimator is always lower than the precision of the LS estimator. Notice that the stronger is the correlation of z and x, the higher is the precision of the IV estimator. If the value of ˆρxz is high, we are saying about strong instruments. For low ˆ
ρxz the instruments are weak (in this case the IV estimator might be badly biased, even in large samples. Moreover, its distribution is not approximately normal)
Instrumental variables estimator: example
Recall the model in which earnings depend on education. In the mroz.wf1 le we have data for a sample of 428 women. The LS estimates:
ln wage\i = −0.185
(0.185)+0.109
(0.014)educi
indicate that an additional year of education increases wages by about 10.9% (really impressive rate of return).
Let us apply the IV estimation procedure with variable mothereduc (motyher's education) as an instrument z. mothereduc should be correlated with the explanatory variable (daughter education), shouldn't have a direct impact on the dependent variable (daughter earnings) and we assume that it is not correlated with the error term (this might be invalid). The estimation results:
ln wage\i =0.702
(0.985)+0.039
(0.038)educi
are less optimistic: an additional year of education increases wages by only 3.9%.
Remark: The precision of IV estimates, which is measured by the standard errors, is about twice lower than the precision of the LS estimates
Instrumental variables estimator: generalization
Consider a general linear model:
yt= β0+ β1x1t+ . . . + βGxGt
| {z }
exogenous vars.
+ βG +1xG +1,t+ . . . + βKxKt
| {z }
endogenous vars.
+t
in which:
• the rst G explanatory variables are exogenous
• the remaining K − G variables are potentially endogenous
The generalization of the IV estimation procedure for one variable would be to nd K − Ginstruments zG +1,t, . . . , zKt for each endogenous variable, and estimate model parameters with the formula:
ˆbIV= (
T
X
t=1
ztx0t)−1(
T
X
t=1
ztyt), where:
xt= [1 x1t . . . xGtxG +1,t . . . xKt]0 zt = [1 x1t . . . xGtzG +1,t . . . zKt]0.
Two Stage Least Squares (2SLS)
• The above IV procedure is not the dominant method of estimating a model with endogenous variables
• A more popular method is called the Two Stage Least Squares (2SLS) estimation procedure
Two Stage Least Squares (2SLS)
2SLS estimation procedure consists of two stages:
1 For each endogenous variable xk(G < k ≤ K) we LS estimate the model:
xkt= αk0+ αk1x1t+ . . . + αkGxGt
| {z }
G exogenous vars.
+ γk1z1t+ . . . + γkLzLt
| {z }
L instrumental vars.
+νkt
2 The tted values from the 1 stage regressions ˆxk are substituted to the original model and we LS regress:
yt= β0+ β1x1t+ . . . + βGxGt
| {z }
G exogenous vars.
+ βG +1xˆG +1,t+ . . . + βKxˆKt
| {z }
K-G tted values.
+t
Remark 1: The 2SLS residuals should be calculated with actual values of xk (k > G) and not their tted values from the rst stage
Remark 2: The tted values of ˆxktfor k > G from the rst stage of the 2SLS procedure is a linear combination of explanatory variables and instruments so that ˆxkt
is an instrument in the second stage of 2SLS. This is why the 2SLS estimator is also called the IV estimator.
Two Stage Least Squares (2SLS): example
We continue the model in which we analyze the relationship between earnings and education (mroz.wf1 le). We add one explanatory variable, which describes experience. The LS estimates are:
ln wage\ i= −0.400
(0.190)+0.109
(0.014)educi+0.016
(0.004)experi.
The 2SLS estimation procedure with the variable mothereduc playing a role of the instrument. The results of the rst stage estimation are:
educdi= 9.99
(0.37)+0.270
(0.031)mothereduci+0.008
(0.013)experi.
Substituting the tted value from the above regression to the original model yields the following LS estimates:
ln wage\i =0.302
(0.499)+0.054
(0.031)educdi+0.015
(0.004)experi.
Remark: The standard errors of the above regression are not valid. The correct standard errors are:
ln wage\ i=0.302
(0.477)+0.054
(0.037)educi+0.015
(0.004)experi.
Identication in the IV estimation procedure
xkt= αk0+ αk1x1t+ . . . + αkGxGt
| {z }
G exogenous vars.
+ γk1z1t+ . . . + γkLzLt
| {z }
L instrumental vars.
+νkt
yt= β0+ β1x1t+ . . . + βGxGt
| {z }
G exogenous vars.
+ βG +1xˆG +1,t+ . . . + βKxˆKt
| {z }
K-G tted values.
+t
In the IV (or 2SLS) estimation procedure the number of instruments L cannot be be smaller than the number of endogenous variables K − G, i.e. L ≥ K − G.
Why? otherwise, there would be a problem of collinearity (at least one explanatory variable would be a linear function of other explanatory variables).
If L = K − G we say that the model is just identied or exactly identied.
For L > K − G, the model is overidentied.
Identication in the IV estimation procedure
In the case of overidentied models (more instruments than endogenous variables) there are G + L + 1 > K + 1 moment conditions:
1 T
T
X
t=1
zt(yt− ˆβ0− ˆβ1x1t− . . . − ˆβKxKt) = T1
T
X
t=1
ztet =0, where zt = [1 x1t . . . xGtz1,t . . . zLt]0, and only K + 1 unknown parameters.
The above system of equations may not have an exact solution. We can however reformulate the problem as one of choosing the parameters ˆβk so that the sample moments are as close to zero as possible. This is (Generalized Method of Moments (GMM) estimator).
GMM estimator
Let:
ml(b) = 1 T
T
X
t=1
zlt(yt− b0− b1x1t− . . . − bKxKt) = 1 T
T
X
t=1
zltet
for l = 1, 2, . . . , G + L + 1 conditions and K + 1 unknowns, we can only nd approximation that would minimise expression:
βˆGMM=arg min
b
Tm(b)0Wm(b) =arg min
b
J(b) W is a weighting matrix and:
m(b) = [m1(b); m2(b); . . . mG +L+1(b)] =T1
T
X
t=1
ztet
with zt = [1 x1t . . . xGtz1,t . . . zLt]0
For the quadratic form W = (PTt=1ztz0t)−1and:
J(b) = 1 T
T
X
t=1
etz0t(
T
X
t=1
ztz0t)−1ztet.
Orthogonality test
M1: x (G exogenous vars.) z (L instruments): J1statistic M2: M1 + S additional instruments v: J2statistic Orthogonality test hypothesis:
H0: E (v ) =0 H1: E (v ) 6=0 Under the null the test statistic:
J2− J1∼ χ2(S )
Hausman test of regressors endogeneity
Trade-o while deciding on whether to use LS or 2SLS:
• the LS estimator is more ecient (has a smaller variance)
• the 2SLS estimator is unbiased if regressors are endogenous We can test endogeneity with theregressors endogeneity Hausman test Hypotheses for a set of M variables are:
H0: xG +1∧ xG +2∧ . . . ∧ xK are exogenous H1: xG +1∨ xG +2∨ . . . ∨ xK is endogenous
Hausman test of regressors endogeneity
The general idea of the Durbin-Wu-Hausman test (available in eViews) is to check whether additional moment conditions are valid:
M1: x1 (G exogenous vars.) z (L instruments): J1 statistic
M2: M1 + x2 (K-G regressors that are assumed to be endogenous): J2 statistic.
If x2 are not endogenous then additional K − G should be close to zero.
Under the null the test statistic:
H = J2− J1∼ χ2(K − G )
Remark: The results of the Hausman test are conditional on the choice of instruments, so it might happen that for one set of instruments we conclude that x is exogenous, whereas for another set that it is endogenous.
Hausman test of regressors endogeneity: example
The results of the Hausman test for the exogeneity of variable educ in the regression discussed in the previous example:
H =2.69 (p = 0.101)
Conclusion: we cannot reject the null at the 5% signicance level.
Remark: Our result does not imply that the null is true.
Hausman test of regressors endogeneity
The Hausman test (from POE)
For regression y = β0+ β1x + with instruments z1 and z2:
1 LS estimate x = α0+ α1z1+ α2z2+ v
2 Calculate ˆv
3 LS estimate y = β0+ β1x + δ ˆv +
4 Test the null δ = 0 stating that there is no endogeneity problem
Orthogonality test from POE
For regression y = β0+ β1x + with instruments z1 and z2:
1 IV estimate y = β0+ β1x +
2 Calculate ˆ
3 Estimate ˆ = γ0+ γ1z1+ γ2z2+ v
4 Test the null of one additional orthogonality condition with TR2∼ χ2(1)
Weak instruments test
In eViews it is also possible to test whether instruments are weak (low correlation cor(x, z)) with Cragg-Donald statistic (low values point to weak instruments) and Stock-Yogo critical values.