Ridge regularization for Mean Squared Error Reduction in Regression with Weak Instruments

04/18/2019 ∙ by Karthik Rajkumar, et al. ∙ Stanford University 0

In this paper, I show that classic two-stage least squares (2SLS) estimates are highly unstable with weak instruments. I propose a ridge estimator (ridge IV) and show that it is asymptotically normal even with weak instruments, whereas 2SLS is severely distorted and un-bounded. I motivate the ridge IV estimator as a convex optimization problem with a GMM objective function and an L2 penalty. I show that ridge IV leads to sizable mean squared error reductions theoretically and validate these results in a simulation study inspired by data designs of papers published in the American Economic Review.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Instrumental variables are widely used in applied economics and other social sciences to establish causal relationships in the absence of experimental variation. Under standard assumptions, instrumental variable (IV) regression estimators are unbiased and asymptotically consistent. However, when the first-stage, i.e. the relationship between instrumental variables and the independent variable at hand, is weak, inference with IV regression is distorted and a vast literature has emerged showing this. In particular, size is not controlled in finite sample and several finite-sample corrections of the IV estimator have been proposed to tackle this. These methods attempt to remove bias in small sample in a way that washes out as sample sizes grow.

Many of these methods, however, do so in a static way, which sometimes leads to overcorrection or even having no effect at all. We propose a ridge estimator for IV regression that attempts to alleviate the bias in a way that is tunable to the data. We motivate the estimator by returning to the classical perspective of IV regression as a ratio of two estimands (in the case of a single, just-identified IV), and showing that our approach is equivalent to stabilizing the denominator away from 0, thus avoiding the division-by-zero problem.

The paper is organized as follows. Section 2

provides background on the weak instrument problem and IV outlier sensitivity problem. Section

3 introduces the data model we use throughout the paper. Section 4 motivates and defines the ridge IV estimator and shows full preservation of efficiency under “large” parameters. Section 5 takes a local asymptotic approach to modeling weak instruments and shows using theory how ridge IV leads to drastic reductions in mean squared error. Section 6 interpets the ridge IV estimator as the solution to a convex optimization problem with a GMM objective function with an penalty on the coefficient. Section 7 uses a simulation study whose parameters are tuned to be consistent with the data generating processes implied in a sample of papers published in the American Economic Review and shows how ridge IV can lower MSE in practice. Section 8 concludes.

2 Literature

There is a long literature in econometrics studying weak instruments. Young (2018)

raises concerns on the quality of inference with instrumental variables. Specifically, the problems cited are weak or irrelevant instruments, non-iid error processes, and distortion in inference from one or two observations. A major claim in that paper is that IV estimates have larger MSE than OLS (OLS here referring to the structural equation of directly regressing the dependent variable on the endogenous variable, bypassing any instruments). In situations arising in practice, one is unable to tell the two estimates apart in the sense that IV confidence intervals generally include OLS anyway, and implies a preference for OLS. This calls into question the utility of traditional econometric tests for endogeneity, such as the Hausman test (e.g.

Hausman (1978)).

Andrews et al. (2018) provides a recent survey on weak instrument diagnostics, and inferential methods that are robust to weak instruments. They focus on the case of heteroskedastic and possibly non-iid errors, and make a case for the adoption of the robust F-statistic proposed by Olea and Pflueger (2013) in the case of a single endogenous regressor. Young (2018) counters this, claiming that such pre-tests for detecting weak instruments do little to accurately diagnose the problem in practice.

2.1 The weak instrument problem

The primary problem with weak instruments is that they are biased towards OLS. This gives tests the wrong size, and leads to misleading inference. By far, the most popular case in the literature appears to be the just-identified case with a single instrument and heteroskedasticity (Andrews et al., 2018). To combat this, the most common approach used in the literature is some form of the following two-stage process:

  1. . Here instruments are not weak and regular 2SLS inference is used.

  2. . Various “weak instrument robust” methods are used.

What are some of these weak instrument robust methods? Hirano and Porter (2015) show there does not exist an unbiased or asymptotically unbiased IV estimator. Anderson-Rubin confidence sets (Anderson et al., 1949) are optimal in just-identified case with single instrument, even with heteroskedasticity. In over-identified models, conditional LR (CLR) test is a good test for homoskedastic settings as it is fully robust to weak instruments. Andrews and Armstrong (2017) provide methods that are unbiased when sign of first-stage coefficient is known a priori.

For the scope of this paper, we address the large MSE critique of 2SLS. We provide a novel estimator, guiding theory, as well as simulation evidence for the utility of this estimator.

3 Model

We begin with a a just-identified setting with one instrument. Our data takes the form


Each datapoint is iid and the instruments are exogenous, i.e. are independent of and . In this notation, is the outcome variable of interest, and is the endogenous variable whose effect of the outcome one is interested in.

The main contention we want to address with our paper is that 2SLS is more sensitive because it is a ratio of two things and its p-value does not account for this stochasticity of the denominator. To explain this simply, in the just-identified case, our 2SLS estimator is

which basically is

where . Essentially, this is a a division-by-zero problem.

4 Ridge estimation

To address the weak instrument problem, we propose adding a bias to the denominator of the 2SLS, in the case of just-identified single instrument. This turns the estimator into

where is a tuning parameter, appropriately chosen. This serves the dual objectives of stabilizing the denominator of the IV estimator and also shrinking the coefficient toward zero, thus controlling type-1 error. This naturally motivates a ridge estimator in the case of just-identified with multiple instruments too:


Similarly for an over-identified setting, the ridge estimator modifies to


Suppose we allowed that the ridge estimator allowed for the penalty parameter to vary with sample size. That is, is indexed by to give us a full sequence of penalty parameters. In our specific example from equation 1, this would mean our estimator is


Before we dive into the asymptotic properties of the ridge estimator, we make a note on the sampling process we assume for the data.

4.1 Sampling assumptions on the data

There are many assumptions that may be used for the data generating process. Suppose we condition on the instruments,

, treating them as “constants” and only assuming the existence of probability limits of their moments. Then the classic OLS asymptotic results hold. What does that mean? Let us formalize this idea.


Assumption 1.

The instruments are “constants” with respect to the error shocks. This is achieved by conditioning on the instruments in hand. Further, under this sampling scheme, we assume .

Now consider the first stage equation:

We ignore constants for the time being.

Lemma 1 (CLT for OLS—Version 1).

Under Assumption 1, the CLT for OLS is


This is easily proven using the Liapounov CLT. Specifically, we have

Thus, Liapounov CLT tells us that

Using the second moment condition assumed from the sampling, we get the desired result. ∎

However, this is not the result one get when assuming is fully stochastic!

Assumption 2.

Suppose the instruments

are fully stochastic. Without loss of generality, we assume unit variance and also bounded fourth moment,


Lemma 2 (CLT for OLS—Version 2).

Under Assumption 2, the CLT for OLS is



is also stochastic, the random variable

looks like

Here we may apply the classic Lindeberg-Levy CLT to get

Rearranging terms gives us the desired result. ∎

Observe that the results under Assumptions 1 and 2 are different! Indeed the variance under full stochasticity is larger than the previous case, because there is more randomness to account for.

Theorem 4.1 (Consistency of the ridge estimator).

Let be the ridge estimator as defined in Equation (4). Suppose the instrument is not totally irrelevant, i.e. . Then under either Assumption 1 or Assumption 2, and for a sequence of penalty parameters , the estimator is consistent. That is,


We can rewrite the structural equation in the data generating process as

This is just the reduced form equation. Without loss of generality, we have assumed (or , if using Assumption 2). Then it is clear that

Similarly, the first stage gives us

Since , and we have , convergence in probability allows us to take the ratio of these two estimators, and we get our result.

The natural next question is, what is the asymptotic distribution of the proposed estimator? We tackle this in the next theorem.

Theorem 4.2 (Asymptotic normality of the ridge estimator—Version 1).

Suppose . Then the ridge estimator, as defined in Equation (4), is asymptotically normal. That is,

Further, under Assumption 1, .


First, we examine the reduced form regression. We understand that from the Central Limit Theorem of OLS in Lemma

1, we have

where is the homoskedastic error variance of the reduced form regression. That is, it is the variance of the residual term. Similarly, analyzing the first-stage regression, we have

for , the variance of , the errors in the first stage.

Next, we have , where is the covariance between and . Putting these results together, we get the multivariate result

Call this covariance matrix . Because , we can use Slutsky’s theorem to get the same asymptotic distribution:

This is the asymptotic distribution of a bivariate estimator. We note that is simply the ratio of the first and the second elements of this bivariate estimator.

Then to obtain the distribution of the ridge estimator, we can use the multivariate delta method here. It says the ratio estimator converges to the ratio of individual probability limits. Consider the bivariate function . Its gradient is given by . So the asymptotic variance of the ridge estimator is

Here . So the above is

which simplifies to

This is the required , and we have our asymptotic distribution.

Theorem 4.3 (Asymptotic normality of the ridge estimator—Version 2).

Again suppose . Under Assumption 2, the ridge estimator is asymptotically normal as well, and with the same asymptotic variance, .


From a multivariate central limit theorem as in Lemma 2, we again have asymptotic normality:

Note that the covariance matrix in this theorem is different from the one using Assumption 1. Let us obtain it.

We have and We ignore intercepts without loss of generality.


from Lemma 2. Finally,

From these it is clear that

Using the multivariate delta method again as in the Theorem 4.2 on this new , we get the variance of the ridge estimator to be

which is remarkably the same result.

We see that ridge IV recovers full efficiency in the “large” coefficient case. From another perspective, this result is uninteresting because ridge IV “does nothing.” The rate being too slow, it yields the exact same distribution as the original 2SLS estimator without any penalization. To obtain a novel asymptotic distribution with ridge IV, we need a faster rate. We show that now.

To see what ridge IV is able to do, we set . To be clear, in the previous theorems we use a sub- rate, whereas now we use exactly a rate for . Knight et al. (2000) show that is necessary for unique

consistency of classic ridge regression. We show the same for ridge IV.

Theorem 4.4.

Let , that is , for some . Then


Let us look at the case with Assumption 1 first. The proof of Theorem 4.2 says

The only thing that changes is the bias term with because of the slower rate of

. Now take the ratio of the two elements of the vector and perform the delta method like in Theorem

4.2. This gives us

using the same notation as in the theorem 4.2. This gives us our result.

Under this new regime, we see that we recover the same asymptotic distribution as the 2SLS estimator, but centered at the wrong mean! That is, the asymptotic bias is , rather than 0. What then do we gain from the ridge approach? The next section addresses this point.

5 The Staiger-Stock critique

Recall our main motivation for ridge IV is the weak instrument problem. To deal with it more explicitly, we take a local asymptotic framework used in Staiger and Stock (1997). That is, we use

This is to show that the strength of the first-stage is small, even relative to the sample size and the problem does not go away with bigger samples. Our next theorem shows the behavior of conventional 2SLS under this sense of weak instruments.

Theorem 5.1.

Suppose our first stage is weak in the Staiger-Stock sense. That is, . Then the 2SLS estimator is unstable and diverges. Specifically,

converges to a Cauchy distribution, and so



We operate under Assumption 1 here. We have

Consider . Since varies with sample size, , we employ a triangular array argument here. has mean and variance .

Let and . The Lindeberg-Feller Central Limit Theorem states

Applying this to our setting, we get

We have . So this can be rewritten as

Further, . So this further simplifies to

Applying the Lindeberg-Feller CLT similarly to the reduced form equation, we get

And by similar logic, this can be rewritten as

Following the derivation of the covariance term in Theorem 4.2, we can then were a joint normality result as follows:

We called this covariance matrix . With this notation, we can rewrite the joint normality as

Our 2SLS estimator is

so we can get its distribution by taking the ratio of the above joint normality distribution, which would result in a Cauchy distribution. Given that the ratio itself has a distribution,

times its difference from its mean would be unstable and diverges.

How does ridge IV help with this? We demonstrate that in the next theorem, which is the most important result of the paper.

Theorem 5.2.

Let . That is, for some Then, under Staiger-Stock asymptotics, we have


The ridge IV estimator is

From Theorem 5.1, we know that . This implies

and from this it follows that

Also from Theorem 5.1, we know

Then taking the ratio of the above two results using Slutsky’s theorem, we get

This is the required result.

In the Staiger-Stock regime, ridge IV with an aggressive enough penalization scheme () massively lowers mean squared error of the estimate. In regimes where instruments are not weak, we can still use ridge IV with a more moderate penalization scheme (), and lose nothing, although in this case, the choice of ridge IV over 2SLS is superfluous.

6 Understanding ridge IV

6.1 Interpretation of as the Lagrange multiplier in a constrained optimization problem

In the simplest case, as in our model in (1), ignoring intercepts, the standard 2SLS objective function is given by

Then the ridge IV solution is given by the following objective function:


Clearly, setting penalization, to zero recovers the original 2SLS estimator. What is the relation between and the Lagrange multiplier we see above? We address this in the next proposition.

Proposition 1 (Objective function of ridge IV).

The objective function of ridge IV is as given in Equation 5. Further, there is a one-to-one relation between , the Lagrange multiplier in the objective function, and , the level of penalization in the ridge estimator, given by


Let the ridge IV objective function be

This is a convex function, so to minimize it, set its partial derivative with respect to to zero. This gives us

That is,

This gives us


Comparing this form with the definition of the ridge IV estimator in Equation (4), we know

Rearranging the equation gives us the desired result.

7 Results

We look at a simulation design where we are interested in seeing how the MSE metric varies with varying levels of aggressiveness in the ridge penalty, . The linear IV model for this simulation is

This corresponds to our model in equation 1, with coefficients set to The coefficients and sample sizes are chosen to match a study from the AER (Hornung, 2014). are some independent normals.

In the first set of results, we allow the first stage coefficient size, to vary. Our simulation study is as follows:

  1. Pick a first stage coefficient from 0 to 1.

  2. Simulate 10,000 datasets of N=150 each.

  3. Compute MSE of that estimated coefficient for (including intercept).

(a) (Regular 2SLS)
(b) (Moderate regularization)
(c) (Aggressive regularization)
Figure 1: MSE of the IV estimator by the level of ridge penalization, . Figure (a) shows the regular 2SLS case, which we compare against. It is extremely unstable when the instrument is weak, i.e. when we are closer to 0 on the x-axis. When using a moderate level of regularization as in (b), we see that MSE is reduced by three orders of magnitude. Figure (c) sounds the alarm against too much regularization because then bias can become a dominant force to vie with and MSE starts to pick up again.

In the second set of results, we allow effect size, to vary. This is to show that for a given first stage strength, we may still have some use for regularization for very small coefficients. This simulation study is as follows:

  1. Pick an effect size, i.e. from 0 to 3.475.

  2. Simulate 10,000 datasets of N=150 each.

  3. Compute MSE of that estimated coefficient (including intercept).

(a) (Regular 2SLS)
(b) (Moderate regularization)
(c) (Aggressive regularization)
Figure 2: MSE of the IV estimator by the level of ridge penalization, . Figure (a) shows the regular 2SLS case, which we compare against. When using a moderate level of regularization as in (b), we see that MSE may be halved. However, figure (c) sounds the alarm against too much regularization because then bias can become a dominant force to vie with, particularly when effect sizes are large.

We show three cases here. The case of zero regularization is the classic 2SLS estimator, which has a certain level of mean squared error, which remains high even when the effect size being studied is small.

8 Conclusion

In this paper, we introduced a novel estimator, called “ridge IV.” We motivated it as the solution to the GMM objective function for instrumental variable regression with an additional penalty. In the theoretical case with “large” coefficients, we showed that ridge IV does not hurt our estimation while in the weak first stage case, we showed that it leads to strong improvements in the mean squared error of the estimand. We then validated the theory using simulations inspired by data designs of papers in the American Economic Review.

While this paper is primarily a theoretical contribution to the literature, we outline several avenues for further research. First, it would be helpful to provide a method for tuning the parameter for a given dataset. Our results operated at the abstract level of the big- notation, but for practical use, more information is needed. We would also like to see results on how exactly to perform inference with ridge IV. Explicit demonstration of Type 1 error control, for instance, would be very useful.

We would like to tie back the results of ridge IV and interpret them in the context of the problem of IV sensitivity to outliers, which is related to instability. We conjecture that ridge IV under an appropriate penalization scheme can address this as well. We would also like to expand the results provided in this paper in the general case of over-identifying instruments. A natural extension of our estimator is given in Equation (3).


  • T. W. Anderson, H. Rubin, et al. (1949) Estimation of the parameters of a single equation in a complete system of stochastic equations. The Annals of Mathematical Statistics 20 (1), pp. 46–63. Cited by: §2.1.
  • I. Andrews and T. B. Armstrong (2017) Unbiased instrumental variables estimation under known first-stage sign. Quantitative Economics 8 (2), pp. 479–503. Cited by: §2.1.
  • I. Andrews, J. Stock, and L. Sun (2018) Weak instruments in iv regression: theory and practice. Technical report Mimeo. Harvard University. Cited by: §2.1, §2.
  • J. A. Hausman (1978) Specification tests in econometrics. Econometrica: Journal of the econometric society, pp. 1251–1271. Cited by: §2.
  • K. Hirano and J. R. Porter (2015) Location properties of point estimators in linear instrumental variables and related models. Econometric Reviews 34 (6-10), pp. 720–733. Cited by: §2.1.
  • E. Hornung (2014) Immigration and the diffusion of technology: the huguenot diaspora in prussia. American Economic Review 104 (1), pp. 84–122. Cited by: §7.
  • K. Knight, W. Fu, et al. (2000) Asymptotics for lasso-type estimators. The Annals of statistics 28 (5), pp. 1356–1378. Cited by: §4.1.
  • J. L. M. Olea and C. Pflueger (2013) A robust test for weak instruments. Journal of Business & Economic Statistics 31 (3), pp. 358–369. Cited by: §2.
  • D. Staiger and J. H. Stock (1997) Instrumental variables regression with weak instruments. Econometrica: Journal of the Econometric Society, pp. 557–586. Cited by: §5.
  • A. Young (2018) Consistency without inference: instrumental variables in practical application. Unpublished manuscript, London: London School of Economics and Political Science. Retrieved from: http://personal. lse. ac. uk/YoungA. Cited by: §2, §2.