1 Introduction
Instrumental variables are widely used in applied economics and other social sciences to establish causal relationships in the absence of experimental variation. Under standard assumptions, instrumental variable (IV) regression estimators are unbiased and asymptotically consistent. However, when the firststage, i.e. the relationship between instrumental variables and the independent variable at hand, is weak, inference with IV regression is distorted and a vast literature has emerged showing this. In particular, size is not controlled in finite sample and several finitesample corrections of the IV estimator have been proposed to tackle this. These methods attempt to remove bias in small sample in a way that washes out as sample sizes grow.
Many of these methods, however, do so in a static way, which sometimes leads to overcorrection or even having no effect at all. We propose a ridge estimator for IV regression that attempts to alleviate the bias in a way that is tunable to the data. We motivate the estimator by returning to the classical perspective of IV regression as a ratio of two estimands (in the case of a single, justidentified IV), and showing that our approach is equivalent to stabilizing the denominator away from 0, thus avoiding the divisionbyzero problem.
The paper is organized as follows. Section 2
provides background on the weak instrument problem and IV outlier sensitivity problem. Section
3 introduces the data model we use throughout the paper. Section 4 motivates and defines the ridge IV estimator and shows full preservation of efficiency under “large” parameters. Section 5 takes a local asymptotic approach to modeling weak instruments and shows using theory how ridge IV leads to drastic reductions in mean squared error. Section 6 interpets the ridge IV estimator as the solution to a convex optimization problem with a GMM objective function with an penalty on the coefficient. Section 7 uses a simulation study whose parameters are tuned to be consistent with the data generating processes implied in a sample of papers published in the American Economic Review and shows how ridge IV can lower MSE in practice. Section 8 concludes.2 Literature
There is a long literature in econometrics studying weak instruments. Young (2018)
raises concerns on the quality of inference with instrumental variables. Specifically, the problems cited are weak or irrelevant instruments, noniid error processes, and distortion in inference from one or two observations. A major claim in that paper is that IV estimates have larger MSE than OLS (OLS here referring to the structural equation of directly regressing the dependent variable on the endogenous variable, bypassing any instruments). In situations arising in practice, one is unable to tell the two estimates apart in the sense that IV confidence intervals generally include OLS anyway, and implies a preference for OLS. This calls into question the utility of traditional econometric tests for endogeneity, such as the Hausman test (e.g.
Hausman (1978)).Andrews et al. (2018) provides a recent survey on weak instrument diagnostics, and inferential methods that are robust to weak instruments. They focus on the case of heteroskedastic and possibly noniid errors, and make a case for the adoption of the robust Fstatistic proposed by Olea and Pflueger (2013) in the case of a single endogenous regressor. Young (2018) counters this, claiming that such pretests for detecting weak instruments do little to accurately diagnose the problem in practice.
2.1 The weak instrument problem
The primary problem with weak instruments is that they are biased towards OLS. This gives tests the wrong size, and leads to misleading inference. By far, the most popular case in the literature appears to be the justidentified case with a single instrument and heteroskedasticity (Andrews et al., 2018). To combat this, the most common approach used in the literature is some form of the following twostage process:

. Here instruments are not weak and regular 2SLS inference is used.

. Various “weak instrument robust” methods are used.
What are some of these weak instrument robust methods? Hirano and Porter (2015) show there does not exist an unbiased or asymptotically unbiased IV estimator. AndersonRubin confidence sets (Anderson et al., 1949) are optimal in justidentified case with single instrument, even with heteroskedasticity. In overidentified models, conditional LR (CLR) test is a good test for homoskedastic settings as it is fully robust to weak instruments. Andrews and Armstrong (2017) provide methods that are unbiased when sign of firststage coefficient is known a priori.
For the scope of this paper, we address the large MSE critique of 2SLS. We provide a novel estimator, guiding theory, as well as simulation evidence for the utility of this estimator.
3 Model
We begin with a a justidentified setting with one instrument. Our data takes the form
(1)  
Each datapoint is iid and the instruments are exogenous, i.e. are independent of and . In this notation, is the outcome variable of interest, and is the endogenous variable whose effect of the outcome one is interested in.
The main contention we want to address with our paper is that 2SLS is more sensitive because it is a ratio of two things and its pvalue does not account for this stochasticity of the denominator. To explain this simply, in the justidentified case, our 2SLS estimator is
which basically is
where . Essentially, this is a a divisionbyzero problem.
4 Ridge estimation
To address the weak instrument problem, we propose adding a bias to the denominator of the 2SLS, in the case of justidentified single instrument. This turns the estimator into
where is a tuning parameter, appropriately chosen. This serves the dual objectives of stabilizing the denominator of the IV estimator and also shrinking the coefficient toward zero, thus controlling type1 error. This naturally motivates a ridge estimator in the case of justidentified with multiple instruments too:
(2) 
Similarly for an overidentified setting, the ridge estimator modifies to
(3) 
Suppose we allowed that the ridge estimator allowed for the penalty parameter to vary with sample size. That is, is indexed by to give us a full sequence of penalty parameters. In our specific example from equation 1, this would mean our estimator is
(4) 
Before we dive into the asymptotic properties of the ridge estimator, we make a note on the sampling process we assume for the data.
4.1 Sampling assumptions on the data
There are many assumptions that may be used for the data generating process. Suppose we condition on the instruments,
, treating them as “constants” and only assuming the existence of probability limits of their moments. Then the classic OLS asymptotic results hold. What does that mean? Let us formalize this idea.
Specifically
Assumption 1.
The instruments are “constants” with respect to the error shocks. This is achieved by conditioning on the instruments in hand. Further, under this sampling scheme, we assume .
Now consider the first stage equation:
We ignore constants for the time being.
Lemma 1 (CLT for OLS—Version 1).
Under Assumption 1, the CLT for OLS is
Proof.
This is easily proven using the Liapounov CLT. Specifically, we have
Thus, Liapounov CLT tells us that
Using the second moment condition assumed from the sampling, we get the desired result. ∎
However, this is not the result one get when assuming is fully stochastic!
Assumption 2.
Suppose the instruments
are fully stochastic. Without loss of generality, we assume unit variance and also bounded fourth moment,
.Lemma 2 (CLT for OLS—Version 2).
Under Assumption 2, the CLT for OLS is
Proof.
Here we may apply the classic LindebergLevy CLT to get
Rearranging terms gives us the desired result. ∎
Observe that the results under Assumptions 1 and 2 are different! Indeed the variance under full stochasticity is larger than the previous case, because there is more randomness to account for.
Theorem 4.1 (Consistency of the ridge estimator).
Proof.
We can rewrite the structural equation in the data generating process as
This is just the reduced form equation. Without loss of generality, we have assumed (or , if using Assumption 2). Then it is clear that
Similarly, the first stage gives us
Since , and we have , convergence in probability allows us to take the ratio of these two estimators, and we get our result.
∎
The natural next question is, what is the asymptotic distribution of the proposed estimator? We tackle this in the next theorem.
Theorem 4.2 (Asymptotic normality of the ridge estimator—Version 1).
Proof.
First, we examine the reduced form regression. We understand that from the Central Limit Theorem of OLS in Lemma
1, we havewhere is the homoskedastic error variance of the reduced form regression. That is, it is the variance of the residual term. Similarly, analyzing the firststage regression, we have
for , the variance of , the errors in the first stage.
Next, we have , where is the covariance between and . Putting these results together, we get the multivariate result
Call this covariance matrix . Because , we can use Slutsky’s theorem to get the same asymptotic distribution:
This is the asymptotic distribution of a bivariate estimator. We note that is simply the ratio of the first and the second elements of this bivariate estimator.
Then to obtain the distribution of the ridge estimator, we can use the multivariate delta method here. It says the ratio estimator converges to the ratio of individual probability limits. Consider the bivariate function . Its gradient is given by . So the asymptotic variance of the ridge estimator is
Here . So the above is
which simplifies to
This is the required , and we have our asymptotic distribution.
∎
Theorem 4.3 (Asymptotic normality of the ridge estimator—Version 2).
Again suppose . Under Assumption 2, the ridge estimator is asymptotically normal as well, and with the same asymptotic variance, .
Proof.
From a multivariate central limit theorem as in Lemma 2, we again have asymptotic normality:
Note that the covariance matrix in this theorem is different from the one using Assumption 1. Let us obtain it.
We have and We ignore intercepts without loss of generality.
Then,
Using the multivariate delta method again as in the Theorem 4.2 on this new , we get the variance of the ridge estimator to be
which is remarkably the same result.
∎
We see that ridge IV recovers full efficiency in the “large” coefficient case. From another perspective, this result is uninteresting because ridge IV “does nothing.” The rate being too slow, it yields the exact same distribution as the original 2SLS estimator without any penalization. To obtain a novel asymptotic distribution with ridge IV, we need a faster rate. We show that now.
To see what ridge IV is able to do, we set . To be clear, in the previous theorems we use a sub rate, whereas now we use exactly a rate for . Knight et al. (2000) show that is necessary for unique
consistency of classic ridge regression. We show the same for ridge IV.
Theorem 4.4.
Let , that is , for some . Then
Proof.
Let us look at the case with Assumption 1 first. The proof of Theorem 4.2 says
The only thing that changes is the bias term with because of the slower rate of
. Now take the ratio of the two elements of the vector and perform the delta method like in Theorem
4.2. This gives ususing the same notation as in the theorem 4.2. This gives us our result.
∎
Under this new regime, we see that we recover the same asymptotic distribution as the 2SLS estimator, but centered at the wrong mean! That is, the asymptotic bias is , rather than 0. What then do we gain from the ridge approach? The next section addresses this point.
5 The StaigerStock critique
Recall our main motivation for ridge IV is the weak instrument problem. To deal with it more explicitly, we take a local asymptotic framework used in Staiger and Stock (1997). That is, we use
This is to show that the strength of the firststage is small, even relative to the sample size and the problem does not go away with bigger samples. Our next theorem shows the behavior of conventional 2SLS under this sense of weak instruments.
Theorem 5.1.
Suppose our first stage is weak in the StaigerStock sense. That is, . Then the 2SLS estimator is unstable and diverges. Specifically,
converges to a Cauchy distribution, and so
diverges.Proof.
We operate under Assumption 1 here. We have
Consider . Since varies with sample size, , we employ a triangular array argument here. has mean and variance .
Let and . The LindebergFeller Central Limit Theorem states
Applying this to our setting, we get
We have . So this can be rewritten as
Further, . So this further simplifies to
Applying the LindebergFeller CLT similarly to the reduced form equation, we get
And by similar logic, this can be rewritten as
Following the derivation of the covariance term in Theorem 4.2, we can then were a joint normality result as follows:
We called this covariance matrix . With this notation, we can rewrite the joint normality as
Our 2SLS estimator is
so we can get its distribution by taking the ratio of the above joint normality distribution, which would result in a Cauchy distribution. Given that the ratio itself has a distribution,
times its difference from its mean would be unstable and diverges.∎
How does ridge IV help with this? We demonstrate that in the next theorem, which is the most important result of the paper.
Theorem 5.2.
Let . That is, for some Then, under StaigerStock asymptotics, we have
Proof.
The ridge IV estimator is
From Theorem 5.1, we know that . This implies
and from this it follows that
Also from Theorem 5.1, we know
Then taking the ratio of the above two results using Slutsky’s theorem, we get
This is the required result.
∎
In the StaigerStock regime, ridge IV with an aggressive enough penalization scheme () massively lowers mean squared error of the estimate. In regimes where instruments are not weak, we can still use ridge IV with a more moderate penalization scheme (), and lose nothing, although in this case, the choice of ridge IV over 2SLS is superfluous.
6 Understanding ridge IV
6.1 Interpretation of as the Lagrange multiplier in a constrained optimization problem
In the simplest case, as in our model in (1), ignoring intercepts, the standard 2SLS objective function is given by
Then the ridge IV solution is given by the following objective function:
(5) 
Clearly, setting penalization, to zero recovers the original 2SLS estimator. What is the relation between and the Lagrange multiplier we see above? We address this in the next proposition.
Proposition 1 (Objective function of ridge IV).
The objective function of ridge IV is as given in Equation 5. Further, there is a onetoone relation between , the Lagrange multiplier in the objective function, and , the level of penalization in the ridge estimator, given by
Proof.
Let the ridge IV objective function be
This is a convex function, so to minimize it, set its partial derivative with respect to to zero. This gives us
That is,
This gives us
or
Comparing this form with the definition of the ridge IV estimator in Equation (4), we know
Rearranging the equation gives us the desired result.
∎
7 Results
We look at a simulation design where we are interested in seeing how the MSE metric varies with varying levels of aggressiveness in the ridge penalty, . The linear IV model for this simulation is
This corresponds to our model in equation 1, with coefficients set to The coefficients and sample sizes are chosen to match a study from the AER (Hornung, 2014). are some independent normals.
In the first set of results, we allow the first stage coefficient size, to vary. Our simulation study is as follows:

Pick a first stage coefficient from 0 to 1.

Simulate 10,000 datasets of N=150 each.

Compute MSE of that estimated coefficient for (including intercept).
In the second set of results, we allow effect size, to vary. This is to show that for a given first stage strength, we may still have some use for regularization for very small coefficients. This simulation study is as follows:

Pick an effect size, i.e. from 0 to 3.475.

Simulate 10,000 datasets of N=150 each.

Compute MSE of that estimated coefficient (including intercept).
We show three cases here. The case of zero regularization is the classic 2SLS estimator, which has a certain level of mean squared error, which remains high even when the effect size being studied is small.
8 Conclusion
In this paper, we introduced a novel estimator, called “ridge IV.” We motivated it as the solution to the GMM objective function for instrumental variable regression with an additional penalty. In the theoretical case with “large” coefficients, we showed that ridge IV does not hurt our estimation while in the weak first stage case, we showed that it leads to strong improvements in the mean squared error of the estimand. We then validated the theory using simulations inspired by data designs of papers in the American Economic Review.
While this paper is primarily a theoretical contribution to the literature, we outline several avenues for further research. First, it would be helpful to provide a method for tuning the parameter for a given dataset. Our results operated at the abstract level of the big notation, but for practical use, more information is needed. We would also like to see results on how exactly to perform inference with ridge IV. Explicit demonstration of Type 1 error control, for instance, would be very useful.
We would like to tie back the results of ridge IV and interpret them in the context of the problem of IV sensitivity to outliers, which is related to instability. We conjecture that ridge IV under an appropriate penalization scheme can address this as well. We would also like to expand the results provided in this paper in the general case of overidentifying instruments. A natural extension of our estimator is given in Equation (3).
References
 Estimation of the parameters of a single equation in a complete system of stochastic equations. The Annals of Mathematical Statistics 20 (1), pp. 46–63. Cited by: §2.1.
 Unbiased instrumental variables estimation under known firststage sign. Quantitative Economics 8 (2), pp. 479–503. Cited by: §2.1.
 Weak instruments in iv regression: theory and practice. Technical report Mimeo. Harvard University. Cited by: §2.1, §2.
 Specification tests in econometrics. Econometrica: Journal of the econometric society, pp. 1251–1271. Cited by: §2.
 Location properties of point estimators in linear instrumental variables and related models. Econometric Reviews 34 (610), pp. 720–733. Cited by: §2.1.
 Immigration and the diffusion of technology: the huguenot diaspora in prussia. American Economic Review 104 (1), pp. 84–122. Cited by: §7.
 Asymptotics for lassotype estimators. The Annals of statistics 28 (5), pp. 1356–1378. Cited by: §4.1.
 A robust test for weak instruments. Journal of Business & Economic Statistics 31 (3), pp. 358–369. Cited by: §2.
 Instrumental variables regression with weak instruments. Econometrica: Journal of the Econometric Society, pp. 557–586. Cited by: §5.
 Consistency without inference: instrumental variables in practical application. Unpublished manuscript, London: London School of Economics and Political Science. Retrieved from: http://personal. lse. ac. uk/YoungA. Cited by: §2, §2.
Comments
There are no comments yet.