Many questions in Economics and Statistics can be posed as an extremum estimation problem:
where is a population loss function induced by the data distribution and dependent on a parameter of interest and a potentially infinite dimensional nuisance parameter . The true value is defined as the minimizer of the population loss evaluated at the true value of the nuisance component . The nuisance component can itself be estimated based on some auxiliary estimation process, whose description depends on the application of interest.
We address the problem of estimating based on a data set of i.i.d. samples, each drawn from distribution and we consider a high-dimensional sparse regime, i.e., we allow the dimension to exceed the sample size : , but require to be sparse:
This framework extends standard semiparametric extremum estimation problems by allowing the finite dimensional parameter to be a high-dimensional sparse vector. Instances of this framework that we investigate in detail in this paper include estimating models defined viaconditional moment restrictions with missing data, estimating the utility of agents in games of incomplete information and estimating treatment effects in a regression model with a nonlinear-link function. In all these settings, our work enables estimation in the high-dimensional regime, where among the treatments/features only of them have a non-zero effect on the outcome.
As is typical in semiparametric models, estimating is most times a much harder problem in terms of sample complexity than estimating , had we been given oracle access to the true (e.g. estimating requires a non-parametric regression or a high-dimensional regression with very dense parameters). This nature of semiparametric estimation extends even in the high dimensional regime. Motivated by this observation, the main goal of our work is to develop an estimation algorithm for , whose performance is robust to errors in the estimation of .
A natural way to estimate is via a two-stage procedure, where a first-stage estimate of the nuisance component is plugged into a -regularized sample analog of (1). Namely, we assume the existence a sample loss function that concentrates around conditional on any first-stage estimate , as the sample size becomes large. Given such an empirical loss function, we propose to estimate by the following two-stage algorithm:
Input: , search set
Our main result is to show that if the loss function satisfies an orthogonality condition with respect to the nuisance component, as well as regularity conditions that are typical in high-dimensional estimation, then the convergence rate of the plug-in regularized extremum estimator presented in Algorithm 1, has a second order impact from the first-stage estimation error of , i.e. it depends only on the squared error .
Example 1 (High-Dimensional Heterogeneous Treatment Effects).
To make matters more concrete, let us consider a stylized, albeit of practical importance, model of heterogeneous treatment effect estimation. In particular consider the following structural model, which corresponds to a high-dimensional extension of the classic Partially Linear Model (PLR) :
where is a base treatment variable, is a high-dimensional vector of features/control variables and is an outcome of interest. The target parameter corresponds to a linear parametrization of the heterogeneous treatment effect of on conditional on the features . The features also have a confounding effect, in the sense that they have a direct impact on the outcome apart from determining the treatment effect. This setting falls into our formulation, where are the nuisance components in the estimation of the parameter of interest . Many times, the density of the coefficients and is much larger than the density of , i.e. many variables have a direct effect on the outcome, but do not alter the effect of the treatment. Hence, our goal is to estimate in a manner that does not depend on the support size of the coefficients .
In this particular example, one could estimate via a direct approach, by regressing on , via the Lasso algorithm, i.e. minimizing the regularized loss:
However, with such a direct approach, the convergence rate for the parameter will depend on the support-size of both and . The framework that we will establish in this work, will show that if instead one invokes our two-stage Algorithm (1) with a slightly modified loss function:
where is a first stage estimate of and is a first stage estimate of the function , then the convergence rate of the resulting estimate is asymptotically independent of the error in the estimation of and , i.e. the density of the coefficients of their linear representations enters only in a non-leading term. The crucial property that enables this result is that the modified loss satisfies an orthogonality condition, which we will define
shortly and which renders it insensitive to local perturbations of the nuisance components, near the true values of both and . This difference of the two estimation methods is not an artifact of the theoretical analysis, but exhibits itself clearly in their experimental performance as we show in Figure 1 and Figure 2 . This improvement is exacerbated when either the variance of the residual in the treatment equation
. This improvement is exacerbated when either the variance of the residual in the treatment equationis small (i.e. small amount of natural experimentation in the treatment) or the variance the residual in the outcome equation is small (not significant unobserved heterogeneity in the outcome).
A detailed exposition of the latter result is given in Section 2.4. This approach extends beyond the linear setting to high-dimensional treatment effect estimation with non-linear link functions, i.e. , where we present an orthogonal loss construction, which is novel even in the low-dimensional regime. This generalization is presented in Section 4.3 and a sample experimental performance of our approach for the logistic link, which arises in estimation of discrete choice models, is presented in Figure 3.
Outline of Main Result.
The input to Algorithm 1 consists of the regularization parameter and a search set . Depending on the convexity of the loss in , the algorithm defines either a global or local optimization problem, and we consider both cases in Section 2. In the convex case, we conduct a global search by setting and the regularization parameter
to dominate the gradient of the loss with high probability
The rate at which the gradient of the empirical plugin loss evaluated at the true parameter goes to zero, is a proxy of how the noise of the problem decays to zero as the sample size grows. In the non-convex case, we conduct a local search determined by the properties of the loss , which will be discussed later, and set to dominate the gradient of the loss and the local violation of the convexity around . In both cases, the error of the final estimator is proportional to the regularization parameter .
Hence, to understand the impact of the first stage estimation error on the second stage estimate, one crucial aspect is characterizing how this error affects the noise of our second stage estimation problem, as captured by the empirical plugin gradient evaluated at the true value . We define a population loss to be (Neyman)-orthogonal to the nuisance parameter if the pathwise derivative of the gradient of the loss w.r.t , evaluated at the true parameter and nuisance component value, is zero:
In other words, at the true parameter value, local perturbations of the nuisance component around its true value, have a zero first-order effect on the gradient of the loss, i.e.:
As we will show later, in several estimation settings defined via conditional moment restrictions it is always possible to construct such an orthogonal loss.
Subsequently, we can use this property to show an analogue of it for the empirical loss. Crucially, this property allows us to set a regularization weight that only depends on , since that suffices for regularization to dominate the noise of the problem. Since the convergence rate of is determined by the required level of regularization, this leads to our desired second-order influence property. Moreover, if the quantity is of lower order than the rate at which the oracle empirical gradient converges to zero, then the estimation error of can essentially be asymptotically ignored. In typically settings, will be of the order of . Hence, the requirement for the oracle convergence property is essentially , which can be achieved by several non-parametric or high-dimensional parametric estimators. Even when is not fast enough to ensure the oracle convergence property, orthogonality still benefits the estimation of , in that it renders it more robust to the nuisance component estimation.
The results of this paper accommodate estimation of- shrinkage estimators, as well as earlier developed tools. The only requirement we impose on is its uniform convergence to the at some slow rate . Crucially, we do not impose further stringent complexity requirements on the function class in which the first stage estimates need to lie in. We achieve this by a sample splitting approach, already introduced in the low-dimensional regime. We also show that for a particular class of extremum estimators, namely -estimators, the slight statistical inefficiency due to sample splitting can be alleviated via a cross-fitting scheme (see Algorithm 3).
Our formal proof requires further technical steps, primarily addressing the fact that the first-stage estimation error also has an effect on the second-order (strong convexity) properties of the loss function. In the convex setting this translates to a minimal requirement on the rate , so that its effect on the second-order properties of the loss can be ignored after some constant number of samples . This effect is much harder to handle in the non-convex setting, where the effect on the second order properties of the loss, need to also be dominated by the regularization strength , thereby entering the convergence rate. This seemingly leads to a first order effect of the nuisance estimation on the target parameter estimation. However, we show how to bypass this problem via a two-step estimation approach, where we first estimate a preliminary at a slow rate and then refine our search set around this preliminary estimate. With this addition, we arrive at an overall estimation algorithm (see Algorithm 2) that enjoys second-order dependence on the estimation of . The sole drawback of this approach is that the requirement on for oracle convergence is stricter than in the convex setting. In particular: , which contains an extra in the right hand side, as compared to the convex case.
Constructing an Orthogonal Loss.
Our main result is presented conditional on having access to an orthogonal loss. One might wonder how one arrives at such a loss from primitives of the model. In Section 3, we show how such an orthogonal loss can always be constructed, via a novel orthogonalization technique, when the model is defined via single-index conditional moment restrictions of the form:
for some real-valued moment function , where the inner product is referred to as the index. Crucially, the parameter enters the moment function only through the index. The latter approach applies to our application on estimation with missing data and estimation in games of incomplete information. For our final application of non-linear treatment effect estimation, we develop a separate partialing-out approach to arrive at an orthogonal loss. This method is an extension and generalization of the loss function presented in Example 1.
We apply our general results to three classical problems in bio-statistics, structural econometrics and causal inference. Concretely we address estimation of conditional moment models with missing data (Section 4.1), estimation of agent utilities in games of incomplete information (Section 4.2) and estimation of high-dimensional treatment effects in regression problems with nonlinear link functions (Section 4.3). In all these settings, we extend prior work (e.g. , , ), from the low-dimensional target parameter setting, to its sparse high-dimensional counterpart. For each setting, we establish concrete conditions on the complexity of the nuisance components that lead to oracle convergence rates for our two-stage estimator.
This paper builds on the two theoretical bodies of research within the Statistics and Econometrics literature: (1) orthogonal/debiased machine learning and (2) sparse high-dimensional -estimation and its extensions to non-convex settings, as well as uses the examples of the models described by conditional moment restrictions. The first literature () provides a -consistent and asymptotically normal estimates of low-dimensional target parameters in the presence of high-dimensional/highly complex nonparametric nuisance functions. The second literature establishes the convergence rates for -penalized -estimation problems in convex () and non-convex ( and ) settings. As for the applications, we illustrate our results by applying them to Conditional Moment Models in presence of Missing Data (,) applicable also to the models with measurement error and an error-free validation samples (as studied in , , , , ), Games of Incomplete Information (e.g. see  and  among others), and model of high-dimensional treatment effects with nonlinear link function, whose linear case was considered in .
2 Plug-in Regularized Extremum Estimation
In this section we derive the convergence rate for the Plug-in Regularized Extremum Estimator, outlined in Algorithm 1, which exhibits second-order impact from the first stage error in the estimation of . We establish sufficient conditions under which this rate can be attained.
We assume that both the estimation sample and the auxiliary sample consist of i.i.d. data points, each drawn from a data generating distribution . We consider empirical loss function and population loss , that depend on a target parameter and a nuisance component that can either be a finite-dimensional parameter or a function. We assume that belongs to a convex set equipped with some norm , whose choice will be specific to the type of the nuisance parameter and the application of interest.222For instance, in the case of a finite dimensional parameter , the norm could be some norm of the finite dimensional vector space , and in the case of a vector-valued function , it could be the norm of some finite dimensional vector norm of the random vector , i.e. for some finite dimensional vector norm .
A leading example of our framework is an -estimation problem, where sample and population losses are defined as the empirical and population expectation of an -estimator loss function , i.e.:
Our results are not specific to the -estimation setting and also apply to loss functions that are not additively separable across samples.
Our goal in this section is to establish high probability bounds on the estimation error of Algorithm 1, with respect to either the or the norm. To enable our results we first impose a set of sufficient conditions. Some of these conditions will be of a first order nature (e.g. orthogonality and strong convexity), while others will be easily satisfied under mild smoothness and differentiability properties of the loss functions. We will typically refer to the latter as regularity assumptions.
Our first regularity assumption requires that the first stage estimator achieves some non-trivial rate of convergence to the truth. In particular, Assumption 1 introduces a sequence of nuisance realization sets that contain the first-stage estimator with high probability. As sample size increases, set shrinks around the true value . The shrinkage speed is measured by the rate and is referred to as the first-stage rate. At the end of the section we will characterize bounds on under which the first stage error can be ignored and is not of the same order as the leading error term of the second stage estimation. However, our convergence rate for the second stage will be valid for any rate and will still have a dampened impact from the first stage error, even if this impact is of a leading order. Only in our convex setting, we will impose the mild condition that for our convergence rate to be valid.
Regularity Assumption 1 (Nuisance Parameter Estimation Error).
For any , w.p. at least , the first-stage estimate belongs to a neighborhood of , such that:
Orthogonality of the population loss .
To dampen the impact of the estimation error of the first-stage estimator on the second-stage estimator , we require population loss to be orthogonal with respect to . We call a population loss (Neyman) orthogonal to the nuisance parameter if the pathwise derivative of the loss gradient w.r.t is zero.
Definition 1 (Orthogonal Loss).
Loss function is orthogonal with respect to the nuisance function if the pathwise derivative map of its gradient at ,
exists and , and vanishes at :
Assumption 2 (Orthogonality of Population Loss).
The population loss function is orthogonal.
To guarantee that the impact of the first-stage estimator on the second stage estimator is second-order, we require an extra regularity assumption which is easily satisfied when is sufficiently smooth.
Regularity Assumption 3 (Bounded Hessian of the Gradient of Population Loss w.r.t. Nuisance).
The second order path-wise derivative of the gradient w.r.t the nuisance parameter:
exists and is bounded as:
Convergence of the gradient of the empirical loss
To ensure that the empirical oracle gradient goes to zero in norm, we assume that the gradient of the empirical loss concentrates well around its population analogue for each fixed instance of . Crucially, by using different samples in the first and the second stages of Algorithm 1, we do not require the uniform convergence of over the realization set of the nuisance , and therefore, we do not restrict the complexity of the function class . As a result, one can employ high-dimensional, highly complex methods to estimate .
Regularity Assumption 4 (Convergence Rate of Empirical Gradient).
We assume that for any fixed , there exists a sequence such that converges at rate to zero w.p. . Formally, for any :
This regularity assumption is a mild requirement. For example, in the case of the -estimation problem with a bounded loss gradient , Assumption 4 follows from McDiarmid’s inequality with .
Curvature of the loss.
The mere fact that the estimator is a local minimum of the empirical loss is not sufficient to guarantee that is close to . Even if we knew that was an approximate minimizer of the population oracle loss , this would not imply that is close to , unless the loss function has a large curvature within the search set . For a given direction , we measure this curvature of loss function by the symmetric Bregman distance, considered in [1, 16, 15] among others.333The latter quantity is referred to as the symmetric Bregman distance since it corresponds to a symmetrized version of the Bregman distance, defined as: , which measures how far the value of at is from the value of a linear approximation of when it is linearized around the point . Observe that .
Definition 2 (Symmetric Bregman distance).
For a differentiable function , define its symmetric Bregman distance as:
Given that assumptions presented below pertain to the second order properties of loss functions, they will depend on the overall convexity of the empirical loss in . In the non-convex case, we will conduct a local optimization, where the search set of Algorithm 1 depends on the problem features as discussed below. In addition, our further assumptions will be required to hold uniformly for all directions in an neighborhood around . In a convex case, convexity of ensures that the estimator belongs to a restricted cone
where denotes the support of the true parameter , its complement and by we denote -dimensional vector such that on set of indices and if (similarly for ). Therefore, the uniformity requirement will apply to cone only. For that reason, we formulate our assumptions with respect to a set , subsuming the -neighborhood of in the non-convex case and the restricted cone in the convex case.
Definition 3 (-Generalized Restricted Strong Convexity on a set ).
A differentiable function satisfies the GRC property with curvature and tolerance parameters on a set if its symmetric Bregman distance satisfies:
If loss function is twice differentiable, then a sufficient condition for the -GRC property is that for all and for all :
Moreover, if the loss is also convex, then a sufficient condition for the GRC property is that the Bregman distance at , satisfies the same lower bound for all , i.e.:
The latter is the condition that was employed in the analysis of , which studied regualarized convex loss based estimation.
Assumption 5 ( -GRC Empirical Oracle Loss on set ).
There exist curvature and tolerance parameter sequences such that the empirical oracle loss satisfies the -GRC condition on the set . w.p. .
Assumption 5 states that the empirical oracle loss has a positive curvature in all directions , allowing for the violation described by the tolerance parameters . In our further discussion we use notation and to denote, respectively, symmetric Bregman distances of the empirical and population losses evaluated at parameter values and and nuisance . In many biostatistic and econometric applications, it is plausible to assume that the population loss is strongly convex with no violations (satisfies -GRC). In this case, if the difference between the symmetric Bregman distances of the empirical and population loss converges to zero at rate uniformly over , then -GRC of implies -GRC of . Also, if the empirical oracle loss is twice differentiable, and its Hessian converges at rate uniformly over the set to its population counterpart, then -GRC of implies -GRC of .
Lemma 1 (From Population to Empirical GRC.).
Suppose the difference between the sample and the population symmetric Bregman distances normalized by converges uniformly over to zero at rate , i.e., w.p. :
Then, -GRC of implies -GRC of on w.p. .
Lemma 2 (From Population to Empirical GRC with Twice Differentiability).
Suppose that is twice differentiable and its empirical Hessian concentrates uniformly over to its population counterpart at rate , i.e., w.p. :
Then, -GRC of implies -GRC of on w.p. .
Lipschitz in nuisance symmetric Bregman distance.
To control the impact of the first-stage estimation error of on the second-stage estimate , we require a final regularity assumption that the symmetric Bregman distance is Lipschitz in . If the loss is sufficiently smooth in and additional mild requirements, Regularity Assumption 6 is satisfied on .
Regularity Assumption 6 (Lipschitz symmetric Bregman distance on ).
The empirical symmetric Bregman distance
satisfies the following Lipschitz condition in uniformly over a set : , w.p. :
The fudge factor is used to account for the fact that the norm of the nuisance space might be defined with respect to the population measure, while the latter assumption is about Lipschitzness of the empirical loss. Hence, typically a slack variable will be required to account for this difference in measures. For the case of sup norms over the data, will be zero.
2.1 Local Optimization of a Non-Convex Loss
To state our main theorem for non-convex losses, we will also make a benign assumption that a preliminary estimator that converges to at some preliminary rate is available. After stating the main theorem, we will discuss how one can easily construct such an estimator by either employing a convex non-orthogonal loss that is readily available in some applications, or even the same orthogonal loss with a sufficiently large search set and a more aggressive regularization weight . The latter implies that one does not really need a separate estimator, but rather needs to repeat the orthogonal estimation process twice with different parameters.
Assumption 7 (Preliminary Estimator).
We assume that there exists a preliminary estimator such that with probability :
Main Theorem for Non-Convex Loss
We are now ready to state our main theorem for the non-convex case. It provides a bound on the and errors of the Plug-in Regularized Estimation procedure.
Theorem 4 (Convergence Rate of the Plug-in Regularized Estimator).
Construction of a preliminary estimator.
Below we provide a three-step algorithm that is a generalization of Algorithm 1, augmented by the construction of a preliminary estimator. We show how such a preliminary estimator can be constructed using the same loss with a more aggressive regularization. In Appendix A.1 we also show provide concrete rates when a convex loss is used as a preliminary step and in Appendix A.2 we provide extra conditions under which a preliminary estimator might not even be required.
Input: Preliminary loss and final loss .
Input: Radii .
Input: Preliminary and final regularization weights , .
Remark 1 (Achieving Second-Order Dependence on ).
Observe that seemingly Theorem 4 declares first order dependence of the error in on the first stage error , unless is not decaying sufficiently fast (e.g. of order ). We will now argue that in fact Theorem 4 enables a second order three step estimation algorithm outlined in Algorithm 2.
Suppose that we have an orthogonal loss that satisfies all the assumptions of Theorem 4 except the existence of a preliminary estimator. Then we can still apply the theorem with and , for some upper bound on . For instance, if we assume that the true coefficients are all bounded by , then . Then the Theorem states that the resulting estimator achieves a rate in terms of the norm:
Subsequently, we can use as our preliminary estimator and invoke the theorem with the latter rate , which will yield a new estimator that achieves w.p. :
where in to simplify expressions we made the assumption that . Thus we have recovered a rate that has only the second order dependence on the first stage error. This result leads to the following corollary.
Corollary 5 (Convergence Rate of Plug-in Regularized Estimator with Preliminary Step).
Let and suppose that Assumptions 1, 2, 3, 4, 5, 6 hold on set with . Then the estimator returned by Algorithm 2 with the orthogonal loss as preliminary and final loss, search radii , and defined in Equation (17) and regularization weights:
satisfies w.p. :
If , then the estimation error of the nuisance component can asymptotically be ignored.
2.2 Global Convergence Under Convexity
The statement below assumes the convexity of the empirical loss with respect to the parameter of interest .
Assumption 8 (Convexity of Empirical Loss).
The empirical loss is a convex function of for any in some neighborhood of of the true .
The convexity assumption above allows to have a weaker -GRC requirement on empirical oracle than in a non-convex case. In particular, convexity ensures that the error vector belongs to the restricted cone . Therefore, we require Assumptions 5 and 6 to hold on a set , as opposed to a -dimensional ball of radius around as it was in a non-convex case.
Moreover, observe that in such a cone,