Minimax Semiparametric Learning With Approximate Sparsity

Many objects of interest can be expressed as a linear, mean square continuous functional of a least squares projection (regression). Often the regression may be high dimensional, depending on many variables. This paper gives minimal conditions for root-n consistent and efficient estimation of such objects when the regression and the Riesz representer of the functional are approximately sparse and the sum of the absolute value of the coefficients is bounded. The approximately sparse functions we consider are those where an approximation by some t regressors has root mean square error less than or equal to Ct^-ξ for C,ξ>0. We show that a necessary condition for efficient estimation is that the sparse approximation rate ξ_1 for the regression and the rate ξ_2 for the Riesz representer satisfy max{ξ_1 ,ξ_2}>1/2. This condition is stronger than the corresponding condition ξ_1+ξ_2>1/2 for Holder classes of functions. We also show that Lasso based, cross-fit, debiased machine learning estimators are asymptotically efficient under these conditions. In addition we show efficiency of an estimator without cross-fitting when the functional depends on the regressors and the regression sparse approximation rate satisfies ξ_1>1/2.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/07/2019

A unifying approach for doubly-robust ℓ_1 regularized estimation of causal contrasts

We consider inference about a scalar parameter under a non-parametric mo...
09/20/2019

Does SLOPE outperform bridge regression?

A recently proposed SLOPE estimator (arXiv:1407.3824) has been shown to ...
11/21/2018

High Dimensional Linear GMM

This paper proposes a desparsified GMM estimator for estimating high-dim...
08/06/2019

On cylindrical regression in three-dimensional Euclidean space

The three-dimensional cylindrical regression problem is a problem of fin...
09/14/2018

Learning L2 Continuous Regression Functionals via Regularized Riesz Representers

Many objects of interest can be expressed as an L2 continuous functional...
01/27/2018

Cross-Fitting and Fast Remainder Rates for Semiparametric Estimation

There are many interesting and widely used estimators of a functional wi...
03/30/2020

Supplementary Material for CDC Submission No. 1461

In this paper, we focus on the influences of the condition number of the...

1 Introduction

Many objects of interest can be expressed as a linear, mean square continuous functional of a least squares projection (regression) on a countable set of regressors. Important examples include the covariance between two regression residuals, a coefficient of a partially linear model, average derivatives, average consumer surplus bounds, and the average treatment effect. Often the regression may be high dimensional, depending on many random variables. There may be many covariates of interest for the covariance of two regression residuals or an average derivative. There are often many prices and covariates in the economic demand for some commodity. This variety of important examples motivates estimators of such objects when the regression is high dimensional.

This paper gives minimal conditions for root-n consistent and efficient estimation of such objects under approximate sparsity. We focus on models where the regressors each have second moment

and the sum of the absolute value of the regression and Riesz representer coefficients is finite. The approximately sparse functions we consider are those where an approximation by some regressors has root mean square error less than or equal to for We show that a necessary condition for efficient estimation is that the sparse approximation rate for the regression and the rate for the Riesz representer of the linear functional satisfy We also show that Lasso based, cross-fit, debiased machine learning (DML) estimators are optimal under these conditions, being root-n consistent and asymptotically efficient. We find that without cross-fitting these estimators also nearly have the best rate of convergence when . We also show efficiency without cross-fitting when the projection is a conditional expectation, the functional of interest depends only on the regressors, and

The approximately sparse specification we consider is fundamentally different than other nonparametric specifications such as Holder classes. Approximate sparseness does not require that the identify of regressors that give the best sparse approximation (the ”strong regressors”) are known. Instead approximate sparsity only requires the strong regressors be included somewhere among regressors where can be much larger than sample size. In contrast other nonparametric specifications, such as Holder classes, lead to strong regressors being prespecified, such as the leading terms in a wavelet basis. The flexibility allowed by approximate sparsity, in not having to specify which are the strong regressors, seems particularly valuable in high dimensional settings where there may be very many regressors, including interaction terms, and there is little knowledge about which regressors are strong. This flexibility motivates our interest in conditions for efficient learning under approximate sparsity.

Our results reveal important differences between the necessary conditions for efficient semiparametric estimation in approximately sparse and Holder classes of functions. The approximately sparse necessary condition is stronger than the Holder class necessary condition that follows from Robins et al. (2009), for the expected conditional covariance and the mean with data missing at random. In this sense attaining asymptotic efficiency under approximate sparsity requires stricter conditions than in Holder classes. One might think of this as a cost for not knowing which regressors are the strong regressors. This cost is in addition to the well known extra term, where is the number of regressors, that shows up in the minimax convergence rate for regressions under approximate sparsity; Bickel, Ritov, and Tsybakov (2007) and Cai and Guo (2017).

Figure 1 illustrates the difference between the necessary condition for asymptotic efficiency for approximately sparse and Holder classes of functions. The blue box gives the set {} where the necessary conditions for efficient estimation are not satisfied under approximate sparsity. The red triangle gives the set of { } where the necessary conditions for efficient estimation are not satisfied for a Holder class, as in Robins et al. (2009). Because the blue box contains the red triangle the conditions for existence of an efficient estimator are stronger under approximate sparsity than for a Holder class. One can think of the difference between the blue box and the red triangle as being a cost for not knowing which are the strong regressors. Stronger conditions are required under approximate sparsity where the identity of the strong regressors is not known.

We also show that Lasso based debiased machine learning estimators are asymptotically efficient under the minimal approximate sparsity condition

. The estimators we consider are special cases of the doubly robust estimators of a linear functional given in Chernozhukov et al. (2016). We base these estimators on Lasso regression and Lasso minimum distance learning of the Riesz representer as in Chernozhukov, Newey, and Singh (2018). The Dantzig learners of Chernozhukov et al. (2018) would also work. We show that with cross-fitting these estimators attain asymptotic efficiency under

and under additional regularity conditions that are satisfied in the construction of the minimax bound. We also find that the convergence rate of these estimators is nearly the minimax rate when

There is a close correspondence between the minimax rate and the behavior of remainder terms in an asymptotic expansion of a doubly robust estimator around the average of the efficient influence function. A dominating remainder term is the product of the mean square norms of estimation errors for the regression and Riesz representer. Other remainder terms will be smaller order than this term. By virtue of the sum of the absolute values of the regression and Riesz representer coefficients being bounded, the estimation errors for both the regression and Riesz representer converge nearly at root-mean-square rate as known for Lasso regression from Chatterjee and Jafarov (2015) and for the Riesz representer by Chernozhukov et al. (2018) and Chernozhukov, Newey, and Singh (2018). The minimax rate for the object of interest is when which is nearly the product of the two rates, i.e. the size of the dominating remainder. When the mean square error of either the regression error or the Riesz representer will converge at for some so that the product of rates for the regression and Riesz representer convergence rate is smaller than Other remainder terms will also be , resulting in asymptotic efficiency. In Holder classes efficiency of a doubly robust estimator is not determined by the product of mean square rates for the regression and Riesz representer. A more refined remainder analysis is required for efficiency of a doubly robust estimator of the average conditional covariance for Holder classes when , as shown by Newey and Robins (2018).

An important feature of our estimation results is that we do not require that the regression be a conditional expectation. We only require that the regression and the Riesz representer are mean square projections on a dictionary of functions. This is important for our efficient estimation results which are based on the semiparametric efficiency bound for linear functionals of a projection given in Chernozhukov et al. (2019). The semiparametric efficiency bound may be different when the projection is required to be a conditional expectation, similarly to Chamberlain’s (1992) analysis of semiparametric regression. In only requiring that the regression is a projection our estimation results generalize those of Chernozhukov, Newey, and Singh (2018). We also generalize Chernozhukov, Newey, and Singh (2019) in allowing for an unbounded Riesz representer, which is important for the Gaussian regressor case used in the derivation of the necessary condition for efficient estimation.

We also consider the role of cross-fitting in efficient estimation. We find that when the Lasso based debiased machine estimator attains nearly the optimal rate without cross-fitting. This feature of estimation is different than for Holder classes where cross-fitting can improve remainder rates so that asymptotic efficiency is attained under weaker conditions than without cross-fitting, see Newey and Robins (2018). In addition we show efficiency of an estimator without cross-fitting when the functional of interest depends on the regressors and the regression sparse approximation rate satisfies

The approximately sparse specification we consider is a special case of those of Belloni et al. (2012) and Belloni, Chernozhukov, and Hansen (2014). The class we consider, where an approximation by some regressors has root mean square error less than or equal to for turns out to be particularly well suited for necessary conditions for efficient estimation.

The debiased machine learning estimators we consider are based on the zero derivative of the estimating equation with respect to each nonparametric component, as in Belloni, Chernozhukov, and Hansen (2014), Farrell (2015), and Robins et al. (2013). This kind of debiasing is different than bias correcting the regression learner, as in Zhang and Zhang (2014), Belloni Chernozhukov, and Wang (2014), Belloni, Chernozhukov, and Kato (2015), Javanmard and Montanari (2014a,b; 2015), van de Geer et al. (2014), Neykov et al. (2015), Ren et al. (2015), Jankova and van de Geer (2015, 2016a,b), Bradic and Kolar (2017), and Zhu and Bradic (2018). These two debiasing approaches bear some resemblance when the functional of interest is a coefficient of a partially linear model (as discussed in Chernozhukov et al., 2018), but are quite different for other functionals.

The functionals we consider are different than those analyzed in Cai and Guo (2017). The continuity properties of functionals we consider provide additional structure that we exploit, namely the Riesz representer, an object that was not considered in Cai and Guo [2017]. Targeted maximum likelihood (van der Laan and Rubin, 2006) based on machine learners has been considered by van der Laan and Rose (2011) and large sample theory given by Luedtke and van der Laan (2016), Toth and van der Laan (2016), and Zheng et al. (2016). The DML learners here are relatively simple to implement and analyze and directly target functionals of interest.

Mean-square continuity of the functional of interest does place us squarely in a semiparametric setting where root-n consistent efficient semiparametric estimation of the object of interest is possible under sufficient regularity conditions; see Jankova and Van De Geer (2016a). Our results apply to different objects than considered by Ning and Liu (2017), who considered machine learning of the efficient score for a parameter of an explicit semiparametric pdf of the data.

Mackey, Syrgkanis, and Zadik (2018) showed that weak sparsity conditions would suffice for root-n consistency of a certain estimator of a partially linear conditional mean when certain variables are independent and non Gaussian. The estimator given there will not be consistent for the objects and model we consider.

In Section 2 we describe the objects we are interested in. Section 3 gives the minimal conditions for asymptotic efficiency. Section 4 shows that DML estimators are asymptotically efficient under these minimal conditions. Section 5 concludes.

2 Linear Functionals of a Regression.

To describe the objects of interest let denote a data observation and consider a subvector of where is a scalar outcome with finite second moment and

is a covariate vector that takes values

. Let

be a dictionary of functions of the covariates with each dictionary element having second moment equal to . Let denote the closure in mean-square of the set of linear combinations of dictionary functions. Denote the least squares projection of on as

Here when is the set of all measurable functions of with finite second moment. In this paper we focus primarily on the case where is the projection and do not require that be the conditional expectation.

To describe the object of interest let denote a linear functional of a possible projection that depends on a data observation . The object of interest is

(2.1)

We focus on functionals where is a mean square continuous linear functional of

This continuity property is equivalent to the semiparametric variance bound for

being finite, as discussed in Newey (1994). In this case, the Riesz representation theorem implies existence of such that for all

We refer to as the Riesz representer (RR).

There are many important examples of this type of object. A leading example for our results is the average product of two projections where , , and

This object is part of the covariance between two projection residuals

where the first equality follows by orthogonality of and This object is useful in the analysis of covariance while controlling for regressors in . Here is the part of the covariance that depends on unknown functions.

Another interesting example is a weighted average derivative given by

where we assume that is differentiable in . This object summarizes the local effect of one of the regressors on the regression function. Here

By integration by parts and projection on the RR is

An example from economics is a bound on average consumer surplus. Here is the share of income spent on a commodity and where is the price of the commodity and includes income , prices of other goods, and other observable variables affecting utility. Let be lower and upper prices over which the price of the commodity can change, a bound on the income effect, and some weight function. The object of interest is

where is income and is a variable of integration. When individual heterogeneity in consumer preferences is independent of and is a lower (upper) bound on the derivative of consumption with respect to income across all individuals, then is an upper (lower) bound on the weighted average over consumers of exact consumer surplus (equivalent variation) for a change in the price of the first good from to ; see Hausman and Newey (2016). Here and the RR is

where is the conditional pdf of given

3 A Convergence Rate Lower Bound

We now introduce the parameter space for approximately sparse models. For any constants , we define

The construction of the above notion of approximate sparsity is motivated by the series approximation idea. Consider the Holder class of order . Under the standard approximation theory, functions in this class admit a series expansion (using an appropriate basis) such that . In this case the vector of coefficients belongs to . Hence, the approximately sparse class extends the notion of Holder class of order . In particular, the approximate sparsity assumes the existence of the best -sparse approximation without specifying the order/direction/location of this approximation. Notice that if , then ; similarly, the Holder class of order shrinks with .

Henceforth we let denote the approximately sparse approximation rate for and the approximately sparse rate for the Riesz representer for the functional of interest.

3.1 Expected conditional covariance

We observe i.i.d. data with . We consider the expected conditional covariance. We are interested in . Clearly,

The first term can always be estimated at the rate . We focus on the rate for the second term as in Section 2. We now show that even if the data is known to be jointly Gaussian with mean zero and the covariance of

is known to be the identity matrix, the requirement of

is necessary where

Assumption 1

Suppose that is jointly Gaussian with mean zero and . Assume that and . Moreover, , where

Under Assumption 1, we focus on

For any constants , we define the parameter space

where are constants. For , we define the functional .

Let be the set of confidence intervals for that are valid uniformly over . We are interested in the following

for . If depends on instead of , then there is no adaptivity between and . If , then is the minimax rate over . The primary goal is to study this object with and , where and . This means . We will assume that we are in a high-dimensional setting by imposing the condition that there exists a constant such that .


Theorem 1: Let Assumption 1 hold. Consider and . Assume that there exists a constant such that . If , then

where is a constant depending only on .


Theorem 1 has two important implications. First, is a necessary condition for obtaining the parametric rate . When we choose and , we have and hence Theorem 1 implies that . This means that when , the parametric rate for estimation is impossible; if this were possible, then one can construct a confidence interval with expected width by simply choosing an interval that centers at this -consistent estimator with radius of the order .

Second, Theorem 1 implies that adaptivity to the rate of approximation is not possible. Notice that in Theorem 1 the lower bound for only depends on and has nothing to do with . This means that any confidence interval that is valid over with cannot have expected width even at points in a smaller parameter space , no matter how small is. Hence, there does not exist a confidence interval that satisfies both of the following properties: (1) being valid over with and (2) having expected width on a smaller (potentially much smoother) space . One implication is that it is not possible to distinguish between and from the data. Consequently, in order to obtain the root-n rate on , the condition of cannot be tested in the data.

It is also worth noting that there is no adaptivity between the ordered class and the non-ordered class. The ordered class has the same setting, except that (for ) is replaced by

where . The ordered class is directly related to the Holder class for which the approximation error of including the first a few terms can be controlled. In contrast, the non-ordered class (defined by ) only require that the approximation error be controlled once a few terms are included, without specifying which terms.

To see the lack of adaptivity between the ordered class and the non-ordered class, simply notice that when , we have . Clearly, for any . It is not hard to see that the proof of Theorem 1 still holds with . Therefore, the conclusion of Theorem 1 remains valid when is replaced by any ordered class. This lack of adaptivity means that when we are given an ordering scheme, it is not possible to test this scheme for the purpose of improving inference efficiency for the expected conditional covariance once .

3.2 Partial linear models and average derivatives

Suppose that we observe independent copies of from

(3.1)

where , , and . Assume that is jointly Gaussian with mean zero and . Hence, the distribution of the data is indexed by . Let , we define the following parameter space

where is a constant. (Other constants such as are the same as before.)

We notice that the conditional covariance model can be written in the partial linear form. Assume that has the distribution indexed by as in Assumption 1. Then by straight-forward algebra, we can see that the data can be written as in (3.1) with , where , , and . It turns out that this relationship allows us to translate the lower bound in Theorem 1 to a lower bound for partial linear models.


Theorem 2: Let satisfy and . Consider the model in (3.1) with parameter . Assume that there exists a constant such that . If , then

where and denotes the set of confidence intervals for in (3.1) with coverage over .


By Theorem 2 the condition of is also a necessary condition for attaining the root-n rate in partial linear models. The same adaptivity discussions apply. We would also like to point out that although and measure the expected length of confidence intervals, the rates are not due to the possibility of

taking extreme values with a small probability and in fact stronger results are proved in the appendix. For example,

can be replaced by

where is a constant.

We would like to point out that the average derivative is a harder problem than partial linear models and hence the lower bound in Theorem 2 applies to the problem of average derivative. To see this, consider a function of . A special case is when the partial derivative with respect to is constant. In this special case, the average derivative problem becomes learning a coefficient in a partial linear model. By Theorem 2, even in this special problem, is a necessary condition for attaching the parametric rate. Therefore, in general, one needs to impose to obtain the root-n rate for the average derivative problem.

In this Section we have shown that when an estimator of can converge no faster than In the next Section we give estimators that attain root-n consistency and the semiparametric efficiency bound when In this way the next Section will show that the attainable rate of convergence for an estimator of is when

4 Asymptotic Efficiency of Debiased Machine Learning

We consider debiased machine learners (DML) of like those of Chernozhukov, Newey, and Singh (2018) under approximate sparsity where is estimated by Lasso regression and by Lasso minimum distance. Let the data be , assumed to be i.i.d.. Let be a partition of the observation index set into distinct subsets of about equal size. Let and be estimators constructed from the observations that are not in as follows. The Lasso regression estimator is given by

(4.1)

where , is the number of observations in , and we will make assumptions about below. Let be a vector with component The Lasso minium distance estimator of has the form

(4.2)

as given in Chernozhukov, Newey, and Singh (2018). The estimator of is then given by

(4.3)

Here we give sufficient conditions for to be asymptotically efficient, meaning

(4.4)

Here is the efficient influence function of the object when is a least squares projection as shown by Chernozhukov, Newey, Robins, and Singh (2019). Because is nonparametric, being a functional of a distribution that is unrestricted except for regularity conditions, this influence function is unique.

We make the following assumption about the dictionary and the functional evaluated at the elements of the dictionary:


Assumption 2: , , and are uniformly subgaussian and there is such that


The term can be replaced by any positive number that goes to infinity with the sample size. We include such a term for simplicity. It could be dropped for some of the results with a modification to include statements that certain remainder events happen with small probability.

We also impose a slightly stronger condition than mean square continuity of in as well as some moment existence conditions.


Assumption 3: i) There is such that ii)

There is such that such that and such that


The moment boundedness and existence conditions in this Assumption are automatically satisfied if has uniformly in bounded moments of all orders and , , and have moments of all orders, as they do in the Gaussian case in the lower bound.

The following is a useful bias condition that will be satisfied under approximate sparsity. For a function let denote the mean square norm


Assumption 4: There is , and such that and .


When there is with bounded such that shrinks faster than some power of then the rate condition for will be satisfied when grows faster than some power of and similarly for For example, if then the rate condition for is satisfied if .

Let denote a vector and denote the subset of such that for all and and denote the population Lasso approximations

The and are population Lasso approximations to the true coefficient vectors. These approximations will generally be sparse with number of nonzero elements growing at rates that are determined by the degree of approximate sparsity, as in Chernozhukov, Newey, and Robins (2018).

In the next two assumptions we impose approximate sparsity and a population sparse eigenvalue condition for either the regression or Riesz representer.


Assumption 5: is approximately sparse with and there is and such that for all large enough


Assumption 6: is approximately sparse with and there is and such that for all large enough


The next result shows efficiency of DML and gives a convergence rate for the case where


Theorem 3: Suppose that Assumptions 2-4 are satisfied. If either Assumptions 5 or 6 are satisfied then

If neither assumption 5 nor 6 is satisfied then


Here we see that is a semiparametric efficient learner under the regularity conditions of Assumptions 2-4 and the minimal approximate sparsity condition in either Assumption 5 or 6. These Assumptions also include a population sparse eigenvalue condition that we take as a regularity condition for efficient estimation. This condition is automatically satisfied in the orthornormal Gaussian regressor case used in the derivation of the lower bound. The other regularity conditions are also satisfied in that case, so that the asymptotic efficiency result of Theorem 3 is sharp.

This result improves upon those of Chernozhukov, Newey, and Singh (2018) in only requiring the regression to be a projection and in allowing and to be unbounded. Allowing for such unbounded and is necessary for the result to cover the model used in the construction of the lower bound of Section 3, where the Riesz representer is Gaussian. The specification of as a projection rather than a conditional expectation means that heteroskedasicity need not be corrected for to obtain an efficient semiparametric estimator of regression functionals and explicitly allows for misspecification of where the projection is the best least squares approximation to . In such a misspecified case the can be interpreted as a pseudo true value that is the functional of interest evaluated at the projection.

The convergence rate attained by when i.e. without Assumption 5 or 6, is slower than the minimax rate of Section 3 by Here could be replaced by any sequence going to infinity so that the rate can be arbitrarily close to the minimax rate.

The source of the rate in the second conclusion of Theorem 3 is the product remainder

that appears in the proof of Theorem 3. All other remainder terms have order smaller than The fact that this product remainder leads to the minimax rate is different than in the Holder smooth case. Newey and Robins (2018) obtained better rates for the Holder smooth case by working directly with the remainder

rather than the remainder that is is obtained by applying the Cauchy-Schwartz inequality to

Cross-fitting is not as vital to attaining the best rate of convergence in the approximately sparse case as it is for Holder classes. As shown by Newey and Robins (2018) for the Holder case, cross-fitting reduces the size of remainder terms and results in asymptotic efficiency in important cases. It turns out that cross-fitting is not necessary to attain nearly the best rate for the approximately sparse case when To demonstrate this we consider a DML estimator without cross-fitting. Let and be exactly as described above except that they are estimated from the whole sample rather than the observations not in . The estimator without cross-fitting is given by


Theorem 4: If Assumptions 2-4 are satisfied then


There are many interesting examples of where depends only on the regressors so that for all These examples include the bound on average surplus and the average derivative. It turns out that cross-fitting is not necessary for asymptotic efficiency in these cases when the regression is a conditional expectation and . Let and be exactly as described above except that they are estimated from the whole sample rather than the observations not in .


Theorem 5: If depends only on for all , Assumptions 2-5 are satisfied, and is bounded then