# Minimax rates of ℓ_p-losses for high-dimensional linear regression models with additive measurement errors over ℓ_q-balls

We study minimax rates for high-dimensional linear regression with additive errors under the ℓ_p (1≤ p<∞)-losses, where the regression parameter is of weak sparsity. Our lower and upper bounds agree up to constant factors, implying that the proposed estimator is minimax optimal.

## Authors

• 98 publications
• 2 publications
• ### Minimax Optimal Rates of Estimation in High Dimensional Additive Models: Universal Phase Transition

We establish minimax optimal rates of convergence for estimation in a hi...
03/10/2015 ∙ by Ming Yuan, et al. ∙ 0

• ### Interplay of minimax estimation and minimax support recovery under sparsity

In this paper, we study a new notion of scaled minimaxity for sparse est...
10/12/2018 ∙ by Mohamed Ndaoud, et al. ∙ 0

• ### Scale calibration for high-dimensional robust regression

We present a new method for high-dimensional linear regression when a sc...
11/06/2018 ∙ by Po-Ling Loh, et al. ∙ 0

• ### High-dimensional Adaptive Minimax Sparse Estimation with Interactions

High-dimensional linear regression with interaction effects is broadly a...
04/06/2018 ∙ by Chenglong Ye, et al. ∙ 0

• ### Model selection for high-dimensional linear regression with dependent observations

We investigate the prediction capability of the orthogonal greedy algori...
06/18/2019 ∙ by Ching-Kang Ing, et al. ∙ 0

• ### Transfer Learning for High-dimensional Linear Regression: Prediction, Estimation, and Minimax Optimality

This paper considers the estimation and prediction of a high-dimensional...
06/18/2020 ∙ by Sai Li, et al. ∙ 0

• ### A Convex Formulation for Mixed Regression with Two Components: Minimax Optimal Rates

We consider the mixed regression problem with two components, under adve...
12/25/2013 ∙ by Yudong Chen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the standard linear regression model

 yi=⟨β∗,Xi⋅⟩+ei,for i=1,2⋯,m, (1)

where is the unknown parameter and are i.i.d. observations, which are assumed to be fully-observed in standard formulations. However, this assumption is not realistic for many applications, in which the covariates can only be measured imprecisely and one can only observe the pairs instead, where ’s are corrupted versions of the corresponding ’s; see, e.g., Carroll et al. (2006). This is known as the measurement error model in the literature.

Estimation in the presence of measurement errors has attracted a lot of interest for a long time. In 1987, Bickel and Ritov first studied the linear measurement error models and proposed an efficient estimator (Bickel and Ritov, 1987). Then Stefanski and Carroll investigated the generalized linear measurement error models and constructed consistent estimators (Stefanski and Carroll, 1987). Extensive results have also been established on parameter estimation and variable selection for both parametric or nonparametric settings; see Huwang and Hwang (2002); Tsiatis and Ma (2004); Delaigle and Meister (2007) and references therein.

Recently, in the context of high dimension (i.e., ), Loh and Wainwright studied the sparse linear regression with the covariates are corrupted by additive errors, missing and dependent data. Though the proposed estimator involves solving a nonconvex optimization problem, they proved that the global and stationary points are statistically consistent; see Loh and Wainwright (2012b, 2015), respectively. The proposed estimator was also shown to be minimax optimal in the additive error case under the -loss, assuming that the true parameter is exact sparse, that is, has at most nonzero elements (Loh and Wainwright, 2012a). However, the “exact sparse” assumption may be sometimes too restrictive in real applications. For instance, in image processing, it is standard that wavelet coefficients for images always exhibit an exponential decay, but do not need to be almost (see, e.g., Mallat (1989)). Other applications include signal processing, medical imaging reconstruction, remote sensing and so on. Hence, it is necessary to investigate the minimax rate of estimation when the “exact sparse” assumption does not hold.

In this study, we consider the sparse high-dimensional liner model with additive errors. By assuming the regression parameter is of weak sparsity, we establish the minimax rates of estimation in terms of -losses. The proposed estimator is also shown to be minimax optimal in the -loss.

## 2 Problem setup

Recall the standard linear regression model (1). One of the main types of measurement errors is the additive error. Specifically, for each , we observe , where

is a random vector independent of

with mean 0 and known covariance matrix . Throughout this paper, we assume that, for , the vectors , and are Gaussian with mean 0 and covariance matrices , and , respectively, and we write for simplificity.

Following a line of past works (Loh and Wainwright, 2012b, 2015), we fix and use to denote the covariance matrix of . Let denote the estimators for that depend only on the observed data . As discussed in Loh and Wainwright (2012b), an appropriate choice of the surrogate pair for the additive error case is given by

 ^Γ:=Z⊤Zm−Σwand^Υ:=Z⊤ym.

Instead of assuming the regression parameter is exact sparse, we use a weaker notion to characterize the sparsity of . Speciafically, we assume that for , and a radius , , where

 Bq(Rq):={β∈Rn:||β||qq=n∑j=1|βj|q≤Rq}.

Note that corresponds to the case that is exact sparse, while for corresponds to the case of weak sparsity, which enforces a certain decay rate on the ordered elements of . Throughout this paper, we fix , and assume that unless otherwise specified. Without loss of generality, we also assume that and define . Then we have .

In order to estimate the regression parameter, one considers an estimator , which is a measure function of the observed data . In order to assess the quality of

, one introduces a loss function

, which represents the loss incurred by the estimator when the true parameter . Finally, in the minimax formulism, we aim to choose an estimator that minimizes the following worst-case loss

 min^βmaxβ∗∈Bq(Rq)∩B2(1)L(^β,β∗).

Specifically, we shall consider the -losses for as follows

 Lp(^β,β∗):=∥^β−β∗∥pp.

We then impose some conditions on the observed matrix . The first assumption requires that the columns of are bounded in -norm.

###### Assumption 1 (Column normalization).

There exists a constant such that

 1√mmaxj=1,2,⋯,n∥Z⋅j∥2≤κc.

Our second assumption imposes a lower bound on the restricted eigenvalue of

.

###### Assumption 2 (Restricted eigenvalue condition).

There exists a constant and a function such that for all ,

 β⊤^Γβ≥κl∥β∥22−τl(Rq,m,n).

Previous researches have shown that Assumption 1 and 2

are satisfied by a wide range of random matrices with high probability; see, e.g.,

## 3 Main results

Let denote the distribution of in the linear model with additive errors, when is given and is observed. The following lemma tells us the Kullback-Leibler (KL) divergence between the distributions induced by two different parameters , which is beneficial for establishing the lower bound. Recall that for two distributions and which have densities and with respect to some base measure , the KL divergence is defined by .

###### Lemma 1.

In the additive error setting, for any , we have

 D(Pβ||Pβ′)≤σ4x2σ2z(σ2xσ2w+σ2zσ2ϵ)∥Z(β−β′)∥22.
###### Proof.

For each fixed, by the model setting, is jointly Gaussian with mean 0, and by computing the covariances, one has that

 [yiZi⋅]∼N([00],[β⊤Σxβ+σ2ϵβ⊤ΣxΣxβΣx+Σw]).

Then it follows from standard results on the conditional distribution of Gaussian variables that

 yi|Zi⋅∼N(β⊤ΣxΣ−1zZi⋅,β⊤(Σx−ΣxΣ−1zΣx)β+σ2ϵ). (2)

Now assume that and are not both 0; otherwise, the conclusion holds trivially. Since is a product distribution of over all , we have from (2) that

 D(Pβ||Pβ′) =EPβ[logPβ(y)Pβ′(y)] (3) =EPβ⎡⎣m2log⎛⎝σ2β′σ2β⎞⎠−∥y−ZΣ−1zΣxβ∥222σ2β+∥y−ZΣ−1zΣxβ′∥222σ2β′⎤⎦ =m2log⎛⎝σ2β′σ2β⎞⎠+m2⎛⎝σ2βσ2β′−1⎞⎠+12σ2β′∥ZΣ−1zΣx(β−β′)∥22,

where , and is given analogously. Since , , and by the assumptions, we have that

 σ2β=(σ2x−σ4xσ2z)∥β∥22+σ2ϵ=σ2xσ2wσ2z+σ2ϵ.

Substituting this equality into (3) yields that

 D(Pβ||Pβ′)=σ4x2σ2z(σ2xσ2w+σ2zσ2ϵ)∥Z(β−β′)∥22.

The proof is completed. ∎

###### Theorem 1 (Lower bound on ℓp-loss).

In the additive error setting, suppose that the observed matrix satisfies Assumption 1 with . Then for any , there exists a constant depending only on and such that, with probability at least 1/2, the minimax -loss over the -ball is lower bounded as

 min^βmaxβ∗∈Bq(Rq)∩B2(1)∥^β−β∗∥pp≥cq,p[σ2z(σ2xσ2w+σ2zσ2ϵ)σ4xκ2c]p−q2Rq(lognm)p−q2.
###### Proof.

Let denote the cardinality of a maximal packing of the ball in the metric with elements . We follow the standard technique (Yang and Barron, 1999) to transform the estimation on lower bound into a multi-way hypothesis testing problem as follows

 P(min^βmaxβ∗∈Bq(Rq)∩B2(1)∥^β−β∗∥pp≥12pδp)≥min~βP(B≠~β), (4)

where

is a random variable uniformly distributed over the packing set

, and is an estimator taking values in the packing set. It then follows from Fano’s inequality (Yang and Barron, 1999) that

 P(B≠~β)≥1−I(y;B)+log2logMp(δ), (5)

where is the mutual information between the random variable and the observation vector . It now remains to upper bound the mutual information . Let be the minimal cardinality of an -covering of in -norm. From the procedure of Yang and Barron (1999), the mutual information is upper bounded as

 I(y;B)≤logN2(ϵ)+D(Pβ||Pβ′). (6)

Let denote the -convex hull of the rescaled columns of the observed matrix , that is,

 absconvq(Z/√m):={1√mn∑j=1θjZ⋅j∣∣θ∈Bq(Rq)},

where the normalization is used for convenience. Since satisfies Assumption 1, Raskutti et al. (Unpublished results, Lemma 4) is applicable to concluding that there exists a set such that for all , there exists some index and some constant such that . Combining this inequality with Lemma 1 and (6), one has that the mutual information is upper bounded as

 I(y;B)≤logN2(ϵ)+σ4xσ2z(σ2xσ2w+σ2zσ2ϵ)mc2κ2cϵ2.

Thus we obtain by (5) that

 P(B≠~β)≥1−logN2(ϵ)+σ4xσ2z(σ2xσ2w+σ2zσ2ϵ)mc2κ2cϵ2+log2logMp(δ). (7)

It remains to choose the packing and covering set radii (i.e., and , respectively) such that (7) is strictly above zero, say bounded below by . For simplicity, denote . Suppose that we choose the pair such that

 c2mσ2κ2cϵ2 ≤logN2(ϵ), and (8a) logMp(δ) ≥6logN2(ϵ). (8b)

As long as , it is guaranteed that

 P(B≠~β)≥1−2logN2(ϵ)+log26logN2(ϵ)≥12, (9)

as desired. It remains to determine the values of the pair satisfying (8). By Raskutti et al. (Unpublished results, Lemma 3), we know that if for some constant depending only on , then (8a) is satisfied. Thus, we can choose satisfying

 ϵ42−q=Lq,2R22−qqσ2c2κ2clognm. (10)

Also it follows from Raskutti et al. (Unpublished results, Lemma 3) that if is chosen as

 Uq,p⎡⎣Rpp−qq(1δ)pqp−qlogn⎤⎦≥6Lq,2⎡⎢⎣R22−qq(1ϵ)2q2−qlogn⎤⎥⎦, (11)

for some constant depending only on and , then (8b) holds. Combining (10) and (11), one has that

 δp ≤[Uq,p6Lq,2]p−qq(ϵ42−q)p−q2R2−p2−qq =Lp−q2q,2[Uq,p6Lq,2]p−qqRq[σ2c2κ2clognm]p−q2.

Combining this inequality with (9) and (4), we obtain that there exists a constant depending only on and such that,

 P⎛⎜⎝min^βmaxβ∗∈Bq(Rq)∩B2(1)∥^β−β∗∥pp≥cq,pRq[σ2z(σ2xσ2w+σ2zσ2ϵ)σ4xκ2clognm]p−q2⎞⎟⎠≥12.

The proof is complete. ∎

Note that the probability in Theorem 1 is just a standard convention, and it may be made arbitrarily close to by choosing the universal constants suitably.

###### Theorem 2 (Upper bound on ℓ2-loss).

In the additive error setting, suppose that for a universal constant , satisfies Assumption 2 with and . Then there exist universal constants and a constant denpending only on such that, with probability at least , the minimax -loss over the -ball is upper bounded as

 min^βmaxβ∗∈Bq(Rq)∩B2(1)∥^β−β∗∥22≤cq[σ2−qz(σw+σϵ)2−q+κ1−qlκ2−ql]Rq(lognm)1−q/2. (12)
###### Proof.

It suffices to find an estimator for , which has small -norm error with high probability,. We consider the estimator as follows

 ^β∈argminβ∈Bq(Rq)∩B2(1){12β⊤^Γβ−^Υ⊤β}. (13)

It is worth noting that (13) involves solving a nonconvex optimization problem when . Since , it follows from the optimality of that . Define , and thus . Then one has that

 ^Δ⊤^Γ^Δ≤2⟨Δ,^Υ−^Γβ∗⟩.

This inequality, together with the assumption that satisfies Assumption 2, implies that

 κl∥^Δ∥22−τl(Rq,m,n)≤2⟨^Δ,^Υ−^Γβ∗⟩≤2∥^Δ∥1∥^Υ−^Γβ∗∥∞. (14)

It then follows from Loh and Wainwright (2012b, Lemma 2) that there exist universal constants such that, with probability at least ,

 ∥^Υ−^Γβ∗∥∞≤c4σz(σw+σϵ)∥β∗∥2√lognm=c4σz(σw+σϵ)√lognm. (15)

Combining (14) and (15), one has that

 κl∥^Δ∥22≤2c4σz(σw+σϵ)√lognm∥^Δ∥1+τl(Rq,m,n).

Introduce the shorthand . Recall that . It then follows from Raskutti et al. (2011, Lemma 5) (with ) and the assumption that

 ∥^Δ∥22≤√2Rq(2c4σκl√lognm)1−q/2∥^Δ∥2+2Rq(2c4σκl√lognm)2−q+c1κlRq(lognm)1−q/2.

Therefore, by solving this inequality with the indeterminate viewed as , we obtain that there exists a constant depending only on such that, (12) holds with probability at least . The proof is complete. ∎

###### Remark 1.

(i) The lower and upper bounds for minimax rates are dependent on the triple , the error level, and the observed matrix , as shown in Theorems 1 and 2. Specifically, by setting in Theorem 1, the lower and upper bounds agree up to constant factors, showing the optimal minimax rates in the additive error case.

(ii) Note that when and (i.e., the exact sparse case), the minimax rate scales as . In the regime when for some constant , the rate is equivalent to (up to constant factors), which re-capture the same scaling as in Loh and Wainwright (2012a).

## 4 Conclusion

We focused on the information-theoretic limitations of estimation for sparse linear regression with additive errors under the high-dimensional scaling. Further research may generalize the current result to sub-Gaussian matrices with non-diagonal covariances, or other types of measurement errors, such as the multiplicative error.

## References

• Bickel and Ritov (1987) Bickel, P. J., Ritov, Y., 1987. Efficient estimation in the errors in variables model. Ann. Statist. 15 (2), 513–540.
• Carroll et al. (2006) Carroll, R. J., Ruppert, D., Stefanski, L. A., Crainiceanu, C. M., 2006. Measurement error in nonlinear models: A modern perspective, second ed. Chapman & Hall/CRC, Boca Raton, Florida.
• Delaigle and Meister (2007)

Delaigle, A., Meister, A., 2007. Nonparametric regression estimation in the heteroscedastic errors-in-variables problem. J. Amer. Statist. Assoc. 102 (480), 1416–1426.

• Huwang and Hwang (2002)

Huwang, L., Hwang, J. G., 2002. Prediction and confidence intervals for nonlinear measurement error models without identifiability information. Statist. Probab. Lett. 58 (4), 355–362.

• Loh and Wainwright (2012a) Loh, P.-L., Wainwright, M. J., 2012a. Corrupted and missing predictors: Minimax bounds for high-dimensional linear regression. In: IEEE International Symposium on Information Theory Proceedings. pp. 2601–2605.
• Loh and Wainwright (2012b) Loh, P.-L., Wainwright, M. J., 2012b. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. Ann. Statist. 40 (3), 1637–1664.
• Loh and Wainwright (2015) Loh, P.-L., Wainwright, M. J., 2015. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16 (1), 559–616.
• Mallat (1989) Mallat, S. G., 1989. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11 (7), 674–693.
• Raskutti et al. (2010) Raskutti, G., Wainwright, M. J., Yu, B., 2010. Restricted eigenvalue properties for correlated Gaussian designs. J. Mach. Learn. Res. 11 (Aug), 2241–2259.
• Raskutti et al. (2011) Raskutti, G., Wainwright, M. J., Yu, B., 2011. Minimax rates of estimation for high-dimensional linear regression over -balls. IEEE Trans. Inform. Theory 57 (10), 6976–6994.
• Raskutti et al. (Unpublished results) Raskutti, G., Wainwright, M. J., Yu, B., Unpublished results. Minimax rates of estimation for high-dimensional linear regression over -balls. arXiv preprint arXiv:0910.2042.
• Stefanski and Carroll (1987) Stefanski, L. A., Carroll, R. J., 1987. Conditional scores and optimal scores for generalized linear measurement-error models. Biometrika 74 (4), 703–716.
• Tsiatis and Ma (2004) Tsiatis, A. A., Ma, Y. Y., 2004. Locally efficient semiparametric estimators for functional measurement error models. Biometrika 91 (4), 835–848.
• Yang and Barron (1999) Yang, Y. H., Barron, A., 1999. Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27 (5), 1564–1599.