Consider the standard linear regression model
where is the unknown parameter and are i.i.d. observations, which are assumed to be fully-observed in standard formulations. However, this assumption is not realistic for many applications, in which the covariates can only be measured imprecisely and one can only observe the pairs instead, where ’s are corrupted versions of the corresponding ’s; see, e.g., Carroll et al. (2006). This is known as the measurement error model in the literature.
Estimation in the presence of measurement errors has attracted a lot of interest for a long time. In 1987, Bickel and Ritov first studied the linear measurement error models and proposed an efficient estimator (Bickel and Ritov, 1987). Then Stefanski and Carroll investigated the generalized linear measurement error models and constructed consistent estimators (Stefanski and Carroll, 1987). Extensive results have also been established on parameter estimation and variable selection for both parametric or nonparametric settings; see Huwang and Hwang (2002); Tsiatis and Ma (2004); Delaigle and Meister (2007) and references therein.
Recently, in the context of high dimension (i.e., ), Loh and Wainwright studied the sparse linear regression with the covariates are corrupted by additive errors, missing and dependent data. Though the proposed estimator involves solving a nonconvex optimization problem, they proved that the global and stationary points are statistically consistent; see Loh and Wainwright (2012b, 2015), respectively. The proposed estimator was also shown to be minimax optimal in the additive error case under the -loss, assuming that the true parameter is exact sparse, that is, has at most nonzero elements (Loh and Wainwright, 2012a). However, the “exact sparse” assumption may be sometimes too restrictive in real applications. For instance, in image processing, it is standard that wavelet coefficients for images always exhibit an exponential decay, but do not need to be almost (see, e.g., Mallat (1989)). Other applications include signal processing, medical imaging reconstruction, remote sensing and so on. Hence, it is necessary to investigate the minimax rate of estimation when the “exact sparse” assumption does not hold.
In this study, we consider the sparse high-dimensional liner model with additive errors. By assuming the regression parameter is of weak sparsity, we establish the minimax rates of estimation in terms of -losses. The proposed estimator is also shown to be minimax optimal in the -loss.
2 Problem setup
Recall the standard linear regression model (1). One of the main types of measurement errors is the additive error. Specifically, for each , we observe , where
is a random vector independent ofwith mean 0 and known covariance matrix . Throughout this paper, we assume that, for , the vectors , and are Gaussian with mean 0 and covariance matrices , and , respectively, and we write for simplificity.
Following a line of past works (Loh and Wainwright, 2012b, 2015), we fix and use to denote the covariance matrix of . Let denote the estimators for that depend only on the observed data . As discussed in Loh and Wainwright (2012b), an appropriate choice of the surrogate pair for the additive error case is given by
Instead of assuming the regression parameter is exact sparse, we use a weaker notion to characterize the sparsity of . Speciafically, we assume that for , and a radius , , where
Note that corresponds to the case that is exact sparse, while for corresponds to the case of weak sparsity, which enforces a certain decay rate on the ordered elements of . Throughout this paper, we fix , and assume that unless otherwise specified. Without loss of generality, we also assume that and define . Then we have .
In order to estimate the regression parameter, one considers an estimator , which is a measure function of the observed data . In order to assess the quality of
, one introduces a loss function, which represents the loss incurred by the estimator when the true parameter . Finally, in the minimax formulism, we aim to choose an estimator that minimizes the following worst-case loss
Specifically, we shall consider the -losses for as follows
We then impose some conditions on the observed matrix . The first assumption requires that the columns of are bounded in -norm.
Assumption 1 (Column normalization).
There exists a constant such that
Our second assumption imposes a lower bound on the restricted eigenvalue of.
Assumption 2 (Restricted eigenvalue condition).
There exists a constant and a function such that for all ,
3 Main results
Let denote the distribution of in the linear model with additive errors, when is given and is observed. The following lemma tells us the Kullback-Leibler (KL) divergence between the distributions induced by two different parameters , which is beneficial for establishing the lower bound. Recall that for two distributions and which have densities and with respect to some base measure , the KL divergence is defined by .
In the additive error setting, for any , we have
For each fixed, by the model setting, is jointly Gaussian with mean 0, and by computing the covariances, one has that
Then it follows from standard results on the conditional distribution of Gaussian variables that
Now assume that and are not both 0; otherwise, the conclusion holds trivially. Since is a product distribution of over all , we have from (2) that
where , and is given analogously. Since , , and by the assumptions, we have that
Substituting this equality into (3) yields that
The proof is completed. ∎
Theorem 1 (Lower bound on -loss).
In the additive error setting, suppose that the observed matrix satisfies Assumption 1 with . Then for any , there exists a constant depending only on and such that, with probability at least 1/2, the minimax -loss over the -ball is lower bounded as
Let denote the cardinality of a maximal packing of the ball in the metric with elements . We follow the standard technique (Yang and Barron, 1999) to transform the estimation on lower bound into a multi-way hypothesis testing problem as follows
where, and is an estimator taking values in the packing set. It then follows from Fano’s inequality (Yang and Barron, 1999) that
where is the mutual information between the random variable and the observation vector . It now remains to upper bound the mutual information . Let be the minimal cardinality of an -covering of in -norm. From the procedure of Yang and Barron (1999), the mutual information is upper bounded as
Let denote the -convex hull of the rescaled columns of the observed matrix , that is,
where the normalization is used for convenience. Since satisfies Assumption 1, Raskutti et al. (Unpublished results, Lemma 4) is applicable to concluding that there exists a set such that for all , there exists some index and some constant such that . Combining this inequality with Lemma 1 and (6), one has that the mutual information is upper bounded as
Thus we obtain by (5) that
It remains to choose the packing and covering set radii (i.e., and , respectively) such that (7) is strictly above zero, say bounded below by . For simplicity, denote . Suppose that we choose the pair such that
As long as , it is guaranteed that
as desired. It remains to determine the values of the pair satisfying (8). By Raskutti et al. (Unpublished results, Lemma 3), we know that if for some constant depending only on , then (8a) is satisfied. Thus, we can choose satisfying
Also it follows from Raskutti et al. (Unpublished results, Lemma 3) that if is chosen as
The proof is complete. ∎
Note that the probability in Theorem 1 is just a standard convention, and it may be made arbitrarily close to by choosing the universal constants suitably.
Theorem 2 (Upper bound on -loss).
In the additive error setting, suppose that for a universal constant , satisfies Assumption 2 with and . Then there exist universal constants and a constant denpending only on such that, with probability at least , the minimax -loss over the -ball is upper bounded as
It suffices to find an estimator for , which has small -norm error with high probability,. We consider the estimator as follows
It is worth noting that (13) involves solving a nonconvex optimization problem when . Since , it follows from the optimality of that . Define , and thus . Then one has that
This inequality, together with the assumption that satisfies Assumption 2, implies that
It then follows from Loh and Wainwright (2012b, Lemma 2) that there exist universal constants such that, with probability at least ,
Introduce the shorthand . Recall that . It then follows from Raskutti et al. (2011, Lemma 5) (with ) and the assumption that
Therefore, by solving this inequality with the indeterminate viewed as , we obtain that there exists a constant depending only on such that, (12) holds with probability at least . The proof is complete. ∎
(i) The lower and upper bounds for minimax rates are dependent on the triple , the error level, and the observed matrix , as shown in Theorems 1 and 2. Specifically, by setting in Theorem 1, the lower and upper bounds agree up to constant factors, showing the optimal minimax rates in the additive error case.
(ii) Note that when and (i.e., the exact sparse case), the minimax rate scales as . In the regime when for some constant , the rate is equivalent to (up to constant factors), which re-capture the same scaling as in Loh and Wainwright (2012a).
We focused on the information-theoretic limitations of estimation for sparse linear regression with additive errors under the high-dimensional scaling. Further research may generalize the current result to sub-Gaussian matrices with non-diagonal covariances, or other types of measurement errors, such as the multiplicative error.
- Bickel and Ritov (1987) Bickel, P. J., Ritov, Y., 1987. Efficient estimation in the errors in variables model. Ann. Statist. 15 (2), 513–540.
- Carroll et al. (2006) Carroll, R. J., Ruppert, D., Stefanski, L. A., Crainiceanu, C. M., 2006. Measurement error in nonlinear models: A modern perspective, second ed. Chapman & Hall/CRC, Boca Raton, Florida.
Delaigle and Meister (2007)
Delaigle, A., Meister, A., 2007. Nonparametric regression estimation in the heteroscedastic errors-in-variables problem. J. Amer. Statist. Assoc. 102 (480), 1416–1426.
Huwang and Hwang (2002)
Huwang, L., Hwang, J. G., 2002. Prediction and confidence intervals for nonlinear measurement error models without identifiability information. Statist. Probab. Lett. 58 (4), 355–362.
- Loh and Wainwright (2012a) Loh, P.-L., Wainwright, M. J., 2012a. Corrupted and missing predictors: Minimax bounds for high-dimensional linear regression. In: IEEE International Symposium on Information Theory Proceedings. pp. 2601–2605.
- Loh and Wainwright (2012b) Loh, P.-L., Wainwright, M. J., 2012b. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. Ann. Statist. 40 (3), 1637–1664.
- Loh and Wainwright (2015) Loh, P.-L., Wainwright, M. J., 2015. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16 (1), 559–616.
- Mallat (1989) Mallat, S. G., 1989. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11 (7), 674–693.
- Raskutti et al. (2010) Raskutti, G., Wainwright, M. J., Yu, B., 2010. Restricted eigenvalue properties for correlated Gaussian designs. J. Mach. Learn. Res. 11 (Aug), 2241–2259.
- Raskutti et al. (2011) Raskutti, G., Wainwright, M. J., Yu, B., 2011. Minimax rates of estimation for high-dimensional linear regression over -balls. IEEE Trans. Inform. Theory 57 (10), 6976–6994.
- Raskutti et al. (Unpublished results) Raskutti, G., Wainwright, M. J., Yu, B., Unpublished results. Minimax rates of estimation for high-dimensional linear regression over -balls. arXiv preprint arXiv:0910.2042.
- Stefanski and Carroll (1987) Stefanski, L. A., Carroll, R. J., 1987. Conditional scores and optimal scores for generalized linear measurement-error models. Biometrika 74 (4), 703–716.
- Tsiatis and Ma (2004) Tsiatis, A. A., Ma, Y. Y., 2004. Locally efficient semiparametric estimators for functional measurement error models. Biometrika 91 (4), 835–848.
- Yang and Barron (1999) Yang, Y. H., Barron, A., 1999. Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27 (5), 1564–1599.