Model Agnostic High-Dimensional Error-in-Variable Regression

02/28/2019 ∙ by Anish Agarwal, et al. ∙ MIT 0

We consider the problem of high-dimensional error-in-variable regression where we only observe a sparse, noisy version of the covariate data. We propose an algorithm that utilizes matrix estimation (ME) as a key subroutine to de-noise the corrupted data, and then performs ordinary least squares regression. When the ME subroutine is instantiated with hard singular value thresholding (HSVT), our results indicate that if the number of samples scales as ω( ρ^-4 r ^5 (p)), then our in- and out-of-sample prediction error decays to 0 as p →∞; ρ represents the fraction of observed data, r is the (approximate) rank of the true covariate matrix, and p is the number of covariates. As an important byproduct of our approach, we demonstrate that HSVT with regression acts as implicit ℓ_0-regularization since HSVT aims to find a low-rank structure within the covariance matrix. Thus, we can view the sparsity of the estimated parameter as a consequence of the covariate structure rather than a model assumption as is often considered in the literature. Moreover, our non-asymptotic bounds match (up to ^4(p) factors) the best guaranteed sample complexity results in the literature for algorithms that require precise knowledge of the underlying model; we highlight that our approach is model agnostic. In our analysis, we obtain two technical results of independent interest: first, we provide a simple bound on the spectral norm of random matrices with independent sub-exponential rows with randomly missing entries; second, we bound the max column sum error -- a nonstandard error metric -- for HSVT. Our setting enables us to apply our results to applications such as synthetic control for causal inference, time series analysis, and regression with privacy. It is important to note that the existing inventory of methods is unable to analyze these applications.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction


We consider error-in-variable regression in the high-dimensional regime. Let and denote the (unobserved) covariates and model parameters, respectively. Let denote the vector of responses. We denote , where , as the subset of observed indices. Given a “sample set” , we observe , i.e., a subset of size

of all the response variables

. Rather than observing , we are given access to , where is sparse, noisy version of . Specifically, for all and , we define ; here, represents noise, , and denotes an unknown value. Given observations and , the goal is to predict .

Matrix estimation for pre-processing.

In the classical regression setup, the covariates are assumed to be fully observed and noiseless, i.e, ; if is small, then ordinary least squares (OLS) is sufficient in solving the problem; if, however, is large and can exceed , then regularization methods (e.g., Lasso) can accurately recover (sparse) . However, most modern datasets of interest are both high-dimensional and corrupted by noisy, partial observations. In the last decade or so, matrix estimation has emerged as a powerful, model agnostic method for recovering a structured matrix from its noisy and sparse observations. Therefore, it stands to reason that in the setup of error-in-variable where we observe instead of , we can produce a good estimate of from via matrix estimation. Using , we can then apply OLS to recover the underlying model parameter.

In this paper, our focus is to investigate the properties of such an approach. Specifically, we consider the three-step procedure (see Algorithm 1): (1) obtain by applying matrix estimation (see Section 3.1) on ; (2) run OLS on (restricting the rows of to ) and to produce ; (3) output as an estimate for . Our analysis proves that our achieves small in- and out-of-sample prediction errors with nearly optimal number of samples , indicating that matrix estimation can be a general-purpose data preprocessing step to obtain estimates of the latent covariates for the purposes of prediction. Further, our testing (out-of-sample) error analysis (see Appendix 6.3) demonstrates that matrix estimation with OLS effectively performs implicit regularization. This highlights another benefit of applying matrix estimation as a pre-processing step.

1.1 Contributions

Model and algorithm.

We describe a model for high-dimensional error-in-variable regression (see Section 2), where we simultaneously allow for missing data (Property 2.4) and sub-gaussian or sub-exponential noise in the covariate measurements (Property 2.2). A key contribution is in utilizing a natural three-step algorithm (see Algorithm 1) for this setting. Our proposed algorithm is model agnostic and does not require knowledge (or an estimate) of the noise covariance, as is commonly assumed in the literature (cf. [Loh and Wainwright(2012), Datta and Zou(2017), Rosenbaum and Tsybakov(2013)]). Despite this generality, our algorithm achieves a vanishing (in- and out-of-sample) prediction error with the number of required samples scaling comparably to the state-of-the-art results in the literature. We highlight again that our analysis holds for the weaker requirement of sub-exponential noise compared to the sub-gaussianity assumption typically made in the literature.

Finite sample analysis of prediction error.

We provide new finite sample prediction error bounds for high-dimensional linear regression with corrupted covariates (see Theorem

4.1). This bound holds for any matrix estimation algorithm, and is particularly useful if its max column sum error (MCSE) (Definition 4.1) is small. For concreteness, we instantiate the matrix estimation subroutine with hard singular value thresholding (HSVT) (Section 3.1), and provide finite-sample analysis for both training (Corollary 4.2.1) and testing error (Theorem 4.2.3). A key contribution is that if the underlying matrix is (approximately) low-rank, then both errors decay to 0 as long as the number of samples and where is the (approximate) rank of the true covariate matrix (Propositions 4.2.2 and 4.2.2). By Theorem 4.2.3 (specifically, Lemma 6.3), we show that pre-processing the observed covariates with HSVT and then performing OLS is a form of implicit regularization, giving similar Rademacher complexity bounds as that of regression with -regularization (a nonconvex program). As discussed in Section 4.4, the sample complexity of our model agnostic algorithm is comparable to the best known sample complexities of model aware methods, cf. [Loh and Wainwright(2012), Datta and Zou(2017), Rosenbaum and Tsybakov(2013)].

Technical results of independent interest.

We highlight two technical results. First, a spectral norm bound for random matrices whose rows are (1) independent (entry-wise independence is not necessary), and (2) the Hadamard product of sub-exponential and Bernoulli random vectors (see Properties 2.2 and 2.4). Theorem 4.3.1 indicates that this spectral norm scales as

. Second, we prove a high probability bound on the MCSE of the estimate

produced from HSVT (see Theorem 4.3.2). We note that MCSE is a more stringent error metric than the Frobenius norm, the standard metric of choice in literature (see Section 1.2 for details). The study of MCSE is motivated by Theorem 4.1 as it provides a general upper bound for the prediction error in error-in-variable regression.


In Section 5

, we provide important applications of error-in-variable regression and how they fit within our framework. In the first two applications, synthetic control and time series analysis, covariate noise is a natural formation (e.g., in time series forecasting, future predictions are made using past noisy observations). Both topics are of immense importance in fields such as econometrics, signal processing, machine learning, finance, and retail. The third application, regression with privacy, is rapidly becoming a crucial topic in machine learning, where practitioners strive to make predictions with highly sensitive data. Here, it is typical to

purposefully inject Laplacian noise into the covariates so the underlying dataset is differentially private. This, again, neatly fits into our framework.

1.2 Related works

Matrix estimation.

Over the past decade, matrix estimation has spurred tremendous theoretical and empirical research across numerous fields, including recommendation systems (cf. [Keshavan et al.(2010a)Keshavan, Montanari, and Oh, Keshavan et al.(2010b)Keshavan, Montanari, and Oh, Negahban and Wainwright(2011), Chen and Wainwright(2015), Chatterjee(2015), Lee et al.(2016)Lee, Li, Shah, and Song, Candès and Tao(2010), Recht(2011), Davenport et al.(2014)Davenport, Plan, van den Berg, and Wootters]), social network analysis (cf. [Abbe and Sandon(2015a), Abbe and Sandon(2015b), Abbe and Sandon(2016), Anandkumar et al.(2013)Anandkumar, Ge, Hsu, and Kakade, Hopkins and Steurer(2017)]), and graph learning (graphon estimation) (cf. [Airoldi et al.(2013)Airoldi, Costa, and Chan, Zhang et al.(2015)Zhang, Levina, and Zhu, Borgs et al.(2015)Borgs, Chayes, Cohn, and Ganguly, Borgs et al.(2017)Borgs, Chayes, Lee, and Shah]). Traditionally, the end goal is to recover the underlying mean matrix from an incomplete and noisy sampling of its entries; the quality of the estimate is often measured through the Frobenius norm. Further, entry-wise independence and sub-gaussian noise is typically assumed. A key property of many matrix estimation methods is that they are model agnostic (i.e., the de-noising procedure does not change with the noise assumptions); this makes such methods desirable for our purposes. We build upon recent developments by advocating that matrix estimation can be a vital pre-processing

subroutine in solving high-dimensional error-in-variable regression. To theoretically analyze the effectiveness of matrix estimation as a pre-processing procedure for linear regression, we study a nonstandard error metric, the MCSE, a stronger error metric than the Frobenius norm (appropriately normalized). Further, we only require independence across rows (e.g., measurements), and allow for a broader class of noise distributions (e.g., sub-exponential). This allows our model and algorithm to connect to important modern applications such as differential privacy, where adding Laplacian noise (a sub-exponential random variable) to the data is a standard tool in preserving privacy within databases. Thus, our algorithm can serve as a useful tool for interacting with highly sensitive datasets.

Error-in-variable regression.

There exists a rich body of work regarding high-dimensional error-in-variable regression (cf. [Loh and Wainwright(2012)], [Datta and Zou(2017)], [Rosenbaum and Tsybakov(2010)], [Rosenbaum and Tsybakov(2013)], [Belloni et al.(2017b)Belloni, Chernozhukov, Kaul, Rosenbaum, and Tsybakov], [Belloni et al.(2017a)Belloni, Rosenbaum, and Tsybakov], [Chen and Caramanis(2012)], [Chen and Caramanis(2013)], [Kaul and Koul(2015)]). Two common threads of these works include: (1) a sparsity assumption on ; (2) error bounds with convergence rates for estimating under different norms, i.e., where denotes the -norm. Some notable works closest to our setup include [Loh and Wainwright(2012)], [Datta and Zou(2017)], [Rosenbaum and Tsybakov(2013)]. We focus the comparison of our work to these few papers.

In [Loh and Wainwright(2012)], a non-convex -penalization algorithm is proposed based on the plug-in principle to handle covariate measurement errors. However, the authors consider additive and multiplicative (with randomly missing data as a special instance) noise models separately and design different plug-in estimators in each setting. Under both noise models, [Loh and Wainwright(2012)] assume that observed covariate matrix is sub-gaussian, and that a bound on is known (recall that is the unknown vector to be estimated). Arguably the most crucial difference is that for the additive noise setting, they additionally require knowledge of the unobserved noise covariance matrix and the estimator they design changes based on their assumption of .

[Datta and Zou(2017)] builds upon [Loh and Wainwright(2012)], but propose a convex formulation of Lasso. Although the algorithm introduced does not require knowledge of , similar assumptions on and (e.g., sub-gaussianity and access to ) are made. This renders their algorithm to be not model agnostic. In fact, many works (e.g., [Rosenbaum and Tsybakov(2010)], [Rosenbaum and Tsybakov(2013)], [Belloni et al.(2017b)Belloni, Chernozhukov, Kaul, Rosenbaum, and Tsybakov]) require either to be known or the structure of is such that it admits a data-driven estimator for its covariance matrix. This is so because these algorithms rely on correcting the bias for the matrix , which we do not need to compute.

A key difference with the above works is their to aim to estimate exactly while we aim to achieve low training/testing prediction error. Learning is important, but proving low training/testing error is vital in guaranteeing good predictions when out of sample measurements are sparse and noisy. To best of our knowledge, the above works do not provide a formal method to de-noise .

In summary, from a model standpoint, our work analyzes a more general setting: we allow to be simultaneously corrupted by noise (including sub-exponential noise) and missing data. Algorithmically, we propose (1) a model agnostic estimator that does not change depending on the underlying model (i.e., and ); and (2) can provably generalize beyond the training set in predicting the expected response values via a de-noising process of the observed covariates.

Principal Component Regression (PCR).

Recall our approach to handling covariates with measurement errors is a two-step procedure that first utilizes a general matrix estimation method to de-noise and impute the observed covariates, and then performs linear regression to make predictions. Our analysis focuses on when the matrix estimation subroutine is HSVT (Section

3.1). In this case, our algorithm is similar to that of Principal Component Regression (PCR) (cf. [Bair et al.(2006)Bair, Hastie, Paul, and Tibshirani], [Jolliffe(1982)]). Thus, a contribution of our work is in motivating that PCR-like methods are an effective tool for solving error-in-variable regression in high dimensions. In particular, we provide finite sample analysis (Propositions 4.2.2 and 4.2.2) and (in- and out-of-sample) prediction error bounds (Corollary 4.2.1 and Theorem 4.2.3) that demonstrate the efficacy of PCR-like methods. As stated in Section 1.1, it is worth recalling that our analysis indicates that PCR serves as a form of implicit regularization (Proposition D), i.e., taking a low-rank approximation of the observed covariates and then performing OLS gives similar Rademacher complexity bounds as that of regression with -regularization.

2 Problem Setup

Standard formulations of prediction problems assume the independent variables (covariates) are noiseless and fully observed. However, these assumptions often fail to accurately describe modern applications where the covariates are commonly noisy, missing, and/or dependent. Here, we study these issues in the context of high-dimensional linear regression.

2.1 Structural Assumptions for Covariates

Let denote the matrix of true covariates, where the number of predictors can possibly exceed . We assume that its entries are bounded.

Property 2.1

There exists an absolute constant such that for all .

Rather than directly observing

, we are given access to its corrupted version. Let the random matrix

denote a perturbation of the deterministic covariates , i.e.,


where is a random matrix of independent, mean-zero rows. Before we introduce the properties assumed on , we first define an important class of random variables/vectors.

For any , we define the -norm of a random variable as


If , we call a -random variable. More generally, we say in is a -random vector if all one-dimensional marginals are -random variables for any fixed vector . We define the -norm of the random vector as


where denotes the unit sphere in and denotes the inner product. Note that and represent the class of sub-gaussian and sub-exponential random variables/vectors, respectively. We now impose the following properties on the noise matrix . For notational convenience, we denote as the -th entry of , and and as the -th row and -th column of , respectively. In general, for any matrix , we define and as its -th row and -th column, respectively.

Property 2.2

Let be a matrix of independent, mean zero -rows for some , i.e., there exists an and such that for all .

Property 2.3

for all .

2.2 Missing Data

In addition to the noise perturbations, we allow for missing data within our observed covariate matrix. In particular, we observe the matrix , which is a “masked” version of , i.e., each entry of is observed with some probability , independent of other entries. This is made formal by the following property.

Property 2.4

For all ,


is sampled independently. Here, denotes an unknown value.

2.3 Response Variables

We consider a response associated with each covariate vector. Formally, for each , we let denote the random response variable associated with . We consider the setting where we observe a response via the following model: letting , we define


where is the vector of unknown parameters and is the response noise with the following property.

Property 2.5

The response noise is a random vector with independent, mean zero entries and for some ; here,

denotes the variance of a random variable


2.4 Problem Statement and Model Recap

In summary, we observe all (noisy) covariates . However, we only observe a subset of size of its corresponding response values . Using our sample points , we aim to produce a regression estimator so that our prediction estimates are close to the unknown expected responses values associated with any data point in , i.e., for all . We will evaluate our algorithm based on its prediction error. Specifically, we assess the quality of our estimate in terms of its (1) mean-squared training (empirical) error over ,


and (2) mean-squared test error over all (observed and unobserved) entries


Note that denotes the set of locations out of for which we observe a response. This is the set of sample indices that will be used to learn a model parameter in our algorithm (Algorithm 1). In that sense, we call the “training error”. For convenience, we summarize all of our model assumptions111With regards to Property 2.4, we specifically mean . in Table 1.

Covariates Covariate Noise Covariate Masking Response Noise
-norm Covariance
Property 2.1 Property 2.2 Property 2.3 Property 2.4 Property 2.5
Table 1: Summary of Model Assumptions

For any matrix and index set , we let denote the submatrix of formed by stacking the rows of according to , i.e., is the pile of . We are particularly interested in the case where is the set of locations drawn from for which we observe response. In this case, we denote as the matrix formed by concatenating the rows of according to , i.e., is constructed from . We define , and similarly using , and , respectively. The superscript is sometimes omitted whenever what the matrix represents is clear from the context. Finally, we denote as a function that scales at most polynomially in its arguments .

3 Algorithm

3.1 Matrix Estimation & Hard Singular Value Thresholding

Our proposed algorithm to solve the error-in-variable regression problem relies on a “blackbox” matrix estimation procedure as an important subroutine, which we define as follows:

A matrix estimation algorithm, denoted as , takes as input a matrix , which is a partially observed, noisy version of , and outputs an estimate .


For concreteness, we describe one of the most commonly used matrix estimation subroutines, hard singular value thresholding (HSVT). For any , we define the map , which shaves off the input matrix’s singular values below . Specifically, given , we define


where denotes the indicator function.

HSVT with missing data.


have the following singular value decomposition:

Let denote the proportion of observed entries in 222More precisely, we define .. Using a HSVT subroutine, we define the estimator of as


3.2 “Error-in-Variable” Regression via Matrix Estimation

We can now formally state our “Matrix Estimation Regression” method in Algorithm 1.

Input : 
Output : 
1:De-noise and impute to obtain .
2:Let be the sub-matrix formed from the rows of associated with .
3:Perform linear regression: .
4:Define .
Algorithm 1 Matrix Estimation Regression

We let . In the classical regime (), this reduces to least squares solution. In the high-dimensional setup (), this yields one solution in the row span of . Our main result (Theorem 4.1) holds for any matrix estimation algorithm that bounds the max column sum error (MCSE) (refer to Definition 4.1). However, given that MCSE is a nonstandard error metric for matrix estimation, we instantiate it for HSVT in Theorem 4.2.1.

4 Main Results

We state our main results and discuss consequences for noisy and missing data. The proof of Theorem 4.1 is presented in Appendix F. Proofs for the other Theorems can be found in Section 6.

4.1 Prediction Error Bounds for General Matrix Estimation Methods

We present Theorem 4.1, which bounds the training MSE (refer to (6)) for a general matrix estimation algorithm. It is important to note that this quantity is also key in bounding the testing MSE (refer to (7)), as seen by Theorem 4.2.3. As a tool for analysis, we introduce an auxiliary notion of error, the so-called max column sum error (MCSE). For an estimator of and a set , we define the max column sum error (MCSE) of over as


It is easily seen that MCSE is a stronger metric than the conventional Frobenius norm bound333Let be an matrix. Let be an estimator of with denoting the Frobenius norm error. Then, .; thus, any known Frobenius norm lower bounds immediately hold for the MCSE as well. See Appendix F for details on the MCSE metric.

The following theorem provides a general upper bound on the training MSE of our estimate of the underlying signal in terms of the model parameter , variance of the response noise , and properties of the covariate estimate .

[Training MSE for general ME methods] Let and be the estimator of produced by a general “matrix estimation regression” method described in Algorithm 1. Assume Property 2.5 holds. Then, the training prediction error of our algorithm satisfies


The proof of Theorem 4.1 is deferred until Appendix F.

4.2 Prediction Error Bounds for HSVT

Here, we instantiate our matrix estimation subroutine with HSVT, and provide non-asymptotic upper bounds for both the training and testing MSE. Specifically, we upper bound for HSVT in Theorems 4.2.1; combining this with Theorem 4.1 yields an upper bound on training error, cf. Corollary 4.2.1. Also, we analyze the generalization error to provide an upper bound on the testing MSE, which can be found in Theorem 4.2.3.

4.2.1 Training Error

If we apply HSVT in the de-noising procedure of Algorithm 1, then we obtain an explicit upper bound on MCSE as stated in Theorem 4.2.1. In order to bound the MCSE of HSVT, we first describe the role of the thresholding hyper-parameter . Let


be the singular value decomposition of with its singular values arranged in descending order. We reserve to denote the -th singular value of throughout this exposition. We may partition the principal components of at the threshold as


Then, . We define , i.e.,


Before we present our results, we define a quantity which we will refer to multiple times in our theorem statements:


where is an absolute constant that depends only on .

[Main Theorem 1: MCSE upper bound for HSVT] Given and , let and . Let as defined above. Suppose the following conditions hold:

  1. Properties 2.1, 2.2 for some , 2.3, 2.4, and 2.5

  2. satisfies

  3. .

We define

where is an absolute constant. Then there exists an absolute constant such that


Theorem 4.2.1 follows as an immediate consequence of Theorem 4.3.2. The full details of the proof (including technical Lemmas used in the proof and thir proofs) can be found in Appendix B.

[Training MSE for HSVT] Suppose the conditions of Theorem 4.2.1 hold. Then for some (the same constant as in Theorem 4.2.1),


The result immediately from plugging in (16) of Theorem 4.2.1 into (11) of Theorem 4.1.

4.2.2 Interpretation of Training Error Results

We now provide interpretation of Corollary 4.2.1 with two exemplar scenarios. For that purpose, we present Corollary 4.2.1, which is simplified from Corollary 4.2.1 to succinctly convey the essence of the resulting training error bound. The proof of Corollary 4.2.2 and Propositions 4.2.2 and 4.2.2 can be found in Appendix C.

[Simplified Version of Corollary 4.2.1] Let the conditions of Theorem 4.2.1 hold. Suppose . Let and . Then,

Case 1: Low Rank, Evenly Spaced Singular Values.

Here, we assume the underlying covariate matrix is low-rank and its signal is evenly spaced out amongst its nonzero singular values .

Let conditions of Theorem 4.2.1 hold. Suppose: (1) ; (2) ; (3) where ; (4) . Let . If


then as .

Case 2: Geometrically Decaying Singular Values.

We now let be an approximately low-rank matrix with geometrically decaying singular values. Let denote the -th canonical basis vector. Recall that are the left and right singular vectors of , respectively, for . Let the conditions of Theorem 4.2.1 hold. Suppose: (1) for some ; (2) for where ; (3) ; (4) for all and all ; (5) ; and (6) for a sufficiently large . Let . If


then as .

We provide justification for Condition (4) in Proposition 4.2.2. From the proof of the proposition, we see that . Suppose that at least one for is aligned with , a canonical basis vector in , for some , i.e., . Then, it is easily seen that scales as . Hence, a certain structural assumption is needed to avoid this predicament.

Assuming an “incoherence” type structural assumption as common in literature (cf. [Candes and Romberg(2007)]) can help us to achieve for all . Condition (4) is exactly such an assumption where only the mass of the residual right singular vectors associated with the geometrically decaying tail singular vales (i.e., for ) needs to be evenly distributed amongst its entries. Note that we do not impose any structural assumptions on the descriptive singular vectors (i.e., for ).

4.2.3 Testing Error

We now proceed to demonstrate how instantiating our meta-algorithm with HSVT (Algorithm 1) affects our ability to learn and generalize. Let be a set of locations chosen uniformly at random and without replacement from 444In most setups, the generalization error bounds are stated in settings where the samples are drawn i.i.d. However, we sample our locations uniformly at random and without replacement. However, as argued in [Barak and Moitra(2016)], the sampling model differences are negligible. . As previously stated, our goal is to show that our hypothesis (defined by ) is close to the unknown expected response value associated with all data points in . We sketch the proof of Theorem 4.2.3 in Section 6.3. Full details are found in Appendix D.

[Main Theorem 2: Testing MSE for HSVT] Given and , let and . Let as defined above. Suppose the following conditions hold:

  1. Conditions of Theorem 4.2.1

  2. There exists a constant such that for any hypothesis (including ), .

Let . Then,


Here, is defined as in (17)555More precisely, they are equivalent up to constant factors, i.e., only the value of in (17) changes..

4.3 Technical Results (of Independent Interest)

In this subsection, we present two important technical contributions in Theorems 4.3.1 and 4.3.2, which are both utilized in the derivation of Theorem 4.2.1. These results can also be lifted and applied in more general settings and, thus, could be of interest in its own right.

4.3.1 Spectral Norm Bounds for Random Matrices

[Main Technical Result 1] Suppose Properties 2.1, 2.2 for some , 2.3 and 2.4 hold. Then for any ,


with probability at least . Here, is an absolute constant that depends only on . We sketch the proof of Theorem 4.3.1 in Section 6.1. Full details are found in Appendix E.

4.3.2 High-probability Max Column -norm Error Bound via HSVT

[Main Technical Result 2] Given and , let . Let as defined above. Suppose the following conditions hold:

  1. Properties 2.1, 2.2 for some , 2.3, and 2.4

  2. satisfies

  3. .

Then, with probability at least ,


Here, is an absolute constant. We sketch the proof of Theorem 4.3.2 in Section 6.2. Full details are found in Appendix F.

4.4 Useful Comparisons

Here, we compare Propositions 4.2.2 and 4.2.2 against a few well known results in the high-dimensional error-in-variables regression literature. In order to facilitate a proper comparison, we first state the model assumptions made in [Loh and Wainwright(2012)]. [Datta and Zou(2017)] operates in a similar setting and builds upon the work of [Loh and Wainwright(2012)] (see Section 1.2 for more details).

As previously mentioned, [Loh and Wainwright(2012)] assumes is -sparse, and the covariates are corrupted such that only is observed. However, they assume that is either generated by an additive or multiplicative (with missing data as a special instance) noise model; it is important to note that they do not consider both noise models simultaneously. In the former, they assume that where (1) and are random matrices of i.i.d. mean-zero sub-gaussian rows () with and , and (2) the noise covariance matrix is known666Although the authors of [Loh and Wainwright(2012)] argue that can be estimated from data, it is unclear how to obtain a data-driven estimate of when only one data set is readily available. However, if multiple replicates of the data are accessible, then a data-driven estimation procedure is achievable.. Under these assumptions, a consistent estimation of is achieved if


where denotes the minimum singular value of . In the setting of multiplicative i.i.d. Bernoulli noise (randomly missing data), the authors make the same assumptions on rendering as the missing data matrix. Consistent estimation of is achieved if (for some ) if


Again, we highlight that their algorithm changes based on the noise model assumption, i.e., they design different plug-in estimators for different scenarios.

[Datta and Zou(2017)] proposes a convex formulation of Lasso to handle measurement errors in the covariates (assumed to be deterministic). Note this setup differs from [Loh and Wainwright(2012)], which utilizes a non-convex -penalization and assumes the rows of are drawn i.i.d. from some fixed distribution. However both works design algorithms that depend on plug-in estimators tailor made for different settings; in fact, the authors of [Datta and Zou(2017)] use the estimators proposed in [Loh and Wainwright(2012)]. Moreover, the same assumptions in the additive and i.i.d. missing data models are made. Consequently, both works achieve comparable statistical error bounds.

Although our aim is to minimize the prediction error, we can compare our sample complexity results in Propositions 4.2.2 and 4.2.2 against [Loh and Wainwright(2012)] (since similar bounds are derived in other works (cf. [Datta and Zou(2017)], [Rosenbaum and Tsybakov(2013)])). With respect to model setup, we (1) make no assumptions on ; (2) allow to be a random matrix of independent -rows (this includes sub-exponential noise); (3) allow to be simultaneously corrupted by noise and missing data. Algorithmically, we (1) do not require knowledge of or ; (2) do not change our estimation procedure based on the model assumptions. For a more fair comparison, however, one can imagine that the rows of our covariate matrix were sampled i.i.d. from an isotropic distribution, similar to that described in [Loh and Wainwright(2012)]. The conditions of Proposition 4.2.2 then apply and its sample complexity matches (24) and (25) up to factors as seen by (19). It is worth noting we have an identical dependence on .

Another difference to highlight is the dependence on in (24) and (25). This leads to a weak result for the setup in Proposition 4.2.2 with geometrically decaying singular values since gets arbitrarily small as and grow. However, as (20) of Proposition 4.2.2 indicates, our algorithm and associated analysis does not suffer in this setup with regards to sample complexity. Specifically, by applying HSVT, our bound in Corollary 4.2.2 demonstrates how the choice of leads to a precise tradeoff between the signal captured and model misspecification, as seen by and .

In short, despite less restrictive assumptions and our algorithm having minimal knowledge of the underlying model, we achieve comparable sample complexity bounds and guarantee prediction consistency (both in-/out-of-sample). While we do not learn specifically, our analysis allows us to accurately predict response values associated with noisy, missing covariates outside the sample set .

5 Applications

5.1 Synthetic Control

Problem formulation.

Synthetic control is a popular method for comparative case studies and policy evaluation in econometrics to predict a counterfactual for a unit of interest after its exposure to a treatment. To do so, a synthetic treatment unit is constructed using a combination of so-called “donor” units. Proposed by [Abadie and Gardeazabal(2003)], it has been analyzed in [Amjad et al.(2018)Amjad, Shah, and Shen], [Abadie et al.(2010)Abadie, Diamond, and Hainmueller], [Doudchenko and Imbens(2016)], [Hsiao et al.(2018)Hsiao, Wan, and Xie], [Athey and Imbens(2016)], [Athey et al.(2017)Athey, Bayati, Doudchenko, and Imbens]. A canonical example is in [Abadie et al.(2010)Abadie, Diamond, and Hainmueller], where the unit of interest is California, the donor pool is all other states in the U.S., and the treatment is Proposition 99; the goal is to isolate the effect of Proposition 99 on cigarette consumption in California. Formally, denotes the true donor matrix where is the number of observed time periods and the number of donors. We observe sparse, noisy observations . represents pre-treatment indices (time periods) and the pre-treatment responses for the exposed unit. Its counterfactual is denoted as for . To estimate , [Amjad et al.(2018)Amjad, Shah, and Shen] performs linear regression to learn , which represents the “correct” combination of donors to form a synthetic treatment unit. Thus, helps predict the response values for the exposed unit if the treatment had never been administered, i.e., for . The goal is to have low prediction error between and for .

How it fits our framework.

In [Amjad et al.(2018)Amjad, Shah, and Shen], the authors propose to de-noise the observed donor matrix to obtain , and then learn from and , the pre-treatment responses for the exposed and donor units, respectively. Moreover, they assume that is low-rank. It is easily seen that this setup is a special instance of our framework; more specifically, our training MSE bound (Corollary 4.2.1) is tighter than Corollary 4 of [Amjad et al.(2018)Amjad, Shah, and Shen], and we provide a testing MSE bound (Theorem 4.2.3), which is missing in their work. We highlight existing methods in the error-in-variable regression literature are able to construct a synthetic California (expressed via ), but are unable to predict the counterfactual observations for .

5.2 Time Series Analysis

Problem formulation.

We follow the formulation in [Agarwal et al.(2018)Agarwal, Amjad, Shah, and Shen]. Specifically, consider a discrete-time setting with representing the time index and representing the latent discrete-time time series of interest. For each and with probability , the random variable such that

is observed. Under this setting, the two objectives are: (1) interpolation, i.e., estimate

for all ; and (2) extrapolation, i.e., forecast for . The underlying time series is denoted as . Similarly, the imputation and forecasting estimates are denoted as , respectively777Note the forecasting estimator can only rely on past values to make a prediction.. The quality of the estimates are evaluated by: . Please refer to paper for full details and notation.

How it fits our framework.

In [Agarwal et al.(2018)Agarwal, Amjad, Shah, and Shen], they first transform the sparse, noisy observations into a matrix of dimension , where and . Analogously, we denote with entries . They then perform matrix estimation to estimate the underlying time series, i.e., . We see that imputation fits our framework as a good imputation performance is equivalent to small Frobenius norm difference between and . This is achieved by Theorem 4.3.2 as MCSE is a stronger bound than the Frobenius norm error (after appropriate normalization). Forecasting also fits into our framework since it equates to a small prediction error in our setup; here, refers to for and refers to the preceding values of the time series 888We spare inconsequential details on how and are constructed to ensure setup between both papers are the same.. Thus, good training prediction error is equivalent to the squared difference between (i.e., and being small for . This is guaranteed by Corollary 4.2.1, if the underlying matrix is low-rank or approximately low-rank. This condition is satisfied for the three time series models listed in ([Agarwal et al.(2018)Agarwal, Amjad, Shah, and Shen], Section 5): (i) linear recurrent formulae; (ii) time series with compact support; (iii) sum of sublinear trends (and additive mixtures thereof). By Theorem 4.2.3, we generalize their results by proving that the prediction error is small for future unseen data, i.e., for .

5.3 Regression with Privacy

Problem formulation.

With the advent of large datasets, analysts must maximize the accuracy of their queries and simultaneously protect sensitive information. An important notion of privacy is that of differential privacy; this requires that the outcome of a query of a database cannot greatly change due to the presence or absence of any individual data record (cf. [Dwork and Roth(2014)]). This guarantees that little can be learned about any particular record. Suppose denotes the true, fixed database of sensitive individual records. We consider the setting where an analyst is allowed to ask two types of queries of the data: (1) querying for individual data records, i.e., for ; (2) querying for a linear combination of an individual’s covariates, i.e. . A typical example would be where is the genomic information for patient and is the outcome of a clinical study. The aim in such a setup is to be able to produce both in- and out-of-sample predictions while preserving each patient’s privacy.

How it fits our framework.

A typical way to achieve differential privacy is to add Laplacian noise to queries. This naturally fits our framework as we allow for sub-exponential noise in (which includes Laplacian noise) and (see Properties 2.2 and 2.5). Our setup even allows for a significant fraction of the query response to be masked (see Property 2.4). Specifically, whenever an analyst queries for , the answer is returned as where and is independent Laplacian noise. Similarly, when an analyst queries for the response variable , he or she observes , where is again independent Laplacian noise. This guarantees that every individual’s data remains differentially private. Nevertheless, by the results in Section 4, the analyst can still accurately learn valuable global statistics (e.g., the average over ) about the data.

6 Proof Sketches

In this section, we provide proof sketches for our main theorems in this work999We do not sketch the proof of the first main theorem, Theorem 4.2.1, because it follows as a direct consequence of Theorem 4.3.2. However, its proof can be found in Appendix B., i.e., Theorems 4.2.3, 4.3.1 and 4.3.2. The order in which we present these sketches has been chosen so as to allow for a sequential reading of the proofs.

6.1 Supporting Lemmas of Theorem 4.3.1: Spectral Norm Bound for Random Matrices

Outline. We begin by presenting Proposition 6.1, which holds for general random matrices . We note that this result depends on two quantities: (1) and (2) for all . We then instantiate and present Lemmas 6.1 and 6.1, which bound (1) and (2), respectively, for our choice of . Theorem 4.3.1 follows immediately from the above results (the proofs of which are found in Appendix E).