# Swapping Variables for High-Dimensional Sparse Regression with Correlated Measurements

We consider the high-dimensional sparse linear regression problem of accurately estimating a sparse vector using a small number of linear measurements that are contaminated by noise. It is well known that the standard cadre of computationally tractable sparse regression algorithms---such as the Lasso, Orthogonal Matching Pursuit (OMP), and their extensions---perform poorly when the measurement matrix contains highly correlated columns. To address this shortcoming, we develop a simple greedy algorithm, called SWAP, that iteratively swaps variables until convergence. SWAP is surprisingly effective in handling measurement matrices with high correlations. In fact, we prove that SWAP outputs the true support, the locations of the non-zero entries in the sparse vector, under a relatively mild condition on the measurement matrix. Furthermore, we show that SWAP can be used to boost the performance of any sparse regression algorithm. We empirically demonstrate the advantages of SWAP by comparing it with several state-of-the-art sparse regression algorithms.

There are no comments yet.

## Authors

• 7 publications
• 71 publications
• ### Sampling Requirements and Accelerated Schemes for Sparse Linear Regression with Orthogonal Least-Squares

The Orthogonal Least Squares (OLS) algorithm sequentially selects column...
08/08/2016 ∙ by Abolfazl Hashemi, et al. ∙ 0

• ### Compressed Sparse Linear Regression

High-dimensional sparse linear regression is a basic problem in machine ...
07/25/2017 ∙ by Shiva Prasad Kasiviswanathan, et al. ∙ 0

• ### Keeping greed good: sparse regression under design uncertainty with application to biomass characterization

In this paper, we consider the classic measurement error regression scen...
07/08/2012 ∙ by David J. Biagioni, et al. ∙ 0

• ### Robust High Dimensional Sparse Regression and Matching Pursuit

We consider high dimensional sparse regression, and develop strategies a...
01/12/2013 ∙ by Yudong Chen, et al. ∙ 0

• ### Sufficient principal component regression for pattern discovery in transcriptomic data

Methods for global measurement of transcript abundance such as microarra...
07/05/2021 ∙ by Lei Ding, et al. ∙ 0

• ### Inference in High-Dimensional Linear Regression via Lattice Basis Reduction and Integer Relation Detection

We focus on the high-dimensional linear regression problem, where the al...
10/24/2019 ∙ by David Gamarnik, et al. ∙ 0

• ### High-dimensional subset recovery in noise: Sparsified measurements without loss of statistical efficiency

We consider the problem of estimating the support of a vector β^* ∈R^p b...
05/20/2008 ∙ by Dapo Omidiran, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many machine learning and statistics applications involve recovering a high-dimensional sparse (or approximately sparse) vector

given a small number of linear observations , where is a known measurement matrix and is the observation noise. Depending on the problem of interest, the unknown sparse vector could encode relationships between genes (Segal et al., 2003), power line failures in massive power grid networks (Zhu and Giannakis, 2012), sparse representations of signals (Candès et al., 2006; Duarte et al., 2008), or edges in a graphical model (Meinshausen and Bühlmann, 2006; Ravikumar et al., 2010), to name just a few applications. The simplest, but still very useful, setting is when the observations can be approximated by a sparse linear combination of the columns of a measurement matrix weighted by the non-zero entries of the unknown sparse vector. Sparse regression algorithms can be used to estimate the sparse vector, and subsequently the location of the non-zero entries.

In this paper, we study sparse regression in a setting where current state-of-the-art methods falter or fail. One of the key reasons why current methods fail is because of high correlations between the columns of the measurement matrix . For example, if there exists a column that is nearly linearly dependent on the columns indexed by the locations of the non-zero entries in , then many sparse regression algorithms will falsely select .

There are many situations where it is natural for to contain correlated columns. In signal processing, certain signals may admit a sparse representation in a basis, where the basis elements (columns of ) can be significantly correlated with each other (Elad and Aharon, 2006). Such sparse representations are useful in signal processing tasks including denoising and compression (Elad, 2010). In functional magnetic resonance imaging (fMRI) data, measurements from neighboring voxels can be significantly correlated (Varoquaux et al., 2012). This can lead to inaccurate understanding of the connectivity between regions of the brain. In gene expression data analysis, expression values of genes that are in the same pathway may be significantly correlated with each other (Segal et al., 2003). For example, Figure 1 shows the pairwise correlations in three popular gene expression datasets. Each pixel in the image is the absolute value of the normalized inner product of one gene expression with another gene expression, i.e., . The higher/lower pixel intensities correspond to genes that have higher/lower pairwise correlations. We clearly see that all three examples contain a large number of high pixel intensities. This means that the gene expression values are highly correlated with each other, which may to lead to inaccurate identification of genes that are most relevant to understanding a disease.

### 1.1 \Sw: Overview

In this paper, we develop and analyze , a simple greedy algorithm for sparse regression with correlated measurements. The input to is an estimate of the support of the unknown sparse vector , i.e., the location of its non-zero entries. The main idea behind

is to iteratively perturb the estimate of the support by swapping variables. The swapping is done in such a way that a loss function is minimized in each iteration of

. In this way, seeks to estimate a support, in a greedy manner, that minimizes a loss function. The main reason why is able to handle correlations is because even if an intermediate estimate contains a variable that is not in the true support, we show that can swap this variable with a true variable under relatively mild conditions on .

As an example, suppose we initialize using a set . We note that could either be a random subset or it could be the support estimated by some other sparse regression algorithm. As we discuss later, selecting to be the output of a sparse regression algorithm has both statistical and computational advantages. Starting with the support , as illustrated in Figure 2, iterates to an intermediate support by swapping a variable with a variable such that . The swapping is performed only if is superior to in terms of a given loss function and continues until convergence of the loss function. The iterations can be summarized as follows:

 S(1)⟶S(2)={S(1)∖i}∪{i′}⟶S(3)={S(2)∖i}∪{i′}⟶⋯⋯⟶S(r).

Naturally, the choice of plays an important role in the performance of . In particular, we prove the following two results:

1. When misses at most one entry from the true support, outputs the true support under very mild conditions (see Theorem 4.3.1).

2. When misses more than one entry from the true support, outputs the true support support under conditions that depend on certain correlations among the columns of (see Theorems 4.3.2, 4.3.3, and 4.3.4).

Property (P1) shows that can be used to post-process the output of any sparse regression algorithm without sacrificing for performance. Property (P2) shows that can be used to estimate the true support of even when a sparse regression algorithm outputs a support set that differs significantly from the true support. The particular condition in (P2), defined in Section 4, highlights the role of . In particular, a sparse regression algorithm that can identify a larger fraction of the true support can potentially tolerate higher correlations among the columns of .

We demonstrate the empirical performance of SWAP on synthetic and real gene expression data. In particular, we use several state-of-the-art sparse regression algorithms to initialize SWAP. For every initialization, we demonstrate that SWAP leads to improved support recovery performance. For example, in Figure 3, we plot the true positive rate (TPR), i.e. , as the amount of correlation between the columns of increases. See Section 5.1 for details about the experimental setup. The dashed lines in the plots correspond to a standard sparse regression algorithm, while the solid lines with markers correspond to a algorithm. In most cases, we observe that the solid lines lie above the dashed line with the same color, which shows that is able to improve the performance of sparse regression algorithms. Furthermore, we show that initializing using other sparse regression algorithms has computational benefits as opposed to initializing using a random subset. In particular, if a sparse regression algorithm can partially estimate the support, then needs a smaller number of iterations to converge to the true support when compared to the case when is initialized with a random support.

### 1.2 Related Work

Current state-of-the-art computationally tractable sparse regression algorithms are only reliable under weak correlations among the columns of the measurement matrix. These correlations are quantified either using the irrepresentability condition (Zhao and Yu, 2006; Tropp and Gilbert, 2007; Meinshausen and Bühlmann, 2006; Wainwright, 2009a)

and/or various forms of the restricted eigenvalue (RE) conditions

(Meinshausen and Yu, 2009; Bickel et al., 2009). See Bühlmann and Van De Geer (2011) for a comprehensive review of such methods and the related conditions.

To the best of our knowledge, among the current methods in the literature, multi-stage methods (Wasserman and Roeder, 2009; Meinshausen and Yu, 2009; van de Geer et al., 2011; Javanmard and Montanari, 2013) are most appropriate for handing correlated variables in regression. An example of a multi-stage method is thresholded Lasso (TLasso), which first applies Lasso, a popular sparse regression algorithm proposed in Tibshirani (1996), and then thresholds the absolute value of the estimated non-zero entries to output a final support. Theoretically, TLasso requires an RE based condition for accurate support recovery that is stronger than the RE based condition required by . Moreover, TLasso, and other multi-stage methods, typically require more computations, since they require that computationally intensive model selection methods, such as cross-validation, be applied more than once to produce the final estimate.

When the columns in are highly correlated with other, exact support recovery may not be feasible, even with . In such cases, it is instead desirable to obtain a superset of the true support such that the superset is as small as possible. Several methods have been proposed in the literature for estimating such a superset; see Zou and Hastie (2005); She (2010); Grave et al. (2011); Huang et al. (2011); Bühlmann et al. (2013) for some examples. The main idea in these methods is to select all the highly correlated variables, even if only one of these variables is actually in the true support. In its current form, is designed for exact support recovery in the presence of correlated measurements. However, just as improves the performance of standard sparse regression algorithms for exact support recovery, we believe that suitable modifications of can improve the various superset estimation methods to deal with highly correlated measurements.

can be interpreted as a genetic algorithm

for solving a combinatorial optimization problem

(Melanie, 1999). In prior work, Fannjiang and Liao (2012) empirically show the superior performance of a slightly different version of for handing correlated measurements. However, no theory was given in Fannjiang and Liao (2012) to understand its superior performance. Our main contribution in this paper is to develop performance guarantees for , and thereby demonstrate the advantages of using for sparse regression with correlated measurements. A portion of the results in this paper appeared in Vats and Baraniuk (2013).

## 2 \Sw: Problem Formulation

In this section, we formulate the sparse regression problem and introduce the relevant notation and assumptions that will be used throughout the paper. We assume that , referred to as the observations, and , referred to as the measurement matrix, are known and related to each other by the linear model

 y=Xβ∗+w, (1)

where is the unknown sparse vector that we seek to estimate and is measurement noise. Unless mentioned otherwise, we assume the following throughout this paper:

1. The vector is -sparse with the location of the non-zero entries given by the set . It is common to refer to as the support of , and we adopt this notation throughout the paper. Furthermore, we refer to a variable in the support as an active variable and refer to a variable outside the support as an inactive variable.

2. The matrix is fixed with normalized columns, i.e., for all , where . In practice, normalization can easily be done by scaling and accordingly.

3. The entries of

are i.i.d. zero-mean sub-Gaussian random variables with parameter

so that . The sub-Gaussian condition on is common in the literature and allows for a wide class of noise models, including Gaussian, symmetric Bernoulli, and bounded random variables (Vershynin, 2010).

4. The number of observations is greater than or equal to , i.e., . As we shall see later, this assumption is important to compute the estimates of .

5. The number of observations and the sparsity level are all allowed to grow to infinity as goes to infinity. In the literature, this is referred to as the high-dimensional framework.

For any set , we associate a loss function, , that represents the cost associated with estimating by the set . An appropriate loss function for the linear problem in (1) is the least-squares loss, which is given by

 L(S;y,X)\defnminα∈\R|S|∥y−XSα∥22=\bnΠ⊥[S]y\bn22, (2)

where refers to the matrix that only includes the columns indexed by and is the orthogonal projection onto the kernel of the matrix . Assumption (A4) is required so that the inverse of exists. Throughout this paper, we mainly study the problem of estimating , since, once has been estimated, an estimate of can be easily computed by solving a constrained least-squares problem. The subsequent error in estimating is given by following proposition. Let be an estimate of such that . For any , the constrained least-squares estimator,

, satisfies the following bound with probability at least

:

 ∥ˆβ−ˆβ∗∥22≤ckσ2τn/(nρ2k), (3)

where is a universal positive constant and is the minimum eigenvalue of . Choosing , Proposition 2 shows that if the support can be estimated accurately with high probability, then the -error in estimating can be upper bounded, with high probability, by . Clearly, this bound converges to as long as . The proof of Proposition 2 follows from standard analysis of sub-Gaussian random variables.

Having established a bound on the estimation error given the true support, we now solely focus on the problem of estimating the true support . A classical method is to seek a support that minimizes the loss . This method, in general, may require a search over an exponentially number of possible supports. The main goal in this paper is to design an algorithm that searches the space of all possible supports in a computationally tractable manner. Furthermore, we are interested in establishing conditions under which the algorithm can accurately estimate , and as a result , under much broader conditions on the measurement matrix than those imposed by current state-of-the-art methods.

The rest of the paper is organized as follows. Section 3 presents the algorithm. Section 4 analyzes and proves conditions under which leads to accurate support recovery. Section 5 presents numerical simulations. Section 6 concludes the paper.

## 3 \Sw: Detailed Description

In this section, we describe the algorithm to find a support that minimizes a loss function. Suppose that we are given an estimate, say , of the true support and let be the corresponding least-squares loss (see (2)). We want to transition to another estimate that is closer (in terms of the number of true variables), or equal, to . The algorithm transitions from to an in the following manner:

 Swap every i∈S(1) with i′∈(S(1))c and compute the loss L(1)i,i′=L({S(1)∖i}∪i′;y,X).

If , then there exists a support that has a lower loss than . Subsequently, we find and let . We repeat the above steps to find a sequence of supports , where has the property that . In other words, we stop when perturbing by one variable increases or does not change the resulting loss. These steps are summarized in Algorithm 1. We next make several remarks regarding .

### 3.1 Selecting the Initial Support S(1)

The main input to is the initial support , which also implicitly specifies the desired sparsity level of the estimated support. Recall that is the unknown number of non-zero entries in . If is known, then can be initialized using the output of some other sparse regression algorithm. In this way, can boost the performance of other sparse regression algorithms.

To illustrate the advantages of , we set up a sparse regression problem with , , , and , where

is the variance of the measurement noise

in (1). The measurement matrix is chosen such that the pairwise correlations, i.e., the matrix , is given by the left most figure in Figure 4. We simulate three different observation vectors for three different realizations of the noise . We use as a wrapper around four popular sparse regression algorithms: Lasso (Tibshirani, 1996), thresholded Lasso (TLasso) (van de Geer et al., 2011), CoSaMP (Needell and Tropp, 2009), and FoBa (Zhang, 2011). The three plots in Figure 4 show how the true positive rate (TPR) changes in the intermediate steps of the algorithm for the three different realizations of the noise, where TPR is the total number of active variables in an estimate divided by the total number of active variables. For all the four different sparse regression algorithms, we observe that the TPR eventually increases when stops. This demonstrates that is able to replace inactive variables in the initial estimates with active variables. Furthermore, we generally observe that using with a support that contains more active variables is better for estimating the true support. For example, TLasso is able to estimate more than half of the active variables accurately. Using with TLasso leads to accurate support recovery such that terminates after only a few iterations. Our theoretical analysis in Section 4 sheds some light into this phenomenon. In particular, we show that the sufficient conditions for accurate support recovery for become weaker as the number of active variables in the initial support increase.

When is not known, which is the case in many applications, can be easily used in conjunction with other sparse regression algorithms to compute a solution path, i.e., a list of all possible estimates of the support over different sparsity levels. Once a solution path is obtained, model selection methods, such as cross-validation or stability selection (Meinshausen and Bühlmann, 2010), can be applied to estimate the support.

### 3.2 Computational Complexity

The main computational step in (Algorithm 1) is Line 2, where the loss is computed for all possible swaps . If , then clearly such computations need to be done in each iteration of the algorithm. Using properties of the orthogonal projection matrix (see Lemma D) we have that for any ,

 Π[S]=Π[S∖i]+(Π⊥[S∖i]Xi)(Π⊥[S∖i]Xi)TXTiΠ⊥[S∖i]Xi,i∈S. (4)

To compute , we need to compute the orthogonal projection matrix . Once is computed, can be easily computed for all using the rank one update in (4). Thus, effectively, the computational complexity of Line 2 is roughly , where is the complexity of computing a projection matrix of rank . Using state-of-the-art matrix inversion algorithms (Coppersmith and Winograd, 1990), .

There are several ways to significantly improve the computational complexity of at the expense of slightly degraded the performance. One straightforward way, which was used in Fannjiang and Liao (2012), is to restrict the number of swaps using the pairwise correlations among the columns of . Another method is to first estimate a superset of the support, using methods such as in Bühlmann et al. (2013); Vats (2014), and then restrict the swaps to only lie in the estimated superset. Since the main goal in this paper is to study the statistical properties of , we will address the computational aspects of in future work.

### 3.3 Swapping Several Variables

A natural generalization of is to swap a group of variables with another group of variables. The computational complexity of generalized , which we refer to as , is roughly , where is the complexity of computing a rank projection matrix. Clearly, for large, is not tractable in its naïve form. In particular, as increases, the complexity of approaches the complexity of the exponential search algorithm that searches among all possible supports of size .

### 3.4 Comparison to Other Algorithms

differs significantly from other greedy algorithms for sparse regression. When is known, the main distinctive feature of is that it always maintains a -sparse estimate of the support. Note that the same is true of the computationally intractable exhaustive search algorithm. Other competing algorithms, such as forward-backward (FoBa) (Zhang, 2011) and CoSaMP (Needell and Tropp, 2009), usually estimate a sparse vector with higher sparsity level and iteratively remove variables until variables are selected. The same is true for multi-stage algorithms (Zhang, 2009; Wasserman and Roeder, 2009; Zhang, 2010; van de Geer et al., 2011). Intuitively, as we shall see in Section 4, by maintaining a support of size , the performance of only depends on correlations among the columns of the matrix , where is of size at most and includes the true support. In contrast, for other sparse regression algorithms, .

## 4 Theoretical Analysis of \Sw

In this section, we present the main theoretical results of the paper that identifies the conditions under which performs accurate support recovery. Section 4.1 analyzes the performance of an exhaustive search decoder to understand the performance guarantees of sparse regression when given no restrictions on the computational complexity. Section 4.2 defines important parameters that are used to state the main theoretical results in Section 4.3. Section 4.4 presents an example to highlight the advantages of using . Section 4.5 presents some general remarks regarding the main results and discusses some extensions.

### 4.1 Exhaustive Search Decoder

Recall that we seek to estimate the support set such that the observations is a linear combination of the columns in the matrix . Before presenting a detailed theoretical analysis of , we first study the exhaustive search decoder that estimates as follows:

 (ESD): ˆS=minS∈ΩkL(S;y,X),

where is the set of all possible supports of size . Thus, the ESD method searches among all possible supports of size to find the support that minimizes the loss. In the literature, the ESD method is also referred to as the -norm solution. Understanding the performance of ESD is important to analyzing the performance of , since, if does not minimize the loss, then will most likely not estimate accurately. Before stating the theorem, we define the following two parameters:

 ρk+ℓ \defninf{∥Xθ∥22n∥θ∥22:∥θ∥0≤k+ℓ,S∗⊆\supp(θ)}, (5) βmin \defnmini∈S∗|βi|. (6)

The parameter is the eigenvalue of certain diagonal blocks of the matrix of size that includes the block . The parameter is the minimum absolute value of the non-zero entries in . Both (or its different variations) and are known to be crucial in determining the performance of sparse regression algorithms.

Consider the linear model in (1) and suppose (A1)–(A5) holds. Let be the ESD estimate with and let . If , where , then as . Proposition 4.1 specifies the scaling on the number of observations to ensure that ESD estimates the true support in the high-dimensional setting. The proof of Proposition 4.1 is outlined in Appendix A. The steps in the proof mirror those of the ESD analysis in Wainwright (2009b). The main difference in the proof comes from the assumption that

is fixed, as opposed to being sampled from a Gaussian distribution. The dependence on

in Proposition 4.1 is particularly interesting, since, to the best of our knowledge, the performance of state-of-the-art computationally tractable methods for sparse regression depend either on certain correlations among the columns of or on the restricted eigenvalues for some . Since for , ESD leads to accurate support recovery for a much broader class of measurement matrices . However, ESD is computationally intractable, and so it is desirable to devise algorithms that are computationally tractable and offer similar performance guarantees.

### 4.2 Some Important Parameters

In this section, we collect some important parameters that determine the performance of . We have already defined the minimum absolute value of the non-zero entries, , in (6) and the restricted eigenvalue (RE) in (5). Another form of the restricted eigenvalue used in our analysis is defined as follows:

 ρk,ℓ \defninf{∥Xθ∥22n∥θ∥22:∥θ∥0≤k,|S∗∩\supp(θ)|≥ℓ}. (7)

The parameter defined above is the minimum eigenvalue of certain blocks of the matrix of size that includes the blocks , where is a subset of of size at least . It is clear that smaller values of or correspond to correlated columns in the matrix .

Next, we define two parameters that characterize the correlations between the columns of the matrix and the columns of the matrix , where recall that is the true support of the unknown sparse vector . For a set that contains all supports of size with atleast active variables from , define as

 γd\defnmaxS∈Ωk,d∖S∗mini∈(S∗)c∩S∥∥∥ΣS∖ii,¯S(ΣS∖i¯S,¯S)−1∥∥∥21ΣS∖ii,i,¯S=S∗∖S, (8)

where . The matrix is the pairwise correlations between vectors that are projected onto the kernel of the matrix . When , we write . Popular sparse regression algorithms, such as the Lasso and the OMP, can perform accurate support recovery when . We will show in Section 4.3 that can perform accurate support recovery when . Although the form of is similar to , there are several key differences, which we highlight as follows:

• Since contains all supports such that , it is clear that is the -norm of a vector, where . In contrast, is the -norm of a vector. If indeed , i.e., accurate support recovery is possible using the Lasso, then can be initialized by the output of the Lasso. In this case, and also outputs the true support as long as minimizes the loss function. We make this statement precise in Theorem 4.3.1. Thus, it is only when that the parameter plays a role in the performance of .

• The parameter directly computes correlations between the columns of . In contrast, computes correlations between the columns of when projected onto the null space of a matrix , where .

• Note that is computed by taking a maximum over supports in the set and a minimum over inactive variables in each support. The reason that the minimum appears in is because we choose to swap variables that result in the minimum loss. In contrast, is computed by taking a maximum over all inactive variables. This minimum is what allows to tolerate higher correlations among columns of when compared to other sparse regression algorithms.

In addition to characterizing the performance of using the parameter , we also make use of the following parameter:

 νd\defnmaxS∈Ωk,d∖S∗minimaxj,j′∥∥∥ΣS∖{i,j}i,¯Sj(ΣS∖{i,j}¯Sj,¯Sj)−1∥∥∥21+∥∥∥ΣS∖{i,j}j′,¯Sj(ΣS∖{i,j}¯S,¯Sj)−1∥∥∥21min{ΣS∖{i,j}i,i,ΣS∖{i,j}j′,j′}, (9)

where , , , and . We will see in Theorem 4.3.4 that in the noiseless case, i.e., , ensures that only swaps an active variable with another active variable or swaps an inactive variable with another inactive variables. This enables us to show that the sample complexity of can be at par with that of the exhaustive search decoder.

### 4.3 Statement of the Main Results

In this section, we state the main theoretical results that characterize the performance of . A summary of the results in this section is given as follows:

• Theorem 4.3.1 (Section 4.3.1): Identifies sufficient conditions for accurate support recovery when is initialized by a support that is either equal to the true support or differs from the true support by one variable. This theorem motivates the use of as a method to boost the performance of any sparse regression algorithm.

• Theorem 4.3.2 (Section 4.3.2): Identifies sufficient conditions for accurate support recovery when is used with a sparse regression algorithm that outputs a support that differs from the true support by more than one variable.

• Theorem 4.3.3 (Section 4.3.3): Weakens the sufficient conditions in Theorem 4.3.2 by assuming certain properties of the initial support.

• Theorem 4.3.4 (Section 4.3.4): Shows that can perform accurate support recovery under the same scaling on the number of observations as the ESD algorithm in Section 4.1.

Throughout this Section, we assume that is initialized by a support of size and is the output of .

#### 4.3.1 Initialization with the True Support S∗

We claim that can be used as a wrapper around other sparse regression algorithms to boost their performance. However, for this to be true, it is important to show that does not spoil the results of other sparse regression algorithms.

Consider the linear model in (1) and suppose (A1)–(A5) holds. Let the input to be such that and . If

 n>4+log(k2(p−k))c2β2minρ2k/2, (10)

where , then as . If minimizes the loss among all supports of size , then clearly outputs when initialized with a support such that and . From Theorem 4.1, the conditions stated in the result guarantee that minimizes the loss with high probability. Theorem 4.3.1 states that if the input to falsely detects at most one variable, then is high-dimensional consistent when given a sufficient number of observations . In particular, the condition on in (10) is mainly enforced to guarantee that the true support minimizes the loss function. This condition is weaker than the sufficient conditions required for computationally tractable sparse regression algorithms. For example, the method FoBa is known to be superior to other methods such as the Lasso and the OMP. Zhang (2011) show that FoBa requires observations for high-dimensional consistent support recovery, where the choice of , which is greater than , depends on the correlations among the matrix . In contrast, the condition in (10), which reduces to , is weaker since for and . This shows that if a sparse regression algorithm can accurately estimate the true support, then does not introduce any false positives and also outputs the true support. Furthermore, if a sparse regression algorithm falsely detects one variable, then can potentially recover the correct support. Thus, we conclude that using can only improve the chances of recovering the true support.

#### 4.3.2 Initialization with an Arbitrary Support

We now consider the more interesting case when is initialized by a support that falsely detects more than one variable. In this case, will clearly need more than one iteration to recover the true support. Furthermore, to ensure that the true support can be recovered, we need to impose some additional assumptions on the measurement matrix . The particular condition we enforce depends on the parameter defined in (8). As mentioned in Section 4.2, captures the correlations between the columns of and the columns of . To simplify the statement in the next Theorem, define the function as

 g(δ,ρ,c)=(δ−1)+2c(√δ+1/√ρ)+2c2.

Let the input to be such that and . If for a constant such that , , , and , then as . Theorem 4.3.2 says that if is initialized by any support of size , then outputs the true support in the high-dimensional setting as long as and satisfy the conditions stated in the theorem. It is easy to see that in the noiseless case, i.e., when , the condition required for accurate support recovery reduces to . As discussed in Section 4.2, there is reason to believe that this condition may be much weaker than the conditions imposed by other sparse regression algorithms.

The proof of Theorem 4.3.2, outlined in Appendix B, relies on imposing conditions on each support such that that there exists a swap so that the loss can be decreased. Clearly, if such a property holds for each support, except , then will output the true support since (i) there are only a finite number of possible supports, and (ii) each iteration of results in a different support. The dependence on in the expression for the number of observations arises from applying the union bound over all supports of size .

#### 4.3.3 Initialization with the Output of a Sparse Regression Algorithm

The condition in Theorem 4.3.2 is independent of the initialization , which is why the sample complexity, i.e., the number of observations required for consistent support recovery, scales as . To reduce the sample complexity, we can impose additional conditions on the support that is used to initialize . One such condition is to assume that has certain optimality conditions over a subset of the variables from the true support. In particular, define the event as

 Ek,d={L(S(1);y,X)

where contains all supports of size that contain at most active variables from . The event is the set of outcomes for which the loss associated with is less than the loss associated with all supports that contain a smaller or equal number of active variables than . If we assume that , then all iterations of will lie in the set . Furthermore, by a simple counting argument, . This leads to the following theorem.

Let the input to be such that , , and . If for a constant such that , , , and , then as .

The proof of Theorem 4.3.3 follows easily from the proof of Theorem 4.3.2. The main consequence of Theorem 4.3.3 is that if a sparse regression algorithm can achieve consistent partial support recovery, then the conditions needed for support recovery using are weakened. Moreover, as the number of active variables in increases, the sufficient conditions required for accurate support recovery using become weaker since for .

#### 4.3.4 Achieving the optimal sample complexity

One drawback of the theoretical analysis presented so far is that the conditions require that the number of observations scale as , where . The reason for the dependence on is that, once we assume that , there are possible number of supports that can encounter. To ensure that does not make an error, we use the union bound over all these sets to bound the probability of making an error. In practice, however, once the support set is fixed, the total number of possible supports that can visit can be much smaller than . The exact number of possible supports will depend on the correlations between the columns of . In the next theorem, we show that under additional assumptions on , can achieve similar sample complexity as the exhaustive searcher decoder.

Let be the input to such that and . If for a constant such that , , and , then as . The proof of Theorem 4.3.4 is similar to the proof of Theorem 4.3.2 and is outlined in Appendix C. The condition ensures that, when is initialized with an appropriate support of size , then the iterations only swap an active variable with an active variable or swap an inactive variable with an active variable. This allows us to upper bound the total number of possible supports that can visit from to . We note that in numerical simulations, we observed that even when the iterations swapped an inactive variable with an inactive variable, performed accurate support recovery. Thus, not allowing an inactive variable to be swapped with an inactive variable is rather restrictive. We believe that that a more involved analysis, that allows a constant number of swaps of an inactive with an inactive, can further weaken the condition in Theorem 4.3.4 with the number of observations satisfying for a constant that controls the maximum number of inactive with inactive swaps.

### 4.4 Example

In this section, we present a simple example that highlights the advantages of using to handle correlated measurement matrices. The particular example we use is motivated from Javanmard and Montanari (2013). Recall the linear model (1) and suppose that is given as follows:

 Σi,j=⎧⎪⎨⎪⎩1i=jai=p,j∈S∗ or j=p,i∈S∗0otherwise, (12)

where refers to the entry of the matrix , is the unknown support that we seek to estimate given and , and is chosen such that is positive definite. From (12), we see that the column of is correlated with the columns indexed by the true support. This correlation causes standard sparse linear regression algorithms to estimate the incorrect support. For example, a simple calculation shows that the Lasso can only estimate accurately when . The calculations in Javanmard and Montanari (2013) show that multi-stage methods can estimate accurately when .

To find the range of values of for which can accurately estimate , we need to compute the parameter defined in (8). In particular, as shown in Theorem 4.3.2, when , can perform accurate support recovery as long . A simple calculation shows that . This means that can estimate the true support as long and is chosen such that is positive definite.

The above example also demonstrates how can be used with other sparse regression algorithms. In particular, if , then Lasso outputs the true support. In this case, will also return the true support. However, when , then Lasso is no longer accurate. In this case, can be used with the output of the Lasso algorithm to estimate the true support. As we shall see in the numerical simulations, one of the main advantages of using with other sparse regression algorithms is that the number of iterations needed for to converge can be significantly reduced.

### 4.5 Discussion

In this section, we make some general remarks regarding the theoretical results and discuss some possible extensions of our results.

###### Remark

(Unknown Sparsity Level) In all the results, we assumed that is known. However, in practice, is typically unknown and must be selected using an appropriate model selection algorithm. The theoretical analysis for the case when is not known is not within the scope of this paper. However, we note that recent work has shown that, under appropriate conditions, can be estimated exactly with high probability by computing the solution path of any high-dimensional consistent sparse regression algorithm (Vats and Baraniuk, 2014).

###### Remark

(Number of Iterations) Our theoretical results specified sufficient conditions on the scaling of the number of observations and sufficient conditions on the measurement matrix for accurate support recovery. An open question is to study the number of iterations required by to converge.

###### Remark

(Random Measurement Matrices) In all the results, we assumed that the measurement matrix is deterministic. An interesting extension will be to study the case when the rows of are sampled from a random vector. Such an analysis will have applications in compressed sensing (Donoho, 2006) and Gaussian graphical model selection (Meinshausen and Bühlmann, 2006).

###### Remark

(High Correlations) Our results do not explicitly take into account the correlations among the columns of , but simply improve upon the conditions required by other sparse regression algorithms. We note that in settings where exact support recovery is not feasible, we cannot expect to output the true support. This may happen, for example, when the conditions in Theorem 4.1 are not satisfied. In such cases, it may be desirable to select a superset of the true support instead of the true support. Our future work will study how can be used for this problem by using in conjunction with algorithms, such as those proposed in Bühlmann et al. (2013), to select an appropriate superset of the true support.

## 5 Numerical Simulations

In this section, we illustrate the performance of when initialized by several popular sparse regression algorithms. Section 5.1 presents synthetic data results and Section 5.2 presents pseudo real data results.

### 5.1 Synthetic Data

To illustrate the advantages of , we use the following two examples:

1. We sample the rows of from a Gaussian distribution with mean zero and covariance . The covariance is block-diagonal with blocks of size . The entries in each block are specified as follows: for and for . This construction of the measurement matrix is motivated from Bühlmann et al. (2013). The true support is chosen such that each variable in the support is assigned to a different block. The non-zero entries in are chosen to have magnitude one with sign randomly chosen to be either positive or negative (with equal probability). We let , , , , and .

2. We sample from the same distribution as described in (E1). The only difference is that the true support is chosen such that five different blocks contain active variables and each chosen block contains four active variables. The rest of the parameters are also the same.

In both (E1) and (E2), as increases, the strength of correlations between the columns increase. Furthermore, from the construction, it is clear that the restricted eigenvalue parameter for (E1) is, in general, greater than the restricted eigenvalue parameter of (E2). Thus, (E1) requires a smaller number of observations than (E2) for accurate sparse regression.

#### 5.1.1 Sparse Regression Algorithms

We use the following sparse regression algorithms to initialize :

• Lasso (Tibshirani, 1996)

• Thresholded Lasso (TLasso) (van de Geer et al., 2011)

• Forward-Backward (FoBa) (Zhang, 2011)

• CoSaMP (Needell and Tropp, 2009)

• Marginal Regression (MaR)

• Random

TLasso is a two-stage algorithm where the first stage applies Lasso and the second stage selects the top largest (in magnitude) variables from the Lasso estimate. In our implementation, we applied Lasso using -fold cross-validation. FoBa uses a combination of a forward and a backwards algorithm. CoSaMP is an iterative greedy algorithm. MaR selects the support by choosing the largest variables in . Random selects a random subset of size . We use the notation S-TLasso to refer to the algorithm that uses TLasso as an initialization for . A similar notation follows for other algorithms. Finally, when running the sparse regression algorithms, we assume that is known. In practice, model selection algorithms can be used to select an appropriate . Since our main goal is to illustrate the performance of , and thereby validate our theoretical results in Section 4, we only show results for the case when is assumed to be known. Finally, all our results are reported over trials.

#### 5.1.2 Dependence on the Degree of Correlation

Figures 3 and 7 plot the mean TPR for (E1) and (E2), respectively, as the parameter increases from to . The dashed lines correspond to a standard sparse regression algorithm, while the solid lines correspond to a based algorithm. In most cases, we see that, for the same color, the solid lines lie above the dashed lines. This shows, as predicted by our theory, that is able to improve the performance of other regression algorithms. Furthermore, we observe that TLasso has the best performance among all algorithms, and S-TLasso improves the performance of TLasso. However, the computational complexity of TLasso is higher than that of other algorithms since it requires two stages of model selection.

As expected, the performance of the algorithms degrades as the correlations increase. Moreover, when the correlations are extremely high, then the difference between TLasso and S-TLasso is insignficant. This suggests that for such cases, it might be more suitable to use sparse regression algorithms, such as those in Zou and Hastie (2005); She (2010); Grave et al. (2011); Huang et al. (2011); Bühlmann et al. (2013), which are designed to estimate a superset of the true support.

Figures 5 and 7 show boxplots of the difference between the TPR of a based algorithm minus the TPR of a standard algorithm. A positive difference means that is able to improve the performance of a sparse regression algorithm. For Lasso, FoBa, and CoSaMP, we see that the difference is generally positive, even for large values of . For TLasso, we see that the difference is generally positive for and . For greater values of , there does not seem to be any advantages of using . This is likely due to the correlations being extremely high so that the true support no longer minimizes the loss function.

#### 5.1.3 Number of Iterations and Dependence on the Number of Observations

Figure 8 plots the mean number of iterations required by the based algorithms as the parameter varies from to . As expected, the number of iterations generally increases as the correlations among the columns increase. We also observe that FoBa and TLasso require the smallest number of iterations to converge. This is primarily because both these algorithms are able to estimate a large fraction of the true support.

Figure 9 plots the mean TPR for (E1) and (E2) when and the number of observations vary from to . We clearly see that based algorithms outperform the standard algorithms. For simplicity, we only plot results for TLasso and FoBa.

### 5.2 Pseudo Real Data

In this section, we present results for the case when the measurement matrix corresponds to gene expression data. Since we do not have any ground truth, we simulate and . For this reason, we refer to this simulation as pseudo real data. The two simulation settings are as follows:

• Leukemia: This dataset contains gene expression values from patients (Golub et al., 1999). For computational reasons, we select only genes to obtain a measurement matrix . We let , , and . As seen in Section 5.1, the choice of the support plays an important role in determining the performance of a sparse regression algorithm. To select the support, we cluster the columns using kmeans to identify clusters. Next, we obtain the support by random selecting five clusters and then selecting exactly two variabes from each cluster,

• Prostate Cancer: This dataset contains gene expression values from patients (Singh et al., 2002). The selection of , , and is the same as in the Leukemia data. The only difference is that and the selected support contains three variables from one cluster (as opposed to the two chosen in the Leukemia data).

Figure 10 plots the mean TPR over realizations versus the sparsity level of the estimated support for several sparse regression algorithms. In the figures, we also compare to the elastic net (ENET) algorithm (Zou and Hastie, 2005), which is known to be suitable for regression with correlated variables. Note that ENET requires two regularization parameters. In our simulations, we run ENET for a two-dimensional grid of regularization parameters and select a support for each sparsity level that results in the smallest loss. We only compare ENET to TLasso and FoBa, since we know from Section 5.1 that these algorithms are superior to other regression algorithms in this situation. From the figures, it is clear that, after a certain sparsity level, the -based algorithms perform better than non--based algorithms. Furthermore, we clearly see that performs significantly better than ENET.

## 6 Conclusions

Sparse regression is an important tool for analyzing a wide array of high-dimensional datasets. However, standard computationally tractable algorithms can output erroneous results in the presence of correlated measurements. In this paper, we developed and analyzed a simple sparse regression algorithm, called , that can be used with any regression algorithm to boost its performance. The main idea behind is to iteratively swap variables, starting with an initial estimate of the support, until a loss function cannot be reduced any further. We have theoretically justified the use of with other regression algorithms and quantified the conditions on the measurements that guarantee accurate support recovery. Using numerical simulations on synthetic and real gene expression data, we have shown how boosted the performance of several state-of-the-art sparse regression algorithms. Our work in this paper motivates several interesting extensions of , some of which were discussed in Section 4.5. Two interesting extensions of , not discussed in Section 4.5, are to extend

for logistic regression and to extend

for estimating structured sparse vectors.

## Acknowledgements

Thanks to Aswin Sankaranarayanan, Christoph Studer, and Eric Chi for valuable feedback and discussions. This work was supported by an Institute for Mathematics and Applications (IMA) Postdoctoral Fellowship and by the Grants NSF IIS-1124535, CCF-0926127, CCF-1117939; ONR N00014-11-1-0714, N00014-10-1-0989; and ARO MURI W911NF-09-1-0383.

## Appendix A Proof of Proposition 4.1

Recall that . Analyzing the exhaustive search decoder (ESD) for is equivalent to finding conditions under which the following holds:

 ||Π⊥[S∗]y||22

Using the properties of the orthogonal projection, it is easy to see that . Furthermore, we have

 ||Π⊥[S]y||22=ξTΠ⊥[S]ξ+wTΠ⊥[S]w+2ξTΠ⊥[S]w,

where . Substituting the above into (13), we have that (13) is equivalent to showing that the following holds:

 minS∈Ωk∖S∗⎡⎢ ⎢⎣ξTΠ⊥[S]ξ+wT(Π⊥[S]−Π⊥[S∗])w+2ξTΠ⊥[S]wW(S)⎤⎥ ⎥⎦>0.

To find conditions under which the above holds, we first lower bound . Using properties of projection matrices, and using arguments in