 # Computing Estimators of Dantzig Selector type via Column and Constraint Generation

We consider a class of linear-programming based estimators in reconstructing a sparse signal from linear measurements. Specific formulations of the reconstruction problem considered here include Dantzig selector, basis pursuit (for the case in which the measurements contain no errors), and the fused Dantzig selector (for the case in which the underlying signal is piecewise constant). In spite of being estimators central to sparse signal processing and machine learning, solving these linear programming problems for large scale instances remains a challenging task, thereby limiting their usage in practice. We show that classic constraint- and column-generation techniques from large scale linear programming, when used in conjunction with a commercial implementation of the simplex method, and initialized with the solution from a closely-related Lasso formulation, yields solutions with high efficiency in many settings.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the prototypical problem of sparse signal recovery from linear measurements [candes2007, tibshirani1996regression, chen1994]: given a model matrix with samples and features, response generated via the model , where,

is sparse (that is, has few nonzero entries) and the errors are i.i.d. Gaussian with mean zero and variance

(i.e., ). We consider the case in which the number of variables is much larger than the number of samples () and our task is to estimate from , exploiting the knowledge that is sparse.

We assume throughout that the columns of have been standardized to have mean zero and unit -norm. The -norm is often used as a convex surrogate to the cardinality of , which is a count of the number of nonzero elements in . The celebrated Dantzig Selector [candes2007] approximates by minimizing

subject to a constraint on the maximal absolute correlation between the features and the vector of residuals (given by

). The optimization problem of this recovery problem is as follows:

 (ℓ1-DS)     minimizeβ   ∥β∥1   s.t.   ∥XT(y−Xβ)∥∞≤λ, (1)

where controls the data-fidelity term. Ideally, the value of should be such that the unknown signal vector is feasible, that is,

holds (with high probability, say). The constraint in (

1) can be interpreted as the -norm of the gradient of the least squares loss . An appealing property of the Dantzig Selector estimator is that it is invariant under orthogonal transformations of .

Problem (1) can be reformulated as a linear program (LP), and thus solved via standard LP algorithms and software (for example, commercial solvers like Gurobi and Cplex) for instances of moderate size. As pointed out in [becker2011], efficient algorithms for -DS are scarce:

“…Typical modern solvers rely on interior-point methods which are somewhat problematic for large scale problems, since they do not scale well with size.”

Although important progress has been made on algorithms for -DS in subsequent years (see, for example, [becker2011, lu2012, wang2012, Pang:fastclime-package, JMLR:v16:li15a]), large-scale instances of (1) (with of a million or more) still cannot be solved. The main goal of our work is to improve our current toolkit for solving -DS and related problems, bringing to bear some underutilized classical tools from large scale linear programming.

The Dantzig Selector is closely related to the Lasso [tibshirani1996regression], which combines a least squares data-fidelity term with an -norm penalty on . While the Lasso and Dantzig Selectors yield different solutions [efron2007discussion], for suitably chosen regularization parameters, they both lead to estimators with similar statistical properties in terms of estimation error, under suitable assumptions on , , and (see [bickel2009simultaneous]). A version of the Lasso that has the same objective as (1) is

 minimizeβ  ∥β∥1  s.t.  ∥y−Xβ∥2≤δ, (2)

where is a parameter that places a budget on the data fidelity, defined here as the -norm of the residuals. The explicit constraint on data fidelity makes this formulation appealing, but it poses computational challenges [becker2011] because of the difficulty of projecting onto the constraint. The following alternative, unconstrained version has become the most popular formulation of Lasso:

 (Lasso)    minimizeβ  12∥y−Xβ∥22+λ∥β∥1 (3)

where is a regularization parameter that controls the -norm of . (This is the formulation that we refer to as “Lasso” in the remainder of the paper.) There are several highly efficient algorithms for (3) (see, for example, [beck2009, friedman2010, wright2009sparse]), making it an extremely effective tool in the context of sparse learning based on -minimization.

### 1.1 Algorithms for Lasso and Dantzig Selector: Fixing the Gap in Performance.

Common algorithms for solving (3) are based on proximal gradient methods [beck2009, wright2009sparse], coordinate descent [friedman2010], or homotopy methods [efron2004least]. Several efficient implementations of these methods are available publicly. There are key differences between the Lasso and -DS in terms of computational properties and associated solvers: -DS is essentially an LP whereas Lasso is a convex quadratic program (QP). Although LPs are generally thought to be easier to solve than QPs of a similar size, the Lasso QP can be solved with remarkable efficiency, at least when is quite sparse and the matrix has felicitous properties.

While first order optimization algorithms [becker2011, boyd2011] have also led to good algorithms to solve -DS, they are still much slower than Lasso. To illustrate, to compute a path of 100 solutions for a problem with , , glmnet [friedman2010] takes seconds with minimal memory requirement on a modest desktop computer. On the other hand, for the same dataset (and machine), solving -DS for a path of 100 values by the parametric simplex method of [Pang:fastclime-package] takes several minutes and requires at least GB of memory. The software package flare [JMLR:v16:li15a], based on the Alternating Direction Method of Multipliers (ADMM) [boyd2011] has prohibitive memory requirements and would not run on a modest desktop machine. The differences between solvers grow with problem size. As a consequence of the difficulties of solving the LP formulation, -DS remains somewhat under-utilized, in spite of its excellent statistical properties. This paper seeks to address the striking difference in computational performance between the Lasso and the Dantzig Selector by proposing efficient methods for the latter. We make use of classical techniques from optimization: column generation and constraint generation. These techniques were first proposed as early as 1958 [ford1958suggested, dantzig1960decomposition] in the context of solving large scale LPs but, to our knowledge, have not been applied to -DS or its relatives discussed below.

Our approach exploits the sparsity that is typically present in the solution of -DS: at optimality, an optimal will have few nonzeros. If we can identify the nonzero components efficiently, we may avoid having to solve a full LP formulation that includes all components of . Column generation starts by selecting a subset of components in and solving a reduced version of (1) that includes only these components (that is, it fixes the components of that are not selected to zero). If the optimality conditions for the full problem are not satisfied by the solution of the reduced LP, more components of are added to the formulation in a controlled way, and a new reduced LP is solved, using the previous solution as a warm start. The process is repeated until optimality for the full LP is obtained. Whenever new components are added to the reduced LP, we add new columns to the constraint matrix, hence the name column generation.

We make use too of another key property of -DS: redundancy of the constraints in (1). Typically, the number of components of that are at their bounds of and at the solution is small, of the same order as the number of nonzero components of at the solution. This observation suggests a procedure in which we solve a reduced LP with just a subset of constraints enforced. We then check the constraints that were not included in this formulation to see if they are violated by the solution of the reduced LP. If so, we add (some of) these violated constraints to the LP formulation, and solve the modified reduced LP. This process is repeated until a solution of the original problem is obtained. This procedure is known as constraint generation.

While column generation and constraint generation are commonly used as separate entities to solve large scale LPs, it makes sense to use them jointly in solving -DS, and a combination of the two strategies can be implemented with little complication. The procedures can benefit from good initial guesses of the nonzero components of and the active constraint set for (1). We use efficient Lasso solvers to obtain these initial guesses.

### 1.2 Other Examples

Several other examples of sparse linear models are also amenable to column and constraint generation techniques. These include basis pursuit denoising [chen1994] and a Dantzig selector version of one-dimensional total variation denoising (also known as fused Lasso) [mammen1997locally, tibshirani2005]. Each of these problems can be formulated as a linear program task and, like -DS, they are computationally challenging.

##### Basis Pursuit.

The noiseless version of sparse approximation, popularly known as Basis Pursuit [chen1994], is given by the following optimization problem:

 minimizeβ  ∥β∥1    s.t.   y=Xβ, (4)

which can be formulated as an LP. This problem can be interpreted as a limiting version of (3) as . It may be tempting to solve (3) for a small value of (possibly with warm-start continuation) to obtain a solution to (4). However, this approach is often inefficient in practice, because obtaining accurate solutions to the Lasso becomes increasingly expensive as . It has been pointed out in [donoho2009message] that solving (4) to high accuracy using existing convex optimization solvers or specialized iterative algorithms is a daunting task, for large instances. Our own experiments show that current solvers based on ADMM fail to solve (4) for (with ), while our proposed approach, described below, solves problems with within 3-4 minutes. Our approach relies on column generation, exploiting the familiar observation that the solution of (4) is sparse.

##### Fused Dantzig Selector.

The Fused Lasso [tibshirani2005] or the total variation penalty [rudin1992nonlinear] is a commonly used -based penalty that encourages the solution to be (approximately) piecewise constant. The unconstrained formulation of this problem is

 minimizeβ   12∥y−Xβ∥22+λ∥D(0)β∥1, (5)

where is a regularization parameter and is the first order difference operator matrix, defined by

 D(0)β=(β2−β1,β3−β2,…,βp−βp−1)T, (6)

which represents differences between successive components of . As we show in Section 2.4, (5) can be expressed as a Lasso problem of the standard form (3), with a modified model matrix and response  111We define as follows. Define , where , and define . For any , let be the submatrix of containing the columns indexed by , and let denote the projection operator onto the column space of . We set and , with and .. This suggests a natural Dantzig Selector analog of the fused Lasso problem:

 minimizeα∈Rp−1  ∥α∥1  s.t.  ∥~XT(~y−~Xα)∥∞≤λ, (7)

a formulation that is amenable to the column and constraint generation methods developed in this paper. In the special case of , the problem has additional structure that we can exploit to solve instances with , well beyond the capabilities of alternative methods.

### 1.3 Related work and Contributions

The Dantzig Selector formulations presented here — (1), (4), and (7) — can all be expressed as LPs and solved with interior point methods or simplex-based methods, as implemented in commercial solvers 222Commercial solvers such as Gurobi, Cplex, Mosek are free for academic use. (Gurobi, Cplex, Mosek, XPress, etc) or open-source codes (GLKP, lpsolve, among others). Specialized implementations for -DS and Basis Pursuit have been investigated for several years. An interior point method was used in [candes2007], a first-order method for a regularized version of -DS was described in [becker2011]333The authors add a small ridge penalty to the objective and optimize the dual via gradient methods, and methods based on ADMM were discussed in [lu2012, wang2012, boyd2011]. Using a homotopy continuation approach, [james2009] extend the framework of LARS [efron2004least] to find the solution path to -DS, which is piecewise linear in . Homotopy continuation methods applicable to -DS have also been proposed by [asif2009, brauer2018primal, pang2017], but these works do not appear to use column and constraint generation methods, which are the main focus of our work. For the -DS problem, the algorithms of [asif2009, brauer2018primal, pang2017] compute the full matrix at the outset. This operation is memory-intensive, so these approaches can handle values of only up to a few thousands on a modest desktop computer.

Our methods solve the problems (1), (4), and (7) at a single value of the regularization parameter, but they can be extended to solve these problems on a grid of regularization parameters via a warm start continuation strategy. We show that the classical tools of column and constraint generation can be effective in solving large-scale instances of these problems. Our work is related to the proposal of [dedieu2019solving] who explored column and constraint generation to solve regularized linear SVM problems (with a hinge loss) that can be expressed as LPs. (Regularizers considered in [dedieu2019solving] include the -norm, group -norm, and the Slope penalty.) Our Dantzig Selector problems have structural properties different from the SVM problems. They also have the unique advantage that they can be initialized using Lasso. This fact plays an important role in the practical computational efficiency of our approaches.

Our methods are based on the simplex algorithm, which is better at making use of the available warm-start information than interior-point methods. A memory-friendly version of Gurobi’s simplex solver, applied to an LP formulation of (1) that avoids formation of by using auxiliary variables, works well for -DS with in the hundreds and in the thousands. In fact, this approach can be faster than some specialized algorithms [pang2017, becker2011, lu2012]. We show simplex performance can be improved substantially by using column and constraint generation when , for problems in which the underlying solution is sufficiently sparse. We refer to our framework as Dantzig-Linear Programming (DantzigLP for short). Because we use a simplex engine as the underlying solution, a primal-dual solution is available at optimality. If we decide to terminate the algorithm early due to computational budget constraints, our framework delivers a certificate of suboptimality. DantzigLP can solve instances of the -DS problem with and ; Basis Pursuit with and ; and Fused Lasso with ; all within a few minutes and with reasonable memory requirements. To our knowledge, problems of this size are beyond the capabilities of current solvers. A Julia implementations of our DantzigLP framework can be found at https://github.com/atzheng/DantzigLP.

##### Notation.

We denote

. The identity matrix is denoted by

(with dimension understood from the context). When operating on vector-valued operands , the inequality denotes elementwise comparison. For any matrix and index sets and , we denote by the submatrix of that consists of the rows of indexed by and the columns of indexed by . The notation denotes a submatrix consisting of all rows of but only the columns indexed by (A similar convention applies to ). For a vector and a set , denotes the subvector of restricted to the indices in . The notation denotes the vector , whose length is defined by the context.

## 2 Methodology

### 2.1 Column and Constraint Generation for Large-Scale LP

Column generation [ford1958suggested, dantzig1960decomposition, bertsimas1997] is a classical tool to solve large scale LPs with a large number of variables and a relatively small number of constraints, when we anticipate an optimal solution with few nonzero coefficients. The basic idea is simple: we solve a small LP involving just a subset of columns, and incrementally add columns into the model, re-solving the LP after each addition, until optimality conditions for the original problem are satisfied. Constraint generation is used when the number of constraints is large relative to the number of variables, when we expect a relatively small subset of the constraints to be active at optimality.

For the sake of completeness, we provide an overview of these techniques in this section, referring the reader to [bertsimas1997] for a more detailed treatment.

Given problem data and with decision variable , we consider the LP (Primal-Full), whose dual (Dual-Full) has decision variable . We assume that has full rank.

 minimizex cTx (Primal-Full) s.t. Ax≥b x≥0
 maximizev bTv (Dual-Full) s.t. ATv≤c v≥0.

We will assume that (Primal-Full) has a finite optimal solution. By LP duality theory, the dual also has a solution with the same optimal objective value.

The solutions of (Primal-Full) and (Dual-Full) can be derived from the solutions to reduced problems of the following form, for some index sets and :

(Primal)

(Dual)

The subsets and are not known in advance; the simplex method can be viewed as a search for these subsets in which typically one element is changed at each iteration. Sufficient conditions for the solutions of (Primal) and of (Dual) to be extendible to solutions of (Primal-Full) and (Dual-Full) are that:

 Ai,JxJ≥bi,for all i∈[m]∖I;ATI,jvI≤cj,for all j∈[n]∖J. (8)

If these conditions are satisfied, we obtain solutions and of (Primal-Full) and (Dual-Full), respectively, by setting and , and and .

Column and constraint generation are techniques for systematically expanding the sets and until the optimality conditions (8) are satisfied by the solutions of (Primal) and (Dual). At each iteration, we solve a reduced problem of this form, then seek indices and for which the conditions (8) are violated. Some of these indices are added to the sets and , and the new (slightly larger) versions of (Primal) and (Dual) are solved using a variant of the simplex method, typically warm-started from the previous reduced problem. An outline of the approach is shown in Algorithm 1.

Many variants are possible within this framework. One could define the sets and to contain only the smallest valid index, or the most-violated index. More commonly, and are chosen to have cardinality greater than where possible. In large-scale problems (analogous to partial pricing in the simplex method), not all checks in (8) are even performed. When is too large to store in memory, for example, we can stream columns of to calculate the quantities until enough have been calculated to define .

When Algorithm 1 is implemented with but a strict subset of , it reduces to column generation. (In this case, and are null at every iteration.) Similarly, when is a strict subset of but , Algorithm 1 reduces to constraint generation.

The success of constraint and column generation hinges on the ability to generate initial guesses for and that requires few additional iterations of Algorithm 1 to identify the solutions of (Primal-Full) and (Dual-Full).

##### The DantzigLP Framework.

Combining column and constraint generation LP techniques with methods for finding good initializations for the initial column and constraint sets and , we develop DantzigLP, a general framework for solving large-scale versions of the the Dantzig Selector-type problems described in Section 1. Initializations for and/or are obtained typically by solving the Lasso variant of a given problem. This basic approach can be tailored to a large range of problems, and in many cases the problem structure admits fast algorithms for both initialization and column and constraint generation.

All of the Dantzig-type problems described in Section 1 (except for Basis Pursuit) have a tunable regularization parameter . Practitioners often wish to compute the estimator for a grid of values specified a-priori. DantzigLP makes this process efficient by leveraging a simplex-based solver’s warm start capabilities. Given the solution for one value of , DantzigLP can efficiently find the solution for its neighboring value of (within the column and constraint generation framework). Repeating this process yields a path of solutions.

### 2.2 The Dantzig Selector

We show how the DantzigLP framework applies to -DS. We present an LP formulation for -DS—the primal (9) and its corresponding dual (10) are as follows:

 minimizeβ+,β−,r ∑i∈[p](β+i+β−i) (9) s.t. −1λ≤XTr≤λ1 r=y−X(β+−β−) β+,β−≥0
 maximizeν+,ν−,α −∑i∈[p]λ(ν+i+ν−i)−αTy (10) s.t. −1≤XTα≤1 X(ν+−ν−)+α=0 ν+,ν−≥0.

The primal problem (9) has decision variables denoting the positive and negative parts of , and corresponding to the residual vector. The dual variables correspond to the inequality constraints and respectively, and corresponds to the equality constraint . At optimality, the following complementarity conditions hold:

 ν+∘(XTα−λ1)=0    and   ν−∘(X′α+λ1)=0,

where, “” denotes componentwise multiplication. Therefore, and . Formulation (9) does not require computation and storage of the memory-intensive matrix ; this is avoided by introducing auxiliary variable . Moreover, the related Lasso problem (3) gives us reason to expect that this problem is a good candidate for both constraint and column generation. Optimality conditions for solution of (3) can be written as follows:

 rL:=y−XβL,XT∗,jrL∈⎧⎪ ⎪⎨⎪ ⎪⎩{λ}if βLj>0if βLj=0{−λ}if βLj<0. (11)

This suggests that, for the Lasso solution at least, the number of active constraints in (9) is similar to the number of nonzero components in , which is typically small. If the solution of the Dantzig selector has similar properties to the Lasso solution, then we would expect both the number of nonzero components in the solution of (9) and the number of active constraints to be small relative to the dimensions of the problem. (The papers [james2009] and [asif2010] demonstrate conditions under which the Dantzig and Lasso solution paths, traced as functions of the regularization parameter , are in fact identical.)

For a subset of columns of and of rows of (note that and need not be the same), we define a reduced column and constraint version of (9) as follows:

 minimizeβ+,β−,r ∑j∈J(β+j+β−j) (DS(I,J)) s.t. y−X∗,J(β+J−β−J)=r (X∗,I)Tr≤λ1 (X∗,I)Tr≥−λ1 β+,β−≥0.

Our constraint and column generation strategy for solving -DS solves problems of this form at each iteration, initializing and from the Lasso solution, and alternately expanding and until a solution of -DS is identified. We initialize to be the subset of components for which , while is the set of indices such that . The strategy is specified as Algorithm 2.

Note that each time we solve , we warm-start from the previous instance. The violation checks that define and can be replaced by relaxed versions involving a tolerance . That is, we can define

 Iv←{i∈[p]∖I:|XT∗,ir|>λ+ϵ},Jv←{j∈[p]∖J:|XT∗,jα|>1+ϵ}. (12)

(We use in our experiments.)

##### Computing a path of solutions.

We can extend Algorithm 3 to solve -DS for a path of values of the form in decreasing order. We first obtain the Lasso solution for the smallest value, which typically corresponds to the densest solution in the Lasso path. (This strategy, which is the opposite of that used in solvers for Lasso — see for example [wright2009sparse] — reduces the overhead of continuously updating our LP model with new columns and constraints as we move across the -path.) The Lasso solution can be used to supply initial guesses of index sets and for the first value . The final index sets and for can then be used as initial guesses for , and so on. Optimal basis information (basis matrices and their factorizations) for each value of can also be carried over to the next value.

##### Existing approaches.

Many interesting approaches have been presented to solve the -DS problem. [becker2011] presents a general framework for by solving a regularized version of the problem with first order gradient based methods, but these methods do not scale well. [lu2012, wang2012] use ADMM for -DS, which may lead to large feasibility violations. Homotopy methods were presented in [pang2017] and were shown to outperforms existing algorithms for -DS. Other homotopy based methods of appear in [asif2009, brauer2018primal]. All the homotopy algorithms [asif2009, brauer2018primal, pang2017] compute the matrix at the outset; this memory intensive computation precludes the possibility of solving large scale instances with . Column and constraint generation has not been considered in these aforementioned papers.

Computational results of our methods are presented in Section 3.

### 2.3 Basis Pursuit

We study how our DantzigLP framework can be applied to (4), which admits the following LP representation. (Recall that we assume that the feasible set is nonempty.)

 minimizeβ+,β−  ∑j∈[p](β+j+β−j)     s.t.     y=X(β+−β−),  β+≥0,β−≥0. (BP-Full)

Consider a subset of features and a restriction of (BP-Full) to the indices in . We obtain the following reduced primal (BPP(J)) and dual (BPD(J)):

(BPP(J))
(BPD(J))

For the column generation procedure, we need a subset (preferably of small size) for which (BPP(J)) is feasible. Accordingly, we seek an approximation to the largest value of for which the Lasso yields a a feasible solution for (BPP(J)). We find such a value by solving the Lasso for a sequence of decreasing values of , checking after each solution whether the resulting solution is feasible for (BPP(J)), and if so, defining to be the support obtained of this solution.

If we cannot find a set for which (BPP(J)) is feasible, we append the current with an additional columns to obtain a feasible solution444If the entries of are drawn from a continuous distribution, then will lead to a feasible solution for (BPP(J)). In all our experiments, the Lasso continuation approach did lead to a for which (BPP(J)) was feasible. Note that the Lasso path often leads to solutions for which the number of nonzeros exceeds .. As before, the Lasso continuation approach is used just to obtain a good initialization for for our column generation framework, not to obtain a solution for BP.

The approach is summarized in Algorithm 3; and computational results are presented in Section 3.

### 2.4 The Fused Dantzig Selector

#### 2.4.1 Signal estimation

We discuss how the DantzigLP framework can be extended to the Dantzig analog of (5) with (the identity matrix), which is

 12∥y−β∥22+λ∥D(0)β∥1, (13)

where is defined in (6). To express this problem in Lasso form, we define the matrix (where ), which has full rank. Its inverse is

 Hi,j=⎧⎨⎩1ifj=1i−jifi>j0otherwise. (14)

We can now rewrite (5) in terms of the variables , and recover the solution of (13) by setting . Defining and , we write (13) as follows:

 minimizeαA,αB 12∥y−HAαA−HBαB∥22+λ∥αB∥1. (15)

This formulation differs slightly from the standard Lasso problem in that the penalty term excludes . The Dantzig analog of (15) is

 minimizeαA,αB ∥αB∥1  s.t.  ∥HTB(y−HAαA−HBαB)∥∞≤λ, HTA(y−HAαA−HBαB)=0. (16)

(Note the constraint , which arises as an optimality condition for in (15).) Recalling that , and introducing auxiliary variables , , and , we rewrite (16) as follows:

 minimizer,α+,α−,β,∇ ∑i∈B(α+i+α−i) (17) s.t.     β =D(α+−α−) r =y−β g =DTr gA =0 gB ≤λ1 gB ≥−λ1 α+,α− ≥0.

We apply column and constraint generation to formulation (17): Column generation because of sparsity in , and constraint generation because few of the constraints are expected to be active at optimality. The formulation (17) exploits the following key characteristics:

• is banded. Writing constraints in terms of rather than yields a constraint matrix with nonzero entries. (A direct LP representation for -DS with as in (9) would lead to a constraint matrix with nonzeros.)

• Each component of and appears in exactly one equality constraint, reducing the cost of computing each reduced cost to time ( for each , and for each ) — much cheaper than the corresponding cost for a general -DS  problem, which requires the operation .

• We can check each constraint violation in time, compared to a general -DS problem which requires the operation for each constraint.

To obtain a good initialization for (17), we use the solution of (13), which can be computed efficiently via dynamic programming [johnson2013] at cost.

#### 2.4.2 Beyond signal estimation: Regression

We now consider problem (5) with general design matrix .

As in (15), we introduce a new variable . We let be the submatrix of containing the columns indexed by , and denote the projection operator onto the column space of . With this notation in place, we rewrite (5) in standard Lasso form with model matrix and response . The corresponding Dantzig Selector problem (7) is an instance of -DS problem with problem data . Since (7) lacks the structure as the signal estimation problem in Section 2.4.1 (where ), we apply the DantzigLP procedure (Algorithm 3) to this problem, with a special initialization. The initial point is obtained by solving (5) using a proximal gradient method [beck2009]. Each step of this method requires calculation of the proximal map defined by

 Θ(uk):=argminβ  L2∥∥∥β−(uk−1L∇f(uk))∥∥∥22+λ∥D(0)β∥1,

where

is the largest eigenvalue

555

This can be computed via the power method or by computing the largest singular value of

with cost .
of and . The operator is computed via dynamic programming [johnson2013], which is highly efficient. If we set and we get the usual (unaccelerated) proximal gradient algorithm. We use the accelerated variant which enjoys an improved convergence rate. It sets where and . The sequence is initialized with and .

This proximal gradient approach to solving (5) is much faster than a coordinate descent procedure [friedman2010] applied to the Lasso reformulation of (5) with problem data .

## 3 Computational Results

This section presents computational results showing the performance of our proposed DantzigLP framework for the -DS problem (Section 3.1), Basis Pursuit (Section 3.2), and the Fused Dantzig Selector (Section 3.3).

### 3.1 Computational experience with Dantzig Selector

We implement DantzigLP in Julia666Our Julia/JuMP implementation can be found at https://github.com/atzheng/DantzigLP., using Gurobi’s dual simplex solver777We use Gurobi version 7.5 in our experiments. All computations were performed on a Mac Pro desktop machine with specs: 2.7GHz 12-Core Intel Xeon E5. Unless otherwise specified the memory budget was 64GB of RAM. as the LP solver and Lasso.jl (a Julia implementation of glmnet [friedman2010]) as the Lasso solver. At each iteration of column and constraint generation, we add up to 30 columns with the most negative reduced costs, and up to 50 of the most violated constraints. Unless stated otherwise, we solve problems to within a tolerance of for both column and constraint generation violations.

#### 3.1.1 Experiments on synthetic datasets

##### Data Generation.

Our first set of experiments were performed on synthetic data. The rows of

are drawn from a multivariate Gaussian distribution

with mean zero and covariance where, for all and . We then sparsify888Sparsification may destroy the correlation structure among columns of . by setting its entries to 0 independently with probability ; and finally normalize so that the columns of to have unit -norm. To generate the true , we choose a set and set entries of as i.i.d. draws from a standard Gaussian distribution. The remaining components are set to zero. We then generate with ; and is chosen so as to achieve a signal-to-noise ratio (SNR) of 10. (Note that we define SNR as the ratio .)

A popular method for the -DS problem is based on ADMM [lu2012, wang2012, boyd2011]. In Figure 1, we compare DantzigLP with flare [lu2012], a publically available implementation of ADMM, plotting the violation in feasibility () and the difference between the objective function from its optimal value () as a function of runtime. (We use absolute value in the objective measure because infeasibility can result in a with smaller norm than the solution .) We find that ADMM is slow by both measures (an observation made also by [pang2017]) and that DantzigLP is much faster. Each path in Figure 1 represents a single simulated problem instance with data generated by the means above, where , , , and . We set where, . Since ADMM has difficulty finding a solution with an absolute feasibility violation of less than 0.01 in many cases, we do not consider it further in the experiments below. Figure 1: Feasibility and objective violations as a function of runtime for DantzigLP (including time required for Lasso initialization) and the ADMM implementation Flare, for the ℓ1-DS  problem. In all instances, DantzigLP reaches an optimal solution with zero feasibility violation within a few iterations. In comparison, Flare takes much longer to improve the objective and feasibility violations, often failing to converge even after hundreds of seconds.
##### Comparison with PSM.

The recently proposed parametric simplex-based solver PSM described in [pang2017] can solve the -DS problem and is a state-of-the-art solver for -DS. In the next set of tests, we compare the following approaches.

1. PSM, as implemented in the companion R package fastclime, for a path of 50 values logarithmically spaced between and .

2. Gurobi (dual simplex method) applied to the full LP model (9) for . We denote these results by “Full LP (Single).”

3. DantzigLP applied to (9) for the same 50 values as in the PSM tests. We denote these results by “DantzigLP (Path).”

4. DantzigLP applied to (9) for . We denote these results by “DantzigLP (Single).”

Note that PSM always computes a full path of solutions via homotopy, even if the solution is required at just one value of . The times shown for our DantzigLP methods include the times required to compute a Lasso path, usually between 0.1–1 seconds999Solving the Lasso for is the fastest with being the slowest..

Table 1 shows that when computing a path of 50 -values, PSM is usually outperformed by DantzigLP (Path). In computing a solution to -DS at a single , PSM is seen to be outperformed by solving the full LP model with Gurobi (denoted by “Full LP”), with DantzigLP (Path) still faster.

We observe that PSM works well on instances with small (in the hundreds), but its performance deteriorates with increasing values. PSM computes the whole matrix , leading to large memory requirements by comparison with “Full LP” (which in our formulation does not require computation of ) and also DantzigLP. Of all the methods, DantzigLP has the lowest memory requirements, as it generates new columns and constraints only as necessary.

##### Varying sparsity and correlation in X.

We next explore sensitivity of runtimes of the various approaches to sparsity in and correlations between columns of , which are captured by the parameters and , respectively. Table 2 (Left) shows that the DantzigLP and Full LP approaches exploit sparsity well (runtimes decrease with increasing sparsity), while PSM does not benefit from sparsity, possibly becase the product (which it computes) remains dense. For the case of dense , Table 2 (Right) shows that increasing correlations between the columns lead to improved runtimes for all algorithms. (The solutions tend to be sparser in these cases.) DantzigLP remains the clear winner.

##### Dependence on sparsity in solution β.

In the experiments above, we considered a sequence of -values in the range with . It is well known that smaller values of correspond to denser solutions in . To understand better the dependence of runtime of the various algorithms on sparsity of , we tried the values , where , showing the results in Table 3. Runtimes for DantzigLP increase with density in , as expected, mostly because the Lasso solution has a very different support from the solution of -DS, requiring DantzigLP to generate many more columns than it would for larger values. For smaller values of (results not shown), DantzigLP can be even slower than the vanilla Gurobi implementation for the Full LP, due to the overhead of column generation, although memory consumption is still much smaller ( as opposed to ). Computation time for PSM also increases as decreases, though not as much as DantzigLP in relative terms. Still, DantzigLP performance remains superior for most interesting values of .

##### Importance of the components of DantzigLP.

The DantzigLP framework consists of several components: initialization provided by the Lasso, column generation and constraint generation, and the simplex LP solver. To understand the roles played by these components, we compare several variants in which some or other of them are omitted. Figure 2 compares the following five variants of DantzigLP.

1. DantzigLP: The complete framework described above.

2. Random Init.: Rather than using a Lasso initialization, we initialize to a random subset of of size .

3. Constraint Gen.: Obtain from Lasso initialization, but . That is, only constraint generation is enabled.

4. Column Gen.: Here, is obtained from Lasso initialization, with . That is, only column generation is enabled.

5. Full LP: The full LP model solved with Gurobi, without column or constraint generation.

The boxplots of Figure 2 show the runtime distribution over 20 randomly generated -DS  instances with , , and , solved for . The figure highlights the importance of all elements of Dantzig LP. We note that column generation alone provides no performance gains over the Full LP approach, due to the overhead of generating new columns (and restarting our simplex-based LP solvers). However, we see improvements when constraint generation is also introduced. Figure 2: Comparison of 5 variants of DantzigLP (n=1,000, p=10,000), showing the importance of all components of the approach: Lasso initialization, column generation, constraint generation, and the simplex solver.

#### 3.1.2 Experiments on real datasets

We demonstrate the performance of DantzigLP on real-datasets with and . Due to memory constraints (our maximum was 64GB), only DantzigLP could solve these problems among our tested algorithms. We consider a path of 100 values log-spaced in the interval Results are displayed in Table 4.

The “AmazonLarge” dataset from [hazimeh2018] has as its goal the prediction of helpfulness scores for product reviews based on data from Amazon’s Grocery and Gourmet Food dataset. The “Boston1M” dataset consists of 104 covariates obtained from polynomial expansions of features in the Boston House Prices dataset ([harrison1978]), augmented with 1000 random permutations of each column.

### 3.2 Results for Basis Pursuit

We present some numerical results illustrating performance of the DantzigLP procedure for the basis pursuit problem (4). The matrix is generated as in Section 3.1.1, with . We set , where is -sparse with nonzero elements chosen i.i.d. from . We compare DantzigLP against two other algorithms: the ADMM approach of [boyd2011] as implemented in the ADMM R package, and solving the full LP model (BP-Full). In this example, we set a memory cap of 16GB of RAM for all experiments.

Table 5 shows runtimes in seconds for a number of instances. We see that DantzigLP performs much better than competing algorithms in terms of runtime, particularly when . This also comes with large savings in memory; the other algorithms were unable to solve problems of size due to violation of the 16GB memory bound. DantzigLP can solve problems an order of magnitude larger than this. For instances with smaller , the overhead of generating columns can cause DantzigLP to underperform the baselines in certain cases. The DantzigLP runtimes in Table 5 include the runtime of the Lasso initialization step, which is the main performance bottleneck; the Lasso accounts for 80% of the runtime for these instances.

While DantzigLP can obtain solutions of high accuracy, ADMM often has difficulty in doing so. The ADMM column in Table 5 reports the minimum of time to converge to within of the true objective and time to complete 10000 ADMM iterations (after which we terminate the algorithm). As the “ADMM (% Converged)” column shows, for larger problem sizes none of the instances converge to that tolerance within the allotted iteration bound.

### 3.3 Computational experience with Fused Dantzig Selector

##### Signal estimation.

We first consider the Fused Dantzig Selector with , that is, the signal estimation case of Section 2.4.1. We generate a piecewise constant signal, with discontinuities / knots chosen at random from . At each knot, the jump is chosen from . We add noise with SNR=10 to the signal. We solve (17) at a single value , where corresponds to the true signal.

In Table 6, we compare our DantzigLP framework101010We solve each instance to within a tolerance of and add up to 40 columns (constraints) per iteration of column (constraint) generation. to directly solving the full version (17). Gurobi’s dual simplex solver is used in both cases. The formulation (17) allows solution of problems several orders of magnitude larger than the Dantzig Selector problem (9), when column/constraint generation is not used. The DantzigLP framework improves modestly on this enhancement, by factors of 2-3 for problems with large and a small number of knots. The runtime for the full LP formulation is insensitive to the number of knots, while the runtime of DantzigLP increases with the number of knots with a large number of knots. This is not surprising, as more knots implies a denser solution. Solving the Fused Lasso (required for initialization) takes less than one second across all instances.