# M-estimation with the Trimmed l1 Penalty

We study high-dimensional M-estimators with the trimmed ℓ_1 penalty. While standard ℓ_1 penalty incurs bias (shrinkage), trimmed ℓ_1 leaves the h largest entries penalty-free. This family of estimators include the Trimmed Lasso for sparse linear regression and its counterpart for sparse graphical model estimation. The trimmed ℓ_1 penalty is non-convex, but unlike other non-convex regularizers such as SCAD and MCP, it is not amenable and therefore prior analyzes cannot be applied. We characterize the support recovery of the estimates as a function of the trimming parameter h. Under certain conditions, we show that for any local optimum, (i) if the trimming parameter h is smaller than the true support size, all zero entries of the true parameter vector are successfully estimated as zero, and (ii) if h is larger than the true support size, the non-relevant parameters of the local optimum have smaller absolute values than relevant parameters and hence relevant parameters are not penalized. We then bound the ℓ_2 error of any local optimum. These bounds are asymptotically comparable to those for non-convex amenable penalties such as SCAD or MCP, but enjoy better constants. We specialize our main results to linear regression and graphical model estimation. Finally, we develop a fast provably convergent optimization algorithm for the trimmed regularizer problem. The algorithm has the same rate of convergence as difference of convex (DC)-based approaches, but is faster in practice and finds better objective values than recently proposed algorithms for DC optimization. Empirical results further demonstrate the value of ℓ_1 trimming.

• 5 publications
• 53 publications
• 10 publications
• 17 publications
• 13 publications
12/18/2018

### A Unifying Framework of High-Dimensional Sparse Estimation with Difference-of-Convex (DC) Regularizations

Under the linear regression framework, we study the variable selection p...
08/07/2019

### Linear convergence and support recovery for non-convex multi-penalty regularization

We provide a comprehensive convergence study of the iterative multi-pena...
03/11/2015

### Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators

For the problem of high-dimensional sparse linear regression, it is know...
08/27/2020

### Scaled minimax optimality in high-dimensional linear regression: A non-convex algorithmic regularization approach

The question of fast convergence in the classical problem of high dimens...
04/04/2015

### Convex Denoising using Non-Convex Tight Frame Regularization

This paper considers the problem of signal denoising using a sparse tigh...
10/04/2010

### Regularizers for Structured Sparsity

We study the problem of learning a sparse linear regression vector under...
07/01/2014

### DC approximation approaches for sparse optimization

Sparse optimization refers to an optimization problem involving the zero...

## 1 Introduction

We consider high-dimensional estimation problems, where the number of variables can be much larger that the number of observations . In this regime, consistent estimation can be achieved by imposing low-dimensional structural constraints on the estimation parameters. Sparsity is a prototypical structural constraint, where at most a small set of parameters can be non-zero.

A key class of sparsity-constrained estimators is based on regularized M-estimators using convex penalties, with the penalty by far the most common. In the context of linear regression, the Lasso estimator Tibshirani (1996) solves an regularized (or constrained) least squares problem, and has strong statistical guarantees, including prediction error consistency van de Geer and Buhlmann (2009), consistency of the parameter estimates in some norm van de Geer and Buhlmann (2009); Meinshausen and Yu (2009); Candes and Tao (2007), and variable selection consistency Meinshausen and Bühlmann (2006); Wainwright (2009a); Zhao and Yu (2006). In the context of sparse Gaussian graphical model (GMRF) estimation, the graphical Lasso estimator minimizes the Gaussian negative log-likelihood regularized by the norm of the (off-diagonal) entries of the concentration Yuan and Lin (2007); Friedman et al. (2007); Bannerjee et al. (2008). Strong statistical guarantees for this estimator have been established (see Ravikumar et al. (2011) and references therein).

Recently, there has been significant interest in M-estimators with non-convex penalties, including SCAD and MCP penalties Fan and Li (2001); Breheny and Huang (2011); Zhang et al. (2010); Zhang and Zhang (2012). In particular, Zhang and Zhang (2012) establishes consistency for the global optima of least-squares problems with certain non-convex penalties. Loh and Wainwright (2015) shows that under some regularity conditions on the penalty, any stationary point of the objective function will lie within statistical precision of the underlying parameter vector and thus provide - and - error bounds for any stationary point. Compared to convex penalties, perhaps the strongest point in favor of non-convex regularization is made by authors of Loh and Wainwright (2017), who proved that for a class of amenable non-convex regularizers with vanishing derivative away from the origin (including SCAD and MCP), any stationary point is able to recover the parameter support without requiring the typical incoherence conditions needed for convex penalties.

In this paper, we study a family of M-estimators with trimmed regularization, which leaves the largest parameters unpenalized. This family includes as special cases the recently proposed Trimmed Lasso estimator (Gotoh et al. (2017); Bertsimas et al. (2017)) and its counterpart for sparse graphical model estimation, which we call Graphical Trimmed Lasso. This work complements efforts of Yang et al. (2016), who analyze statistical benefits of trimming losses. We apply the trimming mechanism to separable components of the regularizer.

We present the first statistical analysis of M-estimators with trimmed regularization111Trimmed Lasso has been studied from an optimization perspective and with respect to its connections with existing penalties, but has not been analyzed from a statistical standpoint.. These estimators are non-convex, but unlike SCAD and MCP regularizers, they are not amenable and hence the analyses of Loh and Wainwright (2015, 2017) cannot be applied. Our main theoretical result shows that if the trimming parameter is smaller than the true support size, for any local optimum of the resulting non-convex program all the zero entries of the true parameter vector are successfully estimated as zero; while if is larger than the true support size, the non-relevant parameters of the local optimum have smaller absolute values than relevant parameters and hence relevant parameters are not penalized. In addition to error bounds, we provide error bounds. These are asymptotically the same as those for amenable regularized problems such as SCAD or MCP, but have better constants and do not require the additional constraint where is a safety radius. We specialize our main results and derive corollaries for the special cases of linear regression and graphical model estimation. To optimize the trimmed regularized problem we develop and analyze a specialized algorithm, which performs better than recent methods based on difference of convex (DC) functions optimization Khamaru and Wainwright (2018). Experiments on simulated and real data demonstrate the value of trimming compared to SCAD, MCP and vanilla penalties.

Beyond regularization, the trimming strategy can be seamlessly applied to other decomposable regularizers including group-sparsity promoting regularization Tropp et al. (2006); Zhao et al. (2009); Yuan and Lin (2006); Jacob et al. (2009). Our work therefore motivates a future line of research on trimming a wide class of regularizers.

## 2 Problem Setup and the Trimmed Regularizer

Trimming has been typically applied to the loss function of

-estimators. We can handle outliers and heavy tailed noise by trimming

observations with large residuals in terms of a loss function : given a collection of samples, , we solve the problem

 \minimize\th,\w∈{0,1}nn∑i=1wi\L(\th;Zi)s.t% .n∑i=1wi=n−h,

which trims outliers (see Yang et al. (2016) and references therein).

Here, we consider a family of -estimators with trimmed regularization for general high-dimensional problems. We trim entries of that incur the largest penalty using the following program:

 \minimize\th∈Ω,\w∈[0,1]p \L(\th;\Data)+\lamp∑j=1wj|θj| \st 1⊤\w≥p−h. (1)

where denotes the parameter space (e.g., for linear regression). Defining the order statistics of the parameter , we can partially minimize over (setting to or based on the size of ), and rewrite the reduced version of problem (2) in alone:

 \minimize\th∈Ω \L(\th;\Data)+\lam\R(\th;h) (2)

where the regularizer is the smallest absolute sum of . The constrained version of (2) is equivalent to minimizing a loss subject to a sparsity penalty Gotoh et al. (2017):

 \minimize\th∈Ω\L(\th;\Data)\st∥\th∥0≤h.

For statistical analysis, we focus on the reduced problem (2). When optimizing, we exploit the structure of (2), treating weights as auxiliary optimization variables. This gives us both a new fast algorithm and analysis technique it that is not based on the DC structure of (2).

Prior art (e.g., Tibshirani (1996); Negahban et al. (2012); Loh and Wainwright (2015, 2017)) derives the estimation upper bounds for diverse sparsely regularized estimators motivated by many real applications. In this paper, we mainly consider two typical examples using the trimmed regularizers, but the results generalize.

#### Example: sparse linear models.

In high-dimensional linear regression problems, we have observation pairs of a real-valued target and its covariates in a linear relationship:

 \y=\X\Tth+\e. (3)

Here, , and are independent observation noises. The goal is to estimate the -sparse vector . According to the framework (2), we use the least squares loss function with trimmed regularizer (instead of the standard norm in Lasso Tibshirani (1996)):

 \minimize\th∈\realsp1n∥∥\X\th−\y∥∥22+\lam\R(\th;h). (4)

#### Example: sparse graphical models.

GGMs form a powerful class of statistical models for representing distributions over a set of variables Lauritzen (1996), using undirected graphs to encode conditional independence conditions among variables.

In such high-dimensional settings, graph sparsity constraints are particularly pertinent for estimating GGMs. The most widely used estimator, the graphical Lasso minimizes the negative Gaussian log-likelihood regularized by the norm of the entries (or the off-diagonal entries) of the precision matrix (see Yuan and Lin (2007); Friedman et al. (2007); Bannerjee et al. (2008)). In our framework, we replace with the trimmed version:

 \minimize\Th∈Sp++ trace(\Sig\Th)−logdet(\Th)+\lam\R(\Thoff;h) (5)

where denotes the convex cone of symmetric and strictly positive definite matrices, does the smallest absolute sum of off-diagonals.

## 3 Theoretical Guarantees of Trimmed Regularization

Our goal is to estimate the true -sparse parameter vector (or matrix) that is the minimizer of expected loss: . We use to denote the support set of , namely the set of non-zero entries (i.e., ). In this section, we derive the upper bounds of estimation consistency (in terms of and support set recovery) under the following standard assumptions:

1. [leftmargin=0.5cm, itemindent=0.65cm,label=(C-), ref=(C-),start=1]

2. The loss function is differentiable and convex.

3. (Restricted strong convexity on ) Let be the possible set of error vector on the parameter . Then, for all ,

 \Biginner∇\L(\Tth+\errt)−∇\L(\Tth)\errt≥\RSCcon∥\errt∥22−\RSCtolOnelogpn∥\errt∥21,

where is a “curvature” parameter, and is a “tolerance” constant.

Note that the convex loss function in general cannot be strongly convex under the high dimensional setting (). 2 imposes strong curvature only in some limited directions where the ratio is small. This condition has been extensively studied and known to hold for several popular high dimensional problems (see Raskutti et al. (2010); Negahban et al. (2012); Loh and Wainwright (2015) for instance). The convexity condition of in 1 can be relaxed by introducing additional mild constraint, as shown in Loh and Wainwright (2017). However in this paper, we focus on the convex loss for clarity.

We begin with the bound. Toward this, we adopt the primal-dual witness (PDW) technique, specifically devised for the trimmed regularizer . Note that a line of works uses the PDW technique and shows the support set recovery for regularizer (Wainwright, 2009c; Yang et al., 2015) as well as amenable non-convex regularizers (Loh and Wainwright, 2017). However, even though is also symmetric and concave, it is not amenable. The key step of PDW is to build the restricted program. Let be an arbitrary subset of whose size is . Denoting and , we consider the following restricted program:

 \Gth∈\argmin\th∈\reals\USupNonreg: \th∈Ω \L(\th)+\lam\R(\th;h) (6)

where we fix for all . We further construct the dual variable to satisfy the zero sub-gradient condition

 ∇\L(\Gth)+\lam\Gz=0 (7)

where for (after re-ordering indices properly) and . Note that we suppress the dependency on in and for clarity. In order to derive the final statement, we will establish the strict dual feasibility of , i.e., .

The following theorem describes our main theoretical result concerning any local optimum of the non-convex program (2). The theorem guarantees under strict dual feasibility that non-relevant parameters of local optimum have smaller absolute values than relevant parameters and hence relevant parameters are not penalized (as long as is set as larger than ). Consider the problem with the trimmed regularizer (2) that satisfies 1 and 2. Let be an any local minimum of (2) with a sample size and . Suppose that:

1. [leftmargin=0.5cm, itemindent=0.65cm,label=()]

2. given any selection of s.t. , the dual vector from the PDW construction (7) satisfies the strict dual feasibility with some ,

 ∥\Gz\USupNonregC∥∞≤1−δ (8)

where is the union of true support and ,

3. letting , the minimum absolute value is lower bounded by

 12\Tthmin≥∥∥(\Qs)−1∇\L(\Tth)\USupNonreg∥∥∞+\lam\matnormbig(\Qs)−1∞ (9)

where denotes the maximum absolute row sum of the matrix.

Then, we have

1. [leftmargin=0.5cm, itemindent=0.65cm,label=(0)]

2. for every pair , we have ,

3. if , all are successfully estimated as zero and we have

 ∥\Lth−\Tth∥∞≤∥∥(\TQ)−1∇\L(\Tth)\Supp∥∥∞+\lam\matnormbig(\TQ)−1∞, (10)
4. if , at least the smallest (in absolute) entries in have exactly zero but instead we have simpler (possibly tighter) bound as

 ∥\Lth−\Tth∥∞≤∥∥(\GQs)−1∇\L(\Tth)\GUSupNonreg∥∥∞ (11)

where is defined as the largest absolute entries of including .

We will derive the actual bounds on terms involving , and in the corollaries for actual problems (for instance will be upper bounded by and we can choose accordingly). Though (8) and (9) seem apparently more stringent than

case (which is Lasso), we will see in corollaries that they are uniformly upper bounded for all selections under the asymptotically same probability as

.

Note also that if is set as , the results will recover those of regular norm. Furthermore, by the statement in the theorem, if , only contains relevant feature indices and some relevant features are not penalized. If , includes all relevant indices (and some non-relevant indices). In this case, the second term in (10) disappears, but the term increases as gets larger. Moreover, the condition that will be violated as approaches . While we do not know the true sparsity a priori in many problems, we implicitly assume that we can set (i.e., by cross-validation).

Now we turn to the estimation bound under the same conditions: Consider the problem with a trimmed regularizer (2) where all conditions in Theorem 3 hold. Then, for any local minimum of (2), the parameter estimation error in terms of norm is upper bounded as: for some constant ,

 ∥\Lth−\Tth∥2 ≤\Cltwo√max{k,h}logpn. (12)

The bound in (12) can be trivially derived since for any local optimum, the size of guaranteed to be smaller than or equal to by Theorem 3 hence we can apply the known results (e.g., Negahban et al. (2012)) to obtain bound of restricted program (6) given . From Theorem 3 and Corollary 3, we can observe that the estimation bounds are asymptotically same as those for )-amenable regularized problems such as SCAD or MCP: and . However, the constant for those regularizers might be large since it involves term (instead of for the trimmed ). In addition, those non-convex regularizers require additional constraint in their optimization problems for the theoretical guarantees, introducing additional assumptions on and tuning parameter .

Now we apply our main theorem to popular high dimensional problems introduced in Section 2: sparse linear regression and sparse graphical model, but due to the space constraint, the results for sparse graphical models are provided in the supplementary materials.

#### Sparse least squares.

Motivated by the information theoretic bound for arbitrary methods, all previous analyses of sparse linear regression assume for sufficiently large constant . We also assume , provided . Consider the model (3) where is sub-Gaussian. Suppose that we solve the program (4) with the selection of for some constant and satisfying

1. [leftmargin=0.5cm, itemindent=0.65cm,label=()]

2. the sample covariance matrix satisfies the condition: for any selection of ,

 \matnormBig(\GG−1)\USupNonreg\USupNonreg∞≤\cLSone, max{λmax(\GG\USupNonregC\USupNonregC),λmax((\GG\USupNonreg\USupNonreg)−1)}≤\cLSsix and \matnormBig\GG\USupNonregC\USupNonreg(\GG\USupNonreg\USupNonreg)−1∞≤ηwhere λmax is % the maximum singular value of a matrix. (13)

Further suppose that is lower bounded by for some constant . Then with high probability at least , any local minimum of (4) satisfies

1. [leftmargin=0.5cm, itemindent=0.65cm,label=()]

2. for every pair , we have ,

3. if , all are successfully estimated as zero and we have

 ∥\Lth−\Tth∥∞≤\cLSthree√logpn+\lam\cLSone (14)
4. if , at least the smallest entries in have exactly zero and we have

 ∥\Lth−\Tth∥∞≤\cLSthree√logpn. (15)

The conditions in Corollary 3 are also studied in the line of previous work. Especially, (1) is known as an incoherence condition for the sparse least square estimators Wainwright (2009b). All conditions may be shown to hold with high probability via the standard concentration bounds for sub-Gaussian matrices. It is important to note that in case of Lasso, the estimation will fail if the incoherence condition is violated Wainwright (2009b). Unlike Lasso, we confirm by simulations in Section 5 that the trimmed problem (4) can succeed even when this condition is not met.

## 4 Optimization

We develop and analyze a block coordinate descent algorithm for solving objective (2). By leaving the weights as an explicit block rather than projecting them out, we give the weights more freedom to vary before setting down. We can also analyze the algorithm using the structure of (2) instead of relying on the DC formulation for(2). The approach is detailed in Algorithm 1, and the convergence analysis is summarized in Theorem 4.

Consider the general objective function

 \minimizeθ,wF(θ,w):=f(θ)+λd∑i=1wiri(θ)+δ(w|\cS)

where is the convex indicator function. Let

Then, we assume following assumptions are satisfied, (a) is a smooth closed convex function with an -Lipchitz continuous gradient; (b) are convex and (c) is a closed convex set and is bounded below.

If Assumption 4 (a-c) hold, the iterates generated by Algorithm 1 satisfy,

 1η(θk−θk+1)+(∇f(θk+1)−∇f(θk)) ∈∇f(θk+1)+λd∑i=1wi∂ri(θk+1) 1τ(wk−wk+1) ∈r(θk+1)+∂δ(wk+1|\cS)

Moreover, define , if we choose the step size , we have,

 minkGk≤1KK∑k=1Gk≤1K(F(θ1)−F∗).

which gives a sublinear rate of convergence with respect to the optimality condition.

Problem (2) satisfies Assumption 4, and so Algorithm 1 converges at a sublinear rate as measured using .

To show the efficiency of the Algorithm 1, we conduct a small numerical experiment to compare with (Khamaru and Wainwright, 2018, Algorithm 2). The authors proposed multiple approaches for DC programs; the prox-type algorithm (Algorithm 2) did particularly well for subset selection, (Khamaru and Wainwright, 2018, Figure 2).

We generate Lasso simulation data with variables of dimension , and samples. The number of nonzero elements in true generating variable is 10. We take , and apply both Algorithm 1 and (Khamaru and Wainwright, 2018, Algorithm 2). Result are shown in Figure 1. The per-iteration progress of the methods is comparable, but Algorithm (1) continues at a linear rate to a lower value of the objective, while (Khamaru and Wainwright, 2018, Algorithm 2) cannot get past a certain local minimum. This comparison is very brief; we leave a detailed study comparing Algorithm 1 with DC-based algorithms left to future work focusing on algorithms, along with further analysis of Algorithm 1 and its variants under the Kurdyka-Lojasiewicz assumption Attouch et al. (2013).

## 5 Experimental Results

#### Simulations for Lasso.

We run two experiments with the least squares linear regression. For all simulations we consider the regularization parameter range for all comparison methods, and we fix MCP and SCAD parameters to be 2.5 and 3.0, respectively (Since the results are not sensitive to them). We use two classes of matrices and introduced in Loh and Wainwright (2017). For each simulation, we compare the probability of recovering the correct support and the -error as a function of iterations , and we check the consistency of stationary points.

In our first simulation, we generate i.i.d. observations from where , with = 0.7. Note that this specific choice of satisfies the incoherence condition. We give non-zero values at only random positions with distribution

, and the response variables are generated by

, where . Figure 2 summarizes our first simulation results. In the first row, we set respectively and increases the number of samples, . We observe that the probability of correct support recovery for trimmed Lasso is higher than standard Lasso with any number of samples in all cases. In the second row, the first graph shows that our estimator with Trimmed -penalty converges to the same stationary point with the correct support regardless of initializations, which agrees with our Corollary 3. Moreover, we can check that our Trimmed -penalty is superior to MCP and SCAD penalties in terms of -error plots.

In our second simulations, we replace covariance matrix with , which does not satisfy incoherence condition. ( is a matrix with ’s on the diagonal, ’s in the first positions of the row and column, and ’s everywhere else.) We use . Figure 3 illustrates the second simulation results. Since does not satisfy the incoherence conditions vanilla Lasso fails all the time, we focus on comparisons between Trimmed -penalty and other non-convex penalties, MCP and SCAD. In the first row, we can observe that Trimmed Lasso slightly outperforms MCP and SCAD penalties in terms of probability of successful support recovery for all cases. As in first simulations, we can see in second row that our estimator with Trimmed -penalty has smallest -errors among nonconvex penalties and recovers correct support consistently.

Due to space constraints, experiments on sparse Gaussian Graphical Models and on real data are provided as supplementary materials.

#### Simulations for Gaussian Graphical Models.

We consider the “diamond” graph example described in Ravikumar et al. (2011) (section 3.1.1) to assess the performance of Graphical Trimmed Lasso when the incoherence condition holds and when it is violated. Specifically, we consider a graph , with vertex set and with all edges except . We consider a family of true covariance matrices with diagonal entries for all ; off-diagonal elements for all edges ; ; and finally the entry corresponding to the non-edge is set as We analyze the performance of Graphical Trimmed Lasso under two settings: As discussed in Ravikumar et al. (2011), if the incoherence condition is satisfied ; if it is violated. Under both settings, we report the probability of successful support recovery based on 100 replicate experiments for and and compare it with Graphical Lasso, Graphical SCAD and Graphical MCP (The MCP and SCAD parameters were set to 2.5 and 3.0 as varying these did not affect the results signifcantly). For each method and replicate experiment, success is declared if the true support is recovered for at least one value of along the solution path. We can see that for a wide range of values for the trimming parameter, Graphical Trimmed Lasso outperforms SCAD and MCP alternatives regardless of whether the incoherence condition holds or not. In addition its probability of success is always superior to that of vanilla Graphical Lasso, which fails to recover the true support when the incoherence condition is violated.

## 6 Concluding Remarks

In this paper we have studied high-dimensional M-estimators with trimmed penalty. By leaving the largest parameter entries penalty-free, these estimators alleviate the bias incurred by the vanilla penalty. Our theoretical results in terms of support recovery and error bounds hold for any local optimum and are competitive with other non-convex approaches. In addition they indicate a perhaps surprising robustness of the procedure with respect to the trimming parameter These findings were corroborated by extensive simulation experiments. As future work we plan to generalize our study to the trimming of other decomposable regularizers such as mixed norms and to unsupervised approaches such as convex clustering with fusion penalties Radchenko and Mukherjee (2017).

We have also developed a provably convergent customized algorithm for the trimmed problem. The algorithm and analysis technique are based on problem structure rather than a simple DC structure, and appears to give promising numerical results. We expect that the approach will be useful for more general regularizes, and thorough comparison to DC based approaches is left to future work.

## References

• Attouch et al. (2013) Hedy Attouch, Jérôme Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods. Mathematical Programming, 137(1-2):91–129, 2013.
• Bannerjee et al. (2008) O. Bannerjee, , L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Jour. Mach. Lear. Res., 9:485–516, March 2008.
• Bertsimas et al. (2017) Dimitris Bertsimas, Martin S Copenhaver, and Rahul Mazumder. The trimmed lasso: Sparsity and robustness. arXiv preprint arXiv:1708.04527, 2017.
• Breheny and Huang (2011) Patrick Breheny and Jian Huang.

Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection.

The annals of applied statistics, 5(1):232, 2011.
• Candes and Tao (2007) E. J. Candes and T. Tao. The dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6):2313–2351, 2007.
• Cross and Jain (1983) G. Cross and A. Jain. Markov random field texture models. IEEE Trans. PAMI, 5:25–39, 1983.
• Fan and Li (2001) J. Fan and R. Li. Variable selection via non-concave penalized likelihood and its oracle properties. Jour. Amer. Stat. Ass., 96(456):1348–1360, December 2001.
• Friedman et al. (2007) J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical Lasso. Biostatistics, 2007.
• Gotoh et al. (2017) Jun-ya Gotoh, Akiko Takeda, and Katsuya Tono. Dc formulations and algorithms for sparse optimization problems. Mathematical Programming, pages 1–36, 2017.
• Hassner and Sklansky (1978) M. Hassner and J. Sklansky. Markov random field models of digitized image texture. In ICPR78, pages 538–540, 1978.
• Ising (1925) E. Ising. Beitrag zur theorie der ferromagnetismus. Zeitschrift für Physik, 31:253–258, 1925.
• Jacob et al. (2009) L. Jacob, G. Obozinski, and J. P. Vert. Group Lasso with Overlap and Graph Lasso. In

International Conference on Machine Learning (ICML)

, pages 433–440, 2009.
• Khamaru and Wainwright (2018) Koulik Khamaru and Martin J Wainwright. Convergence guarantees for a class of non-convex and non-smooth optimization problems. arXiv preprint arXiv:1804.09629, 2018.
• Lauritzen (1996) S.L. Lauritzen. Graphical models. Oxford University Press, USA, 1996.
• Loh and Wainwright (2015) P. Loh and M. J. Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research (JMLR), 16:559–616, 2015.
• Loh and Wainwright (2017) P. Loh and M. J. Wainwright. Support recovery without incoherence: A case for nonconvex regularization. Annals of Statistics, 45(6):2455–2482, 2017.
• Manning and Schutze (1999) C. D. Manning and H. Schutze.

Foundations of Statistical Natural Language Processing

.
MIT Press, 1999.
• Meinshausen and Bühlmann (2006) N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the Lasso. Annals of Statistics, 34:1436–1462, 2006.
• Meinshausen and Yu (2009) N. Meinshausen and B. Yu.

Lasso-type recovery of sparse representations for high-dimensional data.

Annals of Statistics, 37(1):246–270, 2009.
• Negahban et al. (2012) S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4):538–557, 2012.
• Oh and Deasy (2014) Jung Hun Oh and Joseph O. Deasy. Inference of radio-responsive gene regulatory networks using the graphical lasso algorithm. BMC Bioinformatics, 15(S-7):S5, 2014.
• Radchenko and Mukherjee (2017) Peter Radchenko and Gourab Mukherjee. Convex clustering via l1 fusion penalization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5):1527–1546, 2017.
• Raskutti et al. (2010) G. Raskutti, M. J. Wainwright, and B. Yu.

Restricted eigenvalue properties for correlated gaussian designs.

Journal of Machine Learning Research (JMLR), 99:2241–2259, 2010.
• Ravikumar et al. (2011) P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing -penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
• Ripley (1981) B. D. Ripley. Spatial statistics. Wiley, New York, 1981.
• Tibshirani (1996) R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996.
• Tropp et al. (2006) J. A. Tropp, A. C. Gilbert, and M. J. Strauss. Algorithms for simultaneous sparse approximation. Signal Processing, 86:572–602, April 2006. Special issue on ”Sparse approximations in signal and image processing”.
• van de Geer and Buhlmann (2009) S. van de Geer and P. Buhlmann. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360–1392, 2009.
• Wainwright (2009a) M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using -constrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55:2183–2202, May 2009a.
• Wainwright (2009b) M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using -constrained quadratic programming (lasso). IEEE Transactions on Info. Theory, 55:2183–2202, 2009b.
• Wainwright (2009c) M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using -constrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55:2183–2202, May 2009c.
• Woods (1978) J.W. Woods. Markov image modeling. IEEE Transactions on Automatic Control, 23:846–850, October 1978.
• Yang et al. (2015) E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. Graphical models via univariate exponential family distributions. Journal of Machine Learning Research (JMLR), 16:3813–3847, 2015.
• Yang et al. (2016) Eunho Yang, Aurelie Lozano, and Aleksandr Aravkin. High-dimensional trimmed estimators: A general framework for robust structured estimation. arXiv preprint arXiv:1605.08299, 2016.
• Yuan and Lin (2006) M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society B, 1(68):49, 2006.
• Yuan and Lin (2007) M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35, 2007.
• Zhang and Zhang (2012) Cun-Hui Zhang and Tong Zhang. A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, pages 576–593, 2012.
• Zhang et al. (2010) Cun-Hui Zhang et al. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2):894–942, 2010.
• Zhao and Yu (2006) P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning Research, 7:2541–2567, 2006.
• Zhao et al. (2009) P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite absolute penalties. Annals of Statistics, 37(6A):3468–3497, 2009.

## Appendix A Sparse graphical models

We also derive the corollary for the trimmed Graphical lasso (5). Following the proof strategy derived in Loh and Wainwright (2017), we assume throughout that the sample size scales with the row sparsity of true parameter, inverse covariance , which is milder condition than other works ( scaling with , the number of non zero entries of ): Consider the program (5) where the ’s are drawn from a sub-Gaussian, and a sample size is greater than . Suppose further that we choose and satisfying

1. [leftmargin=0.5cm, itemindent=0.65cm,label=()]

2. for any selection of ,

 \matnormBig(\TTh⊗\TTh)\USupNonreg\USupNonreg∞≤\cLSone, max{\matnorm\GG\USupNonregC\USupNonregC∞,\matnorm(\GG\USupNonreg\USupNonreg)−1∞}≤\cLSsix and \matnormBig(\TTh−1⊗\TTh−1)\USupNonregC\USupNonreg((\TTh−1⊗\TTh−1)\USupNonreg\USupNonreg)−1∞≤η. (16)

Further suppose that is lower bounded by for some constant . Then with high probability at least , any local minimum of (4) has the following property:

1. [leftmargin=0.5cm, itemindent=0.65cm,label=()]

2. for every pair , we have ,

3. if , all are successfully estimated as zero and we have

 ∥\LTh−\TTh∥∞≤\cLSthree√logpn+2\lam\cLSone (17)
4. if , at least the smallest entries in have exactly zero and we have

 ∥\LTh−\TTh∥∞≤\cLSthree√logpn. (18)

Note that the condition (1) is the incoherence condition studied in Ravikumar et al. (2011), and results are consistent with the sparse linear model case above and comparable to counterparts for or regularized Glasso Loh and Wainwright (2017).

## Appendix B Proofs

### b.1 Proof of Theorem 3

We extend the standard PDW technique Wainwright (2009c); Yang et al. (2015); Loh and Wainwright (2017) for the trimmed regularizers. For any fixed , we construct a primal and dual witness pair with the strict dual feasibility. Specifically, given the fixed , consider the following program:

 \minimize\th∈Ω \L(\th;\Data)+\lam∑j∈\NonregC|θj|. (19)

Note that the program (19) is convex (under 1) where the regularizer is only effective over entries in (fixed) . We construct the primal and dual pair following (6) and (7). The following lemma can guarantee under the strict dual feasibility that any solution of (19) has the same sparsity structure on with . Moreover, since the restricted program (7) is strictly convex as shown in the lemma below, we can conclude that is the unique minimum point of the restricted program (19) given .

Suppose that there exists a primal optimal solution for (19) with associated sub-gradient (or dual) such that . Then any optimal solution of (19) will satisfy for all . The lemma can be directly achieved by the basic property of convex optimization problem, as developed in existing works using PDW Wainwright (2009c); Yang et al. (2015). Note that even though the original problem with the trimmed regularizer is not convex, (19) given is convex. Therefore, by complementary slackness, we have . Therefore, any optimal solution of (19) will satisfy for all since the associated (absolute) sub-gradient is strictly smaller than 1 by the assumption in the statement.

[Section A.2 of (Loh and Wainwright, 2017)] Under 2, the loss function is strictly convex on and hence is invertible if .

Now from the definition of , we have

 \Q(\Gth−\Tth)=∇\L(\Gth)−∇\L(\Tth) (20)

where is decomposed as . Then by the invertibility of in Lemma B.1 and the zero sub-gradient condition in (7) we have

 \Gth\USupNonreg−\Tth\USupNonreg=(\Q\USupNonreg\USupNonreg)−1(−∇\L(\Tth)\USupNonreg−\lam\Gz\USupNonreg). (21)

Since both and are zero vectors, we obtain

 ∥\Gth−\Tth∥∞ =∥∥(\Q\USupNonreg\USupNonreg)−1(−∇\L(\Tth)\USupNonreg−\lam\Gz\USupNonreg)∥∥∞ ≤∥∥(\Q\USupNonreg\USupNonreg)−1∇\L(\Tth)\USupNonreg∥∥∞+\lam\matnormbig(\Qs)−1∞. (22)

Therefore, under the assumption on in the statement, the selection of in which there exists some s.t. , , and , yields contradictory solution with (2). Under the strict dual feasibility condition for this specific choice of (along with Lemma B.1) can guarantee that there is no local minimum for that choice of . Hence, (B.1) can guarantee that for every pair such that and , we have (since ). Note that for any valid selection of , this statement holds. This immediately implies that any local minimum of (2) satisfies this property as well, as in the statement.

Finally turning to the bound when , we have since all entries in are not penalized as shown above. In this case, becomes zero vector (since is empty in the construction of ), and the bound in (B.1) will be tighter as

 ∥\Gth−\Tth∥∞ =∥∥(\Q\USupNonreg\USupNonreg)−1(−∇\L(\Tth)\USupNonreg−\lam\Gz\USupNonreg)∥∥∞ ≤∥∥(\Q\USupNonreg\USupNonreg)−1∇\L(\Tth)\USupNonreg∥∥∞, (23)

as claimed.

### b.2 Proof of Corollary 3

The proof our corollary is similar to that of Corollary 1 of Loh and Wainwright (2017), who derive the result for -amenable regularizers. Here we only describe the parts that need to be modified from Loh and Wainwright (2017).

In order to utilize theorems in the main paper, we need to establish the RSC condition 2 and the strict dual feasibility (8). First, the RSC is known to hold w.h.p as shown in several previous works such as Lemma B.2.

[Corollary 1 of Loh and Wainwright (2015)] The RSC condition in 2 for linear models holds with high probability with and , under sub-Gaussian assumptions in the statement.

In order to show the remaining strict dual feasibility condition of our PDW construction, we consider (20) (by the zero-subgradient and the definition of ) in the block form:

 ⎡⎢ ⎢⎣\Q\Nonreg\Nonreg\Q\Nonreg\DSupNonreg\Q\Nonreg\USupNonregC\Q\DSupNonreg\Nonreg\Q\DSupNonreg\DSupNonreg\Q\DSupNonreg\USupNonregC\Q\USupNonregC\Nonreg\Q\USupNonregC\DSupNonreg\Q\USupNonregC\USupNonregC⎤⎥ ⎥⎦⎡⎢ ⎢⎣\Gth\Nonreg−\Tth\Nonreg\Gth\DSupNonreg−\Tth\DSupNonreg0⎤⎥ ⎥⎦+⎡⎢ ⎢⎣∇\L(\Tth)\Nonreg∇\L(\Tth)\DSupNonreg∇\L(\Tth)\USupNonregC⎤⎥ ⎥⎦+\lam⎡⎢ ⎢⎣0\Gz\DSupNonreg\Gz\USupNonregC⎤⎥ ⎥⎦=0. (24)

By simple manipulation, we can obtain

 \Gz\USupNonregC=1\lam{−∇\L(\Tth)\USupNonregC+\Q\USupNonregC\USupNonreg(\Q\USupNonreg\USupNonreg)−1(−∇\L(\Tth)\USupNonreg−\lam\Gz\DSupNonreg)}. (25)

Here note that our construction of PDW can guarantee the bound in (B.1). In case of (4), since we have and where , we need to show below that

 \Gz\USupNonregC ≤1\lam{−\GG\USupNonregC\USupNonreg\Tth\USupNonreg+ˆγ\USupNonregC+\GG\USupNonregC\USupNonreg\Tth\USupNonreg−\GG\USupNonregC\USupNonreg(\GG\USupNonreg\USupNonreg)−1ˆγ\USupNonreg}+\matnormBig\GG\USupNonregC\USupNonreg(\GG\USupNonreg\USupNonreg)−1∞ ≤1\lam{ˆγ\USupNonregC−\GG\USupNonregC\USupNonreg(\GG\USupNonreg\USupNonreg)−1ˆγ\USupNonreg}+η (26)

for the strict dual feasibility from (25). As derived in Loh and Wainwright (2017), we can write

 ∥∥∥ˆγ\USupNonregC−\GG\USupNonregC\USupNonreg(\GG\USupNonreg\USupNonreg)−1ˆγ\USupNonreg∥∥∥∞=∥∥ ∥∥\X⊤\USupNonregCΠ\en∥∥ ∥∥∞ (27)

where is an orthogonal project matrix on : .

For any , we define such that . Then we have

 ∥uj∥22=∥∥ ∥∥Π\X\USupNonregCejn∥∥ ∥∥22≤∥∥ ∥∥\X\USupNonregCejn∥∥ ∥∥22≤\cLSsixn. (28)

Hence by the sub-Gaussian tail bounds followed by a union bound, we can conclude that

 ∥∥∥ˆγ\USupNonregC−\GG\USupNonregC\USupNonreg(\GG\USupNonreg\USupNonreg)−1ˆγ\USupNonreg∥∥∥∞≤C√logpn (29)

with probability at least for all selections of . We can establish have strict dual feasibility for any selection of w.h.p, provided , and now turn to bounds. From (10), we have

 ∥∥\GG\USupNonreg\USupNonreg(\GG\USupNonreg\USupNonreg\Tth\USupNonreg−ˆγ\USupNonreg)∥∥∞=∥∥ ∥ ∥∥⎛⎝\X⊤\USupNonreg\X\USupNonregn⎞⎠−1⎛⎝\X⊤\USupNonreg\en⎞⎠∥∥ ∥ ∥∥∞. (30)

Then for , we define such that . Since for any selection of , is bounded as follows:

 ∥vj∥22=1n2∥∥ ∥ ∥∥\X\USupNonreg⎛⎝\X⊤\USupNonreg\X\USupNonregn⎞⎠−1ej∥∥ ∥ ∥∥22=1n∣∣ ∣ ∣∣e⊤j⎛⎝\X⊤\USupNonreg\X\USupNonregn⎞⎠−1ej∣∣ ∣ ∣∣22≤\cLSsixn. (31)

Similarly by the sub-Gaussian tail bound and a union bound over , we can obtain

 ∥∥\GG\USupNonreg\USupNonreg(\GG\USupNonreg\USupNonreg\Tth\USupNonreg−ˆγ\USupNonreg)∥∥∞≤C√logpn (32)

with probability at least .

### b.3 Proof of Corollary A

As in the proof of Corollary 3, the proof procedure is quite similar to that of Corollary 4 of Loh and Wainwright (2017). Deriving upper bounds on in Loh and Wainwright (2017) can be seamlessly extendable to upper bounds on for any selection of . mainly because the required upper bounds are related to entry-wise maximum on the true support but entry-wise maximum in this case is uniformly upper bounded for all entries.

Specifically, it computes the upper bound of from the fact that . This actually holds for any selection of . Similarly, it computes the upper bound of by Hölder’s inequality and the definition of matrix induced norms: , which clearly holds for any index beyond . Finally, is shown to be upper bounded by the fact that .

The remaining proof of this result directly follows similar lines to the proof of Corollary 4 in Loh and Wainwright (2017).

### b.4 Proof of Theorem 4

From Algorithm 1, we obtain the relation

 1η(θk−θk+1)+(∇f(θk+1)−∇f(θk)) ∈∇f(θk+1)+λd∑i=1wi∂ri(θk+1) 1τ(wk−wk+1) ∈r(θk+1)+∂δ(wk+1|\cS)