 # An Efficient Semi-smooth Newton Augmented Lagrangian Method for Elastic Net

Feature selection is an important and active research area in statistics and machine learning. The Elastic Net is often used to perform selection when the features present non-negligible collinearity or practitioners wish to incorporate additional known structure. In this article, we propose a new Semi-smooth Newton Augmented Lagrangian Method to efficiently solve the Elastic Net in ultra-high dimensional settings. Our new algorithm exploits both the sparsity induced by the Elastic Net penalty and the sparsity due to the second order information of the augmented Lagrangian. This greatly reduces the computational cost of the problem. Using simulations on both synthetic and real datasets, we demonstrate that our approach outperforms its best competitors by at least an order of magnitude in terms of CPU time. We also apply our approach to a Genome Wide Association Study on childhood obesity.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The advent of big data, with applications involving massive numbers of predictors, has made feature selection a central research area in statistics and machine learning. Many regularization approaches have been developed to solve this problem, such as Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), adaptive Lasso (Zou, 2006), constrained Lasso (Gaines et al., 2018), etc. While effective in many settings, Lasso has strong limitations in scenarios characterized by very high collinearities among features. To tackle this issue, Zou and Hastie (2005) introduced the Elastic Net, which penalizes both the and the squared

norm of the coefficients; the former induces sparsity, while the latter regularizes coefficient estimates mitigating variance inflation due to collinearity. The Elastic Net minimization is formulated as

 (1)

where and are the number of features and observations, respectively,

is the response vector,

is the standardized design matrix, and is the coefficient vector. Many algorithms exists to efficiently solve (1), such as accelerated proximal gradient (Li and Lin, 2015), FISTA (Beck and Teboulle, 2009), distributed ADMM (Boyd et al., 2011) and coordinate descent (Tseng and Yun, 2009; Friedman et al., 2010) – see Boyd and Vandenberghe (2004) for a more exhaustive list. We develop a new Semi-smooth Newton Augmented Lagrangian method for the Elastic Net (SsNAL-EN) in ultra-high-dimensional settings – where the number of features is much larger than the number of observations. Other versions of SsNAL have been recently introduced by Li et al. (2018) to solve regular Lasso, and then extended to constrained Lasso by Deng and So (2019).

SsNAL-EN exploits the sparsity induced by the second order information of the dual augmented Lagrangian to dramatically reduce computational costs. Moreover, the structure of the Elastic Net penalty guarantees a super-linear convergence of both the augmented Lagrangian algorithm and its inner sub-problem, just as in the Lasso case. Therefore, very few iterations are needed to solve both instances with high accuracy.

We implemented an efficient version of our method in python and benchmarked it against two different versions of the coordinate descent algorithm – one implemented in the python package sklearn and one implemented in the R package glmnet. glmnet is written in fortran and is considered the gold standard for fitting sparse generalized linear models. We benchmarked our method also against three advanced solvers which implement screening rules to improve computational performance: the R package biglasso (written in C), and the python packages Gap Safe Rules (Ndiaye et al., 2017) and celer (Massias et al., 2018). Simulation results demonstrate the comparative efficiency of our new approach.

We also applied SsNAL-EN to the Intervention Nurses Start Infants Growing on Healthy Trajectories (INSIGHT) study (Paul et al., 2014; Craig et al., 2019), which examines risk factors for childhood obesity. We investigate the association between single nucleotide polymorphisms (SNPs) and two scalar outcomes – a conditional weight gain score (CWG) and Body Mass Index (BMI) – detecting SNPs that may affect obesity risk in children.

The remainder of the article is organized as follows. In Section 2, we introduce some preliminaries on Fenchel conjugate functions and proximal operators. In Section 3, we describe our new method. In Section 4, we present simulations on both synthetic data and real datasets, and we apply our method to the INSIGHT study. In Section 5, we provide final remarks and discuss future developments. The python code for SsNAL-EN and for our simulations is available at github.comtobiaboschissnalelastic. Data from the INSIGHT study is sensitive and privacy protected, and hence not provided.

## 2 Preliminaries Figure 1: The left panel shows penalty functions (solid lines) and their conjugate functions (dotted lines) for the Lasso (brown) and the Elastic Net (blue). The central panel shows proxσp (red line) and proxp∗/σ (green line) for the Lasso. The right panel shows proxσp (red line) and proxp∗/σ (green line) for the Elastic Net. In all panels the vertical dotted lines represent the interval [−λ1,λ1], and we consider λ1=λ2=σ=1 and x,z∈R.

SsNAL-EN relies on Fenchel conjugate functions (Fenchel, 1949) and proximal operators (Rockafellar, 1976a, b) – which are well know in the optimization field and used in a wide range of problems and applications. In this section, we briefly introduce these mathematical objects, providing definitions and basic properties.

### 2.1 Fenchel conjugate functions

Fenchel conjugate functions (or Legendre transformations) allow one to easily define the dual problem and the Lagrangian dual function (Boyd and Vandenberghe, 2004). Let be a convex set and . The conjugate function of is defined as , where . As an example, the conjugate function of the Lasso penalty is

 p∗(z)=1{∥z∥∞≤λ}={0∥z∥∞≤λ1∞o.w.. (2)

Our first result gives a closed form for the Elastic Net penalty conjugate function, which is pivotal to solve the augmented Lagrangian problem.

###### Proposition 1.

Let be the Elastic Net penalty, then

 p∗(z)=12λ2n∑i=1⎧⎪⎨⎪⎩(zi−λ1)2,zi≥λ10,|zi|<λ1(zi+λ1)2,zi≤−λ1. (3)

A proof of this proposition, which follows the same lines as that of Dünner et al. (2016), is provided in Supplement A. The left panel of Figure 1 depicts Lasso and Elastic Net penalties and their conjugate functions when . While for the Lasso is an indicator function, for the Elastic Net is a continuous differentiable function equal to in the interval .

### 2.2 Proximal operators

Let be a lower semi-continuous convex function. The proximal operator of at with parameter , denoted as , is defined as

 proxσf(x)=argmint(f(t)+(2σ)−1∥t−x∥22) (4)

Parikh et al. (2014) interpret this as an approximated gradient step for . Indeed, when is differentiable and is sufficiently small, we have . To implement SsNAL-EN, we need the proximal operator of the penalty function and of its conjugate . Given the first, the second is obtained through the Moreau decomposition: for .

As an example, the proximal operator of the Lasso penalty is the soft-tresholding operator (Parikh et al., 2014). In particular, for each component of , we have

 (5)

To obtain the proximal operator of the Elastic Net penalty , one composes and the soft-thresholding operator obtaining

 (6)

The central and right panels of Figure 1 depict and for the Lasso and Elastic Net penalties, respectively, when . Just as in the Lasso case, the Elastic Net proximal operator induces sparsity in the interval . Outside this interval, the Elastic Net operator still grows linearly, but with a slope smaller than the Lasso one, due to the presence of the scaling factor .

## 3 Methodology

In this section we describe our new method and its implementation. SsNAL-EN focuses on the augmented Lagrangian of the dual formulation of (1), exploiting the sparsity of its second order information to greatly reduce computational cost.

Consider the continuous differentiable function and the closed proper function . The Elastic Net minimization (1) can be expressed as

 minx(h(Ax)+p(x)). (P)

From Boyd and Vandenberghe (2004), a possible dual formulation of (P) is

 minx−(h∗(y)+p∗(z)) | ATy+z=0, (D)

where and are the dual variables, and and are the conjugate functions of and . In particular, (Dünner et al., 2016) and is given in Proposition 1. The Augmented Lagrangian function (Fenchel, 1949) associated with (D) is

 (7)

where is the Lagrange multiplier and penalizes the dual constraint’s violation. To study the optimality of the primal and dual problems and the convergence of our method, we also introduce the Karush-Kuhn-Tucker (KKT) conditions associated with (D), which are:

 ∇h∗(y)−Ax=0,∇p∗(z)−x=0,ATy+z=0, (8)

where . Finding the closed form of is not essential for our method. Boyd and Vandenberghe (2004) proved that solves the KKT (8) if and only if and are the optimal solutions of (D) and (P), respectively.

### 3.1 The augmented Lagrangian problem

As described in (Rockafellar, 1976a), one can find the optimal solution of (D) by solving the Augmented Lagrangian (AL) method described in Algorithm 1. The critical part here is solving the inner sub-problem (9). Based on Li et al. (2018), for a given , an approximate solution can be computed simultaneously as

 ¯y=argminyLσ(y | ¯z,x),¯z=argminzLσ(z | ¯y,x). (13)

This leads to our second result (a proof again is provided in Supplement B).

###### Proposition 2.

Define . Then, for the Elastic Net we have:

 (14)

and are given in (6). Note that is a continuous differentiable function. As we will see in more detail next, in order to solve (9) one has to minimize with respect to or, equivalently, find the solution of .

### 3.2 A Semi-smooth Newton method to solve (17)

To solve the augmented Lagrangian sub-problem (9), we propose the Semi-smooth Newton (SsN) method described in Algorithm 1, where denotes the generalized Hessian of at . SsN updates and iteratively; -updates follow the rule in Proposition 2 and -updates consist of minimizing through one Newton step. The main challenge is the computational cost of solving the linear system (11) – which we substantially reduce exploiting the sparse structure of . Since is continuous and differentiable, its gradient is

 ∇ψ(y)=∇h∗(y)−Aproxσp(x−σATy), (15)

where . We can define the operator

 ^∂2ψ(y):=∇2h∗(y)+σA∂proxσp(x−σATy)AT, (16)

where (the identity matrix) and is the Clarke subdifferential (Clarke, 1990). If we choose , then . Moreover, from Hiriart-Urruty et al. (1984), we have for every in the domain of . It follows that, as long as is properly chosen, solving (11) is equivalent to solving .

We now illustrate the pivotal role of in SsNAL-EN and how it can induce sparsity in the linear system (11). Let be the diagonal matrix with entries

 qii=11+σλ2{1∣∣(x−σATy)i∣∣>σλ10o.w.. (17)

It is easy to verify that . Let , and be the cardinality of . Given the structure of , we have , where and is the sub-matrix of restricted to the columns in . The system (11) thus becomes:

 (Im+κAJATJ)d=−∇ψ(y). (18)

Note is positive semidefinite because both and are. Using the Cholesky factorization the total cost of solving the linear system reduces from to . This includes computing () and the Cholesky factorization (). Because of the sparsity induced by the Elastic Net, is usually much smaller than – causing substantial computational gains. Even when is very large , one can still solve the linear system efficiently. Furthermore, if , which is often the case when the Elastic Net solution is sparse, one can factorize an (instead of ) matrix using the Sherman-Morrison-Woodbury formula:

 (19)

In this case the total cost of solving the linear system is further reduced from to . To achieve this efficiency, we leverage both the sparsity produced by the penalty and the sparsity inherent to the second order operator . Finally, if in the first iterations of the algorithm and are both larger than , we can further improve computing performance by solving (11) approximately with the conjugate gradient method.

A full convergence analysis for both the Augmented Lagrangian and the Semi-smooth Newton method is provided in Supplement C. Given the super-linear rate, both methods require just a few iterations to converge, as we will see in Section 4. In practice, to determine the convergence of AL and SnN, we check the residuals of the third and first KKT in (8), respectively, i.e:

 (20)

### 3.3 Parameter tuning

To guide the choice of , we consider three different quantitative criteria: k-fold Cross Validation () (Tibshirani, 1996), Generalized Cross Validation () (Jansen, 2015), and Extended Bayesian Information Criterion (-) (Chen and Chen, 2012) – which modifies the standard BIC to also include the number of features . While effective in a wide range of scenarios, can be very computationally expensive because it requires solving additional Elastic Net problems for each value of . In contrast, and - are computed directly from the original solution as:

 (21)

where is the residual sum of squares associated with the solution , and

are the Elastic Net degrees of freedom. Indicating with

the active set of , we have (Tibshirani et al., 2012). Before computing the criteria, we de-bias Elastic Net estimates by fitting standard least squares on the selected features. Albeit naive, this approach is effective when (Belloni et al., 2014; Zhao et al., 2017). To ensure efficiency of the parameter tuning component of our code, we implement some important refinements. We start from values of very close to , i.e. the smallest value of which gives a solution with 0 active components. These values are associated to very sparse solutions, which are fast to compute. When we move to the next value of , we use the solution at the previous value for initialization (warm-start): the new solution is very close to the previous one and can be computed very quickly – usually SsNAL-EN converges in just one iteration. Finally, we allow the user to fix the maximum number of active features: when this number is reached, no further or values are explored.

## 4 Simulation study and INSIGHT data Figure 2: Parameter tuning criteria for the INSIGHT data as cλ varies on the horizontal axis. Top and bottom row refer to the CWG and BMI regressions, respectively. For each, from left to right, we have: number of selected features, 10-fold cv, gcv, and e-bic. We consider three values of α: 0.9 (blue line), 0.8 (red line), 0.6 (green line).

In this section, we demonstrate the gains provided by SsNAL-EN on simulated data and on some commonly used reference data sets. Furthermore, we employ our new method to perform feature selection in a study of genetic variants associated to childhood obesity.

### 4.1 Simulation settings and results

We benchmark SsNAL-EN against two different versions of the coordinate descent algorithm, one implemented in the python package sklearn and one implemented in the R package glmnet which is written in fortran. We tested other algorithms, such as ADMM and the proximal gradient algorithm. The computational burden of these algorithms for the Elastic Net, just like for the Lasso, is more than two orders of magnitude larger than the one of our approach – see Li et al. (2018) (we are not reporting results here since they could not even complete most of the instances).

Our simulated data is generated as follows. The entries of the design matrix

are drawn from a standard normal distribution. We compute the response vector as

, where is a sparse vector with non-zeros values all equal to , and are error terms. We fix as to have a signal to noise ratio snr . SsNAL-EN is run with the tolerance fixed at 1e-6, and in (12) set to 0.2. We start from e-3 and increase it by a factor of 5 every iteration. If we start from smaller values of , the algorithm needs more iterations to converge, while if is too large, SsNAL-EN does not converge to the optimal solution. We consider three different scenarios characterized by the following values of

We set , , where , , and . Note that for glmnet and sklearn we need to divide by since in the objective function (1) they divide the square loss for the number of observations. sim1, sim2 and sim3 present an increasing level of sparsity. The sparser is the problem, the larger is the weight we attribute to relative to with the different choices of .

Table 1 reports the CPU time of SsNAL-EN, sklearn and glment. For each scenario, we consider different values of and we select the largest which gives a solution with active components. SsNAL-EN is the fastest algorithm in every instance. The gain with respect to other algorithms increases for larger ’s and sparser scenarios, where SsNAL-EN is between 20 and 30 times faster than glmnet and more than 60 times faster than sklearn.

We also test all the solvers on some widely studied reference data sets from the LIBSVM library (Chang and Lin, 2011). In particular, we consider housing8, bodyfat8, and traizines4. For each data set, we create a very large number of features including all terms in a polynomial basis expansion (Huang et al., 2010)

. The number reported after the name indicates the order of the expansion. The terms in the polynomial expansions are highly collinear, making these examples suitable for the Elastic Net. To gauge the level of collinearity we compute the largest eigenvalue of

and we normalize it by the number of features . We indicate this number as . Note how is close to 1 for all the simulated data sets (Table 1) and much larger for the data sets created with the polynomial expansions (Table 2). Table 2 also reports the CPU time of SsNAL-EN, sklearn and glment for the latter. For each algorithm we select the two values of which give an active set of cardinality and , respectively. SsNAL-EN is again the fastest algorithm in every instance. For higher values of and , it is more than times faster than sklearn – the other python-coded algorithm. Furthermore, unlike sklearn, the performance of SsNAL-EN is not affected by the choice of . Finally, Tables 1 and 2 report the total number of iterations needed by SsNAL-EN to converge. In all cases convergence is reached in no more than iteration. Notably, does affect convergence; if we decrease its value, giving more weigh to the norm penalty, convergence is generally reached with just iterations. We investigated prediction performance. Results are not reported since the three methods solves the same objective function and converge to the same solution.

Additional results are described in the Supplementary Material. In Supplement D.1, we report computing time standard errors over 20 replications of the same scenario based on sim1. Results confirm the competitiveness of our method. SsNAL-EN

has the lowest mean computing time and comparable standard errors with respect to

sklearn and glmnet. In Supplement D.2, we investigate different values of , , and . The relative gain of SsNAL-EN with respect to sklearn is even larger for bigger values of and smaller values of . In Supplement D.3, we benchmark SsNAL-EN against three advanced solvers which implement screening rules: the R package biglasso (written in C), and the python packages Gap Safe Rules (GSR) and celer. SsNAL-EN is faster than all the competitors in very sparse scenarios ( selected features). In intermediate scenarios () biglasso and SsNAL-EN are comparable and slightly faster than the other algorithms. In non-sparse scenarios () biglasso and celer are about 2 times faster than all other solvers – as expected, in this case, SsNAL-EN cannot exploit sparsity and looses part of its efficiency. Finally, in Supplement D.4, we report computing time for a solution path where multiple values of are considered. We tested SsNAL-EN, textttsklearn, glmnet and biglasso, which have a solution path implementation for . Again, SsNAL-EN outperforms the other solvers in almost every instance, being at least 10 times faster than textttsklearn. The results reported in the Supplement provide further evidence in support of our method – also considering that some competitors, such as glmnet and biglasso, are highly optimized (in particular for a solution path search), and many use screening, which increases speed but may not find the global minimizer.

### 4.2 INSIGHT application

Here, we apply SsNAL-EN to data from the Intervention Nurses Start Infants Growing on Healthy Trajectories (INSIGHT) study (Paul et al., 2014). One goal of INSIGHT is to investigate genetic variants that affect the risk of childhood obesity. In particular, we look for relevant Single Nucleotide Polymorphisms (SNPs). SNPs have been recently related to obesity phenotypes by several Genome-Wide Association Studies (GWASs), e.g Locke et al. (2015). We analyze two different outcome measurements: Conditional Weight Gain (CWG), which describes the change in weight between birth and six months (Taveras et al., 2009), and Body Mass Index at age 3 (BMI). After preprocessing, the design matrix for CWG comprises SNPs and observations, and that for BMI SNPs and observations.

Figure 2 displays the parameter tuning criteria for both responses. All three criteria considered identify just one dominant SNP. gcv and e-bic present a second elbow corresponding to an active set of 13 SNPs for CWG, and 6 SNPs for BMI. SNPs and estimated coefficients are reported in Table 3. The table also contains the chromosome, the position and the gene associated with a SNP (when available) according to the U.S. National Library of Medicine website. In depth biological interpretations of these results are beyond the scope of the present manuscript. However, we can still highlight some facts. Both responses point towards very parsimonious models and the two active sets of 13 and 6 SNPs do not overlap – suggesting that, despite their sizeable correlation (0.545), CWG and BMI may be at least partially driven by different mechanisms. The two top SNPs for CWG () and BMI () are different and located on different chromosomes (4 and 11, respectively), and they do not appear to co-occur on INSIGHT individuals (their correlation is only 0.004). While we could not link to any gene, – the dominant SNP for BMI – is located in the well known NTM gene. According to the NHGRI-EBI GWAS Catalog, NTM has been identified in a wide range of GWASs connecting it to BMI, food addiction, and other obesity-related traits. e.g (Kichaev et al., 2019; Pulit et al., 2019). Finally, we note that also , which is the SNP with the second largest estimated coefficient for CWG, is located in a known gene; ARID5B. Again according to NHGRI-EBI GWAS Catalog, ARID5B is associated to various traits, including some which could be related to obesity (e.g., waste to hip ratio).

## 5 Conclusions and future work

We developed a new efficient Semi-smooth Newton method to solve the Elastic Net problem in high-dimensional settings. This method wisely exploits the sparsity induced by the penalty and the sparsity inherent to the augmented Lagrangian problem. We provided effective algorithms to solve both the augmented Lagrangian problem and the inner-subproblem, proving the super-linear convergence of both. Simulation results based on synthetic and real data show a very large gain in computational efficiency with respect to the best existing algorithms. We also used our method in a GWAS application, identifying genetic variants associated with childhood obesity.

In the near future, we plan to adapt SsNAL-EN to the function-on-scalar regression setting. This framework is particularly relevant for genetic studies where very large number of SNPs are regressed against outcomes suitable for Functional Data Analysis (Cremona et al., 2019). For instance, in addition to CWG and BMI, the INSIGHT data contains also longitudinal growth information on children (Craig et al., 2019). A functional version of SsNAL-EN will involve more complex penalties and build upon the results presented in this article.

We present a new, mathematically sound, practically interpretable and computationally very efficient approach to perform selection of relevant features in very high-dimensional problems. With the advent of big data, the proposed methodology may have a concrete and substantial impact in a wide range of scientific fields and applications – including (but by no means limited to) biomedical research. We have presented an application to a Genome-Wide Association Studies – such studies, investigating association between complex human diseases and genetic variants, are now ubiquitous and ever growing in size. Our method is perfectly suited to aid in the detection of genetic risk factors underlying important health pathologies. From a technical point of view, our work straddles statistics, optimization and computer science – integrating multiple STEM components and paving the way to the development of yet more sophisticated methodologies. Driven by scientific questions based on the use of complex disease phenotypes as outcomes in GWAS, we plan to extend our approach beyond the Elastic Net to tackle feature selection in function-on-scalar regression problems.

This work was partially fund by NSF DMS-1712826 and the Huck Institutes of Life Sciences at Penn State. The authors thank Kateryna Makova for sharing the INSIGHT data, Sarah Craig and Ana Maria Kenney for the fruitful discussions on the data, and Ludovica Delpopolo for her valuable help in the algorithm implementation.

## References

• Beck and Teboulle (2009) Beck, A. and M. Teboulle (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1), 183–202.
• Belloni et al. (2014) Belloni, A., V. Chernozhukov, and C. Hansen (2014). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies 81(2), 608–650.
• Boyd et al. (2011) Boyd, S., N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3(1), 1–122.
• Boyd and Vandenberghe (2004) Boyd, S. and L. Vandenberghe (2004). Convex optimization. Cambridge university press.
• Chang and Lin (2011) Chang, C.-C. and C.-J. Lin (2011).

Libsvm: A library for support vector machines.

ACM transactions on intelligent systems and technology (TIST) 2(3), 1–27.
• Chen and Chen (2012) Chen, J. and Z. Chen (2012). Extended bic for small-n-large-p sparse glm. Statistica Sinica, 555–574.
• Clarke (1990) Clarke, F. H. (1990). Optimization and nonsmooth analysis, Volume 5. Siam.
• Correa et al. (1992) Correa, R., A. Jofre, and L. Thibault (1992). Characterization of lower semicontinuous convex functions. Proceedings of the American Mathematical Society, 67–72.
• Craig et al. (2019) Craig, S. J. C., A. M. Kenney, J. Lin, I. M. Paul, L. L. Birch, J. Savage, M. E. Marini, F. Chiaromonte, M. L. Reimherr, and K. D. Makova (2019). Polygenic risk score based on weight gain trajectories is a strong predictor of childhood obesity. bioRxiv, 606277.
• Cremona et al. (2019) Cremona, M. A., H. Xu, K. D. Makova, M. Reimherr, F. Chiaromonte, and P. Madrigal (2019). Functional data analysis for computational biology. Bioinformatics (Oxford, England) 35(17), 3211.
• Deng and So (2019) Deng, Z. and A. M.-C. So (2019). An efficient augmented lagrangian based method for constrained lasso. arXiv preprint arXiv:1903.05006.
• Dontchev and Rockafellar (2009) Dontchev, A. L. and R. T. Rockafellar (2009). Implicit functions and solution mappings. Springer Monographs in Mathematics. Springer 208.
• Dünner et al. (2016) Dünner, C., S. Forte, M. Takáč, and M. Jaggi (2016). Primal-dual rates and certificates. arXiv preprint arXiv:1602.05205.
• Fan and Li (2001) Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96(456), 1348–1360.
• Fenchel (1949) Fenchel, W. (1949). On conjugate convex functions. Canadian Journal of Mathematics 1(1), 73–77.
• Friedman et al. (2010) Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33(1), 1.
• Gaines et al. (2018) Gaines, B. R., J. Kim, and H. Zhou (2018). Algorithms for fitting the constrained lasso. Journal of Computational and Graphical Statistics 27(4), 861–871.
• Hiriart-Urruty et al. (1984) Hiriart-Urruty, J.-B., J.-J. Strodiot, and V. H. Nguyen (1984). Generalized hessian matrix and second-order optimality conditions for problems withc 1, 1 data. Applied mathematics and optimization 11(1), 43–56.
• Huang et al. (2010) Huang, L., J. Jia, B. Yu, B.-G. Chun, P. Maniatis, and M. Naik (2010). Predicting execution time of computer programs using sparse polynomial regression. In Advances in neural information processing systems, pp. 883–891.
• Jansen (2015) Jansen, M. (2015). Generalized cross validation in variable selection with and without shrinkage. Journal of statistical planning and inference 159, 90–104.
• Kichaev et al. (2019) Kichaev, G., G. Bhatia, P.-R. Loh, S. Gazal, K. Burch, M. K. Freund, A. Schoech, B. Pasaniuc, and A. L. Price (2019). Leveraging polygenic functional enrichment to improve gwas power. The American Journal of Human Genetics 104(1), 65–75.
• Li and Lin (2015) Li, H. and Z. Lin (2015). Accelerated proximal gradient methods for nonconvex programming. In Advances in neural information processing systems, pp. 379–387.
• Li et al. (2018) Li, X., D. Sun, and K.-C. Toh (2018). A highly efficient semismooth newton augmented lagrangian method for solving lasso problems. SIAM Journal on Optimization 28(1), 433–458.
• Locke et al. (2015) Locke, A. E., B. Kahali, S. I. Berndt, A. E. Justice, T. H. Pers, F. R. Day, C. Powell, S. Vedantam, M. L. Buchkovich, J. Yang, et al. (2015). Genetic studies of body mass index yield new insights for obesity biology. Nature 518(7538), 197.
• Luque (1984) Luque, F. J. (1984). Asymptotic convergence analysis of the proximal point algorithm. SIAM Journal on Control and Optimization 22(2), 277–293.
• Massias et al. (2018) Massias, M., A. Gramfort, and J. Salmon (2018). Celer: a fast solver for the lasso with dual extrapolation. arXiv preprint arXiv:1802.07481.
• Ndiaye et al. (2017) Ndiaye, E., O. Fercoq, A. Gramfort, and J. Salmon (2017). Gap safe screening rules for sparsity enforcing penalties. The Journal of Machine Learning Research 18(1), 4671–4703.
• Parikh et al. (2014) Parikh, N., S. Boyd, et al. (2014). Proximal algorithms. Foundations and Trends® in Optimization 1(3), 127–239.
• Paul et al. (2014) Paul, I. M., J. S. Williams, S. Anzman-Frasca, J. S. Beiler, K. D. Makova, M. E. Marini, L. B. Hess, S. E. Rzucidlo, N. Verdiglione, J. A. Mindell, et al. (2014). The intervention nurses start infants growing on healthy trajectories (insight) study. BMC pediatrics 14(1), 184.
• Pulit et al. (2019) Pulit, S. L., C. Stoneman, A. P. Morris, A. R. Wood, C. A. Glastonbury, J. Tyrrell, L. Yengo, T. Ferreira, E. Marouli, Y. Ji, et al. (2019). Meta-analysis of genome-wide association studies for body fat distribution in 694 649 individuals of european ancestry. Human molecular genetics 28(1), 166–174.
• Robinson (1981) Robinson, S. M. (1981). Some continuity properties of polyhedral multifunctions. In Mathematical Programming at Oberwolfach, pp. 206–214. Springer.
• Rockafellar (1976a) Rockafellar, R. T. (1976a). Augmented lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of operations research 1(2), 97–116.
• Rockafellar (1976b) Rockafellar, R. T. (1976b). Monotone operators and the proximal point algorithm. SIAM journal on control and optimization 14(5), 877–898.
• Rockafellar and Wets (2009) Rockafellar, R. T. and R. J.-B. Wets (2009). Variational analysis, Volume 317. Springer Science & Business Media.
• Taveras et al. (2009) Taveras, E. M., S. L. Rifas-Shiman, M. B. Belfort, K. P. Kleinman, E. Oken, and M. W. Gillman (2009). Weight status in the first 6 months of life and obesity at 3 years of age. Pediatrics 123(4), 1177–1183.
• Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288.
• Tibshirani et al. (2012) Tibshirani, R. J., J. Taylor, et al. (2012). Degrees of freedom in lasso problems. The Annals of Statistics 40(2), 1198–1232.
• Touchette (2005) Touchette, H. (2005). Legendre-fenchel transforms in a nutshell. URL http://www. maths. qmul. ac. uk/~ ht/archive/lfth2. pdf.
• Tseng and Yun (2009) Tseng, P. and S. Yun (2009). A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming 117(1-2), 387–423.
• Zhao et al. (2017) Zhao, S., A. Shojaie, and D. Witten (2017). In defense of the indefensible: A very naive approach to high-dimensional inference. arXiv preprint arXiv:1705.05543.
• Zhao et al. (2010) Zhao, X.-Y., D. Sun, and K.-C. Toh (2010). A newton-cg augmented lagrangian method for semidefinite programming. SIAM Journal on Optimization 20(4), 1737–1765.
• Zhou and So (2017) Zhou, Z. and A. M.-C. So (2017). A unified approach to error bounds for structured convex optimization problems. Mathematical Programming 165(2), 689–728.
• Zou (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association 101(476), 1418–1429.
• Zou and Hastie (2005) Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology) 67(2), 301–320.

## Appendix A Proof of Proposition 1

Consider , with . To compute , we use the following result (Touchette, 2005):

 p∗(z)=zT¯x−p(¯x), (A.1)

where . To find one has to solve , which is equivalent to solve for given . In our case, we have:

 p∗(z)=zT¯x−λ1∥¯x∥1−λ22∥¯x∥22 (A.2)

If we compute , and we set it equal to 0, we obtain:

 ~xi=⎧⎨⎩(zi−λ1)/λ2,xi>0∈(zi−λ1[−1,1])/λ2,xi=0(zi+λ1)/λ2,xi<0, (A.3)

where at . To find , we need to take into account the fact that (Touchette, 2005) and transform consequently. The -th component of is given by:

 ¯xi=⎧⎨⎩(zi−λ1)/λ2,zi≥λ10,|zi|<λ1(zi+λ1)/λ2,zi≤−λ1. (A.4)

By (A.2), we can compute the -th component of as . We have:

 (A.5)

Now, we just need to show:

 (A.6)

But, we have:

 (A.7)

## Appendix B Proof of Proposition 2

#### Part 2

We start proving the second part of the proposition, i.e:

 ¯z=proxp∗/σ(x/σ−AT¯y). (B.1)

If we take the derivative with respect to of and we set it equal to , we get:

 xσ−AT¯y−z=∇p∗(z)σ. (B.2)

By the sub-gradient characterization of the proximal operators (Correa et al., 1992), we know:

 u=proxf(t)  if and only if  t−u∈∂f(u) (B.3)

Set , , and . The second part of (B.3) is true by (B.2) – in our case is differentiable so we have strictly equal to. The first term of (B.3), gives us (B.1).

#### Part 1

For the first part of the preposition, we have to prove:

 ψ(y)=h∗(y)+1+σλ22σ∥∥proxσp(x−σATy)∥∥22−12σ∥x∥22. (B.4)

Note, by Moreau decomposition, we have: . By definition, , i.e:

 ψ(y)=h∗(y)+p∗(¯z)−xT(ATy+¯z)+σ2∥∥ATy+¯z∥∥22==h∗(y)+p∗(¯z)−xT(ATy+xσ−ATy−1σproxσp(x−σATy))++σ2∥∥∥ATy+xσ−ATy−1σproxσp(x−σATy)∥∥∥22==h∗(y)+p∗(¯z)+12σ∥∥proxσp(x−σATy)∥∥22−12σ∥x∥22. (B.5)

We need to compute , i.e. . In the Lasso (Li and Lin, 2015) and the constrained Lasso (Deng and So, 2019), is an indicator function and is equal to 0. This is not our case. Let define . Composing the second equation in (6) and (3), we have:

 p∗(proxp∗/σ(t/σ))=12λ2n∑i=1⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩(tiλ2+λ11+σλ2−λ1)2,ti≥σλ10,|ti|<σλ1(tiλ2−λ11+σλ2+λ1)2,ti≤−σλ1==λ221(1+σλ2)2n∑i=1⎧⎪⎨⎪⎩(ti−σλ1)2,ti≥σλ10,|ti|<σλ1(ti+σλ1)2,ti≤−σλ1==by (???)=λ22n∑i=1(proxσp(ti))2 (B.6)

Therefore, we have . Plugging it in (B.5), we prove (B.4).

## Appendix C Convergence Analysis

### c.1 Inexact Augmented Lagrangian Method

To state the global convergence of Algorithm 1 and the super-linear convergence of the solution , we refer to theorem 3.2 and theorem 3.3 in Li et al. (2018) – which are in turned based on the fundamental results presented in Rockafellar (1976a, b), and Luque (1984). Here, we just need verify that the theorems assumptions hold. In particular, we met the assumptions on , since it is the same for Lasso and Elastic Net, and it is always possible to implement the stopping criteria for the local convergence analysis described in Li et al. (2018) – Section 3. The main challenge is to verify that the operators and satisfy the error bound condition, since they are different from the Lasso case.

Given the closed proper convex function in the objective (1), and the convex-concave lagrangian function in (7), we define the maximal monotone operators and as in Rockafellar (1976a):

 Tf(x)=∂f(x),Tl(y,z,x)={(y′,z′,x′)|(y′,z′,−x′)∈∂l(y,z,x)}. (C.1)

We have to show that and are metric subregular Dontchev and Rockafellar (2009), or equivalently that they satisfy the error bound condition Robinson (1981), also called growth condition Luque (1984).

In particular, we say that a multivalue mapping satisfies the error bound condition at with modulus if and there exists such that if with , then

 dist(x,F−1(y))≤κdist(y,F(x)). (C.2)

The regularity of comes from Zhou and So (2017): since is Lipschitz continuous and has a polyhedral epigraph, satisfies the error bound condition. Verifying the bound condition for in the Lasso problem is not straightforward. However, in the Elastic Net case, we can use some known results given the special form of in (3). First, note that is a piecewise linear-quadratic function. Thus, we can apply Proposition 12.30 in Rockafellar and Wets (2009) and state that the subgradient mapping is piecewise polyhedral. Finally, from Robinson (1981) we know that polyhedral multifunctions satisfy the error bound condition for any point . This proves the regularity of and, therefore, the super-linear convergence of the method.

### c.2 Semi-smooth Newton Method

Again, to state the super-linear convergence of the sequence produced by the Semi-smooth Newton Method in Algorithm 1, we can use theorem 3.6 in Li et al. (2018), which is based on the crucial results of Zhao et al. (2010). All the assumptions are easy to verify. and in (6) are semi-smooth functions. By proposition 3.3 in Zhao et al. (2010), defined in (11) is a descent direction. Finally, recall the definition of :

 V:=Im+σAQAT,

with being the diagonal matrix in (17). We have and positive semidefinite.

## Appendix D Simulation Study

We ran all simulations on a MacBookPro with 3.3 GHz DualCore Intel Core i7 processor and 16GB ram. We reran all python simulations using openblas and mkl as blas systems, with threads=1,2 and openmp, with threads=1,4. In all scenarios, times match those reported in the paper that are obtained considering openblas with 2 threads and openmp with 4 threads.