A Unified Dynamic Approach to Sparse Model Selection

10/08/2018 ∙ by Chendi Huang, et al. ∙ Peking University The Hong Kong University of Science and Technology 0

Sparse model selection is ubiquitous from linear regression to graphical models where regularization paths, as a family of estimators upon the regularization parameter varying, are computed when the regularization parameter is unknown or decided data-adaptively. Traditional computational methods rely on solving a set of optimization problems where the regularization parameters are fixed on a grid that might be inefficient. In this paper, we introduce a simple iterative regularization path, which follows the dynamics of a sparse Mirror Descent algorithm or a generalization of Linearized Bregman Iterations with nonlinear loss. Its performance is competitive to glmnet with a further bias reduction. A path consistency theory is presented that under the Restricted Strong Convexity (RSC) and the Irrepresentable Condition (IRR), the path will first evolve in a subspace with no false positives and reach an estimator that is sign-consistent or of minimax optimal ℓ_2 error rate. Early stopping regularization is required to prevent overfitting. Application examples are given in sparse logistic regression and Ising models for NIPS coauthorship.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In high dimensional statistics and machine learning, the data

is often assumed to be generated from a statistical model with a sparse parameter , and the purpose is to estimate typically via the following optimization approach,



is a loss function depending on data

and parameter , usually based on likelihood, and is a penalty function. For simplicity, we shall omit the dependence on for the loss when it is clear from the context.

Example 1 (Sparse linear regression model).

Let be a fixed design matrix, and ,

with ’s i.i.d. drawn from , and sparse. Let

be the loss function for data and parameter (intercept and linear parameter ), as well as the Lasso penalty . For model selection consistency, Zhao and Yu (2006); Wainwright (2009)

showed it under Restricted Strong Convexity (RSC) and Irrepresentable Condition (IRR); under a weaker restricted eigenvalue condition,

Bickel et al. (2009) established the -error at minimax optimal rates.

Example 2 (Sparse logistic regression model).

Let , and ,

with sparse. Ravikumar et al. (2010) considered creftype 1.1 with the loss function for data and parameter


as well as . They also showed its selection/estimation consistency.

Example 3 (Sparse Ising model).

are drawn from whose population satisfies


where ,111We assume and is symmetric. and is sparse. Ravikumar et al. (2010) studied sparse Ising model creftype 1.3 by the so-called neighborhood-based logistic regression, based on the discussion on sparse logistic models in their paper. Specifically, despite the difficulty to deal with the whole by using likelihood-based loss functions of Ising model, they noticed that

Thus each corresponds to a sparse logistic regression problem, i.e. Example 2, with replaced by . Thus they learned (by regularized logistic regression) for each , instead of dealing with directly. Xue et al. (2012) considered creftype 1.1 with the loss being the negative composite conditional log-likelihood


can be penalty, SCAD penalty or other positive penalty function defined on . Alternatively Sohl-Dickstein et al. (2011)

proposed an approach of Minimum Probability Flow (MPF) which in the case of Ising model uses the following loss


The minimizer of this function is a reasonable estimator of . However their work did not treat sparse models in high-dimensional setting. When facing sparse Ising model, one may consider creftype 1.1 with the loss being the expression in creftype 1.5 and , which is not seen in literature to the best of our knowledge.

Example 4 (Sparse Gaussian graphical model).

are drawn from a multivariate Gaussian distribution with covariance

, and the precision matrix is assumed to be sparse. Yuan and Lin (2007); Ravikumar et al. (2008) studied creftype 1.1, with the loss function being the negative scaled log-likelihood, and the penalty being the sum of the absolute values of the off-diagonal entries of the precision matrix.

In general, Negahban et al. (2009) provided a unified framework for analyzing the statistical consistency of the estimators derived by solving creftype 1.1 with a proper choice of . However in practice, since is unknown, one typically needs to compute the regularization path as regularization parameter varies on a grid, e.g. the lars (Efron et al., 2004) or the coordinate descent in glmnet. Such regularization path algorithms can be inefficient in solving many optimization problems.

In this paper, we look at the following three-line iterative algorithm which, despite its simplicity, leads to a novel unified scheme of regularization paths for all cases above, equationparentequation


where , can be arbitrary and is naturally set , step size and are parameters whose selection to be discussed later, and the shrinkage operator is defined element-wise as . Such an algorithm is easy for parallel implementation, with linear speed-ups demonstrated in experiment Section 3 below.

To see the regularization paths returned by the iteration, Figure 1 compared it against the glmnet. Such simple iterative regularization paths exhibit competitive or even better performance than the Lasso regularization paths by glmnet in reducing the bias and improving the accuracy (Section 3.2 for more details).

How does this simple iteration algorithm work?

There are two equivalent views on algorithm creftypecap 1.6. First of all, it can be regarded as a mirror descent algorithm (MDA) (Nemirovski and Yudin, 1983; Beck and Teboulle, 2003; Nemirovski, 2012)

where is the bregman divergence associated with , i.e. defined by


Now set involving a Ridge () penalty on and an elastic net type ( and ) penalty on . Hence and where . With this, the optimization in MDA leads to creftypecap 1.6a and


which is equivalent to creftypecap 1.6b. There has been extensive studies on the convergence (), which are however not suitable for statistical estimate above as such convergent solutions lead to overfitting estimators.

Figure 1: Top: path comparison between by glmnet (left) and by GLBI (middle), for logistic models with true parameters (right). Bottom: path comparison of by glmnet (left), by GLBI + composite loss (middle left), and by GLBI + MPF loss (middle right), for Ising models with true parameters (right). In both cases, glmnet selects biased estimates while GLBI finds more accurate ones.

An alternative dynamic view may lead to a deeper understanding of the regularization path. In fact for Example 1, creftype 1.6 reduces to the Linearized Bregman Iteration (LBI) proposed by Yin et al. (2008) and analyzed by Osher et al. (2016) via its limit differential inclusions. It shows that equipped with the standard conditions as Lasso, an early stopping rule can find a point on the regularization path of creftype 1.6 with the same sign pattern as true parameter (sign-consistency) and gives the unbiased oracle estimate, hence better than Lasso or any convex regularized estimates which are always biased. This can be generalized to our setting where creftypecap 1.8 is a discretization of the following dynamics equationparentequation


It is a restricted gradient flow (differential inclusion) where has its sparse support controlled by . As , it gives a sequence of estimates by minimizing with the sign pattern of restricted on . Thus if an estimator has the same sign pattern as , it must returns the unbiased oracle estimator which is optimal. So it is natural to ask if there is a point on the path (or ) which meets the sparsity pattern of true parameter . This is the path consistency problem to be addressed in this paper. In Section 2, we shall present a theoretical framework as an answer, and Section 3 gives more applications, including Ising model learning for NIPS coauthorship.

Note that for Example 2, creftype 1.6 reduces to the linearized Bregman iterations for logistic regression proposed by Shi et al. (2013) without a study of statistical consistency. A variable splitting scheme in comparison to generalized Lasso is studied in Huang et al. (2016) which shows improved model selection consistency in some scenarios. Hence in this paper, we shall call the general form creftype 1.6 as Generalized Linear Bregman Iterations (GLBI), in addition to (sparse) Mirror Descent flows.

2 Path Consistency of GLBI

Let denotes the true parameter, with sparse . Define () as the index set corresponding to nonzero entries of , and be its complement. Let , and when drops. Let the oracle estimator be


which is an optimal estimate of . GLBI starts within the oracle subspace (), and we are going to prove that under an Irrepresentable Condition (IRR) the dynamics will evolve in the oracle subspace with high probability before the stopping time , approaching the oracle estimator exponentially fast due to the Restricted Strong Convexity (RSC). Thus if all the true parameters are large enough, then we can identify their sign pattern correctly; otherwise, such a stopping time still finds an estimator (possibly with false positives) at minimax optimal error rate. Furthermore, if the algorithm continues beyond the stopping time, it might escape the oracle subspace and eventually reach overfitted estimates. Such a picture is illustrated in Figure 2.

Figure 2: An illustration of global dynamics of the algorithm in this paper.

Hence, it is helpful to define the following oracle dynamics: equationparentequation


with . Let .

2.1 Basic Assumptions

Now we are ready to state the general assumptions that can be reduced to existing ones. We write

Assumption 1 (Restricted Strong Convexity (RSC)).

There exist , such that for any , and for any on the line segment between and , or on the line segment between and ,

Assumption 2 (Irrepresentable Condition (IRR)).

There exist and such that


Remark 1.

For sparse linear regression problem (Example 1) with no intercept ( drops), creftypecap 1 reduces to . The lower bound is exactly the RSC proposed in linear problems. Although the upper bound is not needed in linear problems, it arises in the analysis for logistic problem by Ravikumar et al. (2010) (see (A1) in Section 3.1 in their paper). Besides, is constant and creftypecap 2 reduces to

which is true with high probability, as long as the classical Irrepresentable Condition (Zhao and Yu, 2006) holds along with and is large, since by creftype C.6,

Remark 2.

For sparse logistic regression problem (Example 2), we have the following proposition stating that creftypecap 1 and 2 hold with high probability under some natural setting, along with condition creftype 2.3. See its proof in Appendix D. A slightly weaker condition compared to creftype 2.3b, and a same version of creftype 2.3c, can be found in Ravikumar et al. (2010), where are discrete.

Proposition 1.

In Example 2, we suppose ’s are i.i.d. drawn from some , where . Then there exist constants , such that creftypecap 1 and 2 hold with probability not less than , as long as is sufficiently large and equationparentequation


2.2 Path Consistency Theorem

Theorem 1 (Consistency of GLBI).

Under creftypecap 1 and 2, suppose , and such that


Define as in creftype C.3. We have the following properties.

No-false-positive: For all , the solution path of GLBI has no false-positive, i.e. .

Sign consistency: If and


then .

consistency: The error

The proof of Theorem 1 is collected in Appendix C, which largely follows the analysis of differential inclusion creftypecap 1.6, given in Appendix B, as its discretization.

Remark 3.

For sparse linear regression problem (Example 1), with high probability we have

Hence pick satisfying creftype 2.4, by Theorem 1 the sign consistency is guaranteed at if

The error bound reaches the minimax optimal rate:

Remark 4.

For sparse logistic regression problem (Example 2), with high probability we have

Hence the sign consistency is guaranteed at some if

(meeting Condition (19) in Ravikumar et al. (2010)). The error rate is minimax optimal.

3 Experiments

As for the setting of algorithm parameters: should be large, and then is automatically calculated based on (as long as , such that is positive in creftype C.3). In practice, a small can prevent the iterations from oscillations.

3.1 Efficiency of Parallel Computing

Osher et al. (2016) has elaborated that LBI can easily be implemented in parallel and distributed manners, and applied on very large-scale datasets. Likewise, GLBI can be parallelized in many usual applications. We now take the logistic model Example 2 as an example to explain the details. The iteration creftype 1.6 (generally taking ) can be written as equationparentequation


where such that


where ’s are submatrices stored in a distributed manner on a set of networked workstations. The sizes of ’s are flexible and can be chosen for good load balancing. Let each workstation hold data and , and variables and which are parts of and summands of , respectively. The iteration creftype 3.1 is carried out as

where the all-reduce summation step collects inputs from and then returns the sum to all the workstations. It is the sum of

-dimensional vectors. Therefore, the communication cost is independent of

no matter how the all-reduce step is implemented. It is important to note that the algorithm is not changed at all. particularly, increasing , does not increase the number of iterations. So the parallel implementation is truly scalable.

If denotes the time cost of a single GLBI run with workstations under the same dataset and the same algorithmic settings, it is expected that . Here we show this by an example. Construct a logistic model in Section 3.2 with , and three settings for : (I) , (II) , (III) . For each setting, we run our parallelized version of GLBI algorithm written in C++, with , where is the maximal such that , and the path is early stopped at the -th iteration. The recorded ’s are shown in Figure 3. The left panel shows (in seconds) while the right panel shows , for . We see truly , which is expected in our parallel and distributed treatment. When is large, our package can deal with very large scale problems.

Figure 3: Time cost illustration for logistic model with three settings (black for Setting (I), blue for Setting (II) and red for Setting (III)). In each setting, the left panel shows while the right panel shows , for .

3.2 Application: Logistic Model

We do independent experiments, in each of which we construct a logistic model (Example 2), and then compare GLBI with other methods. Specifically, suppose that has a support set without loss of generality.

are independent, each has a uniform distribution on

. Each row of is i.i.d. sampled from , where is a Toeplitz matrix satisfying . When and are determined, we generate as in Example 2.

After getting the sample , consider GLBI creftype 1.6 and optimization creftype 1.1, both with logistic loss creftype 1.2. For GLBI, set . For creftype 1.1, apply a grid search for differently penalized problems, for which we use glmnet – a popular package available in Matlab/R that can be applied on regularization for sparse logistic regression models.

For each algorithm, we use -fold () cross validation (CV) to pick up an estimator from the path calculate based on the smallest CV estimate of prediction error. Specifically, we split the data into roughly equal-sized parts. For a certain position on paths ( for GLBI, or for glmnet) and , we obtain a corresponding estimator based on the data with the

-th part removed, use the estimator to build a classifier, and get the mis-classification error on the

-th part. Averaging the value for , we obtain the CV estimate of prediction error, for the obtained estimator corresponding to a certain position on paths. Among all positions, we pick up the estimator producing the smallest CV estimate of prediction error. Besides, we calculate AUC (Area Under Curve), for evaluating the path performance of learning sparsity patterns without choosing a best estimator.

Results for are summarized in Table 1. We see that in terms of CV estimate of prediction error, GLBI is generally better than glmnet. Besides, GLBI is competitive with regularization method in variable selection, in terms of AUC. Similar observations for more settings are listed in Table 4, 5 and 6 in Appendix G. Apart from these tables, we can also see the outperformance of CV estimate of prediction error Figure 5 in Appendix G, while in that figure we can see that GLBI further reduces bias, as well as provides us a relatively good estimator with small prediction error if a proper early stopping is equipped.

AUC prediction error
GLBI glmnet GLBI glmnet
Table 1: Comparisons between GLBI and glmnet, for logistic models with . For each algorithm, we run independent experiments.

3.3 Application: Ising Model with 4-Nearest-Neighbor Grid

GLBI1 GLBI2 glmnet
2nd order MDC
GLBI1 GLBI2 glmnet
Table 2: Comparisons of GLBI1 (GLBI + composite), GLBI2 (GLBI + MPF), and glmnet, for Ising models with . For each algorithm, we run independent experiments.

We do independent experiments, in each of which we construct an ising model (Example 3), and then compare GLBI with other methods. Specifically, construct an 4-nearest neighbor grid (with aperiodic boundary conditions) to be graph , with node set and edge set . The distribution of a random vector is given by creftype 1.3 (), where ’s and ’s are i.i.d. and each has a uniform distribution on . Let represents samples drawn from the distribution of via Gibbs sampling.

After getting the sample , consider GLBI1 (GLBI with composite loss creftype 1.4), GLBI2 (GLBI with MPF loss creftype 1.5) and optimization creftype 1.1 with logistic loss (see Example 3 for neighborhood-based logistic regression applied on Ising models, or see Ravikumar et al. (2010)). For GLBI1 and GLBI2, set . For creftype 1.1, apply a grid search for differently penalized problems; we still use glmnet.

For each algorithm, we calculate the AUC (Area Under Curve), popular for evaluating the path performance of learning sparsity patterns. Besides, we apply -fold () cross validation (CV) to pick up an estimator from the path, with the largest CV estimate of 2nd order marginal distribution correlation (2nd order MDC) in the same way as the CV process done in Section 3.2, here the 2nd order MDC, defined in the next paragraph, is calculated based on two samples of the same size: the -th part original data, and the newly sampled Ising model data (based on learned parameters) with the same size of the -th part.

For any sample matrix , we construct , the 2nd marginal empirical distribution matrix of , defined as follows. , where


For any sample matrices with the same sample size, we call the correlation between and the 2nd order marginal distribution correlation (2nd order MDC). This value is expected to be large as well as close to if come from the same model.

Results for are summarized in Table 2. GLBI with composite/MPF loss are competitive with or better than glmnet. Similar observations are listed in Table 7 in Appendix G.

3.4 Application: Coauthorship Network in NIPS

Consider the information of papers and authors in Advances in Neural Information Processing Systems (NIPS) 1987–2016, collected from https://www.kaggle.com/benhamner/nips-papers. After preprocessing (e.g. author disambiguity), for simplicity, we restrict our analysis on the most productive authors (Table 3) in the largest connected component of a coauthorship network that two authors are linked if they coauthored at least 2 papers (Coauthorship (2)). The first panel of Figure 4 shows this coauthorship network with edge width in proportion to the number of coauthored papers. There are papers authored by at least one of these persons.

01 Michael Jordan 16 Inderjit Dhillon
02 Bernhard Schölkopf 17 Ruslan Salakhutdinov
03 Geoffrey Hinton 18 Tong Zhang
04 Yoshua Bengio 19 Thomas Griffiths
05 Zoubin Ghahramani 20 David Blei
06 Terrence Sejnowski 21 Rémi Munos
07 Peter Dayan 22 Joshua Tenenbaum
08 Alex Smola 23 Lawrence Carin
09 Andrew Ng 24 Eric Xing
10 Francis Bach 25 Richard Zemel
11 Michael Mozer 26 Martin Wainwright
12 Pradeep Ravikumar 27 Yoram Singer
13 Tommi Jaakkola 28 Han Liu
14 Klaus-Robert Müller 29 Satinder Singh
15 Yee Teh 30 Christopher Williams
Table 3: Most productive authors in the largest connected component of Coauthorship (2).

Let the -th entry of be if the -th person is involved in the authors of the -th paper, and otherwise. Now we fit the data by a sparse Ising model creftype 1.3 with parameter . Note that indicates that and are conditional independent on coauthorship, given all the other authors; implies that and coauthored more often than their averages, while says the opposite.

Figure 4: Top left: NIPS coauthorship network, with edge width in proportion to the number of coauthored papers. Top right: a learned graph picked from the path of GLBI1. Bottom left: from GLBI2. Bottom right: from glmnet. Green edges indicate positive conditional dependence of coauthorship – the probability of coauthoring a paper significantly increases the authors’ average behavior, while red edges indicating the negative coauthorship. Edge widths show the strength of such a relationship.

The right three panels in Figure 4 compares some sparse Ising models chosen from three regularization paths at a similar sparsity level (the percentage of learned edges over the complete graph, here about ): GLBI1 (GLBI with composite loss), GLBI2 (GLBI with MPF loss), and regularization (glmnet), respectively. For more learned graphs from these paths, see Figure 6 in Appendix G. In GLBI1 and GLBI2, set .

We see that all the learned graphs capture some important coauthorships, such as Pradeep Ravikumar (12) and Inderjit Dhillon (16) in a thick green edge in all the three learned graphs, indicating that they collaborated more often than separately for NIPS. Besides, the most productive author Michael Jordan (01) has coauthored with a lot of other people, but is somewhat unlikely to coauthor with several other productive scholars like Yoshua Bengio (04), Terrence Sejnowski (06), etc., indicating by the red edges between Jordan and those people. Further note the edge widths in the second and third graphs are significantly larger than those in the fourth graph, implying that at a similar sparsity level, GLBI tends to provide an estimator with larger absolute values of entries than that by glmnet. That is because under similar sparsity patterns, GLBI may give low-biased estimators.


This work is supported in part by National Basic Research Program of China (Nos. 2015CB85600, 2012CB825501), NNSF of China (Nos. 61370004, 11421110001), HKRGC grant 16303817, as well as grants from Tencent AI Lab, Si Family Foundation, Baidu BDI, and Microsoft Research-Asia.


  • Beck and Teboulle (2003) Beck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31, 167–175.
  • Bickel et al. (2009) Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. Ann. Statist., 37(4), 1705–1732.
  • Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499.
  • Huang et al. (2016) Huang, C., Sun, X., Xiong, J., and Yao, Y. (2016). Split lbi: An iterative regularization path with structural sparsity. In Advances in Neural Information Processing Systems (NIPS) 29, pages 3369–3377.
  • Negahban et al. (2009) Negahban, S., Yu, B., Wainwright, M. J., and Ravikumar, P. K. (2009). A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems (NIPS) 22, pages 1348–1356.
  • Nemirovski (2012) Nemirovski, A. (2012). Tutorial: Mirror descent algorithms for large-scale deterministic and stochastic convex optimization. Conference on Learning Theory (COLT).
  • Nemirovski and Yudin (1983) Nemirovski, A. and Yudin, D. (1983). Problem complexity and Method Efficiency in Optimization. New York: Wiley. Nauka Publishers, Moscow (in Russian), 1978.
  • Osher et al. (2016) Osher, S., Ruan, F., Xiong, J., Yao, Y., and Yin, W. (2016). Sparse recovery via differential inclusions. Applied and Computational Harmonic Analysis, 41(2), 436–469.
  • Ravikumar et al. (2008) Ravikumar, P., Raskutti, G., Wainwright, M., and Yu, B. (2008). Model selection in Gaussian graphical models: High-dimensional consistency of l1-regularized MLE. In Advances in Neural Information Processing Systems (NIPS), volume 21.
  • Ravikumar et al. (2010) Ravikumar, P., Wainwright, M. J., and Lafferty, J. D. (2010). High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3), 1287–1319.
  • Shi et al. (2013) Shi, J. V., Yin, W., and Osher, S. J. (2013). A new regularization path for logistic regression via linearized bregman.
  • Sohl-Dickstein et al. (2011) Sohl-Dickstein, J., Battaglino, P., and DeWeese, M. (2011). Minimum Probability Flow Learning. ICML ’11, pages 905–912, New York, NY, USA. ACM.
  • Wainwright (2009) Wainwright, M. J. (2009). Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using L1-Constrained Quadratic Programming (Lasso). Information Theory, IEEE Transactions on, 55(5), 2183–2202.
  • Xue et al. (2012) Xue, L., Zou, H., and Cai, T. (2012). Nonconcave penalized composite conditional likelihood estimation of sparse ising models. The Annals of Statistics, 40(3), 1403–1429.
  • Yin et al. (2008) Yin, W., Osher, S., Darbon, J., and Goldfarb, D. (2008). Bregman iterative algorithms for compressed sensing and related problems. SIAM Journal on Imaging Sciences, 1(1), 143–168.
  • Yuan and Lin (2007) Yuan, M. and Lin, Y. (2007). Model Selection and Estimation in the Gaussian Graphical Model. Biometrika, 94, 19–35.
  • Zhao and Yu (2006) Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. J. Machine Learning Research, 7, 2541–2567.

Appendix A GLBISS and GISS: Limit Dynamics of GLBI

Consider a differential inclusion called Generalized Linearized Bregman Inverse Scale Space (GLBISS), the limit dynamics of GLBI when the step size . This will help understanding GLBI, and the proof on sign consistency as well as consistency of GLBISS can be moved to the case of GLBI with slight modifications.

Specifically, noting by the following Moreau decomposition


GLBI has an equivalent form equationparentequation


where . Taking , and , creftype A.2 can be viewed as a forward Euler discretization of a differential inclusion called Generalized Linearized Bregman Inverse Scale Space (GLBISS) equationparentequation


where . Next taking , we reach the following Generalized Bregman Inverse Scale Space (GISS). equationparentequation


where . Following the same spirit of Osher et al. (2016), it is transparent to obtain the existence and uniqueness of the solution paths of GISS and GLBISS under mild conditions; and for GISS and GLBISS, is non-increasing for , while for GLBI, is non-increasing for if .

Appendix B Path Consistency of GLBISS

Now we aim to prove the path consistency of GLBISS, which will shed light on proving the path consistency of GLBI in Appendix C. Define the following Oracle Dynamics of GLBISS, which is viewed as a version of creftype A.3 with known: equationparentequation