is a loss function depending on dataand parameter , usually based on likelihood, and is a penalty function. For simplicity, we shall omit the dependence on for the loss when it is clear from the context.
Example 1 (Sparse linear regression model).
Let be a fixed design matrix, and ,
with ’s i.i.d. drawn from , and sparse. Let
showed it under Restricted Strong Convexity (RSC) and Irrepresentable Condition (IRR); under a weaker restricted eigenvalue condition,Bickel et al. (2009) established the -error at minimax optimal rates.
Example 2 (Sparse logistic regression model).
Example 3 (Sparse Ising model).
are drawn from whose population satisfies
where ,111We assume and is symmetric. and is sparse. Ravikumar et al. (2010) studied sparse Ising model creftype 1.3 by the so-called neighborhood-based logistic regression, based on the discussion on sparse logistic models in their paper. Specifically, despite the difficulty to deal with the whole by using likelihood-based loss functions of Ising model, they noticed that
Thus each corresponds to a sparse logistic regression problem, i.e. Example 2, with replaced by . Thus they learned (by regularized logistic regression) for each , instead of dealing with directly. Xue et al. (2012) considered creftype 1.1 with the loss being the negative composite conditional log-likelihood
can be penalty, SCAD penalty or other positive penalty function defined on . Alternatively Sohl-Dickstein et al. (2011)
proposed an approach of Minimum Probability Flow (MPF) which in the case of Ising model uses the following loss
The minimizer of this function is a reasonable estimator of . However their work did not treat sparse models in high-dimensional setting. When facing sparse Ising model, one may consider creftype 1.1 with the loss being the expression in creftype 1.5 and , which is not seen in literature to the best of our knowledge.
Example 4 (Sparse Gaussian graphical model).
are drawn from a multivariate Gaussian distribution with covariance, and the precision matrix is assumed to be sparse. Yuan and Lin (2007); Ravikumar et al. (2008) studied creftype 1.1, with the loss function being the negative scaled log-likelihood, and the penalty being the sum of the absolute values of the off-diagonal entries of the precision matrix.
In general, Negahban et al. (2009) provided a unified framework for analyzing the statistical consistency of the estimators derived by solving creftype 1.1 with a proper choice of . However in practice, since is unknown, one typically needs to compute the regularization path as regularization parameter varies on a grid, e.g. the lars (Efron et al., 2004) or the coordinate descent in glmnet. Such regularization path algorithms can be inefficient in solving many optimization problems.
In this paper, we look at the following three-line iterative algorithm which, despite its simplicity, leads to a novel unified scheme of regularization paths for all cases above, equationparentequation
where , can be arbitrary and is naturally set , step size and are parameters whose selection to be discussed later, and the shrinkage operator is defined element-wise as . Such an algorithm is easy for parallel implementation, with linear speed-ups demonstrated in experiment Section 3 below.
To see the regularization paths returned by the iteration, Figure 1 compared it against the glmnet. Such simple iterative regularization paths exhibit competitive or even better performance than the Lasso regularization paths by glmnet in reducing the bias and improving the accuracy (Section 3.2 for more details).
How does this simple iteration algorithm work?
There are two equivalent views on algorithm creftypecap 1.6. First of all, it can be regarded as a mirror descent algorithm (MDA) (Nemirovski and Yudin, 1983; Beck and Teboulle, 2003; Nemirovski, 2012)
where is the bregman divergence associated with , i.e. defined by
Now set involving a Ridge () penalty on and an elastic net type ( and ) penalty on . Hence and where . With this, the optimization in MDA leads to creftypecap 1.6a and
which is equivalent to creftypecap 1.6b. There has been extensive studies on the convergence (), which are however not suitable for statistical estimate above as such convergent solutions lead to overfitting estimators.
An alternative dynamic view may lead to a deeper understanding of the regularization path. In fact for Example 1, creftype 1.6 reduces to the Linearized Bregman Iteration (LBI) proposed by Yin et al. (2008) and analyzed by Osher et al. (2016) via its limit differential inclusions. It shows that equipped with the standard conditions as Lasso, an early stopping rule can find a point on the regularization path of creftype 1.6 with the same sign pattern as true parameter (sign-consistency) and gives the unbiased oracle estimate, hence better than Lasso or any convex regularized estimates which are always biased. This can be generalized to our setting where creftypecap 1.8 is a discretization of the following dynamics equationparentequation
It is a restricted gradient flow (differential inclusion) where has its sparse support controlled by . As , it gives a sequence of estimates by minimizing with the sign pattern of restricted on . Thus if an estimator has the same sign pattern as , it must returns the unbiased oracle estimator which is optimal. So it is natural to ask if there is a point on the path (or ) which meets the sparsity pattern of true parameter . This is the path consistency problem to be addressed in this paper. In Section 2, we shall present a theoretical framework as an answer, and Section 3 gives more applications, including Ising model learning for NIPS coauthorship.
Note that for Example 2, creftype 1.6 reduces to the linearized Bregman iterations for logistic regression proposed by Shi et al. (2013) without a study of statistical consistency. A variable splitting scheme in comparison to generalized Lasso is studied in Huang et al. (2016) which shows improved model selection consistency in some scenarios. Hence in this paper, we shall call the general form creftype 1.6 as Generalized Linear Bregman Iterations (GLBI), in addition to (sparse) Mirror Descent flows.
2 Path Consistency of GLBI
Let denotes the true parameter, with sparse . Define () as the index set corresponding to nonzero entries of , and be its complement. Let , and when drops. Let the oracle estimator be
which is an optimal estimate of . GLBI starts within the oracle subspace (), and we are going to prove that under an Irrepresentable Condition (IRR) the dynamics will evolve in the oracle subspace with high probability before the stopping time , approaching the oracle estimator exponentially fast due to the Restricted Strong Convexity (RSC). Thus if all the true parameters are large enough, then we can identify their sign pattern correctly; otherwise, such a stopping time still finds an estimator (possibly with false positives) at minimax optimal error rate. Furthermore, if the algorithm continues beyond the stopping time, it might escape the oracle subspace and eventually reach overfitted estimates. Such a picture is illustrated in Figure 2.
Hence, it is helpful to define the following oracle dynamics: equationparentequation
with . Let .
2.1 Basic Assumptions
Now we are ready to state the general assumptions that can be reduced to existing ones. We write
Assumption 1 (Restricted Strong Convexity (RSC)).
There exist , such that for any , and for any on the line segment between and , or on the line segment between and ,
Assumption 2 (Irrepresentable Condition (IRR)).
There exist and such that
For sparse linear regression problem (Example 1) with no intercept ( drops), creftypecap 1 reduces to . The lower bound is exactly the RSC proposed in linear problems. Although the upper bound is not needed in linear problems, it arises in the analysis for logistic problem by Ravikumar et al. (2010) (see (A1) in Section 3.1 in their paper). Besides, is constant and creftypecap 2 reduces to
For sparse logistic regression problem (Example 2), we have the following proposition stating that creftypecap 1 and 2 hold with high probability under some natural setting, along with condition creftype 2.3. See its proof in Appendix D. A slightly weaker condition compared to creftype 2.3b, and a same version of creftype 2.3c, can be found in Ravikumar et al. (2010), where are discrete.
2.2 Path Consistency Theorem
Theorem 1 (Consistency of GLBI).
Define as in creftype C.3. We have the following properties.
No-false-positive: For all , the solution path of GLBI has no false-positive, i.e. .
Sign consistency: If and
consistency: The error
As for the setting of algorithm parameters: should be large, and then is automatically calculated based on (as long as , such that is positive in creftype C.3). In practice, a small can prevent the iterations from oscillations.
3.1 Efficiency of Parallel Computing
Osher et al. (2016) has elaborated that LBI can easily be implemented in parallel and distributed manners, and applied on very large-scale datasets. Likewise, GLBI can be parallelized in many usual applications. We now take the logistic model Example 2 as an example to explain the details. The iteration creftype 1.6 (generally taking ) can be written as equationparentequation
where such that
where ’s are submatrices stored in a distributed manner on a set of networked workstations. The sizes of ’s are flexible and can be chosen for good load balancing. Let each workstation hold data and , and variables and which are parts of and summands of , respectively. The iteration creftype 3.1 is carried out as
where the all-reduce summation step collects inputs from and then returns the sum to all the workstations. It is the sum of
-dimensional vectors. Therefore, the communication cost is independent ofno matter how the all-reduce step is implemented. It is important to note that the algorithm is not changed at all. particularly, increasing , does not increase the number of iterations. So the parallel implementation is truly scalable.
If denotes the time cost of a single GLBI run with workstations under the same dataset and the same algorithmic settings, it is expected that . Here we show this by an example. Construct a logistic model in Section 3.2 with , and three settings for : (I) , (II) , (III) . For each setting, we run our parallelized version of GLBI algorithm written in C++, with , where is the maximal such that , and the path is early stopped at the -th iteration. The recorded ’s are shown in Figure 3. The left panel shows (in seconds) while the right panel shows , for . We see truly , which is expected in our parallel and distributed treatment. When is large, our package can deal with very large scale problems.
3.2 Application: Logistic Model
We do independent experiments, in each of which we construct a logistic model (Example 2), and then compare GLBI with other methods. Specifically, suppose that has a support set without loss of generality.
are independent, each has a uniform distribution on. Each row of is i.i.d. sampled from , where is a Toeplitz matrix satisfying . When and are determined, we generate as in Example 2.
After getting the sample , consider GLBI creftype 1.6 and optimization creftype 1.1, both with logistic loss creftype 1.2. For GLBI, set . For creftype 1.1, apply a grid search for differently penalized problems, for which we use glmnet – a popular package available in Matlab/R that can be applied on regularization for sparse logistic regression models.
For each algorithm, we use -fold () cross validation (CV) to pick up an estimator from the path calculate based on the smallest CV estimate of prediction error. Specifically, we split the data into roughly equal-sized parts. For a certain position on paths ( for GLBI, or for glmnet) and , we obtain a corresponding estimator based on the data with the
-th part removed, use the estimator to build a classifier, and get the mis-classification error on the-th part. Averaging the value for , we obtain the CV estimate of prediction error, for the obtained estimator corresponding to a certain position on paths. Among all positions, we pick up the estimator producing the smallest CV estimate of prediction error. Besides, we calculate AUC (Area Under Curve), for evaluating the path performance of learning sparsity patterns without choosing a best estimator.
Results for are summarized in Table 1. We see that in terms of CV estimate of prediction error, GLBI is generally better than glmnet. Besides, GLBI is competitive with regularization method in variable selection, in terms of AUC. Similar observations for more settings are listed in Table 4, 5 and 6 in Appendix G. Apart from these tables, we can also see the outperformance of CV estimate of prediction error Figure 5 in Appendix G, while in that figure we can see that GLBI further reduces bias, as well as provides us a relatively good estimator with small prediction error if a proper early stopping is equipped.
3.3 Application: Ising Model with 4-Nearest-Neighbor Grid
|2nd order MDC|
We do independent experiments, in each of which we construct an ising model (Example 3), and then compare GLBI with other methods. Specifically, construct an 4-nearest neighbor grid (with aperiodic boundary conditions) to be graph , with node set and edge set . The distribution of a random vector is given by creftype 1.3 (), where ’s and ’s are i.i.d. and each has a uniform distribution on . Let represents samples drawn from the distribution of via Gibbs sampling.
After getting the sample , consider GLBI1 (GLBI with composite loss creftype 1.4), GLBI2 (GLBI with MPF loss creftype 1.5) and optimization creftype 1.1 with logistic loss (see Example 3 for neighborhood-based logistic regression applied on Ising models, or see Ravikumar et al. (2010)). For GLBI1 and GLBI2, set . For creftype 1.1, apply a grid search for differently penalized problems; we still use glmnet.
For each algorithm, we calculate the AUC (Area Under Curve), popular for evaluating the path performance of learning sparsity patterns. Besides, we apply -fold () cross validation (CV) to pick up an estimator from the path, with the largest CV estimate of 2nd order marginal distribution correlation (2nd order MDC) in the same way as the CV process done in Section 3.2, here the 2nd order MDC, defined in the next paragraph, is calculated based on two samples of the same size: the -th part original data, and the newly sampled Ising model data (based on learned parameters) with the same size of the -th part.
For any sample matrix , we construct , the 2nd marginal empirical distribution matrix of , defined as follows. , where
For any sample matrices with the same sample size, we call the correlation between and the 2nd order marginal distribution correlation (2nd order MDC). This value is expected to be large as well as close to if come from the same model.
3.4 Application: Coauthorship Network in NIPS
Consider the information of papers and authors in Advances in Neural Information Processing Systems (NIPS) 1987–2016, collected from https://www.kaggle.com/benhamner/nips-papers. After preprocessing (e.g. author disambiguity), for simplicity, we restrict our analysis on the most productive authors (Table 3) in the largest connected component of a coauthorship network that two authors are linked if they coauthored at least 2 papers (Coauthorship (2)). The first panel of Figure 4 shows this coauthorship network with edge width in proportion to the number of coauthored papers. There are papers authored by at least one of these persons.
|01 Michael Jordan||16 Inderjit Dhillon|
|02 Bernhard Schölkopf||17 Ruslan Salakhutdinov|
|03 Geoffrey Hinton||18 Tong Zhang|
|04 Yoshua Bengio||19 Thomas Griffiths|
|05 Zoubin Ghahramani||20 David Blei|
|06 Terrence Sejnowski||21 Rémi Munos|
|07 Peter Dayan||22 Joshua Tenenbaum|
|08 Alex Smola||23 Lawrence Carin|
|09 Andrew Ng||24 Eric Xing|
|10 Francis Bach||25 Richard Zemel|
|11 Michael Mozer||26 Martin Wainwright|
|12 Pradeep Ravikumar||27 Yoram Singer|
|13 Tommi Jaakkola||28 Han Liu|
|14 Klaus-Robert Müller||29 Satinder Singh|
|15 Yee Teh||30 Christopher Williams|
Let the -th entry of be if the -th person is involved in the authors of the -th paper, and otherwise. Now we fit the data by a sparse Ising model creftype 1.3 with parameter . Note that indicates that and are conditional independent on coauthorship, given all the other authors; implies that and coauthored more often than their averages, while says the opposite.
The right three panels in Figure 4 compares some sparse Ising models chosen from three regularization paths at a similar sparsity level (the percentage of learned edges over the complete graph, here about ): GLBI1 (GLBI with composite loss), GLBI2 (GLBI with MPF loss), and regularization (glmnet), respectively. For more learned graphs from these paths, see Figure 6 in Appendix G. In GLBI1 and GLBI2, set .
We see that all the learned graphs capture some important coauthorships, such as Pradeep Ravikumar (12) and Inderjit Dhillon (16) in a thick green edge in all the three learned graphs, indicating that they collaborated more often than separately for NIPS. Besides, the most productive author Michael Jordan (01) has coauthored with a lot of other people, but is somewhat unlikely to coauthor with several other productive scholars like Yoshua Bengio (04), Terrence Sejnowski (06), etc., indicating by the red edges between Jordan and those people. Further note the edge widths in the second and third graphs are significantly larger than those in the fourth graph, implying that at a similar sparsity level, GLBI tends to provide an estimator with larger absolute values of entries than that by glmnet. That is because under similar sparsity patterns, GLBI may give low-biased estimators.
This work is supported in part by National Basic Research Program of China (Nos. 2015CB85600, 2012CB825501), NNSF of China (Nos. 61370004, 11421110001), HKRGC grant 16303817, as well as grants from Tencent AI Lab, Si Family Foundation, Baidu BDI, and Microsoft Research-Asia.
- Beck and Teboulle (2003) Beck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31, 167–175.
- Bickel et al. (2009) Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. Ann. Statist., 37(4), 1705–1732.
- Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499.
- Huang et al. (2016) Huang, C., Sun, X., Xiong, J., and Yao, Y. (2016). Split lbi: An iterative regularization path with structural sparsity. In Advances in Neural Information Processing Systems (NIPS) 29, pages 3369–3377.
- Negahban et al. (2009) Negahban, S., Yu, B., Wainwright, M. J., and Ravikumar, P. K. (2009). A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems (NIPS) 22, pages 1348–1356.
- Nemirovski (2012) Nemirovski, A. (2012). Tutorial: Mirror descent algorithms for large-scale deterministic and stochastic convex optimization. Conference on Learning Theory (COLT).
- Nemirovski and Yudin (1983) Nemirovski, A. and Yudin, D. (1983). Problem complexity and Method Efficiency in Optimization. New York: Wiley. Nauka Publishers, Moscow (in Russian), 1978.
- Osher et al. (2016) Osher, S., Ruan, F., Xiong, J., Yao, Y., and Yin, W. (2016). Sparse recovery via differential inclusions. Applied and Computational Harmonic Analysis, 41(2), 436–469.
- Ravikumar et al. (2008) Ravikumar, P., Raskutti, G., Wainwright, M., and Yu, B. (2008). Model selection in Gaussian graphical models: High-dimensional consistency of l1-regularized MLE. In Advances in Neural Information Processing Systems (NIPS), volume 21.
- Ravikumar et al. (2010) Ravikumar, P., Wainwright, M. J., and Lafferty, J. D. (2010). High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3), 1287–1319.
- Shi et al. (2013) Shi, J. V., Yin, W., and Osher, S. J. (2013). A new regularization path for logistic regression via linearized bregman.
- Sohl-Dickstein et al. (2011) Sohl-Dickstein, J., Battaglino, P., and DeWeese, M. (2011). Minimum Probability Flow Learning. ICML ’11, pages 905–912, New York, NY, USA. ACM.
- Wainwright (2009) Wainwright, M. J. (2009). Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using L1-Constrained Quadratic Programming (Lasso). Information Theory, IEEE Transactions on, 55(5), 2183–2202.
- Xue et al. (2012) Xue, L., Zou, H., and Cai, T. (2012). Nonconcave penalized composite conditional likelihood estimation of sparse ising models. The Annals of Statistics, 40(3), 1403–1429.
- Yin et al. (2008) Yin, W., Osher, S., Darbon, J., and Goldfarb, D. (2008). Bregman iterative algorithms for compressed sensing and related problems. SIAM Journal on Imaging Sciences, 1(1), 143–168.
- Yuan and Lin (2007) Yuan, M. and Lin, Y. (2007). Model Selection and Estimation in the Gaussian Graphical Model. Biometrika, 94, 19–35.
- Zhao and Yu (2006) Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. J. Machine Learning Research, 7, 2541–2567.
Appendix A GLBISS and GISS: Limit Dynamics of GLBI
Consider a differential inclusion called Generalized Linearized Bregman Inverse Scale Space (GLBISS), the limit dynamics of GLBI when the step size . This will help understanding GLBI, and the proof on sign consistency as well as consistency of GLBISS can be moved to the case of GLBI with slight modifications.
Specifically, noting by the following Moreau decomposition
GLBI has an equivalent form equationparentequation
where . Taking , and , creftype A.2 can be viewed as a forward Euler discretization of a differential inclusion called Generalized Linearized Bregman Inverse Scale Space (GLBISS) equationparentequation
where . Next taking , we reach the following Generalized Bregman Inverse Scale Space (GISS). equationparentequation
where . Following the same spirit of Osher et al. (2016), it is transparent to obtain the existence and uniqueness of the solution paths of GISS and GLBISS under mild conditions; and for GISS and GLBISS, is non-increasing for , while for GLBI, is non-increasing for if .