
Optimal minibatch and step sizes for SAGA
Recently it has been shown that the step sizes of a family of variance reduced gradient methods called the JacSketch methods depend on the expected smoothness constant. In particular, if this expected smoothness constant could be calculated a priori, then one could safely set much larger step sizes which would result in a much faster convergence rate. We fill in this gap, and provide simple closed form expressions for the expected smoothness constant and careful numerical experiments verifying these bounds. Using these bounds, and since the SAGA algorithm is part of this JacSketch family, we suggest a new standard practice for setting the step sizes and minibatch size for SAGA that are competitive with a numerical grid search. Furthermore, we can now show that the total complexity of the SAGA algorithm decreases linearly in the minibatch size up to a predefined value: the optimal minibatch size. This is a rare result in the stochastic variance reduced literature, only previously shown for the Katyusha algorithm. Finally we conjecture that this is the case for many other stochastic variance reduced methods and that our bounds and analysis of the expected smoothness constant is key to extending these results.
01/31/2019 ∙ by Nidham Gazagnadou, et al. ∙ 22 ∙ shareread it

Screening Rules for Lasso with NonConvex Sparse Regularizers
Leveraging on the convexity of the Lasso problem , screening rules help in accelerating solvers by discarding irrelevant variables, during the optimization process. However, because they provide better theoretical guarantees in identifying relevant variables, several nonconvex regularizers for the Lasso have been proposed in the literature. This work is the first that introduces a screening rule strategy into a nonconvex Lasso solver. The approach we propose is based on a iterative majorizationminimization (MM) strategy that includes a screening rule in the inner solver and a condition for propagating screened variables between iterations of MM. In addition to improve efficiency of solvers, we also provide guarantees that the inner solver is able to identify the zeros components of its critical point in finite time. Our experimental analysis illustrates the significant computational gain brought by the new screening rule compared to classical coordinatedescent or proximal gradient descent methods.
02/16/2019 ∙ by Alain Rakotomamonjy, et al. ∙ 18 ∙ shareread it

Concomitant Lasso with Repetitions (CLaR): beyond averaging multiple realizations of heteroscedastic noise
Sparsity promoting norms are frequently used in high dimensional regression. A limitation of Lassotype estimators is that the regularization parameter depends on the noise level which varies between datasets and experiments. Estimators such as the concomitant Lasso address this dependence by jointly estimating the noise level and the regression coefficients. As sample sizes are often limited in high dimensional regimes, simplified heteroscedastic models are customary. However, in many experimental applications , data is obtained by averaging multiple measurements. This helps reducing the noise variance, yet it dramatically reduces sample sizes, preventing refined noise modeling. In this work, we propose an estimator that can cope with complex heteroscedastic noise structures by using nonaveraged measurements and a concomitant formulation. The resulting optimization problem is convex, so thanks to smoothing theory, it is amenable to stateoftheart proximal coordinate descent techniques that can leverage the expected sparsity of the solutions. Practical benefits are demonstrated on simulations and on neuroimaging applications.
02/07/2019 ∙ by Quentin Bertrand, et al. ∙ 18 ∙ shareread it

Dual Extrapolation for Sparse Generalized Linear Models
Generalized Linear Models (GLM) form a wide class of regression and classification models, where prediction is a function of a linear combination of the input variables. For statistical inference in high dimension, sparsity inducing regularizations have proven to be useful while offering statistical guarantees. However, solving the resulting optimization problems can be challenging: even for popular iterative algorithms such as coordinate descent, one needs to loop over a large number of variables. To mitigate this, techniques known as screening rules and working sets diminish the size of the optimization problem at hand, either by progressively removing variables, or by solving a growing sequence of smaller problems. For both techniques, significant variables are identified thanks to convex duality arguments. In this paper, we show that the dual iterates of a GLM exhibit a Vector AutoRegressive (VAR) behavior after sign identification, when the primal problem is solved with proximal gradient descent or cyclic coordinate descent. Exploiting this regularity, one can construct dual points that offer tighter certificates of optimality, enhancing the performance of screening rules and helping to design competitive working set algorithms.
07/12/2019 ∙ by Mathurin Massias, et al. ∙ 11 ∙ shareread it

Generalized Concomitant MultiTask Lasso for sparse multimodal regression
In high dimension, it is customary to consider Lassotype estimators to enforce sparsity. For standard Lasso theory to hold, the regularization parameter should be proportional to the noise level, yet the latter is generally unknown in practice. A possible remedy is to consider estimators, such as the Concomitant/Scaled Lasso, which jointly optimize over the regression coefficients as well as over the noise level, making the choice of the regularization independent of the noise level. However, when data from different sources are pooled to increase sample size, or when dealing with multimodal datasets, noise levels typically differ and new dedicated estimators are needed. In this work we provide new statistical and computational solutions to deal with such heteroscedastic regression models, with an emphasis on functional brain imaging with combined magneto and electroencephalographic (M/EEG) signals. Adopting the formulation of Concomitant Lassotype estimators, we propose a jointly convex formulation to estimate both the regression coefficients and the (square root of the) noise covariance. When our framework is instantiated to decorrelated noise, it leads to an efficient algorithm whose computational cost is not higher than for the Lasso and Concomitant Lasso, while addressing more complex noise structures. Numerical experiments demonstrate that our estimator yields improved prediction and support identification while correctly estimating the noise (square root) covariance. Results on multimodal neuroimaging problems with M/EEG data are also reported.
05/27/2017 ∙ by Mathurin Massias, et al. ∙ 0 ∙ shareread it

From safe screening rules to working sets for faster Lassotype solvers
Convex sparsitypromoting regularizations are ubiquitous in modern statistical learning. By construction, they yield solutions with few nonzero coefficients, which correspond to saturated constraints in the dual optimization formulation. Working set (WS) strategies are generic optimization techniques that consist in solving simpler problems that only consider a subset of constraints, whose indices form the WS. Working set methods therefore involve two nested iterations: the outer loop corresponds to the definition of the WS and the inner loop calls a solver for the subproblems. For the Lasso estimator a WS is a set of features, while for a Group Lasso it refers to a set of groups. In practice, WS are generally small in this context so the associated feature Gram matrix can fit in memory. Here we show that the GaussSouthwell rule (a greedy strategy for block coordinate descent techniques) leads to fast solvers in this case. Combined with a working set strategy based on an aggressive use of socalled Gap Safe screening rules, we propose a solver achieving stateoftheart performance on sparse learning problems. Results are presented on Lasso and multitask Lasso estimators.
03/21/2017 ∙ by Mathurin Massias, et al. ∙ 0 ∙ shareread it

On the benefits of output sparsity for multilabel classification
The multilabel classification framework, where each observation can be associated with a set of labels, has generated a tremendous amount of attention over recent years. The modern multilabel problems are typically largescale in terms of number of observations, features and labels, and the amount of labels can even be comparable with the amount of observations. In this context, different remedies have been proposed to overcome the curse of dimensionality. In this work, we aim at exploiting the output sparsity by introducing a new loss, called the sparse weighted Hamming loss. This proposed loss can be seen as a weighted version of classical ones, where active and inactive labels are weighted separately. Leveraging the influence of sparsity in the loss function, we provide improved generalization bounds for the empirical risk minimizer, a suitable property for largescale problems. For this new loss, we derive rates of convergence linear in the underlying outputsparsity rather than linear in the number of labels. In practice, minimizing the associated risk can be performed efficiently by using convex surrogates and modern convex optimization algorithms. We provide experiments on various realworld datasets demonstrating the pertinence of our approach when compared to nonweighted techniques.
03/14/2017 ∙ by Evgenii Chzhen, et al. ∙ 0 ∙ shareread it

Gap Safe screening rules for sparsity enforcing penalties
In high dimensional regression settings, sparsity enforcing penalties have proved useful to regularize the datafitting term. A recently introduced technique called screening rules propose to ignore some variables in the optimization leveraging the expected sparsity of the solutions and consequently leading to faster solvers. When the procedure is guaranteed not to discard variables wrongly the rules are said to be safe. In this work, we propose a unifying framework for generalized linear models regularized with standard sparsity enforcing penalties such as ℓ_1 or ℓ_1/ℓ_2 norms. Our technique allows to discard safely more variables than previously considered safe rules, particularly for low regularization parameters. Our proposed Gap Safe rules (so called because they rely on duality gap computation) can cope with any iterative solver but are particularly well suited to (block) coordinate descent methods. Applied to many standard learning tasks, Lasso, SparseGroup Lasso, multitask Lasso, binary and multinomial logistic regression, etc., we report significant speedups compared to previously proposed safe rules on all tested datasets.
11/17/2016 ∙ by Eugene Ndiaye, et al. ∙ 0 ∙ shareread it

Efficient Smoothed Concomitant Lasso Estimation for High Dimensional Regression
In high dimensional settings, sparse structures are crucial for efficiency, both in term of memory, computation and performance. It is customary to consider ℓ_1 penalty to enforce sparsity in such scenarios. Sparsity enforcing methods, the Lasso being a canonical example, are popular candidates to address high dimension. For efficiency, they rely on tuning a parameter trading data fitting versus sparsity. For the Lasso theory to hold this tuning parameter should be proportional to the noise level, yet the latter is often unknown in practice. A possible remedy is to jointly optimize over the regression parameter as well as over the noise level. This has been considered under several names in the literature: ScaledLasso, Squareroot Lasso, Concomitant Lasso estimation for instance, and could be of interest for confidence sets or uncertainty quantification. In this work, after illustrating numerical difficulties for the Smoothed Concomitant Lasso formulation, we propose a modification we coined Smoothed Concomitant Lasso, aimed at increasing numerical stability. We propose an efficient and accurate solver leading to a computational cost no more expansive than the one for the Lasso. We leverage on standard ingredients behind the success of fast Lasso solvers: a coordinate descent algorithm, combined with safe screening rules to achieve speed efficiency, by eliminating early irrelevant features.
06/08/2016 ∙ by Eugene Ndiaye, et al. ∙ 0 ∙ shareread it

Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions
In decentralized networks (of sensors, connected objects, etc.), there is an important need for efficient algorithms to optimize a global cost function, for instance to learn a global model from the local data collected by each computing unit. In this paper, we address the problem of decentralized minimization of pairwise functions of the data points, where these points are distributed over the nodes of a graph defining the communication topology of the network. This general problem finds applications in ranking, distance metric learning and graph inference, among others. We propose new gossip algorithms based on dual averaging which aims at solving such problems both in synchronous and asynchronous settings. The proposed framework is flexible enough to deal with constrained and regularized variants of the optimization problem. Our theoretical analysis reveals that the proposed algorithms preserve the convergence rate of centralized dual averaging up to an additive bias term. We present numerical simulations on Area Under the ROC Curve (AUC) maximization and metric learning problems which illustrate the practical interest of our approach.
06/08/2016 ∙ by Igor Colin, et al. ∙ 0 ∙ shareread it

GAP Safe Screening Rules for SparseGroupLasso
In high dimensional settings, sparse structures are crucial for efficiency, either in term of memory, computation or performance. In some contexts, it is natural to handle more refined structures than pure sparsity, such as for instance group sparsity. SparseGroup Lasso has recently been introduced in the context of linear regression to enforce sparsity both at the feature level and at the group level. We adapt to the case of SparseGroup Lasso recent safe screening rules that discard early in the solver irrelevant features/groups. Such rules have led to important speedups for a wide range of iterative methods. Thanks to dual gap computations, we provide new safe screening rules for SparseGroup Lasso and show significant gains in term of computing time for a coordinate descent implementation.
02/19/2016 ∙ by Eugene Ndiaye, et al. ∙ 0 ∙ shareread it
Joseph Salmon
is this you? claim profile
Assistant Professor at TELECOM ParisTech, Associate member at INRIA Parietal, Visiting assistant professor at UW, Statistics departement.