
Smooth PrimalDual Coordinate Descent Algorithms for Nonsmooth Convex Optimization
We propose a new randomized coordinate descent method for a convex optimization template with broad applications. Our analysis relies on a novel combination of four ideas applied to the primaldual gap function: smoothing, acceleration, homotopy, and coordinate descent with nonuniform sampling. As a result, our method features the first convergence rate guarantees among the coordinate descent methods, that are the bestknown under a variety of common structure assumptions on the template. We provide numerical evidence to support the theoretical results with a comparison to stateoftheart algorithms.
11/09/2017 ∙ by Ahmet Alacaoglu, et al. ∙ 0 ∙ shareread it

Generalized Concomitant MultiTask Lasso for sparse multimodal regression
In high dimension, it is customary to consider Lassotype estimators to enforce sparsity. For standard Lasso theory to hold, the regularization parameter should be proportional to the noise level, yet the latter is generally unknown in practice. A possible remedy is to consider estimators, such as the Concomitant/Scaled Lasso, which jointly optimize over the regression coefficients as well as over the noise level, making the choice of the regularization independent of the noise level. However, when data from different sources are pooled to increase sample size, or when dealing with multimodal datasets, noise levels typically differ and new dedicated estimators are needed. In this work we provide new statistical and computational solutions to deal with such heteroscedastic regression models, with an emphasis on functional brain imaging with combined magneto and electroencephalographic (M/EEG) signals. Adopting the formulation of Concomitant Lassotype estimators, we propose a jointly convex formulation to estimate both the regression coefficients and the (square root of the) noise covariance. When our framework is instantiated to decorrelated noise, it leads to an efficient algorithm whose computational cost is not higher than for the Lasso and Concomitant Lasso, while addressing more complex noise structures. Numerical experiments demonstrate that our estimator yields improved prediction and support identification while correctly estimating the noise (square root) covariance. Results on multimodal neuroimaging problems with M/EEG data are also reported.
05/27/2017 ∙ by Mathurin Massias, et al. ∙ 0 ∙ shareread it

Gap Safe screening rules for sparsity enforcing penalties
In high dimensional regression settings, sparsity enforcing penalties have proved useful to regularize the datafitting term. A recently introduced technique called screening rules propose to ignore some variables in the optimization leveraging the expected sparsity of the solutions and consequently leading to faster solvers. When the procedure is guaranteed not to discard variables wrongly the rules are said to be safe. In this work, we propose a unifying framework for generalized linear models regularized with standard sparsity enforcing penalties such as ℓ_1 or ℓ_1/ℓ_2 norms. Our technique allows to discard safely more variables than previously considered safe rules, particularly for low regularization parameters. Our proposed Gap Safe rules (so called because they rely on duality gap computation) can cope with any iterative solver but are particularly well suited to (block) coordinate descent methods. Applied to many standard learning tasks, Lasso, SparseGroup Lasso, multitask Lasso, binary and multinomial logistic regression, etc., we report significant speedups compared to previously proposed safe rules on all tested datasets.
11/17/2016 ∙ by Eugene Ndiaye, et al. ∙ 0 ∙ shareread it

Efficient Smoothed Concomitant Lasso Estimation for High Dimensional Regression
In high dimensional settings, sparse structures are crucial for efficiency, both in term of memory, computation and performance. It is customary to consider ℓ_1 penalty to enforce sparsity in such scenarios. Sparsity enforcing methods, the Lasso being a canonical example, are popular candidates to address high dimension. For efficiency, they rely on tuning a parameter trading data fitting versus sparsity. For the Lasso theory to hold this tuning parameter should be proportional to the noise level, yet the latter is often unknown in practice. A possible remedy is to jointly optimize over the regression parameter as well as over the noise level. This has been considered under several names in the literature: ScaledLasso, Squareroot Lasso, Concomitant Lasso estimation for instance, and could be of interest for confidence sets or uncertainty quantification. In this work, after illustrating numerical difficulties for the Smoothed Concomitant Lasso formulation, we propose a modification we coined Smoothed Concomitant Lasso, aimed at increasing numerical stability. We propose an efficient and accurate solver leading to a computational cost no more expansive than the one for the Lasso. We leverage on standard ingredients behind the success of fast Lasso solvers: a coordinate descent algorithm, combined with safe screening rules to achieve speed efficiency, by eliminating early irrelevant features.
06/08/2016 ∙ by Eugene Ndiaye, et al. ∙ 0 ∙ shareread it

GAP Safe Screening Rules for SparseGroupLasso
In high dimensional settings, sparse structures are crucial for efficiency, either in term of memory, computation or performance. In some contexts, it is natural to handle more refined structures than pure sparsity, such as for instance group sparsity. SparseGroup Lasso has recently been introduced in the context of linear regression to enforce sparsity both at the feature level and at the group level. We adapt to the case of SparseGroup Lasso recent safe screening rules that discard early in the solver irrelevant features/groups. Such rules have led to important speedups for a wide range of iterative methods. Thanks to dual gap computations, we provide new safe screening rules for SparseGroup Lasso and show significant gains in term of computing time for a coordinate descent implementation.
02/19/2016 ∙ by Eugene Ndiaye, et al. ∙ 0 ∙ shareread it

GAP Safe screening rules for sparse multitask and multiclass models
High dimensional regression benefits from sparsity promoting regularizations. Screening rules leverage the known sparsity of the solution by ignoring some variables in the optimization, hence speeding up solvers. When the procedure is proven not to discard features wrongly the rules are said to be safe. In this paper we derive new safe rules for generalized linear models regularized with ℓ_1 and ℓ_1/ℓ_2 norms. The rules are based on duality gap computations and spherical safe regions whose diameters converge to zero. This allows to discard safely more variables, in particular for low regularization parameters. The GAP Safe rule can cope with any iterative solver and we illustrate its performance on coordinate descent for multitask Lasso, binary and multinomial logistic regression, demonstrating significant speed ups on all tested datasets with respect to previous safe rules.
06/11/2015 ∙ by Eugene Ndiaye, et al. ∙ 0 ∙ shareread it

Mind the duality gap: safer rules for the Lasso
Screening rules allow to early discard irrelevant variables from the optimization in Lasso problems, or its derivatives, making solvers faster. In this paper, we propose new versions of the socalled safe rules for the Lasso. Based on duality gap considerations, our new rules create safe test regions whose diameters converge to zero, provided that one relies on a converging solver. This property helps screening out more variables, for a wider range of regularization parameter values. In addition to faster convergence, we prove that we correctly identify the active sets (supports) of the solutions in finite time. While our proposed strategy can cope with any solver, its performance is demonstrated using a coordinate descent algorithm particularly adapted to machine learning use cases. Significant computing time reductions are obtained with respect to previous safe rules.
05/13/2015 ∙ by Olivier Fercoq, et al. ∙ 0 ∙ shareread it

Accelerated, Parallel and Proximal Coordinate Descent
We propose a new stochastic coordinate descent method for minimizing the sum of convex functions each of which depends on a small number of coordinates only. Our method (APPROX) is simultaneously Accelerated, Parallel and PROXimal; this is the first time such a method is proposed. In the special case when the number of processors is equal to the number of coordinates, the method converges at the rate 2ω̅L̅ R^2/(k+1)^2 , where k is the iteration counter, ω̅ is an average degree of separability of the loss function, L̅ is the average of Lipschitz constants associated with the coordinates and individual functions in the sum, and R is the distance of the initial point from the minimizer. We show that the method can be implemented without the need to perform fulldimensional vector operations, which is the major bottleneck of existing accelerated coordinate descent methods. The fact that the method depends on the average degree of separability, and not on the maximum degree of separability, can be attributed to the use of new safe large stepsizes, leading to improved expected separable overapproximation (ESO). These are of independent interest and can be utilized in all existing parallel stochastic coordinate descent algorithms based on the concept of ESO.
12/20/2013 ∙ by Olivier Fercoq, et al. ∙ 0 ∙ shareread it

Parallel coordinate descent for the Adaboost problem
We design a randomised parallel version of Adaboost based on previous studies on parallel coordinate descent. The algorithm uses the fact that the logarithm of the exponential loss is a function with coordinatewise Lipschitz continuous gradient, in order to define the step lengths. We provide the proof of convergence for this randomised Adaboost algorithm and a theoretical parallelisation speedup factor. We finally provide numerical examples on learning problems of various sizes that show that the algorithm is competitive with concurrent approaches, especially for large scale problems.
10/07/2013 ∙ by Olivier Fercoq, et al. ∙ 0 ∙ shareread it

Smooth minimization of nonsmooth functions with parallel coordinate descent methods
We study the performance of a family of randomized parallel coordinate descent methods for minimizing the sum of a nonsmooth and separable convex functions. The problem class includes as a special case L1regularized L1 regression and the minimization of the exponential loss ("AdaBoost problem"). We assume the input data defining the loss function is contained in a sparse m× n matrix A with at most ω nonzeros in each row. Our methods need O(n β/τ) iterations to find an approximate solution with high probability, where τ is the number of processors and β = 1 + (ω1)(τ1)/(n1) for the fastest variant. The notation hides dependence on quantities such as the required accuracy and confidence levels and the distance of the starting iterate from an optimal point. Since β/τ is a decreasing function of τ, the method needs fewer iterations when more processors are used. Certain variants of our algorithms perform on average only O((A)/n) arithmetic operations during a single iteration per processor and, because β decreases when ω does, fewer iterations are needed for sparser problems.
09/23/2013 ∙ by Olivier Fercoq, et al. ∙ 0 ∙ shareread it

Safe Grid Search with Optimal Complexity
Popular machine learning estimators involve regularization parameters that can be challenging to tune, and standard strategies rely on grid search for this task. In this paper, we revisit the techniques of approximating the regularization path up to predefined tolerance ϵ in a unified framework and show that its complexity is O(1/√(ϵ)) for uniformly convex loss of order d>0 and O(1/√(ϵ)) for Generalized SelfConcordant functions. This framework encompasses leastsquares but also logistic regression (a case that as far as we know was not handled as precisely by previous works). We leverage our technique to provide refined bounds on the validation error as well as a practical algorithm for hyperparameter tuning. The later has global convergence guarantee when targeting a prescribed accuracy on the validation set. Last but not least, our approach helps relieving the practitioner from the (often neglected) task of selecting a stopping criterion when optimizing over the training set: our method automatically calibrates it based on the targeted accuracy on the validation set.
10/12/2018 ∙ by Eugene Ndiaye, et al. ∙ 0 ∙ shareread it
Olivier Fercoq
is this you? claim profile