
Online optimization and regret guarantees for nonadditive longterm constraints
We consider online optimization in the 1lookahead setting, where the objective does not decompose additively over the rounds of the online game. The resulting formulation enables us to deal with nonstationary and/or longterm constraints , which arise, for example, in online display advertising problems. We propose an online primaldual algorithm for which we obtain dynamic cumulative regret guarantees. They depend on the convexity and the smoothness of the nonadditive penalty, as well as terms capturing the smoothness with which the residuals of the nonstationary and longterm constraints vary over the rounds. We conduct experiments on synthetic data to illustrate the benefits of the nonadditive penalty and show vanishing regret convergence on live traffic data collected by a display advertising platform in production.
02/17/2016 ∙ by Rodolphe Jenatton, et al. ∙ 0 ∙ shareread it

Adaptive Algorithms for Online Convex Optimization with Longterm Constraints
We present an adaptive online gradient descent algorithm to solve online convex optimization problems with longterm constraints , which are constraints that need to be satisfied when accumulated over a finite number of rounds T , but can be violated in intermediate rounds. For some userdefined tradeoff parameter β ∈ (0, 1), the proposed algorithm achieves cumulative regret bounds of O(T^maxβ,1β) and O(T^(1β/2)) for the loss and the constraint violations respectively. Our results hold for convex losses and can handle arbitrary convex constraints without requiring knowledge of the number of rounds in advance. Our contributions improve over the best known cumulative regret bounds by Mahdavi, et al. (2012) that are respectively O(T^1/2) and O(T^3/4) for general convex domains, and respectively O(T^2/3) and O(T^2/3) when further restricting to polyhedral domains. We supplement the analysis with experiments validating the performance of our algorithm in practice.
12/23/2015 ∙ by Rodolphe Jenatton, et al. ∙ 0 ∙ shareread it

Sparse and spurious: dictionary learning with noise and outliers
A popular approach within the signal processing and machine learning communities consists in modelling signals as sparse linear combinations of atoms selected from a learned dictionary. While this paradigm has led to numerous empirical successes in various fields ranging from image to audio processing, there have only been a few theoretical arguments supporting these evidences. In particular, sparse coding, or sparse dictionary learning, relies on a nonconvex procedure whose local minima have not been fully analyzed yet. In this paper, we consider a probabilistic model of sparse signals, and show that, with high probability, sparse coding admits a local minimum around the reference dictionary generating the signals. Our study takes into account the case of overcomplete dictionaries, noisy signals, and possible outliers, thus extending previous work limited to noiseless settings and/or undercomplete dictionaries. The analysis we conduct is nonasymptotic and makes it possible to understand how the key quantities of the problem, such as the coherence or the level of noise, can scale with respect to the dimension of the signals, the number of atoms, the sparsity and the number of observations.
07/19/2014 ∙ by Rémi Gribonval, et al. ∙ 0 ∙ shareread it

On The Sample Complexity of Sparse Dictionary Learning
In the synthesis model signals are represented as a sparse combinations of atoms from a dictionary. Dictionary learning describes the acquisition process of the underlying dictionary for a given set of training samples. While ideally this would be achieved by optimizing the expectation of the factors over the underlying distribution of the training data, in practice the necessary information about the distribution is not available. Therefore, in real world applications it is achieved by minimizing an empirical average over the available samples. The main goal of this paper is to provide a sample complexity estimate that controls to what extent the empirical average deviates from the cost function. This estimate then provides a suitable estimate to the accuracy of the representation of the learned dictionary. The presented approach exemplifies the general results proposed by the authors in Sample Complexity of Dictionary Learning and other Matrix Factorizations, Gribonval et al. and gives more concrete bounds of the sample complexity of dictionary learning. We cover a variety of sparsity measures employed in the learning procedure.
03/20/2014 ∙ by Matthias Seibert, et al. ∙ 0 ∙ shareread it

Sample Complexity of Dictionary Learning and other Matrix Factorizations
Many modern tools in machine learning and signal processing, such as sparse dictionary learning, principal component analysis (PCA), nonnegative matrix factorization (NMF), Kmeans clustering, etc., rely on the factorization of a matrix obtained by concatenating highdimensional vectors from a training collection. While the idealized task would be to optimize the expected quality of the factors over the underlying distribution of training vectors, it is achieved in practice by minimizing an empirical average over the considered collection. The focus of this paper is to provide sample complexity estimates to uniformly control how much the empirical average deviates from the expected cost function. Standard arguments imply that the performance of the empirical predictor also exhibit such guarantees. The level of genericity of the approach encompasses several possible constraints on the factors (tensor product structure, shiftinvariance, sparsity ...), thus providing a unified perspective on the sample complexity of several widely used matrix factorization schemes. The derived generalization bounds behave proportional to √((n)/n) w.r.t. the number of samples n for the considered matrix factorization techniques.
12/13/2013 ∙ by Rémi Gribonval, et al. ∙ 0 ∙ shareread it

Local stability and robustness of sparse dictionary learning in the presence of noise
A popular approach within the signal processing and machine learning communities consists in modelling signals as sparse linear combinations of atoms selected from a learned dictionary. While this paradigm has led to numerous empirical successes in various fields ranging from image to audio processing, there have only been a few theoretical arguments supporting these evidences. In particular, sparse coding, or sparse dictionary learning, relies on a nonconvex procedure whose local minima have not been fully analyzed yet. In this paper, we consider a probabilistic model of sparse signals, and show that, with high probability, sparse coding admits a local minimum around the reference dictionary generating the signals. Our study takes into account the case of overcomplete dictionaries and noisy signals, thus extending previous work limited to noiseless settings and/or undercomplete dictionaries. The analysis we conduct is nonasymptotic and makes it possible to understand how the key quantities of the problem, such as the coherence or the level of noise, can scale with respect to the dimension of the signals, the number of atoms, the sparsity and the number of observations.
10/02/2012 ∙ by Rodolphe Jenatton, et al. ∙ 0 ∙ shareread it

Structured sparsity through convex optimization
Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. While naturally cast as a combinatorial optimization problem, variable or feature selection admits a convex relaxation through the regularization by the ℓ_1norm. In this paper, we consider situations where we are not only interested in sparsity, but where some structural prior knowledge is available as well. We show that the ℓ_1norm can then be extended to structured norms built on either disjoint or overlapping groups of variables, leading to a flexible framework that can deal with various structures. We present applications to unsupervised learning, for structured sparse principal component analysis and hierarchical dictionary learning, and to supervised learning in the context of nonlinear variable selection.
09/12/2011 ∙ by Francis Bach, et al. ∙ 0 ∙ shareread it

Optimization with SparsityInducing Penalties
Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. They were first dedicated to linear variable selection but numerous extensions have now emerged such as structured sparsity or kernel selection. It turns out that many of the related estimation problems can be cast as convex optimization problems by regularizing the empirical risk with appropriate nonsmooth norms. The goal of this paper is to present from a general perspective optimization tools and techniques dedicated to such sparsityinducing penalties. We cover proximal methods, blockcoordinate descent, reweighted ℓ_2penalized techniques, workingset and homotopy methods, as well as nonconvex formulations and extensions, and provide an extensive set of experiments to compare various algorithms from a computational point of view.
08/03/2011 ∙ by Francis Bach, et al. ∙ 0 ∙ shareread it

Multiscale Mining of fMRI data with Hierarchical Structured Sparsity
Inverse inference, or "brain reading", is a recent paradigm for analyzing functional magnetic resonance imaging (fMRI) data, based on pattern recognition and statistical learning. By predicting some cognitive variables related to brain activation maps, this approach aims at decoding brain activity. Inverse inference takes into account the multivariate information between voxels and is currently the only way to assess how precisely some cognitive information is encoded by the activity of neural populations within the whole brain. However, it relies on a prediction function that is plagued by the curse of dimensionality, since there are far more features than samples, i.e., more voxels than fMRI volumes. To address this problem, different methods have been proposed, such as, among others, univariate feature selection, feature agglomeration and regularization techniques. In this paper, we consider a sparse hierarchical structured regularization. Specifically, the penalization we use is constructed from a tree that is obtained by spatiallyconstrained agglomerative clustering. This approach encodes the spatial structure of the data at different scales into the regularization, which makes the overall prediction procedure more robust to intersubject variability. The regularization used induces the selection of spatially coherent predictive brain regions simultaneously at different scales. We test our algorithm on real data acquired to study the mental representation of objects, and we show that the proposed algorithm not only delineates meaningful brain regions but yields as well better prediction accuracy than reference methods.
05/02/2011 ∙ by Rodolphe Jenatton, et al. ∙ 0 ∙ shareread it

Convex and Network Flow Optimization for Structured Sparsity
We consider a class of learning problems regularized by a structured sparsityinducing norm defined as the sum of l_2 or l_infinitynorms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of l_infinitynorms can be computed exactly in polynomial time by solving a quadratic mincost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with nonoverlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multitask learning of treestructured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches.
04/11/2011 ∙ by Julien Mairal, et al. ∙ 0 ∙ shareread it

Proximal Methods for Hierarchical Sparse Coding
Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced treestructured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and we propose in this paper efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the treestructured sparse approximation problem at the same computational cost as traditional ones using the L1norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally organize in a prespecified arborescent structure, leading to a better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models.
09/11/2010 ∙ by Rodolphe Jenatton, et al. ∙ 0 ∙ shareread it
Rodolphe Jenatton
is this you? claim profile
Senior Machine Learning Scientist at Amazon.com, Inc. since 2016, Machine Learning Scientist at Amazon since 2014, Software engineer / Data scientist at Criteo 20132014, Postdoctoral fellow at Ecole Polytechnique 20112012, PhD student at INRIA  Ecole Normale Supérieure from 20082011, Software engineer / Data scientist at Dynamic capital management 20072008