Vectors are denoted by bold lower case letters and matrices by upper case ones. We define for the -norm of a vector in as , where denotes the -th coordinate of , and . We also define the -penalty as the number of nonzero elements in a vector:333Note that it would be more proper to write instead of to be consistent with the traditional notation . However, for the sake of simplicity, we will keep this notation unchanged in the rest of the paper. . We consider the Frobenius norm of a matrix in : , where denotes the entry of at row and column . For an integer , and for any subset , we denote by the vector of size containing the entries of a vector in indexed by , and by the matrix in containing the columns of a matrix in indexed by .
2 Loss Functions
We consider in this paper convex optimization problems of the form
where is a convex differentiable function and is a sparsity-inducing—typically nonsmooth and non-Euclidean—norm.
In supervised learning, we predict outputsin from observations in ; these observations are usually represented by -dimensional vectors with . In this supervised setting, generally corresponds to the empirical risk of a loss function . More precisely, given pairs of data points , we have for linear models444In Section 5, we consider extensions to non-linear predictors through multiple kernel learning. . Typical examples of differentiable loss functions are the square loss for least squares regression, i.e., with in , and the logistic loss
for logistic regression, within . Clearly, several loss functions of interest are non-differentiable, such as the hinge loss or the absolute deviation loss , for which most of the approaches we present in this monograph would not be applicable or require appropriate modifications. Given the tutorial character of this monograph, we restrict ourselves to smooth functions , which we consider is a reasonably broad setting, and we refer the interested reader to appropriate references in Section id1. We refer the readers to  for a more complete description of loss functions.
Penalty or constraint?
Given our convex data-fitting term , we consider in this paper adding a convex penalty . Within such a convex optimization framework, this is essentially equivalent to adding a constraint of the form . More precisely, under weak assumptions on and (on top of convexity), from Lagrange multiplier theory (see , Section 4.3) is a solution of the constrained problem for a certain if and only if it is a solution of the penalized problem for a certain . Thus, the two regularization paths, i.e., the set of solutions when and vary, are equivalent. However, there is no direct mapping between corresponding values of and . Moreover, in a machine learning context, where the parameters and have to be selected, for example through cross-validation, the penalized formulation tends to be empirically easier to tune, as the performance is usually quite robust to small changes in , while it is not robust to small changes in . Finally, we could also replace the penalization with a norm by a penalization with the squared norm. Indeed, following the same reasoning as for the non-squared norm, a penalty of the form is “equivalent” to a constraint of the form , which itself is equivalent to , and thus to a penalty of the form , for . Thus, using a squared norm, as is often done in the context of multiple kernel learning (see Section 5), does not change the regularization properties of the formulation.
3 Sparsity-Inducing Norms
In this section, we present various norms as well as their main sparsity-inducing effects. These effects may be illustrated geometrically through the singularities of the corresponding unit balls (see Figure 4).
Sparsity through the -norm.
When one knows a priori that the solutions of problem (1) should have a few non-zero coefficients, is often chosen to be the -norm, i.e., . This leads for instance to the Lasso  or basis pursuit  with the square loss and to -regularized logistic regression (see, for instance, [76, 128]) with the logistic loss. Regularizing by the -norm is known to induce sparsity in the sense that, a number of coefficients of , depending on the strength of the regularization, will be exactly equal to zero.
In some situations, the coefficients of are naturally partitioned in subsets, or groups
, of variables. This is typically the case, when working with ordinal variables555
Ordinal variables are integer-valued variables encoding levels of a certain feature, such as levels of severity of a certain symptom in a biomedical application, where the values do not correspond to an intrinsic linear scale: in that case it is common to introduce a vector of binary variables, each encoding a specific level of the symptom, that encodes collectively this single feature.. It is then natural to select or remove simultaneously all the variables forming a group. A regularization norm exploiting explicitly this group structure, or -group norm, can be shown to improve the prediction performance and/or interpretability of the learned models [62, 84, 107, 117, 142, 156]. The arguably simplest group norm is the so-called- norm:
where is a partition of , are some strictly positive weights, and denotes the vector in recording the coefficients of indexed by in . Without loss of generality we may assume all weights to be equal to one (when is a partition, we can rescale the values of appropriately). As defined in Eq. (2), is known as a mixed -norm. It behaves like an -norm on the vector in , and therefore, induces group sparsity. In other words, each , and equivalently each , is encouraged to be set to zero. On the other hand, within the groups in , the -norm does not promote sparsity. Combined with the square loss, it leads to the group Lasso formulation [142, 156]. Note that when is the set of singletons, we retrieve the -norm. More general mixed -norms for are also used in the literature  (using leads to a weighted -norm with no group-sparsity effects):
In practice though, the - and -settings remain the most popular ones. Note that using -norms may have the undesired effect to favor solutions with many components of equal magnitude (due to the extra non-differentiabilities away from zero). Grouped
-norms are typically used when extra-knowledge is available regarding an appropriate partition, in particular in the presence of categorical variables with orthogonal encoding, for multi-task learning where joint variable selection is desired , and for multiple kernel learning (see Section 5).
Norms for overlapping groups: a direct formulation.
In an attempt to better encode structural links between variables at play (e.g., spatial or hierarchical links related to the physics of the problem at hand), recent research has explored the setting where in Eq. (2) can contain groups of variables that overlap [9, 65, 67, 74, 121, 157]. In this case, if the groups span the entire set of variables, is still a norm, and it yields sparsity in the form of specific patterns of variables. More precisely, the solutions of problem (1) can be shown to have a set of zero coefficients, or simply zero pattern, that corresponds to a union of some groups in . This property makes it possible to control the sparsity patterns of by appropriately defining the groups in . Note that here the weights should not be taken equal to one (see,  for more details). This form of structured sparsity has notably proven to be useful in various contexts, which we now illustrate through concrete examples:
One-dimensional sequence: Given variables organized in a sequence, if we want to select only contiguous nonzero patterns, we represent in Figure 1 the set of groups to consider. In this case, we have . Imposing the contiguity of the nonzero patterns is for instance relevant in the context of time series, or for the diagnosis of tumors, based on the profiles of arrayCGH . Indeed, because of the specific spatial organization of bacterial artificial chromosomes along the genome, the set of discriminative features is expected to have specific contiguous patterns.
Two-dimensional grid: In the same way, assume now that the variables are organized on a two-dimensional grid. If we want the possible nonzero patterns to be the set of all rectangles on this grid, the appropriate groups to consider can be shown (see ) to be those represented in Figure 2. In this setting, we have . Sparsity-inducing regularizations built upon such group structures have resulted in good performances for background subtraction [63, 87, 89], topographic dictionary learning [73, 89], wavelet-based denoising 
, and for face recognition with corruption by occlusions.
Hierarchical structure: A third interesting example assumes that the variables have a hierarchical structure. Specifically, we consider that the variables correspond to the nodes of tree (or a forest of trees). Moreover, we assume that we want to select the variables according to a certain order: a feature can be selected only if all its ancestors in are already selected. This hierarchical rule can be shown to lead to the family of groups displayed on Figure 3.
This resulting penalty was first used in ; since then, this group structure has led to numerous applications, for instance, wavelet-based denoising [15, 63, 70, 157], hierarchical dictionary learning for both topic modeling and image restoration [69, 70], log-linear models for the selection of potential orders of interaction in a probabilistic graphical model , bioinformatics, to exploit the tree structure of gene networks for multi-task regression , and multi-scale mining of fMRI data for the prediction of some cognitive task 
. More recently, this hierarchical penalty was proved to be efficient for template selection in natural language processing.
Extensions: The possible choices for the sets of groups are not limited to the aforementioned examples. More complicated topologies can be considered, for instance, three-dimensional spaces discretized in cubes or spherical volumes discretized in slices; for instance, see  for an application to neuroimaging that pursues this idea. Moreover, directed acyclic graphs that extends the trees presented in Figure 3 have notably proven to be useful in the context of hierarchical variable selection [9, 121, 157],
Norms for overlapping groups: a latent variable formulation.
The family of norms defined in Eq. (2) is adapted to intersection-closed sets of nonzero patterns. However, some applications exhibit structures that can be more naturally modelled by union-closed families of supports. This idea was developed in [65, 106] where, given a set of groups , the following latent group Lasso norm was proposed:
The idea is to introduce latent parameter vectors constrained each to be supported on the corresponding group , which should explain linearly and which are themselves regularized by a usual -norm. reduces to the usual norm when groups are disjoint and provides therefore a different generalization of the latter to the case of overlapping groups than the norm considered in the previous paragraphs. In fact, it is easy to see that solving Eq. (1) with the norm is equivalent to solving
and setting . This last equation shows that using the norm can be interpreted as implicitly duplicating the variables belonging to several groups and regularizing with a weighted norm for disjoint groups in the expanded space. It should be noted that a careful choice of the weights is much more important in the situation of overlapping groups than in the case of disjoint groups, as it influences possible sparsity patterns .
This latent variable formulation pushes some of the vectors to zero while keeping others with no zero components, hence leading to a vector with a support which is in general the union of the selected groups. Interestingly, it can be seen as a convex relaxation of a non-convex penalty encouraging similar sparsity patterns which was introduced by . Moreover, this norm can also be interpreted as a particular case of the family of atomic norms, which were recently introduced by .
Graph Lasso. One type of a priori knowledge commonly encountered takes the form of graph defined on the set of input variables, which is such that connected variables are more likely to be simultaneously relevant or irrelevant; this type of prior is common in genomics where regulation, co-expression or interaction networks between genes (or their expression level) used as predictors are often available. To favor the selection of neighbors of a selected variable, it is possible to consider the edges of the graph as groups in the previous formulation (see [65, 112]).
Patterns consisting of a small number of intervals. A quite similar situation occurs, when one knows a priori—typically for variables forming sequences (times series, strings, polymers)—that the support should consist of a small number of connected subsequences. In that case, one can consider the sets of variables forming connected subsequences (or connected subsequences of length at most ) as the overlapping groups.
Multiple kernel learning.
For most of the sparsity-inducing terms described in this paper, we may replace real variables and their absolute values by pre-defined groups of variables with their Euclidean norms (we have already seen such examples with -norms), or more generally, by members of reproducing kernel Hilbert spaces. As shown in Section 5, most of the tools that we present in this paper are applicable to this case as well, through appropriate modifications and borrowing of tools from kernel methods. These tools have applications in particular in multiple kernel learning. Note that this extension requires tools from convex analysis presented in Section 4.
In learning problems on matrices, such as matrix completion, the rank plays a similar role to the cardinality of the support for vectors. Indeed, the rank of a matrix
may be seen as the number of non-zero singular values of. The rank of however is not a continuous function of , and, following the convex relaxation of the -pseudo-norm into the -norm, we may relax the rank of into the sum of its singular values, which happens to be a norm, and is often referred to as the trace norm or nuclear norm of , and which we denote by . As shown in this paper, many of the tools designed for the -norm may be extended to the trace norm. Using the trace norm as a convex surrogate for rank has many applications in control theory , matrix completion [1, 131], multi-task learning , or multi-label classification , where low-rank priors are adapted.
Sparsity-inducing properties: a geometrical intuition.
for some . The set of solutions of Eq. (4) parameterized by is the same as that of Eq. (1), as described by some value of depending on (e.g., see Section 3.2 in ). At optimality, the gradient of evaluated at any solution of (4) is known to belong to the normal cone of at . In other words, for sufficiently small values of , i.e., so that the constraint is active, the level set of for the value is tangent to .
As a consequence, the geometry of the ball is directly related to the properties of the solutions . If is taken to be the -norm, then the resulting ball is the standard, isotropic, “round” ball that does not favor any specific direction of the space. On the other hand, when is the -norm, corresponds to a diamond-shaped pattern in two dimensions, and to a pyramid in three dimensions. In particular, is anisotropic and exhibits some singular points due to the extra non-smoothness of . Moreover, these singular points are located along the axis of , so that if the level set of happens to be tangent at one of those points, sparse solutions are obtained. We display in Figure 4 the balls for the -, -, and two different grouped -norms.
The design of sparsity-inducing norms is an active field of research and similar tools to the ones we present here can be derived for other norms. As shown in Section id1, computing the proximal operator readily leads to efficient algorithms, and for the extensions we present below, these operators can be efficiently computed.
In order to impose prior knowledge on the support of predictor, the norms based on overlapping
-norms can be shown to be convex relaxations of submodular functions of the support, and further ties can be made between convex optimization and combinatorial optimization (see for more details). Moreover, similar developments may be carried through for norms which try to enforce that the predictors have many equal components and that the resulting clusters have specific shapes, e.g., contiguous in a pre-defined order, see some examples in Section id1, and, e.g., [11, 33, 87, 135, 144] and references therein.
4 Optimization Tools
The tools used in this paper are relatively basic and should be accessible to a broad audience. Most of them can be found in classical books on convex optimization [18, 20, 25, 105], but for self-containedness, we present here a few of them related to non-smooth unconstrained optimization. In particular, these tools allow the derivation of rigorous approximate optimality conditions based on duality gaps (instead of relying on weak stopping criteria based on small changes or low-norm gradients).
Given a convex function and a vector in , let us define the subdifferential of at as
The elements of are called the subgradients of at . Note that all convex functions defined on have non-empty subdifferentials at every point. This definition admits a clear geometric interpretation: any subgradient in defines an affine function which is tangent to the graph of the function (because of the convexity of , it is a lower-bounding tangent). Moreover, there is a bijection (one-to-one correspondence) between such “tangent affine functions” and the subgradients, as illustrated in Figure 5.
Subdifferentials are useful for studying nonsmooth optimization problems because of the following proposition (whose proof is straightforward from the definition):
Proposition .1 (Subgradients at Optimality)
For any convex function , a point in is a global minimum of if and only if the condition holds.
Note that the concept of subdifferential is mainly useful for nonsmooth functions. If is differentiable at , the set is indeed the singleton , where is the gradient of at , and the condition reduces to the classical first-order optimality condition . As a simple example, let us consider the following optimization problem
Applying the previous proposition and noting that the subdifferential is for , for and for , it is easy to show that the unique solution admits a closed form called the soft-thresholding operator, following a terminology introduced in ; it can be written
or equivalently , where is equal to if , if and if . This operator is a core component of many optimization techniques for sparse estimation, as we shall see later. Its counterpart for non-convex optimization problems is the hard-thresholding operator. Both of them are presented in Figure 6. Note that similar developments could be carried through using directional derivatives instead of subgradients (see, e.g., ).
Dual norm and optimality conditions.
The next concept we introduce is the dual norm, which is important to study sparsity-inducing regularizations [9, 67, 100]. It notably arises in the analysis of estimation bounds , and in the design of working-set strategies as will be shown in Section 15. The dual norm of the norm is defined for any vector in by
Moreover, the dual norm of is itself, and as a consequence, the formula above holds also if the roles of and are exchanged. It is easy to show that in the case of an -norm, , the dual norm is the -norm, with in such that . In particular, the - and -norms are dual to each other, and the -norm is self-dual (dual to itself).
Proposition .2 (Optimality Conditions for Eq. (1))
Let us consider problem (1) where is a norm on . A vector in is optimal if and only if with
Computing the subdifferential of a norm is a classical course exercise  and its proof will be presented in the next section, in Remark .1. As a consequence, the vector is solution if and only if . Note that this shows that for all larger than , is a solution of the regularized optimization problem (hence this value is the start of the non-trivial regularization path).
where is in , and is a design matrix in . With Eq. (7) in hand, we can now derive necessary and sufficient optimality conditions:
Proposition .3 (Optimality Conditions for the Lasso)
A vector is a solution of the Lasso problem (8) if and only if
where denotes the -th column of , and the -th entry of .
We apply Proposition .2. The condition can be rewritten: , which is equivalent to: (i) if , (using the fact that the -norm is dual to the -norm); (ii) if , and . It is then easy to check that these conditions are equivalent to Eq. (9). As we will see in Section 16, it is possible to derive from these conditions interesting properties of the Lasso, as well as efficient algorithms for solving it. We have presented a useful duality tool for norms. More generally, there exists a related concept for convex functions, which we now introduce.
4.1 Fenchel Conjugate and Duality Gaps
Let us denote by the Fenchel conjugate of , defined by
Fenchel conjugates are particularly useful to derive dual problems and duality gaps666For many of our norms, conic duality tools would suffice (see, e.g., ).. Under mild conditions, the conjugate of the conjugate of a convex function is itself, leading to the following representation of as a maximum of affine functions:
In the context of this tutorial, it is notably useful to specify the expression of the conjugate of a norm. Perhaps surprisingly and misleadingly, the conjugate of a norm is not equal to its dual norm, but corresponds instead to the indicator function of the unit ball of its dual norm. More formally, let us introduce the indicator function such that is equal to if and otherwise. Then, we have the following well-known results, which appears in several text books (e.g., see Example 3.26 in ):
Proposition .4 (Fenchel Conjugate of a Norm)
Let be a norm on . The following equality holds for any
On the one hand, assume that the dual norm of is greater than one, that is, . According to the definition of the dual norm (see Eq. (6)), and since the supremum is taken over the compact set , there exists a vector in this ball such that . For any scalar , consider and notice that
which shows that when , the Fenchel conjugate is unbounded. Now, assume that . By applying the generalized Cauchy-Schwartz’s inequality, we obtain for any
Equality holds for , and the conclusion follows. An important and useful duality result is the so-called Fenchel-Young inequality (see ), which we will shortly illustrate geometrically:
Proposition .5 (Fenchel-Young Inequality)
Let be a vector in , be a function on , and be a vector in the domain of (which we assume non-empty). We have then the following inequality
with equality if and only if is in .
We can now illustrate geometrically the duality principle between a function and its Fenchel conjugate in Figure 4.1.
With Proposition .4 in place, we can formally (and easily) prove the relationship in Eq. (7) that make explicit the subdifferential of a norm. Based on Proposition .4, we indeed know that the conjugate of is . Applying the Fenchel-Young inequality (Proposition .5), we have
which leads to the desired conclusion.
For many objective functions, the Fenchel conjugate admits closed forms, and can therefore be computed efficiently . Then, it is possible to derive a duality gap for problem (1) from standard Fenchel duality arguments (see ), as shown in the following proposition:
Proposition .6 (Duality for Problem (1))
If and are respectively the Fenchel conjugate of a convex and differentiable function and the dual norm of , then we have
Moreover, equality holds as soon as the domain of has non-empty interior.
This result is a specific instance of Theorem 3.3.5 in . In particular, we use the fact that the conjugate of a norm is the indicator function of the unit ball of the dual norm (see Proposition .4). If is a solution of Eq. (1), and in are such that , this proposition implies that we have
The difference between the left and right term of Eq. (11) is called a duality gap. It represents the difference between the value of the primal objective function and a dual objective function , where is a dual variable. The proposition says that the duality gap for a pair of optima and of the primal and dual problem is equal to . When the optimal duality gap is zero one says that strong duality holds. In our situation, the duality gap for the pair of primal/dual problems in Eq. (10), may be decomposed as the sum of two non-negative terms (as the consequence of Fenchel-Young inequality):
It is equal to zero if and only if the two terms are simultaneously equal to zero.
Duality gaps are important in convex optimization because they provide an upper bound on the difference between the current value of an objective function and the optimal value, which makes it possible to set proper stopping criteria for iterative optimization algorithms. Given a current iterate , computing a duality gap requires choosing a “good” value for (and in particular a feasible one). Given that at optimality, is the unique solution to the dual problem, a natural choice of dual variable is , which reduces to at the optimum and therefore yields a zero duality gap at optimality.
Note that in most formulations that we will consider, the function is of the form with and a design matrix. Indeed, this corresponds to linear prediction on , given observations , , and the predictions . Typically, the Fenchel conjugate of is easy to compute777For the least-squares loss with output vector , we have and . For the logistic loss, we have and if and otherwise. while the design matrix makes it hard888It would require to compute the pseudo-inverse of . to compute . In that case, Eq. (1) can be rewritten as
and equivalently as the optimization of the Lagrangian
which is obtained by introducing the Lagrange multiplier for the constraint . The corresponding Fenchel dual999Fenchel conjugacy naturally extends to this case; see Theorem in  for more details. is then
which does not require any inversion of (which would be required for computing the Fenchel conjugate of ). Thus, given a candidate , we consider , and can get an upper bound on optimality using primal (12) and dual (14) problems. Concrete examples of such duality gaps for various sparse regularized problems are presented in appendix D of , and are implemented in the open-source software SPAMS101010http://www.di.ens.fr/willow/SPAMS/, which we have used in the experimental section of this paper.
4.2 Quadratic Variational Formulation of Norms
Several variational formulations are associated with norms, the most natural one being the one that results directly from (6) applied to the dual norm:
However, another type of variational form is quite useful, especially for sparsity-inducing norms; among other purposes, as it is obtained by a variational upper-bound (as opposed to a lower-bound in the equation above), it leads to a general algorithmic scheme for learning problems regularized with this norm, in which the difficulties associated with optimizing the loss and that of optimizing the norm are partially decoupled. We present it in Section id1. We introduce this variational form first for the - and -norms and subsequently generalize it to norms that we call subquadratic norms.
The case of the - and -norms.
The two basic variational identities we use are, for ,
where the infimum is attained at , and, for ,
The last identity is a direct consequence of the Cauchy-Schwartz inequality:
The infima in the previous expressions can be replaced by a minimization if the function with is extended in using the convention “0/0=0”, since the resulting function111111This extension is in fact the function . is a proper closed convex function. We will use this convention implicitly from now on. The minimum is then attained when equality holds in the Cauchy-Schwartz inequality, that is for , which leads to if and else.
Introducing the simplex , we apply these variational forms to the - and -norms (with non overlapping groups) with and , so that we obtain directly:
Quadratic variational forms for subquadratic norms.
The variational form of the -norm admits a natural generalization for certain norms that we call subquadratic norms. Before we introduce them, we review a few useful properties of norms. In this section, we will denote the vector .
Definition .1 (Absolute and monotonic norm)
We say that:
A norm is absolute if for all , .
A norm is monotonic if for all s.t. , it holds that .
These definitions are in fact equivalent (see, e.g., ):
A norm is monotonic if and only if it is absolute.
If is monotonic, the fact that implies so that is absolute.
If is absolute, we first show that is absolute. Indeed,
Then if , since ,
which shows that is monotonic.
We now introduce a family of norms, which have recently been studied in .
Definition .2 (-norm)
Let be a compact convex subset of , such that , we say that is an -norm if .
The next proposition shows that is indeed a norm and characterizes its dual norm.
is a norm and .
First, since contains at least one element whose components are all strictly positive, is finite on . Symmetry, nonnegativity and homogeneity of are straightforward from the definitions. Definiteness results from the fact that is bounded. is convex, since it is obtained by minimization of in a jointly convex formulation. Thus is a norm. Finally,
The form of the dual norm follows by maximizing w.r.t. .
We finally introduce the family of norms that we call subquadratic.
Definition .3 (Subquadratic Norm)
Let and a pair of absolute dual norms. Let be the function defined as where we use the notation . We say that is subquadratic if is convex.
With this definition, we have:
If is subquadratic, then is a norm, and denoting the dual norm of the latter, we have: