 # Sparse Modeling for Image and Vision Processing

In recent years, a large amount of multi-disciplinary research has been conducted on sparse models and their applications. In statistics and machine learning, the sparsity principle is used to perform model selection---that is, automatically selecting a simple model among a large collection of them. In signal processing, sparse coding consists of representing data with linear combinations of a few dictionary elements. Subsequently, the corresponding tools have been widely adopted by several scientific communities such as neuroscience, bioinformatics, or computer vision. The goal of this monograph is to offer a self-contained view of sparse modeling for visual recognition and image processing. More specifically, we focus on applications where the dictionary is learned and adapted to data, yielding a compact representation that has been successful in various contexts.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1.1 Early concepts of parsimony in statistics

A large number of statistical procedures can be formulated as maximum likelihood estimation. Given a statistical model with parameters , it consists of minimizing with respect to  an objective function representing the negative log-likelihood of observed data. Assuming for instance that we observe independent samples  of the (unknown) data distribution, we need to solve

 minθ∈Rp[L(θ)≜−n∑i=1logPθ(zi)], (1.1)

where

is some probability distribution parameterized by

.

Simple methods such as ordinary least squares can be written as (

1.1). Consider for instance data points  that are pairs , with  is an observation in  and  is a vector in , and assume that there exists a linear relation , where  is an approximation error for observation . Under a model where the

’s are independent and identically normally distributed with zero-mean, Eq. (

1.1) is equivalent to a least square problem:222Note that the Gaussian noise assumption is not necessary to justify the ordinary least square formulation. It is only sufficient to interpret it as maximum likelihood estimation. In fact, as long as the conditional expectation  is linear, the ordinary least square estimator is statistically consistent under mild assumptions.

 minθ∈Rpn∑i=112(yi−x⊤iθ)2.

To prevent overfitting and to improve the interpretability of the learned model, it was suggested in early work that a solution involving only a few model variables could be more appropriate than an exact solution of (1.1); in other words, a sparse solution involving only—let us say— variables might be desirable in some situations. Unfortunately, such a strategy yields two difficulties: first, it is not clear a priori how to choose ; second, finding the best subset of  variables is NP-hard in general (Natarajan, 1995). The first issue was addressed with several criterions for controlling the trade-off between the sparsity of the solution  and the adequacy of the fit to training data. For the second issue, approximate computational techniques have been proposed.

##### Mallows’s Cp, AIC, and BIC.

For the ordinary least squares problem, Mallows (1964, 1966) introduced the -statistics, later generalized by Akaike (1973) with the Akaike information criterion (AIC), and then by Schwarz (1978) with the Bayesian information criterion (BIC). Using , AIC, or BIC is equivalent to solving the penalized -maximum likelihood estimation problem

 minθ∈RpL(θ)+λ∥θ∥0, (1.2)

where  depends on the chosen criterion (see Hastie et al., 2009), and  is the -penalty. Similar formulations have also been derived by using the minimum description length (MDL) principle for model selection (Rissanen, 1978; Barron et al., 1998). As shown by Natarajan (1995), the problem (1.2) is NP-hard, and approximate algorithms are necessary unless  is very small, e.g., .

##### Forward selection and best subset selection for least squares.

To obtain an approximate solution of (1.2), a classical approach is the forward selection technique, which is a greedy algorithm that solves a sequence of maximum likelihood estimation problems computed on a subset of variables. After every iteration, a new variable is added to the subset according to the chosen sparsity criterion in a greedy manner. Some variants allow backward steps—that is, a variable can possibly exit the active subset after an iteration. The algorithm is presented in more details in Section 5.1 and seems to be due to Efroymson (1960), according to Hocking (1976). Other approaches considered in the 70’s include also the leaps and bounds technique of Furnival and Wilson (1974), a branch-and-bound algorithm providing the exact solution of (1.2) with exponential worst-case complexity.

### 1.2 Wavelets in signal processing

In signal processing, similar problems as in statistics have been studied in the context of wavelets. In a nutshell, a wavelet basis represents a set of functions  that are essentially dilated and shifted versions of each other. Unlike Fourier basis, wavelets have the interesting properties to be localized both in the space and frequency domains, and to be suitable to multi-resolution analysis of signals (Mallat, 1989).

The concept of parsimony is central to wavelets. When a signal  is “smooth” in a particular sense (see Mallat, 2008), it can be well approximated by a linear combination of a few wavelets. Specifically, is close to an expansion where only a few coefficients are non-zero, and the resulting compact representation has effective applications in estimation and compression. The wavelet theory is well developed for continuous signals, e.g., is chosen in the Hilbert space , but also for discrete signals  in , making it suitable to modern digital image processing.

Since the first wavelet was introduced by Haar (1910), much research has been devoted to designing a wavelet set that is adapted to particular signals such as natural images. After a long quest for finding good orthogonal basis such as the one proposed by Daubechies (1988), a series of works has focused on wavelet sets whose elements are not linearly independent. It resulted a large number of variants, such as steerable wavelets (Simoncelli et al., 1992), curvelets (Candès and Donoho, 2002), contourlets (Do and Vetterli, 2005), or bandlets (Le Pennec and Mallat, 2005). For the purpose of our monograph, one concept related to sparse estimation is particularly important; it is called wavelet thresholding.

##### Sparse estimation and wavelet thresholding.

Let us consider a discrete signal represented by a vector  in  and an orthogonal wavelet basis set —that is, satisfying where

is the identity matrix. Approximating

by a sparse linear combination of wavelet elements can be formulated as finding a sparse vector  in , say with  non-zero coefficients, that minimizes

 minα∈Rp12∥x−Dα∥22  s.t.  ∥α∥0≤k. (1.3)

The sparse decomposition problem (1.3) is an instance of the best subset selection formulation presented in Section 1.1 where  represents model parameters, demonstrating that similar topics arise in statistics and signal processing. However, whereas (1.3) is NP-hard for general matrices  (Natarajan, 1995), we have assumed  to be orthogonal in the context of wavelets. As such, (1.3) is equivalent to

 minα∈Rp12∥∥D⊤x−α∥∥22  s.t.  ∥α∥0≤k,

and admits a closed form. Let us indeed define the vector in , corresponding to the exact non-sparse decomposition of  onto —that is, we have  since  is orthogonal. To obtain the best -sparse approximation, we denote by  the -th largest value among the set , and the solution  of (1.3) is obtained by applying to  an operator called “hard-thresholding” and defined as

 αht[i]=1|β[i]|≥μβ[i]={β[i]if  |β[i]|≥μ,0otherwise, (1.4)

where  is the indicator function, which is equal to  if and otherwise. In other words, the hard-thresholding operator simply sets to zero coefficients from  whose magnitude is below the threshold . The corresponding procedure, called “wavelet thresholding”, is simple and effective for image denoising, even though it does not perform as well as recent state-of-the-art techniques presented in Section 3. When an image  is noisy, e.g., corrupted by white Gaussian noise, and  is well chosen, the estimate  is a good estimate of the clean original image. The terminology “hard” is defined in contrast to an important variant called the “soft-thresholding operator”, which was introduced by Donoho and Johnstone (1994) in the context of wavelets:333Note that the soft-thresholding operator appears in fact earlier in the statistics literature (see Efron and Morris, 1971; Bickel, 1984), but it was used there for a different purpose.

 αst[i]≜sign(β[i])max(|β[i]|−λ,0)=⎧⎪⎨⎪⎩β[i]−λif  β[i]≥λ,β[i]+λif  β[i]≤−λ,0otherwise, (1.5)

where  is a parameter playing the same role as  in (1.4). Not only does the operator set small coefficients of  to zero, but it also reduces the magnitude of the non-zero ones. Both operators are illustrated and compared to each other in Figure 1.1. Interestingly, whereas  is the solution of (1.3) when  corresponds to the entry of  with -th largest magnitude,

is in fact the solution of the following sparse reconstruction problem with the orthogonal matrix

:

 minα∈Rp12∥x−Dα∥22+λ∥α∥1. (1.6)

This formulation will be the topic of the next section for general non-orthogonal matrices. Similar to statistics where choosing the parameter  of the best subset selection was difficult, automatically selecting the best thresholds  or  has been a major research topic (see, e.g. Donoho and Johnstone, 1994, 1995; Chang et al., 2000a, b).

##### Structured group thresholding.

Wavelets coefficients have a particular structure since the basis elements  are dilated and shifted versions of each other. It is for instance possible to define neighborhood relationships for wavelets whose spatial supports are close to each other, or hierarchical relationships between wavelets with same or similar localization but with different scales. For one-dimensional signals, we present in Figure 1.2 a typical organization of wavelet coefficients on a tree with arrows representing such relations. For two-dimensional images, the structure is slightly more involved and the coefficients are usually organized as a collection of quadtrees (see Mallat, 2008, for more details); we present such a configuration in Figure 1.3. Figure 1.3: Wavelet coefficients displayed for the image lena using the orthogonal basis of Daubechies (1988). A few coefficients representing a low-resolution version of the image are displayed on the top-left corner. Wavelets corresponding to this low-resolution image are obtained by filtering the original image with shifted versions of a low-pass filter called “scaling function” or “father wavelet”. The rest of the coefficients are organized into three quadtrees (on the right, on the left, and on the diagonal). Each quadtree is obtained by filtering the original image with a wavelet at three different scales and at different positions. The value zero is represented by the grey color; negative values appear in black, and positive values in white. The wavelet decomposition and this figure have been produced with the software package matlabPyrTools developed by Eero Simoncelli and available here: http://www.cns.nyu.edu/~lcv/software.php.

A natural idea has inspired the recent concept of group sparsity that will be presented in the next section; it consists in exploiting the wavelet structure to improve thresholding estimators. Specifically, it is possible to use neighborhood relations between wavelet basis elements to define groups of coefficients that form a partition  of , and use a group-thresholding operator (Hall et al., 1999; Cai, 1999) defined for every group  in  as

 αgt[g]≜⎧⎨⎩(1−λ∥β[g]∥2)β[g]if  ∥β[g]∥2≥λ,0otherwise, (1.7)

where  is the vector of size  recording the entries of  whose indices are in . By using such an estimator, groups of neighbor coefficients are set to zero together when their joint -norm falls below the threshold . Interestingly, even though the next interpretation does not appear in early work about group-thresholding (Hall et al., 1999; Cai, 1999), it is possible to view  with  as the solution of the following penalized problem

 minα∈Rp12∥x−Dα∥22+λ∑g∈G∥α[g]∥2, (1.8)

where the closed-form solution (1.7) holds because  is orthogonal (see Bach et al., 2012a). Such a formulation will be studied in the next section for general matrices.

Finally, other ideas for exploiting both structure and wavelet parsimony have been proposed. One is a coding scheme called “zero-tree” wavelet coding (Shapiro, 1993), which uses the tree structure of wavelets to force all descendants of zero coefficients to be zero as well. Equivalently, a coefficient can be non-zero only if its parent in the tree is non-zero, as illustrated in Figure 1.2. This idea has been revisited later in a more general context by Zhao et al. (2009)

. Other complex models have been used as well for modeling interactions between coefficients: we can mention the application of hidden Markov models (HMM) to wavelets by

Crouse et al. (1998) and the Gaussian scale mixture model of Portilla et al. (2003).

### 1.3 Modern parsimony: the ℓ1-norm and other variants

The era of “modern” parsimony corresponds probably to the use of convex optimization techniques for solving feature selection or sparse decomposition problems. Even though the

-norm was introduced for that purpose in geophysics (Claerbout and Muir, 1973; Taylor et al., 1979), it was popularized in statistics with the Lasso estimator of Tibshirani (1996) and independently in signal processing with the basis pursuit formulation of Chen et al. (1999). Given observations  in  and a matrix of predictors  in , the Lasso consists of learning a linear model  by solving the following quadratic program:

 minα∈Rp12∥x−Dα∥22  s.t.  ∥α∥1≤μ. (1.9)

As detailed in the sequel, the -norm encourages the solution  to be sparse and the parameter  is used to control the trade-off between data fitting and the sparsity of . In practice, reducing the value of  leads indeed to sparser solution in general, i.e., with more zeroes, even though there is no formal relation between the sparsity of  and its -norm for general matrices .

The basis pursuit denoising formulation of Chen et al. (1999) is relatively similar but the -norm is used as a penalty instead of a constraint. It can be written as

 minα∈Rp12∥x−Dα∥22+λ∥α∥1, (1.10)

which is essentially equivalent to (1.9) from a convex optimization perspective, and in fact (1.10) is also often called “Lasso” in the literature. Given some data , matrix , and parameter , we indeed know from Lagrange multiplier theory (see, e.g., Borwein and Lewis, 2006; Boyd and Vandenberghe, 2004) that for all solution  of (1.9), there exists a parameter  such that  is also a solution of (1.10). We note, however, that there is no direct mapping between  and , and thus the choice of formulation (1.9) or (1.10) should be made according to how easy it is to select the parameters  or . For instance, one may prefer (1.9) when a priori information about the -norm of the solution is available. In Figure 1.4, we illustrate the effect of changing the value of the regularization parameter  on the solution of (1.10) for two datasets. When , the solution is dense; in general, increasing  sets more and more variables to zero. However, the relation between  and the sparsity of the solution is not exactly monotonic. In a few cases, increasing  yields a denser solution.

Another “equivalent” formulation consists of finding a sparse decomposition under a reconstruction constraint:

 minα∈Rp∥α∥1  s.t.  ∥x−Dα∥22≤ε. (1.11)

This formulation can be useful when we have a priori knowledge about the noise level and the parameter  is easy to choose. The link between (1.10) and (1.11) is similar to the link between (1.10) and (1.9).

For noiseless problems, Chen et al. (1999) have also introduced a formulation simply called “basis pursuit” (without the terminology “denoising”), defined as

 minα∈Rp∥α∥1  s.t.  x=Dα, (1.12)

which is related to (1.10) in the sense that the set of solutions of (1.10) converges to the solutions of (1.12) when  converges to , whenever the linear system is feasible. These four formulations (1.9-1.12) have gained a large success beyond the statistics and signal processing communities. More generally, the -norm has been used as a regularization function beyond the least-square context, leading to problems of the form

 minα∈Rpf(α)+λ∥α∥1, (1.13)

where

is a loss function. In the rest of this section, we will present several variants of the

-norm, but before that, we will try to understand why such a penalty is appropriate for sparse estimation.

##### Why does the ℓ1-norm induce sparsity?

Even though we have claimed that there is no rigorous relation between the sparsity of  and its -norm in general, intuition about the sparsity-inducing effect of the -norm may be obtained from several viewpoints.

###### Analytical point of view.

In the previous section about wavelets, we have seen that when  is orthogonal, the -decomposition problem (1.10) admits an analytic closed form solution (1.5) obtained by soft-thresholding. As a result, whenever the magnitude of the inner product is smaller than  for an index , the corresponding variable  is equal to zero. Thus, the number of zeroes of the solution  monotonically increases with .

For non-orthogonal matrices , such a monotonic relation does not formally hold anymore; in practice, the sparsity-inducing property of the -penalty remains effective, as illustrated in Figure 1.4. Some intuition about this fact can be gained by studying optimality conditions for the general -regularized problem (1.13) where  is a differentiable function. The following lemma details these conditions. [Optimality conditions for -regularized problems]
A vector in  is a solution of (1.13) if and only if

 ∀i=1,…,p {−∇f(α⋆)[i]=λsign(α⋆[i])if α⋆[i]≠0,|∇f(α⋆)[i]|≤λotherwise. (1.14)
###### Proof.

A proof using the classical concept of subdifferential from convex optimization can be found in (see,e.g., Bach et al., 2012a). Here, we provide instead an elementary proof using the simpler concept of directional derivative for nonsmooth functions, defined as, when the limit exists,

 ∇g(α,κ)≜limt→0+g(α+tκ)−g(α)t,

for a function  at a point  in  and a direction  in . For convex functions , directional derivatives always exist and a classical optimality condition for  to be a minimum of  is to have non-negative for all directions  (Borwein and Lewis, 2006). Intuitively, this means that one cannot find any direction  such that an infinitesimal move along  from  decreases the value of the objective. When  is differentiable, the condition is equivalent to the classical optimality condition .

We can now apply the directional derivative condition to the function , which is equivalent to

 ∀κ∈Rp,   ∇f(α⋆)⊤κ+λp∑i=1{sign(α⋆[i])κ[i]if α⋆[i]≠0,|κ[i]|otherwise}≥0. (1.15)

It is then easy to show that (1.15) holds for all  if and only the inequality holds for the specific values and  for all , where is the vector in  with zeroes everywhere except for the -th entry that is equal to one. This immediately provides an equivalence between (1.15) and (1.14). ∎

Lemma 1.3 is interesting from a computational point of view (see Section 5.2), but it also tells us that when , the conditions (1.14) are satisfied for , the sparsest solution possible.

###### Physical point of view.

In image processing or computer vision, the word “energy” often denotes the objective function of a minimization problem; it is indeed common in physics to have complex systems that stabilize at a configuration of minimum potential energy. The negative of the energy’s gradient represents a force, a terminology we will borrow in this paragraph. Consider for instance a one-dimensional -regularized estimation problem

 minα∈R12(β−α)2+λ|α|, (1.16)

where  is a positive constant. Whenever  is non-zero, the -penalty is differentiable with derivative . When interpreting this objective as an energy minimization problem, the -penalty can be seen as applying a force driving  towards the origin with constant intensity . Consider now instead the squared -penalty, also called regularization of Tikhonov (1963)

, or ridge regression regularization

(Hoerl and Kennard, 1970):

 minα∈R12(β−α)2+λ2α2. (1.17)

The derivative of the quadratic energy is . It can be interpreted as a force that also points to the origin but with linear intensity . Therefore, the force corresponding to the ridge regularization can be arbitrarily strong when  is large, but if fades away when  gets close to zero. As a result, the squared -regularization does not have a sparsity-inducing effect. From an analytical point of view, we have seen that the solution of (1.16) is zero when  is smaller than . In contrast, the solution of (1.17) admits a closed form . And thus, regardless of the parameter , the solution is never zero.

We present a physical example illustrating this phenomenon in Figure 1.5. We use springs whose potential energy is known to be quadratic, and objects with a gravitational potential energy that is approximately linear on the Earth’s surface.

###### Geometrical point of view.

The sparsity-inducing effect of the -norm can also be interpreted by studying the geometry of the -ball . More precisely, understanding the effect of the Euclidean projection onto this set is important: in simple cases where the design matrix  is orthogonal, the solution of (1.9) can indeed be obtained by the projection

 minα∈Rp12∥β−α∥22  s.t.  ∥α∥1≤μ, (1.18)

where . When  is not orthogonal, a classical algorithm for solving (1.9) is the projected gradient method (see Section 5.2), which performs a sequence of projections (1.18) for different values of . Note that how to solve (1.18) efficiently is well studied; it can be achieved in operations with a divide-and-conquer strategy (Brucker, 1984; Duchi et al., 2008).

In Figure 1.6, we illustrate the effect of the -norm projection and compare it to the case of the -norm. The corners of the -ball are on the main axes and correspond to sparse solutions. Two of them are represented by red and green dots, with respective coordinates and . Most strikingly, a large part of the space in the figure, represented by red and green regions, ends up on these corners after projection. In contrast, the set of points that is projected onto the blue dot, is simply the blue line. The blue dot corresponds in fact to a dense solution with coordinates . Therefore, the figure illustrates that the -ball in two dimensions encourages solutions to be on its corners. In the case of the -norm, the ball is isotropic, and treats every direction equally. In Figure 1.7, we represent these two balls in three dimensions, where we can make similar observations.

More formally, we can mathematically characterize our remarks about Figure 1.6. Consider a point  in  on the surface of the -ball of radius , and define the set , where  is the projection operator onto the -ball. Examples of pairs  have been presented in Figure 1.6; for instance, when  is the red or green dot,  is respectively the red or green region. It is particularly informative to study how  varies with , which is the focus of the next proposition. [Characterization of the set
For a non-zero vector  in , the set  defined in the previous paragraph can be written as , where is a polyhedral cone of dimension .

###### Proof.

A classical theorem (see Bertsekas, 1999, Proposition B.11) allows us to rewrite  as

 N={z∈Rp : ∀ ∥x∥1≤1,  (z−y)⊤(x−y)≤0}=y+K,

where  denotes the Minkowski sum between the set  and the cone  defined as

 K≜{d∈Rp : ∀ ∥x∥1≤1,  d⊤(x−y)≤0}.

Note that in the optimization literature,  is often called the “normal cone” to the unit -ball at the point  (Borwein and Lewis, 2006). Equivalently, we have

 K={d∈Rp : max∥x∥1≤1d⊤x≤d⊤y}={d∈Rp : ∥d∥∞≤d⊤y}, (1.19)

where we have used the fact that quantity , called the dual-norm of the -norm, is equal to  (see Bach et al., 2012a). Note now that according to Hölder’s inequality, we also have  in Eq. (1.19). Therefore, the inequalities are in fact equalities. It is then easy to characterize vectors  such that and it is possible to show that  is simply the set of vectors  satisfying  for all  such that .

This would be sufficient to conclude the proposition, but it is also possible to pursue the analysis and exactly characterize  by finding a set of generators.444A collection of vectors  are called generators for a cone  when  consists of all positive combinations of the vectors . In other words, . In that case, we use the notation . Let us define the vector  in  that carries the sparsity pattern of , more precisely, with  for all  such that , and otherwise. Let us also define the set of indices  corresponding to the  zero entries of , and  in  the binary vector whose entries are all zero but the -th one that is equal to . Then, after a short calculation, we can geometrically characterize the polyhedral cone :

 K=cone(s,s−ei1,s+ei1,s−ei2,s+ei2,…,s−eil,s+eil),

where the notation “cone” is defined in footnote 4. ∎

It is now easy to see that the set  “grows” with the number  of zero entries in , and that  lives in a subspace of dimension  for all non-zero vector . For example, when —that is,  is a dense vector (e.g., the blue point in Figure 1.6(a)),  is simply a half-line.

To conclude, the geometrical intuition to gain from this section is that the Euclidean projection onto a convex set encourages solutions on singular points, such as edges or corners for polytopes. Such a principle indeed applies beyond the -norm. For instance, we illustrate the regularization effect of the -norm in Figure 1.8, whose corners coordinates have same magnitude.

##### Non-convex regularization.

Even though it is well established that the -norm encourages sparse solutions, it remains only a convex proxy of the -penalty. Both in statistics and signal processing, other sparsity-inducing regularization functions have been proposed, in particular continuous relaxations of  that are non-convex (Frank and Friedman, 1993; Fan and Li, 2001; Daubechies et al., 2010; Gasso et al., 2009). These functions are using a non-decreasing concave function , and the sparsity-inducing penalty is defined as

 ψ(α)≜p∑i=1φ(|α[i]|).

For example, the -penalty uses  (Frank and Friedman, 1993), or an approximation ; the reweighted- algorithm of Fazel (2002); Fazel et al. (2003); Candès et al. (2008) implicitly uses . These penalties typically lead to intractable estimation problems, but approximate solutions can be obtained with continuous optimization techniques (see Section 5.3).

The sparsity-inducing effect of the penalties  is known to be stronger than . As shown in Figure 1.9(a), the magnitude of the derivative of  grows when one approaches zero because of its concavity. Thus, in the one-dimensional case,   can be interpreted as a force driving  towards the origin with increasing intensity when  gets closer to zero. In terms of geometry, we also display the -ball in Figure 1.9(b), with the same red, blue, and green dots as in Figure 1.6. The part of the space that is projected onto the corners of the -ball is larger than that for . Interestingly, the geometrical structure of the red and green regions are also more complex. Their combinatorial nature makes the projection problem onto the -ball more involved when .

##### The elastic-net.

To cope with some instability issues of the estimators obtained with the -regularization, Zou and Hastie (2005) have proposed to combine the - and -norms with a penalty called elastic-net:

 ψ(α)≜∥α∥1+γ∥α∥22.

The effect of this penalty is illustrated in Figure 1.10. Compared to Figure 1.6, we observe that the red and green regions are smaller for the elastic-net penalty than for . The sparsity-inducing effect is thus less aggressive than the one obtained with .

##### Total variation.

The anisotropic total variation penalty (Rudin et al., 1992) for one dimensional signals is simply the -norm of finite differences

 ψ(α)≜p−1∑i=1|α[i+1]−α[i]|,

which encourages piecewise constant signals. It is also known in statistics under the name of “fused Lasso” (Tibshirani et al., 2005). The penalty can easily be extended to two-dimensional signals, and has been widely used for regularizing inverse problems in image processing (Chambolle, 2005).

##### Group sparsity.

In some cases, variables are organized into predefined groups forming a partition  of , and one is looking for a solution  such that variables belonging to the same group of  are set to zero together. For example, such groups have appeared in Section 1.2 about wavelets, where  could be defined according to neighborhood relationships of wavelet coefficients. Then, when it is known beforehand that a problem solution only requires a few groups of variables to explain the data, a regularization function automatically selecting the relevant groups has been shown to improve the prediction performance or to provide a solution with a better interpretation (Turlach et al., 2005; Yuan and Lin, 2006; Obozinski et al., 2009; Huang and Zhang, 2010). The group sparsity principle is illustrated in Figure 1.11(b).

An appropriate regularization function to obtain a group-sparsity effect is known as “Group-Lasso” penalty and is defined as

 ψ(α)=∑g∈G∥α[g]∥q, (1.20)

where  is either the  or -norm. To the best of our knowledge, such a penalty appears in the early work of Grandvalet and Canu (1999) and Bakin (1999) for , and Turlach et al. (2005) for . It has been popularized later by Yuan and Lin (2006).

The function  in (1.20) is a norm, thus convex, and can be interpreted as the -norm of the vector  of size . Consequently, the sparsity-inducing effect of the -norm is applied at the group level. The penalty is highly related to the group-thresholding approach for wavelets, since the group-thresholding estimator (1.7) is linked to  through Eq. (1.8).

In Figure 1.13(a), we visualize the unit ball of a Group-Lasso norm obtained when  contains two groups . The ball has two singularities: the top and bottom corners, corresponding to solutions where variables  and  are simultaneously set to zero, and the middle circle, corresponding to solutions where variable  only is set to zero. As expected, the geometry of the ball induces the group-sparsity effect.

##### Structured sparsity.

Group-sparsity is a first step towards the more general idea that a regularization function can encourage sparse solutions with a particular structure. This notion is called structured sparsity and has been introduced under a large number of different point of views (Zhao et al., 2009; Jacob et al., 2009; Jenatton et al., 2011a; Baraniuk et al., 2010; Huang et al., 2011). To some extent, it follows the concept of group-thresholding introduced in the wavelet literature, which we have presented in Section 1.2. In this paragraph, we briefly review some of these works, but for a more detailed review, we refer the reader to (Bach et al., 2012b).

Some penalties are non-convex. For instance, Huang et al. (2011) and Baraniuk et al. (2010) propose two different combinatorial approaches based on a predefined set  of possibly overlapping groups of variables. These penalties encourage solutions whose support is in the union of a few number groups, but they lead to NP-hard optimization problems. Other penalties are convex. In particular, Jacob et al. (2009) introduce a sparsity-inducing norm that is exactly a convex relaxation of the penalty of Huang et al. (2011), even though these two approaches were independently developed at the same time. As a result, the convex penalty of Jacob et al. (2009) encourages a similar structure as the one of Huang et al. (2011).

By following a different direction, the Group-Lasso penalty (1.20) has been considered when the groups are allowed to overlap (Zhao et al., 2009; Jenatton et al., 2011a). As a consequence, variables belonging to the same groups are encouraged to be set to zero together. It was proposed for hierarchical structures by Zhao et al. (2009) with the following rule: whenever two groups  and  are in , they should be either disjoint, or one should be included in another. Examples of such hierarchical group structures are given in Figures 1.11 and 1.12. The effect of the penalty is to encourage sparsity patterns that are rooted subtrees. Equivalently, a variable can be non-zero only if its parent in the tree is non-zero, which is the main property of the zero-tree coding scheme introduced in the wavelet literature (Shapiro, 1993), and already illustrated in Figure 1.2.

Finally, Jenatton et al. (2011a) has extended the hierarchical penalty of Zhao et al. (2009) to more general group structures, for example when variable are organized on a two-dimensional grid, encouraging neighbor variables to be simultaneously set to zero. We conclude this brief presentation of structured sparsity with Figure 1.13, where we present the unit balls of some sparsity-inducing norms. Each of them exhibits singularities and encourages particular sparsity patterns.

##### Spectral sparsity.

Another form of parsimony has been devised in the spectral domain (Fazel et al., 2001; Srebro et al., 2005)

. For estimation problems where model parameters are matrices, the rank has been used as a natural regularization function. The rank of a matrix is equal to the number of non-zero singular values, and thus, it can be interpreted as the

-penalty of the matrix spectrum. Unfortunately, due to the combinatorial nature of , the rank penalization typically leads to intractable optimization problems.

A natural convex relaxation has been introduced in the control theory literature by Fazel et al. (2001) and consists of computing the -norm of the spectrum—that is, simply the sum of the singular values. The resulting penalty appears under different names, the most common ones being the trace, nuclear, or Schatten norm. It is defined for a matrix  in  with  as

 ∥A∥∗≜p∑i=1si(A),

where  is the -th singular value of . Traditional applications of the trace norm in machine learning are matrix completion or collaborative filtering (Pontil et al., 2007; Abernethy et al., 2009). These problems have become popular with the need of scalable recommender systems for video streaming providers. The goal is to infer movie preferences for each customer, based on their partial movie ratings. Typically, the matrix is of size , where is the number of movies and  is the number of users. Each user gives a score for a few movies, corresponding to some entries of the matrix, and the recommender system tries to infer the missing values. Similar techniques have also recently been used in other fields, such as in genomics to infer missing genetic information (Chi et al., 2013).

### 1.4 Dictionary learning

We have previously presented various formulations where a signal  in  is approximated by a sparse linear combination of a few columns of a matrix  in . In the context of signal and image processing, this matrix is often called dictionary and its columns atoms. As seen in Section 1.2, a large amount of work has been devoted in the wavelet literature for designing a good dictionary adapted to natural images.

In neuroscience, Olshausen and Field (1996, 1997) have proposed a significantly different approach to sparse modeling consisting of adapting the dictionary to training data. Because the size of natural images is too large for learning a full matrix , they have chosen to learn the dictionary on natural image patches, e.g., of size pixels, and have demonstrated that their method could automatically discover interpretable structures. We discuss this topic in more details in Section 2.

The motivation of Olshausen and Field (1996, 1997) was to show that the structure of natural images is related to classical theories of the mammalian visual cortex. Later, dictionary learning found numerous applications in image restoration, and was shown to significantly outperform off-the-shelf bases for signal reconstruction (see, e.g., Elad and Aharon, 2006; Mairal et al., 2008c, 2009; Protter and Elad, 2009; Yang et al., 2010a).

Concretely, given a dataset of  training signals , dictionary learning can be formulated as the following minimization problem

 minD∈C,A∈Rp×nn∑i=112∥xi−Dαi∥22+λψ(αi), (1.21)

where  carries the decomposition coefficients of the signals ,   is sparsity-inducing regularization function, and  is typically chosen as the following set:

 C≜{D∈Rm×p:∀j  ∥dj∥2≤1}.

To be more precise, Olshausen and Field (1996) proposed several choices for ; their experiments were for instance conducted with the -norm, or with the smooth function , which has an approximate sparsity-inducing effect. The constraint was also not explicitly modeled in the original dictionary learning formulation; instead, the algorithm of Olshausen and Field (1996) includes a mechanism to control and rescale the -norm of the dictionary elements. Indeed, without such a mechanism, the norm of  would arbitrarily go to infinity, leading to small values for the coefficients  and making the penalty  ineffective.

The number of samples  is typically large, whereas the signal dimension  is small. The number of dictionary elements  is often chosen larger than —in that case, the dictionary is said to be overcomplete—even though a choice  often leads to reasonable results in many applications. For instance, a typical setting would be to have pixels for natural image patches, a dictionary of size , and more than  training patches.

A large part of this monograph is related to dictionary learning and thus we only briefly discuss this matter in this introduction. Section 2

is indeed devoted to unsupervised learning techniques for natural image patches, including dictionary learning; Sections

3 and 4 present a large number of applications in image processing and computer vision; how to solve (1.21) is explained in Section 5.5 about optimization.

##### Matrix factorization point of view.

An equivalent representation of (1.21) is the following regularized matrix factorization problem

 minD∈C,A∈Rp×n12∥X−DA∥2F+λΨ(A), (1.22)

where . Even though reformulating (1.21) as (1.22) is simply a matter of using different notation, seeing dictionary learning as a matrix factorization problem opens up interesting perspectives. In particular, it makes obvious some links with other unsupervised learning approaches such as non-negative matrix factorization (Paatero and Tapper, 1994)

, clustering techniques such as K-means, and others

(see Mairal et al., 2010a). These links will be further developed in Section 2.

##### Risk minimization point of view.

Dictionary learning can also be seen from a machine learning point of view. Indeed, dictionary learning can be written as

 minD∈C{fn(D)≜1nn∑i=1L(xi,D)},

where  is a loss function defined as

 L(x,D)≜minα∈Rp12∥x−Dα∥22+λψ(α).

The quantity should be small if is “good” at representing the signal in a sparse fashion, and large otherwise. Then, is called the empirical cost.

However, as pointed out by Bottou and Bousquet (2008), one is usually not interested in the exact minimization of the empirical cost for a fixed , which may lead to overfitting on the training data, but instead in the minimization of the expected cost, which measures the quality of the dictionary on new unseen data:

 f(D)≜Ex[L(x,D)]=limn→∞fn(D)  a.s.,

where the expectation is taken relative to the (unknown) probability distribution of the data.555We use “a.s.” to denote almost sure convergence.

The expected risk minimization formulation is interesting since it paves the way to stochastic optimization techniques when a large amount of data is available (Mairal et al., 2010a) and to theoretical analysis (Maurer and Pontil, 2010; Vainsencher et al., 2011; Gribonval et al., 2013), which are developed in Sections 5.5 and 1.6, respectively.

##### Constrained variants.

Following the original formulation of Olshausen and Field (1996, 1997), we have chosen to present dictionary learning where the regularization function is used as a penalty, even though it can also be used as a constraint as in (1.9). Then, natural variants of (1.21) are

 minD∈C,A∈Rp×nn∑i=112∥xi−Dαi∥22  s.t.  ψ(αi)≤μ. (1.23)

or

 (1.24)

Note that (1.23) and (1.24) are not equivalent to (1.21). For instance, problem (1.23) can be reformulated using a Lagrangian function (Boyd and Vandenberghe, 2004) as

 minD∈Cn∑i=1(maxλ