### 1.1 Early concepts of parsimony in statistics

A large number of statistical procedures can be formulated as maximum likelihood estimation. Given a statistical model with parameters , it consists of minimizing with respect to an objective function representing the negative log-likelihood of observed data. Assuming for instance that we observe independent samples of the (unknown) data distribution, we need to solve

(1.1) |

where

is some probability distribution parameterized by

.Simple methods such as ordinary least squares can be written as (

1.1). Consider for instance data points that are pairs , with is an observation in and is a vector in , and assume that there exists a linear relation , where is an approximation error for observation . Under a model where the’s are independent and identically normally distributed with zero-mean, Eq. (

1.1) is equivalent to a least square problem:^{2}

^{2}2Note that the Gaussian noise assumption is not necessary to justify the ordinary least square formulation. It is only sufficient to interpret it as maximum likelihood estimation. In fact, as long as the conditional expectation is linear, the ordinary least square estimator is statistically consistent under mild assumptions.

To prevent overfitting and to improve the interpretability of the learned model, it was suggested in early work that a solution involving only a few model variables could be more appropriate than an exact solution of (1.1); in other words, a sparse solution involving only—let us say— variables might be desirable in some situations. Unfortunately, such a strategy yields two difficulties: first, it is not clear a priori how to choose ; second, finding the best subset of variables is NP-hard in general (Natarajan, 1995). The first issue was addressed with several criterions for controlling the trade-off between the sparsity of the solution and the adequacy of the fit to training data. For the second issue, approximate computational techniques have been proposed.

##### Mallows’s , AIC, and BIC.

For the ordinary least squares problem, Mallows (1964, 1966) introduced the -statistics, later generalized by Akaike (1973) with the Akaike information criterion (AIC), and then by Schwarz (1978) with the Bayesian information criterion (BIC). Using , AIC, or BIC is equivalent to solving the penalized -maximum likelihood estimation problem

(1.2) |

where depends on the chosen criterion (see Hastie et al., 2009),
and is the -penalty. Similar formulations have also
been derived by using the minimum description length (MDL) principle for model
selection (Rissanen, 1978; Barron et al., 1998). As shown by Natarajan (1995), the
problem (1.2) is NP-hard, and approximate algorithms are necessary
unless is very small, *e.g.*, .

##### Forward selection and best subset selection for least squares.

To obtain an approximate solution of (1.2), a classical approach is the
forward selection technique, which is a greedy algorithm that solves a sequence
of maximum likelihood estimation problems computed on a subset of variables.
After every iteration, a new variable is added to the subset according to the
chosen sparsity criterion in a greedy manner. Some variants allow backward
steps—that is, a variable can possibly exit the active subset after an
iteration. The algorithm is presented in more details in
Section 5.1 and seems to be due
to Efroymson (1960), according to Hocking (1976). Other approaches
considered in the 70’s include also the *leaps and bounds* technique
of Furnival and Wilson (1974), a branch-and-bound algorithm providing the exact
solution of (1.2) with exponential worst-case complexity.

### 1.2 Wavelets in signal processing

In signal processing, similar problems as in statistics have been studied in the context of wavelets. In a nutshell, a wavelet basis represents a set of functions that are essentially dilated and shifted versions of each other. Unlike Fourier basis, wavelets have the interesting properties to be localized both in the space and frequency domains, and to be suitable to multi-resolution analysis of signals (Mallat, 1989).

The concept of parsimony is central to wavelets. When a signal is
“smooth” in a particular sense (see Mallat, 2008), it can be well
approximated by a linear combination of a few wavelets. Specifically, is
close to an expansion where only a few coefficients
are non-zero, and the resulting compact representation has effective
applications in estimation and compression. The wavelet theory is well
developed for continuous signals, *e.g.*, is chosen in the Hilbert
space , but also for discrete signals in , making it
suitable to modern digital image processing.

Since the first wavelet was introduced by Haar (1910), much research has
been devoted to designing a wavelet set that is adapted to particular
signals such as natural images. After a long quest for finding good
orthogonal basis such as the one proposed by Daubechies (1988), a series
of works has focused on wavelet sets whose elements are not linearly
independent. It resulted a large number of variants, such as steerable
wavelets (Simoncelli et al., 1992), curvelets (Candès and Donoho, 2002),
contourlets (Do and Vetterli, 2005), or bandlets (Le Pennec and Mallat, 2005).
For the purpose of our monograph, one concept related to sparse estimation is
particularly important; it is called *wavelet thresholding*.

##### Sparse estimation and wavelet thresholding.

Let us consider a discrete signal represented by a vector in and an orthogonal wavelet basis set —that is, satisfying where

is the identity matrix. Approximating

by a sparse linear combination of wavelet elements can be formulated as finding a sparse vector in , say with non-zero coefficients, that minimizes(1.3) |

The sparse decomposition problem (1.3) is an instance of the best subset selection formulation presented in Section 1.1 where represents model parameters, demonstrating that similar topics arise in statistics and signal processing. However, whereas (1.3) is NP-hard for general matrices (Natarajan, 1995), we have assumed to be orthogonal in the context of wavelets. As such, (1.3) is equivalent to

and admits a closed form. Let us indeed define the vector in , corresponding to the exact non-sparse decomposition of onto —that is, we have since is orthogonal. To obtain the best -sparse approximation, we denote by the -th largest value among the set , and the solution of (1.3) is obtained by applying to an operator called “hard-thresholding” and defined as

(1.4) |

where is the indicator function, which is equal to if and otherwise.
In other words, the hard-thresholding operator simply sets to zero coefficients
from whose magnitude is below the threshold . The corresponding
procedure, called “wavelet thresholding”, is simple and effective for image
denoising, even though it does not perform as well as recent state-of-the-art
techniques presented in Section 3. When an image is
noisy, *e.g.*, corrupted by white Gaussian noise, and is well chosen, the
estimate is a good estimate of the clean original image.
The terminology “hard” is defined in contrast to an important variant called
the “soft-thresholding operator”, which was introduced by Donoho and Johnstone (1994) in the context of wavelets:^{3}^{3}3Note that the soft-thresholding operator appears in fact earlier in the statistics literature (see Efron and Morris, 1971; Bickel, 1984), but it was used there for a different purpose.

(1.5) |

where is a parameter playing the same role as in (1.4). Not only does the operator set small coefficients of to zero, but it also reduces the magnitude of the non-zero ones. Both operators are illustrated and compared to each other in Figure 1.1. Interestingly, whereas is the solution of (1.3) when corresponds to the entry of with -th largest magnitude,

is in fact the solution of the following sparse reconstruction problem with the orthogonal matrix

:(1.6) |

This formulation will be the topic of the next section for general non-orthogonal
matrices. Similar to statistics where choosing the parameter of the
best subset selection was difficult, automatically selecting the best
thresholds or has been a major research
topic (see, *e.g.* Donoho and Johnstone, 1994, 1995; Chang et al., 2000a, b).

##### Structured group thresholding.

Wavelets coefficients have a particular structure since the basis elements are dilated and shifted versions of each other. It is for instance possible to define neighborhood relationships for wavelets whose spatial supports are close to each other, or hierarchical relationships between wavelets with same or similar localization but with different scales. For one-dimensional signals, we present in Figure 1.2 a typical organization of wavelet coefficients on a tree with arrows representing such relations. For two-dimensional images, the structure is slightly more involved and the coefficients are usually organized as a collection of quadtrees (see Mallat, 2008, for more details); we present such a configuration in Figure 1.3.

A natural idea has inspired the recent concept of *group sparsity*
that will be presented in the next section; it consists in exploiting the wavelet
structure to improve thresholding estimators. Specifically, it is possible to
use neighborhood relations between wavelet basis elements to define groups of
coefficients that form a partition of , and use a
group-thresholding operator (Hall et al., 1999; Cai, 1999) defined for every
group in as

(1.7) |

where is the vector of size recording the entries of whose indices are in . By using such an estimator, groups of neighbor coefficients are set to zero together when their joint -norm falls below the threshold . Interestingly, even though the next interpretation does not appear in early work about group-thresholding (Hall et al., 1999; Cai, 1999), it is possible to view with as the solution of the following penalized problem

(1.8) |

where the closed-form solution (1.7) holds because is orthogonal (see Bach et al., 2012a). Such a formulation will be studied in the next section for general matrices.

Finally, other ideas for exploiting both structure and wavelet parsimony have been proposed. One is a coding scheme called “zero-tree” wavelet coding (Shapiro, 1993), which uses the tree structure of wavelets to force all descendants of zero coefficients to be zero as well. Equivalently, a coefficient can be non-zero only if its parent in the tree is non-zero, as illustrated in Figure 1.2. This idea has been revisited later in a more general context by Zhao et al. (2009)

. Other complex models have been used as well for modeling interactions between coefficients: we can mention the application of hidden Markov models (HMM) to wavelets by

Crouse et al. (1998) and the Gaussian scale mixture model of Portilla et al. (2003).### 1.3 Modern parsimony: the -norm and other variants

The era of “modern” parsimony corresponds probably to the use of convex optimization techniques for solving feature selection or sparse decomposition problems. Even though the

-norm was introduced for that purpose in geophysics (Claerbout and Muir, 1973; Taylor et al., 1979), it was popularized in statistics with the Lasso estimator of Tibshirani (1996) and independently in signal processing with the basis pursuit formulation of Chen et al. (1999). Given observations in and a matrix of predictors in , the Lasso consists of learning a linear model by solving the following quadratic program:(1.9) |

As detailed in the sequel, the -norm encourages the solution to
be sparse and the parameter is used to control the trade-off between data
fitting and the sparsity of . In practice, reducing the value of
leads indeed to sparser solution in general, *i.e.*, with more zeroes, even
though there is no formal relation between the sparsity of and
its -norm for general matrices .

The basis pursuit denoising formulation of Chen et al. (1999) is relatively similar but the -norm is used as a penalty instead of a constraint. It can be written as

(1.10) |

which is essentially equivalent to (1.9) from a convex optimization
perspective, and in fact (1.10) is also often called “Lasso” in
the literature.
Given some data , matrix , and parameter , we indeed know
from Lagrange multiplier theory (see, *e.g.*, Borwein and Lewis, 2006; Boyd and Vandenberghe, 2004)
that for all solution of (1.9), there exists a
parameter such that is also a solution
of (1.10). We note, however, that there is no direct mapping
between and ,
and thus the choice of formulation (1.9) or (1.10) should
be made according to how easy it is to select the parameters or .
For instance, one may prefer (1.9) when a priori information
about the -norm of the solution is available.
In Figure 1.4, we illustrate the effect of changing the value of the
regularization parameter on the solution of (1.10) for
two datasets. When , the solution is dense; in general,
increasing sets more and more variables to zero. However, the
relation between and the sparsity of the solution is not exactly
monotonic. In a few cases, increasing yields a denser solution.

Another “equivalent” formulation consists of finding a sparse decomposition under a reconstruction constraint:

(1.11) |

This formulation can be useful when we have a priori knowledge about the noise level and the parameter is easy to choose. The link between (1.10) and (1.11) is similar to the link between (1.10) and (1.9).

For noiseless problems, Chen et al. (1999) have also introduced a formulation simply called “basis pursuit” (without the terminology “denoising”), defined as

(1.12) |

which is related to (1.10) in the sense that the set of solutions of (1.10) converges to the solutions of (1.12) when converges to , whenever the linear system is feasible. These four formulations (1.9-1.12) have gained a large success beyond the statistics and signal processing communities. More generally, the -norm has been used as a regularization function beyond the least-square context, leading to problems of the form

(1.13) |

where

is a loss function. In the rest of this section, we will present several variants of the

-norm, but before that, we will try to understand why such a penalty is appropriate for sparse estimation.##### Why does the -norm induce sparsity?

Even though we have claimed that there is no rigorous relation between the sparsity of and its -norm in general, intuition about the sparsity-inducing effect of the -norm may be obtained from several viewpoints.

###### Analytical point of view.

In the previous section about wavelets, we have seen that when is orthogonal, the -decomposition problem (1.10) admits an analytic closed form solution (1.5) obtained by soft-thresholding. As a result, whenever the magnitude of the inner product is smaller than for an index , the corresponding variable is equal to zero. Thus, the number of zeroes of the solution monotonically increases with .

For non-orthogonal matrices , such a monotonic relation does not formally
hold anymore; in practice, the sparsity-inducing property of
the -penalty remains effective, as illustrated in
Figure 1.4. Some intuition about this fact can be gained by studying optimality
conditions for the general -regularized problem (1.13)
where is a differentiable function. The following lemma details these conditions.
[Optimality conditions for -regularized problems]

A vector in is a solution of (1.13) if and only if

(1.14) |

###### Proof.

A proof using the classical concept of subdifferential from convex optimization
can be found in (see,*e.g.*, Bach et al., 2012a). Here, we provide instead an elementary proof
using the simpler concept of directional derivative for nonsmooth functions,
defined as, when the limit exists,

for a function at a point in and a direction in . For convex functions , directional derivatives always exist and a classical optimality condition for to be a minimum of is to have non-negative for all directions (Borwein and Lewis, 2006). Intuitively, this means that one cannot find any direction such that an infinitesimal move along from decreases the value of the objective. When is differentiable, the condition is equivalent to the classical optimality condition .

We can now apply the directional derivative condition to the function , which is equivalent to

(1.15) |

It is then easy to show that (1.15) holds for all if and only the inequality holds for the specific values and for all , where is the vector in with zeroes everywhere except for the -th entry that is equal to one. This immediately provides an equivalence between (1.15) and (1.14). ∎

###### Physical point of view.

In image processing or computer vision, the word “energy” often denotes the objective function of a minimization problem; it is indeed common in physics to have complex systems that stabilize at a configuration of minimum potential energy. The negative of the energy’s gradient represents a force, a terminology we will borrow in this paragraph. Consider for instance a one-dimensional -regularized estimation problem

(1.16) |

where is a positive constant. Whenever is non-zero, the -penalty is
differentiable with derivative . When interpreting this
objective as an energy minimization problem, the -penalty can be seen
as applying *a force driving towards the origin with constant
intensity *. Consider now instead the squared -penalty, also
called regularization of Tikhonov (1963)

, or ridge regression regularization

(Hoerl and Kennard, 1970):(1.17) |

The derivative of the quadratic energy is
. It can be interpreted as *a force that also points
to the origin but with linear intensity *. Therefore, the force
corresponding to the ridge regularization can be arbitrarily strong
when is large, but if fades away when gets close to zero.
As a result, the squared -regularization does not have a
sparsity-inducing effect. From an analytical point of view, we have
seen that the solution of (1.16) is zero when is smaller
than . In contrast, the solution of (1.17) admits a
closed form . And thus, regardless of the
parameter , the solution is never zero.

We present a physical example illustrating this phenomenon in Figure 1.5. We use springs whose potential energy is known to be quadratic, and objects with a gravitational potential energy that is approximately linear on the Earth’s surface.

###### Geometrical point of view.

The sparsity-inducing effect of the -norm can also be interpreted by studying the geometry of the -ball . More precisely, understanding the effect of the Euclidean projection onto this set is important: in simple cases where the design matrix is orthogonal, the solution of (1.9) can indeed be obtained by the projection

(1.18) |

where . When is not orthogonal, a classical algorithm for solving (1.9) is the projected gradient method (see Section 5.2), which performs a sequence of projections (1.18) for different values of . Note that how to solve (1.18) efficiently is well studied; it can be achieved in operations with a divide-and-conquer strategy (Brucker, 1984; Duchi et al., 2008).

In Figure 1.6, we illustrate the effect of the -norm projection and compare it to the case of the -norm. The corners of the -ball are on the main axes and correspond to sparse solutions. Two of them are represented by red and green dots, with respective coordinates and . Most strikingly, a large part of the space in the figure, represented by red and green regions, ends up on these corners after projection. In contrast, the set of points that is projected onto the blue dot, is simply the blue line. The blue dot corresponds in fact to a dense solution with coordinates . Therefore, the figure illustrates that the -ball in two dimensions encourages solutions to be on its corners. In the case of the -norm, the ball is isotropic, and treats every direction equally. In Figure 1.7, we represent these two balls in three dimensions, where we can make similar observations.

More formally, we can mathematically characterize our remarks
about Figure 1.6. Consider a point in on the surface of
the -ball of radius , and define the set , where is the projection operator onto the -ball.
Examples of pairs have been presented in Figure 1.6;
for instance, when is the red or green dot, is respectively the red
or green region. It is particularly informative to study how varies
with , which is the focus of the next proposition.
[Characterization of the set ]

For a non-zero vector in , the set defined in the
previous paragraph can be written as , where
is a polyhedral cone of dimension .

###### Proof.

A classical theorem (see Bertsekas, 1999, Proposition B.11) allows us to rewrite as

where denotes the Minkowski sum between the set and the cone defined as

Note that in the optimization literature, is often called the “normal cone” to the unit -ball at the point (Borwein and Lewis, 2006). Equivalently, we have

(1.19) |

where we have used the fact that quantity , called the dual-norm of the -norm, is equal to (see Bach et al., 2012a). Note now that according to Hölder’s inequality, we also have in Eq. (1.19). Therefore, the inequalities are in fact equalities. It is then easy to characterize vectors such that and it is possible to show that is simply the set of vectors satisfying for all such that .

This would be sufficient to conclude the proposition, but it is also possible
to pursue the analysis and exactly characterize by finding a set of generators.^{4}^{4}4A collection of vectors are called
generators for a cone when consists of all positive combinations of
the vectors . In other words, . In that case, we use the
notation . Let us
define the vector in that carries the sparsity pattern
of , more precisely, with for all such
that , and otherwise. Let us also define the set of
indices corresponding to the zero entries of ,
and in the binary vector whose entries are all zero but
the -th one that is equal to .
Then, after a short calculation, we can geometrically characterize
the polyhedral cone :

where the notation “cone” is defined in footnote 4. ∎

It is now easy to see that the set “grows” with the number of zero
entries in , and that lives in a subspace of dimension for all
non-zero vector . For example, when —that is, is a dense
vector (*e.g.*, the blue point in Figure 1.6(a)), is simply a
half-line.

To conclude, the geometrical intuition to gain from this section is that
*the Euclidean projection onto a convex set encourages solutions on
singular points, such as edges or corners for polytopes*. Such a principle
indeed applies beyond the -norm. For instance, we illustrate the
regularization effect of the -norm in Figure 1.8,
whose corners coordinates have same magnitude.

##### Non-convex regularization.

Even though it is well established that the -norm encourages sparse solutions, it remains only a convex proxy of the -penalty. Both in statistics and signal processing, other sparsity-inducing regularization functions have been proposed, in particular continuous relaxations of that are non-convex (Frank and Friedman, 1993; Fan and Li, 2001; Daubechies et al., 2010; Gasso et al., 2009). These functions are using a non-decreasing concave function , and the sparsity-inducing penalty is defined as

For example, the -penalty uses (Frank and Friedman, 1993), or an approximation ; the reweighted- algorithm of Fazel (2002); Fazel et al. (2003); Candès et al. (2008) implicitly uses . These penalties typically lead to intractable estimation problems, but approximate solutions can be obtained with continuous optimization techniques (see Section 5.3).

The sparsity-inducing effect
of the penalties is known to be stronger than . As shown in
Figure 1.9(a), the magnitude of the derivative of grows
when one approaches zero because of its concavity. Thus, in the one-dimensional case,
can be interpreted as *a force driving towards the origin
with increasing intensity when gets closer to zero.*
In terms of geometry, we also display the -ball in
Figure 1.9(b), with the same red, blue, and green dots as in
Figure 1.6. The part of the space that is projected onto the corners of
the -ball is larger than that for .
Interestingly, the geometrical structure of the red and green regions are also
more complex. Their combinatorial nature makes the projection problem onto
the -ball more involved when .

##### The elastic-net.

To cope with some instability issues of the estimators obtained with the -regularization, Zou and Hastie (2005) have proposed to combine the - and -norms with a penalty called elastic-net:

The effect of this penalty is illustrated in Figure 1.10. Compared to Figure 1.6, we observe that the red and green regions are smaller for the elastic-net penalty than for . The sparsity-inducing effect is thus less aggressive than the one obtained with .

##### Total variation.

The anisotropic total variation penalty (Rudin et al., 1992) for one dimensional signals is simply the -norm of finite differences

which encourages piecewise constant signals. It is also known in statistics under the name of “fused Lasso” (Tibshirani et al., 2005). The penalty can easily be extended to two-dimensional signals, and has been widely used for regularizing inverse problems in image processing (Chambolle, 2005).

##### Group sparsity.

In some cases, variables are organized into predefined groups forming a
partition of , and one is looking for a
solution such that *variables belonging to the same group
of are set to zero together.* For example, such groups have appeared in
Section 1.2 about wavelets, where could be defined
according to neighborhood relationships of wavelet coefficients.
Then, when it is known beforehand that a problem solution only requires a few groups
of variables to explain the data, a regularization function automatically
selecting the relevant groups has been shown to improve the prediction performance
or to provide a solution with a better interpretation (Turlach et al., 2005; Yuan and Lin, 2006; Obozinski et al., 2009; Huang and Zhang, 2010).
The group sparsity principle is illustrated in Figure 1.11(b).

An appropriate regularization function to obtain a group-sparsity effect is known as “Group-Lasso” penalty and is defined as

(1.20) |

where is either the or -norm. To the best of our knowledge, such a penalty appears in the early work of Grandvalet and Canu (1999) and Bakin (1999) for , and Turlach et al. (2005) for . It has been popularized later by Yuan and Lin (2006).

The function in (1.20) is a norm, thus convex, and can be interpreted as the -norm of the vector of size . Consequently, the sparsity-inducing effect of the -norm is applied at the group level. The penalty is highly related to the group-thresholding approach for wavelets, since the group-thresholding estimator (1.7) is linked to through Eq. (1.8).

In Figure 1.13(a), we visualize the unit ball of a Group-Lasso norm obtained when contains two groups . The ball has two singularities: the top and bottom corners, corresponding to solutions where variables and are simultaneously set to zero, and the middle circle, corresponding to solutions where variable only is set to zero. As expected, the geometry of the ball induces the group-sparsity effect.

##### Structured sparsity.

Group-sparsity is a first step towards the more general idea that a regularization
function can encourage sparse solutions with a particular structure. This
notion is called *structured sparsity* and has been introduced under a
large number of different point of
views (Zhao et al., 2009; Jacob et al., 2009; Jenatton et al., 2011a; Baraniuk et al., 2010; Huang et al., 2011). To some extent, it
follows the concept of group-thresholding introduced in the wavelet
literature, which we have presented in Section 1.2.
In this paragraph, we briefly review some of these works, but for a more
detailed review, we refer the reader to (Bach et al., 2012b).

Some penalties are non-convex. For instance, Huang et al. (2011)
and Baraniuk et al. (2010) propose two different combinatorial approaches based on a
predefined set of possibly overlapping groups of variables.
These penalties encourage solutions whose support is in the *union of a few
number groups*, but they lead to NP-hard optimization problems.
Other penalties are convex. In particular, Jacob et al. (2009) introduce a sparsity-inducing
norm that is exactly a convex relaxation of the penalty of Huang et al. (2011),
even though these two approaches were independently developed at the same time.
As a result, the convex penalty of Jacob et al. (2009) encourages a similar
structure as the one of Huang et al. (2011).

By following a different direction, the Group-Lasso penalty (1.20) has been considered when the groups are allowed to overlap (Zhao et al., 2009; Jenatton et al., 2011a). As a consequence, variables belonging to the same groups are encouraged to be set to zero together. It was proposed for hierarchical structures by Zhao et al. (2009) with the following rule: whenever two groups and are in , they should be either disjoint, or one should be included in another. Examples of such hierarchical group structures are given in Figures 1.11 and 1.12. The effect of the penalty is to encourage sparsity patterns that are rooted subtrees. Equivalently, a variable can be non-zero only if its parent in the tree is non-zero, which is the main property of the zero-tree coding scheme introduced in the wavelet literature (Shapiro, 1993), and already illustrated in Figure 1.2.

Finally, Jenatton et al. (2011a) has extended the hierarchical penalty of Zhao et al. (2009) to more general group structures, for example when variable are organized on a two-dimensional grid, encouraging neighbor variables to be simultaneously set to zero. We conclude this brief presentation of structured sparsity with Figure 1.13, where we present the unit balls of some sparsity-inducing norms. Each of them exhibits singularities and encourages particular sparsity patterns.

##### Spectral sparsity.

Another form of parsimony has been devised in the spectral domain (Fazel et al., 2001; Srebro et al., 2005)

. For estimation problems where model parameters are matrices, the rank has been used as a natural regularization function. The rank of a matrix is equal to the number of non-zero singular values, and thus, it can be interpreted as the

-penalty of the matrix spectrum. Unfortunately, due to the combinatorial nature of , the rank penalization typically leads to intractable optimization problems.A natural convex relaxation has been introduced in the control theory literature by Fazel et al. (2001) and consists of computing the -norm of the spectrum—that is, simply the sum of the singular values. The resulting penalty appears under different names, the most common ones being the trace, nuclear, or Schatten norm. It is defined for a matrix in with as

where is the -th singular value of . Traditional applications of the trace norm in machine learning are matrix completion or collaborative filtering (Pontil et al., 2007; Abernethy et al., 2009). These problems have become popular with the need of scalable recommender systems for video streaming providers. The goal is to infer movie preferences for each customer, based on their partial movie ratings. Typically, the matrix is of size , where is the number of movies and is the number of users. Each user gives a score for a few movies, corresponding to some entries of the matrix, and the recommender system tries to infer the missing values. Similar techniques have also recently been used in other fields, such as in genomics to infer missing genetic information (Chi et al., 2013).

### 1.4 Dictionary learning

We have previously presented various formulations where a signal
in is approximated by a sparse linear combination of a few columns of
a matrix in . In the context of signal and image processing,
this matrix is often called *dictionary* and its columns *atoms*. As
seen in Section 1.2, a large amount of work has been devoted in
the wavelet literature for designing a good dictionary adapted to natural
images.

In neuroscience, Olshausen and Field (1996, 1997) have proposed a significantly
different approach to sparse modeling consisting of adapting the dictionary to
training data. Because the size of natural images is too large for learning a
full matrix , they have chosen to learn the dictionary on natural image
patches, *e.g.*, of size pixels, and have demonstrated that their
method could automatically discover interpretable structures. We discuss this
topic in more details in Section 2.

The motivation of Olshausen and Field (1996, 1997) was to show that the
structure of natural images is related to classical theories of the mammalian
visual cortex. Later, dictionary learning found numerous applications in
image restoration, and was shown to significantly
outperform off-the-shelf bases for signal reconstruction (see,
*e.g.*, Elad and Aharon, 2006; Mairal et al., 2008c, 2009; Protter and Elad, 2009; Yang et al., 2010a).

Concretely, given a dataset of training signals , dictionary learning can be formulated as the following minimization problem

(1.21) |

where carries the decomposition coefficients of the signals , is sparsity-inducing regularization function, and is typically chosen as the following set:

To be more precise, Olshausen and Field (1996) proposed several choices for ; their experiments were for instance conducted with the -norm, or with the smooth function , which has an approximate sparsity-inducing effect. The constraint was also not explicitly modeled in the original dictionary learning formulation; instead, the algorithm of Olshausen and Field (1996) includes a mechanism to control and rescale the -norm of the dictionary elements. Indeed, without such a mechanism, the norm of would arbitrarily go to infinity, leading to small values for the coefficients and making the penalty ineffective.

The number of samples is typically large, whereas the signal dimension
is small. The number of dictionary elements is often chosen larger
than —in that case, the dictionary is said to be
*overcomplete*—even though a choice often leads to reasonable
results in many applications. For instance, a typical setting would be to have
pixels for natural image patches, a dictionary of size ,
and more than training patches.

A large part of this monograph is related to dictionary learning and thus we only briefly discuss this matter in this introduction. Section 2

is indeed devoted to unsupervised learning techniques for natural image patches, including dictionary learning; Sections

3 and 4 present a large number of applications in image processing and computer vision; how to solve (1.21) is explained in Section 5.5 about optimization.##### Matrix factorization point of view.

An equivalent representation of (1.21) is the following regularized matrix factorization problem

(1.22) |

where . Even though reformulating (1.21) as (1.22) is simply a matter of using different notation, seeing dictionary learning as a matrix factorization problem opens up interesting perspectives. In particular, it makes obvious some links with other unsupervised learning approaches such as non-negative matrix factorization (Paatero and Tapper, 1994)

, clustering techniques such as K-means, and others

(see Mairal et al., 2010a). These links will be further developed in Section 2.##### Risk minimization point of view.

Dictionary learning can also be seen from a machine learning point of view. Indeed, dictionary learning can be written as

where is a loss function defined as

The quantity should be small if is “good” at representing the
signal in a sparse fashion, and large otherwise. Then, is called
the *empirical cost*.

However, as pointed out by Bottou and Bousquet (2008), one is usually not interested in the exact minimization of the empirical cost for a fixed , which may lead to overfitting on the training data, but instead in the minimization of the expected cost, which measures the quality of the dictionary on new unseen data:

where the expectation is taken relative to
the (unknown) probability distribution of the
data.^{5}^{5}5We use “a.s.” to denote almost sure convergence.

The expected risk minimization formulation is interesting since it paves the way to stochastic optimization techniques when a large amount of data is available (Mairal et al., 2010a) and to theoretical analysis (Maurer and Pontil, 2010; Vainsencher et al., 2011; Gribonval et al., 2013), which are developed in Sections 5.5 and 1.6, respectively.

##### Constrained variants.

Following the original formulation of Olshausen and Field (1996, 1997), we have chosen to present dictionary learning where the regularization function is used as a penalty, even though it can also be used as a constraint as in (1.9). Then, natural variants of (1.21) are

(1.23) |

or

(1.24) |

Note that (1.23) and (1.24) are not equivalent to (1.21). For instance, problem (1.23) can be reformulated using a Lagrangian function (Boyd and Vandenberghe, 2004) as

Comments

There are no comments yet.