1.1 Early concepts of parsimony in statistics
A large number of statistical procedures can be formulated as maximum likelihood estimation. Given a statistical model with parameters , it consists of minimizing with respect to an objective function representing the negative log-likelihood of observed data. Assuming for instance that we observe independent samples of the (unknown) data distribution, we need to solve
is some probability distribution parameterized by.
Simple methods such as ordinary least squares can be written as (1.1). Consider for instance data points that are pairs , with is an observation in and is a vector in , and assume that there exists a linear relation , where is an approximation error for observation . Under a model where the
’s are independent and identically normally distributed with zero-mean, Eq. (1.1) is equivalent to a least square problem:222Note that the Gaussian noise assumption is not necessary to justify the ordinary least square formulation. It is only sufficient to interpret it as maximum likelihood estimation. In fact, as long as the conditional expectation is linear, the ordinary least square estimator is statistically consistent under mild assumptions.
To prevent overfitting and to improve the interpretability of the learned model, it was suggested in early work that a solution involving only a few model variables could be more appropriate than an exact solution of (1.1); in other words, a sparse solution involving only—let us say— variables might be desirable in some situations. Unfortunately, such a strategy yields two difficulties: first, it is not clear a priori how to choose ; second, finding the best subset of variables is NP-hard in general (Natarajan, 1995). The first issue was addressed with several criterions for controlling the trade-off between the sparsity of the solution and the adequacy of the fit to training data. For the second issue, approximate computational techniques have been proposed.
Mallows’s , AIC, and BIC.
For the ordinary least squares problem, Mallows (1964, 1966) introduced the -statistics, later generalized by Akaike (1973) with the Akaike information criterion (AIC), and then by Schwarz (1978) with the Bayesian information criterion (BIC). Using , AIC, or BIC is equivalent to solving the penalized -maximum likelihood estimation problem
where depends on the chosen criterion (see Hastie et al., 2009), and is the -penalty. Similar formulations have also been derived by using the minimum description length (MDL) principle for model selection (Rissanen, 1978; Barron et al., 1998). As shown by Natarajan (1995), the problem (1.2) is NP-hard, and approximate algorithms are necessary unless is very small, e.g., .
Forward selection and best subset selection for least squares.
To obtain an approximate solution of (1.2), a classical approach is the forward selection technique, which is a greedy algorithm that solves a sequence of maximum likelihood estimation problems computed on a subset of variables. After every iteration, a new variable is added to the subset according to the chosen sparsity criterion in a greedy manner. Some variants allow backward steps—that is, a variable can possibly exit the active subset after an iteration. The algorithm is presented in more details in Section 5.1 and seems to be due to Efroymson (1960), according to Hocking (1976). Other approaches considered in the 70’s include also the leaps and bounds technique of Furnival and Wilson (1974), a branch-and-bound algorithm providing the exact solution of (1.2) with exponential worst-case complexity.
1.2 Wavelets in signal processing
In signal processing, similar problems as in statistics have been studied in the context of wavelets. In a nutshell, a wavelet basis represents a set of functions that are essentially dilated and shifted versions of each other. Unlike Fourier basis, wavelets have the interesting properties to be localized both in the space and frequency domains, and to be suitable to multi-resolution analysis of signals (Mallat, 1989).
The concept of parsimony is central to wavelets. When a signal is “smooth” in a particular sense (see Mallat, 2008), it can be well approximated by a linear combination of a few wavelets. Specifically, is close to an expansion where only a few coefficients are non-zero, and the resulting compact representation has effective applications in estimation and compression. The wavelet theory is well developed for continuous signals, e.g., is chosen in the Hilbert space , but also for discrete signals in , making it suitable to modern digital image processing.
Since the first wavelet was introduced by Haar (1910), much research has been devoted to designing a wavelet set that is adapted to particular signals such as natural images. After a long quest for finding good orthogonal basis such as the one proposed by Daubechies (1988), a series of works has focused on wavelet sets whose elements are not linearly independent. It resulted a large number of variants, such as steerable wavelets (Simoncelli et al., 1992), curvelets (Candès and Donoho, 2002), contourlets (Do and Vetterli, 2005), or bandlets (Le Pennec and Mallat, 2005). For the purpose of our monograph, one concept related to sparse estimation is particularly important; it is called wavelet thresholding.
Sparse estimation and wavelet thresholding.
Let us consider a discrete signal represented by a vector in and an orthogonal wavelet basis set —that is, satisfying where
is the identity matrix. Approximatingby a sparse linear combination of wavelet elements can be formulated as finding a sparse vector in , say with non-zero coefficients, that minimizes
The sparse decomposition problem (1.3) is an instance of the best subset selection formulation presented in Section 1.1 where represents model parameters, demonstrating that similar topics arise in statistics and signal processing. However, whereas (1.3) is NP-hard for general matrices (Natarajan, 1995), we have assumed to be orthogonal in the context of wavelets. As such, (1.3) is equivalent to
and admits a closed form. Let us indeed define the vector in , corresponding to the exact non-sparse decomposition of onto —that is, we have since is orthogonal. To obtain the best -sparse approximation, we denote by the -th largest value among the set , and the solution of (1.3) is obtained by applying to an operator called “hard-thresholding” and defined as
where is the indicator function, which is equal to if and otherwise. In other words, the hard-thresholding operator simply sets to zero coefficients from whose magnitude is below the threshold . The corresponding procedure, called “wavelet thresholding”, is simple and effective for image denoising, even though it does not perform as well as recent state-of-the-art techniques presented in Section 3. When an image is noisy, e.g., corrupted by white Gaussian noise, and is well chosen, the estimate is a good estimate of the clean original image. The terminology “hard” is defined in contrast to an important variant called the “soft-thresholding operator”, which was introduced by Donoho and Johnstone (1994) in the context of wavelets:333Note that the soft-thresholding operator appears in fact earlier in the statistics literature (see Efron and Morris, 1971; Bickel, 1984), but it was used there for a different purpose.
where is a parameter playing the same role as in (1.4). Not only does the operator set small coefficients of to zero, but it also reduces the magnitude of the non-zero ones. Both operators are illustrated and compared to each other in Figure 1.1. Interestingly, whereas is the solution of (1.3) when corresponds to the entry of with -th largest magnitude,
is in fact the solution of the following sparse reconstruction problem with the orthogonal matrix:
This formulation will be the topic of the next section for general non-orthogonal matrices. Similar to statistics where choosing the parameter of the best subset selection was difficult, automatically selecting the best thresholds or has been a major research topic (see, e.g. Donoho and Johnstone, 1994, 1995; Chang et al., 2000a, b).
Structured group thresholding.
Wavelets coefficients have a particular structure since the basis elements are dilated and shifted versions of each other. It is for instance possible to define neighborhood relationships for wavelets whose spatial supports are close to each other, or hierarchical relationships between wavelets with same or similar localization but with different scales. For one-dimensional signals, we present in Figure 1.2 a typical organization of wavelet coefficients on a tree with arrows representing such relations. For two-dimensional images, the structure is slightly more involved and the coefficients are usually organized as a collection of quadtrees (see Mallat, 2008, for more details); we present such a configuration in Figure 1.3.
A natural idea has inspired the recent concept of group sparsity that will be presented in the next section; it consists in exploiting the wavelet structure to improve thresholding estimators. Specifically, it is possible to use neighborhood relations between wavelet basis elements to define groups of coefficients that form a partition of , and use a group-thresholding operator (Hall et al., 1999; Cai, 1999) defined for every group in as
where is the vector of size recording the entries of whose indices are in . By using such an estimator, groups of neighbor coefficients are set to zero together when their joint -norm falls below the threshold . Interestingly, even though the next interpretation does not appear in early work about group-thresholding (Hall et al., 1999; Cai, 1999), it is possible to view with as the solution of the following penalized problem
Finally, other ideas for exploiting both structure and wavelet parsimony have been proposed. One is a coding scheme called “zero-tree” wavelet coding (Shapiro, 1993), which uses the tree structure of wavelets to force all descendants of zero coefficients to be zero as well. Equivalently, a coefficient can be non-zero only if its parent in the tree is non-zero, as illustrated in Figure 1.2. This idea has been revisited later in a more general context by Zhao et al. (2009)
. Other complex models have been used as well for modeling interactions between coefficients: we can mention the application of hidden Markov models (HMM) to wavelets byCrouse et al. (1998) and the Gaussian scale mixture model of Portilla et al. (2003).
1.3 Modern parsimony: the -norm and other variants
The era of “modern” parsimony corresponds probably to the use of convex optimization techniques for solving feature selection or sparse decomposition problems. Even though the-norm was introduced for that purpose in geophysics (Claerbout and Muir, 1973; Taylor et al., 1979), it was popularized in statistics with the Lasso estimator of Tibshirani (1996) and independently in signal processing with the basis pursuit formulation of Chen et al. (1999). Given observations in and a matrix of predictors in , the Lasso consists of learning a linear model by solving the following quadratic program:
As detailed in the sequel, the -norm encourages the solution to be sparse and the parameter is used to control the trade-off between data fitting and the sparsity of . In practice, reducing the value of leads indeed to sparser solution in general, i.e., with more zeroes, even though there is no formal relation between the sparsity of and its -norm for general matrices .
The basis pursuit denoising formulation of Chen et al. (1999) is relatively similar but the -norm is used as a penalty instead of a constraint. It can be written as
which is essentially equivalent to (1.9) from a convex optimization perspective, and in fact (1.10) is also often called “Lasso” in the literature. Given some data , matrix , and parameter , we indeed know from Lagrange multiplier theory (see, e.g., Borwein and Lewis, 2006; Boyd and Vandenberghe, 2004) that for all solution of (1.9), there exists a parameter such that is also a solution of (1.10). We note, however, that there is no direct mapping between and , and thus the choice of formulation (1.9) or (1.10) should be made according to how easy it is to select the parameters or . For instance, one may prefer (1.9) when a priori information about the -norm of the solution is available. In Figure 1.4, we illustrate the effect of changing the value of the regularization parameter on the solution of (1.10) for two datasets. When , the solution is dense; in general, increasing sets more and more variables to zero. However, the relation between and the sparsity of the solution is not exactly monotonic. In a few cases, increasing yields a denser solution.
Another “equivalent” formulation consists of finding a sparse decomposition under a reconstruction constraint:
This formulation can be useful when we have a priori knowledge about the noise level and the parameter is easy to choose. The link between (1.10) and (1.11) is similar to the link between (1.10) and (1.9).
For noiseless problems, Chen et al. (1999) have also introduced a formulation simply called “basis pursuit” (without the terminology “denoising”), defined as
which is related to (1.10) in the sense that the set of solutions of (1.10) converges to the solutions of (1.12) when converges to , whenever the linear system is feasible. These four formulations (1.9-1.12) have gained a large success beyond the statistics and signal processing communities. More generally, the -norm has been used as a regularization function beyond the least-square context, leading to problems of the form
is a loss function. In the rest of this section, we will present several variants of the-norm, but before that, we will try to understand why such a penalty is appropriate for sparse estimation.
Why does the -norm induce sparsity?
Even though we have claimed that there is no rigorous relation between the sparsity of and its -norm in general, intuition about the sparsity-inducing effect of the -norm may be obtained from several viewpoints.
Analytical point of view.
In the previous section about wavelets, we have seen that when is orthogonal, the -decomposition problem (1.10) admits an analytic closed form solution (1.5) obtained by soft-thresholding. As a result, whenever the magnitude of the inner product is smaller than for an index , the corresponding variable is equal to zero. Thus, the number of zeroes of the solution monotonically increases with .
For non-orthogonal matrices , such a monotonic relation does not formally
hold anymore; in practice, the sparsity-inducing property of
the -penalty remains effective, as illustrated in
Figure 1.4. Some intuition about this fact can be gained by studying optimality
conditions for the general -regularized problem (1.13)
where is a differentiable function. The following lemma details these conditions.
[Optimality conditions for -regularized problems]
A vector in is a solution of (1.13) if and only if
A proof using the classical concept of subdifferential from convex optimization can be found in (see,e.g., Bach et al., 2012a). Here, we provide instead an elementary proof using the simpler concept of directional derivative for nonsmooth functions, defined as, when the limit exists,
for a function at a point in and a direction in . For convex functions , directional derivatives always exist and a classical optimality condition for to be a minimum of is to have non-negative for all directions (Borwein and Lewis, 2006). Intuitively, this means that one cannot find any direction such that an infinitesimal move along from decreases the value of the objective. When is differentiable, the condition is equivalent to the classical optimality condition .
We can now apply the directional derivative condition to the function , which is equivalent to
It is then easy to show that (1.15) holds for all if and only the inequality holds for the specific values and for all , where is the vector in with zeroes everywhere except for the -th entry that is equal to one. This immediately provides an equivalence between (1.15) and (1.14). ∎
Physical point of view.
In image processing or computer vision, the word “energy” often denotes the objective function of a minimization problem; it is indeed common in physics to have complex systems that stabilize at a configuration of minimum potential energy. The negative of the energy’s gradient represents a force, a terminology we will borrow in this paragraph. Consider for instance a one-dimensional -regularized estimation problem
where is a positive constant. Whenever is non-zero, the -penalty is differentiable with derivative . When interpreting this objective as an energy minimization problem, the -penalty can be seen as applying a force driving towards the origin with constant intensity . Consider now instead the squared -penalty, also called regularization of Tikhonov (1963)
, or ridge regression regularization(Hoerl and Kennard, 1970):
The derivative of the quadratic energy is . It can be interpreted as a force that also points to the origin but with linear intensity . Therefore, the force corresponding to the ridge regularization can be arbitrarily strong when is large, but if fades away when gets close to zero. As a result, the squared -regularization does not have a sparsity-inducing effect. From an analytical point of view, we have seen that the solution of (1.16) is zero when is smaller than . In contrast, the solution of (1.17) admits a closed form . And thus, regardless of the parameter , the solution is never zero.
We present a physical example illustrating this phenomenon in Figure 1.5. We use springs whose potential energy is known to be quadratic, and objects with a gravitational potential energy that is approximately linear on the Earth’s surface.
Geometrical point of view.
The sparsity-inducing effect of the -norm can also be interpreted by studying the geometry of the -ball . More precisely, understanding the effect of the Euclidean projection onto this set is important: in simple cases where the design matrix is orthogonal, the solution of (1.9) can indeed be obtained by the projection
where . When is not orthogonal, a classical algorithm for solving (1.9) is the projected gradient method (see Section 5.2), which performs a sequence of projections (1.18) for different values of . Note that how to solve (1.18) efficiently is well studied; it can be achieved in operations with a divide-and-conquer strategy (Brucker, 1984; Duchi et al., 2008).
In Figure 1.6, we illustrate the effect of the -norm projection and compare it to the case of the -norm. The corners of the -ball are on the main axes and correspond to sparse solutions. Two of them are represented by red and green dots, with respective coordinates and . Most strikingly, a large part of the space in the figure, represented by red and green regions, ends up on these corners after projection. In contrast, the set of points that is projected onto the blue dot, is simply the blue line. The blue dot corresponds in fact to a dense solution with coordinates . Therefore, the figure illustrates that the -ball in two dimensions encourages solutions to be on its corners. In the case of the -norm, the ball is isotropic, and treats every direction equally. In Figure 1.7, we represent these two balls in three dimensions, where we can make similar observations.
More formally, we can mathematically characterize our remarks
about Figure 1.6. Consider a point in on the surface of
the -ball of radius , and define the set , where is the projection operator onto the -ball.
Examples of pairs have been presented in Figure 1.6;
for instance, when is the red or green dot, is respectively the red
or green region. It is particularly informative to study how varies
with , which is the focus of the next proposition.
[Characterization of the set ]
For a non-zero vector in , the set defined in the previous paragraph can be written as , where is a polyhedral cone of dimension .
A classical theorem (see Bertsekas, 1999, Proposition B.11) allows us to rewrite as
where denotes the Minkowski sum between the set and the cone defined as
Note that in the optimization literature, is often called the “normal cone” to the unit -ball at the point (Borwein and Lewis, 2006). Equivalently, we have
where we have used the fact that quantity , called the dual-norm of the -norm, is equal to (see Bach et al., 2012a). Note now that according to Hölder’s inequality, we also have in Eq. (1.19). Therefore, the inequalities are in fact equalities. It is then easy to characterize vectors such that and it is possible to show that is simply the set of vectors satisfying for all such that .
This would be sufficient to conclude the proposition, but it is also possible to pursue the analysis and exactly characterize by finding a set of generators.444A collection of vectors are called generators for a cone when consists of all positive combinations of the vectors . In other words, . In that case, we use the notation . Let us define the vector in that carries the sparsity pattern of , more precisely, with for all such that , and otherwise. Let us also define the set of indices corresponding to the zero entries of , and in the binary vector whose entries are all zero but the -th one that is equal to . Then, after a short calculation, we can geometrically characterize the polyhedral cone :
where the notation “cone” is defined in footnote 4. ∎
It is now easy to see that the set “grows” with the number of zero entries in , and that lives in a subspace of dimension for all non-zero vector . For example, when —that is, is a dense vector (e.g., the blue point in Figure 1.6(a)), is simply a half-line.
To conclude, the geometrical intuition to gain from this section is that the Euclidean projection onto a convex set encourages solutions on singular points, such as edges or corners for polytopes. Such a principle indeed applies beyond the -norm. For instance, we illustrate the regularization effect of the -norm in Figure 1.8, whose corners coordinates have same magnitude.
Even though it is well established that the -norm encourages sparse solutions, it remains only a convex proxy of the -penalty. Both in statistics and signal processing, other sparsity-inducing regularization functions have been proposed, in particular continuous relaxations of that are non-convex (Frank and Friedman, 1993; Fan and Li, 2001; Daubechies et al., 2010; Gasso et al., 2009). These functions are using a non-decreasing concave function , and the sparsity-inducing penalty is defined as
For example, the -penalty uses (Frank and Friedman, 1993), or an approximation ; the reweighted- algorithm of Fazel (2002); Fazel et al. (2003); Candès et al. (2008) implicitly uses . These penalties typically lead to intractable estimation problems, but approximate solutions can be obtained with continuous optimization techniques (see Section 5.3).
The sparsity-inducing effect of the penalties is known to be stronger than . As shown in Figure 1.9(a), the magnitude of the derivative of grows when one approaches zero because of its concavity. Thus, in the one-dimensional case, can be interpreted as a force driving towards the origin with increasing intensity when gets closer to zero. In terms of geometry, we also display the -ball in Figure 1.9(b), with the same red, blue, and green dots as in Figure 1.6. The part of the space that is projected onto the corners of the -ball is larger than that for . Interestingly, the geometrical structure of the red and green regions are also more complex. Their combinatorial nature makes the projection problem onto the -ball more involved when .
To cope with some instability issues of the estimators obtained with the -regularization, Zou and Hastie (2005) have proposed to combine the - and -norms with a penalty called elastic-net:
The effect of this penalty is illustrated in Figure 1.10. Compared to Figure 1.6, we observe that the red and green regions are smaller for the elastic-net penalty than for . The sparsity-inducing effect is thus less aggressive than the one obtained with .
The anisotropic total variation penalty (Rudin et al., 1992) for one dimensional signals is simply the -norm of finite differences
which encourages piecewise constant signals. It is also known in statistics under the name of “fused Lasso” (Tibshirani et al., 2005). The penalty can easily be extended to two-dimensional signals, and has been widely used for regularizing inverse problems in image processing (Chambolle, 2005).
In some cases, variables are organized into predefined groups forming a partition of , and one is looking for a solution such that variables belonging to the same group of are set to zero together. For example, such groups have appeared in Section 1.2 about wavelets, where could be defined according to neighborhood relationships of wavelet coefficients. Then, when it is known beforehand that a problem solution only requires a few groups of variables to explain the data, a regularization function automatically selecting the relevant groups has been shown to improve the prediction performance or to provide a solution with a better interpretation (Turlach et al., 2005; Yuan and Lin, 2006; Obozinski et al., 2009; Huang and Zhang, 2010). The group sparsity principle is illustrated in Figure 1.11(b).
An appropriate regularization function to obtain a group-sparsity effect is known as “Group-Lasso” penalty and is defined as
where is either the or -norm. To the best of our knowledge, such a penalty appears in the early work of Grandvalet and Canu (1999) and Bakin (1999) for , and Turlach et al. (2005) for . It has been popularized later by Yuan and Lin (2006).
The function in (1.20) is a norm, thus convex, and can be interpreted as the -norm of the vector of size . Consequently, the sparsity-inducing effect of the -norm is applied at the group level. The penalty is highly related to the group-thresholding approach for wavelets, since the group-thresholding estimator (1.7) is linked to through Eq. (1.8).
In Figure 1.13(a), we visualize the unit ball of a Group-Lasso norm obtained when contains two groups . The ball has two singularities: the top and bottom corners, corresponding to solutions where variables and are simultaneously set to zero, and the middle circle, corresponding to solutions where variable only is set to zero. As expected, the geometry of the ball induces the group-sparsity effect.
Group-sparsity is a first step towards the more general idea that a regularization function can encourage sparse solutions with a particular structure. This notion is called structured sparsity and has been introduced under a large number of different point of views (Zhao et al., 2009; Jacob et al., 2009; Jenatton et al., 2011a; Baraniuk et al., 2010; Huang et al., 2011). To some extent, it follows the concept of group-thresholding introduced in the wavelet literature, which we have presented in Section 1.2. In this paragraph, we briefly review some of these works, but for a more detailed review, we refer the reader to (Bach et al., 2012b).
Some penalties are non-convex. For instance, Huang et al. (2011) and Baraniuk et al. (2010) propose two different combinatorial approaches based on a predefined set of possibly overlapping groups of variables. These penalties encourage solutions whose support is in the union of a few number groups, but they lead to NP-hard optimization problems. Other penalties are convex. In particular, Jacob et al. (2009) introduce a sparsity-inducing norm that is exactly a convex relaxation of the penalty of Huang et al. (2011), even though these two approaches were independently developed at the same time. As a result, the convex penalty of Jacob et al. (2009) encourages a similar structure as the one of Huang et al. (2011).
By following a different direction, the Group-Lasso penalty (1.20) has been considered when the groups are allowed to overlap (Zhao et al., 2009; Jenatton et al., 2011a). As a consequence, variables belonging to the same groups are encouraged to be set to zero together. It was proposed for hierarchical structures by Zhao et al. (2009) with the following rule: whenever two groups and are in , they should be either disjoint, or one should be included in another. Examples of such hierarchical group structures are given in Figures 1.11 and 1.12. The effect of the penalty is to encourage sparsity patterns that are rooted subtrees. Equivalently, a variable can be non-zero only if its parent in the tree is non-zero, which is the main property of the zero-tree coding scheme introduced in the wavelet literature (Shapiro, 1993), and already illustrated in Figure 1.2.
Finally, Jenatton et al. (2011a) has extended the hierarchical penalty of Zhao et al. (2009) to more general group structures, for example when variable are organized on a two-dimensional grid, encouraging neighbor variables to be simultaneously set to zero. We conclude this brief presentation of structured sparsity with Figure 1.13, where we present the unit balls of some sparsity-inducing norms. Each of them exhibits singularities and encourages particular sparsity patterns.
. For estimation problems where model parameters are matrices, the rank has been used as a natural regularization function. The rank of a matrix is equal to the number of non-zero singular values, and thus, it can be interpreted as the-penalty of the matrix spectrum. Unfortunately, due to the combinatorial nature of , the rank penalization typically leads to intractable optimization problems.
A natural convex relaxation has been introduced in the control theory literature by Fazel et al. (2001) and consists of computing the -norm of the spectrum—that is, simply the sum of the singular values. The resulting penalty appears under different names, the most common ones being the trace, nuclear, or Schatten norm. It is defined for a matrix in with as
where is the -th singular value of . Traditional applications of the trace norm in machine learning are matrix completion or collaborative filtering (Pontil et al., 2007; Abernethy et al., 2009). These problems have become popular with the need of scalable recommender systems for video streaming providers. The goal is to infer movie preferences for each customer, based on their partial movie ratings. Typically, the matrix is of size , where is the number of movies and is the number of users. Each user gives a score for a few movies, corresponding to some entries of the matrix, and the recommender system tries to infer the missing values. Similar techniques have also recently been used in other fields, such as in genomics to infer missing genetic information (Chi et al., 2013).
1.4 Dictionary learning
We have previously presented various formulations where a signal in is approximated by a sparse linear combination of a few columns of a matrix in . In the context of signal and image processing, this matrix is often called dictionary and its columns atoms. As seen in Section 1.2, a large amount of work has been devoted in the wavelet literature for designing a good dictionary adapted to natural images.
In neuroscience, Olshausen and Field (1996, 1997) have proposed a significantly different approach to sparse modeling consisting of adapting the dictionary to training data. Because the size of natural images is too large for learning a full matrix , they have chosen to learn the dictionary on natural image patches, e.g., of size pixels, and have demonstrated that their method could automatically discover interpretable structures. We discuss this topic in more details in Section 2.
The motivation of Olshausen and Field (1996, 1997) was to show that the structure of natural images is related to classical theories of the mammalian visual cortex. Later, dictionary learning found numerous applications in image restoration, and was shown to significantly outperform off-the-shelf bases for signal reconstruction (see, e.g., Elad and Aharon, 2006; Mairal et al., 2008c, 2009; Protter and Elad, 2009; Yang et al., 2010a).
Concretely, given a dataset of training signals , dictionary learning can be formulated as the following minimization problem
where carries the decomposition coefficients of the signals , is sparsity-inducing regularization function, and is typically chosen as the following set:
To be more precise, Olshausen and Field (1996) proposed several choices for ; their experiments were for instance conducted with the -norm, or with the smooth function , which has an approximate sparsity-inducing effect. The constraint was also not explicitly modeled in the original dictionary learning formulation; instead, the algorithm of Olshausen and Field (1996) includes a mechanism to control and rescale the -norm of the dictionary elements. Indeed, without such a mechanism, the norm of would arbitrarily go to infinity, leading to small values for the coefficients and making the penalty ineffective.
The number of samples is typically large, whereas the signal dimension is small. The number of dictionary elements is often chosen larger than —in that case, the dictionary is said to be overcomplete—even though a choice often leads to reasonable results in many applications. For instance, a typical setting would be to have pixels for natural image patches, a dictionary of size , and more than training patches.
A large part of this monograph is related to dictionary learning and thus we only briefly discuss this matter in this introduction. Section 2
is indeed devoted to unsupervised learning techniques for natural image patches, including dictionary learning; Sections3 and 4 present a large number of applications in image processing and computer vision; how to solve (1.21) is explained in Section 5.5 about optimization.
Matrix factorization point of view.
An equivalent representation of (1.21) is the following regularized matrix factorization problem
where . Even though reformulating (1.21) as (1.22) is simply a matter of using different notation, seeing dictionary learning as a matrix factorization problem opens up interesting perspectives. In particular, it makes obvious some links with other unsupervised learning approaches such as non-negative matrix factorization (Paatero and Tapper, 1994)
, clustering techniques such as K-means, and others(see Mairal et al., 2010a). These links will be further developed in Section 2.
Risk minimization point of view.
Dictionary learning can also be seen from a machine learning point of view. Indeed, dictionary learning can be written as
where is a loss function defined as
The quantity should be small if is “good” at representing the signal in a sparse fashion, and large otherwise. Then, is called the empirical cost.
However, as pointed out by Bottou and Bousquet (2008), one is usually not interested in the exact minimization of the empirical cost for a fixed , which may lead to overfitting on the training data, but instead in the minimization of the expected cost, which measures the quality of the dictionary on new unseen data:
where the expectation is taken relative to the (unknown) probability distribution of the data.555We use “a.s.” to denote almost sure convergence.
The expected risk minimization formulation is interesting since it paves the way to stochastic optimization techniques when a large amount of data is available (Mairal et al., 2010a) and to theoretical analysis (Maurer and Pontil, 2010; Vainsencher et al., 2011; Gribonval et al., 2013), which are developed in Sections 5.5 and 1.6, respectively.
Following the original formulation of Olshausen and Field (1996, 1997), we have chosen to present dictionary learning where the regularization function is used as a penalty, even though it can also be used as a constraint as in (1.9). Then, natural variants of (1.21) are