Jojic, Frey and Kannan [jojic] introduced in a probabilistic generative image model called an epitome. Intuitively, the epitome is a small image that summarizes the content of a larger one, in the sense that for any patch from the large image there should be a similar one in the epitome. This is an intriguing notion, which has been applied to image reconstruction tasks [jojic], and epitomes have also been extended to the video domain [cheung]
, where they have been used in denoising, superresolution, object removal and video interpolation. Other successful applications of epitomes include location recognition[ni]chu].
Aharon and Elad [aharon2] have introduced an alternative formulation within the sparse coding framework called image-signature dictionary, and applied it to image denoising. Their formulation unifies the concept of epitome and dictionary learning [elad, field] by allowing an image patch to be represented as a sparse linear combination of several patches extracted from the epitome (Figure 1). The resulting sparse representations are highly redundant (there are as many dictionary elements as overlapping patches in the epitome), with dictionaries represented by a reasonably small number of parameters (the number of pixels in the epitome). Such a representation has also proven to be useful for texture synthesis [peyre].
In a different line of work, some research has been focusing on learning shift-invariant dictionaries [mailhe, thiagarajan], in the sense that it is possible to use dictionary elements with different shifts to represent signals, exhibiting patterns that may appear several times at different positions. While this is different from the image-signature dictionaries of Aharon and Elad [aharon2], the two ideas are related, and as shown in this paper, such a shift invariance can be achieved by using a collection of smaller epitomes. In fact, one of our main contributions is to unify the frameworks of epitome and dictionary learning, and establish the continuity between dictionaries, dictionaries with shift invariance, and epitomes.
We propose a formulation based on the concept of epitomes/image-signature-dictionaries introduced by [aharon2, jojic], which allows to learn a collection of epitomes, and which is generic enough to be used with epitomes that may have different shapes, or with different dictionary parameterizations. We present this formulation for the specific case of image patches for simplicity, but it applies to spatio-temporal blocks in a straightforward manner.
The following notation is used throughout the paper: we define for the -norm of a vector in as , where denotes the -th coordinate of . if is a matrix in , will denote its row, while will denote its column. As usual, will denote the entry of at the -row and -column. We consider the Frobenius norm of : .
2 Proposed Approach
Given a set of training image patches of size pixels, represented by the columns of a matrix in , the classical dictionary learning formulation, as introduced by [field] and revisited by [elad, mairal7], tries to find a dictionary in such that each signal can be represented by a sparse linear combination of the columns of . More precisely, the dictionary is learned along with a matrix of decomposition coefficients in such that for every signal . Following [mairal7], we consider the following formulation:
where the quadratic term ensures that the vectors are close to the approximation , the -norm induces sparsity in the coefficients (see, e.g., [chen, tibshirani]), and controls the amount of regularization. To prevent the columns of from being arbitrarily large (which would lead to arbitrarily small values of the ), the dictionary is constrained to belong to the convex set of matrices in whose columns have an -norm less than or equal to one:
As will become clear shortly, this constraint is not adapted to dictionaries extracted from epitomes, since overlapping patches cannot be expected to all have the same norm. Thus we introduce an unconstrained formulation equivalent to Eq. (1):
This formulation removes the constraint from Eq. (1), and replaces the -norm by a weighted -norm. As shown in Appendix A, Eq. (1) and Eq. (2) are equivalent in the sense that a solution of Eq. (1) is also solution of Eq. (2), and for every solution of Eq. (2), a solution for Eq. (1) can be obtained by normalizing its columns to one. To the best of our knowledge, this equivalent formulation is new, and is key to learning an epitome with -regularization: the use of a convex regularizer (the -norm) that empirically provides better-behaved dictionaries than (where the pseudo-norm counts the number of non-zero elements in a vector) for denoising tasks (see Table 1) differentiates us from the ISD formulation of [aharon2]. To prevent degenerate solutions in the dictionary learning formulation with -norm, it is important to constrain the dictionary elements with the -norm. Whereas such a constraint can easily be imposed in classical dictionary learning, its extension to epitome learning is not straightforward, and the original ISD formulation is not compatible with convex regularizers. Eq. (2) is an equivalent unconstrained formulation, which lends itself well to epitome learning.
We can now formally introduce the general concept of an epitome as a small image of size , encoded (for example in row order) as a vector in . We also introduce a linear operator that extracts all overlapping patches from the epitome , and rearranges them into the columns of a matrix of , the integer being the number of such overlapping patches. Concretely, we have . In this context, can be interpreted as a traditional flat dictionary with elements, except that it is generated by a small number of parameters compared to the parameters of the flat dictionary. Our approach thus generalizes to a much wider range of epitomic structures using any mapping that admits fast projections on . The functions we have used so far are relatively simple, but give a framework that easily extends to families of epitomes, shift-invariant dictionaries, and plain dictionaries. The only assumption we make is that is a linear operator of rank (i.e., is injective). This list is not exhaustive, which naturally opens up new perspectives. The fact that a dictionary is obtained from an epitome is characterized by the fact that is in the image of the linear operator . Given a dictionary in , the unique (by injectivity of ) epitome representation can be obtained by computing the inverse of on , for which a closed form using pseudo-inverses exists as shown in Appendix B.
Our goal being to adapt the epitome to the training image patches, the general minimization problem can therefore be expressed as follows:
There are several motivations for such an approach. As discussed above, the choice of the function lets us adapt this technique to different problems such as multiple epitomes or any other type of dictionary representation. This formulation is therefore deliberately generic. In practice, we have mainly focused on two simple cases in the experiments of this paper: a single epitome [jojic] (or image signature dictionary [aharon2]) and a set of epitomes. Furthermore, we have now come down to a more traditional, and well studied problem: dictionary learning. We will therefore use the techniques and algorithms developed in the dictionary learning literature to solve the epitome learning problem.
3 Basic Algorithm
As for classical dictionary learning, the optimization problem of Eq. (3) is not jointly convex in , but is convex with respect to when is fixed and vice-versa. A block-coordinate descent scheme that alternates between the optimization of and , while keeping the other parameter fixed, has emerged as a natural and simple way for learning dictionaries [elad, engan], which has proven to be relatively efficient when the training set is not too large. Even though the formulation remains nonconvex and therefore this method is not guaranteed to find the global optimum, it has proven experimentally to be good enough for many tasks [elad].
We therefore adopt this optimization scheme as well, and detail the different steps below. Note that other algorithms such as stochastic gradient descent (see[aharon2, mairal7]) could be used as well, and in fact can easily be derived from the material of this section. However, we have chosen not to investigate these kind of techniques for simplicity reasons. Indeed, stochastic gradient descent algorithms are potentially more efficient than the block-coordinate scheme mentioned above, but require the (sometimes non-trivial) tuning of a learning rate.
3.1 Step 1: Optimization of with Fixed.
In this step of the algorithm, is fixed, so the constraint is not involved in the optimization of . Furthermore, note that updating the matrix consists of solving independent optimization problems with respect to each column . For each of them, one has to solve a weighted- optimization problem. Let us consider the update of a column of .
We introduce the matrix , and define . If is non-singular, we show in Appendix A that the relation holds, where
This shows that the update of each column can easily be obtained with classical solvers for -decomposition problems. We use to that effect the LARS algorithm [efron], implemented in the software accompanying [mairal7].
Since our optimization problem is invariant by multiplying by a scalar and by its inverse, we then proceed to the following renormalization to ensure numerical stability and prevent the entries of and from becoming too large: we rescale and with
Since the image of is a vector space, stays in the image of after the normalization. And as noted before, it does not change the value of the objective function.
3.2 Step 2: Optimization of with Fixed.
We use a projected gradient descent algorithm [bertsekas] to update . The objective function minimized during this step can be written as:
where is fixed, and we recall that denotes its -th row. The function is differentiable, except when a column of is equal to zero, which we assume without loss of generality not to be the case. Suppose indeed that a column of is equal to zero. Then, without changing the value of the cost function of Eq. (3), one can set the corresponding row to zero as well, and it results in a function defined in Eq. (4) that does not depend on anymore. We have, however, not observed such a situation in our experiments.
The function can therefore be considered as differentiable, and one can easily compute its gradient as:
where is defined as .
To use a projected gradient descent, we now need a method for projecting onto the convex set , and the update rule becomes:
where is the orthogonal projector onto , and is a gradient step, chosen with a line-search rule, such as the Armijo rule [bertsekas].
Interestingly, in the case of the single epitome (and in fact in any other
extension where is a linear operator that extracts some patches from
a parameter vector ), this projector admits a closed form: let us
consider the linear operator ,
such that for a matrix in , a pixel of the epitome
is the average of the entries of corresponding to this
pixel value. We give the formal form of this operator in Appendix B, and show the following results:
(i) is indeed linear,
With this closed form of in hand, we now have an efficient algorithmic procedure for performing the projection. Our method is therefore quite generic, and can adapt to a wide variety of functions . Extending it when is not linear, but still injective and with an efficient method to project on will be the topic of future work.
We present in this section several improvements to our basic framework, which either improve the convergence speed of the algorithm, or generalize the formulation.
4.1 Accelerated Gradient Method for Updating .
A first improvement is to accelerate the convergence of the update of using an accelerated gradient technique [beck, nesterov]. These methods, which build upon early works by Nesterov [nesterov2], have attracted a lot of attention recently in machine learning and signal processing, especially because of their fast convergence rate (which is proven to be optimal among first-order methods), and their ability to deal with large, possibly nonsmooth problems.
Whereas the value of the objective function with classical gradient descent algorithms for solving smooth convex problems is guaranteed to decrease with a convergence rate of , where is the number of iterations, other algorithmic schemes have been proposed with a convergence rate of with the same cost per iteration as classical gradient algorithms [beck, nesterov2, nesterov]. The difference between these methods and gradient descent algorithms is that two sequences of parameters are maintained during this iterative procedure, and that each update uses information from past iterations. This leads to theoretically better convergence rates, which are often also better in practice.
We have chosen here for its simplicity the algorithm FISTA of Beck and Teboulle [beck], which includes a practical line-search scheme for automatically tuning the gradient step. Interestingly, we have indeed observed that the algorithm FISTA was significantly faster to converge than the projected gradient descent algorithm.
4.2 Multi-Scale Version
To improve the results without increasing the computing time, we have also implemented a multi-scale approach that exploits the spatial nature of the epitome. Instead of directly learning an epitome of size , we first learn an epitome of a smaller size on a reduced image with corresponding smaller patches, and after upscaling, we use the resulting epitome as the initialization for the next scale. We iterate this process in practice two to three times. The procedure is illustrated in Figure 2. Intuitively, learning smaller epitomes is an easier task than directly learning a large one, and such a procedure provides a good initialization for learning a large epitome.
|Multi-scale Epitome Learning.|
|Input: number of scales, ratio between each scale,|
|random initialization for the first scale.|
|for to do|
|Given rescaling of image for ratio ,|
|the corresponding patches,|
|initialize with ,|
|= epitome ().|
|Output: learned epitome .|
4.3 Multi-Epitome Extension
Another improvement is to consider not a single epitome but a family of epitomes in order to learn dictionaries with some shift invariance, which has been the focus of recent work [mailhe, thiagarajan]. Note that different types of structured dictionaries have also been proposed with the same motivation for learning shift-invariant features in image classification tasks [kavukcuoglu2], but in a significantly different framework (the structure in the dictionaries learned in [kavukcuoglu2] comes from a different sparsity-inducing penalization).
As mentioned before, we are able to learn a set of epitomes instead of a single one by changing the function introduced earlier. The vector now contains the pixels (parameters) of several small epitomes, and is the linear operator that extracts all overlapping patches from all epitomes. In the same way, the projector on is still easy to compute in closed form, and the rest of the algorithm stays unchanged. Other “epitomic” structures could easily be used within our framework, even though we have limited ourselves for simplicity to the case of single and multiple epitomes of the same size and shape.
The multi-epitome version of our approach can be seen as an interpolation between classical dictionary and single epitome. Indeed, defining a multitude of epitomes of the same size as the considered patches is equivalent to working with a dictionary. Defining a large number a epitomes slightly larger than the patches is equivalent to shift-invariant dictionaries. In Section 5, we experimentally compare these different regimes for the task of image denoising.
Because of the nonconvexity of the optimization problem, the question of the initialization is an important issue in epitome learning. We have already mentioned a multi-scale strategy to overcome this issue, but for the first scale, the problem remains. Whereas classical flat dictionaries can naturally be initialized with prespecified dictionaries such as overcomplete DCT basis (see [elad]), the epitome does not admit such a natural choice. In all the experiences (unless written otherwise), we use as the initialization a single epitome (or a collection of epitomes), common to all experiments, which is learned using our algorithm, initialized with a Gaussian low-pass filtered random image, on a set of random patches extracted from natural images (all different from the test images used for denoising).
5 Experimental Validation
We provide in this section qualitative and quantitative validation. We first study the influence of the different model hyperparameters on the visual aspect of the epitome before moving to an image denoising task. We choose to represent the epitomes as images in order to visualize more easily the patches that will be extracted to form the images. Since epitomes contain negative values, they are arbitrarily rescaled betweenand for display.
In this section, we will work with several images, which are shown in Figure 4.
5.1 Influence of the Initialization
In order to measure the influence of the initialization on the resulting epitome, we have run the same experience with different initializations. Figure 5 shows the different results obtained.
The difference in contrast may be due to the scaling of the data in the displaying process. This experiment illustrates that different initializations lead to visually different epitomes. Whereas this property might not be desirable, the classical dictionary learning framework also suffers from this issue, but yet has led to successful applications in image processing [elad].
5.2 Influence of the Size of the Patches
The size of the patches seem to play an important role in the visual aspect of the epitome. We illustrate in Figure 6 an experiment where pairs of epitome of size are learned with different sizes of patches.
As we see, learning epitomes with small patches seems to introduce finer details and structures in the epitome, whereas large patches induce epitomes with coarser structures.
5.3 Influence of the Number of Epitomes
We present in this section an experiment where the number of learned epitomes vary, while keeping the same numbers of columns in . The , , and epitomes learned on the image barbara are shown in Figure 7. When the number of epitomes is small, we observe in the epitomes some discontinuities between texture areas with different visual characteristics, which is not the case when learning several independant epitomes.
5.4 Application to Denoising
In order to evaluate the performance of epitome learning in various regimes (single epitome, multiple epitomes), we use the same methodology as [aharon2] that uses the successful denoising method first introduced by [elad]. Let us consider first the classical problem of restoring a noisy image in
which has been corrupted by a white Gaussian noise of standard deviation. We denote by in the patch of centered at pixel (with any arbitrary ordering of the image pixels).
The method of [elad] proceeds as follows:
Learn a dictionary adapted to all overlapping patches from the noisy image .
Approximate each noisy patch using the learned dictionary with a greedy algorithm called orthogonal matching pursuit (OMP) [mallat4]
to have a clean estimate of every patch ofby addressing the following problem
where is a clean estimate of the patch , is the pseudo-norm of , and is a regularization parameter. Following [elad], we choose .
Since every pixel in admits many clean estimates (one estimate for every patch the pixel belongs to), average the estimates.
Quantitative results for single epitome, and multi-scale multi-epitomes are presented in Table 1 on six images and five levels of noise. We evaluate the performance of the denoising process by computing the peak signal-to-noise ratio (PSNR) for each pair of images. For each level of noise, we have selected the best regularization parameter overall the six images, and have then used it all the experiments. The PNSR values are averaged over experiments with different noise realizations. The mean standard deviation is of dB both for the single epitome and the multi-scale multi-epitomes.
We see from this experiment that the formulation we propose is competitive compared to the one of [aharon2]. Learning multi epitomes instead of a single one seems to provide better results, which might be explained by the lack of flexibility of the single epitome representation. Evidently, these results are not as good as recent state-of-the-art denoising algorithms such as [dabov2, mairal8] which exploit more sophisticated image models. But our goal is to illustrate the performance of epitome learning on an image reconstruction task, in order to better understand these formulations.
We have introduced in this paper a new formulation and an efficient algorithm for learning epitomes in the context of sparse coding, extending the work of Aharon and Elad [aharon2], and unifying it with recent work on shift-invariant dictionary learning. Our approach is generic, can interpolate between these two regimes, and can possibly be applied to other formulations. Future work will extend our framework to the video setting, to other image processing tasks such as inpainting, and to learning image features for classification or recognition tasks, where shift invariance has proven to be a key property to achieving good results [kavukcuoglu2]. Another direction we are pursuing is to find a way to encode other invariant properties through different mapping functions .
This work was partly supported by the European Community under the ERC grants "VideoWorld" and "Sierra".
Appendix A Appendix: -Norm and Weighted -Norm
In this appendix, we will show the equivalence between the two minimization problems introduced in section 3.1.
Let us denote
Let us define and such that , and , where . The goal is to show that where:
We clearly have: . Furthermore, since , we have:
Appendix B Appendix: Projection on
In this appendix, we will show how to compute the orthogonal projection on the vector space . Let us denote by the binary matrix in that extracts the -th patch from . Note that with this notation, the matrix is a binary matrix corresponding to a linear operator that takes a patch of size and place it at the location in an epitome of size which is zero everywhere else. We therefore have .
We denote by the linear operator defined as
which creates an epitome of size such that each pixel contains the average of the corresponding entries in . Indeed, the matrix is diagonal and the entry on the diagonal is the number of entries in corresponding to the pixel in the epitome.
which is a matrix, we have , where , which is the vector of size obtained by concatenating the columns of , and also .
Since and , which is an orthogonal projection onto , it results the two following properties which are useful in our framework and classical in signal processing with overcomplete representations ([mallat]):
is the inverse function of on : .
is the orthogonal projector on .