Task-Driven Dictionary Learning

09/27/2010 ∙ by Julien Mairal, et al. ∙ 0

Modeling data with linear combinations of a few elements from a learned dictionary has been the focus of much recent research in machine learning, neuroscience and signal processing. For signals such as natural images that admit such sparse representations, it is now well established that these models are well suited to restoration tasks. In this context, learning the dictionary amounts to solving a large-scale matrix factorization problem, which can be done efficiently with classical optimization tools. The same approach has also been used for learning features from data for other purposes, e.g., image classification, but tuning the dictionary in a supervised way for these tasks has proven to be more difficult. In this paper, we present a general formulation for supervised dictionary learning adapted to a wide variety of tasks, and present an efficient algorithm for solving the corresponding optimization problem. Experiments on handwritten digit classification, digital art identification, nonlinear inverse image problems, and compressed sensing demonstrate that our approach is effective in large-scale settings, and is well suited to supervised and semi-supervised classification, as well as regression tasks for data that admit sparse representations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 19

page 20

Code Repositories

multimodal_dictionary_learning

The code for the paper "Multimodal Task-driven Dictionary Learning for Image Classification".


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The linear decomposition of data using a few elements from a learned dictionary instead of a predefined one—based on wavelets [1] for example—has recently led to state-of-the-art results in numerous low-level signal processing tasks such as image denoising [2, 3, 4], audio processing [5, 6], as well as classification tasks [7, 8, 9, 10, 11, 12]. Unlike decompositions based on principal component analysis (PCA) and its variants, these sparse models do not impose that the dictionary elements be orthogonal, allowing more flexibility to adapt the representation to the data.

Concretely, consider a vector 

in . We say that it admits a sparse approximation over a dictionary  in , when one can find a linear combination of a “few” columns from  that is “close” to the vector . Experiments have shown that modeling signals with such sparse decompositions (sparse coding) is very effective in many signal processing applications [13]. For natural images, predefined dictionaries based on various types of wavelets [1] have been used for this task. Initially introduced by Olshausen and Field [14] for modeling the spatial receptive fields of simple cells in the mammalian visual cortex, the idea of learning the dictionary from data instead of using off-the-shelf bases has been shown to significantly improve signal reconstruction [2]. Although some of the learned dictionary elements may sometimes “look like” wavelets (or Gabor filters), they are tuned to the input images or signals, leading to much better results in practice.

This classical data-driven approach to dictionary learning is well adapted to reconstruction tasks, such as restoring a noisy signal. These dictionaries, which are good at reconstructing clean signals, but bad at reconstructing noise, have indeed led to state-of-the-art denoising algorithms [2, 3, 4]. Unsupervised dictionary learning has also been used for other purposes than pure signal reconstruction, such as classification [5, 7, 11, 12, 15], but recent works have shown that better results can be obtained when the dictionary is tuned to the specific task (and not just data) it is intended for. Duarte-Carvajalino and Sapiro [16] have for instance proposed to learn dictionaries for compressed sensing, and in [8, 9, 10] dictionaries are learned for signal classification. In this paper, we will refer to this general approach as task-driven dictionary learning.

Whereas purely data-driven dictionary learning has been shown to be equivalent to a large-scale matrix factorization problem that can be effectively addressed with several methods [14, 17, 18, 19], its task-driven counterpart has proven to be much more difficult to optimize. Presenting a general efficient framework for various task-driven dictionary learning problems is the main topic of this paper. Even though it is different from existing machine learning approaches, it shares similarities with many of them.

For instance, Blei et al. [20] have proposed to learn a latent topic model intended for document classification. In a different context, Argyriou et al. [21]

introduced a convex formulation for multi-task classification problems where an orthogonal linear transform of input features is jointly learned with a classifier. Learning compact features has also been addressed in the literature of neural networks, with restricted Boltzmann machines (RBM’s) and convolutional neural networks for example (see 

[22, 23, 24, 25, 26] and references therein). Interestingly, the question of learning the data representation in an unsupervised or supervised way has also been investigated for these approaches. For instance, a supervised topic model is proposed in [27] and tuning latent data representations for minimizing a cost function is often achieved with backpropagation in neural networks [28].

1.1 Contributions

This paper makes three main contributions:

  • It introduces a supervised formulation for learning dictionaries adapted to various tasks instead of dictionaries only adapted to data reconstruction.

  • It shows that the resulting optimization problem is smooth under mild assumptions, and empirically that stochastic gradient descent addresses it efficiently.

  • It shows that the proposed formulation is well adapted to semi-supervised learning, can exploit unlabeled data when they admit sparse representations, and leads to state-of-the-art results for various machine learning and signal processing problems.

1.2 Notation

Vectors are denoted by bold lower case letters and matrices by upper case ones. We define for , the -norm of a vector  in  as , where  denotes the -th entry of , and . We also define the -pseudo-norm as the number of nonzero elements in a vector. We consider the Frobenius norm of a matrix  in : . We also write for a sequence of vectors and scalars , , when there exists a constant independent of so that for all , , and use a similar notation for matrices (note that for finite-dimensional vector spaces, the choice of norm is irrelevant). When is a finite set of indices, denotes the vector in containing the values of corresponding to the indices in . Similarly, when is a matrix in and , is the matrix in containing the columns of corresponding to the indices in .

The rest of this paper is organized as follows: Section 2 briefly recalls the classical data-driven dictionary learning framework. Section 3 is devoted to our new task-driven framework, and Section 4 to efficient algorithms to addressing the corresponding optimization problems. Section 5 presents several dictionary learning experiments for signal classification, signal regression, and compressed sensing.

2 Data-Driven Dictionary Learning

Classical dictionary learning techniques for sparse coding [14, 17, 18] consider a finite training set of signals in  and minimize the empirical cost function

with respect to a dictionary  in , each column representing a dictionary element. Here 

is a loss function such that

should be small if  is “good” at representing the signal  in a sparse fashion. As emphasized by the index of , this optimization problem is unsupervised. Note that, in this setting, overcomplete dictionaries with are allowed. As others (see, e.g., [18]), we define as the optimal value of a sparse coding problem. In this work, we choose the elastic-net formulation of [29]:

(1)

where  and  are regularization parameters. When , this leads to the sparse decomposition problem, also known as basis pursuit [13], or Lasso [30]. Here, our choice of the elastic-net formulation over the Lasso is mainly for stability reasons. Using a parameter makes the problem of Eq. (1) strongly convex and, as shown later in this paper, ensures its unique solution to be Lipschitz with respect to  and  with a constant depending on . Whereas the stability of this solution is not necessarily an issue when learning a dictionary for a reconstruction task, it has turned out to be empirically important in some of our experiments with other tasks.

To prevent the -norm of  from being arbitrarily large, which would lead to arbitrarily small values of , it is common to constrain its columns to have -norms less than or equal to one. We will call the convex set of matrices satisfying this constraint:

(2)

As pointed out by Bottou and Bousquet [31], one is usually not interested in a perfect minimization of the empirical cost , but instead in the minimization with respect to  of the expected cost

(3)

where the expectation is taken relative to the (unknown) probability distribution 

of the data, and is supposed to be finite.111We use “a.s.” (almost sure) to denote convergence with probability one. In practice, dictionary learning problems often involve a large amount of data. For instance when the vectors represent image patches, can be up to several millions in a single image. In this context, online learning techniques have shown to be very efficient for obtaining a stationary point of this optimization problem [19]. Our paper exploits this stochastic setting as well, by proposing to minimize an expected cost corresponding to a supervised dictionary learning formulation, which we now present.

3 Proposed Formulation

We introduce in this section a general framework for learning dictionaries adapted to specific supervised tasks, e.g., classification, as opposed to the unsupervised formulation of the previous section, and present different extensions along with possible applications.

3.1 Basic Formulation

Obtaining a good performance in classification tasks is often related to the problem of finding a good data representation. Sparse decompositions obtained with data-driven learned dictionaries have been used for that purpose in [5] and [7], showing promising results for audio data and natural images. We present in this section a formulation for learning a dictionary in a supervised way for prediction tasks (regression or classification).

Concretely, given a learned dictionary , a vector  in  can be represented as , where

(4)

is the elastic-net solution [29].

Now suppose that we are interested in predicting a variable  in  from an observation , where  is either a finite set of labels for classification tasks, or a subset of for some integer  for regression tasks. Suppose first that is fixed and obtained with the unsupervised formulation of Eq. (3). Using as features for predicting the variable , we can learn model parameters  by solving:

(5)

where is a convex set, is a regularization parameter, and is defined as

(6)

where  is a convex loss function, the index  of  indicating that the loss is adapted to a supervised learning problem. The expectation is taken with respect to the unknown probability distribution 

of the data. Depending on the setting (classification or regression), several loss functions can be used such as square, logistic, or hinge loss from support vector machines (see 

[32] and references therein), possibly with a nonlinear kernel as done in [7]. Note that we do not specify yet the size of the matrix since it will depend on the chosen application, as illustrated later in Section 3.3.

The formulation of Eq. (5) exploits features , where  is given, possibly learned with the unsupervised formulation from Section 2. However, Mairal et al. [9], and Bradley and Bagnell [10] have shown that better results can be achieved when the dictionary is obtained in a fully supervised setting, tuned for the prediction task. We now introduce the task-driven dictionary learning formulation, that consists of jointly learning  and  by solving

(7)

where is a set of constraints defined in Eq. (2), and has the form

(8)

The main difficulty of this optimization problem comes from the non-differentiability of , which is the solution of a nonsmooth optimization problem (4). Bradley and Bagnell [10] have tackled this difficulty by introducing a smooth approximation of the sparse regularization which leads to smooth solutions, allowing the use of implicit differentiation to compute the gradient of the cost function they have considered. This approximation encourages some coefficients in to be small, and does not produce true zeros. It can be used when “true” sparsity is not required. In a different formulation, Mairal et al. [9]

have used nonsmooth sparse regularization, but used heuristics to tackle the optimization problem. We show in Section

4 that better optimization tools than these heuristics can be used, while keeping a nonsmooth regularization for computing .

Small variants of this formulation can also be considered: Non-negativity constraints may be added on  and , leading to a supervised version of nonnegative matrix factorization [33], regularized with a sparsity-inducing penalty. The function  could also take extra arguments such as  and  instead of just . For simplicity, we have omitted these possibilities, but the formulations and algorithms we present in this paper can be easily extended to these cases.

Before presenting extensions and applications of the formulation we have introduced, let us first discuss the assumptions under which our analysis holds.

3.1.1 Assumptions

From now on, we suppose that:

  1. The data admits a probability density  with a compact support . This is a reasonable assumption in audio, image, and video processing applications, where it is imposed by the data acquisition process, where values returned by sensors are bounded. To simplify the notations later in the paper, we suppose from now on that and are compact.222Even though images are acquired in practice after a quantization process, it is a common assumption in image processing to consider pixel values in a continuous space.

  2. When is a subset of a finite-dimensional real vector space, is continuous and  is twice continuously differentiable.

  3. When is a finite set of labels, for all  in , is continuous and is twice continuously differentiable.333For a given value of  and function , denotes the function which associates to a vector  the value .

Assumptions (B) and (C) allow us to use several loss functions such as the square, logistic, or softmax losses.

3.2 Extensions

We now present two extensions of the previous formulations. The first one includes a linear transform of the input data, and the second one exploits unlabeled data in a semi-supervised setting.

3.2.1 Learning a Linear Transform of the Input Data

In this section, we add to our basic formulation a linear transform of the input features, represented by a matrix . Our motivation for this is twofold: It can be appealing to reduce the dimension of the feature space via such a linear transform, and/or it can make the model richer by increasing the numbers of free parameters. The resulting formulation is the following:

(9)

where and are two regularization parameters, is a convex set and

(10)

It is worth noticing that the formulations of Eq. (7) and Eq. (9) can also be extended to the case of a cost function depending on several dictionaries involving several sparse coding problems, such as the one used in [8] for signal classification. Such a formulation is not developed here for simplicity reasons, but algorithms to address it can be easily derived from this paper.

3.2.2 Semi-supervised Learning

As shown in [7], sparse coding techniques can be effective for learning good features from unlabeled data. The extension of our task-driven formulation to the semi-supervised learning setting is natural and takes the form

(11)

where the second expectation is taken with respect to the marginal distribution of . The function  is the loss function defined in Eq. (1), and in

is a new parameter for controlling the trade-off between the unsupervised learning cost function and the supervised learning one.

3.3 Applications

For illustration purposes, we present a few applications of our task-driven dictionary learning formulations. Our approach is of course not limited to these examples.

3.3.1 Regression

In this setting, is a subset of a -dimensional real vector space, and the task is to predict variables  in  from the observation of vectors  in . A typical application is for instance the restoration of clean signals  from observed corrupted signals . Classical signal restoration techniques often focus on removing additive noise or solving inverse linear problems [34]. When the corruption results from an unknown nonlinear transformation, we formulate the restoration task as a general regression problem. This is the case for example in the experiment presented in Section 5.3.

We define the task-driven dictionary learning formulation for regression as follows:

(12)

At test time, when a new signal 

is observed, the estimate of the corresponding variable 

provided by this model is (plus possibly an intercept which we have omitted here for simplicity reasons). Note that we here propose to use the square loss for estimating the difference between and its estimate , but any other twice differentiable loss can be used.

3.3.2 Binary Classification

In this section and in the next one, we propose to learn dictionaries adapted to classification tasks. Our approach follows the formulation presented in [9], but is slightly different and falls into our task-driven dictionary learning framework. In this setting, the set  is equal to . Given a vector , we want to learn the parameters  in  of a linear model to predict  in , using the sparse representation as features, and jointly optimize  and

. For instance, using the logistic regression loss, our formulation becomes

(13)

Once and have been learned, a new signal  is classified according to the sign of . For simplicity reasons, we have omitted the intercept in the linear model, but it can easily be included in the formulation. Note that instead of the logistic regression loss, any other twice differentiable loss can be used as well.

As suggested in [9], it is possible to extend this approach with a bilinear model by learning a matrix  so that a new vector  is classified according to the sign of . In this setting, our formulation becomes

(14)

This bilinear model requires learning  parameters as opposed to the  parameters of the linear one. It is therefore richer and can sometimes offer a better classification performance when the linear model is not rich enough to explain the data, but it might be more subject to overfitting.

Note that we have naturally presented the binary classification task using the logistic regression loss, but as we have experimentally observed, the square loss is also an appropriate choice in many situations.

3.3.3 Multi-class Classification

When  is a finite set of labels in with , extending the previous formulation to the multi-class setting can be done in several ways, which we briefly describe here. The simplest possibility is to use a set of binary classifiers presented in Section 3.3.2 in a “one-vs-all” or “one-vs-one” scheme. Another possibility is to use a multi-class cost function such as the soft-max function, to find linear predictors , in such that for a vector in , the quantities are encouraged to be greater than for all . Another possibility is to turn the multi-class classification problem into a regression one and consider that is a set of binary vectors of dimension such that the th vector has on its -th coordinate, and elsewhere. This allows using the regression formulation of Section 3.3.1 to solve the classification problem.

3.3.4 Compressed sensing

Let us consider a signal  in , the theory of compressed sensing [35, 36] tells us that under certain assumptions, the vector  can be recovered exactly from a few measurements , where  in  is called a “sensing” matrix with . Unlike classical signal processing methods, such a linear transformation is sometimes included physically in the data acquisition process itself [37], meaning that a sensor can provide measurements  without directly measuring 

at any moment.

In a nutshell, the recovery of  has been proven to be possible when  admits a sparse representation on a dictionary , and the sensing matrix  is incoherent with , meaning that the rows of  are sufficiently uncorrelated with the columns of  (see [35, 36] for more details).444The assumption of “incoherence” between  and  can be replaced with a different but related hypothesis called restricted isometry property. Again the reader should refer to [35, 36] for more details. To ensure that this condition is satisfied,

is often chosen as a random matrix, which is incoherent with any dictionary with high probability.

The choice of a random matrix is appealing for many reasons. In addition to the fact that it provides theoretical guarantees of incoherence, it is well suited to the case where  is large, making it impossible to store a deterministic matrix  into memory, whereas it is sufficient to store the seed of a random process to generate a random matrix. On the other hand, large signals can often be cut into smaller parts that still admit sparse decompositions, e.g., image patches, which can be treated independently with a deterministic smaller matrix . When this is the case or when  has a reasonable size, the question of whether to use a deterministic matrix  or a random one arises, and it has been empirically observed that learned matrices  can outperform random projections: For example, it is shown in [38]

that classical dimensionality reduction techniques such as principal component analysis (PCA) or independent component analysis (ICA) could do better than random projections in noisy settings, and in 

[16] that jointly learning sensing matrices and dictionaries can do even better in certain cases. A Bayesian framework for learning sensing matrices in compressed sensing applications is also proposed in [39].

Following the latter authors, we study the case where  is not random but learned at the same time as the dictionary, and introduce a formulation which falls into out task-driven dictionary learning framework:

(15)

where we learn , and so that the variable  should be well reconstructed when encoding the “sensed” signal  with a dictionary . In a noiseless setting,  is naturally set to the same value as . In a noisy setting, it can be a corrupted version of .

After having presented our general task-driven dictionary learning formulation, we present next a strategy to address the corresponding nonconvex optimization problem.

4 Optimization

We first show that the cost function of our basic formulation (7) is differentiable and compute its gradient. Then, we refine the analysis for the different variations presented in the previous section, and describe an efficient online learning algorithm to address them.

4.1 Differentiability of

We analyze the differentiability of  as defined in Eq. (7) with respect to its two arguments and . We consider here the case where is a compact subset of a finite dimensional real vector space, but all proofs and formulas are similar when is a finite set of labels. The purpose of this section is to show that even though the sparse coefficients are obtained by solving a non-differentiable optimization problem, is differentiable on , and one can compute its gradient.

The main argument in the proof of Propositions 1 and 2 below is that, although the function  is not differentiable, it is uniformly Lipschitz continuous, and differentiable almost everywhere. The only points where is not differentiable are points where the set of nonzero coefficients of change (we always denote this set by in this paper). Considering optimality conditions of the elastic-net formulation of Eq. (1), these points are easy to characterize. The details of the proof have been relegated to the Appendix (Lemma 1 and Proposition 3) for readability purposes. With these results in hand, we then show that admits a first-order Taylor expansion meaning that it is differentiable, the sets where is not differentiable being negligible in the expectation from the definition of in Eq. (8). We can now state our main result:

Proposition 1 (Differentiability and gradients of )

Assume , (A), (B) and (C). Then, the function defined in Eq. (7) is differentiable, and

(16)

where is short for , and is a vector in that depends on with

(17)

where denotes the indices of the nonzero coefficients of .

The proof of this proposition is given in Appendix. We have shown that the function defined in Eq. (7) is smooth, and computed its gradients. The same can be done for the more general formulation of Eq. (10):

Proposition 2 (Differentiability, extended formulation)

Assume , (A), (B) and (C). Then, the function defined in Eq. (10) is differentiable. The gradients of are

(18)

where is short for , and is defined in Eq. (17).

The proof is similar to the one of Proposition 1 in Appendix, and uses similar arguments.

4.2 Algorithm

Stochastic gradient descent algorithms are typically designed to minimize functions whose gradients have the form of an expectation as in Eq. (16). They have been shown to converge to stationary points of (possibly nonconvex) optimization problems under a few assumptions that are a bit stricter than the ones satisfied in this paper (see [31] and references therein).555As often done in machine learning, we use stochastic gradient descent in a setting where it is not guaranteed to converge in theory, but is has proven to behave well in practice, as shown in our experiments. The convergence proof of Bottou [31] for non-convex problems indeed assumes three times differentiable cost functions. As noted in [19], these algorithms are generally well suited to unsupervised dictionary learning when their learning rate is well tuned.

The method we propose here is a projected first-order stochastic gradient algorithm (see [40]), and it is given in Algorithm 1. It sequentially draws i.i.d samples from the probability distribution . Obtaining such i.i.d. samples may be difficult since the density is unknown. At first approximation, the vectors are obtained in practice by cycling over a randomly permuted training set, which is often done in similar machine learning settings [41].

0:   (a way to draw i.i.d samples of ), (regularization parameters), (initial dictionary), (initial parameters), (number of iterations), (learning rate parameters).
1:  for  to  do
2:     Draw ) from .
3:     Sparse coding: compute using a modified LARS [42].
4:     Compute the active set:
5:     Compute : Set and
6:     Choose the learning rate .
7:     Update the parameters by a projected gradient step
where and are respectively orthogonal projections on the sets  and .
8:  end for
9:  return   (learned dictionary).
Algorithm 1 Stochastic gradient descent algorithm for task-driven dictionary learning.

At each iteration, the sparse code is computed by solving the elastic-net formulation of [29]. We have chosen to use the LARS algorithm, a homotopy method [42], which was originally developed to solve the Lasso formulation—that is, , but which can be modified to solve the elastic-net problem. Interestingly, it admits an efficient implementation that provides a Cholesky decomposition of the matrix  (see [29, 42]) as well as the solution . In this setting, can be obtained without having to solve from scratch a new linear system.

The learning rate is chosen according to a heuristic rule. Several strategies have been presented in the literature (see [28, 43] and references therein). A classical setting uses a learning rate of the form , where  is a constant.666A -asymptotic learning rate is usually used for proving the convergence of stochastic gradient descent algorithms [31]. However, such a learning rate is known to decrease too quickly in many practical cases, and one sometimes prefers a learning rate of the form , which requires tuning two parameters. In this paper, we have chosen a learning rate of the form —that is, a constant learning rate during iterations, and a annealing strategy when , a strategy used by [43] for instance. Finding good parameters  and  also requires in practice a good heuristic. The one we have used successfully in all our experiments is , where is the total number of iterations. Then, we try several values of during a few hundreds of iterations and keep the one that gives the lowest error on a small validation set.

In practice, one can also improve the convergence speed of our algorithm with a mini-batch strategy—that is, by drawing samples at each iteration instead of a single one. This is a classical heuristic in stochastic gradient descent algorithms and, in our case, this is further motivated by the fact that solving elastic-net problems with the same dictionary can be accelerated by the precomputation of the matrix when is large enough. Such a strategy is also used in [19] for the classical data-driven dictionary learning approach. In practice, the value has given good results in all our experiments (a value found to be good for the unsupervised setting as well).

As many algorithms tackling non-convex optimization problems, our method for learning supervised dictionaries can lead to poor results if is not well initialized. The classical unsupervised approach of dictionary learning presented in Eq. (3) has been found empirically to be better behaved than the supervised one, and easy to initialize [19]. We therefore have chosen to initialize our dictionary  by addressing the unsupervised formulation of Eq. (3) using the SPAMS toolbox [19].777http://www.di.ens.fr/willow/SPAMS/ With this initial dictionary  in hand, we optimize with respect to the cost function of Eq (5), which is convex. This procedure gives us a pair of parameters which are used to initialize Algorithm 1.

4.3 Extensions

We here present the slight modifications to Algorithm 1 necessary to address the two extensions discussed in Section 3.2.

The last step of Algorithm 1 updates the parameters  and  according to the gradients presented in Eq. (18). Modifying the algorithm to address the formulation of Section 3.2.1 also requires updating the parameters according to the gradient from Proposition 2:

where denotes the orthogonal projection on the set .

The extension to the semi-supervised formulation of Section 3.2.2 assumes that one can draw samples from the marginal distribution . This is done in practice by cycling over a randomly permuted set of unlabeled vectors. Extending Algorithm 1 to this setting requires the following modifications: At every iteration, we draw one pair from and one sample from . We proceed exactly as in Algorithm 1, except that we also compute , and replace the update of the dictionary  by

(19)

where the term is in fact the gradient , as shown in [19].

5 Experimental Validation

Before presenting our experiments, we briefly discuss the question of choosing the parameters of our approach.

5.1 Choosing the Parameters

Performing cross-validation on the parameters , (elastic-net parameters), (regularization parameter) and (size of the dictionary) would of course be cumbersome, and we use a few simple heuristics to either reduce the search space for these parameters or fix arbitrarily some of them. We have proceeded in the following way:

  • Since we want to exploit sparsity, we often set  to , even though is necessary in our analysis for proving the differentiability of our cost function. This has proven to give satisfactory results in most experiments, except for the experiment of Section 5.5, where choosing a small positive value for was necessary for our algorithm to converge.

  • We have empirically observed that natural image patches (that are preprocessed to have zero-mean and unit -norm) are usually well reconstructed with values of around (a value used in [9] for instance), and that one only needs to test a few different values, for instance , with .

  • When there is a lot of training data, which is often the case for natural image patches, the regularization with becomes unnecessary and this parameter can arbitrarily set to a small value, e.g., for normalized input data. When there are not many training points, this parameter is set up by cross-validation.

  • We have also observed that a larger dictionary usually means a better performance, but a higher computational cost. Setting the size of the dictionary is therefore often a trade-off between results quality and efficiency. In our experiments, we often try the values in .

We show in this section several applications of our method to real problems, starting with handwritten digits classification, then moving to the restoration of images damaged by an unknown nonlinear transformation, digital art authentification, and compressed sensing.

5.2 Handwritten Digits Classification

We choose in this section a discriminative task that consists of classifying digits from the MNIST [44] and USPS [45] handwritten datasets. MNIST is made of images, for training, for testing, each of them containing one handwritten digit. USPS is composed of 7291 training images and 2007 test images of size .

We choose to address this multiclass classification problem with a one-vs-all strategy, learning independently dictionaries and classifiers per class, using the formulation of Section 3.3.2. This approach has proven in our case to be more scalable than learning a single large dictionary with a multi-class loss, while providing very good results. In this experiment, the Lasso [30] is preferred to the elastic-net formulation [29], and is set to . All the digits are preprocessed to have zero-mean and are normalized to have unit -norm. We try the parameters , with , for the reasons mentioned in the previous section, and  in . For choosing the parameters, we use for MNIST the last digits of the training set for validation, and train on the first ones. For USPS, we proceed in a similar way, keeping of the training set for validation. Note that a full cross-validation scheme may give better results, but it would be computationally more expensive.

Most effective digit recognition techniques in the literature use features with shift invariance properties [24, 46]. Since our formulation is less sophisticated than for instance the convolutional network architecture of [24] and does not enjoy such properties, we have chosen to artificially augment the size of our training set by considering versions of the digits that are shifted by one pixel in every direction. This is of course not an optimal way of introducing shift invariance in our framework, but it is fairly simple.

After choosing the parameters using the validation set (and of course none of the testing set), we retrain our model on the full training set. Each experiment is performed with iterations of our algorithm with a mini-batch of size . To study the influence of the dictionary size, we report the performance on the test set achieved for different dictionary sizes, with in for the two datasets. We observe that learning in a supervised way significantly improves the performance of the classification. Moreover our method achieves state-of-the-art results on MNIST with a error rate, which is similar to the error rate of [24].888It is also shown in [24] that better results can be achieved by considering deformations of the training set. Our error rate on USPS is slightly behind the error rate of [46].

unsupervised supervised
MNIST 5.27 3.92 2.95 2.36 .96 .73 .57 .54
USPS 8.02 6.03 5.13 4.58 3.64 3.09 2.88 2.84
Table 1: Test error in percent of our method for the digit recognition task for different dictionary sizes .

Our second experiment follows [24], where only a small percentage of the training samples are actually labelled. We use the semi-supervised formulation of Section 3.2.2 which exploits unlabeled data. Unlike the first experiment where the parameters are chosen using a validation set, and following [24], we make a few arbitrary choices. Indeed, we use , , and , which were the parameters chosen by the validation procedure in the previous experiment. The dictionaries associated with each digit class are initialized using the unsupervised formulation of Section 2. To measure the performance of our algorithm for different values of , we use a continuation strategy: Starting with , we sequentially decrease its value by until we have , learning with iterations for each new value of . We report the corresponding error rates in Figure 1, showing that our approach offers a competitive performance similar to [24]. Indeed, the best error rates of our method for labeled data are respectively and , which is similar to [24] who has reported and with the same sets of labeled data.

Figure 1: Error rates on MNIST when using labeled data, for various values of .

5.3 Learning a Nonlinear Image Mapping

To illustrate our method in the context of regression, we consider a classical image processing task called “inverse halftoning”. With the development of several binary display technologies in the 70s (including, for example, printers and PC screens), the problem of converting a grayscale continuous-tone image into a binary one that looks perceptually similar to the original one (“halftoning”) was posed to the image processing community. Examples of halftoned images obtained with the classical Floyd-Steinberg algorithm [47] are presented in the second column of Figure 2, with original images in the first column. Restoring these binary images to continuous-tone ones (“inverse halftoning”) has become a classical problem (see [48] and references therein).

Unlike most image processing approaches that address this second problem by explicitly modeling the halftoning process, we formulate it as a signal regression problem, without exploiting any prior on the task. We use a database of images; are high-quality images from the Kodak PhotoCD dataset999http://r0k.us/graphics/kodak/ and are used for training, and  are classical images often used for evaluating image processing algorithms;101010The list of these images can be found in [4], where they are used for the problem of image denoising. the first four (house, peppers, cameraman, lena) are used for validation and the remaining eight for testing.

We apply the Floyd-Steinberg algorithm implemented in the LASIP Matlab toolbox111111http://www.cs.tut.fi/~lasip/ to the grayscale continuous-tone images in order to build our training/validation/testing set. We extract all pairs of patches from the original/halftoned images in the training set, which provides us with a database of approximately millions of patches. We then use the “signal regression” formulation of Eq. (12) to learn a dictionary  and model parameters , by performing two passes of our algorithm over the million training pairs.

At this point, we have learned how to restore a small patch from an image, but not yet how to restore a full image. Following other patch-based approaches to image restoration [2], we extract from a test image all patches including overlaps, and restore each patch independently so that we get different estimates for each pixel (one estimate for each patch the pixel belongs to). The estimates are then averaged to reconstruct the full image. Estimating each patch independently and then averaging the estimates is of course not optimal, but it gives very good results in many image restoration tasks (see, e.g., [2, 4]). The final image is then post-processed using the denoising algorithm of [4] to remove possible artefacts.

To evaluate our method quantitatively, we measure how well it reconstructs the continuous-tone images of the test set from the halftoned ones. To reduce a bit the number of hyperparameters of our model, we have made a few arbitrary choices: We also use the Lasso formulation for encoding the signals—that is, we set

. With millions of training samples, our model is unlikely to overfit and the regularization parameter  is set to as well. The remaining free parameters of our approach are the size  of the patches, the size  of the dictionary and the regularization parameter . We have optimized these parameters by measuring the mean-squared error reconstruction on the validation set. We have tried patches of size , with , dictionaries of sizes , and , and determined by first trying values on the logarithmic scale , , then refining this parameter on the scale . The best parameters found are , and . Since the test procedure is slightly different from the training one (the test includes an averaging step to restore a full image whereas the training one does not), we have refined the value of , trying different values one an additive scale , and selected the value , which has given the best result on the validation set.

Note that the largest dictionary has been chosen, and better results could potentially be obtained using an even larger dictionary, but this would imply a higher computational cost. Examples of results are presented in Figure 2. Halftoned images are binary but look perceptually similar to the original image. Reconstructed images have very few artefacts and most details are well preserved. We report in Table 2 a quantitative comparison between our approach and various ones from the literature, including the state-of-the-art algorithm of [48], which had until now the best results on this dataset. Even though our method does not explicitly model the transformation, it leads to better results in terms of PSNR.121212Denoting by MSE the mean-squared-error for images whose intensities are between and , the PSNR is defined as and is measured in dB. A gain of dB reduces the MSE by approximately . We also present in Figure 3 the results obtained by applying our algorithm to various binary images found on the web, from which we do not know the ground truth, and which have not necessarily been obtained with the Floyd-Steinberg algorithm. The results are qualitatively rather good.

From this experiment, we conclude that our method is well suited to (at least, some) nonlinear regression problems on natural images, paving the way to new applications of sparse image coding.

Figure 2: From left to right: Original images, halftoned images, reconstructed images. Even though the halftoned images (center column) perceptually look relatively close to the original images (left column), they are binary. Reconstructed images (right column) are obtained by restoring the halftoned binary images. Best viewed by zooming on a computer screen.
Validation set Test set
Image 1 2 3 4 5 6 7 8 9 10 11 12
FIHT2 30.8 25.3 25.8 31.4 24.5 28.6 29.5 28.2 29.3 26.0 25.2 24.7
WInHD 31.2 26.9 26.8 31.9 25.7 29.2 29.4 28.7 29.4 28.1 25.6 26.4
LPA-ICI 31.4 27.7 26.5 32.5 25.6 29.7 30.0 29.2 30.1 28.3 26.0 27.2
SA-DCT 32.4 28.6 27.8 33.0 27.0 30.1 30.2 29.8 30.3 28.5 26.2 27.6
Ours 33.0 29.6 28.1 33.0 26.6 30.2 30.5 29.9 30.4 29.0 26.2 28.0
Table 2: Inverse halftoning experiments. Results are reported in PSNR (higher is better). SA-DCT refers to [48], LPA-ICI to [49], FIHT2 to [50] and WInHD to [51]. Best results for each image are in bold.
Figure 3: Results on various binary images publicly available on the Internet. No ground truth is available for these images from old computer games, and the algorithm that has generated these images is unknown. Input images are on the left. Restored images are on the right. Best viewed by zooming on a computer screen.

5.4 Digital Art Authentification

Recognizing authentic paintings from imitations using statistical techniques has been the topic of a few recent works [52, 53, 54]

. Classical methods compare, for example, the kurtosis of wavelet coefficients between a set of authentic paintings and imitations 

[52], or involve more sophisticated features [53]. Recently, Hugues et al. [54] have considered a dataset of authentic paintings from Pieter Bruegel the Elder, and imitations.131313The origin of these paintings is assessed by art historians. They have proposed to learn dictionaries for sparse coding, and use the kurtosis of the sparse coefficients as discriminative features. We use their dataset, which they kindly provided to us.141414It would have been interesting to use the datasets used in [52, 53], but they are not publicly available.

The supervised dictionary learning approach we have presented is designed for classifying relatively small signals, and should not be directly applicable to the classification of large images, for which classical computer vision approaches based on bags of words may be better adapted (see 

[12, 55] for such approaches). However, we show that, for this particular dataset, a simple voting scheme based on the classification of small image patches with our method leads to good results.

The experiment we carry out consists of finding which painting is authentic, and which one is fake, in a pair known to contain one of each.151515This task is of course considerably easier than classifying each painting as authentic or fake. We do not claim to propose a method that readily applies to forgery detection. We proceed in a leave-one-out fashion, where we remove for testing one authentic and one imitation paintings from the dataset, and learn on the remaining ones. Since the dataset is small and does not have an official training/test set, we measure a cross-validation score, testing all possible pairs. We consider color image patches, without any pre-processing, and classify each patch from the test images independently. Then, we use a simple majority vote among the test patches to decide which image is the authentic one in the pair test, and which one is the imitation.161616Note that this experimental setting is different from [54], where only authentic paintings are used for training (and not imitations). We therefore do not make quantitive comparison with this work.

For each pair of authentic/imitation paintings, we build a dataset containing patches from the authentic images and from the imitations. We use the formulation from Eq. (13) for binary classification, and arbitrarily choose dictionaries containing dictionary elements. Since the training set is large, we set the parameter to , also choose the Lasso formulation for decomposing the patches by setting , and cross-validate on the parameter , trying values on a grid , and then refining the result on a grid with a logarithmic scale of . We also compare Eq. (13) with the logistic regression loss and the basic formulation of Eq. (5) where is learned unsupervised.

For classifying individual patches, the cross-validation score of the supervised formulation is a classification rate of , which slightly improves upon the “unsupervised” one that achieves . The task of classifying independently small image patches is difficult since there is significant overlap between the two classes. On the other hand, finding the imitation in a pair of (authentic,imitation) paintings with the voting scheme is easier and the “unsupervised formulation” only fails for one pair, whereas the supervised one has always given the right answer in our experiments.

5.5 Compressed Sensing

RANDOM SL1 PCA SL2
DCT UL SL SL DCT UL SL SL
Table 3:

Compressed sensing experiment on small natural image patches. The mean squared error (MSE) measured on a test set is reported for each scenario with standard deviations, obtained by reproducing

times each experiment, randomizing the algorithm initializations and the sampling of the training images. Each patch is normalized to have unit norm, and the mean squared reconstruction error is multiplied by for readability purposes. Here, is the number of rows of the matrix . The different scenarios vary with the way and are learned (fixed, unsupervised, supervised). See the main text for details.

In this experiment, we apply our method to the problem of learning dictionaries and projection matrices for compressed sensing. As explained in Section 3.3.4, our formulation and the conclusions of this section hold for relatively small signals, where the sensing matrix can be stored into memory and learned. Thus, we consider here small image patches of natural images of size pixels. To build our training/validation/test set, we have chosen the Pascal VOC 2006 database of natural images [56]: Images  to  are used for training; images to are used for validation, and the remaining images are kept for testing. Images are downsampled by a factor  so that the JPEG compression artefacts present in this dataset become visually imperceptible, thereby improving its quality for our experiment.

We compare different settings where the task is to reconstruct the patches  of size , from an observation  of size (for instance linear measurements), where  in is a sensing matrix. For simplicity reasons, we only consider here the noiseless case, where . At test time, as explained in Section (3.3.4), our estimate of  is , where  in  is a dictionary for representing , and  in  can be interpreted here as a dictionary for representing . We evaluate several strategies for obtaining :

  • We consider the case, which we call RANDOM, where the entries of 

    are i.i.d. samples of the Gaussian distribution

    . Since the purpose of  is to reduce the dimensionality of the input data, it is also natural to consider the case where is obtained by principal component analysis on natural image patches (PCA). Finally, we also learn  with the supervised learning formulation of Eq. (15), (SL), but consider the case where it is initialized randomly (SL1) or by PCA (SL2).

  • The matrix  can either be fixed or learned. A typical setting would be to set , where  is discrete-cosine-transform matrix (DCT), often used in signal processing applications [2]. It can also be learned with an unsupervised learning formulation (UL), or a supervised one (SL).

  • is always learned in a supervised way.

Since our training set is very large (several millions of patches), we arbitrarily set the regularization parameters and to . We measure reconstruction errors with dictionaries of various levels of overcompleteness by choosing a size in . The remaining free parameters are the regularization parameters and  for obtaining the coefficients . We try the values , with in . Unlike what we have done in the experiments of Section 5.3, it is absolutely necessary in this setting to use (according to the theory), since using a zero value for this parameter has led to instabilities and prevented our algorithm from converging. We have tried the values , with in . Each learning procedure is performed by our algorithm in one pass on millions of patches randomly extracted from our training images. The pair of parameters that gives the lowest reconstruction error on the validation set is selected, and we report in Table 3 the result obtained on a test set of patches randomly extracted from the test images. The conclusions of this compressed sensing experiment on natural image patches are the following:

  • When is initialized as a Gaussian random matrix (case RANDOM), learning and significantly improves the reconstruction error (case SL1). A similar observation was made in [16].

  • Results obtained with PCA are in general much better than those obtained with random projections, which is consistent with the conclusions of [38].

  • However, PCA does better than SL1. When PCA is used for initializing our supervised formulation, results can be slightly improved (case SL2). This illustrates either the limits of the non-convex optimization procedure, or that PCA is particularly well adapted to this problem.

  • Learned dictionaries (cases UL and SL) outperform classical DCT dictionaries.

6 Conclusion

We have presented in this paper a general formulation for learning sparse data representations tuned to specific tasks. Unlike classical approaches that learn a dictionary adapted to the reconstruction of the input data, our method learns features in a supervised way. We have shown that this approach is effective for solving classification and regression tasks in a large-scale setting, allowing the use of millions of training samples, and is able of exploiting successfully unlabeled data, when only a few labeled samples are available. Experiments on handwritten digits classification, non-linear inverse image mapping, digital art authentification, and compressed sensing have shown that our method leads to state-of-the-art results for several real problems. Future work will include adapting our method to various image processing problems such as image deblurring and image super-resolution, and other inverse problems.

Acknowledgments

This paper was supported in part by ANR under grant MGA ANR-07-BLAN-0311 and the European Research Council (SIERRA Project). The authors would like to thank J.M. Hugues and D.J. Graham and D.N. Rockmore for providing us with the Bruegel dataset used in Section 5.4, and Y-Lan Boureau and Marc’Aurelio Ranzato for providing their experimental setting for the digit recognition task.

Appendix A Proofs and Lemmas

Before giving the proof of Proposition 1, we present two general results on the elastic net formulation of [29].

Lemma 1 (Optimality conditions of the elastic net)

The vector  is a solution of Eq. (4) if and only if for all in ,

(20)

Denoting by the active set, we also have

(21)

where in carries the signs of .

Proof.  Equation (20) can be obtained by considering subgradient optimality conditions as done in [57] for the case . These can be written as

where denotes the subdifferential of the norm evaluated at . A classical result (see [58], page 238) is that the subgradients of this subdifferential are characterized by the fact that for all  in , if , and otherwise. This gives directly Eq. (20). The equalities in Eq. (20) define a linear system whose solution is Eq. (21).  
The next proposition exploits these optimality conditions to characterize the regularity of .

Proposition 3 (Regularity of the elastic net solution)

Assume and (A). Then,

  1. The function is uniformly Lipschitz on .

  2. Let be in , be a positive scalar and be a vector in , and define as the set of vectors satisfying for all in ,

    (22)

    where is shorthand for .

    Then, there exists independent of , and so that for all in , the function  is twice continuously differentiable on , where and denote the open balls of radius respectively centered on and .

Proof.  The first point is proven in [19]. The proof uses the strong convexity induced by the elastic-net term, when , and the compactness of from Assumption (A).

For the second point, we study the differentiability of on sets that satisfy conditions which are more restrictive than the optimality conditions of Eq. (20). Concretely, let  be in , and  be in . The set characterizes the vectors  so that has the same signs as  (and same set of zero coefficients), and satisfies the conditions of Eq. (20), but with two additional constraints: (i) The magnitude of the non-zero coefficients in should be greater than . (ii) The inequalities in Eq. (20) should be strict with a margin . The reason for imposing these assumptions is to restrict ourselves to points  in  that have a stable active set—that is, the set of non-zero coefficients  of  should not change for small perturbations of , when is in .

Proving that there exists a constant satisfying the second point is then easy (if a bit technical): Let us suppose that is not empty (the case when it is empty is trivial). Since  is uniformly Lipschitz with respect to , so are the quantities and , for all in . Thus, there exists independent of and such that for all satisfying and , we have for all in ,

where is short-hand for , and is therefore in . It is then easy to show that the active set of and the signs of are stable on , and that is given by the closed form of Eq. (21). is therefore twice differentiable on .  
With this proposition in hand, we can now present the proof of Proposition 1:

Proof.  The differentiability of with respect to is easy using only the compactness of  and  and the fact that is twice differentiable. We will therefore focus on showing that is differentiable with respect to , which is more difficult since is not differentiable everywhere.

Given a small perturbation in of , we compute

(23)

where is short for , and the term comes from the fact that is uniformly Lipschitz and is compact.

Let now choose in and in . We have characterized in Lemma 3 the differentiability of  on some subsets of . We consider the set

and denoting by our probability measure, it is easy to show with a few calculations that . Using the constant defined in Lemma 3, we obtain that . Since , the set can be neglected (in the formal sense) when integrating with respect to  in the expectation of Eq. (23), and it is possible to show that