1 Introduction
Representation of signals as sparse linear combinations of a basis set is popular in the signal/image processing and machine learning communities. In this representation, a sample
is described by a linear combination of a sparse number of columns in a dictionary , such that . Significant theoretical progress has been made to determine the necessary and sufficient conditions, under which recovery of the sparsest representation using a predefined is guaranteed [3, 27, 4]. Recent sparse coding methods achieve stateoftheart results for various visual tasks, such as face recognition [29]. Instead of minimizing the norm of , these methods solve relaxed versions of the originally NPhard problem, which we will refer to as traditional sparse coding (TSC). However, it has been empirically shown that adapting to underlying data can improve upon stateoftheart techniques in various restoration and denoising tasks [6, 23]. This adaptation is made possible by solving a sparse matrix factorization problem, which we refer to as dictionary learning. Learning is done by alternating between TSC and dictionary updates [1, 8, 20, 15]. For an overview of TSC, dictionary learning, and some of their applications, we refer the reader to [28, 7].In this paper, we address the problem of discriminative dictionary learning (DDL), where is viewed as a linear mapping between the original data space and the space of sparse representations, whose dimensionality is usually higher. In DDL, we seek an optimal mapping that yields faithful sparse representation and allows for maximal discriminability between labeled data. These two objectives are seldom complimentary and they tend to introduce conflicting goals in many cases, thus, classification can be viewed as a regularizer for reliable representation and vice versa. From both viewpoints, this regularization is important to prevent overfitting to the labeled data. Therefore, instead of optimizing both objectives simultaneously, we seek joint optimization. In the case of sparse linear representation, the problem of DDL was recently introduced and developed in [19, 21, 22], under the name supervised dictionary learning (SDL). In this paper, we denote the problem as DDL instead of SDL, since DDL inherently includes the semisupervised case. SDL is also addressed in a recent work on taskdriven dictionary learning [18]. The form of the optimization problem in SDL is shown in Eq. (1). The objective is a linear combination of a representation cost and a classification cost using data labels and classifier parameters .
(1) 
Although [22, 21] use multiple dictionaries, it is clear that learning a single dictionary allows for sharing of features among labeled classes, less computational cost, and less risk of overfitting. As a result, our proposed method learns a single dictionary . Here, we note that [13] addresses a similar problem, where is predefined and is the Fisher criterion. Despite their merits, SDL methods have the following drawbacks. (i) Most methods use limited forms for (e.g. softmax applied to reconstruction error). Consequently, they cannot generalize to incorporate popular classification costs, such as the exponential loss used in Adaboost or the hinge loss in SVMs. (ii) Previous SDL methods weight the training samples and the classifiers uniformly by setting the fixed mixing coefficient
according to crossvalidation. This biases their cost functions to samples that are badly represented or misclassified. As such, they are more sensitive to outlier, noisy, and mislabeled training data.
(iii) From an optimization viewpoint, the SDL objective functions are quite involved especially due to the use of the softmax function for multiclass discrimination.Contributions:
Our proposed DDL framework addresses the previous issues by learning a linear map that allows for maximal class discrimination in the labeled data when using linear classification. (i) We show that this framework is applicable to a general family of classification cost functions, including those used in popular boosting methods. (ii)
Since we pose DDL in a probabilistic setting, the representationclassification tradeoff and the weighting of training samples correspond to MAP parameters that are estimated in a datadriven fashion that avoids parameter tuning.
(iii) Since we decouple and , the representations act as the only liaisons between classification and representation. In fact, this is why wellstudied methods in dictionary learning and TSC can be easily incorporated in solving the DDL problem. This avoids involved optimization techniques. Our framework is efficient, general, and modular, so that any improvement or theoretical guarantee on individual modules (i.e. TSC or dictionary learning) can be seamlessly incorporated.The paper is organized as follows. In Section 2, we describe the probabilistic representation and classification models in our DDL framework and how they are combined in a MAP setting. Section 3 presents the learning methodology that estimates the MAP parameters and shows how inference is done. In Section 4, we validate our framework by applying it to digit classification and face recognition and showing that it achieves stateoftheart performance on benchmark datasets.
2 Overview of DDL Framework
In this section, we give a detailed description of the probabilistic models used for representation and classification. Our optimization framework, formulated in a standard MAP setup, seeks to maximize the likelihood of the given labeled data coupled with priors on the model parameters.
2.1 Representation and Classification Models
We assume that each dimensional data sample can be represented as a sparse linear combination of dictionary atoms with additive Gaussian noise of diagonal covariance: . Here, we view the sparse representation as a latent variable of the representation model. In training, we assume that the training samples are represented by this model. However, test samples can be contaminated by various types of noise that need not be zeromean Gaussian in nature. In testing, we have: , where we constrain any auxiliary noise (e.g. occlusion) to be sparse in nature without modeling its explicit distribution. This constraint is used in the error correction method for sparse representation in [27]. It is clear that the representation in testing is identical to the one in training with the dictionary in the latter being augmented by identity. In both cases, the likelihood of observing a specific is modeled as a Gaussian: . Since a single dictionary is used to represent samples belonging to different classes, sharing of features is allowed among classes, which simplifies the learning process.
To model the classification process, we assume that each data sample corresponds to a label vector
, which encodes the class membership of this sample, where is the total number of classes. In our experiments, only one value in is . We apply a linear classifier (or equivalently a set of additively boosted linear classifiers) to the sparse representations in a onevsall classification setup. The probabilistic classification model is shown in Eq. (2), where is the classification cost function. Note that appending to intrinsically adds a bias term to each classifier . Due to the linearity of the classifier, discrimination of the class is completely determined by the scalar cost function , where . This function quantifies the cost of assigning label to representation using the classifier . For now, we do not specify the functional form of . In Section 3, we show that most forms of used in practice are easily incorporated into our DDL framework. Since we seek effective class discrimination, we expect low classification cost for the given representations. Therefore, by arranging all linear classifiers in matrix , the event can be modeled as a product ofindependent exponential distributions parameterized by
for . By denoting as the classifier of the class, we have:(2) 
2.2 Overall Probabilistic Model
To formalize notation, we consider a training set of data samples in that are columns of the data matrix . The column of the label matrix is the label vector corresponding to the data sample. Here, we assume that there are atoms in the dictionary , where is a fixed integer that is applicationdependent. Typically, . Note that there have been recent attempts to determine an optimal for a given dataset [24]. For our experiments, is kept fixed and its optimization is left for future work. The representation matrix is a sparse matrix, whose columns represent the sparse codes of the data samples using dictionary . The linear classifiers are columns in matrix . We denote and as the representation and classification parameters respectively.
In what follows, we combine the representation and classification models from the previous section in a unified framework that will allow for the joint MAP estimation of the unknowns: , , , , and
. By making the standard assumption that the posterior probability consists of a dominant peak, we determine the required MAP estimates by maximizing the product:
. Here, we make a simplifying assumption that the prior of the dictionary and representations are uniform. To model the priors of and and to avoid using hyperparameters, we choose the objective nonparametric Jeffereys prior, which has been shown to perform well for classification and regression tasks [9]. Therefore, we obtain and . The motivations behind the selection of these priors are that (i)the representation prior encourages a low variance representation (i.e. the training data should properly fit the proposed representation model) and that
(ii) the classification prior encourages a low mean (and variance)^{1}^{1}1The mean and variance of an exponential distribution with parameter are and respectively. classification cost (i.e. the training data should be properly classified using the proposed classification model). By minimizing the sum of the negative log likelihood of the data and labels as well as the log priors, MAP estimation requires solving the optimization problem in Eq. (3), where represents the label of the training sample with respect to the class.To encode the sparse representation model, we explicitly enforce sparsity on by requiring that each representation . An alternative for obtaining sparse representations is to assume that follows a Laplacian prior, which leads to an regularizer in the objective. While this sparsifying regularizer alleviates some of the complexity of Eq. (3), it leads to the problem of selecting proper parameters for these Laplacian priors. Note that recent efforts have been made to find optimal estimates of these Laplacian parameters in the context of sparse coding [11, 30, 2]. However, to avoid additional parameters, we choose the form in Eq. (3), where the first two terms of the objective correspond to the representation cost and the last two to the classification cost.
(3) 
In the following section, we show that Eq. (3) can be solved for a general family of cost functions using wellknown and wellstudied techniques in TSC and dictionary learning. In other words, developing specialized optimization methods and performing parameter tuning are not required.
3 Learning Methodology
Since the objective function and sparsity constraints in Eq. (3) are nonconvex, we decouple the dependent variables by resorting to a blockwise coordinate descent method (alternating optimization). At each iteration, only a subset of variables is updated at a time. Clearly, learning is decoupled from learning , if and are fixed. Next, we identify the four basic update procedures in our DDL framework. In what follows, we denote the estimate of variable at iteration as .
3.1 Classifier Update
Since the classification terms in Eq. (3) are decoupled from the representation terms and independent of each other, each classifier can be learned separately. In this paper, we focus on four popular forms of , as shown in Figure 1(a): (i) the square loss: optimized by the boosted square leverage method [5], (ii) the exponential loss: optimized by the AdaBoost method [10], (iii) the logistic loss: optimized by the LogitBoost method [10], and (iv) the hinge loss: optimized by the SVM method. Since additive boosting of linear classifiers yields a linear classifier, we allow for seamless incorporation of additive boosting, which is a novel contribution.
3.2 Discriminative Sparse Coding
In this section, we describe how wellknown and wellstudied TSC algorithms (e.g. Orthogonal Matching Pursuit (OMP)) are used to update from . This is done by solving the problem in Eq. (4), which we refer to as discriminative sparse coding (DSC). DSC requires the sparse code to not only reliably represent the data sample but also to be discriminable by the onevsall classifiers. Here, we denote as the label vector of the data element (i.e. the column of ). The superscripts are omitted from variables not being updated to facilitate readability. Here, we note that DSC, as defined here, is a generalization of the functional form used in [13].
(4) 
Solving Eq. (4):
The complexity of this solution depends on the nature of . However, it is easy to show that, by applying a projected Newton gradient descent method to Eq. (4), DSC can be formulated as a sequence of TSC problems, if is strictly convex. At each Newton iteration, a quadratic local approximation of the cost function is minimized. If we denote and as the first and second derivatives of respectively and , the quadratic approximation of around is . Since is a strictly positive function, we can complete the square to get . By replacing this approximation in Eq. (4), the objective function at the Newton iteration is: . In fact, this objective takes the form of a TSC problem and, thus, can be solved by any TSC algorithm. Here, is formed by the columnwise concatenation of and we define for . Also, we define the diagonal weight matrix , where weights the classifier. Based on this derivation, the same TSC algorithm (e.g. OMP) can be used to solve the DSC problem iteratively, as illustrated in Algorithm 1. The convergence of this algorithm is dependent on whether the TSC algorithm is capable of recovering the sparsest solution at each iteration. Although this is not guaranteed in general, the convergence of TSC algorithms to the sparsest solution has been shown to hold, when the solution is sparse enough even if the dictionary atoms are highly correlated [3, 27, 12, 4]. In our experiments, we see that the DSC objective is reduced sequentially and convergence is obtained in almost all cases. Furthermore, we provide a Stop Criterion (threshold on the relative change in solution) for the premature termination of Algorithm 1 to avoid needless computation.
Popular Forms of :
Here, we focus on particular forms of , namely the four functions in Section 3.1. Before proceeding, we need to replace the traditional hinge cost with a strictly convex approximation. We use the smooth hinge approximation introduced by [17], which can arbitrarily approximate the traditional hinge. As seen before, and are the only functions that play a role in the DSC solution. Obviously, only one iteration of Algorithm 1 is needed when the square cost is used, since it is already quadratic. For all other , at the iteration of DSC, the impact of the classifier on the overall cost (or equivalently on updating the sparse code) is determined by . This weight is influenced by two terms. (i) It is inversely proportional to . So, a classifier with a smaller mean training cost (i.e. higher training set discriminability) yields more impact on the solution. (ii) It is proportional to , the second derivative at the previous solution. In this case, the impact of the classifier is determined by the type of classification cost used. In Figure 1(b), we plot the relationship between and for all four types. For the square and hinge functions, and are independent, thus, a classifier yielding high sample discriminability (low ) is weighted the same as one yielding low discriminability. For the exponential case, the relationship is linear and positively correlated, thus, the lower a classifier’s sample discriminability is the higher its weight. This implies that the sparse code will be updated to correct for classifiers that misclassified the training sample in the previous iteration. Clearly, this makes representation sensitive to samples that are “hard” to classify as well as outliers. This sensitivity is overcome when the logistic cost is used. Here, the relationship is positively correlated for moderate costs but negatively correlated for high costs. This is consistent with the theoretical argument that LogitBoost should outperform AdaBoost when training data is noisy or mislabeled.
3.3 Unsupervised Dictionary Learning
When , , and are fixed, can be updated by any unsupervised dictionary learning method. In our experiments, we use the KSVD algorithm, since it avoids expensive matrix inversion operations required by other methods. Also, efficient versions of KSVD have recently been developed [25]. By alternating between TSC and dictionary updates (SVD operations), KSVD iteratively reduces the overall representation cost and generates a dictionary with normalized atoms and the corresponding sparse representations. In our case, the representations are known apriori, so only a single iteration of the KSVD algorithm is required. For more details, we refer the readers to [1].
3.4 Parameter Estimation and Initialization
The use of the Jeffereys prior for and yields simple update equations: and . These variables estimate the sample representation variance and the mean/variance of the classification cost respectively. Since the overall update scheme is iterative, proper initialization is needed. In our experiments, we initialize to a randomly selected subset of training samples (uniformly chosen from the different classes) or to random zeromean Gaussian vectors, followed by columnwise normalization. Interestingly, both schemes produce similar dictionaries, although the randomized scheme requires more iterations for convergence. The representations are computed by TSC using . Initializing the remaining variables uses the update schemes above. Algorithm 2 summarizes the overall DDL framework.
3.5 Inference
After learning and , we describe how the label of a test sample is inferred. We seek the class that maximizes , where is the label vector of assuming it belongs to class . By marginalizing with respect to and assuming a single dominant representation exists, is the class that maximizes , as in Eq. (5). The inner maximization problem is exactly a DSC problem where is the hypothesized label vector. Here, we use the testing representation model to account for dense errors (e.g. occlusion), thus, augmenting by identity. Computing involves independent DSC problems. To reduce computational cost, we solve a single TSC problem instead: . In this case, .
(5) 
Implementation Details:
There are several ways to speedup computation and allow for quicker convergence. (i) The DSC update step is the most computationally expensive operation in Algorithm 2. This is mitigated by using a greedy TSC method (BatchOMP instead of minimization methods) and exploiting the inherent parallelism of DDL (e.g. doing DSC updates in parallel). (ii) Selecting suitable initializations for and the DSC solutions can dramatically speedup convergence. For example, choosing from the training set leads to a smaller number of DDL iterations than randomly choosing . Also, we initialize DSC solutions at a given DDL iteration with those from the previous iteration. Moreover, the DDL framework is easily extended to the semisupervised case, where only a subset of training samples are labeled. The only modification to be made here is to use TSC (instead of DSC) to update the representations of unlabeled samples.
4 Experimental Results
In this section, we provide empirical analysis of our DDL framework when applied to handwritten digit classification () and face recognition (). Digit classification is a standard machine learning task with two popular benchmarks, the USPS and MNIST datasets. The digit samples in these two datasets have been acquired under different conditions or written using significantly different handwriting styles. To alleviate this problem, we use the alignment and error correction technique for TSC that was introduced in [26]. This corrects for gross errors that might occur (e.g. due to thickening of handwritten strokes or reasonable rotation/translation). Consequently, we do not need to augment the training set with shifted versions of the training images, as done in [18]. Furthermore, we apply DDL to face recognition, which is a machine vision problem where sparse representation has made a big impact. We use the Extended Yale B (EYALEB) benchmark for evaluation. To show that learning in a discriminative fashion improves upon traditional dictionary learning, we compare our method against a baseline that treats representation and classification independently. In the baseline, and are estimated using KSVD, is learned using and directly, and a a winnertakeall classification strategy is used. Clearly, our framework is general, so we do not expect to outperform methods that use domainspecific features and machinery. However, we do achieve results comparable to stateoftheart. Also, we show that our DDL framework significantly outperforms the baseline. In all our experiments, we set and and initialize to elements in the training set.
Digit Classification:
The USPS dataset comprises training and test images, each of pixels (). We plot the test error rates of the baseline for the four classifier types and for a range of and values in Figure 2. Beneath each plot, we indicate the values of and that yield minimum error. This is a common way of reporting SDL results [18, 19, 21, 22]. Interestingly, the square loss classifier leads to the lowest error and the best generalization. For comparison, we plot the results of our DDL method in Figure 3. Clearly, our method achieves a significant improvement of over the baseline, and and over the SDL methods in [19] and [18] respectively. Our results are comparable to the stateoftheart performance () [16]). This result shows that adapting to the underlying data and class labels yields a dictionary that is better suited for classification. Increasing leads to an overall improvement of performance because representation becomes more reliable. However, we observe that beyond , this improvement is insignificant. The square loss classifier achieves the lowest performance and the logistic classifier achieves the highest. The variations of error with are similar for all the classifiers. Error steadily decreases till an “optimal” value is reached. Beyond this value, performance deteriorates due to overfitting. Future work will study how to automatically predict this optimal value from training data, without resorting to crossvalidation.
In Figure 4, we plot the learned parameters (in histogram form) and for a typical DDL setup. We observe that the form of these plots does not significantly change when the training setting is changed. We notice that the histogram fits the form of the Jeffereys prior, . Most of the values are close to zero, which indicates reliable reconstruction of the data. On the other hand, take on similar values for most classes, except the “0” digit class that contains a significant amount of variation and thus the highest classification cost. Note that these values tend to be inversely proportional to the classification performance of their corresponding linear classifiers. We provide a visualization of the learned in the supplementary material. Interestingly, we observe that the dictionary atoms resemble digits in the training set and that the number of atoms that resemble a particular class is inversely proportional to the accuracy of that class’s binary classifier. This occurs because a “hard” class contains more intraclass variations requiring more atoms for representation.
The MNIST dataset comprises training and test images, each of pixels (). We show the baseline and DDL test error rates in Table 1. We train each classifier type using the and values that achieved minimum error for that classifier on the USPS dataset. Compared to the baseline, we observe a similar improvement in performance as in the USPS case. Also, our results are comparable to stateoftheart performance () for this dataset [14].
Face Recognition:
The EYALEB dataset comprises images of individuals, each of pixels, which we downsample by an order of (). Using a classification setup similar to [29] with and , we record the classification results in Table 1, which lead to implications similar to those in our previous experiments. Interestingly, DDL achieves similar results to the robust sparse representation method of [29], which uses all training samples () as atoms in . This shows that learning a discriminative can reduce the dictionary size by as much as , without significant loss in performance.
MNIST (digit classification)  EYALEB (face recognition)  
SQ  EXP  LOG  HINGE  SQ  EXP  LOG  HINGE  
BASELINE  
DDL 
1.41  1.28  1.01  0.72  8.89  7.82  7.57  7.30 
5 Conclusions
This paper addresses the problem of discriminative dictionary learning by jointly learning a sparse linear representation model and a linear classification model in a MAP setting. We develop an optimization framework that is capable of incorporating a diverse family of popular classification cost functions and solvable by a sequence of update operations that build on wellknown and wellstudied methods in sparse representation and dictionary learning. Experiments on standard datasets show that this framework outperforms the baseline and achieves stateoftheart performance.
References
 [1] M. Aharon, M. Elad, and A. M. Bruckstein. The KSVD:an algorithm for designing of overcomplete dictionaries for sparse representations. In IEEE Transactions on Signal Processing, volume 54, 2006.
 [2] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. ArXiv eprints, 2008.
 [3] M. Davenport and M. Wakin. Analysis of Orthogonal Matching Pursuit using the restricted isometry property. IEEE Transactions on Information Theory, 56(9):4395–4401, 2010.
 [4] D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via l minimization. Proc. of the National Academy of Sciences, 100(5):2197–202, 2003.
 [5] N. Duffy and D. Helmbold. Boosting methods for regression. Journal of Machine Learning Research, 47(2):153–200, 2002.
 [6] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12):3736–45, 2006.
 [7] M. Elad, M. Figueiredo, and Y. Ma. On the Role of Sparse and Redundant Representations in Image Processing. Proceedings of the IEEE, 98(6):972–982, 2010.
 [8] K. Engan, S. Aase, and J. Husoy. Frame based signal compression using method of optimal directions (mod). In IEEE Intern. Symp. Circ. Syst., 1999.
 [9] M. Figueiredo. Adaptive Sparseness using Jeffreys’ Prior. NIPS, 1:697–704, 2002.

[10]
J. Friedman, R. Tibshirani, and T. Hastie.
Additive logistic regression: a statistical view of boosting.
The Annals of Statistics, 28(2):337–407, 2000.  [11] R. Giryes, M. Elad, and Y. Eldar. Automatic parameter setting for iterative shrinkage methods. In IEEE Convention of Electrical and Electronics Engineers in Israel, pages 820–824, 2009.
 [12] R. Gribonval and M. Nielsen. Sparse representations in unions of bases. IEEE Transactions on Information Theory, 49(12):3320–3325, 2004.
 [13] K. Huang and S. Aviyente. Sparse representation for signal classification. In NIPS, pages 609–616, 2006.
 [14] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multistage architecture for object recognition? ICCV, pages 2146–2153, 2009.
 [15] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary learning. In ICML, 2010.
 [16] D. Keysers, J. Dahmen, T. Theiner, and H. Ney. Experiments with an extended tangent distance. ICPR, 1(2):38–42, 2000.
 [17] N. Loeff and A. Farhadi. Scene discovery by matrix factorization. ECCV, pages 451–464, 2008.
 [18] J. Mairal, F. Bach, and J. Ponce. TaskDriven Dictionary Learning. ArXiv eprints, Sept. 2010.
 [19] J. Mairal, F. Bach, J. Ponce, G. Sapiro, , and A. Zisserman. Supervised dictionary learning. In NIPS, 2008.
 [20] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. ICML, pages 1–8, 2009.
 [21] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. In CVPR, 2008.
 [22] J. Mairal, M. Leordeanu, F. Bach, M. Hebert, and J. Ponce. Discriminative sparse image models for classspecific edge detection and image interpretation. ECCV, pages 43–56, 2008.
 [23] J. Mairal, G. Sapiro, and M. Elad. Learning multiscale sparse representations for image and video restoration. SIAM Multiscale Modeling and Simulation, 7(1):214–241, 2008.
 [24] R. Mazhar and P. Gader. EKSVD: Optimized dictionary design for sparse representations. In ICPR, pages 1–4, 2008.
 [25] R. Rubinstein, M. Zibulevsky, and M. Elad. Efficient implementation of the ksvd algorithm using batch orthogonal matching pursuit. CS Technion Technical Report, pages 1–15, 2008.
 [26] A. Wagner, J. Wright, A. Ganesh, Z. Zhou, and Y. Ma. Towards a practical face recognition system: Robust registration and illumination by sparse representation. In CVPR, pages 597 –604, 2009.
 [27] J. Wright and Y. Ma. Dense error correction via l1minimization. In IEEE Transactions on Information Theory, number 2, pages 3033–3036, 2010.

[28]
J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan.
Sparse Representation for Computer Vision and Pattern Recognition.
Proceedings of the IEEE, 98(6):1031–1044, 2010.  [29] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation. TPAMI, 31(2):210–27, 2009.
 [30] H. Zou. The Adaptive Lasso and its Oracle Properties. Journal of the American Statistical Association, 101:1418–1429, 2006.
Comments
There are no comments yet.