1 Introduction
Sparse representation has been successfully applied in various image processing and computer vision problems, such as image denoising, and image restoration. Dictionary learning is one way of obtaining sparse representations for signals with unknown precise model. The resulting sparse representation as a linear combination of atoms varies according to the type of dictionary learning techniques: Synthesis Dictionary Learning(SDL) and Analysis Dictionary Learning(ADL).
In contrast to SDL, which assumes that the interesting signal can be recovered by a dictionary with corresponding sparse coefficients, ADL is based on applying the dictionary to the data to yield sparse coefficients.
Due to the success of dictionary learning in image restoration problems, taskdriven dictionary learning methods are of great interest in many inference problems, such as image classification. There are broadly two strategies to address the taskdriven dictionary learning method. The first strategy is to learn multiple classspecific subdictionaries to make the dictionary more structured, and to increase overall discrimination between different classes [1, 2, 3, 4]
. To be structured, the atoms in the dictionary are made to learn their own class labels. A class label for a new image can then be decided by comparing reconstruction error from different classes. Another strategy is to learn a shared dictionary for all classes and jointly learn a universal classifier to enforce more discriminative sparse representations
[5, 6].All of the above mentioned techniques have been developed and implemented in the SDL framework, while ADL has increasingly received attention[7]. To the best of our knowledge, none of the standard ADL algorithm such as the analysis KSVD[8] or the Sparse Null Space(SNS) pursuit [9] has addressed the task driven ADL problem. Shekhar et al. [10]
have adopted ADL together with SVM to digits and face recognition, and demonstrated that ADL is more stable under noise and occlusion with a competitive performance with SDL. Guo
et al. [11] integrated local topological structures and discriminative sparse labels into the ADL to yield a Nearest Neighbor method to classify images.Inspired by these past efforts and efficient coding of ADL, we propose an integration of structured subspace regularization and supervised learning into an ADL model to obtain a more structured discriminative and efficient approach to image classification. It has been shown, for example in the context of sparse subspace clustering [12], that the sparse representations of the data within a class share a low dimensional subspace. A structuring block diagonal matrix therefore is introduced to achieve these localized subspaces of the sparse codes. This yields more coherence for withinclass sparse representations and more disparity for betweenclass representations. To induce additional robustness in the sought sparse representation, a oneagainstall regressionbased classifier is jointly learned, with a resulting optimization functional which we solve by a linearized alternating direction method (ADM)[13]. This approach is computationally more efficient than analysis KSVD[8] and SNS pursuit [9]. Moreover, a great advantage of our algorithm is its extremely short online encoding and classification time. Our experiments demonstrate that our method achieves a better overall performances than the synthesis dictionary approach.
The balance of this paper is organized as follows: In Section 2, we state and formulate the problem. We discuss the resulting solution to the optimization problem in Section 3. The experimental validation and results are comprehensively presented in Section 4. We finally provide some concluding comments in Section 5.
2 Structured Analysis Dictionary Learning
Notation:
Uppercase and lowercase letters respectively denote matrix and vectors. The transpose and inverse of matrix are represented as the superscripts
and , such as and . represents the th element in the th column of matrix .2.1 ADL Formulation
Given a data matrix , the originally formulated ADL[8] problem seeks a representation frame with a sparse coefficient set .
(1) 
where and is a nontrivial solution set.
2.2 Mitigating InterClass Feature Interference
The basic idea in our algorithm is to employ the representation to obtain a classifier. To reduce the impact of interclass common atoms on the discriminative power of ADL, we propose two additional constraints on by way of: (1) A structural map of to minimize interference of interclass common features. (2) A classification error performance minimization.
(1) Structural Mapping of U: This constraint is particularly enforced by imposing that each class belongs to a subspace defined by a span of the associated coefficients. This improves the consistency of the analysis representations within a class and enhances the divergence between different classes. A blockdiagonal matrix as shown below is hence introduced in the training phase,
where is the length of structured representation. Each diagonal block represents a class and each column is a structured representation for the corresponding data point in the th class. This constraint may also be deviated by an error term, to be jointly minimized with the ADL functional,
(2) 
where is matrix to be learned with and , is the tolerance.
(2) Minimal Classification Error:
The second constraint is a classification error as a feedback term to the learning process of and . A regressionbased classifier is applied to the structured representations in this term. We write it as
(3) 
where is also the tolerance, and the label matrix , with denoting for the number of classes. If image belongs to class ; otherwise, .
2.3 Structured ADL Formulation
To ensure that the structure for each image class is preserved together with minimal interference between different classes, the minimization of tolerance errors is also required. Then, using Eqs.(2), (3) and the minimization of tolerance errors together, the resulting algorithm formulation for our structured ADL is written as
(4) 
where is the row of , and are the penalty coefficients. Recall is the structured representation, is the structuring transformation, is the classifier label, and is the linear classifier, and is the tuning parameters.
3 Algorithmic Solution
The objective function in Eq.(4), on account of its nonconvexity, is transformed to an augmented Lagrange formulation with dual variables , and . After straight forward calculations that lead to eliminations of and , we obtain the following expression for this function:
(5) 
where , , are the new tuning parameters. Then, to minimize the objective functional in Eq.(5), we first randomly initialize the analysis dictionary
and two linear transformations
and . The sparse representation is initialized by, the zero matrix.
, , and are the parameters for the learning rate. Then, we alternately update different variables when fixing the others, which is summarized in Algorithm 1.4 Experiments and Results
We evaluate our proposed SADL method on four popular visual classification datasets which have been widely used in previous works and with known performance benchmarks. They include Extended YaleB[14] face dataset, AR[15] face dataset, Caltech101[16] object categorization dataset and Scene15[17] scene image dataset. The features of these 4 datasets are extracted by the same settings in [6].
In our experiments, we provide a comparative evaluation of three stateoftheart techniques and our proposed technique, including classification accuracy and training and testing times. The testing time is defined as the average processing time to classify a single image. For a fair comparison, we measure the performances of all algorithms by using the same dictionary size on each dataset and experiment over 10 realizations to obtain an average performance. In relation to competitive methods, ADL+SVM [10] is a baseline. SRC [1] is the classical Sparse Representation based Classification. LCKSVD [6] is a SDL approach that jointly learns a discriminative dictionary and a universal classifier. In our tables, the accuracy in the parentheses with the citation is the one that was reported in the original paper. The difference of the accuracy of our implementing and the original one might be caused by the different segmentations of the training and testing samples.
4.1 Face Recognition
Extended YaleB: This face dataset contains in total 2414 frontal face images of 38 persons under various illumination and expression conditions, as illustrated in Fig.1. Each Extended YaleB face image has a dimensional feature vector. We randomly choose half of the images for training, and the rest for testing. The dictionary size is set to 570 atoms, , , , and .
Methods  Accuracy (%)  Training (s)  Testing (s) 

ADL+SVM[10]  82.91%  91.78  1.13 
SRC[1]  80.5%  No Need  3.74 
LCKSVD[6]  94.56% (95% [6])  234.67  1.63 
SADL  94.91%  51.29  2.72 
The classification results, training and testing times are summarized in Table 1. Our proposed SADL method achieves the highest classification accuracy in the test, but tinily lower than the reported accuracy of LCKSVD. However, it is still substantially more efficient than the others in terms of numerical complexity and classification .
For a more thorough evaluation, we compare SADL with LCKSVD for different dictionary sizes, and display the classification accuracy in Fig.2. We ran our experiments for dictionary sizes by 32, 128, 224, 320, 416, 512, 608, 704, 800, 896, 992, and 1216 (all training size). SADL exhibits a more stable performance than that of LCKSVD. In particular, the accuracy of LCKSVD significantly decreases, when the dictionary size approaches the all training sample size. In addition, our method apparently has a much higher classification accuracy than LCKSVD, when the dictionary size is small. The significant decrease in accuracy may be caused by the trivial solution of dictionary in SDL.
AR: The AR face dateset has 2600 color images of 50 females and 50 males with more facial variations than the Extended YaleB database, such as different illumination conditions, expressions and facial disguises, as shown in Fig. 1. Each person has about 26 images of size . The AR Face feature dimension is 540. 20 images of each person are randomly selected as a training set and the other 6 images for testing. The dictionary size of the AR dataset is set to 500 atoms, , , , and .
The classification performances are summarized in Table 2. Our proposed SADL achieves higher classification accuracy than others. Our method is about 10000 times faster than SRC and LCKSVD for the testing phase.
4.2 Object Recognition
The Caltech101 dataset has 101 different categories of different objects and 1 nonobject category. Most categories have around 50 images. Fig.3 gives some examples from the Caltech101 dataset. The standard bagof words+spatial pyramid matching (SPM) framework [17] is used to calculate the SPM features. PCA is then adopted to reduce the dimension of a SPM feature to 3000. The dictionary size is set to 510, , , , and .
Methods  Accuracy (%)  Training (s)  Testing (s) 

ADL+SVM[10]  54.93%  447.80  7.75 
SRC[1]  67.70%  No Need  4.34 
LCKSVD[6]  71.79%  487.61  1.35 
SADL  72.36%  773.66  8.10 
We evaluate all methods with a dictionary size of 510. The classification performances are summarized in Table 3. Our proposed SADL still achieves the highest performance of the lot. SADL has again a short testing time, which is around 10000 times faster than LCKSVD.
4.3 Scene Classification
Scene15 dataset contains a total of 15 categories of different scenes, and each category has around 200 images. The examples are listed in Fig.4. Proceeding as for the Caltech 101 dataset, we compute the SPM features for scene images. Each scene image is transformed to a 3000 dimensional feature by PCA. We randomly pick 100 images per class as training data, and use the rest of images as testing data. The settings and steps follow [6]. The dictionary size is set to 450, , , , and .
Methods  Accuracy (%)  Training (s)  Testing (s) 

ADL+SVM[10]  49.35%  110.47  1.14 
SRC[1]  91.80%  No Need  4.06 
LCKSVD[6]  98.83% (92.9%[6])  270.93  1.26 
SADL  98.16%  121.02  9.23 
The classification performances are summarized in Table 4. Our performance is slightly lower than LCKSVD, but is still higher than SRC, ADL+SVM and the LCKSVD reported accuracy. However, the testing phase is superior to the others. Note that, the testing time is 10 thousand times faster than LCKSVD.
5 Conclusion
We proposed an image classification method referred to as structured analysis dictionary learning (SADL). To obtain SADL, we constrained a structured subspace(cluster) model in the enhanced ADL method, where each class was represented by a structured subspace. The enhancement of ADL was realized by constraining the learning by a classification fidelity term on the sparse coefficients. Our formulated optimization problem was efficiently solved by the linearized ADM method, in spite of its nonconvexity due to bilinearity. Taking advantage of analysis dictionary, our method achieved a significantly faster testing time.
References
 [1] John Wright, Allen Y Yang, Arvind Ganesh, S Shankar Sastry, and Yi Ma, “Robust face recognition via sparse representation,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 2, pp. 210–227, 2009.

[2]
Ignacio Ramirez, Pablo Sprechmann, and Guillermo Sapiro,
“Classification and clustering via dictionary learning with
structured incoherence and shared features,”
in
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on
. IEEE, 2010, pp. 3501–3508.  [3] Meng Yang, Lei Zhang, Xiangchu Feng, and David Zhang, “Fisher discrimination dictionary learning for sparse representation,” in 2011 International Conference on Computer Vision. IEEE, 2011, pp. 543–550.
 [4] Zhaowen Wang, Jianchao Yang, Nasser Nasrabadi, and Thomas Huang, “A maxmargin perspective on sparse representationbased classification,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1217–1224.
 [5] Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis R Bach, “Supervised dictionary learning,” in Advances in neural information processing systems, 2009, pp. 1033–1040.
 [6] Zhuolin Jiang, Zhe Lin, and Larry S Davis, “Label consistent ksvd: Learning a discriminative dictionary for recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2651–2664, 2013.
 [7] Sangnam Nam, Mike E Davies, Michael Elad, and Rémi Gribonval, “The cosparse analysis model and algorithms,” Applied and Computational Harmonic Analysis, vol. 34, no. 1, pp. 30–56, 2013.
 [8] Ron Rubinstein, Tomer Peleg, and Michael Elad, “Analysis ksvd: A dictionarylearning algorithm for the analysis sparse model,” Signal Processing, IEEE Transactions on, vol. 61, no. 3, pp. 661–677, 2013.
 [9] Xiao Bian, Hamid Krim, Alex Bronstein, and Liyi Dai, “Sparsity and nullity: Paradigms for analysis dictionary learning,” SIAM Journal on Imaging Sciences, vol. 9, no. 3, pp. 1107–1126, 2016.
 [10] Sumit Shekhar, Vishal M Patel, and Rama Chellappa, “Analysis sparse coding models for imagebased classification,” in 2014 IEEE International Conference on Image Processing (ICIP). IEEE, 2014, pp. 5207–5211.

[11]
Jun Guo, Yanqing Guo, Xiangwei Kong, Man Zhang, and Ran He,
“Discriminative analysis dictionary learning,”
in
Thirtieth AAAI Conference on Artificial Intelligence
, 2016.  [12] Ehsan Elhamifar and Rene Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 11, pp. 2765–2781, 2013.
 [13] Zhouchen Lin, Risheng Liu, and Zhixun Su, “Linearized alternating direction method with adaptive penalty for lowrank representation,” in Advances in neural information processing systems, 2011, pp. 612–620.
 [14] Athinodoros S. Georghiades, Peter N. Belhumeur, and David J. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE transactions on pattern analysis and machine intelligence, vol. 23, no. 6, pp. 643–660, 2001.
 [15] A.M. Martinez and R. Benavente, “The ar face database,” CVC Technical Report, , no. 24, June 1998.
 [16] Li FeiFei, Rob Fergus, and Pietro Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” Computer Vision and Image Understanding, vol. 106, no. 1, pp. 59–70, 2007.
 [17] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). IEEE, 2006, vol. 2, pp. 2169–2178.

[18]
Wen Tang, Ives Rey Otero, Hamid Krim, and Liyi Dai,
“Analysis dictionary learning for scene classification,”
in Statistical Signal Processing Workshop (SSP), 2016 IEEE. IEEE, 2016, pp. 1–5.  [19] Shahin Mahdizadehaghdam, Liyi Dai, Hamid Krim, Erik Skau, and Han Wang, “Image classification: A hierarchical dictionary learning approach,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 2597–2601.
Comments
There are no comments yet.