Analysis Dictionary Learning based Classification: Structure for Robustness

by   Wen Tang, et al.
NC State University

A discriminative structured analysis dictionary is proposed for the classification task. A structure of the union of subspaces (UoS) is integrated into the conventional analysis dictionary learning to enhance the capability of discrimination. A simple classifier is also simultaneously included into the formulated functional to ensure a more complete consistent classification. The solution of the algorithm is efficiently obtained by the linearized alternating direction method of multipliers. Moreover, a distributed structured analysis dictionary learning is also presented to address large scale datasets. It can group-(class-) independently train the structured analysis dictionaries by different machines/cores/threads, and therefore avoid a high computational cost. A consensus structured analysis dictionary and a global classifier are jointly learned in the distributed approach to safeguard the discriminative power and the efficiency of classification. Experiments demonstrate that our method achieves a comparable or better performance than the state-of-the-art algorithms in a variety of visual classification tasks. In addition, the training and testing computational complexity are also greatly reduced.


page 5

page 6

page 7

page 8

page 9


Structured Analysis Dictionary Learning for Image Classification

We propose a computationally efficient and high-performance classificati...

Analysis Dictionary Learning: An Efficient and Discriminative Solution

Discriminative Dictionary Learning (DL) methods have been widely advocat...

Information-theoretic Dictionary Learning for Image Classification

We present a two-stage approach for learning dictionaries for object cla...

Collaborative Filtering via Group-Structured Dictionary Learning

Structured sparse coding and the related structured dictionary learning ...

Dictionary learning for fast classification based on soft-thresholding

Classifiers based on sparse representations have recently been shown to ...

Correlation and Class Based Block Formation for Improved Structured Dictionary Learning

In recent years, the creation of block-structured dictionary has attract...

Evolutionary Simplicial Learning as a Generative and Compact Sparse Framework for Classification

Dictionary learning for sparse representations has been successful in ma...

I Introduction

Sparse representation has had great success in dealing with various problems in image processing and computer vision, such as image denoising and image restoration. To obtain such sparse representations with an unknown precise model, Dictionary Learning is one choice because it results in a linear combination of sparse dictionary atoms. There are two different types of dictionary learning methods: Synthesis Dictionary Learning (SDL) and Analysis Dictionary Learning (ADL).

In recent years, SDL has been prevalently and widely studied [1, 2, 3], while ADL has received little attention. SDL supposes that a signal lies in a sparse latent subspace and can be recovered by an associated dictionary. The local structures of the signal are well preserved in the optimal synthesis dictionary [4, 5, 6]. In contrast, ADL assumes that a signal can be transformed into a latent sparse subspace by its corresponding dictionary. In other words, ADL is to produce a sparse representation by applying the dictionary as a transform to a signal. The atoms in an analysis dictionary can be interpreted as local filters, as first mentioned in [7]. Sparse representations can be simply obtained by an inner product operation, when the dictionary is known. Such a fast coding supports ADL more favored than SDL in applications. The contrast of SDL and ADL is shown in Fig. 1.

Fig. 1: SDL reconstructs data by the dictionary with the sparse representations . ADL applies the dictionary to data and results in the sparse representations . can be either norm or norm. If and only if and are square matrices, SDL and ADL are equivalent to each other.

The success of dictionary learning in image processing problems has shaped much interest in task-driven dictionary learning methods for inference applications, such as image classification. The task of classification aims to assign the correct label to an observed image, which requires a much more discriminative capacity of either the dictionary or the sparse representation. Towards addressing this issue, supervised learning is often invoked when using single-view SDL [8] so as to maximize the distances between the sparse representations of each of two distinct classes. In addition, multi-view dictionary learning methods [9, 10, 11, 12] were developed to include more information of each class.

For the supervised single-view dictionary learning methods, there are generally two strategies to address the supervised learning approaches. The first strategy is to learn multiple dictionaries or class-specific dictionaries for different classes [13, 14, 15, 16]. The advantage of learning multiple dictionaries is that these dictionaries characterize specific patterns and structures of each class and enhance the distances between different classes. The minimum reconstruction errors of various dictionaries are subsequently used to assign labels of new incoming images. In [14], Ramirez et al. learned class-specific dictionaries with penalty for the common atoms. Yang et al. [15] then learned class-specific dictionaries and jointly applied a Fisher criterion to associative sparse representations to thereby enhance the distances between each class. A large-margin method was proposed to increase the divergence of sparse representations for the class-specific dictionaries in [16]. However, as the number of classes increases, it would be too complex and time consuming to train class-specific dictionaries with regularizing distances of each dictionary. Even though a distributed cluster could reduce the time complexity of training dictionaries, it is difficult for the distributed algorithm to communicate with each independent cluster and to compromise with other regularizations for the class-specific dictionary learning.

Another strategy is to learn a shared dictionary for all classes together with a universal classifier [8, 17]. Such a joint dictionary learning enforces more discriminative sparse representations. Compared with class-specific dictionary learning, using this strategy is simpler to learn such a dictionary and classifier, and easier to test the unknown images. In [8], Mairal et al. integrated a linear classifier in a sparse representation for a dictionary learning phase. Jiang et al. then included a linear classifier and a label consistent regularization term to enforce more consistent sparse representations in each class [17]. When any large data sets are on hand, memory and computational limitations emerge, and an online learning or distributed solutions are required as a viable strategy.

Although the techniques mentioned above are all based on SDL, ADL has gradually received more attention [18]. Based on the seminal work of Rubinstein et al. [19] proposing analysis K-SVD to learn an analysis dictionary, Li et al. [20] considered to learn ADL by using an additional inner product term of sparse coefficients to increase its discriminative power. In addition, reducing the computational complexity has been addressed in recent methods. Zhang et al. [21] use Recursive Least Square method to accelerate dictionary learning by updating a dictionary based on the dictionaries in the previous iterations. Li et al. [22] and Dong et al. [23] proposed Simultaneous codeword optimization (SimCo) related algorithms to update multiple atoms in each iteration and by adding an extra incoherent regularity term to avoid linear dependency among dictionary atoms. On other hand, Li et al. [24, 25] used norm instead of norm to have stronger sparsity and mathematically guaranteed a strong convergence. In [26], Bian et al. [26] proposed the Sparse Null Space (SNS) pursuit to search for the optimal analysis dictionary. However, all of these methods are proposed for the original problem of learning an analysis dictionary. To the best of our knowledge, few attempts have been carried out for task-driven ADL. For example, in [27], Shekhar et al. [27]

learned an analysis dictionary and subsequently trained SVM for the digital and face recognition tasks. Their results demonstrate that ADL is more stable than SDL under noise and occlusion, and achieves a competitive performance. Guo

et al. [28] integrated local topological structures and discriminative sparse labels into the ADL and separately classified images by a Nearest Neighbor classifier. Instead of preforming ADL and SVM in separate steps, Wang [29] alternately optimize ADL and SVM together to classify different patterns. In [30], Wang et al. use the K-SVD based technique to solve a joint learning of ADL and a linear classifier. Additionally, a hybrid design based on both SDL and ADL is considered in [31] and [32]. A multi-view ADL was proposed in [33], which separately learns analysis dictionaries for different views and a marginalized classifier for fusing the semantic information of each view.

Inspired by these past works, and taking advantage of efficient coding by ADL, we propose a supervised ADL with a shared dictionary and a universal classifier. In addition to the classifier, a structured subspace regularization is also included into an ADL model to obtain a more structured discriminative and efficient approach to image classification. We refer to this approach as Structured Analysis Dictionary Learning (SADL). Since Sparse Subspace Clustering[34] has shown that visual data in a class or category can be well captured and localized by a low dimensional subspace, and the sparse representation of the data within a class similarly share a low dimensional subspace, a structured representation is introduced to achieve a distinct representation of each class. This achieves more coherence for within-class sparse representations and more disparity for between-class representations. When sorted by the order of classes, these representations as shown later can be viewed as a block-diagonal matrix. For robustness of the sought sparse representations, we simultaneously learn a one-against-all regression-based classifier. The resulting optimization function is solved by a linearized alternative direction method (ADM)[35]. This approach leads to a more computationally efficient solution than that of analysis K-SVD [19] and of SNS pursuit [26]. Additionally, a great advantage of our algorithm is its extremely short on-line encoding and classification time for an incoming observed image. It is easy to understand that in contrast to the SDL encoding procedure, ADL obtains a sparse representation by a simple matrix multiplication of the learned dictionary and testing data. Experiments demonstrate that our method achieves an overall better performance than the synthesis dictionary approach. A good accuracy is achieved in the scene and object classification with a simple classifier, and at a remarkably low computational complexity to seek the best performances of facial recognition problems. Moreover, experiments also show that our approach has a more stable performance than that of SDL. Even when the dictionary size is reduced to result in memory demand reduction, our performance is still outstanding. To address large datasets, a distributed structured analysis dictionary learning algorithm is also developed while preserving the same properties as those of structured analysis dictionary learning (SADL). Experiments also show that when the dataset is sufficient, a distributed algorithm achieves as high a performance as SADL.

The following represent our main contributions,

  • Both a structured representation and a classification error regularization term are introduced to the conventional ADL formulation to improve classification results. A multiclass classifier and an analysis dictionary are jointly learned.

  • The optimal solution provided by the linearized ADM is significantly faster than other existing techniques for non-convex and non-smooth optimization.

  • An extremely short classification time is offered by our algorithms, as they entail encoding by a mere matrix multiplication for a simple classification procedure.

  • A distributed structured analysis dictionary learning algorithm is also presented.

The balance of this paper is organized as follows: we state and formulate the problem of SADL and its distributed form in Section II. The resulting solutions to the optimization problems along with the classification procedure are described in Sections III and IV. In Section V, we analyze the convergence and complexity of our methods. The experimental comprehensively validation and results are then presented in Section VI. Some comments and future works are finally provided in Section VII.

Ii Structured Analysis Dictionary Learning

Ii-a Notation

Uppercase and lowercase letters respectively denote matrices and vectors throughout the paper. The transpose and inverse of matrices are represented as the superscripts

and , such as and

. The identity matrix and all-zero matrix are respectively denoted as

and 0. represents the th element in the th column of matrix .

Ii-B Structured ADL Method

Ii-B1 ADL Formulation

The conventional ADL problem [19] aims at obtaining a representation frame with a sparse coefficient set based on the data matrix .


where and is a large class of non-trivial solutions.

Ii-B2 Mitigating Inter-Class Feature Interference

The basic idea of our algorithm is to take advantage of the stability to perturbations and of the fast encoding of ADL. Since there is no reconstruction term in the conventional ADL, and to secure an efficient classification, the representation is used to obtain a classifier in a supervised learning mode. To strengthen the discriminative power of ADL, it is desirable to minimize the impacts of inter-class common features. We therefore propose two additional constraints on by way of:

  • Minimizing interference of inter-class common features by a structural map of .

  • Minimizing the classification error.

Structural Mapping of U

The first constraint is to particularly ensure that the representation of each sample in the same class belong to a subspace defined by a span of the associated coefficients. This imposes the distinction among the classes and improves the identification of each class, and efficiently enhances the divergence between classes. Specifically, we introduce a block-diagonal matrix as shown below,

where is the length of the structured representation. Each diagonal block in represents a subspace of each class to force each one class to remain distinct from another with a consistent intra-class representation. Each column is a structured representation for the corresponding data point, which is pre-defined on the training labels. is not necessarily a uniformly block-diagonal matrix, and the order of samples is not important, so long as the structured representation corresponds to a given class. To mildly relax the constraint, and integrate it into the ADL function, we write


where is a matrix to be learned with and , is the tolerance.

Minimal Classification Error

To maintain an audit track on the desired representation, we include a classification error to make the representation discriminative and learn an optimal regularization. This is written as


where is the tolerance,

is a linear transform, and the label matrix

is defined as

and is the number of classes.

Ii-B3 Structured ADL Formulation

To account for all these constraints and to avoid overfitting by regularization arising and , we can rewrite the ADL optimization problem as


where and are the penalty coefficients, and are tuning parameters. Recall is the structured representation, is the structuring transformation, is the classifier label, and is the linear classifier.

The formulated optimization function in Eq. (4) provides an analysis dictionary driven by the latent structure of the data yielding an improved discriminative sparse representation among numerous classes.

Ii-B4 Distributed Structured ADL Formulation

In order to handle large datasets, we propose a distributed Structured ADL method. Since both the discriminative structure and the efficient classification need to be preserved, we introduce a global analysis dictionary, a global structuring transformation and a global classifier. In pursuing a distributed ADL, we ensure that the global variables share information with each distributed dictionary cluster, thereby ensuring that the global analysis dictionary, the structured transform and the classifier respectively reach a consensus,


Together with the consensus penalties, the distributed SADL is formulated as


where represents the th independent cluster, , , and are respectively the local analysis dictionary, sparse representation, structuring transformation and classifier of the th cluster, and , , are respectively the global analysis dictionary, structuring transformation and classifier. The global variables will be applied to the same efficient classification scheme as the one of SADL.

Iii Algorithmic Solution

Iii-a SADL Algorithm

Due to the non-convexity of the objective function in Eq. (4), an augmented Lagrange formulation with dual variables , and is adopted to seek an optimal solution. The augmented Lagrangian is then written as,


where is a tuning parameter. To iteratively seek the optimal solution in Eq. (7), the analysis dictionary and two linear transformations and are first randomly initialized. The sparse representation is initialized as , the zero matrix. In the following equations, Eq. (8) - Eq. (22), the auxiliary variables , , and are introduced to guarantee the convergence of the algorithm. The variables with superscripts which do not include parenthesis are the temporal variables of intermediate step in the calculation. Different variables are alternately updated while fixing the others, resulting in the following steps:

(1) Fix , , and , and update :


where is the element-wise soft thresholding operator, and , , and are as follows:


(2) Fix , , and , and update :


(3) Fix , , and , and update :


(4) Fix , , and , and update :


The analytical solution of Eq. (17) can be regularized as


where is also a tuning parameter. It will be chosen by a usual way.

(5) Fix , , , and , and update :


(6) Fix , , , and , and update :


The dual variables , are updated as


In contrast to previous ADL techniques, which train a dictionary by iterating a single row of the dictionary, i.e., one atom, to avoid a trivial solution, we proceed to update a set of rows in a single step at each iteration. A fast convergence rate of the algorithm is also guaranteed by linearized ADM [35] and with a closed form solution for the dictionary given in Eq. (18). The proposed SADL algorithm 111The codes are released at is summarized in Algorithm 1.

1:Training data , diagonal block matrix , class labels , penalty coefficients , parameters and maximum iteration ;
2:Analysis dictionary , sparse representation , and linear transformations and ;
3:Initialize , , and as random matrices, and initialize as a zero matrix;
4:while not converged and  do
5:     ;
6:     Update by (8);
7:     Update by (12);
8:     Update by (15);
9:     Update by (18);
10:     Update by (19);
11:     Update by (20);
12:     Update by (21);
13:     Update by (22);
14:end while
Algorithm 1 Structured Analysis Dictionary Learning

Iii-B Distributed SADL Algorithm

The distributed SADL is similarly expressed in the augmented Lagrangian function as


To minimize such an objective function, each variable is alternatively updated while fixing others. The distributed SADL algorithm is presented in Algorithm 2.

Training data , diagonal block matrix , class labels , penalty coefficients , parameters and maximum iteration ;
2:Analysis dictionary , linear transformations and ;
Initialize , , , , , and as random matrices, initialize as a zero matrix, and set as a randomly selected partition of with ;
4:while not converged and  do
6:     for  do %Here for loop can be parallelized or distributed in different clusters.
         Normalize by
18:     end for
20:     Normalize by
end while
Algorithm 2 Distributed SADL

Iv Classification Procedure

Both SADL and Distributed SADL have the same classification procedure because the global analysis dictionary , transforming matrix and classifier are obtained from the algorithms. With the analysis dictionary in hand, an observed image can be quickly sparsely encoded as . This is in stark contrast to SDL for which a sparse representation is obtained by solving a non-smooth optimization as: and highlights the remarkable improvement ADL provides. Our proposed SADL, which naturally enjoys the same encoding properties as ADL, efficiently yields a structured sparse representation of the signal as well. Figure 2 shows an example of the structured representations obtained from Scene 15 dataset.

Fig. 2: on Scene 15 Dataset
Fig. 3: on Scene 15 Dataset

As shown, the result reflects the desired block diagonal structure. The ultimate desired classification goal of is accomplished by . Figure 3 depicts for the example in Figure 2 where the horizontal axis is image index, and the vertical axis reflects the class labels, which are computed according to,


shown as the brightest ones in Figure 3.

V Convergence

Since we have used linearized ADM method to solve our nonconvex objective function, are introduced as the auxiliary variables. We additionally have the following

Theorem 1.

Suppose that . There exist positive values only depending on the initialization such that for the sequence converges to the following set of bounded feasible stationary points of the Lagrangian 222The norm is any norm that is continuous with respect to the two norm of the components, for example their some of two norms. Also, the function is treated as a (convex) function of , which is constant with respect to other components than .:

where is the smooth part of , i.e.,

According to Theorem 1, if we initialize large enough, Algorithm 1 not only converges, but also generates the variable sequences with a final convergence to the stationary points. The proof of Theorem 1 can be found in Appendix A.

Vi Experiments and Results

We now evaluate our proposed SADL method on five popular visual classification datasets that have been widely used in previous works and with known performance benchmarks. They include Extended YaleB face dataset [36], AR face dataset [37], Caltech 101 object categorization dataset [38], Caltech 256 objective dataset [39], and Scene 15 scene image dataset [40].

In our experiments, we provide a comparative evaluation of six state-of-the-art techniques and our proposed technique, including a classification accuracy as well as training and testing times. All our experiments and competing algorithms are implemented in Matlab 2015b on the server with 2.30GHz Intel(R) Xeon(R) CPU. For a fair comparison, we measure the performance of each algorithm by repeating the experiment over 10 realizations. The testing time is defined as the average processing time to classify a single image. In our tables, the accuracy in parentheses with the associated citation is that was reported in the original paper. The difference in the accuracy of our approach and of the original one might be caused by different segmentations of the training and testing samples.

Vi-a Parameter Settings

In our proposed SADL method, and maximum iteration are tuning parameters. The parameter controls the contribution of the sparsity, and the parameter controls the learned analysis dictionary, while is the maximum iteration number. We replace and by their expressions, and insert them in the optimization formula. We choose for all the experiments , and dictionary size by 10-fold cross validation on each dataset. In addition, we also optimally tuned the parameters of all competing methods to ensure their best performance.

Vi-B State-of-the-art Methods

We compare our proposed SADL and Distributed SADL (DSADL) with the following competing techniques: The first one is a baseline, which uses the ADL method to learn a sparse representation and subsequently trains a Support Vector Machine (SVM) to classify images based on such sparse representations (ADL+SVM)

[27]. A penalty term is included to avoid similar atoms and minimize false positives. The second one is the classical Sparse Representation based Classification (SRC) [13]. For this method, we do not need to train a dictionary. Instead, we use the training images as the atoms in the dictionary. In the testing phase, we obtain the sparse coefficients based on such a dictionary. The third technique that we consider in this work is a state-of-the-art dictionary learning method, called Label Consistent K-SVD (LC-KSVD) [17], which forces each category labels to be consistent with classification. We select the LC-KSVD2 in [17] for comparison, because it has a better classification performance. The fourth method is Discriminative Analysis Dictionary Learning (DADL) [28], which incorporates a topological structure and distinct class representations to the ADL framework in order to make each class discriminative. Then a 1-nearest-neighbor classifier is used to assign the label. The fifth technique, Class-aware Analysis Dictionary Learning (CADL) [29], is to learn the class-specific analysis dictionaries and jointly learn a universal SVM based on the concatenated class-specific coefficients of each class. Finally, we compare our method with the Synthesis K-SVD based Analysis Dictionary Learning (SK-SVDADL) [30], which is to jointly learn ADL and a linear classifier and is solved by the K-SVD method.

Vi-C Extended YaleB

Fig. 4: Extended YaleB Dataset Examples

The Extended YaleB face dataset contains in total 2414 frontal face images of 38 persons under various illumination and expression conditions, as illustrated in Figure 4. Due to such illumination and expression variation, YaleB is intended to test the robustness to the intra-class variation. Each person has about 64 images, each cropped to

pixels. We project each face image onto a n-dimensional random face feature vector. The projection is performed by a randomly generated matrix with a zero mean normal distribution whose rows are

normalized. This procedure is similar to the one in [17]. In our experiment, is 504, i.e., each Extended YaleB face image is reduced to a -dimensional feature vector. Then, we randomly choose half of the images for training, and the rest for testing. The dictionary size is set to 1216 atoms, and .

Methods Classification Training Testing
Accuracy(%) Time(s) Time(s)
ADL+SVM[27] 82.91% 91.78 1.13
SRC[13] 96.51% No Need 3.66
LC-KSVD[17] 83.31% (96.7%[17]) 123.07 1.60
DADL[28] 97.35% (97.7%[28]) 10.05 4.55
CADL[29] 97.05% 130.83 9.72
SK-SVDADL[30] 96.14% (96.9%[30]) 113.78 1.34
SADL 96.35% 39.23 7.61
TABLE I: Classification Results on Extended YaleB Dataset

The classification results, training and testing times are summarized in Table I. Although the accuracy of the SADL method is slightly lower than SRC, DADL and CADL, it is still comparable and higher than SK-SVADL, LC-KSVD and ADL+SVM. SADL is substantially more efficient than the others in terms of numerical complexity.

For a more thorough evaluation, we compare SADL with LC-KSVD, CADL and SK-SVDADL for different dictionary sizes, and display the classification accuracy and training times in Figure 5 and 6, which are based on the average of ten realizations. We ran our experiments for dictionary sizes by the size of 38, 152, 266, 380, 494, 608, 722, 836, 950, 1064, 1178 and 1216 (all training size). SADL, SK-SVDADL and CADL, the ADL methods, exhibit a more stable accuracy performance than that of LC-KSVD of the SDL methods. In particular, the accuracy of LC-KSVD significantly decreases, when the dictionary size approaches the training sample size. The significant decrease in accuracy may be caused by the trivial solution of dictionary in SDL. In addition, our method apparently has a much higher classification accuracy than LC-KSVD and a very similar accuracy as SK-SVDADL, when the dictionary size is small. As the dictionary size increases, SADL achieves a better accuracy than SK-SVDADL and approaches the accuracy of CADL. Although the accuracy of SADL is barely lower than CADL, the SADL method is also much faster than the LC-KSVD, SK-SVDADL and CADL in the training phase, especially when the dictionary size becomes larger.

Fig. 5: Classification Accuracy versus Dictionary Size
Fig. 6: Training Time versus Dictionary Size

Vi-D AR Face

Fig. 7: AR Dataset Examples

The AR Face dateset has 2600 color images of 50 females and 50 males with more facial variations than the Extended YaleB database, such as different illumination conditions, expressions and facial disguises. AR is also used to test the robustness to large intra-class variation. Each person has about 26 images of size . Figure 7 shows some sample images of faces with sunglasses or scarves. The features of the AR face image are extracted in the same way as those of the Extended YaleB face image are, but we project it to a dimensional feature vector similarly to the setting in [17]. 20 images of each person are randomly selected as a training set and the other 6 images for testing. The dictionary size of the AR dataset is set to 2000 atoms, , and .

Methods Classification Training Testing
Accuracy(%) Time(s) Time(s)
ADL+SVM[27] 90.40% 218.54 9.10
SRC[13] 97.10% No Need 7.41
LC-KSVD[17] 87.78% (97.8%[17]) 169.35 2.00
DADL[28] 98.32% (98.7%[28]) 47.76 2.68
CADL[29] 98.52% (98.8%[29]) 313.37 1.34
SK-SVDADL[30] 97.38% (97.7%[30]) 113.78 1.34
SADL 97.17% 32.60 1.33
TABLE II: Classification Results on AR Dataset

The classification results as well as the training and testing times are summarized in Table II. Comparing with other methods, our proposed SADL achieves a comparable result with the fastest training and testing time. The classification accuracy is lower than DADL, CADL and SK-SVDADL, and higher than SRC, LC-KSVD. However, our method is about 1000 times faster than SRC and LC-KSVD for the testing phase, 10 times faster than DADL and SK-SVDADL. Although SADL is only slightly faster than CADL, its training time is one-tenth of the one of CADL.

Vi-E Caltech 101

Fig. 8: Caltech 101 Dataset Examples

The Caltech 101 dataset has 101 different categories of different objects and one non-object category. Most categories have around 50 images. Figure 8 gives some examples from the Caltech 101 dataset. Since this dataset is left-right aligned and rotated, Caltech 101 contians many different intra-class scaling variations, color pattern diversity and inter-class common features. We extract dense Scale-invariant Feature Transform (SIFT) descriptors for each image from patches and with a pixels step. Then, we apply a spatial pyramid method [40] to the dense SIFT features with three segmentation sizes , , and to capture the objects’ features at different scales. At the same time, a size codebook is trained by -means clustering for spatial pyramid features. Spatial pyramid features of each subregion are then concatenated together as a vector to represent one image. Due to the sparse nature of the spatial pyramid features, we use PCA to reduce each feature to dimensions. In our experiment, 30 images per class are randomly chosen as training data, and other images are used as testing data. All the steps and settings follow [17]. The dictionary size is set to 3060, and .

Methods Classification Training Testing
Accuracy(%) Time(s) Time(s)
ADL+SVM[27] 66.75% 1943.47 1.33
SRC[13] 70.70% No Need 4.34
LC-KSVD[17] 73.67% (73.6%[17]) 2144.90 2.49
DADL[28] 71.77% (74.6%[28]) 233.49 7.90
CADL[29] 76.83% (75%[29]) 9896.46 4.86
SK-SVDADL[30] 73.39% (74.4%[30]) 182.71 2.49
SADL 74.45% 847.50 4.76
DADL 73.49% - 8.10
TABLE III: Classification Results on Caltech 101 Dataset

The classification results, training and testing times are summarized in Table III. Our proposed SADL achieves the second highest accuracy, while only costing one-tenth of the training time of CADL obtaining the maximum accuracy. SADL has again the shortest encoding time, which is around 10000 times faster than LC-KSVD and 10 times faster than DADL and SK-SVDADL. Note that the distributed ADL (DSADL) used only 510 atoms, but it still achieves a comparable result with the fastest testing time.

Fig. 9: Distributed SADL on Caltech 101: is the number of clusters used. is centralized. Training set is divided into groups.

The parameters in DSADL are set as the following: and the penalty coefficients of the communication cost . Figure 9 shows that when the number of groups is increased, the accuracy is actually lower at first because of the smaller training sample size of each independent variable. But after the communication between global variables and local independent variables are enhanced, the performance rises up very quickly to a high generalized accuracy. Distributed SADL is demonstrated that it can also obtain a very stable and excellent performance even when the number of groups is large.

Fig. 10: Training time of Distributed SADL on Caltech 101: horizontal axis is the number of cluster, vertical axis is the real training time in second. The left blue vertical axis is the time for initialization of DSADL, and the right orange vertical axis is the time for variables alternating iteration, i.e., while loop part in our Algorithm 2.
Initiali Variables Total Training #Training Samples
-zation (s) Updating (s) Time (s) of Each Cluster
1 Cluster 17.04 6471.9 6488.94 3060
2 Clusters 2.62 3572.9 3575.52 1530
4 Clusters 0.89 3235.5 3236.39 765
6 Clusters 0.78 3194.0 3194.78 510
10 Clusters 0.72 3148.6 3149.32 306
TABLE IV: Training Time and # Training Samples on Caltech 101 Dataset

To further study the efficiency of distributed SADL, we conduct an experiment based on different numbers of clusters, which is shown in Fig. 10. For fairness, we first utilize only one core in our CPU to run the SADL, while the 2-cluster experiment uses 2 cores to implement DSADL; 4-cluster experiment uses 4 cores on DSADL, and so on. The training time and the number of training samples of each cluster are averaged over 10 realizations and are listed in Table IV. It is worth noting that the training time in Table III is based on 28 cores (whole cores) in CPU, while the training time in Table IV is based on only one core of the CPU. We separate the algorithm of DSADL into two parts: an initialization part and a variable updating part. The initialization part corresponds to the line 1 in Algorithm 2, and the variable updating part is started at line 2 to line 21, i.e., the while loop. The initialization part consists of simple matrix assignments, while the variable updating part has more matrix calculation, such as multiplication and inversion. It is shown in Fig. 10 that the running time of both the initialization part and the variable updating part quickly decrease when the numbers of clusters increase. The slopes of both curves decrease when more clusters are used, which is due to the fact that the training samples in each cluster is small enough to affect the calculation capability of each CPU core. As there are three global communication terms in Algorithm 2 after updating individual dictionaries, transforming matrix and classifier learnt, the training time with 2 clusters, is slightly more than the half the running time of 1 cluster (centralized). However, these three terms are not expensive, and Algorithm 2, with 2 clusters is still 1.8 times faster than the centralized one. We observed that the more clusters we use, the more training time is saved. Moreover, the larger data is, the more training time is also saved.

Vi-F Caltech 256

Fig. 11: Caltech 256 Dataset Examples

The Caltech 256 is a relatively larger objective dataset, which includes 256 object categories and one clutter. There are totally 30608 images with various object location, pose, and size. Figure 11 shows examples of the Caltech 256 dataset, whose each category has at least 80 images. Note that Caltech 256 includes no rotation or alignment characteristics. Thus, it contains large intra-class diversity and inter-class similarity, such as object scale, object rotation and common patterns. The features of Caltech 256 images are extracted by using the output features of the last layer before fully connected layer of ResNet-50 [41]

with the weights trained by ImageNet. The dimension of each feature is

. We randomly sample 15 images from each category for training, and test on the rest of them. To train the Distributed SADL, the dictionary size is set to 3855, dataset is divided into subsets (i.e., in Algorithm 2), , and .

Methods Classification Training Testing
(training samples) Accuracy(%) Time(s) Time(s)
ADL+SVM(15)[27] 66.66% 3501.44 7.67
LC-KSVD(15)[17] 73.37% 3118.76 3.00
DADL[28] 72.20% 417.06 5.42
CADL[29] 75.25% 5586.21 4.83
SK-SVDADL[30] 73.35% 334.31 3.28
CNN Features(15)[42] 65.70%[42] - -
SADL(15) 75.36% 4829.01 2.79
DSADL(15) 74.38% - 2.79
ResFeats-50(30)[43] 75.40%[43]
TABLE V: Classification Results on Caltech 256 Dataset

We use Caltech 256 to test both SADL and Distributed SADL. Our SADL achieves the highest accuracy, and our Distributed SADL also achieves a comparable performance with an extremely fast testing time, even though the dimension of the features are increased. For reference, we also compare our method with two network methods [42, 43]. In [42], Zeiler et al. constructed a convolutional network per-trained by ImageNet, and then learned an adapted convolutional network for Caltech 256 based on the features of the former network. As trained by 15 samples of each class, our performance is higher than the CNN result. ResFeats-50[43] is a most recent convolutional network method. This method is trained by 30 samples of each category with 50 layers. Though ResFeats-50 utilizes twice more training samples than ours, our result is still very comparable.

Vi-G Scene 15

Fig. 12: Scene 15 Dataset Examples

Scene 15 dataset contains a total of 15 categories of different scenes, and each category has around 200 images. The examples are listed in Figure 12. As different scenes contain many common components, and different components aslo share a large number of common features, training on the Scene 15 dataset is prone to a remarkable amount of inter-class similarity. Proceeding as for the Caltech 101 dataset, we compute the spatial pyramid features for scene images. A four-level spatial pyramid (i.e., each image is grid into , , and ) and a codebook of size 200 is used. The final features are obtained by applying PCA to reduce the dimension of spatial pyramid features to . We randomly pick 100 images per class as training data, and use the rest of images as testing data. The settings and steps follow [17]. The dictionary size is set to 1500, and .

Methods Classification Training Testing
Accuracy(%) Time(s) Time(s)
ADL+SVM[27] 80.55% 484.41 1.73
SRC[13] 91.80% No Need 4.06
LC-KSVD[17] 98.83% (92.9%[17]) 390.22 1.81
DADL[28] 97.81% (98.3%[28]) 33.03 4.62
CADL[29] 98.49% (98.6%[29]) 4637.80 6.02
SK-SVDADL[30] 96.84% (97.4%[30]) 66.79 1.06
SADL 98.50% 174.20 2.41
TABLE VI: Classification Results on Scene 15 Dataset

The classification results, training and testing time are summarized in Table VI. Our performance is slightly lower than LC-KSVD, but is still higher than all other methods. However, the testing phase is superior to the others. Note that the testing time is the fastest, and the training time is faster than CADL, LC-KSVD and ADL+SVM.

Vi-H Comparative Evaluation

To investigate the effect of different constrains in the optimization problem in Eq. (4), we respectively learn ADL by neglecting one of the two constraints and test the resulting algorithms on 10 realizations. The results of 5 datasets are compared with SADL in the Table VII. As there is no linear classifier, training ADL with only the constraint , we assign the class labels for an observed image by . For ADL with only the constraint , labels of images are assigned by the classifier .

ADL+constrains YaleB AR Caltech 101 Scene 15 Caltech 256
95.37% 96.70% 74.44% 98.46% 75.05%
95.52% 97.12% 74.45% 98.36% 75.35%
SADL 96.35% 97.13% 74.45% 98.50% 75.36%
TABLE VII: Classification Results of Different Constraints Comparison

The results show that both of these two constraints exhibit a similar behavior, the universal classifier has a slightly better performance, while jointly learning two constraints achieves the best result. The results also support our goal of mitigating the inter-class common features. Therefore, our algorithm achieves a better performance in Scene 15 and Caltech 256, which have many common features among classes.

Vii Conclusion

We proposed an image classification method referred to as Structured Analysis Dictionary Learning (SADL). To obtain SADL, we constrain a structured subspace (cluster) model in the enhanced ADL method, where each class is represented by a structured subspace. The enhancement of ADL is realized by constraining the learning by a classification fidelity term on the sparse coefficients. Our formulated optimization problem was efficiently solved by the linearized ADM method, in spite of its non-convexity due to bilinearity. Taking advantage of analysis dictionary, our method achieves a significantly faster testing time. Furthermore, a Distributed SADL (DSADL) was also proposed to address the scalability problem. Both discriminative structure and fast testing phase are well preserved in the DSADL. Even though the algorithm was run by many multi-clusters, the performance was still stable and comparable to the centralized SADL.

Our experiments demonstrate that our approach has at least a comparable, and often a better performance than state-of-the-art techniques on five well known datasets and achieves superior training and testing times by orders of magnitude.

A possible future direction for improving our method could be to leverage the discriminative nature of the synthesis dictionary and the efficiency of the analysis dictionary together. This can achieve a more discriminative power and high efficiency.

Appendix A Proof of the Algorithm1

Take the Lagrangian function


Our algorithm can be written as the one in Alg. 3.

At each iteration , compute:



Algorithm 3 Linearized ADM for Structured Analysis Dictionary Learning

Let us proceed by introducing two simple lemmas:

Lemma 2.

Consider a differentiable function with an Lipschitz continuous derivative and another arbitrary convex function . For any arbitrary point define

where is a step size and

Then, we have

where .


Notice that by the definition of the proximal operator , there exists a subgradient such that

where . Then, we have