I Introduction
Sparse representation has had great success in dealing with various problems in image processing and computer vision, such as image denoising and image restoration. To obtain such sparse representations with an unknown precise model, Dictionary Learning is one choice because it results in a linear combination of sparse dictionary atoms. There are two different types of dictionary learning methods: Synthesis Dictionary Learning (SDL) and Analysis Dictionary Learning (ADL).
In recent years, SDL has been prevalently and widely studied [1, 2, 3], while ADL has received little attention. SDL supposes that a signal lies in a sparse latent subspace and can be recovered by an associated dictionary. The local structures of the signal are well preserved in the optimal synthesis dictionary [4, 5, 6]. In contrast, ADL assumes that a signal can be transformed into a latent sparse subspace by its corresponding dictionary. In other words, ADL is to produce a sparse representation by applying the dictionary as a transform to a signal. The atoms in an analysis dictionary can be interpreted as local filters, as first mentioned in [7]. Sparse representations can be simply obtained by an inner product operation, when the dictionary is known. Such a fast coding supports ADL more favored than SDL in applications. The contrast of SDL and ADL is shown in Fig. 1.
The success of dictionary learning in image processing problems has shaped much interest in taskdriven dictionary learning methods for inference applications, such as image classification. The task of classification aims to assign the correct label to an observed image, which requires a much more discriminative capacity of either the dictionary or the sparse representation. Towards addressing this issue, supervised learning is often invoked when using singleview SDL [8] so as to maximize the distances between the sparse representations of each of two distinct classes. In addition, multiview dictionary learning methods [9, 10, 11, 12] were developed to include more information of each class.
For the supervised singleview dictionary learning methods, there are generally two strategies to address the supervised learning approaches. The first strategy is to learn multiple dictionaries or classspecific dictionaries for different classes [13, 14, 15, 16]. The advantage of learning multiple dictionaries is that these dictionaries characterize specific patterns and structures of each class and enhance the distances between different classes. The minimum reconstruction errors of various dictionaries are subsequently used to assign labels of new incoming images. In [14], Ramirez et al. learned classspecific dictionaries with penalty for the common atoms. Yang et al. [15] then learned classspecific dictionaries and jointly applied a Fisher criterion to associative sparse representations to thereby enhance the distances between each class. A largemargin method was proposed to increase the divergence of sparse representations for the classspecific dictionaries in [16]. However, as the number of classes increases, it would be too complex and time consuming to train classspecific dictionaries with regularizing distances of each dictionary. Even though a distributed cluster could reduce the time complexity of training dictionaries, it is difficult for the distributed algorithm to communicate with each independent cluster and to compromise with other regularizations for the classspecific dictionary learning.
Another strategy is to learn a shared dictionary for all classes together with a universal classifier [8, 17]. Such a joint dictionary learning enforces more discriminative sparse representations. Compared with classspecific dictionary learning, using this strategy is simpler to learn such a dictionary and classifier, and easier to test the unknown images. In [8], Mairal et al. integrated a linear classifier in a sparse representation for a dictionary learning phase. Jiang et al. then included a linear classifier and a label consistent regularization term to enforce more consistent sparse representations in each class [17]. When any large data sets are on hand, memory and computational limitations emerge, and an online learning or distributed solutions are required as a viable strategy.
Although the techniques mentioned above are all based on SDL, ADL has gradually received more attention [18]. Based on the seminal work of Rubinstein et al. [19] proposing analysis KSVD to learn an analysis dictionary, Li et al. [20] considered to learn ADL by using an additional inner product term of sparse coefficients to increase its discriminative power. In addition, reducing the computational complexity has been addressed in recent methods. Zhang et al. [21] use Recursive Least Square method to accelerate dictionary learning by updating a dictionary based on the dictionaries in the previous iterations. Li et al. [22] and Dong et al. [23] proposed Simultaneous codeword optimization (SimCo) related algorithms to update multiple atoms in each iteration and by adding an extra incoherent regularity term to avoid linear dependency among dictionary atoms. On other hand, Li et al. [24, 25] used norm instead of norm to have stronger sparsity and mathematically guaranteed a strong convergence. In [26], Bian et al. [26] proposed the Sparse Null Space (SNS) pursuit to search for the optimal analysis dictionary. However, all of these methods are proposed for the original problem of learning an analysis dictionary. To the best of our knowledge, few attempts have been carried out for taskdriven ADL. For example, in [27], Shekhar et al. [27]
learned an analysis dictionary and subsequently trained SVM for the digital and face recognition tasks. Their results demonstrate that ADL is more stable than SDL under noise and occlusion, and achieves a competitive performance. Guo
et al. [28] integrated local topological structures and discriminative sparse labels into the ADL and separately classified images by a Nearest Neighbor classifier. Instead of preforming ADL and SVM in separate steps, Wang [29] alternately optimize ADL and SVM together to classify different patterns. In [30], Wang et al. use the KSVD based technique to solve a joint learning of ADL and a linear classifier. Additionally, a hybrid design based on both SDL and ADL is considered in [31] and [32]. A multiview ADL was proposed in [33], which separately learns analysis dictionaries for different views and a marginalized classifier for fusing the semantic information of each view.Inspired by these past works, and taking advantage of efficient coding by ADL, we propose a supervised ADL with a shared dictionary and a universal classifier. In addition to the classifier, a structured subspace regularization is also included into an ADL model to obtain a more structured discriminative and efficient approach to image classification. We refer to this approach as Structured Analysis Dictionary Learning (SADL). Since Sparse Subspace Clustering[34] has shown that visual data in a class or category can be well captured and localized by a low dimensional subspace, and the sparse representation of the data within a class similarly share a low dimensional subspace, a structured representation is introduced to achieve a distinct representation of each class. This achieves more coherence for withinclass sparse representations and more disparity for betweenclass representations. When sorted by the order of classes, these representations as shown later can be viewed as a blockdiagonal matrix. For robustness of the sought sparse representations, we simultaneously learn a oneagainstall regressionbased classifier. The resulting optimization function is solved by a linearized alternative direction method (ADM)[35]. This approach leads to a more computationally efficient solution than that of analysis KSVD [19] and of SNS pursuit [26]. Additionally, a great advantage of our algorithm is its extremely short online encoding and classification time for an incoming observed image. It is easy to understand that in contrast to the SDL encoding procedure, ADL obtains a sparse representation by a simple matrix multiplication of the learned dictionary and testing data. Experiments demonstrate that our method achieves an overall better performance than the synthesis dictionary approach. A good accuracy is achieved in the scene and object classification with a simple classifier, and at a remarkably low computational complexity to seek the best performances of facial recognition problems. Moreover, experiments also show that our approach has a more stable performance than that of SDL. Even when the dictionary size is reduced to result in memory demand reduction, our performance is still outstanding. To address large datasets, a distributed structured analysis dictionary learning algorithm is also developed while preserving the same properties as those of structured analysis dictionary learning (SADL). Experiments also show that when the dataset is sufficient, a distributed algorithm achieves as high a performance as SADL.
The following represent our main contributions,

Both a structured representation and a classification error regularization term are introduced to the conventional ADL formulation to improve classification results. A multiclass classifier and an analysis dictionary are jointly learned.

The optimal solution provided by the linearized ADM is significantly faster than other existing techniques for nonconvex and nonsmooth optimization.

An extremely short classification time is offered by our algorithms, as they entail encoding by a mere matrix multiplication for a simple classification procedure.

A distributed structured analysis dictionary learning algorithm is also presented.
The balance of this paper is organized as follows: we state and formulate the problem of SADL and its distributed form in Section II. The resulting solutions to the optimization problems along with the classification procedure are described in Sections III and IV. In Section V, we analyze the convergence and complexity of our methods. The experimental comprehensively validation and results are then presented in Section VI. Some comments and future works are finally provided in Section VII.
Ii Structured Analysis Dictionary Learning
Iia Notation
Uppercase and lowercase letters respectively denote matrices and vectors throughout the paper. The transpose and inverse of matrices are represented as the superscripts
and , such as and. The identity matrix and allzero matrix are respectively denoted as
and 0. represents the th element in the th column of matrix .IiB Structured ADL Method
IiB1 ADL Formulation
The conventional ADL problem [19] aims at obtaining a representation frame with a sparse coefficient set based on the data matrix .
(1) 
where and is a large class of nontrivial solutions.
IiB2 Mitigating InterClass Feature Interference
The basic idea of our algorithm is to take advantage of the stability to perturbations and of the fast encoding of ADL. Since there is no reconstruction term in the conventional ADL, and to secure an efficient classification, the representation is used to obtain a classifier in a supervised learning mode. To strengthen the discriminative power of ADL, it is desirable to minimize the impacts of interclass common features. We therefore propose two additional constraints on by way of:

Minimizing interference of interclass common features by a structural map of .

Minimizing the classification error.
Structural Mapping of U
The first constraint is to particularly ensure that the representation of each sample in the same class belong to a subspace defined by a span of the associated coefficients. This imposes the distinction among the classes and improves the identification of each class, and efficiently enhances the divergence between classes. Specifically, we introduce a blockdiagonal matrix as shown below,
where is the length of the structured representation. Each diagonal block in represents a subspace of each class to force each one class to remain distinct from another with a consistent intraclass representation. Each column is a structured representation for the corresponding data point, which is predefined on the training labels. is not necessarily a uniformly blockdiagonal matrix, and the order of samples is not important, so long as the structured representation corresponds to a given class. To mildly relax the constraint, and integrate it into the ADL function, we write
(2) 
where is a matrix to be learned with and , is the tolerance.
Minimal Classification Error
To maintain an audit track on the desired representation, we include a classification error to make the representation discriminative and learn an optimal regularization. This is written as
(3) 
where is the tolerance,
is a linear transform, and the label matrix
is defined asand is the number of classes.
IiB3 Structured ADL Formulation
To account for all these constraints and to avoid overfitting by regularization arising and , we can rewrite the ADL optimization problem as
(4) 
where and are the penalty coefficients, and are tuning parameters. Recall is the structured representation, is the structuring transformation, is the classifier label, and is the linear classifier.
The formulated optimization function in Eq. (4) provides an analysis dictionary driven by the latent structure of the data yielding an improved discriminative sparse representation among numerous classes.
IiB4 Distributed Structured ADL Formulation
In order to handle large datasets, we propose a distributed Structured ADL method. Since both the discriminative structure and the efficient classification need to be preserved, we introduce a global analysis dictionary, a global structuring transformation and a global classifier. In pursuing a distributed ADL, we ensure that the global variables share information with each distributed dictionary cluster, thereby ensuring that the global analysis dictionary, the structured transform and the classifier respectively reach a consensus,
(5) 
Together with the consensus penalties, the distributed SADL is formulated as
(6) 
where represents the th independent cluster, , , and are respectively the local analysis dictionary, sparse representation, structuring transformation and classifier of the th cluster, and , , are respectively the global analysis dictionary, structuring transformation and classifier. The global variables will be applied to the same efficient classification scheme as the one of SADL.
Iii Algorithmic Solution
Iiia SADL Algorithm
Due to the nonconvexity of the objective function in Eq. (4), an augmented Lagrange formulation with dual variables , and is adopted to seek an optimal solution. The augmented Lagrangian is then written as,
(7) 
where is a tuning parameter. To iteratively seek the optimal solution in Eq. (7), the analysis dictionary and two linear transformations and are first randomly initialized. The sparse representation is initialized as , the zero matrix. In the following equations, Eq. (8)  Eq. (22), the auxiliary variables , , and are introduced to guarantee the convergence of the algorithm. The variables with superscripts which do not include parenthesis are the temporal variables of intermediate step in the calculation. Different variables are alternately updated while fixing the others, resulting in the following steps:
(1) Fix , , and , and update :
(8) 
where is the elementwise soft thresholding operator, and , , and are as follows:
(9) 
(10) 
(11) 
(2) Fix , , and , and update :
(12) 
(13) 
(14) 
(3) Fix , , and , and update :
(15) 
(16) 
(4) Fix , , and , and update :
(17) 
The analytical solution of Eq. (17) can be regularized as
(18) 
where is also a tuning parameter. It will be chosen by a usual way.
(5) Fix , , , and , and update :
(19) 
(6) Fix , , , and , and update :
(20) 
The dual variables , are updated as
(21) 
(22) 
In contrast to previous ADL techniques, which train a dictionary by iterating a single row of the dictionary, i.e., one atom, to avoid a trivial solution, we proceed to update a set of rows in a single step at each iteration. A fast convergence rate of the algorithm is also guaranteed by linearized ADM [35] and with a closed form solution for the dictionary given in Eq. (18). The proposed SADL algorithm ^{1}^{1}1The codes are released at https://github.com/wtang0512/DemoofSADL is summarized in Algorithm 1.
IiiB Distributed SADL Algorithm
The distributed SADL is similarly expressed in the augmented Lagrangian function as
(23) 
To minimize such an objective function, each variable is alternatively updated while fixing others. The distributed SADL algorithm is presented in Algorithm 2.
Iv Classification Procedure
Both SADL and Distributed SADL have the same classification procedure because the global analysis dictionary , transforming matrix and classifier are obtained from the algorithms. With the analysis dictionary in hand, an observed image can be quickly sparsely encoded as . This is in stark contrast to SDL for which a sparse representation is obtained by solving a nonsmooth optimization as: and highlights the remarkable improvement ADL provides. Our proposed SADL, which naturally enjoys the same encoding properties as ADL, efficiently yields a structured sparse representation of the signal as well. Figure 2 shows an example of the structured representations obtained from Scene 15 dataset.
As shown, the result reflects the desired block diagonal structure. The ultimate desired classification goal of is accomplished by . Figure 3 depicts for the example in Figure 2 where the horizontal axis is image index, and the vertical axis reflects the class labels, which are computed according to,
(24) 
shown as the brightest ones in Figure 3.
V Convergence
Since we have used linearized ADM method to solve our nonconvex objective function, are introduced as the auxiliary variables. We additionally have the following
Theorem 1.
Suppose that . There exist positive values only depending on the initialization such that for the sequence converges to the following set of bounded feasible stationary points of the Lagrangian ^{2}^{2}2The norm is any norm that is continuous with respect to the two norm of the components, for example their some of two norms. Also, the function is treated as a (convex) function of , which is constant with respect to other components than .:
where is the smooth part of , i.e.,
According to Theorem 1, if we initialize large enough, Algorithm 1 not only converges, but also generates the variable sequences with a final convergence to the stationary points. The proof of Theorem 1 can be found in Appendix A.
Vi Experiments and Results
We now evaluate our proposed SADL method on five popular visual classification datasets that have been widely used in previous works and with known performance benchmarks. They include Extended YaleB face dataset [36], AR face dataset [37], Caltech 101 object categorization dataset [38], Caltech 256 objective dataset [39], and Scene 15 scene image dataset [40].
In our experiments, we provide a comparative evaluation of six stateoftheart techniques and our proposed technique, including a classification accuracy as well as training and testing times. All our experiments and competing algorithms are implemented in Matlab 2015b on the server with 2.30GHz Intel(R) Xeon(R) CPU. For a fair comparison, we measure the performance of each algorithm by repeating the experiment over 10 realizations. The testing time is defined as the average processing time to classify a single image. In our tables, the accuracy in parentheses with the associated citation is that was reported in the original paper. The difference in the accuracy of our approach and of the original one might be caused by different segmentations of the training and testing samples.
Via Parameter Settings
In our proposed SADL method, and maximum iteration are tuning parameters. The parameter controls the contribution of the sparsity, and the parameter controls the learned analysis dictionary, while is the maximum iteration number. We replace and by their expressions, and insert them in the optimization formula. We choose for all the experiments , and dictionary size by 10fold cross validation on each dataset. In addition, we also optimally tuned the parameters of all competing methods to ensure their best performance.
ViB Stateoftheart Methods
We compare our proposed SADL and Distributed SADL (DSADL) with the following competing techniques: The first one is a baseline, which uses the ADL method to learn a sparse representation and subsequently trains a Support Vector Machine (SVM) to classify images based on such sparse representations (ADL+SVM)
[27]. A penalty term is included to avoid similar atoms and minimize false positives. The second one is the classical Sparse Representation based Classification (SRC) [13]. For this method, we do not need to train a dictionary. Instead, we use the training images as the atoms in the dictionary. In the testing phase, we obtain the sparse coefficients based on such a dictionary. The third technique that we consider in this work is a stateoftheart dictionary learning method, called Label Consistent KSVD (LCKSVD) [17], which forces each category labels to be consistent with classification. We select the LCKSVD2 in [17] for comparison, because it has a better classification performance. The fourth method is Discriminative Analysis Dictionary Learning (DADL) [28], which incorporates a topological structure and distinct class representations to the ADL framework in order to make each class discriminative. Then a 1nearestneighbor classifier is used to assign the label. The fifth technique, Classaware Analysis Dictionary Learning (CADL) [29], is to learn the classspecific analysis dictionaries and jointly learn a universal SVM based on the concatenated classspecific coefficients of each class. Finally, we compare our method with the Synthesis KSVD based Analysis Dictionary Learning (SKSVDADL) [30], which is to jointly learn ADL and a linear classifier and is solved by the KSVD method.ViC Extended YaleB
The Extended YaleB face dataset contains in total 2414 frontal face images of 38 persons under various illumination and expression conditions, as illustrated in Figure 4. Due to such illumination and expression variation, YaleB is intended to test the robustness to the intraclass variation. Each person has about 64 images, each cropped to
pixels. We project each face image onto a ndimensional random face feature vector. The projection is performed by a randomly generated matrix with a zero mean normal distribution whose rows are
normalized. This procedure is similar to the one in [17]. In our experiment, is 504, i.e., each Extended YaleB face image is reduced to a dimensional feature vector. Then, we randomly choose half of the images for training, and the rest for testing. The dictionary size is set to 1216 atoms, and .Methods  Classification  Training  Testing 
Accuracy(%)  Time(s)  Time(s)  
ADL+SVM[27]  82.91%  91.78  1.13 
SRC[13]  96.51%  No Need  3.66 
LCKSVD[17]  83.31% (96.7%[17])  123.07  1.60 
DADL[28]  97.35% (97.7%[28])  10.05  4.55 
CADL[29]  97.05%  130.83  9.72 
SKSVDADL[30]  96.14% (96.9%[30])  113.78  1.34 
SADL  96.35%  39.23  7.61 
The classification results, training and testing times are summarized in Table I. Although the accuracy of the SADL method is slightly lower than SRC, DADL and CADL, it is still comparable and higher than SKSVADL, LCKSVD and ADL+SVM. SADL is substantially more efficient than the others in terms of numerical complexity.
For a more thorough evaluation, we compare SADL with LCKSVD, CADL and SKSVDADL for different dictionary sizes, and display the classification accuracy and training times in Figure 5 and 6, which are based on the average of ten realizations. We ran our experiments for dictionary sizes by the size of 38, 152, 266, 380, 494, 608, 722, 836, 950, 1064, 1178 and 1216 (all training size). SADL, SKSVDADL and CADL, the ADL methods, exhibit a more stable accuracy performance than that of LCKSVD of the SDL methods. In particular, the accuracy of LCKSVD significantly decreases, when the dictionary size approaches the training sample size. The significant decrease in accuracy may be caused by the trivial solution of dictionary in SDL. In addition, our method apparently has a much higher classification accuracy than LCKSVD and a very similar accuracy as SKSVDADL, when the dictionary size is small. As the dictionary size increases, SADL achieves a better accuracy than SKSVDADL and approaches the accuracy of CADL. Although the accuracy of SADL is barely lower than CADL, the SADL method is also much faster than the LCKSVD, SKSVDADL and CADL in the training phase, especially when the dictionary size becomes larger.
ViD AR Face
The AR Face dateset has 2600 color images of 50 females and 50 males with more facial variations than the Extended YaleB database, such as different illumination conditions, expressions and facial disguises. AR is also used to test the robustness to large intraclass variation. Each person has about 26 images of size . Figure 7 shows some sample images of faces with sunglasses or scarves. The features of the AR face image are extracted in the same way as those of the Extended YaleB face image are, but we project it to a dimensional feature vector similarly to the setting in [17]. 20 images of each person are randomly selected as a training set and the other 6 images for testing. The dictionary size of the AR dataset is set to 2000 atoms, , and .
Methods  Classification  Training  Testing 

Accuracy(%)  Time(s)  Time(s)  
ADL+SVM[27]  90.40%  218.54  9.10 
SRC[13]  97.10%  No Need  7.41 
LCKSVD[17]  87.78% (97.8%[17])  169.35  2.00 
DADL[28]  98.32% (98.7%[28])  47.76  2.68 
CADL[29]  98.52% (98.8%[29])  313.37  1.34 
SKSVDADL[30]  97.38% (97.7%[30])  113.78  1.34 
SADL  97.17%  32.60  1.33 
The classification results as well as the training and testing times are summarized in Table II. Comparing with other methods, our proposed SADL achieves a comparable result with the fastest training and testing time. The classification accuracy is lower than DADL, CADL and SKSVDADL, and higher than SRC, LCKSVD. However, our method is about 1000 times faster than SRC and LCKSVD for the testing phase, 10 times faster than DADL and SKSVDADL. Although SADL is only slightly faster than CADL, its training time is onetenth of the one of CADL.
ViE Caltech 101
The Caltech 101 dataset has 101 different categories of different objects and one nonobject category. Most categories have around 50 images. Figure 8 gives some examples from the Caltech 101 dataset. Since this dataset is leftright aligned and rotated, Caltech 101 contians many different intraclass scaling variations, color pattern diversity and interclass common features. We extract dense Scaleinvariant Feature Transform (SIFT) descriptors for each image from patches and with a pixels step. Then, we apply a spatial pyramid method [40] to the dense SIFT features with three segmentation sizes , , and to capture the objects’ features at different scales. At the same time, a size codebook is trained by means clustering for spatial pyramid features. Spatial pyramid features of each subregion are then concatenated together as a vector to represent one image. Due to the sparse nature of the spatial pyramid features, we use PCA to reduce each feature to dimensions. In our experiment, 30 images per class are randomly chosen as training data, and other images are used as testing data. All the steps and settings follow [17]. The dictionary size is set to 3060, and .
Methods  Classification  Training  Testing 

Accuracy(%)  Time(s)  Time(s)  
ADL+SVM[27]  66.75%  1943.47  1.33 
SRC[13]  70.70%  No Need  4.34 
LCKSVD[17]  73.67% (73.6%[17])  2144.90  2.49 
DADL[28]  71.77% (74.6%[28])  233.49  7.90 
CADL[29]  76.83% (75%[29])  9896.46  4.86 
SKSVDADL[30]  73.39% (74.4%[30])  182.71  2.49 
SADL  74.45%  847.50  4.76 
DADL  73.49%    8.10 
The classification results, training and testing times are summarized in Table III. Our proposed SADL achieves the second highest accuracy, while only costing onetenth of the training time of CADL obtaining the maximum accuracy. SADL has again the shortest encoding time, which is around 10000 times faster than LCKSVD and 10 times faster than DADL and SKSVDADL. Note that the distributed ADL (DSADL) used only 510 atoms, but it still achieves a comparable result with the fastest testing time.
The parameters in DSADL are set as the following: and the penalty coefficients of the communication cost . Figure 9 shows that when the number of groups is increased, the accuracy is actually lower at first because of the smaller training sample size of each independent variable. But after the communication between global variables and local independent variables are enhanced, the performance rises up very quickly to a high generalized accuracy. Distributed SADL is demonstrated that it can also obtain a very stable and excellent performance even when the number of groups is large.
Initiali  Variables  Total Training  #Training Samples  

zation (s)  Updating (s)  Time (s)  of Each Cluster  
1 Cluster  17.04  6471.9  6488.94  3060 
2 Clusters  2.62  3572.9  3575.52  1530 
4 Clusters  0.89  3235.5  3236.39  765 
6 Clusters  0.78  3194.0  3194.78  510 
10 Clusters  0.72  3148.6  3149.32  306 
To further study the efficiency of distributed SADL, we conduct an experiment based on different numbers of clusters, which is shown in Fig. 10. For fairness, we first utilize only one core in our CPU to run the SADL, while the 2cluster experiment uses 2 cores to implement DSADL; 4cluster experiment uses 4 cores on DSADL, and so on. The training time and the number of training samples of each cluster are averaged over 10 realizations and are listed in Table IV. It is worth noting that the training time in Table III is based on 28 cores (whole cores) in CPU, while the training time in Table IV is based on only one core of the CPU. We separate the algorithm of DSADL into two parts: an initialization part and a variable updating part. The initialization part corresponds to the line 1 in Algorithm 2, and the variable updating part is started at line 2 to line 21, i.e., the while loop. The initialization part consists of simple matrix assignments, while the variable updating part has more matrix calculation, such as multiplication and inversion. It is shown in Fig. 10 that the running time of both the initialization part and the variable updating part quickly decrease when the numbers of clusters increase. The slopes of both curves decrease when more clusters are used, which is due to the fact that the training samples in each cluster is small enough to affect the calculation capability of each CPU core. As there are three global communication terms in Algorithm 2 after updating individual dictionaries, transforming matrix and classifier learnt, the training time with 2 clusters, is slightly more than the half the running time of 1 cluster (centralized). However, these three terms are not expensive, and Algorithm 2, with 2 clusters is still 1.8 times faster than the centralized one. We observed that the more clusters we use, the more training time is saved. Moreover, the larger data is, the more training time is also saved.
ViF Caltech 256
The Caltech 256 is a relatively larger objective dataset, which includes 256 object categories and one clutter. There are totally 30608 images with various object location, pose, and size. Figure 11 shows examples of the Caltech 256 dataset, whose each category has at least 80 images. Note that Caltech 256 includes no rotation or alignment characteristics. Thus, it contains large intraclass diversity and interclass similarity, such as object scale, object rotation and common patterns. The features of Caltech 256 images are extracted by using the output features of the last layer before fully connected layer of ResNet50 [41]
with the weights trained by ImageNet. The dimension of each feature is
. We randomly sample 15 images from each category for training, and test on the rest of them. To train the Distributed SADL, the dictionary size is set to 3855, dataset is divided into subsets (i.e., in Algorithm 2), , and .Methods  Classification  Training  Testing 
(training samples)  Accuracy(%)  Time(s)  Time(s) 
ADL+SVM(15)[27]  66.66%  3501.44  7.67 
LCKSVD(15)[17]  73.37%  3118.76  3.00 
DADL[28]  72.20%  417.06  5.42 
CADL[29]  75.25%  5586.21  4.83 
SKSVDADL[30]  73.35%  334.31  3.28 
CNN Features(15)[42]  65.70%[42]     
SADL(15)  75.36%  4829.01  2.79 
DSADL(15)  74.38%    2.79 
ResFeats50(30)[43]  75.40%[43] 
We use Caltech 256 to test both SADL and Distributed SADL. Our SADL achieves the highest accuracy, and our Distributed SADL also achieves a comparable performance with an extremely fast testing time, even though the dimension of the features are increased. For reference, we also compare our method with two network methods [42, 43]. In [42], Zeiler et al. constructed a convolutional network pertrained by ImageNet, and then learned an adapted convolutional network for Caltech 256 based on the features of the former network. As trained by 15 samples of each class, our performance is higher than the CNN result. ResFeats50[43] is a most recent convolutional network method. This method is trained by 30 samples of each category with 50 layers. Though ResFeats50 utilizes twice more training samples than ours, our result is still very comparable.
ViG Scene 15
Scene 15 dataset contains a total of 15 categories of different scenes, and each category has around 200 images. The examples are listed in Figure 12. As different scenes contain many common components, and different components aslo share a large number of common features, training on the Scene 15 dataset is prone to a remarkable amount of interclass similarity. Proceeding as for the Caltech 101 dataset, we compute the spatial pyramid features for scene images. A fourlevel spatial pyramid (i.e., each image is grid into , , and ) and a codebook of size 200 is used. The final features are obtained by applying PCA to reduce the dimension of spatial pyramid features to . We randomly pick 100 images per class as training data, and use the rest of images as testing data. The settings and steps follow [17]. The dictionary size is set to 1500, and .
Methods  Classification  Training  Testing 

Accuracy(%)  Time(s)  Time(s)  
ADL+SVM[27]  80.55%  484.41  1.73 
SRC[13]  91.80%  No Need  4.06 
LCKSVD[17]  98.83% (92.9%[17])  390.22  1.81 
DADL[28]  97.81% (98.3%[28])  33.03  4.62 
CADL[29]  98.49% (98.6%[29])  4637.80  6.02 
SKSVDADL[30]  96.84% (97.4%[30])  66.79  1.06 
SADL  98.50%  174.20  2.41 
The classification results, training and testing time are summarized in Table VI. Our performance is slightly lower than LCKSVD, but is still higher than all other methods. However, the testing phase is superior to the others. Note that the testing time is the fastest, and the training time is faster than CADL, LCKSVD and ADL+SVM.
ViH Comparative Evaluation
To investigate the effect of different constrains in the optimization problem in Eq. (4), we respectively learn ADL by neglecting one of the two constraints and test the resulting algorithms on 10 realizations. The results of 5 datasets are compared with SADL in the Table VII. As there is no linear classifier, training ADL with only the constraint , we assign the class labels for an observed image by . For ADL with only the constraint , labels of images are assigned by the classifier .
ADL+constrains  YaleB  AR  Caltech 101  Scene 15  Caltech 256 

95.37%  96.70%  74.44%  98.46%  75.05%  
95.52%  97.12%  74.45%  98.36%  75.35%  
SADL  96.35%  97.13%  74.45%  98.50%  75.36% 
The results show that both of these two constraints exhibit a similar behavior, the universal classifier has a slightly better performance, while jointly learning two constraints achieves the best result. The results also support our goal of mitigating the interclass common features. Therefore, our algorithm achieves a better performance in Scene 15 and Caltech 256, which have many common features among classes.
Vii Conclusion
We proposed an image classification method referred to as Structured Analysis Dictionary Learning (SADL). To obtain SADL, we constrain a structured subspace (cluster) model in the enhanced ADL method, where each class is represented by a structured subspace. The enhancement of ADL is realized by constraining the learning by a classification fidelity term on the sparse coefficients. Our formulated optimization problem was efficiently solved by the linearized ADM method, in spite of its nonconvexity due to bilinearity. Taking advantage of analysis dictionary, our method achieves a significantly faster testing time. Furthermore, a Distributed SADL (DSADL) was also proposed to address the scalability problem. Both discriminative structure and fast testing phase are well preserved in the DSADL. Even though the algorithm was run by many multiclusters, the performance was still stable and comparable to the centralized SADL.
Our experiments demonstrate that our approach has at least a comparable, and often a better performance than stateoftheart techniques on five well known datasets and achieves superior training and testing times by orders of magnitude.
A possible future direction for improving our method could be to leverage the discriminative nature of the synthesis dictionary and the efficiency of the analysis dictionary together. This can achieve a more discriminative power and high efficiency.
Appendix A Proof of the Algorithm1
At each iteration , compute:
(26) 
(27) 
(28) 
(29) 
(30) 
(31) 

(32) 

(33) 
Let us proceed by introducing two simple lemmas:
Lemma 2.
Consider a differentiable function with an Lipschitz continuous derivative and another arbitrary convex function . For any arbitrary point define
where is a step size and
Then, we have
where .
Proof.
Notice that by the definition of the proximal operator , there exists a subgradient such that
where . Then, we have