I Introduction
Sparse representation (sparse coding) [1], which uses a few atoms from a dictionary to construct a signal, is a popular method to acquire, compress and represent signals. In particular, a vector has a sparse representation on a dictionary , when the correspondent vector
is sparse (the majority of the coefficients are zeros). There exist many successful applications using sparse representations, such as face recognition
[2], image denoising [3], blind source separation [4][5], [6].It is well known that sparse representation is computational intensive [43]
. The property of the dictionary can affect the sparse representation significantly. How to construct a proper dictionary is important for sparse representation. There are two major approaches for dictionary learning. First is the analytic approach, in which DCT bases, wavelets, curvelets and other nonadaptive functions are used as atoms to construct the dictionaries. Second is the learningbased approaches, such as the unsupervised learning for dictionary construction
[7] and the online dictionary learning [9], [32], which use machine learning methods to construct the dictionary.Active learning [12], [13] is a machine learning [14], [15] paradigm, in which the learning algorithm [11]
select certain unlabeled data for labeling through interactive queries with the user (or information source). It is generally effective when the labeled data is scarce compared with the unlabeled data. In active learning, the number of samples to learn a concept (or hypothesis) can be much less than the number required in classic supervised learning. Normally, there are three query strategies in active learning. (1) Query by uncertainty sampling
[16], in which a learner selects the sample that is the most uncertain to label. (2) Query by Committee [17], in which the query sample has the most disagreement among a committee of models. (3) Query by generalization error [18], in which the query sample has the largest generalization error. In our work, the reconstruction error and sparse representation based clarification error are considered as the query criteria of active learning.The contributions of this paper are summarized as follows:

A new dictionary learning method via active learning in sparse representation is presented. In particular, the construction errors and sparse representation based classification errors are used as query criteria in the active learning.

The proposed dictionary is applied to UCI [19] data sets (binarycategory and multiplecategory) and face data set, the reconstruction results and classification results are obtained in the process of sparse coding.

The proposed dictionary learning method is compared with other methods, clustering based dictionary, dictionary learning with structure incoherence, wholetrainingdata dictionary, and so on. The experimental results on the UCI data sets and face data set demonstrate the capability and efficiency of the proposed dictionary learning method.
The rest of the paper is organized as follows: Section 2 presents related work on dictionary learning, sparse representation based classification and query strategies in active learning. Section 3 presents the proposed dictionary learning via active learning. Section 4 presents the experiments with UCI data sets. Section 5 presents the comparison results on the Extended Yale B data set. Section 6 gives the conclusion of the paper and discusses some future plans.
Ii Related Work
Assume a training data set T (), the labels for the training data are L (). In the original form of sparse representation, the sparsity of is defined as the number of nonzero elements. The constrain is named norm :
(1) 
However, solving is NP hard. Fortunately, under Restricted Isometry Property (RIP) constrains, the solution of norm would be equivalent to norm. Then the sparse representation can be expressed as:
(2) 
Under the norm, the solution becomes a convex optimization problem and regularized least squares method [21] is used to solve this problem.
(3) 
The choice of the dictionary is a key for effective sparse representation. There are plenty of researches on this topic in the literature. In the analytic approach, some predefined functions are used to construct the dictionary. Curvelets [22], which tracked the shape of the discontinuity set, supplied efficient and nearoptimal representation of smooth objects. Shearlets [23], which obtained from dilations, action of translations, and shear transformations, displayed the geometric properties and mathematical properties for image representation. Bandelets [24], which specify the geometry as a vector field, improved image compression and noise reduction performance.
In the learningbased approach, machine learning methods are used to construct the dictionary from the training data. The least square error was used by the method of optimal directions (MOD) [25] to update the dictionary iteratively:
(4) 
where k is the th iteration, Y is the training data matrix and is the sparse vector matrix based on the th dictionary. In KSVD [26]
, the atoms in the dictionary were updated sequentially. It was related to the kmeans method and the atoms were modified based on associated examples. Online dictionary learning
[9], which was based on stochastic approximations, adapted the dictionary to large data sets with millions of samples. In efficient sparse coding algorithms [27], two least square optimization problems ( norm regularized and norm constrained) were solved interactively.(5) 
where and c are defined parameters. In the learning process, this approach optimized the dictionary or the sparse vector matrix X while the other is fixed.
In sparse coding, reconstruction error is the difference between the original testing data and the result of sparse representation, which can be expressed as:
(6) 
and it is an important criteria to evaluate the quality of the dictionary.
Recently, sparse representation based classification (SRC) [20]
was proposed and presented with successful application in face image classifications. In SRC, the reconstruction errors based on different categories are used to classify testing data. For each class
, a function is defined as , which selects the sparse vectors (coefficients) associated to th category. Then the SRC process can be presented as:(7) 
The classification accuracy is utilized as another criterion to evaluate the property of the dictionary. The process of SRC is shown in Algorithm 1.
All active learning methods involve assessing the information of training data. The most informative sample is chose according to diverse query strategies [28]. Query by uncertainty sampling [16] and query by generalization error [18] are two classical strategies for active learning.
In the uncertainty sampling, an active learner selects the sample which is least confident of labeling. And this method is usually straightforward in probabilistic learning models. Generally, the uncertainty sample is decided by the posterior probability:
(8) 
where , the classification is decided by the highest posterior probability with the model . However, the least confident method simply evaluates the most uncertain data, which ignores the information of the rest data. The margin sampling method is proposed in [29] with multiclass uncertainty criterion.
(9) 
where and are the first and second most probable classification labels with the model . Margin sampling method aims to utilize the posterior probability of the second most likely label. The sample with small margins are difficult for classifier to make decision. Therefore, obtaining the true label would improve the discriminated capability of model effectively. The most popular uncertainty sampling strategy uses entropy to evaluate training data:
(10) 
where denotes possible labels. The entropybased queries are successfully applied in complicated structured samples, such as trees [30] and sequences [31].
In the case of generalization error, the key idea is to estimate the future error based on the new training data
. The samples with the minimal expected error would be selected in the active learning. One typical method is to minimize the 1/0loss:(11) 
where stands for the new model with the new data added to the .
Iii Active Dictionary Learning (ADL)
In active learning scenario, the query data is usually the most informative data from the unlabeled data pool. In the dictionary learning of sparse representation, if the atoms in the dictionary are the most informative samples in the training data, the dictionary would be representative and meaningful in the coding process.
(12) 
(13) 
(14) 
In the proposed ADL method, the reconstruction and the sparse representation based classification are used to select the most the informative data in the training data:
(15) 
This idea shares the properties of query by uncertainty sampling [16] and query by generalization error [18] in the active learning paradigm.
Algorithm 2 shows the details of proposed method. At first, a random dictionary is established for testing. Then the reconstruction error and SRC error are calculated for each data. And the process is carried out with K times, which can reduced the selection bias. Finally, the total errors based on K iterations are recorded for selecting the atoms. For the classification error, the result is right or wrong, which is a binary output. In our method, we use the mean error of the reconstruction errors to normalize the classification error, the details are shown in step 10 of Algorithm 2. is an empirical index, which can be modified with different requirements. In our experiment, and are used for empirical study.
The proposed method is related to the AdaBoost.M1 [32]. In AdaBoost.M1, the weights of the training data () is initialized as , then a sequence of weak classifiers are applied on the training data using corresponding weights . At each step, the training data that were misclassified at the previous step have their weights increased, whereas the weights are decreased when the training data were classified correctly. And the update process is via:
(16) 
where is the normalized parameter at each step. In our method, the reconstruction errors and the classification error based on the different random dictionary are used to rank the training data, which is similar to the weights updates in the AdaBoost.M1.
Iv Experiments on UCI Data Sets
The experiments on different UCI [19] data sets are shown in this section. Both binarycategory data sets and multicategory data sets are used in the experiments. The proposed dictionary are more effective in the multicategory data sets, which is normally difficult in applications. The performances on the reconstruction and classification are shown for each data sets.
Iva UCI Data sets
Nine UCI data sets are used in the experiments. The detail information are shown in Table I, which are the size of data, the category properties and the feature number of data. Data “car evaluation” and “vowel recognition” are binarycategory data sets. Data “contraceptive method choice”, “wine” and “cardiotocography” have three categories. The rest are classic multicategory data sets.
Name  Feature number  Total size  Category number 

car evaluation  6  1728  2 
vowel recognition  10  528  2 
contraceptive method choice  9  1473  3 
wine  13  178  3 
cardiotocography  22  2126  3 
glass  10  214  7 
image segmentation  19  2310  7 
libras movement  90  360  15 
breast tissue  9  106  6 
IvB Experiment Setting
For each data set, 5fold crossvalidation method is used. 80% data are used as training data to establish dictionary, the rest 20% data are left for testing. In order to investigate the performance of different sizes of dictionary. The sizes of dictionary are from 10% (0.1) to 50% (0.5) of the size of training data. Then according to Algorithm 2, the dictionary based on the active learning is trained.
In order to show the effectiveness of the proposed method, relative methods are utilized for comparisons. Two clustering based dictionary learning methods, self organized map (SOM) based dictionary and neural gas (NGAS) based dictionary, are used in the comparisons. SOM and NGAS are classic unsupervised learning methods, which can maintain the topological properties of the training data. Some sparse coding applications with NGAS are discussed in
[34]. The centers trained by SOM and NGAS are used as atoms in our experiments. The labels of centers are based on the 5nearest neighbor voting. Wholetrainingdata dictionary, is used as standard in the experiments. For the classification, the SRC method is used among different dictionary learning models. Our sparse coding tool is from [21]. The SOM and NGAS are from the SOM Toolbox [35].For simplicity, ADL, SOMD, and NGASD are used to represent active dictionary learning, SOM based dictionary, and NGAS based dictionary separately. Wholetrainingdata dictionary is abbreviated as WD. It is important to note that WD contains all the training data in its dictionary.
IvC Result and Discussion
In this section, the comparison results based on reconstruction and classification are shown. We list the detail performances of 4 data sets. “Car evaluation” is a binarycategory data set, “Contraceptive Method Choice” and “cardiotocography” are threecategory data sets, “breast tissue” data set has more than three categories. The performances are based on different dictionaries size rate, i.e. 0.1, 0.2, 0.3, 0.4 and 0.5, which indicates the number of atoms in dictionary compared with the total number of training data. The reconstruction errors are the average errors on the testing data, which have different scales due the different data sets properties. Then the average results based on all 9 data sets are shown for comprehensive performance comparison.
Figure 1 shows the performances on the data “car evaluation”. For the reconstruction results on the left subfigure, ADL has relative smaller reconstruction errors. When the dictionary size rate are more than 0.4, the reconstruction errors of ADL are comparable to the level of WD. For the classification performance, ADL always has the highest accuracy. When the dictionary rates are larger than 0.3 the results of ADL reach the level of WD results.
Figure 2 shows the performances of the threecategory data “Contraceptive Method Choice”. For the reconstruction results, the curve of ADL is at the bottom of all the dictionary learning methods, and the errors are comparable with WD when the dictionary size rates are 0.4 and 0.5. For the classification results on the right subfigure, ADL has competitive performance with NGASD from the rate 0.2.
The results of another threecategory data “cardiotocography” are shown in Figure 3. In the left subfigure, ADL has small reconstruction errors comparable to the WD. For the classification results, ADL has higher accuracies when the dictionary size rates are 0.3, 0.4 and 0.5.
Figure 4 shows the performance of data “breast tissue”. For the reconstruction results, the error curve of ADL is at the bottom of all the models from the dictionary size rate 0.2. It is interesting to note that the error of WD is high. For the classification performance, the results of ADL rank first compared with SOMD and NGASD.
In the above figures, ADL has shown some advantages in the reconstruction and classification compared with relative methods. Table II lists the average performances based on the all 9 data sets in our experiments. In detail, the mean accuracy with different dictionary size are shown among 3 dictionary learning models. The results show the accuracies of ADL rank first when the dictionary rate is larger than 0.1. For the reconstruction, as different data sets have different scales, it is not meaningful to take the average reconstruction error. For the mean ranks among 3 dictionary learning models, we observer that ADL always ranks first with different dictionary size rates.
Dictionary size (rate)  0.1  0.2  0.3  0.4  0.5 

SOMD  44.14  50.18  53.19  56.12  50.90 
NGASD  46.95  46.47  46.24  46.28  44.51 
ADL  46.12  58.28  67.92  70.62  74.19 
V Experiments on Face recognition
Methods  NN  DKSVD  DLSI  SVM  DLSI*  SRC  FDDL 

Accuracy  0.62  0.75  0.85  0.89  0.89  0.90  0.92 
Dictionary Size  600  650  700  750 

ADL Accuracy  0.86  0.87  0.87  0.90 
Recently, Fisher Discrimination Dictionary Learning (FDDL) was proposed [36] and successfully applied in the extended Yale B Face data set [37]. In this data set, there are 2414 image faces from 38 persons (64 images for each person). In the setting of the FDDL, the dictionary selected 20 images from each person. There are images in the training data and the rest are testing data. There are 6 competitive methods showing in [36] for comparison and results are shown in Table III
. For details, SRC (sparse representation based classification), two classical classifiers: NN (nearest neighbor) and SVM (linear support vector machines), two new proposed dictionary learning based classification methods: DKSVD (discriminative KSVD)
[40] and DLSI (dictionary learning with structure incoherence) [41]. Note that there are two DLSI (DLSI and DLSI*) used in that work. The atoms number of dictionary are 760 in the methods related to the sparse representation (such as SRC, DKSVD and DLSI).When we use ADL on the same data set, we keep the experiment setting as FDDL (Eigen face with dimension 300) and the outputs are the average of 5 times running. However, we don’t select the training data from each class (person), instead, we randomly choose 760 images as the training data from the data set. The reason lies in two aspects: First, choosing training data with each class is cost of labor, especially in the huge date set. Second, we try to test the robustness of our method in the imbalance learning framework [42].
Figure 5 shows the atoms source for a person when the ADL dictionary has 600 atoms. In the ADL dictionary, the size of an atom is 300 after Eigen face process and we have traced back to display the original face image for each atom. The selected face images are representative of different face expressions and light conditions. We show our classification results with dictionary size of 600, 650, 700 and 750 in Table IV. The accuracies we got are higher than that of NN, DKSVD and DLSI, and they are in the same level with SVM, DLSI* and SRC. The results of ADL is slight lower than that of FDDL. However, our method can use only one dictionary to handle all the classes while there are many dictionaries based on diverse classes in FDDL.
Vi Conclusion
A novel dictionary learning method based on active learning is proposed in the paper. It is an iterative searching criteria in the training data. Compared with classic active learning, which is active to search for data to label, ADL is active to search for data to establish dictionary. The comparisons with other dictionary learning methods give the comprehensive study. The results show that ADL is effective in reconstruction and classification due to the small dictionary size and randomly sampled training data. In certain cases, ADL with small size dictionary can achieve comparable performance with wholetrainingdata dictionary.
Theoretical analysis and more data sets are needed to extend the proposed dictionary learning method. How to balance the rate between the reconstruction error and classification error is still an open question in the sparse representation. Our method has shown to be effective for this problem in the dictionary learning. Beyond this, we plan to modify our method for large data set applications, such as social networks analysis [38], [39] and human group recognition [33].
References
 [1] Wang, F. and Li, P.: Compressed Nonnegative Sparse Coding. IEEE 10th International Conference on Data Mining (ICDM), pp. 11031108, 2010.
 [2] Yang, J., Yu, K., Gong, Y., and Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. IEEE Conference on CVPR, pp. 17941801, 2009.
 [3] Elad, M. and Aharon, M.: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12), pp. 37363745, 2006.
 [4] Li, Y., Amari, S., Cichocki, A., Ho, D.W.C. and Xie, S.: Underdetermined blind source separation based on sparse representation. IEEE Transactions on Signal Processing, vol.54, no.2, pp. 423437, 2006.
 [5] Xu, J., Yang, G., Man, H. and He, H.:L1 Graph based on sparse coding for Feature Selection Lecture Notes in Computer Science, vol 7951, pp. 594601, 2013.

[6]
Xu, J., Yin, Y., Man, H. and He, H.:Feature Selection Based on Sparse Imputation. International Joint Conference on Neural Networks (IJCNN) 2012.
 [7] Sprechmann, P. and Sapiro, G.: Dictionary learning and sparse coding for unsupervised clustering. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2010.
 [8] Xu, J., and Man H.:Dictionary learning Based on Laplacian Score in Sparse Coding. Lecture Notes in Computer Science, vol. 6871, pp. 253264, 2011.
 [9] Mairal, J., Bach, F., Ponce, J., and Sapiro, G.: Online Dictionary Learning for Sparse Coding. International Conference on Machine Learning, 2009.
 [10] Cohn, D., Atlas, L. and Ladner, R.: Improving generalization with active learning. Machine Learning, 15(2), pp. 201221, 1994.
 [11] Wang, J., He, H., Cao, Y., Xu, J. and Zhao, D.:A Hierarchical Neural Network Architecture for Classification, Lecture Notes in Computer Science, vol 7367, pp.3746, 2012.
 [12] Zheng, Y., Scott, S. and Deng, K.: Active Learning from Multiple Noisy Labelers with Varied Costs. IEEE 10th International Conference on Data Mining (ICDM), pp. 639648, 2010.

[13]
Wang, X. and Davidson, I.: Active Spectral Clustering. IEEE 10th International Conference on Data Mining (ICDM), pp. 561568, 2010.
 [14] Xu, J., He, H., and Man, H.:DCPE CoTraining: CoTraining Based on Diversity of Class Probability Estimation. International Joint Conference on Neural Networks (IJCNN). pp. 17. 2010.
 [15] Xu, J., He, H., and Man, H.:DCPE CoTraining for Classification. Neurocomputing, vol. 86, pp. 7585, 2012.
 [16] Lewis, D. and Catlett J.: Heterogeneous uncertainty sampling for supervised learning. International Conference on Machine Learning (ICML), pp. 148156, 1994.

[17]
Seung, H.S., Opper, M. and Sompolinsky, H.: Query by committee. ACM Workshop on Computational Learning Theory, pp. 287294, 1992.
 [18] Sun, B., Ng, W.W.Y., Yeung, D.S. and Wang, J.: Localized Generalization Error Based Active Learning for Image Annotation. IEEE International Conference on Systems, Man and Cybernetics, pp. 6065, 2008.
 [19] Frank, A. and Asuncion, A.: UCI machine learning repository, http://archive.ics.uci.edu/ml, 2010.
 [20] Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S and Ma, Y.: Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), pp. 210227, 2009.
 [21] Kim, S., Koh, K., Lustig, M., Boyd, S., and Gorinevsky, D.: An interiorpoint method for largescale l1regularized least squares. IEEE Journal of Selected Topics in Signal Processing, 1(4), pp. 606617, 2007.
 [22] Cand s, E.J. and Donoho, D.L.: Curvelets A Surprisingly Effective Nonadaptive Representation for Objects With Edges. Curve and Surface Fitting: SaintMalo, Nashville, TN:Vanderbilt University Press, pp. 105120, 1999.
 [23] Labate, D., Lim, W., Kutyniok, G. and Weiss, G.: Sparse multidimensional representation using shearlets. SPIE: Wavelets XI, vol. 5914, pp. 254262, 2005.
 [24] LePennec, E. and Mallat, S.: Sparse geometric image representations with bandelets. IEEE Trans on Image Process, vol. 14, no. 4, pp. 423438, 2005.
 [25] Engan, K., Aase, S.O., and Husey, J.H.: Multiframe compression: theory and design. Signal Process., vol. 80, pp. 21212140, 2000.
 [26] Aharon, M., Elad, M. and Bruckstein, A. M.: The KSVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Transactions on Signal Processing, 54(11), pp. 43114322, 2006.
 [27] Lee, H., Battle, A., Raina, R. and Ng, A.Y.: Efficient sparse coding algorithms. Advances in Neural Information Processing Systems, 19, pp. 801808, 2007.
 [28] Settles, B. : Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of WisconsinMadison, 2010.

[29]
Scheffer, T., Decomain, C., and Wrobel, S.: Active hidden Markov models for information extraction. Proceedings of the International Conference on Advances in Intelligent Data Analysis (CAIDA), pp. 309318. SpringerVerlag, 2001.
 [30] Hwa, R.: Sample selection for statistical parsing. Computational Linguistics, 30(3), pp. 7377, 2004.
 [31] Settles, B., Craven, M., and Friedland, L.: Active learning with real annotation costs. Proceedings of the NIPS Workshop on CostSensitive Learning, pp.110, 2008a.
 [32] Freund, Y. and Schapire, R.E.: A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences 55, pp. 119139, 1997.
 [33] Yin, Y., Yang, G., Xu, J. and Man, H.:Small Group Human Activity Recognition. International Conference on Image Processing (ICIP) 2012.
 [34] Labusch, K., Barth, E. and Martinetz, T.: Sparse Coding Neural Gas: Learning of Overcomplete Data Representations. Neurocomputing, vol. 72, pp. 15471555, 2009.
 [35] Kohonen, T.: SOM TOOLBOX, Software available at http://www.cis.hut.fi/projects/somtoolbox/, 2005.
 [36] Yang, M., Zhang, L., Feng, X. and Zhang, D.: Fisher Discrimination Dictionary Learning for Sparse Representation. Proceedings of the International Conference on Computer Visionin (ICCV), 2011.
 [37] Georghiades, A.S., Belhumeur, P.N. and Kriegman, D.J.: From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 643660, 2001.
 [38] Govindan, P., Xu, J., Hill, S., EliassiRad, T., and Volinsky, C.:Local Structural Features Threaten Privacy across Social Networks, The 5th Workshop on Information in Networks, New York, NY, September 2013.
 [39] Hill, S., Benton, A. and Xu, J.:Social Mediabased Social TV Recommender System. 22nd Workshop on Information Technologies and Systems. 2012.
 [40] Zhang, Q. and Li, B.: Discriminative KSVD for Dictionary Learning in Face Recognition. Proceedings of the International Conference on Computer Visionin (CVPR), 2010.
 [41] Ramirez, I., Srechmann, P. and Sapiro, G.: Classification and clustering via dictionary learning with structured incoherence and shared features. Proceedings of the International Conference on Computer Visionin (CVPR), 2010.
 [42] He, H. and Garcia, E.A.: Learning from Imbalanced Data. IEEE Trans. Knowledge and Data Engineering, vol. 21, no. 9, pp. 12631284, 2009.
 [43] Xu, J., Yang, G., Yin, Y., Man, H. and He, H.:SparseRepresentationBased Classification with StructurePreserving Dimension Reduction. Cognitive Computation, volume 6, issue 3, pp. 608621, 2014.
 [44] Xu, J., Yang, G., Yin, Y. and Man, H.:Sparse Representation for Classification with Structure Preserving Dimension Reduction. International Conference on Machine Learning (ICML) Workshop on Structured Sparsity: Learning and Inference, (Bellevue, WA, USA), 2011.
Comments
There are no comments yet.