1 Introduction
Sparsity is an attribute characterizing a mass of natural and manmade signals [1]
, and has played a vital role in the success of many machine learning algorithms and techniques such as compressed sensing
[2], matrix factorization [3], sparse coding [4], dictionary learning [5, 6], sparse autoencoders [7], Restricted Boltzmann Machines (RBMs)
[8] and Independent Component Analysis (ICA) [9].Among these, ICA transforms an observed multidimensional random vector into sparse components which are statistically as independent from each other as possible. Specifically, to estimate the independent components, a general principle is the maximization of nongaussianity
[9]. This is based on the central limit theorem that sum of independent random variables is closer to gaussian than any of the original random variables, i.e., nongaussian is independent. Meanwhile, sparsity is one form of nongaussianity
[10], which is dominant in natural images. Then maximization of sparseness in natural images is basically equivalent to maximization of nongaussianity. Thus, ICA has been successfully applied to learn sparse representation for classification tasks by maximizing sparsity [11]. However, there are two main drawbacks to standard ICA.1) ICA is sensitive to whitening, which is an important preprocessing step in ICA to extract efficient features. In addition, standard ICA is difficult to exactly whiten high dimensional data. For example, an input image of size 100
100 pixels could be exactly whitened by principal component analysis(PCA), while it has to solve the eigendecomposition of the 10,000
10,000 covariance matrix.2) ICA is hard to learn the overcomplete basis (that is the number of basis vectors is greater than dimensionality of input data). Whereas Coates et al. [12]
have shown that several approaches with overcomplete basis, e.g., sparse autoencoders
[7], Kmeans
[12] and RBMs [8], obtain an improvement for the performance of classification. This puts ICA at a disadvantage compared to these methods.Both drawbacks are mainly due to the hard orthonormality constraint in standard ICA. Mathematically, that is , which is utilized to prevent degenerate solution for the basis matrix where each basis vector is a row of . While this orthonormalization cannot be satisfied when is overcomplete. Specifically, the optimization problem of standard ICA is generally solved by using gradient descent methods, where is orthonormalized at each iteration by symmetric orthonormalization, i.e., , which doesn’t work for overcomplete learning. In addition, although alternative orthonormalization methods could be employed to learn overcomplete basis, they not only are expensive to compute but also may arise from the cumulation of errors.
To address the above issues, Q.V. Le et al. [13] replaced the orthonormality constraint with a robust soft reconstruction cost for ICA (RICA). Thus, RICA can learn sparse representation with highly overcomplete basis even on unwhitened data. However, this model is so far also a linear technique which is infeasible to discover nonlinear relationships among input data. Additionally, as an unsupervised method, RICA may not be sufficient for classification tasks, which failed to consider the association between the training sample and its class.
Recall that, to explore the nonlinear features, kernel trick [14] can be used to nonlinearly project the input data into a high dimensional feature space. Therefore, we develop a kernel extension of RICA (kRICA) to represent the data with nonlinear structure. In addition, to bring in label information, we further extend the unsupervised kRICA to a supervised one by introducing a discrimination constraint, namely dkRICA. Particularly, this constraint maximizes the homogeneous representation cost and minimizes the inhomogeneous representation cost jointly, which leads to learn a structured basis consisted of basis vectors from different basis subsets corresponding to the class labels. Then each subset will sparsely represent well for its own class but not for the others. Furthermore, data samples belonging to the same class will have similar representations, and thereby the obtained sparse representation can take more discriminative power.
It is important to note that this work is fundamentally based on our previous work DRICA [15]. In comparison to DRICA, we further improve our work as follows:
1) By taking advantage of the kernel trick, we replace the linear projection with nonlinear one to capture the nonlinear features. Experimental results show that our kernel extension usually further improves the image classification accuracy.
2) The discriminative capability of basis is further enhanced by maximizing the homogeneous representation cost besides minimizing the inhomogeneous representation cost simultaneously. Thus, we can obtain a set of more discriminative basis vectors that are forced to sparsely represent better for their own classes but poorer for the others. Experiments show that this basis can further boost the performance for image classification.
3) In the experiments, we conduct comprehensive analysis for our proposed method, e.g., the effects of different parameters and kernels for image classification, experiment settings, and the similarity comparative analysis.
The rest of the paper is organized as follows. In Section 2, we revisit related works on sparse coding and RICA, and describe the connection between them. Then we give a brief review of reconstruction ICA in Section 3. Section 4 introduces the details of our proposed kRICA, including its optimization problem and implementation. By incorporating the discrimination constraint, kRICA is further extended to supervised learning in Section 5. Section 6 presents extensive experimental results on image classification. Finally, we conclude our work in Section 7.
2 Related Work
In this section, we will review some related work in the following aspects: (1) Sparse coding and its applications; (2) Connection between RICA and sparse coding; (3) The other kernel sparse representation algorithms.
Sparse coding is an unsupervised method for reconstructing a given signal by selecting a relatively small subset of basis vectors from an overcomplete basis set, and meanwhile making the reconstruction error as small as possible. Because of its plausive statistical theory [16]
, sparse coding has attracted more and more attention from scientists in computer vision field. Meanwhile, it has been successfully used for more and more computer vision applications, e.g., image classification
[17, 18, 19][20], image restoration [21] etc. This success is largely due to two factors:1) The sparsity characteristic ubiquitously exists in many computer vision applications. For example, for image classification, the image components can be sparsely reconstructed by utilizing similar components of other images from same class [17]. Another example is face recognition. The face image to be tested can be accurately reconstructed by a few training images from the same category [20]. As a consequence, sparsity is the foundation for these applications based on sparse coding.
2) Images are often corrupted by noise, which may arise due to sensor imperfection, poor illumination or communication errors. While sparse coding can effectively select the related basis vectors to reconstruct the clean image, and meanwhile can deal with noise by allowing the reconstruction error and promoting sparsity. Therefore, sparse coding has been successfully applied to image denoising [22], image restoration [21] etc.
Similar to sparse coding, ICA with a reconstruction cost (RICA) [13] also can learn highly overcomplete sparse representation. In addition, in [13], it has been shown that RICA is mathematically equivalent to sparse coding if using explicit encoding and ignoring the norm ball constraint.
The abovementioned studies only seek the sparse representations of the input data in the original data space, which are incompetent to represent the data with nonlinear structure. To solve this problem, Yang et al. [23] developed a twophase kernel ICA algorithm: whitened kernel principal component analysis (KPCA) plus ICA. Different from [23], another solution [24] was proposed to use contrast function based on canonical correlations in a reproducing kernel Hilbert space. However, both of these methods couldn’t learn the overcomplete sparse representation of nonlinear features due to the orthonormality constraint. Therefore, to find such representation, Gao et al. [25, 26] presented a kernel sparse coding method (KSR) in a high dimensional feature space. But this work failed to utilize the class information as an unsupervised approach. Additionally, in Section 4.4, we will show that our proposed kernel extension of RICA (kRICA) is equivalent to KSR under certain conditions.
3 Reconstruction ICA
Since sparsity is one form of nongaussianity, maximization of sparsity for ICA is equivalent to maximization of independence[10]. Given the unlabeled data set where , the optimization problem of standard ICA [9] is generally defined as
(1) 
where is a nonlinear convex function, is the basis matrix, is the number of basis vectors and is th row basis vector in , and
is the identity matrix. Additionally, the orthonormality constraint
is traditionally utilized to prevent the basis vectors in from becoming degenerate. Meanwhile, a good general purpose smooth penalty is: [10].However, as above pointed out, the orthonoramlity constraint makes standard ICA difficult to learn the overcomplete basis. In addition, ICA is sensitive to whitening. These drawbacks restrict ICA to scale high dimensional data. Consequently, RICA [13] used a soft reconstruction cost to replace the orthonormality constraint in ICA. Applying this replacement to Equation (2), RICA can be formulated as the following unconstrained problem
(2) 
where parameter is a tradeoff between reconstruction and sparsity. Swapping the orthonormality constraint with a reconstruction penalty, the RICA could learn sparse representations even on the data without whitening when is overcomplete.
Furthermore, since the penalty is not sufficient to learn invariant features [10], RICA [13, 27] replaced it by a pooling penalty which encourages pooling features to group similar features together to achieve complex invariances such as scale and rotational invariance. Besides, the pooling can also promote sparsity for feature learning. Particularly, pooling [28, 29] is a twolayered network with square nonlinearity in the first layer, and squareroot nonlinearity in the second layer:
(3) 
where is the row of spatial pooling matrix fixed to uniform weights and is a small constant to prevent division by zero.
Nevertheless, RICA is infeasible to represent the data with nonlinear structure due to its intrinsic linearity. In addition, this model just simply learned the overcomplete basis set with reconstruction cost while failed to consider the association between the training sample and its class, which may be insufficient for classification tasks. To address these problems, on one hand, we focus on developing a kernel extension of RICA to find the sparse representation of nonlinear features. On the other hand, we aim to learn a more discriminative basis by bringing in class information than unsupervised RICA, which will facilitate the better performance of sparse representation in classification tasks.
4 Kernel Extension for RICA
Motivated by the success that kernel trick can capture the nonlinear structure in data [14], we propose a kernel version of RICA, called kRICA, to learn the sparse representation of nonlinear features.
4.1 Model Formulation
Suppose that there is a kernel function induced by a high dimensional feature mapping , where . Given two data points and , represents a nonlinear similarity between them. Then the function maps the data and basis from the original data space to the feature space as follows.
(4) 
Furthermore, by substituting the mapped data and basis into Equation (2), we can get the following objective function of kRICA.
(5) 
Due to its excellent performance in many computer vision applications [14, 25], Gaussian kernel, i.e., is used in this study. Thus, the norm ball constraints on basis in RICA can be removed owing to .
In addition, we perform kernel principal component analysis (KPCA) in the feature space for data whitening similar to [23], which makes the problem of ICA estimation simpler and better conditioned [10]. When data is whitened, there exists a close relationship between kernel ICA [23] and kRICA. Regarding this relationship, we have the following Lemma:
Lemma 4.1 When the input data set is whitened in the feature space, the reconstruction cost is equivalent to the orthonormality cost .
Where is the Frobenius norm. Lemma 4.1 shows that kernel ICA’s hard orthonormality constraint and kRICA’s reconstruction cost are equivalent when data is whitened. While kRICA can learn the overcomplete sparse representation of nonlinear features and kernel ICA fails to work due to the orthonormality constraint. Please see the Appendix A for a detailed proof.
4.2 Implementation
The Equation (5) is an unconstrained convex optimization problem. To solve this problem, we rewrite the objective as follows
(6)  
where and are the rows of basis , and is the element in pooling matrix . Since the row of is contained in the kernel , it is very hard to directly utilize the optimization methods in RICA, e.g. LBFGS and CG [30], to compute the optimal basis. Thus, to solve this problem, we alternatively optimize each row of basis instead. With respect to each updating row of , the derivative of is
(7)  
Then, to compute the optimal , we set . Since is contained in , it is challenging to solve the Equation (7). Thus, we seek the approximate solution instead of the exact solution. Inspired by fixed point algorithm [25], to update in the th iteration, we utilize the result of in the th iteration to calculate the part in the kernel function. In addition, we utilize kmeans to initialize the basis followed by [25]. Let denote the in the th iteration as , and the Equation (7) with respect to becomes
When all the remaining rows are fixed, the problem becomes a linear equation of , which can be solved straightforwardly.
4.3 Connection between kRICA and KSR
It is clear there is a close connection between the proposed kRICA and KSR [25]. Similar to kRICA, KSR attempts to find the sparse representation of nonlinear features in a high dimensional feature space and its optimization problem is
(8) 
where is the sparse representation of sample . Therefore, there are two major differences between them.
(1) KSR utilizes explicit encoding for sparse representation corresponding to input data sample, i.e., . Since the objective of Equation (8) in KSR is not convex, the basis and sparse codes should be optimized, alternatively.
(2) The simple penalty, , is employed by KSR to promote sparsity while kRICA uses pooling instead, which can force the pooling features to group similar features together to achieve invariance, and meanwhile optimize the sparsity.
5 Supervised Kernel RICA
Given the labeled training data, our goal is to utilize class information to learn a structured basis set, which is consisted of basis vectors from different basis subsets corresponding to different class labels. Then each subset will sparsely represent well for its own class but not for the others. Thus, to learn such basis, we further extend the unsupervised kRICA to a supervised one by introducing a discrimination constraint, namely dkRICA.
Mathematically, when the sample is labeled as where is the total number of classes, we can further utilize class information to learn a structured basis set , where is the basis subset that can well represent the sample belonging to the th class rather than others, is the number of basis vectors for each subset and . Let denote where can be regarded as the sparse representation of sample [13].
5.1 Discrimination constraint
Since we aim to utilize class information to learn a structured basis, we hope that the sample labeled as will only be reconstructed by the basis subset with coefficients . To achieve this goal, an inhomogeneous representation cost constraint [15, 31] was utilized to minimize the inhomogeneous representation coefficients of , i.e., coefficients corresponding to basis vectors other than belonging to . However, this constraint only focuses on minimizing the inhomogeneous coefficients while fails to consider maximizing the the homogeneous ones, which is not sufficient to learn an optimal structured basis. Consequently, to learn such basis, we introduce a discrimination constraint, which maximizes the homogeneous representation cost and minimizes the inhomogeneous representation cost, jointly. Mathematically, we define the homogeneous cost as and the inhomogeneous cost as . Specifically, and are
(9)  
where and select the homogeneous and inhomogeneous representation coefficients of , respectively. For example, assuming , () and =3, and can be respectively defined as follows.
Intuitively, we can define the discrimination constraint function as , which means the sparse representation in terms of basis matrix will only concentrate on the basis subset . However, this constraint is nonconvex and unstable. To address the problem, we propose to incorporate an elastic term into . Thus, is defined as
(10) 
It can be proved that if , is strictly convex to . Please see the Appendix B for a detailed proof. The constraint (10) maximizes the homogeneous representation cost and minimizes the inhomogeneous representation cost, simultaneously, which leads to learn a structured basis consisted of basis vectors from different basis subsets corresponding to the class labels. Then each subset will sparsely represent well for its own class but not for the others. Furthermore, data samples belonging to the same class will have similar representations, and thereby the obtained new representations can take more discriminative power.
By incorporating the discrimination constraint into the kRICA framework (dkRICA), we can get the following objective function
(11)  
where and are the scalars controlling the relative contribution of the corresponding terms. Given a test sample, Equation (11) means that the learned basis set can sparsely represent it with nonlinear structure while demands its homogeneous representations as large as possible and meanwhile inhomogeneous representations as small as possible. Following kRICA, the optimization problem (11) can be easily solved by the above proposed fixed point algorithm.
6 Experiments
In this section, we will firstly introduce the feature extraction for image classification. Then, we evaluate the performances of our kRICA and dkRICA for image classification on three public datasets: Caltech 101
[32], CIFAR10 [12] and STL10 [12]. Furthermore, we study the selections of tuning parameters and kernel functions for our method. Finally, we give the similarity matrix to further illustrate the performances of kRICA and dkRICA.6.1 Feature Extraction for Classification
Given a input image patch (with channels) (), kRICA can transform it to a new representation in the feature space, where is termed as the ’receptive field size’. For an image of pixels (with channels), we could obtain a (with channels) feature following the same setting in [13], by estimating the representation for each ’subpatch’ of the input image. To reduce the dimensionality of the image representation, we utilize similar pooling method in [13] to form a reduced dimensional pooled representation for image classification. Given the pooled feature for each image, we utilize linear SVM for classification.
6.2 Classification on Caltech 101
Caltech 101 dataset consists of 9144 images which are divided among 101 object classes and 1 background class including animals, vehicles, etc. Following the common experiment setup [17], we implement our algorithm on 15 and 30 training images per category with basis size and 1010 receptive fields, respectively. Comparison results are shown in Table 2. We compare our classification accuracy with ScSPM [17], DKSVD [6], LCKSVD [19], RICA [13], KICA [23], KSR [25] and DRICA [15]. In addition, in order to compare with DRICA, we incorporate the discrimination constraint (10) into the RICA framework (2), namely dRICA.
Table I shows that kRICA and dkRICA outperform the other competing approaches.
6.3 Classification on CIFAR10
The CIFAR10 dataset includes 10 categories and 60000 3232 color images in all with 6000 images per category, such as airplane, automobile, truck and horse etc. In addition, there are 50000 training images and 10000 testing images. Specifically, 1000 images from each class are randomly selected as test images and the other 5000 images from each class as training images. In this experiment, we fix the size of basis set to 4000 with 66 receptive fields followed by [12]. We compare our approach with RICA, Kmeans (Triangle, 4000 features) [12], KSR, DRICA and dRICA etc.
Table II shows the effectiveness of our proposed kRICA and dkRICA.
Model  Accuracy 

Improved Local Coord. Coding [18]  74.5% 
Conv. Deep Belief Net (2 layers) [33]  78.9% 
Sparse autoencoder [12]  73.4% 
Sparse RBM [12]  72.4% 
Kmeans (Hard) [12]  68.6% 
Kmeans (Triangle) [12]  77.9% 
Kmeans (Triangle, 4000 features) [12]  79.6% 
RICA [13]  81.4% 
KICA [23]  78.3% 
KSR [25]  82.6% 
DRICA [15]  82.1% 
dRICA  82.9% 
kRICA  83.4% 
dkRICA  84.5% 
6.4 Classification on STL10
In STL10, there are 10 classes(e.g., airplane, dog, monkey and ship etc), where each image is 96x96 pixels and color. In addition, this dataset is divided into 500 training images (10 predefined folds), 800 test images per class and 100,000 unlabeled images for unsupervised learning. In our experiments, we set the size of basis set
= 1600 and 88 receptive fields in the same manner described in [13].Table III shows the classification results of the raw pixels [12], Kmeans, RICA, KSR, DRICA, dRICA, kRICA and dkRICA.
As can be seen, dRICA achieves better performance than DRICA on all of the above datasets. It is because that DRICA just only minimized the inhomogeneous representation cost for structured basis learning, while dRICA simultaneously maximizes the homogeneous representation cost and minimizes the inhomogeneous representation cost, which makes the learned sparse representation take more discriminative power. Although both DRICA and dRICA introduce the class information, unsupervised kRICA still performs better than both these algorithms. This means that kRICA implies more discriminative power for classification by representing the data with nonlinear structure. Additionally, since kRICA utilizes the pooling instead of penalty to achieve feature invariance, it demonstrates better performance than KSR. Furthermore, the dkRICA achieves better performance than kRICA in all the cases by bringing in class information.
We also investigate the effect of basis size for our proposed kRICA and dkRICA on STL10 dataset. In our experiments, we try seven sizes: 50, 100, 200, 400, 800, 1200 and 1600. As shown in Fig. 1, the classification accuracies of dkRICA and kRICA continue to increase when the basis size goes up to 1600 and the performances augment slightly from basis size of 800. Especially, dkRICA outperforms all the other algorithms all the way.
Model  Accuracy 

Raw pixels [12]  31.8% 
Kmeans(Triangle 1600 features) [12]  51.5% 
RICA(8x8 receptive fields) [13]  51.4% 
RICA(10x10 receptive fields) [13]  52.9% 
KICA [23]  51.1% 
KSR [25]  54.4% 
DRICA [15]  54.2% 
dRICA  54.8% 
kRICA  55.2% 
dkRICA  56.9% 
6.5 Tuning Parameter and Kernel Selection
In the experiments, the tuning parameters in kRICA and dkRICA, i.e. , and in the objective function, are verified by cross validation to avoid overfitting. More specifically, we experimentally set these parameters as follows.
The effect of : The parameter is the weight of sparsity term, which is an important factor in kRICA. To facilitate the parameter selection, we experimentally investigate how the performance of kRICA varies with the parameter on STL10 dataset in Fig. 2 (). Fig. 2 shows that kRICA achieves best performance when is fixed to be . Thus, we set for STL10 data. In addition, we test the accuracy of RICA under the same sparsity weight. It is easy to find that our proposed nonlinear RICA (kRICA) can consistently outperform linear RICA with respect to . Similarly, we experimentally set for Caltech data and for CIFAR10 data.
The effect of : The parameter controls the weight of discrimination constraint term. When , the supervised dkRICA optimization problem becomes the unsupervised kRICA problem. Fig. 3 shows the relationship between the weight of discrimination constraint term and classification accuracy on the STL10. We can see that dkRICA achieves best performance when . Hence, we set for STL10 data. In particular, dRICA achieves better performance than DRICA in a wide range of values. This is because that DRICA just only minimizes the inhomogeneous representation cost, while dRICA jointly optimizes both the homogeneous and inhomogeneous representation costs for basis learning, which makes the learned sparse representations take more discriminative power. Furthermore, by representing the data with nonlinear structure, dkRICA implies more discriminative power for classification and outperforms both these algorithms. Similarly, we set for Caltech data and for CIFAR10 data.
The effect of : When we utilize the Gaussian kernel in kRICA, it is vital to select the kernel parameter , which affects the image classification accuracy. Fig. 4 shows the relationship between and classification accuracy on STL10 dataset. Therefore, we set for STL10 data. Similarly, we experimentally set for Caltech data and for CIFAR10 data.
We also investigate the effect of different kernels for kRICA in image classification, i.e., Polynomial kernel: , Inverse Distance kernel: , Inverse Square Distance kernel: , Exponential Histogram Intersection kernel: .^{1}^{1}1Following the work [26], we set b=3 for Polynomial kernel and b=1 for the others. Table IV demonstrates the classification performances of different kernels on STL10 dataset, and Gaussian kernel outperforms the other kernels. Thus, we employ Gaussian kernel in our studies.
Kernel  Accuracy 

Polynomial kernel  54.2% 
Inverse Distance kernel  38.3% 
Inverse Square Distance kernel  47.6% 
Exponential Histogram Intersection kernel  36.5% 
Gaussian kernel  56.9% 
6.6 Similarity Analysis
In above sections, we have shown the effectiveness of kRICA and dkRICA for image classification. To further illustrate their performances, we firstly choose 90 images from three classes in Caltech 101, and 30 images for each class. Then we compute the similarity between sparse representations of these images for RICA, kRICA and dkRICA, respectively. Fig. 5 demonstrates the similarity matrices corresponding to sparse representations of RICA, kRICA and dkRICA, respectively. Each element in similarity matrix is the sparse representation similarity measured by Euclidean distance between image and . Since a good sparse representation method can make the new representations belonging to the same class more similar, their similarity matrix also should be blockwise. Fig. 5 shows that nonlinear kRICA takes more discriminative power than linear RICA, and dkRICA achieves best by binging in class information.
7 Conclusions
In this paper, we propose a kernel ICA model with reconstruction constraint (kRICA) to capture the nonlinear features. To bring in the class information, we further extend the unsupervised kRICA to a supervised one by introducing a discrimination constraint. This constraint leads to learn a structured basis consisted of basis vectors from different basis subsets corresponding to different class labels. Then each subset will sparsely represent well for its own class but not for the others. Furthermore, data samples belonging to the same class will have similar representations, and thereby the obtained sparse representation can take more discriminative power. The experiments conducted on standardized datasets have demonstrated the effectiveness of our proposed method.
Appendix A Proof of Lemma 4.1
Poof
Since the input data set is whitened in the feature space by KPCA, we have
where is the identity matrix. Furthermore, we can obtain
where denotes the trace of a matrix, and the steps of derivation employ the matrix property . Thus, the reconstruction cost is equivalent to the orthonormality constraint when data is whitened in the feature space.
Appendix B Proof of the convexity of
We rewrite the Equation (10) as
(12)  
Then, we can obtain its Hessian matrix with respect to .
(13) 
Without loss of generality, we assume
After some derivations, we have where
The convexity of depends on whether its Hessian matrix , i.e. matrix , is positive definite or not [34]. Meanwhile, the matrix is positive definite if and only if for all nonzero vectors [35], where denotes the transpose. Let the size of upper left matrix in be , and suppose . Then, we have
Furthermore, we can get
Define function , and when , it is easy to verify that
Since , we have . Thus, Hessian matrix is positive definite for , which guarantees that is convex to .
Acknowledgments
We thank Wende Dong for helpful discussions, and acknowledge Quoc V. Le for providing the RICA code.
References
 [1] E. Candes and M. Wakin, “An introduction to compressive sampling,” IEEE Signal Proc. Mag., vol. 25, no. 2, pp. 21 –30, march 2008.
 [2] D. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
 [3] D. Lee, H. Seung et al., “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
 [4] B. Olshausen et al., “Emergence of simplecell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, 1996.
 [5] M. Aharon, M. Elad, and A. Bruckstein, “Ksvd: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, 2006.

[6]
Q. Zhang and B. Li, “Discriminative ksvd for dictionary learning in face
recognition,” in
Proc. Comput. Vis. Pattern Recognit.
, 2010.  [7] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layerwise training of deep networks,” in Proc. Adv. Neural Inform. Process. Syst., vol. 19, 2007, p. 153.
 [8] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006.
 [9] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis. Wileyinterscience, 2001, vol. 26.
 [10] A. Hyvärinen, J. Hurri, and P. Hoyer, Natural image statistics. Springer, 2009, vol. 1.
 [11] A. Hyvärinen, P. Hoyer, and M. Inki, “Topographic independent component analysis,” Neural Comput., vol. 13, no. 7, pp. 1527–1558, 2001.
 [12] A. Coates, H. Lee, and A. Ng, “An analysis of singlelayer networks in unsupervised feature learning,” in Proc. AISTATS, 2010.
 [13] Q. Le, A. Karpenko, J. Ngiam, and A. Ng, “Ica with reconstruction cost for efficient overcomplete feature learning,” in Proc. Adv. Neural Inform. Process. Syst., 2011.
 [14] J. ShaweTaylor and N. Cristianini, Kernel methods for pattern analysis. Cambridge university press, 2004.
 [15] Y. Xiao, Z. Zhu, S. Wei, and Y. Zhao, “Discriminative ica model with reconstruction constraint for image classification,” in Proc. ACM Multimedia, 2012, pp. 929–932.
 [16] D. Donoho, “For most large underdetermined systems of linear equations the minimal l1norm solution is also the sparsest solution,” Commun. Pur. Appl. Math., vol. 59, no. 6, pp. 797–829, 2006.
 [17] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Proc. Comput. Vis. Pattern Recognit., 2009, pp. 1794 –1801.
 [18] K. Yu and T. Zhang, “Improved local coordinate coding using local tangents,” in Proc. Int. Conf. Mach. Learn., 2010.
 [19] Z. Jiang, Z. Lin, and L. Davis, “Learning a discriminative dictionary for sparse coding via label consistent ksvd,” in Proc. Comput. Vis. Pattern Recognit., 2011, pp. 1697–1704.
 [20] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, 2009.
 [21] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Nonlocal sparse models for image restoration,” in Proc. Comput. Vis. Pattern Recognit., 2009, pp. 2272–2279.
 [22] M. Elad, Sparse and redundant representations: from theory to applications in signal and image processing. Springer, 2010.
 [23] J. Yang, X. Gao, D. Zhang, and J. Yang, “Kernel ica: An alternative formulation and its application to face recognition,” Pattern Recognit., vol. 38, no. 10, pp. 1784–1787, 2005.
 [24] F. Bach and M. Jordan, “Kernel independent component analysis,” J. Mach. Learn. Res., vol. 3, pp. 1–48, 2003.
 [25] S. Gao, I. Tsang, and L. Chia, “Kernel sparse representation for image classification and face recognition,” in Proc. ECCV, 2010, pp. 1–14.
 [26] S. Gao, I. W.H. Tsang, and L.T. Chia, “Sparse representation with kernels,” IEEE Trans. Image Process., vol. 22, no. 2, pp. 423–434, feb. 2013.
 [27] Q. Le, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng, “Building highlevel features using largescale unsupervised learning,” in Proc. Int. Conf. Mach. Learn., 2012.
 [28] Y. LeCun, “Learning invariant feature hierarchies,” in Proc. ECCV. Workshops and Demonstrations, 2012, pp. 496–505.
 [29] Y.L. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of feature pooling in visual recognition,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 111–118.
 [30] M. Schmidt, “minfunc,” 2005.
 [31] J. Yang, J. Wang, and T. Huang, “Learning the sparse representation for classification,” in Proc. ICME, 2011, pp. 1–6.
 [32] L. FeiFei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” Comput. Vis. Image Und., vol. 106, no. 1, pp. 59–70, 2007.

[33]
A. Krizhevsky, “Convolutional deep belief networks on cifar10,”
Unpublished manuscript, 2010.  [34] S. Boyd and L. Vandenberghe, Convex optimization, 2004.
 [35] G. H. Golub and C. F. Van Loan, Matrix computations, 1996.
Comments
There are no comments yet.