Sparsity is an attribute characterizing a mass of natural and manmade signals 
, and has played a vital role in the success of many machine learning algorithms and techniques such as compressed sensing, matrix factorization , sparse coding , dictionary learning [5, 6], sparse auto-encoders 
, Restricted Boltzmann Machines (RBMs) and Independent Component Analysis (ICA) .
Among these, ICA transforms an observed multidimensional random vector into sparse components which are statistically as independent from each other as possible. Specifically, to estimate the independent components, a general principle is the maximization of non-gaussianity
. This is based on the central limit theorem that sum of independent random variables is closer to gaussian than any of the original random variables, i.e., non-gaussian is independent. Meanwhile, sparsity is one form of non-gaussianity, which is dominant in natural images. Then maximization of sparseness in natural images is basically equivalent to maximization of non-gaussianity. Thus, ICA has been successfully applied to learn sparse representation for classification tasks by maximizing sparsity . However, there are two main drawbacks to standard ICA.
1) ICA is sensitive to whitening, which is an important preprocessing step in ICA to extract efficient features. In addition, standard ICA is difficult to exactly whiten high dimensional data. For example, an input image of size 100
100 pixels could be exactly whitened by principal component analysis(PCA), while it has to solve the eigen-decomposition of the 10,00010,000 covariance matrix.
2) ICA is hard to learn the over-complete basis (that is the number of basis vectors is greater than dimensionality of input data). Whereas Coates et al. 
have shown that several approaches with over-complete basis, e.g., sparse autoencoders
, K-means and RBMs , obtain an improvement for the performance of classification. This puts ICA at a disadvantage compared to these methods.
Both drawbacks are mainly due to the hard orthonormality constraint in standard ICA. Mathematically, that is , which is utilized to prevent degenerate solution for the basis matrix where each basis vector is a row of . While this orthonormalization cannot be satisfied when is over-complete. Specifically, the optimization problem of standard ICA is generally solved by using gradient descent methods, where is orthonormalized at each iteration by symmetric orthonormalization, i.e., , which doesn’t work for over-complete learning. In addition, although alternative orthonormalization methods could be employed to learn over-complete basis, they not only are expensive to compute but also may arise from the cumulation of errors.
To address the above issues, Q.V. Le et al.  replaced the orthonormality constraint with a robust soft reconstruction cost for ICA (RICA). Thus, RICA can learn sparse representation with highly over-complete basis even on unwhitened data. However, this model is so far also a linear technique which is infeasible to discover nonlinear relationships among input data. Additionally, as an unsupervised method, RICA may not be sufficient for classification tasks, which failed to consider the association between the training sample and its class.
Recall that, to explore the nonlinear features, kernel trick  can be used to nonlinearly project the input data into a high dimensional feature space. Therefore, we develop a kernel extension of RICA (kRICA) to represent the data with nonlinear structure. In addition, to bring in label information, we further extend the unsupervised kRICA to a supervised one by introducing a discrimination constraint, namely d-kRICA. Particularly, this constraint maximizes the homogeneous representation cost and minimizes the inhomogeneous representation cost jointly, which leads to learn a structured basis consisted of basis vectors from different basis subsets corresponding to the class labels. Then each subset will sparsely represent well for its own class but not for the others. Furthermore, data samples belonging to the same class will have similar representations, and thereby the obtained sparse representation can take more discriminative power.
It is important to note that this work is fundamentally based on our previous work DRICA . In comparison to DRICA, we further improve our work as follows:
1) By taking advantage of the kernel trick, we replace the linear projection with nonlinear one to capture the nonlinear features. Experimental results show that our kernel extension usually further improves the image classification accuracy.
2) The discriminative capability of basis is further enhanced by maximizing the homogeneous representation cost besides minimizing the inhomogeneous representation cost simultaneously. Thus, we can obtain a set of more discriminative basis vectors that are forced to sparsely represent better for their own classes but poorer for the others. Experiments show that this basis can further boost the performance for image classification.
3) In the experiments, we conduct comprehensive analysis for our proposed method, e.g., the effects of different parameters and kernels for image classification, experiment settings, and the similarity comparative analysis.
The rest of the paper is organized as follows. In Section 2, we revisit related works on sparse coding and RICA, and describe the connection between them. Then we give a brief review of reconstruction ICA in Section 3. Section 4 introduces the details of our proposed kRICA, including its optimization problem and implementation. By incorporating the discrimination constraint, kRICA is further extended to supervised learning in Section 5. Section 6 presents extensive experimental results on image classification. Finally, we conclude our work in Section 7.
2 Related Work
In this section, we will review some related work in the following aspects: (1) Sparse coding and its applications; (2) Connection between RICA and sparse coding; (3) The other kernel sparse representation algorithms.
Sparse coding is an unsupervised method for reconstructing a given signal by selecting a relatively small subset of basis vectors from an over-complete basis set, and meanwhile making the reconstruction error as small as possible. Because of its plausive statistical theory 
, sparse coding has attracted more and more attention from scientists in computer vision field. Meanwhile, it has been successfully used for more and more computer vision applications, e.g., image classification[17, 18, 19]20], image restoration  etc. This success is largely due to two factors:
1) The sparsity characteristic ubiquitously exists in many computer vision applications. For example, for image classification, the image components can be sparsely reconstructed by utilizing similar components of other images from same class . Another example is face recognition. The face image to be tested can be accurately reconstructed by a few training images from the same category . As a consequence, sparsity is the foundation for these applications based on sparse coding.
2) Images are often corrupted by noise, which may arise due to sensor imperfection, poor illumination or communication errors. While sparse coding can effectively select the related basis vectors to reconstruct the clean image, and meanwhile can deal with noise by allowing the reconstruction error and promoting sparsity. Therefore, sparse coding has been successfully applied to image denoising , image restoration  etc.
Similar to sparse coding, ICA with a reconstruction cost (RICA)  also can learn highly over-complete sparse representation. In addition, in , it has been shown that RICA is mathematically equivalent to sparse coding if using explicit encoding and ignoring the norm ball constraint.
The above-mentioned studies only seek the sparse representations of the input data in the original data space, which are incompetent to represent the data with nonlinear structure. To solve this problem, Yang et al.  developed a two-phase kernel ICA algorithm: whitened kernel principal component analysis (KPCA) plus ICA. Different from , another solution  was proposed to use contrast function based on canonical correlations in a reproducing kernel Hilbert space. However, both of these methods couldn’t learn the over-complete sparse representation of nonlinear features due to the orthonormality constraint. Therefore, to find such representation, Gao et al. [25, 26] presented a kernel sparse coding method (KSR) in a high dimensional feature space. But this work failed to utilize the class information as an unsupervised approach. Additionally, in Section 4.4, we will show that our proposed kernel extension of RICA (kRICA) is equivalent to KSR under certain conditions.
3 Reconstruction ICA
Since sparsity is one form of non-gaussianity, maximization of sparsity for ICA is equivalent to maximization of independence. Given the unlabeled data set where , the optimization problem of standard ICA  is generally defined as
where is a nonlinear convex function, is the basis matrix, is the number of basis vectors and is -th row basis vector in , and
is the identity matrix. Additionally, the orthonormality constraintis traditionally utilized to prevent the basis vectors in from becoming degenerate. Meanwhile, a good general purpose smooth penalty is: .
However, as above pointed out, the orthonoramlity constraint makes standard ICA difficult to learn the over-complete basis. In addition, ICA is sensitive to whitening. These drawbacks restrict ICA to scale high dimensional data. Consequently, RICA  used a soft reconstruction cost to replace the orthonormality constraint in ICA. Applying this replacement to Equation (2), RICA can be formulated as the following unconstrained problem
where parameter is a tradeoff between reconstruction and sparsity. Swapping the orthonormality constraint with a reconstruction penalty, the RICA could learn sparse representations even on the data without whitening when is over-complete.
Furthermore, since the penalty is not sufficient to learn invariant features , RICA [13, 27] replaced it by a pooling penalty which encourages pooling features to group similar features together to achieve complex invariances such as scale and rotational invariance. Besides, the pooling can also promote sparsity for feature learning. Particularly, pooling [28, 29] is a two-layered network with square nonlinearity in the first layer, and square-root nonlinearity in the second layer:
where is the row of spatial pooling matrix fixed to uniform weights and is a small constant to prevent division by zero.
Nevertheless, RICA is infeasible to represent the data with nonlinear structure due to its intrinsic linearity. In addition, this model just simply learned the over-complete basis set with reconstruction cost while failed to consider the association between the training sample and its class, which may be insufficient for classification tasks. To address these problems, on one hand, we focus on developing a kernel extension of RICA to find the sparse representation of nonlinear features. On the other hand, we aim to learn a more discriminative basis by bringing in class information than unsupervised RICA, which will facilitate the better performance of sparse representation in classification tasks.
4 Kernel Extension for RICA
Motivated by the success that kernel trick can capture the nonlinear structure in data , we propose a kernel version of RICA, called kRICA, to learn the sparse representation of nonlinear features.
4.1 Model Formulation
Suppose that there is a kernel function induced by a high dimensional feature mapping , where . Given two data points and , represents a nonlinear similarity between them. Then the function maps the data and basis from the original data space to the feature space as follows.
Furthermore, by substituting the mapped data and basis into Equation (2), we can get the following objective function of kRICA.
Due to its excellent performance in many computer vision applications [14, 25], Gaussian kernel, i.e., is used in this study. Thus, the norm ball constraints on basis in RICA can be removed owing to .
In addition, we perform kernel principal component analysis (KPCA) in the feature space for data whitening similar to , which makes the problem of ICA estimation simpler and better conditioned . When data is whitened, there exists a close relationship between kernel ICA  and kRICA. Regarding this relationship, we have the following Lemma:
Lemma 4.1 When the input data set is whitened in the feature space, the reconstruction cost is equivalent to the orthonormality cost .
Where is the Frobenius norm. Lemma 4.1 shows that kernel ICA’s hard orthonormality constraint and kRICA’s reconstruction cost are equivalent when data is whitened. While kRICA can learn the over-complete sparse representation of nonlinear features and kernel ICA fails to work due to the orthonormality constraint. Please see the Appendix A for a detailed proof.
The Equation (5) is an unconstrained convex optimization problem. To solve this problem, we rewrite the objective as follows
where and are the rows of basis , and is the element in pooling matrix . Since the row of is contained in the kernel , it is very hard to directly utilize the optimization methods in RICA, e.g. L-BFGS and CG , to compute the optimal basis. Thus, to solve this problem, we alternatively optimize each row of basis instead. With respect to each updating row of , the derivative of is
Then, to compute the optimal , we set . Since is contained in , it is challenging to solve the Equation (7). Thus, we seek the approximate solution instead of the exact solution. Inspired by fixed point algorithm , to update in the -th iteration, we utilize the result of in the -th iteration to calculate the part in the kernel function. In addition, we utilize k-means to initialize the basis followed by . Let denote the in the -th iteration as , and the Equation (7) with respect to becomes
When all the remaining rows are fixed, the problem becomes a linear equation of , which can be solved straightforwardly.
4.3 Connection between kRICA and KSR
It is clear there is a close connection between the proposed kRICA and KSR . Similar to kRICA, KSR attempts to find the sparse representation of nonlinear features in a high dimensional feature space and its optimization problem is
where is the sparse representation of sample . Therefore, there are two major differences between them.
(1) KSR utilizes explicit encoding for sparse representation corresponding to input data sample, i.e., . Since the objective of Equation (8) in KSR is not convex, the basis and sparse codes should be optimized, alternatively.
(2) The simple penalty, , is employed by KSR to promote sparsity while kRICA uses pooling instead, which can force the pooling features to group similar features together to achieve invariance, and meanwhile optimize the sparsity.
5 Supervised Kernel RICA
Given the labeled training data, our goal is to utilize class information to learn a structured basis set, which is consisted of basis vectors from different basis subsets corresponding to different class labels. Then each subset will sparsely represent well for its own class but not for the others. Thus, to learn such basis, we further extend the unsupervised kRICA to a supervised one by introducing a discrimination constraint, namely d-kRICA.
Mathematically, when the sample is labeled as where is the total number of classes, we can further utilize class information to learn a structured basis set , where is the basis subset that can well represent the sample belonging to the -th class rather than others, is the number of basis vectors for each subset and . Let denote where can be regarded as the sparse representation of sample .
5.1 Discrimination constraint
Since we aim to utilize class information to learn a structured basis, we hope that the sample labeled as will only be reconstructed by the basis subset with coefficients . To achieve this goal, an inhomogeneous representation cost constraint [15, 31] was utilized to minimize the inhomogeneous representation coefficients of , i.e., coefficients corresponding to basis vectors other than belonging to . However, this constraint only focuses on minimizing the inhomogeneous coefficients while fails to consider maximizing the the homogeneous ones, which is not sufficient to learn an optimal structured basis. Consequently, to learn such basis, we introduce a discrimination constraint, which maximizes the homogeneous representation cost and minimizes the inhomogeneous representation cost, jointly. Mathematically, we define the homogeneous cost as and the inhomogeneous cost as . Specifically, and are
where and select the homogeneous and inhomogeneous representation coefficients of , respectively. For example, assuming , () and =3, and can be respectively defined as follows.
Intuitively, we can define the discrimination constraint function as , which means the sparse representation in terms of basis matrix will only concentrate on the basis subset . However, this constraint is non-convex and unstable. To address the problem, we propose to incorporate an elastic term into . Thus, is defined as
It can be proved that if , is strictly convex to . Please see the Appendix B for a detailed proof. The constraint (10) maximizes the homogeneous representation cost and minimizes the inhomogeneous representation cost, simultaneously, which leads to learn a structured basis consisted of basis vectors from different basis subsets corresponding to the class labels. Then each subset will sparsely represent well for its own class but not for the others. Furthermore, data samples belonging to the same class will have similar representations, and thereby the obtained new representations can take more discriminative power.
By incorporating the discrimination constraint into the kRICA framework (d-kRICA), we can get the following objective function
where and are the scalars controlling the relative contribution of the corresponding terms. Given a test sample, Equation (11) means that the learned basis set can sparsely represent it with nonlinear structure while demands its homogeneous representations as large as possible and meanwhile inhomogeneous representations as small as possible. Following kRICA, the optimization problem (11) can be easily solved by the above proposed fixed point algorithm.
In this section, we will firstly introduce the feature extraction for image classification. Then, we evaluate the performances of our kRICA and d-kRICA for image classification on three public datasets: Caltech 101, CIFAR-10  and STL-10 . Furthermore, we study the selections of tuning parameters and kernel functions for our method. Finally, we give the similarity matrix to further illustrate the performances of kRICA and d-kRICA.
6.1 Feature Extraction for Classification
Given a input image patch (with channels) (), kRICA can transform it to a new representation in the feature space, where is termed as the ’receptive field size’. For an image of pixels (with channels), we could obtain a (with channels) feature following the same setting in , by estimating the representation for each ’subpatch’ of the input image. To reduce the dimensionality of the image representation, we utilize similar pooling method in  to form a reduced -dimensional pooled representation for image classification. Given the pooled feature for each image, we utilize linear SVM for classification.
6.2 Classification on Caltech 101
Caltech 101 dataset consists of 9144 images which are divided among 101 object classes and 1 background class including animals, vehicles, etc. Following the common experiment setup , we implement our algorithm on 15 and 30 training images per category with basis size and 1010 receptive fields, respectively. Comparison results are shown in Table 2. We compare our classification accuracy with ScSPM , D-KSVD , LC-KSVD , RICA , KICA , KSR  and DRICA . In addition, in order to compare with DRICA, we incorporate the discrimination constraint (10) into the RICA framework (2), namely d-RICA.
Table I shows that kRICA and d-kRICA outperform the other competing approaches.
6.3 Classification on CIFAR-10
The CIFAR-10 dataset includes 10 categories and 60000 3232 color images in all with 6000 images per category, such as airplane, automobile, truck and horse etc. In addition, there are 50000 training images and 10000 testing images. Specifically, 1000 images from each class are randomly selected as test images and the other 5000 images from each class as training images. In this experiment, we fix the size of basis set to 4000 with 66 receptive fields followed by . We compare our approach with RICA, K-means (Triangle, 4000 features) , KSR, DRICA and d-RICA etc.
Table II shows the effectiveness of our proposed kRICA and d-kRICA.
|Improved Local Coord. Coding ||74.5%|
|Conv. Deep Belief Net (2 layers) ||78.9%|
|Sparse auto-encoder ||73.4%|
|Sparse RBM ||72.4%|
|K-means (Hard) ||68.6%|
|K-means (Triangle) ||77.9%|
|K-means (Triangle, 4000 features) ||79.6%|
6.4 Classification on STL-10
In STL-10, there are 10 classes(e.g., airplane, dog, monkey and ship etc), where each image is 96x96 pixels and color. In addition, this dataset is divided into 500 training images (10 pre-defined folds), 800 test images per class and 100,000 unlabeled images for unsupervised learning. In our experiments, we set the size of basis set= 1600 and 88 receptive fields in the same manner described in .
As can be seen, d-RICA achieves better performance than DRICA on all of the above datasets. It is because that DRICA just only minimized the inhomogeneous representation cost for structured basis learning, while d-RICA simultaneously maximizes the homogeneous representation cost and minimizes the inhomogeneous representation cost, which makes the learned sparse representation take more discriminative power. Although both DRICA and d-RICA introduce the class information, unsupervised kRICA still performs better than both these algorithms. This means that kRICA implies more discriminative power for classification by representing the data with nonlinear structure. Additionally, since kRICA utilizes the pooling instead of penalty to achieve feature invariance, it demonstrates better performance than KSR. Furthermore, the d-kRICA achieves better performance than kRICA in all the cases by bringing in class information.
We also investigate the effect of basis size for our proposed kRICA and d-kRICA on STL-10 dataset. In our experiments, we try seven sizes: 50, 100, 200, 400, 800, 1200 and 1600. As shown in Fig. 1, the classification accuracies of d-kRICA and kRICA continue to increase when the basis size goes up to 1600 and the performances augment slightly from basis size of 800. Especially, d-kRICA outperforms all the other algorithms all the way.
|Raw pixels ||31.8%|
|K-means(Triangle 1600 features) ||51.5%|
|RICA(8x8 receptive fields) ||51.4%|
|RICA(10x10 receptive fields) ||52.9%|
6.5 Tuning Parameter and Kernel Selection
In the experiments, the tuning parameters in kRICA and d-kRICA, i.e. , and in the objective function, are verified by cross validation to avoid over-fitting. More specifically, we experimentally set these parameters as follows.
The effect of : The parameter is the weight of sparsity term, which is an important factor in kRICA. To facilitate the parameter selection, we experimentally investigate how the performance of kRICA varies with the parameter on STL-10 dataset in Fig. 2 (). Fig. 2 shows that kRICA achieves best performance when is fixed to be . Thus, we set for STL-10 data. In addition, we test the accuracy of RICA under the same sparsity weight. It is easy to find that our proposed nonlinear RICA (kRICA) can consistently outperform linear RICA with respect to . Similarly, we experimentally set for Caltech data and for CIFAR-10 data.
The effect of : The parameter controls the weight of discrimination constraint term. When , the supervised d-kRICA optimization problem becomes the unsupervised kRICA problem. Fig. 3 shows the relationship between the weight of discrimination constraint term and classification accuracy on the STL-10. We can see that d-kRICA achieves best performance when . Hence, we set for STL-10 data. In particular, d-RICA achieves better performance than DRICA in a wide range of values. This is because that DRICA just only minimizes the inhomogeneous representation cost, while d-RICA jointly optimizes both the homogeneous and inhomogeneous representation costs for basis learning, which makes the learned sparse representations take more discriminative power. Furthermore, by representing the data with nonlinear structure, d-kRICA implies more discriminative power for classification and outperforms both these algorithms. Similarly, we set for Caltech data and for CIFAR-10 data.
The effect of : When we utilize the Gaussian kernel in kRICA, it is vital to select the kernel parameter , which affects the image classification accuracy. Fig. 4 shows the relationship between and classification accuracy on STL-10 dataset. Therefore, we set for STL-10 data. Similarly, we experimentally set for Caltech data and for CIFAR-10 data.
We also investigate the effect of different kernels for kRICA in image classification, i.e., Polynomial kernel: , Inverse Distance kernel: , Inverse Square Distance kernel: , Exponential Histogram Intersection kernel: .111Following the work , we set b=3 for Polynomial kernel and b=1 for the others. Table IV demonstrates the classification performances of different kernels on STL-10 dataset, and Gaussian kernel outperforms the other kernels. Thus, we employ Gaussian kernel in our studies.
|Inverse Distance kernel||38.3%|
|Inverse Square Distance kernel||47.6%|
|Exponential Histogram Intersection kernel||36.5%|
6.6 Similarity Analysis
In above sections, we have shown the effectiveness of kRICA and d-kRICA for image classification. To further illustrate their performances, we firstly choose 90 images from three classes in Caltech 101, and 30 images for each class. Then we compute the similarity between sparse representations of these images for RICA, kRICA and d-kRICA, respectively. Fig. 5 demonstrates the similarity matrices corresponding to sparse representations of RICA, kRICA and d-kRICA, respectively. Each element in similarity matrix is the sparse representation similarity measured by Euclidean distance between image and . Since a good sparse representation method can make the new representations belonging to the same class more similar, their similarity matrix also should be block-wise. Fig. 5 shows that nonlinear kRICA takes more discriminative power than linear RICA, and d-kRICA achieves best by binging in class information.
In this paper, we propose a kernel ICA model with reconstruction constraint (kRICA) to capture the nonlinear features. To bring in the class information, we further extend the unsupervised kRICA to a supervised one by introducing a discrimination constraint. This constraint leads to learn a structured basis consisted of basis vectors from different basis subsets corresponding to different class labels. Then each subset will sparsely represent well for its own class but not for the others. Furthermore, data samples belonging to the same class will have similar representations, and thereby the obtained sparse representation can take more discriminative power. The experiments conducted on standardized datasets have demonstrated the effectiveness of our proposed method.
Appendix A Proof of Lemma 4.1
Since the input data set is whitened in the feature space by KPCA, we have
where is the identity matrix. Furthermore, we can obtain
where denotes the trace of a matrix, and the steps of derivation employ the matrix property . Thus, the reconstruction cost is equivalent to the orthonormality constraint when data is whitened in the feature space.
Appendix B Proof of the convexity of
We rewrite the Equation (10) as
Then, we can obtain its Hessian matrix with respect to .
Without loss of generality, we assume
After some derivations, we have where
The convexity of depends on whether its Hessian matrix , i.e. matrix , is positive definite or not . Meanwhile, the matrix is positive definite if and only if for all nonzero vectors , where denotes the transpose. Let the size of upper left matrix in be , and suppose . Then, we have
Furthermore, we can get
Define function , and when , it is easy to verify that
Since , we have . Thus, Hessian matrix is positive definite for , which guarantees that is convex to .
We thank Wende Dong for helpful discussions, and acknowledge Quoc V. Le for providing the RICA code.
-  E. Candes and M. Wakin, “An introduction to compressive sampling,” IEEE Signal Proc. Mag., vol. 25, no. 2, pp. 21 –30, march 2008.
-  D. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
-  D. Lee, H. Seung et al., “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
-  B. Olshausen et al., “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, 1996.
-  M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, 2006.
Q. Zhang and B. Li, “Discriminative k-svd for dictionary learning in face
Proc. Comput. Vis. Pattern Recognit., 2010.
-  Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Proc. Adv. Neural Inform. Process. Syst., vol. 19, 2007, p. 153.
-  G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006.
-  A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis. Wiley-interscience, 2001, vol. 26.
-  A. Hyvärinen, J. Hurri, and P. Hoyer, Natural image statistics. Springer, 2009, vol. 1.
-  A. Hyvärinen, P. Hoyer, and M. Inki, “Topographic independent component analysis,” Neural Comput., vol. 13, no. 7, pp. 1527–1558, 2001.
-  A. Coates, H. Lee, and A. Ng, “An analysis of single-layer networks in unsupervised feature learning,” in Proc. AISTATS, 2010.
-  Q. Le, A. Karpenko, J. Ngiam, and A. Ng, “Ica with reconstruction cost for efficient overcomplete feature learning,” in Proc. Adv. Neural Inform. Process. Syst., 2011.
-  J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis. Cambridge university press, 2004.
-  Y. Xiao, Z. Zhu, S. Wei, and Y. Zhao, “Discriminative ica model with reconstruction constraint for image classification,” in Proc. ACM Multimedia, 2012, pp. 929–932.
-  D. Donoho, “For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution,” Commun. Pur. Appl. Math., vol. 59, no. 6, pp. 797–829, 2006.
-  J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Proc. Comput. Vis. Pattern Recognit., 2009, pp. 1794 –1801.
-  K. Yu and T. Zhang, “Improved local coordinate coding using local tangents,” in Proc. Int. Conf. Mach. Learn., 2010.
-  Z. Jiang, Z. Lin, and L. Davis, “Learning a discriminative dictionary for sparse coding via label consistent k-svd,” in Proc. Comput. Vis. Pattern Recognit., 2011, pp. 1697–1704.
-  J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, 2009.
-  J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-local sparse models for image restoration,” in Proc. Comput. Vis. Pattern Recognit., 2009, pp. 2272–2279.
-  M. Elad, Sparse and redundant representations: from theory to applications in signal and image processing. Springer, 2010.
-  J. Yang, X. Gao, D. Zhang, and J. Yang, “Kernel ica: An alternative formulation and its application to face recognition,” Pattern Recognit., vol. 38, no. 10, pp. 1784–1787, 2005.
-  F. Bach and M. Jordan, “Kernel independent component analysis,” J. Mach. Learn. Res., vol. 3, pp. 1–48, 2003.
-  S. Gao, I. Tsang, and L. Chia, “Kernel sparse representation for image classification and face recognition,” in Proc. ECCV, 2010, pp. 1–14.
-  S. Gao, I. W.-H. Tsang, and L.-T. Chia, “Sparse representation with kernels,” IEEE Trans. Image Process., vol. 22, no. 2, pp. 423–434, feb. 2013.
-  Q. Le, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng, “Building high-level features using large-scale unsupervised learning,” in Proc. Int. Conf. Mach. Learn., 2012.
-  Y. LeCun, “Learning invariant feature hierarchies,” in Proc. ECCV. Workshops and Demonstrations, 2012, pp. 496–505.
-  Y.-L. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of feature pooling in visual recognition,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 111–118.
-  M. Schmidt, “minfunc,” 2005.
-  J. Yang, J. Wang, and T. Huang, “Learning the sparse representation for classification,” in Proc. ICME, 2011, pp. 1–6.
-  L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” Comput. Vis. Image Und., vol. 106, no. 1, pp. 59–70, 2007.
A. Krizhevsky, “Convolutional deep belief networks on cifar-10,”Unpublished manuscript, 2010.
-  S. Boyd and L. Vandenberghe, Convex optimization, 2004.
-  G. H. Golub and C. F. Van Loan, Matrix computations, 1996.