Robust Sparse Coding via Self-Paced Learning

09/10/2017 ∙ by Xiaodong Feng, et al. ∙ 0

Sparse coding (SC) is attracting more and more attention due to its comprehensive theoretical studies and its excellent performance in many signal processing applications. However, most existing sparse coding algorithms are nonconvex and are thus prone to becoming stuck into bad local minima, especially when there are outliers and noisy data. To enhance the learning robustness, in this paper, we propose a unified framework named Self-Paced Sparse Coding (SPSC), which gradually include matrix elements into SC learning from easy to complex. We also generalize the self-paced learning schema into different levels of dynamic selection on samples, features and elements respectively. Experimental results on real-world data demonstrate the efficacy of the proposed algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Learning effective representations for high dimensional data plays an essential role in machine learning tasks for various types of data, such as text, image, speech and video data. Among these techniques, sparse coding is attracting more and more attention due to its comprehensive theoretical studies and its excellent performance in many signal processing applications, especially computer vision tasks, including supervised image classification

[Wright et al.2009], semi-supervised image classification [Lu et al.2015], image clustering [Yang et al.2014, Feng, Wu, and Zhou2017], image restoration [Dong et al.2015], etc. Usually, in these areas, the signals studied are image feature or instance, and sparse coding is a powerful tool for feature/image reconstruction[Yang et al.2009, Feng, Wu, and Zhou2017] and similarity measurement [Yan and Wang2009, Yang et al.2014]. Besides, it has also been successfully applied in text mining [Li et al.2015], speech signal processing [Zhen et al.2016], recommendation [Feng et al.2016], etc.

Basically, given a data matrix , consisting of data points measured on dimensions,

is a vector representing the

-th data point. Sparse Coding (SC) [Lee et al.2006] aims at extracting a dictionary (also called a base) that consists of a set of atom items (called a basis), and converts data instances into sparse coordinates with respect to the dictionary. In other words, it decomposes the data matrix into a dictionary matrix and a sparse code matrix , so that each data sample can be represented as a sparse linear combination of atom items (i.e., ) in the dictionary with corresponding sparse coefficient vector . Formally, the sparse coding problem can be described as the following optimization problem:


where denotes the matrix Frobenius norm (i.e., , ), denotes the norm of a given vector (i.e., , ), and is the sparsity regularization parameter that represents the tradeoff between reconstruction error and sparsity.

Naturally, the optimization of sparse coding in (1) is not jointly convex for both and , but it is convex in (while holding fixed) and convex in (while holding fixed) separately. Thus, it is usually solved iteratively by alternatingly minimizing over and , while holding the other fixed. In this way, when is fixed, (1) becomes an -regularized non-smooth convex optimization problem and many efforts have been done for solving this type of problem [Yang et al.2010]. When is fixed, it is a well-defined constrained least squares problem, which can be solved efficiently by existing optimization methods [Lee et al.2006].

Though we can optimize SC by an alternative method, there is a big limitation caused by the non-convexity of the objective in (1

). This deficiency will drive the current SC optimization approach get stuck into bad local minima, especially when many noises and outliers exist in the datasets. To relieve the affect of non-convexity, a heuristic approach is to run the algorithm multiple times with different initializations or learning rates, and pick the best results based on cross-validation. However, this strategy is ad hoc and generally inconvenient to implement in unsupervised setting, since there is no straightforward criterion for choosing a proper solution. Thus, for robust SC optimization, our goal is to alleviate the influence of the noisy and confusing data, as the confusing data usually correspond to the highly nonlinear local patterns which is hardly learnable for the model space, and the noisy ones are the outliers that should not be learned, such as noisy pixels in images, or noisy words in texts. Typically, this learning robustness will be achieved by a well sample selection to distinguish the reliable samples from the confusing ones. Fortunately, a recently studied learning regime, named

- (SPL) [Kumar, Packer, and Koller2010] is making effort for this issue. The core idea of SPL is to learn the model with easier samples first and then gradually process more complex ones, which is simulating the learning principle of humans/animals. This learning mechanism is empirically demonstrated to be beneficial for some learning tasks [Xu, Tao, and Xu2015, Zhao et al.2015, Li et al.2017b], where it can in avoid bad local minima and achieve a better generalization result. Therefore, incorporating it into SC is expected to alleviate the local minimum issue of alternative learning.

To enhance the learning robustness, in this paper, we propose a unified framework named - (SPSC). With an adaptive pace from easy to hard in optimization, SPSC can dynamically introduce the data values into learning from easy ones to complex ones. Specifically, for different levels of dynamic selection, we extends SPSC into three variants, that is - SPSC, - SPSC and - SPSC. We also present a simple yet effective algorithm framework for solving the proposed SPSC problems. Experimental results on the real-world clean datasets and those with corruption demonstrate the effectiveness of the proposed approach, compared to the state-of-the-arts.

Related Work

In this section, we will review related works on sparse coding and self-paced learning, as our work - lies in cross road of these two fields.

Sparse Coding

Sparse coding consists of two components, that is, the optimization of the corresponding sparse codes (as dictionary is known) and the learning of optimal dictionary (as sparse code is given). For sparse code optimization, when the sparsity of codes is measured as -norm, it can be solved by methods such as matching pursuit, orthogonal matching pursuit, basis pursuit, etc; and multiple approaches are presented for -norm ionization, such as coordinate descent [Wu and Lange2008], interior-point method [Kim et al.2007], active set method [Lee et al.2007], and augmented lagrange multiplier method [Lin, Chen, and Ma2010]

The dictionary is usually automatically learned from data rather than by a predetermined dictionary. Typically, sparse coding is suitable to reconstruct data, and consequently variants of traditional sparse coding are presented for different purposes: (i) Structure regularized sparse coding. Structured sparse coding methods exploit different structure priors among the data. [Sun et al.2014] incorporate additional structured sparsity to the sparse coding objective functions, which leads to the promising performances on the applications. Graph/Hypergrpah regularized sparse coding [Zheng et al.2011, Gao, Tsang, and Chia2013, Feng, Wu, and Zhou2017] are proposed to preserve the intrinsic manifold structure among high dimensional data objects, and impose Laplacian regularization as a manifold regularization on sparse coefficients. (ii) Supervised/Semi-supervised sparse coding. In this group, label information of a training dataset is leveraged to learn a discriminative dictionary for better classification performances. Approaches in [M. Yang and Feng2011] attempt to learn multiple dictionaries or category-specific dictionaries to promote discrimination between classes. Others in [Z. Jiang and Davis2012] learn a dictionary shared by all classes by merging or selecting atom items from an initially large dictionary or incorporates discriminative terms into the objective function during training [Zhang and Li2010, Z. Jiang and Davis2013, Zheng and Tao2015]. (iii) Muli-modal sparse coding. In this group, traditional sparse coding approaches can be extended to benefit from the information encoded in multiple sources for cross-domain sparse representation [Jing et al.2014, Li et al.2017a].

Most of the current SC models are developed on nonconvex optimization problems, and thus always encounter the local minimum problem, especially when there are noisy and outlier data. We thus aim to alleviate this issue by advancing it into the SPL framework.

Self-Paced Learning

Inspired by the intrinsic cognitive principle of humans/animals, Bengio et al. Ref:bengio2009curriculum initialized a new learning paradigm called (CL), the core idea of which is to learn a model by gradually including samples into training from easy to complex for training non-convex objectives. This learning process is beneficial for alleviating local solation in non-convex optimization, as supported by empirical evaluation [Basu and Christensen2013]. However, it requires a priori identification to identify the easy and the hard samples in advance, which is difficult in real-world applications. To overcome the limitation due to the heuristic sample easiness ranking as in CL, Kumar et al. Ref:Kumar2010Self proposed a concise model, called - (SPL) to introduce a regularizor into the learning objective. SPL can automatically optimize an appropriate curriculum by the model itself with a weighted loss term on all samples and a general SPL regularizer imposed on sample weights. Jiang et al. Ref:jiang2015self presented a more comprehensive model learning underlying SPL/CL, as a general optimization problem as

s.t. (2)

where a training set ;

denotes the loss function which calculates the cost between the true label

and estimated label

under a model with as its model parameter to learn; denotes the weight variables reflecting the samples confidence; corresponds to a self-paced regularizer function for the sample selection scheme; is a feasible region that encodes the information of a predetermined curriculum; is a parameter to control the learning pace, also referred as ”pace age”. For the self-paced regularizer function , Jiang et al. Ref:jiang2015self abstracted three necessary condition should be satisfied.

By sequentially optimizing the model with gradually increasing pace parameter on the SPL regularizer, more samples can be automatically discovered in a pure self-paced way. The number of samples selected at each iteration is determined by a weight that is gradually annealed such that later iterations introduce more samples. The algorithm converges when all samples have been considered and the objective function cannot be improved further. By virtue of its generality, SPL have been considered in a broad spectrum of latent variable models, such as matrix factorization [Zhao et al.2015], clustering [Xu, Tao, and Xu2015], multi-task learning [Li et al.2017b], dictionary learning [Tang, Yang, and Gao2012, Xu et al.2016]. Opposite to existed self-paced dictionary learning, in this paper we try to utilize the generalized SPL sheila and implement a soft sample selection accordingly rather than a heuristic hard strategy sample selection in [Tang, Yang, and Gao2012]. Also, in [Xu et al.2016]

, it only address the issue of bad local minimum of non-convex sparse coding optimization by assigning weights to different samples of training data. However, due to the complexity and noisy nature of real data, the confidence levels of features of the dataset may also vary (eg., features extracted with different approaches for a set of images should have different confidence level), and the confidence levels of the specific values in the same sample also vary (eg., in image denoising task, noisy pixels should have lower confidence level comparing to the uncorrupted ones in the same image). Thus, considering a unified self-paced learning for sparse coding may be the further direction.

Self-Paced Sparse Coding

The Formulation of SPSC

Generally, the objective of SC is to learn a dictionary and representation from such that it can well reconstruct the data and can follow some prior knowledge as in (1). Inspired by the fact that humans often learn concepts from the easiest to the hardest as in SL and SPL, we incorporate the easy-to-hard strategy operated on the samples into the learning process of SC, as - (SPSC). Thus, following the framework of SPL in (Self-Paced Learning), the new learning objective function of SPSC can be formulated as:

s.t. (3)

where denotes the weight vector imposed on samples in ; let ’’ denote element-wise square root of a matrix; ‘’ denotes the diagonal matrix with diagonal entries as those in the vector, that is,

It is noted that, different from the original SPL learning framework as in (Self-Paced Learning

), as sparse coding is an originally unsupervised learning, the loss function

corresponds to , which measures the reconstruction error between and estimated after learning.

SPL Regularizer

For regularizer , the original SPL [Kumar, Packer, and Koller2010] adopted negative -norm as

Under this regularizer, with fixed , the global optimal is calculated by the optimization as


Then, the solution can be written as a hard threshold schema:


Instead of hard weighting, in our experiments, we utilize the linear soft weighting regularizer [Jiang et al.2015] due to its relatively easy implementation and well adaptability to complex scenarios. This regularizer penalizes the sample weights linearly in terms of the loss. Specifically, we have


(6) is convex with respect to , and thus the global minimum can be obtained with analytical solution for the linear soft weighting, as,


It is seen that if an sample can be well reconstructed (), it will be selected as an easy sample () and the weight is decreasing with respect to the reconstruction error, or otherwise unselected (). The parameter controls the pace at which the model learns new samples, and physically it can be interpreted as the ”age” of the model. When is smaller, more samples will be unselected as their corresponding losses are more likely to be greater than . As becomes larger, more samples with larger losses will be gradually considered.

Regularization on Sparse Codes

Prior relational knowledge among the samples is very useful when learning new representations. To embed this knowledge into sparse coding, we will adopt the newly proposed manifold regularizer, called hypergraph consistency regularization (HC) [Feng, Wu, and Zhou2017], which is empirically shown to perform better in sparse coding. Specifically, it tries to reconstruct the hyperedge incidence matrix using sparse codes in addition to the Laplcaican regulation on as


where, is the incidence matrix of hypergraph which consists of a set of vertices , and a set of hyperedges . The hypergraph can be constructed using the original data of to record locally geometrical structure of the original data; , if ; , otherwise. More generally, can also be defined in a probabilistic way [Huang et al.2010], such that . denotes the Laplacian matrix defined as (

is the identity matrix);

is the weight matrix of a hyergraph to measure how close each pair of two vertices in the manifold space are, which can be defined as follows (with normalization): where and respectively denote the diagonal matrices of hyperedge degrees and vertex degrees, and denotes a diagonal matrix of the hyperedge weights. By minimizing , the incidence matrix can be well reconstructed and the consistency of the hypergraph structure on sparse code space is guaranteed.

SPL Selection on Different Levels

It is seen that SPSC in (The Formulation of SPSC) is implementing sample-level selection during the learning process, where the sample can be an image instance, a text, or a user in various tasks. Accordingly, we note the SPSC in (The Formulation of SPSC) as - SPSC (SPSC). Usually, a sample is represented by a multi-dimensional vector with each element as a partial description of . As in in (The Formulation of SPSC), each element is treated the same, that is, equal confidence for model learning. However, due to the complexity and noisy nature of real data, the confidence levels of the specific element in the same sample also vary (eg., in image denoising task, noisy pixels should have lower confidence level comparing to the uncorrupted ones in the same image). Also the confidence levels of features could be also different (eg., features extracted with different approaches for a set of images should corresponds to different confidence level), We then show how to incorporate the easy-to-hard strategy operated on elements and features into the learning process of SC.

- Spsc

When operating SPSC on element-level learning, the loss for each element and could be defined as . Then the learning objective of - SPSC (SPSC) becomes the following formulation:

s.t. (9)

Where represents the matrix composing of which denotes the weights imposed on the observed elements of ; ‘’ denotes element-wise multiplication, that is,

Here denotes the -th row vector of dictionary . Thus, is the self-paced regularizer determining the elements to be selected in learning. To defines , the SPL regularizer in (6) can be easily generalized as


Accordingly, the optimal solution of can be obtained by


- Spsc

When operating SPSC on feature-level learning, the learning objective of - SPSC (SPSC) is as the following optimization:

s.t. (12)

where denotes the weight vector imposed on different features in ; the loss of each feature corresponds to , which measures the reconstruction error of each feature and represents the -th row vector in , that is, a feature value across all samples. The SPL regularizer and optimal weight canaliculation in (6) and (7) can be easily adopted to SPSC, as the formulations in (- SPSC) and (The Formulation of SPSC) are mathematically equivalent to each other.

Optimization of SPSC

Optimization Algorithm

Comparing the proposed three formulations of SPSC in (The Formulation of SPSC), (- SPSC) and (- SPSC), SPSC in (- SPSC) can be seen as a general form of these three. Thus, in the following we will show how to optimize (- SPSC). The optimization problem is not jointly convex, so we will use alternative search strategy (ASS) to solve it, suggested in [Kumar, Packer, and Koller2010]. Following ASS, we can alternatively optimizes while keeping the other set of variables fixed.

(i) Solve for : With fixed and , optimization of selection weight matrix can be solved under (11).

(ii) Solve for : With fixed codes and weight matrix , the optimization problem for dictionary writes:


This problem is a Quadratically Constrained Quadratic Program (QCQP) which can be solved using Lagrangian dual [Lee et al.2006].

(iii) Solve for : With fixed dictionary and weight matrix , combing (- SPSC) and (8), the optimization problem for codes corresponds to the weighted SC:


This problem is a -regularized convex optimization and off-the-shelf algorithms can be employed for solving it, such as feature-sign algorithm in [Gao, Tsang, and Chia2013]. The whole process is summarized in Algorithm 1.

Self-Paced Sparse Coding (SPSC)

Input :

 Data matrix and a hypergraph with and as its corresponding incidence matrix and weight matrix; nonnegative trade-off parameters and stepsize .  1: Initializing: solve the SC problem with all elements selected () to obtain optimal and ; calculate the loss of all elements;   initialize self-paced parameters ; 2:   3:        Update via (11).  4:        Update via Solving (13).  5:        Update via Solving (14).  6:        Compute current loss .  7:        , . /* updating the learning pace */  8: convergence.   9: return ,

The algorithm converges when all samples have been considered and the objective function cannot be improved further. It is seen from the algorithm that the number of samples selected at each iteration is determined by a weight that will be gradually annealed such that later iterations introduce more elements. For optimization of SPSC or SPSC, in Step 6, it can be easily solved by respectively calculating - loss or - loss defined above. For weight update in in Step 3, all weights of elements are set equal to the weights of corresponding sample or feature, i.e. or . During each iteration, an convex minimization problem is solved for each V-step, B-step and S-step, and the global optimum solution is obtained for each sub-step. Thus the overall objective is non-increasing with the iterations and Algorithm 1 is guaranteed to convergence.

Experiment Results

In this section, we extensively evaluate the proposed approach when applied to image clustering task on two real datasets. Experimental results demonstrate the correctness and effectiveness of the proposed model.

Experiment Setup


Two widely used real-world image dataset benchmarks are considered in the experiments111We downloaded these datasets from:

  • COIL20. The COIL20 image library contains images of objects, with images per object. The size of each image is , with grey levels per pixel. Images of the objects were taken at angle intervals of . Each image is represented by a -dimensional vector.

  • CMU-PIE. The CMU-PIE face database contains facial images of subjects in various angles, facial expressions, and lighting conditions. The size of each image is , with grey levels per pixel. Thus, each image is represented by a -dimensional vector. In our experiments, we fix the angle and the facial expression to select 21 images under different lighting conditions for each subject, totaling images.


We compared the proposed approach with several state-of-the-art methods.

  • Original and SC. The “original” method is to cluster image objects in the original data space (denoted as ’OS’). Sparse coding [Lee et al.2006] is the basic sparse coding method without any regularization except sparsity constraints.

  • LSC, LSC and CSC. (LSC) [Zheng et al.2011] and (LSC) [Gao, Tsang, and Chia2013] add a graph/hyperGraph Laplacian regularization term to the original sparse coding framework, while CSC [Feng, Wu, and Zhou2017] (hypergraph consistent sparse coding) integrates a hypergraph incidence consistency regularization term.

  • SPSC, SPSC and SPSC. These are proposed methods as in (The Formulation of SPSC), (- SPSC) and (- SPSC) respectively.

Experimental Setting

We use -means clustering to group image datasets and compare the clustering results with two commonly used metrics, i.e., clustering accuracy (ACC) and normalized mutual information (NMI). Both ACC and NMI range from 0 to 1, while ACC reveals the clustering accuracy (also called purity) and NMI indicates whether the different clustering sets are identical (NMI = 1) or independent (NMI = 0). To ensure stability of results, the direct -means clustering on the sparse code space is respectively repeated 30 times, each time with a random set of initial cluster centers. The average ACC and NMI for different methods over these 30 repetitions on each dataset will be reported.

0 0.6120 0.7353 0.6022 0.7291 0.6967 0.8035 0.7320 0.8729 0.7828 0.8999 0.7889 0.9043 0.7863 0.8973
0.2 0.6081 0.7323 0.5847 0.7151 0.6742 0.8025 0.7319 0.8668 0.7878 0.8953 0.7580 0.8989 0.7726 0.8895
0.4 0.6043 0.7303 0.5807 0.7172 0.6888 0.8068 0.7258 0.8820 0.7737 0.8920 0.7576 0.8886 0.7511 0.8826
0.6 0.5968 0.7294 0.5688 0.7187 0.6806 0.8068 0.7237 0.8726 0.7655 0.8963 0.7545 0.8843 0.7563 0.8826
0.8 0.5819 0.7132 0.5868 0.7115 0.6705 0.7927 0.7122 0.8540 0.7636 0.8851 0.7474 0.8665 0.7484 0.8740
1 0.5879 0.7103 0.5794 0.7151 0.6790 0.7903 0.7042 0.8493 0.7624 0.8807 0.7433 0.8737 0.7449 0.8752
Table 1: Clustering performance on COIL20 dataset using different methods
0 0.5656 0.5252 0.5076 0.5074 0.6643 0.6411 0.6678 0.6679 0.6814 0.6871 0.6816 0.6791 0.6887 0.6792
0.2 0.5649 0.5234 0.5252 0.5056 0.6551 0.6452 0.6670 0.6611 0.6898 0.6823 0.6765 0.6675 0.6785 0.6820
0.4 0.5555 0.5192 0.5298 0.5054 0.6554 0.6393 0.6613 0.6673 0.6762 0.6825 0.6728 0.6527 0.6689 0.6767
0.6 0.5426 0.5084 0.5189 0.4932 0.6420 0.6212 0.6423 0.6363 0.6659 0.6717 0.6550 0.6538 0.6546 0.6615
0.8 0.5442 0.5021 0.5175 0.4940 0.6381 0.6171 0.6440 0.6314 0.6682 0.6604 0.6512 0.6471 0.6557 0.6539
1 0.5362 0.4939 0.5128 0.4888 0.6239 0.6130 0.6362 0.6326 0.6509 0.6580 0.6382 0.6477 0.6503 0.6511
Table 2: Clustering performance on MNIST dataset using different methods

For each dataset, we normalize each image vector into a unit -norm as the input of comared sparse coding algorithms. The dictionary size of all these models is set to be , since several recent works on sparse coding have advocated the use of overcomplete representations for images [Olshausen and others1996, Zheng et al.2011]. Thus, after sparse coding, each image is represented as a -dimension vector, and will be used as the input of -means clustering (-means). The graph in LSC and hypergraph in CSC and SPSC are constructed in the original feature space after unit-norm normalization. Specifically, The -nearest neighbor-based graph (or hypergraph) (with by Euclidean distance) is used to characterize the intrinsic manifold with a binary weighting scheme as suggested in [Zheng et al.2011, Gao, Tsang, and Chia2013].

The parameters are tuned as follows. For SC, the sparsity weight is tuned in the range of to get the best performance (i.e., the highest average of ACC in 30 runs), and the best value is used for all sparse coding framework-based methods. For LSC, the Laplacian regularization weight is tuned in the range of . For CSC, we respectively fix the best values of and , and tune . For SPSC, we use the best , and of CSC; and we initialize to a value when of the elements (samples, features) are selected in the first iteration and set stepsize .

To test the robustness of proposed SPSC to corruption using COIL20 dataset. For each image with pixel value representation, we add white Gaussian noise according to , where

is the noise following a normal distribution with zero average and

is set to be the standard deviation over entries in data matrix

, is a corruption ratio which increase from to in intervals of . Clearly, white Gaussian noise is additive. Figure. 2 shows some samples on COIL20 dataset.

Figure 1: COIL20 images with corruptions by Gaussian noise.

Results and Discussion

Tables 1 and 2 report the performance of compared methods on clustering. They show the following points. 1) All three variants of proposed SPSC perform better than other baselines on both ACC and NIM, the power of self-paced learning is demonstrated. 2) Among three variants of proposed SPSC, - SPSC (SPSC) is more robust than - SPSC and -, especially when the noise level is larger. We think that this is because the - SPSC is implementing on the dynamic selection on the smallest granularity. 3) When the noise is on a low level (), the clustering performance will be not severely affected. and however when the noise level is high (), the clustering performance will sharply decrease (we implement experiments for but don’t show it due to the lower accuracy).

Discussion on the Weight Learning

In the experiment, we record the weight matrix in each iteration in Algorithm 1. To show the process of dynamic selection of samples, we plot the correlation between noise level (the mean absolute value of noise over each sample) and the corresponding weight in each iteration, as in Figure. 2.

Figure 2: The correlation between noise level and weight in each run.

We can see that, the values in the weight matrix are becoming larger after each iteration and they keep a high value after about 15 iterations, following the mechanism of self-paced learning that more elements are selected after each iteration of the optimization process. Also, there is a negative correlation between noise level and the corresponding weight in the beginning of iterations, means the algorithm of SPSC is learning samples with lower noise corruption (easier samples) first.


Under the framework of alternative optimization of sparse coding, the non-convexity usually drive the optimization approach get stuck into bad local minima, especially when many noises and outliers exist in the datasets. To relieve the affect of non-convexity, we propose a unified framework named - (SPSC) by incorporating Self-paced learning methodology with Sparse coding as well as manifold regularization. For different levels of dynamic selection, we extends SPSC into three variants, that is - SPSC, - SPSC and - SPSC. The effectiveness of proposed SPSC was demonstrated by experiments on real image dataset and it with noise. The proposed method shows its advantage over current SC methods on more accurately approximating the ground truth clustering, and we will evaluate it on other tasks, such as classification and recommendation.


  • [Basu and Christensen2013] Basu, S., and Christensen, J. 2013. Teaching classification boundaries to humans. In AAAI, 109–115.
  • [Bengio et al.2009] Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In ICML, 41–48. ACM.
  • [Dong et al.2015] Dong, W.; Li, X.; Ma, Y.; and Shi, G. 2015. Image restoration via bayesian structured sparse coding. In ICIP, 4018–4022.
  • [Feng et al.2016] Feng, X.; Sharma, A.; Srivastava, J.; Wu, S.; and Tang, Z. 2016. Social network regularized sparse linear model for top-n recommendation. EAAI 51(C):5–15.
  • [Feng, Wu, and Zhou2017] Feng, X.; Wu, S.; and Zhou, W. 2017. Multi-hypergraph consistent sparse coding. ACM TIST 8(6):75.
  • [Gao, Tsang, and Chia2013] Gao, S.; Tsang, I.; and Chia, L. 2013. Laplacian sparse coding, hypergraph laplacian sparse coding, and applications. IEEE TPAMI 35:92–104.
  • [Huang et al.2010] Huang, Y.; Liu, Q.; Zhang, S.; and Metaxas, D. N. 2010. Image retrieval via probabilistic hypergraph ranking. In CVPR, 3376–3383.
  • [Jiang et al.2015] Jiang, L.; Meng, D.; Zhao, Q.; Shan, S.; and Hauptmann, A. G. 2015. Self-paced curriculum learning. In AAI, 2694–2700.
  • [Jing et al.2014] Jing, X. Y.; Hu, R. M.; Wu, F.; Chen, X. L.; Liu, Q.; and Yao, Y. F. 2014. Uncorrelated multi-view discrimination dictionary learning for recognition. In AAAI, 2787–2795.
  • [Kim et al.2007] Kim, S. J.; Koh, K.; Lustig, M.; Boyd, S.; and Gorinevsky, D. 2007. An interior-point method for large-scale l 1-regularized least squares. IEEE Journal of Selected Topics in Signal Processing 1(4):606–617.
  • [Kumar, Packer, and Koller2010] Kumar, M. P.; Packer, B.; and Koller, D. 2010. Self-paced learning for latent variable models. In NPIS, 1189–1197.
  • [Lee et al.2006] Lee, H.; Battle, A.; Raina, R.; and Ng, A. Y. 2006. Efficient sparse coding algorithms. NIPS 801–808.
  • [Lee et al.2007] Lee, H.; Battle, A.; Raina, R.; and Ng, A. Y. 2007. Efficient sparse coding algorithms. In NPIS, 801–808.
  • [Li et al.2015] Li, P.; Bing, L.; Lam, W.; Li, H.; and Liao, Y. 2015.

    Reader-aware multi-document summarization via sparse coding.

    In IJCAI, 1270–1276.
  • [Li et al.2017a] Li, B.; Yuan, C.; Xiong, W.; Hu, W.; Peng, H.; Ding, X.; and Maybank, S. 2017a. Multi-view multi-instance learning based on joint sparse representation and multi-view dictionary learning. IEEE TPAMI PP(99):1–1.
  • [Li et al.2017b] Li, C.; Yan, J.; Wei, F.; Dong, W.; Liu, Q.; and Zha, H. 2017b. Self-paced multi-task learning. In AAAI, 2175–2181.
  • [Lin, Chen, and Ma2010] Lin, Z.; Chen, M.; and Ma, Y. 2010. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055.
  • [Lu et al.2015] Lu, Z.; Gao, X.; Wang, L.; Wen, J.; and Huang, S. 2015.

    Noise-robust semi-supervised learning by large-scale sparse coding.

    In AAAI, 2828–2834.
  • [M. Yang and Feng2011] M. Yang, D. Z., and Feng, X. 2011. Fisher discrimination dictionary learning for sparse representation. In ICCV, 543–550.
  • [Olshausen and others1996] Olshausen, B. A., et al. 1996. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381(6583):607–609.
  • [Sun et al.2014] Sun, X.; Qu, Q.; Nasrabadi, N. M.; and Tran, T. D. 2014. Structured priors for sparse-representation-based hyperspectral image classification. IEEE Geoscience & Remote Sensing Letters 11(7):1235–1239.
  • [Tang, Yang, and Gao2012] Tang, Y.; Yang, Y.-B.; and Gao, Y. 2012. Self-paced dictionary learning for image classification. In ACM MM, 833–836. ACM.
  • [Wright et al.2009] Wright, J.; Yang, A. Y.; Ganesh, A.; Sastry, S. S.; and Ma, Y. 2009.

    Robust face recognition via sparse representation.

    IEEE TPAMI 31:210–227.
  • [Wu and Lange2008] Wu, T. T., and Lange, K. 2008. Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics 224–244.
  • [Xu et al.2016] Xu, D.; Song, J.; Alameda-Pineda, X.; Ricci, E.; and Sebe, N. 2016. Multi-paced dictionary learning for cross-domain retrieval and recognition. In ICPR, 3228–3233.
  • [Xu, Tao, and Xu2015] Xu, C.; Tao, D.; and Xu, C. 2015. Multi-view self-paced learning for clustering. In IJCAI, 3974–3980.
  • [Yan and Wang2009] Yan, S., and Wang, H. 2009. Semi-supervised learning by sparse representation. In SDM, 792–801. SIAM.
  • [Yang et al.2009] Yang, J.; Yu, K.; Gong, Y.; and Huang, T. 2009. Linear spatial pyramid matching using sparse coding for image classification. 1794–1801.
  • [Yang et al.2010] Yang, A. Y.; Sastry, S. S.; Ganesh, A.; and Ma, Y. 2010. Fast -minimization algorithms and an application in robust face recognition: A review. In ICIP, 1849–1852.
  • [Yang et al.2014] Yang, Y.; Yang, J.; Yang, J.; Wang, J.; Chang, S.; and Huang, T. S. 2014. Data clustering by laplacian regularized -graph. In AAAI, 3148–3149.
  • [Z. Jiang and Davis2012] Z. Jiang, G. Z., and Davis, L. S. 2012. Submodular dictionary learning for sparse coding. In CVPR, 3418–3425.
  • [Z. Jiang and Davis2013] Z. Jiang, Z. L., and Davis, L. S. 2013. Label consistent k-svd: learning a discriminative dictionary for recognition. IEEE TPMAI 35:2651–2664.
  • [Zhang and Li2010] Zhang, Q., and Li, B. 2010. Discriminative k-svd for dictionary learning in face recognition. In CVPR, 2691–2698.
  • [Zhao et al.2015] Zhao, Q.; Meng, D.; Jiang, L.; Xie, Q.; Xu, Z.; and Hauptmann, A. G. 2015. Self-paced learning for matrix factorization. In AAAI, 3196–3202.
  • [Zhen et al.2016] Zhen, L.; Peng, D.; Yi, Z.; Xiang, Y.; and Chen, P. 2016. Underdetermined blind source separation using sparse coding. IEEE TNNLS PP(99):1–7.
  • [Zheng and Tao2015] Zheng, H., and Tao, D. 2015. Discriminative dictionary learning via fisher discrimination K-SVD algorithm. Neurocomputing 162:9–15.
  • [Zheng et al.2011] Zheng, M.; Bu, J.; Chen, C.; Wang, C.; Zhang, L.; Qiu, G.; and Cai, D. 2011. Graph regularized sparse coding for image representation. IEEE TIP 20:1327–1336.