1 Introduction
Clustering aims to learn the hidden data patterns and group similar structures in a unsupervised way. While many classical clustering algorithms have been proposed, such as Kmeans, Gaussian mixture model (GMM) clustering
[5], maximummargin clustering [31] and information theoretic clustering [19], most only work well when the data dimensionality is low. Since highdimensional data exhibits dense grouping in lowdimensional embeddings
[23], researchers have been motivated to first project the original data into a lowdimensional subspace [24] and then clustering on the feature embeddings. Among many feature embedding learning methods, sparse codes [30] are proven to be robust and efficient features for clustering, as verified by many [8, 34].Effectiveness and scalability are two major concerns in designing a clustering algorithm under Big Data scenarios [6]. Conventional sparse coding models rely on iterative approximation algorithms, whose inherently sequential structure as well as the datadependent complexity and latency often constitute a major bottleneck in the computational efficiency [12]. That also results in the difficulty when one tries to jointly optimize the unsupervised feature learning and the supervised taskdriven steps [20]. Such a joint optimization usually has to rely on solving complex bilevel optimization [4], such as [29], which constitutes another efficiency bottleneck. What is more, to effectively model and represent datasets of growing sizes, sparse coding needs to refer to larger dictionaries [17]. Since the inference complexity of sparse coding increases more than linearly with respect to the dictionary size [29], the scalability of sparse codingbased clustering work turns out to be quite limited.
To conquer those limitations, we are motivated to introduce the tool of deep learning in clustering, to which there has been a lack of attention paid. The advantages of deep learning are achieved by its large learning capacity, the linear scalability with the aid of stochastic gradient descent (SGD), and the low inference complexity
[3]. The feedforward networks could be naturally tuned jointly with taskdriven loss functions. On the other hand, generic deep architectures [15] largely ignore the problemspecific formulations and prior knowledge. As a result, one may encounter difficulties in choosing optimal architectures, interpreting their working mechanisms, and initializing the parameters.In this paper, we demonstrate how to combine the sparse codingbased pipeline into deep learning models for clustering. The proposed framework takes advantage of both sparse coding and deep learning. Specifically, the feature learning layers are inspired by the graphregularized sparse coding inference process, via reformulating iterative algorithms [12] into a feedforward network, named TAGnet. Those layers are then jointly optimized with the taskspecific loss functions from end to end. Our technical novelty and merits are summarized in threefolds:

As a deep feedforward model, the proposed framework provides extremely efficient inference process and high scalability to large scale data. It allows to learn more descriptive features than conventional sparse codes.

We further enforce auxiliary clustering tasks on the hierarchy of features, we develop DTAGnet and observe further performance boosts on the CMU MultiPIE dataset [13].
2 Related Work
2.1 Sparse coding for clustering
Assuming data samples , where and . They are encoded into sparse codes , where and , using a learned dictionary , where are the learned atoms. The sparse codes are obtained by solving the following convex optimization ( is a constant):
(1) 
In [8]
, the authors suggested that the sparse codes can be used to construct the similarity graph for spectral clustering
[22]. Furthermore, to capture the geometric structure of local data manifolds, the graph regularized sparse codes are further suggested in [34, 32] by solving:(2) 
where
is the graph Laplacian matrix and can be constructed from a prechosen pairwise similarity (affinity) matrix
. More recently in [29], the authors suggested to simultaneously learn feature extraction and discriminative clustering, by formulating a taskdriven sparse coding model
[20]. They proved that such joint methods consistently outperformed nonjoint counterparts.2.2 Deep learning for clustering
In [26], the authors explored the possibility of employing deep learning in graph clustering. They first learned a nonlinear embedding of the original graph by an auto encoder (AE), followed by a Kmeans algorithm on the embedding to obtain the final clustering result. However, it neither exploits more adapted deep architectures nor performs any taskspecific joint optimization. In [7]
, a deep belief network (DBN)
[14] with nonparametric clustering was presented. As a generative graphical model, DBN provides a faster feature learning, but is less effective than AEs in terms of learning discriminative features for clustering. In [27], the authors extended the semi nonnegative matrix factorization (SemiNMF) model [18] to a Deep SemiNMF model, whose architecture resembles stacked AEs. Our proposed model is substantially different from all these previous approaches, due to its unique taskspecific architecture derived from sparse coding domain expertise, as well as the joint optimization with clusteringoriented loss functions.3 Model Formulation
The proposed pipeline consists of two blocks. As depicted in Fig. 1 (a), it is trained endtoend in an unsupervised way. It includes a feedforward architecture, termed Taskspecific And Graphregularized Network (TAGnet), to learn discriminative features, and the clusteringoriented loss function.
3.1 TAGnet: Taskspecific And Graphregularized Network
Different from generic deep architectures, TAGnet is designed in a way to take advantage of the successful sparse codebased clustering pipelines [34, 29]. It aims to learn features that are optimized under clustering criteria, while encoding graph constraints (2) to regularize the target solution. TAGnet is derived from the following theorem: The optimal sparse code from (2) is the fixed point of
(3) 
where is an elementwise shrinkage function parameterized by :
(4) 
is an upper bound on the largest eigenvalue of
. The complete proof of Theorem 3.1 can be found in the supplementary. Theorem 3.1 outlines an iterative algorithm to solve (2). Under quite mild conditions [2], after is initialized, one may repeat the shrinkage and thresholding process in (3) until convergence. Moreover, the iterative algorithm could be alternatively expressed as the block diagram in Fig. 1 (b), where(5) 
In particular, we define the new operator “”: , where the input is multiplied by the prefixed from the right side and scaled by the constant .
By timeunfolding and truncating Fig. 1 (b) to a fixed number of iterations ( = 2 by default)^{1}^{1}1We test larger values (3 or 4), but they do not bring noticeable performance improvements in our clustering cases., we obtain the TAGnet form in Fig. 1 (a). , and are all to be learnt jointly from data. and are tied weights for both stages^{2}^{2}2Out of curiosity, we have also tried the architecture that treat , and in both stages as independent variables. We find that sharing parameters improves the performance.. It is important to note that the output of TAGnet is not necessarily identical to the predicted sparse codes by solving (2). Instead, the goal of TAGnet is to learn discriminative embedding that is optimal for clustering.
To facilitate training, we further rewrite (4) as:
(6) 
Eqn. (6
) indicates that the original neuron with trainable thresholds can be decomposed into two linear scaling layers plus a unitthreshold neuron. The weights of the two scaling layers are diagonal matrices defined by
and its elementwise reciprocal, respectively.A notable component in TAGnet is the branch of each stage. The graph laplacian could be computed in advance. In the feedforward process, a branch takes the intermediate ( = 1, 2) as the input, and applies the “” operator defined above. The output is aggregated with the output from the learnable layer. In the back propagation, will not be altered. In such a way, the graph regularization is effectively encoded in the TAGnet structure as a prior.
An appealing highlight of (D)TAGnet lies in its very effective and straightforward initialization strategy. With sufficient data, many latest deep networks train well with random initializations without pretraining. However, it has been discovered that poor initializations hamper the effectiveness of firstorder methods (e.g., SGD) in certain cases [25]
. For (D)TAGnet, it is however much easier to initialize the model in the right regime. That benefits from the analytical relationships between sparse coding and network hyperparameters defined in (
5): we could initialize deep models from corresponding sparse coding components, the latter of which is easier to obtain. Such an advantage becomes much more important when the training data is limited3.2 Clusteringoriented loss functions
Assuming clusters, and as the set of parameters of the loss function, where corresponds to the th cluster, = . In this paper, we adopt the following two forms of clusteringoriented loss functions.
One natural choice of the loss function is extended from the popular softmax loss, and take the entropylike form as:
(7) 
where
denotes the the probability that sample
belongs to cluster , and :(8) 
In testing, the predicted cluster label of input is determined using the maximum likelihood criteria based on the predicted .
The maximum margin clustering (MMC) approach was proposed in [31]. MMC finds a way to label the samples by running an SVM implicitly, and the SVM margin obtained would be maximized over all possible labels [33]. By referring to the MMC definition, the authors of [29] designed the maxmargin loss:
(9) 
In the above equation, the loss for an individual sample is defined as:
(10) 
where is the prototype for the th cluster. In testing, the predicted cluster label of input
is determined by weight vector that achieves the maximum
.Model Complexity The proposed framework can handle largescale and highdimensional data effectively via the stochastic gradient descent (SGD) algorithm. In each step, the back propagation procedure requires only operations of order O() [12]. The training algorithm takes O() time (
is a constant in terms of the total numbers of epochs, stage numbers, etc.). In addition, SGD is easy to be parallelized and thus could be efficiently trained using GPUs.
3.3 Connections to Existing Models
There is a close connection between sparse coding and neural network. In
[12], a feedforward neural network, named LISTA, is proposed to efficiently approximate the sparse code
of input signal , which is obtained by solving (1) in advance. The LISTA network learns the hyperparameters as a general regression model from training data to their presolved sparse codes using backpropagation.LISTA overlooks the useful geometric information among data points [34], and therefore could be viewed as a special case of TAGnet in Fig. 1 when = 0 (i.e., removing the branches). Moreover, LISTA aims to approximate the “optimal” sparse codes preobtained from (1
), and therefore requires the estimation of
and the tedious precomputation of . The authors did not exploit its potential in supervised and taskspecific feature learning.4 A Deeper Look: Hierarchical Clustering by DTAGnet
Deep networks are well known for their capabilities to learn semantically rich representations by hidden layers [10]. In this section, we investigate how the intermediate features () in TAGnet (Fig. 1 (a)) can be interpreted, and further utilized to improve the model, for specific clustering tasks. Compared to related nondeep models [29]
, such a hierarchical clustering property is another unique advantage of being deep.
Our strategy is mainly inspired by the algorithmic framework of deeply supervised nets [16]. As in Fig. 2, our proposed DeeplyTaskspecific And Graphregularized Network (DTAGnet) brings in additional deep feedbacks, by associating a clusteringoriented local auxiliary loss ( = 1, 2) with each stage. Such an auxiliary loss takes the same form as the overall
, except that the expected cluster number may be different, depending on the auxiliary clustering task to be performed. The DTAGnet backpropagates errors not only from the overall loss layer, but also simultaneously from the auxiliary losses.
While seeking the optimal performance of the target clustering, DTAGnet is also driven by two auxiliary tasks that are explicitly targeted at clustering specific attributes. It enforcrs constraint at each hidden representation for directly making a good cluster prediction. In addition to the overall loss, the introduction of auxiliary losses gives another strong push to obtain discriminative and sensible features at each individual stage. As discovered in the classification experiments in
[16], the auxiliary loss both acts as feature regularization to reduce generalization errors and results in faster convergence. We also find in Section V that every () is indeed most suited for its targeted task.In [27], a Deep SemiNMF model was proposed to learn hidden representations, that grant themselves an interpretation of clustering according to different attributes. The authors considered the problem of mapping facial images to their identities. A face image also contains attributes like pose and expression that help identify the person depicted. In their experiments, the authors found that by further factorizing this mapping in a way that each factor adds an extra layer of abstraction, the deep model could automatically learn latent intermediate representations that are implied for clustering identityrelated attributes. Although there is a clustering interpretation, those hidden representations are not specifically optimized in clustering sense. Instead, the entire model is trained with only the overall reconstruction loss, after which clustering is performed using Kmeans on learnt features. Consequently, their clustering performance is not satisfactory. Our study shares the similar observation and motivation with [27], but in a more taskspecific manner by performing the optimizations of auxiliary clustering tasks jointly with the overall task.
5 Experiment Results
5.1 Datasets and measurements
We evaluate the proposed model on three publicly available datasets:

MNIST [34] consists of a total number of 70, 000 quasibinary, handwritten digit images, with digits 0 to 9. The digits are normalized and centered in fixedsize images of 28 28.

CMU MultiPIE [13] contains around 750, 000 images of 337 subjects, that are captured under varied laboratory conditions. A unique property of CMU MultiPIE lies in that each image comes with labels for the identity, illumination, pose and expression attributes. That is why CMU MultiPIE is chosen in [27] to learn multiattribute features (Fig. 2) for hierarchical clustering. In our experiments, we follow [27] and adopt a subset of 13, 230 images of 147 subjects in 5 different poses and 6 different emotions. Notably, we do not preprocess the images by using piecewise affine warping as utilized by [27] to align these images.

COIL20 [21] contains 1, 440 32 32 gray scale images of 20 objects (72 images per object). The images of each object were taken 5 degree apart.
Although the paper only evaluates the proposed method using image datasets, the methodology itself is not limited to only image subjects. We apply two widelyused measures to evaluate the clustering performances: the accuracy and the Normalized Mutual Information(NMI) [34], [8]. We follow the convention of many clustering work [34, 32, 29], and do not distinguish training from testing. We train our models on all available samples of each dataset, reporting the clustering performances as our testing results. Results are averaged from 5 independent runs.
5.2 Experiment settings
The proposed networks are implemented using the cudaconvnet package [15]. The network takes = 2 stages by default. We apply a constant learning rate of 0.01 with no momentum to all trainable layers. The batch size of 128. In particular, to encode graph regularization as a prior, we fix during model training by setting its learning rate to be 0. Experiments run on a workstation with 12 Intel Xeon 2.67GHz CPUs and 1 GTX680 GPU. The training takes approximately 1 hour on the MNIST dataset. It is also observed that the training efficiency of our model scales approximately linearly with data.
In our experiments, we set the default value of to be 5, to be 128, and to be chosen from [0.1, 1] by crossvalidation^{3}^{3}3The default values of and are inferred from the related sparse coding literature[34], and validated in experiments.. A dictionary is first learned from by KSVD [1]. , and are then initialized based on (5). is also precalculated from , which is formulated by the Gaussian Kernel: ( is also selected by crossvalidation). After obtaining the output from the initial (D)TAGnet models, (or ) could be initialized based on minimizing (7) or (9) over (or ).
5.3 Comparison experiments and analysis
5.3.1 Benefits of the taskspecific deep architecture
We denote the proposed model of TAGnet plus entropyminimization loss (EML) (7) as TAGnetEML, and the one plus maximummargin loss (MML) (9) as TAGnetMML, respectively. We include the following comparison methods:

We refer to the initializations of the proposed joint models as their “NonJoint” counterparts, denoted as NJTAGnetEML and NJTAGnetMML (NJ short for nonjoint), respectively.

We design a Baseline Encoder (BE), which is a fullyconnected feedforward network, consisting of three hidden layers of dimension
with ReLU neuron. It is obvious that the BE has the
same parameter complexity as TAGnet^{4}^{4}4except for the “” layers, each of which contains only free parameters and thus ignored. The BEs are also tuned by EML or MML in the same way, denoted as BEEML or BEMML, respectively. We intend to verify our important argument, that the proposed model benefits from the taskspecific TAGnet architecture, rather than just the large learning capacity of generic deep models. 
We compare the proposed models with their closest “shallow” competitors, i.e., the joint optimization methods of graphregularized sparse coding and discriminative clustering in [29]. We reimplement their work using both (7) or (9) losses, denoted as SCEML and SCMML (SC short for sparse coding). Since in [29] the authors already revealed SCMML outperforms the classical methods such as MMC and graph methods, we do not compare with them again.
As revealed by the full comparison results in Table 1, the proposed taskspecific deep architectures outperform other with a noticeable margin. The underlying domain expertise guides the datadriven training in a more principled way. In contrast, the “generalarchitecture” baseline encoders (BEEML and BEMML) appear to produce much worse (even worst) results. Furthermore, it is evident that the proposed endtoend optimized models outperform their “nonjoint” counterparts. For example, on the MNIST dataset,TAGnetMML surpasses NJTAGnetMML by around 4% in accuracy and 5% in NMI.
By comparing the TAGnetEML/TAGnetMML with SCEML/SCMML, we draw a promising conclusion: adopting a more parameterized deep architecture allows a larger feature learning capacity compared to conventional sparse coding. Although similar points are well made in many other fields [15], we are interested in a closer look between the two. Fig. 3 plots the clustering accuracy and NMI curves of TAGnetEML/TAGnetMML on the MNIST dataset, along with iteration numbers. Each model is well initialized at the very beginning, and the clustering accuracy and NMI are computed every 100 iterations. At first, the clustering performances of deep models are even slightly worse than sparsecoding methods, mainly since the initialization of TAGnet hinges on a truncated approximated of graphregularized sparse coding. After a small number of iterations, the performance of the deep models surpass sparse coding ones, and continue rising monotonically until reaching a higher plateau.
TAGnet  TAGnet  NJTAGnet  NJTAGnet  BE  BE  SC  SC  Deep  
EML  MML  EML  MML  EML  MML  EML  MML  SemiNMF  
MNIST  Acc  0.6704  0.6922  0.6472  0.5052  0.5401  0.6521  0.6550  0.6784  
NMI  0.6261  0.6511  0.5624  0.6067  0.5002  0.5011  0.6150  0.6451  
CMU  Acc  0.2176  0.2347  0.1727  0.1861  0.1204  0.1451  0.2002  0.2090  0.17 
MultiPIE  NMI  0.4338  0.4555  0.3167  0.3284  0.2672  0.2821  0.3337  0.3521  0.36 
COIL20  Acc  0.8553  0.8991  0.7432  0.7882  0.7441  0.7645  0.8225  0.8658  
NMI  0.9090  0.9277  0.8707  0.8814  0.8028  0.8321  0.8850  0.9127 
5.3.2 Effects of graph regularization
In (2), the graph regularization term imposes stronger smoothness constraints on the sparse codes with a larger . It also happens to the TAGnet. We investigate how the clustering performances of TAGnetEML/TAGnetMML are influenced by various values. From Fig. 4, we observe the identical general tendency on all three datasets. While increases, the accuracy/NMI result will first rise then decrease, with the peak appearing between [5, 10]. As an interpretation, the local manifold information is not sufficiently encoded when is too small ( = 0 will completely disable the branch of TAGnet, and reduces its to the LISTA network [12] finetuned by the losses). On the other hand, when is large, the sparse codes are “oversmoothened” with a reduced discriminative ability. Note that similar phenomenons are also reported in other relevant literature, e. g. , [34, 29].
Furthermore, comparing among Fig. 4 (a)  (f), it is noteworthy to observe how graph regularization behaves differently on three of them. We notice that the COIL20 dataset is the one that is the most sensitive to the choice of . Increasing from 0.01 to 50 leads to a improvement of more than 10%, in terms of both accuracy and NMI. It verifies the significance of graph regularization when trying samples are limited [32]. On the MNIST dataset, both models obtain a gain of up to 6% in accuracy and 5% in NMI, by tuning from 0.01 to 10. However, unlike COIL20 that almost always favors larger , the model performance on the MNIST dataset tends to be not only saturated, but even significantly hampered when continues rising to 50. The CMU MultiPIE dataset witnesses moderate improvements of around 2% in both measurements. It is not as sensitive to as the other two. Potentially, it might be due to the complex variability in original images that makes the graph unreliable for estimating the underlying manifold geometry. We suspect that more sophisticated graphs may help alleviate the problem, and will explore it in future.
5.3.3 Scalability and robustness
On the MNIST dataset, We reconduct the clustering experiments with the cluster number ranging from 2 to 10, using TAGnetEML/TAGnetMML. Fig. 5 shows that the clustering accuracy and NMI change by varying the number of clusters. The clustering performance transits smoothly and robustly when the task scale changes.
To examine the proposed models’ robustness to noise, we add various Gaussian noise, whose standard deviation
ranges from 0 (noiseless) to 0.3, to retrain our MNIST model. Fig. 6 indicates that both TAGnetEML and TAGnetMML own certain robustness to noise. When is less than 0.1, there is even little visible performance degradation. While TAGnetMML constantly outperforms TAGnetEML in all experiments (as MMC is wellknown to be highly discriminative [31] ), it is interesting to observe in Fig. 6 that the latter one is slightly more robust to noise than the former. It is perhaps owing to the probabilitydriven loss form (7) of EML that allows for more flexibility.5.4 Hierarchical clustering on CMU MultiPIE
As observed, CMU MultiPIE is very challenging for the basic identity clustering task. However, it comes with several other attributes: pose, expression, and illumination, which could be of assistance in our proposed DTAGnet framework. In this section, we apply the similar setting of [27] on the same CMU MultiPIE subset, by setting pose clustering as the Stage I auxiliary task, and expression clustering as the Stage II auxiliary task^{6}^{6}6In fact, although claimed to be applicable to multiple attributes, [27] only examined the first level features for pose clustering without considering expressions, since it relied on a warping technique to preprocess images, that gets rid of most expression variability. . In that way, we target at 5 clusters, at 6 clusters, and finally as 147 clusters.
The training of DTAGnetEML/DTAGnetMML follows the same aforementioned process except for considering extra backpropagated gradients from task in Stage ( = 1, 2). After then, we test each separately on their targeted task. In DTAGnet, each auxiliary task is also jointly optimized with its intermediate feature , which differentiate our methodology substantially from [27]. It is thus no surprise to see in Table 2 that each auxiliary task obtains much improved performances than [27]^{7}^{7}7In [27] Table. 2, it reports that the best accuracy of pose clustering task falls around 28%, using the most suited layer features. Most notably, the performances of the overall identity clustering task witness a very impressive boost of around 7% in accuracy. We also test DTAGnetEML/DTAGnetMML with only or kept. Experiments verify that by adding auxiliary tasks gradually, the overall task keeps being benefited. Those auxiliary tasks, when enforced together, can also reinforce each other mutually.
Method  Stage I  Stage II  Overall  
Task  Acc  Task  Acc  Task  Acc  
DTAGnet  I  0.2176  
P  0.5067  I  0.2303  
EML  E  0.3676  I  0.2507  
P  0.5407  E  0.4027  I  0.2833  
DTAGnet  I  0.2347  
P  0.5251  I  0.2635  
MML  E  0.3988  I  0.2858  
P  0.5538  E  0.4231  I  0.3021 
One might be curious that, which one matters more in the performance boost: the deeply taskspecific architecture that brings extra discriminative feature learning, or the proper design of auxiliary tasks that capture the intrinsic data structure characterized by attributes?
Method  #clusters  #clusters  Overall 
in Stage I  in Stage II  Accuracy  
DTAGnet  4  4  0.2827 
8  8  0.2813  
EML  12  12  0.2802 
20  20  0.2757  
DTAGnet  4  4  0.3030 
8  8  0.3006  
MML  12  12  0.2927 
20  20  0.2805 
To answer this important question, we vary the target cluster number in either or , and reconduct the experiments. Table 3 reveals that more auxiliary tasks, even those without any striaghtforward taskspecific interpretation (e.g., partitioning the MultiPIE subset into 4, 8, 12 or 20 clusters hardly makes semantic sense), may still help gain better performances. It is comprehensible that they simply promote more discriminative feature learning in a lowtohigh, coarsetofine scheme. In fact, it is a complementary observation to the conclusion found in classification [16]. On the other hand, at least in this specific case, while the target cluster numbers of auxiliary tasks get closer to the groundtruth (5 and 6 here), the models seem to achieve the best performances. We conjecture that when properly “matched” , every hidden representation in each layer is in fact most suited for clustering the attributes corresponding to the layer of interest. The whole model can be resembled to the problem of sharing lowlevel feature filters among several relevant highlevel tasks in convolutional networks [11], but in a distinct context.
We hence conclude that, the deeplysupervised fashion shows to be helpful for the deep clustering models, even when there are no explicit attributes for constructing a practically meaningful hierarchical clustering problem. However, it is preferable to exploit those attributes when available, as they lead to not only superior performances but more clearly interpretable models. The learned intermediate features can be potentially utilized for multitask learning [28].
6 Conclusion
In this paper, we present a deep learningbased clustering framework. Trained from end to end, it features a taskspecific deep architecture inspired by the sparse coding domain expertise, which is then optimized under clusteringoriented losses. Such a welldesigned architecture leads to more effective initialization and training, and significantly outperforms generic architectures of the same parameter complexity. The model could be further interpreted and enhanced, by introducing auxiliary clustering losses to the intermediate features. Extensive experiments verify the effectiveness and robustness of the proposed models.
References
 [1] M. Aharon, M. Elad, and A. Bruckstein. Ksvd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE TSP, 2006.
 [2] A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

[3]
Y. Bengio.
Learning deep architectures for ai.
Foundations and trends® in Machine Learning
, 2(1):1–127, 2009.  [4] D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
 [5] C. Biernacki, G. Celeux, and G. Govaert. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE TPAMI, 22(7):719–725, 2000.
 [6] S. Chang, W. Han, J. Tang, G. Qi, C. Aggarwal, and T. S. Huang. Heterogeneous network embedding via deep architectures. In ACM SIGKDD, 2015.
 [7] G. Chen. Deep learning with nonparametric clustering. arXiv preprint arXiv:1501.03084, 2015.
 [8] B. Cheng, J. Yang, S. Yan, Y. Fu, and T. S. Huang. Learning with l1 graph for image analysis. IEEE TIP, 19(4), 2010.
 [9] C. Cortes and V. Vapnik. Supportvector networks. Machine learning, 20(3):273–297, 1995.
 [10] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531, 2013.
 [11] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for largescale sentiment classification: A deep learning approach. In ICML, pages 513–520, 2011.
 [12] K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. In ICML, pages 399–406, 2010.
 [13] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multipie. Image and Vision Computing, 28(5), 2010.
 [14] G. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 2006.
 [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
 [16] C.Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. arXiv preprint arXiv:1409.5185, 2014.
 [17] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In NIPS, pages 801–808, 2006.
 [18] T. Li and C. Ding. The relationships among various nonnegative matrix factorization methods for clustering. In ICDM, pages 362–371. IEEE, 2006.
 [19] X. Li, K. Zhang, and T. Jiang. Minimum entropy clustering and applications to gene expression analysis. In CSB, pages 142–151. IEEE, 2004.
 [20] J. Mairal, F. Bach, and J. Ponce. Taskdriven dictionary learning. IEEE TPAMI, 34(4):791–804, 2012.
 [21] S. A. Nene, S. K. Nayar, H. Murase, et al. Columbia object image library (coil20). Technical report.

[22]
A. Y. Ng, M. I. Jordan, Y. Weiss, et al.
On spectral clustering: Analysis and an algorithm.
NIPS, 2:849–856, 2002.  [23] F. Nie, D. Xu, I. W. Tsang, and C. Zhang. Spectral embedded clustering. In IJCAI, pages 1181–1186, 2009.
 [24] V. Roth and T. Lange. Feature selection in clustering problems. In NIPS, 2003.
 [25] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, pages 1139–1147, 2013.
 [26] F. Tian, B. Gao, Q. Cui, E. Chen, and T.Y. Liu. Learning deep representations for graph clustering. In AAAI, 2014.
 [27] G. Trigeorgis, K. Bousmalis, S. Zafeiriou, and B. Schuller. A deep seminmf model for learning hidden representations. In ICML, pages 1692–1700, 2014.
 [28] Y. Wang, D. Wipf, Q. Ling, W. Chen, and I. Wassail. Multitask learning for subspace segmentation. In ICML, 2015.
 [29] Z. Wang, Y. Yang, S. Chang, J. Li, S. Fong, and T. S. Huang. A joint optimization framework of sparse coding and discriminative clustering. In IJCAI, 2015.

[30]
J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma.
Robust face recognition via sparse representation.
IEEE TPAMI, 31(2):210–227, 2009.  [31] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In NIPS, pages 1537–1544, 2004.
 [32] Y. Yang, Z. Wang, J. Yang, J. Wang, S. Chang, and T. S. Huang. Data clustering by laplacian regularized l1graph. In AAAI, 2014.
 [33] B. Zhao, F. Wang, and C. Zhang. Efficient maximum margin clustering via cutting plane algorithm. In SDM, 2008.
 [34] M. Zheng, J. Bu, C. Chen, C. Wang, L. Zhang, G. Qiu, and D. Cai. Graph regularized sparse coding for image representation. IEEE TIP, 20(5):1327–1336, 2011.