1 Introduction
Clustering methods are very important techniques for exploratory data analysis with wide applications ranging from data mining Han2011DataMining ; Berkhin2006DataMiningSurvery , dimension reduction Boutsidis2015RDR , segmentation Shi2000Ncuts and so on. Their aim is to partition data points into clusters so that data in the same cluster are similar to each other while data in different clusters are dissimilar. Approaches to achieve this aim include partitional methods such as means and
medoids, hierarchical methods like agglomerative clustering and divisive clustering, methods based on density estimation such as DBSCAN
ester1996density , and recent methods based on finding density peaks such as CFSFDP Rodriguez2016CFSFDP and LDPS Li2016LDPS .Image clustering Ahmed2016review
is a special case of clustering analysis that seeks to find compact, objectlevel models from many unlabeled images. Its applications include automatic visual concept discovery
Lee2011easy, contentbased image retrieval and image annotation. However, image clustering is a hard task mainly owning to the following two reasons: 1) images often are of high dimensionality, which will significantly affect the performance of clustering methods such as
means Ding2007AdaptiveDR , and 2) objects in images usually have twodimensional or threedimensional local structures which should not be ignored when exploring the local structure information of the images.To address these issues, many representation learning methods have been proposed for image feature extractions as a preprocessing step. Traditionally, various handcrafted features such as SIFT
Lowe1999SIFT , HOG Dalal2005HOG , NMF Hong2016jointNMF , and (geometric) CWSSIM similarity Sampat2009CWSSIM ; Li2016GCWSSIMhave been used to encode the visual information. Recently, many approaches have been proposed to combine clustering methods with deep neural networks (DNN), which have shown a remarkable performance improvement over handcrafted features
Krizhevsky2012DCNN . Roughly speaking, these methods can be categorized into two groups: 1) sequential methods that apply clustering on the learned DNN representations, and 2) unified approaches that jointly optimize the deep representation learning and clustering objectives.In the first group, a kind of deep (convolutional) neural networks, such as deep belief network (DBN)
Hinton2006DBN and stacked autoencoders Tian2014graph , is first trained in an unsupervised manner to approximate the nonlinear feature embedding from the raw image space to the embedded feature space (usually being lowdimensional). And then, eithermeans or spectral clustering or agglomerative clustering can be applied to partition the feature space. However, since the feature learning and clustering are separated from each other, the learned DNN features may not be reliable for clustering.
There are a few recent methods in the second group which take the separation issues into consideration. In Xie2015DEC , the authors proposed deep embedded clustering that simultaneously learns feature representations with stacked autoencoders and cluster assignments with soft
means by minimizing a joint loss function. In
Yang2016JULE, joint unsupervised learning was proposed to learn deep convolutional representations and agglomerative clustering jointly using a recurrent framework. In
Liu2016IEC , the authors proposed an infinite ensemble clustering framework that integrates deep representation learning and ensemble clustering. The key insight behind these approaches is that good representations are beneficial for clustering and conversely clustering results can provide supervisory signals for representation learning. Thus, two factors, designing a proper representation learning model and designing a suitable unified learning objective will greatly affect the performance of these methods.In this paper, we follow recent advances to propose a unified clustering method named Discriminatively Boosted Clustering (DBC) for image analysis based on fully convolutional autoencoders (FCAE). See Fig. 1 for a glance of the overall framework. We first introduce a fully convolutional encoderdecoder network for fast and coarse image feature extraction. We then discard the decoder part and add a soft means model on top of the encoder to make a unified clustering model. The model is jointly trained with gradually boosted discrimination where high score assignments are highlighted and low score ones are deemphasized. The our main contributions are summarized as follows:

We propose a fully convolutional autoencoder (FCAE) for image feature learning. The FCAE is composed of convolutiontype layers (convolution and deconvolution layers) and pooltype layers (pooling and unpooling layers). By adding batch normalization (BN) layers to each of the convolutiontype layers, we can train the FCAE in an endtoend way. This avoids the tedious and timeconsuming layerwise pretraining stage adopted in the traditional stacked (convolutional) autoencoders. To the best of our knowledge, this is the first attempt to learn a deep autoencoder in an endtoend manner.

We propose a discriminatively boosted clustering (DBC) framework based on the learned FCAE and an additional soft means model. We train the DBC model in a selfpaced learning procedure, where deep representations of raw images and cluster assignments are jointly learned. This overcomes the separation issue of the traditional clustering methods that use features directly learned from autoencoders.

We show that the FCAE can learn better features for clustering than raw images on several image datasets include MNIST, USPS, COIL20 and COIL100. Besides, with discriminatively boosted learning, the FCAE based DBC can outperform several stateoftheart analogous methods in terms of means and deep autoencoder based clustering.
The remaining part of this paper is organized as follows. Some related work including stacked (convolutional) autoencoders, deconvolutional neural networks, and joint feature learning and clustering are briefly reviewed in Section 2. Detailed descriptions of the proposed FCAE and DBC are presented in Section 3. Experimental results on several real datasets are given in Section 4 to validate the proposed methods. Conclusions and future works are discussed in Section 5.
2 Related work
Stacked autoencoders Vincent2010SDAE ; Baldi2012autoencoders ; Bengio2013review ; Hinton2006DBN ; Hinton2006RBM ; Bengio2007layerwise
have been studied in the past years for unsupervised deep feature extraction and nonlinear dimension reduction. Their extensions for dealing with images are convolutional stacked autoencoders
Masci2011SCAE ; Lee2009CDBN . Most of these methods contain a twostage training procedure Bengio2007layerwise: one is layerwise pretraining and the other is overall finetuning. One of the significant drawbacks of this learning procedure is that the layerwise pretraining is timeconsuming and tedious, especially when the base layer is a Restricted Boltzmann Machine (RBM) rather than a traditional autoencoder or when the overall network is very deep.
Recently, there is an attempt to discard the layerwise pretraining procedure and train a deep autoencoder type network in an endtoend way. In Noh2015deconvolution , a deep deconvolution network is learned for image segmentation. The input of the architecture is an image and the output is a segmentation mask. The network achieves the stateoftheart performance compared with analogous methods thanks to three factors: 1) introducing a deconvolution layer and a unpooling layer Zeiler2014visualizing ; Mohan2014deconvolution ; Zeiler2011deconvolution to recover the original image size of the segmentation mask, 2) applying the batch normalization Ioffe2015BN to each convolution layer and each deconvolution layer to reduce the internal covariate shifts, which not only makes an endtoend training procedure possible but also speeds up the process, and 3) adopting a pretrained encoder on largescale datasets such as VGG16 model Simonyan2015VGG . The success of the architecture motivates us that it is possible to design an endtoend training procedure for fully convolutional autoencoders.
Clustering has also been studied in the past years based on independent features extracted from autoencoders (see, e.g. Ding2007AdaptiveDR ; Tian2014graph ; Huang2014DEN ; Song2013AEC ). Recently, there are attempts to combine the autoencoders and clustering in a unified framework Xie2015DEC ; Yang2016DCN . In Xie2015DEC
, the authors proposed Deep Embedded Clustering (DEC) that learns deep representations and cluster assignments jointly. DEC uses a deep stacked autoencoder to initialize the feature extraction model and a KullbackLeibler divergence loss to finetune the unified model. In
Yang2016DCN , the authors proposed Deep Clustering Network (DCN), a joint dimensional reduction and means clustering framework. The dimensional reduction model is based on deep neural networks. Although these methods have achieved some success, they are not suitable for dealing with highdimensional images due to the use of stacked autoencoders rather than convolutional ones. This motivates us to design a unified clustering framework based on convolutional autoencoders.3 Proposed methods
In this section, we propose a unified image clustering framework with fully convolutional autoencoders and a soft means clustering model (see Fig. 1). The framework contains two parts: part I is a fully convolutional autoencoder (FCAE) for fast and coarse image feature extraction, and part II is a discriminatively boosted clustering (DBC) method which is composed of a fully convolutional encoder and a soft means categorizer. The DBC takes an image as input and exports soft assignments as output. It can be jointly trained with a discriminatively boosted distribution assumption, which makes the learned deep representations more suitable for the top categorizer. Our idea is very similar to selfpaces learning Lee2011easy , where easiest instances are first focused and more complex objects are expanded progressively. In the following subsections, we will explain the detailed implementation of the idea.
3.1 Fully convolutional autoencoder for image feature extraction
Traditional deep convolutional autoencoders adopt a greedy layerwise training procedure for feature transformations. This could be tedious and timeconsuming when dealing with very deep neural networks. To address this issue, we propose a fully convolutional autoencoder architecture which can be trained in an endtoend manner. Part I of Fig. 1 shows an example of FCAE on the MNIST dataset. It has the following features:
 Fully Convolutional

As pointed out in Masci2011SCAE
, the maxpooling layers are very crucial for learning biologically plausible features in the convolutional architectures. Thus, we adopt convolution layers along with maxpooling layers to make a fully convolutional encoder (FCE). Since the downsampling operations in the FCE reduce the size of the output feature maps, we use an unpooling layer introduced in
Noh2015deconvolution to recover the feature maps. As a result, the unpooling layers along with deconvolution layers (see Noh2015deconvolution ) are adopted to make a fully convolutional decoder (FCD).  Symmetric

The overall architecture is symmetric around the feature layer. In practice, it is suggested to design layers of an odd number. Otherwise, it will be ambiguous to define the feature layer. Besides, fully connected layers (dense layers) should be avoided in the architecture since they destroy the local structure of the feature layer.
 Normalized

The depth of the whole network grows in magnitude as the input image size increases. This could make the network very deep if the original image has a very large width or height. To overcome this problem, we adopt the batch normalization (BN) Ioffe2015BN strategy for reducing the internal covariate shift and speeding up the training. The BN operation is performed after each convolutional layer and each deconvolutional layer except for the last output layer. As pointed out in Noh2015deconvolution , BN is critical to optimize the fully convolutional neural networks.
FCAE utilizes the twodimensional local structure of the input images and reduces the redundancy in parameters compared with stacked autoencoders (SAEs). Besides, FCAE differs from conventional SAEs as its weights are shared among all locations within each feature map and thus preserves the spatial locality.
3.2 Discriminatively boosted clustering
Once FCAE has been trained, we can extract features with the encoder part to serve as the input of a categorizer. This strategy is used in many clustering methods based on autoencoders, such as GraphEncoder Tian2014graph , deep embedding networks Huang2014DEN , and autoencoder based clustering Song2013AEC . These approaches treat the autoencoder as a preprocessing step which is separately designed from the latter clustering step. However, the representations learned in this way could be amphibolous for clustering, and the clusters may be unclear (see the initial stage in Fig. 2)).
To address this issue, we propose a selfpaced approach to make feature learning and clustering in a unified framework (see Part II in Fig. 1). We throw away the decoder of the FACE and add a soft means model on top of the feature layer. To train the unified model, we trust easier samples first and then gradually utilize new samples with the increasing complexity. Here, an easier sample (see the regions labelled 2, 3 and 4 in Fig. 2) is much certain to belong to a specific cluster, and a harder sample (see the region 1 in Fig. 2) is very likely to be categorized to multiple clusters. Fig. 2 describes the difference between these samples at a different learning stage of DBC.
There are three challenging questions in the learning problem of DBC which will be answered in the following subsections:

How to choose a proper criterion to determine the easiness or hardness of a sample?

How to transform harder samples into easier ones?

How to learn from easier samples?
3.2.1 Easiness measurement with the soft means scores
We follow DEC Xie2015DEC to adopt the distributionbased soft assignment to measure the easiness of a sample. The distribution is investigated in Maaten2008tSNE to deal with the crowding problem of lowdimensional data distributions. Under the distribution kernel, the soft score (or similarity) between the feature () and the cluster center () is
s.t. 
Here,
is the degree of freedom of the
distribution and set to be in practice. The most important reason for choosing thedistribution kernel is that it has a longer tail than the famous heat kernel (or the Gaussiandistribution kernel). Thus, we do not need to pay much attention to the parameter estimation (see
Maaten2008tSNE ), which is a hard task in unsupervised learning.3.2.2 Boosting easiness with discriminative target distribution
We transform the harder examples to the easier ones by boosting the higher score assignments and, meanwhile, bring down those with lower scores. This can be achieved by constructing an underlying target distribution from as follows:
s.t. 
Suppose we can ideally learn from the soft scores (denoted as ) to the assumptive distribution (denoted as ) each time. Then we can generate a learning chain as follows:
The following two properties can be observed from the chain:
Property 1 If for any and , then for all and all time step .
Proof Under the condition, and by (3.2.2) we can deduce that . By the chain this is equivalent to the fact that . Thus, the conclusion follows recursively for all . ∎
Property 2 If there exists an such that , then
Proof By (3.2.2) we have
By the assumption , it is seen that for any . On the other hand, since , we have . Thus,
Since is finite, we have . Finally, with the constrains , we obtain
Property 1 tells us that the hardest
sample (which has the equal probability to be assigned to different clusters) would always be the hardest one. However, in practical applications, there can hardly exist such examples.
Property 2 shows that the initial nondiscriminative samples could be boosted gradually to be definitely discriminative. As a result, we get the desired features for means clustering.Note that the boosting factor controls the speed of the learning process. A larger can make the learning process more quickly than smaller ones. However, it may boost some falsely categorized samples too quickly at initial stages and thus makes their features irrecoverable at later stages.
Besides, it can be helpful to balance the data distribution at different learning stages. In Xie2015DEC , the authors proposed to normalize the boosted assignments to prevent large clusters from distorting the hidden feature space. This problem can be overcome by dividing a normalization factor for each of the .
3.2.3 Learning with the KullbackLeibler divergence loss
In the last subsection, it was assumed that we could learn from to the boosted target distribution . This aim can be achieved with a joint KullbackLeibler (KL) divergence loss, that is,
(3) 
Fig. 3 gives an example of the joint loss when , where is the loss generated by the sample with respect to the th cluster ( or ). Regions marked in Fig. 3 roughly correspond to the regions marked in Fig. 2.
Intuitively, the loss has the following main features:

For an ambiguous (or hard) sample (i.e., ), its loss according to Property 1. Therefore, it will not be seriously treated in the learning process. (Region 1)

For a good categorized sample (i.e., there exists an such that ), its loss will be much greater than zero, and thus it will be treated more seriously. (Regions 2 and 3)

For a definitely well categorized sample (i.e., there exists an such that ), its loss will be near zero. This means that its features do not need to be changed much more. (Region 4)
3.2.4 Training algorithm
In this section, we summarize the overall training procedure of the proposed method in Algorithm 1 and Algorithm 2. They implement the framework showed in Fig. 1. Here
is the maximum learning epochs,
is the maximum updating iterations in each epoch and is the minibatch size. The encoder part of FCAE is : , which is parameterized by and the decoder part of FCAE is : , which is parameterized by .(M1) 
(M2) 
4 Experiments
In this section, we present experimental results on several real datasets to evaluate the proposed methods by comparing with several stateoftheart methods. To this end, we first introduce several evaluation benchmarks and then present visualization results of the inner features, the learned FCAE weights, the frequency hist of soft assignments during the learning process and the features embedded in a lowdimensional space. We will also give some ablation studies with respect to the boosting factor , the normalization factor and the FCAE initializations.
4.1 Evaluation benchmarks
Datasets We evaluate the proposed FCAE and DBC methods on two handwritten digit image datasets (MNIST ^{1}^{1}1http://yann.lecun.com/exdb/mnist/ and USPS ^{2}^{2}2http://www.cs.nyu.edu/ roweis/data.html) and two multiview object image datasets (COIL20 ^{3}^{3}3http://www.cs.columbia.edu/CAVE/software/softlib/coil20.php and COIL100 ^{4}^{4}4http://www.cs.columbia.edu/CAVE/software/softlib/coil100.php). The size of the datasets, the number of categories, the image sizes and the number of channels are summarized in Table 1.
Dataset  #Samples  #Categories  Image Size  #Channels 

MNIST  70000  10  2828  1 
USPS  11000  10  1616  1 
COIL20  1440  20  128128  1 
COIL100  7200  100  128128  3 
Evaluation metrics Two standard metrics are used to evaluate the experiment results explained as follows.

Accuracy (ACC) Xie2015DEC . Given the ground truth labels and the predicted assignments , ACC measures the average accuracy:
where ranges over all possible onetoone mappings between the labels of the predicted clusters and the ground truth labels. The optimal mapping can be efficiently computed using the Hungarian algorithm Kuhn1995Hungarian .

Normalized mutual information (NMI) Cai2011NMI . From the information theory point of view NMI can be interpreted as
where is the entropy of and is the mutual information of and .
Network architectures Table 2 shows the network architecture of the encoder parts with respect to different datasets. The decoder parts are totally reversed by the encoder parts. We use maxpooling in all the experiments. The size of all the feature layers is
. No padding is used in the convolutional layers except for the USPS dataset whose padding size is
datasets  conv1  pool1  conv2  pool2  conv3  pool3  conv4  pool4  features 

MNIST  5, 6  2,   5, 16  2,           4, 120 
24, 6  12, 6  8, 16  4, 16          1, 120  
USPS  3, 20  2,   3, 20  2,           4, 160 
16,20  8, 20  8, 20  4, 20          1, 160  
COIL  9, 20  2,   5, 20  2,   5, 20  2,   5,40  2,   4, 320 
120, 20  60, 20  56, 20  28, 20  24, 20  12, 20  8, 40  4, 40  1, 320 
The comparing methods To validate the effectiveness of FCAE and DBC, we compare them with the following stateoftheart methods in terms of the means and deep autoencoders based clustering.

KMS is the baseline method that applies the means algorithm on raw images.

DAEKMS Xie2015DEC uses deep autoencoders for feature extraction and then applies means for later clustering.

AEC Song2013AEC is a variant of DAEKMS that simultaneously optimizes the data reconstruction error and representation compactness.

IEC Liu2016IEC incorporates the deep representation learning and ensemble clustering.

DEC Xie2015DEC simultaneously learns the feature representations and cluster centers using deep autoencoders and soft means, respectively.

DEN Huang2014DEN learns the clusteringoriented representations by utilizing deep autoencoders and manifold constraints.

DCN Yang2016DCN jointly applies dimensionality reduction and means clustering.

FCAEKMS (our algorithm) adopts FCAE for feature extraction and applies means for the latter clustering.

DBC (our algorithm) uses Algorithm LABEL:alg:DBC for training a unified clustering method.
Results and analysis Table 3 summarizes the benchmark results on the MNIST dataset. The means method performs badly on raw images. However, based on the endtoend trained FACE features, means can achieve comparative results compared with DAEKMS which uses greedily layerwise trained deep autoencoder features. Moreover, with an additional joint training, DBC outperforms FCAEKMS and beats all the other comparing methods in terms of ACC and NMI.
Tables 46 show the benchmarks on USPS, COIL20 and COIL100, respectively. Similarly to the observations on the MNIST handwriteen digits dataset, DBC outperforms FCAEKMS by a large margin on the USPS handwritten digits dataset. On the COIL sets, DBC obtained a little better results than FCAEKMS did.
On the handwritten digits datasets, the number of samples is much larger than the number of categories. This results in the distribution of the FCAE features to be closely related, and lots of ambiguous samples may occur. As a result, discriminatively boosting makes sense on these datasets. Thus, there is no doubt that DBC performs much better than FCAEKMS. On the COIL sets, DBC takes little advantage of the discriminatively boosting procedure since the FCAE features are very helpful for clustering. Thus, there are very few ambiguous samples whose easiness needs to be boosted.
Metric  KMS  AEC  IEC  FCAEKMS  DBC 

ACC  0.535  0.715  0.767  0.667  0.743 
NMI  0.531  0.651  0.641  0.645  0.724 
Metric  KMS  DEN  FCAEKMS  DBC 

ACC  0.592  0.725  0.787  0.793 
NMI  0.767  0.870  0.882  0.895 
Metric  KMS  IEC  FCAEKMS  DBC 

ACC  0.506  0.546  0.766  0.775 
NMI  0.772  0.787  0.897  0.905 
4.2 Visualization
One of the advantages of fully convolutional neural networks is that we can naturally visualize the inner activations (or features) and the trained weights (or filters) in a twodimensional space Masci2011SCAE . Besides, we can monitor the learning process of DBC by drawing frequency hists of assignment scores. In addition, SNE can be applied to the embedded features to visualize the manifold structures in a lowdimensional space. Finally, we show some typical falsely categorized samples generated by our algorithm.
4.2.1 Visualization of the inner activations and learned filters
In Fig. 4, we visualize the inner activations of FCAE on the MNIST dataset with three digits: 1, 5, and 9. As shown in the figure, the activations in the feature layer are very sparse. Besides, the deconvolutional layer gradually recovers details of the pooled feature maps and finally gives a rough description of the original image. This indicates that FCAE can learn clusteringfriendly features and keep the key information for image reconstruction.
Fig. 5 visualizes the learned filters of FCAE on the MNIST dataset. It is observed in Masci2011SCAE that the stacked convolutional autoencoders trained on noisy inputs (30% binomial noise) and a maxpooling layer can learn localized biologically plausible filters. However, even without adding noise, the learned deconvolutional filters in our architectures are nontrivial Gaborlike filters which are visually the nicest shapes. This is due to the use of maxpooling and unpooling operations. As discussed in Masci2011SCAE , the maxpooling layers are elegant way of enforcing sparse codes which are required to deal with the overcomplete representations of convolutional architectures.
4.2.2 Monitoring the learning process
We use frequency hist of the soft assignment scores to monitor the learning process of DBC. Fig. 6 shows the hists of scores on the MNIST test dataset (a subset of the MNIST dataset with samples). The scores are assigned to the first cluster at different learning epochs. At early epochs (), most of the scores are near This is a random guess probability because there are clusters. As the learning procedure goes on, some higher score samples are discriminatively boosted and their scores become larger than others. As a result, the cluster tends to “believe” in these higher score samples and consequently make scores of the others to be smaller (approximating zero). Finally, the scores assigned to the cluster become twoside polarized. Samples with very high scores () are thought to definitely belong to the first cluster and the others with very small scores () should belong to other clusters.


4.2.3 Embedding learned features in a low dimensional space
We visualize the distribution of the learned features in a twodimensional space with SNE Maaten2008tSNE . Fig. 7 shows the embedded features of the MNIST test dataset at different epochs. At the initial epoch, the features learned with FCAE are not very discriminative for clustering. As shown in Fig. 7(a), the features of digits 3, 5, and 8 are closely related. The same thing happened with digits 4, 7, and 9. At the second epoch, the distribution of the learned features becomes much compact locally. Besides, the features of digit 7 become far away from those of digits 4 and 9. Similarly, the features of digit 8 get far away from those of digits 3 and 5. As the learning procedure goes on, the hardest digits (4 v.s. 9, 3 v.s. 5) for categorization are mostly well categorized after enough discriminative boosting epochs. The observation is consistent with the results showed in Subsection 4.2.2.
4.2.4 Visualization of falsely categorized examples
In Fig. 8, we show the top falsely categorized examples whose maximum soft assignment scores are over It can be observed that it is very hard to distinguish between some ground truth digits 4, 7 and 9 even with human experience. Lots of digits 7 are written with transverse lines in their middle space and would be thought to be ambiguous for the clustering algorithm. Besides, some ground truth images are themselves confusing, such as those showed with the gray background.
4.3 Discussions
In this section, we make some ablation studies on the learning process with respect to different boosting factors (), different normalization methods () and different initialization models generated by FCAE.
4.3.1 Impact of the boosting factor
Fig. 9(a) shows the ACC and NMI curves, where equals to With a small (), the learning process is very slow and takes very long time to terminate. On the contrary, when the factor is set to be very large (), the learning process is very fast at the initial stages. However, this could result in falsely boosting some scores of the ambiguous samples. As a consequence, the model learned too much from some false information so the performance is not so satisfactory. With a moderate boosting factor (), the ACC and NMI curves grow reasonably and progressively.
4.3.2 Impact of the balance normalization
In DEC Xie2015DEC , the authors pointed out that the balance normalization plays an important role in preventing large clusters from distorting the hidden feature space. To address this issue, we compare three normalization strategies: 1) the constant normalization for comparison, that is, , 2) the normalization by dividing the sum of the original soft assignment score per cluster, that is, , which is adopted in DEC, and 3) the normalization by dividing the sum of the boosted soft assignment score per cluster, that is, . Fig. 9(b) presents the value curves of ACC and NMI against the epoch with these settings. Initially, the normalization does not affect ACC and NMI very much. However, the constant normalization can easily get stuck at early stages. The normalization by dividing has certain power of preventing the distortion. Our normalization strategy gives the best performance compared with the previous methods. This is because our normalization directly reflects the changes of the boosted scores.
4.3.3 Impact of the FCAE initialization
To investigate the impact of the FCAE initialization on DBC, we compare the performance of DBC with three different initialization models: 1) the random initialization, 2) the initialization with a halftrained FCAE model, and 3) the initialization with a sufficiently trained FCAE model. The comparison results are shown in Fig. 9(c). As illustrated in the figure, DBC performs greatly based on all the models even when the initialization model is randomly distributed. However, if the FCAE model is not sufficiently trained, the resultant DBC model will be suboptimal.
5 Conclusions and future works
In this paper, we proposed FCAE and DBC to deal with image representation learning and image clustering, respectively. Benchmarks on several visual datasets show that our methods can achieve superior performance than the analogous methods. Besides, the visualization shows that the proposed learning algorithm can implement the idea proposed in Section 3.2
. Some issues to be considered in the future include: 1) adding suitable constraints on FCAE to deal with natural images, and 2) scaling the algorithm to deal with largescale datasets such as the ImageNet dataset.
Acknowledgement
This work was supported in part by NNSF of China under grants 61379093, 61602483 and 61603389. We thank Shuguang Ding, Xuanyang Xi, Lu Qi and Yanfeng Lu for valuable discussions.
Appendix
A. Derivation of (4).
We use the chain rule for the deduction. First, we set
(6) 
Then it follows that
(7) 
Now set
(8) 
so
(9)  
Further, let
(10) 
Then we have
(11)  
Combine the above expressions to get the required result
(13) 
B. Derivation of (5).
References
References
 (1) J. Han, J. Pei, M. Kamber. Data Mining: Concepts and Techniques. Elsevier, 2011.
 (2) P. Berkhin. A survey of clustering data mining techniques. In: Grouping Multidimensional Data, pp. 2571, 2006.
 (3) C. Boutsidis, A. Zouzias, M. Mahaoney, and P. Drineas. Randomized dimensionality reduction for means clustering. IEEE Transactions on Information Theory, vol. 61, no. 2, pp. 10451062, 2015.
 (4) J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888905, 2000.
 (5) M. Ester, H.P. Kriegel, J. Sander and X. Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise. KDD, vol. 96, no. 34, pp. 226231, 1996.
 (6) A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science, vol. 344, no. 6191, pp. 14921496, 2014.
 (7) F. Li, H. Qiao and B. Zhang. Effective deterministic initialization for meanslike methods via local density peaks searching. arXiv:1611.06777, 2016.
 (8) N. Ahmed. Recent review on image clustering. IET Image Processing, vol. 9, no. 11, pp. 10201032, 2015.
 (9) Y.J. Lee and K. Grauman. Learning the easy things first: Selfpaced visual category discovery. IEEE Conference on CVPR, pp. 17211728, 2011.

(10)
C. Ding, T. Li. Adaptive dimension reduction using discriminant analysis and kmeans clustering.
Proc. 24th International Conference on Machine learning
, pp. 521528, 2007. 
(11)
D. Lowe. Object recognition from local scaleinvariant features.
Proc. 7th International Conference on Computer Vision
, vol. 2, pp. 11501157, 1999. 
(12)
N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.
Proc. Computer Vision and Pattern Recognition
, vol. 1, pp. 886893, 2005.  (13) S. Hong, J. Choi, J. Feyereisl, B. Han and L. S. Davis. Joint image clustering and labeling by matrix factorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 7, pp. 14111424, 2016.
 (14) M. Sampat, Z. Wang, S. Gupta, A. Bovik and M. Markey. Complex wavelet structural similarity: A new image similarity index. IEEE Transactions on Image Processing, vol. 18, no. 1, pp. 23852401, 2009.
 (15) F. Li, X. Huang, H. Qiao and B. Zhang. A new manifold distance for visual object categorization. The 12th World Congress on Intelligent Control and Automation, pp. 22322236, 2016.
 (16) A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. Advanced Neural Information Processing Systems, vol. 24, pp. 10971105, 2012.
 (17) G. Hinton, S. Osindero, Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation, vol. 18, no. 7, pp. 15271554, 2006.
 (18) F. Tian, B. Gao, Q. Cui, E. Chen and T. Liu. Learning deep representations for graph clustering. AAAI, pp. 12931299, 2014.
 (19) J. Xie, R. Girshick and A. Farhadi. Unsupervised deep embedding for clustering analysis. Proc. 33rd International Conference on Machine Learning, pp. 478487, 2016.
 (20) J. Yang, D. Parikh, D. Batra. Joint unsupervised learning of deep representations and image clusters. Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 (21) H. Liu, M. Shao, S. Li and Y. Fu. Infinite ensemble for image clustering. Proc. 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 17451754, 2016.
 (22) P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio and P. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, vol. 11, pp. 33713408, Dec. 2010.

(23)
P. Baldi. Autoencoders, unsupervised learning, and deep architectures.
ICML Workshop on Unsupervised and Transfer Learning
, vol. 27, pp. 3750, 2012.  (24) Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 17891828, 2013.
 (25) G. Hinton, R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, vol. 313, no. 5786, pp. 504507, 2006.
 (26) Y. Bengio, P. Lamblin, D. Popovici and H. Larochelle. Greedy layerwise training of deep networks. Advances in Neural Information Processing Systems, vol. 19, pp. 153, 2007.
 (27) J. Masci, U. Meier, D. ciresan and J. Schmidhuber. Stacked convolutional autoencoders for hierarchical feature extraction. International Conference on Artificial Neural Networks, pp. 5259, 2011.
 (28) H. Lee, R. Grosse, R. Ranganath and A. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. Proc. 26th Annual International Conference on Machine Learning, pp. 609616, 2009.
 (29) H. Noh, S. Hong and B. Han. Learning deconvolution network for semantic segmentation. Proc. IEEE International Conference on Computer Vision, pp. 15201528, 2015.
 (30) M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. European Conference on Computer Vision, pp. 818833, 2014.
 (31) R. Mohan. Deep deconvolutional networks for scene parsing. arXiv:1411.4101, 2014.
 (32) M. Zeiler, G. Taylor, R. Fergus. Adaptive deconvolutional networks for mid and high level feature learning. 2011 International Conference on Computer Vision, pp. 20182025, 2011.
 (33) S. Ioffe, and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015.
 (34) K. Simonyan, A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv:1409.1556, 2014.
 (35) P. Huang, Y. Huang, W. Wang and L. Wang. Deep embedding network for clustering. ICPR, pp. 15321537, 2014.
 (36) C. Song, F. Liu, Y. Huang, et al. Autoencoder based data clustering. Iberoamerican Congress on Pattern Recognition, pp. 117124, 2013.
 (37) L. Maaten, G. Hinton. Visualizing data using SNE. Journal of Machine Learning Research, vol. 9, pp. 25792605, 2008.
 (38) H. Kuhn. The Hungarian method for the assignment problem. 50 Years of Integer Programming 19582008, pp. 2947, 2010.
 (39) D. Cai, X. He, J. Han. Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 6, pp. 902913, 2011.
 (40) B. Yang, X. Fu, ND Sidiropoulos, and M Hong. Towards kmeansfriendly spaces: simultaneous deep learning and clustering. arXiv:1610.04794, 2016.
Comments
There are no comments yet.