arXiv Paper: Progressive Cluster Purification for Unsupervised Feature Learning
In unsupervised feature learning, sample specificity based methods ignore the inter-class information, which deteriorates the discriminative capability of representation models. Clustering based methods are error-prone to explore the complete class boundary information due to the inevitable class inconsistent samples in each cluster. In this work, we propose a novel clustering based method, which, by iteratively excluding class inconsistent samples during progressive cluster formation, alleviates the impact of noise samples in a simple-yet-effective manner. Our approach, referred to as Progressive Cluster Purification (PCP), implements progressive clustering by gradually reducing the number of clusters during training, while the sizes of clusters continuously expand consistently with the growth of model representation capability. With a well-designed cluster purification mechanism, it further purifies clusters by filtering noise samples which facilitate the subsequent feature learning by utilizing the refined clusters as pseudo-labels. Experiments demonstrate that the proposed PCP improves baseline method with significant margins. Our code will be available at https://github.com/zhangyifei0115/PCP.READ FULL TEXT VIEW PDF
arXiv Paper: Progressive Cluster Purification for Unsupervised Feature Learning
Representation learning has achieved unprecedented success in various computer vision tasks. This can be attributed to the availability of large-scale datasets with precise label annotations and convolutional neural networks (CNNs) capable of absorbing the annotation information[27, 32]. Nevertheless, precisely annotating samples on large-scale datasets is laborious, expensive, or even impractical.
As an alternative method, unsupervised feature learning (UFL), which is usually based on data completeness, sample distribution and instance similarity, has attracted increased attention. The completeness-based methods typically leverage the intrinsic structure of the input data by forcing models to predict the hidden part in specific well-designed pretext tasks, such as context predicting 18], jigsaw puzzles , counting , split-brain  and rotations . The distribution-based methods could use Auto-Encoders , Variational Auto-Encoders  and Generative Adversarial Nets  to approximate the distribution of raw data. Both methods focus on intra-class distributions but unfortunately ignore inter-class information of samples that the discriminative capacity of extracted features is not fully explored.
As one of the most representative similarity-based UFL method, DeepCluster (DC)  iteratively assigns samples to clusters according to their similarity in the feature space and then uses the assignment as pseudo category labels to train a feature extractor. However, DC is puzzled by a significant number of class inconsistent instances (noise samples) within clusters, which could seriously deteriorate the representation learning performance. To avoid the impact of clustering noise, instance recognition (IR)  treats each sample as an independent class and achieves impressive performance on commonly used benchmarks. Nevertheless, it totally misses the intrinsic inter-sample class information, which limits its discriminative capability of learned features. To improve discriminative capability, the anchor neighbourhood discovery (AND)  method adopts a divide-and-conquer strategy to find local neighbours under the constraint of cluster consistency. However, the sample noise problem remains not systematically solved, which makes the learned features less representative.
In this study, we propose a simple-yet-effective UFL method, termed Progressive Cluster Purification (PCP), to alleviate the impact of sample noise and estimate reliable sample labels. PCP roots in simple-means clustering while introducing a Progressive Clustering (PC) strategy and a Cluster Purification (CP) module to refine the clustering assignments. With iterative training, reliable class consistent samples converge together so that clear and purified pseudo labels are well estimated while discriminative features are learned in a progressive self-guided manner. As shown in Fig. 1, IR, DC and our PCP take three different strategy to learn discrimimative features in unsupervised manner, where PCP shows its superiority in more stable and efficient training procedure and richer feature representation capability.
Progressive Clustering is implemented by reducing the number of clusters from the number of samples towards the number of categories during the multiple iterations of clustering so that the size of each cluster progressively increases. In this way, intra-cluster variance increases consistently with the growth of model representation capability so that the class inconsistent samples could be controlled at a reasonable level, which drives the model to improve the discriminability of extracted features.
Progressive Clustering reduces the noise pollutants from the sources. Meanwhile, Cluster Purification is applied on clustering results in each epoch, with the aim to explicitly obtain stable and reliable clusters. CP consists of two noise processing procedure including unreliable sample filtering (CP) and unstable sample filtering (CP). With unreliable sample filtering, instances far away from cluster centroids are identified as noise samples, each of which is temporarily regarded as a distinct class. Thereafter, in order to improve the robustness of CP, we further introduce unstable sample filtering, a voting strategy, which is utilized to reassign class consistent instances based on the prevenient filtering results. With PCP, the obtained pseudo labels contain purified and clear class discriminative information that facilitates the successive feature learning procedure.
The contributions of this work are summarized as follows:
We propose the simple-yet-effective Progressive Cluster Purification (PCP) method, which implements progressive clustering (PC) by iteratively expanding cluster sizes and progressively enforcing feature discrimination power.
A novel cluster purification (CP) mechanism is designed to exclude class inconsistent instances, which further aggregates the discrimination power of features and speeds up the convergence of unsupervised learning.
With warm-up training strategy, PCP improves the baseline method with significant margins on commonly used benchmarks.
From a general perspective, unsupervised feature learning methods can be classified into three categories: completeness-based methods, distribution-based methods and similarity-based methods.
Completeness-based methods. This kind of methods usually follows self-supervision, which designs pretext tasks by hiding some information of examples to construct a prediction task. By training models to predict the hidden information, rich feature representation could be learned. Pretext tasks include context predicting , colorization , jigsaw puzzles , counting , split-brain  and rotations . After that, FeatureDecoupling  proposes combining rotation and its unrelated part, and VCP  proposes to fill the blank with video options. Although completeness-based methods can obtain proper feature representation for some specific tasks, it has a low performance bound as the learning procedure focuses on optimizing pretext tasks. In addition, it is unclear whether these pretext tasks are suitable for downstream tasks or not in new domains.
Distribution-based methods. The main idea of distribution-based methods is using encoders,12] or auto-encoders 12], deep Boltzmann machines , variational auto-encoders  and generative adversarial network  have been explored for feature learning. Despite the effectiveness, these methods unfortunately ignore inter-class information of samples thus the discriminative power of features is not fully explored.
In the deep learning era, similarity-based feature learning can be implemented by associating clustering with CNNs[4, 29, 2, 33]. As a representative method, DC 
iteratively groups features and uses the assignments as pseudo labels to train feature representation. DC leverages the advantages of self-supervised learning and clustering to make up their disadvantages. To alleviate the impact of noise samples, Exemplar CNN and IR  treat each sample as an independent class. Recent invariant information clustering  learns a neural network classifier from scratch with solely unlabelled data samples, and the objective is to maximize mutual information [24, 1] between the class assignments of each pair. AND  adopts a divide-and-conquer method to find local neighbours under the constraint of cluster consistency.
Similarity-based methods, DC, have set strong baselines for UFL. However, the significant performance gap between UFL and supervised feature learning indicates that it is a challenging task when representation models and sample clusters require to be estimated simultaneously. Existing methods remain challenged by the significant number of noise samples during clustering, which deteriorate the performance of representation models.
In unsupervised feature learning, clustering based method is susceptible to noisy supervision caused by inevitable class inconsistent samples. However, its superiority in reasoning class boundaries, which is so called class conceptualization, should not be neglected. Inspired by this, we propose the Progressive Cluster Purification approach targeting at alleviating the negative propagation of the noisy supervision during feature learning. As shown in Fig. 2, PCP consists of three components including progressive clustering, cluster purification and feature learning, which cooperate with each other in a circular way. PC assigns samples to distinct clusters in the feature space constructed during feature learning and the assignments are purified by CP to generate subsequent supervision for feature learning.
Given an imagery data set without any class label, pseudo-labels produced by PCP are used as supervision to derive CNN with parameters . The learned model maps an input image to a feature space at the -th epoch, as . Details about the main components of the proposed PCP are illustrated in the following paragraph.
IR  evidences that neural networks, to some extent, can learn class discriminative feature representation with solely instance-level supervision which regards individual instances as distinct classes. Inspired by this, we would go one more step to infer that deep models can extract the underlying class information under different grain-level supervision from instance-wise to class-wise. More importantly, in early phase of feature learning, the network has not converged that clustering with few centroids would cooperate significant class inconsistent samples within each cluster. With contaminated pseudo labels, inferior class conceptualization will further hinder the model representation capability growth.
To alleviate this situation, we propose progressive clustering to improve unsupervised class conceptualization by shrinking the number of clusters from the number of samples towards the number of true categories, Fig. 2. Consistently with the growth of our model’s representation capability, the clusters gradually expand while less noise sample pollutants occur. As the total of samples is tremendous, we implement a linear declining strategy on its logarithm to decide the number of clusters at the -th epoch denoted as , which is formulated as
where denotes the total of training epochs (200 as default). As shown in Eq.(1), declines fast in early epochs and goes slower in later epochs. When , equals to . As the true category number is usually unknown prior knowledge, when , we set fixed as that the cluster number begins with and stops at
(hyperparameter). Table.III show that our PCP is not as sensitive to the prior knowledge of the cluster number as DC  does.
With PC, the feature space becomes compact and more class-consistent, which facilitates class conceptualization. To further make the cluster more reliable and stable to optimize the feature learning under supervision of clear pseudo-labels with less noise, we design a Cluster Purification (CP) mechanism which consists of unreliable sample filtering (CP) and unstable sample filtering (CP).
Unreliable Sample Filtering (CP). Based on the observation that samples near cluster centroids share higher apparent similarity, thus they are more likely to belong to the same class. We discard the samples far away from the centroids and temporarily regard each one as a distinct class in the subsequent learning procedure. As shown in the lower right of Fig. 2, the pink area within the dashed circle represents the reliable district and hollow points indicate samples to be excluded while solid points to remain. We denote the unreliable sample filtering ratio in each cluster as that with higher , few samples remain in a cluster. When is close to 1, PCP degenerates to IR. That is to say, we build a bridge between sample specificity based methods and clustering based methods through that essentially makes our PCP absorb the advantages of clear pseudo supervision with more inter-class information. In the feature space , we denote the set of reliable class consistent samples of -th cluster, , as and the corresponding set of noise samples to be discarded as . It is obvious that . Thereafter, these preliminary purification results of CP are fed the successive purification procedure.
Unstable Sample Filtering (CP). Easily distinguished samples are likely to be consistently assigned to the same cluster at different iterations of clustering. However, perplexing samples go the opposite that clustering assignments of which are inconsistent and unstable. Inspired by this, we propose a voting function which utilizes the previous clustering results to quantitatively estimate the class consistency of samples and in the same cluster from the feature space . For cluster , we denote the sample closest to the centroid as . Thereafter, the voting score of sample from cluster , according to the progressive clustering result in the past epochs from the -th epoch, can be calculated as
where denotes the decay rate, denotes the pseudo label which sample belongs to, and when , otherwise -.
As shown in the lower right of Fig. 2, we set different thresholds in and to discard samples in with low voting scores (hollow points) and pull back samples in with high voting scores (solid points). Then sets and are refined as and respectively. and are regarded as mixed instance-level and cluster-level pseudo labels for feature learning.
involves a non-parametric softmax classifier aiming to minimize the negative log-likelihood objective function on the training set where each sample is regarded as a distinct class. The probability of an inputbeing recognized as -th example is
where and is a temperature parameter used for controlling the distribution concentration degree . Following IR, we develop our instance-wise supervision loss and cluster-wise supervision loss at -th epoch as
Warm-up. In early training epochs, the cluster number dramatically decreases, which challenges the stability and convergence of feature learning. In addition, early clustering may lead to network focus on low-level features. To alleviate this situation, a warm-up training strategy is implemented by adding a branch sharing the same architecture and weights with the original PCP branch to assist its early learning. Following IS , the branch is implemented in a sample specific learning manner aiming at learning random data augmentation invariant and instance spread-out features. With warm-up for training a certain epochs, the additional IS branch is discarded for further improvement of the PCP branch. Figures in Table III and Table IV clearly show the effect of the warm-up training strategy.
To show the advantages of PCP over existing UFL methods, we compare them from three aspects including reliability, stability, and efficiency.
Reliability. With PC and CP, PCP explores the underlying class information while mitigating the impact of clustering error. As shown in Fig. 3, PCP uses outside-in filtering strategy for class conceptualization, which generates reliable class consistent cluster by discarding noise samples. In contrast, AND  takes an inside-out expanding strategy to form class concept by merging neighborhoods. However, if the neighborhood size is large, it likely absorbs noise samples and drifts to be confused for feature learning; if the neighborhood size is small, the efficiency of model learning will be reduced, Table IV.
Stability. With a simple-yet-effect clustering correction mechanism, PCP gets rid of dependence of approximating the number of classes of the training set from which other clustering based methods suffer a lot It can be found that the learning procedure of PCP converges more stably, Fig. 1 and Fig. 4.
Efficiency. With the warm-up strategy, PCP extracts discriminative features in early training epochs, which not only improves the reliability and stability of feature learning but also results in fast convergence of training effect. Meanwhile, with quick decrease of cluster number, PCP is forced to efficiently perform class conceptualization. In contrast, AND sets the smallest neighbourhood size to avoid class drift, which limits its learning efficiency.
Datasets. The learned feature representation is validated on image classification and object detection. For image classification, we use CIFAR  which contains 50,000/10,000 train/test images, and CIFAR10/CIFAR100 has 10/100 object classes with 6,000/600 images per class, all images with size 32 32. And ImageNet100 
which is selected from ImageNet with 100 classes. For fine-grained classification, we use CUB-200-2011 dataset which consists of 5994/ 5994 train/test images with 200 bird species. For object detection, we use PASCAL VOC 2007  which contains 20 classes, 5011 images in trainval set and 4952 images in test set.
Experimental Setting. Following the setting of [28, 30], we initialize the learning rate as with the decreasing strategy that scaled-down by at -th epoch and at -th epoch. Then we set momentum to , weight decay to , batch size to , temperature parameter to , and embedding feature space dimension to for network training. For fair comparison, we train PCP models with epochs (a round) as default, and choose while implementing NN evaluation. For the reliability of voting function for cluster purification, we implement cluster purification after epochs training and take epochs clustering results for voting function with two thresholds and . For IS branch warm-up, the weight of loss in IS decrease from to during epoch to epoch continuously, and the weight of PCP with . For multiple rounds training, we apply warm-up in the first round, the rest of rounds without warm-up and with the same decreasing strategy of learning rate. One-off AND is implemented with 2 rounds for the first round as warm-up. Unless stated otherwise, the reported data is implemented on CIFAR10 with ResNet18, and NN performance adopted top-1 accuracy.
Evaluation Metrics. We use linear classification and weighted NN to evaluate features from different layers . With linear classification, we implement classifier by designing a fully connected (FC) layer. We are aiming to optimize a FC layer by minimizing cross-entropy loss. With weighted NN, for a test image
, we calculate the cosine similarity= cos(, ) for each of sample belonging to the train set, where = (). The set of top- nearest neighbours denoted as is then used to predict the class according to = , where and denotes the value of voting function that sample belongs to class . The result label of sample is when = (), where denotes the number of categories in test set.
For fair comparison with DC, we drop cluster number (N) down to 100 to evaluate the effect of PCP components. In Table I, DC method achieves accuracy 73.6% by taking cluster result as pseudo label, we can see that PC improves the accuracy by 3.3%, unreliable sample filtering of cluster purification (CP) gains 2% and unstable sample filtering of cluster purification (CP) further gains 2.7%.
Cluster Purification. In Table II, we explore the effect of CP on PCP in detail. As we can see that with the increase of the unreliable sample filtering ratio from 0 to 0.99, the classification accuracy of PCP with only CP consistently increases from 76.9% to 81.7%. This can be mainly attributed to that the supervision offered by clustering contains less noise. However, with higher , few samples remain in the cluster which limits the model to discovery inter-sample relationship. Further with CP, lower leads to more noise samples, higher leads to model close to IR, best performance appears in intermediate. PCP can not only keep the cluster size but also effectively correct the error assignments that rebounds the accuracy when is low compared with only CP. This further evaluates the effect of CP with no doubt. In the following experiments, we set to 0.5 as default.
Progressive Clustering. As shown in Table III, the performance of DC drops dramatically from 82.9% to 56.9% with the cluster number (N) decreasing from 10,000 to 5 which is below the true class number of 10. In contrast, with the class conceptualization and the noise sample correction mechanism PC and CP, the performance of PCP is insensitive to N, which demonstrates the superior reliability and stability of PCP. With warm-up, PCP outperforms DC with a significant margin which is 4.1% higher when N is equal to 1k (set as default in the following experiments).
|AND  (N = 1)||80.0||83.9||85.0||85.9||86.3|
|AND  (N = 5)||80.6||84.2||85.3||85.6||85.9|
|Model||Random||F-S||DC ||IR ||IS |
Curriculum Learning. To enhance discriminative power of learned features, AND derives curriculum learning with five training rounds to select class-consistent neighbourhoods. In Table IV we can see that the convergence of PCP is faster and the performance is higher than those of AND, and PCP with warm-up further converges faster and performs better which show the superior efficiency of PCP. From the visualization in Fig. 4, that the features learned by PCP steadily focus on objects shows the superior stability of PCP.
Class Conceptualization. To show the effect of class conceptualization, we visualize the feature distribution of the whole test set of CIFAR10 by embedding the 128-dimensional feature space into a 2-dimensional space by t-SNE . In Fig. 5 (a-c), both DC and PCP models are trained with N = 10 and we can see that the feature representations learned by DC and AND are less discriminative as many features with different classes mixed up. In contrast, the features extracted by PCP are better assigned to different clusters.
|Classifier||Weighted NN ()||Linear Classifier ()|
To quantitatively show the superiority of PCP, we cluster learned features to clusters, then identify the class pseudo-label of each cluster by the category which the majority of features in the cluster belong to. As shown in Fig. 5 (d-f), PCP forms 10 clusters dominated by samples from 10 different classes. For DC, 3 categories (, and ) are missed in that each of the 3 categories (, and ) dominates two clusters respectively. AND misses the class of , and shows same proportions in and . We can easily distinguish each category in PCP while DC and AND look slightly confusing. So far we can argue that better class conceptualization is of essential importance to improve the performance of unsupervised clustering methods.
To further evaluate the proposed PCP approach, we conduct experiments on image classification and fine-grained classification to compare with baseline methods. The backbone is ResNet18 and the classification is evaluated by NN. The results of IR/IS/AND are directly transcribed from the corresponding references. DC is reimplemented with 1000 clusters. As shown in Table V, PCP model achieves the best performance of 84.7% accuracy, and after 5 rounds the accuracy of 87.3% outperforms AND with 1%. Considering the fully-supervised accuracy is only 93.1%, the performance of PCP is extraordinary under supervised setting. We further compare our method with MoCo  (rerun the code https://github.com/facebookresearch/moco) on ImageNet100 under the same settings, performance of PCP is higher than MoCo, Table VII.
For generalization verification, we then compare image classification accuracy on CIFAR10 & CIFAR100 with a standard AlexNet backbone attached by a weighted NN ( layer) and a linear classifier ( layer). The figures in Table VI clearly show that PCP outperforms other methods, and especially with a significant margin for CIFAR100 evaluated by NN.
Fine-grained Classification. PCP is further evaluated on the more challenging fine-grained classification task. We follow the settings with N = and batch size as on the CUB-200-2011 dataset. As shown in Table VIII, PCP significantly outperforms IR by 5.3%, DC by 3.8% and AND by 2.5%.
Object detection experiments are conducted to evaluate the effectiveness of the features learned by PCP for down-stream tasks. We implement Faster R-CNN  with a ResNet18 backbone pre-trained on the CIFAR100 dataset and fine-tuned on the PASCAL VOC 2007 dataset.
For pretrained ResNet18 backbone, the first residual stage and all the batch normalization layers are frozen. Among the overall fine-tuning 30 epochs, the learning rate is first set to 0.001 and decreased 10 times after 25 epochs, and the image batch size is fixed as 8. Compared with the randomly initialized feature model, which is hard to converge, SOTA models significantly improve theAP performance, Table IX. PCP achieves 39.8% mAP which outperforms AND by 2.9%. This shows the effectiveness of the features learned by PCP for down-stream tasks.
In this paper, we proposed a simple-yet-effective Progressive Cluster Purification (PCP) method for unsupervised feature learning. To alleviate the impact of noise samples, we elaborately designed the Progressive Clustering (PC) strategy to gradually expand the cluster size consistently with the growth of the model representation capability and the Cluster Purification (CP) mechanism to reduce unreliable and unstable noise samples in each cluster to a significant extent. Resultantly, PCP mitigates the dependency of the prior knowledge of the cluster number and is able to reliably, stably and efficiently learn discriminative and representative features. Extensive experiments on object classification and detection benchmarks demonstrated that the proposed PCP approach has improved the classical clustering method and provided a fresh insight into the unsupervised learning problem.
A. Coates and A. Ng, ”Learning Feature Representations with K-Means”, inNeural Networks, 2012.
P. Vincent, H. Larochelle, Y. Bengio and P. Manzagol, ”Extracting and Composing Robust Features with Denoising Autoencoders”, inICML, 2008.