Online Deep Clustering for Unsupervised Representation Learning

06/18/2020 ∙ by Xiaohang Zhan, et al. ∙ 54

Joint clustering and feature learning methods have shown remarkable performance in unsupervised representation learning. However, the training schedule alternating between feature clustering and network parameters update leads to unstable learning of visual representations. To overcome this challenge, we propose Online Deep Clustering (ODC) that performs clustering and network update simultaneously rather than alternatingly. Our key insight is that the cluster centroids should evolve steadily in keeping the classifier stably updated. Specifically, we design and maintain two dynamic memory modules, i.e., samples memory to store samples labels and features, and centroids memory for centroids evolution. We break down the abrupt global clustering into steady memory update and batch-wise label re-assignment. The process is integrated into network update iterations. In this way, labels and the network evolve shoulder-to-shoulder rather than alternatingly. Extensive experiments demonstrate that ODC stabilizes the training process and boosts the performance effectively. Code:



There are no comments yet.


page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

(a) Online Deep Clustering (ODC) seeks to reduce the discrepancy in training mechanism between Deep Clustering (DC) and supervised classification via integrating clustering process into network update iterations. ODC training is both unsupervised and uninterrupted. (b) Compared to DC, ODC updates labels continuously rather than in a pulsating manner, enabling the representations to evolve steadily. The loss curves (only initial 32 epochs for clarity) show the stability of ODC. After training, the loss is decreased to around 2.0 for ODC while 2.9 for DC.

Unsupervised representation learning [7, 43, 62, 39, 9, 27, 23, 15, 59] aims at learning transferable image or video representations without manual annotations. Among them, clustering-based representation learning methods [21, 55, 56, 2, 3] emerge as a promising direction in this area. Different from recovering-based approaches [43, 62, 39, 15], clustering-based methods require little domain knowledge [2] while achieving encouraging performances. Compared to contrastive representation learning [19, 18, 4]

that captures merely intra-image invariance, clustering-based methods are able to explore inter-image similarity. Unlike conventional clustering that is typically performed on fixed features 

[58, 57], these works jointly optimize clustering and feature learning.

While evaluations of early works [55, 56] are mostly performed on small datasets, Deep Clustering [2] (DC) proposed by Caron et al

. is the first attempt to scale up clustering-based representation learning. DC alternates between deep feature clustering and CNN parameters update. In particular, at the start of each epoch, it performs off-line clustering algorithms on the entire dataset to obtain pseudo-labels as the supervision for the next epoch. Off-line clustering inevitably permutes the assigned labels in different epochs,

i.e., even if some of the clusters do not change, their indices after clustering will be permuted randomly. As a result, parameters in the classifier cannot be inherited from the last epoch and they have to be randomly initialized before each epoch. The mechanism introduces training instability and exposes representations to a high risk of representation corruption. As shown in Figure 1

(a), network update in DC is interrupted by feature extraction and clustering in each epoch. This is in contrast to the conventional supervised classification that is performed in an uninterrupted manner using fixed labels, where an iteration consists of forward and backward propagations of the network.

In this work, we seek to devise a joint clustering and feature learning paradigm with high stability. To reduce the discrepancy of training mechanism between DC and supervised learning, we decompose the clustering process into mini-batch-wise label update, and integrate this update process into iterations of network update. Based on this intuition, we propose Online Deep Clustering (ODC) for joint clustering and feature learning. Specifically, an ODC iteration consists of forward and backward propagations, label re-assignment, and centroids update. For label update, ODC reuses the features in the forward propagation, thus avoiding additional feature extraction. To facilitate online label re-assignment and centroids update, we design and maintain two dynamic memory modules,

i.e., samples memory to store samples’ labels and features, and centroids memory for centroids evolution. In this way, ODC is trained in an uninterrupted manner similar to supervised classification, while no manual annotation is required. During the training process, labels and network parameters evolve shoulder-to-shoulder, rather than alternatingly. Since labels are updated in each iteration continuously and instantly, the classifier in the CNN also evolves more steadily, resulting in a much more steady loss curve as shown in Figure 1 (b).

While ODC alone achieves compelling unsupervised representation learning performance on various benchmarks, it can be naturally used to fine-tune models that have been trained using other unsupervised learning approaches. Extensive experiments show that the steadiness of ODC helps it to perform superiorly over DC as an unsupervised fine-tuning tool. We conclude our contributions as follows:

1) we propose ODC that learns image representations in an unsupervised manner with high stability. 2) ODC also serves as a unified unsupervised fine-tuning scheme that further improves previous self-supervised representation learning approaches. 3) Promising performances are observed on different benchmarks, indicating the great potential of joint clustering and feature learning.

2 Related Work

Unsupervised Representation Learning. Many unsupervised visual representation learning algorithms are based on generative models, which usually use a latent representation bottleneck to reconstruct input images. Existing generation-based models include Auto-Encoders [47, 28]

, Restricted Boltzman Machines 

[20, 30, 44], Variational Auto-Encoders [24] and Generative Adversarial Networks [16], some of which have shown powerful ability in generating images or videos [6, 29, 60, 1, 48, 46]. By learning to generate examples, these models can learn meaningful latent representations that can be used for downstream tasks [9, 12, 10].

Another popular form of unsupervised representation learning is self-supervised learning, where a pretext task is designed to derive proxy labels from raw data. Representations are learned by encouraging a CNN to predict the proxy labels from the data. Various pretext tasks have been explored, e.g., predicting relative patch locations within an image [7], solving jigsaw puzzles [39]

, colorizing grayscale images 

[62, 26], inpainting of missing pixels [43], cross-channel prediction [63], counting visual primitives [40], predicting image rotations [15], and multiview contrastive learning [45]. For videos, self-derived supervision signals come from temporal continuity [37, 22, 51, 34, 31, 36, 53] or motion consistency [42, 35, 50, 49, 59].

Figure 2:

Each ODC iteration mainly contains four steps: 1. forward to obtain a compact feature vector; 2. read labels from the samples memory and perform back-propagation to update the CNN; 3. update samples memory by updating features and assigning new labels; 4. update centroids memory by recomputing the involved centroids.

Joint Clustering and Feature Learning. Clustering-based unsupervised representation learning is of particular interest recently. Various methods are proposed to jointly optimize feature learning and image clustering. Notably, these methods have shown great potential in learning unsupervised features on small datasets [55, 56, 11, 32]

. To scale up to large datasets like ImageNet 

[5], Caron et al[2] propose DeepCluster to cluster features and update CNN with subsequent assigned pseudo-labels for each epoch. In a subsequent study, Caron et al[3] propose DeeperCluster to leverage self-supervision and clustering, and validate the representation learning ability of their approaches on non-curated data. Although deep clustering methods are capable of learning good representations from large-scale unlabeled data, the alternating update of feature clustering and CNN parameters update leads to instability in training.

Improvements to Self-supervised Learning. Some works aim at improving previous self-supervised learning approaches from different perspectives. For instance, Larsson et al[27] give a first in-depth analysis on colorization as a pretext task and provide some insights on improving its effectiveness. Mundhenk et al[38] explore a set of methods to avoid some trivial shortcuts like chromatic aberration on context-based self-supervised learning. Noroozi et al[41] improve the performance of self-supervised models using a clustering-based knowledge transfer method that allows a deeper network during pre-training. Wang et al[52] and Doersch et al[8] exploit multiple cues contained in different pretext tasks to improve self-supervised models. Recently, some works [25, 17] have studied extensively the architectures and scaling ability on existing self-supervised approaches. Complementary to these works, ODC serves as a flexible and unified unsupervised fine-tuning scheme to boost general self-supervised learning methods although it can be used alone to perform unsupervised representation learning from scratch.

3 Methodology

In the following sub-sections, we first discuss the differences between the proposed ODC to the conventional DC [2] in Sec. 3.1. We then recommend some useful strategies to maintain stable cluster size while using ODC in Sec. 3.2. We finally explain how one can use ODC for unsupervised fine-tuning (Sec. 3.3) and the implementation details of ODC (Sec. 3.4).

3.1 Online Deep Clustering

We first discuss the basic idea of DC [2] and then detail the proposed ODC. To learn representations, DC alternates between off-line feature clustering and network back-propagation with pseudo-labels. The off-line clustering process requires deep feature extraction on the entire training set, followed by a global clustering algorithm, e.g

., K-Means clustering. The global clustering permutes the pseudo labels vastly, requiring the network to adapt to new labels rapidly in the subsequent epoch.

Framework Overview. Different from DC, ODC does not require an extra feature extraction process. Besides, labels evolve alongside the network parameters update smoothly. This is made possible by the newly introduced samples and centroids memories. As shown in Fig. 2, the samples memory stores features and pseudo-labels of the entire dataset; while the centroids memory stores the features of class centroids, i.e., the mean feature of all samples in a class. A “class” here represents a temporary cluster that evolves continuously during training. Labels and network parameters are updated simultaneously during uninterrupted iterations of ODC. Specific techniques including loss re-weighting and dealing with small clusters are introduced to avoid ODC from getting stuck into trivial solutions.

An ODC Iteration. Assuming that we are given with a randomly initialized network along with a linear classifier , the goal is to train the backbone parameters to produce highly discriminative representations. To prepare for ODC, the samples and centroids memories are initialized via a global clustering process, e.g., K-Means. Next, one can perform uninterrupted ODC iteratively.

An ODC iteration contains four steps. First, given a batch of input images , the network maps the images into compact feature vectors

. Second, we read pseudo-labels for this batch from the samples memory. With the pseudo-labels, we update the network with stochastic gradient descent to solve the following problem:


where is the current pseudo label from the samples memory, denotes the size of each mini-batch. Third, after L2 normalization is reused to update the samples memory:


where is the feature of in the samples memory, is a momentum coefficient. Simultaneously, each involved sample is assigned with a new label by finding the nearest centroid following:


where denotes the centroid feature of class . Finally, the involved centroids, including those in which new members join, and those from which old members leave, are recorded. They are updated every -th iterations by averaging the features of all samples belonging to their corresponding centroid.

3.2 Handling Clustering Distribution in ODC

Loss Re-weighting. To avoid the training from collapsing into a few huge clusters, DC adopts uniform sampling before each epoch. However, for ODC, the number of samples over the clusters changes in each iteration. Using uniform sampling requires one to re-sample the entire dataset in each iteration, a process that is deemed redundant and costly. We propose an alternative approach, i.e., re-weighting the loss according to the number of samples in each class. To verify their equivalence, we implement a DC model with loss re-weighting and empirically find that the performance remains unchanged when the weight follows , where denotes the number of samples in class

. Hence, we adopt the same loss re-weighting formulation for ODC. With loss re-weighting, samples in smaller clusters contribute more towards backpropagation, thus pushing the decision boundary farther to accept more potential samples.

Dealing with Small Clusters. Loss re-weighting helps to prevent the formation of huge clusters. Nevertheless, we still face the risk of having some small clusters collapsing into empty clusters. To overcome this problem, we propose to process and eliminate extremely small clusters in advance before they collapse. Denoting normal clusters as whose sizes are larger than a threshold, and small clusters as whose sizes are not, for , we first assign samples in to the nearest centroids in to make empty. Next, we split the largest cluster into two sub-clusters by K-Means and randomly choose one of the sub-clusters as the new . We repeat the process until all clusters belong to . Though this process alters some clusters abruptly, it only affects a small portion of samples which are involved in this process.

Dimensionality Reduction. Some of the backbone networks map an image to a high-dimensional vector, e.g

., AlexNet produces 4,096-dimensional features and ResNet-50 yields 2,048-dimensional features, leading to high space and time complexities in subsequent clustering. DC performed PCA on features across the entire dataset to reduce dimension. However, for ODC, the features of different samples have varying timestamps, leading to incompatible statistics among samples. Hence, PCA is not applicable anymore. It is also costly to perform PCA in each iteration. We therefore add a non-linear head layer of {fc-bn-relu-dropout-fc-relu} to reduce high dimensional features into 256 dimensions. It is jointly tuned during ODC iterations. The head layer is removed for downstream tasks.

3.3 ODC for Unsupervised Fine-tuning

Compared with self-supervised learning approaches that tend to capture intra-image semantics, clustering-based methods focus more on inter-image information. Hence, DC and ODC are naturally complementary to previous self-supervised learning approaches. As DC and ODC are not restricted to a specifically designed objective, like rotation angle or color prediction, they readily serve as an unsupervised fine-tuning scheme to boost the performance of existing self-supervised approaches. In this paper, we study the effectiveness of DC and ODC as a fine-tuning process with initialization from different self-supervised learning methods.

3.4 Implementation Details

Data Pre-processing. We use ImageNet that contains 1.28M images without labels for training. Images are first randomly cropped to have a resolution of 224x224 with augmentation including random flipping and rotation (

). DC adopts a Sobel filter on the images to avoid exploiting color as the shortcut. Such a pre-processing step requires the downstream tasks to include the Sobel layer as well, which potentially limit its application. We find that strong color jittering shows the same effect as a Sobel filter in avoiding shortcuts, while it allows normal RGB images as inputs. Specifically, we adopt PyTorch style color jitter transform with brightness factor

, contrast factor , saturation factor , and hue factor

. Besides, we randomly convert images to grayscale with a probability of

. The random color jittering and grayscale applied on training samples randomize the similarity measured in color. This discourages the network from exploiting trivial information from color.

Training of ODC. We use ResNet-50 as our backbone. Considering that most early works use AlexNet, we also perform experiments on AlexNet for comparison. Following [2]

, we use AlexNet architecture without Local Response Normalization and add batch normalization layers. The ODC models for AlexNet and ResNet-50 are trained from scratch. The batch size is 512 allocated to 8 GPUs. The learning rate is constantly

for AlexNet and for ResNet-50 for 400 epochs, and decayed by for further 40 epochs. Following DC, the number of clusters is set as 10,000, which is 10 times larger than the annotated number of classes of ImageNet. The momentum coefficient is set as . The threshold to identify small clusters is set as 20. Varying this threshold does not affect the results significantly, provided that it does not exceed the average number of samples in a cluster. The centroids memory is updated in every 10 iterations. The centroids update frequency constitutes a trade-off between learning efficacy and efficiency. In our experiments, we observe that as long as the frequency is restricted to a reasonable range, the performance of ODC is not sensitive to it.

4 Experiments

4.1 Evaluation on Unsupervised Representation

Method ImageNet Places
(AlexNet) conv1 conv2 conv3 conv4 conv5 conv1 conv2 conv3 conv4 conv5
Places labels [63] - - - - - 22.1 35.1 40.2 43.3 44.6
ImageNet labels [63] 19.3 36.3 44.2 48.3 50.5 22.7 34.8 38.4 39.4 38.7
Random [63] 11.6 17.1 16.9 16.3 14.1 15.7 20.3 19.8 19.1 17.5
Context [7] 16.2 23.3 30.2 31.7 29.6 19.7 26.7 31.9 32.7 30.9
ContextEncoder [43] 14.1 20.7 21.0 19.8 15.5 18.2 23.2 23.4 21.9 18.4
Jigsaw [39] 19.2 30.1 34.7 33.9 28.3 23.0 32.1 35.5 34.8 31.3
Colorization [62] 13.1 24.8 31.0 32.6 31.8 22.0 28.7 31.8 31.3 29.7
SplitBrain [63] 17.7 29.3 35.4 35.2 32.8 21.3 30.7 34.0 34.1 32.5
Counting [40] 18.0 30.6 34.3 32.5 25.7 23.3 33.9 36.3 34.7 29.6
NPID [54] 16.8 26.5 31.8 34.1 35.6 18.8 24.3 31.9 34.5 33.6
Rotation [15] 18.8 31.7 38.7 38.2 36.5 21.5 31.0 35.1 34.6 33.7
DeepCluster [2] 12.9 29.2 38.2 39.8 36.1 18.6 30.8 37.0 37.5 33.1
AET [61] 19.2 32.8 40.6 39.7 37.7 22.1 32.9 37.1 36.2 34.7
Rot-Decouple [14] 19.3 33.3 40.8 41.8 44.3 22.9 32.4 36.6 37.3 38.6
LA [65] 14.9 30.1 35.7 39.4 40.2 17.1 32.2 36.5 38.3 37.8
CMC [45] 18.3 33.7 38.3 40.5 42.8 - - - - -
ODC (Ours) 19.6 32.8 40.4 41.4 37.3 24.0 33.2 38.3 38.4 35.5
Table 1: AlexNet linear classification on ImageNet and Places. We report top-1 center-crop accuracy. Numbers for other methods are obtained either from [63] or from their original papers. The highest performance in each layer is in bold, and the second highest performance in each layer is underlined. SplitBrain and CMC have half the number of parameters.
Method ImageNet Places
(ResNet-50) conv1 conv2 conv3 conv4 conv5 conv1 conv2 conv3 conv4 conv5
Places labels [17] - - - - - 16.7 32.3 43.2 54.7 62.3
ImageNet labels [17] 11.6 33.3 48.7 67.9 75.5 14.8 32.6 42.1 50.8 52.5
Random [17] 9.6 13.7 12.0 8.0 5.6 12.9 16.6 15.5 11.6 9.0
Jigsaw [17] 12.4 28.0 39.9 45.7 34.2 15.1 28.8 36.8 41.2 34.4
Colorization [17] 10.2 24.1 31.4 39.6 35.2 14.7 27.4 32.7 37.5 34.8
NPID [54] 15.3 18.8 24.9 40.6 54.0 18.1 22.3 29.7 42.1 45.5
Rotation [25] 41.7 (best layer) 38.1 (best layer)
BigBiGAN [10] 55.4 (best layer) -
DeepCluster [2] 14.4 29.6 39.9 52.2 50.3 19.3 31.9 39.0 46.1 43.6
LA [65] 9.3 23.2 38.0 48.6 58.8 18.3 31.5 39.2 46.3 49.1
CMC [45] 58.4 (best layer) -
ODC (Ours) 14.8 31.6 42.5 55.7 57.6 21.4 35.0 41.3 47.4 49.3
Table 2: ResNet-50 linear classification on ImageNet and Places. We report top-1 center-crop accuracy. Numbers for methods with and are produced by third-party studies as cited, and by us, respectively. Numbers for other methods are taken from their original papers. The highest performance in each layer is in bold, and the second highest performance in each layer is underlined. CMC has half the number of parameters.

After pre-training the ODC model, we evaluate the quality of unsupervised features on standard downstream tasks including ImageNet classification, Places205 [64] classification, VOC2007 [13] SVM classification, and VOC2007 Low-shot classification. We provide the details of each benchmark and show our competing results as follows.

Re-implementation of Deep Clustering. Since the original paper of DC does not include ResNet-50, we implement a DC model with ResNet-50. The DC model adopts the same data augmentations as ODC, except that DC applies a Sobel filter on images. For fair comparisons, the training hyper-parameters of DC are the same as ODC except that we empirically find is more suitable for DC.

ImageNet Classification. Following the setup in Zhang et al[63], we keep the backbone including all convolution and batch normalization layers frozen, and train a 1000-way linear classifier on features from different depths of convolutional layers. The features are mapped to around 9000 dimensions via average pooling. We train all models for 100 epochs in total, using SGD with a momentum of 0.9 and batch size of 256. The learning rate is initialized as 0.01, decayed by a factor of 10 after every 30 epochs. Other hyper-parameters are set following Goyal et al[17]. We report top-1 center-crop accuracy on the official validation split of ImageNet.

For AlexNet, as shown in Table 1, ODC has a consistent improvement over DC in all conv layers, with the largest improvement (6.7%) observed in conv1 layer. The performance in conv1 layer surpasses the ImageNet pre-trained model. With regard to the best-performing layer, ODC achieves 41.4% on conv4 layer, outperforming the latest LA [65], ranking only second to Rot-Decoupling [14]. Though ODC does not outperform Rot-Decoupling in its best performing layer, it provides a complementary perspective to rotation based methods.

ODC also scales well with deeper architectures. For ResNet-50, as shown in Table 2, ODC achieves 57.6% center-crop accuracy in the conv5 layer, which is 5.4% higher than the best performing layer of the re-implemented DC. Compared with the concurrent state-of-the-art method LA [65], our method produces competing results. Though the result of conv5 is slightly lower than LA, ODC outperforms LA from conv1 to conv4 layers by large margins. We observe a consistent performance increase from shallower layers to deeper layers, indicating that ODC makes full use of all residual layers.

Places205 Classification. Following Zhang et al[63], to test the generalization ability on other domains, we also transfer the learned models to Places205 dataset that contains 2.45M images of 205 scene categories. Similar to the experiments on ImageNet, we train a 205-way linear classifier on top of each frozen convolutional layer on the train split of Places205, and report top-1 center-crop accuracy on the standard validation split. The evaluation setting and hyper-parameters are the same as those in the ImageNet classification task.

The results in Table 1 show that ODC with AlexNet as the backbone outperforms DC in all layers as well. ODC surpasses all previous works on conv1, conv3 and conv4 layers. Similar to the observation in the ImageNet classification task, ODC scales well on deeper architectures when it is transferred to Places205 with ResNet-50. As shown in Table 2, in all layers, ODC surpasses all previous works, with the largest margin (3.1%) to the runner-up observed in conv2 layer. With regard to the best performing layer, ODC reaches 49.3% center-crop accuracy in the conv5 layer, surpassing the re-implemented DC by 3.2% in the respective best layer. We observe the superiority of ODC in conv1 and conv2 layers over the supervised model using either Places labels or ImageNet labels. The transfer performance of our method in the Places205 classification task indicates that representations learned by ODC can generalize well to different domains from ImageNet.

VOC2007 SVM Classification.

To further evaluate the generalization of learned features, we perform experiments on the VOC2007 transfer learning task that resembles real applications with smaller datasets. Following 

[17], we train linear SVMs on features extracted from the frozen backbone on the “trainval” split of VOC2007 and evaluate on the test split. We follow the same test setting and hyper-parameters used in [17], and report the best performing layers of different methods for ResNet-50. The results in Table 3 show that ODC surpasses previous approaches by a significant margin on the VOC2007 SVM classification task. With ODC, we achieve 78.2% mAP performance, which is 9.1% higher than DC. However, We also note that there is still a significant 9.8% performance gap between our ODC and the supervised model pre-trained with ImageNet labels, leaving room for further exploration.

Low-shot VOC2007 SVM Classification. Following [17], we also transfer our learned representations to a low-shot setting of VOC2007 SVM classification to test the quality of features when there are few training examples per category. We vary the number of positive samples in each class and train linear SVMs on the frozen ResNet-50 backbone using the same setting from VOC2007 SVM classification. We use the standard “trainval” split of VOC2007 in training and the test split in testing. We report the mean average precision (mAP) across five independent samples for various low-shot values in Figure 3. The final mAP results shown in Table 3 are observed as the averages of all low-shot values and all independent runs. The per-shot results are shown in Figure 3. ODC has a consistent improvement over DC for each shot, with the performance gap further increasing when more positive examples per class are allowed. We also observe that the performance gap between ODC and the supervised model pre-trained with ImageNet labels is gradually narrowed down with the increase of training shot values. Table 3 shows that ODC achieves 57.1% mAP performance in low-shot SVM classification on VOC2007, 10.2% higher than our counterpart DC. The low-shot results of ODC in this benchmark suggest that the learned features through ODC generalize well to low-shot classification.

Method best VOC07 SVM VOC07 SVM
(ResNet-50) layer (% mAP) Low-shot (% mAP)
ImageNet labels 5 88.0 75.4
Random 1 9.6 12.7
Jigsaw [39] 4 64.5 39.2
Colorization [62] 4 55.6 33.3
Rotation [15] 4 67.4 41.0
DeepCluster [2] 5 69.1 46.9
ODC (Ours) 5 78.2 57.1
Table 3: ResNet-50 SVM classification and low-shot SVM classification mAP on VOC07. Numbers for methods with are produced by us. Numbers for other methods are taken from [17].
Figure 3: Low-shot Image Classification on VOC07 with linear SVMs trained and tested on the features from the best layer respectively for each method. We show the average performance for each shot across five runs.

4.2 Further Analysis

In this section, we further analyze our ODC model from different perspectives.

ODC as a Fine-tuning Scheme. The high efficiency enables ODC to easily serve as a rapid unsupervised fine-tuning scheme. To assess the fine-tuning ability of ODC, we also use our reimplemented DC to fine-tune other self-supervised models. The improvements over different self-supervised approaches are shown in Table 4. Compared with DC, we observe that ODC boosts the performance of each self-supervised approach by a significant margin. With ODC fine-tuning, we achieve 16.7% improvements for Col., 9.9% for Jig., 7.1% for Rot., and 7.9% for DC, respectively, on the VOC2007 SVM classification benchmark. By contrast, DC also yields fine-tuning improvements but lags far behind ODC.

Col. [62] Jig. [39] Rot. [15] DC [2]
Original 55.6 64.5 67.4 69.1
DC [2] 61.2 68.5 68.6 70.0
ODC 72.3 74.4 74.5 77.0
Table 4: Improvements over previous self-supervised approaches. Each model is fine-tuned for 120 epochs. We report VOC07 SVM classification mAP for ResNet-50. Pre-trained models marked are provided by [17], hence the original results are also taken from [17]. For methods marked, we reimplement them to obtain the results.
Figure 4: Influence of centroids update frequency (left) and minimal small cluster size (right) on the quality of features learned by ODC. We study these hyper-parameters on uniformly sampled 90K ImageNet within 300 random classes. We report mAP on VOC07 SVM classification task with ResNet-50.

Influence of the Hyper-parameters. The hyper-parameters of ODC include the frequency of updating the centroids memory, and the minimal size of clusters. To study the influence of the aforementioned two hyper-parameters, we train models with 90K images that are uniformly sampled from the original 1.28M ImageNet dataset, and evaluate the performance on VOC2007 SVM classification benchmark. Figure 4 shows the influence of the update frequency of centroids memory. We observe no significant decrease in the performance of ODC when the update frequency becomes lower, indicating that our method is insensitive to this hyper-parameter provided that it is within a reasonable range. The influence of the minimal size of small clusters is shown in Figure 4. The results show that a large threshold ( i.e. 160) on clusters size would lead to a performance drop. The result is not surprising. A cluster whose size is smaller than the minimal size is identified as a “small cluster”. An overly frequent processing of such small clusters (see Sec. 3.2) introduces instability in feature learning. The large threshold would also group images that should not have belonged to the same class. It is noteworthy that ODC does not experience a significant change in performance within a reasonable range of minimal cluster sizes.

Figure 5: The ratio of changed labels in each batch gradually declines, indicating ODC tends to be stable during training.

Stability and Convergence. Figure 1 already demonstrates the superior stability of ODC over DC from the aspect of the loss curve. In Figure 5, we show the training stability and convergence of ODC throughout the full training iterations. To measure the stability of our models, we record the ratio of samples whose labels are changed in a batch. Intuitively, fewer label switchings suggest a higher stability. We report the ratio when different backbones are trained from scratch with ODC. The curves begin with the highest label-switching ratio, i.e., nearly 100% of samples in a batch experience a switch in their labels. Gradually, the label-switching ratio declines and converges to a relatively low value. Though there is always a small portion of samples altering their labels at last, ODC reaches a stable state.

Training on Long-Tailed Data. In all previous experimens, we train our models on the class-balanced ImageNet dataset. To evaluate the learning efficacy of ODC on long-tailed data, we perform experiments on downsampled long-tail ImageNet following [33]. Specifically, we randomly downsample 300 classes with 100K images from the original ImageNet dataset to make different levels of long-tail ImageNet datasets, where the ratio of the largest class to the smallest class ranges from 1 (the non-long-tail level) to 64 (the highest long-tail level). Figure 6 shows the performance of ODC trained on different levels of long-tail ImageNet. We observe no significant performance drop even in the conditions with large long-tail degrees, suggesting the robustness of our method on long-tailed data.

Visualization of Clusters. We visualize some selected clusters as shown in Figure 7. Since the number of clusters is much larger than that of the original annotations, there will certainly be some clusters that represent new semantics beyond the annotated classes. We find new classes, e.g., “hand” and “feet”, and new relations, e.g., “animal in cage”, “person holds dog” and “person leads dog with a rope”, that are discovered by ODC. The phenomenon reveals the potential of unsupervised learning to capture new semantics beyond manual annotations.

Figure 6: The efficacy of ODC trained on downsampled 300-class 100K long-tail ImageNet, with the ratio of the size of largest class to smallest class ranging from 1 (non-long-tail) to 64 (highly long-tail). We report mAP on VOC07 SVM task with ResNet-50.
Figure 7: This figure shows part of selected clusters. Each row represents a cluster. Apart from the clusters that represents existing classes in ImageNet annotations, shown in the green box, we also find some new classes discovered by ODC. For example, the two rows in the blue box group “hand” and “feet” respectively, while “hand” or “feet” is not a category in ImageNet annotations. ODC also surprisingly groups images with similar relations between objects. As shown in the orange box, the clusters represent “animal in cage”, “person holds dog” and “person leads dog with a rope” respectively.

5 Conclusion

We have proposed an effective joint clustering and feature learning paradigm for unsupervised representation learning. The proposed approach, Online Deep Clustering (ODC), attains effective and stable unsupervised training of deep neural networks, via decomposing feature clustering and integrating the process into iterations of network update. ODC performs compellingly as an unsupervised representation learning scheme alone. It can also be used to fine-tune and substantially improve previous self-supervised learning methods.


This work is supported by the SenseTime-NTU Collaboration Project, Singapore MOE AcRF Tier 1 (2018-T1-002-056), NTU SUG, NTU NAP, the Max Planck-NTU Joint Lab for Artificial Senses and Data Science and Artificial Intelligence Research Lab. We thank Yue Zhao for his participation in discussing the idea.


  • [1] A. Brock, J. Donahue, and K. Simonyan (2019) Large scale gan training for high fidelity natural image synthesis. Cited by: §2.
  • [2] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In ECCV, Cited by: §1, §1, §2, §3.1, §3.4, §3, Table 1, Table 2, Table 3, Table 4.
  • [3] M. Caron, P. Bojanowski, J. Mairal, and A. Joulin (2019) Unsupervised pre-training of image features on non-curated data. In ICCV, pp. 2959–2968. Cited by: §1, §2.
  • [4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §1.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §2.
  • [6] E. L. Denton, S. Chintala, R. Fergus, et al. (2015) Deep generative image models using a laplacian pyramid of adversarial networks. In NeurIPS, pp. 1486–1494. Cited by: §2.
  • [7] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In ICCV, Cited by: §1, §2, Table 1.
  • [8] C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In ICCV, pp. 2051–2060. Cited by: §2.
  • [9] J. Donahue, P. Krähenbühl, and T. Darrell (2017) Adversarial feature learning. In ICLR, Cited by: §1, §2.
  • [10] J. Donahue and K. Simonyan (2019) Large scale adversarial representation learning. arXiv preprint arXiv:1907.02544. Cited by: §2, Table 2.
  • [11] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014)

    Discriminative unsupervised feature learning with convolutional neural networks

    In NeurIPS, pp. 766–774. Cited by: §2.
  • [12] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville (2016) Adversarially learned inference. arXiv preprint arXiv:1606.00704. Cited by: §2.
  • [13] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §4.1.
  • [14] Z. Feng, C. Xu, and D. Tao (2019) Self-supervised representation learning by rotation feature decoupling. In CVPR, pp. 10364–10374. Cited by: §4.1, Table 1.
  • [15] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In ICLR, External Links: Link Cited by: §1, §2, Table 1, Table 3, Table 4.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, pp. 2672–2680. Cited by: §2.
  • [17] P. Goyal, D. Mahajan, A. Gupta, and I. Misra (2019) Scaling and benchmarking self-supervised visual representation learning. arXiv preprint arXiv:1905.01235. Cited by: §2, §4.1, §4.1, §4.1, Table 2, Table 3, Table 4.
  • [18] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1.
  • [19] O. J. Hénaff, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord (2019) Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §1.
  • [20] G. E. Hinton, S. Osindero, and Y. Teh (2006) A fast learning algorithm for deep belief nets. Neural computation 18 (7), pp. 1527–1554. Cited by: §2.
  • [21] H. Huang, C. C. Loy, and X. Tang (2016) Unsupervised learning of discriminative attributes and visual representations. In CVPR, pp. 5175–5184. Cited by: §1.
  • [22] D. Jayaraman and K. Grauman (2016) Slow and steady feature analysis: higher order temporal coherence in video. In CVPR, pp. 3852–3861. Cited by: §2.
  • [23] S. Jenni and P. Favaro (2018) Self-supervised feature learning by learning to spot artifacts. In CVPR, Cited by: §1.
  • [24] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • [25] A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005. Cited by: §2, Table 2.
  • [26] G. Larsson, M. Maire, and G. Shakhnarovich (2016) Learning representations for automatic colorization. In ECCV, pp. 577–593. Cited by: §2.
  • [27] G. Larsson, M. Maire, and G. Shakhnarovich (2017) Colorization as a proxy task for visual understanding. In CVPR, Cited by: §1, §2.
  • [28] Q. V. Le (2013) Building high-level features using large scale unsupervised learning. In ICASSP, Cited by: §2.
  • [29] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017)

    Photo-realistic single image super-resolution using a generative adversarial network

    In CVPR, pp. 4681–4690. Cited by: §2.
  • [30] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng (2009)

    Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations

    In ICML, pp. 609–616. Cited by: §2.
  • [31] H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In ICCV, pp. 667–676. Cited by: §2.
  • [32] R. Liao, A. Schwing, R. Zemel, and R. Urtasun (2016) Learning deep parsimonious representations. In NeurIPS, pp. 5076–5084. Cited by: §2.
  • [33] Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu (2019) Large-scale long-tailed recognition in an open world. In CVPR, Cited by: §4.2.
  • [34] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala (2017) Video frame synthesis using deep voxel flow. In ICCV, pp. 4463–4471. Cited by: §2.
  • [35] A. Mahendran, J. Thewlis, and A. Vedaldi (2018) Cross pixel optical-flow similarity for self-supervised learning. In ACCV, pp. 99–116. Cited by: §2.
  • [36] I. Misra, C. L. Zitnick, and M. Hebert (2016) Shuffle and learn: unsupervised learning using temporal order verification. In ECCV, pp. 527–544. Cited by: §2.
  • [37] H. Mobahi, R. Collobert, and J. Weston (2009) Deep learning from temporal coherence in video. In ICML, pp. 737–744. Cited by: §2.
  • [38] T. N. Mundhenk, D. Ho, and B. Y. Chen (2018) Improvements to context based self-supervised learning.. In CVPR, pp. 9339–9348. Cited by: §2.
  • [39] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §1, §2, Table 1, Table 3, Table 4.
  • [40] M. Noroozi, H. Pirsiavash, and P. Favaro (2017) Representation learning by learning to count. In ICCV, pp. 5898–5906. Cited by: §2, Table 1.
  • [41] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash (2018) Boosting self-supervised learning via knowledge transfer. In CVPR, pp. 9359–9367. Cited by: §2.
  • [42] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan (2017) Learning features by watching objects move. In CVPR, pp. 2701–2710. Cited by: §2.
  • [43] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, Cited by: §1, §2, Table 1.
  • [44] Y. Tang, R. Salakhutdinov, and G. Hinton (2012)

    Robust boltzmann machines for recognition and denoising

    In CVPR, pp. 2264–2271. Cited by: §2.
  • [45] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §2, Table 1, Table 2.
  • [46] S. Tulyakov, M. Liu, X. Yang, and J. Kautz (2018) Mocogan: decomposing motion and content for video generation. In CVPR, pp. 1526–1535. Cited by: §2.
  • [47] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    In ICML, Cited by: §2.
  • [48] C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating videos with scene dynamics. In NeurIPS, pp. 613–621. Cited by: §2.
  • [49] J. Walker, C. Doersch, A. Gupta, and M. Hebert (2016) An uncertain future: forecasting from static images using variational autoencoders. In ECCV, pp. 835–851. Cited by: §2.
  • [50] J. Walker, A. Gupta, and M. Hebert (2015) Dense optical flow prediction from a static image. In ICCV, pp. 2443–2451. Cited by: §2.
  • [51] X. Wang and A. Gupta (2015) Unsupervised learning of visual representations using videos. In ICCV, pp. 2794–2802. Cited by: §2.
  • [52] X. Wang, K. He, and A. Gupta (2017) Transitive invariance for self-supervised visual representation learning. In ICCV, pp. 1329–1338. Cited by: §2.
  • [53] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman (2018) Learning and using the arrow of time. In CVPR, pp. 8052–8060. Cited by: §2.
  • [54] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR, pp. 3733–3742. Cited by: Table 1, Table 2.
  • [55] J. Xie, R. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    In ICML, Cited by: §1, §1, §2.
  • [56] J. Yang, D. Parikh, and D. Batra (2016) Joint unsupervised learning of deep representations and image clusters. In CVPR, Cited by: §1, §1, §2.
  • [57] L. Yang, X. Zhan, D. Chen, J. Yan, C. C. Loy, and D. Lin (2019) Learning to cluster faces on an affinity graph. In CVPR, pp. 2298–2306. Cited by: §1.
  • [58] X. Zhan, Z. Liu, J. Yan, D. Lin, and C. Change Loy (2018)

    Consensus-driven propagation in massive unlabeled data for face recognition

    In ECCV, Cited by: §1.
  • [59] X. Zhan, X. Pan, Z. Liu, D. Lin, and C. C. Loy (2019) Self-supervised learning via conditional motion propagation. In CVPR, Cited by: §1, §2.
  • [60] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, pp. 5907–5915. Cited by: §2.
  • [61] L. Zhang, G. Qi, L. Wang, and J. Luo (2019) Aet vs. aed: unsupervised representation learning by auto-encoding transformations rather than data. In CVPR, pp. 2547–2555. Cited by: Table 1.
  • [62] R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In ECCV, Cited by: §1, §2, Table 1, Table 3, Table 4.
  • [63] R. Zhang, P. Isola, and A. A. Efros (2017) Split-brain autoencoders: unsupervised learning by cross-channel prediction. In CVPR, Cited by: §2, §4.1, §4.1, Table 1.
  • [64] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva (2014)

    Learning deep features for scene recognition using places database

    In NeurIPS, pp. 487–495. Cited by: §4.1.
  • [65] C. Zhuang, A. L. Zhai, and D. Yamins (2019) Local aggregation for unsupervised learning of visual embeddings. In ICCV, pp. 6002–6012. Cited by: §4.1, §4.1, Table 1, Table 2.