Log In Sign Up

LSD-C: Linearly Separable Deep Clusters

We present LSD-C, a novel method to identify clusters in an unlabeled dataset. Our algorithm first establishes pairwise connections in the feature space between the samples of the minibatch based on a similarity metric. Then it regroups in clusters the connected samples and enforces a linear separation between clusters. This is achieved by using the pairwise connections as targets together with a binary cross-entropy loss on the predictions that the associated pairs of samples belong to the same cluster. This way, the feature representation of the network will evolve such that similar samples in this feature space will belong to the same linearly separated cluster. Our method draws inspiration from recent semi-supervised learning practice and proposes to combine our clustering algorithm with self-supervised pretraining and strong data augmentation. We show that our approach significantly outperforms competitors on popular public image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K.


page 1

page 2

page 3

page 4


Semi-supervised Contrastive Learning with Similarity Co-calibration

Semi-supervised learning acts as an effective way to leverage massive un...

On the Selection of Anchors and Targets for Video Hyperlinking

A problem not well understood in video hyperlinking is what qualifies a ...

Integrating Auxiliary Information in Self-supervised Learning

This paper presents to integrate the auxiliary information (e.g., additi...

Semi-Automatic Data Annotation guided by Feature Space Projection

Data annotation using visual inspection (supervision) of each training s...

VAESim: A probabilistic approach for self-supervised prototype discovery

In medicine, curated image datasets often employ discrete labels to desc...

GuCNet: A Guided Clustering-based Network for Improved Classification

We deal with the problem of semantic classification of challenging and h...

Code Repositories

1 Introduction

The need for large scale labelled datasets is a major obstacle to the applicability of deep learning to problems where labelled data cannot be easily obtained. Methods such as clustering, which are unsupervised and thus do not require any kind of data annotation, are in principle more easily applicable to new problems. Unfortunately, standard clustering algorithms 

Comaniciu and Meer (1979); Ester et al. (1996); MacQueen and others (1967); Pearson (1894) usually do not operate effectively on raw data and require to design new data embeddings specifically for each new application. Thus, there is a significant interest in automatically learning an optimal embedding while clustering the data, a problem sometimes referred to as simultaneous data clustering and representation learning. Recent works have demonstrated this for challenging data such as images Ji et al. (2019); Xie et al. (2016) and text Jiang et al. (2016); Sarfraz et al. (2019)

. However, most of these methods work with a constrained output space, which usually coincides with the space of discrete labels or classes being estimated, therefore forcing to work at the level of the semantic of the clusters directly.

In this paper, we relax this limitation by introducing a novel clustering method, Linearly Separable Deep Clustering (LSD-C). This method operates in the feature space computed by a deep network and builds on three ideas. First, the method extracts mini-batches of input samples and establishes pairwise pseudo labels (connections) for each pair of sample in the mini-batch. Differently from prior art, this is done in the space of features computed by the penultimate layer of the deep network instead of the final output layer, which maps data to discrete labels. From these pairwise labels, the method learns to regroup the connected samples into clusters by using a clustering loss which forces the clusters to be linearly separable. We empirically show in section 4.2 that this relaxation already significantly improves clustering performance.

Second, we initialize the model by means of a self-supervised representation learning technique. Prior work has shown that these techniques can produce features with excellent linear linear separability Chen et al. (2020); Gidaris et al. (2018); He et al. (2019) that are particularly useful as initialization for downstream tasks such as semi-supervised and few-shot learning Gidaris et al. (2019); Rebuffi et al. (2020); Zhai et al. (2019).

Third, we make use of very effective data combination techniques such as RICAP Takahashi et al. (2018) and MixUp Zhang et al. (2017) to produce composite data samples and corresponding pseudo labels, which are then used at the pairwise comparison stage. In section 4 we show that training with such composite samples and pseudo labels greatly improves the performance of our method, and is in fact the key to good performance in some cases.

We comprehensively evaluate our method on popular image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K. Our method almost always outperforms competitors on all datasets, establishing new state-of-the-art clustering results. The rest of the paper is organized as follows. We first review the most relevant works in section 2. Next, we develop the details of our proposed method in section 3, followed by the experimental results, ablation studies and analysis in section 4. Our code is publicly available at

2 Related work

Deep clustering

. Clustering has been a long-standing problem in the machine learning community, including well-known algorithms such as K-means 

MacQueen and others (1967), mean-shift Comaniciu and Meer (1979), DBSCAN Ester et al. (1996)

or Gaussian Mixture models 

Pearson (1894). Furthermore it can also be combined with other techniques to achieve very diverse tasks like novel category discovery Han et al. (2019); Fontanel et al. (2020) or semantic instance segmentation De Brabandere et al. (2017) among others. With the advances of deep learning, more and more learning-based methods have been introduced in the literature Genevay et al. (2019); Ghasedi Dizaji et al. (2017); Haeusser et al. (2018); Huang et al. (2019); Jiang et al. (2016); Li et al. (2018); Shaham et al. (2018); Xie et al. (2016); Yang et al. (2017)). Among them, DEC Xie et al. (2016)

is one of the most promising method. It is a two stage method that jointly learns the feature embedding and cluster assignment. The model is first pretrained with an autoencoder using reconstruction loss, after which the model is trained by constructing a sharpened version of the soft cluster assignment as pseudo target. This method inspired a few following works such as IDEC 

Guo et al. (2017a) and DCED Guo et al. (2017b). JULE Yang et al. (2016) is a recurrent deep clustering framework that jointly learns the feature representation with an agglomerative clustering procedure, however it requires tuning a number of hyper-parameters, limiting its practical use. More recently, several methods have been proposed based on mutual information Chen et al. (2016); Hu et al. (2017); Ji et al. (2019). Among them, IIC Ji et al. (2019) achieves the current state-of-the-art results on image clustering by maximizing the mutual information between two transformed counterparts of the same image. Closer to our work is the DAC Chang et al. (2017)

method, which considers clustering as a binary classification problem. By measuring the cosine similarity between predictions, pairwise pseudo labels are generated from the most confident positive or negative pairs. With the generated pairwise pseudo labels, the model can then be trained by a binary cross-entropy loss. DAC can learn the feature embedding as well as the cluster assignment in an end-to-end manner. Our work significantly differs from DAC as it generates pairwise predictions from a less constrained feature space using similarity techniques not limited to cosine distance.

Self-supervised representation learning. Self-supervised representation learning has recently attracted a lot of attention. Many effective self-supervised learning methods have been proposed in the literature Asano et al. (2019); Caron et al. (2018); Chen et al. (2020); Gidaris et al. (2020, 2018); He et al. (2019). DeepCluster Caron et al. (2018)

learns feature representation by classification using the pseudo labels generated from K-means on the learned features in each training epoch. RotNet 

Gidaris et al. (2018) randomly rotates an image, and learns to predict the applied rotations. Very recently, contrastive learning based methods MoCo He et al. (2019) and SimCLR Chen et al. (2020)

have achieved the state-of-the-art self-supervised representation performance, surpassing the representation learnt using ImageNet labels. Self-supervised learning has been also applied in few-shot learning 

Gidaris et al. (2019), semi-supervised learning Rebuffi et al. (2020); Zhai et al. (2019) and novel category discovery Han et al. (2020), which successfully boosts their performance. In this work we make use of the provably well-conditioned feature space learnt from self-supervised learning method to initialize our network and avoid degenerative cases.

Pairwise pseudo labeling. Pairwise similarity between pairs of sample has been widely used in the literature for dimension reduction or clustering (e.g., t-SNE Maaten and Hinton (2008), FINCH Sarfraz et al. (2019)

). Several methods have shown the effectiveness of using pairwise similarity to provide pseudo labels on-the-fly to train deep convolutional neural networks. In 

Hsu et al. (2019)

, a binary classifier is trained to provide pairwise pseudo labels to train a multi-class classifier. In 

Han et al. (2020), ranking statistics is used to obtain pairwise pseudo labels on-the-fly for the task of novel category discovery. In Sarfraz et al. (2019), the pairwise connection between data points by finding the nearest neighbour is used to cluster images using CNN features. In our method, we compute pairwise labels from a neural network embedding. This way we generate pseudo labels for each pair in each mini-batch and learn cluster assignment without any supervision.

3 Method

Figure 1: Overview of LSD-C. Pairwise labels are extracted at the feature level. They are then used in a clustering loss after the linear classifier. This way, the feature maps will evolve such that connected samples will be grouped in linearly separated clusters. The MSE loss acts a regularizer and enforces the consistency of the cluster predictions when data augmentation is applied.

Our methods is divided into three stages: (i) self-supervised pre-training, (ii) pairwise connection and clustering, and (iii) data composition. We provide an overview of our pipeline in figure 1. Our method processes each input data batch in two steps, by extracting features by means of a neural network

, followed by estimating posterior class probabilities

by means of a linear layer and softmax non-linearity. We use the symbol to denote the class predictions for the same mini-batch with data augmentation (random transformations) applied to it. We use the letters , and to denote the feature space dimension, the number of clusters and the mini-batch size. We now detail each component of LSD-C.

3.1 Self-supervised pretraining

As noted in the introduction, traditional clustering methods require handcrafted or pretrained features. More recently, methods such as Ji et al. (2019) have combined deep learning and clustering to learn features and clusters together; even so, these methods usually still require ad hoc pre-processing steps (e.g. pre-processing such as Sobel filtering Caron et al. (2018); Ji et al. (2019)

) and extensive hyperparameter tuning. In our method we address this issue and avoid bad local minima in our clustering results by initializing our representation by means of self-supervised learning. In practice, this amounts to train our model on a pretext task (detailed in

section 4) and then retain and freeze the earlier layers of the model when applying our clustering algorithm. As reported in Chen et al. (2020); Gidaris et al. (2018), the features obtained from self-supervised pre-training are linearly separable with respect to typical semantic image classes. This property is particularly desirable in our context and also motivates our major design choice: since the feature space of self-supervised pre-trained network is linearly separable, it is therefore easier to directly operate on it to discriminate between different clusters.

3.2 Pairwise labeling

A key idea in our method is the choice of space where pairwise the data connections are established: we extract pairwise labels at the level of the data representation rather than at the level of the class predictions. The latter is a common design choice, used in DAC Chang et al. (2017) to establish pairwise connections between data points and in DEC Xie et al. (2016) to match the current label posterior distribution to a sharper version of itself.

The collection of pairwise labels between samples in a mini-batch is given by the adjacency matrix of an undirected graph whose nodes are the samples and whose edges encode their similarities. DAC Chang et al. (2017) generates pseudo labels by checking if the output of the network is above or under certain thresholds. The method of Lee (2013) proceeds similarly in the semi-supervised setting. In our method, as we work instead at the feature space level, the pairwise labeling step is a separate process from class prediction and we are free to choose any similarity to establish our adjacency matrix . We denote with and

the feature vectors for samples

and in a mini-batch, obtained from the penultimate layer of the neural network . We also use the symbol to denote the value of the adjacency matrix for the pair of samples . Next, we describe the different types of pairwise connections considered in this work and summarize them in table 1.

Cosine and similarity.

Let be a threshold hyperparameter and define (cosine) or (Euclidean) where denotes the dot product between -normalized vectors. We then define where is the indicator function. These definitions connect neighbor samples but do not account well for the local structure of the data. Indeed, it is not obvious that the cosine similarity or Euclidean distance would establish good data connections in feature space.

Symmetric SNE.

A possible solution to alleviate the previous issue is to use the symmetric SNE similarity introduced in t-SNE Maaten and Hinton (2008). This similarity is based on the conditional probability of picking as neighbor of

under a Gaussian distribution assumption. We make a further assumption compared to 

Maaten and Hinton (2008)

of an equal variance for every sample in order to speed up the computation of pairwise similarities and define:


As shown in equation (1), we introduce a temperature hyperparameter and we call the partition function for sample . Then the associated adjacency matrix in equation (2) can be written as a function of the

distance between samples and, in the denominator, of the harmonic mean

of the partition functions. As a result, if sample or has many close neighbours, it will reduce the symmetric SNE similarity and possibly prevent a connection between samples and . Such a phenomenon is shown on the two moons toy dataset in figure 2.

k-nearest neighbors.

We also propose a similarity based on

-nearest neighbours (kNN

Cover and Hart (1967) where the samples and are connected if is in the -nearest neighbours of or if is in the -nearest neighbours of . With this similarity, the hyperparameter is the minimum of neighbours and not the threshold .

dist. SNE Cosine kNN
Table 1: Pairwise labeling with adjacency matrices based on different similarities. is the thresholding hyperparameter for , SNE and Cosine. The number of neighbours is kNN’s hyperparameter.
(a) Raw data
(b) dist.
(c) kNN
(d) SNE
Figure 2: Pairwise connections on the two moons toy data. From left to right. We apply our algorithm with different connection techniques on a toy dataset shown in (a) where each color represents a class. We use the different connections techniques of table 1 such that there are 650 undirected edges for each similarity. Compared to distance and SNE, kNN produces neighbourhoods of similar sizes and every sample is connected. SNE captures the local structure of the data: most of the connections are at the external tails of the moons where there are less points.

3.3 Clustering loss and data composition

Now that we have established pairwise connections between each pair of samples in the mini-batch, we will use the adjacency matrix as target for a binary cross-entropy loss. Denoting with the probability that samples and belong to the same cluster, we wish to optimize the clustering loss:


The left term of this loss aims at maximizing the number of connected samples (i.e. ) within a cluster and the right term at minimizing the number of non-connected samples within it (namely, the edges of the complement of the similarity graph ). Hence the second term prevents the formation of a single, large cluster that would contain all samples.

The next step is to model by using the linear classifier predictions of samples and . As seen in equation (4), for a fixed number of clusters , the probability of samples and belonging to the same cluster can be rewritten as a sum of probabilities over the possible clusters. For simplicity, we assume that samples and are independent. This way, the pairwise comparison between samples appear only at the loss level and we can thus use the standard forward and backward passes of deep neural networks where each sample is treated independently. By plugging equation (4) in equation (3) and by replacing with to form pairwise comparisons between the mini-batch and its augmented version, we obtain our final clustering loss :


A similar loss is used in Hsu et al. (2019) but with supervised pairwise labels to transfer a multi-class classifier across tasks. It is also reminiscent of DAC Chang et al. (2017), but differs from the latter because the DAC loss does not contain a dot product between probability vectors but between normalized probability vectors. Hence DAC optimizes a Bhattacharyya distance whereas we optimize a standard binary cross-entropy loss.

Figure 3: Illustration for equation 6 of a pairwise target between the "pure" image and the composite image with . In this case, the resulting pairwise target equals .

In practice can be used in combination with effective data augmentation techniques such as RICAP Takahashi et al. (2018) and MixUp Zhang et al. (2017). These methods combine the images from the minibatch and use a weighted combination of the labels of the original images as new target for the cross-entropy loss. We denote with permutation of the samples in the minibatch; RICAP and MixUp require 4 and 2 permutations respectively. RICAP creates a new minibatch of composite images by patching together random crops from the 4 permutations of the original minibatch, whereas MixUp produces a new minibatch by taking a linear combination with random weights from 2 permutations. The new target for a composite image is then obtained by taking a linear combination of the labels in the recombined images, weighted by area proportions in RICAP and the mixing weights in MixUp. These techniques were proposed for the standard supervised classification setting, so we adapt them here to clustering. In order to do so, we propose to perform a pairwise labeling between the composite images and the raw original images. Both minibatches of original and composite images are fed to the network. Then, as illustrated in figure 3, the pairwise label between a composite image and a raw image is the linear combination of the pairwise labels between the components of both. To sum up, to obtain the pairwise labels between a minibatch and its composite version we just need to extract the adjacency matrix of the minibatch and then do a linear combination of the adjacency matrix with the different column permutations :


Regarding the predicted probability of the ‘pure’ image and the composite image being in the same cluster, we take the dot product between their respective cluster predictions and .

3.4 Overall loss

The overall loss we optimise is given by




and is the ramp-up function proposed in Laine and Aila (2017); Tarvainen and Valpola (2017) with the current training step, the ramp-up length and . is a consistency constraint which requires the model to produce the same prediction for an image and an its augmented version. We use it in our method in a similar way as semi-supervised learning techniques Laine and Aila (2017); Miyato et al. (2018); Sajjadi et al. (2016); Tarvainen and Valpola (2017), i.e. as a regularizer to provide consistent predictions. This differs significantly from clustering methods like IIC Ji et al. (2019) and IMSAT Hu et al. (2017) where augmentations are used as a main clustering cue by maximizing the mutual information between different versions of an image. Instead, as commonly done in semi-supervised learning, we use the Mean Squared Error (MSE) between predictions as the consistency loss.

4 Experiments


We conduct experiments on five popular benchmarks which we use to compare our method against recent state-of-the-art approaches whenever results are available. We use four image datasets and one text dataset to illustrate the versatility of our approach to different types of data. We use MNIST LeCun et al. (1998), CIFAR 10 Krizhevsky and Hinton (2009), CIFAR 100-20 Krizhevsky and Hinton (2009) and STL 10 Coates et al. (2011) as image datasets. All these datasets cover a wide range of image varieties ranging from pixels grey scale digits in MNIST to higher resolution images from STL 10. CIFAR 100-20 is redesigned from original CIFAR 100 since we consider only the 20 meta classes for evaluation as common practice Ji et al. (2019). Finally we also evaluate our method on a text dataset, Reuters 10K Lewis et al. (2004). Reuters 10K contains 10,000 English news labelled with 4 classes. Each news has 2,000 tf-idf features. For all datasets we suppose the number of classes to be known.

Experimental details.

We use ResNet-18 He et al. (2016) for all the datasets except two. For MNIST we use a model inspired from VGG-4 Simonyan and Zisserman (2014), described in Ji et al. (2019) and for Reuters 10K we consider a simple DNN of dimension 2000–500–500–2000–4 described in Xie et al. (2016). We train with batch-size of 256 for all experiments. We use SGD optimizer with momentum Sutskever et al. (2013) and weight decay set to for every dataset except for Reuters 10K where we respectively use Adam Kingma and Ba (2014) and decay of . When comparing with other methods in table 2 and table 3

, we run our method using 10 different seeds and report average and standard deviation on each dataset to measure the robustness of our method with respect to initialization. As it is common practice 

Ji et al. (2019), we train and test the methods on the whole dataset (this is acceptable given that the method uses no supervision). Further experimental details about data augmentation and training are available in the appendix.

Evaluation metrics.

We take the commonly used clustering accuracy

(ACC) as evaluation metric. ACC is defined as


where and respectively denote the ground-truth class label and the clustering assignment obtained by our method for each sample in the dataset. is the group of permutations with elements and following other clustering methods we use the Hungarian algorithm Kuhn (1955) to optimize the choice of permutation.

4.1 Results on standard benchmarks

K-means MacQueen and others (1967) JULE Yang et al. (2016) DEC Xie et al. (2016) DAC Chang et al. (2017) IIC Ji et al. (2019) Ours
CIFAR 10 22.9 27.2 30.1 52.2 61.7 81.7 0.9
CIFAR 100-20 13.0 13.7 18.5 23.8 25.7 42.3 1.0
STL 10 19.2 27.7 35.9 47.0 59.6 66.4 3.2
MNIST 57.2 96.4 84.3 97.8 99.2 98.6 0.5
Table 2: Comparison with other methods. Our method almost constantly reaches state-of-the-art performances by a large margin. Note that Ji et al. (2019) report best results over all the heads while we report results over ten different initializations. This further shows that our method is overall stable and robust to initialization.

We compare our method with the K-means MacQueen and others (1967) baseline and recent clustering methods. In table 2, we report results on image datasets. We use RotNet Gidaris et al. (2018) self-supervised pre-training for each dataset on all the data available (e.g including the unlabelled set in STL-10). Our method significantly outperforms the others by a large margin. For example, our method achieves on CIFAR 10, while the previous state-of-the-art method IIC Ji et al. (2019) gives . On CIFAR 10, our method also outperforms the leading semi-supervised learning technique FixMatch Sohn et al. (2020) which obtains in its one label per class setting. Similarly, on CIFAR 100-20 and STL 10, our method outperforms other clustering approaches respectively by and points. On MNIST, our method and IIC both achieve a very low error rate around .

These results clearly show the effectiveness of our approach. Unlike the previous state-of-the-art method IIC that requires to apply Sobel filtering and very large batch size during training, our method does not require such preprocessing and works with a common batch size. We also note that our method is robust to different initialization, with a maximum of standard deviation across all datasets.

To analyse further the results on CIFAR 10, we can look at the confusion matrix resulting from our model’s predictions.We note that most of the errors are due to the ‘cat’ and ‘dog’ classes being confused. If we retain only the confident samples with prediction above

(around of the samples), the accuracy rises to

. We assume that the two classes ‘cat’ and ‘dog’ are are more difficult to discriminate due to their visual similarity.

In table 3, we also evaluate our method on the document classification dataset Reuters 10K to show its versatility. We compare with different approaches than in table 2

as clustering methods developed for text are seldom evaluated on image datasets like CIFAR and vice versa. Following existing approaches applied to Reuters 10K, we pretrain the deep neural network by training a denoising autoencoder on the dataset 

Jiang et al. (2016). Our method works notably better than the K-means baseline, and is on par with the best results methods FINCH Sarfraz et al. (2019) and VaDE Jiang et al. (2016). Most notably one run of our method established state-of-the-art results of 83.5%, 2 points above the current best model.

K-means MacQueen and others (1967) IMSAT Hu et al. (2017) DEC Xie et al. (2016) VaDE Jiang et al. (2016) FINCH Sarfraz et al. (2019) Ours
Reuters 10K 52.4 71.9 72.2 79.8 81.5 79.0 4.3
Table 3: Results on Reuters 10K. Our method performs on average on par with state of the art. Note that for the best seed we reach state-of-the-art results of .

4.2 Ablation studies

In order to analyze the effects of the different components of our method, we conduct a three parts ablation study on CIFAR 10 and CIFAR 100-20. First, we compare the impact of different possible pairwise labeling methods in the feature space. Second, as one of our key contribution is to choose the space where the pairwise labeling is performed, we test doing so at the level of features and predictions (i.e.

after the linear classifier but before the softmax layer like DEC 

Xie et al. (2016) or DAC Chang et al. (2017)). Third, we analyse the importance of data augmentation in clustering raw images. Results are reported in table 4 and discussed next.

Pairwise similarity.

We compare, in feature space, pairwise labeling methods based on distance, cosine similarity, kNN and symmetric SNE as described in table 1. For kNN, we set the number of neighbors to 20 and 10 for CIFAR 10 and CIFAR 100-20 respectively. For the cosine similarity, we use respectively thresholds 0.9 and 0.95. For the distance, we ran a grid search between 0 and 2 to find an optimal threshold. For SNE, we set the threshold to 0.01 and the temperature to 1 and 0.5, for CIFAR 10 and CIFAR 100-20 respectively. Further details about the hyperparameters are available in the supplementary. We observe that kNN, SNE and cosine similarity perform very well on CIFAR 10 with values around . It is interesting to note that cosine similarity performs noticeably worse than kNN and SNE on CIFAR 100-20 with around 6 points less. We also notice that distance performs consistently worse than the other labeling methods. We can conclude that kNN and SNE are the best labeling methods empirically with consistent performance on these two datasets.

Feature space embedding.

Instead of using these labeling methods before the linear classifier, we apply them after it. In this case, our overall approach becomes more similar to standard pseudo-labeling methods such as Chang et al. (2017); Lee (2013); Xie et al. (2016), which aim to match the network predictions output with a ‘sharper’ version of it. We observe that the performance drops considerably for all labeling methods with an average decrease of 16.3 points for CIFAR 10 and 10.6 points for CIFAR 100-20. Hence, this shows empirically that where pseudo labeling is applied plays a major role in clustering effectiveness and that labeling at the feature space level is noticeably better than doing so at the prediction space level.

Data augmentation.

We compare RICAP, MixUp, and the case without data composition (denoted as None). As can it can be seen in table 4, data composition is crucial for CIFAR 10 where RICAP and MixUp surpass None by respectively and points. On CIFAR 100-20, the differences are smaller but using data composition still brings a clear improvement with a points increase when using RICAP. Interestingly, RICAP clearly outperforms MixUp in both cases.

Pairwise labeling Using the pred. space Data augmentation
Cosine kNN SNE Cosine kNN SNE RICAP MixUp None
CIFAR 10 70.2 81.1 81.7 81.5 63.7 64.7 67.0 81.7 75.3 53.7
CIFAR 100-20 26.1 34.4 42.3 40.4 20.4 32.8 30.4 42.3 37.1 35.4
Table 4: Ablation study. We analyse the effect of different pairwise labeling methods but also the impact of where the labeling is done (feature vs prediction space). We also show the paramount importance of data augmentation for clustering some datasets like CIFAR 10.

5 Conclusions

We have proposed a novel deep clustering method, LSD-C. Our method establishes pairwise connections at the feature space level among different data points in a mini-batch. These on-the-fly pairwise connections are then used as targets by our loss to regroup samples into clusters. In this way, our method can effectively learn feature representation together with the cluster assignment. In addition, we also combine recent self-supervised representation learning with our clustering approach to bootstrap the representation before clustering begins. Finally, we adapt data composition techniques to the pairwise connections setting, resulting in a very large performance boost Our method substantially outperforms existing approaches in various public benchmarks, including CIFAR 10/100-20, STL 10, MNIST and Reuters 10K.

Broader Impact

Our method considers the task of unsupervised clustering from unlabeled data. We mainly consider two types of data: images and text document. While we make significant advances in terms of clustering accuracy compared to previous work, we believe the data we used to be at low risk since we consider datasets wide-spread around the community for sometimes decades.

While the data we used are not at risk we believe there is an inherent risk of misuse with clustering particularly when learnt from raw data. As any learning algorithm the clustering also depends on the data bias and could lead to misinformation or misinterpretation of results obtained from our model.

However we believe our method and clustering in general to be of interest for future years as it would reduce the need of heavy data annotations and processing.

6 Acknowledgments

We thank Kevin Scaman for his very useful comments. This work is supported by the EPSRC Programme Grant Seebibyte EP/M013774/1, Mathworks/DTA DFR02620, and ERC IDIU-638009.


  • [1] Y. M. Asano, C. Rupprecht, and A. Vedaldi (2019) Self-labelling via simultaneous clustering and representation learning. In Proc. ICLR, Cited by: §2.
  • [2] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018)

    Deep clustering for unsupervised learning of visual features

    In Proc. ECCV, Cited by: §2, §3.1.
  • [3] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan (2017) Deep adaptive image clustering. In Proc. ICCV, Cited by: §2, §3.2, §3.2, §3.3, §4.2, §4.2, Table 2.
  • [4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv. Cited by: §1, §2, §3.1.
  • [5] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Proc. NIPS, Cited by: §2.
  • [6] A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    Cited by: §4.
  • [7] D. Comaniciu and P. Meer (1979) Mean shift: a robust approach toward feature space analysis.. PAMI. Cited by: §1, §2.
  • [8] T. Cover and P. Hart (1967) Nearest neighbor pattern classification. IEEE transactions on information theory. Cited by: §3.2.
  • [9] B. De Brabandere, D. Neven, and L. Van Gool (2017)

    Semantic instance segmentation with a discriminative loss function

    arXiv. Cited by: §2.
  • [10] M. Ester, H. Kriegel, J. Sander, X. Xu, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Cited by: §1, §2.
  • [11] D. Fontanel, F. Cermelli, M. Mancini, S. R. Bulò, E. Ricci, and B. Caputo (2020) Boosting deep open world recognition by clustering. arXiv. Cited by: §2.
  • [12] A. Genevay, G. Dulac-Arnold, and J. Vert (2019) Differentiable deep clustering with cluster size constraints. arXiv. Cited by: §2.
  • [13] K. Ghasedi Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang (2017) Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Proc. ICCV, Cited by: §2.
  • [14] S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord (2019) Boosting few-shot visual learning with self-supervision. In Proc. ICCV, Cited by: §1, §2.
  • [15] S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord (2020) Learning representations by predicting bags of visual words. arXiv. Cited by: §2.
  • [16] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In Proc. ICLR, Cited by: Appendix A, §1, §2, §3.1, §4.1, LSD-C: Linearly Separable Deep Clusters.
  • [17] X. Guo, L. Gao, X. Liu, and J. Yin (2017) Improved deep embedded clustering with local structure preservation.. In IJCAI, Cited by: §2.
  • [18] X. Guo, X. Liu, E. Zhu, and J. Yin (2017) Deep clustering with convolutional autoencoders. In International conference on neural information processing, Cited by: §2.
  • [19] P. Haeusser, J. Plapp, V. Golkov, E. Aljalbout, and D. Cremers (2018) Associative deep clustering: training a classification network with no labels. In

    German Conference on Pattern Recognition

    Cited by: §2.
  • [20] K. Han, S. Rebuffi, S. Ehrhardt, A. Vedaldi, and A. Zisserman (2020) Automatically discovering and learning new visual categories with ranking statistics. In Proc. ICLR, Cited by: Appendix A, §2, §2.
  • [21] K. Han, A. Vedaldi, and A. Zisserman (2019) Learning to discover novel visual categories via deep transfer clustering. In Proc. ICCV, Cited by: §2.
  • [22] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv. Cited by: §1, §2.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. CVPR, Cited by: Appendix A, §4.
  • [24] Y. Hsu, Z. Lv, J. Schlosser, P. Odom, and Z. Kira (2019) Multi-class classification without multi-class labels. In Proc. ICLR, Cited by: §2, §3.3.
  • [25] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama (2017) Learning discrete representations via information maximizing self-augmented training. In Proc. ICML, Cited by: §2, §3.4, Table 3.
  • [26] G. Huang, H. Larochelle, and S. Lacoste-Julien (2019) Centroid networks for few-shot clustering and unsupervised few-shot classification. arXiv. Cited by: §2.
  • [27] X. Ji, J. F. Henriques, and A. Vedaldi (2019) Invariant information clustering for unsupervised image classification and segmentation. In Proc. ICCV, Cited by: §1, §2, §3.1, §3.4, §4, §4, §4.1, Table 2.
  • [28] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou (2016) Variational deep embedding: an unsupervised and generative approach to clustering. arXiv. Cited by: §1, §2, §4.1, Table 3.
  • [29] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In Proc. ICLR, Cited by: §4.
  • [30] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report. Cited by: §4.
  • [31] H. W. Kuhn (1955) The hungarian method for the assignment problem. Naval research logistics quarterly. Cited by: §4.
  • [32] S. Laine and T. Aila (2017) Temporal ensembling for semi-supervised learning. In Proc. ICLR, Cited by: §3.4.
  • [33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE. Cited by: §4.
  • [34] D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Cited by: §3.2, §4.2.
  • [35] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li (2004) Rcv1: a new benchmark collection for text categorization research. Journal of machine learning research. Cited by: §4.
  • [36] F. Li, H. Qiao, and B. Zhang (2018) Discriminatively boosted image clustering with fully convolutional auto-encoders. Pattern Recognition 83, pp. 161–173. Cited by: §2.
  • [37] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research. Cited by: §2, §3.2.
  • [38] J. MacQueen et al. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, pp. 281–297. Cited by: Appendix C, §1, §2, §4.1, Table 2, Table 3.
  • [39] T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence. Cited by: §3.4.
  • [40] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Proc. NIPS, Cited by: Appendix A.
  • [41] K. Pearson (1894) Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London Series A. Cited by: §1, §2.
  • [42] S. Rebuffi, S. Ehrhardt, K. Han, A. Vedaldi, and A. Zisserman (2020) Semi-supervised learning with scarce annotations. In Proc. CVPR Workshop, Cited by: §1, §2.
  • [43] M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in neural information processing systems, Cited by: §3.4.
  • [44] S. Sarfraz, V. Sharma, and R. Stiefelhagen (2019) Efficient parameter-free clustering using first neighbor relations. In Proc. CVPR, Cited by: §1, §2, §4.1, Table 3.
  • [45] U. Shaham, K. Stanton, H. Li, B. Nadler, R. Basri, and Y. Kluger (2018)

    Spectralnet: spectral clustering using deep neural networks

    In Proc. ICLR, Cited by: §2.
  • [46] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv. Cited by: §4.
  • [47] K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel (2020) Fixmatch: simplifying semi-supervised learning with consistency and confidence. arXiv. Cited by: §4.1.
  • [48] I. Sutskever, J. Martens, G. Dahl, and G. Hinton (2013) On the importance of initialization and momentum in deep learning. In Proc. ICML, Cited by: §4.
  • [49] R. Takahashi, T. Matsubara, and K. Uehara (2018) RICAP: random image cropping and patching data augmentation for deep cnns. In Asian Conference on Machine Learning, Cited by: Appendix A, §1, §3.3.
  • [50] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Proc. NIPS, Cited by: §3.4.
  • [51] J. Xie, R. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    In Proc. ICML, Cited by: §1, §2, §3.2, §4, §4.2, §4.2, Table 2, Table 3.
  • [52] B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong (2017) Towards k-means-friendly spaces: simultaneous deep learning and clustering. In Proc. ICML, Cited by: §2.
  • [53] J. Yang, D. Parikh, and D. Batra (2016) Joint unsupervised learning of deep representations and image clusters. In Proc. CVPR, Cited by: §2, Table 2.
  • [54] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer (2019) S4l: self-supervised semi-supervised learning. In Proc. ICCV, Cited by: §1, §2.
  • [55] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv. Cited by: Appendix A, §1, §3.3.

Appendix A Implementation details

Self-supervised pretraining. We train the RotNet Gidaris et al. (2018) (i.e. predicting the rotation applied to the image among four possibilities: 0, 90, 180, and 270) on all datasets with the same configuration. Following the authors’ released code, we train for 200 epochs using a step-wise learning rate starting at 0.1 which is then divided by 5 at epochs 60, 120, and 160.

Main LSD-C models. After the self-supervised pretraining step, following Han et al. (2020) we freeze the first three macro-blocks of the ResNet-18 He et al. (2016) as the RotNet training provides robust early filters. We then train the last macro-block and the linear classifier using our clustering method. For all the experiments, we use a batch size of 256. We summarize in table 5 all the hyperparameters for the different datasets and labeling methods.

Optimizer Ramp-up Cosine SNE kNN
Type Epochs LR steps LR init T Temp k
CIFAR 10 SGD 220 [140, 180] 0.1 5 100 0.9 0.01 1.0 20
CIFAR 100-20 SGD 200 170 0.1 25 150 0.95 0.01 0.5 10
STL 10 SGD 200 [140, 180] 0.1 5 50 - 0.01 0.5 -
MNIST SGD 15 - 0.1 5 50 - - - 10
Reuters 10K Adam 75 - 0.001 25 100 - - - 5
Table 5: Hyperparameters. Optimizer, ramp-up function and parameters of different labeling methods on different datasets.

Data augmentation techniques. We showed in the main paper that data composition techniques like RICAP Takahashi et al. (2018) and MixUp Zhang et al. (2017) are highly beneficial to our method. For RICAP, we follow the authors’ instructions to sample the width and height of crops for each minibatch permutation by using a Beta(0.3, 0.3) distribution. Regarding MixUp, we note that using a Beta(0.3, 0.3) distribution for the mixing weight works better in our case than the Beta(1.0, 1.0) advised for CIFAR 10 in the MixUp paper. Furthermore, we have to decrease the weight decay to to make MixUp work.

Miscellaneous. Our method is implemented with PyTorch 1.2.0 Paszke et al. (2019). Our experiments were run on NVIDIA Tesla M40 GPUs and can run on a single GPU with 12 GB of RAM.

Appendix B Confusion matrices on CIFAR 10

In figure 4, we show some confusion matrices on CIFAR 10 to analyse how our clustering method performs on the different classes. We notice that there are 8 confident clusters with a very high clustering accuracy of for confident samples. The "dog" and "cat" clusters are not well identified possibly due to a huge intra-class variation of the samples.

(a) All samples
(b) Confident samples
Figure 4: Confusion matrices on CIFAR 10 using our method with kNN labeling. Figure 3(a) shows that most of the errors are due to the "cat" and "dog" classes. When taking the samples with prediction above 0.9 (60% of the samples) in Figure 3(b), there are less than 2000 predictions on classes "cat" and "dog" whereas there are more than 3500 for each of the other classes. Our method manages to ignore the problematic classes when taking the confident samples. Indeed, the accuracy for confident samples is 94.0%.

Appendix C Additional ablation studies

We report in table 6 the results of some additional ablation studies to evaluate the impact of more components of our method. For example, we apply K-means MacQueen and others (1967) on the feature space of the pretrained RotNet model and we note very poor performance on CIFAR 10 and CIFAR 100-20. We can conclude that before training with our clustering loss, the desired clusters are not yet separated in the feature space. After training with our clustering loss, the clusters can be successfully separated. Moreover, if we only use the clustering loss and drop the consistency MSE loss, the performance decreases on both CIFAR 10 and CIFAR 100-20 by 1.5 and 1.3 points respectively, showing that the MSE provides a moderate but clear gain to our method. Finally, if we replace the linear classifier by a 2-layer classifier (i.e. this corresponds to a non-linear separation of clusters in the feature space), it results in a small improvement on CIFAR 10 but a clear decrease of 1.9 points on CIFAR 100-20. Hence using a linear classifier provides more consistent results across datasets.

K-means + RotNet Ours (kNN) Ours (kNN) w/o MSE Ours (kNN) w/ non-lin.
CIFAR 10 14.3 81.7 80.2 82.0
CIFAR 100-20 9.1 40.5 39.2 38.6
Table 6: Additional ablation studies. From the first column, we observe that the desired clusters are not yet separated in the feature space after the RotNet pretraining. The second column shows that the MSE consistency loss provides a boost of more than 1 point to our method. Finally, we see that using a non-linear classifier harms the performance on CIFAR 100-20.