lsd-clusters
None
view repo
We present LSD-C, a novel method to identify clusters in an unlabeled dataset. Our algorithm first establishes pairwise connections in the feature space between the samples of the minibatch based on a similarity metric. Then it regroups in clusters the connected samples and enforces a linear separation between clusters. This is achieved by using the pairwise connections as targets together with a binary cross-entropy loss on the predictions that the associated pairs of samples belong to the same cluster. This way, the feature representation of the network will evolve such that similar samples in this feature space will belong to the same linearly separated cluster. Our method draws inspiration from recent semi-supervised learning practice and proposes to combine our clustering algorithm with self-supervised pretraining and strong data augmentation. We show that our approach significantly outperforms competitors on popular public image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K.
READ FULL TEXT VIEW PDFNone
The need for large scale labelled datasets is a major obstacle to the applicability of deep learning to problems where labelled data cannot be easily obtained. Methods such as clustering, which are unsupervised and thus do not require any kind of data annotation, are in principle more easily applicable to new problems. Unfortunately, standard clustering algorithms
Comaniciu and Meer (1979); Ester et al. (1996); MacQueen and others (1967); Pearson (1894) usually do not operate effectively on raw data and require to design new data embeddings specifically for each new application. Thus, there is a significant interest in automatically learning an optimal embedding while clustering the data, a problem sometimes referred to as simultaneous data clustering and representation learning. Recent works have demonstrated this for challenging data such as images Ji et al. (2019); Xie et al. (2016) and text Jiang et al. (2016); Sarfraz et al. (2019). However, most of these methods work with a constrained output space, which usually coincides with the space of discrete labels or classes being estimated, therefore forcing to work at the level of the semantic of the clusters directly.
In this paper, we relax this limitation by introducing a novel clustering method, Linearly Separable Deep Clustering (LSD-C). This method operates in the feature space computed by a deep network and builds on three ideas. First, the method extracts mini-batches of input samples and establishes pairwise pseudo labels (connections) for each pair of sample in the mini-batch. Differently from prior art, this is done in the space of features computed by the penultimate layer of the deep network instead of the final output layer, which maps data to discrete labels. From these pairwise labels, the method learns to regroup the connected samples into clusters by using a clustering loss which forces the clusters to be linearly separable. We empirically show in section 4.2 that this relaxation already significantly improves clustering performance.
Second, we initialize the model by means of a self-supervised representation learning technique. Prior work has shown that these techniques can produce features with excellent linear linear separability Chen et al. (2020); Gidaris et al. (2018); He et al. (2019) that are particularly useful as initialization for downstream tasks such as semi-supervised and few-shot learning Gidaris et al. (2019); Rebuffi et al. (2020); Zhai et al. (2019).
Third, we make use of very effective data combination techniques such as RICAP Takahashi et al. (2018) and MixUp Zhang et al. (2017) to produce composite data samples and corresponding pseudo labels, which are then used at the pairwise comparison stage. In section 4 we show that training with such composite samples and pseudo labels greatly improves the performance of our method, and is in fact the key to good performance in some cases.
We comprehensively evaluate our method on popular image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K. Our method almost always outperforms competitors on all datasets, establishing new state-of-the-art clustering results. The rest of the paper is organized as follows. We first review the most relevant works in section 2. Next, we develop the details of our proposed method in section 3, followed by the experimental results, ablation studies and analysis in section 4. Our code is publicly available at https://github.com/srebuffi/lsd-clusters.
Deep clustering
. Clustering has been a long-standing problem in the machine learning community, including well-known algorithms such as K-means
MacQueen and others (1967), mean-shift Comaniciu and Meer (1979), DBSCAN Ester et al. (1996)Pearson (1894). Furthermore it can also be combined with other techniques to achieve very diverse tasks like novel category discovery Han et al. (2019); Fontanel et al. (2020) or semantic instance segmentation De Brabandere et al. (2017) among others. With the advances of deep learning, more and more learning-based methods have been introduced in the literature Genevay et al. (2019); Ghasedi Dizaji et al. (2017); Haeusser et al. (2018); Huang et al. (2019); Jiang et al. (2016); Li et al. (2018); Shaham et al. (2018); Xie et al. (2016); Yang et al. (2017)). Among them, DEC Xie et al. (2016)is one of the most promising method. It is a two stage method that jointly learns the feature embedding and cluster assignment. The model is first pretrained with an autoencoder using reconstruction loss, after which the model is trained by constructing a sharpened version of the soft cluster assignment as pseudo target. This method inspired a few following works such as IDEC
Guo et al. (2017a) and DCED Guo et al. (2017b). JULE Yang et al. (2016) is a recurrent deep clustering framework that jointly learns the feature representation with an agglomerative clustering procedure, however it requires tuning a number of hyper-parameters, limiting its practical use. More recently, several methods have been proposed based on mutual information Chen et al. (2016); Hu et al. (2017); Ji et al. (2019). Among them, IIC Ji et al. (2019) achieves the current state-of-the-art results on image clustering by maximizing the mutual information between two transformed counterparts of the same image. Closer to our work is the DAC Chang et al. (2017)method, which considers clustering as a binary classification problem. By measuring the cosine similarity between predictions, pairwise pseudo labels are generated from the most confident positive or negative pairs. With the generated pairwise pseudo labels, the model can then be trained by a binary cross-entropy loss. DAC can learn the feature embedding as well as the cluster assignment in an end-to-end manner. Our work significantly differs from DAC as it generates pairwise predictions from a less constrained feature space using similarity techniques not limited to cosine distance.
Self-supervised representation learning. Self-supervised representation learning has recently attracted a lot of attention. Many effective self-supervised learning methods have been proposed in the literature Asano et al. (2019); Caron et al. (2018); Chen et al. (2020); Gidaris et al. (2020, 2018); He et al. (2019). DeepCluster Caron et al. (2018)
learns feature representation by classification using the pseudo labels generated from K-means on the learned features in each training epoch. RotNet
Gidaris et al. (2018) randomly rotates an image, and learns to predict the applied rotations. Very recently, contrastive learning based methods MoCo He et al. (2019) and SimCLR Chen et al. (2020)have achieved the state-of-the-art self-supervised representation performance, surpassing the representation learnt using ImageNet labels. Self-supervised learning has been also applied in few-shot learning
Gidaris et al. (2019), semi-supervised learning Rebuffi et al. (2020); Zhai et al. (2019) and novel category discovery Han et al. (2020), which successfully boosts their performance. In this work we make use of the provably well-conditioned feature space learnt from self-supervised learning method to initialize our network and avoid degenerative cases.Pairwise pseudo labeling. Pairwise similarity between pairs of sample has been widely used in the literature for dimension reduction or clustering (e.g., t-SNE Maaten and Hinton (2008), FINCH Sarfraz et al. (2019)
). Several methods have shown the effectiveness of using pairwise similarity to provide pseudo labels on-the-fly to train deep convolutional neural networks. In
Hsu et al. (2019), a binary classifier is trained to provide pairwise pseudo labels to train a multi-class classifier. In
Han et al. (2020), ranking statistics is used to obtain pairwise pseudo labels on-the-fly for the task of novel category discovery. In Sarfraz et al. (2019), the pairwise connection between data points by finding the nearest neighbour is used to cluster images using CNN features. In our method, we compute pairwise labels from a neural network embedding. This way we generate pseudo labels for each pair in each mini-batch and learn cluster assignment without any supervision.Our methods is divided into three stages: (i) self-supervised pre-training, (ii) pairwise connection and clustering, and (iii) data composition. We provide an overview of our pipeline in figure 1. Our method processes each input data batch in two steps, by extracting features by means of a neural network
, followed by estimating posterior class probabilities
by means of a linear layer and softmax non-linearity. We use the symbol to denote the class predictions for the same mini-batch with data augmentation (random transformations) applied to it. We use the letters , and to denote the feature space dimension, the number of clusters and the mini-batch size. We now detail each component of LSD-C.As noted in the introduction, traditional clustering methods require handcrafted or pretrained features. More recently, methods such as Ji et al. (2019) have combined deep learning and clustering to learn features and clusters together; even so, these methods usually still require ad hoc pre-processing steps (e.g. pre-processing such as Sobel filtering Caron et al. (2018); Ji et al. (2019)
) and extensive hyperparameter tuning. In our method we address this issue and avoid bad local minima in our clustering results by initializing our representation by means of self-supervised learning. In practice, this amounts to train our model on a pretext task (detailed in
section 4) and then retain and freeze the earlier layers of the model when applying our clustering algorithm. As reported in Chen et al. (2020); Gidaris et al. (2018), the features obtained from self-supervised pre-training are linearly separable with respect to typical semantic image classes. This property is particularly desirable in our context and also motivates our major design choice: since the feature space of self-supervised pre-trained network is linearly separable, it is therefore easier to directly operate on it to discriminate between different clusters.A key idea in our method is the choice of space where pairwise the data connections are established: we extract pairwise labels at the level of the data representation rather than at the level of the class predictions. The latter is a common design choice, used in DAC Chang et al. (2017) to establish pairwise connections between data points and in DEC Xie et al. (2016) to match the current label posterior distribution to a sharper version of itself.
The collection of pairwise labels between samples in a mini-batch is given by the adjacency matrix of an undirected graph whose nodes are the samples and whose edges encode their similarities. DAC Chang et al. (2017) generates pseudo labels by checking if the output of the network is above or under certain thresholds. The method of Lee (2013) proceeds similarly in the semi-supervised setting. In our method, as we work instead at the feature space level, the pairwise labeling step is a separate process from class prediction and we are free to choose any similarity to establish our adjacency matrix . We denote with and
the feature vectors for samples
and in a mini-batch, obtained from the penultimate layer of the neural network . We also use the symbol to denote the value of the adjacency matrix for the pair of samples . Next, we describe the different types of pairwise connections considered in this work and summarize them in table 1.Let be a threshold hyperparameter and define (cosine) or (Euclidean) where denotes the dot product between -normalized vectors. We then define where is the indicator function. These definitions connect neighbor samples but do not account well for the local structure of the data. Indeed, it is not obvious that the cosine similarity or Euclidean distance would establish good data connections in feature space.
A possible solution to alleviate the previous issue is to use the symmetric SNE similarity introduced in t-SNE Maaten and Hinton (2008). This similarity is based on the conditional probability of picking as neighbor of
under a Gaussian distribution assumption. We make a further assumption compared to
Maaten and Hinton (2008)of an equal variance for every sample in order to speed up the computation of pairwise similarities and define:
(1) |
(2) |
As shown in equation (1), we introduce a temperature hyperparameter and we call the partition function for sample . Then the associated adjacency matrix in equation (2) can be written as a function of the
distance between samples and, in the denominator, of the harmonic mean
of the partition functions. As a result, if sample or has many close neighbours, it will reduce the symmetric SNE similarity and possibly prevent a connection between samples and . Such a phenomenon is shown on the two moons toy dataset in figure 2.We also propose a similarity based on
-nearest neighbours (kNN)
Cover and Hart (1967) where the samples and are connected if is in the -nearest neighbours of or if is in the -nearest neighbours of . With this similarity, the hyperparameter is the minimum of neighbours and not the threshold .dist. | SNE | Cosine | kNN | |
---|---|---|---|---|
Now that we have established pairwise connections between each pair of samples in the mini-batch, we will use the adjacency matrix as target for a binary cross-entropy loss. Denoting with the probability that samples and belong to the same cluster, we wish to optimize the clustering loss:
(3) |
The left term of this loss aims at maximizing the number of connected samples (i.e. ) within a cluster and the right term at minimizing the number of non-connected samples within it (namely, the edges of the complement of the similarity graph ). Hence the second term prevents the formation of a single, large cluster that would contain all samples.
The next step is to model by using the linear classifier predictions of samples and . As seen in equation (4), for a fixed number of clusters , the probability of samples and belonging to the same cluster can be rewritten as a sum of probabilities over the possible clusters. For simplicity, we assume that samples and are independent. This way, the pairwise comparison between samples appear only at the loss level and we can thus use the standard forward and backward passes of deep neural networks where each sample is treated independently. By plugging equation (4) in equation (3) and by replacing with to form pairwise comparisons between the mini-batch and its augmented version, we obtain our final clustering loss :
(4) |
(5) |
A similar loss is used in Hsu et al. (2019) but with supervised pairwise labels to transfer a multi-class classifier across tasks. It is also reminiscent of DAC Chang et al. (2017), but differs from the latter because the DAC loss does not contain a dot product between probability vectors but between normalized probability vectors. Hence DAC optimizes a Bhattacharyya distance whereas we optimize a standard binary cross-entropy loss.
In practice can be used in combination with effective data augmentation techniques such as RICAP Takahashi et al. (2018) and MixUp Zhang et al. (2017). These methods combine the images from the minibatch and use a weighted combination of the labels of the original images as new target for the cross-entropy loss. We denote with permutation of the samples in the minibatch; RICAP and MixUp require 4 and 2 permutations respectively. RICAP creates a new minibatch of composite images by patching together random crops from the 4 permutations of the original minibatch, whereas MixUp produces a new minibatch by taking a linear combination with random weights from 2 permutations. The new target for a composite image is then obtained by taking a linear combination of the labels in the recombined images, weighted by area proportions in RICAP and the mixing weights in MixUp. These techniques were proposed for the standard supervised classification setting, so we adapt them here to clustering. In order to do so, we propose to perform a pairwise labeling between the composite images and the raw original images. Both minibatches of original and composite images are fed to the network. Then, as illustrated in figure 3, the pairwise label between a composite image and a raw image is the linear combination of the pairwise labels between the components of both. To sum up, to obtain the pairwise labels between a minibatch and its composite version we just need to extract the adjacency matrix of the minibatch and then do a linear combination of the adjacency matrix with the different column permutations :
(6) |
Regarding the predicted probability of the ‘pure’ image and the composite image being in the same cluster, we take the dot product between their respective cluster predictions and .
The overall loss we optimise is given by
(7) |
where
(8) |
and is the ramp-up function proposed in Laine and Aila (2017); Tarvainen and Valpola (2017) with the current training step, the ramp-up length and . is a consistency constraint which requires the model to produce the same prediction for an image and an its augmented version. We use it in our method in a similar way as semi-supervised learning techniques Laine and Aila (2017); Miyato et al. (2018); Sajjadi et al. (2016); Tarvainen and Valpola (2017), i.e. as a regularizer to provide consistent predictions. This differs significantly from clustering methods like IIC Ji et al. (2019) and IMSAT Hu et al. (2017) where augmentations are used as a main clustering cue by maximizing the mutual information between different versions of an image. Instead, as commonly done in semi-supervised learning, we use the Mean Squared Error (MSE) between predictions as the consistency loss.
We conduct experiments on five popular benchmarks which we use to compare our method against recent state-of-the-art approaches whenever results are available. We use four image datasets and one text dataset to illustrate the versatility of our approach to different types of data. We use MNIST LeCun et al. (1998), CIFAR 10 Krizhevsky and Hinton (2009), CIFAR 100-20 Krizhevsky and Hinton (2009) and STL 10 Coates et al. (2011) as image datasets. All these datasets cover a wide range of image varieties ranging from pixels grey scale digits in MNIST to higher resolution images from STL 10. CIFAR 100-20 is redesigned from original CIFAR 100 since we consider only the 20 meta classes for evaluation as common practice Ji et al. (2019). Finally we also evaluate our method on a text dataset, Reuters 10K Lewis et al. (2004). Reuters 10K contains 10,000 English news labelled with 4 classes. Each news has 2,000 tf-idf features. For all datasets we suppose the number of classes to be known.
We use ResNet-18 He et al. (2016) for all the datasets except two. For MNIST we use a model inspired from VGG-4 Simonyan and Zisserman (2014), described in Ji et al. (2019) and for Reuters 10K we consider a simple DNN of dimension 2000–500–500–2000–4 described in Xie et al. (2016). We train with batch-size of 256 for all experiments. We use SGD optimizer with momentum Sutskever et al. (2013) and weight decay set to for every dataset except for Reuters 10K where we respectively use Adam Kingma and Ba (2014) and decay of . When comparing with other methods in table 2 and table 3
, we run our method using 10 different seeds and report average and standard deviation on each dataset to measure the robustness of our method with respect to initialization. As it is common practice
Ji et al. (2019), we train and test the methods on the whole dataset (this is acceptable given that the method uses no supervision). Further experimental details about data augmentation and training are available in the appendix.We take the commonly used clustering accuracy
(ACC) as evaluation metric. ACC is defined as
(9) |
where and respectively denote the ground-truth class label and the clustering assignment obtained by our method for each sample in the dataset. is the group of permutations with elements and following other clustering methods we use the Hungarian algorithm Kuhn (1955) to optimize the choice of permutation.
K-means MacQueen and others (1967) | JULE Yang et al. (2016) | DEC Xie et al. (2016) | DAC Chang et al. (2017) | IIC Ji et al. (2019) | Ours | |
---|---|---|---|---|---|---|
CIFAR 10 | 22.9 | 27.2 | 30.1 | 52.2 | 61.7 | 81.7 0.9 |
CIFAR 100-20 | 13.0 | 13.7 | 18.5 | 23.8 | 25.7 | 42.3 1.0 |
STL 10 | 19.2 | 27.7 | 35.9 | 47.0 | 59.6 | 66.4 3.2 |
MNIST | 57.2 | 96.4 | 84.3 | 97.8 | 99.2 | 98.6 0.5 |
We compare our method with the K-means MacQueen and others (1967) baseline and recent clustering methods. In table 2, we report results on image datasets. We use RotNet Gidaris et al. (2018) self-supervised pre-training for each dataset on all the data available (e.g including the unlabelled set in STL-10). Our method significantly outperforms the others by a large margin. For example, our method achieves on CIFAR 10, while the previous state-of-the-art method IIC Ji et al. (2019) gives . On CIFAR 10, our method also outperforms the leading semi-supervised learning technique FixMatch Sohn et al. (2020) which obtains in its one label per class setting. Similarly, on CIFAR 100-20 and STL 10, our method outperforms other clustering approaches respectively by and points. On MNIST, our method and IIC both achieve a very low error rate around .
These results clearly show the effectiveness of our approach. Unlike the previous state-of-the-art method IIC that requires to apply Sobel filtering and very large batch size during training, our method does not require such preprocessing and works with a common batch size. We also note that our method is robust to different initialization, with a maximum of standard deviation across all datasets.
To analyse further the results on CIFAR 10, we can look at the confusion matrix resulting from our model’s predictions.We note that most of the errors are due to the ‘cat’ and ‘dog’ classes being confused. If we retain only the confident samples with prediction above
(around of the samples), the accuracy rises to. We assume that the two classes ‘cat’ and ‘dog’ are are more difficult to discriminate due to their visual similarity.
In table 3, we also evaluate our method on the document classification dataset Reuters 10K to show its versatility. We compare with different approaches than in table 2
as clustering methods developed for text are seldom evaluated on image datasets like CIFAR and vice versa. Following existing approaches applied to Reuters 10K, we pretrain the deep neural network by training a denoising autoencoder on the dataset
Jiang et al. (2016). Our method works notably better than the K-means baseline, and is on par with the best results methods FINCH Sarfraz et al. (2019) and VaDE Jiang et al. (2016). Most notably one run of our method established state-of-the-art results of 83.5%, 2 points above the current best model.K-means MacQueen and others (1967) | IMSAT Hu et al. (2017) | DEC Xie et al. (2016) | VaDE Jiang et al. (2016) | FINCH Sarfraz et al. (2019) | Ours | |
Reuters 10K | 52.4 | 71.9 | 72.2 | 79.8 | 81.5 | 79.0 4.3 |
In order to analyze the effects of the different components of our method, we conduct a three parts ablation study on CIFAR 10 and CIFAR 100-20. First, we compare the impact of different possible pairwise labeling methods in the feature space. Second, as one of our key contribution is to choose the space where the pairwise labeling is performed, we test doing so at the level of features and predictions (i.e.
after the linear classifier but before the softmax layer like DEC
Xie et al. (2016) or DAC Chang et al. (2017)). Third, we analyse the importance of data augmentation in clustering raw images. Results are reported in table 4 and discussed next.We compare, in feature space, pairwise labeling methods based on distance, cosine similarity, kNN and symmetric SNE as described in table 1. For kNN, we set the number of neighbors to 20 and 10 for CIFAR 10 and CIFAR 100-20 respectively. For the cosine similarity, we use respectively thresholds 0.9 and 0.95. For the distance, we ran a grid search between 0 and 2 to find an optimal threshold. For SNE, we set the threshold to 0.01 and the temperature to 1 and 0.5, for CIFAR 10 and CIFAR 100-20 respectively. Further details about the hyperparameters are available in the supplementary. We observe that kNN, SNE and cosine similarity perform very well on CIFAR 10 with values around . It is interesting to note that cosine similarity performs noticeably worse than kNN and SNE on CIFAR 100-20 with around 6 points less. We also notice that distance performs consistently worse than the other labeling methods. We can conclude that kNN and SNE are the best labeling methods empirically with consistent performance on these two datasets.
Instead of using these labeling methods before the linear classifier, we apply them after it. In this case, our overall approach becomes more similar to standard pseudo-labeling methods such as Chang et al. (2017); Lee (2013); Xie et al. (2016), which aim to match the network predictions output with a ‘sharper’ version of it. We observe that the performance drops considerably for all labeling methods with an average decrease of 16.3 points for CIFAR 10 and 10.6 points for CIFAR 100-20. Hence, this shows empirically that where pseudo labeling is applied plays a major role in clustering effectiveness and that labeling at the feature space level is noticeably better than doing so at the prediction space level.
We compare RICAP, MixUp, and the case without data composition (denoted as None). As can it can be seen in table 4, data composition is crucial for CIFAR 10 where RICAP and MixUp surpass None by respectively and points. On CIFAR 100-20, the differences are smaller but using data composition still brings a clear improvement with a points increase when using RICAP. Interestingly, RICAP clearly outperforms MixUp in both cases.
Pairwise labeling | Using the pred. space | Data augmentation | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Cosine | kNN | SNE | Cosine | kNN | SNE | RICAP | MixUp | None | ||
CIFAR 10 | 70.2 | 81.1 | 81.7 | 81.5 | 63.7 | 64.7 | 67.0 | 81.7 | 75.3 | 53.7 |
CIFAR 100-20 | 26.1 | 34.4 | 42.3 | 40.4 | 20.4 | 32.8 | 30.4 | 42.3 | 37.1 | 35.4 |
We have proposed a novel deep clustering method, LSD-C. Our method establishes pairwise connections at the feature space level among different data points in a mini-batch. These on-the-fly pairwise connections are then used as targets by our loss to regroup samples into clusters. In this way, our method can effectively learn feature representation together with the cluster assignment. In addition, we also combine recent self-supervised representation learning with our clustering approach to bootstrap the representation before clustering begins. Finally, we adapt data composition techniques to the pairwise connections setting, resulting in a very large performance boost Our method substantially outperforms existing approaches in various public benchmarks, including CIFAR 10/100-20, STL 10, MNIST and Reuters 10K.
Our method considers the task of unsupervised clustering from unlabeled data. We mainly consider two types of data: images and text document. While we make significant advances in terms of clustering accuracy compared to previous work, we believe the data we used to be at low risk since we consider datasets wide-spread around the community for sometimes decades.
While the data we used are not at risk we believe there is an inherent risk of misuse with clustering particularly when learnt from raw data. As any learning algorithm the clustering also depends on the data bias and could lead to misinformation or misinterpretation of results obtained from our model.
However we believe our method and clustering in general to be of interest for future years as it would reduce the need of heavy data annotations and processing.
We thank Kevin Scaman for his very useful comments. This work is supported by the EPSRC Programme Grant Seebibyte EP/M013774/1, Mathworks/DTA DFR02620, and ERC IDIU-638009.
Deep clustering for unsupervised learning of visual features
. In Proc. ECCV, Cited by: §2, §3.1.Proceedings of the fourteenth international conference on artificial intelligence and statistics
, Cited by: §4.Semantic instance segmentation with a discriminative loss function
. arXiv. Cited by: §2.German Conference on Pattern Recognition
, Cited by: §2.Spectralnet: spectral clustering using deep neural networks
. In Proc. ICLR, Cited by: §2.Unsupervised deep embedding for clustering analysis
. In Proc. ICML, Cited by: §1, §2, §3.2, §4, §4.2, §4.2, Table 2, Table 3.Self-supervised pretraining. We train the RotNet Gidaris et al. (2018) (i.e. predicting the rotation applied to the image among four possibilities: 0, 90, 180, and 270) on all datasets with the same configuration. Following the authors’ released code, we train for 200 epochs using a step-wise learning rate starting at 0.1 which is then divided by 5 at epochs 60, 120, and 160.
Main LSD-C models. After the self-supervised pretraining step, following Han et al. (2020) we freeze the first three macro-blocks of the ResNet-18 He et al. (2016) as the RotNet training provides robust early filters. We then train the last macro-block and the linear classifier using our clustering method. For all the experiments, we use a batch size of 256. We summarize in table 5 all the hyperparameters for the different datasets and labeling methods.
Optimizer | Ramp-up | Cosine | SNE | kNN | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Type | Epochs | LR steps | LR init | T | Temp | k | ||||
CIFAR 10 | SGD | 220 | [140, 180] | 0.1 | 5 | 100 | 0.9 | 0.01 | 1.0 | 20 |
CIFAR 100-20 | SGD | 200 | 170 | 0.1 | 25 | 150 | 0.95 | 0.01 | 0.5 | 10 |
STL 10 | SGD | 200 | [140, 180] | 0.1 | 5 | 50 | - | 0.01 | 0.5 | - |
MNIST | SGD | 15 | - | 0.1 | 5 | 50 | - | - | - | 10 |
Reuters 10K | Adam | 75 | - | 0.001 | 25 | 100 | - | - | - | 5 |
Data augmentation techniques. We showed in the main paper that data composition techniques like RICAP Takahashi et al. (2018) and MixUp Zhang et al. (2017) are highly beneficial to our method. For RICAP, we follow the authors’ instructions to sample the width and height of crops for each minibatch permutation by using a Beta(0.3, 0.3) distribution. Regarding MixUp, we note that using a Beta(0.3, 0.3) distribution for the mixing weight works better in our case than the Beta(1.0, 1.0) advised for CIFAR 10 in the MixUp paper. Furthermore, we have to decrease the weight decay to to make MixUp work.
Miscellaneous. Our method is implemented with PyTorch 1.2.0 Paszke et al. (2019). Our experiments were run on NVIDIA Tesla M40 GPUs and can run on a single GPU with 12 GB of RAM.
In figure 4, we show some confusion matrices on CIFAR 10 to analyse how our clustering method performs on the different classes. We notice that there are 8 confident clusters with a very high clustering accuracy of for confident samples. The "dog" and "cat" clusters are not well identified possibly due to a huge intra-class variation of the samples.
We report in table 6 the results of some additional ablation studies to evaluate the impact of more components of our method. For example, we apply K-means MacQueen and others (1967) on the feature space of the pretrained RotNet model and we note very poor performance on CIFAR 10 and CIFAR 100-20. We can conclude that before training with our clustering loss, the desired clusters are not yet separated in the feature space. After training with our clustering loss, the clusters can be successfully separated. Moreover, if we only use the clustering loss and drop the consistency MSE loss, the performance decreases on both CIFAR 10 and CIFAR 100-20 by 1.5 and 1.3 points respectively, showing that the MSE provides a moderate but clear gain to our method. Finally, if we replace the linear classifier by a 2-layer classifier (i.e. this corresponds to a non-linear separation of clusters in the feature space), it results in a small improvement on CIFAR 10 but a clear decrease of 1.9 points on CIFAR 100-20. Hence using a linear classifier provides more consistent results across datasets.
K-means + RotNet | Ours (kNN) | Ours (kNN) w/o MSE | Ours (kNN) w/ non-lin. | |
---|---|---|---|---|
CIFAR 10 | 14.3 | 81.7 | 80.2 | 82.0 |
CIFAR 100-20 | 9.1 | 40.5 | 39.2 | 38.6 |