As a fundamental task in unsupervised machine learning, clustering aims to group similar data points into the same clusters based on a similarity metric in the absence of labels. Clustering algorithms are helpful in many applications, including but not limited to image segmentation [segment]anomaly], medical analysis [medical], and data retrieval [retriv1, retriv2], and are often considered an essential pre-processing step in data mining tasks [mining]
. It has been an active field of research in machine learning, and a flurry of clustering frameworks has been proposed throughout time. Traditional methods, such as k-means[kmeans] and fuzzy c-means [FCM], could achieve a promising performance on lower-dimensional datasets [review_old]
. Still, they fail to cluster large-scale or high-dimensional data accurately. They assume that input data are already represented by an efficient feature vector, which is generally not valid in the case of high-dimensional data.
Following the recent success of deep models in handling unstructured and high-dimensional data such as images, researchers have turned their attention to developing deep learning algorithms for clustering [reviewClustermain]. Most early works relied on the power of deep models in learning a generic representation from the input [DEN]. Since this representation is not particularly suited for clustering, this approach yields sub-optimal results. More recently, deep clustering methods have been proposed to tackle this issue [DEC]. They jointly train a representation with the clustering task to learn a feature space in which the data can be more explicitly clustered into different groups. As a result, these models managed to achieve a reliable performance on complex computer vision datasets.
In the past few years, self-supervised learning[Rotate], particularly contrastive learning, has established itself as a state-of-the-art representation learning algorithm [SimCLR]. They generate a transformed version of the input samples and attempt to learn a representation in which augmentations of the same data point are closer together and further away from other data points. They managed to outperform other baselines, particularly in tasks related to computer vision, such as image classification [SimCLR], image anomaly detection [Hojjati2022SelfSupervisedAD], and object recognition [Objectrecognition]. Encouraged by these results, several studies have attempted to apply contrastive learning to the clustering task. An early attempt by Li et al. [CC] showed that contrastive clustering could significantly outperform other baselines on benchmark datasets. Despite these improvements, contrastive clustering and the majority of other deep clustering methods do not consider the interrelationship between data samples, which commonly leads to sub-optimal clustering [DCSS]. However, it would be challenging to incorporate information about the cross-relationship of different instances in an unsupervised setting where we do not have access to data labels.
In this paper, inspired by the CC method, we propose a novel and groundbreaking clustering method which employs a trained latent representation to discover similarities between samples, and train the algorithm’s network so that similar instances that form a cluster become closer together. In the remainder of this paper, we describe our idea more formally. Then, we carry out a series of extensive analyses to show that our scheme can significantly improve the clustering performance, and we try to explain how it can achieve such enhancement. We summarize the contribution of this work as follows:
We propose a new contrastive loss function to incorporate these newly discovered positive pairs towards learning a more reliable representation space.
By carrying out extensive experiments, we show that our proposed scheme can significantly outperform current state-of-the-art, with inserting almost no extra computations, and this improvement is resulting from considering the data similarities.
We offer insight into the behavior of our developed model, discuss the intuition behind how it improves the clustering performance, and support them by conducting relevant experiments.
2 Related Works
Deep learning-based clustering methods can be categorized into two groups [reviewCluster]
: (I) Models that use deep networks for embedding the data into a lower-dimensional representation and apply a traditional clustering such as k-means to the new representation, and (II) Algorithms that jointly train the neural network for extracting features and optimizing the clustering results.
In order to achieve a more clustering-friendly representation, previous studies have added regularization terms and constraints to the loss function of neural networks. For example, Huang et al. [DEN]
proposed the Deep embedding network (DEN), which imposes a locality preserving and a group sparsity constraint to the latent representation of the autoencoder. These two constraints reduce the inner cluster and increase the inter-cluster distances to improve the clustering performance. In another work, Penget al. [PARTY] proposed deep subspace clustering with sparsity prior (PARTY) that enhances the clustering efficiency of the autoencoder by incorporating the structure’s prior in order to consider the relationship between different samples. Sadeghi and Armanfard [DML] proposed Deep Multi-Representation Learning for Data Clustering (DML), which uses a general autoencoder for instances that are easily clustered along separate AEs for difficult-to-cluster data to improve the performance.
More recent works jointly train the neural network with the clustering objective to further improve the clustering performance. For instance, Deep clustering network (DCN) [DCN] uses k-means objective as the clustering loss and jointly optimizes it with the loss of an autoencoder. Analogously, Deep embedded clustering (DEC) [DEC] first embeds the data into a lower-dimensional space by minimizing the reconstruction loss. Then, it iteratively updates the encoder part of the AE by optimizing a Kullback-Leiber (KL) divergence [KL] loss between the soft assignments and adjusted target distributions. Following the success of DEC, a series of improved algorithms have been developed. For instance, improved deep embedded clustering with local structure preservation (IDEC) [IDEC] jointly optimizes the clustering loss and AE loss to preserve the local structure of data, IDECF [IDECF] adds a fuzzy c-mean network for improving the auxiliary cluster assignment of IDEC during training, and Deep embedded clustering with data augmentation (DEC-DA) [DECDA] applies the DEC method along with the data augmentation strategy to improve the performance.
Several other methods design auxiliary tasks for learning an efficient representation. E.g., JULE [JULE] applies agglomerative clustering to learn the data representation and cluster assignments. In another algorithm named invariant information clustering (IIC) [IIC], the mutual information between the cluster assignment of a pair is maximized.
Recently, researchers have turned their attention to self-supervised learning (SSL) models for clustering. For example, MMDC (multi-modal deep clustering) [MMDC] improves the clustering accuracy by solving the proxy task of predicting the rotation. SCAN (semantic clustering by adopting nearest neighbors) [SCAN] first obtains a high-level feature representation using self-supervised learning and then improves the clustering performance by incorporating the nearest neighbor prior.
Contrastive learning is a self-supervised learning paradigm that learns data representation by minimizing the distance between the augmentations of the same sample while pushing them away from other instances. SimCLR [SimCLR] is an example of a contrastive model for learning representation from images that can achieve performance on par with supervised methods. Researchers have increasingly utilized contrastive models for solving tasks such as clustering in the past couple of years. Zhong et al. proposed deep robust clustering (DRC) [DRC]
in which a contrastive loss decreases the inter-class variance and another contrastive loss increases the intra-class distance. Contrastive clustering (CC)[CC] improves the clustering performance by jointly performing the instance and cluster-level contrastive learning.
Given an unlabelled dataset and a predefined cluster number parameter M, the goal of the clustering problem is to partition into M disjoint groups.
To realize this goal, our proposed method follows a two-step training procedure:
We initially train the algorithm networks using an existing reliable clustering framework. In this paper, we choose CC [CC] as it has shown state-of-the-art in the image clustering field. Such initial training provides a latent space which is almost a reliable lower dimensional representation space for the data samples. In the following, We refer to this space as z-space.
We use the initialized z-space to form positive and negative pairs needed to be used in the proposed contrastive loss function. The pairs are formed based on cosine similarities in the z-space. By minimizing the proposed loss we significantly boost the z-space, in providing effective data representation, by drawing similar points closer together and pushing them away from the rest of the batch samples. As is shown in Figure1, such a training process significantly boosts the clustering performance.
We first map the data into a latent representation using an encoder network , so that . Here, is the dimension of the latent representation, which is usually less than the input size, and denotes the parameters of the neural network and is tuned during the training phase. Ideally, the latent space should be suitable for clustering while preserving important characteristics of . We utilize the self-supervised contrastive clustering (CC) [CC] method to learn such representation.
Like other contrastive learning methods, CC applies two data augmentations and , sampled randomly from a pool of transformations, , to form a positive pair (, ) for a sample , where and . These transformations preserve the important information of the original sample, and human eyes easily perceive their association. However, their pixel values differ significantly, and by learning to create a link between them, the neural network learns to focus on the important and consistent patterns of the data. In this paper, we use the same augmentations as those employed in CC [CC] namely resized crop, gray-scale, horizontal flip, color jittering, and Gaussian blur.
In the next step, the network extracts representation of augmented samples, i.e, and . After extracting representations, we conduct the instance-level and cluster-level contrastive learning.
The instance-level contrastive learning could be done by pulling together the representation of positive pairs and pushing them away from negative ones. Since no prior label is given in unsupervised settings, we treat each sample as an individual class, and the two augmentations of the same sample will be considered positive. At the same time, the rest of the batch will be negative. More specifically, for a given sample , (, ) forms its sole positive pair, and the rest of the samples will be considered negative. In practice [SimCLR], instead of directly applying the contrastive learning on the representation
, we first map it to another subspace using a multi-layer perceptronto obtain and then minimize the following loss function:
where is the temperature parameter that defines the degree of attraction and repulsion between samples. If we normalize s, the similarity metric will become .
The final loss function is defined as the average of the loss for all positive pairs across the batch:
The reason that is used instead of in Eq. (1), is because minimizing the contrastive loss on the representation might cause it to drop essential information. Previous studies have also empirically shown that minimizing leads to better results [SimCLR].
Decoupled from the instance-level contrastive learning network , we train another network that maps the h-space onto a M-dimensional c-space. Mathematically speaking, if we denote the output of under the first augmentation as , then . Intuitively, in this subspace, the
-th element corresponds to the probability of the sample belonging to the-th cluster.
If we show the -th column of by , we can consider the representation of the second augmentation as its positive pair while leaving the other columns as negative. Then, similar to the instance-level loss, we define a contrastive loss function to pull the positive pair and push it away from the negatives as follows:
Minimizing the above loss can lead to the trivial solution of mapping most data points to the same cluster. To avoid this, CC minimizes the negative entropy of cluster assignment probabilities defined below in Eq. (5) where .
The final cluster-level contrastive loss is calculated as:
The final loss of CC is the sum of the instance-level and cluster-level loss:
As the first step in our method, we initialize our networks through training it for minimizing the CC loss. If we do this initialization process
for a sufficient number of epochs,a partially reliable z-space will be obtained. However, the z-space obtained by minimizing the CC loss is still sub-optimal as CC does not consider the cross-sample relationships in its training phase.
Since we do not have access to the data labels to identify similar samples, we use the cosine similarity of sample representations in the z-space to measure data similarities. If, for a pair of instances, the similarity
is greater than or equal to a threshold, quantified by the hyperparameter, we consider those samples to be similar and pull them closer together through minimizing the loss function defined below:
where denotes the indicator function. Note that since s are normalized, . Analogous to , we define that considers similarity of and other samples in the batch.
The reason we use the z-space for similarity measurements rather than the h-space is mainly that computing the inner products in the z-space requires fewer mathematical operations as the dimensions of z is lower than that of h.
Comparing the numerator of Eq. (1) and Eq. (9) confirms that C3 allows the networks to get trained considering much more positive pairs while CC allows only one positive pair to appear in the numerator. This is indeed a valuable property of C3, as considering all batch samples as negative samples (like what CC does) is a misleading assumption due to the fact that samples of one cluster should indeed be considered as positive pairs and pulled together (rather than getting repelled as is done by CC). It is worth noting that, from the number of required operations point of view, computing the C3 loss needs no extra computations compared to the CC loss – C3 only needs to find positive pairs and perform a few more summations, in the numerator, equal to the number of extra positive pairs.
In summary, in our framework, we initialize the networks by CC. This is mainly to train and obtain a partially reliable z-space that will be used in the following step when creating more positive pairs to be employed in the C3 loss. In the second step, we boost the z-space ability in providing reliable representations by identifying similar samples and then bringing them closer together to form clusters with more explicit boundaries. Pseudocode of the proposed C3 method is presented in Algorithm 1.
4 Experiments and Discussions
In this section, we demonstrate the effectiveness of our proposed scheme by conducting rigorous experimental evaluations. We evaluated our method on five challenging computer vision benchmark datasets: CIFAR-10, CIFAR-100 [Cifar], ImageNet-10, ImageNet-Dog [Imagenet], and Tiny-ImageNet [Tinyimage]. Table 1 gives a summarized description of each dataset. For CIFAR-10 and CIFAR-100, we combined the training and test splits. Also, for CIFAR-100, instead of 100 classes, we used the 20 super-classes as the ground-truth. To evaluate the performance, we use three commonly-used metrics in clustering namely clustering accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI) [reviewCluster].
4.1 Implementation Details
For the sake of fair comparison, for all datasets, we used ResNet34 [resnet] as the backbone of our encoder , which is the same architecture that previous algorithms have adopted. We set the dimension of the output of the instance-level projection head to 128, for all datasets. Also, the output dimension of the cluster-level contrastive head is set to the number of classes, M, in each dataset. Like CC, we set the temperature parameters of instance-level and clustering-level networks to and , respectively. Like CC, for training the first step, we used Adam optimizer [kingma:adam] with , batch size of and epochs. Also, the Adam optimizer with an initial learning rate of and batch size of is used in the second step, and the networks are trained for epochs. The experiments are run on NVIDIA TESLA V100 32G GPU.
4.2 Comparison with the State-of-the-art
Table 2 shows the results of our proposed method on benchmark datasets, compared to the state-of-the-art and some common traditional clustering methods. For CC, we run the code provided by its authors for all datasets, and the results are indicated by (*). As is evident in this table, our proposed method significantly outperforms all other baselines in all datasets. Quantitatively, comparing to the second best algorithm (i.e. CC), C3 improves the ACC by , , , and ), the NMI by , , , and , and the ARI by , , , and , respectively on CIFAR-10, CIFAR-100, ImageNet-10, ImageNet-Dogs, and Tiny-ImageNet. The main difference between our framework and other baselines, such as CC, is that we exploit an additional set of information, i.e. the similarity between samples, to further enhance the learned representation for clustering. We believe this is the main reason our method’s performance is superior compared to the baselines.
As opposed to the CC method that misses the global patterns present in each data cluster, C3 correctly tries to consider such patterns (at least partially) by employing the cross-instance data similarities. In this way, C3 implicitly reduces the inter-class distance while maximizing the intra-class distance, which is what an efficient grouping technique would do in the presence of the data labels. This can be confirmed by visualizing the clusters before and after applying the C3 loss. As Figures 1 and 3 show, the samples are fairly clustered at the end of the first step training using the CC loss, i.e. before the start of training using the C3 loss. However, some of the difficult clusters are mixed in the boundaries. Clusters obtained by minimizing only the CC loss, are expanded with a considerable number of miss-clustered data. However, after training with the proposed C3 loss in the second step, we observe that the new cluster space is much more reliable, and individual clusters are densely populated while being distant from each other.
4.3 Convergence Analysis
Results of section 4.2 depict the superiority of our proposed scheme. Now, we analyze C3’s convergence and the computational complexity to evaluate at what cost it makes such an improvement over other baselines. We plotted the trend of clustering accuracy and NMI for four datasets during the training epochs in Figure 4. We can readily confirm that although we are just training the C3 step for epochs, the graphs quickly converge to a settling point, which corresponds to the peak performance. Also, we can observe that both ACC and NMI are improved throughout the C3 training phase in all datasets. The performance at corresponds to the final performance of the CC algorithm. These figures clearly show that C3 improves its clustering quality and justifies the qualitative results shown in Section 4.2.
One may (wrongly) think that the better performance of C3 compared to CC is because it is being trained for 20 more epochs. (Note that in all our experiments, as suggested by the CC authors, we trained the CC algorithm networks for 1000 epochs. We train the C3 networks for 1020 epochs, i.e. 1000 epochs for the first step and 20 epochs for the second step.) We reject such an argument and support it by training the CC networks for the same number of epochs as what the C3 is trained for, i.e. 1020 epochs. We observe that no improvement is obtained for CC when trained for an extra 20 epochs. This is while when we keep training the networks with the C3 loss for only 20 more epochs, significant improvements are observed. This shows that the superior performance of C3 is not because of the extra epochs but because its objective function helps the network discover patterns that are complementary to those that CC extracts. The result of such an experiment on CIFAR-10 is shown in Figure 5.
4.4 How does C3 loss improve the clusters?
As we saw in Figure 3, the improvement that C3 achieves is mainly because it is able to reduce the distance between instances of the same cluster while repelling them from other clusters. We can justify this observation by considering the loss function of C3, i.e. Eq. (9). In this function, the term indicates that if the cosine similarity of two samples is greater than , they should be moved further close to each other. At the beginning of the second phase of C3, since the z-space is already trained with CC during the first step, we can assume that the points that are very similar to each other have a cosine distance less than this threshold so they will become further close by minimizing the C3 loss. For instance, take two points and with a cosine similarity larger than , and assume that has a similarity greater than with , but its similarity with is smaller than . Therefore, according to the loss function, and , as well as and , are forced to become closer, but it is not the case for and . However, these two points will also implicitly move closer to each other because their distance to is reduced. As the training continues, at some point, the similarity of and also may pass the threshold . Therefore, as the similar pairs move closer to each other during the training, a series of new connections will be formed, and the cluster will become denser. To support this hypothesis, we plotted the average of the loss function and the average number of positive pairs of each data sample in Figure 6-a and 6-b, respectively. We can observe that the number of positive pairs exponentially increases during the training until it settles to form the final clusters. Corresponding to this exponential increase, we can see that the loss is decreasing, and the network learns a representation in which clusters are distanced from each other while samples of each cluster are packed together.
We can also deduct from this experiment that the number of positive pairs is also related to the number of classes in each dataset. For example, if we have an augmented batch size of , for Tiny-ImageNet that has 200 classes, we expect to have positive pairs per sample which is very close to and it is the reason that we do not see the same sharp increasing trend as other datasets in Tiny-ImageNet.
4.5 Effect of Hyperparameter
Our method, C3, introduces a new hyperparameter , a threshold for identifying similar samples. Throughout the experiments, we fixed , which yielded consistent results across most detests. Now, we carry out an experiment in which we change and record the performance. Note that since , we can technically change from -1 to 1. Intuitively, for a small or negative value of , most points in the z-space will be considered similar, and the resulting clusters will not be reliable. Therefore, in our experiment, we change from 0.4 to 0.9 in 0.1 increments for CIFAR-10. For Tiny-ImageNet, as there are lots of clusters, we set . We then plot the accuracy, NMI, average loss, and average of positive pairs per sample. The resulting graphics are shown in Figure 7.
In CIFAR-10 experiments, in Figure 7-a and Figure 7-b, we see that for , accuracy and NMI are indeed decreasing during the C3 training. This is because this value of is too lenient and considers the points not in the same cluster to be similar. We can confirm this explanation by looking at Figure 7-d. We can see that for smaller s, we will have more average positive pairs per sample. As we increase the , we can see that the performance begins to improve. For larger values such as , we can see that the performance does not significantly change during the training. This is because is a strict threshold, and if we look at the number of positive pairs, only a few instances are identified as similar during the training.
Comparing the results of CIFAR-10 and Tiny-ImageNet experiments shows that the value of also depends on the number of clusters. Since we have 200 classes in Tiny-ImageNet, a smaller value of might yield two or more clusters merging together and this would decrease the accuracy. Therefore, we should choose a more strict threshold such as or to improve.
In Figure 7-c and Figure 7-g, the average loss plot also conveys interesting observations about the behavior of . We can see that for smaller values, the loss is exponentially converging to the minimum, but for larger , the rate is much slower. This can be due to the fact that a smaller considers most points to be similar and of the same class, and therefore, it can yield the trivial solution of considering all points to be similar and mapping them into one central point. In the extreme case of , the loss function becomes similar to CC, and therefore, we will not have any major improvement. In contrast, if we set , the loss considers all points to be positive and the numerator and denominator of Eq. (9) become equal. Therefore, the loss function becomes zero and the network does not train.
Following the above discussion, we suggest a value like , which is a good balance. However, the choice of might be influenced by the number of clusters in the data. If we have a large number of clusters, it would be better to choose a large . On the other hand, if the data has a small number of clusters, a smaller (but not too small) is preferred since it trains faster. In our experiments, we set unless in Tiny-ImageNet which has 200 classes where we used .
In this paper, we proposed C3, an algorithm for contrastive data clustering that incorporates the similarity between different instances to form a better representation for clustering. We experimentally showed that our method could significantly outperform the state-of-the-art on five challenging computer vision datasets. In addition, through additional experiments, we evaluated different aspects of our algorithm and provided several intuitions on how and why our proposed scheme can help in learning a more cluster-friendly representation. The focus of this work was on image clustering, but our idea can also be applied to clustering other data types, such as text and categorical data, in future works.