Information based Deep Clustering: An experimental study

10/03/2019 ∙ by Jizong Peng, et al. ∙ 0

Recently, two methods have shown outstanding performance for clustering images and jointly learning the feature representation. The first, called Information Maximiz-ing Self-Augmented Training (IMSAT), maximizes the mutual information between input and clusters while using a regularization term based on virtual adversarial examples. The second, named Invariant Information Clustering (IIC), maximizes the mutual information between the clustering of a sample and its geometrically transformed version. These methods use mutual information in distinct ways and leverage different kinds of transformations. This work proposes a comprehensive analysis of transformation and losses for deep clustering, where we compare numerous combinations of these two components and evaluate how they interact with one another. Results suggest that mutual information between a sample and its transformed representation leads to state-of-the-art performance for deep clustering, especially when used jointly with geometrical and adversarial transformations.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In deep learning, supervised methods have shown excellent performance, sometimes even surpassing the human level

[23, 11]. However, these methods require large datasets with fully annotated data, which cannot be unaffordable in many cases [43]. Instead, unsupervised methods can learn from data without annotations, which is very appealing given the large amount of data that can easily be collected from social media and other sources [39]

. In this paper, we are interested in the unsupervised learning problem of deep clustering, which consists in learning to group data into clusters, while at the same time finding the representation that best explains the data. Jointly learning to group data (clustering) and represent data (representation learning) is an ill-posed problem which can lead to poor or degenerate solutions

[46, 45, 5]. A principled way to avoid most of these problems is mutual information [35]. Mutual information is a powerful approach for clustering because it does not make assumptions about the data distribution and reduces the problems of mode collapse, where most of the data is grouped in a single large cluster [5].

Figure 1: Information Clustering Components. We decompose the information based clustering into two main components: the used loss and the pairwise transformations used on the input samples. We experimentally found that combining in the correct way losses and transformations of IMSAT and ICC leads to superior results.

In recent publications, two papers obtained outstanding results for deep clustering by using mutual information in different ways. The first one, IMSAT [14] maximizes the mutual information between input data and the cluster assignment, and regularizes it with virtual adversarial samples [30]

by imposing that the original sample and the adversarial sample should have similar cluster assignment probability distribution (by minimizing their KL divergence). The second one, IIC

[16], maximizes the mutual information of the cluster assignment between a sample and the same sample after applying a geometrical transformation.

Both algorithms are based on a mutual information loss (applied either to input output for IMSAT or output and transformed output for IIC) which relies on pairwise associations of an image and its transformed version (either by a geometrical transformation or an adversarial one). In this paper, we aim to analyze and better understand these algorithms by decomposing them in two basic building blocks: the information-based loss and the used transformation. As illustrated in Fig. 1, these two blocks can be combined in multiple and non trivial ways. By following this strategy, we build a family of clustering approaches in which IMSAT and IIC are special cases. We evaluate the different instances of this family of methods on three datasets with different difficulty, and perform an in-depth analysis of obtained results.

Our empirical analysis shows that: i) maximizing the mutual information between an image clustering distribution and its transformed version seems to be more robust than other approaches when dealing with challenging datasets; ii) the combined use of different transformations seems beneficial (e.g. geometrical and adversarial) as they capture complementary characteristics of images; iii) the optimal way of including a transformation in the loss differs across transformations. For instance, we found that combining geometrical transformations in the mutual information loss with adversarial transformations used as regularization leads to improved accuracy.

Finally, we show in the experimental results that deep clustering is a convenient way to perform unsupervised learning since the final task is very related to classification. Moreover, when the representation is used for supervised tasks, it largely outperforms other approaches.

In the reminder of this paper, we first introduce related work in section 2, with special attention to the two mentioned methods. Then, in section 3, we propose the experimental protocol by defining the different components of our experiments. Finally, we report results in section 4 and draw conclusions in section 5.

2 Related work

Mutual information

Mutual information

is a information-theoretic criterion to measure the dependency between two random variables


. It is defined as the KL divergence between the joint distribution

of two variables and the product of their marginal: .

The criterion of maximizing mutual information for clustering is first introduced in [4], as the firm but fair criterion. In this case the mutual information between input data (i.e, image or representation) and output categorical distribution is maximized, believing that the class distribution can be deduced given the input. This principle is extended in [21], in which mutual information is maximized with additionally an explicit regularization, such as loss. This helps to avoid too complex decision boundaries.

In [42, 38, 41] mutual information is used to align inputs with different modalities [42, 38, 41], because with different modalities normal distances are meaningless. Finally, mutual information is also used as regularization in a semi-supervised setting [29]. Recently, DeepInfoMax [13]

simultaneously estimates and maximizes mutual information between images and learns high-level representations. However, estimating the mutual information of images is hard and requires complex techniques


Finally, two recent techniques for deep clustering based on mutual information are IMSAT [14] and IIC [15]. As these are the starting point of this study, they will be analyzed in more detail in section 3.

Self-supervised approaches

Self-supervised learning has recently emerged as a way to learn a representative knowledge based on non-annotated data. The main principle is to transfer the unsupervised task to a supervised one by defining some

pseudo labels that are automatically generated by a pretext task without involving any human annotations [17]. The network trained with the pretext task is then used as the initialization of some downstream tasks, such as image classification, semantic segmentation and object detection. It has been shown that a good pretext task can be beneficial and help to improve the performance of the downstream task [36, 8, 33, 6, 1]. Without the access to label information, the pretext task is usually defined based on the data structure and is somewhat related but different from the downstream tasks. Various pretexts have been proposed and investigated. Similarity-based methods [33, 6] design a pretext task that let the network learn the semantic similarities between image patches. Likewise, [32, 1, 44] trained a network to recognize the spatial relationship between image patches. See [17] for a complete survey on such methods.

If the task we want to learn with non-annotated data is classification, in this work we argue that the best pretext task is clustering. With clustering, the pretext task is very close to the downstream task that is classification. In fact, clustering aims to group the data in a meaningful way and therefore split the data into categories. If these categories are not only visually similar but also semantically, classification and clustering become the same task. In other words, with clustering, there is not need for a downstream task. By assigning the most likely category to each cluster, our clustering method becomes a classifier. In the experimental evaluation, we will compare the performance of our information based clustering with state-of-the-art self-supervised learning approaches. For doing that, we use two simple assumptions: i) the sample distribution per class is known (normally uniform) and ii) the exact number of classes, which corresponds to the number of clusters is also known.

Clustering approaches

Clustering has been long time studied before the deep learning era. K-means

[19] and GMM algorithms [2]

were popular choices given representative features. Recently, much progress has been made by jointly training a network that perform feature extraction together with a clustering loss

[45, 26, 10, 6]. Deep Embedded Clustering (DEC) [45] is a representative method that uses an auto-encoder as the network architecture and a cluster-assignment hardening loss for regularization. Li et al. [26] proposed a similar network architecture but with a boosted discrimination module to gradually enforce cluster purity. DEPICT [10] improved the clustering algorithm’s scalability by explicitly leveraging class distributions as prior information. DeepCluster [6] is an end-to-end algorithm that jointly trains a Convnet with K-means and groups high-level features to pseudo labels. Those pseudo labels are in turn used to retrain the network after each iteration. In this work, we focus on two state of the art information-based clustering approaches [14, 15] and analyze their different components and how they can be combined in a meaningful way.

3 Information-based clustering

3.1 Overview

As shown in Fig. 1, we consider information-based clustering as a family of methods having two main components: the mutual information loss and the used transformations. The maximization of the mutual information aims to produce meaningful groups of data, i.e. clusters with a similar representation and with a even number of samples. On the other hand, transformations are used to make the learned representation locally smooth and ease the optimization in a similar way as in data augmentation. For these two components, we consider and evaluate different possible choices and their combination.

3.2 Mutual Information losses

: This formulation introduced in IIC [15] maximizes the mutual information between the clustering assignment variable and the clustering assignment of a transformed sample . Mutual information is defined as the KL divergence between the joint probability of two variables and the product of their marginals [4]:


It represents a measure of the information between the two variables. If the variables are independent, the mutual information is zero because the joint will be equal to the product of the marginals. Thus, we need to estimate the joint probability of the clustering assignment and its transformed version , as well as the marginals and . The marginals are defined as


and can be empirically estimated by averaging output mini-batches. For the joint, we compute the dot product between and for each sample and marginalize over X:


For each sample, the joint probability of and is . For a single sample, by construction the joint and the marginals will be equivalent. However, when marginalizing the joint over samples , the final will be different than .

This formulation maximizes the predictability of a variable given the other. It is different than enforcing KL divergence between two distributions because: i) it does not enforce the two distributions to be the same, but only to contain the same information. For instance, one distribution can be transformed by an invertible operation without altering the mutual information. ii) it penalizes distributions that do not have uniform marginal, i.e. all the cluster should contain an even number of samples.
: We consider the MI formulation used as loss in IMSAT [14]. It connects the input image distribution

with the output cluster assignment of the used neural network

. As the input of the neural network is a continuous vector, estimating its probability distribution is hard and we cannot use directly equ. (

1). Instead, in IMSAT the mutual information between input and output is computed as:


In this formulation, the mutual information is easy to compute because it is the difference between the entropy of the output and the conditional entropy

. Both quantities can be approximated in a mini-batch stochastic gradient descent setting.

is approximated as the entropy of the average probability distribution over the given samples, while is approximated as average of the conditional entropy of each sample. As entropy is a non-linear operation, the two quantities are different. is maximized when the probability of each cluster is the same, i.e. the output has the same probability distribution for each cluster. On the other hand, is minimized when, for each sample , has most of the probability distribution assigned to a single cluster, i.e. the model is certain about a given choice. The two combined means that the clustering has always to choose just one cluster and globally, each cluster should contain the same number of samples. In case the class distribution is not uniform, another distribution can be enforced by . In our experiments, we limit ourselves to uniform class distribution. Notice that, in IMSAT, the authors add an hyper-parameter MI formulation . This parameter is the minimum value that ensures that the data is evenly distributed on all clusters. We follow the same approach as in the original paper.
Regularization: In the context of this work, we call regularization an additional loss that penalizes when the output of the model for the original image and the transformed image are different. While for this regularization is fundamental for good results, for it is an optional step to strongly enforce a transformation. In this work, as we have access to the clustering probability distribution of a sample , we use the KL divergence as penalty term . Although in unsupervised settings there is not real difference between loss and regularization because in both cases the training is performed without annotations, here we consider this KL divergence as a regularization because it cannot be used alone for clustering. It will need at least another term such as to enforce an even distribution of samples in the clusters.

3.3 Transformations

Transformations seem to be a key component of MI-based clustering. To be useful, any transformation needs to change the appearance of the image (in terms of pixels) while maintaining its semantic content, i.e. the class of the image. We can think of transformations as a pseudo ground-truth that helps to train the model. In this work, we consider three kind of transformations: Geometrical, Adversarial and Mixup.
Geometrical: Geometrical transformations are the image transformations that are normally used for data augmentation. As in [16], we use random crop, resize at multiple scales, horizontal flip, and color jitter with the same range of parameters that is normally used in data augmentation. Note that some transformations can actually change the category of a class. For instance, on MNIST [25] a dataset composed of numbers, a crop of a can zoom in the lower circle and look very similar to a . We will talk more about this problem in the experimental results. In IIC, geometrical transformations are used directly in the mutual information loss, however in our experiments we also tested their usefulness for regularization.
Adversarial: Adversarial samples [47]

are samples that are slightly modified by an adversarial noise which is usually unnoticeable by the human eye, but can induce a neural network to misclassify an example. Recently, methods based on adversarial examples have attracted a lot of attention because they can easily fool machine learning algorithms and thus represent a threat to any system using machine learning

[40, 7]. It has been shown [27] that adding those samples during training can help to improve the robustness of the classifier. In this study, we use Virtual Adversarial Training(VAT) [30], an extension of adversarial attack that can also be used for non-labelled samples and has shown promising results for fully-supervised, semi-supervised [30] and unsupervised learning [14]. The adversarial noise can be found as the value within a certain neighbourhood that maximizes the distance between the probability distribution of the original sample and the transformed sample :


where is a divergence function, usually defined as KL-divergence. In practice, equ. (5) can be optimized in order to find with a few iterations of the power method [18]. Note that we could also experiment with adversarial geometrical transformations as in [37], but we leave this direction as future work.
Mixup: This is a simple data augmentation technique that has proven successful for supervised learning [48]. It consists on creating a new sample and label by linearly combining two training samples and (e.g. images) and labels and (e.g. class probabilities):


is the mixing coefficient and is normally sampled from a distribution. Although very simple and effective, Mixup has received multiple criticisms because it is clear that the generated images do not represent real samples. However, the distribution has most of its mass near 0 and 1, which means that in most of the cases the mixed samples look very similar to one of the samples, but with a structured noise coming from the other image. This transformation differs from the previous one because it requires two input samples to generate a new one. Thus, to use it in our family of algorithms, we had to adapt it. For , as before we consider the expected output of real samples , while is now the output associated to mixup samples generated using the same real samples in combination with other samples that is randomly selected. Mixup can also be used as direct regularization (see next section). In this case, the first output distribution is associated to real samples while the second is associated to mixup samples built as above.

4 Experiments

Our main experiment evaluates on the three datasets (presented below) the two identified components for information based clustering: information based losses and image transformations. To do so, we summarize all our experiments in two main tables in which only one component (either the losses or the transformations) is considered and the other is considered as an hyper-parameter that we want to optimize. For completeness, we have reported all results with all combinations of the different components in the supplementary material. In the following tables, for each result, we added a code that indicates how to find that result in the complete set of experiments. In the second part of our experiments, we present other interesting findings that were not visible from the two tables. We first compare the performance of different combinations of mutual information losses and transformations, showing that only some of them make sense and can produce meaningful results. Then, we show that the proposed clustering can be used to initialize the parameters of a network. We compare our best model with DeepInfoMax in the task of linear supervised classification using the representation learned by the respective methods. Finally, we visualize the clusters of our models on three datasets.

4.1 Datasets

We evaluate the different methods on 3 datasets:

  • MNIST dataset [25] of hand-written digit classification consists of 60,000 training images and 10,000 validation images. 10 classes are evenly distributed in both train and test sets. Following common practice, we mix the training and test set to form a large training set. During training, we do not show any ground truth information, while for testing, we use image annotations to find a mapping between true class label and cluster assignment, thus assessing the clustering performance by the classification accuracy.

  • CIFAR10 [22] is a popular dataset consisting of 60,000 3232 color images in 10 classes, with 6,000 images per class. Similar to MNIST dataset, we mix the 50,000 training images with 10,000 test images to build a larger dataset for clustering.

  • SVHN [31] is a real-world image dataset for digit recognition, consisting of 73,257 digits for training, 26,032 digits for testing. Images come from natural scene images. We adopt the previously described strategy to use this dataset too.

4.2 Evaluation Metric

Our method groups samples into clusters. If the grouping is meaningful, it should be related to the datsest classes. Thus, in most of our experiments, we use classification accuracy as measure of the clustering quality. This makes sense because the final aim of this approach is exactly to produce a classifier without using training labels. This accuracy is based on the best possible one-to-one mapping (using the Hungarian method [24]

) between clustering assignment and ground truth label (assuming they share the same number of classes). We run the experiments 3 times with different initialization and report mean and standard deviation values.

4.3 Implementation Details

In order to provide a fair comparison, we use the same network for a given dataset across methods. For both MNIST and SVHN dataset, we borrow the setting of IIC [16], using a VGG-based convolutional network as our backbone network. For CIFAR-10, we use a ResNet-34 [12] based network. It is worth mentioning that, in original IMSAT paper [14], the used network was just a 2 fully connected layers with pre-trained features on CIFAR-10 or GIST features [34] on SVHN. Instead, in this work, we want to compare all results on the same convolutional architecture and without pre-trained models or any hand-crafted features.

To boost performance, as in [16] we use two additional procedures. The first one, over-clustering, consist in using more clusters than the number of classes in the training data. This can help to find sub-classes and therefore reduce the intra-class variability on each cluster. The second consists in splitting the last layer of the network in multiple final layers (there called heads) and therefore multiple clusters. This can increase diversity and acts as a simplified form of ensembling. Combining these two techniques can highly boost the final performance of the clustering approach. However, they also increase the computational cost of the model. Thus, for the evaluation of all configurations in a same setting, we use a basic model without additional over-clustering or multiple final layers. However, for our best configuration, we retrained it with 5 final layers with 10 clusters (as the number of classes) and other 5 final layers with 50 clusters for MNIST and SVHN or 70 clusters for CIFAR10.

42.63.9 16.01.0 15.51.5
97.70.3 22.60.5 19.73.1
98.00.0 31.91.1 28.04.0
97.60.1 36.60.5 28.95.7
90.46.2 37.512.9 34.33.9
Table 1: Mutual Information Losses. We consider the information-based losses presented in section 3.2 and their combination and report results on the three datasets validation sets. The complete table with all experiments can be found in the supplementary material. The letters beside every accuracy value refer to the corresponding row in the supplementary material table.

4.4 Mutual Information losses

In table 1, we consider the two ways of using mutual information for clustering as explained in section 3.2. As we want to factorize our transformations on this analysis, reported results are obtained with the best performing transformations. For more details, see table 1 in the supplementary material. On MNIST, most of the methods perform quite well. Only does not obtain good results and, in fact, it obtains the lowest accuracy on all datasets (row 1). In contrast, if we induce similarity between the clustering probabilities of a sample and its transformed version with KL divergence (row 2, this configuration corresponds to IMSAT when the transformation is Adversarial) the obtained results are much improved. This is because we can consider inducing similarity on the cluster probabilities as a form of self-supervision. Thus, not leveraging this source of information drastically reduces results quality. When using instead, results are already good without KL divergence (row 3, this configuration corresponds to IIC when the transformation is geometrical). This is because, in this formulation, the mutual information directly considers the transformations. However, adding explicitly a KL term further improves results (row 4). This shows that enforcing similarity between transformed samples with mutual information and with KL divergence are complementary strategies, and therefore using them jointly helps to further improve performance. Finally, combining the two different information losses together with the KL divergence (row 5) gives the best results on two of the three datasets.

Geometric 97.90.0 31.91.1 28.03.9
Adversarial 97.70.3 18.40.3 17.12.0
Mixup 53.25.6 20.11.2 17.11.7
Geo. + Adv. 93.67.4 36.60.5 28.95.8
Adv. + Mixup 97.20.3 29.81.5 19.10.0
Geo. + Mixup 98.00.0 35.91.4 26.312.4
All 97.60.1 29.81.5 26.82.1
Table 2: Transformations. The reported results show the performance of each transformation associate with the best information based loss. The complete table with all experiments can be found in the supplementary material. The letters beside every accuracy value refer to the corresponding row in the supplementary material table.

4.5 Transformations

In table 2, we compare the three different kinds of transformations that are described in section 3.3. From the table, we can see that the best single transformation seems to be geometric. Compared to Adversarial, the gap is small on MNIST, but it becomes larger on the other datasets. For Mixup, performances are much lower than geometric, but slightly better than adversarial on CIFAR10. We also tested possible combinations of transformations. In this case, the most promising seems to be Geometric + Adversarial, although Geometric + Mixup also performs well. Finally, using the three combinations jointly does not seem to help. Thus, our best configuration is Geometric + Adversarial transformations.

Transform Loss MNIST CIFAR10 SVHN
None 42.63.9 16.01.0 15.51.5
Geo 43.847.8 14.50.7 17.41.5
97.80.0 31.91.1 28.04.0
Weak Geo 56.70.5 22.20.5 32.318.4
25.12.5 18.50.6 19.35.0
VAT 97.70.3 18.40.4 17.12.0
11.30.0 17.40.4 17.01.9
65.67.4 15.82.6 16.91.9
Table 3: Matching geometrical transformations and losses. We show that each transformation should be associated to the correct loss in order to obtain good results.

4.6 Matching losses and transformations

So far, we have analyzed different ways of clustering data from the point of view of information loss and transformations. However, this analysis not show how information loss and transformations are combined in our experiments. Different transformations can be matched with the informaiton loss in distinct, and only some of these configurations make sense and provide good results. In table 3, we consider this problem for Geometrical and Adversarial transformations.
Geometrical For Geometrical transformations, works very well (row 3). However, when we try to use the same transformations for (row 2), results are poor, sometimes even inferior to the mutual information loss without any regularization (row 1). We believe this is due to the different ways this transformations are used in the two cases. In the case of , we impose that the network output for an example and its transformed version should have maximum mutual information. In contrast, in , we explicitly want the two output distribution to be similar. Thus, the latter is a much stronger constraint than the former. However, we analyzed the used geometrical transformation and we discover that this transformation sometimes does not respect the basic rule of maintaining the same semantic meaning of the image, i.e. the same class. For instance in MNIST, a 6 with a crop can easily become a 0. We hypothesize that these ambiguities can confuse the training, when enforcing geometrical consistency with KL divergence. Instead, if we only impose to maximize the information between the original image and the transformation, clustering works well. To verify this hypothesis, we ran a second experiment with weaker geometrical transformations (just an image crop in which the new image is a few pixel smaller than the original), such that we ensure that an image and its transformed version maintain the same class on all datasets. This new transformation, even if very weak, helps with KL divergence to obtain better results (row 4), while using such weak transformation in does not help much (row 5).
Adversarial Adversarial transformations are different from the other kinds of transformations because they depend on the actual loss that the network is optimizing. For instance, in , the adversarial transformation works well and helps to improve the performance of the method on all datasets (row 6). However, if we use the same adversarial transformations with , results are very poor. We assume that this is due to a mismatch of losses. In fact, adversarial samples are generated to be adversarial to the KL divergence between the original sample and the noise sample (see equ. (5). However, with , we optimize the mutual information between input and adversarial transformation. So, the adversarial transformation is actually adversarial for KL divergence, but not for mutual information and therefore it will not really help the clustering. Instead, if we use an adversarial sample generated against the mutual information loss, results improve. We still generate this new kind of adversarial samples with equ. (5), but using mutual information between the original sample and the transformed sample as distance . We call this new transformation IVAT, as information based VAT. Results of this experiment are shown in the last row of the table. The obtained accuracy is much better on MNIST, while slightly lower than the normal VAT on CIFAT10 and SVHN.

(a) IIC with geometrical transformations.
(b) IMSAT with geometric transformations.
Figure 2: IIC and IMSAT with geometrical transformations We visually compare for the three datasets the second and third rows of table 3 which corresponds to our re-implementation of IIC and IMSAT but with geometrical transformations. Each row represent a class in the dataset. Images have been randomly selected.
Training DA Accuracy
Scratch no 72.9%
Pretrained no 80.6%
Scratch yes 81.6%
Pretrained yes 82.8%
Table 4: Clustering for pre-training. We evaluate how our best model can be used as initialization for supervised learning.
Method FC Conv Y
VAE  [20] 42.1 53.8 39.6
AAE  [28] 43.3 55.2 37.8
BiGAN  [9] 38.4 56.4 44.9
DeepInfoMax  [13] 54.1 63.3 49.6
IIC (our impl.) 73.6 78.7 59.4
Ours 75.4 78.9 64.7
Table 5: Supervised Linear Classifier. To compare our model with self-supervised approaches, we extract a representation from the fully connected layer (FC), the penultimate convolutional layer (Conv) and the output (Y) of our clustering network and use it for supervised training on CIFAR10.

4.7 Clustering for pre-training

In table 4, we consider the effect of using the network trained for clustering as pre-training for a fully-supervised training on the same dataset (CIFAR10). When we train without any data augmentation, the initialization based on clustering gives an important boost in performance, going from an accuracy of for a network trained from scratch to for a network whose weights are initialized with our best model. When the network is trained with data augmentation, the gap between the two initialization is reduced to . This is probably due to the fact that most of the useful information brought with the clustering pre-trained model is about image transformations. Thus, when adding data augmentation that information is included directly in the training.

4.8 Train a Linear Classifier

As we perform clustering, the final results are already groups of samples that are visually similar. If the clustering works well, these groups should represent classes. Thus, in our experiments, it is not necessary to perform any additional supervised learning for evaluation. In contrast, methods based on self-supervision will normally need a final evaluation step in which supervision is used to learn the final classifier. An evaluation often used for these methods consists on training the learned representation in a fully-supervised way but with a linear classifier. In order to compare with methods based on self-supervision, we use our learned representation to train a linear classifier. We argue that, as clustering is very similar to the final classification task, results of our method will be better than other approaches. Table 5 compares DeepInfoMax [13], a recent method of self-supervision based on mutual information, with two versions of our clustering approach. We also report results of other methods from [13]. We report the accuracy obtained by a linear classifier trained on the fully-connected layer (FC), penultimate convolutional layer (Conv) and output (Y), 10 values in our case and 64 for DeepInfoMax. As expected, the gap in performance is in the order of 20 points for FC, 10 for convolutional and 5 for the output Y. The reduced gap in the output is explained by the low dimensionality of the final output (10) for our model.

4.9 Visualization of the clusters

In Fig. 2 we visually compare the clustering performance of the two different mutual information losses with geometrical transformations (as in rows 2 and 3 of table  3). Each row, represent a cluster found in an unsupervised way. If the samples in each row belong to the same class, the clustering has managed to find a meaningful grouping strategy and the associated classification accuracy will be high. From visual inspection and similarly to the accuracies in table 3, we observe that the clustering based on with geometrical transformations performs better than on the same transformations.

5 Conclusions

In this paper, we have presented an in-depth analysis of two very popular information-based clustering approaches. We first consider the two approaches as a combination of a MI-based loss and a class of transformations. This has lead us to a more general formulation of a information-based clustering in which we can combine different losses formulations and different transformations. From our empirical evaluation of these different configurations, we conclude that different transformations require different losses for optimal performance. Our best configuration is then a clustering that maximizes the mutual information between input and output as well as the mutual information between a samples and its geometrical transformation, and which enforces with KL divergence the similarity between a sample and its adversarial transformation. This configuration seems to be better than all previous work on clustering as well as other self-supervised approaches.


  • [1] U. Ahsan, R. Madhok, and I. Essa (2019) Video jigsaw: unsupervised learning of spatiotemporal context for video action recognition. In

    2019 IEEE Winter Conference on Applications of Computer Vision (WACV)

    pp. 179–189. Cited by: §2.
  • [2] J. D. Banfield and A. E. Raftery (1993) Model-based gaussian and non-gaussian clustering. Biometrics, pp. 803–821. Cited by: §2.
  • [3] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm (2018) Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062. Cited by: §2.
  • [4] J. S. Bridle, A. J. R. Heading, and D. J. C. MacKay (1992) Unsupervised classifiers, mutual information and ‘phantom targets’. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann (Eds.), pp. 1096–1101. External Links: Link Cited by: §2, §2, §3.2.
  • [5] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features.. CoRR abs/1807.05520. External Links: Link Cited by: §1.
  • [6] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §2, §2.
  • [7] E. Chou, F. Tramèr, G. Pellegrino, and D. Boneh (2018) Sentinet: detecting physical attacks against deep learning systems. arXiv preprint arXiv:1812.00292. Cited by: §3.3.
  • [8] C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2051–2060. Cited by: §2.
  • [9] J. Donahue, P. Krähenbühl, and T. Darrell (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: Table 5.
  • [10] K. Ghasedi Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang (2017)

    Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization

    In Proceedings of the IEEE International Conference on Computer Vision, pp. 5736–5745. Cited by: §2.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, Washington, DC, USA, pp. 1026–1034. External Links: ISBN 978-1-4673-8391-2, Link, Document Cited by: §1.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §4.3.
  • [13] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §2, §4.8, Table 5.
  • [14] W. Hu (2017) Learning discrete representations via information maximizing self-augmented training. ICML. Cited by: §1, §2, §2, §3.2, §3.3, §4.3.
  • [15] X. Ji, J. F. Henriques, and A. Vedaldi (2018) Invariant information distillation for unsupervised image segmentation and clustering. arXiv preprint arXiv:1807.06653. Cited by: §2, §2, §3.2.
  • [16] X. Ji, J. F. Henriques, and A. Vedaldi (2018) Invariant information distillation for unsupervised image segmentation and clustering. CoRR abs/1807.06653. External Links: Link, 1807.06653 Cited by: §1, §3.3, §4.3, §4.3.
  • [17] L. Jing and Y. Tian (2019) Self-supervised visual feature learning with deep neural networks: a survey. arXiv preprint arXiv:1902.06162. Cited by: §2.
  • [18] M. Journée, Y. Nesterov, P. Richtárik, and R. Sepulchre (2010)

    Generalized power method for sparse principal component analysis

    Journal of Machine Learning Research 11 (Feb), pp. 517–553. Cited by: §3.3.
  • [19] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Transactions on Pattern Analysis & Machine Intelligence 1 (7), pp. 881–892. Cited by: §2.
  • [20] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: Table 5.
  • [21] A. Krause, P. Perona, and R. G. Gomes (2010) Discriminative clustering by regularized information maximization. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.), pp. 775–783. External Links: Link Cited by: §2.
  • [22] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: 2nd item.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Link Cited by: §1.
  • [24] H. W. Kuhn and B. Yaw (1955) The hungarian method for the assignment problem. Naval Res. Logist. Quart, pp. 83–97. Cited by: §4.2.
  • [25] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3.3, 1st item.
  • [26] F. Li, H. Qiao, and B. Zhang (2018) Discriminatively boosted image clustering with fully convolutional auto-encoders. Pattern Recognition 83, pp. 161–173. Cited by: §2.
  • [27] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §3.3.
  • [28] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey (2015) Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: Table 5.
  • [29] V. Manohar, D. Povey, and S. Khudanpur (2015) Semi-supervised maximum mutual information training of deep neural network acoustic models. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §2.
  • [30] T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2019-08)

    Virtual adversarial training: a regularization method for supervised and semi-supervised learning

    IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8), pp. 1979–1993. External Links: Document, ISSN 0162-8828 Cited by: §1, §3.3.
  • [31] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: 3rd item.
  • [32] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §2.
  • [33] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash (2018) Boosting self-supervised learning via knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9359–9367. Cited by: §2.
  • [34] A. Oliva and A. Torralba (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. International journal of computer vision 42 (3), pp. 145–175. Cited by: §4.3.
  • [35] L. Paninski (2003-06) Estimation of entropy and mutual information. Neural Comput. 15 (6), pp. 1191–1253. External Links: ISSN 0899-7667, Link, Document Cited by: §1.
  • [36] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17. Cited by: §2.
  • [37] X. Peng, Z. Tang, F. Yang, R. S. Feris, and D. Metaxas (2018)

    Jointly optimize data augmentation and network training: adversarial data augmentation in human pose estimation

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2226–2234. Cited by: §3.3.
  • [38] J. P. Pluim, J. A. Maintz, and M. A. Viergever (2003) Mutual-information-based registration of medical images: a survey. IEEE transactions on medical imaging 22 (8), pp. 986–1004. Cited by: §2.
  • [39] F. Schroff, A. Criminisi, and A. Zisserman (2011-04) Harvesting image databases from the web. IEEE Trans. Pattern Anal. Mach. Intell. 33 (4), pp. 754–766. External Links: ISSN 0162-8828, Link, Document Cited by: §1.
  • [40] J. Su, D. V. Vargas, and K. Sakurai (2019) One pixel attack for fooling deep neural networks.

    IEEE Transactions on Evolutionary Computation

    Cited by: §3.3.
  • [41] P. Thévenaz and M. Unser (2000) Optimization of mutual information for multiresolution image registration. IEEE transactions on image processing 9 (ARTICLE), pp. 2083–2099. Cited by: §2.
  • [42] P. Viola and W. M. Wells III (1997) Alignment by maximization of mutual information. International journal of computer vision 24 (2), pp. 137–154. Cited by: §2.
  • [43] C. Vondrick, D. Patterson, and D. Ramanan (2013-01) Efficiently scaling up crowdsourced video annotation. Int. J. Comput. Vision 101 (1), pp. 184–204. External Links: ISSN 0920-5691, Link, Document Cited by: §1.
  • [44] C. Wei, L. Xie, X. Ren, Y. Xia, C. Su, J. Liu, Q. Tian, and A. L. Yuille (2019) Iterative reorganization with weak spatial constraints: solving arbitrary jigsaw puzzles for unsupervised representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1910–1919. Cited by: §2.
  • [45] J. Xie, R. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 478–487. External Links: Link Cited by: §1, §2.
  • [46] J. Yang, D. Parikh, and D. Batra (2016) Joint unsupervised learning of deep representations and image clusters. CoRR abs/1604.03628. External Links: Link, 1604.03628 Cited by: §1.
  • [47] X. Yuan, P. He, Q. Zhu, and X. Li (2019) Adversarial examples: attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems. Cited by: §3.3.
  • [48] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §3.3.