It is very expensive to label a dataset with respect to a particular task. Consider the alternative where a user, instead of labelling a dataset, specifies a simple set of class-preserving transformations or ‘augmentations’. For example, lighting changes will not change a dog into a cat. Is it possible to learn a model that produces a useful representation by leveraging a set of such augmentations? Such a representation would need to be good at capturing salient information about the image, and enable downstream tasks to be done efficiently. If the representation were a discrete labelling which groups the dataset into clusters, an obvious choice of downstream task is unsupervised clustering that ideally should match the clusters that would be obtained by direct labelling, without ever having been learnt on explicitly labelled data.
. These approaches typically involve learning neural networks that map augmentations of the same image to similar representations. This is a reasonable approach to take as the variances across many common image augmentations often align with the invariances we would require a method to have.
In particular, a number of earlier works target maximising mutual information (MI) between augmentations [24, 15, 26, 16, 1]. By targeting high MI between representations computed from distinct augmentations of images, useful representations can be learned that capture the invariances induced by the augmentations. We are particularly interested in a form of representation that is a discrete labelling of the data, as this is particularly parsimonious. This labelling can be seen as a clustering  procedure, where MI can be computed and assessment can be done directly using the learned labelling, as opposed to via an auxiliary network trained posthoc.
1.1 Suboptimal mutual information maximisation
We argue and show that in many cases the MI objective is not maximised effectively in existing work due to the combination of:
Greedy optimisation algorithms
used to train neural networks, such as stochastic gradient descent (SGD) that potentially target local optima; and
A limited set of data augmentations that can result in the existence of multiple local optima to the MI maximisation objective.
SGD is greedy in the sense that early-found high-gradient features can dominate and so networks will tend to learn easier-to-compute locally-optimal representations (for example, one that can be computed using fewer neural network layers) over those that depend on complex features.
By way of example, in natural images, average colour is an easy-to-compute characteristic, whereas object type is not. If the augmentation strategy preserves average colour, then a reasonable mapping need only compute average colour information, and high MI between images representations will be obtained.
1.2 Dealing with greedy solutions
A number of heuristic solutions, such as as Sobel edge-detection[3, 16] as a pre-processing step, have been suggested to remove/alter the features in images that may cause trivial representations to be learned. However, this is a symptomatic treatment and not a solution. In the work presented herein, we acknowledge that greedy SGD can get stuck in local optima of the MI maximisation objective because of limited data augmentations. Instead of trying to prevent a greedy solution, we let our DHOG model learn this representation, but also require it to learn a second distinct representation. Specifically, we minimise the MI between these two representations so that the latter cannot rely on the same features used by an earlier head. We extend this idea by adding additional representations, each time requiring the latest to be distinct from all previous representations.
Learning a set of representations by encouraging them to have low MI, while still maximising the original augmentation-driven MI objective for each representation, is the core idea behind Deep Hierarchical Object Grouping (DHOG). We define a mechanism to produce a set of hierarchically-ordered solutions (in the sense of easy-to-hard orderings, not tree structures). DHOG is able to better maximise the original MI objective between augmentations since each representation must correspond to a unique local optima. Our contributions are:
We identify the suboptimal MI maximisation problem: maximising MI between data augmentations using neural networks and stochastic gradient descent (SGD) produces a substantially suboptimal solution.111We show this by finding higher mutual information solutions using DHOG, rather than by any analysis of the solutions themselves. We reason that the SGD learning process settles on easy-to-compute solutions early in learning and optimises for these as opposed to leveraging the capacity of deep and flexible networks. We give plausible explanations and demonstrations for why this is a case, and show, with improved performance on a clustering task, that we can explicitly avoid these suboptimal solutions.
We mitigate for this problem, introducing DHOG: the first robust neural network image grouping method to learn diverse and hierarchically arranged sets of discrete image labellings (Section 3) by explicitly modelling, accounting for, and avoiding spurious local optima, requiring only simple data augmentations, and needing no Sobel edge detection.
We show a marked improvement over the current state-of-the-art for standard benchmarks in image clustering for CIFAR-10 ( improvement), CIFAR-100-20 (a 20-way class grouping of CIFAR-100, improvement), and SVHN ( improvement); we set a new accuracy benchmarks on CINIC-10; and show the utility of our method on STL-10 (Section 4).
To be clear, DHOG still learns to map data augmentations to similar representations as this is imperative to the learning process. The difference is that the DHOG framework enables a number of intentionally distinct data labellings to be learned, arranged hierarchically in terms of source feature complexity.
1.4 Assessment task: clustering
For this work, our focus is on finding higher MI representations; we then assess the downstream capability on the ground truth task of image classification, meaning that we can either (1) learn a representation that must be ‘decoded’ via an additional learning step, or (2) produce a discrete labelling that requires no additional learning. Clustering methods offer a direct comparison and require no labels for learning a mapping from the learned representation to class labels. Instead, labels are only required to appropriately assign groups to appropriate classes and no learning is done using these. Therefore, our comparisons are with respect to state-of-the-art clustering methods.
2 Related Work
(CPC) models a 2D latent space using an autoregressive model and defines a predictive setup to maximise MI between distinct spatial locations. Deep InfoMAX
(DIM) does not maximise MI across a set of data augmentations, but instead uses mutual information neural estimation and negative sampling to balance maximising MI between global representations and local representations. Local MI maximisation encourages compression of spurious elements, such as noise, that are inconsistent across blocks of an image. Augmented multiscale Deep InfoMAX  (AMDIM) incorporates MI maximisation across data augmentations and multiscale comparisons.
Clustering approaches are more directly applicable for comparison with DHOG because they explicitly learn a discrete labelling. The authors of deep embedding for clustering (DEC)  focused their attention on jointly learning an embedding suited to clustering and a clustering itself. They argued that the notion of distance in the feature space is crucial to a clustering objective. Joint unsupervised learning of deep representations and image clusters (JULE)  provided supervisory signal for representation learning. Some methods [12, 11]
employ autoencoder architectures along with careful regularisation of cluster assignments to (1) ensure sufficient information retention, and (2) avoid cluster degeneracy (i.e., mapping all images to the same class).
Deep adaptive clustering 
(DAC) recasts the clustering problem as binary pairwise classification, pre-selecting comparison samples via feature cosine distances. A constraint on the DAC system allows for a one-hot encoding that avoids cluster degeneracy. Another mechanism for dealing with degeneracy is to use a standard clustering algorithm, such as-means to iteratively group on learned features. This approach is used by DeepCluster .
Associative deep clustering (ADC) 
uses the idea that associations in the embedding space are useful for learning. Clustering was facilitated by learning a network to associate data with (pseudo-labelled) centroids. They leveraged augmentations by encouraging samples to output similar cluster probabilities.
Deep comprehensive correlation mining  (DCCM) constructs a sample correlation graph for pseudo-labels and maximises the MI between augmentations, and the MI between local and global features for each augmentation. While many of the aforementioned methods estimate MI in some manner, invariant information clustering  (IIC) directly defines the MI using the -way softmax output (i.e., probability of belong to class ), and maximises this over data augmentations to learn clusters. They effectively avoid degenerate solutions because MI maximisation implicitly targets marginal entropy. We use the same formulation for MI – details can be found in Section 3.
DHOG is an approach for obtaining jointly trained multi-level representations as discrete labellings, arranged in a simple-to-complex hierarchy. Later representations in the hierarchy have low MI between earlier representations.
Each discrete labelling is computed by a separate ‘head’ within a neural network – Figure 1 shows the architecture. A head is an unit that computes a multivariate class probability vector. By requiring independence amongst heads, a diversity of solutions to the MI maximisation objective can be found. The head that best maximises MI between augmentations typically aligns better with a ground truth task that also relies on complex features (e.g., classification).
Figure 1 demonstrates the DHOG architecture and training principles. There are shared model weights (: ResNet blocks 1, 2, and 3) and head-specific weights (the MLP layers and : ResNet blocks 4 to 8). For the sake of brevity, we abuse notation and use between labelling probability vectors as an overloaded shorthand for the mutual information between the labelling random variables and that have probability vectors and respectively.
Any branch of the DHOG architecture ( to any ) can be regarded as a single neural network. These are trained to maximise the MI between the label variables at each head for different augmentations; i.e. between label variables with probability vectors and for augmentations and . Four augmentations are shown at . The MI is maximised pairwise between all pairs, at . This process can be considered pulling the mapped representations together.
Following IIC , we compute the MI directly from the label probability vectors within a minibatch. Let denote the random probability vectors at head associated with sampling a data item and its augmentations, and passing those through the network. Then we can compute the mutual information between labels associated with each augmentation using
where is the matrix trace, logarithms are computed element-wise, and expectations are over data samples and augmentations of each sample. In practice we compute an empirical estimate of this MI based on samples from a minibatch.
3.1 Distinct heads
What makes DHOG different from other methods is that each head is encouraged to compute unique solutions via cross-head MI minimisation. For a minibatch of images, the particular labelling afforded by any head is trained to have low MI with other heads – in Figure 1. We assume multiple viable groupings/clusterings because of natural patterns in the data. By encouraging low MI between heads, these must capture different patterns in the data.
Concepts such as brightness, average colours, low-level patterns, etc., are axes of variation that are reasonable to group by, and which maximise the MI objective to some degree. Complex ideas, such as shape, typically require more processing. Greedy optimisation may not discover these groupings without explicit encouragement. Unfortunately, the groupings that tend to rely on complex features are most directly informative of likely class boundaries. In other words, without a mechanism to explore viable patterns in the data (like our cross-head MI minimisation, Section 3.2), greedy optimisation will avoid finding them.
3.2 Cross-head MI minimisation
Our approach to addressing suboptimal MI maximisation is to encourage unique solutions at sequential heads ( in Figure 1), which rely on different features in the data. We can compute and minimise the MI across heads using:
Logarithms are element-wise, and expectations are over the data and augmentations. Note and are each computed from the same data augmentation. We estimate this from each minibatch sample. This process can be thought of as pushing the heads apart.
3.3 Aligning assignments
When computing and subsequently minimising Equation 2, a degenerate solution must be accounted for: two different heads can effectively minimise this form of MI computation while producing identical groupings of the data because of the way MI is computed in practice. This can be done simply by ensuring the order of the labels is permuted for each head, but consistently so across the data. We use the Hungarian Method  to choose the best match between heads, effectively mitigating spurious MI computation. This step can be computationally expensive when is large, but is imperative to the success of DHOG.
3.4 Hierarchical arrangement
Requiring heads (where here) to produce unique representations is not necessarily the optimal method to account for suboptimal MI maximisation. Instead, what we do is encourage a simple-to-complex hierarchy structure to the heads, defined according to cross-head comparisons made using Equation 2. The hierarchy enables a reference mechanism through which later representations can be encouraged toward relying on complex and abstract features in the data.
Figure 1 shows 8 heads, three of which are computed from representations owing to early residual blocks of the network. The hierarchical arrangement is created by only updating head-specific weights according to comparisons made with earlier heads. In practice this is done by stopping the appropriate gradients during training – in the figure. For example, when computing the MI between assignments using and those using , gradient back-propagation is allowed when but not when . In other words, when learning to produce , the network is encouraged to produce a head that is distinct from heads ‘lower’ on the hierarchy. Those ‘higher’ on the hierarchy do not affect the optimisation, however. Extending this concept for gives rise to the idea of the hierarchical arrangement.
Initial experiments showed clearly that if this hierarchical complexity routine was ignored, the gains owing to cross-head MI minimisation were reduced.
The part of the objective producing high MI representations by ‘pulling’ together discrete labellings from augmentations is Equation 1 normalised over heads:
The quantity used to ‘push’ heads apart is Equation 2 normalised per head:
where each cross-head MI term is scaled by the head index, , since that directly tracks the number of comparisons made for each head. scales up the hierarchy, such that the total associated with any head is scaled according to the number of comparisons. Scaling ensures that head-specific weight updates are all equally important. The final optimisation objective is:
where are the network parameters implicitly included in the MI computations, is a hyper-parameter we call the cross-head MI-minimization coefficient. We ran experiments in Section 4 with as an ablation study.
3.6 Design and training choices
Figure 1 shows the architecture and training design. It is based on a ResNet-18 backbone, where each residual block has two layers (with a skip connection over these). Blocks 1 to 3 have 64, 128, and 256 units, respectively. Each parallel final block (4 to 8, here) have 512 units. Each MLP has a single hidden layer of width 200. Although the parallel repetition of entire block structures for each head is cumbersome, our earlier experiments showed that this was an important model flexibility. We used four data augmentation repeats with a batch size of 220.
DHOG maximises MI between discrete labellings from different data augmentations. This is equivalent to a clustering and is similar to IIC. There are, however, key differences. In our experiments:
We train for 1000 epochs
with a cosine annealing learning rate schedule, as opposed to a fixed learning rate for 2000 epochs.
We do not use sobel edge-detection or any other arbitrary preprocessing as a fixed processing step.
We make use of the fast auto-augment CIFAR-10 data augmentation strategy (for all tested datasets) found by . We then randomly apply (with ) grayscale after these augmentations, and take random square crops of sizes 64 and 20 pixels for STL-10 and all other datasets, respectively.
The choice of data augmentation is important, and we acknowledge that for a fair comparison to IIC the same augmentation strategy must be used. The ablation of any DHOG-specific loss (when ) largely recreates the IIC approach but with augmentations, network and head structure matched to DHOG; this enables a fair comparison between an IIC and DHOG approach.
Since STL-10 has much more unlabelled data of a similar but broader distribution than the training data, the idea of ‘overclustering’ was used by ; they used more clusters than the number of classes (70 versus 10 in this case). We repeat each head with an overclustering head that does not play a part in the cross-head MI minimisation. The filter widths are doubled for STL-10. We interspersed the training data evenly and regularly through the minibatches.
To determine the DHOG cross-head MI-minimisation coefficient, , we carried out a non-exhaustive hyper parameter search using only CIFAR-10 images (without the labels), assessing performance on a held out validation set sampled from the training data. This did not use the evaluation data in any way.
Once learned, the optimal head can be identified either using the highest MI, or using a small set of labelled data. Alternatively all heads can be used as different potential alternatives, with posthoc selection of the best head according to some downstream task. The union of all the head probability vectors could be used as a compressed data representation. In this paper the head that maximises the normalised mutual information on the training data is chosen. This is then fully unsupervised, as with the head selection protocol of IIC. We also give results for the best posthoc head to show the potential for downstream analysis.
We first show the functionality of DHOG using a toy problem (Section 4.1) to illustrate how DHOG can find distinct and viable groupings of the same data. In Section 2 we give results to show the superiority of DHOG on real images.
4.1 Toy problem
Toy Problem: 4 sets of 2D Gaussian distributed points. The network must learn 2 groups. The probability of being in either is given by the background. Without DHOG the network simply learns a single solution, while DHOG encourages a number of unique solutions, from which it can select the solution with highest mutual information.
Figure 2 demonstrates a simple 2D toy problem to show how DHOG can learn a variety of solutions. The data is generated from 4 separate 2D Gaussians: there are four possible groups. The number of clusters was set as 2 in order to create a circumstance where different groupings were possible. The augmentations were generated by adding additional Gaussian noise to the samples. Without DHOG (a) the network computes the same solution for each head. With DHOG (b) each head produces a unique solution. Even though the single solution found in (a) might very well be the sought after solution, it is not necessarily the solution that maximises the MI between augmentations of data. This is evidence of the suboptimal MI maximisation problem: the network, learned using SGD, latched onto this local minima througout training. DHOG is able to identify different, better, minima because the local minima are already identified.
4.2 Real images
(an extension of CIFAR-10 using images from Imagenet of similar classes), street view house numbers  (SVHN), and STL-10 . For CINIC-10 only the standard training set of 90000 images (without labels) was used for training.
|-means on pixels|
|DHOG (, ablation)|
|DHOG, unsup. ()|
|DHOG, best ()|
|-means on pixels|
|DHOG (, ablation)|
|DHOG, unsup. ()|
|DHOG, best ()|
|-means on pixels|
|DHOG (, ablation)|
|DHOG, unsup. ()|
|DHOG, best ()|
|-means on pixels|
|DHOG (, ablation)|
|DHOG, unsup. ()|
|DHOG, best ()|
|-means on pixels|
|DHOG (, ablation)|
|DHOG, unsup. ()|
|DHOG, best ()|
Table 1 gives the accuracy, normalised mutual information (NMI), and the adjusted rand index (ARI) between remapped assignments and classification targets. Before assessment a labelling-to-class remapping was computed using the training data and the Hungarian method . The results listed for DHOG correspond to average over 3 seeded runs. In terms of all measured metrics DHOG outperformed other relevant fully-unsupervised clustering methods, with an accuracy improvement of on CIFAR-10, on CIFAR-100-20, and on SVHN. No Sobel edge-detection was used to account for trivial solutions. The DHOG network converged in half the training time (compared to IIC).
We used a fully unsupervised posthoc head selection according to . The selected heads almost always corresponded with the head that maxmimised , where are user-defined class labels. This means that the DHOG framework produces data groupings that:
Better maximise the widely used MI objective (between mappings of data augmentations);
Also corresponds better with the challenging underlying object classification test objective.
It is only on STL-10 that DHOG never beat the current state-of-the-art. This may be owing to the need for a STL-10 specific hyper-parameter search. Our aim was to show that the simple hierarchical ordering of heads in DHOG improves performance. The difference between STL-10 with and without the MI cross-head minimisation term (controlled by ) shows a marked improvement. Again, note that DHOG uses no preprocessing such as Sobel edge detection to deal with easy solutions to the MI objective.
The advantage of a hierarchical ordering is particularly evident when considering the ablation study: with () and without () cross-head MI minimisation. Figure 3 (a) and (b) are accuracy versus head curves, showing that without cross-head MI minimisation later heads converge to similar solutions.
and the confusion matrix in Figure5 (b) show the classes the final learned network confuses in CIFAR-10. Compare this to the confusion matrix in Figure 5 (a) where and note the greater prevalence of cross-class confusion.
Table 2 shows images that yielded the highest probability for each class, and an average of the top 10 images, for both an early () and a late () head. Since the fast auto-augment strategy is broad, the difference between easy-to-compute and complex features is often nuanced. In this case, the grouped images from the earlier head are more consistent in terms of colour or simple patterns. It easier to ascribe notions of ‘mostly blue background’ (class 3) or ‘light blob with three dark spots’ (class 6) for the earlier head. This is also exemplified by the average images: the consistency of images in the early head makes the detail of the averages images clearer, whereas those from the later head are difficult to discern owing to diversity amongst the samples.
We presented deep hierarchical object grouping (DHOG): a method that leverages the challenges faced by current data augmentation-driven unsupervised representation learning methods that maximise mutual information. Learning a good representation of an image using data augmentations is limited by the user, who chooses the set of plausible data augmentations but who is also unable to cost-effectively define an ideal set of augmentations. We argue and show that learning using greedy optimisation typically causes models to get stuck in local optima, since the data augmentations fail to fully describe the sought after invariances to all task-irrelevant information.
We address this pitfall via a simple-to-complex ordered sequence of representations. DHOG works by minimising mutual information between these representations such that those later in the hierarchy are encouraged to produce unique and independent discrete labellings of the data (w.r.t. earlier representations). Therefore, later heads avoid becoming stuck in the same local optima of the original mutual information objective (between augmentations, applied separately to each head). Our tests showed that DHOG resulted in an improvement of on CIFAR-10, on CIFAR-100-20, and on SVHN, without using preprocessing such as Sobel edge detection.
This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh.
This research was part funded from a Huaweil DDMPLab Innovation Research Grant DDMPLab5800191.
-  (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §1, §1, §2.
Mine: mutual information neural estimation.
Proceedings of the thirty-fifth International Conference on Machine Learning, Vol. 80, pp. 531–540. Cited by: §2.
Deep clustering for unsupervised learning of visual features.
Proceedings of the European Conference on Computer Vision, pp. 132–149. Cited by: §1.2, §2.1, Table 1.
-  (2017) Deep adaptive image clustering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5879–5887. Cited by: §1, §2.1, Table 1.
An analysis of single-layer networks in unsupervised feature learning.
Proceedings of the fourteenth International Conference on Artificial Intelligence andSstatistics, pp. 215–223. Cited by: §4.2.
-  (2019) RandAugment: practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719. Cited by: §1.
-  (2018) CINIC-10 is not imagenet or cifar-10. arXiv preprint arXiv:1810.03505. Cited by: §4.2.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §4.2.
Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence 38 (9), pp. 1734–1747. Cited by: §1.
-  (2014) Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems, pp. 766–774. Cited by: §1.
-  (2018) Deep -means: jointly clustering with -means and learning representations. arXiv preprint arXiv:1806.10069. Cited by: §2.1.
-  (2017) Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5736–5745. Cited by: §2.1.
Associative deep clustering: training a classification network with no labels.
German Conference on Pattern Recognition, pp. 18–32. Cited by: §2.1, Table 1.
-  (2016) Identity mappings in deep residual networks. In Proceedings of fourteenth European Conference on Computer Vision, pp. 630–645. Cited by: Figure 1.
-  (2019) Learning deep representations by mutual information estimation and maximization. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §2.
-  (2019) Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9865–9874. Cited by: §1.2, §1, §1, §2.1, §3.6, §3, Table 1.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.2.
-  (1955) The hungarian method for the assignment problem. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §3.3, §4.2.
-  (2019) Fast autoaugment. In Advances in Neural Information Processing Systems, Cited by: 3rd item.
-  (1988) Self-organization in a perceptual network. Computer 21 (3), pp. 105–117. Cited by: §2.
-  (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Figure 4.
Reading digits in natural images with unsupervised feature learning.
NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §4.2.
-  (2019) On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625. Cited by: §2.
-  (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §2.
-  (2014) Optimized cartesian k-means. IEEE Transactions on Knowledge and Data Engineering 27 (1), pp. 180–192. Cited by: Table 1.
-  (2019) Deep comprehensive correlation mining for image clustering. In International Conference on Computer Vision, Cited by: §1, §1, §2.1, Table 1.
Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487. Cited by: §2.1, Table 1.
-  (2016) Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156. Cited by: §2.1, Table 1.