1 Introduction
We are given a set of observations of arbitrary type that already has been grouped by some external mechanism. While we suppose these initial clusters are coherent, the dataset might have been overpartitioned. In this case it would be desirable to unite clusters whose observations are ”similar” to each other. However, we are not interested in a domain dependent similarity measure to decide which clusters should be merged, but seek a method that can be used for arbitrary types of data. An intuitive principle that holds across domains is that clusters whose observations cannot be discriminated from each other should be merged. For a pair of clusters discriminability can be measured by splitting both clusters in a train and validation subset, using the train subsets for estimating the parameters of a classifier and the validation subset to get an approximation of the classifier’s accuracy. Given an appropriate choice of classifier, a high/low accuracy then suggests that it is possible/impossible to discriminate between the two clusters. This corresponds to approximating the Bayes accuracy on the binary classification dataset defined by the two clusters. Since the Bayes accuracy is related to the TVD between the classconditional distributions the procedure can be alternatively seen as estimation of the TVD between the clusters’ underlying distributions.
In our setting distances have to be calculated between all pairs of clusters. Sequentially training classifiers would not be practical for even a small number of initial clusters
. Neural networks allow to solve classification tasks between all pairs of clusters in parallel. A naïve implementation requires for every pair of cluster an output neuron. An output vector with
entries would require the storage of too many parameters. With our optimization merely output neurons are needed, which makes the method applicable for a higher number of initial clusters.To apply this method an external mechanism is needed that partitions the dataset. An obvious possibility is that another clustering algorithm provides the overclustering. We will show empirically that this is feasible even for subsets of the challenging ImageNet dataset. However, data must not necessarily be partitioned by a clustering algorithm. Consider for example a collection of videos in which objects are tracked across frames. Each object then has associated with it a set of images showing it from different viewpoints and under other transformations, that the object undergoes during tracking. To perform object discovery, objects of the same category, but which have been tracked in different videos have to be identified ([osep_2017]). Since each object is represented by a set of images, the proposed method could be naturally applied for this step.
Our contributions can be summarized as follows:

We show how neural networks can be used to estimate pairwise TVDs between sets of observations in parallel.

The number of sets of observations between which TVDs can be estimated simultaneously is increased by reducing the required number of output neurons from to .

Empirically we demonstrate that the algorithm is suited to merge realistically obtained overclusterings of a challenging image dataset. The algorithm compares favorably to a strong baseline that is specific to the image domain.

One advantage of neural networks is that their architecture can be adapted to the respective domain to get a better performing classifier, e.g. by introducing invariances to certain transformations. We show how our distance estimates, that ultimately depend on the suitability of the used classifier for the domain, benefit in a similar way from an appropriate choice of architecture.
2 Related Work
This work is inspired by a recent paper from ([gutmann_2018]) on likelihoodfree inference. There the goal is to find the parameters for which a model generates data as close as possible to the real data. For this matter a discrepancy measure between the real and generated data is required. ([gutmann_2018]) propose to train a classifier between real and generated data. The accuracy of the classifier computed with holdout data can then directly be used as an optimizable discrepancy measure. Like us they motivate their method with the relation of the Bayes accuracy to the TVD. This connection has already been established a long time ago ([blackwell_1951]). The relationship between divergences, to which the TVD belongs, and surrogate losses of optimal classifiers has been studied in a more general way ([nguyen_2009]
). It also has been shown that accuracy estimates can be used as a test statistic for two sample hypothesis testing (
[lopezpaz_2017]). These works differ in their goals from ours and in none of them it was necessary to efficiently compute distances between multiple distributions in parallel.3 Methods
3.1 Bayes Accuracy and Tvd
First we repeat the connection between the accuracy of the Bayes classification rule and the TVD between the classconditional distributions, which are in our case the clusters’ distributions. We will do this in a similar way as ([gutmann_2018]), but we will not assume equally sized sets of observations and therefore replace ordinary accuracy by balanced accuracy, so that class priors get canceled out. We have sampled two sets of observations and from unknown distributions and :
(1) 
(2) 
We begin by constructing a binary classification dataset that assigns all observations in class label and respectively all observations in class label :
(3) 
In classification the goal is to predict a class label from observations ([allofstatistics, elements, p. 349, p. 9]). This can be formalized by a classification rule that maps an observation to its class label . The performance of the classification rule on the dataset can be evaluated by the balanced accuracy :
(4) 
We denote as balanced Bayes rule the function which attains the best possible
. Let the pair of random variables
be distributed according to random draws of observationlabel pairs from , such that and , then is given by ([menon_2013]):(5) 
The TVD between and is defined as
(6) 
The expected balanced accuracy of is connected to the TVD between the classconditional distributions and ([nguyen_2009, gutmann_2018]):
(7) 
Since the condition in can be rewritten as the statement above can be proven like in ([gutmann_2018, Appendix A]) but without relying on . Further details are given in the supplementary material.
3.2 Approximating the Bayes Accuracy
We do not have access to as this would require knowledge of and , but we can learn an approximation of the Bayes rule with an arbitrary classification algorithm ([allofstatistics]). We will restrict ourselves to empirical risk minimization with neural networks. To account for our balanced objective function , we employ costsensitive learning as this was shown to be superior to alternative approaches to imbalanced classification ([menon_2013]). Here a loss suited for classification gets modified by reweighting the contributions of observations with class label 0 (negative) and observations with class label 1 (positive):
(8) 
where is the real label, is the predicated label, is the indicator function and is the weight of the positive/negative class. Calculating with the same data with which the neural network’s parameters were learned would result in a biased estimate. This problem can be circumvented via crossvalidation ([allofstatistics, p. 362]), i.e. learning with a randomly sampled subset of and calculating with the remaining, unseen data.
3.3 Parallel Computation
Since distances need to be estimated between all pairs of clusters in , classification tasks need to be solved. Sequentially training neural networks would be intractable and therefore we seek a more efficient solution. Recall that the problem of multilabel classification with labels can be transformed into binary classification tasks and viceversa. Since our classification tasks are indeed binary, they might be solved by a single neural network within the framework of multilabel classification with labels. More specifically, our neural network outputs a matrix, where can be used as classification rule for the dataset defined by and . However, there are two differences that set the scenario here apart from ordinary multilabel classification.
First, when an observation is fed into the neural network that neither originated from nor , then the score is meaningless to the underlying classification task between cluster and and should thus be ignored from the loss calculation. Formulated differently, for an observation originating from cluster only the row and the column are relevant for the loss computation. This concept is visualized in Figure 1.
The second difference is that a balanced loss needs to be calculated as illustrated in the last section. Naturally, different weight coefficients and need to be applied to the classification tasks corresponding to different pairs of clusters. For that matter is made dependent on the clusters and :
(9) 
The total loss can then be computed as:
(10)  
The index variable ranges either over the row or the column depending on whether the term or the term is considered. Each observation originating from cluster contributes to multiple classification tasks in parallel and appears either with the class label 0 for the classification tasks associated with the row of output neurons or with class label 1 for the classification tasks associated with the column of output neurons as can be seen in Figure 1. ’s range excludes , since for every the classification task between cluster and itself is degenerate and therefore omitted in the loss calculation. As a result there are superfluous classification tasks, explaining the normalization constant in the beginning of equation 10.
If for each observation only the relevant row and column is computed we get from a run time viewpoint an efficient algorithm. Despite that optimization output neurons are needed, which requires storage of a weight matrix, where is the size of the last hidden layer. As an example for and a realistically sized hidden layer approximately Terabyte memory space would be required, which strongly exceeds the capacities of current GPUs.
Therefore a way has to be found to reduce the storage requirements of the weight tensor, while still respecting the objective of balanced classification. Imagine a two step algorithm that first calculates for an observation
a score for each cluster corresponding to the likelihood that originated from . In the second step these scores can be used to calculate for each pair of clusters and whether it is more likely that originated from or . Specifically, this can be done by comparing the score associated with cluster and the score associated with cluster . When ’s score is higher than ’s score, than it is more likely that originated from , and viceversa. Instead of a output matrix merely an output vector with elements will be parameterized. Let be the preactivation output of this new output layer. To now compute the entry in the th row and th column of we redefine it:(11) 
where and
is the sigmoid activation function. We can still calculate scores
for the classification tasks associated with every pair of clusters by using unparameterized differences between activations of the new output layer . Since we still have access to , the calculation of the total loss does not have to be altered.4 Experiments
We identified five key criteria important for a distance estimation algorithm like we propose here that we want to explore experimentally:

It should be able to handle small initial clusters.

It should be able to handle noise in the initial overclustering.

It should scale to large datasets without requiring special hardware, i.e. allow for a large amount of initial clusters.

It should be easy to apply and in particular not require unrealistic amounts of hyperparameter tuning.

It should work for various types of datasets.
Obviously, an algorithm that can still accurately calculate distances when initial clusters are small (1.) has an increased range of applications. The initial grouping of the dataset might contain noisy associations, especially when another clustering algorithm is used to provide the overclustering. Therefore an algorithm that is robust to such noise (2.) is preferred. It should be possible to apply the algorithm to datasets with many observations (3.), which makes it necessary to estimate distances between a bigger number of initial clusters, if we assume a constant size for the initial clusters. If our algorithm required tedious tuning of hyperparameters, it would not be interesting from a practitioner’s viewpoint (4.). Further if a procedure depends on the correct setting of hyperparameters, there must at least exist an objective way to select those.
In Section 4.1.1 an experiment is presented that uses artificially created overclusterings of the whole ImageNet dataset to test the method’s scalability and its ability to deal with small initial clusters and noise in the initial clustering. The experiment in Section 4.1.2 compares the proposed method with stateoftheart image representations at the task of merging realistically obtained overclusterings of ImageNet subsets. The method’s applicability to different domains is assessed in the experiment in Section 4.2 by running it on a point cloud dataset.
All of our experiments were conducted on a GTX 1080 Ti grapics card withh 11 GB of VRAM and implemented with the PyTorch framework (
[paszke_2017]). We used Adam ([kingma_2014]) for optimization as it is reported to require less tuning of the learning rate and momentum hyperparameters than comparable methods.4.1 Imagenet
ImageNet is a largescale natural image dataset consisting of approximately 1.3 million observations ([deng_2009]). Each image is manually annotated to belong to one of 1000 categories. For all experiments on ImageNet the convolutional architecture ResNet18 was used ([he_2016]).
4.1.1 Artificially Created OverClusterings
In this first series of experiments ImageNet’s annotations are used to artificially create overclusterings. This lets us explore the influence of the size of the initial clusters on the quality of estimated distances. Additionally, this setting allows us to introduce noise into the overclustering. In dependence of the noise ratio the initial clusters’ purity is artificially decreased by moving observations to clusters with different categories. More specifically, a fraction of observations is selected globally at random. The selected observations are then moved to a different cluster. A check ensures that the category of the target cluster is different from the observation’s category. This procedure does not strictly enforce that every cluster has exactly a fraction of noisy observations. On the contrary some clusters will contain more noisy observations than others, but on the average the purity of the resulting clustering will be equal to . At the same time we can test the algorithm’s scalability by setting the initial cluster size to a small value, which in turn results in a high number of initial clusters .
In our scenario where we want to use the distances for merging an overclustering it is less important whether the estimated TVDs are close to the actual TVDs between the clusters distributions. Rather, to get meaningful merge decisions it suffices when the approximated distances are smaller for pairs of clusters with the same category than for pairs of clusters with different categories. The relative ordering instead of the approximation accuracy is relevant. Let contain the result of our algorithm, i.e. a distance matrix, where is the distance between initial cluster and . Let denote the pairs of clusters with the same majority category and denote pairs of clusters with different majority categories:
(12)  
where is the majority category for a cluster :
(13) 
where is the set of labels for the given dataset and is the annotated label of observation . Now two random variables linking the above sets to their distances can be defined:
(14)  
The quality of the distances can then be defined as the chance that a randomly selected pair of clusters with the same majority category has a lower distance than a randomly selected pair of clusters from different majority categories:
(15) 
If we construct a binary classification dataset by assigning label 0 to and label 1 to and treat the values of as scores than
corresponds to the area under the receiver operating characteristic curve (AUROC) for that classification dataset and can thus be efficiently calculated (
[fawcett_2006]).Further it is interesting to monitor the accuracies of our classification rules over the course of neural network training to see if there is a correlation between them and . For that matter we generate a summary statistic of the individual accuracies of the classification tasks between each pair of cluster and by averaging them together:
(16) 
where are the indices of the upper triangle of the distance matrix .
Results. The initial cluster size was set to 50, 125 and 200 and the noise ratio was set to 0, 0.1 and 0.3. In Figure 2(a) it can be seen that the distances’ quality increases with bigger initial cluster sizes and with purer initial clusterings. The values of are so close to 1, that in Figure 2(a) has to be plotted on a logarithmic scale. Even for the most challenging configuration, and , maintains a high value of approximately 97%. This indicates that despite considerable noise and a small initial cluster size the distance measure provides reliable guidance for merge decisions. For approximately 26000 initial clusters were created, which means that the computation of more than distances are necessary. Despite that our algorithm can still be run on our GPU with 11 GB VRAM. Without the optimization from equation 11 with the same hardware and neural network architecture only less than 200 initial clusters could be handled.
For each combination of and we additionally monitored how and evolve during neural network training. Like expected the longer the network is trained the higher gets the average accuracy as can be seen in Figure 2. This indicates that the learned classification rules approximate the balanced Bayes rule better as training progresses. The quality of the distance measure, which ultimately depends on how well the Bayes rule is approximated, therefore increases with .
4.1.2 Realistic OverClusterings
In the previous experiments overclusterings were created artificially by randomly subdividing categories into smaller clusters. Under such ideal circumstances the true TVD between all clusters with different majority categories is maximal and the true TVD between all clusters with the same majority category is minimal, which makes the use of the proposed method naturally adequate. For a realistically obtained overclustering the situation might not be that clear cut and therefore we want to study in the following experiment series, whether the proposed method can be used to merge clusters whose distributions’ might only partially overlap.
We experiment on 10 subsets of ImageNet, where each subset is created by randomly selecting 10 categories from ImageNet. Because of ImageNet’s high diversity, these subsets can be seen as datasets in their own right. Experimenting on subsets of ImageNet makes it easier to obtain overclusterings that correspond to the ground truth annotations, which makes it possible to evaluate the performed merges. For the same reason we opt for a partial clustering as this further increases purity.
To calculate an overclustering for each of the datasets we first instantiate the stateoftheart representation learning method DeepCluster ([caron_2018]), which has been trained unsupervisedly on the whole ImageNet dataset, to convert images into feature vectors, that are better suited for clustering than raw observations. Then we use Algorithm 1 on the retrieved feature vectors to get initial clusters each consisting of images. Any clustering algorithm could be used here instead as long as it can be configured to return clusters with a minimum size. For this experiment it is however beneficial to have control about the initial cluster sizes to eliminate this source of irritation, whose influence already has been explored in the previous experiment, and therefore Algorithm 1 is used.
We proceed by analyzing the overclusterings returned by Algorithm 1. As can be seen in Table 1 the clusters have an average purity of about on all datasets. Because of that it makes sense to associate each cluster with the category of the majority of its observations . Although only a a fraction of observations is contained in each clustering there are at least 6 unique majority categories in every dataset as is shown in Table 1.
Dataset  1  2  3  4  5  6  7  8  9  10 

Purity  91.2%  91.8%  90.3%  92.8%  94.5%  91.7%  89.0%  89.4%  95.3%  96.3% 
Unique  8  8  6  7  7  9  9  8  8  7 
Finally, we merge the overclusterings of each dataset hierarchically according to our TVD estimates. Our algorithm will be instantiated after every merge decision to compute distances between all pairs of clusters. Since the DeepCluster features were capable of providing us with a pure initial clustering it makes sense to incorporate them into a baseline. More specifically, we use the average Euclidean distance between clusters’ feature vectors as a sensible alternative to guide the merge process.
Since the initial clusters have on average a purity of about 90% each cluster can be associated with the category of the majority of its observations . Two clusters and are merged correctly if they have the same majority category:
(17) 
To compare the two alternatives the number of correctly merged clusters after a fixed number of merge decisions is kept track of. Let and be the clusters which have been merged in step , then the number of correct merges after merges is given by
(18) 
Results. Figure 4 shows the number of correct merges after merge steps averaged over all datasets. It can be seen that using our proposed method to calculate distances leads for each value of to a higher number of correct merge decisions than the sensible baseline of using average Euclidean distances between DeepCluster feature vectors. Further we compared the both alternatives on a per dataset basis. Figure 5 shows that for each dataset using our method leads to a higher number of correct merge decisions. Our method generally is capable to merge the overclusterings of all datasets. On half of the datasets even all merge decisions are correct. This indicates that the distributions of initial clusters with the same majority category indeed overlap significantly.
4.2 Point Cloud: Modelnet40
The success of our method greatly depends on how well the learned classification rules approximate the corresponding balanced Bayes rules. There does not exist any classifier that outperforms every alternative under all circumstances ([wolpert_1996]). It is however customary to build classifiers suited for specific domains in order to boost performance. For neural networks this customization can be done by designing an appropriate architecture. In the following experiments we want to verify the generality of our idea by testing it on a different type of dataset and at the same time study the influence of the neural network architecture on the quality of the estimated distances.
For that matter we experiment with the ModelNet40 ([wu_2015]) dataset, which consists of categoryannotated 3D objects. An object is represented as a list of 3D points that have been sampled from the object’s surface. We chose ModelNet40, because its point cloud data is invariant to permutations, i.e. point clouds that are identical to each other up to their ordering describe the same object. But the requirement to deal with permutation invariant data efficiently is more general and can be found in other domains as well like for example in graph classification. It has been shown that architectures that possess a builtin invariance to the input permutation achieve significantly better classification accuracies on this dataset.
Analogous to the first experiment the dataset’s annotations are used to artificially create overclusterings with various initial cluster sizes . We compare the PointNet++ architecture, which has builtin permutation invariance ([qi_2017]) and achieves good results on ModelNet40, to a single hidden layer baseline architecture.
Results. In Figure 5(a) we can see that estimated distances have a significantly higher quality when using the PointNet++ architecture than when using the baseline architecture. The difference is so pronounced that PointNet++ with even outperforms the baseline architecture with by a large margin. This confirms the hypothesis that the choice of architecture has an effect on the viability of this method. Once again we can see in Figure 5(b) that the average accuracy correlates with the quality of the estimated distances. Thus can be used to guide the choice of architecture.
5 Discussion
We have shown that the quality of the estimated distances depends directly on the performance of the classifiers as measured by the average accuracy . Fortunately, the method requires anyhow the estimation of accuracies on holdout validation sets and therefore the computation of comes at almost no cost. Note that ’s computation does not require labeled data. The availability of allows for principled choice of hyperparameters like learning rate and momentum, to which neural network training can be sensitive to. It could be even used to guide the selection of the network’s architecture as was demonstrated with the ModelNet40 experiment. In the first experiment we have seen that the quality of the estimated distances keeps improving as long as the average accuracy is increasing and therefore the training process can be terminated early whenever stops improving.
An advantage of this method is that it inherits the strengths of neural networks. As a result we were able to show that it also works for highdimensional datasets like ImageNet. In particular the flexibility that comes with the choice of an architecture makes the proposed technique interesting for a wide range of domains. While details like the neural network’s architecture might need to be adapted to the domain of interest, the overall principle to compute distances between sets of observations via classification makes sense for arbitrary domains.
In the second experiment we used hierarchical clustering to merge the overclustering, but the calculated distances could be also fed as input to other clustering techniques like spectral algorithms. We used our algorithm in the second experiment after each merge decision to compute distances, i.e. a neural network had to be trained from scratch for each merge step. To handle larger datasets in future work more efficient alternatives could be explored.
In this paper we focused on balanced accuracy of an ordinary neural network as a distance, which corresponds to the TVD between the classconditional distributions. Alternatively, by forcing the neural network to conform to a Lipschitz constraint Wasserstein distances might be computed instead ([sriperumbudur_2009]).
6 Conclusion
We presented a principled method for merging overclusterings in arbitrary domains. Neural networks are used to efficiently estimate TVDs between all pairs of clusters in parallel. Empirically it has been shown that the method is viable for challenging, highdimensional datasets. The procedure inherits its strengths from neural networks that can be adapted to the domain of interest via the choice of their architecture. Only due to a computational trick with which the required number of output neurons could be reduced from to the method becomes applicable to big datasets on consumer hardware. In future work the method could be studied in various contexts where overclusterings are generated naturally. Further with adequate regularization of the neural network alternative distances like the Wasserstein distance could be calculated instead.
References
[heading=none]
Appendix A Relationship Between Balanced Bayes Accuracy and Tvd
For what follows it helps to express the balanced Bayes rule (see equation 5) via the classconditional distributions and :
(19)  
To show the relationship between the TVD and we first pull out the constant term:
(20)  
Next the set is introduced. The normalized sums in the equation above correspond then to fractions of observations which belong to . We will now take the expectation over and and use that the expectation of the binary function
equals the probability of the set
([gutmann_2018]):(21) 
Therefore we can write:
(22) 
It follows from equation 19 that and therefore we have ([pollard_2001, p. 60]):
(23)  
Overall we conclude that:
(24) 
Comments
There are no comments yet.