1 Introduction
With modern supervised learning methods, machines can recognize thousands of visual categories with high reliability; in fact, machines can
outperform individual humans when performance depends on extensive domainspecific knowledge as required for example to recognize hundreds of species of dogs in ImageNet [11]. However, it is also clear that machines are still far behind human intelligence in some fundamental ways. A prime example is the fact that good recognition performance can only be obtained if computer vision algorithms are
manually supervised. Modern machine learning methods have little to offer in an open world setting, in which image categories are
not defined apriori, or for which no labelled data is available. In other words, machines lack an ability to structure data automatically, understanding concepts such as object categories without external supervision.In this paper, we study the problem of discovering and recognizing visual categories automatically. However, rather than considering a fully unsupervised setting, we assume that the machine already possesses certain knowledge about some of the categories in the world. Then, given additional images that belong to new categories, the problem is to tell how many new categories there are and to learn to recognize them. The aim is to guide this process by transferring knowledge from the old classes to the new ones (see fig. 1).
This approach is motivated by the following observation. Unlike existing machine learning models, a child can easily tell an unseen animal category (e.g., bird) after learning a few other (seen) animal categories (e.g., cat, dog); and an adult wandering around a zoo or wildlife park can effortlessly discover new categories of animals (e.g., okapi) based on the many categories previously learnt. In fact, while we can manually annotate some categories in the world, we cannot annotate them all, even in relatively restricted settings. For example, consider the problem of recognizing products in supermarkets for the purpose of market research: hundreds of new products are introduced every week and providing manual annotations for all is hopelessly expensive. However, an algorithm can draw on knowledge of several thousand products in order to discover new ones as soon as they enter the data stream.
This problem lies at the intersection of three widelystudied areas: semisupervised learning
[7], transfer learning [24, 37], and clustering [1]. However, it has not been extensively addressed in any of them. In semisupervised learning, labelled and unlabelled data contain the same categories, an assumption that is not valid in our case. Moreover, semisupervised learning has been shown to perform poorly if the unlabelled data is contaminated with new categories [23], which is problematic in our case. In transfer learning [24], a model may be trained on one set of categories and then finetuned to recognize different categories, but both source and target datasets are annotated, whereas in our case the target dataset is unlabelled. Our problem is more similar to clustering [1], extensively studied in machine learning, since the goal is to discover classes without supervision. However, our goal is also to leverage knowledge of other classes to improve the discovery of the new ones. Since classes are a highlevel abstraction, discovering them automatically is challenging, and perhaps impossible since there are many criteria that could be used to cluster data (e.g., we may equally well cluster objects by color, size, or shape). Knowledge about some classes is not only a realistic assumption, but also indispensable to narrow down the meaning of clustering.Our contribution is a method that can discover and learn new object categories in unlabelled data while leveraging knowledge of related categories. This method has two components. The first is a modification of a recent deep clustering approach, Deep Embedded Clustering (DEC) [38], that can cluster data while learning a data representation at the same time. The purpose of the modification is to allow clustering to be guided by the known classes. We also extend the algorithm by introducing a representational bottleneck, temporal ensembling, and consistency, which boost its performance considerably.
However, this method still requires to know the number of new categories in the unlabelled data, which is not a realistic assumption in many applications. So, the second component is a mechanism to estimate the number of classes. This also transfers knowledge from the set of known classes. The idea is to use part of the known classes as a probe set, adding them to the unlabelled set pretending that part of them are unlabelled, and then running the clustering algorithm described above on the extended unlabelled dataset. This allows to crossvalidate the number of classes to pick, according to the clustering accuracy on the probe set as well as a cluster quality index on the unlabelled set, resulting in a reliable estimate of the true number of unlabelled classes.
We empirically demonstrate the strength of our approach, utilizing public benchmarks such as ImageNet, OmniGlot, CIFAR100, CIFAR10, and SVHN, and outperforming competitors by a substantial margin in all cases. Our code can be found at http://www.robots.ox.ac.uk/~vgg/research/DTC.
2 Related work
Our work is related to semisupervised learning, transfer learning, and clustering. These three areas are widely studied, and it is out the scope of this paper to review all of them. We briefly review the most representative and related works below in each area.
Semisupervised learning (SSL) [7, 23, 25] aims to solve a closedset classification problem in which part of the data is labelled while the rest is not. In the context of SSL, both the labelled data and unlabelled data share the same categories, while this assumption does not hold in our case. A comprehensive study of recent SSL methods can be found in [23]. The consistency based SSL methods (e.g., [19, 34]) have been shown to achieve promising results. Laine and Aila [19] proposed to incorporate the unlabelled data during training by the consistency between the predictions of a data sample and its transformed counterpart, which they call the model, or by the consistency between current prediction and the temporal ensembling prediction. Instead of keeping a temporal ensembling prediction, Tarvainen and Valpola [34] proposed to maintain a temporal ensembling model, and enforces the consistency between predictions of the main model and the temporal ensembling model.
In transfer learning [24, 37, 33]
, a model is first trained on one labelled dataset, and then finetuned with another labelled dataset, containing different categories. Our case is similar to transfer learning in the sense that we also transfer knowledge from a source dataset to a target dataset, though our target dataset is unlabelled. With the advent of deep learning, the most common way of transfer learning nowadays is to finetune models pretrained on ImageNet
[11] for specific tasks with labelled data. However, in our case, no labels are available for the new task.Clustering [1] has long been studied in machine learning. A number of classic works (e.g., means [21], meanshift [9]) have been widely applied in many applications. Recently, there have been more and more works on clustering in the deep learning literature (e.g., [38, 6, 12, 39, 40]). Among them, Deep Embedded Clustering (DEC) [38]
appears to be one of most promising learning based clustering approaches. It can simultaneously cluster the data and learn a proper data representation. It is trained in two phases. The first phase trains an autoencoder using reconstruction loss, and the second phase finetunes the encoder of the autoencoder with an auxiliary target distribution. However, it does not take the available labelled data of seen categories into account, thus the performance is still far from satisfactory.
Our work is also related to metric learning [30, 31, 29] and domain adaptation [36]. Actually, we build on metric learning, as the latter is used for initialization. However, most metric learning methods are unable to exploit unlabelled data, while our work can automatically adjust the embeddings space on unlabelled data. More importantly, our task requires producing a partition of the data (a discrete decision), whereas metric learning only produces a continuous data embedding, and converting the latter to discrete classes is often not trivial. Domain adaptation aims to resolve the domain discrepancy between source and target datasets (e.g., digital SLR camera images vs web camera images), while generally assuming a shared class space. Thus, the source and target data are on different manifolds. In our case, the unlabelled data belongs to novel categories without any labels, and the unlabelled data are on the same manifold with the labelled data, which is a more practical but more challenging scenario.
To our knowledge, the most related works to ours are [15] and [16], in terms of considering novel visual category discovery as a deep transfer clustering task. In [15], Hsu et al.
introduced a Constrained Clustering Network (CCN) which is trained in two stages. In the first stage, a binary classification model is trained on labelled data to measure pairwise similarity of images. In the second stage, a clustering model is trained on unlabelled data by using the output of the binary classification model as supervision. The network is trained with a KullbackLeibler divergence based contrastive loss (KCL). In
[16], the CCN is improved by replacing KCL with a new loss called Meta Classification Likelihood (MCL). In addition, Huang et al. [17] recently introduced Centroid Networks for fewshot clustering, which cluster unlabeled images into clusters with images each after training on labeled data.3 Deep transfer clustering
We propose a method for data clustering: given as input an unlabelled dataset , usually of images, the goal is to produce as output class assignments , where the number of different classes is unknown. Since there can be multiple equallyvalid criteria for clustering data, making a choice depends on the application. Thus, we also assume we have a labelled dataset where class assignments are known.
The classes in this labelled set differ, in identity and number, from the classes in the unlabelled set. Hence the goal is to learn from the labelled data not its specific classes, but what properties make a good class in general, so that this knowledge can be used to discover new classes and their number in the unlabelled data.
We propose a method with two components. The first is an extension of a deep clustering algorithm that can transfer knowledge from a known set of classes to a new one (section 3.1); the second is a method to reliably estimate the number of unlabelled classes (section 3.2).
3.1 Transfer clustering and representation learning
At its core, our method is based on a deep clustering algorithm that clusters the data while simultaneously learning a good data representation. We extract this representation by applying a neural network
to the data, obtaining embedding vectors
. The representation is initialised using the labelled data, and then finetuned using the unlabelled data. This is done via deep embedded clustering (DEC) of [38] with three important modifications: the method is extended to account for labelled data, to include a tight bottleneck to improve generalization, and to incorporate temporal ensembling and consistency, which also contribute to its stability and performance. An overview of our approach is given in algorithm 1.3.1.1 Joint clustering and representation learning
In this section we summarise DEC [38] as this algorithm lies at the core of our approach. In DEC, similar to means, clusters are represented by a collection of vectors or prototypes representing the cluster “centers”. However, differently from means, the goal is not only to determine the clusters, but also to learn the data representation .
Naively combining representation learning, which is a discriminative task, and clustering, which is a generative one, is challenging. For instance, directly minimizing the means objective function would immediately collapse the learned representation vectors to the closest cluster centers. DEC [38] addresses this problem by slowly annealing cluster centers and data representation.
In order to do so, let
be the probability of assigning data point
to cluster . DEC uses the following parameteterization of this conditional distribution by assuming a Student’s distribution:(1) 
Further assuming that data indices are sampled uniformly (i.e.
), we can write the joint distribution
.In order to anneal to a good solution, instead of maximizing the likelihood of the model directly, we match the model to a suitablyshaped distribution . This is done by minimizing the KL divergence between joint distributions and , given by
(2) 
It remains to show how to construct the target distribution as a progressively sharper version of the current distribution . Concretely, this is done by setting
In this manner the assignment of image to cluster is reinforced when the current distribution assigns a high probability of going from to as well as of going from to . The latter has an equalization effect as the probability of sampling data point in cluster is high only if the cluster is not too large. Using Bayes rule for , the expression can be rewritten as
(3) 
Hence the target distribution is constructed by first raising to the second power, which sharpens it, and then normalizing by the frequency per cluster, which balances it.
In practice, eq. 2 is minimized in alternateoptimization fashion. Namely, fixing a target distribution , the representation
is optimized using stochastic gradient descent or a similar method to minimize
eq. 2for a certain number of iteration, usually corresponding to a complete sweep over the available training data (an epoch).
Equation 3 is then used to sharpen the target distribution and the process is repeated.Transferring knowledge from known categories.
The clustering algorithm described above is entirely unsupervised. However, our goal is to aid the discovery of new classes by leveraging a certain number of known classes. We capture such information in the image representation , which is pretrained on the labelled dataset using a metric learning approach. In order to train , one can use the crossentropy loss, the triplet loss or the prototypical loss, depending on what is the best supervised approach for the specific data.
Bottleneck.
Algorithm 1 requires an initial setting for the cluster centers . We obtain this initialization by running the means algorithm on the set of features extracted from the unlabelled data. However, we found this step to perform much better by introducing a step of dimensionality reduction in the feature representation . To this end, PCA is applied to the feature vectors , resulting in a dimensionality reduction layer . Importantly, we retain a number of components equal to the number of unlabelled classes , so that . This linear layer is then added permanently as the head of the deep network, and the parameters are further finetuned during clustering together with the other parameters.
3.1.2 Temporal ensembling and consistency
The key idea of DEC is to slowly anneal clusters to learn a meaningful partition of the data. Here, we propose a modification of DEC that can further improve the smoothness of the annealing process via temporal ensembling [19]. To apply temporal ensembling to DEC, the clustering models computed at different epochs are aggregated by maintaining an exponential moving average (EMA) of the previous distributions.
In more detail, we first accumulate the network predictions into an ensemble prediction via
(4) 
where is a momentum term controlling how far the ensemble reaches into training history, and indicates the time step. To correct the zero initialization of the EMA [19], is rescaled to obtain the smoothed model distribution
(5) 
Equation 5 is plugged into eq. 3 to obtain a new target distribution . In turn, this defines a variant of eq. 2 that is then optimized to learn the model.
Consistency constraints have been shown to be effective in SSL (e.g., [19, 34]). A consistency constraint can be incorporated by enforcing the predictions of a data sample and its transformed counterpart (which can be obtained by applying data transformation such as random cropping and horizontal flipping on the original data sample) to be close (known as the model in SSL), or by enforcing the prediction of a data sample and its temporal ensemble prediction to be close. Such consistency constraints can also be used to improve our method. After introducing consistency, the loss in eq. 2 now becomes
(6)  
where is either the prediction of the transformed sample or the temporal ensemble prediction , and is a rampup function as used in [19, 34] to gradually increase the weight of the consistency constraint from 0 to 1.
3.2 Estimating the number of classes
So far, we have assumed that the number of classes in the unlabelled data is known, but this is usually not the case in real applications. Here we propose a new approach to estimate the number of classes in the unlabelled data by making use of labelled probe classes. The probe classes are combined with the unlabelled data and the resulting set is clustered using means multiple times, varying the number of classes. The resulting clusters are then examined by computing two quality indices, one of which checks how well the probe classes, for which ground truth data is available, have been identified. The number of categories is then estimated to be the one that maximizes these quality indices.
In more details, we first split the known classes into a probe subset of classes and a training subset containing the remaining classes. The classes are used for supervised feature representation learning, while the probe classes are combined with the unlabelled data for class number estimation. We then further split the probe classes into a subset of classes and a subset of classes (e.g., ), which we call anchor probe set and validation probe set respectively. We then run a constrained (semisupervised) means on to estimate the number of classes in . Namely, during means, we force images in the anchor probe set to map to clusters following their groundtruth labels, while images in the validation probe set are considered as additional “unlabelled” data. We launch this constrained means multiple times by sweeping the number of total categories in , and measure the constrained clustering quality on . We consider two quality indices, given below, for each value of . The first measures the cluster quality in the labelled validation probe set, whereas the second measures the quality in the unlabelled data . Each index is used to determine an optimal number of classes and the results are averaged. Finally, means is run one last time with this value as number of classes and any outlier cluster in , defined as containing less than (e.g., ) the mass of the largest clusters, are dropped. The details are given in algorithm 2.
Cluster quality indices. We measure our clustering for class number estimation with two indices. The first index is the average clustering accuracy (ACC), which is applicable to the labelled classes in the validation probe set and is given by
(7) 
where and denote the groundtruth label and clustering assignment for each data point and is the group of permutations of elements (as a clustering algorithm recovers clusters in an arbitrary order).
The other index is a cluster validity index (CVI) [2] which, by capturing notions such as intracluster cohesion vs intercluster separation, is applicable to the unlabelled data . There are several CVI metrics, such as Silhouette [26], Dunn [13], Davies–Bouldin [10], and CalinskiHarabasz [5]; while no metric is uniformly the best, the Silhouette index generally works well [2, 3], and we found it to be a good choice for our case too. This index is given by
(8) 
where is a data sample, is the average distance between and all other data samples within the same cluster, and is the smallest average distance of to all points in any other cluster (of which is not a member).
4 Experimental results
We assess two scenarios over multiple benchmarks: first, where the number of new classes is known for OmniGlot, ImageNet, CIFAR10, CIFAR100 and SVHN; and second, where the number of new classes is unknown for OmniGlot, ImageNet and CIFAR100. For the unknown scenario we separate a probe set from the labelled classes.
4.1 Data and experimental details
OmniGlot [20]. This dataset contains 1,623 handwritten characters from 50 different alphabets. It is divided into a 30alphabet (964 characters) subset called background set and a 20alphabet (659 characters) subset called evaluation set. Each character is considered as one category and has 20 example images. We use the background and evaluation sets as labelled and unlabelled data, respectively. To experiment with an unknown number of classes, we randomly hold out 5 alphabets from the background set (169 characters in total) to use as probes for algorithm 2, leaving the remaining 795 characters to learn the feature extractor.
ImageNet [11]. ImageNet contains 1,000 classes and about 1,000 example images per class. We follow [35] and split the data into two subsets containing 882 and 118 classes respectively. Following [15, 16], we consider the 882class subset as labelled data, and use three randomly sampled 30class subsets from the remaining 118class subset as unlabelled data. To experiment with an unknown number of classes, we randomly hold out 82 classes from the 882class subset as probes, leaving the remaining 800 classes for training the feature extractor.
CIFAR10/CIFAR100 [18]. CIFAR10 contains 50,000 training images and 10,000 test images from 10 classes. Each image has a size of . We split the training images into labelled and unlabelled subsets. In particular, we consider the images of the first 5 categories (i.e., airplane, automobile, bird, cat, deer) as the labelled set, while the remaining 5 categories (i.e., dog, frog, horse, ship, truck) as the unlabelled set. CIFAR100 is similar to CIFAR10, except it has 10 times less images per class. We consider the first 80 classes as labelled data, and the last 10 classes as unlabelled data, leaving 10 classes as probe set for category number estimation on unlabelled data.
SVHN [22]. SVHN contains 73,257 images of digits for training and 26,032 images for testing. We split the 73,257 training digits into labelled and unlabelled subsets. Namely, we consider the images of digits 04 as the labelled set, and the images of 59 as the unlabelled set. The labelled set contains 45,349 images, while the unlabelled set contains 27,908 images.
Evaluation metrics. We adopt the conventionally used clustering accuracy (ACC) and normalized mutual information (NMI) [32] to evaluate the clustering performance of our approach. Both metrics are valued in the range of and higher values mean better performance. We measure error in the estimation of the number of novel categories as , where and denote the groundtruth and estimated number of categories, respectively.
Network architectures. For a fair comparison, we follow [15, 16] and use a 6layer VGG like architecture [27] for OmniGlot and CIFAR100, and a ResNet18 [14] for ImageNet and all other datasets.
Training configurations. OmniGlot is widely used in the context of fewshot learning due to the very large number of categories it contains and the small number of example images per category. Hence, in order to train the feature extractor on the background set of OmniGlot we use the prototypical loss [28], one of the best methods for fewshot learning. We train the feature extractor with a batch size of 200, forming batches by randomly sampling 20 categories and including 10 images per category. For each category, 5 images are used as supporting data to calculate the prototypes while the remaining 5 images are used as query samples. We use Adam optimizer with a learning rate of 0.001 for 200 epochs. We then finetune and train the bottleneck and the cluster centers for each alphabet in the evaluation set. For warmup (in algorithm 1), the Adam optimizer is used with a learning rate of 0.001, and trained for 10 epochs without updating the target distribution. Afterwards, training continues for another 90 epochs updating the target distribution per epoch. For ImageNet and other datasets, which are widely used in supervised image classification tasks, we pretrain the feature extractor using the crossentropy loss on the labelled subsets. Following common practice, we then remove the last layer of the classification network and use the the rest of the model as our feature extractor.
In our experiment on ImageNet, we take the pretrained ImageNet classification network of [16] as our initial feature extractor. For the case when the number of novel categories is unknown, we train a ImageNet classification network as our initial feature extractor. We use SGD with an initial learning rate of 0.1, which is divided by 10 every 30 epochs, for 90 epochs. For warmup, the feature extractor, together with the bottleneck and cluster centers, are trained for 10 epochs by SGD with a learning rate of 0.1; then, we train for further 60 epochs updating the target distribution per epoch. Experiments on other datasets follow a similar configuration. Our results on all datasets are averaged over 10 runs, except ImageNet, which is averaged over 3 runs using different unlabelled subsets following [15, 16].
4.2 Learning with a known number of categories
CIFAR10  CIFAR100  SVHN  OmniGlot  ImageNet  

Method  ACC  NMI  ACC  NMI  ACC  NMI  ACC  NMI  ACC  NMI 
means [21]  65.5%  0.422  66.2%  0.555  42.6%  0.182  77.2%  0.888  71.9%  0.713 
DTCBaseline  74.9%  0.572  72.1%  0.630  57.6%  0.348  87.9%  0.933  78.3%  0.790 
DTC  87.5%  0.735  70.6%  0.605  60.9%  0.419  89.0%  0.949  76.7%  0.767 
DTCTE  82.8%  0.661  72.8%  0.634  55.8%  0.353  87.8%  0.931  78.2%  0.791 
DTCTEP  75.2%  0.591  72.5%  0.632  55.4%  0.329  87.8%  0.932  78.3%  0.791 
In table 1 we compare variants of our Deep Transfer Clustering (DTC) approach with the temporal ensembling and consistency constraints as introduced in section 3.1.2, namely, DTCBaseline (our model trained using DEC loss), DTC (our model trained using DEC loss with consistency constraint between predictions of a sample and its transformed counterpart), DTCTE (our model trained using DEC loss with consistency constraint between current prediction and temporal ensemble prediction of each sample), and DTCTEP (our mode trained using DEC loss with targets constructed from temporal ensemble predictions). We only apply standard data augmentation of random crop and horizontal flip in our experiment. To measure the performance of metric learning based initialization, we also show the results of means [21] on the features of unlabelled data produced by our feature extractor trained with the labelled data. means shows reasonably good results on clustering the unlabelled data using the model trained on labelled data, indicating that the model can transfer useful information to cluster data of unlabelled novel categories. All variants of our approach substantially outperform means, showing that our approach can effectively finetune the feature extractor and cluster the data. DTC appears to be the most effective one for CIFAR10, SVHN, and OmniGlot. The consistency constraints makes a huge improvement for CIFAR10 (e.g., ACC ) and SVHN (e.g., ACC ). When it comes to the more challenging datasets, CIFAR100 and ImageNet, DTCTE and DTCTEP appear to be the most effective with ACC of and respectively.
We visualize the tSNE projection of our learened feature on unlabelled subset of CIFAR10 in fig. 2. It can be seen that our learned representation is sufficiently discriminative for different novel classes, clearly demonstrating that our approach can effectively discover novel categories. We also show some failure cases where there exist some confusion between dogs and horse heads (due to a similar pose and color) in the green selection, and between trucks and ships in the orange selection (the trucks are either parked next to the sea, or have a similar color with the sea).
OmniGlot  ImageNet  
Method  ACC  NMI  ACC  NMI 
means [21]  21.7%  0.353  71.9%  0.713 
LPNMF [4]  22.2%  0.372  43.0%  0.526 
LSC [8]  23.6%  0.376  73.3%  0.733 
KCL [15]  82.4%  0.889  73.8%  0.750 
MCL [16]  83.3%  0.897  74.4%  0.762 
Centroid Networks [17]  86.6%       
DTC  89.0%  0.949  78.3%  0.791 
We compare our approach with traditional methods as well as stateoftheart learning based methods on OmniGlot and ImageNet in table 2. We use the same 6layer VGG like architecture as KCL [15], MCL [16] and Centroid Networks [17] for comparison on OmniGlot, and use the same ResNet18 as KCL and MCL for comparison on ImageNet. The results of traditional methods are those reported in [16] using raw images for OmniGlot and pretrained features for ImageNet. All these methods are applied by assuming the number of categories to be known. It is worth noting that Centroid Networks [17] also assumes the clusters to be of uniform size. This assumption, although not practical in real application, is beneficial when experimenting with OmniGlot, since each category contains exactly 20 images. For both datasets, our method outperforms existing methods in both ACC (89.0% vs 86.6%) and NMI (0.949 vs 0.897). Unlike KCL and MCL, our method does not need to maintain an extra model to provide a pseudosupervision signal for the clustering model.
In addition, we also compare with KCL and MCL on CIFAR10, CIFAR100, and SVHN in table 3 based on their officiallyreleased code. Our method consistently outperforms KCL and MCL on these datasets, which further verifies the effectiveness of our approach.
4.3 Finding the number of novel categories
We now experiment under the more challenging (and realistic) scenario where the number of categories in unlabelled data is unknown. KCL amd MCL assume the number of categories to be a large value (i.e., 100) instead of estimating the number of categories explicitly. By contrast, we choose to estimate the number of categories before running the transfer clustering algorithm using algorithm 2 (with for all our experiments) and only then apply algorithm 1 to find the clusters. Results for novel category number estimation are reported in table 4. The average error is less than 5 for all of three datasets, which validates the effectiveness of our approach. In table 5, we show the clustering results on OmniGlot and ImageNet for algorithm 1, with these estimates for the number of novel categories, and also compare with other methods. The results of traditional methods are those reported in [16] using raw images for OmniGlot and pretrained features for ImageNet. In both datasets, our approach achieves the best results, outperforming previous stateoftheart by 6.8% and 6.1% ACC on OmniGlot and ImageNet respectively.
We also experiment on KCL and MCL by using our estimated number of clusters on OmniGlot and ImageNet (see table 6). With this augmentation, both KCL amd MCL improve significantly in term of ACC, and are similar in term of NMI, indicating that our category number estimation method can also be beneficial for other methods. Our method still significantly outperforms the augmented KCL and MCL on all metrics.
Data  GT  Ours  Error 

OmniGlot  2047  2251  4.60 
ImageNet  {30, 30, 30}  {34, 31, 32}  2.33 
CIFAR100  10  11  1 
OmniGlot  ImageNet  
Method  ACC  NMI  ACC  NMI 
means [21]  18.9%  0.464  34.5%  0.671 
LPNMF [4]  16.3%  0.498  21.8%  0.500 
LSC [8]  18.0%  0.500  33.5%  0.655 
KCL [15]  78.1%  0.874  65.2%  0.715 
MCL [16]  80.2%  0.893  71.5%  0.765 
DTC  87.0%  0.945  77.6%  0.786 
4.4 Transfer from ImageNet pretrained model
The most common way of transfer learning with modern deep convolutional neural networks is to use ImageNet pretrained models. Here, we explore the potential of leveraging the ImageNet pretrained model to transfer features for novel category discovery. In particular, we take the ImageNet pretrained model as our feature extractor, and adopt our transfer clustering model on a new dataset. We experiment with CIFAR10 and the results are shown in
table 7. Instead of considering only part of the categories as unlabelled data, we consider the whole CIFAR10 training set as unlabelled data here. Similar as before, our deep transfering clustering model equipped with temporal ensembling or consistency constraints consistently outperform means and our baseline model. DTC performs the best in term of ACC and DTCTE performs the best in term of NMI. We also experimented with SVHN, however we do not have much success on it. This is likely due to the small correlation between ImageNet and SVHN. This result is consistent with that of semisupervised learning (SSL) [23]. Using an ImageNet pretrained model, SSL can achieve reasonably good performance on CIFAR10, but not on SVHN, which shows that the correlation between source data and target data is important for SSL. Our results corroborate that, to successfully transfer knowledge from the pretrained models for deep transfer clustering, the labelled data and unlabelled data should be closely related.ACC  NMI  

means [21]  71.0%  0.639 
DTCBaseline  76.9%  0.729 
DTC  78.9%  0.753 
DTCTE  78.5%  0.755 
DTCTEP  77.4%  0.734 
5 Conclusion
We have introduced a simple and effective approach for novel visual category discovery in unlabelled data, by considering it as a deep transfer clustering problem. Our method can simultaneously learn a data representation and cluster the unlabelled data of novel visual categories, while leveraging knowledge of related categories in labelled data. We have also proposed a novel method to reliably estimate the number of categories in unlabelled data by transferring cluster prior knowledge using labelled probe data. We have thoroughly evaluated our method on public benchmarks, and it substantially outperformed stateoftheart techniques in both known and unknown category number cases, demonstrating the effectiveness of our approach.
Acknowledgments.
We are grateful to EPSRC Programme Grant Seebibyte EP/M013774/1 and ERC StG IDIU638009 for support.
References
 [1] (2013) Data clustering: algorithms and applications. CRC Press. Cited by: §1, §2.
 [2] (2012) An extensive comparative study of cluster validity indices. Pattern Recognition. Cited by: §3.2.
 [3] (1998) Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B. Cited by: §3.2.
 [4] (2009) Locality preserving nonnegative matrix factorization. In IJCAI, Cited by: Table 2, Table 5.

[5]
(1974)
A dendrite method for cluster analysis
. Communications in Statisticstheory and Methods. Cited by: §3.2.  [6] (2017) Deep adaptive image clustering. In ICCV, Cited by: §2.
 [7] (2006) Semisupervised learning. MIT Press. Cited by: §1, §2.

[8]
(2011)
Large scale spectral clustering with landmarkbased representation
. In AAAI, Cited by: Table 2, Table 5.  [9] (1979) Mean shift: a robust approach toward feature space analysis.. IEEE TPAMI. Cited by: §2.
 [10] (1979) A cluster separation measure. IEEE TPAMI. Cited by: §3.2.
 [11] (2009) ImageNet: a largescale hierarchical image database. In CVPR, Cited by: §1, §2, §4.1.
 [12] (2017) Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In ICCV, Cited by: §2.
 [13] (1974) Wellseparated clusters and optimal fuzzy partitions. Journal of Cybernetics. Cited by: §3.2.
 [14] (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
 [15] (2018) Learning to cluster in order to transfer across domains and tasks. In ICLR, Cited by: Table 8, Appendix F, §2, §4.1, §4.1, §4.1, §4.2, Table 2, Table 3, Table 5, Table 6.
 [16] (2019) Multiclass classification without multiclass labels. In ICLR, Cited by: Table 8, Appendix F, §2, §4.1, §4.1, §4.1, §4.2, §4.3, Table 2, Table 3, Table 5, Table 6.
 [17] (2019) Centroid networks for fewshot clustering and unsupervised fewshot classification. In arXiv preprint arXiv:1902.08605, Cited by: §2, §4.2, Table 2.
 [18] (2009) Learning multiple layers of features from tiny images. Technical report. Cited by: §4.1.
 [19] (2017) Temporal ensembling for semisupervised learning. In ICLR, Cited by: §2, §3.1.2, §3.1.2, §3.1.2.
 [20] (2015) Humanlevel concept learning through probabilistic program induction. Science. Cited by: §4.1.
 [21] (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Cited by: §2, §4.2, Table 1, Table 2, Table 5, Table 7.
 [22] (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §4.1.
 [23] (2018) Realistic evaluation of deep semisupervised learning algorithms. In NIPS, Cited by: §1, §2, §4.4.
 [24] (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering. Cited by: §1, §2.
 [25] (2019) Semisupervised learning with scarce annotations. In ArXiv eprints, Cited by: §2.
 [26] (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. Cited by: §3.2.
 [27] (2015) Very deep convolutional networks for largescale image recognition. In ICLR, Cited by: §4.1.
 [28] (2017) Prototypical networks for fewshot learning. In NIPS, Cited by: §4.1.
 [29] (2016) Improved deep metric learning with multiclass npair loss objective. In NIPS, Cited by: §2.
 [30] (2017) Deep metric learning via facility location. In CVPR, Cited by: §2.
 [31] (2016) Deep metric learning via lifted structured feature embedding. In CVPR, Cited by: §2.
 [32] (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions.. JMLR. Cited by: §4.1.
 [33] (2018) A survey on deep transfer learning. In International Conference on Artificial Neural Networks, Cited by: §2.
 [34] (2017) Mean teachers are better role models: weightaveraged consistency targets improve semisupervised deep learning results. In NIPS, Cited by: §2, §3.1.2.
 [35] (2016) Matching networks for one shot learning. In NIPS, Cited by: §4.1.
 [36] (2018) Deep visual domain adaptation: a survey. Neurocomputing. Cited by: §2.
 [37] (2016) A survey of transfer learning. Journal of Big Data. Cited by: §1, §2.
 [38] (2016) Unsupervised deep embedding for clustering analysis. In ICML, Cited by: §1, §2, §3.1.1, §3.1.1, §3.1.

[39]
(2017)
Towards kmeansfriendly spaces: simultaneous deep learning and clustering
. In ICML, Cited by: §2. 
[40]
(2016)
Joint unsupervised learning of deep representations and image clusters
. In CVPR, Cited by: §2.
Appendices
Appendix A Bottleneck dimension
As described in section 3.1.1 of our paper, we introduce a bottleneck layer to reduce to dimension of the learned representation from (e.g., for ResNet18 which is used in the experiment) to , where and . To verify different choices of for in the bottleneck, we experiment with the 10class unlabelled subset of CIFAR100 by varying . In particular, we train OursBaseline model with different in the bottleneck. The ACC and NMI are shown in fig. 3. It can be seen that our model is not very sensitive to the choice of , especially when is slightly larger than the number of unlabelled categories ( in this experiment). We find that setting is a good choice for the bottleneck, since it gives the best ACC and NMI in our experiment.
Appendix B Number of clusters
Our transfer clustering model requires the number of novel categories to be known. However, this is not always the case in real applications. When it is unknown, we can use our algorithm introduced in section 3.2 to estimate it. For the 10class unlabelled subset of CIFAR100, our algorithm gives an estimate of 12 which is very close to the ground truth (i.e., 10). We show the results of setting different number of clusters for our transfer clustering model in fig. 4. It can be seen that both ACC and NMI decrease if the estimated number of categories is different from the ground truth. While a larger number is more preferable than a smaller number, since ACC and NMI decrease faster with smaller numbers than lager numbers.
Appendix C Representation visualization
(a) init  (b) epoch 0  (c) epoch 10 
(d) epoch 20  (e) epoch 30  (f) epoch 40 
(g) epoch 50  (h) epoch 60  (i) epoch 70 
(j) epoch 80  (k) epoch 90  (l) epoch 100 
We visualize the learned representation of the images in the unlabelled 5class (i.e., dog, frog, horse, ship, truck) subset of CIFAR10 by projecting them to 2D space with tSNE in fig. 7. They are with class indices of 59 in CIFAR10. In fig. 7 (a), we show the representation of unlabelled data obtained by the feature extractor pretrained on labelled data. We can see that, the novel classes can not be properly distinguished, and there are no clear boundaries among different novel classes. In fig. 7 (b)(l), we show the evolving of the representation at different check points after learning with our model. We can clearly see that the representation becomes more and more discriminative for different classes after learning with our model, clearly demonstrating that our approach can effectively discover novel categories (see fig. 7 (l) vs fig. 7 (a)).
Appendix D Evolving of soft clustering assignment
(a) GT  (b) init:  (c) init:  (d) epoch 20:  (e) epoch 20: 
(f) epoch 60:  (g) epoch 60:  (h) epoch 100:  (i) epoch 100: 
As discussed in section 3.1.1 of our manuscript, the training of our model is driven by the selfevolving soft clustering assignment. By constructing the target distribution with the prediction (soft clustering assignment), our model can gradually learn to discover novel categories. We show how the soft clustering assignment evolves on the unlabelled subset of CIFAR10 in fig. 6. Instances with ID 09, 1019, 2029, 3039, and 4049 are images of dog, frog, horse, ship, and truck, respectively. If the images are clustered perfectly, each row in the soft assignment figure will form a onehot vector, and the predictions will form a vertical bar for each of the 10instance clusters (see fig. 6 (a) for the ground truth). As can be seen in fig. 6 (b), the predictions, which are obtained using the pretrained feature extractor, are rather noisy at the beginning. For example, the model seems unable to distinguish the two classes of instances 020 in fig. 6 (b, d), while after training, our model can perfectly cluster them into two distinct clusters (see fig. 6(h)). Similarly, images 4049 are mostly wrongly considered to be in the same cluster with images 3039 with high confidence. After training, our model can correctly cluster them. Meanwhile, we can see that the constructed is more ‘confident’ and discriminative than , indicating that can serve as a proper learning target.
Appendix E tSNE visualization with images
Appendix F Detailed category number estimation results on OmniGlot
KCL [15] amd MCL [16] assume the number of categories to be a large value (i.e., 100 ) instead of estimating the number of categories explicitly. After running their clustering method, they finally estimate the number of categories by identifying the clusters with a number of assigned instances larger than a certain threshold. By contrast, we choose to estimate the number of categories before running the transfer clustering algorithm using algorithm 2 in our main paper (with for all our experiments) and only then apply algorithm 1 to find the clusters. The results of our estimator for OmniGlot is shown in table 8, where we also compared with the results of MCL and KCL as reported in [16]. Our method achieves better accuracy than the others methods (4.60 vs. 5.10 highest), which validates the effectiveness of our approach.
Alphabet  GT  KCL [15]  MCL [16]  Ours 

Angelic  20  26  22  23 
Atemayar Q.  26  34  26  25 
Atlantean  26  41  25  34 
Aurek_Besh  26  28  22  34 
Avesta  26  32  23  31 
Ge_ez  26  32  25  31 
Glagolitic  45  45  36  46 
Gurmukhi  45  43  31  34 
Kannada  41  44  30  40 
Keble  26  28  23  25 
Malayalam  47  47  35  42 
Manipuri  40  41  33  39 
Mongolian  30  36  29  33 
Old Church S.  45  45  38  51 
Oriya  46  49  32  33 
Sylheti  28  50  30  22 
Syriac_Serto  23  38  24  26 
Tengwar  25  41  26  28 
Tibetan  42  42  34  43 
ULOG  26  40  27  33 
Avg    6.35  5.10  4.60 