Log In Sign Up

Online Unsupervised Learning of Visual Representations and Categories

Real world learning scenarios involve a nonstationary distribution of classes with sequential dependencies among the samples, in contrast to the standard machine learning formulation of drawing samples independently from a fixed, typically uniform distribution. Furthermore, real world interactions demand learning on-the-fly from few or no class labels. In this work, we propose an unsupervised model that simultaneously performs online visual representation learning and few-shot learning of new categories without relying on any class labels. Our model is a prototype-based memory network with a control component that determines when to form a new class prototype. We formulate it as an online Gaussian mixture model, where components are created online with only a single new example, and assignments do not have to be balanced, which permits an approximation to natural imbalanced distributions from uncurated raw data. Learning includes a contrastive loss that encourages different views of the same image to be assigned to the same prototype. The result is a mechanism that forms categorical representations of objects in nonstationary environments. Experiments show that our method can learn from an online stream of visual input data and is significantly better at category recognition compared to state-of-the-art self-supervised learning methods.


page 9

page 11


Representation Learning via Consistent Assignment of Views to Clusters

We introduce Consistent Assignment for Representation Learning (CARL), a...

Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

Instance-level contrastive learning techniques, which rely on data augme...

SCALE: Online Self-Supervised Lifelong Learning without Prior Knowledge

Unsupervised lifelong learning refers to the ability to learn over time ...

End-to-end Deep Prototype and Exemplar Models for Predicting Human Behavior

Traditional models of category learning in psychology focus on represent...

Self-Supervised Classification Network

We present Self-Classifier – a novel self-supervised end-to-end classifi...

Self-Damaging Contrastive Learning

The recent breakthrough achieved by contrastive learning accelerates the...

Discriminative Probabilistic Prototype Learning

In this paper we propose a simple yet powerful method for learning repre...

1 Introduction

Human and artificial agents learn from large amounts of unlabeled data through a continuous stream of experience that has strong temporal correlations and hierarchical structure. The stream of experience is determined by the fact that individuals operate in distinct physical environments, such as home and office. Transitions between environments occur on a coarse time scale relative to the rate of encounters with objects in the environment. The distribution over objects is strongly conditioned on the environment: frozen pizza and milk are in the supermarket, computer monitors and desks at the office. The distribution may also be nonstationary; for example, a visitor may be infrequent, but when they are around, they’re encountered frequently. And finally, the distribution can be highly unbalanced: an individual may interact with their co-workers daily but the boss only occasionally.

The goal of our research is to tackle the challenging problem of online unsupervised representation learning in the setting of environments with naturalistic structure. We desire a learning algorithm that will facilitate the categorization of objects encountered in the environment, with few or zero category labels. In representation learning, methods often evaluate their ability to classify from the representation using either supervised linear readout or unsupervised clustering over the full dataset, both of which are typically done in a separate post-hoc evaluation phase. An important aim of our work is to produce object-category predictions throughout training and evaluation, and to allow these predictions to guide subsequent categorization.

Unsurprisingly, the structure of natural environments contrasts dramatically with the standard scenario typically assumed by many machine learning algorithms: mini-batches of independent and identically distributed (iid) samples from a well-curated dataset. In unsupervised visual representation learning, the most successful methods rely on iid samples. Contrastive-based objectives (Chen et al., 2020a; He et al., 2020) typically assume that each instance in the mini-batch forms its own instance class, throwing away the potential similarity between instances. Clustering-based learning frameworks (Caron et al., 2018; Asano et al., 2020; Caron et al., 2020) often assume that the set of cluster centroids remain relatively stable and that the clusters are balanced in size. Unfortunately, none of these assumptions necessarily hold true in a naturalistic online streaming setting. The performance of methods will suffer as a consequence. Contrastive approaches could eventually fail if examples of the same class are pushed apart; clustering approaches will behave erratically with nonstationary and imbalanced class distributions.

To make progress on the challenge of unsupervised visual representation learning and categorization in a naturalistic setting, we propose the online unsupervised prototypical network, which performs learning of visual representations and object categories simultaneously in a single-stage process. Class prototypes are created via an online clustering procedure, and a contrastive loss (van den Oord et al., 2018) is used to encourage different views of the same image to be assigned to the same cluster. Notably, our online clustering procedure is more flexible relative to other clustering-based representation learning algorithms, such as DeepCluster (Caron et al., 2018) and SwAV (Caron et al., 2020): our model performs learning and inference as an online Gaussian mixture model, where clusters can be created online with only a single new example, and cluster assignments do not have to be balanced, which permits an approximation to natural imbalanced distributions from uncurated raw data.

We train and evaluate our algorithm on a recently proposed naturalistic learning dataset, RoamingRooms (Ren et al., 2021), which uses imagery collected from a virtual agent walking through different rooms. Unlike the experiments in Ren et al. (2021), our training is done without using any labeled data, and hence it is online and unsupervised. We compare to state-of-the-art unsupervised methods SimCLR (Chen et al., 2020a) and SwAV (Caron et al., 2020). Because they rely on sampling a large batch from an offline dataset, their performance drops significantly using smaller non-iid episodes. In contrast, our method can directly learn with online streaming data from small episodes, without requiring an example buffer, and surprisingly, we even outperform these strong methods trained offline with large batches of iid data. We also used RoamingOmniglot (Ren et al., 2021)

as a benchmark to investigate the effect of imbalanced classes and we find that our method is very robust to an imbalanced distribution of classes. For ImageNet-based datasets with large intra-class variance, our method is on par with standard self-supervised approaches such as SimCLR when using the same batch size. Finally, qualitative visualizations confirm that our online clustering procedure can automatically discover new concepts and categories by grouping a few similar instances together in the learned embedding space.

2 Related Work

Self-supervised learning.

Self-supervised learning methods discover rich and informative visual representations without using class labels. Instance-based approaches aim to learn invariant representations of each image under different transformations (van den Oord et al., 2018; Misra and van der Maaten, 2020; Tian et al., 2020; He et al., 2020; Chen et al., 2020a, b; Grill et al., 2020; Chen and He, 2021). They typically work well under iid learning with large batch sizes, which contrasts with realistic learning scenarios. Our method is also related to clustering-based approaches, which computes clusters on top of the learned embedding, and uses the cluster assignment to self-supervise the embedding network (Caron et al., 2018; Zhan et al., 2020; Asano et al., 2020; Caron et al., 2020; Li et al., 2021). The self-supervision loss is found to be more powerful than the traditional reconstruction loss (Rao et al., 2019; Greff et al., 2017; Joo et al., 2020). To compute the cluster assignment, DeepCluster (Caron et al., 2018; Zhan et al., 2020) and PCL (Li et al., 2021) use the -means algorithm whereas SeLa (Asano et al., 2020) and SwAV (Caron et al., 2020) uses the Sinkhorn-Knopp algorithm (Cuturi, 2013). However, they typically assume a fixed number of clusters, and Sinkhorn-Knopp further assumes a balanced assignment as an explicit constraint. In contrast, our online clustering procedure is more flexible, as it can create new clusters on-the-fly with only a single new example and does not assume balanced cluster assignments.

Representation learning from video.

There has also been a surge of interest in leveraging video data to learn visual representations (Wang and Gupta, 2015; Orhan et al., 2020; Pathak et al., 2017; Zhu et al., 2020; Xiong et al., 2021). Wang and Gupta (2015) proposed to sample positive example pairs from a video track of the same object; Orhan et al. (2020) divided the video into equal chunks and trains a network to predict the chunk index. Pathak et al. (2017) learned visual representations using low-level motion-based grouping cues, and more recently, Xiong et al. (2021) used optical flow to encourage regions that move together to have consistent representations across frames. These approaches all sample video subsequences uniformly over the entire dataset, whereas our model directly learns from an online stream of data. Our model also does not have the assumption that inputs must be adjacent frames in the video.

Online and incremental representation learning.

Our work is also related to online and continual representation learning (Rebuffi et al., 2017; Castro et al., 2018; Rao et al., 2019; Jerfel et al., 2019; Javed and White, 2019; Hayes et al., 2020). Incremental learning (Rebuffi et al., 2017; Castro et al., 2018) studies learning a set of classes incrementally. Continual mixture models (Rao et al., 2019; Jerfel et al., 2019) designate a categorical latent variable that can be dynamically allocated for a new environment. Our model has a similar mixture latent variable setup but one major difference is that we operate on example-level rather than task-level. OML (Javed and White, 2019) is a meta-learning framework that learns the representation to support online learning at test time. Streaming learning (Hayes et al., 2020, 2019) aims to perform representation learning online. Most of the works here except Rao et al. (2019) assume that the training data is fully supervised. Our prototype memory also resembles the replay buffer for experience replay methods (Buzzega et al., 2020; Kim et al., 2020), but we store the hidden feature prototypes instead of the input examples.

Latent variable modeling on sequential data.

Our model also relates to a family of latent variable generative models for sequential data (Johnson et al., 2016; Krishnan et al., 2015; He et al., 2018; Denton and Fergus, 2018; Zhu et al., 2020). Both our model and these approaches aim to infer latent variables with temporal structure. However, instead of using input reconstruction as the main learning objective, our model leverages self-supervised contrastive learning to avoid learning difficulty in the decoder generator.

Online mixture models.

Our clustering module is related to the literature on online mixture models, e.g., Hughes and Sudderth (2013); Pinto and Engel (2015); Song and Wang (2005). Typically, these are designed for fast and incremental learning of clusters without having to recompute clustering over the entire dataset. Despite presenting a similar online clustering algorithm, our goal is to jointly learn both online clusters and input representations.

Few-shot learning.

Our model is able to recognize a new class with only one or a few examples, which relates to the literature on few-shot learning (Li et al., 2003; Lake et al., 2015). Our prototype-based memory is also inspired by the Prototypical Network (Snell et al., 2017) and its variants, e.g., IMP (Allen et al., 2019) and Online ProtoNet (Ren et al., 2021). Ren et al. (2018); Huang et al. (2019) demonstrated improved performance by using unlabeled examples. More recent methods can entirely remove the reliance on class labels in the training pipeline, using self-supervised learning (Hsu et al., 2019; Gidaris et al., 2019; Antoniou and Storkey, 2019; Khodadadeh et al., 2019; Medina et al., 2020).

Classical few-shot learning, however, relies on episodes of equal number of training and test examples from a fixed number of new classes. Incremental few-shot learning approaches (Gidaris and Komodakis, 2018; Ren et al., 2019) consider both old and new classes, while Meta-Dataset (Triantafillou et al., 2020) considers episodes with varying numbers of examples and classes. Ren et al. (2021) proposed a new setup that incrementally accumulates new classes and re-visits old classes over a sequence of inputs. We evaluate our algorithm on a similar setup; however, unlike that work, our proposed algorithm learns visual representations without relying on any class labels.

Human category learning.

Our work is inspired by human learning settings and relevant models from cognitive science. Lake et al. (2009) proposed an online category learning framework using online mixture models on top of fixed representations. SUSTAIN (Love et al., 2004) is another computational model for human category learning that can perform both supervised and unsupervised clustering of categories. Mack et al. (2016) further augmented the SUSTAIN model such that categories can be learned in a goal-oriented manner. In contrast to these human learning models, our model learns the mapping between images to categories in an end-to-end fashion.

3 Online Unsupervised Prototypical Networks

Figure 1: Our proposed online unsupervised prototypical network (OUPN). Left:

OUPN learns directly from an online visual stream. Images are processed by a deep neural network to extract representations. Representations are stored and clustered in a prototype memory. Similar features are aggregated in a concept and new concepts can be dynamically created if the current feature vector is different from all existing concepts.

Right: The network learning uses self-supervision that encourages different augmentations of the same frame to have consistent cluster assignments.

In this section, we introduce our proposed model, online unsupervised prototypical networks (OUPN). We study the online categorization setting, where the model receives an input at every time step

, and predicts both a categorical variable

to indicate the object class and also a binary variable

to indicate whether this is a known or new class. OUPN uses a network to encode the input to obtain embedding , where represents the learnable parameters of the encoder network. is then processed by a prototype memory which predicts .

3.1 Prototype memory

We formulate our prototype memory as a probabilistic mixture model, where each cluster corresponds to a Gaussian distribution

, with mean , a constant isotropic variance shared across all clusters, and mixture weights : Throughout a sequence, the number of components evolves as the model makes an online decision of when to create a new cluster or remove an old one. We assume that the prior distribution for the Bernoulli variable is constant (), and the prior for a new cluster is uniform over the entire space (). In the following, we formulate our prototype memory as an approximation to an online EM algorithm. The full derivation is included in the supplementary materials.

3.1.1 E-step

Upon seeing the current input , the online clustering procedure needs to predict the cluster assignment or initiate a new cluster in the E-step.

Inferring cluster assignments.

The categorical variable infers the cluster assignment of the current input example with regard to the existing clusters.


where is the mixing coefficient of cluster , is the distance function, and is an independent learnable temperature parameter that is related to the cluster variance.

Inference on unknown classes.

The binary variable estimates the probability that the current input belongs to a new cluster:


where and are separate learnable parameters related to and , allowing us to predict different confidence levels for unknown and known classes.

3.1.2 M-step

Here we infer the posterior distribution of the prototypes

. We formulate an efficient recursive online update, similar to Kalman filtering, incorporating the evidence of the current input

and avoiding re-clustering the entire input history. We define as the posterior estimate of the mean of the -th cluster at time step , and is the estimate of the inverse variance.

Updating prototypes.

Suppose that in the E-step we have determined that . Then the posterior distribution of the -th cluster after observing is:


The transition probability distribution

is a zero-mean Gaussian with variance , where is some constant that we define to be the memory decay coefficient. In fact, can be viewed as a count variable for the number of elements in each estimated cluster, subject to the decay factor over time, if we further assume that , , and . The memory update equation can be formulated as follows:

Adding and removing prototypes.

Our prototype memory is a collection of tuples . We convert the probability of an observation belongs to a new cluster into a decision: if exceeds a threshold , we create a new cluster in our prototype memory. Due to the decay factor , our estimate of a cluster can decay to zero over time, which is appropriate for modeling nonstationary environments. In practice, we keep a maximum number of clusters, and if the memory is full and we try to add another cluster at time , we simply pop out the least relevant prototype , where : .

3.2 Representation learning

A primary goal of our learning algorithm is to learn good visual representations through this online categorization process. In the beginning, the encoder network is randomly initialized, and the prototype memory will not produce accurate class predictions since the representations are not informative. Our overall representation learning objective has three terms:


This loss function drives the learning of the main network parameters

, as well as other learnable control parameters , , and . We explain each term in detail below.

  1. [leftmargin=*]

  2. Self-supervised loss: Inspired by recent self-supervised representation learning approaches, we apply augmentations on , and encourage the clustering assignments to match across different views. Self-supervision follows three steps: First, the model makes a prediction on the augmented view, and obtains and (E-step). Secondly, it updates the prototype memory according to the prediction (M-step). To create a learning target, we query the original view again, and obtain to supervise the cluster assignment of the augmented view, , as in distillation (Hinton et al., 2015).


    Note that both and are produced after the M-step so we can exclude the “unknown” class in the representation learning objective. We here introduce a separate temperature parameter to control the entropy of the mixture assignment .

  3. Entropy loss: In order to encourage more confident predictions we introduce another loss function that controls the entropy of the original prediction , produced in the initial E-step.

  4. New cluster loss: Lastly, our learning formulation also includes a loss for initiating new clusters . We define it to be a Beta prior on the expected

    , and we introduce a hyperparameter

    to control the expected number of clusters.


    This acts as a regularizer on the total number of prototypes: if the system is too aggressive in creating prototypes, then it does not learn to merge instances of the same class; while if it is too conservative, the representations can collapse to a trivial solution.

Relation to Online ProtoNet.

The formulation of our probabilistic prototype memory is similar to Online ProtoNet (Ren et al., 2021). However, there are several main differences. First, we consider a decay term that can handle nonstationary mixtures, which is related to the variance of the transition probability. Second, our new cluster creation is unsupervised, whereas in Ren et al. (2021) only labeled examples lead to new clusters. Most importantly, our representation learning objective is also entirely unsupervised, whereas Ren et al. (2021) relies on a supervised loss.

Full algorithm.

Let denote the union of the learnable parameters. Algorithm 1 outlines our proposed learning algorithm. The full list of hyperparameters are included in Appendix B.

     for  do
        Observe new input .
        Encode input, .
        Compare to existing prototypes:
        if  then
           Assign to existing prototypes: .
           Recycle the least used prototype if is full.
           Create a new prototype .
        end if
        Compute pseudo-labels:
        Augment a view:
        Encode the augmented view:
        Compare the augmented view to existing prototypes:
        Compute the self-supervision loss:
        Compute the entropy loss:
        Compute the average probability of creating new prototypes,
     end for
     Compute the new cluster loss:
     Sum up losses:
     Update parameters:
  until convergence
Algorithm 1 Online Unsupervised Prototypical Learning

It is worth noticing that if we create a new prototype every time step, then our algorithm is similar to a standard contrastive learning with an instance-based InfoNCE loss (Chen et al., 2020a; He et al., 2020). All the losses can be computed online without having to store additional examples other than the collection of prototypes.

4 Experiments

In this section, we evaluate our proposed learning algorithm on a set of visual learning tasks and evaluate the quality of the output categories. Different from prior work on visual representation learning, we focus on online non-iid image sequences to highlight the merit of our method. Our source code is released at:

Implementation details.

Throughout our experiments, we make two changes on the model inference procedure defined above. First, we use cosine similarity instead of negative squared Euclidean distance for computing the mixture logits. Cosine similarity is bounded and is found to be more stable to train. Second, when we perform cluster inference, we treat the mixing coefficients

as constant and uniform as otherwise we find that the representations may collapse into a single large cluster.

Evaluation setup.

Although our training is entirely unsupervised, we can evaluate our model under both unsupervised and supervised settings. During evaluation we present our model a sequence of new images (labeled or unlabeled) and we would like to see how well the model is able to output a successful grouping of the inputs. For the unsupervised setting, there is no class label associated with each image, and our model directly predicts . For the supervised setting, following Ren et al. (2021), we give the model the groundtruth label after the prediction of , i.e. .

Figure 2: An example subsequence of the RoamingRooms dataset, consisting of consecutive glimpses of an online agent roaming in an indoor environment and the task is to recognize the object instances.
Online ProtoNet (Ren et al., 2021) 79.02 89.94
Random Network 28.25 11.68
SimCLR (Chen et al., 2020a) 59.20 65.47
SwAV (Caron et al., 2020) 54.55 62.70
SwAV+Queue (Caron et al., 2020) 58.70 65.83
OUPN (Ours) 78.16 84.86
Table 2: Comparison to SimCLR and SwAV with larger batch sizes on RoamingRooms
Batch Size # GPUs AMI AP
SimCLR 50 1 59.20 65.47
SimCLR 100 2 63.08 72.03
SimCLR 200 4 64.88 76.16
SimCLR (iid) 1024 8 73.52 83.63
SwAV 50 1 54.55 62.70
SwAV 100 2 60.18 70.80
SwAV 200 4 63.73 75.83
SwAV (iid) 1024 8 66.37 78.76
SwAV+Queue 50 1 58.70 65.83
SwAV+Queue 100 2 61.67 72.33
SwAV+Queue 200 4 65.58 76.57
SwAV+Queue (iid) 1024 8 71.04 80.67
OUPN (Ours) 50 1 78.16 84.86
Table 1: Novel object recognition performance on RoamingRooms
Evaluation metrics.

We use the following unsupervised and supervised metrics.

  • [leftmargin=*]

  • Adjusted mutual information (AMI): In the unsupervised setting, we use the mutual information metric to evaluate the similarity between our prediction the groundtruth class ID . Since our online clustering method admits a threshold parameter to control the number of output clusters, we can sweep the value of to maximize the AMI score, to make the score threshold invariant:

  • Average precision (AP): In the supervised setting, we followed the evaluation procedure in Ren et al. (2021) and used average precision that combines both accuracy for predicting known classes as well as unknown ones.

Competitive methods.

We compare our proposed model with the following state-of-the-art self-supervised visual representation learning methods. Since none of these competitive methods are designed to output classes with a few examples, we use our online clustering procedure to readout from these learned representations.

  • [leftmargin=*]

  • SimCLR (Chen et al., 2020a)

    is a representative contrastive learning method with an instance-based objective that tries to classify an image instance among other augmented views of the same batch of instances. It relies on a large batch size and is often trained on well-curated datasets such as ImageNet 

    (Deng et al., 2009).

  • SwAV (Caron et al., 2020) is a representative contrastive learning methods with a clustering-based objective. It has a stronger performance than SimCLR on ImageNet. The clustering is achieved through Sinkhorn-Knopp which assumes balanced assignment, and prototypes are learned by gradient descent.

  • SwAV+Queue is a SwAV variant with an additional example queue. This setup is proposed in (Caron et al., 2020) to deal with small training batches. Since our episodes are relatively small sized (50 - 150), we utilize a queue here to see if it can improve the SwAV baseline. The queue size is set to around 2000 to mimic a large batch setting.

4.1 Indoor home environments

We first evaluate the algorithm using the RoamingRooms dataset (Ren et al., 2021) where the images are collected from indoor environments (Chang et al., 2017) using a random walking agent. The dataset contains 1.22M image frames and 7K instance classes from 6.9K random walk episodes collected using 1.2K panoramas. Each image is resized to 120 160 3. We use a maximum of 50 frames for training due to memory constraints, and all methods are evaluated on the test set with a maximum of 100 frames per episode. Each frame has an object annotation with its segmentation mask and the task here is to recognize the object instance IDs. Since our task is classification, we also send the segmentation mask as an additional channel in the input. An example episode is shown in Fig. 2.

Figure 3: Image retrieval results on RoamingRooms. On each row, the leftmost image is the query image, and top-9 retrieved images are shown on the right, and the cosine similarity score on the top left corner. Retrieval of the same instances are in green.
Figure 4: An example subsequence of an episode sampled from the RoamingOmniglot dataset
Pretrain-Supervised 84.48 93.83
Online ProtoNet (Ren et al., 2021) 89.64 92.58
Random Network 17.66 17.01
SimCLR (Chen et al., 2020a) 59.06 73.50
SwAV (Caron et al., 2020) 62.09 75.93
SwAV+Queue (Caron et al., 2020) 67.25 81.96
OUPN (Ours) 84.42 92.84
Table 4: Results on RoamingImageNet
Pretrain-Supervised 25.04 23.43
Online ProtoNet (Ren et al., 2021) 23.59 17.59
Random Network 6.20 1.60
SimCLR (Chen et al., 2020a) 16.45 9.67
SwAV (Caron et al., 2020) 15.58 5.98
SwAV+Queue (Caron et al., 2020) 14.77 5.65
OUPN (Ours) 17.04 9.21
Table 3: Results on RoamingOmniglot
Figure 5:

Robustness to imbalanced distributions by adding distractor classes (Omniglot episodes mixed with MNIST images). Performance is relative to the original performance and a random baseline.

Figure 6: Visualization of learned categories in RoamingOmniglot in a test episode. Different colors denote the ground-truth class IDs.
Implementation details.

We use a ResNet-12 (Oreshkin et al., 2018)

as the encoder network, and we train our method over 80k 50-frame episodes (4M image frames total), using a single NVIDIA 1080-Ti GPU. For iid baselines, we train using large batches of 1024 randomly sampled over all episodes, for a total of 40 epochs, and each epoch contains 1000 update steps (40M image frames total), using 8 GPUs with 128 images per GPU. We follow the same procedure of image augmentation as SimCLR 

(Chen et al., 2020a). We use 150 prototypes with . More implementation details can be found in Appendix B.


Our main results are shown in Table 2. Although both SimCLR and SwAV shown promising results on large batch learning on ImageNet, their performances are relatively weak compared to the supervised baseline. Adding a queue slightly improve SwAV; however, since the examples in the queue cannot be used to compute gradients, the gradient updates still suffers from the nonstationary distribution. In contrast, our method OUPN shows impressive performance on this benchmark: it almost matches the supervised learner in AMI, and reached almost 95% of the performance of the supervised learner in AP.

To illustrate the impact of our online non-iid episodes, we increase the batch size for SimCLR and SwAV, from 50 to 200, at the cost of using multiple GPUs training in parallel. The results are shown in Table 2. We also report oracles using large iid batches of size 1024; these are considered oracles because as we increase the batch size to infinity, the non-iid setting will converge to an iid one. Interestingly, we can see that both SimCLR and SwAV experience a huge drop in performance when training with the online non-iid episodes. Moreover, our method using online non-iid episodes is able to outperform the iid large-batch versions of SimCLR and SwAV, which use 8 computational resource compared to ours.

Visualization on image retrieval.

To verify the usefulness of the learned representation, we ran an image retrieval visualization over 100 test episodes in Fig. 3, and we also provided the cosine similarity score for each retrieved image. The top retrieved images are all from the same instance of the query image, and our model sometimes achieves perfect recall. This confirms that our model can handle a certain level of view angle changes. We also investigated the missed examples and we found that these are taken from more distinct view angles. For example, the missed examples in bottom row are viewed from the back of the flower. However, note that we only computed the pairwise similarity score for retrieval, and therefore it is possible that the actual clusters are more inclusive as the model incrementally computes the average embedding as the prototype. More visualizations are provided in Appendix D.

4.2 Handwritten characters

We also evaluated our method on a different task of recognizing handwritten characters using Omniglot (Lake et al., 2015). In this experiment, images are not being organized in a video-like sequence, and models have to reason about conceptual similarity between images in order to learn grouping. Furthermore, since this is a more controllable setup, we can verify our hypothesis on class imbalance by performing manipulations on the episode distribution.

Our episodes are sampled from the RoamingOmniglot dataset (Ren et al., 2021), where each episode consists of five different context, and in each context, classes are sampled from a Chinese restaurant process. We use the 150-image episodes, with a maximum of 50 classes per episode, with images.

Implementation details.

We use a four-layer ConvNet (Vinyals et al., 2016) as the encoder, and train our model over 80k episodes (12M image frames total). We use 150 prototypes with . More experimental details are in Appendix B.


The results are reported in Tab. 4. Our model is able to significantly reduce the gap between supervised and unsupervised models, outperforming SimCLR and SwAV.

Effect of imbalanced distribution.

We further study the effect of imbalanced cluster sizes by manipulating the class distribution in the training episodes. In the first setting, we randomly replace Omniglot images with MNIST digits, with probability from 0% to 100%. For example, at 50% rate, an MNIST digit is over 300 times more likely to appear compared to any Omniglot character class, so the episodes are composed of half of frequent classes and half of infrequent classes. In the second setting, we randomly replace Omniglot images with MNIST digit 1 images and this will make the imbalance even greater. Results of the two settings are shown in Fig. 6, and our method is shown to be more robust under imbalanced distribution than the iid versions of SimCLR and SwAV. Compared to clustering-based methods like SwAV, our prototypes can be dynamically created and updated with no constraints on the number of elements per cluster. Compared to instance-based methods like SimCLR, our prototypes also samples the contrastive pairs more equally during learning. We hypothesize that these model aspects contribute to the differences in robustness.

Visualization of learned categories.

We visualize the learned categories in RoamingOmniglot in Fig. 6 using t-SNE (Van der Maaten and Hinton, 2008). Different colors represents different ground-truth classes. Our method is able to learn meaningful embeddings and roughly group items of similar semantic meanings together. More visualizations are provided in Appendix D.

Ablation studies.

We study the effect of certain hyperparameters of our model in Tab. 7, 7 and 7, using the validation set of RoamingRooms and RoamingOmniglot. The memory size does not affect the final performance, but interestingly the decay factor is more important. As shown in Tab. 7, the new cluster loss is necessary since the model performance drops significantly when the value of is below 0.5. More studies on the effect of , , and the Beta mean are included in Appendix C.2.

4.3 ImageNet images

Lastly, we evaluate on episodes composed of ImageNet images, and using the RoamingImageNet dataset (Ren et al., 2021), which is sampled with similar structure as RoamingOmniglot. It roughly on a similar scale compared to RoamingRooms (around 1.3 million images), but images in each class here share less common features for clustering. We use this benchmark as a stress test for our method to see whether it can still learn semantic concepts with a large intra-class variance, but at the same time this benchmark lacks the realistic nature of an online real world agent, and therefore it is not the main focus of our paper.

Implementation details.

We use a ResNet-12 (Oreshkin et al., 2018) and train it with 80k 150-frame episodes. We use 600 prototypes with . More implementation details are in the Appendix B.


Results are shown in Tab. 4. Although ImageNet is a more challenging benchmark, our model is still comparable to SimCLR using small non-iid batches, and achieves the best performance on the AMI score. We included additional comparison to SimCLR and SwAV in their traditional large batch settings in Appendix C.3, where SimCLR and SwAV can still achieve better performance at the cost of using much more computational resource.

RoamingOmniglot RoamingRooms
50 89.19 95.12 75.33 82.42
100 90.54 95.83 76.70 83.51
150 90.24 95.92 77.07 84.00
200 90.36 95.68 76.81 84.45
250 89.87 95.69 77.83 84.33
Table 6: Effect of decay rate
RoamingOmniglot RoamingRooms
0.9 51.12 64.19 65.07 75.50
0.95 79.78 89.30 74.33 81.92
0.99 89.43 95.54 76.97 84.05
0.995 90.80 95.90 77.78 85.02
0.999 86.27 93.69 38.89 39.37
Table 7: Effect of
RoamingOmniglot RoamingRooms
0.0 38.26 93.40 19.49 73.93
0.1 86.60 93.50 67.25 71.69
0.5 89.89 95.28 78.04 84.85
1.0 90.06 95.81 77.59 84.36
2.0 88.74 95.73 77.62 84.72
Table 5: Effect of mem. size

5 Conclusion

Taking steps towards the online and nonstationary learning process of real world agents, we develop an online unsupervised algorithm for learning visual representation and categories. Unlike standard self-supervised learning, our category learning is embedded in a probabilistic clustering module that is learned jointly with the representation encoder. Our clustering is more flexible and supports learning of new categories with very few examples. At the same time, we leverage self-supervised learning to acquire semantically meaningful representations. Our method is evaluated in both synthetic and realistic image sequences and outperforms state-of-the-art self-supervised learning algorithms using online non-iid data inputs.


  • Allen et al. (2019) Kelsey R. Allen, Evan Shelhamer, Hanul Shin, and Joshua B. Tenenbaum. Infinite mixture prototypes for few-shot learning. In ICML, 2019.
  • Antoniou and Storkey (2019) Antreas Antoniou and Amos J. Storkey. Assume, augment and learn: Unsupervised few-shot meta-learning via random labels and data augmentation. CoRR, abs/1902.09884, 2019.
  • Asano et al. (2020) Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and representation learning. In ICLR, 2020.
  • Buzzega et al. (2020) Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. In NeurIPS, 2020.
  • Caron et al. (2018) Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
  • Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, 2020.
  • Castro et al. (2018) Francisco M. Castro, Manuel J. Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In ECCV, 2018.
  • Chang et al. (2017) Angel X. Chang, Angela Dai, Thomas A. Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from RGB-D data in indoor environments. In 3DV, 2017.
  • Chen et al. (2020a) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020a.
  • Chen and He (2021) Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. 2021.
  • Chen et al. (2020b) Xinlei Chen, Haoqi Fan, Ross B. Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. CoRR, abs/2003.04297, 2020b.
  • Cuturi (2013) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2013.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • Denton and Fergus (2018) Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In ICML, 2018.
  • Gidaris and Komodakis (2018) Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In CVPR, 2018.
  • Gidaris et al. (2019) Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick Pérez, and Matthieu Cord. Boosting few-shot visual learning with self-supervision. In ICCV, 2019.
  • Greff et al. (2017) Klaus Greff, Sjoerd van Steenkiste, and Jürgen Schmidhuber.

    Neural expectation maximization.

    In NIPS, 2017.
  • Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Ávila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent - A new approach to self-supervised learning. In NeurIPS, 2020.
  • Hayes et al. (2019) Tyler L. Hayes, Nathan D. Cahill, and Christopher Kanan. Memory efficient experience replay for streaming learning. In ICRA, 2019.
  • Hayes et al. (2020) Tyler L. Hayes, Kushal Kafle, Robik Shrestha, Manoj Acharya, and Christopher Kanan. REMIND your neural network to prevent catastrophic forgetting. In ECCV, 2020.
  • He et al. (2018) Jiawei He, Andreas M. Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal. Probabilistic video generation using holistic attribute control. In ECCV, 2018.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  • Hinton et al. (2015) Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
  • Hsu et al. (2019) Kyle Hsu, Sergey Levine, and Chelsea Finn. Unsupervised learning via meta-learning. In ICLR, 2019.
  • Huang et al. (2019) Gabriel Huang, Hugo Larochelle, and Simon Lacoste-Julien. Centroid networks for few-shot clustering and unsupervised few-shot classification. CoRR, abs/1902.08605, 2019.
  • Hughes and Sudderth (2013) Michael C. Hughes and Erik B. Sudderth. Memoized online variational inference for dirichlet process mixture models. In NIPS, 2013.
  • Javed and White (2019) Khurram Javed and Martha White. Meta-learning representations for continual learning. In NeurIPS, 2019.
  • Jerfel et al. (2019) Ghassen Jerfel, Erin Grant, Tom Griffiths, and Katherine A. Heller. Reconciling meta-learning and continual learning with online mixtures of tasks. In NeurIPS, 2019.
  • Johnson et al. (2016) Matthew J. Johnson, David Duvenaud, Alexander B. Wiltschko, Ryan P. Adams, and Sandeep R. Datta. Composing graphical models with neural networks for structured representations and fast inference. In NIPS, 2016.
  • Joo et al. (2020) Weonyoung Joo, Wonsung Lee, Sungrae Park, and Il-Chul Moon.

    Dirichlet variational autoencoder.

    Pattern Recognit., 107:107514, 2020.
  • Khodadadeh et al. (2019) Siavash Khodadadeh, Ladislau Bölöni, and Mubarak Shah. Unsupervised meta-learning for few-shot image classification. In NeurIPS, 2019.
  • Kim et al. (2020) Chris Dongjoo Kim, Jinseo Jeong, and Gunhee Kim. Imbalanced continual learning with partitioning reservoir sampling. In ECCV, 2020.
  • Krishnan et al. (2015) Rahul G. Krishnan, Uri Shalit, and David A. Sontag. Deep kalman filters. CoRR, abs/1511.05121, 2015.
  • Lake et al. (2009) Brenden M. Lake, Gautam K. Vallabha, and James L. McClelland. Modeling unsupervised perceptual category learning. IEEE Trans. Auton. Ment. Dev., 1(1):35–43, 2009.
  • Lake et al. (2015) Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • Li et al. (2003) Fei-Fei Li, Robert Fergus, and Pietro Perona. A bayesian approach to unsupervised one-shot learning of object categories. In ICCV, 2003.
  • Li et al. (2021) Junnan Li, Pan Zhou, Caiming Xiong, Richard Socher, and Steven C. H. Hoi. Prototypical contrastive learning of unsupervised representations. 2021.
  • Love et al. (2004) Bradley C Love, Douglas L Medin, and Todd M Gureckis. Sustain: a network model of category learning. Psychological review, 111(2):309, 2004.
  • Mack et al. (2016) Michael L. Mack, Bradley C. Love, and Alison R. Preston. Dynamic updating of hippocampal object representations reflects new conceptual knowledge. Proceedings of the National Academy of Sciences, 113(46):13203–13208, 2016. ISSN 0027-8424. doi: 10.1073/pnas.1614048113.
  • Medina et al. (2020) Carlos Medina, Arnout Devos, and Matthias Grossglauser. Self-supervised prototypical transfer learning for few-shot classification. CoRR, abs/2006.11325, 2020.
  • Misra and van der Maaten (2020) Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, 2020.
  • Oreshkin et al. (2018) Boris N. Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. TADAM: task dependent adaptive metric for improved few-shot learning. In NIPS, 2018.
  • Orhan et al. (2020) A. Emin Orhan, Vaibhav V. Gupta, and Brenden M. Lake. Self-supervised learning through the eyes of a child. In NeurIPS, 2020.
  • Pathak et al. (2017) Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In CVPR, 2017.
  • Pinto and Engel (2015) Rafael C. Pinto and Paulo Martins Engel. A fast incremental gaussian mixture model. CoRR, abs/1506.04422, 2015.
  • Rao et al. (2019) Dushyant Rao, Francesco Visin, Andrei A. Rusu, Razvan Pascanu, Yee Whye Teh, and Raia Hadsell. Continual unsupervised representation learning. In NeurIPS, 2019.
  • Rebuffi et al. (2017) Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl: Incremental classifier and representation learning. In CVPR, 2017.
  • Ren et al. (2018) Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B. Tenenbaum, Hugo Larochelle, and Richard S. Zemel. Meta-learning for semi-supervised few-shot classification. In ICLR, 2018.
  • Ren et al. (2019) Mengye Ren, Renjie Liao, Ethan Fetaya, and Richard S. Zemel. Incremental few-shot learning with attention attractor networks. In NeurIPS, 2019.
  • Ren et al. (2021) Mengye Ren, Michael L. Iuzzolino, Michael C. Mozer, and Richard S. Zemel. Wandering within a world: Online contextualized few-shot learning. In ICLR, 2021.
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. In NIPS, 2017.
  • Song and Wang (2005) Mingzhou Song and Hongbin Wang. Highly efficient incremental estimation of gaussian mixture models for online data stream clustering. In Intelligent Computing: Theory and Applications III, volume 5803, pages 174–183. International Society for Optics and Photonics, 2005.
  • Tian et al. (2020) Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In ECCV, 2020.
  • Triantafillou et al. (2020) Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. Meta-dataset: A dataset of datasets for learning to learn from few examples. In ICLR, 2020.
  • van den Oord et al. (2018) Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018.
  • Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 9(11), 2008.
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In NIPS, 2016.
  • Wang and Gupta (2015) Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015.
  • Xiong et al. (2021) Yuwen Xiong, Mengye Ren, Wenyuan Zeng, and Raquel Urtasun. Self-supervised representation learning from flow equivariance. In ICCV, 2021.
  • Zhan et al. (2020) Xiaohang Zhan, Jiahao Xie, Ziwei Liu, Yew-Soon Ong, and Chen Change Loy. Online deep clustering for unsupervised representation learning. In CVPR, 2020.
  • Zhu et al. (2020) Yizhe Zhu, Martin Renqiang Min, Asim Kadav, and Hans Peter Graf. S3VAE: self-supervised sequential VAE for representation disentanglement and data generation. In CVPR, 2020.

Appendix A Method Derivation

a.1 E-step

Inferring cluster assignments.

The categorical variable infers the cluster assignment of the current input example with regard to the existing clusters.


where is the mixing coefficient of cluster and is the distance function, and is the logits. In our experiments, ’s are kept as constant and is an independent learnable parameter.

Inference on unknown classes.

The binary variable estimates the probability that the current input belongs to a new cluster:


where and is the input dimension. In our implementation, we use here instead of since we found leads to better and more stable training performance empirically. It can be derived as a lower bound:


where , . To make learning more flexible, we directly make and as independent learnable parameters so that we can control the confidence level for predicting unknown classes.

a.2 M-step

Here we infer the posterior distribution of the prototypes . We formulate an efficient recursive online update, similar to Kalman filtering, by incorporating the evidence of the current input and avoiding re-clustering the entire input history. We define as the estimate of the posterior mean of the -th cluster at time step , and is the estimate of the posterior variance.

Updating prototypes.

Suppose that in the E-step we have determined that . Then the posterior distribution of the -th cluster after observing is:


If we assume that the transition probability distribution is a zero-mean Gaussian with variance , where is some constant that we defined to be the memory decay coefficient, then the posterior estimates are:


If , and , , it turns out we can formulate the update equation as follows, and can be viewed as a count variable for the number of elements in each estimated cluster, subject to the decay factor over time:


If , then the prototype posterior distribution simply gets diffused at timestep :


Finally, our posterior estimates at time are computed by taking the expectation over :


Since is also our estimate on the number of elements in each cluster, we can use it to estimate the mixture weights,


Note that in our experiments the mixture weights are not used and we assume that each cluster has an equal mixture probability.

Appendix B Experiment Details

We provide additional implementation details in Tab. 8, 9 and 10.

Hyperparameter Values
Random crop area range 0.08 - 1.0
Random color strength 0.5
Backbone ResNet-12 (Oreshkin et al., 2018)
Num channels [32, 64, 128, 256]
init 0.1
init -12.0
init 1.0
Num prototypes 150
Memory decay 0.995
Sequence length / batch size 50 (100 eval)
Beta mean 0.5
Entropy loss 0.0
New cluster loss 0.5
Threshold 0.5
Pseudo label temperature ratio 0.1
Learning rate schedule [40k, 60k, 80k]
Learning rate [1e-3, 1e-4, 1e-5]
Optimizer Adam
Table 8: Hyperparameter settings for RoamingRooms
Hyperparameter Values
Random crop area range 0.08 - 1.0
Random color None
Backbone Conv-4 (Vinyals et al., 2016)
Num channels [64, 64, 64, 64]
init 0.1
init -12.0
init 1.0
Num prototypes 150
Memory decay 0.995
Sequence length / batch size 150
Beta mean 0.5
Entropy loss 1.0
New cluster loss 1.0
Threshold 0.5
Pseudo label temperature ratio 0.2
Learning rate schedule [40k, 60k, 80k]
Learning rate [1e-3, 1e-4, 1e-5]
Optimizer Adam
Table 9: Hyperparameter settings for RoamingOmniglot
Hyperparameter Values
Random crop area range 0.08 - 1.0
Random color strength 0.5
Backbone ResNet-12 (Oreshkin et al., 2018)
Num channels [32, 64, 128, 256]
init 0.1
init -12.0
init 1.0
Num prototypes 600
Memory decay 0.99
Sequence length / batch size 150
Beta mean 0.5
Entropy loss 0.5
New cluster loss 0.5
Threshold 0.5
Pseudo label temperature ratio 0.0 (i.e. one-hot pseudo labels)
Learning rate schedule [40k, 60k, 80k]
Learning rate [1e-3, 1e-4, 1e-5]
Optimizer Adam
Table 10: Hyperparameter settings for RoamingImageNet

b.1 Metric Details

For each method, we used the same nearest centroid algorithm for online clustering. For unsupervised readout, at each timestep, if the closest centroid is within threshold , then we assign the new example to the cluster, otherwise we create a new cluster. For supervised readout, we assign examples based on the class label, and we create a new cluster if and only if the label is a new class. Both readout will provide us a sequence of class IDs, and we will use the following metrics to compare our predicted class IDs and groundtruth class IDs. Both metrics are designed to be threshold invariant.


For unsupervised evaluation, we consider the adjusted mutual information score. Suppose we have two clustering and , and and are set of example IDs, and is the total number of examples. and can be viewed as discrete probability distribution over cluster IDs. Therefore, the mutual information score between and is:


The adjusted MI score111 normalizes the range between 0 and 1, and subtracts the baseline from random chance:


where denotes the entropy function, and is the expected mutual information by chance 222 Finally, we sweep the threshold to get a threshold invariant score:


For supervised evaluation, we used the AP metric. The AP metric is also threshold invariant, and it takes both output and into account. First it sorts all the prediction based on its unknown score in ascending order. Then it checks whether makes the correct prediction. For the N top ranked instances in the sorted list, it computes: precision@N and recall@N among the known instances.

  • precision@ = ,

  • recall@ = ,

where is the true number of known instances among the top N instances. Finally, AP is computed as the area under the curve of (y=precision@N, x=recall@N). For more details, see Appendix A.3 of Ren et al. (2021).

Appendix C Additional Experimental Results

c.1 Comparison to Reconstruction-Based Methods

We additionally provide Tab. 11 to show a comparison with CURL in the iid setting. We used the same MLP architecture and applied it on the Omniglot dataset using the same data split. Reconstruction-based methods lag far behind self-supervised learning methods. Our method is on par with SimCLR and SwAV.

Method 3-NN Error 5-NN Error 10-NN Error
VAE (Joo et al., 2020) 92.340.25 91.210.18 88.790.35
SBVAE (Joo et al., 2020) 86.900.82 85.100.89 82.960.64
DirVAE (Joo et al., 2020) 76.550.23 73.810.29 70.950.29
CURL (Rao et al., 2019) 78.180.47 75.410.34 72.510.46
SimCLR (Chen et al., 2020a) 44.350.55 42.990.55 44.930.55
SwAV (Caron et al., 2020) 42.660.55 42.080.55 44.780.55
OUPN (Ours) 43.750.55 42.130.55 43.880.55
Table 11: Unsupervised IID Learning on Omniglot using MLP

c.2 Additional studies on hyperparameters

Tab. 15, 15, 15, and 15 show additional studies on the effect of the cluster creation threshold , the temperature parameter , the entropy loss parameter , and the Beta prior mean using the validation set of RoamingOmniglot and RoamingRooms. The Beta mean is computed as the following: Suppose and

are the parameters of the Beta distribution, and

is the mean. We fix and .

RoamingOmniglot RoamingRooms
0.3 82.75 90.57 52.60 58.71
0.4 81.59 90.94 59.69 66.11
0.5 89.65 95.22 77.96 84.34
0.6 87.01 93.87 64.65 69.49
0.7 86.08 92.94 66.60 73.54
Table 13: Effect of
RoamingOmniglot RoamingRooms
0.05 89.23 95.01 77.44 84.38
0.10 89.71 95.21 77.89 84.99
0.20 89.78 95.31 77.82 84.57
0.50 89.40 95.13 76.81 83.90
1.00 89.62 95.16 0.00 19.91
Table 14: Effect of
RoamingOmniglot RoamingRooms
0.00 82.45 90.66 76.64 84.11
0.25 87.31 93.85 76.61 83.16
0.50 87.98 94.21 75.46 81.78
0.75 88.77 94.74 74.76 79.91
1.00 89.70 95.14 75.32 80.29
Table 15: Effect of mean of the Beta prior
RoamingOmniglot RoamingRooms
0.3 84.14 93.19 68.75 72.58
0.4 86.59 93.10 69.19 73.86
0.5 89.89 95.24 77.61 84.64
0.6 85.93 93.81 64.21 73.23
0.7 26.22 92.08 48.58 64.28
Table 12: Effect of threshold

c.3 Large Batch IID Results on RoamingOmniglot and RoamingImageNet

In Tab. 17 and 17, we provide results on training SimCLR and SwAV under the traditional setting of using large iid batches, whereas our model is still trained on non-iid batches. We are able to outperform these iid batch models on RoamingOmniglot, but there is still a gap on RoamingImageNet. It is worth mention that both SimCLR and SwAV are well tuned towards large batch settings on ImageNet, and in order to run these experiments on RoamingImageNet, both used 8 GPUs in parallel which is much more computationally intensive than our model. This also suggests that our method has some limitation towards handling larger intra-class variance, and it is more challenging to assume the clustering latent structure on top of these embeddings.

Batch size AMI AP
Random Network - 17.66 17.01
SimCLR (Chen et al.