Log In Sign Up

How does the degree of novelty impacts semi-supervised representation learning for novel class retrieval?

by   Quentin Leroy, et al.

Supervised representation learning with deep networks tends to overfit the training classes and the generalization to novel classes is a challenging question. It is common to evaluate a learned embedding on held-out images of the same training classes. In real applications however, data comes from new sources and novel classes are likely to arise. We hypothesize that incorporating unlabelled images of novel classes in the training set in a semi-supervised fashion would be beneficial for the efficient retrieval of novel-class images compared to a vanilla supervised representation. To verify this hypothesis in a comprehensive way, we propose an original evaluation methodology that varies the degree of novelty of novel classes by partitioning the dataset category-wise either randomly, or semantically, i.e. by minimizing the shared semantics between base and novel classes. This evaluation procedure allows to train a representation blindly to any novel-class labels and evaluate the frozen representation on the retrieval of base or novel classes. We find that a vanilla supervised representation falls short on the retrieval of novel classes even more so when the semantics gap is higher. Semi-supervised algorithms allow to partially bridge this performance gap but there is still much room for improvement.


page 3

page 4


Contrastive Semi-Supervised Learning for 2D Medical Image Segmentation

Contrastive Learning (CL) is a recent representation learning approach, ...

The Semi-Supervised iNaturalist-Aves Challenge at FGVC7 Workshop

This document describes the details and the motivation behind a new data...

Semi-Supervised Class Discovery

One promising approach to dealing with datapoints that are outside of th...

The Semi-Supervised iNaturalist Challenge at the FGVC8 Workshop

Semi-iNat is a challenging dataset for semi-supervised classification wi...

A Realistic Evaluation of Semi-Supervised Learning for Fine-Grained Classification

We evaluate the effectiveness of semi-supervised learning (SSL) on a rea...

Invariant-equivariant representation learning for multi-class data

Representations learnt through deep neural networks tend to be highly in...

Semi-supervised learning with Bidirectional GANs

In this work we introduce a novel approach to train Bidirectional Genera...

1. Introduction

Deep neural networks have been the de facto algorithm for processing and training data for a variety of tasks and data modalities whether it is image  

(oquab2014; razavian2014a), text (vaswani2017; devlin2019), speech (hannun2014; zhang2020) , video (aytar2016; sun2019a; alayrac2020; arnab2021). In this study we focus on image data and retrieval tasks. The major impediment is the sheer amount of annotated images necessary to achieve this feat  (russakovsky2015a)

. Several lines of research have been tackling this issue and have brought new ideas helping to reduce the amount of annotation necessary: active learning refers to algorithms that select most effectively the unlabeled samples for an annotator to annotate

(settles2010), few-shot learning refers to algorithms that adapt most rapidly to new classes with a few annotated support samples (vinyals2017; wang2019)

, self-supervised learning refers to algorithms that learn from the samples themselves without any annotation

(doersch2015; doersch2017; jing2019)

, semi-supervised learning refers to algorithms that effectively leverage unlabeled data while also training with supervised data

(chapelle2006; laine2016; tarvainen2017; phama2021).

Deep networks provide a strong baseline for image representation in image retrieval applications  

(babenko2014) (babenko2015)

, whether it is visual instance retrieval or more broadly visual category retrieval. Image retrieval, a long tradition in the computer vision community, is the task of retrieving those images in the search corpus that depict the same visual

instance as the query. Visual object category retrieval evaluates more broadly the retrieving of those images containing the same visual object as the query, allowing for a lot more appearance variation between the query and the correct images.

In this paper, we are interested in the retrieval of an object category unseen beforehand in the training corpus used to train the deep embedding. We term this challenging task novel class retrieval; it aims to reflect the setting of real-world applications where incoming image data contains visual content novel compared to curated image corpora used for training deep networks. This setting exists in various applications. Take for example a corpus of digitized material of a cultural institution containing a large variety of historical photographs, drawings, paintings and manuscripts of various eras. Choosing a generic embedding for these original materials for efficient retrieval is a challenging question. These novel materials can be queried by users in unanticipated ways. Understanding these new types of material without additional human annotation is a challenging problem.

We focus on the question: What learning algorithm allows for the most efficient retrieval of novel classes? We devise an original methodology for comparing deep embeddings towards the task of novel class retrieval; We partition an image corpus category-wise, keeping a set of base classes for training the deep embedding and leaving aside a set of novel classes for evaluating the embedding. We get a strict disjoint label space between the training classes and the evaluation classes. The base and novel sets are further split into a training and a test set. More importantly, in each instance of our experiments we carry out two kinds of category splits: a random split partitioning the labeled space at random and a more carefully crafted SS that minimizes the information overlap between base and unseen categories to the extent possible (details in section 3).

Our contributions are summarised as follows: 1) We devise an evaluation methodology to evaluate a deep embedding on the task of novel class retrieval, 2) we simulate new visual content by hiding supervision from the deep network and we are able to vary the degree of novelty of novel classes and 3) we compare several deep learning algorithms from different paradigms (supervised, self-supervised, semi-supervised) on the task of novel class retrieval based on three different datasets (CIFAR10, CIFAR100, In).

The remainder of the paper is organized as follows: the next section reviews important lines of research related to our evaluation methodology. In section 3 we detail how we manage to construct a benchmark relevant to novel class retrieval by masking whole categories from the training process and varying the degree of novelty. Next we present the algorithms we implemented and the different datasets and category splits used in the experiments. Finally we present our experiment results and analysis.

2. Related Work

Novel Class Retrieval

We aim to benchmark training algorithms for the task of retrieving samples of classes unseen during training. It is similar to Novel class discovery (han2021a; zhong2021)

where the goal is to cluster the whole set of images of novel classes: instead of clustering we evaluate with retrieval, we do not need to know the number of novel classes or to estimate it. The goal in Few-shot learning 

(vinyals2017; snell2017; oreshkin2019; wang2019) is to adapt most rapidly to novel classes with a small set of support supervised samples of those novel classes: in contrast we evaluate class retrieval, that is returning the most images of the class of a given single image query. Metric learning (kulis2013; krause2013; bell2015; schroff2015; song2016; musgrave2020) benchmarks usually report performance in retrieval tasks of novel classes; datasets for those benchmarks (krause2013; song2016; wah2011) also have class-disjoint training and testing sets but do not keep a split along the image axis as we do (See Fig. 1); in the testing phase we report retrieval performance on test images of base classes and novel classes among both base and novel-class images while metric learning does not keep test samples of base classes during testing (which prevents from evaluating the performance decrease compared to base classes).

Semi-Supervised Setting

It is often pointed out that labels are expensive to obtain and that is why training data is abundant in unlabeled samples and is lacking labeled samples. Semi-Supervised learning algorithms (chapelle2006) are often proclaimed to work efficiently in that setting (laine2016; tarvainen2017; li2021b; phama2021), some methods showing impressive performance with a quantity of labels as low as one per class for CIFAR10 (sohn2020a)

, or 1% of labeled data for ImageNet 

(zhai2019). We investigate an overlooked setting where the labeled pool and the unlabeled pool are class-disjoint; meaning that the unlabeled data are known to come from different classes that those seen during training. The authors in (oliver2018) noted that this should be part of an effective benchmark for evaluating semi-supervised learning algorithms. They did a set of experiments on CIFAR10 where they vary the amount of classes present both in labeled and unlabeled data and showed that performance can decrease in some cases. We take this idea further in the following ways: 1) we keep the labeled and unlabeled pools completely class-disjoint, 2) we experiment on CIFAR10, CIFAR100 and ImageNet100 and 3) we vary the degree of novelty between labeled and unlabeled pools by carrying a random split and a SSof the label space.

Partitioning the label space semantically

We experiment novel class retrieval on classification datasets: CIFAR10, CIFAR100 (krizhevsky2009)

and a subset of ImageNet 

(russakovsky2015a). For CIFAR100 we can use the super-classes which group together classes that are visually alike: we borrow the splits from (oreshkin2019) to devised the semantic split. For ImageNet we can use the WordNet hierarchy (fellbaum1998) ; we chose a subset of the high-level categories used for tieredImageNet (ren2018) (details in Section 5.1).

3. Proposed Evaluation methodology: Varying the degree of novelty

3.1. Random and Semantic Class-disjoint Data Splits

Figure 1. We partition the label space into , holding out a set of novel classes for evaluation. Contrary to metric learning benchmarks we also keep a train/test partition along the image axis , allowing for a comparative performance evaluation on base and novel classes on the test images . We train a various set of algorithms with supervision only on base classes , and optionally adding images of novel classes of the train partition without supervision .
Random Semantic
Table 1. Illustration of the two training schemes: We train algorithms with supervision on base classes (top row) and algorithms that incorporate images of the novel classes without supervision (bottom row). When the label space is partitioned randomly (left column airplane/cat vs. automobile/dog) the semantic gap between base and novel classes is narrower than when the label space is partitioned semantically (right column airplane/automobile vs. cat/dog).

We are given an image set partitioned into training and testing sets (Fig. 1). We consider that the label space is partitioned into two parts: a set of base classes and a set of novel classes. The training set can be partitioned into where comprises annotated images of base classes and comprises unannotated images of novel classes. The test set can be partitioned into .

An image is encoded as , where is a neural network whose parameters are trained on and optionally on . The labels of are never used for training the network.

In each dataset, two kinds of experiments are carried out involving either a random split or a SS. The random split divides randomly the set of classes. The SSdivides the set of classes in a way that minimizes the information overlap between the base and the novel classes. Table 1 illustrates how the SSincreases the degree of novelty compared to the random split with classes of CIFAR10. We evaluate the trained network via image retrieval by running a similarity search among the whole testing set and report R-Precision for queries from base classes and from novel classes separately (see Section 5 for the results).

3.2. Evaluation via Image Retrieval: R-Precision

We want to evaluate how well a visual representation transfer to unseen classes without any additional training, contrary to other evaluation methods that allow for additional training on the novel-class labels (linear probes (doersch2015), few-shot tasks (gidaris2019)

, knn-classifier 

(wu2018)). Evaluation via clustering requires embeddings to be clustered and is dependent on the clustering algorithm used (musgrave2020)

. We have chosen an evaluation metric that does not require any additional metric nor a choice of a clustering algorithm.


Visual object category retrieval aims to retrieve examples of the same category than a given query from a search set ( in our case). A function maps a query image to a set of similar images. The similarity between two images is evaluated by a dot product in the

-norm normalized feature space (also known as cosine similarity):


For an image , returns the list of most similar images ranked by the function in decreasing order. Let us denote the samples of class by , and the size of the class by . Image retrieval performance is usually evaluated by Recall@ (jegou2011) (with ), in the metric learning literature (song2016). We evaluate instead with Precision@ where is the total size of correct hits. The Precision is the proportion of correct hits among the retrieved items. We set the cut-off-rank to be the size of the class query, so that the precision is equal to the recall.

4. Algorithms Studied

Table 2. Supervised algorithms train a network with images of base classes only (classes ). Unsupervised algorithms train a network with images of base classes without supervision (classes ). Semi-Supervised algorithms train a network with images of base classes with supervision, and images of novel classes without supervision (classes and ). At testing time all the heads are discarded, we take as a visual representation the output of the last average pooling layer of the backbone .

In this section we present the algorithms we have implemented and tested in our experiments. We experimented with three training schemes. We took four supervised algorithms, two unsupervised algorithms and two semi-supervised algorithms. The architecture of each algorithm are illustrated in Table 2. In the following we briefly introduce each algorithm.

4.1. Supervised Algorithms


This model optimizes the cross-entropy loss on base classes only:


Triplet Loss

This model optimizes the triplet loss (schroff2015) on base classes only:


where is the set of all triplets mined from ;

is a margin hyperparameter,

, . In practice we optimize with SGD and mine semi-hard triplets within a batch.

Classification With Triplet Loss

This model jointly optimizes the cross-entropy loss and the triplet loss:


The first head classifies into the base classes while the additional head reduces the dimension and is used to compute the triplet loss.

Supervised Contrastive: SupContrast

This model (khosla2021) is based on a siamese architecture. It extends self-supervised contrastive methods (he2019; chen2020; grill2020a) to the fully supervised-setting: in addition to construct positive pairs with random augmentations of the same image, it also uses the class labels to construct positive pairs. Similar to the triplet loss the contrastive loss enforces positive pairs to attract themselves and negative pairs to repel themselves. We refer the reader to the paper (khosla2021) for details.

4.2. Unsupervised Algorithms

Predicting Image Rotations: RotNet

This model (gidaris2018) is an instance of a self-supervised algorithm (jing2019). It rotates the input image into four possible rotations, and the head classifies into the four possible rotations. The model is trained to correctly classify the rotation applied to the input. It is a -class classification problem optimized with a cross-entropy loss:


where is the set of all possible rotations and refers to the corresponding rotated input.


This model (chen2021c) is another instance of a self-supervised algorithm (jing2019). It is based on a siamese architecture and is trained to maximize the similarity between two randomly augmented views of an input image. Despite being simpler than similar methods (he2019; chen2020; grill2020a) it shows competitive results for unsupervised visual representation learning (chen2021c). We trained our own models on base classes only. We refer the reader to the paper (chen2021c) for details.

4.3. Semi-Supervised Algorithms


This semi-supervised model (sohn2020a) feeds an unlabeled image to a weak augmentation and a strong augmentation: the weakly augmented input is used for pseudo-labeling and the strongly augmented input is used for computing a cross-entropy loss against the pseudo-label. Note that the head classifies into the base classes. The pseudo-labels are among the base classes and the unlabeled images that are pseudo-labeled are fed to

that classifies them into the base classes. The pseudo-labeling module is a threshold applied on the probability of the most confident class: only those images that are classified most confidently are passed to the next cross-entropy loss. FixMatch has two main hyperparameters:

the number of unlabeled images per labeled images and the confidence threshold for the pseudo-labeling. We refer the reader to (sohn2020a) for details.

Classification With Rotation: CwRot

We propose to train a network in a semi-supervised fashion by adding an auxiliary self-supervised loss for unsupervised novel-class samples. We append a second head to the network tasked to predict the rotation applied to the input image as in the RotNet model (gidaris2018). This model optimizes the loss:


The choice for the auxiliary self-supervised objective was motivated by a series of works that successfully applied it to improve few-shot classification (gidaris2019; su2019; chen2019a), robustness (hendrycks2019), pre-training for novel class discovery (han2021a), image generation (chen2019b; lucic2019), semi-supervised learning (zhai2019).

This is actually a simplified version of the method proposed in (zhai2019), where the consistency regularization and pseudo-labeling components of the training scheme are removed.

5. Experimental Results

5.1. Datasets

We experiment on CIFAR10, CIFAR100 (krizhevsky2009) and a subset of ImageNet (russakovsky2015a). For each dataset, we split the label space randomly and semantically. The splits are done once and for all.


For the random split, we keep classes at random for base classes and the remaining classes for novel classes. We got:

Note that vehicles and animals ended in base classes, and also vehicles and animals ended in novel classes. For the SS, we also partitioned the label space evenly into base classes and novel classes; we chose the following partitioning:

Note that vehicles and animal are in the base classes, while the novel classes consist only of animals and no vehicle.


CIFAR100 contains classes that can be grouped into super-classes. For the random split, we keep classes at random for base classes and the remaining classes for novel classes. Note that for the random split, the super-classes are spread randomly among the base and the novel classes. For example, the people super-class has classes in base classes (baby, girl, man) and classes in novel classes (boy, woman). For the SS, we follow the splits of Few-Shot CIFAR100 introduced in (oreshkin2019). We set the base classes to the train classes and the novel classes to the val classes + test classes. Note that for the SS, each super-class is entirely either in the base classes or the novel classes. For example, the people super-class has all the classes in novel classes.


We experiment on a subset of 100 classes of ImageNet (russakovsky2015a) which we call ImageNet100 in the following. We keep 16 of the high-level categories among the 34 devised by the authors of tieredImageNet (ren2018) using the WordNet hierarchy: 8 categories descending from the artefact synset (motor vehicle, craft, durables, garment, musical instrument, game equipment, furnishing, tool), 8 categories descending from the animal synset (ungulate, primate, feline, working dog, saurian, aquatic bird, insect, aquatic vertebrate); with 6-7 ImageNet classes per high-level category. For the random split we keep in base classes 3-4 classes for each high-level category at random in order to keep 50 classes, and the novel classes are the remaining 50 classes. For the SS, we keep the artefact classes for the base classes and the animal classes for the novel classes. Unlike tieredImageNet we keep all training images in the training, and validation images for testing; like tieredImageNet we resize the images to 84x84 resolution.

5.2. Implementation details

All networks share the same ResNet18 (he2016a)

backbone. On CIFAR10 and CIFAR100, the input images are 32x32 in resolution, we use a first convolutional layer with stride

, not followed by a max pooling layer. For ImageNet100, the images are stored on disk at 84x84 resolution and are resized to 224x224 resolution before being fed to the network: this time we used the standard ResNet18 architecture where the first layers reduce the spatial dimension.

We reimplemented from scratch Vanilla, Triplet, RotNet, CwT and CwRot. For Vanilla, Triplet, CwT and CwRot, the networks are optimized with SGD with batch size , momentum, weight-decay, and an initial learning rate of dropped by every epochs for a total of epochs. For RotNet, we followed the guidelines from the paper (gidaris2018). For Triplet and CwT, in all cases we set the margin to we mine semihard (schroff2015) negative samples, and we set the embedding dimension to .

For the other three algorithms we used PyTorch implementations readily available (SimSiam

111, SupContrast 222, FixMatch 333 and made some modifications. We adapted the codebase by modifying the backbone network to match the Resnet18 used for the other baseline algorithms. For SupContrast on ImageNet100 we reduced the number of epochs to 200, decaying the initial learning rate by 10 at epochs 150, 170 and 190. For FixMatch on ImageNet100, we trained for iterations and used unlabeled images per labeled image. We kept the recommended hyperparameters in every other case. We would like to emphasize that we did not optimize the hyperparameters for the best performance but instead use sensible defaults, following recommendations of ResNet (he2016a) and original papers of each algorithm.

5.3. Results and analysis

Algo CIFAR10-Random CIFAR10-Semantic
Base Novel Base Novel
Vanilla 65.147 26.380 68.469 19.758
Triplet 71.479 20.512 74.890 16.760
CwT 72.336 20.391 74.389 15.438
SupContrast 28.954 22.726
RotNet 28.476 17.379 27.663 16.732
SimSiam 19.340 19.048 20.518 17.298
CwRot 72.816 37.832 77.680
FixMatch 57.807 79.178 27.419
Table 3. R-Precision on base and novel classes of CIFAR10 for a random split (left columns) and a SS(right columns). Top rows are supervised algorithms. Middle rows are unsupervised algorithms. Bottom rows are semi-supervised algorithms. Best R-Precision is marked in bold.
Algo CIFAR100-Random CIFAR100-Semantic
Base Novel Base Novel
Vanilla 33.339 15.351 39.778 7.891
Triplet 33.895 10.015 43.745 3.846
CwT 10.180 4.131
SupContrast 42.981 14.094 49.000 6.678
RotNet 5.293 3.906 6.149 2.392
SimSiam 5.287 4.521 6.085 3.863
CwRot 34.257 17.525 39.962
FixMatch 38.771 47.905 9.444
Table 4. R-Precision on base and novel classes of CIFAR100 for a random split (left columns) and a SS(right columns). Top rows are supervised algorithms. Middle rows are unsupervised algorithms. Bottom rows are semi-supervised algorithms. Best R-Precision is marked in bold.
Algo ImageNet100-Random ImageNet100-Semantic
Base Novel Base Novel
Vanilla 35.202 18.351 35.388 9.931
Triplet 36.680 12.018 27.138 3.905
CwT 14.090 4.813
SupContrast 33.495 15.241 34.667 7.738
SimSiam 7.984 7.867 8.359 6.306
RotNet 6.789 7.194 8.469 5.194
CwRot 37.287 36.373
FixMatch 38.090 19.322 35.230 11.050
Table 5. R-Precision on base and novel classes of ImageNet100 for a random split (left columns) and a SS(right columns). Top rows are supervised algorithms. Middle rows are unsupervised algorithms. Bottom rows are semi-supervised algorithms. Best R-Precision is marked in bold.

Retrieval performance is reported in  Table 3 (CIFAR10),  Table 4 (CIFAR100) and  Table 5 (ImageNet100). Read horizontally we can compare the performance between base and novel classes. In any case the performance is degraded on novel classes, even more so for the SS. Read vertically we can compare the performance between the algorithms.

Supervised algorithms

The metric learning algorithms (Triplet and CwT) exhibit a better fit of the base classes to the detriment of a poorer performance on the novel classes classes compared to Vanilla. The benefit of Triplet over Vanilla on base classes is most noteworthy on CIFAR10 and slight on CIFAR100 and ImageNet100, indicating that the simpler and faster Vanilla algorithm is a better choice on more fine-grained dataset. The contrastive model SupContrast performs the best among the supervised baselines on base classes; on novel classes it also performs the best on CIFAR10 but not on CIFAR100 and ImageNet100. The contrastive model SupContrast performs the best on base and novel classes on CIFAR10 but this does not hold on CIFAR100 and ImageNet100.

Unsupervised algorithms

RotNet and SimSiam do not show interesting results even on the base classes there were trained on; and we do not draw a consistent conclusion as for which algorithm is best in general. However we note that for the semantic split in some cases RotNet and SimSiam are better than Triplet and CwT on novel classes. For example on ImageNet100-Semantic it is actually better to use an unsupervised method than to use a metric learning method.

Semi-Supervised algorithms

The semi-supervised models mitigate the performance difference between base and novel classes and surpass the supervised baselines with a greater margin on the SSthan on the random split. We note that CwRot performs consistently better than Vanilla on all datasets on both base and novel classes.

As for FixMatch, it is consistently better than Vanilla on novel classes. It compares more favorably to CwRot on novel classes for the random split than for the SS. We argue that it is due to the fact that during training it learns to classify novel-class images into the set of base classes and, because for the SSthe novel classes are less visually alike to the base classes than for the random split, it less efficiently leverages similar visual patterns between base and novel classes. This comparison between CwRot and FixMatch for the SSis an evidence that semi-supervised methods such as FixMatch that rely on the fact that labeled and unlabeled images come from the same set of classes are less efficient in the setting when an entire set of images from a novel class come into play.

5.4. T-SNE embedding visualization.

CIFAR10-Random CIFAR10-Semantic
Base Novel Base Novel
Table 6. T-SNE visualizations of the embeddings of base and novel classes for CIFAR10 for the random split (left columns) and for the SS(right columns). The Semi-Supervised algorithms (bottom rows) better separate the novel classes compared to the Supervised algorithms (top rows). The colors encode class membership. Best viewed in color.

We show in Table 6 some T-SNE visualization of the embeddings of CIFAR10 for four algorithms. The visualizations support the conclusions from the previous section. In particular, we can see that only the semi-supervised methods succeed in structuring the novel classes in the case of the SS(with CwRot providing a better class separation than FixMatch).

6. Conclusion

We presented a method to evaluate novel class retrieval. We argued that existing benchmarks for semi-supervised representation learning algorithms lack a setting where unlabeled data are from novel classes.

We experimented with a variety of representation learning algorithms and showed evidence that semi-supervised learning algorithms mitigate the performance drop on novel classes. Yet there is still much room for improvement for the novel class retrieval task. Semi-supervised algorithms still fall short in retrieving images of novel classes in the random split setting (low-degree of novelty) and even more so in the SSsetting (higher degree of novelty).