Boosting Supervision with Self-Supervision for Few-shot Learning

06/17/2019 ∙ by Jong-Chyi Su, et al. ∙ University of Massachusetts Amherst cornell university 6

We present a technique to improve the transferability of deep representations learned on small labeled datasets by introducing self-supervised tasks as auxiliary loss functions. While recent approaches for self-supervised learning have shown the benefits of training on large unlabeled datasets, we find improvements in generalization even on small datasets and when combined with strong supervision. Learning representations with self-supervised losses reduces the relative error rate of a state-of-the-art meta-learner by 5-25 several few-shot learning benchmarks, as well as off-the-shelf deep networks on standard classification tasks when training from scratch. We find the benefits of self-supervision increase with the difficulty of the task. Our approach utilizes the images within the dataset to construct self-supervised losses and hence is an effective way of learning transferable representations without relying on any external training data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We humans can quickly learn new concepts from limited training data, but current machine learning algorithms cannot. We are able to do this by relying on our past “visual experience”. Recent work attempts to emulate this “visual experience” by training a feature representation to classify a training dataset well, with the hope that the resulting representation generalizes not just to unseen examples of the same classes but to novel classes as well. This has indeed been the case when “deep representations” learned on massive image classification datasets are applied to novel image classification tasks on related domains. This intuition also underlies recent work on few-shot learning or meta-learning. This approach has two related shortcomings: Because we are training the feature representation to classify a training dataset, it may end up discarding information that might be useful for classifying unseen examples or novel classes. Meanwhile, the

images of these base classes themselves contain a wealth of useful information about the domain that is discarded.

Learning from images alone without any class labels falls under the umbrella of self-supervised learning. The key idea is to learn about statistical regularities (e.g., likely spatial relationship between patches, likely orientations between images) that might be a cue to semantics. However, these techniques require enormous amounts of data to match fully-supervised approaches trained on large datasets. It is still unclear whether self-supervised learning can help in the low data regime. In particular, can these techniques help prevent overfitting to training datasets and improve performance on new examples and novel classes?

We answer this question in the affirmative. We experiment both with standard supervised training on small labeled datasets, as well as few-shot transfer to novel classes. We show that, with no additional training data, adding the self-supervised task as an auxiliary task significantly improves accuracy in both settings and across datasets. In particular adding self-supervised losses leads to 5-25% relative reduction in classification error rate of a state-of-the-art meta-learner snell2017prototypical on several image classification datasets (see Section 4). We observe that the benefits are greater when the task is more challenging, such as classification among more classes, or from greyscale or low resolution images. On standard classification too, we find that off-the-shelf networks trained from scratch on these datasets have lower error rate when trained with self-supervised losses. We conclude that self-supervision as an auxiliary task is beneficial for learning generalizable feature representations.

2 Related work

Our work is related to a number of recent approaches for learning robust visual representations, specifically few-shot learning and self-supervised learning. Most self-supervised learning works investigate if the features from pre-training on a large unlabeled dataset can be transferred, while our work focuses on the benefit of using self-supervision as an auxiliary task in the low training data regime.

Few-shot learning

Few-shot learning aims to learn representations that generalize well to the novel classes where only few images are available. To this end many meta-learning methods have been proposed that simulate learning and evaluation of representations using a base learner by sampling many few-shot tasks. These include optimization-based base learners (e.g., MAMLfinn2017model , gradient unrolling ravi2016optimization , closed form solvers bertinetto2018meta , convex learners lee2019meta , etc.). Others such as matching networks vinyals2016matching use a nearest neighbor classifier, protoypical networks (ProtoNet) snell2017prototypical use a nearest class mean classifier mensink2013distance , and the approach of Garcia and Bruna garcia2017few

uses a graph neural network to model higher-order relationships in data. Another class of techniques (

e.g., gidaris2018dynamic ; qi2018low ; qiao2018few ) model the mapping between training examples and classifier weights using a feed-forward network. On standard benchmarks for few-shot learning (e.g., miniImageNet and CIFAR) these techniques have shown better generalization than simply training a network for classification on the base classes.

While the literature is vast and growing, a recent study chen2019closerfewshot has shown that meta-learning is less effective when domain shifts between training and novel classes are larger, and that the differences between meta-learning approaches are diminished when deeper network architectures are used. We build our experiments on top of this work and show that auxiliary self-supervised tasks provide additional benefits on realistic datasets and state-of-the-art deep networks (Section 4).

Self-supervised learning

Human labels are expensive to collect and hard to scale up. To this end, there has been increasing research interest to investigate learning representations from unlabeled data. In particular, the image itself already contains structural information which can be utilized. Self-supervised learning approaches attempt to capture this.

There is a rich line of work on self-supervised learning. One class of methods removes part of the visual data (e.g., color information) and tasks the network with predicting what has been removed from the rest (e.g., greyscale images) in a discriminative manner larsson2016learning ; zhang2016colorful ; zhang2017split . Doersch et al. proposed to predict the relative position of patches cropped from images doersch2015unsupervised . Wang et al. used the similarity of patches obtained from tracking as a self-supervised task  wang2015unsupervised and combined position and similarity predictions of patches wang2017transitive . Other tasks include predicting noise bojanowski2017unsupervised , clusters caron2018deep ; wu2018unsupervised , count noroozi2017representation , missing patches (i.e., inpainting) pathak2016context , motion segmentation labels pathak2017learning , etc. Doersch et aldoersch2017multi proposed to jointly train four different self-supervised tasks and found it to be beneficial. Recent works goyal2019scaling ; kolesnikov2019revisiting have compared various self-supervised learning tasks at scale and concluded that solving jigsaw puzzles and predicting image rotations are among the most effective, motivating the choice of tasks in our experiments.

The focus of most prior works on self-supervised learning is to supplant traditional fully supervised representation learning with unsupervised learning on large unlabeled datasets for downstream tasks. In contrast, our work focuses on few-shot classification setting and shows that self-supervision helps in the low training data regime without relying on any external dataset.

Multi-task learning

Our work is related to multi-task learning, a class of techniques that train on multiple task objectives together to improve each one. Previous works in the computer vision literature have shown moderate benefits by combining tasks such as edge, normal, and saliency estimation for images, or part segmentation and detection for humans 

kokkinos2017ubernet ; maninis2019attentive ; ren2018cross . Unfortunately, in many cases, tasks can be detrimental to others resulting in reduced performance. Moreover, acquiring additional task labels is expensive. In contrast, self-supervised tasks do not require additional labeling effort and we find that they often improve the generalization of learned representations.

3 Method

Our framework augments standard supervised losses with those derived from self-supervised tasks to improve representation learning as seen in Figure 1. We are given a training dataset consisting of pairs of images and labels . A feed-forward CNN maps the input to an embedding space, which is then mapped to the label space using a classifier . The overall prediction function from the input to the labels can be written as . Learning consists of estimating functions and that minimize a loss on the predicted labels on the training data, along with suitable regularization over the functions and , and can be written as:

For example, for classification tasks a commonly used loss is the cross-entropy loss (softmax loss) over the labels, and a suitable regularizer is the L2 norm of the parameters of the functions. For transfer learning

is discarded and relearned on training data for a new task. The combination can be further fine-tuned if necessary.

In addition to supervised losses, we consider self-supervised losses based on labeled data ) that can be systematically derived from inputs alone. Figure 1 shows two examples: the jigsaw task rearranges the input image and uses the index of the permutation as the target label, while the rotation task uses the angle of the rotated image as the target label. A separate function is used to predict these labels from the shared feature backbone and the self-supervised loss can be written as:

Our final loss function combines the two losses:

and thus the self-supervised losses act as a data-dependent regularizer for representation learning. Below we describe various losses we consider for image classification tasks.

Figure 1: Overview of our approach. Our framework combined supervised losses derived from class labels and self-supervised losses derived from the images in the dataset to learn a feature representation. Since the feature backbone is shared, the self-supervised tasks act as an additional regularizer. In our experiments, the features are learned either in a meta-learning framework for few-shot transfer or for standard classification (shown above) with jigsaw puzzle or rotation tasks as self-supervision.

3.1 Supervised losses ()

We consider two supervised loss functions depending on whether the representation learning is aimed at standard classification where training and test tasks are identical, or transfer where the test task is different from training.

Standard classification loss. On tasks we assume that the test set consists of novel images from the same distribution as the training set. We use the standard cross-entropy (softmax) loss computed over posterior label distribution predicted by the model and target label.

Few-shot transfer task loss. Here we assume abundant labeled data on base classes and limited labeled data on novel classes (in the terminology of hariharan2017low ). We use losses commonly used in meta-learning frameworks for few-shot learning snell2017prototypical ; sung2018learning ; vinyals2016matching . In particular we base our approach on prototypical networks snell2017prototypical which perform episodic training and testing over sampled datasets in stages called meta-training and meta-testing. During meta-training stage, we randomly sample classes from , then we select a support set with images per class and another query set with images per class. We call this an -way -shot classification task. The embeddings are trained to predict the labels of the query set conditioned on the support set using a nearest mean (prototype) model. The objective is to minimize the prediction loss on the query set. Once training is complete, given the novel dataset class prototypes are recomputed for classification and query examples are classified based on the distances to the class prototypes. The model is related to distance-based learners such as matching networks vinyals2016matching or metric-learning based on label similarity koch2015siamese .

3.2 Self-supervised losses ()

We consider two losses motivated by a recent large-scale comparison of the effectiveness of various self-supervised learning tasks goyal2019scaling .

Jigsaw puzzle task loss. In this case the input image is tiled into 33 regions and permuted randomly to obtain an input . The target label is the index of the permutation. The index (one of 9!) is reduced to one of 35 following the procedure outlined in noroozi2016unsupervised , which grouped the possible permutations based on hamming distance as a way to control the difficulty of the task.

Rotation task loss. In this case the input image is rotated by an angle to obtain and the target label is the index of the angle.

In both cases we use the cross-entropy (softmax) loss between the target and prediction.

3.3 Stochastic sampling and training

We use the following strategy for combining the two losses for stochastic training. The same batch of images sampled for the supervised loss (Section 3.1) are used for the self-supervision task. After the two forward passes, one for the supervised task and one for the self-supervised task, both losses are combined for computing gradients using back-propagation. There are several recent works proposed to stabilize multi-task training, e.g. balancing the losses by normalizing the gradient chen2017gradnorm , by uncertainty measure kendall2018multi , or multi-object optimization sener2018multi . However, simply combining the two losses with equal weights performed well and we use this protocol for all our experiments.

4 Experiments

We present results for representation learning for few-shot transfer learning (Section 4.1) and standard classification tasks (Section 4.2). We begin by describing the datasets and details of our experiments.

Datasets

We select a diverse set of image classification datasets: Caltech-UCSD birds cub , Stanford dogs dogs , Oxford flowers flowers , Stanford cars cars , and FGVC aircrafts aircrafts for our experiments. Each of these datasets contains between 100 to 200 classes with a few thousands of images as seen in Table 1. For few-shot learning we split classes into three disjoint sets: base, validation, and novel. For each class all the images in the dataset are used in the corresponding set. The model is trained on the base set, and the validation set is used to select the model that has the best performance. The model is then tested on the novel set given few examples per class. For birds, we use the same split as chen2019closerfewshot , where {base, val, novel} sets have {100, 50, 50} classes respectively. For other datasets we use the same ratio for splitting the classes. For standard classification, we follow the original train and test split for each dataset.

Setting Set Stats Birds Cars Aircrafts Dogs Flowers
Few-shot transfer Base # class 100 98 50 60 51
# images 5885 8023 5000 10337 4129
Val # class 50 49 25 30 26
# images 2950 4059 2500 5128 2113
Novel # class 50 49 25 30 25
# images 2953 4103 2500 5115 1947
Standard classification Train # class 200 196 100 120 102
# images 5994 8144 6667 12000 2040
Test # class 200 196 100 120 102
# images 5794 8041 3333 8580 6149
Table 1: Example images and dataset statistics. For few-shot experiments (top rows) the classes are split into base, val, and novel set. Image representations learned on base set are evaluated on the novel set while val

set is used for cross-validation. For standard classification experiments we use the standard training and test splits provided in the datasets. These datasets vary in the number of classes but are orders of magnitude smaller than ImageNet dataset.

Experimental details on few-shot learning experiments

We follow the best practices and codebase for few-shot learning laid out in Chen et al. chen2019closerfewshot 111https://sites.google.com/view/a-closer-look-at-few-shot/. In particular we use a ProtoNet snell2017prototypical with ResNet18 he2016deep as feature backbone. Following chen2019closerfewshot we use 5-way (classes) and 5-shot (examples per-class), with 16 query images for training. The models are trained with ADAM kingma2014adam

optimizer with a learning rate of 0.001 for 40,000 episodes. Once training is complete (based on validation error), we report the mean accuracy and 95% confidence interval over 600 test experiments. In each test experiment,

classes are selected from the novel set, and for each class 5 support images and 16 query images are used. We report results for testing with classes. During training, especially for jigsaw puzzle task, we found it to be beneficial to not

track the running mean and variance for the batch normalization layer, and instead estimate them for each batch independently. We hypothesize that this is because the inputs contain both full-sized images and small patches, which might have different statistics. At test time we do the same.

Experimental details on standard classification experiments

For classification we once again train a ResNet18 network from scratch. All the models are trained with ADAM optimizer with a learning rate of 0.001 for 400 epochs with a batch size of 16. These parameters were found to be nearly optimal on the birds dataset and we keep them fixed for all the other datasets. We track the running statistics for the batch normalization layer for the softmax baselines following the conventional setting,

i.e. w/o self-supervised loss, but do not track these statistics when training with self-supervision for the aforementioned reasons.

Architectures for self-supervised tasks

The ResNet18 results in a 512-dimensional feature vector for each input, and we add a fully-connected (

fc) layer with 512-units on top of this. For jigsaw puzzle task, we have nine patches as input, resulting in nine 512-dimensional feature vectors, which are concatenated. This is followed by a fc layer, projecting the feature vector from 4608 to 4096 dimensions, and a fc layer with 35-dimensional outputs corresponding to the 35 permutations for the jigsaw task.

For the rotation task, the 512-dimensional output of ResNet18 is passed through three fc layers with {512, 128, 128, 4} units, where the predictions correspond to the four rotation angles. Between each fc layer, a ReLU (i.e., 

) activation and a dropout layer with dropout probability of 0.5 are added.

Experimental details on image sampling and data augmentation

Experiments in chen2019closerfewshot found that data augmentation has large impact in the generalization performance for few-shot learning. We follow their procedure for all our experiments described as follows. For classification, images are first resized to 224 pixels for the shorter edge while maintaining the aspect ratio, from which a central crop of 224224 is obtained. For predicting rotations, we use the same cropping method then rotate the images. For jigsaw puzzles, we first randomly crop 255255 region from the original image with random scaling between , then split into 33 regions, from which a random crop of size 64

64 is picked. We use PyTorch 

paszke2017automatic for our experiments.

4.1 Results on few-shot learning

We first present results on few-shot learning where the models are trained on base set then transferred and tested on a novel set. Table 2 shows the accuracies of various models for 5-way 5-shot classification (top half), and 20-way 5-shot classification (bottom half). Our baseline ProtoNet (first row in each half) reproduces the results for the birds dataset presented in chen2019closerfewshot (in their Table A5). This was reported to be the top-performing method on this dataset and others.

On 5-way classification, using jigsaw puzzles as an auxiliary task gives the best performance, improving the ProtoNet baseline on all five datasets. Specifically, it reduces the relative error rate by 19.6%, 8.0%, 4.6%, 15.9%, and 27.8% on birds, cars, aircrafts, dogs, and flowers datasets respectively. Predicting rotations also improves the ProtoNet baseline on birds, cars, and dogs. We further test the models on the 20-way classification task, as shown in the bottom half of the table. The relative improvements using self-supervision are greater in this setting.

We also test the accuracy of the model trained only with self-supervision. Compared to the randomly initialized model (“None” rows, Table 2), training the network to predict rotations gives around 2% to 7% improvements on all five datasets, while solving jigsaw puzzles only improves on aircrafts and flowers.

To see if self-supervision is beneficial for harder classification tasks, we conduct experiments on degraded images. For cars and aircrafts, we use low-resolution images where the images are down-sampled by a factor of four and up-sampled back to 224

224 with bilinear interpolation. For natural categories we discard color and perform experiments with greyscale images. Low-resolution images are considerably harder to classify for man-made categories while color information is most useful for natural categories

222see su2017adapting for a discussion on how color and resolution affect fine-grained classification results. The results for this setting are shown in Table 3. On birds and dogs datasets, the improvements using self-supervision are higher compared to color images. Similarly on the cars and aircrafts datasets with low-resolution images, the improvement goes up from 0.7% to 2.2% and from 0.4% to 2.0% respectively.

Our initial experiments on miniImageNet showed small improvements from 75.2% of ProtoNet baseline to 75.9% when training with the jigsaw puzzle loss. We found that, perhaps owing to the low resolution images in this dataset, the jigsaw puzzle task was significantly harder to train (we observed low training accuracy for the self-supervised task). Perhaps other forms of self-supervision might be beneficial here.

Loss Birds Cars Aircrafts Dogs Flowers
5-way 5-shot
ProtoNet 87.290.48 91.690.43 91.390.40 82.950.55 89.190.56
ProtoNet + Jigsaw 89.800.42 92.430.41 91.810.38 85.660.49 92.160.43
Jigsaw 25.730.48 25.260.46 38.790.61 24.270.45 50.530.73
ProtoNet + Rotation 89.390.44 92.320.41 91.380.40 84.250.53 88.990.52
Rotation 33.090.63 29.370.53 29.540.54 27.250.50 49.440.69
None 26.710.48 25.240.46 28.100.47 25.330.47 42.280.75
20-way 5-shot
ProtoNet 69.310.30 78.650.32 78.580.25 61.620.31 75.410.28
ProtoNet + Jigsaw 73.690.29 79.120.27 79.060.23 65.440.29 79.160.26
Jigsaw 8.140.14 7.090.12 15.440.19 7.110.12 25.720.24
ProtoNet + Rotation 72.850.31 80.010.27 78.380.23 63.410.30 73.870.28
Rotation 12.850.19 9.320.15 9.780.15 8.750.14 26.320.24
None 9.290.15 7.540.13 8.940.14 7.770.13 22.550.23
Table 2: Performance on few-shot transfer task. The mean accuracy (%) and the 95% confidence interval of 600 randomly chosen test experiments are reported for various combinations of loss functions. The top part shows the accuracy on 5-way 5-shot classification tasks, while the bottom part shows the same on 20-way 5-shot. Adding jigsaw puzzle loss as self-supervision to the ProtoNet loss gives the best performance across all five datasets on 5-way results. On 20-way classification, the improvements are even larger. The last row indicate results with a randomly initialized network.
Loss Birds Cars Aircrafts Dogs Flowers
Greyscale Low-resolution Low-resolution Greyscale Greyscale
5-way 5-shot
ProtoNet 82.240.59 84.750.52 85.030.53 80.660.59 86.080.55
ProtoNet + Jigsaw 85.400.55 86.960.52 87.070.47 83.630.50 87.590.54
20-way 5-shot
ProtoNet 60.760.35 64.720.33 64.100.27 57.350.29 69.690.26
ProtoNet + Jigsaw 65.730.33 68.640.33 68.280.27 61.160.29 71.640.27
Table 3: Performance on few-shot transfer task with degraded inputs. Accuracies are reported on novel set for 5-way 5-shot and 20-way 5-shot classification with degraded inputs of the standard datasets, with and without jigsaw puzzle loss. The loss of color or resolution makes the task more challenging as seen by the drop in the performance of the baseline ProtoNet. However the improvements using the jigsaw puzzle loss are higher in comparison to the results presented in Table 2.

4.2 Results on fine-grained classification

Next, we present results using the standard classification task. Here we investigate if the self-supervised tasks can help representation learning for deep networks (ResNet18 network) trained from scratch using images and labels in the training set only. Table 4 shows the accuracy using various combinations of loss functions. One can see that training with self-supervision improves on all datasets we tested. On birds, cars, and dogs, predicting rotation gives 4.1%, 3.1%, and 3.0% improvements, while on aircrafts and flowers, the jigsaw puzzle loss gives boosts of 0.9% and 3.6% respectively.

Loss Birds Cars Aircrafts Dogs Flowers
Softmax 47.0 72.6 69.9 51.4 72.8
Softmax + Jigsaw 49.2 73.2 70.8 53.5 76.4
Softmax + Rotation 51.1 75.7 70.0 54.4 73.5
Table 4: Performance on standard classification task. Per-image accuracy (%) on the test set are reported. Using self-supervision improves the accuracy of a ResNet18 network trained from scratch over the baseline of supervised training with cross-entropy (softmax) loss on all five datasets.

4.3 Visualization of learned models

To better understand what causes the representations to generalize better we visualize what pixels contribute the most to the correct classification for various models. In particular, for each image and model we compute the gradient of the logits (predictions before softmax) for the correct class with respect to the input image. The magnitude of the gradient at each pixel is a proxy for the importance and are visualized as “saliency maps”. Figure 

2 shows these maps for various images and models trained with and without self-supervision. It appears that the self-supervised models tend to focus more on the foreground regions, as seen by the amount of bright pixels within the bounding box. One hypothesis is that self-supervised tasks force the model to rely less on background features, which might be accidentally correlated to the class labels. For fine-grained recognition localization indeed improves performance when training from few examples (see WertheimerCVPR2019 for a contemporary evaluation of the role of localization for few-shot learning).

Figure 2: Saliency maps for various images and models. For each image we visualize the magnitude of the gradient with respect to the correct class for models trained with various loss functions. The magnitudes are scaled to the same range for easier visualization. The models trained with self-supervision often have lower energy on the background regions when there is clutter. We highlight a few examples with blue borders and the bounding-box of the object for each image is shown in red.

5 Conclusion

We showed that self-supervision improves transferability of representations for few-shot learning tasks on a range of image classification datasets. Surprisingly, we found that self-supervision is beneficial even when the number of images used for self-supervision is small, orders of magnitude smaller than previously reported results. This has a practical benefit that the images within small datasets can be used for self-supervision without relying on a large-scale external dataset. Future work could investigate if additional unlabeled data within the domain, or combination of various self-supervised losses can be used to further improve generalization. Future work could also investigate how and when self-supervision improves generalization by analyzing transferabilty across a range of self-supervised and supervised tasks empirically achille2019task2vec ; zamir2018taskonomy .

References

  • (1) Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless Fowlkes, Stefano Soatto, and Pietro Perona. Task2Vec: Task Embedding for Meta-Learning. arXiv preprint arXiv:1902.03545, 2019.
  • (2) Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136, 2018.
  • (3) Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 517–526. JMLR. org, 2017.
  • (4) Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In European Conference on Computer Vision (ECCV), pages 132–149, 2018.
  • (5) Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Wang, and Jia-Bin Huang. A Closer Look at Few-shot Classification. In International Conference on Learning Representations (ICLR), 2019.
  • (6) Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. arXiv preprint arXiv:1711.02257, 2017.
  • (7) Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In International Conference on Computer Vision (ICCV), pages 1422–1430, 2015.
  • (8) Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In International Conference on Computer Vision (ICCV), pages 2051–2060, 2017.
  • (9) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
  • (10) Victor Garcia and Joan Bruna. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.
  • (11) Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 4367–4375, 2018.
  • (12) Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. arXiv preprint arXiv:1905.01235, 2019.
  • (13) Bharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking and hallucinating features. In International Conference on Computer Vision (ICCV), pages 3018–3027, 2017.
  • (14) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
  • (15) Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7482–7491, 2018.
  • (16) Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.
  • (17) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • (18) Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In

    ICML deep learning workshop

    , volume 2, 2015.
  • (19) Iasonas Kokkinos.

    Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6129–6138, 2017.
  • (20) Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. arXiv preprint arXiv:1901.09005, 2019.
  • (21) Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D Object Representations for Fine-Grained Categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  • (22) Gustav Larsson, Michael Maire, and Gregory Shakhnarovich.

    Learning representations for automatic colorization.

    In European Conference on Computer Vision (ECCV), 2016.
  • (23) Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In Computer Vision and Pattern Recognition (CVPR), 2019.
  • (24) Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  • (25) Kevis-Kokitsi Maninis, Ilija Radosavovic, and Iasonas Kokkinos. Attentive single-tasking of multiple tasks. arXiv preprint arXiv:1904.08918, 2019.
  • (26) Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Distance-based image classification: Generalizing to new classes at near-zero cost. IEEE transactions on pattern analysis and machine intelligence, 35(11):2624–2637, 2013.
  • (27) M-E. Nilsback and A. Zisserman. A visual vocabulary for flower classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 1447–1454, 2006.
  • (28) Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision (ECCV), pages 69–84. Springer, 2016.
  • (29) Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. In Proceedings of the IEEE International Conference on Computer Vision, pages 5898–5906, 2017.
  • (30) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • (31) Deepak Pathak, Ross B. Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In Computer Vision and Pattern Recognition (CVPR), 2017.
  • (32) Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context encoders: Feature learning by inpainting. In Computer Vision and Pattern Recognition (CVPR), 2016.
  • (33) Hang Qi, Matthew Brown, and David G Lowe. Low-shot learning with imprinted weights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5822–5830, 2018.
  • (34) Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7229–7238, 2018.
  • (35) Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.
  • (36) Zhongzheng Ren and Yong Jae Lee. Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 762–771, 2018.
  • (37) Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems (NeurIPS), pages 527–538, 2018.
  • (38) Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 4077–4087, 2017.
  • (39) Jong-Chyi Su and Subhransu Maji. Adapting models to signal degradation using distillation. In British Machine Vision Conference (BMVC), 2017.
  • (40) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1199–1208, 2018.
  • (41) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems (NeurIPS), pages 3630–3638, 2016.
  • (42) Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In International Conference on Computer Vision (ICCV), 2015.
  • (43) Xiaolong Wang, Kaiming He, and Abhinav Gupta. Transitive invariance for self-supervised visual representation learning. In International Conference on Computer Vision (ICCV), 2017.
  • (44) P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
  • (45) Davis Wertheimer and Bharath Hariharan. Few-shot learning with localization in realistic settings. In Computer Vision and Pattern Recognition (CVPR), 2019.
  • (46) Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Computer Vision and Pattern Recognition (CVPR), 2018.
  • (47) Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3712–3722, 2018.
  • (48) Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European Conference on Computer Vision (ECCV), 2016.
  • (49) Richard Zhang, Phillip Isola, and Alexei A Efros.

    Split-brain autoencoders: Unsupervised learning by cross-channel prediction.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1058–1067, 2017.