AI researchers have organically broken down the ambitious task of enabling intelligence into smaller and better defined sub-tasks such as classification for uniformly distributed data, few-shot learning, transfer learning, etc. and worked towards developing effective models for these sub-tasks. The last decade has witnessed staggering progress in each of these sub-tasks. Despite this progress, solutions tailored towards these sub-tasks are not yet ready to be deployed into broader real-world settings.
Consider a general recognition system, a key component in many downstream applications of computer vision. One would expect such a system to recognize a variety of categories without knowing apriori the number of samples for that category (e.g. few shot or not), to adapt to samples from novel categories, to efficiently utilize the available hardware resources, and to be flexible on how and when to spend its resources on updating its model. Today’s state of the art models make far too many assumptions about the expected data distributions and volumes of supervision and aren’t likely to do well in such an unconstrained setting.
To revitalize progress in building learning systems, it’s paramount that we set up benchmarks that encourage us to design algorithms that facilitate the desired attributes of an ML system in the wild. But, what are the desired properties of such an ML system? We posit that a learning system capable of being deployed in the wild should have the following attributes: (1) Sequential Learning - In many application domains the data streams in. Sustained learners must be capable of processing data sequentially. At a given time step, the system may be required to make a prediction or may have the ability to update itself based on available labels and resources. (2) Flexible Learning - The system must be capable of making decisions that affect its learning including updating itself at every time step vs in batches, processing every data point vs being selective, perform a large number of updates at once vs spreading them out evenly, etc. (3) Efficient Learning - Practical systems must take into account energy consumption over the duration of their entire lifetime. ML systems in the wild are not just performing inference but also constantly updating themselves; they must be efficient in terms of their learning and execution strategies. (4) Open World Learning
- Such systems must be capable of learning beyond the restrictions of closed world settings. They must constantly be on the lookout to classify new data into existing categories or choose to create a new one. (5)X-Shot Learning - These recognition systems must be capable of dealing with categories that have many instances as well as ones that have very few. Systems that excel in just one paradigm but fail in another will not be effective in many real-world settings.
With these properties in mind, we present iN thE wilD (Ned) (Figure 1) - a unified sequential learning framework built to train and evaluate learning systems capable of being deployed in real-world scenarios. This is in contrast to existing isolated sub tasks (Figure 1). Ned is agnostic to the underlying dataset. It consumes any source of supervised data, samples it to ensure a wide distribution of the number of instances per category, and presents it sequentially to the learning algorithm, ensuring that new categories are gradually introduced. At each time step, the learner is presented with a new data instance which it must classify. Following this, it is also provided the label corresponding to this new data point. It may then choose to process this data and possibly update its beliefs. Beyond this, Ned enforces no restrictions on the learning strategy. Ned evaluates systems in terms of accuracy and compute throughout the lifetime of the system. As a result, systems that can learn sporadically and efficiently stand out from our more traditional systems.
In this work, we present Ned-Imagenet
comprising data from ImageNet-22K. The sequential data presented to learners is drawn from a heavy-tailed distribution of 1000 classes (250 of which overlap with the popular ImageNet-1K dataset). Importantly, the data presented has no overlapping instances with the ImageNet-1K dataset, allowing models to leverage advances made over the past few years on this popular dataset. These choices allow us to study learners at their effectiveness with old and new categories as well as rare and common classes.
We evaluate a large number of models and learning strategies on Ned-Imagenet. This includes models that pre-train on Imagenet-1K and finetune on the sequential data using diverse strategies, models that only consume the sequential data, models drawn from the few-shot learning literature, MoCo [He et al., 2019], deep models with simple nearest neighbor heads, our own proposed baseline; and present some surprising findings. For example, we find that higher-capacity networks actually overfit less to training classes; popular few shot methods have 40% lower accuracy than simple baselines, and MoCo networks forget the pretrain classes while training on a mixture of new and pretrain classes and perform rather poorly. Finally, our simple proposed baseline, Prototype Tuning, outperforms all other evaluated methods.
2 Attributes of an ML System in the Wild
We now present the desired attributes of a pragmatic machine learning system in the wild and place these in the context of related work.
Systems that learn in the wild must be capable of processing data as it commonly appears, sequentially. As data presents itself, such a system must be able to produce inferences as well as sequentially update itself. Interestingly, few-shot learning and open-world learning are natural consequences of learning in a sequential manner. When an instance from a new category first presents itself, the system must determine that it belongs to a new category; the next time an instance from this category appears, the system must be capable of one-shot learning, and so forth. Learning in a sequential manner is the core objective of continual learning and is a longstanding challenge [Thrun, 1996]. Several setups have been proposed over the years [Li and Hoiem, 2017, Kirkpatrick et al., 2017, Rebuffi et al., 2017, Harrison et al., 2019, Aljundi et al., 2019, Riemer et al., 2019, Aljundi et al., 2018] in order to evaluate systems’ abilities to learn continuously and primarily focus on catastrophic forgetting, a phenomenon where models drastically lose accuracy on old tasks when trained on new tasks. The most common setup sequentially presents data from each task then evaluates on current and previous tasks [Li and Hoiem, 2017, Kirkpatrick et al., 2017, Rebuffi et al., 2017]. Recent variants [Harrison et al., 2019, Aljundi et al., 2019, Riemer et al., 2019, He et al., 2020] learn continuously in a task-free sequential setting but focus on slowly accumulating knowledge throughout a stream and do not include open-world & few-shot challenges.
Effective systems in the wild must be flexible in their learning strategies. They must make decisions over the course of their lifetime regarding what data to train with, what to ignore, how long to train, when to train, and what to optimize [Cho et al., 2013]. Note the difference between learning strategies and updating model parameters. This is in contrast to previous learning paradigms such as supervised, few-shot, and continual learning which typically impose fixed and preset restrictions on learners.
Update strategies must optimize for accuracy but should be constrained to the resources available to the system. Pragmatic systems must measure this compute over the duration of their lifetime; not just measuring inference cost [Rastegari et al., 2016, Howard et al., 2017, Kusupati et al., 2020] but also update cost [Evci et al., 2019]. The trade-off between accuracy and lifetime compute will help researchers design appropriate systems.
Open world learning
Systems in the wild must be capable of learning in an open-world setting - where the classes, and even the number of classes, are not known to the learner. Every time data is encountered, they must be capable of determining if it belongs to an existing category or must belong to a new one. Previous works have explored the problem of open-world recognition [Liu et al., 2019, Bendale and Boult, 2015]. Out-of-distribution detection is also a well-studied problem and is a case of unseen class detection where the distribution is static. We benchmark two previous works, Hendrycks and Gimpel  and Liu et al.  alongside our own proposed baseline in the Ned framework.
A pragmatic recognition system must be able to deal with a varying number of instances per class. As noted above, this is a natural consequence of a sequential data presentation. While learning from large datasets has received the most attention Russakovsky et al. , Lin et al. , few-shot learning has also become quite popular [Ravi and Larochelle, 2017, Finn et al., 2017, Snell et al., 2017, Oreshkin et al., 2018, Sun et al., 2019, Du et al., 2020, Li and Hoiem, 2017, Hariharan and Girshick, 2017]. The experimental setup for few-shot is typically a -shot -way evaluation. Models are trained on base classes during "meta-training" and then tested on novel classes in "meta-testing". We argue that the -shot -way evaluation is too restrictive. First -shot -way assumes that data distributions during meta-testing are uniform, an unrealistic assumption in practice. Second, these setups only evaluate very specific "ways" and "shots". In contrast, Ned evaluates methods across a spectrum of shot and way numbers. We find that popular few-shot methods Finn et al. , Snell et al. , Liu et al.  overfit to 5-way and less than 10-shot scenarios.
3 Ned: Setup
Ned is a unified learning and evaluation framework for ML systems that can learn sequentially, flexibly, and efficiently in open world settings with data with varying distributions. It is agnostic to the underlying task and source data. In this work, we create such a framework for the task of image classification.
Ned provides a stream of data to a learning system which consists of a model and training strategy. At each time step, the system sees one data point and must classify it as either one of the existing classes or as an unseen one ( classification at each step where is the number of known classes at the current time step). After inference, the system is provided with a label for that data instance. The learner decides when to train during the stream using previously seen data with respect to its training strategy. We evaluate systems using a suite of metrics including the overall and mean class accuracies over the stream along with the total compute required for training and inference. Algorithm 1 provides details for the Ned evaluation.
In this paper we evaluate methods under the Ned framework using a subset of ImageNet-22K [Deng et al., 2009] dataset. Traditionally, few-shot learning has used datasets like Omniglot [Lake et al., 2011] & MiniImagenet [Vinyals et al., 2016] and continual learning has focused on MNIST [LeCun, 1998] & CIFAR [Krizhevsky et al., 2009]. Some recent continual learning works have used Split-ImageNet [Wen et al., 2020]. The aforementioned datasets are mostly small-scale. We use the large ImageNet-22K dataset as our repository of images in order to present new challenges to existing models. We show surprising results for established methods when evaluated on Ned-Imagenet in section 5.
Images that a pragmatic system may encounter in the real-world typically follow a heavy-tail distribution. However, benchmark datasets such as ImageNet Deng et al.  have a uniform distribution. Recently, datasets like INaturalist [Van Horn et al., 2018] and LVIS [Gupta et al., 2019] have advocated for more realistic distributions of data. We follow suit and draw our sequences from a heavy-tailed dataset.
The data consists of a pretraining dataset and sequences of images. For pretraining we use the standard ImageNet-1K [Russakovsky et al., 2015]. This allows us to leverage existing models built by the community as pre-trained checkpoints. Sequences’ images come from ImageNet-22K after removing ImageNet-1K’s images. Each sequence contains images from 1000 classes, 750 of which do not appear in ImageNet-1K. We refer to the overlapping 250 classes as Pretrain classes and the remaining 750 as Novel classes. The sequence is constructed by randomly sampling images from a heavy-tailed distribution of these 1000 classes. Each sequence contains samples, where head classes contain and tail classes contain samples. The sequence allows us to study the effect of methods on combinations of pretrain vs novel, and head vs tail classes. In Table 1, we show results obtained for 1 sequence, and the Appendix C shows results across all sequences. More comprehensive statistics on the data and sequences can also be found in the Appendix A.
Before deploying the system into the sequential phase, we train our model with pretraining data (ImageNet-1K). Our experiments in Sec 5 reveal interesting and new findings about pretraining in supervised settings vs pretraining with self-supervised objectives like MoCo.
We use the following evaluation metrics inNed.
Accuracy: The accuracy over all elements in the sequence.
Mean Per Class Accuracy: The accuracy for each class in the sequence averaged over all classes.
Total Compute: The total numbers of multiply-accumulate operations for all updates and evaluations accrued over the sequence measured in GMACs (Giga MACs).
Unseen Class Detection - AUROC: The area under the receiver operating characteristic for the detection of samples that are from unseen classes. Unseen classes are defined as classes that were not in pretrain and have not yet been encountered in the sequence.
Cross-Sectional Accuracies: The mean accuracy among classes in the sequence that belong to one of the 4 subcategories: 1) Pretraining-Head: Classes with samples that were included in pretraining 2) Pretraining-Tail: Classes with samples that were included in pretraining 3) Novel-Head: Classes with samples that were not included in pretraining 4) Novel-Tail: Classes with samples that were not included in pretraining
Standard training and fine-tuning
We test standard neural network training (updating all parameters in the network) and fine-tuning (update only the final linear layer and not later the feature layers) to evaluate how well these traditional techniques learn new classes and improve throughout the sequence. We use standard batch-training with a fixed learning rate for both methods. For both methods, we add a randomly initialized vector when an unseen class is encountered. In AppendixB we show the effects of the number of layers trained during fine-tuning.
Nearest Class Mean (NCM)
have found that NCM performs similarly to state of the art few-shot methods and hence we include it in our evaluation. NCM in the context of deep learning performs 1-nearest neighbor search in feature space with the centroid of each class as a neighbor. Each class mean,is constructed as the average feature embedding of all examples in class : ; where is the set of examples belong to class and
is the deep feature embedding of
. Class probabilities are calculated as the softmax of negative distances betweenand the class means:
In our implementation, the neural network is trained with a linear classifier using a cross-entropy loss. During the sequential phase, the classification layer is removed and the feature layers are frozen.
We propose a simple baseline that is a hybrid between NCM and fine-tuning that is similar to the weight imprinting technique proposed by Qi et al. . In Prototype Tuning, we perform NCM until some numbers of examples for each class are observed; then we fine-tune the class means given new examples. The motivation for such a baseline is that NCM has better accuracy in the low-data regime, but fine-tuning has higher accuracy in the abundant-data regime. Combining the two methods achieves higher accuracy than all other baselines.
We benchmark two few-shot methods in the Ned framework: a) MAML [Finn et al., 2017] and b) Prototypical Networks [Snell et al., 2017]. Both these methods are the common baselines in few-shot learning literature and can be tailored to Ned with minor modifications. Prototypical Networks find a deep feature embedding in which samples from the same class are close and calculate class centroids by and perform inference according to Eq 1. The parameters of the networks are meta-trained with cross-entropy loss according to the -shot, -way routine. The distinction between NCM and Prototypical Networks is that the latter uses meta-training with nearest neighbors to learn the features embedding while NCM uses standard batch-training with a linear classifier.
MAML is an initialization based approach which uses second order optimization to learn parameters that can be quickly fine-tuned to a given task. The gradient update for MAML is:
where are the parameters after making a gradient update given by: . We adapt MAML to Ned by pretraining the model according to the above objective then fine-tuning during the sequential phase.
Out-of-Distribution (OOD) methods
We evaluate the baseline proposed by Hendrycks and Gimpel  along with our feature embedding baseline. The Hendrycks and Gimpel baseline thresholds the maximum probability output of the softmax classifier to determine whether a sample is in or out of distribution. Formally, for a threshold , a sample is determined to be out of distribution according to: . The motivation is that the model is less likely to assign high probability values to samples that do not belong to a known class. This method is evaluated using the area under the receiver operating characteristic (AUROC) and we do likewise for our OOD evaluations.
Our proposed Minimum Distance Thresholding (MDT) baseline utilizes the minimum distance from the sample to all class means. Metric based methods in few-shot learning have demonstrated that distance in feature space is a reasonable estimate for visual similarity. For class meansand distance function a sample is out of distribution according to: . We use cosine distance for all the methods. MDT obtains the highest AUROC of all methods [Liu et al., 2019, Hendrycks and Gimpel, 2016] we evaluate under NED.
For all experiments that require training excluding those in Figure 4
-a/b we train each model for 4 epochs every 5,000 samples that are received. An epoch includes training over all previously seen data in the sequence. We chose these training settings based on the experiments in Figure4-a/b. We found that training for 4 epochs every 5,000 samples balanced sufficient accuracy and reasonable computational cost. Note that we update the nearest class mean after every sample for applicable methods because the update frequency does not affect computational cost. For few-shot methods that leverage meta-training we used 5-shot 20-way with the exception of MAML which we meta-trained with 5-shot 5-way to reduce computational costs. Note that we found that the shot and way did not significantly affect accuracy.
For Prototype Tuning and Weight Imprinting, we transition from NCM to fine-tuning after 10,000 samples. We chose to do so because we observed that the accuracy of NCM saturated at this point in the sequence. During the fine-tuning phase, we train for 4 epochs every 5,000 samples with a learning rate of 0.01. For OLTR [Liu et al., 2019], we update the memory and train the model for 4 epochs every 200 samples for the first 10,000 samples, then train 4 epochs every 5,000 samples with a learning rate of 0.1.
We use the PyTorch[Paszke et al., 2019] pretrained models for supervised ImageNet-1K pretrained ResNet18 and ResNet50. We use the models from VINCE [Gordon et al., 2020] for the MoCo [He et al., 2019] self-supervised ImageNet-1K pretrained models of ResNet18 and ResNet50. MoCo-ResNet18 and MoCo-ResNet50 get top-1 validation accuracy of 44.7% and 65.2% respectively and were trained for 200 epochs.
5 Experiments and Analysis
|(a) Prototypical Networks [Snell et al., 2017]||Meta||Conv-4||5.00||9.58||0.68||1.30||3.31||7.82||0.06|
|(b) Prototypical Networks [Snell et al., 2017]||Meta||R18||8.64||16.98||6.99||12.74||9.50||11.14||0.15|
|(c) MAML [Finn et al., 2017]||Meta||Conv-4||2.86||2.02||0.15||0.10||1.10||3.64||0.06 / 2.20|
|(d) New-Meta Baseline [Chen et al., 2020b]||Sup./Meta||R18||40.47||67.03||27.53||53.87||40.23||47.62||0.16 / 5.73|
|(e) OLTR [Liu et al., 2019]||MoCo||R18||34.60||33.74||13.38||9.38||22.68||39.92||0.16 / 6.39|
|(f) OLTR [Liu et al., 2019]||Sup.||R18||40.83||40.00||17.27||13.85||27.77||45.06||0.16 / 6.39|
|(g) Fine-Tune||MoCo||R18||5.55||46.05||0.03||25.52||10.60||18.89||0.16 / 5.73|
|(h) Fine-Tune||Sup.||R18||43.41||77.29||23.56||58.77||41.54||53.80||0.16 / 5.73|
|(i) Standard Training||MoCo||R18||26.63||45.02||9.63||20.54||21.12||35.60||11.29|
|(j) Standard Training||Sup.||R18||38.51||68.14||16.90||43.25||33.99||49.46||11.29|
|(m) Prototype Tuning||MoCo||R18||28.36||42.05||7.98||14.39||19.90||37.18||0.16 / 5.73|
|(n) Prototype Tuning||Sup.||R18||46.59||74.32||24.87||48.12||42.86||57.02||0.16 / 5.73|
|(o) Fine-Tune||MoCo||R50||3.25||69.63||0.09||56.03||16.71||20.63||0.36 / 13.03|
|(p) Fine-Tune||Sup.||R50||47.78||82.06||27.53||66.42||46.24||57.95||0.36 / 13.03|
|(q) Standard Training||MoCo||R50||26.82||42.12||10.50||21.08||21.32||35.44||38.36|
|(r) Standard Training||Sup.||R50||43.89||74.50||21.54||50.69||39.48||54.10||38.36|
|(u) Prototype Tuning||MoCo||R50||28.86||54.03||7.02||20.82||21.89||40.13||0.36 / 13.03|
|(v) Prototype Tuning||Sup.||R50||48.98||77.78||28.14||57.97||48.74||58.80||0.36 / 13.03|
We find that the Ned framework both confirms previous findings and provides new insights about established few-shot techniques, self-supervised methods, and the relationship between network capacity and generalization. Additionally, we find that our proposed baselines, Prototypical Tuning, and Minimum Distance Thresholding (MDT), outperform all other baselines we evaluate.
Table 1 shows the performance of the suite of methods (outlined in Sec 4) across accuracy and compute metrics. We report the Overall accuracy, Mean-Per-Class accuracy as well as accuracy sliced into four buckets: Head and Tail Pretrain classes (present in the ImageNet-1K dataset) as well as Head and Tail Novel classes (present only in the sequential data).
Standard training vs fine tuning - Standard training (Table 1-j) provides a respectable overall accuracy of 49.46 % with a ResNet18 backbone. Fine-tuning (Table 1-h) provides a nice boost over standard training for all accuracy metrics while also being significantly cheaper in terms of compute.
Nearest Class Mean - NCM (Table 1-l) provides comparable overall accuracies to standard training with very little compute. Note that NCM is the best performing method for Novel-Tail classes.
Prototype Tuning - Our proposed approach (Table 1-n) provides the best overall accuracy and is also very light in terms of compute. This is observed through the entire duration of the sequence (Figure 2-a).
Network capacity and generalization - Other few-shot works indicate that smaller networks avoid overfitting to the classes they train on [Sun et al., 2019, Oreshkin et al., 2018, Snell et al., 2017, Finn et al., 2017, Ravi and Larochelle, 2017]. Hence, to avoid overfitting few-shot methods such as Prototypical Networks and MAML have used 4-layer convolutional networks. Rows (o)-(v) in Table 1 and Figure 2-b show the effect of moving to larger ResNet50 and DenseNet161 backbones with NCM. Interestingly, in Ned, we see the opposite effect. The generalization of learned representations to Novel classes significantly increases with the size of the network! We find that meta-training is primarily responsible for the overfitting we see in larger networks which we discuss further in the few shot methods section.
Representation learning using self-supervision - We observe surprising behavior from MoCo [He et al., 2019] in the Ned setting; in contrast to results on other downstream tasks. Across a suite of methods, MoCo backbones are hugely inferior to supervised backbones. For instance Table 1-g vs Table 1-h shows a drastic 35% drop. In fact, on Novel-Tail, accuracy drops to almost 0. Figure 3-a shows the progress of MoCo backbones over the entire sequence. Interestingly the accuracy on Pretrain classes sharply decreases to almost 0% at the start of standard training suggesting that MoCo networks struggle to simultaneously learn pretrain and novel classes. We conjecture that this difficulty is induced by learning with a linear classification layer. We believe this to be the case because NCM with MoCo (Table 1-k) generalizes well enough to novel classes while retaining accuracy on the pretrain classes.
Few shot methods - Few-shot methods are designed to perform in the low data regime; therefore one might expect prototypical networks and MAML to perform well in Ned. However, we find that Prototype Networks and MAML (Table 1-a,b,c and Figure 2-a) fail to scale to the more difficult Ned setting even when using comparable architectures. This suggests that the -shot -way setup is not a sufficient evaluation for systems that need to learn across a spectrum of "shots" and "ways". Note that these few shot methods are extremely light in terms of compute.
Additionally, we observe from the results for Prototypical Networks and NCM (which differ only in that Prototypical Networks use meta-training while NCM uses standard training) that meta-training causes larger networks to drastically overfit to the training distribution and explains why our results contradict those from prior few-shot works.
The New-Meta-Baseline is the same as NCM in implementation except that a phase of meta-training is done after pretraining. We find that the additional meta-training improves the performance of New-Meta-Baseline in the novel-tail category but overall lowers the accuracy compared to NCM.
Unseen class detection - For our proposed baselines, we measure the AUROC for detecting unseen classes throughout the sequence. The ROC curves are presented in Figure 3-b. The Hendrycks and Gimpel  baseline achieves 0.59 (AUROC), OLTR [Liu et al., 2019] achieves 0.78 (AUROC), and our feature-based method (MDT) achieves 0.85 (AUROC). The effectiveness of our baseline verifies our hypothesis that distances in feature space are an effective gauge of visual difference, aligning with findings from past works Tian et al. , Snell et al. , Zhang et al. .
We evaluate the accuracy and total compute cost of varying update frequencies and training epochs (Figure 4). We conduct our experiments with fine-tuning (Figure 4-a) and standard training (Figure 4-b) on a ResNet18 model with supervised pretraining.
We find that the MACs expended during training primarily determines the overall accuracy under high MAC regime. In other words, training frequently and for a small number of epochs, is comparable to training infrequently with a large number of epochs under high MAC regime. However, different update strategies result in significantly different accuracy numbers under a low MAC regime showing more evidence that these procedural parameters should be inferred by the learner for different settings. Additionally, we find that fine-tuning and standard training behave differently as the MAC increases. In the case of fine-tuning (Figure 4-a) the accuracy asymptotically increases with total training while for standard training (Figure 4-b) the performance decreases after an optimal amount of total training.
In this work we introduce Ned, an encompassing learning and evaluation framework that 1) encourages an integration of solutions across many sub-fields including supervised classification, few-shot learning, meta-learning, continual learning, and efficient ML, 2) offers more flexibility for learners to specify various parameters of their learning procedure, such as when and how long to train, 3) incorporates the total cost of updating and inference as the main anchoring constraint, and 4) can cope with the streaming, fluid and open nature of the real world. Ned is designed to foster research in devising algorithms tailored toward building more pragmatic ML systems in the wild. This study has already resulted in discoveries (see Section 5) that contradict the findings of less realistic or smaller-scale experiments which emphasizes the need to move towards more pragmatic setups like Ned. We hope Ned promotes more research at the cross-section of decision making and model training to provide more freedom for learners to decide on their own procedural parameters. In this paper, we study various methods and settings in the context of supervised image classification, one of the most explored problems in ML. While we do not make design decisions specific to image classification, incorporating other mainstream tasks into Ned is an immediate next step. Throughout the experiments in this paper, we impose some restrictive assumptions on Ned. Relaxing these assumptions in order to get Ned even closer to the real world is another immediate step for future work. For example, we are now assuming that Ned has access to labels as the data streams in. One exciting future direction is to add semi- and un-supervised settings to Ned.
This work is in part supported by NSF IIS 1652052, IIS 17303166, DARPA N66001-19-2-4031, 67102239 and gifts from Allen Institute for Artificial Intelligence. We thank Jae Sung Park and Mitchell Wortsman for insightful discussions and Daniel Gordon for the pretrained MoCo weights.
Task-free continual learning.
Proceedings of the IEEE conference on computer vision and pattern recognition. Cited by: §2.
- Selfless sequential learning. arXiv preprint arXiv:1806.05421. Cited by: §2.
- Towards open world recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1893–1902. Cited by: §2.
- A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §3.
- A closer look at few-shot classification. arXiv preprint arXiv:1904.04232. Cited by: §4.
- A new meta-baseline for few-shot learning. arXiv preprint arXiv:2003.04390. Cited by: Table 1.
- Dynamic learning model update of hybrid-classifiers for intrusion detection. The Journal of Supercomputing 64 (2), pp. 522–526. Cited by: §2.
- Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: Appendix A, §3, §3, §3.
- Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434. Cited by: §2.
- Rigging the lottery: making all tickets winners. arXiv preprint arXiv:1911.11134. Cited by: §2.
- Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §2, §4, Table 1, §5.
- Watching the world go by: representation learning from unlabeled videos. arXiv preprint arXiv:2003.07990. Cited by: §4.
- LVIS: a dataset for large vocabulary instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5356–5364. Cited by: §3.
- Low-shot visual recognition by shrinking and hallucinating features. Proceedings of the IEEE conference on computer vision and pattern recognition. Cited by: §2.
- Continuous meta-learning without tasks. Advances in neural information processing systems. Cited by: §2.
- Incremental learning in online scenario. arXiv preprint arXiv:2003.13191. Cited by: §2.
- Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1, §3, §4, Table 1, §5.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §2, §4, §4, §5.
Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2.
- Self-supervised visual feature learning with deep neural networks: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §3.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §2.
- Learning multiple layers of features from tiny images. Cited by: §3.
- Soft threshold weight reparameterization for learnable sparsity. In Proceedings of the International Conference on Machine Learning, Cited by: §2.
- One shot learning of simple visual concepts.. CogSci. Cited by: §3.
- The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §3.
- Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §2, §2.
- Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §2.
- Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2537–2546. Cited by: §2, §2, §4, §4, Table 1, §5.
- Tadam: task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721–731. Cited by: §2, §5.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §4.
- Low-shot learning with imprinted weights. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5822–5830. Cited by: Appendix F, §4.
- Continual unsupervised representation learning. In Advances in Neural Information Processing Systems, pp. 7645–7655. Cited by: §3.
- Xnor-net: imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Cited by: §2.
- Optimization as a model for few-shot learning. International Conference on Learning Representations. Cited by: §2, §5.
- Icarl: incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §2.
- Learning to learn without forgetting by maximizing transfer and minimizing inference. Proceedings of the IEEE conference on computer vision and pattern recognition. Cited by: §2.
- Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: Appendix A, §2, §3.
- Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: Table 5, Appendix D, §2, §4, Table 1, §5, §5.
- Meta-transfer learning for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 403–412. Cited by: §2, §5.
- Is learning the n-th thing any easier than learning the first?. Advances in neural information processing systems. Cited by: §2.
- Rethinking few-shot image classification: a good embedding is all you need?. arXiv preprint arXiv:2003.11539. Cited by: §4, §5.
- Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069. Cited by: §3.
- The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769–8778. Cited by: §3.
- Matching networks for one shot learning. Advances in neural information processing systems. Cited by: §3.
- SimpleShot: revisiting nearest-neighbor classification for few-shot learning. arXiv preprint arXiv:1911.04623. Cited by: §4.
- BatchEnsemble: an alternative approach to efficient ensemble and lifelong learning. In International Conference on Learning Representations, External Links: Cited by: §3.
- The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE conference on computer vision and pattern recognition. Cited by: §5.
Appendix A Dataset Information
The five sequences we pair with Ned are constructed from ImageNet-22K [Deng et al., 2009]. Two sequences (1-2) are for validation, and three (3-5) are for testing. Each sequence contains 1,000 classes; 250 of which are in ImageNet-1K [Russakovsky et al., 2015] (pretrain classes) and 750 of which are only in ImageNet-22K (novel classes). For the test sequences, we randomly select the classes without replacement to ensure that the sequences do not overlap. The validation sequences share pretrain classes because there are not enough pretrain classes (1000) to partition among five sequences. We randomly distribute the number of images per class according to Zipf’s law with (Figure 5). For classes without enough images, we fit the Zipfian distribution as closely as possible which causes a slight variation in sequence statistics seen in Table 2.
|Sequence #||Number of Images||Min # of Class Images||Max # of Class Images|
Appendix B Training Depth for Fine Tuning
We explored how training depth affects the accuracy of a model on new, old, common, and rare classes. For this set of experiments, we vary the number of trained layers when fine-tuning for 4 epochs every 5,000 samples on ResNet18 with a learning rate of 0.01 on Sequence 2 (validation). The results are reported in Table 3. We found that training more layers leads to greater accuracy on new classes and lower accuracy on pretrain classes. However, we observed that the number of fine-tuning layers did not significantly affect overall accuracy so for our results on the test sequences (3-5) we only report fine-tuning of one layer (Table 1).
Appendix C Results For Other Sequences
We report the mean and standard deviation for all performance metrics across test sequences 3-5 in Table4. Note that the standard deviation is relatively low so the methods are consistent across the randomized sequences.
|New-Meta Baseline [cite]||Sup/Meta||R18||41.730.57||66.542.37||27.541.13||53.690.97||39.320.71||47.740.63|
Appendix D Prototypical Network Experiments
We benchmarked our implementation of Prototypical Networks on few-shot baselines to verify that it is correct. We ran experiments for training on both MiniImageNet and regular ImageNet-1k and tested our implementation on the MiniImageNet test set and NED (Sequence 2). We found comparable results to those reported by the original Prototypical Networks paper [Snell et al., 2017] (Table 5).
|Prototypical Networks||Conv - 4||MiniImageNet||69.2||7.54|
|Prototypical Networks||Conv - 4||ImageNet (Train)||42.7||7.82|
|Prototypical Networks||Conv - 4||MiniImageNet||68.2||-|
Appendix E Out-of-Distribution Ablation
In this section we report AUROC and F1 for MDT and softmax for all baselines. In section 5 we only included OLTR, MDT with NCM, and standard training with maximum softmax (Hendrycks Baseline). Additionally, we visualize the accuracy curves for in-distribution and out-of-distribution samples as the rejection threshold varies (Figure 6). All the OOD experiments presented in Figure 6 and Table 6 were run using ResNet18. Maximum Distance Thresholding (MDT) generally works better than maximum softmax when applied to most methods.
The results of NCM and prototypical tuning using softmax and cosine similarity in comparison to OLTR are shown in table6. The F1-scores are low due to the large imbalance between positive and negative classes. There are 750 unseen class datapoints vs negative datapoints. Table 6 shows that cosine similarity (MDT) is better than softmax or the OLTR model for most methods.
Appendix F Weight Imprinting and Prototype Tuning
Weight Imprinting Qi et al.  conceptually is very similar to Prototype Tuning. They both use a combination of NCM with finetuning, however, we find there are a few key differences that greatly affect performance in the NED framework. The first difference is that Prototype Tuning uses a standard linear layer for classification while Weight Imprinting utilizes a cosine classifier with a learnable scaling temperature for the softmax. We find that during the fine-tuning phase the linear classifier significantly outperforms Weight Imprinting as well as a euclidean classifier and regular cosine classifier Table 7. Note that the learnable scaling temperature does improve Weight Imprinting over the cosine classifier by 15% in overall accuracy, but is still 9% lower than Prototype Tuning. We evaluated Weight Imprinting with initialization of the scaling temperature at 1, 2, 4, and 8. The other aspect in which Weight Imprinting differs from Prototype Tuning is in the construction of the nearest class means (NCMs). Weight Imprinting calculates the NCMs for the entire data set then finetunes the NCMs on that same data. Prototype tuning calculates NCMs only for the small portion of data at the beginning of the stream then fine-tunes on future data and does not recalculate the NCMs. Overall we find that Prototype Tuning requires less hyper-parameter tuning and significantly outperforms Weight Imprinting in the NED framework (Table 7).
|Weight Imprinting (s = 1)||Sup||R18||36.58||63.39||9.32||21.80||26.85||46.35|
|Weight Imprinting (s = 2)||Sup||R18||36.58||63.39||9.32||21.80||26.85||46.35|
|Weight Imprinting (s = 4)||Sup||R18||40.32||67.46||15.35||34.18||32.69||48.51|
|Weight Imprinting (s = 8)||Sup||R18||31.18||32.66||34.77||28.94||32.56||46.67|
|Prototype Tuning (Cosine)||Sup||R18||33.90||18.22||4.84||1.88||11.72||31.81|
|Prototype Tuning (Euclidean)||Sup||R18||43.40||66.32||21.66||42.06||37.19||51.62|
|Prototype Tuning (Linear)||Sup||R18||48.56||71.41||24.16||47.51||41.32||56.79|