In the Wild: From ML Models to Pragmatic ML Systems

07/06/2020 ∙ by Matthew Wallingford, et al. ∙ University of Washington Allen Institute for Artificial Intelligence 13

Enabling robust intelligence in the wild entails learning systems that offer uninterrupted inference while affording sustained training, with varying amounts of data supervision. Such a pragmatic ML system should be able to cope with the openness flexibility inherent in the real world. The machine learning community has organically broken down this challenging task into manageable sub tasks such as supervised, few-shot, continual, self-supervised learning; each affording distinctive challenges leading to a unique set of methods. Notwithstanding this amazing progress, the restricted isolated nature of these tasks has resulted in methods that excel in one setting, but struggle to extend beyond them. To foster the research required to extend ML models to ML systems, we introduce a unified learning evaluation framework - iN thE wilD (NED). NED is designed to be a more general paradigm by loosening the restrictive design decisions of past settings (e.g. closed-world assumption) imposing fewer restrictions on learning algorithms (e.g. predefined train test phases). The learners can infer the experimental parameters themselves by optimizing for both accuracy compute. In NED, a learner receives a stream of data makes sequential predictions while choosing how to update itself, adapt to data from novel categories, deal with changing data distributions; while optimizing the total amount of compute. We evaluate a large set of existing methods across several sub fields using NED present surprising yet revealing findings about modern day techniques. For instance, prominent few shot methods break down in NED, achieving dramatic drops of over 40 Momentum Contrast obtains 35 novel classes. We also show that a simple baseline outperforms existing methods on NED.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

AI researchers have organically broken down the ambitious task of enabling intelligence into smaller and better defined sub-tasks such as classification for uniformly distributed data, few-shot learning, transfer learning, etc. and worked towards developing effective models for these sub-tasks. The last decade has witnessed staggering progress in each of these sub-tasks. Despite this progress, solutions tailored towards these sub-tasks are not yet ready to be deployed into broader real-world settings.

Consider a general recognition system, a key component in many downstream applications of computer vision. One would expect such a system to recognize a variety of categories without knowing apriori the number of samples for that category (e.g. few shot or not), to adapt to samples from novel categories, to efficiently utilize the available hardware resources, and to be flexible on how and when to spend its resources on updating its model. Today’s state of the art models make far too many assumptions about the expected data distributions and volumes of supervision and aren’t likely to do well in such an unconstrained setting.

To revitalize progress in building learning systems, it’s paramount that we set up benchmarks that encourage us to design algorithms that facilitate the desired attributes of an ML system in the wild. But, what are the desired properties of such an ML system? We posit that a learning system capable of being deployed in the wild should have the following attributes: (1) Sequential Learning - In many application domains the data streams in. Sustained learners must be capable of processing data sequentially. At a given time step, the system may be required to make a prediction or may have the ability to update itself based on available labels and resources. (2) Flexible Learning - The system must be capable of making decisions that affect its learning including updating itself at every time step vs in batches, processing every data point vs being selective, perform a large number of updates at once vs spreading them out evenly, etc. (3) Efficient Learning - Practical systems must take into account energy consumption over the duration of their entire lifetime. ML systems in the wild are not just performing inference but also constantly updating themselves; they must be efficient in terms of their learning and execution strategies. (4) Open World Learning

- Such systems must be capable of learning beyond the restrictions of closed world settings. They must constantly be on the lookout to classify new data into existing categories or choose to create a new one. (5)

X-Shot Learning - These recognition systems must be capable of dealing with categories that have many instances as well as ones that have very few. Systems that excel in just one paradigm but fail in another will not be effective in many real-world settings.

Figure 1: Comparison of supervised (top-left), continual (top-middle), and few-shot learning (top-right) with Ned (bottom). The learner (grey box) accumulates data (dotted path), trains on given data (dark nodes), then is evaluated (light nodes). The size of the node indicates the scale of the training or evaluation. Each color represents a different data distribution.

With these properties in mind, we present iN thE wilD (Ned) (Figure 1) - a unified sequential learning framework built to train and evaluate learning systems capable of being deployed in real-world scenarios. This is in contrast to existing isolated sub tasks (Figure 1). Ned is agnostic to the underlying dataset. It consumes any source of supervised data, samples it to ensure a wide distribution of the number of instances per category, and presents it sequentially to the learning algorithm, ensuring that new categories are gradually introduced. At each time step, the learner is presented with a new data instance which it must classify. Following this, it is also provided the label corresponding to this new data point. It may then choose to process this data and possibly update its beliefs. Beyond this, Ned enforces no restrictions on the learning strategy. Ned evaluates systems in terms of accuracy and compute throughout the lifetime of the system. As a result, systems that can learn sporadically and efficiently stand out from our more traditional systems.

In this work, we present Ned-Imagenet

comprising data from ImageNet-22K. The sequential data presented to learners is drawn from a heavy-tailed distribution of 1000 classes (250 of which overlap with the popular ImageNet-1K dataset). Importantly, the data presented has no overlapping instances with the ImageNet-1K dataset, allowing models to leverage advances made over the past few years on this popular dataset. These choices allow us to study learners at their effectiveness with old and new categories as well as rare and common classes.

We evaluate a large number of models and learning strategies on Ned-Imagenet. This includes models that pre-train on Imagenet-1K and finetune on the sequential data using diverse strategies, models that only consume the sequential data, models drawn from the few-shot learning literature, MoCo [He et al., 2019], deep models with simple nearest neighbor heads, our own proposed baseline; and present some surprising findings. For example, we find that higher-capacity networks actually overfit less to training classes; popular few shot methods have 40% lower accuracy than simple baselines, and MoCo networks forget the pretrain classes while training on a mixture of new and pretrain classes and perform rather poorly. Finally, our simple proposed baseline, Prototype Tuning, outperforms all other evaluated methods.

2 Attributes of an ML System in the Wild

We now present the desired attributes of a pragmatic machine learning system in the wild and place these in the context of related work.

Sequential learning

Systems that learn in the wild must be capable of processing data as it commonly appears, sequentially. As data presents itself, such a system must be able to produce inferences as well as sequentially update itself. Interestingly, few-shot learning and open-world learning are natural consequences of learning in a sequential manner. When an instance from a new category first presents itself, the system must determine that it belongs to a new category; the next time an instance from this category appears, the system must be capable of one-shot learning, and so forth. Learning in a sequential manner is the core objective of continual learning and is a longstanding challenge [Thrun, 1996]. Several setups have been proposed over the years [Li and Hoiem, 2017, Kirkpatrick et al., 2017, Rebuffi et al., 2017, Harrison et al., 2019, Aljundi et al., 2019, Riemer et al., 2019, Aljundi et al., 2018] in order to evaluate systems’ abilities to learn continuously and primarily focus on catastrophic forgetting, a phenomenon where models drastically lose accuracy on old tasks when trained on new tasks. The most common setup sequentially presents data from each task then evaluates on current and previous tasks [Li and Hoiem, 2017, Kirkpatrick et al., 2017, Rebuffi et al., 2017]. Recent variants [Harrison et al., 2019, Aljundi et al., 2019, Riemer et al., 2019, He et al., 2020] learn continuously in a task-free sequential setting but focus on slowly accumulating knowledge throughout a stream and do not include open-world & few-shot challenges.

Flexible learning

Effective systems in the wild must be flexible in their learning strategies. They must make decisions over the course of their lifetime regarding what data to train with, what to ignore, how long to train, when to train, and what to optimize [Cho et al., 2013]. Note the difference between learning strategies and updating model parameters. This is in contrast to previous learning paradigms such as supervised, few-shot, and continual learning which typically impose fixed and preset restrictions on learners.

Efficient learning

Update strategies must optimize for accuracy but should be constrained to the resources available to the system. Pragmatic systems must measure this compute over the duration of their lifetime; not just measuring inference cost [Rastegari et al., 2016, Howard et al., 2017, Kusupati et al., 2020] but also update cost [Evci et al., 2019]. The trade-off between accuracy and lifetime compute will help researchers design appropriate systems.

Open world learning

Systems in the wild must be capable of learning in an open-world setting - where the classes, and even the number of classes, are not known to the learner. Every time data is encountered, they must be capable of determining if it belongs to an existing category or must belong to a new one. Previous works have explored the problem of open-world recognition [Liu et al., 2019, Bendale and Boult, 2015]. Out-of-distribution detection is also a well-studied problem and is a case of unseen class detection where the distribution is static. We benchmark two previous works, Hendrycks and Gimpel [2016] and Liu et al. [2019] alongside our own proposed baseline in the Ned framework.

X-shot learning

A pragmatic recognition system must be able to deal with a varying number of instances per class. As noted above, this is a natural consequence of a sequential data presentation. While learning from large datasets has received the most attention Russakovsky et al. [2015], Lin et al. [2014], few-shot learning has also become quite popular [Ravi and Larochelle, 2017, Finn et al., 2017, Snell et al., 2017, Oreshkin et al., 2018, Sun et al., 2019, Du et al., 2020, Li and Hoiem, 2017, Hariharan and Girshick, 2017]. The experimental setup for few-shot is typically a -shot -way evaluation. Models are trained on base classes during "meta-training" and then tested on novel classes in "meta-testing". We argue that the -shot -way evaluation is too restrictive. First -shot -way assumes that data distributions during meta-testing are uniform, an unrealistic assumption in practice. Second, these setups only evaluate very specific "ways" and "shots". In contrast, Ned evaluates methods across a spectrum of shot and way numbers. We find that popular few-shot methods Finn et al. [2017], Snell et al. [2017], Liu et al. [2019] overfit to 5-way and less than 10-shot scenarios.

3 Ned: Setup

1:Task
2:A learning system: model , training strategy
3:
4:function Ned())
5:     Evaluations
6:     Datapoints
7:     Counter of operations .
8:
9:     while not done do
10:          Sample from
11:           ( operations)
12:          Flag shows if is a new unseen class
13:          .insert()
14:          .insert()
15:          Update using with ( operations)
16:          
17:     end while
18:
19:     return
20:end function
Algorithm 1 Ned Evaluation

Ned is a unified learning and evaluation framework for ML systems that can learn sequentially, flexibly, and efficiently in open world settings with data with varying distributions. It is agnostic to the underlying task and source data. In this work, we create such a framework for the task of image classification.

Procedure

Ned provides a stream of data to a learning system which consists of a model and training strategy. At each time step, the system sees one data point and must classify it as either one of the existing classes or as an unseen one ( classification at each step where is the number of known classes at the current time step). After inference, the system is provided with a label for that data instance. The learner decides when to train during the stream using previously seen data with respect to its training strategy. We evaluate systems using a suite of metrics including the overall and mean class accuracies over the stream along with the total compute required for training and inference. Algorithm 1 provides details for the Ned evaluation.

Data

In this paper we evaluate methods under the Ned framework using a subset of ImageNet-22K [Deng et al., 2009] dataset. Traditionally, few-shot learning has used datasets like Omniglot [Lake et al., 2011] & MiniImagenet [Vinyals et al., 2016] and continual learning has focused on MNIST [LeCun, 1998] & CIFAR [Krizhevsky et al., 2009]. Some recent continual learning works have used Split-ImageNet [Wen et al., 2020]. The aforementioned datasets are mostly small-scale. We use the large ImageNet-22K dataset as our repository of images in order to present new challenges to existing models. We show surprising results for established methods when evaluated on Ned-Imagenet in section 5.

Images that a pragmatic system may encounter in the real-world typically follow a heavy-tail distribution. However, benchmark datasets such as ImageNet Deng et al. [2009] have a uniform distribution. Recently, datasets like INaturalist [Van Horn et al., 2018] and LVIS [Gupta et al., 2019] have advocated for more realistic distributions of data. We follow suit and draw our sequences from a heavy-tailed dataset.

The data consists of a pretraining dataset and sequences of images. For pretraining we use the standard ImageNet-1K [Russakovsky et al., 2015]. This allows us to leverage existing models built by the community as pre-trained checkpoints. Sequences’ images come from ImageNet-22K after removing ImageNet-1K’s images. Each sequence contains images from 1000 classes, 750 of which do not appear in ImageNet-1K. We refer to the overlapping 250 classes as Pretrain classes and the remaining 750 as Novel classes. The sequence is constructed by randomly sampling images from a heavy-tailed distribution of these 1000 classes. Each sequence contains samples, where head classes contain and tail classes contain samples. The sequence allows us to study the effect of methods on combinations of pretrain vs novel, and head vs tail classes. In Table 1, we show results obtained for 1 sequence, and the Appendix C shows results across all sequences. More comprehensive statistics on the data and sequences can also be found in the Appendix A.

Pretraining

Supervised pretraining [He et al., 2016] using large annotated datasets like ImageNet [Deng et al., 2009]

facilitates the transfer of learnt representations to help data-scarce downstream tasks. Unsupervised learning methods like autoencoders 

[Tschannen et al., 2018, Rao et al., 2019] and more recent self-supervision methods [Jing and Tian, 2020] like Momentum Contrast (MoCo) [He et al., 2019] and SimCLR [Chen et al., 2020a] have begun to produce representations that rival that of supervised learning and achieve similar accuracies on various tasks.

Before deploying the system into the sequential phase, we train our model with pretraining data (ImageNet-1K). Our experiments in Sec 5 reveal interesting and new findings about pretraining in supervised settings vs pretraining with self-supervised objectives like MoCo.

Evaluation metrics

We use the following evaluation metrics in

Ned.
Accuracy: The accuracy over all elements in the sequence.
Mean Per Class Accuracy: The accuracy for each class in the sequence averaged over all classes.
Total Compute: The total numbers of multiply-accumulate operations for all updates and evaluations accrued over the sequence measured in GMACs (Giga MACs).
Unseen Class Detection - AUROC: The area under the receiver operating characteristic for the detection of samples that are from unseen classes. Unseen classes are defined as classes that were not in pretrain and have not yet been encountered in the sequence.
Cross-Sectional Accuracies: The mean accuracy among classes in the sequence that belong to one of the 4 subcategories: 1) Pretraining-Head: Classes with samples that were included in pretraining 2) Pretraining-Tail: Classes with samples that were included in pretraining 3) Novel-Head: Classes with samples that were not included in pretraining 4) Novel-Tail: Classes with samples that were not included in pretraining

4 Methods

Standard training and fine-tuning

We test standard neural network training (updating all parameters in the network) and fine-tuning (update only the final linear layer and not later the feature layers) to evaluate how well these traditional techniques learn new classes and improve throughout the sequence. We use standard batch-training with a fixed learning rate for both methods. For both methods, we add a randomly initialized vector when an unseen class is encountered. In Appendix 

B we show the effects of the number of layers trained during fine-tuning.

Nearest Class Mean (NCM)

Recently, multiple works [Chen et al., 2019, Tian et al., 2020, Wang et al., 2019]

have found that NCM performs similarly to state of the art few-shot methods and hence we include it in our evaluation. NCM in the context of deep learning performs 1-nearest neighbor search in feature space with the centroid of each class as a neighbor. Each class mean,

is constructed as the average feature embedding of all examples in class : ; where is the set of examples belong to class and

is the deep feature embedding of

. Class probabilities are calculated as the softmax of negative distances between

and the class means:

(1)

In our implementation, the neural network is trained with a linear classifier using a cross-entropy loss. During the sequential phase, the classification layer is removed and the feature layers are frozen.

Prototype Tuning

We propose a simple baseline that is a hybrid between NCM and fine-tuning that is similar to the weight imprinting technique proposed by Qi et al. [2018]. In Prototype Tuning, we perform NCM until some numbers of examples for each class are observed; then we fine-tune the class means given new examples. The motivation for such a baseline is that NCM has better accuracy in the low-data regime, but fine-tuning has higher accuracy in the abundant-data regime. Combining the two methods achieves higher accuracy than all other baselines.

Few-shot methods

We benchmark two few-shot methods in the Ned framework: a) MAML [Finn et al., 2017] and b) Prototypical Networks [Snell et al., 2017]. Both these methods are the common baselines in few-shot learning literature and can be tailored to Ned with minor modifications. Prototypical Networks find a deep feature embedding in which samples from the same class are close and calculate class centroids by and perform inference according to Eq 1. The parameters of the networks are meta-trained with cross-entropy loss according to the -shot, -way routine. The distinction between NCM and Prototypical Networks is that the latter uses meta-training with nearest neighbors to learn the features embedding while NCM uses standard batch-training with a linear classifier.

MAML is an initialization based approach which uses second order optimization to learn parameters that can be quickly fine-tuned to a given task. The gradient update for MAML is:
where are the parameters after making a gradient update given by: . We adapt MAML to Ned by pretraining the model according to the above objective then fine-tuning during the sequential phase.

Out-of-Distribution (OOD) methods

We evaluate the baseline proposed by Hendrycks and Gimpel [2016] along with our feature embedding baseline. The  Hendrycks and Gimpel baseline thresholds the maximum probability output of the softmax classifier to determine whether a sample is in or out of distribution. Formally, for a threshold , a sample is determined to be out of distribution according to: . The motivation is that the model is less likely to assign high probability values to samples that do not belong to a known class. This method is evaluated using the area under the receiver operating characteristic (AUROC) and we do likewise for our OOD evaluations.

Our proposed Minimum Distance Thresholding (MDT) baseline utilizes the minimum distance from the sample to all class means. Metric based methods in few-shot learning have demonstrated that distance in feature space is a reasonable estimate for visual similarity. For class means

and distance function a sample is out of distribution according to: . We use cosine distance for all the methods. MDT obtains the highest AUROC of all methods [Liu et al., 2019, Hendrycks and Gimpel, 2016] we evaluate under NED.

Implementation Details

For all experiments that require training excluding those in Figure 4

-a/b we train each model for 4 epochs every 5,000 samples that are received. An epoch includes training over all previously seen data in the sequence. We chose these training settings based on the experiments in Figure

4-a/b. We found that training for 4 epochs every 5,000 samples balanced sufficient accuracy and reasonable computational cost. Note that we update the nearest class mean after every sample for applicable methods because the update frequency does not affect computational cost. For few-shot methods that leverage meta-training we used 5-shot 20-way with the exception of MAML which we meta-trained with 5-shot 5-way to reduce computational costs. Note that we found that the shot and way did not significantly affect accuracy.

For Prototype Tuning and Weight Imprinting, we transition from NCM to fine-tuning after 10,000 samples. We chose to do so because we observed that the accuracy of NCM saturated at this point in the sequence. During the fine-tuning phase, we train for 4 epochs every 5,000 samples with a learning rate of 0.01. For OLTR [Liu et al., 2019], we update the memory and train the model for 4 epochs every 200 samples for the first 10,000 samples, then train 4 epochs every 5,000 samples with a learning rate of 0.1.

We use the PyTorch 

[Paszke et al., 2019] pretrained models for supervised ImageNet-1K pretrained ResNet18 and ResNet50. We use the models from VINCE [Gordon et al., 2020] for the MoCo [He et al., 2019] self-supervised ImageNet-1K pretrained models of ResNet18 and ResNet50. MoCo-ResNet18 and MoCo-ResNet50 get top-1 validation accuracy of 44.7% and 65.2% respectively and were trained for 200 epochs.

5 Experiments and Analysis

Method
Pretrain
Strategy
Backbone
Novel -
Head (>50)
Pretrain -
Head (>50)
Novel -
Tail (<50)
Pretrain -
Tail (<50)
Mean
Per-Class
Overall
GMACs
()
(a) Prototypical Networks [Snell et al., 2017] Meta Conv-4   5.00   9.58   0.68   1.30   3.31   7.82   0.06
(b) Prototypical Networks [Snell et al., 2017] Meta R18   8.64 16.98   6.99 12.74   9.50 11.14   0.15
(c) MAML [Finn et al., 2017] Meta Conv-4   2.86   2.02   0.15   0.10   1.10   3.64 0.06 / 2.20
(d) New-Meta Baseline [Chen et al., 2020b] Sup./Meta R18 40.47 67.03 27.53 53.87 40.23 47.62 0.16 / 5.73
(e) OLTR [Liu et al., 2019] MoCo R18 34.60 33.74 13.38 9.38 22.68 39.92 0.16 / 6.39
(f) OLTR [Liu et al., 2019] Sup. R18 40.83 40.00 17.27 13.85 27.77 45.06 0.16 / 6.39
(g) Fine-Tune MoCo R18   5.55 46.05   0.03 25.52 10.60 18.89 0.16 / 5.73
(h) Fine-Tune Sup. R18 43.41 77.29 23.56 58.77 41.54 53.80 0.16 / 5.73
(i) Standard Training MoCo R18 26.63 45.02   9.63 20.54 21.12 35.60 11.29
(j) Standard Training Sup. R18 38.51 68.14 16.90 43.25 33.99 49.46 11.29
(k) NCM MoCo R18 19.24 31.12 14.40 21.95 18.99 22.90   0.15
(l) NCM Sup. R18 42.35 72.69 31.72 56.17 43.44 48.62   0.15
(m) Prototype Tuning MoCo R18 28.36 42.05   7.98 14.39 19.90 37.18 0.16 / 5.73
(n) Prototype Tuning Sup. R18 46.59 74.32 24.87 48.12 42.86 57.02 0.16 / 5.73
(o) Fine-Tune MoCo R50   3.25 69.63   0.09 56.03 16.71 20.63 0.36 / 13.03
(p) Fine-Tune Sup. R50 47.78 82.06 27.53 66.42 46.24 57.95 0.36 / 13.03
(q) Standard Training MoCo R50 26.82 42.12 10.50 21.08 21.32 35.44 38.36
(r) Standard Training Sup. R50 43.89 74.50 21.54 50.69 39.48 54.10 38.36
(s) NCM MoCo R50 30.58 55.01 24.10 45.37 32.75 36.14   0.35
(t) NCM Sup. R50 45.58 78.01 35.94 62.90 47.75 52.19   0.35
(u) Prototype Tuning MoCo R50 28.86 54.03   7.02 20.82 21.89 40.13 0.36 / 13.03
(v) Prototype Tuning Sup. R50 48.98 77.78 28.14 57.97 48.74 58.80 0.36 / 13.03
Table 1: Performance of the suite of methods (outlined in Section 4) across accuracy and compute metrics. We present several variants of accuracy - Overall, Mean-per-class as well as accuracy bucketed into 4 categories: Novel-Head, Novel-Tail, Pretrain-Head and Pretrain-Tail (Pretrain refers to classes present in the ImageNet-1K dataset and available at pre-training). R18 and R50 refer to ResNet18 and ResNet50 backbones. Sup. refers to Supervised and MoCo refers to  [He et al., 2019]. The best technique for a given model backbone on every metric is underlined and the overall best is in bold. Some methods could leverage caching of representations for efficiency, so, both GMACs are reported. GMACs does not include those incurred during pretraining.

We find that the Ned framework both confirms previous findings and provides new insights about established few-shot techniques, self-supervised methods, and the relationship between network capacity and generalization. Additionally, we find that our proposed baselines, Prototypical Tuning, and Minimum Distance Thresholding (MDT), outperform all other baselines we evaluate.

Table 1 shows the performance of the suite of methods (outlined in Sec 4) across accuracy and compute metrics. We report the Overall accuracy, Mean-Per-Class accuracy as well as accuracy sliced into four buckets: Head and Tail Pretrain classes (present in the ImageNet-1K dataset) as well as Head and Tail Novel classes (present only in the sequential data).

Standard training vs fine tuning - Standard training (Table 1-j) provides a respectable overall accuracy of 49.46 % with a ResNet18 backbone. Fine-tuning (Table 1-h) provides a nice boost over standard training for all accuracy metrics while also being significantly cheaper in terms of compute.

Nearest Class Mean - NCM (Table 1-l) provides comparable overall accuracies to standard training with very little compute. Note that NCM is the best performing method for Novel-Tail classes.

Prototype Tuning - Our proposed approach (Table 1-n) provides the best overall accuracy and is also very light in terms of compute. This is observed through the entire duration of the sequence (Figure 2-a).

Network capacity and generalization - Other few-shot works indicate that smaller networks avoid overfitting to the classes they train on [Sun et al., 2019, Oreshkin et al., 2018, Snell et al., 2017, Finn et al., 2017, Ravi and Larochelle, 2017]. Hence, to avoid overfitting few-shot methods such as Prototypical Networks and MAML have used 4-layer convolutional networks. Rows (o)-(v) in Table 1 and Figure 2-b show the effect of moving to larger ResNet50 and DenseNet161 backbones with NCM. Interestingly, in Ned, we see the opposite effect. The generalization of learned representations to Novel classes significantly increases with the size of the network! We find that meta-training is primarily responsible for the overfitting we see in larger networks which we discuss further in the few shot methods section.

Representation learning using self-supervision - We observe surprising behavior from MoCo [He et al., 2019] in the Ned setting; in contrast to results on other downstream tasks. Across a suite of methods, MoCo backbones are hugely inferior to supervised backbones. For instance Table 1-g vs Table 1-h shows a drastic 35% drop. In fact, on Novel-Tail, accuracy drops to almost 0. Figure 3-a shows the progress of MoCo backbones over the entire sequence. Interestingly the accuracy on Pretrain classes sharply decreases to almost 0% at the start of standard training suggesting that MoCo networks struggle to simultaneously learn pretrain and novel classes. We conjecture that this difficulty is induced by learning with a linear classification layer. We believe this to be the case because NCM with MoCo (Table 1-k) generalizes well enough to novel classes while retaining accuracy on the pretrain classes.

Figure 2: Plot (a) compares the rolling accuracy of various methods over the stream of data. Note that Prototype Tuning performs the best at all stages of the stream. Plot (b) compares the accuracy of NCM on novel classes for various network architectures. Surprisingly, deeper networks overfit less to the pretrain classes.

Few shot methods - Few-shot methods are designed to perform in the low data regime; therefore one might expect prototypical networks and MAML to perform well in Ned. However, we find that Prototype Networks and MAML (Table 1-a,b,c and Figure 2-a) fail to scale to the more difficult Ned setting even when using comparable architectures. This suggests that the -shot -way setup is not a sufficient evaluation for systems that need to learn across a spectrum of "shots" and "ways". Note that these few shot methods are extremely light in terms of compute.

Additionally, we observe from the results for Prototypical Networks and NCM (which differ only in that Prototypical Networks use meta-training while NCM uses standard training) that meta-training causes larger networks to drastically overfit to the training distribution and explains why our results contradict those from prior few-shot works.

The New-Meta-Baseline is the same as NCM in implementation except that a phase of meta-training is done after pretraining. We find that the additional meta-training improves the performance of New-Meta-Baseline in the novel-tail category but overall lowers the accuracy compared to NCM.

Unseen class detection - For our proposed baselines, we measure the AUROC for detecting unseen classes throughout the sequence. The ROC curves are presented in Figure 3-b. The Hendrycks and Gimpel [2016] baseline achieves 0.59 (AUROC), OLTR [Liu et al., 2019] achieves 0.78 (AUROC), and our feature-based method (MDT) achieves 0.85 (AUROC). The effectiveness of our baseline verifies our hypothesis that distances in feature space are an effective gauge of visual difference, aligning with findings from past works Tian et al. [2020], Snell et al. [2017], Zhang et al. [2018].

Figure 3: Plot (a) shows the difference between supervised and MoCo pretraining, especially in the initial stages of the sequence. Plot (b) compares the ROC curves for out-of-distribution methods.

Update Strategies

We evaluate the accuracy and total compute cost of varying update frequencies and training epochs (Figure 4). We conduct our experiments with fine-tuning (Figure 4-a) and standard training (Figure 4-b) on a ResNet18 model with supervised pretraining.

We find that the MACs expended during training primarily determines the overall accuracy under high MAC regime. In other words, training frequently and for a small number of epochs, is comparable to training infrequently with a large number of epochs under high MAC regime. However, different update strategies result in significantly different accuracy numbers under a low MAC regime showing more evidence that these procedural parameters should be inferred by the learner for different settings. Additionally, we find that fine-tuning and standard training behave differently as the MAC increases. In the case of fine-tuning (Figure 4-a) the accuracy asymptotically increases with total training while for standard training (Figure 4-b) the performance decreases after an optimal amount of total training.

Figure 4: (a) and (b) compare the accuracy and MACs for various update strategies when fine-tuning and standard training, respectively. The two methods diverge in the high MAC regime where standard training decreases in accuracy if trained for too many epochs.

6 Conclusion

In this work we introduce Ned, an encompassing learning and evaluation framework that 1) encourages an integration of solutions across many sub-fields including supervised classification, few-shot learning, meta-learning, continual learning, and efficient ML, 2) offers more flexibility for learners to specify various parameters of their learning procedure, such as when and how long to train, 3) incorporates the total cost of updating and inference as the main anchoring constraint, and 4) can cope with the streaming, fluid and open nature of the real world. Ned is designed to foster research in devising algorithms tailored toward building more pragmatic ML systems in the wild. This study has already resulted in discoveries (see Section 5) that contradict the findings of less realistic or smaller-scale experiments which emphasizes the need to move towards more pragmatic setups like Ned. We hope Ned promotes more research at the cross-section of decision making and model training to provide more freedom for learners to decide on their own procedural parameters. In this paper, we study various methods and settings in the context of supervised image classification, one of the most explored problems in ML. While we do not make design decisions specific to image classification, incorporating other mainstream tasks into Ned is an immediate next step. Throughout the experiments in this paper, we impose some restrictive assumptions on Ned. Relaxing these assumptions in order to get Ned even closer to the real world is another immediate step for future work. For example, we are now assuming that Ned has access to labels as the data streams in. One exciting future direction is to add semi- and un-supervised settings to Ned.

Acknowledgements

This work is in part supported by NSF IIS 1652052, IIS 17303166, DARPA N66001-19-2-4031, 67102239 and gifts from Allen Institute for Artificial Intelligence. We thank Jae Sung Park and Mitchell Wortsman for insightful discussions and Daniel Gordon for the pretrained MoCo weights.

References

  • R. Aljundi, K. Kelchtermans, and T. Tuytelaars (2019) Task-free continual learning.

    Proceedings of the IEEE conference on computer vision and pattern recognition

    .
    Cited by: §2.
  • R. Aljundi, M. Rohrbach, and T. Tuytelaars (2018) Selfless sequential learning. arXiv preprint arXiv:1806.05421. Cited by: §2.
  • A. Bendale and T. Boult (2015) Towards open world recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1893–1902. Cited by: §2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020a) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §3.
  • W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang (2019) A closer look at few-shot classification. arXiv preprint arXiv:1904.04232. Cited by: §4.
  • Y. Chen, X. Wang, Z. Liu, H. Xu, and T. Darrell (2020b) A new meta-baseline for few-shot learning. arXiv preprint arXiv:2003.04390. Cited by: Table 1.
  • J. Cho, T. Shon, K. Choi, and J. Moon (2013) Dynamic learning model update of hybrid-classifiers for intrusion detection. The Journal of Supercomputing 64 (2), pp. 522–526. Cited by: §2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: Appendix A, §3, §3, §3.
  • S. S. Du, W. Hu, S. M. Kakade, J. D. Lee, and Q. Lei (2020) Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434. Cited by: §2.
  • U. Evci, T. Gale, J. Menick, P. S. Castro, and E. Elsen (2019) Rigging the lottery: making all tickets winners. arXiv preprint arXiv:1911.11134. Cited by: §2.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §2, §4, Table 1, §5.
  • D. Gordon, K. Ehsani, D. Fox, and A. Farhadi (2020) Watching the world go by: representation learning from unlabeled videos. arXiv preprint arXiv:2003.07990. Cited by: §4.
  • A. Gupta, P. Dollar, and R. Girshick (2019) LVIS: a dataset for large vocabulary instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5356–5364. Cited by: §3.
  • B. Hariharan and R. Girshick (2017) Low-shot visual recognition by shrinking and hallucinating features. Proceedings of the IEEE conference on computer vision and pattern recognition. Cited by: §2.
  • J. Harrison, A. Sharma, C. Finn, and M. Pavone (2019) Continuous meta-learning without tasks. Advances in neural information processing systems. Cited by: §2.
  • J. He, R. Mao, Z. Shao, and F. Zhu (2020) Incremental learning in online scenario. arXiv preprint arXiv:2003.13191. Cited by: §2.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1, §3, §4, Table 1, §5.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.
  • D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §2, §4, §4, §5.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    Mobilenets: efficient convolutional neural networks for mobile vision applications

    .
    arXiv preprint arXiv:1704.04861. Cited by: §2.
  • L. Jing and Y. Tian (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §3.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §3.
  • A. Kusupati, V. Ramanujan, R. Somani, M. Wortsman, P. Jain, S. Kakade, and A. Farhadi (2020) Soft threshold weight reparameterization for learnable sparsity. In Proceedings of the International Conference on Machine Learning, Cited by: §2.
  • B. M. Lake, R. Salakhutdinov, J. Gross, and J. B. Tenenbaum (2011) One shot learning of simple visual concepts.. CogSci. Cited by: §3.
  • Y. LeCun (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §3.
  • Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §2, §2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §2.
  • Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu (2019) Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2537–2546. Cited by: §2, §2, §4, §4, Table 1, §5.
  • B. Oreshkin, P. R. López, and A. Lacoste (2018) Tadam: task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721–731. Cited by: §2, §5.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §4.
  • H. Qi, M. Brown, and D. G. Lowe (2018) Low-shot learning with imprinted weights. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5822–5830. Cited by: Appendix F, §4.
  • D. Rao, F. Visin, A. Rusu, R. Pascanu, Y. W. Teh, and R. Hadsell (2019) Continual unsupervised representation learning. In Advances in Neural Information Processing Systems, pp. 7645–7655. Cited by: §3.
  • M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Cited by: §2.
  • S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. International Conference on Learning Representations. Cited by: §2, §5.
  • S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) Icarl: incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §2.
  • M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and G. Tesauro (2019) Learning to learn without forgetting by maximizing transfer and minimizing inference. Proceedings of the IEEE conference on computer vision and pattern recognition. Cited by: §2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: Appendix A, §2, §3.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: Table 5, Appendix D, §2, §4, Table 1, §5, §5.
  • Q. Sun, Y. Liu, T. Chua, and B. Schiele (2019) Meta-transfer learning for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 403–412. Cited by: §2, §5.
  • S. Thrun (1996) Is learning the n-th thing any easier than learning the first?. Advances in neural information processing systems. Cited by: §2.
  • Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola (2020) Rethinking few-shot image classification: a good embedding is all you need?. arXiv preprint arXiv:2003.11539. Cited by: §4, §5.
  • M. Tschannen, O. Bachem, and M. Lucic (2018) Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069. Cited by: §3.
  • G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie (2018) The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769–8778. Cited by: §3.
  • O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. Advances in neural information processing systems. Cited by: §3.
  • Y. Wang, W. Chao, K. Q. Weinberger, and L. van der Maaten (2019) SimpleShot: revisiting nearest-neighbor classification for few-shot learning. arXiv preprint arXiv:1911.04623. Cited by: §4.
  • Y. Wen, D. Tran, and J. Ba (2020) BatchEnsemble: an alternative approach to efficient ensemble and lifelong learning. In International Conference on Learning Representations, External Links: Link Cited by: §3.
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE conference on computer vision and pattern recognition. Cited by: §5.

Appendix A Dataset Information

The five sequences we pair with Ned are constructed from ImageNet-22K [Deng et al., 2009]. Two sequences (1-2) are for validation, and three (3-5) are for testing. Each sequence contains 1,000 classes; 250 of which are in ImageNet-1K [Russakovsky et al., 2015] (pretrain classes) and 750 of which are only in ImageNet-22K (novel classes). For the test sequences, we randomly select the classes without replacement to ensure that the sequences do not overlap. The validation sequences share pretrain classes because there are not enough pretrain classes (1000) to partition among five sequences. We randomly distribute the number of images per class according to Zipf’s law with (Figure 5). For classes without enough images, we fit the Zipfian distribution as closely as possible which causes a slight variation in sequence statistics seen in Table 2.

Figure 5: The distribution of samples over the classes for Sequences 1 - 5. Classes with less than 50 samples are considered in the tail and samples with greater than or equal to 50 samples are considered in the head for the purpose of reporting.
Sequence # Number of Images Min # of Class Images Max # of Class Images
1 89030   1 961
2 87549 21 961
3 90133 14 961
4 86988   6 892
5 89921 10 961
Table 2: Statistics for the sequences of images used in NED. Sequences 1-2 are for validation and Sequence 3-5 are for testing. The images from ImageNet-22k are approximately fit to a Zipfian distribution with 250 classes overlapping with ImageNet-1k and 750 new classes.

Appendix B Training Depth for Fine Tuning

We explored how training depth affects the accuracy of a model on new, old, common, and rare classes. For this set of experiments, we vary the number of trained layers when fine-tuning for 4 epochs every 5,000 samples on ResNet18 with a learning rate of 0.01 on Sequence 2 (validation). The results are reported in Table 3. We found that training more layers leads to greater accuracy on new classes and lower accuracy on pretrain classes. However, we observed that the number of fine-tuning layers did not significantly affect overall accuracy so for our results on the test sequences (3-5) we only report fine-tuning of one layer (Table 1).

# of
Layers
Novel-
Head (>50)
Pretrain-
Head (>50)
Novel-
Tail (<50)
Pretrain-
Tail (<50)
Mean
Per-Class
Overall
1 41.32 80.96 17.13 66.52 39.19 56.87
2 41.55 80.79 17.40 67.03 39.43 56.79
3 45.82 78.59 19.08 59.52 40.73 57.23
4 46.96 75.44 19.87 53.97 40.39 57.04
5 46.76 75.72 19.97 54.04 40.41 57.04
Table 3: The results for fine-tuning various numbers of layers with a learning rate of .01 on Sequence 2. Training more layers generally results in higher accuracy on novel classes, but lower accuracy on pretrain classes. The trade-off between novel and pretrain accuracy balances out so the overall accuracy is largely unaffected by the depth of training.

Appendix C Results For Other Sequences

We report the mean and standard deviation for all performance metrics across test sequences 3-5 in Table

4. Note that the standard deviation is relatively low so the methods are consistent across the randomized sequences.

Method Pretrain Backbone
Novel -
Head (>50)
Pretrain -
Head (>50)
Novel -
Tail (<50)
Pretrain -
Tail (>50)
Mean
Per-Class
Overall
Prototype Networks Sup./Meta Conv-4 5.020.05 9.710.11 0.640.01 1.270.04 3.250.03 7.820.09
Prototype Networks Meta R18 8.720.09 16.840.14 7.060.03 12.980.04 9.460.08 11.190.12
New-Meta Baseline [cite] Sup/Meta R18 41.730.57 66.542.37 27.541.13 53.690.97 39.320.71 47.740.63
MAML Meta Conv-4 2.930.01 2.020.02 0.150.01 0.10.01 1.110.02 3.640.06
Fine-Tune Moco R18 5.310.24 45.951.27 0.030 26.230.88 10.640.23 18.520.98
Fine-Tune Sup. R18 43.20.65 74.552.53 22.791.21 59.631.02 40.90.73 53.060.65
Standard Training Moco R18 26.90.27 42.393.04 9.10.74 21.110.51 20.760.32 34.850.75
Standard Training Sup. R18 38.820.49 65.882.32 16.150.83 44.30.91 33.630.38 48.810.57
NCM Moco R18 19.310.06 30.021.69 14.210.46 22.060.52 18.860.13 22.141.24
NCM Sup. R18 41.680.65 70.052.29 31.240.86 57.230.97 42.870.62 47.890.76
OLTR MoCo R18 41.470.03 31.480.01 17.480.01 9.810.01 22.030 38.330.01
OLTR Sup. R18 51.190.37 37.020.51 24.140.14 13.770.24 27.60.28 44.460.44
Prototype Tuning Moco R18 41.473.54 31.480.4 17.480.49 9.810.12 22.030.44 38.331.24
Prototype Tuning Sup. R18 46.362.31 69.790.4 24.381.43 46.820.82 41.340.48 54.170.74
Fine-Tune Moco R50 45.950.26 5.310.32 26.230.07 0.031.74 10.640.21 18.521.02
Fine-Tune Sup. R50 47.590.65 80.141.71 26.690.97 66.921.4 45.620.6 57.480.47
Standard Training Moco R50 43.930.73 71.723.18 20.840.92 51.430.68 38.940.9 53.451.73
Standard Training Sup. R50 47.590.45 80.142.59 26.690.79 66.921.91 45.620.47 57.480.56
NCM Moco R50 30.150.48 53.841.05 23.990.53 44.111.11 32.270.92 35.450.61
NCM Sup. R50 45.460.95 76.551.77 35.470.82 65.621.57 47.770.65 52.220.55
Prototype Tuning Moco R50 28.463.04 40.421.33 7.572.15 14.364.14 19.542.63 32.072.37
Prototype Tuning Sup. R50 49.241.55 75.781.84 26.672.17 55.632.31 44.151.44 57.681.02
Table 4: Averaged results for all methods evaluated on Sequences 3-5. See Table 1 for the computational cost (GMACs) for each method and more information about each column.

Appendix D Prototypical Network Experiments

We benchmarked our implementation of Prototypical Networks on few-shot baselines to verify that it is correct. We ran experiments for training on both MiniImageNet and regular ImageNet-1k and tested our implementation on the MiniImageNet test set and NED (Sequence 2). We found comparable results to those reported by the original Prototypical Networks paper [Snell et al., 2017] (Table 5).

Method Backbone Train Set
MiniImageNet
5 Way - 5 Shot
NED
Prototypical Networks Conv - 4 MiniImageNet 69.2 7.54
Prototypical Networks Conv - 4 ImageNet (Train) 42.7 7.82
Prototypical Networks Conv - 4 MiniImageNet 68.2 -
Table 5: Results for our implementation of Prototypical Networks on MiniImageNet and NED. Results taken from Snell et al. [2017].

Appendix E Out-of-Distribution Ablation

(a) NCM+MDT
(b) Prototypical tuning + MDT
(c)

Fine-Tune + Max logits

(d) OLTR
(e) NCM+Softmax
(f) Prototypical tuning + Softmax
(g) Fine-Tune + Softmax
(h) Full train + Softmax
Figure 6: The accuracy for the in-distribution (IND) and out-of-distribution (OOD) samples as the threshold for considering a sample out-of-distribution varies. The horizontal axis is the threshold value, and the vertical axis is the accuracy. Intersection of the IND and OOD curves at a higher accuracy generally indicates better out-of-distribution detection for a given method.

In this section we report AUROC and F1 for MDT and softmax for all baselines. In section 5 we only included OLTR, MDT with NCM, and standard training with maximum softmax (Hendrycks Baseline). Additionally, we visualize the accuracy curves for in-distribution and out-of-distribution samples as the rejection threshold varies (Figure 6). All the OOD experiments presented in Figure 6 and Table 6 were run using ResNet18. Maximum Distance Thresholding (MDT) generally works better than maximum softmax when applied to most methods.

The results of NCM and prototypical tuning using softmax and cosine similarity in comparison to OLTR are shown in table

6. The F1-scores are low due to the large imbalance between positive and negative classes. There are 750 unseen class datapoints vs negative datapoints. Table 6 shows that cosine similarity (MDT) is better than softmax or the OLTR model for most methods.

Metric
NCM
+Softmax
NCM
+MDT
Prototye Tuning
+Softmax
Prototye Tuning
+MDT
Standard Training
+Softmax
Standard Training
+Max Logits
Fine-Tune
+Softmax
Fine-Tune
+Max Logits
OLTR
AUROC 0.07 0.85 0.76 0.92 0.59 0.53 0.68 0.72 0.78
F1 0.01 0.20 0.04 0.20 0.03 0.02 0.06 0.10 0.27
Table 6: The out-of-distribution performance for each method on sequence 5. We report the AUROC and the F1 score achieved by choosing the best possible threshold value.

Appendix F Weight Imprinting and Prototype Tuning

Weight Imprinting Qi et al. [2018] conceptually is very similar to Prototype Tuning. They both use a combination of NCM with finetuning, however, we find there are a few key differences that greatly affect performance in the NED framework. The first difference is that Prototype Tuning uses a standard linear layer for classification while Weight Imprinting utilizes a cosine classifier with a learnable scaling temperature for the softmax. We find that during the fine-tuning phase the linear classifier significantly outperforms Weight Imprinting as well as a euclidean classifier and regular cosine classifier Table 7. Note that the learnable scaling temperature does improve Weight Imprinting over the cosine classifier by 15% in overall accuracy, but is still 9% lower than Prototype Tuning. We evaluated Weight Imprinting with initialization of the scaling temperature at 1, 2, 4, and 8. The other aspect in which Weight Imprinting differs from Prototype Tuning is in the construction of the nearest class means (NCMs). Weight Imprinting calculates the NCMs for the entire data set then finetunes the NCMs on that same data. Prototype tuning calculates NCMs only for the small portion of data at the beginning of the stream then fine-tunes on future data and does not recalculate the NCMs. Overall we find that Prototype Tuning requires less hyper-parameter tuning and significantly outperforms Weight Imprinting in the NED framework (Table 7).

Method Pretrain Backbone
Novel -
Head (>50)
Pretrain -
Head (>50)
Novel -
Tail (<50)
Pretrain -
Tail (>50)
Mean
Per-Class
Overall
Weight Imprinting (s = 1) Sup R18 36.58 63.39 9.32 21.80 26.85 46.35
Weight Imprinting (s = 2) Sup R18 36.58 63.39 9.32 21.80 26.85 46.35
Weight Imprinting (s = 4) Sup R18 40.32 67.46 15.35 34.18 32.69 48.51
Weight Imprinting (s = 8) Sup R18 31.18 32.66 34.77 28.94 32.56 46.67
Prototype Tuning (Cosine) Sup R18 33.90 18.22 4.84 1.88 11.72 31.81
Prototype Tuning (Euclidean) Sup R18 43.40 66.32 21.66 42.06 37.19 51.62
Prototype Tuning (Linear) Sup R18 48.56 71.41 24.16 47.51 41.32 56.79
Table 7: Comparison of Weight Imprinting and Prototype Tuning with different classifiers and initial temperatures. Prototype Tuning with a linear layer performs significantly better than all other variants.