Generative Low-Shot Network Expansion

by   Adi Hayat, et al.

Conventional deep learning classifiers are static in the sense that they are trained on a predefined set of classes and learning to classify a novel class typically requires re-training. In this work, we address the problem of Low-Shot network expansion learning. We introduce a learning framework which enables expanding a pre-trained (base) deep network to classify novel classes when the number of examples for the novel classes is particularly small. We present a simple yet powerful hard distillation method where the base network is augmented with additional weights to classify the novel classes, while keeping the weights of the base network unchanged. We show that since only a small number of weights needs to be trained, the hard distillation excels in low-shot training scenarios. Furthermore, hard distillation avoids detriment to classification performance on the base classes. Finally, we show that low-shot network expansion can be done with a very small memory footprint by using a compact generative model of the base classes training data with only a negligible degradation relative to learning with the full training set.



There are no comments yet.


page 1

page 2

page 3

page 4


Incremental Few-Shot Learning with Attention Attractor Networks

Machine learning classifiers are often trained to recognize a set of pre...

Label Hallucination for Few-Shot Classification

Few-shot classification requires adapting knowledge learned from a large...

Dual-Teacher Class-Incremental Learning With Data-Free Generative Replay

This paper proposes two novel knowledge transfer techniques for class-in...

Few-Shot Few-Shot Learning and the role of Spatial Attention

Few-shot learning is often motivated by the ability of humans to learn n...

Generating Classification Weights with GNN Denoising Autoencoders for Few-Shot Learning

Given an initial recognition model already trained on a set of base clas...

Learning without Memorizing

Incremental learning (IL) is an important task aimed to increase the cap...

Impact of base dataset design on few-shot image classification

The quality and generality of deep image features is crucially determine...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In many real-life scenarios, a fast and simple classifier expansion is required to extend the set of classes that a deep network can classify. For example, consider a cleaning robot trained to recognize a number of objects. After deployment, the robot is likely to encounter novel objects which it was not trained to classify. It is desired to be able to update and expand the robot classifier to classify novel objects. In such a scenario, the update should be a simple procedure, based on a small collection of images captured in a non-controlled setting. This scenario is illustrated in Figure 1: we wish to update and expand a robot classifier to classify novel classes in the deployed setting.

The low-shot network update should be fast and without requiring access to the entire training set of previously learned data. A common solution to classifier expansion is fine-tuning the network [1]. However fine-tuning requires collecting sufficient examples of the novel classes, in addition to keeping a large amount of base training data in memory, to avoid catastrophic forgetting [2]. In striking contrast, for some tasks, humans are capable of instantly learning novel categories. Using one or only a few training examples, humans are able to learn a novel class, without compromising previously learned abilities or having access to training examples from all previously learned classes.

Fig. 1: Problem Setup: We desire to be able to adapt the base classifier to the Novel classes in the deployed setting.

We consider the classifier expansion problem under the following constraints:

  1. Low-shot: very few samples of the novel classes are available.

  2. No forgetting: preserving classification performance on the base classes.

  3. Small memory footprint: no access to the base classes training data.

In this work, we introduce a low-shot network expansion technique, augmenting the capability of an existing (base) network trained on base classes by training additional parameters that enable to classify novel classes.

To satisfy low-shot along with no-forgetting constraints, we present a hard distillation

framework. Distillation in neural networks


is a process for training a target network to imitate another network. A loss function is added to the target network so that its output matches the output of the mimicked network. In standard

soft distillation, the trained network is allowed to deviate from the mimicked network. Whereas hard distillation enforces that the output of the trained network for base classes matches the output of the mimicked network as a hard constraint. Network expansion with hard distillation yields a larger network, distilling the knowledge of the base network in addition to the augmented capacity to classify novel classes. We show that in the case of low-shot (only 1–15 examples of a novel class), hard distillation outperforms soft distillation.

To maintain a small memory footprint, we refrain from saving the entire training set. Instead, we present a compact generative model, consisting of a collection of generative models fitted in the feature space to each of the base classes. We use a Gaussian Mixture Model (GMM) with a small number of mixtures, and show it inflicts a minimal degradation in classification accuracy. Sampling from the generative GMM model is fast, reducing the low-shot training time and allowing fast expansion of the network.

We define a benchmark for low-shot network expansion. The benchmark is composed of a series of tests of increasing complexity. We perform a comprehensive set of experiments on this challenging benchmark, comparing the performance of the proposed to alternative methods.

Ii Related Works

A common solution to the class-incremental learning problem is to use a Nearest-Neighbors (NN) based classifier in feature space. A significant advantage of an NN-based classifier is that it can be easily extended to classify a novel class, even when only a single example of the class is available (one-shot learning). However NN-based classifiers keep in the memory significant amount data. [4]

proposed to use Nearest Class Mean (NCM) classifier, where each class is represented by a single prototype example which is the mean feature vector of all class examples. One major disadvantage of NCM and NN-based methods is that they are based on a fixed feature representation of the data. To overcome this problem

[4] proposed to learn a new distance function in the feature space using metric learning.

The Incremental Classifier and Representation Learning (iCaRL) method [5] aims to solve the class-incremental learning problem using the Nearest-Mean-of-Exemplars classifier method. Feature representation is updated and the class means are recomputed from a small stored number of representative examples of the base classes. During the feature representation update, the network parameters are updated by minimizing a combined classification and distillation loss. The iCaRL method was introduced as a class-incremental learning method for large training sets.

In [6] a Squared Gradient Magnitude regularization technique was proposed that improves the fixed feature representation for low-shot scenarios. They also propose to hallucinate additional training examples from the novel classes. In contrast, we present a method which aims to maximize the performance in low-shot network expansion given a fixed representation.

In Progressive Network [7], new tasks are learned without affecting the performance of old tasks by freezing the parameters of old tasks and expanding the network with additional layers to solve new tasks. Progressive learning [8] solves the problem of online sequential learning in extreme learning machines (ELM). The purpose of their work is to incrementally learn the last fully-connected layer of the network. In [9] was proposed an incremental learning technique which augments the base network with additional parameters in the last fully connected layer to classify novel classes. Similar to iCaRL, it performs soft distillation by learning all parameters of the network. The phantom sampling for hallucinating data from past distribution modeled with Generative Adversarial Networks was used instead of keeping historical training data.

In this work, we propose a solution that borrows ideas from the freeze-and-expand paradigm, improved feature representation learning, network distillation and modeling past data with a generative model. We propose to expand the last fully connected layer of a base network to classify novel classes. Moreover, the deeper layers may be also expanded to improve the feature representation. However, in contrast to previous methods [5, 8], we do not retrain the base network parameters, but only train the expanded parts of the network. The extended feature representation is learned from samples of base and novel classes. Finally, in order to avoid keeping all of the historical training data, we use a GMM of the feature space as a generative model for the base classes.

Iii The proposed method

Assume a deep neural network is trained on

base classes with the full set of training data. This base network can be partitioned into two subnetworks: a feature extraction network and a classification network. The feature extraction network

maps an input sample into a feature representation . The classification network maps feature vectors

into a vector of approximated class posterior probabilities

which correspond to each one of classes. The whole network can be represented as composition of two networks .

In the following, we discuss how the pre-learned feature representation of feature extraction network can be leveraged to classify additional classes in a low-shot scenario with only relatively minor changes to the classification subnetwork.

Iii-a Expansion of the last layer of classification subnetwork

First, we discuss how to expand the classification network to classify one additional class. We can expand from a -class classifier into class classifier by adding a new weight vector to the last FC layer. Thus, the class probability is , where is a new normalization factor for classes. We would like to preserve classification accuracy on the base classes to avoid catastrophic forgetting. To that end, during training we constrain to optimize of the weights, while the vectors are kept intact. We refer to this paradigm as hard distillation. By preserving the base classes weight vectors, we guarantee that as a result of the last classification layer expansion the only new errors that can appear are between the novel class and the base classes, but not among the base classes. Moreover, the small number of newly learned parameters helps avoid over-fitting, which is especially important in low-shot scenarios.

Similarly, we can expand the classification network to classify more than one novel class.

Fig. 2: (a) The proposed Gen-LSNE overview, generating feature representation of base classes to train the expansion. (b) Training the last two layers, learning shared representation in addition to the per novel class weights expansion: are samples of feature vector generations of base class , are the novel class feature vector measurements, are the number of input features to , are the number of input feature to before the expansion.

Iii-B Deep Feature GMM - Generative model for base classes

Due to the small memory footprint constraint, we are unable to keep the entire training data of the base classes. As an alternative, we can use a generative model of the base classes and during training draw samples from the model. There are various approaches to this task, such as GAN [10], VAE [11], Pixel CNN [12]

, or conventional methods of non-parametric kernel density estimation

[13]. However, it is usually hard to generate accurate samples from past learned distributions in the image domain, and these methods still require a significant amount of memory to store the model network parameters. Furthermore, since training typically requires thousands of samples, we prefer a generative model that allows fast sampling to reduce the low-shot phase training time.

In our work, we use the Gaussian Mixture Model (GMM) density estimator as an approximate generative model of the data from the base classes. However, instead of approximating the generative distribution of the image data, we approximate a class conditional distribution of its feature representation. Thus, we model a GMM , where is the number of mixtures for each base class. In order to satisfy the small memory footprint constraint, we use a GMM which assumes feature independence, i.e., the covariance matrix of each Gaussian mixture is diagonal. We denote this model as Deep Feature GMM. If we have classes, and the feature vectors dimensionality is , the memory requirements for storing information about base classes is . The feature representation , which we learn a generative model for, can be from the last fully connected layer or from deeper layers. In Section IV-E

, we evaluate the effectiveness of the use of the Deep Features GMM, showing that despite its compact representation, there is a minimal degradation in accuracy when training a classifier based only on data that is generated from the Deep Features GMM, compared to the accuracy obtained on the full training data.

Iii-C Low-Shot Training

We apply standard data augmentation (random crop, horizontal flip, and color noise) to the input samples of the novel classes and create 100 additional samples variants from each of the novel class samples. These samples are passed through the feature extraction network to obtain their corresponding feature representation. Note that new samples and their augmented variants are passed through only once.

As described in Section III-A, we expand the classification subnetwork and train the expanded network to classify novel classes in addition to the base classes. Figure 2(a) illustrates the proposed method in the case where is the last fully connected layer. As mentioned above, we only learn the dimensional vector , which augments the weight matrix of the FC layer.

Each training batch is composed of base classes feature vectors drawn from the Deep Features GMM models learned from the base classes training data and the available samples of a novel class. The training batch is balanced to have an equal number of generations/samples per class.

Since the forward and backward passes are carried out by only the last FC layers, each iteration can be done very rapidly. We use SGD with gradient dropout (see below) to learn . More specifically, the weights update at step is done by:

where is the momentum factor, is the learning rate and is a binary random mask with probability of being ( is randomly generated at each iteration throughout the low-shot training). That is, the gradient update is applied to a random subset of the learned weights. In Section IV-C we demonstrate the contribution of the gradient dropout when only a few novel labeled samples are available.

Iii-D Expansion of Deeper Layers for Learning Representation

The procedure described in the previous subsections expands the last classification layer but does not change the feature representation space. In some cases, especially in those which the novel classes are similar to the base classes, it is desirable to update and expand the feature representation.

To expand the feature representation, we add new parameters to deeper layers of the network. This, of course, requires an appropriate expansion of all subsequent layers. To satisfy the hard distillation constraints, we enforce that the feature representation expansion does not affect the network output for the base classes. All weights in subsequent layers which connects the expanded representation to the base classes are set to zero and remain unchanged during learning. In Figure  2(b) we demonstrate an expansion of two last fully connected layers. The

weight matrix is zero padded to adjust to the new added weights in

. Only the expansion to uses the newly added features in . The details of the representation learning expansion can be found in Supplementary Materials (Section S3).

Iv Experiments

In this section, we evaluate the proposed low-shot network expansion method on several classification tasks. We design a benchmark which measures the performance of several alternative low-shot methods in scenarios that resemble real-life problems, starting with easier tasks (Scenario 1) to harder tasks (Scenario 2 & 3). In each experiment, we use a standard dataset that is partitioned into base classes and novel classes. We define three scenarios:

  • []

  • Scenario 1, Generic novel classes: unconstrained novel and base classes which can be from different domains.

  • Scenario 2, Domain specific with similar novel classes: base and novel classes are drawn from the same domain and the novel classes share visual similarities among themselves.

  • Scenario 3, Domain specific with similar base and novel classes: base and novel classes are drawn from the same domain and each novel class shares visual similarities with one of the base classes.

In each scenario we define five base classes (learned using the full train set) and up to five novel classes, which should be learned from up to 15 samples only. We compare the proposed method to several alternative methods for low-shot learning described in Section IV-B.

Iv-a Datasets for Low-Shot Network Expansion scenarios

Dataset for Scenario 1

For the task of generic classification of the novel classes, we use the ImageNet dataset

[14], such that the selected classes were not part of the ILSVRC2012 1000 classes challenge. Each class has at least 1000 training images and 250 test images per class. We randomly selected 5 partitions of 5 base classes and 5 novel classes.

Dataset for Scenario 2 and Scenario 3

For these scenarios, we use the UT-Zappos50K [15] shoes dataset for fine-grained classification. We choose 10 classes representing different types of shoes each having more than 1,000 training images and 250 test images.

To define similarity between the chosen classes, we fine-tune the base network (VGG-19 [16]

) on the selected classes with the full dataset, and we use the confusion matrix as a measure of similarity between classes. Using the defined similarities, we randomly partition the 10 classes to 5 base and 2 novel classes, where for Scenario 2 we enforce similarity between novel classes, and for Scenario 3 we enforce similarity between novel and base classes. The confusion matrix is presented in Figure S2(b) in Supplementary Materials.

Iv-B Evaluated Methods

In the proposed method we use the VGG-19 network [16] trained on ImageNet ILSVRC2012 [14] 1000 classes as a feature extraction subnetwork . In all three scenarios for training the classification subnetwork on the base classes, we fine-tune the last two fully-connected layers of VGG-19 on the 5 selected base classes, while freezing the rest of the layers of .

We denote the method proposed in Section III as Generative Low-Shot Network Expansion: Gen-LSNE. We compare our proposed method to NCM [4], and to the


method which is an extension of NCM and the soft distillation based method inspired by iCaRL method [5], adapted for the low-shot scenario.

Iv-B1 NCM & Prototype-kNN

We compare the proposed method to NCM classifier proposed by [4]. Additionally, we extend the NCM classifier by using multiple prototypes for each class, as in the Prototype-kNN classifier [13]. Both NCM and Prototype-kNN are implemented in a fixed feature space of the FC2 layer of the VGG-19 network. In our implementation of the Prototype-kNN, we fit a Deep Features GMM model with 20 mixtures for each of the base classes. We extract feature representation of all of the available samples from the novel classes. The Deep Features GMM centroids of the base feature vectors and the novel feature vectors of the samples are considered as prototypes of each class. We set for Prototype-kNN classifier to be the smallest number of prototypes per class (the number of prototypes in the novel classes is lower than the number of mixtures in the base classes). The Prototype-kNN classification rule is the majority vote among nearest neighbors of the query sample. If the majority vote is indecisive, that is, there are two or more classes with the same number of prototypes among the nearest neighbors of the query image, we repeat classification with .

Iv-B2 Low-Shot with Soft Distillation

We want to measure the benefit of the hard distillation constraint in the low-shot learning scenario. Thus, we formulate a soft distillation based method inspired by iCaRL [5] and methods described by [9] and [8] as an alternative to the proposed method.

In the iCaRL method, feature representation is updated by re-training the whole representation network. Since in low-shot scenario we have only a small number of novel class samples, updating the whole representation network is infeasible. Using the soft distillation method, we adapt to the low-shot scenario by updating only the last two fully connected layers , but still use a combination of distillation and classification loss as in the iCaRL method.

The iCaRL method stores a set of prototype images and uses the Nearest Mean Exemplar (NME) classifier at the final classification stage. In order to provide a fair comparison with the hard distillation method and uphold our memory restriction, we avoid storing prototypes in the image domain and use the proposed Deep-Features GMM as a generative model for the base-classes.

To summarize, soft distillation applies a distillation loss and allows the layers to adjust to the new data, while the proposed hard-distillation freezes and trains only the new (expanded) parameters without using a distillation loss. We denote the soft distillation based methods as Soft-Dis in the presented results.

Iv-B3 Gradient Dropout

In Section III-C we proposed using gradient dropout regularization on SGD as a technique to improve convergence and overcome over-fitting in a low-shot scenario. We perform ablation experiments to assess the importance of the gradient dropout and train using both soft distillation (Soft-Dis) and proposed hard distillation (Gen-LSNE) with and without gradient dropout regularization.

Iv-C Results: Expansion of the last fully connected layer

Scenario 1 : Base + Novel Top-1 Test Error(%)
Method /NumSamples 1 3 5 9 15
Prototype-KNN 19.81 23.01 19.89 20.29 19.25
NCM 21.3 9.84 8.89 7.92 7.71
Soft-Dis+GradDrop 21.46 12.04 9.45 7.48 6.42
Soft-Dis 21.31 12.41 9.82 7.48 6.5
Gen-LSNE+GradDrop 15.21 9.82 8.72 7.77 7.54
Gen-LSNE 17.11 9.82 8.46 7.15 6.64
Scenario 2 : Base + Novel Top-1 Test Error(%)
Method /NumSamples 1 3 5 9 15
Prototype-KNN 34.12 40.08 33.08 32.95 30.65
NCM 29.52 22.04 20.79 20.11 19.37
Soft-Dis+GradDrop 27.58 22.95 20.79 19.02 17.48
Soft-Dis 28.31 24.85 21.99 20.53 18.13
Gen-LSNE+GradDrop 26.68 21.15 20.58 19.36 18.5
Gen-LSNE 27.06 21.1 19.79 18.53 17.4
Scenario 3 : Base + Novel Top-1 Test Error(%)
Method /NumSamples 1 3 5 9 15
Prototype-KNN 32.72 38.3 33.21 31.06 29.57
NCM 29.48 21.98 22.78 21.53 20.79
Soft-Dis+GradDrop 30.17 24.0 22.64 20.86 18.4
Soft-Dis 30.22 24.45 23.36 20.98 18.78
Gen-LSNE+GradDrop 24.23 20.44 20.42 19.45 18.37
Gen-LSNE 25.8 20.83 20.35 19.39 17.77

TABLE I: Expansion of last fully connected layer: Top-1 Test Error on the proposed-method, Prototype-kNN, NCM and Soft-Dis: the average Test Error on all 7 classes (base + novel).
Scenario 1: Generic novel classes

In this experiment, the base classification network is trained on five base classes and then expanded to classify two novel classes chosen at random. For each of the five class partitions (Section IV-A), we perform five trials by randomly drawing two novel classes from five novel classes available in the partition. The results are an average of 25 trials. The results of this experiment are presented in Table  I(a). In Table IV(a) we present detailed results of the test error on the base and novels classes apart. Prototype-kNN and the Soft-Dis methods perform better on the base classes. However, our method is significantly better on the novel classes and the overall test error is considerably improved, particularly when the number of samples is small. In addition, we see the significant gain in accuracy delivered by the gradient dropout when the number of novel samples is lower than 3 samples. Furthermore, gradient dropout also improves the results of the Soft-Dis method.

NCM generally performs considerably better than Prototype-kNN in the Low-Shot scenario, despite the use of less information from the base classes. However, NCM is unable to effectively utilize more novel samples when they are available. Gen-LSNE significantly outperforms NCM with a single novel sample, and overall outperforms all the tested method with nine and below samples per novel class.

Scenario 2 & 3: Domain specific with similar novel-to-novel and novel-to-base classes

As described in Section  IV-A.b, in each scenario we have 5 partitions with five base classes and two novel classes. The results are an average of 5 trials. The result of the experiments are presented in Table I(b,c). In Scenario-2 & Scenario-3 we see that the proposed method consistently outperforms the Soft-Dis, NCM and Prototype-kNN methods. Training Gen-LSNE with gradient dropout improves results in cases with 1 & 3 novel samples per class, especially in Scenario-3. In Table IV(b,c) we present detailed results of the test error on base and novels classes apart.

Iv-D Results: Expansion of Deeper Layers for Learning Representation

In this section, we explore the effect of the expansion of deeper layers, as described in Section III-D. We partition the datasets as defined in IV-A to five base and five novel classes, and we test a 10 classes classification task. We expand the feature representation which is obtained after layer with 5 new features. The size of the feature representation after the FC1 layer of VGG-19 is of dimension 4k. Thus, is expanded with new weights. The results are averaged over 5 trails (randomly selecting the base/novel classes). Table  II shows the results obtained, we denote +5Inner as the experiments with the additional five shared representation features.

We see a marginal gain in Scenario 1. However, we observe a significant gain in Scenario 2 and 3 when the number of samples increases (especially Scenario 2).

Scenario 1: Generic novel classes
Base + Novel Top-1 Test Error(%)
Method /#Samples 1 2 3 5 7 9 11 15
Prototype-KNN 37.76 38.49 36.98 36.66 35.36 33.97 33.45 33.23
NCM 36.08 22.79 17.18 15.54 14.96 13.7 13.37 13.17
Soft-Dis 37.39 26.34 20.46 15.6 14.04 12.18 11.4 10.62
Soft-Dis+5Inner 37.71 26.32 20.83 15.69 14.09 12.16 11.32 10.59
Gen-LSNE 28.51 20.92 17.15 14.55 13.35 11.98 11.27 10.9
Gen-LSNE+5Inner 28.8 21.0 17.14 14.46 13.15 11.7 11.13 10.44

Scenario 2: Domain specific with similar novel classes
Base + Novel Top-1 Test Error(%)
Method /#Samples 1 2 3 5 7 9 11 15
Prototype-KNN 41.28 40.37 39.63 39.96 39.11 37.31 38.06 36.42
NCM 41.61 36.6 32.74 29.16 28.02 27.49 27.57 27.24
Soft-Dis 41.0 38.49 36.75 31.26 28.96 27.58 27.01 25.78
Soft-Dis+5Inner 41.3 38.43 37.18 31.53 28.98 27.39 27.06 25.57
Gen-LSNE 38.52 35.95 33.2 29.09 27.47 26.42 26.71 25.87
Gen-LSNE+5Inner 39.11 36.45 33.95 28.89 26.74 25.62 25.76 24.99
Scenario 3: Domain specific with similar class in base
Base + Novel Top-1 Test Error(%)
Method /#Samples 1 2 3 5 7 9 11 15
Prototype-KNN 48.69 46.81 48.7 47.96 45.02 43.72 44.17 44.35
NCM 48.91 38.62 34.47 31.98 30.72 29.91 29.34 29.49
Soft-Dis 49.47 42.8 39.49 35.18 32.64 29.97 29.39 27.75
Soft-Dis+5Inner 49.52 42.64 39.45 35.35 32.42 30.22 29.36 27.81
Gen-LSNE 41.0 34.29 32.5 29.93 28.43 27.26 26.52 25.93
Gen-LSNE+5Inner 42.07 34.37 33.45 30.07 28.09 26.58 26.12 25.7
TABLE II: Expansion of Deeper Layers for Learning Representation: showing performance obtained with learning additional 5 shared inner features, +5Inner marks the addition of the shared expanded features: (a) averaged results on Scenario 1 , (b) averaged results on Scenario 2, (c) averaged results on Scenario 3.

Iv-E Results: Deep-features GMM Evaluation

In the Deep-features GMM evaluation experiment, we feed the full training data to the base network and collect the feature vectors before , i.e., two FC layers before the classification output. We fit a GMM model to the feature vectors of each of the base classes with a varying number of mixtures. We train the two last FC layers of the base network from randomly initialized weights, where the training is based on generating feature vectors from the fitted GMM. We measure the top-1 accuracy on the test set of the networks trained with GMM models and the base network trained with full training data on the datasets defined in IV-A. The difference in top-1 accuracy between the network trained with full data and the networks trained with GMM models represent degradation caused by compressing the data with a simple generative model. The results of the experiment presented in the Table III demonstrate that learning with samples from GMM models commonly causes only a negligible degradation relative to learning with a full training set. Together with the Hard-Distillation constraint. Deep-Feature GMM is sufficient to imitate the presence of the inaccessible base class data.

Top-1 Accuracy(%)
Dataset /# Mixtures Full 1 10 20 40 60

95.3 91.94 94.03 94.19 94.03 94.57
imagenet-group2-base 98.0 93.83 97.04 96.63 96.54 97.37
imagenet-group3-base 98.2 94.40 96.81 97.45 97.09 96.52
imagenet-group4-base 98.8 95.60 98.16 98.01 98.30 98.58
imagenet-group5-base 99.0 97.26 98.26 98.01 98.01 98.26
ut-zap-scenario3-base 89.5 73.23 85.34 85.10 85.50 85.50
ut-zap-scenario2-novel 86.5 81.81 80.59 78.97 78.92 81.27
ut-zap-scenario2-base 91.9 82.15 87.73 88.45 91.16 90.68
TABLE III: Deep-Features GMM Evaluation: Full stands for Fine tuning FC7,FC8 with the full training data.

V Concluding Remarks

We have introduced Gen-LSNE , a technique for low-shot network expansion. The method is based on hard-distillation, where pre-trained base parameters are kept intact, and only a small number of parameters are trained to accommodate the novel classes. We presented and evaluated the advantages of hard-distillation: (i) it gains significant increased accuracy (up to ) on the novel classes, (ii) it minimizes forgetting: less than drop in accuracy on the base classes, (iii) a small number of trained parameters avoids over-fitting, and (iv) the training for the expansion is fast. We have demonstrated that our method excels when only a few novel images are provided, rendering our method practical and efficient for a quick deployment of the network expansion.

We have also presented Deep–Features GMM for effective base class memorization. This computationally and memory efficient method allows training the network from a generative compact feature-space representation of the base classes, without storing the entire training set. Finally, we have shown that the learned representation can be extended based on Low-Shot novel observations to support better discrimination of novel classes.

In the future, we would like to continue exploring hard-distillation methods and extremely low-shot classifier expansion for robotic applications, aspiring towards human-level low-shot learning.

Vi Supplementary Materials

=0mu plus 0mu Supplementary materials can be found at:

Base Top-1 Test Error(%)
Method /NumSamples 1 3 5 9 15
Prototype-KNN 6.27 3.01 2.4 2.34 2.31
NCM 2.28 3.23 3.73 4.57 5.0
Soft-Dis+GradDrop 2.09 2.27 2.43 2.7 2.98
Soft-Dis 2.21 2.36 2.41 2.65 2.85
Gen-LSNE+GradDrop 2.64 3.75 4.3 4.84 5.43
Gen-LSNE 2.34 2.96 3.5 3.99 4.33

Novel Top-1 Test Error(%)
1 3 5 9 15
53.69 73.03 63.64 65.17 61.59
68.83 26.36 21.79 16.28 14.47
69.86 36.44 27.01 19.43 15.0
69.06 37.53 28.34 19.55 15.62
46.62 25.0 19.75 15.1 12.82
54.02 26.96 20.85 15.03 12.43


Base Top-1 Test Error(%)
Method /NumSamples 1 3 5 9 15
Prototype-KNN 19.2 23.05 18.08 18.35 15.61
NMC 12.51 13.42 14.29 14.53 14.79
Soft-Dis+GradDrop 10.04 10.22 10.42 10.58 10.85
Soft-Dis 11.41 11.82 11.6 12.22 11.73
Gen-LSNE+GradDrop 10.57 11.75 12.57 12.88 13.24
Gen-LSNE 10.44 10.98 11.87 11.97 12.33

Novel Top-1 Test Error(%)
1 3 5 9 15
71.43 82.65 70.57 69.43 68.26
72.04 43.6 37.02 34.06 30.82
71.42 54.8 46.72 40.12 34.06
70.56 57.42 47.96 41.28 34.14
66.96 44.66 40.58 35.58 31.64
68.62 46.4 39.58 34.94 30.08


Base Top-1 Test Error(%)
Method /NumSamples 1 3 5 9 15
Prototype-KNN 17.83 16.22 13.48 11.49 10.63
NMC 9.75 12.3 14.15 15.65 16.24
Soft-Dis+GradDrop 7.76 7.89 8.11 8.7 8.88
Soft-Dis 7.82 7.9 8.1 8.53 8.75
Gen-LSNE+GradDrop 7.87 9.93 11.38 12.21 12.59
Gen-LSNE 7.67 8.7 10.29 10.88 11.57

Novel Top-1 Test Error(%)
1 3 5 9 15
69.93 93.5 82.52 79.98 76.91
78.8 46.18 44.34 36.23 32.16
86.2 64.29 58.97 51.25 42.21
86.23 65.81 61.52 52.11 43.85
65.12 46.72 43.04 37.57 32.82
71.13 51.15 45.49 40.64 33.26


TABLE IV: Base & Novel Accuracy Apart: Top-1 average test error rate on the proposed-method, Prototype-kNN, NCM and Soft-Dis base classes (5 from 7) and novel classes (2 from 7) apart: (a) Scenario 1, Generic novel classes (b) Scenario 2, Domain specific with similar novel classes (c) Scenario 3, Domain specific with similar base and novel classes.