In many real-life scenarios, a fast and simple classifier expansion is required to extend the set of classes that a deep network can classify. For example, consider a cleaning robot trained to recognize a number of objects. After deployment, the robot is likely to encounter novel objects which it was not trained to classify. It is desired to be able to update and expand the robot classifier to classify novel objects. In such a scenario, the update should be a simple procedure, based on a small collection of images captured in a non-controlled setting. This scenario is illustrated in Figure 1: we wish to update and expand a robot classifier to classify novel classes in the deployed setting.
The low-shot network update should be fast and without requiring access to the entire training set of previously learned data. A common solution to classifier expansion is fine-tuning the network . However fine-tuning requires collecting sufficient examples of the novel classes, in addition to keeping a large amount of base training data in memory, to avoid catastrophic forgetting . In striking contrast, for some tasks, humans are capable of instantly learning novel categories. Using one or only a few training examples, humans are able to learn a novel class, without compromising previously learned abilities or having access to training examples from all previously learned classes.
We consider the classifier expansion problem under the following constraints:
Low-shot: very few samples of the novel classes are available.
No forgetting: preserving classification performance on the base classes.
Small memory footprint: no access to the base classes training data.
In this work, we introduce a low-shot network expansion technique, augmenting the capability of an existing (base) network trained on base classes by training additional parameters that enable to classify novel classes.
To satisfy low-shot along with no-forgetting constraints, we present a hard distillation
framework. Distillation in neural networks
is a process for training a target network to imitate another network. A loss function is added to the target network so that its output matches the output of the mimicked network. In standardsoft distillation, the trained network is allowed to deviate from the mimicked network. Whereas hard distillation enforces that the output of the trained network for base classes matches the output of the mimicked network as a hard constraint. Network expansion with hard distillation yields a larger network, distilling the knowledge of the base network in addition to the augmented capacity to classify novel classes. We show that in the case of low-shot (only 1–15 examples of a novel class), hard distillation outperforms soft distillation.
To maintain a small memory footprint, we refrain from saving the entire training set. Instead, we present a compact generative model, consisting of a collection of generative models fitted in the feature space to each of the base classes. We use a Gaussian Mixture Model (GMM) with a small number of mixtures, and show it inflicts a minimal degradation in classification accuracy. Sampling from the generative GMM model is fast, reducing the low-shot training time and allowing fast expansion of the network.
We define a benchmark for low-shot network expansion. The benchmark is composed of a series of tests of increasing complexity. We perform a comprehensive set of experiments on this challenging benchmark, comparing the performance of the proposed to alternative methods.
Ii Related Works
A common solution to the class-incremental learning problem is to use a Nearest-Neighbors (NN) based classifier in feature space. A significant advantage of an NN-based classifier is that it can be easily extended to classify a novel class, even when only a single example of the class is available (one-shot learning). However NN-based classifiers keep in the memory significant amount data. 
proposed to use Nearest Class Mean (NCM) classifier, where each class is represented by a single prototype example which is the mean feature vector of all class examples. One major disadvantage of NCM and NN-based methods is that they are based on a fixed feature representation of the data. To overcome this problem proposed to learn a new distance function in the feature space using metric learning.
The Incremental Classifier and Representation Learning (iCaRL) method  aims to solve the class-incremental learning problem using the Nearest-Mean-of-Exemplars classifier method. Feature representation is updated and the class means are recomputed from a small stored number of representative examples of the base classes. During the feature representation update, the network parameters are updated by minimizing a combined classification and distillation loss. The iCaRL method was introduced as a class-incremental learning method for large training sets.
In  a Squared Gradient Magnitude regularization technique was proposed that improves the fixed feature representation for low-shot scenarios. They also propose to hallucinate additional training examples from the novel classes. In contrast, we present a method which aims to maximize the performance in low-shot network expansion given a fixed representation.
In Progressive Network , new tasks are learned without affecting the performance of old tasks by freezing the parameters of old tasks and expanding the network with additional layers to solve new tasks. Progressive learning  solves the problem of online sequential learning in extreme learning machines (ELM). The purpose of their work is to incrementally learn the last fully-connected layer of the network. In  was proposed an incremental learning technique which augments the base network with additional parameters in the last fully connected layer to classify novel classes. Similar to iCaRL, it performs soft distillation by learning all parameters of the network. The phantom sampling for hallucinating data from past distribution modeled with Generative Adversarial Networks was used instead of keeping historical training data.
In this work, we propose a solution that borrows ideas from the freeze-and-expand paradigm, improved feature representation learning, network distillation and modeling past data with a generative model. We propose to expand the last fully connected layer of a base network to classify novel classes. Moreover, the deeper layers may be also expanded to improve the feature representation. However, in contrast to previous methods [5, 8], we do not retrain the base network parameters, but only train the expanded parts of the network. The extended feature representation is learned from samples of base and novel classes. Finally, in order to avoid keeping all of the historical training data, we use a GMM of the feature space as a generative model for the base classes.
Iii The proposed method
Assume a deep neural network is trained on
base classes with the full set of training data. This base network can be partitioned into two subnetworks: a feature extraction network and a classification network. The feature extraction networkmaps an input sample into a feature representation . The classification network maps feature vectors
into a vector of approximated class posterior probabilitieswhich correspond to each one of classes. The whole network can be represented as composition of two networks .
In the following, we discuss how the pre-learned feature representation of feature extraction network can be leveraged to classify additional classes in a low-shot scenario with only relatively minor changes to the classification subnetwork.
Iii-a Expansion of the last layer of classification subnetwork
First, we discuss how to expand the classification network to classify one additional class. We can expand from a -class classifier into class classifier by adding a new weight vector to the last FC layer. Thus, the class probability is , where is a new normalization factor for classes. We would like to preserve classification accuracy on the base classes to avoid catastrophic forgetting. To that end, during training we constrain to optimize of the weights, while the vectors are kept intact. We refer to this paradigm as hard distillation. By preserving the base classes weight vectors, we guarantee that as a result of the last classification layer expansion the only new errors that can appear are between the novel class and the base classes, but not among the base classes. Moreover, the small number of newly learned parameters helps avoid over-fitting, which is especially important in low-shot scenarios.
Similarly, we can expand the classification network to classify more than one novel class.
Iii-B Deep Feature GMM - Generative model for base classes
Due to the small memory footprint constraint, we are unable to keep the entire training data of the base classes. As an alternative, we can use a generative model of the base classes and during training draw samples from the model. There are various approaches to this task, such as GAN , VAE , Pixel CNN 
, or conventional methods of non-parametric kernel density estimation. However, it is usually hard to generate accurate samples from past learned distributions in the image domain, and these methods still require a significant amount of memory to store the model network parameters. Furthermore, since training typically requires thousands of samples, we prefer a generative model that allows fast sampling to reduce the low-shot phase training time.
In our work, we use the Gaussian Mixture Model (GMM) density estimator as an approximate generative model of the data from the base classes. However, instead of approximating the generative distribution of the image data, we approximate a class conditional distribution of its feature representation. Thus, we model a GMM , where is the number of mixtures for each base class. In order to satisfy the small memory footprint constraint, we use a GMM which assumes feature independence, i.e., the covariance matrix of each Gaussian mixture is diagonal. We denote this model as Deep Feature GMM. If we have classes, and the feature vectors dimensionality is , the memory requirements for storing information about base classes is . The feature representation , which we learn a generative model for, can be from the last fully connected layer or from deeper layers. In Section IV-E
, we evaluate the effectiveness of the use of the Deep Features GMM, showing that despite its compact representation, there is a minimal degradation in accuracy when training a classifier based only on data that is generated from the Deep Features GMM, compared to the accuracy obtained on the full training data.
Iii-C Low-Shot Training
We apply standard data augmentation (random crop, horizontal flip, and color noise) to the input samples of the novel classes and create 100 additional samples variants from each of the novel class samples. These samples are passed through the feature extraction network to obtain their corresponding feature representation. Note that new samples and their augmented variants are passed through only once.
As described in Section III-A, we expand the classification subnetwork and train the expanded network to classify novel classes in addition to the base classes. Figure 2(a) illustrates the proposed method in the case where is the last fully connected layer. As mentioned above, we only learn the dimensional vector , which augments the weight matrix of the FC layer.
Each training batch is composed of base classes feature vectors drawn from the Deep Features GMM models learned from the base classes training data and the available samples of a novel class. The training batch is balanced to have an equal number of generations/samples per class.
Since the forward and backward passes are carried out by only the last FC layers, each iteration can be done very rapidly. We use SGD with gradient dropout (see below) to learn . More specifically, the weights update at step is done by:
where is the momentum factor, is the learning rate and is a binary random mask with probability of being ( is randomly generated at each iteration throughout the low-shot training). That is, the gradient update is applied to a random subset of the learned weights. In Section IV-C we demonstrate the contribution of the gradient dropout when only a few novel labeled samples are available.
Iii-D Expansion of Deeper Layers for Learning Representation
The procedure described in the previous subsections expands the last classification layer but does not change the feature representation space. In some cases, especially in those which the novel classes are similar to the base classes, it is desirable to update and expand the feature representation.
To expand the feature representation, we add new parameters to deeper layers of the network. This, of course, requires an appropriate expansion of all subsequent layers. To satisfy the hard distillation constraints, we enforce that the feature representation expansion does not affect the network output for the base classes. All weights in subsequent layers which connects the expanded representation to the base classes are set to zero and remain unchanged during learning. In Figure 2(b) we demonstrate an expansion of two last fully connected layers. The
weight matrix is zero padded to adjust to the new added weights in. Only the expansion to uses the newly added features in . The details of the representation learning expansion can be found in Supplementary Materials (Section S3).
In this section, we evaluate the proposed low-shot network expansion method on several classification tasks. We design a benchmark which measures the performance of several alternative low-shot methods in scenarios that resemble real-life problems, starting with easier tasks (Scenario 1) to harder tasks (Scenario 2 & 3). In each experiment, we use a standard dataset that is partitioned into base classes and novel classes. We define three scenarios:
Scenario 1, Generic novel classes: unconstrained novel and base classes which can be from different domains.
Scenario 2, Domain specific with similar novel classes: base and novel classes are drawn from the same domain and the novel classes share visual similarities among themselves.
Scenario 3, Domain specific with similar base and novel classes: base and novel classes are drawn from the same domain and each novel class shares visual similarities with one of the base classes.
In each scenario we define five base classes (learned using the full train set) and up to five novel classes, which should be learned from up to 15 samples only. We compare the proposed method to several alternative methods for low-shot learning described in Section IV-B.
Iv-a Datasets for Low-Shot Network Expansion scenarios
Dataset for Scenario 1
For the task of generic classification of the novel classes, we use the ImageNet dataset, such that the selected classes were not part of the ILSVRC2012 1000 classes challenge. Each class has at least 1000 training images and 250 test images per class. We randomly selected 5 partitions of 5 base classes and 5 novel classes.
Dataset for Scenario 2 and Scenario 3
For these scenarios, we use the UT-Zappos50K  shoes dataset for fine-grained classification. We choose 10 classes representing different types of shoes each having more than 1,000 training images and 250 test images.
To define similarity between the chosen classes, we fine-tune the base network (VGG-19 
) on the selected classes with the full dataset, and we use the confusion matrix as a measure of similarity between classes. Using the defined similarities, we randomly partition the 10 classes to 5 base and 2 novel classes, where for Scenario 2 we enforce similarity between novel classes, and for Scenario 3 we enforce similarity between novel and base classes. The confusion matrix is presented in Figure S2(b) in Supplementary Materials.
Iv-B Evaluated Methods
In the proposed method we use the VGG-19 network  trained on ImageNet ILSVRC2012  1000 classes as a feature extraction subnetwork . In all three scenarios for training the classification subnetwork on the base classes, we fine-tune the last two fully-connected layers of VGG-19 on the 5 selected base classes, while freezing the rest of the layers of .
We denote the method proposed in Section III as Generative Low-Shot Network Expansion: Gen-LSNE. We compare our proposed method to NCM , and to the Prototype-kNN
Prototype-kNNmethod which is an extension of NCM and the soft distillation based method inspired by iCaRL method , adapted for the low-shot scenario.
Iv-B1 NCM & Prototype-kNN
We compare the proposed method to NCM classifier proposed by . Additionally, we extend the NCM classifier by using multiple prototypes for each class, as in the Prototype-kNN classifier . Both NCM and Prototype-kNN are implemented in a fixed feature space of the FC2 layer of the VGG-19 network. In our implementation of the Prototype-kNN, we fit a Deep Features GMM model with 20 mixtures for each of the base classes. We extract feature representation of all of the available samples from the novel classes. The Deep Features GMM centroids of the base feature vectors and the novel feature vectors of the samples are considered as prototypes of each class. We set for Prototype-kNN classifier to be the smallest number of prototypes per class (the number of prototypes in the novel classes is lower than the number of mixtures in the base classes). The Prototype-kNN classification rule is the majority vote among nearest neighbors of the query sample. If the majority vote is indecisive, that is, there are two or more classes with the same number of prototypes among the nearest neighbors of the query image, we repeat classification with .
Iv-B2 Low-Shot with Soft Distillation
We want to measure the benefit of the hard distillation constraint in the low-shot learning scenario. Thus, we formulate a soft distillation based method inspired by iCaRL  and methods described by  and  as an alternative to the proposed method.
In the iCaRL method, feature representation is updated by re-training the whole representation network. Since in low-shot scenario we have only a small number of novel class samples, updating the whole representation network is infeasible. Using the soft distillation method, we adapt to the low-shot scenario by updating only the last two fully connected layers , but still use a combination of distillation and classification loss as in the iCaRL method.
The iCaRL method stores a set of prototype images and uses the Nearest Mean Exemplar (NME) classifier at the final classification stage. In order to provide a fair comparison with the hard distillation method and uphold our memory restriction, we avoid storing prototypes in the image domain and use the proposed Deep-Features GMM as a generative model for the base-classes.
To summarize, soft distillation applies a distillation loss and allows the layers to adjust to the new data, while the proposed hard-distillation freezes and trains only the new (expanded) parameters without using a distillation loss. We denote the soft distillation based methods as Soft-Dis in the presented results.
Iv-B3 Gradient Dropout
In Section III-C we proposed using gradient dropout regularization on SGD as a technique to improve convergence and overcome over-fitting in a low-shot scenario. We perform ablation experiments to assess the importance of the gradient dropout and train using both soft distillation (Soft-Dis) and proposed hard distillation (Gen-LSNE) with and without gradient dropout regularization.
Iv-C Results: Expansion of the last fully connected layer
|Scenario 1 : Base + Novel Top-1 Test Error(%)|
Scenario 1: Generic novel classes
In this experiment, the base classification network is trained on five base classes and then expanded to classify two novel classes chosen at random. For each of the five class partitions (Section IV-A), we perform five trials by randomly drawing two novel classes from five novel classes available in the partition. The results are an average of 25 trials. The results of this experiment are presented in Table I(a). In Table IV(a) we present detailed results of the test error on the base and novels classes apart. Prototype-kNN and the Soft-Dis methods perform better on the base classes. However, our method is significantly better on the novel classes and the overall test error is considerably improved, particularly when the number of samples is small. In addition, we see the significant gain in accuracy delivered by the gradient dropout when the number of novel samples is lower than 3 samples. Furthermore, gradient dropout also improves the results of the Soft-Dis method.
NCM generally performs considerably better than Prototype-kNN in the Low-Shot scenario, despite the use of less information from the base classes. However, NCM is unable to effectively utilize more novel samples when they are available. Gen-LSNE significantly outperforms NCM with a single novel sample, and overall outperforms all the tested method with nine and below samples per novel class.
Scenario 2 & 3: Domain specific with similar novel-to-novel and novel-to-base classes
As described in Section IV-A.b, in each scenario we have 5 partitions with five base classes and two novel classes. The results are an average of 5 trials. The result of the experiments are presented in Table I(b,c). In Scenario-2 & Scenario-3 we see that the proposed method consistently outperforms the Soft-Dis, NCM and Prototype-kNN methods. Training Gen-LSNE with gradient dropout improves results in cases with 1 & 3 novel samples per class, especially in Scenario-3. In Table IV(b,c) we present detailed results of the test error on base and novels classes apart.
Iv-D Results: Expansion of Deeper Layers for Learning Representation
In this section, we explore the effect of the expansion of deeper layers, as described in Section III-D. We partition the datasets as defined in IV-A to five base and five novel classes, and we test a 10 classes classification task. We expand the feature representation which is obtained after layer with 5 new features. The size of the feature representation after the FC1 layer of VGG-19 is of dimension 4k. Thus, is expanded with new weights. The results are averaged over 5 trails (randomly selecting the base/novel classes). Table II shows the results obtained, we denote +5Inner as the experiments with the additional five shared representation features.
We see a marginal gain in Scenario 1. However, we observe a significant gain in Scenario 2 and 3 when the number of samples increases (especially Scenario 2).
|Scenario 1: Generic novel classes|
|Scenario 2: Domain specific with similar novel classes|
|Scenario 3: Domain specific with similar class in base|
Iv-E Results: Deep-features GMM Evaluation
In the Deep-features GMM evaluation experiment, we feed the full training data to the base network and collect the feature vectors before , i.e., two FC layers before the classification output. We fit a GMM model to the feature vectors of each of the base classes with a varying number of mixtures. We train the two last FC layers of the base network from randomly initialized weights, where the training is based on generating feature vectors from the fitted GMM. We measure the top-1 accuracy on the test set of the networks trained with GMM models and the base network trained with full training data on the datasets defined in IV-A. The difference in top-1 accuracy between the network trained with full data and the networks trained with GMM models represent degradation caused by compressing the data with a simple generative model. The results of the experiment presented in the Table III demonstrate that learning with samples from GMM models commonly causes only a negligible degradation relative to learning with a full training set. Together with the Hard-Distillation constraint. Deep-Feature GMM is sufficient to imitate the presence of the inaccessible base class data.
|Dataset /# Mixtures||Full||1||10||20||40||60|
V Concluding Remarks
We have introduced Gen-LSNE , a technique for low-shot network expansion. The method is based on hard-distillation, where pre-trained base parameters are kept intact, and only a small number of parameters are trained to accommodate the novel classes. We presented and evaluated the advantages of hard-distillation: (i) it gains significant increased accuracy (up to ) on the novel classes, (ii) it minimizes forgetting: less than drop in accuracy on the base classes, (iii) a small number of trained parameters avoids over-fitting, and (iv) the training for the expansion is fast. We have demonstrated that our method excels when only a few novel images are provided, rendering our method practical and efficient for a quick deployment of the network expansion.
We have also presented Deep–Features GMM for effective base class memorization. This computationally and memory efficient method allows training the network from a generative compact feature-space representation of the base classes, without storing the entire training set. Finally, we have shown that the learned representation can be extended based on Low-Shot novel observations to support better discrimination of novel classes.
In the future, we would like to continue exploring hard-distillation methods and extremely low-shot classifier expansion for robotic applications, aspiring towards human-level low-shot learning.
Vi Supplementary Materials
=0mu plus 0mu Supplementary materials can be found at: https://github.com/adihayat/Gen-LSNE-Supplementary/blob/master/sup.pdf
-  C. Kading, E. Rodner, A. Freytag, and J. Denzler, “Fine-tuning deep neural networks in continuous learning scenarios,” in ACCV Workshop on Interpretation and Visualization of Deep Neural Nets, 2016.
-  R. M. French, “Catastrophic forgetting in connectionist networks.” Trends in cognitive sciences, 1999.
-  G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Deep Learning Workshop, NIPS, 2014.
-  T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Distance-based image classification: Generalizing to new classes at near zero cost,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013.
-  S. Rebuffi, A. Kolesnikov, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” in CVPR, 2016.
-  B. Hariharan and R. Girshick, “Low-shot visual recognition by shrinking and hallucinating features,” arXiv:1606.02819, 2016.
-  A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv:1606.04671, 2016.
-  R. Venkatesan and M. J. Er, “A novel progressive learning technique for multi-class classification,” Neurocomputing, 2016.
-  R. Venkatesan, H. Venkateswara, S. Panchanathan, and B. Li, “A strategy for an uncompromising incremental learner,” arXiv:1705.00744, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Networks,” in NIPS, 2014.
Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, “Variational autoencoder for deep learning of images, labels and captions,” inNIPS, 2016.
-  A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, “Conditional image generation with pixelcnn decoders,” arXiv:1606.05328, 2016.
-  T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer New York Inc., 2001.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li, “Imagenet
large scale visual recognition challenge,”
International Journal of Computer Vision, 2015.
-  A. Yu and K. Grauman, “Fine-Grained Visual Comparisons with Local Learning,” in CVPR, June 2014.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.