Brain-Inspired Model for Incremental Learning Using a Few Examples

by   Ali Ayub, et al.
Penn State University

Incremental learning attempts to develop a classifier which learns continuously from a stream of data segregated into different classes. Deep learning approaches suffer from catastrophic forgetting when learning classes incrementally. We propose a novel approach to incremental learning inspired by the concept learning model of the hippocampus that represents each image class as centroids and does not suffer from catastrophic forgetting. Classification of a test image is accomplished using the distance of the test image to the n closest centroids. We further demonstrate that our approach can incrementally learn from only a few examples per class. Evaluations of our approach on three class-incremental learning benchmarks: Caltech-101, CUBS-200-2011 and CIFAR-100 for incremental and few-shot incremental learning depict state-of-the-art results in terms of classification accuracy over all learned classes.



There are no comments yet.


page 1

page 2

page 3

page 4


FearNet: Brain-Inspired Model for Incremental Learning

Incremental class learning involves sequentially learning classes in bur...

Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration without Forgetting

Both generalized and incremental few-shot learning have to deal with thr...

Class-incremental Learning using a Sequence of Partial Implicitly Regularized Classifiers

In class-incremental learning, the objective is to learn a number of cla...

A Strategy for an Uncompromising Incremental Learner

Multi-class supervised learning systems require the knowledge of the ent...

Incremental Learning via Rate Reduction

Current deep learning architectures suffer from catastrophic forgetting,...

Mnemonics Training: Multi-Class Incremental Learning without Forgetting

Multi-Class Incremental Learning (MCIL) aims to learn new concepts by in...

Class-incremental Learning with Pre-allocated Fixed Classifiers

In class-incremental learning, a learning agent faces a stream of data w...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans can continuously learn new concepts over their lifetime. In contrast, modern machine learning systems often must be trained on batches of data 

[14, 26]. Applied to the task of object recognition, incremental learning is an avenue of research that seeks to develop systems that are capable of continually updating the learned model as new data arrives [18]. Incremental learning gradually increases an object classifier’s breadth by training it to recognize new object classes [10].

This paper examines a sub-type of incremental learning known as class-incremental learning. Class-incremental learning attempts to first learn a small subset of classes and then incrementally expand that set with new classes. Importantly, class-incremental evaluation of a final model is tested on a single blended dataset, an evaluation known as single-headed evaluation [7]. To paraphrase Rebuffi et al. [35], a class-incremental learning algorithm must:

  1. Be trainable from a stream of data that includes instances of different classes at different times;

  2. Offer a competitively accurate multi-class classifier for any classes it has observed thus far;

  3. Be bounded or only grow slowly with respect to memory and computational requirements as the number of training classes increase.

Incremental learning may offer a number of important advantages over batch learning  [12]. Specifically, because the classes and data used to train the model do not need to be determined a priori, an incremental machine learning system may offer advantages in terms of adaptability. Moreover, Flesch et al. [12] demonstrate that incremental learning improves human learning by fostering the development of higher order representations that promote task segregation.

Creating high accuracy machine learning systems that incrementally learn, however, is a hard problem. One simple way to create an incremental learner is by tuning the model to the data of the new classes. This approach, however, causes the model to forget the previously learned classes and the overall classification accuracy decreases, a phenomenon known as catastrophic forgetting [2, 13, 15, 21, 24, 29]

. To overcome this problem, most existing class-incremental learning methods avoid it altogether by storing a portion of the training data from the earlier learned classes and retrain the model (often a neural network) on a mixture of the stored data and new data containing new classes 

[35, 6, 18, 40, 45, 46]. These approaches are, however, neither scalable nor biologically inspired i.e. when humans learn new visual objects they do not forget the visual objects they have previously learned, nor must humans relearn these previously known objects.

With respect to class-incremental learning, this paper contributes a state-of-the-art method termed Centroid-Based Concept Learning (CBCL). CBCL uses a fixed data representation (ResNet [16]

pre-trained on ImageNet 


for object-centric image datasets) for feature extraction. After feature extraction, CBCL applies a brain-inspired framework which is a variant of agglomerative clustering proposed in 

[3] (denoted Agg-Var

in this paper) to the feature vectors of the training data for each image class to create a set of centroids (concepts) for that class. To predict the label of a test image, similar to 

[3], the distance of the feature vector of the test image to the closest centroids is used. Since CBCL stores the centroids for each class independently of the other classes, the decrease in overall classification accuracy is not catastrophic when new classes are learned. CBCL is tested on three incremental learning benchmarks (Caltech-101 [11], CUBS-200-2011 [44], CIFAR-100 [22]) and it outperforms the state-of-the-art methods by a significant margin.

We seek to develop a practical incremental learning system that would allow human users to incrementally teach a robot different classes of objects. In order to be practical for human users, an incremental learner should only require a few instances of labeled data per class. Most class-incremental learning methods [35, 45, 18] require hundreds or thousands of instances per class. In contrast, our approach achieves superior accuracy with only few instances of labeled training data per class. Evaluations of our approach on three incremental learning benchmarks show that it outperforms state-of-the-art methods which do not store old class data, even when only using 5 or 10 training examples per class. The main contributions of this paper are:

  1. A novel, brain-inspired class-incremental learning approach is proposed.

  2. An approach that outperforms the state-of-the-art methods by significant margins of 38.02%, 20.23% and 10.6% on the three benchmark datasets listed above.

  3. A centroid reduction method is proposed that bounds the memory footprint without a significant loss in classification accuracy.

  4. Our approach results in state-of-the-art accuracy even when applied to few-shot incremental learning.

2 Related Work

The related work is divided into two categories: traditional approaches that use a fixed data representation and class-incremental approaches that use deep learning.

2.1 Traditional Methods

Early incremental learning approaches used SVMs [9]. For example, Ruping [37] creates an incremental learner by storing support vectors from previously learned classes and using a mix of old and new support vectors to classify new data. Most of the earliest approaches did not fulfill the criteria for class-incremental learning and many required old class data to be available when learning new classes: [23, 33, 34, 32].

Another set of early approaches use a fixed data representation with a Nearest Class Mean (NCM) classifier for incremental learning [30, 31, 36]. NCM classifier computes a single centroid for each class as the mean of all the feature vectors of the images in the training set for each class. To predict the label for a new image, NCM assigns it the class label of the closest centroid. NCM avoids catastrophic forgetting by using centroids. Each class centroid is computed using only the training data of that class, hence even if the classes are learned in an incremental fashion the centroids for previous classes are not affected when new classes are learned. These early approaches, however, use SIFT features  [27], hence their classification accuracy is not comparable to the current deep learning approaches, as shown in [35].

2.2 Deep Learning Methods

Deep learning methods have produced excellent results on many vision tasks because of their ability to jointly learn task-specific features and classifiers [5, 14, 26, 41]. Deep learning approaches suffer from catastrophic forgetting on incremental learning tasks. Essentially, classification accuracy rapidly decreases when learning new classes [2, 13, 15, 21, 24, 29]. Various approaches have been proposed recently to deal with catastrophic forgetting for task-incremental and class-incremental learning [1, 35].

For task-incremental learning, a model is trained incrementally on different datasets and during evaluation it is tested on the different datasets separately [7]. Task-incremental learning utilizes multi-headed evaluation which is characterized by predicting the class when the task is known, which has been shown to be a much easier problem (in [7]) than the class-incremental learning considered in this paper.

2.2.1 Class-Incremental Learning Methods

Most of the recent class-incremental learning methods rely upon storing a fraction of old class data when learning a new class [35, 18, 6, 45, 7]. iCaRL [35] combines knowledge distillation [17] and NCM for class-incremental learning. iCaRL uses the old class data while learning a representation for new classes and uses the NCM classifier for classification of the old and new classes. EEIL [6] improves iCaRL with an end-to-end learning approach. Hou et al. [18] uses cosine normalization, less-forget constraint and inter-class separation for reducing the data imbalance between old and new classes. The main issue with these approaches is the need to store old class data which is not practical when the memory budget is limited. To the best of our knowledge, there are only two approaches that do not use old class data and use a fixed memory budget: LWF-MC [35] and LWM [10]. LWF-MC is simply the implementation of LWF [25] for class-incremental learning. LWM uses attention distillation loss and a teacher model trained on old class data for better performance than LWF-MC. Although both of these approaches meet the conditions for class-incremental learning proposed in [35], their performance is inferior to approaches that store old class data [35, 6, 45].

An alternative set of approaches increase the number of layers in the network for learning new classes [39, 43]. Another novel approach is presented in [47] which grows a tree structure to incorporate new classes incrementally. These approaches also have the drawback of rapid increase in memory usage as new classes are added.

Some researchers have also focused on using a deep network pre-trained on ImageNet as a fixed feature extractor for incremental learning. Belouadah et al. [4]

uses a pre-trained network for feature extraction and then trains shallow networks for classification while incrementally learning classes. They also store a portion of old class data. The main issue with their approach is that they test their approach on the ImageNet dataset using the feature extractor that has already been trained on ImageNet which skews their results. FearNet 


uses a ResNet-50 pre-trained on ImageNet for feature extraction and uses a brain-inspired dual memory system which requires storage of the feature vectors and co-variance matrices for the old class images. The feature vectors and co-variance matrices are further used for generating augmented data during learning. Our approach does not store any base class data and uses a ResNet pre-trained on ImageNet for feature extraction but we do not test our approach on ImageNet.

Figure 1: For each new image class in a dataset, the feature extractor generates the CNN features of all the training images in the image class and generates a set of centroids using Agg-Var clustering algorithm, concatenates them with the centroids of previously learned classes and uses the complete set of centroids for classifying unlabeled test images

3 Methodology

The main focus of this paper is not on representation learning. Our approach uses a pre-trained ResNet to extract features which are then fed to CBCL. Following the notation from [35], CBCL learns from a class-incremental data stream of sample sets in which all samples from the set are from the class with samples.

The subsections below, first briefly explain our method for class-incremental learning. Next, we explore how the memory footprint can be managed by restricting the total number of centroids. Finally, we demonstrate the use of our approach to incrementally learn using only a few examples per class.

3.1 Agg-Var Clustering-Based Classification

The complete architecture of our approach is depicted in Figure 1  [3]. Once the data for a new class becomes available, the first step in CBCL is the generation of feature vectors from the images of the new class using a fixed feature extractor. The proposed architecture can work with any type of image feature extractor or even for non-image datasets with appropriate feature extractors. In this paper, for the task of object-centric image classification, we use CNNs (ResNet [16]) pre-trained on ImageNet [38] as feature extractors.

In the learning phase, for each new image class , Agg-Var clustering is applied on the feature vectors of all the training images in the class . Inspired by the EpCon model for concept learning in hippocampus [28], Agg-Var clustering begins by creating one centroid from the first image in the training set of class . Next, for each image in training set of the class, feature vector (for the the image) is generated and compared using the euclidean distance to all the centroids for the class . If the distance of to the closest centroid is below a pre-defined distance threshold , the closest centroid is updated by calculating a weighted mean of the centroid and the feature vector . If the distance between the th image and the closest centroid is higher than the distance threshold , a new centroid is created for class and equated to the feature vector of the th image. The result of this process is a collection containing a set of centroids for the class , , where is the number of centroids for class . This process is applied to the sample set of each class incrementally once they become available to get a collection of centroids for all classes in a dataset. It should be noted that using the same distance threshold for different classes can yield different number of centroids per class depending on the similarity among the images in each class. Hence, we only need to tune a single parameter () to get the optimal number of centroids (that yield best validation accuracy) in each class. Note that our approach calculates the centroids for each class separately. Thus, the performance of our approach is not strongly impacted when the classes are presented incrementally.

Similar to the approach used in [3], to predict the label , of a test image we use the feature extractor to generate a feature vector . Next, euclidean distance is calculated between and the centroids of all the classes observed so far. Based on the calculated distances, we select closest centroids to the unlabeled image. The contribution of each of the closest centroids to the determination of the test image’s class is a conditional summation:


where is the prediction weight of class , is category label of th closest centroid and is the euclidean distance between and the feature vector of the test image. The prediction weights for all the image classes observed so far are first initialized to zero. Then, for the closest centroids the prediction weights are updated, using equation (1), for the classes that each of the centroids belong to. The prediction weight for each class is further multiplied by the inverse of the total number of images in the training set of the class to manage class imbalance. The test image is assigned the class label with the highest prediction weight. Algorithm 1 describes this weighted voting scheme for classification.

Input: x feature vector of the test image

require: number of closest centroids for prediction

require: class centroids sets

Output: predicted label

1: closest centroids from set
2:for ; do
Algorithm 1 Weighed voting scheme for classification

3.2 Centroid Reduction

The memory footprint is an important consideration for an incremental learning algorithm  [35]. Real system implementations have limited memory available. We therefore propose a method that restricts the number of centroids while attempting to also maintain system’s classification accuracy.

Input: current class centroids sets

require: maximum number of centroids

require: number of centroids for new classes Output: reduced class centroids sets

2:for ; do
Algorithm 2 CBCL Centroid Reduction Step

Assume that a system can store a maximum of centroids. Currently the system has stored centroids for classes. For the next batch of classes the system needs to store more centroids but the total number of centroids . Hence, the system needs to reduce the total stored to centroids. Rather than reducing the number of centroids for each class equally, CBCL reduces the centroids for each class based upon the previous number of centroids in the class. The reduction in the number of centroids for each class is calculated as (whole number):


where is the number of centroids for class

after reduction. Rather than simply removing the extra centroids from each class, CBCL applies k-means clustering 

[19] on the centroid set of each class to cluster them into a total of centroids. Thus CBCL clusters the closest centroids together to create new centroids, keeping as much information as possible about the previous classes. Algorithm 2 describes the centroid reduction step. Results on benchmark datasets show the effectiveness of our centriod reduction approach (Section 4).

3.3 Few-Shot Incremental Learning

For a few-shot learning problem, an algorithm is evaluated on n-shot, k-way tasks. Hence, a model is given a total of examples per class for classes for training. After the training phase, the model is evaluated on a small number of test samples (usually 15 test samples for 1-shot, 5-shot and 10-shot learning) for each of the classes.

For an n-shot incremental learning setting, we propose to train a model on examples per class for classes in an increment. The training data for the previously learned classes is not available to the model during the current increment. After training, the model is tested on the complete test set for all the classes learned so far (). Although this problem becomes more difficult with each increment, we hypothesize that our approach will perform well even for 5-shot and 10-shot incremental learning cases because even a limited number of instances per class generate centroids covering most of the class’s concept. Few-shot incremental learning is potentially important for applications involving humans incrementally teaching a system such as a robot. In such cases, the human is unlikely to be willing to provide more than few examples of a class.

4 Experiments

We evaluate CBCL on three standard class-incremental learning datasets: Caltech-101 [11], CUBS-200-2011 [44] and CIFAR-100 [22]. First, we explain the datasets and the implementation details. CBCL is then compared to state-of-the-art methods for class-incremental learning and evaluated on 5-shot and 10-shot incremental learning. Finally, we perform an ablation study to analyze the contribution of each component of our approach.

Dataset CIFAR-100 CUBS-200-2011 Caltech-101
# classes 100 100 100
# training images 500 80% of data 80% of data
# testing images 100 20% of data 20% of data
# classes/batch 2, 5, 10, 20 10 10
Table 1: Statistical details of the datasets in our experiments, same as in [10] for a fair comparison. Number of training and test images reported are for each class in the dataset.

4.1 Datasets

CBCL was evaluated on the three datasets used in [10]. LWM [10] was also tested on iLSVRC-small(ImageNet) dataset but since our feature extractor is pre-trained on ImageNet comparing on this dataset would not be a fair comparison. Caltech-101 [11] contains 8,677 images of 101 object categories with 40 to 800 images per category. CUBS-200-2011 [44] contains 11,788 images of 200 categories of birds. CIFAR-100 consists of 60,000 images belonging to 100 object classes. There are 500 training images and 100 test images for each class. The number of classes, train/test split size and number of classes per batch used for training are described in Table 1. The classes that compose a batch were randomly selected. For the 5-shot and 10-shot incremental learning experiments, only the training images per class were changed to 5 and 10, respectively in Table 1 keeping the other statistics the same.

Similar to [10], top-1 accuracy was used for evaluation. We also report the average incremental accuracy, which is the average of the classification accuracies achieved in all the increments  [35].

Figure 2:

Average and standard deviation of classification accuracies (%) on CIFAR-100 dataset with (a) 2, (b) 5, (c) 10, (d) 20 classes per increment with 10 executions.

Average incremental accuracies are shown in parenthesis. (For other methods, results are reported from the respective papers and different papers reported results on different increment settings. Best viewed in color)

We compare CBCL against FearNet using the evaluation metrics provided in 

[20]. We test the model’s ability to retain base-knowledge given as , where is the accuracy of the model on the classes learned in the first increment,

is the accuracy of a multi-layer perceptron trained offline (69.9%) and

is the total number of increments. The model’s ability to recall new information is evaluated as , where is the accuracy of the model on the classes learned in increment . Lastly, we evaluate the model on all test data as , where is the accuracy of the model on all the classes learned up till increment .

Because CBCL’s learning time is much shorter than the time required to train a neural network, we are able to run all our experiments 10 times randomizing the order of the classes. We report the average classification accuracy and standard deviation over these ten runs.

4.2 Implementation Details

The Keras deep learning framework 

[8] was used to implement all of the neural network models. For Caltech-101 [11] and CUBS-200-2011 [44] datasets, the ResNet-18 [16] model pre-trained on the ImageNet [38] dataset was used and for CIFAR-100 the ResNet-34 [16] model pre-trained on ImageNet was used for feature extraction. These model architectures are consistent with [10] for a fair comparison. For the experiment with the CIFAR-100 dataset the model was allowed to store up to centroids (requiring 3.87 MB versus 84 MB for an extra ResNet-34 teacher model as in [10]). Furthermore, compared to methods that store only 2000 images for previous classes [35, 6, 4], 7500 centroids for our approach require less memory than 2000 complete images. For Caltech-101 centroids were stored (0.5676 MB versus 45 MB for a ResNet-18 teacher model as in [10]) and for CUBS-200-2011 centroids were stored (0.4128 MB). For a fair comparison with FearNet [20], we use the ResNet-50 pre-trained on ImageNet as a feature extractor.

For each batch of new classes, the hyper-parameters (distance threshold) and (number of closest centroids used for classification) are tuned using cross-validation. We only use the previously learned centroids and the training data of the new classes for hyper-parameter tuning.

4.3 Results on CIFAR-100 Dataset

On CIFAR-100 dataset, our method is compared to seven different methods: finetuning (FT), LWM [10], LWF-MC [35], iCaRL [35], EEIL [6], BiC [45] and FearNet [20]. FT simply uses the network trained on previous classes and adapts it to the new incoming classes. LWM extends LWF [25] and uses attention distillation loss for class-incremental learning. LWM does not store any data from the previously learned classes but it does require an extra teacher model trained on the previous classes to be stored. LWF-MC uses distillation loss during the training phase. iCaRL also uses the distillation loss for representation learning but stores exemplars of previous classes and uses the NCM classifier for classification. EEIL improves iCaRL by offering an end-to-end learning approach which also uses the distillation loss and keeps exemplars from the old classes. BiC also uses the exemplars from the old classes and adds a bias correction layer after the fully connected layer of the neural network to correct for the bias towards the new classes. We also compare the classification accuracy after learning all the classes to an upper bound (68.6%) consisting of ResNet34 trained on the entire CIFAR-100 dataset in one batch. FearNet uses a ResNet-50 pre-trained on ImageNet for feature extraction and uses brain-inspired dual-memory model. FearNet stores the feature vectors and covariance matrices for old class images and also uses a generative model for data augmentation.

Figure 2 compares CBCL to first six out of the seven methods mentioned above with 2, 5, 10 and 20 classes per increment. Even though a fair comparison of CBCL is only possible with FT, LWF-MC and LWM, since they are the only approaches that do not require storing the exemplars of the old classes, it outperforms all six methods on all increment settings. The difference in classification accuracy between CBCL and these other methods increases as the number of classes learned increases. Moreover, for smaller increments the difference in accuracy is larger. Unlike other methods, CBCL’s performance remains the same regardless of the number of classes in each increment (final accuracy after 100 classes for all increments is 60%).

Methods 2 classes 5 classes 10 classes 20 classes
CBCL 5-Shot 56.97 55.61 54.65 53.81
CBCL 10-Shot 61.90 61.45 61.30 60.71
Table 2: CBCL’s performance on 5-shot and 10-shot incremental learning settings in terms of average incremental accuracy (%) on CIFAR-100 dataset with 2, 5, 10 and 20 classes per increment.

Table 2 shows the average incremental accuracy of CBCL for 5-shot and 10-shot incremental learning. In comparison with the average incremental accuracies of the other six methods (Figure 2) for 2, 5 and 10 classes per increment (using all the 500 training examples per class), even with 5 or 10 training examples per class CBCL outperforms methods that do not store old class data (FT, LWF-MC, LWM) while it is only slightly inferior to methods that do store old class data. For 20 classes per increment CBCL’s average incremental accuracy is slightly inferior to the other methods, for 5-shot and 10-shot incremental learning settings. Furthermore, training a ResNet-34 for 5-shot and 10-shot settings with all the training data available in a single batch yields only 8.22% and 12.15% accuracies, respectively. These results clearly show the effectiveness of CBCL for few-shot incremental learning setting.

Evaluation Metric FearNet CBCL CBCL 5-Shot CBCL 10-Shot
0.927 1.025 0.754 0.830
0.824 1.020 0.778 0.870
0.947 1.025 0.778 0.870
Table 3: Comparison with FearNet on CIFAR-100. , and are all normalized by the offline multi-layer preceptron (MLP) baseline (69.9%) reported in [20]. A value greater than 1 means that the average incremental accuracy of the model is higher than the offline MLP.

Table 3 compares CBCL with FearNet [20] using the metrics proposed in [20]. FearNet does not report the number of classes per increment for the reported results, hence we report results of CBCL on the most difficult increment setting (2 classes per increment). CBCL clearly outperforms FearNet on all three metrics (, , ) by a significant margin when using all training examples per class. For 10-shot incremental learning, CBCL outperforms FearNet (using all the training examples per class) on but for and it is slightly inferior. For 5-shot incremental learning setting, the results of CBCL are inferior to FearNet (using all the training examples) but the change in accuracy is not drastic. It should be noted that even for 10-shot and 5-shot incremental learning settings, the MLP baseline, used during the calculation of , and , has been trained on all the training data of each class in a single batch. We also trained a ResNet-50 for 5-shot and 10-shot learning with all the class training data available in one batch and the test accuracies for 5-shot and 10-shot learning were 8.49% and 12.21%, respectively. CBCL outperforms this few-shot baseline by a remarkable margin for 5-shot and 10-shot settings, demonstrating that it is extremely effective for few-shot incremental learning setting.

# Classes FT LWM CBCL (Ours)
10 (base)
Table 4: Comparison with FT and LWM [10] on Caltech-101 dataset in terms of classification accuracy (%) with 10 classes per increment. Average and standard deviation of classification accuracies per increment are reported

4.4 Results on Caltech-101 Dataset

For the Caltech-101 dataset CBCL was compared to finetuning (FT) and LWM [10] with learning increments of 10 classes per batch (Table 4). FT and LWM were introduced in Subsection 4.3. CBCL outperforms FT and LWM by a significant margin. The difference between CBCL and LWM and FT continues to increase as more classes are learned. FT performs the worst with a classification accuracy after 100 classes that is about one fourth of the base accuracy (decreases by 69.52%). LWM is an improvement compared to FT. Nevertheless, the accuracy after 100 classes is almost the half of the base accuracy (decreases by 49.36%). For CBCL the decrease in accuracy is only 11.31% after incrementally learning 100 classes. The average incremental accuracies for FT, LWM, CBCL, CBCL on 5-shot incremental learning and CBCL on 10-shot incremental learning are 43.84%, 62.67%, 90.61%, 87.70% and 89.92%, respectively. Hence, CBCL improves accuracy over the current best method (LWM) by a margin of 27.94% in terms of average incremental accuracy when the complete training set is used. Even for the 5-shot and 10-shot incremental learning, CBCL outperforms LWM by margins of 25.03% and 27.25%, respectively.

4.5 Results on CUBS-200-2011 Dataset

For the CUBS-200-2011 dataset we again compare our approach to FT and LWM with learning increments of 10 classes per batch (Table 5). The classification accuracy of CBCL is greater than FT and LWM after the 10 classes (base). As the learning increments increase the performance margin also increases. The accuracy of FT decreases by 81.77% after 10 increments and LWM’s accuracy decreases by 64.65%. The decrease in classification accuracy of CBCL after 10 increments is significantly lower than both of these approaches (38.08%). The average incremental accuracies for FT, LWM, CBCL, CBCL for 5-Shot incremental learning and CBCL for 10-shot incremental learning are 37.73%, 57.087%, 67.85%, 56.16% and 63.80% respectively. Similar to the results for Caltech-101, CBCL is an improvement over the current best method (LWM) by a significant margin (10.763%) in terms of average incremental accuracy. Furthermore, even for 10-shot incremental learning setting CBCL improves over LWM and it is slightly below LWM for 5-shot incremental learning.

# Classes FT LWM CBCL (Ours)
10 (base)
Table 5: Comparison with FT and LWM [10] on CUBS-200-2011 dataset in terms of classification accuracy (%) with 10 classes per increment. Average and standard deviation of classification accuracies per increment are reported

4.6 Ablation Study

We performed an ablation study to examine the contribution of each component in our approach to the overall system’s accuracy. This set of experiments was performed on CIFAR-100 dataset with increments of 10 classes and memory budget of centroids using all the training data per class. We report average incremental accuracy for these experiments.

This ablation study investigates the effect of the following components: feature extractor, clustering approach, number of centroids used for classification, and the impact of centroid reduction. Hybrid versions of CBCL are created to ablate each of these different components. hybrid1 uses VGG-16 [42] pre-trained on the Places365 dataset [48] as a feature extractor. This feature extractor is good for indoor scene datasets, a completely different task than the object-centric image classification considered in this paper. hybrid2 uses traditional agglomerative clustering and hybrid3 uses k-means clustering to generate centroids for all the image classes. hybrid4 uses only a single closest centroid for classification (same as NCM classifier). hybrid5 simply removes the extra centroids when the memory limit is reached rather than using the proposed centroid reduction technique. Lastly, hybrid6 uses an NCM classifier with the ImageNet pre-trained feature extractor. Except for the changed component, all the other components in the hybrid approaches are the same as CBCL.

Methods Accuracy (%)
hybrid1 44.1
hybrid2 59.2
hybrid3 60.0
hybrid4 68.7
hybrid5 69.0
hybrid6 58.25
CBCL 69.85
Table 6: Effect on average incremental accuracy by switching off each component separately in CBCL. All of the hybrids show lower performance than CBCL demonstrating each of their contribution to get the best results using CBCL

Table 6 shows the results for the ablation study. All of the hybrid methods are less accurate than the complete CBCL algorithm. The most drastic decrease in accuracy is for hybrid1. In this case, the feature extractor is designed for a different task. These results support [48], which notes that the performance of Places365 features are inferior to ImageNet features on object-centric image datasets. hybrid2 and hybrid3 achieve similar average incremental accuracy but significantly inferior accuracy when compared to CBCL. This difference in accuracy reflects the effectiveness of the Agg-Var clustering algorithm for object-centric image classification. hybrid4 achieves slightly lower accuracy than CBCL, illustrating that the accuracy gain resulting from using multiple centroids for classification is about 1.15%. Finally, hybrid5’s accuracy is the closest to CBCL’s.This small difference may reflect the fact that the memory budget is large enough such that the algorithm does not need to reduce centroids until the last increment. Hence, only in the last increment is there a slight change, which does not effect the average incremental accuracy for all 10 increments by a significant margin. The effectiveness of our centroid reduction technique is more apparent when using smaller memory budgets. For example, for K=3000 centroids limit the average incremental accuracies for CBCL and hybrdi5 are 67.5% and 64.0%, respectively, depicting the effectiveness of our proposed centroid reduction technique. Lastly, for hybrid6, we again see a drastic decrease in accuracy because this hybrid uses a single centroid to represent each class. This ablation study indicates that the most important components of CBCL are the task-specific feature extractor and Agg-Var clustering based upon the performance of hybrid4 and hybrid5. Note that the average incremental accuracy for both of these hybrids is also higher than the state-of-the-art methods.

5 Conclusion

In this paper we have proposed a novel approach (CBCL) for class-incremental learning which does not store previous class data. The centroid-based representation of different classes not only produces the state-of-the-art results but also opens up novel avenues of future research, like few-shot incremental learning. Although CBCL offers superior accuracy to other incremental learners, its accuracy is still lower than single batch learning on the entire training set. Future versions of CBCL will seek to match the accuracy of single batch learning.

CBCL contributes methods that may one day allow for real-world incremental learning from a human to an artificial system. Few-shot incremental learning, in particular, holds promise as a method by which humans could conceivably teach robots about important task-related objects. Our upcoming work will focus on this problem.


  • [1] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars.

    Memory aware synapses: Learning what (not) to forget.


    The European Conference on Computer Vision (ECCV)

    , September 2018.
  • [2] Bernard Ans, Stéphane Rousset, Robert M. French, and Serban Musca. Self-refreshing memory in artificial neural networks: learning temporal sequences without catastrophic forgetting. Connection Science, 16(2):71–99, 2004.
  • [3] Ali Ayub and Alan Wagner. CBCL: Brain inspired model for RGB-D indoor scene classification. arXiv:1911.00155, 2019.
  • [4] Eden Belouadah and Adrian Popescu. Deesil: Deep-shallow incremental learning. In The European Conference on Computer Vision (ECCV) Workshops, September 2018.
  • [5] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, Aug 2013.
  • [6] Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In The European Conference on Computer Vision (ECCV), September 2018.
  • [7] Arslan Chaudhry, Puneet K. Dokania, Thalaiyasingam Ajanthan, and Philip H. S. Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In The European Conference on Computer Vision (ECCV), September 2018.
  • [8] François Chollet et al. Keras., 2015.
  • [9] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Mach. Learn., 20(3):273–297, Sept. 1995.
  • [10] Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu, and Rama Chellappa. Learning without memorizing. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2019.
  • [11] Li Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):594–611, April 2006.
  • [12] Timo Flesch, Jan Balaguer, Ronald Dekker, Hamed Nili, and Christopher Summerfield. Comparing continual task learning in minds and machines. Proceedings of the National Academy of Sciences, 115(44):E10313–E10322, 2018.
  • [13] Robert M. French. Dynamically constraining connectionist networks to produce distributed, orthogonal representations to reduce catastrophic interference. Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, pages 335–340, 2019.
  • [14] Ross Girshick. Fast r-cnn. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [15] Ian J. Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv:1312.6211, 2013.
  • [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [17] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
  • [18] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a unified classifier incrementally via rebalancing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [19] Anil K Jain, M Narasimha Murty, and Patrick J Flynn. Data clustering: a review. ACM computing surveys (CSUR), 31(3):264–323, 1999.
  • [20] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. In International Conference on Learning Representations, 2018.
  • [21] James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America, 114(13):3521–3526, 2017.
  • [22] Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009. Technical report, University of Toronto.
  • [23] I. Kuzborskij, F. Orabona, and B. Caputo. From n to n+1: Multiclass transfer incremental learning. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 3358–3365, June 2013.
  • [24] Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang.

    Overcoming catastrophic forgetting by incremental moment matching.

    In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4652–4662. Curran Associates, Inc., 2017.
  • [25] Z. Li and D. Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, Dec 2018.
  • [26] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [27] David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110, Nov. 2004.
  • [28] Michael L. Mack, Bradley C. Love, and Alison R. Preston. Building concepts one episode at a time: The hippocampus and concept formation. Neuroscience Letters, 680:31–38, 2018.
  • [29] Michael Mccloskey and Neil J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation, 24:104–169, 1989.
  • [30] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Distance-based image classification: Generalizing to new classes at near-zero cost. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2624–2637, Nov 2013.
  • [31] Thomas Mensink, Jakob J. Verbeek, Florent Perronnin, and Gabriela Csurka. Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV, 2012.
  • [32] M. D. Muhlbaier, A. Topalis, and R. Polikar. Learn .nc: Combining ensemble of classifiers with dynamically weighted consult-and-vote for efficient incremental learning of new classes. IEEE Transactions on Neural Networks, 20(1):152–168, Jan 2009.
  • [33] Anastasia Pentina, Viktoriia Sharmanska, and Christoph H. Lampert. Curriculum learning of multiple tasks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [34] R. Polikar, L. Upda, S. S. Upda, and V. Honavar. Learn++: an incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 31(4):497–508, Nov 2001.
  • [35] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. iCaRL: Incremental classifier and representation learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [36] Marko Ristin, Matthieu Guillaumin, Juergen Gall, and Luc Van Gool. Incremental learning of ncm forests for large-scale image classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
  • [37] S. Ruping.

    Incremental learning with support vector machines.

    In Proceedings 2001 IEEE International Conference on Data Mining, pages 641–642, Nov 2001.
  • [38] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. Int. J. Comput. Vision, 115(3):211–252, Dec. 2015.
  • [39] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv:1606.04671, 2016.
  • [40] Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. Incremental learning of object detectors without catastrophic forgetting. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [41] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, pages 568–576, Cambridge, MA, USA, 2014. MIT Press.
  • [42] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv 1409.1556, 09 2014.
  • [43] Alexander V. Terekhov, Guglielmo Montone, and J. Kevin O’Regan. Knowledge transfer in deep block-modular neural networks. In Proceedings of the 4th International Conference on Biomimetic and Biohybrid Systems - Volume 9222, Living Machines 2015, pages 268–279, New York, NY, USA, 2015. Springer-Verlag New York, Inc.
  • [44] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge J. Belongie. The caltech-ucsd birds-200-2011 dataset, 2011. Technical Report CNS-TR-2011-001, California Institute of Technology.
  • [45] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [46] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, Zhengyou Zhang, and Yun Fu. Incremental classifier learning with generative adversarial networks. CoRR, abs/1802.00853, 2018.
  • [47] Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang, Yuxin Peng, and Zheng Zhang.

    Error-driven incremental learning in deep convolutional neural network for large-scale image classification.

    In Proceedings of the 22nd ACM International Conference on Multimedia, MM ’14, pages 177–186, New York, NY, USA, 2014. ACM.
  • [48] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba.

    Places: A 10 million image database for scene recognition.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.