Continual Universal Object Detection

02/13/2020 ∙ by Xialei Liu, et al. ∙ Amazon Universitat Autònoma de Barcelona 33

Object detection has improved significantly in recent years on multiple challenging benchmarks. However, most existing detectors are still domain-specific, where the models are trained and tested on a single domain. When adapting these detectors to new domains, they often suffer from catastrophic forgetting of previous knowledge. In this paper, we propose a continual object detector that can learn sequentially from different domains without forgetting. First, we explore learning the object detector continually in different scenarios across various domains and categories. Learning from the analysis, we propose attentive feature distillation leveraging both bottom-up and top-down attentions to mitigate forgetting. It takes advantage of attention to ignore the noisy background information and feature distillation to provide strong supervision. Finally, for the most challenging scenarios, we propose an adaptive exemplar sampling method to leverage exemplars from previous tasks for less forgetting effectively. The experimental results show the excellent performance of our proposed method in three different scenarios across seven different object detection datasets.



There are no comments yet.


page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection has improved significantly in recent years [2, 10, 17, 28] on multiple challenging benchmarks, such as PASCAL VOC [7] and MS COCO [18]. However, most of the existing detectors are domain-specific, lacking the generalization ability for new domains and categories. Recently, Wang et al. [35] proposed a universal object detector by domain attention across eleven diverse datasets. They show that a single network with marginally more parameters can handle detection on multiple domains, even outperforming single-domain detectors in most cases.

However, a major drawback of such a universal detector is that all data has to be accessible at the time of training the model. Moreover, once the model is trained, it is hard to adapt it to new domains without retraining all seen domains again. Unfortunately, in real-world scenarios, legacy data from previous domains can be lost, proprietary, or simply too expensive to use in new domains [16]. For example, for models deployed on edge devices, it is unlikely that these models can still access the original training data. If we want to learn to detect new concepts from new domains on devices for these models, we need to develop a new learning scheme. Therefore, the ability to learn incrementally across multiple domains becomes an important feature of a true universal object detection system.

Figure 1: Given , two object detection tasks, , their corresponding domains, and , their corresponding categories, there are four different scenarios when learning and sequentially: more training data (top-left), same categories but different domains (top-right), different categories but same domains (bottom-left), and different categories and domains (bottom-right).

On the other hand, incremental and continual learning has drawn significant attention in recent years [1, 14, 16, 20, 26]. However, most existing work focus on classification problems, while ignoring the more challenging detection problems. For object detection, we not only need to consider the forgetting problem of the classification part but also need to pay specific attention to the bounding boxes. Recently, Shmelkov et al. [33] proposed to learn object detectors incrementally by adding one or a few classes by using an

distillation loss on both logits and bounding boxes, but their work still only focuses on a single domain.

In this paper, we aim to develop a universal detector by allowing sequential learning across multiple domains. We approach this challenging problem by first studying the effects of domain and category differences on catastrophic forgetting. For instance, VOC and COCO are from similar domains, but COCO has 60 more categories, how would forgetting happen if we train from VOC to COCO or from COCO to VOC? Watercolor and Comic [13] share the same categories, but they are from different domains, how would the domain shift affect the continual learning process? KITTI [8] and Kitchen [9] have disjoint categories and significant domain discrepancies; can we learn those sequentially without forgetting? To answer these questions and better understand forgetting across domains, we conducted a comprehensive analysis of different scenarios, as shown in Figure 1.

To solve the problem of continual universal object detection, we study the interesting line of works of applying probability distillation to learning without forgetting (LwF) 

[16]. However, naively transferring these distillation methods to object detection introduces new problems. Since many region proposals in detection are background, if we do not consider spatial information in distillation, we could end up with more noisy than useful information. Recently, learning without memorizing (LwM) [6] applied attention-based distillation to avoid catastrophic forgetting for classification problems. This method could perform better than distillation without attention, but this attention is rather weak for object detection. Hence, we develop a novel attentive feature distillation which leverages both bottom-up and top-down attention maps to focus on spatial foreground regions and useful context information. Further, it also provides stronger supervision compared to attention-based distillation [6, 38].

For really difficult settings in our problem, i.e., when both domains and categories are drastically different, we study the use of exemplars from previous tasks to mitigate forgetting. iCarl [26] proposes to keep exemplars that stay close to the mean of each category in the feature space. RWalk [3] proposes to compare several sampling methods; however, it turns out that random sampling works equally well as others at a lower complexity cost. To the best of our knowledge, we are the first to investigate the impact of keeping exemplars to avoid forgetting in object detection. Additionally, we propose to rank the samples by the number of ground truth bounding boxes, then sample them adaptively with a fixed budget to balance between the number of bounding boxes and the diversity of exemplars.

We make the following major contributions:

  • We propose to explore a novel problem, where the object detector is learned across different domains incrementally.

  • We conduct an in depth analysis of the effects of domain and category differences on catastrophic forgetting for object detection.

  • We propose a novel attentive feature distillation approach to mitigate forgetting for object detection,

  • We propose a novel adaptive exemplar sampling to choose samples with more information and diversity to solve the difficult problems in our setting.

Experimental results across seven datasets demonstrate the effectiveness of our proposed method and show interesting behaviors in different scenarios.

2 Related work

Our work is at the intersection of continual learning, knowledge distillation, and object detection. Three fields have advanced dramatically in the last few years. Here we briefly review the relevant literature.

Object detection   Object detection networks can be divided into two categories: two-stage or one-stage. Faster-RCNN [28] is a representative two-stage method with a region proposal network (RPN) and a classification and regression network (Fast-RCNN [10]). YOLO [27] and SSD [19] are representatives of one-stage methods, which predict bounding boxes and class probabilities directly from full images in one step. We show experimental results with Faster-RCNN, although our method applies to other two-stage and one-stage methods. Large datasets have been a driven force in object detection, such as PASCAL VOC [7], MSCOCO [18]. However, most of the detectors are specific for one domain and hard to extend to other domains. Recently, Wang et al. [35] proposed to train a single network for 11 diverse domains, which achieves better performance compared to most other independent detectors. The assumption they make is that all data can be accessed simultaneously, which is not always true in practice. We propose to do object detection across domains by continual learning, in which the objective is to learn a detector on the current task without forgetting previous knowledge.

Continual learning   Losing the acquired knowledge from previous tasks when learning a new task is known as catastrophic forgetting. Continual learning approaches are introduced to alleviate this undesired effect. Most works on continual learning focus on classification problems, which can be roughly divided into three main families [5, 24].

The first family is regularization-based, which contain both data-focused and prior-focused approaches. The first subset is composed of distillation approaches such as LwF and LwM. The second subset estimates the importance of network parameters, and it applies a higher penalty on those that show a significant change when switching from one task to another 

[1, 14, 16, 20, 39]. The next family prevents catastrophic forgetting by growing a sub-network or learning a mask for each task [15, 22, 23, 30, 31]. Finally, rehearsal-based approaches form the last family, which either store a small number of training samples from previous tasks [3, 21, 26], or use a generative model to sample synthetic data from previously learned distributions [32, 36].

There are few works on continual object detection learning. Shmelkov et al. [33] proposed the first work to learn an object detector incrementally by adding one or a few classes at a time within one single domain by using an distillation loss on both logits and bounding boxes. However, we are interested in learning objects continually across different domains. Additionally, to the best of our knowledge, we are the first to explore storing exemplars for continual object detection and to propose adaptive exemplar sampling which provides more information and diversity.

Knowledge distillation   Probability distillation was proposed by Hinton et al. [11] to compress networks for fast inference. It trains a smaller student network from the softened output of a wider teacher network. Romero et al. [29] extended the idea to learn intermediate features from teacher networks using a two-stage strategy. Moving forward, Zagoruyko et al. [38] proposed to mimic the attention maps from teacher networks, computed by activations or gradients. Due to the simple implementation and success of knowledge distillation, the idea has been applied to different applications, such as object detection [4], and continual learning [6, 16, 26]. In this work, we focus on learning an object detector continually by further exploring the idea of knowledge distillation.

3 Problem formulation

In this section, we introduce the formulation for universal object detection in continual learning and the main concepts before presenting the proposed approach in Sec. 4.

3.1 Object detection

We consider the object detection task learned from dataset , where is the th image, is the corresponding label with both bounding boxes and classes, and is the size of the dataset. As for Faster-RCNN [28], we divide the network into feature extractor , Region Proposal Network (RPN) , and the remaining modules as detection head . The objective function of the detection loss is formalized as:


where contains all losses for object detection including classification losses and bounding box regression losses.

3.2 Continual object detection

We consider the continual object detection learning setting where detection tasks are learned sequentially from the corresponding domains with corresponding categories. During the learning process, we share part of the network modules or grow some of them for the new task. If we train a new copy of the model for each new task, it results in models which need times the memory size and is commonly referred as independent training. If we finetune from the previous model for each new task, we obtain a single model after training all tasks. That is efficient in terms of memory consumption, but suffers from catastrophic forgetting. We simplify the problem of learning tasks sequentially by proposing four different scenarios (see Figure 1):

  • and . If both domains and categories are the same, task is a new set of samples of the same distribution as task .

  • and . If domains are the same but categories are different, it is the case of intra-domain sequential learning. For example, VOC and COCO have different categories under the same domain.

  • and . If domains are different but categories are the same, the scenario is closer to sequential domain adaptation. For example, Watercolor and Comic have the same categories but belong to different domains.

  • and . If both domains and categories are different, it is the case of inter-domain sequential learning. For example, KITTI and Kitchen have very different domains and categories.

In this work, we focus on the last three scenarios where forgetting is more likely to happen. It is straightforward to extend the idea from two domains to more.

3.3 Knowledge distillation

Probability distillation was proposed by Hinton et al. [11]

to match two output distributions of two different networks. The loss function for probability distillation is:


where is the softmax function, are the probabilities of the current model, are the recorded probabilities of the previous model, are the outputs before softmax and is a temperature parameter to control the scaling of the probabilities.

Feature distillation was proposed by Romero et al. [29] to use feature alignment on the intermediate representations to do network distillation. The loss function of their proposed hint feature distillation is:


where and are feature extractors of current and previous model respectively.

Attention distillation: instead of matching the probabilities of two networks, Zagoruyko et al. [38] proposed to mimic the attention maps from the teacher to the student for network distillation. The attention maps are computed as:


where is the number of channels per activation. The activation of the th image can be computed as . Attention map is processed with an normalization and the attention distillation loss is defined as:


where and are attention maps of current and previous model respectively.

One of the drawbacks of probability distillation, when we apply it to object detection, is that region proposals are very noisy. The large amounts of background result in unreliable distilled information, especially when domains are very different. A similar problem happens to feature distillation due to the diversity and complexity of backgrounds. Attention distillation partially reduces noisy information. However, it is weaker supervision compared to direct features, which ends up being less effective against catastrophic forgetting. Therefore, in this paper, we propose an attentive feature distillation using both bottom-up and top-down attentions to alleviate the effects of catastrophic forgetting in continual object detection.

4 Proposed approach

In this section, we will first introduce the proposed attentive feature distillation using both bottom-up and top-down attention. Then we will explain how adaptive exemplar sampling could balance the number of bounding boxes and diversity of exemplars.

4.1 Attentive feature distillation

Figure 2: Examples of attention maps. Original images (top), bottom-up attention (middle), top-down attention (bottom) from Comic (left) and Kitchen (right) object detection datasets.
Figure 3: Main pipeline for learning tasks sequentially. White modules on top are frozen after learning task with Faster-RCNN. When learning task , we first duplicate the blue modules (including the RPN) and add a new head. Then, we combine data from task () and exemplars from task (). In green, our proposed approach combines bottom-up attention () from the outputs of the feature extractors with top-down attention () from the output bounding boxes of the RPN. The attentive feature distillation loss () is then combined with the usual object detection losses (, ). The same procedure can be easily extended to more tasks.
Figure 4: Illustration of different sampling methods.

We propose attentive feature distillation (AFD) to make use of the advantages of both feature and attention distillation while avoiding the drawbacks of both methods. Specifically, we proposed to use both top-down (TD) and bottom-up (BU) attention maps to filter out the noisy information on the features. The bottom-up attentive feature distillation is computed by integrating attention maps (as shown in Figure 2) from Eq. 3 and Eq. 5:


Bottom-up attention captures critical context information of the input image, which we use to guide the current model to mimic the attentive features of the previous one.

Top-down attention masks were first proposed in [34] to do distillation on object detection, although it does not focus on avoiding catastrophic forgetting. For each ground truth bounding box, Intersection over Union (IoU) with all anchors is computed. A dynamic filter threshold is computed by , and is set to be 0.5. We propose two modifications:

  1. for each bounding box, we compute the mean of all overlapping proposals over the threshold, which helps to focus on the center region of the object. Then, we obtain the final mask by looping over all ground-truth boxes and adding all normalized masks (as shown in Figure  2).

  2. we do the same normalization as in Eq. 6 but replacing the attention maps by the previously computed final mask .

Therefore, the top-down attentive feature distillation objective is defined as:


Top-down attention concentrates on foreground objects, which is crucial for object detection algorithms. To leverage both rich context information and foreground objects, bottom-up and top-down attention can be complementary. The final attentive feature distillation loss function is:


Since both attention maps and masks are normalized in the same way, there is no need to use an hyperparameter to balance between both losses.

4.2 Adaptive exemplar sampling

Exemplars are stored while learning a task to retain performance after learning future tasks [3, 26]. Random sampling has been shown to be simple and effective [3] for classification problems. Therefore, we use random sampling as a baseline for object detection.

Additionally, since the number of bounding boxes in object detection varies a lot, we propose hard sampling as another baseline, which keeps exemplars with the most number of bounding boxes. This sampling method provides more information with the same memory consumption compared to random sampling. However, in cases like video-based datasets such as KITTI and Kitchen, the diversity of samples can reduce by using hard sampling, as the images with more bounding boxes are more likely to be sampled from consecutive frames in a video.

To overcome this drawback, we also propose adaptive sampling. For each specific category, we rank all samples by the number of bounding boxes in descending order. Given a budget of number of exemplars with and , we randomly sample exemplars from the list (1, ) which we ordered before. We define three different sampling cases:


which allow balance between the number of bounding boxes and diversity of exemplars by adapting . An illustration of different sampling methods is shown in Figure  4.

and and and
Methods KITTI Kitchen Kitchen KITTI Comic Watercolor Watercolor Comic VOC COCO COCO VOC
Independent 53.7 70.9 45.1 50.0 74.1 47.8
Finetuning 10.9 (-42.8) 68.2 9.2 (-61.7) 54.0 39.6 (-5.5) 48.8 46.4 (-3.6) 43.4 74.9 (+0.8) 48.4 20.1 (-27.7) 75.9
AFD (bottom-up) 25.2 (-28.5) 68.8 13.1 (-57.8) 50.9 43.1 (-2.0) 48.3 47.5 (-2.5) 43.5 75.4 (+1.3) 45.1 23.1 (-24.7) 76.2
AFD (top-down) 33.9 (-19.8) 72.1 16.8 (-54.1) 51.2 43.6 (-1.5) 45.9 51.3 (+1.3) 46.2 76.0 (+1.9) 45.5 28.1 (-19.7) 76.8
AFD (both) 36.6 (-17.1) 72.0 20.5 (-50.4) 50.5 42.9 (-2.2) 48.5 49.4 (-0.6) 45.5 75.4 (+1.3) 45.1 27.8 (-20.0) 77.1
Table 1: Analysis of performance (mAP) and forgetting on different scenarios without exemplars. Arrows indicate order of learning.

4.3 Learning

By fusing the usual object detection losses with the proposed attention feature distillation loss, the final objective function for continual object detection is defined as:


where is the trade-off parameter to balance the current learning objective and distillation objective to avoid forgetting. As shown in Figure 3, we train the first task using Faster-RCNN with data . Then we train on task with data using our proposed attentive feature distillation and adaptive exemplar sampling methods. It results in the same model as in [35] but learning different tasks continually instead of training all tasks together. The same procedure could be easily applied to more tasks.

5 Experiments

We use a Pytorch 

[25] implementation of Faster-RCNN [37] with SE-ResNet-50 [12]

pre-trained on ImageNet, as in 

[35]. As usual in detection, the first convolution layer, the first residual block, and all BN layers are fixed during training. We consider different combinations of the following datasets: Watercolor [13], Clipart [13], Comic [13], Kitchen [9], KITTI [8], VOC [7], COCO [18]. All train and test splits are the same as in  [35]. The Pascal VOC mean average precision (mAP) is used for evaluation of all cases.

Training details:   Following [37]

, we use the default parameters for Faster-RCNN with 4 anchor scales and 3 anchor ratios. Learning rate is set to 0.01 for 10 epochs and 0.001 for the last 2. For small datasets (Watercolor, Clipart, Comic) we use a batch size of 8 on 4 synchronized GPUs. For the other datasets, we use a batch size of 16 to speed up training. For all experiments, we set

=1e-4 from Eq. 10.

Compared methods:   We compare our proposed methods (Feature Distillation, Attention Distillation, AFD) with LwF Detection and three other baselines (Independent, Joint, Finetuning). For LwF Detection, we follow [33] to embed the loss function into our framework. Independent is training from an ImageNet pre-trained model for each dataset separately. For Joint, we follow [35] to train a universal detector. Both baselines can be seen as an upper bound, but they have their drawbacks (i.e. Joint needs all data available at once). Finetuning is done by training all object detection tasks sequentially without any additional loss to avoid forgetting.

5.1 Analysis on forgetting in different scenarios and ablation studies

We analyze different scenarios (Section 3.2) with ablation studies on different attention mechanisms (Section 4.1) and exemplar sampling (Section 4.2).

Figure 5: Ablation study on different number of exemplars. Evaluation on the first dataset after training on the second dataset.

Forgetting in different scenarios.   In Table 1, we analyze forgetting in object detection for three different scenarios: 1) KITTI and Kitchen (, ); 2) Watercolor and Comic (, ); 3) VOC and COCO (, ). The analysis also contains the ablation study on our proposed distillation methods without using any exemplars.

In the case of Kitchen and KITTI, we observe a lot of forgetting due to the vast differences in categories and domains. Finetuning forgets the first task almost completely while our proposed AFD outperforms both bottom-up and top-down attentive feature distillations. However, compared to Independent, the performance of AFD drops and respectively. Without the help of exemplars, it is hard to mitigate the forgetting effect even with the best distillation methods for this scenario.

In the second scenario, domains are different but categories are the same, which causes much less forgetting, even for Finetuning. By using our proposed attentive feature distillation, the object detector can retain similar performance as the Independent baseline. We observe that with drops of and , forgetting can be avoided for Comic and Watercolor.

In the last scenario, since VOC is a subset of COCO, it is interesting to note that training from VOC to COCO has no forgetting. Furthermore, performance is improved by using our proposed methods. There is a lot of forgetting when learning from COCO to VOC, which neither Finetuning nor AFD can avoid. However, AFD still manages to improve over Finetuning from . It seems that category differences can cause severe forgetting when reducing the number of categories, but they seem to avoid forgetting better when adding more categories from the same domain.

method samples Kitchen KITTI KITTI Kitchen
none 0 20.5 50.5 36.6 72.0
random 100 66.3 53.8 46.7 73.1
hard 100 69.4 51.6 42.1 71.7
adaptive () 100 68.1 52.5 45.9 71.4
adaptive () 100 68.6 53.4 48.4 72.0
adaptive () 100 67.8 52.6 48.1 72.4
adaptive () 100 67.3 53.2 47.2 72.7
Table 2: Ablation study on different sampling methods.

Ablation study for exemplars   Although we have shown that attentive feature distillation can mitigate forgetting by a large margin, there is still a gap with Independent training in the more challenging scenario (i.e. KITTI - Kitchen). Therefore, we experiment with exemplar sampling by keeping a small number of exemplars from the previous task. As shown in Figure 5, we compare our proposed methods together with different sampling strategies. It is notable that with only randomly sampled exemplars, Finetuning increases by from Kitchen to KITTI, and by from KITTI to Kitchen. Combined with our distillation methods, the accuracy from previous task gets closer to Independent. Performance improves as more exemplars are added. Attentive feature distillation is superior to both individual bottom-up and top-down attentions.

As shown in Table 2, we also compare with the different sampling strategies introduced in Section 4.2. All results are reported with attentive feature distillation. Hard sampling performs better than random sampling from Kitchen to KITTI, but is much worse from KITTI to Kitchen. When looking at the selected samples, we found that many consecutive frames have a very similar scene, which reduces the diversity of samples. By choosing different values of , we observe that adaptive sampling slightly outperforms both random and hard sampling with being the best and fixing it for the remaining experiments.

5.2 Comparison on two tasks

In this section, we compare our proposed methods to other baselines and methods in different scenarios.

KITTI and Kitchen.   As shown in Table 3, Joint performs slightly better than Independent by sharing representations from different tasks during training. Finetuning usually suffers from catastrophic forgetting, although with 100 exemplars we can see that it still performs quite well. LwF Detection [33] performs similar or worse than Finetuning. We assume this is due to the large amount of background after the RPN, which is not suitable for LwF Detection. Feature distillation is worse in both cases, while Attention distillation performs better by filtering out the noisy information. Attention feature distillation outperforms all other methods and baselines and is close to both Independent and Joint with only 100 exemplars (less than 2% of KITTI data and less than 3% of Kitchen data).

Methods Kitchen KITTI KITTI Kitchen
Independent 70.9 53.7
Joint [35] 74.0 54.3 54.3 74.0
Finetuning 63.0 54.5 38.5 70.5
LwF Detection [33] 59.9 54.7 39.4 69.9
Feature Distillation 62.7 54.4 35.0 69.4
Attention Distillation 64.2 52.8 39.8 71.0
AFD 68.6 53.4 48.1 72.4
Table 3: Results on KITTI and Kitchen (100 exemplars).
Methods Comic Watercolor Watercolor Comic
Independent 45.1    – 50.0    –
Joint [35] 45.3   49.7 49.7   45.3
Finetuning 39.6   48.8 46.4   43.4
LwF Detection [33] 39.8   48.3 44.2   44.7
Feature Distillation 39.7   48.5 43.8   41.0
Attention Distillation 40.9   47.9 48.1   44.6
AFD 42.9   48.5 49.4   45.5
Table 4: Results on Comic and Watercolor (without exemplars).
Methods COCO VOC
Independent 47.8    –
Joint [35] 44.3   79.0
Finetuning 28.7   73.3
LwF Detection [33] 26.6   73.0
Feature Distillation 26.9   72.4
Attention Distillation 28.5   73.0
AFD 36.8   75.2
Table 5: Results from COCO to VOC (500 exemplars).

Comic and Watercolor.   Since we had very little forgetting on this scenario, we compare with the methods and baselines without using exemplars. Similar conclusions to the previous scenario can be drawn, as seen in Table 4. Note that with our method, we achieve similar results as the upper bounds (Independent, Joint) without any exemplars.

COCO to VOC.   We present results on this relatively larger datasets in Table 5. Since there is no forgetting from VOC to COCO, we only report results from COCO to VOC. COCO is a more challenging dataset compared to others because of the large amount of data and categories. Our attentive feature distillation surpasses the other distillation methods by a wide margin. However, there is still space for improvement until achieving similar performance as Independent or Joint. We also show some qualitative results on COCO before and after training on VOC (see Figure 6). We can clearly see that Finetuning outputs less bounding boxes or has lower prediction confidence on COCO after learning VOC. Our proposed attention feature distillation closes the gap between Finetuning and Independent by forgetting less and predicting more confident scores.

Figure 6: Examples of predictions on COCO after training on VOC for Finetuning (top), Independent (bottom) and Ours (middle). Finetuning tends to forget objects, such as the hydrant, the tie or the keyboard. Ours is capable of detecting all those objects and only misses the ball and the bicycle. Independent is capable of detecting all the objects missed by the other methods, but is not capable of differentiating between the two elephants. In the last column, Ours predicts the objects with higher confidence than the other methods.

5.3 Experiments on longer sequences

In this section, we show results in more realistic and challenging settings with more than two tasks. In Table 6, a sequence on KITTI, VOC, and Kitchen where datasets are evaluated after training the last dataset. Our proposed approach clearly outperforms in both orders of the sequential training. It is important to note that Finetuning with exemplars has a more competitive performance and lower forgetting, as seen in previous experiments.

To further scale our setting, we conduct experiments on a sequence of six different datasets: VOC, Clipart, KITTI, Watercolor, Comic, and Kitchen. To illustrate the dynamic process of forgetting after training on each dataset, we show in Figure 7 a forgetting matrix. It reflects the forgetting of all current or previous datasets after training each task. The total forgetting percentage for each datasets is obtained by adding all values on that row. For instance, after training the last task, performance on VOC Finetuning drops 27.1, while only 19.5 for Ours. It is interesting that after training on KITTI, datasets like VOC and Clipart experience severe forgetting. However, performance after training on other datasets does not drop too much for KITTI.

Finetuning 43.8   55.0   68.8 64.6   58.4   57.5
AFD 53.2   63.4   71.2 69.1   65.0   59.6
Table 6: Results on KITTI, VOC and Kitchen (300 exemplars).

6 Conclusions

We have proposed a novel setting for continual learning for universal object detection. We have shown how forgetting can be mitigated by using attentive feature distillation. We can further improve performance in the more difficult scenarios by adding our proposed adaptive exemplar sampling. The experimental analysis indicates that when learning a new task, domain differences with previous tasks cause less forgetting than category differences. By combining attentive feature distillation and adaptive exemplar sampling, we outperform existing methods with a relatively small number of samples. As future work, it would be interesting to apply our proposed approaches to other one-stage detectors.

Figure 7: Forgetting matrix for Finetuning (left) and AFD (right) on sequential learning of six tasks. Rows represent the forgetting progression of each dataset over the sequence. The lighter the color, the less the forgetting. The total forgetting is obtained by adding all values in a row.


  • [1] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars.

    Memory aware synapses: Learning what (not) to forget.

    In Proceedings of the European Conference on Computer Vision (ECCV), pages 139–154, 2018.
  • [2] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 6154–6162, 2018.
  • [3] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pages 532–547, 2018.
  • [4] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, pages 742–751, 2017.
  • [5] Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. Continual learning: A comparative study on how to defy forgetting in classification tasks. arXiv preprint arXiv:1909.08383, 2019.
  • [6] Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu, and Rama Chellappa. Learning without memorizing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5138–5146, 2019.
  • [7] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [8] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361. IEEE, 2012.
  • [9] Georgios Georgakis, Md Alimoor Reza, Arsalan Mousavian, Phi-Hung Le, and Jana Košecká. Multiview rgb-d dataset for object instance detection. In 2016 Fourth International Conference on 3D Vision (3DV), pages 426–434. IEEE, 2016.
  • [10] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [11] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.

    Distilling the knowledge in a neural network.

    NeurIPS Deep Learning and Representation Learning Workshop

    , 2015.
  • [12] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  • [13] Naoto Inoue, Ryosuke Furuta, Toshihiko Yamasaki, and Kiyoharu Aizawa. Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5001–5009, 2018.
  • [14] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  • [15] Jeongtae Lee, Jaehong Yun, Sungju Hwang, and Eunho Yang. Lifelong learning with dynamically expandable networks. In ICLR, 2018.
  • [16] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
  • [17] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  • [18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [19] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [20] Xialei Liu, Marc Masana, Luis Herranz, Joost Van de Weijer, Antonio M Lopez, and Andrew D Bagdanov. Rotate your networks: Better weight consolidation and less catastrophic forgetting. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 2262–2268. IEEE, 2018.
  • [21] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In NIPS, pages 6467–6476, 2017.
  • [22] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In ECCV, pages 67–82, 2018.
  • [23] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In CVPR, pages 7765–7773, 2018.
  • [24] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.
  • [25] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
  • [26] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert.

    icarl: Incremental classifier and representation learning.

    In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
  • [27] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  • [28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [29] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  • [30] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
  • [31] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In icml, pages 4555–4564, 2018.
  • [32] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In NIPS, pages 2990–2999, 2017.
  • [33] Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE International Conference on Computer Vision, pages 3400–3409, 2017.
  • [34] Tao Wang, Li Yuan, Xiaopeng Zhang, and Jiashi Feng. Distilling object detectors with fine-grained feature imitation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4933–4942, 2019.
  • [35] Xudong Wang, Zhaowei Cai, Dashan Gao, and Nuno Vasconcelos. Towards universal object detection by domain attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7289–7298, 2019.
  • [36] Chenshen Wu, Luis Herranz, Xialei Liu, Yaxing Wang, Joost van de Weijer, and Bogdan Raducanu. Memory replay GANs: learning to generate images from new categories without forgetting. In NIPS, 2018.
  • [37] Jianwei Yang, Jiasen Lu, Dhruv Batra, and Devi Parikh. A faster pytorch implementation of faster r-cnn., 2017.
  • [38] Sergey Zagoruyko and Nikos Komodakis.

    Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer.

  • [39] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In ICML, pages 3987–3995. JMLR. org, 2017.