Two-Level Residual Distillation based Triple Network for Incremental Object Detection

07/27/2020 ∙ by Dongbao Yang, et al. ∙ 6

Modern object detection methods based on convolutional neural network suffer from severe catastrophic forgetting in learning new classes without original data. Due to time consumption, storage burden and privacy of old data, it is inadvisable to train the model from scratch with both old and new data when new object classes emerge after the model trained. In this paper, we propose a novel incremental object detector based on Faster R-CNN to continuously learn from new object classes without using old data. It is a triple network where an old model and a residual model as assistants for helping the incremental model learning on new classes without forgetting the previous learned knowledge. To better maintain the discrimination of features between old and new classes, the residual model is jointly trained on new classes in the incremental learning procedure. In addition, a corresponding distillation scheme is designed to guide the training process, which consists of a two-level residual distillation loss and a joint classification distillation loss. Extensive experiments on VOC2007 and COCO are conducted, and the results demonstrate that the proposed method can effectively learn to incrementally detect objects of new classes, and the problem of catastrophic forgetting is mitigated in this context.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Despite modern object detection methods based on convolutional neural network have achieved state-of-the-art results, it suffers from severe catastrophic forgetting (French, 1999) (Goodfellow et al., 2013) (McCloskey and Cohen, 1989) in learning new object classes. In practice, new object classes often emerge after the model trained. Finetuning is a common way to adapt the old model to new classes, which is achieved by replacing the output layer with new classes or by adding units in the output layer for new classes, as shown in Figure 1. However, the performance may degrade severely due to the absence of old data. Intuitively, training the model from scratch with both old and new data would solve this problem, but it will take a lot of time and increase the storage burden for storing old data. In particular, the training data for the pretrained model are not always available for a new task. Therefore, it is necessary to develop incremental learning methods for object detection, which can continuously learn from new data instead of training on the whole dataset and preserve the previously learned knowledge.

Figure 1. Finetuning on new classes. The old model is trained on old class (person), and new model is trained by finetuning old model on new class (horse) by adding one unit on the output layer. The person in the test image can not be detected after finetuning.

Many studies on incremental learning mainly focus on image classification. According to the optimization directions to overcome catastrophic forgetting, these methods can be divided into two categories (Hou et al., 2019): (1) preserving significant parameters of the original model (Aljundi et al., 2018) (Kirkpatrick et al., 2017) (Zenke et al., 2017); (2) preserving the knowledge of the original model through knowledge distillation (Aljundi et al., 2017) (Jung et al., 2018) (Li and Hoiem, 2017) (Rannen et al., 2017) (Rebuffi et al., 2017). Due to the fact that it is difficult to design a reasonable metric to evaluate the importance of all parameters, we follow the second direction to preserve the knowledge of the original classes when adapting the model to detect new object classes, which utilizes the supervised information provided by old model to guide the training of new model by distillation losses.

Different from image classification, object detection involves the discrimination between foreground and background and the precise localization of objects, which increases the difficulty of incremental learning. Previous incremental object detection methods (Chen et al., 2019) (Hao et al., 2019b) (Hao et al., 2019a) (Li et al., 2019) (Shmelkov et al., 2017) (Zhang et al., 2020) mainly directly let the incremental model imitate the old model in some important positions such as the feature space and the output layers to preserve the learned knowledge from old data, which is achieved by constraining the activations between them to be similar. However, simply imitating old model will suppress the feature discrimination between old and new classes to some extent.

To solve these problems, we propose a novel triple network based incremental object detector, which is based on Faster R-CNN (Ren et al., 2015). Three detection models cooperate for adapting the network trained on old classes to new classes, and ensuring that the performance on the old classes will not degrade without using original training data. To train an incremental model which is responsible for detecting both old and new classes, a frozen copy of the original Faster R-CNN trained on old classes is utilized to provide the knowledge of old classes including feature, distributions of output layers and pseudo ground-truth. In addition, a novel residual model is proposed to assist the incremental learning procedure. To preserve the previous learned knowledge, a novel distillation scheme is designed which includes a two-level residual distillation loss and a joint classification distillation loss applied on the feature space and the output layers separately.

The contributions are as follows:

  • We propose a triple-network based incremental object detector, in which a residual model is jointly trained for detecting new classes in the incremental learning procedure, and it is introduced to fit the difference between the incremental model and the old model.

  • A two-level residual distillation loss is designed to maintain the feature discrimination between old and new classes, and a joint classification distillation loss is used to maintain the learned knowledge from both old and new data.

  • Extensive experiments are conducted on VOC2007 (Everingham et al., 2010) and COCO (Lin et al., 2014), and the results demonstrate that the proposed method is effective for incremental object detection and achieves promising results compared with other methods.

2. Related Work

Incremental learning is a significant problem of machine learning (Cauwenberghs and Poggio, 2001) (Kuzborskij et al., 2013) (Mensink et al., 2013) (Polikar et al., 2001)

. Recently, with the success of deep learning, many researchers pay more attention on incremental learning of deep neural network. Most existing incremental learning methods for vision tasks mainly focus on image classification, which means to continuously update the image classifier to recognize new classes without decreasing the accuracy on previous seen classes. Existing works can be divided into two categories based on the optimization directions for preserving learned knowledge 

(Hou et al., 2019): parameter-based and distillation-based.

Figure 2. The framework of the proposed end-to-end incremental object detector. There are three detection models: (1) Old Model is a frozen copy of original trained model, which is used to generate pseudo ground-truth and supervisory information of old classes; (2) Incremental Model is finetuned to incrementally learn new classes, while preserving the original knowledge by distillation losses; (3) Residual Model is used to learn the residual between Old Model and Incremental Model, while learning to detect new classes.

Some works are based on preserving important parameters of the network to maintain the performance on old tasks. (Jung et al., 2016)

presents a method to maintain the performance on previous task by freezing the weights of the softmax layer and minimizing the distance between the features of old data extracted from target and source networks respectively. The limitation of this approach is that the weights of new and old tasks may be conflicted, and the parameters can not be updated, which will result in the degradation for learning new tasks. EWC 

(Kirkpatrick et al., 2017) is a prominent work in this category, which sets the weights of the new model by those of the original model according to the importance of weights, and the approach remembers old tasks by selectively slowing down learning on the weights important for those tasks. MAS (Aljundi et al., 2018) accumulates an importance measure for each parameters of the network based on the sensitivity of the predicted output function to the changes in parameters. The changes of important parameters will be penalized when learning a new task. Zenke et al. (Zenke et al., 2017)

introduce intelligent synapses to accumulate task relevant information over time, and exploit this information to rapidly store new memories without forgetting old ones. The limitation of these works is that it is hard to design a metric to evaluate the importance of all parameters.

Distillation-based method is the other representative category of incremental learning, where knowledge distillation is used to transfer knowledge from the original network to new network. Knowledge distillation is defined to utilize the supervisory information provided by teacher model to guide the training of student model to mimic the teacher model with distillation loss. LwF (Li and Hoiem, 2017) is the first to use knowledge distillation for incremental learning, which utilizes a modified cross-entropy loss to preserve original knowledge with only examples from new task. iCaRL (Rebuffi et al., 2017) proposes to jointly learn feature representation and classifiers by combining representation learning and knowledge distillation, and a small set of exemplars is selected to perform nearest-mean-of-exemplars classification. Rannen et al. (Rannen et al., 2017) propose an auto-encoder based method to retain the knowledge from old tasks, which prevents the reconstructions of the features from changing and gives the space for features to adjust. Sun et al. (Sun et al., 2018a) (Sun et al., 2018b) propose to maintain a lifelong dictionary, which is used to transfer knowledge to learn each new metric learning task.

Transfer learning is also related to incremental object detection, which uses the knowledge acquired from one task to help other tasks training. Finetuning is a representative paradigm of transfer learning, which is frequently used in the initialization of the backbone in object detection model with the trained CNN on ImageNet 

(Krizhevsky et al., 2012)(Hinton et al., 2015) transfers the knowledge from a large network to a small network by knowledge distillation, which encourages the responses of these two networks to be similar. However, transfer learning needs data for both old and new tasks to maintain the performance on old task, otherwise the performance will degrade severely if the old data are not available.

For incremental object detection, (Shmelkov et al., 2017) introduces the first incremental object detector based on Fast R-CNN (Girshick, 2015) by applying knowledge distillation without using previous training data. Firstly, it uses EdgeBoxes (Zitnick and Dollár, 2014) and MCG (Arbeláez et al., 2014) to precompute proposals. Then, these proposals are fed into R-CNN after sampled to predict their categories. The model is trained with distillation losses applied on the outputs of final classification and regression layers to preserve the ability to recognize old classes. However, this method is not end-to-end and the proposal generation procedure is not learnable.

Recently, several end-to-end incremental object detection methods (Chen et al., 2019) (Hao et al., 2019b) (Hao et al., 2019a) (Li et al., 2019) are proposed based on Faster RCNN (Ren et al., 2015)(Chen et al., 2019) proposes a hint loss to minimize difference between feature maps of the old and incremental model, and a confidence loss is used to suppress the generating of low confidence bounding boxes. (Hao et al., 2019b) proposes a hierarchical large-scale retail object detection dataset (TGFS), and utilizes an exemplar set with a fixed size of old data to train an class-incremental object detector. (Hao et al., 2019a) uses a frozen duplication of RPN to preserve the knowledge gained from the old classes, and a feature-changing loss is proposed to reduce the difference of the feature maps between the old and new classes. (Li et al., 2019)

distills three types of knowledge from the old model to mimic the old model’s behavior on object classification, bounding box regression and feature extraction

(Zhang et al., 2020) proposes a dual distillation training function which pretrains a separate model only for the new classes, such that a student model can learn from two teacher models simultaneously. In addition, a novel work (Perez-Rua et al., 2020) proposes an incremental few-shot object detector based on CentreNet (Zhou et al., 2019), however, the original structure is redesigned for few-shot learning in this method. In our work, we mainly focus on the incremental object detection without changing original network, which can be applied on just giving an existing trained model and data of new classes. Different from these methods, we not only let the incremental model imitate the important activations of the old model, but also introduce a residual model trained simultaneously on new classes, which is designed to maintain the feature discrimination between old and new classes in an end-to-end way without extra model training steps.

3. Method

In this paper, we bring up an end-to-end incremental object detector to continuously learn from new data without using old data. Figure 2 presents the whole framework of the proposed method. It is a triple-network which includes three detection networks. Old Model (OM) is a frozen copy of the original detector trained on old data, which provides the knowledge of old classes, including the detection results and distributions of the output layers. Incremental Model (IM) is adapted to detect both old and new classes with the annotations of new data and the knowledge from OM. The detection results from OM are regarded as the pseudo ground-truth, which are combined with the annotations of new data for updating IM. Residual Model (RM) is an assistant model jointly trained to detect new classes. To better preserve the knowledge of old classes and maintain the discrimination between old and new classes, a new distillation scheme is designed to add some constraints on the training procedure of IM, including a residual distillation loss and a joint classification distillation loss, which are applied on feature space and output layers respectively. The method is described in detail as follows.

3.1. Triple-Network based Incremental Detector

The triple-network for incremental object detection is based on Faster R-CNN, which is an end-to-end proposal-based object detector, and the backbone used in our framework is ResNet50 (He et al., 2016).

In the incremental learning stage, the parameters of IM are initialized from OM excluding the weights and bias of new added output layers for new classes, which are initialized randomly. After training samples of new data fed into the triple-network, OM will generate some bounding boxes after filtered by confidence threshold and per-category non-maxima suppression (NMS). For the remaining bounding boxes, IoU is computed between them and the ground-truth of new data to further filter overlapped boxes. We delete the bounding box from OM if it has an IoU greater than with a ground-truth bounding box of a new class, which are obviously wrong detection results. The remaining bounding boxes as the pseudo ground-truth are combined with the original annotations of new data for training IM.

(a) General Distillation (b) Residual Distillation of Backbone (c) Residual Distillation of Pooled Feature after RoI-Pooling
Figure 3. Illustration of generally used distillation (a) and the residual distillation (b) (c) in this paper.

To preserve the learned knowledge, many methods generally train the IM to mimic the feature and outputs of OM. As shown in Figure 3 (a), the general distillation calculates the difference between OM and IM, and minimizes the differences simultaneously. This constrains will restrict the capability of IM to learn the unique character of both old and new classes. Therefore, RM is designed to maintain the difference between old and new classes, which is only trained to detect new classes in the incremental learning stage. Meanwhile, RM should mimic the residual feature between OM and IM, which can be seen as the representation of new classes. The backbone of RM is initialized by the pretrained ResNet50, and the remaining parameters are initialized randomly.

In this framework, OM and RM are jointly utilized to assist the training of IM, and only IM is utilized to get the detection results in the inference procedure. It can be noted that RM is jointly training with IM, so this method is still an end-to-end incremental learning procedure.

3.2. Distillation Losses

In order to avoid catastrophic forgetting, we design a new distillation scheme to constrain the learning process of the IM, which is introduced to keep significant activations from OM, IM and RM to be similar, thus preserving the knowledge of previous seen classes and maintain the learning capability for unseen classes. The scheme consists of two type distillation losses, which are applied on different positions of the network: (1) feature space; (2) output layers of R-CNN.

For the feature space, except the general distillation between IM and OM, we also propose a two-level residual distillation, which includes base feature residual and pooled feature residual, as shown in Figure 3 (b) (c). For the backbone of IM, although freezing the parameters is the best way to preserve the knowledge of previous seen classes, it will lose the ability to update for new classes, and the learning for new classes only depends on the updating of parameters in classifier and regressor, which will degrade the performance. If we directly finetune the backbone on the new data, the parameters will update towards new classes. According to the general ways, distillation should be used to maintain the feature similarity of OM and IM, so we design a new way to calculate the difference between the features of backbones, which is written as Eq. 1:

(1)

and represent feature maps. is the number of channels, and and are the spatial dimensions. and are the mean of and along channels, where represents a coordinate on a channel of the feature map. The difference between two feature maps is , where L1 loss is used to penalize differences in the L2-normalized outer products of and . Compared to directly compute the L1 distance between two feature maps, this method considers the original 2D information of feature map as a whole not just points.

To maintain the capability of IM to preserve learned knowledge of seen classes and learn new classes simultaneously, we propose a residual distillation method (), which can be divided into two parts and representing the residual losses applied on the feature map of backbone and pooled feature after RoI pooling.

For the feature map of backbone, the residual between OM and IM is written as Eq. 2:

(2)

The difference is calculated between the residual and the feature map of RM , which is calculated as:

(3)

To integrate the features of backbones in the triple network without increasing the model complexity in inference, a feature merge loss is proposed, which minimizes the distance between the feature map of IM and the merged feature map, where the merged feature map of OM and RM is represented by , and we keep the feature of IM close to it. Due to the residual mechanism, the sum of features from OM and RM can be regarded as the representation of incremental learning. Therefore, to better fuse these three features, we calculate the loss (Eq. 1) between the merged feature () and IM feature.

(4)

After RoI-Pooling, we add another residual distillation loss for pooled features using the same RoIs from IM to assist the residual learning, where a bidirectional regularization is designed to maintain the feature discrimination in the triple-network, and L1 loss is directly applied to compute the distance between these two feature maps in instance-level.

(5)

where represents the pooled feature.

For the output layers of R-CNN, the weights and biases of the classification layers can be considered as the high-level semantic representation of classes, so the parameters of classification layers corresponding to old classes are initialized from OM to maintain the representation. In the training of IM, the classifier is finetuned for learning new classes, and the classification outputs from IM is constrained to generate similar distributions to OM and RM for old and new classes respectively, ensuring to preserve the original knowledge. We calculate L2 loss between the softmax outputs of old classes and background from the classification layers of OM and IM, and the softmax outputs of new classes from the classification layers of RM and IM, which can be written as:

(6)
(7)

where , and are the classification outputs of OM, RM and IM respectively. and are the number of old and new classes respectively.

3.3. 2-Threshold Training

For end-to-end two-stage object detectors, it is difficult to preserve the performance on old classes when directly adapt OM on new classes without using old data due to the class-sensitive RPN and R-CNN. Intuitively, the detection results of OM for new training data can provide useful information of old classes for incremental learning. Consequently, we utilize the detection results from the OM as the pseudo ground-truth to keep the ability of detecting old classes when trained on new data. Due to the lack of ground truth of old classes on new data, the wrong detection results can not be rectified. Therefore, the choice of confidence threshold is an important hyper-parameter to obtain the pseudo ground-truth, which has a great influence on the performance of IM. A high threshold may discard some potential object-like proposals of old classes, and a low threshold may contain many false negatives, which confuse the classifier of IM leading to a degraded detector.

As is well-known, for two-stage object detection methods, RPN needs to generate region proposals from complex background with a high recall, and R-CNN needs to accurately classify and regress these proposals. To better preserve the learned knowledge from old data, in the incremental training procedure, we design to train these parts of IM for different purposes, where RPN is trained to preserve the performance of RPN of OM for generating more object-like proposals for both old and new classes, and R-CNN is trained to accurately discriminate different classes. Therefore, we design a 2-threshold training strategy where a low threshold is used to select more potential proposals for training RPN and a high threshold is used to get the detection results with a high confidence for training a precise R-CNN.

Algorithm 1 describes the whole procedure of generating pseudo ground-truth and 2-threshold training strategy for calculating original Faster R-CNN losses. The losses of original Faster R-CNN consists of the losses of RPN and R-CNN, which are calculated as:

(8)
(9)
(10)

where is the pseudo ground-truth. The losses of RPN and R-CNN include classification and regression losses. and are two confidence thresholds, which are used to select pseudo ground-truth for training RPN and R-CNN respectively.

1:The incremental model , old model , image , ground truth , two confidence thresholds and , IoU threshold
2:loss
3:Results of old model
4:Let pseudo ground-truth
5:for  in  do
6:     if  then
7:         
8:     end if
9:end for
10:Compute with (Eq. 8)
11:Compute with (Eq. 9)
12:Compute (Eq. 10)
13:return
Algorithm 1 2-Threshold Training Strategy

The overall loss used to adapt on the new classes is presented as Eq. 11, which is the sum of the standard Faster R-CNN losses and the proposed distillation losses. The Faster R-CNN losses of IM are applied to all classes, where pseudo ground-truth of old classes and ground-truth of new classes are used for training. The Faster R-CNN losses of RM are applied to new classes, where only the ground-truth of new classes are used for training. The distillation losses are used to constrain the learning of IM.

(11)

where is a trade-off between original Faster R-CNN losses and the proposed distillation losses, we set in all experiments.

4. Experiments

1
Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
A(1-19) 76.79 81.43 75.85 59.2 56.75 81.61 84.76 84.15 51.76 82.5 67.23 83.28 83.69 79.66 78.21 47.08 73.42 67.95 77.98 - 73.33
Finetune 31.80 24.68 28.27 25.46 24.59 43.58 61.38 35.25 10.60 35.59 17.47 22.34 27.46 20.02 20.01 16.81 28.11 11.10 28.67 56.50 28.49
 (Shmelkov et al., 2017) 69.4 79.3 69.5 57.4 45.4 78.4 79.1 80.5 45.7 76.3 64.8 77.2 80.8 77.5 70.1 42.3 67.5 64.4 76.7 62.7 68.3
 (Chen et al., 2019) 68.30 <60.0 <68.30
 (Li et al., 2019) 69.7 78.3 70.2 46.4 59.5 69.3 79.7 79.9 52.7 69.8 57.4 75.8 69.1 69.8 76.4 43.2 68.5 70.9 53.7 40.4 65.00
+B(20) 2-th 73.65 81.03 75.17 60.59 57.69 80.95 84.65 85.51 52.11 80.82 63.67 83.22 83.89 80.75 78.04 47.25 75.04 67.05 79.47 51.98 72.13
5
Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
A(1-15) 77.54 79.02 74.41 60.44 58.09 76 84.88 84.82 51.15 76 65.68 83.16 84.11 79.05 78.2 - - - - - 74.17
Finetune 54.07 50.34 47.80 32.71 21.12 51.57 71.14 64.62 19.18 47.98 47.59 52.77 61.22 46.08 42.46 37.22 55.63 56.95 62.99 63.31 49.34
 (Shmelkov et al., 2017) 70.5 79.2 68.8 59.1 53.2 75.4 79.4 78.8 46.6 59.4 59.0 75.8 71.8 78.6 69.6 33.7 61.5 63.1 71.7 62.2 65.9
+B(16-20) 2-th 75.56 81.05 75.76 58.77 58.11 77.03 83.90 84.69 52.77 75.62 66.25 81.56 84.37 78.78 76.89 30.83 65.86 57.98 72.63 55.76 69.71
10
Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
A(1-10) 90.42 90.77 90.55 90.62 86.65 87.37 90.35 89.15 87.64 77.78 - - - - - - - - - - 88.13
Finetune 52.67 27.05 41.87 30.06 15.42 40.78 46.85 60.44 13.03 40.50 57.56 70.85 78.76 70.35 75.84 38.65 64.96 63.43 69.96 64.04 51.15
 (Shmelkov et al., 2017) 69.9 70.4 69.4 54.3 48.0 68.7 78.9 68.4 45.5 58.1 59.7 72.7 73.5 73.2 66.3 29.5 63.4 61.6 69.3 62.2 63.1
 (Li et al., 2019) 71.7 81.7 66.9 49.6 58.0 65.9 84.7 76.8 50.1 69.4 67.0 72.8 77.3 73.8 74.9 39.9 68.5 61.5 75.5 72.4 67.90
+B(11-20) 2-th 75.85 73.44 72.35 58.57 58.86 79.11 82.55 77.47 44.10 73.90 54.20 73.23 76.15 72.05 69.86 30.82 65.05 56.36 70.99 59.24 66.21
A(1-20) 78.94 78.94 74.87 64.61 56.06 81.80 84.58 84.67 52.48 83.56 66.72 84.60 84.21 78.47 78.33 47.93 74.84 69.43 78.6 73.36 73.85
Table 1. Results on VOC2007 test dataset. Per-class average precision (%) are presented under different settings when 1, 5, 10 classes are added at once.
Method mAP@.5 mAP@[.5, .95]
A(1-40) 32.55 16.3
 (Shmelkov et al., 2017) 37.4 21.3
+B(41-80) 2-th 43.75 24.23
A(1-80) 49.59 29.04
Table 2. Results on COCO minival (first 5000 validation images). The mAP@.5 and mAP@[.5,.95] (%) of different methods are listed when 40 classes are added at once.
Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP  (Shmelkov et al., 2017)
A(1-15) 77.54 79.02 74.41 60.44 58.09 76 84.88 84.82 51.15 76 65.68 83.16 84.11 79.05 78.2 - - - - - 74.17
+B(16) 75.51 79.51 74.52 58.63 57.12 74.35 84.72 84.49 47.85 73.35 60.17 82.45 84.01 79.15 77.32 26.73 - - - - 69.99 67.0
+B(16)(17) 74.10 79.87 73.57 54.91 57.09 75.65 84.28 80.96 46.66 75.86 61.59 80.58 83.85 78.62 76.87 20.20 60.89 - - - 68.56 63.9
+B(16)(17)(18) 72.03 78.36 73.05 55.35 57.44 75.99 84.33 80.62 47.20 74.21 60.02 78.97 83.71 78.56 73.09 23.54 30.32 43.08 - - 64.99 62.5
+B(16)(17)(18)(19) 72.10 77.65 70.16 52.98 54.61 74.41 83.59 79.08 46.34 73.64 58.43 77.67 82.33 76.79 72.91 20.81 19.17 51.57 69.28 - 63.87 62.4
+B(16)(17)(18)(19)(20) 67.29 78.56 69.45 54.44 55.61 74.45 83.17 80.16 44.98 72.41 49.14 78.09 82.98 77.15 74.58 21.00 13.33 53.48 14.90 47.16 59.62 62.4
A(1-20) 78.94 78.94 74.87 64.61 56.06 81.80 84.58 84.67 52.48 83.56 66.72 84.60 84.21 78.47 78.33 47.93 74.84 69.43 78.6 73.36 73.85
Table 3. Results on VOC2007 test dataset. Per-class average precision (%) are presented under different settings when 5 classes are added at once or sequentially.
Method +table +dog +horse +mbike +person +plant +sheep +sofa +train +tv
Fast RCNN  (Shmelkov et al., 2017) 65.1 62.5 59.9 59.8 59.2 57.3 49.1 49.8 48.7 49
Faster RCNN  (Chen et al., 2019) 66.3 62.6 54.7 50.3 48.8 45.5 38.2 36.6 31.2 33.5
Ours 64.46 64.14 61.68 55.96 52.82 50.04 48.25 44.29 37.80 35.39
Table 4. Results on VOC2007 test dataset. Average precision (%) are presented when adding 10 classes sequentially.
Method A B C D mAP  (Shmelkov et al., 2017)  (Hao et al., 2019a)
OURS 71.97 - - - 71.97 66.3 63.9
66.23 69.98 - - 68.1 52 57.5
60.71 51.24 60 - 57.32 47 50.9
54.89 44.64 39.81 41.02 45.09 39.25 48.5
Table 5. Results on VOC2007 test dataset, when four groups are added sequentially.

4.1. Settings

Datasets. The proposed method is evaluated on two object detection benchmarks PASCAL VOC2007 and Microsoft COCO. VOC2007 has 20 object classes, and consists of 5K images in trainval subset and 5K images in test subset. We use the test subset for evaluation. COCO has 80 object classes, and the minival (the first 5000 images from the validation set) split is used for evaluation. We split new and old classes the same as the setting in (Shmelkov et al., 2017). The experiments are conducted with different number of classes, and the classes from VOC2007 and COCO are sorted in alphabetical order. We take 19, 15 and 10 classes from VOC2007 as old classes respectively, and the remaining 1, 5, 10 classes are the corresponding new classes. For COCO, we take the first 40 classes as old classes and the remaining 40 classes as new classes.

Evaluation.

The evaluation metric is mean average precision (mAP) at 0.5 IoU threshold for VOC2007, and mAP across different IoU from 0.5 to 0.95 for COCO. Our method is compared with finetuning and recent related works based on two-stage object detectors. We list the results of these methods reported in their original papers, which are evaluated under the same settings with our method. The best results are in boldface, and the second-best results are underlined.

Implementation details.

OM is trained for 20 epochs, and the initial learning rate is set to 0.001, and decay every 5 epochs with gamma 0.1. The momentum is set to 0.9. IM is trained for 10 epochs with initial learning rate 0.0001 and decay to 0.00001 after 5 epochs. The confidence and IoU threshold for NMS are set to 0.5 and 0.3 respectively, and the IoU threshold for filtering pseudo ground-truth is also 0.3. In the following experiments, some notations are presented. A() is the results of OM, and +B() is the results of our incremental learning method, which is training on the base of A(). We define

to represent the feature distillation between OM and IM, and represents residual distillation on feature space, and represent the joint distillation on classification layers respectively. represents incremental learning with 2-threshold training strategy.

4.2. Addition of Classes at Once

In the first experiment, we evaluate the performance of our method on VOC2007 when 1, 5, 10 new classes are added at once. The results on these three setting are listed in Table 1, which presents the per-category average precision on VOC2007 test subset.

For the first setting in this table, we test the performance on 19 old classes and one new class (tvmonitor) from VOC2007. We train OM on the VOC2007 trainval subset with all data containing any of 19 classes (A(1-19)), and IM is trained on the data of VOC2007 trainval subset containing “tvmonitor” (+B(20)).In our experiments, the first baseline method is finetuning, and we initialize IM by the parameters of OM. Different from original finetuning which train a new classification layer from scratch for a new task, we also initialize the parameters of old classes in the classification layer of IM by those of OM to preserve the learned knowledge. However, as can be seen from the first part in Table 1, finetuning gets only 28.49% mAP on all classes when old classes are in the majority, which demonstrates catastrophic forgetting can be caused by this way. It can be noted that the combination of all distillation losses with a 2-threshold training strategy (+B(20) 2-th) achieves the highest accuracy (72.13%), increasing 3.83% compared to (Shmelkov et al., 2017). The mAP also increases 0.8% compared with the Faster RCNN based method (Hao et al., 2019a). The effectiveness of our method for mitigating catastrophic forgetting is demonstrated.

For the second setting, we choose the first 15 classes as old classes for training OM (A(1-15)), and the remaining 5 classes are used for incremental learning. As shown in the second part in Table 1, although the performance of finetuning is improved with the increased number of new classes, the accuracy on old classes are still lower than our method by a large margin. The mAP of our method (+B(16-20) 2-th) reaches 69.71%, and increases about 3.81% comparing with (Shmelkov et al., 2017).

Our method is also evaluated on adding more classes (10 classes) as shown in the third part in Table 1. OM is trained on 10 classes first, and IM learns to detect the remaining 10 new classes. The proposed method with all distillation losses(+B(11-20) 2-th) achieves 66.21% mAP and increases 3.11% compared with (Shmelkov et al., 2017). We also list the results of (Li et al., 2019), which reported in the original paper under the same dataset split setting. Due to the fact that (Li et al., 2019) keeps some exemplars of old classes, the mAP of (Li et al., 2019) is slightly better than our method on the 10+10 setting. However, our method exceeds it by a large margin (7.13%) on 19+1 setting, which demonstrate the effectiveness of our method without data of old classes.

We also test the proposed method with all distillation losses on COCO, where 40 classes are old classes and the remaining 40 classes are new classes. The results are listed in Table 2. The performance outperforms (Shmelkov et al., 2017) by a large margin with 6.35% improvement on mAP@0.5 and 2.93% on mAP@[.5,.95]. It further demonstrates the effectiveness of the proposed method on lager dataset with more classes.

4.3. Sequential Addition of Multiple Classes

In this experiment, we evaluate the performance of our method by adding classes sequentially for incremental learning. IM is updated on the basis of the latest trained network with a new class, and the process is repeated with another new class. For example, OM is trained on 15 old classes of VOC2007 and IM is adapted to the 16th class (+B(16)), and then a new IM uses the 16-class IM to learn the 17th class (+B(16)(17)). The process continues until the 20th class (+B(16)(17)(18)(19)(20)). The results in this scenario are listed in Table 3. As can be seen, our method outperforms (Shmelkov et al., 2017) on the learning of the 16th, 17th, 18th and 19th classes. Compared with (Shmelkov et al., 2017), the mAP after adding the 16th class increases 2.99%, and the margin reaches to 4.66% after adding the 17th class. The accuracy also improves after adding the 18th and the 19th classes with the mAP increasing 2.49% and 1.47% respectively.

We also evaluate the method on adding 10 classes sequentially as shown in Table 4. Compared to the same Faster RCNN based method (Chen et al., 2019), the mAP of our method higher in almost all incremental learning steps. The results of Faster RCNN based methods are worse than Fast RCNN based method after many incremental learning steps, which may result from the gradually error accumulation by previous models for end-to-end object detection methods where the process of generating region proposals also needs to learn.

To further verify the sequential addition performance, we split the trainval set of VOC2007 into four groups (A, B, C, D) as the setting in (Hao et al., 2019a), and each group contains 5 classes, where all of the 20 classes are sorted alphabetically. ResNet101 is used in this experiment for fair comparison. The results are shown in Table 5. As can be seen, our method also achieves promising results compared with other methods.

4.4. Ablation Study

To demonstrate the effectiveness of the key components, we conduct experiments to evaluate them separately when adding 1, 5, 10 classes on VOC2007 at once. As shown in Table 6, the first row means the model trained on new classes with the pseudo ground-truth generated from OM using a confidence threshold (0.5), and the following four rows show the results when adding the designed losses and 2-threshold training strategy separately. The “” means the increased mAP of this component when compared with the first row. As listed in the table, the base feature distillation improves 0.67% when only add one class, which verifies the effectiveness for mitigating catastrophic forgetting. The performance of is slightly decreased with the increasing number of new classes when is used alone, because it is designed for preserving the performance on old classes, and it needs to cooperate with other components for improving the mAP of all classes. increases about 1.89% on average, and the joint distillation of the final classification layer increases about 0.25% when used alone. 2-threshold training strategy (2-th) is also effective for boosting the performance, which increases about 0.95% on average.

Components mAP(%)
2-th 1 5 10
69.10 66.05 64.08
69.77(0.67) 66.01(0.03) 63.94(0.14)
71.81(2.71) 68.75(2.70) 64.33(0.25)
69.34(0.24) 66.08(0.03) 64.55(0.47)
69.63(0.52) 66.40(0.35) 66.05(1.97)
69.77 66.01 63.94
71.42() 68.62() 64.39()
71.51() 68.95() 64.56()
72.13() 69.71() 66.21()
Table 6. Ablation Study
Method 1 5 10
L1 Loss 71.14 68.17 62.88
Ours 71.51(0.37) 68.95(0.78) 64.56(1.68)
old 69.31 65.75 64.08
Ours 69.34(0.03) 66.08(0.33) 64.55(0.47)
Table 7. Results on alternative distillation losses.
Figure 4. The mAP with different choices of confidence thresholds for training +B(20) network.

We also evaluate these components by adding them sequentially as shown in the last four rows of Table 6, where “+” represents the increased mAP of this combination compared with the last combination. As can be seen, the mAP increases gradually. The combination of and improves about 1.57% on the average of these three settings. When is added, the mAP further increases about 0.2%. The last combination of all designed components reaches the highest accuracy. This experiment further proves the validity of our method.

The comparison of alternative distillation losses is shown in Table 7. The experiments are conducted on three settings (adding 1, 5, 10 new classes) respectively. To evaluate the designed L1-norm feature distillation

, we replace the loss function between two 2D feature maps in

and with L1 loss, which is directly applied on the original 2D feature maps. For , the classification distillation from both OM and RM is compared with only distillation from OM. As shown, the mAP of the designed L1-norm feature distillation exceeds L1 loss about 0.94% on average of these three settings. The joint classification distillation from both OM and RM outperforms only distillation from OM with the mAP increasing about 0.28%. The results verify that our designs are more appropriate in this scenario.

Figure 4 illustrates the comparison between single-threshold training and 2-threshold training for training +B(20), where we present the results on all classes and old classes for evaluating the 2-threshold training strategy on preserving the learned knowledge of old classes. As can be seen, the 2-threshold choice with 0.1 and 0.9 can maintain the performance of old classes to a large extent and the mAP of all classes is the highest compared to other choices.

5. Conclusion

In this paper, we propose a triple-network based incremental object detector with a novel residual distillation scheme for learning new classes without using original training data. A frozen copy of the old model trained on old classes is used to generate pseudo ground-truth with a 2-threshold strategy and provide knowledge corresponding to old classes for training the incremental model. A residual model trained on new classes is designed to preserve the feature discrimination between old and new classes by learning the residual of the incremental model and the old model. A two-level residual distillation loss is designed for the feature of backbone and pooled feature, and a joint classification distillation is designed for the output layers. Experimental results on VOC2007 and COCO demonstrate the effectiveness of the proposed method on incrementally learning to detect objects of new classes without forgetting original learned knowledge.

References

  • R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars (2018) Memory aware synapses: learning what (not) to forget. In

    Proceedings of the European Conference on Computer Vision

    ,
    pp. 139–154. Cited by: §1, §2.
  • R. Aljundi, P. Chakravarty, and T. Tuytelaars (2017) Expert gate: lifelong learning with a network of experts. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3366–3375. Cited by: §1.
  • P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik (2014) Multiscale combinatorial grouping. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 328–335. Cited by: §2.
  • G. Cauwenberghs and T. Poggio (2001)

    Incremental and decremental support vector machine learning

    .
    In Advances in Neural Information Processing Systems, pp. 409–415. Cited by: §2.
  • L. Chen, C. Yu, and L. Chen (2019) A new knowledge distillation for incremental object detection. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. Cited by: §1, §2, §4.3, Table 1, Table 4.
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: 3rd item.
  • R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3 (4), pp. 128–135. Cited by: §1.
  • R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448. Cited by: §2.
  • I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio (2013) An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211. Cited by: §1.
  • Y. Hao, Y. Fu, Y. Jiang, and Q. Tian (2019a) An end-to-end architecture for class-incremental object detection with knowledge distillation. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §1, §2, §4.2, §4.3, Table 5.
  • Y. Hao, Y. Fu, and Y. Jiang (2019b) Take goods from shelves: a dataset for class-incremental object detection. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 271–278. Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §3.1.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
  • S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin (2019) Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 831–839. Cited by: §1, §2.
  • H. Jung, J. Ju, M. Jung, and J. Kim (2016) Less-forgetting learning in deep neural networks. arXiv preprint arXiv:1607.00122. Cited by: §2.
  • H. Jung, J. Ju, M. Jung, and J. Kim (2018) Less-forgetful learning for domain expansion in deep neural networks. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114 (13), pp. 3521–3526. Cited by: §1, §2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §2.
  • I. Kuzborskij, F. Orabona, and B. Caputo (2013) From n to n+ 1: multiclass transfer incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3358–3365. Cited by: §2.
  • D. Li, S. Tasci, S. Ghosh, J. Zhu, J. Zhang, and L. Heck (2019) RILOD: near real-time incremental learning for object detection at the edge. In Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, pp. 113–126. Cited by: §1, §2, §4.2, Table 1.
  • Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (12), pp. 2935–2947. Cited by: §1, §2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Proceedings of the European Conference on Computer Vision, pp. 740–755. Cited by: 3rd item.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of Learning and Motivation, Vol. 24, pp. 109–165. Cited by: §1.
  • T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka (2013) Distance-based image classification: generalizing to new classes at near-zero cost. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (11), pp. 2624–2637. Cited by: §2.
  • J. Perez-Rua, X. Zhu, T. Hospedales, and T. Xiang (2020) Incremental few-shot object detection. arXiv preprint arXiv:2003.04668. Cited by: §2.
  • R. Polikar, L. Upda, S. S. Upda, and V. Honavar (2001) Learn++: an incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man, and Cybernetics, part C (applications and reviews) 31 (4), pp. 497–508. Cited by: §2.
  • A. Rannen, R. Aljundi, M. B. Blaschko, and T. Tuytelaars (2017) Encoder based lifelong learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1320–1328. Cited by: §1, §2.
  • S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) ICaRL: incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §1, §2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pp. 91–99. Cited by: §1, §2.
  • K. Shmelkov, C. Schmid, and K. Alahari (2017) Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3400–3409. Cited by: §1, §2, §4.1, §4.2, §4.2, §4.2, §4.2, §4.3, Table 1, Table 2, Table 3, Table 4, Table 5.
  • G. Sun, Y. Cong, and X. Xu (2018a) Active lifelong learning with” watchdog”. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
  • G. Sun, C. Yang, J. Liu, L. Liu, X. Xu, and H. Yu (2018b) Lifelong metric learning. IEEE Transactions on Cybernetics 49 (8), pp. 3168–3179. Cited by: §2.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987–3995. Cited by: §1, §2.
  • J. Zhang, J. Zhang, S. Ghosh, D. Li, S. Tasci, L. Heck, H. Zhang, and C. J. Kuo (2020) Class-incremental learning via deep model consolidation. In The IEEE Winter Conference on Applications of Computer Vision, pp. 1131–1140. Cited by: §1, §2.
  • X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §2.
  • C. L. Zitnick and P. Dollár (2014) Edge boxes: locating object proposals from edges. In Proceedings of the European Conference on Computer Vision, pp. 391–405. Cited by: §2.