The human vision and perception system is inherently incremental where new knowledge is continually learned over time whilst existing knowledge is retained. On the other hand, deep learning networks are ill-equipped for incremental learning. When a well-trained network is adapted to new categories, its performance on the old categories will dramatically degrade. To address this problem, incremental learning methods have been explored to preserve the old knowledge of deep learning models. However, the state-of-the-art incremental object detector employs an external fixed region proposal method that increases overall computation time and reduces accuracy compared to object detectors such as Faster RCNN that use trainable Region Proposal Networks (RPNs). The purpose of this paper is to design an efficient end-to-end incremental object detector using knowledge distillation for object detectors based on RPNs. We first evaluate and analyze the performance of RPN-based detector with classic distillation towards incremental detection tasks. Then, we introduce multi-network adaptive distillation that properly retains knowledge from the old categories when fine-turning the model for new task. Experiments on the benchmark datasets, PASCAL VOC and COCO, demonstrate that the proposed incremental detector is more accurate as well as being 13 times faster than the baseline detector.READ FULL TEXT VIEW PDF
Benefiting from the rapid development of deep learning models, the performance of object detectors has increased dramatically over the years. However, the gap between state-of-the-art performance and the human visual system is still huge. One of the main obstacles is incrementally learning new tasks in the dynamic real-world where new categories of interest can emerge over time. For example, in the pathology area, new sub-types of disease patterns are identified over time due to the continued growth in our knowledge and understanding. An ideal disease pattern detection system should be able to learn a new sub-type of disease from the pathology images without losing the ability to detect old disease sub-types. Humans can learn to recognize new categories without forgetting previously learned knowledge. However, when state-of-the-art object detectors are fine-tuned for new tasks, they often fail on the previously trained tasks — a problem called catastrophic forgetting (Goodfellow et al., 2014; McCloskey and Cohen, 1989). Figure 1 shows an example of this problem on the PASCAL VOC dataset (Everingham et al., 2010). The normal training shown in Figure 1 is the conventional way to make a detector work well on all tasks — this requires the model to be trained on labeled data from both the old and new tasks. Unfortunately, this retraining procedure is both time-consuming and computationally expensive. This method also requires access to all of the data for all tasks which is quite impractical for many real-life applications due to various reasons. The old training data may be inaccessible as it may have been lost or corrupted, or perhaps it is simply too large or there may be licensing or distribution issues. Even if all of the data is available, they need to be re-annotated for retraining since the old annotations only contain labels about certain classes for one incremental learning task but now the annotations need to contain labels about all classes for all incremental learning tasks learned so far.
To bridge this performance gap between catastrophic forgetting and normal full dataset training, (Shmelkov et al., 2017) proposed an incremental object detector. Their method is based on the largely superseded Fast RCNN (Girshick, 2015) detector which uses an external fixed proposal generator rather than a CNN, so training is not end-to-end. During the training on new categories, the annotations for old objects are not available, so Shmelkov et al. deliberately chose the external fixed proposal generator of Fast RCNN to ensure that proposals would be agnostic to the object categories. The more recent Faster RCNN (Ren et al., 2015) uses a trainable Region Proposal Network (RPN) to boost both accuracy and speed. The RPN-based methods are expected to be fragile to incremental learning because the unlabeled old object classes are treated as background during retraining of the RPN detector and may adversely affect the RPN proposals on the old classes. To address this challenge of incremental learning for RPN-based detectors, we first analyze the capability of RPN towards missing annotation problem for incremental detection. Then, we propose an incremental framework, Faster ILOD, using multi-network adaptive distillation to improve the performance. As illustrated in Figure 2, the multi-network adaptive distillation includes adaptive distillation on the feature maps and RPN outputs, as well as a conventional distillation component on the final outputs of the detector.
The contributions of this paper are as follows:
Multi-network adaptive distillation is proposed to help the network remember previously learned knowledge during the learning of new data as well as alleviate the missing annotation problem for old classes on new data.
Our framework is generic and can be applied to any object detectors with RPN.
Incremental learning for object detection consists of () incremental steps. In each incremental step, only the batch of training data for the new classes () is accessible. Given the object detection model that is previously trained using images from certain old classes (), incremental object detection is the task of retraining the model to maintain detection of the old classes () while detecting the new classes (). We refer to the original model as the old model (teacher model) and the retrained model as the new model (student model). In the multi-step incremental detection scenarios (
), for each step, all categories trained during any previous steps are regarded as the old classes. Different from image classification which classifies one image where contains only one class of objects, object detection specifies the location of multiple objects in an image and each input image can contain multiple classes of objects. In incremental object detection scenarios, the classes of objects in one image can come from both the new task as well as the previous tasks. As in real-life conditions, the data for each task is labeled separately, there is a high chance that during labeling for a certain task, only annotations for that task are labeled and all other objects are ignored which leads to missing annotations in the incremental learning situation. Figure3 shows an example of the missing annotation problem. Comparing with incremental classification, incremental detection is a more challenging task, since it not only needs to solve catastrophic forgetting but also the missing annotations for old classes in the new data. In summary, in this paper, we target on the challenging real-life incremental detection scenarios that:
In each incremental training step, only training data for the new classes is available; no representative data exemplars of old classes from previous incremental steps is stored.
Objects from old classes could be contained in the training images of the new detection tasks; however, the annotations for these old object classes are not provided.
The retrained detector should have the ability to detect objects from both the new classes and all of the previous classes.
Our work focuses on applying Knowledge Distillation (KD) on RPN-based object detectors to improve both speed and accuracy in incremental scenarios. In this section, we first introduce the background of KD followed by the discussion about its application to incremental learning scenarios.
KD was first introduced by (Hinton et al., 2015)
for classification model compression. Model compression transfers the knowledge learned from a high performance cumbersome source model to a small target model. The intuition behind KD is that the relative probabilities of incorrect answers can reveal the potential relations between different categories. For example, in handwritten digit recognition,is more likely to be confused with than . Thus, during model compression, it is advantageous to train the target model by outputs from the source model instead of ground truth labels. (Chen et al., 2017) adopted the distillation method and hint learning (Romero et al., 2015) to detection model compression.
As KD method has the capability to transfer the knowledge of one model to another model, it has become one of the most commonly used methods for incremental learning. When applying KD to incremental learning, the old model output for new data is combined with its ground truth information to train the new model. In this section, we first discuss related methods for incremental classification followed by methods for incremental detection.
(Li and Hoiem, 2017)
first applied KD to incremental learning and built an incremental classifier called LwF. The LwF method does not require any old data to be stored and uses KD as an additional regularization term on the loss function to force the new model to follow the behavior of the old model on old tasks.(Zhou et al., 2019) proposed a multi-model distillation method called M2KD which directly matches the category outputs of the current model with those of the corresponding old models. Mask based pruning is used to compress the old models in M2KD. (Rebuffi et al., 2017) introduced a KD-based method called iCaRL. iCaRL stores some old data by selecting representative exemplars from each of the old classes based on herding. The stored old exemplars and new data are combined to train the new model. However, as only limited exemplars are stored, there is prediction bias towards the new classes due to data size imbalance between the old and new classes. (Castro et al., 2018) kept all final classification layers during incremental learning for distillation to alleviate this data imbalance. (Wu et al., 2019a) proposed using a few balanced old and new data batches to train additional two-parameter offsets for the model outputs to remove the bias.
In contrast to classification, research on incremental object detection is quite limited in the literature. (Shmelkov et al., 2017) designed an incremental object detector with KD based on Fast RCNN (Girshick, 2015), where no old data is available. We note the incremental detector proposed by (Shmelkov et al., 2017) Incremental Learning Object Detector (ILOD). As ILOD is based on Fast RCNN (Girshick, 2015) detector, it uses external proposal generator such as EdgeBox (Zitnick and Dollár, 2014) and MCG (Arbeláez et al., 2014) to generate region proposals. Shmelkov et al. deliberately chose the external fixed proposal generator of Fast RCNN to ensure that proposals would be agnostic to the object categories. In our experiments, we show that our proposed method can perform well on more efficient Region Proposal Network (RPN) based detectors such as Faster RCNN (Ren et al., 2015).
(Hao et al., 2019) proposed an incremental object detector called CIFRCN. In their experiments, they divided the classes to multiple class groups and trained their model to incrementally learn the class groups. For both training and testing of each class group, they ignored the images that contain objects from multiple class groups. This process ensures that the training images for new classes do not contain any old objects and avoids the missing annotation problem for old classes. However, in real-life scenarios, there is a high chance that the input image contains objects from both old classes as well as new classes. Similar to (Shmelkov et al., 2017), in our experiments, we use the setting closer to real-life applications. All images that contain objects for current task are used for training. If the image also contains objects from the old classes, the annotations for them are not present. (Li et al., 2019) proposed an one-stage incremental object detector called RILOD based on RetinaNet (Lin et al., 2017). In their experiments, they did not mention how they handle the annotations for old classes on new data and they only performed one-step incremental on benchmark dataset PASCAL VOC (Everingham et al., 2010). In our work, we target at designing a high performance incremental object detector for real-life applications by solving catastrophic forgetting and missing annotation problems for RPN-based detectors and perform experiments for both one-step and multi-step incremental on two detection benchmark datasets — PASCAL VOC (Everingham et al., 2010) and COCO (Lin et al., 2014).
Before designing our own framework, we first evaluate how RPN network will affect the performance of the detector towards incremental learning. To that end, we adapt the KD method in ILOD (Shmelkov et al., 2017) on Faster RCNN (Ren et al., 2015) detector and follow the same training strategy to train the model. The evaluation is performed on PASCAL VOC 2007 and COCO 2014 datasets. The training strategy and datasets are described in detail in Section 6. Tables 2, 2 and Figures 4, 5 show the performance of ILOD and ILOD adapted on Faster RCNN on VOC and COCO datasets under different incremental scenarios. According to the experimental results, we find that in almost every condition, the ILOD adapted on Faster RCNN method outperforms the original ILOD method. Unlike what was assumed in some previous literature (Shmelkov et al., 2017; Hao et al., 2019), we see that, the performance of Faster RCNN is not largely deteriorated in the incremental settings where the annotations of old classes are not provided in the new data.
One possible reason is that the RPN from Faster RCNN is robust towards missing annotations. This also has been observed by (Wu et al., 2019b). In their experiments, after dropping 30% of the annotations, the performance of Faster RCNN only drops by 5% (Wu et al., 2019b). During incremental training, the RPN randomly samples a set of negative proposals (proposals containing no objects) from thousands of anchors. The risk of these negative proposals containing a well localized old category object is quite low. On the other hand, the positive proposals feature objects from the new classes, but may not contain many examples of the old classes. Old class object proposals would be treated as false alarms and become a problem for training. Offsetting this effect, although for ILOD method, distillation is only applied at the final outputs, loss due to matching the old model back-propagates through the entire network and will tend to force both the RPN and feature extractor to detect old classes. The RPN training is not destroyed at least over the range of our experiments, such as one or several-step incremental settings.
While we show that the RPN network is relatively robust towards missing annotations for old classes, there is still accuracy gap between the ILOD method adapted on RPN-based detectors and normal training. In this section, we propose a novel multi-network adaptive distillation method to further narrow the gap. We first discuss the backbone network used for our model and then discuss each component of our proposed method.
Our proposed method for incremental object detection is illustrated in Figure 2. It comprises two models: a teacher model () and a student model (). The teacher model is a frozen copy of the original detector which detects objects from the old categories ( = ). The student model is the adapted model that needs to be trained to detect objects from both the old and new categories ( = ). It is also initially a copy of the original detector but the number of outputs in the last layer is increased to provision for the additional new classes. We use Faster RCNN (Ren et al., 2015)
as our backbone network. Faster RCNN is a two-stage end-to-end object detector which consists of three parts: (1) A Convolutional Neural Network (CNN) based feature extractor to provide features; (2) a Region Proposal Network (RPN) to produce regions of interest (RoIs); (3) a class-level classification and bounding box regression network (RCN) to generate the final prediction for each of the proposals from the RPN(Chen et al., 2017). In order to create a high performance incremental object detector, it is important to properly account for all three components.
To make a model remember what it learned before, similar to ILOD (Shmelkov et al., 2017), we adapt knowledge distillation. But unlike ILOD which only performs one-step distillation at the final outputs, we perform multi-network distillation on the feature maps, RPN and RCN outputs. In addition, knowledge distillation is originally developed for model compression which only requires the preservation of learned knowledge. Incremental learning requires not only maintaining learned knowledge, but also learning new knowledge from the new classes. Thus, directly applying distillation loss to force the student model to follow the behavior of the teacher model will simply prevent new data learning. To solve this problem, we propose adaptive distillation which uses the teacher model outputs as a lower bound to adaptively distill old knowledge.
Feature Distillation: The desired feature extractor needs to provide features that are effective for both old and new categories. To build the desired feature extractor, we utilize normalized adaptive distillation with a loss. Specifically, we subtract the mean of the unnormalized feature map, to obtain the corresponding zero-mean feature maps, and , from the teacher model and student model respectively. For each activation in the feature map, we then check its value from the student model () with the corresponding value from the teacher model (). If the teacher’s activation, , has a higher value, a loss is generated to force the student model to increase its value for this input, since this activation is important for the old classes. On the other hand, if the student’s activation, , has the higher value, the loss is zero since this activation is likely important for the new classes. This is how adaptive distillation preserves information for both the old and new classes. The feature distillation loss is:
where and refer to teacher and student networks respectively and is the total number of activation values in the feature map. Note that feature distillation needs be performed on every feature map which is presented to the RPN and RCN.
RPN Distillation: The desired RPN needs to provide proposals for objects from both new and old classes. Similar to feature distillation loss, we use the teacher model RPN output as a lower bound to force the student model to choose anchors according to both the training data from the new classes and the teacher model RPN output. In addition, the bounding box regression can provide incorrect values since the real valued regression output is unbounded. Inspired by the detection model distillation of (Chen et al., 2017), we use a threshold, , to control regression. In our experiments, we set . For RPN distillation, loss is used. Suppose is the total number of anchors, is the RPN classification output, and is the RPN bounding box regression output. The RPN distillation loss is:
RCN Distillation: The desired RCN needs to predict each RoI for both old and new classes in an unbiased manner. We follow the method in ILOD (Shmelkov et al., 2017) to perform RCN distillation. More specifically, for each image, we randomly choose 64 out of 128 RoIs with the smallest background score according to the RPN output from the teacher model. Then these proposals are fed into the RCN of the student model and the teacher model’s final outputs are used as targets for the old classes. The student model’s outputs on the new classes are not included in the RCN distillation. For each RoI classification output, , we subtract the mean over the class dimension to get the zero-mean classification result, . We use loss for the distillation. Let be the total number of sampled RoIs, be the number of old classes including background, and be the bounding box regression result. The RCN distillation loss is then written as:
Total Loss Function: The overall loss () will be the weighted summation of the standard Faster R-CNN loss (Ren et al., 2015), feature distillation loss (1), RPN distillation loss (2), and RCN distillation loss (3). Hyper-parameters , and help to balance each loss term, and are empirically set to 1.
|ILOD (Shmelkov et al., 2017)||68.60%||67.27%||70.03%||62.72%||65.61%||61.03%||69.50%|
|ILOD adapted on Faster RCNN||70.10%||67.72%||73.06%||66.35%||73.90%||61.14%||70.52%|
|ILOD (Shmelkov et al., 2017) (mAP@.5)||38.5%||33.8%||38.4%||34.3%||40.0%||36.1%||38.2%|
|ILOD adapted on Faster RCNN (mAP@.5)||42.8%||37.9%||43.0%||37.6%||47.0%||39.1%||42.7%|
|Faster ILOD (mAP@.5)||42.8%||39.6%||43.0%||39.9%||47.0%||40.1%||42.7%|
|ILOD (Shmelkov et al., 2017) (mAP@[.5, .95])||21.1%||19.2%||21.7%||19.5%||22.7%||19.8%||21.2%|
|ILOD adapted on Faster RCNN (mAP@[.5, .95])||22.5%||20.0%||22.9%||19.9%||24.4%||20.2%||22.5%|
|Faster ILOD (mAP@[.5, .95])||22.5%||21.0%||22.9%||21.3%||24.4%||20.6%||22.5%|
We call our proposed method Faster ILOD as it is designed to work with Faster RCNN. In this section, we compared our Faster ILOD method with the original ILOD method as well as ILOD adapted on Faster RCNN.
. VOC 2007 comprises 10k images of 20 object categories — 5k for training and 5k for testing. COCO 2014 comprises 164k images of 80 object categories — 83k for training, 40k for validation and 41k for testing. For the evaluation metric, we use mean average precision (mAP) at 0.5 Intersection over Union (IoU) for both datasets and also use mAP weighted across different IoU from 0.5 to 0.95 for COCO. To validate our method, we have investigated several incremental settings for these two datasets, such as one-step and multi-step addition. The sequence of categories is arranged according to the category names in alphabetical order.
is used to generate the external proposals. Faster ILOD and ILOD adapted on Faster RCNN are implemented using PyTorch(Paszke et al., 2017). For a fair comparison of our approach with ILOD (Shmelkov et al., 2017), we use the same backbone network (ResNet-50 (He et al., 2016)) and similar training strategies as mentioned in their paper. In the first step of training, we set the learning rate to 0.001, decaying to 0.0001 after 30k iterations, and momentum is set to 0.9. The network is trained using 40k iterations for VOC and 400k iterations for COCO. In the following incremental steps, learning rate is set to 0.0001. The network is trained using 5k-10k iterations when only one class is added and the same number of iterations as the first step if multiple classes are added at once.
Experiments on VOC Dataset: Table 2 shows the results for one-step incremental settings when the number of new classes equals 1, 5 and 10, respectively. In all three settings, Faster ILOD is more accurate than both ILOD and ILOD adapted on Faster RCNN. Comparing with the experimental results for multi-step increments, the improvement is not significant in one-step settings. This is because one-step increments require a small amount of fine-turning with the old model and the catastrophic forgetting and missing annotation problems might not be significant, which provide little room for improvement. However, when we retrain the old model in multiple incremental steps, the build-up of detection errors due to catastrophic forgetting or missing annotation are approximately exponential on the older ones, which is a more difficult scenario. Thus, we have also investigated the results under multi-step incremental scenarios. Figure 4(a) shows the performance of Faster ILOD, ILOD and ILOD adapted on Faster RCNN, when first training with 15 classes followed by the addition of 1 class for 5 steps. Observing from Figure 4(a), under the add one new class at a time protocol, Faster ILOD outperforms ILOD and ILOD adapted on Faster RCNN in each incremental step and the average performance gain is 3.44% and 2.12% respectively. Figure 4(b) shows the performance of three models under the condition of first training with 10 classes followed by addition of 2 classes for 5 times. Under this incremental setting, Faster ILOD also performs best for all five incremental steps and the average accuracy improvement is 5.78% towards ILOD and 1.67% towards ILOD adapted on Faster RCNN. Figure 4(c) shows the performance of three models under the condition of first training with 5 classes followed by addition of 5 classes for 3 times. Under this incremental scenario, Faster ILOD outperforms ILOD adapted on Faster RCNN except the first step and always has better accuracy than ILOD. The average accuracy increase is 1.61% towards ILOD and 0.83% towards ILOD adapted on Faster RCNN.
Experiments on COCO Dataset: For our experiments on COCO, we use train set and valminusminival set as our training data and minival set as our testing data. Table 2 shows the results under one-step incremental settings, where the number of new classes is 5, 10 and 40, respectively. Figure 5 shows the results for multi-step incremental detection under the add one new class at a time protocol. In both scenarios, Faster ILOD provides the best detection accuracy. In particular, under multi-step incremental detection shown in Figure 5, Faster ILOD outperforms ILOD and ILOD adapted on Faster RCNN in all steps and has an average gain of 5.94% and 2.74% (1.86% and 1.5%) respectively at 0.5 IoU (weighted across different IoU from 0.5 to 0.95).
: As the original ILOD code is built in Tensorflow, to fairly compare the detection speeds for ILOD and Faster ILOD, we rebuild ILOD on Pytorch. All experiments were performed on an NVIDIA Tesla V100 GPU. Average detection time per image of ILOD and Faster ILOD with ResNet-50(He et al., 2016) on VOC dataset is 1396.66 ms and 109.52 ms respectively. As ILOD relies on an external proposal generator to acquire proposals, the inference speed of ILOD is about 13 slower than Faster ILOD.
In this paper, we found that unlike some previous literature (Shmelkov et al., 2017; Hao et al., 2019) assumed, RPN network is relatively robust towards missing annotations for old classes on incremental object detection. We then proposed a novel end-to-end framework, Faster ILOD, to further narrow the gap between incremental learning and normal training caused by catastrophic forgetting and missing annotation when fine-tuning Faster RCNN using only new class annotations. By adaptively distilling the old information in multi-networks, the proposed method aimed to preserve the capabilities of the detector on old object classes with limited affect towards the learning on new classes. Our method shows superior results on the PASCAL VOC and COCO datasets and outperforms the state-of-the-art incremental detector (Shmelkov et al., 2017) by a large margin in most cases.
This research was funded by the Australian Government through the Australian Research Council and Sullivan Nicolaides Pathology under Linkage Project LP160101797.