Efficient Incremental Learning for Mobile Object Detection

03/26/2019 ∙ by Dawei Li, et al. ∙ University of Southern California 8

Object detection models shipped with camera-equipped mobile devices cannot cover the objects of interest for every user. Therefore, the incremental learning capability is a critical feature for a robust and personalized mobile object detection system that many applications would rely on. In this paper, we present an efficient yet practical system, IMOD, to incrementally train an existing object detection model such that it can detect new object classes without losing its capability to detect old classes. The key component of IMOD is a novel incremental learning algorithm that trains end-to-end for one-stage object detection deep models only using training data of new object classes. Specifically, to avoid catastrophic forgetting, the algorithm distills three types of knowledge from the old model to mimic the old model's behavior on object classification, bounding box regression and feature extraction. In addition, since the training data for the new classes may not be available, a real-time dataset construction pipeline is designed to collect training images on-the-fly and automatically label the images with both category and bounding box annotations. We have implemented IMOD under both mobile-cloud and mobile-only setups. Experiment results show that the proposed system can learn to detect a new object class in just a few minutes, including both dataset construction and model training. In comparison, traditional fine-tuning based method may take a few hours for training, and in most cases would also need a tedious and costly manual dataset labeling step.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 introduction

Object detection is the computer vision task that identifies and localizes instances of semantic objects of a certain class (such as humans, buildings, or cars) in images or video frames. On mobile devices that are equipped with advanced cameras such as smart-phones, mobile robots and drones, object detection has been the backbone for many exciting applications including augmented reality, autopilot, mobile shopping, etc. These successful mobile applications were made possible due to recent breakthroughs in deep learning based object detection, especially the one-stage object detection frameworks like YOLO 

[32], SSD [26] and RetinaNet [24], which enabled on-device inference for applications with real-time requirement.

In this paper, we focus on a challenging but critical problem for mobile object detection, i.e., the incremental learning of new object classes. Each mobile user may have his/her personalized objects of interest not included in the deep learning model that is shipped with the mobile devices, and meanwhile new objects of interest may appear from time to time (e.g., the emergence of MP3 Walkman to replace the old tape Walkman). To incrementally add the new object classes into the model, a simple and straightforward way is to fine-tune the model with training data from both old classes and new classes. However, this naive method would not work in practice. First, due to privacy issues, the training data for old classes of the deployed model may not be available with a high possibility. Second, even the old data is available, training using a dataset containing all classes is time consuming as the training efficiency is highly correlated with the amount of training data required. For mobile object detection, learning to detect new objects incrementally in a timely and quality-conscious manner is crucial in many application scenarios. Users of mobile devices may encounter new objects of interest anywhere anytime and delayed learning on those objects would have a negative impact on the user experience. For mobile robots and drones performing military or disaster relief missions, this could play a decisive role on saving lives and properties.

On the other hand, deep neural networks are known to suffer from the “catastrophic forgetting" problem 

[19] — this is the phenomenon where, when training using back propagation and the standard cross-entropy loss with only training data of the new classes, the model can forget its learned knowledge on the old classes, e.g., for the object detection model, it can no longer detect objects belonging to the old classes. To deal with the “catastrophic forgetting" problem, deep learning researchers have proposed effective learning methods such as elastic weight consolidation (EWC)  [19] and learning without forgetting (LwF)  [22]. Those methods have focused on the object classification problem, and none of them can be directly applied to the object detection problem, especially the one-stage object detection models which simultaneously predict the object classes and the bounding box locations.

We address the “catastrophic forgetting" problem of incremental learning for one-stage object detection models by proposing a novel loss function based on the knowledge distillation technique 


. The basic idea is to discourage the change of predictions for the old classes including the model output of both classification and bounding box regression. Specifically, when training only with new-class data, we keep a copy of the old model to generate the output of the new data for both classification (i.e., the probability of belonging to each of the old classes) and bounding box location (i.e., the coordinates of each detected object in the image), and use distillation loss terms to minimize the changes in the new model (see Figure 

3). By this way, we force the old and new models to reach a consensus on predicting the old classes. Moreover, we add a feature distillation loss term to prevent dramatic changes of features extracted from a middle neural network layer, which further reduces the catastrophic forgetting and improves the model accuracy.

Unlike object classification, the training data for object detection of a new class may not always be available since it requires both class level labels and accurate bounding boxes, annotating which is a costly and laborious job. In the absence of training data, we need to automatically build a relatively good training dataset on-the-fly. To this end, we propose a real-time training dataset construction method for the new object classes. We show that the created dataset has accurately annotated bounding boxes around the objects corresponding to the new classes, and thus can be directly used to incrementally train the detection model.

To summarize, we make the following contributions:

  • [noitemsep]

  • The first end-to-end incremental learning system for mobile object detection in near real-time.

  • A new algorithm and loss function for incremental learning of one-stage object detection models to solve the “catastrophic forgetting" problem.

  • On-the-fly automated dataset construction algorithm to create training data for new classes with both classification and bounding box location labels.

  • Extensive experiments that demonstrate the effectiveness and efficiency of each system component — we also investigate some trade-offs to optimize the overall system performance.

  • Prototype implementations of the end-to-end system on both mobile-cloud and mobile-only architectures. For the mobile-only implementation, we develop and deploy the system completely on Jetson TX2 embedded AI platform — we show that learning of a new class completes in just a few minutes, including both dataset construction and model training.

The rest of the paper is organized as follows: We review the related work in Section 2. The system overview is described in Section 3. We present our efficient incremental learning method for mobile object detection in Section 4. In Section 5, we introduce the automatic data collection and labeling algorithm. The evaluation and prototype implementation is given in Section 6. Finally we conclude the paper in Section 7.

2 Related Work

Figure 1: System overview of IMOD. It includes 3 main components: (1) Incremental learning trigger which incorporates the logic on when the incremental learning will start and what classes will be learned (one or more new classes can be learned at a time), (2) Training dataset preparation which downloads, labels and purifies the training images for the new classes in case that there’s no available training dataset, and (3) Incremental training for new classes which applies our novel and efficient incremental learning algorithm to incrementally train the model with only the training data of the new classes while maintaining its knowledge on the old classes. Components (2) and (3) may run on a remote cloud server or locally on a mobile GPU for a fully on-device system, and both have been implemented for prototype system evaluation.

Deep Learning for Mobile Vision:

Deep learning, especially convolutional neural networks (CNN), has been widely used for a variety of computer vision application on the mobile devices due to the recent development of mobile deep learning SDKs 

[1, 21, 30, 35]. Researchers have focused on reducing the computation cost on the mobile devices by compressing pre-trained deep learning models [10, 9, 20, 3], designing lightweight neural network architectures [16, 13], or hardware accelerations [15, 37] — these papers have not addressed the incremental learning problem in mobile vision.

Deep Incremental Learning: To address the “catastrophic forgetting" problem, deep learning researchers have recently developed methods in several different directions. One important approach is adding regularization [2, 19] to the neural network weights according to their importance to the old classes, i.e., it discourages the change on important weights using a smaller learning rate. The other major research direction is based the knowledge distillation [12], which uses the new data to distill the knowledge (i.e., the network output) from the old model and mimic its behavior (i.e., generate similar output) when training the new model [31, 22, 34]. Among the related work, only [34] has focused on the object detection problem. However, their approach works on the Fast-RCNN model [8], where bounding boxes proposals are computed using an additional selective search method [36] — this approach will not be able to satisfy the real-time inference requirement on modern mobile devices.

Automatic Image Annotation: A related research topic to our automatic training dataset construction problem is automatic image annotation [17, 4]

, which is a process that relies on the computer system assigning metadata (e.g., keywords, caption) to digital images. Image annotation is generally used as a method to provide semantic labels for an image retrieval system. The labeled images is not used as training dataset to train or refine a learning model. Our approach is different from automatic image annotation — we are trying to automatically build a training dataset on-the-fly by selecting and labeling a subset of training images from noisy web-crawled images. Furthermore, our labeling algorithm not only gives the image-level labels but also object bounding box locations.

3 System Overview

In this section, we give an overview of the proposed IMOD system which has three major components as illustrated in Figure 1.

The first component is the incremental learning trigger that runs on the mobile devices to listen for the request of learning new classes. There could be different scenarios to initiate a new learning task. In the simplest scenario, the user directly tells the AI engine to learn a new object class that the current AI engine cannot detect. This scenario assumes that the user knows what class she/he is interested in and what existing classes the AI engine has learned. However, a more realistic situation is that the user does not know in advance her/his interest and the already learned classes, but would like to learn whenever new classes of interest come across.

Therefore, we design a more interactive way of initiating a new learning task as shown in Figure 1 — when pointing the camera of a mobile device to a certain object (e.g. slow cooker), if the AI engine cannot detect the object using the current object-detection model with enough confidence, it prompts the user to notify that it would like to learn the new object category. If the user approves the learning request, she/he could issue a simple initiate-learning command and tell the AI engine to start the learning process.

Another important problem here is how to designate the name(s) of the new class(es). The straightforward way is that the user would directly tell the AI engine the name of the new object in images captured by the camera (e.g., “slow cooker"). Alternatively, we could use a reverse image search engine to obtain the object name(s) by using the captured image as a query. For each learning task, one or more new classes can be learned simultaneously.

The second component is a training dataset preparation pipeline which constructs a set of training data in real-time for the new class(es) to be learned. This component will be triggered if there is not an available training dataset for the new classes to be learned. Specifically, it downloads relevant images from the Internet using the new class name(s) as query and automatically generates bounding boxes for the object(s) corresponding to the given class name(s). Determining bounding box labels in a fully automatic manner is a highly challenging task. Our solution is to first generate a number of noisy bounding boxes, which can be done using either an unsupervised bounding box generator such as edge boxes [39] or a supervised object detector pre-trained on a relatively large number of classes (e.g., 80-class COCO dataset [25]

), and then filter out the incorrect bounding boxes by classifying each bounding box using a pre-trained large scale object classification model (e.g., the ImageNet 11k-class model). However, this solution will not work unless the following two critical technical problems are resolved: (1) How can we correspond the user-given class name(s) to the labels used in the classification model? (2) How can we deal with overlapping boxes? We will elaborate our solutions to these problems in Section 


The third component is the model training module which learns to detect the the new object using either an available training dataset or the dataset prepared by the previous system component. While training with only new-class data reduces the time cost of incremental learning, it increases the risk of on the old classes. Our novel incremental learning algorithm is built on top of the powerful knowledge distillation technique which has been used for incremental learning of object classification models[22]. However, incremental learning for an object detection model is much more complicated than a classification model, since it requires preservation of not only the semantic classification capability but also its knowledge regarding the bounding box prediction. We will discuss the details in Section 4.

We have implemented the end-to-end system with two different system architectures: mobile-cloud and mobile-only. For the mobile-cloud implementation, the dataset construction and model training components run on a remote cloud-based GPU server — once the training is completed, the new model is downloaded to the mobile devices for future inferences. The mobile-only implementation is a fully on-device system without relying on the cloud computing resources. We have evaluated and compared both implementations in our evaluation section.

4 Incremental learning for mobile object detection

Figure 2: A representative one-stage object detection model architecture: RetinaNet [24]. For each cell in a grid of size W*H, it predicts (1) the probability that an object exists in each of the A anchor boxes from K different classes, and (2) the four coordinate offsets for each of the A anchor boxes to a ground-truth bounding box (if one exists).

4.1 Recap for One-stage Object Detector

The goal of object detection is to recognize instances of a predefined set of object classes (e.g. people, cars, bikes, animals) and describe the locations of each detected object in the image using a bounding box. Earlier end-to-end deep learning methods adopted a two-stage architecture which first identify a set of possible bounding box locations using a region proposal network and then use a second convolutional neural network to refine the bounding box proposals and classify the object categories [33]. While two-stage object detectors can achieve very high detection accuracy, they suffer from slow inference speed and thus cannot be deployed for most mobile applications which require real-time inference capability. Therefore, recent research has focused on developing one-stage object detection architectures [24, 23, 26, 32] which run faster yet provide accuracy similar to the two-stage detectors.

The success of one-stage detector is attributed to four critical techniques recently invented by computer vision researchers: grid-based prediction [32], anchor boxes [26], feature pyramids [23] and focal loss [24]. First, the feature maps generated by convolutional neural networks preserve the spatial information of the input image and thus [32] divide the feature maps into grids, and each grid cell is used to directly predict objects (both classification and bounding box) falling into the area. In this manner, there’s no need for a box proposal network. Second, regression learning of the four coordinates of a bounding box from random initialization makes the training extremely hard; to address this issue,  [26] instead predicts bounding box offsets to some pre-defined anchor boxes with varying aspect ratios, which embed some prior information about the shape of candidate objects. Third, to better locate objects of different scales and especially the small ones,  [23] proposed to predict objects with feature maps of different resolutions via a top-down pathway, i.e., the feature pyramids. Finally, the focal loss was proposed [24] to solve the class imbalance problem (e.g., there are too many boxes containing only the image background compared to the boxes actually containing an object). In particular, a scaling factor was added to the cross entropy loss such that the training focuses more on learning hard examples.

In Figure 2, we show the basic architecture of the state-of-the-art one-stage object detection method, RetinaNet [24]. RetinaNet is composed of three subnets: a Feature Net (F) for feature map extraction from different resolutions (i.e., different neural network layers), a Class Subnet (C) for object classification, and a Box Subnet (B) for bounding box regression. The Feature Net is also called a backbone network — it is usually a classification network, pre-trained using a large-scale dataset such as ImageNet [5]. At inference time, RetinaNet first uses a pre-defined classification probability threshold to decode only boxes with high confidence scores and then uses non-maximum suppression (nms) technique to filter out redundant predictions.

4.2 Incremental Learning Method

Figure 3: The proposed incremental learning method using RetinaNet as an example.

The general architecture of our incremental learning algorithm for one-stage deep object detectors is illustrated in Figure 3. To detect objects from n new classes (), we first create a copy of the old model as N’ and then create a new model N by expanding the old model’s classification subnet to classify n more classes by adding nneurons to the network’s output layer. The weight parameters of the new model are initialized using the corresponding parameters from the old model with the exception of the newly added neurons in the output layer, which are randomly initialized.

To avoid catastrophic forgetting of the old classes while training only with data for new classes, we follow the idea of learning without forgetting (LwF) [22]

which has been successfully applied to the image classification problem. Specifically, LwF tries to ensure that the classification output for the old classes in the new model (i.e., the vector of the probabilities for the old classes) is close to the output of the old model on the same input image. To achieve this goal, LwF leverages the knowledge distillation 

[12] technique, which uses a modified cross-entropy loss — instead of using a hard one-hot ground-truth label, it uses the input image’s output from the old model as the soft ground-truth label to train the new model. The core idea here is that it discourages the change for the output of old classes. By jointly optimizing the distillation loss on the old classes and the cross-entropy loss on the new classes, LwF achieves good classification performance on predicting both old and new classes.

However, only preserving the classification capability of the old model is not enough for object detection, as the new model would still lose its ability to predict correct bounding boxes for the old classes while training only using the new classes’ bounding box labels. To address this issue, we extend the knowledge distillation idea to the object detection task so that it encourages the outputs from both the classification subnet and the box subnet in the new model, in order to approximate the outputs from the old model for the old classes. Furthermore, we argue that applying distillation loss only on the model outputs is not enough to prevent forgetting on old classes. While training only with data for new classes, the intermediate features (i.e., features extracted from a middle layer) which are important for predicting old classes have also been changed during back-propagation. Therefore, we design a new distillation loss on the intermediate features extracted from the feature net, F, so that the features extracted from the new model would not be dramatically different from the old model.

4.2.1 Detailed Loss Functions

Given the analysis above, the loss function for incremental learning of the object detection model must satisfy the following three properties to avoid catastrophic forgetting:

  1. [label=()]

  2. Discourage changes on classification output for the old object classes.

  3. Discourage changes on bounding box locations for predicted objects.

  4. Prevent dramatic changes for features extracted from intermediate neural network layers.

Given an old object detection model N’ trained on m classes, and a training dataset for n new classes, our goal is to incrementally train an object detection model N which performs object detection on the complete set of classes. The loss function for our incremental detection algorithm is defined as:


where , and are hyper-parameters to balance the weights of different loss terms, and are all set to 1 in our experiments.

The first loss term and the second loss term are the standard loss functions used in [24] to train RetinaNet for the new classes where represents the ground-truth one-hot classification labels, represents the new model’s classification output over n new classes, represents the ground-truth bounding box coordinates, and represents the predicted bounding box coordinates for the ground-truth objects. 111The box subnet outputs are actually offsets to the predefined anchor boxes. We simplify the notation in our equations.

The third loss term is the distillation loss for the classification subnet similar to that defined in [22, 34]. Here, is the output of the frozen old model F’ for m old classes using the new training data, and is the output of the new model F for the old classes. Specifically, the loss is calculated by the following equation:


The fourth loss term, , preserves the old model’s capability of correctly predicting bounding boxes for the old classes. Our insight for the bounding box distillation is that when an image is given as input to the old model, even the images does not contain any object belonging to the old classes or not, it would still predict the existence of old-class objects though with relatively low confidence score. We regard the predicted bounding boxes for those relatively low-confidence (but not too low) objects as the old model’s bounding box prediction capability. In particular, for each image, we sort the bounding boxes (i.e., the anchor boxes and the coordinate offsets) predicted with the old model based on their classification confidence scores, and select the top k bounding boxes as the ground-truth for bounding box distillation loss (see the left sub-figure in Figure 4 for an example). When incrementally training the new model, we regress the new model’s bounding box predictions corresponding to the same set of anchor boxes to the ground-truth using smooth L1 loss [8].


Finally, we append a novel feature distillation loss term to prevent the extracted intermediate features from drastic change. Similar to the bounding box distillation loss, we calculate the smooth L1 loss between the features extracted from the feature net of the old model and the new model .


4.2.2 Using Exemplars of Old Classes

The basic IMOD system assumes that no data from old classes are saved and learning happens only using the new class data, which are readily available or collected using the pipeline described in Section 5. This assumption reduces both the expected training time and the storage cost for incremental learning. Moreover, it increases the applicability of our system for cases where old data is not available.

However, in some application scenarios some or all of the data for old classes might be available. In such cases, we can augment a small number of exemplars from this old data to our training set to ensure that at least some information about old classes is incorporated into the learning process. In this way, we can further reduce the forgetting on old classes and achieve a better detection accuracy overall. Note that, even when available, using all old data is not reasonable since it will increase the training latency of the system tremendously and make it unsuitable for real-time incremental object detection. Moreover, in our experiments we observed that using only a few exemplar images per class provides similar accuracy to using all data for old classes (see Section 6).

Exemplar image selection can be performed in various ways: 1) randomly select a fixed number of images from each class, 2) select images such that the average feature vector of exemplars will be closest to the class mean as in [31], 3) perform clustering for each class and pick a random image from each cluster. In this work, we preferred the last method for exemplar selection to consider in-class variation (e.g. select example images from each dog breed for dog class). In addition, even though we select a fixed number of exemplars from each class, it is also possible to fix the total number of exemplars to prevent the linear increase in exemplar dataset size as new classes are added incrementally.

5 Data Collection and Labeling

Figure 4: An example demonstrating the bounding box labeling problem. First, the labels in the classification model may not match the given new class name. Second, the noisy bounding boxes may overlap with each other (e.g., (a) and (b)) and the ideal box (b) may be eliminated by simply setting a low IoU threshold for NMS.

Given a set of downloaded images matching a new class name (e.g., slow cooker), our goal is to automatically and efficiently label the images that do include the desired objects with bounding boxes. Even though images with bounding box labels are scarce, there is an abundance of images with class labels such as ImageNet which has over 10 million labeled images for over 10k classes. With a state-of-the-art convolutional architecture (e.g., ResNet [11] or DenseNet [14]), a highly-accurate and efficient classification model can be trained to perform classification over a huge number of classes. Therefore, we argue that once we generate a set of noisy bounding boxes from a candidate image (this can be done using either a class-agnostic method [36, 39] or an existing deep object detector with a low confidence threshold), then we could use a large-scale classifier to verify each bounding box to filter out the bad box proposals.

1:the given new class name
2:the downloaded images using as query
3:the large-scale classification model
4:the Word2vec model
6:Voting for “credible labels" set :
7:initialize a counter for each label in
8:for each image in  do
9:     produce a set of noisy bounding boxes
10:     for each in that  do
11:         predict top labels using
12:         for each label in  do
14:sort and append the top1 label to
15:for each in  do
16:     if  then
17:               return
19:Purification for final dataset
20:for each image in  do
21:     for Each bounding box in  do
22:         if Any label in predicted top k  then
23:              calculate
24:         else
25:              remove from               
26:     if  then
27:         for each box pair  do
28:              if  then
29:                  remove the box with lower                             
30:     if  then
31:         add with to      return
Algorithm 1 Automatic Dataset Construction

However, there are two vital challenges that must be addressed as shown in Figure 4. The first one is label mismatch problem in which the provided new class name does not have an exact match with the corresponding label in the large-scale classification model, e.g., a “slow cooker" may be labeled as “crock pot". In addition, the classification model is not perfect and it may classify a “slow cooker" as “pressure cooker" or just its super-class “cooker". To solve these problems, we propose a hybrid voting+semantic method based on two observations: (1) most of the downloaded images contain the objects matching the given class name, and (2) even though not perfect, the large-scale classification model still has a high recognition accuracy, e.g., we use a 11k-class model with 71.2% top-5 accuracy in our prototype system. Based on these observations, we could safely conclude that a considerable proportion of the images would have one or more bounding boxes predicted with the “true label" in the classification model. Because of this, the “true label" can be identified using a voting mechanism that sorts the labels by the number of occurrences in the classification results over the noisy bounding boxes and return the most frequent label as the “true label". In addition, other classification labels that are semantically close to both the “true label" and the given class name presumably also refer to the same object category. The semantic similarity can be measured by the distance of two labels’ word embeddings such as Word2vec [27]. The “true label" and the additional semantically verified labels form the set of “credible labels".

The second challenge is the overlapping among bounding boxes with correct labels. Simply setting a low IoU (intersection over union) threshold for NMS can eliminate too many true positive boxes as shown in Figure 4. To deal with the overlapping-box problem, we introduce the accumulated classification confidence score (ACCS) to determine which bounding box should be retained from two overlapped boxes. ACCS for a bounding box is defined as the sum of the predicted classification confidence scores for labels belonging to the generated set of “credible labels". If the the overlapping between two boxes is larger than a threshold, we can discard the one with lower ACCS. Additionally, we ignore too small bounding boxes as the possibility of those boxes to contain an object matching the class name when you use the class name as query to download those images is very small. Discarding these small boxes before sending them to the classification model greatly reduces the computational overhead.

The pseudo-code for our dataset construction pipeline is summarized in Algorithm 1. In out implementation, we set the box size threshold as 1% of the image size. The word embedding similarity threshold is set to 10. The overlapping threshold of two boxes is set to half the size of the smaller box. is set to 5.

Discussion on Dataset Construction It might seem that instead of incrementally training an object classifier using the automatically constructed dataset, the two-stage dataset construction algorithm can be directly employed to detect the new object classes without any training. However, this approach would have more storage and memory cost since it requires the deployment of two deep learning models; a detection model to create the set of noisy bounding boxes and a large-scale classification model for verification of the labels for each bounding box. In addition, it would take significantly more time for inference of a single image compared to single stage object detectors due to the significant time required for classification of box proposals.

6 evaluation

The evaluation of IMOD focuses on answering the following questions: (1) How effective and efficient is the proposed incremental learning algorithm? (2) How effective and efficient is the automatic data construction algorithm? (3) How does this complete system work in practice? What is the running time? What are the bottlenecks in overall system design?

6.1 Evaluation Dataset

The following two datasets are used to evaluate the object detection accuracy for the incremental learning algorithm.






















Class 1-19 70.6 79.4 76.6 55.6 61.7 78.3 85.2 80.3 50.6 76.1 62.8 78.0 78.0 74.9 77.4 44.3 69.1 70.5 75.6 - -
All Data 77.8 85.0 82.9 62.1 64.4 74.7 86.9 87.0 56.0 76.5 71.2 79.2 79.1 76.2 83.8 53.9 73.2 67.4 77.7 78.7 74.7
Catastrophic Forgetting 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 68.9 3.4
w/o Feat-Distill Loss 61.9 78.5 62.5 39.2 60.9 53.2 79.3 84.5 52.3 52.6 62.8 71.5 51.8 61.5 76.8 43.8 43.8 69.7 52.9 44.6 60.2
w Feat-Distill Loss 69.7 78.3 70.2 46.4 59.5 69.3 79.7 79.9 52.7 69.8 57.4 75.8 69.1 69.8 76.4 43.2 68.5 70.9 53.7 40.4 65.0
Table 1: Per-class accuracy of Pascal VOC (%) for scenario.





















Class 1-10 76.8 78.1 74.3 58.9 58.7 68.6 84.5 81.1 52.3 61.4 - - - - - - - - - - -
All Data 77.8 85.0 82.9 62.1 64.4 74.7 86.9 87.0 56.0 76.5 71.2 79.2 79.1 76.2 83.8 53.9 73.2 67.4 77.7 78.7 74.7
Catastrophic Forgetting 0 0 0 0 0 0 0 0 0 0 66.3 71.5 75.2 67.7 76.4 38.6 66.6 66.6 71.1 74.5 33.7
w/o Feat-Distill Loss 67.1 64.1 45.7 40.9 52.2 66.5 83.4 75.3 46.4 59.4 64.1 74.8 77.1 67.1 63.3 32.7 61.3 56.8 73.7 67.3 62.0
w/ Feat-Distill Loss 71.7 81.7 66.9 49.6 58.0 65.9 84.7 76.8 50.1 69.4 67.0 72.8 77.3 73.8 74.9 39.9 68.5 61.5 75.5 72.4 67.9
Table 2: Per-class accuracy of Pascal VOC (%) for scenario.
  • [noitemsep]

  • Pascal VOC 2007 [6]: This is a benchmark dataset for object detection which includes 20 object classes. In total it has 9,963 images collected from Flickr222https://www.flickr.com/ photo-sharing web-site and contains 24,640 annotated objects. We use the 5K images in the train and val splits for training, and the images in the test split for validation.

  • iKitchen: We collected and annotated 7K images (6K for training and 1K for validation) belonging to 10 classes of kitchen objects for an internal demonstration. Most of the images are collected using mobile phones in lab and real-world environment to evaluate our system in a more realistic setting [28]. This dataset was also used in the prototype system evaluation.

To evaluate the automatic dataset construction algorithm, we simulate the behavior of downloading web images using a class name as query. Specifically:

  • [noitemsep]

  • regarding Pascal VOC, for each of the 20 classes, we use the class name (e.g., chair) as the search keyword of ImageNet333http://www.image-net.org to find the matched synset, and we download 200 images from the synset along with their bounding box annotations.

  • regarding iKitchen, we use the Google image search to download images in real-time to demonstrate the effectiveness and efficiency of the algorithm in the prototype system evaluation.

For all the model accuracy evaluations, we use the standard mean average precision (mAP) at the threshold of 0.5 IoU as the evaluation metric 


6.2 Incremental Learning

The input image for the deep neural network is resized such that the longer side is 1024 and the shorter side is then adjusted proportionally. The initial learning rate for all experiments is set to and we use the Adam method [18]

for optimization. The training is performed on a single NVIDIA Tesla M40 GPU, with batch size of 8. We run all our experiments using PyTorch 


6.2.1 Experiment Setup For Pascal Dataset

The model architecture we use for Pascal dataset is RetinaNet with ResNet-50 backbone, and we train each model for 100 epochs. We performed experiments for the following two scenarios:

  • [noitemsep]

  • 19+1: we train a base model with the first 19 classes and incrementally train the 20th class (i.e., the tv/monitor class).

  • 10+10: we train a base 10-class model and then learn 10 new classes simultaneously.

The following 4 different learning schemes have been compared:

  • [noitemsep]

  • All Data: We fine-tune the old model using data from both old and new classes and use the original loss function in RetinaNet without distillation loss terms.

  • Catastrophic Forgetting: We train the model using the training data of only new classes and use the original loss function in RetinaNet without distillation loss terms.

  • w/o Feat-Distill Loss: We train the model with training data of only new classes and use the proposed loss function except the feature distillation loss term defined in Equation 4.

  • w Feat-Distill Loss: We train the model using training data of only new classes and we use the proposed loss function including the feature distillation loss term.

6.2.2 Experiment Result For Pascal Dataset

Figure 5: Result of adding “slow cooker" to the base 8 class model on iKitchen Dataset.

We present the result of 19+1 scenario in Table 1, and the result of 10+10 scenario in Table 2. In both scenarios, we observe that without the distillation loss terms, the “catastrophic forgetting" problem occurs and the mAP values for all old classes drops to 0. On the other hand, with the novel loss function we have proposed in this paper, the accuracy on the old classes is preserved while incrementally learning with training images of only the new classes. Even compared to using data of all classes, the average mAP over all classes is only reduced by less than 10%. In addition, by adding the feature distillation loss term, we observe 4.8% accuracy increase for the 19+1 scenario, and 5.9% accuracy increase for the 10+10 scenario. The results demonstrate that the proposed learning algorithm can solve the problem of “catastrophic forgetting" pretty well even in scenarios when multiple new classes are learned simultaneously.

In the 19+1 scenario, the number of training images for the new “tv/monitor" class is just 279 which is only 5.5% of the 5K training dataset for all 20 classes. Even with such a limited amount of data, our model can learn the new class without forgetting the old classes. Moreover, due to its capability of learning with a small number of images, our method can significantly reduce the training time and thus is more suitable for incremental learning in mobile applications.

6.2.3 Experiment Setup For iKitchen Dataset

The model architecture we use for iKitchen dataset is RetinaNet with ResNet-18 backbone. We train a base 8-class object detection model by excluding two classes slow cooker (SC) and cocktail shaker (CS). The mAP for the base 8-class model is 80.1%.

Then we run the following experiments: (1) 8+SC, (2) 8+CS, (3) 8+SC+CS, and (4) 8+CS+SC. In 3 and 4, we apply the proposed incremental learning method incrementally as each new class is added. Moreover, we study the trade-off between accuracy and training time of adding exemplar data from the old classes as discussed in Section 4.2.2. For all experiments, we train the model for 10 epochs using the proposed incremental learning algorithm including the feature distillation loss.

6.2.4 Experiment Result For iKitchen Dataset

As seen in in Figure 6, adding exemplars from the old classes can boost the incremental learning accuracy, especially when we are learning new classes sequentially (i.e., the 8+SC+CS and 8+CS+SC scenarios). Even by adding just 10 exemplars per old class, the mAP increases 15% to 40%. As more exemplars are added, we don’t see further significant accuracy increase. Using all the training data from old classes can be seen as a special case of adding exemplars.

In Figure 5, we show the detailed accuracy for each object class in the 8+SC scenario for different number of exemplars per class (left sub-figure). First of all, we see that all the old classes maintain a good accuracy after learning the new class. In addition, we see that the accuracy for some of the base 8 classes also increases after adding the new class. The reason could be that as the model sees more diverse training samples, it can learn better features that provides better boundaries for discriminating different object classes. We have also measured the training time and the speed-up of using different number of exemplars (right sub-figure). Compared with the baseline of learning with all training data, we can achieve 38x speed-up with only 4% accuracy loss (see the 10 exemplar case).

Figure 6: The average mAP over all classes with different number of exemplars per class.
PASCAL Label Top Returned Credible Labels
aeroplane ’airplane, aeroplane, plane’, ’jet, jet plane, jet-propelled plane’, ’jetliner’, ’warplane, military plane’
bicycle ’bicycle, bike, wheel, cycle’, ’safety bicycle, safety bike’, ’push-bike’, ’ordinary, ordinary bicycle’
bird ’bird’, ’passerine, passeriform bird’, ’dickeybird, dickey-bird, dickybird, dicky-bird’, ’parrot’
boat ’boat’, ’small boat’, ’dinghy, dory, rowboat’, ’sea boat’, ’rowing boat’, ’river boat’, ’cockleshell’
bottle ’bottle’, ’pop bottle, soda bottle’, ’water bottle’, ’jar’, ’smelling bottle’, ’flask’, ’jug’, ’carafe’
bus ’public transport’, ’local’, ’bus, autobus’, ’express, limited’, ’shuttle bus’, ’trolleybus, trolley coach’
car ’motor vehicle, automotive vehicle’, ’car, auto’, ’coupe’, ’sports car, sport car’, ’sedan, saloon’
cat ’domestic cat, house cat’, ’kitty, kitty-cat’, ’tom, tomcat’, ’mouser’, ’Manx, Manx cat’, ’tabby’
chair ’chair’, ’seat’, ’armchair’, ’straight chair, side chair’, ’rocking chair, rocker’, ’swivel chair’
cow ’bull’, ’cattle, cows’, ’cow’, ’bullock, steer’, ’beef, beef cattle’, ’cow, moo-cow’, ’dairy cattle’
dining table ’dining-room table’, ’dining table, board’, ’dinner table’, ’table’, ’dining-room furniture’
dog ’sporting dog, gun dog’, ’terrier’, ’retriever’, ’hunting dog’, ’Labrador retriever’, ’water dog’
horse ’horse, Equus caballus’, ’equine, equid’, ’gelding’, ’mare, female horse’, ’yearling’, ’pony’, ’dobbin’
motorbike ’motorcycle, bike’, ’wheeled vehicle’, ’trail bike, dirt bike, scrambler’, ’motor scooter, scooter’
person ’person, individual’, ’male, male person’, ’face’, ’oldster, old person’, ’man’, ’eccentric person’
potted plant ’pot, flowerpot’, ’planter’, ’houseplant’, ’bucket, pail’, ’vase’, ’crock, earthenware jar’, ’watering pot’
sheep ’sheep’, ’domestic sheep, Ovis aries’, ’ewe’, ’ram, tup’, ’black sheep’, ’wild sheep’, ’mountain sheep’
sofa ’seat’, ’sofa, couch, lounge’, ’love seat, loveseat’, ’chesterfield’, ’settee’, ’easy chair, lounge chair’
train ’train, railroad train’, ’passenger train’, ’mail train’, ’car train’, ’freight train, rattler’, ’commuter’
tv/monitor ’monitor’, ’LCD’, ’television monitor, tv monitor’, ’OLED’, ’digital display, alphanumeric display’
Table 3: Credible Labels for PASCAL Dataset.





















Retention Rate (deep) 64.09 19.38 78.02 60.00 62.37 59.50 79.50 85.71 61.83 78.86 64.82 74.49 75.90 47.18 50.67 63.30 76.06 67.50 74.32 48.59 64.60
Retention Rate (ebox) 38.00 25.5 26.5 37.37 36.0 35.5 43.5 29.0 48.0 16.29 47.5 23.0 19.0 40.5 26.0 24.5 31.33 41.0 37.0 33.55 32.95
FP Rate (deep) 0.00 12.41 1.74 3.73 11.11 3.01 2.01 4.02 4.71 3.43 15.63 1.16 1.55 9.34 3.51 12.50 2.13 6.12 1.68 7.62 5.37
FP Rate (ebox) 4.5 11.5 19.79 11.16 9.64 2.5 3.51 9.54 6.63 9.55 13.5 14.19 13.0 12.0 14.06 27.71 15.64 11.16 6.03 18.86 11.73
Table 4: Evaluation on the quality of dataset construction.

6.3 Dataset Construction

6.3.1 Experiment setup

To evaluate our automatic dataset construction method, we use a pre-trained 11k-class classification model trained on the ImageNet dataset444We converted a pre-trained MXNet model to Pytorch http://data.mxnet.io/models/imagenet-11k/., and a word2vec model pretrained on Google News dataset (about 100 billion words) which contains 300-dimensional vectors for 3 million different words and phrases. To generate the noisy bounding boxes from an image, we have adopted two methods including:

  • [noitemsep]

  • deep: We train an object detection model (RetinaNet) on the COCO dataset [25] by excluding the 20 classes in Pascal VOC and the two new classes in iKitchen, i.e. “slow cooker" and “cocktail shaker". For each image, we run the detector to identify bounding boxes with classification confidence scores above a low threshold 0.2.

  • edge: We run the EdgeBoxes [39] algorithm on each image to retain up to 20 bounding boxes based on the predicted confidence scores.

6.3.2 Pascal 20 Classes

First, we would like to demonstrate that the set of “credible labels" identified in the label set of the 11k classification model by the proposed Algorithm 1 is reasonable. The produced credible labels are given in Table 3. We can observe that our algorithm has managed to extract the semantically closest labels for the given class names. Some of the extracted class labels include rarely used words which may be ignored even by humans. These accurately identified “credible labels" is critical for removing a large number of irrelevant bounding boxes from the noisy bounding box candidate set.

Second, we calculate the accuracy of the returned bounding boxes by comparing them with the ground-truth. Two evaluation metrics are used:

  • [noitemsep]

  • Retention Rate: The retention rate is defined as the percentage of images that have been labeled correctly for all the ground-truth bounding boxes. Here, we say a box is correctly labeled if the returned bounding box has IoU (intersection over union: ) above 0.5 to the ground-truth bounding box.

  • FP Rate: The false positive rate is defined as the percentage of bounding boxes that fall out of the ground-truth region. Concretely, we calculate the IoP value (intersection over prediction: ) to decide if the predicted bounding box is within the scope of a ground-truth bounding box. If IoP<0.5, we regard this predicted bounding box as a false positive. A high false positive rate will have serious negative impact on the training accuracy.

We present the result of the bounding box extraction accuracy in Table 4. Compared with the noisy bounding boxes generated by EdgeBoxes, the boxes generated by the deep learning detection model are much more accurate in terms of both retention rate and FP rate. This can be explained by the fact that even though the deep learning model is not trained on the new classes, it has “seen" tens of thousands of images, and thus could better identify the boundaries between different semantically meaningful objects with deep learned features. On the other hand, EdgeBoxes generates bounding box proposals that rely entirely on the edges extracted from the image, and thus is not as robust. On the noisy boxes generated by the deep learning model, our algorithm achieves a high retention rate of 64.6% and low FP rate of 5.37%.

The only case with a low retention rate is the class. After manually inspecting the generated bounding boxes, we observed that many bounding boxes were generated over the bicycle wheels instead of the whole bicycle. This happens because the classification model also classifies a bounding box with a bicycle wheel as “bicycle". However, since those bounding boxes would fall within the boundary of the ground-truth bounding boxes, the FP rate for the bicycle class is still only 12.4%.

6.3.3 Incrementally Train iKitchen Dataset with Automatically Constructed Dataset

8+SC 80.1% 80.8% 85.4% 81.3%
8+CS 80.1% 79.2% 32.2% 74.0%
Table 5: iKitchen Accuracy on Automatically Constructed Dataset. (: average mAP for the base 8 classes before learning the new class; : average mAP for the base 8 classes after learning the new class; : mAP for the new learned class; : average mAP for 9 classes after learning the new class.)

For iKitchen dataset, we download 100 images for the new class with Google image search for incremental training. In particular, we use the keywords “slow cooker" and “cocktail shaker" for the 8+SC and 8+CS scenarios respectively. After running the dataset construction algorithm, 71 images remain for the “slow cooker" class and 91 images remain for the “cocktail shaker" class. We train the model with 10 exemplars per old class.

We show the result in Table 5. For both 8+SC and 8+CS scenarios, after learning the new class, the accuracy on the old classes almost remain unchanged. For the new learned class, we have achieved very good accuracy on the slow cooker class with mAP 85.4%. However, for 8+CS scenario, the accuracy for the new “cocktail shaker" class is relatively low with mAP 32.2%. This demonstrates that the quality of the automatically created training dataset is not consistent for different object classes. We will make further investigation on this issue in our future work.

6.4 System Implementation and Efficiency

To measure the latency of our incremental learning approach, we have implemented the end-to-end system and performed multiple experiments using two experimental setups. In Mobile-Only setup, all steps from dataset collection to incremental training of new classes is performed on the embedded Jetson TX2 platform. While this setup has advantages like privacy protection and reduced network activity, it cannot benefit from the powerful GPU(s) on the cloud side. On the other hand, in Mobile-Cloud setup, the major system components including dataset preparation and incremental model training run on the cloud server which uses a single NVIDIA Tesla M40 GPU and then the final model is transferred to a Samsung Galaxy S9 Android phone for inference. Table 7 shows the latencies for every step of the incremental learning process in both setups. Even though Mobile-Only setup has no model transfer latency, overall it is much slower than Mobile-Cloud setup due to the significant difference in computation power during model training. Please refer to the video in the supplementary material for a demo of our Mobile-Cloud implementation.

Input Size 10 Exemplar 30 Exemplar
Small (512) 8.3 14.1
Large (1024) 17.2 30.7
Table 6: Training time of different input image size on NVIDIA Tesla M40 (seconds per epoch).
Mobile-Only Mobile-Cloud
Download image 16 10
Build dataset 44 21
Train model 233 83
Download model N/A 5
Total 293 119
Table 7: System Running Time (s).
(a) 10 Exemplars
(b) 30 Exemplars
Figure 7: Accuracy of different input sizes for 8+slow cooker.

In these experiments, we used Resnet-18 as the base model, and trained the model for 10 epochs on automatically generated dataset from 100 images downloaded using Google image search plus 10 exemplar images per the base-8 class. In addition, the input image is resized that the longer side is 512 while maintaining the original aspect ratio. Note that we preferred a small image size since it has a major effect on training time. Table 6 shows that doubling the image dimensions more than doubles the training time while providing little improvement on the accuracy of the system (see Figure 7).

6.5 Discussion

Even though the proposed incremental learning system for object detection is fast and practical enough to be deployed in real applications, many optimizations are possible to improve the system efficiency:

Model optimizations In this work, we used ResNet-50 and ResNet-18 as the backbone models. However, using a network that is primarily designed for time-critical mobile applications such as MobileNet [13] and SqueezeNet [16] can reduce the overall system time significantly by decreasing the model training time. In addition, in single-shot object detectors, anchor boxes are calculated using multiple layers in the base feature extraction model. However, in many applications, the user is interested in a single salient object that is relatively large on the image. In such a case, we can skip anchor proposals from lower layers since they are particularly used for detecting smaller objects on the image.

System enhancements. Besides these model optimizations, when enough system resources are available system-level tricks such as, multi-gpu training, parallel image downloading and preprocessing, caching of downloaded images and pre-loading of deep models that are used in dataset generation can be employed to further reduce the overall system time.

Bounding box generation for random unseen objects. In rare cases, the new objects a user is interested in might not be included in the large-scale classification model (e.g., 11k ImageNet). In such scenarios, we cannot use the proposed dataset preparation pipeline. Instead, we could possibly use unsupervised object co-localization method  [38] to locate objects belonging to the same category at the cost of less accurate bounding boxes locations.

Scaling the model. As more classes are added, to maintain a good accuracy in the incremental learning, exemplars from old classes must be included. However, this will also increase the model training time linearly. More intelligent methods of selecting the exemplars would help alleviate this problem.

7 conclusion

In this paper, we have presented IMOD, a system for efficient incremental learning of deep neural networks for mobile object detection. IMOD has included a novel incremental learning algorithm which learns to detect a new object class with only training data from the new class while preventing the model from forgetting its knowledge on the old classes. Compared to the existing learning methods, this algorithm achieves 38x speed-up with negligible accuracy loss. In addition, in the absence of available training data for the new classes, IMOD introduced a real-time dataset construction method to label web-crawled images with high-precision bounding boxes and this actually makes IMOD a practical system ready for deployment. We have implemented IMOD with both mobile-cloud and mobile-only architectures, and demonstrated that learning of a new object class can finish in less than 2 minutes on a single GPU with superior detection accuracy.


  • [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), Savannah, GA, October 2016.
  • [2] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. arXiv preprint arXiv:1711.09601, 2017.
  • [3] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, pages 742–751, 2017.
  • [4] Wang Chong, David Blei, and Fei-Fei Li. Simultaneous image classification and annotation. In

    Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

    , pages 1903–1910. IEEE, 2009.
  • [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  • [6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
  • [7] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [8] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [9] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, and Arvind Krishnamurthy. MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constrains. In Proc. MobiSys’16, Singapore, June 2016.
  • [10] Song Han, Huizi Mao, and William J. Dally. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. CoRR, abs/1510.00149, 2015.
  • [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. CoRR, abs/1512.03385, 2015.
  • [12] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [13] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [14] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269. IEEE, 2017.
  • [15] Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. Deepmon: Mobile gpu-based deep learning framework for continuous vision applications. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pages 82–95. ACM, 2017.
  • [16] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and 1MB Model Size. CoRR, abs/1602.07360, 2016.
  • [17] Jiwoon Jeon, Victor Lavrenko, and Raghavan Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 119–126. ACM, 2003.
  • [18] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • [19] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, page 201611835, 2017.
  • [20] Nicholas D. Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices. In Proc. IPSN’16, Vienna, Austria, 2016.
  • [21] Nicholas D. Lane and Petko Georgiev. Can Deep Learning Revolutionize Mobile Sensing? In Proc. HotMobile’15, Santa Fe, NM, February 2015.
  • [22] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [23] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 936–944. IEEE, 2017.
  • [24] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence, 2018.
  • [25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [26] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [27] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • [28] Andrew Ng. Machine learning yearning, 2017.
  • [29] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan.

    Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration, 2017.

  • [30]

    Porting Caffe to Android Platform.

  • [31] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [32] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  • [33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proc. NIPS’15, Montreal, Canada, December 2015.
  • [34] Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. Incremental learning of object detectors without catastrophic forgetting. arXiv preprint arXiv:1708.06977, 2017.
  • [35] The SDK for Jetpac’s iOS Deep Belief Image Recognition Framework.
  • [36] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
  • [37] Chao Wang, Lei Gong, Qi Yu, Xi Li, Yuan Xie, and Xuehai Zhou. Dlau: A scalable deep learning accelerator unit on fpga. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 36(3):513–517, 2017.
  • [38] Xiu-Shen Wei, Chen-Lin Zhang, Yao Li, Chen-Wei Xie, Jianxin Wu, Chunhua Shen, and Zhi-Hua Zhou. Deep descriptor transforming for image co-localization.
  • [39] C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. In European conference on computer vision, pages 391–405. Springer, 2014.