1 Introduction
Autonomous driving is receiving enormous development effort with many companies predicting largescale commercial deployment in 23 years [35]. One of the most important features of autonomous driving vehicles is the ability to interpret the surroundings and perform complex perception task such as the detection and recognition of lanes, roads, pedestrians, vehicles, and traffic signs [27]
. Recently, the growth of Convolutional Neural Networks (CNNs), and large labeled data sets
[8, 11] have led to tremendous progress in object detection and recognition [14, 15, 26, 20]. It is now possible to detect objects with high accuracy [26, 25, 20].Videos collected invehicle have a great potential to improve model quality (through offline training) but at the scales achievable in a few years (billions to trillions of hours of video), training on all the data is completely impractical. Nor is it desirable  most images contain “typical” content and objects that are recognized with good accuracy. These models contribute little to the final model. But it is the less common “interesting” images that are most important for training (i.e. images containing objects that are misclassified, or classified with low confidence). To benefit from these images, it’s still important to have accurate label information. The images of distance objects in isolation are not good for this purpose  by definition they contain objects which are difficult to label automatically. But we can use particular characteristics of vehicle video: namely that most of the time (vehicle moving forward), objects gradually grow and become clearer and easier to recognize. We exploit this coherence by tracking objects back in time and using near, highreliability labels to label more distant objects. We demonstrate this process on a handlabeled dataset
[13] which only has a small fraction of total frames labeled and also has short length video clips that can be used for object tracking. We show that we can extend the labelled data using neartofar tracking strategy, and importance sampling can be used to refine the automatically labeled dataset to improve its quality.To automate farobject tracking, we need both singleimage object detection and betweenimage tracking. While these two modules can be used separately, we designed a strategy to nicely combine them together. Specifically we used a FasterRCNN object detector [26]
and Kalman filtering to track objects. We use predicted object positions from tracking to augment the FasterRCNN’s region proposals, and then use the FasterRCNN’s bounding box regression to correct the object position estimates. The result is that we can track and persist object layers much further into the distance where it might be hard for FasterRCNN to give accurate object region proposal.
The automaticallygenerated labels are then used to compute the importance of each image. As shown in [2], an optimal model is obtained when images are importance sampled using the norm of the backpropagated gradient magnitude for the image. While computing the full backpropagated gradients in vehicle video systems would be very expensive, we can actually use the loss function as a surrogate for gradient as it is easier to obtain. We further show in our experiments that the loss function can be a approximate of the gradient norm. Importance sampling for training data filtering is also described in [7].
Contributions. Starting with a sparsely labeled video dataset, we combine object tracking and object detection to generate new labeled data, and then use importance sampling for data reduction. The contributions of this paper are: 1) We show that neartofar object tracking can produce useful training data augmentation. 2) Empirically, gradient norm can be approximated by loss function and last layer gradient norm in deep neural networks. 3) Importance sampling produces large reductions in training data and training time and modest loss of accuracy.
2 Related Work
SemiSupervised Data Labeling. With the large amount of sparsely labeled image datasets, some work has been done in the field of semisupervised object detection and data labeling [34, 31, 18, 6]. These work try to learn a set of similar attributes for image classes [31, 6] to label new datasets, or to cluster similar images [18]
, or perform transfer learning to recognize similar types of objects
[34]. However, these work are typically suitable for image dataset where they process images individually and do not consider the temporal continuity of video dataset for semisupervised learning. Semisupervised learning for video dataset is also described in
[19, 24, 33]. While their performance is good, they assume labeling the salient object in a video and thus do not apply well to multiobject detection or tracking case. A large body of work has also been done in the field of trackingbydetection [3, 23, 16, 17, 5]. However, they either assume the possibility of negative data being sampled from around the object or they do not use the special characteristic of driving video that objects in the near field are easier to be detected than objects in the far field. In addition, these naive combinations of tracking and detection may introduce additional noise in labeling images. Also, tracking tends to drift away in the long run and the related data association is also very challenging [3]. In the work of [21], they proposed to use semisupervised learning to label large video datasets by tracking multiple objects in videos. However, their application scenario is not driving video dataset and the object they detect only include cars. In addition, their focus is on short term tracking of objects and they do not require short tracklets to be associated with each other. Therefore, the applicability of their method is limited especially when there are multiple categories of objects in a scene, since ignoring the data association part would be problematic if the goal is to label multiple categories of objects. In our work, we do consider the problem of data association and we use object tracker’s prediction as the region proposal for object detector to provide more accurate bounding box annotation. However, similar to the work of [21], we do not perform longrun tracking to prevent the tracker from drifting away. We also use neartofar labeling to help correct the detector’s classification results.Importance Sampling.
Importance sampling is a wellknown technique used for reducing the variance when estimating properties of a particular distribution while only having samples generated from another distribution
[22, 28]. The work of [38] studied the problem of improving traditional stochastic optimization with importance sampling, where they improved the convergence rate of proxSMD [9, 10] and proxSDCA [29] by reducing the stochastic variance using importance sampling. The work of [2] improves over [38]by incorporating stochastic gradient descent and deep neural networks. Also there are some work in using importance sampling for minibatch SGD
[7], where they proposed to use importance sampling to do data sampling in minibatch SGD and this can improve the convergence rate of SGD. The idea of hard negative example mining is also highly related to our work. As shown in [30] where they presented an approach to perform efficient object detection training by training on an optimally sampled bounding boxes according to their gradient.As for selfdriving vehicles’ vision system training, we typically do not know the ground truth distribution of the data, which are the images or video data captured by cameras. Thus, importance sampling will be very useful in estimating the properties of the data from a datadriven sampling scheme. The work of [36] and [12] proposed to use importance sampling for visual tracking, but their focus was not on reducing the training data amount and creating labeled data using visual tracking. In our work, we use importance sampling to obtain an optimal set of data so that our training efficiency is high as we train on the most informative data. The information that each image carries is characterized by their detection loss, which is reasonably suitable in our case as images with high loss are usually images that are difficult for the current detector.
3 Methods
Our approach for creating labeled data and performing data reduction by importance sampling can be divided into two parts. First of all, based on the sparsely labeled image frames, we initialize object tracker by incorporating Kalmanfilter algorithm [37] and use the tracker to predict bounding box of objects in the previous (since we predict back in time) frame. We then use the prediction as region proposal input and send this to object detection module. Based on the region proposal received, the object detection module trained on sparsely labeled data will do a bounding box regression to get the final bounding box and detection loss. The object tracker further matches new detections to existing trackers or create new trackers if the new detection cannot match any of the existing trackers. The neartofar labeling module will double check object detection results within each tracker to use the classification results of objects in the near field which are more accurate to check the results of objects in the far field. The bounding box produced by the object detection module are used as labels for those unlabeled video frames. The detection loss will be used as the sampling weights for importance sampling. Secondly, based on the detection loss recorded in the first part, the importance sampler will sample an optimal subset of labeled images and these selected labeled images will be used as the training data to train a new object detector.
The system architecture is shown in figure 1. Here, we first describe the framework of semisupervised data labeling followed by the data reduction using the importance sampling framework.
3.1 Semisupervised Data Labeling
Object Tracking. Starting with a few sparsely annotated video frames, we first trained an object detection network using FasterRCNN [26]. By using Kalman filter [37], we initialize object trackers with the ground truth labeled frames. The specific object tracking framework we use is similar to that of [1], where the state of the tracker includes 7 parameters, namely, the center position of the bounding box , the scale (the area of the bounding box) and aspect ratio of the bounding box ( the ratio of the width over the height of the bounding box), and the rate of change of the center position and scale of the bounding box. We follow the assumption in [1]
that the aspect ratio of the bounding box does not change over time. The measurement is just the first four parameters of the state vector.
We always use ground truth labeled bounding box to initialize object trackers, and the tracking is done from the near field to the far field, which means the video is played in the opposite direction as it was collected, so that at the very beginning, the camera is close to the labeled objects and at the very end the camera is far away from the the object. Therefore, it is reasonably to believe that the classification and detection results for objects in the near field are more reliable while there is more noise in the detection results for object in the far field.
Prediction as Region Proposal. After the trackers are initialized with ground truth bounding box, based on the principle of a Kalman filter, predictions of bounding boxes of the objects being tracked will be calculated. These predictions will be used as a hint for the object detection network to produce new bounding boxes in the next frame. The network we used for object detection is FasterRCNN [26], which is composed of a region proposal network (RPN) and object detection network FastRCNN [15]. Usually the RPN will be used as the region proposer, however, as we already have the prior information of where the object might be, we can directly use this information to help the object detection network avoid uncertainty in region proposal. This part corresponds to the get_new_detection method in algorithm 1.
Matching Tracker with Detections. Given the predictions sent by the object tracker, the object detection network will produces a set of candidate bounding boxes in the next frame and the object tracker will try to match the existing trackers and the new detections using linear assignment. We also use intersection over Union (IoU) to filter out detectiontracker pairs that do not have IoU values higher than a predefined threshold. After finishing detectiontracker matching, the state of valid trackers will be updated, and trackers that remain inactive (not being updated) for a certain steps will be removed from the trackers list. Now we finished one step of object tracking and labeling. The bounding boxes produced by object detection network will be used as labels for those unlabeled video frames. The more detailed algorithm description for one step of tracking and labeling is shown in algorithm 1. Matching trackers with detections is further described in algorithm 2.
The tracker is a class containing state of the current object being tracked and methods for updating object’s state given ground truth state of the object. A detailed implementation of the tracker class can be found in [4].
NeartoFar Labeling. Another key ingredient of our approach is the neartofar labeling scheme. Consider the case that we are tracking an object from far to near field. When the image is far away from our current location, the object could be very small or blurred in the image, which makes it very difficult to be correctly classified. As the object approaches the vehicle, the detection network has a higher confidence to correctly classify this object. As we trust object detection results in the near field, if object detection results of the same object being tracked in the far field differ from that in the near field, we can use the detection results in the near field to correct that. To do this, we restrict object tracker’s initialization only to ground truth bounding boxes so as to avoid the additional noise introduced by imperfect object detection network. In case the classification of objects in the far field diverges, we use the detection result of the same tracker in the near field to correct that. Examples of neartofar labeling are shown in figure 4.
3.2 Sampling an Optimal Subset of Images
Inspired by the idea of importance sampling [2]
, we can select an optimal subset of the data by sampling the data according to importance sampling probability distribution so that the variance of the sampled data is minimized under an expected size of sampled data. Here, the sampling distribution is proportional to the object detection loss of each image. Images with higher loss obtain more importance as they provide more useful information for accurate object detection.
In our case, we are interested in estimating the expectation of based on a distribution , where is the detection loss of each image, denotes the image distribution and denotes a particular image with an object detection loss. The problem is expressed by the following equation,
(1) 
where . However, usually we do not know the ground truth distribution of the data , so we rely on a sampling proposal
to to unbiasedly estimate this expectation, with the requirement that
whenever . This is commonly known as importance sampling:(2) 
It has been proved in [2] that the variance of this estimation can be minimized when we have,
(3) 
Defining as the unnormalized optimal probability weight of image , it is obvious that images with a larger detection loss should have a larger weight. Although we do not know , we have access to a dataset sampled from . Therefore, we can obtain by associating the unnormalized probability weight to every , and to sample from we just need to normalize these weights:
(4) 
where is the loss of input . To reduce the total number of data instances used for estimating , we draw samples from the whole data instances () based on a multinomial distribution where are the parameters of this multinomial distribution. Based on the discussion above, we obtained an estimation of which has least variance compared to all cases where we draw samples from the entire data set. We further provide some prove in the appendix.
Pedestrian  Car  Cyclist  mAP  

Easy  Medium  Hard  Easy  Medium  Hard  Easy  Medium  Hard  
Ground Truth (GT)  80.6  68.8  61.0  94.1  78.8  69.3  88.1  78.8  73.6  77.0 
New Labeled (NL) & GT  69.2  58.4  50.8  83.4  63.2  53.1  68.3  56.6  52.9  61.8 
Sampled NL & GT  71.3  62.7  54.1  75.8  61.5  51.6  77.5  66.0  61.3  64.6 
Only NL  69.8  60.8  52.1  80.4  60.9  50.3  70.4  57.8  54.0  61.8 
3.3 Measuring Variance Reduction Efficiency
Once we get the sampling distribution , we then perform the importance sampling. Images with a higher detection loss will get higher likelihood to be sampled. We, further, measure how efficient that we estimate the detection loss distribution. Since the goal of using importance sampling approach here is to reduce the variance while estimating properties of the data from a subset of the data.
To show that the expectation of loss estimated from the sampled images have close variance with loss variance estimated from all images, we computed a relative variance value. This value is the ratio of whole data set detection loss variance over sampled images’ detection loss variance.
Suppose the data set is , and we can get detection loss given individual input . In order to calculate the relative variance more easily, we will first normalize . Then, we define the sampling probability of image when we expect to sample M out of N images () as,
(5) 
taking the minimum compared with 1 is to ensure that the probability of sampling image can not be larger than 1, which happens when is saturated. Note that, when the sampling probability is 1, we should sample this image. With the scaled sampling weight , we change so that we can get different numbers of images out of the entire image date. Typically, choosing a such that the sample gradient norm variance is close to whole data gradient norm variance. Since the data are in the discrete space, the relative variance is defined as,
(6) 
4 Experiments
Our framework has several major contributions. First of all, we proposed to use object tracker’s prediction as the region proposal input for the object detection network to detect objects. Secondly, we proposed to use neartofar labeling to help correct labels that may not be correct. Thirdly, we use importance sampling to select an optimal subset of images to remove images with less reliable labels and obtain a smaller but more informative set of data. We designed several comparative experiments to show the impact of our contribution.
4.1 Comparative Experiments
Datasets. To show that our algorithm is able to scale to a relatively large video dataset, we choose the KITTI benchmark dataset [13] which contains hundreds of autonomous driving video clips, and each of the video clips lasts about 10 to 30 seconds. The data set is fairly rich as it contains highresolution color and grayscale video frames captured in many kinds of driving environments: city, residential, road, campus, person, etc. The KITTI dataset also contains a set of sparsely labeled image frames for object detection purposes. The number of images with ground truth bounding box labeling we used in our experiment is 7481, while the total number of images is around 40000. Categories of objects being labeled include cars, pedestrians, vans, trams, cyclist, truck, person sitting, and so on. For simplicity, we choose 3 categories from them to detect, which include cars, pedestrians, and cyclist. We manually and randomly divide the dataset into the training, validation and test data set. The training dataset contains 4206 images, the validation dataset contains 1404 images, and the test data set contains 1871 images.
Experiment 0. We first trained a basic object detection network based on the ground truth labeled data using the FasterRCNN [26] object detection network. As for details of training, we used pretrained FasterRCNN model with VGG16 network [32] trained on PASCAL VOC 2007 dataset [11], and then finetuned with KITTI dataset. The number of training iterations is 300k with the initial learning rate of 0.01 and decay every 30k iterations.
Experiment 1. The first experiment is our labeling by tracking approach using semisupervised learning. In this experiment, we use the ground truth labeled bounding boxes to initialize object trackers. Since images in the KITTI dataset are sparsely labeled with unlabeled images between labeled images in the original video sequence, we use the labeled data as a guidance to label images without ground truth labeling. It is useful to notice that, in this case, the object detection network does not use RPN to generate region proposals. Instead, it takes the object tracker’s prediction of bounding box in the next frame as region proposal and then perform bounding box regression to generate optimal bounding box for the object being tracked. In other words, only ground truth labeled images can be used to initialize object tracker, which is based on our assumption that objects in the near field provide more accurate information and we only predict bounding boxes based on reliable information instead of relying on some random detection. We used both ground truth data from KITTI combined with new labeled data to train the object detector. The training setting is the same as in experiment 0.
Experiment 2. In this experiment, we adopt the approach we take in experiment 1 and we further combine it with importance sampling. As images labeled using the approach in experiment 1 may still contain redundant information such as images that are already easy for the network to process, so we use importance sampling to select an optimal set of images that are more informative. We choose to sample 60% of the data ( which consists of both ground truth and new labeled data) in experiment 1 using the importance sampling method mentioned in previous section. As shown in figure 5, 60% of data corresponds to around 0.90 sampling efficiency, which is reasonably high. The training setting is also the same as in experiment 0.
Experiment 3. We further remove the ground truth data which comes from KITTI and only used newly labeled data using to train an object detector with the same training setting as in experiment 0.
Evaluation of Accuracy. We trained FasterRCNN object detection networks using data mentioned in experiment 1, 2, and 3 respectively, all using the same training configurations as we did in experiment 0. We evaluate the performance of models in experiment 0,1,2,and 3 by testing the models on a held out test dataset of 1871 images. The average precision is evaluated on the 3 categories of objects mentioned before.
4.2 Results and Analysis
Loss as a Approximation for Gradient First, we show our finding that the gradient of the network we used has some linear correlation between different layers as shown in figure 3. Therefore, we can use last layer gradient (as it is easier to obtain) as a approximation of total gradients. On the other hand, we also show in figure 6 that loss can be used as a approximation for the total gradient norm. Therefore, we can also use loss to approximate gradient and use it as sampling weight for different object bounding box labels.
Qualitative Results for Bounding Box Generation. As mentioned in experiment description, we use two different strategies to generate bounding boxes using FasterRCNN. The first strategy uses region proposal network to generate bounding boxes, and the second strategy uses object tracker’s prediction as region proposals. We show some qualitative results of bounding boxes generated by the two methods in figure 2.
Quantitative Results for Object Detection. The accuracy of models trained on experiment 0,1,2,and 3 are evaluated on a test data set of 1871 images. The average precision (AP) on 3 categories of objects and the mAPs are reported in table 1. The results show the average precision for different categories of objects with different degrees of difficulty. With the ground truth data, the model shows the best performance, which is not a surprise since labels generated by tracking may introduce noise that harms the performance of the detector. However, after filtering the data by importance sampling, we can obtain better detection accuracy using the same training setting, which means importance sampling has helped to reduce data volume and makes it easier to train a model to convergence.
Relative Variance Results We use the relative variance mentioned in section 3.3 to measure how good we estimate the image detection loss distribution. The result is shown here 5. From the plot, we can see that by scaling the importance sampling weight as mentioned in 5, we are able to keep high sampling efficiency (0.90) with 60 % of the original labeled data being sampled. This curve will be useful for determining how much data to sample given the desired sampling efficiency.
5 Conclusion
We proposed a framework of automatically generating object bounding box labels for large volume driving video dataset with sparse labels. Our work generates object bounding boxes on the new labeled data by employing a neartofar labeling strategy, a combination of object tracker’s prediction and object detection network and the importance sampling scheme. Our experiments show that with our semisupervised learning framework, we are able to annotate driving video dataset with bounding box labels and improve the accuracy of object detection with the new labeled data using importance sampling.
References
 [1] Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), pages 3464–3468, Sept 2016.
 [2] G. Alain, A. Lamb, C. Sankar, A. Courville, and Y. Bengio. Variance reduction in sgd by distributed importance sampling. arXiv preprint arXiv:1511.06481, 2015.
 [3] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua. Multiple object tracking using kshortest paths optimization. IEEE transactions on pattern analysis and machine intelligence, 33(9):1806–1819, 2011.
 [4] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple online and realtime tracking. CoRR, abs/1602.00763, 2016.
 [5] M. D. Breitenstein, F. Reichlin, B. Leibe, E. KollerMeier, and L. Van Gool. Robust trackingbydetection using a detector confidence particle filter. In Computer Vision, 2009 IEEE 12th International Conference on, pages 1515–1522. IEEE, 2009.

[6]
J. Choi, M. Rastegari, A. Farhadi, and L. S. Davis.
Adding unlabeled samples to categories by learned attributes.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 875–882, 2013.  [7] D. Csiba and P. Richtárik. Importance sampling for minibatches. arXiv preprint arXiv:1602.02283, 2016.
 [8] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.

[9]
J. Duchi and Y. Singer.
Efficient online and batch learning using forward backward splitting.
Journal of Machine Learning Research
, 10(Dec):2899–2934, 2009.  [10] J. C. Duchi, S. ShalevShwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In COLT, pages 14–26, 2010.
 [11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
 [12] R. Farah, Q. Gan, J. P. Langlois, G.A. Bilodeau, and Y. Savaria. A computationally efficient importance sampling tracking algorithm. Machine Vision and Applications, 25(7):1761–1777, 2014.
 [13] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3354–3361. IEEE, 2012.
 [14] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014.
 [15] R. B. Girshick. Fast RCNN. CoRR, abs/1504.08083, 2015.
 [16] H. Grabner, C. Leistner, and H. Bischof. Semisupervised online boosting for robust tracking. In European conference on computer vision, pages 234–247. Springer, 2008.
 [17] Z. Kalal, K. Mikolajczyk, and J. Matas. Trackinglearningdetection. IEEE transactions on pattern analysis and machine intelligence, 34(7):1409–1422, 2012.
 [18] S. Lad and D. Parikh. Interactively guiding semisupervised clustering via attributebased explanations. In European Conference on Computer Vision, pages 333–349. Springer, 2014.
 [19] D. Liu, G. Hua, and T. Chen. A hierarchical visual model for video object summarization. IEEE transactions on pattern analysis and machine intelligence, 32(12):2178–2190, 2010.
 [20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015.
 [21] I. Misra, A. Shrivastava, and M. Hebert. Watch and learn: Semisupervised learning of object detectors from videos. CoRR, abs/1505.05769, 2015.
 [22] Y. Z. Owen, Art; Associate. Safe and effective importance sampling. Journal of the American Statistical Association, 449:135 – 143, 2000.
 [23] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globallyoptimal greedy algorithms for tracking a variable number of objects. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1201–1208. IEEE, 2011.
 [24] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3282–3289. IEEE, 2012.
 [25] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, realtime object detection. CoRR, abs/1506.02640, 2015.
 [26] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster RCNN: towards realtime object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
 [27] E. Romera, L. M. Bergasa, and R. Arroyo. Can we unify monocular detectors for autonomous driving by using the pixelwise semantic segmentation of cnns? CoRR, abs/1607.00971, 2016.
 [28] R. Y. Rubinstein and D. P. Kroese. Simulation and the Monte Carlo method, volume 707. John Wiley & Sons, 2011.
 [29] S. ShalevShwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arXiv preprint arXiv:1211.2717, 2012.
 [30] A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016.
 [31] A. Shrivastava, S. Singh, and A. Gupta. Constrained semisupervised learning using attributes and comparative attributes. In European Conference on Computer Vision, pages 369–383. Springer, 2012.
 [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [33] K. Tang, R. Sukthankar, J. Yagnik, and L. FeiFei. Discriminative segment annotation in weakly labeled video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2483–2490, 2013.
 [34] Y. Tang, J. Wang, B. Gao, E. Dellandréa, R. Gaizauskas, and L. Chen. Large scale semisupervised object detection using visual and semantic knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2119–2128, 2016.
 [35] A. Teichman and S. Thrun. Practical object recognition in autonomous driving and beyond. Advanced Robotics and its Social Impacts (ARSO), 2011 IEEE Workshop on, pages 35–38, 10 2011.
 [36] N. Wang and D. yan Yeung. Learning a deep compact image representation for visual tracking. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 809–817. 2013.
 [37] G. Welch and G. Bishop. An introduction to the kalman filter. 1995.
 [38] T. Zhang and R. EDU. Stochastic optimization with importance sampling for regularized loss minimization. 2014.
Appendix A Introduction
We provide proof of the importance sampling framework and their optimality in this supplementary material. We also provide detailed explanations for the measurement of relative variance and the meaning of relative variance.
Appendix B Importance Sampling Framework Proof
The importance sampling algorithm is used for data reduction. It is also used for the selection of an optimal subset of data from the original labeled dataset with minimal variance. In the paper, we stated that by using a reference proposal distribution we can get an estimation of the expectation of with the least variance. We now provide the proof.
In importance sampling, the expectation of is estimated by using . We require that whenever . It is thus easily to verify that this estimation is unbiased. Suppose that is defined on while is defined on . We have and . So that we have for , and for , . That is to say, for and , we have . So the expectation of can be written as,
(7) 
Then we prove that when sampling distribution , we can obtain the minimal variance in the estimation of the expectation. Let , and let,
(8) 
given samples are sampled from . Then the variance of is,
(9) 
By choosing , and let be any density function that is positive given . We have,
(10) 
The last inequality is the CauchySchwarz inequality. Therefore, we show that by choosing sampling distribution and sampling data according to the normalized , we can obtain the minimal variance estimation. In the case where is not known directly, but we have a dataset sampled from , we can use as the sampling weight.
Appendix C Measuring the Efficiency of Sampling
We define the efficiency as the ratio between the original data variance and the sampled data variance. To make things simpler, suppose we want to estimate the expectation of , we first normalize and obtain , where and
are the mean and standard deviation of
. Now we use importance sampling to estimate the expectation of under by using proposal distribution . We sample images out of a total images, the probability of a particular image being sampled is,(11) 
As mentioned in the paper, we take the minimum compared with 1 to ensure that the probability is always no more than 1. Obviously, since describes the probability of a particular image being selected. We further define which has an upper bound of . Therefore, it is easy to see that . To get images, we select images according to their sampling probabilities . The expectation of based on the sampled images is,
(12) 
where . On the other hand, if we sample the entire dataset and get images, then and , the expectation will be,
(13) 
which is just . It is no harm to assume
is a uniform distribution since we consider it to be unknown. In the case where we sample the entire dataset,
, , and , then the variance of by sampling the entire dataset is,(14) 
In the case where we sample images out of images, , , and , then the variance of by sampling images out of images is,
(15) 
The efficiency is defined as the ratio between 14 and 15, which is,
(16) 
which is the same as the efficiency (relative variance) defined in the main paper. Obviously, since , this ratio will always be no larger than 1. If we sample all the data, which means , then we can obtain a sampling efficiency of 1. To simplify the calculation of , We can further express as,
(17) 
where , , , are smaller than 1 and , , are equal to 1.
Comments
There are no comments yet.