Gaussian YOLOv3: An Accurate and Fast Object Detector Using Localization Uncertainty for Autonomous Driving

04/09/2019 ∙ by Jiwoong Choi, et al. ∙ Seoul National University ofScience and Technology Seoul National University 0

The use of object detection algorithms is becoming increasingly important in autonomous vehicles, and object detection at high accuracy and a fast inference speed is essential for safe autonomous driving. A false positive (FP) from a false localization during autonomous driving can lead to fatal accidents and hinder safe and efficient driving. Therefore, a detection algorithm that can cope with mislocalizations is required in autonomous driving applications. This paper proposes a method for improving the detection accuracy while supporting a real-time operation by modeling the bounding box (bbox) of YOLOv3, which is the most representative of one-stage detectors, with a Gaussian parameter and redesigning the loss function. In addition, this paper proposes a method for predicting the localization uncertainty that indicates the reliability of bbox. By using the predicted localization uncertainty during the detection process, the proposed schemes can significantly reduce the FP and increase the true positive (TP), thereby improving the accuracy. Compared to a conventional YOLOv3, the proposed algorithm, Gaussian YOLOv3, improves the mean average precision (mAP) by 3.09 and 3.5 on the KITTI and Berkeley deep drive (BDD) datasets, respectively. In addition, on the same datasets, the proposed algorithm can reduce the FP by 41.40 and 4.3 real-time detection at faster than 42 frames per second (fps).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep learning has been actively applied in various fields including computer vision 

[9], autonomous driving [5], and social network services [14]

. The development of sensors and GPU along with deep learning algorithms has accelerated research into autonomous vehicles based on artificial intelligence. An autonomous vehicle with self-driving capability without a driver intervention must accurately detect cars, pedestrians, traffic signs, traffic lights, etc. in real time to ensure safe and correct control decisions 

[23]. To detect such objects, various sensors such as cameras, light detection and ranging (Lidar), and radio detection and ranging (Radar) are generally used in autonomous vehicles [25]. Among these various types of sensors, a camera sensor can accurately identify the object type based on texture and color features and is more cost-effective [22] than other sensors. In particular, deep-learning based object detection using camera sensors is becoming more important in autonomous vehicles because it achieves a better level of accuracy than humans in terms of object detection, and consequently it has become an essential method [11] in autonomous driving systems.

An object detection algorithm for autonomous vehicles should satisfy the following two conditions. First, a high detection accuracy of the road objects is required. Second, a real-time detection speed is essential for a rapid response of a vehicle controller and a reduced latency. Deep-learning based object detection algorithms, which are indispensable in autonomous vehicles, can be classified into two categories: two-stage and one-stage detectors. Two-stage detectors, , Fast R-CNN 

[8], Faster R-CNN [20], and R-FCN [4], conduct a first stage of region proposal generation, followed by a second stage of object classification and bbox regression. These methods generally show a high accuracy but have a disadvantage of a slow detection speed and lower efficiency. One-stage detectors, , SSD [15] and YOLO [17], conduct object classification and bbox regression concurrently without a region proposal stage. These methods generally have a fast detection speed and high efficiency but a low accuracy. In recent years, to take advantage of both types of method and to compensate for their respective disadvantages, object detectors combining various schemes have been widely studied [1, 11, 27, 26]. MS-CNN [1], a two-stage detector, improves the detection speed by conducting detection on various intermediate network layers. SINet [11], also a two-stage detector, enables a fast detection using a scale-insensitive network. CFENet [27], a one-stage detector, uses a comprehensive feature enhancement module based on SSD to improve the detection accuracy. RefineDet [26], also a one-stage detector, improves the detection accuracy by applying an anchor refinement module and an object detection module. However, using an input resolution of 512 512 or higher, which is widely applied in object detection algorithms for achieving a high detection accuracy, previous studies [1, 11, 27, 26] have been unable to meet a real-time detection speed of above 30 fps, which is a prerequisite for self-driving applications. This indicates that these previous schemes are incomplete in terms of a trade-off between accuracy and detection speed, and consequently, have a limitation in their application to autonomous driving systems.

In addition, one of the most critical problems of most conventional deep-learning based object detection algorithms is that, whereas the bbox coordinates (i.e., localization) of the detected object are known, the uncertainty of the bbox result is not. Thus, conventional object detectors cannot prevent mislocalizations (i.e., FPs) because they output the deterministic results of the bbox without information regarding the uncertainty. In autonomous driving, an FP denotes an incorrect detection result of bbox on an object that is not the ground-truth (GT), or an inaccurate detection result of bbox on the GT, whereas a TP denotes an accurate detection result of bbox on the GT. An FP is extremely dangerous under autonomous driving because it causes excessive reactions such as unexpected braking, which can reduce the stability and efficiency of driving and lead to a fatal accident [16, 21] as well as confusion in the determination of an accurate object detection. In other words, it is extremely important to predict the uncertainty of the detected bboxes and to consider this factor along with the objectness score and class scores for reducing the FP and preventing autonomous driving accidents. For this reason, various studies have been conducted on predicting uncertainty in deep learning. Kendall  [12]

proposed a modeling method for uncertainty prediction using a Bayesian neural network in deep learning. Feng  

[6] proposed a method for predicting uncertainty by applying Kendall ’s scheme [12] to 3D vehicle detection using a Lidar sensor. However, the methods proposed by Kendall  [12] and Feng  [6] only predict the level of uncertainty, and do not utilize this factor in actual applications. Choi  [2]

proposed a method for predicting uncertainty in real time using a Gaussian mixture model and applied the method to an autonomous driving application. However, it was applied to the steering angle, and not object detection, and a complicated distribution is therefore modeled, increasing the computational complexity. He  

[10] proposed an approach for predicting uncertainty and utilized it toward object detection. However, because they focused on a two-stage detector, their method cannot support a real-time operation, and remaining a box overlap problem, so it is unsuitable for self-driving applications.

(a)
(b)
Figure 1: (a) Network architecture of YOLOv3 and (b) attributes of its prediction feature map.

To overcome the problems of previous object detection studies, this paper proposes a novel object detection algorithm suitable for autonomous driving based on YOLOv3 [19]. YOLOv3 can detect multiple objects with a single inference, and its detection speed is therefore extremely fast; in addition, by applying a multi-stage detection method, it can complement the low accuracy of YOLO [17] and YOLOv2 [18]. Based on these advantages, YOLOv3 is suitable for autonomous driving applications, but generally achieves a lower accuracy than a two-stage method. It is therefore essential to improve the accuracy while maintaining a real-time object detection capability. To achieve this goal, the present paper proposes a method for improving the detection accuracy by modeling the bbox coordinates of YOLOv3, which only outputs deterministic values, as the Gaussian parameters (i.e.

, the mean and variance), and redesigning the loss function of bbox. Through this Gaussian modeling, a localization uncertainty for a bbox regression task in YOLOv3 can be estimated. Furthermore, to further improve the detection accuracy, a method for reducing the FP and increasing the TP by utilizing the predicted localization uncertainty of bbox during the detection process is proposed. This study is therefore the first attempt to model the localization uncertainty in YOLOv3 and to utilize this factor in a practical manner. As a result, the proposed Gaussian YOLOv3 can cope with mislocalizations in autonomous driving applications. In addition, because the proposed method is modeled only in bbox of the YOLOv3 detection layer (

i.e., the output layer), the additional computation cost is negligible, and the proposed algorithm consequently maintains the real-time detection speed of over 42 fps with an input resolution of 512 512 despite the significant improvements in performance. Compared to the baseline algorithm (i.e., YOLOv3), the proposed Gaussian YOLOv3 improves the mAP by 3.09 and 3.5 on the KITTI [7] and BDD [24] datasets, respectively. In addition, the proposed algorithm reduces the FP by 41.40% and 40.62%, respectively, and increases the TP by 7.26% and 4.3%, respectively, on the KITTI and BDD datasets. As a result, the proposed algorithm is suitable for autonomous driving because it significantly improves the detection accuracy and addresses the mislocalization problem while supporting a real-time operation.

2 Background

Instead of the region proposal method used in two-stage detectors, YOLO [17] detects objects by dividing an image into grid units. The feature map of the YOLO output layer is designed to output bbox coordinates, the objectness score, and the class scores, and thus YOLO enables the detection of multiple objects with a single inference. Therefore, the detection speed is much faster than that of conventional methods. However, owing to the processing of the grid unit, localization errors are large and the detection accuracy is low, and thus it is unsuitable for autonomous driving applications. To address these problems, YOLOv2 [18]

has been proposed. YOLOv2 improves the detection accuracy compared to YOLO by using batch normalization for the convolution layer, and applying an anchor box, multi-scale training, and fine-grained features. However, the detection accuracy is still low for small or dense objects. Therefore, YOLOv2 is unsuitable for autonomous driving applications, where a high accuracy is required for dense road objects and small objects such as traffic signs and lights.

To overcome the disadvantages of YOLOv2, YOLOv3 [19] has been proposed. YOLOv3 consists of convolution layers, as shown in Figure 0(a)

, and is constructed of a deep network for an improved accuracy. YOLOv3 applies a residual skip connection to solve the vanishing gradient problem of deep networks and uses an up-sampling and concatenation method that preserves fine-grained features for small object detection. The most prominent feature is the detection at three different scales in a similar manner as used in a feature pyramid network 

[13]. This allows YOLOv3 to detect objects with various sizes. In more detail, when an image of three channels of R, G, and B is input into the YOLOv3 network, as shown in Figure 0(a), information on the object detection (i.e., bbox coordinates, and objectness score, and class scores) is output from three detection layers. The predicted results of the three detection layers are combined and processed using non-maximum suppression. After that, the final detection results are determined. Because YOLOv3 is a fully convolutional network consisting only of small-sized convolution filers of 1 1 and 3 3 like YOLOv2 [18], the detection speed is as fast as YOLO [17] and YOLOv2 [18]. Therefore, in terms of the trade-off between accuracy and speed, YOLOv3 is suitable for autonomous driving applications and is widely used in autonomous driving research [3]. However, in general, it still has a lower accuracy than a two-stage detector using a region proposal stage. To compensate for this drawback, as taking advantage of the smaller complexity of YOLOv3 than that of a two-stage detector, a more efficient detector for an autonomous driving application can be designed by applying the additional method for improving accuracy to YOLOv3 [19]. The Gaussian modeling and loss function reconstruction of YOLOv3 proposed in this paper can improve the accuracy by reducing the influence of noisy data during training and predict the localization uncertainty. In addition, the detection accuracy can be further enhanced by using this predicted localization uncertainty. A detailed description of the above aspects is provided in Section 3.

3 Gaussian YOLOv3

3.1 Gaussian Modeling

As shown in Figure 0(b), the prediction feature map of YOLOv3 [19] has three prediction boxes per grid, where each prediction box consists of bbox coordinates (i.e., , , , and ) and the objectness score, and class scores. YOLOv3 outputs the objectness (i.e., whether an object is present or not) and class (i.e., the category of the object), as a score of between zero and one. An object is then detected based on the product of these two values. Unlike the objectness and class information, bbox coordinates are output as deterministic coordinate values instead of a score, and thus the confidence of the detected bbox is unknown. It therefore does not know how uncertain the result of bbox is. In contrast, the uncertainty of bbox, which is predicted by the proposed method, serves as the bbox score, and can thus be used as an indicator of how uncertain the bbox is. The results for this are described in Section 4.1.

In YOLOv3, bbox regression is to extract the bbox center information (i.e., and ) and bbox size information (i.e., and ). Because there is only one correct answer (i.e., the GT) for the bbox of an object, complex modeling is not required for predicting the localization uncertainty. In other words, the uncertainty of bbox can be modeled using each single Gaussian model of , , , and . A single Gaussian model of output for a given test input whose output consists of Gaussian parameters is as follows:

(1)

where and are the mean and variance functions, respectively.

To predict the uncertainty of bbox, each of the bbox coordinates in the prediction feature map is modeled as the mean () and variance (), as shown in Figure 2. The outputs of bbox are , , , , , , , and . The Gaussian parameters of each coordinate are preprocessed considering the structure of the detection layer in YOLOv3. The mean value of each coordinate in the detection layer is the predicted coordinate of the detected bbox, and each variance represents the uncertainty of each coordinate.

Single Gaussian modeling for predicting the uncertainty of bbox only applies to the bbox coordinates of the YOLOv3 detection layer shown in Figure 0(a). Therefore, the overall computational complexity of the algorithm does not increase significantly. In a 512 512 input resolution and ten classes, YOLOv3 requires 99 FLOPs; however, after a single Gaussian modeling for bbox, 99.04 FLOPs are required. Thus, the penalty for the detection speed is extremely low because the computation cost increases only by 0.04% as compared with before the modeling. The related results are shown in Section 4.

Figure 2: Components in the prediction box of proposed algorithm.

3.2 Reconstruction of Loss Function

For training, YOLOv3 [19] uses the sum of the squared error loss for bbox, and the binary cross-entropy loss for the objectness and class. Because the bbox coordinates are output as Gaussian parameters through Gaussian modeling, the loss function of bbox is redesigned as a negative log likelihood (NLL) loss, whereas the loss function for objectness and class is not changed. The loss function redesigned for bbox is as follows:

(2)
(3)
(4)
(5)

where , , , and are the NLL losses of each coordinate. In addition, and are the number of grids of each and , respectively, and is the number of anchors. Moreover, , , , and denote the bbox coordinates, which are the outputs of the detection layer of the proposed algorithm, at the k-th anchor in the (i, j) grid. In addition, , , , and are also the outputs of the detection layer, indicating the uncertainty of each coordinate, and , , , and are the GT of bbox. in NLL loss function is a parameter applied to include in the loss only if there is an anchor that is most suitable in the current object among the predefined anchors. This parameter is assigned as a value of one when the intersection over union (IOU) of the GT and the k-th anchor box in the (i, j) grid are the largest, and is assigned as a value of zero if there is no appropriate GT. For a numerical stability of the logarithmic function, is assigned a value of .

Because YOLOv3 uses the sum of the squared error loss for bbox, it is unable to cope with noisy data during training. However, the redesigned loss function of bbox can provide a penalty [12] to the loss through the uncertainty for inconsistent data during training. That is, the model can be learned by concentrating on consistent data. Therefore, the redesigned loss function of bbox makes the model more robust to noisy data [12]. Through this loss attenuation [12], it is possible to improve the accuracy of the algorithm.

3.3 Utilization of Localization Uncertainty

The proposed Gaussian YOLOv3 can obtain the uncertainty of bbox for every detection object in an image. Because it is not an uncertainty for the entire image, it is possible to apply uncertainty to each detection result. YOLOv3 considers only the objectness score and class scores during object detection, and cannot consider the bbox score during the detection process because the score information for the bbox coordinates is unknown. However, Gaussian YOLOv3 can output the localization uncertainty, which is the score of bbox. Therefore, localization uncertainty can be considered along with the objectness score and class scores during the detection process. The proposed algorithm applies localization uncertainty to the detection criteria of YOLOv3 such that bbox with high uncertainty among the predicted results is filtered through the detection process. In this way, predictions with high confidence of objectness, class, and bbox are finally selected. Thus, Gaussian YOLOv3 can reduce the FP and increase the TP, which results in improving the detection accuracy. The proposed detection criterion considering the localization uncertainty is as follows:

(6)

in (13) indicates the detection criterion for Gaussian YOLOv3, is the objectness score, and is the score of the i-th class. In addition, , which is localization uncertainty, indicates the average of the uncertainties of the predicted bbox coordinates (i.e., , , , and ). Localization uncertainty has a value between zero and one, such as the object score and class scores, and the higher the localization uncertainty, the lower the confidence of the predicted bbox. The results of the proposed algorithm, Gaussian YOLOv3, are described in Section 4.

4 Experimental Results

In the experiment, the KITTI dataset [7], which is commonly used in autonomous driving research, and the BDD dataset [24], which is the latest published autonomous driving dataset, are used. The KITTI dataset consists of three classes: car, cyclist, and pedestrian, and consists of 7,481 images for training and 7,518 images for testing. Because there is no GT for testing, the training and validation sets are made by randomly splitting the training set in half [23]. The BDD dataset consists of ten classes: bike, bus, car, motor, person, rider, traffic light, traffic sign, train, and truck. The ratio of training, validation, and test set is 7:1:2. In this paper, a test set is utilized for the performance evaluation. In general, the IOU threshold (TH) of the KITTI dataset is set to 0.7 for cars and 0.5 for cyclists and pedestrians [7], whereas the IOU TH of the BDD dataset is 0.75 for all classes [24]

. In both YOLOv3 and Gaussian YOLOv3 training, the anchor size is extracted using k-means clustering for each training set of KITTI and BDD. The experiment is conducted on an NVIDIA GTX 1080 Ti with CUDA 8.0 and cuDNN v7.

Figure 3: IOU versus localization uncertainty on KITTI and BDD validation sets.

4.1 Validation in Utilizing Localization Uncertainty

Figure 3 shows the relationship between the IOU and localization uncertainty of bbox for the KITTI and BDD validation sets. These results are plotted for cars, which is the dominant class for all data, and the localization uncertainty is predicted using the proposed algorithm. To show a typical tendency, the IOU is divided increments of 0.1, and the average value of the IOU and the average value of the localization uncertainty are calculated for each range and used as a representative value. As shown in Figure 3, the IOU value tends to increase as the localization uncertainty decreases in both datasets. A larger IOU indicates that the coordinates of the predicted bbox are closer to those of the GT. Based on these results, the localization uncertainty of the proposed algorithm effectively represents the confidence of the predicted bbox. It is therefore possible to cope with mislocalizations and improve the accuracy by utilizing the localization uncertainty predicted by the proposed algorithms.

4.2 Performance Evaluation of Gaussian YOLOv3

To demonstrate the superiority of the proposed algorithm, its performance (i.e., accuracy and detection speed) is compared with that of baseline algorithm  [19]. The official evaluation method of each dataset is used for an accuracy comparison, and IOU TH is set to the value mentioned before. For a comparison of the accuracy, mAP, which has been widely used in previous studies on object detection, is selected.

Table 1 shows the performance of the proposed and baseline algorithm using the KITTI validation set and BDD test set. In the KITTI validation set, the mAP of the proposed algorithm, Gaussian YOLOv3, improves by 3.09 compared to that of YOLOv3, and the detection speed is 43.13 fps, which enables real-time detection with a slight difference from YOLOv3. In the BDD test set, Gaussian YOLOv3 improves the mAP by 3.5 compared with YOLOv3, and the detection speed is 42.5 fps, which is almost the same as YOLOv3. Based on these experimental results, because the proposed algorithm can significantly improve the accuracy with little penalty in speed compared to YOLOv3, Gaussian YOLOv3 is superior to the baseline algorithm.

Figure 4: Detection results of the baseline and proposed algorithms on the KITTI validation set. The first column shows the detection results of YOLOv3, whereas the second column shows the detection results of Gaussian YOLOv3.
Figure 5: Detection results of the baseline and proposed algorithms on the BDD test set. The first and second rows show the detection results of YOLOv3 and Gaussian YOLOv3, respectively, and each color is related to a particular object class.
mAP (%)
FPS
Input size
KITTI validation set
YOLOv3 [19] 80.52 43.57 512512
Gaussian YOLOv3 83.61 43.13 512512
BDD test set
YOLOv3 [19] 14.9 42.9 512512
Gaussian YOLOv3 18.4 42.5 512512
Table 1: Performance comparison.

4.3 Visual and Numerical Evaluation of FP and TP

For a visual evaluation of Gaussian YOLOv3, Figures 4 and 5 show the detection examples of the baseline and Gaussian YOLOv3 for the KITTI validation set and the BDD test set, respectively. The detection TH is 0.5, which is the default test TH of YOLOv3. The results in the first row of Figure 4 and in the first column of Figure 5 show that Gaussian YOLOv3 can detect objects that YOLOv3 cannot find, thereby increasing its TP. These positive results are obtained because the Gaussian modeling and loss function reconstruction of YOLOv3 proposed in this paper can provide a loss attenuation effect in the learning process, so that the learning accuracy for bbox can be improved, which enhances the performance of objectness. Next, the results in the second row of Figure 4 and in the second column of Figure 5 show that Gaussian YOLOv3 can complement incorrect object detection results found by YOLOv3. In addition, the results in the third row of Figure 4 and in the third column of Figure 5 show that Gaussian YOLOv3 can accurately detect bbox of object inaccurately detected by YOLOv3. Based on these results, Gaussian YOLOv3 can significantly reduce the FP and increase the TP, and consequently, the driving stability and efficiency are improved and fatal accidents can be prevented.

YOLOv3
Gaussian
YOLOv3
Variation
rate (%)
KITTI validation set
# of FP 1,681 985 -41.40
# of TP 13,575 14,560 +7.26
# of GT 17,607 17,607 0
BDD validation set
# of FP 86,380 51,296 -40.62
# of TP 57,261 59,724 +4.30
# of GT 185,578 185,578 0
Table 2: Numerical evaluation of FP and TP.

For a numerical evaluation of the FP and TP of Gaussian YOLOv3, Table 2 shows the numbers of FPs and TPs for the baseline and Gaussian YOLOv3. The detection TH is the same as the mentioned before. The KITTI and BDD validation sets are used to calculate the FP and TP because the GT is provided in the validation set. For more accurate measurements, the FP and TP of the two datasets are calculated using the official evaluation code of BDD because the KITTI official evaluation method does not count the FP when bbox is within a certain size. For the KITTI and BDD validation sets, Gaussian YOLOv3 reduces the FP by 41.40% and 40.62%, respectively, compared to YOLOv3. In addition, it increases the TP by 7.26% and 4.3%, respectively. It should be noted that the reduction in the FP prevents unnecessary unexpected braking, and the increase in the TP prevents fatal accidents from object detection errors. In conclusion, Gaussian YOLOv3 shows a better performance than YOLOv3 for both the FP and TP related to the safety of autonomous vehicles. Based on the results described in Sections 4.1, 4.2, and 4.3, the proposed algorithm outperforms baseline algorithm and is suitable for autonomous driving applications.

5 Conclusion

A high accuracy and real-time detection speed of an object detection algorithm are extremely important for the safety and real-time control of autonomous vehicles. In addition, a detection algorithm that can cope with mislocalizations is required in autonomous driving applications. For this reason, this paper proposes an object detection algorithm that can achieve the high accuracy and real-time detection speed and handle the mislocalization for autonomous driving. Through Gaussian modeling, loss function reconstruction, and the utilization of localization uncertainty, the proposed algorithm improves the accuracy, increases the TP, and significantly reduces the FP, while maintaining the real-time capability. Compared to the baseline, the proposed Gaussian YOLOv3 algorithm improves the mAP by 3.09 and 3.5 for the KITTI and BDD datasets, respectively. Also, the proposed algorithm reduces the FP by 41.40% and 40.62%, respectively, and increases the TP by 7.26% and 4.3%, respectively, on the KITTI and BDD datasets. As a result, the proposed algorithm can significantly improve the camera-based object detection system for autonomous driving, and is consequently expected to contribute significantly to the wide use of autonomous driving applications.

6 Future Work

We are comparing to previous studies to demonstrate the excellence of the proposed method. Also, we are doing research to improve and optimize the proposed algorithm using uncertainty and other information.

References

  • [1] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos.

    A unified multi-scale deep convolutional neural network for fast object detection.

    In European conference on computer vision, pages 354–370. Springer, 2016.
  • [2] S. Choi, K. Lee, S. Lim, and S. Oh. Uncertainty-aware learning from demonstration using mixture density networks with sampling-free variance modeling. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6915–6922. IEEE, 2018.
  • [3] A. Ćorović, V. Ilić, S. Durić, M. Marijan, and B. Pavković. The real-time detection of traffic participants using yolo algorithm. In 2018 26th Telecommunications Forum (TELFOR), pages 1–4. IEEE, 2018.
  • [4] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
  • [5] X. Dai. Hybridnet: A fast vehicle detection system for autonomous driving. Signal Processing: Image Communication, 70:79–88, 2019.
  • [6] D. Feng, L. Rosenbaum, and K. Dietmayer. Towards safe autonomous driving: Capture uncertainty in the deep neural network for lidar 3d vehicle detection. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 3266–3273. IEEE, 2018.
  • [7] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In

    2012 IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3354–3361. IEEE, 2012.
  • [8] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [10] Y. He, X. Zhang, M. Savvides, and K. Kitani. Softer-nms: Rethinking bounding box regression for accurate object detection. arXiv preprint arXiv:1809.08545, 2018.
  • [11] X. Hu, X. Xu, Y. Xiao, H. Chen, S. He, J. Qin, and P.-A. Heng. Sinet: A scale-insensitive convolutional neural network for fast vehicle detection. IEEE Transactions on Intelligent Transportation Systems, 20(3):1010–1019, 2019.
  • [12] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017.
  • [13] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.
  • [14] F. Liu, B. Liu, C. Sun, M. Liu, and X. Wang. Deep learning approaches for link prediction in social network services. In International Conference on Neural Information Processing, pages 425–432. Springer, 2013.
  • [15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [16] A. Marshall. False positive: Self-driving cars and the agony of knowing what matters. WIRED Transportation, 2018.
  • [17] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  • [18] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
  • [19] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  • [20] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [21] Y.-W. Seo, N. Ratliff, and C. Urmson. Self-supervised aerial images analysis for extracting parking lot structure. In Twenty-First International Joint Conference on Artificial Intelligence, 2009.
  • [22] J. Wei, J. M. Snider, J. Kim, J. M. Dolan, R. Rajkumar, and B. Litkouhi. Towards a viable autonomous driving research platform. In 2013 IEEE Intelligent Vehicles Symposium (IV), pages 763–770. IEEE, 2013.
  • [23] B. Wu, F. Iandola, P. H. Jin, and K. Keutzer. Squeezedet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 129–137, 2017.
  • [24] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687, 2018.
  • [25] C. Zhang, Y. Liu, D. Zhao, and Y. Su. Roadview: A traffic scene simulator for autonomous vehicle simulation testing. In 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), pages 1160–1165. IEEE, 2014.
  • [26] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4203–4212, 2018.
  • [27] Q. Zhao, Y. Wang, T. Sheng, and Z. Tang. Comprehensive feature enhancement module for single-shot object detector. In Asian conference on computer vision. Springer, 2018.