Autonomous cars are currently primed for mass adoption by the consumer market. The recent slew of announcements by several car makers and automotive suppliers indicate that we continue on the path to having autonomous vehicles cohabiting our streets with other objects. Some examples of these objects include pedestrians, vehicles, strollers and shopping carts to name a few. As opposed to most other consumer technologies, autonomous vehicles behaving erroneously can result in significant harm to humans and property. Therefore, the ability to detect and recognize various objects around the car is paramount to ensuring safe operation of the vehicle. In this paper, we present a solution to detect and classify several moving objects around the vehicle in real time. This solution was developed for use in Carnegie Mellon University’s autonomous driving research vehicle, Cadillac SRX.
I-a Related work
Several approaches to detecting obstacles around the car have been explored. The information about objects around the car can be obtained through various sensors. These include LIDAR and RADAR-based approaches used by . Others include stereo camera-based approaches to make use of the additional depth information available . Bertozzi et al. 
use images from 4 fisheye cameras to detect and track objects. Another approach is to combine the data from various sensors to arrive at a more accurate estimate. This sensor fusion-based approach has also been explored for Carnegie Mellon University’s autonomous driving research vehicle
. Once the information about the environment has been obtained, the detection of objects can be done by processing one single frame at a time or by comparing the images from adjacent frames. Detection using processing from a single frame can be done using feature-based methods or machine learning. These methods analyze a single frame to find various object classes like pedestrians and cyclists. The disadvantage of this approach is that they impose high computational demand and they can also miss objects that can be of potential danger but were not known a priori while training the model. Detection using processing of adjacent frames takes advantage of object motion to reduce the search space and computation time. Optical flow-based methods are the most popular for this latter approach. When combined with tracking, these methods can detect objects even in the absence of relative motion, if the object had been moving anytime in the past. Yet others combine many of these methods to develop a hybrid approach.
I-B Our Contributions
An autonomous vehicle must be capable of driving at different speeds based on the surrounding environment and operating conditions. In specific, its speed can range from zero, close to zero, all the way to highway speeds. We distinguish between two operating regimes based on whether the autonomous vehicle can stop rapidly and safely, or it needs to travel a considerable distance before coming to a safe and comfortable halt. These two operating modes lead to very different design considerations. We have developed our solution to work in the former operating mode of driving at low speeds and hence being able to stop quickly yet safely. In this mode, it is important to detect whether an obstacle is coarse-grained present in regions around the car. Once an obstacle is detected in any of the surrounding regions, the car can plan a path to avoid hitting the obstacle. In the worst case, the vehicle can wait until the region is considered safe to move into. Therefore, we do not need to detect the exact bounding box or distance of the obstacles relative to the vehicle.
We developed a solution to fit the practical need for having the solution perform accurately in real-time on the vehicle. We made several design choices to meet our objectives. The moving object detector is run on the CPU, while the DNN-based classifier is run on a GPU. One of the major challenges is to maintain a low computation overhead. This is critical to achieve real-time processing of simultaneous streams from 4 fisheye cameras. To achieve such a low computation overhead, we eschewed the De-Warping stage that similar solutions use . Our algorithm was developed to work directly on the raw images from the fisheye cameras. Instead of processing the 4 video streams individually, we merge the four video streams into a single video sequence and process them as a single stream. We also deal with a fixed region of interest in the 4 video streams as shown in Figure 1. Through experimentation with video streams captured from real-world conditions, we found that these 12 ROIs work best to detect presence of obstacles in the 8 regions around the car. This enabled us to narrow down the search space and transform the task of finding where the moving object is into a simpler one of figuring out whether a particular region of interest contains a moving object. For ensuring that a previously detected moving object continues to be detected even after it stops moving, we have developed a simpler version of tracking using the sparse Lucas Kanade algorithm.
Another priority of ours was that our solution must be portable enough to be platform-independent. The detector and tracking modules were implemented in OpenCV to achieve this objective. The modules were also developed to be independent from each other. This allows us to enable only a subset of the modules as necessary. This also allows the modules to be integrated without much effort with newer solutions. We have done extensive testing of the entire solution on a real car as well as using offline computation. Our entire solution has been ported to x86, TI TDA2x, and NVIDIA TX2. Performance and accuracy measurements were taken for all these system implementations.
Ii Architecture and Algorithms
The overall architecture of our solution includes detection, tracking and classification modules. The detection module detects if there is a moving object in any of the regions of interest. The tracking module is responsible for ensuring that a previously-detected object continues to be detected even if it stops moving. The classification module uses deep learning to categorize the moving object. The function and design of these modules are explained in the following sections.
Ii-a Detection of Moving Objects
The detection module detects if there is a moving object in any of the regions of interest. The different regions of interest were chosen to align with practical needs to indicate vehicle and other objects approaching from the left, right or center of each field of view.
Figure 2 illustrates the pipeline of our detection and tracking modules. As shown, the first stage in detection is the background subtraction of adjacent frames. This narrows our search space to only the pixels where we believe motion to be present. In this background-subtracted image, we find the appropriate feature locations to track. Similar solutions use various feature descriptors such as SIFT , SURF , BRIEF  and ORB . We decided to use the ORB feature descriptor chiefly because of its efficiency of computation. Once the feature points are extracted, we use the sparse Lucas Kanade algorithm  to find the corresponding feature points in the subsequent frame. The vectors connecting the feature points from the previous frame to the corresponding ones in the current frame were summed for every region of interest separately. The length of the resultant vector was thresholded to determine if the region of interest contains any moving object.
Ii-B Tracking of Stopped Objects
With only detection, there is the drawback that a moving object will stop being detected if the object enters a region of interest and then stops moving. This behavior is not what we intend. We would like to know of every potential object around the car that can pose danger, even though it is not moving currently. Hence, we introduced a tracking module that would track a previously-detected object even after it stops moving. Tracking of moving objects in a video stream is a well-studied problem with many state-of-the-art trackers available . Most state-of-the-art trackers perform pretty well with high sensitivity and specificity. However, the computation overhead of these trackers is high. Since we are concerned with detecting the presence of moving objects in the regions of interest, our requirements are less demanding than that offered by these turn-key solutions. We need not track multiple objects, since we only generate warnings for the presence or absence of moving objects in each ROI. We also did not need to track the exact bounding boxes for the objects. We leverage these looser requirements to obtain lower computation overhead. In our solution, once we detect the presence of moving objects, we store those feature points. When the detector transitions from positive to negative, we run another iteration of the sparse Lucas Kanade algorithm on the stored feature points from the previous frame. If the feature points are detected in the current frame with high confidence and the resultant motion vector for the feature points is small, we conclude that the moving object is still present in the ROI but has stopped moving. This ROI is considered to contain the moving object until the detector module kicks in and freshly detects a moving object, at which point, the stored features are discarded.
Ii-C ROI Classification
The detector and tracking modules are used to find the regions of interest with moving objects. However, to make meaningful use of this information, the system needs to know the nature of the moving object. Specifically, we would like to categorize the moving object as pedestrian, bicycle, shopping cart, etc. This information can then be used by the autonomous vehicle software to take action tailored to the category of the detected object. For this purpose, we introduce a deep neural network-based classification module.
Since AlexNet 
in 2012, Convolutional Neural Networks have proved to be the state-of-the-art methodology in Computer Vision. It has outperformed traditional approaches in various tasks, such as recognition, segmentation, and detection. Over time, CNNs in use today have grown to be very deep and complex. This has led to an increase in accuracy. However, these networks are also large in size and their response times are slow. In many real-world applications like our system, the model needs to execute on a small, low-power embedded system without a powerful GPU to provide sufficient processing power to run a computationally heavy deepnet in real-time. In 2017, Andrew et al. proposed MobileNet, which substitutes the standard convolution layers with depth-wise convolution layers and point-wise convolution layers to achieve small, low-latency models that meet the requirement for our application.
MobileNet factorizes a standard convolution into a depth-wise convolution and a point-wise convolution, which is referred to as depth-wise separable convolution. The depth-wise convolutions apply a single filter for each input channel. Then, the output map is fed into a convolution to combine the output of the depth-wise convolution layer. The factorization actually splits a standard one-step convolution operation into 2 separate steps. Because the separation restricts space requirements and simplifies the computation, depth-wise separable convolution can achieve a reduction in computation of
According to , MobileNet uses depth-wise separable convolutions resulting between to times less computation than the use of standard convolution with a reduction in accuracy of only about .
MobileNet provided us features extracted from images as descriptors. We utilized these descriptors to build a classifier. Here, we chose to use SoftMax together with Cross Entropy as the loss function.
SoftMax is normally used after the final layer to map real values to the interval , in order to represent a probability distribution over possible categories. Then, we use cross entropy to compute loss and provide a gradient for the back-propagation phase. Before training, we first obtain the frames from the video, and then split them into regions of interest. Because MobileNet takes input images of dimension we re-sized all different-resolution images into images and labeled them as shown in Figure 3.
This section describes our experiments measuring the performance of our moving-object detection system using different hardware platforms and images acquired under various areas.
Iii-a Hardware and Implementation
|TDA2x||Dual-core 1.0-GHz||2 C66x DSPs|
|ARM A15 processor|
|TX2||Quad-core 2.0-GHz ARM A57||Pascal GPU|
|Dual-core 2.0-GHz Denver 2||(2 SMs 256 Cores)|
|x86||3.3-Ghz Intel I9 7900x||Titan Xp GPU|
|(60 SMs 3840 Cores)|
In this evaluation, we examined the practical feasibility and efficacy of our algorithm using multiple platforms from different vendors. Figure 4 shows our surround view system prototype built using a TI TDA2x evaluation kit from Spectrum Digital  to capture data and evaluate our detection/tracking algorithm in real-life scenarios. We also integrated our algorithm on an NVIDIA TX2 embedded platform  and an x86 desktop to compare the relative performance of our detection, tracking and classification modules across different platforms. Specifications of the different hardware we used are listed in Table I
. We ran 64-bit Ubuntu 16.04 with OpenCV 2.4.17 for the detection/tracking module. CUDA 8.0+CuDNN 6.0, Tensorflow, Python 3, and OpenCV 3.0 were used for the classification module implementation. To have the best classification result, we applied MobileNet 1.0 model to our system. We discuss speed performance with different MobileNet model versions in SectionIII-B.
Figure 5 captures our implemented system architecture. The different components of the system architecture are as follows:
Capture/Merge: This component captures the images from the 4 fisheye cameras and merges them into a single quad view image as shown in Figure 1.
Detection/Tracking: These modules take the quad view image as input and process them to detect the moving objects. The outputs from the modules are the indexes of the ROIs containing moving objects.
Detection Output Transfer: The output from detection/tracking modules are transferred to the classification module using socket communication.
Object Classification: This module takes the quad view image and selected ROI indexes as input and processes them to classify the moving objects into different categories.
|Core||Job description||Run Time (ms)||Total (ms)|
|CPU||Diff of two frames|
|ORB Feature Extraction|
|Sparse Optical flow cal.|
|GPU||Diff of two frames|
|ORB Feature Extraction|
|Sparse Optical flow cal.|
Iii-B Experiment Results
Experiment 1: Detection only
Experiment 2: Detection and tracking
Experiment 3: Classification only
Experiment 4: Detection, tracking and classification
The metrics for Experiment 1 and 2 were compared to evaluate the improvement in accuracy on adding the tracking module. Likewise, the metrics from Experiment 3 and 4 were compared to evaluate the speed and accuracy trade offs when combining all the modules.
We collected video sequences at 1280x720 resolution at 30 fps using our surround view system equipped with 4 190°FOV fisheye cameras as shown in Figure 4. We used 7 video sequences to create our training data. 1009 frames were selected from the video data and 12 pre-fixed ROIs were sliced into individual images. The images were labeled with presence information and the category of objects for training our MobileNet-based classification module. To create test data, we labeled 5 other video sequences with manually-annotated ground truths of object presence and target class in 12 ROIs. These training and testing data were obtained from various indoor and outdoor parking lots from local shopping centers and buildings in Pittsburgh, USA.
Table II shows a comparison of the execution times of the detection module between GPU and CPU on NVIDIA TX2. We observe that running the detection module on the GPU instead of the CPU does not result in a significant performance improvement. This is because the algorithms deal with sparse features instead of processing all the pixels in the image. Therefore, the detection and tracking module can be selected to run either on a GPU or a CPU as per system availability. We chose to pin the detection and tracking operation to the CPU. This allowed us to dedicate the GPU resource to the DNN-based classification algorithm to maximize its performance.
Table III shows the measured average frame per second (fps) of each algorithm module running on different platforms. Unfortunately, we could not evaluate the classification module on the TI TDA2x platform since it does not support CUDA . We notice that the overall performance of the integrated ‘detection, tracking, and classification’ approach is significantly better than the ’classification only’ scheme. This is because fewer ROIs need to be processed by the classification module using the filtered ROI index information received from the detection and tracking module. We also observe that the ‘detection, tracking, and classification’ experiment was practically feasible on an embedded platform like the NVIDIA TX2 by dedicating the entire GPU resource to the classification module. The NVIDIA TX2 supports a feature inherent in integrated GPUs called zero-copy memory. Zero-copy memory eliminates the need for copying data to and from DRAM associated with the GPU. This allows both CPU and GPU components of GPU intensive programs to share memory space. This feature enables us to reduce the latency while transferring the captured input image from the CPU to the classification module on GPU.
To evaluate the accuracy of our algorithms, we defined the following metrics for detection/tracking and classification modules:
Our measured values for the precision and recall are shown in TableIV. As we can see from the table, the addition of the tracking module improves the recall metric at the cost of less precision. Consider the case when a moving object comes to a stop at one of the ROIs. This object will not be detected in the absence of the tracking module. This ROI will be counted as a false negative in all the frames until the object starts moving again. The result is a low recall value. The addition of the tracking module does reduce the number of false negatives leading to a higher recall. But, we also see a reduction in precision. This happens in the scenario when the tracking module erroneously latches onto a non-moving object. This ROI then counts as a false positive for the subsequent frames resulting in a lower precision value. We believe that the overall compromise is however preferable for an autonomous vehicle system where we would rather have more false positives than false negatives. Another important takeaway is that only performing classification gives the best precision and recall values. However, without the detection and tracking modules, we will have to run the classification module for all ROIs of all frames which incurs a huge computational cost. In the future, with advances in hardware, it might be feasible to skip the detection and tracking modules to achieve better accuracy.
Figure 6 shows the speed of different MobileNet models on x86 and NVIDIA TX2 with respect to the different input image resolutions. The appropriate MobileNet model can be chosen to fit the latency and size budget for the platform in use. We decided to use MobileNet_v1_1.0_224  to achieve the best accuracy in real-time.
The test result of our classifier was compared with Faster R-CNN . We used MobileNet as the backbone of the R-CNN network for fair comparison. The Faster R-CNN module was integrated only on the x86 desktop because it requires more computational resources than those provided by NVIDIA TX2. We trained a model using KITTI dataset  and measured the performance using our dataset. The system performance showed 15fps and the accuracy performance showed 60.5% and 81.1% for recall and precision, respectively. We observe a significant reduction in precision because the Faster R-CNN module fails to detect objects along the edges of the image where the distortion is severe. We expect that the accuracy will be improved using our own training data.
Iv Conclusion and Future Work
We have presented a solution that can detect, track and recognize moving objects in pre-determined regions of interest in real-time with good accuracy. Our solution was evaluated on x86, TI TDA2X and NVIDIA TX2 platforms and their relative performance was measured for various combinations of active modules. The final solution was also deployed on a real car and extensively tested in real-world situations. The system performed with good accuracy and precision in these real-world tests. As the system was built from the ground up using a modular approach, the advantage of modularity was evident in the ease with which the breakdown of performance and accuracy measurements could be taken. In fact, these modules can be effectively added to any existing autonomous vehicle solution with minimum porting time and system overhead.
In the future, we plan to train the classification module to detect several additional categories. We would also like to add an ego-motion compensator to be able to use the detection and tracking module even when the vehicle is traveling at high speeds. Another next step is to integrate the entire solution with other existing algorithms on the NVIDIA Drive PX2 and deploy on our autonomous driving research vehicle.
-  J. Wei, J. M. Snider, J. Kim, J. M. Dolan, R. Rajkumar, and B. Litkouhi, “Towards a viable autonomous driving research platform,” in Intelligent Vehicles Symposium (IV), 2013 IEEE. IEEE, 2013, pp. 763–770.
-  H. Wang, B. Wang, B. Liu, X. Meng, and G. Yang, “Pedestrian recognition and tracking using 3d lidar for autonomous vehicle,” Robotics and Autonomous Systems, vol. 88, pp. 71–78, 2017.
-  N. Bernini, M. Bertozzi, L. Castangia, M. Patander, and M. Sabbatelli, “Real-time obstacle detection using stereo vision for autonomous ground vehicles: A survey,” in Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on. IEEE, 2014, pp. 873–878.
-  M. Bertozzi, L. Castangia, S. Cattani, A. Prioletti, and P. Versari, “360 detection and tracking algorithm of both pedestrian and vehicle using fisheye images,” in Intelligent Vehicles Symposium (IV), 2015 IEEE. IEEE, 2015, pp. 132–137.
-  H. Cho, Y.-W. Seo, B. V. Kumar, and R. R. Rajkumar, “A multi-sensor fusion system for moving object detection and tracking in urban driving environments,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on. IEEE, 2014, pp. 1836–1843.
-  J. Choi, “Realtime on-road vehicle detection with optical flows and haar-like feature detectors,” Tech. Rep., 2012.
-  D. G. Lowe, “Object recognition from local scale-invariant features,” in Computer vision, 1999. The proceedings of the seventh IEEE international conference on, vol. 2. Ieee, 1999, pp. 1150–1157.
-  H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” Computer vision–ECCV 2006, pp. 404–417, 2006.
-  M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust independent elementary features,” Computer Vision–ECCV 2010, pp. 778–792, 2010.
-  E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Computer Vision (ICCV), 2011 IEEE international conference on. IEEE, 2011, pp. 2564–2571.
-  S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying framework,” International journal of computer vision, vol. 56, no. 3, pp. 221–255, 2004.
-  M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernández, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, “The visual object tracking vot2015 challenge results,” in Proceedings of the IEEE international conference on computer vision workshops, 2015, pp. 1–23.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
-  “TDA2x SoC family,” http://processors.wiki.ti.com/index.php/TDA2x.
-  “TDA2x Vision Evaluation Module Kit,” http://www.spectrumdigital.com/tda2x-vision-evaluation-module-kit.
-  “NVIDIA Jetson TX1/TX2 Embedded Platforms,” http://www.nvidia.com/object/embedded-systems-dev-kits-modules.html.
-  J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with CUDA,” ACM Queue, vol. 6, no. 2, pp. 40–53, 2008.
-  “MobileNets: Open-Source Models for Efficient On-Device Vision,” https://research.googleblog.com/2017/06/mobilenets-open-source-models-for.html.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.