Self driving cars have moved from driving in the desert terrain, spearheaded by the DARPA Grand Challenge, through highways, and into populated cities. End-to-end approaches directly maps input images to driving actions , predicting trajectories based on end-to-end deep learning models using supervised regression. Using an LSTM network, we predict both trajectory and speed from a set of input images and a semantic map. Recently, ChaeffeurNet  predicts trajectories and uses a mid-level controller to transfer these predictions to specific vehicles, avoiding the need to retrain a model for every different vehicle type. The use of visual semantic maps has shown to outperform the traditional end-to-end models which only used front-facing images as input data [6, 7]. Recent work trains end-to-end models for predicting steering angle, speed, and driving trajectories. The use of LSTM models efficiently predicts steering  by taking into account long range dependencies. A single 3D CNN model which simultaneously performs detection, tracking and motion forecasting is also trained in an end-to-end fashion . Since predicting steering angle alone is insufficient for self-driving, a multi-task learning framework 
is used to predict both speed and steering angle end-to-end. Augmenting existing data by using a pre-trained neural networks for image segmentation and optical flow estimation has shown to improve steering angle prediction.
The challenge dataset consists of videos of around 55 hours of recorded driving in Switzerland, and the associated driving speed and steering angle. Sample training images are shown in Figure 1 and sample testing images are shown in Figure 2. The dataset consists of about 2 million images taken using a GoPro Hero5 facing the front, sides, and rear of a car driven around Switzerland. An image of a visual map from HERE technologies and a semantic map derived from this data is provided. The semantic map consists of 21 fields. GPS latitude, GPS longitude, and GPS precision. All images and data are sampled at 10 frames per-second. The data is separated into 5 minute chapters. In total, there are 682 chapters for 27 different routes. The data is randomly sampled into 548 chapters for training, 36 chapters for validation, and 98 chapters for testing.
We pre-process the data by down-sampling, normalization, semantic map imputation and augmentation as described next:
For efficient experimentation we down-sample the dataset in both space and time as shown in Table 1. Although the initial process is time consuming, it enabled the model to train faster by limiting I/O time. The initial image size is also prohibitive due to memory limitations on our GPUs and significantly decreases the speed at which we train the model.
|Dataset||Full||Sample 1||Sample 2||Sample 3|
. We include the current and previous frame in each iteration, which are 0.4 seconds apart. The images are fed into a pre-trained ResNet34 model or a ResNet152 model. We feed the semantic map into a fully connected network with two hidden layers of 256, 128 and with ReLU activation layers. The output from the fully connected and ResNet models are concatenated and fed into a long short-term memory (LSTM) network. The output from the LSTM network and the current information from the semantic map, if used, are concatenated. The LSTM output and the current information is then used as input to both an angle regressor and a speed regressor, which predict the current steering angle and speed. Both regressors share the same structure, as shown in Figure4.
We use 20 of the numerical features as shown in Table 2
. Missing values are imputed in with zeros. We avoid using the mean value across the chapter or other placeholder to avoid using future data. For one model, we add in 27 additional dummy variables for each folder an image was located. Although the semantic map has information on location, the information from the folder gave another view that we consider possibly helpful to the training process.
We digitize the steering angles and speeds from the training dataset into 100 and 30 evenly spaced bins, where each bin represents a range of continuous values for angle or speed. Then we count the digitized values for each bin and use these counts to estimate the likelihood of an angle or speed value. Using this estimated distribution, we compute a weighted average of predictions from a selection of models we trained. Table 3
shows the performance of each individual model and the ensemble. We combine predictions from epoch 1 of model 3 - 5 for angle and predictions from epoch 1 of model 2 - 5 and epoch 2 of model 1 for speed.
Hyper-parameters of network training.
All models are trained using the Adam optimizer with an initial learning rate of 0.0001, without weight decay, and momentum parameters , and . In all models, we optimize on the sum of mean squared error (MSE) for the steering angle and speed. Attempts at optimizing only one regressor showed little improvement to either metrics. Minibatch sizes are either 8, 32, or 64, limited by GPU memory. The image set, depending on run, is of dimensions of 320x180 or 160x90 and using ResNet34 or ResNet152 models for the image CNN. Table 3
summarizes the hyperparameters and image settings used for each of our models.
Preprocessing took about 3 days of continuous computation on an Intel i7-4790k CPU using 6 of the 8 cores and with data stored on a 7200rpm hard drive. The largest limitation to the preprocessing was the I/O due to the sizeable number of photos. Models 1, 2, 4, and 5 were trained using a single Nvidia K80 GPU. Each epoch required approximately 12 hours to complete. Model 3 was trained on a single Nvidia GTX 980 GPU and each epoch took approximately 10 hours to compute. The code is implemented in Python 3.6 using PyTorch.
|Model||CNN||Image Dimensions||Semantic Map||Batch||Epochs||MSE Angle||MSE Speed||Combined|
|Model||MSE Angle||MSE Speed|
Performance across various zones: the ensemble method did worst in the pedestrian sections and best in Zone80. Presumably, pedestrian sections would be hardest to train due to the unpredictability of cities and people. Zone80 sections are likely straighter and require less change in speed and steering angle and would probably be easier to train. Likewise, Right and Left sections, which would require learning a turn, would be difficult, but Straight segments are easier to train and did better.
We train each of our five neural network models for varying number of epochs to make efficient use of our computational resources. Table 3 summarizes our results. The most significant improvement on the single models occurs by including of the HERE semantic map into the model, which results in a decrease of the MSE by approximately 300 points on the Angle metric. This is most apparent between models 1 and 4. Including the city location as part of the semantic map has a marginal benefit to the model, as seen by the change from model 4 to 5. Notably, models 3 and 4 tend to overfit. Our best result for both is after 2 epochs, and the MSE slowly increases for each additional epoch. The training loss for both models decrease throughout the training, as shown in Figure 5. The performance of the ensemble method in the different road types is shown in Table 4.
In conclusion, fusing the semantic map data with the image data significantly improves results. We demonstrate that fusing these different modalities can be efficiently performed using a simple neural network. A classical ensemble method of diverse models improves overall performance. We make our models and code publicly available .
We would like to thank the Columbia University students of the Fall 2019 Deep Learning class for their participation in the challenge. Specifically, we would like to thank Xiren Zhou, Fei Zheng, Xiaoxi Zhao, Yiyang Zeng, Albert Song, Kevin Wong, and Jiali Sun for their participation.
-  Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079, 2018.
-  Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to end learning for self-driving cars. CoRR, abs/1604.07316, 2016.
-  Diodato, Michael and Li, Yu and Goyal, Manik and Drori, Iddo. GitHub Repository for Winning the ICCV 2019 Learning to Drive Challenge. https://github.com/mdiodato/cu-iccv-2019-learning-to-drive-winners, 2019.
Tharindu Fernando, Simon Denman, Sridha Sridharan, and Clinton Fookes.
Going deeper: Autonomous steering with neural memory networks.
Proceedings of the IEEE International Conference on Computer Vision, pages 214–221, 2017.
-  GoPro. Camera. https://gopro.com, 2019.
-  Simon Hecker, Dengxin Dai, and Luc Van Gool. End-to-end learning of driving models with surround-view cameras and route planners. In Proceedings of the European Conference on Computer Vision (ECCV), pages 435–453, 2018.
-  Simon Hecker, Dengxin Dai, and Luc Van Gool. Learning accurate, comfortable and human-like driving. arXiv preprint arXiv:1903.10995, 2019.
-  HERE. Location Platform. https://here.com, 2019.
Yuenan Hou, Zheng Ma, Chunxiao Liu, and Chen Change Loy.
Learning to steer by mimicking features from heterogeneous auxiliary
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8433–8440, 2019.
Wenjie Luo, Bin Yang, and Raquel Urtasun.
Fast and furious: Real time end-to-end 3d detection, tracking and
motion forecasting with a single convolutional net.
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3569–3577, 2018.
-  Zhengyuan Yang, Yixuan Zhang, Jerry Yu, Junjie Cai, and Jiebo Luo. End-to-end multi-modal multi-task vehicle control for self-driving cars with visual perceptions. In International Conference on Pattern Recognition, pages 2289–2294, 2018.