Self driving cars have moved from driving in the desert terrain, spearheaded by the DARPA Grand Challenge, through highways, and into populated cities. Traditionally, a mediated perception approach was used where the entire scene is parsed to make a decision by solving sub-problems 
. Deep learning architectures consist of end-to-end models[4, 16, 8, 3] that directly map input images to driving actions 
, predicting trajectories by supervised learning. Our three key contributions are:
End-to-end neural network system architectures that (i) encode multiple input representations (camera images, segmentation maps, semantic maps, semantic features), followed by (ii) a neural network that fuses these different modalities together, and finally (iii) separate decoding networks for generating both outputs. A general prototype is shown in Figure 1 and we ensemble several variants of this architecture.
We use pre-trained models for generating segmentation maps of the input images and semantic information. The pre-trained models augment the existing inputs with a rich dataset.
We leverage the fact that both speed and steering angles are smooth functions to improve the predictions.
Samples of front facing training images. The training data set consists of about 1.6 million images from each of the four sides of a car driving through Switzerland and consists of a mixture of cities, highways, and rural areas. The videos were taken by a GoPro Hero 5, and are sampled at a rate of 10 frames per second. Each frame has a resolution of 1920x1080, and the videos are split into chapters of 5 minutes each. The training dataset has 548 chapters from 27 unique routes. In addition to the images, the training dataset also includes visual maps from HERE Technologies at each time, a semantic map that is derived using a Hidden-Markov-Model path matcher, and the steering wheel angle and vehicle speed.
We evaluate our results on the Drive360 dataset  in the Learning to Drive Challenge. We win the top three positions overall, compared with state-of-the-art methods ranking below 10th position . The dataset consists of videos of around 55 hours of recorded driving in Switzerland, along with their associated driving speed and steering angle. Sample images from the training set are shown in Figure 2 and sample test images are shown in Figure 3. The dataset consists of about 2 million images recorded using a GoPro Hero5 facing the front, left, right, and rear of the vehicle. The dataset also includes images of visual maps from HERE technologies and their corresponding semantic map features. Specifically, the semantic map consists of 21 fields as well as GPS latitude, longitude, and precision. All images and data are sampled at 10 frames per-second. The data is separated into 5 minute chapters. In total, there are 682 chapters for 27 different routes. The data is randomly sampled into 548 chapters for training, 36 chapters for validation, and 98 chapters for testing.
We show that jointly predicting steering angle and vehicle speed is improved by using segmentation maps and data augmentation. Motivated by the use of semantic segmentation models for self driving vehicles [16, 13], we concatenate the segmentation maps with the images as the input instead of using the maps as an additional learning objective. We augment the data set by applying transformations including mirroring, adjusting brightness, and geometric transformations. We ensemble three neural network architectures to achieve the best performance.
2 Related Work
End-to-end deep learning models have been trained to predict steering angle given only the front camera view . As humans have a wider perceptional field than the front camera, 360-degree view datasets have been collected with additional route planner maps [10, 11]. Neural network models have been trained end-to-end using these 360-degree views which are specifically useful for navigating cities and crossing intersections. Map data has been demonstrated to improve the steering angle prediction accuracy . LSTM models achieve good results predicting steering angle by taking into account long range dependencies . Another improvement is the usage of event cameras instead of traditional cameras, which capture moving edges . Since predicting steering angle alone is insufficient for self-driving cars, a multi-task learning framework is used to predict both speed and steering angle in an end-to-end fashion . Adding a segmentation loss to fully connected layers has been shown to improve overall performance by learning a better feature representation . Recently, ChaeffeurNet  predicted trajectories using a mid-level controller allowing to predict once and transfer the results to many vehicles types, avoiding the need to retrain a model for every different vehicle type.
We pre-process the data by conducting down-sampling, normalization, semantic map imputation and data augmentation to allow for fast experimentation and improved results. We also augment the dataset using segmentation masks derived from a pre-trained model. The different modalities and pre-processed data are fed into two types of models, one of which uses the images and semantic map information and pre-trained ResNets, and another of which takes as input the images as well as segmentation masks, using a combination of pre-trained and fully-trained networks.
For efficient experimentation we down-sample the dataset in both space and time as shown in Table 1. Although this initial pre-processing is time consuming, down-sampling enables us to train models faster by orders of magnitude, reducing computation time from days to minutes. The initial image size is also prohibitive due to memory limitations on our GPU’s and significantly decreases the speed at which we train the model. In both cases an ensemble method is used to combine results and provide the highest achieving results in the competition.
|Dataset||Full||Sample 1||Sample 2||Sample 3|
The images are normalized in all models according to the mean and standard deviation. Similarly, the steering angle and speed are normalized for training using means and standard deviation.
We use 20 of the numerical features including latitude, longitude, speed limit from a navigation map, free flow speed (average driving speed based on underlying road geometry), headings and road indices at intersections, as well as road distance to next signal, yield, pedestrian crossing, and intersection. We imputed missing values by zeros, avoiding using the mean value across the chapter or other placeholder, to avoid using future data. For one model, we add in 27 additional dummy variables for each folder an image was located. Although the semantic map has information on location, the information from the folder gave another view that we consider possibly helpful to the training process.
We augmented the dataset by applying several transformations to the training data (i) random horizontal flips of frontal images with probability 0.5 (steering angle is multiplied by -1 in order to offset this flip, (ii) random changes to image brightness by a factor between 0.2 and 0.75 with probability 0.1, and (iii) random translations and rotations of frontal image with probability 0.25.
We used a pre-trained segmentation model [18, 7] to generate segmentation masks for each of the images in the dataset. The original pre-trained model contains 34 classes of which we consider only 19 which include relevant objects (such as road, car, parking, wall, etc.) that influence the steering angle .
We experimented with three different network architectures which won 1st, 2nd, and 3rd places in the competition.
3.1.1 Model 1 network architecture (1st place)
. We include the current and previous frame in each iteration, which are 0.4 seconds apart. The images are fed into a pre-trained ResNet34 model or a ResNet152 model. We feed the semantic map into a fully connected network with two hidden layers of dimensions 256 and 128 with ReLU activation layers. The output from the fully connected and ResNet models are concatenated and fed into a long short-term memory (LSTM) network. The output from the LSTM network and the current information from the semantic map, if used, are concatenated. The LSTM output and the current information is then used as input to both an angle regressor and a speed regressor, which predict the current steering angle and speed.
3.1.2 Model 2 network architectures (2nd place)
Model 2-single takes as input a single image and its corresponding segmentation mask. The input is passed through a DenseNet121 architecture, followed by decoders for predicting speed and angle implemented using fully connected networks with three dense layers of size 200, 50 and 10 each. In between each dense layer we apply batch normalization a ReLU non-linearity. The final output is a real-valued number which is the predicted speed or steering angle, which is normalized using the mean and standard deviation from the training set. Model 2-stacked takes as input a sequence of 10 images, where each image is concatenated with its corresponding segmentation mask.
Model 2-sequence shown in Figure 8 takes as input a sequence of 10 images, and their corresponding segmentation masks, similar to Model 2-stacked. Each image in the input sequence is passed individually through a pre-trained ResNet34 model and a pre-trained DenseNet201. Additionally, the input images are concatenated with their corresponding segmentation masks and passed through model 2-single. The resulting outputs are concatenated and passed through an intermediate layer which contains two dense layers of dimensions 512 and 128 with dropout. The output is then passed into a bi-directional GRU. This is performed for each input image/mask pair in the sequence, and the output of the GRU is concatenated with the input to the previous fully connected layer which is also passed through a dense block. Finally, this representation is passed into two decoders for speed and steering angle prediction, each consisting of three fully connected layers of dimensions 256, 128 and 32. We ensemble the various model 2 variants.
3.1.3 Model 3 network architecture (3rd place)
. The model uses a pre-trained ResNet for both frontal images and HERE maps, while the semantic map features are passed through a fully connected network. The output feature vector of the frontal images is fed into an LSTM model. Finally, the output of the LSTM and the output of the ResNet models are concatenated along with HERE numerical features (semantics) and fed into the angle and speed regressor.
3.2.1 Model 1 implementation details (1st place)
Hyper-parameters for network training.
All models are trained using the Adam optimizer with an initial learning rate of 0.0001, without weight decay, and momentum with , and . In all model variants we optimize the sum of mean squared error (MSE) for the steering angle and speed. Minibatch sizes are 8, 32, or 64, limited by GPU memory. The image set, depending on run, is of dimensions 320x180 or 160x90 using ResNet34 or ResNet152 models. Table 2
summarizes the hyperparameters and image settings used for each model 1 variant.
We quantize the steering angles and speeds from the training dataset into 100 and 30 evenly spaced bins respectively and use these discrete distributions to compute a weighted average of predicted values. Table 2 shows the performance of each individual model 1 variant and the ensemble.
Preprocessing time is around 3 days of computation on an Intel i7-4790k CPU using 6 cores with data stored on a 7200rpm hard drive. The largest bottleneck in preprocessing is the I/O due to the large number of image. Models 1 variants were trained using a single Nvidia GPU. Each epoch required approximately 10-12 hours.
3.2.2 Model 2 implementation details (2nd place)
This model uses the Adam optimizer with an initial learning rate of 0.0003. We implemented learning rate decay to 0.0001, 0.00005, and 0.00003 after the first 5, 15, and 20 epochs. Overall the model was trained for 90 epochs, with a batch size of 13 for training, validation and testing. The loss criterion used for both speed and steering angle was MSE loss, and the overall model loss was defined as the summation of the speed loss and steering angle loss. Model 2-stacked was trained in an identical manner, but the lowest MSE loss was achieved after the 14th epoch. We used a similar method to train all model 2 variants, with slight changes in learning rate and learning rate decay tuned over multiple runs. Our models did not require many epochs to reach their lowest validation MSE, and an ensemble of the results of the individual models outperformed each individually.
The model is trained using the Adam optimizer with an initial learning rate of 0.003, which is halved after epochs 20, 30 and 40. We use the same combination loss as in model 2-single and model 2-stacked, summing the MSE of speed and steering angle. Taking an ensemble of model 2 variants decreased the MSE for speed from 6.115 to 5.312, and the total MSE for steering from 925 to 901, achieving 2nd place overall in the competition.
Hyper-parameters for network training.
In each training run of the model, we save only the best model for speed and the best model for steering angle, which were then separately stored and used for predicting values on the test set.
3.2.3 Model 3 implementation details (3rd place)
We used the HERE map images, HERE numerical features, and frontal images as inputs. In addition to the baseline mode, we passed the HERE map images into a pre-trained ResNet, and concatenated its output with the output from encoded frontal images and HERE numerical features. The concatenated feature vector is decoded by fully connected networks for predicting steering angle and speed.
Hyper-parameters for network training.
The model is trained using an Adam optimizer with learning rate 0.0001. We use a ResNet34 for the HERE MAP images, and ResNet50 for frontal images. We apply dropout with 0.2 and 0.5 probabilities and batch size of 64 for training.
Pre-processing time is around 6 hours on a AMD 2700X CPU with data stored on a 7200rpm hard drive. The model is trained using an Nvidia GTX 1080. Each epoch requires approximately 5 minutes to complete.
We have won the ICCV 2019 Learning to Drive Challenge  in all first three places overall, compared with other previous state-of-the art positioned below 10th place. Figure 10 shows a sample of test prediction results. Ground truth angle and speed are in blue and predicted angle and speed are in red. The entire video is available online . Figure 11 show the predicted path for two driving chapters, where the axis scales represent distance in kilometers. Table 2 summarizes the results of model 1 which won 1st place overall by fusing together different modalities using a neural network. We train each of our five neural network model 1 variants for varying number of epochs to make efficient use of our computational resources. Table 2 shows that the most significant improvement on the single models occurs by including of the HERE semantic map into the model, which results in a decrease of the total MSE by approximately 300 points on the Angle metric. This is most apparent between model 1 variants 1 and 4. Including the city location as part of the semantic map has a marginal benefit to the model, as seen by the change from model 1 variants 4 to 5. Notably, models 1 variants 3 and 4 tend to overfit. Our best result for both is after 2 epochs, and the MSE slowly increases for each additional epoch. The training loss for both models decrease throughout the training, as shown in Figure 12. The performance of the ensemble method in the different road types is shown in Table 3.
Table 4 summarizes the results of model 2 which won 2nd place overall by augmenting the data with it segmentation maps. Figure 13 shows the distribution of angles between 0 and 180 in absolute values for bins of 5 angles each. As expected, the number instances in each angle decreases exponentially as angle is increased. Finally, Figure 14 show the total MSE for the top 50 rankes teams, with our models achieving the best performance overall, winning 1st, 2nd, and 3rd.
|Model||CNN||Dimensions||Semantic Map||Batch||Epochs||MSE Angle||MSE Speed||Combined|
|Model 1||MSE Angle||MSE Speed|
Model 2-sequence achieved the lowest MSE on the test set at 6.115 while the best performing single model for steering angle was model 2-stacked with total MSE loss at 925.926. The results are summarized in Table 4. Model 2-stacked performed the best for steering as a result of the combination of data augmentation which included horizontal flipping of images in a sequence, and the full view of the sequence to the input of the model.
|Model 2||MSE Speed||total MSE Angle|
In conclusion, our main contributions are (i) fusing together multiple modalities, (ii) augmenting the inputs with segmentation maps generated from a pre-trained model, (iii) leveraging the fact that the output is a smooth function, and an ensemble of diverse model architectures to win the ICCV Learning to Drive Challenge. We demonstrate high quality steering angle and speed predictions to yield accurate driving trajectories. In the spirit of reproducible research we make our models and code publicly available.
-  (2019) ICCV 2019: Learning-to-Drive Challenge Winning Test Prediction Result Video. Note: https://www.aicrowd.com/challenges/32/submissions/20194[Online; accessed 15-Nov-2019] Cited by: §4.
-  (2019) ICCV 2019: Learning-to-Drive Challenge. Note: https://www.aicrowd.com/challenges/iccv-2019-learning-to-drive-challenge[Online; accessed 15-Nov-2019] Cited by: §4.
-  (2018) Chauffeurnet: learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079. Cited by: §1, §2.
-  (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §1, §2.
-  (2016) End to end learning for self-driving cars. CoRR abs/1604.07316. Cited by: §1.
DeepDriving: learning affordance for direct perception in autonomous driving.
Proceedings of the IEEE International Conference on Computer Vision, pp. 2722–2730. Cited by: §1.
The cityscapes dataset for semantic urban scene understanding. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223. Cited by: §3.
-  (2017) Going deeper: autonomous steering with neural memory networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 214–221. Cited by: §1, §2.
-  (2019) Camera. https://gopro.com. Cited by: §3.1.1, §3.1.3.
-  (2018) End-to-end learning of driving models with surround-view cameras and route planners. In Proceedings of the European Conference on Computer Vision, pp. 435–453. Cited by: §1, §2.
-  (2019) Learning accurate, comfortable and human-like driving. arXiv preprint arXiv:1903.10995. Cited by: §1, §2.
-  (2019) Location Platform. https://here.com. Cited by: §3.1.1, §3.1.3.
Learning to steer by mimicking features from heterogeneous auxiliary networks.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8433–8440. Cited by: §1.
-  (2018) Event-based vision meets deep learning on steering prediction for self-driving cars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5419–5427. Cited by: §2.
-  (2019) Models and Code. Note: https://www.dropbox.com/sh/ucbsh0wc6xepv4r/AAAYV1x-93n5XjP4alYCI8LYa?dl=0[Online; accessed 15-Nov-2019] Cited by: Accurate Trajectory Prediction for Autonomous Vehicles.
-  (2017) End-to-end learning of driving models from large-scale video datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2174–2182. Cited by: §1, §1, §2.
-  (2018) End-to-end multi-modal multi-task vehicle control for self-driving cars with visual perceptions. In Proceedings of the IEEE International Conference on Pattern Recognition, pp. 2289–2294. Cited by: §2.
-  (2019) Improving semantic segmentation via video propagation and label relaxation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8856–8865. Cited by: §3.