Accurate Trajectory Prediction for Autonomous Vehicles

11/18/2019
by   Michael Diodato, et al.
0

Predicting vehicle trajectories, angle and speed is important for safe and comfortable driving. We demonstrate the best predicted angle, speed, and best performance overall winning the top three places of the ICCV 2019 Learning to Drive challenge. Our key contributions are (i) a general neural network system architecture which embeds and fuses together multiple inputs by encoding, and decodes multiple outputs using neural networks, (ii) using pre-trained neural networks for augmenting the given input data with segmentation maps and semantic information, and (iii) leveraging the form and distribution of the expected output in the model.

READ FULL TEXT VIEW PDF

Authors

page 2

page 3

page 7

10/23/2019

Winning the ICCV 2019 Learning to Drive Challenge

Autonomous driving has a significant impact on society. Predicting vehic...
10/23/2019

Using Segmentation Masks in the ICCV 2019 Learning to Drive Challenge

In this work we predict vehicle speed and steering angle given camera im...
01/13/2022

On Adversarial Robustness of Trajectory Prediction for Autonomous Vehicles

Trajectory prediction is a critical component for autonomous vehicles (A...
09/16/2016

Learning Opposites Using Neural Networks

Many research works have successfully extended algorithms such as evolut...
07/05/2018

Encoding Motion Primitives for Autonomous Vehicles using Virtual Velocity Constraints and Neural Network Scheduling

Within the context of trajectory planning for autonomous vehicles this p...
03/15/2021

Gradient Policy on "CartPole" game and its' expansibility to F1Tenth Autonomous Vehicles

Policy gradient is an effective way to estimate continuous action on the...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self driving cars have moved from driving in the desert terrain, spearheaded by the DARPA Grand Challenge, through highways, and into populated cities. Traditionally, a mediated perception approach was used where the entire scene is parsed to make a decision by solving sub-problems [6]

. Deep learning architectures consist of end-to-end models

[4, 16, 8, 3] that directly map input images to driving actions [5]

, predicting trajectories by supervised learning. Our three key contributions are:

  1. End-to-end neural network system architectures that (i) encode multiple input representations (camera images, segmentation maps, semantic maps, semantic features), followed by (ii) a neural network that fuses these different modalities together, and finally (iii) separate decoding networks for generating both outputs. A general prototype is shown in Figure 1 and we ensemble several variants of this architecture.

  2. We use pre-trained models for generating segmentation maps of the input images and semantic information. The pre-trained models augment the existing inputs with a rich dataset.

  3. We leverage the fact that both speed and steering angles are smooth functions to improve the predictions.

Figure 1: Network architecture: inputs in green, neural networks in blue, intermediate outputs in red, and final outputs in orange. A cycle denotes a sequence model with multiple time steps.
Figure 2:

Samples of front facing training images. The training data set consists of about 1.6 million images from each of the four sides of a car driving through Switzerland and consists of a mixture of cities, highways, and rural areas. The videos were taken by a GoPro Hero 5, and are sampled at a rate of 10 frames per second. Each frame has a resolution of 1920x1080, and the videos are split into chapters of 5 minutes each. The training dataset has 548 chapters from 27 unique routes. In addition to the images, the training dataset also includes visual maps from HERE Technologies at each time, a semantic map that is derived using a Hidden-Markov-Model path matcher, and the steering wheel angle and vehicle speed.

Figure 3: Sample of front facing test images. The test data set consists of 289,663 images from each of the four sides of a vehicle driving through Switzerland and consists of a mixture of cities, highways, and rural areas. The test image dataset consists of 98 chapters from 27 unique routes.

We evaluate our results on the Drive360 dataset [10] in the Learning to Drive Challenge. We win the top three positions overall, compared with state-of-the-art methods ranking below 10th position [11]. The dataset consists of videos of around 55 hours of recorded driving in Switzerland, along with their associated driving speed and steering angle. Sample images from the training set are shown in Figure 2 and sample test images are shown in Figure 3. The dataset consists of about 2 million images recorded using a GoPro Hero5 facing the front, left, right, and rear of the vehicle. The dataset also includes images of visual maps from HERE technologies and their corresponding semantic map features. Specifically, the semantic map consists of 21 fields as well as GPS latitude, longitude, and precision. All images and data are sampled at 10 frames per-second. The data is separated into 5 minute chapters. In total, there are 682 chapters for 27 different routes. The data is randomly sampled into 548 chapters for training, 36 chapters for validation, and 98 chapters for testing.

We show that jointly predicting steering angle and vehicle speed is improved by using segmentation maps and data augmentation. Motivated by the use of semantic segmentation models for self driving vehicles [16, 13], we concatenate the segmentation maps with the images as the input instead of using the maps as an additional learning objective. We augment the data set by applying transformations including mirroring, adjusting brightness, and geometric transformations. We ensemble three neural network architectures to achieve the best performance.

2 Related Work

End-to-end deep learning models have been trained to predict steering angle given only the front camera view [4]. As humans have a wider perceptional field than the front camera, 360-degree view datasets have been collected with additional route planner maps [10, 11]. Neural network models have been trained end-to-end using these 360-degree views which are specifically useful for navigating cities and crossing intersections. Map data has been demonstrated to improve the steering angle prediction accuracy [10]. LSTM models achieve good results predicting steering angle by taking into account long range dependencies [8]. Another improvement is the usage of event cameras instead of traditional cameras, which capture moving edges [14]. Since predicting steering angle alone is insufficient for self-driving cars, a multi-task learning framework is used to predict both speed and steering angle in an end-to-end fashion [17]. Adding a segmentation loss to fully connected layers has been shown to improve overall performance by learning a better feature representation [16]. Recently, ChaeffeurNet [3] predicted trajectories using a mid-level controller allowing to predict once and transfer the results to many vehicles types, avoiding the need to retrain a model for every different vehicle type.

Figure 4: Sample visual maps from HERE Technologies for each set of GoPro images in the training and test sets at a 10 frames per second sampling rate.
Figure 5: Sample front, rear, left, and right images for the same time and location. The corresponding HERE map for the instance is shown in Figure 4 (right). Each sampling of the dataset includes images in the four directions and a HERE map.

3 Methods

We pre-process the data by conducting down-sampling, normalization, semantic map imputation and data augmentation to allow for fast experimentation and improved results. We also augment the dataset using segmentation masks derived from a pre-trained model. The different modalities and pre-processed data are fed into two types of models, one of which uses the images and semantic map information and pre-trained ResNets, and another of which takes as input the images as well as segmentation masks, using a combination of pre-trained and fully-trained networks.

Down-sampling.

For efficient experimentation we down-sample the dataset in both space and time as shown in Table 1. Although this initial pre-processing is time consuming, down-sampling enables us to train models faster by orders of magnitude, reducing computation time from days to minutes. The initial image size is also prohibitive due to memory limitations on our GPU’s and significantly decreases the speed at which we train the model. In both cases an ensemble method is used to combine results and provide the highest achieving results in the competition.

Dataset Full Sample 1 Sample 2 Sample 3
Res. 1920x1080 640x360 320x180 160x90
Train 1,600,000 160,000 80,000 40,000
Val. 106,000 10,600 5,300 2,650
Test 290,000 29,000 14,500 7,250
Table 1: Data sampling in both space and time for efficiency.
Normalization.

The images are normalized in all models according to the mean and standard deviation. Similarly, the steering angle and speed are normalized for training using means and standard deviation.

Semantic map.

We use 20 of the numerical features including latitude, longitude, speed limit from a navigation map, free flow speed (average driving speed based on underlying road geometry), headings and road indices at intersections, as well as road distance to next signal, yield, pedestrian crossing, and intersection. We imputed missing values by zeros, avoiding using the mean value across the chapter or other placeholder, to avoid using future data. For one model, we add in 27 additional dummy variables for each folder an image was located. Although the semantic map has information on location, the information from the folder gave another view that we consider possibly helpful to the training process.

Data augmentation.

We augmented the dataset by applying several transformations to the training data (i) random horizontal flips of frontal images with probability 0.5 (steering angle is multiplied by -1 in order to offset this flip, (ii) random changes to image brightness by a factor between 0.2 and 0.75 with probability 0.1, and (iii) random translations and rotations of frontal image with probability 0.25.

Segmentation mask.

We used a pre-trained segmentation model [18, 7] to generate segmentation masks for each of the images in the dataset. The original pre-trained model contains 34 classes of which we consider only 19 which include relevant objects (such as road, car, parking, wall, etc.) that influence the steering angle [7].

3.1 Architecture

We experimented with three different network architectures which won 1st, 2nd, and 3rd places in the competition.

3.1.1 Model 1 network architecture (1st place)

Figure 6 shows Model 1’s network architecture. The inputs consist of the front facing GoPro [9] images and the semantic map from HERE Technologies [12]

. We include the current and previous frame in each iteration, which are 0.4 seconds apart. The images are fed into a pre-trained ResNet34 model or a ResNet152 model. We feed the semantic map into a fully connected network with two hidden layers of dimensions 256 and 128 with ReLU activation layers. The output from the fully connected and ResNet models are concatenated and fed into a long short-term memory (LSTM) network. The output from the LSTM network and the current information from the semantic map, if used, are concatenated. The LSTM output and the current information is then used as input to both an angle regressor and a speed regressor, which predict the current steering angle and speed.

Figure 6: Model 1: The network consists of a pre-trained ResNet and fully connected network that feeds into an LSTM model. This and the output of the ResNet model on the current image are fed into an angle and speed regressor. Each regressor consists of 3 blocks of a linear layer, a ReLU activation, and a 10% droput layer. The hidden layers have dimensions 1024, 512, and 256.

3.1.2 Model 2 network architectures (2nd place)

Figures 7 and 8 show two variants of model 2 that differ in their inputs: one taking a single image as input and the other taking a sequence of images as input.

Model 2-single takes as input a single image and its corresponding segmentation mask. The input is passed through a DenseNet121 architecture, followed by decoders for predicting speed and angle implemented using fully connected networks with three dense layers of size 200, 50 and 10 each. In between each dense layer we apply batch normalization a ReLU non-linearity. The final output is a real-valued number which is the predicted speed or steering angle, which is normalized using the mean and standard deviation from the training set. Model 2-stacked takes as input a sequence of 10 images, where each image is concatenated with its corresponding segmentation mask.

Figure 7:

Model 2.1: model 2-single and model 2-stacked share the same architecture, where the only difference is the input. For model 2-single, the input is a single image concatenated with its one-hot encoded segmentation mask. The input dimensions are width

height 23 (3 RGB channels + 20 classes in the segmentation mask). For model 2-stacked, the input is a full sequence of 10 images concatenated together with their segmentation masks. All images and masks are stacked to a width height 230 (3 RGB channels + 20 classes)*10. These inputs are fed into a pre-trained DenseNet121. The final layers are two separate feedforward towers - one for speed and for steering angle. Each of the feed forward layers, except of the output layer, are followed by batch normalization and ReLU activation.

Model 2-sequence shown in Figure 8 takes as input a sequence of 10 images, and their corresponding segmentation masks, similar to Model 2-stacked. Each image in the input sequence is passed individually through a pre-trained ResNet34 model and a pre-trained DenseNet201. Additionally, the input images are concatenated with their corresponding segmentation masks and passed through model 2-single. The resulting outputs are concatenated and passed through an intermediate layer which contains two dense layers of dimensions 512 and 128 with dropout. The output is then passed into a bi-directional GRU. This is performed for each input image/mask pair in the sequence, and the output of the GRU is concatenated with the input to the previous fully connected layer which is also passed through a dense block. Finally, this representation is passed into two decoders for speed and steering angle prediction, each consisting of three fully connected layers of dimensions 256, 128 and 32. We ensemble the various model 2 variants.

Figure 8: Model 2-sequence: The input is a sequence of 10 images and the corresponding segmentation masks. The images are passed through a pre-trained ResNet34 and a pre-trained DenseNet 121, while the segmentation masks are concatenated with their corresponding image and passed through model 2-signal. For each image and segmentation mask, the hidden features after the convolutional layers are extracted and passed through 2 fully-connected layers. The result is a sequence of features, which is fed into a 3-layer bi-directional GRU with 64 hidden features. The output of the GRU is concatenated with the hidden features of the latest input image and its segmentation mask, which is finally fed into two feed forward towers for speed and angle prediction.

3.1.3 Model 3 network architecture (3rd place)

Figure 9 shows Model 3’s network architecture. The inputs consist of the front facing GoPro [9] images, the semantic map images and numerical features from HERE Technologies [12]

. The model uses a pre-trained ResNet for both frontal images and HERE maps, while the semantic map features are passed through a fully connected network. The output feature vector of the frontal images is fed into an LSTM model. Finally, the output of the LSTM and the output of the ResNet models are concatenated along with HERE numerical features (semantics) and fed into the angle and speed regressor.

Figure 9: Model 3: The network consists of a pre-trained ResNet and fully connected network that feeds into an LSTM model for frontal images input, a pre-trained ResNet and fully connected network for HERE map images input. Each regressor consists of 3 blocks of a linear layer, a ReLU activation, and a 20% dropout. The hidden layers are 64, 32, and 1.

3.2 Implementation

3.2.1 Model 1 implementation details (1st place)

Hyper-parameters for network training.

All models are trained using the Adam optimizer with an initial learning rate of 0.0001, without weight decay, and momentum with , and . In all model variants we optimize the sum of mean squared error (MSE) for the steering angle and speed. Minibatch sizes are 8, 32, or 64, limited by GPU memory. The image set, depending on run, is of dimensions 320x180 or 160x90 using ResNet34 or ResNet152 models. Table 2

summarizes the hyperparameters and image settings used for each model 1 variant.

Ensemble.

We quantize the steering angles and speeds from the training dataset into 100 and 30 evenly spaced bins respectively and use these discrete distributions to compute a weighted average of predicted values. Table 2 shows the performance of each individual model 1 variant and the ensemble.

Computation.

Preprocessing time is around 3 days of computation on an Intel i7-4790k CPU using 6 cores with data stored on a 7200rpm hard drive. The largest bottleneck in preprocessing is the I/O due to the large number of image. Models 1 variants were trained using a single Nvidia GPU. Each epoch required approximately 10-12 hours.

3.2.2 Model 2 implementation details (2nd place)

Model 2-single.

This model uses the Adam optimizer with an initial learning rate of 0.0003. We implemented learning rate decay to 0.0001, 0.00005, and 0.00003 after the first 5, 15, and 20 epochs. Overall the model was trained for 90 epochs, with a batch size of 13 for training, validation and testing. The loss criterion used for both speed and steering angle was MSE loss, and the overall model loss was defined as the summation of the speed loss and steering angle loss. Model 2-stacked was trained in an identical manner, but the lowest MSE loss was achieved after the 14th epoch. We used a similar method to train all model 2 variants, with slight changes in learning rate and learning rate decay tuned over multiple runs. Our models did not require many epochs to reach their lowest validation MSE, and an ensemble of the results of the individual models outperformed each individually.

Model 2-sequence.

The model is trained using the Adam optimizer with an initial learning rate of 0.003, which is halved after epochs 20, 30 and 40. We use the same combination loss as in model 2-single and model 2-stacked, summing the MSE of speed and steering angle. Taking an ensemble of model 2 variants decreased the MSE for speed from 6.115 to 5.312, and the total MSE for steering from 925 to 901, achieving 2nd place overall in the competition.

Hyper-parameters for network training.

In each training run of the model, we save only the best model for speed and the best model for steering angle, which were then separately stored and used for predicting values on the test set.

3.2.3 Model 3 implementation details (3rd place)

We used the HERE map images, HERE numerical features, and frontal images as inputs. In addition to the baseline mode, we passed the HERE map images into a pre-trained ResNet, and concatenated its output with the output from encoded frontal images and HERE numerical features. The concatenated feature vector is decoded by fully connected networks for predicting steering angle and speed.

Hyper-parameters for network training.

The model is trained using an Adam optimizer with learning rate 0.0001. We use a ResNet34 for the HERE MAP images, and ResNet50 for frontal images. We apply dropout with 0.2 and 0.5 probabilities and batch size of 64 for training.

Computation.

Pre-processing time is around 6 hours on a AMD 2700X CPU with data stored on a 7200rpm hard drive. The model is trained using an Nvidia GTX 1080. Each epoch requires approximately 5 minutes to complete.

4 Results

We have won the ICCV 2019 Learning to Drive Challenge [2] in all first three places overall, compared with other previous state-of-the art positioned below 10th place. Figure 10 shows a sample of test prediction results. Ground truth angle and speed are in blue and predicted angle and speed are in red. The entire video is available online [1]. Figure 11 show the predicted path for two driving chapters, where the axis scales represent distance in kilometers. Table 2 summarizes the results of model 1 which won 1st place overall by fusing together different modalities using a neural network. We train each of our five neural network model 1 variants for varying number of epochs to make efficient use of our computational resources. Table 2 shows that the most significant improvement on the single models occurs by including of the HERE semantic map into the model, which results in a decrease of the total MSE by approximately 300 points on the Angle metric. This is most apparent between model 1 variants 1 and 4. Including the city location as part of the semantic map has a marginal benefit to the model, as seen by the change from model 1 variants 4 to 5. Notably, models 1 variants 3 and 4 tend to overfit. Our best result for both is after 2 epochs, and the MSE slowly increases for each additional epoch. The training loss for both models decrease throughout the training, as shown in Figure 12. The performance of the ensemble method in the different road types is shown in Table 3.

Table 4 summarizes the results of model 2 which won 2nd place overall by augmenting the data with it segmentation maps. Figure 13 shows the distribution of angles between 0 and 180 in absolute values for bins of 5 angles each. As expected, the number instances in each angle decreases exponentially as angle is increased. Finally, Figure 14 show the total MSE for the top 50 rankes teams, with our models achieving the best performance overall, winning 1st, 2nd, and 3rd.

Figure 10: Sample of test results with the ground truth (in blue) and predicted values (in red) for steering angle and driving speed. The ground truth and the prediction range between to for steering angle and between to for driving speed.
Figure 11: Sample predicted test paths: The vertical and horizontal axes represent distance in kilometers. Each trajectory starts at the zero coordinates and moves in the predicted direction and speed at each time point. The predictions for both steering angle and driving speed are 100 milliseconds apart and the trajectory is calculated with constant angle and speed between measurements.
Model CNN Dimensions Semantic Map Batch Epochs MSE Angle MSE Speed Combined
1 ResNet34 320x180 No 64 2 1111.437 5.866 1117.303
2 ResNet152 320x180 No 8 1 1211.434 5.461 1216.895
3 ResNet34 160x90 20 Features 8 1 897.489 6.664 904.153
2 883.501 6.403 889.904
3 931.689 6.445 938.134
4 970.96 6.714 977.674
5 956.262 6.576 962.838
4 ResNet34 320x180 20 Features 32 1 995.42 5.316 1000.736
2 946.516 5.337 951.853
3 989.013 5.519 994.532
4 965.791 5.706 971.497
5 987.572 5.846 993.418
5 ResNet34 320x180 47 Features 64 1 900.407 5.571 905.978
Ensemble 831.504 4.543 836.047
Table 2: Results for model 1 variants and their ensemble: The best overall result is an ensemble of the single models. Individually, we note that the inclusion of the semantic map reduces the MSE by about 300 (comparing models 1 and 4) and using the smaller image size resulted in an additional benefit (comparing models 3 and 4). Models 3 and 4 likely suffer from overfitting as evidence by the increasing test MSE, although the training loss decreased, as shown in Figure 12. The best standalone and overall models are in bold.
Model 1 MSE Angle MSE Speed
Overall 831.5 4.5
Zone30 2,981.1 0.3
Zone50 1,353.4 6.0
Zone80 168.6 4.1
Right 1,928.4 1.3
Straight 821.7 4.6
Left 833.6 2.6
Pedestrian 3,722.1 4.1
Traffic Light 329.2 5.3
Yield 1,818.9 2.8
Table 3: Performance across various zones: the ensemble method did worst in the pedestrian sections and best in Zone80. Presumably, pedestrian sections would be hardest to train due to the unpredictability of cities and people. Zone80 sections are likely straighter and require less change in speed and steering angle and would probably be easier to train. Likewise, Right and Left sections, which would require learning a turn, would be difficult, but Straight segments are easier to train and did better.
Figure 12: Training loss of angle and speed MSE as a function of epoch.
Figure 13: The distribution of angles between 0 and 180 in absolute values for bins of 5 angles each.

Model 2-sequence achieved the lowest MSE on the test set at 6.115 while the best performing single model for steering angle was model 2-stacked with total MSE loss at 925.926. The results are summarized in Table 4. Model 2-stacked performed the best for steering as a result of the combination of data augmentation which included horizontal flipping of images in a sequence, and the full view of the sequence to the input of the model.

Model 2 MSE Speed total MSE Angle
Model 2-single 7.440 1,140.875
Model 2-stacked 7.036 925.926
Model 2-sequence 6.115 1,075.497
Avg: 5.312 901.802
Table 4: Results of model 2: The stacked variant achieved the lowest total MSE for steering and was second for speed. The individual model that had the best performance for speed was model 2-sequence. The average of the predictions generated by model 2-stacked and model 2-sequence provided a significant decrease in the total MSE of both speed and steering angle. This combination is the final result of model 2 and it was our best submission to the competition which awarded us the second place in steering angle prediction and second place overall.
Figure 14: Top 50 teams ranked by total MSE angle. Our methods are the top 3 performers as denoted by stars. The total MSE angle achieved by the models are Model 1: 831.424, Model 2: 881.513 and Model 3: 922.836.

5 Conclusions

In conclusion, our main contributions are (i) fusing together multiple modalities, (ii) augmenting the inputs with segmentation maps generated from a pre-trained model, (iii) leveraging the fact that the output is a smooth function, and an ensemble of diverse model architectures to win the ICCV Learning to Drive Challenge. We demonstrate high quality steering angle and speed predictions to yield accurate driving trajectories. In the spirit of reproducible research we make our models and code publicly available.

References

  • [1] AIcrowd (2019) ICCV 2019: Learning-to-Drive Challenge Winning Test Prediction Result Video. Note: https://www.aicrowd.com/challenges/32/submissions/20194[Online; accessed 15-Nov-2019] Cited by: §4.
  • [2] AIcrowd (2019) ICCV 2019: Learning-to-Drive Challenge. Note: https://www.aicrowd.com/challenges/iccv-2019-learning-to-drive-challenge[Online; accessed 15-Nov-2019] Cited by: §4.
  • [3] M. Bansal, A. Krizhevsky, and A. Ogale (2018) Chauffeurnet: learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079. Cited by: §1, §2.
  • [4] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §1, §2.
  • [5] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba (2016) End to end learning for self-driving cars. CoRR abs/1604.07316. Cited by: §1.
  • [6] C. Chen, A. Seff, A. Kornhauser, and J. Xiao (2015) DeepDriving: learning affordance for direct perception in autonomous driving. In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    pp. 2722–2730. Cited by: §1.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3213–3223. Cited by: §3.
  • [8] T. Fernando, S. Denman, S. Sridharan, and C. Fookes (2017) Going deeper: autonomous steering with neural memory networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 214–221. Cited by: §1, §2.
  • [9] GoPro (2019) Camera. https://gopro.com. Cited by: §3.1.1, §3.1.3.
  • [10] S. Hecker, D. Dai, and L. Van Gool (2018) End-to-end learning of driving models with surround-view cameras and route planners. In Proceedings of the European Conference on Computer Vision, pp. 435–453. Cited by: §1, §2.
  • [11] S. Hecker, D. Dai, and L. Van Gool (2019) Learning accurate, comfortable and human-like driving. arXiv preprint arXiv:1903.10995. Cited by: §1, §2.
  • [12] HERE (2019) Location Platform. https://here.com. Cited by: §3.1.1, §3.1.3.
  • [13] Y. Hou, Z. Ma, C. Liu, and C. C. Loy (2019) Learning to steer by mimicking features from heterogeneous auxiliary networks. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 8433–8440. Cited by: §1.
  • [14] A. I. Maqueda, A. Loquercio, G. Gallego, N. García, and D. Scaramuzza (2018) Event-based vision meets deep learning on steering prediction for self-driving cars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5419–5427. Cited by: §2.
  • [15] (2019) Models and Code. Note: https://www.dropbox.com/sh/ucbsh0wc6xepv4r/AAAYV1x-93n5XjP4alYCI8LYa?dl=0[Online; accessed 15-Nov-2019] Cited by: Accurate Trajectory Prediction for Autonomous Vehicles.
  • [16] H. Xu, Y. Gao, F. Yu, and T. Darrell (2017) End-to-end learning of driving models from large-scale video datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2174–2182. Cited by: §1, §1, §2.
  • [17] Z. Yang, Y. Zhang, J. Yu, J. Cai, and J. Luo (2018) End-to-end multi-modal multi-task vehicle control for self-driving cars with visual perceptions. In Proceedings of the IEEE International Conference on Pattern Recognition, pp. 2289–2294. Cited by: §2.
  • [18] Y. Zhu (2019) Improving semantic segmentation via video propagation and label relaxation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8856–8865. Cited by: §3.