Log In Sign Up

Imitation Learning for End to End Vehicle Longitudinal Control with Forward Camera

by   Laurent George, et al.

In this paper we present a complete study of an end-to-end imitation learning system for speed control of a real car, based on a neural network with a Long Short Term Memory (LSTM). To achieve robustness and generalization from expert demonstrations, we propose data augmentation and label augmentation that are relevant for imitation learning in longitudinal control context. Based on front camera image only, our system is able to correctly control the speed of a car in simulation environment, and in a real car on a challenging test track. The system also shows promising results in open road context.


page 2

page 4


Robust Behavioral Cloning for Autonomous Vehicles using End-to-End Imitation Learning

In this work, we present a robust pipeline for cloning driving behavior ...

Urban Driving with Conditional Imitation Learning

Hand-crafting generalised decision-making rules for real-world urban aut...

SEIL: Simulation-augmented Equivariant Imitation Learning

In robotic manipulation, acquiring samples is extremely expensive becaus...

Multi-Modal Fusion for Sensorimotor Coordination in Steering Angle Prediction

Imitation learning is employed to learn sensorimotor coordination for st...

Imitation-Based Active Camera Control with Deep Convolutional Neural Network

The increasing need for automated visual monitoring and control for appl...

Rapid Autonomous Car Control based on Spatial and Temporal Visual Cues

We present a novel approach to modern car control utilizing a combinatio...

End to End Vehicle Lateral Control Using a Single Fisheye Camera

Convolutional neural networks are commonly used to control the steering ...

1 Introduction

In this work, we provide a proof of concept of how imitation learning can be used to control the speed of a vehicle, in the context of autonomous driving. As opposed to classical approaches schwarting2018 ; Urmson2008 ; Dolgov2010 , the control is done end to end with one single neural network from raw image data to the desired car speed. In particular, we focus around a behavior reflex approach Chena : the goal is to react to the environment state without explicit description.

The basic principle of imitation learning is to rely on expert demonstrations. However, when the system is actuated by the network prediction, there will be deviations from the expert behaviors. This error implies a distribution mismatch between the expert demonstrations and the images encountered at inference time. To alleviate this distribution mismatch such systems rely on data augmentation. For instance in imitation learning systems for car steering command Bojarski2016a ; Hubschneider2017b ; Hubschneider2017c ; toromanoff2018end , data were generated to emulate deviations from expert trajectory: lateral deviations are generated with the corresponding label augmentation, either by adding more cameras or using image transformation techniques.

Here, we consider vehicle speed prediction from a frontal camera: the speed should be adapted to the road shape and to the presence of obstacles on the road. This specific problem of longitudinal control of the vehicle has been less explored than the steering. In particular, there is no well described way to do the data augmentation. The company Baidu has demonstrated in baidu a vehicle with lateral and longitudinal control in end-to-end. Their network predicts the vehicle acceleration and is based on Conv-LSTM layers to deal with the spatio-temporal dependency of the longitudinal control. However, they do not detail their data augmentation process or give details on their vehicle integration, although they have demonstrated it live. Yan et al. propose a combined network in Yang2018 which predicts both steering angle and speed, with an LSTM branch for the speed prediction. They show that combining both can benefit to the steering angle prediction, but they do not detail their data augmentation for longitudinal. Xu et al. Xu2016a use a FCN-LSTM architecture to predict discretized steering angles and acceleration. They use what they call privileged training to achieve better performance by training the network to simultaneously segment the camera image, which is an intermediate approach between behavior reflex and the mediated perception approach Chena .

Our main contribution is detailing a data and label augmentation pipeline which makes it possible to use speed prediction online and in the loop. We also provide an adapted loss function to ensure a smooth control in the car, and quantitative and qualitative results both in simulation and on a real car. Finally, we show that challenging scenarios can be handled in a real car for our proof of concept demonstration.

2 System architecture

We use both a simulation environment and real cars to perform training and tests. We chose to use the video game Grand Theft Auto V (GTA) as a simulation environment. This game provides a very large world ( open world with city environment), realistically rendered. Multiple weather conditions (sun, rain, clouds etc), and different time of day are available. Pedestrians and cars are available with an expert AI to control them. To interact with the game engine and generate training and offline validation data (images, other characters positions, map etc) we use DeepGTAV gtav and the built-in control AI with different driver behaviors (varying aggressiveness). For our simulation experiments, we used a fixed circuit inside the city environment. Note that the vehicle simulator Carla Dosovitskiy17 was not used for this test because the focus was on complex urban interactions, but that the simulator could be leveraged in future work.

For our real data experiments (see Figure 1), the data is split between an open road passive database, collected in the Paris area and which is described more precisely in toromanoff2018end , and data collected on a test track designed to demonstrate some use cases: two hard turns (16m in diameter), one dynamic barrier, one straight section and a road deviation through a traffic cone chicane. The test car is equipped with a front windshield camera with a 60 degrees field of view, and a drive-by-wire system to send acceleration commands.


Cone chicane

Turn #2


Turn #1


Jersey barriers



Figure 1: Test track (left) and demonstration car used (right)

We focused our work on driving behavior in low speed urban scenarios (less than 60kph). During the online tests the lateral behavior of the car was controlled manually or by another end-to-end imitation learning neural network toromanoff2018end .

The input of the longitudinal neural network is the normalized front 320x240 camera image. The proposed network consists in 11 layers, 7 consecutive convolutional layers with decreasing filter size, 3 fully connected layers and 1 LSTM layer (all layers are followed by a Rectified Linear Unit (ReLU), except for the LSTM). The network produces a speed which is transformed to acceleration and deceleration command using PID (Proportional, Integral and Derivative) with fuzzy gain adaptation.

We defined the training loss as : the sum of the mean squared error between the output of the network and the command of the human driver and an auxiliary smoothing loss , plus a L2 regularization loss. The loss is introduced to smooth the network output over time. It minimizes the difference between the differences of successive labels and prediction, to better match the vehicle actual dynamics.

3 Data augmentation and selection

Data augmentation is key to the success of imitation learning with online control. For speed prediction and control, we need to add more variety in the training dataset and mitigate the absence from our training dataset of specific driving behavior, like emergency or late braking.

To compensate the absence of emergency brake in our datasets we introduce a zoom procedure that allows to mimic stops closer to obstacles. We create new sequences by copying ones where the vehicle stops behind another car and append last frame multiple time with a progressive zoom. The associated label speed of the zoomed frame is set to 0 kph. Indeed the desired behavior is to stop the car if we are close to obstacles. In parallel, to limit the over-representation of zero speed frames in the dataset we cap the sequence of frames when the car is fully stopped behind another car. In such conditions, the recorded frames are very similar (same rear of preceeding car) and thus do not provide new relevant information for the learning.

To simulate non-centered position of the car in the lane we use lateral sub-cropping of the original image (320x240 to 300x240). A random cropping offset was used for each data sequence and each epoch. Our aim with this data augmentation is to get a system robust to such small lateral offsets. This is important for us as we aim at using our system with other systems that control the car steering wheels, and such systems could sometimes lead to non-centered behavior.

The datasets must be filtered to remove some specific cases which can reduce the training and inference performance. Inferring correct prediction of speed based on front camera only is not always possible. This is the case for example with some road signs present in France (e.g traffic lights, stop signs, etc) that cannot be seen from a front 60 degrees FOV camera when the car is stopped at its mandatory position. During initial tests, we also observed some undesired braking: for example, on the test track, even when the barrier was open, the prediction would brake slightly at the location of barrier. In our understanding, this was due to the fact that during the training, we provided numerous images for locations right after the barrier (when it is no longer visible), labeled with a small speed. They correspond to the acceleration phase after the opening of the barrier. However these labels conflict with cases where the barrier is open, and the car is at much higher speed, which leads the training to provide a compromise. Removing the restart sequences from the training improved the car behavior: the inference was always predicting a higher speed.

4 Results

To evaluate the prediction error quantitatively, there are two measures that are relevant: the speed Mean Absolute Error (MAE) and the acceleration MAE over all the time sequence. We introduce acceleration MAE because it reflects the final performance in the car: inconsistent acceleration results in discomfort for the car passengers at best, or impossible speed profiles at worst. To evaluate our architecture choice and ensure that the introduction of the LSTM layer and the loss was relevant for the longitudinal prediction problem we compare our architecture to two other architectures: a vanilla CNN, which predicts the speed only from the current image, with no temporal dependency, and a similar architecture to the one presented in Section 2 but without using loss.

Table 1 presents a quantitative comparison of the final speed mean absolute error achieved offline on the different validation datasets (using the best training iteration for each network). Concerning our main network (LSTM + loss), it is visible that the best results are obtained on the test track (MAE 1.17 kph), GTA data has a comparable performance (MAE 2.06 kph), and open road is still relevant (MAE 6.40 kph). Comparing the three networks, we can see that the two LSTM based network always outperforms the vanilla CNN network in term of speed MAE. We can observe that the two LSTM based network provide quite similar speed MAE. We could see that the LSTM network with loss outperforms other architectures by providing a slightly smaller acceleration MAE. Note that we could not perform a direct quantitative evaluation online on the car because it would have required a high precision GPS. In practice, even if the acceleration MAE is only slightly lower with the smoothing loss, the behavior is significantly smoother when actually applied in the car. Table 1 only contains results of networks trained with data augmentation. Without data augmentation we obtained similar values for offline evaluation on recorded tracks. However, when we test it online (in the real car or in simulation), we observed a dangerous behavior leading to fatal errors requiring human intervention, data augmentation is necessary to create a safe and complete behavior of the car.

Dataset LSTM + loss Simple LSTM Vanilla
Metric Speed Accel. Speed Accel. Speed Accel.
Table 1: Speed MAE () and acceleration MAE () according to datasets (less is better)
Figure 2:

Examples of visual backpropagation images on each dataset (validation subset)

Figure 3: *

For qualitative evaluation, we focus on experimental results in terms of smoothness and reproducibility on the test track. We performed a live demonstration at the CES show at Las Vegas in January 2018: see for a video (note that both steering and speed of the vehicle are controlled by neural networks). We managed to get a very high success rate (about one human intervention per day, when running several tens of laps over the day), with a constant performance in terms of comfort when using the network trained from scratch on test track data. This confirm the offline-results and highlight the ability of the system to control a real car.

Figure 3 shows examples of the visual backpropagation Bojarski2016 that we use in the car to interpret the network output. In particular, we noticed that the network was focusing mainly on relevant elements of the track (jersey barriers, traffic cones, stop signs etc) when braking was required and correctly predicted. On the contrary, large, unfocused visual backpropagation was observed when the training was not complete (not enough iterations) or not adapted to the situations. This provides a qualitative way to assess the success of the training, but also to provide explainability to external users: running the visualization live during the demonstrations highlighted that the inference is based on relevant elements of the infrastructure.

5 Conclusion

In this work, we present a complete study of an end-to-end imitation learning for speed control of a real car. We use a neural network with an LSTM and an adapted loss. To answer the problem of generalizing from a few ideal samples, we propose an adapted label augmentation, then validate it both offline, in a simulator and online. The system performs particularly well on a dedicated test track that includes difficult scenarios like stop/restart at a dynamic barrier and on a warning triangle, slow down in hard turns or in road deviation and speed up on straight line. The use of LSTM and the proposed specific loss allows the system to capture speed dynamics and to output speed control that correctly mimics human driver. Concerning general urban open road context, although we observed encouraging online behavior like correctly adapting speed to the preceding vehicle, the mean absolute error (6.40 kph) observed on validation data tends to indicate that the system is not yet ready to replace classical approaches.

Future work could couple both lateral and longitudinal prediction in one unique network. Then it would be relevant to investigate on how to transfer their respective label augmentation requirements, and how we can benefit from learning both tasks simultaneously. In the domain of label augmentation, the question of how we can generate sequences of new labels and input images that are consistent with time in dynamic situations is still open. Finally, integrating other types of raw data, such as Lidar, or combining cameras with different field of views could open new possibilities.


  • [1] Wilko Schwarting, Javier Alonso-mora, and Daniela Rus. Survey on Planning and Decision-Making for Autonomous Vehicles. Annual Review of Control, Robotics, and Autonomous Systems, 1(January):1–26, 2018.
  • [2] Chris Urmson, Joshua Anhalt, Drew Bagnell, Christopher Baker, Robert Bittner, MN Clark, John Dolan, Dave Duggins, Tugrul Galatali, Chris Geyer, et al. Autonomous driving in Urban environments: Boss and the Urban Challenge. Journal of Field Robotics, 25(8):425–466, 2008.
  • [3] Dmitri Dolgov, Sebastian Thrun, Michael Montemerlo, and James Diebel. Path planning for autonomous vehicles in unknown semi-structured environments. International Journal of Robotics Research, 2010.
  • [4] Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. DeepDriving: Learning affordance for direct perception in autonomous driving. ICCV, 2015.
  • [5] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to End Learning for Self-Driving Cars. CoRR, 2016.
  • [6] Christian Hubschneider, André Bauer, Michael Weber, and J Marius Zöllner. Adding Navigation to the Equation: Turning Decisions for End-to-End Vehicle Control. In ITSC Workshop, 2017.
  • [7] Christian Hubschneider, Jens Doll, Michael Weber, Sebastian Klemm, and Florian Kuhnt. Integrating End-to-End Learned Steering into Probabilistic Autonomous Driving. ITSC, 2017.
  • [8] Marin Toromanoff, Emilie Wirbel, Frédéric Wilhelm, Camilo Vejarano, Xavier Perrotton, and Fabien Moutarde. End to end vehicle lateral control using a single fisheye camera. arXiv preprint arXiv:1808.06940, 2018.
  • [9] Hao Yu, Shu Yang, Weihao Gu, and Shaoyu Zhang. Baidu driving dataset and end-To-end reactive control model. In IEEE Intelligent Vehicles Symposium, pages 341–346, jun 2017.
  • [10] Zhengyuan Yang, Yixuan Zhang, Jerry Yu, Junjie Cai, and Jiebo Luo. End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perception. arXiv preprint arXiv:1801.06734, 2018.
  • [11] Huazhe Xu, Yang Gao, Fisher Yu, and Trevor Darrell. End-to-end Learning of Driving Models from Large-scale Video Datasets. CoRR, 2016.
  • [12] Deep GTA V plugin,, 2018.
  • [13] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pages 1–16, 2017.
  • [14] Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, Bernhard Firner, Larry D. Jackel, Urs Muller, and Karol Zieba. VisualBackProp: visualizing CNNs for autonomous driving. CoRR, 2016.