Log In Sign Up

Sample Efficient Interactive End-to-End Deep Learning for Self-Driving Cars with Selective Multi-Class Safe Dataset Aggregation

by   Yunus Bicer, et al.

The objective of this paper is to develop a sample efficient end-to-end deep learning method for self-driving cars, where we attempt to increase the value of the information extracted from samples, through careful analysis obtained from each call to expert driverś policy. End-to-end imitation learning is a popular method for computing self-driving car policies. The standard approach relies on collecting pairs of inputs (camera images) and outputs (steering angle, etc.) from an expert policy and fitting a deep neural network to this data to learn the driving policy. Although this approach had some successful demonstrations in the past, learning a good policy might require a lot of samples from the expert driver, which might be resource-consuming. In this work, we develop a novel framework based on the Safe Dateset Aggregation (safe DAgger) approach, where the current learned policy is automatically segmented into different trajectory classes, and the algorithm identifies trajectory segments or classes with the weak performance at each step. Once the trajectory segments with weak performance identified, the sampling algorithm focuses on calling the expert policy only on these segments, which improves the convergence rate. The presented simulation results show that the proposed approach can yield significantly better performance compared to the standard Safe DAgger algorithm while using the same amount of samples from the expert.


page 1

page 4


End-to-End Urban Driving by Imitating a Reinforcement Learning Coach

End-to-end approaches to autonomous driving commonly rely on expert demo...

DADAgger: Disagreement-Augmented Dataset Aggregation

DAgger is an imitation algorithm that aggregates its original datasets b...

Multi-Instance Aware Localization for End-to-End Imitation Learning

Existing architectures for imitation learning using image-to-action poli...

Iterative Imitation Policy Improvement for Interactive Autonomous Driving

We propose an imitation learning system for autonomous driving in urban ...

Fast and Real-time End to End Control in Autonomous Racing Cars Through Representation Learning

The challenges presented in an autonomous racing situation are distinct ...

Ignition: An End-to-End Supervised Model for Training Simulated Self-Driving Vehicles

We introduce Ignition: an end-to-end neural network architecture for tra...

I Introduction

Recent years saw significant advances in self-driving car technologies, mainly due to several breakthroughs in the area of deep learning. In particular, the use of vision-based methods to generate driving policies has been of interest to a vast body of researchers, resulting in a variety of different learning and control architectures, that can be roughly classified into classical and end-to-end methods. Conventional methods approach the problem of autonomous driving in three stages; perception, path planning, and control


. In the perception stage, feature extraction and image processing techniques such as color enhancement, edge detection, etc. are applied to image data to detect lane markings. In path planning, reference, and the current path of the car is determined based on the identified features in perception. In the control part, control actions for the vehicle such as steering, speed, etc. are calculated from reference and the current path with an appropriate control algorithm. The performance of the classical methods heavily depends on the performance of the perception stage, and this performance can be sub-optimal because of the manually defined features and rules in this stage

[2]. Sequential structure of the classical methods might also lead to the non-robustness against errors, as an error in feature extraction can result in an inaccurate final decision.

On the other hand, end-to-end learning methods learn a function from the samples obtained from an expert driving policy. The learned function can generate the control inputs directly from the vision data, combining the three layers of the classical control sequence into a single step. By far, the most popular approach for representing the mapping from images to controls in end-to-end driving is using neural networks (NN). ALVINN by Pomerleau [3]

is one of the initial works in this area, which uses a feedforward neural network that maps frames of the front-facing camera to steering input. Researchers from Nvidia utilized convolutional neural networks (CNN)

[4] to automatize the feature extraction process and predict steering input. An FCN-LSTM architecture[5] is proposed to increase learning performance with scene segmentation. In [6]

, a visual attention model used to highlight some essential regions of frames for better prediction. Although the steering input prediction in an end-to-end manner is a well-studied problem in the literature, the steering input alone is not sufficient for fully autonomous driving. In

[7], a CNN-LSTM network is proposed to predict the speed and steering inputs synchronously.

Pure end-to-end learning policies are limited to the demonstrated performance, and although the training and validation loss on the data collected from the expert might be low, errors accumulated from the execution of the learned driving policy might lead to poor performance in the long run. This performance loss is partly because the learned driving policy is likely to observe states that do not belong to the distribution of the original expert demonstration data. DAgger [8] algorithm addresses this issue by iteratively collecting training data from both expert and trained policies. The main idea behind DAgger is to actively obtain more samples from the expert to improve the learned policy. Even though DAgger achieves better driving performance, it might end up obtaining a lot of samples from the expert, which can be time and resource-consuming in many real-life scenarios. SafeDAgger [9] algorithm, an extension of DAgger, attempts to minimize the number of calls to the expert by predicting the unsafe trajectories of the learned driving policy and only calls the expert on such cases. Another extension of DAgger, EnsembleDAgger [10]

, predicts the variance of the decisions by using multiple models and takes it as additional safety criteria like SafeDAgger.

In this paper, we propose a novel framework which is sample-efficient compared to the SafeDAgger algorithm (state-of-the-art data aggregation method), named Selective SafeDAgger. The proposed algorithm classifies the trajectories executed by the learned policy to safe and multiple classes of unsafe segments. After the prediction, the model focuses on obtaining the expert policy samples primarily from the identified unsafe segment classes. Our main contribution is an imitation learning algorithm that collects the most promising samples from the expert policy, which enables outperforming the SafeDAgger method while limited to the same number of calls to the expert.

This paper is organized as follows. Section II provides the details of the methodology. The experimental setup is provided in section III, followed by a discussion about results in section IV and conclusions in section V.

Ii Methodology

In this section, driving policies, the architecture of the network, and devised algorithm are explained in detail.

Ii-a Driving Policies

We begin with giving definitions of the used terms to explain driving policies in detail.

A set of states for the car in this paper is an environment model, and is one of the states for the car in that environment. Observation of the state is defined as where is the observation set for all states. will be driving action at observation where is the set of all possible actions.

A set of driving policies is defined as in Eq. (1).


where is a mapping from state observations to driving actions such as steering, throttle, brake, etc.

Two distinct driving policies are defined throughout the paper. The first one is an expert policy that drives the car with a reasonable performance that we want to imitate. An expert policy in an autonomous driving scenario is usually chosen as actions of a human driver. Variants of DAgger algorithms, however, have mislabeling problem in case of the human driver, since drivers do not have feedback feelings from their actions and they can give incorrect reactions to the given states. To overcome the mislabeling problem, we have used a rule-based controller which contains speed and steering controllers, as an expert policy in this paper. The second one is a primary policy that is trained to drive a car. This policy is a sub-optimal policy according to the expert policy since it is trained on a subset of observation set .

Training a primary policy to mimic an expert policy is called imitation learning or learning by demonstration. One of the most common methods for imitation learning is based on supervised learning techniques. The loss function for the supervised learning is defined as in Eq. (

2) [9].


where refers to -Norm between trained and expert policy actions.

A primary policy as in Eq. (3) is defined as a policy that minimizes the loss function as follows.


Minimization of the loss function can be challenging since it is known that the relation between image frames and driving actions is highly nonlinear. So, we have used a deep neural network architecture to find an optimal solution for the primary policy.

Ii-B Network Architecture

The earlier works in end-to-end learning for self-driving cars focus on computing only the steering angle from a single image or a sequence of images. The longitudinal control component is required to reach a higher level of autonomy in the end-to-end framework. In this work, we utilize the multi-task model proposed in [7]

as our baseline, which is capable of generating both longitudinal and lateral control inputs for the car. Besides, we utilize a speed controller rather than the classical throttle/brake commands for the longitudinal control. The steering action is predicted from the raw image inputs taken from the cameras located in front of the vehicle through convolution layers, and the speed is predicted from a sequence of speed profiles through a Long-Short Term Memory (LSTM) layer. There exists a single-direction coupling between the longitudinal controller (speed controller) and the lateral steering actions. In particular, the speed of the vehicle has a significant impact on the prediction model, since entering a turn with low speed represents different dynamics for the lateral controller when compared to a high-speed maneuver. Moreover, the straight trajectory dominates the whole other trajectory types (e.g., turn left, turn right); therefore, the trained network will be biased toward the straight path. To recover from this issue, we decided to define various trajectory types including all major maneuvers such as straight, turn left, turn right and low and high-speed scenarios, by which the devised model will learn the other less-occurring maneuvers.

The model architecture is shown in Fig 1. It takes the current observation and the past speed profile and returns steering action, speed action, and the class of the trajectory segment. The multi-task network predicts the steering angle through a visual encoder using a stack of convolution and fully-connected layers. In the first two convolution layers (Conv1 and Conv2), large kernel size is adopted to better capture the environment features, which is suitable for the front-view camera. Inputs and kernels of the each convolution layer is denoted by "" and "" and each fully connected layer is denoted by "

size of neurons

". The speed and trajectory class are predicted through a concatenation of visual encoder and feedback speed features. The speed features are extracted by an LSTM layer followed by fully-connected layers. ReLU (Rectified Linear Unit) is used as the activation function for all layers. Mean absolute error is the loss function for both speed and steering angle predictions as regression problems. On the other hand, the cross-entropy applies to the trajectory classifier as a classification problem.

Figure 1: Sample-efficient Selective SafeDAgger model

The multi-class classifier highlighted in Fig. 1 extends the safeDAgger method to a novel algorithm devised in this paper. The trajectory classes are defined as follows:


Low and high speeds with combinations of left, straight and right turn cover almost all unsafe trajectories. Same combinations also applicable for safe trajectories but since it is not needed to call expert policy in safe trajectories, we define only one class for the safe trajectories.

The multi-class classifier takes the partial observation of the state which contains the visual perception and the past speed profile and returns a label indicating in which part of the trajectory the policy will likely to deviate from the expert policy .

The labels for training the model was generated through one-hot-encoding method, defined by sequential decisions; first, it was decided whether the policy is safe by measuring its distance from the expert policy through

-Norm metric using Eq. (5).


where is a predefined threshold and can be chosen arbitrarily. Furthermore, to distinguish between low-speed and high-speed turn trajectories, steering threshold , speed thresholds for turn maneuver and straight trajectory

are defined heuristically based on the response of the car dynamics in these trajectories. The threshold values for this work is depicted in Table


Parameter Threshold value
Table I: Threshold Values in Labeling Process

where as yields for the steering angle and for the speed difference between the network prediction and expert policy output.

Ii-C Selective SafeDAgger Algorithm

1 Collect using
3 for i = 1:N do
4       Define unsafe classes over
6       while  do
8             classifier output of
9             if  then
10                   use
13            else
14                   use
16             end if
18       end while
22 end for
return best over validation set
Algorithm 1 Selective SafeDAgger:  Blue fonts distinguishes the difference between Selective SafeDAgger and SafeDAgger

Algorithm 1 describes the proposed method in detail, which takes the expert policy as an input and gives as an output. The primary dataset is collected by using , which is then utilized in training a primary policy by a supervised learning method. Having the at hand, , the unsafe classes of for the trained policy are determined. An observation taken from environment is evaluated by to find its class . If is an element of , takes over the control of the car and is appended to . Otherwise, continues to command the car until it encounters an unsafe class. As depicted in lines 1-1, the algorithm continues to append data to for T number of iterations. The appended dataset is aggregated into to create and is trained on . This loop is repeated for N times, as shown in lines 1-1. In the end, the algorithm returns the best over the validation set.

Iii Experiments

Iii-a System Setup

Iii-A1 Simulator

AirSim used in this work is an Unreal Engine Plugin based simulator for drones and cars established by Microsoft to create a platform for AI studies to develop and test new deep learning, computer vision and reinforcement learning algorithms for autonomous vehicles with photo-realistic graphs in simulations

[11]. It has built-in APIs for interfacing with Python coding language. Furthermore, the engine editor creates custom environments or scenarios.

The road track for the training process of the algorithm is devised in a way to capture all defined scenarios in this work. The geometry of the custom created training track is shown in Fig. 2, in which all the trajectory classes are illustrated.

Figure 2: Train set track

Representative power of the training set can be increased by collecting data from unseen observations. With that reason, two additional cameras were added to the front-facing camera with an angle of to imitate turning-like maneuvers [4]. Airsim APIs provide ground truth labels for the front-facing camera frames, but ground truth labels for the left and right cameras should be adjusted with a threshold as in Eq. (6).


where , , and refer to the ground truth for the left and right cameras, center camera steering and speed actions respectively. In the turning case, the ground truth speed of the vehicle is adjusted by a parameter which is chosen as heuristically.

Iii-A2 Data Preprocessing

A couple of techniques were utilized in the preprocessing phase. The input raw image was down-sampled to the size of 144×256×3 (RGB) and a Region of Interest (ROI) defined with the size of 59×255 to cover almost the entire road and ignore the features above the horizon, which reduces the computational cost. Moreover, to improve the convergence rate and robustness of the neural network model, the processed image was normalized to the range [0,1] and augmented by randomly changing the brightness of the image with a scale of 0.4. The normalization was done by dividing all image pixels by 255.

Iii-A3 Expert Policy

To automatize the data collection part of the algorithm, a rule-based expert policy is defined as shown in Fig. 3.

Figure 3: Expert policy

For the steering action, is a tangent line to the road spline at the position of the car and is a point on road spline with distance along spline from that positions. Tangent line at according to road spline is . The angle between and which is will be expert steering action as depicted in Eq. (7).


For the speed action, is a point on the road spline with a distance from the position of the car along the road spline as depicted in Eq. (8).


where is current speed and is a fine tuned constant. Tangent line at according to the road spline is .

Figure 4: Convergence rate of the proposed model; It shows the improvement of the model as the number of dataset aggregation iterations increases.

Expert speed action is defined by Eq. (9).


where is a pre-defined cruise speed, is a fine tuned gain and is an angle between and .

For our implementation, the parameters are chosen as m, , m/s and .

Iii-B Training

For the training of the primary policy , dataset , which contains 2800 image data were collected by using expert policy . Nesterov Adam Optimizer (Nadam) was used as an optimizer for the training of the network with the initial learning rate of

and moment of 0.99. The Training continued for ten epochs with the batch size of 32.

Trained primary policy is tested on the pre-collected dataset to classify trajectories and calculate the -Norm of each sample in the dataset. The weakness of the network over trajectory segments is determined by a coefficient of weakness, which is defined as in Eq. (10).


where ,

are mean and standard deviations for the

-Norm of class. is the total number of samples in class that -Norm of samples fall in the region of one away from the mean . is the total number of samples in class.

Once the weakness coefficients are calculated, trajectory classes are sorted according to their weakness coefficients, and the two of the most dominant unsafe classes will be chosen for data aggregation as shown in Table II. Additionally, the classes with the mean -Norm lower than 1, will be selected as allowable classes.

As depicted in Table II, the weakness coefficients for the class and are quite low and never chosen as weak classes. The initial dataset for the training is biased toward , and classes and -Norms in those classes are low, which lead to low weakness coefficients. Moreover, training track does not have many samples from class so that weakness coefficients for the class is also low.

# Iter.
1 0.004 0.321 0.019 0.694 0.002 0.010
2 0.505 0.122 0.037 0.278 0.001 0.023
3 0.635 0.264 0.028 0.607 0.001 0.062
4 0.751 0.515 0.046 0.646 0.001 0.010
5 0.018 0.678 0.034 0.755 0.001 0.010
6 0.009 0.752 0.039 0.849 0.000 0.006
7 0.717 0.790 0.038 0.780 0.001 0.004
8 0.028 0.787 0.017 0.794 0.001 0.006
9 0.670 0.634 0.011 0.713 0.001 0.005
10 0.012 0.768 0.020 0.809 0.001 0.003
Table II: Coefficient of weakness for each class

After determination of the weak and allowable classes, the data aggregation phase begins. In this phase, policy drives the car to collect 10 batches of data in dominant classes. If policy faces with dominant classes, the expert policy takes control of the vehicle and samples are taken in that time labeled and reserved for aggregation. If policy faces with allowable classes which are unsafe, it continues to drive the car. For all the other unsafe classes, the expert policy takes control of the vehicle with the query limit of 10 batch-size. When the number of query reaches the limit, data aggregation freezes and training starts with the new aggregated dataset .

After the training, becomes , and determination of dominant weak classes on the pre-collected data is repeated for collecting relevant data. This process will be repeated for 10 iterations. As shown in Fig. 4, in the dataset aggregation iteration number 1, a significant fraction of dataset is unsafe, and as it proceeds to recover from the most problematic cases, the model error converges. The progress of this process can be seen from iteration number 1 to 10.

Iv Results

Figure 5: (a) Performance of the Selective SafeDAgger algorithm for all classes at each aggregation iteration. (b) -Norm of prediction and ground truth over 10000 samples at each iteration.

In Fig. 4(a), we present the performance of the Selective SafeDAgger with using metric of -Norm in each class during the training process. For the first iteration, and are chosen as weak classes and data for new dataset comes from those classes by querying expert policy. It is seen that in the second iteration, -Norms drops for all classes by using aggregated dataset. Notice that the performance of the policy for the other classes is also increased without querying expert policy for those classes which are not the case for the SafeDAgger. Sequential decision making is the main idea behind this behavior. In SafeDAgger, when policy shifts from nominal conditions, the expert policy is called, and the new dataset is collected until the safety criterion is met, which leads to an unnecessary query of the expert policy. On the other hand, Selective SafeDAgger tries to solve the problem from the beginning by finding problematic classes. Besides, after the seventh iteration, the norm of all classes drops below the allowable threshold, which means that resultant dataset covers almost all trajectory classes as seen in Fig. 4.

The trained model is tested at each iteration by taking 10000 samples from the environment and mean -Norms are calculated, accordingly. Fig. 4(b) shows that selective SafeDAgger method has better performance in all iterations than the SafeDAgger method even though both ways have the same amount of query to the expert as depicted in Table III.

Selective SafeDAgger SafeDAgger
Iteration 1 0 127 38 155 0 0 320
Iteration 2 0 44 0 228 0 48 320
Iteration 3 19 63 0 238 0 0 320
Iteration 4 27 12 0 281 0 0 320
Iteration 5 0 165 0 155 0 0 320
Iteration 6 31 189 0 100 0 0 320
Iteration 7 0 93 0 227 0 0 320
 Iteration 8 2 162 0 156 2 5 320
Iteration 9 83 0 0 237 0 0 320
Iteration 10 0 205 0 115 0 0 320
Total 3200 3200
Table III: Query to expert
Figure 6: Geometry of test tracks.

Three unseen test tracks were devised to evaluate the generalization performance of the proposed method, where their layouts are illustrated in Fig. 6. The generalization performance of the Selective SafeDAgger is depicted in Table IV, which shows its superiority over SafeDAgger method. The selectivity of the proposed algorithm will define the unsafe cases that dominate all other classes, which results in faster convergence of the model error compared to different dataset aggregation methods.

Selective SafeDAgger SafeDAgger
1. Test Track 0.4794 0.5518
2. Test Track 0.3295 0.4986
3. Test Track 0.3254 0.3632
Table IV: Mean -Norm on Unseen Test Track

V Conclusions

In this work, we implemented a Selective SafeDAgger algorithm which is sample-efficient in the selection of dataset aggregation. The proposed algorithm evaluates the performance of the trained policy and determines the weakness of the policy over different trajectory classes and recovers the policy from those specific trajectory classes. Our method outperforms the SafeDAgger algorithms in term of sample-efficiency and convergence rate. Next, we aim to cluster the trajectories with unsupervised neural network techniques to have a better realization of the road trajectories.


This work is supported by Scientific and Technological Research Council of Turkey (Turkish:TÜBİTAK) under the grant agreement TEYDEB 1515 / 5169901.