End-to-End Urban Driving by Imitating a Reinforcement Learning Coach

by   Zhejun Zhang, et al.
ETH Zurich

End-to-end approaches to autonomous driving commonly rely on expert demonstrations. Although humans are good drivers, they are not good coaches for end-to-end algorithms that demand dense on-policy supervision. On the contrary, automated experts that leverage privileged information can efficiently generate large scale on-policy and off-policy demonstrations. However, existing automated experts for urban driving make heavy use of hand-crafted rules and perform suboptimally even on driving simulators, where ground-truth information is available. To address these issues, we train a reinforcement learning expert that maps bird's-eye view images to continuous low-level actions. While setting a new performance upper-bound on CARLA, our expert is also a better coach that provides informative supervision signals for imitation learning agents to learn from. Supervised by our reinforcement learning coach, a baseline end-to-end agent with monocular camera-input achieves expert-level performance. Our end-to-end agent achieves a 78 and new weather on the NoCrash-dense benchmark and state-of-the-art performance on the more challenging CARLA LeaderBoard.



There are no comments yet.


page 1

page 4

page 13


Improved Deep Reinforcement Learning with Expert Demonstrations for Urban Autonomous Driving

Currently, urban autonomous driving remains challenging because of the c...

Sample Efficient Interactive End-to-End Deep Learning for Self-Driving Cars with Selective Multi-Class Safe Dataset Aggregation

The objective of this paper is to develop a sample efficient end-to-end ...

GRI: General Reinforced Imitation and its Application to Vision-Based Autonomous Driving

Deep reinforcement learning (DRL) has been demonstrated to be effective ...

Learning End-to-end Autonomous Driving using Guided Auxiliary Supervision

Learning to drive faithfully in highly stochastic urban settings remains...

Learning by Watching

When in a new situation or geographical location, human drivers have an ...

Policy Improvement from Multiple Experts

Despite its promise, reinforcement learning's real-world adoption has be...

Autonomous Vehicle Control: End-to-end Learning in Simulated Urban Environments

In recent years, considerable progress has been made towards a vehicle's...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Roach: RL coach allows IL agents to benefit from dense and informative on-policy supervisions.

Even though nowadays, most autonomous driving (AD) stacks [31, 49] use individual modules for perception, planning and control, end-to-end approaches have been proposed since the 80’s [36]

and the success of deep learning brought them back into the research spotlight

[5, 51]. Numerous works have studied different network architectures for this task [3, 17, 53]

, yet most of these approaches use supervised learning with expert demonstrations, which is known to suffer from covariate shift

[37, 41]. While data augmentation based on view synthesis [2, 5, 36] can partially alleviate this issue, in this paper, we tackle the problem from the perspective of expert demonstrations.

Expert demonstrations are critical for end-to-end AD algorithms. While imitation learning (IL) methods directly mimic the experts’ behavior [3, 11], reinforcement learning (RL) methods often use expert demonstrations to improve sample efficiency by pre-training part of the model via supervised learning [28, 48]. In general, expert demonstrations can be divided into two categories: (i) Off-policy, where the expert directly controls the system, and the state/observation distribution follows the expert. Off-policy data for AD includes, for example, public driving datasets [7, 23, 52]. (ii) On-policy, where the system is controlled by the desired agent and the expert “labels” the data. In this case, the state/observation distribution follows the agent, but expert demonstrations are accessible. On-policy data is fundamental to alleviate covariate shift as it allows the agent to learn from its own mistakes, which the expert in the off-policy data does not exhibit. However, collecting adequate on-policy demonstrations from humans is non-trivial. While trajectories and actions taken by the human expert can be directly recorded during off-policy data collection, labeling these targets given sensor measurements turns out to be a challenging task for humans. In practice, only sparse events like human interventions are recorded, which, due to the limited information it contains, is hard to use for training and better suited for RL [2, 24, 25] than for IL methods.

In this work we focus on automated experts, which in contrast to human experts can generate large-scale datasets with dense labels regardless of whether they are on-policy or off-policy. To achieve expert-level performance, automated experts may rely on exhaustive computations, expensive sensors or even ground truth information, so it is undesirable to deploy them directly. Even though some IL methods do not require on-policy labeling, such as GAIL [21] and inverse RL [1], these methods are not efficient in terms of on-policy interactions with the environment.

On the contrary, automated experts can reduce the expensive on-policy interactions. This allows IL to successfully apply automated experts to different aspects of AD. As a real-world example, Pan et al. [35] demonstrated end-to-end off-road racing with a monocular camera by imitating a model predictive control expert with access to expensive sensors. In the context of urban driving, [37] showed that a similar concept can be applied to the driving simulator CARLA [13]. Driving simulators are an ideal proving ground for such approaches since they are inherently safe and can provide ground truth states. However, there are two caveats. The first regards the “expert” in CARLA, commonly referred to as the Autopilot (or the roaming agent). The Autopilot has access to ground truth simulation states, but due to the use of hand-crafted rules, its driving skills are not comparable to a human expert’s. Secondly, the supervision offered by most automated experts is not informative. In fact, the IL problem can be seen as a knowledge transfer problem and just learning from expert actions is inefficient.

To tackle both drawbacks and motivated by the success of model-free RL in Atari games [19] and continuous control [15], we propose Roach (RL coach), an RL expert that maps bird’s-eye view (BEV) images to continuous actions (Fig. 1

bottom). After training from scratch for 10M steps, Roach sets the new performance upper-bound on CARLA by outperforming the Autopilot. We then train IL agents and investigate more effective training techniques when learning from our Roach expert. Given that Roach uses a neural network policy, it serves as a better coach for IL agents also based on neural networks. Roach offers numerous informative targets for IL agents to learn from, which go far beyond deterministic action provided by other experts. Here we demonstrate the effectiveness of using action distributions, value estimations and latent features as supervisions.

Fig. 1 shows the scheme of learning from on-policy supervisions labeled by Roach on CARLA. We also record off-policy data from Roach by using its output to drive the vehicle on CARLA. Leveraging 3D detection algorithms [27, 50] and extra sensors to synthesize the BEV, Roach could also address the scarcity of on-policy supervisions in the real world. This is feasible because on the one hand, BEV as a strong abstraction reduces the sim-to-real gap [32], and on the other hand, on-policy labeling does not have to happen in real-time or even onboard. Hence 3D detection becomes easier given the complete sequences [38].

In summary, this paper presents Roach, an RL expert that sets a new performance upper-bound on CARLA. Moreover, we demonstrate the state-of-the-art performance on both the CARLA LeaderBoard and the CARLA NoCrash benchmark using a single camera based end-to-end IL agent, which is supervised by Roach using our improved training scheme. Our repository is publically available at https://github.com/zhejz/carla-roach

2 Related Work

Since our methods are trained and evaluated on CARLA, we mainly focus on related works also done on CARLA.

End-to-End IL: Dosovitskiy et al. [13] introduced the CARLA driving simulator and demonstrated that a baseline end-to-end IL method with single camera input can achieve a performance comparable to a modular pipeline. After that, CIL [11] and CILRS [12] addressed directional multi-modality in AD by using branched action heads where the branch is selected by a high-level directional command. While the aforementioned methods are trained via behavior cloning, DA-RB [37] applied DAGGER [41] with critical state sampling to CILRS. Most recently, LSD [33] increased the model capacity of CILRS by learning a mixture of experts and refining the mixture coefficients using evolutionary optimization. Here, we use DA-RB as the baseline IL agent to be supervised by Roach.

Mid-to-X IL: Directly mapping camera images to low-level actions requires a large amount of data, especially if one wants generalization to diverse weather conditions. Mid-to-X approaches alleviate this issue by using more structured intermediate representation as input and/or output. CILRS with coarse segmentation masks as input was studied in [4]. CAL [42] combines CIL and direct perception [8] by mapping camera images to driving affordances which can be directly used by a rule-based low-level controller. LBC [9] maps camera images to waypoints by mimicking a privileged mid-to-mid IL agent similar to Chauffeurnet [3], which takes BEV as input and outputs future waypoints. Similarly, SAM [54] trained a visuomotor agent by imitating a privileged CILRS agent that takes segmentation and affordances as inputs. Our Roach adopts BEV as the input representation and predicts continuous low-level actions.

RL: As the first RL agent on CARLA, an A3C agent [30] was demonstrated in [13], yet its performance is lower than that of other methods presented in the same paper. CIRL [28] proposed an end-to-end DDPG [29] agent with its actor network pre-trained via behavior cloning to accelerate online training. To reduce the problem complexity, Chen et al. [10] investigated DDQN [16], TD3 [14] and SAC [15] using BEV as an input and pre-trained the image encoder with a variational auto-encoder [26] on expert trajectories. State-of-the-art performance is achieved in [48] using Rainbow-IQN [47]. To reduce the number of trainable parameters during online training, its image encoder is pre-trained to predict segmentation and affordances on an off-policy dataset. IL was combined with RL in [40] and multi-agent RL on CARLA was discussed in [34]. In contrast to these RL methods, Roach achieves high sample efficiency without using any expert demonstrations.

IL with Automated Experts: The effectiveness of automated experts was demonstrated in [35] for real-world off-road racing, where a visuomotor agent is trained by imitating on-policy actions labeled by a model predictive control expert equipped with expensive sensors. Although CARLA already comes with the Autopilot, it is still beneficial to train a proxy expert based on deep neural networks, as shown by LBC [9] and SAM [54]

. Through a proxy expert, the complex to solve end-to-end problem is decomposed into two simpler stages. At the first stage, training the proxy expert is made easier by formulating a mid-to-X IL problem that separates perception from planning. At the second stage, the end-to-end IL agent can learn more effectively from the proxy expert given the informative targets it supplies. To provide strong supervision signals, LBC queries all branches of the proxy expert and backpropagates all branches of the IL agent given one data sample, whereas SAM matches latent features of the proxy expert and the end-to-end IL agent. While the proxy expert addresses planning, it is also possible to address perception at the first stage as shown by FM-Net

[22]. Overall, two-stage approaches achieve better performance than direct IL, but using proxy experts inevitably lowers the performance upper-bound as a proxy expert trained via IL cannot outperform the expert it imitates. This is not a problem for Roach, which is trained via RL and outperforms the Autopilot.

(a) Drivable areas
(b) Desired route
(c) Lane boundaries
(d) Vehicles
(e) Pedestrians
(f) Lights and stops
Figure 2: The BEV representation used by our Roach.

3 Method

In this section we describe Roach and how IL agents can benefit from diverse supervisions supplied by Roach.

3.1 RL Coach

Our Roach has three features. Firstly, in contrast to previous RL agents, Roach does not depend on data from other experts. Secondly, unlike the rule-based Autopilot, Roach is end-to-end trainable, hence it can generalize to new scenarios with minor engineering efforts. Thirdly, it has a high sample efficiency. Using our proposed input/output representation and exploration loss, training Roach from scratch to achieve top expert performance on the six LeaderBoard maps takes less than a week on a single GPU machine.

Roach consists of a policy network parameterized by and a value network parameterized by . The policy network maps a BEV image

and a measurement vector

to a distribution of actions . Finally the value network estimates a scalar value , while taking the same inputs as the policy network.

Input Representation: We use a BEV semantic segmentation image to reduce the problem complexity, similar to the one used in [3, 9, 10]. It is rendered using ground-truth simulation states and consists of grayscale images of size . The ego-vehicle is heading upwards and is centered in all images at pixels above the bottom, but it is not rendered. Fig. 2 illustrates each channel of . Drivable areas and intended routes are rendered respectively in Fig. 1(a) and 1(b). In Fig. 1(c) solid lines are white and broken lines are grey. Fig. 1(d) is a temporal sequence of grayscale images in which cyclists and vehicles are rendered as white bounding boxes. Fig. 1(e) is the same as Fig. 1(d) but for pedestrians. Similarly, stop lines at traffic lights and trigger areas of stop signs are rendered in Fig. 1(f). Red lights and stop signs are colored by the brightest level, yellow lights by an intermediate level and green lights by a darker level. A stop sign is rendered if it is active, i.e. the ego-vehicle enters its vicinity and disappears once the ego-vehicle has made a full stop. By letting the BEV representation memorize if the ego-vehicle has stopped, we can use a network architecture without recurrent structure and hence reduce the model size of Roach. A colored combination of all channels is visualized in Fig. 1. We also feed Roach a measurement vector containing the states of the ego-vehicle not represented in the BEV, these include ground-truth measurements of steering, throttle, brake, gear, lateral and horizontal speed.

Output Representation: Low-level actions of CARLA are , and . An effective way to reduce the problem complexity is predicting waypoint plans which are then tracked by a PID-controller to produce low-level actions [9, 40]. However, a PID-controller is not reliable for trajectory tracking and requires excessive parameter tuning. A model-based controller would be a better solution, but CARLA’s vehicle dynamics model is not directly accessible. To avoid parameter tuning and system identification, Roach directly predicts action distributions. Its action space is

for steering and acceleration, where positive acceleration corresponds to throttle and negative corresponds to brake. To describe actions we use the Beta distribution

, where are respectively the concentration on and

. Compared to the Gaussian distribution, which is commonly used in model-free RL, the support of the Beta distribution is bounded, thus avoiding clipping or squashing to enforce input constraints. This results in a better behaved learning problem since no

tanh layers are needed and the entropy and KL-divergence can be computed explicitly. Further, the modality of the Beta distribution is also suited for driving, where extreme maneuvers may often be taken, for example, emergency braking or taking a sharp turn.

Training: We use proximal policy optimization (PPO) [44] with clipping to train the policy network and the value network . To update both networks, we collect trajectories by executing on CARLA. A trajectory includes BEV images , measurement vectors , actions , rewards and a terminal event that triggers the termination of an episode. The value network is trained to regress the expected returns, whereas the policy network is updated via


The first objective is the clipped policy gradient loss with advantages estimated using generalized advantage estimation [43]. The second objective is a maximum entropy loss commonly employed to encourage exploration



pushes the action distribution towards a uniform prior because maximizing entropy is equivalent to minimizing the KL-divergence to a uniform distribution,


if both distributions share the same support. This inspires us to propose a generalized form of , which encourages exploration in sensible directions that comply with basic traffic rules. We call it the exploration loss and define it as


where is the indicator function and is the event that ends the episode. The terminal condition set includes collision, running traffic light/sign, route deviation and being blocked. Unlike which imposes a uniform prior on the actions at all time steps regardless of which is triggered, shifts actions within the last steps of an episode towards a predefined exploration prior which encodes an “advice” to prevent the triggered event from happening again. In practice, we use . If is related to a collision or running traffic light/sign, we apply on the acceleration to encourage Roach to slow down while the steering is unaffected. In contrast, if the car is blocked we use an acceleration prior . For route deviations, a uniform prior is applied on the steering. Despite being equivalent to maximizing entropy in this case, the exploration loss further encourages exploration on steering angles during the last 10 seconds before the route deviation.

(a) Roach
Figure 3: Network architecture of Roach, the RL expert, and CILRS, the IL agent.

Implementation Details: Our implementation of PPO-clip is based on [39] and the network architecture is illustrated in Fig. 2(a). We use six convolutional layers to encode the BEV and two fully-connected (FC) layers to encode the measurement vector. Outputs of both encoders are concatenated and then processed by another two FC layers to produce a latent feature , which is then fed into a value head and a policy head, each with two FC hidden layers. Trajectories are collected from six CARLA servers at 10 FPS, each server corresponds to one of the six LeaderBoard maps. At the beginning of each episode, a pair of start and target location is randomly selected and the desired route is computed using the A algorithm. Once the target is reached, a new random target will be chosen, hence the episode is endless unless one of the terminal conditions in is met. We use the reward of [47] and additionally penalize large steering changes to prevent oscillating maneuvers. To avoid infractions at high speed, we add an extra penalty proportional to the ego-vehicle’s speed. More details are in the supplement.

3.2 IL Agents Supervised by Roach

To allow IL agents to benefit from the informative supervisions generated by Roach, we formulate a loss for each of the supervisions. Our training scheme using Roach can be applied to improve the performance of existing IL agents. Here we use DA-RB [37] (CILRS [12] + DAGGER [41]) as an example to demonstrate its effectiveness.

CILRS: The network architecture of CILRS is illustrated in Fig. 2(b), it includes a perception module that encodes the camera image and a measurement module that encodes the measurement vector . Outputs of both modules are concatenated and processed by FC layers to generate a bottleneck latent feature . Navigation instructions are given as discrete high-level commands and for each kind of command a branch is constructed. All branches share the same architecture, while each branch contains an action head predicting continuous actions and a speed head predicting the current speed of the ego-vehicle. The latent feature is processed by the branch selected by the command. The imitation objective of CILRS consists of an L1 action loss


and a speed prediction regularization


where is a scalar weight, is the expert’s action, is the measured speed, and are action and speed predicted by CILRS. Expert actions

may come from the Autopilot, which directly outputs deterministic actions, or from Roach, where the distribution mode is taken as the deterministic output. Besides deterministic actions, Roach also predicts action distributions, values and latent features. Next we will formulate a loss function for each of them.

Action Distribution Loss: Inspired by [20] which suggests soft targets may provide more information per sample than hard targets, we propose a new action loss based on the action distributions as a replacement for . The action head of CILRS is modified to predict distribution parameters and the loss is formulated as a KL-divergence


between the action distribution predicted by the Roach expert and predicted by the CILRS agent.

Feature Loss: Feature matching is an effective way to transfer knowledge between networks and its effectiveness in supervising IL driving agents is demonstrated in [22, 54]. The latent feature of Roach is a compact representation that contains essential information for driving as it can be mapped to expert actions using an action head consists of only two FC layers (cf. Fig. 2(a)). Moreover, is invariant to rendering and weather as Roach uses the BEV representation. Learning to embed camera images to the latent space of should help IL agents to generalize to new weather and new situations. Hence, we propose the feature loss


Value Loss: Multi-task learning with driving-related side tasks could also boost the performance of end-to-end IL driving agents as shown in [51], which used scene segmentation as a side task. Intuitively, the value predicted by Roach contains driving relevant information because it estimates the expected future return, which relates to how dangerous a situation is. Therefore, we augment CILRS with a value head and regress value as a side task. The value loss is the mean squared error between , the value estimated by Roach, and , the value predicted by CILRS,


Implementation Details: Our implementation follows DA-RB [37]

. We choose a Resnet-34 pretrained on ImageNet as the image encoder to generate a 1000-dimensional feature given

, a wide-angle camera image with a horizontal FOV. Outputs of the image and the measurement encoder are concatenated and processed by three FC layers to generate , which shares the same size as . More details are found in the supplement.

4 Experiments

Benchmarks: All evaluations are completed on CARLA 0.9.11. We evaluate our methods on the NoCrash [12] and the latest LeaderBoard benchmark111We use the 50 public training routes and the 26 public testing routes. [46]. Each benchmark specifies its training towns and weather, where the agent is allowed to collect data, and evaluates the agent in new towns and weather. The NoCrash benchmark considers generalization from Town 1, a European town composed of solely one-lane roads and T-junctions, to Town 2, a smaller version of Town 1 with different textures. By contrast, the LeaderBoard considers a more difficult generalization task in six maps that cover diverse traffic situations, including freeways, US-style junctions, roundabouts, stop signs, lane changing and merging. Following the NoCrash benchmark, we test generalization from four training weather types to two new weather types. But to save computational resources, only two out of the four training weather types are evaluated. The NoCrash benchmark comes with three levels of traffic density (empty, regular and dense), which defines the number of pedestrians and vehicles in each map. We focus on the NoCrash-dense and introduce a new level between regular and dense traffic, NoCrash-busy, to avoid congestion that often appears in the dense traffic setting. For the CARLA LeaderBoard the traffic density in each map is tuned to be comparable to the busy traffic setting.

Metrics: Our results are reported in success rate, the metric proposed by NoCrash, and driving score, a new metric introduced by the CARLA LeaderBoard. The success rate is the percentage of routes completed without collision or blockage. The driving score is defined as the product of route completion, the percentage of route distance completed, and infraction penalty, a discount factor that aggregates all triggered infractions. For example, if the agent ran two red lights in one route and the penalty coefficient for running one red light was , then the infraction penalty would be . Compared to the success rate, the driving score is a fine-grained metric that considers more kinds of infractions and it is better suited to evaluate long-distance routes. More details about the benchmarks and the complete results are found in the supplement.

Figure 4: Learning curves of RL experts

trained in CARLA Town 1-6. Solid lines show the mean and shaded areas show the standard deviation of episode returns across 3 seeds. The dashed line shows an outlier run that collapsed.

4.1 Performance of Experts

We use CARLA to train RL experts and fine-tune our Autopilot, yet all evaluations are still on 0.9.11.

Sample Efficiency: To improve the sample efficiency of PPO, we propose to use BEV instead of camera images, Beta instead of Gaussian distributions, and the exploration loss in addition to the entropy loss. Since the benefit of using a BEV representation is obvious, here we only ablate the Beta distribution and the exploration loss. As shown in Fig. 4

, the baseline PPO with Gaussian distribution and entropy loss is trapped in a local minimum where staying still is the most rewarding strategy. Leveraging the exploration loss, PPO+exp can be successfully trained despite relatively high variance and low sample efficiency. The Beta distribution helps substantially, but without the exploration loss the training still collapsed in some cases due to insufficient exploration (cf. dashed blue line in Fig. 

4). Our Roach (PPO+beta+exp) uses both Beta distribution and exploration loss to ensure stable and sample efficient training. The training takes around 1.7M steps in each of the six CARLA servers, this accounts for 10M steps in total, which takes roughly a week on an AWS EC2 g4dn.4xlarge or 4 days on a 2080 Ti machine with 12 cores.

Driving Performance: Table 1 compares different experts on the NoCrash-dense and on all 76 LeaderBoard routes under dynamic weather with busy traffic. Our Autopilot is a strong baseline expert that achieves a higher success rate than the Autopilot used in LBC and DA-RB. We evaluate three RL experts - (1) Roach, the proposed RL coach using Beta distribution and exploration prior. (2) PPO+beta, the RL coach trained without using the exploration prior. (3) PPO+exp, the RL coach trained without using the Beta distribution. In general, our RL experts achieve comparable success rates and higher driving scores than Autopilots because RL experts handle traffic lights in a better way (cf. Table 3). The two Autopilots often run red lights because they drive over-conservatively and wait too long at the junction, thus missing the green light. Among RL experts, PPO+beta and Roach, the two RL experts using a Beta distribution, achieve the best performance, while the difference between both is not significant. PPO+exp performs slightly worse, but it still achieves better driving scores than our Autopilot.

Suc. Rate % NCd-tt NCd-tn NCd-nt NCd-nn LB-all
AP (ours)
AP-lbc [9] N/A
AP-darb [37] N/A
Dri. Score % NCd-tt NCd-tn NCd-nt NCd-nn LB-all
AP (ours)
Table 1: Success rate and driving score of experts. Mean and standard deviation over 3 evaluation seeds. NCd: NoCrash-dense. tt: train town & weather. tn: train town & new weather. nt: new town & train weather. nn: new town & weather. LB-all: all 76 routes of LeaderBoard with dynamic weather. AP: CARLA Autopilot. For RL experts the best checkpoint among all training seeds and runs is used.

4.2 Performance of IL Agents

The performance of an IL agent is limited by the performance of the expert it is imitating. If the expert performs poorly, it is not sensible to compare IL agents imitating that expert. As shown in Table 1, this issue is evident in the NoCrash new town with dense traffic, where Autopilots do not perform well. To ensure a high performance upper-bound and hence a fair comparison, we conduct ablation studies (Fig. 5 and Table 3) under the busy traffic setting such that our Autopilot can achieve a driving score of 80% and a success rate of 90%. In order to compare with the state-of-the-art, the best model from the ablation studies is still evaluated on NoCrash with dense traffic in Table 2.

The input measurement vector is different for the NoCrash and for the LeaderBoard. For NoCrash, is just the speed. For the LeaderBoard, contains additionally a 2D vector pointing to the next desired waypoint. This vector is computed from noisy GPS measurements and the desired route is specified as sparse GPS locations. The LeaderBoard instruction suggests that it is used to disambiguate situations where the semantics of left and right are not clear due to the complexity of the considered map.

Figure 5: Driving score of experts and IL agents. All IL agents (dashed lines) are supervised by Roach except for , which is supervised by our Autopilot. For IL agents at the 5th iteration on NoCrash and all experts, results are reported as the mean over 3 evaluation seeds. Others agents are evaluated with only one seed.
Success Rate % NCd-tt NCd-tn NCd-nt NCd-nn
LBC [9] (0.9.6)
SAM [54] (0.8.4)
LSD [33] (0.8.4) N/A N/A
DA-RB+(E) [37]
DA-RB+ [37] (0.8.4)
Our baseline,
Our best,
Table 2: Success rate of camera-based end-to-end IL agents on NoCrash-dense. Mean and standard deviation over 3 seeds. Our models are from DAGGER iteration 5. For DA-RB, + means triangular perturbations are added to the off-policy dataset, (E) means ensemble of all iterations.
Red light
iter 5 %, %, %, %, #/Km, #/Km, #/Km, #/Km, #/Km,
Table 3: Driving performance and infraction analysis of IL agents on NoCrash-busy, new town & new weather. Mean and standard deviation over 3 evaluation seeds.

Ablation: Fig. 5 shows driving scores of experts and IL agents at each DAGGER iteration on the NoCrash and LeaderBoard with busy traffic. The baseline is our implementation of DA-RB+ supervised by our Autopilot. Given our improved Autopilot, it is expected that can achieve higher success rates than those reported in the DA-RB paper, but this is not observed in Table 2. The large performance gap between the Autopilot and (cf. Fig. 5), especially while generalizing to a new town and new weather, indicates the limitation of this baseline.

By replacing the Autopilot with Roach, performs better overall than . Further learning from the action distribution, generalizes better than on the NoCrash but not on the LeaderBoard. Feature matching only helps when is provided with the necessary information needed to reproduce . In our case, contains navigational information as the desired route is rendered in the BEV input. For the LeaderBoard, navigational information is partially encoded in , which includes the vector to the next desired waypoint, so better performance is observed by using . But for NoCrash this information is missing as is just the speed, hence it is impractical for to mimic and this causes the inferior performance of and . To confirm this hypothesis, we evaluate a single-branch network architecture where the measurement vector is augmented by the command encoded as a one-hot vector. Using feature matching with this architecture, and achieve the best driving score among IL agents in the NoCrash new town & weather generalization test, even outperforming the Autopilot.

Using value supervision in addition to feature matching helps the DAGGER process to converge faster as shown by and . However, without feature matching, using value supervision alone does not demonstrate superior performance. This indicates a potential synergy between feature matching and value estimation. Intuitively, the latent feature of Roach encodes the information needed for value estimation, hence mimicking this feature should help to predict the value, while value estimation could help to regularize feature matching.

Comparison with the State-of-the-art: In Table 2 we compare the baseline and our best performing agent with the state-of-the-art on the NoCrash-dense benchmark. Our performs comparably to DA-RB+ except when generalizing to the new weather, where there is an incorrect rendering of after-rain puddles on CARLA 0.9.11 (see supplement for visualizations).This issue does not affect our best method due to the stronger supervision of Roach. By mimicking the weather-agnostic Roach, the performance of our IL agent drops by less than while generalizing to the new town and weather. Hence if the Autopilot is considered the performance upper-bound, it is fair to claim our approach saturates the NoCrash benchmark. However, as shown in Fig. 5, there is still space for improvement on NoCrash compared to Roach and the performance gap on the LeaderBoard highlights the importance of this new benchmark.

Performance and Infraction Analysis: Table 3 provides the detailed performance and infraction analysis on the NoCrash benchmark with busy traffic in the new town & weather setting. Most notably, the extremely high “Agent blocked” of our baseline is due to reflections from after-rain puddles. This problem is largely alleviated by imitating Roach, which drives more naturally, and shows an absolute improvement of in terms of driving score. In other words this is the gain achieved by using a better expert, but the same imitation learning approach. Further using the improved supervision from soft targets and latent features results in our best model , which demonstrates another absolute improvement. By handling red lights in a better way, this agent achieves , an expert-level driving score, using a single camera image as input.

5 Conclusion

We present Roach, an RL expert, and an effective way to imitate this expert. Using the BEV representation, Beta distribution and the exploration loss, Roach sets the new performance upper-bound on CARLA while demonstrating high sample efficiency. To enable a more effective imitation, we propose to learn from soft targets, values and latent features generated by Roach. Supervised by these informative targets, a baseline end-to-end IL agent using a single camera image as input can achieve state-of-the-art performance, even reaching expert-level performance on the NoCrash-dense benchmark. Future works include performance improvement on simulation benchmarks and real-world deployment. To saturate the LeaderBoard, the model capacity shall be increased [3, 18, 33]. To apply Roach to label real-world on-policy data, several sim-to-real gaps have to be addressed besides the photorealism, which is partially alleviated by the BEV. For urban driving simulators, the realistic behavior of road users is of utmost importance [45]. Acknowledgements: This work was funded by Toyota Motor Europe via the research project TRACE Zurich.


  • [1] P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In

    Proceedings of the International Conference on Machine Learning (ICML)

    pp. 1. Cited by: §1.
  • [2] A. Amini, I. Gilitschenski, J. Phillips, J. Moseyko, R. Banerjee, S. Karaman, and D. Rus (2020) Learning robust control policies for end-to-end autonomous driving from data-driven simulation. IEEE Robotics and Automation Letters (RA-L) 5 (2), pp. 1143–1150. Cited by: §1, §1.
  • [3] M. Bansal, A. Krizhevsky, and A. S. Ogale (2019) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. In Robotics: Science and Systems XV, External Links: Link, Document Cited by: §1, §1, §2, §3.1, §5.
  • [4] A. Behl, K. Chitta, A. Prakash, E. Ohn-Bar, and A. Geiger (2020) Label efficient visual abstractions for autonomous driving. In International Conference on Intelligent Robots and Systems (IROS), Cited by: §2.
  • [5] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §1.
  • [6] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: 1st item.
  • [7] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §1.
  • [8] C. Chen, A. Seff, A. Kornhauser, and J. Xiao (2015) Deepdriving: learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2722–2730. Cited by: §2.
  • [9] D. Chen, B. Zhou, V. Koltun, and P. Krähenbühl (2020) Learning by cheating. In Conference on Robot Learning (CoRL), pp. 66–75. Cited by: §2, §2, §3.1, §3.1, Table 1, Table 2.
  • [10] J. Chen, B. Yuan, and M. Tomizuka (2019) Model-free deep reinforcement learning for urban autonomous driving. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 2765–2771. Cited by: §2, §3.1.
  • [11] F. Codevilla, M. Müller, A. López, V. Koltun, and A. Dosovitskiy (2018) End-to-end driving via conditional imitation learning. In IEEE International Conference on Robotics and Automation (ICRA), pp. 4693–4700. Cited by: §1, §2.
  • [12] F. Codevilla, E. Santana, A. M. López, and A. Gaidon (2019) Exploring the limitations of behavior cloning for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9329–9338. Cited by: §C.2, §2, §3.2, §4.
  • [13] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: An open urban driving simulator. In Proceedings of the Conference on Robot Learning (CoRL), pp. 1–16. Cited by: §1, §2, §2.
  • [14] S. Fujimoto, H. Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning (ICML), pp. 1587–1596. Cited by: §2.
  • [15] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning (ICML), pp. 1861–1870. Cited by: §1, §2.
  • [16] H. v. Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    pp. 2094–2100. Cited by: §2.
  • [17] S. Hecker, D. Dai, A. Liniger, M. Hahner, and L. Van Gool (2020) Learning accurate and human-like driving using semantic maps and attention. In IEEE International Conference on Intelligent Robots and Systems (IROS), pp. 2346–2353. Cited by: §1.
  • [18] S. Hecker, D. Dai, and L. Van Gool (2018) End-to-end learning of driving models with surround-view cameras and route planners. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 435–453. Cited by: §5.
  • [19] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1.
  • [20] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.2.
  • [21] J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 29, pp. . Cited by: §1.
  • [22] Y. Hou, Z. Ma, C. Liu, and C. C. Loy (2019) Learning to steer by mimicking features from heterogeneous auxiliary networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2, §3.2.
  • [23] J. Houston, G. Zuidhof, L. Bergamini, Y. Ye, A. Jain, S. Omari, V. Iglovikov, and P. Ondruska (2020) One thousand and one hours: self-driving motion prediction dataset. Note: https://level5.lyft.com/dataset/ Cited by: §1.
  • [24] G. Kahn, P. Abbeel, and S. Levine (2021) LaND: learning to navigate from disengagements. IEEE Robotics and Automation Letters (R-AL). Cited by: §1.
  • [25] A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J. Allen, V. Lam, A. Bewley, and A. Shah (2019) Learning to drive in a day. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8248–8254. Cited by: §1.
  • [26] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • [27] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun (2019) Multi-task multi-sensor fusion for 3d object detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 7345–7353. Cited by: §1.
  • [28] X. Liang, T. Wang, L. Yang, and E. Xing (2018) Cirl: controllable imitative reinforcement learning for vision-based self-driving. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 584–599. Cited by: §1, §2.
  • [29] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning.. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • [30] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (ICML), pp. 1928–1937. Cited by: §2.
  • [31] M. Montemerlo, J. Becker, S. Bhat, H. Dahlkamp, D. Dolgov, S. Ettinger, D. Haehnel, T. Hilden, G. Hoffmann, B. Huhnke, et al. (2008) Junior: the stanford entry in the urban challenge. Journal of Field Robotics 25 (9), pp. 569–597. Cited by: §1.
  • [32] M. Mueller, A. Dosovitskiy, B. Ghanem, and V. Koltun (2018-29–31 Oct) Driving policy transfer via modularity and abstraction. In Proceedings of the Conference on Robot Learning (CoRL), Vol. 87, pp. 1–15. Cited by: §1.
  • [33] E. Ohn-Bar, A. Prakash, A. Behl, K. Chitta, and A. Geiger (2020) Learning situational driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11296–11305. Cited by: §2, Table 2, §5.
  • [34] P. Palanisamy (2020) Multi-agent connected autonomous driving using deep reinforcement learning. In International Joint Conference on Neural Networks (IJCNN), pp. 1–7. Cited by: §2.
  • [35] Y. Pan, C. Cheng, K. Saigol, K. Lee, X. Yan, E. A. Theodorou, and B. Boots (2018) Agile autonomous driving using end-to-end deep imitation learning. In Robotics: Science and Systems (RSS), Cited by: §1, §2.
  • [36] D. Pomerleau (1989-12) ALVINN: an autonomous land vehicle in a neural network. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pp. 305 –313. Cited by: §1.
  • [37] A. Prakash, A. Behl, E. Ohn-Bar, K. Chitta, and A. Geiger (2020) Exploring data aggregation in policy learning for vision-based urban autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11763–11773. Cited by: §C.2, §1, §1, §2, §3.2, §3.2, Table 1, Table 2.
  • [38] C. R. Qi, Y. Zhou, M. Najibi, P. Sun, K. Vo, B. Deng, and D. Anguelov (2021) Offboard 3D Object Detection from Point Cloud Sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6134–6144. Cited by: §1.
  • [39] A. Raffin, A. Hill, M. Ernestus, A. Gleave, A. Kanervisto, and N. Dormann (2019) Stable baselines3. GitHub. Note: https://github.com/DLR-RM/stable-baselines3 Cited by: §3.1.
  • [40] N. Rhinehart, R. McAllister, and S. Levine (2020) Deep imitative models for flexible inference, planning, and control. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2, §3.1.
  • [41] S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 627–635. Cited by: §1, §2, §3.2.
  • [42] A. Sauer, N. Savinov, and A. Geiger (2018) Conditional affordance learning for driving in urban environments. In Conference on Robot Learning (CoRL), pp. 237–252. Cited by: §2.
  • [43] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2016) High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §3.1.
  • [44] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3.1.
  • [45] S. Suo, S. Regalado, S. Casas, and R. Urtasun (2021) TrafficSim: learning to simulate realistic multi-agent behaviors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
  • [46] C. team (2020) CARLA autonomous driving leaderboard. Note: https://leaderboard.carla.org/Accessed: 2021-02-11 Cited by: §4.
  • [47] M. Toromanoff, E. Wirbel, and F. Moutarde (2019) Is deep reinforcement learning really superhuman on atari?. In Deep Reinforcement Learning Workshop of the Conference on Neural Information Processing Systems, Cited by: §C.1, §2, §3.1.
  • [48] M. Toromanoff, E. Wirbel, and F. Moutarde (2020) End-to-end model-free reinforcement learning for urban driving using implicit affordances. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7153–7162. Cited by: §1, §2.
  • [49] C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark, J. Dolan, D. Duggins, T. Galatali, C. Geyer, et al. (2008) Autonomous driving in urban environments: boss and the urban challenge. Journal of Field Robotics 25 (8), pp. 425–466. Cited by: §1.
  • [50] D. Wang, C. Devin, Q. Cai, P. Krähenbühl, and T. Darrell (2019) Monocular plan view networks for autonomous driving. In International Conference on Intelligent Robots and Systems (IROS), Cited by: §1.
  • [51] H. Xu, Y. Gao, F. Yu, and T. Darrell (2017) End-to-end learning of driving models from large-scale video datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2174–2182. Cited by: §1, §3.2.
  • [52] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell (2020) Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 2636–2645. Cited by: §1.
  • [53] W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun (2019) End-to-end interpretable neural motion planner. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8660–8669. Cited by: §1.
  • [54] A. Zhao, T. He, Y. Liang, H. Huang, G. V. den Broeck, and S. Soatto (2020) SAM: squeeze-and-mimic networks for conditional visual driving policy learning. In Conference on Robot Learning (CoRL), Cited by: §2, §2, §3.2, Table 2.

Appendix A Summary

In the appendix, we provide (1) an overview of supplementary videos and codes, (2) implementation details of the RL experts and the IL agents, (3) details regarding benchmarks, and (4) additional experimental results.

Appendix B Other Supplementary Materials

b.1 Videos

To investigate how different agents actually drive, we provide three videos. roach.mp4 shows the driving performance of Roach, and highlights that it has a natural driving style and that it can handle complex traffic scenes. In autopilot.mp4 we demonstrate the rule-based CARLA Autopilot. This agent uses unnatural brake actuation, i.e. it only uses emergency braking. Further, this video also highlights that in dense traffic, the rule-based agent can get stuck due to conservative danger predictions. For more details about the Autopilot and changes we made see Section C.3. Finally, in il_agent.mp4 we demonstrate our best roach-supervised IL agent, showing that the agent can handle complex traffic scenes but also highlighting failure cases. In detail:

  • roach.mp4 is an uncut evaluation run recorded from Roach driving in Town03 (LeaderBoard-busy under dynamic weather). This video demonstrates the natural driving style of Roach even in challenging situations such as US-style traffic lights, unprotected left turns, roundabouts and stop signs.

  • autopilot.mp4 is an uncut evaluation run recorded from Autopilot driving in Town02 (NoCrash-dense, new town & new weather). This video demonstrates the over-conservative behavior of the Autopilot while driving in dense traffic. This often leads to red light infractions and blockage (both are present in the video).

  • il_agent.mp4 is a highlight video recorded from our best roach-supervised IL agent . This video includes multiple challenging situations often encountered during urban driving, such as EU and US-style junctions, unprotected left turns, roundabouts and reacting to pedestrians walking into the street. Furthermore, we highlight some of the failure modes of our camera-based IL agent, including not coming to a full stop for stop signs, collisions at overcrowded intersections and oscillation in the steering if the lane markings are not visible due to sun glare. We believe that including memory in the IL agent policy can help in most of these issues, due to a better understanding of the ego-motion (stop sign and oscillations) and other agents’ motion (collisions).

b.2 Code

To reproduce our results, we provide four python scripts:

  • train_rl.py for training Roach.

  • train_il.py for training DA-RB (CILRS + DAGGER).

  • benchmark.py for benchmarking agents.

  • data_collect.py for collecting on/off-policy data.

It is recommended to run our scripts through bash files contained in the folder run. All configurations are in the folder config. Our repository is composed of two modules:

  • carla_gym, a versatile OpenAI gym [6] environment for CARLA. It allows not only RL training with synchronized rollouts, but also data collection and evaluation. The environment is configurable in terms of weather, number of background pedestrians and vehicles, benchmarks, terminal conditions, sensors, rewards for the ego-vehicle and etc.

  • agents, which includes our implementation of Autopilot (in agents/expert), Roach (in agents/rl_birdview) and DA-RB (in agents/cilrs).

b.3 Rendering issues

As illustrated in Fig. 6, on CARLA 0.9.11 reflections from after-rain puddles are sometimes wrongly rendered as black pixels. When the black pixels are accumulated, for example in the middle of Fig. 5(a), they are often recognized as obstacles by the camera-based agents. Since this kind of reflection only appears under the testing weather but not under the training weather, generalizing to testing weather is exceptionally hard on CARLA 0.9.11 for the camera-based end-to-end IL agents.

(a) Reflections from after-rain puddles in fornt of the ego-vehicle are incorrectly rendered as black pixels.
(b) Reflections are correctly rendered if the puddle is not directly in front of the ego-vehicle.
Figure 6: Rendering issue of CARLA 0.9.11 running on Ubuntu with OpenGL.

Appendix C Implementation Details

c.1 Roach

The network architecture of Roach can be found in Table 6 and the hyper-parameter values are listed in Table 8.

BEV: Cyclists and pedestrians are rendered larger than their actual sizes, this allows us to use a smaller image encoder with less parameters for Roach. Additionally, increasing the size naturally adds some caution when dealing with these vulnerable road users.


The policy network and the value network are updated together using one Adam optimizer with an initial learning rate of 1e-5. The learning rate is scheduled based on the empirical KL-divergence between the policy before and after the update. If the KL-divergence is too large after an update epoch, the update phase will be interrupted and a new rollout phase will start. Furthermore, a patience counter will be increased by one and the learning rate will be reduced once the patience counter reaches a threshold.

Rollout: Before each update phase a fixed-size buffer will be filled with trajectories collected on six CARLA servers, each corresponds to one of the six LeaderBoard maps (Town1-6).

Terminal Condition: An episode is terminated if and only if one of the following event happens.

  • Run red light: examination code taken from the public repository of LeaderBoard. Terminal reward: .

  • Run stop sign: examination code taken from the public repository of LeaderBoard. Terminal reward: .

  • Collision registered by CARLA: based on the physics engine. Any collision with intensity larger than 0 is considered. Terminal reward: .

  • Collision detected by bounding box overlapping in the BEV. Terminal reward: .

  • Route deviation: triggered if the lateral distance to the lane centerline of the desired route is larger than 3.5 meters. Terminal reward: .

  • Blocked: speed of the ego-vehicle is slower than 0.1 m/s for more than 90 consecutive seconds. Terminal reward: .

with is the ego-vehicle’s speed. The terminal reward is the reward given to the very last observation/action pair before the termination. For non-terminal samples, the terminal reward is 0.

Reward Shaping: The reward is the sum of the following components.

  • r_speed: equals to , where is the measured speed of the ego-vehicle, is the maximum speed and is the desired speed. We use a constant maximum speed m/s. The desired speed is a variable and is explained below.

  • r_position: equals to , where is the lateral distance (in meters) between the ego-vehicle’s center and the center line of the desired route.

  • r_rotation: equals to , where is the absolute value of the angular difference (in radians) between the ego-vehicle’s heading and the heading of the center line of the desired route.

  • r_action: equals to if the current steering differs more than 0.01 from the steering applied in the previous step.

  • r_terminal: the aforementioned terminal reward.

The desired speed, as proposed in [47], depends on rule-based obstacle detections. If there’s no obstacle detected, the desired speed equals to the maximum speed. If an obstacle is detected, based on the distance to the obstacle the desired speed is linearly decreased to 0. As obstacle detector we use the hazard detection of Autopilot (cf. Section C.3). As a dense and informative reward, r_speed helps substantially to train our Roach and the camera-based end-to-end RL agent [47]. However, using rule-based obstacle detections inevitable introduces bias, the trained RL agent can be over-aggressive or over-conservative depending on the false positive and false negative rate of the detector. For example, during multi-lane freeway driving, our Roach decelerates for vehicles on the neighbouring lanes because those vehicles are detected as obstacles during training. Another example, Roach tends to collide after a right turn, this is related to the sector shaped (around 40 degrees) detection area used by the obstacle detection; vehicles and pedestrians on the right are not covered in the detection area. To further improve the performance of Roach, this r_speed should be modified, either using a better obstacle detector, or completely remove the rule-based obstacle detection, and build a less artificial reward based on simulation states.

Mode of Beta Distribution: We take the distribution mode as a deterministic output. The mode of the Beta distribution is defined as


For a natural driving behavior, we use the mean as the deterministic output when the mode is not uniquely defined, i.e. when or .

c.2 IL Agent Supervised by Roach

The network architecture of our IL agent is found in Table 7 and the hyper-parameter values are listed in Table 9.

Network Architecture: We use six branches: turning left, turning right and going straight at the junction, following lane, changing to the left lane and changing to the right lane.

Off-policy Data Collection: Following CILRS [12], triangular perturbations on actions are applied while collecting the off-policy expert dataset to alleviate the covariate shift. The off-policy dataset for NoCrash includes 80 episodes and for LeaderBoard it includes 160 episodes. Each episode is at most 300 seconds and at least 30 seconds long. The episode will be terminated if the expert violates any traffic rules, including red light infractions, stop sign infractions and collisions. In such a case, we remove the last 30 seconds of that episode so as to ensure that the off-policy dataset includes only correct demonstrations. Data is not collected using the given training routes but from randomly spawned start and target locations.

On-policy Data Collection: We follow DA-RB [37] for DAGGER with critical state sampling and replay buffer. New DAGGER-data will replace the old data in the replay buffer, while the buffer size is fixed. The same number of frames are contained in the replay buffer as in the off-policy dataset. At each DAGGER iteration, around 15-25% of the replay buffer is filled with new DAGGER-data, whereas at least 20% of the replay buffer is filled with off-policy data. Identical to the off-policy data collection, we use randomly spawned start and target locations while collecting DAGGER datasets. Following DA-RB, we did not use a mixed agent/expert policy to collect DAGGER datasets. However, our code allows this kind of rollout for DAGGER.

Training Details:

Since we take the ResNet-34 pre-trained on ImageNet, the input image is normalized as suggested. In case the IL agent uses a distributional action head and/or a value head, the corresponding weights will be loaded from the Roach model at the first training iteration (the behavior cloning iteration). At each DAGGER iteration, the training continuous from the last epoch of the previous DAGGER iteration. We apply image augmentations using code modified from CILRS. The image augmentation methods are applied in random order and include Gaussian blur, additive Gaussian noise, coarse and block-wise dropouts, additive and multiplicative noise to each channel, randomized contrast and grayscale. All models are trained for 25 epochs using the ADAM optimizer with an initial learning rate of 2e-4. The learning rate is halved if the validation loss has not decreased for more than 5 epochs.

c.3 Autopilot

The CARLA Autopilot (also called roaming agent) is a simple but effective automated expert based on hand-crafted rules and ground-truth simulation states. The Autopilot is composed of two PID controllers for trajectory tracking and hazard detectors for emergency brake. Hazards include

  • pedestrians/vehicles detected ahead,

  • red lights/stop sings detected ahead,

  • negative ego-vehicle speed, for handling slopes.

Locations and states of pedestrians, vehicles, red lights and stop signs are provided as ground-truth by the CARLA API. If any hazard appears in a trigger area ahead of the ego-vehicle, Autopilot will make an emergency brake with , . If no hazard is detected, the ego-vehicle will follow the desired path using two PID controllers, one for speed and one for steering control. The PID controller takes as input the location, rotation and speed of the ego-vehicle and the desired route specified as dense (1 meter interval) waypoints. The speed PID yields and the steering PID yields . We tuned the parameters for PID controllers and hazard detectors manually, such that the Autopilot is a strong baseline. The target speed is 6 m/s.

Appendix D Benchmarks

Scope: The scope of the NoCrash and the LeaderBoard benchmark are illustrated in Table 4. As the latest benchmark on CARLA, the LeaderBoard benchmark considers more traffic scenarios and longer routes in six different maps. In this paper we use the publicly available training and testing routes of the LeaderBoard.

Weather: Following the NoCrash benchmark, we use ClearNoon, WetNoon, HardRainNoon and ClearSunset as the training weather types, whereas new weather types are SoftRainSunset and WetSunset. To save computational resources, only two out of the four training weather types are evaluated, they are WetNoon and ClearSunset.

Background Traffic: The number of vehicles and pedestrians spawned in each map of different benchmarks are listed in Table 5. Vehicles and pedestrians are spawned randomly from the complete blueprint library of CARLA 0.9.11. This stands in contrast to several previous works where for example two-wheeled vehicles are disabled.

Appendix E Additional Experimental Results

To verify IL agents trained using the feature loss indeed embed camera images to the latent space of Roach, we report the feature loss at test time in Fig. 7. In the first row of Fig. 7, the IL agent trained without feature loss, , learns a latent space independent of the one of Roach. Hence, the test feature loss is effectively noise that is invariant to the test condition. In the second row, is trained with the feature loss. The test feature loss of this agent is much smaller (less than 1) and increases as expected during the generalization tests.

Figure 7: Feature loss w.r.t. Roach on one of the NoCrash-dense route. The y-axis of both charts have different scale.

To complete Fig. 5 of the main paper, driving scores of experts and IL agents at each DAGGER iterations are in

  • Fig. 8: NoCrash-busy.

  • Fig. 9: LeaderBoard-busy.

To complete Table 3 of the main paper, detailed driving performance and infraction analysis of our experts and IL agents (5th DAGGER iteration) are listed in

  • Table 10: NoCrash-busy, train town & train weather.

  • Table 11: NoCrash-busy, train town & new weather.

  • Table 12: NoCrash-busy, new town & train weather.

  • Table 13: NoCrash-busy, new town & new weather.

  • Table 14: LeaderBoard, train town & train weather.

  • Table 15: LeaderBoard, train town & new weather.

  • Table 16: LeaderBoard, new town & train weather.

  • Table 17: LeaderBoard, new town & new weather.

# Traffic
# Stop
NoCrash Train
NoCrash Test
LeaderBoard Train
LeaderBoard Test
Table 4: Scope of the Nocrash benchmark and the LeaderBoard benchmark. Total kilometers, number of traffic lights and stop signs are measured using Roach.
Map # Vehicles # Pedestrians
NoCrash dense
NoCrash busy
LeaderBoard busy
Table 5: Background traffic settings for different benchmarks.
Layer Type Filters Size Strides Activation
Image Encoder
Conv2d 8 5x5 2 ReLU
Conv2d 16 5x5 2 ReLU
Conv2d 32 5x5 2 ReLU
Conv2d 64 3x3 2 ReLU
Conv2d 128 3x3 2 ReLU
Conv2d 256 3x3 1 -
Measurement Encoder
Dense 256 ReLU
Dense 256 ReLU
FC Layers after Concatenation
Dense 512 ReLU
Dense 256 ReLU
Action Head
Dense (shared) 256 ReLU
Dense (shared) 256 ReLU
Dense (for ) 2 Softplus
Dense (for ) 2 Softplus
Value Head
Dense 256 ReLU
Dense 256 ReLU
Dense 1 -
Table 6: The network architecture used for Roach. Around 1.53M trainable parameters.
Layer Type Filters Activation Dropout
Image Encoder
Measurement Encoder
Dense 128 ReLU
Dense 128 ReLU
FC Layers after concatenation
Dense 512 ReLU
Dense 512 ReLU
Dense 256 ReLU
Speed Head
Dense 256 ReLU
Dense 256 ReLU 0.5
Dense 1
Value Head
Dense 256 ReLU
Dense 256 ReLU 0.5
Dense 1
Deterministic Action Head
Dense 256 ReLU
Dense 256 ReLU 0.5
Dense 2
Distributional Action Head
Dense (shared) 256 ReLU
Dense (shared) 256 ReLU 0.5
Dense (for ) 2 Softplus
Dense (for ) 2 Softplus
Table 7: The network architecture used for our IL agent. Around 23.4M trainable parameters.
Notation Description Value
BEV Representation
Width 192 px
Height 192 px
Number of channels 15
Size of the temporal sequence 4
Timestamps of images in the temporal sequence {-1.5, -1, -0.5, 0} sec
Distance from the ego-vehicle to the bottom 40 px
Pixels per meter 5 px/m
Minimum width/height of rendered bounding boxes 8 px
Scale factor for bounding box size of pedestrians 2
Buffer size for six environments 12288 frames
Value bootstrap for the last non-terminal sample True
Synchronized True
Reset at the beginning of a new phase False
Weather dynamic
Range of vehicle/pedestrian number in Town 1
Range of vehicle/pedestrian number in Town 2
Range of vehicle/pedestrian number in Town 3
Range of vehicle/pedestrian number in Town 4
Range of vehicle/pedestrian number in Town 5
Range of vehicle/pedestrian number in Town 6
Number of epochs 20
Weight for the entropy loss 0.01
Weight for the exploration loss 0.05
Weight for value loss 0.5
for GAE 0.99
for GAE 0.9
Clipping range for PPO-clip 0.2

Max norm for gradient clipping

Batch size 256
Initial learning rate 1e-5
KL-divergence threshold for learning rate schedule 0.15
Patience for learning rate schedule 8
Factor for learning rate schedule 0.5
Table 8: The hyper-parameter values used for Roach.
Description Value
Camera type RGB
Camera image width 900 px
Camera image height 256 px
Camera location relative to the ego-vehicle
Camera rotation relative to the ego-vehicle
Camera horizontal FOV
Mean for image normalization
Standard deviation for image normalization
Speed measurement Forward speed in m/s
Normalization factor for speed 12
Data Collection
Episode length 300 sec
Triangular perturbation for off-policy data 20%
Number of episodes (NoCrash, off-policy) 80
Number of episodes (LeaderBoard, off-policy) 160
Number of episodes (NoCrash, on-policy, Autopilot) 80
Number of episodes (LeaderBoard, on-policy, Autopilot) 160
Number of episodes (NoCrash, on-policy, Roach) 40
Number of episodes (LeaderBoard, on-policy, Roach) 80
DA-RB critical state sampling criterion difference in acceleration
DA-RB critical state sampling threshold 0.2
Weather Same as NoCrash train weathers
Range of vehicle/pedestrian number in NoCrash train town 1
Range of vehicle/pedestrian number in LeaderBoard train town 1
Range of vehicle/pedestrian number in LeaderBoard train town 3
Range of vehicle/pedestrian number in LeaderBoard train town 4
Range of vehicle/pedestrian number in LeaderBoard train town 6
Number of epochs at each DAGGER iteration 25
, weight for the speed regularization 0.05
, weight for the value loss, if applied 0.05
, weight for the feature loss, if applied 0.001
Batch size 48
Initial learning rate 0.0002
Patience for reduce-on-plateau learning rate schedule 5
Factor for learning rate schedule 0.5
Pre-trained distributional action head True
Pre-trained value head True
Image augmentation True
Table 9: The hyper-parameter values used for our IL agent.
(a) Driving Score
(b) Success Rate
Figure 8: Driving performance of experts and IL agents on the NoCrash-busy benchmark. All IL agents (dashed lines) are supervised by Roach except for , which is supervised by the CARLA Autopilot. For IL agents at the 5th iteration and all experts, results are reported as the mean over 3 evaluation seeds. Others agents are evaluated only once.
(a) Driving Score
(b) Success Rate
Figure 9: Driving performance of experts and IL agents on the LeaderBoard-busy benchmark. All IL agents (dashed lines) are supervised by Roach except for , which is supervised by the CARLA Autopilot. For all experts, results are reported as the mean over 3 evaluation seeds. Results of IL agents are evaluated only once.
Red light
iter 5 %, %, %, %, #/Km, #/Km, #/Km, #/Km, #/Km,
Table 10: Performance and infraction analysis on NoCrash-busy, train town & train weather. Mean and std. over 3 seeds.
Red light
iter 5 %, %, %, %, #/Km, #/Km, #/Km, #/Km, #/Km,
Table 11: Performance and infraction analysis on NoCrash-busy, train town & new weather. Mean and std. over 3 seeds.
Red light
iter 5 %, %, %, %, #/Km, #/Km, #/Km, #/Km, #/Km,
Table 12: Performance and infraction analysis on NoCrash-busy, new town & train weather. Mean and std. over 3 seeds.
Red light
iter 5 %, %, %, %, #/Km, #/Km, #/Km, #/Km, #/Km,