[CoRL'20] Learning a Decision Module by Imitating Driver’s Control Behaviors
Classical autonomous driving systems are modularized as a pipeline of perception, decision, planning, and control. The driving decision plays a central role in processing the observation from the perception as well as directing the execution of downstream planning and control modules. Commonly the decision module is designed to be rule-based and is difficult to learn from data. Recently end-to-end neural control policy has been proposed to replace this pipeline, given its generalization ability. However, it remains challenging to enforce physical or logical constraints on the decision to ensure driving safety and stability. In this work, we propose a hybrid framework for learning a decision module, which is agnostic to the mechanisms of perception, planning, and control modules. By imitating the low-level control behavior, it learns the high-level driving decisions while bypasses the ambiguous annotation of high-level driving decisions. We demonstrate that the simulation agents with a learned decision module can be generalized to various complex driving scenarios where the rule-based approach fails. Furthermore, it can generate driving behaviors that are smoother and safer than end-to-end neural policies.READ FULL TEXT VIEW PDF
[CoRL'20] Learning a Decision Module by Imitating Driver’s Control Behaviors
The autonomous driving system is essentially a decision-making system that takes a stream of onboard sensory data as input, processes it with prior knowledge about driving scenarios to make a decision, and then outputs control signals to steer the vehicle . The driving system is often modularized into perception, decision, planning, and control. While the perception as well as the map reconstruction, often referred together as state estimator
, tend to be more open to machine learning methods, the downstream modules such as decision, planning and control remain as program-based in order to be configurable to ensure safe interaction with the physical world. In other words, this safety guarantee is built upon a human-inspectable and interruptible basis, often backed by a classical mechanism of Finite State Machine (FSM) composed of a huge amount of rules to follow. In complex driving environments, building a complete FSM for decision making is infeasible. A corner case that is inconsistent with the existing rules may lead to a significant program refactorization. Additionally, other unobservable factors such as intentions of other drivers is non-trivial to symbolize into rule-based program. Aiming at a more flexible alternative, this work explores a learning-based decision-making module whose downstream modules, i.e. planning and control, remain encapsulated.
Driving decision is defined as the high-level abstraction about what lane the driver wants the vehicle to be with what velocity in
seconds, so that the decision determines the execution of downstream modules. However, such a driving decision is both difficult and ambiguous for human drivers to describe. For example, when drivers are trying to merge into traffic at a roundabout, they may have different responses and behaviors to execute. The lack of standardized criteria poses a challenge to large-scale supervised learning. It is impractical to evaluate human annotation for quality control. On the other hand, the drivers’ physical behaviors, such as the step on throttle/brake and the steering of the wheel, implicitly reflect their high-level decisions. Thus we would like to explore the possibility of reverse-engineering the high-level driving decisions by learning from easily accessible data of the drivers’ physical behaviors.
Learning human intention from observable behaviors has been a long-standing topic in the field of artificial intelligence and cognitive science 
. Previous work mainly focuses on inferring the unobservable high-level intention by observing human behaviors from a second-person view. Recently end-to-end imitation learning
and inverse reinforcement learning have been used for autonomous driving tasks, but the resulting neural control policies could not guarantee the fulfillment of low-level constraints in states or actions , nor do they have the robustness and stability guarantee as classical controllers . On the other hand, instead of end-to-end training, the earlier work uses a reinforcement learning policy to switch between controllers designed with Lyapunov domain knowledge . Here we adopt the modeling of human intention for driving decisions, but also take driving cars’ own geometrical, mechanical and physical constraints into account. Specifically, we aim to learn a high-level decision policy in the modular pipeline where the downstream modules re-use the classical methods which are transparent to human inspectors. However, tailored learning design is needed to handle the issue that the downstream programs are not differentiable for automated gradient calculators [1, 18].
In this work, we propose an imitation learning framework for decision making in the modular driving pipeline that learns neural decisions from human behavioral demonstrations. Although perception, planning and control modules are included in the modeling, this learning framework is agnostic to their mechanisms. Policy learning is conducted in a generative adversarial manner . The generator (green line in Fig. 1) simulates the modularized driving pipeline. With information from a local map, decisions are generated by a neural policy and processed by programmed planning and control module to interact with the environment sequentially. The neural discriminator takes in the generator’s low-level control trajectories along with their observation sequences, then compare them with drivers’ behavior data (orange lines in Fig. 1). Due to the absence of gradients from planning and control modules (blue line in Fig. 1), we modify the learning objective of the generative adversarial imitation learning , to propagate credits on control action back to the decision directly (red line in Fig. 1). We evaluate this framework on various simulated urban driving scenarios, including following, merging at the crossing, merging at the roundabout, and overtaking, showing its superior performance over the rule-based and end-to-end approaches.
We summarize our contributions as follows:
The proposed framework combines the programming-based system with the end-to-end learning method, where the high-level decision policy is data-driven while the low-level plan and control remain configurable and physically constrained.
The tailored generative adversarial learning method learns driving decisions from drivers’ control data, bypassing the ambiguous annotation of human intention.
This framework successfully distillates the behaviors in different scenarios into one decision policy, which can generalize to unseen scenarios to some extent.
Classical autonomous driving system The design of the autonomous driving pipeline could be traced back to DARPA Urban Challenge 2007 . The survey from  provides an overview of the hierarchical structure of perception, decision, planning, and control. Specifically, finite state machine (FSM) was used in the decision level, or equivalently upper planning level, in . Classical autonomous driving systems are organized in this way such that it is friendly to diagnosis. We follow this modular pipeline but develop the data-driven decision module to replace the FSM.
Imitation learning-based autonomous driving system With the popularity of deep generative learning, methods have been proposed to learn end-to-end neural control policies from driver’s data. Previously, Ziebart et al  and Ross et al  proposed general methods in Inverse Reinforcement Learning and Interactive Learning from Demonstration, with an empirical study on a driving game. More recently, Kuefler et al  and Behbahani et al  learn an end-to-end policy in a GAIL-like manner. Codevilla et al  and Liang et al  share similar hierarchical perspective as us, but still control policies are completely neural. One thing worth mentioning is Codevilla et al  proposed to imitatively learn a downstream policy given human’s high-level command, whose motivation is supplementary to ours. Rhinehart et al  proposed to learn a generative predictor for vehicles’ coordinate sequence. Our work differentiates from all of them as we imitate the driver’s low-level behaviors to learn high-level decisions, whose execution is conducted by a transparent and configurable program.
Learning human intention Baker et al 
formalized human’s mental model of planning with Markov Decision Process (MDP) and proposed to learn human’s mental states with Bayesian inverse planning. A similar probabilistic model of decisions is adopted by us but the learning is done with a novel generative adversarial method. Wanget al 
proposed a Hidden Markov Model (HMM) for human behaviors and materialized it with Gaussian Process (GP), where the intent is the hidden variable. Jainet al  model the interaction between human intentional maneuvers and their behaviors in a driving scenario, both in-car and outside, as an Auto-Regressive Input-Output HMM and learn it with EM algorithm. However, they didn’t explicitly distinguish humans from the environment. Besides, rather than inferring the human’s mental state, we care more about how a robot can execute these learned decisions.
Before diving into the learning method, we overview the system structure of an intelligent driving system, which is also the modeling of the generator in our proposed learning framework, as shown in Fig 1. We focus on the decision-making module and its connection to the downstream planning and control modules. The state estimator constructs a local map by processing the sensory data and combining them with prior knowledge from the map and rules. The decision module then receives observation and decides a legal local driving task that steers the car towards the destination. To complete this task, the planning module searches for an optimal trajectory under enforced physical constraints. The control module then reactively corrects any errors in the execution of the planned motion.
We assume that a local map centered at the ego vehicle could be constructed from either the state estimator or an offline global HD map. This local map contains critical driving information such as the routing guidance to the destination, legal lanes, lane lines, other vehicle coordinates and velocity in the ego vehicle’s surroundings, as well as ego vehicle’s speed and acceleration.
The interaction between the decision module and the environment through its downstream is modeled as Markov Decision Process (MDP). MDP is normally represented as a tuple , with a state set , a decision set , a transitional probability distribution , a bounded reward function , an initial state distribution , a discount factor and horizon . Decision policy takes current state from local map and generates a high-level decision . Note that we explicitly refer to it as decision to differentiate it from control action . This decision directs the execution of planning and control module, and makes the autonomous driving system proceed in the environment to acquire next state . The optimization objective of this policy is to maximize the expected discounted return , where denotes the whole trajectory, , and .
Below we introduce the observation space and decision space, which are interfaces in modular systems. Note that we assume full access to necessary information in the local map.
To make the decision making policy more generalizable to different driving scenarios, the output of the state estimator, also as the input to the neural decision module, is the local map and traffic state. As shown in Fig. 2, part of the observation is 3D coordinates of sample points of lane lines. We select two sets of lanes for decision policy. One for the ego vehicle’s current lane and the other for the edge of legal regions. Note that here legal means driving in that region abides by both the traffic rules and a global routing, which would have a crucial effect at the crossing. To reduce the dimension of input, these lane points are sampled with exponentially increasing slots towards the further end. In some sense, it mimics the effect of Lidar, with the observation more accurate in the near end. The observation also includes the coordinates and velocities of the nearest vehicles in traffic within range of the ego vehicle.
We introduce strong priors over the decision space for tractable inversion of this generation process from the collected data. We define decision
as three categorical variables, one for lateral, one for longitudinal and one for velocity. Combining with the local map, each of them is assigned a specific semantic meaning. The lateral decision variable has three classes aschanging to the left lane, keeping current lane and changing to the right lane. Note that at a crossing, global routing information has been implanted into the local map as introduced in the last paragraph. The lateral decision set is complete since obviously, these three are the only possible lateral decisions for a vehicle. The longitudinal decision variable has four different classes, with each indicating the distance along with waypoints in time interval T. The exact coordinate of the endpoint could be extracted from the local map. Combining with the predicted target speed at the endpoint from the velocity decision variable, which uniformly discretizes the allowed speed range into 4, this decision module provides a quantitative configuration of goal states for the downstream trajectory planner.
We use a minimal design of the planning and control module as we focus on the decision making. The planning and control module can be easily extended to more complicated ones as long as they are deterministic or stationary.
The planning module processes path and velocity separately. Paths are planned with cubic Bezier curve
where , are the starting and goal point, and are intermediate control points, imposing the geometrical constraints such as end pose aligning with lane direction.
The velocity planning module reparameterizes Eq.(1) with arc-length, introducing temporal unit. This reparameterization is not done directly, rather, spline segments with identical time unit are used to fit sample points from (1). Each of the segments is defined as:
Here we consider the constraint on velocity planning, by minimizing the acceleration and jerk while fitting to the sample points from (1) and maintaining the continuity of curvature. Formally, the fitting is done by Quadratic Programming (QP):
The control module is a basic PID controller. Here to reconcile with the MDP assumption in decision, we adopt the discrete time form:
where , and denote the coefficients for the proportional, integral, and derivation terms respectively. is the error function. In this work we have two independent error functions and for lateral controller steering wheel and longitudinal controller throttle/brake , which are both capped by mechanical constraints.
It is flexible to extend each module of the proposed framework. Note that while the interface between decision and planning is the location of a goal point and the speed ego vehicle is expected to maintain when reaching it, the planning module could be replaced by more sophisticated search-based or optimization-based trajectory planner, to fulfill more practical requirements in execution time and constraint enforcement. Similarly, the model-free controller could be replaced with alternatives like Model Predictive Control (MPC). Other practical constraints and concerns over the controller such as robustness and stability could also be taken into account if necessary.
Under the MDP modeling, we choose imitation learning over reinforcement learning. In principle, imitation learning could save practitioners from ad hoc reward engineering and focus their effort on reusable infrastructure development.
Among all the imitation learning methods, behavior cloning (BC)  is the most straightforward one. The policy is trained as a regressor in a purely supervised manner, ignoring the temporal effect of each action in an MDP trajectory. Thus it suffers from covariate shift when presented with states which are not covered by the training data.
By contrast, the model-free generative imitation learning like Inverse Reinforcement Learning (IRL) explicitly models the interaction between policy and environment and approximates it with Monte Carlo rollouts. Thus they can generalize better in states not covered in the demonstration.
Generative Adversarial Imitation Learning (GAIL) is a variate of IRL that makes the learning of reward function unsupervised. Intuitively, it learns a policy whose actions at given states are indistinguishable from demonstration data. More formally, with expert demonstration , a neural discriminator is introduced and the reward function is defined as . The training objective is:
Specifically, is optimized with cross entropy loss, while is optimized with policy gradient .
Different from GAIL, in our framework the generation process is modularized, following the structure of classical autonomous driving systems. Because the downstream planning and controls modules are not necessarily differentiable, a separation occurs between decisions from neural policy and the control data from drivers, as illustrated by the blue dotted line in Fig 1: policy generates decisions while discriminator only distinguishes actions . One important insight of our work is that the transformation function, which is the parametrized equivalence of the planning and control module, is a deterministic and invertible function. Therefore, it is a candidate for reparametrization  or push-forward . After reparametrizing in the first expectation with we have this modified learning objective:
The whole algorithm is described in Algorithm 1.
Training Platform We evaluate the proposed framework in CARLA simulation platform . As the proposed framework focuses on the decision and its downstream, the state estimator is neglected and the observations are extracted directly from the simulator. We build a parallel training system to improve training efficiency.
Data Collection The demonstration data is collected by human driving ego car in CARLA simulator with Logitech G29 Driving Force Steering Wheels Pedals.The PC configuration is Ubuntu 16.04 x86, Intel Core i7-8700 CPU, GeForce GTX and GB memory. The following criteria are used to set to filter out low-quality demonstrations:
there is no collision occurs;
sufficient accumulated heuristic rewards in one episode;
no dangerous move and obeying traffic rules.
For each scenario, experts repeatedly drive in the simulator to finish episodes with steps. Trajectories violating the aforementioned criteria are rejected. After collecting trajectories in each scenario, we each choose with the highest heuristic rewards as a demonstration.
Traffic Scenarios 6 common driving scenarios are constructed for evaluation:
Evaluation Metrics We compare the learned decision module with both the rule-based module and end-to-end neural control policy. Rule-based module is composed following . The end-to-end policy is trained with vanilla GAIL. Following , we conduct quantitative comparison including metrics like collision rate, time to accomplish tasks, average acceleration and jerk. These metrics reflect how safe and smooth the agent drives. The trajectory from different approaches is also visualized and compared qualitatively. We provide demo videos in the supplementary.
|Scenarios||Learning-based Module||End-to-End||Rule-based||Expert Data|
|Collision Rate ()||Time taken ()||Accel. ()||Jerk ()||Collision Rate ()||Time taken ()||Accel. ()||Jerk ()||Collision Rate ()||Time taken ()||Accel. ()||Jerk ()||Accel. ()|
|Two Lanes Car Following||0.00||49.21||2.61||6.90||0.00||53.03||3.53||7.82||0.00||19.69||2.11||2.98||1.48|
|Single Lane Car Following||0.00||22.83||1.99||1.67||0.00||32.63||2.21||4.12||0.00||26.10||1.91||2.85||1.55|
|Crossroad Turn Left||0.00||36.75||2.09||2.75||0.00||33.45||2.23||5.62||33.33||43.49||2.05||4.75||2.55|
Empty Town is a special traffic scenario where there is no traffic. It covers urban road structures like a straight road, curve road, crossing, and roundabout. Experiments in Empty Town show the difference between a modular driving system and an end-to-end control policy. The training time on an 8-GPU cluster for the end-to-end control policy is nearly 31 hours versus 15 hours for our modular driving system. Different training time in the same platform implies different numbers of interactions these two methods need to converge, showing the improvement in exploration from constraint enforcement.
As shown in Fig. 4, trajectories from our proposed framework are smoother than end-to-end policy. This demonstrates the effect of explicit enforcement of geometrical constraints in the downstream planning module. Interestingly, there are some road structures where the learning-based modular system can pass while the end-to-end control policy cannot. For example, Fig. 4(d) shows a narrow sharp turn where end-to-end control policy drives across the legal lane line. We believe this is a difficult temporally-consistent exploration problem. An agent needs some successful trials of consecutive right-turning action samples to learn, which is challenging for generative interactive learning. The modular system, in contrast, only explores behaviors that are temporally plausible like human drivers. With the planning module that searches for a geometrically constrained path, the agent can effortlessly make this sharp turn.
Table I shows the statistics of behaviors from the learning-based modular system, end-to-end control policy, and human expert. The comparison shows that our method can drive more safely (0% collision rate), much faster (less time to finish), and with higher comfort (less acceleration). This exhibits the advantage of having both the learned decision policy and the classic planning and control module.
Besides, the decision module trained in Town 3 works similarly well in Town 2, showing its generalization ability.
Traffic scenarios include basic scenarios such as Car Following and Crossroad Merge where vehicle’s interactions with zombies are relatively simple and monotonic; and complex ones such as Crossroad Turn Left, Roundabout Merge and Overtake. In complex traffic scenarios, the dynamics around the ego vehicle is of higher variation, due to either more traffic users or more complex relations between them.
As shown in Table I, in basic traffic scenarios, all three methods obtain agents that can drive safely (0 collision rate). Among them, rule-based and learning-based modular systems are somehow on par in terms of time taken to finish, while rule-based agent seems to drive a little bit more comfortably. End-to-end agent drives more rudely in most of the scenarios, expect in Crossroad Merge, where it takes much longer time to finish. In basic traffic scenarios, none of them drives as well as experts.
In complex traffic scenarios, rule-based agent configures in two sub-scenarios fails to drive safely in another one. While learning-based modular agent safely and smartly passes all, just as in basic traffic scenarios. When compared with end-to-end control policy, a similar conclusion could be drawn as in the previous subsection that learning-based modular agent offers more comfort because the modular pipeline enforces smoother planning and control. Interestingly, in complex social scenarios, the learning-based modular system achieves higher comfort (lower acceleration) than experts. This may be attributed to human errors, and a learning-based modular agent somehow fixes it.
The aforementioned experiments train one specific model for each scenario. To test the practical potential of the proposed framework, we train one policy with demonstration data in four scenarios.
The policy distillation process runs as follows. We leverage the same demonstration data as the previous experiments in four scenarios (Single Lane Car Following, Crossroad Merge, Roundabout Merge and Crossroad Turn Left) together to train a policy for these scenarios jointly. Four CARLA simulators run in parallel, each corresponding to one scenario, while only one learner is deployed to learn from all collected data simultaneously. Then we can obtain one policy for all scenarios.
Table II shows the statistics of behaviors from the learning-based modular system trained in four scenarios jointly and human expert. The distilled policy shows comparable performance to the policies previously trained separately in the aspect of collision rate (), time to finish, acceleration and jerk. This exhibits the potential that our framework can handle scenarios in different levels of complexity concurrently without compromising performance.
|Scenarios||Learning-based Module||Expert Data|
|Collision Rate ()||Time taken ()||Accel. ()||Jerk ()||Accel. ()|
|Single Lane Car Following||0.00||23.53||2.01||1.77||1.55|
|Crossroad Turn Left||0.00||36.53||2.01||2.72||2.55|
This work introduces a flexible framework for learning decision making for autonomous driving. To learn the driver’s high-level driving decision, the proposed framework imitates driver’s control behavior with a modularized generation pipeline. Being agnostic to the design of the non-differentiable downstream planning and control modules, this framework can train the neural decision module through reparametrized generative adversarial learning. We evaluate its effectiveness in simulation environments with human driving demonstrations.
This work can be extended to more complex environments where other vehicles are also learning agents, i.e. multi-agent learning system. Alternatively, if we can collect data in real-world and replay them in the simulator, this framework would be tested in real driving environments.
Proceedings of the IEEE International Conference on Computer Vision, pp. 3182–3190. Cited by: §II.
Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §I.