Sequential decision making problems are generally formulated using Markov decision processes (MDPs). In order to solve an MDP, a cost function must be specified; however, the cost function may be unknown and difficult to articulate for many problems. In such circumstances, imitation learning, also known as learning from demonstration, is a promising approach[2, 41]. Imitation learning uses expert demonstrations to learn a policy that behaves similarly to the expert with respect to performance on the unknown cost function.
There are two primary approaches for addressing the covariate shift problem. First, if an expert and environment are available at training time, we can use Dataset Aggregation (DAgger) . In this paper, we do not assume access to an expert. A second class of methods learn a replacement for the cost function that generalizes to unobserved states, allowing the policy to learn from interaction with the environment, and thereby encountering the same distribution of states observed at test time. Inverse reinforcement learning  and apprenticeship learning [1, 46, 18] are examples of this second approach. In this paper, we adopt a specific definition for apprenticeship learning following that of ho2016model ho2016model.
The goal in apprenticeship learning is for an agent to perform no worse than the expert on the true, unknown cost function. Traditional approaches to apprenticeship learning have three primary disadvantages. First, they often fail at imitating the expert as a consequence of restricting the class of cost functions. Second, the class of cost functions is often defined as the span of a set of basis functions that must be defined manually (as opposed to learned from the observations). Third, these methods generally involve running reinforcement learning repeatedly, and have a large computational cost as a result.
, GAIL removes the restriction that the cost belong to a highly limited class of functions, instead allowing it to be learned using expressive function approximators such as neural networks. Furthermore, using Trust Region Policy Optimization (TRPO), GAIL works with direct policy search as opposed to finding intermediate value functions [35, 1].
In this paper, we review GAIL, describing its connection to, and advantages over, previous apprenticeship learning approaches. We then apply GAIL to real-world driving data in order to learn models of human driving behavior. Autonomous driving systems are typically evaluated on real-world drive tests, which are expensive, time-consuming, and potentially dangerous. Furthermore, it is likely infeasible to build a statistically significant case for the safety of a system solely through real-world testing . Validation through simulation provides a promising alternative to real-world testing, with the ability to evaluate vehicle performance in large numbers of scenes quickly, safely, and economically. Simulations must accurately reflect real-world driving to be useful, and therefore require realistic models of human drivers to govern the behavior of non-autonomous vehicles that occupy the roadway. We demonstrate some variations to GAIL that become useful specifically for the problem of driver modeling: parameter sharing to enable multi-agent imitation, reward augmentation to provide domain knowledge, and mutual information maximization to uncover individual driving styles from the data.
This paper is organized as follows. Section 2 provides preliminaries of MDPs and establishes notation for apprenticeship learning that will be used throughout the paper. Section 3 formally introduces the imitation learning problem. Section 4 discusses apprenticeship learning in detail. Section 5 describes GAIL and establishes its connections to apprenticeship learning and GANs. Section 6 provides the driver modeling case study including modeling the behavior of a single driver using GAIL, modifying to PS-GAIL to deal with multi-agent driver modeling, and using Burn-InfoGAIL to work with latent variables that can capture individual driving styles. Section 7 provides concluding remarks.
An infinite horizon, discounted MDP is defined by the tuple , where is the state space, is the action space,
is the transition probability distribution,is the cost function, is the distribution of the initial state , and is the discount factor.
A stochastic policy, , defines the probability of taking each action from each state. The set contains all stationary stochastic policies that take actions in given states in . We use to refer to the expert policy. In practice, will only be provided as a set of trajectory samples obtained by executing in the environment.
The expectation with respect to a policy is used to denote an expectation with respect to the trajectory it generates: where , , and for . The -discounted causal entropy of the policy is .
The state-occupancy distribution:
gives the average discounted probability of the agent being in state . The state-action occupancy distribution of a policy is then defined as . This can be interpreted as the distribution of state and actions that an agent encounters when following policy starting from state . The state-action occupancy distribution allows us to write the expected trajectory cost of a policy as
for any cost function .
3 Imitation Learning
The goal of imitation learning (IL) is to learn a policy that imitates an expert policy given demonstrations from that expert [41, 40]. A demonstration is defined as a sequence of state-action pairs that result from a policy interacting with the environment: .
Behavioral cloning learns a policy by minimizing some loss function over the set of demonstrations with respect to the policy :
is typically the cross-entropy loss when using discrete actions and the negative log likelihood of a multivariate Gaussian distribution when using continuous actions.
During training, behavioral cloning samples states from the state-occupancy distribution of the expert, . However, when interacting with the environment, the policy samples states from the state-occupancy distribution of the learned policy, . This change in distribution between training and test time is referred to as covariate shift , and results in the agent making increasingly large errors from which it cannot recover.
Allowing the agent to interact with the environment at training time addresses the underlying cause of covariate shift, but this interaction requires an explicit or implicit reward function since the agent may encounter states not contained in the training data. There are various approaches to addressing this problem, which we detail next.
4 Apprenticeship Learning
The goal of apprenticeship learning  is to find a policy that performs no worse than the expert under the true cost function:
The problem is that the true cost function is unknown. Hence, the desired goal is recast as:
If we can satisfy eq. 5 for the worst possible cost function, i.e., find a policy that performs no worse than the expert on the worst possible cost function in , we can guarantee that it will perform no worse than the expert on the (unknown) true cost function. Thus, for a given policy that is yet to be determined, we are interested in finding the worst possible cost function. This is made possible by posing the following optimization problem:
Once the worst-case cost function is known, finding a policy can be posed as the following optimization problem:
The policy found from eq. 7 is guaranteed to perform no worse than the expert with respect to the worst-case cost function, and hence guaranteed to perform no worse than the expert on the true cost function if .
We can add the expert incurred cost into the objective function without changing the resulting optimum, as follows:
Since the worst-case cost function is found by solving a maximization problem in eq. (6), the overall objective function can be rewritten as:
Equation 9 establishes a general framework for defining apprenticeship learning algorithms. To use this framework, we must provide: a cost function class , and an optimization algorithm.
The unknown, true cost function is typically assumed to be a linear combination of known functions that are called basis cost functions. Classic apprenticeship learning algorithms [1, 46] restrict to convex sets given by linear combinations of basis cost functions. However, when the true cost function does not lie within the cost function classes, we lose the guarantee that the learning agent will perform no worse than the expert.
5 Generative Adversarial Imitation Learning
Generative Adversarial Imitation Learning is derived from an alternative approach to imitation learning called Maximum Causal Entropy IRL (MaxEntIRL) [7, 52]. While apprenticeship learning attempts to find a policy that performs at least as well as the expert across cost functions, MaxEntIRL seeks a cost function for which the expert is uniquely optimal. This latter objective turns out to be equivalent, under certain assumptions, to finding a policy with an occupancy distribution matching that of the expert. This section describes the derivation of this connection, the resulting imitation learning algorithm, and its connection with Generative Adversarial Networks.
5.1 Derivation of GAIL
GAIL is derived from a cost-regularized MaxEntIRL objective :
where is the discounted causal entropy of the policy taken with respect to the state-action distribution of the policy, and is a function assigning a value in the extended reals to each cost function . The regularization function, , plays an important role in the derivation of GAIL and in its connection with the apprenticeship learning methods of section 4. Specifically, ho2016generative ho2016generative characterize the result of running reinforcement learning on a cost output from MaxEntIRL:
Here, denotes the convex conjugate of , which attempts to find a cost function that places high cost on state-action pairs more frequently visited by than by . As a result, minimizing with respect to attempts to match the occupancy distributions of the two policies.
ho2016generative ho2016generative show that different cost function regularizers result in different imitation learning algorithms. For example, they show (under assumptions) that when is constant across cost functions, this results in exact occupancy distribution matching. This is accomplished by showing that MaxEntIRL is dual to the following optimization problem:
where denotes the entropy of the occupancy distribution. Solving this optimization problem is intractable for large or continuous MDPs because it involves satisfying a constraint for each point in , many of which will require to be zero due to the limited size of the dataset of expert demonstrations.
An alternative setting of the cost function regularizer results in the apprenticeship learning algorithms from section 4. Let where and otherwise, for a restricted class of cost functions . This results in eq. 11 reducing to (entropy-regularized) apprenticeship learning as follows:
This regularizer restricts the cost function to , which is traditionally taken to be a small subspace spanned by finitely many basis cost functions. From eq. 11 we see that the information contained in the expert policy (or demonstrations sampled using that policy) must be encoded in the cost function. When the “true” cost function is not in this space, information about the expert policy can be lost, which partially explains why traditional apprenticeship learning algorithms can fail to imitate the expert well.
Given that the desire is for an imitation learning algorithm that can: 1) scale to large state action spaces to work for practical problems, and 2) can allow for imitation without restricting cost functions to lie in a small subspace of finitely many linear basis cost functions, GAIL proposes a new cost function regularizer . This regularizer allows scaling to large state action spaces and removes the requirement to specify basis cost functions. While existing apprenticeship learning formalisms used the cost function as the descriptor of desirable behavior, GAIL relies instead on the divergence between the demonstration occupancy distribution and the learning agent’s occupancy distribution. The subsequent discussion will derive the form of and establish its connection to GANs.
5.2 Connection to Generative Adversarial Networks
The cost function regularizer is derived using ideas from binary classification of state-action pairs that have been drawn from the expert occupancy distribution or from the learning agent’s occupancy distribution . Assuming a loss function to score training examples in this binary classification task, two associated functions and are constructed as follows (proposition A.1 of ho2016generative ho2016generative):
The minimum expected risk here in the context of binary classification of state-action pairs drawn from the occupancy distributions and is
Proposition A.1 of ho2016generative ho2016generative shows that this minimum expected risk is connected to the convex conjugate of the cost function regularizer as
For the connection to GANs, the logistic loss function is chosen. Plugging in this loss function into the constructed functions and gives the following forms (corollary A.1.1 in ho2016generative ho2016generative):
Now, we have seen that there is a connection between the cost function regularizer and the minimum expected risk via the convex conjugate. In this case, using the logistic loss, the connection is (as shown by corollary A.1.1 from ):
Here, is the generator that maps input noise variables to the data space as and is the discriminator which outputs a single scalar that represents the probability that came from the data rather than , a binary classification task. This objective is solved using simultaneous gradient descent wherein the parameters of and are updated. This is accomplished by sampling two sets of data, one from the training samples and the other from the noise prior.
Unlike GANs, GAIL considers the environment as a black box, and thus the objective is not differentiable with respect to the parameters of the policy. Therefore, simultaneous gradient descent is not suitable for solving the GAIL optimization objective. Instead, optimization over the GAIL objective is performed by alternating between a gradient step to increase eq. 19 with respect to the discriminator parameters , and a Trust Region Policy Optimization (TRPO) step  to decrease eq. 19 with respect to the parameters of the policy .
GAIL can also be derived more directly from a -divergence minimization perspective , which is less general than cost-regularized MaxEntIRL. The -divergence framework does not allow for minimizing certain distances between occupancy distributions, for example the Wasserstein distance, which has been shown to result in more reliable training of GANs . However, the Wasserstein distance can be used within the cost-regularized MaxEntIRL framework , and we use this version of GAIL in the multi-agent setting.
5.3 Information Maximizing GAIL
Demonstration trajectories are typically collected from human experts. However, these trajectories can show significant variability due to internal latent factors of variation among different individuals. For example, aggressive drivers will demonstrate significantly different driving trajectories as compared to passive drivers, even for the same road geometry and traffic scenario. To uncover these latent factors of variation, and learn policies that produce trajectories corresponding to these latent factors, Information Maximizing GAIL (InfoGAIL) was proposed .
InfoGAIL assumes that the expert policy is a mixture of experts , and defines the generative process of the expert trajectory as , where is a discrete latent variable that selects a specific policy from the mixture of expert policies (which is unknown and needs to be learned), and is the known prior distribution of .
In the GAIL formulation, there is no incentive given to separating and disentangling variations observed in the data. The latent variable is introduced for this purpose. To ensure that the learned policy utilizes as much as possible, InfoGAIL tries to enforce high mutual information between the latent variable and the state-action pairs in the generated trajectory given by
Intuitively, the mutual information captures the amount of information obtained from knowledge of the trajectory about the latent variable .
However, capturing the mutual information in eq. 21 relies on knowledge of the probability distribution , which is difficult to access. Therefore, a variational lower bound , of the mutual information is introduced , where is an approximation of the true posterior . This lower bound is given by
Now, the GAIL policy learning objective function under this mutual information regularization is modified to
is the hyperparameter for the information maximization regularization term. Ineq. 23, the latent code capturing the variability in demonstration is introduced via , the approximation to the posterior distribution .
However, if the policy is initialized from a state sampled at the end of a demonstrator’s trajectory (as is the case when initializing the ego vehicle from a human playback), the driving policy’s actions should be consistent with the driver’s past behavior. InfoGAIL relies on sampling a random latent code at the beginning of a trial, which cannot ensure the requirement of consistency with the true driving style. This shortcoming limits the applicability of InfoGAIL to modeling real driving situations, where ego vehicles are sampled from playbacks of recorded human data.
To address this issue of inconsistency with real driving behavior, Burn-InfoGAIL  was introduced, where a policy must take over where an expert demonstration trajectory ends. This is referred to as a burn-in demonstration, upon which a learned inference model must be conditioned to draw latent codes that characterize driving style.
5.4 Extension to Multiple Agents
For safety validation in simulation, it is crucial to simulate the behavior of not just a single vehicle, but entire traffic scenes to be able to recreate driving behavior arising out of interaction between agents. This motivated the development of Parameter-Sharing GAIL (PS-GAIL) , which enables scaling of the imitation learning approach to multiple agents.
In line with recent work in multi-agent imitation learning [45, 50, 15], we formulate multi-agent driving as a Markov game  consisting of agents and an unknown reward function. We make three simplifying assumptions:
Homogeneous agents: agents have the same observation and action spaces:
Independent rewards: the reward function is not shared; it depends only on the action of each agent and the state, and not on the actions of other agents or the next state. In particular, agents are not cooperative:
Identical reward function: the reward function is the same for all agents:
These assumptions are idealizations and do not hold for real-world driving scenes. For example, different vehicles may permit different accelerations, a driver may only want to change lanes if other drivers are not doing so, and individuals may value different driving qualities such as smoothness or proximity to other vehicles differently. Nevertheless, these assumptions often do apply approximately, and, as we later show, allow for learning of realistic driving policies.
A naive approach to learning human driver policies would be to train a policy in an environment where it controls a single vehicle on the roadway and all remaining vehicles follow a predetermined trajectory. Unfortunately, this approach is often incapable of producing policies that can reliably control many vehicles on the same roadway. By introducing such a controller to other vehicles after training, we reintroduce covariate shift. As a result, small errors in the behavior of a single vehicle can destabilize neighboring vehicles, ultimately leading to the failure of many agents in the scene.
gupta2017cooperative gupta2017cooperative introduced an algorithm called Parameter Sharing Trust Region Policy Optimization (PS-TRPO), which is a policy gradient approach that combines parameter sharing and TRPO. PS-TRPO was shown to produce decentralized parameter-sharing neural network policies that exhibit emergent cooperative behavior without explicit communication between agents. PS-TRPO is highly sample-efficient because it reduces the number of parameters by a factor of , and shares experience across all agents. Notably, PS-TRPO still allows agents to exhibit different behavior because each agent receives unique observations.
For a policy with parameters , PS-TRPO performs an update to the policy parameters by approximately solving the constrained optimization problem:
where is a rollout-sampling policy, and is an advantage function quantifying how much the value of an action taken in response to an observation
differs from the baseline value estimated for. is the KL-divergence between the two policy distributions, and is a step size parameter that controls the maximum change in policy per optimization step.
Our proposed approach, PS-GAIL, combines GAIL with PS-TRPO to generate policies capable of controlling multiple vehicles, enabling more stable simulation of entire road scenes. The approach is described in algorithm 1. We begin by initializing the shared policy parameters and select a step size parameter. At each iteration, the shared policy is used by each agent to generate trajectories. Rewards are then assigned to each state-action pair in these trajectories by the critic. Subsequently, observed trajectories are used to perform a TRPO  update for the policy, and an Adam  update for the critic. PS-GAIL can be viewed as a special case of the algorithms presented by Song et al.  since all agents share the same policy and receive rewards from the same critic.
6 Case study: Learning Driver Models from Demonstrations
This section demonstrates the GAIL methodology in the driving domain. Reliable models of human driving behavior are important for building simulation platforms for validating the safety of autonomous driving algorithms . Driver modeling falls within the paradigm of learning from demonstrations. There is no reason to assume that the cost function of the human drivers lies within a small function class. Instead, the cost function could be quite complex, which makes GAIL a suitable choice for driver modeling.
We frame highway driving as a sequential decision-making task in which the driver obeys a stochastic policy , mapping observed road conditions to a distribution over driving actions . The state space represents the driving scene, the actions are driving actions, and the transition model is governed by the vehicle dynamics and the actions taken by surrounding vehicles. However, the cost function of the MDP is unspecified because it is often difficult for humans to articulate, let alone mathematically formulate, the cost function that they are following while driving. Given a class of policies parameterized by , we seek to find the policy that best recreates human driving behavior. The goal is to infer this policy from a dataset consisting of a sequence of state-action tuples .
We use the public Next-Generation Simulation (NGSIM) dataset for US Highway 101 . NGSIM provides minutes of driving at . The US Highway 101 dataset covers an area in Los Angeles approximately in length with five mainline lanes and a sixth auxiliary lane for highway entrance and exit.
Traffic density in the dataset transitions from uncongested to full congestion and exhibits a high degree of vehicle interaction as vehicles merge on and off the highway and must navigate in congested flow. The diversity of driving conditions and the forced interaction of traffic participants makes these sources particularly useful for behavioral studies. The trajectories were smoothed using an extended Kalman filter on a bicycle model and projected to lanes using centerlines extracted from the NGSIM roadway geometry files. Cars, trucks, buses, and motorcycles are in the dataset, but only car trajectories were used for model training.
In order to learn the policy in an environment with human drivers, we use a simulator that allows for playing back real trajectories and simulating the movement of controlled vehicles given actions selected by a policy. The process proceeds as follows:
The initial scene is sampled from a dataset of real driver trajectories. This state includes the position, orientation, and velocity of all vehicles in the scene.
A subset of the vehicles in the scene are randomly selected to be controlled by the policy. For single-agent training only one vehicle is selected, whereas for multi-agent training vehicles are controlled by the policy.
For each vehicle, a set of features are extracted and passed to the policy as the observation. Table 1 describes the features provided to the policy. These features represent the scene information, and thus act as observations of the state of the driving MDP.
At every timestep, the policy outputs longitudinal acceleration and turn-rate values as the vehicle action in response to the observed features. These values are used to propagate the vehicle forward in time according to the vehicle dynamics.
The simulation is carried out, and associated metrics of both imitation performance and driving performance are extracted.
|Ego Vehicle||Lane-relative velocity, heading, offset.|
|Vehicle length and width.|
|Longitudinal and lateral acceleration.|
|local and global turn and angular rate.|
|Lane curvature, distance to left and|
|right lane makers and road edges.|
|Leading Vehicle||Relative distance, velocity, and|
|absolute acceleration of vehicle|
|of fore vehicle, if it exists.|
|LIDAR Range||artificial LIDAR beams|
|and Range Rate||output in regular polar intervals,|
|providing the relative position|
|and velocity of intercepted objects.|
|Temporal||Timegap and time-to-collision.|
|Indicators||Collision occurring, ego vehicle|
|out-of-lane, and negative velocity.|
6.3 Policy Representation
Our learned policy must be able to simulate human driving behavior, which involves:
Non-linearity in the desired mapping from states to actions (e.g., large corrections in steering to avoid collisions caused by small changes in the current state).
High-dimensionality of the state representation, which must describe properties of the ego-vehicle, in addition to surrounding cars and road conditions.
Stochasticity because humans may take different actions each time they encounter a given traffic scene.
To address the first and second points, we represent all learned policies using neural networks. Neural networks have gained widespread popularity due to their ability to learn robust hierarchical features from complicated inputs [28, 25], and have been used in automotive behavioral modeling for action prediction in car-following contexts [20, 36, 22, 29, 34], lateral position prediction , and maneuver classification .
To address the third point, we interpret the network’s real-valued outputs given input as the mean and logarithm of the diagonal covariance of a Gaussian distribution. This enables stochasticity in the driving action provided by the neural network policy in response to a particular driving scene. Actions are chosen by sampling .
We evaluate both feedforward and recurrent network architectures. Feedforward neural networks directly map inputs to outputs. The most common architecture, multilayer perceptrons (MLPs), consist of alternating layers of tunable weights and element-wise nonlinearities. However, the feedforward MLP is limited in its ability to adequately address partially observable environments. In real world driving, sensor error and occlusions may prevent the driver from seeing all relevant parts of the driving state. By maintaining sufficient statistics of past observations in memory, recurrent policies
disambiguate perceptually similar states by acting with respect to histories of, rather than individual, observations. In this work, we represent recurrent policies using Gated Recurrent Unit (GRU) networks due to their comparable performance with fewer parameters than other architectures.
We use similar architectures for the feedforward and recurrent policies. The recurrent policies consist of five feedforward layers that decrease in size from 256 to 32 neurons, with an additional GRU layer consisting of 32 neurons. Exponential linear units (ELU) were used throughout the network, which have been shown to combat the vanishing gradient problem while supporting a zero-centered distribution of activation vectors. The MLP policies have the same architecture, except the GRU layer is replaced with an additional feedforward layer. For each network architecture, one policy is trained through BC and one policy is trained through GAIL. In all, we trained four neural network policies: GAIL GRU, GAIL MLP, BC GRU, and BC MLP.
Figure 1 shows the imitation learning pipeline starting from driving demonstration data to driving policies. We assess the imitation performance of our driving policies via different metrics. First, to measure imitation of local vehicle behaviors, we use a set of Root Mean Square Error (RMSE) metrics that quantify the distance between the real trajectories in the dataset and the trajectories generated by our learned driving policies. We calculate the RMSE between the original human driven vehicle and its replacement policy driven vehicle in terms of the position, speed, and lane offset.
Second, to assess the undesirable traffic phenomena that arise out of vehicular interactions as compared to local, single vehicle imitation, we extract metrics that quantify collisions, hard-braking, and offroad driving. We also extract these metrics of undesirable traffic phenomena for the NGSIM driving data and compare them against the metrics obtained from rollouts generated by our driving policies.
6.5 Single Agent Imitation
First, we report results obtained from experiments conducted on learning driving from a single agent . Here, one vehicle is randomly sampled from the NGSIM demonstration data and its trajectory is used to train the critic. The effectiveness of the resulting driving policy trained using GAIL in imitating human driving behavior is assessed by validation in rollouts conducted on the simulator described in section 6.2. The resulting driving behavior was compared against various driver modeling baselines using the metrics discussed in section 6.4.
The first baseline is a static Gaussian (SG) model, which is an unchanging Gaussian distribution fit using maximum likelihood estimation on the demonstration data. The second baseline model is a Behavioral Cloning (BC) approach using mixture regression (MR) 
. The model has been used for model-predictive control and has been shown to work well in simulation and in real-world drive tests. Our MR model is a Gaussian mixture over the joint space of the actions and features, trained using Expectation Maximization. The stochastic policy is formed from the weighted combination of the Gaussian components conditioned on the features. Greedy feature selection is used during training to select a subset of predictors up to a maximum feature count threshold while minimizing the Bayesian information criterion.
The final baseline model uses a rule-based controller to govern the lateral and longitudinal motion of the ego vehicle. The longitudinal motion is controlled by the Intelligent Driver Model . The inputs to the model are the vehicle’s current speed at time , relative speed with respect to the leading vehicle, and distance headway . The model then outputs an acceleration according to
where the desired distance is
The model has several parameters that determine the acceleration output based on the scene information. Here, refers to the free speed velocity, refers to the minimum allowable separation between the ego and leader vehicle, refers to the minimum time separation allowable between ego and leader vehicle, and refer to the limits on the acceleration and deceleration, respectively.
For the lateral motion, MOBIL  is used to select the desired lane, with a proportional controller used to track the lane centerline. A small amount of noise is added to both the lateral and longitudinal accelerations to make the controller nondeterministic.
To extract the metrics of driving performance, the ego vehicle is driven using a bicycle model with acceleration and turn rate sampled from the policy network. Figure 2 shows the discrepancy between rollouts and the ground truth demonstration through root mean square error metrics.
The RMSE results show that the BC models have competitive short-horizon performance, but accumulate error over longer time horizons. GAIL produces more stable trajectories and its short term predictions perform well.
Figure 3 shows the undesirable driving metrics obtained from simulation. The GAIL policies outperform the BC policies. Compared to BC, the GAIL GRU policy has the closest match to the data everywhere except for hard brakes, as it rarely takes extreme actions. Mixture regression largely performs better than SG and is on par with the BC policies, but is still susceptible to cascading errors. Offroad duration is perhaps the most striking statistic; only GAIL (and of course IDM + MOBIL) stay on the road for extended stretches. SG never brakes hard as it only drives straight, causing many collisions as a consequence. It is interesting that the collision rate for IDM + MOBIL is roughly the same as the collision rate for GAIL GRU, despite the fact that IDM + MOBIL should not collide. The inability of other vehicles within the simulation environment to fully react to the ego-vehicle may explain this phenomenon.
The results demonstrate that GAIL-based models capture many desirable properties of both rule-based and machine learning methods, while avoiding common pitfalls. With the exception of the rule-based controller, GAIL policies achieve the lowest collision and off-road driving rates, considerably outperforming baseline and similarly structured BC models. Furthermore, extending GAIL to recurrent policies leads to improved performance. This result is an interesting contrast with the BC policies, where the addition of recurrence tends not to yield better results. Thus, we find that recurrence by itself is insufficient for addressing the detrimental effects that cascading errors can have on BC policies.
6.6 Multi Agent Imitation
In this subsection, we describe experiments and results conducted for multiple learning agents using the parameter sharing approach (PS-GAIL) described in Section 5.4.
In the multi-agent setting, multiple vehicles are sampled from the demonstration NGSIM data, and a policy with shared parameters is learned by batching together the observations and actions from all the vehicles. Importantly, the dynamics of the environment change along with the agent policies. Our training procedure must therefore account for non-stationary environment dynamics.
We mitigate this problem by introducing a curriculum that scales the difficulty of the multi-agent learning problem during training. gupta2017cooperative gupta2017cooperative define a multi-agent curriculum, , as a multinomial distribution over the number of agents controlled by the policy each episode. The curriculum gradually shifts probability mass to larger numbers of agents. In practice, we use a simplified curriculum that increments the number of controlled agents by a fixed number every iterations during training, in which case is a deterministic function of the iteration .
We use recurrent neural network (RNN) policies, in all cases consisting ofGated Recurrent Units (GRUs). The observation is passed directly into the RNN without any initial reduction in dimensionality. We use recurrent policies in order to address the partial observability of the state caused by occluded vehicles. In the multi-agent setting, a single shared policy selects actions for all vehicles, following the parameter sharing approach previously described. Policy optimization is performed using an implementation of TRPO from rllab  with a step size of .
We use two training phases for all of the models. The first phase consists of iterations with a low discount of and a small batch size of observation-action pairs. The second phase fine-tunes the models, running for iterations with a higher discount of and larger batch size of . For the multi-agent model, we add agents to the environment every iterations of the first training phase. We use agents in the fine-tune phase for the multi-agent GAIL models.
The critic acts as the surrogate reward function in the environment. The observation-action pairs for each vehicle at each timestep are passed to the critic, which outputs a scalar value that is then used as the reward for that vehicle. The critic is implemented as a feed-forward neural network consisting of (,,
) ReLU units. We implemented the critic as a Wasserstein GAN with a gradient penalty (WGAN-GP) of
. Similar to li2017infogail li2017infogail, we used a replay memory for the critic in order to stabilize training, which contains samples from the three most recent epochs. For each training epoch of the policy, the critic is trained forepochs using the Adam optimizer  with a learning rate of , dropout probability of , and batch size of
. Half of each batch consists of NGSIM data, with the remaining half comprised of data from policy rollouts. Finally, the reward values output from the critic are adaptively normalized to have zero mean and unit variance prior to being passed to TRPO.
The difficulty of the multi-agent task scales with the number of agents controlled in the environment. Figure 4 shows the performance of the two models as a function of the number of agents driven by our learned driving policy. The indicated number of agents are randomly sampled and replaced in the environment with the policy, while the remaining agents are left as originally recorded in NGSIM. Here, the single-agent policy refers to the policy trained using data obtained from one vehicle, and then deployed on multiple vehicles during validation. The results indicate that while the single-agent policy deteriorates rapidly with increasing number of agents, the multi-agent policy declines in performance much more gradually.
6.6.1 Reward Augmentation
Both single-agent GAIL and PS-GAIL are methodologies that are domain agnostic. However, for the specific task of driver modeling, providing the learning agent with domain knowledge proves useful. Reward Augmented Imitation Learning (RAIL) provides external penalties during training  that specifically encapsulate rules of the road. These include penalties for going off the road, braking hard, and colliding with other vehicles. All of these are undesirable driving behaviors and therefore should be discouraged in the learning agent. These penalties help to improve the state space exploration of the learning agent by discouraging bad states such as those that could potentially lead to collisions. In RAIL, part of the reinforcement learning cost signal comes from the critic based on imitating the expert, and another cost signal comes from the externally provided penalties specifying the prior knowledge of the expert .
We explore a binary penalty and a smoothed penalty as the two forms of reward augmentation provided to the imitation learning agent. The first method of reward augmentation that we employ is to penalize states in a binary manner, where the penalty is applied when a particular event is triggered. To calculate the augmented reward, we take the maximum of the individual penalty values. For example, if a vehicle is driving off the road and colliding with another vehicle, we only penalize the collision. This will also be important when we discuss smoothed penalties.
We explore penalizing three different behaviors. First, we give a large penalty to each vehicle involved in a collision. Next, we impose the same large penalty for a vehicle that drives off the road. Finally, performing a hard brake (acceleration of less than ) is penalized by only . The penalty formula is shown in eq. 27. We denote the smallest distance from the ego vehicle to any other vehicle on the road as (meters), where . We also define the closest distance from the ego vehicle to the edge of the road (meters): . We allow to be negative if the vehicle is off the road. Finally, let be the acceleration of the vehicle in . A negative value of indicates that the vehicle is braking. Now, we can formally define the binary penalty function:
The relative values of the penalties indicate the preferences of the designer of the imitation learning agent. For example, in this case study, we penalize hard braking less than the other undesirable traffic phenomena.
We hypothesize that providing advanced warning to the imitation learning agent in the form of smaller, increasing penalties as the agent approaches an event threshold will address the credit assignment problem in reinforcement learning. In this case, we provide a smooth penalty for off-road driving and hard braking, where the penalty is linearly increased from a minimum threshold to the previously defined event threshold for the binary penalty.
For off-road driving, we linearly increase the penalty from to when the vehicle is within of the edge of the road. For hard braking, we linearly increase the penalty from to when the acceleration is between and .
The driving performance of driving policies trained using PS-GAIL and RAIL was assessed by performing experiments in the simulator. Figure 5 shows root mean square error results for prediction horizons up to . These plots indicate that the multi-agent learning approaches PS-GAIL and RAIL capture expert behavior more faithfully than single-agent GAIL. This performance discrepancy is especially pronounced for longer prediction horizons, where the errors for single-agent policies begin to accumulate rapidly. Further, reward augmentation results in better local imitation performance, as seen by the lowest RMSE values.
The superior performance of PS-GAIL and RAIL is further illustrated by Figure 6. These validation results empirically demonstrate that PS-GAIL and RAIL policies are less likely to lead vehicles into collisions, extreme decelerations, and off-road driving. This serves as further illustration that the PS-GAIL training procedure encourages stabler interactions between agents, thereby making them less likely to encounter extreme or unlikely driving situations. The inclusion of domain knowledge is especially significant here as seen by the reduction in the values of the undesirable metrics of driving.
6.7 Disentangling Driving Styles
Human driving demonstrations display variability due to latent factors. In this subsection, we report results from experiments targeted at disentangling driving styles from demonstrations .
The simulator used to generate data and train models is based on an oval racetrack, shown in fig. 7 . We populate our environment with vehicles simulated by the Intelligent Driver Model , where lane changes are executed by the MOBIL general lane changing model . The settings of each controller are drawn from one of four possible parameterizations, defining the style of each car. The resulting driving experts fall into one of four classes:
Aggressive: High speed, large acceleration, small headway distances.
Passive: Low speed, low acceleration, large headway distances.
Speeder: High speed and acceleration, but large headway distance.
Tailgating: Low speed and acceleration, but small headway distances.
Furthermore, the desired speed of each car is sampled from a Gaussian distribution, ensuring that individual cars belonging to the same class behave differently. A total of 960 training demonstrations and 480 validation demonstrations were used, each lasting 50 timesteps (or 5 seconds, at 10 Hz). The observations are represented with the combination of LIDAR and road features reported in table 1.
We compared against three baseline models. The first baseline is the VAE driver policy proposed by morton2017simultaneous morton2017simultaneous. Its encoder network consists of two Long Short-Term Memory (LSTM)
layers that map state-action pairs to the mean and standard deviation of a 2-dimensional Gaussian distribution. Its decoder, or policy, is a 2-layer MLP, also consisting of 128 units. During testing, the encoder conditions on a sequence of observation-action pairs sampled from the expert whose playback is used to initialize the ego vehicle (the “burn-in demonstration”). The predicted mean of the distribution is used as the latent code for the policy. The second baseline is a GAIL model trained on the demonstration trajectories. It has the same model architecture as, the policy trained using Burn-InfoGAIL with the exclusion of the learned embedding layer needed to encode the latent style variable. Finally, we baseline against an implementation of InfoGAIL that is architecturally identical to , but simply samples
from a discrete uniform distribution at the beginning of each trial.
As shown in Figure 8, Burn-InfoGAIL achieves the lowest error over the longest period of driving. GAIL is able to capture differences in style for about , presumably because the imitation objective discourages the policy from adjusting its velocity away from its initial conditions. As minor errors compound over long horizons, GAIL drifts toward an average policy due to its mode-seeking nature . In contrast, the VAE is able to use the latent code inferred from the burn-in demonstration to maintain an appropriate speed, achieving an RMSE close to the true value, rivaling Burn-InfoGAIL. However, being trained without a simulator, the VAE suffers from cascading errors causing it to go off the road.
Learning from demonstrations is a promising approach to solving MDPs when the cost function is unknown or difficult to specify. Following on a long line of work on inverse reinforcement learning, GAIL was proposed with the promise of (in theory), exact imitation even for problems with large (even continuous) state and actions spaces. Driver modeling is a problem where the state and action spaces are continuous, the policy is characterized by non-linearity and stochasticity, and the cost function is difficult to articulate exactly. These characteristics make learning from human driving demonstrations a suitable approach to generating human-like driving behavior. However, human demonstrations are dependent on latent factors of variability that cannot be captured by GAIL on its own. Moreover, driver modeling is inherently a multi-agent problem, again not directly solvable by GAIL.
This article described three modifications to GAIL addressing these limitations. First, it described PS-GAIL, which accounts for the multi-agent nature of the problem resulting from the interaction between traffic participants. Second, it described RAIL, which uses reward augmentation to provide domain knowledge about the rules of the road to the driver modeling agent. Third, it described Burn-InfoGAIL which deals with the disentanglement of latent variability in demonstrations. All three modifications were demonstrated on driver modeling experiments, including learning driver behavior models from real world driving demonstration data.
Directions for future work include methods for improving model performance, and applying learned driver models. Potential methods for improving model performance include (i) explicitly modeling the interaction between agents in a centralized manner through the use of Graph Neural Networks , (ii) exploring recently introduced, alternative methods of imitation learning , and (iii) scaling up experiments to larger datasets and driving domains. Ultimately, the goal in learning human driver models is to validate autonomous vehicles in simulation, and we hope to apply these models to that end in the future. Toyota Research Institute (TRI) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
-  (2004) Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine Learning (ICML), Cited by: §1, §1, §4, §4.
-  (2009) A survey of robot learning from demonstration. Robotics and Autonomous Systems 57 (5), pp. 469–483. Cited by: §1.
-  (2017) Wasserstein Generative Adversarial Networks. In International Conference on Machine Learning (ICML), Cited by: §5.2.
Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §7.
-  (2019) Simulating emergent properties of human driving behavior using reward augmented multi-agent imitation learning. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §6.6.1.
-  (2018) Multi-agent imitation learning for driving simulation. In International Conference on Intelligent Robots and Systems (IROS), Cited by: §5.4.
-  (2014) Infinite time horizon maximum causal entropy inverse reinforcement learning. In IEEE Conference on Decision and Control (CDC), Cited by: §5.
Signal modelling and hidden Markov models for driving manoeuvre recognition and driver fault diagnosis in an urban road scenario. In Intelligent Vehicles Symposium (IV), Cited by: §6.3.
-  (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §5.3.
Learning phrase representations using rnn encoder–decoder for statistical machine translation.
Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §6.3.
-  (2007) US highway 101 dataset. Technical report Technical Report FHWA-HRT-07-030, Federal Highway Administration (FHWA). Cited by: §6.1.
-  (2016) Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning (ICML), Cited by: §6.6.
-  (2019) A divergence minimization perspective on imitation learning methods. arXiv preprint arXiv:1911.02256. Cited by: §5.2.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §5.2, §6.7.
-  (2020) Multi-agent adversarial inverse reinforcement learning with latent variables. In International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), Cited by: §5.4.
-  (2017) Improved training of Wasserstein GANs. In Neural Information Processing Systems (NeurIPS), Cited by: §6.6.
-  (2016) Generative adversarial imitation learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §5.2.
-  (2016) Model-free imitation learning with policy optimization. In International Conference on Machine Learning (ICML), Cited by: §1.
-  (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §6.7.
-  (2003) Develop a car-following model using data collected by five-wheel system. In Intelligent Transporation Systems Conference (ITSC), Cited by: §6.3.
-  (2007) General lane-changing model mobil for car-following models. Transportation Research Record 1999 (1), pp. 86–94. Cited by: §6.5, §6.7.
-  (2012) A modified car-following model based on a neural network model of the human driver effects. IEEE Transactions on Systems, Man, and Cybernetics 42 (6), pp. 1440–1449. Cited by: §6.3.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.4, §6.6.
-  (2016) Challenges in autonomous vehicle testing and validation. SAE International Journal of Transportation Safety 4 (1), pp. 15–24. Cited by: §1, §6.
-  (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.3.
-  (2018) Burn-in demonstrations for multi-modal imitation learning. In International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), Cited by: §5.3, §6.7.
-  (2017) Imitating driver behavior with generative adversarial networks. In Intelligent Vehicles Symposium (IV), Cited by: §6.5.
-  (2009) . In International Conference on Machine Learning (ICML), Cited by: §6.3.
-  (2014) Comparison of parametric and non-parametric approaches for vehicle speed prediction. In American Control Conference, Cited by: §6.3, §6.5.
-  (2017) Infogail: interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §5.3, §6.6.1.
-  (1994) Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings, pp. 157–163. Cited by: §5.4.
-  (2014) Vehicle lateral position prediction: a small step towards a comprehensive risk assessment system. In Intelligent Transportation Systems Conference (ITSC), Cited by: §6.3.
-  (2017) Simultaneous policy learning and latent state inference for imitating driver behavior. In International Conference on Intelligent Transportation Systems (ITSC), Cited by: §6.7.
-  (2016) Analysis of recurrent neural networks for probabilistic modeling of driver behavior. Transactions on Intelligent Transportation Systems 18 (5), pp. 1289–1298. Cited by: §6.3.
-  (2000) Algorithms for inverse reinforcement learning. In International Conference on Machine Learning (ICML), Cited by: §1, §1.
-  (2007) Neural agent car-following models. IEEE Transactions on Intelligent Transportation Systems 8 (1), pp. 60–70. Cited by: §6.3.
-  (1989) Alvinn: an autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1.
-  (2019) SQIL: imitation learning via regularized behavioral cloning. arXiv preprint arXiv:1905.11108. Cited by: §7.
Efficient reductions for imitation learning.
International Conference on Artificial Intelligence and Statistics, Cited by: §1.
-  (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics, Cited by: §1, §1, §3, §3.
-  (1999) Is imitation learning the route to humanoid robots?. Trends in cognitive sciences 3 (6), pp. 233–242. Cited by: §1, §3.
-  (2015) Trust region policy optimization. In International Conference on Machine Learning (ICML), Cited by: §1, §5.2, §5.4.
-  (1978) Estimating the dimension of a model. The Annals of Statistics 6 (2), pp. 461–464. Cited by: §6.5.
-  (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90 (2), pp. 227–244. Cited by: §3.
-  (2018) Multi-agent generative adversarial imitation learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §5.4, §5.4.
-  (2008) A game-theoretic approach to apprenticeship learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §4.
-  (2000) Congested traffic states in empirical observations and microscopic simulations. Physical Review E 62 (2), pp. 1805–1852. Cited by: §6.5, §6.7.
-  (2010) Recurrent policy gradients. Logic Journal of IGPL 18 (5), pp. 620–634. Cited by: §6.3.
-  (2019) Wasserstein adversarial imitation learning. arXiv preprint arXiv:1906.08113. Cited by: §5.2.
-  (2019) Multi-agent adversarial inverse reinforcement learning. In International Conference on Machine Learning (ICML), Cited by: §5.4.
-  (2010) Modeling interaction via the principle of maximum causal entropy. In International Conference on Machine Learning (ICML), Cited by: §5.1.
-  (2008) Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §5.