1 Introduction
Recent works have achieved impressive results in a variety of sequence modeling problems from NLP (Brown et al., 2020; Radford et al., 2019, 2018) to trajectory prediction (J. Ngiam, B. Caine, V. Vasudevan, Z. Zhang, H. L. Chiang, J. Ling, R. Roelofs, A. Bewley, C. Liu, A. Venugopal, et al. (2021); Y. Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou (2021); A. Quintanar, D. FernándezLlorca, I. Parra, R. Izquierdo, and M. Sotelo (2021); 1) by leveraging powerful Transformer (Vaswani et al., 2017) models. Inspired by these numerous successes, several recent works (Chen et al., 2021b; Janner et al., 2021) have explored ways of reformulating sequentialdecision making problems in the offline RL framework as a single sequence modeling problem. In particular, these approaches jointly model the states, actions, and rewards as a single data stream with a highcapacity Transformer. These Transformerbased methods are able to outperform the behavioral policy at test time by either conditioning on desired outcomes when picking actions or leveraging the model to search for highreward trajectories. The main benefit of this paradigm is that it avoids many of the complexities involved in modern modelfree and modelbased offline RL algorithms.
There are 2 major issues with treating RL as a single sequence modeling problem like the words or characters in a NLP problem. (1) States and actions are fundamentally different concepts. The agent always has complete control over its action sequences, but often has only limited influence on the resulting state transitions. In adversarial or stochastic environments, the same action in the same state can lead to potentially different outcomes, which affects the likelihood or feasibility of achieving a desired result. This leads to the more practical issue (2), which is that we often need to perform different optimizations over the policy actions and the potential state transitions. Generally, we want to find the action that maximizes reward, but either in expectation over possible future states or with respect to the worst case scenario. Thus, in safetycritical domains or adversarial games we often want to perform a maximum over potential actions and a minimum over possible futures in the environment. Thus in these types of environments, special considerations are needed to ensure effective planning during deployment. Ozair et al. (2021) demonstrates similar issues when deploying MuZero (Schrittwieser et al., 2020) with different MCTS (Coulom, 2006) frameworks in chess. They find that planning with a singleplayer variant of MuZero that treats the other player as an unknown part of the environment results in a catastrophic drop in performance relative to the traditional twoplayer adversarial framework.
Most prior works in offline RL have focused on the mainly deterministic D4RL (Fu et al., 2020) benchmarks and a variety of weakly stochastic Atari (Machado et al., 2018) benchmarks. Therefore, there has been limited focus on the difficulties of deploying such methods in largely stochastic domains. In this work, we instead focus specifically on stochastic safetycritical domains, and use autonomous driving as a representative setting. Understanding the stochastic and multimodal nature of traffic is critical to safe and robust autonomous driving. For example, the same turning sequence for the egovehicle could lead to a successful traversal of an intersection or a crash depending on the unknown intentions of the other agents. In this setting, the deployment strategies described in recent sequence modeling approaches for offline RL will lead to overly optimistic behavior because they do not properly disentangle the effects of the policy and world dynamics on the outcome. Specifically, they are biased to believe that the environment will cooperate with them because those sequences are most likely to lead to desired high returns.
In order to address this optimism bias in prior approaches, we develop a method called SeParated Latent Trajectory Transformer (SPLT Transformer), which learns separate generative Transformer models for the policy and dynamics. Because we focus on the autonomous driving domain, we represent each of these models as discrete latent variable Variational AutoEncoder (VAE) (Kingma and Welling, 2013) as inspired by prior work (Tang and Salakhutdinov, 2019). By training two separate discrete latent variable VAE models, we can efficiently search over different possible ego behaviors and their interactions with different possible environment responses. We demonstrate how this planning approach allows us during testtime to search for policies that are robust to many possible futures in the environment.
2 Related Work
Many recent works have explored leveraging highcapacity sequence models in sequential decision making problems as well as in autonomous driving. Most of the works in the former area have focused on deterministic environments and thus struggle in stochastic and multimodal problems like autonomous driving. Most of the works applying sequence models to autonomous driving have focused specifically on joint trajectory prediction for all the vehicular and pedestrian agents in the scene. These methods have been effective in capturing the multimodal stochastic nature of autonomous driving, but often do not consider how these approaches should be leveraged to generate a robust driving policy during deployment. In this work, we develop a method that incorporates ideas from both of these fields in order to learn sequential policy and dynamics models entirely from offline data. Then, we demonstrate how these models can be leveraged at testtime to perform robust planning solely from quantities inferable from an egocentric perception system.
2.1 SequenceModeling for Offline RL
Two major works that have explored reformulating the offline RL problem as a single sequence modelling problem are Decision Transformer (Chen et al., 2021b) and Trajectory Transformer (Janner et al., 2021). In stochastic environments like autonomous driving, both approaches can act optimistically because they fail to account for the uncontrollable factors in the environment that can affect the ability to achieve a specific return.
Decision Transformer is a returnconditioned modelfree method that learns a Transformerbased policy that takes in the historical states and actions and a target return, and outputs the action that is most likely to lead to a trajectory sequence that achieves the target return. In stochastic environments, the difficulties come from picking a suitable target return without strong prior knowledge of the testing domain. This is especially problematic because the distribution of possible returns is heavily dependent on the stochastic transitions of the environment. Thus, setting a large target return that is not always feasible can lead to overly aggressive and optimistic behavior, while setting any lower target return could lead to suboptimal behavior in situations where the environment does unroll favorably.
Trajectory Transformer is a modelbased method that trains a Transformerbased trajectory model that can hallucinate potential trajectories in the environment. In stochastic environments, the main concern is how you use this model to properly search for a suitable next action or trajectorysequence given the uncontrollable stochastic transitions. Naively deploying NLPstyle beam search as proposed in the original Trajectory Transformer paper (Janner et al., 2021) without accounting for these uncontrollable stochastic transitions will be biased to explore and pick trajectories where the environment just so happens to unroll favorably. This will similarly lead to optimistic behavior, which can be dangerous in safetycritical environments. To address this issue, a different searching procedure needs to be used that can reason about the states and actions separately. One possible avenue for doing this stepwise would be to explore the different variants of MCTS described in Ozair et al. (2021), but that would lead to an approach that scales exponentially with the search horizon, while our approach only scales linearly with the search horizon.
In order to illuminate the optimism bias in Decision Transformer and Trajectory Transformer, consider a simple discrete and stochastic MDP where there is only 5 states and 2 actions . The agent always starts in . Then, if the agent takes it will stochastically transition from or
with equal probability. Similarly, if the agent takes
it will stochastically transition from or with equal probability. The reward at each state is: 0 for , 10 for , 10 for , 6 for , and 4 for . The trajectory terminates upon reaching any state besides . In this MDP, the expected return for taking is 0 and for takingis 5. Let’s assume that our dataset consists of samples from a uniform random policy. Now, the optimism bias in Trajectory Transformer comes from the modified beam search they use at testtime. They jointly unroll the next action and resulting state, reward, and return and then filter based on which trajectories have the highest estimated return. In this setting, the trajectory model should properly predict all possible transitions. Then, their beam search will chose to take action
because it will lead to the highest return of 10 when it predicts the transition . However, picking is obviously suboptimal in expectation, even ifcould lead to the best possible return. The optimism bias for Decision Transformer comes from picking a fixed target return and their heuristic of setting it to be the highest in the dataset. If we condition on getting a return of 10 (the highest possible), then it will clearly pick action
as that is the only action where the probability of achieving the desired outcome is nonzero. Note that this is an issue for any returnconditioned method, not just Decision Transformer. The inclination for both methods to select because there is a possibility it has the highest return, even though is a better action in expectation and in the worstcase scenario is the specific optimism bias we allude to in this paper.We validate the optimism bias in these approaches in a toy multimodal autonomous driving task, and further demonstrate how addressing these issues leads to our method achieving superior performance.
2.2 Conservatism and RiskSensitive RL
Our work is also related to the fields of risksensitive RL and conservatism for offline RL. Most prior works in risksensitive RL focus on learning a policy that is optimized not for expected return, but rather some risksensitive profile over the distribution of returns (Ma et al., 2021; Shen et al., 2014). Many approaches accomplish this by learning a distributional Qfunction that estimates the full distribution of potential returns (Dabney et al., 2018) rather than just the expected return. Instead of learning a distributional or risksensitive Qfunction, we use a learned Transformer world model to hallucinate possible future trajectories and pick a robust behavior that does well in the worstcase predicted future.
Prior works in offline RL have explored different methods of incorporating conservatism in both modelbased (Yu et al., 2020; Kidambi et al., 2020) and modelfree (Kumar et al., 2020; Kostrikov et al., 2021) RL methods. However, offline RL approaches generally leverage conservatism specifically to discourage the agent from visiting stateactions that are outside the training distribution where the learned models could fail to generalize (Levine et al., 2020). In contrast, in this works we explore using a conservative approach to address the difficulties of planning in stochastic environments in safetycritical domains, like autonomous driving.
2.3 Trajectory Prediction in Autonomous Driving
Many recent works in trajectory prediction leverage attentionbased models (Tang and Salakhutdinov, 2019; Choi et al., 2019) and in particular Transformer neural networks (J. Ngiam, B. Caine, V. Vasudevan, Z. Zhang, H. L. Chiang, J. Ling, R. Roelofs, A. Bewley, C. Liu, A. Venugopal, et al. (2021); Y. Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou (2021); A. Quintanar, D. FernándezLlorca, I. Parra, R. Izquierdo, and M. Sotelo (2021); 1) to make accurate longterm predictions in complex traffic situations. These methods can learn to attend specifically to the relevant factors in both the target and surrounding vehicles’ recent trajectories in order to make predictions that are consistent with the surrounding traffic. Additionally, many trajectory prediction approaches have incorporated VAEs (C. Tang and R. R. Salakhutdinov (2019); A. Quintanar, D. FernándezLlorca, I. Parra, R. Izquierdo, and M. Sotelo (2021); 1) to facilitate covering the different possible modes of future traffic trajectories. However, these approaches have mostly been evaluated on different variations of prediction error on a held out testset with little focus on the effectiveness of leveraging these models for longterm online planning, like for the CARLA (Dosovitskiy et al., 2017) NoCrash (Codevilla et al., 2019) or Leaderboard^{1}^{1}1https://leaderboard.carla.org/ benchmarks.
In this work, we also leverage Transformers and VAEs in order to make multimodal trajectory predictions. However, we focus specially on learning models that can be effectively used for robust search during deployment on an ego vehicle.
2.4 Learning Behavior for SelfDriving
Recently, there have been several works that have focused on learning behavioral policies for autonomous driving from offline datasets. Most of this work revolves around performing imitation learning on the behavioral policy of a privileged autopilot agent in simulation
(Chen et al., 2020; Prakash et al., 2021; Chen et al., 2021a).In our work, we also use a privileged autopilot to collect our offline dataset in simulation. Instead of just learning the behavioral policy, we use these data to learn both a multimodal policy, and a multimodal world model that allows us to do efficient and robust search at testtime. Later, we demonstrate how this search leads our method to outperform pure imitation learning.
3 Preliminaries
3.1 Offline Reinforcement Learning
In RL, we treat the environment as a Markov decision process (MDP)
^{2}^{2}2While we do evaluate in environments that could be considered POMDPs, we use the MDP formalism in our paper to keep it consist with most prior work in offline RL. represented as , where denotes the state space, denotes the action space, represents the probabilistic transition dynamics, is the reward function, is the initial state distribution, and is the discount factor.For MDPs, a trajectory is a sequence of states and actions . Each trajectory has corresponding rewards . The discounted return for a specific timestep is
The goal in RL is to find a policy that maximizes the expected discounted return
In offline RL, we are a given a fixed dataset consisting of trajectories collected by some behavioral policy in the environment. Without collecting additional data, we must learn a policy that will be effective immediately upon deployment.
3.2 Transformers
Transformers (Vaswani et al., 2017) are a neural network architecture that use several stacks of selfattention blocks to process an arbitrary collection of inputs. By leveraging positional encodings and a causalattention map, Transformers can be used as sequence models for autoregressive generation. In this work, we mostly use a similar GPTbased (Radford et al., 2018) architecture as DT (Chen et al., 2021b). In particular, we use the same linear layer + layer normalization to project the raw inputs into the embedding dimension. We use 4 selfattention blocks with 8 selfattention heads. While our highlevel details are more inspired by DT, our code and GPT implementation are based on the publicly available TT (Janner et al., 2021) codebase^{3}^{3}3https://github.com/JannerM/trajectorytransformer.
3.3 Variational AutoEncoder
Our method uses conditional variational autoencoders (CVAE) as generative policy and world models that we can use to generate realistic and multimodal candidate trajectories for testtime search. Thus, in this section we give a general overview of variational autoencoders (VAE). The goal of a VAE model is to generate samples
that are within the distribution of a training set . Thus the model should be trained to maximize the likelihood of all training points . VAEs accomplish this by sampling a latent variable from some prior and training a decoder to convert the latent variable into a sample. However, optimizing this model is often intractable and thus Kingma and Welling (2013) introduces an encoder that can be used to instead optimize the evidence lower bound (ELBO) on the loglikelihood of the dataThe expectation term represents the reconstructions loss, and the KL Divergence term acts as a regularizer that keeps the encoder output close to the prior. Thus with a trained VAE model, we can sample a from our prior and pass it to our decoder to generate a sample from the desired distribution.
3.4 Vae
When training VAEs with highcapacity models like Transformers or CNNs (Krizhevsky et al., 2012), the decoders often learn to ignore the highentropy stochastic latent variables. Therefore, Higgins et al. (2016) proposed a modification to the standard ELBO loss and introduced a coefficient
to the KL Divergence term. Lowering this hyperparameter
below 1 reduces the regularizing effect on the latent variables and thus makes it easier for the models to incorporate the latent variables into their predictions. We find to be effective for our experiments.4 Separated Latent Trajectory Transformer
In this section, we describe how we train 2 separate Transformerbased discrete latent variable VAEs to represent our policy and world models, and how we leverage these models for robust planning at testtime. Figure 1 contains an overview of the entire SPLT model architecture.
4.1 Discrete Latent Variable VAE
In order for our models to be useful for search, they need to (1) be able to produce a good range of candidate behaviors for any given situation for the ego vehicle and (2) cover a majority of the different modes of potential responses from the agents in the environment. Towards this end we train separate Transformerbased VAEs for both the policy and world models. We make the specific design choice for the stochastic latent variables for both models to be discrete and consistent over the entire planning horizon. This allows us to tractably enumerate all possible candidate trajectories without exponential branching, which enables efficient search at testtime. The intuition is that the policy latent variables should correspond to different highlevel intentions or policies for the egovehicle, like whether to tail another vehicle aggressively or keep your distance. Similarly, the world latent variables should correspond to different possible intentions of the observable vehicles, like whether a vehicle in the opposing lane will go straight or turn through the intersection. Additionally, the world latent variables should capture other events like lights changing or cars suddenly appearing in the sensing range after rounding a corner.
4.1.1 Encoders
Both the world encoder and policy encoder use the same architecture and receive the same trajectory representation. Specifically, they both take in length trajectory sequences of the form
and output a or dimensional discrete latent variable with each dimension having possible values
The core modules for both encoders are Transformers based on the GPT architecture similar to TT and DT, except we do not perform any masking of the attention. Thus, all elements in the Transformer can fully attend to every other component in the sequence. We take the mean of the Transformer outputs for all the elements in order to coalesce the entire trajectory into a single vector representation
and . Finally, we pass each of these outputs into a small MLP that outputs the and independent categorical distributions for and respectively. Thus, the conditional distributions for and are represented as and independent multinomial distributions respectively. We leverage the straightthrough gradient estimator(Bengio et al., 2013) as described in Hafner et al. (2020) in order to make the sampling procedure fully differentiable for training.4.1.2 Policy Decoder
The policy decoder uses a similar input trajectory representation
and also takes in the latent variable . The goal of the policy decoder is to estimate
so that we can predict the most likely next action in the trajectory. We represent this decoder with a causal Transformer model that is very similar to the ones used in Decision Transformer. Beyond excluding the returnstogo as inputs, the main difference in our model is that we need to incorporate the latent variable . In this work, we incorporate by first converting it into a single embedding vector, similar to the positional encodings used in other Transformer works, and add it to all the state and action embeddings. We make this design choice instead of inputting the latent variable as another element in the sequence because our method makes it harder for the decoder Transformer to learn to ignore the latent variable, which is a common issue when using highcapacity models in VAEs.
For simplicity, we represent our output distribution as a unitvariance isotropic Gaussian with the mean outputted by our deterministic decoder
4.1.3 World Model Decoder
The world model decoder is very similar to the policy decoder, except that its goal is to estimate
so that we can predict the most likely next state, reward, and discounted return in the trajectory. The world model decoder is similarly represented with a causal Transformer and incorporates its latent variable in the same manner as the policy decoder. The major difference is that the world model decoder has 3 separate heads to output the 3 different required quantities. The output distributions are similarly represented as unitvariance isotropic Gaussians with the means outputted by the different heads of the deterministic decoder
4.1.4 Variational Lower Bound
In order to train our CVAEs, we wish to minimize the standard evidence lower bound (ELBO) on the loglikelihood of the behavioral data for the policy model
(1) 
and the world model
(2) 
The main difference between our CVAE formulation and the standard formulation is that we use independent discrete uniform distributions as our prior
because our latent variables are discrete and multidimensional. This prior makes the KLDivergence terms in equations (1) and (2) equivalent to independently maximizing the entropy of each dimension of and respectively. This regularizes the decoders and encourages the VAEs to leverage all available combinations of discrete latent variables.
Because we use the straightthrough gradient estimator, we can differentiate through the multinomial sampling of the latent variables. Thus for both CVAEs, we can jointly train the encoders and decoders endtoend by directly optimizing objective (1) for the policy models and objective (2) for the world models.
4.2 Training
During training, we sample batches of length trajectories from our offline dataset. The states and actions of these trajectories are passed into the world and policy encoders in order to generate the corresponding and s. Each and its corresponding trajectory are passed into the policy decoder, where it predicts all actions in the trajectory through the standard teacherforcing procedure (Williams and Zipser, 1989). Each and its corresponding trajectory are passed into the world decoder, where it predicts all the next states, rewards, and discounted returns in the trajectory also through the teacherforcing procedure. The policy decoder and encoder parameters are updated to minimize equation (1), and the world decoder and encoder parameters are updated to minimize equation (2). We train all our models with the Adam optimizer (Kingma and Ba, 2014) with a learning rate of and weight decay
. Additionally, we normalize all raw values by subtracting the mean and dividing by the standard deviation of the dataset.
4.3 Planning
In this section, we describe how we can use our trained conditional policy and world model decoders in order to perform efficient and robust search at testtime.
4.3.1 Candidate Trajectory Generation
First, we describe how we can generate a single candidate trajectory given a specific and .
Assume that we are currently at a state and we have stored a history of the last steps of the trajectory
Our goal is to predict a possible continuation of that trajectory over the planning horizon
Additionally, we want to estimate the future discounted returns for our candidate trajectory
In order to predict these quantities, we alternatively make autoregressive predictions from the policy and world models. Specifically, we alternate between predicting the next action
and the next state, reward, and return
We repeat this alternating procedure until we reach the horizon length and compute and its corresponding .
4.3.2 Action Selection
Because we use discrete latent variables, we can enumerate all possible combinations of and . There are possible values for and possible values for , which leads to different possible candidate trajectories. In this work, we found , , and
to be sufficient for all our explored problems. Thus, we only need to consider a maximum of 256 different combinations of latent variables, which is a standard batch size in many deep learning applications. Therefore, it is easy to run the candidate trajectory generation procedure previously described for all combinations of latent variables on modern GPU hardware.
Without loss of generality we can order the possible values of and assign each one an index . We can do the same for and assign each one an index . Then, we will label the candidate trajectories that are produced when conditioned on the th and th as and its corresponding return . Then, we select the candidate trajectory that corresponds to
We execute the first action of and repeat this procedure at every timestep. The intuition behind this procedure, is that we are trying to pick a policy to follow that will be robust to any realistic possible future in the current environment. Later, we will show how this procedure allows our method to be opportunistic in safe situations and cautious in more dangerous situations.
5 Experiments
For all experiments, we compare our SPLT Transformer^{4}^{4}4Our code is available at https://github.com/avillaflor/SPLTtransformer method to Trajectory Transformer (TT), Decision Transformer (DT), and Behavioral Cloning (BC) with a Transformer model. Additionally, on the CARLA benchmarks we compare to Implicit QLearning (IQL)^{5}^{5}5We use the publicly available rlkit implementation at https://github.com/railberkeley/rlkit. (Kostrikov et al., 2021), which is a stateoftheart nonrecurrent modelfree offline RL approach.
5.1 Illustrative Example
Metric  SPLT (Ours)  BC  DT(m)  DT(e)  DT(t)  TT  TT(a)  IDM(t) 

Return  
Success (%) 
We start with a toy autonomous driving problem that we designed to be very simple, but that still demonstrates the dangerous optimism bias in prior Transformerbased approaches.
In this toy problem, we have an ego vehicle trailing a leading vehicle with both travelling in the same direction on a
D path. Both vehicles are represented using pointmass dynamics, but only the ego vehicle is controllable. Half of the time the leading vehicle will begin hardbraking at the last possible moment in order to stop just before the 70m mark before continuing. The other half of the time the leading vehicle will immediately speed up to the maximum speed and continue for the entire trajectory. The ego vehicle cannot infer beforehand whether the leading vehicle will brake or not, and thus this is a completely stochastic event from the perspective of the ego vehicle.
The observation space is the absolute position and velocity of both the ego and leading vehicle. The action space is just the acceleration for the ego vehicle clipped to . The maximum velocity for both vehicles is and the minimum velocity is , so the vehicles cannot travel backwards. The ego vehicle is rewarded for the distance traveled at each timestep, but is given a penalty of if it crashes. The trajectory ends after or if the ego vehicle crashes into the leading vehicle. The ego vehicle is initialized at with a random velocity in . The leading vehicle is initialized randomly within and with the same velocity as the ego vehicle.
For the offline dataset, we collected steps with a distribution of different IDM (Treiber et al., 2000) controllers that demonstrate a wide range of aggressiveness, and includes some trajectories where the IDM controller is too aggressive and collides with the leading vehicle. We show our results in Table 1.
For DT, we find that conditioning on the maximum return in the dataset leads to crashes every time the leading vehicle brakes. If we condition on the mean return of the best controller used to collect the dataset, then we get the opposite behavior. The agent does not crash, but also does not take full advantage of the situations where the leading vehicle does not brake, and thus underperforms. While we were able to tune the conditional return in order to get reasonable results, we found the optimal value to be quite arbitrary. Thus, we believe this parameter will be very difficult to tune in more general and complex stochastic environments, where the possible returns would be very hard to estimate beforehand without significant prior knowledge of the specific testing scenario.
For TT, we find that the results depend heavily on the scope of the search used. When we use the default parameters from their codebase for the beam search, we get results very comparable to behavior cloning, which is quite suboptimal. When we reduce the lowprobability filtering to allow for more aggressive search, we find that the method sometimes crashes into the leading vehicle because it picks the predicted trajectory where both vehicles will continue at max speed. Similar to DT, we expect tuning the search aggressiveness for TT to be difficult without significant prior knowledge of the intended testing scenario.
For our method, we find that our world VAE is able to predict both possible modes for the leading vehicle’s behavior. Additionally, the policy VAE seems to be able to predict a range of different trailing behaviors. Thus, our method is able to properly search for an effective and robust behavior and achieves results comparable to the best controller in the distribution used to collect the data.
5.2 NoCrash
Next, we evaluate our method on the CARLA (Dosovitskiy et al., 2017) NoCrash (Codevilla et al., 2019) benchmark. For these experiments, we run the version of CARLA at fps. We assume access to a global route planner that can generate dense waypoints to our goal, an accurate localization system, and a perception system that can identify the state of any vehicle or traffic light directly in front of us within a limited sensing range. In our experiments, we obtain these ground truth quantities from the CARLA simulator, as commercial selfdriving car efforts already have systems to provide these quantities. Thus, we leave implementing such systems as beyond the scope of this project.
The goal in the CARLA NoCrash benchmark is to navigate in a suburban town to a desired goal waypoint from a predetermined start waypoint. The benchmark takes place in the CARLA towns Town01 and Town02 and consists of 25 different routes in each town.
For our observation we use a lowdimensional vector representation consisting of: (1) the relative heading error to the next target waypoint, (2) the distance from the center of the target lane, (3) the ego vehicle speed, (4) the relative distance to the leading vehicle or the max sensing range if there is no leading vehicle in range, (5) the speed of the leading vehicle or the max speed if there is no leading vehicle in range, (6) the distance to the upcoming red light or the max sensing range if there is no red light in range. There are 2 actions: the steering and the target velocity for a PID controller. The ego car is rewarded for traveling faster and receives a small penalty for deviating from the target lane and a large penalty for crashing or incurring a traffic infraction. We terminate the trajectory whenever the car crashes, incurs an infraction, times out, or reaches the goal.
We collect our offline dataset with autopilot agents with a distribution of different levels of aggressiveness. This aggressiveness corresponds to the parameters of a timetocollisionbased controller that is used to adjust the vehicle’s speed in response to any leading vehicles. Additionally, the autopilots use a PID controller for steering, and always immediately brake if they are too close to a red light. We collect steps with these autopilots in the Town01 routes in the dense traffic setting. The autopilots are imperfect and have an average success rate of in Town01.
Metric 

BC  TT  DT(m)  DT(t)  IQL  Autopilot(t)  

Success (%)  
Speed (m/s) 
We evaluate all methods by training on this Town01 dataset and then running in the unseen Town02 routes with the dense traffic setting. We depict our results in Table 2. We find that TT and DT conditioned on the max return have a lower success rate than BC, which we suspect is due to the optimism bias we previously described leading to unnecessary collisions and infractions. When we tune the target return for DT it can outperform BC in terms of average success rate and speed. However, similar to the toy problem, we found the tuned return to be quite arbitrary. Without online evaluation or prior knowledge of the testing domain, it would be quite difficult to estimate the best target return, especially considering that the testing scenarios are different from the training scenarios. Our SPLT method is also able to achieve a higher average success rate and speed compared to BC. We suspect that our positive results are due to our method’s planning procedure which avoids the optimism bias of TT’s naive beam search.
IQL is a nonrecurrent modelfree offline RL method. Instead of using modelbased planning or a returnconditioned policy, it leverages a conservative Qlearning approach to perform advantage weighted regression(Peng et al., 2019) in order to learn a policy that improves over the datacollection policy. IQL’s superior performance compared to all the offline Transformerbased approaches demonstrates the benefits of leveraging a conservatively trained Qfunction. We believe the contributions of our work related to addressing the optimism bias in prior Transformerbased offline RL methods to be mostly orthogonal to IQL. Thus, we leave potential extensions of incorporating an IQL style Qfunction into the planning or policy learning procedure as future work.
5.3 Leaderboard
Metric  SPLT (Ours)  BC  DT(m)  DT(t)  TT  IQL  Autopilot 

Total score  
Completion (%)  
Collision (/km)  
Infraction (/km) 
Next, we evaluate our method on a modified version of the CARLA Leaderboard^{6}^{6}6https://leaderboard.carla.org/ benchmark. For these experiments, we run the version of CARLA at fps. The only additional assumption we make is that our perception system can identify the state of any vehicles or pedestrians directly surrounding us in all directions within a limited sensing range.
The CARLA Leaderboard benchmark is much more comprehensive and requires the agent to perform more involved maneuvers like lanechanging in urban and highway situations. The major difference from our NoCrash setup is that we introduce 8 additional variables to the observation, corresponding to the distance and speed of surrounding vehicles in each of the 4 diagonal directions. We collect million time steps using the autopilot from the Transfuser (Prakash et al., 2021) codebase^{7}^{7}7https://github.com/autonomousvision/transfuser in the CARLA Challenge 2021 training routes for our offline dataset. We evaluate all methods on the officially released devtest routes. We depict our results in Table 3.
Besides DT with a tuned conditionalreturn, our method achieves the best overall driving score among all methods. Driving score is a comprehensive indicator for driving quality that accounts for route completion, collisions, and traffic infractions. The increased complexity in the Leaderboard scenarios leads to the world being less predictable and cooperative from the egovehicle’s perspective. Thus, our SPLT Transformer method, which disentangles the world dynamics and the agent decisionmaking process, is better equipped to handle this stochastic and uncooperative environment.
We believe that TT in particular underperforms relative to our SPLT method because its naive beam search does not plan appropriately for the range of possibly uncooperative multimodal outcomes. On the other hand, our method is even able to achieve results on par with the privileged autopilot we used to collect the dataset.
6 Conclusion
We presented our SeParated Latent Trajectory Transformer (SPLT Transformer) method, which trains 2 separate policy and world VAE models that can be used at testtime to efficiently perform robust search. We discussed how our approach avoids the optimism bias that other Transformerbased approaches for offline RL struggle with in stochastic settings. Finally, we demonstrated how our method outperforms these baseline approaches on a variety of autonomous driving tasks.
Acknowledgments: This work was supported by the CMU Argo AI Center for Autonomous Vehicle Research.
References
 [1] Cited by: §1, §2.3.

Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §4.1.1.  Language models are fewshot learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
 Learning to drive from a world on rails. arXiv preprint arXiv:2105.00636. Cited by: §2.4.
 Learning by cheating. In Conference on Robot Learning, pp. 66–75. Cited by: §2.4.
 Decision transformer: reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345. Cited by: §1, §2.1, §3.2.

Attentionbased recurrent neural network for urban vehicle trajectory prediction
. Procedia Computer Science 151, pp. 327–334. Cited by: §2.3. 
Exploring the limitations of behavior cloning for autonomous driving.
In
Proceedings of the IEEE/CVF International Conference on Computer Vision
, pp. 9329–9338. Cited by: §2.3, §5.2.  Efficient selectivity and backup operators in montecarlo tree search. In International conference on computers and games, pp. 72–83. Cited by: §1.

Implicit quantile networks for distributional reinforcement learning
. In International conference on machine learning, pp. 1096–1105. Cited by: §2.2.  CARLA: an open urban driving simulator. In Conference on robot learning, pp. 1–16. Cited by: §2.3, §5.2.
 D4rl: datasets for deep datadriven reinforcement learning. arXiv preprint arXiv:2004.07219. Cited by: §1.
 Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193. Cited by: §4.1.1.
 Betavae: learning basic visual concepts with a constrained variational framework. Cited by: §3.4.
 Offline reinforcement learning as one big sequence modeling problem. Advances in Neural Information Processing Systems 34. Cited by: §1, §2.1, §2.1, §3.2.
 Morel: modelbased offline reinforcement learning. Advances in neural information processing systems 33, pp. 21810–21823. Cited by: §2.2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §3.3.
 Offline reinforcement learning with implicit qlearning. arXiv preprint arXiv:2110.06169. Cited by: §2.2, §5.
 Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §3.4.
 Conservative qlearning for offline reinforcement learning. Advances in Neural Information Processing Systems 33, pp. 1179–1191. Cited by: §2.2.
 Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §2.2.

Multimodal motion prediction with stacked transformers.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 7577–7586. Cited by: §1, §2.3.  Conservative offline distributional reinforcement learning. Advances in Neural Information Processing Systems 34, pp. 19235–19247. Cited by: §2.2.

Revisiting the arcade learning environment: evaluation protocols and open problems for general agents.
Journal of Artificial Intelligence Research
61, pp. 523–562. Cited by: §1.  Scene transformer: a unified architecture for predicting multiple agent trajectories. arXiv preprint arXiv:2106.08417. Cited by: §1, §2.3.
 Vector quantized models for planning. arXiv preprint arXiv:2106.04615. Cited by: §1, §2.1.
 Advantageweighted regression: simple and scalable offpolicy reinforcement learning. arXiv preprint arXiv:1910.00177. Cited by: §5.2.
 Multimodal fusion transformer for endtoend autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7077–7087. Cited by: §2.4, §5.3.

Predicting vehicles trajectories in urban scenarios with transformer networks and augmented information
. arXiv preprint arXiv:2106.00559. Cited by: §1, §2.3.  Improving language understanding by generative pretraining. Cited by: §1, §3.2.
 Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §1.
 Mastering atari, go, chess and shogi by planning with a learned model. Nature 588 (7839), pp. 604–609. Cited by: §1.
 Risksensitive reinforcement learning. Neural computation 26 (7), pp. 1298–1328. Cited by: §2.2.
 Multiple futures prediction. Advances in Neural Information Processing Systems 32, pp. 15424–15434. Cited by: §1, §2.3.
 Congested traffic states in empirical observations and microscopic simulations. Physical review E 62 (2), pp. 1805. Cited by: §5.1.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §3.2.
 A learning algorithm for continually running fully recurrent neural networks. Neural computation 1 (2), pp. 270–280. Cited by: §4.2.
 Mopo: modelbased offline policy optimization. Advances in Neural Information Processing Systems 33, pp. 14129–14142. Cited by: §2.2.
Appendix A Hyperparameters
For all Transformerbased methods across all experiments, we kept the general Transformer hyperparameters consistent. We used 4 layers of selfattention blocks with 8 heads and an embedding size of 128.
For all Transformerbased methods, we start with the linear warmup rate scheduler from the TT codebase^{8}^{8}8https://github.com/JannerM/trajectorytransformer. For all methods except TT, after the warmup we used a learning rate of with a constant learning rate schedule. For TT, we used the cosineannealing schedule from their codebase.
For our SPLT method, the only additional important hyperparameters are , , and for the latent variables, for the VAE, and and for the planning. We generally did a hyperparameter search over , , , and . For, the toy illustrative problem we used , , , , , and . For NoCrash, we used , , , , , and . For Leaderboard, we used , , , , , and .
For BC, we used a context for all experiments. For DT, we similarly used a context for all experiments. For TT, we used a context and horizon and the default search parameters from their codebase for all experiments. For IQL, we use the default parameters provided in the RLKit ^{9}^{9}9https://github.com/railberkeley/rlkit implementation. Namely, we set , where is an inverse temperature parameter for the advantage term in the policy extraction step.
For DT(t), we search over different scalars times the max return in the dataset as the desired target return. For the toy problem, we searched over and found to get the best results. For NoCrash, we searched over and found to get the best results. For Leaderboard, we searched over and found to get the best results.
Appendix B Offline D4RL Experiments
Dataset  Environment  BC  MBOP  CQL  DT  TT  IQL  SPLT(ours) 

MedExpert  HalfCheetah  
MedExpert  Hopper  
MedExpert  Walker2d  
Medium  HalfCheetah  
Medium  Hopper  
Medium  Walker2d  
MedReplay  HalfCheetah  
MedReplay  Hopper  
MedReplay  Walker2d  
Average 
For completeness, we evaluate our method on the D4RL Mujoco locomotionv2 tasks. In these determinstic tasks, we find that our approach is comparable, but has no obvious competitive advantage over the prior Transformerbased methods as expected. Specifically, we find that our SPLT approach is generally competitive with these approaches on the mediumexpert and medium datasets, but underperforms on the hopper and walker2d mediumreplay tasks. We believe these results are reasonable given that the mediumreplay datasets are the least similar to the autonomous driving data setting, where we assume that the dataset contains a limited number of temporally consistent behaviors.
For all these locomotion tasks we used , , , , , and
Appendix C Additional Toy AV Results
In table 5, we show an additional ablation result for our method, where we take a max over world latent variables instead of a min. This corresponds to our planner picking the candidate trajectory based on
Using this planning procedure causes our approach to have a similar optimism bias as the beam search used in TT. We find that using this max planner gets substantially worse results and causes collisions with the leading vehicle. These results demonstrate the importance in our approach of searching for a policy latent variable that is robust to the worstcase scenario in stochastic environments. Additionally, we include results for IQL with default parameters on this toy problem for reference.
Metric  SPLT (Ours)  SPLT (max planner)  IQL  IDM(t) 

Return  
Success (%) 
Appendix D Planning
We include an additional diagram (Figure 2) to help depict the trajectory generation used in the planning algorithm.
Appendix E Sensors
We include a diagram (Figure 3) here to depict the obstacle sensor setup we use in the Leaderboard experiments.