GRI: General Reinforced Imitation and its Application to Vision-Based Autonomous Driving

11/16/2021
by   Raphaël Chekroun, et al.
15

Deep reinforcement learning (DRL) has been demonstrated to be effective for several complex decision-making applications such as autonomous driving and robotics. However, DRL is notoriously limited by its high sample complexity and its lack of stability. Prior knowledge, e.g. as expert demonstrations, is often available but challenging to leverage to mitigate these issues. In this paper, we propose General Reinforced Imitation (GRI), a novel method which combines benefits from exploration and expert data and is straightforward to implement over any off-policy RL algorithm. We make one simplifying hypothesis: expert demonstrations can be seen as perfect data whose underlying policy gets a constant high reward. Based on this assumption, GRI introduces the notion of offline demonstration agents. This agent sends expert data which are processed both concurrently and indistinguishably with the experiences coming from the online RL exploration agent. We show that our approach enables major improvements on vision-based autonomous driving in urban environments. We further validate the GRI method on Mujoco continuous control tasks with different off-policy RL algorithms. Our method ranked first on the CARLA Leaderboard and outperforms World on Rails, the previous state-of-the-art, by 17

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

03/19/2021

Efficient Deep Reinforcement Learning with Imitative Expert Priors for Autonomous Driving

Deep reinforcement learning (DRL) is a promising way to achieve human-li...
02/18/2021

Improved Deep Reinforcement Learning with Expert Demonstrations for Urban Autonomous Driving

Currently, urban autonomous driving remains challenging because of the c...
01/11/2022

Reward Relabelling for combined Reinforcement and Imitation Learning on sparse-reward tasks

During recent years, deep reinforcement learning (DRL) has made successf...
08/18/2021

End-to-End Urban Driving by Imitating a Reinforcement Learning Coach

End-to-end approaches to autonomous driving commonly rely on expert demo...
07/10/2018

CIRL: Controllable Imitative Reinforcement Learning for Vision-based Self-driving

Autonomous urban driving navigation with complex multi-agent dynamics is...
10/16/2017

Gradient-free Policy Architecture Search and Adaptation

We develop a method for policy architecture search and adaptation via gr...
10/27/2021

Learning from demonstrations with SACR2: Soft Actor-Critic with Reward Relabeling

During recent years, deep reinforcement learning (DRL) has made successf...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Autonomous driving (AD) in urban areas is a convoluted task. Agents have to efficiently analyse a highly complex environment and make online decisions to follow driving rules whilst simultaneously interacting with other dynamic agents, such as drivers or pedestrians. That is why, literature in autonomous driving focuses on different learning methods rather than the design of general hand-crafted rules.

Imitation learning (IL) [bojarski, il-bullshit, transfuser, marin-fisheye], especially behavior cloning, aims at mimicking expert behavior for a given task. It requires a significant amount of annotated data, often recorded by human drivers. Even though this kind of data can be recorded easily on a large scale, practical safety concerns in real traffic lead to heavily biased observations showing predominantly safe driving examples, and under-represents rare dangerous situations. Hence, IL agents suffer from distribution mismatch and will struggle to recover from its own mistakes.

Deep reinforcement learning (DRL) [dqn, ppo, td3, a3c] offers an alternative, more robust to distribution mismatch than IL, by letting the agent learn from its own mistakes through trial-and-error. In the RL framework, the agent explores its environment by itself and gathers rewards, a numerical value assessing how much a given action in a given state is good. The goal of the agent is to maximize its cumulative rewards. To do so, the agent needs to optimize sequences of actions rather than instantaneous ones. Nonetheless, DRL needs an order of magnitude more data than IL to converge due to this extensive, and often time-consuming, exploration of the environment during the training.

To overcome IL distribution mismatch and RL data inefficiency, we propose General Reinforced Imitation (GRI), a novel method which combines both exploration and prior knowledge from demonstrations. GRI is based on the simplifying hypothesis that expert data presents a perfect behavior and therefore, an expert’s action should receive a constant, high reward. Straightforward to implement over any off-policy algorithm, GRI introduces the notion of an offline demonstration agent. This offline agent sends expert data associated with a constant demonstration reward to the replay buffer of an RL online exploration agent. We note that those expert data are processed by the DRL algorithm concurrently and indistinguishably from the exploration data

The GRI method is applied to visual-based autonomous driving in an end-to-end pipeline on the CARLA simulator [carla]

, an open-source simulator for research in autonomous driving. We call this algorithm GRI for Autonomous Driving (GRIAD). The whole pipeline is represented in

Fig. 1. On the CARLA Leaderboard, an online benchmark ranking agents according to the quality of their driving, we achieved 17% better results than World on Rails [wor], the prior top ranking entry. In addition, our method used only three cameras and no LiDAR, which is fewer sensors than the previous top two entries [wor, transfuser]. At the time of writing (early November 2021), we are ranked first on the CARLA leaderboard according to the main metric, the driving score. We also conducted ablation studies to highlight the impact of GRIAD compared to standard RL training on the CARLA NoCrash Benchmark [nocrash].

Finally, we conducted experiments on the Mujoco [mujoco] benchmark to investigate our method adaptability and generalizability. Tests were conducted on four different Mujoco environments, with two different DRL algorithm as backbones. Our experiments demonstrate that using the GRI framework systematically leads to better results, even when the expert data are noisy or not significantly better than the trained vanilla RL algorithm.

We summarize our main contributions below:

  • Definition of the novel GRI method to combine offline demonstrations and online exploration.

  • Presentation and ablation study of GRI for visual-based Autonomous Driving (GRIAD) algorithm.

  • Further analysis of GRI-based algorithms on the Mujoco benchmark.

2 Related work

The GRI method aims at leveraging both offline expert demonstrations and online simulator exploration. Our main application is end-to-end vision-based autonomous driving on the CARLA simulator [carla]. Therefore, this section focuses both on end-to-end autonomous driving methods that achieved milestones on CARLA, and existing decision-making methods learning from demonstration and exploration.

2.1 End-to-end Autonomous Driving on CARLA

End-to-end autonomous driving, i.e. directly mapping sensor signals to control is a highly complex task on which training an agent with DRL is tedious. IL methods were the first to lead the CARLA Leaderboard. In particular, Learning by Cheating (LBC) [lbc] presents an efficient method to train a behavior cloning agent in two steps: (i) train a privileged behavior cloning agent which has access to all the ground truth data, and (ii) train a behavior cloning agent to mimic the privileged one. An evaluation of several methods on the NoCrash benchmark, presented in Chen et al. [wor], shows that LBC presents great results on the training conditions but generalizes poorly on unknown environments.

DRL can also be used for end-to-end autonomous driving. However, vision-based DRL comes with some drawbacks. Indeed, image inputs are often of high dimensions thus requiring larger DRL networks which are usually difficult to train to convergence. Therefore, for vision-based DRL, one can encode the sensors’ signal in a more compact and semantically rich representation to train the DRL network on this predefined latent space as in D. Gordon et al. [splitnet]. This latent space can be obtained by pretraining a visual encoder on some visual tasks, such as segmentation or classification.

Based on this principle, Toromanoff et al. [architecture-marin] introduced the Implicit Affordances (IAs) method. They design and train an efficient DRL agent on CARLA, winning the CARLA challenge two years in a row (ranked #4 at the time of writing). To do so, they propose an end-to-end pipeline composed of two subsystems trained successively. First, a visual encoder is trained on some auxiliary tasks. Those tasks are semantic segmentation, classification of the type of road, detection of traffic lights, and if there is a relevant traffic light, the state of, and distance to, the light. Then, the visual encoder is frozen and the DRL-based decision-making subsystem is trained on the encoder latent space.

Another top ranked (#3) agent on the CARLA Leaderboard is World on Rails [wor] which assumes the world to be on rails, meaning that the agent’s actions affect only its own state and do not influence its environment. Based on that hypothesis, they transpose the driving problem into a simple, yet powerful, tabular RL setup.

Finally, Transfuser+ [transfuser], a more recent top ranked (#2) agent on the CARLA Leaderboard, applies an IL method principally focused on LiDAR and camera fusion.

2.2 Learning from demonstration and exploration

The aforementioned IL and RL strengths and weaknesses are complementary. Indeed, IL suffers from distribution mismatch contrarily to online RL. Alternatively, as RL learns from scratch it is less data efficient than IL which incorporates prior demonstration knowledge during training.

To take the best of both worlds, some algorithms combine IL and RL to maximize efficacy by leveraging both expert data and exploration [dqfd, sqil, dapg, sacr2]. In particular, demonstrations can be used to initialize policies by pretraining the network [nac, dapg, dqfd] or leveraged with a specifically designed reward [dqfd, sqil].

Soft-Q Imitation Learning (SQIL) [sqil] and Deep Q-learning from Demonstrations (DQfD) [dqfd] are the two closest approaches to ours as both take advantage of demonstrations in a different way and can be applied to any off-policy RL algorithms.

SQIL [sqil] does IL using an RL agent. To do so, the replay buffer is initially filled with demonstrations, associated with a constant reward . A RL agent collects data from exploration into the replay buffer, associated with a constant reward . Thus, SQIL designed an RL agent that learns to imitate expert behavior, and has been mathematically demonstrated to be equivalent to regularized behavior cloning. However, SQIL does not efficiently leverage exploration as environment rewards are never used. Our method combines both the IL part from SQIL and the classical, rich RL online exploration.

DQfD [dqfd] is based on DQN [dqn], an off-policy RL algorithm with a replay buffer. DQfD first pretrains the agent on expert data with both IL and RL losses using the real reward given by the environment. After some steps of pretraining, the agent starts gathering data from the environment in the memory buffer. The network is then trained on batches composed of exploration data with a RL loss and expert data with both IL and RL losses. Nonetheless, DQfD uses simultaneously reinforcement and imitation which can have divergent losses and are difficult to jointly optimize. [rl-id]. Our method leverages demonstrations and exploration exclusively with an RL loss, and thus cannot suffer from divergent losses issue. Moreover, DQfD, contrarily to GRI, relies on the true environment reward for the expert data, which cannot always be obtained.

3 General Reinforced Imitation (GRI)

Our pipeline for autonomous driving is an end-to-end system. Its decision-making subsystem uses the GRIAD algorithm, which is an adaptation of the GRI method to visual-based autonomous driving (AD) on CARLA. This section presents the GRI method and details the whole pipeline.

3.1 Method

GRI is a method which is straightforward to implement over any off-policy RL algorithm using a replay buffer, such as SAC [sac], DDPG [ddpg], DQN [dqn] and its successive improvements [rainbow, iqn]. GRI is built upon the hypothesis that expert demonstrations can be seen as perfect data whose underlying policy gets a constant high reward. We denote this as demonstration reward, . In our experiments we chose to be the maximum of the reward.

Input: demonstration reward value, probability to use demonstration agent;
Initialize empty buffer ;
while not converged do
       if len() min_buffer then
             do a DRL network update;
            
       end if
      if random.random()  then
             collect episode in buffer with exploration agent 
      else
             add episode in buffer with demonstration agent;
            
       end if
      
end while
Algorithm 1 GRI: General Reinforced Imitation

The idea of GRI is to distill expert knowledge from demonstrations into an RL agent during the training phase. To do so, we defined two types of agents: (i) the online exploration agent, which is the regular RL agent exploring its environment to gather experiences into the memory buffer, and (ii) the offline demonstration agent which sends expert data associated with a constant demonstration reward to the memory buffer. is the state, the chosen action and the reward at time . At any given training step, the next episode to add to the replay buffer comes from the demonstration agent with a probability of , else from the exploration agent. GRI is summarized in Algorithm 1.

3.2 GRI for Autonomous Driving

We applied GRI on a pipeline inspired by Toromanoff et al. Implicit Affordances method [architecture-marin]. As this method is trained in two phases, hence making it modular, we optimized both the visual and the decision-making subsystems independently.

Design of the vision subsystem We first train two visual encoders on segmentation and classifications tasks with different camera perspectives to extract compact semantic features.

We found that single camera setup is leading to more collisions on intersections, as sensors are not able to see close obstacles while turning. Thus, we mounted three RGB cameras on the hood of our agent vehicle, at the coordinates and relatively to the center of the car. The side cameras are angled at 70°. All three cameras have a 100° field of view.

Our visual subsystem is composed of two highly specialized Efficientnet-b1 [efficientnet] models, one for the segmentation and one for the classification and regression tasks, as shown in Fig. 2. We concatenate the four outputs (three segmentations, one for each camera, and one classification from the front camera only) of both Efficientnet-b1 and use it similarly as Implicit Affordances (IAs) for the DRL training.

Figure 2: Feature extraction from RGB camera images for the visual subsystem. Two encoder-decoder networks are pretrained on segmentation, classifications and regression tasks. Classifications and regression are only performed on the center image while all three images are segmented. After training, the visual encoders serve as fixed feature extractors with frozen weights. For the DRL backbone training, both encoder outputs are concatenated and sent to the memory buffer as input to DRL. Both encoders are Efficientnet-b1. The segmentation decoder is fully convolutional, and the classification decoder is an MLP with several outputs.

This architecture allows to keep the same accuracy on classification and segmentation metrics as if we were using a single bigger encoder for all the auxiliary tasks like in Toromanoff et al. [architecture-marin], while reducing the encoder latent space dimension by a factor of .

The visual part for the CARLA Leaderboard has been trained on a dataset of 400,000 samples which corresponds to 44 hours of driving. This dataset has been generated with the CARLA autopilot on every town with random trajectories. Each sample of the dataset is composed of three images from the three cameras and the corresponding ground truth information, which are segmentation maps from CARLA, booleans indicating the presence of an intersection and the presence of a traffic light in front of the car. And if there is a traffic light, a class corresponding to its color, and the distance to it in meters. Trajectories have been augmented with random cameras translations and rotations.

Design of the Decision Subsystem The decision subsystem takes as input four consecutive encodings of the three camera images and outputs an action. An action is defined by the combination of the desired steering of the wheel, and the throttle or brake to apply.

Generating data on the CARLA simulator is very computationally expensive. We used a Rainbow-IQN Ape-X [drl-marin], which is a distributed DRL backbone, to mitigate this issue.

Rainbow-IQN Ape-X being based on DQN [dqn], the action state has to be discrete. Therefore, we discretized the action state in 27 steering values, and 4 throttle or brake values. The discretized action space contains actions.

We called this setup GRI for Autonomous Driving (GRIAD). We diagram it in Fig. 3.

Figure 3:

Simplified representation of the distributed GRIAD setup with a Rainbow-IQN Ape-X backbone. A central computer receives data in a shared replay buffer from both exploration and demonstration agents running on other computers. Data are sampled from this replay buffer to make the backpropagation and update the weights of all the agents. Images from the agents are encoded using the network presented in

Fig. 2 before being stored in the memory buffer.

The demonstration dataset contains 200,000 samples, which correspond to 22 hours of driving, generated using the autopilot from CARLA on predefined tracks published by the CARLA team111https://github.com/carla-simulator/scenario_runner. Each sample from the demonstration dataset consists in three images from the three cameras and a discrete action obtained by mapping continuous actions of the expert to our discrete set of RL actions. We did not use any data augmentation. We note that the autopilot makes driving errors such as collisions, red light infractions, or the car getting stuck for hundreds of frames. As a result of our demonstrations corresponds to poor action choices. However, we decided to use this demonstration dataset as it is in order to assess the robustness of our method to noisy demonstrations.

In our experiments on CARLA, GRIAD had a total of 12 agents, including 3 demonstration agents, running in a distributed setup and sending data to the memory buffer. As demonstration agents have been constrained to send data at the same frequency as exploration agents, this is equivalent to having .

The reward function used for the exploration agents is the same as in Toromanoff et al. [architecture-marin]. Since this reward function is normalized to a range between 0 and 1, we set the demonstration reward to .

4 Experimental results

The GRI method was assessed on its primary application of visual-based autonomous driving on the CARLA Leaderboard and with an ablation study comparing it to vanilla RL. Further studies of the method have also been conducted on the Mujoco benchmark to analyze its behavior depending on the proportion of demonstration agents and highlight its generalizability to other DRL backbones.

4.1 GRIAD on CARLA

On the CARLA leaderboard. We trained GRIAD for 60M steps (45M exploration steps + 200,000 expert data sampled 15M times). Both visual and decision-making parts were trained on all available maps with all available weather. We compare to LBC [lbc], IAs [architecture-marin], Transfuser+ [transfuser] and World on Rails [wor]. Our method outperforms World on Rails, the previous leading method on the CARLA leaderboard, by on the main metric, the driving score, while using fewer sensors. Our method also gives a higher driving score than Transfuser+ [transfuser], submitted a month after GRIAD, while using more cameras and a LiDAR. CARLA Leaderboard results are presented in Tab. 1.

Cam. LiDAR DS RC IS
LBC [lbc] 3 10.9 21.3 0.55
IAs [architecture-marin] 1 24.98 46.97 0.52
Rails [wor] 4 31.37 57.65 0.56
Transfuser+ [transfuser] 4 34.58 69.84 0.56
GRIAD 3 36.79 61.85 0.60
Table 1: Comparison of sensors used and CARLA Leaderboard’s driving metrics: driving score (DS, main metric), route completion (RC), and infraction score (IS). Results from the public CARLA Leaderboard website on November 2021. Higher is better for all metrics. Our method improves the driving score by 17% relative to the prior state of the art [wor] while using fewer sensors than the two other best. GRIAD was trained on CARLA 0.9.10.

Ablation study on the NoCrash benchmark. We compared GRIAD to regular RL by using the same global architecture with and without demonstration agents on the NoCrash benchmark [nocrash]. In the NoCrash benchmark, agents need to be trained on a single environment (Town01) under a specific set of training weather. Then, agents are evaluated on several scenarios with different traffic density on the training (Town01) and test (Town02) town with training and test sets of weather.

Task Town Weather RL 12M RL 16M GRIAD
Empty 98 99 99
Regular train train 96 100 98
Dense 92 97 92
Empty 83 97 96
Regular test train 83 97 93
Dense 62 76 72
Empty 66 76 84
Regular train test 76 82 86
Dense 68 80 82
Empty 60 62 68
Regular test test 60 58 64
Dense 42 46 48
Table 2: Ablation study of GRIAD using the NoCrash benchmark. Score is the percentage of road completed without any crash. GRIAD experimentally shows to generalize more on test weather than RL with 12M and 16M steps and globally gives the best agent. We note that, for computational reason, neither the RL nor GRIAD was trained until convergence. Hence, comparison to the state-of-the-art should rather be done on the CARLA Leaderboard with the fully trained GRIAD model, cf Tab. 1.

For these experiments, GRIAD was trained on 16M samples corresponding to 12M exploration steps + 25,000 expert data which have been sampled 4M times in total. We present an ablation study to show how GRIAD compares to RL without GRI i.e. without demonstration agents, using two vanilla RL models: one trained on 12M exploration steps and the other on 16M exploration steps. Each agent was trained using the exact same visual encoder trained on another demonstration dataset of 100,000 samples coming exclusively from Town01 under training weather. Results are presented in Tab. 2.

We first observe that GRIAD systematically gives better results than RL with 12M steps, while taking approximately the same time to train (+). Indeed, as demonstration agents do not require any interaction with the simulator, we can add them at a negligible cost and still improve results.

We also observe that while RL with 16M steps does better than GRIAD on train weather, GRIAD gives better results on the test weather while being faster to train. We believe this is because RL tends to overfit on a given environment if it explores it too much. Hence, replacing 4M exploration data with 25,000 demonstration data sampled times each appears to reduce the overfitting and allows a better generalization.

We also trained the same pipeline using the SQIL method during 20M steps, but the evaluation reward stayed particularly low during training. First test showed SQIL to be inefficient for autonomous driving on CARLA as it did not learn to drive at all, staying static or drifting off the road most of the time. It reached the score of 0 on every evaluated tasks. We believe that the reward signal as defined by SQIL is not rich enough to allow the network to converge on such a highly complex task.

4.2 GRI on the Mujoco benchmark

To further validate the GRI method we conducted experiments on selected Mujoco [mujoco] environments, shown in Fig. 4. Expert data were generated using chainerrl [chainerrl] pretrained RL agent weights. For each environment, the value of was chosen as the highest value chainerrl expert agent reached during the generation of the dataset. As we did not find real expert data on Mujoco environments, expert data is not always significantly better than our trained vanilla RL network. Hence, this study assesses the efficiency of GRI even with suboptimal expert data.

Figure 4: Mujoco environments used for our experiments. Respectively HalfCheetah-v2, Humanoid-v2, Ant-v2, and Walker2d-v2. Articulations are controlled to make them walk. Rewards depends on the covered distance.

Study on the proportion of demonstration agents. Experiments being faster on Mujoco environments than on CARLA we were able to investigate the impact of the proportion of demonstration agents. For these experiments we used a GRI-SAC i.e. a GRI algorithm using SAC [sac] as DRL backbone, and we vary the proportion of demonstration agents between 0% and 40%. Each experiment has been repeated three times, with different seeds. Fig. 5

presents the results with the variances and the evaluation reward of the expert. Experiments were conducted with public code from GitHub

222Original code from https://github.com/dongminlee94/deep_rl which has been adapted with GRI.

Figure 5: Evolution of the evaluation reward on Mujoco environments with different proportions of demonstration agents with GRI-SAC. GRI-SAC with 0% demonstration agent is vanilla SAC. We observe that GRI-SAC always reaches the level of the expert even when the expert is significantly better than the trained vanilla SAC. The proportion of demonstration agent have a significant impact on the dynamic of the convergence.

We observe three different dynamics.

  • For HalfCheetah-v2, a difficult task on which the expert is significantly stronger than the trained SAC, we observe that the beginning of the training is slower using GRI-SAC; we call this a warm up phase which we will explain further in Sec. 4.3. However the rewards turns out to become significantly higher after some time. On this game, GRI-SAC is better than SAC with every proportion of demonstration agents. Best scores were reached with 10% and 20% of demonstration agents.

  • For Humanoid-v2, a difficult task on which the expert is just a little stronger than the trained SAC, we observe that the higher the number of demonstration agents is, the longer the warm up phase is. Nonetheless, GRI-SAC models end up having higher rewards after their warm up phase. Best scores are reached with 10% and 20% of demonstration agents.

  • Ant-v2 and Walker2d-v2 are the easiest tasks of the four evaluated. On Ant-v2 the SAC agent reaches the expert level, converging similarly as GRI-SAC regardless of the number of demonstration agents used. Nevertheless, GRI-SAC converges faster with 10% and 20% demonstration agents. On Walker2d-v2 the final reward of GRI-SAC is significantly higher and reaches the expert level, while SAC remains below.

More experiments were conducted with the proportion of demonstration agents varying between 50% and 90%. Results were significantly worse than using 20% demonstration agents. We therefore conclude that the proportion of demonstration agent should not exceed 50%. We discuss some qualitative insights in the limitation sections.

These experiments reveals, at least on the evaluated Mujoco environments, that 20% demonstration agents seems to be the best choice for GRI-SAC to reach the expert level.

4.2.1 GRI with DDPG as DRL backbone

We also investigated the contribution of the DRL backbone to assess the generalizability of the GRI method. To do so, we evaluated the same tasks with the Deep Deterministic Policy Gradient (DDPG) algorithm [ddpg] instead of SAC. For these experiments, we fixed the proportion of demonstration agents to 20%. Results are shown in Fig. 6.

Figure 6: Comparaison of the evaluation reward evolution on Mujoco environments between GRI-DDPG with 20% of demonstration agents and vanilla DDPG. GRI-DDPG systematically leads to a better reward than vanilla DDPG. However, contrarily to GRI-SAC, GRI-DDPG with 20% demonstration agents does not systematically reach the expert level.

We first observe that, alike to GRI-SAC with a proportion of 20% demonstration agents, GRI-DDPG systematically reaches better results than DDPG on all the tested environments. However, GRI-DDPG does not systematically reach the level of the expert. While final rewards are better with SAC and GRI-SAC, the dynamics of the rewards evolution is about the same with both backbones, cf. Fig. 5. We can conclude that GRI is easily adaptable and generalizes to locomotion tasks where it robustly outperforms two alternative methods.

4.3 Limitations

The main limitations of this method are consequences of our initial hypothesis that demonstration data can always be associated with a constant maximal reward .

A first limitation occurs if the demonstration data is not constantly optimal, e.g. due to low expert performance on some aspect of a given task, as this introduces noise in the reward function. This is the case in our demonstration dataset on the CARLA simulator, as expert data have been generated using an imperfect autopilot containing noisy demonstrations. Still, GRIAD showed to improve our model by a significant margin over vanilla RL. Therefore, we can consider the GRI setup to present some robustness to noisy demonstrations.

A second limitation of our approach is the warm-up phase on some difficult environments, as observed in Fig. 5 on HalfCheetah-v2 and Humanoid-v2. This warm up phase can be seen as the consequence of a distribution shift. Indeed, GRI suffers from a sort of distribution shift when the training expert data mostly represent actions made in states not reached yet by the exploration agents. In particular, we observed this effect on HalfCheetah-v2: the expert agent does not walk but jumps as soon as it touches the ground, which is a complex yet highly efficient strategy. But to reach a state where it can successfully jump, it needs some warm up to gain the required speed and momentum by doing some low reward actions. Hence, our GRI-SAC agents learns to jump before it is able to walk, making it fall. Once the agent learned how to reach the jumping state, reward steadily increase until convergence. However, we observe that the lower the proportion of demonstration agent is, the faster the model is able to recover from this distribution shift. Indeed, collecting more exploration data following the current agent policy compensates the distribution shift between demonstration and exploration data.

Finally, a third limitation of our approach is the inconsistency of the rewards associated to some common actions collected by both the demonstration and exploration agents. Still on the HalfCheetah-v2 example, the demonstration agent will reward expert actions at the beginning of the agent run with the high demonstration reward, while the exploration agent will receive poor reward for the same exact actions. This induces a sort of discrepancy between data coming from the offline demonstration agent and experiences coming from the online RL exploration agent. It also implies an overestimation of demonstration actions. However, allocating high reward to demonstration data which are not correlated with the actual reward of the environment might encourage the agent to get to states closer to the expert’s ones. Nonetheless, it is difficult to assess the impact on the training in practice.

5 Conclusion

We present GRI, a method that efficiently leverages both expert demonstrations and environment exploration. GRI is straightforward to implement over any off-policy deep reinforcement learning algorithm. Thus, GRI-based algorithms improve data efficiency compared to vanilla reinforcement algorithms and do not suffer from distribution shift as much as behavior cloning methods. This method also proved to be robust to noisy demonstrations in the expert dataset. We applied GRI to visual-based autonomous driving with the distributed GRIAD algorithm and outperformed the previous state-of-the-art on the CARLA Leaderboard. Finally, we showed its generalizability using different DRL backbones on several Mujoco continuous control environments and highlighted its robustness.

References