DeepAI
Log In Sign Up

Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning

10/25/2022
by   Yi Zhao, et al.
aalto
0

Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment. However, depending on the quality of the offline dataset, such pre-trained agents may have limited performance and would further need to be fine-tuned online by interacting with the environment. During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data. While constraints enforced by offline RL methods such as a behaviour cloning loss prevent this to an extent, these constraints also significantly slow down online fine-tuning by forcing the agent to stay close to the behavior policy. We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability. Moreover, we use a randomized ensemble of Q functions to further increase the sample efficiency of online fine-tuning by performing a large number of learning updates. Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark. Code is available: <https://github.com/zhaoyi11/adaptive_bc>.

READ FULL TEXT VIEW PDF

page 1

page 5

page 7

page 8

07/01/2021

Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble

Recent advance in deep offline reinforcement learning (RL) has made it p...
02/02/2023

Policy Expansion for Bridging Offline-to-Online Reinforcement Learning

Pre-training with offline data and online fine-tuning using reinforcemen...
10/04/2021

Learning to Assist Agents by Observing Them

The ability of an AI agent to assist other agents, such as humans, is an...
02/05/2021

Deep reinforcement learning for smart calibration of radio telescopes

Modern radio telescopes produce unprecedented amounts of data, which are...
11/21/2022

Improving TD3-BC: Relaxed Policy Constraint for Offline Learning and Stable Online Fine-Tuning

The ability to discover optimal behaviour from fixed data sets has the p...
10/07/2021

Offline RL With Resource Constrained Online Deployment

Offline reinforcement learning is used to train policies in scenarios wh...
07/19/2021

DeepCC: Bridging the Gap Between Congestion Control and Applications via Multi-Objective Optimization

The increasingly complicated and diverse applications have distinct netw...

I Introduction

Offline or batch reinforcement learning (RL) deals with the training of RL agents from fixed datasets generated by possibly unknown behavior policies, without any interactions with the environment. This is important in problems like robotics, autonomous driving, and healthcare where data collection can be expensive or dangerous. Offline RL has been challenging for model-free RL methods due to extrapolation error where the Q networks predict unrealistic values upon evaluations on out-of-distribution state-action pairs [16]. Recent methods overcome this issue by constraining the policy to stay close to the behavior policy that generated the offline data distribution [16, 25, 23, 14], to demonstrate even better performance than the behavior policy on several simulated and real-world tasks [39, 42, 33].

However, the performance of pre-trained policies will be limited by the quality of the offline dataset and it is often necessary or desirable to fine-tune them by interacting with the environment. Also, offline-to-online learning reduces the risks in online interaction as the offline pre-training results in reasonable policies that could be tested before deployment. In practice, offline RL methods often fail during online fine-tuning by interacting with the environment. This offline-to-online RL setting is challenging due to: (i) the sudden distribution shift from offline data to online data. This could lead to severe bootstrapping errors which completely distorts the pre-trained policy leading to a sudden performance drop from the very beginning of online fine-tuning, and (ii) constraints enforced by offline RL methods on the policy to stay close to the behavior policy. While these constraints help in dealing with the sudden distribution shift they significantly slow down online fine-tuning from newly collected samples.

We propose to adaptively weigh the offline RL constraints such as behavior cloning loss during online fine-tuning. This could prevent sudden performance collapses due to the distribution shift while also enabling sample-efficient learning from the newly collected samples. We propose to perform this adaptive weighing according to the agent’s performance and the training stability. We start with TD3+BC, a simple offline RL algorithm recently proposed by [14] which combines TD3 [15] with a simple behavior cloning loss, weighted by an hyperparameter. We adaptively weigh this hyperparameter using a control mechanism similar to the proportional–derivative (PD) controller. The value is decided based on two components: the difference between the moving average return and the target return (proportional term) as well as the difference between the current episodic return and the moving average return (derivative term).

We demonstrate that these simple modifications lead to stable online fine-tuning after offline pre-training on datasets of different quality. We also use a randomized ensemble of Q functions [7] to further improve the sample-efficiency. We attain state-of-the-art online fine-tuning performance on locomotion tasks from the popular D4RL benchmark.

Ii Related Work

Offline RL. Offline RL aims to learn a policy from pre-collected fixed datasets without interacting with the environment [27, 1, 16, 24, 32, 39, 29, 35]. Off-policy RL algorithms allow for reuse of off-policy data [22, 9, 18, 41, 30, 15, 31] but they typically fail when trained offline on a fixed dataset, even if it’s collected by a policy trained using the same algorithm [16, 24]. In actor-critic methods, this is due to extrapolation error of the critic network on out-of-distribution state-action pairs [29]. Offline RL methods deal with this by constraining the policy to stay close to the behavioral policy that collected the offline dataset. BRAC [45]

achieves this by minimizing the Kullback-Leibler divergence between the behavior policy and the learned policy. BEAR

[24] minimizes the MMD distance between the two policies. TD3+BC [14] proposes a simple yet efficient offline RL algorithm by adding an additional behavior cloning loss to the actor update. Another class of offline RL methods learns conservative Q functions, which prevents the policy network from exploiting out-of-distribution actions and forces them to stay close to the behavior policy. CQL [25] changes the critic objective to also minimize the Q function on unseen actions. Fisher-BRC [23] achieves conservative Q learning by constraining the gradient of the Q function on unseen data. Model-based offline RL methods [48, 21] train policies based on the data generated by ensembles of dynamics models learned from offline data, while constraining the policy to stay within samples where the dynamics model is certain. In this paper, we focus on offline-to-online RL with the goal of stable and sample-efficient online fine-tuning from policies pre-trained on offline datasets of different quality.

Offline pre-training in RL

. Pre-training has been vastly investigated in the machine learning community from computer vision

[38, 11, 47]

to natural language processing

[10, 44]. Offline pre-training in RL could enable deployment of RL methods in domains where data collection can be expensive or dangerous. [40, 17, 36]

pre-train the policy network with imitation learning to speed up RL. QT-opt

[20] studies vision-based object manipulation using a diverse and large dataset collected by seven robots over several months and fine-tune the policy with 27K samples of online data. However, these methods pre-train using diverse, large, or expert datasets and it is also important to investigate the possibility of pre-training from offline datasets of different quality. [46, 2] use offline pre-training to accelerate downstream tasks. AWAC [33] and Balanced Replay [28]

are recent works that also focus on offline-to-online RL from datasets of different quality. AWAC updates the policy network such that it is constrained during offline training while not too conservative during fine-tuning. Balanced Replay trains an additional neural network to prioritize samples in order to effectively use new data as well as near-on-policy samples in the offline dataset. We compare with AWAC and Balanced Replay to attain state-of-the-art offline-to-online RL performance on the popular D4RL benchmark.

Ensembles in RL. Ensemble methods are widely used for better performance in RL [12, 34, 8, 19]. In model-based RL, PETS [8] and MBPO [19] use probabilistic ensembles to effectively model the dynamics of the environment. In model-free RL, ensembles of Q functions have been shown to improve performance [3, 26]. REDQ [7] learns a randomized ensemble of Q functions to achieve similar sample efficiency as model-based methods without learning a dynamic model. We utilize REDQ in this work for improved sample-efficiency during online fine-tuning. Specific to offline RL, REM [1]

uses random convex combinations of multiple Q-value estimates to calculate the Q targets for effective offline RL on Atari games. MOPO

[48] uses probabilistic ensembles from PETS to learn policies from offline data using uncertainty estimates based on model disagreement. MBOP [4] uses ensembles of dynamic models, Q functions, and policy networks to get better performance on locomotion tasks. Balanced Replay [28] uses ensembles of pessimistic Q functions to mitigate instability caused by distribution shift in offline-to-online RL. While ensembling of Q functions has been studied by several prior works [26, 7], we combine it with behavioral cloning loss for the purpose of robust and sample-efficient offline-to-online RL.

Adaptive balancing of multiple objectives in RL. [5] train policies using learned dynamics models with the objective of visiting states that most likely lead to subsequent improvement in the dynamics model, using active online learning. They adaptively weigh the maximization of cumulative rewards and minimization of model uncertainty using an online learning mechanism based on exponential weights algorithm. In this paper, we focus on offline-to-online RL using model-free methods and propose to adaptively weigh the maximization of cumulative rewards and a behavioral cloning loss. Exploration of other online learning algorithms such as exponential weights algorithm is a line of future work.

Iii Background

Iii-a Reinforcement Learning

Reinforcement learning (RL) deals with sequential decision making to maximize cumulative rewards. RL problems are often formalized as Markov decision processes (MDPs). An MDP consists of a set of states

, a set of actions , a transition dynamics

that represents the probability of transitioning to a state

by taking action in state at timestep , a scalar reward function , and a discount factor .

A policy function of an RL agent is a mapping from states to actions and defines the behavior of the agent. The value function of a policy is defined as the expected cumulative rewards from state : , where the expectation is taken over state transitions and policy function . Similarly, the state-action value function is defined as the expected cumulative rewards after taking action in state : . The goal of RL is to learn an optimal policy function with parameters , that maximizes the expected cumulative rewards:

We use the TD3 algorithm for reinforcement learning [15]. TD3 is an actor-critic method that alternatingly trains: (i) the critic network to estimate the values of the policy network , and (ii) the policy network to produce actions that maximize the Q function: .

Iii-B Offline Pre-training

Offline reinforcement learning or batch reinforcement learning assumes that the agent is not able to interact with the environment but is given a fixed dataset of tuples to learn from. The data is assumed to be collected by an unknown behavioural policy (or a collection of policies).

The problem with using actor-critic methods for offline RL is extrapolation error due to the evaluation of the critic network on the next state and next action values to compute the temporal difference error. Here the next action is sampled from the policy network and this could lead to out-of-distribution evaluations of the critic network. This is problematic as erroneous predictions of the critic on unfamiliar actions could be propagated to other critic predictions due to bootstrapping in temporal difference learning. This will also lead to the policy network preferring actions with unrealistic value predictions. This problem can be overcome either by constraining the policy network to stay close to the data distribution [14] or by enforcing conservative estimates of the critic network on out-of-distribution samples [25].

[14] propose TD3+BC, a simple offline RL algorithm that regularizes policy learning in TD3 with a behavior cloning loss that constraints the policy actions to stay close to the actions in the offline dataset . This is achieved by adding a behavior cloning term to the policy loss:

(1)

where is a weighing hyperparameter and

normalizes the values which help in balancing both losses. The sum in the denominator is taken over a mini-batch and the gradients do not flow through the critic term in the denominator.

Iv Online Fine-tuning

RL agents trained from offline data tend to have limited performance and would further need to be fine-tuned online by interacting with the environment. During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline data to online data. Keeping the constrain used in offline pre-training, such as in Equation 1, could mitigate the collapse. However, this will force the policy to stay close to the behavior policy (used to collect the dataset), thus leads to slow improvement. In this section, we describe the two components of our online-tuning algorithm that enables stable and sample-efficient online fine-tuning.

Iv-a Adaptive Weighing of Behavior Cloning Loss

The most straightforward way to fine-tune the pre-trained policy is by just removing the constrains used in offline pre-training. For example, Balanced Replay [28] uses CQL [25] during offline pre-training and uses SAC [18] in fine-tuning. However, this strategy often leads to a performance collapse at the beginning of fine-tuning, as shown in Fig. 1 (with ) and the TD3-ft in Fig. 2. In the TD3+BC algorithm we consider in this paper, an hyperparameter is used to balance the RL objective and the behaviour cloning term which constrains the policy to stay close to the behavior policy (see Equation 1). We use and to distinguish the hyperparameter value used during offline and online training respectively. By default, we use in all our experiments. We use for TD3+BC and we observe that this prevents sudden performance drops at the initial steps of online fine-tuning, at the cost of very slow learning due to the strong behavior cloning constraint. On the other hand, setting leads to sample-efficient learning on some tasks at the cost of complete instability in other tasks. This is due to the sudden distribution shift causing the policy network to change significantly.

In Fig. 1, we present the influence of on the TD3+BC during fine-tuning by trying different values of from . We can clearly see that using the behavior cloning loss with proper enables stable fine-tuning. However, the value of depends on the quality of the offline dataset and has significant influence of the fine-tuning performance. For example, fits well on the Hopper-Random task while causes immediate collapse on Hopper-Medium and Hopper-Medium-Expert tasks.

Fig. 1: Results of online fine-tuning on the D4RL benchmark using TD3+BC with different

hyperparameters. We plot the mean and standard deviation across 3 runs. Using the behavior cloning loss with proper

enables the stable fine-tuning. But the optimal value of differs between datasets.

In our experiments, we found that when the offline dataset has narrow distribution or when the policy has already converged to a desired performance (comparable to the expert), it is usually beneficial to maintain a higher . When the data distribution is broader or when we still need to improve the agent by a large margin, a smaller works better. During experiments, we can not find a single that is suitable for all tasks and its value needs to be tuned carefully per tasks, which makes this method hard to be used in practice.

To solve this problem, we propose to adapt the weight of the behavior cloning loss according to two factors: (i) the difference between the moving average return and the target return, and (ii) the difference between the current episodic return and the moving average return. We adaptively change the hyperparameter as:

(2)

where we constrain between 0 and 0.4 (the value used during offline pre-training). and are normalized following the return normalization procedure used in D4RL. is the target episodic return, which we set as 1.05 (corresponding to the expert policy) for all tasks. controls how fast we decrease the according to current performance and determines how fast we increase the when the performance drops. Intuitively, when the agent’s performance reaches the target episodic return, we try to maintain it during fine-tuning. But when the agent’s performance is low, we decrease the to allow the agent improving further. The second term increases the when performance drops during training to mitigate performance collapse. Equation 2 allows for adaptive weighing of the behavior cloning loss throughout online fine-tuning. This learning algorithm can automatically adjust the constraint enforced by the behavior cloning loss.

After offline pre-training the replay buffer is filled with offline samples and during online fine-tuning, they are slowly replaced by online samples. Uniformly sampling mini-batches from this replay buffer for online fine-tuning is inefficient as it is dominated by offline samples. After offline learning we simply remove 95% of random offline samples from the replay buffer to deal with this problem. Our results show that the data down-sampling allows efficient usage of novel data without destroying the training.

Iv-B Randomized Ensembles of Critic Networks

We propose to use an ensemble of Q functions to better deal with the distribution shift from offline pre-training and to improve the sample-efficiency of online fine-tuning. We use the Randomized Ensembled Double Q-learning (REDQ) method proposed by [7] to learn an ensemble of critic networks.

The critic network is trained to satisfy the Bellman equation: . REDQ maintains an ensemble of critic networks and randomly samples networks for each critic update. Given a mini-batch of transitions , all critic networks in the ensemble are updated towards the same target:

(3)

where is a random subset of critic networks and . Here is Gaussian exploration noise with standard deviation and is the action range.

REDQ updates the policy network to maximize the average predictions of the critic networks:

We combine this REDQ policy update with a behaviour cloning loss (like in Equation 1) for robust learning [14]:

(4)

We show that this simple modification of ensembling the critic networks (which can be run in parallel) improves offline-to-online learning. We call this algorithm REDQ+AdaptiveBC, as it combines the randomized ensembling of REDQ with TD3+AdaptiveBC. We outline our complete algorithm in Algorithm 1.

  Initialize REDQ agent with critic parameters and policy parameters
  Initialize target parameters and , for
  Initialize replay buffer with offline data
  for  to  do
     Sample mini-batch of B transitions from
     Update critic parameters using Equation 3
     Update actor parameters using Equation 4 with
     Update target networks and
  end for
  
  Randomly remove 95% of offline samples from
  Initialize
  Initialize and to store the return of current and previous episodes
  Initialize environment for online fine-tuning
  for every training episode do
     for  to  do
        Act with exploration noise
        Observe next state and reward
        Add to
        for  to  do
           Sample mini-batch of B transitions from
           Update critic parameters using Equation 3
           Update actor parameters using Equation 4 with
           Update target networks and
        end for
     end for
     Set and
     Adapt based on and using Equation 2
  end for
Algorithm 1 Offline-to-online RL with adaptive behaviour cloning and ensembles of critic networks

V Experiments

V-a Online Fine-tuning on D4RL Benchmark

The goal of our experiments is to evaluate the stability and sample-efficiency of the proposed algorithm on online fine-tuning after offline pre-training on datasets of different quality. We evaluate our algorithm on online fine-tuning after offline pre-training on the D4RL benchmark [13]. D4RL includes three locomotion environments (halfcheetah, hopper, and walker) implemented in the MuJoCo simulator [43], wrapped in OpenAI Gym API [6]. D4RL provides five different offline datasets for each task: Random, Medium, Medium-Replay, Medium-Expert, and Expert. The Random datasets are collected by random policies, Medium datasets are collected by an early-stopped soft actor-critic (SAC) [18] agent with medium-level performance, Medium-Replay datasets consist of all samples in the replay buffer after training a medium-level agent, Medium-Expert datasets are mixed with expert demonstrations and sub-optimal demonstrations from a medium-level agent, and Expert datasets are expert demonstrations. The “expert” in these datasets is a fully trained soft-actor critic agent. We ignore the Expert datasets in this paper as offline RL algorithms already achieve expert-level performance on these tasks and there is little to no benefit in online fine-tuning.

Fig. 2: Results of online fine-tuning on the D4RL benchmark. We plot the mean and standard deviation across 5 runs. Our REDQ+AdaptiveBC method attains performance competitive to the state-of-the-art. Our method is able to consistently improve the pre-trained agent during fine-tuning without suffering from dramatic performance collapse at the beginning of training.

In Figure 2, we compare our REDQ+AdaptiveBC algorithm with two state-of-the-art offline-to-online RL algorithms (AWAC and Balanced Replay) and two baseline methods (TD3-ft and REDQ):

  • Advantage Weighted Actor-Critic (AWAC) [33] is an actor-critic method for offline-to-online RL that implicitly constraints the policy network to stay close to the behavior policy. We produce the results for AWAC using code taken from https://github.com/ikostrikov/jaxrl.

  • Balanced Replay [28] is an offline-to-online RL method that prioritizes near-on-policy samples from the replay buffer. This method also uses an ensemble of Q functions to prevent overestimation of Q values in the initial stages of online fine-tuning. We reproduced the results for this method using our own implementation. For a fair comparison, we base our implementation on TD3+BC (instead of CQL originally used by [28]) while ensuring that we are able to reproduce the original results.

  • TD3-ft is the standard TD3 algorithm [15] that was pre-trained offline using TD3+BC [14].

  • REDQ (scratch) [7] is an RL method trained from scratch, without any access to the offline data. This baseline emphasizes the importance of offline pre-training and online fine-tuning. We base our REDQ implementation on TD3 (instead of SAC used by [7]) for compatibility with TD3+BC.

All methods (except AWAC) are implemented on top of TD3 and are run from the same codebase for a fair comparison. For simplicity, we do not perform any state normalization like in the original TD3+BC implementation [14].

During offline pre-training, all algorithms are pre-trained on the offline dataset for one million gradient steps. After pre-training, we fine-tune the agents for 250,000 time steps by interacting with the environment. We evaluate the agent every 5000 time steps and each evaluation consists of 10 episodes. We attain performance competitive to the state-of-the-art in this benchmark with our method stably improving the performance during online fine-tuning.

Our method consistently improves the pre-trained policy and outperforms or matches other methods on all tasks, among different environments and different datasets. More importantly, our method does not collapse dramatically on all three Medium-Expert tasks. We significantly outperform REDQ on all tasks, which demonstrates that we considerably benefit from offline pre-training. TD3-ft is able to improve from online fine-tuning but suffers from significant performance drops due to the sudden distribution shift and the learning progress is slow due to the replay buffer being dominated by offline samples.

Both Balanced Replay and our method (REDQ+AdaptiveBC) use an ensemble of 10 Q networks, but in different ways. Balanced Replay maintains a pair of five ensemble networks, average the predictions across each of the five networks and then takes the minimum of the averages as the final prediction. In our method, we simply consider the average of all 10 networks as the prediction but randomly sample a pair of Q networks to compute the critic targets (Equation 3). We show that this simple modification enables stable and sample-efficient online fine-tuning without the need for any complex sampling scheme from the replay buffer.

Similar to prior works [14], we use feed-forward networks with two hidden layers as actor and critic networks for all the methods. We use a batch size of 256 to train the network for all methods, except for AWAC where we use a larger batch size of 1024 [33]. During offline learning, we use for all tasks, except Walker-Random where we use since the dataset has a very narrow distribution and the Q function will diverge with small .

V-B Experiments on Dexterous Manipulation Tasks

To demonstrate that our proposed method can be used to solve more challenging tasks, we test it on four dexterous manipulation tasks [36] in the D4RL benchmark: Hammer, Pen, Relocate, and Door.

In this section, we evaluate our algorithm on the Expert dataset of the four tasks, each composed of one million expert data from a fine-tuned RL policy. We first tune TD3_BC for offline pre-training on these tasks. We increase the from 0.4 to 8, and correspondingly increase the initial , , by the same factor of 20. The online fine-tuning performance of our method with and without the adaptive behavior cloning term is shown in Figure 3. We observe that the performance of the REDQ agent immediately collapses but the proposed adaptive behavior cloning method is able to successfully prevent this.

Fig. 3: Comparison of online fine-tuning performance of REDQ agent with and without adaptive behavioal cloning term on four dexterous manipulation tasks. We plot the mean and standard deviation across 3 runs. Our method successfully avoid the performance collapse during fine-tuning.

V-C Algorithmic Investigations

Adaptive Weighing of : In our experiments, the hyperparameters and are decided by grid search on Hopper-Random and Hopper-Expert-Medium tasks and fixed in other tasks. To evaluate whether the proposed method can correctly select a good for stable online fine-tuning, in Figure 4, we compare the results obtained with the automatically tuned with manually tuned results on the Walker2d domain. To manually tune the , we do a grid search on over and pick the best separately for each dataset. From Figure 4, we can see that our method outperformances or matches the manually selected results but saves lots of labor and computational resources. Furthermore, during our experiments, we found that among all 12 tasks, our method is only slightly worse than manually tuned results on HalfCheetah-Random and Hopper-Medium-Replay tasks.

Fig. 4: Comparison of results with automatically tuned and carefully picked results. It shows that our proposed method can effectively find the suitable for tested tasks.

Offline Dataset Downsampling: Balanced Replay [28] trains a neural network to estimate the priority of samples from offline data and online data. In their methods, three replay buffer need to be maintained: offline dataset (0.1-2M samples), online dataset (0.25M samples) and a prioritized replay buffer (0.35M-2.25M samples) [37], making it memory consuming (0.7M-4.5M samples). Our method simply dowsamples the offline dataset by , thus, our method only maintains one replay buffer to store online data but is prefilled with 0.05M offline data points, roughly saves memory. However, our results show that dataset downsampling combined with the adaptive behavioral cloning term is enough to avoid performance collapse and consistently improve the pre-trained policy.

We compare two different downsampling methods, random sampling and prioritized sampling, as well as different downsampling ratios, shown in Figure 5. To achieve prioritized sampling, we simply retain trajectories with higher episodic returns. Our results show that a proper downsampling ratio is important to achieve good performance. Retaining all offline data hurts the performance, even for the Medium-Expert dataset. Dataset downsampling is even important when the data quality of the offline data is not good enough, such as when the dataset is collected by a random policy since it allows the agent to effectively sample the novel data encountered during fine-tuning. Unlike the downsampling ratio, different sampling methods do not influence much in our experiments.

Fig. 5: Comparison of different sampling methods and different downsampling ratios. We plot the mean and standard deviation across 3 runs. Downsampling enables effective usage of novel data encountered during fine-tuning.

Usage of Ensembles In our experiments, we use ensembles to represent the critic network. In Figure 6, we compare the online fine-tuning performance with and without ensembles. We also compare two different ways of using an ensemble: (i) taking a minimum across all the Q networks in the ensemble (Minimum), and (ii) taking a minimum across a random pair of Q networks in the ensemble (REDQ). Our results show that using ensembles is not necessary to avoid performance collapse, however, it stabilizes training in most cases. We observe that calculating the target Q values with randomly sampled Q predictions in REDQ is crucial when using an ensemble.

Fig. 6: Comparison of usages of ensembles on the hopper domain. We plot the mean and standard deviation across 3 runs. Randomized ensembled double Q-Learning stabilizes the training, but it is not necessary to avoid performance collapse during fine-tuning.

Dependency of the Target Return In order to change the , we assume the prior knowledge of the target return, obtained by the expert SAC agent. This is a reasonable assumption in some applications, however, in some real world application, the target return of one task is usually unknown. Instead, assuming the knowledge of maximal per-step reward is more proper. Thus, we compare two different ways to set the target return: i) the episodic return obtained by the expert SAC agent; ii) the product of the maximal per-step reward and the maximal episode length . The second method, setting the target return as , only depends on mild prior knowledge of the task, and usually has a higher value. As shown in Figure  7, two methods have similar performance on four Walker2d tasks. In our earlier experiments, we found that both algorithm have similar performance in most cases. The second method has better performance on the Hopper-Medium-Replay task. However, using expert SAC agent’s performance as the target return has more stable performance on the Hopper-Medium-Expert task. This is caused by the quicker decrease of the , since according to Equation 2, the is proportionate to . It should be noticed that we haven’t tuned the and in this experiment. So it is potential to obtain more stable results on Hopper-Medium-Expert tasks with tuned hyperparameters.

Fig. 7: Comparison of two different ways to set the target return: i) the episodic return obtained by the expert SAC agent; ii) the product of the product of the maximal per-step reward and the maximal episode length . Two ways have similar performance on four Walker2d tasks.

Vi Conclusion

We consider the problem of offline-to-online RL where an agent is first pre-trained on offline data (collected by a possibly unknown behavior policy) and the agent is then fine-tuned online by interacting with the environment. This is desirable as pre-trained agents may have limited performance depending on the quality of the offline dataset. Offline-to-online RL is challenging due to the sudden distribution shift from offline data to online data, and also the constraints enforced by offline RL algorithms (such as a behavior cloning loss) during pre-training. In this paper, we propose a simple mechanism to adaptively weigh a behavior cloning loss during online fine-tuning, based on agent performance and training stability. We demonstrate that a randomized ensemble further helps to deal with these challenges to enable sample-efficient online fine-tuning performance. We achieve performance competitive to the state-of-the-art online fine-tuning methods on locomotion in the popular D4RL benchmark. Furthermore, our method successfully avoids performance collapse on challenging dexterous manipulation tasks.

References

  • [1] R. Agarwal, D. Schuurmans, and M. Norouzi (2020) An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pp. 104–114. Cited by: §II, §II.
  • [2] A. Ajay, A. Kumar, P. Agrawal, S. Levine, and O. Nachum (2020) Opal: offline primitive discovery for accelerating offline reinforcement learning. arXiv preprint arXiv:2010.13611. Cited by: §II.
  • [3] O. Anschel, N. Baram, and N. Shimkin (2017)

    Averaged-DQN: variance reduction and stabilization for deep reinforcement learning

    .
    In International conference on machine learning, pp. 176–185. Cited by: §II.
  • [4] A. Argenson and G. Dulac-Arnold (2020) Model-based offline planning. arXiv preprint arXiv:2008.05556. Cited by: §II.
  • [5] P. Ball, J. Parker-Holder, A. Pacchiano, K. Choromanski, and S. Roberts (2020)

    Ready policy one: world building through active learning

    .
    In International Conference on Machine Learning, pp. 591–601. Cited by: §II.
  • [6] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI Gym. arXiv preprint arXiv:1606.01540. Cited by: §V-A.
  • [7] X. Chen, C. Wang, Z. Zhou, and K. W. Ross (2021) Randomized ensembled double Q-learning: learning fast without a model. In International Conference on Learning Representations, Cited by: §I, §II, §IV-B, 4th item.
  • [8] K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. arXiv preprint arXiv:1805.12114. Cited by: §II.
  • [9] T. Degris, M. White, and R. S. Sutton (2012) Off-policy actor-critic. arXiv preprint arXiv:1205.4839. Cited by: §II.
  • [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §II.
  • [11] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pp. 647–655. Cited by: §II.
  • [12] S. Faußer and F. Schwenker (2015) Neural network ensembles in reinforcement learning. Neural Processing Letters 41 (1), pp. 55–69. Cited by: §II.
  • [13] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine (2020) D4RL: datasets for deep data-driven reinforcement learning. External Links: 2004.07219 Cited by: §V-A.
  • [14] S. Fujimoto and S. S. Gu (2021) A minimalist approach to offline reinforcement learning. arXiv preprint arXiv:2106.06860. Cited by: §I, §I, §II, §III-B, §III-B, §IV-B, 3rd item, §V-A, §V-A.
  • [15] S. Fujimoto, H. Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1587–1596. Cited by: §I, §II, §III-A, 3rd item.
  • [16] S. Fujimoto, D. Meger, and D. Precup (2019) Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052–2062. Cited by: §I, §II.
  • [17] A. Gupta, V. Kumar, C. Lynch, S. Levine, and K. Hausman (2019) Relay policy learning: solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956. Cited by: §II.
  • [18] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §II, §IV-A, §V-A.
  • [19] M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. arXiv preprint arXiv:1906.08253. Cited by: §II.
  • [20] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293. Cited by: §II.
  • [21] R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims (2020) Morel: model-based offline reinforcement learning. arXiv preprint arXiv:2005.05951. Cited by: §II.
  • [22] V. R. Konda and J. N. Tsitsiklis (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §II.
  • [23] I. Kostrikov, R. Fergus, J. Tompson, and O. Nachum (2021) Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp. 5774–5783. Cited by: §I, §II.
  • [24] A. Kumar, J. Fu, G. Tucker, and S. Levine (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949. Cited by: §II.
  • [25] A. Kumar, A. Zhou, G. Tucker, and S. Levine (2020) Conservative Q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779. Cited by: §I, §II, §III-B, §IV-A.
  • [26] Q. Lan, Y. Pan, A. Fyshe, and M. White (2020) Maxmin Q-learning: controlling the estimation bias of Q-learning. In International Conference on Learning Representations, Cited by: §II.
  • [27] S. Lange, T. Gabel, and M. Riedmiller (2012) Batch reinforcement learning. In Reinforcement learning, pp. 45–73. Cited by: §II.
  • [28] S. Lee, Y. Seo, K. Lee, P. Abbeel, and J. Shin (2021) Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. arXiv preprint arXiv:2107.00591. Cited by: §II, §II, §IV-A, 2nd item, §V-C.
  • [29] S. Levine, A. Kumar, G. Tucker, and J. Fu (2020) Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §II.
  • [30] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §II.
  • [31] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §II.
  • [32] O. Nachum, Y. Chow, B. Dai, and L. Li (2019) Dualdice: behavior-agnostic estimation of discounted stationary distribution corrections. arXiv preprint arXiv:1906.04733. Cited by: §II.
  • [33] A. Nair, M. Dalal, A. Gupta, and S. Levine (2020) Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359. Cited by: §I, §II, 1st item, §V-A.
  • [34] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy (2016) Deep exploration via bootstrapped dqn. Advances in neural information processing systems 29, pp. 4026–4034. Cited by: §II.
  • [35] X. B. Peng, A. Kumar, G. Zhang, and S. Levine (2019) Advantage-weighted regression: simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177. Cited by: §II.
  • [36] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine (2017) Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087. Cited by: §II, §V-B.
  • [37] T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §V-C.
  • [38] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson (2014) CNN features off-the-shelf: an astounding baseline for recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition workshops

    ,
    pp. 806–813. Cited by: §II.
  • [39] N. Siegel, J. T. Springenberg, F. Berkenkamp, A. Abdolmaleki, M. Neunert, T. Lampe, R. Hafner, N. Heess, and M. Riedmiller (2020) Keep doing what worked: behavior modelling priors for offline reinforcement learning. In International Conference on Learning Representations, Cited by: §I, §II.
  • [40] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §II.
  • [41] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. Cited by: §II.
  • [42] A. Singh, A. Yu, J. Yang, J. Zhang, A. Kumar, and S. Levine (2020) Cog: connecting new skills to past experience with offline reinforcement learning. arXiv preprint arXiv:2010.14500. Cited by: §I.
  • [43] E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §V-A.
  • [44] J. Turian, L. Ratinov, and Y. Bengio (2010)

    Word representations: a simple and general method for semi-supervised learning

    .
    In Proceedings of the 48th annual meeting of the association for computational linguistics, pp. 384–394. Cited by: §II.
  • [45] Y. Wu, G. Tucker, and O. Nachum (2019) Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361. Cited by: §II.
  • [46] M. Yang and O. Nachum (2021) Representation matters: offline pretraining for sequential decision making. arXiv preprint arXiv:2102.05815. Cited by: §II.
  • [47] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. arXiv preprint arXiv:1411.1792. Cited by: §II.
  • [48] T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma (2020) Mopo: model-based offline policy optimization. arXiv preprint arXiv:2005.13239. Cited by: §II, §II.