Offline Distillation for Robot Lifelong Learning with Imbalanced Experience

by   Wenxuan Zhou, et al.

Robots will experience non-stationary environment dynamics throughout their lifetime: the robot dynamics can change due to wear and tear, or its surroundings may change over time. Eventually, the robots should perform well in all of the environment variations it has encountered. At the same time, it should still be able to learn fast in a new environment. We investigate two challenges in such a lifelong learning setting: first, existing off-policy algorithms struggle with the trade-off between being conservative to maintain good performance in the old environment and learning efficiently in the new environment. We propose the Offline Distillation Pipeline to break this trade-off by separating the training procedure into interleaved phases of online interaction and offline distillation. Second, training with the combined datasets from multiple environments across the lifetime might create a significant performance drop compared to training on the datasets individually. Our hypothesis is that both the imbalanced quality and size of the datasets exacerbate the extrapolation error of the Q-function during offline training over the "weaker" dataset. We propose a simple fix to the issue by keeping the policy closer to the dataset during the distillation phase. In the experiments, we demonstrate these challenges and the proposed solutions with a simulated bipedal robot walking task across various environment changes. We show that the Offline Distillation Pipeline achieves better performance across all the encountered environments without affecting data collection. We also provide a comprehensive empirical study to support our hypothesis on the data imbalance issue.



page 2

page 10


PAnDR: Fast Adaptation to New Environments from Offline Experiences via Decoupling Policy and Environment Representations

Deep Reinforcement Learning (DRL) has been a promising solution to many ...

Learning Navigation Skills for Legged Robots with Learned Robot Embeddings

Navigation policies are commonly learned on idealized cylinder agents in...

Context is Everything: Implicit Identification for Dynamics Adaptation

Understanding environment dynamics is necessary for robots to act safely...

Continual Reinforcement Learning deployed in Real-life using Policy Distillation and Sim2Real Transfer

We focus on the problem of teaching a robot to solve tasks presented seq...

Dual-Arm Adversarial Robot Learning

Robot learning is a very promising topic for the future of automation an...

Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment

Reinforcement learning from large-scale offline datasets provides us wit...

Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

We aim to identify how different components in the KD pipeline affect th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Lifelong learning, also commonly known as continual learning, studies the problem of learning with a stream of tasks sequentially with incremental, non-stationary data (Thrun, 1998; Hadsell et al., 2020)

. Lifelong learning has been an important topic in artificial intelligence and it naturally reflects the challenges faced by animals and humans 

(Hassabis et al., 2017)

. In this work, we study the problem of lifelong robot reinforcement learning in the face of changing environment dynamics. Non-stationary environment dynamics present a practical and important challenge for training reinforcement learning policies on robots in the real world. Especially on low-cost or low-tolerance robots, the robot dynamics can change due to wear and tear both during training and deployment. Also, in most natural settings, the robot’s environment will change over time, for instance when the robot encounters new terrains or objects. Ideally, at any point in time the robot should be able to draw on the entirety of its past experience, so that it is able to recover or repurpose the skills that has been useful in the past when it encounters the same or a similar environment again during deployment or later in training. One important property of sequential environment variations in the real world is that the task boundaries may be unknown or not well defined. For example, deformation of the robot can happen gradually. This limits the applicability of many existing approaches in lifelong learning which rely on well-defined task boundaries 

(Rusu et al., 2016; Kirkpatrick et al., 2017; Lopez-Paz and Ranzato, 2017). We aim to investigate a practical solution for lifelong robot learning across environment variations without the need of task boundaries.

One important challenge in lifelong learning is the trade-off between remembering the old task (backward transfer) and learning the new task efficiently (forward transfer). The most widely studied aspect of this trade-off is the catastrophic forgetting issue of neural networks 

(French, 1999). We follow the memory-based method to avoid forgetting by simply saving all the incoming data in the replay buffer and train the policy with an off-policy algorithm, which does not require task boundaries (Rolnick et al., 2019). However, we find that even if we save all the data across environment variations, “forgetting” still happens. In this case, the additional challenge of forgetting is due to the extrapolation error of the Q-function which is widely discussed in the Offline RL literature (Fujimoto et al., 2019): when the agent does not have access to the previous environments, it becomes “offline” over these environments. Thus, the agent cannot correct for the overestimation error of the Q-function by collecting more data in these environments. Conversely, if we use “conservative” (or “pessimistic”) algorithms that force the policy to stay close to the existing replay buffer to maintain the performance in the old environments, it affects data collection and creates difficulties in learning in the new environment (Jeong et al., 2020). Different from the stability-plasticity dilemma of neural networks often discussed in the lifelong learning literature, this trade-off between forward and backward transfer is specific to RL due to off-policy data. We propose the Offline Distillation Pipeline to disentangle this trade-off into two stages. To learn the task in the latest environment efficiently, we can use any RL algorithm suitable for online data collection without worrying about forgetting. To obtain a policy that effectively accumulates previous experience across environment variations, we can distill the entire dataset into a policy by treating it as an offline RL problem. The distillation step can happen periodically during training or right before deployment. An illustration can be found in Figure 1.

In addition, we investigate a practical consideration of lifelong learning where the stream of experience is imbalanced across environment variations. For example, the agent might be trained on one environment much longer than the other. The ideal lifelong learning algorithm should be robust to such imbalanced experience. In the Offline Distillation Pipeline, we find that training a policy with the imbalanced datasets from multiple environments can sometimes lead to much worse performance than training on each dataset individually. Through the experiments, we provide evidence for the following hypothesis: both the imbalanced quality and the imbalanced size of the datasets become extra sources of extrapolation error in offline learning. The imbalanced quality makes the Q-function biased towards larger values. The imbalanced size leads to more fitting error of the policy network on the smaller dataset, which exacerbates the bootstrapping error caused by out-of-distribution actions. Furthermore, we find that keeping the policy to be closer to the dataset could be a simple yet effective solution to this issue without requiring task boundaries.

In summary, we study two challenges of lifelong robot learning over environment variations: First, we identify the trade-off between learning in the new environments and remembering the old environments in existing off-policy RL algorithms. We propose the Offline Distillation Pipeline to break this trade-off without the need for task boundaries. Second, we study how the dataset imbalance issue could affect the performance in offline learning and provide thorough empirical analysis. We evaluate our method on a bipedal robot walking task in simulation with different environment changes. The proposed pipeline is shown to achieve similar or better performance than the baselines across the sequentially changing environments even with imbalanced experience.

Figure 1: Illustration of the Offline Distillation Pipeline with two types of experiment setup as examples: (a) The agent is trained over two environments sequentially and runs offline distillation at the end of training before deployment. (b) The pipeline can also be potentially applied in a more complex lifelong learning setup with parallel training on the robots. “RB” means replay buffer in the figure.

2 Related Work

Lifelong Learning:

Lifelong learning has been widely studied in machine learning literature 

(Thrun, 1998; Hadsell et al., 2020; Khetarpal et al., 2020). When given a stream of non-stationary data or non-stationary tasks, the agent should maintain the performance of previous tasks (backward transfer) while learning the new task efficiently (forward transfer). One direction of lifelong learning literature focuses more on the issue of backward transfer caused by the catastrophic forgetting of neural networks (French, 1999). Existing methods in this direction can be expansion-based (Rusu et al., 2016; Schwarz et al., 2018), regularization-based (Kirkpatrick et al., 2017), gradient-based (Lopez-Paz and Ranzato, 2017) or memory-based (Rolnick et al., 2019). There has also been a line of work in task-agnostic continual learning (Aljundi et al., 2019a, b; Zeno et al., 2018), where the task boundaries are unknown or not well-defined. We follow the task-agnostic memory-based method from Rolnick et al. (2019) by saving all the transitions in the replay buffer. In this paper, we show that there are additional challenges in lifelong reinforcement learning besides the catastrophic forgetting issue of the neural networks. Another direction in lifelong learning focuses on maximizing forward transfer without worrying about forgetting where the performance is only measured by the new task. For example, recent work Xie and Finn (2021) studies the problem of learning a sequence of tasks and proposes to selectively use past experience to accelerate forward transfer.

Offline RL: The additional forgetting issue in off-policy reinforcement learning discussed in this paper is related to offline RL. Thus, the proposed Offline Distillation Pipeline is based on this line of work. Offline RL investigates the problem of learning a policy from a static dataset without additional data collection (Ernst et al., 2005; Lange et al., 2012; Levine et al., 2020). Such a problem setting challenges existing off-policy algorithms due to the mismatch between the state-conditioned action distribution induced by the policy and the dataset distribution (Fujimoto et al., 2019). Previous work has proposed to fix this issue by constraining the policy to be close to the dataset explicitly (Kumar et al., 2019; Wu et al., 2019; Siegel et al., 2020; Wang et al., 2020; Jeong et al., 2020) or implicitly (Fujimoto et al., 2019; Zhou et al., 2020), learning a conservative Q-function (Kumar et al., 2020), or modifying the reward based on model uncertainy (Kidambi et al., 2020; Yu et al., 2020). We use Critic Regularized Regression (CRR) from Wang et al. (2020) to perform offline distillation. In terms of related work in imbalanced dataset in offline RL, Zhang et al. (2021) investigates imbalanced offline datasets collected by a variety of policies, in contrast to having a mixed dataset from multiple environments in our case. While most of the work in offline RL focuses on one task, Yu et al. (2021) studies multi-task offline RL with the goal of improving single task performance by selectively sharing the data across tasks. In contrast, we aim at learning a universal policy for all the tasks which does not rely on task boundaries during training. The difference in the objectives is mainly due to the difference in the domains of interests: the “tasks” are defined to have different reward functions in Yang et al. (2022) while defined to be different dynamics in our case.

Distillation: The proposed pipeline is also related to knowledge distillation (Hinton et al., 2015). In reinforcement learning, policy distillation has been used to compress the network size (Rusu et al., 2015; Schwarz et al., 2018), improve multitask learning (Teh et al., 2017), or improve generalization (Igl et al., 2020). In contrast to policy distillation methods which distill the knowledge from networks to networks, we directly distill the data into a policy. This eliminates the need of task boundaries and additional data collection in previous methods. Nonetheless, the proposed pipeline with offline distillation may still share similar benefits of modifying network size or improving generalization because it trains a new policy from scratch (Igl et al., 2020).

3 Preliminaries

3.1 Problem Definition: Lifelong reinforcement learning with environment variations

We define the lifelong learning problem across environment variations to be a time-varying Markov Decision Process (MDP)

as a tuple , with state space , action space , non-stationary dynamics function that may change over time , reward function , and discount factor . In contrast to our work, most reinforcement learning literature considers static dynamics, which is a special case when for all . In reinforcement learning, the objective is to optimize the policy to maximize the return given by . We also define a policy and its corresponding Q-function where the expectation is taken over the trajectories that start from an initial state , an initial action , and follow and for the following timesteps.

In this work, we assume that the agent experiences for a fixed amount of time during training, and will be evaluated and deployed at time . We formulate our problem as maximizing the return over the support of when the policy is evaluated at time . This involves efficient data collection across and being able to recall the skills at time . For example, in a simplified lifelong learning setting shown in Figure 0(a), the agent experiences for and then experiences for . Note that and might not be equal. In this case, the objective can be defined as maximizing the performance of during evaluation at the end of training. Note that this function is independent of and . Although we have two distinct stages with two environments in this example, in general the task boundaries may not always be accessible or well-defined since can change continuously. Without task boundaries, we cannot directly optimize the policy over the support of instead of the density of . However, we aim to treat the importance of different environments equally during evaluation.

3.2 Off-Policy Reinforcement Learning Algorithms

The proposed Offline Distillation Pipeline is built on top of two RL algorithms: Maximum a Posteriori Policy Optimisation (MPO) (Abdolmaleki et al., 2018a, b) and Critic Regularized Regression (CRR) (Wang et al., 2020). The pipeline uses MPO to update the policy during data collection, and uses CRR for offline distillation. Although MPO and CRR are both off-policy RL algorithms and share a lot of similarities, they are designed for different problem settings. MPO works well in the online setting, i.e. when data collection is allowed. CRR is designed for offline reinforcement learning, i.e. to learn from a fixed dataset without additional data collection. To stabilize learning in the offline setting, CRR attempts to avoid selecting actions outside of the dataset, which also renders it more conservative. More discussion of the connections between CRR and MPO can be found in Jeong et al. (2020); Abdolmaleki et al. (2021). Both algorithms alternate between policy evaluation and policy improvement. Both algorithms perform policy evaluation

to estimate the Q-function

using the Bellman Operator :


but they differ in the policy improvement step.

Maximum a Posteriori Policy Optimisation (MPO): To improve the current policy given the corresponding Q-function and a state distribution , MPO performs two steps. In the first step, for each state , an improved policy is obtained where

is a transformation function that gives higher probabilities to actions with higher Q-values. In the second step, a new parametric policy

is obtained by distilling the improved policies

into a new parametric policy using the supervised learning loss:


In practice, we represent as a non-parametric policy consists of samples from for each state and re-weighting each sample by . If the exponential function is chosen as the transformation function , the improved policy can be written as where is the temperature term. This is the solution to the following KL regularized RL objective that keeps the improved policies close to the current policy while maximising the expected Q-values:


Critic Regularized Regression (CRR): CRR follows a similar procedure as MPO for the policy improvement step. The major difference is the way of constructing the improved policy

. Since CRR is designed for offline RL, the optimization objective is to improve the policy according to the Q-function while keeping the policy close to the dataset distribution. In CRR, we construct the improved policies based on the joint distribution of

by sampling state-action pairs from the dataset . Thus, the improved policy for each state is defined as a joint distribution instead of a conditional distribution as in MPO. Similarly, Equation 2 can be modified to obtain a new parametric policy :

When the transformation function is the exponential function, the improved policy can be written as which is a solution to the following objective similar to Equation 3:


In Equation 4, when temperature is higher, the constraint on staying close to the dataset becomes stronger, which makes the policy more “conservative”. A common practice is to replace the Q-value by the advantage in the transformation function: . Besides the exponential function, another popular choice of the transformation is the indicator function: where is the advantage function. The indicator function corresponds to an exponential transformation clipped to with which is less “conservative”.

4 Forward and Backward trade-off in Lifelong Reinforcement Learning

Figure 2: Forgetting in MPO: The figure shows a performance drop in the original environment after a switch of environment at 200k steps.

To build a pipeline for lifelong learning, we need to first deal with the catastrophic forgetting issue of neural networks, as widely discussed in the literature (Hadsell et al., 2020). We follow the memory-based approach from Rolnick et al. (2019) by saving all the transitions across the agent’s life-cycle and run off-policy algorithms such as MPO (Abdolmaleki et al., 2018b). In off-policy algorithms, the policy is used for exploration in the latest environment while being trained on the entire history of data. However, we still observe that “forgetting” happens in the old environment following this setup. Figure 2 shows an example in which the policy experiences a change in the environment dynamics at 200k steps while being evaluated in the first environment across the full training process. More details of the experiment can be found in Section 7.1. Once the policy starts training in a new environment, the performance of MPO drops significantly even if the data from the old environment is kept in the replay buffer. This shows the extra challenge in lifelong reinforcement learning besides the catastrophic forgetting issue of the neural network.

The reason behind this drop is related to the issues of applying off-policy algorithms to Offline Reinforcement Learning (Fujimoto et al., 2019). The objective of offline RL is to learn a policy from a fixed dataset without further exploration. Due to the extrapolation error in the Q-function, the policy might select overestimated actions beyond the dataset. This error will be accumulated by bootstrapping during Q-function updates which results in significant overestimation bias of the Q-function. When the agent does not have access to collect more data to correct the overestimation bias, the performance of the policy will drastically degrade. Thus, off-policy algorithms designed with the assumption of active data collection often break under this problem setting. Similarly, in the lifelong learning scenario discussed above, when the agent switches from one environment to another, it is essentially training over the static dataset of the old environment. When the agent cannot collect more data in the old environment, it cannot correct the extrapolation error on those state-action pairs.

Figure 3: CRR is not as efficient as MPO when training from scratch.

Prior work in offline RL proposes to fix the overestimation issue of off-policy algorithms by restricting the policy to be closer to the conditional distribution of the dataset, such as Critic Regularized Regression(CRR) as described in Section 3.2. If we apply a similar “conservative” objective in the lifelong learning setting, we find that it is able to reduce the forgetting issue. However, it will instead affect forward transfer due to the conservatism. Figure 3 shows an example of running CRR from scratch, which can be viewed as the beginning stage of a lifelong learning experiment. Although CRR has been shown to have strong performance on offline RL benchmarks, it does not have good performance when exploration is needed due to the constraint on the policy. In the experiment, we will further show that tuning the constraint will lead to either forgetting or ineffective forward transfer.

The above two examples demonstrate the trade-off between preserving performance in the old environment (backward transfer) and exploring in the new environment effectively (forward transfer). Note that this is different from the “stability-plasticity” dilemma in previous lifelong learning literature in two ways. First, in terms of backward transfer, the issue of forgetting rises from the extrapolation error of Q-function specific to off-policy reinforcement learning. Second, in terms of forward transfer, previous work mainly considers the trade-off between past and recent experience from the streaming data. The issue we discuss above is a trade-off between past and future experience which is specific to reinforcement learning where the performance highly depends on effective data collection.

5 Offline Distillation Pipeline

To address this trade-off, we propose the Offline Distillation Pipeline shown in Figure 1. During data collection across environment variations, we can use any RL algorithm that maximizes forward transfer without considering forgetting. At the end of training, we “distill” the experience into a single policy by treating the entire dataset as an offline RL dataset. In this paper, we use MPO to train the policy for data collection, and use CRR during offline distillation. In this way, the forgetting issue of the off-policy data is handled by the distillation step without affecting exploration. Moreover, we may occasionally perform the distillation step during training. After a distilled policy is trained, we can bootstrap future experiments from this policy and its corresponding Q-function. Previous efforts can thus be accumulated if the agent encounters an environment that is similar to one it has encountered.

There are several benefits of this pipeline that are especially important for lifelong learning of real robots. First, the proposed pipeline does not require task boundaries. The wear and tear of the robot might happen over time and sometimes the change of the environment might not be immediately noticeable. This is different from a common multi-task learning setting where the task switches are well defined (such as learning to stand up and then learning to walk). In our method, the distillation step across training does not have to happen at the boundaries and the training procedure treats the replay buffer as a single dataset. Second, our method is flexible on the choice of data collection methods. During the development of a robot platform, we might have multiple robots that behave differently due to manufacturing tolerances. The training of the robots can happen in parallel or sequentially, and potentially with different choices of algorithms. The Offline Distillation Pipeline can reuse all of these previous experience within the “lifetime” of the platform (Figure 0(b)).

6 Imbalanced Experience in Offline Distillation

One practical issue we encounter in the offline distillation phase is that when the policy is trained over the combined dataset from multiple environments, the imbalance of the datasets might create an unexpected performance drop. For example, following Figure 0(a), the agent is first trained in Env-A and then switches to Env-B. During the offline distillation phase, we use CRR to train a policy with the combined dataset and evaluate the performance in both environments, as formulated in Section 3.1. We find that this sometimes results in worse performance in Env-A compared to training on alone. Although previous work has studied the problem of data imbalance in supervised learning (Johnson and Khoshgoftaar, 2019; Ren et al., 2018), the issue we observe has the extra complexity from the boostrapping procedure in off-policy RL. We provide evidence to the following hypothesis: Both the imbalanced quality and the imbalanced size of the combined dataset lead to additional extrapolation error of the Q-function in offline learning which contribute to the performance drop. As we discussed in Section 4, extrapolation error of the Q-function plays an important role in the failure cases in offline RL. The imbalanced dataset exacerbates this problem in the following way: if one dataset has a higher average return than the other, it may cause overestimation bias of the Q-function for the “weaker” dataset. At the same time, if there is a large size imbalance, the policy network will be trained with more data points from one environment than the other. In this way, the policy may create more out-of-distribution actions in the environment that comes with a smaller dataset which makes the extrapolation error worse. Both of these two aspects contribute to the undesirable performance we observe in offline distillation phase. In the experiment, we provide evidence to support this hypothesis and eliminate other potential factors.

To build a robust algorithm for lifelong learning, we need to improve the offline distillation phase to achieve good performance on all of the environments despite imbalanced experience. We prefer a solution that does not rely on task boundaries as discussed before. Our insight is that since both the quality imbalance and the size imbalance eventually result in additional extrapolation error, we can follow the conservative objective in offline RL and make the policy even more conservative to compensate for this issue. As shown in Equation 4, the temperature controls the strength of the KL term in the policy improvement objective in CRR. With a larger temperature, the policy is constrained to be closer to the behavior policy of the dataset. We find that the imbalanced dataset requires a higher strength of the KL term compared to single dataset training to compensate for the additional extrapolation error. In the experiment, we show that increasing is a simple yet effective fix to the data imbalance problem. The effectiveness of increasing can also serve as an evidence that the performance drop is highly related to extrapolation error. Note that increasing only makes the policy more “conservative” during the distillation phase which will not affect exploration.

7 Experiments

7.1 Experiment Setup

We study the lifelong learning problem in a simulated bipedal walking task, where the goal is to maximize the forward velocity while avoiding falling. Our experiments involve a small humanoid robot, called OP3111, that has 20 actuated joints and has been previously used to train walking directly on hardware (Bloesch et al., 2022). All of the experiments in this work are conducted in simulation both due to limited access to hardware and for a more controlled experiment setting. However, we try our best to incorporate all the realistic considerations of the experiments, hoping that it can be deployed on real robots in the future. All of the results are averaged over 3 random seeds.

Our experiments are based on the setup where the robot is trained in Env-A for 0.2M steps, and then trained in Env-B for 1M steps (Figure 0(a)). The goal is to achieve good performance at the end of training in both Env-A and Env-B. To evaluate the generality of the results, we consider different types of changes in the environment including softer ground texture, hip joint deformation and larger foot size (Figure 5). The parameters for each change of the environment are chosen to create a clear performance drop when we perform zero-shot transfer of a policy trained in the default environment to the new environment. In the following experiments when there is a switch from Env-A to Env-B, we use the default environment as Env-A, and change one of the physical parameters to create Env-B. When we switch from one environment to another, we always keep the previous policy, Q-function and the replay buffer.

To remove partial observability in non-stationary dynamics, we include the ground truth physical parameters in the observation. This eliminates the possibility that the issue we observe in the lifelong learning pipeline and the imbalanced experience are caused by the partial observability. The results we provide can be served as an upper bound on the expected performance without the physical parameters. However, the physical parameters are not explicitly used to denote task boundaries in the proposed method. In a realistic scenario, the partial observability could be handled by memory or system identification methods (Yu et al., 2017; Heess et al., 2016; Zhou et al., 2019). In these cases, the variations of the environments might be represented as continuous embeddings which cannot be used as task boundaries. Thus, we avoid relying on the physical parameters in the proposed pipeline as task boundaries.

7.2 Offline Distillation for Lifelong Reinforcement Learning

Figure 4: Experiment Setup: The bipedal robot walking task with different environment variations.
Figure 5: Evaluation curves of the lifelong learning experiment across two stages. The Offline Distillation Pipeline effectively breaks the trade-off between forward and backward transfer and achieves better performance than the baselines at the end.

In this section, we demonstrate the trade-off between forward transfer and backward transfer in lifelong reinforcement learning and the effectiveness of the Offline Distillation Pipeline. The policy is trained from scratch in Env-A during Stage-1 and then switched to Env-B during Stage-2 (as illustrated in Figure 0(a)). We compare the performance of MPO, CRR with a less conservative objective (with an indicator function which corresponds to as discussed in 3.2), CRR with a more conservative objective () and the Offline Distillation Pipeline. In the example shown in Figure 5, the robot in Env-B has its right hip joints deformed for 0.3 rad. The results for more Env-B variations are included in Appendix A.

We first demonstrate the forward transfer problem in Stage-1. As shown on the left figure in Figure 5, conservative algorithms such as CRR cannot learn as efficiently as MPO from scratch. Then, we show the backward transfer problem in Stage-2. To compare the performance drop better, we enforce the same starting performance of Stage-2 for all the baselines. This is done by loading the same agent (networks and replay buffer) trained with MPO from Stage-1. In Stage-2, the policy only collects data in Env-B, but we evaluate the performance in both Env-A and Env-B. From the middle figure in Figure 5, the performance of MPO drops significantly in Env-A after the switch. This could potentially be explained by the fact that Env-A transitions are “offline” during Stage-2 and thus the extrapolation error starts to accumulate. The forgetting issue also happens in the less conservative CRR, despite being less severe than MPO. The more conservative CRR keeps the performance of Env-A effectively which indicates that the performance drop is indeed related to the extrapolation issue in offline RL. However, as shown in the right figure in Figure 5, more conservative CRR does not improve the performance in Env-B as the other baselines.

In summary, these baselines either struggle with backward transfer or forward transfer. As described in Section 5, we propose the Offline Distillation Pipeline which distills the data collected by MPO using CRR. In this experiment, we perform the distillation step at the end of Stage-2. The performance is shown as the dotted lines in the figures. Taking the best of both world, our method can achieve better performance than the baselines in both environments. Note that in this experiment we include the results of the proposed method with the data imbalance issue fixed which will be discussed in the next section. If we expect multiple switches of environments during training, we can do the distillation step more often as illustrated in Figure 0(b). We leave this to future work.

7.3 Imbalanced Experience in Offline Distillation

As discussed in Section 6, we sometimes observe a performance drop during the distillation phase in the proposed pipeline with imbalanced experience. We will provide experimental evidence for the hypothesis that the decrease in performance is caused by the imbalanced size and quality between the datasets. In this section, we use CRR with the indicator function by default which corresponds to a less conservative CRR objective as discussed in Section 3.2.

(a) Reward
(b) Q-values
Figure 6: The imbalance issue in Offline Distillation: Training CRR with the combined dataset results in much lower performance in Env-A compared to training on the individual dataset. However, the worse performance corresponds to a higher average Q-value which indicates overestimation.

The Imbalance Issue. Following the setup in the previous section (Figure 0(a)), we run the offline distillation step at the end of Stage-2 with the combination of dataset in Env-A from Stage-1 and in Env-B from Stage-2, both collected by MPO. We apply different environment variations in Env-B to generate while keeping the same in the combined dataset (Figure 5). In Figure 5(a), the blue curves (Combined Dataset) show the performance of the distilled policy evaluated in Env-A and Env-B across the CRR training process of the distillation stage. Each column corresponds to a different . Note that there is no data collection in this stage and the x-axis here is the number of policy updates in CRR rather than environment steps. As a comparison, we run CRR on and

separately with the same hyperparameters (

Env-A Dataset Only, Env-B Dataset Only). Given that there is no partial observability (see Section 7.1), we expect to have similar or better performance than evaluated in Env-A, and evaluated in Env-. From the second row of Figure 5(a), the performance in Env- is similar between and at convergence. However, the performance of in Env-A is much worse than training with alone: we observe the blue curves to converge at a lower reward, converge much slower, or becomes unstable during training. Despite being trained on the same , the distilled policies have very different performance in Env-A due to the fact that they are trained with different . Although the performance drop in Env-A does not always happen, it is important to understand when and why the performance degrades to develop a robust lifelong learning pipeline that works for diverse settings. To get more insights of the problem, we also train a behavior cloning policy with the combined dataset. Figure 7 includes the final performance of CRR (Baseline) and behavior cloning (BC) over the combined dataset. Despite the size imbalance, with only supervised learning, BC performs reasonably well in Env-A. The CRR Baseline is much worse than BC in Env-A. This comparison indicates that the performance drop we observe in Env-A is more likely to be rooted in the RL procedure, instead of being a regular data imbalance problem in a supervised setting. We have also tried a few sanity check experiments including increasing batch size, increasing network capacity, or using a mixture of Gaussians as the policy output. None of these can prevent the performance drop in Env-A. In the following sections, we will discuss the most important experiments that can support our hypothesis discussed in Section 6.

Figure 7: To study the imbalance issue, the figure shows the final performance of different variants of the offline distillation step at 5e6 policy updates. The Baseline corresponds to the performance of Combined Dataset in Figure 6.

Overestimation due to the imbalanced quality. Figure 5(b) plots the average estimated Q-values of each method over and separately. Although has a lower performance in Env-A than , the corresponding Q-functions produce higher estimated Q-values over Env-A datapoints, which indicates significant overestimation. In contrast, if we compare the Q-values over on the second row, the curves are similar to each other within each plot. This indicates that suffers from overestimation specifically for Env-A data points. Furthermore, we observe that the average Q-value over is higher than for the individual dataset experiments. This is because is collected by training from scratch, while is collected during Stage-2 where the policy is bootstrapped from the previous experience and starts from a higher performance (see Figure 5). This observation leads us to the hypothesis that the high value datapoints in Env-B bias the Q-function which leads to overestimation for Env-A datapoints. To verify this hypothesis, we perform an experiment where we scale the reward for all the transitions in by , which does not change the optimal solution of the policy. After this change, the distilled policy with the combined dataset works well on both environments (Scale Rwd in Figure 7) and achieves similar performance as the individual dataset baselines. However, re-scaling the reward is not an acceptable solution in our problem setting because it requires the knowledge of task boundaries. It only serves as an analysis to demonstrate the imbalance issue due to the quality of the dataset.

Additional fitting error of the actor. We also test the contribution of the fitting error of the actor to the overall extrapolation error. We use separate actor networks for each dataset when training CRR on the combined dataset (Two Actors in Figure 7): Actor-A and Actor-B are trained with the transitions from Env-A and Env-B respectively, while the critic is shared across two environments. This change also makes the policy work well in both environments despite that the imbalanced reward is not corrected. In the Two Actors experiment, we find that the overestimation over still exists but has been reduced. Together with the Scale Reward experiment, the results indicate that the overestimation we observe in Figure 5(b) in Env-A is caused by two sources of error: the imbalanced quality creates overestimation; The imbalanced size creates more fitting error of the actor which results in more out-of-distribution actions that may take advantage of the overestimation. Note that using two separated actors also requires task boundaries and only serves as an analysis.

Figure 8: Sensitivity analysis of data imbalance: Higher makes CRR more robust to different ratios of dataset imbalance. The legend indicates the dataset size ratio of Env-A:Env-B corresponding to different colors.

Effectiveness of the temperature. As shown in previous sections, fixing either the imbalanced quality or the fitting error of the actor makes the algorithm stable when evaluated in both Env-A and Env-B. However, we need a solution that does not require task boundaries. As proposed in Section 6, increasing the temperature term in CRR can largely fix this issue. Figure 7 includes the performance of CRR with different . Baseline uses an Indicator function as the transformation function which corresponds to very small . With increased , the performance in Env-A increases. Although we observe a minor drop in Env-B with high , the overall performance in both Env-A and Env-B are reasonably satisfactory. To further demonstrate the effectiveness of increasing , we conduct an experiment where we upsample either or to simulate other compositions of the combined dataset (Figure 8). As mentioned in Section 7.1, the size ratio of is (denoted as raw). The performance in Env-A of CRR with the indicator function (Baseline) decreases drastically with higher Env-B sampling ratio. Interestingly, when the size ratio is , the policy is still not able to consistently achieve the single dataset performance in Env-A (which is expected to be above 400 as shown in Figure 6). In contrast, CRR with works well across a wider range of size ratios (which is what we use in Section 7.2). As shown in previous work (Wang et al., 2020), the specific choice of could be domain-dependent. The more important takeaway from this analysis is that if we observe a performance drop during RL training with an imbalanced dataset, we may consider increasing the conservativeness of the policy to compensate for the additional extrapolation error, such as increasing in CRR.

8 Conclusion

In this work, we investigate the lifelong learning problem of variations in environment dynamics as commonly observed when learning on robot hardware. We find that there is a trade-off between backward and forward transfer of existing RL algorithms in this problem setting even when we keep all of the transitions in the replay buffer. We connect the problem to offline RL and propose the Offline Distillation Pipeline to break this trade-off. In the proposed pipeline, the forgetting issue is prevented by distilling the replay buffer data across multiple environments into a universal policy as an offline RL problem. In this way, the solution to the forgetting problem is disentangled from data collection. We empirically verify the effectiveness of the pipeline through a bipedal robot walking task in simulation across various physical changes. In addition, we find an potential issue with imbalanced experience in offline distillation. Through controlled experiments, we demonstrate how the quality imbalance and the increased fitting error of the actor might exacerbate extrapolation error and create a performance drop. We also provide a simple yet effective solution to this issue by increasing the temperature term in CRR.

The insights from this work could potentially be applied in other settings beyond the lifelong learning problem of varying dynamics. For example, the Offline Distillation Pipeline can be used in other lifelong reinforcement learning settings with a different definition of “task”. The imbalance issue may also happen in other cases of multi-task learning in offline RL, or in single-task RL with sufficient non-stationarity (e.g. due to partial observability). In future work, we hope to see the proposed method being verified and deployed in more settings including real robot experiments.


  • A. Abdolmaleki, S. H. Huang, G. Vezzani, B. Shahriari, J. T. Springenberg, S. Mishra, D. TB, A. Byravan, K. Bousmalis, A. Gyorgy, et al. (2021) On multi-objective policy optimization as a tool for reinforcement learning. arXiv preprint arXiv:2106.08199. Cited by: §3.2.
  • A. Abdolmaleki, J. T. Springenberg, J. Degrave, S. Bohez, Y. Tassa, D. Belov, N. Heess, and M. Riedmiller (2018a) Relative entropy regularized policy iteration. arXiv preprint arXiv:1812.02256. Cited by: §3.2.
  • A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller (2018b) Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920. Cited by: §3.2, §4.
  • R. Aljundi, K. Kelchtermans, and T. Tuytelaars (2019a) Task-free continual learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 11254–11263. Cited by: §2.
  • R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio (2019b) Gradient based sample selection for online continual learning. Advances in neural information processing systems 32. Cited by: §2.
  • M. Bloesch, J. Humplik, V. Patraucean, R. Hafner, T. Haarnoja, A. Byravan, N. Y. Siegel, S. Tunyasuvunakool, F. Casarini, N. Batchelor, et al. (2022) Towards real robot learning in the wild: a case study in bipedal locomotion. In Conference on Robot Learning, pp. 1502–1511. Cited by: §7.1.
  • D. Ernst, P. Geurts, and L. Wehenkel (2005) Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6, pp. 503–556. Cited by: §2.
  • R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §1, §2.
  • S. Fujimoto, D. Meger, and D. Precup (2019) Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052–2062. Cited by: §1, §2, §4.
  • R. Hadsell, D. Rao, A. A. Rusu, and R. Pascanu (2020) Embracing change: continual learning in deep neural networks. Trends in cognitive sciences 24 (12), pp. 1028–1040. Cited by: §1, §2, §4.
  • D. Hassabis, D. Kumaran, C. Summerfield, and M. Botvinick (2017) Neuroscience-inspired artificial intelligence. Neuron 95 (2), pp. 245–258. Cited by: §1.
  • N. Heess, G. Wayne, Y. Tassa, T. Lillicrap, M. Riedmiller, and D. Silver (2016) Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182. Cited by: §7.1.
  • G. Hinton, O. Vinyals, J. Dean, et al. (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2 (7). Cited by: §2.
  • M. Igl, G. Farquhar, J. Luketina, W. Boehmer, and S. Whiteson (2020) Transient non-stationarity and generalisation in deep reinforcement learning. arXiv preprint arXiv:2006.05826. Cited by: §2.
  • R. Jeong, J. T. Springenberg, J. Kay, D. Zheng, Y. Zhou, A. Galashov, N. Heess, and F. Nori (2020) Learning dexterous manipulation from suboptimal experts. arXiv preprint arXiv:2010.08587. Cited by: §1, §2, §3.2.
  • J. M. Johnson and T. M. Khoshgoftaar (2019)

    Survey on deep learning with class imbalance

    Journal of Big Data 6 (1), pp. 1–54. Cited by: §6.
  • K. Khetarpal, M. Riemer, I. Rish, and D. Precup (2020) Towards continual reinforcement learning: a review and perspectives. arXiv preprint arXiv:2012.13490. Cited by: §2.
  • R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims (2020) Morel: model-based offline reinforcement learning. Advances in neural information processing systems 33, pp. 21810–21823. Cited by: §2.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1, §2.
  • A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pp. 11784–11794. Cited by: §2.
  • A. Kumar, A. Zhou, G. Tucker, and S. Levine (2020) Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems 33, pp. 1179–1191. Cited by: §2.
  • S. Lange, T. Gabel, and M. Riedmiller (2012) Batch reinforcement learning. In Reinforcement learning, pp. 45–73. Cited by: §2.
  • S. Levine, A. Kumar, G. Tucker, and J. Fu (2020) Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §2.
  • D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. Advances in neural information processing systems 30. Cited by: §1, §2.
  • M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to reweight examples for robust deep learning. In International conference on machine learning, pp. 4334–4343. Cited by: §6.
  • D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, and G. Wayne (2019) Experience replay for continual learning. Advances in Neural Information Processing Systems 32. Cited by: §1, §2, §4.
  • A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell (2015) Policy distillation. arXiv preprint arXiv:1511.06295. Cited by: §2.
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §1, §2.
  • J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell (2018) Progress & compress: a scalable framework for continual learning. In International Conference on Machine Learning, pp. 4528–4537. Cited by: §2, §2.
  • N. Y. Siegel, J. T. Springenberg, F. Berkenkamp, A. Abdolmaleki, M. Neunert, T. Lampe, R. Hafner, N. Heess, and M. Riedmiller (2020) Keep doing what worked: behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396. Cited by: §2.
  • Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu (2017) Distral: robust multitask reinforcement learning. Advances in neural information processing systems 30. Cited by: §2.
  • S. Thrun (1998) Lifelong learning algorithms. In Learning to learn, pp. 181–209. Cited by: §1, §2.
  • Z. Wang, A. Novikov, K. Zolna, J. S. Merel, J. T. Springenberg, S. E. Reed, B. Shahriari, N. Siegel, C. Gulcehre, N. Heess, et al. (2020) Critic regularized regression. Advances in Neural Information Processing Systems 33, pp. 7768–7778. Cited by: §2, §3.2, §7.3.
  • Y. Wu, G. Tucker, and O. Nachum (2019) Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361. Cited by: §2.
  • A. Xie and C. Finn (2021) Lifelong robotic reinforcement learning by retaining experiences. arXiv preprint arXiv:2109.09180. Cited by: §2.
  • F. Yang, C. Yang, H. Liu, and F. Sun (2022) Evaluations of the gap between supervised and reinforcement lifelong learning on robotic manipulation tasks. In Conference on Robot Learning, pp. 547–556. Cited by: §2.
  • T. Yu, A. Kumar, Y. Chebotar, K. Hausman, S. Levine, and C. Finn (2021) Conservative data sharing for multi-task offline reinforcement learning. Advances in Neural Information Processing Systems 34. Cited by: §2.
  • T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y. Zou, S. Levine, C. Finn, and T. Ma (2020) Mopo: model-based offline policy optimization. Advances in Neural Information Processing Systems 33, pp. 14129–14142. Cited by: §2.
  • W. Yu, J. Tan, C. K. Liu, and G. Turk (2017) Preparing for the unknown: learning a universal policy with online system identification. arXiv preprint arXiv:1702.02453. Cited by: §7.1.
  • C. Zeno, I. Golan, E. Hoffer, and D. Soudry (2018) Task agnostic continual learning using online variational bayes. arXiv preprint arXiv:1803.10123. Cited by: §2.
  • H. Zhang, J. Shao, Y. Jiang, S. He, and X. Ji (2021) Reducing conservativeness oriented offline reinforcement learning. arXiv preprint arXiv:2103.00098. Cited by: §2.
  • W. Zhou, S. Bajracharya, and D. Held (2020) PLAS: latent action space for offline reinforcement learning. In Conference on Robot Learning, Cited by: §2.
  • W. Zhou, L. Pinto, and A. Gupta (2019) Environment probing interaction policies. arXiv preprint arXiv:1907.11740. Cited by: §7.1.

Appendix A Additional results on the offline distillation pipeline

In Figure 5, we present the results of the two-stage lifelong learning experiments when Env-A is the default environment and Env-B has the hip joint of the robot deformed by 0.3 rad. In this section, we include more results across more environment variations. To demonstrate the difficulty of data collection with conservative algorithms, Figure 9 shows the performance of each algorithm when they are training from scratch, which corresponds to the first stage of the lifelong learning setup discussed in  7.2. MPO performs significantly better than both versions of CRR. Figure 10 shows the performance during Stage-2 where all of the algorithms loaded an agent which is pretrained in Env-A (the default environment) for 0.2M steps. The Offline Distillation Pipeline can achieve the best performance consistently across different Env-B variations.

Figure 9: Comparison of off-policy algorithms for training from scratch which corresponds to the beginning stage of a lifelong learning experiment.
Figure 10: Comparison of off-policy algorithms during Stage-2 of the lifelong learning experiment.