1 Introduction
Batch reinforcement learning (RL) (Ernst et al., 2005; Lange et al., 2011) is the problem of learning a policy from a fixed, previously recorded, dataset without the opportunity to collect new data through interaction with the environment. This is in contrast to the typical RL setting which alternates between policy improvement and environment interaction (to acquire data for policy evaluation). In many real world domains collecting new data is laborious and costly, both in terms of experimentation time and hardware availability but also in terms of the human labour involved in supervising experiments. This is especially evident in robotics applications (see e.g. Riedmiller et al. 2018; Haarnoja et al. 2018b; Kalashnikov et al. 2018 for recent examples learning on robots). In these settings where gathering new data is expensive compared to the cost of learning, batch RL promises to be a powerful solution.
There exist a wide class of offpolicy algorithms for reinforcement learning designed to handle data generated by a behavior policy which might differ from , the policy that we are interested in learning (see e.g. Sutton and Barto (2018) for an introduction). One might thus expect solving batch RL to be a straightforward application of these algorithms. Surprisingly, for batch RL in continuous control domains, however, Fujimoto et al. (2018) found that policies obtained via the naïve application of offpolicy methods perform dramatically worse than the policy that was used to generate the data. This result highlights the key challenge in batch RL: we need to exhaustively exploit the information that is in the data but avoid drawing conclusions for which there is no evidence (i.e. we need to avoid overvaluing stateaction sequences not present in the training data).
As we will show in this paper, the problems with existing methods in the batch learning setting are further exacerbated when the provided data contains behavioral trajectories from different policies which solve different tasks, or the same task in different ways (and thus potentially execute conflicting actions) that are not necessarily aligned with the target task that should accomplish. We empirically show that previously suggested adaptations for offpolicy learning (Fujimoto et al., 2018; Kumar et al., 2019) can be led astray by behavioral patterns in the data that are consistent (i.e. policies that try to accomplish a different task or a subset of the goals for the target task) but not relevant for the task at hand. This situation is more damaging than learning from noisy or random data where the behavior policy is suboptimal but is not predictable, i.e. the randomness is not a correlated signal that will be picked up by the learning algorithm.
We propose to solve this problem by restricting our solutions to ‘stay close to the relevant data’. This is done by: 1) learning a prior that gives information about which candidate policies are potentially supported by the data (while ensuring that the prior focuses on relevant trajectories), 2) enforcing the policy improvement step to stay close to the learned prior policy. We propose a policy iteration algorithm in which the prior is learned to form an advantageweighted model of the behavior data. This prior biases the RL policy towards previously experienced actions that also have a high chance of being successful in the current task. Our method enables stable learning from conflicting data sources and we show improvements on competitive baselines in a variety of RL tasks – including standard continuous control benchmarks and multitask learning for simulated and realworld robots. We also find that utilizing an appropriate prior is sufficient to stabilize learning; demonstrating that the policy evaluation step is implicitly stabilized when a policy iteration algorithm is used – as long as care is taken to faithfully evaluate the value function within temporal difference calculations. This results in a simpler algorithm than in previous work (Fujimoto et al., 2018; Kumar et al., 2019).
2 Background and Notation
In the following we consider the problem of reinforcement learning, modeling the environment as a markov decision process (MDP) consisting of the continuous states
, actions, and transition probability distribution
– describing the evolution of the system dynamics over time (e.g. probability of reaching from state when executing action ) – together with the statevisitation distribution . The goal of reinforcement learning is to find a policy that maximizes the cumulative discounted return , for the reward function . We also define the stateaction value function for taking action in state , and thereafter following : which we can relate to the objective via , where is the optimal policy. We parameterize the policy by but we will omit this dependency where unambiguous. In some of the experiments we will also consider a setting where we learn about multiple tasks , each with their own reward function . We condition the policy and Qfunction on the task index (i.e. and ), changing the objective to maximize the sum of returns across all tasks.For the batch RL setting we assume that we are given a dataset containing trajectory snippets (i.e. subtrajectories of length ) , with . We assume access to the reward function for the task of interest and can evaluate it for all transitions in (for example, may be some function of ). We further assume was filled, prior to training, by following a set of arbitrary behavior policies . Note that these behavior policies may try to accomplish the task we are interested in; or might indeed generate trajectories unrelated to the task at hand.
3 A learned prior for Offline offpolicy RL from imperfect data
To stabilize offpolicy RL from batch data, we want to restrict the learned policy to those parts of the stateaction space supported by the batch. In practice this means that we need to approximately restrict the policy to the support of the empirical stateconditional action distribution. This prevents the policy from taking actions for which the Qfunction cannot be trained and for which it might thus give erroneous, overly optimistic values (Fujimoto et al., 2018; Kumar et al., 2019). In this paper we achieve this by adopting a policy iteration procedure – in which the policy is constrained in the improvement step. As in standard policy iteration (Sutton and Barto, 2018), the procedure consists of two alternating steps. First, starting with a given policy in iteration (with corresponding to a randomly initialized policy distribution), we find an approximate actionvalue function , with parameters (Section 3.1) (as with the policy we will drop the dependence on and write where unambiguous). Second, we optimize for with respect to subject to a constraint that ensures closeness to the empirical stateconditional action distribution of the batch (Section 3.2). Iterating these steps, overall, optimizes . We realize both policy evaluation and improvement via a fixed number of gradient descent steps – holding and fixed via the use of target networks (Mnih et al., 2015). We refer to Algorithm 1 for details.
3.1 Policy Evaluation
To learn the task actionvalue function in each iteration we minimize the squared temporal difference error for a given reward – note that when performing offline RL from a batch of data the reward might be computed posthoc and does not necessarily correspond to the reward optimized by the behavior policies . The result after iteration is given as
(1)  
We approximate the expectation required to calculate with samples from , i.e. . As further discussed in the related work Section 4, the use of policy evaluation is different from the Qlearning approach pursued in Fujimoto et al. (2018); Kumar et al. (2019)
, which requires a maximum over actions and may be more susceptible to overestimation of Qvalues. We find that when enough samples are taken (we use
) and the policy is appropriately regularized (see Section 3.2) learning is stable without additional modifications.3.2 Prior Learning and Policy Improvement
In the policy improvement step we solve the following constrained optimization problem
(2)  
s.t. 
where is the behavior data, the policy being learned, and is the prior policy. This is similar to the policy improvement step in Abdolmaleki et al. (2018) but instead of enforcing closeness to the previous policy here the constraint is with respect to a separately learned “prior” policy, the behavior model. The role of in Equation 2 is to keep the policy close to the regime of the actions found in . We consider two different ways to express this idea by learning a prior alongside the policy optimization.
For learning the prior, we first consider simply modeling the raw behavior data. This is similar to the approach of BCQ and BEARQL (Fujimoto et al., 2018; Kumar et al., 2019), but we use a parametric behavior model and measure distance by KL; we refer to the related work for a discussion. The behavior model can be learned by maximizing the log likelihood of the observed data
(3) 
where are the parameters of the behavior model prior.
Regularizing towards the behavior model can help to prevent the use of unobserved actions, but it may also prevent the policy from improving over the behavior in . In effect, the simple behavior prior in Equation 3 regularizes the new policy towards the empirical stateconditional action distribution in . This may be acceptable for datasets dominated by successful trajectories for the task of interest or when the unsuccessful trajectories are not predictable (i.e. they correspond to random behaviour). However, we here are interested in the case where is collected from imperfect data and from multiple tasks. In this case, will contain a diverse set of trajectories – both (partially) successful and actively harmful for the target task. With this in mind, we consider a second learned prior, the advantageweighted behavior model, , with which we can bias the RL policy to choose actions that are both supported by and also good for the current task (i.e. keep doing actions that work). We can formulate this as maximizing the following objective:
(4)  
with 
where is an increasing, nonnegative function, and the difference is akin to an nstep advantage function, but here calculated offpolicy representing the “advantage” of the behavior snippet over the policy . This objective still tries to maximize the log likelihood of observed actions, and avoids taking actions not supported by data. However, by “advantage weighting” we focus the model on “good” actions while ignoring poor actions. We let (the unit step function with for and
otherwise) both for simplicty – to keep the number of hyperparameters to a minimum while keeping the prior broad – and because it has an intuitive interpretation: such a prior will start by covering the full data and, over time, filter out trajectories that would lead to worse performance than the current policy, until it eventually converges to the best trajectory snippets contained in the data. We note that Equation
4 is similar to a policy gradient, though samples here stem from the buffer and it will thus not necessarily converge to the optimal policy in itself; will instead only cover the best trajectories in the data due to no importance weighting being performed for offpolicy data. This bias is in fact desirable in a batchRL setting; we want a broad prior that only considers actions present in the data. We also note that we tried several different functions for including exponentiation, e.g. , but found that choice of function did not make a significant difference in our experiments.Using either or as , Equation 2 can be solved with a variety of optimization schemes. We experimented with an EMstyle optimization following the derivations for the MPO algorithm (Abdolmaleki et al., 2018), as well as directly using the stochastic value gradient of wrt. policy parameters (Heess et al., 2015).
It should be noted that, if the prior itself is already good enough to solve the task we can learn – e.g. if the data stems from an expert or has sufficiently high quality. In this case learning both and becomes independent of the RL policy improvement step; if we then set , skipping the policy improvement step, we obtain a further simplified algorithm consisting only of learning and (see Figure 9 for an ablation).
EMstyle optimization
We can optimize the objective from Equation 2 using a twostep procedure. Following (Abdolmaleki et al., 2018), we first notice that the optimal for Equation 2 can be expressed as , where is a temperature that depends on the used for the KL constraint and can be found automatically by a convex optimization (Appendix B.1). Conveniently, we can sample from this distribution by querying using samples from . These samples can then be used to learn the parametric policy by minimizing the divergence , which is equivalent to maximizing the weighted log likelihood
(5) 
which we optimize via gradient descent subject to an additional trustregion constraint on given as to ensure conservative updates (Appendix B.1).
Stochastic value gradient optimization
Alternatively, we can use Langrangian relaxation to turn Equation 2 into an objective amenable to gradient descent. Inserting for and relaxing results in
(6) 
for and which we can optimize by alternating gradient descent steps on and respectively, taking the stochastic gradient of the Qvalue (Heess et al., 2015) through the sampling of via reparameterization. See Appendix B.2 for a derivation of this gradient.
4 Related Work
There exist a number of offpolicy RL algorithms that have been developed since the inception of the RL paradigm (see e.g. Sutton and Barto (2018) for an overview). Most relevant for our work, some of these have been studied in combination with function approximators (for estimating value functions and policies) with an eye on convergence properties in the batch RL setting. In particular, several papers have theoretically analyzed the accumulation of bootstrapping errors in approximate dynamic programming (Bertsekas and Tsitsiklis, 1996; Munos, 2005) and approximate policy iteration (Farahmand et al., 2010; Scherrer et al., 2015); for the latter of which there exist well known algorithms that are stable at least with linear function approximation (see e.g. Lagoudakis and Parr (2003)). Work on RL with nonlinear function approximators has mainly considered the “online” or “growing batch” settings, where additional exploration data is collected (Ernst et al., 2005; Riedmiller, 2005; Ormoneit and Sen, 2002); though some success for batch RL in discrete domains has been reported (Agarwal et al., 2019). For continuous action domains, however, offpolicy algorithms that are commonly used with powerful function approximators fail in the fixed batch setting.
Prior work has identified the cause of these failures as extrapolation or bootstrapping errors (Fujimoto et al., 2018; Kumar et al., 2019) which occur due to a failure to accurately estimate Qvalues, especially for stateaction pairs not present in the fixed data set. Greedy exploitation of such misleading Qvalues (e.g. due to a operation) can then cause further propagation of such errors in the Bellman backup, and to inappropriate action choices during policy execution (leading to suboptimal behavior). In nonbatch settings, new data gathered during exploration allows for the Qfunction to be corrected. In the batch setting, however, this feedback loop is broken, and correction never occurs.
To mitigate these problems, previous algorithms based on Qlearning identified two potential solutions: 1) correcting for overly optimistic Qvalues in the Bellman update, and 2) restricting the policy from taking actions unlikely to occur in the data. To address 1) prior work uses a Bellman backup operator in which the operation is replaced by a generative model of actions (Fujimoto et al., 2018) which a learned policy is only allowed to minimally perturb; or via a maximum over actions sampled from a policy which is constrained to stay close to the data (Kumar et al., 2019) (implemented through a constraint on the distance to a model of the empirical data, measured either in terms of maximum mean discrepancy or relative entropy). To further penalize uncertainty in the Qvalues this can be combined with Clipped DoubleQ learning (Fujimoto et al., 2018) or an ensemble of Qnetworks (Kumar et al., 2019). To address 2) prior work uses a similarly constrained also during execution, by considering only actions sampled from the perturbed generative model (Fujimoto et al., 2018) or the constrained policy (Kumar et al., 2019), and choosing the best among them.
Our work is based on a policy iteration scheme instead of Qlearning – exchanging the for an expectation. Thus we directly learn a parametric policy that we also use for execution. We estimate the Qfunction as part of the policy evaluation step with standard TD0 backups. We find that for an appropriately constrained policy no special treatment of the backup operator is necessary, and that it is sufficient to simply use an adequate number of samples to approximate the expectation when estimating (see Equation 1). The only modification required is in the policy improvement step where we constrain the policy to remain close to the adaptive prior in Equation 2. As we demonstrate in the empirical evaluation it is the particular nature of the adaptive prior – which can adapt to the task at hand (see Equation 4) – that makes this constraint work well. Additional measures to account for uncertainty in the Q values could also be integrated into our policy evaluation step but we did not find it to be necessary for this work; we thus forego this in favor of our simpler procedure.
Our policy iteration scheme also bears similarity to previous works that use (relative) entropy regularized policy updates which implement constraints with respect to either a fixed (e.g. uniform) policy (e.g. Haarnoja et al., 2018a) or, in a trustregion like scheme, to the previous policy (e.g. Abdolmaleki et al., 2018)
. Other work has also focused on policy priors that are optimized to be different from the actual policy but so far mainly in the multitask or transferlearning setup, i.e. to share knowledge across or to transfer knowledge to new tasks
(Teh et al., 2017; Galashov et al., 2019; Tirumala et al., 2019; Jaques et al., 2017). The constrained updates are also related to trustregion optimization in action space, e.g. in TRPO / PPO (Schulman et al., 2015, 2017) and MPO (Abdolmaleki et al., 2018), which ensures stable learning in the standard RL setting by enforcing conservative updates. The idea of conservative policy optimization can be traced back to Kakade and Langford (2002). Here we take a slightly different perspective: we enforce a trust region constraint not on the last policy in the policy optimization loop (conservative updates) but wrt. the advantage weighted behavior distribution.5 Experiments
We experiment with continuous control tasks in two different settings. In a first set of experiments we compare our algorithm to strong offpolicy baselines on tasks from the DeepMind control suite (Tassa et al., 2018) – to give a reference point as to how our algorithm performs on common benchmarks. We then turn to the more challenging setting of learning multiple tasks involving manipulation of blocks using a robot arm in simulation. These span tasks from reaching toward a block to stacking one block on top of another. Finally, we experiment with analogous tasks on a real robot.
We use the same networks for all algorithms that we compare, optimize parameters using Adam (Kingma and Ba, 2015), and utilize proprioceptive features (e.g. joint positions / velocities) together with task relevant information (mujoco state for the control suite, and position/velocity estimates of the blocks for the manipulation tasks). All algorithms were implemented in the same framework, including our reproduction of BCQ and BEAR, and differ only in their update rules. Note that for BEAR we use a KL instead of the MMD as we found this to work well, see appendix. In the multitask setting (Section 5.1) we learn a task conditional policy and Qfunction where
is a onehot encoding of the task identifier, that is provided as an additional network input. We refer to the appendix for additional details.
5.1 Control Suite Experiments
We start by performing experiments on four tasks from the DeepMind control suite: Cheetah, Hopper, Quadruped. To obtain data for the offline learning experiments we first generate a fixed dataset via a standard learning run using MPO, storing all transitions generated; we repeat this with 5 seeds for each environment. We then separate this collected data into two sets: for experiments in the high data regime, we use the first 10,000 episodes generated from each seed. For experiments with lowquality data we use the first 2,000 episodes from each seed. The high data regime therefore has both more data and data from policies which are of higher quality on average. A plot showing the performance of the initial training seeds over episodes is given in the appendix, Figure 6.
For our offline learning experiments, we reload this data into a replay buffer.^{1}^{1}1Due to memory constraints, it can be prohibitive to store the full dataset in replay. To circumvent this problem we run a set of "restorers", which read data from disk and add it back into the replay in a loop. The dataset is then fixed and no new transitions are added; the offline learner never receives any data that any of its current or previous policies have generated. We evaluate performance by concurrently testing the policy in the environment. The results of this evaluation are shown in Figure 1. As can be observed, standard offpolicy RL algorithms (MPO / SVG) can learn some tasks offline with enough data, but learning is unstable even on these relatively simple control suite tasks – confirming previous findings from (Fujimoto et al., 2018; Kumar et al., 2019). In contrast the other methods learn stably in the highdata regime, with BCQ lagging behind in Hopper and Quadruped (sticking too close to the VAE prior actions, an effect already observed in (Kumar et al., 2019)). Remarkably our simple method of combining a policy iteration loop with a behavior model prior (BM+MPO in the plot) performs as well or better than the more complex baselines (BEAR and BCQ from the literature). Further improvement can be obtained using our advantage weighted behavior model even in some of these simple domains (ABM+SVG and ABM+MPO). Comparing the performance of the priors (BM[prior] vs ABM[prior], dotted lines) on Hopper we can understand the advantage that ABM has over BM: the BM prior performs well on simple tasks, but struggles when the data contains conflicting trajectories (as in Hopper, where some of the seeds learn suboptimal jumping) leading to too hard constraints on the RL policy. Interestingly, the ABM prior itself performs as well or better than the baseline methods for the controlsuite domains. Furthermore, in additional experiments presented in the appendix we find that competitive performance can be achieved, in simple domains, when training only an ABM prior (effectively setting ); providing an even simpler method when one does not care about squeezing out every last bit of performance. A test on lower quality data, Figure 2, shows similar trends, with our method learning to perform slightly better than the best trajectories in the data.
5.2 Simulated Robot Experiments
We experiment with a Sawyer robot arm simulated in Mujoco (Todorov et al., 2012) in a multitask setting – as described above. The seven tasks are to manipulate blocks that are placed in the workspace of the robot. They include: reaching for the green block (Reach), grasping any block (Grasp), lifting the green block (Lift), hovering the green block over the yellow block (Place Wide), hovering the green block over the center of the yellow block (Place Narrow), stacking the green block on top of yellow (Stack and Stack/Leave i.e. without gripper contact). To generate the data for this experiment we again run MPO – here simultaneously learning all taskconditional policies for the full seven tasks. Datawas collected by randomly switching tasks after each episode (of 200 control steps) with random resets of the robot position every 20 episodes. As before, data from all executed tasks is collected in one big dataset annotating each trajectory snippet with all rewards (i.e. this is similar to the SACR setting from (Riedmiller et al., 2018).
During offline learning we then compare the performance of MPO and RL with a behavior modelling prior (BM+MPO and ABM+MPO). As shown in Figure 3, behavioral modelling priors improve performance across all tasks over standard MPO – which struggles in these more challenging tasks. This is likely due to the sequential nature of the tasks: later tasks implicitly include earlier tasks but only a smaller fraction of trajectories achieve success on stack and leave (and the actions needed for stack conflict, e.g., with lifting), this causes the BM prior to be overly broad (see plots in appendix). The ABM+MPO, on the other hand, achieves high performance across all tasks. Interestingly, even with ABM in place, the RL policy learned using this prior still outperforms the prior, demonstrating that RL is still useful in this setting.
As an additional experiment we test whether we can learn new tasks entirely from previously recorded data. Since our rewards are specified as functions of observations, we compute rewards for two new tasks (bringing the green block to the center and bringing it to the corner) for the entire dataset – we then test the resulting policy in the simulator. As depicted in Figure 4, this is successful with ABM+MPO, demonstrating that we can learn tasks which were not originally executed in the dataset (as long as trajectory snippets that lead to successful task execution are contained in the data).
5.3 Real Robot Experiments
Finally, to validate that our approach is a feasible solution to performing fast learning for realrobot experiments, we perform an experiment using a real Sawyer arm and the same set of seven tasks (implemented on the real robot) from Section 5.2. As before, data from all executed tasks is collected in one big dataset annotating each trajectory snippet with all rewards. The full buffer after about two weeks of real robot training is used as the data for offline learning; which we here only performed with ABM+MPO due to the costly evaluation. The goal is to relearn all seven original tasks. Figure 5 shows the results of this experiment – we ran an evaluation script on the robot, continuously testing the offline learned policy, and stopped when there was no improvement in average reward (as measured over a window of 50 episodes). As can be seen, ABM with MPO as the optimizer manages to reliably relearn all seven tasks purely from the logged data in less than 12 hours. All tasks can jointly be learned with only small differences in convergence time – while during the initial training run the harder tasks, of course, took the most time to learn. This suggests that gathering large datasets of experience from previous robot learning experiments, and then quickly extracting the skills of interest, might be a viable strategy for making progress in robotics.
6 Conclusion
In this work, we considered the problem of stable learning from logged experience with offpolicy RL algorithms. Our approach consists of using a learned prior that models the behavior distribution contained in the data (the advantage weighted behavior model) towards which the policy of an RL algorithm is regularized. This allows us to avoid drawing conclusions for which there is no evidence in the data. Our approach is robust to large amounts of suboptimal data, and compares favourably to strong baselines on standard continuous control benchmarks. We further demonstrate that our approach can work in challenging robot manipulation domains – learning some tasks without ever seeing a single trajectory for them.
References
 Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920. Cited by: §B.1, §3.2, §3.2, §3.2, §4.
 Striving for simplicity in offpolicy deep reinforcement learning. CoRR abs/1907.04543. Cited by: §4.
 Neurodynamic programming. Athena Scientific. Cited by: §4.

Treebased batch mode reinforcement learning.
Journal of Machine Learning Research (JMLR)
. External Links: Link Cited by: §1, §4.  Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems (NeurIPS) 23, Cited by: §4.
 Offpolicy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900. Cited by: §C.2, §1, §1, §1, §3.1, §3.2, §3, §4, §4, §5.1.
 Information asymmetry in KLregularized RL. In International Conference on Learning Representations, External Links: Link Cited by: §4.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML), External Links: Link Cited by: §4.
 Soft actorcritic algorithms and applications. CoRR abs/1812.05905. Cited by: §1.
 Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems 28 (NeurIPS), Cited by: §B.2, §3.2, §3.2.
 Sequence tutor: conservative finetuning of sequence generation models with klcontrol. In Proceedings of the 34th International Conference on Machine Learning (ICML), External Links: Link Cited by: §4.
 Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML), Cited by: §4.
 QTopt: scalable deep reinforcement learning for visionbased robotic manipulation. CoRR abs/1806.10293. Cited by: §1.
 Autoencoding variational bayes. In International Conference on Learning Representations (ICLR), Cited by: §B.2.
 Adam: a method for stochastic optimization. External Links: Link Cited by: §B.1, §5.
 Stabilizing offpolicy qlearning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949. Cited by: §C.2, §1, §1, §3.1, §3.2, §3, §4, §4, §5.1.
 Leastsquares policy iteration. Journal of Machine Learning Research (JMLR). Cited by: §4.
 Batch Reinforcement Learning. In Reinforcement Learning: State of the Art, M. Wiering and M. van Otterlo (Eds.), Cited by: §1.
 Humanlevel control through deep reinforcement learning. Nature. Cited by: §3.

Error bounds for approximate value iteration.
In
Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI
, Cited by: §4.  Kernelbased reinforcement learning. Machine Learning. External Links: Link Cited by: §4.

Stochastic backpropagation and approximate inference in deep generative models
. In Proceedings of the 31st International Conference on Machine Learning (ICML), Cited by: §B.2.  Learning by playingsolving sparse reward tasks from scratch. arXiv preprint arXiv:1802.10567. Cited by: Table 5, §1, §5.2.
 Neural fitted q iteration – first experiences with a data efficient neural reinforcement learning method. In Proceedings of the 16th European Conference on Machine Learning (ECML), External Links: Link Cited by: §4.
 Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research (JMLR). External Links: Link Cited by: §4.
 Trust region policy optimization. In International Conference on Machine Learning (ICML), Cited by: §4.
 Proximal policy optimization algorithms. CoRR abs/1707.06347. Cited by: §4.
 Reinforcement learning: an introduction. Second edition, The MIT Press. External Links: Link Cited by: §1, §3, §4.
 DeepMind control suite. CoRR abs/1801.00690. Cited by: §5.
 Distral: robust multitask reinforcement learning. CoRR abs/1707.04175. External Links: Link, 1707.04175 Cited by: §4.
 Exploiting hierarchy for learning and transfer in klregularized rl. In arXiv:1903.07438, Cited by: §4.
 Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §5.2.
Appendix A Algorithm
A full algorithm listing for our procedure is given in Algorithm 1.
Appendix B Details on Policy Improvement
We here give additional details on the implementation of the policy improvement step in our algorithm. Depending on the policy optimizer used (MPO or SVG) different update rules are used to maximize the objective given in Equation 4. This is also outlined in Algorithm 1. We here describe the general form for both algorithms, using to represent the prior which then will be instantiated as either the ABM or BM prior.
b.1 Mpo
For the EMstyle optimization based on MPO we first notice that the optimal nonparametric policy that respects the KL constraint wrt. is
(7) 
with and where is a temperature that depends on the desired constraint . In practice we estimate based on the samples that we draw for each state to perform the optimization of . That is we set . Using these samples we can then optimize for in a way analogous to what is described in Abdolmaleki et al. (2018). Specifically, we find that the objective for finding is
(8) 
which can approximate based on a batch of trajectories sampled from (sampling actions from for each state therein) and corresponding action samples where we used the samples as
(9) 
which can readily be differentiated wrt. . We then use Adam (with standard settings and learning rate ) to take a gradient step in direction of (we want to maximize ) for each batch. We start our optimization with to ensure stable optimization (i.e. to avoid large changes in the policy parameters in the beginning of optimization). Further, after each gradient step, we project to the positive numbers i.e. we set , as is required to be positive. We find that this procedure is capable of fulfilling the desired KL constraints well.
The same batch, and action samples, are then also used to take an optimization step for the policy parameters . In particular we find the parametric policy by minimizing the divergence , which is equivalent to maximizing the weighted log likelihood of sampled actions (Equation 5). We only take samples here, which is a relatively crude representation of the behavior model at state ; therefore, to prevent the policy from converging too quickly it can be useful to employ an additional trust region constraint in this step that ensures slow convergence. We do this by adjusting the maximum likelihood objective Equation 5 to contain an additional KL regularization towards the previous policy, yielding the following optimization problem using samples :
(10) 
for , which is a Langrangian relaxation to the maximum likelihood problem under the additional constraint that , where is the Langrange multiplier. This objective can be differentiated wrt. both and and we simply take alternating gradient descent steps (one per batch for both and ) using Adam (Kingma and Ba, 2015) (starting with a random and ) and projecting back to the positive regime if it becomes negative; i.e. we set .
b.2 Svg
To optimize the policy parameters via the stochastic value gradient (Heess et al., 2015) (under a KL constraints) we can directly calculate the derivative of the Langrangian relaxation from Equation 6. In particular, again assuming that we have sampled a batch , the gradient wrt. can be obtained via the reparameterization trick (Kingma and Welling, 2014; Rezende et al., 2014). For this we first require that a sample from our policy can be obtained via a deterministic function applied to a standard noise source. We first specify the policy class used in the paper to be that of Gaussian policies, parameterized as where
denotes the pdf of a standard Normal distribution and where we assume
is directly given as one output of the network whereas we parameterize the diagonal standard deviation as
withbeing output by the neural network. We can then obtain samples via the deterministic transformation
, where (being the identity matrix). Using this definition we can obtain the following expression for the value gradient:
(11)  
where we use the Gaussian samples . The gradient for the Langrangian multiplier is given as
(12)  
where we dropped terms independent of in the second line. Following to maximize the objective, and conversely moving in the opposite direction of to minimize the objective wrt. , can then be performed by taking alternating gradient steps. We perform one step per batch for both and via Adam; starting from an initially random and . As in the MPO procedure we ensure that is positive by projecting it to the positive regime after each gradient step.
Appendix C Details on the experimental setup
c.1 Hyperparameters and network architecture
c.2 Details on BCQ and BEAR
To provide strong offpolicy learning baselines we reimplemented BCQ (Fujimoto et al., 2018) and BEAR (Kumar et al., 2019) in the same framework that we used to implement our own algorithm. As mentioned in the main paper we used the same network architecture for ll algorithms. Algorithm specific hyperparameters where tuned via a coarse grid search on the control suite tasks; while following the advice from the original papers on good parameter ranges. To avoid bias in our comparisons we did not utilize ensembles of Qfunctions for any of the methods (e.g. we removed them from BEAR), we note that ensembling did not seem to have a major impact on performance (see appendix in (Kumar et al., 2019)
). Parameters for all methods where optimized with Adam. Furthermore, to apply BEAR and BCQ in the multitask setting we employed the same conditioning of the policy and Qfunction on a onehot task vector (which is used to select among multiple network "heads", yielding per task parameters, see description below).
For BCQ we used a range of for the perturbative actions generated by the DDPG trained network, and chose a latent dimensionality of 64 for the VAE, see Table 3 for the full hyperparameters.
For BEAR we used a KL constraint rather than the maximum mean discrepancy. This ensures comparability with our method and we did not see any issues with instability when using a KL. To ensure good satisfaction of constraints we used the exact same optimization for the Langragian multiplier required in BEAR that was also used for our method – see description of SVG above. The hyperparameters for the BEAR training run on the control suite are given in Table 4.
Hyperparameters  MPO 

Policy net  256256 
Prior net  256256 
Number of actions sampled per state  20 
Q function net  256256256 
0.1  
Discount factor ()  0.99 
Adam learning rate  
Replay buffer size  
Target network update period  200 
Batch size  512 
Activation function  elu 
Layer norm on first layer  Yes 
Tanh on Gaussian mean  No 
Min variance 
0.01 
Max variance  unbounded 
Hyperparameters  SVG 

Policy net  256256 
Prior net  256256 
Q function net  256256256 
0.2  
Discount factor ()  0.99 
Adam learning rate  
Replay buffer size  
Target network update period  200 
Batch size  512 
Activation function  elu 
Layer norm on first layer  Yes 
Tanh on Gaussian mean  No 
Min variance  0.01 
Max variance  unbounded 
Hyperparameters  BCQ 

Encoder net  256256 
Latent size  64 
Decoder net  256256 
Perturbation net  256256 
Q function net  256256256 
Perturbation Scale Factor  0.25 
Discount factor ()  0.99 
Adam learning rate  
Replay buffer size  
Batch size  512 
Hyperparameters  BEAR 

Policy net  256256 
Prior net  256256 
Q function net  256256256 
for KL constraint  0.2 
action samples for BEARQL  20 
Discount factor ()  0.99 
Adam learning rate  
Replay buffer size  
Target network update period  200 
Batch size  512 
Activation function  elu 
Layer norm on first layer  Yes 
Tanh on Gaussian mean  No 
Min variance  0.01 
Max variance  unbounded 
Hyperparameters  Multitask 

Policy net  200 > 300 
Q function net  400 > 400 
Encoder net (BCQ)  256 > 256256 
Decoder net (BCQ)  256256 
Perturbation net (BCQ)  256 > 100 
Replay buffer size  

c.3 Details on the Robot Experiment Setup
The task setup for both the simulated and real robot experiments is described in the following. A detailed description of the robot setup will be given in an accompanying paper. We nonetheless give a description here for completeness. We make no claim to have contributed these tasks specifically for this paper and merely use them as an evaluation testbed.
As the robot we utilize a Sawyer robotic arm mounted on a table and equipped with a Robotiq 2F85 parallel gripper. A basket is positioned in front of the robot which contains three cubes (the proportions of cubes and basket sizes are consistent between simulation and reality). Three cameras on the basket track the cube using augmented reality tags. To model the tasks as an MDP we provide both proprioceptive information from the robot sensors (joint positions, velocities and torques) and the tracked cube position, velocity (both in 3 dimensions) and orientation to the policy and Qfunction. Overall the observations provided to the robot are: Proprioception: Joint positions (7D, double), velocities (7D, double), torques (7D, double); wrist pose (7D, double), velocity (6D, double), force (6D, double); gripper finger angles (1D, int), velocity (1D, int), grasp flag (1D, binary) and the object features: Object pose (7D, double) averaged over all cameras observing the object, Relative pose between object and gripper (7D, double). In simulation the true object position and velocities / orientation are used instead of running a tracking algorithm.
We control both the robot and the gripper by commanding velocities in the 4dimensional Cartesian space of the robot’s endeffector (specifing three translational velocities plus the velocity for the wrists rotation) while the gripper control is 1dimensional. The action limits are [0.07, 0.07] m/s for Cartesian actions, [1, 1] rad/s for wrist rotation and [255, 255] for finger velocity (for units see gripper specifications). The control rate at which the actions are executed is 20 Hz. Episodes are terminated if the wrist force exceeds 20 N on any axis, encouraging a gentle interaction of the robot with its environment.
For our experiment we use 7 different task to learn. The first 6 tasks are auxiliary tasks for better exploration that help to learn the final task (Stack and Leave), i.e, stacking the green cube on top of the yellow cube. The rewards functions for all tasks are given as:

Reach: :
This function minimizes the distance of the gripper to the green cube. 
Grasp:
Activate grasp sensor of gripper. 
Lift: if height of green block is less than 3 cm, if it is above 10 centimeters and otherwise.

Place Wide:
Move green cube to a position 5cm above the yellow cube. 
Place Narrow: :
Like Place Wide but with more precision. 
Stack:
A binary reward for stacking green cube on the yellow one and deactivating the grasp sensor. 
STACK_AND_LEAVE(G, Y):
Like STACK(G, Y), but it gets max reward if it moves the arm 10cm above the green cube.
Where denotes the euclidean distance between a and b. We also define two tolerance functions with outputs scaled between 0 and 1, i.e,
(13) 
(14) 
Appendix D Additional experimental results
We present additional plots that show some aspects of the developed algorithm in more detail.
d.1 Expanded plots for control suite
We provide expanded plots for MPO on the control suite in Figure 7. These show in detail that the learned advantage weighted behavior model (ABM, in red in the left column) is far superior to the standard behavior model prior, leading to less constrained RL policies.
d.2 Expanded plots for Robot simulation
Figure 8 shows full results for the simulated robot stacking task, including all 7 intentions as well as performance of the prior policies themselves during learning. The task set is structured: earlier tasks like reaching and lifting are necessary to perform most other tasks, so the simple behavioral model performs well on these. For the more difficult stacking tasks, however, the presence of conflicting data means the simple behavioral model doesn’t achieve high reward, though it still significantly improves performance of the regularized policy.
d.3 Performance tables for control suite and robot simulation
Table 6 shows final performance for all methods on control suite tasks. Table 7 shows final performance on simulated robotics tasks to make comparison between algorithms easier. The episode returns are averaged over the final 10% of episodes. ABM provides a performance boost over BM alone, particularly for difficult tasks such as block stacking and quadruped. The RL policy further improves performance on the difficult tasks. As noted in the main paper, one additional option, to further simplify the algorithm is to omit the policy improvement step, setting (i.e. considering the case where ) and, conversely learning the Qvalues of the prior. In additional experiments we have found that this procedure roughly recovers the performance of the ABM prior when trained together with MPO (ABM+MPO); i.e. this is an option for a simpler algorithm to implement at the cost of some performance loss (especially on the most complicated domains). We included this setting as ABM in the table – noting that in this case it is vital to choose short trajectory snippets in order to allow to pick the best action in each state.
Intention \ Algorithm  ABM + MPO  prior  BM + MPO  prior  MPO  ABM + SVG  BM + SVG  SVG  BCQ  BEAR  ABM () 

cheetah  907.4  888.7  891.8  465.3  496.2  920.7  924.1  726.1  914.9  843  884.3 
hopper  539.3  454.8  453.8  237.2  599.2  651.6  634.7  544.1  341.2  425  451.8 
quadruped  786.6  733.4  710.8  398.8  465.6  856.5  818.7  472.7  621.4  674  733.6 
walker  843.0  835.9  817.7  684.8  891.9  896.5  895.6  891.0  869.5  814  838.1 
Intention \ Algorithm  ABM + MPO  prior  BM + MPO  prior  MPO  BCQ  BEAR 

Reach  182.3  180.8  178.3  172.9  178.3  172.6  176.9 
Grasp  123.7  111.7  114.6  89.1  86.0  104.7  116.3 
Lift  147.1  144.9  144.2  122.5  121.4  126.4  137.9 
Place Wide  137.9  122.3  124.4  67.2  98.0  68.0  93.2 
Place Narrow  126.3  106.1  111.3  46.7  65.3  58.3  64.1 
Stack  109.9  78.8  86.9  20.1  17.5  39.3  67.8 
Full Task  77.9  47.3  53.5  3.5  0.2  5.2  34.6 
Comments
There are no comments yet.