Contextual Policy Reuse using Deep Mixture Models

02/29/2020 ∙ by Michael Gimelfarb, et al. ∙ 0

Reinforcement learning methods that consider the context, or current state, when selecting source policies for transfer have been shown to outperform context-free approaches. However, existing work typically tailors the approach to a specific learning algorithm such as Q-learning, and it is often difficult to interpret and validate the knowledge transferred between tasks. In this paper, we assume knowledge of estimated source task dynamics and policies, and common goals between tasks. We introduce a novel deep mixture model formulation for learning a state-dependent prior over source task dynamics that matches the target dynamics using only state trajectories obtained while learning the target policy. The mixture model is easy to train and interpret, is compatible with most reinforcement learning algorithms, and complements existing work by leveraging knowledge of source dynamics rather than Q-values. We then show how the trained mixture model can be incorporated into standard policy reuse frameworks, and demonstrate its effectiveness on benchmarks from OpenAI-Gym.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) is a general framework for the development of artificial agents that learn to make complex decisions by interacting with their environment. In recent years, off-the-shelf RL algorithms have achieved state-of-the-art performance on simulated tasks such as Atari games (Mnih et al., 2015) and real-world applications (Gu et al., 2017). However, sample efficiency remains a genuine concern for RL in general, and particularly model-free RL.

To address this concern, transfer learning tries to reduce the number of samples required to learn a new (target) task by reusing previously acquired knowledge or skills from other similar (source) tasks (Taylor and Stone, 2009). The knowledge transferred can be represented in many different ways, including low-level information such as sample data (Lazaric, 2008), or high-level knowledge such as value functions (Taylor et al., 2005), task features (Banerjee and Stone, 2007), or skills (Konidaris and Barto, 2007). A major stream of literature focuses on reusing policies (Brys et al., 2015; Parisotto et al., 2015; Barreto et al., 2017; Gupta et al., 2017) because it is intuitive, direct, and does not rely on value functions that can be difficult to transfer, or are not always available by design of the RL algorithm.

Early work on transfer learning in RL focused on the efficient transfer of knowledge between a single source and target task (Taylor et al., 2007). However, the knowledge transferred from multiple source tasks can be more effective (Fernández and Veloso, 2006; Comanici and Precup, 2010; Rosman et al., 2016). Despite recent developments in transfer learning theory, few existing approaches are able to reason about task similarity separately in different parts of the state space. This could lead to significant improvements in knowledge transfer, since only the information from relevant regions of each source task can be selected for transfer in a mix-and-match manner (Taylor and Stone, 2009).

In this paper, we assume the states, actions and rewards are identical between source and target tasks, but their dynamics can vary. Furthermore, the dynamics and optimal policies of the source tasks are estimated prior to transfer. Such formulations are often motivated by practical applications. In the field of maintenance, for instance, practitioners often rely on a digital reconstruction of the machine and its surroundings, called a digital twin, to assist maintenance on the physical asset (Lund et al., 2018). Here, different source tasks could represent models of optimal control under a wide range of conditions (exogenous events such as weather, or physical properties of the machine and other endogenous factors) corresponding to common state, action, and reward, but differing transition dynamics. Such simulations are routinely developed in other areas as well, such as drug discovery (Durrant and McCammon, 2011), robotics (Christiano et al., 2016) or manufacturing (Zhang et al., 2017).

To enable contextual policy transfer in RL, we introduce a novel Bayesian framework for autonomously identifying and combining promising sub-regions from multiple source tasks. This is done by placing state-dependent Dirichlet priors over source task models, and updating them using state trajectories sampled from the true target dynamics while learning the target policy. Specifically, posterior updates are informed by the likelihood of the observed transitions under the source task models. However, explicit knowledge of target dynamics is not necessary, making our approach model-free with respect to the target task. Furthermore, naive tabulation of state-dependent priors is intractable in large or continuous state space problems, so we parameterize them as deep neural networks. This architecture, inspired by the

mixture of experts (Jacobs et al., 1991; Bishop, 1994), serves as a surrogate model that can inform the state-dependent contextual selection of source policies for locally exploring promising actions in each state.

Our approach has several key advantages over other existing methods. Firstly, Bayesian inference allows priors to be specified over source task models. Secondly, the mixture model network can benefit from advances in deep network architectures, such as CNNs 

(Krizhevsky et al., 2012). Finally, our approach separates reasoning about task similarity from policy learning, so that it can be easily combined with different forms of policy reuse (Fernández and Veloso, 2006; Brys et al., 2015) and is easy to interpret as we demonstrate later in our experiments.

The main contributions of this paper are threefold:

  1. We introduce a contextual mixture model to efficiently learn state-dependent posterior distributions over source task models;

  2. We show how the trained mixture model can be incorporated into existing policy reuse methods, such as directed exploration (MAPSE) and reward shaping (MARS);

  3. We demonstrate the effectiveness and generality of our approach by testing it on problems with discrete and continuous spaces, including physics simulations.

1.1 Related Work

Using state-dependent knowledge to contextually reuse multiple source policies is a relatively new topic in transfer learning. Rajendran et al. (2015) used a soft attention mechanism to learn state-dependent weightings over source tasks, and then transferred either policies or values. Li and Kudenko (2018) proposed Two-Level Q-learning, in which the agent learns to select the most trustworthy source task in each state in addition to the optimal action. The selection of source policies can be seen as an outer optimization problem. The Context-Aware Policy Reuse algorithm of Li et al. (2019) used options to represent selection of source policies as well as target actions, learning Q-values and termination conditions simultaneously. However, these two papers are limited to critic-based approaches with finite action spaces. To fill this gap, Kurenkov et al. (2019) proposed AC-Teach

, which uses Bayesian DDPG to learn probability distributions over Q-values corresponding to student and teacher actions, and Thompson sampling for selecting exploratory actions from them. However, their inference technique is considerably different from ours, and is specific to the actor-critic setting. Our paper complements existing work by using source task dynamics rather than Q-values to reason about task similarity, and is compatible with both model-based and model-free RL.

Potential-based reward shaping (PBRS) was first introduced in Ng et al. (1999) for constructing dense reward signals without changing the optimal policies. Later, Wiewiora et al. (2003) and Devlin and Kudenko (2012) extended this to action-dependent and time-varying potential functions, respectively. More recently, Harutyunyan et al. (2015) combined these two extensions into one framework and used it to incorporate arbitrary reward functions into PBRS. Brys et al. (2015) made the connection between PBRS and policy reuse, by turning a single source policy into a binary reward signal and then applying Harutyunyan et al. (2015). Later, Suay et al. (2016) recovered a potential function from policy demonstrations directly using inverse RL. Our paper extends Brys et al. (2015) by reusing multiple source policies in a state-dependent way that is compatible with modern deep RL techniques. Thus, our paper advances the state-of-the-art in policy transfer and contributes to the expanding body of theoretical research into reward shaping.

2 Preliminaries

Markov Decision Process

We follow the framework of Markov decision processes (MDPs) (Puterman, 2014), defined as five-tuples where: is a set of states, is a set of actions, are the state dynamics, is a bounded reward function, and is a discount factor. In deterministic problems, the state dynamics are typically represented as a deterministic function . The objective of an agent is to find an optimal deterministic policy that maximizes the discounted cumulative reward over the planning horizon , where .

Reinforcement Learning

In the reinforcement learning setting, neither nor are assumed to be known by the agent a-priori. Instead, an agent collects data by interacting with the environment through a randomized exploration policy , where denotes a probability distribution over . In model-based RL (MBRL), the agent uses this data to first learn and , and then uses this to learn the optimal policy . In model-free RL, an agent learns the optimal policy directly without knowledge of or . Model-free RL algorithms typically fall into one of two categories. Temporal difference (TD) or Monte-carlo (MC) algorithms approximate using a tabular (Watkins and Dayan, 1992) or deep neural network representation (Mnih et al., 2015). Policy gradient methods, on the other hand, learn the optimal policy directly (Sutton et al., 2000).

Model Learning

In model-based RL, the dynamics model or is typically parameterized as a deep neural network and trained through repeated interaction with the environment. It typically returns an estimate of the next state directly , or approximates its distribution using, for instance, a Gaussian model . Subsequently, samples from the trained dynamics model can be used to augment the real experience when training the policy (Sutton, 1991; Peng et al., 2018; Kaiser et al., 2019), although other methods for reusing dynamics exist (Todorov and Li, 2005; Levine and Koltun, 2013; Heess et al., 2015; Nagabandi et al., 2018).

Transfer Learning

We are interested in solving the following transfer learning problem. A library of source tasks, and a single target task, are provided with identical , and 111In practice, we only require that goals of source and target tasks are shared., but different dynamics. Each source task is then solved to obtain estimates of the optimal policy , as well as the dynamics or . The main objective of this paper is to make use of this knowledge to solve the new target task in an efficient online manner.

3 Model-Aware Policy Reuse

Assuming the dynamics models have been learned for all source tasks, we now proceed to model and learn state-dependent contextual similarity between source tasks and a target task.

3.1 Contextual Mixture Models

We first introduce a state-dependent prior over combinations of source task models, that tries to match the true (unknown) target dynamics using transitions data collected from the target environment up to current time . Here, consists of non-negative elements such that . Using combinations to model uncertainty in source task selection can be viewed as Bayesian model combination, which allows inference over a general space of hypotheses and has been shown to exhibit stable convergence in practice (Minka, 2000; Monteith et al., 2011).

The motivation for learning a state-dependent prior is that the optimal behaviour in the target task may be locally similar to one source task in one region of the state space, but a different source task in another region of the state space. By reasoning about task similarity locally in different areas of the state space, a reinforcement learning agent can make more efficient use of source task knowledge. Theoretically, better estimates of dynamics lead to better estimates of the value function and hence the optimal policy.

Theorem 1.

Consider an MDP with finite and and bounded reward . Let

be the reward function in vector form,

be an estimate of the transition probabilities induced by a policy in matrix form, and be the corresponding value function in vector form. Also, let and be the corresponding values under the true dynamics. Then for any policy ,

This result justifies our methodology of using source task dynamics similarity to guide state-dependent policy reuse from source tasks. A proof is provided in Appendix Appendix.

In this setting, exact inference over is intractable, so we model using a surrogate probability distribution. In particular, since each realization of is a discrete probability distribution, a suitable prior for in each state is a Dirichlet distribution with density


where are mappings from to .

Next, by averaging out the uncertainty in , we can obtain an a-posteriori estimator of target dynamics:


In the following sections, we will instead refer to the following normalized form of (2)



is the mean of a Dirichlet random variable with density (

1) and . Therefore, the posterior estimate of target dynamics (3) can be represented as a mixture of source task models.

In a tabular setting, it is feasible to maintain separate estimates of per state using Bayes’ rule


using sampling (Andrieu et al., 2003) or variational inference (Gimelfarb et al., 2018). However, maintaining (4) for large or continuous state spaces presents inherent computational challenges. Furthermore, it is not practical to cache , but rather to process each sample online or in batches.

3.2 Deep Contextual Mixture Models

Fortunately, as (3) shows, the posterior mean is a sufficient estimator of . Therefore, we can approximate

directly using a feed-forward neural network

with parameters that can then be optimized using gradient descent. Now it is no longer necessary to store all , since each sample can be processed online or in batches. Furthermore, since approximates and fully parameterizes the estimated model (3), we can write .

The input of is a vectorized state representation of , and the outputs are fed through the softmax function

to guarantee that and .

In order to learn the parameters , we can minimize the empirical negative log-likelihood function222Note that this equates to optimizing the posterior with a uniform prior . using gradient descent, given by (3) as:


The gradient of has a Bayesian interpretation. For one observation , it can be written as




Here, we can interpret as a prior. Once a new sample is observed, we compute the posterior using Bayes rule (7), and is updated according to the difference between prior and posterior, scaled by state features . Hence, gradient updates in space can be viewed as projections of posterior updates in space, and the geometry of this learning process is illustrated in Figure 1. Regularization of can be incorporated naturally by introducing an informative prior (e.g. isotropic Gaussian, Laplace) in the loss (5), and can lead to smoother posteriors.

Figure 1: Geometric intuition of the posterior update according to (6) and (7) for two source tasks, and , and a toy mixture network with weights . The left sub-plot shows how the prior vector shifts along the dotted line to the posterior vector by , after observing a transition in the target task in state . This displacement is expressed entirely in terms of the source models and . The right sub-plot shows how this transformation computes .

3.3 Conditional RBF Networks

In continuous-state tasks with deterministic transitions, correspond to Dirac measures, in which case we only have access to models that predict the next state. In order to tractably update the mixture model in this setting, we assume that, given source task is the correct model of target dynamics, the probability of observing a transition from state to state is a decreasing function of the prediction error . More formally, given an arbitrarily small region ,


where can be interpreted as a normalized333Technically, we could only require that , as the likelihood function need not be a valid probability density. radial basis function. A popular choice of , implemented in this paper, is the Gaussian kernel, which for is


In principle, could be modeled as an additional output of the mixture model, , and learned from data (Bishop, 1994), although we treat it as a constant in this paper.

By using (8) and following the derivations leading to (2), we obtain the following result in direct analogy to (3)


Consequently, the results derived in the previous sections, including the mixture model and loss function for

, hold by replacing with . Furthermore, since (10) approximates the target dynamics as a mixture of kernel functions, it can be viewed as a conditional analogue of the RBF network (Broomhead and Lowe, 1988). It remains to show how to make use of this model and the source policy library to solve a new target task.

3.4 Policy Reuse

The most straightforward approach is to sample a source policy according to and follow it in state . To allow for random exploration, the agent is only allowed to follow this action with probability , initially set to a high value and annealed over time to maximize the use of source policies early in training (Fernández and Veloso, 2006; Li and Zhang, 2018). The resulting behaviour policy is suitable for any off-policy RL algorithm. We call this algorithm Model-Aware Policy ReuSe for Exploration (MAPSE), and present the pseudocode in Algorithm 1.

0:  , , , , {current state, mixture model, current , source policy library, target behavior policy}
  if  then
  end if
Algorithm 1 Behavior Policy for MAPSE

However, such an approach has several shortcomings. Firstly, it is not clear how to anneal , since the underlying quantity is learned over time and is non-stationary. Secondly, using the recommended actions too often can lead to poor test performance, since the agent may not observe sub-optimal actions enough times to learn to avoid them in testing. Finally, since efficient credit assignment is particularly difficult in sparse reward problems (Seo et al., 2019), it may limit the effectiveness of action recommendation.

Instead, motivated by the recent success of dynamic reward shaping (Brys et al., 2015), we directly modify the original reward to , where:


Here, is a positive constant that defines the strength of the shaped reward signal, and can be tuned for each problem, and

is chosen to be the posterior probability that action

would be recommended by a source policy in state at time . The sign of the potential function is negative to encourage the agent to take the recommended actions, following the argument in Harutyunyan et al. (2015). A more elaborate approach would update using an on-policy value-based RL algorithm, as explained in the aforementioned paper. By repeating the derivations leading to (2), we can derive a similar expression for as follows:


Note that (3.4) reduces to Brys et al. (2015) when and source policies are deterministic, e.g. . Unlike MAPSE, this approach can also be applied on-policy, and is guaranteed to converge (Devlin and Kudenko, 2012). We call this approach Model-Aware Reward Shaping (MARS), and present the training procedure of both MARS and MAPSE in Algorithm 2.

0:   (or and ) for , , , , , , , (optionally or ) {learned source dynamics, source policy library, learned target policy, behavior policy, mixture model weights, learning rate, replay buffer}
  for episode  do
     Sample an episode from the environment, where is defined by Algorithm 1 (MAPSE) or is defined by (11) and (3.4) (MARS)
     Store in
     Train on random mini-batches sampled from
     Update using gradient descent (6) on random mini-batches sampled from , e.g.
  end for
Algorithm 2 Model-Aware Policy Reuse (MAPSE, MARS)

The proposed framework is general and modular, since it can be combined with any standard RL algorithm in model-free and model-based settings. Furthermore, the computational cost of processing each sample is a linear function of the cost of evaluating the source task dynamics and policies, and can be efficiently implemented using neural networks. In a practical implementation, would be trained with a higher learning rate or larger batch size than the target policy, to make better use of source task information earlier.

Many extensions to the current framework are possible. For instance, to reduce the effect of incomplete or negative transfer, it is possible to estimate the target task dynamics or , and include it as an additional ()-st component in the mixture (3). If this model can be estimated accurately, it can also be used to update the agent directly as suggested in Section 2. Further improvements for MARS could be obtained by learning a secondary Q-value function for the potentials following Harutyunyan et al. (2015). We do not investigate these extensions in this paper, which can form interesting topics for future study.

4 Empirical Evaluation

In this section, we evaluate, empirically, the performance of both MAPSE (Algorithm 1) and MARS (11)-(3.4) in a typical RL setting (Algorithm 2). In particular, we are interested in answering the following research questions:

  1. Does the mixture model learn to select the most relevant source task(s) in each state?

  2. Does MARS (and possibly MAPSE) achieve better test performance, relative to the number of transitions observed, over existing baselines?

  3. Does MARS lead to better transfer than MAPSE?

In order to answer these questions, we implement Algorithm 2 in the context of tabular Q-learning (Watkins and Dayan, 1992) and DQN (Mnih et al., 2015) with MAPSE and MARS. To ensure fair comparison with relevant baselines, we include one state-of-the-art context-free and contextual policy reuse algorithm:

  1. CAPS: a contextual option-based algorithm recently proposed in Li et al. (2019);

  2. UCB: a context-free UCB algorithm recently proposed in Li and Zhang (2018); here performance depends on , so we run a grid search for , where and is the episode, and report the best case (likewise for MAPSE);

  3. 1, 2…: PBRS using a binary reward derived from each source policy in isolation, as suggested in Brys et al. (2015);

  4. Q: Q-learning or DQN with no transfer.

To help us answer the research questions above, we consider three variants of existing problems, Transfer-Maze, Transfer-CartPole, and Transfer-SparseLunarLander, that are explained in the subsequent subsections. All experiments are run using Keras with TensorFlow backend. Full details are provided in Appendix B. Experiment Settings.

4.1 Transfer-Maze

(a) Source tasks.
(b) Target tasks.
Figure 2: The Transfer-Maze environment.

The first experiment consists of a 30-by-30 discrete maze environment with four sub-rooms as illustrated in Figure 2. The four possible actions left, up, right, down move the agent one cell in the corresponding direction, but have no effect if the destination contains a wall. The agent incurs a penalty of -0.02 for hitting a wall, and otherwise -0.01. Upon reaching the goal, the agent receives +1.0 and the game ends. The goal is to find the shortest path from the green cell to the red cell in the target maze shown in Figure 2. The source tasks, as shown in Figure 2, each correctly model the interior of one room. As a result, only a context-aware algorithm can learn to utilize the source task knowledge correctly.

For each source task, we use Q-learning to learn optimal policies . The dynamics are trained using lookup tables and hence . The target policies are learned using Q-learning with each baseline. The experiment is repeated 20 times and the aggregated results are reported in Figure 2(a). Figure 2(b) plots the state-dependent posterior learned over time on a single trial.


Smoothed mean test performance (number of steps needed to reach the goal) and standard error for 20 trials on the Transfer-Maze domain solved with Q-learning. As seen on the right, transferring from a single source policy individually prevents the agent from converging in the specified number of samples.

(b) Probability assigned to each source task in each position of the Transfer-Maze domain in a single trial of MARS (results for MAPSE are similar) after collecting 0, 5K, 10K, 20K, 50K and 100K samples (going left to right).
Figure 3: Training results for the Transfer-Maze experiment.

4.2 Transfer-CartPole

We next consider a variation of the continuous-state CartPole control problem, where the force applied to the cart is not constant, but varies with cart position according to the equation . One way to interpret this is that the surface is not friction-less, but contains slippery () and rough () patches. To learn better policies, the agent can apply half or full force to the cart in either direction (4 possible actions). As a result, the optimal policy in each state depends on the surface. The problem is made more difficult by uniformly initializing , to require the agent to generalize control to both surfaces. In the first two source tasks, agents balance the pole only on rough () and slippery () surfaces, respectively. In the third source task, the pole length is doubled and .

Following Mnih et al. (2015), Q-values are approximated using feed-forward neural networks, and we use randomized experience replay and target Q-network with hard updates. State dynamics are parameterized as feed-forward neural networks and trained in a supervised way using the MSE loss using batches drawn randomly from the buffer. To learn , the likelihood is estimated using (9) with fixed . For CAPS, we follow Li et al. (2019) and only train the last layer when learning termination functions. We tried different learning rates in and picked the best one. The test performance is illustrated in Figure 3(a). Figure 3(b) plots the state-dependent posterior learned over time.

(a) Smoothed mean test performance (number of steps until pole falls) and standard error for 20 trials on the Transfer-CartPole environment solved with DQN. Test performance on each trial is estimated by averaging the results of 10 episodes with randomly generated initial states and using the current greedy policy.
(b) Probability assigned to each source task in each state of the Transfer-CartPole environment in a single trial of MARS after collecting 0, 100, 500, 1K, 2.5K and 5K samples (going left to right). x and y axes are cart position () and pole angle (), respectively, and and are averaged over .
Figure 4: Training results for the Transfer-CartPole experiment.

4.3 Transfer-SparseLunarLander

The final experiment consists of a variation of the LunarLander-v2 domain from OpenAi-Gym with sparse reward, in which the reward shape is deferred until the end of each episode. This is a high-dimensional continuous stochastic problem with sparse reward, representative of many real-world problems where it is considerably harder to learn correct dynamics, and hence transfer skills effectively. The first source task teaches the lander to hover above the landing pad at a fixed region in space (

), and fails if the lander gets too close to the ground. The second source task places the rover at a random location () above the landing pad, and the agent learns to land the craft safely. The third source task is equivalent to LunarLander-v2, except the mass of the craft is halved. A successful transfer experiment, therefore, should learn to transfer skills from the hover and land source tasks depending on altitude.

To solve this problem, we use the same setup as in Transfer-CartPole. Here, state transitions are stochastic and the moon surface is generated randomly in each episode, so dynamics are learned on noisy data. Observed state components are clipped to

to reduce the effect of outliers, and

tanh output activations predict the first 6 state components (position and velocity) and sigmoid the last two (leg contact with ground). Furthermore, the dynamics are learned offline on data collected during policy training to avoid the moving target problem and improve learning stability. We obtain MSE of for hover dynamics and for other source task dynamics, highlighting the difficulty of correctly learning accurate dynamics for ground contact. The test performance averaged over 10 trials is shown in Figure 5. Figure 6 illustrates the output of the mixture on 10 state trajectories obtained during training.

Figure 5: Smoothed mean test performance (total return) and standard error for 10 trials on the Transfer-SparseLunarLander domain solved with DQN. Test performance on each trial is estimated by averaging the results of 10 episodes with randomly generated initial states and using the best greedy policy (to minimize over-fitting).
Figure 6: Each plot shows the lander’s position during 10 training episodes, collected after observing 0, 10K, 25K and 50K samples. Colors indicate the source task mixture as indicated in the legend, and arrows indicate direction and (relative) magnitude of lander’s linear velocity. Clearly, the lander tends to follow the hover policy (red) until near landing, when it switches to the landing policy (green). The half-mass policy (blue) is used in increasingly rare circumstances as training progresses.

4.4 Discussion

MARS consistently outperforms all baselines, in terms of sample efficiency and solution quality, and MAPSE outperforms UCB, as shown in Figures 2(a)3(a) and 5. Figures 2(b)3(b) and 6 provide one possible explanation for this, namely the ability of the mixture model to converge to good mixtures even when presented with imperfect source dynamics as in Transfer-SparseLunarLander. Furthermore, on all three domains, MARS achieves asymptotic performance comparable to, or better, than the best single potential function etc. Interestingly, although MARS consistently outperforms CAPS, MAPSE only does so on Transfer-SparseLunarLander. This reaffirms our hypothesis in Section 3.4 that reward shaping can improve generalization on test data with little tuning. Furthermore, we conjecture that the inconsistent performance of CAPS is due to its reliance on fluctuating Q-values, that is mitigated in MARS and MAPSE by their reliance instead on more stable samples of the dynamics.

5 Conclusion

We investigated transfer of policies from multiple source tasks with identical goals but different dynamics. We showed, theoretically, how dynamics are related to policy values. We then used estimates of source task dynamics to contextually measure similarity between source and target tasks using a deep mixture model. We introduced two ways to use this information to improve training in the target task. Experiments showed strong performance and the advantages of leveraging more stable dynamics, as well as reward shaping, as a means of contextual transfer. Several possible extensions of this work were discussed in Section 3.4. It is also possible to generalize this work to MDPs with different goals (Schaul et al., 2015) or different state or action spaces (Taylor et al., 2007).


  • C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan (2003)

    An introduction to mcmc for machine learning

    Machine learning 50 (1-2), pp. 5–43. Cited by: §3.1.
  • B. Banerjee and P. Stone (2007) General game learning using knowledge transfer. In IJCAI, pp. 672–677. Cited by: §1.
  • A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver (2017) Successor features for transfer in reinforcement learning. In NIPS, pp. 4055–4065. Cited by: §1.
  • C. M. Bishop (1994) Mixture density networks. Technical report Citeseer. Cited by: §1, §3.3.
  • D. S. Broomhead and D. Lowe (1988)

    Radial basis functions, multi-variable functional interpolation and adaptive networks

    Technical report Royal Signals and Radar Establishment Malvern (United Kingdom). Cited by: §3.3.
  • T. Brys, A. Harutyunyan, M. E. Taylor, and A. Nowé (2015) Policy transfer using reward shaping. In AAMAS, pp. 181–188. Cited by: §1.1, §1, §1, §3.4, §3.4, item 3.
  • P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. Tobin, P. Abbeel, and W. Zaremba (2016) Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518. Cited by: §1.
  • G. Comanici and D. Precup (2010) Optimal policy switching algorithms for reinforcement learning. In AAMAS, pp. 709–714. Cited by: §1.
  • S. M. Devlin and D. Kudenko (2012) Dynamic potential-based reward shaping. In AAMAS, pp. 433–440. Cited by: §1.1, §3.4.
  • J. D. Durrant and J. A. McCammon (2011) Molecular dynamics simulations and drug discovery. BMC biology 9 (1), pp. 71. Cited by: §1.
  • F. Fernández and M. Veloso (2006) Probabilistic policy reuse in a reinforcement learning agent. In AAMAS, pp. 720–727. Cited by: §1, §1, §3.4.
  • M. Gimelfarb, S. Sanner, and C. Lee (2018) Reinforcement learning with multiple experts: a bayesian model combination approach. In NIPS, pp. 9528–9538. Cited by: §3.1.
  • S. Gu, E. Holly, T. Lillicrap, and S. Levine (2017) Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In ICRA, pp. 3389–3396. Cited by: §1.
  • A. Gupta, C. Devin, Y. Liu, P. Abbeel, and S. Levine (2017) Learning invariant feature spaces to transfer skills with reinforcement learning. In ICLR, Cited by: §1.
  • A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowé (2015) Expressing arbitrary reward functions as potential-based advice. In AAAI, Cited by: §1.1, §3.4, §3.4.
  • N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa (2015) Learning continuous control policies by stochastic value gradients. In NIPS, pp. 2944–2952. Cited by: §2.
  • R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton (1991) Adaptive mixtures of local experts. Neural computation 3 (1), pp. 79–87. Cited by: §1.
  • L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. (2019) Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374. Cited by: §2.
  • G. Konidaris and A. G. Barto (2007) Building portable options: skill transfer in reinforcement learning.. In IJCAI, Vol. 7, pp. 895–900. Cited by: §1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §1.
  • A. Kurenkov, A. Mandlekar, R. Martin-Martin, S. Savarese, A. Garg, M. Danielczuk, A. Balakrishna, M. Matl, D. Wang, K. Goldberg, et al. (2019) AC-teach: a bayesian actor-critic method for policy learning with an ensemble of suboptimal teachers. In ICRA, Cited by: §1.1.
  • A. Lazaric (2008) Knowledge transfer in reinforcement learning. Ph.D. Thesis, Politecnico di Milano. Cited by: §1.
  • S. Levine and V. Koltun (2013) Guided policy search. In ICML, pp. 1–9. Cited by: §2.
  • M. Li and D. Kudenko (2018) Reinforcement learning from multiple experts demonstrations. In ALA, Vol. 18. Cited by: §1.1.
  • S. Li, F. Gu, G. Zhu, and C. Zhang (2019) Context-aware policy reuse. In AAMAS, pp. 989–997. Cited by: §1.1, item 1, §4.2.
  • S. Li and C. Zhang (2018) An optimal online method of selecting source policies for reinforcement learning. In AAAI, Cited by: §3.4, item 2.
  • A. M. Lund, K. Mochel, J. Lin, R. Onetto, J. Srinivasan, P. Gregg, J. E. Bergman, K. D. Hartling, A. Ahmed, S. Chotai, et al. (2018) Digital twin interface for operating wind farms. Google Patents. Note: US Patent 9,995,278 Cited by: §1.
  • T. P. Minka (2000) Bayesian model averaging is not model combination. Available electronically at http://www. stat. cmu. edu/minka/papers/bma. html, pp. 1–2. Cited by: §3.1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §2, §4.2, §4.
  • K. Monteith, J. L. Carroll, K. Seppi, and T. Martinez (2011) Turning bayesian model averaging into bayesian model combination. In IJCNN, pp. 2657–2663. Cited by: §3.1.
  • A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2018) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In ICRA, pp. 7559–7566. Cited by: §2.
  • A. Y. Ng, D. Harada, and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §1.1.
  • E. Parisotto, J. L. Ba, and R. Salakhutdinov (2015) Actor-mimic: deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342. Cited by: §1.
  • B. Peng, X. Li, J. Gao, J. Liu, and K. Wong (2018) Deep dyna-q: integrating planning for task-completion dialogue policy learning. In ACL, pp. 2182–2192. Cited by: §2.
  • M. L. Puterman (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §2.
  • J. Rajendran, A. S. Lakshminarayanan, M. M. Khapra, P. Prasanna, and B. Ravindran (2015) Attend, adapt and transfer: attentive deep architecture for adaptive transfer from multiple sources in the same domain. arXiv preprint arXiv:1510.02879. Cited by: §1.1.
  • B. Rosman, M. Hawasly, and S. Ramamoorthy (2016) Bayesian policy reuse. Machine Learning 104 (1), pp. 99–127. Cited by: §1.
  • T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015) Universal value function approximators. In ICML, pp. 1312–1320. Cited by: §5.
  • M. Seo, L. F. Vecchietti, S. Lee, and D. Har (2019) Rewards prediction-based credit assignment for reinforcement learning with sparse binary rewards. IEEE Access 7 (), pp. 118776–118791. External Links: Document, ISSN 2169-3536 Cited by: §3.4.
  • H. B. Suay, T. Brys, M. E. Taylor, and S. Chernova (2016) Learning from demonstration for shaping through inverse reinforcement learning. In AAMAS, pp. 429–437. Cited by: §1.1.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In NIPS, pp. 1057–1063. Cited by: §2.
  • R. S. Sutton (1991) Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin 2 (4), pp. 160–163. Cited by: §2.
  • M. E. Taylor, P. Stone, and Y. Liu (2005) Value functions for rl-based behavior transfer: a comparative study. In AAAI, Vol. 5, pp. 880–885. Cited by: §1.
  • M. E. Taylor, P. Stone, and Y. Liu (2007) Transfer learning via inter-task mappings for temporal difference learning. JMLR 8 (Sep), pp. 2125–2167. Cited by: §1, §5.
  • M. E. Taylor and P. Stone (2009) Transfer learning for reinforcement learning domains: a survey. JMLR 10 (Jul), pp. 1633–1685. Cited by: §1, §1.
  • E. Todorov and W. Li (2005) A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proceedings of the 2005, American Control Conference, 2005., pp. 300–306. Cited by: §2.
  • C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §2, §4.
  • E. Wiewiora, G. W. Cottrell, and C. Elkan (2003) Principled methods for advising reinforcement learning agents. In ICML, pp. 792–799. Cited by: §1.1.
  • H. Zhang, Q. Liu, X. Chen, D. Zhang, and J. Leng (2017) A digital twin-based approach for designing and multi-objective optimization of hollow glass production line. IEEE Access 5, pp. 26901–26911. Cited by: §1.


A. Proof of Theorem 1

First, observe that for any stochastic matrix

, , where is the infinity norm, and

is always invertible, since the eigenvalues of

always lie in the interior of the unit circle for . Therefore,

To simplify notation, we write and . Then and . Now, making use of the identity , we have

and so the proof is complete.

B. Experiment Settings

All code was written and executed using Eclipse PyDev running Python 3.7. All neural networks were initialized and trained using Keras with TensorFlow backend (version 1.14), and weights were initialized using the default setting. The Adam optimizer was used to train all neural networks. Experiments were run on an Intel 6700-HQ Quad-Core processor with 8 GB RAM running on the Windows 10 operating system.

We used the following hyper-parameters in the experiments:

Name Description Transfer-Maze Transfer-CartPole Transfer-SparseLunarLander
maximum roll-out length 300 500 1000
discount factor 0.95 0.98 0.99
exploration probability 0.12
learning rate of Q-learning* 0.2 (MARS/PBRS), 0.8 (other)
replay buffer capacity 5000 20000
batch size 32 64
topology of DQN 4-40-40-4 8-120-100-4
hidden activation of DQN ReLU ReLU
learning rate of DQN 0.0005 0.001
learning rate for termination function weights** 0.4 0.01 0.0001
target network update frequency (in batches) 500 100
L2 penalty of DQN
topology of dynamics model (4+4)-50-50-4 (8+4)-100-100-(6+2)
hidden activation of dynamics model ReLU ReLU
learning rate of dynamics model 0.001 0.001
L2 penalty of dynamics model
Gaussian kernel precision
topology of mixture model 58-30-30-4 4-30-30-3 8-30-30-3
hidden activation of mixture ReLU ReLU ReLU
learning rate of mixture 0.001 0.001 0.001

training epochs/batch for mixture

4 3 1
PBRS scaling factor 0.1 1.0 20.0
probability of following source policies*** (MAPSE), (UCB) (MAPSE), (UCB) (MAPSE), (UCB)

* we had to decrease the learning rate for MARS and reward shaping to avoid instability in the learning process due to larger reward
** we report the best value found in

in the deep learning case

*** where is the episode number; we report the best