1 Introduction
The world is filled with families of tasks with similar, but not identical, dynamics. For example, consider the task of training a robot to swing a bat with unknown length and mass . The task is a member of a family of batswinging tasks. If a robot has already learned to swing several bats with various lengths and masses , then the robot should learn to swing a new bat with length and mass more efficiently than learning from scratch. That is, it is grossly inefficient to develop a control policy from scratch each time a unique task is encountered.
The Hidden Parameter Markov Decision Process (HiPMDP) (DoshiVelez and Konidaris, 2016) was developed to address this type of transfer learning, where optimal policies are adapted to subtle variations within tasks in an efficient and robust manner. Specifically, the HiPMDP paradigm introduced a lowdimensional latent task parameterization that, combined with a state and action, completely describes the system’s dynamics . However, the original formulation did not account for nonlinear interactions between the latent parameterization and the state space when approximating these dynamics, which required all states to be visited during training. In addition, the original framework scaled poorly because it used Gaussian Processes (GPs) as basis functions for approximating the task’s dynamics.
We present a new HiPMDP formulation that models interactions between the latent parameters and the state when transitioning to state after taking action . We do so by including the latent parameters , the state , and the action as input
to a Bayesian Neural Network (BNN). The BNN both learns the common transition dynamics for a family of tasks and models how the unique variations of a particular instance impact the instance’s overall dynamics. Embedding the latent parameters in this way allows for more accurate uncertainty estimation and more robust transfer when learning a control policy for a new and possibly unique task instance. Our formulation also inherits several desirable properties of BNNs: it can model multimodal and heteroskedastic transition functions, inference scales to data large in both dimension and number of samples, and all output dimensions are jointly modeled, which reduces computation and increases predictive accuracy
Depeweg et al. (2017a). Herein, a BNN can capture complex dynamical systems with highly nonlinear interactions between state dimensions. Furthermore, model uncertainty is easily quantified through the BNN’s output variance. Thus, we can scale to larger domains than previously possible.
We use the improved HiPMDP formulation to develop control policies for acting in a simple twodimensional navigation domain, playing acrobot Sutton and Barto (1998), and designing treatment plans for simulated patients with HIV Ernst et al. (2006). The HiPMDP rapidly determines the dynamics of new instances, enabling us to quickly find nearoptimal instancespecific control policies.
2 Background
Modelbased reinforcement learning
We consider reinforcement learning (RL) problems in which an agent acts in a continuous state space
and a discrete action space . We assume that the environment has some true transition dynamics , unknown to the agent, and we are given a reward function that provides the utility of taking action from state . In the modelbased reinforcement learning setting, our goal is to learn an approximate transition function based on observed transitions and then use to learn a policy that maximizes longterm expected rewards , where governs the relative importance of immediate and future rewards.HiPMDPs
A HiPMDP (DoshiVelez and Konidaris, 2016) describes a family of Markov Decision Processes (MDPs) and is defined by the tuple , where is the set of states , is the set of actions , and is the reward function. The transition dynamics for each task instance depend on the value of the hidden parameters ; for each instance, the parameters are drawn from prior . The HiPMDP framework assumes that a finitedimensional array of hidden parameters can fully specify variations among the true task dynamics. It also assumes the system dynamics are invariant during a task and the agent is signaled when one task ends and another begins.
Bayesian Neural Networks
A Bayesian Neural Network (BNN) is a neural network, , in which the parameters
are random variables with some prior
(MacKay, 1992). We place independent Gaussian priors on each parameter. Exact Bayesian inference for the posterior over parameters
is intractable, but several recent techniques have been developed to scale inference in BNNs (Blundell et al., 2015; Gal and Ghahramani, 2016; HernándezLobato et al., 2016; Neal, 1992). As probabilistic models, BNNs reduce the tendency of neural networks to overfit in the presence of low amounts of data—just as GPs do. In general, training a BNN is more computationally efficient than a GP (HernándezLobato et al., 2016), while still providing coherent uncertainty measurements. Specifically, predictive distributions can be calculated by taking averages over samples of from an approximated posterior distribution over the parameters. As such, BNNs are being adopted in the estimation of stochastic dynamical systems (Depeweg et al., 2017a; Gal et al., 2016)..3 A HiPMDP with JointUncertainty
The original HiPMDP transition function models variation across task instances as:^{1}^{1}1We present a simplified version that omits their filtering variables to make the parallels between our formulation and the original more explicit; our simplification does not change any key properties.
(1) 
where is the dimension of . Each basis transition function (indexed by the latent parameter, the action , and the dimension ) is a GP using only as input, linearly combined with instancespecific weights . Inference involves learning the parameters for the GP basis functions and the weights for each instance. GPs can robustly approximate stochastic state transitions in continuous dynamical systems in modelbased reinforcement learning (Deisenroth and Rasmussen, 2011; Rasmussen and Kuss, 2003; Rasmussen and Williams, 2006). GPs have also been widely used in transfer learning outside of RL (e.g. (Bonilla et al., 2008)).
While this formulation is expressive, it has limitations. The primary limitation is that the uncertainty in the latent parameters is modeled independently of the agent’s state uncertainty. Hence, the model does not account for interactions between the latent parameterization and the state . As a result, DoshiVelez and Konidaris (2016) required that each task instance performed the same set of stateaction combinations during training. While such training may sometimes be possible—e.g. robots that can be driven to identical positions—it is onerous at best and impossible for other systems such as human patients. The secondary limitation is that each output dimension is modeled separately as a collection of GP basis functions . The basis functions for output dimension are independent of the basis functions for output dimension , for . Hence, the model does not account for correlation between output dimensions. Modeling such correlations typically requires knowledge of how dimensions interact in the approximated dynamical system (Alvarez et al., 2012; Genton et al., 2015). We choose not to constrain the HiPMDP with such a priori knowledge since the aim is to provide basis functions that can ascertain these relationships through observed transitions.
To overcome these limitations, we include the instancespecific weights as input to the transition function and model all dimensions of the output jointly:
(2) 
This critical modeling change eliminates all of the above limitations: we can learn directly from data as observed—which is abundant in many industrial and health domains—and no longer require highly constrained training procedure. We can also capture the correlations in the outputs of these domains, which occur in many natural processes.
Finally, the computational demands of using GPs as the transition function limited the application of the original HiPMDP formulation to relatively small domains. In the following, we use a BNN rather than a GP to model this transition function. The computational requirements needed to learn a GPbased transition function makes a direct comparison to our new BNNbased formulation infeasible within our experiments (Section 5). We demonstrate, in Appendix A, that the BNNbased transition model far exceeds the GPbased transition model in both computational and predictive performance. In addition, BNNs naturally produce multidimensional outputs without requiring prior knowledge of the relationships between dimensions. This allows us to directly model output correlations between the state dimensions, leading to a more unified and coherent transition model. Inference in a larger input space with a large number of samples is tractable using efficient approaches that let us—given a distribution and inputoutput tuples —estimate a distribution over the latent embedding . This enables more robust, scalable transfer.
Demonstration
We present a toy domain (Figure 1) where an agent is tasked with navigating to a goal region. The state space is continuous , and action space is discrete . Task instances vary the following the domain aspects: the location of a wall that blocks access to the goal region (either to the left of or below the goal region), the orientation of the cardinal directions (i.e. whether taking action North moves the agent up or down), and the direction of a nonlinear wind effect that increases as the agent moves away from the start region. Ignoring the wall and grid boundaries, the transition dynamics are:
where is the stepsize (without wind), indicates which of the two classes the instance belongs to and controls the influence of the wind and is fixed for all instances. The agent is penalized for trying to cross a wall, and each step incurs a small cost until the agent reaches the goal region, encouraging the agent to discover the goal region with the shortest route possible. An episode terminates once the agent enters the goal region or after 100 time steps.
A linear function of the state and latent parameters would struggle to model both classes of instances ( and ) in this domain because the state transition resulting from taking an action is a nonlinear function with interactions between the state and hidden parameter .
By contrast, our new HiPMDP model allows nonlinear interactions between state and the latent parameters , as well as jointly models their uncertainty. In Figure 1, this produces measurable differences in transition uncertainty in regions where there are few related observed transitions, even if there are many observations from unrelated instances. Here, the HiPMDP is trained on two instances from distinct classes (shown in blue () and red () on the left). We display the uncertainty of the transition function, , using the latent parameters inferred for a red instance in two regions of the domain: 1) an area explored during red instances and 2) an area not explored under red instances, but explored with blue instances. The transition uncertainty is three times larger in the region where red instances have not been—even if many blue instances have been there—than in regions where red instances have commonly explored, demonstrating that the latent parameters can have different effects on the transition uncertainty in different states.
4 Inference
Algorithm 1 summarizes the inference procedure for learning a policy for a new task instance , facilitated by a pretrained BNN for that task, and is similar in structure to prior work (Deisenroth and Rasmussen, 2011; Gal et al., 2016). The procedure involves several parts. Specifically, at the start of a new instance , we have a global replay buffer of all observed transitions and a posterior over the weights for our BNN transition function learned with data from . The first objective is to quickly determine the latent embedding, , of the current instance’s specific dynamical variation as transitions are observed from the current instance. Transitions from instance are stored in both the global replay buffer and an instancespecific replay buffer . The second objective is to develop an optimal control policy using the transition model and learned latent parameters . The transition model and latent embedding
are separately updated via minibatch stochastic gradient descent (SGD) using Adam
(Kingma and Ba, 2015). Using for planning increases our sample efficiency as we reduce interactions with the environment. We describe each of these parts in more detail below.4.1 Updating embedding and BNN parameters
For each new instance, a new latent weighting is sampled from the prior (Alg. 1, step 2), in preparation of estimating unobserved dynamics introduced by . Next, we observe transitions from the task instance for an initial exploratory episode (Alg. 1, steps 710). Given that data, we optimize the latent parameters to minimize the
divergence of the posterior predictions of
and the true state transitions (step 3 in TuneModel) (HernándezLobato et al., 2016). Here, the minimization occurs by adjusting the latent embedding while holding the BNN parameters fixed. After an initial update of the for a newly encountered instance, the parameters of the BNN transition function are optimized (step 4 in TuneModel). As the BNN is trained on multiple instances of a task, we found that the only additional data needed to refine the BNN and latent for some new instance can be provided by an initial exploratory episode. Otherwise, additional data from subsequent episodes can be used to further improve the BNN and latent estimates (Alg. 1, steps 1114).The minibatches used for optimizing the latent and BNN network parameters are sampled from with squared error prioritization (Moore and Atkeson, 1993)
. We found that switching between small updates to the latent parameters and small updates to the BNN parameters led to the best transfer performance. If either the BNN network or latent parameters are updated too aggressively (having a large learning rate or excessive number of training epochs), the BNN disregards the latent parameters or state inputs respectively. After completing an instance, the BNN parameters and the latent parameters are updated using samples from global replay buffer
. Specific modeling details such as number of epochs, learning rates, etc. are described in Appendix C.4.2 Updating policy
We construct an greedy policy to select actions based on an approximate actionvalue function . We model the action value function with a Double Deep Q Network (DDQN) (Hasselt et al., 2016; Mnih et al., 2015). The DDQN involves training two networks (parametrized by and respectively), a primary Qnetwork, which informs the policy, and a target Qnetwork, which is a slowly annealed copy of the primary network (step 9 of SimEp) providing greater stability when updating the policy .
With the updated transition function, , we approximate the environment when developing a control policy (SimEp). We simulate batches of entire episodes of length using the approximate dynamical model , storing each transition in a fictional experience replay buffer (steps 26 in SimEp). The primary network parameters are updated via SGD every time steps (step 8 in SimEp) to minimize the temporaldifference error between the primary network’s and the target network’s Qvalues. The minibatches used in the update are sampled from the fictional experience replay buffer , using TDerrorbased prioritization (Schaul et al., 2016).
5 Experiments and Results
Now, we demonstrate the performance of the HiPMDP with embedded latent parameters in transferring learning across various instances of the same task. We revisit the 2D demonstration problem from Section 3, as well as describe results on both the acrobot (Sutton and Barto, 1998) and a more complex healthcare domain: prescribing effective HIV treatments (Ernst et al., 2006) to patients with varying physiologies.^{2}^{2}2Example code for training and evaluating a HiPMDP, including the simulators used in this section, can be found at http://github.com/dtak/hipmdppublic.
For each of these domains, we compare our formulation of the HiPMDP with embedded latent parameters (equation 2) with four baselines (one modelfree and three modelbased) to demonstrate the efficiency of learning a policy for a new instance using the HiPMDP. These comparisons are made across the first handful of episodes encountered in a new task instance to highlight the advantage provided by transferring information through the HiPMDP. The ‘linear’ baseline uses a BNN to learn a set of basis functions that are linearly combined with the parameters (used to approximate the approach of DoshiVelez and Konidaris (2016), equation 1), which does not allow interactions between states and weights. The ‘modelbased from scratch’ baseline considers each task instance as unique; requiring the BNN transition function to be trained only on observations made from the current task instance. The ‘average’ model baseline is constructed under the assumption that a single transition function can be used for every instance of the task; is trained from observations of all task instances together. For all modelbased approaches, we replicated the HiPMDP procedure as closely as possible. The BNN was trained on observations from a single episode before being used to generate a large batch of approximate transition data, from which a policy is learned. Finally, the modelfree baseline learns a DDQNpolicy directly from observations of the current instance.
For more information on the experimental specifications and longrun policy learning see Appendix C and D, respectively.
5.1 Revisiting the 2D demonstration
The HiPMDP and the average model were supplied a transition model trained on two previous instances, one from each class, before being updated according to the procedure outlined in Sec. 4 for a newly encountered instance. After the first exploratory episode, the HiPMDP has sufficiently determined the latent embedding, evidenced in Figure 1(b) where the developed policy clearly outperforms all four benchmarks. This implies that the transition model adequately provides the accuracy needed to develop an optimal policy, aided by the learned latent parametrization.
The HiPMDP with linear also quickly adapts to the new instance and learns a good policy. However, the HiPMDP with linear is unable to model the nonlinear interaction between the latent parameters and the state. Therefore the model is less accurate and learns a less consistent policy than the HiPMDP with embedded . (See Figure 5(a) in Appendix A.2)
With single episode of data, the model trained from scratch on the current instance is not accurate enough to learn a good policy. Training a BNN from scratch requires more observations of the true dynamics than are necessary for the HiPMDP to learn the latent parameterization and achieve a high level of accuracy. The modelfree approach eventually learns an optimal policy, but requires significantly more observations to do so, as represented in Figure 1(a). The modelfree approach has no improvement in the first 10 episodes. The poor performance of the average model approach indicates that a single model cannot adequately represent the dynamics of the different task instances. Hence, learning a latent representation of the dynamics specific to each instance is crucial.
5.2 Acrobot
First introduced by Sutton and Barto (1998), acrobot is a canonical RL and control problem. The most common objective of this domain is for the agent to swing up a twolink pendulum by applying a positive, neutral, or negative torque on the joint between the two links (see Figure 2(a)). These actions must be performed in sequence such that the tip of the bottom link reaches a predetermined height above the top of the pendulum. The state space consists of the angles , and angular velocities , , with hidden parameters corresponding to the masses (, ) and lengths (, ), of the two links.^{3}^{3}3
The centers of mass and moments of inertia can also be varied. For our purposes we left them unperturbed.
See Appendix B.2 for details on how these hidden parameters were varied to create different task instances. A policy learned on one setting of the acrobot will generally perform poorly on other settings of the system, as noted in Bai et al. (2013). Thus, subtle changes in the physical parameters require separate policies to adequately control the varied dynamical behavior introduced. This provides a perfect opportunity to apply the HiPMDP to transfer between separate acrobot instances when learning a control policy for the current instance.Figure 2(b) shows that the HiPMDP learns an optimal policy after a single episode, whereas all other modelbased benchmarks required an additional episode of training. As in the toy example, the modelfree approach eventually learns an optimal policy, but requires more time.
5.3 HIV treatment
Determining effective treatment protocols for patients with HIV was introduced as an RL problem by mathematically representing a patient’s physiological response to separate classes of treatments (Adams et al., 2004; Ernst et al., 2006). In this model, the state of a patient’s health is recorded via 6 separate markers measured with a blood test.^{4}^{4}4These markers are: the viral load (), the number of healthy and infected CD4 Tlymphocytes (, respectively), the number of healthy and infected macrophages (, respectively), and the number of HIVspecific cytotoxic Tcells (). Patients are given one of four treatments on a regular schedule. Either they are given treatment from one of two classes of drugs, a mixture of the two treatments, or provided no treatment (effectively a rest period). There are 22 hidden parameters in this system that control a patient’s specific physiology and dictate rates of virulence, cell birth, infection, and death. (See Appendix B.3 for more details.) The objective is to develop a treatment sequence that transitions the patient from an unhealthy steady state to a healthy steady state (Figure 3(a), see adams2004dynamicforamorethoroughexplanation). Small changes made to these parameters can greatly effect the behavior of the system and therefore introduce separate steady state regions that require unique policies to transition between them.
Figure 3(b) shows that the HiPMDP develops an optimal control policy after a single episode, learning an unmatched optimal policy in the shortest time. The HIV simulator is the most complex of our three domains, and the separation between each benchmark is more pronounced. Modeling a HIV dynamical system from scratch from a single episode of observations proved to be infeasible. The average model, which has been trained off a large batch of observations from related dynamical systems, learns a better policy. The HiPMDP with linear is able to transfer knowledge from previous task instances and quickly learn the latent parameterization for this new instance, leading to an even better policy. However, the dynamical system contains nonlinear interactions between the latent parameters and the state space. Unlike the HiPMDP with embedded , the HiPMDP with linear is unable to model those interactions. This demonstrates the superiority of the HiPMDP with embedded for efficiently transferring knowledge between instances in highly complex domains.
6 Related Work
There has been a large body of work on solving single POMDP models efficiently (Brunskill and Li, 2013; Fern and Tadepalli, 2010; Kaelbling et al., 1998; Rosman et al., 2016; Williams and Young, 2006). In contrast, transfer learning approaches leverage training done on one task to perform related tasks. Strategies for transfer learning include: latent variable models, reusing pretrained model parameters, and learning a mapping between separate tasks (see review in (Taylor and Stone, 2009)).
Our work falls into the latent variable model category. Using latent representation to relate tasks has been particularly popular in robotics where similar physical movements can be exploited across a variety of tasks and platforms (Delhaisse et al., 2017; Gupta et al., 2017). In Chen et al. (2016), these latent representations are encoded as separate MDPs with an accompanying index that an agent learns while adapting to observed variations in the environment. Bai et al. (2013) take a closely related approach to our updated formulation of the HiPMDP by incorporating estimates of unknown or partially observed parameters of a known environmental model and refining those estimates using modelbased Bayesian RL. The core difference between this and our work is that we learn the transition model and the observed variations directly from the data while Bai et al. (2013) assume it is given and the specific variations of the parameters are learned. Also related are multitask approaches that train a single model for multiple tasks simultaneously (Bonilla et al., 2008; Caruana, 1998). Finally, there have been many applications of reinforcement learning (e.g. (Moore et al., 2014; Shortreed et al., 2011; Tenenbaum et al., )) and transfer learning in the healthcare domain by identifying subgroups with similar response (e.g. (Jaques et al., 2015; Marivate et al., 2014; Schulam and Saria, 2016)).
More broadly, BNNs are powerful probabilistic inference models that allow for the estimation of stochastic dynamical systems (Depeweg et al., 2017a; Gal et al., 2016). Core to this functionality is their ability to represent both model uncertainty and transition stochasticity (Kendall and Gal, 2017). Recent work decomposes these two forms of uncertainty to isolate the separate streams of information to improve learning. Our use of fixed latent variables as input to a BNN helps account for model uncertainty when transferring the pretrained BNN to a new instance of a task. Other approaches use stochastic latent variable inputs to introduce transition stochasticity (Depeweg et al., 2017b; Moerland et al., 2017).
We view the HiPMDP with latent embedding as a methodology that can facilitate personalization and do so robustly as it transfers knowledge of prior observations to the current instance. This approach can be especially useful in extending personalized care to groups of patients with similar diagnoses, but can also be extended to any control system where variations may be present.
7 Discussion and Conclusion
We present a new formulation for transfer learning among related tasks with similar, but not identical dynamics, within the HiPMDP framework. Our approach leverages a latent embedding—learned and optimized in an online fashion—to approximate the true dynamics of a task. Our adjustment to the HiPMDP provides robust and efficient learning when faced with varied dynamical systems, unique from those previously learned. It is able, by virtue of transfer learning, to rapidly determine optimal control policies when faced with a unique instance.
The results in this work assume the presence of a large batch of alreadycollected data. This setting is common in many industrial and health domains, where there may be months, sometimes years, worth of operations data on plant function, product performance, or patient health. Even with large batches, each new instance still requires collapsing the uncertainty around the instancespecific parameters in order to quickly perform well on the task. In Section 5, we used a batch of transition data from multiple instances of a task—without any artificial exploration procedure—to train the BNN and learn the latent parameterizations. Seeded with data from diverse task instances, the BNN and latent parameters accounted for the variation between instances.
While we were primarily interested in settings where batches of observational data exist, one might also be interested in more traditional settings in which the first instance is completely new, the second instance only has information from the first, etc. In our initial explorations, we found that one can indeed learn the BNN in an online manner for simpler domains. However, even with simple domains, the modelselection problem becomes more challenging: an overly expressive BNN can overfit to the first few instances, and have a hard time adapting when it sees data from an instance with very different dynamics. Modelselection approaches to allow the BNN to learn online, starting from scratch, is an interesting future research direction.
Another interesting extension is rapidly identifying the latent . Exploration to identify would supply the dynamical model with the data from the regions of domain with the largest uncertainty. This could lead to a more accurate latent representation of the observed dynamics while also improving the overall accuracy of the transition model. Also, we found training a DQN requires careful exploration strategies. When exploration is constrained too early, the DQN quickly converges to a suboptimal, deterministic policy––often choosing the same action at each step. Training a DQN along the BNN’s trajectories of least certainty could lead to improved coverage of the domain and result in more robust policies. The development of effective policies would be greatly accelerated if exploration were more robust and stable. One could also use the hidden parameters to learn a policy directly.
Recognizing structure, through latent embeddings, between task variations enables a form of transfer learning that is both robust and efficient. Our extension of the HiPMDP demonstrates how embedding a lowdimensional latent representation with the input of an approximate dynamical model facilitates transfer and results in a more accurate model of a complex dynamical system, as interactions between the input state and the latent representation are modeled naturally. We also model correlations in the output dimensions by replacing the GP basis functions of the original HiPMDP formulation with a BNN. The BNN transition function scales significantly better to larger and more complex problems. Our improvements to the HiPMDP provide a foundation for robust and efficient transfer learning. Future improvements to this work will contribute to a general transfer learning framework capable of addressing the most nuanced and complex control problems.
Acknowledgements
We thank Mike Hughes, Andrew Miller, Jessica Forde, and Andrew Ross for their helpful conversations. TWK was supported by the MIT Lincoln Laboratory Lincoln Scholars Program. GDK is supported in part by the NIH R01MH109177. The content of this work is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
References
 Adams et al. [2004] BM Adams, HT Banks, H Kwon, and HT Tran. Dynamic multidrug therapies for HIV: optimal and STI control approaches. Mathematical Biosciences and Engineering, pages 223–241, 2004.

Alvarez et al. [2012]
MA Alvarez, L Rosasco, ND Lawrence, et al.
Kernels for vectorvalued functions: A review.
Foundations and Trends® in Machine Learning
, 4(3):195–266, 2012.  Bai et al. [2013] H Bai, D Hsu, and W S Lee. Planning how to learn. In Robotics and Automation, International Conference on, pages 2853–2859. IEEE, 2013.
 Blundell et al. [2015] C Blundell, J Cornebise, K Kavukcuoglu, and D Wierstra. Weight uncertainty in neural networks. In Proceedings of The 32nd International Conference on Machine Learning, pages 1613–1622, 2015.
 Bonilla et al. [2008] EV Bonilla, KM Chai, and CK Williams. Multitask Gaussian process prediction. In Advances in Neural Information Processing Systems, volume 20, pages 153–160, 2008.
 Brunskill and Li [2013] E Brunskill and L Li. Sample complexity of multitask reinforcement learning. arXiv preprint arXiv:1309.6821, 2013.
 Caruana [1998] R Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.
 Chen et al. [2016] M Chen, E Frazzoli, D Hsu, and WS Lee. POMDPlite for robust robot planning under uncertainty. In Robotics and Automation, International Conference on, pages 5427–5433. IEEE, 2016.
 Deisenroth and Rasmussen [2011] MP Deisenroth and CE Rasmussen. PILCO: a modelbased and dataefficient approach to policy search. In In Proceedings of the International Conference on Machine Learning, 2011.
 Delhaisse et al. [2017] B Delhaisse, D Esteban, L Rozo, and D Caldwell. Transfer learning of shared latent spaces between robots with similar kinematic structure. In Neural Networks, International Joint Conference on. IEEE, 2017.
 Depeweg et al. [2017a] S Depeweg, JM HernándezLobato, F DoshiVelez, and S Udluft. Learning and policy search in stochastic dynamical systems with Bayesian neural networks. In International Conference on Learning Representations, 2017a.
 Depeweg et al. [2017b] S Depeweg, JM HernándezLobato, F DoshiVelez, and S Udluft. Uncertainty decomposition in bayesian neural networks with latent variables. arXiv preprint arXiv:1706.08495, 2017b.
 Dietrich and Newsam [1997] CR Dietrich and GN Newsam. Fast and exact simulation of stationary gaussian processes through circulant embedding of the covariance matrix. SIAM Journal on Scientific Computing, 18(4):1088–1107, 1997.

DoshiVelez and Konidaris [2016]
F DoshiVelez and G Konidaris.
Hidden parameter Markov Decision Processes: a semiparametric
regression approach for discovering latent task parametrizations.
In
Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence
, volume 25, pages 1432–1440, 2016.  Ernst et al. [2006] D Ernst, G Stan, J Goncalves, and L Wehenkel. Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, 2006.
 Fern and Tadepalli [2010] A Fern and P Tadepalli. A computational decision theory for interactive assistants. In Advances in Neural Information Processing Systems, pages 577–585, 2010.

Gal and Ghahramani [2016]
Y Gal and Z Ghahramani.
Dropout as a Bayesian approximation: representing model uncertainty in deep learning.
In Proceedings of the 33rd International Conference on Machine Learning, 2016.  Gal et al. [2016] Y Gal, R McAllister, and CE Rasmussen. Improving PILCO with Bayesian neural network dynamics models. In DataEfficient Machine Learning workshop, ICML, 2016.
 Genton et al. [2015] MG Genton, W Kleiber, et al. Crosscovariance functions for multivariate geostatistics. Statistical Science, 30(2):147–163, 2015.
 Gupta et al. [2017] A Gupta, C Devin, Y Liu, P Abbeel, and S Levine. Learning invariant feature spaces to transfer skills with reinforcement learning. In International Conference on Learning Representations, 2017.
 Hasselt et al. [2016] H van Hasselt, A Guez, and D Silver. Deep reinforcement learning with double Qlearning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 2094–2100. AAAI Press, 2016.
 HernándezLobato et al. [2016] JM HernándezLobato, Y Li, M Rowland, D HernándezLobato, T Bui, and RE Turner. Blackbox divergence minimization. In Proceedings of The 33rd International Conference on Machine Learning, 2016.
 Jaques et al. [2015] N Jaques, S Taylor, A Sano, and R Picard. Multitask, multikernel learning for estimating individual wellbeing. In Proc. NIPS Workshop on Multimodal Machine Learning, 2015.
 Kaelbling et al. [1998] LP Kaelbling, ML Littman, and AR Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1):99–134, 1998.
 Kendall and Gal [2017] A Kendall and Y Gal. What uncertainties do we need in bayesian deep learning for computer vision? arXiv preprint arXiv:1703.04977, 2017.
 Kingma and Ba [2015] D Kingma and J Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

MacKay [1992]
D JC MacKay.
A practical Bayesian framework for backpropagation networks.
Neural computation, 4(3):448–472, 1992.  Marivate et al. [2014] VN Marivate, J Chemali, E Brunskill, and M Littman. Quantifying uncertainty in batch personalized sequential decision making. In Workshops at the TwentyEighth AAAI Conference on Artificial Intelligence, 2014.
 Mnih et al. [2015] V Mnih, K Kavukcuoglu, D Silver, A A Rusu, J Veness, M G Bellemare, A Graves, M Riedmiller, A K Fidjeland, G Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Moerland et al. [2017] TM Moerland, J Broekens, and CM Jonker. Learning multimodal transition dynamics for modelbased reinforcement learning. arXiv preprint arXiv:1705.00470, 2017.
 Moore and Atkeson [1993] AW Moore and CG Atkeson. Prioritized sweeping: reinforcement learning with less data and less time. Machine learning, 13(1):103–130, 1993.
 Moore et al. [2014] BL Moore, LD Pyeatt, V Kulkarni, P Panousis, K Padrez, and AG Doufas. Reinforcement learning for closedloop propofol anesthesia: a study in human volunteers. Journal of Machine Learning Research, 15(1):655–696, 2014.
 Neal [1992] RM Neal. Bayesian training of backpropagation networks by the hybrid Monte carlo method. Technical report, Citeseer, 1992.
 QuiñoneroCandela and Rasmussen [2005] J QuiñoneroCandela and CE Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
 Rasmussen and Kuss [2003] CE Rasmussen and M Kuss. Gaussian processes in reinforcement learning. In Advances in Neural Information Processing Systems, volume 15, 2003.
 Rasmussen and Williams [2006] CE Rasmussen and CKI Williams. Gaussian processes for machine learning. MIT Press, Cambridge, 2006.
 Rosman et al. [2016] B Rosman, M Hawasly, and S Ramamoorthy. Bayesian policy reuse. Machine Learning, 104(1):99–127, 2016.
 Schaul et al. [2016] T Schaul, J Quan, I Antonoglou, and D Silver. Prioritized experience replay. In International Conference on Learning Representations, 2016.
 Schulam and Saria [2016] P Schulam and S Saria. Integrative analysis using coupled latent variable models for individualizing prognoses. Journal of Machine Learning Research, 17:1–35, 2016.
 Shortreed et al. [2011] SM Shortreed, E Laber, DJ Lizotte, TS Stroup, J Pineau, and SA Murphy. Informing sequential clinical decisionmaking through reinforcement learning: an empirical study. Machine learning, 84(12):109–136, 2011.
 Snelson and Ghahramani [2006] E Snelson and Z Ghahramani. Sparse gaussian processes using pseudoinputs. In Advances in Neural Information Processing Systems, pages 1257–1264, 2006.
 Sutton and Barto [1998] R Sutton and A Barto. Reinforcement learning: an introduction, volume 1. MIT Press, Cambridge, 1998.
 Taylor and Stone [2009] ME Taylor and P Stone. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.
 [44] M Tenenbaum, A Fern, L Getoor, M Littman, V Manasinghka, S Natarajan, D Page, J Shrager, Y Singer, and P Tadepalli. Personalizing cancer therapy via machine learning.
 Williams and Young [2006] JD Williams and S Young. Scaling POMDPs for dialog management with composite summary pointbased value iteration (CSPBVI). In AAAI Workshop on Statistical and Empirical Approaches for Spoken Dialogue Systems, pages 37–42, 2006.
Appendix
Appendix A BNNbased transition functions with embedded latent weighting
a.1 Scalability of BNN vs. GP based transition approximation
In this section, we demonstrate the computational motivation to replace the GP basis functions of the original HiPMDP model [DoshiVelez and Konidaris, 2016] with a single, standalone BNN as discussed in Sec. 3) using the 2D navigation domain. To fully motivate the replacement, we altered the GPbased model to accept the latent parameters
and a onehot encoded action as additional inputs to the transition model. This was done to investigate how the performance of the GP would scale with a higher input dimension; the original formulation of the HiPMDP
[DoshiVelez and Konidaris, 2016] uses 2 input dimensions our proposed reformulation of the HiPMDP uses 11 (2 from the state, 4 from the action and 5 from latent parameterization).We directly compare the runtime performance of training both the GPbased and BNNbased HiPMDP over 6 unique instances of the toy domain, with 50 episodes per instance. Figure 5 shows the running times (in seconds) for each episode of the GPbased HiPMDP and the BNNbased HiPMDP, with the transition model and latent parameters being updated after every 10 episodes. In stark contrast with the increase in computation for the GPbased HiPMDP, the BNNbased HiPMDP has no increase in computation time as more data and further training is encountered. Training the BNN over the course of 300 separate episodes in the 2D toy domain was completed in a little more than 8 hours. In contrast, the GPbased HiPMDP, trained on the 2D toy domain, took close to 70 hours to complete training on the same number of episodes.
This significant increase in computation time using the GPbased HiPMDP, on a relatively simple domain, prevented us from performing comparisons to the GP model on our other domains. (We do realize that there is a large literature on making GPs more computationally efficient [Dietrich and Newsam, 1997, QuiñoneroCandela and Rasmussen, 2005, Snelson and Ghahramani, 2006]; we chose to use BNNs because if we were going to make many inference approximations, it seemed reasonable to turn to a model that can easily capture heteroskedastic, multimodal noise and correlated outputs.)
a.2 Prediction performance: benefit of embedding the latent parameters
In the previous section, we justified replacing the GP basis functions in the HiPMDP in favor of a BNN. In this section, we investigate the prediction performance of various BNN models to determine whether the latent embedding provides the desired effect of more robust and efficient transfer. The BNN models we will characterize here are those presented in Sec. 5 used as baseline comparisons to the HiPMDP with embedded latent parameters. These models are:

HiPMDP with embedded

HiPMDP with linear

BNN learned from scratch, without any latent characterization of the dynamics

Average model: BNN trained on all data without any latent characterization of the dynamics.
For each of these benchmarks except the BNN trained from scratch, a batch of transition data from previously observed instances was used to pretrain the BNN (and learn the latent parameters for the HiPMDPs). Each method was then used to learn the dynamics of a new, previously unobserved instance. After the first episode in the newly encountered instance, the BNN is updated. In the two models that use a latent estimation of the environment, the latent parameters are also updated. As can be seen in Fig 6, the models using a latent parameterization improve greatly after those first network and latent parameter updates. The other two (the model from scratch and average model) also improve, but only marginally. The average model is unable to account for the different dynamics of the new instance, and the model trained from scratch does not have enough observed transition data from the new instance to construct an accurate representation of the transition dynamics.
The superior predictive performance of the two models that learn and utilize a latent estimate of the underlying dynamics of the environment reinforces the intent of the HiPMDP as a latent variable model. That is, by estimating and employing a latent estimate of the environment, one may robustly transfer trained transition models to previously unseen instances. Further, as is shown across both domains represented in Fig. 6, the BNN with the latent parametrization embedded with the input is more reliably accurate over the duration the model interacts with a new environment. This is because the HiPMDP with embedded latent parameters can model nonlinear interactions between the latent parameters and the state and the HiPMDP with linear latent parameters cannot. Moreover, the 2D navigation domain was constructed such that the true transition function is a nonlinear function of the latent parameter and the state. Therefore, the most accurate predictions can only be made with an approximate transition function that can model those nonlinear interactions. Hence, the 2D navigation domain demonstrates the importance of embedding the latent parameters with the input of the transition model.
Appendix B Experimental Domains
This section outlines the nonlinear dynamical systems that define the experimental domains investigated in Sec. 5. Here we outline the equations of motion, the hidden parameters dictating the dynamics of that motion, and the procedures used to perturb those parameters to produce subtle variations in the environmental dynamics. Other domain specific settings such as the length of an episode are also presented.
b.1 2D Navigation Domain
As was presented in Sec. 3 the transition dynamics of the 2D navigation domain follow:
Where and
are hyperparameters that restrict the agent’s movement either laterally or vertically, depending on the hidden parameter
. In this domain, this hidden parameter is simply a binary choice between the two classes of agent (“blue” or “red”). This force, used to counteract, or accentuate, certain actions of the agent is scaled nonlinearly by the distance the agent moves away from the center of the region of origin.The agent accumulates a small negative reward (0.1) for each step taken, a large penalty (5) if the agent hits a wall or attempts to cross into the goal region over the wrong boundary. The agent receives a substantial reward (1000) once it successfully navigates to the goal region over the correct boundary. This value was purposefully set to be large so as to encourage the agent to more rapidly enter the goal region and move against the force pushing the agent away from the goal region.
At the initialization of a new episode, the class of the agent is chosen with uniform probability and the starting state of the agent is randomly chosen to lie in the region
.b.2 Acrobot
The acrobot domain [Sutton and Barto, 1998] is dictated by the following dynamical system evolving the state parameters :
With reward function if the foot of the pendulum has not exceeded the goal height. If it has, then and the episode ends. The hyperparameter settings are (lengths to center of mass of links), (moments of inertia of links) and (gravity). The hidden parameters are the lengths and masses of the two links all set to 1 initially. In order to observe varied dynamics from this system we perturb by adding Gaussian () noise to each parameter independently at the initialization of a new instance. The possible state values for the angular velocities of the pendulum are constrained to and .
At the initialization of a new episode the agent’s state is initialized to
and perturbed by some small uniformly distributed noise in each dimension. The agent is then free to apply torques to the hinge, until it raises the foot of the pendulum above the goal height or after 400 time steps.
b.3 HIV Treatment
The dynamical system used to simulate a patient’s response to HIV treatments was formulated in [Adams et al., 2004]. The equations are highly nonlinear in the parameters and are used to track the evolution of six core markers used to infer a patient’s overall health. These markers are, the viral load (), the number of healthy and infected CD4 Tlymphocytes ( and , respectively), the number of healthy and infected macrophages ( and , respectively), and the number of HIVspecific cytotoxic Tcells (). Thus, . The system of equations is defined as:
With reward function , where are treatment specific parameters, selected by the prescribed action.
As was done with the Acrobot at the initialization of a new instance, these hidden parameters are perturbed with by some Gaussian noise () each parameter independently. These perturbations were applied naively and at times would cause the dynamical system to lose stability or otherwise provide nonphysical behavior. We filter out such instantiations of the domain and deploy the HiPMDP on wellbehaved and controllable versions of this dynamical system.
At the initialization of a new episode the agent is started at an unhealthy steady state
where the viral load and number of infected cells are much higher than the number of virus fighting Tcells. An episode is characterized by 200 time steps where, dynamically, one time step is equivalent to 5 days. At each 5 day interval, the patient’s state is taken and is prescribed a treatment until the treatment period (or 1000 days) has been completed.
Appendix C Experiment Specifications
c.1 Bayesian neural network
HiPMDP architecture
For all domains, we model the dynamics using a feedforward BNN. For the toy example, we used 3 fully connected hidden layers with 25 hidden units each, and for the acrobot and HIV domains, we used 2 fully connected hidden layers with 32 units each. We used rectifier activation functions,
, on each hidden layer, and the identity activation function, , on the output layer. For the HiPMDP with embedded , the input to the BNN is a vector of length consisting of the state , a onehot encoding of the action , and the latent embedding . The BNN architecture for the the HiPMDP with linear uses a different input layer and output layer. The BNN input does not include the latent parameters. Rather the BNN output, is a matrix of shape . The next state is computed as . In all experiments, the BNN output is the state difference rather than the next state .Hyperparameters and Training
For all domains, we put zero mean priors on the random input noise and the network weights with variances of 1.0 and , respectively, following the procedure used by HernándezLobato et al. [2016]. In our experiments, we found the BNN performed best when initialized with a small prior variance on the network weights that increases over training, rather than using a large prior variance. Following HernándezLobato et al. [2016], we learn the network parameters by minimizing the divergence using ADAM with for acrobot and the toy example and for the HIV domain. In each update to the BNN, we performed 100 epochs of ADAM, where in each epoch we sampled 160 transitions from a prioritized experience buffer and divided those transitions into mini batches of size 32. We used a learning rate of for HIV and acrobot and learning rate of for the toy example.
The BNN and latent parameters were learned from a batch of transition data gathered from multiple instances across 500 episodes per instance. For the toy example, acrobot, and HIV, we use data from 2, 8, and 5 instances, respectively. For HIV, we found performance improved by standardizing the observed states to have zero mean, unit variance.
c.2 Latent Parameters
For all domains, we used latent parameters. The latent parameters were updated using the same update procedure as for updating the BNN network parameters (except with the BNN network parameters held fixed) with a learning rate of .
c.3 Deep QNetwork
To learn a policy for a new task instance, we use a Double Deep Q Network with two full connected hidden layers with 256 and 512 hidden units, respectively. Rectifier activation functions are used for the hidden layers and the identity function is used on for the output layer. For all domains, we update the primary network weights every time steps using ADAM with a learning rate of , and slowly update the target network to mirror the primary network with a rate of . Additionally, we clip gradients such that the L2norm is less than 2.5. We use an greedy policy starting with and decaying after each episode (each real episode for the modelfree approach and each approximated episode for the modelbased approaches) with a rate of .
In modelbased approaches, we found that the DQN learns more robust policies (both on the BNN approximate dynamics and the real environment) from training exclusively off the approximated transitions of the BNN. After training the BNN off the first episode, we train the DQN using an initial batch of approximated episodes generated using the BNN.
c.4 Prioritized Experience Replay Buffers
We used a TDerrorbased prioritized experience replay buffer [Schaul et al., 2016] to store experiences used to train the DQN. For modelbased approaches, we used a separate squarederrorbased prioritized buffer to store experiences used to train the BNN and learn the latent parameterization. Each prioritized buffer was large enough to store all experiences. We used a prioritization exponent of 0.2 and an importance sampling exponent of 0.1.
Appendix D Long run demonstration of policy learning
We demonstrate that all benchmark methods used learn good control policies for new, unique instances of the acrobot domain (Figure 7(a)) and the HIV treatment domain (Figure 7(b)) with a sufficient number of training episodes. However, in terms of policy learning efficiency and proficiency, comparing the performance of the HiPMDP with the benchmark over the first 10 episodes is instructive—as presented the Experiments section of the main paper. This format emphasizes the immediate returns of using the embedded latent parameters to transfer previously learned information when encountering a new instance of a task.
Comments
There are no comments yet.