Single Episode Policy Transfer in Reinforcement Learning

by   Jiachen Yang, et al.

Transfer and adaptation to new unknown environmental dynamics is a key challenge for reinforcement learning (RL). An even greater challenge is performing near-optimally in a single attempt at test time, possibly without access to dense rewards, which is not addressed by current methods that require multiple experience rollouts for adaptation. To achieve single episode transfer in a family of environments with related dynamics, we propose a general algorithm that optimizes a probe and an inference model to rapidly estimate underlying latent variables of test dynamics, which are then immediately used as input to a universal control policy. This modular approach enables integration of state-of-the-art algorithms for variational inference or RL. Moreover, our approach does not require access to rewards at test time, allowing it to perform in settings where existing adaptive approaches cannot. In diverse experimental domains with a single episode test constraint, our method significantly outperforms existing adaptive approaches and shows favorable performance against baselines for robust transfer.


page 1

page 2

page 3

page 4


Meta-Reinforcement Learning by Tracking Task Non-stationarity

Many real-world domains are subject to a structured non-stationarity whi...

Efficient transfer learning and online adaptation with latent variable models for continuous control

Traditional model-based RL relies on hand-specified or learned models of...

MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics

Transfer reinforcement learning (RL) aims at improving learning efficien...

Generalized Hidden Parameter MDPs Transferable Model-based RL in a Handful of Trials

There is broad interest in creating RL agents that can solve many (relat...

Discovering Diverse Solutions in Deep Reinforcement Learning

Reinforcement learning (RL) algorithms are typically limited to learning...

Characterizing Policy Divergence for Personalized Meta-Reinforcement Learning

Despite ample motivation from costly exploration and limited trajectory ...

A New Representation of Successor Features for Transfer across Dissimilar Environments

Transfer in reinforcement learning is usually achieved through generalis...

Code Repositories


Single Episode Policy Transfer in Reinforcement Learning

view repo

1 Introduction

One salient feature of human intelligence is the ability to perform well in a single attempt at a new task instance, by recognizing critical characteristics of the instance and immediately executing appropriate behavior based on experience in similar instances. Artificial agents must do likewise in applications where success must be achieved in one attempt and failure is irreversible. This problem setting, single episode transfer, imposes a challenging constraint in which an agent experiences—and is evaluated on—only one episode of a test instance.

As a motivating example, a key challenge in precision medicine is the uniqueness of each patient’s response to therapeutics [11, 3, 34]. Adaptive therapy is a promising approach that formulates a treatment strategy as a sequential decision-making problem [40, 33, 21]. However, heterogeneity among instances may require explicitly accounting for factors that underlie individual patient dynamics. For example, in the case of adaptive therapy for sepsis [21], predicting patient response prior to treatment is not possible. However, differences in patient responses can be observed via blood measurements very early after the onset of treatment [6].

As a first step to address single episode transfer in reinforcement learning (RL), we propose a general algorithm for near-optimal test-time performance in a family of environments where differences in dynamics can be ascertained early during an episode. Our key idea is to train an inference model and a probe that together achieve rapid inference of latent variables—which account for variation in a family of similar dynamical systems—using a small fraction (e.g., 5%) of the test episode, then deploy a universal policy conditioned on the estimated parameters for near-optimal control on the new instance. Our approach combines the advantages of robust transfer and adaptation-based transfer, as we learn a single universal policy that requires no further training during test, but which is adapted to the new environment by conditioning on an unsupervised estimation of new latent dynamics.

In contrast to methods that quickly adapt or train policies via gradients during test but assume access to multiple test rollouts and/or dense rewards [9, 15, 23], we explicitly optimize for performance in one test episode without accessing the reward function at test time. Hence our method applies to real-world settings in which rewards during test are highly delayed or even completely inaccessible. We also consider computation time a crucial factor for real-time application, whereas some existing approaches require considerable computation during test [15]. Our algorithm builds on variational inference and RL as submodules, which ensures practical compatibility with existing RL workflows.

Our main contribution is a simple general algorithm for single episode transfer in families of environments with varying dynamics, via rapid inference of latent variables and immediate execution of a universal policy. Our method performs significantly higher with orders of magnitude faster computation time during test than the state-of-the-art model-based method [15], on benchmark high-dimensional domains whose dynamics are discontinuous and continuous in latent parameters. We also show superior performance over optimization-based meta-learning and favorable performance versus baselines for robust transfer.

2 Single episode transfer in RL: problem setup

Our goal is to train a model that performs close to optimal within a single episode of a test instance with new unknown dynamics. We formalize the problem as a family , where

are the state space, action space, reward function, and discount of an episodic Markov decision process (MDP). Each

instance of the family is a stationary MDP with transition function . When a set of physical parameters determines transition dynamics [17], each has a hidden parameter that is sampled once from a distribution and held constant for that instance. For more general stochastic systems whose modes of behavior are not easily attributed to physical parameters, is induced by a generative latent variable model that indirectly associates each to a latent variable learned from observed trajectory data. We refer to “latent variable” for both cases, with the clear ontological difference understood. Depending on application, can be continuous or discontinuous in . We strictly enforce the challenging constraint that latent variables are never observed, in contrast to methods that use known values during training [38], to ensure the framework applies to challenging cases without prior knowledge.

This formulation captures a diverse set of important problems. Latent space has physical meaning in systems where is a continuous function of physical parameters (e.g., friction and stiffness) with unknown values. In contrast, a discrete set can induce qualitatively different dynamics, such as a 2D navigation task where decides if the same action moves in either a cardinal direction or its opposite [15]. Such drastic impact of latent variables may arise when a single drug is effective for some patients but causes serious side effects for others [6].

Training phase. Our training approach is fully compatible with RL for episodic environments. We sample many instances, either via a simulator with controllable change of instances or using off-policy batch data in which demarcation of instances—but not values of latent variables—is known, and train for one or more episodes on each instance. While we focus on the case with known change of instances, the rare case of unknown demarcation can be approached either by preprocessing steps such as clustering trajectory data or using a dynamic variant of our algorithm (Appendix C).

Single test episode. In contrast to prior work that depend on the luxury of multiple experience rollouts for adaptation during test time [8, 15, 9, 23], we introduce the strict constraint that the trained model has access to—and is evaluated on—only one episode of a new test instance. This reflects the need to perform near-optimally as soon as possible in critical applications such as precision medicine, where an episode for a new patient with new physiological dynamics is the entirety of hospitalization.

3 Single episode policy transfer

We present Single Episode Policy Transfer (SEPT), a high-level algorithm for single episode transfer between MDPs with different dynamics. The following sections discuss specific design choices in SEPT, all of which are combined in synergy for near-optimal performance in a single test episode.

3.1 Policy transfer through latent space

Our best theories of natural and engineered systems involve physical constants and design parameters that enter into dynamical models. This physicalist viewpoint motivates a partition for transfer learning in families of MDPs: 1. learn a representation of latent variables with an inference model that rapidly encodes a vector

of discriminative features for a new instance; 2. train a universal policy to perform near-optimally for dynamics corresponding to any latent variable in ; 3. immediately deploy both the inference model and universal policy on a given test episode. To build on the generality of model-free RL, and for scalability to systems with complex dynamics, we do not expend computational effort to learn a model of , in contrast to model-based approaches [15, 37]. Instead, we leverage expressive variational inference models to represent latent variables and provide uncertainty quantification.

In domains with ground truth hidden parameters, a latent variable encoding is the most succinct representation of differences in dynamics between instances. As the encoding is held constant for all episodes of an instance, a universal policy can either adapt to all instances when

is finite, or interpolate between instances when

is continuous in [25]. Estimating a discriminative encoding for a new instance enables immediate deployment of on the single test episode, bypassing the need for further fine-tuning. This is critical for applications where further training complex models on a test instance is not permitted due to safety concerns. In contrast, methods that do not explicitly estimate a latent representation of varied dynamics must use precious experiences in the test episode to tune the trained policy [9].

In the training phase, we generate an optimized111In the sense of machine teaching, as explained fully in Section 3.3 dataset of short trajectories, where each is a sequence of early state-action pairs at the start of episodes of instance (e.g. ). We train a variational auto-encoder, comprising an approximate posterior inference model that produces a latent encoding from and a parameterized generative model . The dimension chosen for may differ from the exact true dimension when it exists but is unknown; domain knowledge can aid the choice of dimensionality reduction. Because dynamics of a large variety of natural systems are determined by independent parameters (e.g., coefficient of contact friction and Reynolds number can vary independently), we consider a disentangled latent representation where latent units capture the effects of independent generative parameters. To this end, we bring -VAE [10] into the context of families of dynamical systems, choosing an isotropic unit Gaussian as the prior and imposing the constraint . The -VAE is trained by maximizing the variational lower bound for each across :


This subsumes the VAE [16] as a special case (), and we refer to both as VAE in the following. Since latent variables only serve to differentiate among trajectories that arise from different transition functions, the meaning of latent variables is not affected by isometries and hence the value of by itself need not have any simple relation to a physically meaningful even when one exists. Only the partition of latent space is important for training a universal policy.

Earlier methods for a family of similar dynamics relied on Bayesian neural network (BNN) approximations of the entire transition function

, which was either used to perform computationally expensive fictional rollouts during test time [15] or used indirectly to further optimize a posterior over [37]. Our use of variational inference is more economical: the encoder can be used immediately to infer latent variables during test, while the decoder plays a crucial role for optimized probing in our algorithm (see Section 3.3).

In systems with ground truth hidden parameters, we desire two additional properties. The encoder should produce low-variance encodings, which we implement by minimizing the entropy of



under a diagonal Gaussian parameterization, where and . We add as a regularizer to (1). Second, we must capture the impact of on higher-order dynamics. While previous work neglects the order of transitions in a trajectory [23], we note that a single transition may be compatible with multiple instances whose differences manifest only at higher orders. In general, partitioning the latent space requires taking the ordering of a temporally-extended trajectory into account. Therefore, we parameterize our encoder using a bidirectional LSTM—as both temporal directions of pairs are informative—and we use an LSTM decoder (architecture in Section E.2). In contrast to embedding trajectories from a single MDP for hierarchical learning [5], our purpose is to encode trajectories from different instances of transition dynamics for optimal control.

3.2 Transfer of a universal policy

We train a single universal policy and deploy the same policy during test, for two reasons: robustness against imperfection in latent variable representation and significant improvement in scalability. Earlier methods trained multiple optimal policies on training instances with a set of hidden parameters, then employed either behavioral cloning [37] or off-policy Q-learning [2] to train a final policy using a dataset . However, this supervised training scheme may not be robust [38]: if were trained only using instance-specific optimal state-action pairs generated by and posterior samples of from an optimal inference model, it may not generalize well when faced with states and encodings that were not present during training. Moreover, it is computationally infeasible to train a collection —which is thrown away during test—when faced with a large set of training instances from a continuous set . Instead, we interleave training of the VAE and a single policy , benefiting from considerable computation savings at training time, and higher robustness due larger effective sample count. In our experiments, we use DDQN with prioritized replay [32, 26], while other RL algorithm can be readily substituted (e.g., PPO for continuous action spaces [27]).

3.3 Optimized probing for accelerated latent variable inference

To execute near-optimal control within a single test episode, we first rapidly compute using a short trajectory of initial experience. This is loosely analogous to the use of preliminary medical treatment to define subsequent prescriptions that better match a patient’s unique physiological response. Our goal of rapid inference motivates two algorithmic design choices to optimize this initial phase. First, the trajectory used for inference by must be optimized, in the sense of machine teaching [41]

, as certain trajectories are more suitable than others for inferring latent variables that underlie system dynamics. If specific degrees of freedom are impacted the most by latent variables, an agent should probe exactly those dimensions to produce an informative trajectory for inference. Conversely, methods that deploy a single universal policy without an initial probing phase

[37] can fail in adversarial cases, such as when the initial placeholder used in at the start of an instance causes failure to exercise dimensions of dynamics that are necessary for inference. Second, the VAE must be specifically trained on a dataset of short trajectories consisting of initial steps of each training episode. We cannot expend a long trajectory for input to the encoder during test, to ensure enough remaining steps for control. Hence, single episode transfer motivates the machine teaching problem of learning to distinguish among dynamics: our algorithm must have learned both to generate and to use a short initial trajectory to estimate a representation of dynamics for control.

Our key idea of optimized probing for accelerated latent variable inference is to train a dedicated probe policy to generate a dataset of short trajectories at the beginning of all training episodes, such that the VAE’s performance on is optimized222In general, is not related to the replay buffer commonly used in off-policy RL algorithms.. Orthogonal to training a meta-policy for faster exploration during standard RL training [36], our probe and VAE are trained for the purpose of performing well on a new test MDP. For ease of exposition, we discuss the case with access to a simulator, but our method easily allows use of off-policy batch data. We start each training episode using for a probe phase lasting steps, record the probe trajectory into , train the VAE using minibatches from , then use with the encoder to generate for use by to complete the remainder of the episode (Algorithm 1). At test time, SEPT only requires lines 5, 8, and 9 in Algorithm 1 (training step in 9 removed; see Algorithm 2). The reward function for is defined as the VAE objective, approximated by the variational lower bound (1): . This feedback loop between the probe and VAE directly trains the probe to help the VAE’s inference of latent variables that distinguish different dynamics (Figure 1). We provide detailed justification as follows. First we state a result derived in Appendix A:

Proposition 1.

Let denote the distribution of trajectories induced by . Then the gradient of the entropy is given by


Noting that dataset follows distribution

and that the VAE is exactly trained to maximize the log probability of

, we use as a tractable lowerbound on . Crucially, to generate optimal probe trajectories for the VAE, we take a minimum-entropy viewpoint and descend the gradient (3). This is opposite of a maximum entropy viewpoint that encourages the policy to generate diverse trajectories [5], which would minimize and produce an adversarial dataset for the VAE—hence, optimal probing is not equivalent to diverse exploration. The degenerate case of learning to “stay still” for minimum entropy is precluded by any source of environmental stochasticity: trajectories from different instances will still differ, so degenerate trajectories result in low VAE performance. Finally we observe that (3) is the defining equation of a simple policy gradient algorithm [35] for training , with interpreted as the cumulative reward of a trajectory generated by . This completes our justification for defining reward . We also show empirically in ablation experiments that this reward is more effective than choices that encourage high perturbation of state dimensions or high entropy (Section 6).

Figure 1: learns to generate an optimal dataset for the VAE, whose performance is the reward for . Encoding by the VAE is given to control policy .

The VAE objective function may not perfectly evaluate a probe trajectory generated by because the objective value increases due to VAE training regardless of . To give a more stable reward signal to , we can use a second VAE whose parameters slowly track the main VAE according to for , and similarly for . While analogous to target networks in DQN [18], the difference is that our second VAE is used to compute the reward for .

1:procedure SEPT-train
2:      Initialize encoder , decoder , probe policy , control policy , and trajectory buffer
3:      for each instance with transition function sampled from  do
4:            for each episode on instance  do
5:                 Execute for steps and store trajectory into
6:                 Use variational lower bound (1) as the reward to train by descending gradient (3)
7:                 Train VAE using minibatches from for gradient ascent on (1) and descent on (2)
8:                 Estimate from using encoder
9:                 Execute with for remaining time steps and train it with suitable RL algorithm
10:            end for
11:      end for
12:end procedure
Algorithm 1 Single Episode Policy Transfer: training phase

4 Related work

Transfer learning in a family of MDPs with different dynamics manifests in various formulations [30]. Analysis of -stationary MDPs and -MDPs provide theoretical grounding by showing that an RL algorithm that learns an optimal policy in an MDP can also learn a near-optimal policy for multiple transition functions [13, 29].

Imposing more structure, the hidden-parameter Markov decision process (HiP-MDP) formalism posits a space of hidden parameters that determine transition dynamics, and implements transfer by model-based policy training after inference of latent parameters [8, 17]. Our work considers HiP-MDP as a widely applicable yet special case of a general viewpoint, in which the existence of hidden parameters is not assumed but rather is induced by a latent variable inference model. The key structural difference from POMDPs [12] is that given fixed latent values, each instance from the family is an MDP with no hidden dynamics. In contrast to multi-task learning [4], which uses the same tasks for training and test, and in contrast to parameterized-skill learning [7], where an agent learns from a collection of rewards with given task identities in one environment with fixed dynamics, our training and test MDPs have different dynamics and identities of instances are not given.

Prior latent variable based methods for transfer in RL depend on a multitude of optimal policies during training [2], or learn a surrogate transition model for model predictive control with real-time posterior updates during test [20]. Our variational model-free approach does not incur either of these high computational costs. We encode trajectories to infer latent representation of differing dynamics, in contrast to state encodings in [39]. Rather than formulating variational inference in the space of optimal value functions [31], we implement transfer through variational inference in a latent space that underlies dynamics. Previous work for transfer across dynamics with hidden parameters employ model-based RL with Gaussian process and Bayesian neural network (BNN) models of the transition function [8, 15], which require computationally expensive fictional rollouts to train a policy from scratch during test time and poses difficulties for real-time test deployment. DPT uses a fully-trained BNN to further optimize latent variable during a single test episode, but faces scalability issues as it needs one optimal policy per training instance [37]. In contrast, our method does not need a transition function and can be deployed without optimization during test. Methods for robust transfer either require access to multiple rounds from the test MDP during training [22], or require the distribution over hidden variables to be known or controllable [19]. While meta-learning [9, 24, 42, 23] in principle can take one gradient step during a single test episode, prior empirical evaluation were not made with this constraint enforced, and adaptation during test is impossible in settings without dense rewards.

5 Experimental setup

We conducted experiments on three benchmark domains with diverse challenges to evaluate the performance, speed of reward attainment, and computational time of SEPT versus five baselines in the single test episode. We evaluated four ablation and variants of SEPT to investigate the necessity of all algorithmic design choices. For each method on each domain, we conducted three independent training runs. For each trained model, all test instances start with the same model; adaptations during the single test episode, if done by any method, are not preserved across the independent test instances. Hyperparameters were adjusted using a coarse coordinate search on validation performance.

Domains. We use the same continuous state discrete action HiP-MDPs proposed by Killian et al. [15] for benchmarking. Each isolated instance from each domain is solvable by RL, but it is highly challenging, if not impossible, for naïve RL to perform optimally for all instances because significantly different dynamics require different optimal policies. In 2D navigation, dynamics are discontinuous in as follows: location of barrier to goal region, flipped effect of actions (i.e., depending on , the same action moves in either a cardinal direction or its opposite), and direction of a nonlinear wind. In Acrobot [28], the agent applies torques to swing a two-link pendulum above a certain height. Dynamics are determined by a vector of masses and lengths, centered at 1.0. We use four unique instances in training and validation, constructed by sampling uniformly from and adding it to all components of . During test, we sample from to evaluate both interpolation and extrapolation. In HIV, a patient’s state dynamics is modeled by differential equations with high sensitivity to 12 hidden variables and separate steady-state regions of health, such that different patients require unique treatment policies [1]. Four actions determine binary activation of two drugs. We used the same training and test instances as Killian et al. [15].

Baselines. First, we evaluated two simple baselines that establish approximate bounds on test performance of methods that train a single policy: as a lower bound, Avg trains a single policy on all instances sampled during training and runs directly on test instances; as an upper bound in the limit of perfect function approximation, Oracle receives the true hidden parameter during both training and test. Next we adapted existing methods, detailed in Section E.1, to single episode test evaluation: 1. we allow BNN [15] to fine-tune a pre-trained BNN and train a policy using BNN-generated fictional episodes every 10 steps during the test episode; 2. we adapted the adversarial part of EPOpt [22], which we term EPOpt-adv, by training a policy on instances with the lowest 10-percentile performance; 3. we evaluate MAML as an archetype of meta-learning methods that require dense rewards or multiple rollouts [9]. We allow MAML to use a trajectory of the same length as SEPT’s probe trajectory for one gradient step during test. We used the same architecture for the RL module of all methods (Section E.2).

Ablations. To investigate the benefit of our optimized probing method for accelerated inference, we designed an ablation called SEPT-NP in which the probe is removed. Instead, trajectories generated by the control policy are used by the encoder for inference and stored into to train the VAE. Second, we investigated an alternative reward function for the probe, labeled TotalVar and defined as for probe trajectory . In contrast to the minimum entropy viewpoint in Section 3.3, this reward encourages generation of trajectories that maximize total variation across all state space dimensions. Third, we tested the maximum entropy viewpoint on probe trajectory generation, labeled MaxEnt, by giving negative lowerbound as the probe reward: . Last, we tested whether DynaSEPT, an extension that dynamically decides to probe or execute control (Appendix C), has any benefit for stationary dynamics.

(a) 2D navigation
(b) Acrobot
(c) 2D navigation
(d) Acrobot
(e) HIV
(f) 2D navigation
(g) Acrobot
(h) 2D navigation
(i) Acrobot
(j) HIV
Figure 2: (a-e): Comparison against baselines. (a-b): Number of steps to solve 2D navigation and Acrobot in a single test episode; failure to solve is assigned a count of 50 in 2D nav. (c-e): Cumulative reward versus test episode step. (f-j): Ablation results. DynaSEPT is out of range in (g), see Figure 3(b)

. Error bars show standard error of mean over all test instances over three training runs per method.

6 Results and discussion

2D navigation and Acrobot are solved upon attaining terminal reward of 1000 and 10, respectively. SEPT outperforms all baselines in 2D navigation (Figures 1(c) and 1(a)). SEPT attained the same maximum test performance as the Oracle, in nearly equal number of steps. While a single instance of 2D navigation is easy for RL, handling multiple instances is highly non-trivial. EPOpt-adv and Avg almost never solve the test instance—we set “steps to solve” to 50 for test episodes that were unsolved—because interpolating between instance-specific optimal policies in policy parameter space is not meaningful for any task instance. MAML did not perform well despite having the advantage of being provided with rewards at test time, unlike SEPT. The gradient adaptation step was likely ineffective because the rewards are sparse and delayed. BNN requires significantly more steps than SEPT, and it uses four orders of magnitude longer computation time (Table 3), due to training a policy from scratch during the test episode. Training times of all algorithms except BNN are in the same order of magnitude (Table 2).

In Acrobot and HIV, where dynamics are continuous in latent variables, interpolation within policy space can produce meaningful policies, so all baselines are feasible in principle. SEPT is statistically significantly faster than BNN, Avg, and MAML, is within error bars of EPOpt-adv, and matches the Oracle’s performance in Acrobot (Figures 1(d) and 1(b)). As the true values of latent variables for Acrobot test instances were interpolated and extrapolated from the training values, this shows that SEPT is robust to out-of-training dynamics. BNN requires more steps due to simultaneously learning and executing control during the test episode. On HIV, SEPT is within margin of error with EPOpt-adv, and outperformed other baselines (Figure 1(e)). The Oracle’s low performance may be due to insufficient examples of the high-dimensional ground truth hidden parameters.

Comparing directly to reported results in DPT [37], SEPT solves 2D Navigation at least 33% (>10 steps) faster, and solves Acrobot at least 20% (>20 steps) faster. Together, these results show that methods that explicitly distinguish different dynamics (e.g., SEPT and BNN) can significantly outperform methods that implicitly interpolate in policy parameter space (e.g., Avg and EPOpt-adv) in settings where has large discontinuous effect on dynamics, such as 2D navigation. When dynamics are continuous in latent variables (e.g., Acrobot and HIV), interpolation-based methods fare better than BNN, which faces the difficulty of learning a model of the entire family of dynamics. SEPT performs well in both cases because it explicitly distinguishes dynamics and does not require learning a full transition model. Moreover, SEPT does not require rewards at test time allowing it be useful on a broader class of problems than optimization-based meta-learning approaches like MAML. Appendix D contains training curves.

Ablation results. Comparing to SEPT-NP, Figures 1(j), 1(g) and 1(f) show that the probe phase is necessary to solve 2D navigation and Acrobot quickly, while giving marginal improvement in HIV. SEPT matched the performance of TotalVar in HIV and outperformed in 2D navigation and Acrobot, showing that directly using VAE performance as reward for probing is more effective than a reward that deliberately encourages perturbation of state dimensions. The clear advantage of SEPT over MaxEnt in all three domains supports our hypothesis in Section 3.3 that the variational lowerbound, rather than its negation in the maximum entropy viewpoint, should be used as the probe reward. SEPT outperforms DynaSEPT on all problems where dynamics are stationary during each instance. On the other hand, DynaSEPT is the better choice in a non-stationary variant of 2D navigation where the dynamics “switch” abruptly at (Figure 3(d)).

(a) 2D navigation
(b) Acrobot
(c) HIV
(d) 2D navigation
(e) Acrobot
(f) HIV
Figure 3: Cumulative reward on test episode for different (a-c) and different (d-f).

Robustness. Figure 3 shows that SEPT is robust to varying the length of the probe trajectory and . Even with suboptimal choices of probe length and , it can outperform all baselines on 2D navigation in both steps-to-solve and final reward; it matches all baselines on Acrobot based on final cumulative reward; and it meets the performance of EPOpt-adv while outperforming other baselines on HIV. Increasing means foregoing valuable steps of the control policy; hence a longer probing phase is not always better. Section D.4 shows the effect of on latent variable encodings.

7 Conclusion and future directions

We propose a general algorithm for single episode transfer among MDPs with different dynamics, a challenging goal with real-world significance that deserves increased effort from the transfer learning and RL community. Our method trains a probe policy along with an inference model to discover a latent representation of dynamics using very few initial steps in a single test episode, such that a universal policy can execute optimal control. The dedicated probing phase may be improved by other objectives, in addition to performance of the inference model, to mitigate the risk and opportunity cost of probing. While our method generates short probe trajectories for inference, one can consider how best to use available long trajectory data with machine teaching methods when differences in dynamics are not detectable early during an episode. Strong performance versus previous methods on domains involving both continuous and discontinuous dependence on latent variables suggests that SEPT has promise for problems where different dynamics can be distinguished via a short probing phase. Our work is one step along a broader avenue of research on general transfer learning in RL equipped with the realistic constraint of a single episode for adaptation and evaluation.


  • [1] B. M. Adams, H. T. Banks, H. Kwon, and H. T. Tran (2004) Dynamic multidrug therapies for hiv: optimal and sti control approaches. Mathematical biosciences and engineering 1 (2), pp. 223–241. Cited by: §5.
  • [2] I. Arnekvist, D. Kragic, and J. A. Stork (2019) Vpe: variational policy embedding for transfer reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 36–42. Cited by: §3.2, §4.
  • [3] A. Bordbar, D. McCloskey, D. C. Zielinski, N. Sonnenschein, N. Jamshidi, and B. O. Palsson (2015) Personalized whole-cell kinetic models of metabolism for discovery in genomics and pharmacodynamics. Cell systems 1 (4), pp. 283–292. Cited by: §1.
  • [4] R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §4.
  • [5] J. D. Co-Reyes, Y. Liu, A. Gupta, B. Eysenbach, P. Abbeel, and S. Levine (2018)

    Self-consistent trajectory autoencoder: hierarchical reinforcement learning with trajectory embeddings

    In Proceedings of the 35th International Conference on Machine Learning, pp. 1009–1018. Cited by: §3.1, §3.3.
  • [6] R. C. Cockrell and G. An (2018-02)

    Examining the controllability of sepsis using genetic algorithms on an agent-based model of systemic inflammation

    PLOS Computational Biology 14, pp. 1–17. External Links: Link Cited by: §1, §2.
  • [7] B. C. Da Silva, G. Konidaris, and A. G. Barto (2012) Learning parameterized skills. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1443–1450. Cited by: §4.
  • [8] F. Doshi-Velez and G. Konidaris (2016) Hidden parameter markov decision processes: a semiparametric regression approach for discovering latent task parametrizations. In IJCAI: proceedings of the conference, Vol. 2016, pp. 1432. Cited by: §2, §4, §4.
  • [9] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §E.1, §1, §2, §3.1, §4, §5.
  • [10] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, Vol. 3. Cited by: §D.4, §3.1.
  • [11] R. Hodson (2016) Precision medicine. Nature 537 (7619), pp. S49. Cited by: §1.
  • [12] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra (1998) Planning and acting in partially observable stochastic domains. Artificial intelligence 101 (1-2), pp. 99–134. Cited by: §4.
  • [13] Z. Kalmár, C. Szepesvári, and A. Lőrincz (1998) Module-based reinforcement learning: experiments with a real robot. Autonomous Robots 5 (3-4), pp. 273–295. Cited by: §4.
  • [14] A. Kastrin, P. Ferk, and B. Leskošek (2018) Predicting potential drug-drug interactions on topological and semantic similarity features using statistical learning. PloS one 13 (5), pp. e0196865. Cited by: Appendix C.
  • [15] T. W. Killian, S. Daulton, G. Konidaris, and F. Doshi-Velez (2017) Robust and efficient transfer learning with hidden parameter markov decision processes. In Advances in Neural Information Processing Systems, pp. 6250–6261. Cited by: §D.3, §E.1, §E.3, §1, §1, §2, §2, §3.1, §3.1, §4, §5, §5.
  • [16] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In International Conference on Learning Representations, Cited by: §3.1.
  • [17] G. Konidaris and F. Doshi-Velez (2014) Hidden parameter markov decision processes: an emerging paradigm for modeling families of related tasks. In the AAAI Fall Symposium on Knowledge, Skill, and Behavior Transfer in Autonomous Robots, Cited by: §2, §4.
  • [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §3.3.
  • [19] S. Paul, M. A. Osborne, and S. Whiteson (2019) Fingerprint policy optimisation for robust reinforcement learning. In International Conference on Machine Learning, pp. 5082–5091. Cited by: §4.
  • [20] C. F. Perez, F. P. Such, and T. Karaletsos (2018) Efficient transfer learning and online adaptation with latent variable models for continuous control. arXiv preprint arXiv:1812.03399. Cited by: §4.
  • [21] B. K. Petersen, J. Yang, W. S. Grathwohl, C. Cockrell, C. Santiago, G. An, and D. M. Faissol (2019) Deep reinforcement learning and simulation as a path toward precision medicine. Journal of Computational Biology. Cited by: §1.
  • [22] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine (2017) Epopt: learning robust neural network policies using model ensembles. In International Conference on Learning Representations, Cited by: §E.1, §4, §5.
  • [23] K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen (2019) Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International Conference on Machine Learning, pp. 5331–5340. Cited by: §1, §2, §3.1, §4.
  • [24] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2019) Meta-learning with latent embedding optimization. In International Conference Learning Representations (ICLR), Cited by: §4.
  • [25] T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015) Universal value function approximators. In International Conference on Machine Learning, pp. 1312–1320. Cited by: §3.1.
  • [26] T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2016) Prioritized experience replay. In International Conference Learning Representations (ICLR), Vol. 2016. Cited by: §3.2.
  • [27] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3.2.
  • [28] R. S. Sutton, A. G. Barto, et al. (1998) Reinforcement learning: an introduction. MIT press. Cited by: §5.
  • [29] I. Szita, B. Takács, and A. Lörincz (2002) -MDPs: learning in varying environments. Journal of Machine Learning Research 3 (Aug), pp. 145–174. Cited by: §4.
  • [30] M. E. Taylor and P. Stone (2009) Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10 (Jul), pp. 1633–1685. Cited by: §4.
  • [31] A. Tirinzoni, R. R. Sanchez, and M. Restelli (2018) Transfer of value functions via variational methods. In Advances in Neural Information Processing Systems, pp. 6179–6189. Cited by: §4.
  • [32] H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §3.2.
  • [33] J. West, L. You, J. Brown, P. K. Newton, and A. R. A. Anderson (2018) Towards multi-drug adaptive therapy. bioRxiv. External Links: Document, Link, Cited by: §1.
  • [34] M. Whirl-Carrillo, E. M. McDonagh, J. Hebert, L. Gong, K. Sangkuhl, C. Thorn, R. B. Altman, and T. E. Klein (2012) Pharmacogenomics knowledge for personalized medicine. Clinical Pharmacology & Therapeutics 92 (4), pp. 414–417. Cited by: §1.
  • [35] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §3.3.
  • [36] T. Xu, Q. Liu, L. Zhao, W. Xu, and J. Peng (2018) Learning to explore with meta-policy gradient. In Proceedings of the 35th International Conference on Machine Learning, pp. 5463–5472. Cited by: §3.3.
  • [37] J. Yao, T. Killian, G. Konidaris, and F. Doshi-Velez (2018) Direct policy transfer via hidden parameter markov decision processes. In LLARLA Workshop, FAIM, Vol. 2018. Cited by: §3.1, §3.1, §3.2, §3.3, §4, §6.
  • [38] W. Yu, J. Tan, C. K. Liu, and G. Turk (2017-07) Preparing for the unknown: learning a universal policy with online system identification. In Proceedings of Robotics: Science and Systems, Cambridge, Massachusetts. External Links: Document Cited by: §2, §3.2.
  • [39] A. Zhang, H. Satija, and J. Pineau (2018) Decoupling dynamics and reward for transfer learning. arXiv preprint arXiv:1804.10689. Cited by: §4.
  • [40] J. Zhang, J. J. Cunningham, J. S. Brown, and R. A. Gatenby (2017) Integrating evolutionary dynamics into treatment of metastatic castrate-resistant prostate cancer. Nature communications 8 (1), pp. 1816. Cited by: §1.
  • [41] X. Zhu, A. Singla, S. Zilles, and A. N. Rafferty (2018) An overview of machine teaching. arXiv preprint arXiv:1801.05927. Cited by: §3.3.
  • [42] L. M. Zintgraf, K. Shiarlis, V. Kurin, K. Hofmann, and S. Whiteson (2019) Fast context adaptation via meta-learning. In International Conference on Machine Learning (ICML), Vol. 2019. Cited by: §4.

Appendix A Derivations

See 1


Assuming regularity, the gradient of the entropy is

For trajectory generated by the probe policy :


Since and do not depend on , we get

Substituting this into the gradient of the entropy gives (3). ∎

Appendix B Testing phase of SEPT

1:procedure SEPT-test
2:     Restore trained decoder , encoder , probe policy , and control policy
3:     Run probe policy for time steps and record trajectory
4:     Use with decoder to estimate
5:     Use with control policy for the remaining duration of the test episode
6:end procedure
Algorithm 2 Single Episode Policy Transfer: testing phase

Appendix C DynaSEPT

In our problem formulation, it is not necessary to compute at every step of the test episode, as each instance is a stationary MDP and change of instances is known. However, removing the common assumption of stationarity leads to time-dependent transition functions , which introduces problematic cases. For example, a length probing phase would fail if leads to a switch in dynamics at time , such as when poorly understood drug-drug interactions lead to abrupt changes in dynamics during co-medication therapies [14]. Here we describe an alternative general algorithm for non-stationary dynamics, which we call DynaSEPT. We train a single policy that dynamically decides whether to probe for better inference or act to maximize the MDP reward , based on a scalar-valued function representing the degree of uncertainty in posterior inference, which is updated at every time step. The total reward is , where is a short sliding-window trajectory of length , and is the final state of . The history-dependent term is equivalent to a delayed reward given for executing a sequence of probe actions. Following the same reasoning for SEPT, one choice for is . Assuming the encoder outputs variance of each latent dimension, one choice for

is a normalized standard deviation over all dimensions of the latent variable, i.e.

, where is a running max of . Despite its novelty, we consider DynaSEPT only for rare nonstationary dynamics and merely as a baseline in the predominant case of stationary dynamics, where SEPT is our primary contribution. Appendix C explains that DynaSEPT has no advantage over SEPT in the stationary case.

DynaSEPT does not have any clear advantage over SEPT when each instance is a stationary MDP. DynaSEPT requires to start at 1.0, representing complete lack of knowledge about latent variables, and it still requires the choice of hyperparameter . Only after steps can it use the uncertainty of to adapt and continue to generate the sliding window trajectory to improve . By this time, SEPT has already generated an optimized sequence using for the encoder to estimate . If a trajectory of length is sufficient for computing a good estimate of latent variables, then SEPT is expected to outperform DynaSEPT.

(a) 2D navigation
(b) Acrobot
(c) HIV
(d) 2D switch
Figure 4: Ablations and variants of SEPT (a-c), and additional test on nonstationary dynamics (d).

Appendix D Supplementary experimental results

d.1 Steps to solve 2D and Acrobot

2D navigation and Acrobot have a definition of “solved”. Table 1 reports the number of steps in a test episode required to solve the MDP. Average and standard deviation were computed across all test instances and across all independently trained models. If an episode was not solved, the maximum allowed number of steps was used (50 for 2D navigation and 200 for Acrobot).

2D navigation Acrobot
Average 4710 10942
Oracle 121 8227
BNN 3410 15445
EPOpt-adv 4610 8942
MAML 500 10348
SEPT 142 7923
Table 1: Steps to solve 2D navigation and Acrobot

d.2 Timing comparison

2D navigation Acrobot HIV
Average 1.3e3277 1.0e385 1.4e347
Oracle 0.6e3163 1.1e3129 1.5e347
BNN 2.9e3244 9.0e43.0e3 4.3e4313
EPOpt-adv 1.1e344 1.1e31.0 1.9e333
MAML 0.9e3116 1.1e396 1.3e36.0
SEPT 1.9e370 2.3e31e3 2.8e311
Table 2: Total training times in seconds on all experiment domains
2D navigation Acrobot HIV
Average 0.040.04 0.090.04 0.420.01
Oracle 0.020.04 0.090.04 0.450.02
BNN 2.6e3957 2.8e3968 1.4e38.8
EPOpt-adv 0.040.04 0.100.06 0.450.03
MAML 0.050.05 0.100.07 0.480.01
SEPT 0.040.07 0.120.10 0.600.02
Table 3: Test episode time in seconds on all experiment domains

d.3 Training curves

(a) 2D navigation
(b) Acrobot
(c) HIV
Figure 5: Average episodic return over training episodes. Only SEPT and Oracle converged in 2D navigation. All methods converged in Acrobot. All methods except MAML converged in HIV. BNN is not shown as the implementation does not record training progress.

Figure 5 shows training curves on all domains by all methods. None of the baselines, excepting Oracle, converge in 2D navigation, because it is meaningless for Avg and EPOpt-adv to interpolate between optimal policies for each instance, and MAML cannot adapt due to lack of informative rewards for almost the entire test episode. Hence these baselines cannot work for a new unknown test episode, even in principle. We allowed the same number of training episodes for HIV as in Killian et al. [15], and all baselines except MAML show learning progress.

d.4 Latent representation of dynamics

Figure 6: Two-dimensional encodings generated for four instances of Acrobot (represented by four ground-truth colors), for different values of . We chose for Acrobot.

There is a tradeoff between reconstruction and disentanglement as increases [10]. Increasing encourages greater similarity between the posterior and an isotropic Gaussian. Figure 6 gives evidence that this comes at a cost of lower quality of separation in latent space.

d.5 Probe reward

Figure 7: Probe policy reward curve in one training run in 2D navigation

Appendix E Experimental details

For 2D navigation, Acrobot, and HIV, total number of training episodes allowed for all methods are 10k, 4k, and 2.5k, respectively. There are 2, 4 and 5 unique training instances, and 2, 4, and 5 validation instances, respectively. For each of three independent training runs, we tested on 10, 5, and 1 test instances, respectively.

e.1 Algorithm implementation details

The simple baselines Average and Oracle can be immediately deployed in a single test episode after training. However, the other methods for transfer learning require modification to work in the setting of single episode test, as they were not designed specifically for this highly constrained setting. We detail the necessary modifications below. We also describe the ablation SEPT-NP in more detail.

BNN. In Killian et al. [15], a pre-trained BNN model was fine-tuned using the first test episode and then used to generate fictional episodes for training a policy from scratch. More episodes on the same test instance were allowed to help improve model accuracy of the BNN. In the single test episode setting, all fine-tuning and policy training must be conducted within the first test episdoe. We fine-tune the pre-trained BNN every 10 steps and allow the same total number of fictional episodes as reported in [15] for policy training. We measured the cumulative reward attained by the policy—while it is undergoing training—during the single real test episode.

EPOpt. EPOpt trains on the lowest -percentile rollouts from instances sampled from a source distribution, then adapts the source distribution using observations from the target instance [22]. Since we do not allow observation from the test instance, we only implemented the adversarial part of EPOpt. To run EPOpt with off-policy DDQN, we generated 100 rollouts per iteration and stored the lowest 10-percentile into the replay buffer, then executed the same number of minibatch training steps as the number that a regular DDQN would have done during rollouts.

MAML. While MAML uses many complete rollouts per gradient step [9]

, the single episode test constraint mandates that it can only use a partial episode for adaptation during test, and hence the same must be done during meta-training. For both training and test, we allow MAML to take one gradient step for adaptation using a trajectory of the same length as the probe trajectory of SEPT, starting from the initial state of the episode. We implemented a first-order approximation that computes the meta-gradient at the post-update parameters but omits second derivatives. This was reported to have nearly equal performance as the full version, due to the use of ReLU activations.

SEPT-NP. begins with a zero-vector for at the start of training. When it has produced a trajectory of length , we store into for training the VAE, and use with the VAE to estimate for the episode. Later training episodes begin with the rolling mean of all estimated so far. For test, we give the final rolling mean of at the end of training as initial input to .

e.2 Architecture

Encoder. For all experiments, the encoder is a bidirectional LSTM with 300 hidden units and activation. Outputs are mean-pooled over time, then fully-connected to two linear output layers of width dim(), interpreted as the mean and log-variance of a Gaussian over .

Decoder. For all experiments, the decoder is an LSTM with 256 hidden units and activation. Given input at LSTM time step , the output is fully-connected to two linear output layers of width , and interpreted as the mean and log-variance of a Gaussian decoder for the next state-action pair .

Q network. For all experiments, the function is a fully-connected neural network with two hidden layers of width 256 and 512, ReLU activation, and a linear output layer of size . For SEPT and Oracle, the input is the concatenation , where is estimated in the case of SEPT and is the ground truth in for the Oracle. For all other methods, the input is only the state .

Probe policy network. For all experiments, is a fully-connected neural network with 3 hidden layers, ReLU activation, 32 nodes in all layers, and a softmax in the output layer.

e.3 Hyperparameters

VAE learning rate was 1e-4 for all experiments. Size of the dataset of probe trajectories was limited to 1000, with earliest trajectories discarded. 10 minibatches from were used for each VAE training step. We used for the VAE. Probe policy learning rate was 1e-3 for all experiments. DDQN minibatch size was 32, one training step was done for every 10 environment steps,

, learning rate was 1e-3, gradient clip was 2.5,

, and target network update rate was 5e-3. Prioritized replay used the same parameters in [15].

2D navigation Acrobot HIV
2 5 8
Instances 1000 500 500
Episodes on instance 10 8 5
VAE batch size 10 64 64
dim() 2 2 6
N/A 0.005 N/A
Probe minibatches 1 10 1
DDQN 1.0 1.0 0.3
Table 4: Hyperparameters used by each method, where applicable