Log In Sign Up

Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization

We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments, where the task may change over time. While RNN-based policies could in principle represent such strategies, in practice their training time is prohibitive and the learning process often converges to poor solutions. In this paper, we consider the case where the agent has access to a description of the task (e.g., a task id or task parameters) at training time, but not at test time. We propose a novel algorithm that regularizes the training of an RNN-based policy using informed policies trained to maximize the reward in each task. This dramatically reduces the sample complexity of training RNN-based policies, without losing their representational power. As a result, our method learns exploration strategies that efficiently balance between gathering information about the unknown and changing task and maximizing the reward over time. We test the performance of our algorithm in a variety of environments where tasks may vary within each episode.


Adaptive Variance for Changing Sparse-Reward Environments

Robots that are trained to perform a task in a fixed environment often f...

Proximal Policy Optimization via Enhanced Exploration Efficiency

Proximal policy optimization (PPO) algorithm is a deep reinforcement lea...

Learning Exploration Policies for Navigation

Numerous past works have tackled the problem of task-driven navigation. ...

Meta-Reinforcement Learning by Tracking Task Non-stationarity

Many real-world domains are subject to a structured non-stationarity whi...

Exploration via Sample-Efficient Subgoal Design

The problem of exploration in unknown environments continues to pose a c...

Learning Efficient and Effective Exploration Policies with Counterfactual Meta Policy

A fundamental issue in reinforcement learning algorithms is the balance ...

Active Exploration via Experiment Design in Markov Chains

A key challenge in science and engineering is to design experiments to l...

Code Repositories


RLStructures is a library to facilitate the implementation of new reinforcement learning algorithms. It includes a library, a tutorial, and different RL algorithms provided as examples.

view repo


MultiTask Environments for Reinforcement Learning.

view repo

1 Introduction

Deep Reinforcement Learning (DRL) has been used to successfully train agents on a range of challenging environments such as Atari games 

(Mnih et al., 2013; Bellemare et al., 2013; Hessel et al., 2017) or continuous control (Peng et al., 2017; Schulman et al., 2017). Nonetheless, in these problems, RL agents perform exploration strategies to discover the environment and implement algorithms to learn a policy that is tailored to solving a single task. Whenever the task changes, RL agents generalize poorly and the whole process of exploration and learning restarts from scratch. On the other hand, we expect an intelligent agent to fully master a problem when it is able to generalize from a few instances (tasks) and learn how to achieve the objective of the problem under many variations of the environment. For instance, children know how to ride a bike (i.e., the problem) when they can reach their destination irrespective of the specific bike they are riding, which requires to adapt to the weight of the bike, the friction of the brakes and tires, and the road conditions (i.e., the tasks).

How to enable agents to generalize across tasks has been studied under the frameworks of Multi-task Reinforcement Learning (e.g., Wilson et al., 2007; Teh et al., 2017), Transfer Learning (e.g., Taylor and Stone, 2011; Lazaric, 2012) and Meta-Reinforcement Learning (Finn et al., 2017; Hausman et al., 2018; Rakelly et al., 2019; Humplik et al., 2019). These works fall into two categories. Learning to learn approaches aim at speeding up learning on new tasks, by pre-training feature extractors or learning good initializations of policy weights (Raghu et al., 2019). In contrast, we study in this paper the online adaptation setting where a single policy is trained for a fixed family of tasks. When facing a new task, the policy must then balance exploration, to reduce the uncertainty about the current task, and exploitation to maximize the cumulative reward of the task.

The online adaptation setting is a special case of a partially observed markov decision problem, where the unobserved variables are the descriptors of the current task. It is thus possible to rely on recurrent neural networks (RNNs) 

(Bakker, 2001; Heess et al., 2015; Duan et al., 2016), since they can theoretically represent optimal policies in POMDPs if given enough capacity. Unfortunately, the training of RNN policies has often prohibitive sample complexity and it may converge to suboptimal local minima.

To overcome this drawback, efficient online adaptation methods leverage the knowledge of the task at train time. The main approach is to pair an exploration strategy with the training of informed policies, i.e., policies taking the description of the current task as input. Probe-then-Exploit (PTE) algorithms (e.g., Zhou et al., 2019) operate in two stages. They first rely on an exploration policy to identify the task. Then, they commit to the identified task by playing the associated informed policy. Thompson Sampling (TS) approaches (Thompson, 1933; Osband et al., 2016, 2019) maintain a distribution over plausible tasks and play the informed policy of a task sampled from the posterior following a predefined schedule.

PTE and TS are expected to be sample-efficient because learning informed policies is easier than RNN policies: since informed policies know the current task, the problem is fully observable and they can be learnt with efficient algorithms for MDPs. However, as we discuss in Section 3, PTE and TS cannot represent effective exploration/exploitation policies in many environments. This limitation is even more severe in non-stationary environments, where the task changes within each episode. In this case, the exploration strategy must also adapt to how tasks evolve over time.

Recently, Humplik et al. (2019) proposed an alternative approach, Task Inference (TI), which trains a full RNN policy with the prediction of the current task as an auxiliary loss. TI avoids the suboptimality of PTE/TS since it does not constrain the structure of the exploration/exploitation policy. However, in TI, the task descriptors are used as targets and not as inputs, so TI does not leverage the faster learning of informed policies. Moreover, the behavior of TI in non-stationary environments has not been studied.

In this paper, we introduce IMPORT (InforMed POlicy RegularizaTion), a novel policy architecture for efficient online adaptation that combines the rich expressivity of RNNs with the efficient learning of informed policy. At training time, a shared policy head receives as input the current observation, together with either a (learned) embedding of the current task, or the hidden state of an RNN such that two policies are learned simultaneously: the informed policy and the RNN policy. At test time, the hidden state of the RNN replaces the task embedding, and the agent can act without having access to the current task.

We evaluate IMPORT against the main approaches to online adaptation on a suite of different environments with different characteristics, from challenging exploration problems with sparse rewards to non-stationary control problems. We confirm that TS suffers from its limited expressivity when non-trivial probing policies are required, and show that the policy regularization of IMPORT significantly speeds up learning compared to TI. Moreover, the learnt task embeddings of IMPORT make it robust to irrelevant or minimally informative task descriptors, while TI performances degrade significantly when task descriptors contain irrelevant variables or are only minimally informative.

2 Setting

Let be the space of possible tasks. Each is associated to an episodic -MDP whose dynamics and rewards are task dependent, while state and action spaces are shared across tasks and is the discount factor. The descriptor can be a simple id () or a set of parameters ().

When the reward function and the transition probabilities are unknown, RL agents need to devise a strategy that balances exploration to gather information about the system and exploitation to maximize the cumulative reward. Such a strategy can be defined as the solution of a partially observable MDP (POMDP), where the hidden variable is the descriptor

of the MDP. Given a trajectory , a POMDP policy maps the trajectory to actions. In particular, the optimal policy in a POMDP is a history-dependent policy that uses to construct a belief state , which describes the uncertainty about the task at hand, and then maps it to the action that maximizes the expected sum of rewards (e.g., Kaelbling et al., 1998). In this case, maximizing the rewards may require taking explorative actions that improve the belief state enough so that future actions can be more effective in collecting reward.

At training time, we assume the agent has unrestricted access to the descriptor of the tasks it interacts with. In particular, we consider the challenging case of dynamic (i.e., non-stationary) environments, where the task may change over time according to a fixed or random schedule, being the value of at time . Let to be a history-dependent task distribution, then at each step a new task is drawn from .111Notice that the definition of is rich enough so that it can represent cases such as stationary, piece-wise stationary, and shifts with limited deviations. Leveraging the information gathered at training time, we expect the agent to learn an exploration strategy that is better suited for tasks in and . More formally, after steps of training, the agent returns a policy that is evaluated according to its average performance across tasks in generated from , i.e.,


where the expectation is taken over trajectories of full episodes , and is the length of episode .

The objective is then to find an architecture for that is able to express strategies that perform the best according to Eq. 1 and, at the same time, can be efficiently learned even for moderately short training phases.




Figure 1: An environment with two tasks: the goal location ( or ) changes at each episode. The sign reveals the location of the goal. Optimal informed policies are shortest paths from start to either or , which never visit the sign. Thompson sampling cannot represent the optimal exploration/exploitation policy (go to the sign first) since going to the sign is not feasible by any informed policy.

3 Related Work and Contributions

In this section, we review how the online adaptation setting has been tackled in the literature. The main approaches are depicted in Fig. 2. We first compare the different methods in terms of the optimality of strategies they can represent, how they deal with non-stationary environments, and whether they leverage the efficient learning of informed policies. We then discuss learning task embeddings and how the various methods deal with unknown or irrelevant task descriptors. The last subsection summarizes our contributions. Unless otherwise stated, the methods described below have not been tested in the non-stationary setting.

3.1 Online Adaptation with Deep RL

In the previous section we mentioned that the best exploration strategy corresponds to the optimal policy of the associated POMDP. Since the belief state is a sufficient statistic of the history, POMDP policies takes the form . While it is impractical to compute the exact belief state even for toy discrete problems, approximations can be learnt using Recurrent Neural Networks (RNNs) (Bakker, 2001; Heess et al., 2015; Duan et al., 2016). RNN-policies are directly trained to maximize the cumulative reward and do not leverage information about task descriptors at train time. While this class of policies can represent rich exploratory strategies, their large training complexity makes them highly suboptimal whenever the training phase is too short.

In order to reduce the training complexity of RNN policies, existing strategies have constrained the set of possible exploratory behaviors by leveraging privileged information about the task. Probe-Then-Exploit (PTE) (e.g., Lattimore and Szepesvári, 2018; Zhou et al., 2019) works in two phases. First, it executes a pure exploratory policy with the objective of identifying the underlying task (i.e.,

) as accurately as possible. In the second phase, PTE runs the optimal policy associated to the estimated task. In that case, the agent only needs to learn a probing policy that maximizes the likelihood of the task at the end of the exploration phase and the informed policies for the training tasks, thus leading to a much more efficient training process. PTE has two main limitations. First, similarly to explore-then-commit approaches in bandits 

(Garivier et al., 2016), the exploration can be suboptimal because it is not reward-driven: valuable time is wasted to estimate unnecessary information. Second, the switch between probing and exploiting is difficult to tune and problem-dependent, which makes these approaches unsuitable in non-stationary environments.

Thompson Sampling (TS) (Thompson, 1933) leverages randomization to efficiently mix exploration and exploitation. Similarly to the belief state of an RNN-policy, TS maintains a distribution over tasks that are compatible with the observed history. The policy samples a task from the posterior and executes the corresponding informed policy for several steps. In that case, the training is limited to learning a maximum likelihood estimator to map trajectories to distributions over states and, similar to PTE, informed policies for the training tasks. This strategy proved successful in a variety of problems (e.g., Chapelle and Li, 2011; Osband and Roy, 2017; Osband et al., 2019). However, TS cannot represent certain probing policies because it is constrained to executing informed policies. Another drawback of TS approaches is that even in stationary environemnts, the frequency of re-sampling needs to be carefully tuned. This makes the application of TS to non-stationary environments challenging. We describe an example of environment where TS is suboptimal in Fig. 1.

The Task Inference (TI) approach (Humplik et al., 2019) is an RNN trained to simultaneously learn a good policy and predict the task descriptor . Denoting by the mapping from histories to a latent representation of the belief state (), the policy selects the action based on the representation constructed by the RNN. During training, is also used to predict the task descriptor , using the task-identification module . The overall objective is:


where is the log-likelihood of under distribution . Note that because the auxiliary loss is only supposed to structure the memory of the RNN rather than be an additional reward for the policy, the gradient of the auxiliary loss with respect to ignores the effect of on : given trajectories sampled by , only the average gradient of with respect to

is backpropagated.

Thus, the training of in TI is purely reward-driven and it does not suffer from the suboptimality of PTE/TS. However, in contrast to PTE/TS, it does not leverage the smaller sample complexity of training informed policies, and the auxiliary loss is defined over the whole value of while only some dimensions may be relevant to solve the task.

In the non-stationary setting, only a few models have been proposed, mainly based on the MAML algorithm. For instance, (Nagabandi et al., 2018) combines MAML with model-based RL by meta-learning a transition model that helps an MPC controller predicting the action sequence to take. The method does not make use of the value of at train time and is specific to MPC controllers.

3.2 Learning Task Embeddings

The approaches described above differ in the necessary prior assumptions on tasks and their descriptions used for learning. The minimal requirement for all these approaches is to have access to task identifiers

, i.e., one-hot encodings of the task. In general however, these approaches are sensitive to the description of the task. In particular, irrelevant variables have a significant impact on PTE approaches since the probing policy aims at identifying the task: for instance, an agent might waste time reconstructing the full map of a maze when it only needs to find a specific information to act optimally w.r.t the reward. Moreover, many methods are guided by a prior distribution over

that has to be chosen by hand to fit the ground-truth distribution of tasks.

Several approaches have been proposed to learn from task identifiers (Gupta et al., 2018; Rakelly et al., 2019; Zintgraf et al., 2019; Hausman et al., 2018). The usual approach is to train embeddings of task identifiers jointly with the informed policies. TI is not amenable to joint task embedding training, since tasks are used as targets and not as inputs. Humplik et al. (2019) mention using TI with task embeddings, but the embeddings are pre-trained separately, which requires either additional interactions with the environment or expert traces.

3.3 Contributions

As for RNN/TI, IMPORT learns an RNN policy to maximize cumulative reward. As such, our approach does not suffer from the intrinsic limitations of PTE/TS in terms of optimality; because there is no decoupling between exploration and exploitation (of an informed policy), the approach does not suffer from the scheduling difficulties of PTE/TS and is readily applicable to non-stationary environments. Nonetheless, similarly to PTE/TS and contrarilty to TI, IMPORT leverages fast training of informed policies through a joint training of an RNN and an informed policy.

In addition, IMPORT does not rely on probabilistic models of task descriptors. Learning task embeddings makes the approach robust to irrelevant task descriptors contrary to TI, but also makes IMPORT applicable when only task identifiers are available.

The next section describes these components in more details.

4 Method





max reward






MLE w.r.t.


max reward






MLE w.r.t.


max reward









max reward

Figure 2: Representation of the different architectures. RNN policy is a classical recurrent neural network. Thompson Sampling (TS) samples a value of at each timestep and follows the corresponding informed policy. TI is a recurrent policy such that one can predict from the hidden state. At last, IMPORT is composed of two models sharing parameters: The (black+blue) architecture is the informed policy optimized through Eq. 1 (B) while the (black+red) architecture is the history-based policy (used at test time) trained through Eq 1. (A) (and eventually (C)).

In this section, we describe the main components of the IMPORT model, as well as the online optimization procedure and an additional auxiliary loss to further speed-up learning. The overall approach is described in Fig. 2 (right).

4.1 Regularization Through Informed Policies

Our approach leverages the knowledge of the task descriptor and informed policies to construct a latent representation of the task that is purely reward driven. Since is unknown at testing time, we use this informed representation to train a predictor based on a recurrent neural network. To leverage the efficiency of informed policies even in this phase, we propose an architecture sharing parameters between the informed policy and the final policy such that the final policy will benefit from parameters learned with privileged information. The idea is to constrain the final policy to stay close to the informed policy while allowing it to perform probing behaviors when needed to effectively reduce the uncertainty about the task. We call this approach InforMed POlicy RegularizaTion (IMPORT).

Formally, we define by and the informed policy and the history-dependent policy that will be used at test time. The informed policy is defined as the functional composition of and , where projects in a latent space and selects the action based on the provided latent representation. The idea is that captures the relevant information contained in while ignoring dimensions that are not relevant for learning the optimal policy. This behavior is obtained by training directly to maximize the task reward (i.e., informed).

While this policy leverages the knowledge of at training time, should be able to act based on the sole history. To encourage to behave like the informed policy while preserving the ability to probe, we let share parameters with through the component that they have in common. We thus define where encodes the history into the latent space. By sharing the policy head between informed and history-dependent policies, the approximate belief state constructed by the RNN is mapped to the same latent space as . As such, when informed policies learn faster than the , they can be reused directly by when the uncertainty about the task is small.

More precisely, let the parameters of and respectively, so that and . The goal of IMPORT is to maximize over the following objective function:


4.2 IMPORT with : Speeding Up the Learning.

The only information that is shared between and is function . However, optimizing term (B) in Eq. 3 produces also a reward-driven latent representation of the task through function . This information can be used to regularize the prediction of , and to encourage the history-based policy to predict a task embedding close to the one predicted by the informed policy. We can thus rewrite Eq. 3 as:


where is the squared -norm in our experiments. Note that the objective (C) is an auxiliary loss, so only the average gradient of with respect to along trajectories collected by is backpropagated, ignoring the effect of on .

  Initialize arbitrarily
  for  do
      Collect trajectories according to in buffer
      Collect trajectories according to in buffer
      Compute (A) on
      Compute average on (auxiliary loss (C))
      Compute (B) on
      Update and based on (A)(B) +
  end for
Algorithm 1 IMPORT Training

4.3 Optimization

We propose an optimization scheme for (3) and (4) based on policy gradient algorithms (our experiments use A2C (Mnih et al., 2016) and REINFORCE (Baxter and Bartlett, 2001)). The high-level algorithm is summarized in Alg. 1. At each iteration, we collect two batches of episodes, one containing trajectories (full episodes) generated by the exploration/exploitation policy , and the other one containing trajectories generated by the informed policy . The gradients of the RNN () and of the policy head are computed on the first batch according to objectives (A) and (C) (if in Eq. (4)), while the gradients of the policy head () and of the task embeddings () are computed on the second batch. The weight updates are performed on the sum of all gradients.

5 Experiments

Method Test reward
= 0.9 () = 0.5 ()
UCB 64.67(0.38) 52.03(1.13) 37.61(1.5) 31.12(0.21) 22.84(0.97) 15.63(0.74)
SW-UCB 62.17(0.74) 50.8(1.21) 33.02(0.19) 30.06(1.36) 22.11(0.69) 14.63(0.57)
TS 68.93(0.7) 41.35(0.98) 20.58(1.86) 28.93(1.23) 15.23(0.28) 9.8(0.64)
RNN 73.39(0.78) 54.21(2.94) 30.51(0.8) 31.63(0.12) 21.19(0.87) 11.81(0.94)
TI 73.01(0.86) 58.81(1.67) 32.22(0.89) 31.9(0.82) 21.46(0.37) 12.78(0.77)
IMPORT() 72.66(0.96) 57.62(0.75) 31.36(4.55) 30.76(0.7) 21.53(2.07) 11.41(0.46)
IMPORT() 73.13(0.62) 61.38(0.62) 42.29(3.08) 32.44(1.12 24.83(0.87) 14.47(2.05)
Table 1: Bandits in non stationary modes where and . On average, the distribution over the arms changes 5 times per episode. The number of arms is . Each value corresponds to the cumulated reward after

environment steps (and standard deviation over the 3 seeds).

5.1 Experimental Setting and Baselines

Method Test reward
TS 93.54(1.99) 85.33(6.28) 91.34(4.28)
RNN 81.3(12.7) 73.6(6.3) 76(4.1)
TI 90.4(0.5) 83.6(2.4) 80.1(2.2)
IMPORT() 90.66(6.6) 69.1(9.1) 79(2.5)
IMPORT() 96.4(2.9) 93.4(3.2) 97.2(1.5)
Table 2: CartPole stationary and non-stationary where (after environment steps).
Method Test reward
RNN -406.4(30.1) -459.(57.5)
TI -278.4(45.) -223.6(36.3)
TS -242.9(4.9) -190.4(51.9)
IMPORT -112.2(10.8) -101.7(11.1)
Table 3: Acrobot in non-stationary settings with (after environment steps).
Method Test reward
TS 72.2(15.4) 44.9(8.2) 89.8(2.2) 49.2(4.)
RNN 81.4(9.7) 74.3(8.4) 79.8(9.4) 63.2(6.)
TI 88.6(4.4) 80.8(5.) 85.2(7.5) 71.2(8.4)
IMPORT 77.1(17.8) 89.4(6.) 90.1(1.5) 88.3(1.)
Table 4: CartPole: Generalization with Task Embeddings. is the number of training tasks, while the models have been selected on 100 different validation tasks, and the performance is reported on 100 test tasks.

is a one-hot vector of size

encoding the id of each training task (after environment steps).

Our model is compared to different baselines: Recurrent Neural Networks (RNN) is a recurrent policy based on a GRU recurrent module (Heess et al., 2015) that does not use at train time but just the observations and actions. Thompson Sampling (TS) and Task Inference (TI) models are trained in the same setting that the IMPORT model, using at train time, but not at test-time.

Figure 3: Training curves on the bandit problem where , and
Figure 4: Cartpole with training tasks and . The curves have been selected on 100 validation tasks, and the results are reported over 100 test tasks unseen at train time.
Figure 5: Training curves on Maze2d (up) and Maze3d (down) environments under scenario A learned with REINFORCE. The curves show the ability of IMPORT to learn faster than TI and RNN. TS struggles on such problems because of its oscillating behaviour, however, under scenario A, it can touch the sign. Under scenario B, TS is deemed to fail.

Each approach (i.e., TS, RNN, TI and IMPORT) is trained using A2C222REINFORCE has been used for Maze problems because it was more stable in this sparse reward setting. (Mnih et al., 2016) on CPUs333+ 1 GPU in the Maze 3d setting. Precise values of the hyper-parameters and details on the neural network architectures are also given in the appendix B.2 . The networks for all approaches have a similar structure with the same number of hidden layers and same hidden layers sizes for fair comparison. In all the tables, we report the test performance of the best hyper-parameter value, selected according to the procedure described in appendix B.3

5.2 Environments

Experiments have been performed on a diverse set of environments whose goals are to showcase different aspects: a) generalization to unseen tasks, b) adaptation to non-stationarities, c) ability to learn complex probing policies, d) coping with high-dimensional state space and e) sensitivity to the task descriptor.

In both stationary and non-stationary settings, we study multi-armed bandits (MAB) and control tasks (CartPole and Acrobot) to test a) and b) characteristics. They all have high dimensional task descriptors that contain irrelevant variables, thus potentially sensitive to reconstructing unnecessary information e).
We then provide two environments with sparse rewards but different state inputs: Maze2d with coordinates, and Maze3d is a challenging first-person view task with high dimensional state space (pixels) to test d). Maze environments require c) to discover complex probing policies.
Note that in MAB, CartPole and Acrobot, train/validation/test task sets are disjoint to support a). To test e), we study CartPole for two types of privileged information: 1) summarizes the dynamics of the system, e.g. pole length and 2) encodes a task index (one-hot vector) in a set of training tasks. When using 2), we restrict the size of the training set to be small, thus making generalization property a) even harder.

Environments are described in further details in Appendix C.

Multi-armed bandits:

We use MAB problems with arms where each arm is Bernoulli with parameter . In the non-stationary case, there is a probability to re-sample the value of at each timestep, The size of each episode is . We consider the setting where one single random arm is associated with a Bernoulli of parameter , while the other arms are associated with Bernoulli with parameters independently drawn uniformly in . In this environment, TS and TI are using a Beta distribution to model . We compare also with some bandit-specific online algorithms: UCB (Auer, 2003) and one of its non-stationary counter-part Sliding-Window UCB  (Garivier and Moulines, 2008).

Control Environments:

We use two environments, CartPole and Acrobot, where is controlling different physical variables of the system, e.g., the weight of the cart, the size of the pole, etc. The size of is for Cartpole and for Acrobot. The values of components are normalized between and , are uniformly sampled at the beginning of each episode, and can be resampled at each timestep with a probability . The maximum size of each trajectory is for Cartpole and for Acrobot, and the reward functions are the one implemented in OpenAI Gym (Brockman et al., 2016). These environments are particularly difficult for two reasons: the ‘direction’ of the forces applied to the system may vary such that the optimal informed policies are very different w.r.t. to . Moreover, the mapping between and the dynamics is not obvious since some dimensions of may be irrelevant, and some others may have opposite or similar effects. In these environments, TS and TI are using a Gaussian distribution to model . CartPole with a task identifier as is described in 5.3 and Appendix C.3.

Maze2d/Maze3d with Two Goals and Sign:

Figure 6: Maze 3d under scenario A: The goal is either located at the blue or the red box. The grey box is the sign that provide the goal location when the agent is close. The optimal policy (i.e. reaching the sign then the right goal location) has an average length of 19 steps.

We consider (see Fig. 6) an environment where two possible goals are positioned at two different locations. The value of or denotes which of the two goals is active and it is sampled at random at the beginning of each episode. The agent can access this information by moving to a sign location. It has to reach the goal in at most steps – the optimal policy is able to achieve this objective in 19 steps – to receive a reward of and the episode stops. If the agent reaches the wrong goal, it receives a reward of and the episode stops. The agent has three actions (forward, turn left, turn right) and the environment has been implemented using MiniWorld (Chevalier-Boisvert, 2018). We propose two versions: the 2d version where the agent observes its location (and eventually the goal location when going over the sign), and the 3d version where the agent observes an image, the sign information is encoded as a fourth dimension (either zeros or values when the sign is touched) concatenated to the RGB image. Note that it is an environment with a sparse reward that is very complex to solve with a RNN which has to learn to memorize the sign information. In this setting, TS and TI are using a Bernoulli distribution to model . We tried two scenarios A and B detailed in Appendix C.5 with difficulty adjusted with the sign position: in the scenario A, the sign is between the initial position and the two possible goals, whereas in scenario B it is behind the initial position as in Fig. 1.

5.3 Results

Table 1, 2, 3 and 4 show the cumulated reward obtained by the different methods over the different environments, with the standard deviation over the 3 seeds.

Quality of the learned policy: In all environments, the TI and IMPORT methods are performing better or similarly to the baselines (TS and RNN). Indeed TI and IMPORT are able to benefit from the values of at train time to avoid poor local minima and to learn a good policy in a reasonable time. It can be seen that the ordering of the methods depends on the environment: in control problems, TS is performing well since the dynamics of the environment can be captured in a few transitions and the informed policy acts as a sufficient exploration policy, but is performing bad on other settings. The non-stationary settings make the different problems more difficult to most of the methods, but IMPORT suffers less of this non-stationarity than other methods.

Learning Speed: One important aspect to look at is the ability to learn with few interactions with the environment. Indeed, as explained before, RNN, TI and IMPORT have the same expressiveness – i.e., their final policy is a recurrent neural network – but are trained differently. Figure 3 444All the learning curves are provided in the appendix. shows the best training curves (averaged over 3 seeds) for the different methods on one of the bandit setting. IMPORT is able to reach a better performance with less interactions with the environment than the baselines, and particularly than RNN which is not guided by the information at train time. Note that this is observed in almost all the experimental results on all the environments. When comparing IMPORT and IMPORT, the learning speed of IMPORT is better since it benefits from an auxiliary loss of the same nature than TI. But coupled with the informed policy regularization principle, it achieves better and faster than TI. Note that the abilty of IMPORT to learn fast is particularly visible on the Maze experiments (see Figure 5). Depending on the scenario, TS may perform badly. This is mainly due to the fact that, is trained with as an input; it prevents the agent to actually follow probing actions, thus making the auxiliary supervised objective ineffective. For CartPole-task, TS predicts using a multinomial distribution; at test time, it behaves according to informed policies where corresponds to the index of one training task.

Generalization and sensitivity to task descriptors: We have performed a set of experiments to evaluate both generalization to unseen tasks and the sensitivity to the type of privileged information for the different methods (see Table 4 and Figure 4). Intrinsically, they test the capacity of IMPORT to learn task embeddings. We reuse the CartPole environment with but we consider possible values of (or tasks) at train time, the validation being performed on episodes generated with 100 different values of , and the performance being reported over other values. In this setting, at train time, is encoded as a one-hot vector of size reflecting which task the agent is currently playing. Note that the informed policy trained on a multi-task CartPole with task index inputs is able to perform optimally. It both allows us to evaluate the ability of the methods to generalize to unseen values of , and to compute task embeddings. Note that TI and TS models using a multinomial distribution in this setting. Performance is provided in Table 4 and in Figure 4. It shows that IMPORT outperforms the other models also in this setting, and that it is thus able to generalize to unseen tasks, even with few tasks at train time. Decreasing performance after steps can be explained by overfitting on training tasks.

6 Conclusion

We have proposed a new policy architecture for learning in multi-task environments. The IMPORT model is trained only on the reward objective, and leverages the informed policy knowledge to discover a good trade-off between exploration and exploitation. It is thus able to learn better strategies than Thompson Sampling approaches, and faster than recurrent neural network policies and Task Inference approaches. Moreover, our approach works well also in non-stationary settings and is able to adapt to generalize to tasks unseen at train time. Learning this model in a continual learning setting will be investigated in a near future.


  • P. Auer (2003) Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3 (null), pp. 397–422. External Links: ISSN 1532-4435 Cited by: §5.2.
  • B. Bakker (2001)

    Reinforcement learning with long short-term memory

    In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, Cambridge, MA, USA, pp. 1475–1482. Cited by: §1, §3.1.
  • J. Baxter and P. L. Bartlett (2001) Infinite-horizon policy-gradient estimation. J. Artif. Intell. Res. 15, pp. 319–350. Cited by: §4.3.
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The arcade learning environment: an evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    47, pp. 253–279.
    External Links: ISSN 1076-9757, Link, Document Cited by: §1.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §5.2.
  • O. Chapelle and L. Li (2011) An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pp. 2249–2257. Cited by: §3.1.
  • M. Chevalier-Boisvert (2018) Gym-miniworld environment for openai gym. GitHub. Note: licensed under Apache License 2.0. Cited by: §C.5, §5.2.
  • Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel (2016) RL$^2$: fast reinforcement learning via slow reinforcement learning. CoRR abs/1611.02779. External Links: Link, 1611.02779 Cited by: §1, §3.1.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. arXiv e-prints, pp. arXiv:1703.03400. External Links: 1703.03400 Cited by: §1.
  • A. Garivier, T. Lattimore, and E. Kaufmann (2016) On explore-then-commit strategies. In NIPS, pp. 784–792. Cited by: §3.1.
  • A. Garivier and E. Moulines (2008) On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems. arXiv e-prints, pp. arXiv:0805.3415. External Links: 0805.3415 Cited by: §5.2.
  • A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine (2018) Meta-reinforcement learning of structured exploration strategies. In NeurIPS, pp. 5307–5316. Cited by: §3.2.
  • K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. A. Riedmiller (2018) Learning an embedding space for transferable robot skills. In ICLR (Poster), Cited by: §1, §3.2.
  • N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver (2015) Memory-based control with recurrent neural networks. CoRR abs/1512.04455. External Links: Link, 1512.04455 Cited by: §1, §3.1, §5.1.
  • M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2017) Rainbow: combining improvements in deep reinforcement learning. External Links: 1710.02298 Cited by: §1.
  • J. Humplik, A. Galashov, L. Hasenclever, P. A. Ortega, Y. Whye Teh, and N. Heess (2019) Meta reinforcement learning as task inference. arXiv e-prints, pp. arXiv:1905.06424. External Links: 1905.06424 Cited by: §1, §1, §3.1, §3.2.
  • L. P. Kaelbling, M. L. Littman, and A. R. Cassandra (1998) Planning and acting in partially observable stochastic domains. Artif. Intell. 101 (1-2), pp. 99–134. Cited by: §2.
  • T. Lattimore and C. Szepesvári (2018) Bandit algorithms. preprint, pp. 28. Cited by: §3.1.
  • A. Lazaric (2012) Transfer in reinforcement learning: a framework and a survey. In Reinforcement Learning, pp. 143–173. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. External Links: 1312.5602 Cited by: §1.
  • V. Mnih, A. Puigdomènech Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous Methods for Deep Reinforcement Learning. arXiv e-prints, pp. arXiv:1602.01783. External Links: 1602.01783 Cited by: §B.1, §4.3, §5.1.
  • A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn (2018) Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning. arXiv e-prints, pp. arXiv:1803.11347. External Links: 1803.11347 Cited by: §3.1.
  • I. Osband, C. Blundell, A. Pritzel, and B. V. Roy (2016) Deep exploration via bootstrapped DQN. In NIPS, pp. 4026–4034. Cited by: §1.
  • I. Osband, B. V. Roy, D. J. Russo, and Z. Wen (2019) Deep exploration via randomized value functions.

    Journal of Machine Learning Research

    20 (124), pp. 1–62.
    External Links: Link Cited by: §1, §3.1.
  • I. Osband and B. V. Roy (2017) Why is posterior sampling better than optimism for reinforcement learning?. In ICML, Proceedings of Machine Learning Research, Vol. 70, pp. 2701–2710. Cited by: §3.1.
  • X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel (2017) Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. arXiv e-prints, pp. arXiv:1710.06537. External Links: 1710.06537 Cited by: §1.
  • A. Raghu, M. Raghu, S. Bengio, and O. Vinyals (2019) Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML. arXiv e-prints, pp. arXiv:1909.09157. External Links: 1909.09157 Cited by: §1.
  • K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen (2019) Efficient off-policy meta-reinforcement learning via probabilistic context variables. In ICML, Proceedings of Machine Learning Research, Vol. 97, pp. 5331–5340. Cited by: §1, §3.2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. External Links: 1707.06347 Cited by: §1.
  • M. E. Taylor and P. Stone (2011) An introduction to inter-task transfer for reinforcement learning. AI Magazine 32 (1), pp. 15–34. External Links: Link Cited by: §1.
  • Y. W. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu (2017) Distral: robust multitask reinforcement learning. CoRR abs/1707.04175. External Links: Link, 1707.04175 Cited by: §1.
  • W. R. Thompson (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §1, §3.1.
  • A. Wilson, A. Fern, S. Ray, and P. Tadepalli (2007) Multi-task reinforcement learning: a hierarchical bayesian approach. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, New York, NY, USA, pp. 1015–1022. External Links: ISBN 9781595937933, Link, Document Cited by: §1.
  • W. Zhou, L. Pinto, and A. Gupta (2019) Environment probing interaction policies. Cited by: §1, §3.1.
  • L. Zintgraf, M. Igl, K. Shiarlis, A. Mahajan, K. Hofmann, and S. Whiteson (2019) Variational Task Embeddings for Fast Adaptation in Deep Reinforcement Learning. arXiv e-prints. Cited by: §3.2.

Appendix A Details of the IMPORT algorithm

The algorithm is described in details in Algorithm 2.

  Initialize arbitrarily


  while training do
      Collect trajectories according to in buffer .
      Collect trajectories according to in buffer .
  end while
Algorithm 2 Details of IMPORT Training

Appendix B Implementation details

b.1 Data collection and optimization

We focus on on-policy training for which we use the actor-critic method A2C (Mnih et al., 2016) algorithm. We use a distributed execution to accelerate experience collection. Several worker processes independently collect trajectories. As workers progress, a shared replay buffer is filled with trajectories and an optimization step happens when the buffer’s capacity is reached. After model updates, replay buffer is emptied and the parameters of all workers are updated to guarantee synchronisation.

b.2 Network architectures

The architecture of the different methods remains the same in all our experiments, except that the number of hidden units changes across considered environments. A description of the architectures of each method is given in Fig. 2.
Unless otherwise specified, MLP blocks represent single linear layers activated with a function and their output size is . All methods aggregate the trajectory into an embedding using a GRU with hidden size . Its input is the concatenation of representations of the last action and current state obtained separately. For bandits environments, the current state corresponds to the previous reward. TS uses the same GRU architecture to aggregate the history into .

All methods use a

activation to obtain a probability distribution over actions.

The use of the hidden-state differs across methods. While RNNs only use as an input to the policy and critic, both TS and TI map

to a belief distribution that is problem-specific, e.g. Gaussian for control problems, Beta distribution for bandits, and a multinomial distribution for Maze and CartPole-task environments. For instance,

is mapped to a Gaussian distribution by using two MLPs whose outputs of size

correspond to the mean and variance. The variance values are mapped to

using a activation.
IMPORT maps to an embedding , whereas the task embedding is obtained by using a -activated linear mapping of . Both embeddings have size , tuned by cross-validation onto a set of validation tasks. The input of the shared policy head is the embedding associated with the policy to use, i.e. either when using or when using . For the Maze3d experiment, the pixel input

is fed into three convolutional layers (with output channels 32) and LeakyReLU activation (kernel size are respectively 5, 5 and 4 and stride is 2). The output is flattened, linearly mapped to a vector of size

and -activated.

b.3 Results preprocessing

We run each method for different hyperparameter configurations, specified in Appendix C, and choose the best hyperparameters using grid-search. We separate task sets into disjoints training, validation and testing sets. During training, every 10 model updates, the validation performance is measured by running on 100 episodes with taken from the validation tasks. Similarly, the test performance is measured using testing tasks.

Each pair (method, set of hyperparameters) was trained with 3 seeds. For each method, we define the best set of hyperparameters as follows. First, for each seed, find the best validation performance achieved by the model over the course of training. The score of a set of hyperparameters is then the average of this performance over seeds. The best set of hyperparameters is the one with maximum score.


Each curve was obtained by averaging over 3 seeds already-smoothed test performance curves. The error bars correspond to standard deviations. Smoothing is done with a sliding window of size . For each method, we only plot the method with the best set of hyperparameters, as defined above. The x-axes of the plots correspond to environment steps.


Results in Tables 1, 2, 3 and 4 correspond to the mean and standard deviation (over seeds) of the test performance obtained by the policy extracted from the model with the best set of hyperparameters at maximum validation performance.

Appendix C Experiments

In this section, we explain in deeper details the environments and the set of hyper-parameters we considered. We add learning curves of all experiments to supplement results from Table 1, 2, 3 and 4 in order to study sample efficiency. Note that for all experiments but bandits, is normalized to be in where is the task descriptor dimension.

Hyperparameters ranges specified in Table 5 are kept constant on all environments. Environment-specific hyperparameters (hidden size , belief distribution for TS/TI, …) will be specified in Appendix C.

Hyperparameter Considered values
clip gradient 40
Table 5: Hyperparameters range

c.1 Bandits

At every step, the agent pulls one of K arms, and obtains a stochastic reward drawn from a Bernoulli distribution with success probability

, where is the arm id. The goal of the agent is to maximize the cumulative reward collected over steps. At test time, the agent does not know and only observes the reward of the selected arm.

is sampled according to the following multivariate random variable with constants

and fixed beforehand:

  • an optimal arm is sampled at random in and

At each time-step, there is a probability to sample a new value of .

We consider different configurations of this generic schema with .

All methods use and the belief distribution is either a Beta distribution or Gaussian. Other hyperparameters are presented in Table 5.

Since the setting with is fairly easy to solve, RNN, TI and IMPORT perform on par (see Fig. 7). TS performs worse as it is sub-optimal in non-stationary environments. For (Fig. 8), IMPORT largely outperforms other methods. When , the gap between the optimal arm and the second best can be small. The optimal policy does not necessarily stick to the best arm and learning is slower. When (Fig. 9), learning is harder and the UCB baseline is better.

c.2 CartPole.

We consider the classic CartPole control environment where the environment dynamics change within a set () described by the following physical variables: gravity, cart mass, pole mass, pole length, magnetic force. Their respective pre-normalized domains are . Knowing some components of might not be required to behave optimally. The discrete action space is .

’s are re-sampled at each step with probability . Episode length is .

All methods use and the belief distribution is Gaussian. Other hyperparameters are presented in Table 5.

Figure 10 shows IMPORT’s performance and sample efficiency is greatly superior to other methods. IMPORT() performs on par or worse than TI, which proves that IMPORT main advantage is the auxiliary supervised loss. TI performs dramatically worse, showing reconstructing the entire is not optimal.

c.3 CartPole-task

To study how the different methods deal with cases where no meaningful physical parameters of the system is available, as well as studying their performance on tasks that were not seen during training, we conduct a new set of experiments in the CartPole environment described below. In this new set of experiments, represents the task identifier of the considered -MDP. Here is a one-hot encoding of the MDP, thus containing no relevant information on the world dynamics. To assess generalization on unseen tasks, we consider a training task set of different tasks where the underlying dynamics parameters are sampled in the same way than for the usual CartPole environment.Validation and testing task sets are then additional disjoints set of tasks (thus, there is no overlap between train, validation and test task sets).

’s are re-sampled at each step with probability . Episode length is .

Considered hyperparameters in CartPole-task are the same than the ones in CartPole except the belief distribution is multinomial.

In stationary environments, all methods are roughly equivalent in performance (Figures 11, 12). Indeed, in control problems, there is no need of a strong exploration policy since the underlying physics can be inferred from few transitions. When the environment is non-stationary, IMPORT is significantly better than the baselines. In the end, these experiments suggest that, in the stationary setting, all methods are able to generalize to unseen tasks on that environment. In the non-stationary setting however, IMPORT significantly outperforms the baselines.

c.4 Acrobot

Acrobot consists of two joints and two links, where the joint between the two links is actuated. Initially, the links are hanging downwards, and the goal is to swing the end of the lower link up to a given height. Environment dynamics are determined by the length of the two links, their masses, their maximum velocity. Their respective pre-normalized domains are . Unlike CartPole, the environment is stochastic because the simulator applies noise to the applied force. The action space is . We also add an extra dynamics parameter which controls whether the action order is inverted, i.e. , thus .

’s are re-sampled at each step with probability . Episode length is .

All methods use and the belief distribution is Gaussian. Other hyperparameters are presented in Table 5.

IMPORT outperforms all baselines in every settings (Fig. 13).

c.5 Maze environments

Maze2d/Maze3d are grid-world environments with two possible goals positioned at two different locations and a sign that indicates which goal is activated when visited. The value of or denotes which of the two goals is active. is sampled at random at the beginning of each episode. The agent can access this information by moving to a sign location. It has to reach the goal in at most steps – the optimal policy is able to achieve this objective in 19 steps – to receive a reward of and the episode stops. If the agent reaches the wrong goal, it receives a reward of and the episode stops. The agent has three actions (forward, turn left, turn right) and the environment has been implemented using MiniWorld (Chevalier-Boisvert, 2018).

We propose two versions of the same grid-world environment but with different inputs given to the agent. The Maze2d version where the agent observes its absolute coordinates (and eventually the goal location when going over the sign, otherwise a placeholder s.t. ). The Maze3d version where the agent observes a highly-dimensional () image, the sign information is encoded as a fourth dimension (either zeros or values when the sign is touched) concatenated to the RGB image. Note that it is an environment with a sparse reward (sionce there is no reward when reaching the sign) that is very complex to solve because the policy has to learn to discover the sign location, to associate the sign information with the sign location, to memorize the sign information, and to reach the goal. In this setting, TS and TI are using a Bernoulli distribution to model .

In both cases, the maze’s width and length are 12 with coordinates going from to in both directions. The goal locations are and . In order to adjust the difficulty of solving the environment, we tried two scenarios:

  • Scenario A: The sign is located on and the agent starts in position . The agent does not waste time going to the sign as it is on its road.

  • Scenario B: The sign is located on and the agent starts in position . This requires the agent to go to the bottom of the maze first, then remember the goal location, and finally go to the activated goal. This is a very hard exploration problem.

In the main article, results are reported on scenario A with a single seed. We report here complete results on the two scenarios on multiple seeds.

All methods use , and the belief distribution is Bernoulli. Other hyperparameters are presented in Table 5.

IMPORT outperforms other methods on Scenario A in both Maze2d and Maze3d (Fig. 14) in sample efficiency. Due to time constraints, we only ran Maze3d on just one seed. In Scenario B (Fig. 15), IMPORT is a bit more sample efficient in Maze2d. We were not able to have scenario B solved with image inputs by the different methods.

Figure 7: Training curves on the bandits environment with and respectively (left) and (right).
Figure 8: Training curves on the bandits environment with and respectively (left) and (right).
Figure 9: Training curves on the bandits environment with and respectively (left) and (right).
Figure 10: Training curves on the CartPole environment with respectively (top left), (top right), (bottom)
Figure 11: Training curves on the CartPole-task environment with and respectively (left), (right)
Figure 12: Training curves on the CartPole-task environment with and respectively (left), (right)
Figure 13: Training curves on the Acrobot environment respectively for (top left), (top right), (bottom)
Figure 14: Training curves on the scenario A of Maze2d (top) and Maze3d (bottom)
Figure 15: Training curves on the scenario B of Maze2d. We were not able to have this scenario solved with image inputs by the different methods