rlstructures
RLStructures is a library to facilitate the implementation of new reinforcement learning algorithms. It includes a library, a tutorial, and different RL algorithms provided as examples.
view repo
We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments, where the task may change over time. While RNN-based policies could in principle represent such strategies, in practice their training time is prohibitive and the learning process often converges to poor solutions. In this paper, we consider the case where the agent has access to a description of the task (e.g., a task id or task parameters) at training time, but not at test time. We propose a novel algorithm that regularizes the training of an RNN-based policy using informed policies trained to maximize the reward in each task. This dramatically reduces the sample complexity of training RNN-based policies, without losing their representational power. As a result, our method learns exploration strategies that efficiently balance between gathering information about the unknown and changing task and maximizing the reward over time. We test the performance of our algorithm in a variety of environments where tasks may vary within each episode.
READ FULL TEXT VIEW PDFRLStructures is a library to facilitate the implementation of new reinforcement learning algorithms. It includes a library, a tutorial, and different RL algorithms provided as examples.
MultiTask Environments for Reinforcement Learning.
Deep Reinforcement Learning (DRL) has been used to successfully train agents on a range of challenging environments such as Atari games
(Mnih et al., 2013; Bellemare et al., 2013; Hessel et al., 2017) or continuous control (Peng et al., 2017; Schulman et al., 2017). Nonetheless, in these problems, RL agents perform exploration strategies to discover the environment and implement algorithms to learn a policy that is tailored to solving a single task. Whenever the task changes, RL agents generalize poorly and the whole process of exploration and learning restarts from scratch. On the other hand, we expect an intelligent agent to fully master a problem when it is able to generalize from a few instances (tasks) and learn how to achieve the objective of the problem under many variations of the environment. For instance, children know how to ride a bike (i.e., the problem) when they can reach their destination irrespective of the specific bike they are riding, which requires to adapt to the weight of the bike, the friction of the brakes and tires, and the road conditions (i.e., the tasks).How to enable agents to generalize across tasks has been studied under the frameworks of Multi-task Reinforcement Learning (e.g., Wilson et al., 2007; Teh et al., 2017), Transfer Learning (e.g., Taylor and Stone, 2011; Lazaric, 2012) and Meta-Reinforcement Learning (Finn et al., 2017; Hausman et al., 2018; Rakelly et al., 2019; Humplik et al., 2019). These works fall into two categories. Learning to learn approaches aim at speeding up learning on new tasks, by pre-training feature extractors or learning good initializations of policy weights (Raghu et al., 2019). In contrast, we study in this paper the online adaptation setting where a single policy is trained for a fixed family of tasks. When facing a new task, the policy must then balance exploration, to reduce the uncertainty about the current task, and exploitation to maximize the cumulative reward of the task.
The online adaptation setting is a special case of a partially observed markov decision problem, where the unobserved variables are the descriptors of the current task. It is thus possible to rely on recurrent neural networks (RNNs)
(Bakker, 2001; Heess et al., 2015; Duan et al., 2016), since they can theoretically represent optimal policies in POMDPs if given enough capacity. Unfortunately, the training of RNN policies has often prohibitive sample complexity and it may converge to suboptimal local minima.To overcome this drawback, efficient online adaptation methods leverage the knowledge of the task at train time. The main approach is to pair an exploration strategy with the training of informed policies, i.e., policies taking the description of the current task as input. Probe-then-Exploit (PTE) algorithms (e.g., Zhou et al., 2019) operate in two stages. They first rely on an exploration policy to identify the task. Then, they commit to the identified task by playing the associated informed policy. Thompson Sampling (TS) approaches (Thompson, 1933; Osband et al., 2016, 2019) maintain a distribution over plausible tasks and play the informed policy of a task sampled from the posterior following a predefined schedule.
PTE and TS are expected to be sample-efficient because learning informed policies is easier than RNN policies: since informed policies know the current task, the problem is fully observable and they can be learnt with efficient algorithms for MDPs. However, as we discuss in Section 3, PTE and TS cannot represent effective exploration/exploitation policies in many environments. This limitation is even more severe in non-stationary environments, where the task changes within each episode. In this case, the exploration strategy must also adapt to how tasks evolve over time.
Recently, Humplik et al. (2019) proposed an alternative approach, Task Inference (TI), which trains a full RNN policy with the prediction of the current task as an auxiliary loss. TI avoids the suboptimality of PTE/TS since it does not constrain the structure of the exploration/exploitation policy. However, in TI, the task descriptors are used as targets and not as inputs, so TI does not leverage the faster learning of informed policies. Moreover, the behavior of TI in non-stationary environments has not been studied.
In this paper, we introduce IMPORT (InforMed POlicy RegularizaTion), a novel policy architecture for efficient online adaptation that combines the rich expressivity of RNNs with the efficient learning of informed policy. At training time, a shared policy head receives as input the current observation, together with either a (learned) embedding of the current task, or the hidden state of an RNN such that two policies are learned simultaneously: the informed policy and the RNN policy. At test time, the hidden state of the RNN replaces the task embedding, and the agent can act without having access to the current task.
We evaluate IMPORT against the main approaches to online adaptation on a suite of different environments with different characteristics, from challenging exploration problems with sparse rewards to non-stationary control problems. We confirm that TS suffers from its limited expressivity when non-trivial probing policies are required, and show that the policy regularization of IMPORT significantly speeds up learning compared to TI. Moreover, the learnt task embeddings of IMPORT make it robust to irrelevant or minimally informative task descriptors, while TI performances degrade significantly when task descriptors contain irrelevant variables or are only minimally informative.
Let be the space of possible tasks. Each is associated to an episodic -MDP whose dynamics and rewards are task dependent, while state and action spaces are shared across tasks and is the discount factor. The descriptor can be a simple id () or a set of parameters ().
When the reward function and the transition probabilities are unknown, RL agents need to devise a strategy that balances exploration to gather information about the system and exploitation to maximize the cumulative reward. Such a strategy can be defined as the solution of a partially observable MDP (POMDP), where the hidden variable is the descriptor
of the MDP. Given a trajectory , a POMDP policy maps the trajectory to actions. In particular, the optimal policy in a POMDP is a history-dependent policy that uses to construct a belief state , which describes the uncertainty about the task at hand, and then maps it to the action that maximizes the expected sum of rewards (e.g., Kaelbling et al., 1998). In this case, maximizing the rewards may require taking explorative actions that improve the belief state enough so that future actions can be more effective in collecting reward.At training time, we assume the agent has unrestricted access to the descriptor of the tasks it interacts with. In particular, we consider the challenging case of dynamic (i.e., non-stationary) environments, where the task may change over time according to a fixed or random schedule, being the value of at time . Let to be a history-dependent task distribution, then at each step a new task is drawn from .^{1}^{1}1Notice that the definition of is rich enough so that it can represent cases such as stationary, piece-wise stationary, and shifts with limited deviations. Leveraging the information gathered at training time, we expect the agent to learn an exploration strategy that is better suited for tasks in and . More formally, after steps of training, the agent returns a policy that is evaluated according to its average performance across tasks in generated from , i.e.,
(1) |
where the expectation is taken over trajectories of full episodes , and is the length of episode .
The objective is then to find an architecture for that is able to express strategies that perform the best according to Eq. 1 and, at the same time, can be efficiently learned even for moderately short training phases.
In this section, we review how the online adaptation setting has been tackled in the literature. The main approaches are depicted in Fig. 2. We first compare the different methods in terms of the optimality of strategies they can represent, how they deal with non-stationary environments, and whether they leverage the efficient learning of informed policies. We then discuss learning task embeddings and how the various methods deal with unknown or irrelevant task descriptors. The last subsection summarizes our contributions. Unless otherwise stated, the methods described below have not been tested in the non-stationary setting.
In the previous section we mentioned that the best exploration strategy corresponds to the optimal policy of the associated POMDP. Since the belief state is a sufficient statistic of the history, POMDP policies takes the form . While it is impractical to compute the exact belief state even for toy discrete problems, approximations can be learnt using Recurrent Neural Networks (RNNs) (Bakker, 2001; Heess et al., 2015; Duan et al., 2016). RNN-policies are directly trained to maximize the cumulative reward and do not leverage information about task descriptors at train time. While this class of policies can represent rich exploratory strategies, their large training complexity makes them highly suboptimal whenever the training phase is too short.
In order to reduce the training complexity of RNN policies, existing strategies have constrained the set of possible exploratory behaviors by leveraging privileged information about the task. Probe-Then-Exploit (PTE) (e.g., Lattimore and Szepesvári, 2018; Zhou et al., 2019) works in two phases. First, it executes a pure exploratory policy with the objective of identifying the underlying task (i.e.,
) as accurately as possible. In the second phase, PTE runs the optimal policy associated to the estimated task. In that case, the agent only needs to learn a probing policy that maximizes the likelihood of the task at the end of the exploration phase and the informed policies for the training tasks, thus leading to a much more efficient training process. PTE has two main limitations. First, similarly to explore-then-commit approaches in bandits
(Garivier et al., 2016), the exploration can be suboptimal because it is not reward-driven: valuable time is wasted to estimate unnecessary information. Second, the switch between probing and exploiting is difficult to tune and problem-dependent, which makes these approaches unsuitable in non-stationary environments.Thompson Sampling (TS) (Thompson, 1933) leverages randomization to efficiently mix exploration and exploitation. Similarly to the belief state of an RNN-policy, TS maintains a distribution over tasks that are compatible with the observed history. The policy samples a task from the posterior and executes the corresponding informed policy for several steps. In that case, the training is limited to learning a maximum likelihood estimator to map trajectories to distributions over states and, similar to PTE, informed policies for the training tasks. This strategy proved successful in a variety of problems (e.g., Chapelle and Li, 2011; Osband and Roy, 2017; Osband et al., 2019). However, TS cannot represent certain probing policies because it is constrained to executing informed policies. Another drawback of TS approaches is that even in stationary environemnts, the frequency of re-sampling needs to be carefully tuned. This makes the application of TS to non-stationary environments challenging. We describe an example of environment where TS is suboptimal in Fig. 1.
The Task Inference (TI) approach (Humplik et al., 2019) is an RNN trained to simultaneously learn a good policy and predict the task descriptor . Denoting by the mapping from histories to a latent representation of the belief state (), the policy selects the action based on the representation constructed by the RNN. During training, is also used to predict the task descriptor , using the task-identification module . The overall objective is:
(2) |
where is the log-likelihood of under distribution . Note that because the auxiliary loss is only supposed to structure the memory of the RNN rather than be an additional reward for the policy, the gradient of the auxiliary loss with respect to ignores the effect of on : given trajectories sampled by , only the average gradient of with respect to
is backpropagated.
Thus, the training of in TI is purely reward-driven and it does not suffer from the suboptimality of PTE/TS. However, in contrast to PTE/TS, it does not leverage the smaller sample complexity of training informed policies, and the auxiliary loss is defined over the whole value of while only some dimensions may be relevant to solve the task.
In the non-stationary setting, only a few models have been proposed, mainly based on the MAML algorithm. For instance, (Nagabandi et al., 2018) combines MAML with model-based RL by meta-learning a transition model that helps an MPC controller predicting the action sequence to take. The method does not make use of the value of at train time and is specific to MPC controllers.
The approaches described above differ in the necessary prior assumptions on tasks and their descriptions used for learning. The minimal requirement for all these approaches is to have access to task identifiers
, i.e., one-hot encodings of the task. In general however, these approaches are sensitive to the description of the task. In particular, irrelevant variables have a significant impact on PTE approaches since the probing policy aims at identifying the task: for instance, an agent might waste time reconstructing the full map of a maze when it only needs to find a specific information to act optimally w.r.t the reward. Moreover, many methods are guided by a prior distribution over
that has to be chosen by hand to fit the ground-truth distribution of tasks.Several approaches have been proposed to learn from task identifiers (Gupta et al., 2018; Rakelly et al., 2019; Zintgraf et al., 2019; Hausman et al., 2018). The usual approach is to train embeddings of task identifiers jointly with the informed policies. TI is not amenable to joint task embedding training, since tasks are used as targets and not as inputs. Humplik et al. (2019) mention using TI with task embeddings, but the embeddings are pre-trained separately, which requires either additional interactions with the environment or expert traces.
As for RNN/TI, IMPORT learns an RNN policy to maximize cumulative reward. As such, our approach does not suffer from the intrinsic limitations of PTE/TS in terms of optimality; because there is no decoupling between exploration and exploitation (of an informed policy), the approach does not suffer from the scheduling difficulties of PTE/TS and is readily applicable to non-stationary environments. Nonetheless, similarly to PTE/TS and contrarilty to TI, IMPORT leverages fast training of informed policies through a joint training of an RNN and an informed policy.
In addition, IMPORT does not rely on probabilistic models of task descriptors. Learning task embeddings makes the approach robust to irrelevant task descriptors contrary to TI, but also makes IMPORT applicable when only task identifiers are available.
The next section describes these components in more details.
In this section, we describe the main components of the IMPORT model, as well as the online optimization procedure and an additional auxiliary loss to further speed-up learning. The overall approach is described in Fig. 2 (right).
Our approach leverages the knowledge of the task descriptor and informed policies to construct a latent representation of the task that is purely reward driven. Since is unknown at testing time, we use this informed representation to train a predictor based on a recurrent neural network. To leverage the efficiency of informed policies even in this phase, we propose an architecture sharing parameters between the informed policy and the final policy such that the final policy will benefit from parameters learned with privileged information. The idea is to constrain the final policy to stay close to the informed policy while allowing it to perform probing behaviors when needed to effectively reduce the uncertainty about the task. We call this approach InforMed POlicy RegularizaTion (IMPORT).
Formally, we define by and the informed policy and the history-dependent policy that will be used at test time. The informed policy is defined as the functional composition of and , where projects in a latent space and selects the action based on the provided latent representation. The idea is that captures the relevant information contained in while ignoring dimensions that are not relevant for learning the optimal policy. This behavior is obtained by training directly to maximize the task reward (i.e., informed).
While this policy leverages the knowledge of at training time, should be able to act based on the sole history. To encourage to behave like the informed policy while preserving the ability to probe, we let share parameters with through the component that they have in common. We thus define where encodes the history into the latent space. By sharing the policy head between informed and history-dependent policies, the approximate belief state constructed by the RNN is mapped to the same latent space as . As such, when informed policies learn faster than the , they can be reused directly by when the uncertainty about the task is small.
More precisely, let the parameters of and respectively, so that and . The goal of IMPORT is to maximize over the following objective function:
(3) |
The only information that is shared between and is function . However, optimizing term (B) in Eq. 3 produces also a reward-driven latent representation of the task through function . This information can be used to regularize the prediction of , and to encourage the history-based policy to predict a task embedding close to the one predicted by the informed policy. We can thus rewrite Eq. 3 as:
(4) |
where is the squared -norm in our experiments. Note that the objective (C) is an auxiliary loss, so only the average gradient of with respect to along trajectories collected by is backpropagated, ignoring the effect of on .
We propose an optimization scheme for (3) and (4) based on policy gradient algorithms (our experiments use A2C (Mnih et al., 2016) and REINFORCE (Baxter and Bartlett, 2001)). The high-level algorithm is summarized in Alg. 1. At each iteration, we collect two batches of episodes, one containing trajectories (full episodes) generated by the exploration/exploitation policy , and the other one containing trajectories generated by the informed policy . The gradients of the RNN () and of the policy head are computed on the first batch according to objectives (A) and (C) (if in Eq. (4)), while the gradients of the policy head () and of the task embeddings () are computed on the second batch. The weight updates are performed on the sum of all gradients.
Method | Test reward | |||||
---|---|---|---|---|---|---|
= 0.9 () | = 0.5 () | |||||
UCB | 64.67(0.38) | 52.03(1.13) | 37.61(1.5) | 31.12(0.21) | 22.84(0.97) | 15.63(0.74) |
SW-UCB | 62.17(0.74) | 50.8(1.21) | 33.02(0.19) | 30.06(1.36) | 22.11(0.69) | 14.63(0.57) |
TS | 68.93(0.7) | 41.35(0.98) | 20.58(1.86) | 28.93(1.23) | 15.23(0.28) | 9.8(0.64) |
RNN | 73.39(0.78) | 54.21(2.94) | 30.51(0.8) | 31.63(0.12) | 21.19(0.87) | 11.81(0.94) |
TI | 73.01(0.86) | 58.81(1.67) | 32.22(0.89) | 31.9(0.82) | 21.46(0.37) | 12.78(0.77) |
IMPORT() | 72.66(0.96) | 57.62(0.75) | 31.36(4.55) | 30.76(0.7) | 21.53(2.07) | 11.41(0.46) |
IMPORT() | 73.13(0.62) | 61.38(0.62) | 42.29(3.08) | 32.44(1.12 | 24.83(0.87) | 14.47(2.05) |
environment steps (and standard deviation over the 3 seeds).
Method | Test reward | ||
---|---|---|---|
TS | 93.54(1.99) | 85.33(6.28) | 91.34(4.28) |
RNN | 81.3(12.7) | 73.6(6.3) | 76(4.1) |
TI | 90.4(0.5) | 83.6(2.4) | 80.1(2.2) |
IMPORT() | 90.66(6.6) | 69.1(9.1) | 79(2.5) |
IMPORT() | 96.4(2.9) | 93.4(3.2) | 97.2(1.5) |
Method | Test reward | |
---|---|---|
RNN | -406.4(30.1) | -459.(57.5) |
TI | -278.4(45.) | -223.6(36.3) |
TS | -242.9(4.9) | -190.4(51.9) |
IMPORT | -112.2(10.8) | -101.7(11.1) |
Method | Test reward | |||
---|---|---|---|---|
TS | 72.2(15.4) | 44.9(8.2) | 89.8(2.2) | 49.2(4.) |
RNN | 81.4(9.7) | 74.3(8.4) | 79.8(9.4) | 63.2(6.) |
TI | 88.6(4.4) | 80.8(5.) | 85.2(7.5) | 71.2(8.4) |
IMPORT | 77.1(17.8) | 89.4(6.) | 90.1(1.5) | 88.3(1.) |
is a one-hot vector of size
encoding the id of each training task (after environment steps).Our model is compared to different baselines: Recurrent Neural Networks (RNN) is a recurrent policy based on a GRU recurrent module (Heess et al., 2015) that does not use at train time but just the observations and actions. Thompson Sampling (TS) and Task Inference (TI) models are trained in the same setting that the IMPORT model, using at train time, but not at test-time.
Each approach (i.e., TS, RNN, TI and IMPORT) is trained using A2C^{2}^{2}2REINFORCE has been used for Maze problems because it was more stable in this sparse reward setting. (Mnih et al., 2016) on CPUs^{3}^{3}3+ 1 GPU in the Maze 3d setting. Precise values of the hyper-parameters and details on the neural network architectures are also given in the appendix B.2 . The networks for all approaches have a similar structure with the same number of hidden layers and same hidden layers sizes for fair comparison. In all the tables, we report the test performance of the best hyper-parameter value, selected according to the procedure described in appendix B.3
Experiments have been performed on a diverse set of environments whose goals are to showcase different aspects: a) generalization to unseen tasks, b) adaptation to non-stationarities, c) ability to learn complex probing policies, d) coping with high-dimensional state space and e) sensitivity to the task descriptor.
In both stationary and non-stationary settings, we study multi-armed bandits (MAB) and control tasks (CartPole and Acrobot) to test a) and b) characteristics. They all have high dimensional task descriptors that contain irrelevant variables, thus potentially sensitive to reconstructing unnecessary information e).
We then provide two environments with sparse rewards but different state inputs: Maze2d with coordinates, and Maze3d is a challenging first-person view task with high dimensional state space (pixels) to test d). Maze environments require c) to discover complex probing policies.
Note that in MAB, CartPole and Acrobot, train/validation/test task sets are disjoint to support a).
To test e), we study CartPole for two types of privileged information: 1) summarizes the dynamics of the system, e.g. pole length and 2) encodes a task index (one-hot vector) in a set of training tasks. When using 2), we restrict the size of the training set to be small, thus making generalization property a) even harder.
Environments are described in further details in Appendix C.
We use MAB problems with arms where each arm is Bernoulli with parameter . In the non-stationary case, there is a probability to re-sample the value of at each timestep, The size of each episode is . We consider the setting where one single random arm is associated with a Bernoulli of parameter , while the other arms are associated with Bernoulli with parameters independently drawn uniformly in . In this environment, TS and TI are using a Beta distribution to model . We compare also with some bandit-specific online algorithms: UCB (Auer, 2003) and one of its non-stationary counter-part Sliding-Window UCB (Garivier and Moulines, 2008).
We use two environments, CartPole and Acrobot, where is controlling different physical variables of the system, e.g., the weight of the cart, the size of the pole, etc. The size of is for Cartpole and for Acrobot. The values of components are normalized between and , are uniformly sampled at the beginning of each episode, and can be resampled at each timestep with a probability . The maximum size of each trajectory is for Cartpole and for Acrobot, and the reward functions are the one implemented in OpenAI Gym (Brockman et al., 2016). These environments are particularly difficult for two reasons: the ‘direction’ of the forces applied to the system may vary such that the optimal informed policies are very different w.r.t. to . Moreover, the mapping between and the dynamics is not obvious since some dimensions of may be irrelevant, and some others may have opposite or similar effects. In these environments, TS and TI are using a Gaussian distribution to model . CartPole with a task identifier as is described in 5.3 and Appendix C.3.
We consider (see Fig. 6) an environment where two possible goals are positioned at two different locations. The value of or denotes which of the two goals is active and it is sampled at random at the beginning of each episode. The agent can access this information by moving to a sign location. It has to reach the goal in at most steps – the optimal policy is able to achieve this objective in 19 steps – to receive a reward of and the episode stops. If the agent reaches the wrong goal, it receives a reward of and the episode stops. The agent has three actions (forward, turn left, turn right) and the environment has been implemented using MiniWorld (Chevalier-Boisvert, 2018). We propose two versions: the 2d version where the agent observes its location (and eventually the goal location when going over the sign), and the 3d version where the agent observes an image, the sign information is encoded as a fourth dimension (either zeros or values when the sign is touched) concatenated to the RGB image. Note that it is an environment with a sparse reward that is very complex to solve with a RNN which has to learn to memorize the sign information. In this setting, TS and TI are using a Bernoulli distribution to model . We tried two scenarios A and B detailed in Appendix C.5 with difficulty adjusted with the sign position: in the scenario A, the sign is between the initial position and the two possible goals, whereas in scenario B it is behind the initial position as in Fig. 1.
Table 1, 2, 3 and 4 show the cumulated reward obtained by the different methods over the different environments, with the standard deviation over the 3 seeds.
Quality of the learned policy: In all environments, the TI and IMPORT methods are performing better or similarly to the baselines (TS and RNN). Indeed TI and IMPORT are able to benefit from the values of at train time to avoid poor local minima and to learn a good policy in a reasonable time. It can be seen that the ordering of the methods depends on the environment: in control problems, TS is performing well since the dynamics of the environment can be captured in a few transitions and the informed policy acts as a sufficient exploration policy, but is performing bad on other settings. The non-stationary settings make the different problems more difficult to most of the methods, but IMPORT suffers less of this non-stationarity than other methods.
Learning Speed: One important aspect to look at is the ability to learn with few interactions with the environment. Indeed, as explained before, RNN, TI and IMPORT have the same expressiveness – i.e., their final policy is a recurrent neural network – but are trained differently. Figure 3 ^{4}^{4}4All the learning curves are provided in the appendix. shows the best training curves (averaged over 3 seeds) for the different methods on one of the bandit setting. IMPORT is able to reach a better performance with less interactions with the environment than the baselines, and particularly than RNN which is not guided by the information at train time. Note that this is observed in almost all the experimental results on all the environments. When comparing IMPORT and IMPORT, the learning speed of IMPORT is better since it benefits from an auxiliary loss of the same nature than TI. But coupled with the informed policy regularization principle, it achieves better and faster than TI. Note that the abilty of IMPORT to learn fast is particularly visible on the Maze experiments (see Figure 5). Depending on the scenario, TS may perform badly. This is mainly due to the fact that, is trained with as an input; it prevents the agent to actually follow probing actions, thus making the auxiliary supervised objective ineffective. For CartPole-task, TS predicts using a multinomial distribution; at test time, it behaves according to informed policies where corresponds to the index of one training task.
Generalization and sensitivity to task descriptors: We have performed a set of experiments to evaluate both generalization to unseen tasks and the sensitivity to the type of privileged information for the different methods (see Table 4 and Figure 4). Intrinsically, they test the capacity of IMPORT to learn task embeddings. We reuse the CartPole environment with but we consider possible values of (or tasks) at train time, the validation being performed on episodes generated with 100 different values of , and the performance being reported over other values. In this setting, at train time, is encoded as a one-hot vector of size reflecting which task the agent is currently playing. Note that the informed policy trained on a multi-task CartPole with task index inputs is able to perform optimally. It both allows us to evaluate the ability of the methods to generalize to unseen values of , and to compute task embeddings. Note that TI and TS models using a multinomial distribution in this setting. Performance is provided in Table 4 and in Figure 4. It shows that IMPORT outperforms the other models also in this setting, and that it is thus able to generalize to unseen tasks, even with few tasks at train time. Decreasing performance after steps can be explained by overfitting on training tasks.
We have proposed a new policy architecture for learning in multi-task environments. The IMPORT model is trained only on the reward objective, and leverages the informed policy knowledge to discover a good trade-off between exploration and exploitation. It is thus able to learn better strategies than Thompson Sampling approaches, and faster than recurrent neural network policies and Task Inference approaches. Moreover, our approach works well also in non-stationary settings and is able to adapt to generalize to tasks unseen at train time. Learning this model in a continual learning setting will be investigated in a near future.
Reinforcement learning with long short-term memory
. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, Cambridge, MA, USA, pp. 1475–1482. Cited by: §1, §3.1.Journal of Artificial Intelligence Research
47, pp. 253–279. External Links: ISSN 1076-9757, Link, Document Cited by: §1.Journal of Machine Learning Research
20 (124), pp. 1–62. External Links: Link Cited by: §1, §3.1.The algorithm is described in details in Algorithm 2.
We focus on on-policy training for which we use the actor-critic method A2C (Mnih et al., 2016) algorithm. We use a distributed execution to accelerate experience collection. Several worker processes independently collect trajectories. As workers progress, a shared replay buffer is filled with trajectories and an optimization step happens when the buffer’s capacity is reached. After model updates, replay buffer is emptied and the parameters of all workers are updated to guarantee synchronisation.
The architecture of the different methods remains the same in all our experiments, except that the number of hidden units changes across considered environments. A description of the architectures of each method is given in Fig. 2.
Unless otherwise specified, MLP blocks represent single linear layers activated with a function and their output size is .
All methods aggregate the trajectory into an embedding using a GRU with hidden size . Its input is the concatenation of representations of the last action and current state obtained separately. For bandits environments, the current state corresponds to the previous reward.
TS uses the same GRU architecture to aggregate the history into .
All methods use a
activation to obtain a probability distribution over actions.
to a belief distribution that is problem-specific, e.g. Gaussian for control problems, Beta distribution for bandits, and a multinomial distribution for Maze and CartPole-task environments. For instance,
is mapped to a Gaussian distribution by using two MLPs whose outputs of size
correspond to the mean and variance. The variance values are mapped to
using a activation.is fed into three convolutional layers (with output channels 32) and LeakyReLU activation (kernel size are respectively 5, 5 and 4 and stride is 2). The output is flattened, linearly mapped to a vector of size
and -activated.We run each method for different hyperparameter configurations, specified in Appendix C, and choose the best hyperparameters using grid-search. We separate task sets into disjoints training, validation and testing sets. During training, every 10 model updates, the validation performance is measured by running on 100 episodes with taken from the validation tasks. Similarly, the test performance is measured using testing tasks.
Each pair (method, set of hyperparameters) was trained with 3 seeds. For each method, we define the best set of hyperparameters as follows. First, for each seed, find the best validation performance achieved by the model over the course of training. The score of a set of hyperparameters is then the average of this performance over seeds. The best set of hyperparameters is the one with maximum score.
Each curve was obtained by averaging over 3 seeds already-smoothed test performance curves. The error bars correspond to standard deviations. Smoothing is done with a sliding window of size . For each method, we only plot the method with the best set of hyperparameters, as defined above. The x-axes of the plots correspond to environment steps.
In this section, we explain in deeper details the environments and the set of hyper-parameters we considered. We add learning curves of all experiments to supplement results from Table 1, 2, 3 and 4 in order to study sample efficiency. Note that for all experiments but bandits, is normalized to be in where is the task descriptor dimension.
Hyperparameters ranges specified in Table 5 are kept constant on all environments. Environment-specific hyperparameters (hidden size , belief distribution for TS/TI, …) will be specified in Appendix C.
Hyperparameter | Considered values |
---|---|
4 | |
0.95 | |
clip gradient | 40 |
At every step, the agent pulls one of K arms, and obtains a stochastic reward drawn from a Bernoulli distribution with success probability
, where is the arm id. The goal of the agent is to maximize the cumulative reward collected over steps. At test time, the agent does not know and only observes the reward of the selected arm.is sampled according to the following multivariate random variable with constants
and fixed beforehand:an optimal arm is sampled at random in and
At each time-step, there is a probability to sample a new value of .
We consider different configurations of this generic schema with .
All methods use and the belief distribution is either a Beta distribution or Gaussian. Other hyperparameters are presented in Table 5.
Since the setting with is fairly easy to solve, RNN, TI and IMPORT perform on par (see Fig. 7). TS performs worse as it is sub-optimal in non-stationary environments. For (Fig. 8), IMPORT largely outperforms other methods. When , the gap between the optimal arm and the second best can be small. The optimal policy does not necessarily stick to the best arm and learning is slower. When (Fig. 9), learning is harder and the UCB baseline is better.
We consider the classic CartPole control environment where the environment dynamics change within a set () described by the following physical variables: gravity, cart mass, pole mass, pole length, magnetic force. Their respective pre-normalized domains are . Knowing some components of might not be required to behave optimally. The discrete action space is .
’s are re-sampled at each step with probability . Episode length is .
All methods use and the belief distribution is Gaussian. Other hyperparameters are presented in Table 5.
Figure 10 shows IMPORT’s performance and sample efficiency is greatly superior to other methods. IMPORT() performs on par or worse than TI, which proves that IMPORT main advantage is the auxiliary supervised loss. TI performs dramatically worse, showing reconstructing the entire is not optimal.
To study how the different methods deal with cases where no meaningful physical parameters of the system is available, as well as studying their performance on tasks that were not seen during training, we conduct a new set of experiments in the CartPole environment described below. In this new set of experiments, represents the task identifier of the considered -MDP. Here is a one-hot encoding of the MDP, thus containing no relevant information on the world dynamics. To assess generalization on unseen tasks, we consider a training task set of different tasks where the underlying dynamics parameters are sampled in the same way than for the usual CartPole environment.Validation and testing task sets are then additional disjoints set of tasks (thus, there is no overlap between train, validation and test task sets).
’s are re-sampled at each step with probability . Episode length is .
Considered hyperparameters in CartPole-task are the same than the ones in CartPole except the belief distribution is multinomial.
In stationary environments, all methods are roughly equivalent in performance (Figures 11, 12). Indeed, in control problems, there is no need of a strong exploration policy since the underlying physics can be inferred from few transitions. When the environment is non-stationary, IMPORT is significantly better than the baselines. In the end, these experiments suggest that, in the stationary setting, all methods are able to generalize to unseen tasks on that environment. In the non-stationary setting however, IMPORT significantly outperforms the baselines.
Acrobot consists of two joints and two links, where the joint between the two links is actuated. Initially, the links are hanging downwards, and the goal is to swing the end of the lower link up to a given height. Environment dynamics are determined by the length of the two links, their masses, their maximum velocity. Their respective pre-normalized domains are . Unlike CartPole, the environment is stochastic because the simulator applies noise to the applied force. The action space is . We also add an extra dynamics parameter which controls whether the action order is inverted, i.e. , thus .
’s are re-sampled at each step with probability . Episode length is .
All methods use and the belief distribution is Gaussian. Other hyperparameters are presented in Table 5.
IMPORT outperforms all baselines in every settings (Fig. 13).
Maze2d/Maze3d are grid-world environments with two possible goals positioned at two different locations and a sign that indicates which goal is activated when visited. The value of or denotes which of the two goals is active. is sampled at random at the beginning of each episode. The agent can access this information by moving to a sign location. It has to reach the goal in at most steps – the optimal policy is able to achieve this objective in 19 steps – to receive a reward of and the episode stops. If the agent reaches the wrong goal, it receives a reward of and the episode stops. The agent has three actions (forward, turn left, turn right) and the environment has been implemented using MiniWorld (Chevalier-Boisvert, 2018).
We propose two versions of the same grid-world environment but with different inputs given to the agent. The Maze2d version where the agent observes its absolute coordinates (and eventually the goal location when going over the sign, otherwise a placeholder s.t. ). The Maze3d version where the agent observes a highly-dimensional () image, the sign information is encoded as a fourth dimension (either zeros or values when the sign is touched) concatenated to the RGB image. Note that it is an environment with a sparse reward (sionce there is no reward when reaching the sign) that is very complex to solve because the policy has to learn to discover the sign location, to associate the sign information with the sign location, to memorize the sign information, and to reach the goal. In this setting, TS and TI are using a Bernoulli distribution to model .
In both cases, the maze’s width and length are 12 with coordinates going from to in both directions. The goal locations are and . In order to adjust the difficulty of solving the environment, we tried two scenarios:
Scenario A: The sign is located on and the agent starts in position . The agent does not waste time going to the sign as it is on its road.
Scenario B: The sign is located on and the agent starts in position . This requires the agent to go to the bottom of the maze first, then remember the goal location, and finally go to the activated goal. This is a very hard exploration problem.
In the main article, results are reported on scenario A with a single seed. We report here complete results on the two scenarios on multiple seeds.
All methods use , and the belief distribution is Bernoulli. Other hyperparameters are presented in Table 5.
IMPORT outperforms other methods on Scenario A in both Maze2d and Maze3d (Fig. 14) in sample efficiency. Due to time constraints, we only ran Maze3d on just one seed. In Scenario B (Fig. 15), IMPORT is a bit more sample efficient in Maze2d. We were not able to have scenario B solved with image inputs by the different methods.