1 Introduction
One hallmark of human intelligence is our ability to leverage knowledge collected over our lifetimes when we face a new problem. When we first drive a new car, we do not relearn from scratch how to drive a car. Instead, we leverage our experience driving to quickly adapt to the new car (its handling, control placement, etc.). Standard reinforcement learning (RL) methods lack this ability. When faced with a new problem—a new Markov decision process (MDP)—they typically start from scratch, initially making decisions randomly to explore and learn about the current problem they face.
The problem of creating agents that can leverage previous experiences to solve new problems is called lifelong learning or continual learning, and is related to the problem of transfer learning. In this paper, however, we focus on one aspect of lifelong learning: when faced with a sequence of MDPs sampled from a distribution over MDPs, how can a reinforcement learning agent learn an optimal policy for exploration? Specifically, we do not consider the question of when an agent should explore or how much an agent should explore, which is a well studied area of reinforcement learning research, [24, 14, 2, 6, 22]. Instead, we study the question of, given that an agent is going to explore, which action should it take?
After formally defining the problem of searching for an optimal exploration policy, we show that this problem can itself be modeled as an MDP. This means that the task of finding an optimal exploration strategy for a learning agent can be solved by another reinforcement learning agent that is solving a new metaMDP. This metaMDP operates at a different timescale from the RL agent solving specific MDPs—one episode of the metaMDP corresponds to an entire lifetime of the RL agent. This difference of timescales distinguishes our approach from previous metaMDP methods for optimizing components of reinforcement learning algorithms, [25, 11, 26, 10, 4].
We contend that using random action selection during exploration (as is common when using Qlearning, [27], Sarsa, [23], and DQN, [16]) ignores useful information from the agent’s experience with previous similar MDPs that could be leveraged to guide exploration. We separate the policies that define the agent’s behavior into an exploration policy (which governs behavior when the agent is exploring) and an exploitation policy (which governs behavior when the agent is exploiting).
In this paper we make the following contributions: 1) we formally define the problem of searching for an optimal exploration policy, 2) we prove that this problem can be modeled as a new MDP, and describe one algorithm for solving this metaMDP, and 3) we present experimental results that show the benefits of our approach. Although the search for an optimal exploration policy is only one of the necessary components for lifelong learning (along with deciding when to explore, how to represent data, how to transfer models, etc.), it provides one key step towards agents that leverage prior knowledge to solve challenging problems.
2 Related Work
There is a large body of work discussing the problem of how an agent should behave during exploration when faced with a single MDP. Simple strategies, such as greedy with random actionselection, Boltzmann actionselection or softmax actionselection, make sense when an agent has no prior knowledge of the problem that is currently trying to solve. The performance of an agent exploring with unguided exploration techniques, such as random actionselection, reduces drastically as the size of the statespace increases [28]
. For example, the performance of Boltzmann or softmax actionselection hinges on the accuracy of the actionvalue estimates. When these estimates are poor (e.g., early during the learning process), it can have a drastic negative effect on the overall learning ability of the agent. More sophisticated methods search for subgoal states to define temporallyextended actions, called
options, that explore the statespace more efficiently, [15, 7], use statevisitation counts to encourage the agent to explore states that have not been frequently visited, [24, 14], or use approximations of a statetransition graph to exploit structural patterns, [13, 12].Recent research concerning exploration has also taken the approach of adding an exploration “bonus” to the reward function. VIME [8] takes a Bayesian approach by maintaining a model of the dynamics of the environment, obtaining a posterior of the model after taking an action, and using the KL divergence between these two models as a bonus. The intuition behind this approach is that encouraging actions that make large updates to the model allows the agent to better explore areas where the current model is inaccurate. [17]
define a bonus in the reward function by adding an intrinsic reward. They propose using a neural network to predict state transitions based on the action taken and provide an intrinsic reward proportional to the prediction error. The agent is therefore encouraged to make state transitions that are not modeled accurately. Another relevant work in exploration was presented in
[4], where the authors propose building a library of policies from prior experience to explore the environment in new problems more efficiently. These techniques are useful when an agent is dealing with a single MDP or class of MDPs with the same statetransition graph, however they do not provide a means to guide an agent to explore intelligently when faced with a novel task with different dynamics.Related to our approach is the idea of metalearning, or learning to learn, which has also been a recent area of focus. [1] proposed learning an update rule for a class of optimization problems. Given an objective function and parameters , the authors proposed learning a model, , such that the update to parameters , at iteration are given according to . RL has also been used in metalearning to learn efficient neural network architectures [19]. However, even though one can draw a connection to our work through metalearning, these methods are not concerned with the problem of exploration.
In the context of RL, a similar idea can be applied by defining a metaMDP, i.e., considering the agent as part of the environment in a larger MDP. In multiagent systems, [11] considered other agents as part of the environment from the perspective of each individual agent. [25] proposed the conjugate MDP framework, in which agents solving metaMDPs (called CoMDPs) can search for the state representation, action representation, or options that maximize the expected return when used by an RL agent solving a single MDP. Despite existing metaMDP approaches, to the best of our knowledge, ours is the first to use the metaMDP approach to specifically optimize exploration for a set of related tasks.
3 Background
A Markov decision process (MDP) is a tuple, , where is the set of possible states of the environment, is the set of possible actions that the agent can take,
is the probability that the environment will transition to state
if the agent takes action in state , is a function denoting the reward received after taking action in state and transitioning to state , and is the initial state distribution. We use to index the timestep, and write , , and to denote the state, action, and reward at time . We also consider the undiscounted episodic setting, wherein rewards are not discounted based on the time at which they occur. We assume that , the maximum time step, is finite, and thus we restrict our discussion to episodic MDPs; that is, after timesteps the agent resets to some initial state. We use to denote the total number of episodes the agent interacts with an environment. A policy, , provides a conditional distribution over actions given each possible state: . Furthermore, we assume that for all policies, , (and all tasks, , defined later) the expected returns are normalized to be in the interval .One of the key challenges within RL, and the one this work focuses on, is related to the explorationexploitation dilemma. To ensure that an agent is able to find a good policy, it shoudl act with the sole purpose of gathering information about the environment (exploration). However, once enough information is gathered, it should behave according to what it believes to be the best policy (exploitation). In this work, we separate the behavior of an RL agent into two distinct policies: an exploration policy and an exploitation policy. We assume an greedy exploration schedule, i.e., with probability the agent explores and with probability the agent exploits, where is a sequence of exploration rates where and refers to the episode number in the current task. We note that more sophisticated decisions on when to explore are certainly possible and could exploit our proposed method. Assuming this exploration strategy the agent forgoes the ability to learn when it should explore and we assume that the decision as to whether the agent explores or not is random. That being said, greedy is currently widely used (e.g.,SARSA [23], Qlearning [27], DQN [16]) and its popularity makes its study still relevant today.
Let be the set of all tasks, . That is, all are MDPs sharing the same stateset and actionset , which may have different transition functions , reward functions , and initial state distributions . An agent is required to solve a set of tasks , where we refer to the set as the problem class. Given that each task is a separate MDP, the exploitation policy might not directly apply to a novel task. In fact, doing this could hinder the agent’s ability to learn an appropriate policy. This type of scenarios arise, for example, in control problems where the policy learned for one specific agent will not work for another due to differences in the environment dynamics and physical properties. As a concrete example, Intelligent Control Flight Systems (ICFS) is an area of study that was born out of the necessity to address some of the limitations of PID controllers; where RL has gained significant traction in recent years [30, 31]. One particular scenario were our proposed problem would arise is in using RL to control autonomous vehicles [9], where a single control policy would likely not work for a number of distinct vehicles and each policy would need to be adapted to the specifics of each vehicle. Under this scenario, if refers to learning a policy to for vehicles to stay in a specific lane, for example, each task could refer to learning to stay in a lane for a vehicle with its own individual dynamics.
In our framework, the agent has a taskspecific policy, , that is updated by the agent’s own learning algorithm. This policy defines the agent’s behavior during exploitation, and so we refer to it as the exploitation policy. The behavior of the agent during exploration is determined by an advisor, which maintains a policy tailored to the problem class (i.e., it is shared across all tasks in ). We refer to this policy as an exploration policy. The agent is given timesteps of interactions with each of the sampled tasks. Hereafter we use to denote the index of the current episode on the current task, to denote the time step within that episode, and to denote the number of time steps that have passed on the current task, i.e., , and we refer to as the advisor time step. At every timestep, , the advisor suggests an action, , to the agent, where is sampled according to . If the agent decides to explore at this step, it takes action , otherwise it takes action sampled according to the agent’s policy, . We refer to an optimal policy for the agent solving a specific task, , as an optimal exploitation policy, . More formally: , where is referred to as the return. Thus, the agent solving a specific task is optimizing the standard expected return objective. From now on we refer to the agent solving a specific task as the agent (even though the advisor can also be viewed as an agent).
Intuitively, we consider a process that proceeds as follows. First, a task, is sampled from some distribution, , over . Next, the agent uses some prespecified reinforcement learning algorithm (e.g., Qlearning or Sarsa) to approximate an optimal policy on the sampled task, . Whenever the agent decides to explore, it uses an action provided by the advisor according to its policy, . After the agent completes episodes on the current task, the next task is sampled from and the agent’s policy is reset to an initial policy.
4 Problem Statement
We define the performance of the advisor’s policy, , for a specific task to be where is the reward at time step during the episode. Let
be a random variable that denotes a task sampled from
. The goal of the advisor is to find an optimal exploration policy, , which we define to be any policy that satisfies:(1) 
We cannot directly optimize this objective because we do not know the transition and reward functions of each MDP, and we can only sample tasks from . In the next section we show that the search for an exploration policy can be formulated as an RL problem where the advisor is itself an RL agent solving an MDP whose environment contains both the current task, , and the agent solving the current task.
5 A General Solution Framework
Our framework can be viewed as a metaMDP—an MDP within an MDP. From the point of view of the agent, the environment is the current task, (an MDP). However, from the point of view of the advisor, the environment contains both the task, , and the agent. At every timestep, the advisor selects an action and the agent an action . The selected actions go through a selection mechanism which executes action with probability and action with probability at episode . Figure 1 depicts the proposed framework with action (exploitation) being selected. Even though one time step for the agent corresponds to one time step for the advisor, one episode for the advisor constitutes a lifetime of the agent. From this perspective, wherein the advisor is merely another reinforcement learning algorithm, we can take advantage of the existing body of work in RL to optimize the exploration policy, .
We experimented training the advisor policy using two different RL algorithms: REINFORCE, [29], and Proximal Policy Optimization (PPO), [21]
. Using Montercarlo methods, such as REINFORCE, results in a simpler implementation at the expense of a large computation time (each update of the advisor would require to train the agent for an entire lifetime). On the other hand, using temporal difference method, such as PPO, overcomes this computational bottleneck at the expense of larger variance in the performance of the advisor.
Pseudocode for an implementation of our framework using REINFORCE, where the metaMDP is trained for episodes, is described in Algorithm 1. Pseudocode for our implementation of PPO is presented in the Appendix B.
5.1 Theoretical Results
Below, we formally define the metaMDP faced by the advisor and show that an optimal policy for the metaMDP optimizes the objective in (1). Recall that , and denote the reward function, transition function, and initial state distribution of the MDP .
To formally describe the metaMDP, we must capture the property that the agent can implement an arbitrary RL algorithm. To do so, we assume the agent maintains some memory, , that is updated by some learning rule (an RL algorithm) at each time step, and write to denote the agent’s policy given that its memory is . In other words, provides all the information needed to determine and its update is of the form (this update rule can represent popular RL algorithms like QLearning and actorcritics). We make no assumptions about which learning algorithm the agent uses (e.g., it can use Sarsa, Qlearning, REINFORCE, and even batch methods like Fitted QIteration), and consider the learning rule to be unknown and a source of uncertainty.
Proposition 1.
Consider an advisor policy, , and episodic tasks belonging to a problem class . The problem of learning can be formulated as an MDP, , where is the state space, the action space, the transition function, the reward function, and the initial state distribution.
Proof.
To show that is a valid MDP it is sufficient to characterize the MDP’s state set, , action set, , transition function, , reward function, , and initial state distribution . We assume that when facing a new task, the agent memory, , is initialized to some fixed memory (defining a default initial policy and/or value function). The following definitions fully characterize the metaMDP the advisor faces:

. That is, the state set is a set defined such that each state, contains the current task, , the current state, , in the current task, the current episode number, , and the current memory, , of the agent.

. That is, the actionset is the same as the actionset of the problem class, .

is the transition function, and is defined such that is the probability of transitioning from state to state upon taking action . Assuming the underlying RL agent decides to explore with probability and to exploit with probability at episode , then is as follows.
If is terminal and , then .
If is terminal and , then .
Otherwise,

is the reward function, and defines the reward obtained after taking action in state and transitioning to state .

is the initial state distribution and is defined by: .
∎
Now that we have fully described , we are able to show that the optimal policy for this new MDP corresponds to an optimal exploration policy.
Theorem 1.
An optimal policy for is an optimal exploration policy, , as defined in (1). That is, .
Proof.
See Appendix A. ∎
Since is an MDP for which an optimal exploration policy is an optimal policy, it follows that the convergence properties of reinforcement learning algorithms apply to the search for an optimal exploration policy. For example, in some experiments the advisor uses the REINFORCE algorithm [29], the convergence properties of which have been wellstudied [18].
Although the framework presented thus far is intuitive and results in nice theoretical properties (e.g., methods that guarantee convergence to at least locally optimal exploration policies), each episode corresponds to a new task being sampled. This means that training the advisor may require to solve a large number of tasks (episodes of the metaMDP), each one potentially being an expensive procedure. To address this issue, we sampled a small number of tasks , where each and train many episodes on each task in parallel. By taking this approach, every update to the advisor is influenced by several simultaneous tasks and results in an scalable approach to obtain a general exploration policy. In more difficult tasks, which might require the agent to train a long time, using TD techniques allows the advisor to improve its policy while the agent is still training.
6 Empirical Results
In this section we present experiments for discrete and continuous control tasks. Figures 5(a) and 5(b) depicts task variations for Animat for the case of discrete actionset. Figures 8(a) and 8(b) show task variations for Ant problem for the case of continuous actionset. Depiction of task variations in the remaining experiments are given in Appendix C. Implementations used for the discrete case polebalancing and all continuous control problems, where taken from OpenAI Gym, Roboschool benchmarks [3]. For the driving task experiments we used a simulator implemented in Unity by Tawn Kramer from the “Donkey Car” community ^{1}^{1}1The Unity simulator for the selfdriving task can be found at https://github.com/tawnkramer/sdsandbox. We demonstrate that: 1) in practice the metaMDP, , can be solved using existing reinforcement learning methods, 2) the exploration policy learned by the advisor improves performance on existing RL methods, on average, and 3) the exploration policy learned by the advisor differs from the optimal exploitation policy for any task , i.e., the exploration policy learned by the advisor is not necessarily a good exploitation policy. To that end, we will first study the behavior of our method in two simple problem classes with discrete actionspaces: polebalancing [23] and animat [25], and a more realistic application of control tuning in selfdriving vehicles.
As a baseline metalearning method, to which we contrast our framework, we chose the recently proposed technique called Model Agnostic Meta Learning (MAML), [5], a general meta learning method for adapting previously trained neural networks to novel but related tasks. It is worth noting that, although the method was not specifically designed for RL, the authors describe some promising results in adapting behavior learned from previous tasks to novel ones.
6.1 Pole Balancing Problem Class
In our first experiments on discrete action sets, we used variants of the standard polebalancing (cartpole) problem class. The agent is tasked with applying force to a cart to prevent a pole balancing on it from falling. The distinct tasks were constructed by modifying the length and mass of the pole mass, mass of the cart and force magnitude. States are represented by 4D vectors describing the position and velocity of the cart, and angle and angular velocity of the pendulum, i.e.,
. The agent has 2 actions at its disposal; apply a force in the positive or negative direction.Figure 2(a), contrasts the cumulative return of an agent using the advisor against random exploration during training over 6 tasks, shown in blue and red respectively. Both policies, and , were trained using REINFORCE: for episodes and for
iterations. In the figure, the horizontal axis corresponds to episodes for the advisor. The horizontal red line denotes an estimate (with standard error bar) of the expected cumulative reward over an agent’s lifetime if it samples actions uniformly when exploring. Notice that this is not a function of the training iteration, as the random exploration is not updated. The blue curve (with standard error bars from 15 trials) shows how the expected cumulative reward the agent obtains during its lifetime changes as the advisor improves its policy. By the end of the plot, the agent is obtaining roughly
more reward during its lifetime than it was when using a random exploration. To visualize this difference, Figure 2(b) shows the mean learning curves (episodes of an agent’s lifetime on the horizontal axis and average return for each episode on the vertical axis) during the first and last 50 iterations. The mean cumulative reward were and respectively.6.2 Animat Problem Class
The following set of experiments were conducted in the animat problem class. In these environments, the agent is a circular creature that lives in a continuous state space. It has 8 independent actuators, angled around it in increments of 45 degrees. Each actuator can be either on or off at each time step, so the action set is , for a total of 256 actions. When an actuator is on, it produces a small force in the direction that it is pointing. The resulting action moves the agent in the direction that results from the some of all those forces and is perturbed by 0mean unit variance Gaussian noise. The agent is tasked with moving to a goal location; it receives a reward of at each timestep and a reward of +100 at the goal state. The different variations of the tasks correspond to randomized start and goal positions in different environments.
Figure 2(c) shows a clear performance improvement on average as the advisor improves its policy over 50 training iterations. The curve show the average curve obtained over the first and last 10 iteration of training the advisor, shown in blue and orange respectively. Each individual task was trained for episodes.
An interesting pattern that is shared across all variations of this problem class is that there are actuator combinations that are not useful for reaching the goal. For example, activating actuators at opposite angles would leave the agent in the same position it was before (ignoring the effect of the noise). The presence of these poor performing actions provide some common patterns that can be leveraged. To test our intuition that an exploration policy would exploit the presence of poorperforming actions, we recorded the frequency with which they were executed on unseen testing tasks when using the learned exploration policy after training and when using a random exploration strategy, over 5 different tasks. Figure 2(d) helps explain the improvement in performance. It depicts in the yaxis, the percentage of times these poorperforming actions were selected at a given episode, and in the xaxis the agent episode number in the current task. The agent using the advisor policy (blue) is encouraged to reduce the selection of known poorperforming actions, compared to a random actionselection exploration strategy (red).
6.3 Driving Problem Class
As a more realistic application to which we made reference earlier, we tested the advisor on a control problem using a selfdriving car simulator implemented in Unity. We assume that the agent has a constant acceleration (up to some maximum velocity) and the actions consist on 15 possible steering angles between angles and . The state is represented as a stack of the last 4 80x80 images sensed by a frontfacing camera, and the tasks vary in the body mass, , of the car and values of and . We tested the ability of the advisor to improve finetuning controls to specific cars. We first learned a wellperforming policy for one car and used the policy as a starting point to finetune policies for 8 different cars.
The experiment, depicted in Figure 3, compares an agent who is able to use an advisor during exploration for finetuning (blue) vs. an agent who does not have access to an advisor (red). The figure the number of episodes of finetuning needed to reach a predefined performance threshold ( timesteps without leaving the correct lane). The first and second groups in the figure show the average number of episodes needed to finetune in the first and second half of tasks, respectively. In the first half of tasks (left), the advisor seems to make finetuning more difficult since it has not been trained to deal with this specific problem. Using the advisor took an average of 42 episodes to finetune, while it took on average 12 episodes to finetune without it. The benefit, however, can be seen in the second half of training tasks. Once the advisor had been trained, it took on average 5 episodes to finetune while not using the advisor needed an average of 18 episodes to reach the required performance threshold. When the number of tasks is large enough and each episode is a timeconsuming or costly process, our framework could result in important time and cost savings.
6.4 Is an Exploration Policy Simply a General Exploitation Policy?
One might be tempted to think that the learned policy for exploration might simply be a policy that works well in general. How do we know that the advisor is learning a policy for exploration and not simply a policy for exploitation? To answer this question, we generated three distinct unseen tasks for polebalancing and animat problem classes and compare the performance of using only the learned exploration policy with the performance obtained by an exploitation policy trained to solve each specific task.
Figure 4 shows two bar charts contrasting the performance of the exploration policy (blue) and the exploitation policy (green) on each task variation. In both charts, the first three groups of bars on the correspond to the performance on each task and the last one to an average over all tasks. Figure 3(a)
corresponds to the mean performance on polebalancing and the error bars to the standard deviation; the yaxis denotes the return obtained. We can see that, as expected, the exploration policy by itself fails to achieve a comparable performance to a taskspecific policy. The same occurs with the animat problem class, shown in Figure
3(b). In this case, the yaxis refers to the number of steps needed to reach the goal (smaller bars are better). In all cases, a taskspecific policy performs significantly better than the learned exploration policy, indicating that the exploration policy is not a general exploitation policy.6.5 Performance Evaluation on Novel Tasks
In this section we examine the performance of our framework on novel tasks when learning from scratch, and contrast our method to MAML trained using PPO. In the case of discrete actionsets, we trained each task for 500 episodes and compare the performance of an agent trained with REINFORCE (R) and PPO, with and without an advisor. In the case of continuous tasks, we restrict our experiments to an agent using PPO after training for 500 episodes. In our experiments we set the initial value of to , and decreased by a factor of ever episode. The results shown in table 1 were obtained by training 5 times in 5 novel tasks and recording the average performance and standard deviations. The table displays the mean of those averages and the mean of the standard deviations recorded. The problem classes “polebalance (d)” and “animat” correspond to discrete actions spaces, while “polebalance (c)”, “hopper”, and “ant” are continuous.
In the discrete case, MAML showed a clear improvement over starting from a random initial policy. However, using the advisor with PPO resulted in a clear improvement in polebalancing and, in the case of animat, training the advisor with REINFORCE led to an almost improvement over MAML. In the case of continuous control, the first test corresponds to a continuous version of polebalancing. The second and third set of tasks correspond to the “Hopper” and “Ant” problem classes, where the task variations were obtained by modifying the length and size of the limbs and body. In all continuous control tasks, using the advisor and MAML led to an clear improvement in performance in the alloted time. In the case of polebalancing using the advisor led the agent to accumulate almost twice as much reward as MAML, and in the case of Hopper, the advisor led to accumulating 4 times the reward. On the other had, MAML achieved a higher average return in the Ant problem class, but showing high variance. An important takeaway from these results is that in all cases, using the advisor resulted in a clear improvement in performance over a limited number of episodes. This does not necessarily mean that the agent can reach a better policy over an arbitrarily long period of time, but rather that it is able to reach a certain performance level much quicker.
Problem Class  R  R+Advisor  PPO  PPO+Advisor  MAML 

Polebalance (d)  
Animat  
Polebalance (c)  —  —  
Hopper  —  —  
Ant  —  — 
7 Conclusion
In this work we developed a framework for leveraging experience to guide an agent’s exploration in novel tasks, where the advisor learns the exploration policy used by the agent solving a task. We showed that a few sample tasks can be used to learn an exploration policy that the agent can use improve the speed of learning on novel tasks. A takeaway from this work is that oftentimes an agent solving a new task may have had experience with similar problems, and that experience can be leveraged. One way to do that is to learn a better approach for exploring in the face of uncertainty. A natural future direction from this work use past experience to identify when exploration is needed and not just what action to take when exploring.
References
 [1] Marcin Andrychowicz, Misha Denil, Sergio Gomez Colmenarejo, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. CoRR, abs/1606.04474, 2016.

[2]
Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos.
Minimax regret bounds for reinforcement learning.
In Doina Precup and Yee Whye Teh, editors,
Proceedings of the 34th International Conference on Machine Learning
, pages 263–272, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.  [3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
 [4] Fernando Fernandez and Manuela Veloso. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS ’06, pages 720–727, New York, NY, USA, 2006. ACM.
 [5] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
 [6] Aurélien Garivier and Eric Moulines. On upperconfidence bound policies for switching bandit problems. In Proceedings of the 22nd International Conference on Algorithmic Learning Theory, ALT’11, pages 174–188, Berlin, Heidelberg, 2011. SpringerVerlag.
 [7] Sandeep Goel and Manfred Huber. Subgoal discovery for hierarchical reinforcement learning using learned policies. In Ingrid Russell and Susan M. Haller, editors, FLAIRS Conference, pages 346–350. AAAI Press, 2003.
 [8] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Curiositydriven exploration in deep reinforcement learning via bayesian neural networks. CoRR, abs/1605.09674, 2016.
 [9] William Koch, Renato Mancuso, Richard West, and Azer Bestavros. Reinforcement learning for UAV attitude control. CoRR, abs/1804.04154, 2018.
 [10] Romain Laroche, Mehdi Fatemi, Harm van Seijen, and Joshua Romoff. Multiadvisor reinforcement learning. April 2017.
 [11] Bingyao Liu, Satinder P. Singh, Richard L. Lewis, and Shiyin Qin. Optimal rewards in multiagent teams. In 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics, ICDLEPIROB 2012, San Diego, CA, USA, November 79, 2012, pages 1–8, 2012.
 [12] Marlos C. Machado, Marc G. Bellemare, and Michael H. Bowling. A laplacian framework for option discovery in reinforcement learning. CoRR, abs/1703.00956, 2017.
 [13] Sridhar Mahadevan. Protovalue functions: Developmental reinforcement learning. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 553–560, New York, NY, USA, 2005. ACM.
 [14] Jarryd Martin, Suraj Narayanan Sasikumar, Tom Everitt, and Marcus Hutter. Countbased exploration in feature space for reinforcement learning. CoRR, abs/1706.08090, 2017.
 [15] Amy Mcgovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proc. of ICML, 2001.
 [16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.
 [17] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiositydriven exploration by selfsupervised prediction. CoRR, abs/1705.05363, 2017.
 [18] V. V. Phansalkar and M. A. L. Thathachar. Local and global optimization algorithms for generalized learning automata. Neural Comput., 7(5):950–973, September 1995.
 [19] Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of nonlinear functions for multitask learning. CoRR, abs/1711.01239, 2017.
 [20] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
 [21] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
 [22] Alexander L. Strehl. Probably approximately correct (pac) exploration in reinforcement learning. In ISAIM, 2008.
 [23] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998.
 [24] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. #exploration: A study of countbased exploration for deep reinforcement learning. CoRR, abs/1611.04717, 2016.
 [25] Philip S. Thomas and Andrew G. Barto. Conjugate markov decision processes. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28  July 2, 2011, pages 137–144, 2011.
 [26] Harm van Seijen, Mehdi Fatemi, Joshua Romoff, Romain Laroche, Tavian Barnes, and Jeffrey Tsang. Hybrid reward architecture for reinforcement learning. June 2017.
 [27] Christopher J. C. H. Watkins and Peter Dayan. Qlearning. In Machine Learning, pages 279–292, 1992.
 [28] Steven D. Whitehead. Complexity and cooperation in qlearning. In Proceedings of the Eighth International Workshop (ML91), Northwestern University, Evanston, Illinois, USA, pages 363–367, 1991.
 [29] Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Machine Learning, pages 229–256, 1992.
 [30] Q. Yang and S. Jagannathan. Reinforcement learning controller design for affine nonlinear discretetime systems using online approximators. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(2):377–390, April 2012.
 [31] Tianhao Zhang, Gregory Kahn, Sergey Levine, and Pieter Abbeel. Learning deep control policies for autonomous aerial vehicles with mpcguided policy search. CoRR, abs/1509.06791, 2015.
8 Appendix A  Proof of Theorem 1
To show that an optimal policy of is an optimal exploration policy, we will first establish the following equality to help us in our derivation.
Lemma 1.
Proof.
Recall that given task , episode , advisor timestep , the agent state corresponds to the state of the advisor . Continuing with our derivation.
∎
Theorem 2.
An optimal policy for is an optimal exploration policy, .
Proof.
To show that an optimal policy of is an optimal exploration policy as defined in this paper, it is sufficient to show that maximizing the return in the metaMDP is equivalent to maximizing the expected performance. That is, .
∎
9 Appendix B  Pseudocode
This section presents pseudocode for the PPO implementation of the proposed metaMDP framework, which was omitted from the paper for space considerations.
Pseudocode for a PPO implementation of both agent and advisor is given in Algorithm 2. PPO maintains two parameterized policies for an agent, and . The algorithm runs for timesteps and computes the generalized advantage estimates (GAE), [20], , where and .
The objective function seeks to maximize the following objective for timestep :
(2)  
where , and is an estimate of the value for state . The updates are done in minibatches that are stored in a buffer of collected samples.
To train the agent and advisor with PPO we defined two separate sets of policies: and for the advisor, and and for the agent. The agent collects samples of trajectories of length to update its policy, while the advisor collects trajectories of length , where . So, (the objective of the agent) is computed with samples while (the objective of the advisor) is computed with samples. In our experiments, setting seemed to give the best results.
Notice that the presence of a buffer to store samples, means that the advisor will be storing samples obtained from many different tasks, which prevents it from overfitting to one particular problem.
10 Appendix C  Task Variations
This section shows variations of each problem used for experiments.
Polebalancing: Task variations in this problem class were obtained by changing the mass of the cart, the mass and the length of the pole.
Animat: Task variations in this problem class were obtained by randomly sampling new environments, and changing the start and goal location of the animat.
Driving Task: Task variations in this problem class were obtained by changing the mass of the car and turning radius. A decrease in body mass and increase in turning radius causes the car to drift more and become more unstable when taking sharp turns.
Hopper: Task variations for Hopper consisted in changing the length of the limbs, causing policies learned for other hopper tasks to behave erratically.
Ant: Tasks variation for Ant consisted in changing the length of the limbs in the ant and the size of the body.
Comments
There are no comments yet.