1 Related Work
There is a long history of work on designing rewards to accelerate learning in reinforcement learning (RL). Reward shaping aims to design task-specific rewards towards known optimal behaviours, typically requiring domain knowledge. Both the benefits (randlov1998learning; ng1999policy; harutyunyan2015expressing) and the difficulty (FaultyRewards) of task-specific reward shaping have been studied. On the other hand, many intrinsic rewards have been proposed to encourage exploration, inspired by animal behaviours. Examples include prediction error (schmidhuber1991curious; schmidhuber1991possibility; oudeyer2007intrinsic; gordon2011reinforcement; mirolli2013functions; pathak2017curiosity), surprise (itti2006bayesian), weight change (linke2019adapting), and state-visitation counts (Sutton90integratedarchitectures; poupart2006analytic; strehl2008analysis; bellemare2016unifying; ostrovski2017count). Although these kinds of intrinsic rewards are not domain-specific, they are often not well-aligned with the task that the agent tries to solve, and ignores the effect on the agent’s learning dynamics. In contrast, our work aims to learn intrinsic rewards from data that take into account the agent’s learning dynamics without requiring prior knowledge from a human.
Rewards Learned from Data
There have been a few attempts to learn useful intrinsic rewards from data. The optimal reward framework (singh2009rewards) proposed to learn an optimal reward function that allows agents to solve a distribution of tasks quickly using random search. We revisit this problem in this paper and propose a more scalable gradient-based approach. Although there have been follow-up works (sorg2010reward; guo2016deep) that use a gradient-based method, they consider a non-parameteric policy using Monte-Carlo Tree Search (MCTS). Our work is closely related to LIRPG (zheng2018learning) which proposed a meta-gradient method to learn intrinsic rewards. However, LIRPG considers a single task in a single lifetime with a myopic episode return objective, which is limited in that it does not allow exploration across episodes or generalisation to different agents.
Meta-learning for Exploration
Meta-learning (schmidhuber1996simple; thrun1998learning) has recently received considerable attention in RL. Recent advances include few-shot adaptation (finn2017model), few-shot imitation (finn2017one; duan2017one), model adaptation (clavera2018learning), and inverse RL (xu2019learning). In particular, our work is closely related to the prior work on meta-learning good exploration strategies (Wang2016LearningTR; duan2016rl; stadie2018importance; xu2018learning) in that both perform temporal credit assignment across episode boundaries by maximising rewards accumulated beyond an episode. Unlike the prior work that aims to learn an exploratory policy, our framework indirectly drives exploration via a reward function which can be reused by different learning agents as we show in this paper (Section 5.1).
Meta-learning of Agent Update
There have been a few studies that directly meta-learn how to update the agent’s parameters via meta-parameters including discount factor and returns (xu2018meta), auxiliary tasks (schlegel2018discovery; veeriah2019discovery)
, unsupervised learning rules(metz2019meta), and RL objectives (bechtle2019meta). Our work also belongs to this category in that our meta-parameters are the reward function used in the agent’s update. In particular, our multi-lifetime formulation is similar to ML (bechtle2019meta). However, we consider the long-term lifetime return for cross-episode temporal credit assignment as opposed to the myopic episodic objective of ML.
2 The Optimal Reward Problem
We first introduce some terminology.
Agent: A learning system interacting with an environment. On each step the agent selects an action and receives from the environment an observation and an extrinsic reward defined by a task . The agent chooses actions based on a policy parameterised by .
Episode: A finite sequence of agent-environment interactions until the end of the episode defined by the task. An episode return is defined as: , where
is a discount factor, and the random variablegives the number of steps until the end of the episode.
Lifetime: A finite sequence of agent-environment interactions until the end of training defined by an agent-designer, which can multiple episodes. The lifetime return is , where is a discount factor, and is the number of steps in the lifetime.
Intrinsic reward: A reward function parameterised by , where is a lifetime history with (binary) episode terminations .
The Optimal Reward Problem (singh2010intrinsically), illustrated in Figure 1, aims to learn the parameters of the intrinsic reward such that the resulting rewards achieve a learning dynamic for an RL agent that maximises the lifetime (extrinsic) return on tasks drawn from some distribution. Formally, the optimal reward function is defined as:
where and are an initial policy distribution and a distribution over possibly non-stationary tasks respectively. The likelihood of a lifetime history is , where is a policy parameter as updated with update function , which is policy gradient in this paper.111We assume that the policy parameter is updated after each time-step throughout the paper for brevity. However, the parameter can be updated less frequently in practice. Note that the optimisation of spans multiple lifetimes, each of which can span multiple episodes.
Using the lifetime return as the objective instead of the conventional episodic return allows exploration across multiple episodes as long as the lifetime return is maximised in the long run. In particular, when the lifetime is defined as a fixed number of episodes, we find that the lifetime return objective is sometimes more beneficial than the episodic return objective, even for the episodic return performance measure. However, different objectives (e.g., final episode return) can be considered depending on the definition of what a good reward function is.
3 Meta-Learning Intrinsic Reward
We propose a meta-gradient approach (xu2018meta; zheng2018learning) to solve the optimal reward problem. At a high-level, we sample a new task and a new random policy parameter at each lifetime iteration. We then simulate an agent’s lifetime by updating the parameter using an intrinsic reward function (Section 3.1) with policy gradient (Section 3.2). Concurrently, we compute the meta-gradient by taking into account the effect of the intrinsic rewards on the policy parameters to update the intrinsic reward function with a lifetime value function (Section 3.3). Algorithm 1 gives an overview of our algorithm. The following sections describe the details.
3.1 Intrinsic Reward And Lifetime Value Function Architectures
The intrinsic reward function is a recurrent neural network (RNN) parameterised by, which produces a scalar reward on arriving in state by taking into account the history of an agent’s lifetime . We claim that giving the lifetime history across episodes as input is crucial for balancing exploration and exploitation, for instance by capturing how frequently a certain state is visited to determine an exploration bonus reward. The lifetime value function is a separate recurrent neural network parameterised by , which takes the same inputs as the intrinsic reward function and produces a scalar value estimation of the expected future return within the lifetime.
3.2 Policy Update ()
Each agent interacts with an environment and a task sampled from a distribution . However, instead of directly maximising the extrinsic rewards defined by the task, the agent maximises the intrinsic rewards () by using policy gradient (williams1992simple; sutton2000policy):
where is the intrinsic reward at time , and is the return of the intrinsic rewards accumulated over an episode with discount factor .
3.3 Intrinsic Reward () and Lifetime Value Function () update
The chain rule is used to get the meta-gradient () as in previous work (zheng2018learning). The computation graph of this procedure is illustrated in Figure 1.
Computing the true meta-gradient in Equation 3
requires backpropagation through the entire lifetime, which is infeasible as each lifetime can involve thousands of policy updates. To partially address this issue, we truncate the meta-gradient afterpolicy updates but approximate the lifetime return using a lifetime value function parameterised by , which is learned using a temporal difference learning from -step trajectory:
where is a learning rate. In our empirical work, we found that the lifetime value estimates were crucial to allow the intrinsic reward to perform long-term credit assignments across episodes (Section 4.5).
4 Empirical Investigations: Feasibility and Usefulness
We present the results from our empirical investigations in two sections. For the results in this section, the experiments and domains are designed to answer the following research questions:
What kind of knowledge is learned by the intrinsic reward?
How does the distribution of tasks influence the intrinsic reward?
What is the benefit of the lifetime return objective over the episode return?
When is it important to provide the lifetime history as input to the intrinsic reward?
We investigate these research questions in the grid-world domains illustrated in Figure 2. For each domain, we trained an intrinsic reward function across many lifetimes and evaluated it by training an agent using the learned reward. We implemented the following baselines.
Extrinsic-EP: A policy is trained with extrinsic rewards to maximise the episode return.
Extrinsic-LIFE: A policy is trained with extrinsic rewards to maximise the lifetime return.
Count-based (strehl2008analysis): A policy is trained with extrinsic rewards and count-based exploration bonus rewards.
ICM (pathak2017curiosity): A policy is trained with extrinsic rewards and curiosity rewards based on an inverse dynamics model.
Note that these baselines, unlike the learned intrinsic rewards, do not transfer any knowledge across different lifetimes. Throughout Sections 4.1-4.4, we focus on analysing what kind of knowledge is learned by the intrinsic reward depending on the nature of environments. We discuss the benefit of using the lifetime return and considering the lifetime history when learning the intrinsic reward in Section 4.5
. The details of implementation and hyperparameters are described in AppendixB.
4.1 Exploring Uncertain States
We designed ‘Empty Rooms’ (Figure 1(a)) to see whether the intrinsic reward can learn to encourage exploration of uncertain states like novelty-based exploration methods. The goal is to visit an invisible goal location, which is fixed within each lifetime but varies across lifetimes. An episode terminates when the goal is reached. Each lifetime consists of episodes. From the agent’s perspective, its policy should visit the locations suggested by the intrinsic reward. From the intrinsic reward’s perspective, it should encourage the agent to go to unvisited locations to locate the goal, and then to exploit that knowledge for the rest of the agent’s lifetime.
Figure 3 shows that our learned intrinsic reward was more efficient than extrinsic rewards and count-based exploration when training a new agent. We observed that the intrinsic reward learned two interesting strategies as visualised in Figure 4. While the goal is not found, it encourages exploration of unvisited locations, because it learned the knowledge that there exists a rewarding goal location somewhere. Once the goal is found the intrinsic reward encourages the agent to exploit it without further exploration, because it learned that there is only one goal. This result shows that curiosity about uncertain states can naturally emerge when various states can be rewarding in a domain, even when the rewarding states are fixed within an agent’s lifetime.
4.2 Exploring Uncertain Objects and Avoiding Harmful Objects
In the previous domain, we considered uncertainty of where the reward (or goal location) is. We now consider dealing with uncertainty about the value of different objects. In the ‘Random ABC’ environment (see Figure 1(b)), for each lifetime the rewards for objects A, B, and C are uniformly sampled from , , and respectively but are held fixed within the lifetime. A good intrinsic reward should learn that: 1) B should be avoided, 2) A and C have uncertain rewards, hence require systematic exploration (first go to one and then the other), and 3) once it is determined which of the two A or C is better, exploit that knowledge by encouraging the agent to repeatedly go to that object for the rest of the lifetime.
Figure 3 shows that the agent learned a near-optimal exploration-and-then-exploitation method with the learned intrinsic reward. Note that the agent cannot pass information about the reward for objects across episodes, as usual in reinforcement learning. The intrinsic reward can propagate such information across episodes and help the agent explore or exploit appropriately. We visualised the learned intrinsic reward for different actions sequences in Figure 5. The intrinsic rewards encourage the agent to explore towards A and C in the first few episodes. Once A and C are explored, the agent exploits the largest rewarding object. Throughout training, the agent is discouraged to visit B through negative intrinsic rewards. These results show that avoidance and curiosity about uncertain objects can potentially emerge if the environment has various or fixed rewarding objects.
4.3 Exploiting Invariant Causal Relationship
To see how the intrinsic reward deals with causal relationship between objects, we designed ‘Key-Box’, which is similar to Random ABC except that there is a key in the room (see Figure 1(c)). The agent needs to collect the key first to open one of the boxes (A, B, and C) and receive the corresponding reward. The rewards for the objects are sampled from the same distribution as Random ABC. The key itself gives a neutral reward of . Moreover, the locations of the agent, the key, and the boxes are randomly sampled for each episode. As a result, the state space contains more than billion distinct states and thus is infeasible to fully enumerate. Figure 3 shows that learned intrinsic reward leads to a near-optimal exploration. The agent trained with extrinsic rewards did not learn to open any box. The intrinsic reward captures that the key is necessary to open any box, which is true across many lifetimes of training. This demonstrates that the intrinsic reward can capture causal relationships between objects when the domain has this kind of invariant dynamics.
4.4 Dealing with Non-stationarity
We investigated how the intrinsic reward handles non-stationary tasks within a lifetime in our ‘Non-stationary ABC’ environment. Rewards are as follows: for A is either or , for B is , for C is the negative value of the reward for A. The rewards of A and C are swapped every episodes. Each lifetime lasts episodes. Figure 3 shows that the agent with the learned intrinsic reward quickly recovered its performance when the task changes, whereas the baselines take more time to recover. Figure 6 shows how the learned intrinsic reward encourages the learning agent to react to the changing rewards. Interestingly, the intrinsic reward has learned to prepare for the change by giving negative rewards to the exploitation policy of the agent a few episodes before the task changes. In other words, the intrinsic reward reduces the agent’s commitment to the current best rewarding object, thereby increasing entropy in the current policy in anticipation of the change, eventually making it easier to adapt quickly. This shows that the intrinsic reward can capture the (regularly) repeated non-stationarity across many lifetimes and make the agent intrinsically motivated not to commit too firmly to a policy, in anticipation of changes in the environment.
4.5 Ablation Study
To study relative benefits of the proposed technical ideas, we conducted an ablation study 1) by replacing the long-term lifetime return objective () with the episodic return () and 2) by restricting the input of the reward network to the current time-step instead of the entire lifetime history. Figure 7 shows that the lifetime history was crucial to achieve good performance. This is reasonable because all domains require some past information (e.g., object rewards in Random ABC, visited locations in Empty Rooms) to provide useful exploration strategies. It is also shown that the lifetime return objective was beneficial on Random ABC, Non-stationary ABC, and Key-Box. These domains require exploration across multiple episodes in order to find the optimal policy. For example, collecting an uncertain object (e.g., object A in Random ABC) is necessary even if the episode terminates with a negative reward. The episodic value function would directly penalise such an under-performing exploratory episode when computing meta-gradient, which prevents the intrinsic reward from learning to encourage exploration across episodes. On the other hand, such behaviour can be encouraged by the lifetime value function, as long as it provides useful information to maximise the lifetime return in the long term.
5 Empirical Investigations: Generalisation via Rewards
As noted above, rewards capture knowledge about what an agent’s goals should be rather than how it should behave. At the same time, transferring the latter in the form of policies is also feasible in our domains presented above. Here we confirm that by implementing and presenting results for the following two meta-learning methods:
MAML (finn2017model): A policy meta-learned from a distributions of tasks such that it can adapt quickly to the given task after a few parameter updates.
RL (duan2016rl; Wang2016LearningTR): An LSTM policy unrolled over the entire lifetime to maximise the lifetime return, which is pre-trained on a distributions of tasks.
Although all the methods we implemented including ours are designed to learn useful knowledge from a distribution of tasks, they have different objectives. Specifically, the objective of our method is to learn knowledge that is useful for training “randomly-initialised policies” by capturing “what to do”, whereas the goal of policy transfer methods is to directly transfer a useful policy for fast task adaptation by transferring “how to do” knowledge. In fact, it can be more efficient to transfer and reuse pre-trained policies instead of restarting from a random policy and learning using the learned rewards given a new task. Figure 8 indeed shows that RL performs better than our intrinsic reward approach. It is also shown that MAML and RL achieve good performance from the beginning, as they have already learned how to navigate the grid worlds and how to achieve the goals of the tasks. In our method, on the other hand, the agent starts from a random policy and relies on the learned intrinsic reward which indirectly tells it what to do. Nevertheless, our method outperforms MAML and achieves a comparable asymptotic performance to RL.
5.1 Generalisation to Different Agent-Environment Interfaces
In fact, our method can be interpreted as an instance of RL with a particular decomposition of parameters ( and ), which uses policy gradient as a recurrent update (see Figure 1). While this modular structure may not be more beneficial than RL when evaluated with the same agent-environment interface, such a decomposition provides clear semantics of each module: the policy () captures “how to do” while the intrinsic reward () captures “what to do”, and this enables interesting kinds of generalisations as we show below. Specifically, we show that “what” knowledge captured by the intrinsic reward can be reused by many different learning agents as follows.
Generalisation to unseen action spaces
We first evaluated the learned intrinsic reward on new action spaces. Specifically, the intrinsic reward was used to train new agents with either 1) permuted actions, where the semantics of left/right and up/down are reversed, or 2) extended actions, with 4 additional actions that move diagonally. Figure 8(a) shows that the intrinsic reward provided useful rewards to new agents with different actions, even when these were not trained with those actions. This is possible because the intrinsic reward assigns rewards to the agent’s state changes rather than its actions. In other words, the intrinsic reward captures “what to do”, which makes it possible to generalise to new actions, as long as the goal remains the same. On the other hand, it is unclear how to generalise RL and MAML in this way.
Generalisation to unseen learning algorithms
We further investigated how general the knowledge captured by the intrinsic reward is by evaluating the learned intrinsic reward on agents with different learning algorithms. In particular, after training the intrinsic reward from actor-critic agents, we evaluated it by training new agents through Q-learning while using the learned intrinsic reward as denoted by ‘Q-Intrinsic’ in Figure 8(b). Interestingly, it turns out that the learned intrinsic reward is general enough to be useful for Q-learning agents, even though it was trained for actor-critic agents. Again, it is unclear how to generalise RL and MAML in this way.
Comparison to policy transfer
Although it was not possible to apply the learned policy from RL and MAML when we extended the action space and when we changed the learning algorithm, we can do so when we keep the same number of actions and just permute them. As shown in Figure 8(c), both RL and MAML generalise poorly when the action space is permuted for Random ABC, because the transferred policies are highly biased to the original action space. Again, this result highlights the difference between “what to do” knowledge captured by our approach and “how to do” knowledge captured by policies.
We revisited the optimal reward problem (singh2009rewards) and proposed a more scalable gradient-based method for learning intrinsic rewards. Through several proof-of-concept experiments, we showed that the learned non-stationary intrinsic reward can capture regularities within a distribution of environments or, over time, within a non-stationary environment. As a result, they were capable of encouraging both exploratory and exploitative behaviour across multiple episodes. In addition, some task-independent notions of intrinsic motivation such as curiosity emerged when they were effective for the distribution over tasks across lifetimes the agent was trained on. We also showed that the learned intrinsic rewards can generalise to different agent-environment interfaces such as different action spaces and different learning algorithms, whereas policy transfer methods fail to generalise. This highlights the difference between the “what” kind of knowledge captured by rewards and the “how” kind of knowledge captured by policies. The flexibility and range of knowledge captured by intrinsic rewards in our proof-of-concept experiments encourages further work towards combining different loci of knowledge to achieve greater practical benefits.
We thank Joseph Modayil for his helpful feedback on the manuscript.
Appendix A Derivation of intrinsic reward update
Following the conventional notation in RL, we define as the state-value function that estimates the expected future lifetime return given the lifetime history , the task , initial policy parameters and the intrinsic reward parameters . Specially, denotes the expected lifetime return at the starting state, i.e.,
where denotes the lifetime return in task . We also define the action-value function accordingly as the expected future lifetime return given the lifetime history and an action .
The objective function of the optimal reward problem is defined as:
where and are an initial policy distribution and a task distribution respectively.
Assuming the task and the initial policy parameters are given, we omit and for the rest of equations for simplicity. Let
be the probability distribution over actions at timegiven the history , where is the policy parameters at time in the lifetime. We can derive the meta-gradient with respect to by the following:
where is the lifetime return given the history , and we assume the discount factor for brevity. Thus, the derivative of the overall objective is:
Appendix B Experimental Details
b.1 Implementation Details
We used mini-batch update to reduce the variance of meta-gradient estimation. Specifically, we ranlifetimes in parallel, each with a randomly sample task and randomly initialised policy parameters. We took the average of the meta-gradients from each lifetime to compute the update to the intrinsic reward parameters (). We ran updates to
at training time. All hidden layers in the neural networks used ReLU as the activation function. We used arctan activation on the output of the intrinsic reward. The hyperparameters used for each domain are described in Table1.
|Hyperparameters||Empty Rooms||Random ABC||Key-Box||Non-stationary ABC|
|Time limit per episode||100||10||100||10|
|Number of episodes per lifetime||200||50||5000||1000|
Conv(filters=16, kernel=3, strides=1)-FC(64)
|Policy learning rate ()||0.1||0.1||0.001||0.1|
|Reward architecture||Conv(filters=16, kernel=3, strides=1)-FC(64)-LSTM(64)|
|Reward learning rate ()||0.001|
|Lifetime VF architecture||Conv(filters=16, kernel=3, strides=1)-FC(64)-LSTM(64)|
|Lifetime VF optimiser||Adam|
|Lifetime VF learning rate ()||0.001|
|Outer unroll length ()||5|
|Inner discount factor ()||0.9|
|Outer discounter factor ()||0.99|
We will consider four task distributions, instantiated within one of the three main gridworld domains shown in Figure 2. In all cases the agent has four actions available, corresponding to moving up, down, left and right. However the topology of the gridworld and the reward structure may vary.
b.2.1 Empty Rooms
Figure 1(a) shows the layout of the Empty Rooms domain. There are four rooms in this domain. The agent always starts at the centre of the top-left room. One and only one cell is rewarding, which is called the goal. The goal is invisible. The goal location is sampled uniformly from all cells at the beginning of each lifetime. An episode terminates when the agent reaches the goal location or a time limit of steps is reached. Each lifetime consists of episodes. The agent needs to explore all rooms to find the goal and then goes to the goal afterwards.
b.2.2 ABC World
Figure 1(b) shows the layout of the ABC World domain. There is a single by room, with three objects (denoted by A, B, C). All object provides reward upon reaching them. An episode terminates when the agent reaches an object or a time limit of steps is reached. We consider two different versions of this environment:Random ABC and Non-stationary ABC. In the Random ABC environment, each lifetime has episodes. The reward associated with each object is randomly sampled for each lifetime and is held fixed within a lifetime. Thus, the environment is stationary from an agent’s perspective but non-stationary from the reward function’s perspective. Specifically, the rewards for A, B, and C are uniformly sampled from , , and respectively. The optimal behaviour is to explore A and C at the beginning of a lifetime to assess which is the better, and then commits to the better one for all subsequent episode. In the non-stationary ABC environment, each lifetime has episodes. The rewards for A, B, and C are , , and respectively. The rewards for A and C swap every episodes.
b.2.3 Key Box World
Figure 1(c) shows the Key Box World domain. In this domain, there is a key and three boxes, A, B, and C. In order to open any box, the agent must pick up the key first. The key has a neutral reward of . The rewards for A, B, and C are uniformly sampled from , , and respectively for each lifetime. An episode terminates when the agent opens a box or a time limit of steps is reached. Each lifetime consists of episodes.
b.3 Hand-designed near-optimal exploration strategy for Random ABC
We hand-designed a heuristic strategy for the Random ABC domain. We assume the agent has the prior knowledge that B is always bad and A and C have uncertain rewards. Therefore, the heuristic is to go to A in the first episode, go to C in the second episode, and then go to the better one in the remaining episodes in the lifetime. We view this heuristic as an upper-bound because it always finds the best object and can arbitrarily control the agent’s behaviour.