1 Introduction
Reinforcement Learning (RL) has recently attracted much attention with its success in various domains such as Atari (Mnih et al., 2015) and Go (Silver et al., 2018). However, the problem of credit assignment (Minsky, 1961)
still troubles its learning efficiency. It is rather difficult for RL agents to answer the following question: how to distribute credit for success (or penalty for failure) among the sequence of decisions involved in producing the result from naturally delayed (even sparse) rewards. If the agent could know exactly which actions are right or wrong, RL would be no more difficult than supervised learning. Such inefficiency in credit assignment is one major reason for the unsatisfactory learning efficiency of current modelfree RL methods.
Reward shaping is one of the most intuitive, popular and effective solutions to credit assignment, whose very goal is to shape the original delayed rewards to properly reward or penalize intermediate actions as intime credit assignment. The technique first emerges in animal training (Skinner, 1990), and is then introduced to RL (Dorigo & Colombetti, 1994; Mataric, 1994) to tackle increasingly complex problems like Doom (Wu & Tian, 2017) and Dota 2 (OpenAI, 2018). While most shaping functions could be directly applied, it is proved that optimal policies remain invariant under certain ones, namely potentialbased shaping functions (Ng et al., 1999).
However, almost all reward shapings are handcrafted and need to be carefully designed by experienced human experts (Wu & Tian, 2017; OpenAI, 2018). On one hand, coding those shaping functions in programming languages is potentially tedious and inconvenient especially in complex largescale environments such as Doom (Wu & Tian, 2017) and Dota 2 (OpenAI, 2018). On the other hand, humans have to theoretically justify the shaping rewards to ensure that they lead to expected behavior but not other local optima. Together this makes effective reward shapings hard to design/code, and easily coded shapings usually ineffective.
Furthermore, in practice we are usually interested in solving multiple similar tasks as a whole. For example, when training an RL agent to solve 2D grid mazes, we wouldn’t like to train individual agents for each maze map, but would naturally hope for one general agent for all possible mazes. The shared but not identical taskstructures naturally induce a distribution over tasks, which in this case is a distribution over maze configurations (Wilson et al., 2007) and could elsewhere be a distribution over system parameters (Lazaric & Ghavamzadeh, 2010) for different robothand sizes or over game maps for RTS games (Jaderberg et al., 2018). The ability to quickly solve new similar tasks drawn from such distributions is much expected for general intelligence, since it is mastered by human infants quite young (Smith & Slone, 2017). However, the human effort in reward shaping would be further exacerbated, where we have to either design a different shaping per task or come up with a general taskdependent function presumably harder to design.
To this end, we consider the generally hard problem of reward shaping on a distribution of tasks. Motivated by the inconvenience in reward shaping under task multiplicity, we seek to design a general, automatic reward shaping mechanism that works well on the task distribution without handengineering of human experts. We first derive the theoretically optimal reward shaping in terms of credit assignment in modelfree RL to be the optimal Vvalues. By spotting that there exists shared knowledge across tasks on the same distribution, we then propose a novel valuebased algorithm based on ModelAgnostic MetaLearning (MAML) (Finn et al., 2017b), leveraging metalearning to extract such prior knowledge. This prior approximates the optimal potentialbased shaping function (Ng et al., 1999) for each task. The metalearned prior conducts reward shaping on newly sampled tasks either directly (zeroshot) or adapting to the taskposterior optimum (fewshot) to shape rewards in the meantime of solving the task. We provide theoretical guarantee for the latter. Extensive experiments demonstrate the effectiveness of our reward shaping in both two cases.
To summarize, our contributions are: (1) We present a first attempt to conduct general, automatic reward shaping with metalearning on a distribution of tasks for better credit assignment and learning efficiency; (2) Our framework requires only a shared state space across tasks, and could be applied either directly or adaptively on newly sampled tasks, which is quite general and flexible compared with most existing metalearning methods and multitask reward shaping works; (3) We theoretically derive and analyze the optimal reward shaping (w.r.t. credit assignment based on potential functions (Ng et al., 1999)) and our shaping algorithm.
2 Preliminaries
We consider the setting of multitask reinforcement learning (RL), where the tasks follow a distribution . Each sampled task
is a standard Markov Decision Process (MDP)
, where is the state space, assumed to be shared by all tasks, is the action space,is the state transition probability,
is the discount factor and is the reward function. Here, we use the subscript to denote that the tasks may have different action spaces , different transition probabilities and different reward functions .In this section, we briefly introduce the techniques on which our method is based, namely general Qlearning variants to solve individual MDPs, reward shaping functions to accelerate learning with theoretical guarantees, and metalearning to tackle reward shaping on task distributions.
2.1 QLearning
Given any MDP , a policy is a distribution . The Vvalue and Qvalue are correspondingly defined for as cumulative rewards. The goal of standard RL on a single task is to find the optimal that gives maximal V(and Q)values: .
QLearning (Watkins & Dayan, 1992) provides one solution to directly learn and induce from it. Different from previously tabular representations, Deep QNetwork (DQN) (Mnih et al., 2015)
parameterizes the Qvalue with a neural network
and minimizes the temporal difference (TD) error (Sutton & Barto, 1998) with gradient descent:where represents the parameters of the neural network. A periodic target network is usually adopted.
DuelingDQN (Wang et al., 2016b) specifically parameterizes as so as to “ generalize learning across actions” for better learning efficiency and performance. The neural network’s penultimate layer outputs a Vvalue head and an advantage head that sum to the ultimate Qvalue. Still, the delayed (or even sparse) nature of rewards poses great challenge on learning.
2.2 Potentialbased shaping function
A rewardshaping function modifies the original reward function and attempts to make RL methods (e.g., Qlearning) converge faster with more “instructive” rewards. It generally resides in the same functional space as the reward function , and transforms the original MDP into another shaped MDP . Of all possible shapings, potentialbased shaping functions (Ng et al., 1999) retain the optimal policy, as summarized below.
Definition 2.1 (Potentialbased shaping function (Ng et al., 1999)).
is a potentialbased shaping function if there exists a realvalued function , such that ,
is thus called the potential function.
Theorem 2.1 (Policy Invariance under Reward Shaping (Ng et al., 1999)).
The condition that is a potentialbased shaping function is necessary and sufficient for it to guarantee consistency with the optimal policy. Formally, for and , if , then
(1) 
so the policy derived from remains the same.
Consequently, if we choose , then , and “all that would remain to be done would be to learn the nonzero Qvalues” (Ng et al., 1999).
However, why are the “nonzero Qvalues” easier to learn for RL? Agents could never know a priori which actions’ Qvalues are zero, and we cannot directly induce policies from Vvalues without access to the underlying MDP model. We found that the true advantage this particular reward shaping brings about is underappreciated in previous works, and in Sec. 3.2 we provide formal analysis and identify its theoretical optimal efficiency in credit assignment, motivating our framework based on such shaping functions.
2.3 MetaLearning
Metalearning is an effective strategy to deal with a distribution of tasks. Specifically, it operates on two sets of tasks: metatraining set and metatesting set , both drawn from the same task distribution . The metalearner attempts to learn the structure of tasks during metatraining, and in metatesting, it leverages the structure to learn efficiently on new tasks with a limited number of newly observed examples from new tasks.
Metalearning methods have been developed in both supervised learning (Santoro et al., 2016; Vinyals et al., 2016) and RL settings (Duan et al., 2016; Wang et al., 2016a). One of the most popular algorithms is ModelAgnostic MetaLearning (MAML) (Finn et al., 2017a), which metalearns an versatile initialization of model parameters by:
(2)  
(3) 
where are the parameters to be learned, are the taskspecific parameters updated from as initialization (Eqn. (2)), and are learning rates and
is the loss function on each
. Note that depend on and the gradients backpropagated through to (Eqn. (3)). In metatesting, given data from the new task , MAML adapts model parameters starting from . MAML has also been recently extended to a more Bayesian treatment (Grant et al., 2018; Yoon et al., 2018; Finn et al., 2018).3 Methods
Based on the notions and notations in Sec. 2, we first formulate the problem of learning shaping functions on a distribution of tasks. Then we derive the optimal shaping function we’d like to learn and introduce our algorithm to learn the shaping function on sampled tasks from the distribution. Lastly we introduce how to use the learned shaping function on newly sampled tasks.
3.1 Problem Formulation
Our goal is to learn a potential function capable of effective reward shaping on tasks sampled from the distribution to accelerate their learning. We seek to learn via metelearning on a certain number of sampled tasks. In terms of metalearning, this is the metatraining phase to extract prior knowledge from the task distribution. In light of this and recent works (Grant et al., 2018; Yoon et al., 2018; Finn et al., 2018), we call the potential function prior. During metatesting phase, we seek to directly plug in the prior to shape rewards as a general test, or to adapt it to the taskposterior under more restricted conditions for more effective shaping.
Note that in implementation we instantiate the prior as and taskposterior as , i.e., ordinary neural networks rather than distributions. However, our method could still be understood from a Bayesian perspective by treating the prior as a delta function, the taskposterior as maximumaposteriori inference and the overall algorithm as empirical Bayes, the details of which are beyond the scope of this paper and readers may refer to (Grant et al., 2018; Yoon et al., 2018; Finn et al., 2018).
Next, we first derive the ideal taskposterior .
3.2 Efficient Credit Assignment with Optimal Potential Functions
Delving deeper into the particular potential function in Sec. 2.2, we first show that the substantial advantage it brings to credit assignment, which the “nonzero Qvalues” fail to identify, is the following:
Theorem 3.1.
Shaping with is optimal for credit assignment and learning efficiency.
Proof.
We first show that the reward shaping gives nonpositive immediate rewards with the optimal actions’ rewards exclusively zero. To see this, consider a general MDP and the corresponding shaped MDP , we have
where the last equality holds iff .
Therefore, after shaping the rewards with , at any state, only the optimal action(s) give zero immediate reward, and all the other actions give strictly negative rewards right away. As a result, credit assignment could be achieved the most efficiently since the agent could spot a deviation from the optimal policy as soon as it receives a negative reward. The optimality of any action could be determined instantaneously after it’s taken without any need to consider future rewards, and any RL algorithm could penalize negativereward actions without any fear that they might lead to better rewards in the future, hence the theoretically optimal efficiency in credit assignment. ∎
We thus choose as the adaptation target of taskposterior . In practical RL, the nonpositivity may not always hold with sampled experience and rewards from the environment, but the property still holds under expectation, and minibatches of data approximate the very expectation. Learning efficiency could therefore be still improved, which will be demonstrated through our experiments.
3.3 MetaLearning Potential Function Prior
The optimal shaping function is taskspecific without a universal optimum for all tasks . Inspired by MAML’s idea to learn a proper prior capable of fast adaptation to the taskposterior, we propose Alg. 1, as detailed below.
Formally, we specify the prior as defined on with parameters . For each task , the taskposterior adapts in the direction of initialized from . Then, a natural objective of learning prior is:
(4) 
However, is not directly accessible for any MDP, and neither is there any RL algorithm to directly learn optimal Vvalues. We therefore first specify the adaptation from prior to , and then return to the learning of prior.
TaskPosterior Adaptation:
Existing policybased RL methods either don’t estimate values or simply use the value output as baseline or bootstrap, leaving valuebased RL methods more suitable for our framework. In this paper, we simply choose Qlearning
(Watkins & Dayan, 1992), though in principle any valuebased algorithm explicitly estimating optimal values could be adopted.Qlearning still cannot directly estimate optimal Vvalues. To address this, we decompose the optimal values as:
where is the advantagevalue function. We implement this by separating the Vvalue head and advantage head before the network outputs Qvalue:
where is initialized as .
Note that (and ) are the potential functions we need, so (and ) are just part of the whole network, but for completeness we denote the overall parameters (and ) and treat (and ) as “augmented” potential functions.
Taskposterior adapts by following Qlearning and minimizing the TD error:
(5) 
This method was first introduced in duelingDQN (Wang et al., 2016b) but for a different purpose of speeding up training. Here we exploit the architecture in estimating the optimal Vvalues. To see this, first note that for identifiability of and , the maximum of the output advantage function is further subtracted from in implementation:
(6) 
As attains during Qlearning, by taking on both sides of Eqn. 6, we get . We can therefore learn the optimal Vvalues with duelingDQN, adapting to taskposterior from prior.
Prior Learning: Following the design of the taskposterior, the prior is naturally instantiated also as a duelingDQN . Similar as MAML (Finn et al., 2017a), we explicitly model the desired property of the prior to be able to efficiently adapt to the taskposterior. Based on that each taskposterior adapts from to on with steps of gradient update, we could finally rewrite the impractical priorlearning problem (4) as a practical one:
(7) 
It is worth noting that this problem is in essence different from that of DQN, as it does not compute bootstrapped Qvalues for TD error but directly uses under the expectation of as the learning target for .
Also note that in implementation we keep the full computational graph of taskposterior adaptation so is dependent on and gradients could backpropagate through to . For all our experiments we set for simplicity, but is a natural implementational extension. Possibly thanks to task multiplicity, we didn’t find target networks necessary for the Qnetworks. Besides, since it’s still an overall offpolicy algorithm, we don’t need to resample data for update, contrary to MAML.
3.4 MetaTesting with Potential Function Prior
During metatesting, we aim to find the optimal policy on newly sampled tasks with reward shaping by the learned potential function prior. We use the metalearned to directly shape the MDP, which transforms the original MDP into the shaped MDP , where . Intuitively, provides a good estimate of from metatraining on the task distribution, thus learning on can be much simpler than learning on as the reward shaping is close to optimal.
We identify two cases of metatesting with our duelingDQNbased metalearning algorithm. Shaping only is one case where is directly applied on new tasks without adaptation. This applies to new tasks with different action spaces, or when the advantage head simply could not be used for some reason (e.g., constraints on the new policy). According to Thm. 2.1, any RL algorithm could be used on the shaped MDP with the optimal policy unchanged. Adaptation with advantage head is the other case where the action space doesn’t change and the DQNpolicy is still applicable. We can then jointly adapt to the taskposterior and find the optimal policy efficiently within a few updates, initializing the whole as .
In the latter case, we still shape the MDP with the taskposterior being adapted. We iteratively collect experience using with greedy and update and alternating the following two steps (step size ):
Update with sampled data from replay buffer:
(8) 
Update with sampled data from replay buffer:
(9) 
We defer the proof to Appx. A.
For faster adaptation on new tasks, we simply optimize Eqn. (8) and Eqn. (9) alternately, which we find sufficient in experiments. We summarize such adaptation with advantage head in Alg. 2.
Advantage over MAML: Note that in the latter case of metatesting, one can directly adapt as the original MAML. However, direct adaptation merely exploits the parameter initialization, while our Alg. 2 also explicitly exploits the efficient reward shaping of the potential function prior in addition. The shaped rewards are easier for policy learning, and the adapting shaping (Eqn. (9)) further boosts policy learning (Eqn. (8)) immediately in the next loop. Thus our Alg. 2 is faster and more stable than direct MAML, and in Sec. 5 we compare with and outperform MAML. We also emphasize that we only assume shared state space, facilitating adaptation across discrete and continuous action spaces, which MAML cannot achieve.
4 Related Work
To the best of our knowledge, the only recent work on automatic reward shaping on a task distribution is Jaderberg et al. (2018). In addition to being independent of our work, the difference of Jaderberg et al. (2018) is that they access the limited novel states (termed “game events”) of the game engine of their specific task and only need to evolve the rewards for those states. Such rewards are simply stored in a short, fixedlength table and optimized with evolution strategies, with the metaoptimization objective of evolution being also designed taskspecifically. Earlier similar works (Konidaris & Barto, 2006; Snel & Whiteson, 2010)
are also restricted in various ways such as relying on specific feature choice and evolution heuristics, being unable to adapt to new tasks as ours, lacking theoretical analysis of reward shaping on credit assignment or being unable to scale to complex environments with simple models. In contrast to those works, our method is quite general, assuming no task knowledge or model access, with a more general, principled metalearning objective, flexible application settings, novel theoretical analysis and gradientbased optimization.
Apart from Jaderberg et al. (2018), almost all other recent RL successes in complex environments either manually design reward shaping based on game elements, with examples in Doom (Wu & Tian, 2017) and Dota 2 (OpenAI, 2018), or simply depart from the scalarreward RL approach and exploit rich supervision signals of other source with supervised learning (Dosovitskiy & Koltun, 2017; Silver et al., 2018; Huang et al., 2019; Wu et al., 2019).
5 Experiments
We demonstrate the effectiveness and generality of our framework under various settings. First we conduct experiments on the classic control task, CartPole (Barto et al., 1983), where the task distribution is defined varying the pole length and the action space could be either discrete or continuous. We then consider grid games whose state space is of much higher dimensionality and the maps of which hold exponential many possibilities (the task distribution is also defined on all the possible maps). Depending on whether the action space shares across the task distribution, the advantage head in our duelingDQN model (and thus the Qvalues) may not be applicable to newly sampled tasks. We therefore experiment under both settings to test the learning efficiency on new tasks. Since we are more interested in general complex MDPs where shaping rewards are hard to code and our metatraining relies on function approximators to generalize on the task distribution, we use neuralnetwork agents in all experiments under the modelfree setting.
5.1 Discrete and Continuous CartPoles
In CartPole (Barto et al., 1983), the agent tries to keep a pole upright by applying horizontal forces to the cart supporting the pole. Although a single particular CartPole is not very difficult, it’s nontrivial to consider infinitely many CartPole tasks with different pole lengths, since the pole length affects the pole mass, mass center and, therefore, the whole dynamics of the environment. Besides, the applied forces could also be represented in either a discrete or continuous way in different tasks, posing further difficulties in solving them altogether.
A positive reward of is provided every timestep as long as the pole stays within a predefined “upright” range of 15 degrees from vertical (Barto et al., 1983; Brockman et al., 2016). This reward is not sparse, but is still far from optimal in terms of credit assignment since it does not distinguish between “really” upright positions and dangerous ones where the pole is yet about to fall. To design a properly distinguishing reward shaping obviously requires much expert knowledge of the underlying physics. Therefore, automatic reward shaping on the distribution of CartPoles is of much significance.
Basic Training Settings: We modify the CartPole environment in OpenAI Gym (Brockman et al., 2016) so that it accepts pole length as a construction parameter and changes the physical dynamics accordingly. The pole length is uniformly sampled within the range of and defines a distribution over CartPoles. All the state spaces . We use the discrete twoaction setting (a fixed amount of force to the left or right) and the aforementioned original reward during metatraining. Episodes terminate after 200 timesteps, so the maximum achievable return is 200.
For the duelingDQN we use an MLP with two hidden layers of size 32, followed by one linear layer for the advantage head and one for the value head to aggregate the output Qvalues as Eqn. (6). The prior is metatrained with Alg. 1 for 500 meta iterations with 10 sampled tasks per iteration. Note that the tasks are merely used for the metaupdate in Alg. 1 with no performance guarantee on single tasks. All results are taken across five random seeds from to .
Intuitively, Alg. 1 is learning to generalize over different dynamics to assess how good/bad a state is.
MetaTesting with Advantage Head: We first test the case of adaptation with advantage head as per Sec. 3.4, where test tasks share the action space with metatraining tasks. The metatrained prior is evaluated on 40 newly sampled unseen discrete CartPoles with Alg. 2 to see how fast and how well the potential function (value head), as well as the advantage head, adapts to each new task after reinitializing their weights to . As mentioned in Sec. 3.4, we compare with the metatesting procedure of MAML as a baseline, keeping
and all common hyperparameters the same.
We track the episodic returns of the agent after each gradient update step, aggregate all such returns across different metatest tasks and different runs, and plot their medians and quartiles in Fig. 1 (left). As can be seen, our method performs better than MAML, achieving the max 200 two times faster (in 4 steps c.f. 8 steps) and oscillates milder, with improvement even clearer in Sec. 5.2. The relatively high initial return also indicates the quality of the metalearned prior on new tasks. While MAML could also exploit the prior over the entire model, it’s with the additional reward shaping that our method could adapt and learn on new tasks faster. Note that oscillation could not be completely avoided since it’s to some extent inherent to offpolicy RL algorithms, as is shown in later experiments.
MetaTesting from Discrete to Continuous: We then test the shaping only case as per Sec. 3.4. With directly used for reward shaping zeroshot, we train: (1) a vanilla DQN with randomly initialized weights on discrete CartPoles, corresponding to situations where the advantage head could not be used, and (2) a deterministic policy network using DDPG on continuous CartPoles, corresponding to situations where metatest tasks have different actions spaces, which disqualifies almost all existing metalearning methods.
The vanilla DQN has only two hidden layers of size 32 without dueling. It’s randomly reinitialized for each test task, and we track and plot the test progress similarly as before, except that we evaluate episodic returns every 100 updates. Naturally, we compare with training the same vanilla DQN with the same common hyperparameters but without any reward shaping to test the effectiveness of the metalearned reward shaping. As shown in Fig. 1 (middle), the zeroshot reward shaping still significantly boost the learning process on new tasks, achieving the max 200 remarkably faster while “without shaping” hasn’t achieved yet.
To test with continuous action, we further modify the CartPole environment to accept a scalar real value as action, whose sign determines the direction and absolute value determines the force magnitude. We use a deterministic policy also with two hidden layers of size 32, and an additional twohidden layer critic network for DDPG. Similar as with the vanilla DQN, we run DDPG with or without our reward shaping. As shown in Fig. 1 (right), learning on new tasks is again significantly accelerated with our reward shaping. Note that because we apply tanh nonlinearity to the action output to bound the actions, the initial policy appears more stable with higher initial returns than in the discrete case. However, due to the nonoptimal original reward in terms of credit assignment, DDPG without shaping confuses and struggles at first with returns dropping to below 25.
5.2 Grid Games
Grid games are clean but still challenging environments for modelfree RL agents in terms of navigation and planning, especially when using neural nets as the agent model (Tamar et al., 2016) since tabular representations could not generalize across grids. Many realworld environments could be modeled as grids in 2D or 3D. While represented simple, grids could have many variations with different start and goal positions on a grid incurring different tasks. Introducing additional obstacles on the maps leads to combinatorial explosion of further possibilities.
Furthermore, grids almost always come with sparse rewards with rewards obtained only in novel states like goals or traps. Such rewards are probably the most difficult for credit assignment, and to manually design reward shapings requires full access to the environment model and much human knowledge and heuristics which usually precompute the shortest paths or some distance metrics. Therefore, it’s very important to study automatic reward shaping on the distribution of grid games.
We randomly generate grid maps specifying start and goal positions and possibly obstacles and traps. Agents start from the start position, move in the four canonical top, down, left and right directions and only receive a positive reward of 1 upon reaching the goal. The discount factor assures that the optimal V/Qvalues display certain notion of shortest path. Episodes terminate if the agent hasn’t reached the goal in certain timesteps (50 in our experiments).
We use the same representations for start, goal and obstacles respectively across different maps, so intuitively, Alg. 1 learns to recognize and generalize concepts of map positions and, more importantly, the notion of shortest path to goal.
Grid Games with Clean Maps: We first experimented with a simpler version of grid mazes with only start and goal positions but no obstacles. We generate 800 such maps of size (e.g., Fig. 2 (left)) for metatraining, where all state spaces with the last dimension corresponding to the 4 channels of 01 maps of start, goal and current position as well as obstacles (all 0 in this case).
For the duelingDQN we use a CNN with four convolutional layers with 32 kernels of
and stride 1, followed by two fully connected layers and then the dueling module. Metatraining is conducted similarly as in Sec.
5.1.We also metatest the two cases as per Sec. 3.4: adaptation with advantage head from the whole prior , and shaping only to train a vanilla DQN with zeroshot shaping. We mainly follow the procedure as in Sec. 5.1, except that we don’t construct continuousaction grids. All common hyperparameters are the same between any pair of our method and baseline.
As can be seen from Fig. 3, our method performs much better in both cases of metatesting in terms of learning efficiency and stability, displaying the high potential of our method in scaling to complex environments and agent models. Visualization of the learned Vvalues on an unseen map (Fig. 2 (right)) also justifies the metalearned prior .
Grid Games with Obstacles:
We then experimented with a fuller version of grid mazes where obstacles may be present at each grid position with probability 0.2 during map generation. We generate 4000 such maps of size (e.g., Fig. 4 (left)) for metatraining, so all state spaces . We used the same convolutional duelingDQN architecture as on clean maps, and the same metatraining/testing protocols.
As can be seen from Fig. 5, our method constantly learns more efficiently than the baselines in both cases of adaptation with advantage head and shaping only on new tasks. The metalearned of an unseen map also passes intuitive sanity check (Fig. 4 (right)). Oscillation is a bit more severe than before due to the harder tasks and offpolicy algorithmic nature, but ours is still superior in relative performance and stability.
6 Conclusions
In this paper, we consider the problem of reward shaping on a distribution of tasks. We first prove the optimality of optimal Vvalues for potentialbased reward shaping in terms of credit assignment. We then propose a metalearning algorithm to learn a flexible prior over the optimal Vvalues. The prior could be well applied directly to shape rewards and could also quickly adapt to the taskposterior optimum while solving the task. We provide additional theoretical guarantee for the latter case. Meanwhile, our framework only assumes that the state spaces of the task distribution are shared, leaving wide possibilities for potential applications. Extensive experiments demonstrate the effectiveness of our method in terms of learning efficiency and stability on new tasks. We plan to consider adapting the shaping prior without the advantage head, and also singletask setting in the future.
Acknowledgement
Haosheng Zou would personally like to thank his beautiful and sweet wife, Jiamin Deng, for her incredible suppport during the whole process of this paper by not being around most of the time.
References
 Barto et al. (1983) Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, 1983.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.
 Dorigo & Colombetti (1994) Dorigo, M. and Colombetti, M. Robot shaping: Developing autonomous agents through learning. Artificial intelligence, 71(2):321–370, 1994.
 Dosovitskiy & Koltun (2017) Dosovitskiy, A. and Koltun, V. Learning to act by predicting the future. In ICLR, 2017.
 Duan et al. (2016) Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
 Finn et al. (2017a) Finn, C., Abbeel, P., and Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, 2017a.

Finn et al. (2017b)
Finn, C., Yu, T., Zhang, T., Abbeel, P., and Levine, S.
Oneshot visual imitation learning via metalearning.
In CoRL, 2017b.  Finn et al. (2018) Finn, C., Xu, K., and Levine, S. Probabilistic modelagnostic metalearning. In NeurIPS, 2018.
 Grant et al. (2018) Grant, E., Finn, C., Levine, S., Darrell, T., and Griffiths, T. Recasting gradientbased metalearning as hierarchical bayes. In ICLR, 2018.
 Huang et al. (2019) Huang, S., Su, H., Zhu, J., and Chen, T. Comboaction: Training agent for fps game with auxiliary tasks. In AAAI, 2019.
 Jaderberg et al. (2018) Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castaneda, A. G., Beattie, C., Rabinowitz, N. C., Morcos, A. S., Ruderman, A., et al. Humanlevel performance in firstperson multiplayer games with populationbased deep reinforcement learning. arXiv preprint arXiv:1807.01281, 2018.
 Konidaris & Barto (2006) Konidaris, G. and Barto, A. Autonomous shaping: Knowledge transfer in reinforcement learning. In ICML, 2006.
 Lazaric & Ghavamzadeh (2010) Lazaric, A. and Ghavamzadeh, M. Bayesian multitask reinforcement learning. In ICML, 2010.
 Mataric (1994) Mataric, M. J. Reward functions for accelerated learning. In Machine Learning Proceedings 1994, pp. 181–189. Elsevier, 1994.
 Minsky (1961) Minsky, M. Steps toward artificial intelligence. Proceedings of the IRE, 49(1):8–30, 1961.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Ng et al. (1999) Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, 1999.
 OpenAI (2018) OpenAI. Openai five blog. https://blog.openai.com/openaifive/, 2018. Posted: 20180625.
 Santoro et al. (2016) Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. Metalearning with memoryaugmented neural networks. In ICML, 2016.
 Silver et al. (2018) Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science, 362(6419):1140–1144, 2018.
 Skinner (1990) Skinner, B. F. The behavior of organisms: An experimental analysis. BF Skinner Foundation, 1990.
 Smith & Slone (2017) Smith, L. B. and Slone, L. K. A developmental approach to machine learning? Frontiers in psychology, 8(2124), 2017.

Snel & Whiteson (2010)
Snel, M. and Whiteson, S.
Multitask evolutionary shaping without prespecified
representations.
In
Proceedings of the 12th annual conference on Genetic and evolutionary computation
, 2010.  Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 1998.
 Tamar et al. (2016) Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. Value iteration networks. In NeurIPS, 2016.
 Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In NeurIPS, 2016.
 Wang et al. (2016a) Wang, J. X., KurthNelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016a.
 Wang et al. (2016b) Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., and Freitas, N. Dueling network architectures for deep reinforcement learning. In ICML, 2016b.
 Watkins & Dayan (1992) Watkins, C. J. and Dayan, P. Qlearning. Machine learning, 8(34):279–292, 1992.
 Wilson et al. (2007) Wilson, A., Fern, A., Ray, S., and Tadepalli, P. Multitask reinforcement learning: a hierarchical bayesian approach. In ICML, 2007.
 Wu et al. (2019) Wu, B., Fu, Q., Liang, J., Qu, P., Li, X., Wang, L., Liu, W., Yang, W., and Liu, Y. Hierarchical macro strategy model for moba game ai. In AAAI, 2019.
 Wu & Tian (2017) Wu, Y. and Tian, Y. Training agent for firstperson shooter game with actorcritic curriculum learning. In ICLR, 2017.
 Yoon et al. (2018) Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian modelagnostic metalearning. In NeurIPS, 2018.
Appendix A Proof of Thm. 3.2
Proof.
(Part I) Eqn. (8) optimizes for the optimal policy.
(10) 
Bearing in mind that in Alg. 2 we use as the potentialbased shaping function to obtain , we can rewrite objective (10) as:
where simply because it’s neuralnetwork computation of the dueling architecture.
Now we’ve already arrived at exactly the Qlearning TD error on the original MDP :
(11) 
Therefore, Eqn. (8) is in essence minimizing the Qlearning TD error on , thus optimizing for the optimal policy (invariant with/without the potentialbased reward shaping).
Remark: As an alternative understanding, first note that is just a notation for the neuralnetwork head. If we view it as an estimator of , then Eqn. (8) is actually performing Qlearning on the shaped MDP , with objective (10) directly being the corresponding TD error. It is therefore still optimizing for the invariant optimal policy.
(Part II) Eqn. (9) optimizes for the taskposterior .
Let , i.e., assume Eqn. (8) optimizes to minimum the parameters that it has gradients on, and get . Following the remark in Part I, we have
(12) 
from Qlearning on the shaped MDP .
We also rearrange Eqn. (1) with the adopted to get:
(13) 
Note that by definition,
(15) 
So we have
(16) 
In this way, we transform the inaccessible into the computable , and to adapt to the taskposterior one should minimize
(18) 
where we stop the gradients because the latter part should be treated as a scalar learning target.
Therefore, Eqn. (9) is indeed optimizing for the taskposterior .
Remark: Here we assume is optimized to the final . In practice this is not necessary nor desired, preventing fast adaptation. Therefore, we take only one step of Eqn. (8), and alternate between one step of Eqn. (8) and one step of Eqn. (9), where one pair constitutes one update step in Fig. 3 and 5 (left).
Also note that and may or may not share parameters, and only corresponds to the parameters that Eqn. (8) has gradients on, so we keep the separate notations of and . From the above derivation, we can see that Eqn. (17) holds for arbitrary , so nothing is violated if some parameters of is updated by Eqn. (8). ∎