Reinforcement learning (RL) algorithms are typically divided into two categories, i.e., model-free RL and model-based RL. The former directly learns the policy from the interactions with the environment, and has achieved impressive results in many areas, such as games (Mnih et al., 2015; Silver et al., 2016). But these model-free algorithms are data-expensive to train, which limits their applications to simulated domains. Different from model-free approaches, model-based reinforcement learning algorithms learn an internal model of the real environment to generate imaginary data, perform online planning or do policy search, which holds promise to provide significantly lower sample complexity (Luo et al., 2018).
Previously, model-based RL with linear or Bayesian models has obtained excellent performance on the simple low dimensional control problems (Abbeel, Quigley, and Ng, 2006; Deisenroth and Rasmussen, 2011; Levine and Koltun, 2013; Levine and Abbeel, 2014; Levine et al., 2016)
. But these methods are hard to be applied to high-dimensional domains. Since neural network models can represent more complex transition functions, model-based RL with them can solve higher dimensional control problems(Gal, McAllister, and Rasmussen, 2016; Depeweg et al., 2017; Nagabandi et al., 2018). However, learned high-capacity dynamics models ineluctably face predicting error, which results in the suboptimal performance and even catastrophic failures (Deisenroth and Rasmussen, 2011).
Plenty of approaches have been proposed to alleviate the above problem. For example, (Chua et al., 2018) learns an ensemble of probabilistic models to mitigate the model error. (Clavera et al., 2018) also learns the ensemble of models, and meta-trains a policy to adapt all the models so that the policy can be robust against model-bias. Among this line of research, a type of solution tries to tune model usage to reduce adverse effects of the imaginary data generated by inaccurate models, and promising results have been obtained. (Kalweit and Boedecker, 2017) only uses imaginary trajectories in the case of high uncertainties of Q-function. (Heess et al., 2015) only uses imaginary data to compute policy gradients. (Janner et al., 2019) replaces model-generated rollouts begin from the initial state distribution with short model-generated rollouts branched from the real data.
Above simple tuning schemes would result in that the generated data is always ignored in some training processes even it is completely accurate. Since samples with large prediction errors in the imaginary experience will lead to the value or policy function trained on it being inaccurate, adaptively filtering the samples with large prediction errors can reduce the performance degradation caused by the model bias. This makes a basic motivation of our study. However, the prediction error of an imaginary transition is difficult to obtain, because it is hard to decide a threshold of prediction error to determine whether the sample should be abandoned or not. For instance, when the value or policy function is very imprecise, even the samples with relatively large prediction errors can be used to optimize the function.
To handle above predication errors problem, we attempt to adaptively tune model usage through reweighting the imaginary samples according to their potential effect on training, which is totally different from previous model usage approaches. More specifically, we measure the effect through comparing the values of the optimization object (e.g, TD error) computed on the real samples before and after updating the functions using the imaginary transition. In this way, the filtering process can be taken as selecting an appropriate weight from 0, 1 for each imaginary sample based on its effect. To achieve this, we train a weight function to minimize adverse effects of the samples after they being reweighted using the function. The weight function outputs a weight in the range between 0 and 1 for each transition based on its features, like the uncertainty of the predicted next state in the transition. The effect of a reweighted sample can also be measured by the evaluation criterion mentioned previously.
A main issue of using weight function lies in its optimization. Given a generated transition, a weight is predicted by the weight function and a weighted loss is accordingly calculated for updating parameters. Its effect is evaluated by the difference between the losses computed on the real transitions using the parameters before and after updating. As the loss is parameterized by the updated parameters and the update of parameters is parameterized by the output of the weight function, the function can be optimized through minimizing the difference using the chain rule. Our method can be considered as an instance of meta-gradient (Xu, van Hasselt, and Silver, 2018; Zheng, Oh, and Singh, 2018; Veeriah et al., 2019), a form of meta-learning (Thrun and Pratt, 1998; Finn, Abbeel, and Levine, 2017; Hospedales et al., 2020), where the meta-learner is trained via gradients through the effect of the meta-parameters on a learner also trained via gradients (Xu, van Hasselt, and Silver, 2018).
To this end, we implement the algorithm by employing an ensemble of bootstrapped probabilistic neural networks and using Soft Actor-Critic (Haarnoja et al., 2018a, b) to update the policy and action-value function. We name this implementation as Reweighted Probabilistic-Ensemble Soft-Actor-Critic (ReW-PE-SAC). Experimental results demonstrate that ReW-PE-SAC outperforms the state-of-the-art model-based and model-free deep RL algorithms on multiple benchmarking tasks. We also analyze the predicted weights on the samples generated with different schemes in different stages of the training process, which shows that the learned weight function can provide reasonable weights for different generated samples in different stages of the training process. In addition, the critic loss updated with the weighted samples is obviously smaller than the one updated with the unweighted samples. This means that the learned weight function can filter out the samples with adverse effects by decreasing their weights.
The main contributions of this work are:
We propose an effective tuning scheme of model usage through adaptively reweighting the imaginary transitions. Different from the simple tuning schemes proposed by previous works, this theme can adaptively filter generated samples with a certain degree of prediction error based on the precision of action-value and policy functions while maximizing the use of remaining generated samples.
We use neural networks to predict the weight of each transition in the generated trajectories based on the well-designed features of the transitions and utilize meta-gradient method to optimize the weight network according to the above scheme. Thus, the learned weight network can be applied to new generated samples.
Experimental results demonstrate that our method outperforms state-of-the-art model-based and model-free RL algorithms on multiple tasks.
Considering the standard reinforcement learning setting, an agent interacts with an environment in discrete time. The environment is described by state space , action space , reward function
, state transition probabilities, and a discount factor , where state transition probabilities denotes the probability density of the next state given the current state , action , and reward function present the reward according to the transition. At each time step , the agent selects an action according to the policy , and then receives the next state and the reward from the environment. The objective of standard reinforcement learning is to learn a policy of the agent to maximize the discount cumulative rewards.
Model-based reinforcement learning approaches attempt to learn a dynamics model to simulate the real environment and utilize the model to make better decisions. In most cases, the learned model is imperfect and not all the transitions generated by it are accurate, which means the value and policy functions would be misled by the transitions with prediction errors. Therefore, this paper proposes to adaptively reweight the generated transitions to minimize the negative effect of them for the training.
We train a weight function to minimize adverse effects of the transitions after they are reweighted. Specifically, for a transition, the weight function outputs a weight. The effect of a reweighted transition is measured by comparing the losses of value and policy functions computed on the real samples before and after the functions being updated by the reweighted transition. As the loss before being updated is fixed, minimizing the adverse effect is equal to minimizing the loss after being updated. This loss is parameterized by the updated parameters and the update of the parameters is parameterized by the weight function, thus we can optimize the function through minimizing the loss after being updated by the chain rule. The training process of weight function is shown in Figure 1(left).
We employ an ensemble of bootstrapped probabilistic neural networks as the dynamics model, which can provide an estimated uncertainty for each generated transition. The weight function can predict the weights for the transitions more reasonably based on their estimated uncertainties. We use Soft Actor-Critic(Haarnoja et al., 2018a, b) to update the q-value and policy functions, which is an off-policy RL algorithm so that we can use the old experience to evaluate the effect of the updated parameters. We call this implementation as ReWeighted Probabilistic-Ensemble Soft-Actor-Critic (ReW-PE-SAC).
In the following, we would first present how to obtain the ensemble of networks, then describe the network architecture of the weight function, finally explain how to optimize the weight function.
In our method, the dynamics model is not only required to generate the transitions, but also needed to provide the other information that is useful for evaluating the weights of these transitions, like uncertainty.
In order to measure the uncertainties of generated transitions, we train an ensemble of -many bootstrapped probabilistic models like (Chua et al., 2018). The models have the same architecture but different parameters and training datasets . Each dataset is generated by sampling with replacement times from the replay buffer , where is equal to the size of
. Each probabilistic model is a neural network that predicts the probability distribution of the next statebased on the input state and action
. The probability distribution is described by a Gaussian distribution,. The predicted next state is obtained by sampling from the Gaussian distribution, . Reward function is assumed as given in advance, like most works of literature related to model-based RL methods (Wang et al., 2019; Clavera et al., 2018; Chua et al., 2018).
Given a state and an action sequence , the learned dynamics models can induce a distribution over the subsequent trajectories . Based on and , we use the ensemble of probabilistic models to induce -many Gaussian distributions of the next state , and then sample states from each Gaussian distributions . The reward function is applied to the predicted next states to evaluate the reward of them, . A state is randomly selected from the predicted states as the next input . Then the selected state and the action are used to generate the subsequent states. In this way, we can get a transition set for each time-step .
Weight Prediction Network
Estimating the weight on a single generated transition is difficult, because we cannot obtain any information about the prediction accuracy of and from the single transition.
Thus, we estimate the weight on the transition set
generated by the ensemble of probabilistic models for the input instead of the single transition.
The weight function is approximated by a neural network with parameters , where
represents the feature vector of a generated transition set. The feature vector is composed of the states , the actions , the uncertainty on the predicted reward and the uncertainties on each dimension of the predicted next state
. The uncertainties are approximated by computing the standard deviation of rewards and the next states. The uncertainties imply the credibility of the generated transition , while the inputted state and action uniquely identify the transitions. In practice, we find the latter one enables the weight function to make a better prediction. To avoid the large disparities of different features, the feature vectors are normalized for each dimension before they are fed to the weight network.
It is obvious that the credibility of is related to the ones of its predecessors
, due to that modeling errors in dynamics are accumulated with time-steps. Thus we select Gated Recurrent Units (GRU)(Cho et al., 2014) to integrate the features of the predecessors. The network architecture of weight function is shown in Figure 1(right).
Training the Weight Function
This section will show how to train the weight function so that it can predict appropriate weights for imaginary transitions to minimize their adverse effect.
The training of weight function can be split into two steps, evaluating the potential effects of the reweighted transitions and optimizing the weight function through minimizing the negative effects by the chain rule. We focus on the effects of the action-value and policy functions, and update the weight function through minimizing the effects of a mini-batch of imaginary transitions in each iteration.
For the first step, we sample real states and the corresponding real action sequences from the replay buffer to generate the imaginary transitions , where is the planning horizon. Then we compute the weights of imaginary transitions and update the parameters of Q-network and policy network, and , with the reweighted losses of these imaginary samples:
where is the learning rate of and . and are the soft Bellman residual and the KL-divergence between the policy and the exponential of the soft Q-function (Haarnoja et al., 2018a, b), respectively. For a transition set , and are computed by
where is the parameters of target Q-network, and is the temperature parameter.
For the second step, we sample real transitions from the replay buffer , combined them into a set , and compute the losses of q-value and policy functions on them with the updated parameters and ,
The gradient of the parameters of weight function is computed through the chain rule,
Once the gradient is obtained, the parameters can be updated by any optimization algorithm.
We alternately optimize the q-value, policy functions, and the weight function, so that the latter one can adaptively adjust the weights of imaginary transitions along with the change of the precision of the former ones. We sample real states and the corresponding action sequences with an explore policy which is obtained by changing the temperature parameter of current policy from to ( is set to 10 in this paper). A larger temperature parameter is conducive to generating diverse transitions. Based on the sampled state and action sequences, we utilize the dynamics model to generate imaginary transitions and use the weight function to reweight them. The gradients of q-value and policy functions are computed by
We use Adam to update the parameters and . The temperature parameter is optimized based on the generated transition sets without being reweighted.
The complete algorithm is shown in Alg. 1. In our algorithm, the real transitions are not only used to train the dynamics models, but also used to train the action-value and policy networks. The real samples can avoid too large prediction errors of the action-value function. When the predicted weights of generated samples are too low, the real samples can prevent algorithm from being in stagnation behavior.
In this section, we evaluate our algorithm on six complex continuous control tasks from the model-based RL benchmark (Wang et al., 2019), which is modified from the OpenAI gym benchmark suite (Brockman et al., 2016)
. The six tasks are Ant, HalfCheetah, Hopper, SlimHumanoid, Swimmer-v0, and Walker2D, whose horizon length is fixed to 1000. The network architecture and training hyperparameters are given in the appendix. First, we compare ReW-PE-SAC on the benchmark against state-of-the-art model-free and model-based approaches. Then, we show the differences of the q-value losses with and without reweighting method. Next, we evaluate the robustness of our algorithm to imperfect dynamics model. Finally, we analyze the relation between the learned weights and the factors of the training iterations, the planning horizon, and the explore policy.
Comparison with State of the Art
We compare ReW-PE-SAC with state-of-the-art model-free and model-based RL methods, including SAC (Haarnoja et al., 2018a, b)111 We select the PyTorch implement of soft actor-critic in
We select the PyTorch implement of soft actor-critic inhttps://github.com/pranz24/pytorch-soft-actor-critic to evaluate the performance. This implement includes using double-Q network, ignoring the artificial terminal signal and other tricks, so the performance is better than the one reported in (Wang et al., 2019)., TD3 (Fujimoto, Hoof, and Meger, 2018), ME-TRPO (Kurutach et al., 2018), MB-MPO(Clavera et al., 2018), PETS (Chua et al., 2018), MBPO (Janner et al., 2019) and POPLIN (Wang and Ba, 2019). We reproduce results from (Wang et al., 2019; Janner et al., 2019) and additionally run MBPO on the tasks of Slimhumanoid and Swimmer as the according experimental results are absent. We run our method ReW-PE-SAC for time-steps with random seeds. To evaluate our reweighting mechanism, we also run PE-SAC on these six tasks which does not learn the weight function and directly use the imaginary transitions to train the policy and value networks. To measure the sample efficiency of ReW-PE-SAC, we additionally run SAC time-steps on each task. The results are summarized in Table 1, and the learning curves of SAC and our methods with or without reweighting are plotted in Figure 2.
As shown in Table 1, ReW-PE-SAC achieves better performance compared with all other state-of-the-art algorithms except MBPO running with time-steps in all the environments. Especially in the environments of Ant, Hopper, Swimmer and Walker2d, the performance of ReW-PE-SAC is comparable to the one of SAC running with time-steps, which demonstrates that ReW-PE-SAC has good sample efficiency. Compared with MBPO, ReW-PE-SAC is better on four environments and is slightly weaker in the tasks of HalfCheetah and Hopper.
Comparing the results of our methods with and without reweighting, ReW-PE-SAC and PE-SAC, the performance with reweighting is obviously higher on the most of the environments. This demonstrates that the learned weight function can provide appropriate weights to facilitate training a better policy. The performance gap of ReW-PE-SAC and PE-SAC on the environment of HalfCheetah is probably caused by that the weight function is overcautious, and the weights provided by it are too low.
From Figure 2
(a,d), we find our method has a large performance variance in the tasks of Ant and Slimhumanoid. The most likely reason is that our method utilizes the collected transitions to evaluate the effect of imaginary transitions, while the number of collected transitions is insufficient for some tasks. This induces that the weights of some valid imaginary transitions could be underestimated, and then the learned policy would be relatively poor due to the lack of these valid transitions. We will consider constructing a more reasonable validation set in future work.
The Critic Losses of PE-SAC and ReW-PE-SAC
In this section, we compare the critic losses in cases with and without reweighting. We run the algorithms of PE-SAC and ReW-PE-SAC on the tasks of Ant, HalfCheetah, SlimHumanoid and Swimmer, and record the average critic losses of real samples in every episode. The minimum, maximum and mean of the losses in the same time-step are plotted in Figure 3.
As shown in the figure, ReW-PE-SAC can maintain lower critic losses than PE-SAC and prevent abnormal large losses. Combined with the learning curve for the task of Swimmer (shown in Figure 2(e)), we find the performance of PE-SAC is falling after about time-steps while the critic loss is also increasing sharply at around this time. So maintaining lower losses has contributed to improve the performance in most cases. The only exception is the task of HalfCheetah, in which the lower critic losses have not resulted in higher performance. The most likely reason is that an imprecise Q-value function is enough to train a good policy.
Robustness to Imperfect Dynamics Model
We construct the dynamics models with different prediction accuracy through adjusting the number of the hidden layers in them from to . We run the algorithms of PE-SAC and ReW-PE-SAC with these dynamics models on the tasks of Ant. The learning curves of them are plotted in Figure 4.
When the number of the hidden layers is decreased, the performances of PE-SAC drops significantly. This means that the dynamics models with hidden layers have strong negative effect on the training process. The performances of ReW-PE-SAC remain roughly unchanged, which means that our method can effectively reduce the negative effect of the generated samples with prediction errors. The above analysis gives a possible explanation for the phenomenon that ReW-PE-SAC has higher performance improvement on the more complex tasks, like Slimhumanoid and Walker2d.
The Trend of the Predicted Weight
In this section, we analyze the overall trend of the predicted weights and the relation between the weights and the prediction depth and the soft scale . We run the algorithms of ReW-PE-SAC on the task of Swimmer with only random seed, and record the predicted weights of generated samples at the first step of each episode. The predicted weights are changed with the process of training, so computing the average on different seeds is meaningless. The 25 precent point, median and 75 precent point are plotted in Figure 5(a). Then, we split these weights according to the prediction depth, and plot the median of the weights of different prediction depth in Figure 5(b). Finally, we generate some extra data using different , and plot the median of predicted weights on them in Figure 5(c).
In Figure 5(a), the weights are lower in the earlier and later stages but are higher in the middle stage (The weight function’s initial output is about as that the bias of last layer is initialized to .). The trend reflects the change of the accuracy of the dynamics model and the q-value and policy functions. In the earlier stage, the dynamics model is imprecise, so most of the generated transitions are rejected. Then, the weights become to increase as the improvement in the prediction precision of the dynamics model. However, in the later stage, the precision of q-value function also improves, while the model has reached its bottleneck. This results in the decline of the weights. From Figure 5(b), we find that the predicted weights decrease with the planning steps which accords with the fact that the prediction errors accumulates with steps. From Figure 5(c), we also found that the weights decrease with the scale which is caused by the difference of the distributions of the actions in the training and predicting process of the dynamics model. These phenomenons further verify that the learned weight function is reasonable.
In this paper, we have proposed a novel and efficient model-based reinforcement learning approach, which adaptively adjusts the weights of all generated transitions through training a weight function to reduce the potential negative effect of them. We measure the effect of reweighted imaginary transitions through computing the difference of the losses computed on the real transitions before and after training with them, and minimize the difference to optimize the weight function by the chain rule.
Experimental results show that our method obtains the state-of-the-art performance on multiple complex continuous control tasks. The learned weight function can provide reasonable weights for different generated samples in different stages of training process. We believe that the weight function can be utilized to adjust some hyper-parameters, like planning horizon, in the future.
This work is funded by the National Natural Science Foundation of China (Grand No. 61876181 No. 61673375 and No.61721004), Beijing Nova Program of Science and Technology under Grand No. Z191100001119043, the Youth Innovation Promotion Association, and CAS and the Projects of Chinese Academy of Science (Grant No. QYZDB-SSW-JSC006).
Abbeel, Quigley, and Ng (2006)
Abbeel, P.; Quigley, M.; and Ng, A. Y. 2006.
Using inaccurate models in reinforcement learning.
Proceedings of the 23rd International Conference on machine learning (ICML-06), 1–8.
- Brockman et al. (2016) Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. arXiv preprint arXiv:1606.01540 .
Cho et al. (2014)
Cho, K.; Van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.;
Schwenk, H.; and Bengio, Y. 2014.
Learning Phrase Representations using RNN Encoder–Decoder for
Statistical Machine Translation.
Empirical Methods in Natural Language Processing, 1724–1734.
- Chua et al. (2018) Chua, K.; Calandra, R.; McAllister, R.; and Levine, S. 2018. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, 4754–4765.
- Clavera et al. (2018) Clavera, I.; Rothfuss, J.; Schulman, J.; Fujita, Y.; Asfour, T.; and Abbeel, P. 2018. Model-Based Reinforcement Learning via Meta-Policy Optimization. In Conference on Robot Learning, 617–629.
- Deisenroth and Rasmussen (2011) Deisenroth, M.; and Rasmussen, C. E. 2011. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), 465–472.
- Depeweg et al. (2017) Depeweg, S.; Hernández-Lobato, J.; Doshi-Velez, F.; and Udluft, S. 2017. Learning and policy search in stochastic dynamical systems with Bayesian neural networks. In 5th International Conference on Learning Representations, ICLR 2017-Conference Track Proceedings.
- Finn, Abbeel, and Levine (2017) Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on machine learning (ICML-17), 1126–1135. JMLR. org.
- Fujimoto, Hoof, and Meger (2018) Fujimoto, S.; Hoof, H.; and Meger, D. 2018. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on machine learning (ICML-18), 1587–1596.
- Gal, McAllister, and Rasmussen (2016) Gal, Y.; McAllister, R.; and Rasmussen, C. E. 2016. Improving PILCO with Bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, ICML, volume 4, 34.
- Haarnoja et al. (2018a) Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018a. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on machine learning (ICML-18), 1861–1870.
- Haarnoja et al. (2018b) Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. 2018b. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905 .
- Heess et al. (2015) Heess, N.; Wayne, G.; Silver, D.; Lillicrap, T.; Erez, T.; and Tassa, Y. 2015. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2944–2952.
- Hospedales et al. (2020) Hospedales, T.; Antoniou, A.; Micaelli, P.; and Storkey, A. 2020. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439 .
- Janner et al. (2019) Janner, M.; Fu, J.; Zhang, M.; and Levine, S. 2019. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, 12498–12509.
- Kalweit and Boedecker (2017) Kalweit, G.; and Boedecker, J. 2017. Uncertainty-driven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, 195–206.
- Kurutach et al. (2018) Kurutach, T.; Clavera, I.; Duan, Y.; Tamar, A.; and Abbeel, P. 2018. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592 .
- Levine and Abbeel (2014) Levine, S.; and Abbeel, P. 2014. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, 1071–1079.
- Levine et al. (2016) Levine, S.; Finn, C.; Darrell, T.; and Abbeel, P. 2016. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17(1): 1334–1373.
- Levine and Koltun (2013) Levine, S.; and Koltun, V. 2013. Guided policy search. In Proceedings of the 30th International Conference on machine learning (ICML-13), 1–9.
- Luo et al. (2018) Luo, Y.; Xu, H.; Li, Y.; Tian, Y.; Darrell, T.; and Ma, T. 2018. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858 .
- Mnih et al. (2015) Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540): 529–533.
- Nagabandi et al. (2018) Nagabandi, A.; Kahn, G.; Fearing, R. S.; and Levine, S. 2018. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 7559–7566. IEEE.
- Silver et al. (2016) Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of Go with deep neural networks and tree search. NATURE 529(7587): 484.
- Thrun and Pratt (1998) Thrun, S.; and Pratt, L. 1998. Learning to learn: Introduction and overview. In Learning to learn, 3–17. Springer.
- Veeriah et al. (2019) Veeriah, V.; Hessel, M.; Xu, Z.; Rajendran, J.; Lewis, R. L.; Oh, J.; van Hasselt, H. P.; Silver, D.; and Singh, S. 2019. Discovery of useful questions as auxiliary tasks. In Advances in Neural Information Processing Systems, 9306–9317.
- Wang and Ba (2019) Wang, T.; and Ba, J. 2019. Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649 .
- Wang et al. (2019) Wang, T.; Bao, X.; Clavera, I.; Hoang, J.; Wen, Y.; Langlois, E.; Zhang, S.; Zhang, G.; Abbeel, P.; and Ba, J. 2019. Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057 .
- Xu, van Hasselt, and Silver (2018) Xu, Z.; van Hasselt, H. P.; and Silver, D. 2018. Meta-gradient reinforcement learning. In Advances in neural information processing systems, 2396–2407.
- Zheng, Oh, and Singh (2018) Zheng, Z.; Oh, J.; and Singh, S. 2018. On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, 4644–4654.