Introduction
Reinforcement learning (RL) algorithms are typically divided into two categories, i.e., modelfree RL and modelbased RL. The former directly learns the policy from the interactions with the environment, and has achieved impressive results in many areas, such as games (Mnih et al., 2015; Silver et al., 2016). But these modelfree algorithms are dataexpensive to train, which limits their applications to simulated domains. Different from modelfree approaches, modelbased reinforcement learning algorithms learn an internal model of the real environment to generate imaginary data, perform online planning or do policy search, which holds promise to provide significantly lower sample complexity (Luo et al., 2018).
Previously, modelbased RL with linear or Bayesian models has obtained excellent performance on the simple low dimensional control problems (Abbeel, Quigley, and Ng, 2006; Deisenroth and Rasmussen, 2011; Levine and Koltun, 2013; Levine and Abbeel, 2014; Levine et al., 2016)
. But these methods are hard to be applied to highdimensional domains. Since neural network models can represent more complex transition functions, modelbased RL with them can solve higher dimensional control problems
(Gal, McAllister, and Rasmussen, 2016; Depeweg et al., 2017; Nagabandi et al., 2018). However, learned highcapacity dynamics models ineluctably face predicting error, which results in the suboptimal performance and even catastrophic failures (Deisenroth and Rasmussen, 2011).Plenty of approaches have been proposed to alleviate the above problem. For example, (Chua et al., 2018) learns an ensemble of probabilistic models to mitigate the model error. (Clavera et al., 2018) also learns the ensemble of models, and metatrains a policy to adapt all the models so that the policy can be robust against modelbias. Among this line of research, a type of solution tries to tune model usage to reduce adverse effects of the imaginary data generated by inaccurate models, and promising results have been obtained. (Kalweit and Boedecker, 2017) only uses imaginary trajectories in the case of high uncertainties of Qfunction. (Heess et al., 2015) only uses imaginary data to compute policy gradients. (Janner et al., 2019) replaces modelgenerated rollouts begin from the initial state distribution with short modelgenerated rollouts branched from the real data.
Above simple tuning schemes would result in that the generated data is always ignored in some training processes even it is completely accurate. Since samples with large prediction errors in the imaginary experience will lead to the value or policy function trained on it being inaccurate, adaptively filtering the samples with large prediction errors can reduce the performance degradation caused by the model bias. This makes a basic motivation of our study. However, the prediction error of an imaginary transition is difficult to obtain, because it is hard to decide a threshold of prediction error to determine whether the sample should be abandoned or not. For instance, when the value or policy function is very imprecise, even the samples with relatively large prediction errors can be used to optimize the function.
To handle above predication errors problem, we attempt to adaptively tune model usage through reweighting the imaginary samples according to their potential effect on training, which is totally different from previous model usage approaches. More specifically, we measure the effect through comparing the values of the optimization object (e.g, TD error) computed on the real samples before and after updating the functions using the imaginary transition. In this way, the filtering process can be taken as selecting an appropriate weight from 0, 1 for each imaginary sample based on its effect. To achieve this, we train a weight function to minimize adverse effects of the samples after they being reweighted using the function. The weight function outputs a weight in the range between 0 and 1 for each transition based on its features, like the uncertainty of the predicted next state in the transition. The effect of a reweighted sample can also be measured by the evaluation criterion mentioned previously.
A main issue of using weight function lies in its optimization. Given a generated transition, a weight is predicted by the weight function and a weighted loss is accordingly calculated for updating parameters. Its effect is evaluated by the difference between the losses computed on the real transitions using the parameters before and after updating. As the loss is parameterized by the updated parameters and the update of parameters is parameterized by the output of the weight function, the function can be optimized through minimizing the difference using the chain rule. Our method can be considered as an instance of metagradient (Xu, van Hasselt, and Silver, 2018; Zheng, Oh, and Singh, 2018; Veeriah et al., 2019), a form of metalearning (Thrun and Pratt, 1998; Finn, Abbeel, and Levine, 2017; Hospedales et al., 2020), where the metalearner is trained via gradients through the effect of the metaparameters on a learner also trained via gradients (Xu, van Hasselt, and Silver, 2018).
To this end, we implement the algorithm by employing an ensemble of bootstrapped probabilistic neural networks and using Soft ActorCritic (Haarnoja et al., 2018a, b) to update the policy and actionvalue function. We name this implementation as Reweighted ProbabilisticEnsemble SoftActorCritic (ReWPESAC). Experimental results demonstrate that ReWPESAC outperforms the stateoftheart modelbased and modelfree deep RL algorithms on multiple benchmarking tasks. We also analyze the predicted weights on the samples generated with different schemes in different stages of the training process, which shows that the learned weight function can provide reasonable weights for different generated samples in different stages of the training process. In addition, the critic loss updated with the weighted samples is obviously smaller than the one updated with the unweighted samples. This means that the learned weight function can filter out the samples with adverse effects by decreasing their weights.
The main contributions of this work are:

We propose an effective tuning scheme of model usage through adaptively reweighting the imaginary transitions. Different from the simple tuning schemes proposed by previous works, this theme can adaptively filter generated samples with a certain degree of prediction error based on the precision of actionvalue and policy functions while maximizing the use of remaining generated samples.

We use neural networks to predict the weight of each transition in the generated trajectories based on the welldesigned features of the transitions and utilize metagradient method to optimize the weight network according to the above scheme. Thus, the learned weight network can be applied to new generated samples.

Experimental results demonstrate that our method outperforms stateoftheart modelbased and modelfree RL algorithms on multiple tasks.
Approach
Notation
Considering the standard reinforcement learning setting, an agent interacts with an environment in discrete time. The environment is described by state space , action space , reward function
, state transition probabilities
, and a discount factor , where state transition probabilities denotes the probability density of the next state given the current state , action , and reward function present the reward according to the transition. At each time step , the agent selects an action according to the policy , and then receives the next state and the reward from the environment. The objective of standard reinforcement learning is to learn a policy of the agent to maximize the discount cumulative rewards.Overall Framework
Modelbased reinforcement learning approaches attempt to learn a dynamics model to simulate the real environment and utilize the model to make better decisions. In most cases, the learned model is imperfect and not all the transitions generated by it are accurate, which means the value and policy functions would be misled by the transitions with prediction errors. Therefore, this paper proposes to adaptively reweight the generated transitions to minimize the negative effect of them for the training.
We train a weight function to minimize adverse effects of the transitions after they are reweighted. Specifically, for a transition, the weight function outputs a weight. The effect of a reweighted transition is measured by comparing the losses of value and policy functions computed on the real samples before and after the functions being updated by the reweighted transition. As the loss before being updated is fixed, minimizing the adverse effect is equal to minimizing the loss after being updated. This loss is parameterized by the updated parameters and the update of the parameters is parameterized by the weight function, thus we can optimize the function through minimizing the loss after being updated by the chain rule. The training process of weight function is shown in Figure 1(left).
We employ an ensemble of bootstrapped probabilistic neural networks as the dynamics model, which can provide an estimated uncertainty for each generated transition. The weight function can predict the weights for the transitions more reasonably based on their estimated uncertainties. We use Soft ActorCritic
(Haarnoja et al., 2018a, b) to update the qvalue and policy functions, which is an offpolicy RL algorithm so that we can use the old experience to evaluate the effect of the updated parameters. We call this implementation as ReWeighted ProbabilisticEnsemble SoftActorCritic (ReWPESAC).In the following, we would first present how to obtain the ensemble of networks, then describe the network architecture of the weight function, finally explain how to optimize the weight function.
Dynamics Model
In our method, the dynamics model is not only required to generate the transitions, but also needed to provide the other information that is useful for evaluating the weights of these transitions, like uncertainty.
In order to measure the uncertainties of generated transitions, we train an ensemble of many bootstrapped probabilistic models like (Chua et al., 2018). The models have the same architecture but different parameters and training datasets . Each dataset is generated by sampling with replacement times from the replay buffer , where is equal to the size of
. Each probabilistic model is a neural network that predicts the probability distribution of the next state
based on the input state and action. The probability distribution is described by a Gaussian distribution,
. The predicted next state is obtained by sampling from the Gaussian distribution, . Reward function is assumed as given in advance, like most works of literature related to modelbased RL methods (Wang et al., 2019; Clavera et al., 2018; Chua et al., 2018).Given a state and an action sequence , the learned dynamics models can induce a distribution over the subsequent trajectories . Based on and , we use the ensemble of probabilistic models to induce many Gaussian distributions of the next state , and then sample states from each Gaussian distributions . The reward function is applied to the predicted next states to evaluate the reward of them, . A state is randomly selected from the predicted states as the next input . Then the selected state and the action are used to generate the subsequent states. In this way, we can get a transition set for each timestep .
Weight Prediction Network
Estimating the weight on a single generated transition is difficult, because we cannot obtain any information about the prediction accuracy of and from the single transition.
Thus, we estimate the weight on the transition set
generated by the ensemble of probabilistic models for the input instead of the single transition.
The weight function is approximated by a neural network with parameters , where
represents the feature vector of a generated transition set
. The feature vector is composed of the states , the actions , the uncertainty on the predicted reward and the uncertainties on each dimension of the predicted next state. The uncertainties are approximated by computing the standard deviation of rewards and the next states
. The uncertainties imply the credibility of the generated transition , while the inputted state and action uniquely identify the transitions. In practice, we find the latter one enables the weight function to make a better prediction. To avoid the large disparities of different features, the feature vectors are normalized for each dimension before they are fed to the weight network.It is obvious that the credibility of is related to the ones of its predecessors
, due to that modeling errors in dynamics are accumulated with timesteps. Thus we select Gated Recurrent Units (GRU)
(Cho et al., 2014) to integrate the features of the predecessors. The network architecture of weight function is shown in Figure 1(right).Training the Weight Function
This section will show how to train the weight function so that it can predict appropriate weights for imaginary transitions to minimize their adverse effect.
The training of weight function can be split into two steps, evaluating the potential effects of the reweighted transitions and optimizing the weight function through minimizing the negative effects by the chain rule. We focus on the effects of the actionvalue and policy functions, and update the weight function through minimizing the effects of a minibatch of imaginary transitions in each iteration.
For the first step, we sample real states and the corresponding real action sequences from the replay buffer to generate the imaginary transitions , where is the planning horizon. Then we compute the weights of imaginary transitions and update the parameters of Qnetwork and policy network, and , with the reweighted losses of these imaginary samples:
(1) 
where is the learning rate of and . and are the soft Bellman residual and the KLdivergence between the policy and the exponential of the soft Qfunction (Haarnoja et al., 2018a, b), respectively. For a transition set , and are computed by
(2) 
(3) 
where is the parameters of target Qnetwork, and is the temperature parameter.
For the second step, we sample real transitions from the replay buffer , combined them into a set , and compute the losses of qvalue and policy functions on them with the updated parameters and ,
(4) 
The gradient of the parameters of weight function is computed through the chain rule,
(5) 
Once the gradient is obtained, the parameters can be updated by any optimization algorithm.
We alternately optimize the qvalue, policy functions, and the weight function, so that the latter one can adaptively adjust the weights of imaginary transitions along with the change of the precision of the former ones. We sample real states and the corresponding action sequences with an explore policy which is obtained by changing the temperature parameter of current policy from to ( is set to 10 in this paper). A larger temperature parameter is conducive to generating diverse transitions. Based on the sampled state and action sequences, we utilize the dynamics model to generate imaginary transitions and use the weight function to reweight them. The gradients of qvalue and policy functions are computed by
(6) 
We use Adam to update the parameters and . The temperature parameter is optimized based on the generated transition sets without being reweighted.
Ant  HalfCheetah  Hopper  Slimhumanoid  Swimmer  Walker2d  
METRPO 
282.218.0  2283.7900.4  1272.5500.9  154.9534.3  30.19.7  1609.3657.5 
MBMPO  705.8 147.2  3639.01185.8  333.21189.7  674.4982.2  85.098.9  1545.9216.5 
PETS  1165.5226.9  2795.3879.9  1125.0679.6  1472.4738.3  22.125.2  260.2536.9 
POPLIN  2330.1320.9  4235.01133.0  2055.2613.8  245.7141.9  37.14.6  597.0478.8 
MBPO  4332.51277.6  10758.91413.7  3279.8455.0  2950.4819.1  26.313.3  4154.7846.1 
TD3 
956.166.9  3614.382.1  2245.3232.4  1319.11246.1  40.48.3  73.8769.0 
SAC200k  922.0283.0  6129.3775.7  2365.1193.4  1891.6379.2  49.75.8  1642.7606.9 
w.o reweighting 
4033.51480.5  11854.3102.8  2202.6363.5  1436.8490.8  26.625.4  2673.82264.8 
Our Method  4614.4931.1  9779.8546.6  2824.0159.9  11755.911152.2  82.233.4  4961.9457.8 
SAC1000k 
4994.9719.5  10283.8648.4  2990.3214.3  29122.511129.0  86.86.4  5094.01371.3 

The complete algorithm is shown in Alg. 1. In our algorithm, the real transitions are not only used to train the dynamics models, but also used to train the actionvalue and policy networks. The real samples can avoid too large prediction errors of the actionvalue function. When the predicted weights of generated samples are too low, the real samples can prevent algorithm from being in stagnation behavior.
Experiments
In this section, we evaluate our algorithm on six complex continuous control tasks from the modelbased RL benchmark (Wang et al., 2019), which is modified from the OpenAI gym benchmark suite (Brockman et al., 2016)
. The six tasks are Ant, HalfCheetah, Hopper, SlimHumanoid, Swimmerv0, and Walker2D, whose horizon length is fixed to 1000. The network architecture and training hyperparameters are given in the appendix. First, we compare ReWPESAC on the benchmark against stateoftheart modelfree and modelbased approaches. Then, we show the differences of the qvalue losses with and without reweighting method. Next, we evaluate the robustness of our algorithm to imperfect dynamics model. Finally, we analyze the relation between the learned weights and the factors of the training iterations, the planning horizon, and the explore policy.
Comparison with State of the Art
We compare ReWPESAC with stateoftheart modelfree and modelbased RL methods, including SAC (Haarnoja et al., 2018a, b)^{1}^{1}1
We select the PyTorch implement of soft actorcritic in
https://github.com/pranz24/pytorchsoftactorcritic to evaluate the performance. This implement includes using doubleQ network, ignoring the artificial terminal signal and other tricks, so the performance is better than the one reported in (Wang et al., 2019)., TD3 (Fujimoto, Hoof, and Meger, 2018), METRPO (Kurutach et al., 2018), MBMPO(Clavera et al., 2018), PETS (Chua et al., 2018), MBPO (Janner et al., 2019) and POPLIN (Wang and Ba, 2019). We reproduce results from (Wang et al., 2019; Janner et al., 2019) and additionally run MBPO on the tasks of Slimhumanoid and Swimmer as the according experimental results are absent. We run our method ReWPESAC for timesteps with random seeds. To evaluate our reweighting mechanism, we also run PESAC on these six tasks which does not learn the weight function and directly use the imaginary transitions to train the policy and value networks. To measure the sample efficiency of ReWPESAC, we additionally run SAC timesteps on each task. The results are summarized in Table 1, and the learning curves of SAC and our methods with or without reweighting are plotted in Figure 2.As shown in Table 1, ReWPESAC achieves better performance compared with all other stateoftheart algorithms except MBPO running with timesteps in all the environments. Especially in the environments of Ant, Hopper, Swimmer and Walker2d, the performance of ReWPESAC is comparable to the one of SAC running with timesteps, which demonstrates that ReWPESAC has good sample efficiency. Compared with MBPO, ReWPESAC is better on four environments and is slightly weaker in the tasks of HalfCheetah and Hopper.
Comparing the results of our methods with and without reweighting, ReWPESAC and PESAC, the performance with reweighting is obviously higher on the most of the environments. This demonstrates that the learned weight function can provide appropriate weights to facilitate training a better policy. The performance gap of ReWPESAC and PESAC on the environment of HalfCheetah is probably caused by that the weight function is overcautious, and the weights provided by it are too low.
From Figure 2
(a,d), we find our method has a large performance variance in the tasks of Ant and Slimhumanoid. The most likely reason is that our method utilizes the collected transitions to evaluate the effect of imaginary transitions, while the number of collected transitions is insufficient for some tasks. This induces that the weights of some valid imaginary transitions could be underestimated, and then the learned policy would be relatively poor due to the lack of these valid transitions. We will consider constructing a more reasonable validation set in future work.
The Critic Losses of PESAC and ReWPESAC
In this section, we compare the critic losses in cases with and without reweighting. We run the algorithms of PESAC and ReWPESAC on the tasks of Ant, HalfCheetah, SlimHumanoid and Swimmer, and record the average critic losses of real samples in every episode. The minimum, maximum and mean of the losses in the same timestep are plotted in Figure 3.
As shown in the figure, ReWPESAC can maintain lower critic losses than PESAC and prevent abnormal large losses. Combined with the learning curve for the task of Swimmer (shown in Figure 2(e)), we find the performance of PESAC is falling after about timesteps while the critic loss is also increasing sharply at around this time. So maintaining lower losses has contributed to improve the performance in most cases. The only exception is the task of HalfCheetah, in which the lower critic losses have not resulted in higher performance. The most likely reason is that an imprecise Qvalue function is enough to train a good policy.
Robustness to Imperfect Dynamics Model
We construct the dynamics models with different prediction accuracy through adjusting the number of the hidden layers in them from to . We run the algorithms of PESAC and ReWPESAC with these dynamics models on the tasks of Ant. The learning curves of them are plotted in Figure 4.
When the number of the hidden layers is decreased, the performances of PESAC drops significantly. This means that the dynamics models with hidden layers have strong negative effect on the training process. The performances of ReWPESAC remain roughly unchanged, which means that our method can effectively reduce the negative effect of the generated samples with prediction errors. The above analysis gives a possible explanation for the phenomenon that ReWPESAC has higher performance improvement on the more complex tasks, like Slimhumanoid and Walker2d.
The Trend of the Predicted Weight
In this section, we analyze the overall trend of the predicted weights and the relation between the weights and the prediction depth and the soft scale . We run the algorithms of ReWPESAC on the task of Swimmer with only random seed, and record the predicted weights of generated samples at the first step of each episode. The predicted weights are changed with the process of training, so computing the average on different seeds is meaningless. The 25 precent point, median and 75 precent point are plotted in Figure 5(a). Then, we split these weights according to the prediction depth, and plot the median of the weights of different prediction depth in Figure 5(b). Finally, we generate some extra data using different , and plot the median of predicted weights on them in Figure 5(c).
In Figure 5(a), the weights are lower in the earlier and later stages but are higher in the middle stage (The weight function’s initial output is about as that the bias of last layer is initialized to .). The trend reflects the change of the accuracy of the dynamics model and the qvalue and policy functions. In the earlier stage, the dynamics model is imprecise, so most of the generated transitions are rejected. Then, the weights become to increase as the improvement in the prediction precision of the dynamics model. However, in the later stage, the precision of qvalue function also improves, while the model has reached its bottleneck. This results in the decline of the weights. From Figure 5(b), we find that the predicted weights decrease with the planning steps which accords with the fact that the prediction errors accumulates with steps. From Figure 5(c), we also found that the weights decrease with the scale which is caused by the difference of the distributions of the actions in the training and predicting process of the dynamics model. These phenomenons further verify that the learned weight function is reasonable.
Conclusion
In this paper, we have proposed a novel and efficient modelbased reinforcement learning approach, which adaptively adjusts the weights of all generated transitions through training a weight function to reduce the potential negative effect of them. We measure the effect of reweighted imaginary transitions through computing the difference of the losses computed on the real transitions before and after training with them, and minimize the difference to optimize the weight function by the chain rule.
Experimental results show that our method obtains the stateoftheart performance on multiple complex continuous control tasks. The learned weight function can provide reasonable weights for different generated samples in different stages of training process. We believe that the weight function can be utilized to adjust some hyperparameters, like planning horizon, in the future.
Acknowledgments
This work is funded by the National Natural Science Foundation of China (Grand No. 61876181 No. 61673375 and No.61721004), Beijing Nova Program of Science and Technology under Grand No. Z191100001119043, the Youth Innovation Promotion Association, and CAS and the Projects of Chinese Academy of Science (Grant No. QYZDBSSWJSC006).
References

Abbeel, Quigley, and Ng (2006)
Abbeel, P.; Quigley, M.; and Ng, A. Y. 2006.
Using inaccurate models in reinforcement learning.
In
Proceedings of the 23rd International Conference on machine learning (ICML06)
, 1–8.  Brockman et al. (2016) Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. arXiv preprint arXiv:1606.01540 .

Cho et al. (2014)
Cho, K.; Van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.;
Schwenk, H.; and Bengio, Y. 2014.
Learning Phrase Representations using RNN Encoder–Decoder for
Statistical Machine Translation.
In
Empirical Methods in Natural Language Processing
, 1724–1734.  Chua et al. (2018) Chua, K.; Calandra, R.; McAllister, R.; and Levine, S. 2018. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, 4754–4765.
 Clavera et al. (2018) Clavera, I.; Rothfuss, J.; Schulman, J.; Fujita, Y.; Asfour, T.; and Abbeel, P. 2018. ModelBased Reinforcement Learning via MetaPolicy Optimization. In Conference on Robot Learning, 617–629.
 Deisenroth and Rasmussen (2011) Deisenroth, M.; and Rasmussen, C. E. 2011. PILCO: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), 465–472.
 Depeweg et al. (2017) Depeweg, S.; HernándezLobato, J.; DoshiVelez, F.; and Udluft, S. 2017. Learning and policy search in stochastic dynamical systems with Bayesian neural networks. In 5th International Conference on Learning Representations, ICLR 2017Conference Track Proceedings.
 Finn, Abbeel, and Levine (2017) Finn, C.; Abbeel, P.; and Levine, S. 2017. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on machine learning (ICML17), 1126–1135. JMLR. org.
 Fujimoto, Hoof, and Meger (2018) Fujimoto, S.; Hoof, H.; and Meger, D. 2018. Addressing Function Approximation Error in ActorCritic Methods. In Proceedings of the 35th International Conference on machine learning (ICML18), 1587–1596.
 Gal, McAllister, and Rasmussen (2016) Gal, Y.; McAllister, R.; and Rasmussen, C. E. 2016. Improving PILCO with Bayesian neural network dynamics models. In DataEfficient Machine Learning workshop, ICML, volume 4, 34.
 Haarnoja et al. (2018a) Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018a. Soft ActorCritic: OffPolicy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on machine learning (ICML18), 1861–1870.
 Haarnoja et al. (2018b) Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. 2018b. Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905 .
 Heess et al. (2015) Heess, N.; Wayne, G.; Silver, D.; Lillicrap, T.; Erez, T.; and Tassa, Y. 2015. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, 2944–2952.
 Hospedales et al. (2020) Hospedales, T.; Antoniou, A.; Micaelli, P.; and Storkey, A. 2020. Metalearning in neural networks: A survey. arXiv preprint arXiv:2004.05439 .
 Janner et al. (2019) Janner, M.; Fu, J.; Zhang, M.; and Levine, S. 2019. When to trust your model: Modelbased policy optimization. In Advances in Neural Information Processing Systems, 12498–12509.
 Kalweit and Boedecker (2017) Kalweit, G.; and Boedecker, J. 2017. Uncertaintydriven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, 195–206.
 Kurutach et al. (2018) Kurutach, T.; Clavera, I.; Duan, Y.; Tamar, A.; and Abbeel, P. 2018. Modelensemble trustregion policy optimization. arXiv preprint arXiv:1802.10592 .
 Levine and Abbeel (2014) Levine, S.; and Abbeel, P. 2014. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, 1071–1079.
 Levine et al. (2016) Levine, S.; Finn, C.; Darrell, T.; and Abbeel, P. 2016. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research 17(1): 1334–1373.
 Levine and Koltun (2013) Levine, S.; and Koltun, V. 2013. Guided policy search. In Proceedings of the 30th International Conference on machine learning (ICML13), 1–9.
 Luo et al. (2018) Luo, Y.; Xu, H.; Li, Y.; Tian, Y.; Darrell, T.; and Ma, T. 2018. Algorithmic framework for modelbased deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858 .
 Mnih et al. (2015) Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518(7540): 529–533.
 Nagabandi et al. (2018) Nagabandi, A.; Kahn, G.; Fearing, R. S.; and Levine, S. 2018. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 7559–7566. IEEE.
 Silver et al. (2016) Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of Go with deep neural networks and tree search. NATURE 529(7587): 484.
 Thrun and Pratt (1998) Thrun, S.; and Pratt, L. 1998. Learning to learn: Introduction and overview. In Learning to learn, 3–17. Springer.
 Veeriah et al. (2019) Veeriah, V.; Hessel, M.; Xu, Z.; Rajendran, J.; Lewis, R. L.; Oh, J.; van Hasselt, H. P.; Silver, D.; and Singh, S. 2019. Discovery of useful questions as auxiliary tasks. In Advances in Neural Information Processing Systems, 9306–9317.
 Wang and Ba (2019) Wang, T.; and Ba, J. 2019. Exploring modelbased planning with policy networks. arXiv preprint arXiv:1906.08649 .
 Wang et al. (2019) Wang, T.; Bao, X.; Clavera, I.; Hoang, J.; Wen, Y.; Langlois, E.; Zhang, S.; Zhang, G.; Abbeel, P.; and Ba, J. 2019. Benchmarking modelbased reinforcement learning. arXiv preprint arXiv:1907.02057 .
 Xu, van Hasselt, and Silver (2018) Xu, Z.; van Hasselt, H. P.; and Silver, D. 2018. Metagradient reinforcement learning. In Advances in neural information processing systems, 2396–2407.
 Zheng, Oh, and Singh (2018) Zheng, Z.; Oh, J.; and Singh, S. 2018. On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, 4644–4654.
Comments
There are no comments yet.