1 Introduction
In contrast to the tremendous progress made by modelfree reinforcement learning algorithms in the domain of games (Mnih et al., 2015; Silver et al., 2017; Vinyals et al., 2019), and biomechanical control, poor sample efficiency has risen as a great challenge to RL, especially when interacting with the real world. A promising direction is to integrate the dynamics model to enhance the sample efficiency of the learning process (Sutton, 1991; Calandra et al., 2016; Kalweit and Boedecker, 2017; Oh et al., 2017; Racanière et al., 2017). However, classic modelbased reinforcement learning (MBRL) methods often lag behind the modelfree methods (MFRL) asymptotically, especially in stochastic environments where the dynamics models are difficult to learn. The hybrid combination of MFRL and MBRL (HybridRL for short) also attracted attention. A lot of efforts has been devoted to this field, including the Dyna algorithm (Sutton, 1991), modelbased value expansion (Feinberg et al., 2018), I2A (Weber et al., 2017), etc.
The robustness of the learned policy is another concern in RL. In MFRL, offpolicy RL typically suffers from this problem and the performance drops suddenly Duan et al. (2016). A promising solution is to avoid risk decisions. Risksensitive MFRL not only maximizes the expected return, but also tries to reduce those catastrophic outcomes (Garcıa and Fernández, 2015; Dabney et al., 2018a; Pan et al., 2019). For MBRL and HybridRL, without modeling the uncertainty in the environment (especially for continuous states and actions), it often leads to higher function approximation errors and poorer performances. It is proposed that complete modeling of uncertainty in transition can obviously improve the performance (Chua et al., 2018), however, reducing risks in MBRL and HybridRL has not been sufficiently studied yet.
To achieve sample efficiency and robustness at the same time, we propose a new HybridRL method more capable of solving stochastic and risky environments. The proposed method, namely Risk Averse Value Expansion (RAVE), is an extension of the modelbased value expansion (MVE) (Feinberg et al., 2018) and stochastic ensemble value expansion (STEVE) (Buckman et al., 2018). We analyse the higher approximation issue of modelbased methods in stochastic environments. Based on the analysis, we borrow ideas from the uncertainty modeling Chua et al. (2018)
and risk averse reinforcement learning. The probabilistic ensemble environment model captures not only the variance in estimation (also called epistemic uncertainty), but also stochastic transition nature of the environment (also called aleatoric uncertainty). Utilizing the ensemble of estimations, we further adopt a dynamic confidence lower bound of the target value function to make the policy more risksensitive. We compare RAVE with prior MFRL and HybridRL baselines, showing that RAVE not only yields SOTA expected performance, but also facilitates the robustness of the policy. With this technique, our solution gets the first place in NeurIPS 2019 "Learn to Move" challenge, with a gap of 144 points over the second place
^{1}^{1}1https://www.aicrowd.com/challenges/neurips2019learntomovewalkaround/leaderboards..2 Related Works
The modelbased value expansion (MVE) (Feinberg et al., 2018) is a HybridRL algorithm. Unlike typical MFRL such as DQN that uses only 1 step bootstrapping, MVE uses the imagination rollouts to predict the target value. Though the results are promising, they rely on the taskspecfic tuning of the rollout length. Following this thread, stochastic ensemble value expansion (STEVE) (Buckman et al., 2018)
adopts an interpolation of value expansion of different horizon lengths. The accuracy of the expansion is estimated through the ensemble of environment models as well as value functions. Ensemble of environment models also models the uncertainty to some extent, however, ensemble of deterministic model captures mainly epistemic uncertainty instead of stochastic transitions
(Chua et al., 2018).The uncertainty is typically divided into three classes (Geman et al., 1992): the noise exists in the objective environments, e.g., the stochastic transitions, which is also called aleatoric uncertainty (Chua et al., 2018). The model bias is the error produced by the limited expressive power of the approximating functions, which is measured by the expectation of ground truth and the prediction of the model, in case that infinite training data is provided. The variance is the uncertainty brought by insufficient training data, which is also called epistemic uncertainty. Dabney et al. (2018b) discuss the epistemic and aleatoric uncertainty in their work and focus on the latter one to improve the distributional RL. Recent work suggests that ensemble of probabilistic model (PE) is considered as more thorough modeling of uncertainty (Chua et al., 2018), while simply aggregate deterministic model captures only epistemic uncertainty. The aleatoric uncertainty is more related to the stochastic transition, and the epistemic uncertainty is usually of interest to many works in terms of exploitation & exploration (Pathak et al., 2017; Schmidhuber, 2010; Oudeyer and Kaplan, 2009). Other works adopt ensemble of deterministic value function for exploration (Osband et al., 2016; Buckman et al., 2018).
Risks in RL typically refer to the inherent uncertainty of the environment and the fact that policy may perform poorly in some cases (Garcıa and Fernández, 2015). Risk sensitive learning requires not only maximization of expected rewards, but also lower variances and risks in performance. Toward this object, some works adopt the variance of the return (Sato et al., 2001; Pan et al., 2019; Reddy et al., 2019), or the worstcase outcome (Heger, 1994; Gaskett, 2003) in either policy learning (Pan et al., 2019; Reddy et al., 2019), exploration (Smirnova et al., 2019), or distributional value estimates (Dabney et al., 2018a). An interesting issue in risk reduction is that reduction of risks is typically found to be conflicting with exploration and exploitation that try to maximize the reward in the long run. Authors in (Pan et al., 2019) introduce two adversarial agents (risk aversion and longterm reward seeking) that act in combination to solve to tradeoff between risksensitive and riskseeking (exploration) in RL. In this paper, we propose a dynamic confidence bound for this purpose.
A number of prior works have studied function approximation error that leads to overestimation and suboptimal solution in MFRL. Double DQN (Van Hasselt et al., 2016) improves over DQN through disentangling the target value function and the target policy that pursues maximum value. The authors of TD3 (Fujimoto et al., 2018) suggest that systematic overestimation of value function also exists in actorcritic MFRL. They use an ensemble of two value functions, with the minimum estimate being used as the target value. Selecting the lower value estimation is similar to using uncertainty or lower confidence bound which is adopted by the other risk sensitive methods (Pan et al., 2019), though they works have different motivations.
3 Preliminaries
3.1 ActorCritic Modelfree Reinforcement Learning
The Markov Decision Processes(MDP) is used to describe the process of an agent interacting with the environment. The agent selects the action
at each time step . After executing the action, it receives a new observation and a feedback from the environment. As we focus mainly on the environments of continuous action, we denote the policy that the agent uses to perform actions as . As the interaction process continues, the agent generates a trajectory following the policy . For finite horizon MDP, we use the indicator to mark whether the episode is terminated. The objective of RL is to find the optimal policy to maximize the expected discounted sum of rewards along the trajectory. The value performing the action with the policy at the state is defined by , where is the discount factor. The Qvalue function can be updated with Bellman equation, by minimizing the Temporal Difference(TD) error:(1) 
The target Qvalues are estimated by a target network , where is a delayed copy of the parametric function approximator Lillicrap et al. (2015).
To optimize the deterministic policy function in a continuous action space, deep deterministic policy gradient(DDPG) maximizes the value function (or minimizes the negative value function) under the policy :
(2) 
3.2 Environment Modeling
To model the environment in continuous space, an environment model is typically composed of three individual mapping functions: , , and
, which are used to approximate the feedback, next state and probability of the terminal indicator respectively
Gu et al. (2016); Feinberg et al. (2018). With the environment model, starting from , we can predict the next state and reward by , and this process might go on to generate a complete imagined trajectory of .3.3 Uncertainty Aware Prediction
The deterministic model approximates the expectation only and cannot capture either aleatoric or epistemic uncertainty. Following the recent work that studies the uncertaintyChua et al. (2018), we briefly review different uncertainty modeling techniques.
Probabilistic model
outputs a distribution (e.g., mean and variance of a Gaussian distribution) instead of an expectation. Taking the reward component of the environment model as an example, the probabilistic model is written as
, and the loss function is the negative log likelihood:
(3) 
Ensemble of deterministic (DE) model maintains an ensemble of parameters, which is typically trained with a unique dataset. Given the ensemble of parameters =, the variance of predicted values measures the prediction uncertainty.
The variance in equation (3) mainly captures the aleatoric uncertainty, and the variance mainly captures the epistemic uncertainty (Chua et al., 2018).
Ensemble of probabilistic models (PE) keeps track of an collection of distributions , which can capture the aleatoric uncertainty as well as epistemic uncertainty.
3.4 ModelBased Value Expansion
MVEFeinberg et al. (2018) uses the learned environment model and the policy to imagine a trajectory. We define the imaginative trajectory with the rollout horizon H as (). For the imaginative trajectory starting from state and action , we can write .
MVE defines the target values based on the imaginative trajectory as:
3.5 Stochastic Ensemble Value Expansion
Selecting proper horizon for value expansion is important to achieve high sample efficiency and asymptotic accuracy at the same time. Though the increase of brings increasing prediction ability, the asymptotic accuracy is sacrificed due to the increasing reliance on the environment model.
To reduce the difficulty of selecting proper horizons for different environments, STEVEBuckman et al. (2018) proposes to interpolate the estimated values of different . Modeling the dynamics and the Q function with an ensemble of models, STEVE decides the weight for each rollout step by considering the variance of ensemble predictions. We denote the parameters and for the dynamics models and respectively, and for the Q function. The parameter set for ensemble can be denoted as , where n represents the ensemble size of each function. The target values in STEVE can be expressed as:
(4) 
where represents the outcomes of ensemble models, with and representing their mean and variance respectively. More details about the ensemble prediction can be found at the appendix.
4 Investigation of the approximation error in stochastic environments
To thoroughly investigate the impact of incomplete modeling uncertainty on hybriRL methods, we construct a demonstrative toy environment (fig. 1(a)). The agent starts from , chooses an action from at each time step . The transition of the environment is . We compare two different environments: and , where represents the deterministic transition, and represents the stochastic transition. The episode terminates at ( ), where the agent acquires a final reward. The agent gets a constant penalty at each time step to encourage it to reach the terminal state as soon as possible. Note that the deterministic environment actually requires more steps in expectation to reach compared with the stochastic environment, thus the value function at the starting point of (Ground truth = 380+) tends to be lower than that of (Ground truth = 430+).
We apply DDPG, MVE, STEVE to this environment, and plot the changes of estimate values at the starting point (see fig. 1).
The results show that, in the deterministic environment, the Qvalues estimated by all methods converge to the groundtruth asymptotically in such a simple environment. However, after adding the noise, previous MFRL and HybridRL methods show various level of overestimation. The authors of (Feinberg et al., 2018) have claimed that value expansion improves the quality of estimated values, but MVE and STEVE actually give even worse prediction than modelfree DDPG in the stochastic environment. A potential explanation is that the overall overestimation comes from the unavoidable imprecision of the estimator (Thrun and Schwartz, 1993; Fujimoto et al., 2018), but HybridRL also suffers from the approximation error of the dynamics model. When using a deterministic environment model, the predictive transition of both environments would be identical, because the deterministic dynamics model tends to approximate the expectation of next states (e.g, ). This would result in the same estimated values for and for both value expansion methods, but the ground truth of Qvalues are different in these two environments. As a result, the deterministic environment introduces additional approximation error, leading to extra bias of the estimated value.
5 Methodology
5.1 Risk Averse Value Expansion
We proposed mainly two improvements based on MVE and STEVE. Firstly, we apply an ensemble of probabilistic models (PE) to enable the environment model to capture the uncertainty, including aleatoric and epistemic uncertainty. Secondly, inspired by risk sensitive RL, we calculate the confidence lower bound of the target values, using the variance of aleatoric and epistemic uncertainty.
Before introducing RAVE, we start with the Distributional Value Expansion (DVE). Compared with MVE that uses a deterministic environment model and value function, DVE uses a probabilistic environment model, and we independently sample the reward and the next state using the probabilistic environment models:
(5) 
We apply the distributional expansion starting from to acquire the trajectory , which leads to the definition of DVE:
(6) 
With an ensemble of models on dynamics and value functions, we can compute multiple estimates based on different combinations of dynamics models and Q functions(details about the combination can be found at the appendix). Then we count the average and the variance on the ensemble of DVE, and by subtracting a certain proportion (
) of the standard deviation, we acquire a lower bound of DVE estimation. We call this estimation of value function the
confidence lower bound (CLB):(7) 
The variance in consists of aleatoric and epistemic uncertainty, and we present the proof of convergence in Appendix using the Bellman backup. Subtraction of variances is a commonly used technique in risksensitive RL (Sato and Kobayashi, 2000; Pan et al., 2019; Reddy et al., 2019). It tries to suppress the utility of the highvariance trajectories to avoid possible risks. Finally, we define RAVE, which adopts the similar interpolation among different horizon lengths as STEVE based on DVE and CLB:
(8) 
5.2 Adaptive Confidence Bound
An unsolved problem in RAVE is to select proper . The requirement of risk aversion and exploration is somehow competing: risk aversion seek to minimize the variance, while exploration searches states with higher variance (uncertainty). The agent needs to explore the state space sufficiently, and it should get more risksensitive as the model converges. The epistemic uncertainty is a proper indicator that measures how well our model get to know the state space. In MBRL and HybridRL, a common technique to monitor the epistemic uncertainty is evaluating the ability of the learned environment model to predict the consequence of its own actions (Pathak et al., 2017).
We set the confidence bound factor to be related to its current state and action, denoted as . We want to be larger when the environment model could perfectly predict the state to get more risk sensitive, and smaller when the prediction is noisy to allow more exploration. We define
(9) 
where is a scaling factor for the prediction error. With a little abuse of notations, we use
here to represent a constant hyperparameter, and
is the factor that is actually used in CLB. picks the value near zero at first, and gradually increases to with the learning process.6 Experiments and Analysis
We evaluate RAVE on continuous control environments using the MuJoCo physics simulator (Todorov et al., 2012), Roboschool (Klimov and Schulman, 2017), osimbased environment Seth et al. (2018). The baselines includes the modelfree DDPG, proximal policy optimization (PPO) (Schulman et al., 2017), and STEVE that yields the SOTA HybridRL performance. We also compare RAVE with the SOTA MFRL methods including twin delayed deep deterministic (TD3) (Fujimoto et al., 2018), soft actrorcritic (SAC) algorithm (Haarnoja et al., 2018), using the authorprovided implementations. To further demonstrate the robustness in complex environments, we also evaluate RAVE on the osimrl environment with the task: learning to run. For details about the environment setting and implementation^{2}^{2}2we present training details of the 1st place solution in the supplementary materials due to the limited space of the paper., please refer to the supplementary materials.
6.1 Experimental Results
We carried out experiments on nine environments shown in fig. 2. Among the compared methods, PPO has very poor sampleefficiency, as PPO needs a large batch size to learn stably (Haarnoja et al., 2018). On Hopper and Walker2d, STEVE lags behind the best MFRL methods (TD3 and SAC), which yield quite good performance in many environments. Overall, RAVE performed favorably in most environments in both sample efficiency and asymptotic performance.
6.2 Analysis
Distribution of Value Function Approximation. While the RAVE estimation on the toy environment has demonstrate its improvement on value estimation, we further investigate whether it predicts value function more precisely in a more complicate environment. We plot the of the predicted values against the ground truth values of Hopperv1 in fig. 3. The ground truth value here is calculated by directly adding the rewards of the left trajectory, thus it is more like a Monte Carlo sampling from ground truth distribution, which is quite noisy. To better demonstrate the distribution of points, we draw the confidence ellipses representing the density. The points are extracted from the model at environment steps of 1M. In DDPG and STEVE, the predicted value and ground truth aligned poorly with the ground truth, while RAVE yields better alignments, though a little bit of underestimation.
Investigation on dynamic confidence bound. To study the role played by the confidence lower bound separately, we further carried out ablative experiments. We compared RAVE ( is a constant), RAVE (dynamic ) and other baselines in fig. 4.
We can see that already surpasses the performance of STEVE on Hopper, showing that modeling aleatoric uncertainty through PE indeed benefits the performance of value expansion. Larger margin was attained by introducing CLB. A very large (such as constant , which means lower CLB) can quickly stabilize the performance, but its performance stayed low due to lack of exploration, while a smaller (constant = 0.5) generates larger fluctuation in performance. The dynamic adjustment of facilitates quick rise and stable performance.
Analysis on Robustness. We also evaluated the robustness of RAVE and baselines, by testing the falling rates of the learned agents. We test the perfmance of various algorithms in the LearnToRun task, as it is reported that the agent is prone to fall in this environment Kidzinski et al. (2019). As shown in fig.4(b), RAVE achieves the lowest falling rate in a short time compared with the baselines. We also found that RAVE played an importance role in the competition, where the future velocity commands depend on the future position. It adds additional bias into Qvalue estimation, and in such case, RAVE with riskaversion Qvalues can stabilize the learning policy. In our experiments, DDPGbased policy fell at the probability of 15%, while the learned agent of RAVE performed more stably, at 1.3%.
Computational Complexity. A main concern toward RAVE may be its computational complexity.
For the training stages, the additional training cost of RAVE compared with STEVE rises from the sampling operation. In our experiments, it takes 13.20s for RAVE to finish training 500 batches with a batch size 512, an increase of 24.29%, compared to STEVE (10.62s)^{3}^{3}3The time reported here is tested in 2 P40 Nvidia GPUs with 8 CPUs (2.00GHz), with same number of candidate target values generated for both STEVE and RAVE..
For the inference stages, RAVE charges exactly the same computational resources just as the other modelfree actorcritic methods as only the learned policy is used for inference.
7 Conclusion
In this paper, we raise the problem of incomplete modeling of uncertainty and insufficient robustness in modelbased value expansion. To address the issue, we introduce the ensemble of probabilistic models to better approximate the environment, as well as the alphaConfidence Lower Bound to avoid the opportunistic solution. Based on the ensemble imaginative rollout predictions, we take the lower confidence bound for value estimation to avoid optimistic estimation, which will lead to risky action selections. We also suggest tuning the lower confidence bound dynamically to balance the riskaverse actions and exploration. Experiments on a range of environments demonstrate the superiority of RAVE in both sample efficiency and robustness, compared with stateoftheart RL methods, including the modelfree TD3 algorithm and the HybridRL STEVE algorithm. The RAVE algorithm also provides a plausible modelbased and robust solution for the Neurips challenge on the physiologicallybased human model. We hope that this algorithm will facilitate the application of reinforcement learning in risky, realworld scenarios.
References
 Sampleefficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8224–8234. Cited by: §1, §2, §2, §3.5.

Bayesian optimization for learning gaits under uncertainty.
Annals of Mathematics and Artificial Intelligence (AMAI)
76 (1), pp. 5–23. External Links: Document, ISSN 15737470 Cited by: §1.  Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765. Cited by: §1, §1, §2, §2, §3.3, §3.3.

Implicit quantile networks for distributional reinforcement learning
. arXiv preprint arXiv:1806.06923. Cited by: §1, §2.  Distributional reinforcement learning with quantile regression. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §2.

Benchmarking deep reinforcement learning for continuous control.
In
International Conference on Machine Learning
, pp. 1329–1338. Cited by: §1.  Modelbased value estimation for efficient modelfree reinforcement learning. arXiv preprint arXiv:1803.00101. Cited by: §1, §1, §2, §3.2, §3.4, §4.
 Addressing function approximation error in actorcritic methods. Cited by: §2, §4, §6.
 A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16 (1), pp. 1437–1480. Cited by: §1, §2.
 Reinforcement learning under circumstances beyond its control. Cited by: §2.
 Neural networks and the bias/variance dilemma. Neural Computation 4 (1), pp. 1–58. External Links: Link, Document Cited by: §2.
 Continuous deep qlearning with modelbased acceleration. In International Conference on Machine Learning, pp. 2829–2838. Cited by: §3.2.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §6.1, §6.
 Consideration of risk in reinforcement learning. In Machine Learning Proceedings 1994, pp. 105–111. Cited by: §2.
 Learning to run with actorcritic ensemble. CoRR abs/1712.08987. External Links: Link, 1712.08987 Cited by: §B.3.
 Uncertaintydriven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, pp. 195–206. Cited by: §1.
 Artificial intelligence for prosthetics  challenge solutions. CoRR abs/1902.02441. External Links: Link, 1902.02441 Cited by: §B.1, §B.3, Appendix D, §6.2.
 Roboschool. GitHub. Note: https://github.com/openai/roboschool Cited by: §6.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §3.1.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
 Value prediction network. In Advances in Neural Information Processing Systems, pp. 6118–6128. Cited by: §1.
 Deep exploration via bootstrapped DQN. See DBLP:conf/nips/2016, pp. 4026–4034. External Links: Link Cited by: §2.
 What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics 1, pp. 6. Cited by: §2.
 Risk averse robust adversarial reinforcement learning. arXiv preprint arXiv:1904.00511. Cited by: §1, §2, §2, §5.1.

Curiositydriven exploration by selfsupervised prediction.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pp. 16–17. Cited by: §2, §5.2.  Imaginationaugmented agents for deep reinforcement learning. In Advances in neural information processing systems, pp. 5690–5701. Cited by: §1.
 Risk averse reinforcement learning for mixed multiagent environments. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2171–2173. Cited by: §2, §5.1.
 TD algorithm for the variance of return and meanvariance reinforcement learning. Transactions of the Japanese Society for Artificial Intelligence 16 (3), pp. 353–362. Cited by: §2.
 Variancepenalized reinforcement learning for riskaverse asset allocation. In International Conference on Intelligent Data Engineering and Automated Learning, pp. 244–249. Cited by: §5.1.
 Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development 2 (3), pp. 230–247. Cited by: §2.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §6.
 OpenSim: simulating musculoskeletal dynamics and neuromuscular control to study human and animal movement. PLoS computational biology 14 (7), pp. e1006223. Cited by: §6.
 Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §1.
 Distributionally robust reinforcement learning. arXiv preprint arXiv:1902.08708. Cited by: §2.
 Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin 2 (4), pp. 160–163. Cited by: §1.
 Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, Cited by: §4.
 Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §6.
 Deep reinforcement learning with double qlearning. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §2.
 AlphaStar: Mastering the RealTime Strategy Game StarCraft II. Note: https://deepmind.com/blog/alphastarmasteringrealtimestrategygamestarcraftii/ Cited by: §1.
 Imaginationaugmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203. Cited by: §1.
Appendix A Training and Implementation Details
a.1 Neural Network Structure
We used rectified linear units (ReLUs) between all hidden layers of all our implemented algorithms. Unless otherwise stated, all the output layers of model have no activation function.
RL Models. We implement modelbased algorithms on top of DDPG, with a policy network and a Qvalue network. The policy network is a stack of 4 fullyconnected (FC) layers. The activation function of the output layer is tanh to constrain the output range of the network. The Qvalue network takes the concatenation of the state and the action as input, followed by four FC layers.
Dynamics Models. We train three neural networks as the transition function, the reward function and the termination function. We build eight FC layers for the transition approximator, and four FC layers for the other approximators. The distributional models in RAVE use the similar model structure except that there are two output layers corresponding to the mean and the variances respectively.
a.2 Parallel Training
We use distributed training to accelerate our algorithms and the baseline algorithms. Following the implementation of STEVE, we train a GPU learner with multiple actors deployed in a CPU cluster. The actors reload the parameters periodically from the learner, generate trajectories and send the trajectories to the learner (we used 8 CPUs for mujoco and roboschool environments, 128 CPUs for the osimrl environment). The learner stores them in the replay buffer, and updates the parameters with data randomly sampled from the buffer. For the network communication, we use PARL^{4}^{4}4https://github.com/PaddlePaddle/PARL to transfer data and parameters between the actors and the learner. We have 8 actors generating data, and deploy the learner on the GPUs. For osim tasks, we use 128 actors as its simulation speed is much slower than Mujco. DDPG uses a GPU, and modelbased methods uses two: one for the training of the policy and another for the dynamics model.
a.3 Rollout Details
We employ the identical method of target candidates computation as STEVE, except we image a rollout with an ensemble of probabilistic models. At first, we bind the parameters of transition model () to the termination model (). That is, we numerate the combination of three integers , which gives us an ensemble of parameters . The actual sampling process goes like this: For each , we first use the transition model () and the termination model () to image a stateaction sequence ; Based on the stateaction sequence, we use reward function () to estimate the rewards () and the value function () to predict the value of the last state () (fig. 5). In total we predict combination of rewards and value functions in both RAVE and STEVE.
Appendix B Details for NeurIPS 2019: Learn to Move Challenge
The Learn to Move challenge is the third competition of the series in NeurIPS 2019, requiring the participants to train a controller for 3D human model to follow the input velocity commands. The agent has to reach two target points sequentially in 2500 frames. At each step, the agent has to follow an input velocity command that depends on its relative position with the target point. Compared with last year’s task, there are three challenges: realtime velocity commands, which varies according to the position of the agent; unknown destination at any angle, sometimes at the back of the agent; minimum effort required during the locomotion.
Following the 1st place’s solution of last year, we divide our solution into two stages: learning a usual humanlike walking gait and learning to follow the input velocity commands.
b.1 Stage 1: Sensible Walking Gait
We believe that the humanlike gait is more proper and flexible for realtime velocity targets. However, the agent tends to learn curious walking gaits such as jumping or staggering, when simply setting its target to a lower speed walking. The authors of Kidzinski et al. [2019] proposed using curriculum learning to learn a flexible walking gait. The first aim is to run very fast , because the agent has limited potential gaits to move in a high speed. Then it learns to walk at lower speeds gradually, still keeping a humanlike gait that two leg moves forward alternately.
b.2 Stage 2: Following targets
Distilling. After attaining a usual gait walking like the human, the main problem is to distill a policy network with velocity targets from the current policy network. As the new network requires input of realtime targets, which has not been considered in the previous training stage, the distilled network perform worse in the new task. To solve the issue, we improve the robustness of the distilled policy by adding noise during the distilling process. Suppose we have collected a dataset from the old policy, and want to fit a new policy function, , where t is the velocity target. We replace with an uniform noise in the training process, to make the new function robust with any unseen targets.
Lowenergy locomotion. A main challenge in this competition is to finish the target with low energy. We found that adding a large penalty on muscles will degrade the flexibility and robustness of the agent. One explanation is that agent fails to explore efficient gaits due to the limitation of muscle penalty. Our solution is training an agent without muscle penalty to facilitate the exploration of action spaces at first, then add the muscle penalty to the reward to reduce the energy used.
b.3 Note on robustness
A main concern of the learned agent is its robustness, because the agent gets 500 scores as a bonus for completing the task. To address the issue of falling, previous worksKidzinski et al. [2019], Huang et al. [2017] relies on an ensemble of Qvalue functions to evaluate the future returns of candidate actions. However, the estimated Qvalues suffer from high bias and variance as the agent has no information about the following velocity commands, which is critical for reward computation.
We tried using an ensemble of Qvalue functions over DDPG, but the learned agent with ensemble action prediction was prone to fall, at the probability of 15%. When training the agent with RAVE, the falling rate dropped significantly. We evaluate the submitted agent by running 5000 episodes locally. The agent walks to the target points in 2500 frames at each episode, and the mean score is 1489.54, with the probability 1.3% of falling(see fig.6 for more information about the learned agent.)
Appendix C Rewards for Osimbased experiments
LearnToRun
Here , is the moving distance On Xaxis of the left leg and right leg.
Appendix D Hyperparameters for Training
We list all the hyperparameters used in our Mujoco and Roboschool experiments in table 1. For osim tasks, we used the same hyperparameters of the last year’s winning solutionKidzinski et al. [2019].
Hyperparameter  Value  Description 

512  Batch size for training the RL, and also the dynamics model  
1e6  Size of the replay buffer storing the transitions  
3e4  Learning rate of the training policy  
3e4  Learning rate of the Qvalue function  
3e4  Learning rate of the dynamics model  
0.05  Probability of adding a Gaussian noise to the action for exploration  
3  Maximum horizon length for value expansion  
4  Ensemble size of the value function and environment models  
10000  Number of collected frames to pretrain the dynamics model before training the policy  
1  Scaling factor for the prediction error  
1.5  Confidence lower bound in equation.9 