Code repository for Active Domain Randomization (https://arxiv.org/abs/1904.04762)
Domain randomization is a popular technique for improving domain transfer, often used in a zero-shot setting when the target domain is unknown or cannot easily be used for training. In this work, we empirically examine the effects of domain randomization on agent generalization. Our experiments show that domain randomization may lead to suboptimal, high-variance policies, which we attribute to the uniform sampling of environment parameters. We propose Active Domain Randomization, a novel algorithm that learns a parameter sampling strategy. Our method looks for the most informative environment variations within the given randomization ranges by leveraging the discrepancies of policy rollouts in randomized and reference environment instances. We find that training more frequently on these instances leads to better overall agent generalization. In addition, when domain randomization and policy transfer fail, Active Domain Randomization offers more insight into the deficiencies of both the chosen parameter ranges and the learned policy, allowing for more focused debugging. Our experiments across various physics-based simulated and a real-robot task show that this enhancement leads to more robust, consistent policies.READ FULL TEXT VIEW PDF
Producing agents that can generalize to a wide range of environments is ...
Domain randomization (DR) is a successful technique for learning robust
We address the issue of learning from synthetic domain randomized data
When learning policies for robot control, the real-world data required i...
Goal-directed Reinforcement Learning (RL) traditionally considers an age...
Customer scoring models are the core of scalable direct marketing. Uplif...
Heap layout randomization renders a good portion of heap vulnerabilities...
Code repository for Active Domain Randomization (https://arxiv.org/abs/1904.04762)
Recent trends in Deep Reinforcement Learning (DRL) exhibit a growing interest in zero-shot domain transfer, i.e. when a policy is learned in a source domain and is then tested without finetuning in an unseen target domain. Zero-shot transfer is particularly useful when the task in the target domain is inaccessible, complex, or expensive, such as gathering rollouts from a real-world robot. An ideal agent would learn to generalize across domains; it would accomplish the task without exploiting irrelevant features or deficiencies in the source domain (i.e., approximate physics in simulators), which may vary dramatically after transfer.
One promising approach for zero-shot transfer has been domain randomization (Tobin et al., 2017). In Domain Randomization (DR), we uniformly randomize environment parameters of the simulation (i.e. friction, motor torque) across predefined ranges after every training episode. By randomizing everything that might vary in the target environment, the hope is that the agent will view the target domain as just another variation. However, recent works suggest that the sample complexity grows exponentially with the number of randomization parameters, even when dealing only with transfer between simulations (i.e. in Andrychowicz et al. (2018) Figure 8). In addition, when using domain randomization unsuccessfully
, policy transfer fails as a black box. After a failed transfer, randomization ranges are tweaked heuristically via trial-and-error. Repeating this process iteratively, researchers are often left with arbitrary ranges that do (or do not) lead to policy convergence without any insight into how those settings may affect to the learned behavior.
In this work, we demonstrate that the strategy of uniformly sampling environment parameters is suboptimal and propose an alternative method, Active Domain Randomization. Active Domain Randomization (ADR) formulates domain randomization as a search for randomized environments that maximize utility for the agent policy. Concretely, we aim to find environments that currently pose difficulties for the agent policy and dedicate more training time to these troublesome parameter settings. We cast this active search as a Reinforcement Learning (RL) problem where the ADR sampling policy is parameterized with Stein Variational Policy Gradient (SVPG). ADR hones in on problematic regions of the randomization space by learning a discriminative reward computed from discrepancies in policy rollouts generated in randomized and reference environments.
We showcase ADR on a simple environment where the benefits of training on more challenging variations are apparent and interpretable (Figure 1), and demonstrate that ADR learns to preferentially select parameters from these regions while still adapting to the policy’s current deficiencies. We then apply ADR to more complex environments and real robot settings (Figure 2) and show that even with high-dimensional search spaces and unmodeled dynamics, policies trained with ADR exhibit superior generalization and lower overall variance than their Uniform Domain Randomization (UDR) counterparts.
The key contributions of our work can be summarized as:
Our proposed ADR method learns an adaptive randomization strategy that finds problematic environments within the given randomization ranges. Across a wide variety of tasks, we find that training preferentially on these environments leads to better generalization.
ADR can provide insight into which dimensions and parameter ranges are most influential before transfer, which can aid the tuning of randomization ranges before expensive experiments are undertaken.
In this section, we briefly cover the basics of RL (used to train both the agent policy and the ADR policy), domain randomization, and Stein Variational Policy Gradient (parameterizes the ADR policy).
We consider a RL framework (Sutton & Barto, 2018) where some task is defined by a Markov Decision Process (MDP) consisting of a state space , action space , state transition function , reward function , and discount factor . The goal for an agent trying to solve is to learn a policy with parameters that maximizes the expected total discounted reward. We define a rollout to be the sequence of states and actions executed by a policy in the environment.
Domain randomization (DR) is a technique to train policies completely in simulation and transfer them in a zero-shot manner to the real world. DR requires a prescribed set of simulation parameters to randomize, as well as corresponding ranges to sample them from. A set of parameters is sampled from randomization space , where each randomization parameter is bounded on a closed interval .
When a configuration is passed to a non-differentiable simulator , it generates an environment . At the start of each episode, the parameters are uniformly sampled from the ranges, and the environment generated from those values is used to train the agent policy .
DR may perturb any to all elements of the task ’s underlying MDP111The effects of DR on action space are usually implicit or are carried out on the simulation side., with the exception of keeping and constant. DR therefore generates a set of MDPs that are superficially similar, but can vary greatly in difficulty depending on the character of the randomization. Upon transfer to the target domain, the expectation is that the agent policy has learned to generalize across MDPs, and sees the final domain as just another variation of parameters.
Sufficient exploration in high-dimensional state spaces has always been a difficult problem in RL. Recently, Liu et al. (2017) proposed SVPG, which learns an ensemble of policies in a maximum-entropy RL framework (Ziebart, 2010).
with entropy being controlled by temperature parameter . SVPG uses Stein Variational Gradient Descent (Liu & Wang, 2016) to iteratively update an ensemble of policies or particles with an update rule:
with step size and positive definite kernel . This update rule balances exploitation (the first term moves particles towards high-reward regions) and exploration (the second term repulses similar policies).
Drawing analogies with Bayesian Optimization (BO) literature, one can think of the randomization space as a search space. We aim to look for points (environment instances) that maximize utility, or provide the most improvement to our agent policy when used for training. Traditionally, in BO, the search for where to evaluate an objective is handled by acquisition functions, which trade off exploitation of the objective with exploration in the uncertain regions of the space (Brochu et al., 2010). However, unlike the stationary objectives seen in BO, training the agent policy makes our optimization nonstationary: the environment with highest utility at time is likely not the same as the maximum-utility environment at time . With this dynamic objective, we need to actively search the space for the most fruitful training environments given the current state of the agent policy.
However, as this nonstationary search adds its own complexity, it is important to investigate if uniform sampling across the entire space is actually detrimental to agent performance. Concretely, we begin by investigating the validity of the following claim: uniformly sampling of environment parameters does not generate equally useful MDPs. To test the hypothesis, we use LunarLander-v2
, where the agent’s task is to ground a lander in a designated zone and reward is based on the quality of landing (fuel used, impact velocity, etc). Parameterized by an 8D state vector and actuated by a 2D continuous action space,LunarLander-v2 has one main axis of randomization that we vary: the main engine strength (MES).
We aim to determine if certain environment instances (different values of the MES) are more informative - more efficient than others in terms of aiding generalization. We set the total range of variation for the MES to be (the default is 13, and lower than 7.5 makes the environment unsolvable when all other physics parameters are held constant) and find through empirical tests that lower engine strengths generate harder MDPs to solve. Under this assumption, we show the effects of focused domain randomization by editing the ranges that the MES is uniformly sampled from.
We train multiple agents, with the only difference between them being the randomization ranges for MES. The randomization ranges define what types of environments the agent is exposed to during training. Figure 1 shows the final generalization performance of each agent by sweeping across the entire randomization range of [8, 20] and rolling out the policy in the generated environments. We see that focusing on harder MDPs improves generalization over uniformly sampling the whole space, even when the evaluation environment is outside of the training distribution.
The experiment in the previous section shows that preferential training on more informative environments provides tangible benefits, but in general, finding these environments is diffcult because:
It is rare that such intuitively hard MDP instances or parameter ranges are known beforehand.
DR is used mostly when the space of randomized parameters is high-dimensional or noninterpretable.
An ideal randomization scheme would find the most informative environment instances in the randomization space, rather than uniformly sampling from the entire space. While seemingly just an instantiation of the traditional BO problem, the nonstationarity of the objective (the environment utility) requires us to redefine the notion of an acquisition function while simultaneously dealing with BO’s deficiencies with higher-dimensional inputs (Wang et al., 2013).
To this end, we propose ADR, summarized in Algorithm 1 and Figure 3. ADR provides a framework for manipulating a more general analog of an acquisition function, selecting the most informative MDPs for the agent within the randomization space. By formulating the search as an RL problem, ADR learns a policy where the states are proposed randomization configurations and actions are continuous changes to those parameters.
We learn a discriminator-based reward for , similar to the one originally proposed in Eysenbach et al. (2018):
where is a boolean variable denoting the discriminator’s prediction of which type of environment (a randomized environment or reference environment ) the trajectory was generated from. We assume that the is provided with the original task definition.
Intuitively, we reward the policy for finding regions of the randomization space that produce environment instances where the same agent policy acts differently than in the reference environment. The agent policy sees and trains only on the randomized environments (as it would in traditional DR), using the environment’s task-specific reward for updates. As the agent improves on the proposed, problematic environments, it becomes more difficult to differentiate whether any given state transition was generated from the reference or randomized environment. Thus, ADR can find what parts of the randomization space the agent is currently performing poorly on, and can actively update its sampling strategy throughout the training process.
To encourage sufficient exploration in high dimensional randomization spaces, we parameterize with SVPG. Since each particle proposes its own environment settings (lines 4-6, Fig. 3h), all of which are passed to the agent for training, the agent policy benefits from the same environment variety seen in UDR. However, unlike UDR, can use the learned reward to focus on problematic MDP instances while still being efficiently parallelizable.
After receiving each particle’s proposed parameter settings , we generate randomized environments (line 9, Fig. 3b).
We proceed to train the agent policy on the randomized instances , just as in UDR. We roll out on each randomized instance and store each trajectory . For every randomized trajectory generated, we use the same policy to collect and store a reference trajectory by rolling out in the default environment (lines 10-12, Fig. 3a, c). We store all trajectories (lines 11-12) as we will use them to score each parameter setting and update the discriminator.
We now generate a score for each environment (lines 20-22) using each stored randomized trajectory by passing them through the discriminator , which predicts the type of environment (reference or randomized) each trajectory was generated from. We use this score as a reward to update each SVPG particle using Equation 2 (lines 24-26, Fig. 3f).
To test ADR, we experiment on OpenAI Gym environments (Brockman et al., 2016) across various tasks, both simulated and real.
ErgoReacher-v0444Originally developed for Golemo et al. (2018), a 4 DoF arm which has to touch a goal with its end effector, implemented in the Bullet Physics Engine (Coumans, 2015). For sim2real experiments, we recreate this environment setup on a real Poppy Ergo Jr. robot (Lapeyre, 2014) shown in Fig.2.
In addition to the randomization of the main engine strength of LunarLander-v2, both other environments randomize various physics parameters that change the environments drastically. Pusher-3DOF-v0 has two axes of randomization that make the puck slide more or less when pushed. ErgoReacher-v0 randomizes the max torque and gain for each degree of freedom (joint) for eight randomization parameters. We provide a detailed account of the randomizations used in Table 1 in Appendix C.
To investigate whether ADR’s learned sampling strategy provides a tangible benefit for agent generalization, we start by comparing it against traditional DR (labeled as UDR) on LunarLander-v2 and vary only the main engine strength (MES). In Figure 1, we see that ADR approaches expert-levels of generalization whereas UDR fails to generalize on lower MES ranges.
From Figure 4, we see that ADR solves the reference environment () more consistently than UDR, never dipping below the Solved line once that level of performance is reached. Figure 4 shows that ADR is the only agent out of both the baseline (trained only on MES of 13) and the UDR agent (trained seeing environments with ) that makes significant progress on the hard environment instances ().
Figure 5 explains the adaptability of ADR by showing generalization and sampling distribution at various stages of training. ADR starts by sampling approximately uniformly for the first 650K steps, but then finds a deficiency in the policy on higher ranges of the MES. As those areas become more frequently sampled between 650K-800K steps, the agent learns to solve all of the higher-MES environments, as shown by the generalization curve for 800K steps. As a result, the discriminator is no longer able to differentiate reference and randomized trajectories from the higher MES regions, and starts to reward environment instances generated in the lower end of the MES range, which improves generalization towards the completion of training.
If the intuitions that drive ADR are correct, we should see increased benefit of a learned sampling strategy with larger due to the increasing sparsity of informative environments when sampling uniformly. We first explore ADR’s performance on Pusher3DOF-v0, an environment where . Both randomization dimensions (puck damping, puck friction loss) affect whether or not the puck retains momentum and continues to slide after making contact with the agent’s end effector. Lowering the values of these parameters simultaneously creates an intuitively-harder environment, where the puck continues to slide after being hit. In the reference environment, the puck retains no momentum and must be continuously pushed in order to move. We qualitatively visualize the effect of these parameters on puck sliding in Figure 6.
From Figure 6, we see ADR’s improved robustness to extrapolation - or when the target domain lies outside the training region. We train two agents, one using ADR and one using UDR, and show them only the training regions encapsulated by the dark, outlined box in the top-right of Figure 6. Qualitatively, only 25% of the environments have dynamics which cause the puck to slide, which are the hardest environments to solve in the training region. We see that from the sampling histogram overlaid on Figure 6 that ADR prioritizes the single, harder purple region more than the light blue regions, allowing for better generalization to the unseen test domains, as shown in Figure 6. ADR outperforms UDR in all but one test region and produces policies with less variance than their UDR counterparts.
From both Figure 7 and Figure 7, which are learning curves for UDR and ADR on the reference environment and hard environment (the pink square in Figure 6) respectively, we observe an interesting phenomenon: not only does ADR solve both environments more consistently (i.e. doesn’t pop up above the Solved line), but UDR also unlearns
the good behaviors it acquired in the beginning of training. When training neural networks in both supervised and reinforcement learning settings, this phenomenon has been dubbed ascatastrophic forgetting (Kirkpatrick et al., 2016). ADR seems to exhibit this slightly (leading to ”hills” in the curve), but due to the adaptive nature the algorithm, it is able to adjust quickly and retain better performance across all environments.
We further show the significance of ADR over UDR on ErgoReacher-v0, where . It is now impossible to infer intuitively which environments are hard due to the complex interactions between the eight randomization parameters (gains and maximum torques for each joint). For demonstration purposes, we test extrapolation by creating a held-out target environment with extremely low values for torque and gain, which causes certain states in the environment to lead to catastrophic failure - gravity pulls the robot end effector down, and the robot is not strong enough to pull itself back up. We show an example of an agent getting trapped in a catastrophic failure state in Figure 12, Appendix C.1.
To generalize effectively, the sampling policy should prioritize environments with lower torque and gain values in order for the agent to operate in such states precisely. However, since the hard evaluation environment is not seen during training, ADR must learn to prioritize the hardest environments that it can see, while still learning behaviors that can operate well across the entire training region.
In Figure 8, we see that when evaluated on the reference environment, the policy learned using ADR has much lower variance than one learned using UDR and can solve the environment much more effectively. In addition, it generalizes better to the unseen target domain as shown in Figure 8, again which much less variance in the learned agent policy.
UDR’s high variance on ErgoReacher-v0 is indicative of some of its issues: by continuously training on a random mix of hard and easy MDP instances, both beneficial and detrimental agent behaviors can be learned and unlearned throughout training. As shown in ErgoReacher-v0, this mixing can lead to high-variance, inconsistent, and unpredictable behavior upon transfer. By focusing on those harder environments and allowing the definition of hard to adapt over time, ADR shows more consistent performance and better overall generalization than UDR in all environments tested.
In this section, we present results of simulation-trained policies transferred zero-shot onto the real Poppy Ergo Jr. robot.
In sim2real (simulation to reality) transfer, many policies fail due to unmodeled dynamics within the simulators, as policies may have overfit to or exploited simulation-specific details of their training environments. While the deficiencies and high variance of UDR are clear even in simulated environments, one of the most impressive results of domain randomization was zero-shot transfer out of simulation onto robots. However, we find that the same issues of unpredictable performance apply to UDR-trained policies in the real world as well.
We take each method’s (ADR and UDR) five independent simulation-trained policies from Section 4.4 and transfer them without fine tuning onto the real robot. We rollout only the final policy on the robot, and show performance in Figure 9. To evaluate generalization, we alter the robot by changing the values of the torques (higher torque means the arm moves at higher speed and accelerates faster), and evaluate each of the policies with 25 random goals (125 independent evaluations per torque setting). As shown in Figure 9, ADR policies obtain overall better or similar performance than UDR policies trained in the same conditions. More importantly, ADR policies are more consistent and display lower spread across all environments, which is crucial when safely evaluating reinforcement learning policies on real-world robots.
One of the secondary benefits of ADR is its insight into incompatibilities between the task and randomization ranges. We demonstrate the simple effects of this phenomenon in a one-dimensional LunarLander-v2, where we only randomize the main engine strength. Our initial experiments varied this parameter between 6 and 20, which lead to ADR learning degenerate agent policies by learning to propose the lopsided blue distribution in Figure 10. Upon inspection of the simulation, we see that when the parameter has a value of less than approximately 8, the task becomes almost impossible to solve due to the other environment factors (in this case the lander always hits the ground too fast, which it is penalized for).
After adjusting the parameter ranges to more sensible values, we see a better sampled distribution in pink, which still gives more preference to the hard environments in the lower engine strength range. Most importantly, ADR allows for analysis that is both focused - we know exactly what part of the simulation is causing trouble - and pre-transfer, i.e. done before a more expensive experiment such as real robot transfer has taken place. With UDR, the agents would be equally trained on these degenerate environments, leading to policies with potentially undefined behavior (or, as seen in Section 4.4, unlearn good behaviors) in these truly out-of-distribution simulations.
Simulators have played a crucial role in transferring learned policies onto real robots, and many different strategies have been proposed. Randomizing simulation parameters for better generalization or transfer performance is a well-established idea in evolutionary robotics(Zagal et al., 2004; Bongard & Lipson, 2004), but recently has emerged as an effective way to perform zero-shot transfer of deep reinforcement learning policies in difficult tasks (Andrychowicz et al., 2018; Tobin et al., 2017; Peng et al., 2018; Sadeghi & Levine, 2016).
Learnable simulations are also an effective way to adapt a simulation to a particular target environment. Chebotar et al. (2018) and Ruiz et al. (2018) use RL for effective transfer by learning parameters of a simulation that accurately describes the target domain, but require the target domain for reward calculation, which can lead to overfitting. In contrast, our approach requires no target domain, but rather only a reference domain (the default simulation parameters) and a general range for each parameter. ADR encourages diversity, and as a result gives the agent a wider variety of experience. In addition, unlike Chebotar et al. (2018), our method does not requires carefully-tuned (co-)variances or task-specific cost functions. Concurrently, Khirodkar et al. (2018) also showed the advantages of learning adversarial simulations and disadvantages of purely uniform randomization distributions in object detection tasks.
To improve policy robustness, Robust Adversarial Reinforcement Learning (RARL) Pinto et al. (2017) jointly trains both an agent and an adversary who applies environment forces that disrupt the agent’s task progress. ADR removes the zero-sum game dynamics, which have been known to decrease training stability (Mescheder et al., 2018). More importantly, our method’s final outputs - the SVPG-based sampling strategy and discriminator - are reusable and can be used to train new agents as shown in Appendix A, whereas a trained RARL adversary would overpower any new agent and impede learning progress.
Active learning methods in supervised learning try to construct a representative, sometimes time-variant, dataset from a large pool of unlabelled data by proposing elements to be labeled. The chosen samples are labelled by an oracle and sent back to the model for use. Similarly, ADR searches for what environments may be most useful to the agent at any given time. Active learners, like BO methods discussed in Section 3, often require an acquisition function (derived from a notion of model uncertainty) to chose the next sample. Since ADR handles this decision through the explore-exploit framework of RL and the in SVPG, ADR sidesteps the well-known scalability issues of both active learning and BO (Tong, 2001).
Recently, Toneva et al. (2018)
showed that certain examples in popular computer vision datasets are harder for networks to learn, and that some examples generalize (or are forgotten) much quicker than others. We explore the same phenomenon in the space ofMDPs defined by our randomization ranges, and try to find the ”examples” that cause our agent the most trouble. Unlike the active learning setting or Toneva et al. (2018), we have no oracle or supervisory loss signal in RL, and instead attempt to learn a proxy signal for ADR via a discriminator.
Generalization in RL has long been one of the holy grails of the field, and recent work like Packer et al. (2018), Cobbe et al. (2018), and Farebrother et al. (2018) highlight the tendency of deep RL policies to overfit to details of the training environment. Our experiments exhibit the same phenomena, but our method improves upon the state of the art by explicitly searching for and varying the environment aspects that our agent policy may have overfit to. We find that our agents, when trained more frequently on these problematic samples, show better generalization while also improving interpretability in both the randomization ranges’ and agent policy’s weaknesses.
In this work, we highlight failure cases of traditional domain randomization, and propose active domain randomization (ADR), a general method capable of finding the most informative parts of the randomization parameter space for a reinforcement learning agent to train on. ADR does this by posing the search as a reinforcement learning problem, and optimizes for the most informative environments using a learned reward and multiple policies. We show on a wide variety of simulated environments that this method efficiently trains agents with better generalization than traditional domain randomization, extends well to high dimensional parameter spaces, and produces more robust policies when transferring to the real world.
The authors gratefully acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC), the Fonds de Recherche Nature et Technologies Quebec (FQRNT) and the Open Philanthropy Project for supporting this work. In addition, the authors would like to thank Kyle Kastner and members of the REAL Lab for their helpful comments.
International Conference on Machine Learning, 2018.
Stein variational gradient descent: A general purpose bayesian inference algorithm.In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29. 2016.
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI ’13, 2013.
Unlike DR, ADR’s learned sampling strategy and discriminator can be reused to train new agents from scratch. To test the transferability of the sampling strategy, we first train an instance of ADR on LunarLander-v2, and then extract the SVPG particles and discriminator. We then replace the agent policy with an random network initialization, and once again train according the the details in Section 4.1. From Figure 11, it can be seen that the bootstrapped agent generalization is even better than the one learned with ADR from scratch. However, its training speed on the default environment () is relatively lower.
Please see Table 1.
In Figure 12, we show an example progression to a catastrophic failure state in the held-out, simulated target environment of ErgoReacher-v0, with extremely low torque and gain values.
|Environment||Types of Randomizations||Train Ranges||Test Ranges|
|LunarLander-v2||1||Main Engine Strength|
|Pusher-3DOF-v0||2||Puck Friction Loss & Puck Joint Damping||default||default|
|Joint Max Torque||default||default|
All policies on Lunar Lander described in our paper receive a Solved score when the engine strengths are above 12, which is why truncated plots are shown in the main document. For clarity, we show the full, untruncated plot in Figure 13.
All experiments can be reproduced using our Github repository555https://github.com/montrealrobotics/active-domainrand.
All of our experiments use the same network architectures and experiment hyperparameters, except for the number of particles. For any experiment with LunarLander-v2, we use . For both other environments, we use . All other hyperparameters and network architectures remain constant, which we detail below. All networks use the Adam optimizer (Kingma & Ba, 2014).
We run Algorithm 1 until 1 million agent timesteps are reached - i.e. the agent policy takes 1M steps in the randomized environments. We also cap each episode off a particular number of timesteps according to the documentation associated with (Brockman et al., 2016). In particular, LunarLander-v2 has an episode time limit of 1000 environment timesteps, whereas both Pusher-3DOF-v0 and ErgoReacher-v0 use an episode time limit of 100 timesteps.
For our agent policy, we use an implementation of DDPG (particularly, OurDDPG.py) from the Github repository associated with (Fujimoto et al., 2018)
. The actor and critic both have two hidden layers of 400 and 300 neurons respectively, and useReLU activations. Our discriminator-based rewarder is a two-layer neural network, both layers having 128 neurons. The hidden layers use tanh activation, and the network outputs a sigmoid for prediction.
The agent particles in SVPG are parameterized by a two-layer actor-critic architecture, both layers in both networks having 100 neurons. We use Advantage Actor-Critic (A2C
) to calculate unbiased and low variance gradient estimates. All of the hidden layers usetanh activation and are orthogonally initialized, with a learning rate of and discount factor . They operate on a continuous space, with each axis bounded between . We allow for set the max step length to be 0.05, and every 50 timesteps, we reset each particle and randomly initialize its state using a
-dimensional uniform distribution. We use a temperaturewith an RBF-Kernel as was done in (Liu et al., 2017). In our work we use an Radial Basis Function (RBF) kernel with median baseline as described in Liu et al. (2017) and an A2C policy gradient estimator (Mnih et al., 2016), although both the kernel and estimator could be substituted with alternative methods (Gangwani et al., 2019). To ensure diversity of environments throughout training, we always roll out the SVPG particles using a non-deterministic sample.
For DDPG, we use a learning rate , target update coefficient of , discount factor , and batch size of 1000. We let the policy run for 1000 steps before any updates, and clip the max action of the actor between as prescribed by each environment.
Our discriminator-based reward generator is a network with two, 128-neuron layers with a learning rate of and a binary cross entropy loss (i.e. is this a randomized or reference trajectory). To calculate the reward for a trajectory for any environment, we split each trajectory into its constituents, pass each tuple through the discriminator, and average the outputs, which is then set as the reward for the trajectory. Our batch size is set to be 128, and most importantly, as done in (Eysenbach et al., 2018), we calculate the reward for examples before using those same examples to train the discriminator.