Generating Automatic Curricula via Self-Supervised Active Domain Randomization

02/18/2020 ∙ by Sharath Chandra Raparthy, et al. ∙ Montréal Institute of Learning Algorithms 15

Goal-directed Reinforcement Learning (RL) traditionally considers an agent interacting with an environment, prescribing a real-valued reward to an agent proportional to the completion of some goal. Goal-directed RL has seen large gains in sample efficiency, due to the ease of reusing or generating new experience by proposing goals. In this work, we build on the framework of self-play, allowing an agent to interact with itself in order to make progress on some unknown task. We use Active Domain Randomization and self-play to create a novel, coupled environment-goal curriculum, where agents learn through progressively more difficult tasks and environment variations. Our method, Self-Supervised Active Domain Randomization (SS-ADR), generates a growing curriculum, encouraging the agent to try tasks that are just outside of its current capabilities, while building a domain-randomization curriculum that enables state-of-the-art results on various sim2real transfer tasks. Our results show that a curriculum of co-evolving the environment difficulty along with the difficulty of goals set in each environment provides practical benefits in the goal-directed tasks tested.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

Code Repositories

unsupervised-adr

Accompanying repository for Unsupervised Active Domain Randomization in Goal-Directed RL


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The classic Markov Decision Process (MDP)-based formulation of RL can be extended with goals to contextualize actions and enable higher sample-efficiency (see e.g. [22] [1]). These methods work by allowing the agent to set its own goals, rather than exclusively relying on the environment to provide these. However, when setting new goals, the onus falls on the experimenter to decide which goals to use. Not all experience is equally useful for learning. As a result, past works have resorted to simple random sampling [1] or learning an expensive generative model to generate relevant goals [12].

In the framework of self-play

, the agent can set goals for itself, using only unlabelled interactions with the environment (i.e., no evaluation of the true reward function). While many heuristics for this self-play goal curriculum exist, we focus on the framework of Asymmetric Self-Play

[25], which learns a goal-setting policy via time-based heuristics. The idea is that the most “productive” goals for an agent to see are just out of the agent’s understanding or horizon. If goals are too easy or too hard, the experience will not be useful, making the horizon approach a strong option to pursue.

Figure 1: Self-Supervised Active Domain Randomization learns (SS-ADR) robust policies (h) via self-play by co-evolving a goal curriculum, set by Alice (e), alongside an environment curriculum, set by the ADR particles (j). The randomized environments (c) and goals (g) slowly increase in difficulty, leading to strong zero shot transfer on all environments tested.

However, in certain cases, just learning a goal curriculum via self-play is not enough. In robotic RL, policies trained purely in the simulation have proved difficult to transfer to the real world, a problem known as “reality gap” [13]. One leading approach for this sim2real transfer is Domain Randomization (DR) [26], where a simulator’s parameters are perturbed, generating a space of related-but-different environments, all of which an agent tries to solve before transferring to a real robot. Nevertheless, like the goal curriculum issue, the issue once again becomes a question of which environments to show the agent. Recently, [18] empirically showed that not all generated environments are equally useful for learning, leading to Active Domain Randomization (ADR). ADR defines a curriculum learning problem in the environment randomized space, using learned rewards to search for an optimal curriculum.

As our work deals with both robotics and goal-directed RL, we combine ADR and Asymmetric Self Play to propose Self-Supervised Active Domain Randomization (SS-ADR). SS-ADR couples the environment and goal space, learning a curriculum across both simultaneously. SS-ADR can transfer to real-world robotic tasks without ever evaluating the true reward function during training, learning a policy completely via self-supervised reward signals. We show that this coupling generates strong robotic policies in all environments tested, even across multiple robots and simulation settings.

2 Background

2.1 Reinforcement Learning

We consider a Markov Decision Process (MDP), , defined by (), where is the state space, is the action space, is the transition function, and is the discount factor. Formally, the agent receives a state at the timestep and takes an action based on the policy . The environment gives a reward of and the agent transitions to next state . The goal of RL is to find a policy which maximizes the expected return from each state where the return is given by . Goal-directed RL often appends a goal (some in a goal space ) to the state, and requires the goal when evaluating the reward function (i.e )

2.2 Curriculum Learning

Curriculum learning is a strategy of training machine learning models on a series of gradually increasing tasks (from easy to hard)

[2]. In curriculum learning, the focus lies on the order of tasks, often abstracting away the particular learning of the task itself. In general, task curricula are crafted in such a way that the future task is just beyond the agent’s current capabilities. However, when an explicit ordering of task difficulty is not available, careful design of the curriculum is required to overcome optimization failures.

2.3 Self-Play

We consider the self-play framework by [25], which proposes an unsupervised way of learning to explore the environment. In this method, the agent has two brains: Alice, which sets a task, and Bob, which finishes the assigned task. The novelty of this method can be attributed to the elegant reward design given by Equations 1 and 2,

(1)
(2)

where is the timesteps taken by Alice to set a task, is the timesteps taken by Bob to finish the task set by Alice and is the scaling factor. This reward design allows self-regulating feedback between both agents, as Alice focuses on tasks that are just beyond Bob’s horizon: Alice tries to propose tasks that are easy for her, yet difficult for Bob. This evolution of tasks forces the two agents to construct a curriculum for exploration automatically.

However, in the original work, the unsupervised self-play is used only as supplementary experience. In order to learn better policies on a target task, Bob still requires a majority of trajectories where the reward is evaluated from the environment.

2.4 Domain Randomization

Domain Randomization [20], [26] is a technique in which we provide enough variability during the training time such that during the test time, the model generalizes well on potentially unseen data. It requires the explicit definition of a set of simulation parameters like friction, damping, etc., and a randomization space . During every episode, a set of parameters are sampled to generate a new MDP when passed through the simulator (). If is the cumulative return of the policy in the MDP parameterized by , then the goal is to maximize this expected return across the distribution of such MDPs. The hope is that, when this model is deployed on an unseen environment, like a real robot (in a zero-shot transfer scenario), the policy generalizes well enough to maintain strong performance.

2.5 Active Domain Randomization

ADR [18] is a framework that searches for most informative environment instances, unlike the uniform sampling in DR [26]. ADR formulates this as an RL problem, where the sampling policy is parameterized by Stein’s Variational Policy Gradient (SVPG) [16], to learn a set of particles which control which environments are shown to the agent. The particles undergo interacting updates, which can be written as:

(3)

where denotes the sampled return from particle , the learning rate and temperature

are hyperparameters.

The particles are trained by using learned discriminator-based rewards [5] , which measure the discrepancies between the trajectories from the reference and randomized environment instances .

(4)

The authors claim that ADR finds environments which are difficult for the current agent policy to solve via learnable discrepancies between the reference (generally, easier) environment, and a proposed randomized instance.

While the formulation benefits from learned rewards, ADR also suffers from an exploitability problem, as the authors mention in the paper’s appendix. Equation 4 finds (and rewards) environments where the discrepancy can be maximized, leading to situations where the method exploits the physics of simulation via generation of “impossible to solve” environments. The original work proposed iteratively adjusting the bounds of the randomization space as workaround for the exploitability issue.

3 Related Work

The idea of curriculum leaning was first proposed by [4], who showed that the curriculum of tasks is beneficial in language processing. Later, [2] extended this idea to various vision and language tasks which showed faster learning and better convergence. While many of these require some human specifications, recently, automatic task generation has gained interest in the RL community. This body of work includes automatic curriculum produced by adversarial training [12], reverse curriculum [6] [7], teacher-student curriculum learning [17] [10] etc. However, many papers exploit (a) distinct tasks rather than continuous task spaces (b) state or reward-based “progress” heuristics. Our work builds upon the naturally growing curriculum formulation of asymmetricselfplay (asymmetricselfplay), fixing some of its issues with stability-inducing properties.

Curriculum learning has also been studied through the lens of Self-Play. Self-play has been successfully applied to many games such as checkers [21] and Go [23]. Recently an interesting asymmetric self-play strategy has been proposed [25], which models a game between two variants of the same agent, Alice and Bob, enabling exploration of the environment without requiring any extrinsic reward. However, in this work, we use the self-play framework for learning a curriculum of goals, rather than for its traditional exploration-driven use case.

Despite the success in deep-RL, training RL algorithms on physical robots remains a difficult problem and often impractical due to safety concerns. Simulators played a huge role in transferring policies to the real robot safely, and many different methods have been proposed for the same [9], [19], [3]. DR [26] is one of the popular methods which generates a multitude of environment instances by uniformly sampling the environment parameters from a fixed range. However, [18]

showed that DR suffers from high variance due to unstructured task space and instead proposed a novel algorithm that learns to sample the most informative environment instances. In our work, we use

ADR formulation while mitigating some of the critical issues like exploitability by substituting the learned reward with the self-supervised reward.

4 Method

ADR allows for curriculum learning in an environment space: given some black box agent, trajectories are used to differentiate between the difficulty of environments, regardless of the goal set in the particular environment instance. In goal-directed RL, the goal itself may be the difference between a useful episode and a useless one. In particular, certain goals within the same environment instance may vary in difficulty; on the other hand, the same goal may vary in terms of reachability in different environments. ADR provides a curriculum in environment space, but with goal-directed environments, we have a new dimension to consider; one that the standard ADR formulation does not account for.

In order to build proficient, generalizable agents, we need to evolve a curriculum in goal space alongside a curriculum in environment space; otherwise, we may find degenerate solutions by proposing impossible goals with any environment, or vice versa. As shown in [25], self-play provides a way for policies to learn without environment interaction, but when used only for goal curricula, requires interleaving of self play trajectories alongside reward-evaluated rollouts for best performance.

To this end, we propose Self-Supervised Active Domain Randomization (SS-ADR), summarized in Algorithm 1. SS-ADR learns a curriculum in the joint goal-environment space, producing strong, generalizable policies without ever evaluating an environment reward function during training.

SS-ADR learns two additional policies: Alice and Bob. Alice and Bob are trained in the same format described in Algorithm 1 and [25]. Alice sets a goal in the environment, and eventually signals a STOP action. The environment is reset to the starting state, and now uses Bob’s policy to attempt to achieve the goal Alice has set. Bob sees Alice’s goal state appended to the current state, while Alice sees the current state appended to it’s initial state. Alice and Bob are trained via DDPG [24], using Equations 1 and 2 to generate rewards for each trajectory based on the time each agent took to complete the task (denoted by and ).

The reward structure forces Alice to focus on horizons: her reward is maximized when she can do something quickly that Bob cannot do at all. Considering the synchrony of policy updates for each agent, we presume that the goal set by Alice is not far out of Bob’s current reach.

1:  Input: : Randomization space, : Simulator (), : reference parameters
2:  Initialize : Alice’s policy, : Bob’s policy, : SVPG particles
3:  for  timesteps do
4:     
5:     
6:     Observe the initial state
7:     while  is not STOP do
8:        
9:        Observe the current state
10:        
11:     end while
12:     Set Alice’s final state as Bob’s target state:
13:     Sample environment
14:     
15:     
16:     while Bob not done do
17:        
18:        Observe the current state
19:        
20:     end while
21:     Compute Alice’s reward using Eq (1)
22:     Compute Bob’s reward using Eq (2)
23:     Update the particles using Eq (3)
24:     with update Alice’s policy :
25:        
26:     with update Bob’s policy :
27:        
28:  end for
Algorithm 1 Self Supervised ADR

However, before Bob operates in the environment, the environment is randomized (e.g. object frictions are perturbed or robot torques are changed). Alice, who operates in the reference environment, (an environment given as the “default”), tries to find goals that are easy in the reference environment (), but difficult in the randomized ones (). Since the randomizations themselves are prescribed by the ADR particles, when we train the ADR particles with Alice’s reward (i.e Equation 1 is evaluated separately for each randomization tested), we get a co-evolution on both curriculum levels. The curriculum in both goal and environment space evolve in difficulty simultaneously, leading to state-of-the-art performance in goal-directed, real world robotic tasks.

4.1 Implementation

Across all experiments, all networks share the same network architecture and hyperparameters. For each Alice and Bob policy, we use Deep Deterministic Policy Gradients [24], using an implementation from [8]

. Each actor and the critic have two hidden layers with 400 and 300 neurons, respectively, and use ReLU activation. For Alice’s stopping policy (which signals the

STOP

action), we use a multi-layered perceptron with two hidden layers consisting of 300 neurons each. All networks use the Adam optimizer

[14]

with standard hyperparameters from the Pytorch implementation 

222https://pytorch.org/. We use a learning rate , discount factor , reward scaling factor and number of ADR/SVPG particles . In all our self-play experiments, we consider 1 million unlabelled self-play interactions and plot the mean-averaged learning curves across 4 seeds.333This is unlike the original results of [25], where the x-axis labels consider only labeled interaction. All of the corresponding code and experiments can be found in the supplementary material.

(a)
(b)
Figure 2: ErgoReacher is a 4 DoF robotic arm, with both simulation and real world environments. The goal is to move the end effector to several imaginary goals (pink dot) as fast as possible, actuated with the four motors.
(a)
(b)
Figure 3: ErgoPusher is a 3DoF robotic arm, with the goal of bringing a secondary object to an imaginary goal (pink dot).

5 Results

In order to evaluate our method, we perform various experiments on continuous control robotic tasks both in simulation and real world. We used the following environments from [9] and [18]:

  • ErgoReacher: A 4DoF robotic arm where the end-effector has to reach the goal (Figure 2)

  • ErgoPusher: A 3DoF robotic arm similar to [11] that has to push the puck to the goal (Figure 3)

For the sim-to-real experiments, we recreated the simulation environment on the real Poppy Ergo. Jr robots [15] shown in Figures (b)b and (b)b. All simulated experiments are run across 4 seeds. We evaluate the policy on (a) the default environment and (b) an intuitively hard environment which lies outside the training domain, for every 5000 timesteps, accounting to 200 evaluations in total over 1 million timesteps. Unlike the self-play framework proposed in [25], we do not explicitly train Bob on the target task with extrinsic rewards from the environment to learn a policy. Instead, we evaluate the policy trained only with intrinsic rewards, making the approach completely self-supervised.

We compare our method against two different baselines:

  • Uniform Domain Randomization (UDR): We use UDR, which generates a multitude of tasks by uniformly sampling parameters from a given range as our first baseline. The environment space generated by UDR is unstructured, where the difficulty greatly varies. Here the curriculum of goal space is not considered.

  • Unsupervised Default: We use the self-play framework to generate a naturally growing curriculum of goals as our second baseline. Here, only the goal curriculum (and not the coupled environment curriculum) is considered.

Figure 4: On the default (in-distribution) environment, both the self-play method, shown as Unsupervised-Default, and SS-ADR show strong performance. Even on an easier task, we see issues with UDR, which is unstable in both performance and convergence throughout training. Shown is final distance to goal, lower is better.

5.1 Simulation Experiments

We explore the significance of SS-ADR’s performance on the ErgoPusher and ErgoReacher tasks. In the ErgoPusher task, we vary puck damping and puck friction (). In order to create an intuitively hard environment, we lower the values of these two parameters, which creates an ”icy” surface, ensuring that the puck needs to be hit carefully to complete the difficult task.

Figure 5: When we test the Reacher policies on a harder, held-out test environment (i.e where torques are dropped to a minimum, leading to non-recoverable states in the MDP), we see that only SS-ADR converges with low variance and strong performance. Both UDR and Unsupervised-Default struggle on the held out environment. Shown is final distance to goal, lower is better.

For the ErgoReacher task, we increase the randomization dimensions () making it hard to intuitively infer the environment complexity. However, for the demonstration purposes, we create an intuitively hard environment by assigning extremely low torques and gains for each joint. We adapt the parameter ranges from the GitHub repository of DBLP:journals/corr/abs-1904-04762 (DBLP:journals/corr/abs-1904-04762).

Figure 6: Final distance to goal, lower is better. In the Pusher environment, we see the same narrative as in Figure 4; UDR struggles even in the easy, in-distribution environment, while both self-play methods converge quickly with low variance.
Figure 7: Final distance to goal, lower is better. Both self-play methods show higher variance in simulation in the Pusher environment, despite the fact that SS-ADR has better overall performance.

From Figure 4 and 6 we can see that both Unsupervised-Default and SS-ADR significantly outperform UDR both in terms of variance and average final distance. This highlights that the uniform sampling in UDR can lead to unpredictable and inconsistent behaviour. To actually see the benefits of environment-goal curriculum over solely goal curriculum, we evaluate on the intuitively-hard environments (outside of the training parameter distribution, as described above). From Figure 5 and 7, we can see that our method, SS-ADR, which co-evolves environment and goal curriculum, outperforms Unsupervised-Default, which omits the environment curriculum. This shows that the coupling curriculum enables strong generalization performance over the standard self-play formulation.

Figure 8: On various instantiations of the real robot (parameterized by motor torques), SS-ADR outperforms UDR in terms of performance (lower is better) and spread. While SS-ADR’s performance is almost consistent with or better than that of the Unsupervised-Default.

5.2 Sim-to-Real Transfer Experiments

In this section, we explore the zero-shot transfer performance of the trained policies in the simulator. To test our policies on real-robots, we take the four independent trained policies of both ErgoReacher and ErgoPusher and deploy them onto the real-robots without any fine-tuning. We roll out each policy per seed for 25 independent trails and evaluate the average final distance across 25 trails. To evaluate the generalization, we change the task definitions (and therefore the MDPs) of the puck friction (across low, high, and standard frictions in a box pushing environment) in case of ErgoPusher and joint torques (across a wide spectrum of choices) on ErgoReacher. In general, lower values in both settings correspond to harder tasks, due to construction of the robot and the intrinsic difficulty of the task itself.

Figure 9: We see the difference between the various methods clearly in the Pusher environment, where SS-ADR outperforms all other baselines. Lower is better.

From the Figures 8 and 9, we see that SS-ADR outperforms both baselines in terms of accuracy and consistency, leading to robust performance across all environment variants tested. Zero-shot policy transfer is a difficult and dangerous task, meaning that low spread (i.e consistent performance) is required for deployed robotic RL agents. As we can see in the plots, simulation alone is not the answer (leading to poor performance of UDR), while self-play also fails sometimes to generate curricula that allow for strong, generalizable policies. However, by utilizing both methods together, and co-evolving the two curriculum spaces, we see multiplicative benefits of using curriculum learning in each separately.

6 Conclusion

In this work, we proposed Self-Supervised Active Domain Randomization (SS-ADR), which co-evolves curricula in a joint goal-environment task space to create strong, robust policies that can transfer zero-shot onto real world robots. Our method requires no evaluation of training environment reward functions, and learns this joint curriculum entirely through self-play. SS-ADR is a feasible approach to train new policies in goal-directed RL settings, and outperforms all baselines in both tasks (in simulated and real variants) tested.

7 Acknowledgements

The authors gratefully acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC), the Fonds de Recherche Nature et Technologies Quebec (FQRNT), Calcul Quebec, Compute Canada, the Canada Research Chairs, Canadian Institute for Advanced Research (CIFAR) and Nvidia for donating a DGX-1 for computation. BM would like to thank IVADO for financial support. FG would like to thank MITACS for their funding and support.

References

  • [1] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba (2017) Hindsight experience replay. CoRR abs/1707.01495. External Links: Link, 1707.01495 Cited by: §1.
  • [2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA. External Links: ISBN 978-1-60558-516-1, Link, Document Cited by: §2.2, §3.
  • [3] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. D. Ratliff, and D. Fox (2018) Closing the sim-to-real loop: adapting simulation randomization with real world experience. CoRR abs/1810.05687. External Links: Link, 1810.05687 Cited by: §3.
  • [4] J. L. Elman (1993)

    Learning and development in neural networks: the importance of starting small

    .
    Cognition 48 (1). Cited by: §3.
  • [5] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine (2018) Diversity is all you need: learning skills without a reward function. CoRR abs/1802.06070. External Links: Link, 1802.06070 Cited by: §2.5.
  • [6] C. Florensa, D. Held, M. Wulfmeier, and P. Abbeel (2017) Reverse curriculum generation for reinforcement learning. CoRR abs/1707.05300. External Links: Link, 1707.05300 Cited by: §3.
  • [7] S. Forestier, Y. Mollard, and P. Oudeyer (2017) Intrinsically motivated goal exploration processes with automatic curriculum learning. CoRR abs/1708.02190. External Links: Link, 1708.02190 Cited by: §3.
  • [8] S. Fujimoto, H. van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. CoRR abs/1802.09477. External Links: Link, 1802.09477 Cited by: §4.1.
  • [9] F. Golemo, A. A. Taiga, A. Courville, and P. Oudeyer (2018-29–31 Oct) Sim-to-real transfer with neural-augmented robot simulation. In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.), Proceedings of Machine Learning Research, Vol. 87, . External Links: Link Cited by: §3, §5.
  • [10] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu (2017) Automated curriculum learning for neural networks. CoRR abs/1704.03003. External Links: Link, 1704.03003 Cited by: §3.
  • [11] T. Haarnoja, V. Pong, A. Zhou, M. Dalal, P. Abbeel, and S. Levine (2018) Composable deep reinforcement learning for robotic manipulation. CoRR abs/1803.06773. External Links: Link, 1803.06773 Cited by: 2nd item.
  • [12] D. Held, X. Geng, C. Florensa, and P. Abbeel (2017) Automatic goal generation for reinforcement learning agents. CoRR abs/1705.06366. External Links: Link, 1705.06366 Cited by: §1, §3.
  • [13] N. Jakobi, P. Husbands, and I. Harvey (1995) Noise and the reality gap: the use of simulation in evolutionary robotics. In European Conference on Artificial Life, Cited by: §1.
  • [14] D. Kingma and J. Ba (2014-12) Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: §4.1.
  • [15] M. Lapeyre (2014-11)

    Poppy: open-source, 3d printed and fully-modular robotic platform for science, art and education

    .
    Cited by: §5.
  • [16] Y. Liu, P. Ramachandran, Q. Liu, and J. Peng (2017) Stein variational policy gradient. CoRR abs/1704.02399. External Links: Link, 1704.02399 Cited by: §2.5.
  • [17] T. Matiisen, A. Oliver, T. Cohen, and J. Schulman (2017) Teacher-student curriculum learning. CoRR abs/1707.00183. External Links: Link, 1707.00183 Cited by: §3.
  • [18] B. Mehta, M. Diaz, F. Golemo, C. J. Pal, and L. Paull (2019) Active domain randomization. CoRR abs/1904.04762. External Links: Link, 1904.04762 Cited by: §1, §2.5, §3, §5.
  • [19] A. Prakash, S. Boochoon, M. Brophy, D. Acuna, E. Cameracci, G. State, O. Shapira, and S. Birchfield (2018) Structured domain randomization: bridging the reality gap by context-aware synthetic data. CoRR abs/1810.10093. External Links: Link, 1810.10093 Cited by: §3.
  • [20] F. Sadeghi and S. Levine (2017-07) CAD2RL: real single-image flight without a single real image. Robotics: Science and Systems XIII. External Links: ISBN 9780992374730, Link, Document Cited by: §2.4.
  • [21] A. L. Samuel (1959) Some studies in machine learning using the game of checkers. IBM Journal of Research and Development 3. Cited by: §3.
  • [22] T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015) Universal value function approximators. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15. Cited by: §1.
  • [23] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016-01) Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587). External Links: Document Cited by: §3.
  • [24] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014-22–24 Jun) Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.), Proceedings of Machine Learning Research, Vol. 32, Bejing, China. External Links: Link Cited by: §4.1, §4.
  • [25] S. Sukhbaatar, I. Kostrikov, A. Szlam, and R. Fergus (2017) Intrinsic motivation and automatic curricula via asymmetric self-play. CoRR abs/1703.05407. External Links: Link, 1703.05407 Cited by: §1, §2.3, §3, §4, §4, §5, footnote 3.
  • [26] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, Cited by: §1, §2.4, §2.5, §3.