I Introduction
Teaching robots to perform challenging tasks has been an active topic of research. In particular, it has recently been demonstrated that reinforcement learning (RL) coupled with deep neural networks is able to learn policies (controllers) which can successfully perform tasks such as pick and fetch.
Robots may be slow, dangerous, can damage themselves and they are expensive. When a robot is learning a task, it needs to be taken out of production. Learning policies using modelfree deep RL typically requires many samples to explore the sequential decision making space. Modelfree RL applied to tasks that involve complex dynamics, require even more samples to learn adequate policies compared to tasks involving (largely) linear dynamics. Directly learning on robots may thus be very costly.
In order to reduce the time required for learning on a real robot, training can be performed in simulation environments. The learned policy is then transferred to the real world domain. Modern graphics cards and sophisticated physics engines enable the simulation of complex tasks. Learning with simulators has several advantages. The rendering and physics engines are capable of computing simulations faster than realtime. This helps to reduce overall training times. Recent deep reinforcement learning algorithms allow agents to learn in parallel [1], which reduces training times. Furthermore, both appearance and physics can be controlled in simulation. For example the lighting condition, or the friction of an object can be changed, or the entire simulation can be halted to allow for computation of updates.
Appearance, complex dynamics, and robot motor movements in the real world can only be simulated up to some approximation. Simulation to real world transfer thus requires finetuning on real data. Furthermore, real setups involving various components, experience delays which are hard to determine exactly. For example, the delay introduced by the acquisition system, where some time has passed before the acquired image is available for processing by the algorithm.
By randomization of the appearance, physics and system parameters during reinforcement learning on simulation data, robustified
policies can be learned. This is analogous to training a deep convolutional neural network to classify objects regardless of the background in the input images. We found that robustified policies can greatly reduce the amount of time for finetuning in transfer learning. Reducing the finetuning time in transfer learning becomes especially important for tasks involving complex dynamics.
We demonstrate our proposed approach on a challenging task of a robot learning to solve a marble maze game. The maze game is shown in Figure 1. The marbles are subject to static and rolling friction, acceleration, and collisions (with other marbles and with the maze geometry). A simulator simulates the physics of the marbles in the maze game, and renders the results to images. We learn to solve the game from scratch using deep reinforcement learning. A modified version of the deep reinforcement learning is used to learn directly on real robot hardware. We learn both a robustified and nonrobustified policy in simulation and compare the times required for finetuning after transferring the policy to the real world.
In the remainder of this paper we will refer to learning on simulated data / environments as offline learning, and learning on real data / environments will be referred to as online learning. Transfer learning (TL) with finetuning on real data therefore constitutes both offline as well as online learning.
Ii Related Work
Our work is inspired by the recent advances in deep reinforcement learning, learning complicated tasks and achieving (beyond) human level performance on a variety of tasks [2, 1, 3, 4].
TL has been an active area of research in the context of deep learning. For example, tasks such as object detection and classification can avoid costly training time by using pretrained networks and finetuning
[5, 6], where typically only the weights in the last couple of layers are updated. TL from simulated to real has also been applied to learn robot tasks [7, 8, 9, 10]. To reduce the time required for finetuning in TL, the authors in [11] propose to make simulated data look more like the real world.In [12]
the authors acknowledge that training robot tasks on simulated data alone does not readily transfer to the real world. They propose a form of finetuning where the inverse dynamics for the real robot are recovered. It requires a simulator and training which produces reasonable estimates of the real world situation. The drawback of this method is that it requires long online training times, whereas our goal is to minimize the duration of the online training time.
By randomization of the appearance, the learning can become robust against appearance changes and readily transfer to the real world domain [13, 14]. The method proposed in [15] exploits an ensemble of simulated source domains and adversarial training to obtain robust policies. This policy search approach relies on trajectories and rollouts which solve the task. The approach proposed in [16] uses modelbased RL to learn a controller entirely in simulation, allowing for zeroshot TL. Since we are considering tasks involving (much) more complex dynamics, we instead follow a similar approach as [17], and perform randomization of appearance, physics and system parameters with modelfree RL.
Modelagnostic metalearning (MAML) [18], aims to learn a metapolicy that can be quickly adapted to new (but similar) tasks. In the case of complex dynamics it is not clear how easily MAML could be applied. Appearance and dynamics randomization can be considered as forms of metalearning. Other approaches aim to learn new tasks, or refine previously learned tasks, without ”forgetting”, e.g., [19]. Our emphasis instead is on reducing the amount of time required for finetuning in TL.
Our simulator provides observations of the state in simulation, similar to the real world. In [20] the critic receives full states, whereas the actor receives observations of states. Coupled with appearance randomization, zeroshot transfer can be achieved. The full state requires that the physics parameters to produce complex dynamics match those of the real world. However, precisely determining the physics parameters is nontrivial.
Formulating reward functions is not straightforward. The authors in [21] propose to discover robust rewards to enable the learning of complicated tasks. Adding additional goals (subgoals), basically a form of curriculum learning [22], can improve the learning as well [23]. The latter approach may be applied to break up the goal of a marble maze into stages. However, in this paper we show that a simple reward function which governs the overall goal of the game is sufficient.
The authors in [24] propose a gamelike environment for generating synthetic data for benchmark problems related to reinforcement learning. We developed our simulator along the same lines as [24].
In [25] the authors propose to model both the dynamics and control in order to solve the marble maze game. This is a complementary approach to the TL approach proposed in this paper, and we believe that each approach has its own strengths and weaknesses.
Iii Preliminaries
We briefly review some concepts from (deep) reinforcement learning (RL) using modelfree asynchronous actorcritic, and define some terminology that we will use in the remainder of this paper. In the next section we will discuss our approach.
Iiia Reinforcement Learning
In RL an agent interacts with an environment, represented by a set of states , taking actions from an action set , and receiving rewards
. The environment is governed by (unknown) state transition probabilities
. The agent aims to learn a (stochastic) policy , which predicts (a distribution over) actions based on state . The goal for the agent is to learn a policy which maximizes the expected return , where the return denotes the discounted sum of future rewards, with discount factor .To determine for a given policy how good it is to be in a certain state, or how good it is to take a certain action in a certain state, RL depends on two value functions: a statevalue function and an actionvalue function
. For Markov decision processes, the value functions can be written as a recursion of expected rewards, e.g.,
, where denotes the current state, and denotes the next state. The recursive formulations are Bellman equations. Solving the Bellman optimality equations would give rise to the optimal policy . For details we refer the reader to [26]We consider the case where agents interact with the environment in episodes of finite length. The end of an episode is reached if the agent arrives at the timestep of maximum episode length, or the goal (terminal state) is achieved. In either case, the agent restarts from a new initial state.
IiiB Deep RL using Advantage ActorCritic
In [1] the authors propose the asynchronous advantage actorcritic algorithm. The algorithm defines two networks: a policy network with network parameters , and a value network with network parameters
. This policybased modelfree method determines a reduced variance estimate of
as [27]. The return is an estimate of and the baseline is a learned estimate of the value function . The policy is referred to as the actor, and value function estimate as the critic.The authors in [1] describe an algorithm where multiple agents learn in parallel, and each agent maintains local copies of the policy and value networks. Agents are trained on episodes of maximum length . Within each episode, trajectories are acquired as sequences , of maximum length . Rather than the actual state, the inputs are observations (images) of the state, and a forward pass of each image through the agent’s local policy network results in a distribution over the actions. Every steps, the parameters of the global policy and value networks are updated and the agent synchronizes its local copy with the parameters of the global networks. The current episode ends after steps, or when the terminal state is reached, and then a new episode starts. This episodal learning is repeated until the task is solved consistently. See [1] for further details.
Iv Deep Reinforcement Learning for a Task with Complex Dynamics
Iva Setting up the Task
The task we aim to learn is to solve a marble maze game, see Figure 1. Solving the game means that the marble(s) are maneuvered from the outermost ring, through a sequence of gates, into the center. Due to static and dynamic friction, acceleration, damping, and the discontinuous geometry of the maze, the dynamics are (highly) complex and difficult to model. To solve the marble maze game using modelfree RL we can define a reward function as:
(1) 
This sparse reward function is general and does not encode any information about the actual geometry of the game. The action space is discretized into five actions. The first four actions constitute rotation increments, clockwise and counterclockwise around the , and axes up to a fixed maximum angle. Figure 1–Left shows the orientation of the , and axes with respect to the maze. The increment is sufficient to overcome the static friction, while simultaneously avoiding accelerations that are too large. We define a fifth action as noop, i.e., maintain the current orientation of the maze. We empirically determined the fixed maximum angle to be in either direction.
IvB Deep Reinforcement Learning on Simulated Robot Environments
In order to learn a robustified policy in simulation, we adopt the idea of randomization from [17, 13, 14]
. We implemented two learning schemes. In the first scheme, each agent was assigned different parameters which were kept fixed for the duration of learning. In the second scheme, the physics and appearance parameters are randomly sampled from a predetermined range, according to a uniform distribution, for each episode and each agent. We found that the second scheme produced robustified policies which adapted more quickly during finetuning on the real robot after transfer.
We use the asynchronous advantage actorcritic (A3C) algorithm to learn a policy for the marble maze game. To successfully apply reinforcement learning with sparse rewards, a framework of auxiliary tasks may be incorporated [28]. One could consider path following as an auxiliary (dense reward) task. However, we aim to keep our approach as general as possible, and not rely on the geometry of the maze. We instead incorporate pixel change and reward prediction, as proposed by [28]. Pixel change promotes taking actions which result in maximal change between images of consecutive states. In the context of the maze game, we aim to avoid selecting consecutive actions that would result in little to no marble motions. In addition, reward prediction aims to overrepresent rewarding events to offset the sparse reward signal provided by the reward function. To stabilize learning and avoid settling into suboptimal policies we employ the generalized advantage estimation as proposed by [29] together with entropy regularization with respect to the policy parameters [1].
IvB1 Robustified Policies
At the start of each episode, for each agent, the parameter values for static friction, dynamic friction, damping and marble(s) mass are uniformly sampled from a range of values. We emulated a camera delay by rendering frames into a buffer. The camera delay was varied per episode and agent. During each episode the parameters are held constant. Each observation received from the simulator is corrupted by AGWN. We experimented with additional appearance changes, such as different light colors and intensities. We found that those changes had little effect on improving the time required for finetuning for our current setup.
IvC Deep Reinforcement Learning on Real Robot Environments
A3C is an onpolicy method, since the current policy is used in rollouts (using an greedy exploration strategy) to obtain the current trajectory of length . For each update, A3C
accumulates the losses for the policy and value networks over the trajectory and performs backpropagation of the losses to update the policy and value network parameters. The simulation is halted until the network parameters have been updated, and then rollouts for the next trajectory continue using the updated policy
.For a real robot setup we need to be able to compute an update, while simultaneously collecting the next trajectory, since we cannot halt the motion of the marble(s) during an update. We therefore adopt an offpolicy approach for the real robot setups (see Algorithm 1).
We acquire the next trajectory while concurrently computing the updates for the policy and value networks based on the previously acquired trajectory . We first verified in simulation that our offpolicy adaptation of A3C would indeed be able to successfully learn a policy to solve the marble maze. If one had access to multiple robots, the robots could act as parallel agents similar to the case of simulation. However, due to practical limitations, we only have access to a single robot and are thus limited to training with a single agent in the real world case.
V Implementation
We have implemented a simulation of the marble maze using MuJoCo [30] to simulate the dynamics, and Ogre 3D [31] for the appearance. We carefully measured the maze and marble dimensions to accurately reconstruct its 3D geometry. In order to match the simulated dynamics to the real world dynamics, we have tuned the MuJoCO parameters, with static friction, dynamic friction, and damping parameters in particular. For tuning, the maze was inclined to a known orientation, and the marble was released from various predetermined locations within the maze. Using the markers (see Figure 1) we aligned the images of the simulated maze to the real maze by computing a homography warp. We then empirically tuned the parameters to match the marble oscillations between the simulated and real maze. Learning the parameters instead would be preferable, but this is left as future work. The simulator is executed as a separate process, and communication between controller and simulator is performed via sockets. The simulator receives an action to perform, and returns an image of the updated marble positions and maze orientation, along with a reward (according to Eq. 1) and terminal flag.
The policy network consists of two convolutional layers, followed by a fullyconnected layer. The input to the network is an 84
84 image. A onehot action vector and the reward are appended to the 256dim. output of the fullyconnected layer and serves as input to an LSTM layer. This part of the network is shared between the policy (actor) and value (critic) network. For the policy network a fullyconnected layer with softmax activation computes a distribution over the actions. For the value network, a fully connected layer outputs a single value. We empirically chose
and .The ()tuples are stored in a FIFO experience buffer (of length 3000). We keep track of which tuples have zero and nonzero rewards for importance sampling. For reward prediction we (importance) sample three consecutive frames from the experience buffer. The two convolutional layers and fully connected layer are shared from the policy and value networks. Two more fully connected layers determine a distribution over negative, zero or positive rewards.
For pixel change, we compute the average pixelchange for a 2020 grid, for the central 8080 portion of consecutive images. The pixelchange network reuses the layers up to and including the LSTM layer for the policy and value network. A fully connected layer together with a deconvolution layers predict 2020 pixel change images. At most frames are sampled from the experience buffer, and we compute the L2 loss between the pixel change predicted by the network, and the recorded pixel change over the sampled sequence. Both losses are added to the A3C loss.
The physics parameters are uniformly sampled from a range around the empirically estimated parameter values. Due to the lack of intuitive interpretation of some of the physics parameters, the range was determined by visually inspecting the resulting dynamics to ensure that the dynamics had sufficient variety, but did not lead to instability in the simulation.
For the real setup, the ROS framework is used to integrate the learning with camera acquisition and robot control. The camera is an Intel RealSense R200 and the robot arm is a Mitsubishi Electric Melfa RV6SL (see Figure 1–Middle). The execution time of a rotation command for the robot arm is about 190ms. Forward passes through the networks and additional computation time add up to about 20 or 30ms. Although we can overlap computation and robot command execution to some degree, observations are acquired at a framerate of 4.3Hz, i.e. 233ms intervals, to ensure robot commands are completed entirely before the new state is obtained. We observed that during concurrent network parameter updates the computation time for a forward pass through the policy network increases drastically. If we expect that the robot action cannot be completed before the new state is observed by the camera, we set the action to noop (Sec. IVA). We implemented a simple marble detector to determine when a marble has passed through a gate, in order to provide a reward signal. For learning in simulation we use the same 4.3Hz framerate. Each incremental rotation action is performed over the course of the allotted time interval of 233ms, such that the next state provided by the simulator reflects the situation after a complete incremental rotation.
Vi Results
Online (real)  Offline (simulator)  TL (online part)  

Robust  3.5M  4.0M  55K 
NonRobust  3.5M  4.5M  220K 
Table I compares the number of steps for training a policy to successfully play a one marble maze game. Training directly on the real robot takes about 3.5M steps. For TL, we compare the number of finetuning steps necessary for a robustified policy versus a nonrobustified policy (fixed parameters). Training a robustified policy in simulation takes about 4.0M steps, whereas a nonrobustified policy takes approximately 4.5M to achieve 100% success rate. TL of a robustified policy requires about 55K steps to ”converge”. This is a reduction of nearly 60 compared to online training. A nonrobustified policy requires at least 3 the number of finetuning steps in order to achieve the same level of success in solving the maze game.
Figure 2 further shows the benefit for TL of a robustified policy. The left side of Figure 2 shows results for the robustified policy, with results for the nonrobustified policy on the right. The bottom row shows the accumulated rewards for an episode. An accumulated reward of 4.0 means that the marble has been maneuvered from the outside ring into the center, since there are four gates to pass through. The graph for the robustified policy shows that the learning essentially converges, i.e., achieve 100% success, whereas for the nonrobustified policy transfer, the success rate is around 90%. The top row of Figure 2 shows the length of each episode. It is evident that the robustified policy has successfully learned how to handle the complex dynamics to solve the maze game.
We repeated the same experiment for a two marble maze game, with the goal to get both marbles into the center of the maze. We only compared TL with the robustified policy. The results are shown in Table II. Learning a two marble game in simulation with rewards achieved 100% success. However, training on the real setup with these rewards proved very challenging. We believe this is due to the geometry of the maze—the center has only one gate, surrounded by four gates in the adjacent ring—coupled with the static friction. We designed a reward function which gives more importance for passing through gates into rings closer to the goal. This promotes a marble to stay in the center area, while the controller maneuvers the remaining marble. The rewards were modified to instead (which was also used for training the two marble game offline). When learning online, even after 1M steps, the success rate is still at 0% (a single marble reached the center about a dozen of times). With finetuning a transferred robustified policy, after 225K steps around a 75% success rate is achieved.
Online  Offline  TL  

Robust  1M (0%)  3.0M (100%)  225K (75%) 
We investigate if the transfer of a single marble policy learned offline, would require longer finetuning for a two marble game online. After 100K steps of finetuning, the policy was able to start solving the game. A success rate of about 50% was achieved after 400K steps. Thus, finetuning a robustified policy trained on a two marble maze game in simulation achieves a higher success rate compared to the finetuning of a single marble robustified policy.
We refer the reader to the supplemental material for videos of example rollouts for single and two marble maze games.
Vii Discussion and Future Work
Deep reinforcement learning is capable of learning complicated robot tasks, and in some cases achieving (beyond) humanlevel performance. Deep RL requires many training samples, especially in the case of modelfree approaches. For learning robot tasks, learning in simulation is desirable since robots are slow, can be dangerous and are expensive. Powerful GPUs and CPUs have enabled simulation of complex dynamics coupled with high quality rendering at high speeds. Transfer learning, i.e., the training in simulation and subsequent transfer to the real world, is typically followed by finetuning. Finetuning is necessary to adapt to any differences between the simulated and the real world. Previous work has focused on transfer learning tasks involving linear dynamics, such as controlling a robot to pick an object and place it at some desired location. However, we explore the case when the dynamics are complex. Nonlinearities arise due to static and dynamic friction, acceleration and collisions of objects interacting with each other and the environment. We compare learning online, i.e., directly in the real world, with learning in simulation where the physics, appearance and system parameters are varied during training. For reinforcement learning we refer to this as learning robustified policies. We show that the time required for finetuning with robustified policies, is greatly reduced.
Although we have shown that modelfree deep reinforcement learning can be successfully used to learn tasks involving complex dynamics, there are drawbacks of using a modelfree approach. In the example discussed in our paper, the dynamics are (mostly) captured by the LSTM layer in the network. In the case of more than one marble the amount of finetuning time significantly increases. In general, as the complexity of the state space increases, the amount of training time increases as well. When people perform tasks such as the maze game, they typically have a decent prediction of where the marble(s) will go given the amount of rotation applied. In [32, 33] the graphics and physics engine are embedded within the learning to recover physics parameters and perform predictions of the dynamics. In [34] the physics and dynamics predictions are modeled with networks. These approaches are interesting research directions for tasks involving complex dynamics.
We currently use highdimensional images as input to the learning framework. Lowdimensional input, i.e. marble position and velocity, may be used instead. In addition, rather than producing a distribution over a discrete set of actions, the problem can be formulated as a regression instead and directly produce values for the and axes rotations [35, 1].
People quickly figure out that the task can be broken down into moving a single marble at the time into the center, while avoiding marbles already in the center location from spilling back out. Discovering such subtasks automatically would be another interesting research direction. Along those lines, teaching a robot to perform tasks by human demonstration, or imitation learning, could teach robots complicated tasks without the need for elaborate reward functions, e.g.,
[36].Acknowledgements
We want to thank Rachana Sreedhar for the implementation of the simulator and WeiAn Lin for the Pytorch implementation of deep reinforcement learning.
References
 [1] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” CoRR, vol. abs/1602.01783, 2016. [Online]. Available: http://arxiv.org/abs/1602.01783
 [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015. [Online]. Available: http://dx.doi.org/10.1038/nature14236
 [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.
 [4] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. R. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. X. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,” Nature, vol. 550, pp. 354–359, 2017.
 [5] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features offtheshelf: an astounding baseline for recognition,” Computer Vision and Pattern Recognition (CVPR), 2014.
 [6] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” Advances in Neural Information Processing Systems (NIPS), 2014.
 [7] A. A. Rusu, M. Vecerik, T. Rothörl, N. Heess, R. Pascanu, and R. Hadsell, “Simtoreal robot learning from pixels with progressive nets,” arXiv preprint, vol. arXiv/1610.04286, 2016.
 [8] F. Zhang, J. Leitner, M. Milford, and P. Corke, “Simtoreal transfer of visuomotor policies for reaching in clutter: Domain randomization and adaptation with modular networks,” CoRR, vol. abs/1709.05746, 2017. [Online]. Available: http://arxiv.org/abs/1709.05746
 [9] F. Zhang, J. Leitner, B. Upcroft, and P. I. Corke, “Visionbased reaching using modular deep networks: from simulation to the real world,” arXiv preprint, vol. arXiv:1610.06781, 2016. [Online]. Available: http://arxiv.org/abs/1610.06781
 [10] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V. Vanhoucke, “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” arXiv preprint, vol. arXiv/1709.07857, 2017.
 [11] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learning from simulated and unsupervised images through adversarial training,” Computer Vision and Pattern Recognition (CVPR), 2016.
 [12] P. F. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. Tobin, P. Abbeel, and W. Zaremba, “Transfer from simulation to real world through learning deep inverse dynamics model,” arXiv preprint, vol. arXiv/1610.03518, 2016.
 [13] S. James, A. J. Davison, and E. Johns, “Transferring endtoend visuomotor control from simulation to real world for a multistage task,” Conference on Robot Learning (CoRL), 2017.
 [14] F. Sadeghi and S. Levine, “(cad)$^2$rl: Real singleimage flight without a single real image,” Robotics: Science and Systems Conference (RSS), 2016.
 [15] A. Rajeswaran, S. Ghotra, S. Levine, and B. Ravindran, “Epopt: Learning robust neural network policies using model ensembles,” International Conference on Learning Representations (ICLR), vol. abs/1610.01283, 2016. [Online]. Available: http://arxiv.org/abs/1610.01283
 [16] K. Lowrey, S. Kolev, J. Dao, A. Rajeswaran, and E. Todorov, “Reinforcement learning for nonprehensile manipulation: Transfer from simulation to physical system,” IEEE Conf. on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR), vol. abs/1803.10371, 2018. [Online]. Available: http://arxiv.org/abs/1803.10371
 [17] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Simtoreal transfer of robotic control with dynamics randomization,” arXiv preprint, vol. abs/1710.06537, 2018. [Online]. Available: http://arxiv.org/abs/1710.06537
 [18] C. Finn, P. Abbeel, and S. Levine, “Modelagnostic metalearning for fast adaptation of deep networks,” ICML 2017, vol. abs/1703.03400, 2017. [Online]. Available: http://arxiv.org/abs/1703.03400
 [19] Z. Li and D. Hoiem, “Learning without forgetting,” European Conference on Computer Vision (ECCV), 2016.
 [20] L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel, “Asymmetric actor critic for imagebased robot learning,” CoRR, vol. abs/1710.06542, 2017. [Online]. Available: http://arxiv.org/abs/1710.06542
 [21] J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” CoRR, vol. abs/1710.11248, 2017. [Online]. Available: http://arxiv.org/abs/1710.11248

[22]
Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,”
in
International Conference on Machine Learning (ICML
, 2009.  [23] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” Advances in Neural Information Processing Systems (NIPS), 2017.
 [24] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint, vol. arXiv/1606.01540, 2016.
 [25] D. Romeres, D. Jha, A. DallaLibera, B. Yerazunis, and D. Nikovski, “Learning hybrid models to control a ball in a circular maze,” arXiv preprint, vol. abs/1809.04993, 2018. [Online]. Available: http://arxiv.org/abs/1809.04993
 [26] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1st ed. Cambridge, MA, USA: MIT Press, 1998.
 [27] R. J. Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 34, May 1992.
 [28] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” arXiv preprint, vol. arXiv/1611.05397, 2016.
 [29] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,” Internationcal Conference on Learning Representations (ICRL), 2016.
 [30] E. Todorov, “Convex and analyticallyinvertible dynamics with contacts and constraints: Theory and implementation in mujoco,” IEEE International Conference on Robotics and Automation (ICRA), 2014.
 [31] “Ogre 3D,” http://www.ogre3d.org, 2018, [Accessed May 2018].
 [32] J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. Tenenbaum, “Galileo: Perceiving physical object properties by integrating a physics engine with deep learning,” Advances in Neural Information Processing Systems (NIPS), 2015.
 [33] J. Wu, E. Lu, P. Kohli, B. Freeman, and J. Tenenbaum, “Learning to see physics via visual deanimation,” Advances in Neural Information Processing Systems (NIPS), 2017.
 [34] S. Ehrhardt, A. Monszpart, N. J. Mitra, and A. Vedaldi, “Unsupervised intuitive physics from visual observations,” arXiv preprint, vol. abs/1805.05086, 2018. [Online]. Available: http://arxiv.org/abs/1805.05086
 [35] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” International Conference on Learning Representations (ICLR), 2015.
 [36] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “Oneshot visual imitation learning via metalearning,” Conference on Robot Learning (CoRL), 2017.