Teaching robots to perform challenging tasks has been an active topic of research. In particular, it has recently been demonstrated that reinforcement learning (RL) coupled with deep neural networks is able to learn policies (controllers) which can successfully perform tasks such as pick and fetch.
Robots may be slow, dangerous, can damage themselves and they are expensive. When a robot is learning a task, it needs to be taken out of production. Learning policies using model-free deep RL typically requires many samples to explore the sequential decision making space. Model-free RL applied to tasks that involve complex dynamics, require even more samples to learn adequate policies compared to tasks involving (largely) linear dynamics. Directly learning on robots may thus be very costly.
In order to reduce the time required for learning on a real robot, training can be performed in simulation environments. The learned policy is then transferred to the real world domain. Modern graphics cards and sophisticated physics engines enable the simulation of complex tasks. Learning with simulators has several advantages. The rendering and physics engines are capable of computing simulations faster than real-time. This helps to reduce overall training times. Recent deep reinforcement learning algorithms allow agents to learn in parallel , which reduces training times. Furthermore, both appearance and physics can be controlled in simulation. For example the lighting condition, or the friction of an object can be changed, or the entire simulation can be halted to allow for computation of updates.
Appearance, complex dynamics, and robot motor movements in the real world can only be simulated up to some approximation. Simulation to real world transfer thus requires fine-tuning on real data. Furthermore, real setups involving various components, experience delays which are hard to determine exactly. For example, the delay introduced by the acquisition system, where some time has passed before the acquired image is available for processing by the algorithm.
By randomization of the appearance, physics and system parameters during reinforcement learning on simulation data, robustified
policies can be learned. This is analogous to training a deep convolutional neural network to classify objects regardless of the background in the input images. We found that robustified policies can greatly reduce the amount of time for fine-tuning in transfer learning. Reducing the fine-tuning time in transfer learning becomes especially important for tasks involving complex dynamics.
We demonstrate our proposed approach on a challenging task of a robot learning to solve a marble maze game. The maze game is shown in Figure 1. The marbles are subject to static and rolling friction, acceleration, and collisions (with other marbles and with the maze geometry). A simulator simulates the physics of the marbles in the maze game, and renders the results to images. We learn to solve the game from scratch using deep reinforcement learning. A modified version of the deep reinforcement learning is used to learn directly on real robot hardware. We learn both a robustified and non-robustified policy in simulation and compare the times required for fine-tuning after transferring the policy to the real world.
In the remainder of this paper we will refer to learning on simulated data / environments as offline learning, and learning on real data / environments will be referred to as online learning. Transfer learning (TL) with fine-tuning on real data therefore constitutes both offline as well as online learning.
Ii Related Work
TL has been an active area of research in the context of deep learning. For example, tasks such as object detection and classification can avoid costly training time by using pre-trained networks and fine-tuning[5, 6], where typically only the weights in the last couple of layers are updated. TL from simulated to real has also been applied to learn robot tasks [7, 8, 9, 10]. To reduce the time required for fine-tuning in TL, the authors in  propose to make simulated data look more like the real world.
the authors acknowledge that training robot tasks on simulated data alone does not readily transfer to the real world. They propose a form of fine-tuning where the inverse dynamics for the real robot are recovered. It requires a simulator and training which produces reasonable estimates of the real world situation. The drawback of this method is that it requires long online training times, whereas our goal is to minimize the duration of the online training time.
By randomization of the appearance, the learning can become robust against appearance changes and readily transfer to the real world domain [13, 14]. The method proposed in  exploits an ensemble of simulated source domains and adversarial training to obtain robust policies. This policy search approach relies on trajectories and roll-outs which solve the task. The approach proposed in  uses model-based RL to learn a controller entirely in simulation, allowing for zero-shot TL. Since we are considering tasks involving (much) more complex dynamics, we instead follow a similar approach as , and perform randomization of appearance, physics and system parameters with model-free RL.
Model-agnostic meta-learning (MAML) , aims to learn a meta-policy that can be quickly adapted to new (but similar) tasks. In the case of complex dynamics it is not clear how easily MAML could be applied. Appearance and dynamics randomization can be considered as forms of meta-learning. Other approaches aim to learn new tasks, or refine previously learned tasks, without ”forgetting”, e.g., . Our emphasis instead is on reducing the amount of time required for fine-tuning in TL.
Our simulator provides observations of the state in simulation, similar to the real world. In  the critic receives full states, whereas the actor receives observations of states. Coupled with appearance randomization, zero-shot transfer can be achieved. The full state requires that the physics parameters to produce complex dynamics match those of the real world. However, precisely determining the physics parameters is non-trivial.
Formulating reward functions is not straightforward. The authors in  propose to discover robust rewards to enable the learning of complicated tasks. Adding additional goals (sub-goals), basically a form of curriculum learning , can improve the learning as well . The latter approach may be applied to break up the goal of a marble maze into stages. However, in this paper we show that a simple reward function which governs the overall goal of the game is sufficient.
In  the authors propose to model both the dynamics and control in order to solve the marble maze game. This is a complementary approach to the TL approach proposed in this paper, and we believe that each approach has its own strengths and weaknesses.
We briefly review some concepts from (deep) reinforcement learning (RL) using model-free asynchronous actor-critic, and define some terminology that we will use in the remainder of this paper. In the next section we will discuss our approach.
Iii-a Reinforcement Learning
In RL an agent interacts with an environment, represented by a set of states , taking actions from an action set , and receiving rewards
. The environment is governed by (unknown) state transition probabilities. The agent aims to learn a (stochastic) policy , which predicts (a distribution over) actions based on state . The goal for the agent is to learn a policy which maximizes the expected return , where the return denotes the discounted sum of future rewards, with discount factor .
To determine for a given policy how good it is to be in a certain state, or how good it is to take a certain action in a certain state, RL depends on two value functions: a state-value function and an action-value function
. For Markov decision processes, the value functions can be written as a recursion of expected rewards, e.g.,, where denotes the current state, and denotes the next state. The recursive formulations are Bellman equations. Solving the Bellman optimality equations would give rise to the optimal policy . For details we refer the reader to 
We consider the case where agents interact with the environment in episodes of finite length. The end of an episode is reached if the agent arrives at the timestep of maximum episode length, or the goal (terminal state) is achieved. In either case, the agent restarts from a new initial state.
Iii-B Deep RL using Advantage Actor-Critic
In  the authors propose the asynchronous advantage actor-critic algorithm. The algorithm defines two networks: a policy network with network parameters , and a value network with network parameters
. This policy-based model-free method determines a reduced variance estimate ofas . The return is an estimate of and the baseline is a learned estimate of the value function . The policy is referred to as the actor, and value function estimate as the critic.
The authors in  describe an algorithm where multiple agents learn in parallel, and each agent maintains local copies of the policy and value networks. Agents are trained on episodes of maximum length . Within each episode, trajectories are acquired as sequences , of maximum length . Rather than the actual state, the inputs are observations (images) of the state, and a forward pass of each image through the agent’s local policy network results in a distribution over the actions. Every steps, the parameters of the global policy and value networks are updated and the agent synchronizes its local copy with the parameters of the global networks. The current episode ends after steps, or when the terminal state is reached, and then a new episode starts. This episodal learning is repeated until the task is solved consistently. See  for further details.
Iv Deep Reinforcement Learning for a Task with Complex Dynamics
Iv-a Setting up the Task
The task we aim to learn is to solve a marble maze game, see Figure 1. Solving the game means that the marble(s) are maneuvered from the outermost ring, through a sequence of gates, into the center. Due to static and dynamic friction, acceleration, damping, and the discontinuous geometry of the maze, the dynamics are (highly) complex and difficult to model. To solve the marble maze game using model-free RL we can define a reward function as:
This sparse reward function is general and does not encode any information about the actual geometry of the game. The action space is discretized into five actions. The first four actions constitute rotation increments, clockwise and counterclockwise around the , and axes up to a fixed maximum angle. Figure 1–Left shows the orientation of the , and axes with respect to the maze. The increment is sufficient to overcome the static friction, while simultaneously avoiding accelerations that are too large. We define a fifth action as no-op, i.e., maintain the current orientation of the maze. We empirically determined the fixed maximum angle to be in either direction.
Iv-B Deep Reinforcement Learning on Simulated Robot Environments
. We implemented two learning schemes. In the first scheme, each agent was assigned different parameters which were kept fixed for the duration of learning. In the second scheme, the physics and appearance parameters are randomly sampled from a pre-determined range, according to a uniform distribution, for each episode and each agent. We found that the second scheme produced robustified policies which adapted more quickly during fine-tuning on the real robot after transfer.
We use the asynchronous advantage actor-critic (A3C) algorithm to learn a policy for the marble maze game. To successfully apply reinforcement learning with sparse rewards, a framework of auxiliary tasks may be incorporated . One could consider path following as an auxiliary (dense reward) task. However, we aim to keep our approach as general as possible, and not rely on the geometry of the maze. We instead incorporate pixel change and reward prediction, as proposed by . Pixel change promotes taking actions which result in maximal change between images of consecutive states. In the context of the maze game, we aim to avoid selecting consecutive actions that would result in little to no marble motions. In addition, reward prediction aims to over-represent rewarding events to offset the sparse reward signal provided by the reward function. To stabilize learning and avoid settling into sub-optimal policies we employ the generalized advantage estimation as proposed by  together with entropy regularization with respect to the policy parameters .
Iv-B1 Robustified Policies
At the start of each episode, for each agent, the parameter values for static friction, dynamic friction, damping and marble(s) mass are uniformly sampled from a range of values. We emulated a camera delay by rendering frames into a buffer. The camera delay was varied per episode and agent. During each episode the parameters are held constant. Each observation received from the simulator is corrupted by AGWN. We experimented with additional appearance changes, such as different light colors and intensities. We found that those changes had little effect on improving the time required for fine-tuning for our current setup.
Iv-C Deep Reinforcement Learning on Real Robot Environments
A3C is an on-policy method, since the current policy is used in roll-outs (using an -greedy exploration strategy) to obtain the current trajectory of length . For each update, A3C
accumulates the losses for the policy and value networks over the trajectory and performs backpropagation of the losses to update the policy and value network parameters. The simulation is halted until the network parameters have been updated, and then roll-outs for the next trajectory continue using the updated policy.
For a real robot setup we need to be able to compute an update, while simultaneously collecting the next trajectory, since we cannot halt the motion of the marble(s) during an update. We therefore adopt an off-policy approach for the real robot setups (see Algorithm 1).
We acquire the next trajectory while concurrently computing the updates for the policy and value networks based on the previously acquired trajectory . We first verified in simulation that our off-policy adaptation of A3C would indeed be able to successfully learn a policy to solve the marble maze. If one had access to multiple robots, the robots could act as parallel agents similar to the case of simulation. However, due to practical limitations, we only have access to a single robot and are thus limited to training with a single agent in the real world case.
We have implemented a simulation of the marble maze using MuJoCo  to simulate the dynamics, and Ogre 3D  for the appearance. We carefully measured the maze and marble dimensions to accurately reconstruct its 3D geometry. In order to match the simulated dynamics to the real world dynamics, we have tuned the MuJoCO parameters, with static friction, dynamic friction, and damping parameters in particular. For tuning, the maze was inclined to a known orientation, and the marble was released from various pre-determined locations within the maze. Using the markers (see Figure 1) we aligned the images of the simulated maze to the real maze by computing a homography warp. We then empirically tuned the parameters to match the marble oscillations between the simulated and real maze. Learning the parameters instead would be preferable, but this is left as future work. The simulator is executed as a separate process, and communication between controller and simulator is performed via sockets. The simulator receives an action to perform, and returns an image of the updated marble positions and maze orientation, along with a reward (according to Eq. 1) and terminal flag.
The policy network consists of two convolutional layers, followed by a fully-connected layer. The input to the network is an 84
84 image. A one-hot action vector and the reward are appended to the 256-dim. output of the fully-connected layer and serves as input to an LSTM layer. This part of the network is shared between the policy (actor) and value (critic) network. For the policy network a fully-connected layer with softmax activation computes a distribution over the actions. For the value network, a fully connected layer outputs a single value. We empirically choseand .
The ()-tuples are stored in a FIFO experience buffer (of length 3000). We keep track of which tuples have zero and non-zero rewards for importance sampling. For reward prediction we (importance) sample three consecutive frames from the experience buffer. The two convolutional layers and fully connected layer are shared from the policy and value networks. Two more fully connected layers determine a distribution over negative, zero or positive rewards.
For pixel change, we compute the average pixel-change for a 2020 grid, for the central 8080 portion of consecutive images. The pixel-change network re-uses the layers up to and including the LSTM layer for the policy and value network. A fully connected layer together with a deconvolution layers predict 2020 pixel change images. At most frames are sampled from the experience buffer, and we compute the L2 loss between the pixel change predicted by the network, and the recorded pixel change over the sampled sequence. Both losses are added to the A3C loss.
The physics parameters are uniformly sampled from a range around the empirically estimated parameter values. Due to the lack of intuitive interpretation of some of the physics parameters, the range was determined by visually inspecting the resulting dynamics to ensure that the dynamics had sufficient variety, but did not lead to instability in the simulation.
For the real setup, the ROS framework is used to integrate the learning with camera acquisition and robot control. The camera is an Intel RealSense R200 and the robot arm is a Mitsubishi Electric Melfa RV-6SL (see Figure 1–Middle). The execution time of a rotation command for the robot arm is about 190ms. Forward passes through the networks and additional computation time add up to about 20 or 30ms. Although we can overlap computation and robot command execution to some degree, observations are acquired at a framerate of 4.3Hz, i.e. 233ms intervals, to ensure robot commands are completed entirely before the new state is obtained. We observed that during concurrent network parameter updates the computation time for a forward pass through the policy network increases drastically. If we expect that the robot action cannot be completed before the new state is observed by the camera, we set the action to no-op (Sec. IV-A). We implemented a simple marble detector to determine when a marble has passed through a gate, in order to provide a reward signal. For learning in simulation we use the same 4.3Hz framerate. Each incremental rotation action is performed over the course of the allotted time interval of 233ms, such that the next state provided by the simulator reflects the situation after a complete incremental rotation.
|Online (real)||Offline (simulator)||TL (online part)|
Table I compares the number of steps for training a policy to successfully play a one marble maze game. Training directly on the real robot takes about 3.5M steps. For TL, we compare the number of fine-tuning steps necessary for a robustified policy versus a non-robustified policy (fixed parameters). Training a robustified policy in simulation takes about 4.0M steps, whereas a non-robustified policy takes approximately 4.5M to achieve 100% success rate. TL of a robustified policy requires about 55K steps to ”converge”. This is a reduction of nearly 60 compared to online training. A non-robustified policy requires at least 3 the number of fine-tuning steps in order to achieve the same level of success in solving the maze game.
Figure 2 further shows the benefit for TL of a robustified policy. The left side of Figure 2 shows results for the robustified policy, with results for the non-robustified policy on the right. The bottom row shows the accumulated rewards for an episode. An accumulated reward of 4.0 means that the marble has been maneuvered from the outside ring into the center, since there are four gates to pass through. The graph for the robustified policy shows that the learning essentially converges, i.e., achieve 100% success, whereas for the non-robustified policy transfer, the success rate is around 90%. The top row of Figure 2 shows the length of each episode. It is evident that the robustified policy has successfully learned how to handle the complex dynamics to solve the maze game.
We repeated the same experiment for a two marble maze game, with the goal to get both marbles into the center of the maze. We only compared TL with the robustified policy. The results are shown in Table II. Learning a two marble game in simulation with rewards achieved 100% success. However, training on the real setup with these rewards proved very challenging. We believe this is due to the geometry of the maze—the center has only one gate, surrounded by four gates in the adjacent ring—coupled with the static friction. We designed a reward function which gives more importance for passing through gates into rings closer to the goal. This promotes a marble to stay in the center area, while the controller maneuvers the remaining marble. The rewards were modified to instead (which was also used for training the two marble game offline). When learning online, even after 1M steps, the success rate is still at 0% (a single marble reached the center about a dozen of times). With fine-tuning a transferred robustified policy, after 225K steps around a 75% success rate is achieved.
|Robust||1M (0%)||3.0M (100%)||225K (75%)|
We investigate if the transfer of a single marble policy learned offline, would require longer fine-tuning for a two marble game online. After 100K steps of fine-tuning, the policy was able to start solving the game. A success rate of about 50% was achieved after 400K steps. Thus, fine-tuning a robustified policy trained on a two marble maze game in simulation achieves a higher success rate compared to the fine-tuning of a single marble robustified policy.
We refer the reader to the supplemental material for videos of example roll-outs for single and two marble maze games.
Vii Discussion and Future Work
Deep reinforcement learning is capable of learning complicated robot tasks, and in some cases achieving (beyond) human-level performance. Deep RL requires many training samples, especially in the case of model-free approaches. For learning robot tasks, learning in simulation is desirable since robots are slow, can be dangerous and are expensive. Powerful GPUs and CPUs have enabled simulation of complex dynamics coupled with high quality rendering at high speeds. Transfer learning, i.e., the training in simulation and subsequent transfer to the real world, is typically followed by fine-tuning. Fine-tuning is necessary to adapt to any differences between the simulated and the real world. Previous work has focused on transfer learning tasks involving linear dynamics, such as controlling a robot to pick an object and place it at some desired location. However, we explore the case when the dynamics are complex. Non-linearities arise due to static and dynamic friction, acceleration and collisions of objects interacting with each other and the environment. We compare learning online, i.e., directly in the real world, with learning in simulation where the physics, appearance and system parameters are varied during training. For reinforcement learning we refer to this as learning robustified policies. We show that the time required for fine-tuning with robustified policies, is greatly reduced.
Although we have shown that model-free deep reinforcement learning can be successfully used to learn tasks involving complex dynamics, there are drawbacks of using a model-free approach. In the example discussed in our paper, the dynamics are (mostly) captured by the LSTM layer in the network. In the case of more than one marble the amount of fine-tuning time significantly increases. In general, as the complexity of the state space increases, the amount of training time increases as well. When people perform tasks such as the maze game, they typically have a decent prediction of where the marble(s) will go given the amount of rotation applied. In [32, 33] the graphics and physics engine are embedded within the learning to recover physics parameters and perform predictions of the dynamics. In  the physics and dynamics predictions are modeled with networks. These approaches are interesting research directions for tasks involving complex dynamics.
We currently use high-dimensional images as input to the learning framework. Low-dimensional input, i.e. marble position and velocity, may be used instead. In addition, rather than producing a distribution over a discrete set of actions, the problem can be formulated as a regression instead and directly produce values for the and axes rotations [35, 1].
People quickly figure out that the task can be broken down into moving a single marble at the time into the center, while avoiding marbles already in the center location from spilling back out. Discovering such sub-tasks automatically would be another interesting research direction. Along those lines, teaching a robot to perform tasks by human demonstration, or imitation learning, could teach robots complicated tasks without the need for elaborate reward functions, e.g.,.
We want to thank Rachana Sreedhar for the implementation of the simulator and Wei-An Lin for the Pytorch implementation of deep reinforcement learning.
-  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” CoRR, vol. abs/1602.01783, 2016. [Online]. Available: http://arxiv.org/abs/1602.01783
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015. [Online]. Available: http://dx.doi.org/10.1038/nature14236
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.
-  D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. R. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. X. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,” Nature, vol. 550, pp. 354–359, 2017.
-  A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: an astounding baseline for recognition,” Computer Vision and Pattern Recognition (CVPR), 2014.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” Advances in Neural Information Processing Systems (NIPS), 2014.
-  A. A. Rusu, M. Vecerik, T. Rothörl, N. Heess, R. Pascanu, and R. Hadsell, “Sim-to-real robot learning from pixels with progressive nets,” arXiv preprint, vol. arXiv/1610.04286, 2016.
-  F. Zhang, J. Leitner, M. Milford, and P. Corke, “Sim-to-real transfer of visuo-motor policies for reaching in clutter: Domain randomization and adaptation with modular networks,” CoRR, vol. abs/1709.05746, 2017. [Online]. Available: http://arxiv.org/abs/1709.05746
-  F. Zhang, J. Leitner, B. Upcroft, and P. I. Corke, “Vision-based reaching using modular deep networks: from simulation to the real world,” arXiv preprint, vol. arXiv:1610.06781, 2016. [Online]. Available: http://arxiv.org/abs/1610.06781
-  K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V. Vanhoucke, “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” arXiv preprint, vol. arXiv/1709.07857, 2017.
-  A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learning from simulated and unsupervised images through adversarial training,” Computer Vision and Pattern Recognition (CVPR), 2016.
-  P. F. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. Tobin, P. Abbeel, and W. Zaremba, “Transfer from simulation to real world through learning deep inverse dynamics model,” arXiv preprint, vol. arXiv/1610.03518, 2016.
-  S. James, A. J. Davison, and E. Johns, “Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task,” Conference on Robot Learning (CoRL), 2017.
-  F. Sadeghi and S. Levine, “(cad)$^2$rl: Real single-image flight without a single real image,” Robotics: Science and Systems Conference (RSS), 2016.
-  A. Rajeswaran, S. Ghotra, S. Levine, and B. Ravindran, “Epopt: Learning robust neural network policies using model ensembles,” International Conference on Learning Representations (ICLR), vol. abs/1610.01283, 2016. [Online]. Available: http://arxiv.org/abs/1610.01283
-  K. Lowrey, S. Kolev, J. Dao, A. Rajeswaran, and E. Todorov, “Reinforcement learning for non-prehensile manipulation: Transfer from simulation to physical system,” IEEE Conf. on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR), vol. abs/1803.10371, 2018. [Online]. Available: http://arxiv.org/abs/1803.10371
-  X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” arXiv preprint, vol. abs/1710.06537, 2018. [Online]. Available: http://arxiv.org/abs/1710.06537
-  C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” ICML 2017, vol. abs/1703.03400, 2017. [Online]. Available: http://arxiv.org/abs/1703.03400
-  Z. Li and D. Hoiem, “Learning without forgetting,” European Conference on Computer Vision (ECCV), 2016.
-  L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel, “Asymmetric actor critic for image-based robot learning,” CoRR, vol. abs/1710.06542, 2017. [Online]. Available: http://arxiv.org/abs/1710.06542
-  J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” CoRR, vol. abs/1710.11248, 2017. [Online]. Available: http://arxiv.org/abs/1710.11248
Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,”
International Conference on Machine Learning (ICML, 2009.
-  M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” Advances in Neural Information Processing Systems (NIPS), 2017.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint, vol. arXiv/1606.01540, 2016.
-  D. Romeres, D. Jha, A. DallaLibera, B. Yerazunis, and D. Nikovski, “Learning hybrid models to control a ball in a circular maze,” arXiv preprint, vol. abs/1809.04993, 2018. [Online]. Available: http://arxiv.org/abs/1809.04993
-  R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1st ed. Cambridge, MA, USA: MIT Press, 1998.
-  R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3-4, May 1992.
-  M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” arXiv preprint, vol. arXiv/1611.05397, 2016.
-  J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” Internationcal Conference on Learning Representations (ICRL), 2016.
-  E. Todorov, “Convex and analytically-invertible dynamics with contacts and constraints: Theory and implementation in mujoco,” IEEE International Conference on Robotics and Automation (ICRA), 2014.
-  “Ogre 3D,” http://www.ogre3d.org, 2018, [Accessed May 2018].
-  J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. Tenenbaum, “Galileo: Perceiving physical object properties by integrating a physics engine with deep learning,” Advances in Neural Information Processing Systems (NIPS), 2015.
-  J. Wu, E. Lu, P. Kohli, B. Freeman, and J. Tenenbaum, “Learning to see physics via visual de-animation,” Advances in Neural Information Processing Systems (NIPS), 2017.
-  S. Ehrhardt, A. Monszpart, N. J. Mitra, and A. Vedaldi, “Unsupervised intuitive physics from visual observations,” arXiv preprint, vol. abs/1805.05086, 2018. [Online]. Available: http://arxiv.org/abs/1805.05086
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” International Conference on Learning Representations (ICLR), 2015.
-  C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “One-shot visual imitation learning via meta-learning,” Conference on Robot Learning (CoRL), 2017.