Let us consider the task of opening a bottle. How should a two-armed robot accomplish this? Even without knowing the bottle geometry, its position, or its orientation, one can answer that the task will involve holding the bottle’s base with one hand, grasping the bottle’s cap with the other hand, and twisting the cap off. This “schema,” the high-level plan of what steps need to be executed, only depends on the task and not on the object’s geometric and spatial state, which only influence how to parameterize each of these steps (e.g., deciding where to grasp, or how much to twist).
However, typical end-to-end reinforcement learning approaches do not leverage this kind of structure, and instead aim to solve tasks by learning a policy, which would involve inferring both the schema and the parameterizations, as a function of the raw sensory input. These approaches have led to impressive successes across domains such as game-playing [1, 2, 3, 4] and robotic control tasks [4, 5, 6, 7, 8, 9], but are known to have very high sample complexity. For instance, they require millions of frames of interaction to learn to play Atari games, or several weeks’ worth of experience to learn simulated control policies, which makes them impractical to train on real hardware.
In this work, we address the problem of learning to perform tasks in environments with a sparse reward signal, given a discrete set of generic skills parameterized by continuous arguments. Examples of skills include exerting a force at a location or moving an end effector to a target pose. The action space is hybrid discrete-continuous: at each timestep, the agent must decide both 1) which skill to use and 2) what continuous arguments to use for it (e.g., the location to apply force, the amount of force, or the target pose to move to). The sample inefficiency of current reinforcement learning methods is exacerbated in domains with these large search spaces; even basic tasks such as opening a bottle with two arms are challenging to learn from sparse rewards. While one could hand-engineer dense rewards, this is undesirable as it does not scale to more complicated tasks. We ask a fundamental question: can we use the given skills to efficiently learn policies for tasks with a large policy search space, like bimanual manipulation, given only sparse rewards?
Our insight is that for many tasks, the learning process can be decomposed into learning a state-independent task schema (sequence of skills) and a state-dependent policy that chooses appropriate parameterizations for the different skills. Such a decomposition of the policy into state-dependent and state-independent parts simplifies the credit assignment problem and leads to more effective sharing of experience, as data from different instantiations of the task can be used to improve the same shared skills. This leads to faster learning.
This modularization can further allow us to transfer learned schemas among related tasks, even if they have different state spaces. For example, suppose we have learned a good schema for picking up a long bar in simulation, where we have access to object poses, geometry information, etc. We can then reuse that schema for a related task such as picking up a tray in the real world from only raw camera observations, even though both the state space and the optimal parameterizations (e.g., grasp poses) differ significantly. As the schema is fixed, policy learning for this tray pickup task will be very efficient, since it only requires learning the (observation-dependent) arguments for each skill. Transferring the schema in this way enables learning to solve sparse-reward tasks very efficiently, making it feasible to train real robots to perform complex skills. See Figure 2 for an overview of our approach.
We validate our approach over a suite of robotic bimanual manipulation tasks, both in simulation and on real hardware. We give the robots a very generic library of skills such as twisting, lifting, and reaching. Even given these skills, bimanual manipulation is challenging due to the large search space for policy optimization. We consider four task families: lateral lifting, picking, opening, and rotating, all with varying objects, geometries, and initial poses. All tasks have a sparse binary reward signal: 1 if the task is completed, and 0 otherwise. We empirically show that a) explicitly modeling schema state-independence yields large improvements in learning efficiency over the typical strategy of conditioning the policy on the full state, and b) transferring learned schemas to real-world tasks allows complex manipulation skills to be discovered within only a few hours (<10) of training on a single setup. Figure1 shows some examples of real-world tasks solved by our system.
Ii Related Work
Search in hybrid discrete-continuous spaces. An agent equipped with a set of skills parameterized by continuous arguments must learn a policy that decides both which skills to use and what continuous arguments to use for them. Therefore, policy optimization requires searching in a hybrid discrete-continuous space. Learning in such a hybrid space has been addressed within the field of task and motion planning [10, 11, 12, 13], but these methods typically rely on hand-designed abstract representations of the state space in order to make use of classical planners. In contrast, we enable end-to-end reinforcement learning from raw images by building independence assumptions into our model. A separate line of work learns control policies for steps in a policy sketch , which can be recombined in novel ways to solve new task instances; however, this work does not consider the discrete search aspect of the problem, as we do.
Transfer learning for robotics. The idea of transferring a learned policy from simulation to the real world for more efficient robotic learning was first developed in the early 1990s [15, 16]. More recent techniques include learning from model ensembles  and utilizing domain randomization [18, 19, 20], in which physical properties of a simulated environment are randomized to allow learned policies to be robust. However, as these methods directly transfer the policy learned in simulation, they rely on the simulation being visually and physically similar to the real world. In contrast, we only transfer one part of our learned policy — the skill sequence to be executed — from simulation to the real world, and allow the associated continuous parameters to be learned in the real-world domain.
Temporal abstraction for reinforcement learning. The idea of using temporally extended actions to reduce the sample complexity of reinforcement learning algorithms has been studied for decades [21, 22, 23, 24]. For instance, work on macro-actions for mdps  attempts to build a hierarchical model in which the primitive actions occupy the lowest level, and subsequently higher levels build local policies, each equipped with their own termination conditions, that make use of actions at the level below. In our work, the skills are parameterized, and therefore the agent must reason about not only which skills to apply, but also what arguments to use for the chosen skills.
Bimanual manipulation. Dual-arm manipulation tasks have been studied in classical control settings , and often rely on hybrid force-position control strategies to guide both manipulators [26, 27]. These tasks have also been addressed via learning from demonstration [28, 29, 30]. In our work, we do not rely on demonstrations, and we are able to learn control policies directly from raw sensory inputs (camera images) without relying on models of the environment, which are difficult to specify by hand.
Given a set of parameterized skills, we aim to solve sparse-reward tasks by learning a policy that decides both which skill to execute and what arguments to use when invoking it. Our insight is that, for many tasks, the same sequence of skills (possibly with different arguments) can be used to optimally solve different instantiations of the task. We operationalize this by disentangling the policy into a state-independent task schema (sequence of skills) and a state-dependent prediction of how to parameterize these skills. We first formally define our problem setup, and then present our model for leveraging the state-independence of schemas to learn efficiently. Finally, we describe how our approach also allows transferring schemas across tasks, letting us learn real-world policies from raw images by reusing schemas learned for related tasks in simulation.
|Task Family||Object (Sim)||Objects (Real)||Schema Discovered from Learning in Simulation|
|lateral lifting||bar||aluminum tray, rolling pin, heavy bar, plastic box||1) L: top grasp, R: top grasp 2) L: lift, R: lift|
|picking||ball||soccer ball||1) L: top grasp, R: go-to pose 2) L: no-op, R: go-to pose 3) L: lift, R: lift|
|opening||bottle||glass jar, water bottle||1) L: top grasp, R: side grasp 2) L: twist, R: no-op|
|rotating||corkscrew||T-wrench, corkscrew||1) L: go-to pose, R: side grasp 2) L: go-to pose, R: no-op 3) L: rotate, R: no-op|
Each task we consider is defined as a finite-horizon Markov decision process (mdp) , with a hybrid discrete-continuous action space and time horizon . The reward associated with each task is a binary function indicating whether the current state is an element of the set of desired goal configurations, such as a state with the bottle opened. The learning objective, therefore, is to obtain a policy that maximizes the expected proportion of times that following it achieves the goal. Note that this is a particularly challenging setup for reinforcement learning algorithms due to the sparse nature of the reward function.
The agent is given a discrete library of generic skills , where each skill
is parameterized by a corresponding vectorof continuous values. Examples of skills can include exerting a force at a location, moving an end effector to a target pose, or rotating an end effector about an axis. An action is therefore a tuple , indicating what skill to apply as well as the corresponding parameterization. A schema is a sequence of skills in , where captures the sequence of skills but not their corresponding continuous parameterizations.
Assumption. We assume that the optimal schema is state-independent: it depends only on the task, not on the state and its dynamics. This implies that the same schema is optimal for all instantiations of a task, e.g. different geometries and poses of objects. We note that this is a valid assumption across many tasks of interest, since the skills themselves can be appropriately chosen to be complicated and expressive, such as stochastic, closed-loop control policies for guiding an end effector.
Modular Policies. The agent must learn a policy that, at each timestep, infers both which skill to use (a discrete choice) and what continuous arguments to use.
What is a good form for such a policy? A simple strategy, which we use as a baseline and depict in Figure 3 (top), would be to represent via a neural network, with weights
, that takes the state as input and has a two-headed output. One head predicts logits that represent a categorical distribution over the skillsfrom the logits predicted by the first head, then sample arguments using the subset of means and variances predicted by the second head that correspond to .
However, this does not model the fact that the optimal schema is state-independent. To capture this, we need to remove the dependence of the discrete skill selection on the input state. Thus, we propose to maintain a separate array, where row is the logits of a categorical distribution over which skill to use at time . Note that is the horizon of the mdp. In this architecture, the neural network is only tasked with predicting the skill arguments. The array of logits and the neural network, taken together, represent the policy , as depicted in Figure 3 (bottom).
Learning Schemas and Skill Arguments. The weights of the neural network can be updated via standard policy gradient methods. Let denote a trajectory induced by following in an episode. The objective we wish to maximize is . Policy gradient methods such as reinforce  leverage the likelihood ratio trick, which says that , to tune
via gradient ascent. When estimating this gradient, we treat the current setting of the array of logits as a constant.
Updating the logits within the array can also be achieved via policy gradients; however, since there is no input, and because we have sparse rewards, the policy optimization procedure is quite simple. Let be the logit for time and skill . Given trajectory :
If achieves the goal, i.e. , increase for each timestep and skill taken at that timestep.
If does not achieve the goal, i.e. , decrease for each timestep and skill taken at that timestep.
The amount by which to increase or decrease is absorbed by the step size and thus gets tuned as a hyper-parameter. See Algorithm 1 for full pseudocode.
Schema Transfer Across Tasks. Since we have disentangled the learning of the schema from the learning of the skill arguments within our policy architecture, we can now transfer the array of logits across related tasks, as long as the skill spaces and horizons are equal. Therefore, learning for a new task can be made efficient by reusing a previously learned schema, since we would only need to train the neural network weights to infer skill arguments for that new task.
Importantly, transferring the schema is reasonable even when the tasks have different state spaces. For instance, one task can be a set of simulated bimanual bottle-opening problems in a low-dimensional state space, while the other involves learning to open bottles in the real world from high-dimensional camera observations. As the state spaces can be different, it follows immediately that the tasks can also have different optimal arguments for the skills.
We test our proposed approach on four robotic bimanual manipulation task families: lateral lifting, picking, opening, and rotating. Table I lists the different objects that we considered for each one. These task families were chosen because they represent a challenging hybrid discrete-continuous search space for policy optimization, while meeting our requirement that the optimal schema is independent of the state. We show results on these tasks both in simulation and on real Sawyer arms: schemas are learned in simulation by training with low-dimensional state inputs, then transferred as-is to visual inputs (in simulation as well as in the real world), for which we only need to learn skill arguments. Our experiments show that our proposed approach is significantly more sample-efficient than one that uses the baseline policy architecture, and allows us to learn bimanual policies on real robots in less than 10 hours of training. We first describe the experimental setup, then discuss our results.
Iv-a MuJoCo Experimental Setup
Environment. For all four task families, two Sawyer robot arms with parallel-jaw grippers are placed at opposing ends of a table, facing each other. A single object is placed on the table, and the goal is to manipulate the object’s pose in a task-specific way. Lateral lifting (bar): The goal is to lift a heavy and long bar by 25cm while maintaining its orientation. We vary the bar’s location and density. Picking (ball): The goal is to lift a slippery (low coefficient of friction) ball vertically by 25cm. The ball slips out of the gripper when grasped by a single arm. We vary the ball’s location and coefficient of friction. Opening (bottle): The goal is to open a bottle implemented as two links (a base and a cap) connected by a hinge joint. If the cap is twisted without the base being held in place, the entire bottle twists. The cap must undergo a quarter-rotation while the base maintains its pose. We vary the bottle’s location and size. Rotating (corkscrew): The goal is to rotate a corkscrew implemented as two links (a base and a handle) connected by a hinge joint, like the bottle. The handle must undergo a half-rotation while the base maintains its pose. We vary the corkscrew’s location and size.
Skills. The skills we use are detailed in Table II, and the search spaces for the skill parameters are detailed in Table III. Note that because we have two arms, we actually need to search over a cross product of this space with itself.
State and Policy Representation. Experiments conducted in the MuJoCo simulator 
use a low-dimensional state: proprioceptive features (joint positions, joint velocities, end effector pose) for each arm, the current timestep, geometry information for the object, and the object pose in the world frame and each end effector’s frame. The policy is represented as a 4-layer MLP with 64 neurons in each layer, ReLU activations, and a multi-headed output for the actor and the critic. Since object geometry and pose can only be computed within the simulator, our real-world experiments will instead use raw RGB camera images.
|Skill||Allowed Task Families||Continuous Parameters|
|top grasp||lateral lifting, picking, opening||(x, y) position, z-orientation|
|side grasp||opening, rotating||(x, y) position, approach angle|
|go-to pose||picking, rotating||(x, y) position, orientation|
|lift||lateral lifting, picking||distance to lift|
|rotate||rotating||rotation axis, rotation radius|
|Parameter||Relevant Skills||Search Space (Sim)||Search Space (Real)|
|(x, y) position||grasps, go-to pose||
|location on table surface|
|approach angle||side grasp|
|distance to lift||lift|
|location on table surface|
Training Details. We use the Stable Baselines  implementation of proximal policy optimization (ppo) , though our method is agnostic to the choice of policy gradient algorithm. We use the following hyper-parameters: Adam  with learning rate , clipping parameter , entropy loss coefficient , value function loss coefficient
, gradient clip threshold, number of steps , number of minibatches per update
, and number of optimization epochs. Our implementation builds on the Surreal Robotics Suite . Training is parallelized across 50 workers. The time horizon in all tasks.
Iv-B Real-World Sawyer Experimental Setup
Environment. Our real-world setup also contains two Sawyer robot arms with parallel-jaw grippers placed at opposing ends of a table, facing each other. We task the robots with manipulating nine common household objects that require two parallel-jaw grippers to interact with. We consider the same four task families (lateral lifting, picking, opening and rotating), but work with more diverse objects (such as a rolling pin, soccer ball, glass jar, and T-wrench), as detailed in Table I. For each task family, we use the schema discovered for that family in simulation, and only learn the continuous parameterizations of the skills in the real world. See Figure 1 for pictures of some of our tasks.
State and Policy Representation. The state for these real-world tasks is the RGB image obtained from an overhead camera that faces directly down at the table. To predict the continuous arguments, we use a fully convolutional spatial neural network architecture , as shown in Figure 4 along with example response maps.
Training Details. We use ppo and mostly the same hyper-parameters, with the following differences: learning rate , number of steps , number of minibatches per update , number of optimization epochs , and no parallelization. We control the Sawyers using PyRobot .
Iv-C Results in Simulation
Figure 5 shows that our policy architecture greatly improves the sample efficiency of model-free reinforcement learning. In all simulated environments, our method learns the optimal schema, as shown in the last column of Table I. Much of the difficulty in these tasks stems from sequencing the skills correctly, and so our method, which more effectively shares experience across task instantiations in its attempt to learn the task schema, performs very well.
Before transferring the learned schemas to the real-world tasks, we consider learning from rendered images in simulation, using the architecture from Figure 4 to process them. Figure 6 shows the impact of transferring the schema versus re-learning it in this more realistic simulation setting. We see that when learning visual policies, transferring the schemas learned in the tasks with low-dimensional state spaces is critical to efficient training. These results increase our confidence that transferring the schema will enable efficient real-world training with raw RGB images, as we show next.
Iv-D Results in Real World
Figure 7 shows our results on the nine real-world tasks, with schemas transferred from the simulated tasks. We can see that, despite the challenging nature of the problem (learning from raw camera images, given sparse rewards), our system is able to learn to manipulate most objects in around 4-10 hours of training. We believe that our approach can be useful for sample-efficient learning in problems other than manipulation as well; all one needs is to define skills appropriate for the environment such that the optimal sequence depends only on the task, not the (dynamic) state. The skills may themselves be parameterized closed-loop policies.
Please see the supplementary video for examples of learned behavior on the real-world tasks.
V Future Work
In this work, we have studied how to leverage state-independent sequences of skills to greatly improve the sample efficiency of model-free reinforcement learning. Furthermore, we have shown experimentally that transferring sequences of skills learned in simulation to real-world tasks enables us to solve sparse-reward problems from images very efficiently, making it feasible to train real robots to perform complex skills such as bimanual manipulation.
An important avenue for future work is to relax the assumption that the optimal schema is open-loop. For instance, one could imagine predicting the schema via a recurrent mechanism, so that the decision on what skill to use at time is conditioned on the skill used at time . Another interesting future direction is to study alternative approaches to training the state-independent schema predictor.
We would like to thank Dhiraj Gandhi for help with experimental setup. Rohan is supported by an NSF Graduate Research Fellowship. Any opinions, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
-  H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Thirtieth AAAI conference on artificial intelligence, 2016.
-  Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas, “Dueling network architectures for deep reinforcement learning,” arXiv preprint arXiv:1511.06581, 2015.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,”
International conference on machine learning, 2016, pp. 1928–1937.
-  S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
-  S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 3389–3396.
-  A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 7559–7566.
-  J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.
-  Z. Wang, C. R. Garrett, L. P. Kaelbling, and T. Lozano-Pérez, “Active model learning and diverse action sampling for task and motion planning,” arXiv preprint arXiv:1803.00967, 2018.
R. Chitnis, D. Hadfield-Menell, A. Gupta, S. Srivastava, E. Groshev, C. Lin, and P. Abbeel, “Guided search for task and motion plans using learned heuristics,” inRobotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 447–454.
-  B. Kim, L. P. Kaelbling, and T. Lozano-Pérez, “Learning to guide task and motion planning using score-space representation,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 2810–2817.
-  C. Paxton, V. Raman, G. D. Hager, and M. Kobilarov, “Combining neural networks and tree search for task and motion planning in challenging environments,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 6059–6066.
-  J. Andreas, D. Klein, and S. Levine, “Modular multitask reinforcement learning with policy sketches,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 166–175.
-  Y. Davidor, Genetic Algorithms and Robotics: A heuristic strategy for optimization. World Scientific, 1991, vol. 1.
-  E. Gat, “On the role of simulation in the study of autonomous mobile robots,” in AAAI-95 Spring Symposium on Lessons Learned from Implemented Software Architectures for Physical Agents, 1995.
-  I. Mordatch, K. Lowrey, and E. Todorov, “Ensemble-CIO: Full-body dynamic motion planning that transfers to physical humanoids,” in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2015, pp. 5307–5314.
-  J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 23–30.
-  K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige et al., “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 4243–4250.
-  F. Sadeghi and S. Levine, “CAD2RL: Real single-image flight without a single real image,” arXiv preprint arXiv:1611.04201, 2016.
-  M. Asada, S. Noda, S. Tawaratsumida, and K. Hosoda, “Purposive behavior acquisition for a real robot by vision-based reinforcement learning,” Machine learning, vol. 23, no. 2-3, pp. 279–303, 1996.
-  L. Chrisman, “Reasoning about probabilistic actions at multiple levels of granularity,” in AAAI Spring Symposium: Decision-Theoretic Planning, 1994.
-  P. Dayan and G. E. Hinton, “Feudal reinforcement learning,” in Advances in neural information processing systems, 1993, pp. 271–278.
-  M. Hauskrecht, N. Meuleau, L. P. Kaelbling, T. Dean, and C. Boutilier, “Hierarchical solution of Markov decision processes using macro-actions,” in Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 1998, pp. 220–229.
-  C. Smith, Y. Karayiannidis, L. Nalpantidis, X. Gratal, P. Qi, D. V. Dimarogonas, and D. Kragic, “Dual arm manipulation – a survey,” Robotics and Autonomous systems, vol. 60, no. 10, pp. 1340–1353, 2012.
-  P. Hsu, “Coordinated control of multiple manipulator systems,” IEEE Transactions on Robotics and Automation, vol. 9, no. 4, pp. 400–410, 1993.
-  N. Xi, T.-J. Tarn, and A. K. Bejczy, “Intelligent planning and control for multirobot coordination: An event-based approach,” IEEE transactions on robotics and automation, vol. 12, no. 3, pp. 439–452, 1996.
-  R. Zollner, T. Asfour, and R. Dillmann, “Programming by demonstration: Dual-arm manipulation tasks for humanoid robots,” in 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE Cat. No. 04CH37566), vol. 1. IEEE, 2004, pp. 479–484.
-  E. Gribovskaya and A. Billard, “Combining dynamical systems control and programming by demonstration for teaching discrete bimanual coordination tasks to a humanoid robot,” in 2008 3rd ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 2008, pp. 33–40.
-  O. Kroemer, C. Daniel, G. Neumann, H. Van Hoof, and J. Peters, “Towards learning hierarchical skills for multi-phase manipulation tasks,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 1503–1510.
-  M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994.
-  R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033.
-  A. Hill, A. Raffin, M. Ernestus, A. Gleave, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu, “Stable baselines,” https://github.com/hill-a/stable-baselines, 2018.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
-  L. Fan, Y. Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa, S. Savarese, and L. Fei-Fei, “SURREAL: Open-source reinforcement learning framework and robot manipulation benchmark,” in Conference on Robot Learning, 2018.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  A. Murali, T. Chen, K. V. Alwala, D. Gandhi, L. Pinto, S. Gupta, and A. Gupta, “Pyrobot: An open-source robotics framework for research and benchmarking,” arXiv preprint arXiv:1906.08236, 2019.