, and Reinforcement Learning (RL)[7, 8]
. The success of DL in each of these fields was made possible by the availability of huge amounts of labeled training data. Researchers in vision and language can easily train and evaluate deep neural networks on standard datasets with crowdsourced annotations such as ImageNet, COCO  and CLEVR . In simulated environments like video games, where experience and rewards are easy to obtain, Deep RL is tremendously successful in outperforming top skilled humans by ingesting huge amounts of data [8, 12, 13]. The OpenAI Five DOTA bot  processes 180 years of simulated experience every day to play at a professional level. Even playing simple Atari games typically requires 40 days of game play . This stands in contrast to robotics, where we lack abundant data since designing a policy typically requires execution on a real robot, which cannot be accelerated beyond real time. This limits the effectiveness of Deep RL in robotics.
This paper presents a framework to bring the power of Deep RL to robotics through an approach that makes the generation and handling of a large dataset tractable. While previous attempts required several robots operating in parallel [14, 15], our framework requires only a single robot.
To generate enough data to feed the data hungry machinery of Deep RL with a single robot we record experience continuously and persistently, regardless of the purpose or quality of the behavior. This means that for a new task our data is necessarily off policy, and is also typically off task as well. Nevertheless, we show that the accumulated experience can be reused to learn new tasks. Reuse is made possible by automatically annotating historical data using task-specific learned reward model.
Our framework introduces three novel components:
The NeverEnding Storage (NES) system, which captures and stores all experience generated by the robot, regardless of purpose.
The reward sketching procedure for elicitating reward functions from humans, so that past experience can be automatically annotated for new tasks.
The use of off-policy batch RL allows us to train as many policies as possible given the computation budget, using data that is generated using different behavior policies without further execution on the robot.
The NES system captures all the camera and sensor data generated by the robot. This includes demonstrations (by human teleoperators) of various tasks, behaviors generated by trained policies for these tasks (successful or not), as well as experience generated by scripted or random policies. If the robot is moving, then the experience is being captured by NES.
Our approach to task specification relies on human judgments about progress toward the goal to define task-specific reward functions. Annotations are elicited from humans in the form of per-timestep reward annotations using a process we call reward sketching (Fig. 1, top right). The sketching procedure is intuitive for humans, and allows them to label many timesteps rapidly and accurately. We use the human annotations to train a reward model, which is then used to annotate the remainder of the database automatically. The use of a learned reward function in this way allows us to repurpose an arbitrary amount of past experience using a fixed amount of annotation effort per task.
compares our framework to other common approaches in robotics. Through learning of both reward functions and controllers we avoid the need for explicit state estimation, which is an essential component of classic robotics. RL systems also typically rely on state estimation to deliver rewards, even when they learn policies from pixels. System identification and environment modelling arise in any situation where explicit models (either of the robot or of the task) are required and significant effort is needed to build these models in both classical robotics and sim2real. Similarly, unlike the sim2real setting we have no perceptual domain gap to address, since we always train on experience collected directly from the robot.
Ii Related Work
RL has a long history in robotics [16, 17, 18, 19, 20, 14, 15] but applying RL to robots inherits all the general difficulties of applying RL to the real world . Most published works either rely on state estimation for a specific task, or work in a very data limited regime to learn from raw observations. These methods typically entail highly engineered reward functions. Our work handles these limitations. Besides, we go beyond the usual scale of RL methods in robotics.
Among the RL for robotics literature, QT-Opt  is the closest approach to ours. The authors collected a dataset of over grasps over the course of several weeks with robots. The resulting distributed Q-learning agent shows remarkable generalization to different objects. Yet, the whole system focuses on a single task: grasping. Grasping is particularly well-suited for hand-engineering a reward function and scripting policies for collecting data. However, neither hand-engineering reward functions nor scripting policies are easy to design for many tasks. So, relying on these techniques limits the applicability of the method. In contrast, in our work we learn the reward functions. Besides, experiences for multiple tasks contribute to learning a target task.
Demonstrations in RL have gained popularity in recent years [22, 23] as they help to address the exploration problem. As in prior works [24, 25, 26], we use demonstrations as part of the agent experience and train with temporal difference learning in a model-free setting.
The sim2real framework aims to seamlessly transfer policies trained in simulated environments to the real world [29, 30, 31, 32, 33]. These methods have achieved remarkable success in dexterous in-hand manipulation  and deformable object manipulation . The challenges of sim2real lie in the need for extensive system identification to construct simulated environments that closely mirror the target task. Even with careful simulation design there remains a domain gap between simulation and reality that needs to be dealt with [30, 34]. In our framework there is no need to simulate the robot or the environment, and, thus, no domain gap.
Learning reward functions using inverse RL dates back to Ng and Russell  and has achieved tremendous success recently [36, 37, 38, 39, 40, 41, 42]. This class of methods works best when applied to states or well-engineered features. Making it work for high-dimensional input spaces, particularly raw pixels, remains a challenge.
Learning from preferences [43, 44, 45, 46, 47, 48] also has a long history. For example, interactive learning and optimization with human preferences has been applied to animation [49, 50, 51]. Preference learning is also used in RL for reward learning [52, 53]. Preferences can be extracted with whole episode comparisons [54, 55, 56, 57] or shorter clip comparisons [58, 59]. Binary success labels of frames are also used to learn reward functions [26, 60].
Batch RL describes the scenario where the learning experience is fixed a priori . We rely on Batch RL for training our agents. This is an active area of research, with a number of recent works aimed at improving its stability [63, 64, 65, 66]. These additional advancements can further improve our agents.
Iii Framework overview and motivation
The general workflow is illustrated in Fig. 2. The distinct properties of our pipeline come from the interaction of the two key features: NES and learning the reward model from sketches. NES allows us to accumulate a large dataset of task-agnostic experience. A task-specific reward model allows us to retrospectively annotate data in NES with reward signals for a new task. With rewards, we can then train batch RL agents with all the data in NES.
The procedure for training an agent to complete a new task using our framework has the following steps which are described in turn in the remainder of the section, with more details given in Section IV:
A human teleoperates the robot to provide first-person demonstrations of the target task (IV-A).
All robot experience, including these demonstrations, is accumulated into NES (IV-B).
A subset of data from NES (including the task-specific demos) are annotated by humans with reward sketches for the target task (IV-C).
A reward model for the target task is trained using the labelled experience (IV-D).
An agent for the target task is trained using all experience in NES, with rewards provided by the learned reward model (IV-E).
The resulting policy is deployed on a real robot for execution, which at the same time records more data into NES (IV-F).
Occasionally we select an agent for careful evaluation, to track overall progress on the task.
To specify a new target task, a human operator first remotely controls the robot to provide several successful (and optionally unsuccessful) examples of completing the task. By employing the demonstration trajectories, we circumvent the problem of exploration in RL: Instead of requiring that the agent explores the state space autonomously, we use expert knowledge about the intended outcome of the task to guide the agent.
NES captures all of the robot experience generated across all tasks in a central repository. This allows us to make use of historical data when learning a new target task, instead of generating a new dataset from scratch each time. NES includes teleoperated trajectories for various tasks, human play data, and accumulated experience from the execution of numerous learned policies. The experience data is stored together with various metadata for organization and retrieval.
The second step in task specification is reward sketching. We ask human experts to provide per-timestep annotations of reward using a custom user interface. As illustrated in Fig. 3, user draws a curve indicating the progress towards accomplishing the target task as a function of time, while the interface shows the frame corresponding to the current cursor position. This intuitive interface allows a single annotator to produce hundreds of frames of reward annotations per minute.
The reward sketches allow comparison of perceived value of any two frames. In addition, the green region in Fig. 3 is reserved for timesteps where the goal is achieved. For each task the episodes to be annotated are drawn from NES. They include both the demonstrations provided for the target task, as well as experience generated for prior tasks. Annotating data from prior tasks ensures better coverage of the state space.
The reward annotations produced by sketching are used as supervision to train a reward model. This model is then used to predict reward values for all experience in NES. As a result, we can leverage all historical data in training a policy for a new task, without requiring manual human annotation of the entire repository.
Off-policy batch RL
We train an agent using off-policy pure batch RL. It allows us to learn as many policies as possible given the computation budget, using data that is generated using different behavior policies without further execution on the robot. Besides, the agent is trained by only using RL objective and it does not use feature pretraining, BC initialization, any special batch correction terms, or auxiliary losses. We do, however, find it important to use the historical data from other tasks.
Once an agent is trained, we can run it on the real robot. By running the agent on the robot, we collect more experience data which can be used for reward sketching or RL in future iterations. Running the agent also allows us to observe its performance and make judgments about the steps needed to improve it.
At this point we can iterate the workflow to improve the agent. Typically this involves sketching more data to improve the reward function. We can choose to sketch the new data generated by executing the policy on the robot, or more of the historical data stored in NES. If particular failure cases are observed during execution, we can teleoperate the robot to collect examples near these failure modes. Then, we sketch them to direct an agent in areas of the state space where it does yet not perform well.
Our approach starts by human operators providing first-person demonstrations of the target tasks on the robot. The robot is controlled with a -DoF mouse or hand-held virtual reality controllers. A demonstrated sequence contains pairs of observations and corresponding actions for each time step : . Observations contain all available sensor data including raw pixels as well as proprioceptive inputs.
In addition to full episodes of demonstrations, interactive interventions can be also performed: A human operator can take over from, or return control to, an agent at any time. This data is useful for fixing particular corner cases that the agents might encounter. All demonstrations are stored in NES with corresponding metadata. Note that in later workflow iterations, all demonstrations are used for RL and possibly reward sketching.
Iv-B NeverEnding Storage
NeverEnding Storage (NES) is a database of episode data stored as files on disk and metadata associated with it. Episode data includes video recordings from several cameras in the cage and at the robot wrist (Fig. 4). Metadata includes unique episode ID, operator name, date and time of operation, task ID, episode type, a paths to separately stored experience data, and others. Each episode can be marked with an arbitrary set of tags to offer a flexible way to identify sets of relevant episodes. NES also allows to associate an arbitrary number of reward sketches with each episode that could be task-specific. Metadata and reward sketches can be jointly queried with SQL queries to select subsets of data.
Iv-C Reward Sketching
In order to convey the desired behaviour to the agent, a human operator sketches the reward values for several examples of successful and unsuccessful trajectories for each task. For annotation, we select demonstration episodes as they are likely to exhibit positive examples. At the same time, we select episodes from other tasks as they exhibit negative examples. This combination gives good coverage of the state space.
Sketching works on an episodic basis. An example of a sketch is shown in Fig. 3. To sketch an episode, a user interactively selects a frame and provides an associated reward value . The sketching interface allows the annotator to draw reward curves while “scrubbing” through a video episode, rather than annotating frame by frame. This efficient procedure provides a rich source of information about the reward across the entire episode. The sketches for an episode are stored in NES as described in Section IV-B.
Iv-D Reward Learning
Episodes annotated with reward sketches are used to train a reward function in the form of neural network with parameters in a supervised manner. We find that although there is high agreement between annotators on the relative quality of timesteps within an episode, annotators are often not consistent in the overall scale of the sketched rewards. We therefore adopt an intra-episode ranking approach to learn reward functions, rather than trying to regress the sketched values directly.
Specifically, given two frames and in the same episode, we train the reward model to satisfy two conditions. First, if frame is (un)successful according to the sketch , it should be (un)successful according the estimated reward function . Second, if is higher than by a threshold , then should be higher than by another threshold . These conditions are captured by the following two hinge losses:
The total loss is obtained by adding these terms: . In our experiments, we set , , , , , and .
Iv-E Off-policy batch RL
We use an algorithm similar to of D4PG  as our main training algorithm, but with a recurrent state. It maintains an online value network and an online policy network . Given the effectiveness of recurrent values functions , both and are recurrent with and representing the corresponding recurrent hidden states. The target networks are of the same structures as the value and policy network, but are parameterized by different parameters and which are periodically updated to the current parameters of the online networks.
Given the function, we update the policy using DPG :
As in D4PG, instead of using a scalar function, we adopt a distributional value function . We refer the readers to the original paper for details on the loss for learning distributional value functions which we use to compute the gradient for learning the critic:
During learning, we sample a batch of sequences of observations and actions and use a zero start state to initialize all recurrent states at the beginning of sampled sequences. We then update and following gradients defined in (2) and (3) respectively using BPTT .
Since NES contains data from many different tasks, a randomly sampled batch from NES may contain data mostly irrelevant to the task at hand. To increase the representation of data from the current task, we construct fixed ratio batches, with % of the batch drawn from the entirety of NES and % from the data specific to the target task. This is similar to the solution proposed in previous work , where fixed ratio batches are formed with agent and demonstration data.
After policies are trained, we choose some of them for execution on the robot. No matter how successful the episodes are, the data is accumulated in NES to train next models.
In early workflow iterations, before the reward functions are trained with sufficient coverage of state space, the policies often exploit “delusions” where high rewards are assigned to undesired behaviors. To fix a reward delusion, a human annotator sketches some of the episodes where the delusion is observed. New annotations are used to improve the reward model, which is used in training a new policy. For each target task, this cycle is typically repeated – times until the predictions of a reward function are satisfactory.
V-a Experimental Setup
Our experimental setup consists of a Sawyer robot with a Robotiq F- gripper and a wrist force-torque sensor facing a
cm basket. The action space has seven degrees of freedom, corresponding to Cartesian translational and rotational velocity targets of the gripper pinch point and gripper fingers. The agent control loop is executed atHz. For safety, the pinch point movement is restricted to be in a cm workspace with maximum rotations of , , and around each axis.
Observations are provided by three cameras around the cage, as well as two wide angle cameras and one depth camera mounted at the wrist, and proprioceptive sensors in the arm (Fig. 4). NES captures all of the observations, and we indicate what subset is used for each learned component.
Tasks and datasets
We focus on two subsets of NES, containing data recorded during manipulations of:
three colored objects, one of each color: red, green and blue (rgb dataset, Fig. 3);
three deformable objects: a soft ball, a rope and a piece of cloth (deformable dataset, Fig. 1, bottom right).
The rgb dataset is used to learn policies for two tasks: lift_green and stack_green_on_red, the deformable dataset is used for the pull_cloth_up task. Final statistics for both datasets are presented in Table II. Both datasets were grown progressively by iterating the process in Fig. 2. Each episode lasts for steps ( seconds) unless an episode is terminated earlier for safety reasons.
To generate initial datasets for training we use a scripted policy called the random_watcher
. This policy moves the end effector to randomly chosen locations and opens and closes the gripper at random moments in time. When following this policy, the robot occasionally picks up or pushes the objects, but is typically just moving in free space. This data not only serves to seed the initial iteration of learning, but removing it from the final datasets degrades performance of the final agents.
The two datasets contain a significant number of teleoperated episodes, although the majority are recorded via interactive teleoperation (section IV-A), and thus required limited human intervention. Only about full teleoperated episodes correspond to the lift_green or stack_green_on_red tasks.
There are , , and sketched episodes for the lift_green, stack_green_on_red and pull_cloth_up tasks, respectively. Approximately % of the episodes are used for training and % for validation. Notice that not all sketches were obtained at once, but they were accumulated over several iterations of the process in Fig. 2.
Reward network architecture
The reward network is a non-recurrent residual network with a spatial softmax layer (Fig. 5). The spatial softmax layer  produces a list of coordinates. Proprioceptive features are embedded using linear layers, layer-normalized 
, and concatenated with the camera encoding vectors. As the sketched values are in the range of, the reward network ends with a sigmoid non-linearity.
The final agent for lift_green and stack_green_on_red tasks observes two cameras, a basket front left camera () and one of wrist-mounted wide angle cameras () (Fig. 4). The agent for pull_cloth_up also uses additional basket back left camera ().
The agent network is illustrated in Fig. 5. Each camera is encoded using a residual network followed by a spatial softmax keypoint encoder with channels. We use one such column for each camera and concatenate the results. Before applying the spatial softmax, noise from the distribution
was added to the logits.
Proprioceptive features are concatenated, embedded using a linear layer, layer-normalized  and passed through a activation and appended to the camera encodings to form joint the input features.
The actor network consumes these joint input features directly. The critic network additionally passes them through a linear layer, concatenates the the result with action features (obtained by passing the action through a linear layer), and passes the result through an additional linear layer with ReLU activation. Everything is then fed through a two layer layer-normed LSTMs withhidden units each, followed by actor or critic heads. Action outputs are processed through a layer putting them in the range , and then rescaled to their native ranges before being sent to the robot.
We train multiple RL agents in parallel and briefly evaluate promising ones on the robot. Each agent is trained for k update steps. To further improve performance, we save all episodes from RL agents, and sketch more reward curves if necessary, and use them when training the next generation of agents. We iterated this procedure – times to get the final agents that we report here.
|No random watcher data||80%||70%||20%|
|No other task data||0%||0%||0%|
|No random watcher data||50%||30%||30%|
|No other task data||0%||10%||0%|
Finally, we run controlled evaluations on the physical robot with fixed initial conditions across different policies. For the lift_green and stack_green_on_red datasets we devise three different evaluation conditions with varying levels of difficilty:
normal – basic rectangular green blocks well represented in training data, large red object close to center in starting position;
hard – more diverse objects that are less well represented in training data, smaller red objects with diverse starting locations;
unseen – green objects unseen during training with a large red object.
Each condition specifies different initial positions of the objects (set by human operator) as well as the initial pose of the robot (set automatically). The hard and unseen conditions are especially challenging, since they require the agent to cope with novel objects and novel object configurations.
Figure 6 shows examples of the initial conditions. We use the same evaluation sets for both the lift_green and stack_green_on_red tasks. To evaluate the pull_cloth_up task, we randomize initial conditions at every trial. As a quality metric, we measure the rate of successfully completed episodes, where success is judged by the human operator.
Results on the rgb dataset are summarized in Table III and IV. Our agent achieves a success rate of % for lifting and % for stacking. Even with rarely seen objects positioned in adversarial ways, the agent is quite robust with success rates being % and %, respectively. Remarkably, when dealing with objects that were never seen before, it can lift or stack them in % and % of cases. The success rate of our agent for pull_cloth_up task in episodes with randomized initial conditions is %.
Our results compares favorably with those of Zhu et al. , where block lifting and stacking success rates are 64% and 35%.111The results are not directly comparable due to different physical setups. Wulfmeier et al.  also attempted the block stacking task. But instead of learning directly from pixels, they rely on QR code based state estimation of fixed set of cubes, whereas our policies can handle a variety of objects.
To understand the importance of accumulating robot experience in NES, we run the following ablations. First, we train an agent using only task-specific data that resembles the conventional RL approach. This strategy is interesting because although the amount of training data is reduced, the similarity between training data and target behavior is increased (i.e. the training data is more on-policy). Second, we train an agent while excluding only the random_watcher data. As this data is unlikely to contain episodes relevant to the task, we are interested in knowing how much it contributes to the final performance.
|(A): normal||(B): hard||(C): unseen|
Tables III and IV show the results of these ablations. Remarkably, using only a task-specific dataset dramatically degrades the performance of the policy. Random watcher data proves to be valuable as it contributes up to an additional % improvement, showing the biggest advantage in the hardest case with unseen objects.
For qualitative results we refer the reader to the accompanying video that demonstrates the robustness of our agents. They successfully deal with adversarial perturbations by a human operator, stacking several unseen and non-standard objects and lifting toys, such as a robot and a pony. Last but not least, our agents move faster and are more efficient compared to a human operator.
We proposed a framework for data-driven robotics that makes use of a large dataset of diverse robot experience and reward functions learned from a novel form of human feedback. We presented a successful instantiation of this framework to train policies using pure batch RL. Our experimental results allow us to draw the following conclusions:
Reward sketching is an effective way to elicit reward functions, since humans are good at judging progress toward a task goal. Importantly, this approach can be directly applied to many other tasks.
Stored robot experience over a long period of time and across different tasks can be efficiently harnessed to learn policies in a completely offline manner.
Diversity of training data seems to be an essential factor in the success of standard state-of-the-art RL algorithms, which were previously reported to fail when trained only on expert data or the history of a single agent .
We can tightly close the loop of human input, reward learning and policy learning with a human-in-the-loop workflow. It enables us to learn a new real world skill in a flexible and fast way.
Qualitatively we observed that agents trained using our framework accomplish tasks faster than human teleoperation or the BC policies we trained.
Our proposed framework offers a promise to address several shortcomings of other approaches that rely on DL in robotics. Importantly, it is general enough to obtain policies for a variety of robot manipulation tasks as we demonstrated in our experiments. Our policies achieve notable success even in hard settings involving complex object manipulation, soft deformable objects, and previously unseen sets of objects without any need for feature or reward design.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances on Neural Information Processing Systems, 2012, pp. 1097–1105.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
IEEE Computer Vision and Pattern Recognition, 2016, pp. 770–778.
A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in
International Conference on Machine Learning, 2014, pp. 1764–1772.
A. Maas, Z. Xie, D. Jurafsky, and A. Ng, “Lexicon-free conversational speech recognition with neural networks,” inProceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 345–354.
-  D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al., “Deep speech 2: End-to-end speech recognition in English and Mandarin,” in International Conference on Machine Learning, 2016, pp. 173–182.
-  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick, “CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning,” in IEEE Computer Vision and Pattern Recognition, 2017.
-  O. Vinyals, I. Babuschkin, J. Chung, M. Mathieu, M. Jaderberg, W. M. Czarnecki, A. Dudzik, A. Huang, P. Georgiev, R. Powell, et al., “Alphastar: Mastering the real-time strategy game starcraft ii,” DeepMind Blog, 2019.
-  “OpenAI Five,” https://openai.com/five/.
-  S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 421–436, 2018.
-  D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine, “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on Robot Learning, 2018, pp. 651–673.
-  J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
-  J. Peters and S. Schaal, “Reinforcement learning of motor skills with policy gradients,” Neural networks, vol. 21, no. 4, pp. 682–697, 2008.
-  M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal, “Learning force control policies for compliant manipulation,” in International Conference on Intelligent Robots and Systems, 2011, pp. 4639–4644.
-  R. Hafner and M. Riedmiller, “Reinforcement learning in feedback control,” Machine learning, vol. 84, no. 1-2, pp. 137–169, 2011.
-  S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
-  G. Dulac-Arnold, D. J. Mankowitz, and T. Hester, “Challenges of real-world reinforcement learning,” arXiv preprint arXiv:1904.12901, 2019.
-  A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming exploration in reinforcement learning with demonstrations,” in IEEE International Conference on Robotics & Automation, 2018, pp. 6292–6299.
-  A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine, “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations,” arXiv preprint arXiv:1709.10087, 2017.
-  M. Večerík, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Riedmiller, “Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards,” arXiv preprint arXiv:1707.08817, 2017.
-  T. Pohlen, B. Piot, T. Hester, M. G. Azar, D. Horgan, D. Budden, G. Barth-Maron, H. Van Hasselt, J. Quan, M. Večerík, et al., “Observe and look further: Achieving consistent performance on atari,” arXiv preprint arXiv:1805.11593, 2018.
-  M. Vecerik, O. Sushkov, D. Barker, T. Rothörl, T. Hester, and J. Scholz, “A practical approach to insertion with variable socket position using deep reinforcement learning,” in IEEE International Conference on Robotics & Automation, 2019, pp. 754–760.
-  D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in Advances on Neural Information Processing Systems, 1989, pp. 305–313.
-  R. Rahmatizadeh, P. Abolghasemi, L. Bölöni, and S. Levine, “Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration,” in IEEE International Conference on Robotics & Automation, 2018, pp. 3758–3765.
-  S. James, A. J. Davison, and E. Johns, “Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task,” arXiv preprint arXiv:1707.02267, 2017.
-  J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in International Conference on Intelligent Robots and Systems, 2017, pp. 23–30.
-  L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel, “Asymmetric actor critic for image-based robot learning,” arXiv preprint arXiv:1710.06542, 2017.
-  M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al., “Learning dexterous in-hand manipulation,” arXiv preprint arXiv:1808.00177, 2018.
-  K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, et al., “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” in IEEE International Conference on Robotics & Automation, 2018, pp. 4243–4250.
-  J. Matas, S. James, and A. J. Davison, “Sim-to-real reinforcement learning for deformable object manipulation,” arXiv preprint arXiv:1806.07851, 2018.
-  A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcement learning,” in International Conference on Machine Learning, 2010, pp. 663–670.
-  C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse optimal control via policy optimization,” in International Conference on Machine Learning, 2016, pp. 49–58.
J. Ho and S. Ermon, “Generative adversarial imitation learning,” inAdvances in neural information processing systems, 2016, pp. 4565–4573.
-  Y. Li, J. Song, and S. Ermon, “Infogail: Interpretable imitation learning from visual demonstrations,” in NIPS, 2017.
-  J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” in ICLR, 2018.
-  J. Merel, Y. Tassa, S. Srinivasan, J. Lemmon, Z. Wang, G. Wayne, and N. Heess, “Learning human behaviors from motion capture by adversarial imitation,” arXiv preprint arXiv:1707.02201, 2017.
-  Y. Zhu, Z. Wang, J. Merel, A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kramár, R. Hadsell, N. de Freitas, et al., “Reinforcement and imitation learning for diverse visuomotor skills,” arXiv preprint arXiv:1802.09564, 2018.
-  N. Baram, O. Anschel, I. Caspi, and S. Mannor, “End-to-end differentiable adversarial imitation learning,” in ICML, 2017.
-  L. Thurstone, “A law of comparative judgement,” Psychological Review, vol. 34, pp. 273–286, 1927.
F. Mosteller, “Remarks on the method of paired comparisons: I. the least squares solution assuming equal standard deviations and equal correlations,”Psychometrika, vol. 16, pp. 3–9, 1951.
-  S. E. Feinberg and K. Larntz, “Log-linear representation for paired and multiple comparison models,” Biometrika, vol. 63, no. 2, pp. 245–254, 1976.
-  H. Stern, “A continuum of paired comparison models,” Biometrika, vol. 77, pp. 265–273, 1990.
-  W. Chu and Z. Ghahramani, “Preference learning with Gaussian processes,” in International Conference on Machine Learning, 2005.
-  T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay, “Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search,” ACM Transactions on Information Systems, vol. 25, no. 2, 2007.
-  E. Brochu, N. de Freitas, and A. Ghosh, “Active preference learning with discrete choice data,” in Advances on Neural Information Processing Systems, 2007.
-  E. Brochu, T. Brochu, and N. de Freitas, “A Bayesian interactive optimization approach to procedural animation design,” in SIGGRAPH Symposium on Computer Animation, 2010, pp. 103–112.
-  Y. Koyama, I. Sato, D. Sakamoto, and T. Igarashi, “Sequential line search for efficient visual design optimization by crowds,” ACM Transactions on Graphics, vol. 36, no. 4, pp. 1–11, 2017.
-  M. J. A. Strens and A. W. Moore, “Policy search using paired comparisons,” Journal of Machine Learning Research, vol. 3, pp. 921–950, 2003.
-  C. Wirth, R. Akrour, G. Neumann, and J. Fürnkranz, “A survey of preference-based reinforcement learning methods,” Journal of Machine Learning Research, vol. 18, no. 136, pp. 1–46, 2017.
-  R. Akrour, M. Schoenauer, and M. Sebag, “APRIL: Active preference learning-based reinforcement learning,” in European Conference on Machine Learning and Knowledge Discovery in Databases, 2012, pp. 116–131.
-  R. Akrour, M. Schoenauer, M. Sebag, and J.-C. Souplet, “Programming by feedback,” in International Conference on Machine Learning, 2014, pp. 1503–1511.
-  D. S. Brown, W. Goo, P. Nagarajan, and S. Niekum, “Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations,” arXiv preprint arXiv:1904.06387, 2019.
-  A. D. D. Dorsa Sadigh, S. Sastry, and S. A. Seshia, “Active preference-based learning of reward functions,” in Robotics, Science and Systems, 2017.
-  P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in Advances on Neural Information Processing Systems, 2017, pp. 4299–4307.
-  B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei, “Reward learning from human preferences and demonstrations in Atari,” in Advances on Neural Information Processing Systems, 2018, pp. 8011–8023.
-  A. Singh, L. Yang, K. Hartikainen, C. Finn, and S. Levine, “End-to-end robotic reinforcement learning without reward engineering,” arXiv preprint arXiv:1904.07854, 2019.
-  P. Sermanet, K. Xu, and S. Levine, “Unsupervised perceptual rewards for imitation learning,” in ICLR Workshop, 2017.
-  S. Lange, T. Gabel, and M. Riedmiller, “Batch reinforcement learning,” in Reinforcement learning, 2012, pp. 45–73.
-  S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” arXiv e-prints, 2018.
-  N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard, “Way off-policy batch deep reinforcement learning of implicit human preferences in dialog,” arXiv preprint arXiv:1907.00456, 2019.
-  R. Agarwal, D. Schuurmans, and M. Norouzi, “Striving for simplicity in off-policy deep reinforcement learning,” arXiv preprint arXiv:1907.04543, 2019.
-  A. Kumar, J. Fu, G. Tucker, and S. Levine, “Stabilizing off-policy q-learning via bootstrapping error reduction,” arXiv preprint arXiv:1906.00949, 2019.
-  G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. TB, A. Muldal, N. Heess, and T. Lillicrap, “Distributed distributional deterministic policy gradients,” in International Conference on Learning Representations, 2018.
-  S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney, “Recurrent experience replay in distributed reinforcement learning,” in International Conference on Learning Representations, 2018.
-  D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in International Conference on Machine Learning, 2014, pp. 387–395.
-  M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” arXiv preprint arXiv:1707.06887, 2017.
P. J. Werbos et al.
, “Backpropagation through time: what it does and how to do it,”Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
-  J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
-  M. Wulfmeier, A. Abdolmaleki, R. Hafner, J. T. Springenberg, M. Neunert, T. Hertweck, T. Lampe, N. Siegel, N. Heess, and M. Riedmiller, “Regularized hierarchical policies for compositional transfer in robotics,” arXiv preprint arXiv:1906.11228, 2019.
-  M. Quigley, K. Conley, B. P. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operating system,” in ICRA Workshop on Open Source Software, 2009.
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” in12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283.
-  T. Oliphant, Guide to NumPy. USA: Trelgol Publishing, 2006.
-  W. McKinney et al., “Data structures for statistical computing in python,” in Proceedings of the 9th Python in Science Conference, vol. 445, 2010, pp. 51–56.
-  J. D. Hunter, “Matplotlib: A 2d graphics environment,” Computing in science & engineering, vol. 9, no. 3, p. 90, 2007.
-  M. Waskom, O. Botvinnik, D. O’Kane, P. Hobson, S. Lukauskas, D. C. Gemperline, T. Augspurger, Y. Halchenko, J. B. Cole, J. Warmenhoven, J. de Ruiter, C. Pye, S. Hoyer, J. Vanderplas, S. Villalba, G. Kunter, E. Quintero, P. Bachant, M. Martin, K. Meyer, A. Miles, Y. Ram, T. Yarkoni, M. L. Williams, C. Evans, C. Fitzgerald, Brian, C. Fonnesbeck, A. Lee, and A. Qalieh, “mwaskom/seaborn: v0.8.1 (september 2017),” Sept. 2017. [Online]. Available: https://doi.org/10.5281/zenodo.883859
We would like to thank Jan Leike, Borja Ibarz and Konstantinos Bousmalis for their insightful comments along the project and all the DeepMind colleagues who kindly teleoperated the robot for data collection. We would also like to thank the open source community for developing the core set of tools that enabled this work, including ROS  Tensorflow , Numpy , Pandas , Matplotlib , and Seaborn .