Log In Sign Up

Robot Learning on the Job: Human-in-the-Loop Autonomy and Learning During Deployment

With the rapid growth of computing powers and recent advances in deep learning, we have witnessed impressive demonstrations of novel robot capabilities in research settings. Nonetheless, these learning systems exhibit brittle generalization and require excessive training data for practical tasks. To harness the capabilities of state-of-the-art robot learning models while embracing their imperfections, we present Sirius, a principled framework for humans and robots to collaborate through a division of work. In this framework, partially autonomous robots are tasked with handling a major portion of decision-making where they work reliably; meanwhile, human operators monitor the process and intervene in challenging situations. Such a human-robot team ensures safe deployments in complex tasks. Further, we introduce a new learning algorithm to improve the policy's performance on the data collected from the task executions. The core idea is re-weighing training samples with approximated human trust and optimizing the policies with weighted behavioral cloning. We evaluate Sirius in simulation and on real hardware, showing that Sirius consistently outperforms baselines over a collection of contact-rich manipulation tasks, achieving 8 than the state-of-the-art methods, with 3 times faster convergence and 15 memory size. Videos and code are available at


page 1

page 5

page 6

page 9

page 12


Human-in-the-Loop Imitation Learning using Remote Teleoperation

Imitation Learning is a promising paradigm for learning complex robot ma...

Trust Calibration and Trust Respect: A Method for Building Team Cohesion in Human Robot Teams

Recent advances in the areas of human-robot interaction (HRI) and robot ...

The Transfer of Human Trust in Robot Capabilities across Tasks

Trust is crucial in shaping human interactions with one another and with...

Achieving Multi-Tasking Robots in Multi-Robot Tasks

One simplifying assumption made in distributed robot systems is that the...

Adaptive Workload Allocation for Multi-human Multi-robot Teams for Independent and Homogeneous Tasks

Multi-human multi-robot (MH-MR) systems have the ability to combine the ...

Intermittent Deployment for Large-Scale Multi-Robot Forage Perception: Data Synthesis, Prediction, and Planning

Monitoring the health and vigor of grasslands is vital for informing man...

I Introduction

Recent years have witnessed great strides in deep learning techniques for robotics. In contrast to the traditional form of robot automation, which heavily relies on human engineering, these data-driven approaches show great promise in building robot autonomy that is difficult to design manually. While learning-powered robotics systems have achieved impressive demonstrations in research settings 

[andrychowicz2020learning, kalashnikov2018qt, lee2020learning]

, the state-of-the-art robot learning algorithms still fall short of generalization and robustness for widespread deployment in real-world tasks. The dichotomy between rapid research progress and the absence of real-world application stems from the lack of performance guarantees in today’s learning systems, especially when using black-box neural networks. It remains opaque to the potential practitioners of these learning systems: how often they fail, in what circumstances the failures occur, and how they can be continually enhanced to address them.

Fig. 1: Overview of Sirius, our human-in-the-loop learning and deployment framework. Sirius enables a human and a robot to collaborate on manipulation tasks through shared control. The human monitors the robot’s autonomous execution and intervenes to provide corrections through teleoperation. Data from deployments will be used by our algorithm to improve the robot’s policy in consecutive rounds of policy learning.

To harness the power of modern robot learning algorithms while embracing their imperfections, a burgeoning body of research has investigated new mechanisms to enable effective human-robot collaborations. Specifically, shared autonomy methods [Javdani2015SharedAV, Reddy2018SharedAV] aim at combining human input and semi-autonomous robot control for achieving a common task goal. These methods typically use a pre-built robot controller rather than seeking to improve robot autonomy over time. Meanwhile, recent advances in

interactive imitation learning

 [kelly2019hg, mandlekar2020humanintheloop, ross2011reduction] have aimed to learn policies from human feedback in the learning loop. These learning algorithms can improve the overall efficacy of autonomous policies. These policies alone still fail to meet the performance requirements for real-world deployment.

This work aims at developing a human-in-the-loop learning framework for human-robot collaboration and continual policy learning in deployed environments. We expect our framework to satisfy two key requirements: 1) it ensures task execution to be consistently successful through human-robot teaming, and 2) it allows the learning models to improve continually, such that human workload is reduced as the level of robot autonomy increases. To build such a framework, we need to move from the train-then-deploy paradigm common in today’s robot learning literature to continuous model updates during deployment. This idea of robot learning on the job resembles the Continuous Integration, Continuous Deployment (CI/CD) principles in software engineering [CICD]. Nonetheless, realizing this idea for learning-based manipulation invites fundamental challenges.

The foremost challenge is developing the infrastructure for human-robot collaborative manipulation. We develop a system that allows a human operator to monitor and intervene the robot’s policy execution (see Fig. 1). The human can take over control when necessary and handle challenging situations to ensure safe and reliable task execution. Meanwhile, human interventions implicitly reveal the task structure and the level of human trust in the robot. As recent work [kelly2019hg, mandlekar2020humanintheloop, Hoque2021ThriftyDAggerBN] indicates, human interventions inform when the human lacks trust in the robot, where the risk-sensitive task states are, and how to traverse these states. We can thus take advantage of the occurrences of human interventions during deployments as informative signals for policy learning.

The subsequent challenge is updating policies on an ever-growing dataset of shifting distributions. As our framework runs over time, the policy would adapt its behaviors through learning, and the human would adjust their intervention patterns accordingly. Deployment data from human-robot teams can be multimodal and suboptimal. Learning from such deployment data requires us to selectively use them for policy updates. We want the robot to learn from good behaviors to reinforce them and also to recover from mistakes and deal with novel situations. At the same time, we want to prevent the robot from copying bad actions that would lead to failure. Our key insight is that we can assess the importance of varying training data based on human interventions for policy learning.

To this end, we develop a simple yet effective learning algorithm that uses human interventions to re-weigh training data. We consider the robot rollouts right before an intervention as “low-quality” (as the human believes the robot is about to fail) and both human demonstrations and interventions as “high-quality” for policy training. We label training samples with different weights and train policies on these samples using weighted behavioral cloning, the state-of-the-art algorithm for imitation learning [sasaki2021behavioral, DBLP:journals/corr/abs-2011-13885, pmlr-v162-xu22l]

and offline reinforcement learning

[wang2020critic, nair2021awac, Kostrikov2021OfflineRL]

. This supervised learning algorithm lends itself to the efficiency and stability of policy optimization on our large-scale and growing dataset.

Furthermore, deploying our system in long-term missions leads to two practical considerations: 1) it incurs a heavy burden of memory storage to store all past experiences over a long duration, and 2) a large number of similar experiences may inundate the small subset of truly valuable data for policy training. We thus examine different memory management strategies, aiming at adaptively adding and removing data samples from the memory storage of fixed size. Our results show that even with 15% of the full memory size, we can retain the same level of performance or achieve even better performance than keeping all data, and moreover enables 3 times faster convergence for rapid model updates between consecutive rounds.

We name our framework Sirius, the star symbolizing our human-robot team with its binary star system. We evaluate Sirius in two simulated and two real-world tasks requiring contact-rich manipulation with precise motor skills. Compared to the state-of-the-art methods of learning from offline data [nair2021awac, Kostrikov2021OfflineRL, robomimic2021] and interactive imitation learning [mandlekar2020humanintheloop], Sirius achieves higher policy performance and reduced human workload. Sirius reports 8% in simulation and 27% boost on real hardware in policy performance over the state-of-the-art methods.

Ii Related Work

Human-in-the-loop Learning: In a human-in-the-loop learning setting, an agent utilizes interactive human feedback signals to improve its performance [zhang2019leveraging, Cruz2020ASO, Cui2021UnderstandingTR]. Human feedback can serve as a rich source of supervision, as humans often have a priori domain information and can interactively guide the agent with respect to its learning progress. Many forms of human feedback exist, such as interventions [kelly2019hg, spencer2020wil, mandlekar2020humanintheloop], preferences [christiano2017preferences, Biyik2022LearningRF, lee2021pebble, Wang2021SkillPL], rankings [Brown2019ExtrapolatingBS], scalar-valued feedback [macglashan2017interactive, warnell2018deep], and human gaze [ijcai2020-689]. These forms of feedback can be integrated into the learning loop through techniques such as policy shaping [knox2009tamer, NIPS2013_e034fb6b] and reward modeling [daniel2014active, leike2018scalable].

Within the context of robot manipulation, one approach is to incorporate human interventions in imitation learning algorithms [kelly2019hg, spencer2020wil, mandlekar2020humanintheloop]. Another approach is to employ deep reinforcement learning algorithms with learned rewards, either from preferences [lee2021pebble, Wang2021SkillPL] or reward sketching [cabi2019scaling]. While these methods have demonstrated higher performance compared to those without humans in the loop, they require a large amount of supervision from humans and also fail to incorporate human control feedback in deployment into the learning loop again to improve model performance. In contrast, we specifically consider the above scenarios which are critical to real-world robotic systems.

Shared Autonomy: Human-robot collaborative control is often necessary for real-world tasks when we do not have full robot autonomy while full human teleoperation control is burdensome. In shared autonomy [dragan2013policy, Javdani2015SharedAV, gopinath2017human, Reddy2018SharedAV], the control of a system is shared by human and robot to accomplish a common goal [Tan2021InterventionAS]. The existing literature on shared autonomy focuses on efficient collaborative control from human intent prediction [dragan2008formalizing, Muelling2015AutonomyIT, 7140066]. However, they do not attempt to learn from human intervention feedback and therefore there is no policy improvement. We examine a context similar to that of shared autonomy where human is involved during the actual deployment of the robot system; however, we also put human control in the feedback loop and use them to improve the learning itself.

Learning from Offline Data: An alternative to the human-in-the-loop paradigm is to learn from fixed robot datasets via imitation learning [pomerleau1989alvinn, zhang2017deep, mandlekar2020gti, florence2021implicit] or offline reinforcement learning (offline RL) [levine2020offline, fujimoto2019bcq, Kumar2020ConservativeQF, kidambi2020morel, yu2020mopo, yu2021combo, mandlekar2020iris, Kostrikov2021OfflineRL]. Offline RL algorithms in particular have demonstrated promise when trained on large diverse datasets with suboptimal behaviors [singh2020cog, kumar2022when, ajay2020opal]. Among a number of different methods, advantage-weighed regression methods [wang2020critic, nair2021awac, Kostrikov2021OfflineRL]

have recently emerged as a popular approach to offline RL. These methods use a weighted behavior cloning objective to learn the policy, using learned advantage estimates as the weight. In this work, we also use weighted behavior cloning; however, we explicitly leverage human intervention signals from our online human-in-the-loop setting to obtain weights rather than using task rewards to learn advantage-based weights. We show that this leads to superior empirical performance for our manipulation tasks.

Iii Overview

Iii-a Problem Formulation

We formulate a robot manipulation task as a Markov Decision Process

representing the state space, action space, reward function, transition probability, initial state distribution, and discount factor. In this work, we adopt an intervention-based learning framework in which the human can choose to intervene and take control of the robot. Given the current state

, the robot action is drawn from the policy , and the human can override this action with a human action . The policy for the human-robot team can thus be formulated as:

where is a binary indicator function of human interventions and is the implicit human policy. Our learning objective is two-fold: 1) we want to improve the level of robot autonomy by finding the autonomous policy that maximizes the cumulative rewards , and 2) we want to minimize the human’s workload in the system, i.e., the expectation of interventions under the state distribution induced by the team policy .

Fig. 2: Illustration of the workflow in Sirius. Robot deployment and policy update co-occur in two parallel threads. Deployment data are passed to policy training, while a newly trained policy is deployed to the target environment for task execution.

Iii-B Human-in-the-Loop Learning Framework

Next, we define the human-in-the-loop deployment setting and give an overview of the system design of Sirius as depicted in Fig. 2. Our human-in-the-loop system consists of two components that happen simultaneously: Robot Deployment and Policy Update. We have a memory buffer that grows in size until it reaches the memory limit.

Initially, the system has a warm-up phase, where we bootstrap a robot policy trained on a small number of human demonstrations. These demonstrations comprise a set of trajectories , where each trajectory consists of the states, actions, task rewards, and the data class type flag indicating that these trajectories are human demonstrations.

Upon training the initial policy , we deploy the robot to perform the task, and in the process, we collect a set of trajectories to improve the policy. A human operator who continuously monitors the robot’s execution will intervene based on whether the robot has performed or will perform suboptimal behaviors. Note that we adapt human-gated control [kelly2019hg] rather than robot-gated control [Hoque2021ThriftyDAggerBN] to guarantee task execution success and trustworthiness of the system for real-world deployment. Through this process, we obtain a new dataset of trajectories , where either indicates the transition is a robot action () or a human intervention (). We append this data to the existing memory buffer collected so far , and train a new policy on this new dataset.

In subsequent rounds, we deploy the robot to collect new data while simultaneously updating the policy. We define “Round” as the interval for policy update and deployment: It consists of the completion of training for one policy, and at the same time collection of one set of deployment data. In Round , we train for policy using all previous data. Meanwhile, the robot is continuously being deployed using the current best policy , and gathered deployment data . At the end of round we append this data to the existing memory buffer collected so far and train a new policy on this aggregated dataset.

Iii-C Properties of Deployment Data

Our framework aggregates data from deployment environments over its long-term deployments. This leads to a dataset that is constantly growing in size, consisting of mixed hybrid data from the human and the robot. Such a dataset distinguishes itself from conventional offline datasets. It presents a unique set of challenges:

  • Mixed Data Distributions. Data samples in the dataset consist of a mix of robot actions, human interventions, and human demonstrations. The data distributions could be suboptimal and highly multi-modal, as the robot and the human may resort to different strategies to solve the same task or take different actions in similar states.

  • Subjective Human Behavior. Extensive studies have been done showing that human data could vary greatly in terms of expertise, quality, and behaviors, affecting policy learning performance [Biyik2022LearningRF]. In our framework, human’s trust and subjectivity are manifested in time, criteria, and duration of interventions.

  • Growing Data Size. The system running in long-term deployments can produce a vast amount of data, creating computational burdens for learning algorithms.

Iv Method

Fig. 3: Overview of our human-in-the-loop learning model. We maintain an ever-growing database of diverse experiences spanning four categories: human demonstrations, autonomous robot data, human interventions, and transitions preceding interventions which we call pre-interventions. We employ weighted behavioral cloning to learn vision-based policies from these experiences. We set weights according to these four categories, with a high weight given to interventions over other categories.

To improve autonomy and reduce human costs in human-robot collaboration, we develop a policy learning method that harnesses the deployment data to update the models in consecutive rounds. Devising a suitable learning algorithm is challenging due to data distribution shifts. As we discussed in Section III-C, the dataset is multi-modal, of mixed quality, and ever-growing. To utilize such data sources, we have a critical insight that human interventions provide informative signals of task structure and human trust, which we will use to guide the design of our algorithm.

The core idea of our approach is to use human interventions to re-weigh training samples based on an approximate quality score. With these weighted samples, we train the policy with the weighted behavioral cloning method to learn the policy on mixed-quality data. Our intuition is that with a mixed-quality dataset, different weight assignments can differentiate high-quality samples from low-quality ones, such that the algorithm prioritizes high-quality samples for learning. This strategy has been shown effective in recent literature on offline reinforcement learning [nair2021awac, wang2020critic, Kumar2020ConservativeQF]. In the following, we will review the weighted behavioral cloning method and discuss our method for learning the sample weights from human interventions.

Iv-a Weighted Behavioral Cloning

In vanilla behavioral cloning (BC), we train a model to replicate the actions for each state in the dataset. The objective is to learn a policy parameterized by that maximizes the log likelihood of actions at states :


where are samples from the dataset . For weighted behavioral cloning, the log likelihood term of each pair is scaled by a weight function , which assigns different importance scores to different samples:


Although this algorithm appears conceptually simple, it serves as the foundation of several state-of-the-art methods for offline reinforcement learning (RL) [nair2021awac, Kostrikov2021OfflineRL, wang2020critic]. In particular, advantage-based offline RL algorithms learn advantage estimates and calculate weights as , where is a non-negative scalar function. The intuition of the advantage-weighted methods is that they can filter bad samples of low advantage and train on the samples of high advantage. However, effectively learning advantage estimates may be challenging if the dataset does not cover a sufficiently wide distribution of states and actions.

Iv-B Intervention-based Weighting Scheme

Rather than relying on advantage-based weights, we explicitly leverage the human intervention signals in the dataset to construct our weighting scheme. (see Figure 3 for an overview). Recall that each sample in our dataset contains a data class type indicating whether the sample denotes a human demonstration action, robot action, or human intervention action. We assign the weight for each sample according to , that is


where refers to the total number of samples in the dataset corresponding to data class ; and

is a hyperparameter that represents the contribution of class type

to the weighted BC objective. By dividing the weight over the number of class samples , we ensure that the overall contribution of each class type to the weighted BC objective is , regardless of the makeup of the dataset. This is an important feature of our method, allowing us to operate over varying datasets with the same set of fixed hyperparameters.

Fig. 4: Quantitative evaluations. We compare our method with human-in-the-loop learning, imitation learning, and offline reinforcement learning baselines. Our results in simulated and real-world tasks show steady performance improvements of the autonomous policies over rounds. Our model reports the highest performances in all four tasks after three rounds of deployments and policy updates. Solid line: human-in-the-loop; dashed line: offline learning on data from our method.

The weighting scheme described so far resembles similar weighting schemes proposed by prior work [mandlekar2020humanintheloop]. While promising, this weighting scheme has a critical limitation — it treats all of the robot samples with the same weight

regardless of their quality. While the robot often performs reasonable behaviors, it can make mistakes that subsequently demand interventions. In practice, humans exhibit a response delay in performing corrective interventions, as they must first detect mistakes and react accordingly. During this delay period, the robot’s actions can be highly undesirable, which can hurt subsequent policy training. Therefore, we should down-weight the influence of these samples. To address this limitation, we introduce a simple heuristic of relabeling robot samples within

timesteps preceding a window of intervention as a new pre-intervention class . We use a low weight to reduce the contribution of these samples.

Iv-C Memory Management

As the deployment continues and the dataset increases, large data slows down training convergence and takes up excessive memory space. We hypothesize that forgetting (routinely discarding samples from memory) helps prioritize important and useful experiences for learning, speeding up convergence and even further improving policy. Therefore, we examine strategies for managing the memory buffer.

We assume that we have a fixed-size memory buffer. Once we buffer becomes full, we replace existing samples with new samples. Since we know that human data is useful, we always retain demo and intv samples and we instead reject robot execution samples corresponding to the robot class. We consider five strategies for managing the memory buffer of deployment data: 1) FIFO (First-In-First-Out): reject samples in the order that they were added to the buffer; 2) FILO (First-In-Last-Out): reject the most recently added samples first; LFI (Least-Frequently-Intervened): first reject samples from trajectories with the least interventions; 4) MFI (Most-Frequently-Intervened): first reject samples from trajectories with the most interventions; and 5) Uniform: reject samples uniformly at random. Each strategy operates on a different of assumptions and offers unique advantages. For example, in MFI the advantage of retaining trajectories with fewer interventions is that they contain high-quality samples, whereas, with LFI, trajectories with the most interventions reveal rare corner cases that retain the diversity of the dataset.

Iv-D Implementation Details

We adopt BC-RNN [robomimic2021], the state-of-the-art behavioral cloning algorithm, as our model backbone. We use ResNet-18 encoders [DBLP:journals/corr/HeZRS15] to encode third person and eye-in-hand images [robomimic2021, mandlekar2020gti]

. We concatenate image features with robot proprioceptive state as input to the policy. The network outputs a Gaussian Mixture Model (GMM) distribution over actions. See Appendix

VII-D for the full details.

For the sample weights, we find that good weighting schemes favor weights that are higher for the intervention class, lower for the robot class, and near-zero for the pre-intervention class. Intuitively, human interventions serve as a crucial supervisory signal that guides the agent to recover from suboptimal behaviors, while pre-interventions denote suboptimal behaviors. We set the weights as follows: , , and , , where denotes the total number of samples in the memory. Note that our demo weight maintains the true ratio of demonstration samples in the dataset. It keeps the policy close to the expert data distribution, especially during initial rounds of updates when the robot generates lower-quality data. This is in contrast to Mandlekar et al. [mandlekar2020humanintheloop] which treats all non-intervention samples as a single class, thus lowering the contribution of demonstrations over time.

V Experiments

V-a Tasks

We design a set of simulated and real-world tasks that resemble common industrial tasks in manufacturing and logistics. We consider long-horizon tasks that require precise contact-rich manipulation, necessitating human guidance. For all tasks, we use a Franka Emika Panda robot arm equipped with a parallel jaw gripper. Both the policy and teloperation device control the robot in the task space. We use the SpaceMouse as the human interface device to intervene.

We systematically evaluate the performance of our method and baselines in the robosuite simulator [zhu2020robosuite]. We choose the two most challenging contact-rich manipulation tasks in the robomimic benchmark [robomimic2021]:

Nut Assembly. The robot picks up a square rod from the table and inserts the rod into a column.

Tool Hang. The robot picks up a hook piece and inserts it into a very small hole, then hangs a wrench on the hook. As noted in robomimic [robomimic2021], this is a difficult task requiring very precise and dexterous control.

In the real world, we design two tasks representative of industrial assembly and food packaging applications:

Gear Insertion. The robot picks up two gears on the NIST board and inserts each of them onto the gear shafts.

Coffee Pod Packing. The robot opens a drawer, places a coffee pod into the pod holder and closes the drawer.

V-B Metrics and Baselines

We benchmark human-in-the-loop deployment systems in two aspects: 1) Policy Performance. Our human-robot team achieves a reliable task success of 100%. Here we evaluate the success rate of the autonomous policy after each round of model update; and 2) Human Workload. We measure human workload as the percentage of intervention in the trajectories in each round.

We compare our method with the state-of-the-art human-in-the-loop method Intervention Weighted Regression (IWR[mandlekar2020humanintheloop]. Furthermore, to isolate the impacts of learning algorithms and data distributions, we compare the state-of-the-art imitation learning algorithm BC-RNN [robomimic2021] and the state-of-the-art offline RL algorithm Implicit Q-Learning (IQL[Kostrikov2021OfflineRL]. We run these two algorithms on the deployment data generated by our method for a fair comparison.

To mimic the intervention-guided weights for IQL, we use the following rewards: upon task success, for intervention states, for pre-intervention states, and for all other states. We also run IQL in a sparse reward setting but find that it underperformed. We highlight that in contrast to our method, IQL requires task rewards, which may be expensive to obtain in real-world settings.

We follow round updates rules by prior works [mandlekar2020humanintheloop, kelly2019hg]: 3 rounds of update when the number of intervention samples reaches of the human demonstration samples, so that all methods get the same amount of human samples per round. We discuss more details on the round update design in Appendix VII-E.

Fig. 5: Ablation on memory management strategies. We study the five different strategies introduced in Section IV-C. Adopting these strategies consistently matches or yields better performance over keeping all data samples (Base) while achieving higher learning efficiency.

V-C Quantitative Results

We show in Fig. 4 that our method significantly outperforms the baselines on our evaluation tasks. Our method consistently outperforms IWR over the rounds. We attribute this difference to our fine-grained weighting scheme, enabling the method to better differentiate high-quality and suboptimal samples. This advantage over IWR cascades across the rounds, as we obtain a better policy, which in turn yields better deployment data.

We also show that our method significantly outperforms the BC-RNN and IQL baselines under the same dataset distribution. This highlights the importance of our weighting scheme — BC-RNN performs poorly due to copying the suboptimal behaviors in the dataset, while IQL is unable to effectively learn the values as weights.

Moreover, we show the effectiveness of Memory Management in deployment in Fig. 5. With memory management strategies that discard samples, we manage to reduce memory size to a small proportion of the original dataset size (30% for Nut Assembly and 15% for Tool Hang). Furthermore, memory management is crucial to learning efficiency in model updates, as the policy is learned with a smaller dataset with more useful samples. Moreover, we are able to maintain the same level of performance and even higher performance than no memory management. We compare different management strategies discussed in Section IV-C

and compare their effect on policy performance and learning efficiency. We define metrics for learning efficiency as the number of epochs needed to reach a high policy performance, which we benchmark with 90% policy success rate. We show that with memory management strategies, we are able to achieve 2-3x faster policy convergence. This is crucial to efficient updates in the deployment setting.

Fig. 6: (Left) Ablation on weight function design. Our results show that removing each class label hurts model performance. (Right) Comparison of human intervention distribution. Each bar represents the ratio of human intervention samples that occurred in that time range over total samples. Higher bars indicate a harder bottleneck region. The sum of all bars represents the overall human intervention ratio.

Furthermore, we conduct an ablation study to examine the contribution of each component in our weighting scheme in Fig. 6 (Left). Our method has a different weight design for interventions, demonstrations, and pre-interventions. We study how removing each class, i.e., treating each class as the robot action class, affects the policy performance. We run each ablation method on the deployment data generated by our method for each round of the Nut Assembly task. As shown in Fig. 6 (Left), removing any class hurts the policy performance, especially in the first round where the dataset contains mostly suboptimal data. In earlier rounds, the robot needs to stay closer to higher-quality demonstration samples, learn corrective behaviors, and avoid erroneous actions. The impact of class removal is less at later rounds when the dataset is abundant with higher quality data.

Lastly, we show that our method is more effective at reducing human workload than IWR. We visualize the distribution of interventions in trajectories in Fig. 6 (Right). Our method requires significantly fewer interventions and is able to overcome harder bottleneck states better. Please see the supplementary video for qualitative results.

Vi Conclusion

We introduce Sirius, a framework for human-in-the-loop robot manipulation and learning at deployment that both guarantees reliable task execution and also improves autonomous policy performance over time. We utilize the properties and assumptions of human-robot collaboration to develop an intervention-based weighted behavioral cloning method for making effective use of deployment data. We also design a practical system that trains and deploys new models continuously under memory constraints. For future work, we are interested in exploring ways to make the robot respond to shared human control inputs in real time. Another direction for future research is alleviating the human’s cognitive burdens for monitoring and teleoperating the system.


We thank Ajay Mandlekar for having multiple insightful discussions, and for sharing well-designed simulation task environments and codebases during development of the project. We thank Yifeng Zhu for valuable advice and system infrastructure development for real robot experiments. We would like to thank Tian Gao, Jake Grigsby, Zhenyu Jiang, Ajay Mandlekar, Braham Snyder, and Yifeng Zhu for providing helpful feedback for this manuscript. We acknowledge the support of the National Science Foundation (1955523, 2145283), the Office of Naval Research (N00014-22-1-2204), and Amazon.

Vii Appendix

Vii-a Task Details

We elaborate on the four tasks in this section, providing more details of the task setups, the bottleneck regions, and how they are challenging. The two simulation tasks, Nut Assembly and Tool Hang, are from the robomimic codebase [robomimic2021] for better benchmarking.

Nut Assembly. The robot picks up a square rod from the table and inserts the rod into a column. The bottleneck lies in grasping the square rod with the correct orientation and turning it such that it aims at the column correctly.

Tool Hang. The robot picks up a hook piece, inserts it into a tiny hole, and then hangs a wrench on the hook. As noted in robomimic [robomimic2021], this task requires very precise and dexterous control. There are multiple bottleneck regions: picking up the hook piece with the correct orientation, inserting the hook piece with high precision in both position and orientation, picking out the wrench, and carefully aiming the tiny hole at the hook.

Gear Insertion. We design the task scene setup adapting from the common NIST board benchmark111 Task Board 1, which is designed for standard industrial tasks like peg insertion and electrical connector insertions. Initially, one blue gear and one red gear are placed at a randomized region on the board. The robot picks up two gears in sequence and inserts each onto the gear shafts respectively. The gears’ holes are very small, requiring precise insertion on the gear shafts.

Coffee Pod Packing. We design this task for a food manufacturing setting where the robot packs real coffee pods222 into a coffee pod holder333 The robot first opens the coffee pod holder drawer, grasps a coffee pod placed on a random initial position on the table, places the coffee pod into the pod holder, and closes the drawer. The pod holder contains holes that fit precisely to the coffee pods’ side, so it requires precise insertion of the coffee pods into the holes. The common bottlenecks are exactly grasping the coffee pod, exact insertion, and releasing the drawer whenever the opening and closing actions are done without getting stuck.

The objects in all tasks are initialized randomly within an x-y position range and with a rotation on the z-axis. The configurations of the simulation tasks follow that in robomimic. We present the reset initialization configuration in Table I for reference.

Tasks and Objects Position (x-y) Orientation (z)
Nut Assembly
square nut cm cm
hook cm cm
wrench cm cm
Gear Insertion (Real)
blue gear cm cm
red gear cm cm
Coffee Pod Packing (Real)
coffee pod cm cm
TABLE I: Task objects configuration
Initialize memory buffer with fixed size continuous data storage
Collect a small amount of human demonstrations warmstart phase
Initialize BC policy :
while deployment continues do deployment phase
     Collect rollout episodes
     if  then
         reorganize to the size with memory management strategies      
     for each gradient step do start next round of policy update
         Sample batch is the class label of the sample
         Obtain from Algorithm 2
          perform weighted BC      
     Update for next round of deployment
Algorithm 1 Learning at Deployment

Input: class labels ;
            data storage with class labels

total number of samples in data storage
for  do number of samples for each class
      number of samples with class label
Set ratio for classes: set overall percentage of the influence each class has on loss
   does not change demo ratio
   see Table IV
   see Table IV
   ratio of all classes sum up to 1
for  do calculate weight for individual sample
Algorithm 2 Intervention-based Weighting

Vii-B Human-Robot Teaming

We illustrate the actual human-robot teaming process during human-in-the-loop deployment in Figure 7. The robot executes a task (e.g., gear insertion) by default while a human supervises the execution. In this gear insertion scenario, the expected robot behavior is to pick up the gear and insert it down the gear shaft. When the human detects undesirable robot behavior (e.g., gear getting stuck), the human intervenes by taking over control of the robot. The human directly passes in action commands to perform the desired behavior. When the human judges that the robot can continue the task, the human passes control back to the robot.

To enable effective shared human control of the robot, we seek a teleoperation interface that (1) enables humans to control the robot effectively and intuitively and (2) switches between robot and human control immediately once the human decides to intervene or pass the control back to the robot. To this end, we employ SpaceMouse444 control. The human operator controls a 6-DoF SpaceMouse and passes the position and orientation of the SpaceMouse as action commands. The user can pause when monitoring the computer screen by pressing a button, exert control until the robot is back to an acceptable state, and pass the control back to the robot by stopping the motion on the SpaceMouse.

Vii-C Observation and Action Space

The observational space of all our tasks consists of the workspace camera image, the eye-in-hand camera image, and low-dimensional proprioceptive information. For simulation tasks, we use the operational space controller (OSC) that has a 7D action space; for real-world tasks, we use OSC yaw controller that has a 5D action space.

The minor differences for the Tool Hang task from robomimic [robomimic2021] default image observation: We use an image size of instead of the default for training efficiency. Due to the task’s need for high-resolution image inputs, we adjust the workspace camera angle to give more details on the objects. This compensates for the need for large image size and boosts policy performance.

Fig. 7: Human Robot Teaming. Left: The robot executes the task by default while a human supervises the execution. Right: When the human detects undesirable robot behavior, the human intervenes.
Fig. 8: Human Intervention Sample Ratio. We evaluate the human intervention sample ratio for the four tasks. The human intervention sample ratio decreases over deployment round updates. Our methods have a larger reduction in human intervention ratio as compared with IWR.

Details on low-dimensional proprioceptive information: For simulation tasks, we have the end effector position (3D) and orientation (4D), as well as the distance of the gripper (2D). We have joint positions (7D) and gripper width (1D) for real-world tasks.

The action space of simulation tasks is 7 dimensions in total: x-y-z position (3D), yaw-pitch-roll orientation (3D), and the gripper open-close command (1D). The action space of real-world tasks is 5 dimensions in total: x-y-z position (3D), yaw orientation (1d), and the gripper open-close command (1D).

Vii-D Method Implementations

We present our learning at the deployment pipeline in Algorithm 1 and the intervention-based weighting scheme in Algorithm 2.

We describe the policy architecture details initally introduced in Section IV-D. Our codebase is based on robomimic [robomimic2021]

, a recent open-source project that benchmarks a range of learning algorithms on offline data. We standardize all methods with the same state-of-the-art policy architectures and hyperparameters from robomimic. The architectural design includes ResNet-18 image encoders, random cropping for image augmentation, GMM head, and the same training procedures. The list of hyperparameter choices is presented in Table

II. For all BC-related methods, including Ours, IWR, and BC-RNN, we use the same BC-RNN architecture specified in Table III.

For all tasks except for Tool Hang, we use the same hyperparameters with image size . We use for Tool Hang due to its need for high-precision details. We use a few demonstrations for each task to warm-start the policy; the number ranges from to so that the initial policy can all have some level of reasonable behavior regardless of task difficulty. See Table VI for all task-dependent hyperparameters.

For IQL [Kostrikov2021OfflineRL]

, we reimplemented the method in our robomimic-based codebase to keep the policy backbone and common architecture the same across all methods. Our implementation is based on the publicly available PyTorch implementation of IQL


We follow the paper’s original design with some slight modifications. In particular, the original IQL uses the sparse reward setting where the reward is based on task success. We add a denser reward for IQL to incorporate information on human intervention. To mimic the intervention-guided weights for IQL, we use the following rewards: upon task success, for intervention states, for pre-intervention states, and for all other states. We found that this version of IQL outperforms the default sparse reward setting. We list the hyperparameters for IQL baseline in Table V.

Vii-E HITL System Policy Updates

We elaborate on our design choice for HITL system policy update rules discussed in Section V-B.

In a practical human-in-the-loop deployment system, there can be many possible design choices for the condition and frequency of policy updates. A few straightforward ones among various designs are: update every specific amount of elapsed time, update after the robot completes a certain number of tasks, or update after human interventions reach a certain number. Our experiments aim to provide a fair comparison between various human-in-the-loop methods and benchmark our method against prior baselines. For consistent evaluation, we follow round updates rules by prior work [mandlekar2020humanintheloop, kelly2019hg]: 3 rounds of update when the number of intervention samples reaches of the human demonstration samples. The motivation is to evaluate prior baselines in their original setting to ensure fair comparison; moreover, we want to ensure all methods get the same amount of human samples per round. Since they are human-in-the-loop methods, the amount of human samples is important to their utilization. How policies are updated could be a dimension of human-in-the-loop system design on its own right and could be further explored in future work.

Vii-F Evaluation Procedure

Evaluation is an essential step in the robot learning experiment pipeline. We make sure to design fair evaluation procedures to provide an accurate estimate of the true policy performance of each method as best as possible. Here we elaborate on our evaluation procedures and the motivations behind them.

Simulation experiments

: We evaluate the success rate across 3 seeds and report the average and standard deviation across all seeds. For each seed, we run the following evaluation procedure: we evaluate 100 trials for each checkpoint and choose a checkpoint to evaluate once every 50 epochs, within the range of the total number of trained epochs. We record the top 3 best success rates and then average the three. We do so to reduce the effect of outliers in the evaluation.

Real-world experiments: Due to the high time cost for real robot evaluation, we evaluate for 1 seed for each method. Since real robot evaluations are subject to noise and variation across checkpoints, we run the following evaluation procedure to ensure that results are as fair as possible. For each method, we perform an initial evaluation of different checkpoints (5 checkpoints in practice), evaluating each for a small number of trials (5 trials in practice). For the checkpoint that gives the best quantitative and qualitative behavior, we perform 32 trials and report the success rate over them. To give a more straightforward comparison, we present the absolute number of successes over 32 trials in Table VII.

Vii-G Human Workload Reduction

We present more results on the effectiveness of our method in reducing human workload as discussed in Section V-C. To show how human workload decreases over the policy deployment round updates, we plot the human intervention sample ratio for every round, i.e., the percentage of intervention samples in all samples per round. We compare the results for the HITL methods, Ours and IWR, in Figure 8. We show that the human intervention ratio decreases for all four tasks as policy performance increases over time. Furthermore, our methods significantly reduce the human intervention ratio more than IWR.

Moreover, we note that there are different metrics to evaluate human workload, such as the number of control switches and lengths of interventions, as introduced in prior work [Hoque2021ThriftyDAggerBN]. We include two additional human workload metrics:

Fig. 9: More Intervention Behavior Metrics (Nut Assembly). We present two more metrics to measure human workload over time: average intervention frequency (Left) and average intervention length (Right). We show that our method results in a larger reduction of both metrics over round updates, developing better human trust and human-robot partnership.

Average intervention frequency: the number of intervention occurrences divided by the number of rollouts. This reflects the number of context switches, i.e., shifts of control between the human and the robot. A higher number of context switches imposes higher concentration and exhaustion on the human.

Average intervention length: length of each intervention in terms of the number of timesteps. This reflects the ease of every intervention - longer intervention occurrence means a higher mental workload to the human for taking control of the robot.

We note that these metrics also reflect the human trust level for the robot. The human makes a decision during robot control: should I intervene at this point? Furthermore, during human control: is the robot in a state where I can safely return control to the robot? Lower intervention frequency and shorter intervention length reflect that human trusts the robot more so that they can intervene at fewer places and return control to the robot faster.

Fig. 10: Human Intervention Distribution. The color bar represents the time duration over 10 trials and whether a timestep is autonomous robot action (yellow) or human intervention (green). In Round 1, much human intervention is needed to overcome the bottleneck states. In Round 3, the policy needs very little human intervention, and the robot can run autonomously most of the time.

We present the results in Figure 9 using Nut Assembly as an example. We can see that, like the human intervention ratio, the average intervention frequency, and the intervention length decrease. Our method also has a faster reduction of both metrics over round updates. This shows that our human-in-the-loop system fosters good human trust in the robot and develops better human-robot partnerships.

Lastly, we visualize how the dynamics of the human-robot-partnership evolve qualitatively in Figure 10. For the Gear Insertion task, we do trials of task execution in sequence for our method in round and round respectively, and record the time duration for human intervention needed during the deployment. Comparing round and round , the policy in round needs very little human intervention. The intervention duration is much shorter too, validating the effectiveness of our system in human workload reduction.

Hyperparameter Value
GMM number of modes
Image encoder ResNet-18
Random crop ratio % of image height
Optimizer Adam
Batch size
# Training steps per epoch
# Total training epochs
Evaluation checkpoint interval (in epoch)
TABLE II: Common hyperparameters
Hyperparameter Value
RNN hidden dim
RNN sequence length
# of LSTM layers
Learning rate
TABLE III: BC backbone hyperparameters
Hyperparameter Value
ratio for intv class,
ratio for preintv class,
TABLE IV: Intervention-based weighting scheme
Hyperparameter Value
Reward scale
Termination false
Discount factor
Adv filter exponential

V function quantile

Actor lr
Actor lr decay factor
Actor mlp layers
Critic lr
Critic lr decay factor
Critic mlp layers
TABLE V: IQL hyperparameters
Hyperparameter Nut Assembly ToolHang Gear Insertion (Real) Coffee Pod Packing (Real)
Image size ()
Initial # of human demonstrations
Evaluation rollout length
TABLE VI: Task hyperparameters
Task Method Round (BC base policy) Round Round Round
Gear Insertion Ours 14 / 32 22 / 32 25 / 32 29 / 32
IWR 14 / 32 18 / 32 20 / 32 23 / 32
BC-RNN - - - 11 / 32
IQL - - - 12 / 32
Coffee Pod Packing Ours 3 / 32 20 / 32 23 / 32 24 / 32
IWR 3 / 32 15 / 32 19 / 32 21 / 32
BC-RNN - - - 14 / 32
IQL - - - 13 / 32
TABLE VII: Quantitative results for real robots in absolute success numbers