I Introduction
Recent years have witnessed great strides in deep learning techniques for robotics. In contrast to the traditional form of robot automation, which heavily relies on human engineering, these data-driven approaches show great promise in building robot autonomy that is difficult to design manually. While learning-powered robotics systems have achieved impressive demonstrations in research settings
[andrychowicz2020learning, kalashnikov2018qt, lee2020learning], the state-of-the-art robot learning algorithms still fall short of generalization and robustness for widespread deployment in real-world tasks. The dichotomy between rapid research progress and the absence of real-world application stems from the lack of performance guarantees in today’s learning systems, especially when using black-box neural networks. It remains opaque to the potential practitioners of these learning systems: how often they fail, in what circumstances the failures occur, and how they can be continually enhanced to address them.

To harness the power of modern robot learning algorithms while embracing their imperfections, a burgeoning body of research has investigated new mechanisms to enable effective human-robot collaborations.
Specifically, shared autonomy methods [Javdani2015SharedAV, Reddy2018SharedAV] aim at combining human input and semi-autonomous robot control for achieving a common task goal. These methods typically use a pre-built robot controller rather than seeking to improve robot autonomy over time. Meanwhile, recent advances in interactive imitation learning
This work aims at developing a human-in-the-loop learning framework for human-robot collaboration and continual policy learning in deployed environments. We expect our framework to satisfy two key requirements: 1) it ensures task execution to be consistently successful through human-robot teaming, and 2) it allows the learning models to improve continually, such that human workload is reduced as the level of robot autonomy increases. To build such a framework, we need to move from the train-then-deploy paradigm common in today’s robot learning literature to continuous model updates during deployment. This idea of robot learning on the job resembles the Continuous Integration, Continuous Deployment (CI/CD) principles in software engineering [CICD]. Nonetheless, realizing this idea for learning-based manipulation invites fundamental challenges.
The foremost challenge is developing the infrastructure for human-robot collaborative manipulation. We develop a system that allows a human operator to monitor and intervene the robot’s policy execution (see Fig. 1). The human can take over control when necessary and handle challenging situations to ensure safe and reliable task execution. Meanwhile, human interventions implicitly reveal the task structure and the level of human trust in the robot. As recent work [kelly2019hg, mandlekar2020humanintheloop, Hoque2021ThriftyDAggerBN] indicates, human interventions inform when the human lacks trust in the robot, where the risk-sensitive task states are, and how to traverse these states. We can thus take advantage of the occurrences of human interventions during deployments as informative signals for policy learning.
The subsequent challenge is updating policies on an ever-growing dataset of shifting distributions. As our framework runs over time, the policy would adapt its behaviors through learning, and the human would adjust their intervention patterns accordingly. Deployment data from human-robot teams can be multimodal and suboptimal. Learning from such deployment data requires us to selectively use them for policy updates. We want the robot to learn from good behaviors to reinforce them and also to recover from mistakes and deal with novel situations. At the same time, we want to prevent the robot from copying bad actions that would lead to failure. Our key insight is that we can assess the importance of varying training data based on human interventions for policy learning.
To this end, we develop a simple yet effective learning algorithm that uses human interventions to re-weigh training data. We consider the robot rollouts right before an intervention as “low-quality” (as the human believes the robot is about to fail) and both human demonstrations and interventions as “high-quality” for policy training. We label training samples with different weights and train policies on these samples using weighted behavioral cloning, the state-of-the-art algorithm for imitation learning [sasaki2021behavioral, DBLP:journals/corr/abs-2011-13885, pmlr-v162-xu22l]
and offline reinforcement learning
[wang2020critic, nair2021awac, Kostrikov2021OfflineRL]. This supervised learning algorithm lends itself to the efficiency and stability of policy optimization on our large-scale and growing dataset.
Furthermore, deploying our system in long-term missions leads to two practical considerations: 1) it incurs a heavy burden of memory storage to store all past experiences over a long duration, and 2) a large number of similar experiences may inundate the small subset of truly valuable data for policy training. We thus examine different memory management strategies, aiming at adaptively adding and removing data samples from the memory storage of fixed size. Our results show that even with 15% of the full memory size, we can retain the same level of performance or achieve even better performance than keeping all data, and moreover enables 3 times faster convergence for rapid model updates between consecutive rounds.
We name our framework Sirius, the star symbolizing our human-robot team with its binary star system. We evaluate Sirius in two simulated and two real-world tasks requiring contact-rich manipulation with precise motor skills. Compared to the state-of-the-art methods of learning from offline data [nair2021awac, Kostrikov2021OfflineRL, robomimic2021] and interactive imitation learning [mandlekar2020humanintheloop], Sirius achieves higher policy performance and reduced human workload. Sirius reports 8% in simulation and 27% boost on real hardware in policy performance over the state-of-the-art methods.
Ii Related Work
Human-in-the-loop Learning: In a human-in-the-loop learning setting, an agent utilizes interactive human feedback signals to improve its performance [zhang2019leveraging, Cruz2020ASO, Cui2021UnderstandingTR]. Human feedback can serve as a rich source of supervision, as humans often have a priori domain information and can interactively guide the agent with respect to its learning progress. Many forms of human feedback exist, such as interventions [kelly2019hg, spencer2020wil, mandlekar2020humanintheloop], preferences [christiano2017preferences, Biyik2022LearningRF, lee2021pebble, Wang2021SkillPL], rankings [Brown2019ExtrapolatingBS], scalar-valued feedback [macglashan2017interactive, warnell2018deep], and human gaze [ijcai2020-689]. These forms of feedback can be integrated into the learning loop through techniques such as policy shaping [knox2009tamer, NIPS2013_e034fb6b] and reward modeling [daniel2014active, leike2018scalable].
Within the context of robot manipulation, one approach is to incorporate human interventions in imitation learning algorithms [kelly2019hg, spencer2020wil, mandlekar2020humanintheloop]. Another approach is to employ deep reinforcement learning algorithms with learned rewards, either from preferences [lee2021pebble, Wang2021SkillPL] or reward sketching [cabi2019scaling]. While these methods have demonstrated higher performance compared to those without humans in the loop, they require a large amount of supervision from humans and also fail to incorporate human control feedback in deployment into the learning loop again to improve model performance. In contrast, we specifically consider the above scenarios which are critical to real-world robotic systems.
Shared Autonomy: Human-robot collaborative control is often necessary for real-world tasks when we do not have full robot autonomy while full human teleoperation control is burdensome. In shared autonomy [dragan2013policy, Javdani2015SharedAV, gopinath2017human, Reddy2018SharedAV], the control of a system is shared by human and robot to accomplish a common goal [Tan2021InterventionAS]. The existing literature on shared autonomy focuses on efficient collaborative control from human intent prediction [dragan2008formalizing, Muelling2015AutonomyIT, 7140066]. However, they do not attempt to learn from human intervention feedback and therefore there is no policy improvement. We examine a context similar to that of shared autonomy where human is involved during the actual deployment of the robot system; however, we also put human control in the feedback loop and use them to improve the learning itself.
Learning from Offline Data: An alternative to the human-in-the-loop paradigm is to learn from fixed robot datasets via imitation learning [pomerleau1989alvinn, zhang2017deep, mandlekar2020gti, florence2021implicit] or offline reinforcement learning (offline RL) [levine2020offline, fujimoto2019bcq, Kumar2020ConservativeQF, kidambi2020morel, yu2020mopo, yu2021combo, mandlekar2020iris, Kostrikov2021OfflineRL]. Offline RL algorithms in particular have demonstrated promise when trained on large diverse datasets with suboptimal behaviors [singh2020cog, kumar2022when, ajay2020opal]. Among a number of different methods, advantage-weighed regression methods [wang2020critic, nair2021awac, Kostrikov2021OfflineRL]
have recently emerged as a popular approach to offline RL. These methods use a weighted behavior cloning objective to learn the policy, using learned advantage estimates as the weight. In this work, we also use weighted behavior cloning; however, we explicitly leverage human intervention signals from our online human-in-the-loop setting to obtain weights rather than using task rewards to learn advantage-based weights. We show that this leads to superior empirical performance for our manipulation tasks.
Iii Overview
Iii-a Problem Formulation
We formulate a robot manipulation task as a Markov Decision Process
representing the state space, action space, reward function, transition probability, initial state distribution, and discount factor. In this work, we adopt an intervention-based learning framework in which the human can choose to intervene and take control of the robot. Given the current state
, the robot action is drawn from the policy , and the human can override this action with a human action . The policy for the human-robot team can thus be formulated as:where is a binary indicator function of human interventions and is the implicit human policy. Our learning objective is two-fold: 1) we want to improve the level of robot autonomy by finding the autonomous policy that maximizes the cumulative rewards , and 2) we want to minimize the human’s workload in the system, i.e., the expectation of interventions under the state distribution induced by the team policy .

Iii-B Human-in-the-Loop Learning Framework
Next, we define the human-in-the-loop deployment setting and give an overview of the system design of Sirius as depicted in Fig. 2. Our human-in-the-loop system consists of two components that happen simultaneously: Robot Deployment and Policy Update. We have a memory buffer that grows in size until it reaches the memory limit.
Initially, the system has a warm-up phase, where we bootstrap a robot policy trained on a small number of human demonstrations. These demonstrations comprise a set of trajectories , where each trajectory consists of the states, actions, task rewards, and the data class type flag indicating that these trajectories are human demonstrations.
Upon training the initial policy , we deploy the robot to perform the task, and in the process, we collect a set of trajectories to improve the policy. A human operator who continuously monitors the robot’s execution will intervene based on whether the robot has performed or will perform suboptimal behaviors. Note that we adapt human-gated control [kelly2019hg] rather than robot-gated control [Hoque2021ThriftyDAggerBN] to guarantee task execution success and trustworthiness of the system for real-world deployment. Through this process, we obtain a new dataset of trajectories , where either indicates the transition is a robot action () or a human intervention (). We append this data to the existing memory buffer collected so far , and train a new policy on this new dataset.
In subsequent rounds, we deploy the robot to collect new data while simultaneously updating the policy. We define “Round” as the interval for policy update and deployment: It consists of the completion of training for one policy, and at the same time collection of one set of deployment data. In Round , we train for policy using all previous data. Meanwhile, the robot is continuously being deployed using the current best policy , and gathered deployment data . At the end of round we append this data to the existing memory buffer collected so far and train a new policy on this aggregated dataset.
Iii-C Properties of Deployment Data
Our framework aggregates data from deployment environments over its long-term deployments. This leads to a dataset that is constantly growing in size, consisting of mixed hybrid data from the human and the robot. Such a dataset distinguishes itself from conventional offline datasets. It presents a unique set of challenges:
-
Mixed Data Distributions. Data samples in the dataset consist of a mix of robot actions, human interventions, and human demonstrations. The data distributions could be suboptimal and highly multi-modal, as the robot and the human may resort to different strategies to solve the same task or take different actions in similar states.
-
Subjective Human Behavior. Extensive studies have been done showing that human data could vary greatly in terms of expertise, quality, and behaviors, affecting policy learning performance [Biyik2022LearningRF]. In our framework, human’s trust and subjectivity are manifested in time, criteria, and duration of interventions.
-
Growing Data Size. The system running in long-term deployments can produce a vast amount of data, creating computational burdens for learning algorithms.
Iv Method

To improve autonomy and reduce human costs in human-robot collaboration, we develop a policy learning method that harnesses the deployment data to update the models in consecutive rounds. Devising a suitable learning algorithm is challenging due to data distribution shifts. As we discussed in Section III-C, the dataset is multi-modal, of mixed quality, and ever-growing. To utilize such data sources, we have a critical insight that human interventions provide informative signals of task structure and human trust, which we will use to guide the design of our algorithm.
The core idea of our approach is to use human interventions to re-weigh training samples based on an approximate quality score. With these weighted samples, we train the policy with the weighted behavioral cloning method to learn the policy on mixed-quality data. Our intuition is that with a mixed-quality dataset, different weight assignments can differentiate high-quality samples from low-quality ones, such that the algorithm prioritizes high-quality samples for learning. This strategy has been shown effective in recent literature on offline reinforcement learning [nair2021awac, wang2020critic, Kumar2020ConservativeQF]. In the following, we will review the weighted behavioral cloning method and discuss our method for learning the sample weights from human interventions.
Iv-a Weighted Behavioral Cloning
In vanilla behavioral cloning (BC), we train a model to replicate the actions for each state in the dataset. The objective is to learn a policy parameterized by that maximizes the log likelihood of actions at states :
(1) |
where are samples from the dataset . For weighted behavioral cloning, the log likelihood term of each pair is scaled by a weight function , which assigns different importance scores to different samples:
(2) |
Although this algorithm appears conceptually simple, it serves as the foundation of several state-of-the-art methods for offline reinforcement learning (RL) [nair2021awac, Kostrikov2021OfflineRL, wang2020critic]. In particular, advantage-based offline RL algorithms learn advantage estimates and calculate weights as , where is a non-negative scalar function. The intuition of the advantage-weighted methods is that they can filter bad samples of low advantage and train on the samples of high advantage. However, effectively learning advantage estimates may be challenging if the dataset does not cover a sufficiently wide distribution of states and actions.
Iv-B Intervention-based Weighting Scheme
Rather than relying on advantage-based weights, we explicitly leverage the human intervention signals in the dataset to construct our weighting scheme. (see Figure 3 for an overview). Recall that each sample in our dataset contains a data class type indicating whether the sample denotes a human demonstration action, robot action, or human intervention action. We assign the weight for each sample according to , that is
(3) |
where refers to the total number of samples in the dataset corresponding to data class ; and
is a hyperparameter that represents the contribution of class type
to the weighted BC objective. By dividing the weight over the number of class samples , we ensure that the overall contribution of each class type to the weighted BC objective is , regardless of the makeup of the dataset. This is an important feature of our method, allowing us to operate over varying datasets with the same set of fixed hyperparameters.
The weighting scheme described so far resembles similar weighting schemes proposed by prior work [mandlekar2020humanintheloop]. While promising, this weighting scheme has a critical limitation — it treats all of the robot samples with the same weight
regardless of their quality. While the robot often performs reasonable behaviors, it can make mistakes that subsequently demand interventions. In practice, humans exhibit a response delay in performing corrective interventions, as they must first detect mistakes and react accordingly. During this delay period, the robot’s actions can be highly undesirable, which can hurt subsequent policy training. Therefore, we should down-weight the influence of these samples. To address this limitation, we introduce a simple heuristic of relabeling robot samples within
timesteps preceding a window of intervention as a new pre-intervention class . We use a low weight to reduce the contribution of these samples.Iv-C Memory Management
As the deployment continues and the dataset increases, large data slows down training convergence and takes up excessive memory space. We hypothesize that forgetting (routinely discarding samples from memory) helps prioritize important and useful experiences for learning, speeding up convergence and even further improving policy. Therefore, we examine strategies for managing the memory buffer.
We assume that we have a fixed-size memory buffer. Once we buffer becomes full, we replace existing samples with new samples. Since we know that human data is useful, we always retain demo and intv samples and we instead reject robot execution samples corresponding to the robot class. We consider five strategies for managing the memory buffer of deployment data: 1) FIFO (First-In-First-Out): reject samples in the order that they were added to the buffer; 2) FILO (First-In-Last-Out): reject the most recently added samples first; LFI (Least-Frequently-Intervened): first reject samples from trajectories with the least interventions; 4) MFI (Most-Frequently-Intervened): first reject samples from trajectories with the most interventions; and 5) Uniform: reject samples uniformly at random. Each strategy operates on a different of assumptions and offers unique advantages. For example, in MFI the advantage of retaining trajectories with fewer interventions is that they contain high-quality samples, whereas, with LFI, trajectories with the most interventions reveal rare corner cases that retain the diversity of the dataset.
Iv-D Implementation Details
We adopt BC-RNN [robomimic2021], the state-of-the-art behavioral cloning algorithm, as our model backbone. We use ResNet-18 encoders [DBLP:journals/corr/HeZRS15] to encode third person and eye-in-hand images [robomimic2021, mandlekar2020gti]
. We concatenate image features with robot proprioceptive state as input to the policy. The network outputs a Gaussian Mixture Model (GMM) distribution over actions. See Appendix
VII-D for the full details.For the sample weights, we find that good weighting schemes favor weights that are higher for the intervention class, lower for the robot class, and near-zero for the pre-intervention class. Intuitively, human interventions serve as a crucial supervisory signal that guides the agent to recover from suboptimal behaviors, while pre-interventions denote suboptimal behaviors. We set the weights as follows: , , and , , where denotes the total number of samples in the memory. Note that our demo weight maintains the true ratio of demonstration samples in the dataset. It keeps the policy close to the expert data distribution, especially during initial rounds of updates when the robot generates lower-quality data. This is in contrast to Mandlekar et al. [mandlekar2020humanintheloop] which treats all non-intervention samples as a single class, thus lowering the contribution of demonstrations over time.
V Experiments
V-a Tasks
We design a set of simulated and real-world tasks that resemble common industrial tasks in manufacturing and logistics. We consider long-horizon tasks that require precise contact-rich manipulation, necessitating human guidance. For all tasks, we use a Franka Emika Panda robot arm equipped with a parallel jaw gripper. Both the policy and teloperation device control the robot in the task space. We use the SpaceMouse as the human interface device to intervene.
We systematically evaluate the performance of our method and baselines in the robosuite simulator [zhu2020robosuite]. We choose the two most challenging contact-rich manipulation tasks in the robomimic benchmark [robomimic2021]:
Nut Assembly. The robot picks up a square rod from the table and inserts the rod into a column.
Tool Hang. The robot picks up a hook piece and inserts it into a very small hole, then hangs a wrench on the hook. As noted in robomimic [robomimic2021], this is a difficult task requiring very precise and dexterous control.
In the real world, we design two tasks representative of industrial assembly and food packaging applications:
Gear Insertion. The robot picks up two gears on the NIST board and inserts each of them onto the gear shafts.
Coffee Pod Packing. The robot opens a drawer, places a coffee pod into the pod holder and closes the drawer.
V-B Metrics and Baselines
We benchmark human-in-the-loop deployment systems in two aspects: 1) Policy Performance. Our human-robot team achieves a reliable task success of 100%. Here we evaluate the success rate of the autonomous policy after each round of model update; and 2) Human Workload. We measure human workload as the percentage of intervention in the trajectories in each round.
We compare our method with the state-of-the-art human-in-the-loop method Intervention Weighted Regression (IWR) [mandlekar2020humanintheloop]. Furthermore, to isolate the impacts of learning algorithms and data distributions, we compare the state-of-the-art imitation learning algorithm BC-RNN [robomimic2021] and the state-of-the-art offline RL algorithm Implicit Q-Learning (IQL) [Kostrikov2021OfflineRL]. We run these two algorithms on the deployment data generated by our method for a fair comparison.
To mimic the intervention-guided weights for IQL, we use the following rewards: upon task success, for intervention states, for pre-intervention states, and for all other states. We also run IQL in a sparse reward setting but find that it underperformed. We highlight that in contrast to our method, IQL requires task rewards, which may be expensive to obtain in real-world settings.
We follow round updates rules by prior works [mandlekar2020humanintheloop, kelly2019hg]: 3 rounds of update when the number of intervention samples reaches of the human demonstration samples, so that all methods get the same amount of human samples per round. We discuss more details on the round update design in Appendix VII-E.

V-C Quantitative Results
We show in Fig. 4 that our method significantly outperforms the baselines on our evaluation tasks. Our method consistently outperforms IWR over the rounds. We attribute this difference to our fine-grained weighting scheme, enabling the method to better differentiate high-quality and suboptimal samples. This advantage over IWR cascades across the rounds, as we obtain a better policy, which in turn yields better deployment data.
We also show that our method significantly outperforms the BC-RNN and IQL baselines under the same dataset distribution. This highlights the importance of our weighting scheme — BC-RNN performs poorly due to copying the suboptimal behaviors in the dataset, while IQL is unable to effectively learn the values as weights.
Moreover, we show the effectiveness of Memory Management in deployment in Fig. 5. With memory management strategies that discard samples, we manage to reduce memory size to a small proportion of the original dataset size (30% for Nut Assembly and 15% for Tool Hang). Furthermore, memory management is crucial to learning efficiency in model updates, as the policy is learned with a smaller dataset with more useful samples. Moreover, we are able to maintain the same level of performance and even higher performance than no memory management. We compare different management strategies discussed in Section IV-C
and compare their effect on policy performance and learning efficiency. We define metrics for learning efficiency as the number of epochs needed to reach a high policy performance, which we benchmark with 90% policy success rate. We show that with memory management strategies, we are able to achieve 2-3x faster policy convergence. This is crucial to efficient updates in the deployment setting.

Furthermore, we conduct an ablation study to examine the contribution of each component in our weighting scheme in Fig. 6 (Left). Our method has a different weight design for interventions, demonstrations, and pre-interventions. We study how removing each class, i.e., treating each class as the robot action class, affects the policy performance. We run each ablation method on the deployment data generated by our method for each round of the Nut Assembly task. As shown in Fig. 6 (Left), removing any class hurts the policy performance, especially in the first round where the dataset contains mostly suboptimal data. In earlier rounds, the robot needs to stay closer to higher-quality demonstration samples, learn corrective behaviors, and avoid erroneous actions. The impact of class removal is less at later rounds when the dataset is abundant with higher quality data.
Lastly, we show that our method is more effective at reducing human workload than IWR. We visualize the distribution of interventions in trajectories in Fig. 6 (Right). Our method requires significantly fewer interventions and is able to overcome harder bottleneck states better. Please see the supplementary video for qualitative results.
Vi Conclusion
We introduce Sirius, a framework for human-in-the-loop robot manipulation and learning at deployment that both guarantees reliable task execution and also improves autonomous policy performance over time. We utilize the properties and assumptions of human-robot collaboration to develop an intervention-based weighted behavioral cloning method for making effective use of deployment data. We also design a practical system that trains and deploys new models continuously under memory constraints. For future work, we are interested in exploring ways to make the robot respond to shared human control inputs in real time. Another direction for future research is alleviating the human’s cognitive burdens for monitoring and teleoperating the system.
Acknowledgment
We thank Ajay Mandlekar for having multiple insightful discussions, and for sharing well-designed simulation task environments and codebases during development of the project. We thank Yifeng Zhu for valuable advice and system infrastructure development for real robot experiments. We would like to thank Tian Gao, Jake Grigsby, Zhenyu Jiang, Ajay Mandlekar, Braham Snyder, and Yifeng Zhu for providing helpful feedback for this manuscript. We acknowledge the support of the National Science Foundation (1955523, 2145283), the Office of Naval Research (N00014-22-1-2204), and Amazon.
Vii Appendix
Vii-a Task Details
We elaborate on the four tasks in this section, providing more details of the task setups, the bottleneck regions, and how they are challenging. The two simulation tasks, Nut Assembly and Tool Hang, are from the robomimic codebase [robomimic2021] for better benchmarking.
Nut Assembly. The robot picks up a square rod from the table and inserts the rod into a column. The bottleneck lies in grasping the square rod with the correct orientation and turning it such that it aims at the column correctly.
Tool Hang. The robot picks up a hook piece, inserts it into a tiny hole, and then hangs a wrench on the hook. As noted in robomimic [robomimic2021], this task requires very precise and dexterous control. There are multiple bottleneck regions: picking up the hook piece with the correct orientation, inserting the hook piece with high precision in both position and orientation, picking out the wrench, and carefully aiming the tiny hole at the hook.
Gear Insertion. We design the task scene setup adapting from the common NIST board benchmark111https://www.nist.gov/el/intelligent-systems-division-73500/robotic-grasping-and-manipulation-assembly/assembly Task Board 1, which is designed for standard industrial tasks like peg insertion and electrical connector insertions. Initially, one blue gear and one red gear are placed at a randomized region on the board. The robot picks up two gears in sequence and inserts each onto the gear shafts respectively. The gears’ holes are very small, requiring precise insertion on the gear shafts.
Coffee Pod Packing. We design this task for a food manufacturing setting where the robot packs real coffee pods222https://www.amazon.com/dp/B00I5FWWPI into a coffee pod holder333https://www.amazon.com/gp/product/B07D7M93ZW. The robot first opens the coffee pod holder drawer, grasps a coffee pod placed on a random initial position on the table, places the coffee pod into the pod holder, and closes the drawer. The pod holder contains holes that fit precisely to the coffee pods’ side, so it requires precise insertion of the coffee pods into the holes. The common bottlenecks are exactly grasping the coffee pod, exact insertion, and releasing the drawer whenever the opening and closing actions are done without getting stuck.
The objects in all tasks are initialized randomly within an x-y position range and with a rotation on the z-axis. The configurations of the simulation tasks follow that in robomimic. We present the reset initialization configuration in Table I for reference.
Tasks and Objects | Position (x-y) | Orientation (z) |
---|---|---|
Nut Assembly | ||
square nut | cm cm | |
ToolHang | ||
hook | cm cm | |
wrench | cm cm | |
Gear Insertion (Real) | ||
blue gear | cm cm | |
red gear | cm cm | |
Coffee Pod Packing (Real) | ||
coffee pod | cm cm |
Input: class labels ;
data storage with class labels
Vii-B Human-Robot Teaming
We illustrate the actual human-robot teaming process during human-in-the-loop deployment in Figure 7. The robot executes a task (e.g., gear insertion) by default while a human supervises the execution. In this gear insertion scenario, the expected robot behavior is to pick up the gear and insert it down the gear shaft. When the human detects undesirable robot behavior (e.g., gear getting stuck), the human intervenes by taking over control of the robot. The human directly passes in action commands to perform the desired behavior. When the human judges that the robot can continue the task, the human passes control back to the robot.
To enable effective shared human control of the robot, we seek a teleoperation interface that (1) enables humans to control the robot effectively and intuitively and (2) switches between robot and human control immediately once the human decides to intervene or pass the control back to the robot. To this end, we employ SpaceMouse444https://3dconnexion.com/us/spacemouse/ control. The human operator controls a 6-DoF SpaceMouse and passes the position and orientation of the SpaceMouse as action commands. The user can pause when monitoring the computer screen by pressing a button, exert control until the robot is back to an acceptable state, and pass the control back to the robot by stopping the motion on the SpaceMouse.
Vii-C Observation and Action Space
The observational space of all our tasks consists of the workspace camera image, the eye-in-hand camera image, and low-dimensional proprioceptive information. For simulation tasks, we use the operational space controller (OSC) that has a 7D action space; for real-world tasks, we use OSC yaw controller that has a 5D action space.
The minor differences for the Tool Hang task from robomimic [robomimic2021] default image observation: We use an image size of instead of the default for training efficiency. Due to the task’s need for high-resolution image inputs, we adjust the workspace camera angle to give more details on the objects. This compensates for the need for large image size and boosts policy performance.


Details on low-dimensional proprioceptive information: For simulation tasks, we have the end effector position (3D) and orientation (4D), as well as the distance of the gripper (2D). We have joint positions (7D) and gripper width (1D) for real-world tasks.
The action space of simulation tasks is 7 dimensions in total: x-y-z position (3D), yaw-pitch-roll orientation (3D), and the gripper open-close command (1D). The action space of real-world tasks is 5 dimensions in total: x-y-z position (3D), yaw orientation (1d), and the gripper open-close command (1D).
Vii-D Method Implementations
We present our learning at the deployment pipeline in Algorithm 1 and the intervention-based weighting scheme in Algorithm 2.
We describe the policy architecture details initally introduced in Section IV-D. Our codebase is based on robomimic [robomimic2021]
, a recent open-source project that benchmarks a range of learning algorithms on offline data. We standardize all methods with the same state-of-the-art policy architectures and hyperparameters from robomimic. The architectural design includes ResNet-18 image encoders, random cropping for image augmentation, GMM head, and the same training procedures. The list of hyperparameter choices is presented in Table
II. For all BC-related methods, including Ours, IWR, and BC-RNN, we use the same BC-RNN architecture specified in Table III.For all tasks except for Tool Hang, we use the same hyperparameters with image size . We use for Tool Hang due to its need for high-precision details. We use a few demonstrations for each task to warm-start the policy; the number ranges from to so that the initial policy can all have some level of reasonable behavior regardless of task difficulty. See Table VI for all task-dependent hyperparameters.
For IQL [Kostrikov2021OfflineRL]
, we reimplemented the method in our robomimic-based codebase to keep the policy backbone and common architecture the same across all methods. Our implementation is based on the publicly available PyTorch implementation of IQL
555https://github.com/rail-berkeley/rlkit/tree/master/examples/iql.We follow the paper’s original design with some slight modifications. In particular, the original IQL uses the sparse reward setting where the reward is based on task success. We add a denser reward for IQL to incorporate information on human intervention. To mimic the intervention-guided weights for IQL, we use the following rewards: upon task success, for intervention states, for pre-intervention states, and for all other states. We found that this version of IQL outperforms the default sparse reward setting. We list the hyperparameters for IQL baseline in Table V.
Vii-E HITL System Policy Updates
We elaborate on our design choice for HITL system policy update rules discussed in Section V-B.
In a practical human-in-the-loop deployment system, there can be many possible design choices for the condition and frequency of policy updates. A few straightforward ones among various designs are: update every specific amount of elapsed time, update after the robot completes a certain number of tasks, or update after human interventions reach a certain number. Our experiments aim to provide a fair comparison between various human-in-the-loop methods and benchmark our method against prior baselines. For consistent evaluation, we follow round updates rules by prior work [mandlekar2020humanintheloop, kelly2019hg]: 3 rounds of update when the number of intervention samples reaches of the human demonstration samples. The motivation is to evaluate prior baselines in their original setting to ensure fair comparison; moreover, we want to ensure all methods get the same amount of human samples per round. Since they are human-in-the-loop methods, the amount of human samples is important to their utilization. How policies are updated could be a dimension of human-in-the-loop system design on its own right and could be further explored in future work.
Vii-F Evaluation Procedure
Evaluation is an essential step in the robot learning experiment pipeline. We make sure to design fair evaluation procedures to provide an accurate estimate of the true policy performance of each method as best as possible. Here we elaborate on our evaluation procedures and the motivations behind them.
Simulation experiments
: We evaluate the success rate across 3 seeds and report the average and standard deviation across all seeds. For each seed, we run the following evaluation procedure: we evaluate 100 trials for each checkpoint and choose a checkpoint to evaluate once every 50 epochs, within the range of the total number of trained epochs. We record the top 3 best success rates and then average the three. We do so to reduce the effect of outliers in the evaluation.
Real-world experiments: Due to the high time cost for real robot evaluation, we evaluate for 1 seed for each method. Since real robot evaluations are subject to noise and variation across checkpoints, we run the following evaluation procedure to ensure that results are as fair as possible. For each method, we perform an initial evaluation of different checkpoints (5 checkpoints in practice), evaluating each for a small number of trials (5 trials in practice). For the checkpoint that gives the best quantitative and qualitative behavior, we perform 32 trials and report the success rate over them. To give a more straightforward comparison, we present the absolute number of successes over 32 trials in Table VII.
Vii-G Human Workload Reduction
We present more results on the effectiveness of our method in reducing human workload as discussed in Section V-C. To show how human workload decreases over the policy deployment round updates, we plot the human intervention sample ratio for every round, i.e., the percentage of intervention samples in all samples per round. We compare the results for the HITL methods, Ours and IWR, in Figure 8. We show that the human intervention ratio decreases for all four tasks as policy performance increases over time. Furthermore, our methods significantly reduce the human intervention ratio more than IWR.
Moreover, we note that there are different metrics to evaluate human workload, such as the number of control switches and lengths of interventions, as introduced in prior work [Hoque2021ThriftyDAggerBN]. We include two additional human workload metrics:

Average intervention frequency: the number of intervention occurrences divided by the number of rollouts. This reflects the number of context switches, i.e., shifts of control between the human and the robot. A higher number of context switches imposes higher concentration and exhaustion on the human.
Average intervention length: length of each intervention in terms of the number of timesteps. This reflects the ease of every intervention - longer intervention occurrence means a higher mental workload to the human for taking control of the robot.
We note that these metrics also reflect the human trust level for the robot. The human makes a decision during robot control: should I intervene at this point? Furthermore, during human control: is the robot in a state where I can safely return control to the robot? Lower intervention frequency and shorter intervention length reflect that human trusts the robot more so that they can intervene at fewer places and return control to the robot faster.

We present the results in Figure 9 using Nut Assembly as an example. We can see that, like the human intervention ratio, the average intervention frequency, and the intervention length decrease. Our method also has a faster reduction of both metrics over round updates. This shows that our human-in-the-loop system fosters good human trust in the robot and develops better human-robot partnerships.
Lastly, we visualize how the dynamics of the human-robot-partnership evolve qualitatively in Figure 10. For the Gear Insertion task, we do trials of task execution in sequence for our method in round and round respectively, and record the time duration for human intervention needed during the deployment. Comparing round and round , the policy in round needs very little human intervention. The intervention duration is much shorter too, validating the effectiveness of our system in human workload reduction.
Hyperparameter | Value |
---|---|
GMM number of modes | |
Image encoder | ResNet-18 |
Random crop ratio | % of image height |
Optimizer | Adam |
Batch size | |
# Training steps per epoch | |
# Total training epochs | |
Evaluation checkpoint interval (in epoch) | |
Hyperparameter | Value |
---|---|
RNN hidden dim | |
RNN sequence length | |
# of LSTM layers | |
Learning rate | |
Hyperparameter | Value |
---|---|
ratio for intv class, | |
ratio for preintv class, | |
Hyperparameter | Value |
---|---|
Reward scale | |
Termination | false |
Discount factor | |
Beta | |
Adv filter | exponential |
V function quantile |
|
Actor lr | |
Actor lr decay factor | |
Actor mlp layers | |
Critic lr | |
Critic lr decay factor | |
Critic mlp layers | |
Hyperparameter | Nut Assembly | ToolHang | Gear Insertion (Real) | Coffee Pod Packing (Real) |
---|---|---|---|---|
Image size () | ||||
Initial # of human demonstrations | ||||
Evaluation rollout length | ||||
Task | Method | Round (BC base policy) | Round | Round | Round |
---|---|---|---|---|---|
Gear Insertion | Ours | 14 / 32 | 22 / 32 | 25 / 32 | 29 / 32 |
IWR | 14 / 32 | 18 / 32 | 20 / 32 | 23 / 32 | |
BC-RNN | - | - | - | 11 / 32 | |
IQL | - | - | - | 12 / 32 | |
Coffee Pod Packing | Ours | 3 / 32 | 20 / 32 | 23 / 32 | 24 / 32 |
IWR | 3 / 32 | 15 / 32 | 19 / 32 | 21 / 32 | |
BC-RNN | - | - | - | 14 / 32 | |
IQL | - | - | - | 13 / 32 |