A long-standing dream in robotics is to enable robots to perform general-purpose tasks in diverse environments. In recent years, imitation learning has led to great progress towards this goal. Recent work has demonstrated that well-engineered behavioral cloning methods can achieve competitive performance on diverse manipulation tasks [robomimic2021, florence2021implicit, zeng2020transporter]. Despite this promise, these imitation learning algorithms have been confined to relatively small-scale domains. For real-world problems, current imitation learning algorithms often demand a large number of task demonstrations, which can be difficult and costly to obtain.
Realizing these limitations, a series of recent work seeks to use prior interaction data to improve the sample efficiency of imitation learning on new tasks. Such prior data come in various forms, including task-agnostic exploratory “play” data [mees2021calvin] or demonstrations previously collected for different tasks [ebert2021bridge]. An open question is how best to extract knowledge from these large prior datasets and use this knowledge to facilitate learning novel tasks. One promising approach is skill-based imitation learning [ajay2020opal, hakhamaneshi2021fist], which aims to learn a latent space of short-horizon sub-trajectories from the prior data (called skill learning) and subsequently learn a policy to invoke the skills to solve a specific downstream task (called policy learning). This approach offers several appealing advantages: First, the policy benefits from the temporal abstraction encapsulated by the skills, allowing the policy to focus on higher-level reasoning about what behavior to perform rather than how to execute that behavior; and second, by reasoning about the target task in relation to skills trained on prior data, the robot implicitly distills knowledge from the rich diverse interactions of the prior data into the policy. Even so, existing skill-based imitation learning approaches bring modest improvements over simple behavioral cloning methods. This work aims at identifying the underlying limitations of existing approaches and designing new methods to address these limitations.
We argue that a practical approach should encompass two key properties. First, the latent skill space should serve as a predictable representation for downstream policy learning, allowing the policy to accurately infer which appropriate skill to use in new situations. While prior methods [pertsch2020spirl, pertsch2021skild, ajay2020opal, hakhamaneshi2021fist]
have widely adopted variational autoencoders to transform sub-trajectories into embeddings, we empirically show that this learning objective alone is insufficient for constructing a well structured latent space. As a result, the policy tends to execute irrelevant skills. Second, the robot should take advantage of the prior data to learn both the skills and the policy. A major limitation of existing skill-based imitation learning approaches[ajay2020opal, hakhamaneshi2021fist] is that they primarily focus on using prior data for skill learning but not policy learning. In these approaches, policies learned on a small number of target task demonstrations are prone to severe overfitting and covariate shift.
To address these limitations, we introduce Skill-Augmented Imitation Learning with prior R
etrieval (SAILOR). To improve the predictability of our skill representations, SAILOR uses an auxiliary temporal predictability objective to estimate the temporal distance between sub-trajectories in the demonstration sequences. To use prior data for policy learning, we introduce a novel retrieval-based data augmentation procedure that selectively retrieves data relevant to the target task. Specifically, we consider a subset of sub-trajectories from the prior dataset that have high latent skill similarity with sub-trajectories in the target task demonstrations (seeFig. 1). We evaluate on a wide variety of manipulation tasks in simulation and the real world, and show that SAILOR significantly outperforms state-of-the-art imitation learning and offline reinforcement learning approaches. Through comprehensive analysis, we highlight the roles that prior data, our representation learning objective, and retrieval-based data augmentation have in data-efficient learning of robust manipulation policies.
2 Related Work
Learning from Prior Data. There is a large body of work on learning manipulation tasks using human demonstrations [robomimic2021, mandlekar2018roboturk, wong2021momart, zhang2017deep, rajeswaran2017learning, mandlekar2020iris]. While promising, most of these works learn tasks independently, without re-using knowledge from prior tasks. As a result, they have high data requirements and exhibit brittle generalization in complex long-horizon tasks. To address these limitations, several lines of work have investigated leveraging large offline prior datasets to facilitate learning downstream robotic tasks. These prior datasets include task-agnostic play data [mees2021calvin, lynch2019learning], demonstrations for related tasks [ebert2021bridge], self-supervised agent-generated data [dasari2019robonet], or a combination of these [cabi2019scaling]. Alternatively to these, large video datasets are an appealing choice [goyal2017something, damen2018epic, Ego4D2022CVPR]. Recent work has leveraged these datasets to learn pre-trained visual representations for downstream control tasks [nair2022r3m, xiao2020masked]. Yet despite their appeal, after experimenting with one such approach, R3M [nair2022r3m], we found that it can sometimes hinder downstream performance. One hypothesis is that the prior data on which these methods are trained on exhibit significant domain shift compared to downstream tasks, limiting transfer. In this work, we instead consider large multi-task robotic datasets.
Multi-Task Imitation Learning. Multi-task imitation learning methods offer a promising way to learn from diverse robotic data. These include task-conditioned [ebert2021bridge], language-conditioned [jang2022bcz, mees2021calvin, shridhar2021cliport, lynch2020lang], and skill-based [ajay2020opal, hakhamaneshi2021fist] imitation learning. While most task-conditioned and language-conditioned approaches seek to learn a single policy that performs well across a series of tasks, we focus on learning task-specific policies by utilizing a large multi-task prior dataset to learn a useful skill representation space, and supervising the task-specific policy using learned latent skills. These multi-task imitation approaches could be complementary to our skill representation learning pipeline, ie., we can consider incorporating language and task ID supervision into our approach.
Skill-based Imitation Learning. We adopt skill-based imitation learning as the underlying framework for our method. The objective is to learn temporally abstract representations of sensory-motor data (termed skills) to enable more effective imitation learning. A long line of work has proposed learning skill representations by segmenting demonstrations into trajectory segments termed sub-trajectories. Among these include work on segmenting demonstrations into variable-length sub-trajectories, either in an unsupervised manner [konidaris2012cst, niekum2012learning, krishnan2017ddco, shankar2020learning, shankar2020discovering, kipf2019compile, tanneberg2021skid] or relying on additional supervision [shiarlis2018taco, su2018learning, zhu2022buds, belkhale2022plato]. Recent work has shown promise for encoding fixed length sub-trajectories without any additional supervision [pertsch2020spirl, pertsch2021skild, ajay2020opal, hakhamaneshi2021fist] using variational autoencoder-based approaches. We adopt this setting due its relative simplicity and scalability. In contrast to these prior approaches (see Sec. 3), we enforce two key properties—a temporal predictability objective in our learned skill representation and a retrieval-based mechanism to improve task-specific policy learning.
3 Problem Formulation
Our goal is to leverage prior data to effectively learn novel target tasks in a data-efficient manner. Formally, we consider a target task as a Markov Decision Process
representing the state space, action space, reward function, transition probability, initial state distribution, and discount factor. Our objective is to learn a policy that maximizes the discounted sum of rewards for the task. To learn the policy, we assume access to a small offline datasetcollected for the target task and a large offline dataset either from previous related tasks or task-agnostic interactions. These datasets consist of variable-length trajectories in the form with denoting the observations and denoting actions. We highlight that and may have significant differences, i.e., the two datasets may come from different environments, be collected by different human demonstrators, and demonstrate different tasks.
Skill-based Imitation Learning. We employ a skill-based imitation learning framework [ajay2020opal, hakhamaneshi2021fist] which consists of two stages: skill learning and policy learning. In the skill learning phase, we learn a skill embedding space of fixed-length sub-trajectories in . These skill embeddings serve as an abstract representation of the agent’s behavior and can be invoked to solve a range of downstream tasks. A number of representation learning methods can be employed to learn the skill embeddings, spanning reconstruction-based methods and contrastive learning. In the subsequent policy learning phase, we are given and our goal is to learn a policy for the target task . The policy now emits skill embeddings and during execution we decode into a sequence of actions via a skill decoder model . To learn the policy, prior work has proposed parametric [ajay2020opal] and semi-parametric [hakhamaneshi2021fist] polices that first parse segments of into skills and subsequently learn to map observations in to these skill embeddings. Note that we use distinct terms—skill and policy—to distinguish the role of these two components. Skills represent short-horizon behaviors that can be re-used across many tasks, while the policy solves a specific long-horizon target task.
4 Skill-based Imitation Learning with Retrieval
In this section, we describe our skill-based imitation learning approach that can leverage prior multi-task data to efficiently learn novel target tasks with a small amount of task-specific demonstrations. As discussed in Sec. 3, this consists of two phases—a task-agnostic skill learning phase, where a latent skill space is learned using the prior data, and a task-specific policy learning phase where task-specific data is used to learn a policy using the skills as supervision. Compared to prior methods, our approach makes two key considerations—we (1) ensure that the learned skill space is a predictable representation for downstream policy learning, and (2) improve the efficacy of task-specific policy learning by retrieving task-relevant datapoints from the prior dataset. See Fig. 2 for our model overview.
4.1 Learning a Predictable Representation of Skills
We learn a skill representation by encoding sub-trajectories with a variational autoencoder (VAE) [kingma2014vae], and we further introduce an auxiliary objective to shape the representation. Denoting a given sub-trajectory as
, we employ a long short-term memory (LSTM)[hochreiter1997lstm] encoder that encodes
into a Gaussian distribution over latent skills. Our decoder is an LSTM networkthat, for each timestep , decodes a latent and the given observation into the reconstructed action . We additionally employ a learned prior to encourage sub-trajectories with similar starting and ending observations to have similar latent representations [lynch2019learning]. Our VAE loss objective is then
where controls the effect of the KL divergence term [higgins2016beta]111In practice we use a deterministic VAE decoder and we compute the reconstruction loss using distance..
It is important to highlight that action reconstruction is not the sole objective of our skill learning model—learning a consistent and predictable representation of behavior is critical for downstream policy learning, as shown by recent work [allshire2021laser, yang2022trail]. While the KL divergence term is one step towards this objective (by encouraging skills to be predictable given partial information from the sub-trajectories), in this work we introduce an additional temporal predictability term that encourages the learned latent space to predict the temporal difference between two sub-trajectories. Specifically, given two-sub-trajectories and from the same trajectory separated by timesteps, we learn a model to predict given the corresponding skill mean embeddings of the trajectories:
where denotes taking the mean of the distribution. We back-propagate through the skill encoder model, allowing the this term to shape the learned skill representation. Note that this is just one way to encourage temporal predictability, other objectives are also readily compatible with our method, such as time-contrastive networks [sermanet2018tcn]. Our overall objective is a weighted combination of the VAE and temporal predictability objectives:
Refer to Algorithm 1 for a detailed summary of our skill learning algorithm.
4.2 Retrieval-based Policy Learning
In the policy learning phase, we employ an LSTM policy that outputs the skill to execute next. We train the policy on a dataset , where is an -length sub-trajectory, is the mean encoding of that sub-trajectory, and is the frame-stacked history of observations preceding the sub-trajectory. We train the policy to predict from using a standard behavioral cloning loss. During execution, we roll out the LSTM skill decoder in a closed-loop manner, i.e., at each timestep we observe and execute the next action . After rolling out the skill for timesteps we repeat the process by sampling a new skill from the policy.
A common approach to skill-based policy learning is training the policy with all -length sub-trajectories in . However, this limits the amount of supervision, especially when is small. On the other hand, naïvely training on all sub-trajectories in and can hurt performance [NEURIPS2020_3fe78a8a] due to divergent and conflicting behaviors between the prior and target datasets. We thus introduce a retrieval-based mechanism to train on sub-trajectories in that have high similarity with those in . While many similarity metrics are suitable, in this work we measure similarity with respect to the skill embedding space—intuitively, sub-trajectories with similar skill embeddings demonstrate similar behaviors. First we obtain skill embeddings of randomly sampled sub-trajectories in and :
We then calculate the pairwise distances between the prior and target dataset skill embeddings, i.e., . Next, for each prior dataset skill embedding , we find the closest corresponding target dataset skill, D_min[i] = min(D[i][:]). Finally, we retrieve the top- sub-trajectories in with the smallest distance argsort(D_min)[:n], resulting in the retrieval dataset . We train the policy using the aggregated set of sub-trajectories in and . We use a behavioral cloning loss split across two terms: one for , and one for weighted by a factor to control the effect of the retrieval data relative to the target dataset. At the same time as training the policy we additionally fine-tune the skill model on . We summarize the retrieval, policy learning, and skill fine-tuning steps in Algorithm 2.
We evaluate our method SAILOR against a set of six baselines and report the mean task success rate and standard deviation over three seeds (exception: six seeds for BC-RNN (FT) due to high variance). Note: for the kitchen tasks we report one number for baselines that do not involve prior data. We see that SAILOR significantly outperforms the baselines on all tasks.
5.1 Simulated Experiment Setup
We perform empirical evaluations on two simulated robot manipulation domains (see Fig. 3):
Franka Kitchen [gupta2019rpl]: A simulated kitchen environment involving different sub-tasks, such as opening cabinets, moving a kettle, and turning on a stove. This environment comes with a large dataset of approximately 600 demonstrations performing various permutations of seven subtasks. In this dataset, a subset of 18 demonstrations correspond to and demonstrate a specific permutation of subtasks: opening the microwave, followed by moving the kettle, flipping on the light switch, and opening the sliding cabinet. We consider two prior datasets : (1) using all demonstrations except the ones corresponding to the target task (Kitchen-All); and (2) using all demonstrations except those that involve interacting with the microwave (Kitchen-No Microwave). These prior datasets have 584 and 235 demonstrations, respectively.
CALVIN [mees2021calvin]: A simulated tabletop playroom environment accompanies by a large dataset of task-agnostic “play” data with 2.3M transitions. The play data encompass diverse behaviors, such as opening and closing drawers, turning on and off the lights, and picking, placing, and pushing blocks. We use all play data as to solve two target tasks. The first target task involves setting up the playroom environment in multiple stages (CALVIN-Setting Up). Specifically, the robot must turn on the lights, and retrieve three blocks and place them on the table. The second target task in contrast involves cleaning up the playroom environment (CALVIN-Cleaning Up). Specifically, the robot must open the drawer, place all three blocks into the drawer, close the drawer, and turn off the lights. For each task, we collect 30 demonstrations, which amounts to about half an hour of data collection.
The CALVIN domain is substantially more challenging than the Franka Kitchen domain, as the target tasks have a longer horizon and involve a greater number of objects. Also, in contrast to Franka Kitchen, the prior and target datasets in CALVIN are collected by different human demonstrators who exhibit different styles of teleoperation.
5.2 Quantitative Analysis
We evaluate our method SAILOR against state-of-the-art imitation learning and offline reinforcement learning algorithms:
BC-RNN: behavioral cloning on without prior data. We adopt the LSTM-based BC-RNN implementation in robomimic [robomimic2021], which has shown superior performance over other behavioral cloning approaches.
BC-RNN (FT): BC-RNN variant that leverages prior data. We first pre-train BC-RNN on and subsequently fine-tune on . This baseline aims to examine the effectiveness of supervised pre-training on interaction data for imitation learning.
BC-RNN (R3M): behavioral cloning on using a frozen R3M visual representation [nair2022r3m] pre-trained on the large-scale Ego4D video dataset [Ego4D2022CVPR]. This baseline intends to examine the effectiveness of using visual representations trained on natural images and videos.
IQL: Implicit Q-Learning [kostrikov2022iql], a recent offline reinforcement learning method with state-of-the-art performance on the D4RL dataset [fu2020d4rl]; trained on .
IQL (UDS): Implicit Q-Learning with Unlabeled Data Sharing [yu2022uds], which is a variant of IQL trained jointly on and , where the transitions in are labeled with the minimum reward ( for our tasks). yu2022uds show that this simple data augmentation procedure can effectively leverage prior data without additional rewards annotation.
FIST: Few-shot Imitation with Skill Transition Models [hakhamaneshi2021fist], an analogue of our method that employs a semi-parametric policy to select the skill to execute next. This baseline uses the same underlying skill model as our method but a different policy learning scheme and does not involve retrieval.
for hyperparameter details. We report performance of all methods inTable 1. SAILOR greatly outperforms the baselines with an average task success rate of . Notably it outperforms the most competitive baseline BC-RNN (FT) by . BC-RNN performs poorly as it fails to learn an effective policy from a small number of demonstrations. In comparison, BC-RNN (R3M) shows significant improvements on the Franka Kitchen tasks, but performs worse on the CALVIN tasks. We hypothesize that this is due to the limited generalization ability of the pre-trained visual representations. The offline reinforcement learning baselines show more promising results on the Franka kitchen tasks but struggle on the more challenging CALVIN tasks. Finally, FIST significantly under-performs our method. As FIST uses the same underlying skill model as our method, we attribute the limitations of FIST to its semi-parametric policy.
5.3 Ablation Study
We perform an extensive ablation study to understand the effects of various modeling choices on our method. First, we study the effect of the temporal predictability term in Eq. 2 on downstream task performance by removing it from Eq. 3 (No TP). Next we study the role of retrieval by training the policy solely on sub-trajectories in (No Retrieval). We also study the opposite case—training the policy on all of and (All Retrieval). Finally, we study the role of prior data on our method by training the skill and policy solely on (No Prior Data). We also report ablations in Appendix A on the size of the prior, retrieval, and target datasets, in addition to the choice of retrieval method.
We present results in Table 2 for the more challenging CALVIN tasks. First, we find that both the temporal predictability objective and the retrieval mechanism have a significant impact on the final performance. It is worth noting that removing these components makes our model degenerate into the skill-based imitation learning setting of OPAL [ajay2020opal]. In fact, these two ablations perform worse than the naïve BC-RNN (FT) baseline for the CALVIN-Cleaning Up task, indicating that both an effective skill representation and the retrieval mechanism play a critical role in skill-based imitation learning. The All Retrieval ablation also performs suboptimally—qualitatively we observe that the robot often was “distracted” and performed behaviors unrelated to the target task, likely due to the multimodal distributions in the prior data. Finally, the No Prior Data ablation validates the role of prior data in learning effective skills and policy. Comparing the No Retrieval ablation and the No Prior Data ablation, we see a gap in performance, and this is attributed to the fact that the skills in the No Retrieval ablation are additionally trained on the prior data. Despite the loss of performance in the No Prior Data ablation, we still see that it outperforms the BC-RNN baseline by a significant margin. We attribute this to the temporal abstraction afforded by our skill-based learning framework. In sum, our ablation studies suggest that effective skill abstractions, coupled with mechanisms that effectively leverage the prior data, allow us to achieve strong results.
5.4 Real World Experiments
Finally, we showcase the efficacy of our method in the real world with a kitchen environment involving eight food items, receptacles, a stove, and a serving area (see Fig. 4). We first collect a play dataset of exploratory interactions involving the food items and receptacles. We consider three target tasks: (1) Real-Breakfast: setting up a breakfast table by placing the bread, butter, and milk in the serving area; (2) Real-Cook: cooking a meal by placing the fish, sausage, and tomato into the frying pan; (3) Real-Cook-Pan: a variant of the Real-Cook task involving placing the pan onto the stove. We collect 30 demonstrations; refer to Section B.3 for detailed descriptions on the tasks datasets. We evaluate SAILOR against the most competitive baseline, BC-RNN (FT) (see Section C.3 for our evaluation protocol). We find that while on Real-Breakfast both methods achieve a success rate of 76.7%, on Real-Cook our method significantly outperforms BC-RNN (FT) with a success rate of 73.3% vs. 23.3%, and similarly for Real-Cook-Pan (76.7% vs. 46.7%). To see the value of prior data we ran the No Prior ablation for the Real-Cook task, achieving 53.3% success rate compared to our 73.3%. Interestingly we see that the No Prior ablation largely outperforms the BC-RNN (FT) baseline on this task (23.3%) which had access to additional prior data. Overall, we observe that BC-RNN (FT) often failed to correctly grasp objects. One hypothesis for this result is that pre-training stage biases the policy to learn the multi-modal behaviors in the prior dataset, preventing the policy from learning specialized target task behaviors during the fine-tuning phase.
While our method shows significant promise, it leaves limitations that we hope to address in future work. First, acquiring large amounts of multi-task prior data is difficult and costly. To amortize the high cost, large prior multi-task datasets should be useful in a diverse range of downstream tasks, rather than a handful. In this work, we evaluate our method in a limited set of target tasks and leave it for future work to scale up the variety of tasks. We hope (and believe) that the need for large robotic datasets will be addressed in the coming years [mandlekar2018roboturk, jang2022bcz]. Second, our method is more computationally expensive than the BC-RNN baseline [robomimic2021], due to the higher number of losses and networks used. Third, our experiments focus on domains and datasets where the prior data and target tasks are reasonably close to each other. Notably our experiments do not evaluate generalization to unseen objects between the prior and target datasets. It would be interesting to investigate methods that are tolerant to much larger domain shifts between prior and target task data.
We present SAILOR, a skill-based imitation learning framework for robot manipulation. Our method uses prior data to construct a latent space of predictable and consistent skill representations. It uses these latent skills as the temporal abstraction to learn policies for vision-based manipulation. Key to its effectiveness is our newly designed representation learning objectives and retrieval-based data augmentation procedure. We demonstrate that our method can solve long-horizon manipulation tasks in simulation and on physical hardware. It brings forth a data-efficient way of programming robots with new behaviors using a small number of target task demonstrations. For future work, we plan to address the limitations we discussed in the previous section and investigate the effectiveness of this approach with various forms of prior data at different scales.
We would like to thank Jake Grigsby, Huihan Liu, and Zhenyu Jiang for providing feedback on this manuscript. We would also like to thank Yifeng Zhu for real robot infrastructure support. We acknowledge the support of the National Science Foundation (1955523, 2145283), the Office of Naval Research (N00014-22-1-2204), and Amazon.
Appendix A Additional Experiments
a.1 Ablation on prior data
We perform a more fine-grained study on the role of prior data using the CALVIN domain. We compare to variants of our method using 25% or 50% of the available prior data to see whether the quantity of prior data plays a significant role on downstream policy performance. We also compare to a variant of our method that only utilizes prior data collected from different environments than the target task environments. In the context of our CALVIN domain, the prior data spans four environments (A, B, C, D), while the target task data only spans one environment (D). Thus for this ablation we only consider data from unseen environments (A, B, C) for our prior dataset. This ablation examines the robustness of our method under environmental mismatch between the prior data and target task data. We outline all results in Table 3.
First, we see that increasing the size of prior data yields greater downstream policy performance. There is a significant performance gain from using no prior data to using 25% prior data, from which point increasing the amount of prior data leads to smaller gains. In addition, restricting the prior data to unseen environments still results in a meaningful performance increase, which is a promising sign that our method can operate even under controlled environmental distribution shifts between the prior and target task data.
a.2 Ablation on retrieval metric
In this work we use distance in our latent skill space as the underlying distance measure for our retrieval operation. We also consider performing retrieval based on KL-divergence distances. ie. given two inference distributions and , we compute their distance as the average forward and reverse KL divergence: . This metric effectively incorporates both the mean and standard deviation of the inference distributions. We compare our standard retrieval procedure with the KL-based retrieval operation in Table 4.
We do not find a significant difference between these two variants, suggesting that our method can work with alternative distance metrics for retrieval.
a.3 Ablation on target dataset
We perform an ablation study on the size of the target dataset. We find for the CALVIN-Setting Up task that increasing the number of target task demonstrations from 30 to 100 yields an increase in success rate from to (see Table 5). Note that while it is promising that increasing the number of target task demonstrations yields an increase in success rate, this comes at the expense of additional burden for human collecting demonstrations for the target task.
a.4 Ablation on retrieval dataset
We perform a detailed ablation study on the quantity and quality of the retrieved data for policy learning.
Recall that our retrieval procedure: (1) we first randomly sample N sub-trajectories from the prior dataset as possible retrieval candidates, (2) sort them according to their relevance to the target task, and (3) select the top r% of candidates for retrieval. We study the following ablations, which use our standard settings of N=250,000 and r=10% unless otherwise stated:
No Retrieval: N=0. Ie. we retrieve no data
All Retrieval: N=sizeof(prior dataset), r=100%. Ie. we retrieve the entire prior dataset. For CALVIN this is 2.3M sub-trajectories
Random Retrieval: instead of sorting the retrieval candidates according to relevance, we randomly select 10% of the candidates. This is a test of data quality, to see whether the relevance of the retrieved sub-trajectories to the target task matters.
2 / 50 / 90 % Retrieval: we retrieve r=2%, 50%, or 90% of the N candidates. This is to test whether our setting of r=10% is a good threshold for retrieval
Large Retrieval: N=sizeof(prior dataset). This ablation uses the same threshold r=10% as our method to perform retrieval but considers all prior sub-trajectories as retrieval candidates and thus retrieves a significantly larger quantity of data.
Note that Ours, No Retrieval, and All Retrieval are from the original submission and we include these results again for reference.
We present results on the CALVIN Setting Up and Cleaning Up tasks in Table 6. We make the following observations for CALVIN-Cleaning Up:
Data quality is important. The Random Retrieval retrieves the same quantity but lower quality of data as Ours. The performance significantly degrades as a result. We see the same trend from the 50 / 90% Retrieval experiments. Ie. as we increase the threshold for retrieval from r=10% to 50% and 90% (and thus decrease the quality of data) we see a consistent and significant drop in performance.
Our standard setting of r=10% is optimal, striking the right balance between diversity and quality of data. Lower and higher thresholds (2%, 50%, 90%) perform worse.
Retrieving larger amounts of data does not have a major impact on performance. Large Retrieval achieves performance within the margin of error as Ours.
CALVIN-Setting Up however offers a different analysis. For this task data quality does not appear to matter, as the Random retrieval, 2% / 50%, 90% Retrieval ablations all perform similarly to Ours within the margin of error. One possible explanation for this observation is that the Setting Up task involves a more diverse range of behaviors than Cleaning Up – the Setting Up task involves manipulating all components of the environment whereas the Cleaning Up task involves a subset. Another potential hypothesis is that the prior data is more biased towards behaviors seen in the Setting Up task. Because many of the behaviors in the Cleaning Up task are mirror behaviors of the Setting Up task, this may result in an unfavorable bias for the Cleaning Up task, necessitating a retrieval procedure to filter out irrelevant behaviors.
The implication of all of these results is that the importance of retrieval may be task and dataset dependent, with some tasks being especially sensitive to the choice of retrieved data.
Appendix B Tasks and Datasets
b.1 Franka Kitchen
The Franka Kitchen domain consists of a simulated 9-DoF Franka robot operating in a kitchen environment comprising a microwave, kettle, light switch, stove knobs, and a sliding and hinge cabinet. In our experiments the agent operates the robot via joint torque control resulting in a 9-dimensional action space. For observations, the agent has access to proprioceptive information consisting of the 9-dimensional joint values of the robot, in addition to RGB images from a third-person view camera and an eye-in-hand camera.
Prior Data. This environment is accompanied by approximately 600 human demonstrations each performing a subset of four out of seven possible subtasks: opening the microwave, turning on the light switch, turning on the top burner, turning on the bottom burner, moving the kettle, opening the hinge cabinet, and opening the sliding cabinet. We consider two prior datasets : (1) using all demonstrations except the ones corresponding to the target task (Kitchen-All); and (2) using all demonstrations except those that involve interacting with the microwave (Kitchen-No Microwave). These prior datasets have 584 and 235 demonstrations, respectively.
Target Task. We consider one target task demonstrating a specific permutation of subtasks: opening the microwave, followed by moving the kettle, flipping on the light switch, and opening the sliding cabinet. We define task success as whether the agent has completed all of these subtasks (in no particular order). For the target dataset we obtain all demonstrations in the original dataset that perform this specific permutation of subtasks, resulting in 18 demonstrations. Note that this dataset is equivalent to the kitchen-complete-v0 dataset in the d4rl benchmark [fu2020d4rl]. These demonstrations have an average length of timesteps.
The CALVIN domain consists of a simulated 7-DoF Franka robot operating in a playroom environment comprising a drawer, cubbies, two lights, and three blocks. The environment comes in four variants (see Figure 3), each with different textures, block sizes, and fixture locations. In our experiments the agent operates the robot via inverse kinematics control resulting in a 7-dimensional action space. For observations, the agent has access to proprioceptive information consisting of the robot end effector pose and gripper state, in addition to RGB images from a third-person view camera and an eye-in-hand camera.
Prior Data. This environment is accompanied by a large dataset of task-agnostic “play” data across all four environment variants and comprises 2.3M transitions. The play data encompass diverse behaviors, such as opening and closing drawers, turning on and off the lights, and picking, placing, and pushing blocks. We use all play data as to solve two target tasks.
Target Tasks. We consider two target tasks:
CALVIN-Setting Up: the robot must turn on the lights, retrieve the pink block from the drawer, place it on the table, and retrieve the red and blue blocks from the cubby area and place them on the table. We define task success as whether the agent has completed all of these subtasks (in no particular order). At environment resets the lights are always off, the pink block is randomly initialized inside the (closed) drawer, and the red and blue blocks are randomly initialized inside the cubby area with one block in the left region of the cubby and the other block in the right region of the cubby. For this task we collect 30 demonstrations, amounting to about half an hour of data collection. In these demonstrations, we first turn on the lights, then retrieve the pink block, then retrieve the first unoccluded block from the cubby area, then move the slider to retrieve the other block from the other side of the cubby area. These demonstrations have an average length of timesteps.
CALVIN-Cleaning Up: the robot must open the drawer, place all three blocks into the drawer, close the drawer, and turn off the lights. We define task success as whether the agent has completed all of these subtasks (in no particular order). At environment resets the lights are always on, the drawer is closed, and the three blocks are randomly placed in left, center, and right regions of the table. For this task we collect 30 demonstrations, amounting to about half an hour of data collection. In these demonstrations, we first open the drawer, then place the blocks on by one into the drawer from right to left, then close the drawer, and finally turn off the lights. These demonstrations have an average length of timesteps.
b.3 Real World Kitchen
We designed a real world kitchen environment to study the utility of our method on physical hardware. Our kitchen environment comprises a Flexa toy kitchen222https://flexa-usa.com/collections/play/products/toys-the-kitchen, a set of toy food items333https://www.amazon.com/Melissa-Doug-Food-Groups-Hand-Painted/dp/B0000BX8MA, a number of serving items (placemat, plate, knife, fork), and a small pot and pan that we purchased from a local store. We use a 7-DoF Franka Emika Panda robot which is operated via Operational Space Control (OSC) [khatib1995osc]. We found OSC to be a fitting choice, as it offers task-space compliant behavior that makes for a more intuitive data collection experience. We restrict the OSC controller to the position and yaw of the end effector444We did not find the roll and pitch actuation to be necessary for our real world tasks and we opted for a simpler action space., which combined with the gripper controller results in a 5-dimensional action space. For observations, the agent has access to proprioceptive information consisting of the robot end effector pose and gripper state, in addition to RGB images from a third-person view camera and an eye-in-hand camera.
Prior Data. We collect a large prior dataset of task-agnostic play behaviors involving the food items and the pot and pan. Overall our prior dataset involves trajectories each with approximately , timesteps, resulting in approximately , total timesteps. For each trajectory we first initialize the scene by randomly sampling four out of eight food items (milk, bread, butter, sausage, fish, tomato, banana, cheese) and randomly placing these four items around the serving area. We also randomly initialize the pot and pan on the two front stove burners or occasionally place one on the table next to the serving area. We then randomly pick and place food items either on the table, the serving area, or the pot and pan. We also occasionally pick and place the pot or pan to the table or stove burners.
Target Tasks. We consider three target tasks:
Real-Breakfast: the objective of this task is to place the bread, butter, and milk from the table onto the serving area. These food items are initialized randomly in the vicinity of three possible locations on the table: the left, center, and right of the region preceding the serving area. We consider two possible permutations for the placement of object onto these three regions (in left-center-right format): butter-bread-milk, bread-butter-milk, and butter-milk-bread. The pots and pans are initialized on the front stove burners. We define task success as whether the robot has (in no particular order) placed the bread onto the plate, the butter to the left of the plate on the placemat, and the milk to the right of the plate on the placemat. For this task we collect 30 demonstrations, amounting to about half an hour of data collection. In these demonstrations we place the bread, butter, and milk in order onto their corresponding goal locations. These demonstrations have an average length of timesteps.
Real-Cook: the objective of this task is to place the fish, sausage, and tomato from the table into the pan. These food items are initialized randomly in the vicinity of three possible locations on the table: the left, center, and right of the region preceding the serving area. We consider three possible permutations for the placement of object onto these three regions (in left-center-right format): fish-sausage-tomato, sausage-fish-tomato, and fish-tomato-sausage. The pots and pans are initialized on the front stove burners. We define task success as whether the robot has (in no particular order) placed these three items into the pan. For this task we collect 30 demonstrations, amounting to about half an hour of data collection. In these demonstrations we place the food items from left to right (in order) into the pan. These demonstrations have an average length of timesteps.
Real-Setup-Pan: the objective of this task is to place the pan from the table onto the stove and subsequently place the fish and sausage into the pan. The pan is initialized randomly in the vicinity of the right region of the table preceding the serving area. The food items are initialized randomly in the vicinity of the left and center regions of the table preceding the serving area. We consider three possible permutations for the placement of the objects onto these three regions (in left-center-right format): fish-tomato-pan and tomato-fish-pan. The pot is initialized on the front stove burners. We define task success as whether the robot has (in no particular order) placed the pan onto the stove and the two food items into the pan. For this task we collect 30 demonstrations, amounting to about half an hour of data collection.
Appendix C Implementation Details
c.1 Model Architecture
, a recent open source codebase with extensive benchmarking results across a number of imitation learning algorithms. We adopted the same neural modules (same RNN backbone, VAE, visual perception encoders) for our algorithm, and in fact our BC-RNN baseline uses the exact implementation from robomimic.
Our model specifically consists of five neural network modules: four networks for the skill model comprising an RNN encoder, an RNN decoder, a feedforward VAE prior666we also utilize a feedforward deterministic inverse dynamics model but we found that it does not lead to a significant change in downstream policy learning results, and a feedforward temporal prediction network; and one RNN network for the policy.
Observation Encoder. Four out of the five modules described above take observation inputs (among other potential inputs), and each of these modules is equipped with an observation encoder to process these observations. The observation encoder specifically consists of ResNet-18 backbones [he2016deep]
to encode the third-person image and eye-in-hand image, and a multi-layer perceptron (MLP) for all remaining low-dimensional observational inputs. Note that we pre-process the ResNet inputs with random cropping and post-process the outputs with a Spatial Softmax[finn2016deep]
pooling layer. After processing the image and low-dimensional observation inputs we concatenate the resulting outputs to form one unified observation encoding. Note that our RNN encoder, RNN decoder, and RNN policy process a sequence observations individually using the observation encoder and then processes these encoded observations into one unified representation with a recurrent neural network.
Skill Model. The skill model is a Variational Autoencoder that encodes sub-trajectories into a latent skill representation and decodes information back into the actions of sub-trajectories. The skill encoder and decoder are RNNs with a 2-layer LSTM followed by a 2-layer MLP, while the VAE prior and temporal prediction network are 2-layer MLPs.
Policy. The policy is a 2-layer LSTM network that maps a history of observations into a latent skill . We also condition the policy on a dataset id to indicate whether the policy is optimized on the target dataset or the retrieval dataset, to prevent potential interference between the target and retrieved data (see Algorithm 2 for additional details). Note that we can extend our policy to incorporate fine-grained goal information by conditioning on additional context information such as goal images or language goals [gupta2019rpl, mandlekar2020learning, jang2022bcz].
Our algorithm consists of two phases. In the first phase we pre-train our skill model on sub-trajectories in (see Algorithm 1 for further details777in our code we also have a slowness term to ensure that two nearby sub-trajectories have similar skill embeddings. We did not find this feature to have a noticeable impact on downstream policy performance and therefore we omit it from the algorithm pseudocode for simplicity.). In the subsequent phase we are given the target dataset and we proceed to learning the policy and fine-tuning the skill model. Before we perform policy learning, we first retrieve sub-trajectories in that have similar embeddings to those in . We aggregate these retrieved embeddings into our retrieval dataset . We then proceed to train the policy jointly on embeddings from and . At the same time we continue to fine-tune the skill model with sub-trajectories sampled from both and . We summarize these steps in Algorithm 2.
We sample fixed-length sub-trajectories uniformly at random to train our model, following recent skill-based imitation learning works [ajay2020opal, pertsch2021skild, hakhamaneshi2021fist]
. More specifically, for each dataset we concatenate all trajectories into one continuous stream of data and uniformly sample sub-trajectories from this stream. Note that this can result in sampling overlapping sub-trajectories. For training the policy we additionally train on the frame stack of observations preceding the sampled sub-trajectory. There are some edge cases, such as when the sub-trajectory intersects with the next trajectory and when the frame-stack intersects with the previous trajectory. We deal with these cases by padding all data from the offending consecutive trajectory with the first / last observation of the current trajectory.
To perform a policy rollout, we first obtain a skill from the policy. We execute this skill with our closed-loop skill decoder for timesteps and we subsequently repeat the process by obtaining a new skill from the policy. Note that we do not preempt skill execution; we execute all timesetps until completion888We believe this is reasonable choice, as (1) the closed-loop skill decoder can react to current environment conditions during skill execution, and (2) the policy is still operating at high frequency and can react accordingly. We terminate the episode either when the agent has successfully solved the task or if the agent has exceeded the time budget for the rollout. We assess each episode based on whether the agent successfully solved the task in the allotted time budget. While other metric also exist (time to complete task), we chose binary success for its popularity and relative simplicity. We elaborate further on our evaluation protocol:
Simulation Experiments: We evaluate the success rate across 3 seeds (unless otherwise noted) and report the average and standard deviation across all seeds. To evaluate a seed, we perform 100 policy rollouts every checkpoints and record the success rate for each checkpoint. We then record the success rate for the seed as the highest success rate across all checkpoints evaluated for that seed999this is the same evaluation protocol used in [robomimic2021].
Real World Experiments: Due to the challenges of real-world evaluation we only evaluate 1 seed for each baseline. To evaluate an experiment, we perform an initial evaluation of different policy checkpoints, evaluating each checkpoint for only a few trials. Upon choosing the most promising checkpoint we perform 30 rollouts and report the success rate over these rollouts.
All baselines are implemented in the robomimic codebase for fair comparison. We briefly elaborate on these implementations as follows:
BC-RNN: We use the default implementation of BC-RNN in robomimic and we use identical hyperparameters as those reported in the robomimic study paper [robomimic2021].
BC-RNN (FT): We use an identical architecture and identical hyperparameters as BC-RNN. We first train the baseline on and subsequently fine-tune on via a second stage of training. We also experimented with jointly training a task-conditioned BC-RNN policy in and but we found that it yielded very poor performance due to the multi-modality of actions in the prior data.
BC-RNN (R3M): We use an identical architecture and identical hyperparameters as BC-RNN but with a pretrained R3M visual representation. We specifically replace the weights of our ResNet-18 networks with the pretrained ResNet-18 weights from R3M101010https://github.com/facebookresearch/r3m. We follow the same practice from the R3M paper and we freeze the pretrained ResNet weights during downstream imitation learning.
: We base our implementation off of the publicly available PyTorch implementation of IQL111111https://github.com/rail-berkeley/rlkit/tree/master/examples/iql.
IQL (UDS): We make small modification to our IQL implementation. For each batch that we sample from , we also sample an equivalent-size batch from with the rewards set to . We then perform gradient updates on the aggregated data from both of these batches.
FIST: We use the same underlying skill model as our method but a semi-parametric policy in place of our parametric neural network policy. We use an identical scheme for the semi-parametric policy as the FIST paper [hakhamaneshi2021fist].
c.5 Environment Implementation
c.5.1 Gripper Logic
We elaborate on the gripper logic in our environments. The gripper state is either the position of the gripper fingers (Franka Kitchen, Real World Kitchen) or the opening width of the gripper (CALVIN). The gripper action is a continuous 1-D variable, and we interpret this as either opening (if ) or closing (if ). The gripper is controlled via position control. When the agent specifies a closing action the position target of the controller is set to close the gripper fingers all the way (and for opening the target is set to open the gripper fingers all the way). There are limits on the force and velocity of the fingers in order to ensure gripper stability.
Appendix D Hyperparameters
We adopted a similar set of hyperparameters as the BC-RNN baseline from robomimic—we used the same LSTM settings, batch size, and RNN policy history length of . We did experiment with different choices of sub-trajectory lengths for the skill model () and found that performs optimally. Longer horizons may be helpful in some settings, however we hypothesize that RNN-based architectures lack the capacity to accurately predict actions over significantly longer horizons. It would be interesting to investigate if the optimal changes under a Transformer-based [vaswani2017transformer] architecture.
|Skill encoder: # LSTM hidden units|
|Skill encoder: MLP hidden sizes|
|Skill decoder: # LSTM hidden units|
|Skill decoder: MLP hidden sizes|
|Skill prior: hidden sizes|
|TC: hidden sizes|
|Policy: # LSTM hidden units|
|Skill latent dimension|
|Skill KL weight|
|Retrieval weight||for Real tasks, else|
|# observation history frames|
|Learning rate: skill VAE|
|Learning rate: TP|
|Learning rate: Policy|
|Retrieval: max # samples||, for Real tasks, else ,|
|Retrieval: max # samples||,|
|% of samples chosen for retrieval||for Real tasks, else|
|Evaluation checkpoints freq|
|BC-RNN (FT): phase 1||
|BC-RNN (FT): phase 2||
|FIST: phase 1||
|FIST: phase 2|
|Ours: phase 1||
|Ours: phase 2|
|Task||Image Size||Rollout Length|
|CALVIN: Setting Up|
|CALVIN: Cleaning Up|