Train generalizable policies for kit assembly with self-supervised dense correspondence training.
Is it possible to learn policies for robotic assembly that can generalize to new objects? We explore this idea in the context of the kit assembly task. Since classic methods rely heavily on object pose estimation, they often struggle to generalize to new objects without 3D CAD models or task-specific training data. In this work, we propose to formulate the kit assembly task as a shape matching problem, where the goal is to learn a shape descriptor that establishes geometric correspondences between object surfaces and their target placement locations from visual input. This formulation enables the model to acquire a broader understanding of how shapes and surfaces fit together for assembly – allowing it to generalize to new objects and kits. To obtain training data for our model, we present a self-supervised data-collection pipeline that obtains ground truth object-to-placement correspondences by disassembling complete kits. Our resulting real-world system, Form2Fit, learns effective pick and place strategies for assembling objects into a variety of kits – achieving 90% average success rates under different initial conditions (e.g. varying object and kit poses), 94% success under new configurations of multiple kits, and over 86% success with completely new objects and kits.READ FULL TEXT VIEW PDF
Flexible pick-and-place is a fundamental yet challenging task within
Manipulation and assembly tasks require non-trivial planning of actions
Accurate real-time pose estimation of spacecraft or object in space is a...
This paper investigates, using prior shape models and the concept of bal...
What is the right object representation for manipulation? We would like
We have developed the concept of pathway assembly to explore the amount ...
Haptic-assisted virtual assembly and prototyping has seen significant
Train generalizable policies for kit assembly with self-supervised dense correspondence training.
Across many assembly tasks, the shape of an object can often inform how it should be fitted with other parts. For example, in kit assembly (i.e., placing object(s) into a blister pack or corrugated display to form a single unit – see examples in Fig. 1), the profile of an object likely matches the silhouette of the cavity in the paperboard packaging (i.e., kit) that it should be placed into.
While these low-level geometric signals can provide useful cues for both perception and planning, they are often overlooked in many modern assembly methods, which typically abstract visual observations into 6D object poses and then plan on top of the inferred poses. By relying heavily on accurate pose information, these algorithms remain unable to generalize to new objects and kits without task-specific training data and cost functions. As a result, these systems also struggle to quickly scale up to the subclass of real-world assembly lines that may see new kits every two weeks (e.g. due to seasonal items and packages).
In this work, we explore the following: if we formulate the kit assembly task as a shape matching problem, is it possible to learn policies that can generalize to new objects and kits? To this end, we propose Form2Fit, an end-to-end pick and place formulation for kit assembly that leverages shape priors for generalization. Form2Fit has two key aspects:
Shape-driven assembly for generalization. We establish geometric correspondences between object surfaces and their target placement locations (e.g. empty holes/cavities) by learning a fully convolutional network that maps from visual observations of a scene to dense pixel-wise feature descriptors. During training, the descriptors are supervised in a Siamese fashion and regularized so that they are more similar for ground truth object-to-placement correspondences. The key idea is that as the network trains over a variety of objects and target locations across multiple kitting tasks, it acquires a broader understanding of how shapes and surfaces fit together for assembly – subsequently learning a more generalizable descriptor that is capable of matching new objects and target locations.
Learning assembly from disassembly. We present a self-supervised data-collection pipeline that obtains training data for assembly (i.e., ground truth motion trajectories and correspondences between objects and target placements) by disassembling completed kits. Classic methods of obtaining training data for assembly (e.g. via human tele-operated demonstrations or scripted policies) are often time-consuming and expensive. However, we show that for kit assembly, it is possible to autonomously acquire large amounts of high-quality assembly data by disassembling kits through trial and error with pick and place, then rewinding the action sequences over time.
This enables our system to assemble a wide variety of kits under different initial conditions (e.g. different rotations and translations) with accuracies of 90% (rate at which an object is placed in its target kit with the correct configuration), and generalizes to new settings with mixtures of multiple kits as well as entirely new objects and kits unseen during training.
The primary contribution of this paper is to provide new perspectives on robotic assembly: in particular, we study the extent to which we can achieve generalizable kit assembly by formulating the task as a shape matching problem. We also demonstrate that it is possible to acquire substantial amounts of training data for kit assembly by reversing the action sequences collected from self-supervised disassembly. We provide extensive experiments in real settings to evaluate key components of our system. We also discuss some extensions to our formulation, as well as its practical limitations.
Object pose estimation for assembly. Classic methods for assembly are often characterized by a perception module that first estimates the 6D object poses of observed parts [1, 5, 13, 30, 36], followed by a planner that then optimizes for picking and placing actions based on the inferred object poses and task-informed goal states. For example, Choi et al.  introduces a voting-based algorithm to estimate the poses of component parts for assembly from 3D sensor data. Litvak et al.  leverages simulated depth images to amass the training data for pose estimation. Jorg et al.  uses a multi-sensory approach with both vision and force-torque sensing to improve the accuracy of engine assembly.
While these methods have seen great success in highly structured environments, they require full knowledge of all object parts (e.g. with high-quality 3D object models) and/or substantial task-specific training data and task definitions (which requires manual tuning). This limits their practical applicability to kit assembly lines in retail logistics or manufacturing, which can be exposed to new object and kits as frequently as every two weeks. Handling such high task-level variation requires assembly algorithms that can quickly scale or adapt to new objects, which is the focus of our formulation, Form2Fit.
Reinforcement learning for assembly. Another line of work focuses on learning policies for assembly tasks to replace classic optimization-based or rule-based planners. To simplify the task, these works often assume full state knowledge (i.e., object and robot poses). For example Popov et al.  tackles the task of Lego brick stitching, where the state of the environment including brick positions are provided by the simulation environment. Thomas et al.  learns a variety of robotic assembly tasks from CAD models, where the state of each object part is detected using QR codes.
More recent works [18, 22] eliminate the need for accurate state estimation by learning a policy that directly maps from raw pixel observations to actions through reinforcement learning. However, these end-to-end models require large amounts of training data and remain difficult to generalize to new scenarios (train and test cases are often very similar with the same set of objects). In contrast our system is able to learn effective assembly policies with a much smaller amount of data (500 disassembly sequences), and generalizes well to different object types and scene configurations (different pose and number of objects) without extensive fine-tuning.
Learning shape correspondences. Learning visual and shape correspondences is a fundamental task in vision and graphics, studied extensively in prior work via descriptors [34, 9, 27, 16, 10, 26]. These descriptors are either designed or trained to match between points of similar geometry across different meshes or from different viewpoints. Hence, rotation invariance is a desired property from these descriptors. In contrast to prior work, our goal is to learn a general matching function between objects and their target placements locations (i.e., holes/cavities in the kits) instead of matching between two similar shapes. We also want the descriptor to be “rotation-sensitive”, so that the shape matching result can inform the actions necessary for successful assembly.
Learning from reversing time.
Time-reversal is a classic trick in computer vision for learning information from sequences of visual data (i.e., videos). For example, many works [31, 24, 7, 21] study how predicting the arrow of time or the order of frames in a video can be used to learn useful image representations and feature embeddings. Nair et al.  uses time-reversal as supervision for video prediction to learn effective visual foresight policies.
While many of these methods use time-reversal as a means to extract additional information from the order of frames, we instead use time-reversal as a way to generate correspondence labels from pick and place. We observed that a pick and place sequence for disassembly, when reversed, can serve as a valid sequence for quasi-static assembly. This is not true for all assembly tasks, particularly when more complex dynamics are involved, but this observation holds true for a substantial number of kit assembly tasks. Since it is easier to disassemble than assemble, we leverage time-reversed disassembly sequences (e.g. obtained from trial and error) to amass training data for assembly.
Form2Fit takes as input a visual observation of the workspace (including objects and kits), and outputs a prediction of three parameters: a picking location , a placing location , and an angle that defines a change in orientation between the picking and placing locations. These parameters are used with motion primitives on the robot to execute a respective pick, orient, and place operation. Our learning objective is to optimize our predictions of such that after each physical execution, an object is thereby correctly placed into its target location and orientation in its corresponding kit. In this work, each prediction of is i.i.d. and conditioned only on the current visual input – which is sufficient for sequential planning in kit assembly (see Sec. III-C and III-D).
The system consists of three network modules: 1) a suction network that outputs dense pixel-wise predictions of picking success probabilities (i.e., affordances) using a suction cup, 2) a placing network that outputs dense pixel-wise predictions of placing success probabilities on the kit, and 3) a matching network that outputs dense, pixel-wise, rotation-sensitive feature descriptors to match between objects and their corresponding placement locations in the kit. We use the placement network directly to infer , then use the suction and matching networks together to infer both and (see Sec. III-E). Fig. 2 illustrates an overview of our approach.
Our system is trained through self-supervision from time-reversed disassembly: randomly picking with trial and error to disassemble from a fully-assembled kit, then reversing the disassembly sequence to obtain training labels for the suction, placing, and matching networks. In the following subsections, we provide an overview of each module, then describe the details of data collection and training.
We represent the visual observation of the workspace as a grayscale-depth heightmap image. To compute this heightmap, we capture intensity and depth images from a calibrated and statically-mounted camera, project the data onto a 3D point cloud, and orthographically back-project upwards in the direction of gravity to construct a 2-channel heightmap image representation with both grayscale (generated from intensity) and height-from-bottom (depth) channels concatenated. Each pixel location thus maps to a 3D position in the robot’s workspace. The workspace covers a tabletop surface. The heightmap has a pixel resolution of with a spatial resolution of per pixel. We constrain the kit to lie within the left half of the workspace and the objects within the right half. This separation allows us to split the heightmap image into two halves: containing the kit, and containing the objects, each with a pixel resolution of . The supplemental file contains additional details on the representation.
The suction module uses a deep network that takes as input the heightmap image , and predicts favorable picking locations for a suction primitive on the objects inside the kit (during disassembly) and outside the kit (during assembly). Stable suction points often correspond to flat, nonporous surfaces near an object’s center of mass.
Suction primitive. The suction primitive takes as input a 3D location and executes a top-down suction-based grasp centered at that location. This primitive executes in an open loop fashion using stable, collision-free IK solves .
Network architecture. The suction network is a fully-convolutional dilated residual network [12, 20, 33]. It takes as input a grayscale-depth heightmap and outputs a suction confidence map with the same size and resolution as that of the input . Each pixel represents the predicted probability of suction success (i.e., suction affordance) when the suction primitive is executed at the 3D surface location (inferred from calibration) of the corresponding pixel . The supplemental file contains architectural details.
For certain kit assembly tasks, there may be sequence-level constraints that define the order in which objects should be placed into the kit. For example, to successfully assemble a five-pack kit of deodorants (see this test case in Fig. 3), the bottom layer must be filled with deodorants before the robot can proceed to fill the top layer.
One way our system enables sequential ordering is through a place module that predicts the next best placing location conditioned on the current state (i.e., observation) of the environment. The place module consists of a fully-convolutional network (same architecture as suction network) which takes as input the kit heightmap and outputs a dense pixel-wise prediction of placing confidence values over . The 3D locations of the pixels with higher confidence serve as better locations for the suction gripper to approach from a top-down angle while holding an object.
While the suction and placing modules provide a list of candidate picking and placing locations, the system requires a third module to 1) associate each suction location on the object to a corresponding placing location in the kit and 2) infer the change in object orientation. This matching module serves as the core of our algorithm, which learns dense pixel-wise orientation-sensitive correspondences between the objects on the table and their placement locations in the kit.
Network architecture. The matching module consists of a two-stream Siamese network , where each stream is a fully-convolutional residual network with shared weights across streams. Its goal is to learn a function that maps each pixel of the observed kit and objects in the heightmap to a -dimensional descriptor space ( in our experiments), where closer feature distances indicate better object-to-placement correspondences, i.e., , where and are pixel height and width of respectively.
The first stream of the network maps the object heightmap to a dense object descriptor map . The second stream maps a batch of kit heightmaps with 20 different orientations (i.e., multiples of 18°), to a batch of 20 kit descriptor maps . Each pixel in the kit heightmap maps to 20 kit descriptors (one for each rotation), but only one of them (the most similar) will match to its corresponding object descriptor in . The index of the rotation with the most similar kit descriptor informs the change in object orientation between the picking and placing, i.e., where and . In this way, the matching network not only establishes correspondences between object picking and kit placement locations, but also infers the change in object orientation between the pick and place.
The matching module is trained using a pixel-wise contrastive loss, where for every pair of kit and object heightmaps , we sample non-matches from and all 20 rotations of and matches from but only the rotation of
corresponding to the ground-truth angle. The loss function thus encourages descriptors to match solely at the correct rotation of the kit image while non-matches are pushed to be at least a feature distance marginapart. See the supplemental for additional details.
Learning ordered assembly. Each predicted pixel-wise descriptor from the matching network is conditioned on contextual information available inside of its local receptive field (e.g. a descriptor changes based on whether its receptive field sees 0, 1, or more objects already inside the kit). The descriptors thus have the capacity to memorize the sequencing for ordered assembly. As a result, both the matching and place modules implicitly enable our system to memorize the sequencing, where the learned order of assembly corresponds to the reversed order of disassembly from data collection.
The planner is responsible for integrating information from all three modules and producing the final assembly parameters , and . Specifically, top-k pick candidates are sampled from the suction module output and top-k place candidates are sampled across all 20 rotations of the place module output . Then, for each pick and place pair in the Cartesian product of candidates, kit and object descriptors are indexed and their L2 distance is evaluated after which the pair with the lowest L2 distance across all rotations and all candidates is chosen to produce the final kit descriptor, object descriptor and rotation index.
To generate the inputs and ground-truth labels needed to train our various networks, we create a self-resetting closed-loop system wherein the robot continuously generates a random sequence of disassembly trajectories to empty a kit of objects, then performs it in reverse to reset the system to its initial state. Fig. 4 shows the pipeline of the disassembly data collection process.
Specifically, for every object in the kit, a trajectory is generated as follows: first, the robot captures a grayscale-depth image to construct kit and object heightmaps and , then it performs a forward pass of the suction network to make a prediction of parameter which is executed by the suction primitive to grasp the object. If the suction action is successful, it places the object at a position and rotation sampled uniformly at random in the bounds of the workspace. If the suction action is not successful (i.e., no object gets picked up), this suction point is labeled as negative for the online learning process. The suction success signal is obtained by visual background subtraction and measuring suction air flow. To ease trial and error during disassembly, we fixate the kit to the workspace to prevent failed grasps from causing accidental kit displacements.
All the parameters (e.g. , , ) are stored for resetting the scene. Once all objects have been disassembled, the robot indexes the trajectories in reverse, suctioning the objects using the place parameters and placing them back into the kit using the suction parameters. For each trajectory, it stores the grayscale-depth heightmaps captured before and after the disassembly step, the predicted suction pose, and the randomly generated place pose. To bootstrap the learning, we pretrain the suction network using 50 manually labelled suction examples. For each kit in the training set, we collect 500 disassembly sequences, which in total takes 8-10 hours.
Place Network Dataset. To generate the training data for the placing network, we use the suction location at time and the heightmap at time (i.e., image taken after the suction action) as one training pair. Thus, the placing network is encouraged to look at empty locations in the kit and predict valid place positions for the next object.
Suction Network Dataset. The training data for the suction network consists of two sets of input-label pairs: (1) the kit heightmap and the suction position and (2) the object heightmap and the place position . Thus, the suction network is encouraged to predict favorable suction locations on the object of interest inside and outside the kit.
Matching Network Dataset. To label the correspondences for the matching network, we first compute the masks of the object (both inside and outside the kit) using the images’ difference. The relationship of every pixel in the cavity of the kit and its corresponding pixel on the object outside the kit is calculated using the rotation angle , i.e., where is a rotation matrix defining a rotation of around the z-axis. Additional details on data collection and pre-training are in the supplemental file.
We design a series of experiments to evaluate the effectiveness of our approach across different assembly settings. In particular, our goal is to examine the following questions: (1) How does our proposed method – based on learned shape-driven descriptors – compare to other baseline alternatives? (2) How accurate and robust is our system across a wide range of rotations and translations of the objects and kit? (3) Is our system capable of generalizing to new kit configurations such as multiple versions of the same kit and mixtures of different kits when trained solely on individual kits? (4) Does our system learn descriptors that can generalize to previously unseen objects and kits?
Our first experiment compares the kitting performance of Form2Fit to ORB-PE, a classic method for pose estimation that leverages Oriented Fast and Rotated Brief (ORB) descriptors  with RANSAC . Implementation details for this baseline can be found in the supplementary file.
Benchmark. We collected data from 25 random test sequences for each training kit (shown in Fig. 5) using our automatic data collection pipeline (Sec. IV) and generate associated ground-truth pose transforms (i.e., a matrix encoding the change in object pose from its initial position outside the kit to its final position inside the kit hole).
Evaluation metric. Form2Fit generates a rigid transform between the object and its final position in the kit using the predicted descriptors from the matching module, while ORB-PE generates the rigid transform by matching the observed object to a previously known canonical object model (whose transform into the kit is also known beforehand). Our goal is to evaluate how accurate these rigid transforms compare to the ground truth pose transforms from the benchmark. To this end, we adopt the average distance (ADD) metric proposed in  and measure the area under the accuracy-threshold curve using ADD, where we vary the threshold for the average distance (in meters) and then compute the pose accuracy. The maximum threshold is set to 10 cm. Results are shown in Table LABEL:table:main-auc.
In general, we observe that Form2Fit has a higher mean area under the curve across the different training kits than ORB-PE. While ORB-PE performs competitively, it requires prior object-specific knowledge (i.e., canonical object models and their precomputed pose transforms into the kit), making it unable to generalize to novel objects and kits. On the contrary, Form2Fit is capable of generalizing to novel objects and kits, which we demonstrate in the following subsection.
We evaluate the generalization performance of Form2Fit by conducting a series of experiments on a real platform, which consists of a 6DoF UR5e robot using a 3D printed suction end-effector overlooking a tabletop scenario, as well as a Photoneo PhoXi Model M camera, calibrated with respect to the robot base using the pipeline in . Video recordings can be found in our supplementary material.
Evaluation metric is assembly accuracy , defined as the percentage of attempts where the objects are successfully placed into their target locations. For the kits that contain multiple objects (e.g. deodorants, zoo animals, fruits), we record the average individual success rate of each object, then average over all objects to compute the overall kit success rate, i.e., . When evaluating on the training objects, we count 180° rotational flips as incorrectly assembled even though the objects may still fit in the kit. Our expectation is that the system should pick up on minor details in texture and geometry which should inform it about the right orientation. However, for novel kits in generalization experiments, we choose not to penalize the performance since the system has never previously seen the kits. See the supplemental file for examples of rotational flips.
|Kit Type||Name||Assembly Success|
|fruits||0.65, 0.60, 0.95, 0.90|
|zoo-animals||0.90, 0.95, 0.95, 0.95, 0.85|
|deodorants||1.00, 1.00, 1.00, 0.90, 0.85|
|Config.||1 black-floss||1 tape||3 tape||2 tape & 2 black-floss|
|Succ. Rate||0.95||0.90||1.00, 0.95, 0.85||0.80, 1.00, 1.00, 0.95|
Generalization to initial conditions. First, we measure the robustness of our system to varying initial conditions. For each kit, during training, the kit is fixed in the same position and orientation (Fig. 6 (a)) while during testing, we randomly position and orient it on the workspace (Fig. 6 (b)). Specifically, we record assembly accuracy on 5 random kit poses, 4 times each for a total of trials. Kit and object positions are uniformly sampled inside the table, while the orientation of the objects is sampled in and the orientation of the kit in . Note that for kits with multiple objects, if the execution of an object fails, we intervene and place it in its correct location to allow the system to resume. As seen in Table LABEL:table:result-single, our system achieves 90% average assembly success on both single and multi-object kits. We observe that frequent modes of failure come from the robot placing objects (e.g. floss and tape) flipped from their correct orientation.
Generalization to multiple kits. Next, we study how well our system can generalize to different kit configurations. During training, the system sees only 2 individual kits (Fig. 6 (a)), while during testing, we create combinations of the same kit and mixtures of kits (Fig. 6 (c) and (d)). Similarly to above, we perform trials. While our system has never been trained on these novel settings, it is able to achieve an assembly success rate of 94.27% (Table LABEL:table:result-multi).
Generalization to novel kits. Finally, we study how well our system can generalize to novel objects and kits. Specifically, the testing kits are never-before-seen single-object kits and multi-object kits with various object shapes (see Fig. 5). For perfectly symmetrical objects (e.g. circles), we consider the assembly to be successful as long as the object is placed into kit, otherwise we count it as a failure. On a set of 20 trials, our system achieves over 86% generalization accuracy.
To explore what the object descriptors generated by the matching network have learned to encode, we compute and visualize the t-SNE embedding  of the learned feature descriptors for different kits. Specifically, we mask the object pixels in the heightmap using binary masks obtained during the data collection process and forward the masked heightmaps through the matching network. Then, the descriptor map of channel dimension 64 is reduced to dimension 3 using t-SNE and normalized in the range for colorspace visualization. From Fig. 7, we observe that the descriptors have learned to encode: (a) rotation: objects oriented differently have different descriptors and , (b) spatial correspondence: same points on the same oriented objects share similar descriptors and , and (c) object identity: zoo animals and fruits exhibit unique descriptors (cols. 3 and 4).
We present Form2Fit, a framework for generalizable kit assembly. By formulating the assembly task as a shape matching problem, our method learns a general matching function that is robust to a variety of initial conditions, handles new kit combinations, and generalizes to new objects and kits. The system is self-supervised – obtaining its own training labels for assembly by disassembling kits through trial and error with pick and place, then rewinding the action sequences over time.
However, while our system presents a step towards generalizable kit assembly, it also has a few limitations. First, it only handles 2D rotations (i.e., planar object rotations) and assumes that objects are face-down – it would be interesting to explore a more complex (e.g. higher DoF) action representation for 3D assembly. Second, while our system is able to handle partially transparent kits, it has trouble handling fully transparent ones like the deodorant blister pack (we spray-paint it to support stereo matching for our 3D camera). Exploring the use of external vision algorithms like [32, 11, 3, 15] to estimate the geometry of the transparent kits before using the visual data would be a promising direction for future research.
Signature verification using a” siamese” time delay neural network. In Advances in neural information processing systems, pp. 737–744. Cited by: §III-D.
Self-supervised video representation learning with odd-one-out networks. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3636–3645. Cited by: §II.
The Journal of Machine Learning Research17 (1), pp. 1334–1373. Cited by: §II.
Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pp. 527–544. Cited by: §II.
Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge. In ICRA, Cited by: §II.