We develop a state estimator for the application of inserting an object held by a suction cup into a tight container. This scenario takes inspiration from manual packaging stations in e-commerce warehouses. Warehouse bin-picking has been widely addressed in recent work  and is close to commercial use . Fine placement of the picked object is a task that has received significantly less attention. In the logistics industry, placing objects densely reduces the space required to store and transport. In this paper, we focus on the state estimation problem for the task of inserting an object into a container that is already populated.
The proposed state estimation framework is based on incremental Smoothing and Mapping (iSAM) , an online least-squares optimization method that can add variables and constraints in an on-line fashion and maintain an estimate in real-time.
We derive a metric that combines observation models for both visual and tactile sensing. Visual sensing provides global information but is noisy and suffers from occlusions and unexpected outlying detections. On the other hand, force/tactile sensing is local but accurate when contact occurs; i.e., the distance is zero. Our focus is on exploiting tactile information. In our scenario, the tactile information is provided through a combination of a force-torque sensor measurement at the wrist of the robot, the geometry of the container and the object, and the measurement of robot encoders. The general idea is illustrated in Figure 1.
The state estimation framework extends from our previous work  on tracking the pose of an object pushed in the horizontal plane. In this paper, we address four new technical challenges:
There are more complex contact formations.
The contact formations cannot be discriminated directly.
The robot has a deformable joint whose state is not directly observable.
The object state is in 3D.
The first two contributions, referring to contact formations (CFs), are the foci of this paper. CFs describe the structure of possible contact arrangements, i.e., which edge contacts which edge. To detect CFs, we train a support vector machine (SVM)[5, 6] to distinguish the contact formations given force measurements. We apply a self-supervised scheme to collect labeled examples of contact formations: we use a precise Vicon tracking system that provides enough positional accuracy to estimate contact formation and make use of force sensing to distinguish between contact and no contact. This self-supervised scheme allows a large-scale data collection that ultimately yields sufficient classification performance. After the contact formations are predicted we add the corresponding cost functions to enforce the contact information.
We show the efficacy of fusing contact and visual information with ablation studies on the performance of the system using different combinations of sensor information, and for two different objects. The instrumented setup is used for the purpose of evaluating the recovered object pose and contact locations.
Ii Related Work
This work aims to combine the classical work in peg-in-hole assembly and recent breakthroughs in state estimation techniques for solving the SLAM problem. Gadeyne et al.  show that there is an analogy between the SLAM problem and state estimation for the manipulation task involving contacts: both infer continuous variables (poses of the robot v.s. the object) and discrete variables (data association v.s. contact formation detection). Below we discuss the literature of A) the peg-in-hole problem; and B) state estimation in manipulation tasks.
Ii-a Peg-in-hole problem
The peg-in-hole problem studies the task of inserting an object into a hole where the tolerance is smaller than the accuracy of the actuators or sensors. There are two important families of classical approaches: the design of passive compliant devices or control schemes [8, 9] which are often tailored to a specific scene; or active sensing techniques with feedback control [10, 11, 12]. The latter provides more potential flexibility in adapting to new scenarios. In this work, we investigate the second approach, where the quality of sensing and estimation often dictates the system performance. The focus of this paper is on providing high quality and timely estimates of the object pose and contact formation
first presents the insertion process as a pose estimation problem between the part and the hole from a series of noisy measurements. He used both smoothing and Kalman filtering methods. However, due to lack of computation power, the algorithm did not work online. We follow his idea and use iSAM to balance accuracy and latency. In addition, we provide an in-depth experimental understanding of the performance of the system.
Bruyninckx et al.  proposed a model-based state estimation approach for the initial alignment of peg and hole. Their view is close to ours, where the initial alignment is more crucial than the later insertion stage. However, their method is only derived for a round peg and round hole. It is nontrivial to apply it to other geometries. In fact, many of the classical work is only derived for round-peg-round-hole settings in which symmetries simplify the problem. Round profiles are common in manufacturing, but not in a warehouse setting.
The work that exploits passive compliance in hardware or control is well-developed and maybe more suitable for industrial application. Drake  proposes a passive compliance device called Remote Center Compliance (RCC) for accommodating minor uncertainties in geometry in the insertion task. Whitney  presents analysis to choose the compliance parameters for a given task, and derives conditions to avoid wedging and jamming. However, the analyses are planar by assuming a round peg and round hole with chamfers. Caine et al.  extend Whitney’s analysis to inserting a rectangular part to a chamferless hole, which is closest to our box inserting scenario, where the suction cup can be seen as a passive compliant device. His analysis of contact formations inspires our work.
Ii-B State estimation in manipulation tasks
Probabilistic filters such as particle filters and extended Kalman filters are widely used in robotics . Particle filters are good for representing complex multimodal distribution that often arises in contact problems, but struggle to provide accurate estimation due to two problems. First, they are subject to sampling resolution. Second, they usually face particle depletion problem when fusing measurements of different levels of noise. Koval et al.  proposed a manifold particle filter for handling the depletion problem. The manifold generation is usually done offline for a particular configuration of an end-effector in order to make the estimation fast enough online.
Gadeyne et al. 
propose to solve the joint probability of the hybrid problem (i.e., with continuous and discrete variables) using particle filters. In the same line of work,Zhang and Trinkle  use nonlinear complimentarity programs to resolve contact and motion simultaneously. However, the expensive computation is far from real-time. Li et al. 
use a contact graph that governs the transition of discrete contact states so that the contact state evolves physically. The system, however, does not scale well due to the combinatorial growth of the contact graph. Our system uses a simple SVM classifier to estimate the instantaneous contact formation parallel to the probabilistic inference.
The Extended Kalman Filter (EKF) is a popular framework for online state estimation also used in contact tasks [14, 18, 19]. It linearizes a system so as to apply a Kalman Filter, designed for linear systems. One common drawback is that the linearization point is at the current estimate of variables. This can result in an inaccurate linearization, followed by an inaccurate estimation.
Kaess et al.  propose incremental smoothing and mapping (iSAM) to solve the above issues in the SLAM setting. iSAM can be seen as an online nonlinear least-square optimization tool, where cost functions and variables for the optimization can be added online and it can refine the current estimate of the variables and linearization points. The update is efficient because it uses a QR-factorized matrix to represent the linearized cost functions and only updates a small portion of the matrix.
We use iSAM in our previous work for state estimation in the planar situation  and obtain a positive result in terms of higher accuracy over EKF while considering the realtime constraint. In this work, we explore a similar framework but on a problem with more complex contact formations.
Iii Suction-based insertion problem
We are concerned with estimating the pose of a suction-held object, its contact formation with a hole, and the precise location of the contact points between them. We assume the object and the hole are rigid. We particularly focus on the mating phase – aligning of the object and hole – and lesser on the following pushing-down stage. The mating phase is key to the insertion problem and the contact conformation is richer.
In a complete assembly system, the output of our algorithm would be fed into an insertion controller. The top of Figure 1 shows an example scenario where a box is to be inserted into the remaining free space in a box.
Based on this application, we simplify the scenario to the experimental setup in Figure 5 composed of two rigid parallel walls that emulate the space where to insert the box. These walls play the role of either the container or adjacent objects. We also assume only one dimension of the hole is tight not the other dimension (i.e., versus direction in Figure 5). This is often true for real packing scenarios, and also simplifies the instrumentation and observation of the assemblies.
Iii-a Problem definition
The object state is observed with periodicity and we use the subscript to indicate the corresponding timestamp along the trajectory.
State variables. The state at time includes object pose and contact points . The object pose is denoted as , where the latter three variables are roll, pitch, and yaw, respectively. The pose is with respect to the top center of the hole. The contact points on the walls and on the objects are denoted in the object frame as , respectively, where is a unique id. These points will be defined based on the possible contact formations of the objects involved.
When in contact, the contact points on the objects and the walls should be coincident. Here we parameterize both and will impose their coincidence as a constraint when appropriate depending on the predicted contact formation. This makes the implementation more modular and easy to adapt to changes in the object or hole. Section V describes how to impose contact information in our design.
Visual input. Visual input is in the form of time-stamped 3D poses and a binary signal denoting whether it is available at time or not (possibly due to occlusions). In this implementation, we use Apriltag markers to emulate a perception system with a realistic noise level. The algorithm, however, is agnostic to the particular type of perception algorithm.
Tactile input. A tactile input includes the 3D pose of robot’s tool center point (TCP) and force-torque input at TCP .
In this section, we formulate the state estimation problem as a least squares problem in order to apply iSAM. The iSAM algorithm allows adding variables and constraints in an on-line fashion, in contrast to batch optimization. We refer the reader to  for a details description of the iSAM algorithm.
Iv-a Objective function
The overall cost function is a sum of five cost functions/constraints:
the visual measurement cost ;
the local robot motion cost ;
the contact measurement cost ;
the contact point on geometric feature cost ;
the contact point prior cost .
The factor graph in Figure 2 shows the relationship between these cost functions. The overall least squares problem is then:
where is a long vector formed by concatenating ’s, and computes the squared Mahalanobis distance with covariance matrix . The matrices are the covariance matrices for the corresponding noises of every constraint. We identify them from the measurement input and groundtruth. If some measurement is missing due to physical limitations, we remove the relevant cost functions at that instant; e.g., when the object is not in camera view, we remove the term.
Below we discuss the visual measurement and motion cost. The contact related costs are discussed in Section V.
Iv-B Visual measurement
The visual measurement cost forces the pose estimate to be close to the visual input. The associated constraint is simply the subtraction of both inputs:
Iv-C Motion prior
The motion prior forces the displacement of the object to be consistent with the robot motion. The deformation of the suction cup is small, and its effect on the object pose is mostly in rotation, so we adjust the noise level to match the stiffness of a suction cup.
This cost can stabilize the estimation to be robust against noise in the visual input. This cost also allows the correction from contact to persist after the contact is broken. This is implemented with a subtraction of the two displacements:
All the subtraction operations in cost functions on angles are wrapped into .
V Contact information
The optimization framework iSAM handles continuous variables. In our implementation, we decide the discrete contact formation (CF) first and then provide iSAM the corresponding constraints. This enables the estimation to be in real time. Below we illustrate the method with the example of a cuboid.
V-a Contact formations
A general guideline for defining CFs is to minimize the number of CFs, since classification accuracy will suffer when there are too many classes.
We define 9 contact formations labeled from 0 to 8 as shown in Figure 3. The contact formations are line contacts between the bottom face of the object and the top face of the walls. There are possible degenerate situations which we do not include, like a face contact, which we include in the case of line contact.
We propose to use a support vector machine (SVM) to classify the received force signature during the mating phase of the assembly to estimate CFs. The input to the classifier is 6-dimensional force-torque measurement, and the output is a discrete CF id. Based on the hyper-parameter tuning tool provided by libsvm , we choose the radial bases function as the kernel of SVM. We choose SVM because of its speed, simplicity, and because it gives reasonable accuracy in our scenario. One could use many other classifiers.
V-B Contact constraint
We aim for a modular and simple formulation to describe contact constraints because contact constraints take much effort to implement correctly by enumerating the contact formations. We want it to be easy to adapt to other geometries and to scale as the geometric complexity grows.
The process for specifying contact constraint is as follows. Regardless of whether contact happens or not, we always define contact points on the objects and walls. We assign 4 contact points on the bottom edges of the object and 2 contact points to the individual walls as shown in Figure 4. When a particular contact formation is detected by the SVM classifier, we connect the corresponding contact points to force them to join. Figure 4 shows examples of the contact constraints of the rect object.
The contact constraint is implemented as a subtraction of the two points in the wall coordinate. Note that the contact points on the object are in object frame, and contact points on the wall are in the wall frame. Thus, we define
for all and that meet according to the CF. Here is a homogeneous transform matrix that transform in object coordinate to wall coordinate. Note that there could be multiple contacts, so there will be multiple terms in Eq. (1) according to the CF.
V-C Contact point on geometric feature
We explicitly constrain the contact point to be along a designated geometric feature (e.g., line, curve) on a object or a wall. The cost is the difference between the point and its corresponding closest point on the line/curve. That is,
V-D Contact point prior
The contact point prior is a regularization term that gives a gentle hint of the contact point location on a geometric feature on the object or walls. This prior has two benefits. First, it helps to locate the contact points based on prior statistics. Second, it prevents the system from becoming under-constrained, which will cause the least squares problem ill-defined. In practice, we use relatively low weight compared to that of other costs. Therefore, this prior constraint is only effective when there are no other relevant cost. The cost is the difference of the contact point, and its corresponding nominal position, :
The nominal position can be found by statistics of the groundtruth data. This does not have to be accurate as it is a regularization prior.
V-E Adaptation to another shape – elliptic cylinder
We choose elliptic cylinder to test a geometry with curved shape. Following the above procedure, we can adapt the cost functions to an elliptic cylinder. We mainly need to change three places:
The definition of the contact formations;
The associated contact constraints;
The contact point on the geometric feature of the object, i.e., the elliptic bottom face.
The contact formations and contact constraints for ellip are shown in Figure 7. Since the ellipse is a smooth shape, we only need two CFs to describe which wall is in contact. Due to our modular design, we do not need to modify the constraint for contact point on walls.
We want to understand the individual components and how possible changes may affect the performance of the system. Specifically, we want to answer the following questions.
How accurate is the contact formation prediction?
How much do contact constraints improve accuracy?
How accurate is the contact point prediction?
How well does the system adapt to an object of smooth shape, e.g., elliptic cylinder?
What is the runtime of the system?
Objects We use two objects as shown in Figure 6. The crucial part of each object is the bottom face, where the contact happens with the walls. Object rect has a rectangular face; ellip has an elliptical face, with these properties:
rect: a 3D printed cuboid of cm, 110 g.
ellip: we modified the cuboid to have the bottom portion to be an elliptical cylinder. This is to maintain the same configuration on the top portion for sticking on the same pattern of Vicon markers and Apriltags as in rect. It weighs 100 g.
Walls. Two walls have a top face of cm. They are held rigidly to the environment.
Robot. We use an ABB IRB 120 industrial robotic arm with 6 DOF to control precisely the position and velocity of its tool center point (TCP). The TCP moves at 5 mm/s in the experiments. The TCP pose is published at 250 Hz.
Force sensing. We use an ATI Gamma F/T sensors, which connects the robot’s TCP and the suction tube guide. It has high sensitivity with force resolution of 1/160 N, and torque resolution of 1/2000 Nm. The data is published at 250 Hz.
Vision system. We use AprilTags to track the part at 30 Hz. We build a stereo AprilTag 3D pose tracking system to provide a more accurate and stable 3D visual tracking system than that with a single camera. We use two RGB cameras on two Intel Realsense SR300s to build a stereo system. The camera resolution used is on both cameras.
Suction system. The suction cup is Piab BX25, which is made of a mix of rubber and polyurethane. The skirt of the diameter of the suction cup is 25mm. It is mounted on the suction tube guide. We use a vacuum generator that converts compressed air at 50 psi to vacuum.
Data collection procedure. We first command the robot to suction the object at the center, move the object above the hole, and align the object with the hole at the center. With the above as the nominal configuration, we add variations in translation in and rotation in yaw using the coordinate system in Figure 5. We assume the error of the hole and object alignment is not large: within mm in translation and in rotation.
We collect data for training and testing. The former is for learning the contact formation prediction and tuning the iSAM parameters. For training the variations are mm and with 25 grids evenly divided on both dimension; 625 trials in total, taking about 1 hour. For testing the variations are mm and with 5 evenly divided grids in both dimension; i.e., 25 trials in total, taking about 2.5 minutes. The testing configurations are chosen so that they do not exactly match the same poses as in training in order to test generalization of the system. Each trial starts with a varied pose and then the robot moves straight down until the force sensor detects contact. After that, it moves straight up.
The data is recorded using ROS Bag  format so that we can test the same data with different system configurations.
Groundtruth. We use a Vicon tracking system to accurately find the groundtruth pose of the object and the walls. We also use the pose information to find the groundtruth of the CF labels by checking the geometric relationship between the bottom vertices of the object and the two line segments on the walls. This automatic labeling method allows us to label a large amount of data for training and testing.
Computation. All computation was done on a laptop machine with Intel Core i7-3920XM CPU and 16 GB RAM.
|number of training data||32,437||13,553|
|cross-validation accuracy (%)||88.5||99.3|
|testing accuracy (%)||83.5||98.6|
Vi-B Accuracy of contact formation prediction
We extract the portion of data where contact happens. Table I reports the number of training samples and accuracy results.
We examine the error types by checking the confusion matrices shown in Figure 8. For rect, notice the classification error is mostly among the contact formation from adjacent configurations, e.g., , , and . Those kinds of error are usually not harmful to the state estimation, as the object pose is close for those CFs. We also note that there is more training data for the CF 2 and 5 case. Therefore, there is a bias toward those two classes. However, we do not intend to normalize the data because we assume the training distribution would be close to testing distribution. For ellip, we observe very little confusion between the two CFs. That suggests that ellip is much easier than rect.
We also investigate how much training data is sufficient. Figure 9 shows the cross-validation accuracy versus the number of training samples. For rect, the accuracy increases with the amount of training samples until saturating at around 20,000; for ellip, it’s around 8,000.
Vi-C Improvement of contact information
We want to examine whether adding the proposed tactile costs – contact constraint and robot motion – would improve over the baseline, which uses only visual input. Here we use object rect.
|Combination of costs||Trans.||Rot.||Trans(C).||Rot(C).|
We observe that both tactile-related constraints need to coexist in order to have a good improvement. Table II shows the root-mean-square errors in translation and rotation of the whole trajectory and errors only during contact, which are marked with (C). For the first improvement, we add robot motion to the baseline. The result improves slightly: by 0.2 mm and 0.5 mm for full and contact portion, respectively. For the second improvement, we add contact information to the baseline. The result improves by 2.4 mm during the contact portions but not significant for the full trajectory. Combining both improvements, we see a larger 2.9 mm improvement on the whole trajectory. Figure 10 shows on the top a time instance where the contact improves the accuracy.
Vi-D Evaluating contact point accuracy
The groundtruth of detecting contact points is evaluated from Vicon’s tracking result. We calculate the contact points by finding the closest point between the two line segments that are supposed to meet given the groundtruth CF.
We see an improvement in contact point estimation as shown in Table III. We only report the portion where there is contact. We can see that adding robot motion information improves the accuracy by 5.1 mm, and adding contact information improves accuracy by 12.7 mm in error. With both additions, the accuracy improves by 13.6 mm and reaches an error of 5.4 mm.
|Combination of costs||Error (mm)|
Vi-E Different geometry of the object
Here we show that our state estimation, using tactile sensing, can adapt to a different object shape. Figure 10 bottom shows an instance of improvement using contact information. The contact constraint drags the estimation down to touch the wall when contact is detected. Table IV shows that adding the contact information improves the accuracy by 1.1 mm.
|Combination of costs||Trans.||Rot.||Trans(C).||Rot(C).|
Vi-F Run time
The testing sequence is 154 sec long and the state estimator runs for 1,570 time steps, which results in 10 Hz or 96 ms per step in average. SVM prediction is fast: it takes 0.3 ms per step for rect and 0.06 ms per step for ellip, which indicates that the algorithm could run potentially faster.
In this work, we present a real-time state estimation system to recover the pose, the CF and contact points of an object interacting with its environment. We focus on the particular problem of inserting a suction-held object into a tight container. We use iSAM as the optimization framework to fuse tactile and visual sensing. At every timestep, we use an SVM classifier to decide the CF, and then add the corresponding contact constraints to iSAM. We show that using tactile sensing improves the accuracy of the estimated pose of the object with respect to visual sensing.
Number of contact formations. For the sake of computational complexity and to boost the performance of the SVM classifier, we use a reduced number of CFs. The CFs we chose are based on observation of real trials. There are two lessons: 1) If the CFs are too similar, merge them if possible; 2) If the CF happens rarely, regard it as no contact or the closest CF.
SVM accuracy. We observe lower accuracy when we test the SVM on the test data. We believe this is in part due to the fact that the frequency of each formation is different during training and testing. During training, we want to preserve this prior knowledge of the frequency of different CFs so we did not normalize the data. Meanwhile, hardware degradation during the data collection process could also be a factor.
Use force for locating contact points. In this paper, we use a F/T sensor to find the discrete contact formation rather than finding the contact point location. Initially, we attempted to identify a dynamic model of the suction cup compliance to exploit this information jointly with the F/T readings to infer the contact location. In practice, the force reading is too noisy for this to yield a meaningful result.
Unilateral Contact Constraints. While we can impose constraints of contact, we cannot guarantee that when there is no contact, objects will not penetrate. To prevent it, we would need unilateral constraints, which are not allowed by the optimization framework we propose.
Future work. The CFs here are defined manually. One extension would be to investigate their automatic definition, as in  along with the contact constraints. Based on the pose and contact state output, we would like to build a model-predictive controller for the insertion task by modifying the approach in  for pushing tasks.
- Zeng et al.  A. Zeng, S. Song, K.-T. Yu et al., “Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” in ICRA, 2018.
-  Next leap for robots: Picking out and boxing your online order. [Online]. Available: https://www.wsj.com/articles/next-leap-for-robots-picking-out-and-boxing-your-online-order-1500807601
- Kaess et al.  M. Kaess, A. Ranganathan, and F. Dellaert, “iSAM: Incremental smoothing and mapping,” IEEE Trans. on Robotics (TRO), vol. 24, no. 6, pp. 1365–1378, Dec. 2008.
- Yu and Rodriguez  K.-T. Yu and A. Rodriguez, “Realtime state estimation with tactile and visual sensing. application to planar manipulation,” in ICRA, 2018.
- Chang and Lin  C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, p. 27, 2011.
- Cortes and Vapnik  C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
- Gadeyne et al.  K. Gadeyne, T. Lefebvre, and H. Bruyninckx, “Bayesian hybrid model-state estimation applied to simultaneous contact formation recognition and geometrical parameter estimation,” IJRR, vol. 24, no. 8, 2005.
- Drake  S. H. Drake, “Using compliance in lieu of sensory feedback for automatic assembly.” Ph.D. dissertation, MIT, 1978.
- Whitney  D. E. Whitney, “Quasi-static assembly of compliantly supported rigid parts,” Journal of Dynamic Systems, Measurement, and Control, vol. 104, no. 1, pp. 65–77, 1982.
- Simunovic  S. Simunovic, “An information approach to parts mating,” Ph.D. dissertation, Massachusetts Institute of Technology, 1979.
- Inoue  H. Inoue, “Force feedback in precise assembly tasks,” AI Memos, 1974.
- Bruyninckx et al.  H. Bruyninckx, S. Dutre, and J. De Schutter, “Peg-on-hole: a model based solution to peg and hole alignment,” in ICRA, 1995.
- Caine et al.  M. E. Caine, T. Lozano-Pérez, and W. P. Seering, “Assembly strategies for chamferless parts,” in ICRA. IEEE, 1989, pp. 472–477.
- Thrun et al.  S. Thrun, W. Burgard, and D. Fox, Probabilistic robotics. MIT press, 2005.
- Koval et al.  M. Koval, N. Pollard, and S. Srinivasa, “Pose estimation for planar contact manipulation with manifold particle filters,” IJRR, vol. 34, no. 7, June 2015.
- Zhang and Trinkle  L. Zhang and J. C. Trinkle, “The application of particle filtering to grasping acquisition with visual occlusion and tactile sensing,” in ICRA, 2012.
- Li et al.  S. Li, S. Lyu, and J. Trinkle, “State estimation for dynamic systems with intermittent contact,” in ICRA, 2015.
- Izatt et al.  G. Izatt, G. Mirano, E. Adelson, and R. Tedrake, “Tracking objects with point clouds from vision and touch,” in ICRA, 2017.
- Hebert et al.  P. Hebert, N. Hudson, J. Ma, and J. Burdick, “Fusion of stereo vision, force-torque, and joint sensors for estimation of in-hand object location,” in ICRA, 2011.
- Quigley et al.  M. Quigley, K. Conley, B. Gerkey et al., “Ros: an open-source robot operating system,” in ICRA workshop on open source software, 2009.
- Tang  P. Tang, “Representation and automatic generation of contact state graphs between general solid objects,” Ph.D. dissertation, The University of North Carolina at Charlotte, 2007.
- Hogan and Rodriguez  F. Hogan and A. Rodriguez, “Feedback control of the pusher-slider system: A story of hybrid and underactuated contact dynamics,” in WAFR, 2016.