I Introduction & Related Work
Manipulation is a difficult problem, complicated by the challenge of robustly estimating the state of the robot’s interaction with the environment. Parameters such as the contact point and the force vector applied at that point, can be very hard to robustly estimate. These parameters are generally partially observable and must be inferred from noisy information obtained via coarse visual or depth sensors and highly sensitive but difficult to interpret tactile sensors.
For instance, in the case of “in-hand” manipulation problems, where a held object is often partially occluded by an end-effector, tactile sensing offers an additional modality that can be exploited to estimate the pose of the object .
Vision and tactile sensors have been used to localize an object within a grasp using a gradient-based optimization approach . This has been extended to incorporate constraints like signed-distance field penalties and kinematic priors . However, the former is deterministic and the latter handles uncertainty only per time-step, which is insufficient since sensors can be highly noisy and sensitive. Particle filtering-based approaches have been proposed that can infer the latent belief state from bi-modal and noisy sensory data, to estimate the object pose for two-dimensional grasps  and online localization of a grasped object . These approaches are often limited in scope. For example,  uses vision to only initialize the object pose and later relies purely on contact information and dynamics models. In general, particle filtering based methods also suffer from practical limitations like computational complexity, mode collapse, and particle depletions in tightly constrained state spaces.
Beyond manipulation, sate estimation is a classic problem in robotics. For example, Simultaneous Localization and Mapping (SLAM) has been studied for many decades, and many efficient tools have been developed to address noisy multi-modal sensor fusion in these domains [5, 6, 7]. One of the more successful tools, the smoothing and mapping (SAM) framework , uses factor graphs to perform inference and exploits the underlying sparsity of the estimation problem to efficiently find locally optimal distributions of latent state variables over temporal sequences. This technique offers the desired combination of being computationally fast while accounting for uncertainty over time, and has been recently incorporated into motion planning [8, 9].
This framework has also been explored for estimation during manipulation [10, 11, 12]. In particular, Yu et al.  formulate a factor graph of planar pushing interaction (for non-prehensile and underactuated object manipulation) using a simplified dynamics model, with both visual object-pose and force-torque measurements and show improved pose recovery over trajectory histories compared to single-step filtering techniques. However, the scope of  is limited to the use of a purpose-built system, equipped with small-diameter pushing-rods kept at a vertical orientation, allowing for high-fidelity contact-point estimation. A fiducial-based tracking system is also used. Such high precision measurements are impractical in a realistic setting.
In this work, we extend the capabilities of such factor graph inference frameworks in several ways to perform planar pushing tasks in real world settings. We extend the representation to incorporate various geometric and physics-based constraints alongside multi-modal information from vision and tactile sensors. We perform ablation benchmarks to show the benefits of including such constraints, and benchmarks where the vision is occluded or the tactile sensors are very noisy, using data from on our own generalized systems. We conduct our tests on two systems, a dual-arm ABB Yumi manipulator equipped a gel-based Syntouch Biotac tactile sensor  and a Barrett WAM arm equipped with a pushing probe end effector mounted with a force torque sensor (see Fig.1). Both of these systems are also set up with a vision-based articulated tracking system that leverages a depth camera, joint encoders, and contact-point estimates .
Through inference, we jointly estimate the history of not only object poses, and end-effector poses, but also, contact points, and applied force vectors. Estimating contact points and applied force vectors can be very useful in tractable dynamics models to predict future states and can be beneficial to contact-rich planning and control for manipulation .
With our experiments, we show that we can contend with a range of multi-modal noisy sensor data and perform efficient inference in batch and incremental settings to provide high-fidelity and consistent state estimates.
Ii Dynamics of Planar Pushing
In this section, we review the dynamics model for pushing on planar surfaces. The quasi-static approximation of this model is used in the next section to describe the motion model of the pushed object within the factor graph for estimation.
Given an object of mass being pushed with an applied force , we can describe the planar dynamics of the rigid body through the primary equations of motion
where is the object position measured at the center-of-mass (CM), the angular velocity of the object frame,
the moment of inertia, andthe linear frictional force. The applied and frictional moments are defined as and respectively.
We can estimate the frictional loads on the object by considering the contribution of each point on the support area of the object . The friction force and corresponding moment is found by integrating Coulomb’s law across the contact region of the object with the surface
where denotes the linear velocity at a point in area , and the pressure distribution. The coefficient of friction is assumed to be uniform across the support area.
For pusher trajectories that are executed at near-constant speeds, inertial forces can be considered negligible. The push is then said to be quasi-static, where the applied force is large enough to overcome friction and maintain a velocity, but is insufficient to impart an acceleration . Then, the applied force must lie on the limit surface. This surface is defined in space and encloses all loads under which the object would remain stationary . It can be approximated as an ellipsoid with principal semi-axes and 
where , and is the normal force. In order to calculate , we assume a uniform pressure distribution and define with respect to the center of mass (): . For quasi-static pushing, the velocity is aligned with the frictional load, and therefore must be parallel to the normal of the limit surface. This results in the following constraints on the object motion
used within our estimation factor graph in the next section.
Iii State Estimation with Factor Graphs
To solve state estimation during manipulation we formulate a factor graph of belief distributions over any state and force vector trajectory and perform inference over the trajectory given noisy sensor measurements. The graph construction and inference is performed with GTSAM [7, 18] via sparsity exploiting nonlinear least squares optimization to find the maximum a posteriori (MAP) trajectory that satisfies all the constraints and measurements. In the batch setting we use a Gauss-Newton optimizer and in an incremental setting we use iSAM2 that performs incremental inference via Bayes trees 2.
Iii-a Model Design
We construct three different factor graphs for state estimation in our pushing task: CP, SDF, and QS (see Fig. 2). All three models include the latent state variables for a given time : the planar object pose , the projected end-effector pose , and the contact point .
Measurements: Each of the latent state variable is accompanied by an associated measurement factor which projects corresponding measurements from into the pushing plane. The object poses are estimated by the visual tracking system with measurements . Likewise, the end-effector pose measurements may be provided from robot forward kinematics, or from the tracking system (DART includes a prior on joint measurements). The contact-point measurements are provided by a tactile sensor model. In the QS graph (Fig. 1(c)), we include a new state variable for the applied planar contact force with corresponding measurements . For simplicity of graphical representation, we combine the contact point and force variables:
Geometric Constraints: We assume constant point-contact between the end-effector and the object. We include the factor which incurs a cost on the difference between the contact point and the closest point to a surface () :
where is the projection of onto , and returns the surface geometry of a body with a given pose: for the object, and for the end-effector. Additionally, the object and the end-effector must be prevented from occupying the same region in space. Such a constraint is necessary in practice where contact-point estimation is often noisy. Therefore, we introduce a factor to penalize intersecting geometries with a signed distance field. Let the point on the end-effector furthest into the object be denoted by , where . The projection of onto (the surface of the object) is then defined by , and we can apply a penalty
Dynamics: We add a constant velocity prior to impose smoothness on state transitions. For example, for finite-difference velocities of object poses we have
where and denote the timestep sizes at and . Similar to , we introduce an additional factor to condition object state transitions on quasi-static pushing. The corresponding graphical model is denoted by QS and is shown in Fig. 1(c). From Eq. 4 we get
where and are the finite-difference linear and angular velocity, respectively. The final cost function is optimized with respect to the set of variables over a trajectory of length :
The above equation provides the locally optimal i.e. MAP solution of the estimation problem.
Iv Baseline Comparison
In order to first ascertain the general performance of our approach, we evaluate the QS-graph on the MIT planar pushing dataset  using batch optimization. This data contains a variety of pushing trajectories for a single-point robotic pushing system. The object poses were tracked with a motion capture system, and contact forces were measured with a pushing probe mounted on a force-torque sensor. We use this data as ground truth, since it is considered to be sufficiently reliable. We restrict our experiments to a subset of this data, using trajectories with zero pushing acceleration and velocities under 10 cm/s in order to maintain approximately quasi-static conditions. Additionally, we only consider trajectories on the ABS surface, but examine different object types (ellip1, rect1, rect3) with approximately 100 trajectories per object and measurements provided at 100Hz. Gaussian noise is artificially added to the measurements prior to inference, with the following sigma values: .
The resulting RMS and covariance values post-optimization are shown in Table I. The optimized values exhibit marked reductions in error compared to the sigma values of the initial measurements. Note that, for object poses we only include values in which the object is in motion, in order to exclude trivial stationary estimates. All position-related values are in cm, with angular values in radians, and forces in Newtons. An example of an optimized trajectory is shown in Fig. 3. Although the observation noise is artificial, these results indicate that latent state estimates may still be successfully recovered with the addition of geometric and physics-based priors, and without over-constraining the optimization.
V State Estimation in Open and Cluttered scenes
We first perform pushing experiments with the Barrett WAM manipulator acting on a laminated box as shown in Fig. 4. The system is observed by a stationary PrimeSense depth camera located 2.0m away from the starting push position of the end-effector. Vision-based tracking measurements of the object pose are provided by DART, configured with contact-based priors and joint estimates . The robot is equipped with a Force-Torque sensor and a rigid end-effector mounted with a spherical hard-plastic pushing probe. The contact forces are measured by the F/T sensor, with contact point measurements provided through optimization in DART. Ground-truth poses are provided via a motion-capture system. The table is mounted with a smooth delrin sheet to provide approximately uniform friction across the pushing area.
Mean error and standard deviations of object pose estimates (after the last iSAM2 step has been performed). CP, SDF, and QS model results are compared raw measured values, and to those produced by the graph described in Yu et al.. Tracking performance is greatly improved with the inclusion of geometric and physics-based priors. The comparison with , which does not use SDF priors, indicates the importance of enforcing these constraints in practice.
We performed 100 pushing trials with varying initial end-effector and object poses. The end-effector trajectories were varied in curvature and maintained a translational speed close to 6 cm/s to approximate quasi-static conditions. Object pose-tracking measurements were provided at roughly 25Hz, with end-effector poses and force/contact measurements published at 250Hz. Incremental inference of the factor graph is performed after 5 object pose measurements.
In order to provide real-time performance, DART maintains a belief distribution over state at a single timestep. However, this can make tracking susceptible to unreliable measurements that may arise from state-dependent uncertainty or partial observability. As such, we purposely include trajectories in which the object orientation changes significantly with respect to the camera orientation, causing large variations in pointcloud association. In addition, the pushing trajectories were also performed in cluttered scenes, as depicted in Fig. 4, with 85% occlusion of the pushing object occurring in the middle of the trajectory.
Examples of measured and estimated state trajectories are shown in Fig. 6. In the fully-observable (unocccluded) setting, distinct improvement of the object pose can be seen with both SDF and QS models. Under heavy occlusion, the visual tracking system loses the object and is unable to regain the trajectory state. However, the addition of both geometric and physics based priors to the factor graph result in realignment of the tracked object. Fig. 5 shows the tracking performance for fully observable trajectories using the CP, SDF, and QS factor graphs. The results are compared to the model proposed by Yu et al. , which includes quasi-static dynamics factors with contact and zero-velocity priors.
|Force magnitude (N)||0.352||0.195||0.043|
|Force direction (deg.)||3.15||2.54||0.78|
|Contact location (cm)||0.32||0.14||0.18|
In addition to improving inference on kinematic trajectories, the QS graph can be used to improve contact point and force estimates. To demonstrate this, we artificially add non-Gaussian noise (bi-modal mixture of two triangular distributions) to contact points and force measurements on the ground-truth data. The resulting estimation errors after optimization are shown in Table II, and indicate that our approach manages to recover true contact points and pushing forces. An an example of force-trajectory optimization is illustrated in Fig. 7.
Vi Force Estimation for Tactile Sensing
We further demonstrate inference on force trajectories using realistic (noisy) tactile data. The Biotac sensor comprises of a solid core encased in an elastomeric skin and is filled with weakly-conductive gel . The core surface is populated by an array of 19 electrodes, each measuring impedance as the thickness of the fluid between the electrode and the skin changes. A transducer provides static pressure readings which consist of a single scalar value per time-step. This sensor is also equipped with a thermistor for measuring fluid temperature. Although the device does not directly provide a force distribution or contact point measurements, an analytical method for estimating these values is described in .
Using an ABB YUMI robot with a mounted Biotac sensor, we generated randomized linear trajectories of the end effector pushing a 0.65 kg box across a laminated surface (see Fig. 1) starting from a number of different poses. We used the DART tracking system  to obtain object and end-effector pose measurements, along with approximate contact points. The analytical force sensor model , was used to provide initial force measurements.
Examples of initial and optimized trajectories are shown in Fig. 8-9. The presence of the contact surface factor shrinks the contact point covariance in the direction of push, as is expected. The covariances for finger and object pose estimates are drastically reduced, exhibiting the benefits of joint-inference across trajectory histories. Also, the dynamics factor aligns the force vector in the direction of motion of the object. This is further clarified in Fig. 9, where force vectors are correctly aligned with the object center-of-mass for linear trajectories, and provide a moment arm during angular displacement. This demonstrates the importance of contact and geometric factors in aligning the surface tangents of the finger and the object at the point of contact.
We proposed a factor graph-based inference framework to solve estimation problems for robotic manipulation in batch and incremental settings. Our approach can leverage geometric and physics-based constraints along with vision and tactile based multi-modal sensor information to jointly estimate the history of robot and objects poses along with contact locations and force vectors. We perform several benchmarks on various datasets with multiple manipulators in real environments and show that our framework can contend with sensitive, noisy sensor data and occlusions in vision to efficiently solve for locally optimal state estimates that closely match ground truth. Future work will include incorporating the approach within a motion planning context , combining vision and tactile modalities in learning predictive sensor models [22, 23], and the possibility of integration into a hierarchical task-planning framework.
-  T. Schmidt, K. Hertkorn, R. Newcombe, Z. Marton, M. Suppa, and D. Fox, “Depth-based tracking with physical constraints for robot manipulation,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 119–126.
-  J. Bimbo, L. D. Seneviratne, K. Althoefer, and H. Liu, “Combining touch and vision for the estimation of an object’s pose during manipulation,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE, 2013, pp. 4021–4026.
-  L. Zhang and J. C. Trinkle, “The application of particle filtering to grasping acquisition with visual occlusion and tactile sensing,” in Robotics and automation (ICRA), 2012 IEEE international conference on. IEEE, 2012, pp. 3805–3812.
-  M. Chalon, J. Reinecke, and M. Pfanne, “Online in-hand object localization,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE, 2013, pp. 2977–2984.
-  M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al., “Fastslam: A factored solution to the simultaneous localization and mapping problem,” Aaai/iaai, vol. 593598, 2002.
-  S. Thrun and M. Montemerlo, “The graph slam algorithm with applications to large-scale mapping of urban structures,” The International Journal of Robotics Research, vol. 25, no. 5-6, pp. 403–429, 2006.
-  F. Dellaert and M. Kaess, “Square root SAM: Simultaneous localization and mapping via square root information smoothing,” The International Journal of Robotics Research, vol. 25, no. 12, pp. 1181–1203, 2006.
-  M. Mukadam, J. Dong, X. Yan, F. Dellaert, and B. Boots, “Continuous-time Gaussian process motion planning via probabilistic inference,” The International Journal of Robotics Research (IJRR), 2018.
-  M. Mukadam, J. Dong, F. Dellaert, and B. Boots, “Simultaneous trajectory estimation and planning via probabilistic inference,” in Proceedings of Robotics: Science and Systems (RSS), 2017.
-  K.-T. Yu, J. Leonard, and A. Rodriguez, “Shape and pose recovery from planar pushing,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015, pp. 1208–1215.
-  K.-T. Yu and A. Rodriguez, “Realtime state estimation with tactile and visual sensing. application to planar manipulation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 7778–7785.
-  ——, “Realtime state estimation with tactile and visual sensing for inserting a suction-held object,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1628–1635.
-  G. E. Loeb, “Estimating point of contact, force and torque in a biomimetic tactile sensor with deformable skin,” 2013.
-  F. R. Hogan and A. Rodriguez, “Feedback control of the pusher-slider system: A story of hybrid and underactuated contact dynamics,” arXiv preprint arXiv:1611.08268, 2016.
-  K. M. Lynch, H. Maekawa, and K. Tanie, “Manipulation and active sensing by pushing using tactile feedback.” in IROS, 1992, pp. 416–421.
-  M. T. Mason, “Mechanics and planning of manipulator pushing operations,” The International Journal of Robotics Research, vol. 5, no. 3, pp. 53–71, 1986.
-  S. H. Lee and M. Cutkosky, “Fixture planning with friction,” Journal of Engineering for Industry, vol. 113, no. 3, pp. 320–327, 1991.
-  F. Dellaert, “Factor graphs and gtsam: A hands-on introduction,” Georgia Institute of Technology, Tech. Rep., 2012.
-  M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and F. Dellaert, “isam2: Incremental smoothing and mapping using the bayes tree,” The International Journal of Robotics Research, vol. 31, no. 2, pp. 216–235, 2012.
-  K.-T. Yu, M. Bauza, N. Fazeli, and A. Rodriguez, “More than a million ways to be pushed. a high-fidelity experimental dataset of planar pushing,” in Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016, pp. 30–37.
-  T. Schmidt, R. A. Newcombe, and D. Fox, “DART: Dense articulated real-time tracking.” in Robotics: Science and Systems, vol. 2, no. 1, 2014.
-  A. Lambert, A. Shaban, A. Raj, Z. Liu, and B. Boots, “Deep forward and inverse perceptual models for tracking and prediction,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 675–682.
-  B. Sundaralingam, A. Lambert, A. Handa, B. Boots, T. Hermans, S. Birchfield, N. Ratliff, and D. Fox, “Robust learning of tactile force estimation through robot interaction,” arXiv preprint arXiv:1810.06187, 2018.