Robot assistants operating in real-world environments should be capable of performing maintenance and repair tasks. Going beyond pick-and-place actions, we aim to enable robots to use the diversity of objects it might encounter. The ability to use commercial, off-the-shelf hand tools is critical for robots to perform tasks in unstructured, everyday environments. In order to accomplish this, robots must be able to identify and localize tools in an arbitrary cluttered scene to plan appropriate actions toward performing a task.
Recognizing hand tools and localizing their pose remains challenging in common human environments. These challenges arise from uncertainty caused by physical clutter and the high-dimensionality of the space of poses multiple objects in contact may occupy. Many hand tools are articulated, adding complexity to the localization problem by introducing additional degrees-of-freedom. Figure1 shows one example of hand tools in a cluttered scene that could be typical in a work area.
State-of-the-art object and pose recognition methods have been proposed that estimate the six degree-of-freedom (6D) pose of objects using convolutional neural networks (CNNs)[xiang2018posecnn, tremblay2018corl:dope]. Other methods have accomplished pose estimation using probabilistic inference [sui2015axiomatic, deng2019poserbpf]. However, localizing articulated objects remains a challenge for these methods due to both the added degrees of freedom that arise from articulations and occlusions due to clutter. Parts-based representations [Felzenszwalb]
have the potential to achieve higher levels of robustness under these conditions than whole-object based approaches. For such methods to be suitable for robot manipulation tasks, they must be able to localize 6D pose with reasonable computational efficiency. We suggest that generative inference methods, if made more computationally efficient, offer compelling and complementary benefits to modern deep learning. Additionally, a parts-based representation can provide information about the affordances of an object, because robot actions are typically applied to the object parts.
In this paper, we present a method for recognition and localization of articulated objects in clutter suited to robotic manipulation of object affordances. We formulate the problem of articulated object pose estimation as a Markov Random Field (MRF), representing the 6D poses of each rigid object part and the articulation constraints between them. We propose a method to perform inference over the MRF based on message passing. We are inspired by work by Desingh et al. [Desingheaaw4523], in which parts-based articulated object localization is facilitated by combining information from both the observation as well as the compatibility with neighbouring parts within the inference process. Our method is informed jointly by a learned likelihood modelled by a CNN, as well as by the known articulation constraints between each component part. We assume known object mesh models and kinematic constraints in the form of a Unified Robot Description Format (URDF) file, a standard geometrical object representation in the field of robotics.
By employing generative inference to integrate both data-driven techniques and domain knowledge about the object models, we leverage the speed and representational abilities of deep CNNs, while retaining the ability to reconcile noisy results and provide structure and context to the estimate. Methods we present emphasize novel synthesis of (1) efficient discriminative-generative inference via nonparametric belief propagation for pose estimation of articulated objects, and (2) a learned part-based likelihood to evaluate hypotheses of articulated object pose against RGB observations. We present results using a custom dataset made up of commercial, off-the-shelf hand tools with robot observations containing varying levels of clutter.
Ii Related Work
Pose estimation has received considerable attention in robotics. Here, we discuss related work that focuses on rigid body, parts-based, and articulated object pose estimation.
Ii-a Rigid Body Pose Estimation
Methods that tackle the problem of rigid body pose estimation include geometry-based registration approaches [besl1992method], generative approaches [sui2015axiomatic, desingh2016physically], approaches combining discriminative and generative methods [narayanan2017deliberative, sui2017sum, chen2019grip, zeng2018semantic, mitash2018robust, deng2019poserbpf], and end-to-end learning approaches [xiang2018posecnn, tremblay2018corl:dope]. Here, we focus on the discriminative-generative methods and end-to-end learning methods that are most relevant to this work.
Combining the discriminative power of feature-based methods with generative inference has been successful under challenging conditions such as background and foreground clutter [narayanan2017deliberative, mitash2018robust, deng2019poserbpf], adversarial environment conditions [chen2019grip], and uncertainty due to robot actions [sui2017sum, zeng2018semantic]. We are inspired by the success of the above approaches in taking advantage of the speed of discriminative methods to perform analysis and synthesis based generative inference.
Xiang et al. propose an end-to-end network for estimating 6D pose from RGB images [xiang2018posecnn]. This work was further extended to use synthetic data generation and augmentation techniques to improve performance [tremblay2018corl:dope]. Wang et al. [wang2019densefusion] propose an end-to-end network that uses depth information along with RGB information. These methods rely significantly on the textured appearance of objects. More importantly, the state representation used in these methods assume rigidity. Our attempts to adapt these methods for articulated objects required considerably more training data and computation time. In addition, estimates from the end-to-end methods can be noisy, especially in challenging cluttered scenarios. We believe estimates from these methods will be a good prior to help generative methods recover under challenging scenarios. Hence, in this work, we learn a likelihood function over the observation to inform the generative inference.
Ii-B Parts-Based Pose Estimation
Understanding objects in terms of their parts paves the way to meaningful and purposeful action execution, such as tool-use. Parts-based representations have been proposed to aid scene understanding and action execution[Felzenszwalb, felzenszwalb2005pictorial, xiang2012estimating], and have recently garnered attention within the robotics and perception communities [Mo2018, Lu2018]. Parts-based localization has led to research in recognizing objects and their articulated parts [Yi2018]. Parts-based perception for objects in human environments is often limited to recognition and classification tasks. Parts-based pose estimation is often considered for human body pose [sigal2004tracking] and hand pose [sudderth2004visual] estimation with fixed graphical models. Here, we propose a general framework for estimating pose of articulated objects, such as hand tools, that includes parts with fixed transforms as constraints.
Ii-C Articulated Pose Estimation and Tracking
Probabilistic inference is a popular technique in robot perception for articulated body tracking [cifuentes2016probabilistic, schmidt2014dart, schmidt2015depth], where filtering-based approaches alongside novel observation models have been proposed. These tracking frameworks are either initialized to the ground truth poses of objects, or applied to robot manipulators, where the inference is informed by joint encoder readings. In this work, we aim to perform pose estimation of multiple articulated objects using a single RGB-D frame with weak initialization from pixel-wise segmentations.
Interactive perception [interactiveperception] for articulated object estimation [hausman2015artic] has been a problem of interest in the robotics community. Various works [martin14online, sturm11prob, sturm13book], propose methods for estimating kinematic models from demonstration of manipulation or articulation examples. We instead focus on using known kinematic models to estimate the objects in challenging cluttered environments.
Li et al. [li2019category] explore category-level localization of articulated bodies in a point cloud, however their method does not consider clutter and occlusions from the environment. Michel et al. [michel2015pose] perform one-shot pose estimation of articulated bodies using 3D correspondences with optimization over hypotheses. Desingh et al. consider pose estimation of articulated objects in cluttered scenarios using efficient belief propagation [Desingheaaw4523], but do not consider RGB information. All of these approaches consider large, primarily planar objects that cover significant portion of the observation as opposed to the small objects in clutter in this work.
Li et al. [li2016hierarchical] developed techniques to handle the challenges of hand tools and small objects with no articulation, however the techniques proposed require multi-viewpoint information, as opposed to the single image approach that we propose.
Iii Problem Statement
Given a scene containing objects , such that is the set of relevant objects, we wish to localize each object . The state of an object is represented by the set of part poses , where is the 6D pose of an articulating rigid part of , with parts. Each object in the scene is estimated independently.
This estimation problem is formulated as a Markov Random Field (MRF). Let denote an undirected graph with nodes and edges . An example MRF is illustrated in Figure 2
. The joint probability of the graphis expressed as:
where denotes the hidden state variables to be inferred and denotes the observed sensor information in the form of an RGB-D image. The function is the pairwise potential, describing the correspondence between part poses based on the articulation constraints, and is the unary potential, describing the correspondence of a part pose with its observation . The problem of pose estimation of an articulated model is interpreted as the problem of estimating the marginal distribution of each part pose, called the belief, .
In addition to the sensor data, the articulation constraints and 3D geometry of the object, in the form of a Unified Robot Description Format (URDF), and the 3D mesh models of the objects are provided as inputs. We assume that the object articulations are produced by either fixed, prismatic or revolute joints. We consider scenes which contain only one instance of an object. In Section IV, our proposed inference mechanism is detailed, along with a description of our modelling of the potentials in Equation 1.
Belief propagation via iterative message passing is a common approach to infer hidden variables while maximizing the joint probability of a graphical model [Desingheaaw4523, sudderth2004visual]. We adopt the sum-product iterative message passing approach to perform inference [wainwright2008graphical], where messages are passed between hidden variables until their beliefs converge. A message, denoted by , can be considered as the belief of the receiving node as informed by its neighbor at iteration . An approximation of the message, denoted by , is computed using the incoming messages to :
where denotes neighboring nodes of , and denotes the particle set of node .
The marginal belief of a hidden node is a product of all the incoming messages weighted by the node’s unary potential:
Our particle optimization algorithm aims to approximate the joint probability of the MRF, as in Equation 1, by maintaining the marginal belief, as in Equation 3 for each object part. The belief of a rigid part pose, , is represented nonparametrically as a set of weighted particles .
Iv-a Belief Propagation via Message Passing
Our method adopts the traditional reweight and resample paradigm for particle refinement methods. The particles are first reweighted using an approximated sum-product message. The particles are then resampled using importance sampling based on the calculated weights.
The high-dimensional nature of the estimation problem and the cluttered settings with similar parts and partial observations make the inference prone to convergence to local minima. To mitigate this problem while computing messages, we can optionally add an augmentation step before the reweight step to accommodate different proposals. The augmentation technique is adapted from Pacheco et al. [pacheco2014preserving] and is discussed in Section IV-D.
The overall system is summarized in Figure 3.
Iv-A1 Reweighting and Resampling steps
Each particle is reweighted as follows:
where is the sum-product message:
which only takes into account the immediate neighbors of the node. Since the number of parts in each object is small, this approximation has negligible effect in practice and saves computation time. For numerical stability, the log-likelihoods are used in practice. The weights are normalized and then the particles are resampled using importance sampling. The object pose estimate is made by selecting the maximum likelihood estimates (MLE) from each of the marginal beliefs.
Iv-B Unary Likelihood
The unary potential represents the compatibility of each pose hypothesis with the RGB-D observation, . The RGB and depth portions of the observation, and , are treated as independent such that the unary likelihood is:
where and are the likelihoods with respect to depth and RGB parts of the observations.
Iv-B1 RGB Unary Likelihood
The RGB portion of the unary likelihood makes use of the Dilated ResNets architecture [Yu2017]. This architecture maintains a high dimensional feature space which is beneficial for semantic segmentation tasks.
The CNN outputs a pixelwise score for each object part class
. We apply a sigmoid function so the final scores lie between zero and one. This constitutes a learned heatmapover an RGB observation , where is the Dilated ResNets model trained on parts and is the output indexed at class . For each particle hypothesis , we generate a mask over the image for the object part at the hypothesis pose. We transform the mesh model of the part to pose and use the camera parameters to obtain a corresponding binary mask in image space. We represent the likelihood of a particle over the heatmap
using the Jaccard index[Moulton2018Jaccard], commonly called the Intersection over Union (IoU), between the heatmap and the rendered mask:
The CNN is trained using the analagous intersection over union (IoU) loss, and as such, represents a learned likelihood function over the image.
Iv-B2 Depth Unary Likelihood
For a given part , depth observation is generated using a threshold over the heatmap to mask the depth image . For a particle , is the exponential of the negative average pixelwise error between and the mesh model of part , rendered at pose . The error is only evaluated over areas in which the two depth images overlap. If there is no overlap between the masked observation and the hypothesis, we assign a maximum error instead, which is a chosen constant.
Iv-C Pairwise Likelihood
The pairwise likelihood between neighbouring particles measures how compatible is with respect to . If falls within the joint limits of with respect to at pose , then . Otherwise, the likelihood is the exponential of the negative error between and the nearest joint limit. We refer to [Desingheaaw4523] for further details.
Iv-D Particle Augmentation
At each node , the particle set can be augmented by drawing particles from various proposal distributions. Given particles in , Gaussian noise is first added to the current particles, then the distribution is augmented to , where represents the particles generated from the augmentation procedure . The set contains particles, where . Various proposals , , and , as described below can be used to augment the particle set. This optional variant is evaluated and discussed in the results section.
Pairwise: The pairwise proposal distribution is conditioned on a sample , drawn from neighboring node . Using the known geometric relationship between nodes and , a compatible proposal for node , , is generated from .
Unary: The unary proposal distribution draws samples based on the unary potential .
Random: The random proposal distribution draws additional noisy samples. This can be used to avoid the belief falling into a local minima due to the high dimensionality of the orientation space, and to account for mirror symmetry in some objects.
We evaluate our methods for articulated object localization in uncluttered and cluttered scenes. We run experiments on each component of our method and provide an analysis of their effects. These results provide quantitative and qualitative evidence of the accuracy and practicality of our methods.
We test on 20 uncluttered and 17 cluttered test scenes, unseen in the training data. We localize 196 total object instances in these scenes. We do not include results on objects which are severely or fully occluded such that there is no clear observation of any part. We remove 19 objects which fall into this category. An example of such a case is shown in the highly cluttered scene in Figure 7(a), where the flashlight (behind the hammer) and lineman’s pliers (behind the clamp) are almost entirely occluded.
V-a Dataset & Training
Our custom dataset consists of hand tools with eight distinct tool instances: hammer, clamp, boxcutter, flashlight, screwdriver, longnose pliers, and two instances of lineman’s pliers (see Figure 4). We collect videos of both cluttered and uncluttered scenes using the Fetch Mobile Manipulator’s onboard Primesense Carmine 1.09 sensor. The articulated hand tools span the full range of possible articulations in the data. Semantic masks and 6D poses for the objects are labelled using Label Fusion [marion2018label], which generates annotations for each video once the first scene is manually labelled. Semantic part masks and part poses are calculated using the object URDFs. The pixels in the images which do not correspond to a tool part are given class label “background.” After downsampling to remove adjacent frames in the videos, the dataset contains RGB-D images of pixel resolution.
We train the Dilated ResNets DRN-D-22 architecture [Yu2017]
to perform semantic segmentation on 90% of the dataset, and reserve 10% for validation. We further augment the training images with random crops, flips, and rotations. We increase the training set size by applying two transforms per image. The backbone is pre-trained on ImageNet, and the last layers are finetuned on our dataset. We employ the Intersection over Union (IoU) loss with an Adam optimizer. We train for 10 epochs on a RTX 2080 Max-Q GPU.
V-B Implementation Details
Our implementation performs efficient unary potential computation on the GPU, to evaluate the heatmap from DRN and to generate binary masks and depth images for pose hypotheses. The current implementation is vectorized and processes all object parts infor one iteration with 300 particles. The computation time could be further reduced with more efficient implementation.
The and locations of particle poses are initialized randomly in areas corresponding to high heat pixels of the heatmap over the RGB observation. The
-axis is initialized to the corresponding depth in the observed depth image. The initial orientations are uniformly distributed. For completely occluded parts which do not appear on the segmentation mask, we generate compatible poses from the neighbour initializations.
V-C Evaluation Metric
For evaluation, we use the average point matching error proposed by Hinterstoisser et al. [Hinterstoisser2013pose], which measures the average point pairwise distance between the rigid object model’s point cloud in the ground truth and estimated poses:
where are corresponding points in the ground truth and estimated point clouds respectively, each with points in the rigid object model. We also report the symmetric point matching error, which measures the average pairwise distance between points in the estimated point cloud and the nearest point in the ground truth point cloud:
The symmetric matching error represents the error in symmetric objects, such as the screwdriver, better by not penalizing estimates rotated around a degree of symmetry in the object. However, it tends to provide artificially low errors for incorrect estimates.
|DRN+ICP||6.59 3.36||3.31 2.69||6.38 8.00||3.93 10.84||6.47 6.32||3.65 8.22|
|Parts-PF||6.03 4.64||2.73 4.04||4.55 2.57||1.43 1.33||5.23 3.74||2.03 2.98|
|MP+RGB||4.57 3.96||2.83 2.95||3.60 2.75||2.04 1.95||4.06 3.40||2.41 2.50|
|MP+RGB+ICP||4.77 3.89||3.03 2.88||3.85 2.65||2.29 1.90||4.28 3.32||2.64 2.44|
|MP+RGB-D+Aug||4.80 3.80||2.90 2.60||3.71 2.55||2.18 1.88||4.21 3.23||2.51 2.27|
|MP+RGB-D||3.58 2.56||1.89 1.84||2.65 2.01||1.30 1.21||3.08 2.32||1.57 1.56|
We implement two baselines, described below.
Segmentation with ICP (DRN+ICP): We initialize the 3D position of the particle hypotheses using the depth image and segmentation mask generated by Dilated ResNets, with random orientations. We use Iterative Closest Point (ICP) [besl1992method] to find the transform from the initialized point cloud to the observed point cloud. ICP works best on local refinements, and is prone to failure when the initial orientation is incorrect. To accommodate for this failure, we generate proposal poses per part, perform ICP, and select the one with the best final fitness score as the estimate. In our experiments, . A similar method is used by Wong et al. [wong2017segicp].
Part-based Particle Filter (Parts-PF): This baseline consists of independent particle filters at each tool part. We use the unary potential from Equation 6 to calculate the weights for each hypothesis, and use importance sampling to select particles at each iteration. We use 300 particles per part, and run for 85 iterations.
For both DRN+ICP and Parts-PF baselines, if a part is completely occluded in the image, a pose estimate cannot be generated from the segmentation. In such cases, we randomly select a neighboring part for which an estimate was made and use the object model to generate a corresponding pose for the occluded part. If the edge between the parts is articulated, a joint value is uniformly sampled within the joint limits.
V-E Parts-Based Pose Estimation
To fully understand the performance of our method, we perform an ablation study over the components of the proposed method. We focus on three factors: the message passing (MP), the use of RGB only vs. the inclusion of depth (RGB-D) in the unary potential, and the augmentation step (Aug). We use 300 particles (before augmentation) for all experiments. The choice of particles was observed qualitatively to achieve sufficient results in most cases. While representation of the underlying belief improves with more particles, computation becomes intractable for very large particle sets. We run each method for 100 iterations, after which we observe little change in the estimate.
The results of each method are shown in Table I. Figure 5 shows the results for all scenes. Message passing leads to superior results compared to the baselines, DRN-ICP and Parts-PF, which do not use message passing. Best performance is achieved by using the full RGB-D observation (MP-RGB-D). Further description and analysis of each method is provided below.
Message Passing: RGB Unary (MP+RGB): To test the effect of the depth component of the unary potential, we evaluate using only the RGB component, informed by the heatmap, such that Equation 6 becomes . We use message passing to calculate the final likelihood for each particle using the pairwise potential, as described in Equation 4. Using only the RGB image, we obtain lower accuracy on the pose estimates. The RGB image captures the position and orientation of the objects well in image space (see Figure 6(c)), but is prone to falling into local minima in the axes which are not well represented by the image, namely , pitch, and roll (see Figure 6(d)).
Message Passing: RGB Unary and ICP (MP+RGB+ICP): To attempt to recover from the errors in , pitch, and roll, we add an ICP step on the final estimate from MP+RGB to align it to the masked depth image. We estimate the offset in the -axis based on the depth image. Since ICP is a local refinement method which relies on an accurate estimation of the initial transform, the ICP step does not always reconcile the orientation error in pitch and roll, for which the initial transform is unknown.
Message Passing: RGB-D Unary (MP+RGB-D): We hypothesize that by including depth information in the unary potential, we can more reliably estimate the full 6D pose of the parts and make up for missing information in the 2D image. The depth term improves the estimation accuracy by discouraging all unoccluded particles from deviating from the depth image at each iteration. This performs better than MP+RGB+ICP because the latter only attempts to align to the depth image in the final iteration, where it often has converged to a local minimum. This is the best performing method. Selected qualitative results are shown in Figures 7 and 8.
Message Passing with Augmentation (MP+RGB-D+Aug): Using the unary informed by both RGB and depth, as well as message passing, we augment the particle set at the beginning of each iteration as described in Section IV-A. We use , with 5% of the additional particles drawn from the unary distribution. At the first iteration, the remaining 95% of the particles are drawn from . The percentage of particles drawn from is increased by 10% every 5 iterations, and the percentage from is decreased, up to a maximum of 90% of particles from .
Qualitatively, we observe that the augmentation step leads to quicker convergence in some highly cluttered scenarios. On average, this method performs worse than MP+RGB-D in some cases because it is susceptible to propagating incorrect estimates with artificially inflated pairwise scores, due to the addition of perfectly compatible pose estimates. However, these results depend on careful selection of parameters, and might be improved by further tuning. Further analysis of the effect of the parameters on the final estimate is left to future work.
V-F Analysis on Tool Classes
The results for the MP+RGB-D and MP+RGB-D+Aug methods for each object in the dataset are shown in Table II. We present the percentage of class indices which have error under 4 cm. We observed high error in the flashlight which is likely due to its symmetrical nature. The unary potential does not explicitly encode texture information, so geometrically symmetric parts can tend to flip. The clamp is among the most difficult objects to localize due to its high-dimensionality and significant self-occlusions.
|lineman’s pliers A||0.909||0.818|
|lineman’s pliers B||0.917||0.833|
V-G Qualitative Analysis
Selected examples of the MP+RGB-D method are shown in Figure 7. We show a selected scene which demonstrate the effectiveness of the MP+RGB-D method at localizing hand tools, even under partial or full occlusion of some of their parts, in Figure 8. While the heatmap may provide little to no information for some parts, by leveraging geometric information through message passing, we are able to resolve the pose of all the visible tools in the scene.
In this work, we present an inference technique for estimating articulated parts-based object pose in clutter. We model the part poses of each articulated object as a Markov Random Field (MRF) and perform efficient particle-based belief propagation. We use articulation constraints between parts and a novel learned likelihood function to perform message passing in the MRF. We perform a thorough analysis of our method and show that it performs well on both uncluttered and cluttered scenes. We demonstrate that the message passing step is highly beneficial in terms of enforcing geometric consistency to inform pose estimation in the high dimensional space of 6D articulated object pose.
We would like to thank Zhen Zeng for helpful discussions and feedback, and Adrian Röfer and Zhiming Ruan for creating the 3D tool models. Zhiming Ruan also contributed to initial software development.