Distributional Depth-Based Estimation of Object Articulation Models

08/12/2021 ∙ by Ajinkya Jain, et al. ∙ The University of Texas at Austin 0

We propose a method that efficiently learns distributions over articulation model parameters directly from depth images without the need to know articulation model categories a priori. By contrast, existing methods that learn articulation models from raw observations typically only predict point estimates of the model parameters, which are insufficient to guarantee the safe manipulation of articulated objects. Our core contributions include a novel representation for distributions over rigid body transformations and articulation model parameters based on screw theory, von Mises-Fisher distributions, and Stiefel manifolds. Combining these concepts allows for an efficient, mathematically sound representation that implicitly satisfies the constraints that rigid body transformations and articulations must adhere to. Leveraging this representation, we introduce a novel deep learning based approach, DUST-net, that performs category-independent articulation model estimation while also providing model uncertainties. We evaluate our approach on several benchmarking datasets and real-world objects and compare its performance with two current state-of-the-art methods. Our results demonstrate that DUST-net can successfully learn distributions over articulation models for novel objects across articulation model categories, which generate point estimates with better accuracy than state-of-the-art methods and effectively capture the uncertainty over predicted model parameters due to noisy inputs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Articulated objects, such as drawers, staplers, refrigerators, and dishwashers, are ubiquitous in human environments. These objects consist of multiple rigid bodies connected via mechanical joints such as hinge joints or slider joints. Robots in human environments will need to interact with these objects often while assisting humans in performing day-to-day tasks. To interact safely with such objects, a robot must reason about their articulation properties while manipulating them. An ideal method for learning such properties might estimate these parameters directly from raw observations, such as RGB-D images while requiring limited or no a priori information about the task. The ability to additionally provide a confidence over the estimated properties, would allow such a method to be leveraged in the development of safe motion policies for articulated objects [jain2018efficient].

The majority of existing methods to learn articulation models for objects from visual data either need fiducial markers to track motion between object parts [sturm2011probabilistic, katz2008manipulating, katz2013interactive, niekum2015online] or require textured objects [pillai2015learning, martin2014online, martin2016integrated, martin2019coupled, jain2019learning]. Recent deep-learning based methods address this by predicting articulation properties for objects from raw observations, such as depth images [abbatematteo2019learning, li2020category, liu2020nothing, jain2020screwnet] or PointCloud data [wang2019shape2motion, yan2019RPM]. However, the majority of these methods [abbatematteo2019learning, li2020category, wang2019shape2motion, yan2019RPM] require knowledge of the articulation model category for the object (e.g., whether it has a revolute or prismatic joint) which may not be available in many realistic settings. Alleviating this requirement, jain2020screwnet introduced ScrewNet, which uses a unified representation based on screw transformations to represent different articulation types and performs category-independent articulation model estimation directly from raw depth images. However, ScrewNet [jain2020screwnet] and related methods [abbatematteo2019learning, li2020category, liu2020nothing, wang2019shape2motion, yan2019RPM] only predict point estimates for an object’s articulation model parameters. Nonetheless, reasoning about the uncertainty in the estimated parameters can provide significant advantages for ensuring success in robot manipulation tasks, and allows for further advancements such as robust planning [jain2018efficient]

, active learning using human queries 

[cui2018active], and the learning of behavior policies that provide safety assurances [taylor20safe]. Motivated by these advantages, we propose a method for learning articulation models, which estimates the uncertainty over model parameters using a novel distribution over the set of screw transformations based on the matrix von Mises-Fisher distribution over Stiefel manifolds [chikuse2003statistics]. We introduce DUST-net, Deep Uncertainty estimation on Screw Transforms-network, a novel deep learning-based method that, in addition to providing point estimates of the object’s articulation model parameters, leverages raw depth images to provide uncertainty estimates that can be used to guide the robot’s behavior without requiring to knowledge of the object’s articulation model category a priori.

DUST-net garners numerous benefits over existing methods. First, DUST-net estimates articulation properties for objects with uncertainty estimates, unlike most current methods [abbatematteo2019learning, li2020category, jain2020screwnet, liu2020nothing, wang2019shape2motion, yan2019RPM]. These uncertainty estimates, apart from helping robots to manipulate objects safely [jain2018efficient], could allow a robot to take information-gathering actions when it is not confident and enhance its chances of success in completing the task. Second, similar to ScrewNet [jain2020screwnet], DUST-net can estimate model parameters without the need to to know the articulation model category a priori, by leveraging the unified representation for different articulation model types. Third, this unified representation helps DUST-net to be more computationally and data-efficient than other state-of-the-art methods [abbatematteo2019learning, li2020category], as it uses a single network to estimate model parameters for all common articulation models, unlike other methods that require a separate network for each articulation model category [abbatematteo2019learning, li2020category, wang2019shape2motion, yan2019RPM]

. Empirically, DUST-net outperforms other methods even when trained using only half the training data in comparison. Fourth, the distributional learning setting yields more robustness to outliers and noise. Fifth, DUST-net is able to reliably estimate distributions over articulation model parameters for objects in the robot’s camera frame. By contrast, ScrewNet 

[jain2020screwnet], the most closely related approach to ours, can only predict point estimates for articulation model parameters in the object’s local frame.

Figure 1: DUST-net uses a sequence of images to compute the parameters, , of the conditional distribution over the joint parameters and configurations . This distribution allows for inference and reasoning, such as uncertainty and confidence, over both the parameters and the configurations. Using a von Mises-Fisher distribution on a Stiefel manifold allows for an efficient reparameterization that inherently obeys multiple constraints that define rigid body transformations.

We evaluate DUST-net through experiments on two benchmarking datasets: a simulated articulated objects dataset [abbatematteo2019learning] and the PartNet-Mobility dataset [Xiang_2020_SAPIEN, Mo_2019_CVPR, chang2015shapenet], as well as three real-world objects: a microwave, a drawer, and a toaster oven. We compare DUST-net with two state-of-the-art methods, namely ScrewNet [jain2020screwnet] and an MDN-based method proposed by abbatematteo2019learning, as well as two baseline methods. The experiments demonstrate that the samples drawn from the distributions learned by DUST-net result in significantly better estimates for articulation model parameters in comparison to the point estimates predicted by other methods. Additionally, the experiments show that DUST-net can successfully and accurately capture the uncertainty over articulation model parameters resulting from noisy inputs.

2 Related Work

Articulation model estimation from visual observations: A widely used approach for estimating articulation models is based on the probabilistic framework proposed by sturm2011probabilistic. It uses the time-series observations of 6D poses of different parts of an articulated object to learn the relationship between them [sturm2011probabilistic, niekum2015online, pillai2015learning, jain2019learning]. More recently, abbatematteo2019learning and li2020category proposed methods to learn articulation properties for objects from raw depth images given articulation model category. In a related body of work on object parts mobility estimation, wang2019shape2motion and yan2019RPM proposed approaches to segment different parts of the object in an input point cloud and estimate their mobility relationships, given a known articulation model category. Alleviating the requirement of having a known articulation model category,  jain2020screwnet recently proposed ScrewNet that performs category-independent articulation model estimation from depth images. However, these methods only predict point estimates for the articulation model parameters, while DUST-net predicts a distribution over their values.

Rigid Body Pose Estimation

: Our contributions are related to existing work on estimating distributions describing the orientation of rigid bodies.  gilitschenski2015unscented,  arun2018probabilistic,  srivatsan2016estimating and rosen2019se propose strategies that can be used to estimate the rigid body transformation of an object using a combination of Bingham and Guassian distributions, and the von Mises-Fisher distribution, respectively. The mathematical model used by our approach is inspired by these works, but 1) extends them to also represent uncertainty over the configuration of articulated object components about screw axes, and 2) integrates them into a deep learning model that is capable of learning these configurations from raw depth images. In addition, while these approaches use distributions over orientations and rigid body transformations to produce estimates, DUST-net directly outputs a distribution that can be used to facilitate further applications such as uncertainty-aware behavior planning.

Interactive perception (IP)katz2008manipulating introduced IP as a method to leverage a robot’s interaction with objects to generate a rich perceptual signal for articulation model estimation for planar objects, and extended it to learn 3D articulation models for objects [katz2013interactive]martin2016integrated used hierarchical recursive Bayesian filters to make estimation more robust and developed online methods for articulation model estimation from RGB images [martin2014online, martin2016integrated, martin2019coupled]. A comprehensive survey on IP methods in robotics was presented by bohg2017interactive. While IP presents a powerful tool for estimating articulation properties for objects, a wide majority of existing IP methods require textured objects, unlike DUST-net, which learns these properties using depth images.

Further approaches: Articulation motion models can be viewed as geometric constraints imposed on multiple rigid bodies. Such constraints can be learned from human demonstrations by leveraging different sensing modalities [perez2017c, liu2019learning, daniele2020multiview, liu2020nothing, subramani2018inferring]. Recently, daniele2020multiview proposed a multimodal learning framework that incorporates both vision and natural language information for articulation model estimation. However, these approaches predict point estimates for the articulation model parameters, unlike DUST-net, which predicts a distribution over the articulation model parameters.

3 Preliminaries

Problem Formulation: Given a sequence of depth images

of motion between two parts of an articulated object, we estimate the parameters of a probability distribution

representing uncertainty over the parameters of the articulation model governing the motion between the two parts. Following jain2020screwnet, we define the model parameters as the parameters of the screw axis of motion, , where both and are elements of

. This unified parameterization can be used in articulation models with at most one degree-of-freedom (DoF), namely rigid, revolute, prismatic, and prismatic 

[jain2020screwnet]. Additionally, we estimate the parameters of a distribution representing uncertainty over the configurations identifying the rigid body transformations between the two parts in the given sequence of images under model with parameters . Configurations correspond to a set of tuples, , defining a rotation around and a displacement along the screw axis . We assume that the relative motion between the two object parts is determined by a single articulation model.

Background: DUST-net uses a reparameterization of the space of rigid body transformations that allows distributions over an object’s articulation model parameters to be defined naturally. Here, we briefly describe the mathematical foundation leveraged in the proposed distribution over articulation parameters.

Figure 2: Matrix von Mises-Fisher distribution over

, X, Y, Z axes are shown in red, blue and green colors, respectively. Magenta and cyan colors denote vectors corresponding to the first and second column of the matrix

representing the mode of the distribution

Screw Transformations: Chasles’ theorem states that “Any displacement of a body in space can be accomplished by means of a rotation of the body about a unique line in space accompanied by a translation of the body parallel to that line” [siciliano2016springer]. Such a line is called a screw axis, . We represent this line using Plücker coordinates, given as for a

, with moment vector

[siciliano2016springer, jia2019]. The constraints and ensure that the degrees of freedom of the line in space are restricted to four. The rigid body displacement in as a screw transform is then defined as , where the linear displacement and the rotation are connected through the pitch of the screw axis, .

Stiefel manifold: The Stiefel manifold is the space whose points are sets of orthonormal vectors in , called -frames in  [chikuse2003statistics]. Points on the Stiefel manifold are represented by the set of matrices such that , where is the identity matrix; thus . Some special cases of the Stiefel manifold are the unit hypersphere in for , and the orthogonal group for .

Matrix von Mises-Fisher distribution:

A random matrix

on is said to have the matrix von Mises-Fisher distribution (or matrix Langevin distribution), if its density function is given by , where F is any matrix and is a hypergeometric function with matrix argument [chikuse2003statistics]

. We can write the general (unique) singular value decomposition (SVD) of

F as , where , , , denotes the set of matrices with the property that all the elements of the first row of the matrix are positive, and denoting the orthogonal group in dimensions. It can be shown that . For more details, we refer to [chikuse2003statistics].

4 Approach

Given a sequence of depth images

of motion between two parts of an articulated object, DUST-net estimates parameters of the joint probability distribution

representing uncertainty over the articulation model parameters governing the motion between the two parts and the observed configurations . When deciding how to learn this distribution, two goals arise. While some parameters, such as the translation of an object part along a screw axis, are defined on Euclidean space, the set of valid screw axes exhibits constraints that prevent standard distributions defined on

from being applied without complicating the learning process. For example, a standard representation for distributions over screw axes is as the product of a Bingham distribution over the line’s orientation and a multivariate normal distribution over its position in space 

[siciliano2016springer]. However, this representation produces non-unique estimation targets. A rotation of about the screw axis with orientation results in the same transformation as a rotation of about the screw axis with orientation . Similarly, a displacement along results in the same transformation as a displacement along

. This leads to ambiguities in the targets in the estimation problem and can hinder the performance of the trained estimator. By selecting a representation that accounts for these symmetries, these non-unique estimation targets are removed. Second, once a suitable parameterization is chosen, we seek a parametric form for the joint distribution which can be learned by a deep network.

First, we consider the problem of parameterizing the set of screw axes. As noted earlier, we define the model parameter as the parameters of the screw axis of motion . However, this parameterization requires that has unit norm, and that and are orthogonal. To eliminate these constraints, we rewrite the moment vector of a screw axis as , where and represent its magnitude and a unit vector along it respectively, and the Plücker coordinates for the screw axis as . The Plücker coordinates can then be seen as an unconstrained point in the space , where and . Consequently, a distribution over screw transformations can be defined as the product of any probability distribution defined on and a distribution over . Importantly, because of the one-to-one mapping from elements of to screw axes, the non-unique estimation targets described above are eliminated. Based on this parametrization of screw axes, we define the set of valid configuration parameters as follows. We restrict the range of values for the rotation about the screw axis to be and restrict the displacement along the axis to be . Note that these constraints do not reduce the representational power of the screw transform to denote a general rigid body transform, but merely ensure a unique representation.

Having described the parameterization of the set of screw axes and configurations, we now consider the task of defining a joint probability distribution over their values. We propose to represent the distribution over predicted screw axis parameters, whit , as a product of a matrix von Mises-Fisher distribution defined on the Stiefel manifold and a truncated normal distribution with truncation interval over . Formally,

(1)

where F is a matrix representing the parameters of the matrix von Mises-Fisher distribution over , and and

denote the mean and standard deviation of the truncated normal distribution.

Given the sequence of images, we also wish to estimate the posterior over configurations corresponding to the rotations about and displacements along the screw axis . We define the joint posterior representing the uncertainty over the screw axis and the configurations about it as a product of the aforementioned distribution and a set of distributions defined over the configuration parameters,

(2)

where is the set of parameters for the distribution and and represent the set of distributions having parameters and over the configurations and , respectively. In this work, we consider and to be products of truncated normal distributions such that and with , , , and denoting the set of means and the standard deviations of the set of truncated normal distributions over the configurations and , respectively.

Distribution parameter matrix F: The parameter matrix for the matrix von Mises-Fisher distribution over is a matrix, . This presents two possible parameterizations of the matrix: first, to estimate each of the 6 elements of the matrix F and second, to estimate the matrices , and defining the SVD of , given by . The second parameterization decouples the two objectives of distribution mode alignment with the ground truth labels and uncertainty representation; the mode of the distribution is given by , and the concentration matrix for the distribution is given by . This decoupling allows the network to independently optimize both objectives, whereas in the first parameterization, changes in the elements of causes changes in both components.

By definition, is a diagonal matrix with two independent parameters, and is a rotation matrix in two dimensions with one independent parameter, the rotation angle . The matrix can be constructed from a rotation matrix by keeping only the first two columns of R. Hence, the matrix can be defined by three independent Euler angles, denoting rotation according to the convention in the rotating frame. Euler angles can suffer from the problem of gimble lock [siciliano2016springer], which we resolve by restricting the Euler angles to be in the ranges , and .

Normalization factor: One of the main challenges of using the matrix von Mises-Fisher distribution is the calculation of its normalization factor , which is a hypergeometric function of matrix argument [chikuse2003statistics]. In this work, we approximate this hypergeometric function using a truncated series in terms of zonal polynomials, which are multivariate symmetric homogeneous polynomials and form a basis of the space of symmetric polynomials [chikuse2003statistics]. Through our preliminary experiments, we found that this truncated series is a good approximation of

as it converges to a finite value, if the singular values of the

, i.e. and are less than .

Architecture: DUST-net sequentially connects a ResNet-18 CNN [he2016deep] and a 2-layer MLP. ResNet-18 extracts task-relevant features from the input images, which are used by the MLP to predict a set of parameters for the distribution

. The network is trained end-to-end, with ReLU activations for the hidden fully-connected layers. The first four output (out of 40) of the last linear layer of MLP, corresponding to the parameters

and representing the matrices and respectively, are fed through a ReLU-6 layer to ensure that the predictions map to their respective ranges. Remaining output is fed through a Softplus layer for non-negative output. Detailed network architecture is presented in the appendix (Fig. 7).

Training: The training data for the model consists of sequences of depth images of objects parts moving relative to each other and the corresponding screw transforms . The objects and depth images are rendered in Mujoco [todorov2012mujoco]. We train DUST-net by maximizing the log-probability of the labels under the distribution :  . We assume that the observed configurations in

share the same variance. We use the precision parameters rather than the standard deviations,

and to represent the distribution during training for better numerical stability. Following the discussion on training MDNs by makansi2019overcoming, we separate the training in three stages. In the first stage, we assume the dispersion of the matrix von Mises-Fisher distribution to be fixed with and learn parameters corresponding to and matrices. In the second stage, we fix the matrix and learn the rest of the parameters in the set . Finally, we train to predict the complete set .

Figure 3: Mean error values on the MAAD (top) and Screw Loss (bottom) metrics for the simulated articulated objects dataset [abbatematteo2019learning] (lower values are better). Point estimates for DUST-net (violet) correspond to the modes of the distributions predicted by DUST-net.

5 Experiments

In this section, we evaluate DUST-net on its ability to learn articulation model parameters and uncertainty estimates. We conducted three sets of experiments evaluating DUST-net’s performance under different criteria: (1) how accurate point estimates of the articulation model parameters drawn from DUST-net’s estimated distribution are in comparison to the existing methods, (2) how effectively DUST-net captures the uncertainty over parameters arising from noisy input, and (3) how effectively DUST-net transfers from simulation to a real-world setting. We evaluated DUST-net’s performance on two simulated benchmarking datasets: the objects dataset provided by Abbatematteo et al. [abbatematteo2019learning], and the PartNet-Mobility dataset [Xiang_2020_SAPIEN, Mo_2019_CVPR, chang2015shapenet], as well as a set of three real-world objects. From the simulated articulated object dataset [abbatematteo2019learning], we considered the cabinet, microwave, and toaster oven for revolute articulations and the drawer object class for prismatic articulations. From the PartNet-Mobility dataset [Xiang_2020_SAPIEN, Mo_2019_CVPR, chang2015shapenet], we considered five object classes: the dishwasher, oven, and microwave object classes for the revolute articulation model category, and the storage furniture object class consisting of either a single column of drawers or multiple columns of drawers, for the prismatic articulation model category. Among the three sets of experiments, we conducted the first two sets of experiments on the simulated datasets, while the last set of experiments were conducted on the real-world object dataset. In all the experiments, we assumed that the input depth images are semantically segmented and contain non-zero pixels corresponding only to the two objects between which we wish to estimate the articulation model.

We compared DUST-net’s performance in estimating point estimates for articulation model parameters with two state-of-the-art methods, ScrewNet [jain2020screwnet] and an MDN-based approach proposed by abbatematteo2019learning. ScrewNet estimates the object’s articulation model parameters in a local frame located at the center of the object, whereas DUST-net does so directly in the camera frame. We compare our method with ScrewNet predicting parameters both in the object local frame and the camera frame. Additionally, we propose two baseline methods that estimate distributions over articulation model parameters and compare to them DUST-net. The first baseline method (vm-SoftOrtho) can be viewed as an extension of ScrewNet to a distributional setting. It represents the uncertainty over the screw axis orientation vector and the direction of moment vector using two independent von Mises-Fisher distributions and imposes a soft orthogonality constraint over the modes of the two distributions. The distributions over the moment vector magnitude and configurations are considered to be normal distributions. This method suffers from the same drawback as ScrewNet, i.e., the use of a soft orthogonality constraint during training, and therefore cannot predict a valid set of screw axis parameters directly, unlike DUST-net. The second baseline method (Direct ) uses the same probability distribution as DUST-net to represent the uncertainty over the articulation model parameters, but estimates the individual elements of the F matrix directly. As a result, it fails to capture the uncertainty over model parameters accurately.

Figure 4: Mean error values on the MAAD (top) and Screw Loss (bottom) metrics for the PartNet-Mobility dataset [Xiang_2020_SAPIEN, Mo_2019_CVPR, chang2015shapenet] (lower values are better). Point estimates for DUST-net (violet) correspond to the modes of the distributions predicted by DUST-net.

5.1 Accuracy of Point Estimates

The first set of experiments evaluated DUST-net’s accuracy in predicting point estimates for articulation model parameters. We use the mode of the estimated distribution as the point estimate for model parameters. We used two metrics to evaluate accuracy: Mean Absolute (Angular) Deviation (MAAD) and Screw Loss (Metric proposed in ScrewNet [jain2020screwnet]). MAAD metric indicates how close the individual screw parameters are to targets, whereas the Screw Loss indicates how close the complete predicted screw transforms is to the target transforms. The MAAD metric calculates the angular distance between the orientation of the predicted and ground-truth axis orientation vectors and the orientation vectors of the screw axis moment vectors . For the remaining parameters (), it calculates the mean absolute deviation between the predicted and ground-truth values. The screw loss reports the angular distance between the predicted and ground-truth screw axis orientation vectors as orientation error and the length of the shortest perpendicular between the predicted and ground-truth screw axes as the distance between them. Configuration errors are reported as the difference between the predicted rotation about the predicted screw axis and the true rotation, whereas errors over are calculated as the Euclidean distance between the points displaced by the predicted and true displacements along respective axes.

Results for the synthetic articulated objects dataset and the PartNet-Mobility dataset are shown in Figures 3 and  4, respectively. Results demonstrate that under both metrics, the estimates obtained from DUST-net are typically more accurate than those obtained from the state-of-the-art methods. The first baseline, vm-SoftOrtho, performs comparably with DUST-net on both datasets when only MAAD estimates are considered. However, Figures 3 and 4 show that it produces a very high distance (m) between the predicted and ground-truth screw axes. This error arises due to the soft-orthogonality constraint used by vm-SoftOrtho, as DUST-net and the second baseline method, both of which handle the constraint implicitly, do not report high errors on that metric. Meanwhile, the second baseline, Direct , performs comparably with DUST-net on both metrics for both datasets, but fails to capture the uncertainty over parameters with the required accuracy.

Figure 5: Variation of the mean of the singular values of predicted distribution concentration matrices over screw axes by DUST-net with artificially injected noise. Predicted singular values decrease monotonically with input noise, showing that the network’s confidence over the predicted parameters decreases with input noise.

5.2 Uncertainty Estimation

The second set of experiments evaluated how effectively DUST-net’s predicted distribution captures epistemic uncertainty over the predicted articulation parameters. We evaluate this by adding artificial noise to the training labels form the two simulated datasets while training DUST-net. As more noise is added, we expect the confidence estimates produces by DUST-net to decrease as well. We add noise to the labels by sampling perturbations from a matrix von Mises-Fisher distribution with varying singular values and of the distribution parameter matrix and the truncated normal distributions with varying precision parameters . Figure 5 show the variation of the mean of the singular values of the predicted distribution concentration matrices over screw axes by DUST-net with injected noise. In the noiseless case, the singular values of the matrix von Mises-Fisher distribution increases until they reach their maximum allowed value at . When label noise is added, our results show that DUST-net’s confidence over its predicted parameters decreases monotonically as more noise is added to the labels, supporting our hypothesis.

Figure 6: (a) Mean error values on MAAD (top) and Screw Loss (Bottom) metrics for real-world objects when the network was trained solely using simulated data [abbatematteo2019learning] (b) Predicted concentrations over articulation model parameters. DUST-net estimation performance on simulated data [abbatematteo2019learning] (hatched green) included for comparison. DUST-net reported lower confidence in its predictions for real-world objects than simulated data (b), analogous to its degraded estimation accuracy(a).

5.3 Sim to Real Transfer

Lastly, we evaluated how effectively DUST-net transfers from simulation to a real-world setting. DUST-net was trained solely on the simulated articulated object dataset [abbatematteo2019learning]. Afterward, we used it to infer the articulation model parameters for three real-world objects. Results (Fig. 6(a)) report that DUST-net outperforms the current state-of-the-art method, ScrewNet, in estimating the model parameters for real-world objects. However, the estimated parameters using DUST-net are not yet accurate enough to be used directly for manipulating these objects. This sub-par performance stems from the significant differences between the training (clean and information-rich simulation data) and test datasets, which consists of noisy depth images acquired with a Kinect sensor and contain high salt-and-pepper noise, spurious features, and incomplete objects. Better performances could be achieved by either fine-tuning the network on real-world data or retraining it using a larger real-world dataset. A noteworthy insight from the results is that DUST-net also reports low confidence over the predicted parameters for real-world objects, compared to when tested on the simulated data (Fig. 6(b)). This clearly delineates why it is beneficial to estimate a distribution over the articulation model parameters instead of point estimates. Given only point estimates of articulation model parameters, a robot has no way to determine if the estimates are reliable for manipulating the object safely or not. In contrast, DUST-net’s reported confidence over the predictions could allow the robot to develop safe motion policies for articulated objects [jain2018efficient, taylor20safe] or use active learning based methods [cui2018active] to reduce uncertainty over the articulation parameters.

6 Conclusion

We introduced DUST-net, which utilizes a novel distribution over screw transforms on a Stiefel manifold to perform category-independent articulation model estimation with uncertainty estimates. We evaluated our approach on two benchmarking datasets and three real-world objects and compared its performance with two current state-of-the-art methods [jain2020screwnet, abbatematteo2019learning]. Results show that DUST-net can estimate articulation models, their parameters, and model uncertainty estimates for novel objects across articulation model categories successfully with better accuracy than the state-of-the-art methods. At present, DUST-net can only predict parameters for 1-DOF articulation models directly. For multi-DoF objects, an additional image segmentation step is required to mask out all non-relevant object parts. This procedure can be repeated iteratively for all object part pairs to estimate relative models between object parts that can be combined later to construct a complete kinematic model for the object [jain2019learning]. An interesting extension of DUST-net could estimate parameters for multi-DoF objects directly by learning a segmentation network along with it. Another exciting direction of future work is to use DUST-net in an active learning setting where, if the robot is not confident enough about the estimates of the articulation model parameters, it can actively take information-gathering actions to reduce uncertainty.

References

Appendix A Hypergeometric function

A general hypergeometric function in the matrix argument can be written as an infinite series in terms of zonal polynomials, which are multivariate symmetric homogeneous polynomials and form a basis of the space of symmetric polynomials [chikuse2003statistics]. Given an symmetric, positive-definite matrix Y, the hypergeometric function of matrix argument Y is defined as

(3)

where

  • is the set of all ordered integer partitions of

  • is the generalized Pochhammer symbol, defined as

    , where, ,

  • and denotes the zonal polynomial of , indexed by a partition , which is a symmetric homogeneous polynomial of degree 

    in the eigenvalues

    of , satisfying

    (4)

Using zonal polynomials, we can define the hypergeometric function defining the normalization factor of the matrix von Mises-Fisher distribution over Stiefel manifold as

(5)

where , is the set of all ordered integer partitions of , is the generalized Pochhammer symbol, and denotes the zonal polynomial of indexed by a partition . This series converges for all input matrices for a general hypergeometric function if , which holds in our case [chikuse2003statistics]. Recently, jiu2020calculation investigated the zonal polynomials in detail and developed a computer algebra package to calculate these polynomials in SageMath. We use this package to calculate the the hypergeometric function . However, as the number of terms in the series grows combinatorially with , we truncate the series at for computational reasons. Through our experimental analysis, we found that this truncated series is a good approximation of as the series converges to a finite value, if the singular values of the , i.e. and remain below a maximum value .

Figure 7: DUST-net architecture

Appendix B Network Architecture

Figure 7 shows the detailed network architecture for DUST-net. DUST-net uses an off-the-shelf convolutional network, ResNet-18, to extract task-relevant visual features from the input images, which are later passed through a two-layer MLP to predict a set of parameters for the distribution . We use ReLU activations for the hidden fully-connected layers. The first four output parameters (out of 40) of the last linear layer of MLP correspond to the parameters and , representing the matrices and respectively, which lie in ranges , and respectively. We pass the first four values of the output of the last linear layer through a ReLU-6 layer [howard2017mobilenets] to correctly map the predicted values with their respective ranges. The rest of the parameters are required to be non-negative. We pass the remaining output values of the last linear layer through a Softplus layer for non-negative output.

Figure 8: Object classes used from the simulated articulated object dataset [abbatematteo2019learning]. Object classes: cabinet, drawer, microwave, and toaster (left to right)

Appendix C Experimental details

c.1 Datasets

Objects used in the experiments from each of the dataset are shown in the Figures 8 and 9. We sampled a new object geometry and a joint location for each training example in the simulated articulated object dataset, as proposed by [abbatematteo2019learning]. For the PartNet-Mobility dataset, we considered microwave ( train, test), dishwasher ( train, test), oven ( train, test), single column drawer ( train, test), and multi-column drawer ( train, test) object models. For both datasets, we sampled object positions and orientations uniformly in the view frustum of the camera up to a maximum depth dependent upon the object size. The objects and depth images are rendered in Mujoco [todorov2012mujoco]. We apply random frame skipping and pixel dropping to simulate noise encountered in real world sensor data. We consider three household objects — a microwave, a drawer, and a toaster oven, in the real world objects dataset for evaluating DUST-net’s performance. The objects are shown in Figure 10.

To generate the labels for screw displacements, we follow the same procedure as used by jain2020screwnet. Considering one of the objects, , as the base object, we calculate the screw displacements between temporally displaced poses of the second object with respect to it. Given a sequence of images , we calculate a sequence of screw displacements , where each corresponds to the relative spatial displacement between the pose of the object in the first image and the images . Note is defined in the frame attached to the pose of the object in the first image . We then transform to the camera frame by defining the 3D line motion matrix between the frames and [bartoli20013d], and transforming the common screw axis to the target frame . The configurations remain the same during frame transformations. The 3D line motion matrix between two frames can be constructed using the rotation matrix and a translation vector between two frames and , as:

(6)

where

denotes the skew-symmetric matrix corresponding to the translation vector

, and and represents the line in frames and , respectively [bartoli20013d].

Figure 9: Object classes used from the PartNet-Mobility dataset [Xiang_2020_SAPIEN, Mo_2019_CVPR, chang2015shapenet]. Object classes: dishwasher, oven, microwave, drawer- 1 column, and drawer- multiple columns (left to right)
Figure 10: Real world objects used to evaluate DUST-net’s performance. Object classes: microwave, drawer, and toaster (left to right)
Figure 11: von Mises-Fisher distribution in . X, Y, Z axes are shown in red, blue and green colors, respectively. Black color represents the mean direction of distribution

c.2 Baseline: vm-SoftOrtho

von Mises-Fisher distribution: The von Mises-Fisher distribution (or Langevin distribution) is a unimodal probability distribution on the sphere in . A random -dimensional unit vector is said to have the von Mises–Fisher distribution, if its probability distribution function is given by: , where the concentration parameter , the mean direction and the normalization constant where denotes the modified Bessel function of the first kind at order  [mardia1999directional]. For , the normalization constant reduces to .

As noted in the main text 5, the first baseline method (vm-SoftOrtho) represents the uncertainty over the screw axis orientation vector and the direction of moment vector using two independent von Mises-Fisher distributions. To ensure that the learned distributions respect the Plücker constraint, the method imposes a soft orthogonality constraint over the modes of the two distributions. The distributions over the moment vector magnitude and configurations are considered to be truncated normal distributions.

Appendix D Further Results

d.1 Accuracy of Point Estimates

Detailed numerical results for the synthetic articulated objects dataset and the PartNet-Mobility dataset are shown in Tables 1 and  2, respectively. Results demonstrate that under both metrics, the estimates obtained from DUST-net are considerably more accurate than those obtained from the state-of-the-art methods. DUST-net also correctly estimates very high distribution concentration parameters for the true, noise-free labels. The first baseline, vm-SoftOrtho, performs comparably with DUST-net on both datasets when only MAAD estimates are considered. However, Tables 1 and 2 show that it produces a very high distance (m) between the predicted and ground-truth screw axes. This error arises due to the soft-orthogonality constraint used by vm-SoftOrtho, as DUST-net and the second baseline method, both of which handle the constraint implicitly, do not report high errors on that metric. Meanwhile, the second baseline, Direct , performs comparably with DUST-net on both metrics for both datasets, but fails to capture the uncertainty over parameters with the required accuracy.

MAAD / SL MAAD Screw Loss MAAD SL MAAD SL Uncertainty
vm-SoftOrtho 0.139 0.154 0.068 0.956 0.012 0.117 0.003 0.006 56.2 55.8 9.8 47.9 89.5
Direct F 0.240 0.261 0.062 0.104 0.010 0.208 0.002 0.006 8.4 7.9 9.8 48.5 75.3
ScrewNet 0.846 0.929 0.486 0.475 0.115 0.217 0.111 0.118 - - - - -
abbatematteo2019learning 0.194 - - 0.111 0.223 - 0.045 - - - - - -
DUST-net 0.151 0.163 0.052 0.059 0.012 0.122 0.002 0.006 53.8 54.0 18.3 128.1 219.1
ScrewNet (Local) 0.178 0.443 0.068 0.033 0.057 0.118 0.015 0.015 - - - - -
Table 1: Mean error values on the MAAD and Screw Loss(SL) metrics for the simulated articulated objects dataset [abbatematteo2019learning]. Point estimates for DUST-net correspond to the modes of the distributions predicted by DUST-net. Angular values {} and distances {} are reported in radian and meter, respectively. Numerical values are reported for the uncertainty parameters {}. Symbol represents value not reported.
MAAD / SL MAAD Screw Loss MAAD SL MAAD SL Uncertainty
vm-SoftOrtho 0.284 0.243 0.221 1.137 0.030 0.086 0.012 0.027 26.9 31.1 5.7 54.5 60.9
Direct F 0.214 0.212 0.257 0.219 0.030 0.064 0.012 0.024 8.1 7.3 4.9 59.5 70.9
ScrewNet 0.846 0.929 0.486 0.475 0.115 0.217 0.111 0.118 - - - - -
abbatematteo2019learning 0.989 - - 0.095 0.141 - 0.085 - - - - - -
DUST-net 0.220 0.219 0.178 0.189 0.029 0.063 0.012 0.029 49.3 48.3 7.7 72.0 131.9
ScrewNet (Local) 0.260 1.23 0.314 0.151 0.060 0.106 0.040 0.009 - - - - -
Table 2: Mean error values on the MAAD and Screw Loss(SL) metrics for the PartNet-Mobility dataset [Xiang_2020_SAPIEN, Mo_2019_CVPR, chang2015shapenet]. Point estimates for DUST-net correspond to the modes of the distributions predicted by DUST-net. Angular values {} and distances {} are reported in radian and meter, respectively. Numerical values are reported for the uncertainty parameters {}. Symbol represents value not reported.

d.2 Uncertainty Estimation

The detailed numerical results from the second set of experiments are shown in Table 3. In the noiseless case, the singular values of the matrix von Mises-Fisher distribution increases until they reach their maximum allowed value at , while the precision parameters for truncated normal distributions over remaining parameters become arbitrarily large.

Label Noise No noise 15 15 50 50 50 12 12 50 50 50 10 10 50 50 50
SynArt 53.8 53.9 18.3 128.0 219.0 8.2 8.2 14.6 53.7 51.9 6.8 6.8 10.5 41.6 49.6 3.8 3.8 10.3 41.9 47.4
PartNet 49.3 48.3 7.7 72.0 132.0 6.4 6.3 9.4 29.5 29.2 4.9 4.7 8.9 34.0 37.9 3.2 3.1 9.4 31.2 32.1
Table 3: Testing variation of DUST-net’s confidence over predicted articulation model parameters with input noise. DUST-net’s confidence over its predicted parameters decreases monotonically as input noise is increased showing that DUST-net’s predicted distribution captures the network’s confidence over the predicted articulation parameters effectively.

d.3 Real objects

The numerical results from the sim-to-real transfer experiments are shown in Table 4. Results report that while DUST-net outperforms ScrewNet in estimating the model parameters for real-world objects, the estimated parameters are not yet accurate enough to be used directly for manipulating these objects. However, a noteworthy insight from the results is that DUST-net also reported very low confidence over the predicted parameters. This clearly delineates why it is beneficial to estimate a distribution over the articulation model parameters instead of only point estimates, as discussed earlier in the section 5.3.

MAAD / SL MAAD Screw Loss MAAD SL MAAD SL Uncertainty
Toaster ScrewNet 2.42 2.48 0.74 0.76 0.45 1.26 0.01 0.00 - - - - -
Oven DUST-net 0.17 0.31 0.52 0.59 0.44 0.64 0.01 0.01 2.5 0.1 11.6 10.8 75.5
[2pt/2pt] Microwave ScrewNet 0.79 0.81 0.13 0.52 1.19 0.54 0.01 0.01 - - - - -
DUST-net 0.41 0.42 0.22 0.43 0.46 0.40 0.00 0.00 0.7 0.6 19.7 14.3 39.9
[2pt/2pt] Drawer ScrewNet 0.69 0.24 0.49 0.24 0.72 0.97 0.08 0.08 - - - - -
DUST-net 0.42 0.50 0.32 0.74 0.75 0.56 0.07 0.08 0.2 0.1 12.3 31.6 55.2
Table 4: Mean error values on the MAAD and Screw Loss metric for estimation of articulation model parameters for real-world objects when network was trained solely using simulated data. Angular values {} and distances {} are reported in radian and meter, respectively. Numerical values are reported for the uncertainty parameters {}. Symbol represents value not reported.