The understanding of object affordances [gibson1979ecological], i.e.
, the set of possible interactions with an object, plays a crucial role in human perception of real-world objects despite the tremendous variance in their appearances[dicarlo2012does]. As robots begin to enter our daily life, it is necessary for them to assess previously unseen real-world objects and understand how they can be interacted with via affordance-based object perception. Despite the necessity and advantages of affordance reasoning in object perception, the majority of work on this topic is appearance-based. Appearance-based methods can be brittle when encountering objects with large intra-class variations. Moreover, they can be challenged when functional reasoning about previously unseen objects and inter-class function generalization are needed. For example, consider a robot trying to prepare a bowl of M&M’s® candies (hereafter referred to as candies) in a kitchen. What if the bowl in the kitchen has a different appearance from the bowls that the robot has seen before? What if the bowl is broken with a hole at the bottom while still visually appears to be a bowl? If there are no bowls but only cups in the kitchen, can the robot be programmed to improvise? Given these unexpected situations, we argue that appearance-based object perception alone is not sufficient for robotics application in the real world; what is missing is the capability to physically “understand” object affordances. If the bowl is previously unseen, the robot should recognize its containability affordance; if the bowl is broken, it should reason that the bowl cannot afford to contain candies; if there are no bowls but only cups, it should generalize the containability affordance from bowls to cups and use a cup instead to contain the candies (Fig. 1).
To address the problem of affordance reasoning in object perception, we introduced the interaction-based definition of an object and a method to “imagine” object affordances as a chair via physical simulations in our previous work [wu2019chair]. In this paper, we extend our idea of physical simulation-based object affordance reasoning to the class of open containers and propose a novel method for robots to imagine the open containability affordance. We are inspired by the fact that humans mentally simulate high-level physical interactions to reason about complex systems [battaglia2013simulation]. Our long-term goal is to endow robots with similar mental capabilities to imagine potential interactions with an object and reason about its object affordances accordingly (Fig. 1(a)).
We first discuss the physical attributes to imagine for the class of open containers. Containers can be generally categorized into open containers (e.g., cups) and closed containers (e.g., bottles with a cap). The containability closure [varadarajan2012afnet] of a closed container makes the physical interaction with it different from that with an open container. An agent needs to open a closed container to transfer contents into it while there is no such need for an open container. The open containability we describe in this paper does not take into account the containability closure. The pose of an open container also affects the interaction with it. An upside down open container needs to be reoriented upright to afford open containability. Finally, containability is related to the physical properties of the container and the contents. For example, the origami cup in Fig. 2 cannot contain water because it will be soaked but it is able to contain candies. In this paper, we focus on reasoning the open containability for granular material of an upright object. Based on the containability definition in [varadarajan2012afnet], we propose the interaction-based definition of open containers: 0.50cm0.50cm “an object which can be placed such that liquid or granular material can be poured into it from above and retained therein under small perturbations in pose.”
The interaction-based definition defines open containers from the perspective of object interaction which robots can “understand” and “imagine” accordingly. Fig. 2 shows our method’s pipeline. In our method, the robot performs open containability imagination (Sec. III-A) to quantify the open containability affordance of a previously unseen object. This quantification is used as a cue for open-container vs. non-open-container binary classification (hereafter referred to as open container classification). If the object is classified as an open container, the robot further performs pouring imagination (Sec. III-B) to obtain the pouring position and orientation for the robot to pour autonomously in the real world (Sec. IV-D). We implement our imagination method on a UR5 manipulator. We evaluate our method on open container classification and autonomous pouring of granular material (M&M’s® candies) with a dataset containing 130 previously unseen real-world objects. The dataset has 57 object categories. Results show that our method is able to achieve human-like judgements on open container classification, even though it uses only 11 objects for simulation calibration (training). In addition, our method is able to autonomously pour into the 55 containers in the dataset with a very high success rate. We also compare to a deep learning method [do2018affordancenet] which we find most relevant in the literature. Our method is able to achieve the same performance as the deep learning method on open container classification and outperform it on autonomous pouring.
Ii Related Work
Containability Reasoning. Containability reasoning has been studied in the field of cognitive science [liang2015evaluating, davis2013reasoning, ullman2019model]
and computer vision[varadarajan2012afnet, yu2015fill, hinkle2013predicting, liang2018tracking]. Liang et al. [liang2015evaluating] compare human judgements of containing relations between different objects with physical simulation results. Yu et al. [yu2015fill] voxelize the world containing an object of interest and determine the state of each voxel to reason about whether the object is a container and its best filling direction. Our method diverges from these approaches by reasoning open containability affordance with a real robot and leveraging the affordance reasoning on robot autonomous pouring.
Learning-based Affordance Detection. There is a growing interest in learning object affordances and object functionalities in the computer vision and robotics community [myers2015affordance, do2018affordancenet, nguyen2017object, chu2019learning, sawatzky2017weakly, roy2016multi, manuelli2019kpam, ruiz2020geometric, zhu2014reasoning, zhu2018visual]. [nguyen2017object, do2018affordancenet, chu2019learning, roy2016multi, sawatzky2017weakly]
leverage convolutional neural networks (CNNs) to segment affordance regions within an image. Manuelliet al. [manuelli2019kpam] use deep learning to identify keypoint affordances for objects to reinforce robot manipulation. Instead of learning appearance-based cues, our method digs into the underlying physics by using physical simulations to encode object affordances.
Physics-based Reasoning. The concept of object affordances has also been studied by introducing physics to reason about the functionality of an object [zhu2018visual, ho1987representing, wu2019chair, yu2015fill, hinkle2013predicting, kunze2017envisioning, abelha2017learning]. Kunze and Beetz [kunze2017envisioning] reason the outcome of robot actions with simulation-based projections to guide robots in planning actions. The concept of digital twins [boschert2016digital] uses simulations to replicate physical systems for operation optimization and failure prediction in manufacturing. Instead of reasoning physics or using physics to predict action outcomes, we exploit physics to imagine potential interactions with an object for object classification and robot-object interaction.
Autonomous Pouring. Autonomous pouring is an important task in robot manipulation [guevara2017adaptable, yamaguchi2015differential, schenck2017visual, pan2016motion, kennedy2019autonomous]. Yamaguchi and Atkeson [yamaguchi2015differential] explore temporally decomposed dynamics in pouring simulation experiments with differential dynamic programming. Scheneck and Fox [schenck2017visual]
learn to estimate the liquid volume in a container and incorporate this sensory feedback into robotic pouring. Unlike these methods which know the object class (container) and/or the object 3D modela priori, our method categorizes a previously unseen object and predict the pouring position and orientation without knowing the 3D model or the category of the object a priori.
Our method takes two steps to explore the open containability affordance of a previously unseen object. In the first step, the robot performs open containability imagination, which simulates dropping particles of granular material from above the object and perturbing the object, to determine if the object is an open container. If the object is identified as an open container, the robot proceeds to pouring imagination, which simulates pouring particles from a cup, to obtain the pouring position and orientation for autonomous pouring in the real world. Given the scanned 3D model of the object, our method attaches the body frame to its center of mass (CoM) and set the axes parallel to those of the world frame. The object and particles are considered as rigid bodies. Rigid body transformation can be specified by where is the rotation matrix and parameterizes the translation.
Iii-a Open Containability Imagination
Fig. 3 shows the setting of open containability imagination. Particles are placed above the axis-aligned bounding box (AABB) of the object and patterned in a grid to cover the top face of the AABB (Fig. 3(b)). We denote the numbers of particles along the x-, y-, and z-axes as . , are set as follows:
where () and () are the extents of the object (particle) AABB along the x- and y-axes, respectively; is a scale factor to avoid cases where there are too many particles to drop which will greatly increase the runtime; is the floor function. We first set and calculate and . If , i.e., there are too many particles, we decrease by setting and recalculate and . If , i.e., there are too few particles which will make later calculation of open containability (Eqn. 1) unreliable, we increase by setting . is the ceiling function. and are two thresholds.
After the particles are set up, they are dropped in the physical simulation with gravity. According to the interaction-based definition, if an object can robustly retain particles, small perturbations in pose shall not spill the particles. Therefore, perturbations are introduced by simulating motions equivalent to rotating and translating the object when the drop is finished. The object is rotated by a small degree () along the x- and y-axes to simulate rotation; horizontal force fields () along the positive and negative directions of x- and y-axes are applied sequentially to simulate translation (Fig. 3(a)). Note that applying a horizontal force field is equivalent to accelerating the object horizontally along the negative direction of the force field. Translation is simulated with accelerations instead of constant velocities because based on common experience, spillage mostly happens during accelerations instead of moving with constant velocities. The number of particles retained within the objects, , is obtained by counting the number of particles within the object AABB at the end of the simulation. The open containability of the object is quantified as:
where is the number of particles dropped. If , the object is considered as an open container. is a threshold. The xy-plane projection of the initial dropping position of the particles retained within the object after the drop constitutes what we called the footprint of the object (indicated in red in Fig. 3(b)).
Iii-B Pouring Imagination
Pouring imagination simulates pouring particles from a cup from different positions and orientations (Fig. 4). To describe the pouring setting, we first define (red star in Fig. 4(a)) as a point on the rim of the cup. is fixed as the rotation pivot during the pouring. Since the cup used for pouring is symmetric about its longitudinal axis, any points on the rim can be equivalently used as the pivot. For all pourings, we set 1cm above the object AABB. That is, lives on a horizontal plane . The origin of is specified by the CoM of the object’s footprint. The x- and y-axes of are the corresponding principal axes.
The inner surface of the cup is a truncated cone. We denote the slant height passing through as (green line in Fig. 4(a)). For all pourings, the cup is initially placed such that and is the lowest end of the cup mouth. This constrains the initial pose of the cup to . We parameterize a pouring with a tuple where . At the beginning of each pouring, is aligned with a ray (red dashed line Fig. 4(a)) originated from . The angle between the x-axis of and the ray equals . , which lies along the ray, specifies the and components of in
. The pouring is a one degree of freedom motion: the cup is rotated about an axis(blue dashed line in Fig. 4(a)) which is tangent to the rim of the cup at .
For an object, the imagination simulates pouring from 8 different :
For each , it simulates pouring at 3 different :
where defines the j-th indentation distance; and are two hyperparameters: =1cm and equals of the length of the diagonal of the object AABB’s top face; :
The same method described in Sec. III-A is used to count the number of particles retained within the object after each pouring. The particle-in-object ratio of pouring with is specified as:
where is the number of particles poured. In total, the imagination simulates 24 pourings for 24 tuples and get 24 . We add indentation instead of directly pouring at (Eqn. 3) for two reasons. First, is obtained by freely dropping the particles, i.e., the particles have no horizontal velocities. Whereas in pouring, the particles have a horizontal velocity when exiting the cup mouth. We indent the cup “backward” horizontally to compensate for this velocity. Also, we want to explore more area for pouring. Indenting the cup towards the object periphery provides a wider exploration area and a higher chance to find a with a large (Fig. 5).
The for robot autonomous pouring is chosen by setting and as follows:
We select as above because pouring with this has higher chances to get a large , i.e., less spillage, in the real world. The cup initial pose of pouring in the world frame can be retrieved as follows:
where specifies the rotation which rotates about the z-axis of the world frame by ; rotates the cup such that is parallel to the x-axis of and is the lowest end of the cup mouth in the world frame; is the position of the pivot in the object body frame; and are defined at the beginning of this section.
Fig. 1(b) shows the experiment setup. We implement our method on a UR5 robot mounted with an afag EU-20 UR gripper. A PrimeSense Carmine 1.09 RGB-D camera is also mounted on the end effector. The robot base frame is used as the world frame throughout the experiment.
Iv-a Robot 3D Scanning
In the experiment, the object of interest is placed randomly on a transparent platform in its upright pose. The transparent platform is also randomly placed on a table within limits that are described below. Since the depth sensor of the camera has a shortest range of 0.35m, although random, the position of the object needs to satisfy: 1) the object does not collide with the robot during the scanning; 2) the object falls within the range of the depth sensor in all capturing views (e.g., the front-to-back view shown in Fig. 6(a)); 3) the object is in the robot workspace so that the robot can pour into the object if it is classified as an open container. Using the wrist mounted RGB-D camera, we move the robot’s end effector to 24 pre-defined configurations to capture depth images of the scene. From the robot’s forward kinematics, we are able to obtain the pose of the camera at each view. This allows us to use TSDF Fusion [curless1996volumetric, zeng20163dmatch] to densely reconstruct the scene. Since the object is placed on a transparent platform which will not be captured by the depth sensor of the camera, we are able to segment the object from the scene by simply cropping the 3D reconstruction with a box111Plane segmentation can segment the object from the table without the transparent platform. However, the bottom of the inner surface of some objects (e.g., the red bag in Fig. 4(b)) would be very close to the table and falsely segmented out as part of the table due to sensor inaccuracy.. Finally, we get the 3D mesh of the object. It is worth noting that object segmentation could also be achieved without the transparent platform by 3D object segmentation [finman2013toward] or physically picking up the object for scanning [krainin2011manipulator].
As we are not able to capture views from underneath the object, the reconstructed 3D model is partial. It only covers surfaces which can be seen from the top and side views. The inner containing surface of open containers can be seen from the top and side views given the object is placed upright, and thus is included in the reconstructed model. Therefore, the partially reconstructed model is sufficient for the purpose of open containability reasoning. Reconstructing the complete model of an object can be achieved by methods involving manipulating the object such as [krainin2011manipulator, welke2010autonomous].
Our dataset contains 141 real-world objects which are commonly encountered in daily life (Fig. 6(b)(c)(d)). We use 11 objects for simulation calibration (training set) and the remaining 130, which are previously unseen by the robot, for testing (test set). The 130 objects in the test set covers 57 object categories. Since the objects have no labels of open container, we recruit 20 human subjects to annotate the data for open container classification. Every object is annotated 5 times by 5 different subjects. Given an object, we asked each of the 5 subjects222We did not directly ask whether the object is an open container because the concept of open container is not clear to some human subjects.: 1) Given the object in this pose, is it able to contain M&M’s® candies? 2) Given the object in this pose, are you able to pour M&M’s® candies into it? If the annotations to these two questions are both true, we consider the object is annotated as an open container by the subject; otherwise, it is annotated as a non-open container. If 3 or more subjects annotate an object as an open container, the ground truth label of the object is an open container; otherwise, the label is an non-open container. The test set contains 55 open containers and 75 non-open containers.
Iv-C Physical Simulation
We use Pybullet [coumanspybullet] as the physical engine for open containability imagination and pouring imagination. The object, cup, and particles are imported with the URDF files which specify the mass, CoM, inertia matrix, collision model, friction coefficient, and joint properties. The particles are modelled as ellipsoids with similar dimensions of an M&M’s® candy. We use Volumetric Hierarchical Approximate Convex Decomposition (V-HACD) [mamou2016volumetric] to decompose the model of the object and the cup into convex pieces as the object collision model in the simulation. The default Coulomb friction model is used for friction modeling. The collision between the particles and the object is modelled as inelastic (coefficient of restitution ).
We calibrate the simulation by running the algorithm on the training set. The simulation parameters of the open containability imagination are manually calibrated such that the open container classification result matches with the human annotation. The simulation parameters of the pouring imagination are manually calibrated such that the pouring results, i.e. in Eqn. 4, matches with the real-world pouring results to the greatest extent. From the simulation calibration, for all the open containers in the training set; for all the non-open containers. Thus, we set which physically means that an object is classified as an open container if there are particles retained within the object after the drop and non-open container if otherwise. We set for all pourings in the pouring imagination, a number of which the volume can be contained by all the open containers in the dataset. However, we want to point out that can be set according to how much the user wants to pour and the containing volume of the object, which we did not cover in this paper. A further investigation of reasoning is left for future work. The cup used in pouring imagination has the same dimensions as the one used in robot autonomous pouring (Sec. IV-D).
Iv-D Robot Autonomous Pouring
In robot autonomous pouring, the robot pours the same number of candies as in pouring imagination. At the beginning of each experiment, we fill the cup with candies and place it at a pre-defined position. After the pouring imagination, the robot first picks up the cup and execute a pre-defined trajectory to move the cup to a pre-pour pose. The trajectory is designed such that the cup does not spill the candies during the execution. Then, the robot is controlled to move the cup to the imagined initial pose of pouring obtained from the pouring imagination (Eqn. 6). After that, similar to the pouring imagination, the robot rotates the cup about to pour the candies (Fig. 6(e)). We use the python-urx library [pythonurx] to control and generate motion plans for the robot.
In this section, we show the results of open container classification and robot autonomous pouring on the test set333Videos are available at: https://chirikjianlab.github.io/realcontainerimagination/. We evaluate our method on a computer running Intel Core i7-8700 @ 3.2GHz CPU. Our single-threaded unoptimized implementation takes about 50 seconds for 3D scanning, 15 seconds for open containability imagination, 80 seconds for pouring imagination, and 40 seconds for autonomous pouring. We denote our method as Imagination. The most comparable method is [yu2015fill]. However, its source code is not available for comparison. Thus, we compare with a deep learning method [do2018affordancenet], denoted as AffordanceNet, which we find most relevant in the literature444We use the code and pretrained weight provided by the author for testing.. AffordanceNet takes as input an RGB image. It has an object detection branch which localizes the object with a bounding box and classifies the object and an affordance detection branch which assigns each pixel within the bounding box with an affordance label. AffordanceNet is trained on the IIT-AFF dataset [nguyen2017object] which contains 8,835 real-world images with 10 object classes and 9 affordance classes. Among the 10 object classes, 3 classes are open containers (cup, bowl, pan). Among the 9 affordance classes, one of them is “contain”. To ensure fairness, the comparison with AffordanceNet is evaluated on a subset of the full test set, denoted as the restricted test set. It contains only 51 objects (23 open containers (Fig. 6(b)) and 28 non-open containers) which falls in the object classes of the IIT-AFF dataset. For each of these objects, at least four of the five human annotators were in agreement.
V-a Open Container Classification
We use the classification accuracy (accuracy) and area under the Receiver Operating Characteristic curve (AUC) to evaluate the open container classification performance of different methods. The results are shown in TableI. For our proposed method, the open containability (Eqn. 1) is used as the confidence score to calculate the AUC. For AffordanceNet, we use its object detection branch for open container classification. Although our method only uses the depth images from the 24 views captured by the RGB-D camera, we also save the RGB images. These RGB images are used for testing AffordanceNet. Our method segments the object (Sec. IV-A) before the imagination for open container classification. To ensure fairness to the greatest extent, we manually crop the object from each RGB image to subtract the background before inputting into AffordanceNet. From the output of AffordanceNet’s detection branch, we first select the detection (bounding box + classification score) with the highest classification score. The maximum classification score of the 3 open container classes of this detection is used as the confidence score of the object being an open container. We evaluate the classification accuracy and AUC with AffordanceNet for the 24 views respectively, resulting in 24 accuracies and AUCs in total. The highest accuracy and AUC is selected as the accuracy and AUC of AffordanceNet.
Both our method and AffordanceNet achieve perfect performance on the restricted test set. On the full test set, our method fails to classify some objects (indicated with green I’s in Fig.6(d)), labelled as non-open containers, the same way as the human annotation for two reasons. Firstly, the reconstructed 3D model of the object traps particles with small concavities due to inaccurate reconstruction and approximate convex decomposition. Secondly, the object (e.g., a spoon) possesses small concavities which can retain particles. Interestingly, the spoon, classified as an open container by our method, is annotated as a non-open container by 3 out of 5 human subjects, i.e., whether it is an open container diverges among human subjects.
V-B Robot Autonomous Pouring
We use the method described in Sec. III-B and IV-D to pour candies into the open containers in the test set. We compare with AffordanceNet by using the affordance detection branch555[do2018affordancenet] demonstrates robot pouring using the affordance detection branch. However, the authors did not do quantitative experiments and the source code for pouring is not available. The code for testing AffordanceNet on pouring is implemented based on the description in the paper and correspondence with the author.. We derive with the affordance detection and use the same method described in Sec. IV-D to pour. The view with the highest AUC is used to test AffordanceNet666This view also has the highest classification accuracy.. It is a top-down view which is similar to the view used for pouring in [do2018affordancenet]. As in the open container classification, we crop the object from the image to ensure fair comparison. Using the same method described in [do2018affordancenet], we set by calculating the centroid of all the pixels predicted as the “contain” affordance in 2D and project it to 3D with the corresponding depth image and camera intrinsic parameters. However, directly pouring at the projected point will result in collision with the object if the point is on the object surface. Therefore, we set the component of by adding a fixed offset to the component of the projected point. Since AffordanceNet is not able to specify , we use a fixed in the experiment. We also compare with a variation of AffordanceNet, denoted as AffordanceNet w/ 3D Scanning which incorporates the robot 3D scanning module in our method to provide the component of . Similar to our method, the component is set such that is 1cm above the object AABB. In addition, we compare to a variation of our method, denoted as Imagination w/ fixed , in which is fixed and equals to that used for AffordanceNet. We want to study the importance of reasoning about pouring orientations.
|AffordanceNet w/ 3D Scanning||restricted||91.30|
|Imagination w/ fixed||restricted||100.00|
|Imagination w/ fixed||full||94.55|
For AffordanceNet and its variation, if the network detects “contain” affordance for an object and pour without spillage, we consider the pouring a success and failure if otherwise. For our method and its variation, if the open container imagination classifies the object as an open container and pour without spillage, we consider the pouring a success and failure if otherwise. The results are shown in Table II. Failure cases of all methods are shown in Fig. 6(b)(c). Since AffordanceNet is not able to specify the component of , it struggles with very shallow containers (e.g., pans) due to candies bouncing off the object. With the robot 3D scanning to provide the object height, AffordanceNet w/ 3D scanning is able to cope with this problem but still fails on some objects (e.g., the wooden bowl with a similar color as the table in Fig. 6(b)) due to incorrect affordance detection. Both Imagination and its variation perform perfectly on the restricted test set. On the full test set, the variation fails on a gravy boat with an opening resembling a slim rectangle. Without the reasoning of , it is not able to find the pouring orientation in which the cup mouth points along the length of the rectangle, providing larger tolerances for the randomness in the candy’s pouring trajectory. The only failure case for Imagination on the full test set is a vase with a small opening which was reconstructed as a dent with very limited depth due to limitations on the 3D scanning. The robot classifies it correctly but fails to pour without spillage.
Vi Discussion & Future Work
Despite only trained with 11 objects, our method is able to achieve the same open container classification performance as the deep learning method, which has been trained with thousands of annotated images, on the restricted test set. Moreover, our method’s performance on the full test set aligns well with human judgements. The pouring experiment results show that our method outperforms the deep learning method on the restricted test set and is able to maintain similar performance on the full test set. Both the classification and pouring results of our method on the full test set, which covers 57 object categories, show that our method is able to achieve inter-class function (open containability) generalization which is challenging for appearance-based method for the reasons of data burden and the implicit nature of appearance representation. We believe this generalization is a fundamental strength of the affordance-based object perception which physically captures the most essential cue of open containers through reasoning interactions with objects.
Unlike the black box nature of the deep learning method, our method is explainable. It explains the open containability affordance of an object by the number of particles the object is able to retain in the physical simulation. Our method explores object affordances based on object instances instead of object categories. For objects which do not belong to a typical open container categories but are able to afford open containability, e.g., the candlestick in Fig. 5, our method is able to improvise and recognize its affordance. This provides robots with intelligence to tackle previously unseen object instances and/or categories when deployed in our daily life.
In future study, we plan to introduce more accurate and adaptive scanning to tackle the failure cases in the classification and pouring experiments. We also plan to combine reasoning the amount of particles to pour with the evaluation of the open container’s volume for more adaptive pouring. Finally, we want to cut down the runtime via parallelization of pouring from different poses in the pouring imagination.
In this work, we develop a novel method for robots to imagine the open containability of previously unseen objects via physical simulations. Our method is able to perform open container classification and endows robots with the capability to autonomous pour into objects via open containability imagination and pouring imagination, respectively. We evaluate our method on a dataset containing 130 previously unseen objects with 57 object categories. Results show that our method’s performance on open container classification aligns well with human judgements and it is able to pour into the 55 open containers in the dataset with a success rate of 98.18%. This ability to access and generalize function to previously unseen object instances and object categories provide robots with high intelligence on robot-object interaction. We hope that our method will serve as an effective approach to reinforce robot-object interaction in future research.