The ability to find and retrieve an object from a cluttered environment, i.e. solving the object search
problem, is an important requirement for many robotics applications, from warehouse retrieval to household chores. Yet cluttered scenes inherently impose a limitation on visibility: objects occlude one another in close proximity, and often the range of feasible viewpoints is limited. Also, in unstructured environments such as homes, new objects appear frequently and it is very restrictive to require that all objects in the scene have corresponding CAD models and/or labeled images available. In spite of these difficulties, estimating object geometry is important for sequencing actions to find a target object.
Thus this paper focuses on a key topic that has been largely overlooked in the object search domain: inferring occluded geometry. The current state-of-the-art in object search (e.g. [jonschkowski2016probabilistic, schwarz2017nimbro, milan2018semantic]), which shows impressive performance, does not reason about occlusion but instead relies on CAD models and/or labeled images to construct a geometric description of the scene. However, when such information is not available, we hypothesize that inferring occluded geometry significantly improves object search performance in dense clutter in terms of the number of actions required to retrieve the object. This paper does not focus on the effect of shape-completion on grasping, which has been explored in previous work [varley2017shape]. Rather we focus on the role of shape-completion in action selection, i.e. in determining which object to move and where.
The main contribution of this paper is the integration of a novel extension to a previous method for shape completion [Yang17] into a manipulation planning framework. Our proposed shape completion method allows us to infer occluded geometry more accurately by using free-space information and we use this method to infer occluded geometry in a cluttered scene. We are not aware of any previous method that integrates shape completion into a framework for object search. We also integrate a memory method to track free space seen earlier in the interaction, as well as previously-observed geometry that has become occluded.
To gauge how much inferring occluded geometry improves manipulation performance, we constructed a manipulation planning system for a bimanual robot (see Figure 3). Our baseline system consists of a planner that operates on a volumetric segmentation of RGBD images and uses a set of motion primitives to locate and retrieve a target object. We conducted 182 manipulation experiments in eight tabletop cluttered scenes (see Figure 18) comparing the baseline system to a system augmented with shape completion and/or memory. We first evaluated whether our proposed shape completion method outperformed previous work on a dataset specialized to our application. We then tested the hypothesis that both shape completion and memory independently improved performance in certain scenarios. Finally, we tested the performance of the baseline vs. the augmented system in scenarios with varying amounts of clutter. In densely-cluttered scenes we found that inferring occluded geometry significantly reduced the number of actions necessary to retrieve the target object.
2 Related Work
Object Retrieval from Clutter Manipulation of movable obstacles in cluttered scenes has been a longstanding goal of robotics research [Stilman2007], [Dogar2012]. While these earlier examples assumed full knowledge of the scene in question, leveraging manipulation to discover hidden objects has also been a significant focus. In [Wong2013], Wong et al. use spatial visibility constraints and object semantic information to plan manipulation sequences. [Dogar2014] use revealed volume in utility function and generate connected component networks from object occlusions, [Li2016] frame the problem as a POMDP, and employ a similar image processing pipeline to the one presented here. The above methods make important contributions to the literature, however they make simplifying assumptions (e.g. discrete planning space, known object models, or sparse clutter) that are more restrictive than ours.
In recent years, much work about vision and manipulation in clutter has been driven by the scenarios presented by the Amazon Picking Challenge (APC), producing numerous publications on both full frameworks and isolated components. In APC 2015, [jonschkowski2016probabilistic]
developed a segmentation algorithm based on explicit image features (color, edge, missing 3D, distance to shelf, height etc), though the more recent trend has been toward deep-learned perceptual models (e.g.[schwarz2017nimbro]
from APC 2016, combined object detection using a fine-tuned network on Visual Genome and semantic segmentation using a pretrained pixel-level CNN model on ImageNet). Starting in APC 2017, the challenge required competitors to attempt to manipulate novel items in the scene, producing complex sensing and grasping frameworks such as[milan2018semantic, zeng2018robotic]. While powerful, these methods do not reason about occluded space.
Volumetric Shape Reconstruction Reconstructing a 3D model of a scene is both a major challenge and powerful tool for robotic manipulation. Until recently, most systems seeking to infer volumetric information about the hidden parts of the scene have relied on CAD model matching [Liu2012Fast, Choi2012] or semantic matching [factored3dTulsiani17, xiang2018posecnn]. However, progress in reconstructing objects from single 2.5D views [wu2015shapenets, firman2016structured, Yang17, fan2017point, Yang18] has enabled manipulation planning on unseen parts of the space [varley2017shape]. We extend this work in order to reason about occlusion in a cluttered scene, choosing to build on [Yang17] because it obtained good performance on challenging objects and it was clear how to incorporate free-space information into the network.
3 Problem Statement
We seek to retrieve a specific object from a cluttered scene using robotic manipulator(s). Our domain represents household applications, where previously-unseen objects can be present and CAD models or labeled images are not available.
Sensing To sense the environment we assume that the robot is endowed with a single RGB-D sensor. We assume that the objects in the environment are arranged on a flat surface and may be in arbitrary stable pose and contact configurations. A key difference between our domain and that of much previous work is that we do not assume that we possess explicit object representations for the objects on the table: the robot may have observed some of the objects during training of its perception algorithms, but other objects are completely new. Furthermore, the robot has no way to identify objects with which it has been trained. The robot receives an observation of the state of the environment before it acts, where is an grid of RGBD values.
The robot also has no explicit representation for the target object, but is endowed with a classifierwhich determines if a given pixel in the RGBD image is likely to be a part of the target object. This is meant to handle queries that may come from a user, such as “Bring me the yellow ball”, where no explicit model of the object is given.
Acting The robot may manipulate the objects in any way it chooses, however, unlike much previous work, we assume we are not able to command the robot to remove objects from the scene. We make this restriction to consider scenarios where the robot has a limited work-surface (i.e. a confined space).
Further, we assume that the robot has only a limited knowledge of contact mechanics and physics. Contacts between the robot model and the environment or between a grasped object and a tabletop object can be computed based on their observed or inferred shapes, but their behavior after contact is difficult to predict because physical properties such as mass, pressure distribution, and friction, are not known.
Problem The robot is endowed with a set of possible actions it can apply and must choose which actions to take to locate and retrieve the target object, with each action
having a negative reward and retrieval having a positive reward. This problem can be formulated as a Partially-Observable Markov Decision Process (POMDP) by defining a belief state over the environment and computing a policy of the form
which maximizes the probability of success given any starting state. While a POMDP policy would be desirable, this is clearly intractable as the belief over environments is too high-dimensional for a POMDP solver to handle and we do not have models of the transition and observation uncertainty. Instead, we focus on a greedy approach: we seek to design athat takes the next best action given and . A key challenge is how to use and to infer object geometry in occluded regions, so that this information can be used to inform action selection.
4 System Framework
This section describes the components of our system, shown in Figure 4. We first describe our methods for perception, then action selection.
4.1 RGB-D Segmentation
The pipeline begins by processing to produce a segmentation of the observed scene into distinct objects. Many state-of-the-art semantic segmentation approaches require object class annotations [he2017mask, Lin:2017:RefineNet], which violates our stipulation that the scene may contain novel objects or classes. Therefore, we investigated class-agnostic segmentation approaches, adapting SceneCut [Pham2017] for our scenario. The method begins with an ultrametric contour map (UCM) [arbelaez2006boundary], a hierarchical segmentation tree derived from the RGB-D image of the scene. The UCM is generated by a Convolutional Oriented Boundary (COB) [Man+16a, Man+17] network trained on NYUD-v2 dataset [Silberman:ECCV12]. SceneCut then utilizes a tree cut to minimize an energy function over objectness and geometric fitting.
Each segment of the RGB-D image, given the camera intrinsics, corresponds to a point cloud representing a potential object. Points belonging to the table surface, robot arms, or outside the table region of interest are rejected, and the surviving point clouds are passed to shape completion. We also try to find the target in , as well as extracting occupied and free volumes (detailed below).
Target Object Detection Given a collection of image/point cloud segments, we next determine whether any of them is the target in question. We assume a classifier of the form is available, as object recognition is not a focus of this work. Our implementation uses color matching to classify whether a pixel belongs to the target object.
Occupied and Free-Space Volumes Using the segmentation and full point cloud, we can compute a voxelized representation of the free, occupied, and unknown state of the world. Occupied and free space computation is provided by feeding the point cloud data to OctoMap [Hornung2013] to generate an octree representation of the scene. The shape completion results from the following section are fed back into object-specific OctoMaps that are used for end-effector collision computations.
4.2 Shape Completion
The main contribution of this paper is the integration of a new method for shape completion into our object search system. We first frame the problem of shape completion and then describe our solution. Consider an occupancy map carrying 3D points to a binary occupancy value, empty or filled. Letting represent the first natural numbers, we can define a voxel grid as a discrete version of , with .
Let represent the pointwise representation of : the list of points that are at the center of each voxel and . We now define a distance between voxel grids based on the Chamfer distance as defined by [fan2017point]:
So, for two voxel grids and , we define . Let the true voxel occupancy of an object be and the observed free space be . Then, given a partial scan of the object and , shape completion seeks to solve the following problem:
However, at runtime (the true shape of the object) is unknown. Instead, we apply learning methods which train a deep neural network on multiple views of objects in simulation, where the true shape is used as ground-truth. We then use the learned network to predict a likely for a given partial scan.
To tackle the learning problem, we begin with a base model of the 3D-RecGAN architecture [Yang17]
, a combination of a generative autoencoder and a Generative Adversarial Network (GAN)[goodfellow2014generative] capable of generating high-resolution 3D shapes that capture key features (such as handles). Compared to previous approaches which generate a 3D shape from RGB or RGB-D information, this architecture does not require object class labels and is able to generalize to unseen objects. This approach performs well, but does not have the ability to include , so it may generate voxels in known free space. We thus build on this method by incorporating two main modifications: 1) restructuring the network architecture to include known-free space and 2) using a dataset that includes occlusions for training.
Architecture 3D-RecGAN consists of two main networks: the generator and the discriminator. We improve on the original network by augmenting the input space with . Figure 5
shows the detailed architecture of our modified generator in 3D-RecGAN. Both the occupancy voxels and free voxels are encoded using the five 3D convolutional layers used by 3D-RecGAN. In latent space, the two latent vectors are concatenated together. The decoder comprises six up-convolutional layers followed by ReLU activations except for the last layer which uses a sigmoid function. All encoder layers are concatenated to the decoder by skip-connections to preserve local structures. The discriminator and loss functions are the same as those used in 3D-RecGAN.
Synthesizing a Dataset with Occlusions Many large synthetic datasets generated from 3D models exist for the purpose of 3D reconstruction from a single view. Most existing cluttered datasets are generated on large and complex objects such as furniture in an indoor scene. Existing datasets for robotic manipulation in cluttered scene are limited. [varley2017shape] generated a dataset for shape completion for the purpose of robot grasping using objects from the YCB dataset [calli2015ycb] and Grasping dataset [kappler2015leveraging]. However, this dataset only includes a complete single-view occupancy grid with self occlusion only. [firman2016structured] introduced a tabletop dataset which consists of a complete RGB-D occluded scene with objects occluding each other, but their TSDF encoding is different from our requirement in the network architecture.
In order to train a network to reconstruct occluded parts from cluttered scenes, we modify and augment the dataset synthesis steps used by [varley2017shape] so that the objects are not only self-occluded but also occluded by an obstacle. Our dataset contains three kinds of 3D voxel grids for each example: , , and (the ground truth).
12 objects from the YCB dataset and ShapeNet [shapenet2015] are collected and occupancy grids are generated from the object meshes using binvox [binvox]. After that, rotations are uniformly sampled in roll-pitch-yaw space. Instead of directly generating depth images from different angles of rotations, an obstacle mask is placed in between the camera and the mesh, occluding part of and thus generating .The voxels between the camera and the are . is then centered in the reconstruction grid in order to remove information about the original object extents so that the input is similar to a real scenario, where the true extent of the object is unclear. The recentered voxels are then shifted towards the camera to a fixed offset to provide more space for reconstruction. In the experiment section, we show that training using this occluded data set boosts the performance when reconstructing objects in the presence of occlusion.
4.3 Volumetric Memory
Although the dynamics of manipulation are difficult to predict, there are cases, such as when the robot has a stable grasp on a manipulated object, where we wish to inform the next scene of past interactions. With positive memory, we compute the pose transformation due to the manipulator motion, then add the object octree at time into the scene octree at time using the new pose. With negative memory, we assume that space that was previously free and is now occluded is likely to remain free, unless the shape completion indicates otherwise. As both of these assumptions can be violated by unanticipated object interaction (e.g. objects slipping in the grasp or collisions knocking objects behind others), unobserved space is set to decay to the occupancy threshold with rate , giving , .
4.4 Motion Planning
After segmentation and reconstruction, we are left with a collection of voxel maps approximately representing individual objects which must be rearranged to facilitate target retrieval. We employ a randomized kinodynamic motion planner with heterogeneous action types to find an action to perform in the current scene. Acting in clutter often restricts the feasibility of traditional pick-and-place actions due to limited reachability around objects. For these reasons, this work follows others, including [Gupta2015], [Boularias-2015-5904], [Dogar2012], in employing action primitives to act in constricted space.
Domain Definitions When planning for the motion of rigid bodies in 3D, the natural planning space is the Cartesian product of copies of , where is the number of rigid bodies. Combined with the robot’s joint configuration space , we can form the full configuration space representing the state of the robot and all manipulable rigid bodies in the scene.
The action space contains all the possible control actions the system can take. Each action has an associated parameter space and performs the mapping . Each action may also be equipped with a steering policy, , which takes an initial and goal configuration and returns the parameterized action to locally advance toward the goal state. For actions without steering, it is necessary to sample from the parameter space directly.
In the absence of a known terminal state, we instead supply a reward function that determines the most promising action given the current and predicted next state, . The function rewards actions that are likely to exhibit high information gain and penalizes trajectories with high incidence of collision. Collisions between the robot and scene objects during action execution are not forbidden in the framework, as the objects are movable, but actions with lower contact are rewarded.
After segmentation and shape completion, occluded voxels (“shadows”) in the field of view are computed by raycasting from unknown cells back to the camera origin. The object to move is determined by uniformly sampling from the set of shadow voxels, then casting back to select the object that occluded that volume. This provides a heuristic for sampling objects that are more likely to be hiding the target object. In the case where the target object is partially visible, but unreachable due to gripper collisions with its neighbors, the selector has achance of choosing one of those colliding objects (with probability proportional to the number of gripper poses it obstructs and the number of target voxels occluded by the object), and a chance of proceeding normally. Function SelectObject in Algorithm 1 highlights this process.
Action Specifications For the tabletop manipulation domain, we have chosen three heterogeneous actions representing a taxonomy based on the controllable subspace of the full state space: Push, Slide, and Pick. Each action operates on a single selected object, although in clutter this will likely influence the neighborhood of objects around it. Push, parameterized by , represents a 1D palm-push motion with a magnitude and direction in the plane of the table surface. Slide is implemented by grasping the selected object and dragging in the table plane, for 3 controllable dimensions. It is parameterized by , which contains the transform of the object motion. Finally, Pick and represent a full grasp of the object, and can move it in .
Each task-space motion of an object has a generating policy that produces full joint-space motions of the robot. This consists of planning a collision free path to the start of the trajectory, then solving coherent inverse kinematics for a Cartesian path of the end-effector.
Feasibility Function The feasibility function is composed of two components. First, the entire generated trajectory must be kinematically feasible. Second, the initial pose of the hand must be free from collisions. Collisions between the remainder of the robot trajectory and the scene objects are permitted, and are handled by the reward function.
Reward Function Given limited knowledge of the scene’s dynamics and state, the reward function plays a dominant role in enabling progress toward locating the target object. The reward used here is a linear combination of a number of heuristic value functions : .
For our constituent reward functions, we use the following elements:
: The number of previously occluded voxels that should be revealed by this motion.
: The standard deviation of the centroids of detected objects.
: The motion of the object toward or away from the center of mass of the scene.
: The number of collisions between the robot trajectory and non-manipulated objects in the current motion.
No penalty is assessed for disturbing or toppling other objects, other than the collision metric. The weight vector will in general depend highly on the resolution of the voxel map.
The Information heuristic requires some additional explanation. When moving a partially-visible object with shadow, the hidden voxels could “belong” either to the object in motion or to the remainder of the scene. If the object motion is represented by a rigid transform and a shadowed voxel coordinate by , then we need to check whether either or are visible given the new object position. Figures (b)b and (c)c show this process.
Grasp Generation For both the cases with and without shape completion, grasps were planned on a convex prismatic approximation of the voxel grid, generated by extruding the 2D convex hull of the object’s downprojected (X-Y) occupancy map between its minimum and maximum extents in the table z-coordinate.
5 Experimental Results
Experimental Equipment To evaluate the system described, we employed a custom bimanual robot equipped with two KUKA LBR iiwa 7 R800 arms, two Robotiq 3-Finger Adaptive Grippers, and a Microsoft Kinect 2 for vision. External localization of the robot, camera, and table was provided by eight Vicon Bonita 10 motion capture cameras (no motion capture was used for the manipulated objects). Primary scene processing and motion planning was performed on a PC with an Intel 4.7GHz i7-8700K CPU and NVIDIA GTX 1080Ti GPU. Shape completion was also performed on a 1080Ti, and segmentation was performed on an NVIDIA Tesla V100-SXM2. The system, spanning six PCs including hardware-facing machines, used ROS for interprocess communication, sensor data acquisition, and trajectory transmission.
Experimental Parameters Throughout the preceding sections, several threshold and weight parameters were employed. For this experiment configuration, we used values of , , , and from Equation 4.4 .
5.1 Shape Completion
Experiment setup To benchmark the shape completion modifications, we generated voxel grids for 16 objects in a variety of previously unseen orientations, and where four of the objects were previously unseen by the network. One quadrant of the view was then occluded to mimic the conditions found in realistic scenes. 864 data points are generated for each object and randomly split into training and testing sets with the ratio of 4:1. The testing dataset also include 3 new objects which were not in the training set. The resulting reconstructions were then compared using the chamfer distance metric from Equation 1. These error statistics are collected for all 8,448 data points in Figure 19. We then evaluated two hypotheses about our shape completion method:
Training shape completion with a dataset of occluded objects significantly improves performance.
As seen in Figure 19, augmenting the training data with occlusions provides a dramatic improvement in performance (note the log scale on the y-axis). Comparing the 3D-RecGAN model with the 3D-RecGAN + occlusions model yields a t-statistic of 36.752 and a p-value of . Examples are shown in Figure 24. This hypothesis is strongly supported by the results.
Including known free voxel information in shape completion significantly improves performance.
On top of the additions to the training dataset, Section 4.2 described a modification to the original 3D-RecGAN architecture to account for known free space. Figure 19 shows these results as well, showing a modest improvement from the non-freespace network. Computing the t-statistics for the freespace and non-freespace leads to a t-value of 1.87543, and a p-value of 0.06079.
5.2 Tests in manually-designed scenes
In order to show the capabilities of shape completion and memory, we test each of these components in manually-designed scenarios where that component is beneficial. We then compare the results to using the baseline (the framework without either of these components). These test scenarios are shown in Figures (a)a and (b)b.
In all scenes, the target object is the yellow ball, and the scene is considered to be successfully solved when the robot picks the ball from the scene using either hand. An attempt is marked as a failure if the target object is ejected from the workspace during the course of the attempt, or if the number of actions taken is more than three times the number of objects in the scene.
Memory significantly reduces the number of actions necessary to retrieve a target object when other objects are likely to be investigated first.
Scene A was designed to explore the benefits of volumetric memory in locating a hidden object in the scene. Here the pitcher casts a much larger shadow than the coffee can behind which the target is hiding, but after one or two moves the system should realize that the target is not there, and should prioritize other objects. Figure 25 and Table 1 show strong improvement, with a t-statistic of 5.8 and a p-value of 0.00002. This furnishes a compelling argument that memory greatly improves the system performance.
Shape completion significantly reduces the number of actions necessary to retrieve a target object when large occluded areas are a part of visible objects.
Scene B was constructed to demonstrate the capability of shape completion to rule out some of the occluded area for exploration because it is a part of visible objects. In the scenario, the target is hidden behind a small cylinder, while there are two large boxes as distractions. The boxes have large shadows, so the baseline would be biased to move those to find the hidden object. However, with accurate shape completion, the system should realize that most of the volume shadowed by the boxes is likely part of the boxes themselves, and thus should move the cylinder. Figure 25 shows that about half of the time the augmented system decides to move the cylinder first, retrieving the target in the optimal two moves. In the other cases, the augmented system selects one of the side boxes first, or needs to clear an obstacle in contact with the target before grasping. Table 1 shows improvement over the baseline, with a t-statistic of 2.1 and a p-value of 0.056.
5.3 Tests in arbitrary cluttered scenes
To assess the performance of the framework as a whole (including both memory and shape completion), we tested the full framework vs. the baseline (without shape completion or memory) on arbitrary sparsely-cluttered scenes and densely-cluttered scenes. We generated a collection of arbitrary sparse and dense clutter scenes, shown in Figure 18 as C1-C3 and D1-D3. For our purposes, “dense” clutter is defined to be where most or all objects in the scene are in contact with one another.
Our full framework significantly reduces the number of actions necessary to retrieve a target object in sparsely-cluttered scenarios.
Scenes C1-3 were constructed to resemble typical household clutter. Figure 25 and Table 1 show strong improvement on C1 and C3, but C2 shows a minor regression. Thus, the hypothesis is only weakly supported, with a t-statistic of 1.5 and a p-value of 0.14.
Our full framework significantly reduces the number of actions necessary to retrieve a target object in densely-cluttered scenarios.
Scenes D1-3 were constructed to resemble typical household clutter that is more densely distributed than C. In all of these cases, the augmented system showed strong improvement, with a t-statistic of 2.6 and a p-value of 0.012, showing good support for the hypothesis.
5.4 Computation Time
Our framework is a proof-of-concept and has not been optimized for fast computation or execution. However, to gauge the practicality of our method, we collected statistics on average computation time used for each component: Preprocessing: 4.34s; Segmentation: 7.30s; Shape Completion: 1.67s; Memory ; Action Selection: 7.32s, and Execution: 34.98s. These results show the the benefits of shape completion and memory come at a low computational cost as compared to the rest of the framework.
6 Conclusions and Future Work
This paper has presented a method for the volumetric completion of partially observed scenes and demonstrated that such a method, when integrated with a manipulation planner, can significantly reduce the number of actions required to retrieve a hidden object from dense clutter. In addition, we have shown that our extension of previous work on shape completion to consider partially-occluded views and known free-space volumes can boost the performance of shape completion in cluttered environments. Future work focuses on enhancing the segmentation component by inferring better segmentations from video of interactions.