Object Finding in Cluttered Scenes Using Interactive Perception

by   Tonci Novkovic, et al.
ETH Zurich

Object finding in clutter is a skill that requires both perception of the environment and in many cases physical interaction. In robotics, interactive perception defines a set of algorithms that leverage actions to improve the perception of the environment, and vice versa use perception to guide the next action. Scene interactions are difficult to model, therefore, most of the current systems use predefined heuristics. This limits their ability to efficiently search for the target object in a complex environment. In order to remove heuristics and the need for explicit models of the interactions, in this work we propose a reinforcement learning based active and interactive perception system for scene exploration and object search. We evaluate our work both in simulated and in real world experiments using a robotic manipulator equipped with an RGB and a depth camera, and compared our system to two baselines. The results indicate that our approach, trained in simulation only, transfers smoothly to reality and can solve the object finding task efficiently and with more than 90



There are no comments yet.


page 1

page 3

page 4

page 5


Building an Affordances Map with Interactive Perception

Robots need to understand their environment to perform their task. If it...

Bootstrapping Robotic Ecological Perception from a Limited Set of Hypotheses Through Interactive Perception

To solve its task, a robot needs to have the ability to interpret its pe...

Bridging Scene Understanding and Task Execution with Flexible Simulation Environments

Significant progress has been made in scene understanding which seeks to...

Active Perception using Light Curtains for Autonomous Driving

Most real-world 3D sensors such as LiDARs perform fixed scans of the ent...

PERCH: Perception via Search for Multi-Object Recognition and Localization

In many robotic domains such as flexible automated manufacturing or pers...

Optimizing Object-based Perception and Control by Free-Energy Principle

One of the well-known formulations of human perception is a hierarchical...

Robotic Occlusion Reasoning for Efficient Object Existence Prediction

Reasoning about potential occlusions is essential for robots to efficien...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Robots are very efficient tools that have been used in industry for several decades. Their ability to perform specific, repetitive tasks is enabled by their very high precision and accuracy. However, such performance is only possible if the environment is known and not changing. Since nowadays numerous applications require robots to leave such environments and work in unknown spaces in collaboration with humans, new challenges arise. Some of these include maintaining a consistent representation of the world, interpreting the sensor readings even with dynamic and unknown objects in the unstructured and cluttered scenes, estimating the pose of the robot with respect to a map in all kinds of environments and conditions (indoor, outdoor, low-light, fog, rain, etc.), obstacle avoidance in human-shared spaces, etc.

Nevertheless, robots are slowly becoming more versatile tools that can tackle some of these challenges [correll2016analysis, yang2018grand]. However, they are still not able to cope with all of them at once. If the robots are to become more than just passive observers, the important first step is to let them interact with their surroundings. This enables a whole range of new applications such as completely autonomous object handling in stores and warehouse spaces, home assistance, elderly care, hospital assistance, disaster site inspection, etc. Such robots would save time, lower costs and improve the quality of a service in many of these industries. Furthermore, professionals would then be able to focus on tasks that are not physically demanding or dangerous.

One basic application where interaction is crucial is object search in unstructured environments. Some methods from computer vision focus on detecting objects in single RGB images. With recent progress in deep learning, such methods have shown impressive results, even surpassing human performance in the object detection task 

[he2017mask, redmon2017yolo9000, li2017].

Figure 1: In order to find a specific object in a scene, at times it is necessary to interact with the environment to reveal the hidden parts. In our experiments, the robot learned that the object of interest, i.e. the red cube, can be hidden inside a pile of cubes and that physically interacting with the objects might reveal it.

However, a single image is usually not sufficient to find an object if it is hidden in a clutter and occluded by other objects, and therefore not visible. Because of this, active perception, where the camera is actively moved, is required to observe the scene from different viewpoints. In 3D scene reconstruction [kriegel2015efficient], changing the camera viewpoint is required to reveal details of the scene that are obstructed by other objects [kahn2015active] or to gain more information on the scene in order to grasp an object [morrison2019multiview]. Selecting the “next best view” where the camera should move can be based on the current knowledge of the scene and the given task [Dunn2009NextBV], or even selected based on a greedy heuristic [delmerico2018comparison] that does not account for long term planning reward.

Even with allowing camera motions and smart selections of the next views, objects can still be hidden in a pile or completely occluded by other objects. Therefore, interaction with the environment is necessary to remove these occluders. Interactive perception builds upon active perception and combines camera motion with environment interaction. It uses information from the interaction to get a better understanding of the scene, and then based on the current state of the scene, it plans the next best action. Such systems have been demonstrated for different applications such as segmentation, grasp planning or object recognition [bohg2017interactive], even using rl [cheng2018reinforcement]. However, these usually use predefined actions [danielczuk2019mechanical] and do not consider exploration. As a result, such methods are constraining the robot to a limited set of interactions which are usually not efficient for object searching in cluttered environments.

Our approach to object finding in clutter, using interactive perception, is built upon an rl based control algorithm and a simple color detector. Since object detection in a single image is already a very well studied problem and to simplify our problem, we assume that the target object is of a specific color and use a color detector to provide us with the detections. In addition to these, we encode the scene state using a discretized tsdf volumetric representation [curless1996tsdf]. Afterwards, the next action is determined by an rl algorithm based on the current encoded state and the knowledge obtained from past experiences. Since specifying which actions are good and which are bad is very hard and non-intuitive, supervised methods are not adapted to this task. Furthermore, it is more intuitive to define rewards for the agent in such a complex task than to manually engineer policies. With such a framework, we show that we can learn a policy that can effectively search an object in a cluttered scene. The main contributions of this paper are:

  • an rl approach to active and interactive perception based object search in clutter,

  • a compressed and interpretable volumetric representation of the environment suitable for rl,

  • experimental evaluation of the framework on both a simulated and a real world robotic system, including a comparison to baselines.

2 Related Work

Finding objects hidden in clutter requires robots to actively explore and manipulate their environment. In this section, we present an overview of related works from the fields of active and interactive perception, as well as their intersection with robotic manipulation.

In contrast to recent works in object detection and semantic segmentation that typically only operate on single fixed images [he2017mask, redmon2017yolo9000], active perception considers the problem of optimizing the sensor placement in order to perform a task [bajcsy2018revisiting], and has found applications in a variety of tasks such as mapping [carrillo2012comparison] and object reconstruction [pito1999solution, wenhardt2007active, kriegel2015efficient]. A common approach is to choose the next-best-view in order to maximize an information gain metric [connolly1985determination]. Velez et al. [velez2011planning] presented a planner for improved object detection in an unknown scene taking into account the confidence of the object detector and uncertainty of the robot’s pose. Choosing a good metric is often challenging and task specific [chen2011active], however measures such as Shannon entropy [vazquez2001viewpoint] or KL-divergence [hoof2012maximally] are often used. Recently new information gain formulations have been proposed for volumetric reconstruction of unknown scenes [isler2016information, delmerico2018comparison]. Acquiring more information of a scene is also beneficial for manipulation. Kahn et al. [kahn2015active]

model the probability of grasp locations in occluded regions as a mixture of Gaussians and optimize for sensor placements that minimize uncertainty of these regions, while Morrison et al. 

[morrison2019multiview] choose informative viewpoints based on a distribution of grasp pose estimates.

However, in many cases robots can improve perception through physical interaction with their environment. The goal of interactive perception is to learn the relationship between actions and their sensory response [bohg2017interactive]. Applications cover a wide range of problems, such as object recognition [sinapov2014learning], object segmentation [fitzpatrick2002manipulationdriven, hoof2012maximally], and inferring physical properties of objects [xu2019densephysnet, martin-martin2017building]. Dogar et al. [dogar2014object] generate plans that minimize the expected time to find an occluded object using a visibility-accessibility graph, assuming that the target is the only hidden object in the scene. Li et al. [li2016act] formulate object search as a POMDP and solve it with an approximate online solver. Their approach however relies on highly accurate segmentations. Xiao et al. [xiao2019online] address some of these limitations, assuming knowledge of the exact number and the geometric properties of the objects in the scene. In a recent work, Danielczuk et al. use the output of a grasp planner [mahler2019learning] and different heuristics to choose between grasping, suction, and pushing actions to extract a target object from a heap [danielczuk2019mechanical]. Similarly, Breyer et al. [breyer2019comparing] leverage rl to design simple actions in order to clear a table of objects, with a visual input mounted on the end-effector, similarly as in our work.

The majority of these works rely on predefined action primitives, such as pushing and grasping, reducing the set of possible actions to choose from. In contrast, we propose to use rl in order to learn a suitable policy and allow the agent to freely control the pose of its end-effector. Gualtieri et al. [gualtieri2018] also use rl to learn 6 dof movement of a robot arm. However, to simplify the problem, they constrain the agent to focus attention on sub-regions of the current observation, which requires the target to be visible from the beginning. Cheng et al. [cheng2018reinforcement] train an rnn to predict gripper displacements that push an object to a target location, while being robust to occlusions.

A critical component of the latter work was to incorporate the prediction of an object detector in the observation encoding. This agrees with the findings of Sax et al. [sax2018midlevel], who motivate that task-specific, mid-level features offer better learning efficiency and generalization compared to raw images. This work thus tries to focus on simple and interpretable inputs, allowing light and efficient learning of interactive tasks.

3 Method

In our method, we tackle the problem of finding specific objects in cluttered environments where interaction is required, using a robotic arm with wrist-mounted RGB and depth cameras. Our goal is to find a mapping from sensor measurements to end-effector displacements, i.e. a policy , using a model-free rl algorithm. This algorithm should, based on the current representation of the scene and past experience from the training process, predict the next action to perform.

To find the target we use a color detector to provide us with object detection in 3D. These 3D detections together with a volumetric scene representation are used to define the state of our system. Our action space has 5 dof and one additional value to let the algorithm decide to terminate the searching procedure. The overview of the system is shown in Figure 


Figure 2:

To find the objects in clutter we encode the sensor measurements from the RGB and the depth camera into a vector that is then given to the rl agent. The agent computes the next action which is executed by the robot and a feedback, a reward, is provided to the agent. Once the object is detected or the maximum number of time steps is reached, the algorithm terminates.

3.1 Problem Formulation

We model the problem as a discrete, finite horizon mdp, with the set of all states ; is the set of all valid actions for a given state , a real-valued reward function, the discount factor, and a deterministic function that defines the dynamics of the mdp.

At each time step , the rl agent observes the current state . Based on the parameterized policy , it decides which action to take next. Upon completion of this action the system transitions into a new state based on the system dynamics and receives a reward . For each episode (one run of the algorithm) this is repeated until the maximum number of time steps is reached or the agent reaches the terminal state. The final goal of the on-policy rl is to find the optimal actions for the given states that maximize the expected return, i.e. the total discounted reward.

3.2 Agent Model

We first define how the agent converts the sensor measurements into a state representation that is useful for its rl control module. The control module, based on the current state, plans the next action for the robot. In order to plan meaningful actions, the module is trained in simulation where a reward is given for each episode. The reward is shaped in such a way that the algorithm is guided to successfully complete the given task.

3.2.1 State Representation

The choice of a state representation is a crucial component for learning good policies. The representation has to be general enough to capture all the relevant information for the successful completion of the task. In our object finding task, the representation has to allow the agent to distinguish the boundaries of the objects, to reason about the spatial relationships between the objects, to distinguish the target object, and to provide information about unobserved areas. However, the representation should also be compact enough to keep the network size reasonable and informative enough to allow the convergence of the rl control module.

tsdf is a volumetric representation mainly used in 3D reconstruction [curless1996tsdf], which discretizes the scene into 3D voxels and assigns a value to each of them. These values range from -1 to 1, where negative values are used to define distances behind the surface, and positive values are used in front of the surface, 1 also indicates observed free-space in front of the surface, and -1 indicates unobserved space. At all times, the 0 crossing in the field defines the surface. The voxel values are averaged over all integrated depth measurements. Additionally, each voxel stores a weight value which, in our approach, is incremented by 1 each time a new measurement in this voxel is observed. It represents the measurement uncertainty of each voxel. The main advantage of this representation is that it has a notion of free-space and unobserved space, where both unexplored areas and regions inside the objects are considered unobserved.

The volumetric tsdf representation provides us with spatial information about the scene. However, since we are interested in specific objects, additional information is required. Therefore, we have extended this voxel grid with an additional voxel value that indicates if the voxel is part of the target or not. This information is provided by the object detector which works on the same RGB-D frame. Hence, these detections can be integrated into the grid the same way as depth measurements. Since the focus of our work is on interactive perception, we have simplified the detection problem by using an object of specific color as a target. Therefore, a color detector in CIELAB color space is used as a target detection method.

Depending on the resolution of the voxels, such a representation can contain a very large amount of information. Since it would be infeasible to use this information directly in the rl algorithm, we need to encode it into a smaller sized vector that can be efficiently used for learning. To do this, we split the space into two parts, one above the fingers of the gripper and one below. Then we sum the values of the tsdf and the 3D detections separately along the coordinate to get 4 flattened maps. These maps are then discretized into a grid, where the middle cell is further separated into a grid. As a result, we obtain 4 maps encoded with 17 values, which are then concatenated into a 68 dimensional vector that represents the state, depicted in Figure 3.

The global representation of the environment is stored in the robot’s base frame. However, prior to generating the encoded vector, the tsdf and the 3D detection grids are transformed to the end-effector frame projected onto the xy-plane. This is done by translating the two grids horizontally so that the end-effector’s position is at the center of the grids. Because of this, there is no need to additionally append the end-effector pose to the robot state since all the measurements are relative to it.

Since we also allow interaction with the scene, to be able to update the representation in case something moves, we modified the depth image integration to limit the maximum weight in the tsdf grid to 2. This means that all past measurements have a small weight (high uncertainty) and when new measurements are obtained, old ones are quickly forgotten. Such a strategy allows us to update the representation very quickly by only taking a few additional measurements.

Figure 3: The 68 dimensional state vector used for the rl algorithm is generated from the volumetric tsdf map and the object detections around the gripper. This information is summarized by discretizing the space into bins centered in the end-effector frame and projected onto the xy-plane.

3.2.2 Actions

To perform the object finding task successfully, the robot is required to move the camera in 3D, as well as to interact with the objects in the scene. To generalize well to both tasks of active and interactive perception, we decided to control relative displacements of the gripper in the end-effector frame. Therefore, we have 3 variables for the translations in , , and direction. Since exploring the unobserved areas can be much faster by tilting the end-effector, we additionally added roll () and yaw (

) angles. The final degree of freedom, pitch (

), is omitted here to reduce the complexity of the rl agent. To handle scenes where no object is present, we use an additional binary variable that allows the agent to terminate the episode. This allows the agent to finish successfully even if the target object is not in the scene.

3.2.3 Reward Function

Given the complexity of the task, the reward should be shaped such that it guides the agent towards the optimal solution. It requires balancing between exploration and exploitation, i.e. discovering unobserved space and evaluating if any potential 3D detection is actually part of the target or not. Therefore, the reward needs to be carefully crafted to avoid getting stuck at a local optimum that prevents the algorithm from converging.

Our reward has the following components:

  • Detection reward - the agent receives a reward proportional to the number of pixels detected that belong to the target object if the number of detected pixels is less than a defined threshold. This encourages the agent to observe more pixels from the target by moving towards it. However, in order to discourage the agent from obtaining this reward repeatedly at each step without terminating, we apply a weight to each detected pixel of the target that decreases when the pixel has already been observed multiple times.

  • Exploration reward - the agent is rewarded for exploring the scene, proportionally to the number of newly observed voxels (the number of voxels with value -1 is compared before and after the action). In order to balance the importance of the detection relative to the exploration, this reward is clipped to a maximum value.

  • Time penalty - to encourage the agent to finish the task as fast as possible, a constant negative reward is given at each step.

  • Final detection reward - if the agent observes enough of the target object and decides to terminate, it gets a final detection reward.

  • Final exploration reward - if there is no target object in the scene and the agent terminates, it gets a final exploration reward equal to the final detection reward.

  • Final failure penalty - if the agent terminates before achieving the task it gets a failure penalty.

At each iteration, if the agent does not terminate or find the object, the best that it can achieve is a reward since the exploration and the detection rewards summed together are equal to the time penalty reward. Getting a total positive reward means that the agent decided to terminate after having seen at least part of the target or when there was no target in the scene.

3.2.4 Policy

To find a successful policy for our task, we used a fully connected network to map the current state to the next action and update its weights with ppo [SchulmanWDRK17], a state-of-the-art policy gradient method. The output layer uses activations such that the mean of predicted actions is in the to range.

3.3 Simulation

In order to train the rl model, we use a simulation environment based on the Bullet physics engine [Coumans2015Bullet]. In the simulation, we added a disentangled robot hand whose pose is controlled by a force constraint. Furthermore, we attached a virtual RGB-D camera to the end-effector, mimicking the real robot setup. Depth and RGB images are generated using bullet’s camera renderer. The simulated scene is surrounded by walls, and initial positions and number of objects are randomized in each episode.

3.4 Transfer to Reality

Transfer to reality was done without any fine-tuning of the model on the real robot. Since robot poses are provided in the base frame of the robot, and the depth measurements are in the camera frame, the transformation between the two frames is required to properly integrate the measurements into the tsdf grid. The transformation was obtained using our hand-eye-calibration toolbox [furrer2018evaluation]. The fact that we use a tsdf representation, which averages depth images during integration time, allows us to have some noise in the depth image and small errors in the pose estimates of the camera. The discretization of the volume used to represent the state, additionally, allows small scene reconstruction errors. Such a choice of state makes the precise details of the scene less important, and as a result, allows for a better abstraction and generalization of real data.

4 Evaluation

In our experiments we evaluate the performance of our approach both in simulation and in the real world, and compare it to two baselines, one for the active perception task and one for the interactive perception task. As a metric for our evaluation we use success rate, number of steps per episode, and total time per episode.

4.1 Experimental Setup

We evaluate our method with a position controlled 7-DoF arm of an ABB Yumi. An RGB Chameleon 3 camera, coupled with a CamBoard pico flexx providing depth images, is mounted on the wrist of the robot. We use a similar model in the physics engine Pybullet111https://pybullet.org to perform the task realistically in simulation. The position of the gripper is randomly initialized between each run.

The scene is composed of cubes, either scattered on the ground in the active task or forming piles in the interactive one. The target, when there is one, is always the single red object in the scene and the other objects are either all green (in simulation) or with diverse colors (real world). The different setups are visualized in Figure 4. The tsdf is integrated with a modified version of the library Open3D [Zhou2018] that can integrate the detections of the target as well. The target detection is performed by first converting the RGB image into the CIELAB color space and then labelling the red pixels as belonging to the target.

Both tasks are trained in simulation using the ppo implementation of the OpenAI Baselines framework [baselines]

for 5 epochs, with a batch size of 3000, a minibatch size of 300, a discount factor of 0.99, an entropy bonus of 0.01 and the reward values shown in Table 


. The network used for inference is a Multi Layer Perceptron with 2 hidden layers, each of size 200. The model for interactive perception is first pretrained using the model of active perception and then fine-tuned on the dataset with piles of cubes. Table 

2 summarizes the other parameters used in the various experiments.

Reward Value
Maximum detection reward 0.8
Maximum exploration reward 1.2
Time penalty -2
Final detection reward 500.0
Final exploration reward 500.0
Final failure penalty -300.0
Table 1: Maximum values for each component of the reward function for each step of the episode.
(a) AP simulation
(b) AP real world
(c) IP simulation
(d) IP real world
Figure 4: Setup of the simulation and real world experiments for the active and interactive perception tasks of object finding. For the active perception (AP) task the target object is never completely covered by the other objects and can be found by moving the camera, whereas in the interactive perception (IP) task the object can be hidden within a pile of objects and, therefore, interaction is necessary to reveal it. In the simulation experiments the hand can freely move within a defined workspace, whereas in the real experiments it also has to obey kinematic rules.
Simulation Real World
Min num. objects 5 10 20 15 15
Max num. objects 25 40 100 30 30
Num. of piles 0 0 3 0 4
No target prob. 0.1 0.1 0.05 0.0 0.0
Exploration ratio x x 0.25 x 0.1
TSDF grid resolution 100 100 100 100 100
SDF truncation dist. [m] 0.04 0.04 0.04 0.04 0.04
Workspace length [m] 0.6 1.2 0.6 0.33 0.33
Max. translation [m] 0.06 0.06 0.06 0.06 0.06
Max. roll rotation 0.15 0.15 0.15 0.15 0.15
Max. yaw rotation 0.15 0.15 0.15 0.52 0.52
Time horizon 150 500 150 50 50
Num. experiments 1000 1000 1000 20 20
Table 2: Simulation and real world experiment parameters for both small and large (L) active perception (AP), and interactive perception (IP) tasks. Exploration ratio specifies a percentage of scenes for which there are no piles, therefore it is only used in the interactive task.

4.2 Baselines

In order to evaluate the performance of our rl based algorithm for object search, we have implemented two baselines, one for the active perception task and one for the interactive perception task.

4.2.1 Active Perception

For the active perception task, where the target object can be revealed just by moving the camera, we have implemented a one-step Greedy Next-Best-View agent [connolly1985determination]. This agent samples poses within the proximity of the current pose and evaluates each of them based on how many voxels would be observed if the agent would have moved to that location, assuming that there are no objects in the unobserved area. The sample that obtains the highest score is taken as the next action. If one part of the red object was detected, then the samples are evaluated based on which one brings the target object closer to the camera image center. Once the agent explores more than of the voxels, or it sees more than of the red object, it terminates the search. In our experiments, we used pose samples in simulation and pose samples in the real world experiments. Each sample is obtained by sampling actions uniformly in the allowed range and applying them to the current pose of the agent.

4.2.2 Interactive Perception

For the task where the target object can be hidden in a pile, a Grid Exhaustive Search agent has been implemented. The agent traverses the whole workspace at a fixed low height (just above the lowest cube) pushing any cube along the way that is placed on top of another cube, thus revealing any hidden cubes below. To achieve this, the agent divides the workspace into a regular grid which defines a graph whose nodes are intersections of the grid. The nodes are one maximum translation allowed apart from each other such that the agent can move from one node to another in one step. Since the initial position of the agent is randomized, the problem is equivalent to the tsp and is NP-complete. Therefore, we used a 2-opt algorithm [Croes58] to efficiently compute a route that is close to optimal. The robot then traverses the route until it detects the object or reaches the maximum number of steps.

4.3 Simulation Results

The active task in simulation is evaluated on 2 scenes of different size: a normal and large (L) one. Both tasks are evaluated with 1000 episodes. The same number of episodes was used to evaluate the interactive task. The simulation parameters are shown in Table 2 and the results are shown in Figure 5.

4.3.1 Active Perception

The agent trained for active perception learned to quickly go up in order to get a good overview of the scene and then to roll to observe the remaining unobserved parts. This efficient strategy shows a high success rate to solve the task (97.3% in normal scenes and 96% in large ones). While our approach seems overall on par with the greedy baseline in terms of success rates (Table 3), it solves the task significantly faster than the baseline, with an average time per episode of 8.45s for the Active 3D Exploration against 22.38s for the Greedy Next-Best-View (reduction of 62%) and 17.58s for the Active 3D Exploration on large scene against 126.07s for the Greedy Next-Best-View on the same sized scene (reduction of 86%). This is due to the sampling nature of the Next-Best-View algorithm and to the fact that views are evaluated based on the number of newly unobserved voxels in the scene which can be costly to obtain. It is important to note here that the Greedy Next-Best-View baseline has the knowledge of the whole map and uses this knowledge to compute the number of unobserved voxels, while our approach uses only the local map to obtain the state. Nevertheless, even though the baseline is in a favourable position, we achieve comparable results for the success rate.

Learning to terminate is challenging for the agent since it only has a compressed information about a local map, and if it is not big enough to cover the whole tsdf volume the agent has no way to know if it explored the whole area. Furthermore, the discretized state can be either normalized or not. In the first case, we lose the absolute number of unobserved voxels in the scene and the number of point detections, however, we are better at generalizing to different scenes with more or less objects and it forces the agent to explore even the small unobserved areas. Additionally, the normalized state makes it very difficult for the agent to determine if it should terminate or not when there is no object in the scene. In the second case, we know the exact amount of unobserved voxels in the local map, however, we are very scene dependent. To force the agent to explore more, normalized state was used in the experiments.

4.3.2 Interactive Perception

Results show that our learned approach for interactive perception is more efficient than the Grid Exhaustive Search baseline (both in terms of number of steps and timing: 39.60s for ours against 63.89s for the baseline), while keeping a slightly lower success rate (Table 3). To achieve this outcome, our agent has learned to stay down and to push any clutter of objects that is sufficiently big to appear as unobserved in the tsdf.

The exhaustive baseline can still fail the task when it accidentally pushes the target away without seeing it, so that it needs a second exploration of the whole scene to find the target. Since the target is also not present in the scene in 5% of the scenarios, it always fails in these particular scenarios as it has no means to know that there is no target.

Our learned method has the advantage of being significantly faster, especially as the grid search does not scale well to an increase in the size of the workspace (quadratic in terms of the side of the scene).

Method Success Rate (in percentage)
Greedy Next-Best-View 98.4
Active 3D Exploration 97.3
Greedy Next-Best-View (L) 98.2
Active 3D Exploration (L) 96.0
Grid Exhaustive Search 94.8
Interactive 3D Exploration 93.2
Table 3: Simulation results: average success rate over 1000 episodes.
Figure 5: Simulation results: Average number of steps (top) and timing in seconds (bottom) per episode among the active methods (fig:simulated_results_efficiency_active) and interactive methods (fig:simulated_results_efficiency_interactive).

4.4 Real-World Results

As the real experiment is a time-consuming and resource demanding task, the evaluation on the real setup for both active and interactive task is performed over 20 runs. The results are shown in Table 4 and visualized in Figure 6. The scene parameters used are shown in Table 2.

4.4.1 Active Perception

The comparison between the learned Active 3D Exploration and Greedy Next-Best-View baseline shows comparable results to the ones obtained in simulation, both in terms of number of steps per episode and success rates. Nevertheless, this shows that the active task based on rl can be successfully transferred to reality with a high success rate.

4.4.2 Interactive Perception

The Interactive 3D Exploration method performs well on the real world experiment with a success rate of 100% (over 20 episodes), similarly as the Grid Exhaustive Search baseline. Furthermore, the learned interactive method outperforms the baseline in terms of efficiency (number of steps per episode). These results confirm the outcome of the evaluation in simulation and show that a transfer to reality can be performed without any fine-tuning on real data.

Method Success Rate (in percentage)
Greedy Next-Best-View 90.0
Active 3D Exploration 90.0
Grid Exhaustive Search 100.0
Interactive 3D Exploration 100.0
Table 4: Real-world results: average success rate over 20 episodes.
Figure 6: Real-world results: Average number of steps per episode among (fig:real_results_efficiency_active) active methods and (fig:real_results_efficiency_interactive) interactive methods.

4.5 Limitations

One limitation of our system is that it is not aware of the size of the workspace and its current pose within that workspace. This can cause the agent to get stuck by trying to repeat the action to go up, however not being able to execute it.

In the real world experiments, because of the small size of the workspace, the robot was not always able to plan the direct paths to the target location. Additionally, the agent was trained with a detached gripper in simulation, therefore allowing transition to any pose, which is unfeasible on the real robot.

The current state of the robot encodes information about the unobserved voxels and object detections from a single object. However, it would need to be extended if we were to allow multiple object hypotheses in our detector.

5 Conclusion

In this work, we have shown that an rl approach is feasible to learn a successful policy for an object finding task that requires both active and interactive perception. With the encoded volumetric representation of the environment and reward shaping, we are able to achieve a high success rate. By comparing our approach to two baselines, we have shown that it is more time efficient than the Greedy Next-Best-View, since it does not require expensive sampling in each iteration, and more step efficient than the Grid Exhaustive Search, as it does not do naive brute force search, but instead interprets the encoded state and acts upon it. Furthermore, our agent learned to decide when to terminate the sequence. Results obtained in the real robot experiments validate the findings provided by the simulations.

In future work, we plan to integrate a more sophisticated object detection pipeline such that we do not rely on the fact that the target object has a specific color. Furthermore, we plan to extend our volumetric representation to also encode dynamic objects and avoid the need to “forget” the volume where objects moved.


This work was supported by the Amazon Research Awards program, the Swiss National Science Foundation (SNF) through the National Centre of Competence in Research (NCCR) on Digital Fabrication, the Luxembourg National Research Fund (FNR) 12571953, and ABB Corporate Research Center.