Bridging the gap between learning and heuristic based pushing policies

Non-prehensile pushing actions have the potential to singulate a target object from its surrounding clutter in order to facilitate the robotic grasping of the target. To address this problem we utilize a heuristic rule that moves the target object towards the workspace's empty space and demonstrate that this simple heuristic rule achieves singulation. Furthermore, we incorporate this heuristic rule to the reward in order to train more efficiently reinforcement learning (RL) agents for singulation. Simulation experiments demonstrate that this insight increases performance. Finally, our results show that the RL-based policy implicitly learns something similar to one of the used heuristics in terms of decision making.



page 3

page 4


Efficient Use of heuristics for accelerating XCS-based Policy Learning in Markov Games

In Markov games, playing against non-stationary opponents with learning ...

Hierarchical Policy Learning for Mechanical Search

Retrieving objects from clutters is a complex task, which requires multi...

Heuristic-Guided Reinforcement Learning

We provide a framework for accelerating reinforcement learning (RL) algo...

Policy-focused Agent-based Modeling using RL Behavioral Models

Agent-based Models (ABMs) are valuable tools for policy analysis. ABMs h...

Learning to Unknot

We introduce natural language processing into the study of knot theory, ...

Modeling 3D Shapes by Reinforcement Learning

We explore how to enable machines to model 3D shapes like human modelers...

Hyper heuristic based on great deluge and its variants for exam timetabling problem

Today, University Timetabling problems are occurred annually and they ar...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Despite the research effort during the last decades, extracting target objects from unstructured cluttered environments via robotic grasping is yet to be achieved. Until to this day, robotic solutions are mostly limited to highly structured and predictable setups within industrial environments. In these environments the robots perform successfully because the tasks require characteristics such as precision, repeatability and high payload. However, tasks such as tight packing [wang21], throwing objects on target packages [zeng20] or object grasping and manipulation in cluttered environments [zeng19] require a much more different skill set and are still performed mostly by humans. Even though humans lack precision and strength, they are equipped with dexterity and manipulation skills that are missing from today’s robots, endowing humans with the ability to deal effectively with extremely diverse environments. Such a fundamental manipulation skill is robust grasping of target objects [bohg14, sarantopoulos18a, kiatos20]. However, when dealing with clutter, grasping affordances are not always available unless the robot intentionally rearrange the objects in the scene. As a result, non-prehensile manipulation primitives are used for object rearrangement [stuber20] and a body of research emerged for total or partial singulation of objects using pushing actions, in order to create space around the target object required by the robotic fingers for grasping it. Total singulation [sarantopoulos2021total] refers to the creation of free space around the object, while partial singulation refers to the creation of free space along some object sides.

Both hand-crafted and data-driven singulation methods have been developed for object singulation. Hand-crafted methods include analytic or heuristic methods, where analytic use exact models for making predictions assuming accurate knowledge of the environment and heuristic methods follow simple rules that can lead to task completion without assuming accurate knowledge of the environment. Analytic or heuristic-based methods use insights for the task, lacking however the generalization ability of the data-driven methods. The success of deep learning


in fields like computer vision or language processing motivated the robotics community to solve robotic problems using deep learning

[sunderhauf18, kroemer21] or reinforcement learning [kober13, polydoros17, ibartz21, levine17]

. However, deep learning approaches come with the cost of large data requirements, a problem that is exacerbated in robotics due to huge state/actions spaces as well as time consuming, costly and risky training when dealing with real robots. This led to training agents in simulated environments and transfer them in the real robotic system either by robust feature selection

[sarantopoulos2021total] or by sim-to-real methods [rusu17]. Bridging the gap between hand-crafted and data-driven approaches can lead to more robust hybrid algorithms that combine the advantages of both approaches and to a better understanding of what agents learn from data.

In this paper, our objective is to study the learning-based policy proposed in our previous work [sarantopoulos2021total] with respect to pushing policies that follow simple but effective heuristic rules, answering three questions: what is the difference in performance between learning and heuristics in the singulation task? Does the RL-based policy implicitly learn something similar to the heuristics? Can the heuristics be used during training to increase the performance of the RL-based policy in the singulation task? In this work, we aim to answer the above questions by:

  • demonstrating that heuristic policies that move the target object towards the empty-from-obstacles space accomplishes total singulation,

  • demonstrating that guiding the training with these heuristic rules in the reward, increase the performance of the RL-based policy and

  • performing experiments to study the similarity of the initial RL-based policy with the heuristic-based policies, in terms of decision making.

In the following section the related work is presented. Section III describes the environment and the problem formulation. Section IV presents the heuristic and the learning based policies used in this work as well as training details for the learning-based policies. Section V provides the results from our experiments and finally Section VI draws the conclusions.

Ii Related work

Ii-a Analytic and heuristic-based approaches

Lynch et al. [lynch96] pioneered research on analytic models of push mechanics. Following works focused on planning methods to reduce grasp uncertainty in cluttered scenes by utilizing push-grasping actions [dogar11, dogar12]. Cosgun et al. [cosgun11] proposed a method for selecting pushing actions in order to move a target object in a planned position as well as clearing the obstructed space from other objects, assuming however knowledge of exact object poses. In contrast to these analytic methods where exact knowledge of the objects and the environment is assumed, other works employ heuristics for manipulating table-top objects. Hermans et al. [hermans12] proposed a method that separates table-top objects, exploiting the visible boundaries that the edges of the objects create on the RGB image. Similarly, Katz et al. [katz13] segmented the scene for poking the objects in the scene before grasping them. Chang et al. [chang12] proposed a heuristic singulation method which pushes clusters of objects away from other clusters. Their method checked if a push separated the target cluster into multiple units and iteratively kept pushing until the target cluster remained a single unit after the push. The downside is that the clusters should be tracked in order to produce future decisions. Finally, Danielczuk et al. [danielczuk18] proposed two heuristic push policies for moving objects towards free space or for diffusing clusters of objects, using the euclidean clustering method for segmenting the scene into potential objects. We adapt the free space heuristic to our problem and demonstrate that achieves total singulation.

Ii-B Learning-based approaches

Recently, a lot of researchers employed learning algorithms for object manipulation. Boularias et al. [boularias2015learning] explored the use of reinforcement learning for training policies to choose among grasp and push primitives. They learnt to singulate two specific objects given a depth image of the scene but the agent required retraining for a new set of objects. Zeng et al. [zeng18a] used -learning to train end-to-end two fully convolutional networks for learning synergies between pushing and grasping, leading to higher grasp success rates. Yang et al. [yang20] adapted [zeng18a] to grasp a target object from clutter by partially singulating it via pushing actions. Kurenkov et al. [kurenkov20] and Novkovic et al. [novkovic20] used deep reinforcement learning to learn push actions in order to unveil a fully occluded target object. In contrast to singulation, achieving full visibility of the target does not always create free space between the target and surrounding obstacles, which is necessary for prehensile grasping. On the contrary to model-free reinforcement learning approaches, Huang et al. [9591286] proposed a model-based solution for target object retrieval. Specifically, they employ visual foresight trees to intelligently rearrange the surrounding clutter by pushing actions so that the target object can be grasped easily. All the above approaches focus on the partial singulation of objects in order to grasp them.

On the other hand, many works develop methods that totally singulate the objects. Eitel et al. [eitel2020learning]

trained a convolution neural network on segmented images in order to singulate every object in scene. However, they evaluated a large number of potential pushes to find the optimal action. Kiatos

et al. [kiatos19] trained a deep -network to select push actions in order to singulate a target object from its surrounding clutter with the minimum number of pushes using depth features to approximate the topography of the scene. Sarantopoulos and Kiatos [sarantopoulos20] extended the work of [kiatos19] by modelling the -function with two different networks, one for each primitive action, leading to higher success rates and faster network convergence. Finally, Sarantopoulos et al. [sarantopoulos2021total] proposed a modular reinforcement learning approach which uses continuous actions to totally singulate the target object. They used a high level policy, which selects between pushing primitives that are learnt separately and incorporated prior knowledge, through action primitives and feature selection, to increase sample efficiency. Their experiments demonstrated increased success rates over previous works. We build upon this work to demonstrate that incorporating insights from heuristics to reinforcement learning algorithms via reward shaping, can increase the performance further and study the similarity of the decisions that this learning-based policy makes with respect to the heuristic-based policies.

Iii Problem formulation

The objective is to properly rearrange the clutter in order to totally singulate the target object. Total singulation means that the distance between the target object and its closest obstacle is larger than a predefined constant distance

. Given an RGB-D image of the scene, the agent selects the pushing action that maximizes the probability of singulating the target object.

Iii-a Environment

Fig. 1: Illustration of the environment. The goal is to singulate the target object (red contour) from the surrounding obstacles. Note that we place the world frame to the center of the table.
Fig. 2: Illustration of the visual state generation , representing a segmentation of the scene. denotes the heightmap of the scene and

the mask of the target object. An autoencoder is used for dimensionality reduction of

to the encoding vector

. is the reconstructed visual representation by the decoder.

The environment consists of a robotic finger, rigid objects in clutter resting on a rectangular workspace with predefined dimensions and a single overhead RGB-D camera. The pose, dimensions and the number of obstacles are random. Finally, the world frame is placed to the center of the workspace with its z-axis being opposite to the gravity direction as shown in Fig. 1.

Iii-B Visual Representation

Similar to [sarantopoulos2021total], we use a visual representation , which is extracted from the camera observation and is generated by the fusion of the 2D heightmap and the mask (Fig. 2). Specifically, the heightmap shows the height of each pixel and is generated from the depth image by subtracting from each pixel the maximum depth value. The mask shows which pixels belong to the target object and is extracted by detecting the visible surfaces of the target object. Note that we assume that the target is at least partially visible and can be visually distinguished from the obstacles. We also crop the heightmap and the mask to keep only information within the workspace. Essentially, the visual state representation is an image that provides a segmentation of the scene between the target, the obstacles and the support surface. The pixels that belong to the target have the value 0.5, those belonging to an obstacle the value 1 and the rest of them belonging to the support surface the value 0. Finally, we resize the visual representation to 128 128. An illustration of the visual state representation is shown in Fig. 2.

Iii-C Actions

A push action is defined by a line segment , which is parallel to the workspace and is placed at a constant height above it. In contrast to previous work [sarantopoulos2021total], we use only the push-target primitive, which always pushes the target object and is defined by a vector where is the pushing distance and is the angle defining the direction of push. The height is selected by the primitive at the half space between the table and the top of the target. This restricts the available pushing positions since the surrounding obstacles may be taller than this height. To ensure that the finger will avoid undesired collisions with obstacles during reaching the initial position we define the initial position to be the closest to the target position which is not occupied by obstacles along the direction . Note, however, that during the execution of the pushing action, interaction with the obstacles may occur but not considered undesired. To achieve this obstacle avoidance, we use the visual representation and find the first patch of pixels along direction defined by that contains only pixels from the table. Given the vector , the final point is computed as .

Iv Pushing Policies

In this section we will present the different pushing policies used in this work. We use two types of policies: heuristic policies that take greedy decisions based on a heuristic rule and learned policies which are optimized using deep reinforcement learning.

Iv-a Heuristic-based pushing policies

Fig. 3: The different stages for extracting the empty space map ESM. Purple color denotes low distance values and yellow color high distance value. (a) The initial mask of the obstacles. (b) The points belonging to the contour of the obstacles’ mask. (c) The obstacles distance map ODM showing the distances from the obstacles’. (d) The map limits distance map LDM showing the distances from the workspace limits. (e) The final empty space map ESM which combines ODM and LDM.

The heuristic policy used in this work is a policy which selects pushing actions for moving the target object towards the empty space of the scene, which is inspired by [danielczuk18]. The assumption here is that moving the target object towards the empty space will create empty space around it, and thus accomplishing total singulation.

In order to select the best action, we first need to calculate the empty space map (ESM) shown in Fig. 3(e). To produce this map, we first create a mask of the obstacles of the scene (Fig. 3(a)) by removing the target object from the visual state representation , shown in Fig. 2, since we only interested in calculated the distances from the obstacles. Then, we calculate the points that belong to the contour of this mask (Fig. 3(b)). Subsequently, we create an obstacle distance map (ODM) showing the distances of each workspace point from the obstacles (Fig. 3(c)). Hence, the value of each pixel of ODM is its minimum distance from every point of the contour , i.e.:


Finally, we take into account the limits of the workspace. For this, we create a limits distance map (LDM) by calculating for each pixel its minimum distance from the limits of the workspace, as shown in Fig. 3(d). Then, the final map is given by keeping the minimum value between the obstacles distance map and the limit distance map:


as shown in Fig. 3(e). As a final step we normalize the values of the map to , given its minimum and maximum value.

Given the empty space map (ESM), we can select the direction and the pushing distance for pushing the target object towards the empty space. In order to determine the direction, we find all the pixels of ESM map whose value is above 0.9, i.e. the top 10%. From this set, we select as the optimal pixel the pixel that has the minimum distance from the centroid of the target object, calculated using the target’s mask . Note that taking multiple best values instead of just the maximum can produce better decisions. For example, if one pixel far from the target object has the value 0.97 and another pixel much closer to the target has the value 0.95, it would not be efficient to just greedily select the maximum value, because the agent could oscillate between pushes that unnecessarily attempt to reach remote areas. Then, the direction is computed by the line segment between and and the pushing distance is the length of this line segment. Finally, we enforce an upper limit to for avoiding long pushes that can result to unpredictable behaviour of the target object. Fig. 4(a) shows the selected push of this policy, called Empty Space Policy (ES) for this scene.

Even though ES policy selects the optimal pixel between multiple candidates, it can still produce pushes based on regions that are far from the target. For this reason, we also propose a variant called Local Empty Space policy (LES) that selects the best action taking into account only a local neighborhood of the target object. To this end, we crop the ESM map around the target object, producing a local empty space map (LESM) shown in Fig. 4(b) along with the selected action by LES policy. Notice that for the same scene, the two policies, ES and LES, produce very different actions.

Fig. 4: Comparison of decision making between the heuristic policies. (a) The visual state representation of the scene. (b) The ES policy. (c) The LES policy.

Iv-B Learning-based pushing policies

To train the reinforcement learning policies we formulate the problem as a Markov Decision Process (MDP) with time horizon

. An MDP is a tuple () of continuous states , continuous actions , rewards , an unknown transition function and a discount factor .

States: We use both the visual state representation presented in Section III-B and the full state representation taken from simulation. The full state representation contains the exact poses of the objects as well as their bounding boxes. To reduce the dimensionality of the visual state representation, we train an autoencoder on a large dataset of observations and use the latent vector as the input to the agents (Fig. 2). The architecture of the autoencoder is similar to the one proposed in [sarantopoulos2021total], but trained with cross-entropy loss instead of regression. Note that in this work we encode the whole view of the scene instead of a downscaled crop centered around the target.

Actions: All the agents use the pushing actions defined in Section III-C.

Rewards: First we define a sparse reward that will motivate the RL agent to discover a policy purely for the singulation task. Then, to investigate how insights from the heuristics, described in Section IV-A, improve the performance of reinforcement learning policies, we present two additional reward schemes. To achieve that, we use the ESM and LESM maps for reward shaping. Specifically, we measure the error between the predicted angle that the agent produced and the angle that the heuristic policy would have taken if used:


This error will be zero if the directions are the same () and it will be one if the directions are exactly opposite (). The total three rewards are the following:

  1. For the sparse reward we assign if the target falls out of the workspace and if the target is totally singulated. In any other case we assign a small negative reward to each push for motivating the agent to minimize the number of actions, where is the maximum number of timesteps.

  2. To guide the RL policy towards global empty space we use the ESM map to compute the error . Specifically, we assign for each push, if the target falls out of the workspace and if the target is totally singulated.

  3. Finally, we use the same reward as above but utilize the LSM map to compute the error .

Terminal states: The episode is terminated if the target is totally singulated, falls out of the workspace or the maximum number of timesteps has been reached.

We exploit the fact that we train in simulation by employing an asymmetric actor-critic algorithm, proposed in [pinto18], in which the critic is trained with the full state representation while the actor gets the visual state representation as inputs. In particular, we use the Deep Deterministic Policy Gradient (DDPG) [lillicrap16] as the actor-critic algorithm. In contrast to [sarantopoulos2021total]

we train the agents in an off-policy way. Specifically, we fill a replay buffer with 100K transitions of random pushing actions. During training we do not update the replay buffer with new transitions and thus we can pass the entire dataset multiple times (epochs) to update the networks.

The actor and the critic are both 3-layered fully-connected with 512 units each layer. The target networks are updated with polyak averaging of 0.999. We train the networks for 50 epochs with batch size 32 and discount factor . Both networks are trained using Adam optimizer with learning rate .

V Results

The objective of the experiments are two-fold: (a) to evaluate the performance of each policy in the singulation task, i.e. the success rate and the mean number of actions before singulation and (b) to examine if there is any similarity in the decision making between the policy produced by reinforcement learning and the heuristics. The policies that we evaluate are the following:

  • RL: policy trained with the reward (1) of Section IV-B.

  • ES: a heuristic policy that chooses pushing actions according to the empty space map of Section IV-A.

  • LES: a heuristic policy which selects the optimal action based on the local empty space map of Section IV-A.

  • RL-ES: policy trained with the reward (2) of Section IV-B, that guides the agent using the empty space map.

  • RL-LES: policy trained using the reward (3) of Section IV-B that guides the agent using the local empty space map.

We use the Bullet physics engine [coumans2016pybullet] to advance simulation. The objects are approximated as rectangulars of random dimension. The number of obstacles is between 8 and 13, with the smallest bounding box cm and the largest cm. Furthermore, we use a rectangular robotic finger, cm, a workspace m and keep all the dynamic parameters of the simulation fixed. All polices are evaluated in the same 200 scenes in simulation.

V-a Performance of the policies

Table I and Fig. 5 show qualitative performance results for each policy. Specifically, as shown in Table I

heuristic policies outperform the policies which are optimized with reinforcement learning while they produce similar results. This can be justified from the fact that both policies follow the same heuristic rule, but use different representation to select the best action. However, ES policy needs more actions to singulate the target object due to the fact that the next optimal pixel could be selected far from the previous one and thus the agent produces actions that oscillate the target object from one optimal to another. This is eliminated with the local ESM map which does not take into consideration the whole scene to select the optimal action and thus producing a policy with fewer mean actions and smaller standard deviation than ES as shown in Fig.


As we see in Fig. 5(a), RL-ES and RL-LES policies outperform RL policy. The improved success rates demonstrate the importance of incorporating insights from heuristics to guide reinforcement learning agents through reward shaping. This leads to faster convergence and improved success rates. Finally, the RL-ES policy outperforms the RL-LES due to the fact that it is optimized with ES map which takes into consideration the whole scene and thus producing actions that lead to increased success rate.

Policy Type Success Mean Std
rate % #actions #actions
RL Learned 89.1 2.67 1.31
ES Heuristic 97.9 2.57 1.16
LES Heuristic 98.8 2.28 0.74
RL-ES Hybrid 94.8 4.08 2.73
RL-LES Hybrid 92.3 3.67 2.39

[5pt] ES = Empty Space, LES = Local Empty Space

TABLE I: Results
Fig. 5: Qualitative results for the performance of the different policies. (a) Success rates; (b) Mean number of actions

V-B Similarity of decision making between learning and heuristics

For evaluating the similarity of the decision making between heuristic and learning based policies, we compare the pushing directions decided by the policies RL, RL-ES, RL-LES with the pushing directions that would have been decided by the heuristic policies ES, LES if used in the same scenes. To this end, we run 200 episodes for each policy RL, RL-ES, RL-LES. For each scene we compute the angle of the pushing direction produced by each policy and executed in the environment and the angle of the pushing direction that the heuristic ES or LES would have produced, if used. Then, we calculate the error between these angles from Eq. (3).

Fig. 6: The distribution of the errors between the directions chosen by the policies. (a) The errors between the RL, RL-ES policies with the heuristic ES. With blue color the and with orange color the error ; (b) The errors between the RL, RL-LES policies with the heuristic LES. With blue color the and with orange color the error .

Fig. 6(a) shows the distribution of the errors for the 200 episodes between RL and ES policies (blue bars) along with the distribution of errors between RL-ES and ES policies (orange bars). Notice that the distribution of RL-ES is shifted towards smaller errors compared to the RL distribution (see the difference in the range of 0-0.2 of the error), indicating that RL and RL-ES decide on different actions for the same scene and thus RL learns something different. On the other hand, Fig. 6(b) shows the distribution of the errors between RL and LES policies (blue bars) along with the distribution of errors between RL-LES and LES policies (orange bars). In this figure we observe that the distribution of errors between RL and RL-LES are matching, indicating similar decision making. From the above, we can conclude that the RL policy implicitly learns something more similar to the local heuristic than to the global heuristic. Fig. 7 demonstrates a sample of actions taken by the different policies. Notice that in this sample the RL policy decide on a similar action with the local policies LES and RL-LES than the global ES and RL-ES.

This conclusion can also explain the results of Fig. 5(a) related to the performance of the policies. If the RL agent learns implicitly something similar to the LES policy, then we expect that guiding with directions from LES should have a smaller impact in the performance than guiding with ES. As we see in Fig. 5(a), guiding the RL agent to learn a policy closer to ES, the success rate was increased from 89.1% (RL) to 94.8% (RL-ES), in contrast to guiding the agent to learn a policy closer to LES, where the corresponding improvement of the success rate was smaller (92.3% for RL-LES).

Finally, the conclusion that RL learns something closer to the local heuristic LES is to be expected from another point of view. The discount factor limits the time horizon that the agent can optimize the Q-values into the future and given the limited pushing distance in this task, the agent learns to act better by taking into account observations from the local neighborhood of the target object, rather than observations corresponding to a distance far from the target.

Fig. 7: A sample of actions taken by the policies in the same scene. (a) The initial state of the scene. (b) The actions taken by each policy. The start and the end of the arrow is placed on the initial and the final position of the push. (c) The next states for each action.

Vi Conclusions

In this paper, we employed simple heuristic rules to achieve total singulation by pushing the target object towards the free space of the workspace. Experiments demonstrated that if we incorporate these heuristics to the reward of RL agents, singulation success is increased. Finally, we showed that the pure RL-based policy implicitly learns something more similar to the local heuristic.