I Introduction
Robust grasping of objects is an important capability in many robotic applications. As robots go from caged and structured industrial settings to unstructured civil environments, assumptions about the knowledge of object models and environment maps no longer hold, and learningbased methods start to show an advantage in scaling towards generalizable robotic grasping of unseen objects in unstructured environments.
Most previous methods that use deep learning for grasping formulate the problem as training a regression neural network to predict the success probability of grasp poses given RGB or depth observations
[13, 15, 17]. Such methods require an additional component for generating grasp pose proposals which are subsequently evaluated and ranked. This process can be slow and inefficient, especially when going into highdimensional action spaces such as full 6DOF grasping or grasping while moving the base of the robot. In this work, we propose to directly learn the distribution of good grasp poses from selfsupervised grasping trials. With recent advances in density estimation
[3, 4, 12], neural network models such as Real NVP [3]are able to approximate arbitrary distributions. Furthermore, these models can do both, efficiently generate samples from the distribution, as well as compute the probability density of given samples. We call our models actor models. With a trained actor model we are able to speed up inference by eliminating the generationevaluationranking process. In addition, exploration for continuous reinforcement learning becomes more natural and adaptive compared to additive noise.
Ii Related Work
Many prior works on endtoend grasp prediction from visual observations formulate the problem as training a value function (critic) which estimates the probability of success given a hypothetical grasp pose [13, 15, 17]. A separate component is required to generate candidate grasp poses for the critic to evaluate and rank. In [15, 9]
, the crossentropy method (CEM) is used to iteratively find good actions using the value network. When the action space is highdimensional and promising actions only occupy a small fraction of the space, such a method would require a large amount of samples or prior heuristics in order to obtain good grasp poses.
Another approach is to directly learn a policy (actor) which predicts the optimal action given the current state. Most previous works represent this policy as either a deterministic function [14, 16] or a diagonal Gaussian conditioned on the observations [7, 18]. Such models struggle to represent multimodal action distributions which are common in a grasping context, especially due to cluttered scenes or symmetry in objects. Some works [21, 20]
partially address the multimodality problem by regressing to a grasp pose per image patch or per pixel. These approaches make assumptions on the correspondence between grasp positions in robot coordinates and in camera coordinates. This assumption makes these approaches hard to generalize to tasks beyond grasping, e.g. inhand manipulation, where actions no longer correspond to image regions. In this work, we address the multimodality problem by predicting a very expressive probability distribution in action space, conditioned on the whole input image, eliminating the above assumption while increasing the expressiveness of the model. Gaussian Mixture Models (GMM) may be employed to model more complex action distributions
[1]; however, they are still limited by their representational power, and are not very friendly to stochastic gradient descentbased optimizers.
Generative Adversarial Networks (GAN) have received a lot of attention in modeling probability distributions for images [5, 10, 2]
and have also been used successfully in imitation learning
[8]settings. However, because these generators cannot compute the probability density of generated samples, a discriminator is required to provide a training signal, and balancing the interplay between these two components is difficult. Similar works in energybased models
[6, 11] also learn generators for which the probability density of generated samples cannot be computed.Recent work in normalizing flows [3, 12, 22] makes it possible to train a neural network that can both produce samples and calculate the probability density at given points, which enables density estimation by directly maximizing the loglikelihood of observed data. We employ normalizing flows in our actor models and train them by maximizing loglikelihood of successful grasp poses. When running on the robot, the probability density of actions also serve as a confidence score for the actions.
Our method is also related to imitation learning. In [8] GAN is used to match the generator model to the distribution of demonstration actions. Instead of learning from expert demonstrations, our model learns from positive examples accumulated from random trials, and continues to improve itself by executing samples from the learned distribution and receiving binary rewards in a selfsupervised manner.
Iii Motivation
As a motivating example, we study a toy problem that captures some essential characteristics of grasping in clutter. In a dimensional action space , we randomly sample target points and draw hyperspheres of radius around them as regions of successful actions. We assume the value function can be learned perfectly and use as an oracle value function, where is the distance to the nearest target point. To evaluate the performance of CEM, we record the number of iterations required to obtain at least one successful action for which . At each iteration, our implementation of CEM generates normally distributed samples and uses the top
elite samples to estimate the mean and variance for the next iteration.
We repeat this experiment
times for different settings of the hyperparameters
and . The distributions are plotted in Fig. 2 for and . The average number of iterations required to find a successful action grows exponentially with the dimension of the action space, and also grows quickly when raising the required precision of the task. In some experiments, CEM failed to even find a successful action within a maximum of iterations.These experiments illustrate the drawbacks of using CEM to select optimal actions at inference time. For higherdimensional action spaces and highprecision tasks, iteratively optimizing on a multimodal value function is inefficient. This is especially undesirable when action selection needs to happen inside a highfrequency control loop, such as in robotics.
On the other hand, we can train an actor model to predict the distribution of successful actions given as the input state. With enough training, the actor can always predict a successful action with a single iteration, by generating samples from the predicted distribution and selecting the one with highest probability density. As the dimension of the action space increases, the state space also expands exponentially, thus requiring more training time; however, the inference time remains constant.
Iv Method
With some mild assumptions, e.g. the environment has no obstacles to avoid, and objects are not too densely packed so that pregrasp manipulation is necessary, learning to grasp can be seen as predicting the grasp poses that lead to high success probability. We approach this problem by training a neural density model that approximates the ground truth conditional distribution of successful grasp poses.
Iva Neural density models
There are several types of neural network models that are very powerful in representing probability distributions. In this work we study the Gaussian mixture model [1] and Real NVP model [3], and a combination of both, which we call the mixture of flows (MoF) model. We briefly describe the three models below.
Gaussian mixture model
A neural network is trained to predict the centers , variances , and weights of multivariate diagonal Gaussians, where is determined according to the task. For predicting weights
, a softmax layer is used as the last layer to satisfy the constraint that
. GMM is the simplest probabilistic model to approximate a multimodal distribution. However, GMM is not friendly to stochastic gradient optimization. With maximum loglikelihood as the objective, there are saddle points in the optimization landscape that are hard to escape from, even for momentum or Adam optimizers. As a simple example, assume the ground truth distribution is uniform on , and we are approximating this distribution with GMM . It is easy to verify that there is a saddle point at , , , whereare the centers, standard deviations and weights of the GMM with
. In experiments we observed that the optimization got stuck on such saddle points very often, unless the parameters of the GMM were carefully initialized, which requires prior knowledge of the task at hand.Realvalued nonvolume preserving (Real NVP) transformations
Real NVP transformations are bijective mappings between the latent space and the prediction space. If the probability distribution in the latent space is known, then the distribution in the prediction space can be calculated as
where is a point in the prediction space and is the corresponding point in the latent space, and is calculated from using the inverse function. A multivariate normal distribution is used for the prior distribution .
For a general fullyconnected neural network, it is timeconsuming to compute the determinant of the partial derivative matrix , and its derivative with respect to network parameters. The network is also not guaranteed to be a bijective function. In [3] the authors proposed a special way of constructing the neural network to solve these problems. The latent space is split into two orthogonal subspaces, and , and the transformation is defined to be the composition of a series of affine transformations coupling layers. Each affine transformation (also called coupling layers) hasof the form
or similarly
where the functions and
are neural networks that predict the vectors of logscale and translation of the affine transformation, and
is the Hadamard (or elementwise) product. The neural networks and may optionally be conditioned on features of the input observationsstate.By alternating between the two coupling layers, the composed transformation can be arbitrarily complex. This class of transformations has two desirable properties: their inverse function can be easily computed by inverting each affine coupling layer, and the determinant of partial derivative for each layer can be easily calculated as . As a result, we are able to efficiently sample from the predicted distribution, and also compute the probability of given data under the predicted distribution.
Mixture of flows model
It is straight forward to use a multivariate normal distribution in the latent space for Real NVP models. However, in experiments we observe that it is difficult for Real NVP models to learn a clusterlike distribution, where the support of the target distribution is separated into modes, instead of a continuous region. To make the model more expressive, we combine Gaussian mixture and Real NVP into a mixture of flows (MoF) model, where the latent space distribution is a learnable Gaussian mixture, and each Gaussian in the latent space is transformed by an independent Real NVP transformation. The MoF model combines the good from both worlds. It does not suffer from the saddle point problem of GMM, and the model can easily use different Gaussian components to model different modes in the action space.
IvB Actor model training
Using neural density models enables us to directly train an actor model by maximizing the loglikelihood, as opposed to GANstyle adversarial training where a critic or discriminator is required. With a dataset of successful grasps , the training loss is
If we give binary reward of to successful grasps and to failed grasps, this loss is equivalent to
where is the behavior policy used to collect the dataset. Our training loss is equivalent to minimizing the KL divergence . When the behavior policy is uniform random across the action space, assuming our density model is able to approximate arbitrary probability distributions, the optimal policy is , and covers every successful action.
As the task gets more difficult, the success rate of a random policy can be low, and collecting a dataset of successful grasps from random trials can be inefficient. We can also sample actions from the actor model instead of a random distribution to add data into the dataset. Inferring actions from the action model increases grasp success rate and makes learning more efficient. In this case, the training loss needs to be adapted. Since , minimizing the KL divergence between the unnormalized distribution and is prone to mode missing. Maximum entropy regularizer is added to the training loss to prevent mode missing. The loss becomes
where is the relative weight between the two losses. It is not hard to prove that the action distribution will converge to .
V Robot grasping overview
We demonstrate the result of training actor models for visionbased robotic grasping. An illustration of our grasping setup is shown in Fig. 8. To demonstrate the advantage of a probabilistic actor instead of a deterministic one, all our experiments have multiple objects in the workspace, thus the distribution of good grasps is multimodal.
The observation sent to the actor model includes the robot’s current camera observation, a RGB image, recorded from an overtheshoulder monocular camera (see Fig. 8), and an initial image taken before the arm is in the scene. The action is a 4 dimensional topdown grasp pose, with a vector in Cartesian space indicating the desired change in the gripper position, and a change in azimuthal angle encoded via a sinecosine encoding . The gripper is scripted to go to the bottom of the tray and close on the final time step.
In simulation, grasp success is determined by raising the gripper to a fixed height, and checking the objects’ poses. For the real robots, the postgrasp and the postdrop images are subtracted, both without the arm in the view. Only if the two images are significantly different, because an object was dropped back into the tray, a grasp is determined as successful. This labeling process is fully automated to achieve selfsupervision.
The neural network for the actor model consists of 7 convolution layers to process the image, followed by a spatial softmax layer to extract 128 feature points. The coordinates of the feature points are then processed with 2 fully connected layers to produce the final representation of the input images, which is used to predict the parameters of the Gaussian mixture, and/or concatenated with the latent code to predict the log scale and translation for Real NVP’s affine coupling layers.
Although the actor model is trained to predict good grasp poses in one step, our robots take multiple actions for each grasp trial, both for data collection and for evaluation. For data collection, the number of actions taken is random between and . To transform the recorded grasp trials into data samples suitable for training the actor model, at each step the action is determined by the difference between the final grasping pose and the current gripper pose, and grasp success determined at the end of the trial is used for every step in the process. For evaluation, the robot will close its gripper and end one grasp trial if it has converged to a grasp pose, or a maximum of 10 actions is reached. Experimentally we define convergence as if the selected action is within mm movement in Cartesian space and rotation for the actor, and if the predicted value for zero action is above of the highest sample’s value for the critic.the maximum number of actions taken has been fixed to .
In simulation, we evaluate the performance of our actor model with pure offpolicy data as well as onpolicy data. When training with only offpolicy data, the robots moves randomly within the workspace, and successful grasps are extracted. When training onpolicy, the initial successful transitions are collected by random policy, after which the actor model is used to sample actions, and successful grasps are added to the data buffer. We use simulated robots and GPUs to collect data and perform training asynchronously.
We also evaluated our method on real KUKA robots. In this case our models are trained with a dataset of grasps previously collected and used in [2], which has 9.4 million training samples in total, including 3.6 million successful samples.
Vi Experiments
Via Representation power of neural density models
We demonstrate the advantage of our MoF model compared to other neural density models on a toy task as well as on robotic grasping. For the toy task, we randomly sample 5 points in a square. The coordinates of the 5 points are observations, and the distribution of good actions is defined as a mixture of Gaussians centered at with standard deviation . The GMM model has 1 linear layer and 4 components, the Real NVP model has 4 affine coupling layers and each translation and log scale function has 2 fully connected layers. For the MoF model, the base distribution in the latent space uses the same model as GMM, and each Real NVP branch has the same architecture as the Real NVP model.
Visualizations of each model’s prediction is shown in Fig. 3. Due to the saddle point problem in GMM models, its convergence is dependent on the initial value of Gaussian variances. The Real NVP model prediction covers all 4 Gaussians in the ground truth distribution, but also has a significant probability mass in areas that are not supported by the ground truth. The MoF model can represent the ground truth distribution well, and is also robust to changes in the initial variances of the base Gaussian mixture.
We also evaluate the representation power of the models on robotic grasping. The models are trained on an offpolicy dataset of 1.8M successful grasps. Training and test objects are a set of 30 drink bottles and cups. Some examples of the objects are shown in Fig. 4. The grasp success rates are plotted in Fig. 5. Our MoF model achieves the highest grasp success rate. For the GMM model, trainable variance is less stable and performs worse than a fixed variance determined beforehand according to our knowledge of the task.
ViB Data efficiency and inference speed of actor vs. critic models
We compared our method with training a critic model [15], where a crossentropy method (CEM) optimization process is used to find good actions during evaluation. We collected datasets of different sizes by running a random policy. For training the critic model, both successful and failed grasps are used, while for training the actor model only successful grasps are used. However, we report dataset size as the number of actions tried, including successful and failed ones, even for the actor model. This corresponds to the time required to collect training data, although actor models are at a disadvantage in this comparison due to low success rate of random trials. We evaluate both the actor model and the critic model on 2 grasping tasks with different objects. In the first task, there are 2 blocks in the basket that have the same appearance, and in the second task the basket has two objects from a set of 30 different commodity drink bottles and cups. Camera images from the robot for both tasks are shown in Fig. 4. Random trial success rate is on the blocks and on the bottles and cups.
During evaluation, the actor model samples actions for each input observation and picks the action that is kinematically feasible and has the highest probability. For the critic model, the CEM process runs for 3 iterations with population and picks top samples for estimating the mean and variance for the next iteration, the sample with the highest score is executed on the last iteration.
Figure 6 plots the grasp success rate in evaluation as the number of training examples increases. The trend of growth is similar for both the actor and the critic models, although the success rate for the actor model is lower than the critic model, especially on the more difficult set of objects. Our hypothesis is that it is easier for the network to judge if a hypothetical action will be successful, by giving attention to only the area around the destination, while the actor model has to digest the whole image and learn all modes of possible successful actions.
Because the CEM optimization takes 3 iterations, the inference time for the CEM policy is 3 times longer than the inference time of the actor policy. The CEM policy inference also takes significantly more memory on the GPU, because the visual features need to be replicated to merge with the feature of each grasp pose proposal, while the actor can predict the parameters for the probability distribution once, and sample many grasp poses with very little overhead.
ViC Actor models as natural exploration policy
Actor models provide a natural way of exploration for onpolicy training. Once the actor model is trained with a small amount of offpolicy data, it can be used to sample actions for collecting more grasping data, with a significantly higher rate of success.
We compare training the actor model using a replay buffer [19], where data is collected by sampling actions randomly, versus sampling from the actor model. We also compare with training a critic model onpolicy, where data is collected by running CEM with the critic model.
Grasp success rates are plotted in Fig. 7 as the training progress. Using actor model to sample actions and collect data has a clear advantage compared to using random actions. This advantage is more obvious when the task requires more precision and the random policy gets fewer successes. It is also clear that having an entropy regularizer in the training loss helps to improve training speed and final performance of onpolicy training. Finally, the actor and the critic can achieve the same final success rate, although for the harder task of grasping bottles and cups, the critic improves faster at the beginning of training. In both tasks the highest grasp success rate reaches about 90%, where most failures are due to objects being too close to the corners of the bin and the gripper collides with the bin before grasping.
ViD Real robot experiments
We trained and evaluated the actor model and the critic model on real KUKA robots. Both models are trained on the same dataset of real robot grasps, but for the actor model only successful grasps are used. We evaluate the actor model by predicting 64 samples at each time step and taking the action with the highest probability density. The actor will close the gripper and terminate one grasp if the selected action is within movement in Cartesian space and rotation, or if a maximum of 10 time steps is reached. For evaluating the CEM method using the critic model, we set the initial Gaussian to have a standard deviation of in horizontal direction, in vertical direction and in rotation. This distribution is chosen to cover the space of the tray. The CEM is run for 3 iterations (see [15]) and the action with highest predicted value is selected. We also evaluated a policy that combines both the actor model and the critic model, where the actor model predicts 64 samples, and the samples are evaluated by the critic model. Finally the action scored highest by the critic model is selected.
The experiment consisted of 6 sets of 30 grasp attempts on each of the 7 KUKA robots, totalling grasp attempts. Each robot was presented with 56 objects from the test set, shown in Fig. 8. The average success rate of each run using the three presented methods are summarized in the table below.
Method  Run 1  Run 2  Run 3  Run 4  Run 5  Run 6  Avg. 

CEM  78.1%  76.2%  79.0%  76.1%  82.3%  80.5%  78.7% 
Actor only  80.0%  80.0%  70.4%  77.1%  77.1%  75.7%  76.7% 
Actor+Critic  86.9%  81.4%  82.8%  80.0%  81.4%  84.7%  82.8% 
Visualization of the actor model and the critic model predictions are shown in Fig. 1. The critic model predicts a smooth function over the workspace and the CEM samples gradually concentrate towards the high valued region that covers one of the objects. The actor model directly predicts promising samples that usually concentrates on the object closest to the gripper. Occasionally, the actor model predicts samples on the boundary of objects that would result in unstable grasps, see e.g. Fig. 9. The critic model predicts higher values for samples that are closer to the object center and thus more stable, and help improve grasp success.
Vii Conclusions and discussions
We proposed an alternative way of visionbased robotic grasping. Instead of training a critic model that evaluates grasp proposals, we directly train a neural density model to approximate the conditional distribution of successful grasp poses given input images. We demonstrated on both simulation and real robot that the proposed actor model achieves similar performance compared to the critic model using CEM for inference. On the topdown grasping task with 4 dimensional action space, our actor model reduces inference time by 3 times compared to the stateoftheart CEM method, at the cost of longer training time. Going into higher dimensional action space, we believe actor models will be more promising and scalable, while the CEM method will take exponentially longer inference time or even fail to find good solutions from the critic model.
Our proposed actor model also has limitations. One of the limitations is that our model only uses successful grasps as training data. While the density model normalizes over the action space, thus assumes every action that is not included in the dataset of successful grasps is failure, there may be additional information in the failed examples that can be helpful for improving the actor model. As future work, our actor model can be trained jointly with a critic model, where the binary reward from the dataset can be replaced by the value predicted by the critic model. Another way to incorporate information from failed trials is to train two separate actor models, one that predicts the distribution of successful actions (success actor model), and one that predicts the distribution of all actions tried (prior actor model). During evaluation we can predict samples from the success actor model, and evaluate the samples by taking the quotient of probability densities of the success actor model and prior actor model.
References
 [1] C. M. Bishop. Mixture Density Networks. Technical report, Neural Computing Research Group, Aston University, 1994.
 [2] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V. Vanhoucke. Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping. In IEEE International Conference on Robotics and Automation, 2018.
 [3] L. Dinh, J. SohlDickstein, and S. Bengio. Density estimation using Real NVP. arXiv preprint arXiv:1605.08803, 2016.

[4]
M. Germain, K. Gregor, I. Murray, and H. Larochelle.
MADE: Masked Autoencoder for Distribution Estimation.
InInternational Conference on Machine Learning
, 2015.  [5] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, 2014.
 [6] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement Learning with Deep EnergyBased Policies. In International Conference on Machine Learning, 2017.
 [7] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
 [8] J. Ho and S. Ermon. Generative Adversarial Imitation Learning. In Advances in Neural Information Processing Systems, 2016.
 [9] D. Kalashnikov, A. Irpan, P Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine. QTOpt: Scalable Deep Reinforcement Learning for VisionBased Robotic Manipulation. In Conference on Robot Learning, 2018.
 [10] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
 [11] T. Kim and Y. Bengio. Deep Directed Generative Models with EnergyBased Probability Estimation. 2016.
 [12] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improving variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, 2016.
 [13] I. Lenz, H. Lee, and A. Saxena. Deep Learning for Detecting Robotic Grasps. The International Journal of Robotics Research, 2015.
 [14] S. Levine, C. Finn, T. Darrell, and P. Abbeel. Endtoend Training of Deep Visuomotor Policies. Journal of Machine Learning Research, 2016.
 [15] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning HandEye Coordination for Robotic Grasping with Deep Learning and LargeScale Data Collection. In The International Journal of Robotics Research, volume 37, 2017.
 [16] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous Control with Deep Reinforcement Learning. In International Conference on Learning Representations, 2016.
 [17] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. DexNet 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. Robotics: Science and Systems, 2017.
 [18] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver, and K. Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. In International Conference on Machine Learning, 2016.
 [19] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [20] D. Morrison, P. Corke, and J. Leitner. Closing the loop for robotic grasping: A realtime, generative grasp synthesis approach. In Robotics: Science and Systems, 2018.

[21]
J. Redmon and A. Angelova.
Realtime grasp detection using convolutional neural networks.
In IEEE International Conference on Robotics and Automation (ICRA), pages 1316–1322, 2015.  [22] D. J. Rezende and S. Mohamed. Variational Inference with Normalizing Flows. In International Conference on Machine Learning, 2015.