I Introduction
Robotic grasping in picking and transferring one or more targets from a specific place to the other in unstructured environment is a fundamental problem in robotic control. The robot needs to perceive environment from multi modals before making decisions. Visionbased robotic grasping method has become most popular approach.
Robotic grasping has been researched for decades and different researchers use different ways to solve this problem[1]. Learningbased method is one of the mainstream research method of robotic grasping and has achieved impressive performance in the recent years. [2][3]
use deep learning to estimate the grasp policy with human labeled dataset or synthetic CAD dataset. The performance of deep learning method has a great improvement.
[4][5]realize highlevel grasping operations by reinforcement learning.
Most existing methods train the neural network including Convolutional Neural Networks (CNNs) and Fully Convolutional Networks (FCNs) only using depth image or RGB image to output grasp policy individually. The dataset of the robotic grasping is very small comparing to computer vision problem. It is necessary to more extract information from the limited grasping dataset.
In this paper, we propose a FCN grasping network (UGNet V2) based on the UNet[6] which generates an optimal grasping policy in pixelwise. Our work is based on previous work UGNet (https://youtu.be/LJJRqmpYl2c) which only using depth image to predict grasp policy. Because the depth sampling is unstable from different observation height but nonsensitive to the shadow and RGB image is nonsensitive to the height but exists shadow from the natural light, we propose UGNet V2 to fuse advantages of both depth image and RGB image by using depth image and RGB image directly.
Our main contributions in this paper are: (1) We design UGNet V2 which inputs depth and RGB images directly and outputs the grasp prediction in pixelwise. Some preprocessing is given to improve the result. (2) We use a physical experiment to evaluate the effectiveness of our method. The experiment results show that UGNet V2 is competitive with the stateoftheart in grasp performance, realtime and model size.
Ii Related work
Grasping is a fundamental operation in robotic manipulation which has been researched for decades. Traditional methods[7] for the grasping concentrate on the analysis which are dependent on the target precise physical models and multiple properties like geometry, dynamic, friction, etc. Another popular method is learningbased. Many researchers design grasping algorithms using databased and behaviorbased learning methods. [3][8][9][10] use deep learning to generate grasp policy. [4][5][11][12] [13] explore the robotic task environment and generate the grasp policy by reinforcement learn.
Robotic manipulation has been researched using multiple sensors. Current work mainly concerns about computer vision data. FCNs and CNNs have been applied successfully to the grasp prediction [2][8][14][15] based on 2D or 3D (including 2D+depth image or pointcloud) computer vision. Furthermore, heterogeneous sensor modalities are popular these years. Many works try to fuse multimodal data like vision, range, haptic data as well as language to let robot grasp target objects more stable and flexible[16][17][16][18].
Iii Preliminary and problem formulation
In this section, we study visionbased robotic grasping problem using paralleljaw and RGBD image. A robust paralleljaw is programmed to grasp a novel object placed on the table surface navigating by a fixed RGBD camera overhead. We need to learn a function map from the image space to the grasp space which we will define later.
Iiia Problem definition
Grasp definition In this paper, robotic grasp is to predict a 5D grasp representation for single or multiple objects based on RGB and depth images. The 5D representation consists of position of the grasp , orientation of the gripper and open width of gripper . The representation in Cartesian coordinates can be defined as follows:
(1) 
Transformation In real grasp pipeline, we need to consider 5D grasp representation in different coordination. Here, we discuss grasp representation in image space and camera space. We assume the intrinsic and extrinsic parameter of the camera and physical properties of the robot are known.
The RGB and depth image taken from RGBD camera is where is the image height and is the image width. In image space, the 5D grasp representation can be rewritten as:
(2) 
where is the position of grasp in image coordination. and correspond the and in Cartesian coordinates. We can get the grasp representation in the robot base coordinates from following the Eq. (3).
(3) 
where is the transformation matrix from image space to camera space based on intrinsic parameter of the camera. is the transformation matrix from camera coordinates to robot base coordinates based on the extrinsic parameter of the camera.
IiiB Objective
In the grasp prediction, we estimate each pixel’s probability of the grasp instead of predicting the position of grasp directly. So we can redefine the grasp as
(4) 
where is each pixel’s probability, orientation and gripper’s width of the grasp respectively. We get the position of grasp by selecting the maximum probability of pixel position.
Instead of detecting the object and planning grasp, we try to get a grasp policy by pixelwise metric on RGBD image. We define the a function M from the image input to grasp space:
(5) 
where denotes a RGBD image.
Our goal is to find a robust function :
(6) 
where
is the loss function between ground truth and
, is the parameter of function . After the modeling process, we get the most robust function and the optimal grasp in camera space. Finally, we can get 5D grasp representation in robot based coordinates via Eq. (3).Iv Method
In this section, we will introduce a deep encoderdecoder fully convolutional network to approximate the function
defined in last section. We try to use supervised learning to learn a function
where is the weight parameter of the neural network and is the input RGBD image of the network.Iva Dataset generation
We use Cornell grasp detection dataset[8] to train our network. This dataset is a small humanlabeled dataset containing 1035 RGBD images of 280 different objects with ground truth labels of positive graspable rectangle and negative nongraspable rectangles. We project point cloud data into a depth image, concatenate with RGB image as and resize into . Then we augment the dataset like most of supervised methods by rotating, translating, cropping and scaling the raw data.
IvB Raw data processing
We defined 5D grasp representation in IIIA and here we will illustrate the raw data processing based on the grasp definition. In Cornell grasp detection dataset, antipodal grasp candidates are labeled with rectangles in image space. We choose the middle third of the rectangle as our grasp mask like [3].
Grasp probability We use the new grasp mask as our train label and choose the center point of the mask as the position of grasping. We make a sparse binary label image in which positive pixel point is assigned as 1 and other pixel point as 0.
Gripper orientation We define the gripper’s orientation angle in the range of and represent
as a vector
on a unit circle of which value is a continuous distribution in . [19] shows this processing is easy for the training.Gripper width We compute the gripper’s open width in image space and the value equals to each rectangle’s width. We normalize the value by because the maximum width in the dataset is pixels. Furthermore, the gripper in robot base coordinates (physical world) can be converted by camera parameter and the depth value.
Image input As discussed in IVA, we use RGB image and depth image at the same time by concatenating the two image into a RGBD image . The depth image is inpainted with OpenCV [20]. Because the dataset is sampled from camera which is real data with noise, we do not need to consider the noise filtering. RGB pixel value’s scale is in and depth value’s scale is which is a typical value for Realsense SR300 and it indicates that two images have different scales. We scale the data value between and by minmax normalization following Eq. (7):
(7) 
IvC Learning network design
The architecture of UGNet V2 is illustrated in Fig 1. In this work, we design a fully convolutional network to approximate the function . UGNet V2 is a extended version of UGNet based on Unet[6]. Unet is a typical multiscale network and has the same size between input and output image. The network inputs an RGBD image and outputs normalized grasp representation by four output branches which predict probability , orientation vector and gripper’s width respectively. Function UGNet V2 can fit with resolution. The orientation can also be computed with following Eq. (8). Our network has 2,327,876 (approximately 28MB) parameters and realize realtime running on our platform which we will introduce in the next section.
(8) 
IvD Proposed loss function and Training
We use 80% of our dataset for training and 20% of the dataset for evaluation. We use Reluactivation function for all the layers except the last layer as linear. We use MSE loss function Eq. (9
) to train our UGNet V2 10 epoches. The output is
and ground truth is . It takes about 100 minutes with bachsize 4 using two NVIDIA GTX 1080TI graphic cards.(9) 
where are the weight coefficients and we set all of default value as .
IvE Grasping Metric
We consider three performance metrics for the robotic grasp including success rate, robust grasp rate and planning time.
Success Rate The percentage of successful grasps in all the grasp try.
Robust Grasp Rate The ratio of probabilities higher than 50% of successful grasps through all the runs.
Planning Time The time consumed between receiving the raw data from the camera and UGNet V2 outputting the 5D grasp representation prediction. Because our algorithm runs on the physical system including table environment or mobile platform, it is necessary to consider the system’s realtime capability.
V Experiment
In this section, we evaluate our UGNet V2 in the physical environment. We choose a general benchmark including 3Dprinted adversarial objects and household objects. Kinova Jaco 7DOF single arm robot is used to execute the grasp operation.
Va Assumption
To realize our method, we need to finish some prefixed work. First, the RGBD image’s coordinates should align with the depth image’s. Second, we should synchronize the time stamp between RGB image topic and depth image topic. These two processes are easy to realize in simulation environment but need some engineering work in the physical system.
VB Experimental Components
To evaluate the performance of UGNet V2, we test both in the simulation environment and physical. We establish a Kinova Jaco simulation environment using Pybullet[21]. But in this paper we only give the physical experiment result and the simulation environment code is available in https://github.com/aaronhd/pybullet_kinova7.git. We use Intel RealSense SR300 RGBD camera to get the image information which is mounted on the wrist of the robot. All the computation is finished on a PC running Ubuntu16.04 with a Intel Core i78700K CPU and two NVIDIA Geforce GTX 1080ti graphic cards. It is noticed that we use two cards to accelerate training process and just use one card to running trained network.
VC Test Objects
VD Pipeline
We design a deep encoderdecoder fully convolutional network using RGBD information to realize robotic grasping. We also train a same network only inputting depth information as baseline.
We design an openloop controller to grasp target object from three different heights. During the experiment, although the sensor’s producer declares the sensor can be used within a wide range, we find the depth raw data has much noise and is too sparse in different height. We also find RGB data is stable independent of the measuring distance. Different heights experiment results can indicate the effect of RGB information. The whole pipeline is presented in Algorithm 1.
VE Result and Analysis
In the experiment, we perform grasp trails on the adversarial objects talked in VC for more than 480 times. Our robot grasps each object 10 trails individually with the setup in Fig. 3 in three observation heights (35, 45, 55 centimeter). We use the depthonly network as the baseline to realize robotic grasping and evaluate the performance of depth confused with RGB data method. The experiment result is shown in TABLE I. We can see that the method fused RGB and depth data together achieves much improvement in grasp performance. With both depthonly and RGB+depth, we can see that the robust grasp rate is always lower than success rate. Because we get the optimal grasp by selecting the highest probability point in pixelwise talked in IIIB, the grasp position probability maybe the optimal grasp but lower than 50%. The sensitivity to the observation height shows that the smaller distance between the camera and the object can achieve better grasp performance in only depth scenario and the bigger distance between the camera and the object grasp better in RGB+depth scenario. We believe depth data sampling from the sensor will be sparser, noisier or null with the height increasing which will badly affect the grasp performance. However, RGB data sampling is independent of the height within specific measurement range and the higher observation height can get more global information which is used to predict the grasp policy. Fig. 4 and Fig. 5 show the grasp probability prediction visualization which also indicate the advantage of RGB information for the grasp prediction.







35  83.34%  78.31%  96.39%  90.36%  
45  86.59%  78.05%  94.05%  82.14%  
55  86.67%  82.22%  93.33%  75.56% 
VF Some Comparison
We also give some comparison with other work which is shown in TABLE II. Although the robotic grasping task is a complex operation for different scenarios and does not have similar standard, we still manage to compare our work with some existing works using same benchmarks. It need to be noticed that GQCNN[2] and GGCNN[3] used GTX 1080 and GTX 1070 NVIDIA graphic card respectively. Our work uses one GTX1080ti NVIDIA graphic card to realize grasp prediction and only takes 8ms for the prediction. Our hardware is the best of the three, but it is a general capacity of calculation currently. According to abovementioned situation, our work is still competitive. The success rate and the robust grasp rate of our work are both better than other two existing works.
GQCNN  GGCNN 

UGNet V2  
Success Rate(%)  80  84  85.88  94.55  
Robust Grasp Rate(%)  43  79.61  82.49  
Planning Time(ms)  800  19  8  8  
Model Size(Approx.)  75MB  0.5MB  28MB  28MB 
Vi Conclusions
In this paper, we propose an encoderdecoder neural network to predict grasp policy using 2.5D image information. The depth image can provide physical scale geometric features for the grasp policy generation and the RGB image can provide diverse robust information in camera space. We fuse the two modal data to predict grasp policy more robustly. The experiment result shows that our network structure can predict the grasp probability more stable and robust in Fig. 4 and Fig. 5. We design an openloop controller to evaluate the grasp performance in the physical system showed in Algorithm 1 and achieves high grasp success rate comparing to our baseline (only depth). We also compare some the stateoftheart and our method is competitive in grasp performance, realtime and model size. We believe our method is very great for mobile platform deployment as a general module. Our experiment video is available in https://youtu.be/Wxw_r5a8qV0.
References
 [1] A. Bicchi and V. Kumar, “Robotic grasping and contact: A review,” in Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), vol. 1, pp. 348–353, IEEE, 2000.
 [2] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dexnet 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” 2017.
 [3] D. Morrison, P. Corke, and J. Leitner, “Closing the loop for robotic grasping: A realtime, generative grasp synthesis approach,” arXiv preprint arXiv:1804.05172, 2018.
 [4] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser, “Learning synergies between pushing and grasping with selfsupervised deep reinforcement learning,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4238–4245, IEEE, 2018.
 [5] G. Thomas, M. Chien, A. Tamar, J. A. Ojea, and P. Abbeel, “Learning robotic assembly from cad,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9, IEEE, 2018.
 [6] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computerassisted intervention, pp. 234–241, Springer, 2015.
 [7] B. Siciliano and O. Khatib, Springer handbook of robotics. Springer, 2016.
 [8] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,” The International Journal of Robotics Research, vol. 34, no. 45, pp. 705–724, 2015.
 [9] R. Detry, C. H. Ek, M. Madry, and D. Kragic, “Learning a dictionary of prototypical grasppredicting parts from grasping experience,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 601–608, IEEE, 2013.
 [10] D. Kappler, J. Bohg, and S. Schaal, “Leveraging big data for grasp planning,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on, pp. 4304–4311, IEEE, 2015.
 [11] R. Balasubramanian, L. Xu, P. D. Brook, J. R. Smith, and Y. Matsuoka, “Physical human interactive guidance: Identifying grasping principles from humanplanned grasps,” IEEE Transactions on Robotics, vol. 28, no. 4, pp. 899–910, 2012.
 [12] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning handeye coordination for robotic grasping with deep learning and largescale data collection,” The International Journal of Robotics Research, vol. 37, no. 45, pp. 421–436, 2018.

[13]
F. Sadeghi, A. Toshev, E. Jang, and S. Levine, “Sim2real viewpoint invariant
visual servoing by recurrent control,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4691–4699, 2018.  [14] E. Arruda, J. Wyatt, and M. Kopicki, “Active vision for dexterous grasping of novel objects,” in Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pp. 2881–2888, IEEE, 2016.
 [15] A. T. Pas, M. Gualtieri, K. Saenko, and R. Platt, “Grasp pose detection in point clouds,” International Journal of Robotics Research, vol. 36, no. 13, p. 027836491773559, 2017.
 [16] R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Malik, E. H. Adelson, and S. Levine, “More than a feeling: Learning to grasp and regrasp using vision and touch,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3300–3307, 2018.
 [17] M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. FeiFei, A. Garg, and J. Bohg, “Making sense of vision and touch: Selfsupervised learning of multimodal representations for contactrich tasks,” arXiv preprint arXiv:1810.10191, 2018.
 [18] A. Zeng, S. Song, J. Lee, A. Rodriguez, and T. Funkhouser, “Tossingbot: Learning to throw arbitrary objects with residual physics,” arXiv preprint arXiv:1903.11239, 2019.
 [19] K. Hara, R. Vemulapalli, and R. Chellappa, “Designing deep convolutional neural networks for continuous object orientation estimation,” arXiv preprint arXiv:1702.01499, 2017.
 [20] G. Bradski and A. Kaehler, “Opencv,” Dr. Dobb’s journal of software tools, vol. 3, 2000.

[21]
E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,”
GitHub repository, 2016.
Comments
There are no comments yet.