Real-time Fruit Recognition and Grasp Estimation for Autonomous Apple harvesting

by   Hanwen Kang, et al.
Monash University

In this research, a fully neural network based visual perception framework for autonomous apple harvesting is proposed. The proposed framework includes a multi-function neural network for fruit recognition and a Pointnet grasp estimation to determine the proper grasp pose to guide the robotic execution. Fruit recognition takes raw input of RGB images from the RGB-D camera to perform fruit detection and instance segmentation, and Pointnet grasp estimation take point cloud of each fruit as input and output the prediction of grasp pose for each of fruits. The proposed framework is validated by using RGB-D images collected from laboratory and orchard environments, a robotic grasping test in a controlled environment is also included in the experiments. Experimental shows that the proposed framework can accurately localise and estimate the grasp pose for robotic grasping.



There are no comments yet.


page 2

page 3

page 5

page 6

page 7

page 8


2.5D Image based Robotic Grasping

We consider the problem of robotic grasping using depth + RGB informatio...

Geometry-Aware Fruit Grasping Estimation for Robotic Harvesting in Orchards

Field robotic harvesting is a promising technique in recent development ...

Attention based visual analysis for fast grasp planning with multi-fingered robotic hand

We present an attention based visual analysis framework to compute grasp...

Combining RGB and Points to Predict Grasping Region for Robotic Bin-Picking

This paper focuses on a robotic picking tasks in cluttered scenario. Bec...

Recognition of Grasp Points for Clothes Manipulation under unconstrained Conditions

In this work a system for recognizing grasp points in RGB-D images is pr...

A Sweet Pepper Harvesting Robot for Protected Cropping Environments

Using robots to harvest sweet peppers in protected cropping environments...

Suction Grasp Region Prediction using Self-supervised Learning for Object Picking in Dense Clutter

This paper focuses on robotic picking tasks in cluttered scenario. Becau...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Autonomous harvesting plays a significant role in the recent development of the agricultural industry [51]. Vision is one of the essential tasks in autonomous harvesting, as it can detect and localise the crop to guide the robotic arm to perform detachment [42]

. Vision task in orchard environments is challenging as there are many factors which can influence the performance of the system, such as variances in illumination, appearance, and occlusion between crop and other items within the environment. Meanwhile, occlusion between fruits and other items can also decrease the success rate of autonomous harvesting

[2] (for example, autonomous harvesting of sweet pepper, strawberry, or apples). Therefore, to increase the efficiency of harvesting, vision system should be capable of guide the robotic arm to detach the crop from a proper pose. Overall, an efficient vision algorithm which can robustly perform crop recognition and grasp pose estimation is the key to the success of autonomous harvesting [58].

In this work, a fully deep-learning based vision algorithm which can perform real-time fruit recognition and grasp pose estimation on raw sensor input for autonomous apple harvesting is proposed. The proposed vision solution includes two blocks: fruit recognition block and grasp pose estimation block. Fruit recognition block applies a one-stage multi-task neural network is to perform fruit detection and instance segmentation on colour images. Grasp pose estimation further process the information from fruit recognition block and depth images to estimate the proper grasp pose for each of fruits by using the Pointnet. The following highlights are presented in the paper:

  • Proposing the multi-task neural network Dasnet to perform real-time and accurate fruit detection and instance segmentation on input colour images from RGB-D camera.

  • Proposing the improved Pointnet network to perform fruit modelling and grasp pose estimation on input point clouds from RGB-D camera.

  • Realising the real-time and accurate multi-task visual processing on input from RGB-D camera by combining the aforementioned two points. The outputs from multi-task vision processing block are used to guide the robot to perform autonomous harvesting.

The rest of the paper is organised as follow. Section II reviews the related works on fruit recognition and grasp pose estimation. Sections III introduce the methods of the proposed vision processing algorithm. The experiments setup and result are included in Section IV. In Section V, conclusion and future works are included.

Ii Literature Review

Ii-a Fruit Recognition

Fruit recognition is an essential task in the autonomous agricultural applications [52]. There are many methods which have been studied in decades, including traditional method [27, 26, 8]

and deep-learning based method. Traditional method applies hand-crafted image features to describe the objects within images, and use machine-learning algorithm to perform classification, detection, or segmentation on such objects

[19]. The performance of the traditional method is limited by the express ability of the feature descriptor, which requires adjustment before processing different class of objects [43]

. Deep-learning based method applies deep convolution neural network to perform automatic image feature extraction, which has shown a good performance and generalisation

[11]. Deep-learning based detection method can be divided into two classes: two-stage detection and one-stage detection [59]. Two-stage detection applies a Region Proposal Network (RPN) to search the Region of Interest (RoI) from the image, and a classification branch is applied to perform bounding box regression and classification [9, 41]. One-stage detection combines the RPN and classification into a single architecture, which speeds up the processing of the images [40, 30]. Both two-stage detection and one-stage detection are being widely studied in autonomous harvesting [17]. Bargoti and Underwood [3] applied Faster Region Convolution Neural Network (Faster-RCNN) to perform multi-class fruit detection in orchard environments. Yu et al. [56] applied Mask-RCNN [12] to perform strawberry detection and instance segmentation in the non-structural environment. Liu et al. [31] applied a modified Faster-RCNN on kiwifruit detection by combining the information from RGB and NIR images, an accurate detection performance was reported in this work. Tian et al. [48] applied an improved Dense-YOLO to perform monitoring of apple growth in different stages. Koirala et al. [20] applied a light-weight YOLO-V2 model which named as ’Mongo-YOLO’ to perform fruit load estimation. Kang and Chen [14, 16] develop a multi-task network based on YOLO, which combines the semantic, instance segmentation, and detection. Deep-learning based methods are also widely applied in other agriculture applications, such as environmental modelling [32] and remote sensing [21].

Ii-B Grasp Estimation

Grasp pose estimation is one of the key techniques in the robotic grasp [7]. Similar to the methods developed for fruit recognition, the grasp pose estimation methods can be divided into two categories: traditional analytical approaches and deep-learning based approaches [5]. Traditional analytical approaches extract feature/key points from the point clouds and then perform matching between sensory data and template from the databased to estimate the object pose [1]. The pre-defined grasp pose can be applied in this condition. For the unknown objects, some assumption can be made, such as grasp the object along the principle axis [7]. The performance of the traditional analytical approaches when performed in the real world, as noise or partial point cloud can severely influence the accuracy of the estimation [47]. Deep-learning based methods are developed more recently. Earlier works of deep-learning based methods recast the grasp pose estimation as an object detection task, which can directly produce grasp pose from the images [24]. Recently, with the development of the deep-learning architecture on 3D point cloud processing [37, 38], more works are focus on performing grasp pose estimation on 3D point clouds. These methods apply convolution neural network architectures to process the 3D point clouds and estimate the grasp pose to guide the grasping, such as Grasp Pose Detection (GPD) [10] and Pointnet GPD [25]. There are several works which applied grasp pose estimation in the autonomous harvesting. Lehnert et al. [23] modelling the sweep pepper as super-ellipsoid to estimate the grasp pose of the fruits. In their following work [22], they used surface normal of the sweet pepper to generate the grasp candidates. The generated grasp candidates are ranked by using a utility function, which combines the surface curvature, distance to the point cloud boundary and angle with respect to the horizontal world axis. Then, the grasp candidate with the highest score is chosen for the execution. Currently, most of the works [45, 54, 34] developed in autonomous harvesting grasp the fruits by translating towards the target, which can not secure the success rate of harvesting in unstructured environments [28]. Therefore, an efficient grasp pose estimation is the key to fully automatic harvesting.

Iii Methods and Materials

Iii-a System Design

Fig. 1: Two-stage vision perception and grasp estimation for autonomous harvesting.


The proposed vision perception and grasp estimation framework includes two-stage: fruit recognition and grasp pose estimation. The workflow of the proposed vision processing algorithm is shown in Figure 1. In the first step, the fruit recognition block performs fruit detection and segmentation on input RGB images from the RGB-D camera. The outputs of the fruit recognition are projected to the depth images, the point clouds of each detected fruit are extracted and be sent to the grasp pose estimation block for further processing. In the second step, the Pointnet architecture is applied to estimate the geometry and grasp pose of fruits by using the point clouds from the previous steps. The method of the fruit recognition block and grasp pose estimation are presented in Section III-B and III-C, respectively. The implementation details of the proposed method are introduced in Section III-D.

Iii-B Fruit Recognition

Iii-B1 Network Architecture

Fig. 2: Dasnet is a one-stage detection network which combines detection, instance segmentation and semantic segmentation.

A one-stage neural network Dasnet [18] is applied to perform fruit detection and instance segmentation tasks. Dasnet applies a 50 layers residual network (resnet-50) [13] as the backbone to extract features from the input image. A three levels Feature Pyramid Network (FPN) is used to fuse feature maps from the C3, C4, and C5 level of the backbone (as shown in Figure 2). That is, the feature maps from the higher level are fused into the feature maps from the lower level since feature maps in higher level include more semantic information which can increase the classification accuracy [57].

Fig. 3: Architecture of the instance segmentation branch, which can perform instance segmentation, bounding box regression, and classification.

On each level of the FPN, an instance segmentation (includes detection and instance segmentation) branch is applied, as shown in Figure 3. Before the instance segmentation branch, an Atrous Spatial Pyramid Pooling (ASPP) [6] is used to process the multi-scale features within the feature maps. ASPP applies dilation convolution with different rates (1, 2, 4 in this work) to process the feature, which can process the features of different scale separately. The instance segmentation branch includes three sub-branches, which are mask generation branch, bounding box branch, and classification branch. Mask generation branch follows the architecture design proposed in Single Pixel Reconstruction Network (SPRNet) [55], which can predict a binary mask for objects from a single pixel within the feature maps. Bounding box branch includes the prediction on confidence score and the bounding box shape. We apply one anchor bounding box on each level of FPN (size of anchor box of instance segmentation branch on C3, C4 and C5 level are 32 x 32 (pixels), 80 x 80, and 160 x 160, respectively.). Classification branch predicts the class of the object within the bounding box. The combined outputs from the instance segmentation branch form the results of the fruit recognition on colour images. Dasnet also has a semantic segmentation branch for environment semantic modelling, which is not applied in this research.

Iii-B2 Network Training

More than 1000 images are collected from apple orchards located in Qingdao, China and Melbourne, Australia, which includes Fuji, Gala, pink lady, and so on. The images are labelled by using LabelImage tool from Github [50]. We applied 600 images as the training set, 100 images as the validation set, and 400 images as the test set. We introduce multiple image augmentations in the network training, including random crop, random scaling (0.8-1.2), flip (horizontal only), random rotation (), randomly adjust on saturation (0.8-1.2) and brightness (0.8-1.2) in HSV colour space. We apply focal loss [29]

in the training and Adam-optimiser is used to optimise the network parameters. The learning rate and decay rate of the optimiser are 0.001 and 0.75 per epoch. We train the instance segmentation branch for 100 epochs and train the whole network for another 50 epochs.

Iii-B3 Post Processing

The results of the fruit recognition are projected into the depth image. That is, the mask region of each apple on depth image is extracted. Then, the 3D position of each point in the point clouds of each apple is calculated and obtained. The generated point clouds are the visible part of the apple from the current view-angle of the RGB-D camera. These point clouds are further processed by grasp pose estimation block to estimate the grasp pose, which is introduced in the following section.

Iii-C Grasp Estimation

Iii-C1 Grasp Planning

Since most of the apples are presented in sphere or ellipsoid, we modelling the apple as sphere shape for simplified expression. In the natural environments, apples can be blocked by branches or other items within the environments from the view-angle of the RGB-D camera. Therefore, the visible part of the apple from the current view-angle of the RGB-D camera indicates the potential grasp pose, which is proper for the robotic arm to pick the fruit. Unlike generate multiple grasp candidate and use the network to find the best grasp pose which is applied in GPD [10] and Pointnet GPD [25], we formulate the grasp pose estimation as object pose estimation similar to the Frustum PointNets [35]. We select the centre of the visible part and orientation from the centre of the apple to this centre as the position and orientation of the grasp pose (as shown in Figure 4). The Pointnet takes 1-viewed point cloud of each fruit as input and estimates the grasp pose for the robotic arm to perform detachment.

Fig. 4: Our method select orientation from the fruit centre to visible part centre as grasp pose.

Iii-C2 Grasp Representation

The pose of an object in 3D space has 6 Degree of Freedom (DoF), includes three positions (x, y, and z) and three rotations (

, , and , along Z-axis, Y-axis, and X-axis, respectively). We apply Euler-ZYX angle to represent the orientation of the grasp pose, as shown in Figure 5. The value of is set as zero since we can always assume that fruit will not rotate along its X-axis (since apples are presented in a spherical shape). The grasp pose (GP) of an apple can be formulated as follow:


Therefore, a parameter list [x, y, z, , ] is used to represent the grasp pose of the fruit.

Iii-C3 Data Annotation

Grasp pose block use point clouds as input and predicts the 3D Oriented Bounding Box (3D-OBB) (oriented in grasp orientation) for each fruit. Each 3D-OBB includes six parameters, which are , , , , , . The position (, , ) represents the offsets on X-, Y-, Z-axis from the centre of point clouds to the centre of the apple, respectively. The parameter represents the radius of the apple, as the apples is modelled as sphere. The length, width, and height can be derivated by radius. and represent the grasp pose of the fruit, as described in Section III-C2.

Fig. 5: Euler-ZYX angle is applied to represent the orientation of the grasp pose.

Since the parameters , , , and may have large various when dealing with prediction in different situations, a scale parameters is introduced. We apply to represent the mean scale (radius) of the apple, which equals 30 (cm) in our case. The parameters , , , and are divided by to obtain the united offset and radius (, , , ). After remapping, the range of the , , are down into [-, ], and the range of are in [0, ]. To keep the grasp pose in the range of motion of the robotic arm, the and are limited in the range of [, ]. We divide the and by to map the range of grasp pose into the range of [-1,1]. The united and are donated as and . In total, we have six united parameters to predict the 3D-OBB for each fruit, which are [, , , , , ]. Among these parameters, [, , , , ] represent the grasp pose of the fruit, controls the shape of 3D-OBB.

Iii-C4 Pointnet Architecture

Fig. 6: Pointnet applies symmetric function to extract features from the unordered point cloud.

Pointnet [37]

is a deep neural network architecture which can perform classification, segmentation, or other tasks on point clouds. Pointnet can use raw point clouds of the object as input and does not requires any pre-processing. The architecture of the Pointnet is shown in Figure. Pointnet uses an n x 3 (n is the number of points) unordered point clouds as input. Firstly, Pointnet applies convolution operations to extract a multiple dimensional feature vector on each point. Then, a symmetric function is used to extract the features of the point clouds on each dimension of the feature vector (as shown in Figures

6 and 7).


In Eq. 2, is a symmetric function and

is the extracted features from the set. Pointnet applies max-pooling as the symmetric function. In this manner, Pointnet can learn numbers of features from point set and invariant to input permutation. The generated feature vectors are further processed by Multi-Layer Perception (MLP) (fully-connected layer in Pointnet), to perform classification of the input point clouds. Batch-norm layer is applied after each convolution layer or fully-connection layer. Drop-out is applied in the fully-connected layer during the training.

Fig. 7: Network architecture of the Pointnet applied in Grasp estimation.

In this work, the output of the Pointnet is changed to the 3D-OBB prediction, which includes prediction on six parameters [, , , , , ]. The range of the parameters , , and are in [-,

], hence we do not applies an activation function on these three parameters. The range of the

are from 0 to , the exponential function is used as activation. The range of the , is from -1 to 1, hence a tanh activation function is applied. The donate of the Pointnet output before activation are donate as [, , , , , ]. Therefore, we have


The output of the Pointnet can be remapped to their original value by following the description in Section III-C3.

Iii-C5 Network Training

The data labelling is performed on our own developed labelling tool, as shown in Figure 8. Our labelling tool records the six parameters of the 3D-OBB and all the points within the point clouds. The training of the Pointnet for 3D-OBB prediction is independent of the fruit recognition network training. There are 570 1-viewed point clouds of apples labelled in total (250 are collected in lab, 250 are collected in orchards). We apply 300 point sets as the training set (150 in each data set), 50 samples as validation set (25 in each data set), and the rest 220 samples as test set (110 in each data set). We introduce scaling (0.8 to 1.2), translation (-15 cm to 15 cm on each axis), rotation (- to on and

), adding Gaussian noise (mean equals 0, variance equals 2cm), and adding outliers (1% to 5% in total number of point clouds) in the data augmentation. One should notice that the orientation of samples after augmentation should still in the range between

and .

Fig. 8: The developed labelling tool for RGB-D images.

The square error between prediction and ground truth is applied as the training loss. The Adam-optimizer in Tensorflow is used to perform the optimisation, the learning rate, decay rate, and total training epoch of the optimiser are 0.0001, 0.6 /epoch, and 100 epochs, respectively.

Iii-D Implementation Details

Iii-D1 System Configuration and Software

Fig. 9: Harvesting is performed by using a custom harvesting tool and an UR5 robotic arm.

The Intel-D435 RGB-D camera is applied in this research, a laptop (DELL-INSPIRATION) with Nvidia-GPU GTX-980M and Intel-CPU i7-6700 is used to control the RGB-D camera and perform the test. The connection between RGB-D camera and laptop is achieved by using the RealSense communication package in the Robot Operation System (ROS) in kinetic version [39] on the Linux Ubuntu 16.04. The calibration between the colour image and the depth image of the RGB-D camera is included in the realsense-ros. The implementation code of the Pointnet (in Tensorflow) is from the Github [36], and it is trained on the Nvidia-GPU GTX-980M. The implementation code of the Dasnet is achieved by using Tensorflow. The training of the Dasnet is performed on the Nvidia-GPU GTX-1080Ti. In the autonomous harvesting experiment, an industry robotic arm Universal Robot UR5 is applied (as shown in Figure 9). The communication between UR5 and the laptop is performed by using universal-robot-ROS. MoveIt! [46] with TackIK inverse kinematic solver [4] is used in the motion planning of the robotic arm.

Iii-D2 Point Clouds Pre-processing

Although data augmentation in Pointnet training includes the outlier adding to improve the robustness of the algorithm, we apply a Euclidean distance based outlier rejection algorithm to filter out outliers before processing by Pointnet. When the distance between a point and point clouds centre are two times larger than the mean distance between the points and centre, we consider this point as an outlier and reject it. This step is repeated three times to ensure the efficiency of the algorithm. For the inference efficiency, a voxel downsampling function (resolution 3 mm) from the 3D data processing library open3D is used to extract 200 points as the input of the Pointnet grasp estimation. The point set with the number of points less than 200 after voxel downsampling will be rejected since the insufficient number of points are presented. The point clouds before and after outlier rejection and voxel downsampling are shown in Figure.

Iv Experiment and Discussion

Iv-a Experiment Setup

Fig. 10: Experiment setup in laboratory scenario.

We evaluate our proposed fruit recognition and grasp estimation algorithm in both simulation and the robotic hardware. In the simulation experiment, we perform the proposed method in the RGB-D data on the test set, which includes 110 point sets respectively in the laboratory environment and orchard environment. In the robotic harvesting experiment, we apply the proposed method to guide the robotic arm to perform the grasp of applies on the artificial plant in the lab. We apply IoU between predicted and ground-truth bounding box to evaluate the accuracy of 3D localisation and shape estimation of the fruits. We use 3D Axis Aligned Bounding Boxes (3D-AABB) to simplify the IoU calculation of 3D bounding box [53]. The IoU between 3D-AABB is donated as IoU. We set 0.75 (thres) as the threshold value for IoU to determine the accuracy of fruit shape prediction. In terms of the evaluation of the grasp pose estimation, we apply absolute error between the predicted value and ground truth value of grasp pose, as it can intuitively show the accuracy of predicted grasp pose. The maximum accepted error of grasp pose estimation for the robot to perform a successful grasp is 8, which is set as the threshold value in the grasp pose evaluation. This experiment is conducted in several scenarios, including noise and outlier presented conditions, and also dense clutter condition.

Iv-B Simulation Experiments

In the simulation experiment, we compare our method with traditional shape fitting methods, which include sphere Random Sample Consensus (sphere-RANSAC) [44] and sphere Hough Transform (sphere-HT) [49], in terms of accuracy on fruit localisation and shape estimation. Both of RANSAC and HT based algorithm take point clouds as input and generate the prediction of the fruit shape. The 3D bounding box of predicted shapes are then used to perform accuracy evaluation and compared with our method. This comparison are conducted on RGB-D images collected from the both laboratory and orchard scenarios.

Fig. 11: Pointset under different conditions, green sphere is the ground truth of the fruit shape.

Iv-B1 Experiments in laboratory Environments

Normal Noise Outlier Dense clutter Combined
Pointnet 0.94 0.92 0.93 0.91 0.89
RANSAC 0.82 0.71 0.81 0.74 0.61
HT 0.81 0.67 0.79 0.73 0.63
TABLE I: Accuracy of the fruit shape estimation by using Pointnet, RANSAC, and HT in different tests.

We performed Pointnet grasp estimation, RANSAC, and HT in the collected RGB-D images from the laboratory environment. The experimental results of three methods in different tests are shown in Table I. From the experimental results, Pointnet grasp estimation significantly increases the localisation accuracy of the 3D bounding box of the fruits. Pointnet grasp estimation achieves 0.94 on IoU, which is and higher than the RANSAC and HT methods, respectively. To evaluate the robustness of different methods when dealing with noisy and outlier conditions, we randomly add Gaussian noise (mean equals 0, variance equals 2cm) and outlier (1% to 5% in the total number of point clouds) into the point clouds, as shown in Figure 11. Three methods show similar robustness when dealing with outlier presented condition. Since both RANSAC and HT applies vote framework to estimate the primitives of the shape, which is robust to the outlier. However, when dealing with the noisy condition, Pointnet grasp estimation achieves better robustness compared to the RANSAC and HT. Since noisy point clouds can influence the accuracy of vote framework in a large extent. We also tested Pointnet grasp estimation, RANSAC, and HT in dense clutter scenario. Grasp estimation in dense clutter condition is challenging since the point clouds of objects can be influenced by other neighbour objects. Pointnet grasp estimation can robustly perform accurate localisation and shape fitting of apples in this condition, which shows a significant improvement compared to the RANSAC and HT. The experimental results by using Pointnet grasp estimation are presented in Figure 12, the 3D-OBBs are projected into image space by using the method developed in the work [33].

Fig. 12: Grasp estimation by using Pointnet. The green box are the front of the 3D-OBB, blue arrows are the predicted grasp pose, red sphere are the predicted shape of the fruits.
Normal Noise Outlier Dense clutter Combined
Pointnet 3.2 5.4 4.6 4.8 5.5
TABLE II: Mean error of grasp orientation estimation by using Pointnet in different tests.

In terms of the evaluation of the grasp orientation estimation, Pointnet grasp estimation shows accurate performance in the experimental results, as shown in Table II. The mean error between predicted grasp pose and ground truth grasp pose is 3.2. Experimental results also shows Pointnet grasp estimation can accurately and robustly determine the grasp orientation of the objects in noisy, outlier presented, and dense clutter conditions.

Iv-B2 Experiments in Orchards Environments

F score Recall Accuracy IoU
Dasnet 0.873 0.868 0.88 0.873
TABLE III: Performance evaluation of fruit recognition in RGB-D images collected in orchard scenarios

In this experiment, we performed the fruit recognition (Dasnet) and Pointnet grasp estimation on the collected RGB-D images from apple orchards. The performance of the Dasnet is evaluated by using the RGB images in test set. We apply F

score and IoU as the evaluation metric of the fruit recognition. IoU

stands the IoU value of instance mask of fruits in colour images. Table III show the performance of the Dasnet (in terms of the detection accuracy and recall) and Pointnet grasp estimation, Figure shows fruit recognition results by using Dasnet on test set. Experimental results show that Dasnet performs well on fruit recognition in orchard environment, which are 0.88 and 0.868 on accuracy and recall, respectively. The accuracy of the instance segmentation on apples is 0.873. The inaccuracy of the fruit recognition is due to the illumination and fruit appearance variances. From the experiments, we found that Dasnet can accurately detect and segment the apples in the most of conditions.

Fig. 13: Detection and instance segmentation performed by using Dasnet on collected RGB images.
Pointnet RANSAC HT
Accuracy 0.88 0.76 0.78
Grasp Orientation 5.2 - -
TABLE IV: Evaluation on grasp pose estimation by using Pointnet in different tests in the orchard scenario.

Table IV shows the performance comparison between Pointnet grasp estimation, RANSAC, and HT. In the orchard environments, grasp pose estimation is more challenging compared to the indoor environments. The sensory depth data can be affected by the various environmental factors, as shown in Figure 15. In this condition, the performance of the RANSAC and HT show the significant decrease from the indoor experiment while Pointnet grasp estimation shows better robustness. The IoU achieved by Pointnet grasp estimation, RANSAC, and HT in orchard scenario are 0.88, 0.76, and 0.78, respectively. In terms of the grasp orientation estimation, Pointnet grasp estimations show robust performance in dealing with flawed sensory data. The mean error of orientation estimation by using Pointnet grasp estimation is 5.2, which is still within the accepted range of orientation error. The experimental results of grasp pose estimation by using Pointnet grasp estimation in orchard scenario is shown in Figure 14.

Fig. 14: Fruit recognition and grasp estimation experiments in orchard scenario.

Iv-B3 Common Failures in Grasp Estimation

The major reason lead to the grasp estimation failure by using Pointnet grasp estimation is due to the sensory data defect, as shown in Figure 15. When under this conditions, the results of Pointnet grasp estimation will always predicts a sphere with a very small value of radius. We can applies a radius value threshold to filter out this kind of failure during the operation.

Fig. 15: Failure grasp estimation in laboratory and orchard scenarios.

Iv-C Experiments of Robotic Harvesting

Fig. 16: Autonomous harvesting experiment in the laboratory scenario.

The Pointnet grasp estimation was tested by using a UR5 robotic arm to validate its performance in the real working scenario. We arranged apples on a fake plant in the laboratory environment, which is shown in Figure 10. We conducted multiple trails (each trail contains three to seven apples on the fake plant) to evaluate the success rate of the grasp. The success rate records a fraction of success grasps in the total number of grasp attempts. The operational procedures follow the design of our previous work [15], as shown in Figure 16. We simulate the real outdoor environments of autonomous harvesting by adding noises and outliers into the depth data. We also tested our system in dense clutter condition. The experimental results are shown in Table V.

Normal Noise Outlier Dense clutter Combined
success rate .91 0.87 0.90 0.84 0.837
TABLE V: Experimental results on robotic grasp by using Pointnet grasp estimation in Laboratory scenario

From the experimental results presented in Tabla V, Pointnet grasp estimation performs efficiently in the robotic grasp tests. Pointnet grasp estimation achieves accurate grasp results on normal, noise, and outlier conditions, which are 0.91, 0.87, and 0.9, respectively. In dense clutter condition, the success rate shows a decrease compared to the previous conditions. The reason for the success rate decreasing in dense clutter condition is due to the collision between gripper and fruits side by side. When collision presented in the grasp, it will cause the shift of the target fruit and lead to the failure of the grasp. This defect can be either improved by re-design the gripper or propose multiple grasp candidate to avoid the collision. The collision between gripper and branches can also lead to grasping failure in the other three conditions. Although such defect can affect the success rate of robotic grasp, it still achieves good performance in experiments. The success rate of robotic grasp under dense clutter and all factors combined conditions are respectively 0.84 and 0.837.

V Conclusion and Future Work

In this work, a fully deep-learning neural network based fruit recognition and grasp estimation method were proposed and validated. The proposed method includes a multi-function network for fruit detection and instance segmentation, and a Pointnet grasp estimation to determine the proper grasp pose of each fruit. The proposed multi-function fruit recognition network and Pointnet grasp estimation network was validated in RGB-D images from the laboratory and orchard scenario. Experimental results showed that the proposed method could accurately perform visual perception and grasp pose estimation. The Pointnet grasp estimation was also tested in the laboratory scenario, which achieved a high success rate in the experiments. The future work will focus on optimising the design of end-effector and proposes multiple grasp candidates to improve the success rate of the grasp in dense clutter condition.


This research is supported by ARC ITRH IH150100006 and THOR TECH PTY LTD. We also acknowledge Zhuo Chen for her assistance in preparation of this work.


  • [1] A. Aldoma, Z. Marton, F. Tombari, W. Wohlkinger, C. Potthast, B. Zeisl, and M. Vincze (2012) Three-dimensional object recognition and 6 dof pose estimation. IEEE Robotics & Automation Magazine, pp. 80–91. Cited by: §II-B.
  • [2] C. W. Bac, E. J. van Henten, J. Hemming, and Y. Edan (2014) Harvesting robots for high-value crops: state-of-the-art review and challenges ahead. Journal of Field Robotics 31 (6), pp. 888–911. Cited by: §I.
  • [3] S. Bargoti and J. Underwood (2017) Deep fruit detection in orchards. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3626–3633. Cited by: §II-A.
  • [4] P. Beeson and B. Ames (2015)

    TRAC-ik: an open-source library for improved solving of generic inverse kinematics

    In 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), pp. 928–935. Cited by: §III-D1.
  • [5] S. Caldera, A. Rassau, and D. Chai (2018) Review of deep learning methods in robotic grasp detection. Multimodal Technologies and Interaction 2 (3), pp. 57. Cited by: §II-B.
  • [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §III-B1.
  • [7] S. Chitta, E. G. Jones, M. Ciocarlie, and K. Hsiao (2012) Perception, planning, and execution for mobile manipulation in unstructured environments. IEEE Robotics and Automation Magazine, Special Issue on Mobile Manipulation 19 (2), pp. 58–71. Cited by: §II-B.
  • [8] L. Fu, E. Tola, A. Al-Mallahi, R. Li, and Y. Cui (2019) A novel image processing algorithm to separate linearly clustered kiwifruits. Biosystems engineering 183, pp. 184–195. Cited by: §II-A.
  • [9] R. Girshick (2015) Fast r-cnn. In

    Proceedings of the IEEE international conference on computer vision

    pp. 1440–1448. Cited by: §II-A.
  • [10] M. Gualtieri, A. Ten Pas, K. Saenko, and R. Platt (2016) High precision grasp pose detection in dense clutter. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 598–605. Cited by: §II-B, §III-C1.
  • [11] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu (2018) Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Processing Magazine 35 (1), pp. 84–100. Cited by: §II-A.
  • [12] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §II-A.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §III-B1.
  • [14] H. Kang and C. Chen (2019) Fruit detection and segmentation for apple harvesting using visual sensor in orchards. Sensors 19 (20), pp. 4599. Cited by: §II-A.
  • [15] H. Kang and C. Chen (2019) Visual perception and modelling in unstructured orchard for apple harvesting robots. arXiv preprint arXiv:1912.12555. Cited by: §IV-C.
  • [16] H. Kang and C. Chen (2020) Fast implementation of real-time fruit detection in apple orchards using deep learning. Computers and Electronics in Agriculture 168, pp. 105108. Cited by: §II-A.
  • [17] H. Kang and C. Chen (2020) Fast implementation of real-time fruit detection in apple orchards using deep learning. Computers and Electronics in Agriculture 168, pp. 105108. Cited by: §II-A.
  • [18] H. Kang and C. Chen (2020) Fruit detection, segmentation and 3d visualisation of environments in apple orchards. Computers and Electronics in Agriculture 171, pp. 105302. Cited by: §III-B1.
  • [19] K. Kapach, E. Barnea, R. Mairon, Y. Edan, and O. Ben-Shahar (2012) Computer vision for fruit harvesting robots–state of the art and challenges ahead. International Journal of Computational Vision and Robotics 3 (1/2), pp. 4–34. Cited by: §II-A.
  • [20] A. Koirala, K. Walsh, Z. Wang, and C. McCarthy (2019) Deep learning for real-time fruit detection and orchard fruit load estimation: benchmarking of ‘mangoyolo’. Precision Agriculture, pp. 1–29. Cited by: §II-A.
  • [21] N. Kussul, M. Lavreniuk, S. Skakun, and A. Shelestov (2017) Deep learning classification of land cover and crop types using remote sensing data. IEEE Geoscience and Remote Sensing Letters 14 (5), pp. 778–782. Cited by: §II-A.
  • [22] C. Lehnert, A. English, C. McCool, A. W. Tow, and T. Perez (2017) Autonomous sweet pepper harvesting for protected cropping systems. IEEE Robotics and Automation Letters 2 (2), pp. 872–879. Cited by: §II-B.
  • [23] C. Lehnert, I. Sa, C. McCool, B. Upcroft, and T. Perez (2016) Sweet pepper pose detection and grasping for automated crop harvesting. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 2428–2434. Cited by: §II-B.
  • [24] I. Lenz, H. Lee, and A. Saxena (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34 (4-5), pp. 705–724. Cited by: §II-B.
  • [25] H. Liang, X. Ma, S. Li, M. Görner, S. Tang, B. Fang, F. Sun, and J. Zhang (2019) Pointnetgpd: detecting grasp configurations from point sets. In 2019 International Conference on Robotics and Automation (ICRA), pp. 3629–3635. Cited by: §II-B, §III-C1.
  • [26] G. Lin, Y. Tang, X. Zou, J. Cheng, and J. Xiong (2019) Fruit detection in natural environment using partial shape matching and probabilistic hough transform. Precision Agriculture, pp. 1–18. Cited by: §II-A.
  • [27] G. Lin, Y. Tang, X. Zou, J. Xiong, and Y. Fang (2019) Color-, depth-, and shape-based 3d fruit detection. Precision Agriculture, pp. 1–17. Cited by: §II-A.
  • [28] G. Lin, Y. Tang, X. Zou, J. Xiong, and J. Li (2019) Guava detection and pose estimation using a low-cost rgb-d sensor in the field. Sensors 19 (2), pp. 428. Cited by: §II-B.
  • [29] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §III-B2.
  • [30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §II-A.
  • [31] Z. Liu, J. Wu, L. Fu, Y. Majeed, Y. Feng, R. Li, and Y. Cui (2019) Improved kiwifruit detection using pre-trained vgg16 with rgb and nir information fusion. IEEE Access. Cited by: §II-A.
  • [32] Y. Majeed, J. Zhang, X. Zhang, L. Fu, M. Karkee, Q. Zhang, and M. D. Whiting (2020) Deep learning based segmentation for automated training of apple trees on trellis wires. Computers and Electronics in Agriculture 170, pp. 105277. Cited by: §II-A.
  • [33] L. Novak (2017) Vehicle detection and pose estimation for autonomous driving. Ph.D. Thesis, Master’s thesis, Czech Technical University in Prague. Cited by: §IV-B1.
  • [34] Y. Onishi, T. Yoshida, H. Kurita, T. Fukao, H. Arihara, and A. Iwai (2019) An automated fruit harvesting robot by using deep learning. ROBOMECH Journal 6 (1), pp. 13. Cited by: §II-B.
  • [35] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927. Cited by: §III-C1.
  • [36] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2016) PointNet: deep learning on point sets for 3d classification and segmentation. Note: Cited by: §III-D1.
  • [37] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §II-B, §III-C4.
  • [38] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §II-B.
  • [39] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng (2009) ROS: an open-source robot operating system. In ICRA workshop on open source software, Vol. 3, pp. 5. Cited by: §III-D1.
  • [40] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §II-A.
  • [41] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §II-A.
  • [42] I. Sa, Z. Ge, F. Dayoub, B. Upcroft, T. Perez, and C. McCool (2016) Deepfruits: a fruit detection system using deep neural networks. Sensors 16 (8), pp. 1222. Cited by: §I.
  • [43] J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §II-A.
  • [44] R. Schnabel, R. Wahl, and R. Klein (2007) Efficient ransac for point-cloud shape detection. In Computer graphics forum, Vol. 26, pp. 214–226. Cited by: §IV-B.
  • [45] Y. Si, G. Liu, and J. Feng (2015) Location of apples in trees using stereoscopic vision. Computers and Electronics in Agriculture 112, pp. 68–74. Cited by: §II-B.
  • [46] I. A. Sucan and S. Chitta. (2016) Moveit!. Note: Cited by: §III-D1.
  • [47] A. ten Pas, M. Gualtieri, K. Saenko, and R. Platt (2017) Grasp pose detection in point clouds. The International Journal of Robotics Research 36 (13-14), pp. 1455–1473. Cited by: §II-B.
  • [48] Y. Tian, G. Yang, Z. Wang, H. Wang, E. Li, and Z. Liang (2019) Apple detection during different growth stages in orchards using the improved yolo-v3 model. Computers and electronics in agriculture 157, pp. 417–426. Cited by: §II-A.
  • [49] A. Torii and A. Imiya (2007) The randomized-hough-transform-based method for great-circle detection on sphere. Pattern Recognition Letters 28 (10), pp. 1186–1192. Cited by: §IV-B.
  • [50] Tzutalin (2015) LabelImg. Note: code (2015) Cited by: §III-B2.
  • [51] J. P. Vasconez, G. A. Kantor, and F. A. A. Cheein (2019) Human–robot interaction in agriculture: a survey and current challenges. Biosystems engineering 179, pp. 35–48. Cited by: §I.
  • [52] A. Vibhute and S. Bodhe (2012) Applications of image processing in agriculture: a survey. International Journal of Computer Applications 52 (2). Cited by: §II-A.
  • [53] J. Xu, Y. Ma, S. He, and J. Zhu (2019) 3D-giou: 3d generalized intersection over union for object detection in point cloud. Sensors 19 (19), pp. 4093. Cited by: §IV-A.
  • [54] H. Yaguchi, K. Nagahama, T. Hasegawa, and M. Inaba (2016) Development of an autonomous tomato harvesting robot with rotational plucking gripper. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 652–657. Cited by: §II-B.
  • [55] J. Yao, Z. Yu, J. Yu, and D. Tao (2019) Single pixel reconstruction for one-stage instance segmentation. arXiv preprint arXiv:1904.07426. Cited by: §III-B1.
  • [56] Y. Yu, K. Zhang, L. Yang, and D. Zhang (2019) Fruit detection for strawberry harvesting robot in non-structural environment based on mask-rcnn. Computers and Electronics in Agriculture 163, pp. 104846. Cited by: §II-A.
  • [57] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §III-B1.
  • [58] Y. Zhao, L. Gong, Y. Huang, and C. Liu (2016) A review of key techniques of vision-based control for harvesting robot. Computers and Electronics in Agriculture 127, pp. 311–323. Cited by: §I.
  • [59] Z. Zhao, P. Zheng, S. Xu, and X. Wu (2019) Object detection with deep learning: a review. IEEE transactions on neural networks and learning systems 30 (11), pp. 3212–3232. Cited by: §II-A.