Geometry-Aware Fruit Grasping Estimation for Robotic Harvesting in Orchards

by   Hanwen Kang, et al.

Field robotic harvesting is a promising technique in recent development of agricultural industry. It is vital for robots to recognise and localise fruits before the harvesting in natural orchards. However, the workspace of harvesting robots in orchards is complex: many fruits are occluded by branches and leaves. It is important to estimate a proper grasping pose for each fruit before performing the manipulation. In this study, a geometry-aware network, A3N, is proposed to perform end-to-end instance segmentation and grasping estimation using both color and geometry sensory data from a RGB-D camera. Besides, workspace geometry modelling is applied to assist the robotic manipulation. Moreover, we implement a global-to-local scanning strategy, which enables robots to accurately recognise and retrieve fruits in field environments with two consumer-level RGB-D cameras. We also evaluate the accuracy and robustness of proposed network comprehensively in experiments. The experimental results show that A3N achieves 0.873 on instance segmentation accuracy, with an average computation time of 35 ms. The average accuracy of grasping estimation is 0.61 cm and 4.8^∘ in centre and orientation, respectively. Overall, the robotic system that utilizes the global-to-local scanning and A3N, achieves success rate of harvesting ranging from 70% - 85% in field harvesting experiments.



There are no comments yet.


page 1

page 4

page 5

page 6

page 7


Real-time Fruit Recognition and Grasp Estimation for Autonomous Apple harvesting

In this research, a fully neural network based visual perception framewo...

Fruit Detection, Segmentation and 3D Visualisation of Environments in Apple Orchards

Robotic harvesting of fruits in orchards is a challenging task, since hi...

3D Move to See: Multi-perspective visual servoing for improving object views with semantic segmentation

In this paper, we present a new approach to visual servoing for robotics...

Algorithm Design and Integration for a Robotic Apple Harvesting System

Due to labor shortage and rising labor cost for the apple industry, ther...

Predictive Scheduling of Collaborative Mobile Robots for Improved Crop-transport Logistics of Manually Harvested Crops

Mechanizing the manual harvesting of fresh market fruits constitutes one...

Peduncle Detection of Sweet Pepper for Autonomous Crop Harvesting - Combined Colour and 3D Information

This paper presents a 3D visual detection method for the challenging tas...

Visual Perception and Modelling in Unstructured Orchard for Apple Harvesting Robots

Vision perception and modelling are the essential tasks of robotic harve...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the continuously increasing cost of the labor force, robotic fruit retrieving in orchards has become a promising technology in the near future [1, 2]. However, robotic fruit harvesting in common orchards’ environments is more challenging than the traditional crop harvesting [3, 4]

, because most of the orchards’ environments are highly unstructured and complex. Therefore, most recent robotic fruit retrieving systems detach fruits from plants by applying an end-effector on the high Degree-of-Freedom (DoF) robotic arms

[5, 6]. In general, a fruit picking cycle includes four steps: perception, approaching, detachment, and collection. Perception by vision techniques is key to the success of robotic fruit harvesting, as robots need to see fruits before further processing [7]

. After that, robots need to find a proper grasping orientation and a collision-free path to approach and detach fruits from trees. In the past several years, many methods have been developed for visual perception. Both traditional and deep-learning based methods are used to detect, segment, and localise fruits using the sensor data, such as RGB images

[8, 9], point clouds [10], etc. Most of the studies do not consider estimating approaching orientation of fruits. If the workspace is clear and fruits are not blocked by obstacles, it is not challenging to detach these fruits off trees. However, in many cases, the fruits are surrounded by tangled branches and leaves, as shown in Figure 1(a) and (b). It is highly possible that the end-effector would fail to grasp fruits. A forced pulling back action of the end-effector may cause damage to both the robotic arm and trees. Meanwhile, in practice, the collision between robotic’ bodies and environments may cause unexpected moving of target fruits, thus leading to a failure of harvesting cycle. Up to date, only a few studies [11, 12, 13, 14, 15] have tried to solve this problem to improve the success rate of the harvesting. Besides, these works focus on fruit grasping estimation in structured environments, such as greenhouses and laboratories, and cannot be generalized to complex field environments [16, 17].

Fig. 1: (a) Target apple surrounded by leaves and branches, (b) attempt to grasp the target apple with soft end-effector.

Motivated by the object pose estimation using multi-source data of RGB image and point cloud

[18], a new geometry-aware deep-learning network model, Apple 3D Network (A3N), is proposed in this work. A3N is designed to perform end-to-end detection, instance segmentation, and grasping estimation of fruits using the raw RGB image and point clouds from an RGB-D camera. To be specific, A3N takes advantage of the deep-learning detector to search Region of Interest (ROI) from RGB images. Then, a PointNet model is utilised to perform the bounding boxes regression on point clouds of each fruit, which predicts a proper angle for robotic arm approaching using point clouds. Based on an estimated grasping pose of each fruit, the robot can plan a proper path to approach and detach fruit accurately and safely. To provide the robot with more information about surrounding obstacles, OctoMap [19] is used to construct the occupancy map of the workspace. Finally, the A3N is evaluated with the field data and our developed robotic retrieving system. To summarize, we make the following contributions in this research:

  • A novel end-to-end geometry-aware network A3N including a region proposal and a grasping estimation network is proposed to perform fruit segmentation and grasping pose estimation, respectively.

  • A framework including fruit detection, grasping pose estimation, and workspace modeling is proposed, which can be directly used for accurate and robust robotic harvesting in orchards.

  • A global-to-local strategy is implemented, allowing accurate vision sensing in orchards when depth sensors have limited accuracy. Extensive quantitative evaluations are also included to validate the proposed method.

The rest of the paper is organised as follows. Section II reviews the related works. Sections III introduces the design of the robotic system and the visual processing approach, respectively. Experimental methods and results are presented in Section IV. In Section V, conclusion and future work are included.

Ii Related Works

Fruit detection is an essential step in robotic harvesting [20]. Traditional fruit detection uses handcrafted features, such as gradients and textures, to perform segmentation or detection [21]. However, the performance of traditional method is limited in field environments [22]. Comparatively, deep-learning method shows superior performance in accuracy and generalization [9]. Deep-learning based fruit detection includes two-stage [23] and one-stage methods [8]. Two-stage detection networks first search ROIs of objects, then perform the classification and location regression on ROIs [24]. While one-stage detection networks directly perform the region proposal, classification, and regression in one step [25]. Deep-learning based fruit detection has been widely studied in many scenes. Yu et al. (2019) applied mask-RCNN to detect strawberries in greenhouses, an average precision of 95.78 % was reported [26]. Tian et al. (2019) applied YOLO-V3 to monitor apples’ maturity and growing stage in the orchard environments [27]. Kang et al.(2019) presented a one-stage panoptic segmentation network to perform vision sensing in orchard conditions; an accuracy of 0.87 was reported [28].

Grasping estimation by the vision system can provide critical information for the manipulation [29]

. It has been widely studied in many robotic and computer vision tasks. Traditional grasping estimation utilize point cloud alignment methods, such as ICP


, which match objects’ shapes with templates. These methods suffer the performance degeneration when there are significant variances of the shape. Grasping estimation using deep-learning method has drawn tremendous attention recently. It uses the deep-learning algorithm to estimate a grasping pose with 2D and 3D data

[31]. Grasping estimation using 2D images recasts the task as an image detection or classification problem [32]. While grasping estimation using 2D images can apply 3D data, such as points[33, 34], voxel[35], and meshes. However, only a few studies focus on grasping estimation in the robotic harvesting. Most of these existing works try to solve the grasping estimation of fruits in greenhouses [14, 12, 36] or in factory lines [11]. Without proper vision information, the success rate of harvesting can be severely affected when robots work in fields [37]. Besides, some work [13, 17] has studied grasping estimation on fruits in fields while ignoring the collision objects around the fruits, which cannot ensure the safety and accuracy of the manipulation. With the help of visual perception, robotic harvesting is implemented by manipulators toward the human-like robotic harvesting [38]. Birrell et al. [39] applied a 6-DoF arm and an end-effector with force monitoring, achieving 82% accuracy in clarifying vegetables. Ge et al. [40] developed a dedicated manipulator to harvest strawberries in the greenhouse. A Mask-RCNN based vision system was used to localise fruits and model structured obstacles within the workspace; a 74.1% success rate of harvesting on ripe strawberry was reported. Sepulveda et al. [41] applied a dual-arm system to harvest eggplant in the laboratory. One arm can remove occlusions, while another arm can pick fruits simultaneously. Even though these works used information from visual perception to harvest fruits and avoid collision between robots and environments during the operation, they did not consider vision perception and robotic operations in unstructured environments, which are the most common workspace for robotic operation in fruit orchards. Therefore, a vision perception algorithm that can perform grasping estimation and workspace modelling is urgent to the success of the robotic harvesting system in fields.

Iii Material and Methodology

Iii-a Data Collection

The data were collected in apple orchards located in Melbourne, Australia, using a Realsense D435 camera. The time of data collection was between 10:00 am to 4:00 pm. The distance between cameras and trees was from 0.4 to 1.2 meters. In total, 768 RGB-D images and another 1132 color images were taken in orchards. The ground truth of the detection and segmentation were labelled using the software LabelMe. The ground truth of grasping poses was labelled using our dedicated developed software.

Iii-B Geometry-aware Grasping Estimation

Fig. 2: Working framework of the proposed A3N: Region Proposal Network receives and processes the RGB image, Grasping Estimation Network receives and processes the depth image filtered by generated masks.

A3N includes two modules: a region proposal network to process 2D images and a grasping estimation network to process point cloud, as shown in Figure 2. Region proposal network applies a 2D detector to perform detection and instance segmentation of objects. Then a modified PointNet is used to perform grasping estimation using point cloud within the instance segmented and surround area of each object. The grasping poses of each fruit are returned as the output for manipulation.

Iii-B1 Region Proposal Network

Fig. 3: Architecture of the 2D region proposal network, which can extract ROI from RGB images. We base this architecture off of ResNet-50 + FPN (PANet)

RGB images have many dense features on objects’ appearances compared to sparse 3D data. A3N takes advantage of 2D deep-learning methods, using a YOLACT model [42], to perform instance segmentation in a one-stage detector without explicit feature localisation step. Figure 3 shows the architecture of the model. YOLACT performs instance segmentation in two parallel tasks; one is to produce image-sized semantic masks in FCN [43] way, another is to predict parameters to segment those semantic masks into the instance level explicitly. YOLACT network includes three branches: a backbone network, a detection branch, and an instance segmentation branch. We use ResNet-50 [44] as the default backbone network, and the base image size is 416 416. The feature maps of C3, C4, and C5 levels in backbone network are output for further processing. The detection branch has two components: Feature Pyramid Network (FPN), and detection head. FPN uses PANet [45] to fuse multi-scale feature maps from the backbone. PANet receives feature maps from C3, C4, and C5 levels of the backbone and each level applies a detection head to predict detection. Each detection head has channels, which are parameter numbers of confidence score, box localisation, classification, and mask assembly, respectively.

Mask Assembly: The instance segmentation branch generates prototype masks using feature maps in size of the C3 level. The input feature maps can from the backbone network or PANet. The prototype masks are identical to semantic feature maps in FCN, whose last layer has channels. The instance masks are assembled using matrix multiplication of prototype masks (size is ) and mask coefficients (size is ). Then the results are processed by sigmoid activation to produce final masks. The final masks are cropped into instance masks using the information of objects’ bounding boxes from detection head.

Iii-B2 Grasping Estimation Network

Fig. 4: Working framework of the grasping estimation network: both objects’ and non-objects’ points are utilized in the separate subnet

From the previous step, a set of bounding boxes and instance masks of objects in image space are obtained. A grasping estimation network is then applied to estimate the grasping pose of each fruit using the PointNet-based network.

PointNet architecture

: Point set has two essential characters: order and transform invariance, which means the properties of a point cloud will not change with alteraction of point order or the object’s pose. For the first property, PointNet uses a symmetric function to extract a feature vector invariant to point order. For the second property, PointNet uses Spatial Transform Net (STN)

[46] to predict an affine transformation matrix on input raw points directly. Moreover, PointNet uses shared Multi-layer Perception (MLP) to aggregate geometry features in local and global scales. PointNet has enormous advantages in computation complexity as it enables 2D network architecture in 3D data learning without scarifies of data resolution.

Geometry-aware Grasping Estimation: A modified PointNet is used to estimate the grasping pose of each fruit, as shown in Figure 4. The modified model has two subnets to receive points of both objects and non-objects. Each subnet has five blocks: one STN followed by a convolution layer (MLP) to convert points set () into a feature vector with the size of , where is the number of points in the set. Then another STN is used to estimate a proper transform in feature space followed by another two convolution layers, which outputs a feature vector with the size of (m is set as 256). Lastly, a symmetric function, maximum pooling, is used on the first dimension of the feature vector, producing a feature vector in size of . The output two feature vector is resized into and concatenated into a feature vector, named the global feature vector. After that, three fully connected layers is used to generate the grasping pose bounding box of each fruit.

Grasping Pose Representation: We use Euler-angle to represent the orientation of the grasping poses. To ensure the safety of the robot system, the value of predicted angle into the region of . To ensure the convergence of network training, we align a local coordinate at the centre of each points set. Network estimates offsets on the X-,Y-,Z- axis to obtain the real location of objects’ centre.

Point Cloud processing: 3D points set of instance masks can be computed using a camera projection matrix. The intrinsic and extrinsic matrices of the RGB camera are calibrated before implementation. The computed points are divided into two sets: object’s point and non-object’s points, based on the instance masks. For the non-object points, the points within the range of 0.3 m from the centre of objects are retained. Then, a voxel-sampling algorithm is applied to re-sample the points into a given resolution.

Iii-B3 Network Training

We train YOLACT and grasping estimation network separately. Three losses are utilized to train YOLACT: classification loss , boxes regression loss , and mask loss . Classification and box regression losses use binary cross-entropy as only two classes are involved, while mask loss uses L2 losses. The weight of the , , and are 1.0, 1.0, and 2.5, respectively. The training of the grasping estimation network includes box regression loss and grasping pose regression loss . Both and use smooth-L1 (Huber) loss. The weight of the and are 1.0 and 2.0, respectively.

Iii-C Robotic Manipulations

Fig. 5: The robotic harvesting system integration with both base camera and eye-on-hand. Working framework of the global-to-local sensing strategy: the A3N performs global sensing first to obtain the position of targets and construct the scene of the workspace, which guides eye-on-hand local sensing to refine the detection and execute the grasping.

The harvesting robot mainly includes a mobile base, an industry robotic arm (UR5), a vision system, and a soft end-effector, as shown in Figure 5. The vision system includes two Intel RealSense RGB-D cameras: one at the base while the other is on the end-effector. Both cameras are connected to the central computer, an NVIDIA Jetson-TX2. The control framework is implemented on Robotic Operation System (ROS) melodic.

Global-to-local strategy: The harvesting system equipped with two RGB-D cameras on the base and the end-effector respectively, as shown in Figure 5. Recent studies [47, 48] show that the accuracy of consumer RGB-D cameras, such as Realsense D435, will significantly degenerate when distance exceeds 1 meter. However, the field of view for the camera will be largely limited if the distance is too close. Therefore, a global-to-local sensing strategy is implemented here. The base camera is first used to scan the global fruits and initialise a raw workspace model for manipulation planning. Manipulator is then moved to each target fruit in consequence. Eye-on-hand camera is utilized to update the accurate local model. In such manner, workspace model and fruit poses are gradually refined, ensuring the accuracy and efficiency of harvesting.

Workspace modelling: Workspace modelling is an essential step for robotic operations in field environments as the collisions existed in environments can heavily affect the operation of the manipulation. Octomap [19], which uses octree to subdivide the occupancy into voxel in given size hierarchically, is applied to turn collision points into occupancy grid (resolution set as 5cm). During the operation, joint state messages from the robot arm are used to register each frame of point cloud into the map. With a global-to-local sensing strategy, an Octomap with fine details is presented at last for manipulation planning.

Manipulation: MoveIt! framework [49] is used to plan and execute manipulation. In our case, Fruit poses are first mapped into configuring space using the inverse kinematic algorithm, Trac-IK [50]. The planning pipeline of MoveIt!, which uses the RRT algorithm in OMPL [51], receives the latest workspace model and current robot state to plan a collision-free path to the pose goal. Lastly, time-optimal trajectory generation is applied to smooth the trajectory. The trajectory is implemented using the Universal ROS driver. The system will repeat the aforementioned steps until all fruits are retrieved in the current workspace.

Iv Experiment and Discussion

Iv-a Evaluation Methods

We use the F1-score and intersection of Union (IoU) to evaluate the performance of the A3N on detection and instance segmentation, respectively [28]. IoU measures the ratio of the intersection area between the prediction and ground truth. The IoU of detection and segmentation are donated as and , respectively. The detected objects with a confidence score and larger than 0.5 are considered as the true positive. The performance of the grasping estimation network is evaluated using Root Mean Squared Error (RMSE) on both the centre position (cm) and angles ().

Iv-B Evaluation on Instance Segmentation

This section reports the instance segmentation accuracy and detection performance of region proposal network of A3N . We also evaluate different network configurations and compare them with other detectors.

Implementation details

: All models are trained using data collected from the orchards. The network’s backbone adopts pre-trained weights from ImageNet. In the training, we first froze backbone weights, trained the rest of the model for 80 epochs, then trained the whole network for another 40 epochs. Adam-optimiser is applied with a learning rate of

. The decay rate of optimiser and batch-norm layers are set as 0.95 and 0.9, respectively. We use batch size of 24 on one GPU (11GB) in first 80 epochs and 12 on the rest of epochs. Each model is trained for three times, the weights with the highest validation accuracy is utilized in evaluation.

Comparison on Configurations: We evaluate the performance of the alternative network configuration of the region proposal network. Firstly, we train networks with ResNet-50 (R50), ResNet-101 (R101), and MobileNet-v2 (MN) as the backbone. Then, we compare the instance segmentation accuracy of network using features from C3 or PANet, respectively. Also, we evaluate the performance of the network with different input image resolutions. Lastly, we compare the accuracy of instance segmentation by using different mask coefficients in the detection head. The experimental results are shown in Tables I and II.

Model Backbone Time F1
A3N-416 R50-C3 35 0.890 0.842
A3N-416 MN-PANet 24 0.873 0.851
A3N-416 R50-PANet 35 0.890 0.873
A3N-480 R50-PANet 53 0.903 0.884
A3N-640 R50-PANet 76 0.897 0.886
A3N-416 R101-PANet 47 0.907 0.882
A3N-480 R101-PANet 75 0.923 0.891
A3N-640 R101-PANet 97 0.923 0.893
TABLE I: Comparison of the performance on different configurations of the Region Proposal Detector of A3N
k Time
8 0.863 34ms
16 0.868 34ms
32 0.873 35ms
64 0.872 37ms
128 0.870 40ms
TABLE II: Performance of instance segmentation under various values

Comparison with other models: We compare the region proposal network of A3N with YOLO-V4 [52] and mask-RCNN [53]

. Compared to the YOLO-V4, our region proposal network, applies depth-wise convolution layers instead of standard convolution layers in PANet. We also optimize some other details of the network model accordingly - for example, the layer configuration in PANet and detection head. Besides, we compare the model with Mask-RCNN, which has SOTA accuracy on instance segmentation. Both YOLO-V4 and Mask-RCNN are trained on collected data by using COCO pre-trained weights. While the training parameters of each model are slightly adjusted based on results. The results are shown in Table


Model Backbone Time Time(TX2) F1
YOLO-V4-416 CSPD53-PANet 78 592 0.864 N/A
YOLO-V4-480 CSPD53-PANet 106 827 0.886 N/A
Mask-RCNN-640 R50-FPN 122 920 0.857 0.887
Mask-RCNN-640 R101-FPN 157 1285 0.877 0.895
A3N-416 MN-PANet 24 174 0.873 0.851
A3N-416 R50-PANet 35 282 0.890 0.873
A3N-480 R101-PANet 75 598 0.923 0.891
A3N-640 R101-PANet 97 782 0.923 0.893
TABLE III: Comparison of performance among A3N, YOLO-V4 and Mask-RCNN with different architectures.

The performance evaluations of alternative network configuration are shown in Tables I and II. Firstly, results suggest that prototype masks generation using features from PANet can significantly improve the accuracy of instance segmentation. This is because PANet can fuse robust semantic features of deeper levels into the lower level, and achieve better performance. We also trained the network using different image sizes and backbones. Results suggest that network performance would increase with raising of image size or using a backbone with higher accuracy, while the inference speed is inevitably reduced, as expected. Lastly, we compare the network accuracy on the segmentation with different number of mask coefficients. Results suggest that the region proposal network achieves the best accuracy on instance segmentation when is 32.

As shown in Table III, region proposal network of A3N offers competitive accuracy on instance segmentation compared to the Mask-RCNN. The accuracy of A3N-640-R101-PANet achieves 0.893 on , which achieves equal performance compared to the Mask-RCNN. However, our model is mush faster in computational speed compared to the Mask-RCNN. From experiments, we observed two common reasons that lead to instance segmentation errors. The first reason is due to the mask leakage, which always occurs when fruits are close to each other, as shown in Figures 6(a)(b). In this case, the network may fail to accurately segment boundary of each fruit. The second reason is the error of bounding box localisation, as shown in Figure 6 (c). Two apples are included in one bounding box, leading to the failure of the segmentation of each fruit. These two defects are more likely to occur when the sensing distance is larger. Therefore, an additive perception of fruits at a close distance can largely improve segmentation accuracy in our case.

Fig. 6: (a) Inaccuracy of instance segmentation results: (b) mask leakage and (c) inaccurate object localisation.

Computational Speed: The inference time of networks is tested on an NVIDIA Jetson-TX2 and an NVIDIA RTX-2060 super. Table III shows that A3N is faster than Mask-RCNN and YOLO-V4. Combining with network performance, our model similar or even better performance compared to the SOTA network but with faster speed in application of fruit detection and segmentation. Jetson TX2 is an embedded computer that widely applied in robotic applications. Our A3N-R50-PANet is faster compared to the Mask-RCNN on Jetson-TX2 with competitive accuracy.

Iv-C Evaluation on Grasping Estimation

Fig. 7: Planning scenes which include obstacles and grasping poses of the detected targets generated by the A3N and OctoMap. Red dots represented the targets and green arrow represents the suggested grasping orientation.
Fig. 8: RMSE for both centre (cm) and pose estimation () under different scanning distances. Note all the distances have an error range of 5cm.

Accuracy Test: We evaluate the accuracy of A3N in fruits grasping estimation first. The instances of grasping estimation and reconstructed workspace are shown in Figure 7. The accuracy of grasping estimation is measured by RMSE between predicted pose with ground truth. The experimental results are shown in Figure 8. Based on the collected test dataset, we further analyze the grasping estimation performance according to the different distances that these images were taken. It can be seen that the centre and angular error of estimation are significantly reduced with a closer perception distance. Results also indicate that centre estimation achieves acceptable prediction accuracy within 0.8 m distance, while the grasping pose estimation can achieves accurate results when distance is about 0.4 m. With the increase of distance, the quality and number of points decrease dramatically. Meanwhile, sensory data always include defects (as shown in Figure 9), which also affect the accuracy of estimation. From experiments, A3N can estimate fruit poses in fine accuracy in most cases.

Fig. 9: Illustration of data corruption in sensory data in field environments.
Model Time RMSE RMSE
TX2 0.4m 0.8m 0.4m
A3N-416(MN) 207 1.35 0.41 1.8 0.73 6.6 3.1
A3N-416(C3) 322 1.15 0.37 1.7 0.62 5.8 2.5
A3N-416(PANet) 322 0.61 0.25 1.06 0.4 4.8 2.2
A3N-640(R101) 816 0.53 0.22 0.97 0.35 4.4 1.8
Mask-RCNN(R101) 1357 0.51 0.22 0.99 0.38 4.3 1.6
TABLE IV: Comparison of inference time, RMSE of centre (cm) and angular estimation () affected by the quality of instance segmentation generated by difference networks.

Influence of segmentation: We report the influence of instance segmentation quality on grasping estimation in Table IV. From the results, it can be seen that higher instance segmentation accuracy can improve grasping estimation results. The grasping estimation by using instance segmentation of Mask-RCNN achieves the highest accuracy with smaller variances. However, grasping estimation using A3N-416-PANet (the second row of Table IV) also achieves competitive results with 4.2 faster speed. Therefore, we use the A3N-416-PANet as the region proposal network to compensate for the accuracy and the speed.

Distance Missing Outlier
Centre raw (cm)
0.4 m 0.61 0.67 0.94 1.4 0.65 0.69 0.82
0.8 m 1.06 1.22 1.48 1.97 1.16 1.27 1.43
Angular raw ()
0.4 m 4.8 5.2 6.6 10.4 5.1 6.2 8.6
0.8 m 10.5 11.5 14.2 20.6 11.3 13.7 18.5
TABLE V: Comparison of performance among different input corruptions

Robustness Test: To evaluate the robustness of proposed algorithm, we test A3N accuracy under various input corruptions, as shown in Table V. From the results, grasping estimation has higher robustness on fruit centre prediction than grasping pose estimation. At the distance of 0.8 meter, A3N could still achieve high accuracy on fruit centre estimation when there are 40 point missing or 40 outliers points. This can make sure that robotic arm moves accurately to a close position to perform another precise perception. Grasping pose estimation at 0.4 meter shows the high robustness to the missing points or outliers, which enables robot to detach fruits in a proper orientation.

Iv-D Evaluation of A3N in Harvesting

Fig. 10: (a) Workspace of Fankhauser apple farm in Melbourne, (b) local sensing with A3N to refine the detection and workspace reconstruction, (c) grasping of target apple with a compliant end-effector.

This experiment evaluates the implementation performance of A3N in operation. The developed retrieving robotic system is tested in apple orchard in Melbourne, as shown in Figure 10. There are other errors existing due to the system integration, such as the calibration error between eye (camera) and arm, manipulation error, or sensing error due to depth camera, etc. Therefore, it is difficult to independently evaluate the accuracy of the vision system. In the test, we move the manipulator to the estimated pose of fruits based on the perception results. Then, we manually measure the error between the gripper to the real pose of the fruits. We separately evaluate the robotic system by only using global, local, and global-to-local scanning. The results are shown in Table VI.

Method Num RMSE() RMSE(cm) Suc rate
Global 16 13.7 3.9 56.3
Local 5 6.1 1.4 80
G-to-L 16 6.5 1.5 81.3
TABLE VI: Comparison among average experimental results of global, local, and global-to-local scanning strategy, while Num represents the number of objects detected, Suc rate represents the successful rate of apple harvesting.

As shown in results, both centre and angular error of the whole system are in the tolerance range of the gripper operation (maximum 3cm). By global scanning, the number of fruit detected by perception is maximized within the view. With following local perception, the accuracy of grasping estimation of each fruits are significantly improved. The system with local scanning alone requires multiple scans by manipulating to different locations, which lowers the efficiency of the system. Our robotic system achieves an average harvesting success rate from 70 - 85% in operation, depending on the complexity of workspace.

V Conclusion and future work

This work presents a geometry-aware detection network for apple harvesting applications, including the region proposal and grasping estimation networks. The former performs the fruit detection and instance segmentation tasks, while the latter predicts the 3D boundary box, centre of the fruit, and the appropriate approaching angle for grasping. Fruit detection and instance segmentation follow the work of YOLACT that combines these two tasks within one step, while pose estimation is developed and improved based on the PointNet. With the proposed A3N network, an F1-score of 0.890 is achieved for apple detection, an of 0.873 is recorded for instance apple segmentation. The global and local scanning strategy achieves the RMSE of 0.61 cm for centre estimation and 4.8 for angle estimation. Finally, a global-to-local scanning strategy is proposed and experimentally validated, which provides valuable guidance for the robot. An overall harvesting rate of 70% - 85 % is achieved on various natural orchard scenes.

There are still many challenges remaining for current field robotic harvesting system. For example, the accuracy of RGB-D camera is significantly effected by sunlight. Potential solution such as Lidar, can be used to improve accuracy of perception from source. Besides, the path planning of the arm can also be improved to generate more humanoid behavior in grasping. Such improvements are expected to significantly improve the efficiency and success rate of robotic harvesting in field.


We gratefully acknowledge the financial support from Australian Research Council (ARC ITRH IH150100006). We would also like to thank Dr. Shao Liu and Dr Lilian Khaw at Monash University for language check.


  • [1] P. Li, S.-h. Lee, H.-Y. Hsu, Review on fruit harvesting method for potential use of automatic fruit harvesting systems, Procedia Engineering 23 (2011) 351–366.
  • [2] Y. Sarig, Robotics of fruit harvesting: A state-of-the-art review, Journal of agricultural engineering research 54 (4) (1993) 265–280.
  • [3] Y. Zhao, L. Gong, Y. Huang, C. Liu, A review of key techniques of vision-based control for harvesting robot, Computers and Electronics in Agriculture 127 (2016) 311–323.
  • [4] D. Font, T. Pallejà, M. Tresanchez, D. Runcan, J. Moreno, D. Martínez, M. Teixidó, J. Palacín, A proposal for automatic fruit harvesting by combining a low cost stereovision camera and a robotic arm, Sensors 14 (7) (2014) 11557–11579.
  • [5] H. Zhou, X. Wang, W. Au, H. Kang, C. Chen, Intelligent robots for fruit harvesting: Recent developments and future challenges (2021).
  • [6] X. Wang, A. Khara, C. Chen, A soft pneumatic bistable reinforced actuator bioinspired by venus flytrap with enhanced grasping capability, Bioinspiration & Biomimetics 15 (5) (2020) 056017.
  • [7] A. Gongal, S. Amatya, M. Karkee, Q. Zhang, K. Lewis, Sensors and systems for fruit detection and localization: A review, Computers and Electronics in Agriculture 116 (2015) 8–19.
  • [8] Y. Chen, X. An, S. Gao, S. Li, H. Kang, A deep learning-based vision system combining detection and tracking for fast on-line citrus sorting, Frontiers in Plant Science 12 (2021) 171.
  • [9] H. Kang, C. Chen, Fast implementation of real-time fruit detection in apple orchards using deep learning, Computers and Electronics in Agriculture 168 (2020) 105108.
  • [10] G. Lin, Y. Tang, X. Zou, J. Xiong, Y. Fang, Color-, depth-, and shape-based 3d fruit detection, Precision Agriculture 21 (1) (2020) 1–17.
  • [11] N. Guo, B. Zhang, J. Zhou, K. Zhan, S. Lai, Pose estimation and adaptable grasp configuration with point cloud registration and geometry understanding for fruit grasp planning, Computers and Electronics in Agriculture 179 (2020) 105818.
  • [12] H. Li, Q. Zhu, M. Huang, Y. Guo, J. Qin, Pose estimation of sweet pepper through symmetry axis detection, Sensors 18 (9) (2018) 3083.
  • [13] G. Lin, Y. Tang, X. Zou, J. Xiong, J. Li, Guava detection and pose estimation using a low-cost rgb-d sensor in the field, Sensors 19 (2) (2019) 428.
  • [14] P. Eizentals, K. Oka, 3d pose estimation of green pepper fruit for automated harvesting, Computers and Electronics in Agriculture 128 (2016) 127–140.
  • [15] X. Wang, H. Zhou, H. Kang, W. Au, C. Chen, Bio-inspired soft bistable actuator with dual actuations, Smart Materials and Structures (2021).
  • [16] H. Kang, H. Zhou, X. Wang, C. Chen, Real-time fruit recognition and grasping estimation for robotic apple harvesting, Sensors 20 (19) (2020) 5670.
  • [17] H. Kang, H. Zhou, C. Chen, Visual perception and modeling for autonomous apple harvesting, IEEE Access 8 (2020) 62151–62163.
  • [18]

    C. R. Qi, W. Liu, C. Wu, H. Su, L. J. Guibas, Frustum pointnets for 3d object detection from rgb-d data, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 918–927.

  • [19] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, W. Burgard, Octomap: An efficient probabilistic 3d mapping framework based on octrees, Autonomous robots 34 (3) (2013) 189–206.
  • [20] A. Voulodimos, N. Doulamis, A. Doulamis, E. Protopapadakis, Deep learning for computer vision: A brief review, Computational intelligence and neuroscience 2018 (2018).
  • [21] C. Blehm, S. Vishnu, A. Khattak, S. Mitra, R. W. Yee, Computer vision syndrome: a review, Survey of ophthalmology 50 (3) (2005) 253–262.
  • [22] A. Kamilaris, F. X. Prenafeta-Boldú, Deep learning in agriculture: A survey, Computers and electronics in agriculture 147 (2018) 70–90.
  • [23] F. Gao, L. Fu, X. Zhang, Y. Majeed, R. Li, M. Karkee, Q. Zhang, Multi-class fruit-on-plant detection for apple in snap system using faster r-cnn, Computers and Electronics in Agriculture 176 (2020) 105634.
  • [24] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems 28 (2015) 91–99.
  • [25] M. Simony, S. Milzy, K. Amendey, H.-M. Gross, Complex-yolo: An euler-region-proposal for real-time 3d object detection on point clouds, in: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0.
  • [26] Y. Yu, K. Zhang, L. Yang, D. Zhang, Fruit detection for strawberry harvesting robot in non-structural environment based on mask-rcnn, Computers and Electronics in Agriculture 163 (2019) 104846.
  • [27] Y. Tian, G. Yang, Z. Wang, H. Wang, E. Li, Z. Liang, Apple detection during different growth stages in orchards using the improved yolo-v3 model, Computers and electronics in agriculture 157 (2019) 417–426.
  • [28] H. Kang, C. Chen, Fruit detection, segmentation and 3d visualisation of environments in apple orchards, Computers and Electronics in Agriculture 171 (2020) 105302.
  • [29] C. Lehnert, I. Sa, C. McCool, B. Upcroft, T. Perez, Sweet pepper pose detection and grasping for automated crop harvesting, in: 2016 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2016, pp. 2428–2434.
  • [30] D. Droeschel, S. Behnke, 3d body pose estimation using an adaptive person model for articulated icp, in: International Conference on Intelligent Robotics and Applications, Springer, 2011, pp. 157–167.
  • [31]

    G. Du, K. Wang, S. Lian, K. Zhao, Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review, Artificial Intelligence Review 54 (3) (2021) 1677–1734.

  • [32] M. Zhu, K. G. Derpanis, Y. Yang, S. Brahmbhatt, M. Zhang, C. Phillips, M. Lecce, K. Daniilidis, Single image 3d object detection and pose estimation for grasping, in: 2014 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2014, pp. 3936–3943.
  • [33] C. R. Qi, H. Su, K. Mo, L. J. Guibas, Pointnet: Deep learning on point sets for 3d classification and segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
  • [34] H.-S. Fang, C. Wang, M. Gou, C. Lu, Graspnet-1billion: A large-scale benchmark for general object grasping, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11444–11453.
  • [35] Y. Zhou, O. Tuzel, Voxelnet: End-to-end learning for point cloud based 3d object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4490–4499.
  • [36] R. Barth, J. Hemming, E. J. Van Henten, Angle estimation between plant parts for grasp optimisation in harvest robots, Biosystems Engineering 183 (2019) 26–46.
  • [37] S. Hayashi, K. Shigematsu, S. Yamamoto, K. Kobayashi, Y. Kohno, J. Kamata, M. Kurita, Evaluation of a strawberry-harvesting robot in a field test, Biosystems engineering 105 (2) (2010) 160–171.
  • [38] L. F. Oliveira, A. P. Moreira, M. F. Silva, Advances in agriculture robotics: A state-of-the-art review and challenges ahead, Robotics 10 (2) (2021) 52.
  • [39] S. Birrell, J. Hughes, J. Y. Cai, F. Iida, A field-tested robotic harvesting system for iceberg lettuce, Journal of Field Robotics 37 (2) (2020) 225–245.
  • [40] Y. Ge, Y. Xiong, G. L. Tenorio, P. J. From, Fruit localization and environment perception for strawberry harvesting robots, IEEE Access 7 (2019) 147642–147652.
  • [41] D. SepúLveda, R. Fernández, E. Navas, M. Armada, P. González-De-Santos, Robotic aubergine harvesting using dual-arm manipulation, IEEE Access 8 (2020) 121889–121904.
  • [42] D. Bolya, C. Zhou, F. Xiao, Y. J. Lee, Yolact: Real-time instance segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9157–9166.
  • [43] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • [44] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [45] K. Wang, J. H. Liew, Y. Zou, D. Zhou, J. Feng, Panet: Few-shot image semantic segmentation with prototype alignment, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9197–9206.
  • [46]

    M. Jaderberg, K. Simonyan, A. Zisserman, et al., Spatial transformer networks, Advances in neural information processing systems 28 (2015) 2017–2025.

  • [47] L. Fu, F. Gao, J. Wu, R. Li, M. Karkee, Q. Zhang, Application of consumer rgb-d cameras for fruit detection and localization in field: A critical review, Computers and Electronics in Agriculture 177 (2020) 105687.
  • [48] C. Neupane, A. Koirala, Z. Wang, K. B. Walsh, Evaluation of depth cameras for use in fruit localization and sizing: Finding a successor to kinect v2, Agronomy 11 (9) (2021) 1780.
  • [49] S. Chitta, Moveit!: an introduction, in: Robot Operating System (ROS), Springer, 2016, pp. 3–27.
  • [50]

    P. Beeson, B. Ames, Trac-ik: An open-source library for improved solving of generic inverse kinematics, in: 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), IEEE, 2015, pp. 928–935.

  • [51] I. A. Sucan, M. Moll, L. E. Kavraki, The open motion planning library, IEEE Robotics & Automation Magazine 19 (4) (2012) 72–82.
  • [52] H.-Y. M. L. Alexey Bochkovskiy, Chien-Yao Wang, Yolov4: Yolov4: Optimal speed and accuracy of object detection, arXiv (2020).
  • [53] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.