Agricultural robotics is a rapidly moving research field due to advances in computer vision and machine learning, and increased agricultural demand. Research into horticultural automation has been exploring the potential to automate the monitoring of fruit[19, 7, 18] to estimate phenotypic parameters such as yield and ripeness. Harvesting in ad-hoc cropping systems , and commercial glasshouses  has also been investigated, these techniques generally use mobile platforms and robot arms capable of locally sensing, mapping, and cutting fruits. However, there is still a considerable gap between farming requirements and available technology, specifically in terms of sensing capabilities to tackle these challenging and dynamic environments.
This paper introduces PATHoBot (Phenotyping Autonomous Trolley for Horticulture), an automated robotic platform capable of operating in protected cropping environments. The system captures color (RGB), depth (D), and near-infrared (NIR) data of the cropping system using a custom camera array. Platform motion information is also obtained using encoders and a tracking camera. This information is suitable to perform in-situ surveillance and to estimate phenotypic indices. The platform also carries a robot arm intended for close proximity surveying and interaction. To evaluate the viability of PATHoBot we capture two species of fruit in a commercial glasshouse setting: sweet pepper and tomato. We demonstrate the utility of PATHoBot by presenting two surveillance tasks.
First, we use the platform to propose an improvement on previous work on automated fruit counting , by leveraging information made available by this robot. We employ existing models trained in a similar domain on different cultivar (and of different color) to perform “in the wild” tracking-via-segmentation to count the number of fruit seen by the platform. We investigate the use of the depth and motion information from the platform for both mask re-projection and distance based filtering. This exemplifies how incorporating spatio-temporal information from this platform can be used to enhance existing techniques.
Second, we demonstrate our system’s ability of surveying a whole crop-row on-the-fly by producing visualisations from the data gathered offline. We achieve this by concatenating point-clouds, from our RGB-D sensor array, and using odometry data to show them in a common reference frame as a single map.
This leads to the main contributions of this work being:
A ROS enabled, crop surveying platform running in a commercial glasshouse environment;
A sensing approach to scan entire crop rows on-the-fly;
A multi-modal vision-based fruit counting approach exploiting spatial-temporal information made available by this platform;
3D maps generated from the gathered datasets, enabled by our on-the-fly surveying method.
We present the related work in Sec. II, and the cropping system and our platform design consideration in Sec. III. Sec. IV presents our fruit counting and mapping experiments enabled by this platform and results are shown in Sec V.
Ii Related Work
Multiple research groups have tackled the task of automating crop production in glasshouses. Belforte et al.  built a custom 3 DOF robot arm to show the potential use of robots in glasshouse environments. The arm was used as a proof of concept for a number of tasks through predefined motions.
Polder et al.  retro-fitted a spraying platform into a phenotyping robot for commercial glasshouses by allowing it to drive on the heating pipe-rails. It carried a single multi-focus plenoptic camera to produce registered RGB-D images, successfully aiding color-based crop detection by depth filtering. However, the system was very sensitive to illumination changes. More recently, a system introduced in  consisted also of a pipe-rail trolley with a scissor-lift platform, which carried an arm to perform local crop semantic segmentation to find a harvesting path. This system operated autonomously inside crop-rows once snapped to the rails by an operator. Also employing an arm , uses a multi-camera array to obtain multiple samples of the crop from different viewpoints.
In 2017 Lehnert et al.  presented a robot capable of harvesting sweet pepper in ad-hoc cropping systems. More recently, Halstead et al.  acquired a subset of data from field grown sweet pepper, arranged to suit the robotic system. In both cases sensing capabilities were limited to performing local detection and/or 3D mapping of each crop and left the task of autonomously surveying the entire row as future work.
One of the main drawbacks of these approaches using a single image sensor mounted on an arm, is the need to scan crops multiple times from different viewpoints for complete crop coverage, due to the small sensor FOV and space limitations. We addressed this by building a sensor array whose combined FOV can cover the entire crop (vertically) to cover the entire row on a single pass.
Crop surveillance data gathered by robot platforms have also enabled modern computer vision and learning based techniques to automate phenotyping processes as outlined below.
Ii-a Fruit segmentation and vision-based phenotyping
Sa et al. showed in 
that deep neural network (DNN) approaches produced excellent results for fruit detection by using multi-modal information, namely RGB and NIR images. Halstead et al. worked upon this approach to produce quality and yield estimations of sweet pepper from RGB video sequences, however, their approach does not consider depth information. Koirala et al.  compared RCNN and YOLO DNNs for mango detection on-the-fly, from RGB image sequences.
Other researchers explored supervised learning segmentation based approaches such as for crop and weed classification. Zabawa et al.  also tackled this through the challenging task of grape segmentation by biasing a DNN architecture to segment the fruit’s edge as a separate class.
Employing depth aware sensors is beneficial not only for crop size estimation, but also for 3D mapping [13, 1] and as extra information for multi-modal segmentation [4, 20]. Detection/segmentation and 3d mapping techniques can be combined to generate semantic reconstructions as shown by McCormac et al. . These techniques surged in recent years thanks to developments based on autonomous driving  and have the potential to be transferred to semantic crop mapping applications as well as fruit and organ-wise phenotyping.
Iii Platform Design
The PATHoBot platform is designed to operate in commercial glasshouses. This allows it to use existing glasshouse infrastructure and reduce the barriers for grower adoption. To outline the decisions behind our design, we first describe the glasshouse infrastructure and then define the design of the platform.
Iii-a Cropping System Environment and Infrastructure
Our platform is focused on glasshouse crops and makes use of the University of Bonn’s commercial glasshouse located at Campus Klein Altendorf (CKA). It consists of multiple automated chambers capable of self-regulated heating, shading, ventilation and irrigation. Heating is achieved using pipe-rails mounted on the ground, which are also used to move machinery, see Fig. 1(a). Each chamber consist of 6 rows of hanging substrate trays where the crops are grown and each row is long. Here, glasshouse crops are trellised and can reach several meters in height. Two crops, tomato and sweet pepper, are routinely grown in CKA and so are the focus for this platform.
Cultivated species during this study were sweet pepper Mavera (yellow) and Allrounder (red), and Lyterno RZ F1 tomatoes. Generally, sweet pepper leaves grow larger than the fruit which can occlude them partially or completely. A further complication with sweet pepper is the juvenile fruit leaves have a similar green tone, making them difficult to distinguish. For tomatoes, as long as weather conditions are favorable, the stems will grow continuously. This is managed at CKA such that the fruit stays low to simplify cropping, even dangling below their substrates (see Fig. 1(b)). Moreover, the leaves are smaller than sweet pepper ones, making occlusions less frequent, and juvenile tomato fruits have a lighter green tone, making them easier to spot.
At CKA, motorized pipe-rail trolleys, with a manually set height, are employed for crop management tasks (e.g. pruning and harvesting). This infrastructure is also used to measure phenotypic traits using specialized sensors, these tasks are currently time consuming due to a lack of automation. One limitation of the CKA glasshouse is the available space to perform turns at the end of each row as there is less than of free space. Moreover, there are also beams, piping and wires which impose height constraints () for operating in these areas. From these environment conditions and current operations carried out in CKA we derived the following requirements for our robot:
a mobile platform capable of navigating through rows autonomously, carrying all surveying sensors and actuators;
utilise existing glasshouse infrastructure (e.g. pipe-rails);
carry an array of high quality multi-modal sensors, capable of capturing whole plants on-the-fly, and an extra navigation sensor to enable consistent data fusion; and
carry a robot arm capable of reaching and potentially manipulating all crop.
Iii-B Retrofitted commercial cropping platform
PATHoBot is built around an off-the-shelf glasshouse platform which fits within the limited space at the end of the row. It has a pneumatic actuated scissor-lift and is capable of carrying loads of up to 455 Kg and lift up to 3m, making it ideal to mount a robot arm. We added a collapsible mast upon which the camera array is mounted. It remains upright while scanning crops and is collapsed to avoid the infrastructure present between the rows (see Sec. III-C).
To enable automated operation, we installed a wheel encoder, which is fed to the platform controller. This device also receives serial commands over USB, and actuates both the drive motor and scissor-lift. Finally, we have mounted a DC-AC sinusoidal converter to power all on-board systems with the platform batteries and enable flexible expansion.
Iii-C Sensor Array Design
As shown by , employing multi-modal information improves crop detection results, in particular fusing RGB and NIR information for sweet pepper detection. Furthermore, depth or 3D information is useful for fruit size estimation and to distinguish between elements in the scene (e.g. crop and leaves).
To obtain RGB, NIR, and depth information we chose to use the Intel RealSense D435i cameras, as it uses NIR images to estimate stereo depth which can be registered to RGB images. Due to the abundance of texture in the scene the camera IR projector is not required.
From Fig. 3 we see the general structure of the camera arrays on the machine, the three upper cameras are used for sweet pepper and the three lower cameras are used for tomatoes (due to the low lying fruit). We ensure that a FOV superposition between each camera of approx. at the crop stem average depth. Horizontally we space the cameras away from the heating rails, creating greater scene coverage.
As stated in Sec. III-B we fit PATHoBot with a collapsible mast to bypass obstacles, only the top three sensors are mounted in this mast. Finally, we mount a plastic shield to the mast to prevent damaging crops, and avoid lens occlusions from leaves. We also mounted an Intel Realsense T265 tracking camera (See Fig. 1), which provides 6DoF pose estimations on-chip. This camera will be used to bootstrap consistent 3D mapping and serves as a baseline for any other SLAM algorithms that might be tested in the future.
Iii-D On-Board Robot Arm
A UR5e robot arm from Universal Robots is mounted on the scissor-lift platform. It has a reach of and can be equipped with payloads of up to 5kg. Combined with the scissor-lift maximum height of the arm is able to reach the entire crop. Currently, an Intel RealSense L515 LiDAR camera is used as sensor on the tip of the arm. The camera is used in combination with a viewpoint planning approach to enable more complete 3D models of the plants by maneuvering the sensor around in order to avoid occlusions by leaves.
To compliment the existing camera, other sensors could also be used in the future that require close proximity to parts of the plant. For example, fluorometers can be used to measure plant physiological indices such as chlorophyll, flavonols, and anthocyanins content in leaves and fruits. Additional intervention based tools could also be mounted to the arm for autonomous interaction.
The arm’s control box is also connected to the platforms wired network, and a remote control program is installed on it. Universal Robots provides an open source ROS driver which can interact to this controller via standard ROS interfaces such as MoveIt!. The driver is installed on the on-board computer of the platform, and can be launched to enable control of the arm by any node in the ROS network.
Iii-E Software Infrastructure
The platform has an on-board computer running ROS. This computer communicates with the platform’s USB controller, exposing its state and receiving commands through the ROS network. To enable remote operations, we employ an on-board Wi-Fi router to ensure reliable communication with remote stations; however, only non-critical mission data is exchanged through this link for safety reasons.
As explained in Sec. III-C only 3 of the cameras will be recording simultaneously. From each camera, we record the RGB, NIR and depth images (at , 1280x720) as well as IMU data (gyro at , and accel at ) which amounts to a total bandwidth of approx . We performed extensive testing of multiple hardware configurations (for these 3 cameras) and found that a dedicated PICe card was required to ensure no frame drops. To ensure synchronization, the 3 cameras were triggered as described in .
The on-board computer is a small form factor desktop PC equipped with a graphics processing unit (GPU). To enable running high computational load algorithms on-board, such as fruit detection, 3D mapping and navigation, the system includes an i7-10700K CPU, 32Gb RAM and an NVidia GeForce RTX 2080Ti.
Iv Platform surveillance enabled approaches
The following sections describe pehnotyping related applications enabled by PATHoBot sensing capabilities. We first exploit spatial-temporal information for counting sweet pepper in a row, and finally we perform autonomous crop 3D mapping.
We scan sweet pepper at a speed of approximately . At this speed, scanning a whole chamber takes approximately mins, making it suitable for frequent surveying. We capture a number of datasets over a three month period with different separation between captures. We select the final two captured datasets to evaluate our platform enabled surveillance technique. These datasets capture RGB, NIR, and depth images along with wheel encoder information as outlined in Section III-E.
Iv-a Fruit Counting
To improve the spatial awareness of our previous work on fruit counting  we investigate adding two components to the prior IoU tracking strategy, both of which exploit spatial-temporal information captured by this platform. First, we re-project detection tracklets for comparison with new detections, creating a more precise estimate of where the fruit should appear in the next image. This leads to higher IoU comparisons between the tracklet and associated detections. Second, we further exploit depth information to filter objects outside of a specific range. As discussed in Section III, detecting/segmenting sweet pepper in a commercial glasshouse is a challenging task due to occlusion, juvenile sweet pepper, similar appearance to leaves, and strong illumination variation. We aim to count all of the sweet pepper in a row recorded by the RGB-D camera array, using only the second camera from the top. This involves a multi-staged process including instance segmentation utilising Mask-RCNN  and tracking-via-segmentation based on .
Our Mask-RCNN network is trained on the recently released data from . While the data was captured in a similar environment (glasshouse) the training data in  contains black and red sweet pepper species, while the evaluation data here contains yellow, and red (in all its development stages, green to ripe). This variation imposes a domain shift between the datasets creating an “in the wild” situation.
The counting component of this experiment is a modified version of the tracking-via-detection approach in  to incorporate segmentation masks. This technique tracks a sweet pepper through a video sequence to count it only once by associating new detections with active tracklets. A number of hyper-parameters are employed to accurately calculate the association of detection and tracklets but also ascertain if a tracklet meets the conditions of being a legitimate fruit.
To achieve re-projection we exploit the RGB-D information captured by the platform and employ the Intel Realsense’s API to register the depth to the color and the factory calibrated intrinsic parameters for a pinhole camera model as described in . To re-project a detection mask from (image) frame to , we can compute the camera homogeneous transform by using the wheel odometry transform between them . We also account for the camera extrinsics to the encoders (based on the platform’s CAD model) as follows,
Given a binary detection mask in frame composed by pixels coordinates the mask points can be re-projected into frame ,
where is the camera projection function, is each mask coordinate’s depth and applies a homogeneous transform to a 3D point. This provides a more up-to-date spatial mask of the tracklet for comparison to new detections and also provides the potential to track fruit through small occlusions.
In the final evaluation, along with re-projection, we perform depth-based detection filtering, similar to . For the -th frame (image) we obtain detections consisting of masks . We only retain the detections if a sufficient proportion of their depth values are within the closest row, ,
where was empirically derived, is the total number of values in the mask , and is the number of depth values within the defined range .
For our evaluation we will utilise three different tracking set ups, all using the same Mask-RCNN models for mask propagation. The initial experiment uses a baseline (bl) tracker built in a similar method to . In the next two evaluations we exploit spatial-temporal information by utilising re-projection only (rp), and re-projection plus depth filtering in the final (df). To evaluate all approaches we use normalised error of total fruit count in the current row,
where is the manually annotated fruit count in each row and represents the predicted fruit count from the tracking-via-segmentation approach. We also utilise the mean error of an experiment to display the overall performance at each IoU value. Finally, we compute the cofficient of detrmination to measure the correlation between the and the prediction.
Iv-B Crop Sensing and Mapping
We also outline a method for autonomous crop mapping, enabled by PATHoBot’s on-the-fly surveying capabilities. We first synchronize all camera feeds, utilising the built in triggering hardware (Sec. III-E). Then we employ Intel Realsense’s API to generate colored 3D point-clouds from registered RGB-D images, while keeping track of the platform state and frames with a robot model (see Fig. 4(c)). Each camera’s pointcloud is referenced to that robot model, producing spatially referenced point-clouds. Furthermore, we filter points outside the surveyed crop-row by keeping those with depth . Since this simple experiment concatenates frame-wise point-clouds on a single map, we skip frames to get minimal superposition between point-clouds at the radiators depth (see Fig. 3), yielding a less dense visualization.
The following section details the results of our experiments performed on the last two datasets gathered by PATHoBot at CKA commercial glasshouse. These evaluations outline the versitility of PATHoBot for various agricultural robotic based tasks.
Sweet pepper mapping results showing a) multiple mapped fruits; b) fruit blending with leafs and holes due to depth interpolation and occlusions respectively; c) section of a sweet pepper row 3D map also showing the robot state.
V-a Fruit Counting
Based on the techniques and data outlined in Sec. IV-A we evaluate the performance of fruit tracking “in the wild”. This evaluation utilises Mask-RCNN for instance based segmentation and a version of the tracker in . In order to perform these evaluations a number of hyper-parameters are required for the tracker.
The start and stop zones for the tracker were selected in accordance with the research carried out in . As PATHoBot’s motion is faster a sweet pepper stays in the scene for only approx. frames at a depth of . Meaning, after the start and stop zones are accounted for only legitimate frames for detection exist. Based on this information we halve the number of allowed miss detection frames in  to ten. Finally, to ensure we filter detections from other rows, we select a depth filter threshold (see eq. 3) of based on visual inspection of the results.
Fig. 4 provides an overview of the performance of all the experiments. Generally, we see improved accuracy through the three systems where bl performs the worst and the filtered re-projection (df) outperforms the other two. This improved performance can be seen in Fig. 4 and we attribute this improved performance to both filtering and re-projection; only relevant fruit are detected and tracklet IoU is improved. This can be seen by investigating the performance of re-projection (rp) on its own. In the Fruit Count plot of Fig. 4 it can be seen that rp has a consistently higher estimate of the number of fruit (across all IoUs) when compared to df. We believe that this is because rp tracks sweet pepper from other rows, inflating the overall count of this system.
|Mean and Std Normalised Error|
|No Depth (bl)|
|Reprojection Only (rp)|
|Reprojection + Filtering (df)|
The limitation of the rp technique is highlighted in Fig. 6. It can be seen that multiple tracks exist in the background rows, which should not be counted. By introducing a depth-based filtering (df) we are able to remove these unwanted tracks. This increases the performance of both the mean normalized error and . Also, it is noteworthy that in this “in the wild” scenario the segmentation algorithm is able to detect yellow sweet pepper, despite no exposure to it during training.
The df technique is the best performing system. In terms of overall performance it achieves a mean normalized error of at a detection IoU of as seen in Tab. I. This is an absolute improvement of 0.18 over bl which is the next best performing technique. The df technique also achieves an value of showing a strong relationship between the fruit tracking predictions and the ground truth. Thus, in a practical application a simple linear model could be used to improve the mean normalized error of fruit count.
V-B Crop Sensing and Mapping
The mapping pipeline described in IV-B is able to yield offline 3D visualizations of each crop-row. Fig. 4(c) outlines the entire height of a single section of one row. Portions of this map highlighting the positive and negative attributes are provided in Fig. 4(a) and Fig. 4(b) respectively. In the positive example clear and dense maps of the fruit are obtained. While the negative example still displays promising results, there exists blending between the leaf stem and the peppers surface. Moreover, un-mapped patches occur due to occlusions.
Despite the simplistic nature of this mapping approach, visualizations should be dense and detailed enough to be used for phenotyping tasks. However, one of the limitations of using point-clouds from a single depth measurement is that they tend to by noisy and graphically intense. This raises the need for fusing multiple depth measurements in a single map. Not only to reduce the map noise, but also to associate similar points and produce memory efficient reconstructions. Fusing all available information sources (odometry, tracking camera pose, depth images and IMU data) in a SLAM system could produce more precise and consistent 3D maps.
Combining such maps with fruit segmentation algorithms should yield rich 3D semantic crop maps suitable for autonomous phenotyping. This information use can be enhanced through registration  techniques, accounting for crop growth and enable multi-session crop semantic mapping.
In this paper we introduced PATHoBot, an automated platform for surveying crops, capable of operating in a commercial protected cropping environment. This robot has an array of localization and multi-modal sensors which produce rich data that can be employed by advanced computer vision and machine learning techniques to estimate key phenotyping indices. An example of method enabled by PATHoBot was through its use for crop counting of sweet peppers “in the wild”. We were able to leverage existing annotated data to train models in one domain and utilise them on different cultivar (even with different fruit color). We then enhanced our tracking-via-segmentation approach, outperforming the baseline on the same data by approximately 20 points, while also achieving a considerably improved score. This improvement in fruit counting was only possibly due to the information captured by the platform. We also demonstrated PATHoBot’s on-the-fly scanning ability by producing visualisations offline from the camera array data combined with motion information.
To the best of our knowledge this is the first robot capable of surveying a whole crop in a commercial glasshouse, generating suitable data for autonomous phenotyping and crop mapping. We believe this platform helps to simplify laborious phenotyping processes, allowing for more efficient and higher quality crop production.
This work was partialy funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC 2070 – 390732324.
-  (2020) Development of a sweet pepper harvesting robot. Journal of Field Robotics. Cited by: §I, §II-A, §II.
SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9297–9307. Cited by: §II-A.
-  (2006) Robot Design and Testing for Greenhouse Applications. Biosystems Engineering 95 (3), pp. 309–321. External Links: Cited by: §II.
Multimodal deep learning for robust rgb-d object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 681–687. Cited by: §II-A.
-  (2018) Using the realsense d4xx depth sensors in multi-camera configurations. Intel Corp.. Cited by: §III-E.
-  (2020) Fruit detection in the wild: the impact of varying conditions and cultivar. In to appear in Proceedings of Digital Image Computing: Techniques and Applications (DICTA), Cited by: §II, §IV-A.
-  (2018) Fruit quantity and ripeness estimation using a robotic vision system. IEEE Robotics and Automation Letters 3 (4), pp. 2995–3002. Cited by: §I, §I, §II-A, §IV-A, §IV-A, §IV-A, §V-A, §V-A.
-  (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: §IV-A.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §IV-A.
-  (2019) Deep learning for real-time fruit detection and orchard fruit load estimation: benchmarking of ‘mangoyolo’. Precision Agriculture 20 (6), pp. 1107–1135. Cited by: §II-A.
-  (2020) Performance improvements of a sweet pepper harvesting robot in protected cropping environments. Journal of Field Robotics 37, pp. 1197–1223. Cited by: §I.
-  (2019) 3D move to see: multi-perspective visual servoing towards the next best view within unstructured and occluded environments. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 3890–3897. External Links: Cited by: §II.
-  (2017) Autonomous sweet pepper harvesting for protected cropping systems. IEEE Robotics and Automation Letters 2 (2), pp. 872–879. Cited by: §II-A, §II.
-  (2020) Segmentation-Based 4D Registration of Plants Point Clouds for Phenotyping. In IROS, External Links: Cited by: §V-B.
Semanticfusion: dense 3d semantic mapping with convolutional neural networks. In 2017 IEEE International Conference on Robotics and automation (ICRA), pp. 4628–4635. Cited by: §II-A.
-  (2018) Real-time semantic segmentation of crop and weed for precision agriculture robots leveraging background knowledge in cnns. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 2229–2235. Cited by: §II-A.
-  (2014-01) Phenotyping large tomato plants in the greenhouse usig a 3d light-field camera. American Society of Agricultural and Biological Engineers Annual International Meeting 2014, ASABE 2014 1, pp. 153–159. Cited by: §IV-A.
-  (2013) PhenoBot-a robot system for phenotyping large tomato plants in the greenhouse using a 3d light field camera. Unpublished lecture. Cited by: §I, §II.
-  (2016) Deepfruits: a fruit detection system using deep neural networks. Sensors 16 (8), pp. 1222. Cited by: §I, §II-A, §III-C.
-  (2019) Self-supervised model adaptation for multimodal semantic segmentation. International Journal of Computer Vision, pp. 1–47. Cited by: §II-A.
Detection of single grapevine berries in images using fully convolutional neural networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §II-A.