Monocular Camera Based Fruit Counting and Mapping with Semantic Data Association

11/04/2018 ∙ by Xu Liu, et al. ∙ Arizona State University University of Pennsylvania 0

We present a cheap, lightweight, and fast fruit counting pipeline that uses a single monocular camera. Our pipeline that relies only on a monocular camera, achieves counting performance comparable to state-of-the-art fruit counting system that utilizes an expensive sensor suite including LiDAR and GPS/INS on a mango dataset. Our monocular camera pipeline begins with a fruit detection component that uses a deep neural network. It then uses semantic structure from motion (SFM) to convert these detections into fruit counts by estimating landmark locations of the fruit in 3D, and using these landmarks to identify double counting scenarios. There are many benefits of developing a low cost and lightweight fruit counting system, including applicability to agriculture in developing countries, where monetary constraints or unstructured environments necessitate cheaper hardware solutions.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Accurately estimating fruit count is important for growers to optimize yield and make decisions for harvest scheduling, labor allocation, and storage. Robotic fruit counting systems typically utilize a variety of sensors such as stereo cameras, depth sensors, LiDAR, and global positioning inertial navigation systems (GPS/INS). These systems have demonstrated great success in counting a variety of fruits including mangoes, oranges, and apples [1][2][3]. However, while the use of a variety of high-end sensors results in good counting accuracy, they come at high monetary, weight, and size costs. For example, a sensor suite equipped with cameras, LiDAR, and a computer can add up to about , and weigh upwards of a few kilograms [1].

These high monetary, weight, and size costs directly limit the applicability of these systems. Calibration of sensors poses additional challenges, when multiple sensing modalities such as cameras and LiDAR are used. A key motivation of this work is to develop a fruit counting system for cashew growers in Mozambique. The lack of infrastructure and technical knowledge, and tight cost constraints make it infeasible to use a complex sensor suite in these agriculture environments. The growth of smartphone technology have made high-quality monocular cameras readily available and accessible. These factors motivate the development of a high-performance fruit counting pipeline that uses only a monocular camera, and can potentially run on smartphones. By doing so, we would like to shift the burden of performance from sophisticated hardware, to sophisticated algorithms on cheap and ubiquitous commodity hardware.

The main contributions of our work are: (1) a monocular camera fruit counting system that uses a deep neural network to detect fruits and semantic SFM to convert these detections into fruit counts; and (2) a thorough comparison on a mango dataset, with a fruit counting system that uses LiDAR and GPS/INS data, demonstrating that our monocular counting system achieves comparable performance with much cheaper lightweight sensors. Fig. (1) depicts the detection and tracking performance of our algorithm. A video of our algorithm can be found at:

Ii Related Work

Figure 1: Detection and tracking of mangoes across frames. A deep neural network is used to identify fruit detections. These detections are then associated with 3D landmarks in order to convert them into fruit counts.

Fruit detection, segmentation and counting in a single image has seen a revolution from hand-crafted computer vision techniques to data-driven techniques. Traditional hand-engineered techniques for this task is usually based on a combination of shape detection and color segmentation. Dorj et al. develop a watershed segmentation based method to detect citrus in HSV space 

[4]. Ramos et al. use contour analysis on superpixel over-segmentation result to fit ellipses for counting coffee fruits on branches [5]. Roy et al. develop a two-step apple counting method which first uses RGB-based oversegmentation for fruit area proposal, then estimates fruit count by fitting a clustering model with different center numbers [6]. Such hand-crafted features usually have difficulty generalizing to different datasets where illumination or occlusion level may be different.

Consequently, data-driven methods have become the state of the art, primarily as a result of advances in deep learning methods. Bargoti use Faster Region based Convolutional Neural Networks in detection of mangoes, almonds and apples, while also providing a standardized dataset for evaluation of counting algorithms 

[2, 7]. Chen et al. use the Fully Convolutional Network (FCN) to segment the image into candidate regions and CNN to count the fruit within each region [8]. Rahnemoonfar and Sheppard train an Inception style architecture to directly count the number of tomatoes in an image, demonstrating that in some scenarios these deep networks can even be trained using synthetic data [9]. Barth et al. also generate synthetic data to train deep neural networks to segment pepper images [10]. Our work differs from these previous works by expanding the counting problem from a single image to an image sequence. This extension is more challenging since it requires tracking of the fruits across image frames.

To exploit the physiological features of fruits, adding information from different modalities like depth and near-infrared (NIR) has been considered in recent studies. Choi et al. use a Hough Circle Estimation-Deep Classification pipeline for green citrus detection and show that the performance on NIR images is significantly better than on RGB or depth images [11]

. Gan et al. propose a Color-Thermal Combined Probability (CTCP) measure to refine Faster R-CNN detections with thermal image information in immature citrus detection to counteract fruit-canopy similarity 


Most approaches that track fruits across image frames use some combination of Structure from Motion (SfM), Hungarian algorithm, optical flow, and Kalman filters to provide the corresponding assignments of fruits across frames. Wang et al. use stereo cameras to count red and green apples by taking images at night in order to control the illumination to exploit specular reflection features 


. Das et al. use a Support Vector Machine (SVM) to detect fruits, and use optical flow to associate the fruits in between frames 

[1]. Halstead et al. use a 2D tracking algorithm to track and refine FR-CNN detections of sweet peppers for counting and crop quantity evaluation in an indoor environment [14]. Roy et al. develop a four-step 3D reconstruction method which first roughly aligns 3D point cloud two-side view of fruit tree row, then generates semantic representation with deep learning-based trunk segmentations and further refines two-view alignment with this data. At back-end it uses the 3-D point cloud and pre-detected fruits from [6] to give both visual count and tree height and size estimation for harvest count estimation [15].

Our objective is to estimate the count of fruits on trees in the field. We limit ourselves to using only a monocular camera, which presents additional challenges over the previous work in [3]

since depth information or pose information is not directly available from a LiDAR or GPS/INS sensors. Our previous monocular camera fruit counting approach first maintains fruit tracks in the 2D image plane across frames. Separately, to reject outlier fruits, a computationally expensive structure from motion (SfM) reconstruction is performed using relatively dense SIFT features 

[16]. However, in this approach, the counting and data association is still performed in 2D, and the 3D points are reconstruction of SIFT features, not fruits.

We draw from our two previous works by using a monocular camera to estimate the camera pose and fruit landmark locations and perform 3D to 2D data association, and 3D landmark based counting. This method can decrease computation time and increase accuracy. Our algorithm can be broken down into two components: (1) robust detection of fruits in a given image, and (2) association of each detected fruit with a 3D landmark in order to identify double counting scenarios.

Iii Fruit Detection with Deep Learning

Our fruit detection component takes in an image sequence, and outputs bounding boxes of fruit detections in each image as shown in Fig.1. The fruit detection system is identical to the one used by Stein et al. in [3], which is based on Faster R-CNN (FR-CNN) [17]

. The FR-CNN framework consists of two modules. The first module is a region proposal network which detects regions of interest. The second module is a classification module, which classifies individual regions and regresses the bounding box for every fruit simultaneously. Finally, probability thresholding is applied and non-maximum suppression is conducted to remove duplicate detections.

We follow the methodology by Bargoti et al. for obtaining ground truth annotations for network training [2]. These annotations are obtained by randomly sampling 1500 cropped images of size 500 500 from all 15,000 images of the orchard, each with original size of pixels (8.14 megapixels). For ground truth, each fruit is labeled as a rectangular bounding box, giving both size and location information. Only fruits on trees in the first row are labeled. Labels of fruits on the ground truth trees are excluded from the training set. A Python-based annotation toolbox is publicly available at [18]. We refer the reader to [2, 3] for more details.

Iv Fruit Counting with Landmark Representation

The fruit detections in each image frame are used to construct a fruit count for each tree. The challenge in this step is associating detections with each other across all the image frames in the entire dataset, or in other words, identifying double counts. These associated detections then represent a single fruit.

We consider three kinds of situations which lead to double counts. The first results from observing the same fruit across consecutive images, which we address by tracking fruits in the 2D image plane. During tracking, some of the fruit tracks may be split into two due to a missing detection or occlusion in some of the frames, which results in the second source of double count. The third source of double count is viewing the same tree from opposite directions (i.e. robot facing east and robot facing west). Taking this two-side view of a tree could maximize the number of fruit visible in image sequences, but will also lead to a same fruit to be viewed multiple times. The second and third sources of double counts are accounted for by estimating each fruit’s 3D landmark position as well as the corresponding camera pose for each image frame.

Our fruit counting pipeline thus consists of four parts. The first part performs 2D tracking on fruit centers to account for the first source of double count. The second part uses these fruit centers and their associations across frames as feature matches in a semantic SfM reconstruction to estimate 3D landmark positions as well as camera poses of each image frame. The third part projects those 3D landmarks back to the image plane of every image in the video sequence in order identify split tracks and address the second source of double count. Finally, the fourth part estimates the 3D locations of the center of tree trunks. These trunk centroids are used as depth thresholds so only fruits that are closer to the camera than the tree trunk are counted, thus accounting for the third source of double counts.

Iv-a Tracking in the Image Plane

Similar to our previous work [16], we use a combination of the Kanade-Lucas-Tomasi (KLT) algorithm, Kalman filter, and Hungarian Assignment algorithm to track fruits across image frames. However, we improve upon previous work by defining a different filtering step which fuses both the FR-CNN detections and KLT estimates as measurements.

Each detection from the FR-CNN is associated with a center and bounding box. Let represent the row and column of the center of fruit in the image coordinate space of image , and let denote the area of the bounding box. We use the KLT tracker to estimate the optical flow for the fruit at in order to get the predicted location in image . After this optical flow prediction step, we denote the overlap proportion of the predicted bounding box of fruit at predicted position with the bounding box of detected fruit in as .

Our tracking task is a Multiple Hypothesis Tracking (MHT) problem [19]. Given two sets of detections in consecutive images and , we want to find the cost minimizing assignment which represents the tracks from and using the Hungarian Algorithm [20]. Each possible assignment between a detection in and in is associated with the following cost:

Once detection in has been assigned to detection in , we have two measurements for its position in image from the KLT tracker () and the FR-CNN detection (). We use a Kalman Filter [21] to fuse these two measurements to obtain the final estimates of fruits’ positions, which we denote as . Note that represents the centers of FR-CNN detected bounding boxes, while represents the filtered estimates of the position of the fruit.

For every new fruit, we initialize its own Kalman Filter upon first detection. Define the expanded state vector as:

where we now include the pixel row velocity and the pixel column velocity, both of which have the unit of (), where is the constant time interval between every two frames. Let

be the average optical flow of all fruits in . The initial value for the fruit’s position is the center of its FR-CNN detected bounding box, and the average optical flow of all fruits in the previous image :

In using the average optical flow to initialize the fruit’s velocity, we exploit the fact that the perceived movement of the fruit in 2D is due to the motion of the camera.

We use the following discrete-time time-invariant linear system model

where is our measurement vector consisting of the optical flow and FR-CNN measurements (assuming that detection has been associated with detection in the previous step), as well as an additional velocity measurement that multiples the magnitude of the previous optical flow displacement for the fruit (approximating the depth), with the normalized flow direction in the current image.

Figure 2: Map of Fruits (Dark Points) and Camera Trajectory (Red) Estimated by Semantic Structure from Motion. The blue arrow denotes the front direction of the camera. The estimated fruit map and camera trajectory are reasonable, since we expect the clusters of fruits to be centered around their corresponding trees, and the camera trajectory to be straight. There are 17 trees in this map, which can be easily observed by the fruit clusters.

is the state transition matrix, the observation matrix, the process noise, and the measurement noise. Both and

are random variables assumed to be drawn from a Gaussian zero-mean distribution. Specifically, in our implementation, those quantities are defined as follows:


is the identity matrix of size

, is covariance matrix, and is covariance matrix. We chose the relative magnitudes for these covariance matrices since the FR-CNN measurements are relatively precise while optical flow measurements are sometimes noisy.

Thus, given a state , using the process model in Eqn (IV-A), we will get an a priori estimate for image given knowledge of the process prior to step . Using the measurement , we perform the standard Kalman Filter prediction and update steps detailed in [21] to compute the a posteriori state estimate . In this way, we keep propagating the state vector and covariance matrix of every fruit until we lose track of it.

Using the above tracking process, we extract the full tracking history of every fruit to construct a set of the fruit feature matches. If a detection has been tracked from to , we add the entire sequence of tracked positions to :

By constructing this set , we account for the first kind of double counts which results from seeing the same fruit in consecutive frames.

Iv-B Semantic SfM: Estimate the Camera Poses and the Fruit Landmark Positions from Fruit Feature Matches

A normal SfM implementation associates a descriptor with each feature point, and matches them using nearest neighbors in the descriptor space. This descriptor matching process is computationally expensive [22] and as a result, we replace this step using the the frame to frame fruit feature matches output by the 2D tracking process. Bypassing the feature matching process by using our 2D tracking algorithm greatly speeds up the computation time. We use the COLMAP package [23] [24] as our SfM implementation. The outputs of this SfM step are a set of 3D landmarks corresponding to the fruits and a set of camera poses for each frame. Each landmark has an associated 3D positions .

The first step is to identify a good initial pair of images to start the SfM reconstruction process. We input our fruit correspondences as raw feature matches, and then conduct a geometric verification step [23], which uses Epipolar geometry [25] and RANSAC [26] to determine the best initial pair of images as well their inlier feature matches. These initial images are used to initialize the SfM reconstruction process using two-view geometry [23] [27] [28], in which the initial pair of camera poses are estimated and landmarks observed in the initial pair of images are initialized. The Perspective-n-Point (PnP) algorithm [26] is then employed to incrementally estimate poses corresponding to the preceding and succeeding images, while multi-view triangulation is conducted to initialize new landmarks [23].

While the above process generates initial estimates of camera poses and landmark positions, uncertainties in poses and landmark positions can increase overtime and cause the system to drift. Therefore, an optimization process is needed to correct for these errors [23]. A common approach is to minimize the reprojection error of the landmarks. The reprojection error is defined as

where is landmark ’s projection in and is the actual position of the track determined by our Kalman Filter. The projection of landmark onto image with estimated camera pose can be calculated as:


where is the camera intrinsic matrix, is the rotation matrix and is the translation vector that define the camera rotation and translation in the world frame. and can be derived from camera pose .

Bundle Adjustment (BA) [29] solves the following nonlinear optimization problem:

where is the total number of landmarks, is the total number of images, and binary valued variable denoting observability of in . The minimizing are the estimated landmark locations and camera poses from this SfM step. An example of our semantic SfM reconstruction is shown in Fig.2.

Iv-C Avoid Double Counting of Double Tracked Fruits: Re-associate 3D Landmarks with Detections

The second source of double counts results from split tracks resulting from a missed detection or the fruit being occluded in an intermediate frame. As a result, the 2D tracking and Semantic SfM step will potentially generate multiple landmarks for this fruit. A direct way to account for this problem is to compare the distance between landmarks and reject those that coincide. Unfortunately, this approach does not work well, as the SfM reconstruction only provides relative scale and the fruits are clustered. Due to these issues, it is difficult to choose an absolute threshold that determines whether two landmarks coincide.

Instead, we approach this problem by re-associating every 3D landmark with a 2D FR-CNN detections, and discarding landmarks which are not associated to any detection. We sequentially project the landmarks back to every image to obtain using Eqn. (1), and match these projections with the detections using a second Hungarian assignment. This cost function of Hungarian assignment is designed to account for the age of the landmark, or the number of previous images it has been observed for. Using this cost function, for two landmarks corresponding to the same fruit, the older landmark will have lower cost and be matched with the fruit detection, and the newer landmark will be discarded.

In addition to the 3D position , we associate four additional attributes to landmark . The first attribute is a bounding box with area corresponding to the FR-CNN bounding box in the last frame where landmark was observed. The second attribute is an observability history which records the landmark ’s observability in every frame. The third attribute is a depth representing the depth of w.r.t. the camera centers corresponding to , i.e. the z-axis value of the landmark in the camera coordinate frame. The fourth attribute is the age defined by the number of images has been observed in up until the current image.

Using these attributes, we define the following Hungarian algorithm cost function for associating landmark with FR-CNN detection in image :

is an age cost defined as

is the threshold for age, which we chose to be 7. is the age cost weight, chosen as 0.5, which controls the contribution of the age cost to the total cost. Conducting this global data association is computationally cheap since our fruits are relatively sparse.

Iv-D Avoid Double Counting from Two Opposite Tree Sides: Compare the Fruit Landmark with the Tree Centroid

In order to maximize the number of visible fruit, we recorded our dataset by viewing each row from both sides of a row. If these two views of a single row are captured consecutively, the SfM reconstruction algorithm should be able to combine both views into a single point cloud by estimating the pose of the camera as it turns around and faces the other direction. However, our dataset first captures all rows facing one direction (east), and then turns around to capture those rows facing the other direction (west). As a result, the SfM reconstruction process generates two separate point clouds for each row, and we need to integrate them together in order to prevent double counting fruit that are visible from both sides.

We approach this double counting problem by using the tree trunk to separate the tree into two parts. We then only count the fruits that lie on the closer side of the tree trunk. In order to estimate the location of the trunk centroid, we track Shi-Tomasi corners on the trunks [30], which is a modified version of Harris corners [31]. We need the corner features to lie on the trunk, and we use the context extraction network, based on the fully convolutional network (FCN) structure [32] to segment trunks in every image. The context extraction network takes in an image of size

and outputs a score tensor of size

where is the number of object classes. A dense CRF [33] is added to refine the network output to force consistency in segmentation and sharpen predicted edge.

For a trunk , we first manually choose a start frame according to the segmentation network’s segmentation results. We extract corner points inside the trunk region, and use the KLT tracker to track them across multiple frames from to . For every tracked corner point, we obtain a set of point correspondences from to , and add it to the trunk feature correspondences set as:

where are points successfully tracked from frame to , denotes the position of point in , and is set to be , i.e. we track every point across frames to obtain robust tracking and triangulation performances.

Using the pose estimates from to , we conduct a multi-view triangulation for those corner points and calculate their depth w.r.t. the camera center of every frame. Considering that most false positive pixels for trunk segmentation lie on closer objects such as leaves or fruits (because they occlude the trunks), we represent the depth of the trunk at

using the third quartile of the depth of all corner points at

, i.e., , and we increase its value by to account for the diameter of the trunk and the tilting of the camera. For every fruit landmark, before counting it in , we look back 15 frames, and calculate the depth of trunk and the depth of the landmark , . We then use a voting system: for in , if add a before-centroid vote, otherwise, add an after-centroid vote. We will count this landmark only if its before-centroid votes are more than its after-centroid votes.

V Results and Analysis

Figure 3: Comparison of per-tree count error against the field count of all algorithms for 18 ground-truth trees

. The X-axis is the tree index, where trees are ranked from smallest to largest according to their ground-truth counts. The Y-axis is the per-tree fruit count error against the field count. On average every trees has 175 mangoes. Our monocular camera multi-view counting algorithm (red) and the sensor suite multi-view counting algorithm (blue) have least errors against the field count. Both algorithms are on average slightly under-counting, which is expected since we are comparing against the field count. The significant improvement can be seen from our 2D tracking outputs to 3D landmark tracking outputs, and from 3D landmark tracking outputs to the final algorithm outputs. The sensor suite dual-view algorithm undercounts since they only used two opposite images of every tree to get the count. It is notable that the algorithms show a similar trend in their error against the field count, which reflects the variance in our ground truth trees’ occlusion conditions.

Figure 4:

Comparison of linear regression models for monocular camera approach (left) and sensor suite approach (right)

. The X-axis is the field count and the Y-axis is the estimated count. Every dot represents a tree. The slope of both models is slightly less than 1, which implies that on average the estimated count is smaller than the ground truth count. This is expected since some fruits are fully occluded. The fitted lines well represent the relationship between the estimated count and the field count of ground truth trees. The for both methods are within a relative high precision range.
Measure Monocular Multi Sensor Suite Multi
Per-tree Count Error Mean 27.8 19.8
Per-tree Count Error Std Dev 29.6 22.9
Figure 5:

Per-tree count error mean and standard deviation of our monocular camera multi-view algorithm and the sensor suite multi-view algorithm

. The two algorithms have comparable performances judging from the error mean and standard deviation. The error mean and standard deviation of both algorithms are small considering that on average every ground-truth tree has 175 mangoes, and that occlusion condition varies across different ground-truth trees.

In this section, we compare the estimated count output from our monocular camera system and the count output of two algorithms that use the LiDAR and GPS/INS system from [3] against the ground truth field count. These benchmark algorithm uses the GPS/INS/RTK system to estimate the camera poses, and the 3D LiDAR point cloud to estimate the tree centroid and tree masks to associate fruit to the correct individual trees. The first algorithm uses multiple views of the tree and uses multi-view geometry, while the second algorithm uses images from only 2 opposing views (one facing east and one facing west). All three of the algorithms use the same FR-CNN to detect fruits in the 2D images. Although the monocular camera system only utilizes the 2D images, the results show that our system has comparable performance with the sensor suite multi-view algorithm.

Our data set was collected using an unmanned ground vehicle (UGV) built by the Australian Centre for Field Robotics (ACFR) at The University of Sydney. It has a 3D LiDAR and GPS/INS which is capable of real-time-kinematic (RTK) correction. In addition, it has a Prosilica GT3300C camera with a Kowa LM8CX lens, which captures the RGB images of size pixels (8.14 megapixels) at 5Hz [3].

The data set was collected on December 6, 2017, from a single mango orchard at Simpson Farms in Bundaberg, Queensland, Australia. We manually counted 18 trees in the field as ground truth. The 18 ground-truth trees were chosen from all 10 rows of trees in the orchard, to maximise variability of NDVI (and by extension yield) from multi-band satellite data. The trees vary in size and have differing occlusion conditions.

Fig. 3 and Fig. 5 shows the counting results of our monocular camera system and the sensor suite system. To measure and compare the per-tree counting performance, we manually mask the target trees in our algorithm. Most of the difference in performance between the monocular camera system and the sensor suite system comes from trees with higher counts. For trees with less than mangoes (10 trees), our monocular system on average under counts by , while the sensor suite multi over counts by . However, for trees with more than mangoes (8 trees), our system undercounts by , while the sensor suite multi system under counts by , indicating that the sensor suite multi algorithm can better handle occluded fruit.

We would expect both algorithms to handle fruit occlusions equally well since they are using the same FR-CNN detection network. However, the occluded fruit can cause a performance difference due to the third source of double counting that results from combining point clouds from the two different viewpoints. Due to higher occlusion, from a given side of the tree, it is more likely that a fruit may lie further than the trunk centroid, but is not visible from the other side of the tree. Throwing away this landmark just because it is on the wrong side of the tree would be too aggressive of a strategy, and may be the cause of the larger amount of undercounting.

One solution to this problem is a more sophisticated algorithm to integrate the two point clouds. As previously mentioned though, if the data was collected so that the two views of the same row were consecutive, our monocular algorithm would not need the trunk centroid rejection step since the SfM reconstruction algorithm would be able to integrate both views into the same point cloud.

Fig. 3 also depicts the improvement in counting results from the three steps of our pipeline. The 2D tracking over counts most of the trees, and has a high variance in the per-tree counting error. The 3D landmark based tracking over counts to a lesser degree, however, it is still not robust and has a high standard deviation in per-tree counting error. After using the centroid to get rid of two-side double counts, our final algorithm is much more robust and accurate.

Fig. 4 shows the linear regression line fitted for the monocular camera multi-view algorithm and sensor suite multi-view algorithm. A slope of 1 indicates that the estimated counts is proportional to the field counts, and a high value indicates that the linear model on the field counts is a good fit for estimated counts. For the monocular camera system, the slope is compared to for the sensor suite system, and the value is compared to . The linear regressions show that most of the data points corresponding to high field counts lie below the unit diagonal line, which corresponds to our previous observation that undercounting occurs due to the highly occluded fruits. The metrics indicate that both systems are performing well.

One strength of our algorithm is that by using semantic SfM on fruit locations rather than SIFT features, we achieve a much faster algorithm compared to traditional SfM based on geometric features. On a 4 core i7 CPU for a 1000 frame video, our SfM reconstruction (not including fruit tracking in 2D) takes about 5 minutes compared to 10 hours for traditional SIFT based SfM. This dramatic speed increase is the result from the fact that in every image there are much fewer fruits (50 - 100 fruit features compared to 10,000 SIFT features). In addition, by first performing the 2D tracking step to estimate the associations across frames, we bypass the computationally expensive feature matching process based on nearest neighbors on the SIFT descriptors. For the fruit counting and mapping task, since we need to track fruits regardless of whether using tracking results for SfM or not, we do not consider this phase’s computation to be introduced by the SfM step.

Vi Conclusion and Future Work

We presented a monocular fruit counting pipeline that paves the way for yield estimation using commodity smartphone technology. Such a fruit counting system has applications in a wider variety of farm environments where cost and environment constraints prevent the usage of high cost and larger sensors. Our pipeline begins with a fruit detection step using the FR-CNN deep neural network. It then estimates the 3D landmark positions of the fruits using 2D tracking, SfM reconstruction, identification of split tracks, and trunk centroid estimation in order to identify various sources of double counting. We evaluated the monocular system on a mango dataset against two sensor suite algorithms that use a 3D LiDAR and GPS/INS system.

When designing a fruit counting system, there is a tradeoff between hardware complexity and software complexity. We have identified modes (trees with high fruit counts) where the monocular only system underperforms the sensor suite algorithm. Further experimentation is required to understand this tradeoff curve, and identify more possible failure cases to be improved upon. Despite the restriction to low cost sensors, we have demonstrated that our monocular system has comparable performance to systems based on expensive sensor suite, and it is a step towards the ultimate goal of a low cost, robust, lightweight fruit counting system.

Vii Acknowledgements

This work was supported by USDA NIFA grant 2015-67021-23857 under the National Robotics Initiative, and the Australian Centre for Field Robotics (ACFR) at The University of Sydney, with thanks to Simpsons Farms.


  • [1] J. Das, G. Cross, C. Qu, A. Makineni, P. Tokekar, Y. Mulgaonkar, and V. Kumar, “Devices, systems, and methods for automated monitoring enabling precision agriculture,” in Automation Science and Engineering (CASE), 2015 IEEE International Conference on.   IEEE, 2015, pp. 462–469.
  • [2] S. Bargoti and J. Underwood, “Deep fruit detection in orchards,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on.   IEEE, 2017, pp. 3626–3633.
  • [3] M. Stein, S. Bargoti, and J. Underwood, “Image based mango fruit detection, localisation and yield estimation using multiple view geometry,” Sensors, vol. 16, no. 11, p. 1915, 2016.
  • [4] U.-O. Dorj, M. Lee, and S.-s. Yun, “An yield estimation in citrus orchards via fruit detection and counting using image processing,” Computers and Electronics in Agriculture, vol. 140, pp. 103–112, 2017.
  • [5] P. Ramos, F. A. Prieto, E. Montoya, and C. E. Oliveros, “Automatic fruit count on coffee branches using computer vision,” Computers and Electronics in Agriculture, vol. 137, pp. 9–22, 2017.
  • [6] P. Roy, A. Kislay, P. A. Plonski, J. Luby, and V. Isler, “Vision-based preharvest yield mapping for apple orchards,” arXiv preprint arXiv:1808.04336, 2018.
  • [7] S. Bargoti and J. P. Underwood, “Image segmentation for fruit detection and yield estimation in apple orchards,” Journal of Field Robotics, vol. 34, no. 6, pp. 1039–1060, 2017.
  • [8] S. W. Chen, S. S. Shivakumar, S. Dcunha, J. Das, E. Okon, C. Qu, C. J. Taylor, and V. Kumar, “Counting apples and oranges with deep learning: A data-driven approach,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 781–788, April 2017.
  • [9] M. Rahnemoonfar and C. Sheppard, “Deep count: fruit counting based on deep simulated learning,” Sensors, vol. 17, no. 4, p. 905, 2017.
  • [10] R. Barth, J. IJsselmuiden, J. Hemming, and E. Van Henten, “Data synthesis methods for semantic segmentation in agriculture: A capsicum annuum dataset,” Computers and Electronics in Agriculture, vol. 144, pp. 284–296, 2018.
  • [11] D. Choi, W. S. Lee, J. K. Schueller, R. Ehsani, F. Roka, and J. Diamond, “A performance comparison of rgb, nir, and depth images in immature citrus detection using deep learning algorithms for yield prediction,” in 2017 ASABE Annual International Meeting.   American Society of Agricultural and Biological Engineers, 2017, p. 1.
  • [12] H. Gan, W. Lee, V. Alchanatis, R. Ehsani, and J. Schueller, “Immature green citrus fruit detection using color and thermal images,” Computers and Electronics in Agriculture, vol. 152, pp. 117–125, 2018.
  • [13] Q. Wang, S. Nuske, M. Bergerman, and S. Singh, “Automated crop yield estimation for apple orchards,” in Experimental robotics.   Springer, 2013, pp. 745–758.
  • [14] M. Halstead, C. McCool, S. Denman, T. Perez, and C. Fookes, “Fruit quantity and quality estimation using a robotic vision system,” arXiv preprint arXiv:1801.05560, 2018.
  • [15] W. Dong, P. Roy, and V. Isler, “Semantic mapping for orchard environments by merging two-sides reconstructions of tree rows,” arXiv preprint arXiv:1809.00075, 2018.
  • [16] X. Liu, S. W. Chen, S. Aditya, N. Sivakumar, S. Dcunha, C. Qu, C. J. Taylor, J. Das, and V. Kumar, “Robust fruit counting: Combining deep learning, tracking, and structure from motion,” in Intelligent Robots and Systems (IROS), 2018 IEEE/RSJ International Conference on.   IEEE, 2018.
  • [17] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [18] Clapper, B.M. Munkres 1.0.8. Available online:
  • [19] D. Reid, “An algorithm for tracking multiple targets,” IEEE transactions on Automatic Control, vol. 24, no. 6, pp. 843–854, 1979.
  • [20] J. Munkres, “Algorithms for the assignment and transportation problems,” Journal of the society for industrial and applied mathematics, vol. 5, no. 1, pp. 32–38, 1957.
  • [21] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of basic Engineering, vol. 82, no. 1, pp. 35–45, 1960.
  • [22] E. Karami, S. Prasad, and M. Shehata, “Image matching using sift, surf, brief and orb: performance comparison for distorted images,” arXiv preprint arXiv:1710.02726, 2017.
  • [23] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in

    Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016.
  • [24] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixelwise view selection for unstructured multi-view stereo,” in European Conference on Computer Vision (ECCV), 2016.
  • [25] R. Hartley and A. Zisserman, “Multiple view geometry in computer vision,” Robotica, vol. 23, no. 2, pp. 271–271, 2005.
  • [26] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  • [27] C. Beder and R. Steffen, “Determining an initial image pair for fixing the scale of a 3d reconstruction from an image sequence,” in Joint Pattern Recognition Symposium.   Springer, 2006, pp. 657–666.
  • [28] R. Hartley and A. Zisserman, Multiple view geometry in computer vision.   Cambridge university press, 2003.
  • [29] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment—a modern synthesis,” in International workshop on vision algorithms.   Springer, 1999, pp. 298–372.
  • [30] J. Shi and C. Tomasi, “Good features to track,” Cornell University, Tech. Rep., 1993.
  • [31] C. Harris and M. Stephens, “A combined corner and edge detector.” in Alvey vision conference, vol. 15, no. 50.   Citeseer, 1988, pp. 10–5244.
  • [32] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, Jun 2015, pp. 3431–3440.
  • [33] P. Krähenbühl and V. Koltun, “Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials,” 2012. [Online]. Available: