Plant phenotyping focuses on measuring structural and chemical traits such as height, shape, weight, and other properties [walter_2015]. The stand count in a field is an important phenotypic trait related to emergence of plants/crops compared to the number of seeds that were planted, while location provides information on the associated variability of emergence within a plot or geographic area of a field. Plant location is also important for evaluating more complex characteristics of individual plants using other precisely co-registered data sets. Traditional phenotyping is costly, labor-intensive, and is primarily destructive [addie_paper]. Traditional plant counting involves manual counting, conducted by personnel walking through the field, which is not viable for large areas, and typically does not provide locations at the plant level. Modern high-throughput phenotyping [furbank_2011, chapman_2014, makanza_2018, singh2016machine] addresses the problems of traditional phenotyping by using remotely sensed data to measure plant properties with robotic platforms.
Unmanned Aerial Vehicles (UAVs) with sensors such as RGB and multi/hyperspectral imaging, as well as LiDAR, have demonstrated capability to reduce and, in some cases, eliminate field based phenotyping [chapman_2014]. UAVs are suitable for high-throughput phenotyping because of their ability to non-invasively collect data from a field in a short time. Compared to traditional phenotyping, using UAVs to collect data has lower cost and can cover more area in the same period of time. As shown in Figure 1, the aerial images acquired with UAVs need to be geometrically rectified and mosaiced [habib] with accurate location properties, which is critical for developing reliable methods for plant location at the field scale.
Traditional methods for image-based plant localization are often related to modeling the plants before detection [chen_2017_iccv]. The widely varying plant features such as plant shapes and plant overlap impact the capability of modeling and detecting using traditional methods. Using deep learning approaches, the system learns the features during training instead of modeling features before training to avoid the problems associated with traditional methods. In recent years, deep learning has been successfully used for the detection of objects from UAV images. For detecting objects in UAV images [zeggada_2017], Zeggada et alammour_2017], Ammour et al. use a VGG-16 [vgg]
network combined with a linear support vector machine[Cortes1995] to identify cars from UAV imagery. For UAV-based high-throughput phenotyping [ampatzidis_2019], Ampatzidis et al. use two convolutional neural neworks (CNN) to detect and count citrus trees. In [fan_2018], Fan et al. use a CNN to detect tobacco plants in extracted UAV images. Chen et al. [chen_2018] detect plant centers from orthorectified images [habib]
using a deep binary classifier.
Locating plant centers from UAV images with deep learning is not a trivial problem. Because of the altitude of most UAV flights, field scale aerial images have spatial resolution of 1 cm per pixel or less. The problem is even more difficult when plants are in an early stage of growth and are very small. Flying at a lower altitude increases the spatial resolution, but the data sets are larger and additional flightlines are required to cover the field, even necessitating multiple flights due to limited battery time. Deep learning is highly dependent on the quality and quantity of the available training data. Large amounts of high quality ground truth data are needed to achieve good performance. Deep learning models usually perform well if training and testing data are from the same type of data (e.g. in our case the same field, same time, and with the same type of plants). If we apply the same model on different data, the results are often degraded. For example, the color of soil and the plant size can vary across different types of fields and plants. These variables can cause a plant location trained network fine-tuned on a single field with a single plant type to fail when used on other types of plants. In this case, training a new network to achieve high performance requires acquisition of ground truth data on a different field with the associated large quantities of training data, creating a major bottleneck. In this paper we present a method for estimating plant centers for two row crop types and dates with limited quantity of training data using a transfer learning approach.
2 Related Work
Network-based transfer learning.
As noted previously, deep learning methods usually require significantly more training data than traditional machine learning[deep_transferlearning] due to the increased number of parameters. The number of parameters of a 16-layer CNN, for example, can easily exceed millions [vgg]. Training with insufficient data often results in poor performance. A few thousand images are inadequate to properly train most deep neural networks from scratch. The results reflect the inability of the model to converge with limited data. Collecting more training data (ground truth) is labor-intensive and costly. As shown Figure 2
, network-based transfer learning addresses the problem of insufficient data by transferring a model pretrained on larger, more general datasets such as ImageNet[imagenet] to the target task [deep_transferlearning]. During the transfer learning process, the weight of the pretrained network is copied to the new network for the target task. In deep neural networks the first few layers can be considered as a general feature extractor for the input image [yosinski_2014]. For example, in [oquab_2014], Oquab et al. transfer the weight of a pretrained CNN to improve the performance of the network with a small amount of training data. In [ng_2015], Ng et al. fine-tune a pretrained CNN for emotion recognition on small datasets. Tapas et al. [tapas2016] retrain a pretrained GoogLeNet[googlenet] to classify Arabidopsis and Tobacco plants images. In [ghazi_2017], Ghazi et al. show retraining pretrained networks on plant images can improve the performance compared to training from scratch.
Object Detection. Faster R-CNN [fasterrcnn] and Mask R-CNN [maskrcnn] are object detectors commonly used for general object detection. In Faster R-CNN [fasterrcnn], Ren et al. use a region proposal network to search for regions of interest in a feature map. The output of the regional proposal network is connected to convolutional layers for object detection and bounding box regression. Based on Faster R-CNN [fasterrcnn], in Mask R-CNN [maskrcnn], He et al. add additional layers to generate segmentation masks for objects in the image. The ground truth of these networks is based on bounding boxes or masks. Bounding box-type ground truth often results in inaccurate location estimation when the object is very small. Using bounding boxes to define ground truth is also tedious and time-consuming. Plant centers are small objects, so detecting their location precisely is an important objective for the network. Recent work shows locating and counting object can be achieved without bounding boxes [javi-2019]. In [aich2018], Aich et al. use the segmentation map generated from a CNN to count the number of wheat plants. Wu et al. [wu_2019] estimate the number of rice seedlings from UAV images using an estimated map from a CNN.
3 Current Approach and Transfer Learning
Our task can be defined as locating plant centers in orthorectified images [habib] with different types of crops, fields, and image acquisition dates. We represent plant centers as points in our ground truth because they are more accurate than bounding boxes in terms of localization, and are relatively easier to use for labeling. Since our task is localization, our ground truth masks are very sparse. We cannot use pixelwise losses as they do not represent the distance between the prediction and the ground truth, unless they perfectly overlap. This is especially true for the task of point localization. Due to this, our approach is based on locating objects without bounding boxes [javi-2019], which is used for plant localization, eye pupil identification, and people counting. The major contribution of Ribera et al. [javi-2019]
is their proposed loss function: the weighted Hausdorff distance (WHD),
is the generalized mean, is the output at pixel and the function is the Euclidean distance. The in the denominator of the first term is a small positive number that provides stability if the network detects no objects. Multiplying by in the first term ensures that high activations at locations with no ground truth are penalized. The second term has two parts. The expression is used to enforce the constraints and . Now, note that corresponds to the minimum function when . So, ideally, if , the minimum of the function is obtained, meaning the second constraint will penalize low activations around ground truth points. However, the minimum function makes training difficult as it is not a smooth function w.r.t. its inputs, so Ribera et al. [javi-2019] approximate it with . They empirically found the best values to be and [javi-2019].
One of the strengths of the WHD and the approach of object localization as minimizing distance between points is that it is independent of the CNN architecture used.
We use the modified U-Net architecture from [javi-2019], shown in Figure 3. The left block represents the downsampling (encoder) and the right block shows the upsampling (decoder). During the transfer learning process, only the weights of the encoder are copied to the target network for fine-tuning. The input image is size and the encoder has 8 downsampling blocks. Each downsampling block consists of twomax pooling layer with stride . The number of channels doubles in the first five blocks, going from to , while the last three are kept at while still being downsampled. Compared to the original U-Net [unet] architecture, this network has 4 more downsampling blocks. It also removes the convolutional bridge structure after the last downsampling block in the original U-Net [unet]. The upsampling block is similar to the one in the original U-Net [unet] architecture. It concatenates two inputs, one from previous upsampling block output, and the other from the downsampling block with the same shape as the previous upsampling block output. The number of channels doubles during concatenation but eventually returns to the original number of channels when sent to the last convolutional layer of each upsampling block.
The network decoder output is a saliency map, shown in Figure 4 as the “Estimated Map”. A pixel on the saliency map has a range to indicate the object existence in the image. Otsu thresholding [otsu_1979]
is used on the saliency map to generate the threshold image. Additionally, the network has fully connected layers that concatenate the input of the last layer of the encoder and the last layer of the decoder. The output of these fully connected layers is the estimated number of plant centers. The plant centers are estimated with a Gaussian mixture model using the expectation maximization (EM)[em]. In a Gaussian mixture model, each plant segmentation is considered a cluster, and the number of plant centers is the number of clusters. The cluster centers are the estimated plant centers.
4 Experimental Results
Our datasets are extracted from an orthomosaic image [habib] of a maize field captured using a UAV on May 22, 2018. The UAV was flying at an altitude of 50m. The orthomosaic image [habib] has the spatial resolution of 1cm/pixel. The ground truth region where manual plant center labeling was performed as shown in the blue, green, and red boxes in Figure 5 (b), consisting of 5,500 individual plants and their labeled centers. The ground truth region was split into 80% for training (blue box in Figure 5 (b)), 10% for validation (green box in Figure 5 (b)), and 10% for testing (red box in Figure 5 (b)). We randomly extract 2,000 images from the training region as the training dataset and 200 images from the validation region as the validation dataset. The testing dataset also consists of 200 randomly extracted images from the region captured in the red box in Figure 5
(b). Because of this random extraction, all three datasets consist of images that can have high overlap. Since the ground truth region was first split into separate regions before the extraction, the datasets have no common images, which prevents testing on training data. The width and height of the randomly extracted images are uniformly distributed between 100 pixels and 500 pixels.
We use Precision [powers_2011], Recall [powers_2011], F1 Score [powers_2011], Mean Average Hausdorff Distance (MAHD), Mean Absolute Percent Error (MAPE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) related to plant location as our testing metrics. These are defined as:
True positive (TP) is the number of detected plant located in the range of pixels of the plant center ground reference. False positive (FP) is the number of detected plant located outside the range of pixels of the plant center ground reference. False negative (FN) is the number of failed detected plant located in the range of pixels of the plant center ground reference. We find setting is reasonable for the plant center detection application because is about 5cm, which is within the RMSE of the geometric targets. In Equation 7, and are the sets of ground truth plant centers and predicted plant centers, respectively. Consequently, and represent the number of plant centers in the corresponding set. We use Euclidean distance for the function . For MAPE, MAE, and RMSE, is the ground reference of total number of plants in the -th extracted image. is the estimated number of plants. . is the number of plant images. Precision, Recall and F1 Score can indicate how close the estimated points are to the ground reference points. Multiple plant center detections on a single plant is possible even with a high F1 score. We add MAPE, MAE, and RMSE to account for multiple detections.
We compared the performance of the model between transfer learning and training from scratch. Both networks use the modified U-Net [javi-2019] depicted in Figure 3. As noted previously, the pretrained network used in transfer learning is trained on 50,000 randomly cropped images with 15,208 distinct plant centers obtained from an orthomosaic [habib] image of a sorghum field acquired on June 13, 2016. The learning rate is set to for the transfer learning model and for training from scratch. All training uses Adam [kingma_2014]
optimization with a batch size of 16. We evaluate the network performance based on the validation dataset for each epoch. The model with the lowest average Hausdorff distance on the validation dataset is saved as the best model. The results are shown in Table1. We directly apply the pretrained network on the maize dataset to evaluate the base performance without any fine-tuning. Note that the sorghum dataset has a dark soil background, while the maize dataset has a light soil background due to drier conditions with plants at a much earlier growth stage. The pretrained network only has a 0.94% F1 Score. After training (fine-tuning) on 2,000 maize images, the pretrained network outperforms the non-pretrained network with a 90% F1 Score and less multiple detections.
We also evaluated the effectiveness of different pretrained networks in transfer learning. We compared the performance of a model pretrained on ImageNet [imagenet] with that of a model pretrained on plant images. The modified U-Net [javi-2019] structure does not have a readily-available encoder pretrained on ImageNet [imagenet]. While we could train an encoder ourselves, training the model on ImageNet [imagenet] with over 1 million images would consume significant resources. There is no guarantee that the resulting network would perform on par with publicly available pretrained networks, despite the resources invested. Thus, we decided to use a ResNet-50 [resnet] as the encoder for the modified U-Net [javi-2019] in this comparison experiment since ResNet-50 [resnet] has a publicly available model pretrained on ImageNet [imagenet]. The learning rate is set to for both networks. We use Adam [kingma_2014] optimization with a batch size of 16. The results are shown in Table 2. The ImageNet [imagenet] pretrained network performs better than the non-pretrained network. The ImageNet [imagenet] pretrained network did worse than plant image pretrained network because the source domain is too different from the target domain (the more general ImageNet [imagenet] vs. UAV plant images).
We also investigate the effect of the size of training dataset on the transfer learning result. In addition to the maize images training dataset, we randomly cropped , , , and images from the ground reference region. We use 2 NVIDIA GeForce 1080 Ti GPUs for training. Training with 500 images has the least training time, around 4 hours. Training with 5,000 images has the most training time of 12 hours, as the training time linearly increases with the number of training images. The results are shown in Figure 6. The dataset with 2,000 images results in a model that balances performance and training time.
In this paper we present a method to locate plant centers from UAV images with limited ground truth data by using network-based transfer learning. We show that with proper pretrained networks, transfer learning can improve the overall performance of the network with scarce training data. We also demonstrate that performing transfer learning with a pre-trained network is not effective if the distribution of the source domain is significantly different from the target domain. Future work will include evaluating more network structures, as well as testing with more dates, fields, and plant types.
We thank Professor Ayman Habib and the Digital Photogrammetry Research Group (DPRG) from the School of Civil Engineering at Purdue University for providing the images used in this paper.
The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0001135. The views and opinions of the authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. Address all correspondence to Edward J. Delp, firstname.lastname@example.org