Enlisting 3D Crop Models and GANs for More Data Efficient and Generalizable Fruit Detection

08/30/2021
by   Zhenghao Fei, et al.
University of California-Davis
1

Training real-world neural network models to achieve high performance and generalizability typically requires a substantial amount of labeled data, spanning a broad range of variation. This data-labeling process can be both labor and cost intensive. To achieve desirable predictive performance, a trained model is typically applied into a domain where the data distribution is similar to the training dataset. However, for many agricultural machine learning problems, training datasets are collected at a specific location, during a specific period in time of the growing season. Since agricultural systems exhibit substantial variability in terms of crop type, cultivar, management, seasonal growth dynamics, lighting condition, sensor type, etc, a model trained from one dataset often does not generalize well across domains. To enable more data efficient and generalizable neural network models in agriculture, we propose a method that generates photorealistic agricultural images from a synthetic 3D crop model domain into real world crop domains. The method uses a semantically constrained GAN (generative adversarial network) to preserve the fruit position and geometry. We observe that a baseline CycleGAN method generates visually realistic target domain images but does not preserve fruit position information while our method maintains fruit positions well. Image generation results in vineyard grape day and night images show the visual outputs of our network are much better compared to a baseline network. Incremental training experiments in vineyard grape detection tasks show that the images generated from our method can significantly speed the domain adaption process, increase performance for a given number of labeled images (i.e. data efficiency), and decrease labeling requirements.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 6

page 7

07/04/2018

Transfer Learning From Synthetic To Real Images Using Variational Autoencoders For Precise Position Detection

Capturing and labeling camera images in the real world is an expensive t...
07/19/2019

Cross-Domain Car Detection Using Unsupervised Image-to-Image Translation: From Day to Night

Deep learning techniques have enabled the emergence of state-of-the-art ...
10/25/2021

Raw Bayer Pattern Image Synthesis with Conditional GAN

In this paper, we propose a method to generate Bayer pattern images by G...
04/30/2019

Cross Domain Knowledge Learning with Dual-branch Adversarial Network for Vehicle Re-identification

The widespread popularization of vehicles has facilitated all people's l...
09/07/2018

BubGAN: Bubble Generative Adversarial Networks for Synthesizing Realistic Bubbly Flow Images

Bubble segmentation and size detection algorithms have been developed in...
02/18/2018

RadialGAN: Leveraging multiple datasets to improve target-specific predictive models using Generative Adversarial Networks

Training complex machine learning models for prediction often requires a...
05/16/2019

Fonts-2-Handwriting: A Seed-Augment-Train framework for universal digit classification

In this paper, we propose a Seed-Augment-Train/Transfer (SAT) framework ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Showing promising detection results in complex environments, deep neural network-based models (primarily convolutional neural networks, or CNNs) have been widely applied in agricultural applications. Bargoti and Underwood

[2] introduced Faster R-CNN [19] for agricultural applications such as fruit detection in orchards, including mangoes, apples, and almonds. They showed that deep neural network based approaches achieved high accuracy in fruit detection. Santos [20] applied deep neural networks, YOLO (Redmon [17]) for grape detection and Mask R-CNN (He [10]) for grape instance segmentation, to recognize and track grapes in RGB imagery. As another example, Zabawa [24] proposed an encoder-decoder semantic segmentation network for accurate and efficient grape counting. Vasconez [23] did a comprehensive evaluation of different CNNs applied to fruit detection and counting. Deep neural network based approaches have not only been applied to fruit detection and counting, but they have also been widely used in field-based robotic and automation applications, such as thinning, pruning, and harvesting. For instance, Zhang [25] tested the use of CNNs in identification of tree trunks and branches for automated shake‐and‐catch apple harvesting. Majeed [15] developed and tested CNNs for vine cordon detection to provide a reference for robotic green shoot thinning.

While deep neural networks have been widely applied to various agricultural tasks, such approaches typically need a large amount of data to train high performing models. Yet, in the agricultural domain, publicly available datasets are very limited, and at the same time the data are usually very specific to the application scenario, plant variety, horticultural practice, lighting condition, seasons, and even camera type. Silwal [21]

showed that apple images captured at the same location can vary substantially given different camera systems, which affects model performance. If a model trained on data from one field does not work well when applied to data from a different field, or a model trained during one year does not work well for the next year, the resulting model will not be scalable. Thus, it is important to develop new techniques that enable successful adaptation of a deep learning model trained in one agricultural domain into a new domain (same crop but different horticultural practice, lighting condition, season, or camera type), while minimizing the amount of additional labeling required.

To enable more data efficient and generalizable neural network models in agricultural applications, we propose a method that generates photorealistic agricultural images from a synthetic 3D crop model domain into real-world crop domains. The main contributions of our work include:

1. A task-aware, semantically constrained GAN that translates images from one agricultural domain into another domain while keeping the task-related semantics (such as fruit position and size as in Figure 1).

2. A domain adaptation pipeline that improves model performance in another domain, both utilizing fine-tuning and semantically constrained GAN generated labeled images with a small number of labeled images in the target domain.

3. Utilization of a 3D crop model to generate synthetic grape images for pre-training the grape detection model, and also using these synthetic images to generate photorealistic images with the same labels. This ultimately enables generation of unlimited free “labeled” images in the target domain.

2 Related Work

GANs [9] are generative models that are widely used for generating artificial new data (image) with the same distribution as the training data. The artificial images generated using GAN are visually realistic [12]. Domain adaptation using GAN has gained a lot of attention in recent years. Zhu [26]

proposed an unpaired image-to-image translation method using cycle consistent adversarial networks (i.e., CycleGAN) to translate images from one domain to another without the need for paired image training data. Their method showed promising results in collection style transfer, object transfiguration, season transfer, and photo enhancement. However, CycleGAN does not specifically constrain the semantics of an image after translation. Consequently, the translated image often closely matches the general visual distribution of the original image, but objects within the image are often not well aligned. The idea of using GAN for domain adaptation has also been introduced into the agricultural domain. Giuffrida

[7] used adversarial unsupervised domain adaptation to reduce the domain shift between two datasets. They used an adversarial loss to match the statistics of the image features between two datasets without generating visually translated images. Moreover, the leaf counting dataset they used was derived from images taken in a controlled environment, as opposed to the field. Marino [16] applied CoGAN (Liu and Tuzel, [14]) to bridge the domain gap between potato defect classification datasets. They tested their method on artificially brightened and different colored potatoes, achieving performance improvements compared to when no domain adaptation was applied. However, the domain gap and scene complexity in their experiments were relatively limited. Bellocchio [3] combined an unsupervised domain adaptation network (i.e., CycleGAN) and a weakly-supervised fruit counting model to count fruits in four different orchards. The results show that their proposed approach is more accurate than the supervised baseline method alone, but due to the weakly-supervised fruit counting model, their method is limited to counting tasks. Gogoll [8]

designed an unsupervised semantically consistent domain transfer method for plant/weed pixel-wise classification in new field environments. They utilized the idea that the image before and after transfer should have the same labels, which was enforced in the loss function when co-training the generators and target domain fully convolutional networks (FCNs) semantic segmentation model. They achieved very promising transfer results in the plant/weed classification task, and their method does not rely on any target domain labeled data. However, their method cannot explicitly avoid the “trap” in which the task network and the generators cooperate to find a shortcut to trick the loss (e.g., the generator transfers plants to stones and the task network classifies stones as plants). Drees

[5] extends the idea of using GAN to generate temporal predictions of plant growth in which the model learns from a plant growth model and produces realistic, reliable images of future growth stages of plants. Kierdorf [13]

proposed the use of conditional GAN for estimation of grapevine berries occluded by leaves by treating the occluded and non-occluded grapevine images as two domains, based on different leaf distributions, that can be translated to each other.

3 Approach

3.1 Problem definition

For each domain adaptation task in our problem, there are two domains. The first is source domain A. A is usually a well-labeled real-world dataset or a synthetically generated dataset with ground truth labels produced via a 3D rendering engine (i.e., a 3D crop model). The second domain is called the target domain . B refers to the domain where the model is applied for prediction. There are many images in domain , but a lack of ground-truth labels. The target domain and the source domain can be different in style but should have similar contexts. Specifically in agricultural applications, two domains should contain the same crop while having a difference in crop variety, lighting condition, the camera view distance/angle, management practice, etc. Figure 2 provides an example of different domain data for grape production in which the differences between domains is apparent.

Figure 2: Images from four domains within grape vineyards. a) image collected at vineyard A on a shady day using an Intel RealSense D435i camera; b) images collected at vineyard B on a sunny day using a GoPro Hero7 Black camera; c) image collected at vineyard B during the night using an Intel RealSense D435i camera; d) image from the WGISD dataset [20].

In this study, our objective is to utilize the well-labeled source domain while using as little labeled data as possible in the target domain to train a task model that can perform well in . To achieve this, we want to learn a mapping between and given training samples where with task labels and training samples . Among training images in domain , a few of them () can be labeled. The mapping from to is called and the mapping from to is called . The generated image should have the same visual style as images in domain but also maintain the same task-related semantics as the original image . We hypothesize that fine-tuning a task model on the generated image should improve the performance of in domain .

3.2 Task

The main objective of our method is not only photorealistic image generation but also to utilize the generated images to facilitate domain adaptation. As a result, the machine learning task and corresponding task model

is very important to our method. Generally speaking, the task model should be a fully differentiable model that can provide guidance through backpropagation to the image generation network. Here in this study, we chose object detection as our task as it is one of the more popular and important tasks in agricultural machine learning applications. Specifically, the object detection model we used in this work is YOLOv3 (Redmon and Farhadi,

[18]).

3.3 Method

In this problem, we assume that we have access to labeled source images with all task labels , unlabeled training images , and labeled training images with . We want to train an accurate task model on domain using as few labeled training images as possible. The step-by-step method of doing this is shown below and an overview is shown in Figure 3.

Figure 3: Training pipeline overview. The method includes four steps: 1) Train a initial task model using synthetic labeled data. 2) Fine-tune the pre-trained task model using a few labeled target domain images to embed the domain knowledge into the task model. 3) Train a semantically constrained GAN. 4) Fine-tune the task model using GAN generated images and labels.

3.3.1 Train an initial task model

Given the labeled source images with all task labels , we can train an initial task model using the given data in a supervised way by minimizing the task loss . This model performs well in domain but it performs relatively poorly in domain , and the level of performance degeneration is related to the domain gap between A and B (e.g., a model trained in a daylight domain usually performs better in another daylight domain than in a night domain).

3.3.2 Embed the domain knowledge into a task network

One of the main ideas behind our method is to embed the domain knowledge into a task network by fine-tuning the initial task model using a few () labeled images in domain , the fine-tuned task model is named . Based on our finding, even fine-tuning with a single labeled image in domain can make perform much better than . We call this step “domain knowledge embedding”.

Figure 4: Overview of the proposed image generation network.

3.3.3 Train a semantically constrained target domain image generator

To achieve the generation of images from domain to domain while retaining the semantics (the meaning of semantic here is task-specific) we present an image generator network which is composed of five main parts: 1) Image generator from domain to B: ; 2) Image generator from domain to A: ; 3) Adversarial discriminator to distinguish between generated images and real domain images ; 4) Adversarial discriminator to distinguish between generated images and real domain images ; 5) A task network in domain , in which the inputs are generated from domain images and the target labels are corresponding ground-truth labels . Parts 1-4 are the same as CycleGAN work from (Zhu [26]), and are used to generate realistic fake images. Part 5 is the key to retaining semantic consistency between real image and generated image . The overview of the proposed method is shown in Figure 4.

In terms of training this network, there are several losses that need to be optimized; among these losses 1-3 are the same as (Zhu [26]).

1) Adversarial loss: The adversarial loss (Goodfellow [9]) is applied to both generators and their corresponding discriminators. The objective of having adversarial loss is to train the generator network to generate visually realistic images in the target domain. For the generator from domain to B, the loss is as follows

(1)
(2)

2) Cycle consistency loss: There are an infinite number of possible image mappings from one domain to the other while matching the target domain distributions. To constrain the space of this mapping function, Zhu [26] introduced cycle consistency loss which forces the image translation cycle to return the input image back to the original image and ). The cycle consistency loss is expressed as

(3)

3) Identity loss: To constrain the image generator to preserve color information between the input and output, an identity loss is applied. The identity loss was first introduced by Taigman [22], and is defined as

(4)

The intuition behind addition of the identity loss is to enforce the generator to be an identity mapping when images from the source domain are fed into the generator.

4) Task-specific semantic constraint loss: Using losses 1-3 we can train a pair of generators that generate visually realistic images in the target domains. However, aside from the existence of the cycle consistency loss, the semantics are not specifically constrained after translation. The semantics, especially the detailed spatial semantics such as the position and size of the fruit, are prone to change, which makes the generated labels unusable in terms of domain adaptation when localization is required. To overcome this limitation, we use a task-specific semantic constraint loss: Given a task model in domain , , the prediction result of generated image should be identical to ’s ground truth label . During backpropagation and parameter updates, the weights in are fixed, the gradient is passed into the generator to encourage it to generate images that can let generate more accurate predictions. We find this is a very data-efficient way to extract knowledge from to help generate semantically consistent translated images. This loss is referred as and its specific form depends on the task and task model (e.g., YOLOv3 has its specific loss). The only requirement of this loss and the task is that the task loss is differentiable.

Combining all the losses above, the full objective function is given by:

(5)

where are relative weights of the cycle consistency loss, identity loss, and task specific semantic constraint loss. When this method collapses to the original CycleGAN method.

3.3.4 Fine-tune the task network using generated images

In the last step, we get a semantically consistent image generator . Applying to all the domain data we can get the same number of generated data with labels that correspond with . Using this generated data, we can further train the task network to improve performance in domain .

3.4 Network architectures

The generator and discriminators are the same as those in Zhu [26]. The task network is YOLOv3-tiny which is a very light-weighted model proposed by Redmon [18]. The main reason for choosing YOLOv3-tiny as the task network is to decrease the training time, but other differentiable task networks could be used as well.

4 Experiments and Results

4.1 3D synthetic source domain

One special domain is the 3D synthetic domain, where the images are generated using a rendering engine instead of collected in the real world. The benefit of generating images from a rendering engine is that the ground-truth labels are known and easily extracted. The synthetic domain can be treated as a domain with “infinite” labeled images. On the other hand, no matter how realistic the synthetic images are, there is always a domain gap between the synthetic image and the real-world domain where the model needs to be applied. Moreover, the level of photorealism increases rendering time which reduces the scalability of synthetic image generation within agricultural applications. This gap lead to model performance degradation in the target domain.

In this work, we used the open source Helios 3D Plant and Environmental Biophysical Modeling Framework of Bailey

[1] to generate synthetic grape images. Using Helios, we generated 500 synthetic vineyard images that spanned a range of geometric canopy parameters, trellis types, and camera positions. Bounding box labels for grape clusters were generated using a custom Helios plugin. Importantly, Helios can be used to parametrically generate 3D geometries for a wide range of crop types which can then be used to create synthetically labeled data, including those for object detection, semantic segmentation, or instance segmentation.

4.2 Real world target domain

We have two target domains in this work, one we called day domain and the other one called night domain.

4.2.1 Day domain

The day domain data were collected in the California Central Valley using a GoPro camera in Summer 2019 during the daytime. 3065 images are in this dataset and we labeled 100 images; among them 25 images are always used for the test set to evaluate the model performance in this domain. Example images in the day domain are shown in Figure 5.

4.2.2 Night domain

The night domain data were collected at the same location as the day domain using an Intel RealSense D435i camera with a custom lighting system in Summer 2020 during nighttime. 800 images are in this dataset and we labeled 150 images; among them 24 are always used for the test set to evaluate the model performance in this domain. Example images in the night domain are shown in Figure 5.

Figure 5: Top: example day domain real images; bottom: example night domain real images.

4.3 Experimental Design

The main idea of our work is to utilize a 3D crop model and a GAN model to reduce the need for labeling in a new domain. The task we choose here is grape detection and the detection model we use is YOLOv3 with the tiny-backbone (Redmon and Farhadi [18]). We have test datasets for each target domain that do not engage in the model training but are just used for evaluating the final model performance.

To evaluate how using the combined 3D crop model and GAN approach affects data efficiency, we pre-trained an object detection model with the synthetic images (3D crop model generated) and the generated labels. We evaluate the performance of the pre-trained model , the performance of the model after further fine-tuning using labeled target domain real images, and the performance of the model after fine-tuning using labeled target domain real images and GAN generated images with source domain labels.

We choose = 2, 5, 10, 15, 20, 30, 40, 50. Among them, 80% of the labeled images are used for training () and the remaining 20% of images are used for validation () (at least 1 image in each set). The best performing model on the validation set during the training is selected. The GAN for each experiment also uses the same fine-tuned model using labeled target domain images; no additional labeled images are introduced into training the GAN. We also evaluated the performance of the model if we use only CycleGAN to generate images and fine-tune the model using these generated images. We use AP (Average Precision, see Everingham [6] for detailed definition of AP) at 0.3 and 0.5 IoU (Intersection over Union) as model performance metrics.

4.3.1 Generate images using semantically constrained GAN

To help better understand the quality of generated images using the semantically constrained GANs, a set of random results is shown in Figure 6. The source images in domain are randomly selected, each two rows are using the same source image from the first column and translate into different target domains. From the generated images, we can see that the baseline CycleGAN and the semantically constrained CycleGAN models can generate visually realistic images. However, the baseline CycleGAN has trouble in generating images with the grapes in the same location as the source synthetic images. This ”positional drift” problem is more significant in the generated night domain images than the day domain images generation. The main reason for this drift is that the CycleGAN network is not provided information to learn what a grape is, and the domain gap between the night domain to the synthetic domain is relatively large. Using the semantically constrained GAN, even when the task constrained network is trained with only 1 labeled image and validated on only 1 labeled image, the generated image can be very well semantically constrained, in terms of grape position and size. Also, a single source domain 3D rendered image can be generated for two different real-world domain images using two generators, and both the generated images show the same grape distribution as the 3D rendered image.

Figure 6: Example GAN generated images (randomly selected). The first column is randomly selected source domain images with the ground truth generated labels. The 2 – 4 columns are generated images with the projected labels in the yellow box (same as the label in the first column, just for visualization purposes). is the number of labeled target domain image for train, is the number of labeled target domain image for validation.

4.3.2 Fruit detection performance

We first trained a task network only using synthetic 3D grape model images (using 345 train and 74 validation). The results of applying this model into two target real-world domains are shown in Table 2 and the baseline methods’ results are shown in Table 1. The performance of the direct synthetic to real model transfer is shown in the first rows labeled “Synthetic pre-trained”. We also applied the CycleGAN method using the synthetic–night and synthetic–day images, generated images in target domains, and fine-tuned the pre-trained task model on these generated images (validate on 20% of the generated images). The results are shown in the second rows labeled “CycleGAN”. The columns of “Only fine-tuned” contain the results of fine-tuning the pre-trained task mode using labeled target domain train images, and selecting model based on labeled target domain valid images. The performance of the task networks refined using our Semantically Constrained GAN is shown in the “GAN refined” columns. As we can see from the results, the direct application of a model pre-trained on the 3D synthetic domain to the real domain can result in relatively poor performance since the real domains are different than the 3D synthetic domain. Especially for the night domain, the pre-trained model has almost no ability to predict grape locations. One naïve domain adaptation method is using CycleGAN to generate target domain images, assuming the labels are the same as the source images, and further training the pre-trained model on these generated images and labels. The experiment shows that this approach will not lead to performance improvement, and can even lead to a decrease in performance in the day domain. The main reason is that the generated images do not always keep the grape clusters at the same location and thus the source labels no longer valid. Another domain adaptation method is to use some labeled images in the target domain to fine tune the pre-trained network. This classical method is still very promising and leads to a significant increase in model performance even using 1 labeled training image. The performance of the model increases with increasing number of labeled target domain images involved. Our method can further improve the data efficiency upon the fine-tuning using the same labeled target domain images, especially at a very low number of labeled target domain images.

Baseline methods
Train
num a
Valid
num b
Total
num k
Synthetic to Day Domain Synthetic to Night Domain
AP@
IOU0.3
AP@
IOU0.5
AP@
IOU0.3
AP@
IOU0.5
Synthetic Pre-trained 0 0 0 27.8 13.2 0.0 0.8
Cycle GAN 0 0 0 10.3 2.1 0.1 0.0
Table 1: Grape detection results using baseline domain adaptation methods. Average precision (AP) numbers are percentage. Synthetic Pre-trained means only using synthetically generated image to train the model. Cycle GAN means using Cycle GAN generated images to fine-tune the pre-trained model. All models are evaluated on the same test dataset as Table 2.
Train
num a
Valid
num b
Total
num k
Synthetic to Day Domain Synthetic to Night Domain
fine-tuned
SemGAN +
fine-tuned
fine-tuned
SemGAN +
fine-tunned
AP@
IOU0.3
AP@
IOU0.5
AP@
IOU0.3
AP@
IOU0.5
AP@
IOU0.3
AP@
IOU0.5
AP@
IOU0.3
AP@
IOU0.5
1 1 2 37.4 16.4 51.0 23.4 32.6 8.3 38.3 12.1
4 1 5 49.7 21.8 51.8 23.6 38.2 10.8 37.8 12.2
8 1 9 39.8 16.5 55.7 26.5 35.7 9.3 38.2 13.0
12 2 14 52.1 26.6 54.7 26.6 43.0 13.9 45.1 12.8
16 3 19 51.6 23.7 57.9 30.8 43.0 13.4 46.1 17.3
24 6 30 56.3 26.5 57.6 28.5 45.2 17.5 48.2 20.2
32 8 40 57.9 31.1 59.5 36.1 46.1 17.5 51.4 20.2
40 10 50 57.4 31.7 63.9 36.4 49.7 19.3 50.5 20.6
98(all) 28(all) 126 / / / / 52.8 22.8 56.0 26.7
58(all) 15(all) 73 61.3 37.0 61.7 37.2 / / / /
Table 2: Grape detection results.Average precision (AP) numbers are percentage. ”/” means not applicable. ”SemGAN + fine-tuned” columns are the results using the semantically constrained cycle GAN generated images to fine-tune the detection network. The ”fine-tuned” results are just using the labeled train images in the target domain to fine-tune the detection network. Network are selected using corresponding labeled valid images. For each domain, all models are evaluated on the same test dataset.

5 Discussion and Future Work

To apply deep learning-based AI models in agricultural and plant environments, we need to overcome the problem of insufficient labeled data and massive variability (e.g., plant appearance, horticultural practice, seasonal differences, lighting differences). It is labor and cost-intensive to manually label images in the broad range of scenarios that can be encountered in agricultural production environments, and doing so will hinder the large scale adoption of deep learning model deployment in agricultural production. To solve these problems and make deep learning model deployment more feasible in new agricultural environments, we proposed a semantically constrained GAN. We presented a training pipeline for this network and used the generated images to improve task model performance in a new domain, i.e., fruit detection. The results in this paper showed that by using a semantically constrained GAN we can generate very realistic day and night grapevine images from 3D rendering images while retaining grape position and size. The generated images can be used to further train the task network and improve the task network performance in the target domain which can surpass the vanilla fine-tuning results, especially with a low number of labeled images.

Many interesting questions remain to be answered following this research. 1) Using this method, we successfully constrained grape position and geometry, but other parts of the images are unconstrained (e.g. foliage, trunks, sky, etc.). The reason that they are not constrained is that the task-constrained network is only designed to identify grapes. It would be interesting to see if the task-constrained network can identify and constrain multi-class objects, or even constrain the whole scene semantics by replacing the object detection task network with a semantic segmentation task network. 2) When further training using the GAN generated images, we did not include the generated images in the validation set, only the true labeled images (except when using CycleGAN, since there were no labeled images involved). However, adding GAN-generated images into the validation set to select the best model can also help to improve the overall model accuracy, especially when the labeled validation image number is low or even when no labeled validation images are included. Determining the best mixing ratio between GAN-generated images and labeled images in the validation set could further improve data efficiency. 3) The main GAN network architecture of this work is the same as the CycleGAN work except the task constrained network. While only adding the task constrained network already achieves good semantic consistency, there is some work that focuses on other ways to achieve semantic consistency such as Hoffman [11] and Chen [4]. It would be interesting to see how model performance changes when we utilize these semantic consistency methods.

6 Acknowledgement

This project was partly supported by the USDA AI Institute for Next Generation Food Systems (AIFS), USDA award number 2020-67021-32855, and by the NSF funded Center for Data Science and Artificial Intelligence award number 1934568. Brian N. Bailey was supported by USDA National Institute of Food and Agriculture Hatch project 1013396. We would also like to thank Hamid Kamangir and Kaustubh Deshpande for support on synthetic data generation.

References

  • [1] B. N. Bailey (2019) Helios: A Scalable 3D Plant and Environmental Biophysical Modeling Framework. Frontiers in Plant Science 10 (October), pp. 1–17. External Links: Document, ISSN 1664462X Cited by: §4.1.
  • [2] S. Bargoti and J. Underwood (2017) Deep fruit detection in orchards. Proceedings - IEEE International Conference on Robotics and Automation, pp. 3626–3633. External Links: Document, 1610.03677, ISBN 9781509046331, ISSN 10504729 Cited by: §1.
  • [3] E. Bellocchio, G. Costante, S. Cascianelli, M. L. Fravolini, and P. Valigi (2020) Combining Domain Adaptation and Spatial Consistency for Unseen Fruits Counting: A Quasi-Unsupervised Approach. IEEE Robotics and Automation Letters 5 (2), pp. 1079–1086. External Links: Document, ISSN 23773766 Cited by: §2.
  • [4] Y. C. Chen, Y. Y. Lin, M. H. Yang, and J. B. Huang (2019) Crdoco: Pixel-level domain transfer with cross-domain consistency.

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    2019-June, pp. 1791–1800.
    External Links: Document, 2001.03182, ISBN 9781728132938, ISSN 10636919 Cited by: §5.
  • [5] L. Drees, L. V. Junker-frohn, J. Kierdorf, and R. Roscher (2021) Temporal Prediction and Evaluation of Brassica Growth in the Field using Conditional Generative. arXiv preprint arXiv:2105.07789. External Links: arXiv:2105.07789v1 Cited by: §2.
  • [6] M. Everingham, L. Van Gool, C. K.I. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. External Links: Document, ISSN 09205691 Cited by: §4.3.
  • [7] M. V. Giuffrida, A. Dobrescu, P. Doerner, and S. A. Tsaftaris (2019) Leaf counting without annotations using adversarial unsupervised domain adaptation. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops 2019-June, pp. 2590–2599. External Links: Document, ISBN 9781728125060, ISSN 21607516 Cited by: §2.
  • [8] D. Gogoll, P. Lottes, J. Weyler, N. Petrinic, and C. Stachniss (2020) Unsupervised domain adaptation for transferring plant classification systems to new field environments, crops, and robots. IEEE International Conference on Intelligent Robots and Systems, pp. 2636–2642. External Links: Document, ISBN 9781728162126, ISSN 21530866 Cited by: §2.
  • [9] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 27. External Links: Document, ISSN 10495258 Cited by: §2, §3.3.3.
  • [10] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2020) Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2), pp. 2961–2969. External Links: Document, ISSN 19393539 Cited by: §1.
  • [11] J. Hoffman, E. Tzeng, T. Park, J. Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2018) CyCADA: Cycle-Consistent Adversarial Domain adaptation. In 35th International Conference on Machine Learning, ICML 2018, pp. 1989–1998. External Links: 1711.03213, ISBN 9781510867963 Cited by: §5.
  • [12] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of GANs for improved quality, stability, and variation. In 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, External Links: 1710.10196 Cited by: §2.
  • [13] J. Kierdorf, I. Weber, A. Kicherer, L. Zabawa, L. Drees, and R. Roscher (2021) Behind the leaves – Estimation of occluded grapevine berries with conditional generative adversarial networks. arXiv preprint arXiv:2105.10325. External Links: 2105.10325, Link Cited by: §2.
  • [14] M. Y. Liu and O. Tuzel (2016) Coupled generative adversarial networks. Advances in Neural Information Processing Systems (Nips), pp. 469–477. External Links: 1606.07536, ISSN 10495258 Cited by: §2.
  • [15] Y. Majeed, M. Karkee, Q. Zhang, L. Fu, and M. D. Whiting (2021) Development and performance evaluation of a machine vision system and an integrated prototype for automated green shoot thinning in vineyards. Journal of Field Robotics (August 2020). External Links: Document, ISSN 15564967 Cited by: §1.
  • [16] S. Marino, P. Beauseroy, and A. Smolarz (2020) Unsupervised adversarial deep domain adaptation method for potato defects classification. Computers and Electronics in Agriculture 174 (April), pp. 105501. External Links: Document, ISSN 01681699 Cited by: §2.
  • [17] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: Unified, real-time object detection. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2016-Decem, pp. 779–788. External Links: Document, 1506.02640, ISBN 9781467388504, ISSN 10636919 Cited by: §1.
  • [18] J. Redmon and A. Farhadi (2018) YOLOv3: An Incremental Improvement. arXiv preprint arXiv:1804.02767. External Links: 1804.02767, Link Cited by: §3.2, §3.4, §4.3.
  • [19] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pp. 91–99. External Links: ISSN 10495258 Cited by: §1.
  • [20] T. T. Santos, L. L. de Souza, A. A. dos Santos, and S. Avila (2020) Grape detection, segmentation, and tracking using deep neural networks and three-dimensional association. Computers and Electronics in Agriculture 170, pp. 105247. External Links: Document, 1907.11819, ISSN 01681699 Cited by: §1, Figure 2.
  • [21] A. Silwal, T. Parhar, F. Yandun, and G. Kantor (2021) A Robust Illumination-Invariant Camera System for Agricultural Applications. arXiv preprint arXiv:2101.02190. External Links: 2101.02190, Link Cited by: §1.
  • [22] Y. Taigman, A. Polyak, and L. Wolf (2016) Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200. External Links: 1611.02200 Cited by: §3.3.3.
  • [23] J. P. Vasconez, J. Delpiano, S. Vougioukas, and F. A. Cheein (2020) Comparison of convolutional neural networks in fruit detection and counting: A comprehensive evaluation. Computers and Electronics in Agriculture 173 (November 2019), pp. 105348. External Links: Document, ISSN 01681699, Link Cited by: §1.
  • [24] L. Zabawa, A. Kicherer, L. Klingbeil, R. Töpfer, H. Kuhlmann, and R. Roscher (2020) Counting of grapevine berries in images via semantic segmentation using convolutional neural networks. ISPRS Journal of Photogrammetry and Remote Sensing 164, pp. 73–83. External Links: Document, 2004.14010, ISSN 09242716 Cited by: §1.
  • [25] X. Zhang, M. Karkee, Q. Zhang, and M. D. Whiting (2021) Computer vision-based tree trunk and branch identification and shaking points detection in Dense-Foliage canopy for automated harvesting of apples. Journal of Field Robotics 38 (3), pp. 476–493. External Links: Document, ISSN 15564967 Cited by: §1.
  • [26] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. Proceedings of the IEEE International Conference on Computer Vision 2017-Octob, pp. 2242–2251. External Links: Document, 1703.10593, ISBN 9781538610329, ISSN 15505499 Cited by: §2, §3.3.3, §3.3.3, §3.3.3, §3.4.