The success of deep learning in visual recognition tasks has driven advancements in multiple fields of research. Particularly, increasing attention has been drawn towards its application in agriculture. Nevertheless, while visual pattern recognition on farmlands carries enormous economic values, little progress has been made to merge computer vision and crop sciences due to the lack of suitable agricultural image datasets. Meanwhile, problems in agriculture also pose new challenges in computer vision. For example, semantic segmentation of aerial farmland images requires inference over extremely large-size images with extreme annotation sparsity. These challenges are not present in most of the common object datasets, and we show that they are more challenging than many other aerial image datasets. To encourage research in computer vision for agriculture, we present Agriculture-Vision: a large-scale aerial farmland image dataset for semantic segmentation of agricultural patterns. We collected 94,986 high-quality aerial images from 3,432 farmlands across the US, where each image consists of RGB and Near-infrared (NIR) channels with resolution as high as 10 cm per pixel. We annotate nine types of field anomaly patterns that are most important to farmers. As a pilot study of aerial agricultural semantic segmentation, we perform comprehensive experiments using popular semantic segmentation models; we also propose an effective model designed for aerial agricultural pattern recognition. Our experiments demonstrate several challenges Agriculture-Vision poses to both the computer vision and agriculture communities. Future versions of this dataset will include even more aerial images, anomaly patterns and image channels. More information at https://www.agriculture-vision.com.READ FULL TEXT VIEW PDF
The first Agriculture-Vision Challenge aims to encourage research in
We propose a new class of implicit networks, the multiscale deep equilib...
Visual pattern recognition over agricultural areas is an important
We present a novel weed segmentation and mapping framework that processe...
Learning structured models using maximum margin techniques has become an...
In the past few years, computer vision and pattern recognition systems h...
CNNs have made a tremendous impact on the field of computer vision in th...
Since the introduction of ImageNet, a large-scale image classification dataset, research in computer vision and pattern recognition using deep neural nets has seen unprecedented development [30, 22, 48, 47, 24]
. Deep neural networks based algorithms have proven to be effective across multiple domains such as medicine and astronomy[33, 1], across multiple datasets [19, 16], across different computer vision tasks [54, 27, 55, 8, 10, 9, 46, 42] and across different numerical precision and hardware architectures [53, 57]. However, progress of visual pattern recognition in agriculture, one of the fundamental aspects of the human race, has been relatively slow . This is partially due to the lack of relevant datasets that encourage the study of agricultural imagery and visual patterns, which poses many distinctive characteristics.
A major direction of visual recognition in agriculture is aerial image semantic segmentation. Solving this problem is important because it has tremendous economic potential. Specifically, efficient algorithms for detecting field conditions enable timely actions to prevent major losses or to increase potential yield throughout the growing season. However, this is much more challenging compared to typical semantic segmentation tasks on other aerial image datasets. For example, to segment weed patterns in aerial farmland images, the algorithm must be able to identify sparse weed clusters of vastly different shapes and coverages. In addition, some of these aerial images have sizes exceeding pixels, these images pose a huge problem for end-to-end segmentation in terms of computation power and memory consumption. Agricultural data are also inherently multi-modal, where information such as field temperature and near-infrared signal are essential for determining field conditions. These properties deviate from those of conventional semantic segmentation tasks, thus reducing their applicability to this area of research.
|Dataset||# Images||# Classes||# Labels||Tasks||
|Inria Aerial Image ||180||2||180||seg.||4.5B||RGB||30 cm/px|
|AID ||10,000||30||10,000||cls.||3.6B||RGB||50-800 cm/px|
|DeepGlobe Building ||24,586||2||302,701||det. / seg.||10.4B||9 bands||31-124 cm/px|
|EuroSAT ||27,000||10||27,000||cls.||1.77B||13 Bands||30 cm/px|
|SAT-4 ||500,000||4||500,000||cls.||0.39B||RGB, NIR||600 cm/px|
|SAT-6 ||405,000||6||405,000||cls.||0.32B||RGB, NIR||600 cm/px|
|Crop/Weed discrimination ||60||2||494||seg.||0.08B||RGB||N/A|
|Sensefly Crop Field ||5,260||N/A||N/A||N/A||N/A||N/A||NRG, Red edge||12.13 cm/px|
|Agriculture-Vision (ours)||94,986||9||169,086||seg.||22.6B||RGB, NIR||10/15/20 cm/px|
DeepWeeds has only weed annotations at image-level, but there are 8 sub-categories of weeds.
To encourage research on this challenging task, we present Agriculture-Vision, a large-scale, and high-quality dataset of aerial farmland images for advancing studies of agricultural semantic segmentation. We collected images throughout the growing seasons at numerous farming locations in the US, where several important field patterns were annotated by agronomy experts.
Agriculture-Vision differs significantly from other aerial image datasets in the following aspects: (1) unprecedented aerial image resolutions up to 10 cm per pixel (cm/px); (2) multiple aligned image channels beyond RGB; (3) challenging annotations of multiple agricultural anomaly patterns; (4) precise annotations from professional agronomists with a strict quality assurance process; and (5) large size and shape variations of annotations. These features make Agriculture-Vision a unique image dataset that poses new challenges for semantic segmentation in aerial agricultural images.
Our main contributions are summarized as follows:
We introduce a large-scale and high quality aerial agricultural image database for advancing research in agricultural pattern analysis and semantic segmentation.
We perform a pilot study with extensive experiments on the proposed database and provide a baseline for semantic segmentation using deep learning approaches to encourage further research.
Most segmentation datasets primarily focus on common objects or street views. For example, the Pascal VOC  and MS-COCO  segmentation datasets respectively consist of 20 and 91 daily object categories such as airplane, person, computer, etc. The Cityscapes dataset , where dense annotations of street scenes are available, opened up research directions in street-view scene parsing and encouraged more research efforts in this area.
Aerial image visual recognition has also gained increasing attention. Unlike daily scenes, aerial images are often significantly larger in image sizes. For example, the DOTA dataset  contains images with sizes up to pixels, which are significantly larger than those in common object datasets at around pixels. Yet, aerial images are often of much lower resolutions. Precisely, the CVPR DeepGlobe2018 Building Extraction Challenge  uses aerial images at a resolution of 31 cm/px or lower. As a result, finer object details such as shape and texture are lost and have to be omitted in later studies.
Table 1 summarizes the statistics of the most related datasets, including those of aerial images and agricultural images. As can be seen from the table, there has been an apparent lack of large-scale aerial agricultural image databases, which, in some sense, hinders agricultural visual recognition research from rapid growth as evidenced for common images .
Meanwhile, many agricultural studies have proposed solutions to extract meaningful information through images. These papers cover numerous subtopics, such as spectral analysis on land and crops [34, 26, 29], aerial device photogrammetry [20, 31], color indices and low-level image feature analysis [17, 50, 14], as well as integrated image processing systems [31, 32]. One popular approach in analyzing agricultural images is to use geo-color indices such as the Normalized-Difference-Vegetation-Index (NDVI) and Excess-Green-Index (ExG). These indices have high correlation with land information such as water  and plantations 
. Besides, recent papers in computer vision have been eminently motivated by deep convolutional neural networks (DCNN). DCNN is also in the spotlight in agricultural vision problems such as land cover classification  and weed detection . In a similar work , Lu et. al. collected aerial images using an EOS 5D camera at 650m and 500m above ground in Penzhou and Guanghan County, Sichuan, China. They labeled cultivated land vs. background using a three-layer CNN model. In another recent work , Rebetez et. al. utilized an experimental farmland dataset conducted by the Swiss Confederation’s Agroscope research center and proposed a DCNN-HistNN hybrid model to categorize plant species on a pixel-level. Nevertheless, since their datasets are limited in scale and their research models are outdated, both works fail to fuse state-of-the-art deep learning approaches in agricultural applications in the long run.
Agriculture-Vision aims to be a publicly available large-scale aerial agricultural image dataset that is high-resolution, multi-band, and with multiple types of patterns annotated by agronomy experts. In its current stage, we have captured 3,432 farmland images with nine types of annotations: double plant, drydown, endrow, nutrient deficiency, planter skip, storm damage, water, waterway and weed cluster. All of these patterns have substantial impacts on field conditions and the final yield. These farmland images were captured between 2017 and 2019 across multiple growing seasons in numerous farming locations in the US. The proposed Agriculture-Vision dataset contains 94,986 images sampled from these farmlands. In this section, we describe the details on how we construct the Agriculture-Vision dataset, including image acquisition, preprocessing, pattern annotation, and finally image sample generation.
Farmland images in the Agriculture-Vision dataset were captured by specialized mounted cameras on aerial vehicles flown over numerous fields in the US. These farmland images can be identified by their years of acquisition. All images in the current version of Agriculture-Vision were collected from the growing seasons between 2017 and 2019. Each field image contains four color channels: Near-infrared (NIR), Red, Green and Blue.
|2017||N, R, G, B||15cm/px||Narrow band||Canon SLR|
|2018||N, R, G||10cm/px||Narrow band||Nikon D850|
|B||20cm/px||Wide band||Nikon D800E|
|2019||N, R, G, B||10cm/px||Narrow band||WAMS|
The camera settings for capturing farmland images are shown in Table 2. Farmland images in 2017 were taken with two aligned Canon SLR cameras, where one captures RGB images and the other captures only the NIR channel. For farmland images in 2018, the NIR, Red and Green (NRG) channels were taken using two Nikon D850 cameras to enable 10 cm/px resolution. Custom filters were used to capture near-infrared instead of the blue channel. Meanwhile, the separate Blue channel images were captured using one Nikon D800E at 20 cm/px resolution, which were then scaled up to align with the corresponding NRG images. Farmland images in 2019 were captured using a proprietary Wide Area Multi-Spectral System (WAMS) commonly used for remote sensing. The WAMS captures all four channels simultaneously at 10 cm/px resolution. Note that compared to other aerial image datasets in Table 1, our dataset contains images in resolutions higher than all others.
Farmland images captured in 2017 were already stored in regular pixel values between 0 and 255, while those captured in 2018 and 2019 were initially stored in camera raw pixel format. Following the conventional method for normalizing agricultural images, for each of the four channels in one field image, we first compute the and percentile pixel values, then clip all pixel values in the image by a lower bound and an upper bound:
where , stand for lower and upper bound of pixel values respectively, and stand for the and percentile respectively.
Note that farmland images may contain invalid areas, which were initially marked with a special pixel value. Therefore, we exclude these invalid areas when computing pixel percentiles for images in 2018 and 2019.
To intuitively visualize each field image and prepare for later experiments, we separate the four channels into a regular RGB image and an additional single-channel NIR image, and store them as two JPG images.
All annotations in Agriculture-Vision were labeled by five annotators trained by expert agronomists through a commercial software. The software provides visualizations of several image channels and vegetation indices, including RGB, NIR and NDVI, where NDVI can be derived from the Red and NIR channel by:
All annotations were reviewed by agronomy experts for quality assurance, where regions that were falsely annotated were corrected.
Unprocessed farmland images have extremely large image sizes. For instance, Figure 1 shows one field image with a size of pixels. In fact, the largest field image we collected is pixels in size. This poses significant challenges to deep network training in terms of computation time and memory consumption. In addition, Figure 1 also shows the sparsity of some annotations. This means training a segmentation model on the entire image for these patterns would be very inefficient, and would very possibly yield suboptimal results.
On the other hand, unlike common objects, visual appearances of anomaly patterns in aerial farmland images are preserved under image sub-sampling methods such as flipping and cropping. This is because these patterns represent regions of the anomalies instead of individual objects. As a result, we can sample image patches from these large farmland images by cropping around annotated regions in the image. This simultaneously improves data efficiency, since the proportion of annotated pixels is increased.
Motivated by the above reasons, we construct the Agriculture-Vision dataset by cropping annotations with a window size of pixels. For field patterns smaller than the window size, we simply crop the region centered at the annotation. For field patterns larger than the window size, we employ a non-overlapping sliding window technique to cover the entirety of the annotation. Note that we discard images covered by more than 90% of annotations, such that all images retain sufficient context information.
In many cases, multiple small annotations are located near each other. Generating one image patch for every annotation would lead to severe re-sampling of those field regions, which causes biases in the dataset. To alleviate the issue, if two image patches have an Intersection-over-Union of over 30%, we discard the one with fewer pixels annotated as field patterns. When cropping large annotations using a sliding window, we also discard any image patches with only background pixels. A visualization of our sample generation method is illustrated in Figure 2, and some images in the final Agriculture-Vision dataset are shown in Figure 3.
We first randomly split the 3,432 farmland images with a 6/2/2 train/val/test ratio. We then assign each sampled image to the split of the farmland image they are cropped from. This guarantees that no cropped images from the same farmland will appear in multiple splits in the final dataset. The generated Agriculture-Vision dataset thus contains 56,944/18,334/19,708 train/val/test images. We will release all images from the train, validation and test sets in Agriculture-Vision. We will also release annotations from the train and validation sets, but we reserve annotations from the test set for official evaluation.
In this section, we present the statistics of the Agriculture-Vision dataset. We show several challenging properties of our dataset, including variations of annotation areas, amount of each annotation and their proportions.
Field patterns have different shapes and sizes. For example, weed clusters can appear in either small patches or enormous regions, while double plant usually occur in small areas on the field. At the same time, these patterns also appear at different frequencies. Therefore, patterns that are large and more common occupy significantly larger areas than patterns that are smaller and relatively rare.
Figure 6 shows the total number of pixels for each type of annotations in Agriculture-Vision. We observe significantly more drydown, nutrient deficiency and weed cluster pixels than other categories in our dataset, which indicate extreme label imbalance across categories.
The frequency at which a model observes a pattern during training determines the model’s ability to recognize the same pattern during inference. It is therefore very important to understand the sample distribution for each of these field patterns in the Agriculture-Vision dataset.
Figure 6 shows the number of images that contain each annotation category. While most annotations fall under a natural and smooth occurrence distribution, we observe a sudden drop of images containing storm damage patterns. The extreme scarcity of storm damage annotations would be problematic for model training. As a result, we ignore any storm damage annotations when performing evaluations.
As previously described, field patterns can vary dramatically in size. Correspondingly in Agriculture-Vision, each generated image sample may also contain various proportions of annotations. We show in Figure 6 that many images contain more than 50% annotated pixels, some even occupy more than 80% of the image. Training a model to segment large patterns can be difficult, since recognition of field patterns relies heavily on the contextual information of the surrounding field.
|DeepLabv3 (os=8) ||35.29||73.01||21.32||56.19||12.00||35.22||20.10||42.19||35.04||22.51|
|DeepLabv3+ (os=8) ||37.95||72.76||21.94||56.80||16.88||34.18||18.80||61.98||35.25||22.98|
|DeepLabv3 (os=16) ||41.66||74.45||25.77||57.91||19.15||39.40||24.25||72.35||36.42||25.24|
|DeepLabv3+ (os=16) ||42.27||74.32||25.62||57.96||21.65||38.42||29.22||73.19||36.92||23.16|
|DeepLabv3 (os=8) ||32.18||70.42||21.51||50.97||12.60||39.37||20.37||15.69||33.71||24.98|
|DeepLabv3+ (os=8) ||39.05||70.99||19.67||50.89||19.50||41.32||24.42||62.25||34.14||28.27|
|DeepLabv3 (os=16) ||42.22||72.73||25.15||53.62||20.99||43.95||24.57||70.42||38.63||29.91|
|DeepLabv3+ (os=16) ||42.42||72.50||25.99||53.57||24.10||44.15||24.39||70.33||37.91||28.81|
In this section, we perform comprehensive experiments in several aspects to investigate the challenges and the potential of the Agriculture-Vision dataset. We train well-known common object semantic segmentation models on Agriculture-Vision and show the difficulty of our dataset. We then propose a specialized model designed for agricultural pattern recognition that achieves improved results.
There are many popular models for semantic segmentation on common object datasets. For example, U-Net  is a light-weight model that leverages an encoder-decoder architecture for pixel-wise classification. PSPNet  uses spatial pooling at multiple resolutions to gather global information. DeepLab [3, 4, 5, 6] is a well-known series of deep learning models that use atrous convolutions for semantic segmentation. More recently, many new methods have been proposed and achieve state-of-the-art results on CityScapes benchmark. For example, SPGNet  proposes a Semantic Prediction Guidance (SPG) module which learns to re-weight the local features through the guidance from pixel-wise semantic prediction, and  proposes Criss-Cross Network (CCNet) for obtaining better contextual information in a more effective and efficient way. In our experiments, we perform comparative evaluations on the Agriculture-Vision dataset using DeepLabV3 and DeepLabV3+, which are two well-performing models across several semantic segmentation datasets. We also propose a specialized FPN-based model that outperforms these two milestones in Agriculture-Vision.
To couple with Agriculture-Vision, we make minor modifications on the existing DeepLabV3 and DeepLabV3+ architectures. Since Agriculture-Vision contains NRGB images, we duplicate the weights corresponding to the Red channel of the pretrained convolution layer. This gives a convolution layer with four input channels in the backbone.
In our FPN-based model, the encoder of the FPN is a ResNet . We retain the first three residual blocks of the ResNet, and we change the last residual block (layer4) into a dilated residual block with rate=4. The modified block shares the same structure with the Deeplab series [3, 4, 5, 6]. We implement the lateral connections in the FPN decoder using two and one convolution layers. Each of the two
convolution layer does not contain bias units. For the upsampling modules, instead of bilinear interpolation, we use a deconvolution layer with kernel size=3, stride=2 and padding=1, followed by a BN layer, leaky ReLU activation and anotherconvolution layer without bias. The output from each lateral connection and the corresponding upsampling module are added together, the output is then passed through two more convolution layers with BN and leaky ReLU. Lastly, outputs from all pyramid levels are upsampled to the highest pyramid resolution using bilinear interpolation and are then concatenated. The result is passed to a convolution layer with bias units to predict the final semantic map.
We use backbone models pretrained on ImageNet in all our experiments. We train each model for 25,000 iterations with a batch size of 40 on four RTX 2080Ti GPUs. We use SGD with a base learning rate of 0.01 and a weight decay of . Within the 25,000 iterations, we first warm-up the training for 1,000 iterations , where the learning rate linearly grows from 0 to 0.01. We then train for 7,000 iterations with a constant learning rate of 0.01. We finally decrease the learning rate back to 0 with the “poly” rule  in the remaining 17,000 iterations.
Table 3 and Table 4 show the validation and test set results of DeepLabV3 and DeepLabV3+ with different output strides and our proposed FPN-based model. Our model consistently outperforms these semantic segmentation models in Agriculture-Vision. Therefore, in the following experiments, we will use our FPN-based model for comparison studies.
One major study of our work is the effectiveness of training multi-spectral data in image recognition. Agriculture-Vision consists of NIR-Red-Green-Blue (NRGB) images, which is beyond many conventional image recognition tasks. Therefore, we investigate the differences in performance between semantic segmentation from multi-spectral images, including NRG and NRGB images, and regular RGB images.
We simultaneously investigate the impact of using models with different complexities. Specifically, we train our FPN-based model with ResNet-50 and ResNet-101 as backbone. We evaluate combinations of multi-spectral images and various backbones and report the results in Table 5.
|Backbone||Channels||Val mIoU (%)||Test mIoU (%)|
Aerial farmland images contain annotations with vastly different sizes. As a result, models trained from images at different scales can result in significantly different performances. In order to justify our choice of using windows to construct the Agriculture-Vision dataset, we additionally generate two versions of the dataset with different window sizes. The first version (Agriculture-Vision-1024) uses windows to crop annotations. The second version (Agriculture-Vision-MS) uses three window sizes: , and .
In Agriculture-Vision-MS, images are cropped with the smallest window size that completely encloses the annotation. If an annotation exceeds pixels, we again use the sliding window cropping method to generate multiple sub-samples. We use Agriculture-Vision-MS to evaluate if retaining the integrity of large annotations helps to improve performances. Note that this is different from conventional multi-scale inference used in common object image segmentation, since in Agriculture-Vision-MS the images are of different sizes.
We cross-evaluate models trained on each dataset version with all three versions. Results in Table 6 show that the model trained on the proposed Agriculture-Vision dataset with a window size is the most stable and performs the best, thus justifying our dataset with the chosen image sampling method.
We would like to highlight the use of Agriculture-Vision to tackle the following crucial tasks:
|Val mIoU (%)||Test mIoU (%)|
Agriculture images beyond RGB: Deep convolutional neural networks (DCNN) are channel-wise expandable by nature. Yet few datasets promote in-depth research on such capability. We have demonstrated that aerial agricultural semantic segmentation is more effective using NRGB images rather than just RGB images. Future versions of Agriculture-Vision will also include thermal images, soil maps and topographic maps. Therefore, further studies in multi-spectral agriculture images are within our expectation.
Our segmentation task induces an uncommon type of transfer learning, where a model pretrained on RGB images of common objects is transferred to multi-spectral agricultural images. Although the gap between the source and target domain is tremendous, our experiments show that transfer learning remains an effective way of learning to recognize field patterns. Similar types of transfer learning are not regularly seen, but they are expected to become more popularized with Agriculture-Vision. The effectiveness of fine-tuning can be further explored, such as channel expansion in convolution layers and domain adaptation from common objects to agricultural patterns.
Learning from extreme image sizes: The current version of Agriculture-Vision provides a pilot study of aerial agricultural pattern recognition with conventional image sizes. However, our multi-scale experiments show that there is still much to explore in effectively leveraging large-scale aerial images for improved performance. Using Agriculture-Vision as a starting point, we hope to initiate related research on visual recognition tasks that are generalizable to extremely large aerial farmland images. We envision future work in this direction to enable large-scale image analysis as a whole.
We introduce Agriculture-Vision, an aerial agricultural semantic segmentation dataset. We capture extremely large farmland images and provide multiple field pattern annotations. This dataset poses new challenges in agricultural semantic segmentation from aerial images. As a baseline, we provide a pilot study on Agriculture-Vision using well-known off-the-shelf semantic segmentation models and our specialized one.
In later versions, Agriculture-Vision will include more field images and patterns, as well as more image modalities, such as thermal images, soil maps and topographic maps. This would make Agriculture-Vision an even more standardized and inclusive aerial agricultural dataset. We hope this dataset will encourage more work on improving visual recognition methods for agriculture, particularly on large-scale, multi-channel aerial farmland semantic segmentation.
Bottom-up higher-resolution networks for multi-person pose estimation. arXiv preprint arXiv:1908.10357. Cited by: §1.
The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §2.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8843–8850. Cited by: §1.
European Symp. on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 515–520. Cited by: §2.
Computed tomography super-resolution using convolutional neural networks. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 3944–3948. Cited by: §1.