Multispectral Vineyard Segmentation: A Deep Learning approach

by   T. Barros, et al.
University of Coimbra

Digital agriculture has evolved significantly over the last few years due to the technological developments in automation and computational intelligence applied to the agricultural sector, including vineyards which are a relevant crop in the Mediterranean region. In this paper, a study of semantic segmentation for vine detection in real-world vineyards is presented by exploring state-of-the-art deep segmentation networks and conventional unsupervised methods. Camera data was collected on vineyards using an Unmanned Aerial System (UAS) equipped with a dual imaging sensor payload, namely a high-resolution color camera and a five-band multispectral and thermal camera. Extensive experiments of the segmentation networks and unsupervised methods have been performed on multimodal datasets representing three distinct vineyards located in the central region of Portugal. The reported results indicate that the best segmentation performances are obtained with deep networks, while traditional (non-deep) approaches using the NIR band shown competitive results. The results also show that multimodality slightly improves the performance of vine segmentation but the NIR spectrum alone generally is sufficient on most of the datasets. The code and dataset are publicly available on



There are no comments yet.


page 2

page 3

page 5

page 6

page 7

page 10


TTPLA: An Aerial-Image Dataset for Detection and Segmentation of Transmission Towers and Power Lines

Accurate detection and segmentation of transmission towers (TTs) and pow...

FusionLane: Multi-Sensor Fusion for Lane Marking Semantic Segmentation Using Deep Neural Networks

It is a crucial step to achieve effective semantic segmentation of lane ...

The Marine Debris Dataset for Forward-Looking Sonar Semantic Segmentation

Accurate detection and segmentation of marine debris is important for ke...

RELLIS-3D Dataset: Data, Benchmarks and Analysis

Semantic scene understanding is crucial for robust and safe autonomous n...

ABCNet: Attentive Bilateral Contextual Network for Efficient Semantic Segmentation of Fine-Resolution Remote Sensing Images

Semantic segmentation of remotely sensed images plays a crucial role in ...

Did Evolution get it right? An evaluation of Near-Infrared imaging in semantic scene segmentation using deep learning

Animals have evolved to restrict their sensing capabilities to certain r...

Applications of Deep Learning in Fundus Images: A Review

The use of fundus images for the early screening of eye diseases is of g...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Precision agriculture has received much attention lately due to the recent technological advancements in remote sensing, mainly enabled by unmanned aerial systems (UASs) and satellites. These systems provide non-invasive, time and cost-effective techniques to automate tasks such as disease detection Kerkech et al. (2020b), crop yield prediction van Klompenburg et al. (2020) and other monitoring-related tasks Karatzinis et al. (2020). Conversely to satellites, which are limited by temporal and resolution constraints, UAS-based remote sensing in the optical domain offers a cost-effective way to generate the necessary geospatial data to extract the grapevine canopy structure and its spatial variability at vineyard scale Deng et al. (2018)

. Remote sensing in combination with machine learning can be used to infer spatio-temporal variability and structure of vineyards, which are relevant for designing site-specific management strategies

de Castro et al. (2018) to maximize both grape yield and quality, to avoid unnecessary treatments, and then reduce costs Pádua et al. (2020).

Figure 1: Study area and the corresponding Orthomosaic images - captured with an HD RGB (X7) and multispectral (Altum) cameras - considered in the datasets. a) shows the geographic locations; b), c), d) and e) show the orthomosaic images and the DMS of the vineyard plots in Valdoeiro respectively; f), h), i) and g) correspond to the vineyard plots in Coimbra. The orthomosaics and the DSMs generated using the X7 sensor are shown in b), f) and c), g). The multispectral-based orthomosaics generated from the Altum sensor are shown in d) and h), using an R-G-B composition, while in e) and i) using false-color RE-R-G composition.

In terms of technology, aerial images of an onboard high-resolution RGB camera, followed by dedicated Structure-from-Motion (SfM) photogrammetric approaches is a common framework to obtain detailed (centimeter-level) Digital Surface Models (DSM) and orthomosaics of vineyards, which are often used for efficient segmentation of the grapevine canopy (see Pádua et al. (2018) and the references therein). In addition to RGB imagery, multispectral images collected by the onboard multispectral sensors comprise information from additional spectral bands such as near-infrared (NIR), red-edge (RE), and thermal (TIR), which relate to chlorophyll in green vegetation Xue and Su (2017). This additional information allows the measurement and mapping of the vegetation’s biochemical and biophysical attributes such as water stress and head Costa et al. (2016); Pôças et al. (2015), crop evapotranspiration Ferreira et al. (2012)

, biomass, or leaf pigment contents, which are key factors for vine-growers to estimate the outcomes in terms of yield and quality


Figure 2: Orthomosaic segmentation pipeline (OthoSeg) with the following modules: image splitting, which splits the orthomosaics into sub-images; pre-processing, which normalizes each band of the sub-images; DL segmentation, which predicts sub-masks using a DL-based segmentation approach; and mask rebuilding, which uses the sub-masks to build a mask with the same size of the input orthomosaic.

In vineyard-based applications, to obtain accurate prediction estimates from the aerial/UAV images we have to concentrate on the the vine plants data, while ignoring the remaining vegetation which is essential to avoid measurement contamination from undesired plants. Thus, this work resorts to image segmentation techniques to segment the pixels that contain the vine plants vs. everything else. Traditional approaches, on one hand, are based on classical (non-deep) segmentation methods which rely on handcrafted features and may have poor generalization capabilities when differentiating the background from the foreground on test data. On the other hand, deep learning (DL) segmentation approaches are end-to-end learning mechanisms that have been demonstrating exceptional performance on complex problems. One of the disadvantages, however, is linked to the presumable required amount of data to train deep models, which are scarce in many agriculture domains, particularly in vineyard-based applications.

In this work, a comprehensive study of segmentation approaches is presented for the task of vine detection in vineyard plots on both HD (high-definition) and multispectral aerial imagery collected by an UAV. The study is performed on a new dataset, that will be made public to the community, comprising UAV-based multispectral and HD-RGB orthomosaics, as well as digital surface models of vineyard plots from central Portugal, namely Coimbra and Valdoeiro regions (as shown in Fig. 1). To the best of our knowledge, this is the first dataset with such characteristics that is freely available.

The main contributions of this work are the following:
A thorough multispectral DL-based segmentation study for vine detection in real-world vineyards;
A new publicly available UAV-based Vineyard dataset, with annotated labels, comprising multispectral, high-resolution RGB orthomosaics, and digital surface models;
Supervised segmentation models based on state-of-the-art deep architectures to tackle the challenging problem of semantic segmentation.

The remainder of this paper is organized as follows: Section 2 presents the state-of-the-art in the domain of semantic segmentation using UAV/drone data for precision agriculture, namely in Vineyards. In Sect. 4 the multimodal dataset, the UAV system, and the experimental implementation are described, while Section 5 presents and discusses the results of the experiments conducted in this work. Finally, Section 6 concludes the findings of this study and suggests future research directions.

2 Related Work

max width= Ref Bands Fusion Architecture/Approach Application Kerkech et al. (2020b) RGB + NIR Late (case-based) Encoder-Decoder (SegNet) Mildew disease detection in vine + row detection Karatzinis et al. (2020) RGB Late (HSV) Otsu’s thresholding Hough Transformation Row detection in Vineyards Romero et al. (2018) RGB + NIR + RE Early (Vegetation Indices) Two-layer feedforward network Vineyard water status estimation Fawakherji et al. (2019) RGB Early (Concatenation) Encoder-Decoder (based on Unet) Weed/crop segmentation and classification Blok et al. (2021) RGB Early (Concatenation) Mask R-CNN Detection of broccoli heads Ahmed et al. (2019) RGB + NIR + RE Early (NDVI) Laplacian of Gaussian unsupervised clustering random walker Detection and segmentation of lentil plots Guijarro et al. (2011) RGB Early (ExG, CIVE, ExGR, VEG, ExR and ExB) Threshold-based fuzzy clustering Crop segmentation Bah et al. (2019) RGB Early (Concatenation) S-SegNet HoughCNet Crop row detection Song et al. (2020) RGB + NIR Early (Concatenation) Encoder-Decoder (based on SegNet) Identification of sunflower lodgin Kerkech et al. (2020a) (RGB + NIR)
Late (case-based) Encoder-Decoder (based on SegNet) Vine disease detection + row detection

Table 1: Related work on mulstispectral data for semantic segmentation in digital/precision agriculture.

The recent advances in drones and multispectral cameras, associated with the increasing achievements of DL-learning approaches have led to higher adoption of UAV-based multispectral imagery as a reliable information source for decision-making in agricultural tasks van Klompenburg et al. (2020); Hamuda et al. (2016); Kamilaris and Prenafeta-Boldú (2018).

Precision agriculture and related agricultural applications have been resorting to DL approaches to perform various perception-like tasks, namely recognition, detection, and semantic segmentation - a survey on DL in agriculture can be found in Kamilaris and Prenafeta-Boldú (2018). Segmentation, in particular, is mainly used in this context to recognize plants or fruits from the background (soil and other residues) e.g., roots from soil Douarre et al. (2016), fruits from leaves Bargoti and Underwood (2017), or crops from weeds Haug and Ostermann (2014). In vineyards, the crop plants are distinguished from weeds by detecting the rows that contain the vine plants Bah et al. (2020).

In perception-based agricultural tasks, raw detection is a commonly used practice to avoid measurement contamination from undesired plants (e.g.

, weeds), which is achieved by detecting in the images the rows that contain the actual crop plants. Early row detection approaches resorted mostly to classical computer vision methods such as segmentation based on color indices

Kirk et al. (2009), thresholds Jeon et al. (2011), or non-deep learning Guerrero et al. (2012), which are ‘handcrafted’ approaches. Some advantages of these classical approaches comprise simplicity, ‘shallow’ training, and low computation cost. On the other hand, the disadvantages, particularly in the agriculture context, are mainly related to low performance when faced with different lighting conditions, shadows, or complex backgrounds, which makes them more suitable for simple and non-changing environments - a survey on early segmentation approaches in agriculture can be found in Hamuda et al. (2016).

DL-based approaches usually rely on convolutional layers for feature learning, which enables an end-to-end data-driven approach. In Bah et al. (2019)

, Convolutional Neural Networks (CNNs) are used to extract features from RGB images for row detection. That approach, called CRowNet, relies on SegNet

Badrinarayanan et al. (2017) and a CNN-based Hough transform. Segmentation networks like SegNet and UNet Ronneberger et al. (2015) are been used in agriculture domains Fawakherji et al. (2019) and other applications as well (e.g., urban scene segmentation Hong et al. (2020)).

3 Materials and Methods

In vineyards, DL-based row detection can be applied to enhance vine disease detection Kerkech et al. (2020b, a) or water status assessment Romero et al. (2018), which is performed on UAS-based multispectral imagery. Using such data, on one hand, allows to capture relevant information that is not present in the RGB spectrum and, on the other hand, allows to generate DSMs from the surveyed area. In Kerkech et al. (2020b), the Mildew disease is detected using SegNet, which identifies regions of interest based on visible and NIR information. A similar approach is proposed in Kerkech et al. (2020a) where additionally to the two spectral bands, depth information is incorporated as well. A broader spectral range is used in Romero et al. (2018) to assess the water status of vine plants, which is achieved using information from VIS, GREEN, RED, RED EDGE, and NIR, to compute vegetation indices. While the former two works (i.e., Kerkech et al. (2020b, a)

) detect the rows through learning, the latter uses a heuristic approach that computes the height difference

Baofeng et al. (2016), which is easy to implement but has limitations in areas with slopes and dense vegetation i.e., tending to a poor generalization capability.

To summarise the related work on semantic segmentation applied to precision agriculture, Table 1 presents a comprehensive organization as a function of the spectral band, the fusion strategies, and the related architecture/model that have been used in this domain.

3.1 Study Areas

The study was carried out in two vineyards located in the Centre of mainland Portugal, namely Valdoeiro, located in the Bairrada wine region, and Coimbra, a living-lab/farm within the Agrarian School of Coimbra (ESAC) (Fig. 1). The region has a Mediterranean climate with a strong influence of the Atlantic Ocean, characterized by average annual rainfall of 1077 mm and average annual temperature of Ferreira et al. (2018), marked by a relatively long a dry summer (June-August).

Both vineyards are managed under conventional management practices but present different crop characteristics. Valdoeiro is a 2.9 ha vineyard, located at an altitude of 99 m, in flat terrain () under Cambisoil soil type, with a northeast-southwest exposure. The vineyard was planted in 2005 with a typical Baga vine variety and an approximate density of 3200 vines per ha, with plants spacing 1.3 m in straight rows and an inter-rows distance of 2.4 m. The Coimbra-ESAC vineyard extends over an area of 2.3 ha divided into two plots (i.e., ESAC1 and ESAC2 see Fig. 6.a ), which are located at an altitude of 28 m, in a smoothly sloping terrain () under Fluvisol soil type.

The vineyard was planted in 1999 with different vine varieties such as Alfrocheiro, Aragonez, Touriga Nacional, and Marselan. ESAC1 has a soul-north exposure with an approximate plant density of 2800 vines per ha, a plant distance of 1.5 m in straight rows, and an inter-rows distance of 2.4 m. ESAC2 has an east-west solar exposure with a plant density of approximately 3400 vines per ha, a plant distance of 1.4 m, and an inter-row distance of 2.1 m.

3.2 Materials and Data Acquisition

To survey the study areas, a compact and low-cost UAS from DJI (shown in Fig. 3) was equipped with a multispectral camera (Micasense Altum), a high-resolution (HD) RGB camera (Zenmuse X7), and a global navigation satellite systems (GNSS) with RTK correction. The UAS’s flight missions were planned with the DJI Pilot 1.9 software, where the front and side overlap was set to 80% and 70%, respectively, using the Altum sensor as a reference, which captures five spectral bands (R, G, B, RE, NIR) and a thermal band. A sample of each band is illustrated in Fig. 4.

The data acquisition process was carried out by surveying both sides (i.e., ESAC and Valdoeiro) with custom settings, which were set to optimize information acquisition at survey time. One of the flight settings that was adjusted was the height at which the UAS surveyed the plots. The Coimbra plots were surveyed in October at a height of 120 m after the harvest was finished. On the other hand, the Valdoeiro plot was surveyed in April. At this time, the plants are still in an early growth stage with no, or few, visible leaves which makes plant recognition difficult at 120 m. Thus, the height was adjusted to 60 m to capture more rich and detailed information from the plants.

After data acquisition, images of both cameras were used to generate geospatial products (i.e., DSM and orthomosaic) of both sites. These geospatial products were computed offline using the Structure-from-Motion (SfM) workflow available in Agisoft Metashape Professional Edition software (Agisoft LLC, St. Petersburg, Russia) version 1.7.2. The multispectral images were only used to generate orthomosaics (see Fig.1.d, e, h and i), while HD images were used to generate both ortomosaics (see Fig.1.b and f) and DSM (see Fig.1.c and g). The multispectral-based orthomosaics were generated with a dedicated implementation, while the HD-based geospatial products were obtained based on a processing workflow presented in Gonçalves et al. (2021). An overview of the survey conditions and of the geospatial products are presented in Table 2.

Figure 3: UAS and the on-board cameras used for data collection.
Figure 4: Image examples of the Vineyards showing the spectral bands that integrate the mustispectral sensor, and a ground-truth mask.

max width= Location Date
Weather Flying height [m] AGL GSD (orthomosaic) [cm/pix]
RGB Multispectral DSM Coimbra 10/01/2020 1:40:00 pm 17 Sunny 120 1.7 4.8 3.4 Valdoeiro 04/15/2021 11:45:00 am 10 Sun/cloud 60 1 3 2

Table 2: UAS surveys and the corresponding GSD of the generated geospatial products.

3.3 Orthomosaic Deep Learning-based Segmentation

Geospatial products such as orthomosaics are data structures that may have a large and arbitrary size. Such data structures are not appropriate to feed directly to DL-based approaches, which rely on CNN and are optimized for grid-based and fixed-sized inputs. Moreover, the computational demands of CNNs increase proportionally with the input size, which makes feeding orthomosaics directly to DL networks computationally too expensive. To overcome this limitation, this work resorts to an approach (named OrthoSeg illustrated in Fig. 2) that has the following steps: receives orthos as inputs; splits these orthos into sub-images; pre-processes the sub-images before being fed to the segmentation network, which outputs prediction sub-masks; and rebuilds these sub-masks into an orthomosaic mask with the same size as the initial input.

3.3.1 Orthomosaic Splitting & Rebuilding

The image splitting approach has been devised to divide the orthomosaics of all bands into smaller sub-images with a fixed size of 240240, which represents a much less computational burden for the DL segmentation networks.

The splitting process, as illustrated in Fig. 5, begins at the top left corner of the orthomosaic and proceeds to the right, creating sub-images every 240 pixels. After the row is completed, a new row is defined 240 pixels below. The process is repeated until the whole orthomosaic has been processed.

Figure 5: Orthomosaic splitting approach. The splitting begins at the upper left corner and proceeds to the right until the end of the row. The process is repeated until the bottom.

3.3.2 Pre-processing

In order to improve convergence at training time, the generated sub-images are standardized using (1

), before being fed to the neural network,


where represents the sub-image of the band , is the mean,

denotes the standard deviation and

the corresponding standardized sub-image.

3.3.3 Deep Segmentation Networks

In this work, three state-of-the-art supervised segmentation networks are used: SegNet, U-Net Ronneberger et al. (2015) and ModSegnet Ganaye et al. (2018). All three networks have an encoder-decoder-like architecture i.e., the encoder maps the input to reduced feature space and the decoder maps the feature space to a prediction mask with the same size as the input. The U-Net’s architecture implemented here differs slightly from the one originally proposed in Ronneberger et al. (2015)

, more specifically the network has been augmented with a Batch Normalization (BatchNorm)

Ioffe and Szegedy (2015) and a Dropout layer Srivastava et al. (2014), towards the goal of improving generalization and convergence.

4 Experiments

The study was carried out on the proposed dataset, from which three areas of interest (AoI) were selected and are denoted by: ESAC1, ESAC2 and Valodeiro. The ground truth masks were generated in the orthomosaic space. Both the orthomosaic images and masks were processed to fit the computational requirements to train the DL-based segmentation networks, which were evaluated through a 3-fold cross-validation.

4.1 Dataset

All experiments were conducted on data from the 3 areas of interest: ESAC1, ESAC2, and Valdoeiro, which are illustrated in Fig. 6. In Coimbra, two sets are created, which correspond to the ESAC1 and the ESAC2 plots. In Valdoeiro, only the upper fraction of the plot is used, which is referred simply as Valdoeiro.

For practical reasons, given the limited GPU memory available for working with images in the experiments, the three sets were divided into 240240 size sub-images. We note that only the images with at least 1 pixel belonging to the positive class (i.e., corresponding to a vine plant) were used on the training stage.

The resulting dataset comprises thus three sets: ESAC1, ESAC2, and Valdoeiro. Each set comprises data from the HD and multispectral cameras. After the splitting strategy, the HD data contains 624, 626, and 1195 images for ESAC1, ESCA2 and Valdoeiro respectively; on the other hand, the multispectral data contains respectively: 85, 89, and 150 images. A summary of the dataset is presented in Table 3, where P represents the positive class (referring to vine-plants pixels) and N the negative class (referring to non vineyard pixels).

Figure 6: Areas of interest of (a) Coimbra’s vineyard plots (ESAC1 and ESAC2) and (b) Valodeiro’s plot (Valdoeiro).

4.2 Ground-truth data

Figure 7: Sub-images and corresponding ground truth masks (240 x 240) used for training and testing.

In segmentation tasks the ground truth data correspond to masks. In this work, ground truth masks have been generated in the geospatial space (i.e., orthomosaic and DSM spaces), populating the pixels that belong to vine plants with the positive class (label = 1) and the remaining pixels with a negative class (label = 0) i.e., this is a binary segmentation problem. The masks were split with the same process as the othomosaics thus, a sub-mask for each sub-image has been created. Figure 7 illustrates three sub-image samples of the three areas with their respective sub-masks, and Table 3 contains information regarding image/mask and class distributions of each area of interest.

max width= I/M HD MS MS HD P N P/N P N P/N ESAC1 85 624 0.25 0.75 3 0.23 0.77 3 ESAC2 89 626 0.28 0.72 3 0.25 0.75 3 Valdo. 150 1,196 0.07 0.93 14 0.08 0.92 12 Mean - - 0.17 0.83 5 0.17 0.83 5

Table 3: Data and class distributions of each sensor modality where P and N represent respectively, the positive and the negative class fraction available in each set.

4.3 Evaluation

The evaluation procedure adopted in this work was k-fold cross-validation, using the F1-score as performance metric (2). In particular, represents the number of groups (or subsets) the dataset is split in, and corresponding to the three study areas: ESAC1, ESAC2, and Valodeiro. The three plot combinations (denoted by T1, T2, and T3) and their corresponding data distributions are represented in Table 4.

The F1 score is computed as follows,


where the True Positives (TP) are pixels that were correctly classified as vines; False Positives (FP) are pixels that were wrongly classified as a vine plant; True Negative (TN) are pixels that were correctly classified as background; False Negatives (FN) are pixels that were wrongly classified as background.

max width= Training Set Test Set Plots I/M Plot I/M MS HD MS HD T1 ESAC1 & ESAC2 174 1250 Valdoeiro 150 1196 T2 ESAC1 & Valdoeiro 235 1820 ESAC2 89 626 T3 ESAC2 & Valdoeiro 239 1822 ESAC1 85 624

Table 4: Image/Mask (I/M) distribution among the training and test set for cross-validation. MS denotes multispectral and HD=high-definition.

4.4 Implementation details and Training

The implementation of this work was in a Python 3.7 environment, using for the segmentation networks the PyTorch framework. The environment was set up on a hardware with an NVIDIA GFORCE GTx1070Ti GPU and an AMD Ryzen 5 CPU with 32 GB of RAM.

All networks were initialized, trained and validated using the same conditions. The networks’ weight were initialized using a normal distribution with mean 1 and a standard deviation of 0.2. The training was performed using the AdamW optimizer

Loshchilov and Hutter (2019)

with a learning rate and a weight decay of 0.000171 and 0.00061 respectively. The loss function was Pytorch’s

BCEWithLogitsLoss with the positive class weight set to 5 to compensate the unbalanced class distribution (as can be verified in Table 3

). Data augmentation was also implemented in the form of random rotations with angles between 0 and 180 degrees, which were applied to the sub-images and the corresponding sub-masks. Finally, the networks were trained during 20 epochs, using early stopping to extract the best scores.

5 Results and Discussion

max width= Sensor Bands SegNet U-Net ModSegNet RGB RE NIR Th. T1 T2 T3 Mean Std T1 T2 T3 Mean Std T1 T2 T3 Mean Std Multispectral 0.73 0.78 0.79 0.77 0.03 0.73 0.76 0.82 0.77 0.04 0.72 0.77 0.77 0.75 0.02 0.74 0.81 0.82 0.79 0.04 0.71 0.78 0.85 0.78 0.06 0.65 0.81 0.82 0.76 0.08 0.71 0.8 0.82 0.78 0.05 0.79 0.78 0.84 0.8 0.03 0.75 0.81 0.82 0.79 0.03 0.79 0.81 0.83 0.81 0.02 0.81 0.78 0.84 0.81 0.02 0.74 0.81 0.81 0.79 0.03 0.79 0.81 0.83 0.81 0.02 0.8 0.79 0.85 0.81 0.03 0.8 0.8 0.82 0.81 0.01 0.8 0.81 0.83 0.81 0.01 0.8 0.79 0.85 0.81 0.03 0.72 0.8 0.82 0.78 0.04 0.79 0.81 0.83 0.81 0.02 0.8 0.79 0.85 0.81 0.03 0.77 0.8 0.83 0.8 0.02 0.17 0.4 0.38 0.32 0.1 0.19 0.4 0.39 0.33 0.1 0.17 0.38 0.38 0.31 0.1 0.74 0.77 0.78 0.76 0.02 0.71 0.76 0.82 0.76 0.04 0.75 0.76 0.74 0.75 0.01 0.71 0.79 0.82 0.77 0.05 0.74 0.77 0.85 0.79 0.05 0.65 0.81 0.81 0.76 0.08 0.72 0.8 0.82 0.78 0.04 0.74 0.78 0.84 0.79 0.04 0.74 0.8 0.81 0.78 0.03 0.79 0.8 0.83 0.81 0.02 0.78 0.79 0.84 0.8 0.03 0.76 0.8 0.8 0.79 0.02 0.78 0.8 0.83 0.8 0.02 0.79 0.79 0.84 0.81 0.02 0.8 0.79 0.81 0.8 0.01 0.77 0.81 0.83 0.8 0.02 0.79 0.79 0.84 0.81 0.02 0.72 0.8 0.81 0.78 0.04 0.76 0.8 0.83 0.8 0.03 0.81 0.79 0.84 0.81 0.02 0.78 0.8 0.82 0.8 0.02 HD 0.73 0.85 0.85 0.81 0.06 0.75 0.82 0.91 0.83 0.07 0.75 0.83 0.89 0.82 0.06

Table 5: Average F1 scores of 5 repetitions. Each repetition was trained with the same parameters: 20 epochs, data augmentation, weight initialization using a normal distribution.

This section reports and discusses the results obtained from the segmentation networks as well as non-deep unsupervised methods. The results in Table 5, which represent the average performance of 5 repetitions of the DL-based segmentation networks when fed with the various band combinations, were analyzed from three distinct perspectives: the network architectures, the camera been used (i.e., HD vs Multispectral), and from the band configuration perspective. Additionally, comparisons have been carried out between the deep networks and the classical (non-deep) unsupervised segmentation methods.

5.1 Network Comparison

Based on the reported experimental results we can deduce that the three deep networks achieved equivalent performance nonetheless, U-Net and SegNet have presented slightly higher and more ‘stable’ performances across the various band combinations and the train-test configurations (i.e., T1, T2, and T3). Despite the seasonal change of the vineyards, the networks demonstrated high generalization capability.

5.2 HD vs Multispectral

When comparing the scores w.r.t. the sensor perspective (i.e., HD RGB vs multispectral modalities), HD-based F1-scores are in general higher, which can be partially explained by a larger amount of training data available i.e., largest amount of pixels due to higher resolution (see Table 4 for the number of images in each dataset). Nonetheless, when focusing on the T1 scenario, where the networks are trained in a completely different scenario from which they are tested, it is interesting to observe how the multispectral modalities allow a better overall performance.

5.3 Spectral Band Comparison

From the results, one key observation is notable: the NIR spectral band tends to generate the best results thus, using this band alone is sufficient to achieve proficient results. In some cases adding other bands does not improve but, even downgrades the performances. The thermal band is one of such cases, having very low performance when used alone. A possible reason behind the thermal-band having such poor outcome is due to its low resolution when compared with the other bands (see Fig. 4).

Despite the high performance of the NIR band, having access to such information requires multispectral cameras, which are less affordable than their RGB counterparts. Thus, due to the low cost and ease of acquiring, color cameras are very popular among the works in agriculture (as can be seen in Table 1). Comparing the results of the RGB band with the best-performing band combination, RGB has on average lower performance, which is acceptable when a multispectral camera is not an option.

Figure 8: Qualitative prediction masks comparison of DL-based and classical approaches. The two samples represent: (upper) a corner-case where classical approaches have low performance; and (lower) an ideal case where classical approaches are competitive with DL-based approaches.

5.4 Deep vs Conventional Unsupervised Methods

For the sake of comparison with the DL segmentation networks, two unsupervised methods of image segmentation (OTSU and K-means) were used in some of the previously tested modalities (RGB-HD, RGB, RE, and NIR). The methods were independently applied in the selected regions (ESAC1, ESAC2, and Valdoeiro) in an unsupervised fashion and evaluated against the ground truth. For the K-means method, the same type of band-wise normalization (used in our previous Deep Learning methods) has been applied during the pre-processing phase.

Analogous to the deep methods, also here the NIR band led to the highest F1-scores. Overall, the unsupervised methods achieved unsatisfactory segmentation performances, with many F1 scores below 0.50. Nonetheless, when using the NIR band, competitive results are attained across the different regions of the dataset as shown in Table 6 which presents a comparison of both DL and non-deep methods using the NIR band. We can see that the best DL networks outperformed 6 point-percentage in average the non-deep methods. Additionaly, we note that the ‘shallow’ methods struggle in some corner cases where the positive class is scarce, contrarily to DL-based approaches (as illustrated in Fig. 8).

Valdoeiro ESAC1 ESAC2 Mean
OSTSU 0.78 0.78 0.70 0.75
K-Mean 0.78 0.78 0.70 0.75
SegNet 0.79 0.83 0.81 0.81
U-Net 0.81 0.84 0.78 0.81
ModSegNet 0.74 0.81 0.81 0.79
Table 6: Segmentation performance (F1 scores) of the unsupervised (non-deep) and the deep networks using the NIR band.

6 Conclusions

In this work, a new UAV-based multispectral and HD RGB dataset has been used to train three deep segmentation networks for the task of pixel-wise vineyard recognition. The aim was to study the responses of the different spectral bands, image resolutions, and segmentation networks when used in this agricultural application. The data was captured from two distinct vineyards at different seasonal stages, both located in the central region of Portugal: Coimbra and Valdoeiro.

From the results of this study, two major conclusions are derived. Firstly, the higher image resolution in the HD RGB modality increases the general performance of the DL networks, when compared with the different multispectral modalities, but is not sufficient when we restrain our analysis to the scenario of greater generalization (illustrated by the T1 data configuration). Secondly, focusing on the performance of different bands from the multispectral camera, the NIR band stands out not only as ubiquitous in the best performing combinations, but even for the robust results that are possible to obtain when used as a single modality. The latter observation was further substantiated when conventional unsupervised segmentation methods were compared with our DL-network architectures; in this scenario, competitive results were only possible when the conventional algorithms were applied to the NIR band, despite the problems in some corner cases.

The present article makes a good case for the use of this type of dual-camera approach to UAV-based data acquisition, highlighting the clear advantages and disadvantages of each option and discussing, in a thorough and rigorous way, the best semantic segmentation approaches for each scenario. Finally, the DL-based networks were compared with traditional approaches, underlining the importance of this type of study for real-life precision agriculture applications. For future work, a combination of data acquired from both cameras could be introduced in our analysis of Neural Network performance, as well as some depth information retrieved from the DSMs.


This work has been supported by the Portuguese Foundation for Science and Technology (FCT) via the projects AI+Green (MIT-EXPL/TDI/0029/2019) and Agribotics (UIDB/00048/2020). The work of G. Gonçalves was also supported by the FCT through the grant UIDB/00308/2020.


  • I. Ahmed, M. Eramian, I. Ovsyannikov, W. van der Kamp, K. Nielsen, H. S. Duddu, A. Rumali, S. Shirtliffe, and K. Bett (2019) Automatic detection and segmentation of lentil crop breeding plots from multi-spectral images captured by uav-mounted camera. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1673–1681. Cited by: Table 1.
  • V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §2.
  • M. D. Bah, A. Hafiane, and R. Canals (2019) CRowNet: deep network for crop row detection in uav images. IEEE Access 8, pp. 5189–5200. Cited by: Table 1, §2.
  • M. D. Bah, A. Hafiane, and R. Canals (2020) CRowNet: deep network for crop row detection in uav images. IEEE Access 8 (), pp. 5189–5200. External Links: Document Cited by: §2.
  • S. Baofeng, X. Jinru, X. Chunyu, S. Yuyang, S. Fuentes, et al. (2016) Digital surface model applied to unmanned aerial vehicle based photogrammetry to assess potential biotic or abiotic effects on grapevine canopies. International Journal of Agricultural and Biological Engineering 9 (6), pp. 119–130. Cited by: §3.
  • S. Bargoti and J. P. Underwood (2017) Image segmentation for fruit detection and yield estimation in apple orchards. Journal of Field Robotics 34 (6), pp. 1039–1060. Cited by: §2.
  • P. M. Blok, F. K. van Evert, A. P. Tielen, E. J. van Henten, and G. Kootstra (2021) The effect of data augmentation and network simplification on the image-based detection of broccoli heads with mask r-cnn. Journal of Field Robotics 38 (1), pp. 85–104. Cited by: Table 1.
  • [8] (2019-02) Comparison of Satellite and UAV-Based Multispectral Imagery for Vineyard Variability Assessment. Remote Sensing 11 (4), pp. 436. External Links: ISSN 2072-4292 Cited by: §1.
  • J.M. Costa, M. Vaz, J. Escalona, R. Egipto, C. Lopes, H. Medrano, and M.M. Chaves (2016) Modern viticulture in southern europe: vulnerabilities and strategies for adaptation to water scarcity. Agricultural Water Management 164, pp. 5–18. Note: Enhancing plant water use efficiency to meet future food production External Links: ISSN 0378-3774, Document Cited by: §1.
  • A. I. de Castro, F. M. Jiménez-Brenes, J. Torres-Sánchez, J. M. Peña, I. Borra-Serrano, and F. López-Granados (2018) 3-D Characterization of Vineyards Using a Novel UAV Imagery-Based OBIA Procedure for Precision Viticulture Applications. Remote Sensing 10 (4), pp. 584. External Links: Document, ISSN 2072-4292 Cited by: §1.
  • L. Deng, Z. Mao, X. Li, Z. Hu, F. Duan, and Y. Yan (2018) UAV-based multispectral remote sensing for precision agriculture: A comparison between different cameras. ISPRS Journal of Photogrammetry and Remote Sensing 146, pp. 124–136. External Links: Document, ISSN 09242716 Cited by: §1.
  • C. Douarre, R. Schielein, C. Frindel, S. Gerth, and D. Rousseau (2016) Deep learning based root-soil segmentation from x-ray tomography images. bioRxiv, pp. 071662. Cited by: §2.
  • M. Fawakherji, A. Youssef, D. Bloisi, A. Pretto, and D. Nardi (2019) Crop and weeds classification for precision agriculture using context-independent pixel-wise segmentation. In 2019 Third IEEE International Conference on Robotic Computing (IRC), pp. 146–152. Cited by: Table 1, §2.
  • C. Ferreira, J. Keizer, L. Santos, D. Serpa, V. Silva, M. Cerqueira, A. Ferreira, and N. Abrantes (2018) Runoff, sediment and nutrient exports from a mediterranean vineyard under integrated production: an experiment at plot scale. Agriculture, Ecosystems & Environment 256, pp. 184–193. Cited by: §3.1.
  • M. Ferreira, N. Conceição, J. Silvestre, M. Fabião, et al. (2012) Transpiration and water stress effects on water use, in relation to estimations from ndvi: application in a vineyard in se portugal. Options Méditerranéennes: Série B. Etudes et Recherches (67), pp. 203–208. Cited by: §1.
  • P. Ganaye, M. Sdika, and H. Benoit-Cattin (2018) Semi-supervised learning for segmentation under semantic constraint. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 595–602. Cited by: §3.3.3.
  • G. Gonçalves, D. Gonçalves, Á. Gómez-Gutiérrez, U. Andriolo, and J. A. Pérez-Alvárez (2021) 3D Reconstruction of Coastal Cliffs from Fixed-Wing and Multi-Rotor UAS: Impact of SfM-MVS Processing Parameters, Image Redundancy and Acquisition Geometry. Remote Sensing 13 (6), pp. 1222. External Links: Document, ISSN 2072-4292 Cited by: §3.2.
  • J. M. Guerrero, G. Pajares, M. Montalvo, J. Romeo, and M. Guijarro (2012) Support vector machines for crop/weeds identification in maize fields. Expert Systems with Applications 39 (12), pp. 11149–11155. Cited by: §2.
  • M. Guijarro, G. Pajares, I. Riomoros, P. Herrera, X. Burgos-Artizzu, and A. Ribeiro (2011) Automatic segmentation of relevant textures in agricultural images. Computers and Electronics in Agriculture 75 (1), pp. 75–83. Cited by: Table 1.
  • E. Hamuda, M. Glavin, and E. Jones (2016) A survey of image processing techniques for plant extraction and segmentation in the field. Computers and Electronics in Agriculture 125, pp. 184–199. External Links: ISSN 0168-1699, Document Cited by: §2, §2.
  • S. Haug and J. Ostermann (2014) A crop/weed field image dataset for the evaluation of computer vision based precision agriculture tasks. In European conference on computer vision, pp. 105–116. Cited by: §2.
  • D. Hong, J. Yao, D. Meng, Z. Xu, and J. Chanussot (2020) Multimodal gans: toward crossmodal hyperspectral–multispectral image segmentation. IEEE Transactions on Geoscience and Remote Sensing 59 (6), pp. 5103–5113. Cited by: §2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §3.3.3.
  • H. Y. Jeon, L. F. Tian, and H. Zhu (2011) Robust crop and weed segmentation under uncontrolled outdoor illumination. Sensors 11 (6), pp. 6270–6283. Cited by: §2.
  • A. Kamilaris and F. X. Prenafeta-Boldú (2018) Deep learning in agriculture: a survey. Computers and Electronics in Agriculture 147, pp. 70–90. External Links: ISSN 0168-1699, Document Cited by: §2, §2.
  • G. D. Karatzinis, S. D. Apostolidis, A. Ch. Kapoutsis, L. Panagiotopoulou, Y. S. Boutalis, and E. B. Kosmatopoulos (2020) Towards an integrated low-cost agricultural monitoring system with unmanned aircraft system. (), pp. 1131–1138. External Links: Document Cited by: §1, Table 1.
  • M. Kerkech, A. Hafiane, R. Canals, and F. Ros (2020a) Vine disease detection by deep learning method combined with 3d depth information. In International Conference on Image and Signal Processing, pp. 82–90. Cited by: Table 1, §3.
  • M. Kerkech, A. Hafiane, and R. Canals (2020b) Vine disease detection in uav multispectral images using optimized image registration and deep learning segmentation approach. Computers and Electronics in Agriculture 174, pp. 105446. External Links: ISSN 0168-1699, Document Cited by: §1, Table 1, §3.
  • K. Kirk, H. J. Andersen, A. G. Thomsen, J. R. Jørgensen, and R. N. Jørgensen (2009) Estimation of leaf area index in cereal crops using red–green images. Biosystems Engineering 104 (3), pp. 308–317. Cited by: §2.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In International Conference on Learning Representations, Cited by: §4.4.
  • L. Pádua, T. Adão, A. Sousa, E. Peres, and J. J. Sousa (2020) Individual Grapevine Analysis in a Multi-Temporal Context Using UAV-Based Multi-Sensor Imagery. Remote Sensing 12 (1), pp. 139. External Links: Document, ISSN 2072-4292 Cited by: §1.
  • L. Pádua, P. Marques, J. Hruška, T. Adão, J. Bessa, A. Sousa, E. Peres, R. Morais, and J. J. Sousa (2018) Vineyard properties extraction combining UAS-based RGB imagery with elevation data. International Journal of Remote Sensing 39 (15-16), pp. 5377–5401. External Links: Document, ISSN 0143-1161 Cited by: §1.
  • I. Pôças, A. Rodrigues, S. Gonçalves, P. M. Costa, I. Gonçalves, L. S. Pereira, and M. Cunha (2015) Predicting grapevine water status based on hyperspectral reflectance vegetation indices. Remote Sensing 7 (12), pp. 16460–16479. External Links: ISSN 2072-4292, Document Cited by: §1.
  • M. Romero, Y. Luo, B. Su, and S. Fuentes (2018) Vineyard water status estimation using multispectral imagery from an uav platform and machine learning algorithms for irrigation scheduling management. Computers and electronics in agriculture 147, pp. 109–117. Cited by: Table 1, §3.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2, §3.3.3.
  • Z. Song, Z. Zhang, S. Yang, D. Ding, and J. Ning (2020) Identifying sunflower lodging based on image fusion and deep semantic segmentation with uav remote sensing imaging. Computers and Electronics in Agriculture 179, pp. 105812. Cited by: Table 1.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §3.3.3.
  • T. van Klompenburg, A. Kassahun, and C. Catal (2020) Crop yield prediction using machine learning: a systematic literature review. Computers and Electronics in Agriculture 177, pp. 105709. External Links: ISSN 0168-1699, Document Cited by: §1, §2.
  • J. Xue and B. Su (2017) Significant remote sensing vegetation indices: a review of developments and applications. Journal of sensors 2017. Cited by: §1.