Due to their competition with crops for water, nutrients and sunlight, weeds are a significant threat to agricultural productivity llewellyn2016impact; gharde2018assessment. Weed control in conservation cropping systems is reliant on the use of herbicides due to the lack of suitable alternative weed control options that do not interfere with the minimum tillage, residue retention principles, and thus benefits of these systems. To alleviate this threat, precise weed management becomes an increasingly important approach that is implemented through site-specific weed control (SSWC) techniques to identify and target weeds with treatments such as herbicides, lasers, electrical weeding and targeted tillage coleman2019energy.
While in-crop SSWC creates important opportunities for improved production, the major challenge for these systems is the development of robust weed identification algorithms that perform accurately and consistently in complex and highly variable crop production environments olsen2019deepweedscharters2014eagle). These approaches often follow a pipeline in which hand-crafted image features play a primary role. As a result, developing such pipelines is labour intensive and images are required to be captured within well-defined conditions.
Fortunately, due to the great success of deep learning in many vision tasks, hand-crafted features are no longer required to derive promising results. Instead, deep representations of an input image can be obtained using deep learning, which are relevant to the task at hand. For weed identification, four types of deep learning approaches are illustrated in Fig. 2: a) image classification identifies the weed or the crop species contained in an image; b) object detection identifies the per-weed locations of the plants within an image; c) semantic segmentation conducts pixel-wise classification of individual weed classes and d) instance segmentation further identifies the instance of each pixel belonging to. As most of the deep learning-based weed identification studies are based on existing and well-known deep architectures, the architectures relevant to weed identification, including their building blocks and contributions, are introduced first. Next more than 30 deep learning-based weed identification studies are discussed in terms of their architectures, goals and performance. In addition, as deep learning based weed identification research often requires a large volume of annotated data, we provide the details of current publicly available weed datasets and benchmarking metrics.
Expanding on the research of existing well-known architectures, we present other fine-grained and alternative architectures that may offer advantages in terms of the identification performance for future research. For practice purposes, there are still limitations and challenges of current research to provide weed control in large-scale crop production systems. Therefore, deep learning mechanisms which could further improve the efficiency and effectiveness of weed control including real-time inference, weakly-supervised learning, explainable learning and incremental learning techniques are discussed.
In summary, this review aims to: 1) investigate deep learning techniques related to weed control; 2) summarize current deep learning-based weed identification research including architectures, research materials and evaluation methods; 3) identify further challenges and improvements for future research with deep learning-based solutions.
provides a discussion of the existing deep learning based weed detection studies. In addition, public datasets and evaluation metrics for benchmarking are summarised. Section4 discusses the challenges for weed detection and potential deep learning solutions. Finally, Section 5 concludes this review.
2 Overview of Deep Learning Techniques
In this section, the theory of deep learning techniques is introduced including the deep learning building blocks and architectures relevant for weed detection.
2.1 Machine Learning
Machine learning (ML) algorithms are a class of algorithms that ’learn’ to perform a specific task given sample data (i.e., training data). These algorithms are not explicitly programmed with rules or instructions to fulfil the task. In general, a set of samples for a ML task can be obtained, where is the observed features describing the characteristics of the -th sample and is the associated output. In general, for , it can be costly and time-consuming to obtain , whilst is convenient to collect. Therefore, it is expected to learn a model that maps input values to the target variable as close as possible, where is a set containing parameters of the model. Optimization methods can be used to find the best set of model parameters, , to minimize the difference between the predicted output and the ground truth . In regard to the form of , machine learning problems can be differentiated as classification problems if the value of is categorical, or regression problems if the value of is continuous.
In the past decades, various machine learning models have been proposed (e.g. support vector machinescortes1995support
). However, these methods require carefully devised hand-crafted features. Thanks to the recent growth in computational capacity and the availability of a large amount of training data, deep learning automatically integrates feature extraction and modelling together, and gains promising performance for many tasks. Fig. 1 provides an illustration of the difference between conventional machine learning vs. deep learning. In the following subsections, the details of deep learning are introduced.
2.2 Neural Networks
in machine learning can be chosen as a neural network (NN)schmidhuber2015deep
, which contains an interconnected group of nodes (i.e., artificial neurons) inspired and simplified by the biological neural networks in animal brains. The most well-known neural network architectures are the multi-layered perceptrons (MLPs), shown in Fig.3 (a). This architecture organizes nodes into groups of layers and connects nodes between neighbouring layers. In detail, the computations of the -th layer can be written as:
where is the input of the -th layer which can be viewed as nodes of the neural network;
with the bias vector
represents a linear transform of the input signal which introduces full connectivity between the-th layer and the -th layer;
is an activation function which introduces a non-linearity to the output, allowing complex representations. In particular,is the input feature of a sample in .
The layer defined in Eq. (1) can also be referred to as a fully connected (FC) layer. By stacking multiple FC layers, neural networks are able to formulate more complex representations of the input. To obtain predictions, computations are conducted from the first (input) layer to the last (output) layer, which is known as a forward propagation stage. To optimize the parameters of a neural network, a backward propagation stage updates the parameters in an inverse order. Recently, more mechanisms and architectures have been proposed for constructing deeper neural networks, of which the ones related to weed identification are reviewed in the rest of this section.
2.3 Convolution Neural Networks
Inspired by the biological processes of animal visual cortex, convolution neural networks (CNNs) reduce the challenges of training deep models for visual tasksgu2018recent. Convolution layers are the key components of CNNs, as illustrated in Fig. 3 (b). It involve partial connections compared with fully connected layers of MLPs, where each node focuses on a local region of the input. In detail, denote to represent a series of convolution filters and the computations of the -th convolution layer can be written as:
where represents a convolution operator and is an activation function; is the input feature map containing channels and the output feature map is containing channels. A convolution layer can be viewed as a special case of FC layers with a sparse weight matrix.
Convolution layers often reduce the input spatial size but increase the number of channels. For some applications, recovery from a deep representation to the original input size is required. For this purpose, deconvolution (transpose convolution) operations were used. Readers can refer to zeiler2010deconvolutional for more details.
2.4 Graph Neural Networks
Whereas most neural networks were designed for processing vectorized data, there are a wide range of applications involving non-vectorized data. Graph neural networks (GNNs) were devised for graph inputs. One commonly used GNN is graph convolution neural networks (GCNNs), which is a generalisation of the conventional CNNs by involving adjacency patternsbruna2013spectral. In detail, a particular form to compute graph convolutions in -th layer can be written as:
where represents the vertex features of vertices in a graph, is an adjacency matrix to illustrate the relations between vertices, contains trainable weights and
is a bias vector. Instead of using pre-defined adjacency matrix, graph attention network (GAT) estimated edge weights of the adjacency in line with the vertex featuresvelickovic2018graph. Recently, various methods were proposed to focus on some graph characteristics, which could not be captured by GCNNs (e.g. longest circle in garg2020generalization).
2.5 Deep Learning Architectures
Following the above discussed neural networks for deep learning, various deep learning architectures can be constructed in line with different target applications. In terms of weed detection tasks, four categories of deep neural network architectures are summarized, including image classification, object detection, semantic segmentation, and instance segmentation.
2.5.1 Image Classification
Image classification tasks focus on predicting the category (e.g. weed species) of the object in an input image. Input images can be viewed as spatially organized data, and many CNN based architectures have been proposed for classifying them into a specific class or category. AlexNet, which consists of 5 convolution layers and 3 fully connected layers was first adopted for large scale image classificationkrizhevsky2012imagenet. Convolution layers (potentially with other mechanisms) were used to formulate deep representations from input images and FC layers were further used to generate output vectors in line with the categories involved. VGG simonyan2014very further introduced convolution filters with a perception field to substitute each convolution filter with a large perception field to learn a deeper representation and reduce the computational costs. InceptionNet was further proposed to introduce filters of multiple sizes at the same level to characterize the salient regions in an image, which can have extremely large variations in size szegedy2017inception. With the growing depth of the CNN architectures, short-cuts between the layers alleviate the gradient vanishing issues 8099726, such as ResNet and DenseNet. NASNet zoph2018learning was obtained by architecture search algorithms.
2.5.2 Object Detection
Object detection aims to identify the positions and the classes of the objects contained in an image. Generally, various CNN architectures for image classification can be used as backbones to learn deep representations and specific output layers can be introduced to obtain object-level annotations including positions and categories. For example, R-CNN and its improvements such as Faster R-CNN ren2015faster were proposed with a two-stage scheme where the first stage generates regional proposals and the second stage predicts the positions and labels for those proposals. One-stage methods were also explored to perform object detection with less latency. For example, YOLO redmon2016you (You Only Look Once) was proposed by treating object detection as a regression problem, of which the output is a feature map containing the positions and labels of each pre-defined grid cell. Single shot multi-box detector (SSD) liu2016ssd introduced feature maps of multiple scales and prior anchor boxes of different ratios. As class imbalance is one of the critical challenges for one-stage object detection, RetinaNet with focal loss was proposed lin2017focal. Note that pre-defined anchor boxes played an important role for most of the above mentioned methods. To avoid the significant computational costs to compute such anchor boxes, anchor-free methods were also investigated in tian2019fcos.
2.5.3 Semantic Segmentation
Semantic segmentation focuses on the pixel-wise (dense) predictions of an image, by which the category of each pixel is identified. In general, semantic segmentation uses fully convolution networks (FCNs), which were first explored in long2015fully. These studies often involved an encoder-decoder scheme: the encoder formulates a latent representation of an image through convolutions and the decoder focuses on upsampling of the latent representation to the original image size for dense predictions. By increasing the capacity of the decoder, U-Net ronneberger2015u was proposed with promising performance for medical images. SegNet badrinarayanan2017segnet additionally involved pooling indices for its decoder, compared with the general encoder-decoder models to use the pooled values only, to perform non-linear upsampling to keep the boundary information. Pyramid scene parsing network (PSPNet) and different-region-based context aggregation by a pyramid pooling module zhao2017pyramid exploited the capability of global context as a superior framework for pixel-level predictions. Instead of following a conventional encoder-decoder scheme, DeepLab models adopted atrous convolutions to reduce the downsampling operations with a large reception field chen2018encoder.
2.5.4 Instance Segmentation
Instance segmentation aims to output both the class and class instance information for individual pixels. Instance segmentation methods were initially devised in a two-stage manner by focusing on two separated tasks: object detection and semantic segmentation. For example, Mask R-CNN he2017mask was proposed using a top-down design, which first conducts object detection task to locate the bounding boxes of each instance and next within each bounding box undertakes semantic segmentation. Bottom-up methods were also investigated, which first conduct semantic segmentation and use clustering or metric learning to obtain different instances (e.g. papandreou2018personlab). Two-stage methods require accurate results from each stage and the computation cost of two-stage methods could be expensive. Therefore, single shot methods were explored. Anchor-based methods YOLACT bolya2019yolact introduced two parallelized tasks to an existing one-stage object detection model including a dictionary of non-local prototype masks over the entire image and predicting a set of linear combination coefficients per instance. An anchor-free method fully convolutional instance-aware semantic segmentation (FCIS) was proposed based on FCNs by introducing position-sensitive inside/outside score maps li2017fully. PolarMask xie2020polarmask conducted instance center classification and dense distance regression in a polar coordinate, which is a much simpler and flexible framework. Very recently, BlendMask chen2020blendmask inspired by FCIS introduced a blender module to effectively combine instance-level information and semantic information with low-level fine-granularity.
3 Deep Learning for Weed Identification
In this section, deep learning based weed identification studies are summarized covering four 4 approaches: image classification, object detection, semantic segmentation and instance segmentation. Before reviewing those approaches, research data, data augmentation and evaluation metrics used in these studies are reviewed first to provide a context in the field.
3.1 Weed Data
Weed data is the foundation for developing and benchmarking weed identification methods, and sensing technologies will determine what weed data can be acquired and what weed management practice can be developed machleb2020sensor. While various sensing techniques like ultrasound , LiDAR (Light Detection And Ranging) and optoelectronic sensors were used for simple differentiation between weeds and crops, image-based weed identification has gained increasing interests due to the advances of various imaging techniques.
Multispectral imaging captures light energy within specific wavelength ranges or bands of the electromagnetic spectrum, which can capture information beyond visible wavelength farooq2018analysis. For example, hyperspectral images consist of many contiguous and narrow bands; near infrared (NIR) imaging uses a subset of the infrared band, as the pigment in plant leaves, chlorophyll, strongly absorbs red and blue visible light and reflects near infrared. Driven by the low cost RGB cameras and the significant progresses in computer vision, RGB images have been increasingly used (e.g. olsen2019deepweeds). In addition, some studies involved depth images (the distance between the image plane and each pixel) brilhador2019classification; wang2020semantic.
The availability of rich public datasets in the field plays a key role in facilitating the development of new algorithms specific to weed identification tasks. In recent years, a number of in-crop weed datasets have been made public, as shown in Fig. 4, and will be reviewed in the rest of this section.
Bccr-segset NguyenThanhLe2019 contains 30,000 RGB images with pixel-wise annotations collected of canola (Brassica napus), maize (Zea mays) and wild radish (Raphanus raphanistrum). The images were acquired using an gantry-system mounted above an indoor growth facility across multiple growth stages.
Carrot-Weed lameski2017weed contains 39 RGB images collected with a 10-MP (Mega Pixel) phone camera under variable light conditions of young carrot (Daucus carota subsp. sativus) seedlings in the Republic of Macedonia. Pixel-level annotations were provided of three categories: carrots, unspecified weeds and soil (https://github.com/lameski/rgbweeddetection).
Crop/Weed Field Image Dataset (CWFID) haug2014crop comprises 60 top-down field images of carrots with intra-row and close-to-crop weeds captured by RGB cameras. Pixel-level annotations are provided for crop vs weed discrimination of 162 carrot plants and 332 weeds in total (https://github.com/cwfid).
CWF-788 li2019real is a field image dataset containing 788 RGB images collected from cauliflower (Brassica oleracea var. botrytis) fields with high weed pressure. It was collected for semantic segmentation of the cauliflower plants from the background (combining both weeds and soil) with manually segmented annotations (https://github.com/ZhangXG001/Real-Time-Crop-Recognition).
DeepWeeds olsen2019deepweeds was collected from remote rangelands in northern Australia for weed specific image classification. It includes 17,509 images of 8 types of target weed species with various off-target plants native to Australia. The target weed species include chinee apple (Ziziphus mauritiana), lantana (Lantana camara), parkinsonia (Parkinsonia aculeata), parthenium (Parthenium hysterophorus), prickly acacia (Vachellia nilotica), rubber vine (Cryptostegia grandiflora), siam weed (Chromolaena odorata) and snake weed (Stachytarpheta spp.). For each target weed species (positive class), around 1,000 images were obtained; off-target flora and backgrounds not containing the weeds of interest are collected as a single negative class of 9,106 images. It was collected from 8 different locations with an attempt to balance scene bias images of the target species, and negative cases were collected at each location in similar quantities (https://github.com/AlexOlsen/DeepWeeds).
Grass-Broadleaf dyrmann2016plant contains 22 different plant species at early growth stages, which was constructed by combining 6 image datasets. In total, 10,413 RGB images were included. Note that image background was removed in this dataset and each image only contains one individual plant.
GrassClover skovsen2019grassclover is a diverse image and biomass dataset, of which 8,000 synthetic RGB images are provided with pixel-wise annotations for semantic segmentation based weed identification studies. The dataset was collected in an outdoor field setting including 6 classes: unspecified grass species, white clover (Trifolium repens), red clover (Trifolium pratense), shepherd’s purse (Capsella bursa-pastoris), unspecified thistle, dandelion (Taraxacum
spp.) and soil. In addition, 31,600 unlabelled images were provided for pre-training, weakly-supervised learning, and unsupervised learning (https://vision.eng.au.dk/grass-clover-dataset).
Plant Seedling Dataset giselsson2017public contains 960 unique plants at several growth stages in RGB images for species including blackgrass (Alopecurus myosuroides), charlock (Sinapis arvensis), cleavers (Galium aparine), common chickweed (Stellaria media), wheat, fat hen (Chenopodium album), loose silky-bent (Apera spica-venti), maize (Zea mays), scentless mayweed, shepherd’s purse, small-flowered cranesbill (Geranium pusillum) and sugar beet (Beta vulgaris var. altissima) (https://vision.eng.au.dk/plant-seedlings-dataset).
Soybean/Grass/Broadleaf/Soil dos2017weed comprises 15,336 segments of soybean (Glycine max), unspecified grass weeds, unspecified broadleaf weeds and soil. The segments were extracted using the simple linear iterative clustering (SLIC) superpixel algorithm on 400 images collected with an Unmanned Aerial Vehicle (UAV)-mounted RGB camera.
Sugar Beets 2016 chebrolu2017agricultural was collected from agricultural fields with pixel-wise annotations for three classes: sugar beet, weeds, and soil. This dataset contains 1,600 images, of which 700 images were captured at first and 900 images were captured after a four-week period. Both RGB-D and mulitspectral images were provided, which is helpful to explore the effectiveness of different modalities for weed identification and to construct multi-modal learning methods (http://www.ipb.uni-bonn.de/data/sugarbeets2016).
Sugar Beet/Weed Dataset sa2017weednet contains 155 multispectral images (near-infared 790 nm, red 660 nm) plus normalised difference vegetation index (NDVI) with pixel-wise labelling for sugar beet, weeds and soil from a controlled field experiment (https://github.com/inkyusa/weedNet).
Weed-Corn/Lettuce/Radish jiang2020cnn contains 7,200 RGB images with image-level annotations. It includes three subsets: the maize dataset (1,200 images) was collected from a corn field with four different weed species (4,800 images) including Canada thistle (Cirsium arvense), fat hen, bluegrass (Poa spp.) and sedge (Carex spp.); the lettuce dataset was collected from a vegetable field of two plant classes including lettuce (500 images) and weeds (300 images); the radish dataset contains 200 radish images and 200 weeds images lameski2017weed (https://github.com/zhangchuanyin/weed-datasets).
Whilst these datasets provide useful imagery and annotation data for benchmarking, there is a lack of consistency and details in metadata reporting standards and contextual information. An understanding of weed species, beyond a simple awareness of the difference to crops, is important in creating opportunities to deliver specific weed control treatments. For example, contextual understanding of crop growth stages, presence/absence of stubble will assist in developing algorithms capable of handling variability across different conditions.
3.2 Data Augmentation
Due to the laborious nature of developing annotated datasets within weed control contexts, existing datasets are often not large enough and do not reflect sufficient diversity in conditions. A significant risk for deep learning using small datasets is overfitting, where the model performs well on the training set but performs poorly when being deployed in the fields. To address this issue, various data augmentation strategies were adopted to enlarge the size and the quality of the training sets such as random cropping, rotation, flipping, color space transformation, noise injection, image mixing, random erasing and generative approaches. Readers can refer to shorten2019survey for more details.
3.3 Evaluation Metrics
A number of metrics have been utilized to evaluate the performance of weed identification algorithms. The definition of these metrics may differ in terms of different types of identification approaches.
For binary image classification whereby the classification result of an input sample is labelled either as a positive (P) or a negative (N) case, 4 possible outcomes can be derived: (1) If a positive sample is classified as positive, the prediction is correct and defined as true positive (TP). (2) If a negative sample is classified as positive, the prediction is false positive (FP). (3) If a negative sample is classified as negative, the prediction is true negative (TN). (4) If a positive sample is classified as negative, the prediction is false negative (FN).
Based on these definitions, some widely used evaluation metrics can be defined for benchmarking the performance of different algorithms. Accuracy measures the proportion of the correct predictions (#TP + #TN) over all the predictions (#P + #N). Sensitivity, also known as recall, measures the proportion of the correctly predicted positive cases (#TP) over all the positive cases (#TP + #FN). It indicates the likelihood that the algorithm identifies all weeds. A low sensitivity would suggest that a large number of weeds are missed, while a sensitivity rate 1 indicates that all weeds are successfully detected. Precision measures the proportion of the correctly predicted positive cases (#TP) over all predicted positive cases (#TP + #FP). For weed detection, a high precision indicates low off-target or crop damage. Specificity measures the proportion of the correctly predicted negative cases (#TN) over all predicted negative cases (#TN + #FP). A low specificity suggests that an algorithm is delivering an control option towards crops. F-score (also known as F score) combines the precision and the recall values by treating them with equal importance:
As a binary classification model generally outputs continuous predictions, a threshold is required to judge the predicted labels: if the score is beyond the threshold, the corresponding sample is predicted as a positive case; otherwise the sample is predicted as a negative case. By varying the threshold, trade-offs among some metrics can be made. A receiver operating characteristic curve (ROC curve) illustrates the diagnostic ability of a binary classification model by plotting the sensitivity against the 1-specificity at various threshold settings. A precision-recall curve (PR curve) shows plots the precision against the recall. A large area under these curves (AUC) often indicates a model of high quality. For multi-class classification, most of the these metrics can be computed class by class and the mean of these metrics can be used.
For an object detection task with only one class, a sample is associated with an object in a bounding box. For a predicted bounding box, intersection over union (IoU) is defined as the area of the intersection divided by the area of the union of the predicted bounding box and a ground truth bounding box. Given a threshold , if the confidence value of a predicted bounding box is beyond and the IoU against the ground truth bounding box is beyond 0.5, the predicted bounding box is regarded as a TP case; if the confidence is beyond and the IoU is less than 0.5, it is regarded as a FP case; if the confidence is less than and the IoU is less than 0.5, it is regarded as a TN case; if the confidence is less than
and the IoU is beyond 0.5, it is regarded as a FN case. Next, the precision and recall values can be defined to measure the quality of detection results. By varying, a PR curve can be obtained and average precision (AP): is used to summarize the quality of the PR curve, where indicates the precision value corresponding to the recall values for a particular IoU threshold. In practice, different estimations for AP are adopted such as the AUC of the PR curve. Different IoU threshold values other than 0.5 can also be used to select the bounding boxes from the candidates and the corresponding AP can be obtained. For example, AP and AP define the AP with IoU threshold 0.5 and 0.75, respectively. For multi-class object detection problems, these metrics can be computed for each class individually and a mean average precision (mAP) can be obtained over all classes.
In segmentation tasks, a sample can be viewed as a pixel. The metrics such as (mean) accuracy (mAcc), recall, precision and F-score discussed above can be derived in a similar manner. By organizing the pixels belonging to the same class as regions, the concepts such as mAP and mIoU can be derived as well.
|Hyperspectral (400 - 1000nm)|
|Multispectral (NIR & RGB)|
|ResNet-18 bah2018deep||Binary||Bean||-||bean, weeds||94.8||-||-||-||-||-|
|ResNet-18 bah2018deep||Binary||Spinach||-||spinach, weeds||95.7||-||-||-||-||-|
|GCN-ResNet-101 jiang2020cnn||Binary||Radish||Y||radish, weeds||-||98.9||98.5||98.3||98.9||98.5|
|GCN-ResNet-101 jiang2020cnn||Binary||Corn||Y||corn, weeds||-||97.8||99.3||99.2||99.3||97.1|
|GCN-ResNet-101 jiang2020cnn||Binary||Lettuce||Y||lettuce, weeds||-||99.4||98.4||99.7||99.0||99.0|
|GCN-ResNet-101 jiang2020cnn||Binary||Mixed Dataset||Y||3 crops, weeds||-||96.5||98.8||98.7||97.2||96.5|
|Y||8 weeds, others||-||95.7||95.7||-||-||98.0|
|DenseNet-128-32 lammie2019low||Multiclass*||Y||8 weeds, others||-||90.1||-||-||-||-|
|GraphWeedsNet hu2020graph||Multiclass*||Y||8 weeds, others||-||98.1||98.2||-||-||99.3|
|Multiclass||Y||8 weeds, others||-||70.6||96.6||-||-||-|
|VGGNet yu2019weed||Multiclass||in-house||-||3 weeds||-||-||98.2||99.1||98.6||-|
|VGGNet yu2019weed||Multiclass||in-house||-||3 weeds||-||-||98.6||93.4||95.6||-|
|VGGNet yu2019deep||Multiclass||in-house||-||3 weeds||-||-||95.1||99.1||97.1||-|
|VGGNet yu2019deep||Multiclass||in-house||-||3 weeds||-||-||93.7||99.9||96.7||-|
|PCANet xinshao2015weed||Multiclass||in-house||-||91 weed seeds||-||91.0||-||-||-||-|
Multi-label classification problem.
|Faster R-CNN veeranampalayam2020comparison||5 weeds||RGB||66.0||68.0||67.0||0.84|
|DetectNet (1224X1024) dyrmann2017roboweedsupport||weeds||RGB||86.6||46.3||60.3||0.64|
|YOLOv4 sharpe2020vegetation||broadleaves, sedges, grasses||RGB||100.0||91.0||95.0||-|
|DetectNet (1280X720) yu2019deep||Poa annua||RGB||100.0||99.6||99.8||-|
|DetectNet (1280X720) yu2019deep||Poa annua||RGB||100.0||99.8||100.0||-|
|Multispectral (NIR & RGB), Depth|
|Customized U-Net brilhador2019classification||Sugarbeets2016||sugar beet, weeds, soil||Y||83.4||-||-||-||-||-||-|
|DeepLabv3+ wang2020semantic||Sugarbeets2016||sugar beet, weeds, soil||Y||-||-||-||87.1||-||-||-|
|Multispectral (NIR & RGB)|
|in-house||crops, weeds, soil||-||-||-||-||-||92.0||97.4||-|
|Customized FCN lottes2018fully||BONN||crops, weeds, soil||Y||86.6||93.3||81.5||-||-||-||-|
|Customized FCN lottes2018fully||STUTTGART||crops, weeds, soil||-||92.4||91.6||93.5||-||-||-|
|SegNet sa2017weednet||Sugar beet/Weed||crops, weeds, soil||Y||80.0||-||-||-||-||-||78.0|
|VGG-UNet fawakherji2019uav*||Sugar beet/Weed||crops, weeds, soil||Y||-||-||-||95.0||-||-|
|Customized CNN knoll2018improving||in-house||carrots, weeds, soil||-||98.6||98.5||-||-||97.9||-||-|
|U-Net + VGG-16 fawakherji2019crop||Sunflower||sunflower, weed, soil||-||-||-||-||80.0||-||-||-|
|AgNet mccool2017mixtures||CWFID||carrots, weeds, soil||Y||-||-||-||-||88.9||-||-|
|MiniInception mccool2017mixtures||CWFID||carrots, weeds, soil||Y||-||-||-||-||89.9||-||-|
|Customized VGG-16 dyrmann2016pixel||in-house||maize, weeds, soil||-||-||-||-||84.0||96.4||-||-|
|CWF-788||cauliflower, weeds, soil||Y||98.0||-||-||95.9||-||-||-|
|DeepLabv3+ wang2020semantic||Oilseed Image||oilseed, weeds, soil||-||-||-||-||88.9||-||-||-|
|SegNet lameski2017weed||Carrot-Weed||carrots, weeds, soil||Y||-||-||-||-||64.1||-||-|
additional information involved from NVDI modality.
3.4 Weed Identification Methods
Existing studies on weed identification can be organized into four categories in terms of the approaches they used: weed image classification, weed object detection, weed object segmentation or weed instance segmentation. Tables 1, 2, 3 summarize the major studies of the first three categories, whilst instance segmentation based weed identification has emerged recently.
3.4.1 Weed Image Classification
This approach aims to achieve image-level weed identification, determining of what species an image contains. An early deep learning based study dyrmann2016plant
devised a residual CNN for multi-class classification. On their proposed Grass-Broadleaf dataset which contains 10,413 RGB crop-weed images of 22 weed species, an accuracy 86.2% was achieved. A variant PCA (Principal Component Analysis) network was proposed for classifying 91 classes of weed seeds using RGB images, and an accuracy 90.96% was achieved. AlexNet was adopted to classify RGB images from the public dataset - Grass-Broadleafdos2017weed, which achieved an accuracy 99.5%. Although these results look promising, the plants or seeds were well segmented and the field or natural background information was limited, which could lead to failures under real field conditions.
More recently, a hybrid model of AlexNet and VGGNet was proposed. It was evaluated on a public plant seedling dataset containing RGB images of 3 crop species and 9 weed species at an early growth stage and achieved an accuracy 93.6%. Classifying three weed species including Euphorbia maculata, Glechoma hederacea and Taraxacum officinale growing in perennial ryegrass was studied yu2019weed using VGGNet on RGB images. It achieved F-scores 98.6% and 95.6% for two independent test sets collected from two fields with different locations. A similar study was also conducted to classify three other species of weeds growing in perennial ryegrass including Hydrocotyle spp, Hedyotis cormybosa and Richardia scabra yu2019deep. Another study identified cephalanoplos, digitaria, bindweed and soybean in RGB images by introducing a CNN based on LeNet-5 ciresan2011convolutional with a K-means clustering for unsupervised pre-training, which achieved an accuracy 92.9% tang2017weed. To further advance weed image classification in complex environments, a public dataset, namely DeepWeeds olsen2019deepweeds, was constructed by acquiring RGB images in remote and extensive rangelands with rough and uneven terrains. The baseline accuracy 95.7% was achieved by a ResNet-50 for multi-label classification. A simplified DenseNet, namely DenseNet-128-32, was explored to reduce the computational cost and inference time, while keeping the performance comparable to that of the original DenseNet model lammie2019low.
Recently, a few fine-grained architectures were explored to improve weed image classification performance. By introducing graph-based image representation, a graph weeds net achieved the state-of-the-art accuracy 98.1% on the DeepWeeds dataset, which formulated global and fine-grained weed characteristics with GCNs hu2020graph. Another study also investigated the graph mechanisms jiang2020cnn, in which GCN-ResNet-101 was proposed and the accuracy values varied from 96.5% to 98.9% on 4 public RGB datasets.
Deep unsupervised learning was explored in a recent weed study dos2019unsupervised, which explored two methods: joint unsupervised learning of deep representations and image clusters (JULE) and deep clustering for unsupervised learning of visual features (DeepCluster). They adopted the CNN outputs as features for a clustering algorithm and specified pseudo labels for samples based on the clustering results. As reported, the DeepCluster method achieved an accuracy 70.6% with a VGG-16 backbone on the DeepWeeds dataset, and an accuracy 99.5% with an AlexNet backbone on Grass-Broadleaf.
Besides RGB images, multispectral imaging has also been investigated. Different sizes of multispectral images with a CNN involving 4 convolution layers were explored farooq2018analysis. By varying the input size from to , the accuracy varied from 86.3% to 94.7% on the UNSW Hyperspectral Weed Dataset. FCNN-SPLBP combined CNN and superpixel based local binary pattern feature extraction farooq2019multi, which was evaluated on two public datasets: an accuracy 89.7% for pixel images on the UNSW Hyperspectral Weed Dataset and an accuracy 96.4% on the sugar beet/weed dataset.
3.4.2 Weed Object Detection
Existing weed object detection methods are mainly based on generic object detection methods. In dyrmann2017roboweedsupport, DetectNet was used with an mIoU 64.0 and an F-score 60.3 on an in-house dataset containing 1,427 RGB images. Another study yu2019deep with DetectNet achieved F-scores 99.8% and 100.0% for two different environments in detecting a weed species, namely Poa annua. In veeranampalayam2020comparison, Faster R-CNN achieved an mIoU 84.0% and an F-score 67.0% in detecting waterhemp (Amaranthus tuberculatus), Palmer amaranthus (Amaranthus palmeri), common lambsquarters (Chenopodiam album), velvetleaf (Abutilon theophrasti), and foxtail species such as yellow and green foxtails on an in-house dataset containing 450 augmented RGB images. YOLOv3 was used to detect broadleaves, sedges and grasses sharpe2020vegetation and achieved an F-score 95.0%.
3.4.3 Weed Semantic Segmentation
An intuitive approach for weed semantic segmentation is based on a two-stage scheme. Two CNNs were devised in potena2016fast, namely sNet and cNet, in which the first stage generated segmented objects and the second stage predicted the class of each object potena2016fast. The method were applied to multispectral images to identify the regions of crops, weeds and soil with a mean accuracy 92.0% and an mAP 97.4%. Another study adopted a conventional algorithm using HSV-colour room vegetation index method for segmentation and a CNN for classification knoll2018improving. It achieved an F-score 98.6 and a mean accuracy 97.9% on an in-house RGB image dataset containing carrots, weeds and soil.
FCNs were investigated in pursuit of end-to-end solutions which treat the segmentation and classification within one neural network. In dyrmann2016pixel; mortensen2016semantic, the last FC layer of a VGG-16 was replaced as a deconvolutional layer. The modified VGG-16 was evaluated on two RGB datasets: one to segment maize, weeds and soil and an mIoU 84.0% and a mean accuracy 95.4% was achieved; on another to segment equipment, soil, stump, weeds, grass, radish and unknown categories, a mean accuracy 79.0% was achieved. A FCN with a DenseNet backbone lottes2018fully was evaluated to identify crops, weeds and soil in multispectral images and achieved F-Scores 86.6% and 92.4% on two datasets collected from sugar beet fields in two different locations. Two FCN-8s were trained to segment RGB images: the first one recognized grass, clover, weeds and soil, and the latter one recognized fine-grained clover species including white clover and red clover skovsen2019grassclover. It achieved an mIoU 55.0 on its proposed GrassClover dataset.
In addition to using simple FCNs, recent studies tended to explore FCNs with additional mechanisms that are beneficial for segmentation tasks. In lameski2017weed, a SegNet with VGG-16 backbone achieved a mean accuracy 64.1% on a carrot-weed dataset containing RGB images of carrots, weeds and soil. In sa2017weednet, a public multispectral dataset - Sugar beet/Weed - was proposed to identify crops, weeds and soil, and a SegNet with a VGG-16 backbone achieved an F-Score 80.0 and an AUC 78.0 by evaluating the crops and the weeds predictions within a binary pixel-wise classification scheme. In asad2019weed, a SegNet with a ResNet-50 backbone was adopted to identify canola and weeds in RGB images using a pre-processing step to remove backgrounds, which achieved an mIoU 82.9 and a mean accuracy 99.5%. A Bonnet framework milioto2019bonnet to segment sunflower, weed and soil in RGB images achieved an mIoU 0.8 fawakherji2019crop. A customized U-Net with different data augmentation strategies was investigated for the CWFID dataset brilhador2019classification, which achieved an F-score 83.4. A VGG-UNet was evaluated for the Sugar beet/Weed dataset and achieved a mean accuracy 95.0% fawakherji2019uav. DeepLab-v3 was evaluated on Sugarbeet2016 containing multispectral and depth data and an in-house RGB oilseed dataset, which achieved mIoU values of 87.1 and 88.9, respectively wang2020semantic.
Lightweight models aiming for efficient segmentation were explored. In mccool2017mixtures, light models were mixed together with the guidance of a large and accurate model Inception-V3. On the CWFID dataset, compared to the Inception-V3 with an accuracy 93.9% and 0.12 fps during the inference, 4 mixed lightweight models achieved an accuracy 90.3% with 1.83 fps. A customized CNN using a ResNet-10 backbone li2019real with side outputs and short connections for multi-scale feature fusion achieved an F-score 98.0 and an mIoU 95.9 on the CWF-788.
3.4.4 Weed Instance Segmentation
There have been a few attempts at deploying instance segmentation algorithms for weed identification. A recent study adopted Mask R-CNN for field RGB images of two crop species and four weed species champ2020instance. Further explorations for this approach are needed by considering the perspectives such as the improvements on different growth stage and small size plants.
In this section, based on the current weed management studies and the recent developments of deep learning techniques, we discuss the challenges and opportunities for further advancing weed identification research from the following aspects: fine-grained learning, real-time inference, explainability, weakly-supervised / unsupervised learning, and incremental learning.
4.1 Fine-grained Learning
As reviewed in Section 3.4, most of the existing weed identification methods were based on general deep architectures ignoring the challenge caused by the strong similarities of crops and weed species. Recently, 3 major categories of fine-grained deep methods were explored to address this challenge.
Patch-based methods based on the fact that the fine-grained details often occur at a local-level. With the patterns collected from each region, fused or aggregated methods can be used to compute the final outputs. For example, regional CNN based features can be collected according to the key points of human poses for fine-grained action recognition hu2019vision.
High-order pooling based methods were introduced to address fine-grained tasks as well, which did not require explicit patch proposals (e.g. zheng2019looking). In particular, for a given convolutional feature map , the bilinear pooling can be computed by ; the trilinear pooling can be computed by . The pooled vector can be used as the input of the subsequent layer of the network. The relations between the high-order pooling methods and the patch-based methods can be explained from the perspective of the attention kim2018bilinear. Both of these methods result in focusing on the critical regions to collect efficient deep representations for their associated tasks.
Regularization based methods are based on that the intra-class difference could be higher than the inter-class difference for fine-grained modelling, to introduce regularization terms for loss and drive the optimization to focus on learning fine-grained patterns. For example, in DubeyGGRFN17
, pair-wise confusion and entropic confusion was introduced to construct its loss function.
Such fine-grained deep models provide a great opportunity to advance weed identification by taking the domain knowledge into account. For example, a weed can be decomposed into meaningful regions, such as leaves and stems. In our recent work, a patch-based GNN hu2020graph was proposed towards fine-grained weed classification, which achieved an accuracy 98.1% on the DeepWeeds dataset, compared with the accuracy 95.3% of DenseNet.
4.2 Real-time Inference
While most of the weed identification studies demonstrated promising performance using deep learning techniques, these deep networks often contain a huge number of parameters. It leads to three major issues in regard to efficiency, memory consumption, and power consumption for deployment. Intuitively, as indicated in cheng2017survey, lightweight models (e.g. MobileNet howard2019searching
) can be devised by using mechanisms such as parameter pruning, low-rank factorization, transferred/compact convolutional filters. In particular, using a Google Pixel 3 device with one-thread on a single large core, the inference time of MobileNet (V3) for image classification on ImageNet achieved a top-1 accuracy 65.4 and inference latency 11.7 ms. Note that these lightweight models can also be treated as backbones for object detection and segmentation. For example, SSDLite with MobileNet (V3) Small backbone achieved an inference latency of 43 ms and an mAP 16.1 on COCO test set; MobileNet (V3) based segmentation achieved an mIoU 69.4 with an inference time of 1.03s for an input image with resolution of. For weed identification, a ResNet-10 was proposed as backbone and introducing side outputs and short connections for multi-scale feature fusion. It achieved an mIoU 95.9 and an F-score 98.0, while the average inference latency is around 180ms on an Nvidia Jeston TX2 li2019real.
|RGB||180||95.9 mIoU||Jetson TX2|
|Binarized DenseNet-128-32 lammie2019low||RGB||1.539||90.1 Acc||Terasic DE1-SoC|
|Mixture of lightweight models mccool2017mixtures||Multispectral||546-934||90.0 Acc||GeForece Titan X|
In addition to devising lightweight architectures, methods such as quantization and knowledge distillation are devised for any existing models with less parameters while providing comparable performance as the complex models (e.g. ResNet-50 vs. ResNet-152). Quantization methods, which reduce the number of bits to represent the parameters in a model. In particular, binarization only saves one bit for parameter, which significantly reduces the memory consumption and computational cost qin2020binary. A binarized DenseNet-128-32 was implemented by FPGA (Terasic DE1-SoC) for weed detection gaining an accuracy 88.91% lammie2019low. It was slightly lower than a general DenseNet but obtained a very fast average inference latency 1.539ms. Knowledge distillation follows a similar way in which human beings learn, which contains one or more large pre-trained teacher models and a small student model. It aims to obtain an efficient student model which mimics and performs comparably to the the teacher models. A distillation loss penalizes the difference between the outputs from the teacher and the student models. A weed identification study followed this scheme to obtain a few lightweight models for semantic segmentation mccool2017mixtures. Mixing these lightweight models achieved an accuracy 90.0% and the inference latency between 934ms to 546ms by using an Nvidia GeForce Titan X graphics card.
4.3 Weakly-supervised & Unsupervised Learning
As manually collected supervision information for weed dataset can be resource expensive, weakly-supervised and unsupervised learning algorithms are needed for weed identification. For weakly-supervised learning, it is expected that weed object detection or even weed segmentation can be conducted when only using image-level annotation. For unsupervised learning, deep clustering and domain adaptation can be conducted. Deep clustering categorizes similar samples into one cluster in line with some similarity measures on their deep representations min2018survey. An application of deep clustering is that pre-training a neural network with a large unlabelled dataset and further fine-tuning on a small labelled dataset. Domain adaptation solves the problem that the training samples and testing samples following different distributions. This could be the case, for example, two datasets for the same species are from different locations. Therefore, unsupervised domain adaptation handles situations where a network is trained on labeled data from a source domain and unlabeled data from a related but different target domain. Readers can refer to wilson2020survey for more details.
Note that existing deep learning based weed identification methods have not adequately explored this realm to use the unlabelled samples. Until very recently, deep clustering was investigated for weed image classification dos2019unsupervised, in which a VGG-16 based DeepCluster network achieved an accuracy 70.6% on the DeepWeeds dataset. In hu2020graph, a GraphWeeNet involved a weakly-supervised learning strategy, namely multi-instance learning, and used image-level annotations to provide a rough locations of weed plants.
4.4 Explainable Learning
Deep learning shows a black-box nature, since it is difficult to understand and interpret the relations between the inputs and the outputs. However, explainability is of great importance for building trust between models and users to eventually facilitate the model adoption. As summarized in xie2020explainable, there are three major approaches in pursuit of the explainability of deep learning: 1) visualization methods identify the most important parts of an input, which highly influence the results; 2) model distillation involves conventional machine learning models, most of which have clear statistical explanations and indications, to mimic the behavior of trained deep models; 3) intrinsic methods integrate mechanisms (e.g., the attention mechanisms).
Explainable learning has been seldom investigated for deep learning based weed identification, although it has the potentials to provide further insights. Recently, graph weeds net was proposed with its graph mechanism, which treats the regions of an input image as graph vertices, to analyze the critical regions hu2020graph. As it usually takes more efforts for object detection or segmentation than image classification, such explainable learning approach also provides an opportunity to take less effort to focus on the critical objects within an image. Furthermore, the critical regions are obtained without regional annotations, which can be viewed as a weakly-supervised learning requiring less human efforts.
4.5 Incremental Learning
Most of the existing weed identification methods assume that a trained network will only deal with fixed target species, which are available during the training. As a result, when new species of interests are emerged, it is generally expected that the deep model needs to be re-trained with a new training set. To address this time-consuming and inflexible scheme, incremental learning is proposed to extends a trained model for new classes without the re-training from scratch. Note that the training samples of existing species are often not stored with a high volume due to storage limitations, whilst samples of incremental species could be adequate. Hence, incremental learning mainly addresses this imbalance nature when obtaining a new model based on an existing model.
To conduct incremental learning, 4 major approaches were investigated de2019continual. 1) retaining a subset of the old training data in line with a budget. 2) The distributions of the old dataset can be stored as the parameters of a generative model, which can produce unlimited samples during the incremental training. 3) Parameter isolation-based methods aim to prevent any possible forgetting of the previous tasks when no constraints on the model size. In general, for different species, it can use different model parameters to conduct the classification. 4) Regularization techniques prevent forgetting previous knowledge.
Recently, AgroAVNET explored the chance for incremental learning on the plant seedling dataset chavan2018agroavnet and the accuracy achieved 91.35 for 12 species, compared to 93.64 from a general re-training. It followed a very straightforward way without fully exploiting the incremental learning, which froze the convolution layers trained by the original dataset and re-trained the FC layers only.
4.6 Large Scale Dataset
Large scale datasets are essential for developing high performance and robust deep learning models. For example, ImageNetkrizhevsky2012imagenet, which contains 15 million labeled images belonging to roughly 22,000 categories, has played a significant role in advancing deep learning based vision tasks. However, as summarized in Section 3.1, most of the existing weed datasets contain images of a small number of classes. In addition, those images were collected under limited scenarios, such as one growth stage and one light condition. This has limited the development of advanced methods applicable to a large variety of fields and prevents the translation towards commercial adoption. Therefore, constructing large scale datasets with diverse and complex conditions in the context of practical deployment can be highly demanded.
In this paper, we reviewed the recent progresses in the field of deep learning based weed identification and discussed the challenges and opportunities for future research. After introducing the fundamentals of deep learning techniques, we provided a systematical review from three aspects: research data, evaluation, and weed identification methods. There have been more than 10 public datasets collected through different modalities and many weed identification methods have been reported across different research disciplines due to the inter-disciplinary nature of this topic. It is also noticed that most existing weed identification methods were proposed by using the architectures developed for generic deep learning problems. Finally, we discussed the challenges and opportunities in terms of 5 different learning techniques and large scale dataset. Overall, deep learning based weed identification has gained increasing interest from different research communities and we feel that large scale datasets are strongly needed to bring this research direction to a new level.
Funding: This work was supported by the GRDC (Grains Research and Development Corporation) [grant number 9177493].