Precision weed management offers a promising solution for sustainable cropping systems through the use of chemical-reduced/non-chemical robotic weeding techniques, which apply suitable control tactics to individual weeds. Therefore, accurate identification of weed species plays a crucial role in such systems to enable precise, individualized weed treatment. This paper makes a first comprehensive evaluation of deep transfer learning (DTL) for identifying common weeds specific to cotton production systems in southern United States. A new dataset for weed identification was created, consisting of 5187 color images of 15 weed classes collected under natural lighting conditions and at varied weed growth stages, in cotton fields during the 2020 and 2021 field seasons. We evaluated 27 state-of-the-art deep learning models through transfer learning and established an extensive benchmark for the considered weed identification task. DTL achieved high classification accuracy of F1 scores exceeding 95 across models. ResNet101 achieved the best F1-score of 99.1 the 27 models achieved F1 scores exceeding 98.0 minority weed classes with few training samples was less satisfactory for models trained with a conventional, unweighted cross entropy loss function. To address this issue, a weighted cross entropy loss function was adopted, which achieved substantially improved accuracies for minority weed classes. Furthermore, a deep learning-based cosine similarity metrics was employed to analyze the similarity among weed classes, assisting in the interpretation of classifications. Both the codes for model benchmarking and the weed dataset are made publicly available, which expect to be be a valuable resource for future research in weed identification and beyond.READ FULL TEXT VIEW PDF
Weeds are critical threats to crop production; potential crop yield loss due to weeds is estimated at 43% on a global scaleOerke (2006). In cotton production, poor weed management can lead to yield loss of up to 90% Manalil et al. (2017). Weeds control is traditionally performed through machines or by hand weeding. With the advent of trans-genetic, glyphosate-tolerant crops since 1996, over 90% of the U.S. farm lands for field crops such as cotton, are planted with herbicide-resistant seeds Service. (2015). The weed control has become predominantly reliant on herbicide application Duke (2015); Pandey et al. (2021). Intensive, blanket, broadcast application of herbicides, however, has adverse environmental impacts and facilitates the evolution of herbicide-resistant weeds (e.g., Palmer Amaranth and Waterhemp), which in turn substantially increases management costs Norsworthy et al. (2012).
Precision weed management (PWM) has recently emerged as a promising solution for sustainable, effective weed control, which incorporates sensors, computer systems and robotics into cropping systems Young et al. (2014). By recognizing the biological attributes of different weed species, PWM enables precise and minimum necessary treatments according to site-specific demand and targeting individual weeds or a small cluster Gerhards and Christensen (2003), which can lead to significant reduction in the consumption of herbicides and other resources. For instance, a robotic weeder can spray a particular type or volume of herbicide or use mechanical weeder or lasers to treat specific weed species, avoiding unnecessary application to crops, bare soil or plant residuals Barnes et al. (2021). Therefore, successful implementation of integrated, precise weed control strategies relies on accurate identification, localization and monitoring of weeds. Currently, machine vision and robotic technology for automated weed control have been demonstrated in certain speciality crops Fennimore and Cutulle (2019). However, commercial-scale applicability to row crops such as cotton in varying growing conditions has yet to be evaluated or demonstrated. Lack of a robust machine vision system capable of weed recognition with accuracy exceeding 95% in unstructured field conditions has been identified as one of the most critical technological bottlenecks towards full realization of automated weeding Westwood et al. (2018). The key to addressing this bottleneck thus lies in the development of image analysis and modeling algorithms of high and robust performance.
Image analysis methods based on the extraction of color and texture features, followed by thresholding or supervised modeling, are widely used for weed classification and detection Wang et al. (2019); Meyer and Neto (2008). A variety of color indices that accentuate plant greenness have been proposed for separating weeds from soil backgrounds Meyer and Neto (2008); Woebbecke et al. (1995). The color indices that are developed from empirical observations are however not robust enough in dealing with images acquired under variable field lighting conditions Hamuda et al. (2016). In Bawden et al. (2017), texture features including local binary pattern and covariance features were used to perform weed classification, and the extracted features that were applied to a robotic platform, achieved an accuracy of 92.3% on the dataset containing 40 images of 6 weed species. Local shape and edge orientation features were used in Ahmad et al. (2018) for discriminating monocot and dicot weeds, which achieved an overall accuracy of 98.4% based on AdaBoost with Naïve Bayes. In Bakhshipour and Jafari (2018)
, Fourier descriptors and invariant moments were extracted and fed into support vector machine for classifying four common weeds in sugarbeet fields, resulting in an accuracy of 93.3% accuracy. Despite promising results, the aforementioned color or texture feature-based approaches require engineering hand-crafted features for given weed detection/classification tasks, which may not adapt satisfactorily to a more diverse set of imaging conditions.
Recently, data-driven methods such as deep learning (DL), e.g., convolutional neural networks (CNNs), have been researched for weed classification and detectionHasan et al. (2021)
. CNNs are able to capture spatial and temporal dependencies of images through the use of shared-weight filters and can be trained end-to-end without explicit feature extractionO’Shea and Nash (2015)
, empowering neural networks to adaptively discover the underlying class-specific patterns and the most discriminative features. InDyrmann et al. (2016), a CNN model, which was trained on a dataset containing 10413 images with 22 plant species at early growth stage, achieved a classification accuracy of up to 98%. A graph-based DL architecture with multi-scale graph representations was developed in Hu et al. (2020) for weed classification, achieving an accuracy of 98.1% on the DeepWeeds dataset Olsen et al. (2019). While successful, training such DL models from scratch are very time-consuming and resource-intensive, requiring high-performance computation units and large-scale, high-quality annotated image datasets, which may not be readily available.
Transfer learning, a methodology that aims at transferring the knowledge across domains, can greatly reduce the training time and the dependence on massive training data by reusing already trained models for new problems Zhuang et al. (2020). Therefore, deep transfer learning (DTL, i.e., transferring DL models) only involves fine-tuning model parameters using new datasets in the target domain. DTL has recently been investigated for weed identification. In Olsen et al. (2019), two pretrained DL models were fined-turned and tested on the DeepWeeds dataset, achieving average accuracies above 95%. In Espejo-Garcia et al. (2020a)
, the authors found that fune-tuning DL models on agricultural datasets helped reduce training epochs while improving model accuracies. They fine-tuned four DL models on the Plant Seedlings DatasetGiselsson et al. (2017) and the Early Crop Weeds Dataset Espejo-Garcia et al. (2020b) and improved the classification accuracy by 0.51% to 1.89%, respectively. In Suh et al. (2018), six pretrained DL models wer adopted to classify sugarbeet and volunteer potato images and achieved the best accuracy of up to 98.7%. In Ahmad et al. (2021), three pre-trained CNN models, were used for weed classification, achieving 98.8% accuracy in classifying four weed species in corn and soybean fields. These studies, however, only experimented a small number of DL models. Given active developments in DL model architectures Khan et al. (2020), it would be beneficial to the research community to evaluate a broad range of state-of-the-art DL models on weed identification, so as to facilitate informed selection of high-performance models in terms of accuracy, training time, model complexity, and inference speed.
Despite transfer learning strategies, having large volumes of annotated image data is highly desirable for powering DL models in visual categorization tasks Sun et al. (2017). Currently, the dearth of such datasets remains a crucial hurdle for exploiting the potential of DL and advancing machine vision systems for precision agriculture Lu and Young (2020); Library (2021). In weed detection, to achieve high accuracy and robustness requires a dataset that provides adequate representation of important weed species and accounts for the variations associated with environmental factors (e.g., soil types and characteristics, field light, shadows) as well as growth-stage-related morphological or physiological variations. Recently, Lu and Young (2020) reviewed 15 publicly available weed image datasets dedicated to weed control Lu and Young (2020), such as DeepWeeds Olsen et al. (2019), Early crop weed dataset Espejo-Garcia et al. (2020b), Open Plant Phenotyping Database Leminen Madsen et al. (2020), among some others. Most of these datasets target a small number of weed species, with images acquired from a single growth season in geographically similar field sites. No image datasets of weeds specific to cotton production systems have been published so far.
In this paper, we present a new weed dataset collected in cotton fields in multiple southern U.S. states over the two consecutive seasons of 2020 and 2021. We establish a comprehensive benchmark of a large set of DL architectures for weed classification on the new dataset. This research is expected to provide a valuable reference for future research on developing machine vision systems for cotton weed control and beyond. The contributions of this paper are highlighted as follows:
The presentation of a unique, diverse weed dataset111https://www.kaggle.com/yuzhenlu/cottonweedid15 consisting of 5187 images of 15 weed classes specific to the U.S. cotton production systems.
A comprehensive evaluation and benchmark of 27 state-of-the-art DL models222https://github.com/Derekabc/CottonWeeds through transfer learning for multi-class weed identification.
A novel DL-based cosine similarity metric for assisting in the interpretation of DL output and a weighted loss function for improving classification accuracies for minority weed classes.
RGB (Red-Green-Blue) images of weed plants were collected from cotton fields using either smartphones or hand-held digital color cameras. For the sake of image diversity, following the recommendations in Lu and Young (2020), images were captured from different view angles, under natural field light conditions, at varying stages of weed growth, at different locations across the U.S. cotton belt states (primarily in North Carolina and Mississippi). Regular visits to cotton fields were conducted throughout June to August in the growing seasons of 2020 and 2021 for weed image collection. In 2020, images were mainly acquired in the cotton fields of North Carolina State University research stations, including Central Crops Research Station (Clayton, NC), Upper Coastal Plain Research Station (Rocky Mount, NC) and Cherry Research Farm (Goldsboro, NC). In 2021, more weed images were acquired in cotton fields of R. R. Foil Plant Science Research Center (Starkville, MS) and Black Belt Experiment Station (Brooksville, MS) of Mississippi State University. To create a diverse, large-scale dataset, weed scientists at different institutions were invited to participate in the image collection effort. A Google form333https://forms.gle/zr9wa1uu7qHTFiK2A was created and shared for uploading weed images and associated metadata (e.g., weed species, field sites, weather conditions).
The acquired images were first annotated for weed species by weed experts during image submission through the Google from, and the received images were then annotated by trained individuals, and the final annotations were examined again by experts to ensure annotation quality. The images with multiple classes of weeds were cropped so that each resultant image contained a single class of weeds. The weed classes were defined by common names of weed plants. At the time of writing, the entire dataset contains more than 10000 images for over 50 weed species, which will be documented in detail in a future study. Here the weed dataset used for benchmarking DL models consists of a total of 5187 images for 15 common weed classes. The image number for each weed class is shown in Fig. 1
. It should be noted that all weed classes, except Morningglory, correspond to single weed species. The images of different Morningglory species (e.g., Ivy Morninglory, Pitted Morningglory, Entireleaf Morningglory and Tall Morningglory) were grouped together as a single weed class, because of their similarity in weed management. Overall the weed classes including Morningglorg, Carpetweed, Palmer Amaranth, Waterhemp and Purlane are the four major classes in terms of image number, as opposed to weed species like Crabgrass, Swinecress and Spurred Anoda, corresponding to minority classes. It is clear that the present dataset has unbalanced classes. The class imbalance generally poses a challenge to machine learning modeling, which will be discussed in Sections2.4 and 3.2.
Fig. 2 shows example images from the cotton weed dataset. The images within the same weed class have large variations in leaf color and morphology, soil background and field light conditions, which are desirable for building models robust to image conditions or dataset shift. The image variations vary among weed classes; despite distinct identifying characteristics, some weed classes exhibit relatively high similarities in the plant morphology. For instance, some young Morningglory and Spurred Anoda seedlings have similar, broad leaves, and the latter is also similar to Prickly Sida in terms of toothed leaf margins. Goosegrass and Crabgrass are both grassy weeds that grow prostrate on the ground, with similar leaf shapes. Palmer Amaranth and Waterhemp that are both pigweed species may look similar and are difficult to distinguish from each other. These similarities may contribute to errors in weed identification by DL models. A quantitative DL-based similarity measure along with a similarity matrix will be discussed in later sections (see Sections 2.5 and 3.3) to characterize the similarity among weed classes.
Deep transfer learning (DTL) starts with a pre-trained DL model on a large-scale dataset (e.g., ImageNetDeng et al. (2009)) and then fine-tunes the model on a new dataset from the specific domain of interest Zhuang et al. (2020)
. For the weed classification in this study, we replace the last fully-connected (FC) layers of DL models with a layer that has 15 neurons corresponding to the same number of weed classes in the cotton weed dataset.
Literature review was conducted to select appropriate DL models for weed identification. The main selection criterion was the demonstrated performance of models in visual categorization tasks in the computer vision community and the availability of their source-code implementations. As a result, a suite of 27 state-of-the-art CNN models of different architectures, as summarized in Table 2 , were selected for classifying the cotton weed images here. Some of them including Xcpetion Chollet (2017), VGG16 Simonyan and Zisserman (2014), ResNet50 He et al. (2016), InceptionV3 Szegedy et al. (2016) and DenseNet Huang et al. (2017) has recently been evaluated for classifying weeds in other cropping systems Espejo-Garcia et al. (2020b); Olsen et al. (2019); Ahmad et al. (2021). The majority of these models, such as EfficientNet Tan and Le (2019) and MnasNet Tan et al. (2019), remain to be evaluated for weed classification tasks.
The DL models were trained with a conventional cross entropy (CE) loss function as follows:
is a vector of a Softmax output layerGoodfellow et al. (2018)
indicating the probability of the 15 weed classes. Hereis the number of weed classes, and denotes the true probability and is defined as:
For model development and evaluation, the cotton weed dataset was randomly partitioned into three subsets: 65% for training, 20% for validation and 15% for testing, as shown in Fig. 1. All training and validation images were resized to pixels in size before being fed into DL models (images of and pixels were also examined, but the size of pixels was found to be better in terms of accuracy and speed). The image pixel intensities per color channel were normalized to the range of for enhanced image recognition performance Koo and Cha (2017). In addition, for better model accuracy, real-time data augmentation was conducted by randomly rotating images in the range of and random flipping during the training process.
Because of the random nature of dataset partition, it would be desirable to run model training and testing multiple times for obtaining a reliable estimate of model performance Raschka (2018)
. In this study, DL models were trained with 5 replications, with different random seeds that were shared by all the models, and the mean accuracies on test data were computed for performance evaluation. All the models were trained for 50 epochs (that were found sufficient for modeling the weed data) with the SGD (stochastic gradient descent) optimizer and a momentum of 0.9. The learning rate was initially set to be 0.001, and dynamically decreased by a factor of 0.1 every 7 epochs for stabilizing model training. The DL framework Pytorch (version 1.9) with Torchvision (version 0.10.0)Paszke et al. (2019) were used for model training, in which a multiprocessing package444https://docs.python.org/3/library/multiprocessing.html
was employed with 32 CPU cores to speed up the training. The experiments were performed on an Ubuntu 20.04 server with an AMD 3990X 64-Core CPU and a GeForce RTX 3090Ti GPU (24 GB GDDR6X memory). Readers are referred to the open-source codes555Code at: https://github.com/Derekabc/CottonWeeds, for detailed implementation of transfer learning for the 27 DL models.
The performance of the DL models in weed identification was evaluated in terms of number of model parameters, training and inference times, confusion matrix and F1-score.
In this study, pretrained DL models were fined tuned by updating all the model parameters for the weed classification task. Thus the number of model parameters refer to all the weights (and biases) in the network that are updated/learnt during the training process through back-propagation. The parameter number is a direct measure of model complexity: networks with a larger number of parameters potentially require greater deployment memory and incur longer training and inference times (see Subsection 2.3.2).
The training time is the time required to train a DL model with prescribed model configurations and computing resources. The training time depends on factors such as model architecture, number of model parameters, data size, hyper-parameters, DL framework as well as computing hardware. The training time is an important consideration where development time and resources are constrained.
A trained DL model is to be used to make predictions (also known as inference). The inference time (i.e., latency) is one crucial aspect in deploying DL models for real-time applications (e.g., in-field weed identification). It is the time that a trained DL model takes to make a prediction given an image input. In this paper, for reliable estimation, the inference time was measured as the average time needed to predict 30 weed images randomly selected from the testing dataset.
The confusion matrix on testing images, which provides the accuracy for each class while revealing detailed misclassifications, was presented to show the classification for individual weed classes. The classification accuracy was measured in terms of F1 score. For the multi-class weed classification, the micro-averaged F1 score Yang and Liu (1999) was calculated as the classification accuracy. In micro-averaging, the per-class classifications are aggregated across classes to compute the micro-averaged precision P and recall R by counting the total true positives, false negatives and false positives, and then a harmonic combination of P and R, i.e., Micro-F1, as follows:
The CE loss function defined in Eqn. 1 does not account for the class imbalance encountered in the cotton weed dataset (Fig. 2) in this study. Training with the CE loss may result in large classification errors for minority weed classes (e.g., Spurred Annoda). To mitigate this issue, a weighted cross entropy (WCE) Phan and Yamamoto (2020) loss function was introduced, performing re-weighting according to image numbers for each weed class as follows:
where is a weighting vector that assigns individualized penalty to each class, preferentially placing larger weights on minority classes. The conventional CE loss without considering class imbalance corresponds to a weighting vector of ones (the CE column in Table 1). In this study, an inverse-proportion weighting strategy Phan and Yamamoto (2020) was adopted to assign the weight to the th weed class as follows:
where denotes the number of images for the th weed class and represents the maximum number of images among classes, i.e., 1115 (for Morningglory). As a result, weed classes with fewer images are assigned with relatively greater weights. For example, the weight for Spurred Anoda is set to be 18.3 (). This strategy, which enforces larger penalties on misclassifications for minority classes, can potentially enhance the classification accuracy for these classes. In preliminary testing, it is observed that the direct inversion of image ratios may lead to sub-optimal performance, hence the final adopted weights are fine-tuned and empirically set as shown in the WCE column in Table 1. Other different choices of weighting strategies are discussed in Section 4.1.
|# of images||Weighting Coefficients|
To assist in the interpretation of DL classifications, an inter-class (or within-class) analysis was conducted by quantifying the similarity of the images of weed classes. Euclidean distance is the most commonly used measure of inter-class similarity, but it is sensitive to varying image conditions (e.g., variable ambient light, variations in camera view angle and position), which are typical of the cotton weed images collected under natural field conditions. Cosine similarity (CS), which measures the cosine of the angle between two vectors and is thus not sensitive to magnitude, offers an effective alternative to the Euclidean distance Xia et al. (2015).
In this study, we employed a DL-based CS measure for quantifying inter-class similarities. A DL model was used as feature extractor to obtain hierarchically learnt high-level representation of weed images, based on which the CS was calculated between two weed classes. Specifically, the VGG11 Simonyan and Zisserman (2014) model was trained on the cotton weed dataset through DTL, and the output of the first FC layer was taken as the feature vector, which is of length 4,096 (i.e., the output size of the FC layer in the VGG11 network Simonyan and Zisserman (2014)). While other DL models can also be used for feature extraction, the VGG11 was chosen because it achieved the best trade-off between classification performance and training time (see Table 2), particularly with high accuracies for minority weed classes (see Table 3). Given the extracted features for any two weed classes, the CS was calculated as follows Xia et al. (2015) :
where and are two feature vectors extracted by the VGG11 model and we randomly sample pairs of images from two weed classes of interest and compute the averaged similarity value. The CS values range from -1 to 1, where 1 means the two classes are perfectly similar and -1 means they are perfectly dissimilar.
Table 2 summarizes the number of model parameters, training and inference times and F1 score of the selected 27 DL models. There is a large variation in the number of parameters across the models, ranging from 0.74 M (million) in the SqueezeNet to 139.6 M in the VGG19. Depending on model architectures, the training time ranged from 37 min to 144 min. Models with a larger number of parameters tended to require a longer training time (see Fig. 3 left), because of increased model complexity. The inference times also exhibited an increasing trend with the number of parameters (see Fig. 3 right), although it had a notably smaller difference among models, ranging from 188 ms to 256 ms. Inference is mainly a forward propagation process that requires no parameter estimation and is thus far more efficient than training. Particularly, models including AlexNet, SqueezeNet, GoogleNet, ResNet18, ResNet50, VGGs, and MobileNets required inference times less than 200 ms, translating into a prediction speed of over 5 frames per second. The DL models overall show good potential to be deployed for real-time weed identification.
Figure 4 shows the training accuracy and loss curves of the DL models. All the DL models exhibited promising training performance in terms of fast convergence speed, low training losses and high training accuracies (F1 scores). The training accuracies tended to plateau after 10 epoches at a level exceeding 90%. Regarding test accuracies (Table 2), ResNet101 achieved the best overall F1-score of 99.1%, followed by ResNet50 with the F1=99.0%. There were other 12 models that gave F1 scores exceeding 98%, such as three Densenet variants, DPN68, MobilenetV3-large, mong others, and the top-10 models achieved an average F1 score of 98.71%. On the other hand, three models including AlexNet, SqueezeNet and MnasNet, yieleded the lowest F1 scores that were close to or less than 96%, although they were all superiorly efficient to train and fast to make inferences.
|Index||Model||Parameter Number||Training Time||Training F1-Score||Testing F1-Score||Inference Time (ms)|
|1||AlexNet Krizhevsky et al. (2012)||57.1M||37m 2s||95.4 ± 0.2||95.3 ± 0.4||188.5 ± 2.2|
|2||SqueezeNet Iandola et al. (2016)||0.743M||46m 7s||96.4 ± 0.2||95.8 ± 0.5||187.3 ± 1.6|
|3||GoogleNet Szegedy et al. (2015)||5.6M||52m 28s||94.7 ± 0||97.8 ± 0.3||196.3 ± 0.5|
|4||Xception Chollet (2017)||20.8M||89m 9s||94.7 ± 0.2||97.5 ± 0.4||211.3 ± 1.8|
|5||DPN68 Chen et al. (2017)||11.8M||79m 10s||98.5 ± 0.1||98.8 ± 0.2||219.0 ± 6.9|
|6||MnasNet Tan et al. (2019)||3.1M||51m 3s||91.8 ± 0.2||96.0 ± 0.4||191.2 ± 2.0|
|7||ResNet||ResNet18 He et al. (2016)||11.2M||47m 30s||96.9 ± 0.1||98.1 ± 0.2||188.9 ± 0.9|
|ResNet50 He et al. (2016)||23.5M||73m 17s||98.0 ± 0.1||99.0 ± 0.1||195.6 ± 0.4|
|ResNet101 He et al. (2016)||42.5M||92m 55s||98.3 ± 0.1||99.1 ± 0.2||207.0 ± 0.6|
|10||VGG||VGG11 Simonyan and Zisserman (2014)||128.8M||67m 46s||97.3 ± 0.1||98.1 ± 0.2||194.1 ± 1.3|
|VGG16 Simonyan and Zisserman (2014)||134.3M||99m 25s||97.7 ± 0.2||98.1 ± 0.3||195.7 ± 1.4|
|VGG19 Simonyan and Zisserman (2014)||139.6M||112m 41s||97.9 ± 0.1||97.9 ± 0.2||197.2 ± 1.4|
|13||Densenet||Densenet121 Huang et al. (2017)||7.0M||75m 40s||97.9 ± 0.1||98.7 ± 0.1||212.4 ± 0.8|
|Densenet161 Huang et al. (2017)||26.5M||133m 42s||98.4± 0.1||98.9 ± 0.4||227.4 ± 0.5|
|Densenet169 Huang et al. (2017)||12.5M||85m 1s||98.1 ± 0.1||98.9 ± 0.3||226.8 ± 0.5|
|16||Inception||Inception v3 Szegedy et al. (2016)||24.4M||73m 50s||96.7 ± 0||98.4 ± 0.3||206.3 ± 0.4|
|Inception v4 Szegedy et al. (2017)||41.2M||120m 42s||95.9 ± 0.1||98.1 ± 0.4||235.4 ± 0.8|
|Inception-ResNet v2 Szegedy et al. (2017)||54.3M||124m 36s||94.0 ± 0.2||97.6 ± 0.4||255.9 ± 1.4|
|19||Mobilenet||MobilenetV2 Sandler et al. (2018)||2.2M||53m 27s||97.4 ± 0.1||98.4 ± 0.1||191.1 ± 0.8|
|MobilenetV3-small Howard et al. (2019)||1.5M||41m 27s||94.5 ± 0.2||96.6 ± 0.1||193.1 ± 1.2|
|MobilenetV3-large Howard et al. (2019)||4.2M||49m 4s||96.6 ± 0.1||98.6 ± 0.2||193.8 ± 2.0|
|22||EfficientNet||EfficientNet-b0 Tan and Le (2019)||4.0M||63m 39s||93.0 ± 0.1||97.4 ± 0.4||202.0 ± 5.6|
|EfficientNet-b1 Tan and Le (2019)||6.5M||77m 8s||93.8 ± 0.2||97.3 ± 0.4||203.8 ± 0.8|
|EfficientNet-b2 Tan and Le (2019)||7.7M||78m 56s||94.1 ± 0.2||97.8 ± 0.1||204.5 ± 1.7|
|EfficientNet-b3 Tan and Le (2019)||10.7M||92m 51s||95.0 ± 0.2||98.2 ± 0.1||211.3 ± 1.2|
|EfficientNet-b4 Tan and Le (2019)||17.6M||113m 12s||94.1 ± 0.2||97.8 ± 0.2||216.3 ± 1.3|
|EfficientNet-b5 Tan and Le (2019)||28.4M||144m 44s||94.1 ± 0.3||97.4 ± 0.1||224.1 ± 1.5|
Performance of 27 state-of-the-art deep learning models on the cotton weed dataset. Note that the variations in training time are negligible so its standard deviation is not included. The top-10 testing accuracies of DL models are highlighted in bold. “M” stands for a million.
The confusion matrices on test data for all the DL models are available on our Github page666https://github.com/Derekabc/CottonWeeds/tree/master/Confusing_Matrices. Due to space constraints, we only show the confusion matrices for one top F1-score model, ResNet-101, and one low-performant model, MnasNet (MnasNet1.0), in Fig. 5 and Fig. 6, respectively. The ResNet-101 yielded perfect classifications for 12 out of 15 weed classes, although it misclassified 3%, 4% and 20% of the images of Goosegrass, Palmer Amaranth and Spurred Annoda, respectively. Spurred Anoda was the most challenging weed class to distinguish from others. The ResNet-101 achieved the classification accuracy of 80% for this species, misclassifying 20.0% of the weed as PricklySida. The MnasNet model only achieved the accuracy of 20% in the identification of Spurred Annoda, as shown in Fig. 6, misclassifying 60% and 20% of the weed as Prickly Sida and Palmer Amaranth, respectively. The poor accuracies are presumably because of the smallest number of images available in the dataset (61 as shown in Fig. 1). Similar low accuracies were also observed for other minority weed classes such as Crabgrass and Ragweed, with an accuracy of 88% and 80%, respectively, by the MasNet. To improve the performance of DL models on the minority weed classes, the proposed WCE loss function is discussed next.
Fig. 7 shows the confusion matrix achieved by the MnasNet model trained with the WCE loss function (Eqn. 4). The WCE-based model achieved remarkable improvements over the counterpart (Fig. 6) trained with the regular CE function (Eqn. 1) in classifying minority weed classes. The classification accuracy of Spurred Annoda jumped from 20% to 80%, and the accuracies for Crabgrass and Ragweed were improved from 88% to 94% and from 80% to 95%, respectively.
Table 3 compares the classification accuracies by five selected models (including the aforementioned MnasNet) using the CE loss and the WCE loss. The confusion matrices for all the DL models trained with the CE and WCE loss separately are available at the site777https://github.com/Derekabc/CottonWeeds/tree/master/Confusing_Matrices. Emphasis here is placed on classifying two majority weeds, Morningglory and Waterhemp, and two minority weeds, Crabgrass and Spurred Anoda. Considerable improvements were achieved by all these models for the minority weed classes. Notably, in addition to MnasNet, EfficientNet-b2 and Xception achieved improvements of 40% and 20% in identifying Spurred Anoda, respectively, compared to the models the CE loss.
Despite improvements on minority classes, models like Xception, MnasNet and EfficientNet-b2 resulted in a slightly decreased accuracy for Morningglory. This is because the WCE strategy that placed stronger weights on minority classes might negatively affect the classification for major classes. Nonetheless, the significant improvements on the minority classes outweighed the decrease accuracy in the majority classes, leading to overall improvements in F1-score by these models. Particularly, DenseNet161 achieved an overall F1-score of 99.24%, outperforming the ResNet101 that achieved the best accuracy (99.1%) among all the CE-based models. The VGG11 saw a slight decrease in the overall F1-score, but it is encouraging to see that the model achieved 100% classification for Spurred Anoda that has only 61 images in the weed.
|Morningglory||Waterhemp||Crabgrass||Spurred Anoda||Overall F1-score|
Fig. 8 shows an inter-class CS (cosine similarity) matrix based on the features extracted by the VGG11 model (with the CE loss) (Section 2.5). The CS matrix helps explain misclassifications by DL models among weed classes. Weed classes that share more common features tended to have higher CS values. For example, Goosegrass and Crabgrass that are both grassy weeds in the Poaceae family, had a CS of 0.69, greater than their similarities with all other weeds. The high CS is in agreement with the classification errors observed between the two classes (see Fig. 5, Fig. 6 and Fig. 7). For the ResNet101 model, for instance, all the 3% misclassifications for Goosegrass were due to misclassifying the weed as Crabgrass (Fig. 5). Spurred Anoda and Prickly Sida are another pair of similar weeds, which both have toothed leaf margins and are members of the Mallow family. The globally highest CS of 0.73 was observed between the two classes. Their strong similarity, along with the fact that Prickly Sida has more than twice as many images as Spurred Anoda, explains the significant proportion of Spurred Anoda misclassified as Prickly Sida (see Fig. 5, Fig. 6 and Fig. 7).
In this section, we discuss two potential approaches to improving the performance on minority weed classes, which will be investigated in future studies.
The WCE loss function (Eqn. 4) improves the CE loss by adaptively assigning weights to individual weed classes to account for class imbalance. In addition to the weighting in Eqn. 5, there are other weighting Phan and Yamamoto (2020) or cost-sensitive methods Khan et al. (2017) to cope with imbalanced data.
The class-balanced (CB) loss introduced in Cui et al. (2019) re-balances the classification loss based on the effective number of samples for each class.
The CB loss is defined as:
When , the CB loss is equivalent to the CE loss and corresponds to re-weighting by inverse class frequency, which enables us to smoothly adjust the class-balanced term between no re-weighting and re-weighing by inverse class frequency Cui et al. (2019).
Focal loss (FL), which was originally proposed in Lin et al. (2017), offers another promising alternative for imbalanced learning, which is calculated as follows:
where is called the modulating factor, which allows down-weighting the contributions of easy examples or majority classes during training while rapidly focusing on challenging classes that have few images. Here, is the focusing parameter, and FL loss is reduced to the conventional CE loss when equivalent to .
In future research, we will experiment and evaluate these weighted loss functions for improved classification of minority weed classes.
In this paper, although overall DL models achieved remarkable weed identification accuracy, some models that prove to be powerful in visual categorization tasks, like EfficientNet Tan and Le (2019), did not perform well as expected, especially on minority weed classes888https://github.com/Derekabc/CottonWeeds/tree/master/Confusing_Matrices/csv. This is likely due to the fact that these models are heavily reliant on large-scale data to be sufficiently optimized while avoiding overfitting Shorten and Khoshgoftaar (2019). One intuitive solution is to collect more images for the under-performed weed classes. Unfortunately, many weed species may be difficult to collect due to unpredicted weather conditions and limited access to a diversity of field sites.
Data augmentation (DA) offers an effective means to address the insufficiency of physically collected image data. In DA, a suite of techniques Shorten and Khoshgoftaar (2019)
, such as geometric transformations, color space augmentations and generative adversarial networks (GANs), can be used to enhances the size and quality of training images such that deep learning models can be trained on the artificially expanded dataset and then gain better performance. Particularly, GANs have received increasing attention, representing a novel framework of generative modeling through adversarial trainingCreswell et al. (2018). Recently, GAN methods have been investigated for weed classification tasks Espejo-Garcia et al. (2021, 2021) to address the lack of large-scale domain datasets. In this paper, we also applied geometric transformation methods such as random rotation and pixel normalization, but did not fully explore the potential of DA techniques in the classification of weed images, which will be a subject of future research.
In this study, a first, comprehensive benchmark of a suite of 27 DL models was established through transfer learning for multi-class identification of common weeds specific to cotton production systems. A dedicated dataset consisting of 5187 images of 15 weed classes was created by collecting images under natural light conditions and at varied weed growth stages from a diversity of cotton fields in southern U.S. states over two growth seasons. DTL proved to be effective for achieving high weed classification accuracies (F1 score >95%) within reasonably short training time (<2.5 hours). ResNet101 was the best-performant model in terms of the highest accuracy of F1=99.1%, and the top-10 models resulted in an average F1-score of 98.71%. A WCE loss function was proposed for model training, in which individualized weights were assigned to weed classes to account for class imbalance, which achieved substantial improvements in classifying minority weed classes. A DL-based cosine similarity metric was found to be useful for assisting the interpretation of the misclassifications. Both the source codes for model development and evaluation and the weed dataset were made publicly accessible for the research community. This study provides a good foundation for informed choice of DL models for weed classification tasks, and can be beneficial for precision agriculture research at large.
Dong Chen: Formal analysis, Software, Writing - original draft; Yuzhen Lu: Conceptualization, Investigation, Data curation, Supervision, Writing - review & editing; Zhaojiang Li: Resources, Writing - review & editing; Sierra Young: Data curation, Writing - review & editing.
This work was supported in part by Cotton Incorporated award 21-005. The authors thank Dr. Camp Hand and Dr. Edward Barnes for contributing weed images and Dr. Charlie Cahoon for the assistance in weed identification. We also thank Mr. Shea Hoffman and Mr. Vinay Kumar for helping label the weed images.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. Cited by: §2.2, Table 2.
Cost-sensitive learning of deep feature representations from imbalanced data. IEEE transactions on neural networks and learning systems 29 (8), pp. 3573–3587. Cited by: §4.1.
Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence, Cited by: Table 2.