Climatic conditions leading to cyclones and monsoons, causing heavy rains, prompted the recent 2019-2021 upsurge in desert locusts (fao2021upsurge). These upsurges pose a significant threat to food security in affected areas, especially in the Northern parts of the African continent. Furthermore, the occurrence and severity of such upsurges could potentially be exacerbated by global climate change (vallebona2008large; fao2016weather; zhang2019locust; salih2020climate).
Machine learning (ML) has been shown to be a valuable tool for species distribution modeling (beery2021species) and has great potential for being applied for early warning of locust outbreaks and upsurges. In particular, many recent papers have looked at using ML specifically for modelling locusts (gomez2018machine; gomez2019desert; kimathi2020prediction; gomez2020modelling; gomez2021prediction). Remote sensing has become an invaluable component to building ML models for this task (latchininsky2010locust; cressman2013role; latchininsky2013locusts; klein2021application). However, even when useful remote sensing data is readily available and is capable of providing quality features for such models, ML still heavily relies on large amounts of labelled data for training. Currently, the Food and Agriculture Organization (FAO) of the United Nations provides many years worth of labelled data on locusts hosted through their Locust Hub.111https://locust-hub-hqfao.hub.arcgis.com/ This is an extremely useful resource and contains recorded sightings of locusts in various phases and stages of their lifecycle. That said, survey teams in general only record the presence of locusts and rarely their absence. This is typical of many ecological surveys and data of this kind are referred to as presence-only data. To overcome the lack of negative labels when training ML models, past work have made use of pseudo-absence generation (gomez2018machine; gomez2020modelling; gomez2021prediction). A commonly used approach is to randomly sample points in a region of interest while ensuring that pseudo-absence points are sampled a minimum distance away from any true presence points (barbet2012selecting). More advanced pseudo-absence methods also exist, such as environmental profiling and background extent limitation (iturbide2015framework).
Application context. The FAO of the United Nations operates a sophisticated monitoring and early warning system for locust outbreaks (fao2020early). The system relies on a range of technologies serving field survey operators, control centres and researchers (cressman2008use). From field studies, researchers have ascertained that female locusts typically lay their eggs in wet, warm soil and wingless nymph locusts, referred to as hoppers, require specific vegetation nearby to sustain them before their wings develop (symmons2001desert; fao2021standard). This connection between certain environmental variables and locust behaviour makes it possible to attempt to model locust distribution through remote sensing combined with survey data (piou2013coupling; escorihuela2018smos; piou2019soil; chen2020geographic; ellenburg2021detecting). Our work seeks to leverage this connection to be able to predict locust breeding grounds for the purpose of potentially improving early warning and assisting disaster prevention and response teams on the ground.
In this paper, we compare different pseudo-absence generation methods used in conjunction with ML, including methods such as random sampling, environmental profiling and background extent limitation, specifically when modelling the desert locust species in Africa. We focus on ML algorithms commonly used in prior work: logistic regression (LR), gradient boosting (XGBoost), random forests (RF) and maximum entropy (MaxEnt – a presence-background modelling approach222Presence-background approaches make use of only labelled presence data and points sampled across the entire study area of interest, referred to as background data. These background points are randomly selected and completely independent of presence data. Even though MaxEnt does not rely on pseudo-absence generation, we include it in this work as a strong baseline for comparison.). We train each algorithm on specific environmental variables combined with presence labels from the FAO’s Locust Hub. For LR, XGBoost and RF we generate pseudo-absence labels using each of the above-mentioned pseudo-absence generation methods. Our results show LR performs significantly better in terms of prediction accuracy and F1 score compared to the ensemble methods XGBoost and RF, as well as MaxEnt. For LR training, a significant improvement was obtained when using environmental profiling, whereas for the ensemble methods, random sampling combined with background extent limitation significantly improved performance. We therefore conclude that LR combined with environmental profiling is a suitable and effective approach to predicting breeding grounds in Africa.333Our study region includes the following countries: Mauritania, Mali, Egypt, Morocco, Algeria, Sudan, South Sudan, Niger, Eritrea, Senegal, Libya, Western Sahara, Uganda, Tunisia, Cape Verde, Chad, Ethiopia, Kenya, Somalia, and Djibouti.
We are interested in testing the effectiveness of different pseudo-absence generation methods combined with ML for modelling locusts. Here we discuss our methodology concerning our data, including our choice of environmental variables and preprocessing. We explain the different pseudo-absence generation methods and ML algorithms in more detail and formally state our research hypothesis.
Data. We focus on modelling desert locusts over the entire affected region of the African continent. We use the FAO’s Locust Hub observation data in this study, as it contains the geo-locations of areas where locusts were observed, type of locust observed and some environmental conditions. It has a temporal range of 1985 to 2021. The observations were enriched with environmental data from NASA444https://disc.gsfc.nasa.gov/datasets/GLDAS_NOAH025_3H_2.1/summary and ISRIC SoilGrids555https://data.isric.org/geonetwork/srv/eng/catalog.search#/home respectively (names and descriptions of each variable are provided in SM). Our variables include soil characteristics such as moisture, profile (type) and temperature as well as air pressure, humidity and surface level temperature. The environmental data from NASA have a temporal range of 2000 to 2021, while the soil profile information is non-temporal. The three datasets were combined by selecting a region of temporal overlap from 2000 to 2021. As in gomez2018machine, we use hopper presence/absence as a proxy for locust breeding grounds. Given that the maximum time period between the start of egg laying to the end of the hopper phase is approximately 95 days and that hoppers are not able to fly, they remain close to the breeding ground and consequently act as a good proxy. Therefore, we use a 95-day history for all temporal variables and predict hopper presence 7-days into the future as a proxy for predicting the presence of potential future breeding grounds. The resulting dataset was split into train and test sets based on time; the period from 2000-2014 was used for training, while the period from 2015-2021 was used for testing. Pseudo-absence generation was also performed separately on each split ensuring a balanced distribution of hopper presence/absence. After generating pseudo-absences the train set was further split into a smaller train set and a validation set for parameter tuning in the ratio of 80:20. To ensure that different algorithm and pseudo-absence pairs could be tested fairly and on the same test set, we constructed pseudo-absences for the test set by randomly sampling a balanced mixture of test pseudo-absence points generated by the different methods operating on the test set presence points. For more details, we refer the reader to the SM.
Pseudo-absence generation. We consider four different approaches for pseudo-absence generation detailed in iturbide2015framework and depicted in Figure 1 (a)-(d):
Random sampling (RS): Pseudo-absences are sampled at random across all points in the study area that are not within some minimum selected distance to any presence point. The minimum distance between a presence and any absence point is referred to as the exclusion buffer and is set to 30km for all methods.
Random sampling with environmental profiling (RSEP): The RSEP method is aimed at defining the environmental range of the background from which pseudo-absences are sampled. Environmentally unsuitable areas for locusts are defined using a presence-only profiling algorithm (one-class SVM) trained on soil moisture, profile and surface temperature variables. Once these unsuitable regions have been established, pseudo-absence points are randomly sampled from within them.
RS with background extent limitation (RS+): Pseudo-absences are sampled at random within a limited background extent not including the full study region. This optimum limited background is determined by a multi-step process as outlined in (iturbide2015framework).666We perform all pseudo-absence generation using the mopa R package (iturbide2015framework).
RSEP with background extent limitation (RSEP+): This method is similar to the above RS+, but instead of using unconditioned random sampling within the limited background extent, samples are only within environmentally unsuitable regions identified through profiling.
As a strong baseline we compare these pseudo-absence generation methods to only using presence-background data (see Figure 1 (e)) modelling where background samples are generated over the entire study region of interest without any constraint on their location with respect to presence points.
ML algorithms. We compare the following algorithms: logistic regression (LR), gradient boosting (XGBoost) (freund1997decision; friedman2001greedy), random forests (RF) (breiman2001random) and maximum entropy (MaxEnt) (phillips2006maximum)
. We provide hyperparameter details in SM.
Our null hypothesis,, is that of no difference between the mean performances of the different pseudo-absence generation methods across all the algorithms tested including the mean performance of MaxEnt using only presence-background data. Specifically, let , and let represent the mean performance for pseudo-absence generation method , used when training algorithm . Then the null hypothesis is given by
We are required to reject
to have evidence that there are indeed differences between the pseudo-absence generation methods used for training the different algorithms as well as the presence-background MaxEnt approach. Our significance level, i.e. the point at which the probability of equal mean performances under the null hypothesis is low enough to reject the null hypothesis, is set to. If is rejected, we can continue with more specific pairwise tests, with appropriate -value adjustments to control for the family-wise error (demvsar2006statistical).
In Figure 2, we show the pseudo-absence points generated by each method for November 2003 across a subset of the countries considered in our study namely, Niger, Mauritania, Mali, Algeria, Western Sahara and Morocco.
Results. The mean accuracy and F1 score over 100 runs for each algorithm and generation method is shown in Table 1. In general, the performances are respectable. However, interestingly, the linear model LR is seen to outperform the more sophisticated ensemble methods, XGBoost and RF, across all generation methods, as well as MaxEnt, both in terms of absolute mean accuracy and F1 score. Next, we ascertain the statistical significance of these results and test for specific differences between generation methods and ML algorithms.
Statistical analysis. We test using the Friedman aligned ranks test (friedman1937use) and find that we can easily reject the null hypothesis of no difference with -value .777We conducted our statistical testing in R using the stats library where is considered the minimum -value. Values smaller than this are indicated with the ‘’ symbol in printed output. Having rejected , we perform additional pairwise tests using the -value adjustment procedure for multiple testing from holm1979simple. For XGBoost and RF significant differences are detected between the generation methods (all -values ), providing evidence that background extent limitation can be of benefit to these ensemble methods. However, for LR, environmental profiling was instead found to significantly improve performance irrespective of whether background extent limitation was used or not (-values provided in SM).
Interpretation. In Figure 3, we provide a SHAP analysis (NIPS2017_7062) for variable importance of the LR model between using RS or RSEP. We find that only a few of the total 174 features are important for prediction. For both methods features such as Albedo_inst_bucket_14 and clay_0.5cm_mean roughly amount to the same predictive effect as the bottom 165 variables combined. This is in contrast to the ensemble methods where feature importance is far more spread out (not shown). Therefore, we suspect the LR is performant due to its ability to better avoid overfitting to noisy features.
Our study provides some evidence for the suitability of using random sampling with environmental profiling for pseudo-absence generation and logistic regression for predicting desert locust breeding grounds in Africa. We note that a detailed analysis of pseudo-absence generation appeared in barbet2012selecting
, but was performed on synthetic data, whereas we are primarily interested in locust modeling and considered more modern methods such as background extent limitation. Finally, although we find linear models to perform well in our study, recent papers have started to use deep learning for locust modeling(samil2020predicting; tabar2021plan), moving towards more sophisticated ML model architectures and leveraging promising new sources of data such as those generated by the eLocust3m app (plant2020elocust) to good effect. We hope to compare with and potentially contribute to these approaches in future work.
This research was supported in part through computational resources provided by Google.
Dataset details – environmental variables, splits, sizes and features. Temporal variables were extracted from NASA Global Land Data Assimilation System Version 2.1 (GLDAS-2.1) Noah Land Surface Model with a temporal resolution of 3 hours and spatial resolution of 0.25 x 0.25 degrees. https://disc.gsfc.nasa.gov/datasets/GLDAS_NOAH025_3H_2.1/summary
|AvgSurfT_inst||Instantaneous average surface skin temperature (K)|
|Albedo_inst||Instantaneous albedo (%)|
|SoilMoi0_10cm_inst||Instantaneous soil moisture 0-10cm (kg m-2)|
|SoilMoi10_40cm_inst||Instantaneous soil moisture 10-40cm (kg m-2)|
|SoilTMP0_10cm_inst||Instantaneous soil temperature 0-10cm (K)|
|SoilTMP10_40cm_inst||Instantaneous soil temperature 0-10cm (kg m-2)|
|Tveg_tavg||3-hour averaged Transpiration (W m-2)|
|Wind_f_inst||Instantaneous wind speed (m s-1)|
|Rainf_f_tavg||3-hour averaged total precipitation rate (kg m-2 s-1)|
|Tair_f_inst||Instantaneous air temperature (K)|
|Qair_f_inst||Instantaneous specific humidity (kg kg-1)|
|Psurf_f_inst||Instantaneous surface pressure (Pa)|
Soil profile variables were extracted from International Soil Reference and Information Centre (ISRIC) SoilGrids250m 2.0 data product. https://data.isric.org/geonetwork/srv/eng/catalog.search#/home
|sand_0.5cm_mean||Average sand content between 0-5cm (g/kg)|
|sand_5.15cm_mean||Average sand content between 5-15cm (g/kg)|
|clay_0.5cm_mean||Average clay content between 0-5cm (g/kg)|
|clay_5.15cm_mean||Average clay content between 5-15cm (g/kg)|
|silt_0.5cm_mean||Average silt content between 0-5cm (g/kg)|
|silt_5.15cm_mean||Average silt content between 5-15cm (g/kg)|
|Split||Feature type(s)||Number of features||Presence||Pseudo-Absence||Total|
We used a total of 12 temporal and 6 non-temporal variables. For each temporal variable, we retrieved a 95-day history and removed the last 7 days (including the observation day), to ensure that we predict 7 days ahead. Futhermore, we took each 89-day window and bucketised over time by computing the mean value for every 6-day window (following from gomez2018machine). We dropped the last window, which had less than 6 days. After doing this we arrived at 14 windows for each variable, so that in total we had 168 (14*12) temporal features. In addition to 6 non-temporal features the total number of features was 174.
ML model hyperparameters.
We performed manual hyperparameter tuning on the validation set resulting in the follow values: a maximum tree depth of 4 for XGBoost and 15 for RF. The number of random variables used for node splitting in RF was, where is the total number of variables. For our MaxEnt model, we used a linear feature class and a regularization factor of 1. These hyperparameters were found using a grid search on the training data over the available feature classes (linear, quadratic, product, threshold and hinge) and regularization factor in the range [0.1,1], with a step size of 0.1. We used the MaxNet library from phillips2017opening for modeling.
Pairwise comparisons for LR using different pseudo-absence generation methods. Table 5 shows the different -values computed for pairwise comparisons between ML algorithms and pseudo-absence generation for LR. For all other pairwise comparisons the -values obtained were smaller than the minimum value in R, i.e. . For LR, there is a significant difference between using sampling with or without environmental profiling but not between methods that use background extent limitation or not.