On pseudo-absence generation and machine learning for locust breeding ground prediction in Africa

by   Ibrahim Salihu Yusuf, et al.

Desert locust outbreaks threaten the food security of a large part of Africa and have affected the livelihoods of millions of people over the years. Machine learning (ML) has been demonstrated as an effective approach to locust distribution modelling which could assist in early warning. ML requires a significant amount of labelled data to train. Most publicly available labelled data on locusts are presence-only data, where only the sightings of locusts being present at a location are recorded. Therefore, prior work using ML have resorted to pseudo-absence generation methods as a way to circumvent this issue. The most commonly used approach is to randomly sample points in a region of interest while ensuring that these sampled pseudo-absence points are at least a specific distance away from true presence points. In this paper, we compare this random sampling approach to more advanced pseudo-absence generation methods, such as environmental profiling and optimal background extent limitation, specifically for predicting desert locust breeding grounds in Africa. Interestingly, we find that for the algorithms we tested, namely logistic regression, gradient boosting, random forests and maximum entropy, all popular in prior work, the logistic model performed significantly better than the more sophisticated ensemble methods, both in terms of prediction accuracy and F1 score. Although background extent limitation combined with random sampling boosted performance for ensemble methods, for LR this was not the case, and instead, a significant improvement was obtained when using environmental profiling. In light of this, we conclude that a simpler ML approach such as logistic regression combined with more advanced pseudo-absence generation, specifically environmental profiling, can be a sensible and effective approach to predicting locust breeding grounds across Africa.



There are no comments yet.


page 1

page 2

page 3

page 4


Machine learning for detection of stenoses and aneurysms: application in a physiologically realistic virtual patient database

This study presents an application of machine learning (ML) methods for ...

Predicting Eating Events in Free Living Individuals -- A Technical Report

This technical report records the experiments of applying multiple machi...

Design And Modelling An Attack on Multiplexer Based Physical Unclonable Function

This paper deals with study of the physical unclonable functions and spe...

Data-driven chimney fire risk prediction using machine learning and point process tools

Chimney fires constitute one of the most commonly occurring fire types. ...

Machine Learning Assisted Approach for Security-Constrained Unit Commitment

Security-constrained unit commitment (SCUC) which is used in the power s...

Towards Linearization Machine Learning Algorithms

This paper is about a machine learning approach based on the multilinear...

Unsupervised Ensemble Learning via Ising Model Approximation with Application to Phenotyping Prediction

Unsupervised ensemble learning has long been an interesting yet challeng...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Climatic conditions leading to cyclones and monsoons, causing heavy rains, prompted the recent 2019-2021 upsurge in desert locusts (fao2021upsurge). These upsurges pose a significant threat to food security in affected areas, especially in the Northern parts of the African continent. Furthermore, the occurrence and severity of such upsurges could potentially be exacerbated by global climate change (vallebona2008large; fao2016weather; zhang2019locust; salih2020climate).

Machine learning (ML) has been shown to be a valuable tool for species distribution modeling (beery2021species) and has great potential for being applied for early warning of locust outbreaks and upsurges. In particular, many recent papers have looked at using ML specifically for modelling locusts (gomez2018machine; gomez2019desert; kimathi2020prediction; gomez2020modelling; gomez2021prediction). Remote sensing has become an invaluable component to building ML models for this task (latchininsky2010locust; cressman2013role; latchininsky2013locusts; klein2021application). However, even when useful remote sensing data is readily available and is capable of providing quality features for such models, ML still heavily relies on large amounts of labelled data for training. Currently, the Food and Agriculture Organization (FAO) of the United Nations provides many years worth of labelled data on locusts hosted through their Locust Hub.111https://locust-hub-hqfao.hub.arcgis.com/ This is an extremely useful resource and contains recorded sightings of locusts in various phases and stages of their lifecycle. That said, survey teams in general only record the presence of locusts and rarely their absence. This is typical of many ecological surveys and data of this kind are referred to as presence-only data. To overcome the lack of negative labels when training ML models, past work have made use of pseudo-absence generation (gomez2018machine; gomez2020modelling; gomez2021prediction). A commonly used approach is to randomly sample points in a region of interest while ensuring that pseudo-absence points are sampled a minimum distance away from any true presence points (barbet2012selecting). More advanced pseudo-absence methods also exist, such as environmental profiling and background extent limitation (iturbide2015framework).

Application context. The FAO of the United Nations operates a sophisticated monitoring and early warning system for locust outbreaks (fao2020early). The system relies on a range of technologies serving field survey operators, control centres and researchers (cressman2008use). From field studies, researchers have ascertained that female locusts typically lay their eggs in wet, warm soil and wingless nymph locusts, referred to as hoppers, require specific vegetation nearby to sustain them before their wings develop (symmons2001desert; fao2021standard). This connection between certain environmental variables and locust behaviour makes it possible to attempt to model locust distribution through remote sensing combined with survey data (piou2013coupling; escorihuela2018smos; piou2019soil; chen2020geographic; ellenburg2021detecting). Our work seeks to leverage this connection to be able to predict locust breeding grounds for the purpose of potentially improving early warning and assisting disaster prevention and response teams on the ground.

In this paper, we compare different pseudo-absence generation methods used in conjunction with ML, including methods such as random sampling, environmental profiling and background extent limitation, specifically when modelling the desert locust species in Africa. We focus on ML algorithms commonly used in prior work: logistic regression (LR), gradient boosting (XGBoost), random forests (RF) and maximum entropy (MaxEnt – a presence-background modelling approach

222Presence-background approaches make use of only labelled presence data and points sampled across the entire study area of interest, referred to as background data. These background points are randomly selected and completely independent of presence data. Even though MaxEnt does not rely on pseudo-absence generation, we include it in this work as a strong baseline for comparison.). We train each algorithm on specific environmental variables combined with presence labels from the FAO’s Locust Hub. For LR, XGBoost and RF we generate pseudo-absence labels using each of the above-mentioned pseudo-absence generation methods. Our results show LR performs significantly better in terms of prediction accuracy and F1 score compared to the ensemble methods XGBoost and RF, as well as MaxEnt. For LR training, a significant improvement was obtained when using environmental profiling, whereas for the ensemble methods, random sampling combined with background extent limitation significantly improved performance. We therefore conclude that LR combined with environmental profiling is a suitable and effective approach to predicting breeding grounds in Africa.333Our study region includes the following countries: Mauritania, Mali, Egypt, Morocco, Algeria, Sudan, South Sudan, Niger, Eritrea, Senegal, Libya, Western Sahara, Uganda, Tunisia, Cape Verde, Chad, Ethiopia, Kenya, Somalia, and Djibouti.

2 Methodology

We are interested in testing the effectiveness of different pseudo-absence generation methods combined with ML for modelling locusts. Here we discuss our methodology concerning our data, including our choice of environmental variables and preprocessing. We explain the different pseudo-absence generation methods and ML algorithms in more detail and formally state our research hypothesis.

Data. We focus on modelling desert locusts over the entire affected region of the African continent. We use the FAO’s Locust Hub observation data in this study, as it contains the geo-locations of areas where locusts were observed, type of locust observed and some environmental conditions. It has a temporal range of 1985 to 2021. The observations were enriched with environmental data from NASA444https://disc.gsfc.nasa.gov/datasets/GLDAS_NOAH025_3H_2.1/summary and ISRIC SoilGrids555https://data.isric.org/geonetwork/srv/eng/catalog.search#/home respectively (names and descriptions of each variable are provided in SM). Our variables include soil characteristics such as moisture, profile (type) and temperature as well as air pressure, humidity and surface level temperature. The environmental data from NASA have a temporal range of 2000 to 2021, while the soil profile information is non-temporal. The three datasets were combined by selecting a region of temporal overlap from 2000 to 2021. As in gomez2018machine, we use hopper presence/absence as a proxy for locust breeding grounds. Given that the maximum time period between the start of egg laying to the end of the hopper phase is approximately 95 days and that hoppers are not able to fly, they remain close to the breeding ground and consequently act as a good proxy. Therefore, we use a 95-day history for all temporal variables and predict hopper presence 7-days into the future as a proxy for predicting the presence of potential future breeding grounds. The resulting dataset was split into train and test sets based on time; the period from 2000-2014 was used for training, while the period from 2015-2021 was used for testing. Pseudo-absence generation was also performed separately on each split ensuring a balanced distribution of hopper presence/absence. After generating pseudo-absences the train set was further split into a smaller train set and a validation set for parameter tuning in the ratio of 80:20. To ensure that different algorithm and pseudo-absence pairs could be tested fairly and on the same test set, we constructed pseudo-absences for the test set by randomly sampling a balanced mixture of test pseudo-absence points generated by the different methods operating on the test set presence points. For more details, we refer the reader to the SM.

Figure 1: Illustrations of pseudo-absence and background data generation. Blue square is a presence point and red squares represent generated pseudo-absence points. (a) Random sampling (RS). Orange circle is the exclusion buffer. (b) Random sampling with environment profiling (RSEP). Dark blue line shows the boundary between environmentally suitable and unsuitable regions. (c) Random sampling with background extent limitation (RS+). Light blue lines show excluded background regions. (d) Random sampling with environment profiling and background extent limitation (RSEP+). (e) Background sampling (BS) for presence-background modelling. It is important to note that for BS the red points do not necessarily correspond to absence but could represent presence or absence.

Pseudo-absence generation. We consider four different approaches for pseudo-absence generation detailed in iturbide2015framework and depicted in Figure 1 (a)-(d):

  1. Random sampling (RS): Pseudo-absences are sampled at random across all points in the study area that are not within some minimum selected distance to any presence point. The minimum distance between a presence and any absence point is referred to as the exclusion buffer and is set to 30km for all methods.

  2. Random sampling with environmental profiling (RSEP): The RSEP method is aimed at defining the environmental range of the background from which pseudo-absences are sampled. Environmentally unsuitable areas for locusts are defined using a presence-only profiling algorithm (one-class SVM) trained on soil moisture, profile and surface temperature variables. Once these unsuitable regions have been established, pseudo-absence points are randomly sampled from within them.

  3. RS with background extent limitation (RS+): Pseudo-absences are sampled at random within a limited background extent not including the full study region. This optimum limited background is determined by a multi-step process as outlined in (iturbide2015framework).666We perform all pseudo-absence generation using the mopa R package (iturbide2015framework).

  4. RSEP with background extent limitation (RSEP+): This method is similar to the above RS+, but instead of using unconditioned random sampling within the limited background extent, samples are only within environmentally unsuitable regions identified through profiling.

As a strong baseline we compare these pseudo-absence generation methods to only using presence-background data (see Figure 1 (e)) modelling where background samples are generated over the entire study region of interest without any constraint on their location with respect to presence points.

ML algorithms. We compare the following algorithms: logistic regression (LR), gradient boosting (XGBoost) (freund1997decision; friedman2001greedy), random forests (RF) (breiman2001random) and maximum entropy (MaxEnt) (phillips2006maximum)

. We provide hyperparameter details in SM.


Our null hypothesis,

, is that of no difference between the mean performances of the different pseudo-absence generation methods across all the algorithms tested including the mean performance of MaxEnt using only presence-background data. Specifically, let , and let represent the mean performance for pseudo-absence generation method , used when training algorithm . Then the null hypothesis is given by


We are required to reject

to have evidence that there are indeed differences between the pseudo-absence generation methods used for training the different algorithms as well as the presence-background MaxEnt approach. Our significance level, i.e. the point at which the probability of equal mean performances under the null hypothesis is low enough to reject the null hypothesis, is set to

. If is rejected, we can continue with more specific pairwise tests, with appropriate -value adjustments to control for the family-wise error (demvsar2006statistical).

3 Experiments

In Figure 2, we show the pseudo-absence points generated by each method for November 2003 across a subset of the countries considered in our study namely, Niger, Mauritania, Mali, Algeria, Western Sahara and Morocco.

Figure 2: Pseudo-absence and background data generation: an example on a subset of African countries Niger, Mauritania, Mali, Algeria, Western Sahara and Morocco for November 2003. (a) Random sampling (RS). (b) Random sampling with environment profiling (RSEP). White regions indicate environmentally suitable regions as identified through environmental profiling, i.e. these are the regions where pseudo-absence points should not be sampled. (c) Random sampling with background extent limitation (RS+). (d) Random sampling with environment profiling and background extent limitation (RSEP+). (e) Background sampling (BS).

Results. The mean accuracy and F1 score over 100 runs for each algorithm and generation method is shown in Table 1. In general, the performances are respectable. However, interestingly, the linear model LR is seen to outperform the more sophisticated ensemble methods, XGBoost and RF, across all generation methods, as well as MaxEnt, both in terms of absolute mean accuracy and F1 score. Next, we ascertain the statistical significance of these results and test for specific differences between generation methods and ML algorithms.

LR XGBoost RF MaxEnt
Accuracy RSEP
Table 1: Performance comparison between generation methods and ML algorithms. Bold values indicate top performance across generation methods for a specific algorithm.

Statistical analysis. We test using the Friedman aligned ranks test (friedman1937use) and find that we can easily reject the null hypothesis of no difference with -value .777We conducted our statistical testing in R using the stats library where is considered the minimum -value. Values smaller than this are indicated with the ‘’ symbol in printed output. Having rejected , we perform additional pairwise tests using the -value adjustment procedure for multiple testing from holm1979simple. For XGBoost and RF significant differences are detected between the generation methods (all -values ), providing evidence that background extent limitation can be of benefit to these ensemble methods. However, for LR, environmental profiling was instead found to significantly improve performance irrespective of whether background extent limitation was used or not (-values provided in SM).

Figure 3: SHAP analysis for logistic regression. An interpretation of the logistic regression model on the different pseudo-absence generation methods using Shapley additive explanations (NIPS2017_7062). (a) Random sampling (RS). (b) Random sampling with environment profiling (RSEP).

Interpretation. In Figure 3, we provide a SHAP analysis (NIPS2017_7062) for variable importance of the LR model between using RS or RSEP. We find that only a few of the total 174 features are important for prediction. For both methods features such as Albedo_inst_bucket_14 and clay_0.5cm_mean roughly amount to the same predictive effect as the bottom 165 variables combined. This is in contrast to the ensemble methods where feature importance is far more spread out (not shown). Therefore, we suspect the LR is performant due to its ability to better avoid overfitting to noisy features.

4 Discussion

Our study provides some evidence for the suitability of using random sampling with environmental profiling for pseudo-absence generation and logistic regression for predicting desert locust breeding grounds in Africa. We note that a detailed analysis of pseudo-absence generation appeared in barbet2012selecting

, but was performed on synthetic data, whereas we are primarily interested in locust modeling and considered more modern methods such as background extent limitation. Finally, although we find linear models to perform well in our study, recent papers have started to use deep learning for locust modeling

(samil2020predicting; tabar2021plan), moving towards more sophisticated ML model architectures and leveraging promising new sources of data such as those generated by the eLocust3m app (plant2020elocust) to good effect. We hope to compare with and potentially contribute to these approaches in future work.

This research was supported in part through computational resources provided by Google.


Supplementary Material

Dataset details – environmental variables, splits, sizes and features. Temporal variables were extracted from NASA Global Land Data Assimilation System Version 2.1 (GLDAS-2.1) Noah Land Surface Model with a temporal resolution of 3 hours and spatial resolution of 0.25 x 0.25 degrees. https://disc.gsfc.nasa.gov/datasets/GLDAS_NOAH025_3H_2.1/summary

Name Description
AvgSurfT_inst Instantaneous average surface skin temperature (K)
Albedo_inst Instantaneous albedo (%)
SoilMoi0_10cm_inst Instantaneous soil moisture 0-10cm (kg m-2)
SoilMoi10_40cm_inst Instantaneous soil moisture 10-40cm (kg m-2)
SoilTMP0_10cm_inst Instantaneous soil temperature 0-10cm (K)
SoilTMP10_40cm_inst Instantaneous soil temperature 0-10cm (kg m-2)
Tveg_tavg 3-hour averaged Transpiration (W m-2)
Wind_f_inst Instantaneous wind speed (m s-1)
Rainf_f_tavg 3-hour averaged total precipitation rate (kg m-2 s-1)
Tair_f_inst Instantaneous air temperature (K)
Qair_f_inst Instantaneous specific humidity (kg kg-1)
Psurf_f_inst Instantaneous surface pressure (Pa)
Table 2: Environmental variables and their descriptions

Soil profile variables were extracted from International Soil Reference and Information Centre (ISRIC) SoilGrids250m 2.0 data product. https://data.isric.org/geonetwork/srv/eng/catalog.search#/home

Name Description
sand_0.5cm_mean Average sand content between 0-5cm (g/kg)
sand_5.15cm_mean Average sand content between 5-15cm (g/kg)
clay_0.5cm_mean Average clay content between 0-5cm (g/kg)
clay_5.15cm_mean Average clay content between 5-15cm (g/kg)
silt_0.5cm_mean Average silt content between 0-5cm (g/kg)
silt_5.15cm_mean Average silt content between 5-15cm (g/kg)
Table 3: Environmental variables and their descriptions
Split Feature type(s) Number of features Presence Pseudo-Absence Total
Train Numeric 174 17007 9251 26258
Val Numeric 174 4206 2359 6565
Test Numeric 174 5842 1238 7080
Table 4: Dataset splits, sizes and number of features

We used a total of 12 temporal and 6 non-temporal variables. For each temporal variable, we retrieved a 95-day history and removed the last 7 days (including the observation day), to ensure that we predict 7 days ahead. Futhermore, we took each 89-day window and bucketised over time by computing the mean value for every 6-day window (following from gomez2018machine). We dropped the last window, which had less than 6 days. After doing this we arrived at 14 windows for each variable, so that in total we had 168 (14*12) temporal features. In addition to 6 non-temporal features the total number of features was 174.

ML model hyperparameters.

We performed manual hyperparameter tuning on the validation set resulting in the follow values: a maximum tree depth of 4 for XGBoost and 15 for RF. The number of random variables used for node splitting in RF was

, where is the total number of variables. For our MaxEnt model, we used a linear feature class and a regularization factor of 1. These hyperparameters were found using a grid search on the training data over the available feature classes (linear, quadratic, product, threshold and hinge) and regularization factor in the range [0.1,1], with a step size of 0.1. We used the MaxNet library from phillips2017opening for modeling.

Pairwise comparisons for LR using different pseudo-absence generation methods. Table 5 shows the different -values computed for pairwise comparisons between ML algorithms and pseudo-absence generation for LR. For all other pairwise comparisons the -values obtained were smaller than the minimum value in R, i.e. . For LR, there is a significant difference between using sampling with or without environmental profiling but not between methods that use background extent limitation or not.

RS - 1.00000
RS+ -
RSEP - 1.00000
Table 5: Pairwise comparison for LR for different pseudo-absence generation methods. The * symbol indicates a significant difference.