Importance of spatial predictor variable selection in machine learning applications – Moving from data reproduction to spatial prediction

08/21/2019
by   Hanna Meyer, et al.
3

Machine learning algorithms find frequent application in spatial prediction of biotic and abiotic environmental variables. However, the characteristics of spatial data, especially spatial autocorrelation, are widely ignored. We hypothesize that this is problematic and results in models that can reproduce training data but are unable to make spatial predictions beyond the locations of the training samples. We assume that not only spatial validation strategies but also spatial variable selection is essential for reliable spatial predictions. We introduce two case studies that use remote sensing to predict land cover and the leaf area index for the "Marburg Open Forest", an open research and education site of Marburg University, Germany. We use the machine learning algorithm Random Forests to train models using non-spatial and spatial cross-validation strategies to understand how spatial variable selection affects the predictions. Our findings confirm that spatial cross-validation is essential in preventing overoptimistic model performance. We further show that highly autocorrelated predictors (such as geolocation variables, e.g. latitude, longitude) can lead to considerable overfitting and result in models that can reproduce the training data but fail in making spatial predictions. The problem becomes apparent in the visual assessment of the spatial predictions that show clear artefacts that can be traced back to a misinterpretation of the spatially autocorrelated predictors by the algorithm. Spatial variable selection could automatically detect and remove such variables that lead to overfitting, resulting in reliable spatial prediction patterns and improved statistical spatial model performance. We conclude that in addition to spatial validation, a spatial variable selection must be considered in spatial predictions of ecological data to produce reliable predictions.

READ FULL TEXT

page 5

page 7

page 9

page 22

page 23

research
05/16/2020

Predicting into unknown space? Estimating the area of applicability of spatial prediction models

Predictive modelling using machine learning has become very popular for ...
research
11/13/2021

Spatial machine-learning model diagnostics: a model-agnostic distance-based approach

While significant progress has been made towards explaining black-box ma...
research
02/27/2023

Prediction-based Variable Selection for Component-wise Gradient Boosting

Model-based component-wise gradient boosting is a popular tool for data-...
research
03/13/2023

Assessing the performance of spatial cross-validation approaches for models of spatially structured data

Evaluating models fit to data with internal spatial structure requires s...
research
05/24/2021

Informative Bayesian model selection for RR Lyrae star classifiers

Machine learning has achieved an important role in the automatic classif...
research
06/23/2020

Fast, Optimal, and Targeted Predictions using Parametrized Decision Analysis

Prediction is critical for decision-making under uncertainty and lends v...
research
06/30/2022

Prediction of Dilatory Behavior in eLearning: A Comparison of Multiple Machine Learning Models

Procrastination, the irrational delay of tasks, is a common occurrence i...

Please sign up or login with your details

Forgot password? Click here to reset