## 1 Introduction

Given spatial data samples with explanatory features and a targeted response variable at different locations and time, the spatial and spatiotemporal prediction problem aims to learn a model that can predict the response variable based on their explanatory features. Examples of the problem existed even before the popularity of computers. The spatial interpolation method Kriging, named after spatial statistician and mining engineer Krige

[krige1951statistical], was developed in the 1960s to estimate minerals for mining activities. However, with the recent advancement of GPS and remote sensing technology, as well as the popularity of geographic information system and mobile devices, large amounts of spatial and spaiotemporal data are being collected at an increasing speed, such as earth observation imagery, gauge observations in river streams, and geo-social media data. Utilizing the rich geospatial data plays a critical role in addressing many grand societal issues, but is also technically challenging due to the unique characteristics of spatial data. There is an urgent need for effective and efficient prediction methods that can unlock the values of such rich geospatial data assets.

Over the years, various spatial and spatiotemporal prediction methods have been developed by different research communities, including spatial statistics, spatial econometrics, data mining and machine learning, computer vision, remote sensing, geographic information science, and spatial database. Particularly, progress has been made with the rapid development of spatial data mining, a field that studies how to automatically identify non-trivial, previously unknown but potentially useful patterns from large geospatial datasets

[shekhar2011identifying]. Yet methods developed by different research communities tend to use different vocabularies and solve problems in different angles. A systematic review that compares these methods is missing.To this end, this paper provides a systematic review of different spatial prediction methods. We categorize methods based on the unique challenges they address, discuss their underlying assumptions, and compare their advantages and disadvantages. The goal is to help interdisciplinary researchers choose appropriate techniques to solve problems in their applications domains, and more importantly, to help data mining researchers understand the basic principles as well as identify open research opportunities in spatial prediction.

### 1.1 Societal Applications

Spatial prediction is of great importance in societal applications related to various agencies, such as the National Aeronautics and Space Administration (NASA), the National Oceanic and Atmospheric Administration (NOAA), the Department of Defense, the Department of Transportation, the Department of Homeland Security, the United States Department of Agriculture (USDA), and the Environmental Protection Agency (EPA). Here we categorize application examples into four major areas including earth science, urban informatics, geosocial media analytics, and public health.

*Earth science*: Earth science is a major application area for spatial prediction [jiang2017spatial]. Remote sensors from satellites, airplanes, and unmanned aerial vehicles (UAVs) have collected petabytes of geo-referenced earth imagery. Particularly, the recent deployment of CubeSat fleets by commercial companies (e.g., Planet Labs Inc.) help collect high-resolution imagery that covers the entire earth surface almost every day. In addition, numerous ground sensors deployed on land or rivers collect real-time information on soil properties, river flow volume, and air quality. Spatial prediction on earth data plays an important role in mapping land use and land cover, understanding global deforestation [hansen2013high], monitoring surface water dynamics [pekel2016high], improving the situational awareness during disasters (e.g., mapping hurricane flood and earthquakes) [brivio2002integration, jiangkdd2018], predicting crop yield [moran1997opportunities], estimating the spatial distribution of species [austin2002spatial, elith2009species], and mapping soil properties [chang2001near, hengl2004generic].

*Urban informatics*: Another important application area is urban informatics. Relevant spatial data includes temporally detailed road networks with real-time travel costs on individual road segments, GPS trajectories of taxis and trucks, data from video cameras and high-resolution sensors on traffic volume and occupancy close to highway intersections, passenger transactions on public transit systems such as subways and buses, high-resolution street view imagery, geo-referenced crime and accident records, and cellphone location history collected from communication towers. Spatial prediction plays an important role in routing and navigation services (e.g., speed profile and travel volume prediction) [Meng2017], spatially detailed demand forecasting for sharing economy (e.g., ride-hailing, bike sharing) [demandPredAAAI18], law enforcement management (e.g., patrol route planning targeted at predicted crime or crash hotspots), monitoring environment pollution, as well as predictive maintenance of critical urban infrastructures.

*Geosocial media*: Geosocial media analytics is an important emerging application area. With the popularity of smart phones and mobile apps, large amounts of geo-referenced social media data are collected from billions of users. Examples include geo-tagged tweets and Facebook posts, geo-tagged photos and videos, online articles with named entities for locations, as well as check-in records. Geosocial media provides a new way to collect near real-time information about what is happening on the earth surface at a large scale and with low costs. Spatial prediction on geosocial media data has been applied to real-time spatial event detection (e.g., earthquake, flood, landslide) for disaster management [sakaki2010earthquake, zhang2017triovecevent], spatiotemporal event forecasting (e.g., political unrest) [DBLP:conf/kdd/ZhaoSYCLR15], and travel destination recommendations in tourism [majid2013context]. With the area growing rapidly, more applications are being developed in agriculture, environment monitoring, transportation, education, and finance.

*Public health*: Public health has long been an important application area for spatial prediction. Examples of spatial data related to public health includes demographic information on district blocks, electronic health records with patient home addresses, aggregated disease count maps at city, county or state level, population mobility data, environmental data such as air quality and water quality, and other online data such as search engine queries related to diseases. Spatial prediction plays a critical role in automatic medical diagnosis from MRI imagery, monitoring infectious disease and mapping disease risk [best2005comparison], detecting disease outbreak, analyzing environmental factors that cause diseases [rappaport2010environment], understanding drug epidemics [lee2016mind], as well as providing early alert for individuals on environmental triggers of asthma.

It is important to note that the application areas listed above are not isolated but cross-cutting with each other. For example, spatial prediction on geosocial media data can help mapping flood disasters in earth science applications, detecting damage and failures of critical urban infrastructures, and monitoring disease transmission in public health. Earth observation data has been used in land use modeling for urban planning, and in modeling environment factors for public health. Such cross-cutting applications often represent new interdisciplinary research opportunities.

### 1.2 Input Spatial Data

*Spatial data representation*: A geographic information system (GIS) [worboys2004gis] represents spatial data in two ways: *object* and *field*. In the object representation, spatial data consists of identifiable geometric objects including points, lines, and polygons. For example, cities are often represented as points on a map, while rivers and states are represented as line-strings and polygons respectively. In the field representation, spatial data consists of a spatial framework that tessellates continuous space into regular or irregular cells, together with a function that maps each cell into a value. Examples include earth observation imagery and a county-level median house income map.

Symbol | Domain | Description |

The number of sample locations | ||

Sample location coordinates | ||

-dimensional feature vector of a sample at location |
||

or | Continuous or categorical response of a sample at location | |

or | Predicted response of sample at location , is the set of class categories | |

Feature matrix of samples | ||

or | Response vector of samples | |

or | Predicted response vector | |

W-matrix | ||

An element of W-matrix | ||

Covariogram function | ||

Weight of spatial effect | ||

Mean feature vector | ||

Covariance matrix of features | ||

Coefficient vector | ||

Noise (residual error) of a sample | ||

Noise (residual error) of samples | ||

Coefficients for samples at | ||

All coefficients at different locations | ||

Set | A set of all model parameters | |

Spatial kernel weight between location and location | ||

-dimensional feature vector of a sample at location and time | ||

or | Continuous or categorical response of sample at location and time | |

Feature matrix of samples at time | ||

or | Responses (observations) at time | |

or | Hidden variable vector at time | |

Residual errors for hidden process variables at time | ||

Residual errors for observation variables at time | ||

Spatiotemporal covariogram |

*Spatial data sample*: In traditional prediction problems, input data is often viewed as a collection of sample records. Similarly, in spatial prediction, input spatial data can be viewed as a collection of spatial data samples, whereby each sample corresponds to a spatial object (e.g., point, line or polygon), or a raster cell.
A spatial data sample has multiple non-spatial attributes, one of which is the target response variable to be predicted and the others are explanatory features. Additionally, a spatial data sample also has location information and spatial attributes (e.g., distance to another point, length of a line, area of a polygon). These additional information distinguish spatial data samples out from traditional data samples in two important ways: first, the location information and corresponding spatial attributes can provide additional contextual features in the explanatory feature list; second and more important, implicit spatial relationships exist based on sample locations, making samples not independent and identically distributed (non-i.i.d.) For example, in ground sensor observations on soil properties, a sample corresponds to information from one geo-located sensor. Explanatory features can include soil texture, nutrient level, and moisture. The response variable can be whether a type of plant can grow at the location.

Formally, spatial data is a set of spatial data samples , where is the total number of samples, is a by spatial coordinate vector for the th sample, is a by explanatory feature vector ( is the feature dimension), and is a scalar response (it is categorical for classification, and continuous for regression). The set of spatial samples can also be written in the matrix format, , where is a by feature matrix, and is a by response vector.

### 1.3 Formal Problem Definition

Given a set of spatial data samples with explanatory features and responses , the spatial prediction problem aims to learn a model (or function) such that . Once the model is learned, it can be used to predict the responses at other locations based on their explanatory features. The problem can be further categorized into *spatial classification* for categorical response and *spatial regression* for continuous response.

For example, in earth imagery classification for land cover mapping, input spatial data samples are training pixels whose spectral band values (e.g., red, green, blue, and near-infrared bands) are explanatory features, and whose land cover classes (e.g., forest, water) are response. The output is a classification model that can predict the land cover classes of other pixels based on spectral band values.

Spatial prediction is unique from traditional prediction in data mining. In traditional prediction problems, samples are commonly assumed to be independent and identically distributed (i.i.d.). Thus, a same model can be used to predict every sample independently, i.e., for . However, the i.i.d. assumption is often violated in spatial data due to implicit spatial relationships between sample locations. According to the first law of geography, “everything is related to everything else, but near things are more related than distant things” [tobler1970computer]. Ignoring spatial relationships can results in incorrect model assumption and poor prediction performance.

### 1.4 Challenges

Spatial prediction poses unique challenges as compared to traditional prediction due to the special characteristics of spatial data. Here we describe these special characteristics in an intuitive way. More mathematically rigorous discussions are in Section 2.

*Spatial autocorrelation (dependency)*: According to Tobler’s first law of geography [tobler1970computer]

, “everything is related to everything else, but near things are more related than distant things.” In real world spatial data, nearby locations tend to resemble each other, instead of being statistically independent. For example, the temperatures of two nearby cities are often very close. This phenomenon is also called the spatial autocorrelation effect. The existence of spatial autocorrelation is a challenge because it violates a common assumption by many traditional prediction models, i.e., learning samples are independent and identically distributed (i.i.d.). Ignoring the spatial autocorrelation effect may produce prediction models that are inaccurate or inconsistent with data. For instance, when applying a linear regression model to spatial data, the residual errors are often correlated, instead of being i.i.d. as the model assumes. In earth imagery classification, running classification models that rely on the i.i.d. assumption (e.g., decision tree, random forest) often produces results with artifacts (e.g., salt-and-pepper noise)

[jiang2015focal].*Spatial heterogeneity*: Another unique characteristic of spatial data is that sample distribution is often not identical in the entire study area, often called the spatial heterogeneity effect. Specifically, spatial heterogeneity can be reflected in two ways, including spatial non-stationarity and spatial anisotropy. Spatial non-stationarity means that sample distribution varies across different sub-regions. For example, the same spectral signature in earth image pixels may correspond to different land cover types from tropical to temperate regions. This is a challenge because a model learned from global samples may not perform well in each sub-regions (also referred to as “ecological fallacy” [ess2001culture]). Spatial anisotropy means that spatial dependency between sample locations is non-uniform along different directions. For example, climate data distribution is often influenced by geographical terrains (e.g., mountain range), showing unique patterns along different directions. Modeling anisotropic spatial dependency is a challenge because such dependency cannot be simply modeled as a function of distance (should be a function of direction as well).

*Limited ground truth*: Real world spatial data often contains a large amount of information on explanatory features due to advanced data collection techniques (e.g., GPS, remote sensor). However, availability of ground truth data (e.g., land cover classes) is often very limited because collecting ground truth involves sending a field crew or hiring well-trained visual interpreters, which is both expensive and time consuming. While limited ground truth is a common challenge in many other non-spatial prediction problems, the cost of collecting ground truth for spatial data is unique in that it includes not only the labeling time cost but also the time cost for the field crew to travel on the ground between sample locations. In addition, the selection of sample locations should also consider the geographical representativeness of samples for rigorous evaluations [congalton1991review].

*Multiple scales and resolutions*: The last major challenge is that spatial data often exists in multiple spatial scales or resolutions. For example, resolutions of earth observation imagery pixels range from sub-meter (high-resolution aerial photos) to over hundreds of meters (MODIS satellite image). In precision agriculture, soil properties are recorded by ground sensors at isolated point locations, spectral signatures of crops are measured in aerial imagery that covers the entire study area, and crop yields are often measured at per plot level. This poses a challenge since traditional prediction methods often assume that data samples are at the same scale or resolution. Thus, these models cannot be directly applied. A simple approach of preprocessing to aggregate data into the same scale or resolution may cause the loss of critical information. Other approaches (e.g., statistical downscaling [wilby2004guidelines]) have been studied, but mostly for particular applications such as climate science. The challenge is still largely underexplored for broad spatial prediction applications.

### 1.5 Comparison with Existing Surveys

Most existing related surveys focus on general spatial data mining. Ester et al. [ester1997spatial] and Koperski et al. [koperski1996spatial] provide early surveys on spatial data mining from a database perspective. Miller et al. [miller2009geographic] have a book on geographic data mining and knowledge discovery that contains spatial prediction as a chapter. The chapter compares a couple of common methods in case studies but does not provide a systematic survey. Shekhar et al. [shekhar2003trends, shekhar2011identifying, shekhar2015spatiotemporal] and Atluri et al. [atluri2017spatio] provide surveys on general spatial and spatiotemporal data mining, highlighting the unique challenges of mining spatial data and spatial statistical foundation but without systematic review on prediction methods. To the best of our knowledge, there is no systematic survey on spatial prediction methods in the literature.

To fill in the gap, we provide a systematic review on the principles and methods on spatial prediction. We provide a taxonomy of methods based on the unique challenge they address, including spatial autocorrelation (or dependency), spatial heterogeneity, limited ground truth, and multiple spatial scales and resolutions. When introducing each method, we start with the intuition and underlying assumption, then introduce key ideas and theoretical foundation, and finally discuss its advantages and disadvantages. We also introduce several spatiotemporal extensions of methods. Future research opportunities are also identified.

### 1.6 Scope and Outline

Due to space limit, we only focus on spatial prediction problems in which samples have fixed locations. Prediction for moving object data such as location prediction and recommendation are beyond our scope. We also do not include methods in computer vision unless input images are geo-referenced such as earth observation imagery. We do not particularly distinguish spatial classification and regression since methods are often exchangeable through logistic transformation.

The outline of the paper is as follows. Section 2 introduces spatial statistics foundations. Section 3 provides a taxonomy of spatial prediction methods based on the key challenge they address, and also introduces spatiotemporal extensions of methods. Future research opportunities are discussed in Section 4. Section 5 concludes the paper.

## 2 Spatial Statistical Foundations

Spatial statistics [schabenberger2005statistical, banerjee2014hierarchical, cressie2015statistics] provides a theoretical framework to do exploratory analysis and make inference on spatial data. In spatial statistics, samples corresponding to fixed point locations in continuous space are called *point reference data*, while samples corresponding to fixed cells in the field representation are called *areal data*, as summarized in Table II. Samples corresponding to random point locations are called *spatial point process*, which are beyond our scope. This section reviews some important spatial statistics concepts that are the foundation of many spatial prediction methods, including spatial autocorrelation, stationarity, isotropy, variogram, covariogram, and spatial heterogeneity.

Data Representation View | Spatial Statistics View | |

Object | Points | Point reference data |

Spatial point process | ||

Lines | ||

Polygons | ||

Field | Regular cells | Areal data |

Irregular cells |

### 2.1 Spatial Autocorrelation (Dependency)

The first law of geography indicates that spatial data samples are not statistically independent but correlated, particularly across nearby locations. This effect is also called the *spatial autocorrelation* effect. We exchange the usage of “autocorrelation” and “dependency” on spatial data to mean the same effect. The specific statistics of spatial autocorrelation vary between areal data and point reference data, which are introduced separately below.

#### 2.1.1 Spatial autocorrelation on areal data

*Spatial neighborhood and W-matrix*: On areal data, spatial data samples are regular or irregular cells in a discrete tessellation of continuous space. The range of spatial dependency is assumed to be within *spatial neighborhood*, which can be defined based on cell adjacency or distance. Most often, two samples (cells) are spatial neighbors if they share boundaries (rook neighborhood), or if they share corners or boundaries (queen neighborhood). Spatial neighborhood relationships across all samples (cells) can be represented by a by square matrix called W-matrix , where is the number of samples, element if the th sample and the th sample are spatial neighbors, and otherwise (by default, , i.e., samples are not neighbors of themselves).

*Spatial autocorrelation*:
Based on the definition above, spatial autocorrelation statistics is defined as the correlation between observations of the same variable at neighboring cells. One example for continuous response
is Moran’s , which is defined as

(1) |

where with as the total number of samples (cells). Similar to Pearson’s correlation, the value of Moran’s is within . A positive Moran’s indicates that nearby locations tend to have similar values, while a negative indicates that nearby locations tend to have different values. There are also other spatial autocorrelation statistics, such as Geary’s , statistic, Black-Black joint count [schabenberger2005statistical].

#### 2.1.2 Spatial autocorrelation on point reference data

Spatial autocorrelation statistics on point reference data are different from those on areal data in that they measure correlation between variables at any two locations in continuous space, only based on observations at a limited number of point locations. In order to make such measures possible, further assumptions on data distribution have to be made, including spatial stationarity and isotropy.

*Spatial stationarity*: Spatial stationarity is an assumption that sample statistical properties are location invariant. There are different levels of stationarity assumption according to which statistical properties stay invariant when locations are shifted. The strongest assumption is *strict stationarity*

, meaning that the joint distribution of variables at several locations stay unchanged if their locations are shifted by a same distance and direction, i.e.,

. This assumption is the strongest because the joint distribution determines all other statistical properties.*Weak stationarity*

assumes that the first and second moments of spatial variables are invariant with location shifting, i.e.,

and for . With the assumption of weak stationarity, the covariance between any two locations is a function on the relative location difference , which is called*covariogram*. Another weaker assumption is

*intrinsic stationarity*, formally, for . The function here is also called

*variogram*. In practice, a spatial variable usually has a non-constant mean (also called trend) , so weak (or intrinsic) stationarity is often assumed on residual errors .

*Spatial isotropy*: Weak or intrinsic stationarity assumes that the second order statistical property of a variable at two locations is a function of location difference vector only. *Spatial isotropy* further assumes that the properties only depend on distance regardless of direction, i.e., covariogram and variogram , where . In other words, covariance between variables at any two locations only depends on their relative distance.

In a point reference data, assuming weak stationarity and isotropy, we can empirically estimate the variogram or covariogram function by fitting a curve (e.g., linear, spherical, exponential) based on observations at several point locations. Once the curve is fitted, we can get covariance between variables at any two locations simply based on their distance.

From the discussions above, we can observe that the spatial autocorrelation effect (dependency) is measured differently on areal data and point reference data. On areal data, it is based on the definition of spatial neighborhood, while on point reference data, it is based on covariance function (covariogram). The existence of spatial autocorrelation poses a challenge in spatial prediction, since the common assumption that samples are statistically independent in many traditional prediction models is no longer valid. Ignoring this challenge can lead to poor prediction performance.

### 2.2 Spatial heterogeneity

If a spatial distribution is stationary and isotropic, it is called *homogeneous* [banerjee2014hierarchical]. Thus, a spatial distribution is *heterogeneous* if it is either non-stationarity or anisotropy. The terms of *homogeneity* and *heterogeneity*

should not be confused with homoscedasticity and heteroscedasticity, which specifically refer to properties on variance.

From discussions in Section 2.1

, we can see that assuming spatial homogeneity (stationarity and isotropy) can greatly simplify the statistical modeling of spatial data. Thus, this assumption is made in many spatial prediction methods. However, the assumption can be violated by real world spatial data, which is often spatially heterogeneous (spatially non-stationary or anisotropic). For example, spectral features of forest, land and water in earth imagery vary from tropical regions to temperate regions. As another instance of example, when classifying earth observation imagery pixels into water and land, spatial dependency across nearby water locations is anisotropic following geographic terrain and topography (water flows from a higher elevation to a nearby lower elevation due to gravity). Spatial heterogeneity poses a challenge in spatial prediction since a model learned from an entire study area may perform poorly in some local regions.

## 3 A Taxonomy of Spatial Prediction Methods

This section provides a taxonomy of existing spatial prediction methods, as shown in Figure 1. Methods are first categorized by the unique challenge they address, including spatial autocorrelation, spatial heterogeneity, limited ground truth, and multiple spatial scales and resolutions. Within each category, we further group the methods based on the strategies they use to address the challenge. When reviewing a method, we start from the intuition behind it, highlight underlying assumptions, explain the key ideas, and discuss its advantages and disadvantages. We also discuss spatiotemporal extensions of methods in the end.

### 3.1 Spatial Autocorrelation (Dependency)

Addressing the challenge of spatial autocorrelation or dependency requires spatial prediction algorithms to go beyond the independence assumption. Common strategies include spatial contextual feature generation for model inputs, spatial dependency constraint within model structure, and spatial regularization for model objective function.

#### 3.1.1 Spatial contextual feature generation

One way of incorporating spatial dependency into prediction methods is to augment input data with additional spatial contextual features. The spatial context of a sample location refers to information surrounding it, such as relationships to other objects or locations, attributes of nearby samples, auxiliary semantic information from additional data sources. Once spatial contextual information is added into explanatory features, traditional non-spatial prediction methods can be used. We now introduce several approaches to generate spatial contextual features.

*Spatial relationship features*

: Spatial contextual features can be generated based on spatial relationships with other locations or objects, such as distance or direction, touching, lying within or overlapping with another object. Spatial relationship features can be readily used in rule-based or decision tree-based models. Examples of techniques include spatiotemporal probability tree model to classify meteorological data on storms

[mcgovern_spatiotemporal_2008, mcgovern_enhanced_2013], multi-relational spatial classification [frank_multi-relational_2009], prediction based on spatial association rules [ding_discovery_2009].*Spatial contextual features on raster data*: Spatial contextual features have long been used in classifying raster data (e.g., earth observation imagery) to reduce salt-and-pepper noise [lu2007survey]. Specific methods include neighborhood window filters (e.g., median filter [chan2005salt], weighted median filter [brownrigg1984weighted], adaptive median filter [hwang1995adaptive], decision-based filter [chan2005salt, esakkirajan2011removal], etc.), spatial contextual variables and textures [puissant2005utility], neighborhood spatial autocorrelation statistics [jiang2015focal, jiang_focal-test-based_2013], morphological profiling [benediktsson2005classification], and object-based image analysis (e.g., mean, variance, texture of object segments) [hay2008geographic]. Sometimes, these methods are used in the post-processing step [tarabalka2009spectral].

*Spatial contextual features from multi-source data fusion*: One unique property of spatial data is that information from different sources can be fused into the same spatial framework, providing important spatial contexts for learning samples. For example, when predicting a fine-grained air quality map for an entire city, we can generate contextual features by fusing air quality records at ground stations, weather information, road network and traffic data, as well as POIs [zheng2013u]. When predicting human behaviors from location history, auxiliary data from geosocial media can provide important semantic annotations [wu2015semantic, wu2016did]

. Generating contextual features through data fusion can have its own challenges (e.g., multi-modality, sparsity, noise). Various techniques have been explored such as coupled matrix factorization, and context-aware tensor decomposition with manifold

[zheng2015methodologies].Spatial contextual feature generation is important in many practical applications (e.g., urban computing) due to two main advantages. First, generating appropriate contextual features can significantly enhance prediction accuracy due to the effectiveness of those features in explaining the response variable. Second, after spatial contextual features are generated, many traditional non-spatial predictive models can be used (e.g., random forest, support vector machine). This is sometimes convenient since there is no need to modify non-spatial prediction models or learning algorithms. At the same time, spatial contextual feature generation may require significant knowledge about the application domain.

#### 3.1.2 Spatial dependency within model structure

Instead of generating spatial contextual features and utilizing traditional non-spatial prediction methods, we can directly incorporate spatial dependency in model structure. There are three different strategies to do that, including Markov random field based models for areal data, Gaussian process based models for point reference data, and hierarchical models with latent variables which provide a new perspective of capturing spatial dependency for both areal data and point reference data.

*(1) Markov Random Field Based Models*

Markov random field (MRF) is a widely used model for areal data such as earth observation images, MRI medical images, and county level disease count map. An MRF is a random field that satisfies the Markov property: the conditional probability of the observation at one cell given observations at all remaining cells only depends on observations at its neighbors. This property is consistent with the first law of geography that “nearby things are more related than distant things”. According to the Brook’s lemma [brook1964distinction], the joint distribution of cell observations can be uniquely determined based on conditional probability specified in the Markov property. Furthermore, according to the Hammersley-Clifford theorem [clifford1990markov], the corresponding joint distribution of MRF has a unique structure: it can be expressed by a set of potential functions on spatial neighbor cliques (i.e., symmetric functions that are unchanged by any permutations of input variables within a clique). Such a joint distribution is also called Gibb’s distribution. Equation 2 is an example. Its potential function is based on cliques of size two (

). The Markov property simplifies the modeling process: as long as the neighborhood structure is specified, the joint distribution of an MRF can be expressed by a potential function on neighbor cliques. Spatial prediction methods based on MRF include ones that explicitly capture spatial dependency such as Simultaneous Autoregressive models (SAR), ones that implicitly capture spatial dependency such as Conditional Autoregressive models, and ones integrating MRF with other models such as Bayesian classifiers and support vector machines.

(2) |

*Simultaneous Autoregressive (SAR) models* (also called *spatial autoregressive models* in spatial econometrics)
explicitly express spatial dependency across response
variables [anse-88][banerjee2014hierarchical]. SAR models can be better explained by comparison with linear regression. Classical linear regression model is expressed as Equation 3, where is a by column vector

(3) |

of all response variables, is a by sample covariate (feature) matrix, is a by column vector of coefficients, and is a by column vector of i.i.d. Gaussian noise (residual errors). In contrast, the SAR model extends traditional linear regression with an additional spatial autoregressive term, as shown in Equation 4, where is row-normalized W-matrix, and reflects the strength of spatial

(4) |

dependency effect. The th row of the spatial autoregressive term is a weighted average of response variables at all neighboring locations of . Another way to look at SAR is that it multiplies a “smoother” term to the mean and residual error of classical linear regression, i.e.,

. Parameters in SAR can be estimated based on the maximum likelihood method. SAR model can also be extended for spatial classification via logit transformation. It is worth noting that the spatial autoregressive term can also be added into other variables than the responses, such as covariates as in the spatial Durbin model or residual errors as in the spatial error model

[viton2010notes].*Conditional autoregressive (CAR) models* implicitly express spatial dependency via conditional distribution. One common example in the Gaussian case is shown in Equation 5,

(5) |

where is an element of W-matrix, and is the sum of the

th row. In this case, the conditional distribution of a random variable

given all other random variablesfollows a Gaussian distribution with a mean of neighborhood weighted average. It has been shown in

[banerjee2014hierarchical] that the corresponding joint distribution is the same as Equation 2. The main advantage of CAR models is that spatial dependency can be easily captured via potential functions on neighboring cliques. However, the joint distribution can be improper (the integral of the CAR model above is not equal to one). In practice, such CAR models are often used as a prior distribution for Bayesian models [assunccao2009neighborhood].*Integrating MRF with other models*

: MRF models can also be integrated with other classification and regression methods to incorporate spatial dependency. One important example is MRF-based Bayes classifiers. Bayes classifiers are classification models that utilize maximum a posteriori probability (MAP) estimate in Bayes theorem, i.e.,

(6) |

where and are features and class labels for all samples respectively. The last step of Equation 6 is based on the i.i.d. assumption. MRF-based Bayes classifier [li2009markov] replaces the i.i.d. assumption with spatial dependency via MRF models in order to reduce salt-and-pepper noise [chawla_modeling_2001, Shekhar-02]. Specifically, the joint distribution is expressed as a Markov random field with potential functions defined on neighbor cliques. For example,

(7) |

where is the weight for spatial dependency, the term is a potential function with as Kronecker delta function (i.e., if , and

otherwise). The posterior probability shown in Equation

7 can be considered as an energy function, which is the sum of potential functions on neighboring classes and as well as between the feature and class for each sample. Parameters in MRF-based Bayes classifiers can be estimated via graph cut [boykov2001fast] and iterations [besag1986statistical, jackson2002adaptive]. MRF-based models with other potential functions have been applied to earth science problems such as drought detection [fu_drought_2012].Another model similar to MRF is *conditional random field* (CRF) [lafferty2001conditional], which directly models spatial dependency within the conditional probability function in Equation 7. Its potential function on class labels within a clique is conditioned on feature vector . Several variants of CRF have been proposed including decoupled conditional random field [lee_efficient_2006], discriminative random field [kumar2003discriminative] and support vector random field [lee_support_2005]. The difference between MRF and CRF is that the former is generative while the latter is discriminative.

The advantage of MRF-based models is that spatial dependency can be modeled in a very intuitive and simple way (designing potential functions). The limitations include high computational cost in parameter estimation and strong assumptions on the structure of joint probability distribution. In addition, neighborhood relationships are assumed to be given as inputs. Thus, fixed neighborhoods such as square windows are often used for simplicity. For applications where spatial data is anisotropic, determining spatial neighborhood structure is also a challenge.

*(2) Gaussian process based (Kriging)*

Different from MRF based models that are used for areal data, Gaussian process based models are used for spatial prediction (interpolation) on point reference data [banerjee2014hierarchical]. Given observations of a variable at sample locations in continuous space, the problem aims to interpolate the variable at an unobserved location. Gaussian process assumes that observations at any set of sample locations jointly follow a multivariate Gaussian distribution. The mean term at a location is determined by its local covariates. The residual error term at a location is assumed to be weakly stationary and isotropic, so that the covariance matrix can be expressed as a function of distance (covariogram).

Specifically, Gaussian process (Kriging) assumes that any set of sample observations follows a multivariate Gaussian distribution , where and is the covariance matrix . The main difference of Gaussian process from classical regression is that the residual errors are not mutually independent (the covariance matrix is not diagonal, and the non-diagonal elements can be determined based on covariogram ). It can be shown that the optimal predictor (minimizing expected square loss) of given other observations is the conditional expectation as shown in Equation 8, which can be estimated based on the covariance structure from covariogram [banerjee2014hierarchical]. This method is
also called *universal* Kriging since it involves covariates . Special cases without covariates include *simple Kriging* (with known constant mean) and *ordinary Kriging* (with unknown constant mean) [zimmerman1999experimental].

(8) |

Method | Markov random field | Gaussian process |

Spatial data | Areal data | Point reference data |

Spatial dependency | Neighborhood adjacency matrix | Variogram (covariance function on distance) |

Assumption | Conditional independence given neighbors | Isotropy (covariance depends on distance only) |

Gaussian process shares some similarities with MRF in that both of them incorporate spatial dependency (autocorrelation) across sample locations into model structure. There are several differences, however, as summarized in Table III: first, Gaussian process is developed for point reference data while MRF is developed for areal data; second, MRF models spatial dependency through W-matrix () while Gaussian process models spatial dependency through covariance function on spatial distance (covariogram). Similar to MRF which assumes a given neighborhood structure and the Markov property, Gaussian process has assumptions on spatial stationarity and isotropy.

*(3) Hierarchical model with latent variables*

Spatial autocorrelation or dependency can be incorporated into predictive models by adding some spatially autocorrelated latent variables into Bayesian hierarchical models (graphical models). Indeed, the Markov random field based models and Gaussian process models discussed above can be considered as special cases or building blocks for hierarchical models with latent variables.

For instance, in Gaussian process (Kriging), sample observations is assumed to follow a multivariate Gaussian distribution where , and the covariance matrix can be decomposed into two components, one is for i.i.d. Gaussian noise, and the other is a non-diagonal matrix to capture covariance based on covariogram (introduced in Section 2.1.2). This formulation can be rewritten as a hierarchical model in Equation 9, where is a latent

(9) |

variable and the vector follows a multivariate Gaussian distribution with zero mean and covariance matrix [banerjee2014hierarchical]. The hierarchical model expresses the response variables by adding i.i.d. noise and non-i.i.d. variables for spatial effect. It is illustrated in Figure 2, in which and (as well as and ) are independent, and are conditionally independent given latent variables and . Latent variables at different locations reflect the spatial (autocorrelation) effect since they follow a joint Gaussian distribution with a non-diagonal covariance matrix (i.e., correlation exists across neighboring and ).

Another instance of example is disease risk mapping on areal data [banerjee2014hierarchical]. The count of disease events at one county

can be assumed to follow a Poisson distribution

where is the known number of high risk people, is the variable for disease risk (), and is the expectation of the Poisson distribution in county . We assume the distribution in Equation 10, where is the covariate vector, is the(10) |

coefficient vector, and is i.i.d. Gaussian noise, then the problems becomes a non-spatial problem, i.e., each county is independent from each other. However, in reality, though disease risk at different counties may be different, nearby counties have a high tendency to have similar risks. To reflect this phenomena, we can further model the disease risks at different counties as in Equation 3, where is an

(11) |

additive term for the spatial autocorrelation effect. For instance, can be a conditional autoregressive (CAR) model. This hierarchical model has a similar structure as the one in Figure 2 based on the similar conditional independence assumption.

Latent variable can also be used to incorporate spatial autocorrelation for corrupted observation data. Observations of a target variable may be corrupted or missing due to noise or errors in data collection process, even though the true values of the variable should be uncorrupted with a high spatial autocorrelation. Corrupted observations will impact the performance of predictive models since training samples are inaccurate. To address this issue, a latent variable approach has been proposed [DBLP:conf/kdd/KimYTM15], in which a corrupted observation depends on a corresponding uncorrupted (spatially autocorrelated) latent variable , i.e., , and the latent variable depends on covariates . The entire model is hierarchical as illustrated in Figure 3. Model learning involves estimating parameters for functions and .

The advantage of using hierarchical models with latent variables to incorporate spatial dependency is that the modeling process is simple and intuitive, providing flexibility in model design. For example, such models have been used in real estate appraisal [DBLP:conf/kdd/FuXGYZZ14] and event forecasting from social media data [DBLP:conf/sdm/ZhaoCLR15]

. The main issue is the computational cost. Learning parameters involves iterative methods such as EM algorithms or Markov Chain Monte Carlo simulation. The computational cost can be high for large data with many nodes (variables).

#### 3.1.3 Spatial regularization in objective function

In addition to explicitly modify model structure to capture spatial dependency, we can also extend the objective (or loss) function with an additional spatial regularization term. In this way, the learning algorithm favors parameter values that not only make accurate prediction at individual locations but also show high spatial autocorrelation in predicted map. The model here can refer to a single model such as linear regression or a composite of multiple models with one at each location.

Method | Advantages | Disadvantages |

Spatial contextual feature generation | Easy to use, do not require modifying models and learning algorithms, no restriction on model types | Subjective, need to regenerate all candidate features for each problem |

Spatial dependency in model structure | Intuitive, clear theoretical properties | Restricted to fixed type of models and distributions, model learning is computational expensive |

Spatial regularization in object function | Intuitive, clear theoretical properties, less restriction on types of models | Requiring models with differentiable objective functions, no guarantee with optimal solutions, computationally expensive |

*Spatial regularization in multi-model prediction:* Multi-model prediction utilizes a composite of local models with one model at each location (or sub-region) in the study area. Each local model has its own parameters that can be learned by minimizing prediction errors on learning samples at this location. However, independently learning these local models at individual locations risks overfitting due to the large number of parameters in multiple models. To solve this issue, spatial autocorrelation constraint on model parameters can be added to the objective function. The main idea is to create an overall loss function by summing up individual loss of local models with a regularization term that penalizes inconsistent model parameters at neighboring locations. The underlying assumption is that nearby models should have similar parameters due to spatial autocorrelation.

One common spatial regularizer based on the spatial autocorrelation effect is graph Laplacian regularizer [weinberger2007graph]. The idea is to consider each local model as a node and spatial neighborhood relationships between locations as edges, and to penalize neighboring nodes with very different parameters. Approaches have been proposed with different types of base models, including linear regression [subbian_climate_2013]

[DBLP:conf/sdm/DataKKBK14], and support vector machine [stoeckel_svm_2005, DBLP:conf/sdm/DataKKBK14]. Consider linear regression base model as an example, the traditional loss function is the sum of square errors, i.e.,(12) |

where is the location (or region) for the th model (task), and are the feature matrix and response vector for all samples at , and is a by matrix of all model parameters. Minimizing is equivalent to minimize loss functions on each location independently since there parameters and are independent for . This leads to a large number of parameters and potential risks for model overfitting. To address this issue, graph Laplacian regularizer is added, penalizing differences of model parameters at neighboring locations. Specifically, the regularization term is

(13) |

where is the element of W-matrix with when and are neighbors and otherwise, is a graph Laplacian matrix. The combined loss function will be Equation 14, where controls the degree of spatial smoothness among parameters.

(14) |

The advantage of this approach is that the model is very intuitive, easily interpretable, and generally applicable to different base models as long as their loss function is differentiable. In practice, parameters can be estimated iteratively via Newton Raphson methods [avriel2003nonlinear]

. It is worth noting that the graph Laplacian regularizer here has subtle differences from the one often used in semi-supervised learning. In semi-supervised learning, there is only one model instead of multiple models, and the regularizer penalizes parameter values whose model predictions at neighboring locations are inconsistent.

*Spatial decision trees:* Decision tree classifiers have been widely used for spatial classification problem in earth science [hansen2000global, pal2003assessment] due to simplicity, interpretability, computational efficiency, and being non-parametric. However, decision tree implicitly assumes that samples are independent and identically distributed. This assumption is often violated in spatial data due to spatial autocorrelation, resulting in artifacts in prediction results. To address this limitation, *spatial decision trees* have been proposed [jiang_learning_2012, li_spatial_2006, stojanova2011global, stojanova2012dealing, jiang2015focal]

that incorporate the spatial autocorrelation effect into decision tree learning algorithms. This is usually done by modifying the entropy or information gain heuristic. The main idea is that selection of a tree node test should be based on not only class purification but also the spatial arrangement of samples being split on the map.

Specifically, decision trees often use entropy and information gain measures to select tree node tests. Entropy measures the impurity of class distribution. Its value is high when the probabilities of different classes are close with each other (high class impurity). Information gain is defined as the decrease of entropy after training samples are split by a tree node test. Decision tree learning algorithms select a tree node test with the maximum information gain each time. However, this heuristic ignores the spatial pattern on how training samples are split on a map. According to spatial autocorrelation, we can assume that nearby training samples with the same class should be split into the same subset, such that they are likely to be predicted into the same class. Several approaches have been proposed that extend the traditional entropy or information gain definition with spatial regularization, including spatial entropy based on spatial distance [li_spatial_2006], spatial information gain based on spatial autocorrelation statistics such as Moran’s I and Geary’s C [stojanova2011global, stojanova2012dealing] and Gamma index [jiang_learning_2012]. Different spatial entropy or information gain definitions make different assumptions on the ground truth class map. Distance based spatial entropy assumes that samples from each class form a globally compact cluster with high inter-class sample distance and low inner-class sample distance. Spatial autocorrelation statistics based spatial information gain assumes that there are good tree node tests that can separate neighboring samples from different classes into different subsets while keeping neighboring samples from the same class within the same subset.

Spatial decision tree inherits the merits of decision tree model family such as interpretability and non-parametric nature. However, since entropy and information gain is only a greedy heuristic, there is no guarantee on global optimality. For example, if all input feature maps tend to have poor spatial autocorrelation, spatial entropy or information gain will still select one among them. In this scenario, extending the candidate tree node tests to incorporate neighborhood autocorrelation statistics will help [jiang2015focal].

Spatial accuracy objective function: In traditional classification problems, the objective function is often measured on each sample, e.g., if a sample is misclassified or not, regardless how far the predicted class is away from the nearest true class. Consider the example of classifying raster cells into “with bird nest” and “without bird nest”. If a cell that is mistakenly predicted as “with bird nest” is very close to an actual bird nest cell, the prediction accuracy on this sample should be considered higher than zero. Thus, spatial accuracy [chawla_modeling_2001] has been proposed to measure not only how accurate each cell is predicted by itself but also how far it is from the nearest location of its true class.

Table IV compares the three different strategies to incorporate spatial autocorrelation (or dependency) into spatial prediction, including spatial contextual feature generation, spatial dependency within model structure, and spatial regularization in objective function. Spatial feature generation is simple and intuitive, without the need of modifying model structures and learning algorithms. But selecting candidate features can be subjective, requiring strong knowledge related to the application domain. The process needs to be repeated for a different problem instance. The latter two approaches are intuitive with clear theoretical properties, making it easy to design models and learning algorithms. But they rely on strong assumptions on model types, and are often computationally expensive.

### 3.2 Spatial Heterogeneity

Spatial heterogeneity is another major challenge in spatial prediction problems. Spatial data samples often do not follow an identical distribution in the entire study area. A global model that is learned from samples in the entire study area may produce poor predictions for local regions. For example, the relationships between house price and house age may differ dramatically between suburb and urban areas. There are several approaches in the literature to address the challenge of spatial heterogeneity, including location dependent model parameters, decomposition based spatial ensemble, and multi-task learning. The main idea behind these methods is to use a composite of local or regional models that are location sensitive to replace a single global model.

#### 3.2.1 Spatial coordinate features

One simple strategy to make a model location sensitive is to incorporate spatial coordinates into the feature (covariate) vector. For example, incorporating spatial coordinate features into regression can fit a trend surface that is location dependent. Similarly, spatial coordinate features in decision trees can split samples not only in the feature space but also in the geographic space. However, due to the fact that spatial locations are multi-dimensional, considering spatial coordinates as separate features cannot effectively model heterogeneous spatial data where homogeneous sub-regions have arbitrary footprint shapes. For example, a decision tree model learned from data with spatial coordinate features often partitions the geographical space in parallel with the vertical or horizontal axis, and thus cannot effectively partition the geographic space into irregular footprints.

Method | Advantages | Disadvantages |

Geographical weighted model | Simple and intuitive, clear theoretical properties, no restriction on types of models | Underlying assumption of isotropy can be invalid |

Decomposition based spatial ensemble | No restriction on types of models and shapes of homogeneous zones | Decomposing space into homogeneous zones for local models can be non-trivial |

Multi-task learning | No restriction on shapes of homogeneous zones, capturing spatial dependency between local models | Restriction to local models with differentiable objective functions, decomposing space into homogeneous zones for local models can be non-trivial |

#### 3.2.2 Geographically Weighted Models

A geographically weighted model addresses the challenge of spatial heterogeneity, particularly spatial non-stationarity, by learning a distinct model at each location so that the model parameters are location dependent. When learning a local model, training samples are geographically weighted so that nearby samples have higher weights.

One common example is geographically weighted regression (GWR) [fotheringham2002geographically]. GWR can be better explained through comparison with classical linear regression. In linear regression, , model coefficients are assumed to be identical for the entire study area. In contrast, the model coefficients of GWR is location dependent . The coefficient at a new location can be learned by weighted least square errors, where the weight for each training sample is determined by its distance to . Specifically, this is shown in Equation 15,

(15) |

where is determined by a spatial kernel weighting function, e.g., . A sample that is closer to the current model location has a higher weight following the spatial autocorrelation effect, as illustrated by Figure 4. It is worth noting that both GWR and SAR (Section 3.1.2) are spatial regression models that incorporates spatial autocorrelation or dependency. The main difference is that SAR still assumes that model coefficients are same for all locations, and thus does not address the challenge of spatial heterogeneity, while GWR addresses spatial heterogeneity by learning a set of model parameters at each location.

The main idea of geographically weighted models using spatial kernel weighting has been extended to general linear models [nakaya2005geographically] and principle component analysis [harris2011geographically]. It can also be generalized to many other methods, such as decision trees, support vector machines. In these cases, we only need to precompute the relative weights for all other samples to a location of interest so that a local model can be learned for that location.

The advantages of geographically weighted models include simplicity and effectiveness in incorporating local effects. Two main limitations exist though. First, the computational cost is high since a model needs to be learned for each location of interest in the continuous space. Second, geographical weights of samples are determined by spatial kernel functions assuming that the geographical influence of samples to a location only depends on their distances (spatial isotropy). This assumption is often violated by real world geographic data due to spatial anisotropy when homogeneous sub-regions have arbitrary shape.

#### 3.2.3 Decomposition based ensemble

The assumption is that spatial data consists of a number of homogeneous sub-population within which the relationship between sample features and class labels is consistent. Based on this assumption, decomposition based ensemble learning uses a divide-and-conquer strategy to first partition data into different homogeneous sub-groups, and then learns a local model in each sub-group. This approach generally belongs to ensemble learning [ren2016ensemble, zhou2012ensemble, dietterich2000ensemble], which aims to boost predictive accuracy by learning multiple based models.

Several decomposition based ensemble methods partition multi-modular input data in feature vector space, including mixture of experts [jacobs1991adaptive, jordan1994hierarchical] and multimodal ensemble [KarpatneK15, karpatne2015ensemble]. Partitioning is often done via feature clustering, or a gating network. However, partitioning input data in feature vector space may not effectively separate samples with class ambiguity, i.e., samples with similar explanatory features belong to different classes in different spatial regions. Several approaches have been proposed to partition spatial samples in the geographical space. One approach is to use auxiliary information such as road networks and census blocks together with spectral information to segment satellite imagery into different sub-regions [vatsavai_hybrid_2011]. Similarly, a two-step spatial ensemble method has been proposed that first segments sample locations into homogeneous patches, then group patches into different zones via a bisecting algorithms [jiang_ensemble_2017]. Another approach is based on competition strategy [radosavljevic_spatio-temporal_2008], which involves an initial partitioning, and sample shifting across boundaries.

Decomposition based ensemble has several advantages. First, the spatial footprints of local models can have arbitrary shapes. This overcomes the limitations of geographically weighted models, which assume spatial isotropy (circular shapes). Second, once spatial decomposition is done in preprocessing, traditional classification models can be used for each sub-region without the need of developing new classification algorithms. Several limitations exist as well. First, finding a good decomposition in geographic space is computationally challenging. Second, most decomposition-based spatial ensemble approaches assume that future test samples lie within the same spatial framework as training samples. Thus, methods cannot be applied to test samples beyond the training area.

#### 3.2.4 Multi-task learning

Multi-task learning is a common machine learning technique for heterogeneous data [Caruana1997]. The main idea is to group learning samples into different tasks, and to simultaneously learn models in different tasks according to task relatedness (e.g., models from related tasks share the same set of features). When used to address spatial heterogeneity, multi-task learning approach is similar to decomposition based ensemble approach above in that it also learns local models (tasks) in different regions or locations, but it is different in that parameters of local models are jointly learned together according to the task relatedness (nearby models tend to have similar parameters).

The main question is how to identify different tasks and task relationships. One approach is to use each location (spatial point or raster cell) as a task and to use spatial neighborhood relationships as task relatedness, which can be determined by spatial distance or cell adjacency [subbian_climate_2013, DBLP:conf/kdd/ZhaoSYCLR15]. Another approach is to conduct spatial clustering on sample locations. Each cluster is a task, and clusters that are closer to each other are more related [DBLP:conf/sdm/DataKKBK14]. Task relatedness can also be determined by inferring conditional independence of class probability distribution [gonccalves2015multi]. When learning models from related tasks, a graph Laplacian regularizer can be used to enforce that nearby models have similar parameters [subbian_climate_2013, DBLP:conf/sdm/DataKKBK14] (details are in spatial regularization for multiple model prediction in Section 3.1.3). Graph Laplacian regularizer will not only incorporate spatial autocorrelation effect, but also reduce overfitting when the number of training samples is limited. The base models can be linear model, or generalized linear model such as logistic regression, as well as other models as long as the objective function is differentiable.

Similar to decomposition based ensemble, multi-task learning approach has the advantage of flexibility in spatial footprint shapes of sub-regions. It has additional advantages of incorporating spatial autocorrelation and avoiding overfitting with limited training samples. However, it also has more constraints on the choice of base models (those with differentiable objective functions). In addition, determining sub-regions for local tasks as well as relationships between tasks can be non-trivial. A detailed comparison of geographically weighted models, decomposition based spatial ensemble, and multi-task learning approaches are summarized in Table V.

### 3.3 Limited Ground Truth

In real world spatial prediction problems, input data often contains abundant explanatory features but very limited ground truth. For example, in earth image classification for land cover mapping, a large number of learning samples (image pixels) are collected with explanatory features (spectral band values) but only a small set of these samples have ground truth (land cover types). Collecting ground truth is both expensive and time consuming, requiring to send a field crew on the ground or recruit visual interpreters. There are two general strategies to address the challenge, including semi-supervised learning and active learning.

#### 3.3.1 Semi-supervised learning

Semi-supervised learning methods utilize labeled samples together with unlabeled samples in model learning to improve prediction performance. There are many different semi-supervised learning methods, including generative models such as mixture of Gaussian with EM algorithm, graph-based methods, transductive support vector machine, co-training and self-training (more details can be found in a survey [zhu2005semi]). Due to space limit, we briefly introduce three popular methods that have been used in spatial classification and prediction problems.

*Generative mixture of models and EM algorithm:*

This method assumes that feature values of samples from different classes follow a mixture model such as mixture of Gaussian. Unlabeled samples are used to improve the estimate of conditional distribution of feature values under each class. For example, in Gaussian mixture model, the joint distribution of features and classes is

, whereis the prior probability

, andare the mean and covariance of the normal distribution for conditional probability

. Unlabeled samples can be incorporated in the maximum likelihood estimation of model parameters. Specifically, the log likelihood function with both labeled and unlabeled samples can be written as(16) |

where represents all parameters, and are the numbers of labeled and unlabeled samples respectively. The log likelihood of unlabeled samples are incorporated in the second term of Equation 16, where their unknown class labels (latent variables) are marginalized out (summation over ). Expectation and Maximization (EM) algorithm can be used [dempster1977maximum] to iteratively update model parameters and the hidden classes of unlabeled samples (latent variables).

Gaussian mixture model has been used in semi-supervised classification of earth imagery with the maximum likelihood classifiers [DBLP:conf/icdm/VatsavaiSB08, DBLP:conf/igarss/VatsavaiBSB08]. Its advantages include clear assumptions and theoretical properties. The main limitation is that the assumptions of i.i.d. distribution and Gaussian class conditional distribution can be violated by real world spatial data. Ideas have been explored that incorporate the Markov property into semi-supervised max likelihood classifiers [DBLP:journals/paapp/VatsavaiSB07].

*Graph-based methods:* The graph-based approach first constructs a graph, in which nodes are spatial data samples (both labeled and unlabeled) and the edges are determined by the distance or similarity between samples (e.g., feature similarity, geographical proximity, or both via composite kernels [camps2007semi]). The main assumption is that samples that are close with each other have a high chance to share the same class labels. The loss function thus consists of two components, one for the prediction accuracy on labeled samples, and the other for the smoothness of predictions on neighboring samples. Specifically,

(17) |

where and are the true response and predicted response for the sample at location , is the element of W-matrix corresponding to and whose value can be determiend by feature similarity or spatial proximity. Various computational algorithms exist to estimate class labels that minimize the loss function, including graph min-cut algorithm, iterations with neighborhood updates, or a closed form solution when the loss function is differentiable [zhu2005semi]. Graph regularizer has also been used with support vector machines for hyperspectral earth image classification [gomez2008semisupervised].

The graph-based method looks similar to the graph Laplacian regularization method in Section 3.1.3. The main difference is that in semi-supervised learning, there is only one model and the graph regularizer smoothes model predictions on nearby samples, while in multiple model prediction in Section 3.1.3, there are multiple models at different locations and the graph regularizer smoothes the parameters of nearby models. The advantage of graph-based methods for semi-supervised spatial prediction is its simplicity and intuitiveness. However, its performance may be degraded if the assumption on the smoothness of neighboring classes is violated.

*Self-training and co-training:* Self-training methods iteratively expand the set of training samples by adding predicted classes on unlabeled samples. Predictions with the highest confidence will be selected and added into the training set. The model is iteratively updated based on the expanded training set. In spatial classification, spatial neighborhood expansion is often used to select unlabeled samples, i.e., we can select unlabeled samples that are spatial neighbors of a labeled sample [dopido2013semisupervised, tan2015novel]. This follows the first law of geography, i.e., nearby samples tend to resemble each other in their class labels. Co-training is another semi-supervised learning method. Similar to self-training, it augments the training set by selecting samples with the highest confidence from model prediction. The difference is that co-training uses multiple conditionally independent feature sets (or views) and learns one classifier for each feature set (or view). Each classifier selects an unlabeled sample with highest prediction confidence and adds it to the shared expanded training set. Co-training has been used for spatial classification problems, particularly earth image classification. In this case, different feature sets (views) can be spectral versus spatial features [hong2015spatial], derived features from different image sub-blocks [zhang2014modified], or features from different data sources with various resolutions such as Landsat [landsat], MODIS [modis], and high resolution aerial photos.

The advantages of self-training and co-training include that the method is general for many different model families and that the process is automated without need for extra human intervention. The limitation is that the methods rely on accurate model predictions. If a model prediction with the highest confidence is still erroneous, adding the predicted samples into training sets can further impact model performance. In addition, the computational cost is high since a model needs to be re-trained for each iteration.

#### 3.3.2 Active learning

Active learning [settles2010active, tuia2011survey] addresses the challenge of limited ground truth by manually labeling a carefully selected subset of unlabeled samples. The main question is how to select the subset of unlabeled samples that can enhance prediction performance the most and with the minimum labeling costs. There are several strategies to do this, including selecting samples with the highest uncertainty, samples on which a committee of models disagree the most, samples that have the highest expected model change, or samples with the best expected error reduction. The process is often iterative: models are re-trained after new training samples are added. More details can be found in a survey [settles2010active].

For spatial prediction, sample locations need to be considered when selecting unlabeled samples because collecting class labels often involves sending a field crew traveling between locations on the ground. If selected locations of unlabeled samples are spatially disperse, the travel costs will be high. Thus, unlabeled samples that are spatially clustered are preferred over those that are spatially disperse. To this end, a region-based active learning method [stumpf2014active] has been proposed for remote sensing image classification. In each iteration, the method selects a window of pixels that have the most overall disagreement by a committee of models. Another spatial cost-sensitive active learning [liu_spatially_2009], in order to find a travel route that covers the top-K most uncertain samples with the minimum travel cost.

The advantage of active learning approach is that it can improve labeling efficiency since samples with the most uncertainty are first labeled. Its main limitation is that human labor is still needed so the amount of ground truth data being collected is often limited. In addition, human experts who manually collect ground truth labels may need to wait for model re-training in each iteration.

### 3.4 Multiple Scales and Resolutions

Another challenge in spatial prediction is that spatial data may exist in multiple spatial scales and resolutions. For example, resolutions of earth observation imagery range from sub-meter to hundreds of meters. In addition to raster imagery, spatial data can also contain point reference data such as soil samples. Many existing predictive models assume that data samples are from the same scale and resolution, and thus are not directly applicable to multi-scale and multi-resolution data. Moreover, spatial patterns or relationships between explanatory features and the target response variable can be scale dependent (also called the Modifiable Area Unit Problem (MAUP) [wong2009modifiable]).

One existing strategy to address this challenge is to build spatial hierarchical models. A hierarchical model has multiple layers ordered by spatial scales, and each layer consists of spatial units at the same scale or resolution. Each spatial unit in a layer has its own predictive model, and relationships between parameters in different models can be established based on the scale hierarchy. As a specific example, a hierarchical multi-source feature learning framework [DBLP:conf/kdd/ZhaoYCLR16] has been proposed to forecast spatiotemporal events based on features from multiple scales including cities, states, and countries. Another example is the problem of modeling count of caries in human teeth. Each subject (person) has a number of tooth, and a tooth has different surfaces. Spatial neighborhood relationships exist across nearby tooth surface in the same subject. A Bayesian hierarchical model has been used to capture such spatial hierarchy [bandyopadhyay2009bayesian]. Spatial hierarchical models have also been used in crime event forecasting [yu2016hierarchical]. In this work, distributed spatio-temporal patterns from multi-resolution data are ensembled in predictive modeling.

Hierarchical spatial models have several advantages, including intuitive model design, utilizing data from different sources in different scales or resolutions, and capturing spatial dependency within and across spatial scales. The main limitation lies in model complexity, e.g., high computational cost, and risk of overfitting.

### 3.5 Spatiotemporal Extensions to Prediction Methods

In many spatial prediction problems, both space and time are critically important. For example, the time of day (e.g., rush hour versus non-rush hour) plays an important role in air quality prediction due to its impacts on the amount of traffic emissions. Adding the time dimension into spatial data creates new data representation and statistical concepts. More details can be found in a recent survey on spatiotemporal data mining [shekhar2015spatiotemporal].

Compared with spatial prediction, spatiotemporal prediction poses two unique challenges: spatiotemporal autocorrelation and temporal non-stationarity. The effect of *spatiotemporal autocorrelation* means that samples tend to resemble each other not only in nearby locations but also at close times. The effect of *temporal non-stationarity* means that statistical properties are dynamic over time. Addressing these challenges require extensions to existing spatial prediction methods. Due to space limit, we only introduce a few well-known examples.

#### 3.5.1 Spatiotemporal Autocorrelation

*Spatiotemporal contextual features*: Many methods discussed in Section 3.1.1 that generate spatial contextual features can be readily used or extended for spatiotemporal data. For example, additional contextual features can be generated in temporal sequences of raster data (e.g., earth imagery) based on sample attributes in spatiotemporal neighborhoods [mennis2005cubic]. Contextual features can also be generated by fusing data from multiple sources into a common spatiotemporal framework (e.g., spatial grid cells and time intervals) [zheng2015methodologies]. After features are generated, we can apply traditional non-spatiotemporal prediction methods.

*Spatial panel data model*: Spatial panels refer to spatiotemporal data containing time series observations at a number of spatial locations (e.g., zip codes, cities, states) [fischer2009handbook]. Spatial panel data models, therefore, refer to prediction models whose response variables are spatial panels. One simple form is a pooled linear regression model with spatial specific effects but without spatial interaction effects,

(18) |

where , , and are the numbers of locations and time steps respectively, is the spatial specific term (depending on location regardless of time), and is i.i.d. Gaussian noise. The model can also be represented in matrix form,

(19) |

where is a by response vector at time , is a by covariate (feature) matrix at time , is a by vector for spatial specific effect, and is a by vector with i.i.d. Gaussian residual errors. Based on the simple form, spatial autoregressive (interaction) term (Section 3.1.2) can be added to the response vector (Equation 19), or to the covariate matrix (Equation 20), or to residual errors (Equation 21 with ).

(20) |

(21) |

(22) |

Spatial panel data models extend simple pooled linear regression with spatial specific effects and spatial interaction effects, but without considering the temporal autocorrelation effect (dependency at nearby time steps).

*Spatiotemporal autoregressive model (STAR)*: Spatiotemporal autoregressive model (STAR) extends spatial autoregressive model (SAR) by incorporating temporal autoregression. Specifically, the SAR model can be generally written as . The term helps to remove spatial autoregressive effect from the response vector so that residual errors are i.i.d. Spatiotemporal autoregressive model generalizes and from spatial data to spatiotemporal data (stacking all samples into different rows), and also extends to incorporate temporal autoregressive effect or spatiotemporal autoregressive effect , where and model spatial neighbors and temporal neighbors respectively.

*Spatiotemporal Kriging*: Spatiotemporal Kriging is an extension of Kriging for spatiotemporal interpolation. It assumes that observations in continuous space and time follow a multi-variate Gaussian distribution . The mean vector can be modeled as a linear function of sample covariates and coefficients, and the covariance matrix can be determined based on *spatiotemporal covariogram*. Assuming both spatial and temporal stationarity, covariance between two variables are location and time invariant, as in Equation 23. The function

(23) |

is called spatiotemporal covariogram. Similar to Kriging, once spatiotemporal covariogram function is empirically estimated, we can derive the covariance matrix of joint Gaussian distribution based on sample location and time differences. Unknown observation at a new location and time can be estimated through conditional expectation as shown in Equation 24.

(24) |

#### 3.5.2 Temporal Non-stationarity

Temporal non-stationarity means that sample distribution can be varying over time. This poses a challenge in that we cannot fit a same prediction model for all time steps. Addressing the challenge often requires the design of spatiotemporal dynamic models.

*Spatiotemporal dynamic models*: Spatiotemporal dynamic models consider spatiotemporal observation data as a sequence of temporal snapshots, , where , and is the number of spatial locations [cressie2015statistics2]. Moreover, observations depend on a corresponding sequence of temporal snapshots in the form of hidden variables (hidden process) . Temporal autocorrelation (dependency) is usually modeled via Markov property on the hidden process. This is formally written in Equation 25 and Equation 26, where and are noise, is a function that captures dependency
of observations on hidden variables at a specific time step, and is a function that captures transitions of hidden variables over time. In the most simple form, and are linear, and and are Gaussian. There are other more complicated cases to incorporate spatial heterogeneity and temporal dynamics [stroud2001dynamic]. The hidden process variables can be configured based on physics or theories from an application domain, making spatiotemporal dynamic models useful tools in many fields such as meteorology and climate science.

(25) |

(26) |

## 4 Future Research Opportunities

This section summarizes future research opportunities. Most existing spatial prediction methods focus on the challenge of spatial autocorrelation. A few methods address spatial heterogeneity in the aspect of non-stationarity. Challenges of heterogeneous spatial data with anisotropic spatial dependency, multi-scale and multi-modality are largely underexplored. Moreover, the emergence of deep learning and spatial big data also represent new frontiers of spatial prediction research.

### 4.1 Prediction for Heterogeneous Spatial Data

Existing prediction methods addressing spatial heterogeneity focus on non-stationarity, assuming that training and test samples are within the same spatial framework, and that sample distribution is isotropic. However, in real world, spatial dependency can be anisotropic and test samples can be beyond the training area. Thus, novel spatial prediction methods need to be developed.

*Prediction with anisotropic spatial dependency*: In real world spatial data, spatial dependency across sample locations can be anistropic, instead of being uniform in all directions in the Euclidean space. For example, dependency between observations on spatial networks (e.g., pollutants in river networks or traffic accidents on road networks) often follow network topology (e.g., river flow directions, traffic directions). As another instance of example, flood water locations (pixels) in an earth observation imagery implicitly follow terrains and topography due to gravity (i.e., water flows to a lower elevation). Anisotropic prediction on spatial networks poses unique challenges due to directional spatial dependency, and expensive network distance computation. Recently, several spatial statistical methods have been generalized from Euclidean space to spatial network space, e.g., network spatial autocorrelation, network kernel density, and network Kriging [okabe2012spatial]

. However, little research has been done on prediction methods that incorporate anisotropic spatial dependency (i.e., spatial dependency along certain directions). Existing Markov random field methods only reflect undirected spatial dependency, and thus cannot be easily applied. Bayesian networks have been used to model directed dependency, but it is unscalable to a large number of nodes (locations). Recently, a hidden Markov tree model has been proposed to capture anisotropic spatial dependency by a reverse tree structure in the hidden class layer, but is studied in the context of flood mapping applications

[jiangkdd2018, jiang2019hidden, jiang2019geographical, sainju2020hidden]. Addressing the challenge requires innovations in model structure, regularizers in objective functions, as well as effective and efficient learning algorithms.*Model transfer across heterogeneous spatial regions*

: Due to the effect of spatial heterogeneity, prediction models learned from one region may not perform well in another. This is an issue since in many spatial prediction problems, test samples may not lie within the same spatial domain as training samples. In machine learning, similar issues have been studied through transfer learning

[pan2010survey] and domain adaptation [daume2006domain], but their corresponding methods for spatial prediction are largely under-explored. Addressing the challenge may require assumptions on the relationships or structure of spatial sample distributions between one region and another, as well as the fusion of auxiliary data to provide common spatial (or spatiotemporal) contexts.### 4.2 Data Fusion of Multi-scale Spatial Data from Different Sources

Data fusion is the process of combining data from multiple sources to improve inference. Existing research on data fusion include multi-sensor fusion from signal processing perspective [hall1997introduction, waltz2001principle, khaleghi2013multisensor] and data integration from data management perspective [bleiholder2009data, dong2009data]. For spatial prediction, we are more interested in spatial data fusion that can improve predictive performance. A recent survey summarizes techniques to fuse cross-domain data for big data analytics [zheng2015methodologies]. Methods are categorized into stage-based, feature-level-based, and semantic meaning-based.

*Data fusion for multi-resolution earth imagery classification*: Earth observation imagery from different satellite and airborne platforms have different spatial, temporal, and spectral resolutions and coverage. Moreover, imagery from each single source is imperfect with noise, cloud, and obstacles. Spatial prediction methods that can utilize a diverse portfolio of earth imagery with multiple resolutions are of great practical value. Potential research directions include multi-view learning and multi-instance learning [karpatne2016monitoring].

*Data fusion for multi-modal spatial data*: In many spatial prediction applications, spatial data comes in different representations (e.g., points, line-strings, polygons, and raster imagery) and modalities (e.g., geo-social media, geotagged imagery and videos). For example, in precision agriculture, hyperspectral imagery often has high spatial details and complete spatial coverage, ground soil samples are only taken at several point locations, and crop yield are recorded at per-plot level. The goal is to predict crop yield in an early growing phase to optimize fertilizer allocation. Utilize such multi-modal data in spatial prediction requires data fusion and uncertainty quantification.

### 4.3 Deep Learning Methods for Spatial Prediction

Deep learning is a set of machine learning algorithms that use a multi-layer graph structure to extract a hierarchy of features at different levels [goodfellow2016deep]

. High-level features in a top layer is built upon low-level features in lower layers. Deep learning models (e.g., deep convolutional neural network, deep recurrent neural network) have been shown successful in computer vision and natural language processing tasks. In the last couple of years, deep learning has been applied to spatial prediction problems, particularly on remote sensing imagery. Two recent surveys

[zhang2016deep, zhu2017deep]summarize progress of utilizing deep learning techniques in classifying hyperspectral imagery for land cover mapping, radar imagery for target recognition, as well as high-resolution aerial imagery for scene classification and object detection. Based on the types of inputs and outputs, methods can be categorized into per-pixel classification and per-image classification. In

*per-pixel*

classification, the input network layer consists of spectral and spatial features extracted from spectral band values within the neighborhood of a pixel, and the output layer consists of thematic class categories for that pixel (e.g., land cover types). Based on a sliding window method, all pixels in the image can be classified. Recently, there are works that predict the classes of all pixels in one network architecture end-to-end, such as U-Net

[ronneberger2015u]. In*per-image*classification, the input layer consists of all pixels of an image, and the output layer consists of class categories of the entire image (e.g., scenes or object types). This research area is still growing rapidly with open challenges to address.

*Limited ground truth class labels*: Large amount of training data is one important factor for the success of deep learning methods. Unfortunately, in remote sensing applications, collecting ground truth is both expensive and time consuming, as discussed in Section 3.3

. There are several potential directions to address the challenge. In some problems such as high-resolution aerial imagery classification, we can leverage existing well-known datasets with similar data types (e.g., ImageNet dataset

[deng2009imagenet], IARPA functional map of the world challenge dataset [iarpa], UC Merced land use dataset [yang2010bag]) to train a deep model or adapt well-trained deep models to a new application. For other problems, however, existing datasets and models may not be readily useful. In this case, collecting ground truth labels by well-trained experts at a large scale is infeasible. Several promising directions include utilizing crowd-sourcing from volunteered geographic information (e.g., Amazon Mechanical Turk, Tomnod.com by DigitalGlobe) and geotagged social media (e.g., tweets, Facebook posts), as well as leveraging physics-based modeling and simulation.*Enhancing interpretability*

: One major limitation of deep learning is the lack of interpretability. This may not be a major concern in business applications, but in scientific fields such as climate science and hydrological science, interpretability is critical for the theoretical development of the field. One potential direction is to incorporate physical theories and constraints in model design and architecture, as generally discussed in theory-guided data science

[karpatne2017theory].### 4.4 Spatial Big Data Prediction

Spatial big data (SBD) [shekhar2012spatial] refers to geo-referenced data whose volume, velocity, and variety exceed the capability of traditional spatial computational platforms. Examples of spatial big data include earth observation imagery (NASA collects petabytes of imagery each year [vatsavai2012spatiotemporal]), GPS trajectories, temporally detailed road networks, and cellphone check-in histories. Making predictions on spatial big data provides unique opportunities for large scale scientific studies such as national water forecasting and global land cover change analysis, but is also technically challenging due to the large data volume. In recent years, spatial big data techniques have been developed, including HadoopGIS [planthaber2012earthdb] and SpatialHadoop [eldawy2015spatialhadoop], GeoSpark [yu2015geospark], GPU-based algorithms [prasad2015vision, puri2013efficient, zhang2012speeding], distributed database systems EarthDB [planthaber2012earthdb], as well as Google Earth Engine [googleearthengine]. Currently, these existing platforms currently mostly focus on basic spatial operations (e.g., spatial joins), or traditional non-spatial prediction algorithms. Thus, future research is needed to develop parallel spatial prediction algorithms on spatial big data platforms.

*Parallel spatial prediction algorithms*: Algorithm design and platform selection are determined by the computational structure of spatial prediction algorithms. Common computational structure includes filter-and-refine [shekhar2003spatial], divid-and-conquer, matrix operations, and iterations (iteration is common in Expectation and Maximization, Newton Raphson, Markov Chain Monte Carlo simulation). Thus, GPUs and Spark platforms are potentially appropriate platforms.

## 5 Conclusion

This survey provides a systematic overview on spatial prediction techniques. Spatial prediction is of great importance in various application areas, such as earth science, urban informatics, social media analytics, and public health, but is technically challenging due to the unique characteristics of spatial data. We provide a taxonomy of spatial prediction methods based on the challenge they address and discuss several spatiotemporal extensions. We also identify future research opportunities.

## Acknowledgments

This material is based upon work supported by the NSF under Grant No. IIS-1850546, IIS-2008973, CNS-1951974, the National Oceanic and Atmospheric Administration (NOAA), and the University Corporation for Atmospheric Research (UCAR).

Comments

There are no comments yet.