Spatially and Robustly Hybrid Mixture Regression Model for Inference of Spatial Dependence

In this paper, we propose a Spatial Robust Mixture Regression model to investigate the relationship between a response variable and a set of explanatory variables over the spatial domain, assuming that the relationships may exhibit complex spatially dynamic patterns that cannot be captured by constant regression coefficients. Our method integrates the robust finite mixture Gaussian regression model with spatial constraints, to simultaneously handle the spatial nonstationarity, local homogeneity, and outlier contaminations. Compared with existing spatial regression models, our proposed model assumes the existence a few distinct regression models that are estimated based on observations that exhibit similar response-predictor relationships. As such, the proposed model not only accounts for nonstationarity in the spatial trend, but also clusters observations into a few distinct and homogenous groups. This provides an advantage on interpretation with a few stationary sub-processes identified that capture the predominant relationships between response and predictor variables. Moreover, the proposed method incorporates robust procedures to handle contaminations from both regression outliers and spatial outliers. By doing so, we robustly segment the spatial domain into distinct local regions with similar regression coefficients, and sporadic locations that are purely outliers. Rigorous statistical hypothesis testing procedure has been designed to test the significance of such segmentation. Experimental results on many synthetic and real-world datasets demonstrate the robustness, accuracy, and effectiveness of our proposed method, compared with other robust finite mixture regression, spatial regression and spatial segmentation methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 10

02/16/2020

Bayesian Spatial Homogeneity Pursuit Regression for Count Value Data

Spatial regression models are ubiquitous in many different areas such as...
05/23/2020

A New Algorithm using Component-wise Adaptive Trimming For Robust Mixture Regression

Mixture regression provides a statistical model for teasing out latent h...
12/29/2020

Spatial Resolution Enhancement of Oversampled Images Using Regression Decomposition and Synthesis

A new statistical model designed for regression analysis with a sparse d...
04/13/2020

The GWR route map: a guide to the informed application of Geographically Weighted Regression

Geographically Weighted Regression (GWR) is increasingly used in spatial...
10/12/2020

Robust Finite Mixture Regression for Heterogeneous Targets

Finite Mixture Regression (FMR) refers to the mixture modeling scheme wh...
09/30/2014

Hyper-Spectral Image Analysis with Partially-Latent Regression and Spatial Markov Dependencies

Hyper-spectral data can be analyzed to recover physical properties at la...
10/20/2021

Local Statistics for Spatial Panel Models with Application to the US Electorate

The spatial panel regression model has shown great success in modelling ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many problems in the environmental, economic, and biological sciences involve spatially collected data, and a main problem of interest is investigation of the relationship between a response variable and a set of explanatory variables over the spatial domain using regression modeling. Notably, the relationships between response variables and covariates may exhibit complex spatially dynamic patterns that cannot be captured by constant regression coefficients. Instead, such relationships may abruptly change at a certain boundary of two neighboring spatial clusters, but stay relatively homogeneous within clusters. Detecting clusters of observations that display similarity in both regression relationships and spatial proximity allows straightforward interpretations of local associations between response variables and covariates. For example, the residential real estate pricing could be quite similar in a local community, but drastically differ for two houses across the street [dolde1997temporal]; a major goal of analyzing functional magnetic resonance imaging (fMRI) data is to detect spatially distributed and functionally linked regions that continuously share information with each other in reaction to different stimuli [van2010exploring]. For all these real-world application settings, the collected data may often contain outliers, which may severely corrupt the analysis results if not properly handled. Overall, the spatial nonstationarity, local homogeneity, and model robustness are three main challenges in spatial regression modeling.

In the nonspatial setting, finite mixture regression models have been used in many areas as an effective exploratory approach to identify heterogeneity in response–predictor relationships. For an overview, see [mclachlan2004finite, fruhwirth2006finite]. To account for outliers or heavy-tailed noises, many algorithms have been developed to estimate the parameters robustly [yu2020selective]

. To seek for robust parameter estimation in the presence of outliers, methods have been developed that replaced the least-square criterion in the M-step of the expectation maximization (EM) algorithm by more robust criterion

[markatou2000mixture, shen2004outlier, bai2012robust, bashir2012robust, song2014robust, yao2014robust, peel2000robust]. To enable simultaneous model estimation and outlier removal, penalized mean-shift mixture model [yu2017new], and the least trimmed likelihood estimator [neykov2007robust, dougru2018robust, garcia2010robust, garcia2017robust] were proposed. While these methods could robustly capture the heterogeneous relationship between response and predictor variables, they are not designed to model the spatial dependency.

In modeling the spatial dependency, conventional nonstationary spatial regression models such as geographically weighted regression (GWR) [brunsdon1996geographically, fotheringham2003geographically, wheeler2010geographically] and Bayesian spatially varying coefficient (SVC) [fuentes2002spectral, banerjee2014hierarchical] models fit as many regression models to the data as there are observations, at the cost of a large computational burden for large spatial datasets, and sometimes may lead to overfitting. In addition, interpretation of the GWR and SVC models require visual inspection of the coefficient maps to pursue local homogeneity, and can not automatically capture the spatially clustered patterns. In order to automatically detect spatially homogeneity cluster, a penalized spatial regression model has been proposed [li2019spatial], where a fused-lasso [tibshirani2005sparsity]

type of penalty has been developed to account for the spatial homogeneity in the linear regression setting. Nevertheless, the spatial smoothness assumption in the above spatial regression models could be problematic and violated due to natural or man-made discontinuities in the spatial domain. In addition, none of these methods is designed to handle outliers.

Model-based spatial segmentation is another type of methods to deal with spatial data using spatially constrained Gaussian mixture model

[nguyen2011gaussian, nguyen2012fast]. Spatial segmentation incorporates spatial information between neighboring pixels into the Gaussian mixture model based on Markov random field (MRF), with a goal to cluster all variables (e.g. pixels in image), where the distance of two instances is dependent on both their feature expressions and spatial proximity. This comes at a high computational cost. While robust spatial segmentation algorithms are available [nguyen2011gaussian, nguyen2012fast], they fail to intentionally model the linear relationship between the response and predictors, but instead simply treat the response and predictors as different features.

In summary, none of the existing methods could robustly model the spatial clustering patterns of linear dependency between response and predictors, and we propose a novel Spatial Robust Mixture Regression (SRMR) model that enables a simultaneous detection of spatial regions in which variables have a strong linear dependency.

The key contributions of work include: (1) We developed the very first computational concept of spatially dependent mixture regression analysis. (2) We provided the SRMR model that efficiently solves the spatially dependent mixture regression problem, which is also empowered by a statistical inference approach to assess regression significance. (3) SRMR enables a new type of spatial segmentation analysis to detect overlapped spatial regions of varied dependencies among subset of features, which have high contextual meaningfulness.

Ii Preliminary

Ii-a Notations

We denote scalar value, vector, and matrix as lowercase character

, bold lowercase character x, and uppercase character , respectively. Let represent a set of spatial data that is observed at spatial locations , where the response variable is assumed to be spatially correlated, is the -dimensional vector of explanatory variables for the observation located at , and is the 2-dimensional coordinate of the th location. In this study, we only describe and validate the SRMR model on 2-dimensional spatial data. Noted, the approach can be directly applied to -dimensional () spatial data.

Ii-B Problem statement

To capture the spatially dependent structure for the response variable, we write a standard generalized linear regression model (GLM) for the -th spatial location as follows,

where , are the regression coefficients for the predictors, and

represents random noise with mean 0 and variance

, and is the link function. In this work, we assume identify link for linear regression. The intercept can be accommodated by including 1 as an entry of . Apparently, unless with sufficient number of repeated measurements for each location, the are non-identifiable. In many cases, there is only a single observation for each spatial location, certain spatial constraints will be enforced to ensure the identifiablity of the model parameters.

Definition 1

Spatially Dependent Mixture Regression. Given a dataset consisting of observations from spatial locations , the goal of spatially dependent mixture regression is to identify spatial regions and the number , s.t.,

, where are regression parameters for the predictors in the -th cluster; , where represents the noise level of cluster .

To account for the presence of outliers, we assume that are non-overlapping subsets of the whole set , and denote the outlier set as , such that Two type of outliers will be considered here:
Type 1 Outliers:
Type 2 Outliers:
Here the Type 1 Outliers represent the samples do not fit any regression model while the Type 2 Outliers represent the ones fit a certain model but do not locate nearby the spatial region.

Noted, pre-assumptions of the spatial regions are needed to enable a valid solution of the spatially dependent mixture regression problem. Such assumptions include a connected spatial region, a compact shape, or high enrichment to a certain region. Noted, as spatially dependent mixture regression assigns each sample into one spatial region , it directly forms a spatial segmentation method.

Ii-C Related works

Mixture regression and robust estimators. Consider an finite mixture Gaussian regression model parameterized by , the conditional density of given is , where is the normal density function with mean and variance . Many algorithms have been developed to estimate the parameters robustly [yu2020selective] by replacing the least-square criterion in the M-step with more robust criterion in the EM algorithm [bai2012robust, song2014robust, yao2014robust, peel2000robust]

. To enable simultaneous model estimation and outlier detection, usually a hyperparameter regarding the proportion of outlying samples needs to be specified, such as in the penalized mean-shift mixture model

[yu2017new], and the least trimmed likelihood estimator [neykov2007robust, chang2020robmixreg].

Spatially smooth regression. Conventional nonstationary spatial regression models such as geographically weighted regression (GWR) [brunsdon1996geographically, fotheringham2003geographically, wheeler2010geographically] and Bayesian spatially varying coefficient (SVC) [fuentes2002spectral, banerjee2014hierarchical] models allow regression coefficients to vary smoothly as a function of the spatial domain. For GWR, assuming a linear model with denoting the observed response vector and the design matrix, the regression coefficient at the th location is estimated from , where is a diagonal weight matrix defined by a kernel function of distance of all other points to point . The challenge with GWR and SVC models is that they fit as many regression models to the data as there are observations, at the cost of a large computational burden, possible over-fitting and interpretation. A penalized spatial regression model has been developed to automatically detect clusters [li2019spatial] by incorporating a fused-lasso penalty constructed based on spatial proximity.

Spatial segmentation. Model-based spatial segmentation aims perform a segmentation task to all samples (e.g. pixels in image) based on the input features. Model-based spatial segmentation adopts an energy function

to integrate the spatial information such as neighborhoods with a regular clustering analysis of the features. Intrinsically, such methods leverage spatial and data consistency to segment spatial regions, i.e., only considering the covariance of independent variables, which cannot solve the spatially dependent regression problem.

Iii Method

To solve the problem of spatially dependent mixture regression, computational challenges arise from three aspects: (1) the mixture regression model and spatial consistency do not form one unified likelihood function, which prohibits a direct solution by using EM algorithm, (2) detection spatial regions should depend on both goodness of fitting and spatial consistency, and (3) there is lack of a validate approach to assess the statistical significance of mixture regression models.

algocf[htbp]    

Iii-a SRMR algorithm and mathematical considerations

In sight of the challenge, we developed the spatial robust mixture regression (SRMR) algorithm to conduct simultaneous outlier detection and spatially dependent mixture regression estimation. The underlying idea is that by assuming a likelihood function of spatial regions and introducing a tuning parameter to link with the likelihood of mixture regression , a surrogate likelihood function is developed to enable a modified EM-algorithm (Algorithm 1). The inputs of Algorithm 1 include the response and independent variables, spatial coordinates, and the hyper parameter . It conducts a simplified spatially dependent mixture regression fitting by assuming there is only Type 2 outliers, i.e., the sample fit one mixture model but do not locate in the corresponding spatial region. Hence, Algorithm 1 fits a conventional mixture regression model and computes the spatial regions that are top enriched by the samples fit each regression component. In this study, we assume the spatial likelihood follows , where represents the class of sample and represents the Euclidean distance between the spatial coordinate of the sample and the centers of the spatial regions , i.e., assuming the spatial regions form a compact shape. Specifically, a voting step (C-step) is introduced in Algorithm 1, which identifies Type 2 outliers by the ones whose most likely regression component and spatial region are not consistent. Noted, as all the input samples are utilized in the estimation of the mixture regression model, Algorithm 1 is always convergent.

Based on the Algorithm 1, we developed the SRMR framework (Algorithm 2). In SRMR, we iteratively conduct the Type 2 outlier only spatially dependent mixture regression by using the Algorithm 1 and identify Type 1 outliers by running a robust linear regression on all the samples predicted to each spatial region. The underlying consideration is that only one regression component is consisted within each identified spatial region, which could be effectively identified by a conventional robust regression approach (RLM). In SRMS, we implement the trimmed likelihood estimation based robust mixture regression. The inputs of SRMR is the same as the input of Algorithm 1 plus the maximal iteration number and a random seed. The outputs of SRMR include the identified mixture regression models and outliers. The component of each non-outlier samples can be further assigned by maximal likelihood. In SRMR, we utilize the same BIC function for conventional robust mixture regression analysis.

Iii-B Statistical Inference

Iii-B1 Hypothesis testing for spatial regions

We conducted a geometry based approach to estimate the significance to observe a spatial region of a certain size. Noted, we utilized the compact spatial shape assumption in SRMR, which could be considered as a round shape. For a round shape with a diameter of , the number of the shapes needed to cover a rectangular spatial region can be computed by , which serves as a weight to correct the p value assessed from each single component robust regression as detailed following.

Iii-B2 Hypothesis testing for robust linear regression

We discuss hypothesis testing of the significance a robust linear regression model parameterized by

, which represents the robust regression coefficients estimator, standard deviation estimator, and the index of the outlying samples respectively. A bootstrap procedure is adopted to test the null hypothesis

. We perform the following steps.

Step 1: Calculating the residuals for all observations, including the outlier samples, under regression parameter , denoted as . Let be the residuals corresponding to outlying samples, and be smallest absolute residual in .

Step 2: Generate iid sample

from the normal distribution

, denoted as .

Step 3: Calculate the percentage of samples in whose absolute values are larger than , and denote it as .

Step 4: Repeat steps 2-3 for times, and the statistical significance is evaluated as the average of for the times.

Iii-C Discussion

Several prominent features make our proposed approach attractive. First instead of using a robust estimation criterion or complex heavy-tailed distributions to robustify the mixture regression model, our method is built upon a spatial regression model so as to facilitate computation and model interpretation. Second we adopt a sparse and scale-dependent mean-shift parameterization. Each observation is allowed to have potentially different outlying effects across different regression components, which is very flexible. Compared to existing spatial regression methods, our approach allows an efficient solution via the celebrated penalized regression approach, and different information criteria (such as AIC and BIC) can be used to adaptively determine the proportion of outliers. In the next section, we utilized extensive simulations to demonstrate the performance of SRMR and its highly robustness to both gross outliers and high leverage points.

algocf[htbp]    

Iv Experiments on synthetic data

We evaluated the performance of SRMR and selected baseline methods on a comprehensive setup of synthetic datasets, and evaluated the overall performance of SRMR in solving the spatially dependent mixture regression problem with different number of mixture models, level of linear dependency, spatial distribution, and ratio of outliers.

Iv-a Baseline Methods

We collected in total nine existing methods to represent the current works. In the field of mixture regression, Pan et al. [wu2016new] proposed DC-ADMM which cluster mixture content in a group pursuit way. It has an implementation as “PRclust” 111https://github.com/ChongWu-Biostat/prclust R package. In the field of robust mixture regression, we collected two state-of-the-art algorithms, Trimmed Likelihood Estimation (TLE) and Component-wise adaptive Trimming Likelihood Estimation (CTLE) from R package “RobMixReg” 222https://cran.r-project.org/web/packages/RobMixReg/ in CRAN [chang2020robmixreg]. In the field of spatial smooth regression, we collected four algorithms, spatially clustered coefficient regression (SCC)333https://github.com/furong-tamu/Supplementary-files-for-SCC [li2019spatial], Spatialculster 444https://github.com/mpadge/spatialcluster [guo2008regionalization], Spdep 555https://github.com/r-spatial/spdep/ [bivand2011spdep], and ClustGeo 666https://cran.r-project.org/web/packages/ClustGeo/ [chavent2018clustgeo]. However, only ClustGeo can be executed under our formulation. In the field of segmentation methods based on Markov Random Field, we collected two methods FRGMM 777https://sites.google.com/site/nguyen1j/home/10-code [nguyen2012fast] and mrf2d 888https://freguglia.github.io/mrf2d/ [freguglia2020mrf2d]. However, these two methods aim to clustering image pixels, which requires natural spatial orders from neighborhood pixels as input, and hence cannot be applied to solve our problem. Finally, we used four baseline methods DC-ADMM, TLE, CTLE, and ClustGeo to perform comparison experiments. All baseline methods used with their default parameters, except nit parameter in TLE and CTLERob were set as 10. For DC-ADMM, we used stability-prclust function to select the best parameter, followed by the instruction. For ClustGeo, we used choicealpha function to select best parameter.

Iv-B Experimental setup

To simulate spatially dependent linear relationships, we first generate a univariate independent variable

from uniform distribution

and dependent variable by , where is number of mixture models and is regression coefficient. Spatial coordinate of each sample was generated from a multivariate normal distribution , where determines the center and determines the range and shape of each spatial region. We use as the default experimental setting, i.e. of two distinct and non-uniformly distributed spatial regions. The two types of outliers wer further simulated. We simulated Type 1 outliers by a rejection sampling approach. Specifically, we first samples independent from and only accept the ones whose Euclidean distance to the regression lines larger than two as Type 1 outliers. To simulate the Type 2 outliers of a certain ratio, we randomly select the ratio of samples and reverse their spatial coordinate by .

We conducted the synthetic data based experiments for three types of method evaluation:
(1) We evaluated the general performance of SRMR and baseline methods in solving the spatially dependent mixture regression problem by the following experimental setups (Fig 1A). Each time we perturbed one of the five factors and fixed the others, including number of mixture regression models , total sample size , error of linear regression , rate of samples belong to (model, model, outliers)= (only for =2), and coefficients of linear regression model (only for =2).
(2) We validated the robustness of SRMR and baseline methods in handling the two types of outliers, namely Type 1 and 2 outliers by perturbing their ratio from 10 to 20 (Fig 1B).
(3) We validated the capability of SRMR and baseline methods in detecting different shapes and distributions of spatial regions. We simulated the spatial coordinates from a multivariate normal distribution or a multivariate uniform distribution, the former one simulates a round and dense spatial region while the later one generates uniformly distributed 2D coordinates. The simulated shapes are showcased in Fig 1C. In addition, we also evaluated if SRMR is sensitive to different relative positions of the spatial regions. We simulated two types of relative location of spatial regions, namely (i) diagonal distribution by setting and (ii) horizontal distribution by setting . To simulate spatial regions of imbalanced densities, we perturbed the covariance matrix of the spatial coordinates from to .

In summary, we set ten perturbation scenarios (Fig 1), each contains 2-3 different parameter settings. We conducted 100 replicates for each parameter set in each scenario. In total, we obtained 2,500 synthetic data sets. The mean value of evaluation metrics were used for performance evaluation.

Fig. 1: Experiment Setting. Sub-figures without grid represent linear relationship and sub-figures with grid represent spatial coordinates. For (b) and (c), we only show partial plot which control factor is changed instead of full plot (linear relationship and spatial coordinate) as (a). (a) contains five different scenarios in terms of mixture regression. (b) contains two scenarios to deal with Type 1 and Type 2 outliers. (c) contains three scenarios for detecting different shapes and distributions of spatial regions.

Iv-C Evaluation Metrics

We evaluated the performance of SRMR and baseline methods on synthetic datasets, based on how accurate the methods can identify the simulated mixture regression models and corresponding spatial regions, and distinguish the two types of outliers. Four evaluation metrics were utilized in the synthetic data based evaluations:

1) Rand Index (RI): computes a similarity measure between two clusters by considering counting the sample pairs that are assigned in the same or different clusters in the predicted and true clusters.

2) Adjust Rand Index (ARI): , which is a corrected-for-chance version of RI.

3) Accuracy Rate (ACC) for outliear detection. ACC: measures the accuracy for distinguishing the Type 1 and Type 2 outliers.

4) Error of Predicted Coefficients (PCE): measures the distance between the true regression coefficient of the regression components and predicted regression coefficient . Here , i.e., is the predicted coefficient closest to .

Iv-D Performance

We organized the synthetic data experiment results in Table 1 into three sections: mixture regression, robustness and spatial patterns. Overall, SRMR outperforms baseline methods in all 10 experiment settings under almost all evaluation metrics.

In Table 1, the first section (1st- 5th blocks) illustrated the performance of SRMR and other methods in terms of the accuracies in detecting the heterogeneous linear dependencies in different scenarios, with regards to sample size, number of components, noise level, cluster balance and strength of regression coefficients. SRMR could detect the clusters and regression coefficients for each cluster very accurately, for different sample sizes, components, and it is robust to the different noise levels, imbalance of cluster sizes and small regression coefficients. Notably, because it incorporates spatial information, it is able to differentiate two clusters with very similar regression coefficients but different spatial locations. Since DC-ADMM and ClustGeo are designed for clustering, but not regression, the evaluation metrics ACC and PCE for these two methods are filled with NaN. Although DC-ADMM proposed using a novel formulation for clustering, it cannot handle outliers or incorporate spatial information. Thus, the performance of DC-ADMM is the lowest in most of cases. As noise level of regression line increased, the power of ordinary robust mixture regression methods TLE and CTLE decreased, leading to lower RI and ARI score. When the clusters become more and more imbalanced, the RI and ARI scores of all of the baseline methods get much worse. When two clusters have very similar regression parameters, but are far away in terms of spatial locations, TLE and CTLE cannot differentiate the two clusters, as they didn’t account for spatial proximity, causing low RI and ARI score.

The second section (6th-7th blocks) of Table 1 illustrated the performance of all methods in terms of robustness to outlier contamination, including Type 1 outliers and Type 2 outliers. SRMR is highly robust to both regression outliers and spatial outliers, and the clustering accuracies and parameter estimates are almost unaffected in the presence of outliers. This is because SRMR adopted a trimmed likelihood approach, and it is expected that the outliers will not be taken into model estimations. Since DC-ADMM and ClustGeo are not designed to handle the neither Type 1 or Type 2 outliers, their performance consistently worse than TLE, CTLE, and SRMR. While TLE and CTLE could handle regression outliers, they have no control over the spatial proximity, and hence they are very sensitive Type 2 outliers, i.e., spatial outliers. ACC of TLE and CTLE is around 70% due to spatial heterogeneity while SRMR has 100% accuracy rate in all scenarios.

The third section (8th-10blocks) in Table 1 illustrate the performance of all methods for different spatial patterns, regarding the shape, center and density of the spatial clusters. SRMR is designed to detect heterogeneious linear dependencies that is robust to both regression outliers and spatial outliers, and its performance is consistently desirably with regards to different spatial patterns. When the spatial distribution of the clusters are changed from multivariate normal to multivariate uniform, it means the shape of the clusters are less sphear, and more diffused. When the center of spatial coordinate changed from diagonal to horizontal, the boundary of two spatial centers became blurred, meaning there are more overlap between neighbouring clusters. The performance of TLE, CTLE and ClustGeo got worse with more cluster overlaps, while SRMR is robust to this complex situation thanks to the integration of both regression and spatial similarity. ClustGEO is sensitive to the imbalanced density of different clusters, while SRMR is unaffected.

In summary, SRMR is the only method that could model the linear dependency between response and predictors that vary in the spatial domain, and detect clusters of observations with both similarities in regression parameters and spatial proximity. And it is robust to both outliers in regression fitting and spatial locations. It has produced highly favorable performance in different simulation settings, with regards to different levels of regression/spatial noise, outliers, and mixture imbalance.

V Experiments on real-world data

We further validated SRMR on two real-world datasets, namely (1) a geospatial economics data collected from 298 cities of China and (2) a spatial transcriptomics data collected from 3,798 spatial spots on a 2D breast cancer tissue. The synthetic data based experiments clearly suggested that SRMR is the only method that can effectively solve the spatially dependent mixture regression problem compared to the baseline methods. In the real-world data based experiments, we mainly focused on illustrating the contextual meaning of the spatial regions and corresponding regression models identified by SRMR. We also evaluated the goodness of fitting and significance of the spatially dependent mixture regression models as well as the running time of the tested methods.

Fig. 2: Real-world data based experiments. a1: SRMR, a2: SRMR, a3: TLE, a4: CTLE, a5: DC-ADMM, a6: ClustGeo; a1,a3-a6: , a2: . a1-a6: cities of different regression components are red, blue, green and orange colored, while the outliers are colored by grey.

V-a Application on Geospatial Economics Data

We collected 7 economic features, namely total GDP, public income, public spend, educational spend, technology spend, population, and averaged personal income, and latitude and longitudes, for 298 cities in China. We evaluated SRMR and baseline methods to this data set. We utilized each of the eight features as a dependent variable and selected others as independent variables When applying SRMR and other regression models, while all the features were utilized as the input of ClustGeo. Similar to the synthetic data based experiments, SRMR is the only method can identify spatially dependent mixture regression models. In contrast, TLE, CTLERob, and DC-ADMM only detected spatial independent regression models, and ClustGeo output a spatial segmentation based on all features.

For a clear visualization and explanation, we illustrated two univariate regressions of and . For both and , SRMR identified four spatial regions corresponding to the north-east, middle-east, south-east and west regions of China (Fig 2a1,2a2). The spatial regions detected by SRMR show distinct different dependency of and with . Specifically, is positively associated with in the middle-east () and north-east China (). The south-east cities have more stable , which less depends on (), while a negative association of and are observed in the west cities (). The high dependency in middle-east and north-east cities and less dependency in south-east cities are consistent to our knowledge, as the middle-east and north-east China are promoting the education system basis while the education systems south-east China are relatively stable. We also checked the cities in the west China that have high but low . Such cities include Dongying, Ordos, Karamay, etc., which are developing more neo energy business rather than education in the recent years. Similar observations were also made in the model (Fig 2a2). The SRMR outputs suggested the personal in the north-east, south-east and west cities less depends on while more positive dependency between and was observed in middle-east cities, especially the well developed cities Beijing, Shanghai, Tianjin, Hangzhou, etc. On the other hand, on both and , TLE and CTLERob failed to identify such spatial dependent and contextual meaningful patterns while both of them tend to over-fit the mixture of regressions (Fig 2a3, 2a4). DC-ADMM identified all cities as one class (Fig 2a5) while ClustGeo identified three distinct non-overlapping spatial regions without offering explainable regional specific feature dependencies (Fig 2a6).

V-B Application on Spatial Transcriptomics Data

10x Genomics spatial transcriptomics (ST) is a recent commercialized technique to measure spatial coordinates associated gene expression signal from a biological tissue sample, which it has a broad utilization in biomedical research. A typical ST data is a matrix consisting of 15,000 genes (rows) in 4,000 individual spatial spots (columns), and each spot has a 2D spatial coordinate (Fig 2b1). The spatial spots are uniformly distributed. A key challenge in ST data analysis is to infer the spatially dependent and biologically meaningful functional variations from the high dimensional feature matrix (genes by spatial spots). Here we illustrate that SRMR enables a new type of ST data analysis by simultaneously identify spatial regions in which the expression level of genes show different level of dependency, which directly annotate the biological meaning of each detected region.

We applied SRMR and baseline methods on the v1.1 ST data of breast cancer provided by 10xgenomics.com, consisting of 13,161 genes and 3,798 spatial spots. We first selected 500 genes that having high expression level and having known tumor micro-environment related functions. We fit the regression model Gene Gene for each pair of the 500 genes by using SRMR, TLE, CTLERob and DC-ADMM and conducted ClustGeo by using all the 500 genes. Similar to the synthetic and Geospatial data, SRMR is the only method that detected spatially dependent mixture regression models in the ST data. General spatial segmentation, such as ClustGeo, identifies spatial regions by using the whole feature matrix (Fig 2b3), which is consistent to the distribution of the averaged gene expression signal level (Fig 2b2). On the other hand, we identified more than 500 overlapped spatial regions by using SRMR, each having varied dependency among certain genes. Fig 2b4 showcased two distinct spatial regions only identified by SRMR, which have varied dependency between the CD79A and CD79B genes as shown in Fig 2b5. CD79A/B are key genes involved in maturation and functional variation of B cells. The varied dependency of CD79A and CD79B characterizes distinct sub-regions in one breast cancer tissue that potentially have different immune activities and responses to immuno-therapy.

In summary, compared with baseline methods, SRMR is the only method can effectively solve the spatially dependent mixture regression problem on the two real-world data. For the analysis of a single regression model in the real-world data, the running time of SRMR, TLE and CTLERob are about 15s, 10s and 2s, respectively. The running time of SRMR is slower, but also comparable to the baseline robust mixture regression approaches. The running time of DC-ADMM and ClustGeo are about 0.01s.

Vi Conclusion

We developed a new statistical model of high dimensional data with matched spatial information, namely spatially dependent mixture regression. We also developed spatial robust mixture regression (SRMR) analysis as an effective solution of the problem. SRMR is empowered by an inference scheme to assess statistical significance of spatial dependent finite mixture regression models. On both synthetic and real-world data based experiments, we demonstrated that SRMR is the only capability can solve the spatially dependent mixture regression problem. Particularly, SRMR enables a new type of spatial segmentation analysis by detecting large sets of spatial regions having varied dependency among certain features. Compared with conventional spatial segmentation analysis, the regions identified by SRMR characterize more spatial dependent variations conceived in the data and enable better contextual explanation. The source codes of SRMR and the analysis of this study are provided at

https://github.com/changwn/SRMR.

References