Scalable Spatiotemporally Varying Coefficient Modeling with Bayesian Kernelized Tensor Regression

08/31/2021
by   Mengying Lei, et al.
HEC Montréal
McGill University
0

As a regression technique in spatial statistics, spatiotemporally varying coefficient model (STVC) is an important tool to discover nonstationary and interpretable response-covariate associations over both space and time. However, it is difficult to apply STVC for large-scale spatiotemporal analysis due to the high computational cost. To address this challenge, we summarize the spatiotemporally varying coefficients using a third-order tensor structure and propose to reformulate the spatiotemporally varying coefficient model as a special low-rank tensor regression problem. The low-rank decomposition can effectively model the global patterns of the large data with substantially reduced number of parameters. To further incorporate the local spatiotemporal dependencies among the samples, we place Gaussian process (GP) priors on the spatial and temporal factor matrices to better encode local spatial and temporal processes on each factor component. We refer to the overall framework as Bayesian Kernelized Tensor Regression (BKTR). For model inference, we develop an efficient Markov chain Monte Carlo (MCMC) algorithm, which uses Gibbs sampling to update factor matrices and slice sampling to update kernel hyperparameters. We conduct extensive experiments on both synthetic and real-world data sets, and our results confirm the superior performance and efficiency of BKTR for model estimation and parameter inference.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/25/2022

Spatial meshing for general Bayesian multivariate models

Quantifying spatial and/or temporal associations in multivariate geoloca...
02/02/2020

Modeling Multivariate Spatial-Temporal Data with Latent Low-Dimensional Dynamics

High-dimensional multivariate spatial-temporal data arise frequently in ...
04/30/2021

Low-Rank Autoregressive Tensor Completion for Spatiotemporal Traffic Data Imputation

Spatiotemporal traffic time series (e.g., traffic volume/speed) collecte...
08/15/2016

The Bayesian Low-Rank Determinantal Point Process Mixture Model

Determinantal point processes (DPPs) are an elegant model for encoding p...
05/21/2019

Time-varying Autoregression with Low Rank Tensors

We present a windowed technique to learn parsimonious time-varying autor...
05/21/2021

Low-Rank Hankel Tensor Completion for Traffic Speed Estimation

This paper studies the traffic state estimation (TSE) problem using spar...
06/06/2018

Bayesian varying coefficient models using PC priors

Varying coefficient models arise naturally as a flexible extension of a ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Local spatial regression aims to characterize the nonstationary and heterogeneous associations between the response variable and corresponding covariates observed in a spatial domain

Banerjee et al. (2014); Cressie and Wikle (2015). This is achieved by assuming that the regression coefficients vary locally over space. Local spatial regression offers enhanced interpretability of complex relationships, and it has become an important technique in many fields, such as geography, ecology, economics, environment, public health, climate science, to name but a few. In general, a local spatial regression model for a scalar response can be written as:

(1)

where is the index (e.g., longitude and latitude) for a spatial location, and

are the covariate vector and regression coefficients at location

, respectively, and

is a white noise process with precision

.

There are two common methods for local spatial regression analysis—Bayesian spatially varying coefficient model (SVC)

Gelfand et al. (2003) and geographically weighted regression (GWR) Fotheringham et al. (2003). SVC is a Bayesian hierarchical model where regression coefficients are modeled using Gaussian processes (GP) with a kernel function to be learned Rasmussen and Williams (2006). For a collection of observed locations, SVC imposes a prior such that , where is a matrix of all coefficients, denotes vectorization by stacking all columns in as a vector, is a spatial correlation matrix, is a precision matrix for covariates, and denotes the Kronecker product. Note that by setting

as a correlation matrix simplifies the covariance specification, since the variance can be captured by the diagonal of

. This formulation is equivalent to having a matrix normal distribution

. GWR was developed independently using a simple local weighted regression approach, in which a bandwidth parameter is used to calculate weights for different observations, with closer observations carrying larger weights. In practice, the bandwidth parameter is either pre-specified based on domain knowledge or tuned through cross-validation. However, it has been shown in the literature that the estimation results of GWR are highly sensitive to the selection of the bandwidth parameter (e.g., Finley, 2011). Compared with GWR, the Bayesian hierarchical framework of SVC provides more robust results and allows us to learn the hyperparameters of the spatial kernel , which is critical to understanding the underlying characteristics of the spatial processes. In addition, by using Markov chain Monte Carlo (MCMC), we can not only explore the posterior distribution of the kernel hyperparameters and regression coefficients but also perform out-of-sample prediction.

The formulation in Eq. (1) can be easily extended to local spatiotemporal regression to further characterize the temporal variation of coefficients. For a panel response matrix observed from a set of locations over a set of time points , the local spatiotemporal regression model can be formulated as:

(2)

where we use and to index rows (i.e., location) and columns (i.e., time point), respectively, is the th element in , and and are the covariate vector and coefficient vector at location and time , respectively. Based on this formulation, Huang et al. (2010) extended GWR to geographically and temporally weighted regression (GTWR) by introducing more parameters to quantify spatiotemporal weights in the local weighted regression. For SVC, Gelfand et al. (2003) suggested to use a separable kernel structure to build spatiotemporally varying coefficient model (STVC), which assumes that , where is a kernel defining the correlation structure for the time points. Note that with this GP formulation, it is not necessary for the

time points to be uniformly distributed. If we parameterize the regression coefficients in Eq. (

2) as a third-order tensor with fiber , the above specification is equivalent to a prior of tensor normal distribution . However, despite the elegant GP-based formulation in STVC, the model is rarely used in real-world practice mainly due to the high computational cost. For example, for a fully observed matrix with corresponding spatiotemporal covariates, the time complexity in updating all coefficients in each MCMC iteration is due to the inversion (or Cholesky factorization) of the covariance matrix. Even when is partially observed on a index set , the time complexity is still high when sample size is large.

In this paper, we provide an alternative estimation strategy—Bayesian Kernelized Tensor Regression (BKTR)—to perform local spatiotemporal regression analysis on large-scale data sets. Inspired by the idea of reduced rank regression and tensor regression (e.g., Izenman, 1975; Zhou et al., 2013; Bahadori et al., 2014; Guhaniyogi et al., 2017), we use low-rank tensor factorization to encode the dependencies among the three dimensions in with only a few latent factors. To further incorporate local spatial and temporal consistencies, we use GP priors on the spatial and temporal factor vectors following Luttinen and Ilin (2009), thus translating the default tensor factorization into a kernelized factorization model. With a specified tensor rank , the time complexity becomes , which is substantially reduced compared with the default STVC formulation. We conduct numerical experiments on both synthetic and real-world data sets, and our results confirm the promising performance of BKTR.

2 Related Work

Our key research question is how to efficiently and effectively model a high-dimensional latent variable (i.e., the third-order spatiotemporal tensor ), in which complex dependencies exist both within and across dimensions. The most popular strategy is to model the variable using a GP with a separable covariance matrix, which has been used in spatiotemporal analysis (e.g., Cressie and Huang, 1999; Gelfand et al., 2003; Stein, 2005; Zhang, 2007) and multi-output GP (e.g., Bonilla et al., 2007; Álvarez et al., 2012). However, the cubic time complexity in learning kernel hyperparameters becomes the critical bottleneck. Several solutions have been proposed to address this computational challenge, such as low-rank approximation Bonilla et al. (2007), sparse approximation Quinonero-Candela and Rasmussen (2005); Alvarez and Lawrence (2008), and using compactly supported kernels Luttinen and Ilin (2012).

Our work follows a different approach. Instead of modeling directly using a GP, we parameterize the third-order tensor using a low-rank tensor decomposition Kolda and Bader (2009). The idea is inspired by recent studies on low-rank tensor regression/learning (see e.g., Zhou et al., 2013; Bahadori et al., 2014; Rabusseau and Kadri, 2016; Yu and Liu, 2016; Guhaniyogi et al., 2017; Yu et al., 2018). The low-rank assumption not only preserves the global patterns and higher-order dependencies in the variable, but also greatly reduces the number of parameters. In fact, without considering the spatiotemporal indices, we can formulate Eq. (2) as a scalar tensor regression problem Guhaniyogi et al. (2017) by reconstructing each as a sparse covariate tensor of the same size as

. However, for spatiotemporal data, the low-rank assumption alone cannot characterize the strong local spatial and temporal consistency. To better encode side information, existing studies have introduced graph Laplacian regularization in defining the loss function

(e.g., Bahadori et al., 2014; Rao et al., 2015). Nevertheless, this approach also introduces more parameters (e.g., those used to define distance/similarity function and weights in the loss function) and it has limited power in modeling complex spatial and temporal processes. The most relevant work is a Gaussian process factor analysis model Luttinen and Ilin (2009)

developed for a completely different problem—spatiotemporal matrix completion, in which different GP priors are assumed on the spatial and temporal factors and the whole model is learned through variational Bayesian inference. We follow a similar idea to parameterize the coefficients

and develop a MCMC algorithm for model inference.

3 Preliminaries

3.1 Notations

Throughout this paper, we use lowercase letters to denote scalars, e.g., , boldface lowercase letters to denote vectors, e.g., , and boldface uppercase letters to denote matrices, e.g., . The -norm of is defined as . For a matrix , we denote its th entry by . We use

to denote an identity matrix of size

. Given two matrices and , the Kronecker product is defined as . If and have the same number of columns, i.e., , then the Khatri-Rao product is defined as the column-wise Kronecker product . The vectorization stacks all column vectors in as a single vector. Following the tensor notation in Kolda and Bader (2009), we denote a third-order tensor by and its mode- unfolding by , which maps a tensor into a matrix.

3.2 Tensor CP Decomposition

For a third-order tensor , the CANDECOMP/PARAFAC (CP) decomposition factorizes into a sum of rank-one tensors Kolda and Bader (2009):

(3)

where is the CP rank, represents the outer product, , , and for . The factor matrices that combine the vectors from the rank-one components are denoted by , , and , respectively. We can write Eq. (3) in the following matricized form:

(4)

which relates the mode- unfolding of a tensor to its polyadic decomposition.

4 Bayesian Kernelized Tensor Regression

4.1 Model Specification

Let be a tensor, of which the th mode-3 fiber is the covariate vector at location and time , i.e., . We use to denote for simplicity, then Eq. (2) can be formulated as:

(5)

where and are the unfoldings of and , respectively, the Khatri-Rao product is a block diagonal matrix, and . Assuming that admits a CP decomposition with rank , we can rewrite Eq. (5) as:

(6)

where we use to denote the expanded covariate matrix. The number of parameters in model (6) becomes , which is substantially smaller than in model (5).

Local spatial and temporal processes are critical to the modeling of spatiotemporal data. However, as mentioned, the low-rank assumption alone cannot encode such local dependencies. To address this issue, we follow the GP factor analysis Luttinen and Ilin (2009) strategy and assume specific GP priors on and , respectively, in order to estimate model (6). Figure 1 shows the graphical model of the proposed framework, which is referred to as Bayesian Kernelized Tensor Regression (BKTR) in the following of this paper.

Figure 1: Graphical model of BKTR.

Note that in real-world applications the panel data is often incomplete with only a subset of the entries being observed/available ( refers to the set of the observation indices and ). We denote by a binary indicator matrix with if and otherwise, and denote by a binary matrix of formed by removing the rows corresponding to the zero values in from a identity matrix. The vector of the observed data can be obtained by . Therefore, when data contains missing entries, we have

(7)

For spatial and temporal factor matrices and , we use identical GP priors on the component vectors:

(8)

where and are the spatial and temporal covariance matrices built from two valid kernel functions and , respectively, with and being kernel hyperparameters. Note that we also restrict and to be correlation matrices by setting the variance to one. We reparameterize kernel hyperparameters as log-transformed variables to ensure their positivity and assume normal priors on them:

(9)

For the factor matrix

, we assume all columns follow an identical zero-mean Gaussian distribution with a conjugate Wishart prior on the precision matrix:

(10)

where is a -by- positive-definite scale matrix and

denotes the degrees of freedom. Finally, we suppose a conjugate Gamma prior

on the precision .

4.2 Model Inference

We use a Gibbs sampling strategy to perform Bayesian inference for the model parameters, including coefficient factors , the precision , and the precision matrix . For the kernel hyperparameters whose conditional distributions are not easy to sample from, we use a slice sampler.

Sampling kernel hyperparameters .

As shown in Figure 1, sampling kernel hyperparameters conditional on the factor matrices should be straightforward using the Metropolis-Hastings algorithm. However, in practice, conditioning on the latent variables induces sharply peaked posteriors over , making the Markov chains mix slowly and resulting in poor updates Murray and Adams (2010). To address this issue, we adopt the reparameterization and slice sampling approach proposed in Murray and Adams (2010) by introducing a set of auxiliary variables as a noisy version of the factor matrices to sample the kernel hyperparameters more efficiently.

We next illustrate the sampling process of . First, we generate an auxiliary data with each column , where is a specified variance. This auxiliary variable can be regarded as a surrogate data of

, and the joint distribution of

and conditioned on becomes:

(11)

where and . Next, we simulate a whitened Gaussian variable with . After the Cholesky decomposition , we can sample conditioning on the transformed whitened variables , where each column is defined as . Specifically, the posterior distribution of becomes:

(12)

We use slice sampling as in Murray and Adams (2010) to get samples from the above distribution. This approach is robust to the selection of the sampling scale and easy to implement. Sampling can be achieved in a similar way. Note that, as mentioned, the sampling is performed on the log-transformed variables to avoid numerical issues.

Sampling .

Given the conjugate Wishart prior, the posterior distribution of is:

(13)

where and .

Sampling the coefficient factor matrices .

Sampling the factor matrices can be considered a Bayesian linear regression problem. Taking

as an example, we can rewrite Eq. (7) as:

(14)

where are known and is the coefficient to estimate. Considering that the priors of each component vector are independent, the prior distribution over the whole vectorized becomes . Since both likelihood and prior of follow Gaussian distributions, its posterior is also Gaussian with mean and precision , where

The posterior distributions of and can be obtained similarly using different tensor unfoldings. In order to sample , we use the mode-1 unfolding in Eq. (4) and reconstruct the regression model with as coefficients:

(15)

where and . Thus, the posterior of has a closed form—a Guassian distribution with mean and precision , where

The posterior for can be obtained by applying the mode-2 tensor unfolding. It should be noted that the above derivation provides the posterior for the whole factor matrix, e.g., . One can further reduce the computing cost by sampling one by one for as in Luttinen and Ilin (2009). In this case, the time complexity in learning these factor matrices can be further reduced to at the cost of a slow/poor mixing.

Sampling the precision .

Since we used a conjugate Gamma prior, the posterior distribution of

is also a Gamma distribution with shape and rate parameters

and respectively, with

4.3 Model Implementation

Initialize as normally distributed random values, , and . Set , , , and .
for  do
       Sample kernel hyperparameters ;
       Sample hyperparameters ;
       Sample factors , , and ;
       Sample precision ;
       if  then
             Collect the sample ;
             Compute .
      
return to approximate the posterior.
Algorithm 1

We summarize the implementation of BKTR in Algorithm 1. For the MCMC simulations, we run iterations as burn-in and take the following samples for estimation.

5 Experiments

In this section, we evaluate the effectiveness of BKTR on two synthetic data sets and a real-world spatiotemporal bicycle demand data. All experiments are conducted on a laptop with a 6-core Intel Xenon 2.60 GHz CPU and 32GB RAM. We conduct three experiments: (1) a small-scale analysis to compare BKTR with STVC and a pure low-rank tensor regression model, (2) a moderately sized analysis to test the effect of rank on coefficient reconstruction and prediction on non-observed entries, and (3) a large-scale and real-world analysis on spatiotemporal bike-sharing demand.

5.1 Simulation 1: Small Data Set

Simulation Setting.

In this simulation, we generate a small data in which the true is generated following a separable covariance matrix. Specifically, we simulate 30 locations in a square and uniformly distributed time points in . We then generate a small synthetic data following Eq. (5) with:

where and represent a matrix and a column vector of ones, respectively, , , , and and are computed from a Matern 3/2 kernel function ( is the distance between two locations) and a squared exponential (SE) kernel function , respectively, with and . We set the CP rank and compare BKTR with the original STVC model in Gelfand et al. (2003) and a pure low-rank Bayesian tensor regression (BTR) model without imposing any spatiotemporal GP priors, using the same rank setting as BKTR. In other words, BTR corresponds to the default tensor regression where all data points are assumed to be independent. For both BKTR and STVC, we assume that the kernel functions are known, i.e., is Matern 3/2 and is SE, and use MCMC to estimate the unknown kernel hyperparameters. We estimate using the posterior mean and evaluate the model performance using the accuracy measured by the mean absolute error (MAE) and the root mean square error (RMSE).

Results.

Method Computing time
STVC 1.520.48 2.080.66 19.60 sec/iter.
BKTR 1.660.58 2.230.82 0.009 sec/iter.
BTR 2.290.84 2.981.10 0.004 sec/iter.
Table 1: Performance with respect to the estimation of and model comparison for simulation 1.
Figure 2: Estimation results on simulation 1. (a) Comparison of the estimated coefficients. We show an example of the approximated coefficients (mean with 95% CI) for . The top and bottom subplots compare BKTR with STVC and BTR, respectively. (b) Trace plots of the kernel hyperparameters estimated by BKTR.

We replicate the simulation 10 times and run the MCMC sampling with and . Table 1 presents the accuracy (meanstd) of the estimated coefficient tensor obtained from the 10 simulations. We also report the average computing time for each MCMC iteration. As we can see, BKTR with a small rank achieves similar accuracy as STVC in terms of both MAE and RMSE, suggesting that the kernelized CP decomposition can approximate the true coefficient tensor relatively well even if the true does not follow an exact low-rank specification. In addition, BKTR also substantially outperforms BTR, which confirms the importance of introducing GPs to encode the spatial and temporal dependencies. In terms of computing time, we see that BKTR is much faster than STVC. Due to the high cost of STVC, it becomes infeasible to analyse a data set of moderate size, e.g., , and .

Figure 2

shows the estimation results for a randomly selected set. In panel (a), we plot an example to show the estimated temporal variation of the coefficients from the three methods. As can be seen, most of the coefficients are contained in the 95% credible interval (CI) of BKTR and STVC (see top of Figure 

2(a)). The bottom of Figure 2(a) shows the comparison between BKTR and BTR. Although BTR can still capture the general trend, it fails to reproduce the strong temporal dependency due to the absence of spatial/temporal priors. To further verify the mixing of GP hyperparameters, we continue the MCMC for BKTR for 5000 iterations and show the trace plots for and in Figure 2(b). We can see that the Markov chains for kernel hyperparameters mix fast and well, and a few hundred iterations should be sufficient for posterior estimation.

5.2 Simulation 2: Data of Moderate Size

Simulation Setting.

To evaluate the performance of BKTR on large data sets, we further generate a relatively large synthetic data with locations, time points, and covariates following Eq. (5) with:

The setup of and are the same as in simulation 1. We consider a scenario with incomplete observations, i.e., we suppose that is only partially observed on the index set . Specifically, we randomly mask 50% of the values as missing and use the remaining as the observed data. To assess the effect of rank selection, we test different CP rank from and compute both and , which quantify the accuracy of the estimated coefficients and the prediction accuracy for the unobserved data, respectively.

Results.

(a) Estimation results for , i.e., coefficients for the 4th predictor at the 20th time point.
(b) Estimation of coefficients.
(c) Rank effect.
Figure 3: BKTR modeling results on simulation 2. (a) Spatial plots of the true coefficients, estimated coefficients (posterior mean), and absolute estimation error of the coefficients for . (b) Two examples of the estimated coefficients (mean with 95% CI) for and . (c) Estimation error of the coefficients and missing entries with respect to the tensor rank.

We set , and , and show the results in Figure 3. Figure 3(a) illustrates the spatial pattern of a coefficient ( and ), showing the true and estimated values together with the absolute estimation error. Panel (b) illustrates the temporal behaviour of two randomly selected coefficients. As we can see, BKTR can effectively reproduce the true values with acceptable errors. Figure 3(c) shows the effect of the rank. We see that choosing a larger rank gives better accuracy in terms of both and . However, the accuracy gain becomes marginal when . This demonstrates that the proposed Bayesian treatment offers a flexible solution in terms of model estimation and parameter inference.

5.3 Bike-sharing Demand Modeling

Data Description.

In this section we perform local spatiotemporal regression on a large-scale bike-sharing trip data collected from BIXI111https://bixi.com—a docked bike-sharing service in Montreal, Canada. BIXI operates 615 stations across the city and the yearly service time window is from April to November. We collect the number of daily departure trips for each station over days (from April 15 to October 27, 2019). We discard the stations with a daily average demand less than 5, and only consider the remaining stations. The matrix contains missing/corrupted values in the raw data, and we only keep the rest as the observed set . We follow the analysis in Faghih-Imani et al. (2014) and Wang et al. (2021), and consider two types of spatial covariates: (a) features related with bicycle infrastructure, and (b) land-use and built environment features. For temporal covariates, we mainly consider weather conditions and holidays. Table 2 lists the 13 spatial covariates and 5 temporal covariates used in this analysis. The final size of the covariate dimension is . In constructing the covariate tensor , we follow the same approach as in the Simulations 1 and 2. Specifically, we set , and fill the rest of the tensor slices with for the spatial covariates and for the temporal covariates. Given the difference in magnitudes, we normalize all the covariates to using a min-max normalization. The demand is also normalized. The coefficient tensor contains more than entries, preventing us from using STVC directly.

spatial (a) length of cycle paths; length of major roads; length of minor roads; capacity;
(b) numbers of metro stations, bus stations, and bus routes; numbers of restaurants, universities, other business & enterprises; park area; walkscore; population;
temporal

daily relative humidity; daily maximum temperature; daily mean temperature; total precipitation; dummy variables for holidays.

Table 2: Description summary of independent variables.

Experimental Setup.

For the spatial factors, we use a Matern 3/2 kernel as a universal prior, i.e., , where is the haversine distance between locations and , and is the spatial length-scale. The Matern class kernel is commonly-used as a prior kernel function in spatial modeling. For the temporal factors, we use a locally-periodic correlation matrix by taking the product of a periodic kernel and a SE kernel, i.e., , where and denote the length-scale and decay-time for the periodic component respectively, and we fix the period as days. This specification suggests a weekly temporal pattern that can change over time and it allows us to characterize both the daily variation and the weekly trend of the coefficients. We set the CP rank , and run MCMC with and iterations.

Results.

(a) Temporal plots of coefficients.
(b) Spatial maps of coefficients on August 1, 2019 .
Figure 4: BKTR estimated coefficients for BIXI demand data. (a) Temporal plots of coefficients for 4 spatial covariates at location #23 and 4 temporal covariates at location #2 , which show the mean with 95% CI. (b) Spatial maps showing the posterior mean for 6 covariates.

Figure 4 visualizes some examples of the estimated coefficients, with panel (a) showing temporal examples of and panel (b) showing spatial examples of . As we can see from Figure 4(a), the temporal variation of the coefficients for both spatial and temporal covariates are interpretable. The coefficients (along with CI describing the uncertainty) allow us to identify the most important factors for each station at each time point. For example, we observe a similar variation over a week and a general long-term trend from April to October. Furthermore, the magnitude of the coefficients is much larger during the summer (July/August) compared to the beginning and end of the operation period, which is as expected since the weather becomes cold. Overall, for the spatial variables, the positive correlation of walkability and negative impact caused by the length of major roads indicate that bicycle demands tend to be higher in densely populated neighborhoods. For the temporal covariates, precipitation, humidity, and the holiday variable all relate to negative coefficients, implying that people are less likely to ride bicycles in rainy/humid time periods and on holidays. The spatial distributions of the coefficients in Figure 4(b) also demonstrate consistent estimations. Again, one can further explore the credible intervals to evaluate the significance of each covariate at a given location. It should be noted that the estimated variance for gives a posterior mean of with a 95% CI of . This variance is much smaller compared with the variability of the data, confirming the importance of varying coefficients in the model. Overall, BKTR produces interpretable results for understanding the effects of different spatial and temporal covariates on bike-sharing demand. These findings can help planners to evaluate the performance of bike-sharing systems and update/design them. Due to the introduction of GP, BKTR is able to perform spatial forecasting/kriging for unknown locations, and such results can be used to select sites to build new stations.

6 Conclusion

This paper proposes an effective solution for large-scale local spatiotemporal regression analysis. We propose to parameterize the model coefficients using low-rank CP decomposition, which greatly reduces the number of parameters from to . Contrarily to previous studies on tensor regression, the proposed model BKTR goes beyond the low rank assumption by integrating GP priors to characterize the strong local spatial and temporal dependencies. Our numerical experiments on both synthetic data and real-world data suggest that BKTR can reproduce the local spatiotemporal processes efficiently and reliably.

There are several directions for future research. For the tensor structure, we can extend our work to learn the CP rank Zhao et al. (2015) and replace CP decomposition with a flexible Tucker decomposition. In terms of GP priors, BKTR is flexible and can accommodate different kernels (w.r.t. function form and hyperparameter) for different factors such as in Luttinen and Ilin (2009). The combination of different kernels can also produce richer spatiotemporal dynamics and multiscale properties. In terms of computation, one can further reduce the cost in the GP learning (e.g., for spatial kernel) by further introducing sparse approximation techniques such as inducing points Titsias (2009). Finally, the regression model itself can be extended to model generalized responses, such as binary or count data, by introducing proper link functions.

Acknowledgements

This research is supported by the NSERC, FRQNT, and CFI. M. Lei would like to thank the Institute for Data Valorization (IVADO) for providing a scholarship to support this study.

References

  • M. A. Alvarez and N. D. Lawrence (2008) Sparse convolved gaussian processes for multi-output regression.. In Advances in Neural Information Processing Systems, Vol. 21, pp. 57–64. Cited by: §2.
  • M. A. Álvarez, L. Rosasco, and N. D. Lawrence (2012) Kernels for vector-valued functions: a review.

    Foundations and Trends® in Machine Learning

    4 (3), pp. 195–266.
    Cited by: §2.
  • M. T. Bahadori, Q. R. Yu, and Y. Liu (2014) Fast multivariate spatio-temporal analysis via low rank tensor learning.. In Advances in Neural Information Processing Systems, pp. 3491–3499. Cited by: §1, §2.
  • S. Banerjee, B. P. Carlin, and A. E. Gelfand (2014) Hierarchical modeling and analysis for spatial data. 2nd edition, CRC press. Cited by: §1.
  • E. V. Bonilla, K. M. A. Chai, and C. K. Williams (2007) Multi-task gaussian process prediction. In Advances in Neural Information Processing Systems, pp. 153–160. Cited by: §2.
  • N. Cressie and H. Huang (1999) Classes of nonseparable, spatio-temporal stationary covariance functions. Journal of the American Statistical Association 94 (448), pp. 1330–1339. Cited by: §2.
  • N. Cressie and C. K. Wikle (2015) Statistics for spatio-temporal data. John Wiley & Sons. Cited by: §1.
  • A. Faghih-Imani, N. Eluru, A. M. El-Geneidy, M. Rabbat, and U. Haq (2014) How land-use and urban form impact bicycle flows: evidence from the bicycle-sharing system (bixi) in montreal. Journal of Transport Geography 41, pp. 306–314. Cited by: §5.3.
  • A. O. Finley (2011) Comparing spatially-varying coefficients models for analysis of ecological data with non-stationary and anisotropic residual dependence. Methods in Ecology and Evolution 2 (2), pp. 143–154. Cited by: §1.
  • A. S. Fotheringham, C. Brunsdon, and M. Charlton (2003) Geographically weighted regression: the analysis of spatially varying relationships. John Wiley & Sons. Cited by: §1.
  • A. E. Gelfand, H. Kim, C. Sirmans, and S. Banerjee (2003) Spatial modeling with spatially varying coefficient processes. Journal of the American Statistical Association 98 (462), pp. 387–396. Cited by: §1, §1, §2, §5.1.
  • R. Guhaniyogi, S. Qamar, and D. B. Dunson (2017) Bayesian tensor regression. The Journal of Machine Learning Research 18 (1), pp. 2733–2763. Cited by: §1, §2.
  • B. Huang, B. Wu, and M. Barry (2010) Geographically and temporally weighted regression for modeling spatio-temporal variation in house prices. International Journal of Geographical Information Science 24 (3), pp. 383–401. Cited by: §1.
  • A. J. Izenman (1975) Reduced-rank regression for the multivariate linear model.

    Journal of Multivariate Analysis

    5 (2), pp. 248–264.
    Cited by: §1.
  • T. G. Kolda and B. W. Bader (2009) Tensor decompositions and applications. SIAM Review 51 (3), pp. 455–500. Cited by: §2, §3.1, §3.2.
  • J. Luttinen and A. Ilin (2009) Variational gaussian-process factor analysis for modeling spatio-temporal data. Advances in Neural Information Processing Systems 22, pp. 1177–1185. Cited by: §1, §2, §4.1, §4.2, §6.
  • J. Luttinen and A. Ilin (2012) Efficient gaussian process inference for short-scale spatio-temporal modeling. In

    International Conference on Artificial Intelligence and Statistics

    ,
    pp. 741–750. Cited by: §2.
  • I. Murray and R. P. Adams (2010) Slice sampling covariance hyperparameters of latent gaussian models. In Advances in Neural Information Processing Systems, pp. 1723–1731. Cited by: §4.2, §4.2.
  • J. Quinonero-Candela and C. E. Rasmussen (2005) A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research 6, pp. 1939–1959. Cited by: §2.
  • G. Rabusseau and H. Kadri (2016) Low-rank regression with tensor responses. Advances in Neural Information Processing Systems 29, pp. 1867–1875. Cited by: §2.
  • N. Rao, H. Yu, P. Ravikumar, and I. S. Dhillon (2015) Collaborative filtering with graph information: consistency and scalable methods.. In Advances in Neural Information Processing Systems, pp. 2107–2115. Cited by: §2.
  • C. E. Rasmussen and C. K. Williams (2006) Gaussian processes for machine learning. Cited by: §1.
  • M. L. Stein (2005) Space–time covariance functions. Journal of the American Statistical Association 100 (469), pp. 310–321. Cited by: §2.
  • M. Titsias (2009) Variational learning of inducing variables in sparse gaussian processes. In Artificial intelligence and statistics, pp. 567–574. Cited by: §6.
  • X. Wang, Z. Cheng, M. Trépanier, and L. Sun (2021) Modeling bike-sharing demand using a regression model with spatially varying coefficients. Journal of Transport Geography 93, pp. 103059. Cited by: §5.3.
  • R. Yu, G. Li, and Y. Liu (2018) Tensor regression meets gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 482–490. Cited by: §2.
  • R. Yu and Y. Liu (2016) Learning from multiway data: simple and efficient tensor regression. In International Conference on Machine Learning, pp. 373–381. Cited by: §2.
  • H. Zhang (2007) Maximum-likelihood estimation for multivariate spatial linear coregionalization models. Environmetrics 18 (2), pp. 125–139. Cited by: §2.
  • Q. Zhao, L. Zhang, and A. Cichocki (2015) Bayesian cp factorization of incomplete tensors with automatic rank determination. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1751–1763. Cited by: §6.
  • H. Zhou, L. Li, and H. Zhu (2013) Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association 108 (502), pp. 540–552. Cited by: §1, §2.