Multiresolution Tensor Learning for Efficient and Interpretable Spatial Analysis

02/13/2020 ∙ by Jung Yeon Park, et al. ∙ 58

Efficient and interpretable spatial analysis is crucial in many fields such as geology, sports, and climate science. Large-scale spatial data often contains complex higher-order correlations across features and locations. While tensor latent factor models can describe higher-order correlations, they are inherently computationally expensive to train. Furthermore, for spatial analysis, these models should not only be predictive but also be spatially coherent. However, latent factor models are sensitive to initialization and can yield inexplicable results. We develop a novel Multi-resolution Tensor Learning (MRTL) algorithm for efficiently learning interpretable spatial patterns. MRTL initializes the latent factors from an approximate full-rank tensor model for improved interpretability and progressively learns from a coarse resolution to the fine resolution for an enormous computation speedup. We also prove the theoretical convergence and computational complexity of MRTL. When applied to two real-world datasets, MRTL demonstrates 4   5 times speedup compared to a fixed resolution while yielding accurate and interpretable models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 17

Code Repositories

mrtl

Multiresolution Tensor Learning


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spatial analysis seeks to explain patterns of geographic data and analyzing such large-scale spatial data plays a critical role in sports, geology and climate science. In spatial statistics, kriging or Gaussian processes are popular tools for spatial analysis (Cressie, 1992). Others have proposed various Bayesian methods such as Cox processes (Miller et al., 2014; Dieng et al., 2017) to model spatial data. However, while mathematically appealing, these methods often have difficulties scaling to high-resolution data.

Figure 1: Latent factors: random vs. good initialization

We are interested in learning high-dimensional tensor latent factor models, which have shown to be a scalable alternative for spatial analysis (Yu et al., 2018; Litvinenko et al., 2019). High resolution spatial data often contain higher-order correlations between features and locations and tensors can naturally encode such multi-way correlations. For example, in competitive basketball play, we can predict how each player’s decision to shoot is jointly influenced by their shooting style, his or her court position and the position of the defenders by simultaneously encoding these features as a tensor. Using such representations, tensor learning can directly extract higher-order correlations. A challenge in such models is high computational cost. High-resolution spatial data often requires discretization, leading to high-dimensional tensors that are computationally expensive to train. Low-rank tensor learning (Yu et al., 2018; Kossaifi et al., 2019) reduces the dimensionality by assuming low-rank structures in the data and uses tensor decomposition to discovery latent semantics, see review papers (Kolda and Bader, 2009; Sidiropoulos et al., 2017). However, many tensor learning methods have been shown to be sensitive to noise (Cheng et al., 2016) and initialization (Anandkumar et al., 2014). Other numerical techniques including random sketching (Wang et al., 2015; Haupt et al., 2017) and parallelization (Austin et al., 2016; Li et al., 2017a) can speed up training, but they often fail to utilize the unique properties of spatial data such as spatial auto-correlations. Using latent factor models also gives rise to another issue: interpretability. It is well known that a latent factor model is not identifiable due to rotational indeterminacy. As the latent factors are not unique, these models can yield unintuitive latent factors that do not offer insights to domain experts. In general, the definition of interpretability is highly application dependent (Doshi-Velez and Kim, 2017). For spatial analysis, one of the unique properties of spatial patterns is spatial auto-correlation: close objects have similar values (Moran, 1950), which we use as criteria for interpretability. As latent factors models are sensitive to initialization, previous research (Miller et al., 2014; Yue et al., 2014) has shown that random initialized latent factor models can lead to spatial patterns that violate spatial auto-correlation and hence are not interpretable (see Fig. 1). In this paper, we propose a Multiresolution Tensor Learning algorithm, MRTL, to efficiently learn accurate and interpretable patterns in spatial data. MRTL

is based on two key insights. First, to obtain good initialization, we train a full-rank tensor model approximately at a low resolution and factorize it with tensor decomposition. Second, we exploit spatial auto-correlation to learn models at multiple resolutions: starting at a coarse resolution and iteratively finegrain to the next resolution. We demonstrate that this approach converges significantly faster than fixed resolution methods. We develop several fine-graining criteria based on loss convergence and information theory to determine when to finegrain. We also consider different interpolation schemes and discuss how to finegrain in different applications. In summary, our contributions are:

  • We propose a Multiresolution Tensor Learning (MRTL) optimization algorithm for large-scale spatial analysis

  • We prove the rate of convergence for MRTL which depends on the spectral norm of the interpolation operator. We also show the exponential computational speedup for MRTL compared with fixed resolution.

  • We develop different criteria to determine when to transition to a finer resolution and discuss different finegraining methods.

  • We evaluate on two real-world datasets and show MRTL learns orders-of-magnitude faster than fixed-resolution learning and can produce interpretable latent factors.

2 Related Work.

Spatial Analysis

Discovering spatial patterns has significant implications in scientific fields such as human behavior modeling, neural science, and climate science. Early work in spatial statistics has contributed greatly to spatial analysis through the work in Moran’s I statistics (Moran, 1950) and Getis-Ord general G statistics (Getis and Ord, 1992) for measuring spatial auto-correlation. Geographically weighted regression (Brunsdon et al., 1998) accounts for the spatial heterogeneity with a local version of spatial regression but fail to capture higher order correlation. Kriging or Gaussian processes are popular tools for spatial analysis but they often require carefully designed variograms (also known as kernels) (Cressie, 1992). Other Bayesian hierarchical models favor spatial point processes to model spatial data (Diggle et al., 2013; Miller et al., 2014; Dieng et al., 2017). These frameworks are conceptually elegant, but often computationally intractable.

Tensor Learning

Latent factor models utilize correlations in the data to reduce the dimensionality of the problem and have been used extensively in multi-task learning (Romera-Paredes et al., 2013) and recommendation systems (Lee and Seung, 2001). Tensor learning (Zhou et al., 2013; Bahadori et al., 2014; Haupt et al., 2017)

employs tensor latent factor models to learn higher-order correlations in the data in a supervised fashion. In particular, tensor latent factor models aim to learn the higher-order correlations in spatial data by assuming unknown low-dimensional representations among features and locations. High-order tensor models are non-convex by nature, suffer from the curse of dimensionality, and are notoriously hard to train

(Kolda and Bader, 2009; Sidiropoulos et al., 2017). There are many efforts to scale up tensor computation, e.g., parallelization (Austin et al., 2016) and sketching (Wang et al., 2015; Haupt et al., 2017; Li et al., 2017b). In this work, we propose an optimization algorithm to learn tensor models at multiple resolutions that is not only fast but can also generate interpretable factors.

Multiresolution Methods

The idea of multiresolution methods is well celebrated in machine learning, both in latent factor modeling

(Kondor et al., 2014; Ozdemir et al., 2017)

and deep learning

(Reed et al., 2017; Serban et al., 2017). For example, multi-resolution matrix factorization (Kondor et al., 2014; Ding et al., 2017) and its higher order extensions (Schifanella et al., 2014; Ozdemir et al., 2017; Han and Dunson, 2018)

applies multi-level orthogonal operators with the goal to uncover the multiscale structure in a single matrix. In contrast, our method aims to speed up learning by exploiting the relationship among multiple tensors of different resolutions. Our approach bears affinity with the multigrid method in numerical analysis for solving partial differential equations

(Trottenberg et al., 2000; Hiptmair, 1998) where the idea is to accelerate iterative algorithms by solving a coarse problem first and then gradually fine-grain the solution.

3 Tensor Models for Spatial Data

We consider tensor learning in the supervised setting. We will describe both models for the full-rank case and the low-rank case. We use an order-3 tensor for ease of illustration but our model can be readily extended to higher order cases.

3.1 Full Rank Tensor Models

Given input data consisting of both non-spatial and spatial features, we can discretize the spatial features at different resolutions, with corresponding dimensions as . Tensor learning parameterizes the model with a weight tensor over all features, where is number of outputs and is number of non-spatial features. The input data is of the form . Note that both the input features and the learning model are resolution dependent. is the label for output . For resolution , the full rank tensor learning model can be written as

(1)

where

is the activation function and

is the bias for output . The weight tensor is contracted with along the non-spatial mode and the spatial mode . In general, Eqn. (1) can be extended to multiple spatial features and multiple spatial modes, each of which can have its own set of resolution-dependent dimensions. We use a sigmoid activation function for classification and a linear activation function for regression.

3.2 Low Rank Tensor Model

Low rank tensor models assume a low-dimensional structure in that can capture the inherent sparsity in the data and also alleviate model overfitting. We use CANDECOMP/PARAFAC (CP) decomposition (Hitchcock, 1927) as an example; our method can easily be extended for other decompositions as well. Let be the CP rank of the tensor. The weight tensor is factorized into multiple factors matrices

The tensor latent factor model is:

(2)

where the columns of are latent factors for each mode of and is resolution dependent.

3.3 Spatial Regularization

Interpretability is in general hard to define or quantify (Doshi-Velez and Kim, 2017; Ribeiro et al., 2016; Lipton, 2018; Molnar, 2019). In the context of spatial analysis, we deem a latent factor as interpretable if it produces a spatially coherent pattern that satisfies spatial auto-correlation. To this end, we utilize a spatial regularization kernel (Lotte and Guan, 2010; Miller et al., 2014; Yue et al., 2014) and extend to the tensor case. Let index all locations of the spatial dimension for resolution . The spatial regularization term is:

(3)

where denotes the Frobenius norm and

is the kernel that controls the degree of similarity between locations. We use a simple RBF kernel with hyperparameter

.

(4)

These kernels can be precomputed for all spatial modes for each resolution. If there are multiple spatial modes, we apply spatial regularization across all different modes. We additionally use regularization to encourage smaller weights. Any other -norm can be used.

4 Multi-Resolution Tensor Learning

We now describe our algorithm MRTL that addresses both the computation and interpretability issues. Two key concepts of MRTL are learning good initializations and utilizing multiple resolutions.

4.1 Initialization

In general, due to the nonconvex nature of various problems, tensor latent factor models are sensitive to initialization and lead to uninterpretable latent factors (Miller et al., 2014; Yue et al., 2014). We use full-rank initialization in order to learn latent factors that correspond to known spatial patterns. We first train an approximate full-rank version of the tensor model in Eqn. (1). The weight tensor is then decomposed into latent factors and these values are used as initialization of the low-rank model. The low-rank model in Eqn. (2

) is then trained to the final desired accuracy. As we use approximately optimal solutions of the full-rank model as initializations for the low-rank model, our algorithm produces interpretable latent factors in a variety of different scenarios and datasets. There is a trade-off in computation due to training of the full-rank model. However, as the full-rank model is trained only for a small number of epochs, the increase in computation time is not substantial. We also train the full-rank model only at lower resolutions to further reduce computation. In our experiments, we find that spatial regularization alone is not enough to learn spatially coherent factors, whereas full-rank initialization, though computationally costly, is able to fix this issue. Thus, full-rank initialization is critical for spatial interpretability.

4.2 Multi-resolution

Learning a high-dimensional tensor model is generally computationally expensive and memory inefficient. We utilize multiple resolutions for this issue. We outline MRTL in Alg. 1 where we omit the bias in the description for simplicity.

1:  Input: initialization , data .
2:  Output: latent factors
3:  # full rank tensor model
4:  for each resolution  do
5:     Initialize
6:     Get a mini-batch from training set
7:     while stopping criterion not true do
8:         
9:          Opt
10:     end while
11:      = Finegrain
12:  end for
13:  # tensor decomposition
14:   CP_ALS
15:  # low rank tensor model
16:  for each resolution  do
17:     Initialize
18:     Get a mini-batch from training set
19:     while stopping criterion not true do
20:         
21:          Opt
22:     end while
23:     for each spatial factor  do
24:          = Finegrain
25:     end for
26:  end for
Algorithm 1 Multiresolution Tensor Learning: MRTL

We use superscripts to represent the weight tensor and factor matrices at resolution and subscripts to denote the iterate at step , i.e. is at resolution at step . is the initial weight tensor at the lowest resolution. denotes all factor matrices at resolution and we use to index the factor . To make training feasible, we train both the full rank and low rank models at multiple resolutions, starting from a coarse spatial resolution and progressively increase the resolution. At each resolution , we learn using stochastic optimization Opt (we used Adam (Kingma and Ba, 2014) in our experiments). When the stopping criterion is met, we transform to in a process we call finegraining (Finegrain). Due to spatial auto-correlation, trained model at the lower resolution will serve as a good initialization for the higher resolutions. For both models, we only finegrain the factors that corresponds to the spatial mode. Once the full rank resolution has been trained up to resolution (chosen to fit GPU memory or time constraints), we decompose using CP_ALS, the standard alternating least squares (ALS) algorithm (Kolda and Bader, 2009) for CP decomposition. Then the low-rank model is trained at resolutions to final desired accuracy, finegraining between the resolutions. MRTL

is not specific to ALS and allows for any other CP decomposition method, such as multiplicative updates for nonnegative CP decomposition. While we can use tensor nuclear norm regularization

(Friedland and Lim, 2018), this method was computationally infeasible for our datasets.

When to finegrain

There is a tradeoff between training times at different resolutions. While training for longer at lower resolutions decreases computation significantly, we do not want to overfit to the lower resolution data. On the other hand, training at higher resolutions can yield more accurate solutions using more finegrained information. We investigate four different criteria to balance this tradeoff: 1) validation loss, 2) gradient norm, 3) gradient variance, and 4) gradient entropy. Increase in validation loss

(Prechelt, 1998; Yao et al., 2007)

is a commonly used heuristic for early stopping. Another approach is to analyze the gradient distributions during training. For a convex function

, stochastic gradient descent will converge into a noise ball near the optimal solution as the gradients approach zero. However, lower resolutions may be too coarse to learn more finegrained curvatures and the gradients will increasingly disagree near the optimal solution. We quantify the disagreement in the gradients with metrics such as norm, variance, and entropy. We use intuition from convergence analysis for gradient norm and variance

(Bottou et al., 2018), and information theory for gradient entropy (Srinivas et al., 2012). Let and represent the weights and the random sampling of minibatches at step , respectively. Let be the validation loss and be the stochastic gradients at step . The finegraining criteria are:

  • Validation Loss:

  • Gradient Norm:

  • Gradient Variance:

  • Gradient Entropy:

where . One can also use thresholds, i.e. , but as these are dependent on the dataset, we use in our experiments. One can also incorporate patience, i.e. setting the maximum number of epochs where the stopping conditions was reached.

How to finegrain

We discuss different interpolation schemes for different types of features. Categorical/multinomial variables, such as a player’s position on the court, are one-hot encoded or multi-hot encoded onto a discretized grid. Note that as we use higher resolutions, the sum of the input values are still equal across resolutions,

. As the sum of the features remains the same across resolutions and our tensor models are multilinear, nearest neighbor interpolation should be used in order to produce the same outputs.

as for cells that do not contain the value. This scheme yields the same outputs and the same loss values across resolutions. Continuous variables that represent averages over locations, such as sea surface salinity, often have similar values at each finegrained cell at higher resolutions (as the values at coarse resolutions are subsampled or averaged from values at the higher resolution). An example is Laplacian/Gaussian pyramids for images (Burt and Adelson, 1983). Then , where the approximation comes from the type of downsampling used. Using a linear interpolation scheme,

The weights are divided by the scale factor of to keep the outputs approximately equal. We use bilinear interpolation, though any other linear interpolation can be used.

5 Theoretical Analysis.

5.1 Convergence

We prove the convergence rate for MRTL

of a single spatial mode. For a loss function

and stochastic variable , the optimization problem is:

(5)

We defer all proofs to the Appendix. For a fixed-resolution miniSGD algorithm, under common assumptions in convergence analysis:

  • is - strongly convex, -smooth

  • (unbiased) gradient given

  • (variance) for all the ,

Theorem 5.1.

(Bottou et al., 2018) If the step size , then a fixed resolution solution satisfies

where , , and is the optimal solution.

which gives convergence. Denote the number of total iterations for resolution as , and the weights as . We let denote the number of dimensions at and we assume a dyadic scaling between resolutions such that . We define finegraining using an interpolation operator such that as in (Bramble, 2019). For the simple case of a 1D spatial grid where has spatial dimension , P would be of a Toeplitz matrix of dimension . For example, for linear interpolation of ,

Any interpolation scheme can be expressed in this form. The convergence of multiresolution learning algorithm depends on the following property of spatial data:

Definition 5.2 (Spatial Autocorrelation).

The difference between the optimal solutions of consecutive resolution is upper bounded by

with being the interpolation operator.

Theorem 5.3.

If the step size , then MRTL solution satisfies

where , , and is the operator norm of the interpolation operator .

This gives similar convergence rate as the fixed resolution algorithm, with a constant that depends on the operator norm of the interpolation operator .

5.2 Computational Complexity

To analyze computational complexity, we resort to fixed point convergence (Hale et al., 2008) and the multi-grid method (Stüben, 2001). Intuitively, as most of the training iterations are spent on coarser resolutions with fewer number of parameters, multi-resolution learning is more efficient than fixed-resolution training. Assuming that is Lipschitz continuous, we can view gradient-based optimization as a fixed-point iteration operator with a contraction constant of (note that stochastic gradient descent converges to a noise ball instead of a fixed point):

Let

be the optimal estimator at resolution

and be a solution satisfying . The algorithm terminates when the estimation error reaches . The following lemma describes the computational cost of the fixed-resolution algorithm.

Lemma 5.4.

Given a fixed point iteration operator with contraction constant of , the computational complexity of fixed-resolution training for tensor model of order and rank is

(6)

where is the terminal estimation error.

The next theorem 5.5 characterizes the computational speed-up gained by multi-resolution learning compared to fixed-resolution learning, with respect to the contraction factor and the terminal estimation error .

Theorem 5.5.

If the fixed point iteration operator (gradient descent) has a contraction factor of , multi-resolution learning with the termination criteria of at resolution is faster than fixed-resolution learning by a factor of , with the terminal estimation error .

Note that the speed-up using multi-resolution learning uses a global convergence criterion for each .

6 Experiments

We apply MRTL to two real-world datasets: basketball tracking and climate data.

6.1 Datasets

Tensor classification: Basketball tracking

We use a large NBA player tracking dataset from (Yue et al., 2014; Zheng et al., 2016)

consisting of the coordinates of all players at 25 frames per second, for a total of approximately 6 million frames. The goal is to predict whether a given ball handler will shoot within the next second, given his position on the court and the relative positions of the defenders around him. In applying our method, we hope to obtain common shooting locations on the court and how a defender’s relative position suppresses shot probability.

Figure 2: Left: Discretizing of a continuous-valued position of a player (red) via a spatial grid. Right: sample frame with a ballhandler (red) and defenders (green). Only players close to the ballhandler are used.

We have two spatial modes: the ball handler’s position and the relative defender positions around the ball handler. We instantiate the tensor classification model in Eqn (1) as follows:

where is the ballhandler ID, indexes the ballhandler’s position on the discretized court of dimension , and indexes the relative defender positions around the ballhandler in a discretized grid of dimension . As shown in Fig. 2, we orient the defender positions so that the direction from the ballhandler to the basket points up. is the binary output equal to if player shoots within the next second. We use a weighted cross entropy loss (due to imbalanced classes) and nearest neighbor interpolation for finegraining.

Tensor regression: Climate

Figure 3: Left: precipitation over continental U.S. Right: regions considered in particular.

Recent research (Li et al., 2016a, b; Zeng et al., 2019) shows that oceanic variables such as sea surface salinity (SSS) and sea surface temperature (SST) are significant predictors of the variability in rainfall in land-locked locations, such as the U.S. Midwest. We aim to predict the average monthly precipitation variability in U.S Midwest using SSS and SST to identify meaningful latent factors underlying the large-scale processes linking the ocean and precipitation on land (Fig. 3) Let be the historical oceanic data with input features SSS and SST across locations, using the previous months of data. We consider the lag as a non-spatial feature so that and as there are two spatial features. We instantiate the tensor regression model in Eqn (1) as follows:

We use difference detrending for each timestamp due to non-stationarity of the inputs, and deasonalize by standardizing each month of the year. The features are normalized using min-max normalization. We also normalize and deseasonalize the outputs in order to model variability. We use mean square error (MSE) and bilinear interpolation for finegraining.

Implementation Details

For both datasets, we discretize the spatial features into cells and use a 60-20-20 train-validation-test set split. We use Adam (Kingma and Ba, 2014) for optimization as it was empirically faster than SGD in our experiments. We use both and spatial regularization as described in Section 3. We selected optimal hyperparameters for all models via random search. We use a stepwise learning rate decay with stepsize of with . We perform ten trials for all experiments. All other details are provided in the Appendix.

6.2 Accuracy and Convergence

We compare MRTL against a fixed-resolution model on accuracy and computation time. The fixed-resolution model uses the same full-rank initialization as MRTL but utilizes the highest resolution only. The results of all trials are listed in Table 1. Other results are provided in the Appendix.

Figure 4: Basketball: F1 scores of MRTL vs. the fixed-resolution model for the full rank (left) and low rank model (right). The vertical lines indicate finegraining to the next resolution.

Fig. 4 shows the F1 scores of MRTL vs a fixed resolution model for the basketball dataset (validation loss was used as the finegraining criterion for both models). For the full rank case, MRTL converges 9 times faster than the fixed resolution case. The fixed-resolution model is able to reach a higher F1 score for the full rank case, as it uses a higher resolution and is able to learn more finegrained information. This advantage does not transfer to the low rank model. For the low rank model, the training times are comparable and both reach a similar F1 score. There is decrease in the F1 score going from full rank to low rank for both MRTL and the fixed resolution model due to approximation error from CP decomposition. Note that this is dependent on the choice of , specific to each dataset. We see a similar trend for the climate data, where MRTL converges faster than the fixed-resolution model. Overall, MRTL is approximately 4 5 times faster than a fixed-resolution model and we get a similar speedup in the climate data.

6.3 Finegraining Criteria

Figure 5: Basketball: F1 scores different finegraining criteria for the full rank (left) and low rank (right) model
Dataset Model Full-Rank Low-Rank
Time [s] Loss F1 Time [s] Loss F1
Basketball Fixed 11462 0.608 0.685 2205 0.849 0.494
MRTL 1230 0.699 0.607 2009 0.868 0.475
Climate Fixed 12.5 0.0882 - 269 0.0803 -
MRTL 1.11 0.0825 - 67.1 0.0409 -
Table 1: Runtime and prediction performance comparison of a fixed-resolution model vs MRTL for datasets
Figure 6: Climate: Some latent factors of sea surface locations. The red areas in the northwest Atlantic region (east of North America and Gulf of Mexico) represent areas where moisture export contributes to precipitation in the U.S. Midwest.

We compared the performance of different finegraining criteria in Fig. 5. Validation loss converges much faster than other criteria for the full rank model while the other finegraining criteria converge slightly faster for the low rank model. In the classification case, we observe that the full rank model spends many epochs training when we use gradient-based criteria, suggesting that they can be too strict for the full rank case. For the regression case, we see all criteria perform similarly for the full rank model, and validation loss converges faster for the low rank model. As there are differences between finegraining criteria for different datasets, one should try all of them for fastest convergence.

6.4 Interpretability

We now demonstrate that MRTL can learn semantic representations along spatial dimensions. For all latent factor figures, the factors have been normalized to so that reds are positive and blues are negative.

Figure 7: Basketball: Latent factors heatmaps of ballhandler position after training MRTL for . They represent common shooting locations such as the right/left sides of the court, the paint, or near the three point line.
Figure 8: Basketball: Latent factors heatmaps of relative defender positions after training MRTL for . The green dot represents the ballhandler at . The latent factors show spatial patterns near the ballhandler, suggesting important positions to suppress shot probability.

Figs. 7, 8 visualize some latent factors for ballhandler position and relative defender positions, respectively (see Appendix for all latent factors). For the ballhandler position in Fig. 7, coherent spatial patterns (can be both red or blue regions as they are simply inverses of each other) can correspond to common shooting locations. These latent factors can represent known locations such as the paint or near the three-point line on both sides of the court. For relative defender positions in Fig. 8, we see many concentrated spatial regions near the ballhandler, demonstrating that such close positions suppress shot probability (as expected). Some latent factors exhibit directionality as well, suggesting that guarding one side of the ballhandler may suppress shot probability more than the other side. Fig. 6 depicts two latent factors of sea surface locations. We would expect latent factors to correspond to regions of the ocean which independently influence precipitation. The left latent factor highlights the Gulf of Mexico and northwest Atlantic ocean as influential for rainfall in the Midwest, consistent with findings from (Li et al., 2018, 2016a).

Random initialization

We also perform experiments using a randomly initialized low-rank model in order to verify the importance of full rank initialization. Fig. 9 compares random initialization vs. MRTL for the ballhandler position (left two plots) and the defender positions (right two plots). We observe that even with spatial regularization, randomly initialized latent factor models can produce noisy, uninterpretable factors and thus full-rank initialization is essential.

Figure 9: Latent factor comparisons () of randomly initialized low-rank model (1st and 3rd) and MRTL (2nd and 4th) for ballhandler position (left two plots) and the defender positions (right two plots). Random initialization leads to uninterpretable latent factors.

7 Conclusion and Future Work

We presented a novel algorithm for tensor models for spatial analysis. Our algorithm MRTL utilizes multiple resolutions to significantly decrease training time and incorporates an full-rank initialization strategy that promotes spatially coherent and interpretable latent factors. MRTL is generalized to both the classification and regression cases. We proved the theoretical convergence of our algorithm for stochastic gradient descent and compared the computational complexity of MRTL to a single, fixed-resolution model. The experimental results on two real-world datasets support its computational efficiency and interpretability. Future work includes 1) develop other stopping criteria in order to enhance the computational speedup, 2) apply our algorithm to more higher-dimensional spatial data, and 3) study the effect of varying batch sizes between resolutions as in (Wu et al., 2019).

References

  • A. Anandkumar, R. Ge, and M. Janzamin (2014) Guaranteed non-orthogonal tensor decomposition via alternating rank- updates. arXiv preprint arXiv:1402.5180. Cited by: §1.
  • W. Austin, G. Ballard, and T. G. Kolda (2016) Parallel tensor compression for large-scale scientific data. In Parallel and Distributed Processing Symposium, 2016 IEEE International, pp. 912–922. Cited by: §1, §2.
  • M. T. Bahadori, Q. R. Yu, and Y. Liu (2014) Fast multivariate spatio-temporal analysis via low rank tensor learning. In Advances in neural information processing systems, pp. 3491–3499. Cited by: §2.
  • L. Bottou, F. E. Curtis, and J. Nocedal (2018) Optimization methods for large-scale machine learning. Siam Review 60 (2), pp. 223–311. Cited by: Theorem A.1, §4.2, Theorem 5.1.
  • J. H. Bramble (2019) Multigrid methods. Routledge. Cited by: §5.1.
  • C. Brunsdon, S. Fotheringham, and M. Charlton (1998) Geographically weighted regression. Journal of the Royal Statistical Society: Series D (The Statistician) 47 (3), pp. 431–443. Cited by: §2.
  • P. Burt and E. Adelson (1983) The laplacian pyramid as a compact image code. IEEE Transactions on communications 31 (4), pp. 532–540. Cited by: §4.2.
  • H. Cheng, Y. Yu, X. Zhang, E. Xing, and D. Schuurmans (2016) Scalable and sound low-rank tensor learning. In Artificial Intelligence and Statistics, pp. 1114–1123. Cited by: §1.
  • N. Cressie (1992) Statistics for spatial data. Terra Nova 4 (5), pp. 613–617. Cited by: §1, §2.
  • A. B. Dieng, D. Tran, R. Ranganath, J. Paisley, and D. Blei (2017) Variational inference via upper bound minimization. In Advances in Neural Information Processing Systems, pp. 2732–2741. Cited by: §1, §2.
  • P. J. Diggle, P. Moraga, B. Rowlingson, and B. M. Taylor (2013) Spatial and spatio-temporal log-gaussian cox processes: extending the geostatistical paradigm. Statistical Science, pp. 542–563. Cited by: §2.
  • Y. Ding, R. Kondor, and J. Eskreis-Winkler (2017) Multiresolution kernel approximation for gaussian process regression. In Advances in Neural Information Processing Systems, pp. 3743–3751. Cited by: §2.
  • F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §1, §3.3.
  • S. Friedland and L. Lim (2018) Nuclear norm of higher-order tensors. Mathematics of Computation 87 (311), pp. 1255–1281. Cited by: §4.2.
  • A. Getis and J. K. Ord (1992) The analysis of spatial association by use of distance statistics. Geographical analysis 24 (3), pp. 189–206. Cited by: §2.
  • E. T. Hale, W. Yin, and Y. Zhang (2008) Fixed-point continuation for -minimization: methodology and convergence. SIAM Journal on Optimization 19 (3), pp. 1107–1130. Cited by: §5.2.
  • S. Han and D. B. Dunson (2018) Multiresolution tensor decomposition for multiple spatial passing networks. arXiv preprint arXiv:1803.01203. Cited by: §2.
  • J. Haupt, X. Li, and D. P. Woodruff (2017) Near optimal sketching of low-rank tensor regression. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 3469–3479. Cited by: §1, §2.
  • R. Hiptmair (1998) Multigrid method for maxwell’s equations. SIAM Journal on Numerical Analysis 36 (1), pp. 204–225. Cited by: §2.
  • F. L. Hitchcock (1927) The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics 6 (1-4), pp. 164–189. Cited by: §3.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2, §6.1.
  • T. G. Kolda and B. W. Bader (2009) Tensor Decompositions and Applications. SIAM Review 51 (3), pp. 455–500. External Links: Document, ISBN 0036-1445, ISSN 0036-1445 Cited by: §1, §2, §4.2.
  • R. Kondor, N. Teneva, and V. Garg (2014) Multiresolution matrix factorization. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1620–1628. Cited by: §2.
  • J. Kossaifi, Y. Panagakis, A. Anandkumar, and M. Pantic (2019) Tensorly: tensor learning in python. The Journal of Machine Learning Research 20 (1), pp. 925–930. Cited by: §1.
  • D. D. Lee and H. S. Seung (2001) Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pp. 556–562. Cited by: §2.
  • J. Li, J. Choi, I. Perros, J. Sun, and R. Vuduc (2017a) Model-driven sparse cp decomposition for higher-order tensors. In 2017 IEEE international parallel and distributed processing symposium (IPDPS), pp. 1048–1057. Cited by: §1.
  • L. Li, R. Schmitt, C. Ummenhofer, and K. Karnauskas (2016a) Implications of north atlantic sea surface salinity for summer precipitation over the us midwest: mechanisms and predictive value. Journal of Climate 29, pp. 3143–3159. Cited by: §6.1, §6.4.
  • L. Li, R. Schmitt, C. Ummenhofer, and K. Karnauskas (2016b) North atlantic salinity as a predictor of sahel rainfall. Science Advances 29, pp. 3143–3159. Cited by: §6.1.
  • L. Li, R. W. Schmitt, and C. C. Ummenhofer (2018) The role of the subtropical north atlantic water cycle in recent us extreme precipitation events. Climate dynamics 50 (3-4), pp. 1291–1305. Cited by: §6.4.
  • X. Li, J. Haupt, and D. Woodruff (2017b) Near optimal sketching of low-rank tensor regression. In Advances in Neural Information Processing Systems 30, pp. 3466–3476. Cited by: §2.
  • Z. C. Lipton (2018) The mythos of model interpretability. Queue 16 (3), pp. 31–57. Cited by: §3.3.
  • A. Litvinenko, D. Keyes, V. Khoromskaia, B. N. Khoromskij, and H. G. Matthies (2019) Tucker tensor analysis of matérn functions in spatial statistics. Computational Methods in Applied Mathematics 19 (1), pp. 101–122. Cited by: §1.
  • F. Lotte and C. Guan (2010) Regularizing common spatial patterns to improve bci designs: unified theory and new algorithms. IEEE Transactions on biomedical Engineering 58 (2), pp. 355–362. Cited by: §3.3.
  • A. Miller, L. Bornn, R. Adams, and K. Goldsberry (2014) Factorized point process intensities: a spatial analysis of professional basketball. In International Conference on Machine Learning (ICML), Cited by: §1, §1, §2, §3.3, §4.1.
  • C. Molnar (2019) Interpretable machine learning. Lulu. com. Cited by: §3.3.
  • P. A. Moran (1950) Notes on continuous stochastic phenomena. Biometrika, pp. 17–23. Cited by: §1, §2.
  • S. G. Nash (2000) A multigrid approach to discretized optimization problems. Optimization Methods and Software 14 (1-2), pp. 99–116. Cited by: §A.2, Lemma A.5.
  • A. Ozdemir, M. A. Iwen, and S. Aviyente (2017) Multiscale analysis for higher-order tensors. arXiv preprint arXiv:1704.08578. Cited by: §2.
  • L. Prechelt (1998) Automatic early stopping using cross validation: quantifying the criteria. Neural Networks 11 (4), pp. 761–767. Cited by: §4.2.
  • S. Reed, A. Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. Freitas (2017) Parallel multiscale autoregressive density estimation. In International Conference on Machine Learning, pp. 2912–2921. Cited by: §2.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016)

    Why should i trust you?: explaining the predictions of any classifier

    .
    pp. 1135–1144. Cited by: §3.3.
  • B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, and M. Pontil (2013) Multilinear multitask learning. In International Conference on Machine Learning, pp. 1444–1452. Cited by: §2.
  • C. Schifanella, K. S. Candan, and M. L. Sapino (2014) Multiresolution tensor decompositions with mode hierarchies. ACM Transactions on Knowledge Discovery from Data (TKDD) 8 (2), pp. 10. Cited by: §2.
  • I. V. Serban, T. Klinger, G. Tesauro, K. Talamadupula, B. Zhou, Y. Bengio, and A. C. Courville (2017)

    Multiresolution recurrent neural networks: an application to dialogue response generation.

    .
    In AAAI, pp. 3288–3294. Cited by: §2.
  • N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos (2017) Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing 65 (13), pp. 3551–3582. Cited by: §1, §2.
  • N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger (2012) Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory 58 (5), pp. 3250–3265. Cited by: §4.2.
  • K. Stüben (2001) A review of algebraic multigrid. In Partial Differential Equations, pp. 281–309. Cited by: §5.2.
  • U. Trottenberg, C. W. Oosterlee, and A. Schuller (2000) Multigrid. Elsevier. Cited by: §2.
  • Y. Wang, H. Tung, A. J. Smola, and A. Anandkumar (2015) Fast and guaranteed tensor decomposition via sketching. In Advances in Neural Information Processing Systems, pp. 991–999. Cited by: §1, §2.
  • C. Wu, R. Girshick, K. He, C. Feichtenhofer, and P. Krähenbühl (2019) A multigrid method for efficiently training video models. External Links: 1912.00998 Cited by: §7.
  • Y. Yao, L. Rosasco, and A. Caponnetto (2007) On early stopping in gradient descent learning. Constructive Approximation 26 (2), pp. 289–315. Cited by: §4.2.
  • R. Yu, G. Li, and Y. Liu (2018) Tensor regression meets gaussian processes. In 21st International Conference on Artificial Intelligence and Statistics (AISTATS 2018), Cited by: §1.
  • Y. Yue, P. Lucey, P. Carr, A. Bialkowski, and I. Matthews (2014) Learning Fine-Grained Spatial Models for Dynamic Sports Play Prediction. In IEEE International Conference on Data Mining (ICDM), Cited by: Appendix B, §1, §3.3, §4.1, §6.1.
  • L. Zeng, R. Schmitt, L. Li, Q. Wang, and D. Wang (2019) Forecast of summer precipitation in the yangtze river valley based on south china sea springtime sea surface salinity. Climate Dynamics 53, pp. 5495–5509. Cited by: §6.1.
  • S. Zheng, Y. Yue, and J. Hobbs (2016) Generating long-term trajectories using deep hierarchical networks. In Advances in Neural Information Processing Systems, pp. 1543–1551. Cited by: §6.1.
  • H. Zhou, L. Li, and H. Zhu (2013) Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association 108 (502), pp. 540–552. Cited by: §2.

Appendix A Appendix

a.1 Convergence Analysis

Theorem A.1.

(Bottou et al., 2018) If the step size , then a fixed resolution solution satisfies

where , , and is the optimal solution.

Proof.

For a single step update,

(7)

by the law of total expectation

(8)

From strong convexity,

(9)

which implies as . Putting it all together yields

(10)

As , we complete the contraction, by setting

(11)

Repeat the iterations

(12)

Rearranging the terms, we get

(13)

Theorem A.2.

If the step size , then MRTL solution satisfies

where , , and is the operator norm of the interpolation operator .

Consider a two resolution case where and . Let be the total number of iterations of resolution . Based on Eqn. (10), for a fixed resolution algorithm, after number of iterations,

For multi-resolution, where we train on resolution first, we have

At resolution , we have

(14)

Using interpolation, we have . Given the spatial autocorrelation assumption, we have

By the definition of operator norm and triangle inequality,

Combined with eq. (14), we have

(15)
(16)

If we initialize and such that , we have MRTL solution

(17)

for some that completes the contraction. Repeat the resolution iterates in Eqn. (16), we reach our conclusion. ∎

a.2 Computational Complexity Analysis

In this section, we analyze the computational complexity for MRTL (Algorithm 1). Assuming that is Lipschitz continuous, we can view gradient-based optimization as a fixed-point iteration operator with a contraction constant of (note that stochastic gradient descent converges to a noise ball instead of a fixed point).

Let be the optimal estimator at resolution . Suppose for each resolution , we use the following fine-grain criterion:

(18)

where is the number of iterations taken at level . The algorithm terminates when the estimation error reaches . The following main theorem characterizes the speed-up gained by multi-resolution learning MRTL w.r.t. the contraction factor and the terminal estimation error .

Theorem A.3.

Suppose the fixed point iteration operator (gradient descent) for the optimization algorithm has a contraction factor (Lipschitz constant) of , the multi-resolution learning procedure is faster than that of the fixed resolution algorithm by a factor of , with as the terminal estimation error.

We prove several useful Lemmas before proving the main Theorem A.3. The following lemma analyzes the computational cost of the fixed-resolution algorithm.

Lemma A.4.

Given a fixed point iteration operator with a contraction factor , the computational complexity of a fixed-resolution training for a -order tensor with rank is

(19)
Proof.

At a high level, we can prove this by choosing a small enough resolution such that the approximation error is bounded with a fixed number of iterations. Let be the optimal estimate at resolution and be the estimate at step . Then

(20)

We pick a fixed resolution small enough such that

(21)

then using the termination criteria gives where is the discretization size at resolution . Initialize and apply to for times such that

(22)

As , , we obtain that

(23)

Note that for an order tensor with rank , the computational complexity of every iteration in MRTL is with as the discretization size. Hence, the computational complexity of the fixed resolution training is

Given a spatial discretization , we can construct an operator that learns discretized tensor weights. The next lemma relates the estimation error with resolution. The following lemma relates the estimation error with resolution:

Lemma A.5.

(Nash, 2000) For each resolution level , there exists a constant and , such that the fixed point iteration with discretization size has an estimation error:

(24)
Proof.

See (Nash, 2000) for details. We have obtained the discretization error for the fixed point operation at any resolution. Next we analyze the number of iterations needed at each resolution before finegraining.

Lemma A.6.

For every resolution , there exists a constant such that the number of iterations before finegraining satisfies:

(25)
Proof.

According to the fixed point iteration definition, we have for each resolution :

(26)
(27)
(28)

using the definition of the finegrain criterion. ∎By combining Lemmas A.6 and the computational cost per iteration, we can compute the total computational cost for our MRTL algorithm, which is proportional to the total number of iterations for all resolutions:

(29)

where the last step uses the termination criterion in (18). Comparing with the complexity analysis for the fixed resolution algorithm in Lemma A.4, we complete the proof. ∎

Appendix B Experiments

Basketball

We list implementation details for the basketball dataset. We focus only on half-court possessions, where all players have crossed into the half court as in (Yue et al., 2014). The ball must also be inside the court and be within a 4 feet radius of the ballhandler. We discard any passing/turnover events and do not consider frames with free throws. For the ball handler location , we discretize the half-court into resolutions . For the relative defender locations, at the full resolution, we choose a grid around the ball handler where the ball handler is located at (more space in front of the ball handler than behind him/her). We also consider a smaller grid around the ball handler for the defender locations, assuming that defenders that are far away from the ball handler do not influence shooting probability. We use for defender positions. Let us denote the pair of resolutions as . We train the full-rank model at resolutions and the low-rank model at resolutions . There is a notable class imbalance in labels (88% of data points have zero labels) so we use weighted cross entropy loss using the inverse of class counts as weights. For the low-rank model, we use tensor rank . The performance trend of MRTL is similar across a variety of tensor ranks. should be chosen in appropriately to the desired level of approximation.

Climate

We describe the data sources used for climate. The precipitation data comes from the PRISM group, which provides estimates monthly estimates at 1/24º spatial resolution across the continental U.S from 1895 to 2018. For oceanic data we use the EN4 reanalysis product, which provides monthly estimates for ocean salinity and temperature at 1º spatial resolution across the globe from 1900 to the present (see Fig. 3). We constrain our spatial analysis to the range [-180ºW, 0ºW] and [-20ºS, 60ºN], which encapsulates the area around North America and a large portion of South America. The ocean data is non-stationary, with the variance of the data increasing over time. This is likely due to improvement in observational measurements of ocean temperature and salinity over time, which reduce the amount of interpolation needed to generate an estimate for a given month. After detrending and deseasonalizing, we split the train, validation, and test sets using random consecutive sequences so that their samples come from a similar distribution. We train the full-rank model at resolutions and and the low-rank model at resolutions , , , , , and . For finegraining criteria, we use a patience factor of 4, i.e. training was terminated when a finegraining criterion was reached a total of 4 times. Both validation loss and gradient statistics were relatively noisy during training (possibly due to a small number of samples), leading to early termination without the patience factor. During fine-graining, the weights were upsampled to the higher resolution using bilinear interpolation and then scaled by the ratio of the number of inputs for the higher resolution to the number of inputs for the lower resolution (as described in Section 4) to preserve the magnitude of the prediction.

Details

We trained the basketball dataset on 4 RTX 2080 Ti GPUs, while the climate dataset experiments were performed on a separate workstation with 1 RTX 2080 Ti GPU. The computation times of the fixed-resolution and MRTL model were compared on the same setup for all experiments.

b.1 Hyperparameters

Hyperparameter Basketball Climate
Batch size
Full-rank learning rate
Full-rank regularization
Low-rank learning rate
Low-rank regularization