mrtl
Multiresolution Tensor Learning
view repo
Efficient and interpretable spatial analysis is crucial in many fields such as geology, sports, and climate science. Large-scale spatial data often contains complex higher-order correlations across features and locations. While tensor latent factor models can describe higher-order correlations, they are inherently computationally expensive to train. Furthermore, for spatial analysis, these models should not only be predictive but also be spatially coherent. However, latent factor models are sensitive to initialization and can yield inexplicable results. We develop a novel Multi-resolution Tensor Learning (MRTL) algorithm for efficiently learning interpretable spatial patterns. MRTL initializes the latent factors from an approximate full-rank tensor model for improved interpretability and progressively learns from a coarse resolution to the fine resolution for an enormous computation speedup. We also prove the theoretical convergence and computational complexity of MRTL. When applied to two real-world datasets, MRTL demonstrates 4 5 times speedup compared to a fixed resolution while yielding accurate and interpretable models.
READ FULL TEXT VIEW PDFMultiresolution Tensor Learning
Spatial analysis seeks to explain patterns of geographic data and analyzing such large-scale spatial data plays a critical role in sports, geology and climate science. In spatial statistics, kriging or Gaussian processes are popular tools for spatial analysis (Cressie, 1992). Others have proposed various Bayesian methods such as Cox processes (Miller et al., 2014; Dieng et al., 2017) to model spatial data. However, while mathematically appealing, these methods often have difficulties scaling to high-resolution data.
We are interested in learning high-dimensional tensor latent factor models, which have shown to be a scalable alternative for spatial analysis (Yu et al., 2018; Litvinenko et al., 2019). High resolution spatial data often contain higher-order correlations between features and locations and tensors can naturally encode such multi-way correlations. For example, in competitive basketball play, we can predict how each player’s decision to shoot is jointly influenced by their shooting style, his or her court position and the position of the defenders by simultaneously encoding these features as a tensor. Using such representations, tensor learning can directly extract higher-order correlations. A challenge in such models is high computational cost. High-resolution spatial data often requires discretization, leading to high-dimensional tensors that are computationally expensive to train. Low-rank tensor learning (Yu et al., 2018; Kossaifi et al., 2019) reduces the dimensionality by assuming low-rank structures in the data and uses tensor decomposition to discovery latent semantics, see review papers (Kolda and Bader, 2009; Sidiropoulos et al., 2017). However, many tensor learning methods have been shown to be sensitive to noise (Cheng et al., 2016) and initialization (Anandkumar et al., 2014). Other numerical techniques including random sketching (Wang et al., 2015; Haupt et al., 2017) and parallelization (Austin et al., 2016; Li et al., 2017a) can speed up training, but they often fail to utilize the unique properties of spatial data such as spatial auto-correlations. Using latent factor models also gives rise to another issue: interpretability. It is well known that a latent factor model is not identifiable due to rotational indeterminacy. As the latent factors are not unique, these models can yield unintuitive latent factors that do not offer insights to domain experts. In general, the definition of interpretability is highly application dependent (Doshi-Velez and Kim, 2017). For spatial analysis, one of the unique properties of spatial patterns is spatial auto-correlation: close objects have similar values (Moran, 1950), which we use as criteria for interpretability. As latent factors models are sensitive to initialization, previous research (Miller et al., 2014; Yue et al., 2014) has shown that random initialized latent factor models can lead to spatial patterns that violate spatial auto-correlation and hence are not interpretable (see Fig. 1). In this paper, we propose a Multiresolution Tensor Learning algorithm, MRTL, to efficiently learn accurate and interpretable patterns in spatial data. MRTL
is based on two key insights. First, to obtain good initialization, we train a full-rank tensor model approximately at a low resolution and factorize it with tensor decomposition. Second, we exploit spatial auto-correlation to learn models at multiple resolutions: starting at a coarse resolution and iteratively finegrain to the next resolution. We demonstrate that this approach converges significantly faster than fixed resolution methods. We develop several fine-graining criteria based on loss convergence and information theory to determine when to finegrain. We also consider different interpolation schemes and discuss how to finegrain in different applications. In summary, our contributions are:
We propose a Multiresolution Tensor Learning (MRTL) optimization algorithm for large-scale spatial analysis
We prove the rate of convergence for MRTL which depends on the spectral norm of the interpolation operator. We also show the exponential computational speedup for MRTL compared with fixed resolution.
We develop different criteria to determine when to transition to a finer resolution and discuss different finegraining methods.
We evaluate on two real-world datasets and show MRTL learns orders-of-magnitude faster than fixed-resolution learning and can produce interpretable latent factors.
Discovering spatial patterns has significant implications in scientific fields such as human behavior modeling, neural science, and climate science. Early work in spatial statistics has contributed greatly to spatial analysis through the work in Moran’s I statistics (Moran, 1950) and Getis-Ord general G statistics (Getis and Ord, 1992) for measuring spatial auto-correlation. Geographically weighted regression (Brunsdon et al., 1998) accounts for the spatial heterogeneity with a local version of spatial regression but fail to capture higher order correlation. Kriging or Gaussian processes are popular tools for spatial analysis but they often require carefully designed variograms (also known as kernels) (Cressie, 1992). Other Bayesian hierarchical models favor spatial point processes to model spatial data (Diggle et al., 2013; Miller et al., 2014; Dieng et al., 2017). These frameworks are conceptually elegant, but often computationally intractable.
Latent factor models utilize correlations in the data to reduce the dimensionality of the problem and have been used extensively in multi-task learning (Romera-Paredes et al., 2013) and recommendation systems (Lee and Seung, 2001). Tensor learning (Zhou et al., 2013; Bahadori et al., 2014; Haupt et al., 2017)
employs tensor latent factor models to learn higher-order correlations in the data in a supervised fashion. In particular, tensor latent factor models aim to learn the higher-order correlations in spatial data by assuming unknown low-dimensional representations among features and locations. High-order tensor models are non-convex by nature, suffer from the curse of dimensionality, and are notoriously hard to train
(Kolda and Bader, 2009; Sidiropoulos et al., 2017). There are many efforts to scale up tensor computation, e.g., parallelization (Austin et al., 2016) and sketching (Wang et al., 2015; Haupt et al., 2017; Li et al., 2017b). In this work, we propose an optimization algorithm to learn tensor models at multiple resolutions that is not only fast but can also generate interpretable factors.The idea of multiresolution methods is well celebrated in machine learning, both in latent factor modeling
(Kondor et al., 2014; Ozdemir et al., 2017)and deep learning
(Reed et al., 2017; Serban et al., 2017). For example, multi-resolution matrix factorization (Kondor et al., 2014; Ding et al., 2017) and its higher order extensions (Schifanella et al., 2014; Ozdemir et al., 2017; Han and Dunson, 2018)applies multi-level orthogonal operators with the goal to uncover the multiscale structure in a single matrix. In contrast, our method aims to speed up learning by exploiting the relationship among multiple tensors of different resolutions. Our approach bears affinity with the multigrid method in numerical analysis for solving partial differential equations
(Trottenberg et al., 2000; Hiptmair, 1998) where the idea is to accelerate iterative algorithms by solving a coarse problem first and then gradually fine-grain the solution.We consider tensor learning in the supervised setting. We will describe both models for the full-rank case and the low-rank case. We use an order-3 tensor for ease of illustration but our model can be readily extended to higher order cases.
Given input data consisting of both non-spatial and spatial features, we can discretize the spatial features at different resolutions, with corresponding dimensions as . Tensor learning parameterizes the model with a weight tensor over all features, where is number of outputs and is number of non-spatial features. The input data is of the form . Note that both the input features and the learning model are resolution dependent. is the label for output . For resolution , the full rank tensor learning model can be written as
(1) |
where
is the activation function and
is the bias for output . The weight tensor is contracted with along the non-spatial mode and the spatial mode . In general, Eqn. (1) can be extended to multiple spatial features and multiple spatial modes, each of which can have its own set of resolution-dependent dimensions. We use a sigmoid activation function for classification and a linear activation function for regression.Low rank tensor models assume a low-dimensional structure in that can capture the inherent sparsity in the data and also alleviate model overfitting. We use CANDECOMP/PARAFAC (CP) decomposition (Hitchcock, 1927) as an example; our method can easily be extended for other decompositions as well. Let be the CP rank of the tensor. The weight tensor is factorized into multiple factors matrices
The tensor latent factor model is:
(2) |
where the columns of are latent factors for each mode of and is resolution dependent.
Interpretability is in general hard to define or quantify (Doshi-Velez and Kim, 2017; Ribeiro et al., 2016; Lipton, 2018; Molnar, 2019). In the context of spatial analysis, we deem a latent factor as interpretable if it produces a spatially coherent pattern that satisfies spatial auto-correlation. To this end, we utilize a spatial regularization kernel (Lotte and Guan, 2010; Miller et al., 2014; Yue et al., 2014) and extend to the tensor case. Let index all locations of the spatial dimension for resolution . The spatial regularization term is:
(3) |
where denotes the Frobenius norm and
is the kernel that controls the degree of similarity between locations. We use a simple RBF kernel with hyperparameter
.(4) |
These kernels can be precomputed for all spatial modes for each resolution. If there are multiple spatial modes, we apply spatial regularization across all different modes. We additionally use regularization to encourage smaller weights. Any other -norm can be used.
We now describe our algorithm MRTL that addresses both the computation and interpretability issues. Two key concepts of MRTL are learning good initializations and utilizing multiple resolutions.
In general, due to the nonconvex nature of various problems, tensor latent factor models are sensitive to initialization and lead to uninterpretable latent factors (Miller et al., 2014; Yue et al., 2014). We use full-rank initialization in order to learn latent factors that correspond to known spatial patterns. We first train an approximate full-rank version of the tensor model in Eqn. (1). The weight tensor is then decomposed into latent factors and these values are used as initialization of the low-rank model. The low-rank model in Eqn. (2
) is then trained to the final desired accuracy. As we use approximately optimal solutions of the full-rank model as initializations for the low-rank model, our algorithm produces interpretable latent factors in a variety of different scenarios and datasets. There is a trade-off in computation due to training of the full-rank model. However, as the full-rank model is trained only for a small number of epochs, the increase in computation time is not substantial. We also train the full-rank model only at lower resolutions to further reduce computation. In our experiments, we find that spatial regularization alone is not enough to learn spatially coherent factors, whereas full-rank initialization, though computationally costly, is able to fix this issue. Thus, full-rank initialization is critical for spatial interpretability.
Learning a high-dimensional tensor model is generally computationally expensive and memory inefficient. We utilize multiple resolutions for this issue. We outline MRTL in Alg. 1 where we omit the bias in the description for simplicity.
We use superscripts to represent the weight tensor and factor matrices at resolution and subscripts to denote the iterate at step , i.e. is at resolution at step . is the initial weight tensor at the lowest resolution. denotes all factor matrices at resolution and we use to index the factor . To make training feasible, we train both the full rank and low rank models at multiple resolutions, starting from a coarse spatial resolution and progressively increase the resolution. At each resolution , we learn using stochastic optimization Opt (we used Adam (Kingma and Ba, 2014) in our experiments). When the stopping criterion is met, we transform to in a process we call finegraining (Finegrain). Due to spatial auto-correlation, trained model at the lower resolution will serve as a good initialization for the higher resolutions. For both models, we only finegrain the factors that corresponds to the spatial mode. Once the full rank resolution has been trained up to resolution (chosen to fit GPU memory or time constraints), we decompose using CP_ALS, the standard alternating least squares (ALS) algorithm (Kolda and Bader, 2009) for CP decomposition. Then the low-rank model is trained at resolutions to final desired accuracy, finegraining between the resolutions. MRTL
is not specific to ALS and allows for any other CP decomposition method, such as multiplicative updates for nonnegative CP decomposition. While we can use tensor nuclear norm regularization
(Friedland and Lim, 2018), this method was computationally infeasible for our datasets.There is a tradeoff between training times at different resolutions. While training for longer at lower resolutions decreases computation significantly, we do not want to overfit to the lower resolution data. On the other hand, training at higher resolutions can yield more accurate solutions using more finegrained information. We investigate four different criteria to balance this tradeoff: 1) validation loss, 2) gradient norm, 3) gradient variance, and 4) gradient entropy. Increase in validation loss
(Prechelt, 1998; Yao et al., 2007)is a commonly used heuristic for early stopping. Another approach is to analyze the gradient distributions during training. For a convex function
, stochastic gradient descent will converge into a noise ball near the optimal solution as the gradients approach zero. However, lower resolutions may be too coarse to learn more finegrained curvatures and the gradients will increasingly disagree near the optimal solution. We quantify the disagreement in the gradients with metrics such as norm, variance, and entropy. We use intuition from convergence analysis for gradient norm and variance
(Bottou et al., 2018), and information theory for gradient entropy (Srinivas et al., 2012). Let and represent the weights and the random sampling of minibatches at step , respectively. Let be the validation loss and be the stochastic gradients at step . The finegraining criteria are:Validation Loss:
Gradient Norm:
Gradient Variance:
Gradient Entropy:
where . One can also use thresholds, i.e. , but as these are dependent on the dataset, we use in our experiments. One can also incorporate patience, i.e. setting the maximum number of epochs where the stopping conditions was reached.
We discuss different interpolation schemes for different types of features. Categorical/multinomial variables, such as a player’s position on the court, are one-hot encoded or multi-hot encoded onto a discretized grid. Note that as we use higher resolutions, the sum of the input values are still equal across resolutions,
. As the sum of the features remains the same across resolutions and our tensor models are multilinear, nearest neighbor interpolation should be used in order to produce the same outputs.as for cells that do not contain the value. This scheme yields the same outputs and the same loss values across resolutions. Continuous variables that represent averages over locations, such as sea surface salinity, often have similar values at each finegrained cell at higher resolutions (as the values at coarse resolutions are subsampled or averaged from values at the higher resolution). An example is Laplacian/Gaussian pyramids for images (Burt and Adelson, 1983). Then , where the approximation comes from the type of downsampling used. Using a linear interpolation scheme,
The weights are divided by the scale factor of to keep the outputs approximately equal. We use bilinear interpolation, though any other linear interpolation can be used.
We prove the convergence rate for MRTL
of a single spatial mode. For a loss function
and stochastic variable , the optimization problem is:(5) |
We defer all proofs to the Appendix. For a fixed-resolution miniSGD algorithm, under common assumptions in convergence analysis:
is - strongly convex, -smooth
(unbiased) gradient given
(variance) for all the ,
(Bottou et al., 2018) If the step size , then a fixed resolution solution satisfies
where , , and is the optimal solution.
which gives convergence. Denote the number of total iterations for resolution as , and the weights as . We let denote the number of dimensions at and we assume a dyadic scaling between resolutions such that . We define finegraining using an interpolation operator such that as in (Bramble, 2019). For the simple case of a 1D spatial grid where has spatial dimension , P would be of a Toeplitz matrix of dimension . For example, for linear interpolation of ,
Any interpolation scheme can be expressed in this form. The convergence of multiresolution learning algorithm depends on the following property of spatial data:
The difference between the optimal solutions of consecutive resolution is upper bounded by
with being the interpolation operator.
If the step size , then MRTL solution satisfies
where , , and is the operator norm of the interpolation operator .
This gives similar convergence rate as the fixed resolution algorithm, with a constant that depends on the operator norm of the interpolation operator .
To analyze computational complexity, we resort to fixed point convergence (Hale et al., 2008) and the multi-grid method (Stüben, 2001). Intuitively, as most of the training iterations are spent on coarser resolutions with fewer number of parameters, multi-resolution learning is more efficient than fixed-resolution training. Assuming that is Lipschitz continuous, we can view gradient-based optimization as a fixed-point iteration operator with a contraction constant of (note that stochastic gradient descent converges to a noise ball instead of a fixed point):
Let
be the optimal estimator at resolution
and be a solution satisfying . The algorithm terminates when the estimation error reaches . The following lemma describes the computational cost of the fixed-resolution algorithm.Given a fixed point iteration operator with contraction constant of , the computational complexity of fixed-resolution training for tensor model of order and rank is
(6) |
where is the terminal estimation error.
The next theorem 5.5 characterizes the computational speed-up gained by multi-resolution learning compared to fixed-resolution learning, with respect to the contraction factor and the terminal estimation error .
If the fixed point iteration operator (gradient descent) has a contraction factor of , multi-resolution learning with the termination criteria of at resolution is faster than fixed-resolution learning by a factor of , with the terminal estimation error .
Note that the speed-up using multi-resolution learning uses a global convergence criterion for each .
We apply MRTL to two real-world datasets: basketball tracking and climate data.
We use a large NBA player tracking dataset from (Yue et al., 2014; Zheng et al., 2016)
consisting of the coordinates of all players at 25 frames per second, for a total of approximately 6 million frames. The goal is to predict whether a given ball handler will shoot within the next second, given his position on the court and the relative positions of the defenders around him. In applying our method, we hope to obtain common shooting locations on the court and how a defender’s relative position suppresses shot probability.
We have two spatial modes: the ball handler’s position and the relative defender positions around the ball handler. We instantiate the tensor classification model in Eqn (1) as follows:
where is the ballhandler ID, indexes the ballhandler’s position on the discretized court of dimension , and indexes the relative defender positions around the ballhandler in a discretized grid of dimension . As shown in Fig. 2, we orient the defender positions so that the direction from the ballhandler to the basket points up. is the binary output equal to if player shoots within the next second. We use a weighted cross entropy loss (due to imbalanced classes) and nearest neighbor interpolation for finegraining.
Recent research (Li et al., 2016a, b; Zeng et al., 2019) shows that oceanic variables such as sea surface salinity (SSS) and sea surface temperature (SST) are significant predictors of the variability in rainfall in land-locked locations, such as the U.S. Midwest. We aim to predict the average monthly precipitation variability in U.S Midwest using SSS and SST to identify meaningful latent factors underlying the large-scale processes linking the ocean and precipitation on land (Fig. 3) Let be the historical oceanic data with input features SSS and SST across locations, using the previous months of data. We consider the lag as a non-spatial feature so that and as there are two spatial features. We instantiate the tensor regression model in Eqn (1) as follows:
We use difference detrending for each timestamp due to non-stationarity of the inputs, and deasonalize by standardizing each month of the year. The features are normalized using min-max normalization. We also normalize and deseasonalize the outputs in order to model variability. We use mean square error (MSE) and bilinear interpolation for finegraining.
For both datasets, we discretize the spatial features into cells and use a 60-20-20 train-validation-test set split. We use Adam (Kingma and Ba, 2014) for optimization as it was empirically faster than SGD in our experiments. We use both and spatial regularization as described in Section 3. We selected optimal hyperparameters for all models via random search. We use a stepwise learning rate decay with stepsize of with . We perform ten trials for all experiments. All other details are provided in the Appendix.
We compare MRTL against a fixed-resolution model on accuracy and computation time. The fixed-resolution model uses the same full-rank initialization as MRTL but utilizes the highest resolution only. The results of all trials are listed in Table 1. Other results are provided in the Appendix.
Fig. 4 shows the F1 scores of MRTL vs a fixed resolution model for the basketball dataset (validation loss was used as the finegraining criterion for both models). For the full rank case, MRTL converges 9 times faster than the fixed resolution case. The fixed-resolution model is able to reach a higher F1 score for the full rank case, as it uses a higher resolution and is able to learn more finegrained information. This advantage does not transfer to the low rank model. For the low rank model, the training times are comparable and both reach a similar F1 score. There is decrease in the F1 score going from full rank to low rank for both MRTL and the fixed resolution model due to approximation error from CP decomposition. Note that this is dependent on the choice of , specific to each dataset. We see a similar trend for the climate data, where MRTL converges faster than the fixed-resolution model. Overall, MRTL is approximately 4 5 times faster than a fixed-resolution model and we get a similar speedup in the climate data.
Dataset | Model | Full-Rank | Low-Rank | ||||
---|---|---|---|---|---|---|---|
Time [s] | Loss | F1 | Time [s] | Loss | F1 | ||
Basketball | Fixed | 11462 | 0.608 | 0.685 | 2205 | 0.849 | 0.494 |
MRTL | 1230 | 0.699 | 0.607 | 2009 | 0.868 | 0.475 | |
Climate | Fixed | 12.5 | 0.0882 | - | 269 | 0.0803 | - |
MRTL | 1.11 | 0.0825 | - | 67.1 | 0.0409 | - |
We compared the performance of different finegraining criteria in Fig. 5. Validation loss converges much faster than other criteria for the full rank model while the other finegraining criteria converge slightly faster for the low rank model. In the classification case, we observe that the full rank model spends many epochs training when we use gradient-based criteria, suggesting that they can be too strict for the full rank case. For the regression case, we see all criteria perform similarly for the full rank model, and validation loss converges faster for the low rank model. As there are differences between finegraining criteria for different datasets, one should try all of them for fastest convergence.
We now demonstrate that MRTL can learn semantic representations along spatial dimensions. For all latent factor figures, the factors have been normalized to so that reds are positive and blues are negative.
Figs. 7, 8 visualize some latent factors for ballhandler position and relative defender positions, respectively (see Appendix for all latent factors). For the ballhandler position in Fig. 7, coherent spatial patterns (can be both red or blue regions as they are simply inverses of each other) can correspond to common shooting locations. These latent factors can represent known locations such as the paint or near the three-point line on both sides of the court. For relative defender positions in Fig. 8, we see many concentrated spatial regions near the ballhandler, demonstrating that such close positions suppress shot probability (as expected). Some latent factors exhibit directionality as well, suggesting that guarding one side of the ballhandler may suppress shot probability more than the other side. Fig. 6 depicts two latent factors of sea surface locations. We would expect latent factors to correspond to regions of the ocean which independently influence precipitation. The left latent factor highlights the Gulf of Mexico and northwest Atlantic ocean as influential for rainfall in the Midwest, consistent with findings from (Li et al., 2018, 2016a).
We also perform experiments using a randomly initialized low-rank model in order to verify the importance of full rank initialization. Fig. 9 compares random initialization vs. MRTL for the ballhandler position (left two plots) and the defender positions (right two plots). We observe that even with spatial regularization, randomly initialized latent factor models can produce noisy, uninterpretable factors and thus full-rank initialization is essential.
We presented a novel algorithm for tensor models for spatial analysis. Our algorithm MRTL utilizes multiple resolutions to significantly decrease training time and incorporates an full-rank initialization strategy that promotes spatially coherent and interpretable latent factors. MRTL is generalized to both the classification and regression cases. We proved the theoretical convergence of our algorithm for stochastic gradient descent and compared the computational complexity of MRTL to a single, fixed-resolution model. The experimental results on two real-world datasets support its computational efficiency and interpretability. Future work includes 1) develop other stopping criteria in order to enhance the computational speedup, 2) apply our algorithm to more higher-dimensional spatial data, and 3) study the effect of varying batch sizes between resolutions as in (Wu et al., 2019).
Why should i trust you?: explaining the predictions of any classifier
. pp. 1135–1144. Cited by: §3.3.Multiresolution recurrent neural networks: an application to dialogue response generation.
. In AAAI, pp. 3288–3294. Cited by: §2.(Bottou et al., 2018) If the step size , then a fixed resolution solution satisfies
where , , and is the optimal solution.
For a single step update,
(7) |
by the law of total expectation
(8) |
From strong convexity,
(9) |
which implies as . Putting it all together yields
(10) |
As , we complete the contraction, by setting
(11) |
Repeat the iterations
(12) |
Rearranging the terms, we get
(13) |
∎
If the step size , then MRTL solution satisfies
where , , and is the operator norm of the interpolation operator .
Consider a two resolution case where and . Let be the total number of iterations of resolution . Based on Eqn. (10), for a fixed resolution algorithm, after number of iterations,
For multi-resolution, where we train on resolution first, we have
At resolution , we have
(14) |
Using interpolation, we have . Given the spatial autocorrelation assumption, we have
By the definition of operator norm and triangle inequality,
Combined with eq. (14), we have
(15) | ||||
(16) |
If we initialize and such that , we have MRTL solution
(17) |
for some that completes the contraction. Repeat the resolution iterates in Eqn. (16), we reach our conclusion. ∎
In this section, we analyze the computational complexity for MRTL (Algorithm 1). Assuming that is Lipschitz continuous, we can view gradient-based optimization as a fixed-point iteration operator with a contraction constant of (note that stochastic gradient descent converges to a noise ball instead of a fixed point).
Let be the optimal estimator at resolution . Suppose for each resolution , we use the following fine-grain criterion:
(18) |
where is the number of iterations taken at level . The algorithm terminates when the estimation error reaches . The following main theorem characterizes the speed-up gained by multi-resolution learning MRTL w.r.t. the contraction factor and the terminal estimation error .
Suppose the fixed point iteration operator (gradient descent) for the optimization algorithm has a contraction factor (Lipschitz constant) of , the multi-resolution learning procedure is faster than that of the fixed resolution algorithm by a factor of , with as the terminal estimation error.
We prove several useful Lemmas before proving the main Theorem A.3. The following lemma analyzes the computational cost of the fixed-resolution algorithm.
Given a fixed point iteration operator with a contraction factor , the computational complexity of a fixed-resolution training for a -order tensor with rank is
(19) |
At a high level, we can prove this by choosing a small enough resolution such that the approximation error is bounded with a fixed number of iterations. Let be the optimal estimate at resolution and be the estimate at step . Then
(20) |
We pick a fixed resolution small enough such that
(21) |
then using the termination criteria gives where is the discretization size at resolution . Initialize and apply to for times such that
(22) |
As , , we obtain that
(23) |
Note that for an order tensor with rank , the computational complexity of every iteration in MRTL is with as the discretization size. Hence, the computational complexity of the fixed resolution training is
Given a spatial discretization , we can construct an operator that learns discretized tensor weights. The next lemma relates the estimation error with resolution. The following lemma relates the estimation error with resolution:
(Nash, 2000) For each resolution level , there exists a constant and , such that the fixed point iteration with discretization size has an estimation error:
(24) |
See (Nash, 2000) for details. We have obtained the discretization error for the fixed point operation at any resolution. Next we analyze the number of iterations needed at each resolution before finegraining.
For every resolution , there exists a constant such that the number of iterations before finegraining satisfies:
(25) |
According to the fixed point iteration definition, we have for each resolution :
(26) | |||||
(27) | |||||
(28) |
using the definition of the finegrain criterion. ∎By combining Lemmas A.6 and the computational cost per iteration, we can compute the total computational cost for our MRTL algorithm, which is proportional to the total number of iterations for all resolutions:
(29) |
where the last step uses the termination criterion in (18). Comparing with the complexity analysis for the fixed resolution algorithm in Lemma A.4, we complete the proof. ∎
We list implementation details for the basketball dataset. We focus only on half-court possessions, where all players have crossed into the half court as in (Yue et al., 2014). The ball must also be inside the court and be within a 4 feet radius of the ballhandler. We discard any passing/turnover events and do not consider frames with free throws. For the ball handler location , we discretize the half-court into resolutions . For the relative defender locations, at the full resolution, we choose a grid around the ball handler where the ball handler is located at (more space in front of the ball handler than behind him/her). We also consider a smaller grid around the ball handler for the defender locations, assuming that defenders that are far away from the ball handler do not influence shooting probability. We use for defender positions. Let us denote the pair of resolutions as . We train the full-rank model at resolutions and the low-rank model at resolutions . There is a notable class imbalance in labels (88% of data points have zero labels) so we use weighted cross entropy loss using the inverse of class counts as weights. For the low-rank model, we use tensor rank . The performance trend of MRTL is similar across a variety of tensor ranks. should be chosen in appropriately to the desired level of approximation.
We describe the data sources used for climate. The precipitation data comes from the PRISM group, which provides estimates monthly estimates at 1/24º spatial resolution across the continental U.S from 1895 to 2018. For oceanic data we use the EN4 reanalysis product, which provides monthly estimates for ocean salinity and temperature at 1º spatial resolution across the globe from 1900 to the present (see Fig. 3). We constrain our spatial analysis to the range [-180ºW, 0ºW] and [-20ºS, 60ºN], which encapsulates the area around North America and a large portion of South America. The ocean data is non-stationary, with the variance of the data increasing over time. This is likely due to improvement in observational measurements of ocean temperature and salinity over time, which reduce the amount of interpolation needed to generate an estimate for a given month. After detrending and deseasonalizing, we split the train, validation, and test sets using random consecutive sequences so that their samples come from a similar distribution. We train the full-rank model at resolutions and and the low-rank model at resolutions , , , , , and . For finegraining criteria, we use a patience factor of 4, i.e. training was terminated when a finegraining criterion was reached a total of 4 times. Both validation loss and gradient statistics were relatively noisy during training (possibly due to a small number of samples), leading to early termination without the patience factor. During fine-graining, the weights were upsampled to the higher resolution using bilinear interpolation and then scaled by the ratio of the number of inputs for the higher resolution to the number of inputs for the lower resolution (as described in Section 4) to preserve the magnitude of the prediction.
We trained the basketball dataset on 4 RTX 2080 Ti GPUs, while the climate dataset experiments were performed on a separate workstation with 1 RTX 2080 Ti GPU. The computation times of the fixed-resolution and MRTL model were compared on the same setup for all experiments.
Hyperparameter | Basketball | Climate |
---|---|---|
Batch size | ||
Full-rank learning rate | ||
Full-rank regularization | ||
Low-rank learning rate | ||
Low-rank regularization | ||
Spatial regularization |