Large-Scale Learning with Fourier Features and Tensor Decompositions

09/03/2021
by   Frederiek Wesel, et al.
Delft University of Technology
0

Random Fourier features provide a way to tackle large-scale machine learning problems with kernel methods. Their slow Monte Carlo convergence rate has motivated the research of deterministic Fourier features whose approximation error decreases exponentially with the number of frequencies. However, due to their tensor product structure these methods suffer heavily from the curse of dimensionality, limiting their applicability to two or three-dimensional scenarios. In our approach we overcome said curse of dimensionality by exploiting the tensor product structure of deterministic Fourier features, which enables us to represent the model parameters as a low-rank tensor decomposition. We derive a monotonically converging block coordinate descent algorithm with linear complexity in both the sample size and the dimensionality of the inputs for a regularized squared loss function, allowing to learn a parsimonious model in decomposed form using deterministic Fourier features. We demonstrate by means of numerical experiments how our low-rank tensor approach obtains the same performance of the corresponding nonparametric model, consistently outperforming random Fourier features.

READ FULL TEXT VIEW PDF

Authors

page 7

01/04/2021

Gauss-Legendre Features for Gaussian Process Regression

Gaussian processes provide a powerful probabilistic kernel learning fram...
08/27/2020

Nonparametric Multivariate Density Estimation: A Low-Rank Characteristic Function Approach

Effective non-parametric density estimation is a key challenge in high-d...
03/07/2015

Higher order Matching Pursuit for Low Rank Tensor Learning

Low rank tensor learning, such as tensor completion and multilinear mult...
02/04/2021

Wind Field Reconstruction with Adaptive Random Fourier Features

We investigate the use of spatial interpolation methods for reconstructi...
03/17/2020

Nonlinear system identification with regularized Tensor Network B-splines

This article introduces the Tensor Network B-spline model for the regula...
06/24/2018

A Unified Analysis of Random Fourier Features

We provide the first unified theoretical analysis of supervised learning...
03/09/2022

AFD Types Sparse Representations vs. the Karhunen-Loeve Expansion for Decomposing Stochastic Processes

This article introduces adaptive Fourier decomposition (AFD) type method...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Kernel methods such as Support Vector Machines and Gaussian Processes are commonly used to tackle problems such as classification, regression and dimensionality reduction. Since they can be universal function approximators 

[12]

, kernel methods have received renewed attention in the last few years and have shown equivalent or superior performance to Neural Networks 

[21, 25, 10]. The main idea behind kernel methods is to lift the data into a higher-dimensional (or even infinite-dimensional) Reproducing Kernel Hilbert Space by means of a feature map . Considering then the pairwise similarities between the mapped data allows to tackle problems which are highly nonlinear in the original sample space , but linear in . This can be done equivalently by considering a kernel function such that and performing thus said mapping implicitly. Although effective at learning nonlinear patterns in the data, kernel methods are known to scale poorly as the number of data points

increases. For example, when considering Kernel Ridge Regression (KRR), Gaussian Process Regression (GPR) 

[32]

or Least Squares Support Vector Machine (LS-SVM) 

[40, 41] training usually consist in inverting the Gram matrix , which encodes the pairwise relation between all data. As a consequence, the associated storage complexity is and the computational complexity is , rendering these methods unfeasible for large data. In order to lower the computational load, data dependent methods approximate the kernel function by means of data dependent basis functions. Due to their reduced-rank formulation, the computational complexity is reduced to for large data. However, e.g. for the Nyström method, the convergence rate is only of  [7] limiting its effectiveness. As the name suggests, data independent methods approximate the kernel function by data independent basis functions. A good example is the celebrated Random Fourier Features approach by Rahimi and Recht [30], where the authors propose for stationary kernels a low-dimensional random mapping such that

(1)

As in the case of data dependent methods, the reduced-rank formulation allows for a computational complexity of . Probabilistic error bounds on the approximation are provided which result in a convergence rate of , which is again the Monte Carlo rate.

Improvements in this sense were achieved by considering deterministic features resulting from dense quadrature methods [5, 24, 35]

and kernel eigenfunctions 

[15, 38]. These methods are able to achieve exponential convergence. For a -dimensional input space these methods take the tensor product of vectors, resulting in an exponential increase in the number of basis functions and weights, effectively limiting the applicability of deterministic features to low-dimensional data. In this work we consider deterministic features. In order to take advantage of the tensor product structure which arises when mapping the inputs to , we represent the weights as a low-rank tensor. This allows us to learn the inter-modal relations in the tensor product of (low-dimensional) Hilbert spaces, avoiding the exponential computational and storage complexities in . In this way we are able to obtain a linear computational complexity in both the sample size and input dimension during training, without having to resort to using sparse grids or additive kernels.

The main contribution of this work is in lifting the curse of dimensionality affecting deterministic Fourier Features by modeling the weights as a low-rank tensor. This enables efficiently solving large-scale and high-dimensional kernel ridge regression problems with an exponential convergence of the kernel function approximation. We test the resulting algorithm under regularized squared loss for regression and classification.

2 Related Work

Fourier Features (FF) are a collection of data independent methods that leverage Bochner’s theorem [34] from harmonic analysis to approximate stationary kernels by numerical integration of their spectral density :

(2)

Rahimi and Recht [30] proposed to approximate the integral by Monte Carlo integration i.e. by drawing random frequencies . In their work they show how the method converges uniformly at the Monte Carlo rate. See [22] for an overview of random FF. In order to achieve faster convergence and a lower sample complexity, a multitude of approaches that rely on deterministic numerical quadrature of the Fourier integral were developed. These methods generally consider kernels whose spectral density factors over

in the frequency domain, which enables in turn to factor the Fourier integral. The resulting deterministic Fourier features are then the tensor product of

one-dimensional deterministic features.

For example, in [5] the authors give an analysis of the sample complexity of features resulting from dense Gaussian Quadrature (GQ). In [24] the authors present a similar quadrature approach for kernels whose spectral density factors over the dimensions of the inputs. They provide an explicit construction for their Quadrature Fourier Features relying on a dense Cartesian grid, and note that their method, as well as GQ, can achieve exponentially fast convergence in the total number of frequencies  [24, Theorem 1]. To avoid the curse of dimensionality, they make use of additive modeling of kernels. Variational Fourier Features (VFF) [15] are derived probabilistically in a one-dimensional setting for Matérn kernels by projecting a Gaussian Process onto a set of Fourier basis. An extension to multiple dimensions when considering product kernels is then achieved by taking the tensor product of the one-dimensional features, incurring however again in exponentially rising computational costs in . In what they call Hilbert-GP [38]

the authors diagonalize stationary kernels in terms of the eigenvalues and eigenfunctions of the Laplace operator with Dirichlet boundary conditions. Due to the multiplicative structure of the eigenfunctions of the Laplace operator, the complexity of the basis function increases exponentially in

when one considers product kernels. Like with other deterministic approximations, exponential convergence can be achieved [38, Theorem 8].

Tensor decompositions have been used extensively in machine learning in order to benefit from structure in data [37], in particular to obtain an efficient representation of the model parameters. In the context of Gaussian Process Regression, the tensor product structure which arises when the inputs are located on a Cartesian grid have been exploited to speedup inference [11]. In their variational inference framework, [18]

proposed to parameterize the posterior mean of a Gaussian Process by means of a tensor decomposition in order to exploit the tensor product structure which arises when interpolating the kernel function. In the context of neural networks, 

[26]

proposed to represent the weights in deep neural networks as a tensor decomposition to speed up training. A similar approach was carried out in case of recurrent neural network 

[46] [42].

Tensor decompositions have also been used to learn from simple features. Motivated by spin vectors in quantum mechanics, [39] considered trigonometric feature maps which they argue induce uninformative kernels. They then proceed to tackle multi-class image classification by modeling the model weights as a tensor decomposition. On a similar note, [27] and [1]

leverage the tensor product structure of polynomial mappings to induce a polynomial kernel and consider model parameters in decomposed form. While these existing approaches are able to overcome the curse of dimensionality affecting polynomial feature maps by modeling the parameter tensor as a tensor network, they consider simple and empirical feature maps which induce uninformative kernels. In the following section we show how deterministic Fourier features can be used for supervised learning in large and high-dimensional scenarios.

3 Learning with deterministic Fourier features

3.1 Notation

All tensor operations and mappings can be formulated over both the complex as the real field. For the remainder of this article real-valued mappings will be considered. Vectors and matrices are denoted by boldface lowercase and uppercase letters, respectively. Higher-order tensors are denoted by boldface calligraphic letters. Tensor entries are always written as lightface lowercase letters with a subscript index notation. For example, element of the third-order tensor is written as . The vectorization of a tensor is the vector such that

where the relationship between the linear index and is given by

The Frobenius inner product between tensors is defined as

Depending on the context the symbol either denotes the tensor outer product or the matrix Kronecker product. The Khatri-Rao product of the matrices and is the matrix that is obtained by taking the column-wise Kronecker product.

3.2 The model

In this article we assume kernelized models of the form

(3)

Learning such a kernelized model consists in finding a hyperplane

that minimizes a symmetric loss function with the mapped data in a high and possibly infinite-dimensional space such that:

(4)

Here is a -th norm regularization term, with being the regularization constant. A variety of machine learning algorithms arise when considering different combinations of loss functions and regularization terms. For example, considering a squared loss and leads to Kernel Ridge Regression [41], while considering a hinge loss and leads to SVM [4], and so on. Furthermore, applying the kernel trick when considering recovers the nonparametric formulation which, depending on the choice of kernel, enables to perform inference using infinitely many basis functions with a computational complexity of . Considering a low-dimensional finite mapping such as Random Fourier Features or the Nyström method leads to computational savings and a computational complexity of . However, as already discussed, these low-dimensional random mappings converge at the slow Monte Carlo rate, motivating our approach.

In order to approximate the kernel function with fast exponential convergence, we consider deterministic, finite-dimensional features which are the tensor product of vectors. For a given -dimensional input point we define the deterministic feature mapping as

(5)

where is a deterministic mapping applied to the -th dimension. The dimension of the feature space is denoted

Without loss of generality it is from here on assumed that such that . The mapping encompasses for instance the mappings derived by quadrature [5, 24, 35] and by projection [15, 38]. Note that these mappings cover many popular kernels such as the Gaussian and Matérn kernels. To give a concrete example, in the framework of A. Solin and S. Särkkä [38, Equation 60], for input data centered in a hyperbox , the Gaussian kernel is approximated by means of tensor products of weighted sinusoidal basis functions with frequencies lying on a harmonic scale such that:

(6)

Here is the spectral density of the Gaussian kernel with one-dimensional inputs, which is known in closed-form [32, page 83]. This deterministic mapping then approximates the Gaussian kernel function [38, Equation 59] such that

and converges exponentially fast in  [38, Theorem 8]. We therefore use instead of in (3) to obtain

(7)

where the weight vector has been reshaped into a -dimensional tensor . Learning the exponential number of model parameters in  (7) under a hinge loss leads to Support Vector Machines, while considering a squared loss leads to Kernel Ridge Regression:

(8)

Since the number of elements in and grows exponentially in , this primal approach is advantageous compared to the nonparametric dual approach only if , limiting it to low-dimensional inputs. In order to lift this curse of dimensionality, we propose to represent and to learn directly as a low-rank tensor decomposition. A low-rank tensor decomposition allows us to exploit redundancies in in order to obtain a parsimonious model with a storage complexity that scales linearly in both and

. The low-rank structure will, as explicitly shown in the experiments, act as a form of regularization by limiting the total number of degrees of freedom of the model.

3.3 Low-rank tensor decompositions

Tensor decompositions (also called tensor networks) can be seen as generalizations of the Singular Value Decomposition (SVD) of a matrix to tensors 

[19]. Three common tensor decompositions are the Tucker decomposition [6, 44], the Tensor Train decomposition [28] and the Canonical Polyadic decomposition (CPD) [16, 20], where each of them encompasses different properties of the matrix SVD. In this subsection we briefly discuss these three decompositions and list both their advantages and disadvantages for modeling in (8). For a detailed exposition on tensor decompositions we refer the reader to [2] and the references therein. An important property of tensor decompositions is that they can always be written linearly in their components, which implies that applying a block coordinate descent algorithm to solve (8) results in a linear least squares problem.

A rank- CPD of consists of factor matrices and a vector such that

Note that the entries of the vector were absorbed in one of the factor matrices when writing the CPD as a sum of terms. The CPD has been shown, in contrast to matrix factorizations, to be unique under mild conditions [36]. The storage complexity comes mainly from the factor matrices and is therefore . The Tucker decomposition generalizes the CPD in two ways. First, the Khatri-Rao product is replaced with a Kronecker product. The vector has to grow in length accordingly from to . Second, the -dimension of each of the factor matrices is allowed to vary, resulting in a multi-linear rank :

The Tucker decomposition is inherently non-unique and its storage complexity is dominated by the vector, which limits its use to low-dimensional input data. For this reason we do not consider the Tucker decomposition any further in this article.

The Tensor Train (TT) decomposition consists of third-order tensors such that

(9)

The auxiliary dimensions are called the TT-ranks. In order to ensure that the right-hand side of (9) is a scalar, the boundary condition is enforced. The TT decomposition is, just like the Tucker decomposition, non-unique and its storage complexity is due to the tensor components . Considering their storage complexity, both the CPD and TT decomposition are viable candidates to replace in (8). The CP-rank and TT-ranks

are additional hyperparameters, which favors the CPD in practice. For the TT, one could choose

to reduce the number of additional hyperparameters but this constraint turns out in practice to lead to suboptimal results. For these reasons we limit the discussion of the learning algorithm to the CPD case.

3.4 Tensor learning with deterministic Fourier features

We now wish to minimize the standard regularized squared loss function as in (8) with the additional constraint that the weight tensor has a rank- CPD structure:

(10)
subject to (11)

Note that if equals the true CP-rank of the underlying weight tensor then the exact solution of (8) would be obtained. In practice a low-rank solution for achieves a sufficiently complex decision boundary that is practically indistinguishable from the full-rank solution, as is demonstrated in the experiments. Imposing the rank- reduces the number of unknowns from to and allows the application of a block coordinate descent algorithm (also known as alternating linear scheme), which is shown to converge monotonically [3, 17, 33]. Each of the factor matrices is optimized in an iterative fashion while keeping the others fixed. Such a factor matrix update is obtained by solving a linear least squares problem with unknowns. In what follows we derive the linear problem for the th factor matrix . The data-fitting term can be rewritten linearly in terms of the unknown factor matrix as

(12)

where denotes the Hadamard (element-wise) matrix product and

Similarly, for the regularization term:

(13)

Here . Derivations of (12) and (13) can be found in Appendix A. Substitution of the expressions (12) and (13) into (10) and (11) leads to a parametric ridge regression problem for :

(14)

Its solution can be efficiently computed exactly by applying the matrix inversion lemma and then solving a small system of linear equations with unknowns, requiring operations. Since we are solving a non-convex optimization problem that consists of a series of convex and exactly solvable sub-problems, our algorithm is monotonically decreasing, and, although it is not guaranteed to converge to the global optimum, it is the standard choice when dealing with tensors [3, 19]. The total computational complexity of the algorithm when is then , rendering it suitable for problems which are large in both and , provided that and are small. The choice of and the additional hyperparameter which we introduce, are directly linked with the available computational budget. One should in fact choose so that the model has access to a sufficiently complex set of basis functions to learn from. In practice we notice that at most basis functions per dimension suffice for most problems. can then be fixed accordingly in order to match the computational budget at hand. As we will show in the next section, learning is possible with small values of and .

4 Numerical experiments

We have implemented our Tensor-Kernel Ridge Regression (T-KRR) method in Mathworks MATLAB 2021a (Update 1) [23] and tested it on several regression and classification problems. In our implementation we avoid constructing and from scratch at every iteration by updating their components. We further speedup our implementation by considering only the diagonal of the regularization term . All experiments were run on an Intel(R) Core i7-10610U CPU running at 1.80GHz of a Dell Inc. Latitude 7410 laptop with 16GB of RAM. In all our experiments the Gaussian kernel was used which is approximated by means of the Hilbert-GP [38] mapping of (6). In all experiments inputs were scaled to lie in a -dimensional unit hypercube. When dealing with regression, the responses were standardized around the mean, while when considering binary classification inference was done by looking at the sign of the model response (LS-SVM [40]). One sweep of our T-KRR algorithm is defined as updating factor matrices in the order and then back from . All initial factor matrices were initialized with standard normal numbers and normalized by dividing all entries with their Frobenius norm. For all experiments the number of sweeps of T-KRR algorithm are set to

. In the following three experiments it is shown how our proposed T-KRR algorithm is stable and recovers the full KRR estimate in case of low number of frequencies

and low rank , consistently outperforming RFF. Finally, our model exhibits very competitive performance on large-scale problems when compared with other kernel methods.

4.1 Banana classification

The banana dataset is a two-dimensional classification problem [14] consisting of datapoints. Since the data is two-dimensional, we consider frequencies per dimension, which enables us to compare T-KRR with its full-rank counterpart. Furthermore since for tensors of order two (i.e. matrices) the CP-rank is the matrix rank, we can deduce that our approach should recover the underlying Hilbert-GP method with . In this example we fix the kernel hyperparameters of our method and of the Hilbert-GP to and for visualization purposes. In Figure 1 we plot the decision boundary of T-KRR (full line) and of the associated Hilbert-GP (dashed line). We can see that already when the learned decision boundary is almost identical to the one of Hilbert-GP, meaning that in this example it is possible to obtain a low-rank representation of the model weights. From a computational point of view, we solve a series of linear systems with unknowns instead of . However due to the low-dimensionality of the dataset, the computational savings per iteration are very modest, i.e. for and , we solve per iteration a linear system as opposed to the one-time solve of a linear system in case of Hilbert-GP. As we show in Subsection 4.2, the linear computational complexity in of our algorithm results in ever-increasing performance benefits as the dimensionality of the problem becomes larger.

(a)
(b)
(c)
(d)
Figure 1: Classification boundary of the two-dimensional banana dataset for increasing CP-ranks. In the last plot the chosen CP-rank matches the true matrix rank. The dashed line is the Hilbert-GP, while the full line is T-KRR.
Figure 2: Monotonically decreasing normalized residual (loss) and misclassification rate as a function of the training iteration on the Spambase dataset.

4.2 Model performance with baseline

We consider five UCI [8] datasets in order to compare the performance of our model with RFF and the baseline, which we choose to be GPR. For each dataset, we consider of the data for training and the remaining for testing. In particular, in all regression datasets, we first obtain an estimate of the lengthscale of the Gaussian kernel and regularization term by log-marginal likelihood optimization over the whole training set using the GPLM toolbox [31] (FreeBSD License). We subsequently train KRR, RFF and our method with the estimated hyperparameters. In case of classification, we choose the lengthscale

to be the sample mean of the sample standard deviation of our data (which is the default choice in the Matlab Machine Learning Toolbox and scikit-learn

[29]) and . We repeat this procedure ten times over different random splits and report the sample mean and sample standard deviation of the predictive mean-squared error (MSE) and the misclassification rate for regression and classification respectively. In order to test the approximation capabilities of our approach, we set and such that in order to benefit from computational gains. Furthermore, we note that due to the curse of dimensionality it is not possible to compare our approach with RFF while considering the same total number of basis functions . Hence we select such that the computational complexity of both methods is similar. Table 1 shows the MSE for different UCI datasets. We see that our approach comes close to the performance of full KRR and outperforms RFF on all datasets when considering the same number of model parameters. This is because our approach considers implicitly frequencies to learn a low-rank parsimonious model which is fully described by parameters, as opposed to RFF where the number of frequencies is the same as the number of model weights. Similarly to what is reported in [45, Figure 2] we observe lower performance of RFF on the Adult dataset than reported in [30]. Figure 2 plots the monotonically decreasing loss function as well as the corresponding misclassification rate while training with T-KRR on the Spambase dataset. After four sweeps the misclassification rate has converged. Since T-KRR is able to obtain similar performance as the KRR baseline on a range of small datasets, we proceed to tackle a large-scale regression problem.

Dataset RFF T-KRR KRR
Yacht 308 6 10 25
Energy [43] 768 9 20 10
Airfoil 1503 5 20 10
Spambase 4601 57 40 10
Adult 45222 96 40 10 N/A
Table 1: Predictive MSE (regression) and misclassification rate (classification) with one standard deviation for RFF, T-KRR and KRR on different UCI datasets.

4.3 Large-scale experiment

In order to highlight the ability of our method to deal with large data we consider the Airline dataset. The Airline dataset [13, 15] (Apache License 2.0) is a large-scale regression problem originally considered in [13] which is often used to compare state-of-the-art Gaussian Process Regression approximations. This is because of its large size and its non-stationary features. The goal is in fact to predict the delay of a flight given eight features, which are the age of the airplane, route distance, airtime, departure time, arrival time, day of the week, day of the month, and month. We follow the same exact preprocessing steps as in the experiments in [15][38] and  [9], which consider subsets of data of size , each chosen uniformly at random. Training the model is then accomplished with datapoints, with the remaining portion reserved for testing. The entire procedure is then repeated times with random initialization in order to obtain a sample mean and sample standard deviation estimate of the MSE. Following exactly the approach in Hilbert-GP [38], we consider basis functions per dimension, with the crucial difference that T-KRR approximates a standard product Gaussian kernel, as opposed to an additive model. As was the case in the previous section for classification, we select the lengthscale of the kernel as the mean of the standard deviations of the eight features and choose for the different splits, where the dependency on is to account for the fact that we minimize the squared loss instead of the mean squared loss. We then train four distinct models for . Table 2 compares T-KRR with the best performing models found in the literature. The T-KRR method is able to recover the baseline GPR performance already with and remarkably outperform all other approaches although we choose the hyperparameters naively and consider equal lengthscales for all dimensions. Interestingly, a higher choice of does result in better performance in all cases except for , where model performance very much depends on the random split (note that we consider different splits for each experiment). The SVIGP [13] achieves comparable results to T-KRR, indicating that a performance gain is possible from product kernels. The difficulty with SVIGP is however that the kernel function is interpolated locally at nodes, requiring an increasing amount of interpolation nodes to cover the whole domain to allow for good generalization. In contrast, T-KRR considers an exponentially large amount of basis functions in the frequency domain, and learns an efficient representation of the model weights. In light of the performance of the additive Hilbert-GP and additive VFF models, we expect similar performance when considering VFF feature maps. When considering the whole dataset, training our model with takes seconds on a laptop, while for it takes seconds. Reported training times of SVIGP indicate seconds [38] on a cluster. Since the computational complexity of our algorithm is dominated by matrix-matrix multiplications, we expect significant speedups when relying on GPU computations.

10000 100000 1000000 5929413
T-KRR()
T-KRR()
T-KRR()
T-KRR()
A-Hilbert-GP[Table 1] [38]
A-VFF[Table 1] [38]
SVIGP[Table 1] [38]
VISH[Table 2] [9]
GPR [Table 1] [38] N/A N/A N/A
A-GPR [Table 1] [38] N/A N/A N/A
Table 2: Predictive MSE with one standard deviation for T-KRR. The prefix -A indicates that the model in question considers an additive kernel.

5 Conclusion

In this work a framework to perform approximate Kernel Ridge Regression with tensors by leveraging the tensor product structure of deterministic Fourier features was introduced. Concretely, a monotonically decreasing learning algorithm with linear complexity in both sample size and dimensionality was derived. This algorithm leverages the efficient format of the Canonical Polyadic Decomposition in combination with exponentially converging deterministic Fourier features. Numerical experiments show how the performance of the baseline Kernel Ridge Regression is recovered with a very limited number of parameters. Very much like Kernel Ridge Regression itself, the biggest limitation of the current approach is that it is non-probabilistic, which motivates further work in that direction. We also plan to test our approach further on other large-scale problems.

Appendix A Appendix

Derivation of the data-fitting term of (12). We make use of the multi-linearity property of the CPD and rely on re-ordering the summations:

The derivation of the regularization term of (13) follows a similar reasoning as for the data-fitting term:

References

  • Chen et al. [2018] Z. Chen, K. Batselier, J. A. K. Suykens, and N. Wong.

    Parallelized Tensor Train Learning of Polynomial Classifiers.

    IEEE Transactions on Neural Networks and Learning Systems, 29(10):4621–4632, Oct. 2018. ISSN 2162-2388. doi: 10.1109/TNNLS.2017.2771264.
  • Cichocki et al. [2016] A. Cichocki, N. Lee, I. Oseledets, A.-H. Phan, Q. Zhao, and D. P. Mandic. Tensor Networks for Dimensionality Reduction and Large-scale Optimization: Part 1 Low-Rank Tensor Decompositions. Foundations and Trends® in Machine Learning, 9(4-5):249–429, 2016. ISSN 1935-8237, 1935-8245. doi: 10.1561/2200000059.
  • Comon et al. [2009] P. Comon, X. Luciani, and A. L. F. de Almeida. Tensor decompositions, alternating least squares and other tales. Journal of Chemometrics, 23(7-8):393–405, July 2009. ISSN 08869383, 1099128X. doi: 10.1002/cem.1236.
  • Cortes and Vapnik [1995] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, Sept. 1995. ISSN 1573-0565. doi: 10.1007/BF00994018.
  • Dao et al. [2017] T. Dao, C. D. Sa, and C. Ré. Gaussian quadrature for kernel features. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 6109–6119, Red Hook, NY, USA, Dec. 2017. Curran Associates Inc. ISBN 978-1-5108-6096-4.
  • De Lathauwer et al. [2000] L. De Lathauwer, B. De Moor, and J. Vandewalle.

    A Multilinear Singular Value Decomposition.

    SIAM Journal on Matrix Analysis and Applications, 21(4):1253–1278, Jan. 2000. ISSN 0895-4798. doi: 10.1137/S0895479896305696.
  • Drineas and Mahoney [2005] P. Drineas and M. W. Mahoney. On the Nystrom Method for Approximating a Gram Matrix for Improved Kernel-Based Learning. Journal of Machine Learning Research, 6(72):2153–2175, 2005. ISSN 1533-7928.
  • Dua and Graff [2017] D. Dua and C. Graff. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml, 2017.
  • Dutordoir et al. [2020] V. Dutordoir, N. Durrande, and J. Hensman. Sparse Gaussian Processes with Spherical Harmonic Features. In International Conference on Machine Learning, pages 2793–2802. PMLR, Nov. 2020.
  • Garriga-Alonso et al. [2018] A. Garriga-Alonso, C. E. Rasmussen, and L. Aitchison. Deep Convolutional Networks as shallow Gaussian Processes. In International Conference on Learning Representations, 2018.
  • Gilboa et al. [2015] E. Gilboa, Y. Saatçi, and J. P. Cunningham. Scaling Multidimensional Inference for Structured Gaussian Processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):424–436, Feb. 2015. ISSN 1939-3539. doi: 10.1109/TPAMI.2013.192.
  • Hammer and Gersmann [2003] B. Hammer and K. Gersmann. A Note on the Universal Approximation Capability of Support Vector Machines. Neural Processing Letters, 17(1):43–53, Feb. 2003. ISSN 1573-773X. doi: 10.1023/A:1022936519097.
  • Hensman et al. [2013] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for Big data. In

    Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence

    , UAI’13, pages 282–290, Arlington, Virginia, USA, Aug. 2013. AUAI Press.
  • Hensman et al. [2015] J. Hensman, A. Matthews, and Z. Ghahramani. Scalable Variational Gaussian Process Classification. In Artificial Intelligence and Statistics, pages 351–360. PMLR, Feb. 2015.
  • Hensman et al. [2017] J. Hensman, N. Durrande, and A. Solin. Variational Fourier features for Gaussian processes. The Journal of Machine Learning Research, 18(1):5537–5588, Jan. 2017. ISSN 1532-4435.
  • Hitchcock [1927] F. L. Hitchcock. The Expression of a Tensor or a Polyadic as a Sum of Products. Journal of Mathematics and Physics, 6(1-4):164–189, 1927. ISSN 1467-9590. doi: 10.1002/sapm192761164.
  • Holtz et al. [2012] S. Holtz, T. Rohwedder, and R. Schneider. The alternating linear scheme for tensor optimization in the tensor train format. SIAM J. Sci. Comput., 34(2):A683–A713, 2012.
  • Izmailov et al. [2018] P. Izmailov, A. Novikov, and D. Kropotov. Scalable Gaussian Processes with Billions of Inducing Inputs via Tensor Train Decomposition. In International Conference on Artificial Intelligence and Statistics, pages 726–735. PMLR, Mar. 2018.
  • Kolda and Bader [2009] T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review, 51(3):455–500, Aug. 2009. ISSN 0036-1445, 1095-7200. doi: 10.1137/07070111X.
  • Kruskal [1977] J. B. Kruskal. Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and its Applications, 18(2):95–138, Jan. 1977. ISSN 0024-3795. doi: 10.1016/0024-3795(77)90069-6.
  • Lee et al. [2018] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep Neural Networks as Gaussian Processes. arXiv:1711.00165 [cs, stat], Mar. 2018.
  • Liu et al. [2020] F. Liu, X. Huang, Y. Chen, and J. A. K. Suykens. Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond. arXiv:2004.11154 [cs, stat], July 2020.
  • Matlab [2021] Matlab. 9.10.0.1649659 (R2021a) Update 1. The MathWorks Inc., Natick, Massachussetts, 2021.
  • Mutný and Krause [2018] M. Mutný and A. Krause. Efficient high dimensional Bayesian optimization with additivity and quadrature fourier features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pages 9019–9030, Red Hook, NY, USA, Dec. 2018. Curran Associates Inc.
  • Novak et al. [2018] R. Novak, L. Xiao, Y. Bahri, J. Lee, G. Yang, J. Hron, D. A. Abolafia, J. Pennington, and J. Sohl-dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, 2018.
  • Novikov et al. [2015] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov. Tensorizing neural networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  • Novikov et al. [2017] A. Novikov, M. Trofimov, and I. Oseledets. Exponential Machines. arXiv:1605.03795 [cs, stat], Dec. 2017.
  • Oseledets [2011] I. V. Oseledets. Tensor-Train Decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, Jan. 2011. ISSN 1064-8275, 1095-7197. doi: 10.1137/090752286.
  • Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85):2825–2830, 2011. ISSN 1533-7928.
  • Rahimi and Recht [2007] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, pages 1177–1184, Red Hook, NY, USA, Dec. 2007. Curran Associates Inc. ISBN 978-1-60560-352-0.
  • Rasmussen and Nickisch [2010] C. E. Rasmussen and H. Nickisch. Gaussian Processes for Machine Learning (GPML) Toolbox. Journal of Machine Learning Research, 11(100):3011–3015, 2010. ISSN 1533-7928.
  • Rasmussen and Williams [2006] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, Mass, 2006. ISBN 978-0-262-18253-9.
  • Rohwedder and Uschmajew [2013] T. Rohwedder and A. Uschmajew. On local convergence of alternating schemes for optimization of convex problems in the tensor train format. SIAM Journal on Numerical Analysis, 51(2):1134–1162, 2013.
  • Rudin [1962] W. Rudin. Fourier Analysis on Groups. Number 12 in Interscience Tracts in Pure and Applied Mathematics. Interscience Publishers, New York, 1962. ISBN 0-471-52364-X 978-0-471-52364-2.
  • Shustin and Avron [2021] P. F. Shustin and H. Avron. Gauss-Legendre Features for Gaussian Process Regression. arXiv:2101.01137 [cs, math], Jan. 2021.
  • Sidiropoulos and Bro [2000] N. D. Sidiropoulos and R. Bro. On the uniqueness of multilinear decomposition of N-way arrays. Journal of Chemometrics, 14(3):229–239, 2000. ISSN 1099-128X. doi: 10.1002/1099-128X(200005/06)14:3<229::AID-CEM587>3.0.CO;2-N.
  • Sidiropoulos et al. [2017] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos. Tensor Decomposition for Signal Processing and Machine Learning. IEEE Transactions on Signal Processing, 65(13):3551–3582, July 2017. ISSN 1941-0476. doi: 10.1109/TSP.2017.2690524.
  • Solin and Särkkä [2020] A. Solin and S. Särkkä. Hilbert space methods for reduced-rank Gaussian process regression. Statistics and Computing, 30(2):419–446, Mar. 2020. ISSN 1573-1375. doi: 10.1007/s11222-019-09886-w.
  • Stoudenmire and Schwab [2016] E. M. Stoudenmire and D. J. Schwab. Supervised learning with tensor networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 4806–4814, Red Hook, NY, USA, Dec. 2016. Curran Associates Inc. ISBN 978-1-5108-3881-9.
  • Suykens and Vandewalle [1999] J. Suykens and J. Vandewalle. Least Squares Support Vector Machine Classifiers. Neural Processing Letters, 9(3):293–300, June 1999. ISSN 1573-773X. doi: 10.1023/A:1018628609742.
  • Suykens et al. [2002] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific, Nov. 2002. ISBN 978-981-238-151-4 978-981-277-665-5. doi: 10.1142/5089.
  • Tjandra et al. [2017] A. Tjandra, S. Sakti, and S. Nakamura. Compressing recurrent neural network with tensor train. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 4451–4458, May 2017. doi: 10.1109/IJCNN.2017.7966420.
  • Tsanas and Xifara [2012] A. Tsanas and A. Xifara. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings, 49:560–567, 2012.
  • Tucker [1966] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, Sept. 1966. ISSN 1860-0980. doi: 10.1007/BF02289464.
  • Yang et al. [2012] T. Yang, Y.-f. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. Nyström method vs random fourier features: A theoretical and empirical comparison. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
  • Yang et al. [2017] Y. Yang, D. Krompass, and V. Tresp. Tensor-Train Recurrent Neural Networks for Video Classification. In International Conference on Machine Learning, pages 3891–3900. PMLR, July 2017.