Multivariate function approximation to construct a surrogate model for an underlying physical model that is expensive to evaluate is an important topic: it has applications in optimization, reduced order modeling, uncertainty quantification, sensitivity analysis, and many other areas. The first goal of this paper is to provide a multivariate approximation of a function of the form
where is a hypercube, are basis functions for and and are the coefficients. Such a decomposition is described as a summation of separable functions and the existence of such an approximation depends on the properties of (such as regularity). If the functions are analytic, then the error in the best approximation decays geometrically with the degree of the polynomial trefethen2017multivariate ; gass2018chebyshev ; the decay is algebraic if the function is in a Sobolev space griebel2019analysis .
This paper proposes a technique for multivariate function approximation, using a combination of Chebyshev polynomial interpolation using function evaluations at Chebyshev interpolation nodes and compression of the tensor (i.e. multiway array) that contains the function evaluations. We use the Tucker format for the tensor to achieve compression. Since it is a natural fit for the application at hand, we can leverage prior work on randomized low-rank tensor approximationsminster2020randomized and we can exploit the structure in the Tucker format to our computational advantage resulting in new algorithms. Similar functional tensor approximations have been used for trivariate functions (i.e., for ) dolgov2021functional ; hashemi2017chebfun but our approach is not limited to three dimensions. The authors in rai2019randomized also use a functional Tucker approximation but use sparsity promoting techniques to reduce the storage cost involving the core tensor. Several authors have constructed multivariate tensor approximations using the Tensor Train format, rather than the Tucker format gorodetsky2019continuous ; bigoni2016spectral ; however, a detailed comparison between the Tucker and Tensor Train formats is beyond the scope of the paper. In addition to tensor-based methods, there is a technique for multivariate function approximation that uses sparse grid interpolation bungartz_griebel_2004 but it makes slightly different assumptions on the regularity of the functions to be approximated. While there are alternatives to the Tucker format for function approximation, to the best of our knowledge, none of the tensor formats have been explored in the context of kernel approximations, which is the second goal of the paper. For this application, using the Tucker format for the tensor has several advantages, which we exploit throughout this paper.
Kernel methods are used to model pairwise interactions among a set of points. They are popular in many applications in numerical analysis and scientific computing such as solvers for integral equations, radial basis function interpolation, Gaussian processes, machine learning, geostatistics and spatiotemporal statistics, and Bayesian inverse problems. A major challenge in working with kernel methods is that the resulting kernel matrices are dense, and are difficult to store and compute with when the number of interaction points is large. Many fast kernel summation methods have been developed over the years including the Barnes-Hut algorithmbarnes1986hierarchical , Fast Multipole Method greengard1987fast , Fast Gauss Transform greengard1991fast , etc.
Operations with kernel matrices, which are used to model pairwise interactions, can be accelerated by using low-rank matrix approximations. Over the years, there have been many efforts to compute efficient low-rank approximations in the context of kernel methods cambier2019fast ; xu2018low ; chen2018fast ; ying2004kernel ; ye2020analytical ; bebendorf2009recompression . There are two techniques to generating such approximations. In the first approach, the entire matrix is approximated by a global low-rank matrix (e.g., using Nyström approach). In the second approach, the kernel matrix is represented as a hierarchy of subblocks, and certain of the off-diagonal subblocks are then approximated in a low-rank format (e.g., hackbusch1999sparse ; hackbusch2002data ; grasedyck2003construction ; borm2003hierarchical ; hackbusch2015hierarchical
). The subblocks representing the interactions between these sets of points can be compressed using the singular value decomposition (SVD). However, the cost of computing the SVD can be prohibitively expensive when the number of source and target points is large and/or when the low-rank approximations need to be computed repeatedly over several sets of source and target points. Our goal is to avoid these expensive calculations when computing our approximations.
Our approach builds on the black-box fast multipole method (BBFMM) low-rank approximation approach in Fong and Darve fong2009black , which uses Chebyshev interpolation to approximate the kernel interactions. In this method, instead of computing the kernel interactions between every pair of points given, the interactions are only computed between points on Chebyshev grids. This significantly reduces the number of kernel evaluations needed. This work is also similar to the semi-analytic methods in cambier2019fast ; xu2018low which also use some form of Chebyshev interpolation to reduce the number of kernel interactions. Our work differs in the subsequent low-rank approximation techniques, as we use tensor-based methods instead of the matrix techniques presented in these papers.
Contributions and Contents.
This paper provides new algorithms for multivariate interpolation with an application to low-rank kernel matrix approximations. We summarize the main contributions in our paper:
We propose a method for multivariate function interpolation that combines Chebyshev polynomials and tensor compression techniques via the Tucker format. We analyze the error in the approximation which is comprised of the error due to interpolation and the error due to tensor compression. The resulting function approximations can be used as surrogates in a variety of applications including, but not limited to, uncertainty quantification and reduced order modeling.
We develop three novel randomized compression algorithms for computing low multirank approximations to tensors in the Tucker format. The first algorithm employs randomized interpolatory decomposition for each mode-unfolding; we derive error bounds, in expectation, for this compression technique. The second algorithm uses the first approach but for a subsampled tensor, and requires fewer function evaluations. The third algorithm is similar to the first but uses random matrices formed using Kronecker products, and requires less storage and fewer random numbers to be generated. We also provide a detailed analysis of the computational costs. These algorithms are of independent interest to tensor compression beyond computing functional approximations.
We use the multivariate function approximations to develop low-rank approximations to kernel matrices which represent pairwise interactions. The resulting approximations are accurate, as demonstrated by detailed numerical experiments, and the computational costs are linear in the number of source and target points. To our knowledge, this is the first use of tensor-based function approximations to low-rank kernel matrix approximations.
We conclude this section with a brief organizational overview of the paper. In Section 2
, we review the material on Chebyshev interpolation, tensor compression, and randomized matrix algorithms. In Section3, we propose a tensor-based method for approximating multivariate functions, which combines multivariate Chebyshev interpolation with randomized compression algorithms. We develop three randomized algorithms for compressing low-rank tensors and provide insight into the error in the interpolation. In Section 4, we apply the tensor-based algorithms developed in Section 3 to develop low-rank approximations to kernel matrices. We provide a detailed analysis of the computational cost of the resulting algorithms. In Section 5, we perform three sets of numerical experiments: the first highlights the performance of the tensor-based function approximation, the second demonstrates the accuracy of the low-rank approximations to kernel matrices with several different kernels, and the third is an application to Gaussian processes involving low-rank kernel matrices.
In this section, we provide some background on Chebyshev interpolation as well as tensor notation and decompositions. We also discuss the randomized matrix algorithms that will be important components in our algorithms.
2.1 Chebyshev interpolation
We begin by reviewing how to interpolate a function : using Chebyshev interpolation. The first step is to construct the Chebyshev points of the first kind in the interval ; these are given by for . Using the mapping : defined as
we obtain the Chebyshev points in the interval as for . This mapping is invertible, with the inverse . Recall that for a function with , the Chebyshev interpolation polynomial of degree is
where is the Chebyshev polynomial of degree , and are the coefficients of the Chebyshev polynomials. The coefficients can be obtained using the interpolating conditions and the discrete orthogonality of the Chebyshev polynomials. If we define an interpolating polynomial as
then the polynomial can also be expressed as
The idea of interpolation using Chebyshev polynomials can be extented to multivariate functions as we explore in Section 3.1. The accuracy of Chebyshev interpolation has been studied extensively; see for example mason2002chebyshev ; trefethen2013approximation .
2.2 Tensor Preliminaries
We now introduce notation and background information for handling tensors. See kolda2009tensor for more details. A -mode tensor is denoted as with entries for , .
We can reorder the elements of a tensor to “unfold” it into a matrix, and this process is known as matricization or mode-unfolding. A -mode tensor (also called a th order tensor) can be unfolded into a matrix in different ways. For mode-, the mode- fibers111A mode- fiber of a tensor is obtained by holding all indices except the th fixed. For example, the mode-1 fibers of 3rd order tensor would be denoted . of a tensor are arranged to become the columns of a matrix. This matrix is called the mode- unfolding, and is denoted by . We also denote the unfolding of modes through as .
One fundamental operation with tensors is the mode-wise product, a way of multiplying a tensor by a matrix. For a matrix , the mode- product of with is denoted . The entries of this product are
Note that we can also express the mode- product as the product of two matrices, as .
The Tucker form of an -mode tensor , given a target rank , consists of a core tensor and factor matrices , with , such that . For short, we denote this form as .
A popular algorithm for computing a Tucker representation of a tensor is the Higher-order SVD (HOSVD) de2000multilinear . In this algorithm, each mode of a tensor is processed separately and independently. For mode , the factor matrix is computed by taking the first
left singular vectors of, where is the target rank. Then, the core tensor is formed as . The tensor compression algorithms we will present in Section 3 are related to the HOSVD algorithm. This can be computationally expensive for large-scale problems, so we propose new randomized algorithms in Section 3, extending some of the methods proposed in minster2020randomized .
2.3 Randomized Matrix Algorithms
In this section, we review some of the randomized algorithms (for matrices) that will be used in our later algorithms. The first of these is the randomized range finder, popularized in halko2011finding . Given a matrix , this algorithm finds a matrix
that estimates the range of, i.e., . This is accomplished by drawing a Gaussian random matrix , where is the desired target rank, and is an oversampling parameter. We then form the product , which consists of linear combinations of random columns of . We next compute a thin QR of the result, , to obtain the matrix such that is a good approximation to . Taken together, this process gives the approximation . See steps 1-3 in Algorithm 1 for the details.
The randomized range finder can be combined with subset selection (specifically pivoted QR, which we call the randomized row interpolatory decomposition (RRID). This algorithm, similar to randomized SVD with row extraction (halko2011finding, , Algorithm 5.2), produces a low-rank approximation to a matrix that exactly reproduces rows of . After the matrix is computed, we use pivoted QR on to pick out indices of well-conditioned rows of ; denote this by . We then form the matrix , which gives the low-rank approximation to as . In practice, we use the Businger-Golub pivoting algorithm (golub2013matrix, , Algorithm 5.4.1) but in Theorem 3.1 we use the strong rank-revealing QR factorization gu1996efficient , since among the rank-revealing QR algorithms this algorithm has the best known theoretical results. The details of this algorithm are presented in Algorithm 1, and will be used later in our tensor compression algorithms.
3 Tensor-based functional approximation
In this section, we present our tensor-based approach to approximate functions. First, we explain our approach in general, which combines multivariate Chebyshev interpolation and tensor compression techniques to approximate a function . We then propose three tensor compression methods that fit within our framework and the natural application of our approach to kernel approximations. Finally, we provide analysis of the computational cost of our approach as well as analysis of the accuracy.
3.1 Multivariate functional interpolation
Let us denote the Cartesian product of the intervals as
and let be a function that we want to approximate using Chebyshev interpolation. We can expand the function using Chebyshev polynomials as
where are the Chebyshev points of the first kind in the interval for and the interpolating polynomials have been defined in (3). Here we have assumed the same number of Chebyshev points in each dimension for ease of discussion, but this is straightforward to generalize. To match notation with (1), we have and .
For a fixed , we can express this in tensor form as
where the entries of are and
For a new point , we need only to compute the factor matrices . The core tensor need not be recomputed. However, storing the core tensor can be expensive, especially in high dimensions since the storage cost is entries. Therefore, we propose to work with a compressed representation of the core tensor . In (dolgov2021functional, , Section 2.5), the authors argue that for certain functions, the degree of the multivariate polynomial required for approximating a function to a specified tolerance can be larger than the multilinear rank (in each dimension). For these functions, this justifies a compression of the tensor which has a smaller memory footprint but is nearly as accurate as the multivariate polynomial interpolant.
Using the Tucker format, we can approximate where and for . Then,
where for with entries
The compressed Tucker representation can be obtained using any appropriate tensor compression technique (e.g., HOSVD, sequentially truncated HOSVD). However, computing this low-rank approximation using deterministic techniques is computationally expensive and requires many function evaluations. We propose randomized approaches to computing low-multirank Tucker approximations, in order to both avoid the computational costs and reduce the number of function evaluations, associated with non-randomized approaches.
3.2 Tensor compression techniques
We propose three new tensor compression methods in the context of compressing core tensor as defined above. The first uses the Randomized Row Interpolatory Decomposition (RRID), or Algorithm 1, on each mode unfolding of the tensor to obtain a low-rank approximation for each mode. We provide an analysis for this method as well. The second tensor compression method is similar to the first, but works with a subsampled tensor instead, thus reducing the computational cost. Our third method uses the Kronecker product of Gaussian random matrices instead of a single Gaussian random matrix while compressing , which reduces the number of random entries needed. These three methods produce structure-preserving approximations in the Tucker format as described in minster2020randomized ; this means that the core has entries from the original tensor and the decomposition (in exact arithmetic) reproduces certain entries of the tensor exactly.
3.2.1 Method 1: Randomized Interpolatory Tensor Decomposition
The first compression algorithm we present uses a variation of the structure-preserving randomized algorithm that we previously proposed in (minster2020randomized, , Algorithm 5.1) to compress core tensor . In mode , we apply Algorithm 1 with target rank and oversampling parameter to the mode-1 unfolding to obtain the low-rank approximation
has columns from the identity matrix corresponding to the index set. This process is repeated across each mode to obtain the other factor matrices and the index sets . Finally, we have the low multirank Tucker representation
In contrast to (minster2020randomized, , Algorithm 5.1) in which we used sequential truncation, our low-rank approach can be described as the structure-preserving HOSVD algorithm. To analyze the computational cost of this algorithm, assume that the target tensor rank is . First, we discuss the cost of tensor compression. We apply the RRID algorithm to each mode unfolding of size . This costs flops. The total cost is flops.
We now provide a result on the expected error from compressing a tensor using Algorithm 2. Note that this theorem assumes we are using strong rank-revealing QR (sRRQR) in the RRID algorithm instead of pivoted QR. This proof is very similar to the proof of (minster2020randomized, , Theorem 5.1), so we leave the proof to Appendix A.
Let be the rank- approximation to -mode tensor which is the output of Algorithm 2. We use oversampling parameter such that , and strong rank-revealing QR factorization (gu1996efficient, , Algorithm 4) with parameter . Then the expected approximation error is
3.2.2 Method 2: Randomized Interpolatory Tensor Decomposition with Block Selection
Our second compression algorithm aims is a variant of the previous approach, which reduces the number of function evaluations and the computational cost, by working with a subsampled tensor rather than the entire tensor. Recall that we have assumed that the number of Chebyshev grid points in each spatial dimension is the same. Suppose we are given an index set with cardinality representing the subsampled fibers. In the first step of the algorithm, we process mode . We consider the subsampled tensor and apply the RRID algorithm to obtain the matrix and the index set . We then have a low-rank approximation to as
Here has columns from the identity matrix corresponding to the index set . This process is repeated across each mode to obtain the other factor matrices and the index sets . Finally, we have the low-rank Tucker representation
The summary of all steps for the second variation is described in Algorithm 3.
Compared with Method 1, in each mode we are working with the subsampled version of the mode unfolding rather than the entire mode-unfolding. In other words, by setting we can recover Method 1. To subsample, we use the nested property of the Chebyshev nodes xu2016chebyshev . More precisely, let the Chebyshev nodes
Then ; i.e., the -th point is the th point . Therefore, assuming that is a multiple of , to subsample by 1 level we take . To subsample more aggressively, by levels, we take the indices to be and .
Now consider the computational cost. At the first step, the unfolded matrix has dimensions . Applying RRID for each mode requires flops; the total cost for the entire tensor decomposition is therefore . The number of function evaluations are ; in practice, this is much smaller than since we choose .
3.2.3 Method 3: Randomized Kronecker Product
In the third compression algorithm, we are essentially computing RRID along each mode for the tensor compression step; however, the main difference is that random matrix is generated as the Kronecker product of Gaussian random matrices. More specifically, consider independent standard Gaussian random matrices . Here is the target tensor rank and is the oversampling parameter. In mode 1, we form the tensor
and compute the left singular vectors of the mode-1 unfolding to obtain the factor matrix . We call this the Randomized Kronecker Product approach since using the definition of mode products
To obtain the factor matrices , we follow a similar procedure along modes through . Finally, to obtain the core tensor we compute . The rest of the algorithm resembles Method 1 and is detailed in Algorithm 4. The asymptotic cost of Method 3 is the same as Method 1. However, a potential advantage over Method 1 is the reduction in the number of random entries generated. In Method 1, in using a RRID along each mode, we need to generate random numbers, whereas in Method 3, we only need to generate random numbers.
3.3 Error analysis
We now analyze the accuracy of the proposed algorithmic framework. Since there are several approximations (interpolation error, and errors due to tensor compression) involved, it is difficult to give sharp estimates of the error, but we aim to provide some insight into the error analysis.
We want to first understand the effect of the tensor compression on the overall approximation error. To this end, we use the triangle inequality to bound
The first term in the error is due to Chebyshev polynomial interpolation and can be bounded using classical results from multivariate approximation theory. Assuming that the function satisfies the assumptions in (gass2018chebyshev, , Theorem 2.2), then by Corollary 2.3 in that same paper , where and are positive constants with . To bound the second term, we use the Cauchy-Schwartz inequality to obtain
We then use the vector inequality ; the terms can also be bounded using results from interpolation; since the terms are the same as the Lagrange basis functions defined at the appropriate interpolation nodes, we can bound
where is the Lebesgue constant due to interpolation at the nodes . For Chebyshev interpolating points, as is the case in our application, it is well known that (see (trefethen2013approximation, , Chapter 15)) for the Chebyshev points for . This gives , using which we can derive the error bound
The important point to note is that there are two sources of error: the error due to Chebyshev interpolation which decreases geometrically with increasing number of Chebyshev nodes, and the error due to tensor compression is which is amplified by the term . In numerical experiments, we have observed that the error due to the tensor compression was sufficiently small, so the amplification factor did not appear to have deleterious effects.
It is desirable to have an adaptive approach that produces an approximation for a given accuracy; we briefly indicate a few ideas to develop an adaptive approach. However, we do not pursue the adaptive case here further and leave it for future work. If the number of Chebyshev points is not known in advance, an adaptive approach can be derived that combines the methods proposed here with the computational phases as described in hashemi2017chebfun ; dolgov2021functional . We can start with a small estimate for , we can use the fact that the Chebyshev points of the first kind are nested. We can start with a small value of (say ), and increase by multiples of until a desired accuracy is achieved (by evaluating the accuracy on a small number of random points); since the points are nested, this avoids recomputing a fraction of the function evaluations. If the target rank is not known in advance, we can use the techniques described in (minster2020randomized, , Section 4) to estimate the target rank such that the low-rank decomposition satisfies a relative error bound involving a user-specified tolerance.
4 Application: Low-rank Kernel approximations
In this section, we use the functional tensor approximations developed in Section 3 to compute low-rank approximations to kernel matrices.
4.1 Problem Setup and Notation
Rank-structured matrices are a class of dense matrices that are useful for efficiently representing kernel matrices. In this approach, the kernel matrix is represented as a hierarchy of subblocks, and certain off-diagonal subblocks are approximated in a low-rank format. There are several different types of hierarchical formats for rank-structured matrices, including -matrices, -matrices, hierarchically semi-separable (HSS) matrices, and hierarchically off-diagonal low-rank (HODLR) matrices. For more details on hierarchical matrices, see hackbusch1999sparse ; hackbusch2002data ; grasedyck2003construction ; borm2003hierarchical ; hackbusch2015hierarchical ; sauter2010boundary . If the number of interaction points is , then the matrix can be stored approximately using entries rather than entries, and the cost of forming matrix-vector products (matvecs) is flops, where are nonnegative integers depending on the format and the algorithm for computing the rank-structured matrix. The dominant cost in working with rank-structured matrices is the high cost of computing low-rank approximations corresponding to off-diagonal blocks, which we address here using novel tensor-based methods.
Before showing how to compute these off-diagonal blocks efficiently, we define the problem setup that we will use in the rest of this chapter. Let and be and source and target points, respectively. These points are assumed to be enclosed by two boxes, , where is the dimension, defined as
such that and . These bounding boxes are plotted in Figure 1 for visualization in two spatial dimensions (i.e., ).
Then, for the kernel , let the interaction matrix be defined as
We can only represent blocks in low-rank form if those blocks are admissible. Note that the diameters of the bounding blocks and are
We define the distance between these blocks as the minimum distance between any two points from and . Specifically,
The domains and are considered strongly admissible if they satisfy
where is the admissibility parameter. This means two sets of points are strongly admissible if they are sufficiently well-separated. Two blocks are weakly admissible if there is no overlap between them. Typically, we will assume that the blocks and are either strongly or weakly admissible. However, in Section 5.3 we discuss how to compute a global low-rank approximation assuming only that and that the kernel is sufficiently smooth (and that the blocks are not admissible).
Given points and , we seek to compute the pairwise kernel interactions between the sets of points, i.e., . This is expensive both from a computational and storage perspective. If we simply compute the kernel interaction for each pair of points, this requires kernel interactions and the cost of storage is . As and can get quite large, computing can become computationally challenging. Assuming that the points are well-separated so that the kernel interactions between the source and targets can be treated as a smooth function, the matrix can be approximated in a low-rank format. However, computing this low-rank approximation is computationally expensive. To this end, our goal is to compute an approximation to with a cost that is linear in and , which we accomplish using multivariate Chebyshev interpolation.
4.2 Low-rank approximation of
Suppose we have the boxes denoting the bounding boxes for the sources and the targets respectively that are well-separated so that the kernel function can be viewed as a smooth function of the form
That is, we identify and for . Similarly, and for . This allows us to approximate the function , and, therefore, the kernel by a multivariate Chebyshev polynomial approximation
following the notation in Subsection 3.1 for . We can use the multidimensional Chebyshev approximation to obtain a low-rank representation for as follows.
We are now given the set of points and . Let us define the sequence of matrices and such that
for . Let define the row-wise Khatri-Rao product. Using these sequences, we define the following factor matrices to be the row-wise Khatri-Rao product of the matrices and in reverse order
This gives the low-rank approximation (see Figure 3 for a visual representation)
where is an unfolding of the tensor . The storage cost of this representation is , assuming and are stored in terms of factors , and are not formed explicitly.
Compressed low-rank representation
Suppose we compress using the methods discussed in Section 3 to obtain with multirank (here ), we can write the kernel approximation as