Gradient-based Optimization for Regression in the Functional Tensor-Train Format

01/03/2018 ∙ by Alex A. Gorodetsky, et al. ∙ University of Michigan 0

We consider the task of low-multilinear-rank functional regression, i.e., learning a low-rank parametric representation of functions from scattered real-valued data. Our first contribution is the development and analysis of an efficient gradient computation that enables gradient-based optimization procedures, including stochastic gradient descent and quasi-Newton methods, for learning the parameters of a functional tensor-train (FT). The functional tensor-train uses the tensor-train (TT) representation of low-rank arrays as an ansatz for a class of low-multilinear-rank functions. The FT is represented by a set of matrix-valued functions that contain a set of univariate functions, and the regression task is to learn the parameters of these univariate functions. Our second contribution demonstrates that using nonlinearly parameterized univariate functions, e.g., symmetric kernels with moving centers, within each core can outperform the standard approach of using a linear expansion of basis functions. Our final contributions are new rank adaptation and group-sparsity regularization procedures to minimize overfitting. We use several benchmark problems to demonstrate at least an order of magnitude lower accuracy with gradient-based optimization methods than standard alternating least squares procedures in the low-sample number regime. We also demonstrate an order of magnitude reduction in accuracy on a test problem resulting from using nonlinear parameterizations over linear parameterizations. Finally we compare regression performance with 22 other nonparametric and parametric regression methods on 10 real-world data sets. We achieve top-five accuracy for seven of the data sets and best accuracy for two of the data sets. These rankings are the best amongst parametric models and competetive with the best non-parametric methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Assesment of uncertainty in a computational model is essential to increasing the credibility of simulation based knowledge discovery, prediction and design. Sources of model uncertainty must be identified and the effect of these uncertainties on the model output (prediction) quantified. The accuracy to which uncertainty can be quantified is limited by the computational resources available. Many applications require vast amounts of computational effort, thereby limiting the number model evaluations that can be used to interrogate the uncertainty in the system behavior. Consequently a significant portion of methods developed for uncertainty quantification (UQ) in recent years have focused on constructing surrogates of expensive simulation models using only a limited number of model evaluations.

Within the computational science community, both parametric and non-parametric function approximation methods have been extensively used for Uncertainty Quantification (UQ). Non parametric Gaussian process models (GP) (Rasmussen and Williams, 2006; O’Hagan and Kingman, 1978) and parametric Polynomial Chaos Expansions (PCE) (Ghanem and Spanos, 1991; Xiu and Karniadakis, 2002)

are arguably the two most popular methods used. Gaussian process regression can be interpreted as a Bayesian method for function approximation, providing a posterior probability distribution over functions. Maximum likelihood estimation and Markov Chain Monte Carlo sampling are the two most popular methods used to characterize the posterior distribution of the GP. Polynomial chaos expansions represent a response surface as a linear combination of orthonormal multivariate polynomials. The choice of the orthonormal polynomials is related to the distribution of the uncertain model variables. Various approaches have been adopted to compute the coefficients of the PCE basis. Approaches include, pseudo-spectral projection

(Conrad and Marzouk, 2013; Constantine et al., 2012)

, sparse grid interpolation

(Ganapathysubramanian and Zabaras, 2007; Nobile et al., 2008), and regression using -minimization (Blatman and Sudret, 2011; Doostan and Owhadi, 2011; Mathelin and Gallivan, 2012). For a comparison between nonparametric GP methods and parametric PCE methods see (Gorodetsky and Marzouk, 2016), and for an attempt to combine the benefits of both approaches see (Schobi et al., 2015).

High-dimensional approximation problems, such as regression, pose challenges for both parametric and nonparametric representation formats. Parametric approaches, for example those using a PCE representation, are limited by their expressivity; and increasing expressivity, for example by increasing the polynomial order, results in the curse of dimensionality for fixed polynomial order. Nonparametric methods, for example Gaussian process regression, have great expressive capabilities. However, they also encounter the curse of dimensionality since their excess risk grows exponentially with dimension

(Györfi et al., 2002).

To counteract these computational burdens for both types of methods, attention has focused on constraining the functional representation to limit the curse of dimensionality. One popular constraint is limiting the model to that of additive separable forms (Hastie and Tibshirani, 1990; Meier et al., 2009; Ravikumar et al., 2008)

(1)

One can also use second order interactions, e.g., , to increase expressivity while maintaining tractability (Kandasamy and Yu, 2016). However, further increasing the number of interaction terms in the ANOVA model (Fisher, 1925)

will still naturally encounter the curse of dimensionality unless adaptive methods that identify the order of interaction interactively are utilized (Ganapathysubramanian and Zabaras, 2007; Ma and Zabaras, 2010; Foo and Karniadakis, 2010; Jakeman and Roberts, 2013; Jakeman et al., 2015).

In this paper, we propose algorithms to improve regression in a functional representation that takes advantage of low-rank structure to mitigate the curse of dimensionality while maintaining high expressivity. Low-rank functional representations are parametric models that enable a wide variety of interactions between variables and can generate high order representations. More specifically, they are continuous analogues of tensor decompositions and exploit separability, i.e., that a function can be represented by the sum of products of univariate functions. One example is the canonical polyadic (CP) (Carroll and Chang, 1970) representation consisting of a sum of products of univariate functions , and the number of free parameters of such a representation scales linearly with dimension. Instead of the CP format, we use a continuous analogue of the discrete tensor train (TT) decomposition (Oseledets, 2011) called the functional tensor-train (FT) (Oseledets, 2013; Gorodetsky et al., 2016) to allow for a greater number of interactions between variables.

Low-rank functional tensor decompositions have been used for regression previously. Existing approaches (Doostan et al., 2013; Mathelin, 2014; Chevreuil et al., 2015; Rauhut et al., 2017) implicitely make two simplifying assumptions to facilitate the use of certain algorithms and data structures from the low-rank tensor decomposition literature. Specifically, they assume linear and identical basis expansions for each univariate function of a particular dimension. These approaches convert the problem from one of determining a low-rank function to one of representing the coefficients of a tensor-product basis as a low-rank tensor. Following this conversion, many of these techniques use alternating minimization to determine the coefficients of the FT. Alternating minimization, such as alternating least squares, transforms a nonlinear optimization problem for fitting the parameters of each univariate function to data into a linear problem by only considering a single dimension at a time. Existing approaches either use efficient linear algebra routines to solve the linear system at each iteration (Doostan et al., 2013) or sparsity inducing methods such as the LASSO (Mathelin, 2014). Recently, iterative thresholding has also been used to find low rank coefficients; however, such an approach has been limited to problems with low-dimensionality (Rauhut et al., 2017).

In this paper we take a different approach: we use gradient-based optimization procedures such as quasi-Newton methods and stochastic gradient descent, and we do not restrict ourselves by the assumptions that lead to the consideration of a tensor of coefficients. Overall, our contributions include:

  1. Derivation and computational complexity analysis of a gradient-based fitting procedure that yields more accurate approximations than alternating minimization

  2. Usage of both linear and nonlinear parameterizations of univariate function in each mode to facilitate a larger class of functions than is possible when using solely linear representations, and

  3. Creation of rank-adaptation and regularization schemes to reduce overfitting.

We demonstrate the efficacy of our approach on both synthetic functions and on several real-world data sets used in existing literature to test other regression algorithms (Kandasamy and Yu, 2016). Specifically, we show that gradient-based optimization provides significant advantages in terms of approximation error, as compared with an alternating least squares approach, and that these advantages are especially apparent in the case of small amounts of data and large parameter sizes. To our knowledge, no gradient based procedure has been derived or evaluated for the low-rank functional format that we consider. We also show the benefits of nonlinear representations of univariate functions over the linear representations most frequently found in the literature. Our real-world data results show that our methodology is competitive with both nonparametric and parametric algorithms. We achieve top-five accuracy for seven of the 10 data sets considered and best accuracy for two of the data sets. These rankings are the best amongst parametric models and competitive with the best non-parametric methods. Finally we demonstrate that for some physical systems, exploiting low-rank structure is more effective than exploiting sparse structure which is common approach used for enabling uncertainty quantification of computational models.

1.1 Related Work

As mentioned above, the functional tensor-train decomposition was proposed as an ansatz for representing functions by Oseledets (2013), and computational routines for compressing a function into FT format have been developed before (Gorodetsky et al., 2016). In that setting, an approximation of a black-box function is sought to a prescribed accuracy. A sampling procedure and associated algorithm was designed to optimally evaluate the function in order to obtain an FT representation. In this work, we consider the setting of fixed data.

There has also been some recent work on regression in low-rank format (Doostan et al., 2013; Mathelin, 2014; Chevreuil et al., 2015). These approaches rely on linearity between the parameters of the low-rank format and the output of the function. Utilizing this relationship, they convert the low-rank function approximation to one of low-rank tensor decomposition for the coefficients of a tensor-product basis. In Section 3.2 we show how the representation presented in those works can be obtained as a particular instantiation of the formulation we present here.

In spirit, our approach is also similar to the recent use of the tensor-train decomposition within fully connected neural networks

(Novikov et al., 2015)

. There, the layers of a neural network are assumed to be a linear transformation of an input vector, and the weight matrix of the transformation is estimated from data. Their contribution is representing the weight matrix in a compressed TT-format. In this work, our low-rank format can be thought of as an alternative to the linear transformation layer of a neural network. Indeed, future work can focus on utilizing our proposed methodology within a hierarchy of low-rank approximations.

2 Background

In this section we establish notation and provide background for regression and low-rank tensor decompositions.

2.1 Notation

Let be the set of real numbers and be the set of positive integers. Let , and suppose that we are given i.i.d. data such that each datum is sampled from a distribution on a compact space Let be a tensor product space with . Let the marginal distribution of be the tensor product measure , and assume that all integrals appearing in this paper are with respect to this measure. For example let and , then the inner product is defined as Similarly, the norm is defined as

In this paper scalars and scalar-valued functions are denoted by lowercase letters, vectors are denoted by bold lower-case letters such as ; matrices are denoted with upper boldface such as ; tensors are denoted with upper boldface caligraphic letters such as ; and matrix-valued functions are denoted by upper caligraphic letters such as An ordered sequence of elements of the same set are distinguished by parenthesized superscripts such as or .

2.2 Supervised learning

The supervised learning task seeks a function

that minimizes the average of a cost function

(2)

For example, the standard least squares objective is specified with . In this work, the cost functional cannot be exactly calculated; instead, data is used to estimate its value. The cost functional then becomes a sum over the data instead of an integral

(3)

where we have reused the notation , and further references to will use this definition. To obtain an optimization problem over a finite set of variables, the search space of functions must be restricted. Nonparametric representations generally allow this space to vary depending upon the data, and parametric representations typically fix the representation. In either case, we denote this space as to seek an optimal function such that

(4)

One example of a common function space is the reproducing kernel Hilbert space, and resulting algorithms using this space include Gaussian process regression or radial basis function interpolation. Other examples include linear functions (resulting in linear regression) or polynomial functions. In this work, we consider a function space that incorporates all functions with a prescribed

rank, which will be defined in the next section.

When solving the supervised learning problem (4) it is often useful to introduce a regularization term to minimize overfitting. In this paper we will consider the following regularized learning problem

(5)

where denotes a scaling factor and is a functional that penalizes certain functions in . For example, the function penalizes functions that have large norms, and such a penalty has been used for a certain type of low-rank functional approximation before (Doostan et al., 2013). In this work, we impose a group sparsity type regularization that has been shown to improve the prediction performance in other approximation settings (Turlach et al., 2005; Yuan and Lin, 2006). Such an approach seeks to increase parsimony by reducing the number of non-zero elements in an approximation. In Section 4.1 we describe what this type of constrains means in the context of low-rank functional decompositions.

2.3 Low-rank tensor decompositions

The function space that we use to constrain our supervised learning problem is related to the concept of low-rank decompositions of multiway arrays, or tensors. Tensor decompositions are multidimensional analogues of the matrix SVD and are used to mitigate the curse of dimensionality associated with storing multivariate arrays (Kolda and Bader, 2009). Consider an array with , this array contains an exponentially growing number of elements as the dimension increases. Tensor decompositions can provide a compact representation of the tensor and have found wide spread usage for data compression and learning tasks (Cichocki et al., 2009; Novikov et al., 2015; Yu and Lie, 2016).

In this work, we consider the tensor-train (TT) decomposition (Oseledets, 2011). Specifically, we use the TT as an ansatz for representating low-rank functions. A TT representation of a tensor is defined by a list of -way arrays called TT-cores where and for . The sizes of the cores are called the TT-ranks. In this format a tensor element is obtained according to the following multiplication

(6)

where are matrices and the above equation describes a sequence of matrix multiplication.

In this format, storage complexity scales linearly with dimension and quadratically with TT-rank. In the next section we discuss how it can be used as a framework for function approximation.

3 Low-rank functions

Low-rank formats for functions can be thought of as direct analogues to low-rank tensor decompositions. In this work, we focus on aa TT-like decomposition of functions called the functional tensor-train (FT) (Oseledets, 2013).

(7)

where , and for single-output functions. A more compact expression is obtained by viewing a function value as a set of products of matrix-valued functions

(8)

where each matrix-valued function is called a core and can be visualized as an array of the univariate functions

(9)

If each univariate function is represented with parameters and for all , then the storage complexity scales as . Comparing this representation with (6) we see a very close resemblence between the TT cores and the FT cores. Indeed they are both matrices when indexed by a discrete index for the TT or a continuous index for the FT. We describe a closer relationship between the TT and the FT in the context of low-rank representations of functions in the next section.

3.1 Parameterizations of low-rank functions

An FT core is comprised of sets of univariate functions as shown in Figure 1. Each set of univariate functions, contains all of the parameters of the associated univariate functions. As a result the full FT is parameterized through the parameterization of its univariate functions. Let denote the number of parameters describing Let denote the vector of parameters of all the univariate functions. Then, there are a total of parameters describing the FT, i.e., .

The parameter vector is indexed by a multi-index where corresponds to an input variable, and correspond to a univariate function within the th core, and corresponds to a specific parameter within that univariate function. In other words, we adopt the convention that refers to the th parameter of the univariate function in the th row and th column of the th core.

The additional flexibility of the FT representation allows both linear and nonlinear parameterizations of univariate functions. A linear parameterization represents a univariate function as an expansion of basis functions according to

(10)

Linear parameterizations are often convenient within the context of gradient-based optimization methods since the gradient with respect to a parameter is independent of the parameter values. Thus, one only needs to compute and store the evaluation of the basis functions a single time. Nonlinear parameterizations are more flexible and general, but often incur a greater computational cost within optimization. One example of a nonlinear parameterization is that of a Gaussian kernel, which can be written as

(11)

where Here, the first parameters refer to the coefficients of radial basis functions and that second half of the parameters refer to the centers.

Figure 1: Object-oriented hierarchy for storing low-rank functions. Parameters are represented by red squares and functions are represented in blue.

3.2 Low-rank functions vs. low-rank coefficients

The functional tensor train can be used by independently parameterizing the univariate functions of each core, and both linear and nonlinear parameterizations are possible. As described below, the advantage of this representation includes a naturally sparse storage scheme for the cores. Another advantage is the availablitiy of efficient computational algorithms for multilinear algebra that can adapt the representation of each univariate function individually as needed (Gorodetsky et al., 2016; Gorodetsky, 2017) in the spirit of continuous computation pioneered by Chebfun (Platte and Trefethen, 2010).

Although the functional form of the TT is general, most of the literature makes two simplifying assumptions (Doostan et al., 2013; Mathelin, 2014; Chevreuil et al., 2015):

  1. a linear parameterization of each , and

  2. an identitical basis for the functions within each FT core, i.e., and for all , and .

These assumptions transform the problem of learning low-rank functions to the problem of learning low-rank coefficients. This transformation of the problem allows one to facilitate the use of discrete TT algorithms and theory. We will refer to representations using these assumptions as a functional tensor-train with TT coefficients or FT-c, where “c” stands for coefficients.

The FT-c representation stores the coefficients of a tensor-product basis for all and in TT format. Let be a tensor of the following form

(12)

for . Function evaluations can be obtained from the from coefficients stored in TT format by performing the the following summation

(13)

From (8), (12), and (13) we can see that the relationship between the TT cores and the FT cores is

(14)

where the basis function multiplies every element of the tensor. In other words, the TT cores represent a TT decomposition of the coefficient tensor of a tensor-product basis.

The two assumptions required for converting the problem from one of low-rank function approximation to low-rank coefficient approximation often result in a computational burden in practice. The burden of the first assumption is clear, some functions are more compressible if nonlinear parameterizations are allowed. The second assumption can also limit compressibility, but also results in a larger computational and storage burden. One example in which this burden becomes obvious is when attempting to represent additively separable functions, e.g., Equation (1), in the FT format. As mentioned in the introduction, this representation is common for high-dimensional regression, and an FT can represent this function using cores that fave the following rank 2 structure.

Suppose that each of the univariate functions can be represented with parameters and the constants and can be stored with a single parameter. Then, the storage requirement is parameters for the middle cores and for the outer cores, totaling floating point values. In the TT case, the and terms must be stored with the same complexity as the other terms, since the TT is a three-way array. Thus, the total number of parameters becomes . Almost four times less storage is required for the FT format than the TT format, in the limit of large and . Essentially, the TT format does take into account the sparsity of the cores. In this case, one can think of the FT format as storing the TT cores in a sparse format. This burden is exacerbated if we add interaction terms to Equation 1.

4 Low-rank supervised learning

In this section, we incorporate the FT model as a constraint in the supervised learning task. We discuss issues surrounding optimization algorithms, regularization, and choosing the TT-ranks. In particular we discuss three optimization algorithms for fitting fixed-rank models: batch gradient descent, alternating least squares, and stochastic gradient descent. We also present approaches for minimizing over-fitting. Specifically we discuss a group-sparsity-inducing regularization regularization scheme and hyper-parameter estimation scheme using cross validation and a rank-adapation.

4.1 Low-rank constraints and regularization

The function space in Equation (4) constrains the search space. The space constrained by low-rank functions is defined as follows. Let such that for . Then

denotes the space of functions with FT ranks Note that this function space also includes functions with smaller ranks. For example, a function that differs in ranks such that and can be obtained by setting all univariate functions, aside from the top left block, to zero; analogous modifications must be made to all other cores.

In this paper we also add a regularization to attempt to minimize the number of nonzero functions in each core. The structure of the FT-cores in Equation (9) admits a natural grouping in terms of the univariate functions. Setting a univariate function to zero inside of a core essentially lowers the number of interactions allowable between univariate functions in neighbouring dimensions, with the overall effect beign a smaller number of sums in the representation of the function in Equation (7). Minimizes the number of nonzero univariate functions not only produces a more parsimonious model, but also minimizes the number of interactions between function variables.

Specifically, we penalize the regression problem using the sum of the norms of the univariate functions

(15)

Note that this regularization method is different from that first proposed in (Mathelin, 2014) where the FT-c representation was used and the TT-cores of the coefficient tensor themselves were forced to be sparse. In our case we do not look for sparse coefficients, instead our goal is to increase the number of functions that are identically zero. As a surrogate for this goal we follow the standard practice of replacing the non-differential “” norm by the another norm, in our case a sum of the norms of the functions. This replacement also seeks to directly minimize the number of interacting univariate functions and also provides a differentiable objective for optimization.

4.2 Hyper-parameter estimation using cross validation and rank adaptation

A careful selection of the number of parameters in each univariate function, the rank of each core, and the Lagrangian multiplier in the regularized learning problem is required to avoid over fitting. For example if a polynomial basis is chosen, setting the degree too high can result in spurious oscillations. Cross validation provides a mechanism to estimate hyper-parameters. In this paper we use 20 fold cross validation to select the hyper-parameters of an FT that minimizes the expected prediction error.

Alternatively, we can design a more effective rank adaptation scheme by combining cross validation with a rounding procedure. The problem with the simplest cross validation option is that a scheme that optimizes separately over each imposes a computational burden, and a scheme that optimizes for a single rank across all cores, i.e., does not allow enough flexibility. Instead, we propose to combine FT rounding (Gorodetsky et al., 2016) and cross validation to avoid overfitting.

Rounding is a procedure to reapproximate an FT by one with lower ranks to a given tolerance. The tolerance criterion can be thought of as a regularization term, since it limits the complexity of the representation and will be investigated in furture work. The full rank adaptation scheme is provided by Algorithm 1. In that algorithm refers to a function that provides a cross-validation error for an optimization over the space , and the function provides the ranks of a rounded function with a particular tolerance .

0:  Rounding tolerance
1:  
2:  
3:  while not converged do
4:     for  do
5:        
6:     end for
7:     
8:     
9:     if  then
10:        
11:        break
12:     end if
13:     
14:     
15:     
16:     if  for all  then
17:        break
18:     else
19:         for
20:     end if
21:  end while
Algorithm 1 Rank adaptation

The scheme increases ranks until either the cross-validation error increases or until the rounding procedure decreases every rank. The first termination criterion targets overfitting, and the second termination criterion seeks to limit rank growth when data is no longer informative enough to necessitate the increase in expressivity.

4.3 Optimization algorithms

In this section, we overview three gradient-based optimization algorithms for solving supervised learning problems in low-rank format: alternating minimization, gradient decent and stochastic gradient decent. The computational complexity of these algorithms in analyzed in Section 5.

4.3.1 Alternating minimization

Alternating minimization methods, mainly alternating least squares, are the main avenue for optimization in low-rank tensor and functional contexts. These routines are typically used within tensor decompositions by sequentially sweeping over dimensions and optimizing over parameters of a single dimension. Such approaches are popular for compressing multiway arrays or for tensor completion problems (Savostyanov and Oseledets, 2011; Grasedyck et al., 2015).

For the case of supervised learning, their usage is straightforward. The idea is to solve a sequence of optimization problems, where the functional space is further constrained by fixing all-but-one FT core, . After optimizing over the th core, that core is fixed and optimization over the next one is performed. This algorithm is provided by Algorithm 2. This algorithm performs sweeps over all dimensions until convergence is obtained. For details on convergence of this algorithm we refer to, e.g., Uschmajew (2012). We will provide more details about the implementation and complexity of this algorithm in Section 5.

0:  Parameterized low-rank function space, ; Initial FT cores for ; Data, ; Objective function, ; Convergence tolerence,
1:  
2:  
3:  while not converged do
4:     for  do
5:        
6:     end for
7:     
8:     if   then
9:        break;
10:     end if
11:     
12:  end while
Algorithm 2 Alternating minimization for low-rank supervised learning

4.3.2 Batch gradient methods

Gradient descent directly minimizes the cost function with a batch gradient (or second-order) based procedure. While this is a standard optimization approach, we will refer to this algorithm as all-at-once (AAO) to distinguish it from the alternating minimization. Gradient-based procedures have been shown to be promising in the context of canonical tensor decompositions and gradient descent (Acar et al., 2011) and TT decompositions in the context of iterative thresholding (Rauhut et al., 2017), but have not been explored well for low-rank functional approximation.

Gradient descent updates parameters according to

for More generally, we use a quasi-Newton method to choose a direction such that the update becomes

In the examples in Section 6, we use the limited-memory BFGS (Liu and Nocedal, 1989) method available in the library111github.com/goroda/Compressed-Continuous-Computation to perform this update.

One disadvantage of this approach that is often cited is that it involves solving an optimization problem whose size is the number of parameters in the function. However, we note that the number of parameters scales as so for a one hundred dimensional, rank 5 problem with we have unknowns. This number of unknowns is well within the capabilities of modern hardware and optimization techniques.

4.3.3 Stochastic gradient descent

Stochastic gradient descent (SGD) is often used instead of gradient decent when using large data sets. SGD aims to minimize objective functions of the form

which includes the least-squares objective (3). The objective function is updated using only one data point (batches can also be used), which is chosen randomly at each iteration

(16)

Many variations on stochastic gradient descent have been developed. We refer the reader to (Bottou et al., 2016) for more information. These variations often include adaptive strategies for choosing the learning rate . Such methods have previous been applied in the context of tensors to minimize computation costs when dealing with large scale data (Huang et al., 2015).

We will use the adaptive strategy from ADAM (Diederik and Jimmy, 2014), as implemented in the library, to demonstrated the effectiveness of SGD in Section 6.

5 Gradient computation and analysis

The evaluation of the gradients of the FT with respect to its parameters is essential for the making gradient-based optimization algorithms tractable. Almost all existing literature for tensor approximation uses alternating minimization strategies because the subproblems are convex and can be solved exactly and efficiently with standard linear algebra algorithms, though they have limited (if any) guarantees for obtaining a good minimizer of the full problem. One of the major contributions of this paper is the derivation and analysis of the computational complexity of computing gradients of the supervised learning objective (4).

The gradient of the least squares objective function with respect to a parameter is

(17)

where denotes a multi-index for a unique parameter of the th core, see Section 3.1. Efficient computation of the partial derivative, is essential for making gradient based optimization algorithms tractable. Letting

(18)
(19)

denote the products of the cores before and after the th, respectively, and combining Equations (8), (18), and (19) we obtain the following expression

(20)

Thus efficiently computing gradients of the objective function and the FT requires efficient algorithms for evaluation of the left and right product cores and and the partial derivative .

In the following sections we describe and analyze the gradient of the FT with respect to its parameters. We first describe the partial derivatives of an FT with respect to the parameters of a single core. Then we present an efficient algorithm for computing left and right product cores. Finally, we discuss the computation of gradients of the full FT and the objective function.

5.1 Derivatives of an FT core

In this section we discuss how to obtain the partial derivatives with respect to parameters of a specific core. Without loss of generality , we will make the following assumption to ease the notational burden

Assumption 1.

Each univariate function in each FT-core (9) is independently parameterized with parameters so that the FT core is parameterized with parameters.

Under this assumption the FT cores have the following structure.

Proposition 1.

Let denote the partial derivative for some . Under Assumption 1, is a sparse matrix with the following elements

(21)

for , and .

Now if we let denote the number of operations required to compute the gradient of a univariate function with respect to all of its parameters then the cost of computing is operations.

5.2 Evaluating left and right product cores

Computation of the partial derivative of the FT (20) requires the evaluation of the left and right product cores and . A single forward and backward sweep the cores can be used to obtain these values using the following recursion identities

Thus we can obtain the gradient of the FT with respect to parameters of the th core using the Algorithms 3 and 4.

0:  Number of rows, ; Number of columns ; Number of parameters per univariate function ; FT core ; Left multiplier
1:  for  do
2:     for  do
3:        for  do
4:           
5:        end for
6:     end for
7:  end for
Algorithm 3 coregrad-left: Generate intermediate result for gradient computation
0:  Number of rows, ; Number of columns ; Number of parameters per univariate function ; Intermediate result ; Right multiplier
1:  for  do
2:     for  do
3:        for  do
4:           
5:        end for
6:     end for
7:  end for
Algorithm 4 coregrad-right: Update intermediate gradient result with multiplier from the right

These two algorithms consist of a triple nested loop. The innermost loop of coregrad-left requires computing the gradient of a univariate function with respect to its parameters, and multipling each element of the gradient by the left multiplier. Thus if then operations are needed for the gradient and products between floating point operations are needed. Finally, the partial derivative with respect to each parameter of the core is stored with a complexity of floating point numbers. Each of these partial derivatives is updated in coregrad-right for a computational cost of and no additional storage. Since each of these functions is called times, the additional computational cost they incur is and the additional storage complexity they incur is .

5.3 Gradient of FT and objective functions

Th entire gradient of the FT can be obtained with a single forward and backward sweep as presented in Algorithm 5. In Algorithm 5 we use to denote the partial derivative of with respect to the parameters of the th core. Each step of the forward sweep requires: (i) creating an intermediate result for the gradient of the FT with respect to each parameter of the core in line 4 of Algorithm 3, (ii) evaluating and storing the core of the FT in line 6 of Algorithm 5, and (iii) updating the product of the cores through the current step in line 7 of Algorithm 5). The backward sweep involves involves updating the the intermediate gradient result obtained from the forward sweep (line 4 in Algorithm 4), and then updating the product of the cores in line 14 of Algorithm 5.

0:  Parameters ; Evaluation location
1:  {Generate intermediate results during forward sweep}
2:  
3:  
4:  for  do
5:     
6:     
7:     
8:  end for
9:   {Evaluation of the FT}
10:  {Update intermediate results during backward sweep}
11:  
12:  for  do
13:     
14:     
15:  end for
16:  
Algorithm 5 FT evaluation and gradient
Proposition 2.

Let be a rank FT with every univariate function in Equation (9). Assume Assumption 1 and that for we have for some . Let denote the number of operations required to evaluate a univariate function. Let denote the number of operations required to compute the gradient of a univariate function with respect to all of its parameters. Then, evaluating and computing the gradient with respect to its parameters requires

operations, and storing floating point values.

Proof.

To demonstrate this result, we can calculate the number of evaluations and storage size required for each step of Algorithm 5. First note that Lines 2 and 6 involve the evaluation of each FT core. Since each core has at most univariate functions these evaluations require operations and the ability to store floating point numbers. Since this evaluation has to happend for each of the cores the computational complexity of this step is . The storage space can be reused and only requires storing two vectors of size () and an matrix () for a total storage complexity of floating point numbers. Next we note that a product between a vector and an matrix in lines 7 and 14 needs to occur times and therefore requires operations. Thus apart from the calls to coregrad-left and coregrad-right we have a computational complexity of and a storage complexity of .

Combining these costs with the cost of the coregrad algorthims presented in Section 5.2 obtain the stated result. ∎

The values and are dependent upon the types of parameterizations of univariate functions. Consider two examples: one linear and one nonlinear. For both parameterizations we consider kernel basis functions; however, for the nonlinear parameterization we will consider the centers of each kernel as free parameters. The linear parameterization is given by Equation (10) with , where and are the centers of the kernel such that each univariate function of the th dimension is represented with as a sum of kernels with the same locations. If the evaluation of the exponential takes a constant number of operations with respect to then we have and because the the gradient of the univariate function with respect to its th parameter is

(22)

This gradient is independent of any other parameters. In practice this means that it can be precomputed at each location and recalled at runtime. In such a case the storage increases to numbers if each univariate function in each core has a different parameterization. If the univariate functions of each core share the same parameterization then the additional storage cost is . In either case the online cost becomes a simple lookup, i.e., .

The nonlinear parameterization provided in Equation (11) is different because we are free to optimize over the centers. In this case the gradient of each univariate function becomes The gradient with respect to the second half of the parameters now depends on the parameter value

The parameter dependence of the gradient increases the complexity of optimization algorithms since precomputation of the gradients cannot be performed. In this case we have and .

5.4 Summary of computational complexity

The complexity of evaluating the gradient of the least squares objective function

is dominated by the cost of this derivative is evaluating the gradient of at points. Consequently, the total complexity is times that provided by Proposition 2, for a total cost of operations. The storage cost need not increase because one can evaluate the sum by sequentially iterating through each sample.

Now that we have described the computational complexity of the gradient computation, we summarize the computational complexity of the proposed optimization algorithms. Table 1 shows the computational complexity of the all-at-once optimization scheme when using a low-memory BFGS solver. Three parameterizations are considered a linear parameterization with identical parameterizations of each univariate function in a particular core, a more general parameterization with varying linear parmaeterizations within each core and, and the most general case of nonlinear parameterization of each function. We see that using linear parameterizations allows us to precompute certain gradients before running the optimizer. This precomputation reduces the online cost of optimization. The computational complexity of the full solution is dominated by the number of evaluations and gradients of the objective function, and we denote this number as .

Parameterization type Offline cost/storage Eval. and grad. () Full solution
Shared/linear /
General/linear /
nonlinear N/A
Table 1: Computational complexity of all-at-once optimization.

The computational cost of stochastic gradient descent is given in Table 2

. In this case the cost per epoch (once through all of the training points) is the same as a single gradient evaluation in the all-at-once approach. The total cost of such a scheme is dominated by how many samples are used during the optimization procedures. In the associated table, we represent this number as the number of times one requires iterating through all of the samples

. A fully online algorithm would not have any associated offline cost and its complexity would be equivalent to the nonlinear parameterization case.

Parameterization type Offline cost/storage Eval. and grad. per sample () Full solution
Shared/linear /
General/linear /
nonlinear N/A
Table 2: Computational complexity of stochastic gradient descent with ADAM.

Finally, the alternating optimization scheme is of a slightly different nature. In the case of linear parameterization, each sub-optimization problem is quadratic and can be posed as solving a linear system by setting the right hand side of Equation (17) to zero. This linear system has equations and unknowns and its solution, using a direct method, has an asymptotic complexity of . We also need to introduce a new constant called that represents how many sweeps through all of the dimensions are required. We see that in the most general case, this algorithm is times more expensive than all-at-once. However, this number is a bit desceptive since each sub problem has only unknowns and therefore we can assume that . In Table 3 we summarize the costs of this algorithm.

Param type Offline cost/storage Solution of sub-problem () Full solution
Shared/linear /
General/linear /
nonlinear N/A
Table 3: Computational complexity of alternating (non)-linear least squares.

The summaries above were provided for the pure least squares regression problem. If we consider the regularization term of Equation (15) then an additional step must be performed. The gradient of the functional requires computing the gradient of each of the summands, which can be written as

for . In other words, to compute the gradient of every element in the sum in Equation (15), one computes the inner product between the original function and the function representing the partial derivative of this function. In the case of linear parameterizations described above, the partial derivative is simply a scalar and only the integral of the univariate function needs to be computed. Alternatively the full inner product must be evaluated; however we note that the evaluation of this integral is often analytical or available in closed form based upon the type of function. For example the cost is for orthonormal basis functions and ) more generally, to obtain the gradient with respect to every parameter of a univariate function.

6 Experiments

In this section, we provide experimental validation of our approach. Our validation is performed on a synthetic function, approximation benchmark problems, and several real-world data sets. The synthetic examples are used to show the convergence of our approximations with increasing number of samples using the various optimization approaches. We also include comparisons between both the nonlinear parameterized FT and the linearly parameterized FT-c representations. Furthermore, we show the effectiveness of our rank adaptation approach. The real-world data sets indicate the viability of FT-based regression for a wide variety of application areas and indicates pervasiveness of low-rank structure. The Compressed Continuous Computation () library (Gorodetsky, 2014-2015) was used for all of experiments and is publicly available.

6.1 Comparison of optimization strategies

We first compare the convergence, with the number of samples, of three optimization algorithms. We use three synthetic test cases with known rank and parameterization. The first two functions are from a commonly used benchmark database (Surjanovic and Bingham, 2017). The third function is a FT-rank 2 function that is commonly used to demonstrate the performance of low-rank algorithms.

The first function is six dimensional and called the OTL circuit function. It is given by

(23)

with and variable bounds , , , , and . This function has a decaying spectrum due to the complicated sum of variables in the denominators, and it provides and an important test problem for our proposed rank adaptation scheme. Before exploring rank-adaptation first we explore the effectiveness of the three optimization algorithms in the context of fixed rank and number of univariate parameters . Specifically we compute the relative squared error, using 10000 validation samples, for increasing number of training samples. We use a stopping criterion of

for the difference between objective values for the gradient based techniques and for the difference between functions of successive sweeps for ALS. Though we have found that the results for these functions are not too sensitive to this tolerance. We solve 100 realizations of the optimization problem for each sample size and report the median, 25th, and 75th quantiles. The univariate function are parameterized using Legendre polynomials because we are using a domain with uniform measure.

Figure 2 demonstrates that stochastic gradient descent and the all-at-once approach tend to outperform alternating least squares. This performance benefit is greater in regions of small sample sizes and large number of unknowns (large and ). In the case of the largest number of unknowns in Figure 1(c) our gradient based methods obtain an error that is several orders of magnitude lower error than the error obtained by ALS.

(a)
(b)
(c)
Figure 2: Median, 25th, and 75th quantiles of relative error over 100 realizations of training samples for the OTL Circuit (23).

Next we consider two cases in which the rank is known. The wing weight function is ten dimensional and given by

(24)

with variable bounds , , , , , , , , , . Using the variable ordering above the rank of this function is . The results in Figure 3 indicate the same qualitative performance of the three optimizaiton methods. However, the difference in this case is that the SGD is not significantly better than all-at-once (AAO) approach for low sample sizes. In the third panel we see that SGD levels off around a relative squared error of . For such small errors we have found it difficult to converge the SGD to smaller errors because of the tuning parameters involved in ADAM. In particular, the final error tolerance becomes sensitive to the choice of initial learning rate.

(a)
(b)
(c)
Figure 3: Median, 25th, and 75th quantiles of relative error over 100 realizations of training samples for the Wing Weight function (24).

The final test function is six dimensional and commonly used for testing tensor approximation because it has a TT-rank of 2 (Oseledets and Tyrtyshnikov, 2010), this function is a Sine of sums

(25)

In Figure 4 we also see that the gradient based approaches are more effective in the case of small number of data sets, but that all achieve essentially the same minimums as the number of samples increase.

(a)
(b)
(c)
Figure 4: Median, 25th, and 75th quantiles of relative error over 100 realizations of training samples for the Sine of sums (25).

6.2 Linear vs nonlinear approximation

Next we compare the FT and FT-c representations with different basis functions. Specifically, using the OTL function (23), we compare kernels at fixed locations and with kernels at optimized locations. For the linear approximation we use 8 kernels with fixed centers, and for the nonlinear approximation we use 4 kernels with moving centers (for a total of for both approximation types). The results of this study are shown in Figure 5. Using AAO optimization we see that, for this problem, the nonlinear parameterization of kernels with moving centers provides a more effective representation. Specifically, we achieve an order of magnitude reduction in error when using the nonlinear moving-center representation.

Figure 5:

Comparison of parameterization with eight kernels with fixed and uniformly distributed centers and four kernels with with optimized centers with rank