1 Introduction
Assesment of uncertainty in a computational model is essential to increasing the credibility of simulation based knowledge discovery, prediction and design. Sources of model uncertainty must be identified and the effect of these uncertainties on the model output (prediction) quantified. The accuracy to which uncertainty can be quantified is limited by the computational resources available. Many applications require vast amounts of computational effort, thereby limiting the number model evaluations that can be used to interrogate the uncertainty in the system behavior. Consequently a significant portion of methods developed for uncertainty quantification (UQ) in recent years have focused on constructing surrogates of expensive simulation models using only a limited number of model evaluations.
Within the computational science community, both parametric and nonparametric function approximation methods have been extensively used for Uncertainty Quantification (UQ). Non parametric Gaussian process models (GP) (Rasmussen and Williams, 2006; O’Hagan and Kingman, 1978) and parametric Polynomial Chaos Expansions (PCE) (Ghanem and Spanos, 1991; Xiu and Karniadakis, 2002)
are arguably the two most popular methods used. Gaussian process regression can be interpreted as a Bayesian method for function approximation, providing a posterior probability distribution over functions. Maximum likelihood estimation and Markov Chain Monte Carlo sampling are the two most popular methods used to characterize the posterior distribution of the GP. Polynomial chaos expansions represent a response surface as a linear combination of orthonormal multivariate polynomials. The choice of the orthonormal polynomials is related to the distribution of the uncertain model variables. Various approaches have been adopted to compute the coefficients of the PCE basis. Approaches include, pseudospectral projection
(Conrad and Marzouk, 2013; Constantine et al., 2012), sparse grid interpolation
(Ganapathysubramanian and Zabaras, 2007; Nobile et al., 2008), and regression using minimization (Blatman and Sudret, 2011; Doostan and Owhadi, 2011; Mathelin and Gallivan, 2012). For a comparison between nonparametric GP methods and parametric PCE methods see (Gorodetsky and Marzouk, 2016), and for an attempt to combine the benefits of both approaches see (Schobi et al., 2015).Highdimensional approximation problems, such as regression, pose challenges for both parametric and nonparametric representation formats. Parametric approaches, for example those using a PCE representation, are limited by their expressivity; and increasing expressivity, for example by increasing the polynomial order, results in the curse of dimensionality for fixed polynomial order. Nonparametric methods, for example Gaussian process regression, have great expressive capabilities. However, they also encounter the curse of dimensionality since their excess risk grows exponentially with dimension
(Györfi et al., 2002).To counteract these computational burdens for both types of methods, attention has focused on constraining the functional representation to limit the curse of dimensionality. One popular constraint is limiting the model to that of additive separable forms (Hastie and Tibshirani, 1990; Meier et al., 2009; Ravikumar et al., 2008)
(1) 
One can also use second order interactions, e.g., , to increase expressivity while maintaining tractability (Kandasamy and Yu, 2016). However, further increasing the number of interaction terms in the ANOVA model (Fisher, 1925)
will still naturally encounter the curse of dimensionality unless adaptive methods that identify the order of interaction interactively are utilized (Ganapathysubramanian and Zabaras, 2007; Ma and Zabaras, 2010; Foo and Karniadakis, 2010; Jakeman and Roberts, 2013; Jakeman et al., 2015).
In this paper, we propose algorithms to improve regression in a functional representation that takes advantage of lowrank structure to mitigate the curse of dimensionality while maintaining high expressivity. Lowrank functional representations are parametric models that enable a wide variety of interactions between variables and can generate high order representations. More specifically, they are continuous analogues of tensor decompositions and exploit separability, i.e., that a function can be represented by the sum of products of univariate functions. One example is the canonical polyadic (CP) (Carroll and Chang, 1970) representation consisting of a sum of products of univariate functions , and the number of free parameters of such a representation scales linearly with dimension. Instead of the CP format, we use a continuous analogue of the discrete tensor train (TT) decomposition (Oseledets, 2011) called the functional tensortrain (FT) (Oseledets, 2013; Gorodetsky et al., 2016) to allow for a greater number of interactions between variables.
Lowrank functional tensor decompositions have been used for regression previously. Existing approaches (Doostan et al., 2013; Mathelin, 2014; Chevreuil et al., 2015; Rauhut et al., 2017) implicitely make two simplifying assumptions to facilitate the use of certain algorithms and data structures from the lowrank tensor decomposition literature. Specifically, they assume linear and identical basis expansions for each univariate function of a particular dimension. These approaches convert the problem from one of determining a lowrank function to one of representing the coefficients of a tensorproduct basis as a lowrank tensor. Following this conversion, many of these techniques use alternating minimization to determine the coefficients of the FT. Alternating minimization, such as alternating least squares, transforms a nonlinear optimization problem for fitting the parameters of each univariate function to data into a linear problem by only considering a single dimension at a time. Existing approaches either use efficient linear algebra routines to solve the linear system at each iteration (Doostan et al., 2013) or sparsity inducing methods such as the LASSO (Mathelin, 2014). Recently, iterative thresholding has also been used to find low rank coefficients; however, such an approach has been limited to problems with lowdimensionality (Rauhut et al., 2017).
In this paper we take a different approach: we use gradientbased optimization procedures such as quasiNewton methods and stochastic gradient descent, and we do not restrict ourselves by the assumptions that lead to the consideration of a tensor of coefficients. Overall, our contributions include:

Derivation and computational complexity analysis of a gradientbased fitting procedure that yields more accurate approximations than alternating minimization

Usage of both linear and nonlinear parameterizations of univariate function in each mode to facilitate a larger class of functions than is possible when using solely linear representations, and

Creation of rankadaptation and regularization schemes to reduce overfitting.
We demonstrate the efficacy of our approach on both synthetic functions and on several realworld data sets used in existing literature to test other regression algorithms (Kandasamy and Yu, 2016). Specifically, we show that gradientbased optimization provides significant advantages in terms of approximation error, as compared with an alternating least squares approach, and that these advantages are especially apparent in the case of small amounts of data and large parameter sizes. To our knowledge, no gradient based procedure has been derived or evaluated for the lowrank functional format that we consider. We also show the benefits of nonlinear representations of univariate functions over the linear representations most frequently found in the literature. Our realworld data results show that our methodology is competitive with both nonparametric and parametric algorithms. We achieve topfive accuracy for seven of the 10 data sets considered and best accuracy for two of the data sets. These rankings are the best amongst parametric models and competitive with the best nonparametric methods. Finally we demonstrate that for some physical systems, exploiting lowrank structure is more effective than exploiting sparse structure which is common approach used for enabling uncertainty quantification of computational models.
1.1 Related Work
As mentioned above, the functional tensortrain decomposition was proposed as an ansatz for representing functions by Oseledets (2013), and computational routines for compressing a function into FT format have been developed before (Gorodetsky et al., 2016). In that setting, an approximation of a blackbox function is sought to a prescribed accuracy. A sampling procedure and associated algorithm was designed to optimally evaluate the function in order to obtain an FT representation. In this work, we consider the setting of fixed data.
There has also been some recent work on regression in lowrank format (Doostan et al., 2013; Mathelin, 2014; Chevreuil et al., 2015). These approaches rely on linearity between the parameters of the lowrank format and the output of the function. Utilizing this relationship, they convert the lowrank function approximation to one of lowrank tensor decomposition for the coefficients of a tensorproduct basis. In Section 3.2 we show how the representation presented in those works can be obtained as a particular instantiation of the formulation we present here.
In spirit, our approach is also similar to the recent use of the tensortrain decomposition within fully connected neural networks
(Novikov et al., 2015). There, the layers of a neural network are assumed to be a linear transformation of an input vector, and the weight matrix of the transformation is estimated from data. Their contribution is representing the weight matrix in a compressed TTformat. In this work, our lowrank format can be thought of as an alternative to the linear transformation layer of a neural network. Indeed, future work can focus on utilizing our proposed methodology within a hierarchy of lowrank approximations.
2 Background
In this section we establish notation and provide background for regression and lowrank tensor decompositions.
2.1 Notation
Let be the set of real numbers and be the set of positive integers. Let , and suppose that we are given i.i.d. data such that each datum is sampled from a distribution on a compact space Let be a tensor product space with . Let the marginal distribution of be the tensor product measure , and assume that all integrals appearing in this paper are with respect to this measure. For example let and , then the inner product is defined as Similarly, the norm is defined as
In this paper scalars and scalarvalued functions are denoted by lowercase letters, vectors are denoted by bold lowercase letters such as ; matrices are denoted with upper boldface such as ; tensors are denoted with upper boldface caligraphic letters such as ; and matrixvalued functions are denoted by upper caligraphic letters such as An ordered sequence of elements of the same set are distinguished by parenthesized superscripts such as or .
2.2 Supervised learning
The supervised learning task seeks a function
that minimizes the average of a cost function(2) 
For example, the standard least squares objective is specified with . In this work, the cost functional cannot be exactly calculated; instead, data is used to estimate its value. The cost functional then becomes a sum over the data instead of an integral
(3) 
where we have reused the notation , and further references to will use this definition. To obtain an optimization problem over a finite set of variables, the search space of functions must be restricted. Nonparametric representations generally allow this space to vary depending upon the data, and parametric representations typically fix the representation. In either case, we denote this space as to seek an optimal function such that
(4) 
One example of a common function space is the reproducing kernel Hilbert space, and resulting algorithms using this space include Gaussian process regression or radial basis function interpolation. Other examples include linear functions (resulting in linear regression) or polynomial functions. In this work, we consider a function space that incorporates all functions with a prescribed
rank, which will be defined in the next section.When solving the supervised learning problem (4) it is often useful to introduce a regularization term to minimize overfitting. In this paper we will consider the following regularized learning problem
(5) 
where denotes a scaling factor and is a functional that penalizes certain functions in . For example, the function penalizes functions that have large norms, and such a penalty has been used for a certain type of lowrank functional approximation before (Doostan et al., 2013). In this work, we impose a group sparsity type regularization that has been shown to improve the prediction performance in other approximation settings (Turlach et al., 2005; Yuan and Lin, 2006). Such an approach seeks to increase parsimony by reducing the number of nonzero elements in an approximation. In Section 4.1 we describe what this type of constrains means in the context of lowrank functional decompositions.
2.3 Lowrank tensor decompositions
The function space that we use to constrain our supervised learning problem is related to the concept of lowrank decompositions of multiway arrays, or tensors. Tensor decompositions are multidimensional analogues of the matrix SVD and are used to mitigate the curse of dimensionality associated with storing multivariate arrays (Kolda and Bader, 2009). Consider an array with , this array contains an exponentially growing number of elements as the dimension increases. Tensor decompositions can provide a compact representation of the tensor and have found wide spread usage for data compression and learning tasks (Cichocki et al., 2009; Novikov et al., 2015; Yu and Lie, 2016).
In this work, we consider the tensortrain (TT) decomposition (Oseledets, 2011). Specifically, we use the TT as an ansatz for representating lowrank functions. A TT representation of a tensor is defined by a list of way arrays called TTcores where and for . The sizes of the cores are called the TTranks. In this format a tensor element is obtained according to the following multiplication
(6) 
where are matrices and the above equation describes a sequence of matrix multiplication.
In this format, storage complexity scales linearly with dimension and quadratically with TTrank. In the next section we discuss how it can be used as a framework for function approximation.
3 Lowrank functions
Lowrank formats for functions can be thought of as direct analogues to lowrank tensor decompositions. In this work, we focus on aa TTlike decomposition of functions called the functional tensortrain (FT) (Oseledets, 2013).
(7) 
where , and for singleoutput functions. A more compact expression is obtained by viewing a function value as a set of products of matrixvalued functions
(8) 
where each matrixvalued function is called a core and can be visualized as an array of the univariate functions
(9) 
If each univariate function is represented with parameters and for all , then the storage complexity scales as . Comparing this representation with (6) we see a very close resemblence between the TT cores and the FT cores. Indeed they are both matrices when indexed by a discrete index for the TT or a continuous index for the FT. We describe a closer relationship between the TT and the FT in the context of lowrank representations of functions in the next section.
3.1 Parameterizations of lowrank functions
An FT core is comprised of sets of univariate functions as shown in Figure 1. Each set of univariate functions, contains all of the parameters of the associated univariate functions. As a result the full FT is parameterized through the parameterization of its univariate functions. Let denote the number of parameters describing Let denote the vector of parameters of all the univariate functions. Then, there are a total of parameters describing the FT, i.e., .
The parameter vector is indexed by a multiindex where corresponds to an input variable, and correspond to a univariate function within the th core, and corresponds to a specific parameter within that univariate function. In other words, we adopt the convention that refers to the th parameter of the univariate function in the th row and th column of the th core.
The additional flexibility of the FT representation allows both linear and nonlinear parameterizations of univariate functions. A linear parameterization represents a univariate function as an expansion of basis functions according to
(10) 
Linear parameterizations are often convenient within the context of gradientbased optimization methods since the gradient with respect to a parameter is independent of the parameter values. Thus, one only needs to compute and store the evaluation of the basis functions a single time. Nonlinear parameterizations are more flexible and general, but often incur a greater computational cost within optimization. One example of a nonlinear parameterization is that of a Gaussian kernel, which can be written as
(11) 
where Here, the first parameters refer to the coefficients of radial basis functions and that second half of the parameters refer to the centers.
3.2 Lowrank functions vs. lowrank coefficients
The functional tensor train can be used by independently parameterizing the univariate functions of each core, and both linear and nonlinear parameterizations are possible. As described below, the advantage of this representation includes a naturally sparse storage scheme for the cores. Another advantage is the availablitiy of efficient computational algorithms for multilinear algebra that can adapt the representation of each univariate function individually as needed (Gorodetsky et al., 2016; Gorodetsky, 2017) in the spirit of continuous computation pioneered by Chebfun (Platte and Trefethen, 2010).
Although the functional form of the TT is general, most of the literature makes two simplifying assumptions (Doostan et al., 2013; Mathelin, 2014; Chevreuil et al., 2015):

a linear parameterization of each , and

an identitical basis for the functions within each FT core, i.e., and for all , and .
These assumptions transform the problem of learning lowrank functions to the problem of learning lowrank coefficients. This transformation of the problem allows one to facilitate the use of discrete TT algorithms and theory. We will refer to representations using these assumptions as a functional tensortrain with TT coefficients or FTc, where “c” stands for coefficients.
The FTc representation stores the coefficients of a tensorproduct basis for all and in TT format. Let be a tensor of the following form
(12) 
for . Function evaluations can be obtained from the from coefficients stored in TT format by performing the the following summation
(13) 
From (8), (12), and (13) we can see that the relationship between the TT cores and the FT cores is
(14) 
where the basis function multiplies every element of the tensor. In other words, the TT cores represent a TT decomposition of the coefficient tensor of a tensorproduct basis.
The two assumptions required for converting the problem from one of lowrank function approximation to lowrank coefficient approximation often result in a computational burden in practice. The burden of the first assumption is clear, some functions are more compressible if nonlinear parameterizations are allowed. The second assumption can also limit compressibility, but also results in a larger computational and storage burden. One example in which this burden becomes obvious is when attempting to represent additively separable functions, e.g., Equation (1), in the FT format. As mentioned in the introduction, this representation is common for highdimensional regression, and an FT can represent this function using cores that fave the following rank 2 structure.
Suppose that each of the univariate functions can be represented with parameters and the constants and can be stored with a single parameter. Then, the storage requirement is parameters for the middle cores and for the outer cores, totaling floating point values. In the TT case, the and terms must be stored with the same complexity as the other terms, since the TT is a threeway array. Thus, the total number of parameters becomes . Almost four times less storage is required for the FT format than the TT format, in the limit of large and . Essentially, the TT format does take into account the sparsity of the cores. In this case, one can think of the FT format as storing the TT cores in a sparse format. This burden is exacerbated if we add interaction terms to Equation 1.
4 Lowrank supervised learning
In this section, we incorporate the FT model as a constraint in the supervised learning task. We discuss issues surrounding optimization algorithms, regularization, and choosing the TTranks. In particular we discuss three optimization algorithms for fitting fixedrank models: batch gradient descent, alternating least squares, and stochastic gradient descent. We also present approaches for minimizing overfitting. Specifically we discuss a groupsparsityinducing regularization regularization scheme and hyperparameter estimation scheme using cross validation and a rankadapation.
4.1 Lowrank constraints and regularization
The function space in Equation (4) constrains the search space. The space constrained by lowrank functions is defined as follows. Let such that for . Then
denotes the space of functions with FT ranks Note that this function space also includes functions with smaller ranks. For example, a function that differs in ranks such that and can be obtained by setting all univariate functions, aside from the top left block, to zero; analogous modifications must be made to all other cores.
In this paper we also add a regularization to attempt to minimize the number of nonzero functions in each core. The structure of the FTcores in Equation (9) admits a natural grouping in terms of the univariate functions. Setting a univariate function to zero inside of a core essentially lowers the number of interactions allowable between univariate functions in neighbouring dimensions, with the overall effect beign a smaller number of sums in the representation of the function in Equation (7). Minimizes the number of nonzero univariate functions not only produces a more parsimonious model, but also minimizes the number of interactions between function variables.
Specifically, we penalize the regression problem using the sum of the norms of the univariate functions
(15) 
Note that this regularization method is different from that first proposed in (Mathelin, 2014) where the FTc representation was used and the TTcores of the coefficient tensor themselves were forced to be sparse. In our case we do not look for sparse coefficients, instead our goal is to increase the number of functions that are identically zero. As a surrogate for this goal we follow the standard practice of replacing the nondifferential “” norm by the another norm, in our case a sum of the norms of the functions. This replacement also seeks to directly minimize the number of interacting univariate functions and also provides a differentiable objective for optimization.
4.2 Hyperparameter estimation using cross validation and rank adaptation
A careful selection of the number of parameters in each univariate function, the rank of each core, and the Lagrangian multiplier in the regularized learning problem is required to avoid over fitting. For example if a polynomial basis is chosen, setting the degree too high can result in spurious oscillations. Cross validation provides a mechanism to estimate hyperparameters. In this paper we use 20 fold cross validation to select the hyperparameters of an FT that minimizes the expected prediction error.
Alternatively, we can design a more effective rank adaptation scheme by combining cross validation with a rounding procedure. The problem with the simplest cross validation option is that a scheme that optimizes separately over each imposes a computational burden, and a scheme that optimizes for a single rank across all cores, i.e., does not allow enough flexibility. Instead, we propose to combine FT rounding (Gorodetsky et al., 2016) and cross validation to avoid overfitting.
Rounding is a procedure to reapproximate an FT by one with lower ranks to a given tolerance. The tolerance criterion can be thought of as a regularization term, since it limits the complexity of the representation and will be investigated in furture work. The full rank adaptation scheme is provided by Algorithm 1. In that algorithm refers to a function that provides a crossvalidation error for an optimization over the space , and the function provides the ranks of a rounded function with a particular tolerance .
The scheme increases ranks until either the crossvalidation error increases or until the rounding procedure decreases every rank. The first termination criterion targets overfitting, and the second termination criterion seeks to limit rank growth when data is no longer informative enough to necessitate the increase in expressivity.
4.3 Optimization algorithms
In this section, we overview three gradientbased optimization algorithms for solving supervised learning problems in lowrank format: alternating minimization, gradient decent and stochastic gradient decent. The computational complexity of these algorithms in analyzed in Section 5.
4.3.1 Alternating minimization
Alternating minimization methods, mainly alternating least squares, are the main avenue for optimization in lowrank tensor and functional contexts. These routines are typically used within tensor decompositions by sequentially sweeping over dimensions and optimizing over parameters of a single dimension. Such approaches are popular for compressing multiway arrays or for tensor completion problems (Savostyanov and Oseledets, 2011; Grasedyck et al., 2015).
For the case of supervised learning, their usage is straightforward. The idea is to solve a sequence of optimization problems, where the functional space is further constrained by fixing allbutone FT core, . After optimizing over the th core, that core is fixed and optimization over the next one is performed. This algorithm is provided by Algorithm 2. This algorithm performs sweeps over all dimensions until convergence is obtained. For details on convergence of this algorithm we refer to, e.g., Uschmajew (2012). We will provide more details about the implementation and complexity of this algorithm in Section 5.
4.3.2 Batch gradient methods
Gradient descent directly minimizes the cost function with a batch gradient (or secondorder) based procedure. While this is a standard optimization approach, we will refer to this algorithm as allatonce (AAO) to distinguish it from the alternating minimization. Gradientbased procedures have been shown to be promising in the context of canonical tensor decompositions and gradient descent (Acar et al., 2011) and TT decompositions in the context of iterative thresholding (Rauhut et al., 2017), but have not been explored well for lowrank functional approximation.
Gradient descent updates parameters according to
for More generally, we use a quasiNewton method to choose a direction such that the update becomes
In the examples in Section 6, we use the limitedmemory BFGS (Liu and Nocedal, 1989) method available in the library^{1}^{1}1github.com/goroda/CompressedContinuousComputation to perform this update.
One disadvantage of this approach that is often cited is that it involves solving an optimization problem whose size is the number of parameters in the function. However, we note that the number of parameters scales as so for a one hundred dimensional, rank 5 problem with we have unknowns. This number of unknowns is well within the capabilities of modern hardware and optimization techniques.
4.3.3 Stochastic gradient descent
Stochastic gradient descent (SGD) is often used instead of gradient decent when using large data sets. SGD aims to minimize objective functions of the form
which includes the leastsquares objective (3). The objective function is updated using only one data point (batches can also be used), which is chosen randomly at each iteration
(16) 
Many variations on stochastic gradient descent have been developed. We refer the reader to (Bottou et al., 2016) for more information. These variations often include adaptive strategies for choosing the learning rate . Such methods have previous been applied in the context of tensors to minimize computation costs when dealing with large scale data (Huang et al., 2015).
5 Gradient computation and analysis
The evaluation of the gradients of the FT with respect to its parameters is essential for the making gradientbased optimization algorithms tractable. Almost all existing literature for tensor approximation uses alternating minimization strategies because the subproblems are convex and can be solved exactly and efficiently with standard linear algebra algorithms, though they have limited (if any) guarantees for obtaining a good minimizer of the full problem. One of the major contributions of this paper is the derivation and analysis of the computational complexity of computing gradients of the supervised learning objective (4).
The gradient of the least squares objective function with respect to a parameter is
(17) 
where denotes a multiindex for a unique parameter of the th core, see Section 3.1. Efficient computation of the partial derivative, is essential for making gradient based optimization algorithms tractable. Letting
(18)  
(19) 
denote the products of the cores before and after the th, respectively, and combining Equations (8), (18), and (19) we obtain the following expression
(20) 
Thus efficiently computing gradients of the objective function and the FT requires efficient algorithms for evaluation of the left and right product cores and and the partial derivative .
In the following sections we describe and analyze the gradient of the FT with respect to its parameters. We first describe the partial derivatives of an FT with respect to the parameters of a single core. Then we present an efficient algorithm for computing left and right product cores. Finally, we discuss the computation of gradients of the full FT and the objective function.
5.1 Derivatives of an FT core
In this section we discuss how to obtain the partial derivatives with respect to parameters of a specific core. Without loss of generality , we will make the following assumption to ease the notational burden
Assumption 1.
Each univariate function in each FTcore (9) is independently parameterized with parameters so that the FT core is parameterized with parameters.
Under this assumption the FT cores have the following structure.
Proposition 1.
Let denote the partial derivative for some . Under Assumption 1, is a sparse matrix with the following elements
(21) 
for , and .
Now if we let denote the number of operations required to compute the gradient of a univariate function with respect to all of its parameters then the cost of computing is operations.
5.2 Evaluating left and right product cores
Computation of the partial derivative of the FT (20) requires the evaluation of the left and right product cores and . A single forward and backward sweep the cores can be used to obtain these values using the following recursion identities
Thus we can obtain the gradient of the FT with respect to parameters of the th core using the Algorithms 3 and 4.
These two algorithms consist of a triple nested loop. The innermost loop of coregradleft requires computing the gradient of a univariate function with respect to its parameters, and multipling each element of the gradient by the left multiplier. Thus if then operations are needed for the gradient and products between floating point operations are needed. Finally, the partial derivative with respect to each parameter of the core is stored with a complexity of floating point numbers. Each of these partial derivatives is updated in coregradright for a computational cost of and no additional storage. Since each of these functions is called times, the additional computational cost they incur is and the additional storage complexity they incur is .
5.3 Gradient of FT and objective functions
Th entire gradient of the FT can be obtained with a single forward and backward sweep as presented in Algorithm 5. In Algorithm 5 we use to denote the partial derivative of with respect to the parameters of the th core. Each step of the forward sweep requires: (i) creating an intermediate result for the gradient of the FT with respect to each parameter of the core in line 4 of Algorithm 3, (ii) evaluating and storing the core of the FT in line 6 of Algorithm 5, and (iii) updating the product of the cores through the current step in line 7 of Algorithm 5). The backward sweep involves involves updating the the intermediate gradient result obtained from the forward sweep (line 4 in Algorithm 4), and then updating the product of the cores in line 14 of Algorithm 5.
Proposition 2.
Let be a rank FT with every univariate function in Equation (9). Assume Assumption 1 and that for we have for some . Let denote the number of operations required to evaluate a univariate function. Let denote the number of operations required to compute the gradient of a univariate function with respect to all of its parameters. Then, evaluating and computing the gradient with respect to its parameters requires
operations, and storing floating point values.
Proof.
To demonstrate this result, we can calculate the number of evaluations and storage size required for each step of Algorithm 5. First note that Lines 2 and 6 involve the evaluation of each FT core. Since each core has at most univariate functions these evaluations require operations and the ability to store floating point numbers. Since this evaluation has to happend for each of the cores the computational complexity of this step is . The storage space can be reused and only requires storing two vectors of size () and an matrix () for a total storage complexity of floating point numbers. Next we note that a product between a vector and an matrix in lines 7 and 14 needs to occur times and therefore requires operations. Thus apart from the calls to coregradleft and coregradright we have a computational complexity of and a storage complexity of .
Combining these costs with the cost of the coregrad algorthims presented in Section 5.2 obtain the stated result. ∎
The values and are dependent upon the types of parameterizations of univariate functions. Consider two examples: one linear and one nonlinear. For both parameterizations we consider kernel basis functions; however, for the nonlinear parameterization we will consider the centers of each kernel as free parameters. The linear parameterization is given by Equation (10) with , where and are the centers of the kernel such that each univariate function of the th dimension is represented with as a sum of kernels with the same locations. If the evaluation of the exponential takes a constant number of operations with respect to then we have and because the the gradient of the univariate function with respect to its th parameter is
(22) 
This gradient is independent of any other parameters. In practice this means that it can be precomputed at each location and recalled at runtime. In such a case the storage increases to numbers if each univariate function in each core has a different parameterization. If the univariate functions of each core share the same parameterization then the additional storage cost is . In either case the online cost becomes a simple lookup, i.e., .
The nonlinear parameterization provided in Equation (11) is different because we are free to optimize over the centers. In this case the gradient of each univariate function becomes The gradient with respect to the second half of the parameters now depends on the parameter value
The parameter dependence of the gradient increases the complexity of optimization algorithms since precomputation of the gradients cannot be performed. In this case we have and .
5.4 Summary of computational complexity
The complexity of evaluating the gradient of the least squares objective function
is dominated by the cost of this derivative is evaluating the gradient of at points. Consequently, the total complexity is times that provided by Proposition 2, for a total cost of operations. The storage cost need not increase because one can evaluate the sum by sequentially iterating through each sample.
Now that we have described the computational complexity of the gradient computation, we summarize the computational complexity of the proposed optimization algorithms. Table 1 shows the computational complexity of the allatonce optimization scheme when using a lowmemory BFGS solver. Three parameterizations are considered a linear parameterization with identical parameterizations of each univariate function in a particular core, a more general parameterization with varying linear parmaeterizations within each core and, and the most general case of nonlinear parameterization of each function. We see that using linear parameterizations allows us to precompute certain gradients before running the optimizer. This precomputation reduces the online cost of optimization. The computational complexity of the full solution is dominated by the number of evaluations and gradients of the objective function, and we denote this number as .
Parameterization type  Offline cost/storage  Eval. and grad. ()  Full solution 

Shared/linear  /  
General/linear  /  
nonlinear  N/A 
The computational cost of stochastic gradient descent is given in Table 2
. In this case the cost per epoch (once through all of the training points) is the same as a single gradient evaluation in the allatonce approach. The total cost of such a scheme is dominated by how many samples are used during the optimization procedures. In the associated table, we represent this number as the number of times one requires iterating through all of the samples
. A fully online algorithm would not have any associated offline cost and its complexity would be equivalent to the nonlinear parameterization case.Parameterization type  Offline cost/storage  Eval. and grad. per sample ()  Full solution 

Shared/linear  /  
General/linear  /  
nonlinear  N/A 
Finally, the alternating optimization scheme is of a slightly different nature. In the case of linear parameterization, each suboptimization problem is quadratic and can be posed as solving a linear system by setting the right hand side of Equation (17) to zero. This linear system has equations and unknowns and its solution, using a direct method, has an asymptotic complexity of . We also need to introduce a new constant called that represents how many sweeps through all of the dimensions are required. We see that in the most general case, this algorithm is times more expensive than allatonce. However, this number is a bit desceptive since each sub problem has only unknowns and therefore we can assume that . In Table 3 we summarize the costs of this algorithm.
Param type  Offline cost/storage  Solution of subproblem ()  Full solution 

Shared/linear  /  
General/linear  /  
nonlinear  N/A 
The summaries above were provided for the pure least squares regression problem. If we consider the regularization term of Equation (15) then an additional step must be performed. The gradient of the functional requires computing the gradient of each of the summands, which can be written as
for . In other words, to compute the gradient of every element in the sum in Equation (15), one computes the inner product between the original function and the function representing the partial derivative of this function. In the case of linear parameterizations described above, the partial derivative is simply a scalar and only the integral of the univariate function needs to be computed. Alternatively the full inner product must be evaluated; however we note that the evaluation of this integral is often analytical or available in closed form based upon the type of function. For example the cost is for orthonormal basis functions and ) more generally, to obtain the gradient with respect to every parameter of a univariate function.
6 Experiments
In this section, we provide experimental validation of our approach. Our validation is performed on a synthetic function, approximation benchmark problems, and several realworld data sets. The synthetic examples are used to show the convergence of our approximations with increasing number of samples using the various optimization approaches. We also include comparisons between both the nonlinear parameterized FT and the linearly parameterized FTc representations. Furthermore, we show the effectiveness of our rank adaptation approach. The realworld data sets indicate the viability of FTbased regression for a wide variety of application areas and indicates pervasiveness of lowrank structure. The Compressed Continuous Computation () library (Gorodetsky, 20142015) was used for all of experiments and is publicly available.
6.1 Comparison of optimization strategies
We first compare the convergence, with the number of samples, of three optimization algorithms. We use three synthetic test cases with known rank and parameterization. The first two functions are from a commonly used benchmark database (Surjanovic and Bingham, 2017). The third function is a FTrank 2 function that is commonly used to demonstrate the performance of lowrank algorithms.
The first function is six dimensional and called the OTL circuit function. It is given by
(23) 
with and variable bounds , , , , and . This function has a decaying spectrum due to the complicated sum of variables in the denominators, and it provides and an important test problem for our proposed rank adaptation scheme. Before exploring rankadaptation first we explore the effectiveness of the three optimization algorithms in the context of fixed rank and number of univariate parameters . Specifically we compute the relative squared error, using 10000 validation samples, for increasing number of training samples. We use a stopping criterion of
for the difference between objective values for the gradient based techniques and for the difference between functions of successive sweeps for ALS. Though we have found that the results for these functions are not too sensitive to this tolerance. We solve 100 realizations of the optimization problem for each sample size and report the median, 25th, and 75th quantiles. The univariate function are parameterized using Legendre polynomials because we are using a domain with uniform measure.
Figure 2 demonstrates that stochastic gradient descent and the allatonce approach tend to outperform alternating least squares. This performance benefit is greater in regions of small sample sizes and large number of unknowns (large and ). In the case of the largest number of unknowns in Figure 1(c) our gradient based methods obtain an error that is several orders of magnitude lower error than the error obtained by ALS.
Next we consider two cases in which the rank is known. The wing weight function is ten dimensional and given by
(24) 
with variable bounds , , , , , , , , , . Using the variable ordering above the rank of this function is . The results in Figure 3 indicate the same qualitative performance of the three optimizaiton methods. However, the difference in this case is that the SGD is not significantly better than allatonce (AAO) approach for low sample sizes. In the third panel we see that SGD levels off around a relative squared error of . For such small errors we have found it difficult to converge the SGD to smaller errors because of the tuning parameters involved in ADAM. In particular, the final error tolerance becomes sensitive to the choice of initial learning rate.
The final test function is six dimensional and commonly used for testing tensor approximation because it has a TTrank of 2 (Oseledets and Tyrtyshnikov, 2010), this function is a Sine of sums
(25) 
In Figure 4 we also see that the gradient based approaches are more effective in the case of small number of data sets, but that all achieve essentially the same minimums as the number of samples increase.
6.2 Linear vs nonlinear approximation
Next we compare the FT and FTc representations with different basis functions. Specifically, using the OTL function (23), we compare kernels at fixed locations and with kernels at optimized locations. For the linear approximation we use 8 kernels with fixed centers, and for the nonlinear approximation we use 4 kernels with moving centers (for a total of for both approximation types). The results of this study are shown in Figure 5. Using AAO optimization we see that, for this problem, the nonlinear parameterization of kernels with moving centers provides a more effective representation. Specifically, we achieve an order of magnitude reduction in error when using the nonlinear movingcenter representation.
Comments
There are no comments yet.