# Tensor-Train Networks for Learning Predictive Modeling of Multidimensional Data

Deep neural networks have attracted the attention of the machine learning community because of their appealing data-driven framework and of their performance in several pattern recognition tasks. On the other hand, there are many open theoretical problems regarding the internal operation of the network, the necessity of certain layers, hyperparameter selection etc. A promising strategy is based on tensor networks, which have been very successful in physical and chemical applications. In general, higher-order tensors are decomposed into sparsely interconnected lower-order tensors. This is a numerically reliable way to avoid the curse of dimensionality and to provide highly compressed representation of a data tensor, besides the good numerical properties that allow to control the desired accuracy of approximation. In order to compare tensor and neural networks, we first consider the identification of the classical Multilayer Perceptron using Tensor-Train. A comparative analysis is also carried out in the context of prediction of the Mackey-Glass noisy chaotic time series and NASDAQ index. We have shown that the weights of a multidimensional regression model can be learned by means of tensor networks with the aim of performing a powerful compact representation retaining the accuracy of neural networks. Furthermore, an algorithm based on alternating least squares has been proposed for approximating the weights in TT-format with a reduction of computational calculus. By means of a direct expression, we have approximated the core estimation as the conventional solution for a general regression model, which allows to extend the applicability of tensor structures to different algorithms.

## Authors

• 1 publication
• 1 publication
• 2 publications
• 1 publication
• ### TLib: A Flexible C++ Tensor Framework for Numerical Tensor Calculus

Numerical tensor calculus comprise basic tensor operations such as the e...
11/28/2017 ∙ by Cem Bassoy, et al. ∙ 0

• ### Tensor Networks for Dimensionality Reduction and Large-Scale Optimizations. Part 2 Applications and Future Perspectives

Part 2 of this monograph builds on the introduction to tensor networks a...
08/30/2017 ∙ by A. Cichocki, et al. ∙ 0

• ### UNiTE: Unitary N-body Tensor Equivariant Network with Applications to Quantum Chemistry

Equivariant neural networks have been successful in incorporating variou...
05/31/2021 ∙ by Zhuoran Qiao, et al. ∙ 5

• ### T-Basis: a Compact Representation for Neural Networks

We introduce T-Basis, a novel concept for a compact representation of a ...
07/13/2020 ∙ by Anton Obukhov, et al. ∙ 20

• ### Randomized algorithms for rounding in the Tensor-Train format

The Tensor-Train (TT) format is a highly compact low-rank representation...
10/08/2021 ∙ by Hussam Al Daas, et al. ∙ 0

• ### Stochastically Rank-Regularized Tensor Regression Networks

Over-parametrization of deep neural networks has recently been shown to ...
02/27/2019 ∙ by Arinbjörn Kolbeinsson, et al. ∙ 83

• ### Predicting Multidimensional Data via Tensor Learning

The analysis of multidimensional data is becoming a more and more releva...
02/11/2020 ∙ by Giuseppe Brandi, et al. ∙ 35

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Mathematical models, frequently based on assumptions about the nature of physical phenomenon, can be employed to simulate the relations between input and output parameters in order to interpret the inherent interactions and to predict a certain time-varying process. Hence, several modeling techniques have been developed with the aim of providing a more adequate understanding of complex systems. Time series analysis and forecasting are widely present in different fields, like econometrics, dynamic systems theory, statistics, pattern recognition, modeling physiological systems, climatology, biological system identification, among others GoHy:06; BoJeReLj:15. The behavior of these phenomena is influenced by several factors that can affect the development of the system over time and, its interpretability through a proper model and/or the prediction of future values based on past observations is desired.

For several decades, time series analysis has been approached by means of stochastic models BoJeReLj:15; Pa:83, artificial neural networks (ANNs) ZhPaHu:98; AdCo:98; Zh:12

, and support vector machines (SVMs) based models

CoVa:95; GeSuBaLaLaVaMoVa:01; CaTa:03; RaLuSa:03. One of the most classical approaches to represent linear time series is the auto-regressive moving average (ARMA) model, which combines the concept of auto-regressive (AR) and moving-average (MA) models. However in practice many time series have a non-stationary behavior in nature and ARMA models are inadequate to properly describe this behavior. As an alternative, the auto-regressive integrated moving average (ARIMA) model and its variations have been proposed for non-stationary data as well as to detect certain trend and seasonal patterns. In order to capture non-linear characteristics since linear models are insufficient in many real applications, several methods have been developed such as: the non-linear moving average model Ro:77 and the class of auto-regressive conditional heteroskedasticity (ARCH) models, introduced by Engle En:82. Nonlinear models are appropriate for predicting volatility changes in financial time series Ts:10.

SVMs solve pattern classification problems by building maximum margin hyperplanes. They can solve non-linear problems by applying the kernel trick to calculate inner products in a feature space

AiBrRo:06. In DrBuKaSmVa:96

, the concept of SVM was extended to encompass regression analysis, and then other techniques have been developed based on this extension, such as least-squares SVM (LS-SVM)

SuVa:99 and Bayesian SVM PoSc:11.

ANNs, originally developed to model basic biological neural systems, have being applied in a wide variety of tasks, such as image recognition, data mining, classification, regression analysis, among others. Due to property of universal approximation HoStWh:89; HoStWh_90; Ho:91 and no need to make any a priori assumption about the statistical distribution of the data, ANN has become a powerful tool. Hence, in the last decades, a wide range of applications for time series analysis and forecasting has been solved by neural networks (NNs) ZhPaHu:98; Zh:12

. The most widely used ANNs for regression analysis are multilayer perceptrons (MLPs) with non-linear activation functions, which are composed of an input layer, one or more hidden layers, and the output layers of nodes.

In ANNs, non-linearity is commonly introduced by activation functions for modeling outputs of intermediate and/or final layers with the aim of computing more complex problems, which is valuable for most of ANN applications. According to Cy:89

, two-layer NNs with a non-linear function can be proven to be a universal function approximator. This function is usually selected according to some heuristic rules or desired properties, some of them are:

rectified linear unit (or ReLU, more used in convolution networks), softmax (used in multi-classification methods), logistic sigmoid (used in binary predictions), hyperbolic tangent (or Tanh) functions, among other variations. Tanh

function is just a scaled and shifted version of the logistic sigmoid function but, in addition to that, it is an anti-symmetric function. Non-symmetric functions as sigmoid tend to introduce a source of

systematic bias which results in getting stuck during training. Therefore, Tanh function is a more convenient alternative for overcoming this problem and also yielding a faster convergence than non-symmetric activation functions Ha:98.

There is no algorithm for obtaining the global optimal solution of a general non-linear optimization problem in a reasonable amount of time. Besides that, MLPs are usually trained by means of the standard error back-propagation algorithm, which is based on the well-known gradient-descent (GD) algorithm. The error with respect to the desired response is propagated through the network and allows the adjustment of network weights.

Deep neural networks (DNNs) are NNs with a certain level of complexity, which can be described as ANNs with an expressive number of layers connecting input and output GoBeCo:16

. Deep learning neural networks have attracted the attention of the machine learning community because of their appealing data-driven framework and of their remarkable performance in a number of pattern recognition tasks. It is well-known that state-of-the-art DNNs are highly redundant and contain hundreds of millions of parameters, using up to all available memory of personal computers. However, attempts to decrease the width and depth of the neural network layers usually lead to considerable drop of performance.

To overcome the limitations inherent to modern DNNs, there is a need for the development of new fast learning algorithms and the application of special data formats for storing the parameters of such network. Current advances in NNs in most cases are associated with heuristic construction of the network architecture and applicable only to a particular problem. On the other hand, there is no understanding of the internal modus operandi of the network, of the necessity or redundancy of certain layers, of the optimal methods to choose hyper-parameters, among others. The lack of comprehensive scientific answers to the above questions essentially limits the qualitative development of the neural networks method. Furthermore, it is important to reduce the computing requirements of modern DNNs and a very promising strategy is based on tensor networks (TNs) Ci:14.

The tensor network (TN) Ci:14 generally decomposes higher-order tensors into sparsely interconnected matrices or lower-order tensors, and is related to the concept of compression. Tensor networks are one of the most successful tools in quantum information theory, and are a way of presenting of multi-dimensional arrays with a small number of parameters. Two special cases of TN are Tensor-Train (TT) Os:11; OsTy:09 and Quantized Tensor-Train (QTT) Os:10

decompositions. The TT network is the simplest tensor network, while QTT is a version of TT for problems of small dimension (vectors and matrices) by introducing virtual dimensions.

DNNs allows learning simultaneously multiple related tasks and using the information learned from each task in the other layers. This is known as cross-task sharing structure. Nonetheless it is not simple to determine or to specify the number of tasks per layer and the appropriate sharing structure. Multidimensional data can be naturally constructed by stacking parameters regarding many tasks in a higher-order tensor. In YaHo:16, the authors have proposed a generalization of shallow multi-task learning (MTL) methods to the deep network context with the purpose of applying tensor factorization of DNN for multi-task representation.

Thus their deep multi-task representation is trained via standard back-propagation. The rank of each layer is set based on a single task learning initialization. As in KuDa:12, this framework is not sensitive to rank choice as long as it is large enough. Consequently, their method automatically learns the shared representation parameters across the tasks, thereby significantly reducing the space of DNN design choices and not requiring user trial and error.

In NoPoOsVe:15

the authors investigated perspectives of application of the QTT decomposition for the weights matrix of fully connected layer of DNN. A new formulation of stochastic gradient descent method was developed for training the network in TT-format using standard approaches. Preliminary results demonstrate the ability to compress the fully connected layer of ultra DNN by more than 200.000 times.

To tackle the issues associated to ANNs and thanks to the advantages of tensor approaches and its recent results, the present paper is focused on using TT networks to construct a compact representation of NNs and to directly learn the coefficients of TT network based on a given data with the aim of solving a regression problem. The prediction of times series has played an important role in many science fields of practical application as engineering, biology, physics, meteorology, etc. In our work we have considered two different scenarios: noisy chaotic time series, by means of Mackey-Glass equation, and a real financial time series, given by NASDAQ index.

The Mackey–Glass system has been introduced as a model of white blood cell production MaGl:77, which is usually modeled by delay-differential equations. Diseases can be associated to unstable oscillations observed in complex mathematical models of human physiological systems. Mackey-Glass equation provides a range of periodic and chaotic dynamics, which allows to represent the dynamic in human physiology. Due to the dynamic properties and the mathematical simplicity, the Mackey-Glass time series has been employed to validate prediction methods, through the forecast of chaotic time series ChChMu:96; LeLoWa:01; GuWa:07; Mi:09; KoFuLiLe:11.

Because of the ability to perform complex non-linear modeling, ANNs have been employed in financial applications in order to predict markets’ behavior and as well as to provide a powerful statistical modeling technique that can be an alternative to traditional methodologies, such as ARMA, ARCH and Generalized ARCH (GARCH) models TaAlPa:91; Ga:95; ChAyBo:00. In this sense, we have considered the NASDAQ index forecasting in order to study the learned model in a real case.

This paper is organized as follows. Sections 2 describes and discusses the learning model based on TT networks, by proposing a reduction of computational calculus and by deriving a regularization matrix factor. Section 3 analyses the optimization framework and discusses an alternative strategy to reduce the computational cost of pseudo-inverse calculus. Section 4 discusses some general considerations regarding tensor and neural networks. In Section 5, a comparative analysis is carried out in the context of neural network recovery and non-linear predictions of two time series. Finally, Section 6 presents some conclusions.

Notations and property: Scalars, column vectors, matrices, and higher-order tensors are written with lower-case, boldface lower-case, boldface upper-case, and calligraphic letters, i.e. (, , , ), respectively. and stand for transpose and inverse matrices of , respectively. stands for a column vector consisting of zero,

is the identity matrix of order

, denotes a null-space of a matrix, is the Euclidean norm, is the Frobenius norm, whereas denotes the scalar product of two tensors (extension of the classical scalar product between two vectors). The operators and form a column vector by stacking the columns of its tensor argument and a matrix by arranging the modes of an input tensor, respectively. The -mode product of a tensor and vector, defined as , represents a contraction of tensor to a low-order tensor. The Kronecker and Khatri-Rao products are denoted by and , respectively. Given and , both Kronecker and Khatri-Rao products are related according with , where both vectors and denote the -th column of and . A useful Kronecker property is given by

 vec(ABC)=(CT⊗A)vec(B). (1)

## 2 Learning of predictive model

In supervised machine learning, given a training dataset of pairs , for , where each input vector is associated with a desired output

, the target output can be predicted according to the following model:

 ^y(m) Δ=⟨W,Φ(x(m))⟩ =W×1ϕ(x(m)1)⋯×Nϕ(x(m)N), (2)

where each -th input vector is mapped onto a higher-order dimensional space through a feature map . The parameter determines how each feature affects the prediction.

The most common method used for fitting regression problems is based on the least squares (LS) method. Thus, the predictors resulting from this model, i.e., those based on , can be learned by minimizing the mean squared error (MSE) function:

 (3)

where and denote respectively the concatenation of all desired outputs and its predictions associated with the input vectors .

Feature functions, as well as the weighting tensor, can be exponentially large. In our case, both -th order tensors and have components. An interesting way to reduce the number of coefficients is regarding as a particular structure, such as Tensor-Train (TT) decomposition Os:11, since it is given by a sequence of low-order tensors in accordance with

 ws1,s2,…,sN =∑r1,⋯,rN−1g(1)r0,s1,r1⋯g(N)rN−1,sN,rN =∑r1,⋯,rN−1N∏n=1g(n)rn−1,sn,rn, (4)

where each tensor, called TT-core, is denoted by for all with , , and .

The TT-rank of a tensor is given by the rank of each unfolded matrix of TT-core constructed by fixing the second mode i.e. it is bounded by the minimum between . Thus both dimensions and allow to control the trade-off between representational power and computational complexity of the TT decomposition, thereby, in general, the TT-rank is constrained by , i.e. .

Regarding a TT-format for weighting tensor in (2), we can rewrite the scalar product in (2) by isolating the -th core in terms of Kronecker products as follows

 ^y(m) =N∏n=1G(n)×nϕ(x(m)n) (5)

where both vectors and represent respectively the contraction of the left and right sides of the TT structure, i.e.

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩p−k−1(x(m))Δ=k−1∏n=1G(n)×nϕ(x(m)n)∈RRk−1p+k+1(x(m))Δ=N∏n=k+1G(n)×nϕ(x(m)n)∈RRk. (6)

Observe that both vectors and can be rewritten as a function of the previous value, i.e.

 ⎧⎪⎨⎪⎩p−k−1(x(m))=G(k−1)×1p−k−2(x(m))×2ϕ(x(m)k−1)p+k+1(x(m))=G(k+1)×2ϕ(x(m)k+1)×3p+k+2(x(m)). (7)

Thus by sweeping from left-to-right (or right-to-left), we can use (or ) to compute (or ) respectively. Using (7) instead of (6) will reduce the demanding computational operations per each -th core estimation, in terms of complex multiplications, once we can use the previous calculus of (or ), which leads to (or ) instead of (or ) respectively.

From the concatenation of all outputs , and by applying (5), the estimated vector of the desired vector can be expressed in terms of the -th core , i.e. , by

 ^y =(Φk⋄P−k−1⋄P+k+1)Tvec(G(k)Rk−1Rk×Sk) =Pkθk∈RM, (8)

where denotes an unfolded matrix of and

 (9)

Remark that the remaining core tensors are absorbed in the matrix , from the above manipulations in accordance with (5)-(7), and the -the core tensor is isolated in the expression (2

) with the aim of rewriting the loss function (

3) in terms of the -the core tensor. The importance of this procedure will become clearer in the next section.

Finally, the loss function, given in (3), can be also expressed in terms of both vectors and , respectively associated with all target outputs and its predictions, by applying (2) to (3) in the form

 l(W)=\sfrac1M∥Pkθk−y∥22. (10)

If has linearly independent columns, then is non-singular and the solution of least squares regression given by (10) turns out

 ∂∂θkl(W) =2M(PTkPkθk−PTky) ^θk =(PTkPk)−1PTky (11)

where denotes an estimate of and, consequently, an estimate of .

### 2.1 Shrinkage regularization method

The colinearity (or multicolinearity) phenomenon affects calculations regarding individual predictors, in the sense that one predictor can be linearly determined through the others with a substantial degree of accuracy which leads to an inversion problem due to rank deficient of . In order to ensure that is not ill-conditioned due to correlated columns of , i.e. collinear rows of , , and owing to Khatri-Rao structure given by (9), we can consider a regularization term added to the loss function (10). Thus, we are minimizing the following function:

 l′(W) =l(W)+λr(W), (12)

where denotes the regularization or shrinkage factor.

One common option, initially motivated to stabilize the solution (2), is based on the -norm of the weighting coefficients, also referred to as Tikhonov regularization KeStOr:91. In statistical literature, it is also known as ridge regression HoKe:70 and the regularization term can be given by

 r(W)=⟨W,W⟩=∥W∥2F. (13)

In order to obtain a regularization expression in terms of , regarding a TT-format for weighting tensor in (2), we can rewrite the scalar product in (13) by isolating the -th core and contracting recursively the remaining cores and as follows

 ~G(k−1)− Δ=unfold(k−1∏n=2G(n)×1~G(n−1)−), ∈RS1⋯Sk−1×Rk−1 ~G(k+1)+ Δ=unfold(N−1∏n=k+1G(n)×3~G(n+1)+)T, ∈RSk+1⋯SN×Rk (14)

where and are recursively obtained according to

 ~G(n)−Δ=⎧⎪⎨⎪⎩G(1)∈RS1×R1,n=1unfold(G(n)×1~G(n−1)−),2≤n≤k−1, ~G(n)+Δ=⎧⎪⎨⎪⎩unfold(G(n)×3~G(n+1)+)T,k+1≤n≤N−1G(N)T∈RSN×RN−1,n=N (15)

with defining the unfolding operation of a tensor into a matrix , by contracting the two first modes in one mode.

Finally, we can represent the tensor of coefficients , defined in (2), from (2.1) by means of an unfolded matrix as follow

 unfold(W) =(~G(k−1)−⊗~G(k+1)+)G(k)Rk−1Rk×Sk ∈RN∏n=1n≠kSn×Sk. (16)

The vectorization of a higher-order tensor can be derived from the vectorization of a matrix unfolding of this tensor. Observe that the order of the dimensions is quite relevant because it denotes the speed at which each mode changes.

By applying the Kronecker property (1), we can represent the above matrix as a vector given by

 vec(W)=Lkvec(G(k)Rk−1Rk×Sk)=Lkθk∈RSkN∏n=1n≠kSn, with LkΔ=ISk⊗(~G(k−1)−⊗~G(k+1)+)∈RSkN∏n=1n≠kSn×SkRk−1Rk. =ISk⊗Bk (17)

From (13)-(2.1), we can write the regularization term as a function of the -th core , given by , according to

 r(W) =∥unfold(W)∥2F=∥vec(W)∥2F. =∥Lkθk∥22 (18)

and the gradient vector with respect to is

 ∂∂θkr(W) =2LTkLkθk=2(ISk⊗BTkBk)θk. (19)

Regarding the linear LS problem based on the loss function (12), i.e.

 minimizeθk\sfrac1M∥Pkθk−y∥22+λ∥Lkθk∥22, (20)

and under the assumption that the null-spaces of and intersect only trivially, i.e.

 (21)

the LS problem (20) has the unique solution for any given by Lo:76; El:77; Ha:89

 ∂∂θkl′(W)=(\sfrac2MPTkPk+2λLTkLk)θk−\sfrac2MPTky, (PTkPk+λMLTkLk)θk=PTky (22a) ^θk=(PTkPk+λMLTkLk)−1PTky. (22b)

In case the condition (21) is not met, the solution (22b) is not unique.

Remark that this regularization term penalizes large values of weighting coefficients for

and becomes just a linear regression for

(no regularization). For

, it makes the problem nonsingular, as the matrix we need to invert no longer has a determinant near zero in the sense that its eigenvalues are no longer near zero, which avoids imprecise estimation of the inverse matrix

KeStOr:91

. Moreover, analogously to principal components analysis (PCA), ridge regression suppresses the low-order principal components, since the parameter

greatly affects the smallest singular values (SV) of

in contrast to the largest SV, which are only slightly affected. In case is chosen too large, the problem we solve has no longer connection to the original problem.

Besides solving ill-posed optimization problems, the use of regularization with an appropriate weight decay can also prevent over-fitting problem, by reducing model complexity, through model constraints on the parameter in order to enhance the compromise between prediction accuracy i.e. having a small residual error, and robustness and model flexibility. Meanwhile, a very large leads to under-fitting, once it forces the model to represent a constant loss function, i.e. the model becomes too simple to learn the underlying structure of data GoBeCo:16. Therefore, the adjust of

controls the effective degrees of freedom of the model

HaTiFr:09, i.e. the model’s capacity.

Two other common shrinkage methods are: Lasso (Least Absolute Shrinkage and Selection Operator) regression Ti:96 i.e. , which induces sparsity constraint by replacing the -norm with the -norm, and Elastic net ZoHa:05 i.e. for , designed to overcome limitations of Lasso (convex problem but not strictly differentiable), allows across the control of a mix between the regularization terms of Ridge (for ) and Lasso (for ) HaTiFr:09.

In practical terms, Lasso regression tends to perform feature selection by setting weights near to zero, which leads to ignore the least important or even irrelevant information from datasets; whereas Elastic net is preferred when several features are strongly correlated. Additionally, in case that the number of observations

is greater than the number of model parameters , ridge regression tends to perform better than Elastic net ZoHa:05.

There are several variants of Lasso penalty developed to tackle certain optimization limitations and to address to particular problems HaTiFr:09. Furthermore, another alternative is to constraint norm penalties by inequality functions given by a constant , i.e. or , which can be dealt with using the Karush-Kuhn-Tucker conditions, an extension of the solution via Lagrange multipliers, applicable only for equality constraints. However, to analytically derive a closed-form solution is not recurrently possible since the penalties can give rise to a non-convex optimization problem (and, consequently, to convergence to local minima), besides determining an appropriate constant GoBeCo:16.

### 2.2 Feature map: Encoding input data

In machine learning, feature maps can be specified in accordance with certain learning tasks in order to exploit the correlation of information inherent into input data and better classify or estimate it. Thus, input data could implicitly encode a localization information with the purpose of associating set of pixels to detect more efficiently a particular object in an image for example. Furthermore, feature mapping can allow non-linearly separable data to become linearly separable by a hyper-plane in a higher-order dimension.

According to (2), the same local feature, defined by , is applied to each input . Fitting a linear regression model may not be adequate when interactions between variables are not inherently linear. However, the linear regression framework can still be used if the model is nonlinear but linear with respect to its parameters. This is possible by means of a transformation applied to each input, such as a power or logarithmic transformation for example. We can include logarithmic transformation of features by regarding exponential regression model. As an example, for a three-dimension array, , we have

 ϕ(x(m)n)=[1x(m)nlog(x(m)n)]T. (23)

Another possible way of generating nonlinear interaction features is to consider a polynomial regression model of degree , which can be expressed by the Vandermonde structure (for ) given by

 ϕ(x(m)n)=[1x(m)nx(m)2n]T. (24)

Note that the first-order polynomial leads to a multiple linear model whereas higher order () allows a better fit for polynomial curves.

Remark that, in our approach, each TT-core is used for mapping the existing interactions between inputs per each categorical feature. Therefore, the number of cores is determined by the number of features for a given data and the feature map regards the structure of inputs by exploiting nonlinear relationships.

## 3 Optimization framework

To design a learning algorithm, the parameters of our model can be derived from minimizing the mean of squared residuals on the training set under the TT-rank constraint. From (3), it leads to

 (25)

Several algorithms have been proposed to approximate TT-tensor by applying well-known approaches to solve (25), such that gradient descent, conjugate gradient, and Riemannian optimization methods StSc:16; NoTrOs:16. However, a weak point of some of these approaches is due to the estimation of two consecutive cores at time. After estimating a contraction of both cores and

, a restoring procedure based on the approximation by means of SV or QR decomposition is employed. The performance and convergence behavior of these algorithms are strongly dependent upon the adopted approximation, which is also limited by TT-rank.

An alternative strategy is to convert the optimization problem (25) into independent linear least squares problems for adaptively estimating only one core tensor at a time by sweeping along all core tensors from left-to-right and right-to-left, by fixing the remaining cores. According to the development made in Section 2, we can rewrite the overall problem (25) with a regularization factor by using (20) as the following optimization approach

 minimizeθ1,…,θN{N∑k=1∥Pkθk−y∥22+λM∥Lkθk∥22}subject toTT-rank=R. (26)

To reduce the computational complexity effort required for evaluating the solution in (22b) for several values of

, we can first apply the Generalized Singular Value Decomposition (GSVD) of the matrix pair

, proposed by Van Loan Lo:76 assuming and the condition in (21), which is given by

 (27)

where and are orthogonal matrices, and are diagonal matrices and is non-singular matrix.

By replacing (27) in (22a), it gives an equivalent minimization problem, after some manipulations regarding , it gets

 (ΣTPΣP+λMΣTLΣL)zk=ΣTPUTy (28a) ^θk=XT−1zk. (28b)

From (28a), the inverse calculation is reduced to the inverse of each element on the diagonal, the decomposition in (27) and the inverse matrix in (28b) are computed just once for several values of .

There are different approaches to compute the GSVD or based on the GSVD, such as those discussed in El:82; Ha:89; GoLo:96; MoReSg:07; DyRe:14, with the aim of reducing the computational effort and/or exploiting the structure of the regularization matrix. In El:82; MoReSg:07, the GSVD computations take advantage of the structure of the regularization matrix, in case it is a band matrix or an orthogonal projection operator respectively. Additionaly, Eldén in El:82 discussed an alternative way to solve (20), in case is not square and invertible, by considering a weighted inverse matrix which allowed the transformation of the original problem to a standard-form problem. Unlike those cases, the paper DyRe:14 proposed, based on El:82, a method for computation of the GSVD and the truncated GSVD (TGSVD), proposed by Hansen in Ha:89 which generalizes truncated SVD, when the regularization matrix does not have an exploitable structure. Furthermore, Dykes and Reichel presented in DyRe:14 an approach for reducing the matrix pair to a a pair of simpler matrices in order to reduce the GSVD computations.

Remark that our regularization matrix , defined in (2.1), is a Kronecker product between and . Therefore, it is a band matrix that enables to exploit the sparseness of its structure in the numerical computation regarding the regularization matrix, in accordance with the approaches discussed in El:77; El:82; Bj:88. This analysis was not included in the scope of our study once there are several works proposed on this topic, as commented below.

The algorithmic details of our proposed technique for multilinear regression model is presented in Algorithm 1. Note that the estimation of each TT-core is conditioned by the knowledge of previous estimating cores and an intermediate orthogonalization step is included by the QR decomposition, applied to each reshaped TT-core tensor defined in steps 8 (Algorithm 1), with the aim of guaranteeing the left and right orthogonality property of TT cores and consequently, the algorithm stability SaOs:11; Os:11. The matrix in the step 9 is built taking into account the first columns of , such that . The criteria for selecting is detailed in the next sections.

Remark that each core estimation problem can be seen as a layer in the network model, from which inputs with information , flow forward through the network. Hence the estimation of each core propagates the initial information along all network taking into account one feature per layer and finally produces the desired output. During the training, the sweeping procedure, widely applied for approximating TT structure, also allows that the information flow backwards through the network. Thus it can be analogously associated with the back-propagation learning in artificial neural network.

## 4 General considerations

In regression analysis, it is quite usual to standardize the inputs before solving (26) i.e. reparametrization using centered inputs, in order to avoid multicollinearity issues, which could affect model convergence, and also meaningful interpretability of regression coefficients. Consequently, it leads to estimate coefficients of ridge regression model without intercept HaTiFr:09.

The learning procedure of network weights is derived by the minimization of loss function, which is chosen based on a criterion to approximate the model, leading to a direct approximation between the desired output and its prediction. Mean squared error has been classically used as loss function in the context of regression problems Bi:06; Ha:13. Depending on the model, it is not possible to derive a direct and closed expression for the optimal coefficients. In this sense, several iterative methods has been proposed to solve it based on the well-known GD optimization technique.

There are several versions that aim to accelerate the standard GD method, such that Stochastic Gradient Descent (SGD) with momentum method Qi:98 (or Momentum, which helps accelerate SGD), Nesterov Accelerated Gradient Ne:83 (or simply NAG, variant of Momentum), Adaptive Gradient DuHaSi:11 (or Adagrad, learning rate for each parameter according to the history of the gradients for that parameter), RMSProp TiGe:12 (very similar to Adagrad but the update step is normalized by a decaying Root-Mean-Square-RMS of recent gradients), Adadelta (it tends to be more robust to the choice of learning rate of Adagrad), Adaptive Momentum Estimation KiBa:14 (or Adam

, it takes advantage of both RMSProp and Adadelta by including adaptive momentum), among many others. Adam and RMSProp are two very popular optimizers still being used in most neural networks. Both update the variables using an exponential decaying average of the gradient and its squared.

In artificial neural networks, the minimization of loss function involves the chain rule across sequential derivatives with respect to the weights of each layer. This procedure is known as

error back-propagation through the network. In addition to the algorithmic complexity, the more layers and non-linearities through activation functions are introduced, the more the network becomes complex, increasing the computational effort.

The choice of adaptive learning-method algorithms is dependent on the optimization problem and the method robustness noticeably affects convergence. The focus of this work is mainly to compare tensor and neural networks in terms of their structures, by means of robustness, prediction performance and network complexity. Taking it into consideration, we limit our analysis to the standard GD method and to the Adam algorithm, because its popularity in the domain.

Differently from standard model parameters, hyper-parameters are employed in most machine learning algorithms to control the behavior of learning algorithms and there is no a closed formula to uniquely determine them from data. In general, they are empirically set by searching for the best value by trial and error, such that regularization factor, dropout rate, parameters of optimization algorithm (e.g. learning rate, momentum term, decay rate), among others. A usual way to find the best hyper-parameters is to regard the validation set and a search interval; therefore, this procedure, properly described in Section 5, is equivalently applied to both approaches.

It is usual to evaluate the performance progression, i.e., convergence speed of neural networks in terms of epochs. Every epoch considers the entire data set to update the neural network. In order to set a fair comparison between tensor and neural networks, we take into account the contribution of the entire data on the update of all weights and, in this sense, it is reasonable to put on the same level the algorithmic convergence according to epochs and sweeps.

## 5 Simulation Results

A usual way of evaluating model performance is to estimate the weighting tensor learning from a given training data set and to analyze the model performance over a given test data set, which was not used during the learning step. To validate and better understand different aspects regarding the neural and tensor networks, we consider three different experiments described in the following three subsections.

In order to evaluate and compare the performance of the models, we consider the mean squared error (MSE) of predictions, which is given by the loss function, and three other common metrics employed on regression problems: the explained variance score (briefly referred to here as

score), which measures the discrepancy between target and its prediction in terms of the sample variance (i.e. the quality of the fit of a model on data), the sample Pearson correlation coefficient (shortly referred to as SPCC), which measures the linear correlation between both variables (target and its prediction) regarding the estimates of co-variances and variances, and the coefficient of determination (known as -squared or ), which measures the degree of linear correlation and it is unable to determine whether the predictions are biased. These metrics are given by the following expressions:

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩ρMSE=1MM∑m=1(y(m)target−^y(m))2ρSCORE=1−var(ytarget−^y)var(ytarget)ρSPCC=M∑m=1(y(m)target−¯ytarget)(^y(m)−¯^y)√M∑m=1(y(m)target−¯ytarget)2√M∑m=1(^y(m)−¯^y)2ρR2=1−M∑m=1(y(m)target−^y(m))2M∑m=1(y(m)target−¯ytarget)2, (29)

where denotes the sample unbiased variance operator, and and mean the sample mean of the vector of target and its prediction .

The weights of tensor and neural networks are only learned from the training and validation sets and the inputs of both networks are scaled to fit the range . It is known that this scaling procedure can provide an improvement on the quality of the solution, as it ensures all inputs are treated equally in the regularization process and allows a meaningful range for the random starting weights HaTiFr:09

. The starting values for weights are usually chosen to be random values close to zero. A good practice is to initialize the weights following the uniform distribution in the range of

, where and

denotes the number of coefficients associated to each neuron, and the biases to be zero. In analogy, the coefficients of each core tensor are also initialized according to this practice, by regarding

in terms of the number of coefficients of each -th core tensor .

The stopping criterion is based on early stopping (in order to avoid overfitting), which is defined as a minimum relative improvement of loss function, regarding the last two consecutive iterations and normalized by the previous value, until some tolerance is achieved. Thus, we impose a minimum relative improvement of over, at least, of the maximum number of epochs or sweeps. In all simulations, the data is separated in three different sets for training (), validation () and test ().

### 5.1 Recovering multilayer perceptrons

Firstly, we consider a data set with 10000 samples generated by means of a neural network (10-200-1) with 10 inputs and 200 neurons in the hidden layer, totaling 2401 coefficients. The input matrix, randomly generated by a uniform distribution into the range [

], is propagated in two layers: hidden and output layer. Both weights and biases of the neural network are drawn from a Gaussian distribution with zero-mean and standard deviation equal to 2. Two activation functions, ReLU and

Tanh functions, are included in the intermediate layer in order to introduce a non-linearity in the neural network model. We have considered a maximum number of sweeps equal to 12, since the algorithm convergence is achieved with less number of sweeps.

The weights of tensor networks are only learned from the training and validation sets. The regularization factor is selected according to a searching step based on the known Golden-search section (GSS) with a rough preliminary search regarding the given interval . Thus, the optimal regularization factor for each -th core estimate is chosen by taking into account the lowest value of the loss function computed from the validation set.

The neural network output was recovered by the 10-th order Tensor-Train decomposition by fixing a maximum TT-rank (), considering several values, and two values of dimension array ( and ), regarding the local feature mapping given by the polynomial regression in (24). Tables 1 and 2, for Tanh and ReLU functions respectively, show the average performance for all configurations, over 100 Monte Carlo simulations, in terms of MSE, score, SPCC, and -squared at the convergence, for training, validation and test sets.

According with Table 1, we verify that the performance is improved with the increment of both model parameters and once more coefficients are employed. From 232 to 2728 coefficients, for with and , we obtained an improvement over the test set of 4.92% in terms of the explained variance score. Analogously for with and , from 108 and 2556 coefficients, we got an improvement of 12.53% over the test set. Note that the TT model for and , with 3288 coefficients, does not provide a better score than the one for and , with 2556, thus more coefficients lead to a slight over-fitting of the model.

In contrast to the results for recovering the NN with Tanh function, Table 2 shows a lower improvement with the increase of and . From with , i.e. from more than 1960 coefficients, the model does not offer a meaningful improvement over the test set, i.e. lower than four decimal places. From 232 to 1960 coefficients, for with and , we have a gain over the test set of 1.24% against 10.34% for with and (implying the increase of 108 to 2556 coefficients). Analogously to Table 1, we observe a soft trend of over-fitting from to with , because more coefficients did not provide a better score over the test set.

In Figure 1, we present the average explained variance score over 100 Monte Carlo simulations, regarding all configurations, for the training and test sets. Note that the respective standard deviation is represented in this figure in order to stress the influence of the selection of sets and the initialization of the model coefficients. In accordance with Figure 1, as previously discussed, more coefficients considered in the TT network lead to an improvement in the performance of the training set; in contrast with that, the performance of the test set tends to saturate from and for and respectively. In other words, the use of more than 1400 and 2556 coefficients for and does not improve the test set prediction - hence, to use more coefficients is pointless.

It is interesting to observe the potential of contraction of the TT structures regarding a (10-200-1) NN with 2401 coefficients: it can be modeled as a TT network with much less coefficients. For and , the TT network has only 108 coefficients, which represents less than 5% of the total number of neural network coefficients, and can achieve an average score for the test set equals to 0.8110 and 0.8958, regarding Tanh and ReLU functions. The best average performance for the test set is obtained for and , with 2556 coefficients, with a score equal to 0.9126 and 0.9884 for, respectively, both tanh and ReLU functions.

Furthermore, Figure 1 also allows to better understand the influence of the parameter , i.e. the dimension array of the encoded features. This parameter controls the degree level of the polynomial regression model, i.e. the level of non-linearity introduced by the feature mapping, and can enable to fit better the data interactions with lower number of coefficients, as shown in Figure 1.

### 5.2 Mackey-Glass noisy chaotic time series

In the second experiment, we consider the Mackey-Glass noisy chaotic time series in order to compare both neural and tensor networks, which refers to the following delayed differential equation MaGl:77:

 δx(t)δt=x(t+Δt)=ax(t−τ)1+x(t−τ)n−bx(t). (30)

The Mackey-Glass time series with 1000 samples was generated using the 4-th order Runge-Kutta method with the power factor , initial condition , delay constant , time step size , and other parameters and . According to Mi:09; Fa:82, for , the time series shows chaotic behavior. We consider four non-consecutive points of the time series, spaced by 6 points, with the aim of generating each input vector to predict the short-term and long-term predictions, i.e.

 x(t+6) =F(x(t−18),x(t−12),x(t−6),x(t)) x(t+84) =F(x(t−18),x(t−12),x(t−6),x(t)),

which represents a usual test GuWa:07; Mi:09; KoFuLiLe:11. The noiseless case is considered, as well as experiments with additive white Gaussian noise with zero mean and three values of standard deviation i.e. .

Three different 4-th order TT networks with (,), (,), (,) are employed to predict the short and long-term indices, as well as three different neural networks: (4-4-1), (4-6-1), (4-15-1) with two activation functions: Tanh and ReLU. The choice of these neural network parameters is due to the restriction of one hidden layer, as discussed above, and the TT parameters come from the approximate number of coefficients, i.e. (24, 40, 90) and (25, 37, 91) for the TT and NN structures respectively.

Analogously to the previous subsection, the regularization factor search for the tensor network follows the same described procedure, regarding the validation set i.e. it is based on the GSS with a rough preliminary search from the same interval. We also adopted this procedure for the neural networks in order to search an optimal learning rate applied on the SGD method.

In Tables 3 and 4, we present all the results in terms of MSE, and score, and SPCC at the convergence, for training, validation and test sets, for the short-term and long-term predictions respectively. All results represent the average over 400 Monte Carlo simulations, which implies 400 different random initializations. As expected, the performance for all models are affected with the noise addition, specially with .

For the noiseless case, the best performance is achieved with the 4-th order TT (,) with the score 0.9972 and 0.8739 for short-term and long-term predictions of test sets respectively. For , the best score is 0.6916 obtained with the (4-15-1) NN with ReLU for short-term predictions of test sets. However, for long-term predictions (and ), the 4-th order TT with (,) provides the best score with 0.6868. Consequently, the TT model tends to provide better performances than both NN models for the configuration with more coefficients, i.e. (,).

The biggest score difference between and is achieved regarding the 4-th order TT model with (,) with 35.10% for short-term predictions and the (4-15-1) NN model with Tanh with 26.42% for long-term predictions of test sets, with respect to the noiseless case.

The short-term predictions tend to provide better results, as well as the increase of coefficients. From 24/25 to 90/91 coefficients, in the best scenario, we can increase the score until 7.23 and 6.35% with the 4-th order TT model, 3.01% and 1.18% with the NN model with ReLU, and 0.38% and 0.23% with the NN model with Tanh, for both short-term and long-term predictions of test sets respectively. Thus, the increment of coefficients for the TT models tends to provide a bigger improvement on the test sets compared to the NN models.

Figures 2-3 show the amplitude versus time for the Mackey-Glass time series at the convergence, for the training and test sets, regarding the short-term and long-term predictions, and with respect to the noiseless case () and the noise standard deviation , respectively. The original targets (referred to in the figures as exact value) were re-scaled into the range and added a Gaussian noise (referred to as noisy target) with respect to the standard deviation . Note that each prediction curve represents the average over all Monte Carlo simulations with its respective standard deviation in order to emphasize the influence of initialization. The estimates, given by all models, tend to follow the oscillations in time of Mackey-Glass time series. The additional noise makes the forecast harder as well as the long-term predictions.

The convergence of Mackey-Glass series for all configurations is represented by Figures 4-5, regarding the short-term and long-term predictions, with respect to the noiseless case and the noise standard deviation . All the curves represent the average results, in terms of MSE and score over all Monte Carlo simulations, the mean of MSE and score at the convergence and its respective standard deviation are denoted in the legend.

According to these figures, TT structures are faster than NN models for all configurations. We can observe that less than 10 sweeps are enough to achieve the convergence for all TT structures and, in the best case, only 2 sweeps. In contrast, NN networks with ReLU and Tanh respectively require at least 150 and 250 epochs in the best scenario. The ReLU function provides a better convergence than Tanh, specially for short-term prediction. Furthermore, it is interesting to notice that the average performance is more representative for the TT model since the standard deviation is quite small, i.e. lower than four decimal places as indicated in the legend. Consequently, according to both figures, the initialization of coefficients in the neural networks tends to have more impact on the performance then in the tensor network, specially in the case of more coefficients and long-term predictions.