1 Introduction
Kernel methods are a class of nonlinear modeling methods based on mapping the nonlinear input data to a highdimensional feature space. In this highdimensional feature space, the inner product of the data is implicitly computed by a userdefined kernel function, correlating the data points through their similarity. Due to their formulations, the data can be modeled with the simplicity of linear algorithms [1] [2] [3]. As a result, kernel methods are an attractive modeling class because of the insight and ease they provide, and are widely applied in numerous disciplines such as control [4]
[5] and signal processing [6].The SVM theory was initially proposed by Vapnik [1]
, in which a cost function is minimized by solving either the parametric primal problem or the nonparametric dual problem. Either case, the modeling is solved with quadratic programming, generating a unique global solution. In the least squares SVM (LSSVM) one reformulates the original SVM problem by changing the loss function to a least squares form, and altering the inequality constraints to equality constraints
[7] [8] [9]. This makes the resulting LSSVM dual problem much easier to solve compared to its SVM counterpart.The computational complexity of solving the LSSVM dual problem is for direct methods and for iterative methods, where denotes the number of data points and the number of iterations. For small and mediumscale problems, general solvers from the direct and iterative methods classes are applicable [10] [11]. However, these methods cannot be used due to memory and computational requirements when the number of data points grows exponentially large. By this we mean that can be written as for some and . As the exponent
grows, more and more methods will become problematic to use. This exponential growth is an instance of the ”curse of dimensionality”. One approach of solving large dual problems is to use a lowrank approximation of the kernel matrix, which is the cornerstone for the Nyström
[12] and fixedsize LSSVM (FSLSSVM) methods [13]. These methods aim to work around the prohibitive scaling by training on a subset of the data and by employing the primaldual relationship of the LSSVM. Two drawbacks to current lowrank approximation methods exist. Firstly, no confidence bounds of the approximate solution are obtained, only theoretical statistical guarantees can be given. Also, the size of the sample subset significantly impacts the performance in accuracy and the memory and computational complexity. Methods that rely on a lowrank approximation of the kernel matrix inherently depend on a kernel spectrum that exhibits a strong decay. When such a decay is lacking then these methods are expected to perform poorly. An example of such a slowly decaying spectrum is shown in Fig 1for the twospiral problem. As the sample size increases one can see that the spectrum exhibits an everincreasing plateau of important eigenvalues. This observation motivates the development of our proposed recursive method that relies on a lowrank tensor decompositions instead. The power of our proposed method is demonstrated in the experiments where we train an LSSVM for the twospiral problem on
training points and achieve a 100% accuracy during validation.1.1 Contributions of This Work
In this paper, the curse of dimensionality associated with solving the largescale LSSVM dual problem and the inability of current lowrank approximation methods to handle slowdecaying kernel spectra are addressed. The dual problem is cast into a tensor train (TT) form, which reduces the storage complexity from down to , where denotes the TTrank. In this form, the dual matrix never has to be constructed explicitly, nor any other matrix. A recursive Bayesian procedure is then developed to solve the largescale dual problem in TT form by implementation of a tensor network Kalman filter (TNKF). Such a Bayesian approach also results in a covariance matrix of the dual variables, which can be used to implement an earlystopping criterion. The contributions of this work are:

Circumventing the curse of dimensionality of solving largescale LSSVM dual problems by using a lowrank TT representation.

A nonsampling based algorithm that evaluates the entire dual matrix and constructs lowrank tensor network approximations of each row.

Implementation of a recursive Bayesian filter that, unlike current lowrank approximation methods, allows the computation of confidence bounds on predictions.
1.2 Related Work
Many direct and iterative methods such as the LU decomposition and conjugate gradient [11] [14] are available for learning smallscale LSSVMs. The storage complexity and computational complexity of direct methods inhibit their application for largescale problems. Iterative solvers still suffer from an exponential complexity , where denotes the number of iterations. A common way to deal with this difficulty is to employ lowrank approximation methods by utilizing a subset of samples to approximate the solution. A tradeoff between the obtainable accuracy, and the associated computational and memory requirements is made for the problem to be feasible. Two wellknown lowrank approximation techniques for LSSVMs are the Nyström method [12] and FSLSSVM [13]. Both are samplingbased methods that rely on the eigendecomposition of the kernel matrix. The Nyström method solves the LSSVM dual problem with computational and storage complexities of and , respectively. FSLSSVM estimates a nonlinear mapping function in the dual and then solves the primal problem. The computational and storage complexities of FSLSSVM are and [10], respectively. For both methods, the sampling procedure has a significant impact on the performance. Many sampling procedures exist such as uniform (random) sampling, or if computationally achievable, more advanced adaptive techniques can be used, such as sparse greedy matrix approximation [15], leveragescores [16], columnnorm sampling [17], and Renyientropy sampling [7]. Other approaches to deal with largescale linear systems commonly focus on parallel computation or sparsity. For example, in [18] a parallel GPU implementation is used for kernel machines. The linear scaling associated with increasing batch sizes is extended with adaptive kernels to speed up training times and more efficient parallelization. In [19]
, the computational complexity to train polynomial classifiers are significantly reduced through parallel implementations of tensor trains. In
[20], a noniterative pruning method is employed to generate a sparse LSSVM solution. By using globally representative points a support vector subset is found with lower computational complexities than traditional iterative pruning methods.This paper is organized as follows. In Section II the basics of LSSVMs are covered. The Kalman filter is also introduced in context of Bayesian filtering. In Section III, a brief introduction to tensor networks is given, including required operations and the TT decomposition for the TNKF algorithm. The final TNKF algorithm is presented in Section IV and Section V, covering its derivation, implementation and complexities. The TNKF is analyzed on five datasets in Section VI. Lastly, conclusions and further work are discussed in Section VII.
Scalars  (a,b,…) 
Vectors  (, ,…) 
Matrices  (, ,…) 
Tensors  (,,…) 
Tensor Train of  TT 
Matrix transpose  (,,…) 
Identity matrix of size  
Kernel matrix of size by 
TT  Tensor train 
SVM  Supportvectormachines 
LSSVM  Least squares supportvectormachines 
FSLSSVM  Fixed size least squares supportvectormachines 
KKT  KarushKuhnTucker 
SVD  Singular value decomposition 
TNKF  Tensor network Kalman filter 
2 Solving LSSVMs with Bayesian filtering
2.1 Least Squares Support Vector Machines
In what follows, we consider the regression problem. The development of our method is easily adjusted to the classification problem. Consider a training set of inputs and outputs . The regression model in the primal space is
where denotes the bias, denotes a feature map to a possibly infinitedimensional space and are the residuals. The LSSVM primal problem is then to estimate the weights , bias and residuals from
Solving the primal problem can be advantageous when . However, the primal problem is often impossible because is usually unknown, difficult to determine and possibly infinitedimensional. The dual problem circumvents this problem by rewriting the primal problem in terms of a kernel function as
(1) 
The matrix is called the kernel matrix and is defined as . The predictor can then also be written in terms of the dual variables as
(2) 
The dual variables are related to the residuals of the primal problem through . More detailed derivations of both the primal and dual problem can be found in [7].
2.2 Bayesian Filtering and the Kalman Filter
A recursive Bayesian filtering approach is adopted to solve the LSSVM dual problem. Such a recursive approach makes it possible to quantify the uncertainty of our solution during the recursion, which enables the construction of an earlystopping criterion such that largescale datasets can be trained. Here, the dual problem (1) is written as
where is the dual problem matrix, , and the vector contains the residuals of the dual problem. The goal now is to compute the posterior of conditioned on the data . Using Bayes’ rule this posterior can be written as
(3) 
where is called the likelihood and the prior.
We assume that the primal residuals
are zeromean Gaussian random variables
. From it then follows that the prior onis also normally distributed
. For convenience we define the shorthand notation . The regularization parameter hence reflects our confidence in the prior. The residuals of the dual problem are also assumed to be normally distributed . Because the prior and the likelihood are Gaussian, the posterior is also Gaussian. From Bayes’ rule an analytic solution can be derived to calculate the posterior with(4) 
The matrix inverse computation has a prohibitive computational complexity of . By realizing that posterior distributions can also serve as prior distribution for each succeeding observation, a recursive framework can be developed that avoids the explicit matrix inversion. In this article, single observations are used to update the distribution, which is equivalent to solving the dual problem rowbyrow. The following model is used to describe the solution for at iteration ,
(5)  
(6) 
where is the th row of the matrix and is a stochastic term that allows us to introduce a forgetting factor that weights past measurements. We also assume that is normally distributed . Note that denotes the estimate of at iteration and not the th component of the vector . Assuming the Markov property
and the conditional independence of the measurements
then the posterior can be written as
(7) 
This recursive Bayesian framework directly leads to the Kalman filter equations, an optimal closedform algorithm for linear stochastic models [21] [22]. The Kalman filter equations tells us how to compute the mean vector and covariance matrix of the posterior at iteration
(8)  
(9) 
The covariance matrix can be understood as a measure of uncertainty in . By choosing the covariance matrix of as with an approximately exponentially decay weighting on past data is introduced[23, pg.240]
. The prediction error and its variance at iteration
are and , respectively. Lastly, is the Kalman gain, which weights the observation to the prior information in the computation of . Termination occurs when . Once the filter is finished at iteration , one can set and compute confidence bounds from the covariance matrix , similar to Gaussian processes [4, 24].Although the matrix inversion is avoided by using the Kalman equations, the explicit construction of covariance matrix can still be problematic. For this reason, we propose to represent all vector and matrix quantities in the Kalman equations with tensor trains.
3 Tensor Trains
Tensors are multidimensional extensions of vectors and matrices to higher dimensions, relevant in many fields: signal processing, machine learning, and quantum physics [25] [26]. In this article, we define a tensor as having an order and dimensions . The elements of a tensor are given by its indices, (,,…,), for which the MATLAB notation is used: 1 . Even though tensors can be worked with in their mathematical form, it is easier to use a visual representation, as shown in Figure 2. A tensor is illustrated as a core (circle) with a number of outgoing edges (lines) equal to its order. Each edge in such a diagram corresponds with a specific index.
Tensor networks are factorizations of tensors, analogous to matrix factorizations of matrices. In fact, it is a generalization of matrix decompositions to higher orders. An easy introduction is given through the singular value decomposition (SVD) of a matrix A
into two orthogonal matrices , and a diagonal matrix . It is one specific type of matrix factorization, and can therefore also be represented as a tensor network. The diagram for this matrix factorization is shown in Figure 3. The interconnected edges represent the summations over the indices, also called index contractions. Generalized to higher orders, the shared edges between tensors represent tensor index contractions, which can be understood as a higherorder generalization of matrix multiplications.
3.1 Tensor Train Vectors and Matrices
In their highdimensional form, tensors can be difficult to work with. Firstly, the computational and memory complexities are burdensome because of the exponential scaling with the order . Secondly, concepts from linear algebra are not directly applicable, making analysis difficult. Tensor networks give more flexibility, as the tensor is factorized into a number of lowerdimensional tensor network components, also called cores. These factorizations can alleviate the adverse scaling and offer more insight. The CP and Tucker decompositions are two wellknown tensor network structures, but have significant drawbacks, such as NPhardness and poor scaling [27]. The TT decomposition avoids these issues and can therefore be a more easytouse and robust format [28, 29]. As the name suggests, a TT decomposition factorizes a tensor into a chain of cores, which are linked through their TTranks . In this paper, two types of tensor trains are used. A TTvector TT to represent a vector , and a TTmatrix TT to represent a matrix . The diagrams of both a TTvector and TTmatrix are shown in Figure 4 and Figure 5, respectively. Just like in Figure 3, the connecting edges in the diagrams are indices that are summed over. To generate these TT structures, vectors ans matrices are first reshaped into highorder tensors as illustrated in Figure 6 for the matrix case. Then, a lowrank TT decomposition is used to represent the obtained tensors. For example, a matrix can be reshaped into a 6way tensor with each dimension equal to 11. Both the row and column index are in this way split into 3 indices each, resulting in the 6way tensor in Figure 5. This 6way tensor is then approximated by a lowrank tensor train decomposition. The TTranks signify the amount of correlation between the dimensions of a tensor, and determine the complexities of the representation. For small , the TT format yields significant compression ratios as the storage complexities of a TTvector and TTmatrix are and , respectively.
Several different algorithms for converting a vector/matrix into their respective TTform are available [30, 28, 29, 31]. In this article the TTSVD is used as the algorithm is able to find a TT that satisfies guaranteed error bounds. The TTSVD uses consecutive SVD operations to convert a tensor into a TT. Also the TTrounding algorithm will be used in this article. The TTrounding algorithm takes a given TT and truncates the TTranks, also through a sequence of consecutive SVD computations. There are two possible implementations of the TTrounding algorithm: either an upper bound for the relative approximation error is provided or the TTrank can be set to a prechosen value. For more details on these two algorithms we refer the reader to [28].
3.2 Tensor Train Operations
An advantage of the TTvector and TTmatrix representations is that various linear algebra operations can be done explicitly in the compressed form. The two main operations that are required in the Kalman equations are matrix multiplication and matrix addition. Addition and subtraction in TT form is done by a concatenation procedure between respective cores [28, p. 2308]. Multiplication between a scalar and a TT is executed by multiplying the elements of one of the TT cores by the scalar. A matrix multiplication in TTmatrix form can be done via summations over indices between the two TT matrices. The procedure is illustrated for an example in Figure 7. The matrices are both represented by TT matrices of 4 cores. The matrix product is then computed in TTmatrix form by summing over the column indices of and row indices of . These index summations “merge” corresponding TT cores, as indicated by the dotted black rectangle in Figure 7. As a result of this “merging”, the TTmatrix ranks of the matrix product will be the product of the TTmatrix ranks of the individual matrices. This increase of the TTmatrix ranks motivates the use of the TTrounding algorithm to truncate these ranks in order to keep computations feasible. The matrix product in TT form is denoted as TT.
4 Tensor Network Kalman Filter
The Kalman filter gives a simple iterative method for solving the dual problem, but the covariance matrix limits its feasibility. Its explicit representation demands an storage complexity. Fortunately, recasting the Kalman filter into a tensor network Kalman filter (TNKF) form avoids this problem, as has been done similarly in [32] and [33] for specific control applications. The reformulation is achieved through representing vectors by TTvectors and matrices by TTmatrices, reducing the storage complexity to be linear in as indicated in Table 3. The TTranks (
) can be understood as hyperparameters that determine the accuracycomplexity tradeoff. The The Kalman filter in TT form is summarized in Algorithm
1.Variables  TT variable  Storage complexity 
TT  
, ,  TT, TT, TT  
, , ,  Unchanged 
The initial mean and covariance matrix of the prior distribution are constructed directly in TT form as the explicit construction of the covariance matrix can be impractical or even impossible for large datasets. In [32], a more detailed explanation is given on how a rank1 TTvector and matrix can be constructed for the mean and covariance, respectively. In each iteration, a row of the kernel matrix is constructed with the training data and kernel function. This row is then reshaped into a tensor, after which it is decomposed to a lowrank TTvector. The necessity of storing more than one row of the kernel matrix is thereby avoided, and permits application of very large datasets.
Table 4 lists the computational complexity of each step of the TN Kalman filter a. The maximal TTranks for are denoted , respectively. From the TNKF complexities, which includes the construction of the th row of the kernel matrix, it is clear that the TTranks play an important role in the complexity of the method. Especially the TTranks of TT and TT determine much of the algorithm’s computational cost, as they are required almost every step, and impact the TTranks of the other variables. It is also evident that the construction of the TTvector of can be a dominating factor for large datasets.
An alternative for the TTSVD algorithm with lower computational complexity would be to use an Alternating Linear Scheme (ALS) [30] approach where the TT of the previous row could be used as an initial guess for the next row. Other alternatives with lower computational complexity are the randomized TTSVD [31] and crossapproximation algorithm [34]. In comparison to other lowrank approximation methods such as the Nyström and FSLSSVM method, the TNKF is advantageous if the data admits a lowTTrank structure and when the spectrum of the kernel matrix decays slowly.
The TNKF is also advantageous compared to common iterative methods, such as conjugate gradient. These have computational complexities , where is the number of iterations until convergence, and therefore suffer from the curse of dimensionality [11]. As a result, iterative methods quickly become infeasible for largescale datasets.
[.5] Step  Computational complexity  Explanation 
[.5] 1.  Prediction update TT()  
[.5] 2.  Prediction update TT()  
[.5] 3a.  Construct kernel row  
[.5] 3b.  Reshape to tensor  
[.5] 3c.  TTSVD  
[.5] 4.  Compute  
[.5] 5.  Compute  
[.5] 6a.  Compute TT()  
[.5] 6b.  TTRounding TT()  
[.5] 7a.  Measurement update TT()  
[.5] 7b.  TTRounding TT()  
[.5] 8a.  Measurement update TT()  
[.5] 8b.  TTRounding TT() 
4.1 Computation of Confidence Bounds
The output of Algorithm 1 are the mean and covariance matrix of the the posterior distribution , in TT form, which can then be used for validation and testing. The dual model (2) can be implemented in TT form to keep working with lower computational and storage complexities. Consider test points . Similarly to Algorithm 1, rows of the kernel matrix based on the training and test data are constructed and transformed to a TTvector TT. The prediction is then obtained by computing TTTT. The variance of the prediction is then TT+ and can be used to construct the (99.73%) confidence bounds. The bounds describe how well the kernel regression or classification models fit the data. In Algorithm 2, the TT test regression procedure is given. A similar procedure can be derived for classification.
4.2 Practical Considerations
Many different factorizations of the dimensions can be considered when converting vectors and matrices into the TT form. For example, each row of a kernel matrix could be “tensorized” into either a or tensor, and so on. A general recommendation is to use the prime factorization since then we have decomposed the dimension into its smallest constituents. Using a prime factorization does however not uniquely determine how we can “tensorize” our vectors and matrices since the TTranks are known to depend on the order of the indices. For this reason it is also recommended to sort the dimensions. For the kernel matrix, this means that each row would be reshaped into a tensor.
Choosing the rounding for the TT ranks is nontrivial, as there is no way to determine their values a priori. Manual tuning and grid searches on subsets of the data can be used to determine suitable values.
Early stopping criteria can be designed based on the normvalues of the covariance . If the norm is small and has only slight value changes over the iterations, then there is high confidence in the solution in iteration . Determining values for the covariancebased stopping criteria can be done by evaluating the primal error distributions and/or the residuals’ variances, which are related to the dual weights by . Terminations criteria can also be based on other, such as relaxed KarushKuhnTucker (KKT) conditions.
One advantage of the TNKF has is that truncationparameters allow it to be less sensitive to overfitting because the TTranks act as an implicit regularization. Truncations are easiest and most beneficial when the hyperparameters are small because this produces a rapidly decaying singular value spectrum for which small TTranks can be achieved in the TTSVD. However, the selection of truncation and kernel parameters is interdependent. A balance needs to be found between them in order to fit the data well. A consequence of performing truncations is that the confidence region becomes wider, locally or globally, allowing for more uncertainties in the problem.
The TNKF performance is also impacted by the specified residual variances which determine the initial distributions. These variances influence the balance between generalization and overfitting, affecting both the width of the confidence bounds and the accuracy of the mean. In the supplied initial distribution, the mean does not impact regression. The first row of the dual problem is a KKT optimality condition () that resets the mean to zero. The initial covariance does impact the regression, but only if significant truncations are performed or early stopping is used in the TNKF, such that local variances do not converge to the . If no primal residual variance is known, the prior distribution can be chosen as . The dual residual covariance determines the lower bound for the covariance in the Kalman equations, and can be used to supply an initial distribution. Alternatively, a userspecified initial distribution can also be supplied.
5 Experiments
In this section, the TNKF algorithm is applied to a number of largescale regression and classification problems and compared to the FSLSSVM and Nyström methods. Table 5 gives an overview of the datasets, together with how many data points were used for training and validation and the dimension of the input space.
Datasets  # Training samples  # Test samples  Input dimension 
[.5] Noisy sinc [7]  1  
[.5] Twospiral [7]  2  
[.5] Adult [35]  16 99  
[.5]MNIST [36, 37]  784 
Because of the differences between the considered lowrank approximation methods, a direct comparison is not straightforward. The following steps and assumptions are made to simplify the comparison.

The Nyström method uses uniform random sampling to generate a subset of (S) data points. FSLSSVM uses the Renyientropy criterion to generate a subset of (S) data points [7].

The number of Renyi sampling iterations for FSLSSVM is chosen the same as the number of iterations of the TNKF, equal to the number of data points.

Multiple performance measures are shown for each method including different truncation parameters for the TNKF, subset sizes (S), and the number of eigenvectors (EV) in the Nyström approximation.

The same kernel hyperparameters are used to compare the methods. These were found by grid searches to determine the approximate orders, and final manual iterative tuning of the TNKF.

For regression, the rootmeansquare error (RMSE)
RMSE and fit percentage based on the normalized rootmeansquare error (NMRSE)
Fit (NMRSE) are used as performance measures.

For classification, the performance is given through the percentage of correctly assigned labels. The confidence (%) is based on the total number of misclassifications by considering the () confidence bounds as decision models.
In the upcoming subsections, the following conventions are used: all datasets are centered, effectively removing the bias term and first column of the dual problem. This means that only the dual weights are learned. The bias term can simply be computed by calculation of the mean and it can be added to the final model if desired. In the performance tables, ”NA” stands for ”not applicable”. This is when the algorithm is infeasible because of computational or memory requirements, or because no confidence bounds are generated. Multiple instances of the TNKF are tested for each dataset to analyze the effects of TTrank truncations. Each TNKF instance has different truncation parameters, which are presented below the performance tables. The number of samples for Nyström and FSLSSVM is given through ”#S” after their names.
All experiments were conducted in MATLAB  version R2019b, and performed on a 1aptop with an Intel 6Core i7 processor running at 2.6 GHz and 16GB of RAM. The MATLAB code can be downloaded from https://github.com/maximilianluc/LSSVMTNKF. For the FSLSSVM and Nyström methods the ”lssvmlab” toolbox was used, freely available at https://www.esat.kuleuven.be/sista/lssvmlab/.
The hyperparameters and truncation parameters were found through iterative grid searches with increasingly fine grid spacing. Currently there are no methods to determine these parameter values beforehand. Additionally, the hyperparameters and truncation parameters values are interdependent.
5.1 Regression Problem 1: Noisy Sinc Function
To find an approximation of , the TNKF needs to be initialized. The sinc function is corrupted with noise distributed as . It is assumed that this is known, and is chosen equal to this noise. The RBF kernel function is used and hyperparameter values chosen as: (, ). The initial distribution of is given as according to the relation , and supplied in TT form, emphasizing high confidence in the initial solution.


The performance of the TNKF is dependent on its TTranks, determined by the truncations and selected kernel hyperparameters, as given in Table VI(a). If both are appropriately specified, the TNKF can obtain nearoptimal RMSE (
) and fit values, given by the noise’s standard deviation. The performance is poor when the truncations are specified too large, proven by TNKF
, given in Table VI(b). The truncation of the kernel rows in TNKF results in too much information loss in comparison to TNKF. Crucial interrelationships in the data are lost, therefore the prediction power of the TNKF is reduced. The FSLSSVM also manages to realize nearoptimal results. Additionally, through iterative tuning of the sample sizes to analyze the complexityaccuracy tradeoff, it is apparent that very few are needed to generate an accurate prediction for the LSSVM. The method may therefore be preferable over the TNKF in terms of complexity. Similarly to the TNKF method, FSLSSVM is less sensitive to the chosen hyperparameters. With the Renyisampling criterion, this method can select the most informative columns to acquire a nonlinear mapping function. The Renyisampling criterion makes it unnecessary to use larger subsets as the increase in performance is minimal compared to the increase in complexity. The Nyström method is much more sensitive to which and the number of columns sampled. As presented in Table VI(a), it does not match the performances of the TNKF and FSLSSVM. Unlike the TNKF and FSLSSVM methods, it is not able to select the most informative columns to overcome its susceptibility to noise. As a result, and due to the noise, the eigendecomposition is not accurate. In Figure 8, the initial distribution of in TNKF is set equal to the noise distribution, which results in smooth confidence bounds.0.001  0.001  0.001  0.001  0.001  
0.001  0.5  0.001  0.001  0.001  
0.15  0.001  0.001  0.001  0.001 
In Figure 9, the RMSEs of three different TNKFs over the iterations are plotted for a small scale noisy sinc problem. The respective truncation values are given in Table 7. Convergence of the TNKF depends significantly on the specified truncation values. For TNKF, the truncation of TT() prohibits convergence to the noise RMSE. The approximation does not accurately describe the dataset. The TNKF convergence plot oscillates much more than for TNKF because TT() truncations can make the training data produce less informative. TT() has a crucial role in whether the TNKF converges. Also, for TNKF and TNKF, early stopping can be performed with small or minimal consequences to the test RMSE if tuned properly.
5.2 Classification Problem 1: Twospiral
A largescale version of the twospiral problem is considered, a wellknown and difficult machine learning problem. The task is to classify two identical, but phase shifted, spirals from each other. The data is separable, but a highly nonlinear decision boundary has to be found [7]. The RBF kernel is used, and the hyperparameters are: (, ). The initial distribution for the weights is chosen as, , with . The data adheres to TTrank1 structures for which the memory and computational difficulties only scale with and . The test dataset was generated by selecting every fourth point of the two spirals for training.


The TNKF instances manage to distinguish the two spirals perfectly, as given in Table VII and displayed in Figure 10. Additionally, confidence bounds of the decision function are also found. Because of the small TTranks the TNKF is computationally advantageous, requiring no explicit storage of a large matrix. The FSLSSVM classifier could only label a subset of the data therefore it is assigned ”NA”. Due to the size of the dataset and the small kernel hyperparameter it was impossible to generate a nonlinear mapping function that generalizes well over the entire dataset. The Nyström method also performed poorly, as the number of samples for a good estimation exceed RAM capabilities. In the case that fewer samples are chosen, the method also results in low labeling power. The largescale twospiral problem demonstrates that the TNKF is especially suitable in cases where kernel eigenspectrum decays slowly. The TNKF can incorporate the most important information from the dataset in an implicit way, not requiring a sampling procedure.
5.3 Classification Problem 2: Adult Dataset
In this experiment the goal is to classify whether a person makes over $50k a year based on 14 features, both numerical and categorical [35]
. By converting the categorical features to numerical values by onehotencoding, a total of 99 features is obtained. The RBF kernel is used, and the chosen hyperparameters are: (
, ). TNKF and TNKF are trained on the entire dataset. TNKF uses early stopping, and therefore witnesses only a subset of the data. The three TNKFs are implemented with initial distributions, . It is assumed that the residual is irrelevant to the consensus data, therefore it is chosen very small .


In Table X(a) the performances of the TNKF are given. The TNKF is capable of achieving performances equal to that found in the literature for SVM’s and LSSVMs [7, 38]. The data adheres to small TTranks, therefore severe truncations are unnecessary. TNKF, which employed early stopping and a forgetting factor, managed to reach decent labeling power. Below the parameters for TNKF are described:

Forgetting factor ,

TT smaller than ,

The change in TT at most for at least 5 iterations.
Implementation of early stopping required only 4200 rows to be iterated over. This reduced training time by around 80%, but yielded a lower classification accuracy. The chosen kernel hyperparameter () generates a kernel matrix that can easily be approximated with a lowrank matrix structure. The Nyström method is particularly suitable for this, and achieves very high performance values with few sampled columns. The nonlinear mapping function, however, is difficult to approximate. FSLSSVM requires many support vectors and is computationally the most demanding in this problem.
5.4 Classification Problem 3: binary MNIST
For the MNIST digit recognition dataset a binary classification problem is considered in this paper. The digits are labeled in two groups, and , classified based on the 784 normalized pixels per image. A classifier is trained with images, and tested on images. No noise is considered to act on the data. The RBF kernel is used with a hyperparameter the selection (, ). The initial condition is chosen . The performances of the lowrank approximation methods on the binary MNIST classification problem are presented in Table XI.


The TNKF achieves high classification and confidence percentages, and is wellsuited for the MNIST dataset because the images conform to small TT ranks. As a result of the low TT ranks, the memory and computational difficulty are small, even though not as favorable as the Nyström method. The FSLSSVM requires many support vectors and therefore demands much computational effort, nevertheless can reach high labeling power. The Nyström approximations perform well with very few samples. Of the three lowrank approximation methods, it is computationally the lightest to apply and yields very high accuracies.
6 Conclusion
This paper presents a recursive Bayesian filtering framework with the tensor network Kalman filter to solve largescale LSSVM dual problems. To the best of our knowledge, it is the first nonsampling based algorithm for largescale LSSVM dual problems. The recursive Bayesian framework allows early stopping and yields confidence bounds on the predictions. Current lowrank approximation methods lack such properties. Also, the curse of dimensionality is avoided in this work because the method is iterative and uses a tensor train representation. The TNKF is especially advantageous in situations with high nonlinearities, when the kernel data conforms to a lowTTrank structure, or when many samples are required for alternative lowrank methods. High accuracies are attained on all considered problems.
In further research, the TNKF computational complexities can be further reduced. The algorithm can easily be extended to a parallel implementation and a batch framework to reduce training times. Additionally, directly constructing the kernel rows in TT form can save a significant portion of the computational complexity for extremely large problems and problems with many features. The tuning of the interdependent truncation parameters and kernel hyperparameters requires further investigation. Inference methods can be developed to determine these parameters and would save much tuning work, currently done by exhaustive grid searches. Lastly, the TNKF can be readily adopted to other kernel methods and fields involved with solving largescale linear systems.
Acknowledgment
Johan Suykens acknowledges support from EU: The research leading to these results has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program/ERC Advanced Grant EDUALITY (787960). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information. Research Council KUL: C14/18/068 Flemish Government: FWO: projects: GOA4917N, Impulsfonds AI: VR 2019 2203 DOC.0318/1QUATER, Ford KU Leuven Research Alliance Project KUL0076, ICT 48 TAILOR, Leuven.AI Institute.
References

[1]
V. Vapnik.
The Nature of Statistical Learning Theory
. SpringerVerlag, Berlin, Heidelberg, 1995.  [2] B. Scholkopf and A.J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.
 [3] J. ShaweTaylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
 [4] J. Kocijan. Modelling and Control of Dynamic Systems Using Gaussian Process Models. Springer International Publishing, January 2016.
 [5] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.
 [6] J.L RojoAlvarez, M. MartinezRamon, J. MunozMari, and G. CampsValls. Digital Signal Processing with Kernel Methods. WileyIEEE Press, 1st edition, 2018.
 [7] J.A.K Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific, 2002.
 [8] J. Suykens and J. Vandewalle. Least Squares Support Vector Machine Classifiers. Neural Processing Letters, 9:293–300, June 1999.

[9]
K. De Brabanter, J. De Brabanter, J. A. K. Suykens, and B. De Moor.
Approximate Confidence and Prediction Intervals for Least Squares
Support Vector Regression.
IEEE Transactions on Neural Networks
, 22(1):110–120, 2011.  [10] B. Hamers. Kernel Models for Large Scale Applications. PhD thesis, Katholieke Universiteit Leuven, ESAT, Katholieke Universiteit Leuven, Belgium, Kasteelpark Arenberg 10, 3001 Leuven, June 2004.
 [11] G.H. Golub and C.F Van Loan. Matrix Computations. The Johns Hopkins University Press, 4 edition, 2013.
 [12] C. Williams and M. Seeger. Using the Nyström Method to Speed Up Kernel Machines. In Advances in Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001.
 [13] K. De Brabanter, J. De Brabanter, J. A. K. Suykens, and B. De Moor. Optimized Fixedsize Kernel Models for Large Data Sets. Comput. Stat. Data Anal., 54(6):1484–1504, June 2010.
 [14] B. Hamers. A Comparison of Iterative Methods for Least Squares Support Vector Machine Classifiers. Technical report, Katholieke Universiteit Leuven, Belgium, 2001.
 [15] A.J. Smola and B. Schökopf. Sparse Greedy Matrix Approximation for Machine Learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, page 911–918, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
 [16] A. Rudi, D. Calandriello, L. Carratino, and L. Rosasco. On Fast Leverage Score Sampling and Optimal Learning, 2018.
 [17] P. Drineas and M.W Mahoney. On the Nyström Method for Approximating a Gram Matrix for Improved KernelBased Learning. Journal of Machine Learning Research, 6, 2005.
 [18] S. Ma and M. Belkin. Kernel Machines that Adapt to GPUs for Effective Large Batch Training, 2018.
 [19] Z. Chen, K. Batselier, J.A.K Suykens, and N. Wong. Parallelized Tensor Train Learning of Polynomial Classifiers. CoRR, abs/1612.06505, 2016.
 [20] Y. Ma, X. Liang, G. Sheng, J. T. Kwok, M. Wang, and G. Li. Noniterative Sparse LSSVM Based on Globally Representative Point Selection. IEEE Transactions on Neural Networks and Learning Systems, pages 1–11, 2020.
 [21] Simo Särkkä. Bayesian Filtering and Smoothing. Institute of Mathematical Statistics Textbooks. Cambridge University Press, 2013.
 [22] M. Verhaegen and V. Verdult. Filtering and System Identification: A Least Squares Approach. Cambridge University Press, 2007.
 [23] S.S. Haykin. Kalman Filtering and Neural Networks. John Wiley & Sons, Inc., USA, 2001.
 [24] CE. Rasmussen and CKI. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, USA, January 2006.
 [25] A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A. PHAN. Tensor Decompositions for Signal Processing Applications: From Twoway to Multiway Component Analysis. IEEE Signal Processing Magazine, 32(2):145–163, March 2015.
 [26] A. Avella and F. Mancini. Strongly Correlated Systems: Numerical Methods. Springer Berlin Heidelberg, January 2013.
 [27] T.G Kolda and B.W Bader. Tensor Decompositions and Applications. SIAM Review, 51(3):455–500, September 2009.
 [28] I. Oseledets. TensorTrain Decomposition. SIAM J. Scientific Computing, 33:2295–2317, January 2011.
 [29] I. Oseledets. Approximation of Matrices Using Tensor Decomposition. SIAM Journal on Matrix Analysis and Applications, 31, January 2010.
 [30] S. Holtz, T. Rohwedder, and R. Schneider. The Alternating Linear Scheme for Tensor Optimization in the Tensor Train Format. SIAM Journal on Scientific Computing, 34(2):A683–A713, 2012.
 [31] Benjamin Huber, Reinhold Schneider, and Sebastian Wolf. A Randomized Tensor Train Singular Value Decomposition. In Compressed Sensing and its Applications, pages 261–290. Springer, 2017.
 [32] K. Batselier, Z. Chen, and N. Wong. A Tensor Network Kalman Filter with an Application in Recursive MIMO Volterra System Identification. Automatica, 84, October 2016.
 [33] D. Gedon, P. Piscaer, K. Batselier, C. Smith, and M. Verhaegen. Tensor Network Kalman Filter for LTI Systems. In 2019 27th European Signal Processing Conference (EUSIPCO), pages 1–5, September 2019.
 [34] I. Oseledets and E. Tyrtyshnikov. TTCross Approximation for Multidimensional Arrays. Linear Algebra and its Applications, 432(1):70–88, 2010.
 [35] D. Dua and C. Graff. UCI Machine Learning Repository, 2017.
 [36] Y. LeCun and C. Cortes. MNIST Handwritten Digit Database. 2010.
 [37] MNIST in CSV, 2018.
 [38] A. Pal and S.K. Pal. Pattern Recognition and Big Data. World Scientific, 2017.