DeepAI
Log In Sign Up

Tensor Network Kalman Filtering for Large-Scale LS-SVMs

10/26/2021
by   Maximilian Lucassen, et al.
0

Least squares support vector machines are a commonly used supervised learning method for nonlinear regression and classification. They can be implemented in either their primal or dual form. The latter requires solving a linear system, which can be advantageous as an explicit mapping of the data to a possibly infinite-dimensional feature space is avoided. However, for large-scale applications, current low-rank approximation methods can perform inadequately. For example, current methods are probabilistic due to their sampling procedures, and/or suffer from a poor trade-off between the ranks and approximation power. In this paper, a recursive Bayesian filtering framework based on tensor networks and the Kalman filter is presented to alleviate the demanding memory and computational complexities associated with solving large-scale dual problems. The proposed method is iterative, does not require explicit storage of the kernel matrix, and allows the formulation of early stopping conditions. Additionally, the framework yields confidence estimates of obtained models, unlike alternative methods. The performance is tested on two regression and three classification experiments, and compared to the Nyström and fixed size LS-SVM methods. Results show that our method can achieve high performance and is particularly useful when alternative methods are computationally infeasible due to a slowly decaying kernel matrix spectrum.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

01/17/2016

Learning the kernel matrix via predictive low-rank approximations

Efficient and accurate low-rank approximations of multiple data sources ...
07/21/2020

A Semismooth-Newton's-Method-Based Linearization and Approximation Approach for Kernel Support Vector Machines

Support Vector Machines (SVMs) are among the most popular and the best p...
02/12/2020

Efficient Structure-preserving Support Tensor Train Machine

Deploying the multi-relational tensor structure of a high dimensional fe...
09/24/2021

Accelerated nonlinear primal-dual hybrid gradient algorithms with applications to machine learning

The primal-dual hybrid gradient (PDHG) algorithm is a first-order method...
01/28/2016

Large-scale Kernel-based Feature Extraction via Budgeted Nonlinear Subspace Tracking

Kernel-based methods enjoy powerful generalization capabilities in handl...
02/07/2017

Sparse Algorithm for Robust LSSVM in Primal Space

As enjoying the closed form solution, least squares support vector machi...

1 Introduction

Kernel methods are a class of nonlinear modeling methods based on mapping the nonlinear input data to a high-dimensional feature space. In this high-dimensional feature space, the inner product of the data is implicitly computed by a user-defined kernel function, correlating the data points through their similarity. Due to their formulations, the data can be modeled with the simplicity of linear algorithms [1] [2] [3]. As a result, kernel methods are an attractive modeling class because of the insight and ease they provide, and are widely applied in numerous disciplines such as control [4]

, machine learning

[5] and signal processing [6].

The SVM theory was initially proposed by Vapnik [1]

, in which a cost function is minimized by solving either the parametric primal problem or the non-parametric dual problem. Either case, the modeling is solved with quadratic programming, generating a unique global solution. In the least squares SVM (LS-SVM) one reformulates the original SVM problem by changing the loss function to a least squares form, and altering the inequality constraints to equality constraints

[7] [8] [9]. This makes the resulting LS-SVM dual problem much easier to solve compared to its SVM counterpart.

The computational complexity of solving the LS-SVM dual problem is for direct methods and for iterative methods, where denotes the number of data points and the number of iterations. For small and medium-scale problems, general solvers from the direct and iterative methods classes are applicable [10] [11]. However, these methods cannot be used due to memory and computational requirements when the number of data points grows exponentially large. By this we mean that can be written as for some and . As the exponent

grows, more and more methods will become problematic to use. This exponential growth is an instance of the ”curse of dimensionality”. One approach of solving large dual problems is to use a low-rank approximation of the kernel matrix, which is the cornerstone for the Nyström

[12] and fixed-size LS-SVM (FS-LSSVM) methods [13]. These methods aim to work around the prohibitive scaling by training on a subset of the data and by employing the primal-dual relationship of the LS-SVM. Two drawbacks to current low-rank approximation methods exist. Firstly, no confidence bounds of the approximate solution are obtained, only theoretical statistical guarantees can be given. Also, the size of the sample subset significantly impacts the performance in accuracy and the memory and computational complexity. Methods that rely on a low-rank approximation of the kernel matrix inherently depend on a kernel spectrum that exhibits a strong decay. When such a decay is lacking then these methods are expected to perform poorly. An example of such a slowly decaying spectrum is shown in Fig 1

for the two-spiral problem. As the sample size increases one can see that the spectrum exhibits an ever-increasing plateau of important eigenvalues. This observation motivates the development of our proposed recursive method that relies on a low-rank tensor decompositions instead. The power of our proposed method is demonstrated in the experiments where we train an LS-SVM for the two-spiral problem on

training points and achieve a 100% accuracy during validation.

Figure 1: Eigenvalue spectra of differently sized two-spiral kernel matrices. The eigenvalues only become insignificant at the final indices. These spectra are difficult to approximate with low-rank methods that employ eigendecompositions as many eigencomponents and samples need to be used for an accurate solution.

1.1 Contributions of This Work

In this paper, the curse of dimensionality associated with solving the large-scale LS-SVM dual problem and the inability of current low-rank approximation methods to handle slow-decaying kernel spectra are addressed. The dual problem is cast into a tensor train (TT) form, which reduces the storage complexity from down to , where denotes the TT-rank. In this form, the dual matrix never has to be constructed explicitly, nor any other matrix. A recursive Bayesian procedure is then developed to solve the large-scale dual problem in TT form by implementation of a tensor network Kalman filter (TNKF). Such a Bayesian approach also results in a covariance matrix of the dual variables, which can be used to implement an early-stopping criterion. The contributions of this work are:

  • Circumventing the curse of dimensionality of solving large-scale LS-SVM dual problems by using a low-rank TT representation.

  • A non-sampling based algorithm that evaluates the entire dual matrix and constructs low-rank tensor network approximations of each row.

  • Implementation of a recursive Bayesian filter that, unlike current low-rank approximation methods, allows the computation of confidence bounds on predictions.

1.2 Related Work

Many direct and iterative methods such as the LU decomposition and conjugate gradient [11] [14] are available for learning small-scale LS-SVMs. The storage complexity and computational complexity of direct methods inhibit their application for large-scale problems. Iterative solvers still suffer from an exponential complexity , where denotes the number of iterations. A common way to deal with this difficulty is to employ low-rank approximation methods by utilizing a subset of samples to approximate the solution. A trade-off between the obtainable accuracy, and the associated computational and memory requirements is made for the problem to be feasible. Two well-known low-rank approximation techniques for LS-SVMs are the Nyström method [12] and FS-LSSVM [13]. Both are sampling-based methods that rely on the eigendecomposition of the kernel matrix. The Nyström method solves the LS-SVM dual problem with computational and storage complexities of and , respectively. FS-LSSVM estimates a nonlinear mapping function in the dual and then solves the primal problem. The computational and storage complexities of FS-LSSVM are and [10], respectively. For both methods, the sampling procedure has a significant impact on the performance. Many sampling procedures exist such as uniform (random) sampling, or if computationally achievable, more advanced adaptive techniques can be used, such as sparse greedy matrix approximation [15], leverage-scores [16], column-norm sampling [17], and Renyi-entropy sampling [7]. Other approaches to deal with large-scale linear systems commonly focus on parallel computation or sparsity. For example, in [18] a parallel GPU implementation is used for kernel machines. The linear scaling associated with increasing batch sizes is extended with adaptive kernels to speed up training times and more efficient parallelization. In [19]

, the computational complexity to train polynomial classifiers are significantly reduced through parallel implementations of tensor trains. In

[20], a non-iterative pruning method is employed to generate a sparse LS-SVM solution. By using globally representative points a support vector subset is found with lower computational complexities than traditional iterative pruning methods.

This paper is organized as follows. In Section II the basics of LS-SVMs are covered. The Kalman filter is also introduced in context of Bayesian filtering. In Section III, a brief introduction to tensor networks is given, including required operations and the TT decomposition for the TNKF algorithm. The final TNKF algorithm is presented in Section IV and Section V, covering its derivation, implementation and complexities. The TNKF is analyzed on five datasets in Section VI. Lastly, conclusions and further work are discussed in Section VII.

The notation and abbreviations used in this paper are given in Table 1 and Table 2, respectively.

Scalars (a,b,…)
Vectors (, ,…)
Matrices (, ,…)
Tensors (,,…)
Tensor Train of TT
Matrix transpose (,,…)
Identity matrix of size
Kernel matrix of size by
Table 1: Used Notation
TT Tensor train
SVM Support-vector-machines
LS-SVM Least squares support-vector-machines
FS-LSSVM Fixed size least squares support-vector-machines
KKT Karush-Kuhn-Tucker
SVD Singular value decomposition
TNKF Tensor network Kalman filter
Table 2: Used Abbreviations

2 Solving LS-SVMs with Bayesian filtering

2.1 Least Squares Support Vector Machines

In what follows, we consider the regression problem. The development of our method is easily adjusted to the classification problem. Consider a training set of inputs and outputs . The regression model in the primal space is

where denotes the bias, denotes a feature map to a possibly infinite-dimensional space and are the residuals. The LS-SVM primal problem is then to estimate the weights , bias and residuals from

Solving the primal problem can be advantageous when . However, the primal problem is often impossible because is usually unknown, difficult to determine and possibly infinite-dimensional. The dual problem circumvents this problem by rewriting the primal problem in terms of a kernel function as

(1)

The matrix is called the kernel matrix and is defined as . The predictor can then also be written in terms of the dual variables as

(2)

The dual variables are related to the residuals of the primal problem through . More detailed derivations of both the primal and dual problem can be found in [7].

2.2 Bayesian Filtering and the Kalman Filter

A recursive Bayesian filtering approach is adopted to solve the LS-SVM dual problem. Such a recursive approach makes it possible to quantify the uncertainty of our solution during the recursion, which enables the construction of an early-stopping criterion such that large-scale datasets can be trained. Here, the dual problem (1) is written as

where is the dual problem matrix, , and the vector contains the residuals of the dual problem. The goal now is to compute the posterior of conditioned on the data . Using Bayes’ rule this posterior can be written as

(3)

where is called the likelihood and the prior.

We assume that the primal residuals

are zero-mean Gaussian random variables

. From it then follows that the prior on

is also normally distributed

. For convenience we define the short-hand notation . The regularization parameter hence reflects our confidence in the prior. The residuals of the dual problem are also assumed to be normally distributed . Because the prior and the likelihood are Gaussian, the posterior is also Gaussian. From Bayes’ rule an analytic solution can be derived to calculate the posterior with

(4)

The matrix inverse computation has a prohibitive computational complexity of . By realizing that posterior distributions can also serve as prior distribution for each succeeding observation, a recursive framework can be developed that avoids the explicit matrix inversion. In this article, single observations are used to update the distribution, which is equivalent to solving the dual problem row-by-row. The following model is used to describe the solution for at iteration ,

(5)
(6)

where is the th row of the matrix and is a stochastic term that allows us to introduce a forgetting factor that weights past measurements. We also assume that is normally distributed . Note that denotes the estimate of at iteration and not the th component of the vector . Assuming the Markov property

and the conditional independence of the measurements

then the posterior can be written as

(7)

This recursive Bayesian framework directly leads to the Kalman filter equations, an optimal closed-form algorithm for linear stochastic models [21] [22]. The Kalman filter equations tells us how to compute the mean vector and covariance matrix of the posterior at iteration

(8)
(9)

The covariance matrix can be understood as a measure of uncertainty in . By choosing the covariance matrix of as with an approximately exponentially decay weighting on past data is introduced[23, pg.240]

. The prediction error and its variance at iteration

are and , respectively. Lastly, is the Kalman gain, which weights the observation to the prior information in the computation of . Termination occurs when . Once the filter is finished at iteration , one can set and compute confidence bounds from the covariance matrix , similar to Gaussian processes [4, 24].

Although the matrix inversion is avoided by using the Kalman equations, the explicit construction of covariance matrix can still be problematic. For this reason, we propose to represent all vector and matrix quantities in the Kalman equations with tensor trains.

3 Tensor Trains

Tensors are multidimensional extensions of vectors and matrices to higher dimensions, relevant in many fields: signal processing, machine learning, and quantum physics [25] [26]. In this article, we define a tensor as having an order and dimensions . The elements of a tensor are given by its indices, (,,…,), for which the MATLAB notation is used: 1 . Even though tensors can be worked with in their mathematical form, it is easier to use a visual representation, as shown in Figure 2. A tensor is illustrated as a core (circle) with a number of outgoing edges (lines) equal to its order. Each edge in such a diagram corresponds with a specific index.

a

Figure 2: Diagrammatic tensor notation of a scalar (a), vector (a), matrix (), and 3-dimensional tensor ().

Tensor networks are factorizations of tensors, analogous to matrix factorizations of matrices. In fact, it is a generalization of matrix decompositions to higher orders. An easy introduction is given through the singular value decomposition (SVD) of a matrix A

into two orthogonal matrices , and a diagonal matrix . It is one specific type of matrix factorization, and can therefore also be represented as a tensor network. The diagram for this matrix factorization is shown in Figure 3. The interconnected edges represent the summations over the indices, also called index contractions. Generalized to higher orders, the shared edges between tensors represent tensor index contractions, which can be understood as a higher-order generalization of matrix multiplications.

Figure 3: Tensor network representation of the SVD.

3.1 Tensor Train Vectors and Matrices

In their high-dimensional form, tensors can be difficult to work with. Firstly, the computational and memory complexities are burdensome because of the exponential scaling with the order . Secondly, concepts from linear algebra are not directly applicable, making analysis difficult. Tensor networks give more flexibility, as the tensor is factorized into a number of lower-dimensional tensor network components, also called cores. These factorizations can alleviate the adverse scaling and offer more insight. The CP and Tucker decompositions are two well-known tensor network structures, but have significant drawbacks, such as NP-hardness and poor scaling [27]. The TT decomposition avoids these issues and can therefore be a more easy-to-use and robust format [28, 29]. As the name suggests, a TT decomposition factorizes a tensor into a chain of cores, which are linked through their TT-ranks . In this paper, two types of tensor trains are used. A TT-vector TT to represent a vector , and a TT-matrix TT to represent a matrix . The diagrams of both a TT-vector and TT-matrix are shown in Figure 4 and Figure 5, respectively. Just like in Figure 3, the connecting edges in the diagrams are indices that are summed over. To generate these TT structures, vectors ans matrices are first reshaped into high-order tensors as illustrated in Figure 6 for the matrix case. Then, a low-rank TT decomposition is used to represent the obtained tensors. For example, a matrix can be reshaped into a 6-way tensor with each dimension equal to 11. Both the row and column index are in this way split into 3 indices each, resulting in the 6-way tensor in Figure 5. This 6-way tensor is then approximated by a low-rank tensor train decomposition. The TT-ranks signify the amount of correlation between the dimensions of a tensor, and determine the complexities of the representation. For small , the TT format yields significant compression ratios as the storage complexities of a TT-vector and TT-matrix are and , respectively.

Figure 4: Diagrammatic representation of a tensor train vector of a vector . First the vector was reshaped into a 3-dimensional tensor .

Figure 5: Diagrammatic representation of a tensor train matrix decomposition. The matrix was first reshaped into a 6-dimensional tensor . The edges pointing downwards are row indices of the original matrix and the edges pointing upwards are column indices of .

Figure 6: Arrow pointing to the right: a 6th-order tensor is reshaped into a matrix. Arrow pointing to the left: a matrix is reshaped into a 6th-order tensor.

Several different algorithms for converting a vector/matrix into their respective TT-form are available [30, 28, 29, 31]. In this article the TT-SVD is used as the algorithm is able to find a TT that satisfies guaranteed error bounds. The TT-SVD uses consecutive SVD operations to convert a tensor into a TT. Also the TT-rounding algorithm will be used in this article. The TT-rounding algorithm takes a given TT and truncates the TT-ranks, also through a sequence of consecutive SVD computations. There are two possible implementations of the TT-rounding algorithm: either an upper bound for the relative approximation error is provided or the TT-rank can be set to a pre-chosen value. For more details on these two algorithms we refer the reader to [28].

3.2 Tensor Train Operations

An advantage of the TT-vector and TT-matrix representations is that various linear algebra operations can be done explicitly in the compressed form. The two main operations that are required in the Kalman equations are matrix multiplication and matrix addition. Addition and subtraction in TT form is done by a concatenation procedure between respective cores [28, p. 2308]. Multiplication between a scalar and a TT is executed by multiplying the elements of one of the TT cores by the scalar. A matrix multiplication in TT-matrix form can be done via summations over indices between the two TT matrices. The procedure is illustrated for an example in Figure 7. The matrices are both represented by TT matrices of 4 cores. The matrix product is then computed in TT-matrix form by summing over the column indices of and row indices of . These index summations “merge” corresponding TT cores, as indicated by the dotted black rectangle in Figure 7. As a result of this “merging”, the TT-matrix ranks of the matrix product will be the product of the TT-matrix ranks of the individual matrices. This increase of the TT-matrix ranks motivates the use of the TT-rounding algorithm to truncate these ranks in order to keep computations feasible. The matrix product in TT form is denoted as TT.

Figure 7: Diagrammatic representation of a matrix product in TT-matrix form. The resulting TT-matrix from this contraction TT consists of four cores, each with two free indices as given by the edges. The edges pointing downwards are row indices and the edges pointing upwards are column indices.

4 Tensor Network Kalman Filter

The Kalman filter gives a simple iterative method for solving the dual problem, but the covariance matrix limits its feasibility. Its explicit representation demands an storage complexity. Fortunately, recasting the Kalman filter into a tensor network Kalman filter (TNKF) form avoids this problem, as has been done similarly in [32] and [33] for specific control applications. The reformulation is achieved through representing vectors by TT-vectors and matrices by TT-matrices, reducing the storage complexity to be linear in as indicated in Table 3. The TT-ranks (

) can be understood as hyperparameters that determine the accuracy-complexity trade-off. The The Kalman filter in TT form is summarized in Algorithm 

1.

Variables TT variable Storage complexity
TT
, , TT, TT, TT
, , , Unchanged
Table 3: Variables of the TNKF
0:  TT, TT, , , , , .
  while Termination conditions not met do
     1. TT
     2. TT
     3. TT TT-SVD
     4. =
     5. = TT +
     6. TT =
     7. TT = TT
     8. TT = TT
  end while
Algorithm 1 Tensor Network Kalman Filter

The initial mean and covariance matrix of the prior distribution are constructed directly in TT form as the explicit construction of the covariance matrix can be impractical or even impossible for large datasets. In [32], a more detailed explanation is given on how a rank-1 TT-vector and matrix can be constructed for the mean and covariance, respectively. In each iteration, a row of the kernel matrix is constructed with the training data and kernel function. This row is then reshaped into a tensor, after which it is decomposed to a low-rank TT-vector. The necessity of storing more than one row of the kernel matrix is thereby avoided, and permits application of very large datasets.

Table 4 lists the computational complexity of each step of the TN Kalman filter a. The maximal TT-ranks for are denoted , respectively. From the TNKF complexities, which includes the construction of the -th row of the kernel matrix, it is clear that the TT-ranks play an important role in the complexity of the method. Especially the TT-ranks of TT and TT determine much of the algorithm’s computational cost, as they are required almost every step, and impact the TT-ranks of the other variables. It is also evident that the construction of the TT-vector of can be a dominating factor for large datasets.

An alternative for the TT-SVD algorithm with lower computational complexity would be to use an Alternating Linear Scheme (ALS) [30] approach where the TT of the previous row could be used as an initial guess for the next row. Other alternatives with lower computational complexity are the randomized TT-SVD [31] and cross-approximation algorithm [34]. In comparison to other low-rank approximation methods such as the Nyström and FS-LSSVM method, the TNKF is advantageous if the data admits a low-TT-rank structure and when the spectrum of the kernel matrix decays slowly.

The TNKF is also advantageous compared to common iterative methods, such as conjugate gradient. These have computational complexities , where is the number of iterations until convergence, and therefore suffer from the curse of dimensionality [11]. As a result, iterative methods quickly become infeasible for large-scale datasets.

[.5]0 Step Computational complexity Explanation
[.5]0 1. Prediction update TT()
[.5]0 2. Prediction update TT()
[.5]0 3a. Construct kernel row
[.5]0 3b. Reshape to tensor
[.5]0 3c. TT-SVD
[.5]0 4. Compute
[.5]0 5. Compute
[.5]0 6a. Compute TT()
[.5]0 6b. TT-Rounding TT()
[.5]0 7a. Measurement update TT()
[.5]0 7b. TT-Rounding TT()
[.5]0 8a. Measurement update TT()
[.5]0 8b. TT-Rounding TT()
Table 4: Computational complexities of the TNKF

4.1 Computation of Confidence Bounds

The output of Algorithm 1 are the mean and covariance matrix of the the posterior distribution , in TT form, which can then be used for validation and testing. The dual model (2) can be implemented in TT form to keep working with lower computational and storage complexities. Consider test points . Similarly to Algorithm 1, rows of the kernel matrix based on the training and test data are constructed and transformed to a TT-vector TT. The prediction is then obtained by computing TTTT. The variance of the prediction is then TT+ and can be used to construct the (99.73%) confidence bounds. The bounds describe how well the kernel regression or classification models fit the data. In Algorithm 2, the TT test regression procedure is given. A similar procedure can be derived for classification.

0:  TT, TT, , , training inputs , test inputs
  for  = :  do
     1. TT = TT-SVD.
     2. = TT
     3. = TT
  end for
Algorithm 2 TNKF Regression - Prediction and computation of confidence bounds.

4.2 Practical Considerations

Many different factorizations of the dimensions can be considered when converting vectors and matrices into the TT form. For example, each row of a kernel matrix could be “tensorized” into either a or tensor, and so on. A general recommendation is to use the prime factorization since then we have decomposed the dimension into its smallest constituents. Using a prime factorization does however not uniquely determine how we can “tensorize” our vectors and matrices since the TT-ranks are known to depend on the order of the indices. For this reason it is also recommended to sort the dimensions. For the kernel matrix, this means that each row would be reshaped into a tensor.

Choosing the rounding for the TT ranks is non-trivial, as there is no way to determine their values a priori. Manual tuning and grid searches on subsets of the data can be used to determine suitable values.

Early stopping criteria can be designed based on the norm-values of the covariance . If the norm is small and has only slight value changes over the iterations, then there is high confidence in the solution in iteration . Determining values for the covariance-based stopping criteria can be done by evaluating the primal error distributions and/or the residuals’ variances, which are related to the dual weights by . Terminations criteria can also be based on other, such as relaxed Karush-Kuhn-Tucker (KKT) conditions.

One advantage of the TNKF has is that truncation-parameters allow it to be less sensitive to overfitting because the TT-ranks act as an implicit regularization. Truncations are easiest and most beneficial when the hyperparameters are small because this produces a rapidly decaying singular value spectrum for which small TT-ranks can be achieved in the TT-SVD. However, the selection of truncation and kernel parameters is interdependent. A balance needs to be found between them in order to fit the data well. A consequence of performing truncations is that the confidence region becomes wider, locally or globally, allowing for more uncertainties in the problem.

The TNKF performance is also impacted by the specified residual variances which determine the initial distributions. These variances influence the balance between generalization and overfitting, affecting both the width of the confidence bounds and the accuracy of the mean. In the supplied initial distribution, the mean does not impact regression. The first row of the dual problem is a KKT optimality condition () that resets the mean to zero. The initial covariance does impact the regression, but only if significant truncations are performed or early stopping is used in the TNKF, such that local variances do not converge to the . If no primal residual variance is known, the prior distribution can be chosen as . The dual residual covariance determines the lower bound for the covariance in the Kalman equations, and can be used to supply an initial distribution. Alternatively, a user-specified initial distribution can also be supplied.

5 Experiments

In this section, the TNKF algorithm is applied to a number of large-scale regression and classification problems and compared to the FS-LSSVM and Nyström methods. Table 5 gives an overview of the datasets, together with how many data points were used for training and validation and the dimension of the input space.

Datasets # Training samples # Test samples Input dimension
[.5]0 Noisy sinc [7] 1
[.5]0 Two-spiral [7] 2
[.5]0 Adult [35] 16 99
[.5]0MNIST [36, 37] 784
Table 5: Considered datasets in this paper

Because of the differences between the considered low-rank approximation methods, a direct comparison is not straightforward. The following steps and assumptions are made to simplify the comparison.

  • The Nyström method uses uniform random sampling to generate a subset of (S) data points. FS-LSSVM uses the Renyi-entropy criterion to generate a subset of (S) data points [7].

  • The number of Renyi sampling iterations for FS-LSSVM is chosen the same as the number of iterations of the TNKF, equal to the number of data points.

  • Multiple performance measures are shown for each method including different truncation parameters for the TNKF, subset sizes (S), and the number of eigenvectors (EV) in the Nyström approximation.

  • The same kernel hyperparameters are used to compare the methods. These were found by grid searches to determine the approximate orders, and final manual iterative tuning of the TNKF.

  • For regression, the root-mean-square error (RMSE)

    RMSE

    and fit percentage based on the normalized root-mean-square error (NMRSE)

    Fit (NMRSE)

    are used as performance measures.

  • For classification, the performance is given through the percentage of correctly assigned labels. The confidence (%) is based on the total number of misclassifications by considering the () confidence bounds as decision models.

In the upcoming subsections, the following conventions are used: all datasets are centered, effectively removing the bias term and first column of the dual problem. This means that only the dual weights are learned. The bias term can simply be computed by calculation of the mean and it can be added to the final model if desired. In the performance tables, ”NA” stands for ”not applicable”. This is when the algorithm is infeasible because of computational or memory requirements, or because no confidence bounds are generated. Multiple instances of the TNKF are tested for each dataset to analyze the effects of TT-rank truncations. Each TNKF instance has different truncation parameters, which are presented below the performance tables. The number of samples for Nyström and FS-LSSVM is given through ”#S” after their names.

All experiments were conducted in MATLAB - version R2019b, and performed on a 1aptop with an Intel 6-Core i7 processor running at 2.6 GHz and 16GB of RAM. The MATLAB code can be downloaded from https://github.com/maximilianluc/LSSVM-TNKF. For the FS-LSSVM and Nyström methods the ”ls-svmlab” toolbox was used, freely available at https://www.esat.kuleuven.be/sista/lssvmlab/.

The hyperparameters and truncation parameters were found through iterative grid searches with increasingly fine grid spacing. Currently there are no methods to determine these parameter values beforehand. Additionally, the hyperparameters and truncation parameters values are interdependent.

5.1 Regression Problem 1: Noisy Sinc Function

To find an approximation of , the TNKF needs to be initialized. The sinc function is corrupted with noise distributed as . It is assumed that this is known, and is chosen equal to this noise. The RBF kernel function is used and hyperparameter values chosen as: (, ). The initial distribution of is given as according to the relation , and supplied in TT form, emphasizing high confidence in the initial solution.

Method Training RMSE Test RMSE Test Fit %
TNKF 0.160 0.162 69.5
TNKF 0.104 0.104 75.4
FS-LSSVM 20 S 0.261 0.260 25.5
FS-LSSVM 100 S 0.100 0.102 75.3
FS-LSSVM 500 S
Nyström S, 10 EV 0.788 0.786 31.6
Nyström S, 50 EV 0.146 0.148 60.2
Nyström S, 50 EV 0.194 0.180 34.7
Nyström S, 75 EV 0.197 0.185 28.9
(a) Performance values
0.005 0.1 0.02 0.005 0.001
0 0.001 0.0005 0.2 0.001
(b) Truncation values for the TNKF
Table 6: Test performance of the approximation methods on the noisy sinc function (RBF kernel, , ).

The performance of the TNKF is dependent on its TT-ranks, determined by the truncations and selected kernel hyperparameters, as given in Table VI-(a). If both are appropriately specified, the TNKF can obtain near-optimal RMSE (

) and fit values, given by the noise’s standard deviation. The performance is poor when the truncations are specified too large, proven by TNKF

, given in Table VI-(b). The truncation of the kernel rows in TNKF results in too much information loss in comparison to TNKF. Crucial interrelationships in the data are lost, therefore the prediction power of the TNKF is reduced. The FS-LSSVM also manages to realize near-optimal results. Additionally, through iterative tuning of the sample sizes to analyze the complexity-accuracy trade-off, it is apparent that very few are needed to generate an accurate prediction for the LS-SVM. The method may therefore be preferable over the TNKF in terms of complexity. Similarly to the TNKF method, FS-LSSVM is less sensitive to the chosen hyperparameters. With the Renyi-sampling criterion, this method can select the most informative columns to acquire a nonlinear mapping function. The Renyi-sampling criterion makes it unnecessary to use larger subsets as the increase in performance is minimal compared to the increase in complexity. The Nyström method is much more sensitive to which and the number of columns sampled. As presented in Table VI-(a), it does not match the performances of the TNKF and FS-LSSVM. Unlike the TNKF and FS-LSSVM methods, it is not able to select the most informative columns to overcome its susceptibility to noise. As a result, and due to the noise, the eigendecomposition is not accurate. In Figure 8, the initial distribution of in TNKF is set equal to the noise distribution, which results in smooth confidence bounds.

Figure 8: Regression performance of TNKF with an initial distribution of . The residual distribution is set equal to the noise of the underlying data, .
Figure 9: Training RMSE of three TNKF implementations on a small noisy sinc problem. The TNKF truncations, hyperparameters, initialization can have significant effect on the accuracy. The TNKF truncation values are provided in Table 7.
0.001 0.001 0.001 0.001 0.001
0.001 0.5 0.001 0.001 0.001
0.15 0.001 0.001 0.001 0.001
Table 7: Truncation values for the TNKF error plots Fig. 9.

In Figure 9, the RMSEs of three different TNKFs over the iterations are plotted for a small scale noisy sinc problem. The respective truncation values are given in Table 7. Convergence of the TNKF depends significantly on the specified truncation values. For TNKF, the truncation of TT() prohibits convergence to the noise RMSE. The approximation does not accurately describe the dataset. The TNKF convergence plot oscillates much more than for TNKF because TT() truncations can make the training data produce less informative. TT() has a crucial role in whether the TNKF converges. Also, for TNKF and TNKF, early stopping can be performed with small or minimal consequences to the test RMSE if tuned properly.

5.2 Classification Problem 1: Two-spiral

A large-scale version of the two-spiral problem is considered, a well-known and difficult machine learning problem. The task is to classify two identical, but phase shifted, spirals from each other. The data is separable, but a highly nonlinear decision boundary has to be found [7]. The RBF kernel is used, and the hyperparameters are: (, ). The initial distribution for the weights is chosen as, , with . The data adheres to TT-rank-1 structures for which the memory and computational difficulties only scale with and . The test dataset was generated by selecting every fourth point of the two spirals for training.

Figure 10: The two classified spirals for the test dataset TNKF. A subset is shown for clarity.
Method Correctly Labeled % Confidence %
TNKF
FS-LSSVM NA NA
Nyström S, 250EV 50.8 NA
Nyström S, 500EV 50.7 NA
Nyström S, 250EV NA NA
(a) Performance values
1 1 1 1 1
(b) Truncation values for the TNKF
Table 8: Test performance of the approximation methods on the large-scale two-spiral problem (RBF kernel, , )

The TNKF instances manage to distinguish the two spirals perfectly, as given in Table VII and displayed in Figure 10. Additionally, confidence bounds of the decision function are also found. Because of the small TT-ranks the TNKF is computationally advantageous, requiring no explicit storage of a large matrix. The FS-LSSVM classifier could only label a subset of the data therefore it is assigned ”NA”. Due to the size of the dataset and the small kernel hyperparameter it was impossible to generate a nonlinear mapping function that generalizes well over the entire dataset. The Nyström method also performed poorly, as the number of samples for a good estimation exceed RAM capabilities. In the case that fewer samples are chosen, the method also results in low labeling power. The large-scale two-spiral problem demonstrates that the TNKF is especially suitable in cases where kernel eigenspectrum decays slowly. The TNKF can incorporate the most important information from the dataset in an implicit way, not requiring a sampling procedure.

5.3 Classification Problem 2: Adult Dataset

In this experiment the goal is to classify whether a person makes over $50k a year based on 14 features, both numerical and categorical [35]

. By converting the categorical features to numerical values by one-hot-encoding, a total of 99 features is obtained. The RBF kernel is used, and the chosen hyperparameters are: (

, ). TNKF and TNKF are trained on the entire dataset. TNKF uses early stopping, and therefore witnesses only a subset of the data. The three TNKFs are implemented with initial distributions, . It is assumed that the residual is irrelevant to the consensus data, therefore it is chosen very small .

Method Correctly Labeled % Confidence %
TNKF 83 98.2
TNKF
TNKF 81.4 96
FS-LSSVM 50 S 44.3 NA
FS-LSSVM 500 S 73.2 NA
FS-LSSVM 800 S 76.2 NA
Nyström , 50 EV 81.2 NA
Nyström , 25 EV 82.7 NA
Nyström , 25 EV 84.4 NA
(a) Performance values
0.001 =6 0.005 =0.005 0.001
0.001 =0.01 0.003 =0.003 0.001
0.001 =30 0.001 =6 0.001
(b) Truncation values for the TNKF
Table 9: Test performance of the approximation methods on the Adult dataset (, )

In Table X-(a) the performances of the TNKF are given. The TNKF is capable of achieving performances equal to that found in the literature for SVM’s and LS-SVMs [7, 38]. The data adheres to small TT-ranks, therefore severe truncations are unnecessary. TNKF, which employed early stopping and a forgetting factor, managed to reach decent labeling power. Below the parameters for TNKF are described:

  • Forgetting factor ,

  • TT smaller than ,

  • The change in TT at most for at least 5 iterations.

Implementation of early stopping required only 4200 rows to be iterated over. This reduced training time by around 80%, but yielded a lower classification accuracy. The chosen kernel hyperparameter () generates a kernel matrix that can easily be approximated with a low-rank matrix structure. The Nyström method is particularly suitable for this, and achieves very high performance values with few sampled columns. The nonlinear mapping function, however, is difficult to approximate. FS-LSSVM requires many support vectors and is computationally the most demanding in this problem.

5.4 Classification Problem 3: binary MNIST

For the MNIST digit recognition dataset a binary classification problem is considered in this paper. The digits are labeled in two groups, and , classified based on the 784 normalized pixels per image. A classifier is trained with images, and tested on images. No noise is considered to act on the data. The RBF kernel is used with a hyperparameter the selection (, ). The initial condition is chosen . The performances of the low-rank approximation methods on the binary MNIST classification problem are presented in Table XI.

Method Correctly Labeled % Confidence %
TNKF
FS-LSSVM 50 S 61.4 NA
FS-LSSVM 250 S 75.3 NA
FS-LSSVM 1000 S 92.8 NA
Nyström S, 10 EV 82.4 NA
Nyström S, 50 EV 93.7 NA
Nyström S, 25 EV 96.9 NA
(a) Performance values
4 1 1 2 2
(b) Truncation values for the TNKF
Table 10: Test performance of the approximation methods on the MNIST dataset (, )

The TNKF achieves high classification and confidence percentages, and is well-suited for the MNIST dataset because the images conform to small TT ranks. As a result of the low TT ranks, the memory and computational difficulty are small, even though not as favorable as the Nyström method. The FS-LSSVM requires many support vectors and therefore demands much computational effort, nevertheless can reach high labeling power. The Nyström approximations perform well with very few samples. Of the three low-rank approximation methods, it is computationally the lightest to apply and yields very high accuracies.

6 Conclusion

This paper presents a recursive Bayesian filtering framework with the tensor network Kalman filter to solve large-scale LS-SVM dual problems. To the best of our knowledge, it is the first non-sampling based algorithm for large-scale LS-SVM dual problems. The recursive Bayesian framework allows early stopping and yields confidence bounds on the predictions. Current low-rank approximation methods lack such properties. Also, the curse of dimensionality is avoided in this work because the method is iterative and uses a tensor train representation. The TNKF is especially advantageous in situations with high nonlinearities, when the kernel data conforms to a low-TT-rank structure, or when many samples are required for alternative low-rank methods. High accuracies are attained on all considered problems.

In further research, the TNKF computational complexities can be further reduced. The algorithm can easily be extended to a parallel implementation and a batch framework to reduce training times. Additionally, directly constructing the kernel rows in TT form can save a significant portion of the computational complexity for extremely large problems and problems with many features. The tuning of the interdependent truncation parameters and kernel hyperparameters requires further investigation. Inference methods can be developed to determine these parameters and would save much tuning work, currently done by exhaustive grid searches. Lastly, the TNKF can be readily adopted to other kernel methods and fields involved with solving large-scale linear systems.

Acknowledgment

Johan Suykens acknowledges support from EU: The research leading to these results has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program/ERC Advanced Grant E-DUALITY (787960). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information. Research Council KUL: C14/18/068 Flemish Government: FWO: projects: GOA4917N, Impulsfonds AI: VR 2019 2203 DOC.0318/1QUATER, Ford KU Leuven Research Alliance Project KUL0076, ICT 48 TAILOR, Leuven.AI Institute.

References

  • [1] V. Vapnik.

    The Nature of Statistical Learning Theory

    .
    Springer-Verlag, Berlin, Heidelberg, 1995.
  • [2] B. Scholkopf and A.J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.
  • [3] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
  • [4] J. Kocijan. Modelling and Control of Dynamic Systems Using Gaussian Process Models. Springer International Publishing, January 2016.
  • [5] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.
  • [6] J.L Rojo-Alvarez, M. Martinez-Ramon, J. Munoz-Mari, and G. Camps-Valls. Digital Signal Processing with Kernel Methods. Wiley-IEEE Press, 1st edition, 2018.
  • [7] J.A.K Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific, 2002.
  • [8] J. Suykens and J. Vandewalle. Least Squares Support Vector Machine Classifiers. Neural Processing Letters, 9:293–300, June 1999.
  • [9] K. De Brabanter, J. De Brabanter, J. A. K. Suykens, and B. De Moor. Approximate Confidence and Prediction Intervals for Least Squares Support Vector Regression.

    IEEE Transactions on Neural Networks

    , 22(1):110–120, 2011.
  • [10] B. Hamers. Kernel Models for Large Scale Applications. PhD thesis, Katholieke Universiteit Leuven, ESAT, Katholieke Universiteit Leuven, Belgium, Kasteelpark Arenberg 10, 3001 Leuven, June 2004.
  • [11] G.H. Golub and C.F Van Loan. Matrix Computations. The Johns Hopkins University Press, 4 edition, 2013.
  • [12] C. Williams and M. Seeger. Using the Nyström Method to Speed Up Kernel Machines. In Advances in Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001.
  • [13] K. De Brabanter, J. De Brabanter, J. A. K. Suykens, and B. De Moor. Optimized Fixed-size Kernel Models for Large Data Sets. Comput. Stat. Data Anal., 54(6):1484–1504, June 2010.
  • [14] B. Hamers. A Comparison of Iterative Methods for Least Squares Support Vector Machine Classifiers. Technical report, Katholieke Universiteit Leuven, Belgium, 2001.
  • [15] A.J. Smola and B. Schökopf. Sparse Greedy Matrix Approximation for Machine Learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, page 911–918, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
  • [16] A. Rudi, D. Calandriello, L. Carratino, and L. Rosasco. On Fast Leverage Score Sampling and Optimal Learning, 2018.
  • [17] P. Drineas and M.W Mahoney. On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning. Journal of Machine Learning Research, 6, 2005.
  • [18] S. Ma and M. Belkin. Kernel Machines that Adapt to GPUs for Effective Large Batch Training, 2018.
  • [19] Z. Chen, K. Batselier, J.A.K Suykens, and N. Wong. Parallelized Tensor Train Learning of Polynomial Classifiers. CoRR, abs/1612.06505, 2016.
  • [20] Y. Ma, X. Liang, G. Sheng, J. T. Kwok, M. Wang, and G. Li. Noniterative Sparse LS-SVM Based on Globally Representative Point Selection. IEEE Transactions on Neural Networks and Learning Systems, pages 1–11, 2020.
  • [21] Simo Särkkä. Bayesian Filtering and Smoothing. Institute of Mathematical Statistics Textbooks. Cambridge University Press, 2013.
  • [22] M. Verhaegen and V. Verdult. Filtering and System Identification: A Least Squares Approach. Cambridge University Press, 2007.
  • [23] S.S. Haykin. Kalman Filtering and Neural Networks. John Wiley & Sons, Inc., USA, 2001.
  • [24] CE. Rasmussen and CKI. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, USA, January 2006.
  • [25] A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A. PHAN. Tensor Decompositions for Signal Processing Applications: From Two-way to Multiway Component Analysis. IEEE Signal Processing Magazine, 32(2):145–163, March 2015.
  • [26] A. Avella and F. Mancini. Strongly Correlated Systems: Numerical Methods. Springer Berlin Heidelberg, January 2013.
  • [27] T.G Kolda and B.W Bader. Tensor Decompositions and Applications. SIAM Review, 51(3):455–500, September 2009.
  • [28] I. Oseledets. Tensor-Train Decomposition. SIAM J. Scientific Computing, 33:2295–2317, January 2011.
  • [29] I. Oseledets. Approximation of Matrices Using Tensor Decomposition. SIAM Journal on Matrix Analysis and Applications, 31, January 2010.
  • [30] S. Holtz, T. Rohwedder, and R. Schneider. The Alternating Linear Scheme for Tensor Optimization in the Tensor Train Format. SIAM Journal on Scientific Computing, 34(2):A683–A713, 2012.
  • [31] Benjamin Huber, Reinhold Schneider, and Sebastian Wolf. A Randomized Tensor Train Singular Value Decomposition. In Compressed Sensing and its Applications, pages 261–290. Springer, 2017.
  • [32] K. Batselier, Z. Chen, and N. Wong. A Tensor Network Kalman Filter with an Application in Recursive MIMO Volterra System Identification. Automatica, 84, October 2016.
  • [33] D. Gedon, P. Piscaer, K. Batselier, C. Smith, and M. Verhaegen. Tensor Network Kalman Filter for LTI Systems. In 2019 27th European Signal Processing Conference (EUSIPCO), pages 1–5, September 2019.
  • [34] I. Oseledets and E. Tyrtyshnikov. TT-Cross Approximation for Multidimensional Arrays. Linear Algebra and its Applications, 432(1):70–88, 2010.
  • [35] D. Dua and C. Graff. UCI Machine Learning Repository, 2017.
  • [36] Y. LeCun and C. Cortes. MNIST Handwritten Digit Database. 2010.
  • [37] MNIST in CSV, 2018.
  • [38] A. Pal and S.K. Pal. Pattern Recognition and Big Data. World Scientific, 2017.