Parallelized Tensor Train Learning of Polynomial Classifiers

12/20/2016 ∙ by Zhongming Chen, et al. ∙ 0

In pattern classification, polynomial classifiers are well-studied methods as they are capable of generating complex decision surfaces. Unfortunately, the use of multivariate polynomials is limited to kernels as in support vector machines, because polynomials quickly become impractical for high-dimensional problems. In this paper, we effectively overcome the curse of dimensionality by employing the tensor train format to represent a polynomial classifier. Based on the structure of tensor trains, two learning algorithms are proposed which involve solving different optimization problems of low computational complexity. Furthermore, we show how both regularization to prevent overfitting and parallelization, which enables the use of large training sets, are incorporated into these methods. Both the efficiency and efficacy of our tensor-based polynomial classifier are then demonstrated on the two popular datasets USPS and MNIST.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

TTClassifier

Tensor Train polynomial classifier


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Pattern classification is the machine learning task of identifying to which category a new observation belongs, on the basis of a training set of observations whose category membership is known. This type of machine learning algorithm that uses a known training dataset to make predictions is called supervised learning, which has been extensively studied and has wide applications in the fields of bioinformatics

[1], computer-aided diagnosis (CAD) [2], machine vision [3], speech recognition [4], handwriting recognition [5], spam detection and many others [6, 7, 8]. Usually, different kinds of learning methods use different models to generalize from training examples to novel test examples.

As pointed out in [9, 10], one of the important invariants in these applications is the local structure

: variables that are spatially or temporally nearby are highly correlated. Local correlations benefit extracting local features because configurations of neighboring variables can be classified into a small number of categories (e.g. edges, corners…). For instance, in handwritten character recognition, correlations between image pixels that are nearby tend to be more reliable than the ones of distant pixels. Learning methods incorporating this kind of prior knowledge often demonstrate state-of-the-art performance in practical applications. One popular method for handwritten character recognition is using convolutional neural networks (CNNs)

[11, 12]

which are variations of multilayer perceptrons designed to use minimal amounts of preprocessing. In this model, each unit in a layer receives inputs from a set of units located in a small neighborhood in the previous layer, and these mappings share the same weight vector and bias in a given convolutional layer. An important component of a CNN are the pooling layers, which implement a nonlinear form of down-sampling. In this way, the amount of parameters and computational load are reduced in the network. Another popular method uses support vector machines (SVMs)

[13, 14]

. The original finite-dimensional feature space is mapped into a much higher-dimensional space, where the inner product is easily computed through the ‘kernel trick’. By considering the Wolfe dual representation, one can find the maximum-margin hyperplane to separate the examples of different categories in that space. However, it is worth mentioning that these models require a large amount of memory and a long processing time to train the parameters. For instance, if there are thousands of nodes in the CNN, the weight matrices of fully-connected layers are of the order of millions. The major limitation of basic SVMs is the high computational complexity which is at least quadratic with the dataset size. One way to deal with large datasets in the SVM-framework is by using a fixed-size least squares SVM (fixed-size LS-SVM) 

[15], which approximates the kernel mapping in such a way that the problem can be solved in the primal space.

Ii Tensors in machine learning

Tensors are a multidimensional generalization of matrices to higher orders and have recently gained attention in the field of machine learning. The classification via tensors, as opposed to matrices or vectors, was first considered in [16], by extending the concept of spectral regularization for matrices to tensors. The tensor data is assumed to satisfy a particular low-rank Tucker decomposition, which unfortunately still suffers from an exponential storage complexity. Other work has focused speeding-up the convolution operation in CNNs [17] by approximating this operation with a low-rank polyadic decomposition of a tensor. In [18], the weight matrices of fully-connected layers of neural networks are represented by tensor trains (TTs), effectively reducing the number of parameters. TTs have also been used to represent nonlinear predictors [19] and classifiers [20]. The key idea here is always to approximate a mapping that is determined by an exponential number of parameters by a TT with a storage complexity of parameters. To our knowledge, this idea has not yet been applied to polynomial classifiers that also suffer from the curse of dimensionality. The usual approach to circumvent the exponential number of polynomial coefficients would be to use SVMs with a polynomial kernel and solve the problem in the dual space. In this article, we exploit the efficient representation of a multivariate polynomial as a TT in order to avoid the curse of dimensionality, allowing us to work directly in the feature space. The main contributions are:

  • We derive a compact description of a polynomial classifier using the TT format, avoiding the curse of dimensionality.

  • Two efficient learning algorithms are proposed by exploiting the TT structure.

  • Both regularization and a parallel implementation are incorporated into our methods, thus avoiding overfitting and allowing the use of large training datasets.

This paper is organized as follows. In Section III, we give a brief introduction to tensor basics, including the TT decomposition, important tensor operations and properties. The framework of TT learning for pattern classification is presented in Section IV

. Based on different loss functions, two efficient learning algorithms are proposed in Section 

V, together with a discussion on regularization and parallelization. In Section VI, we test our algorithms on two popular datasets: USPS and MNIST and compare their performance with polynomial classifiers trained with LS-SVMs [15]. Finally, some conclusions and further work are summarized in Section VII.

Throughout this paper, we use small letters for scalars, small bold letters for vectors, capital letters for matrices, and calligraphic letters for tensors. The transpose of a matrix or vector is denoted by and

, respectively. The identity matrix of dimension

is denoted by . A list of abbreviations used here is summarized in Table I.

TT Tensor Train
CNN Convolutional Neural Network
SVM Support Vector Machine
TTLS Tensor Train learning by Least Squares
TTLR

Tensor Train learning by Logistic Regression

USPS US Postal Service database
MNIST Modified NIST database
TABLE I: List of Abbreviations

Iii Preliminaries

Iii-a Tensors and pure-power- polynomials

A real th-order or -way tensor is a multidimensional array that generalizes the notions of vectors and matrices to higher orders. Each of the entries is determined by indices. The numbers are called the dimensions of the tensor. An example tensor with dimensions 4, 3, 2 is shown in Fig. 1. We now give a brief introduction to some required tensor operations and properties, more information can be found in [21].

Fig. 1: An example tensor , where denote the indices for each mode respectively.

The -mode product of a tensor and a matrix is defined by

(1)

and . In particular, given a -way tensor and a vector , the multidimensional contraction, denoted by , is the scalar

(2)

which is obtained as a homogeneous polynomial of with degree . The inner product of two same-sized tensors is the sum of the products of their entries, i.e.,

(3)

The Frobenius norm of a tensor is given by

(4)

The vectorization of a tensor is denoted by and maps the tensor element with indices to the vector element with index where

Given vectors , , their outer product is denoted by , which is a tensor in such that its entry with indices is equal to the product of the corresponding vector elements, namely, . It follows immediately that

(5)

where the symbol “” denotes the Kronecker product.

We now illustrate how to represent a polynomial by using tensors. Denote by the polynomial ring in variables with coefficients in the field .

Definition 1.

Given a vector , a polynomial with variables is called pure-power- if the degree of is at most with respect to each variable , .

Example 1.

The polynomial is a pure-power- polynomial with .

The set of all pure-power- polynomials with the degree vector is denoted by . For any , there are a total of distinct monomials

For , denote by the Vandermonde vectors

(6)

It follows that there is a one-to-one mapping between pure-power- polynomials and tensors. To be specific, for any , there exists a unique tensor such that

(7)
Example 2.

We revisit the polynomial from Example 1 and illustrate its corresponding tensor representation. Since , we construct the Vandermonde vectors . The nonzero entries of the corresponding tensor are then . The indices of the tensor A are easily found from grouping together corresponding indices of the Vandermonde vectors. For example, the tensor index corresponding with the monomial is found from .

Iii-B Tensor trains

It is well known that the number of tensor elements grows exponentially with the order . Even when the dimensions are small, the storage cost for all elements is prohibitive for large . The TT decomposition [22] gives an efficient way (in storage and computation) to overcome this so-called curse of dimensionality.

The main idea of the TT decomposition is to re-express the entries of a tensor as a product of matrices

(8)

where is an matrix for each index , also called the TT-core. To turn the matrix-by-matrix product (8) into a scalar, boundary conditions have to be introduced. The quantities are called the TT-ranks. Note that each core is a third-order tensor with dimensions , and . The TT-decomposition for a tensor is illustrated in Fig. 2. The most common way to convert a given tensor into a TT would be the TT-SVD algorithm [22, p. 2301].

Example 3.

TT-SVD algorithm [22, p. 2301]. Using the TT-SVD algorithm, we can convert the tensor from Example 2 into a TT that consists of TT-cores .

Note that throughout this article, we will not need to use the TT-SVD algorithm. Instead, we will initialize the TT-cores randomly and iteratively update the cores one-by-one in an alternating fashion. It turns out that if all TT-ranks are bounded by , the storage of the TT grows linearly with the order as , where .

Fig. 2: The TT decomposition for a tensor in .
Proposition 1 (Theorem 2.1 of [23]).

For any tensor , there exists a TT-decomposition with TT-ranks

We also mention that the TT representation of a tensor is not unique. For instance, let

be an orthogonal matrix in

, namely, . Then the tensor in (8) also has the TT-decomposition

(9)

where

Numerical stability of our learning algorithms is guaranteed by keeping all the TT-cores left-orthogonal or right-orthogonal [24]

, which is achieved through a sequence of QR decompositions as explained in Section

V.

Definition 2.

The core is called left-orthogonal if

and the core is called right-orthogonal if

As stated before, the structure of a TT also benefits the computation of the general multidimensional contraction:

(10)

where and , . If a tensor is given in the TT-format (8), then we have

(11)

The described procedure for fast TT contraction is summarized in Algorithm 1. In order to simplify the analysis on the computational complexity of Algorithm 1, we assume that and . There are two required steps to compute the contraction of a TT with vector. First, we need to construct matrices by contracting the TT-cores with the vectors . This operation is equivalent with matrix-vector products with a total computational cost of approximately flops. Fortunately, the contraction of one TT-core is completely independent from the other contractions and hence can be done in parallel over processors, reducing the computational complexity to per processor. Maximal values for and in our experiments are 10 and 4, respectively, so that the contraction of one TT-core is approximately equivalent with the product of a matrix with a vector. The final step in Algorithm 1 is the product of all matrices with a total computational complexity of . If we again set , , then this final step in Algorithm 1 is equivalent with the product of a matrix with a vector. For more basic operations implemented in the TT-format, such as tensor addition and computing the Frobenius norm, the reader is referred to [22].

0:   Vectors , and a tensor in the TT-format with cores
0:   The multidimensional contraction in (10)
1:  for  do
2:      %Computed in parallel
3:  end for
4:  
5:  for   do
6:     
7:  end for
8:  return  
Algorithm 1 Fast TT contraction [22]

Iv TT Learning

It is easy for us to recognize a face, understand spoken words, read handwritten characters and identify the gender of a person. Machines, however, make decisions based on data measured by a large number of sensors. In this section, we present the framework of TT learning. Like most pattern recognition systems

[25], our TT learning method consists in dividing the system into three main modules, shown in Fig. 3.

The first module is called feature extraction, which is of paramount importance in any pattern classification problem. The goal of this module is to build features via transformations of the raw input, namely, the original data measured by a large number of sensors. The basic reasoning behind transform-based features is that an appropriately chosen transformation can exploit and remove information redundancies, which usually exist in the set of samples obtained by measuring devices. The set of features exhibit high

information packaging

properties compared with the original input samples. This means that most of the classification-related information is compressed into a relatively small number of features, leading to a reduction of the necessary feature space dimension. Feature extraction benefits training the classifier in terms of memory and computation, and also alleviates the problem of overfitting since we get rid of redundant information. To deal with the task of feature extraction, some linear or nonlinear transformation techniques are widely used. For example, the Karhunen-Loève transform, related to principal component analysis (PCA), is one popular method for feature generation and dimensionality reduction. A nonlinear kernel version of the classical PCA is called kernel PCA, which is an extension of PCA using kernel methods. The discrete Fourier transform (DFT) can be another good choice due to the fact that for many practical applications, most of the energy lies in the low-frequency components. Compared with PCA, the basis vectors in the DFT are fixed and problem-dependent, which leads to a low computational complexity.

The second module, the TT classifier, is the core of TT learning. The purpose of this module is to mark a new observation based on its features generated by the previous module. As will be discussed, the task of pattern classification can be divided into a sequence of binary classifications. For each particular binary classification, the TT classifier assigns to each new observation a score that indicates which class it belongs to. In order to construct a good classifier, we exploit the fact that we know the labels for each sample of a given dataset. The TT classifier is trained optimally with respect to an optimality criterion

. In some ways, the TT classifier can be regarded as a kind of generalized linear classifier, it does a linear classification in a higher dimensional space generated by the items of a given pure-power polynomial. The local information is encoded by the products of features. In contrast to kernel-based SVM classifiers that work in the dual space, the TT classifier is able to work directly in the high dimensional space by exploiting the TT-format. Similar with the backpropagation algorithm for multilayer perceptrons, the structure of a TT allows for updating the cores in an alternating way. In the next section, we will describe the training of two TT classifiers through the optimization of two different loss functions.

The last module in Fig. 3 is the decision module that decides which category a new observation belongs to. For binary classification, decisions are made according to the sign of the score assigned by the TT classifier, namely, the decision depends on the value of corresponding discriminant function. In an -class problem, there are several strategies to decompose it into a sequence of binary classification problems. A straightforward extension is the one-against-all, where binary classification problems are involved. We seek to design discriminant functions so that , if belongs to the th class. Classification is then achieved according to the rule:

assign to the th class if .

An alternative technique is the one-against-one, where we need to consider pairs of classes. The decision is made on the basis of a majority vote. It means that each classifier casts one vote and the final class is the one with the most votes. When the number is too large, one can also apply the technique of binary coding. It turns out that only classifiers are needed, where is the ceiling operation. In this case, each class is represented by a unique binary code word of length . The decision is then made on the basis of minimal Hamming distance.

Fig. 3: Framework of TT learning.

V Learning Algorithms

For notational convenience, we define and continue to use this notation for the remainder of the article. As stated before, TT classifiers are designed for binary classification. Given a set of training examples of the form such that is the feature vector of the th example and is the corresponding class label of . Let be the degree vector. Each feature is then mapped to a higher dimensional space generated by all corresponding pure-power- monomials through the mapping

(12)

For , let be the Vandermonde vectors defined in (6). Clearly, we have

(13)

This high-dimensional pure-power polynomial space benefits the learning task from the following aspects:

  • all interactions between features are described by the monomials of pure-power polynomials;

  • the dimension of the tensor space grows exponentially with , namely,

    , which increases the probability of separating all training examples linearly into two-classes;

  • the one-to-one mapping between pure-power polynomials and tensors enables the use of tensor trains to lift the curse of dimensionality.

With these preparations, our goal is to find a decision hyperplane to separate these two-class examples in the tensor space, also called the generic feature space. In other words, like the inductive learning described in [16], we try to find a tensor such that

Note that the bias is absorbed into the first element of . Note that the learning problem can also be interpreted as finding a pure-power- polynomial such that

and

Here we consider that the tensor is expressed as a tensor train with cores . The main idea of the TT learning algorithms is to update the cores in an alternating way by optimizing an appropriate loss function. Prior to updating the TT-cores, the TT-ranks are fixed and a particular initial guess of is made. The TT-ranks can be interpreted as tuning parameters, higher values will result in a better fit at the risk of overfitting. It is straightforward to extend our algorithms by means of the Density Matrix Renormalization Group (DMRG) method [26] such that the TT-ranks are updated adaptively. Each core is updated in the order

until convergence, which is guaranteed under certain conditions as described in [27, 28]. It turns out that updating one TT-core is equivalent with minimizing a loss function in a small number of variables, which can be done in a very efficient manner. The following lemma shows how the inner product in the generic feature space is a linear function in any of the TT-cores .

Lemma 1.

Given a vector , let be the mapping defined by (12), and let be a TT with cores , . For any and , we have that

(14)

where

and

Proof. By definition, we have

for any . This completes the proof. ∎

Example 4.

In this example we illustrate the advantageous representation of a pure-power polynomial as a TT. Suppose we have a polynomial with and all degrees . All coefficients of can then be stored into a 10-way tensor tensor such that the evaluation of in a particular is given by (7). The TT-representation of consists of 10 TT-cores , with a storage complexity of , where is the maximal TT-rank. This demonstrates the potential of the TT-representation in avoiding the curse of dimensionality when the TT-ranks are small.

Example 5.

Next, we illustrate the expressions for for the following quadratic polynomial in two variables . Since and , both and are the following matrices

The TT-representation of consists of a tensor and a tensor . Suppose now that and we want to compute the evaluation of the polynomial in a particular , which is . From Lemma 1 we then have that

with

In what follows, we first present two learning algorithms based on different loss functions. These algorithms will learn the tensor directly in the TT-representation from a given dataset. Two enhancements, namely, regularization for better accuracy and parallelization for higher speed will be described in the last two subsections.

V-a TT Learning by Least Squares

Least squares estimation is the simplest and thus most common estimation method. In the generic feature space, we attempt to design a linear classifier so that its desired output is exactly

or . However, we have to live with errors, that is, the true output will not always be equal to the desired one. The least squares estimator is then found from minimizing the following mean square error function

(15)

We now show how updating a TT-core is equivalent with solving a relatively small linear system. First, we define the matrix

(16)

for any . The matrix is hence obtained from the concatenation of the row vectors from for samples . It follows from Lemma 1 that

(17)

where

(18)

We have thus shown that updating the core is equivalent with solving a least squares optimization problem in variables. Minimizing (17) with respect to for any results in solving the linear system

(19)

Supposing and , then the computational complexity of solving (19) is . For the maximal values of and in our experiments, this implies that we need to solve a linear system of order 400, which takes about 0.01 seconds using MATLAB on our desktop computer.

V-B TT Learning by Logistic Regression

Since our goal is to find a hyperplane to separate two-class training examples in the generic feature space, we may not care about the particular value of the output. Indeed, only the sign of the output makes sense. This gives us the idea to decrease the number of sign differences as much as possible when updating the TT-cores, i.e., to minimize the number of misclassified examples. However, this model is discrete so that a difficult combinatorial optimization problem is involved. Instead, we try to find a suboptimal solution in the sense of minimizing a continuous cost function that penalizes misclassified examples. Here, we consider the logistic regression cost function. First, consider the standard sigmoid function

where the output always takes values between and . An important property is that its derivative can be expressed by the function itself, i.e.,

(20)

The logistic function for the th example is given by

(21)

We can also interpret the logistic function as the probability that the example belongs to the class denoted by the label . The predicted label for is then obtained according to the rule

For a particular example , we define the cost function as

The goal now is to find a tensor such that is near if or near if . As a result, the logistic regression cost function for the whole training dataset is given by

(22)

It is important to note that (22) is convex though the sigmoid function is not. This guarantees that we can find the globally optimal solution instead of a local optimum.

From equation (21) and Lemma 1, one can see that the function can also be regarded as a function of the core since

where denote the th row vector of defined in (16). It follows that updating the core is equivalent with solving a convex optimization problem in variables. Let

(23)

and be the diagonal matrix in with the th diagonal element given by . By using the property (20) one can derive the gradient and Hessian with respect to as

(24)

and

(25)

respectively, where is defined in (18) and denotes the all-ones vector in . Although we do not have a closed-form solution to update the core , the gradient and Hessian allows us to find the solution by efficient iterative methods, e.g. Newton’s method whose convergence is at least quadratic in a neighbourhood of the solution. A quasi-Newton method, like the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, is another good choice if the inverse of the Hessian is difficult to compute.

V-C Regularization

The cost functions (15) and (22) of the two TT learning algorithms do not have any regularization term, which may result in overfitting and hence bad generalization properties of the obtained TT classifier. Next, we discuss how the addition of a regularization term to (15) and (22) results in a small modification of the small optimization problem that needs to be solved when updating the TT-cores .

Consider the regularized optimization problem

(26)

where is given by (15) or (22), is a parameter that balances the loss function and the regularization term. Here we use the Tikhonov regularization, namely,

(27)

Thanks to the TT structure, the gradient of with respect to the TT-core

can be equivalently rewritten as a linear transformation of

. In other words, there is a matrix determined by the cores such that . See Appendix A for more details. It follows that

and

These small modifications lead to small changes when updating the core . For instance, the first-order condition of (26) for the least squares model results in solving the modified linear system

(28)

when compared with the original linear system (19).

V-D Orthogonalization and Parallelization

The matrix from (16) needs to be reconstructed for each TT-core during the execution of the two TT learning algorithms. Fortunately, this can be done efficiently by exploiting the TT structure. In particular, after updating the core in the left-to-right sweep, the new row vectors to construct the next matrix can be easily computed from

Similarly, in the right-to-left sweep, the new column vectors to construct the next matrix can be easily computed from

To make the learning algorithms numerically stable, the techniques of orthogonalization are also applied. The main idea is to make sure that before updating the core , the cores are left-orthogonal and the cores are right-orthogonal by a sequence of QR decompositions. In this way, the condition number of the constructed matrix is upper bounded so that the subproblem is well-posed. After updating the core , we orthogonalize it with an extra QR decomposition, and absorb the upper triangular matrix into the next core (depending on the direction of updating). More details on the orthogonalization step can be found in [27].

Another computational challenge is the potentially large size of the training dataset. Luckily, the dimension of the optimization problem when updating in the TT learning algorithms is , which is much smaller and independent from . We only need to compute the products , , and in (19), (24) and (25). These computations are easily done in parallel. Specifically, given a proper partition satisfying , we divide the large matrix into several blocks, namely,