DeepAI

# TEC: Tensor Ensemble Classifier for Big Data

Tensor (multidimensional array) classification problem has become very popular in modern applications such as image recognition and high dimensional spatio-temporal data analysis. Support Tensor Machine (STM) classifier, which is extended from the support vector machine, takes CANDECOMP / Parafac (CP) form of tensor data as input and predicts the data labels. The distribution-free and statistically consistent properties of STM highlight its potential in successfully handling wide varieties of data applications. Training a STM can be computationally expensive with high-dimensional tensors. However, reducing the size of tensor with a random projection technique can reduce the computational time and cost, making it feasible to handle large size tensors on regular machines. We name an STM estimated with randomly projected tensor as Random Projection-based Support Tensor Machine (RPSTM). In this work, we propose a Tensor Ensemble Classifier (TEC), which aggregates multiple RPSTMs for big tensor classification. TEC utilizes the ensemble idea to minimize the excessive classification risk brought by random projection, providing statistically consistent predictions while taking the computational advantage of RPSTM. Since each RPSTM can be estimated independently, TEC can further take advantage of parallel computing techniques and be more computationally efficient. The theoretical and numerical results demonstrate the decent performance of TEC model in high-dimensional tensor classification problems. The model prediction is statistically consistent as its risk is shown to converge to the optimal Bayes risk. Besides, we highlight the trade-off between the computational cost and the prediction risk for TEC model. The method is validated by extensive simulation and a real data example. We prepare a python package for applying TEC, which is available at our GitHub.

• 1 publication
• 1 publication
• 16 publications
03/27/2017

### Randomized CP Tensor Decomposition

The CANDECOMP/PARAFAC (CP) tensor decomposition is a popular dimensional...
01/03/2019

### Prediction of multi-dimensional spatial variation data via Bayesian tensor completion

This paper presents a multi-dimensional computational method to predict ...
06/13/2018

### Benchmarks for Image Classification and Other High-dimensional Pattern Recognition Problems

A good classification method should yield more accurate results than sim...
01/19/2022

### Coupled Support Tensor Machine Classification for Multimodal Neuroimaging Data

Multimodal data arise in various applications where information about th...
04/17/2020

### Asymptotic Analysis of an Ensemble of Randomly Projected Linear Discriminants

Datasets from the fields of bioinformatics, chemometrics, and face recog...
03/11/2020

### Tensorized Random Projections

We introduce a novel random projection technique for efficiently reducin...
12/20/2016

### Parallelized Tensor Train Learning of Polynomial Classifiers

In pattern classification, polynomial classifiers are well-studied metho...

## 1 Introduction

With the advancement of information and engineering technology, modern-day data science problems often come with data in gigantic size and increased complexity. One popular technique of storing these data is multi-dimensional arrays, which preserves the data anatomy and multi-dimensional spatial and spatio-temporal correlations. Unfolding the data may lose structure along with other relevant information and face the curse of dimensionality. The structural complexity and high dimensionality pose many challenges to statisticians and computer scientists in data processing and analysis. The traditional machine learning and statistical models for high dimensional data are developed in vector spaces. Adopting these methods to the new types of structured data will require a vectorization, resulting in losing critical information such as multi-way correlation and spatio-temporal information. Besides, the vectorized data’s huge dimension could be a nightmare for traditional high dimensional data models, especially for their computation complexity. To preserve the multi-way structure of data and perform efficient computation, research has begun in building tensor data models and performing statistical analysis based on these models.

A tensor is a multi-dimensional or multi-way array, a generalization of a matrix in a multi-dimensional situation. With different definitions, tensors can be represented in different forms, such as Candecomp / Parafac (CP), Tucker, and Tensor Chain. These tensor representations are widely applied in both supervised and unsupervised machine learning problems for multi-dimensional data. Kolda and Bader (2009) provides a comprehensive survey paper about CP and Tucker tensor decomposition and their psychometrics, neuroscience, latent factor detection, and image compression applications prediction, and classification. Among all these machine learning applications with tensors, we focus only on classification models in this work. Various classification models for tensor data have been studied in the last few years. Some of works such as Tan et al. (2012), Zhou et al. (2013), and Pan et al. (2018)

, mostly in statistics literature, develop logistic regression and probabilistic linear discriminant analysis models. These models rely on the same principle assumption that tensor data confirm a normal distribution, which is too restrictive in real applications. Non-parametric tensor classifiers turn out to be more flexible as they do not have any distribution assumption. Examples of non-parametric tensor classifiers include

Tao et al. (2005), Kotsia et al. (2012), Signoretto et al. (2014), He et al. (2014), Li and Schonfeld (2014), and He et al. (2017). These methods are more flexible and would have much better generalization ability in applications with tensor data that are sparse or with high auto-correlation modes.

Motivated by the CP tensor and tensor product space introduced in Hackbusch (2012), we propose a version of Support Tensor Machine (STM) using tensor kernel functions based on tensor cross-product norm in our previous work Li and Maiti (2019). According to the definition, CP tensors can be considered as elements and the associated cross-product norm measures the magnitude of tensors in a tensor product space. Following this idea, we develop a tensor kernel function based on cross-product norm to measure the similarity between tensors. Kernel functions also represent the resulting separating plane. We call this model CP-STM.

We consider simplifying the computation in CP-STM by training a STM with randomly projected tensors in this work. The application of random projection in high-dimensional data analysis is motivated by the well celebrated Johnson Lindenstrauss Lemma (see e.g. Dasgupta and Gupta (2003)). The lemma says that for arbitrary ,

, there is a linear transformation

such that for any two vectors , :

 (1−ϵ)||xi−xj||2⩽||f(xi)−f(xj)||2⩽(1+ϵ)||xi−xj||2

with large probability for all . The linear transformation preserves the pairwise Euclidean distances between these points. The random projection has been proven to be a decent dimension reduction technique in machine learning literature. Durrant and Kabán (2013) presents a Vapnik-Chervonenkis type bounds on the generalization error of a linear classifier trained on a single random projection. Chen and Li (2014) provides a convergence rate for the classification error of the support vector machine trained with a single random projection. Cannings and Samworth (2017) proves that random projection ensemble classifier can reduce the generalization error further. Their results hold several types of basic classifiers such as nearest neighboring and linear/quadratic discriminant analysis. Inspired by these current work, we propose an ensemble of RPSTM, STM trained with randomly projected tensors, for large-sized tensor classification. We call this method Tensor Classification Ensemble (TEC).

Our contribution: Our work alleviates the limitations of existing tensor approaches in handling big data classification problems. This paper introduces an innovative ensemble classification model coupled with random projection and support tensor machines to provide computationally fast and statistically consistent label predictions for ultra high-dimensional tensor data. Specifically, the contribution of this paper is threefold.

1. We successfully adopt the well known random-projection technique into high dimensional tensor classification applications and provide an ensemble classifier that can handle extremely big-sized tensor data. The adoption of random projection makes it feasible to directly classify big tensor data on regular machines and save computational costs. We further aggregate multiple RPSTM to form our TEC classifier, which can be statistically consistent while remaining computationally efficient. Since the aggregated base classifiers are independent of each other, the model learning procedure can be accelerated in a parallel computing platform.

2. Some theoretical results are established in order to validate the prediction consistency of our classification model. Unlike Chen and Li (2014) and Cannings and Samworth (2017), we adopt the Johnson-Lindenstrauss lemma further for tensor data and show that the STM from Li and Maiti (2019) can be estimated with randomly projected tensors. The average classification risk of the estimated model converges to the optimal Bayes risk under some specific conditions. Thus, the ensemble of several RPSTMs can have robust parameter estimation and provide strongly consistent label predictions. The results also highlight the trade-off between classification risk and dimension reduction created by random projections. As a result, one can take a balance between the computational cost and prediction accuracy in practice.

3. We provide an extensive numerical study with synthetic and real tensor data to reveal our ensemble classifier’s decent performance. It performs better than the traditional methods such as linear discriminant analysis and random forest, and other tensor-based methods in applications like brain MRI classification and traffic image recognition. It can also handle large tensors generated from tensor CANDECOMP/PARAFAC (CP) models, which are widely applied in spatial-temporal data analysis. Besides, the computation is much faster for the TEC comparing with the existing methods. All these results indicate a great potential for the proposed TEC in big data and multi-modal data applications. The Python module of applying TEC is available at

https://github.com/PeterLiPeide/KSTM_Classification/tree/master.

Outline: The paper is organized in the following order. Section 2 provides the current literature on STM and contrast with our CP-STM. Section 3 introduces the classification methodology. Section 4 provides an algorithm for model estimation as well as parameter selection. Section 5 contains the main result of our TEC model’s prediction consistency. The simulation study and real data analysis are provided in Section 6 and 7. We analyze traffic imaging data in the real data application. Section 8 concludes the article with a brief discussion. The proof of our theoretical results are provided in the appendix.

## 2 Related Work

Support Tensor Machine (STM) generalizes the idea of support vector machine model from vectors and provides a framework for learning a separating hyper-plane for tensor data. The optimal separating plane would be represented by support tensors analogous to the support vector machine’s support vectors.

Previous work about STM includes Tao et al. (2005), Kotsia et al. (2012), He et al. (2014), and He et al. (2017). Tao et al. (2005) models the separating plane as the inner product between tensor predictors and a tensor coefficient assumed to be a rank-one tensor. The inner product is defined as a mode-wise product, which is indeed is a multi-linear transformation. Kotsia et al. (2012) defines the margin as the traces of unfolded tensor coefficients and constructs the separating plane using multi-linear transformation. Both these works are established with the tensor Tucker model. He et al. (2014) developed their STM under the idea of multi-way feature maps. The multi-way feature maps result in a kernel function measuring the similarity between CP tensors. The separating plane is also constructed with this kernel function. He et al. (2017) combines the tensor CP decomposition and CP tensor STM as a multi-task learning problem, making it slightly different from other STM methods.

Among all these existing STM models, only the model from He et al. (2014) is similar to our proposed RPSTM. Except for random projection, both methods use kernel functions to measure the similarity between tensors. However, the kernel function we adopted would be more general, since it is originated from the tensor cross-product norm. We will highlight this difference in more detail in the latter part of this paper. Moreover, we have provided the universal approximating property for such tensor kernel functions in our recent work Li and Maiti (2019). This gives a theoretical guarantee to the generalization ability for our model.

## 3 Methodology

### 3.1 Notations and Tensor Algebra

We refer Kolda and Bader (2009) for some standard tensor notations and operations used in this paper. Numbers and scalars are denoted by lowercase letters such as . Vectors are denoted by boldface lowercase letters, e.g. . Matrices are denoted by boldface capital letters, e.g. . Multi-dimensional tensors are the generalization of vector and matrix representations for higher-order data, which are denoted by boldface Euler script letters such as .

The order of a tensor is the number of different dimensions, also known as ways or modes. For example, a scalar can be regarded as an order zero tensor, a vector is an order one tensor, and a matrix is an order two tensor. In general, a tensor can have modes as long as is an integer.

In addition to the basic concepts, we need some operations for vectors and matrices to present our work, which is also referred from Kolda and Bader (2009). The first one is the outer product of vectors. This is the simplest and the motivating example for tensor product. Let and be two row vectors, the outer product of them is defined by

 a⊗b=⎡⎢ ⎢ ⎢ ⎢⎣a1b1a1b2...a1bqa2b1a2b2...a2bq............apb1apb2...apbq⎤⎥ ⎥ ⎥ ⎥⎦

which is a matrix. This matrix is identical to . We use to denote both outer product for vectors and CP tensors in our work.

The last part we want to mention is the rank of a tensor. There are two different definitions for tensor rank that are popular in literature. One is the CP rank and the other is the Tucker rank. Kolda and Bader (2009) provides details for both ranks. We only consider the CP rank in this paper. The definition of CP rank for a d-mode tensor is the minimum integer such that the tensor can be written as

 X=r∑k=1x(1)k⊗x(2)k...⊗x(d)k=[X(1),...,X(d)]X(j)∈RIj×r,j=1,..,d (1)

Equitation (1) is also called tensor rank-r / CP representation (see Hackbusch (2012)). We can always have a CP representation for tensors with rank . For any given tensor, CP decomposition can be estimated with Alternative Least Square (ALS) algorithm (see Kolda and Bader (2009)). The support tensor machine in this paper is constructed with rank-r tensors in their CP representations. As a result, all the tensors mentioned in the rest of the article are assumed to have decomposition (1).

### 3.2 CP Support Tensor Machine

Let be the training set, where are d-mode tensors, are labels. If we assume the training risk of a classifier is , where

is a loss function for classification, the problem will be looking for a

such that

 fn={f:RTn(f)=minRTn(f),f∈H}

The is a functional space for the collection of functions mapping tensors into scalars. The support tensor machine in Li and Maiti (2019) solves the classification problem above by optimizing the objective function

 minfλ||f||2K+1nn∑i=1L(yi,f(Xi)) (2)

where is the L2 norm of the function . Note that takes raw output of instead of predicted labels. Examples of such loss functions in classification include Hinge loss and its variants. The optimal solution is in the form of

 fn=n∑i=1αiyiK(Xi,X) (3)

The kernel function is defined as

 K(X,Y)=r∑k,l=1d∏j=1K(j)(x(j)k,y(j)l) (4)

where and are d-mode tensors. The label is predicted with . are vector based kernels as and are vector components of the tensors.

This function (4) is a general form of kernel function, as we do not specify the explicit form of . It is developed with the idea of cross norm defined in the tensor product space, see Hackbusch (2012) and Ryan (2013). It is very natural to derive a tensor kernel function in the form of equation (4) when the CP factors of the tensor is available. Thus, He et al. (2014) and He et al. (2017) also used the similar formulation. However, different choice of kernels might have impacts on the convergence of classification error. This has not been discussed in their works. We show that by choosing a universal kernel functions for tensors (see, Li and Maiti (2019)), the classifier (3) is consistent.

A direct application of the above procedure can be computationally difficult and expensive since and may belong to high-dimensional spaces. Also, as the dimension in each mode increases, it will be more difficult for the kernel function to scale and provide accurate similarity measurement between tensor components. Thus, we apply a random projection for tensors before calculating the kernels.

### 3.3 Support Tensor Machine with Random Projection

For a single tensor with CP representation, we define the random projection of tensor as:

 A∘X=r∑k,l=1A(1)lx(1)k⊗A(2)lx(2)k...⊗A(d)lx(d)k=[A(1)X(1),...,A(d)X(d)]X(j)∈RIj×r,j=1,..,d (5)

where are matrices whose elements are independently identically distributed normal variables. Each component of the tensor CP expansion is projected to a lower dimensional space via a linear projection. We use to denote the training set with tensors randomly projected to a lower dimensional space. The support tensor machine model will then be trained on the projected data instead of the original data. In general, the estimation procedure is expected to be much more efficient and faster as we assume s are much smaller than s.

With projected tensor data and projection matrices, the original STM model (3) turns to be

 (6)

However, it is not an optimal of STM in the projected training data. The optimal solution in projected data space now becomes

 gn(~X)=n∑i=1^βiyiK(A∘Xi,A∘X) (7)

In binary classification problems, the decision rule is , where is the estimated function in the form of (7) with appropriate choices of kernel functions. Choice of kernel functions are discussed in Li and Maiti (2019) in detail. We consider using Gaussian RBF kernel in this work. We name the model (7) as Random Projection based Support Tensor Machine (RPSTM).

### 3.4 Tensor Ensemble Classifier (TEC)

While random projection provides extra efficiency by transforming tensor components into lower dimension, there is no guarantee that the projected data will preserve the same margin for every single random projection. Thus, the robustness of model estimation with a single random projection is difficult to maintain. The proposition in Section 5 shows that the classifier estimated with projected tensor only retains its discriminative power on average over multiple random projections. As a result, we consider aggregating several RPSTMs as an Tensor Ensemble Classifier (TEC) in order to provide a statistically consistent, robust, and computationally efficient classifier. Let

 τn,b(X)=1bb∑m=1Sign(gn,b(X)) (8)

is the number of different RPSTM classifiers. The ensemble classifier is then defined as

 en,b(X)={1ifτn,b(X)⩾γ−1Otherwise (9)

which is a voter classifier. is a thersholding parameter. In the next section, we present the algorithm of model estimation.

## 4 Model Estimation

Since each base classifier in the ensemble model, TEC is independently estimated, we only provide details for learning a single RPSTM. We estimate our model by adopting the optimization procedure from Chapelle (2007). This is a gradient descent optimization procedure solving the problem in primal. It solves the support vector machine type of problems efficiently without transforming the problem into its dual form.

Let be the kernel function for tensors. Let be the random projection matrices for each mode of tensors. Let be a diagonal matrix with th diagonal element is . Since ,

is an orthogonal matrix. i.e.

. The explicit form of the classification function can be derived as

 gn(A∘X)=n∑i=1βiyir∑k,l=1d∏j=1K(j)(A(j)kx(j)k,i,A(j)lx(j)l)=KTDyβ (10)

where and

 K(:,i)=(r∑k,l=1d∏j=1K(j)(A(j)kx(j)k,i,A(j)lx(j)l,1)..,r∑k,l=1d∏j=1K(j)(A(j)kx(j)k,i,A(j)lx(j)l,n))T

is the CP rank of tensors and are components of tensor CP decomposition of the corresponding training and testing tensors. In real data application, we may need to perform a CP decomposition first if the data does not come in CP form. Plugging the solution (10) into the objective function (2), we get

 minβλβTDyKDyβ+1nn∑i=1L(yi,KT(:,i)Dyβ)

where is the ith column of matrix . The derivative with respect to is

 2λDyKDyβ+1nn∑i=1K(:,i)Dy∂L∂β (11)

let equation (11) to be zero and we can solve with Newton method. In our application, we take squared hing loss which is . The explicit form of equation (11), which we denote as , is

 ▽=2DyKDy[λβ+Is(DyKβ−y)]

where is the vector of labels in the training data. , where , such that if . The Hessian, which is the derivative of , is

 H=2DyKDy(λI+IsDyy)

The Newton method says that we can update at each iteration. Thus, we obtain the update rule at each iteration as

 ^β=(λI+IsDyK)−1Isy (12)

The algorithm for training and prediction are described in algorithm 1 and algorithm 2 respectively:

Note that we use some R language style notations in the algorithms to specify rows, columns of matrices, and elements of vectors. For example, stands for the i-th column of matrix . denotes the j-th element in the vector .

In algorithm 1, the time complexity of the training process is , which is much smaller than the complexity of any vectorized method, . The complexity of random projection is . Thus, a single RPSTM with random projection has a computational complexity . It is still can be smaller than the pure SVM whose complexity is

Moreover, the choices of are free from the original tensor dimensions. Our theoretical analysis reveals that are related to sample size instead of dimension . Thus, applications with large size tensors can enjoy the benefit of random projection. Notice that, although the ensemble classifier has to repeat the estimation for many times, these repetitions are independent from each other. Thus, we can make these processes running in parallel without adding extra computation time.

### 4.1 Choice of Parameters

We end this section by providing some empirical suggestions about the tuning parameters used in the estimation procedure. The number of ensemble classifiers, , and the threshold parameter, , are chosen by cross-validation. We first let , which is the middle of two labels, -1 and 1. Then we search in a reasonable range, between 2 to 20. The optimal is the one that provides the best classification model. In the next step, we fix and search between 1 and -1 with step size to be 0.1, and find optimal value which has the best classification accuracy.

The choice of random projection matrices is more complicated. Although we can generate random matrices from standard Gaussian distribution, the dimension of matrices is remain unclear. Our guideline, JL-lemma, only provides a lower bound for dimension, and is only for vector situation. As a result, we can only choose the dimension based on our intuition and cross-validation results in practice. Empirically, we choose the projection dimension

for each mode.

## 5 Statistical Property of TEC Ensemble

In this section, we quantify the uncertainty of model prediction and provide some convergence results supporting the generalization ability of our model. Before the discussion, we need to introduce few more notations and highlight their differences.

Let be a loss function for classification problems. In general, the problem is searching for a function in such that it can minimize the risk of miss-classification.

 f∗=argminR(f)=argminE(X×Y)L(y,f(X))f∗∈{f:X→Y|f∈H}

is the collection of all measurable functions that map tensors into scalars. is the classification risk of a specific decision rule , which is defined as

 R(f)=E(X×Y)L(y,f(X))=∫L(y,f(X))dP

When the optimal decision rule is obtained, i.e. , becomes the minimum risk that a decision rule is able to reach over the data distribution . It is called Bayes risk in the literature, and will be denoted by in the following contents. We say a classifier is statistically consistent if .

With finite training data , however, we are not able to estimate such an optimal decision rule. Under most circumstances, we try to minimize the regularized empirical risk by selecting rules from a pre-defined collection of classification functions. Such a procedure is called Empirical Risk Minimization (EMR). The empirical risk of a decision rule estimated in is defined by

 RTn(f)=1nn∑i=1L(yi,f(Xi))

where we assume the probability mass is uniformly distributed on each data point in the training set. Under the random projection of training data

, the empirical risk in the projected training set is

 RTAn(g)=1nn∑i=1L(yi,g(A∘Xi))

where is our decision function. Let be the reproducing kernel Hilbert space (RKHS) reproduced by tensor kernel functions. By minimizing the regularized empirical risk, we shall obtain two functions which are

 fλn=argminf∈F{RTn(f)+λ||f||2K}gλn=argming∈F{RTAn(g)+λ||g||2K}

and are optimal decision rules which are conditioning on the training data and . Ideally, we assume the best-in-class decision rules and are independent of training data, i.e.

 fλ=argminf∈F{R(f)+λ||f||2K}gλ=argming∈F{R(g∘A)+λ||g||2K}

Lastly, suppressing notational dependence on , we denote our ensemble classifier as

 en,b(X)=I{τn,b(X)⩾γ}=I{1bb∑m=1Sign(gn,m(X))⩾γ}

where each is the optimal conditional on random projection matrices . is an indicator function.

Notice that both the RPSTM classifier and the TEC are randomized classifiers. Their performances as well as their prediction risks also depend on the random projection . As a result, we say these randomized classifiers are statistically consistent if and . denotes expectation over the distribution of random projection matrices.We establish these two results in this section.

### 5.1 Risk of Ensemble Classifier

We first boud the expected risk of our TEC classifier by using the result from Cannings and Samworth (2017), theorem 2.

###### Theorem 1.

For each ,

 EA[R(en,b)]−R∗⩽1min(γ,1−γ)[EA[R(gn)]−R∗]

This result says that the ensemble model TEC is statistically consistent as long as the base classifier, RPSTM, is consistent. Hence, we present an analysis in the following part to show the consistency of RPSTM and its convergence rates.

### 5.2 Excess Risk of Single Classifier

As a statistically consistent model, we expect the excess risk of RPSTM, , converges to zero. Let be a function of the tuning parameter which is described under assumptions, a few paragraphs below. The following proposition about the excess risk provides the direction of convergence for model risk.

###### Proposition 1.

The excess risk is bounded above:

 (13)

The bound is proved in appendix A.3. The notations , , and stand for function (7), (3), and (6) in section 3. All classifier super scripted by denotes norm restricted classifier. For a given random projection, represents norm restricted optimal svm classifier on projected training data. represents norm restricted optimal svm classifier on original training data.For a given random projection characterized by projection matrix , represents a norm restricted classifier on projected training data, constructed directly from using (6). denotes the norm restricted oracle optimal svm classifier.
As the left side of equation (13) is non-negative, we only have to show that the right side of equation (13) converges to zero. Bounded convergence theorem can be adopted then to provide . The convergence of excess risk and its rate demonstrate the generalization ability of the classifier.

In the following part, we assume all the conditions listed below hold:

1. [label=AS.0]

2. The loss function is local Lipschitz continuous in the sense that for and

 |L(y,f1)−L(y,f2)|⩽C(W)|f1−f2|

3. The kernel function

 K(X1,X2)=r∑k,l=1d∏j=1K(j)(x(j)1,k,x(j)2,l)

satisfies the universal approximating property (see, Li and Maiti (2019)).

4. Assume .

5. For each component in the kernel function or . All of them are -Lipschitz continuous

 |h(j)(t1)−h(j)(t2)|⩽L(j)K|t1−t2|

where are different. .

6. All the random projection matrices have their elements identically independently distributed as . The dimension of is . For a and , assume for each fixed , .

7. The hyper-parameter in the regularization term satisfies:

 λn→0asn→∞nλ2n→∞asn→∞
8. For all the tensor data , assume .

9. Bayes Risk remains unaltered for all randomly projected data. For all

 R∗A=R∗
10. Let be reproducible Kernel Hilbert space (RKHS). We assume that, there always exists a function in that minimizes risk as well have bounded kernel norm. Equivalently, it minimizes Lagrangian which is sum of classification error i.e. and characterized kernel norm penalty i.e . Thus we define the minimum achievable norm penalized classification risk as . This minimum error is attained by . We further assume that the relation between and is given by with . We refer to definition 5.14 and section 5.4 in Steinwart and Christmann (2008) for further reading.

11. The projection error ratio for each mode vanishes at rates depending on loss function.

 ϵnλqn→0

For hinge loss and square hinge loss .

The assumption 1, 3, 6, and 7

are commonly used in supervised learning problems with kernel tricks, (see, e.g.

Vapnik (2013), LaConte et al. (2005)). Assumption 2 is a form of universal kernel which makes it possible to approximate any continuous function defined on a compact set with a function in the form of (3). The universal approximating property for tensor kernels is first developed in Li and Maiti (2019). With universal tensor kernel function , any continuous function can be approximated by , where

 fn=∑βK(Xi,⋅)

over a metric compact set . We further assume that our tensor features takes values in only. The approximation is bounded by norm

 supX∈Bx|f(X)−fn(X)|≤ϵ2.

Assumption 8. This assumptions ensure that remains Bayes risk unchanged due to random projection.This assumption has also been made in Cannings and Samworth (2017). It assumes that there is a random projection such that

 P({x:η(x)≥12}Δ{x:ηA(Ax)≥12})=0

denotes the symmetric difference between two sets and . The notation in Cannings and Samworth (2017) represents Bayes classifier and is different form our notation used in 9. Their assumption states that the feature level set with the following property does not exist in probability. The property being, original Bayes decision half plane and Bayes decision half plane corresponding to their projection are different. Thus, the Bayes risk remains unchanged due to any random projections. 5 is a sufficient condition that ensures that preservation of regular (Euclidean) distances between two tensors and their corresponding projection. Such statement is known as concentration inequality. Same lemma for vector data is known as J-L lemma(Dasgupta and Gupta (2003)). It also highlights the way of selecting dimensions for random projections such that we can control projection error with certain threshold probability. Assumption 9 refers to the rate of convergence of minimum risk attained by a function with bounded kernel norm, to Bayes risk. The rate is expressed in terms kernel norm bound. It also implies universal consistency 12, Chen and Li (2014). This rate is generally satisfied for a broad class of data distribution, which assigns certain low density to data near Bayes risk decision boundary Steinwart and Christmann (2008), Theorem 8.18. Assumption 4 states that kernel inner product between two data points can be expressed as some of a Lipshtiz function of distance between them. This assumptions connects kernel norm and regular norm defined on tensors. Hence this assumption is critical for establishment of approximate kernel norm isometry of projected tensors i.e. with large probability, for mode tensors. Assumption 10 declares that the random projection vanishes with respect to kernel norm tuning parameter, as sample size increases. Its worth noting that the rate should differ according to loss function, Chen and Li (2014).

#### 5.2.1 Price for Random Projection

Applying random projection in the training procedure is indeed doing a trade-off between prediction accuracy and computational cost. Although the application of Johnson-Lindenstrauss lemma (e.g. see Johnson and Lindenstrauss (1984) and Dasgupta and Gupta (1999)) has indicated an approximate isometry for projected data in terms of kernel values, the decision rule may not be exactly same as the one estimated from the original data set. Thus, it is necessary to study the asymptotic difference of

 EAR(fA,n)−R(fn)

We first drop the expectation part, and present a result for . The convergence in expectation is established later.

###### Proposition 2.

Assume matrices are generated independently and identically from a Gaussian distribution. With the assumptions 1, 4, 5, 6, and 7, for the described in 5. With probability and for hinge loss, and for square hinge loss function respectively.

 |R(fA,n)+λ||fλA,n||2k−R(fn)−λ||fλn||2k|=O(ϵdλq)

where is the size of training set, is the number of modes of tensor.

The proof of this proposition is provided in the appendix A.4. The value of depends on loss function as well kernel and geometric configuration of data, which is discussed in the appendix. This proposition highlights the trade-off between dimension reduction and prediction risk. As the reduced dimension is related to negatively, small can make the term converges at a very slow rate.

#### 5.2.2 Convergence of Risk

Now we provide a result which establishes the convergence for the risk of RPSTM classifier and reveals the rate of convergence.

###### Theorem 2 (Convergence rate).

Given a training data . We can obtain a support tensor machine classifier by minimizing the regularized empirical risk (2). Let be the Bayes risk of the problem, with all the assumptions 1 - 9. For such that for each the projected dimension . The generalization error of RPSTM is bounded with probability at least , i.e.,

 R(gλn)−R∗⩽V(1)+V(2)+V(3)V(1)=12C(Kmax√L0λ)Kmax√L0√nλ+9~ζλ√log(2/δ2)2n+2ζλ√2log(2/δ2)nV(2)=D(λ)V(3)=CdΨ[C(Kmax√L0λ)+λΨ]ϵd (14)

where

The proof is provided in appendix A.7.The probability is contribution due to the random projection. We notice that in the aforementioned expression, risk difference upper bound depends on probability parameter and . The dependence on is expressed through . The projection error , random projection probability parameter and projection dimension are related by equation . Hence depends on for any fixed s corresponding to fixed sample size . This expression can provide an alternate interpretation if one wants to consider mode wise projected dimensions s as error parameters instead of mode wise projection error . In our case, error parameter is projection error,. It is worth noting that grows with sample size , due to the relation to facilitate the Assumption 10. Assuming that the projected dimensions to be uniform for all modes . Therefore, with probability at least , the above risk difference bounded by a function of .

For different loss function, the convergence rate is different. Since we propose using squared hinge loss for model estimation in section 4, the specific rate for squared hinge loss is presented here. However, we also provide the rate for ordinary hinge loss in the appendix A.9.

###### Proposition 3.

For square hinge loss, let for and for some , . For some and with probability

 R(gλ)−R∗⩽C√log(2δ2)(1n)μη2η+3 (15)

The rate of convergence is faster with increase in sample size, when high value of is chosen. For the risk difference rate becomes . The proof of this result is in appendix A.8.

###### Proposition 4.

For hinge loss,Let for and for some , , For some and with probability

 R(gλ)−R∗⩽C√log(2δ2)(1n)μη2η+2 (16)

The rate of convergence is faster with increase in sample size, when high value of is chosen. For the risk difference rate becomes . The proof of this result is in appendix A.9.

###### Theorem 3.

The excess risk goes to zero as sample size increases, the denote expectation with respect to tensor random projection

 EA[R(gλn)]−R∗→0

This is the expected risk convergence building on top of our previous results. The proof is provided in the appendix A.10. This theorem concludes that the expected risk of RPSTM converges to the optimal Bayes risk. As a result, the RPSTM, as well as our ensemble model TEC, are statistically consistent.

## 6 Simulation Study

In this section, we present an extensive simulation study with data generated by different models. Several popular tensor based and traditional vector based classification algorithms are included in the study for comparison. For the traditional techniques, we consider linear discriminant analysis (LDA), random forest (RF) and support vector machine (SVM) (e.g., see Friedman et al. (2001)). For the tensor-based classification, we select MPCA from Lu et al. (2008), DGTDA from Li and Schonfeld (2014), and the CATCH model from Pan et al. (2018). The classification error rate and computation time are presented for comparison.

Synthetic tensors are generated from different types of tensor models introduced in Kolda and Bader (2009). All the models are described below in details. For simplicity, we use and to denotes data from two different class.

1. F1 Model: Low dimensional rank 1 tensor factor model with each components confirming the same distribution. Shape of tensors is .

 X1=x(1)⊗x(2)⊗x(3)x(j)∼N(0,I30),j=1,2,3X2=x(1)⊗x(2)⊗x(3)x(j)∼N(0.5,I30),j=1,2,3

2. F2 Model: High dimensional rank 1 tensor with normal distribution in each component. Shape of tensors is .

 X=x(1)⊗x(2)⊗x(3)⊗x(4)x(j)∼N(0,Σ(j)),j=1,2,3,4X=x(1)⊗x(2)⊗x(3)⊗x(4)x(j)∼N(1,Σ(j)),j=1,2,3,4Σ(1)=I,Σ(2)=Σ(4)=AR(0.7),Σ(3)i,j=min(i,j)

3. F3 Model: High dimensional rank 3 tensor factor model. Components confirm different Gaussian distribution. Shape of tensors is .