1 Introduction
With the advancement of information and engineering technology, modernday data science problems often come with data in gigantic size and increased complexity. One popular technique of storing these data is multidimensional arrays, which preserves the data anatomy and multidimensional spatial and spatiotemporal correlations. Unfolding the data may lose structure along with other relevant information and face the curse of dimensionality. The structural complexity and high dimensionality pose many challenges to statisticians and computer scientists in data processing and analysis. The traditional machine learning and statistical models for high dimensional data are developed in vector spaces. Adopting these methods to the new types of structured data will require a vectorization, resulting in losing critical information such as multiway correlation and spatiotemporal information. Besides, the vectorized data’s huge dimension could be a nightmare for traditional high dimensional data models, especially for their computation complexity. To preserve the multiway structure of data and perform efficient computation, research has begun in building tensor data models and performing statistical analysis based on these models.
A tensor is a multidimensional or multiway array, a generalization of a matrix in a multidimensional situation. With different definitions, tensors can be represented in different forms, such as Candecomp / Parafac (CP), Tucker, and Tensor Chain. These tensor representations are widely applied in both supervised and unsupervised machine learning problems for multidimensional data. Kolda and Bader (2009) provides a comprehensive survey paper about CP and Tucker tensor decomposition and their psychometrics, neuroscience, latent factor detection, and image compression applications prediction, and classification. Among all these machine learning applications with tensors, we focus only on classification models in this work. Various classification models for tensor data have been studied in the last few years. Some of works such as Tan et al. (2012), Zhou et al. (2013), and Pan et al. (2018)
, mostly in statistics literature, develop logistic regression and probabilistic linear discriminant analysis models. These models rely on the same principle assumption that tensor data confirm a normal distribution, which is too restrictive in real applications. Nonparametric tensor classifiers turn out to be more flexible as they do not have any distribution assumption. Examples of nonparametric tensor classifiers include
Tao et al. (2005), Kotsia et al. (2012), Signoretto et al. (2014), He et al. (2014), Li and Schonfeld (2014), and He et al. (2017). These methods are more flexible and would have much better generalization ability in applications with tensor data that are sparse or with high autocorrelation modes.Motivated by the CP tensor and tensor product space introduced in Hackbusch (2012), we propose a version of Support Tensor Machine (STM) using tensor kernel functions based on tensor crossproduct norm in our previous work Li and Maiti (2019). According to the definition, CP tensors can be considered as elements and the associated crossproduct norm measures the magnitude of tensors in a tensor product space. Following this idea, we develop a tensor kernel function based on crossproduct norm to measure the similarity between tensors. Kernel functions also represent the resulting separating plane. We call this model CPSTM.
We consider simplifying the computation in CPSTM by training a STM with randomly projected tensors in this work. The application of random projection in highdimensional data analysis is motivated by the well celebrated Johnson Lindenstrauss Lemma (see e.g. Dasgupta and Gupta (2003)). The lemma says that for arbitrary ,
, there is a linear transformation
such that for any two vectors , :with large probability for all . The linear transformation preserves the pairwise Euclidean distances between these points. The random projection has been proven to be a decent dimension reduction technique in machine learning literature. Durrant and Kabán (2013) presents a VapnikChervonenkis type bounds on the generalization error of a linear classifier trained on a single random projection. Chen and Li (2014) provides a convergence rate for the classification error of the support vector machine trained with a single random projection. Cannings and Samworth (2017) proves that random projection ensemble classifier can reduce the generalization error further. Their results hold several types of basic classifiers such as nearest neighboring and linear/quadratic discriminant analysis. Inspired by these current work, we propose an ensemble of RPSTM, STM trained with randomly projected tensors, for largesized tensor classification. We call this method Tensor Classification Ensemble (TEC).
Our contribution: Our work alleviates the limitations of existing tensor approaches in handling big data classification problems. This paper introduces an innovative ensemble classification model coupled with random projection and support tensor machines to provide computationally fast and statistically consistent label predictions for ultra highdimensional tensor data. Specifically, the contribution of this paper is threefold.

We successfully adopt the well known randomprojection technique into high dimensional tensor classification applications and provide an ensemble classifier that can handle extremely bigsized tensor data. The adoption of random projection makes it feasible to directly classify big tensor data on regular machines and save computational costs. We further aggregate multiple RPSTM to form our TEC classifier, which can be statistically consistent while remaining computationally efficient. Since the aggregated base classifiers are independent of each other, the model learning procedure can be accelerated in a parallel computing platform.

Some theoretical results are established in order to validate the prediction consistency of our classification model. Unlike Chen and Li (2014) and Cannings and Samworth (2017), we adopt the JohnsonLindenstrauss lemma further for tensor data and show that the STM from Li and Maiti (2019) can be estimated with randomly projected tensors. The average classification risk of the estimated model converges to the optimal Bayes risk under some specific conditions. Thus, the ensemble of several RPSTMs can have robust parameter estimation and provide strongly consistent label predictions. The results also highlight the tradeoff between classification risk and dimension reduction created by random projections. As a result, one can take a balance between the computational cost and prediction accuracy in practice.

We provide an extensive numerical study with synthetic and real tensor data to reveal our ensemble classifier’s decent performance. It performs better than the traditional methods such as linear discriminant analysis and random forest, and other tensorbased methods in applications like brain MRI classification and traffic image recognition. It can also handle large tensors generated from tensor CANDECOMP/PARAFAC (CP) models, which are widely applied in spatialtemporal data analysis. Besides, the computation is much faster for the TEC comparing with the existing methods. All these results indicate a great potential for the proposed TEC in big data and multimodal data applications. The Python module of applying TEC is available at
https://github.com/PeterLiPeide/KSTM_Classification/tree/master.
Outline: The paper is organized in the following order. Section 2 provides the current literature on STM and contrast with our CPSTM. Section 3 introduces the classification methodology. Section 4 provides an algorithm for model estimation as well as parameter selection. Section 5 contains the main result of our TEC model’s prediction consistency. The simulation study and real data analysis are provided in Section 6 and 7. We analyze traffic imaging data in the real data application. Section 8 concludes the article with a brief discussion. The proof of our theoretical results are provided in the appendix.
2 Related Work
Support Tensor Machine (STM) generalizes the idea of support vector machine model from vectors and provides a framework for learning a separating hyperplane for tensor data. The optimal separating plane would be represented by support tensors analogous to the support vector machine’s support vectors.
Previous work about STM includes Tao et al. (2005), Kotsia et al. (2012), He et al. (2014), and He et al. (2017). Tao et al. (2005) models the separating plane as the inner product between tensor predictors and a tensor coefficient assumed to be a rankone tensor. The inner product is defined as a modewise product, which is indeed is a multilinear transformation. Kotsia et al. (2012) defines the margin as the traces of unfolded tensor coefficients and constructs the separating plane using multilinear transformation. Both these works are established with the tensor Tucker model. He et al. (2014) developed their STM under the idea of multiway feature maps. The multiway feature maps result in a kernel function measuring the similarity between CP tensors. The separating plane is also constructed with this kernel function. He et al. (2017) combines the tensor CP decomposition and CP tensor STM as a multitask learning problem, making it slightly different from other STM methods.
Among all these existing STM models, only the model from He et al. (2014) is similar to our proposed RPSTM. Except for random projection, both methods use kernel functions to measure the similarity between tensors. However, the kernel function we adopted would be more general, since it is originated from the tensor crossproduct norm. We will highlight this difference in more detail in the latter part of this paper. Moreover, we have provided the universal approximating property for such tensor kernel functions in our recent work Li and Maiti (2019). This gives a theoretical guarantee to the generalization ability for our model.
3 Methodology
3.1 Notations and Tensor Algebra
We refer Kolda and Bader (2009) for some standard tensor notations and operations used in this paper. Numbers and scalars are denoted by lowercase letters such as . Vectors are denoted by boldface lowercase letters, e.g. . Matrices are denoted by boldface capital letters, e.g. . Multidimensional tensors are the generalization of vector and matrix representations for higherorder data, which are denoted by boldface Euler script letters such as .
The order of a tensor is the number of different dimensions, also known as ways or modes. For example, a scalar can be regarded as an order zero tensor, a vector is an order one tensor, and a matrix is an order two tensor. In general, a tensor can have modes as long as is an integer.
In addition to the basic concepts, we need some operations for vectors and matrices to present our work, which is also referred from Kolda and Bader (2009). The first one is the outer product of vectors. This is the simplest and the motivating example for tensor product. Let and be two row vectors, the outer product of them is defined by
which is a matrix. This matrix is identical to . We use to denote both outer product for vectors and CP tensors in our work.
The last part we want to mention is the rank of a tensor. There are two different definitions for tensor rank that are popular in literature. One is the CP rank and the other is the Tucker rank. Kolda and Bader (2009) provides details for both ranks. We only consider the CP rank in this paper. The definition of CP rank for a dmode tensor is the minimum integer such that the tensor can be written as
(1) 
Equitation (1) is also called tensor rankr / CP representation (see Hackbusch (2012)). We can always have a CP representation for tensors with rank . For any given tensor, CP decomposition can be estimated with Alternative Least Square (ALS) algorithm (see Kolda and Bader (2009)). The support tensor machine in this paper is constructed with rankr tensors in their CP representations. As a result, all the tensors mentioned in the rest of the article are assumed to have decomposition (1).
3.2 CP Support Tensor Machine
Let be the training set, where are dmode tensors, are labels. If we assume the training risk of a classifier is , where
is a loss function for classification, the problem will be looking for a
such thatThe is a functional space for the collection of functions mapping tensors into scalars. The support tensor machine in Li and Maiti (2019) solves the classification problem above by optimizing the objective function
(2) 
where is the L2 norm of the function . Note that takes raw output of instead of predicted labels. Examples of such loss functions in classification include Hinge loss and its variants. The optimal solution is in the form of
(3) 
The kernel function is defined as
(4) 
where and are dmode tensors. The label is predicted with . are vector based kernels as and are vector components of the tensors.
This function (4) is a general form of kernel function, as we do not specify the explicit form of . It is developed with the idea of cross norm defined in the tensor product space, see Hackbusch (2012) and Ryan (2013). It is very natural to derive a tensor kernel function in the form of equation (4) when the CP factors of the tensor is available. Thus, He et al. (2014) and He et al. (2017) also used the similar formulation. However, different choice of kernels might have impacts on the convergence of classification error. This has not been discussed in their works. We show that by choosing a universal kernel functions for tensors (see, Li and Maiti (2019)), the classifier (3) is consistent.
A direct application of the above procedure can be computationally difficult and expensive since and may belong to highdimensional spaces. Also, as the dimension in each mode increases, it will be more difficult for the kernel function to scale and provide accurate similarity measurement between tensor components. Thus, we apply a random projection for tensors before calculating the kernels.
3.3 Support Tensor Machine with Random Projection
For a single tensor with CP representation, we define the random projection of tensor as:
(5) 
where are matrices whose elements are independently identically distributed normal variables. Each component of the tensor CP expansion is projected to a lower dimensional space via a linear projection. We use to denote the training set with tensors randomly projected to a lower dimensional space. The support tensor machine model will then be trained on the projected data instead of the original data. In general, the estimation procedure is expected to be much more efficient and faster as we assume s are much smaller than s.
With projected tensor data and projection matrices, the original STM model (3) turns to be
(6) 
However, it is not an optimal of STM in the projected training data. The optimal solution in projected data space now becomes
(7) 
In binary classification problems, the decision rule is , where is the estimated function in the form of (7) with appropriate choices of kernel functions. Choice of kernel functions are discussed in Li and Maiti (2019) in detail. We consider using Gaussian RBF kernel in this work. We name the model (7) as Random Projection based Support Tensor Machine (RPSTM).
3.4 Tensor Ensemble Classifier (TEC)
While random projection provides extra efficiency by transforming tensor components into lower dimension, there is no guarantee that the projected data will preserve the same margin for every single random projection. Thus, the robustness of model estimation with a single random projection is difficult to maintain. The proposition in Section 5 shows that the classifier estimated with projected tensor only retains its discriminative power on average over multiple random projections. As a result, we consider aggregating several RPSTMs as an Tensor Ensemble Classifier (TEC) in order to provide a statistically consistent, robust, and computationally efficient classifier. Let
(8) 
is the number of different RPSTM classifiers. The ensemble classifier is then defined as
(9) 
which is a voter classifier. is a thersholding parameter. In the next section, we present the algorithm of model estimation.
4 Model Estimation
Since each base classifier in the ensemble model, TEC is independently estimated, we only provide details for learning a single RPSTM. We estimate our model by adopting the optimization procedure from Chapelle (2007). This is a gradient descent optimization procedure solving the problem in primal. It solves the support vector machine type of problems efficiently without transforming the problem into its dual form.
Let be the kernel function for tensors. Let be the random projection matrices for each mode of tensors. Let be a diagonal matrix with th diagonal element is . Since ,
is an orthogonal matrix. i.e.
. The explicit form of the classification function can be derived as(10) 
where and
is the CP rank of tensors and are components of tensor CP decomposition of the corresponding training and testing tensors. In real data application, we may need to perform a CP decomposition first if the data does not come in CP form. Plugging the solution (10) into the objective function (2), we get
where is the ith column of matrix . The derivative with respect to is
(11) 
let equation (11) to be zero and we can solve with Newton method. In our application, we take squared hing loss which is . The explicit form of equation (11), which we denote as , is
where is the vector of labels in the training data. , where , such that if . The Hessian, which is the derivative of , is
The Newton method says that we can update at each iteration. Thus, we obtain the update rule at each iteration as
(12) 
The algorithm for training and prediction are described in algorithm 1 and algorithm 2 respectively:
Note that we use some R language style notations in the algorithms to specify rows, columns of matrices, and elements of vectors. For example, stands for the ith column of matrix . denotes the jth element in the vector .
In algorithm 1, the time complexity of the training process is , which is much smaller than the complexity of any vectorized method, . The complexity of random projection is . Thus, a single RPSTM with random projection has a computational complexity . It is still can be smaller than the pure SVM whose complexity is
Moreover, the choices of are free from the original tensor dimensions. Our theoretical analysis reveals that are related to sample size instead of dimension . Thus, applications with large size tensors can enjoy the benefit of random projection. Notice that, although the ensemble classifier has to repeat the estimation for many times, these repetitions are independent from each other. Thus, we can make these processes running in parallel without adding extra computation time.
4.1 Choice of Parameters
We end this section by providing some empirical suggestions about the tuning parameters used in the estimation procedure. The number of ensemble classifiers, , and the threshold parameter, , are chosen by crossvalidation. We first let , which is the middle of two labels, 1 and 1. Then we search in a reasonable range, between 2 to 20. The optimal is the one that provides the best classification model. In the next step, we fix and search between 1 and 1 with step size to be 0.1, and find optimal value which has the best classification accuracy.
The choice of random projection matrices is more complicated. Although we can generate random matrices from standard Gaussian distribution, the dimension of matrices is remain unclear. Our guideline, JLlemma, only provides a lower bound for dimension, and is only for vector situation. As a result, we can only choose the dimension based on our intuition and crossvalidation results in practice. Empirically, we choose the projection dimension
for each mode.5 Statistical Property of TEC Ensemble
In this section, we quantify the uncertainty of model prediction and provide some convergence results supporting the generalization ability of our model. Before the discussion, we need to introduce few more notations and highlight their differences.
Let be a loss function for classification problems. In general, the problem is searching for a function in such that it can minimize the risk of missclassification.
is the collection of all measurable functions that map tensors into scalars. is the classification risk of a specific decision rule , which is defined as
When the optimal decision rule is obtained, i.e. , becomes the minimum risk that a decision rule is able to reach over the data distribution . It is called Bayes risk in the literature, and will be denoted by in the following contents. We say a classifier is statistically consistent if .
With finite training data , however, we are not able to estimate such an optimal decision rule. Under most circumstances, we try to minimize the regularized empirical risk by selecting rules from a predefined collection of classification functions. Such a procedure is called Empirical Risk Minimization (EMR). The empirical risk of a decision rule estimated in is defined by
where we assume the probability mass is uniformly distributed on each data point in the training set. Under the random projection of training data
, the empirical risk in the projected training set iswhere is our decision function. Let be the reproducing kernel Hilbert space (RKHS) reproduced by tensor kernel functions. By minimizing the regularized empirical risk, we shall obtain two functions which are
and are optimal decision rules which are conditioning on the training data and . Ideally, we assume the bestinclass decision rules and are independent of training data, i.e.
Lastly, suppressing notational dependence on , we denote our ensemble classifier as
where each is the optimal conditional on random projection matrices . is an indicator function.
Notice that both the RPSTM classifier and the TEC are randomized classifiers. Their performances as well as their prediction risks also depend on the random projection . As a result, we say these randomized classifiers are statistically consistent if and . denotes expectation over the distribution of random projection matrices.We establish these two results in this section.
5.1 Risk of Ensemble Classifier
We first boud the expected risk of our TEC classifier by using the result from Cannings and Samworth (2017), theorem 2.
Theorem 1.
For each ,
This result says that the ensemble model TEC is statistically consistent as long as the base classifier, RPSTM, is consistent. Hence, we present an analysis in the following part to show the consistency of RPSTM and its convergence rates.
5.2 Excess Risk of Single Classifier
As a statistically consistent model, we expect the excess risk of RPSTM, , converges to zero. Let be a function of the tuning parameter which is described under assumptions, a few paragraphs below. The following proposition about the excess risk provides the direction of convergence for model risk.
Proposition 1.
The excess risk is bounded above:
(13) 
The bound is proved in appendix A.3. The notations , , and stand for function (7), (3), and (6) in section 3. All classifier super scripted by denotes norm restricted classifier. For a given random projection, represents norm restricted optimal svm classifier on projected training data. represents norm restricted optimal svm classifier on original training data.For a given random projection characterized by projection matrix , represents a norm restricted classifier on projected training data, constructed directly from using (6). denotes the norm restricted oracle optimal svm classifier.
As the left side of equation (13) is nonnegative, we only have to show that the right side of equation (13) converges to zero. Bounded convergence theorem can be adopted then to provide . The convergence of excess risk and its rate demonstrate the generalization ability of the classifier.
In the following part, we assume all the conditions listed below hold:

[label=AS.0]

The loss function is local Lipschitz continuous in the sense that for and
In addition, we need .

Assume .

For each component in the kernel function or . All of them are Lipschitz continuous
where are different. .

All the random projection matrices have their elements identically independently distributed as . The dimension of is . For a and , assume for each fixed , .

The hyperparameter in the regularization term satisfies:

For all the tensor data , assume .

Bayes Risk remains unaltered for all randomly projected data. For all

Let be reproducible Kernel Hilbert space (RKHS). We assume that, there always exists a function in that minimizes risk as well have bounded kernel norm. Equivalently, it minimizes Lagrangian which is sum of classification error i.e. and characterized kernel norm penalty i.e . Thus we define the minimum achievable norm penalized classification risk as . This minimum error is attained by . We further assume that the relation between and is given by with . We refer to definition 5.14 and section 5.4 in Steinwart and Christmann (2008) for further reading.

The projection error ratio for each mode vanishes at rates depending on loss function.
For hinge loss and square hinge loss .
are commonly used in supervised learning problems with kernel tricks, (see, e.g.
Vapnik (2013), LaConte et al. (2005)). Assumption 2 is a form of universal kernel which makes it possible to approximate any continuous function defined on a compact set with a function in the form of (3). The universal approximating property for tensor kernels is first developed in Li and Maiti (2019). With universal tensor kernel function , any continuous function can be approximated by , whereover a metric compact set . We further assume that our tensor features takes values in only. The approximation is bounded by norm
Assumption 8. This assumptions ensure that remains Bayes risk unchanged due to random projection.This assumption has also been made in Cannings and Samworth (2017). It assumes that there is a random projection such that
denotes the symmetric difference between two sets and . The notation in Cannings and Samworth (2017) represents Bayes classifier and is different form our notation used in 9. Their assumption states that the feature level set with the following property does not exist in probability. The property being, original Bayes decision half plane and Bayes decision half plane corresponding to their projection are different. Thus, the Bayes risk remains unchanged due to any random projections. 5 is a sufficient condition that ensures that preservation of regular (Euclidean) distances between two tensors and their corresponding projection. Such statement is known as concentration inequality. Same lemma for vector data is known as JL lemma(Dasgupta and Gupta (2003)). It also highlights the way of selecting dimensions for random projections such that we can control projection error with certain threshold probability. Assumption 9 refers to the rate of convergence of minimum risk attained by a function with bounded kernel norm, to Bayes risk. The rate is expressed in terms kernel norm bound. It also implies universal consistency 12, Chen and Li (2014). This rate is generally satisfied for a broad class of data distribution, which assigns certain low density to data near Bayes risk decision boundary Steinwart and Christmann (2008), Theorem 8.18. Assumption 4 states that kernel inner product between two data points can be expressed as some of a Lipshtiz function of distance between them. This assumptions connects kernel norm and regular norm defined on tensors. Hence this assumption is critical for establishment of approximate kernel norm isometry of projected tensors i.e. with large probability, for mode tensors. Assumption 10 declares that the random projection vanishes with respect to kernel norm tuning parameter, as sample size increases. Its worth noting that the rate should differ according to loss function, Chen and Li (2014).
5.2.1 Price for Random Projection
Applying random projection in the training procedure is indeed doing a tradeoff between prediction accuracy and computational cost. Although the application of JohnsonLindenstrauss lemma (e.g. see Johnson and Lindenstrauss (1984) and Dasgupta and Gupta (1999)) has indicated an approximate isometry for projected data in terms of kernel values, the decision rule may not be exactly same as the one estimated from the original data set. Thus, it is necessary to study the asymptotic difference of
We first drop the expectation part, and present a result for . The convergence in expectation is established later.
Proposition 2.
Assume matrices are generated independently and identically from a Gaussian distribution. With the assumptions 1, 4, 5, 6, and 7, for the described in 5. With probability and for hinge loss, and for square hinge loss function respectively.
where is the size of training set, is the number of modes of tensor.
The proof of this proposition is provided in the appendix A.4. The value of depends on loss function as well kernel and geometric configuration of data, which is discussed in the appendix. This proposition highlights the tradeoff between dimension reduction and prediction risk. As the reduced dimension is related to negatively, small can make the term converges at a very slow rate.
5.2.2 Convergence of Risk
Now we provide a result which establishes the convergence for the risk of RPSTM classifier and reveals the rate of convergence.
Theorem 2 (Convergence rate).
Given a training data . We can obtain a support tensor machine classifier by minimizing the regularized empirical risk (2). Let be the Bayes risk of the problem, with all the assumptions 1  9. For such that for each the projected dimension . The generalization error of RPSTM is bounded with probability at least , i.e.,
(14) 
where
The proof is provided in appendix A.7.The probability is contribution due to the random projection. We notice that in the aforementioned expression, risk difference upper bound depends on probability parameter and . The dependence on is expressed through . The projection error , random projection probability parameter and projection dimension are related by equation . Hence depends on for any fixed s corresponding to fixed sample size . This expression can provide an alternate interpretation if one wants to consider mode wise projected dimensions s as error parameters instead of mode wise projection error . In our case, error parameter is projection error,. It is worth noting that grows with sample size , due to the relation to facilitate the Assumption 10. Assuming that the projected dimensions to be uniform for all modes . Therefore, with probability at least , the above risk difference bounded by a function of .
For different loss function, the convergence rate is different. Since we propose using squared hinge loss for model estimation in section 4, the specific rate for squared hinge loss is presented here. However, we also provide the rate for ordinary hinge loss in the appendix A.9.
Proposition 3.
For square hinge loss, let for and for some , . For some and with probability
(15) 
The rate of convergence is faster with increase in sample size, when high value of is chosen. For the risk difference rate becomes . The proof of this result is in appendix A.8.
Proposition 4.
For hinge loss,Let for and for some , , For some and with probability
(16) 
The rate of convergence is faster with increase in sample size, when high value of is chosen. For the risk difference rate becomes . The proof of this result is in appendix A.9.
Theorem 3.
The excess risk goes to zero as sample size increases, the denote expectation with respect to tensor random projection
This is the expected risk convergence building on top of our previous results. The proof is provided in the appendix A.10. This theorem concludes that the expected risk of RPSTM converges to the optimal Bayes risk. As a result, the RPSTM, as well as our ensemble model TEC, are statistically consistent.
6 Simulation Study
In this section, we present an extensive simulation study with data generated by different models. Several popular tensor based and traditional vector based classification algorithms are included in the study for comparison. For the traditional techniques, we consider linear discriminant analysis (LDA), random forest (RF) and support vector machine (SVM) (e.g., see Friedman et al. (2001)). For the tensorbased classification, we select MPCA from Lu et al. (2008), DGTDA from Li and Schonfeld (2014), and the CATCH model from Pan et al. (2018). The classification error rate and computation time are presented for comparison.
Synthetic tensors are generated from different types of tensor models introduced in Kolda and Bader (2009). All the models are described below in details. For simplicity, we use and to denotes data from two different class.

F1 Model: Low dimensional rank 1 tensor factor model with each components confirming the same distribution. Shape of tensors is .

F2 Model: High dimensional rank 1 tensor with normal distribution in each component. Shape of tensors is .

F3 Model: High dimensional rank 3 tensor factor model. Components confirm different Gaussian distribution. Shape of tensors is .