## 1 Introduction

The supervised learning tasks are often encountered in many fields including pattern recognition, image processing and data mining. Data are represented as feature vectors to handle such tasks. Among all the algorithms based on the vector framework, Support Vector Machine (SVM)

(Vapnik, 1995)is the most representative one due to numerous theoretical and computational developments. Later, the support vector method was extended to improve its performance in many applications. Radial basis function classifiers were introduced in SVM to solve nonlinear separable problems

(Scholkopf et al, 1997). The use of SVM for density estimation

(Weston et al, 1997) and ANOVA decomposition (Stitson et al, 1997) has also been studied. Least squares SVM (LS-SVM) (Suykens and Vandewalle, 1999) modifies the equality constraints in the optimization problem to solve a set of linear equations instead of quadratic ones. Transductive SVM (TSVM) (Joachims, 1999) tries to minimize misclassification error of a particular test set. -SVM (Schölkopf et al, 2000) includes a new parameter to effectively control the number of support vectors for both regression and classification. One-Class SVM (OCSVM) (Schölkopf et al, 2001) aims to identify one available class, while characterizing other classes is either expensive or difficult. Twin SVM (TWSVM) (Khemchandani et al, 2007) is a fast algorithm solving two quadratic programming problems of smaller sizes instead of a large one in classical SVM.However it is more natural to represent real-world data as matrices or higher-order tensors. Within the last decade, advanced researches have been exploited on retaining the structure of tensor data and extending SVM to tensor patterns. Tao et al. proposed a Supervised Tensor Learning (STL) framework to address the tensor problems (Tao et al, 2005). Under this framework, Support Tensor Machine (STM) was studied to separate multilinear hyperplanes by applying alternating projection methods (Cai et al, 2006). Tao et al. (Tao et al, 2007) extended the classical linear C-SVM (Cortes and Vapnik, 1995), -SVM and least squares SVM (Suykens and Vandewalle, 1999) to general tensor patterns. One-Class STM (OCSTM) was generalized to obtain most interesting tensor class with maximal margin (Chen et al, 2016; Erfani et al, 2016)

. Joint tensor feature analysis (JTFA) was proposed for tensor feature extraction and recognition by

Wong et al (2015). Support Higher-order Tensor Machine (SHTM) (Hao et al, 2013) integrates the merits of linear C-SVM and rank-one decomposition. Kernel methods for tensors were also introduced in nonlinear cases. Factor kernel (Signoretto et al, 2011)calculates the similarity between tensors using techniques of tensor unfolding and singular value decomposition (SVD). Dual structure-preserving kernels (DuSK)

(He et al, 2014) is a generalization of SHTM with dual-tensorial mapping functions to detect dependencies of nonlinear tensors. Support matrix machines (SMM) (Luo et al, 2015) aims to solve a convex optimization problem considering a hinge loss plus a so-called spectral elastic net penalty. These methods essentially take advantage of the low-rank assumption, which can be used for describing the correlation within a tensor.In this paper we are concerned with classification problems on a set of matrix data. We present a kernel support matrix machine (KSMM) and it is motivated by the use of matrix Hilbert space (Ye, 2017). Its cornerstones is the introduction of a matrix as the inner product to compile the complicated relationship among samples. KSMM is a general framework for constructing a matrix-based hyperplane through calculating the weighted average distance between data and multiple hyperplanes. We analyze a unifying optimization problem for which we propose an asymptotically convergent algorithm built on the Sequential Minimal Optimization (SMO) (Platt, 1999) algorithm. Generalization bounds of SVM were discussed based on Rademacher complexity with respect to a probability distribution (Shalev-Shwartz and Ben-David, 2014); here we extend their definitions to a more general and flexible framework. The contribution of this paper is listed as follows. One main contribution is to develop a new classifier for matrix learning where the optimization problem is solved directly without adopting the technique of alternating projection method in STL. Important special cases of the framework include classifiers of SVM. Another contribution lies within a matrix-based hyperplane that we propose in the algorithm to separate objects instead of determining multiple hyperplanes as in STM.

The rest of this paper is organized as follows. In Sect. 2.1, we discuss the framework of kernel support matrix machine in linear case. We show its dual problem and present a template algorithm to solve this problem. In Sect. 2.2 we extend to the nonlinear task by adopting the methodology of reproducing kernels. Sect. 2.3 deals with the generalization bounds based on Rademacher complexity with respect to a probability distribution. Differences among several classifiers are discussed in Sect. 2.4. In Sect. 3 we study our model’s performance in both simulation study and benchmark datasets. Finally, concluding remarks are drawn in Sect. 4.

## 2 Kernel Support Matrix Machine

In this section, we put forward the Kernel Support Matrix Machine (KSMM) which makes a closed connection between matrix Hilbert Space (Ye, 2017) and the supervised tensor learning (STL). We construct a hyperplane in the matrix Hilbert space to separate two communities of examples. Then, the SMO algorithm is introduced to handle with the new optimization problem. Next, we derive the generalization bounds for KSMM based on Rademacher complexity with respect to a probability distribution. Finally, we analyze and compare the differences of KSMM with other state-of-the-art methodologies.

### 2.1 Kernel Support Matrix Machine in linear case

We first introduce some basic notations and definitions. In this study, scales are denoted by lowercase letters, e.g., s, vectors by boldface lowercase letters, e.g., v, matrices by boldface capital letters, e.g., M and general sets or spaces by gothic letters, e.g., .

The Frobenius norm of a matrix is defined by

which is a generalization of the normal norm for vectors.

The inner product of two same-sized matrices is defined as the sum of products of their entries, i.e.,

Inspired by the previous work, we introduce the matrix inner product to the framework of STM in matrix space. The matrix inner product is defined as follows:

###### Definition 1 (Matrix Inner Product)

Let be a real linear space, the matrix inner product is a mapping satisfying the following properties, for all

(1)

(2)

(3) if and only if X is a null matrix

(4) is positive semidefinite.

This thus motivates us to reformulate the optimization problem in STM. Considering a set of samples for binary classification problem, where are input matrix data and are corresponding class labels. We assume that and are in the matrix Hilbert space , is a symmetric matrix satisfying:

(1) |

for all . In particular, the problem of KSMM can be described in the following way:

(2) |

where the inner product is specified as for and the norm is defined as . is the vector of all slack variables of training examples and is the trade-off between the classification margin and misclassification error.

###### Remark 1

The proposed problem (2) degenerates into the classical SVM when .

###### Remark 2

Two classes of labels are separated by a hyperplane . The expression can be decomposed into two parts: one is controlled by normal matrix W while the other is constrained by weight matrix V. Each entry of the matrix inner product measures a relative “distance” from X to a certain hyperplane. To make explicit those values underlying their own behavior, we introduce a weight matrix V which determines the relative importance of each hyperplane on the average.

Once the model has been solved, the class label of a testing example X can be predicted as follow:

(3) |

The Lagrangian function of the optimization problem (2) is

(4) |

Let the partial derivatives of with respect to W, b, and V be zeros respectively, we have

(5) |

where is a positive real number. Substituting (5) into (4) yields the dual of the optimization problem (2) as follows:

(6) |

where are the Lagrange multipliers.

Notice that for all , we have

(7) |

which indicates that the matrix V we derives from Lagrange multiplier method satisfies condition (1).

Furthermore, the Karush-Kuhn-Tucker (KKT) conditions are fulfilled when the optimization problem (4) is solved, that is for all :

(8) |

where . Next, we summarize and improve the SMO algorithm to solve the optimization problem (6). At each step, SMO chooses two Lagrange multipliers to jointly optimize the objective function while other multipliers are fixed, which can be computed as follow:

(9) |

For convenience, all quantities that refer to the first multiplier will have a subscript 1, while all quantities that refer to the second multiplier will have a subscript 2. Without lose of generality, the algorithm computes the second Lagrange multiplier and then updates the first Lagrange multiplier at each step. Notice that , we can rewritten (9) in terms of as:

where and .

Compute the partial derivative and second partial derivative of the object function, we can obtain that

(10) |

We can easily derive that

The second inequality holds according to the Cauchy-Schwarz inequality. The second partial derivative of the objective function is no more than zero. Therefore, the location of the constrained maximum of the objective function is either at the bounds or at the extreme point.

On the other hand, let be zero we obtain a function of the sixth degree which does not have a closed-form. Therefore, the Newton’s method is applied to iteratively find the optimal value of . At each step, we update the as:

(11) |

until it converges to .

Remember that the two Lagrange multipliers must fulfill all of the constraints of problem (4) that the lower bound and the upper bound of can be concluded as for labels :

(12) |

If labels , then the following bounds apply to :

(13) |

Next, the constrained maximum is found by clipping the unconstrained maximum to the bounds of the domain:

(14) |

Then the value of is calculated from the new, clipped :

(15) |

This process is repeated iteratively until the maximum number of outer loops M is reached or all of the Lagrange multipliers hold the KTT conditions. Typically, we terminate the inner loop of Newton’s method if , where is a threshold parameter.

Then we present the strategy on the choices of two Lagrange multipliers. When iterates over the entire training set, the first one which violates the KTT condition (8) is determined as the first multiplier. Once a violated example is found, the second multiplier is chosen randomly unlike that of the classical SMO for the closed-form of the extreme point can not be derived directly.

Algorithm: Kernel Support Matrix Machine |

Input: The set of training data , test data set , cost C, maximum number of |

outer loops M and threshold parameter |

Output: The estimated label |

Initialization. Take , |

while Stopping criterion is not satisfied do |

Get which validates condition (8) |

Randomly pick up |

while Stopping criterion is not satisfied do |

end while |

using (14) |

end while |

Calculate in problem (2) |

Like the SMO algorithm, we update the parameter using following strategy: the parameter updates when the new is not at the bounds which forces the output to be .

The parameter updates when the new is not at the bounds which forces the output to be .

When both and are updated, they are equal. When both Lagrange multipliers are at the bounds, any number in the interval between and is consistent with the KKT conditions. We choose the threshold to be the average of and . The Pseudo-code of the overall algorithm is listed above.

The objective function increases at every step and the algorithm will converge asymptotically. Even though the extra Newton’s method is applied in each iteration, the overall algorithm does work efficiently.

### 2.2 Kernel Support Matrix Machine in nonlinear case

Kernel methods, which refer to as “kernel trick” were brought to the field of machine learning in the 20th century to overcome the difficulty in detecting certain dependencies of nonlinear problems. A Reproducing Kernel Matrix Hilbert Space (RKMHS)

(Ye, 2017) was introduced to develop kernel theories in the matrix Hilbert space. In this section, we define a nonlinear mapping and apply these algorithms in our KSMM.We start by defining the following mapping on a matrix .

(16) |

where is in a matrix Hilbert space . Naturally, the kernel function is defined as inner products of elements in the feature space:

(17) |

Further details of the structure of a RKMHS can be found in Ye (2017).

The revised SMO algorithm is still applied under such circumstance. We emphasize that and . The following abbreviations are derived to compute the partial derivative and second partial derivative of the object function in (10).

(19) |

Some possible choices of include

where . is the -th column of X and is the Hadamard product (Horn, 1990).

Additionally, if is an identical mapping with , the optimization problem will degenerate into a linear one.

### 2.3 Generalization Bounds for KSMM

In this section, we use Rademacher complexity to obtain generalization bounds for soft-SVM and STM with Frobenius norm constraint. We will show how this leads to generalization bounds for KSMM.

To simplify the notation, we denote

where is a domain, is a hypothesis class and

is a loss function. Given

, we definewhere is the distribution of elements in , is the training set and is the number of examples in .

We repeat the symbols and assumptions in Sect 2.1 for further study. A STM problem in matrix space can be reformulated as:

(20) |

where and is the Frobenius norm.

Consider the vector as a specialization of matrix that the number of its row or column is equal to one, we rewrite the theorem from Shalev-Shwartz and Ben-David (2014) in the following way. It bounds the generalization error for SVM and STM(for matrix data) of all predictors in using their empirical error.

###### Theorem 2.1

Suppose that is a distribution over such that with probability 1 we have that . Let and let be a loss function of the form

such that for all , is a -Lipschitz function and . Then, for any , with probability of at least over the choice of an i.i.d. sample of size N,

(21) |

###### Remark 3

When or , the matrix transforms into a vector and its Frobenius norm is consistent with the corresponding Euclidean norm. The optimization problems in both classifiers are identical.

In the case of KSMM, we have the following result where we denote by .

###### Theorem 2.2

Suppose that is a distribution over where is a matrix Hilbert space such that with probability 1 we have that . Let and let be a loss function of the form

such that for all , is a -Lipschitz function and . Then, for any , with probability of at least over the choice of an i.i.d. sample of size N,

(22) |

###### Proof

See Appendix A.

∎

The following theorem compare the generalization bounds with the same hinge-loss function .

###### Theorem 2.3

In the same domain of and , we have , and .

###### Proof

For any ,

(23) |

Thus, and so does . With , we have and .

∎

Theorem 2.3 suggests that under the same probability, the difference between the true error and empirical error of KSMM is smaller than that of STM. In other words, if we pick up a moderate kernel with better performance on training set within our method, it is more likely to predict a better result on the test step.

On the other hand, normally we do not obtain prior knowledge of the space , especially for the choice of matrix V. We consider the following general 1-norm and max norm constraint formulation for matrices where and for . The following theorem bounds the generalization error of all predictors in using their empirical error.

###### Theorem 2.4

Suppose that is a distribution over where is a matrix Hilbert space such that with probability 1 we have that . Let and let be a loss function of the form

such that for all , is a -Lipschitz function and . Then, for any , with probability of at least over the choice of an i.i.d. sample of size N,

(24) |

###### Proof

See Appendix B.

∎

Therefore, we have two bounds given in Theorem 2.2 and Theorem 2.4 of KSMM. Apart from the extra factor, they look in a similar way. These two theorems are constrained to different prior knowledge, one captures low -norm assumption while the latter is limited to low max norm on W and low 1-norm on X. Note that there is no limitation on the dimension of W to derive the bounds in which kernel methods can be naturally applied.

### 2.4 Analysis of KSMM versus other methods

We discuss the differences of MRMLKSVM (Gao et al, 2015), SVM, STM, SHTM (Hao et al, 2013), DuSK (He et al, 2014) and our new method as follows:

DuSK, which uses CP decomposition and a dual-tensorial mapping to derive a tensor kernel, is a generalization of SHTM. KSMM constructs a matrix-based hyperplane with Newton’s method applied in the process of seeking appropriate parameters. All the optimization problems mentioned above only need to be solved once. Based on the alternating projection method, STM, MRMLKSVM need to be solved iteratively, which consume much more time. For a set of matrix samples , the memory space occupied by SVM is , STM requires , DuSK requires , MRMLKSVM requires and KSMM requires , where is the rank of matrix. KSMM calculates weight matrix V to determine the relative importance of each hyperplane on the average.

Naturally STM is a multilinear support vector machine using different hyperplanes to separate the projections of data points. KSMM is a nonlinear supervised tensor learning and construct a single hyperplane in the matrix Hilbert space.

From the previous work (Chu et al, 2007), we know that the computational complexity of SVM is , thus STM is , DuSK is , MRMLKSVM is , while the complexity of KSMM is , where is the corresponding number of iterations and is the average number of iterations of Newton’s method, which is usually small in practice. Moreover, its complexity can be narrowed for the optimal time complexity of multiplication of square matrices has been up to now (Le Gall, 2014).

## 3 Experiments

In this section, we conduct one simulation study on synthetic data and four experiments on benchmark datasets. We validate the effectiveness of KSMM with other methodologies (DuSK (He et al, 2014), Gaussian-RBF, matrix kernel (Gao et al, 2014) on SVM or STM classifier, SMM (Luo et al, 2015)), since they have been proven successful in various applications.

We introduce two comparison of methods to verify our claims about the improvement of the proposed approach. We report the accuracy which counts on the proportion of correct predictions,

as the harmonic mean of precision and recall. Precision is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved. In multiple classification problems, macro-averaged F-measure

(Yang and Liu, 1999) is adopted as the average of score for each category.All experiments were conducted on a computer with Intel(R) Core(TM) i7 (2.50 GHZ) processor with 8.0 GB RAM memory. The algorithms were implemented in Matlab.

### 3.1 Simulation study

In order to get better insight of the proposed approach, we focus on the behavior of proposed methods for different attributes and given examples in binary classification problems. Datasets are subject to the Wishart distribution defined over symmetric, positive-definite matrix-valued random variables, which is a generalization to multiple dimensions of the chi-squared distribution. Its probability density function is given by

where X and A are symmetric, positive-definite matrices,

is the number of degrees of freedom greater than

and is the multivariate gamma function. The problem is verified with the following set-ups:It is assumed that the considered objects are described by and matrices respectively. The attributes are generated independently with the Wishart distribution with , for the first class and for the second class, for

. Additional Gaussian white noise is considered while evaluation is performed with

and 200 examples, half of which are selected as a training set while other examples are organized as a test set. For each setting we average results over 10 trials each of which is obtained from the proposed distribution. The input matrices are converted into vectors when it comes to SVM problems. All the kernels select the optimal trade-off parameter from , kernel width parameter from and rank from . All the learning machines use the same training and test set. We first randomly sample of whole data from each dataset for the purpose of parameter selection. Gaussian RBF kernels are used on all MRMLKSVM, DuSK and SVM which denoted as , and respectively while we set in KSMM.N | p | Accuracy | |||||
---|---|---|---|---|---|---|---|

STM | SMM | ||||||

50 | 10 | 76.4(6.7) | 80.8(6.3) | 81.2(5.2) | 80.4(5.0) | 81.2(5.1) | 83.2(6.1) |

15 | 82.8(8.9) | 89.2(4.1) | 90.0(1.8) | 90.4(3.4) | 87.9(3.5) | 92.4(2.7) | |

20 | 85.6(5.9) | 89.6(3.4) | 88.8(6.4) | 88.8(4.3) | 88.6(2.5) | 91.2(2.4) | |

25 | 88.0(4.2) | 92.8(1.0) | 91.6(2.0) | 93.6(0.8) | 92.2(3.2) | 94.0(1.3) | |

30 | 84.8(2.0) | 89.6(3.4) | 92.4(2.9) | 93.2(3.2) | 92.8(2.2) | 95.2(2.0) | |

100 | 10 | 81.8(4.9) | 86.0(3.2) | 85.0(4.1) | 79.4(7.3) | 81.4(3.2) | 86.6(2.9) |

15 | 85.2(3.2) | 89.0(3.2) | 87.8(2.1) | 87.6(3.0) | 86.7(1.2) | 89.2(2.1) | |

20 | 84.8(6.7) | 90.2(2.1) | 90.0(2.1) | 91.0(2.0) | 89.5(1.4) | 91.2(2.5) | |

25 | 89.4(3.2) | 92.0(1.1) | 91.6(2.3) | 93.2(1.5) | 93.1(2.1) | 93.4(2.2) | |

30 | 86.8(7.4) | 93.6(3.4) | 92.2(2.1) | 94.8(2.9) | 93.2(2.9) | 94.8(2.8) |

The results are presented in Table LABEL:Table1. We can observe that KSMM performs well in general. We are interested in accuracy in comparison and one way to understand this is to realize that our kernels are represented as matrices in calculation and Newton’s method is included which occupies much more space and time. In addition, the observations demonstrate the size of training set has positive effect on the performance in most cases. KSMM has a significant performance even the sample size is small. When the training set is large enough, the accuracy is increasing along with the growing number of attributes. That is reasonable for the expectation values of examples in two classes are equal to and respectively which make it easier to identify as increases.

### 3.2 Datasets and Discussion

Next, we evaluate the performance of our classifier on real data sets coming from variety of domains. We consider the following benchmark datasets to perform a series of comparative experiments on multiple classification problems.
We use the ORL^{1}^{1}1http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html (Samaria and Harter, 1994), the Sheffield Face dataset^{2}^{2}2https://www.sheffield.ac.uk/eee/research/iel/research/face, the Columbia Object Image Library (COIL-20)^{3}