I Introduction
Clustering is one of the fundamental tasks in computer vision and pattern recognition, which aims to divide samples into various groups based on their similarity without any prior information. It is very useful, especially when the label information is hard to acquire. There are many clustering based applications, such as image segmentation, dimension reduction, unsupervised classification, etc. During the past decades, a variety of methods for clustering have been proposed. Among them, the standard spectral clustering (SPC)
[1], sparse subspace clustering (SSC) [2], and lowrank representation (LRR) [3] are the most popular methods.These single view clustering methods achieve good performance. In practice, we often acquire data from various domains or feature space. For example, one object can be described with text, images or videos, and different kinds of features can be extracted to represent each of them. In order to make full use of multiview information to boost the performance, many multiview clustering methods have been derived from these popular single view methods.
Due to the popularity of SSC [2] and LRR [3], many selfrepresentation based subspace learning methods [4, 5, 6, 7, 8] are proposed for multiview clustering. They achieve promising performances. But they mainly focus on subspace learning and have high computation complexity. Another important issue is that they mainly investigate the correlations from the aspect of pairwise matrices, and it is more natural and effective to find comprehensive representation of multiview from the tensor aspect. Motivated by the robust multiview spectral clustering (RMSC) [6], there is a connection between the spectral clustering and Markov chain. So we mainly focus on the spectral clustering via Markov chain in this paper. However, RMSC [6] only learns the shared common information among all views. While multiview representations also contain viewspecific information, we hope to explore the high order correlation and find the principle components [10, 11, 12, 13, 14] of multiview representations from the tensor aspect based on the Markov chain clustering.
As for tensor decomposition, we not only need to define the rank, but also find a tight convex relaxation of the tensor rank as nuclear norm. The CANDECOMP/PARAFAC (CP) [15, 16], Tucker [17] and tensor Singular Value Decomposition (tSVD) [18] are three main tensor decomposition techniques. However, CP rank is generally NPhard to compute and its convex relaxation is intractable. For Tucker decomposition, the commonly used Sum of Nuclear Norms (SNN) [19] is not a tight convex relaxation of the Tucker rank. Since tSVD based tensor nuclear norm has been proven to be the tightest convex relaxation [20] to norm of the tensor multirank, so we adopt it. With the tSVD based tensor nuclear norm, our model can well capture both the consistent and viewspecific information among multiple views, which will benefit the clustering.
In Fig. 1, we present the framework of our proposed method. We first construct a similarity matrix and a corresponding transition probability matrix for features of each view. Then, we propose to collect these transition probability matrices of multiview into a 3order tensor. In order to better investigate the correlations as well as reduce the computation complexity, we rotate the tensor. The essential tensor can be learnt via tensor lowrank and sparse decomposition based on tensor nuclear norm minimization defined by the tSVD.
Main contributions are summarized as follows:

We propose a novel essential tensor learning method for the Markov chain based spectral clustering. With the tSVD based tensor lowrank constraint and tensor rotation, our method is very effective to learn the principle information for clustering among multiple views.

We present an efficient algorithm based on ADMM to solve the proposed problem.

Our method achieves superior performance compared with the stateoftheart methods on different datasets for various applications. In the meantime, it also has the lowest computation complexity.
Ii Related Work
Multiview clustering has been extensively studied during the past decade. The standard spectral clustering (SPC) [1]
is the most classic method, which constructs the similarity matrix first, and then learns the affinity matrix by exploiting the properties of the Laplacian of graph. Most existing clustering methods are derived from SPC
[1], and they mainly differ in the construction of affinity matrix, according to which, existing work can be mainly divided into two classes, including the graph based affinity matrix learning methods and the selfrepresentation based subspace learning methods. We briefly review some related work.The graph based methods learn affinity matrix based on the similarity matrix. For example, [21] proposes a cotraining approach to search for the clusterings that agree across the views. [4] aims to find the complementary information across views based on a coregularization method. [22] tries to find a universal Laplacian embedding for multiview features using minimax optimization. The work in [23, 24] shows that there is a natural connection between the spectral clustering and the Markov random walk. Then, [25] constructs a transition probability matrix of Markov chain on each view, and then combines these matrices via a Markov mixture. Considering that multiview data might be noisy, RMSC [6] hopes to recover a shared lowrank transition probability matrix for the Markov chain based spectral clustering. Recently, [26] proposes the structured lowrank matrix factorization methods for multiview spectral clustering.
For the second class, multiview subspace learning methods are derived from the popular SSC [2] and LRR [3], which aim to explore the relationships between samples based on selfrepresentation. Most recent work of multiview clustering mainly focus on selfrepresentation based subspace learning. For example, [27] combines the advantages of both LRR and SSC. [28] extends the LRR into multiview subspace clustering with generalized tensor nuclear norm. Then [29] adopts the tSVD based tensor nuclear norm for better representation, and [30] proposes the tensorial tproduct representation. Zhang et al. [31]
jointly learns the underlying latent representation of features and the multiview lowrank representation, and then generalize it to combine with deep neural network
[32]. To explore the complementary property of multiview representations, [5] utilizes the Hilbert Schmidt Independence Criterion (HSIC) as a diversity term between views, and [7] adds an exclusivity term to the structured sparse subspace clustering model [33] to preserve the complementary and consistent information.Iii Notations and Preliminaries
Iiia Notations
For convenience, we summarize the frequently used notations in Table I. In this paper, we mainly consider the 3order tensor
. Vector along the
th mode is called the mode fiber. Here, we define the norm of a tensor as the sum of norm of each mode fiber. denote the matricization of along the th mode. It can be constructed by arranging the mode fibers to be the columns of the resulting matrix. The transpose is obtained by transposing each frontal slice and then reversing the order of transposed frontal slices through .denotes the fast Fourier transformation (FFT) of a tensor
along the rd dimension, and we also have .A scalar.  A matrix.  
A vector.  A tensor.  
.  Sum of the singular values.  
.  .  
The th entry of .  .  
The th horizontal slice of .  .  
The th lateral slice of .  .  
The th frontal slice of .  .  
.  tSVD based tensor nuclear norm.  
.  The transpose of .  
Mode matricization of . 
Besides, for a tensor , we also define the block vectorizing and its inverse operation as and , respectively. The block diagonal matrix and the block circulant matrix are defined by:
IiiB Preliminaries
To help understand the definition of tensor nuclear norm, we first introduce some related definitions [18].
Definition 1 (tproduct)
Let be , and be . Then the tproduct is the tensor
(1) 
Definition 2 (fdiagonal tensor)
A tensor is called fdiagonal if each of its frontal slices is diagonal matrix.
Definition 3 (Identity tensor)
For the identity tensor
, its first frontal slice is the identity matrix with size
, and all other frontal slices are zero.Definition 4 (Orthogonal tensor)
A tensor is orthogonal if it satisfies
(2) 
Definition 5 (tSVD)
For a tensor , it can be factorized by tSVD as
(3) 
where and are orthogonal, and is fdiagonal.
Definition 6 (tSVD based tensor nuclear norm)
The tSVD based tensor nuclear norm of a tensor is defined by the sum of singular values of all the frontal slices of :
(4) 
where is computed by the SVD of frontal slices of .
Iv Essential Tensor Learning for Multiview Spectral Clustering
In this section, we first introduce the overview of spectral clustering by Markov chain. Then we present the details and analysis of our proposed essential tensor learning for multiview spectral clustering (ETLMSC).
Iva Markov Chain based Spectral Clustering
Denote as the the matrix of data vectors, where is the number of data points and is the dimension of feature vectors. We first compute the similarity matrix , where denotes the similarity between data points and . Gaussian kernel is commonly used to define their similarity. We have , where the distance is adopted and
is the standard deviation. Then we can construct a weighted graph
, where the vertices set consists of the sample points, the edges set denotes the connection between data points, and the similarity defines the weight of each edge. For spectral clustering [1], it tries to find an optimal partition in the weighted graph . According to [23, 24], there is a natural connection between spectral clustering and random walkers on the weighted graph. We first define the transition probability matrix by , where denotes the probability of random walk from node to node , and is a diagonal matrix with elements . For this Markov chain, we hope the random walk over the graph converges to a unique and positive stationary distribution , that is . Let denote the diagonal matrix with , then the Laplacian matrix for the Markov chain based spectral clustering can be computed by . Denote as the number of clusters, the indicator function for clustering can be solved by computing the eigenvectors corresponding to the smallest eigenvalues of the generalized eigenvalue decomposition problem , which is equivalent to the eigenvectors corresponding to the largest eigenvalues of the normalized Laplacian matrix . Finally, kmeans algorithm [37] is adopted to cluster based on these indicator vectors. In Algorithm 1, we briefly summarize the outline for spectral clustering by Markov chains. For more details, please refer to [6, 24].IvB The Proposed Method
Assume that there are different views in total. Let denote the data matrix of the th view, where is the number of samples, is the dimension of feature vectors in the th view, and ranges from to . For multiview spectral clustering via Markov chain, we first compute the similarity matrix , construct the weighted graph , and compute the transition probability matrix for each view. According to Algorithm 1, we can see that the transition probability matrix plays a very important role in the clustering by Markov chain. So we mainly focus on how to learn an essential transition probability matrix for spectral clustering based on the multiview .
RMSC [6] hopes to capture the shared information among multiview transition probability matrices. It divides each into two parts: a shared probability matrix describing important information for clustering, and viewspecific deviation error matrix . As the number of clusters is much smaller than the sample number, RMSC imposes lowrank constraint on . It also assumes that the error matrix should be sparse. Then the objective function for RMSC [6] is formulated as
(5) 
where is a balance parameter.
RMSC only learns the shared common information among multiple views. However, each view also contains unique information that is useful for clustering. Motivated by this, we hope to explore high order correlations among multiple views based on tensor representation.
We divide each into two parts . Then we construct a 3order tensor by collecting all . As multiview features are extracted from the same objects, different also contains some similar information. In the meantime, the number of clusters is much smaller than the sample number. So the tensor should be lowrank. We use the tSVD based tensor nuclear norm to regularize and get the primary objective function for our model:
(6) 
The minimization of lowrank tensor can help us find the essential information among different views. Specifically, the consistent information among multiple views may be represented by several principle components of the tSVD, and viewspecific information can be preserved in other singular values of the corresponding slice of the fdiagonal tensor , which is computed by the tSVD.. By constructing a 3order transition probability tensor , where is the th frontal slice of the tensor , the above problem can be reformulated as the tensor form:
(7) 
Instead of optimizing the above problem, we first rotate the original transition probability tensor into , which can be seen in the middle part of Fig. 1 (please pay attention to the rotation of the red edge of the tensor). This tensor rotation can be easily achieved by the function in Matlab. There are mainly two advantages for this operation. First, according to the definition of tSVD, FFT operates along the third dimension of the tensor and then we perform SVD in each frontal slice. As we hope to capture the essential information among all views, SVD in each slice with the information of multiview and all samples is more meaningful. Moreover, FFT along the feature dimension can preserve the relationship among views. Second, this rotation can largely reduce the computation complexity in optimization, which will be analysed in the subsection IVD.
Besides, for the error term, if one sample contains much noise and outliers, transition probability vectors in the tensor related to this sample will be influenced. Noises in these vectors are not sparse, so
norm regularization on vectors is more proper. As noisy samples should be sparse, tensor norm works. It is more robust to outliers and noises. So we use norm to characterize the sparsity property. Then the final objective function of our proposed ETLMSC method can be reformulated as follows:(8) 
where denotes the rotated transition probability tensor. For the tensor after rotation, the norm is defined as the sum of norm of each fiber along the coefficient dimension. According to the definition of norm and matricization in Table I, we have , which is helpful to the optimization of .
IvC Optimization
We adopt the alternating direction method of multipliers (ADMM) [38] to solve Eq. (8). The augmented Lagrangian function can be formulated as follows:
(9) 
where is a penalty parameter at th iteration and is a Lagrange multiplier. ADMM alternately updates each variable as follows.
subproblem:
(10) 
which is a tSVD based tensor nuclear norm minimization problem. According to [39], it has the following closeform solution with the tensor tubalshrinkage operator:
(11) 
where , and . is an fdiagonal tensor whose diagonal element in the Fourier domain is .
subproblem:
(12) 
As the norm of the tensor is defined as the sum of norm of each mode fiber, we matricize each tensor along the rd mode. So we have . It can be transformed into the matrix form:
(13)  
Let , and according to [3], the problem in Eq. (13) has the following closeform solution:
(14) 
where represents the th column of the matrix . After we get , we transform it into the tensor form.
Update multipliers:
(15) 
The whole optimization process is summarized in Algorithm 2. After we learn the essential transition probability tensor , we compute the essential transition probability matrix by summing its lateral slices as . Then we put into the second step of Algorithm 1 to replace the transition probability matrix , and we can get the final clustering result.
IvD Convergence and Complexity
At each iteration, we can get the closeform solution of and . In [38], the convergence of ADMM with two blocks of variables has already been proved. Accordingly, our algorithm will converge to an optimal solution.
For the computation complexity, at each iteration, it takes to compute the closeform solution of . As for updating , on the one hand, we need to calculate the FFT and inverse FFT of a tensor along the third dimension, which takes . On the other hand, in the Fourier domain, we need to compute the SVD of each frontal slice of a tensor with size , which takes . So we need in total to compute the closeform solution of under tensor rotation operation. However, if we do not rotate the tensor, we need . As the number of views is much smaller than the number of samples in multiview setting, that is and . Therefore, we can see that the computation complexity is largely reduced by the tensor rotation. Denote as the number of iterations, the complexity to learn the essential tensor in Algorithm 2 is , which is relatively efficient.
After we get the essential transition probability matrix, we adopt the Markov chain based spectral clustering to get the final result, which usually cost . Therefore, the overall complexity is .
V Experiments
Va Experimental Settings
VA1 Datasets
We adopt seven commonly used real world datasets, which cover five different applications, including news article clustering, digit clustering, generic object clustering, face clustering, and scene clustering.
In Table II, we summarize the statistic information of these seven datasets.
Some samples of these image datasets are presented in Fig 3.
We briefly introduce these datasets as follows.
BBCSport [40] ^{1}^{1}1http://mlg.ucd.ie/datasets contains documents from
the BBC Sport website corresponding to sports news in five
topical areas, including the athletics, cricket, football, rugby, and tennis.
There are two different views in total.
UCIDigits [41] consists of digits images corresponding to classes.
Same to [6],
we extract three different features to represent these digit images, including Fourier coefficients, pixel averages and morphological features.
COIL20 ^{2}^{2}2http://www.cs.columbia.edu/CAVE/software/softlib/ is the abbreviation of the Columbia object image library dataset, which contains images of object categories. Each category contains 72 images and all images are normalized to size .
For this datasets, we also extract three types of features (intensity, LBP [42] and Gabor [43] features), which is same to [28, 29].
NottingHill [44] is a video based face dataset, which is collected from the movie “NottingHill”. It contains faces of main casts in tracks.
All face images are with size .
Intensity, LBP [42] and Gabor [43] features are extracted for representation.
Scene15 [45] has natural scene categories with both indoor and outdoor environments, including industrial, store, bedroom, kitchen, and etc. There are images in total. Similar to [29], we extract three kinds of image features for representation, including PHOW [46], LBP [42], and CENTRIST [47].
MITIndoor67 [48] contains indoor images of categories. Same to [29], the training subset which has
images is adopted for clustering. Besides the three kinds of features for Scene15, we also extract deep features based on pretrained VGGVD
[49] network to improve the performance.Caltech101 [50] includes object images of categories. For each category, it has about to images. This dataset is the largest dataset used in all these related multiview clustering methods. We adopt all these images of classes to test the performance of clustering, which is same to [29]. Besides the three kinds of features for Scene15, the Inception V3 [51] network is used to extract deep features.
Datasets  BBCSport  UCIDigits  
Methods  NMI  ACC  AR  Fscore  Precision  Recall  NMI  ACC  AR  Fscore  Precision  Recall 
SPC  0.735  0.853  0.744  0.798  0.804  0.792  0.642  0.731  0.545  0.591  0.582  0.601 
LRR  0.747  0.886  0.725  0.789  0.803  0.776  0.768  0.871  0.736  0.763  0.759  0.767 
Coreg  0.771  0.849  0.783  0.829  0.836  0.822  0.804  0.780  0.755  0.780  0.764  0.798 
RMSC  0.808  0.912  0.837  0.871  0.879  0.864  0.822  0.915  0.789  0.811  0.797  0.826 
DiMSC  0.814  0.901  0.843  0.880  0.875  0.882  0.772  0.703  0.652  0.695  0.673  0.718 
LTMSC  0.066  0.379  0.005  0.383  0.239  0.953  0.775  0.803  0.725  0.753  0.739  0.767 
ECMSC  0.090  0.408  0.060  0.391  0.267  0.942  0.780  0.718  0.672  0.707  0.660  0.760 
URETLMSC  0.808  0.879  0.823  0.865  0.859  0.873  0.782  0.841  0.719  0.747  0.739  0.756 
tSVDMSC  0.830  0.941  0.853  0.888  0.881  0.896  0.932  0.955  0.924  0.932  0.930  0.934 
ETLMSC  0.984  0.978  0.967  0.977  0.963  0.998  0.977  0.958  0.953  0.958  0.940  0.980 
Datasets  COIL20  NottingHill  
Methods  NMI  ACC  AR  Fscore  Precision  Recall  NMI  ACC  AR  Fscore  Precision  Recall 
SPC  0.806  0.672  0.619  0.640  0.596  0.692  0.723  0.816  0.712  0.775  0.780  0.776 
LRR  0.829  0.761  0.720  0.734  0.717  0.751  0.579  0.794  0.558  0.653  0.672  0.636 
Coreg  0.774  0.659  0.592  0.613  0.590  0.640  0.703  0.805  0.686  0.754  0.766  0.743 
RMSC  0.800  0.685  0.637  0.656  0.620  0.698  0.585  0.807  0.496  0.603  0.621  0.586 
DiMSC  0.846  0.778  0.732  0.745  0.739  0.751  0.799  0.837  0.787  0.834  0.822  0.847 
LTMSC  0.860  0.804  0.748  0.760  0.741  0.479  0.779  0.868  0.777  0.825  0.830  0.814 
ECMSC  0.942  0.782  0.781  0.794  0.695  0.925  0.817  0.767  0.679  0.764  0.637  0.954 
URETLMSC  0.829  0.750  0.696  0.711  0.692  0.732  0.794  0.835  0.787  0.834  0.828  0.840 
tSVDMSC  0.884  0.830  0.786  0.800  0.785  0.808  0.900  0.957  0.900  0.922  0.937  0.907 
ETLMSC  0.947  0.877  0.862  0.869  0.830  0.914  0.911  0.951  0.898  0.924  0.940  0.908 
VA2 Compared Methods
We compare our proposed approach ETLMSC and URETLMSC (the proposed method without tensor rotation) with the following stateoftheart methods, including two single view and six multiview methods.
SPC achieves the best result among all views with standard spectral clustering [1].
LRR achieves the best result among all views with the lowrank representation [3].
Coreg [4] is the coregularization method for spectral clustering, which coregularizes the clustering hypothesis to explore the complementary information.
RMSC [6] recovers a shared lowrank transition probability matrix as input to the Markov chain based spectral clustering.
DiMSC [5] employs the HSIC as a diversity term to explore the complementarity of multiview representations.
LTMSC [28] adopts the lowrank tensor constraint for multiview subspace clustering.
ECMSC [7] consists of positionaware exclusivity term and consistency term for regularization.
tSVDMSC [29] uses the tSVD based tensor nuclear norm to learn optimal subspace.
Among all above methods, only SPC, Coreg, and RMSC are spectral clustering methods, and other methods are selfrepresentation based subspace clustering methods.
VA3 Evaluation Metrics
To comprehensively evaluate the performance of clustering, we adopt all six commonly used metrics including normalized mutual information (NMI), accuracy (ACC), adjusted rand index (AR), Fscore , precision and recall. These six metrics favour different properties in clustering task. For all metrics, the higher value indicates the better performance.
VB Experimental Results and Analysis
Datasets  Scene15  MITIndoor67  
Methods  NMI  ACC  AR  Fscore  Precision  Recall  NMI  ACC  AR  Fscore  Precision  Recall 
SPC  0.421  0.437  0.270  0.321  0.314  0.329  0.559  0.443  0.304  0.315  0.294  0.340 
LRR  0.426  0.445  0.272  0.324  0.316  0.333  0.226  0.120  0.031  0.045  0.044  0.047 
Coreg  0.470  0.503  0.334  0.380  0.382  0.378  0.270  0.149  0.054  0.067  0.066  0.070 
RMSC  0.564  0.507  0.394  0.437  0.425  0.450  0.342  0.232  0.110  0.123  0.121  0.125 
DiMSC  0.269  0.300  0.117  0.181  0.173  0.190  0.383  0.246  0.128  0.141  0.138  0.144 
LTMSC  0.571  0.574  0.424  0.465  0.452  0.479  0.226  0.120  0.031  0.045  0.044  0.047 
ECMSC  0.463  0.457  0.303  0.357  0.318  0.408  0.590  0.469  0.323  0.333  0.314  0.355 
URETLMSC  0.536  0.534  0.369  0.419  0.420  0.419  0.467  0.335  0.204  0.216  0.211  0.220 
tSVDMSC  0.858  0.812  0.771  0.788  0.743  0.839  0.750  0.684  0.555  0.562  0.543  0.582 
ETLMSC  0.902  0.878  0.851  0.862  0.848  0.877  0.899  0.775  0.729  0.733  0.709  0.758 
Datasets  Caltech101  
Methods  NMI  ACC  AR  Fscore  Precision  Recall 
SPC  0.723  0.484  0.319  0.340  0.597  0.235 
LRR  0.728  0.510  0.304  0.339  0.627  0.231 
Coreg  0.824  0.582  0.401  0.412  0.661  0.301 
RMSC  0.573  0.346  0.246  0.258  0.457  0.182 
DiMSC  0.589  0.351  0.226  0.253  0.362  0.191 
LTMSC  0.788  0.559  0.393  0.403  0.670  0.288 
ECMSC  0.662  0.419  0.312  0.326  0.465  0.251 
URETLMSC  0.740  0.463  0.342  0.352  0.638  0.243 
tSVDMSC  0.858  0.607  0.430  0.440  0.742  0.323 
ETLMSC  0.899  0.639  0.456  0.465  0.825  0.324 
VB1 Performance Comparison
We present the detailed clustering results on seven datasets in Tables IIIVI. All results are measured by the average of runs. In each table, the bold values represent the best performance. To better compare the performance of different methods, we divide all methods into four subclasses in the table, including single view methods, spectral clustering methods, subspace learning methods, and tensor based methods. The optimal parameters for these methods are finetuned by grid searching.
On all datasets, tSVDMSC and the proposed ETLMSC achieve the top two best results under nearly all these different metrics. From Tables IIIVI
, we can easily see that our proposed ETLMSC achieves the best performance on the BBCSport, UCIDigits, COIL20, Scene15, MITIndoor67, and Caltech101 datasets under all six evaluation metrics. Especially on the BBCSport and MITIndoor67 datasets, our results are more than
higher than the second best results achieved by tSVDMSC. There are also , , and improvement compared with the second best performance of tSVDMSC on the UCIDigits, COIL20, Scene15, and Caltech101 datasets, respectively. The NottingHill dataset is a video based face dataset. According to [52, 53], facial images have the subspace structure, and selfrepresentation based subspace learning method is more suitable for this task. While tSVDMSC is based on subspace learning, the performance of our method is still comparable to that achieved by tSVDMSC, and much higher than those of all other methods, which is shown in the right part of Table IV.For single view methods, they obtain good performance. But in general, multiview methods work better than single view methods. Moreover, both ECMSC and DiMSC work very well for this task. As they both try to investigate complementary information, it shows that it is necessary to learn viewspecific information.
Tensor based methods, including ETLMSC and tSVDMSC, achieve significant improvement compared with all other stateoftheart methods in most cases. There is a huge gap between tensor based methods and other methods, which can be attributed to the effectiveness of tensor based correlations exploration. In Fig. 4, we also present the confusion matrices of these three tensor based methods on the Scene15 dataset. The row and column names correspond to the groundtruth and predicted labels, respectively. We can see that compared with LTMSC, our proposed ETLMSC and tSVDMSC achieve much better results in almost all classes in terms of accuracy, which can be attribute to the effectiveness of tSVD decomposition based tensor nuclear norm. Compared with tSVDMSC, our ETLMSC improves slightly in many categories, which can also be verified by the accuracy.
Compared with RMSC, which is also a Markov chain based method, our proposed ETLMSC gains significant improvement. The main reason is that RMSC only captures the shared information among different view, while ETLMSC incorporates viewspecific information that is useful for clustering. Based on the tSVD based tensor nuclear norm to regularize the essential tensor, our method can well preserve these principle components among multiview representations.
Tensor rotation plays an important role in our methods. Besides the complexity reduction, it can also largely improve the performance, which has already been validated by tSVDMSC [29]. We can see that ETLMSC achieves much better results than URETLMSC on all datasets. The main reason is that after rotation, we can throughly investigate the complementary information among different views as the SVD is performed on each matrix composed of different view features after FFT. However, without rotation, the arrangement of similarity coefficients could be destroyed in Fourier domain, so that complementary information cannot be effectively explored. Therefore, URETLMSC only sometime shows comparable performance with the stateoftheart methods.
VB2 Parameter Sensitivity Analysis
There are mainly two parameters in our model, including the balance parameter and the standard deviation of Gaussian kernel to compute the similarity. In experiments, we find the optimal value for by grid searching. As for for the th view, we directly set it to the average Euclidean distance () between all th view features, which is same to RMSC. We present the evaluation results of our proposed ETLMSC method on the first six datasets with respect to different and ratio of in Figs. 5 and 6, respectively. From Fig. 5, we can observe that on these datasets, the performance of our proposed ETLMSC is relatively stable when varies in the range of . plays an important role in balancing the contributions of these two parts. When it is very small (close to ), the norm regularization on will not work. will be minimized as much as possible, which leads to . So the result is very bad. Moreover, the optimal parameter for each dataset is reported in their corresponding table.
As for , all results of ETLMSC presented in Tables IIIVI are based on the ratio . From Fig. 6, we can see that our method is not sensitive to this parameter when it varies in a certain large range. controls the discrimination of similarity. When is too small (or too large), all similarities will be close to (or ). It will be hard to distinguish the difference, which leads to bad results. For all the results reported in the manuscript, they are achieved with . We can see that with proper ratio, the performance can be further improved, especially on the BBCSport, UCIDigit, COIL20, Scene15, and MITIndoor67 datasets.
For the parameters and of ADMM, we directly adopt the suggestion of [38] and fix them as and , respectively. These two parameters mainly influence the number of iteration for convergence.
VB3 Convergence Analysis
The theoretical convergence of our algorithm has already been proved in [38]. In Fig. 7, we show the total error of our algorithm in each iteration on the COIL20, NottingHill, and Caltech101 datasets. Here, the total error is defined as the maximum value of changes in each iteration , , and reconstruction error :
Methods  RMSC  DiMSC  LTMSC  ECMSC  tSVDMSC  ETLMSC(Ours) 

Complexity  
Time on BBCSport  4.5  35.8  23.4  78.7  10.6  2.1 
Time on COIL20  74.8  1075.1  375.9  954.2  103.4  19.6 
Time on UCIDigit  214.6  2706.4  959.3  468.5  225.7  54.6 
Time on NottingHill  2531.3  43813.6  10408.7  6319.3  3373.3  562.8 
Time on Scene15  2407.9  38904.7  9270.6  5663.9  2627.8  489.7 
Time on MITIndoor67  3796.5  66274.3  15759.2  9673.2  5957.5  930.5 
Time on Caltech101  15710.9  218825.5  76833.2  41558.6  18929.7  5395.7 
According to Fig. 7, we can see that the error decreases with the increasing of iteration number. Our algorithm converges within iterations, which is also true on other datasets. As we can compute the closeform solution in each iteration with relatively low computation complexity, our algorithm is very efficient.
VB4 Complexity Comparison
In Table VII, we present computation complexity and running time of the stateoftheart methods on all these datasets. Since all these methods share the similar postprocessing procedure that has the same complexity, we only report the computational complexity and running time for learning the affinity matrix. We need to mention that the number of iteration has an obvious affect on the running time, and parameter selection will influence . So we can see that the running time of ECMSC on the UCIDigit dataset could be shorter than that on the COIL20 dataset. We can see that our method has the lowest complexity and the shortest processing time among these related approaches on all datasets, which demonstrates the efficiency of our proposed method. For example, on the COIL20 dataset, our algorithm can finish within seconds, while the second best method RMSC needs more than seconds, and tSVDMSC costs more than seconds. On the largest Caltech101 dataset, our method can save much time compared with tSVDMSC.
VB5 Representation Visualization
In Fig. 8, we show the visualization of the learned optimal transition probability matrix. Due to the limition of space, we only present the results of two Markov chain based spectral clustering methods (RMSC and our proposed ETLMSC) on the COIL20 dataset. For ETLMSC, the transition probability matrix is computed by the average of lateral slices of the optimal essential tensor . The yellow color represents the large value. Compared with the result of RMSC in Fig. 8(a), we can easily see that the result of ETLMSC in Fig. 8(b) is much better as most large values concentrate on the diagonal blocks. This can also be verified by comparing the experimental results in Tables IIIV. While RMSC only captures shared information among different views, it is more meaningful for our ETLMSC method to explore high order multiview correlations based on tensor formulation.
VB6 Comparison with tSVDMSC
tSVDMSC [29] achieves very good performance for the task of multiview clustering. Both the proposed ETLMSC and tSVDMSC [29] are based on the tensor nuclear norm defined by the tSVD for multiview clustering. But there are many differences. First, construction of affinity matrix and tensor is totally different. We adopt the Markov chain to compute the transition probability matrix, while tSVDMSC is based on selfrepresentation, which is of high computation complexity and under the assumption of subspace structure. Second, the model and optimization process are much different. We directly divide the transition probability tensor into two parts with lowrank and sparse constraints, while their method need to optimize the selfrepresentation coefficients. So the optimization process is also different. Most importantly, compared with tSVDMSC, based on the experimental results presented above, our method achieves better performance with much lower complexity and less processing time.
Vi Conclusion and future work
In this paper, we propose a novel essential tensor learning method for Markov chain based multiview spectral clustering. Based on multiview transition probability matrices, we construct a order tensor. We explore the high order correlations among multiple views by learning the essential tensor with lowrank constraint based on tSVD based tensor nuclear norm. With tensor rotation operation, the proposed algorithm can be optimized efficiently and the principle components can be well preserved. We evaluate the performance of our method on seven datasets with respect to different applications, and it achieves superior performance compared with the stateoftheart methods.
For future work, we would like to focus on the fast and scalable algorithms, such as the sampling technique or recover the subspace of the whole tensor with a much smaller seed tensor. So that the computation complexity of the proposed model can be further reduced, which will make ETLMSC much suitable for largescale applications.
Acknowledgment
We would like to thank Dr. Yuan Xie for his selfless support in sharing codes and datasets as well as the valuable suggestions.
References

[1]
A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in
Proceedings of the Neural Information Processing Systems, pp. 849–856, 2002.  [2] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2765–2781, 2013.
 [3] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by lowrank representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 171–184, 2013.
 [4] A. Kumar, P. Rai, and H. Daume, “Coregularized multiview spectral clustering,” in Proceedings of the Neural Information Processing Systems, pp. 1413–1421, 2011.
 [5] X. Cao, C. Zhang, H. Fu, S. Liu, and H. Zhang, “Diversityinduced multiview subspace clustering,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 586–594, 2015.
 [6] R. Xia, Y. Pan, L. Du, and J. Yin, “Robust multiview spectral clustering via lowrank and sparse decomposition.,” in Proceedings of the AAAI, pp. 2149–2155, 2014.
 [7] X. Wang, X. Guo, Z. Lei, C. Zhang, and S. Z. Li, “Exclusivityconsistency regularized multiview subspace clustering,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 923–931, 2017.
 [8] X. Xie, X. Guo, G. Liu, and J. Wang, “Implicit block diagonal lowrank representation,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 477–489, 2018.

[9]
G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by lowrank
representation,” in
Proceedings of the International Conference on Machine Learning
, pp. 663–670, 2010. 
[10]
J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, “Robust principal component analysis: Exact recovery of corrupted lowrank matrices via convex optimization,” in
Proceedings of the Neural Information Processing Systems, pp. 2080–2088, 2009.  [11] C. Lu, J. Feng, Y. Chen, W. Liu, Z. Lin, and S. Yan, “Tensor robust principal component analysis: Exact recovery of corrupted lowrank tensors via convex optimization,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5249–5257, 2016.
 [12] P. Zhou and J. Feng, “Outlierrobust tensor pca,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3938–3946, 2017.
 [13] P. Zhou, C. Lu, Z. Lin, and C. Zhang, “Tensor factorization for lowrank tensor completion,” IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1152–1163, 2018.
 [14] H. Kong, X. Xie, and Z. Lin, “tschatten norm for lowrank tensor recovery,” IEEE Journal of Selected Topics in Signal Processing, 2018.
 [15] J. D. Carroll and J.J. Chang, “Analysis of individual differences in multidimensional scaling via an nway generalization of “eckartyoung” decomposition,” Psychometrika, vol. 35, no. 3, pp. 283–319, 1970.
 [16] R. A. Harshman, “Foundations of the parafac procedure: Models and conditions for an” explanatory” multimodal factor analysis,” 1970.
 [17] L. R. Tucker, “Some mathematical notes on threemode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966.
 [18] M. E. Kilmer, K. Braman, N. Hao, and R. C. Hoover, “Thirdorder tensors as operators on matrices: A theoretical and computational framework with applications in imaging,” SIAM Journal on Matrix Analysis and Applications, vol. 34, no. 1, pp. 148–172, 2013.
 [19] B. Huang, C. Mu, D. Goldfarb, and J. Wright, “Provable lowrank tensor recovery,” OptimizationOnline, vol. 4252, p. 2, 2014.
 [20] Z. Zhang, G. Ely, S. Aeron, N. Hao, and M. Kilmer, “Novel methods for multilinear data completion and denoising based on tensorsvd,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3842–3849, 2014.
 [21] A. Kumar and H. Daumé, “A cotraining approach for multiview spectral clustering,” in Proceedings of the International Conference on Machine Learning, pp. 393–400, 2011.
 [22] H. Wang, C. Weng, and J. Yuan, “Multifeature spectral clustering with minimax optimization,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 4106–4113, 2014.
 [23] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000.
 [24] D. Zhou, J. Huang, and B. Schölkopf, “Learning from labeled and unlabeled data on a directed graph,” in Proceedings of the International Conference on Machine Learning, pp. 1036–1043, 2005.
 [25] D. Zhou and C. J. Burges, “Spectral clustering and transductive learning with multiple views,” in Proceedings of the International Conference on Machine Learning, pp. 1159–1166, 2007.
 [26] Y. Wang, L. Wu, X. Lin, and J. Gao, “Multiview spectral clustering via structured lowrank matrix factorization,” IEEE Transactions on Neural Networks and Learning Systems, 2018.
 [27] M. Brbić and I. Kopriva, “Multiview lowrank sparse subspace clustering,” Pattern Recognition, vol. 73, pp. 247–258, 2018.
 [28] C. Zhang, H. Fu, S. Liu, G. Liu, and X. Cao, “Lowrank tensor constrained multiview subspace clustering,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1582–1590, 2015.
 [29] Y. Xie, D. Tao, W. Zhang, Y. Liu, L. Zhang, and Y. Qu, “On unifying multiview selfrepresentations for clustering by tensor multirank minimization,” International Journal of Computer Vision, pp. 1–23, 2018.
 [30] M. Yin, J. Gao, S. Xie, and Y. Guo, “Multiview subspace clustering via tensorial tproduct representation,” IEEE Transactions on Neural Networks and Learning Systems, no. 99, pp. 1–14, 2018.
 [31] C. Zhang, Q. Hu, H. Fu, P. Zhu, and X. Cao, “Latent multiview subspace clustering,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, vol. 30, pp. 4279–4287, 2017.
 [32] C. Zhang, H. Fu, Q. Hu, X. Cao, Y. Xie, D. Tao, and D. Xu, “Generalized latent multiview subspace clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 [33] C.G. Li and R. Vidal, “Structured sparse subspace clustering: A unified optimization framework,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 277–286, 2015.
 [34] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan, “Multiview clustering via canonical correlation analysis,” in Proceedings of the International Conference on Machine Learning, pp. 129–136, 2009.
 [35] C. Cortes, M. Mohri, and A. Rostamizadeh, “Learning nonlinear combinations of kernels,” in Proceedings of the Neural Information Processing Systems, pp. 396–404, 2009.
 [36] J. Xu, J. Han, and F. Nie, “Discriminatively embedded kmeans for multiview clustering,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5356–5364, 2016.
 [37] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A kmeans clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
 [38] Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for lowrank representation,” in Proceedings of the Neural Information Processing Systems, pp. 612–620, 2011.
 [39] W. Hu, D. Tao, W. Zhang, Y. Xie, and Y. Yang, “The twist tensor nuclear norm for video completion,” IEEE Transactions on Nuural Networks and Learning Systems, pp. 1–13, 2017.
 [40] D. Greene and P. Cunningham, “Practical solutions to the problem of diagonal dominance in kernel document clustering,” in Proceedings of the International Conference on Machine Learning, pp. 377–384, 2006.
 [41] A. Asuncion and D. Newman, “UCI machine learning repository,” 2007.
 [42] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution grayscale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002.
 [43] M. Lades, J. C. Vorbruggen, J. Buhmann, J. Lange, C. Von Der Malsburg, R. P. Wurtz, and W. Konen, “Distortion invariant object recognition in the dynamic link architecture,” IEEE Transactions on Computers, vol. 42, no. 3, pp. 300–311, 1993.
 [44] Y.F. Zhang, C. Xu, H. Lu, and Y.M. Huang, “Character identification in featurelength films using global facename matching,” IEEE Transactions on Multimedia, vol. 11, no. 7, pp. 1276–1288, 2009.
 [45] L. FeiFei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 524–531, 2005.

[46]
A. Bosch, A. Zisserman, and X. Munoz, “Image classification using random forests and ferns,” in
Proceedings of the IEEE International Conference on Computer Vision, pp. 1–8, 2007.  [47] J. Wu and J. M. Rehg, “Centrist: A visual descriptor for scene categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1489–1501, 2011.
 [48] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 413–420, 2009.
 [49] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [50] L. FeiFei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” Computer Vision and Image Understanding, vol. 106, no. 1, pp. 59–70, 2007.
 [51] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.
 [52] D. Arpit, I. Nwogu, and V. Govindaraju, “Dimensionality reduction with subspace structure preservation,” in Proceedings of the Neural Information Processing Systems, pp. 712–720, 2014.

[53]
G. Zhang, R. He, and L. S. Davis, “Jointly learning dictionaries and subspace structure for videobased face recognition,” in
Proceedings of the IEEE Asian Conference on Computer Vision, pp. 97–111, 2014.
Comments
There are no comments yet.