I Introduction
In the era of big data, data may have different views (i.e., variety), where observations are represented by multiple sources, known as multiview data. For instance, specific news may be available on different broadcasting websites such as BBC and CNN so that each website represents a view of the same news. As another example, in Wikipedia, concept of dog may have multiple representations in the form of image and text. Fig. 1 shows the mentioned examples of multiview data.
Multiview data commonly have the following properties:

Each view may be represented by an arbitrary type and number of features. It may be also collected from diverse domains (variety of big data). Different views often contain complementary and compatible information to each other [1]. For instance, in Wikipedia, one view (image) consists of vision features, while another view (text) has textual features.

Due to inevitable system errors caused by data extractors, each view may be erroneous (veracity of big data). Generally, error refers to the deviation between model assumption and data. It could exhibit as noise and corruptions in reality [2]. Fig. 2 illustrates three types of errors. Noise refers to slight perturbation of random subset of entries in data. Random corruptions indicate that a fraction of random entries are grossly
perturbed, while samplespecific corruptions (or outliers) represent the phenomena that a fraction of the samples (or data points) in each view are far away from the real values. Realworld multiview data can encounter any or combination of these error types.
Due to ever increasing need of learning from multiview data and lack of label information in many applications, clustering on multiple views has received considerable attention recently [3, 4, 1]. The problem is referred to as multiview clustering aiming at finding compatible clusters of data points across all views. One of the most common algorithms used for multiview clustering is spectral clustering [5].
Spectral clustering models the data which is in the form of a graph where nodes are data points and edges represent similarities between data points [5]. First, it projects all the data points to a new lowdimensional space where they are easily separable. The new space is built with the eigendecomposition of the Laplacian matrix of the graph. Then, it finds the clusters by applying another clustering algorithm like means. It is shown that spectral clustering can find the normalized mincut of the graph [5]. Also, there is a hidden relation between spectral clustering and Markov chains [6]. In singleview clustering, Laplacian of the graph can be obtained by real relaxation of the combinatorial normalized cut [7]. Then, it is converted to a transition probability matrix which generates a Markov chain on the graph [7]. In the context of multiview clustering, the transition probability matrix would need to get built across all views.
A challenging problem may arise when the views are erroneous, which makes the corresponding transition probability matrices being perturbed. It could then result in portion of data points being assigned to wrong clusters. To address multiview clustering with noisy views, Xia et al proposed a method, named as RMSC [7]. This approach decomposes a transition probability matrix of each view into two parts: a shared transition probability matrix across all views and an error matrix which encodes the noise in the transition probability matrix in each view. The error matrix of each view captures the difference between the transition probabilities of that view and their correspondings in the shared transition probability matrix. RMSC assumes sparse representation error matrix via norm.
One of the shortcomings of RMSC is that it only handles noise in data. Specifically, since it only imposes norm on the error matrices, it cannot deal with samplespecific corruptions well. This is because error matrix with samplespecific corruptions has sparse row supports. Also, RMSC treats each error matrix independently. However, data may come from various sources, which could result in error matrices with inconsistent magnitude values, and thus degradation of clustering performance, when error matrices are treated independently [2].
To handle typical types of errors in multiview data, we propose a novel ErrorRobust MultiView Clustering (EMVC) method based on Markov chains. Different from RMSC [7], EMVC is based on integration of lowrank decomposition and group [8] and regularization terms, aiming to learn a shared transition probability matrix where the transition probability matrices of different views will be coregularized to a common consensus, while improving the robustness of clustering.
In some cases, features of a certain view are more or less discriminative for clusters. As a result, error in more discriminative features could substantially decrease clustering performance. To improve robustness of clustering on this error, we impose group norm on error matrix. In this way, in contrast to norm, group norm learns groupwise features importance of one view on each cluster and thus improves robustness of clustering against erroneous discriminative features. By various experiments, we also claim that by using group rather than norm, the proposed EMVC method achieves better clustering performance both on nonerroneous and erroneous datasets.
To deal with samplespecific corruptions, EMVC imposes norm on error matrix because similar to [2], error matrix with this error type has sparse row supports. Furthermore, since data may come from multiple heterogeneous sources, error matrices could have inconsistent magnitude values. In contrast to RMSC, as suggested in [2], with the aim of increasing clustering performance, we enforce the column of each error matrix with respect to each view to have jointly consistent magnitude values by vertical concatenation of error matrices of all views.
The and group norms are two nonsmooth structured sparsityinducing norms, which make the corresponding objective function of EMVC challenging to optimize. We present a reformulation of the objective function and propose a new efficient optimization algorithm based on the Augmented Lagrangian Multiplier [9] to optimize it. We also present a rigorous proof of convergence for the optimization technique. Our contributions can be summarized as follows:

To the best of our knowledge, EMVC is the first work that can address any or combination of the typical types of errors in multiview clustering via combination of and group norms. Since it is generally hard to know which type of error incurred in each view, it is important to have an allencompassing approach that can handle any or combination of the typical error types.

EMVC is the first Markov chains method that does not treat error matrices independently. Independent treatment of error matrices could decrease clustering performance [2]. The proposed EMVC method enforces the error matrices in each view to have jointly consistent magnitude values.

We propose a new efficient optimization algorithm to solve the EMVC optimization problem, along with rigorous proof of convergence.

Through extensive experiments on synthetic and realworld datasets, we show that EMVC is superior to several stateoftheart methods in the multiview clustering and robust against typical error types.
Ii Preliminaries
In this section, we introduce some related concepts and notations. The mathematical notations used in the rest of the paper are summarized in Table I.
Iia Transition Probability Matrix
Given a graph with nodes, a square matrix called transition probability matrix is defined over that contains the transitions of a Markov chain. It is denoted as . Each element of denotes a probability (i.e., ), and all outgoing transitions from a specific state have to sum to one (i.e., ). Each row in is a distribution probability over the transitions of the corresponding state.
Symbol  Definition and description 

each lowercase letter represents a scale  
each boldface lowercase letter represents a vector 

each boldface uppercase letter represents a matrix  
denotes inner product  
denotes the rank of the matrix  
denotes the trace of the matrix 
IiB Spectral Clustering via Markov chains
Spectral clustering seeks clusters of data points in a weighted graph where vertices are data points and edges represent similarity between two connecting data points. There is a relationship between spectral clustering and transition probability matrix [6]. Spectral clustering on graph is equivalent to finding clusters on such that the Markov random walk remains long within the same cluster and jumps infrequently between clusters [7].
In the context of clustering, a natural way to construct a transition probability matrix is to first build a similarity matrix between pairs of data points and then calculate the corresponding transition probability matrix by , where denotes the degree matrix of graph . One way to build a similarity matrix is to use Gaussian kernels [7]. Let denotes the similarity on a pair of data points and . It can be calculated as follows:
(1) 
where denotes the norm and
indicates the standard deviation (e.g., it could be set to median of Euclidean distance over all pairs of data points). Algorithm
1 summarizes the overall scheme for computing transition probability matrix.The steps of spectral clustering via Markov chains is described in Algorithm 2 [10]. To perform clustering using Markov chains, a crucial step is to build the transition probability matrix over graph in Line 1. A stationary distribution of is obtained in Line 2. Two matrices and are computed in Lines 3 and 4. Finally, k
means must be performed on eigenvectors of the generalized eigenproblem
(Lines 5 and 6).Iii ErrorRobust MultiView Clustering
In this section, we will first systematically propose a novel errorrobust multiview algorithm for clustering, followed by a new efficient iterative algorithm to solve the formulated nonsmooth objective function.
Problem. In the setting of clustering, given distinct data points or samples with related views, their views are denoted as . The goal is to derive a clustering solution across all views. We assume that the features in each individual view are sufficient for obtaining most of the clustering information and each individual view might be erroneous.
Iiia Transition Probability Matrix Construction
IiiB Problem Formulation
Assuming that each individual view might be erroneous so that it causes wrong assignment of data points to clusters, each transition probability matrix can be decomposed into two terms: a shared transition probability matrix and the error matrix that indicates the error in the transition probabilities in view .
(2) 
We use as the input transition probability matrix to the Markov chains method (i.e., Algorithm 2) to obtain clustering solution. Using the transition probability construction method described in Algorithm 1, we get initial transition probability matrices with respect to each individual view.
In order to approximate the shared transition probability matrix while reducing its complexity, we minimize (i.e., low rankness criterion). Since data may come from different sources, error matrices could have inconsistent magnitude values, which adversarially affects clustering performance [2]. To enforce the column of in each view to have jointly consistent magnitude values, we vertically concatenate error matrices of views along their columns (i.e., ).
In some cases, error may appear in features of a specific view which are more or less discriminative for clustering. Compared to error in less discriminative features, error in more discriminative features degrade clustering performance significantly. For example, color features substantially affects the detection of traffic light and trees whereas they are irrelevant for finding cars in the context of image clustering. Thus, error in color features would substantially decrease clustering performance. To improve robustness of clustering against error in these features, we add group norm [8] on . The group norm can be defined as follows:
(3) 
where denotes (i.e., the segment of th column of ). This norm uses norm within each view and norm between views. Thus, it enforces the sparsity between different views, i.e., if features of one view are not discriminative for clustering, Eq. (3) will assigns zero to them.
In samplespecific corruptions, has sparse row supports [2]. Thus, to handle error in specific samples, we add penalty [12] on . The norm is defined as follows:
(4) 
where denotes (i.e., the th row of ).
Based on the above consideration, the objective function for obtaining a shared transition probability matrix is formulated as follows:
(5)  
where is the rank of . 1 denotes the vector with all ones, and are nonnegative tradeoff parameters. is obtained by concatenation of error matrices of views vertically. To enforce to be a transition probability matrix, two constraints and have been considered.
Since is nonconvex, the objective function in Eq. (5) is an instance of NPhard problem. One natural way is to replace with the trace norm . The resulted objective function is as follows:
(6)  
The trace norm is the convex envelope of the rank. Therefore, minimizing the trace norm of a matrix often applies the lowrank structure on that [13, 14]. Fig. 6 visualizes the proposed EMVC method. We first build initial transition probability matrices that might be erroneous. Then, we use decomposition via lowrankness and regularization to obtain a shared transition probability matrix.
IiiC Optimization Procedure
The objective function in Eq. (6) imposes a probabilistic simplex constraint on each of rows of . We use an Augmented Lagrangian Multiplier scheme [9] to solve the optimization problem. By introducing an auxiliary variable , the objective function in Eq. (6) can be stated equivalently as follows:
(7)  
where . The corresponding augmented Lagrangian function of Eq. (7) is:
(8) 
where and are the Lagrange multipliers. is an adaptive penalty parameter which can be adjusted efficiently according to [15].
The iterative algorithm for solving Eq. (8) is shown in Algorithm 3. We detail each step as follows.
Solving . When other variables are fixed, the objective function w.r.t. can be stated as:
(9) 
The optimization problem in Eq. (9) is equivalent to the following objective function:
(10) 
To solve Eq. (10
), we use the Singular Value Threshold (SVD) method
[16]. Let denote the SVD form of , then can be obtained using the following equation:(11) 
where is the shrinkage operator [9].
Solving . When other variables are fixed, the objective function w.r.t. can be stated as:
(12) 
The optimization problem in Eq. (12) can be stated equivalently as follows:
(13) 
where is constructed by vertically concatenating the matrices together along column. Taking the derivative of Eq. (13) w.r.t. and setting it to zero, we have the following result for :
(14) 
where denotes (i.e., th column of ). Likewise, , and indicate th column of , and , respectively. is a block diagonal matrix with the th diagonal block as ,
is an identity matrix with size of
, is the th segment of and includes the representation errors in th view. is a diagonal matrix with the th diagonal element as where represents th row of . can be obtained by:(15) 
Solving . Fixing other variables, we need to solve the following objective function to obtain :
(16)  
The objective function in Eq. (16) can be converted to the following equivalent form:
(17)  
For ease of presentation, we define a new variable as follows:
(18) 
Then the objective function in Eq. (17) can be converted into the following equivalent form:
(19) 
Eq. (19) can be further rewritten as:
(20) 
where indicates the th row of matrix and denotes the th row of matrix , respectively. The optimization problem in Eq. (20) has independent subproblems. Each problem is a proximal operator problem with a probabilistic simplex constraint that can be efficiently solved by the projection algorithm [17]. The algorithm for this optimization procedure is shown in Algorithm 4.
Solving . The Lagrangian multiplier can be obtained using the following update:
(21) 
Solving . The Lagrangian multiplier can be obtained using the following update:
(22) 
IiiD Computational and Convergence analysis
In Algorithm 3, Lines 25 update with quadratic complexity. Lines 68 update matrix . Instead of computing the matrix inverse with cubic complexity, we can solve a system of linear equations which have quadratic complexity in order to obtain . If sufficient computational resources are available, each () can be computed in parallel with efficiency. Updating in line 9 requires solving an SVD problem. This part can be computed with cubic complexity. Lagrangian Multipliers can be updated with quadratic complexity. Line 16 applies spectral clustering via Markov chains (i.e., Algorithm 2) on the shared transition probability matrix. This step can be done with cubic complexity. When sufficient computational resources are available and parallel computing is implemented, both SVD and linear equations can be solved efficiently.
For convergence analysis, the following theorem guarantees the convergence of Algorithm 3.
Theorem. Algorithm 3 decreases the objective value of Eq. (6) in each iteration.
Proof. To obtain (i.e., in ()th iteration), according to Algorithm 3, we know that
(23) 
where represents at th iteration and indicates th column of . According to Algorithm 3, to obtain , the following problem must be solved:
(24) 
Considering Eq. (23) and Eq. (24), we have the following:
(25) 
Substituting and by their definitions results in the following:
(26)  
where denotes the segment of row of and indicates row of . We can derive the following because if we define , then :
(27) 
and
(28) 
If we add all Eq. (2528) on both sides, we obtain the inequality that objective value at iteration is less than objective value at iteration . Therefore, we can conclude that the objective value decreases in each iteration and Algorithm 3 converges.
Iv Experimental Evaluation
To empirically evaluate performance of the proposed EMVC method, we conduct extensive experiments on synthetic and publicly available realworld multiview datasets and compare with six stateoftheart methods: (1) Best Single View (BSV) performs standard kmeans on the most informative view. (2) Feature Concatenation (Feat. Concat.) concatenates features of all views and then runs standard kmeans clustering on the concatenated feature representations. (3) Kernel Addition constructs kernel matrix for each individual view and then obtains the average of these matrices to get a single kernel matrix for spectral clustering. (4) Coregularized Spectral Clustering (CoReg) performs centroid based and pairwise coregularized spectral clustering via Gaussian kernel [18]. The coregularization parameter is tuned by searching a range of as suggested by the authors. (5) Robust MultiView Spectral Clustering via LowRank and Sparse Decomposition (RMSC) uses lowrank decomposition and norm [7]. The regularization parameter is tuned by searching the range {, , …, , } as suggested by the authors (we keep the same for all views). The parameter is set to the median of all Euclidean distances over all pairs of data points for each individual view as suggested by the authors. (6) ParameterFree AutoWeighted Multiple Graph Learning (AMGL) finds cluster indicator matrix over all views by applying normalized cut algorithms on the graphs of views [19].
We implement four versions of the proposed EMVC method to investigate the effectiveness of its component terms in multiview learning: EMVC by only using the first term in Eq. (6) “EMVC(*)”, EMVC by using trace norm and only imposing the group norm “EMVC()”, EMVC by using trace norm and only imposing norm “EMVC(
)” and the full version of EMVC based on Eq. (6). We apply grid search to identify optimal values for each regularization hyperparameter from
. The standard deviation is set to the median of all Euclidean distances over all pairs of data points for each individual view.We use different evaluation metrics including
FScore, Precision, Recall, Normalized Mutual Information (NMI), Entropy, Accuracy, Adjusted RandIndex (AR) for the purpose of comprehensive evaluation [18, 20]. All of these measure except for Entropy are positive measures, which indicates that larger values stand for better performance. For Entropy, smaller values indicate better performance. Different measurements reveal different properties. Thus, we can obtain comprehensive view from the results.Each experiment is repeated for five times, and the mean and standard deviation of each metric in each dataset are reported. We then use kmeans to obtain final clustering solution. Since kmeans is sensitive to initial seed selection, we run kmeans 20 times on each dataset.
Iva Experiments on RealWorld Datasets
We conduct experiments on the following publicly available realworld datasets. Statistics of the realworld datasets are summarized in Table II (Max # features indicates maximum number of features over all views of the dataset).
Dataset  # data points  Max # features  # views  # clusters 

webKB  1051  3000  2  2 
FOX  1523  2711  2  4 
CNN  2107  3695  2  7 
Citeseer  3312  6654  2  6 
CCV  9317  5000  3  20 
Dataset  Method  FScore  Precision  Recall  NMI  Entropy  Accuracy  AR 

WebKB  BSV  0.889(0.000)  0.889(0.000)  0.889(0.000)  0.532(0.000)  0.406(0.000)  0.913(0.000)  0.618(0.000) 
Feat. Concat.  0.947(0.000)  0.947(0.000)  0.947(0.000)  0.717(0.000)  0.214(0.000)  0.963(0.000)  0.845(0.000)  
Kernel Addition  0.946(0.000)  0.946(0.000)  0.946(0.000  0.717(0.000)  0.214(0.000)  0.963(0.000)  0.845(0.000)  
CoReg  0.949(0.000)  0.949(0.000)  0.949(0.000)  0.733(0.000)  0.195(0.000)  0.965(0.000)  0.853(0.000)  
AMGL  0.794(0.000)  0.794(0.000)  0.794(0.000)  0.015(0.000)  0.752(0.000)  0.783(0.000)  0.013(0.000)  
RMSC  0.956(0.000)  0.956(0.000)  0.956(0.000)  0.758(0.000)  0.189(0.000)  0.970(0.000)  0.871(0.000)  
EMVC (*)  0.568(0.000)  0.568(0.000)  0.568(0.000)  0.000(0.000)  0.757(0.000)  0.511(0.000)  0.000(0.000)  
EMVC ()  0.949(0.000)  0.949(0.000)  0.949(0.000)  0.731(0.000)  0.199(0.000)  0.965(0.000)  0.853(0.000)  
EMVC ()  0.954(0.000)  0.954(0.000)  0.954(0.000)  0.759(0.000)  0.175(0.000)  0.969(0.000)  0.870(0.000)  
EMVC  0.959(0.000)  0.959(0.000)  0.959(0.000)  0.776(0.000)  0.164(0.000)  0.972(0.000)  0.883(0.000)  
FOX  BSV  0.718(0.000)  0.718(0.000)  0.718(0.000)  0.672(0.000)  0.626(0.000)  0.758(0.000)  0.599(0.000) 
Feat. Concat.  0.314(0.000)  0.314(0.000)  0.314(0.000)  0.041(0.000)  1.787(0.000)  0.356(0.000)  0.050(0.000)  
Kernel Addition  0.358(0.000)  0.358(0.000)  0.358(0.000)  0.103(0.000)  1.669(0.000)  0.460(0.000)  0.113(0.000)  
CoReg  0.477(0.006)  0.477(0.006)  0.477(0.006)  0.242(0.002)  1.410(0.002)  0.547(0.000)  0.262(0.000)  
AMGL  0.456(0.000)  0.456(0.000)  0.456(0.000)  0.010(0.000)  1.857(0.000)  0.419(0.000)  0.001(0.000)  
RMSC  0.364(0.005)  0.364(0.005)  0.364(0.005)  0.141(0.000)  1.593(0.001)  0.401(0.001)  0.127(0.000)  
EMVC (*)  0.270(0.009)  0.270(0.009)  0.270(0.009)  0.002(0.002)  1.862(0.002)  0.267(0.003)  0.000(0.000)  
EMVC ()  0.761(0.005)  0.761(0.005)  0.761(0.005)  0.691(0.003)  0.565(0.002)  0.818(0.004)  0.664(0.002)  
EMVC ()  0.761(0.004)  0.761(0.004)  0.761(0.004)  0.691(0.007)  0.565(0.004)  0.818(0.003)  0.664(0.005)  
EMVC  0.761(0.010)  0.761(0.010)  0.761(0.010)  0.691(0.004)  0.565(0.003)  0.818(0.003)  0.664(0.002)  
CNN  BSV  0.388(0.001)  0.388(0.001)  0.388(0.001)  0.405(0.008)  1.736(0.012)  0.486(0.007)  0.228(0.008) 
Feat. Concat.  0.171(0.000)  0.171(0.000)  0.171(0.000)  0.037(0.000)  2.621(0.001)  0.219(0.001)  0.023(0.000)  
Kernel Addition  0.175(0.000)  0.175(0.000)  0.175(0.000)  0.046(0.000)  2.597(0.002)  0.233(0.000)  0.026(0.000)  
CoReg  0.200(0.005)  0.200(0.005)  0.200(0.005)  0.076(0.002)  2.513(0.003)  0.276(0.002)  0.056(0.004)  
AMGL  0.250(0.002)  0.250(0.002)  0.250(0.002)  0.031(0.001)  2.667(0.003)  0.239(0.004)  0.000(0.001)  
RMSC  0.219(0.010)  0.219(0.010)  0.219(0.010)  0.122(0.000)  2.388(0.001)  0.300(0.000)  0.078(0.000)  
EMVC (*)  0.149(0.003)  0.149(0.003)  0.149(0.003)  0.003(0.004)  2.716(0.004)  0.165(0.002)  0.000(0.000)  
EMVC ()  0.557(0.004)  0.557(0.004)  0.557(0.004)  0.536(0.002)  1.279(0.005)  0.655(0.005)  0.472(0.000)  
EMVC ()  0.558(0.004)  0.558(0.004)  0.558(0.004)  0.536(0.003)  1.281(0.004)  0.656(0.001)  0.472(0.001)  
EMVC  0.560(0.013)  0.560(0.013)  0.560(0.013)  0.542(0.005)  1.264(0.003)  0.657(0.002)  0.474(0.002) 
WebKB^{1}^{1}1http://www.cs.cmu.edu/afs/cs/project/theo20/www/data/: This dataset contains webpages collected from Texas, Cornell, Washington and Wisconsin universities. Each webpage is described by the content view and link view.
FOX^{2}^{2}2https://sites.google.com/site/qianmingjie/home/datasets/: The dataset is crawled from FOX web news. Each instance is represented in two views: the text view and the image view. Titles, abstracts, and text body contents are extracted as the text view data, and the image included in the article is stored as the image view data.
CNN^{3}^{3}3https://sites.google.com/site/qianmingjie/home/datasets/: This dataset is crawled from CNN web news. For this dataset, titles, abstracts, and text body contents are extracted as the text view data. Also, the image included in the article is stored as the image view data.
Citeseer^{4}^{4}4http://linqs.cs.umd.edu/projects//projects/lbc/index.html: It contains a selection of the Citeseer dataset. The papers were selected in a way that in the final corpus every paper cites or is cited by at least one other paper. The text view consists of title and abstract of a paper; the link view contains inbound and outbound references.
CCV^{5}^{5}5http://www.ee.columbia.edu/ln/dvmm/CCV/: This high rank dataset contains 9317 videos over 20 semantic categories. Two views contains visual features, while the third view consists of audio features.
Dataset  Method  FScore  Precision  Recall  NMI  Entropy  Accuracy  AR 

Citeseer  BSV  0.322(0.000)  0.322(0.000)  0.322(0.000)  0.199(0.000)  2.013(0.000)  0.443(0.000)  0.180(0.000) 
Feat. Concat.  0.326(0.001)  0.326(0.001)  0.326(0.001)  0.204(0.002)  2.001(0.001)  0.452(0.001)  0.185(0.000)  
Kernel Addition  0.346(0.002)  0.346(0.002)  0.346(0.002)  0.232(0.003)  1.943(0.002)  0.456(0.001)  0.200(0.001)  
CoReg  0.356(0.009)  0.356(0.009)  0.356(0.009)  0.174(0.010)  2.088(0.005)  0.378(0.003)  0.123(0.003)  
AMGL  0.303(0.010)  0.303(0.010)  0.303(0.010)  0.005(0.009)  2.517(0.007)  0.213(0.002)  0.000(0.001)  
RMSC  0.271(0.011)  0.271(0.011)  0.271(0.011)  0.154(0.005)  2.139(0.009)  0.365(0.002)  0.105(0.001)  
EMVC (*)  0.172(0.020)  0.172(0.020)  0.172(0.020)  0.001(0.011)  2.519(0.009)  0.183(0.003)  0.000(0.002)  
EMVC ()  0.386(0.006)  0.386(0.006)  0.386(0.006)  0.283(0.007)  1.800(0.005)  0.532(0.004)  0.251(0.002)  
EMVC ()  0.388(0.007)  0.388(0.007)  0.388(0.007)  0.284(0.008)  1.802(0.004)  0.535(0.003)  0.254(0.002)  
EMVC  0.390(0.007)  0.390(0.007)  0.390(0.007)  0.286(0.011)  1.803(0.002)  0.537(0.004)  0.256(0.002)  
CCV  BSV  0.119(0.001)  0.119(0.001)  0.119(0.001)  0.177(0.001)  3.466(0.003)  0.181(0.006)  0.069(0.002) 
Feat. Concat.  0.096(0.001)  0.096(0.001)  0.096(0.001)  0.119(0.001)  3.739(0.010)  0.170(0.002)  0.023(0.001)  
Kernel Addition  0.124(0.002)  0.124(0.002)  0.124(0.002)  0.171(0.001)  3.496(0.005)  0.189(0.009)  0.072(0.002)  
CoReg  0.119(0.009)  0.119(0.009)  0.119(0.009)  0.176(0.001)  3.473(0.075)  0.180(0.010)  0.068(0.040)  
AMGL  0.080(0.010)  0.080(0.010)  0.080(0.010)  0.089(0.001)  3.901(0.009)  0.165(0.006)  0.019(0.002)  
RMSC  0.130(0.005)  0.130(0.005)  0.130(0.005)  0.203(0.002)  3.225(0.020)  0.196(0.005)  0.082(0.005)  
EMVC (*)  0.070(0.005)  0.070(0.005)  0.070(0.005)  0.085(0.006)  4.001(0.004)  0.152(0.009)  0.012(0.000)  
EMVC ()  0.131(0.004)  0.131(0.004)  0.131(0.004)  0.210(0.007)  3.100(0.003)  0.198(0.004)  0.090(0.004)  
EMVC ()  0.131(0.004)  0.131(0.004)  0.131(0.004)  0.211(0.006)  3.090(0.004)  0.198(0.005)  0.090(0.002)  
EMVC  0.141(0.009)  0.141(0.009)  0.141(0.009)  0.300(0.009)  2.987(0.002)  0.203(0.008)  0.091(0.004) 
Method  FScore  Precision  Recall  NMI  Entropy  Accuracy  AR 

BSV  0.655(0.000)  0.655(0.000)  0.655(0.000)  0.246(0.000)  0.758(0.000)  0.771(0.000)  0.293(0.000) 
Feat. Concat.  0.748(0.000)  0.748(0.000)  0.748(0.000)  0.424(0.000)  0.581(0.000)  0.849(0.000)  0.486(0.000) 
Kernel Addition  0.760(0.000)  0.760(0.000)  0.760(0.000)  0.439(0.000)  0.564(0.000)  0.859(0.000)  0.515(0.00)0 
CoReg  0.750(0.000)  0.750(0.000)  0.750(0.000)  0.437(0.000)  0.569(0.000)  0.850(0.000)  0.489(0.000) 
AMGL  0.579(0.000)  0.579(0.000)  0.579(0.000)  0.116(0.003)  0.883(0.003)  0.696(0.002)  0.153(0.004) 
RMSC  0.736(0.000)  0.736(0.000)  0.736(0.000)  0.375(0.000)  0.624(0.000)  0.844(0.000)  0.472(0.000) 
EMVC (*)  0.499(0.000)  0.499(0.000)  0.499(0.000)  0.000(0.000)  1.000(0.000)  0.501(0.000)  0.000(0.000) 
EMVC ()  0.730(0.000)  0.730(0.000)  0.730(0.000)  0.366(0.000)  0.634(0.000)  0.840(0.000)  0.461(0.000) 
EMVC ()  0.730(0.000)  0.730(0.000)  0.730(0.000)  0.366(0.000)  0.634(0.000)  0.840(0.000)  0.461(0.000) 
EMVC  0.762(0.000)  0.762(0.000)  0.762(0.000)  0.449(0.000)  0.555(0.000)  0.860(0.000)  0.517(0.000) 
Tables III and IV report the performance comparison on the realworld datasets. From these tables, we have several observations. First, EMVC is always better than the baselines by a large margin. Specifically, compared with RMSC, even EMVC() can do a lot better in some of the datasets like FOX, CNN and Citeseer. This observation is consistent with our analysis in that group achieves better performance than . Second, the full version of EMVC is superior to all its three degenerative versions. This validates the correctness of our objective function and demonstrates the importance of having an allencompassing approach.
IvB Experiments on Synthetic Noisy Dataset
Using similar settings in [18]
, the synthetic dataset consists of two views and the data in each view is partitioned into two clusters. Eq. (29) shows cluster means and covariances for each view. In each view, the two clusters overlap, which is the source of noise in the transition probabilities of each view. First, we choose the cluster that each sample belongs to, and then produce the views from a mixture of two bivariate Gaussian distributions. For each view, we sample 500 data points from each of the clusters.
where denotes cluster means for cluster in view . is covariance for first and second clusters in first and second views, respectively. indicates covariance for second and first clusters in first and second views, respectively. Table V presents the comparison results on this dataset. With this type of noise, the proposed EMVC method shows superior clustering performance over all the baselines.
IvC Experiments on Erroneous RealWorld Datasets
To evaluate robustness of the proposed EMVC method on noise and random corruptions, we add white Gaussian noise with different signaltonoise ratio {, , , , } on FOX (denoted as NRCFOX) and CNN (denoted as NRCCNN). Fig. 4 shows clustering performance of the methods with various signaltonoise ratio on the contaminated datasets. We can see that EMVC consistently achieves superior performance over the baselines (we only show the results for full version of EMVC, which is superior to its degenerative versions). This observation demonstrates that EMVC is robust against random noise and corruptions.
We investigate robustness of the proposed EMVC method against samplespecific corruptions on FOX (denoted as SSCFOX) and CNN (denoted as SSCCNN). For this experiment, we randomly select a small portion of samples (2%, 6% and 10%) and replace their feature values in all views by random values. This setting is similar to generation of attribute outliers in [21]. Fig. 4 shows clustering performance of the methods on samplespecific corruptions. The proposed EMVC method outperforms the baselines against this type of error (we only show the results for full version of EMVC, which is superior to its degenerative versions). This is mainly because of norm in our objective function, which has sparse row supports.
IvD Hyperparameter Analysis
To explore the effects of the hyperparameters on the performance, we run experiments on realworld datasets with different values for and and report the average accuracy in Fig. 5. In this Figure, each grid with different shades of colors reflects the clustering quality, where yellow means excellent quality. We can see that the performance is fairly stable. EMVC enjoys more promising results when and , while it is almost insensitive to the hyperparameters in that range.
V Related Work
Existing methods for multiview clustering can be classified into two categories: 1) centralized approaches; 2) distributed approaches
[22]. The centralized approaches constructs a new shared representation (i.e., common consensus) across all views [3, 11, 18, 7, 19]. For example, Bickel and Scheffer presented an algorithm that interchanges the cluster information among different views [3]. Xia et al. proposed a multiview clustering method, named as RMSC, that recovers shared transition probability matrix, in favor of lowrank and regularization [7]. The proposed EMVC method belongs to this category. Different from RMSC, EMVC builds a shared transition probability matrix by integrating decomposition and group and norms. EMVC also handles typical error types well.The distributed approaches often build separate learners for each individual view and use the information in each learner to apply constraints on other views [18]. Kumar et al., proposed an approach to combine graphs of each individual view by pairwise coregularization to achieve better clustering solution [18]. EMVC differs from this category of approaches in a way that it does not construct separate learners. Instead, it recovers a shared transition probability matrix across all views.
Vi Conclusion
In this paper, we developed a Markov chains method named EMVC for multiview clustering via a low rank decomposition and two regularization terms. EMVC has several advantages over existing multiview clustering methods. First, it handles typical types of error well. Second, an iterative optimization framework is proposed for EMVC which is proved to converge. Compared to the existing stateoftheart multiview clustering approaches, EMVC showed better performance on five realworld datasets.
References
 [1] J. Zhao, X. Xie, X. Xu, and S. Sun, “Multiview learning overview: Recent progress and new challenges,” Information Fusion, vol. 38, pp. 43–54, 2017.
 [2] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by lowrank representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 171–184, Jan 2013.
 [3] S. Bickel and T. Scheffer, “Multiview clustering.” in ICDM, vol. 4, 2004, pp. 19–26.
 [4] V. R. De Sa, “Spectral clustering with two views,” in ICML workshop on learning with multiple views, 2005, pp. 20–27.

[5]
A. Y. Ng, M. I. Jordan, Y. Weiss et al.
, “On spectral clustering: Analysis and an algorithm,” in
NIPS, vol. 14, no. 2, 2001, pp. 849–856.  [6] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 8, pp. 888–905, 2000.
 [7] R. Xia, Y. Pan, L. Du, and J. Yin, “Robust multiview spectral clustering via lowrank and sparse decomposition.” in AAAI, 2014, pp. 2149–2155.
 [8] J. Huang and T. Zhang, “The benefits of group sparsity,” arXiv preprint arXiv:0901.2962, 2009.
 [9] Z. Lin, M. Chen, and Y. Ma, “The augmented lagrange multiplier method for exact recovery of corrupted lowrank matrices,” arXiv preprint arXiv:1009.5055, 2010.
 [10] D. Zhou, J. Huang, and B. Schölkopf, “Learning from labeled and unlabeled data on a directed graph,” in ICML. ACM, 2005, pp. 1036–1043.
 [11] D. Zhou and C. J. Burges, “Spectral clustering and transductive learning with multiple views,” in ICML. ACM, 2007, pp. 1159–1166.

[12]
F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and robust feature selection via joint
norms minimization,” in NIPS, 2010, pp. 1813–1821. 
[13]
M. Fazel, H. Hindi, and S. P. Boyd, “A rank minimization heuristic with application to minimum order system approximation,” in
American Control Conference, 2001. Proceedings of the 2001, vol. 6. IEEE, 2001, pp. 4734–4739.  [14] N. Srebro, J. D. Rennie, and T. S. Jaakkola, “Maximummargin matrix factorization,” in NIPS, vol. 17, 2004, pp. 1329–1336.
 [15] Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for lowrank representation,” in NIPS, 2011, pp. 612–620.
 [16] J.F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010.
 [17] J. Duchi, S. ShalevShwartz, Y. Singer, and T. Chandra, “Efficient projections onto the l 1ball for learning in high dimensions,” in ICML. ACM, 2008, pp. 272–279.
 [18] A. Kumar, P. Rai, and H. Daume, “Coregularized multiview spectral clustering,” in NIPS, 2011, pp. 1413–1421.
 [19] F. Nie, J. Li, X. Li et al., “Parameterfree autoweighted multiple graph learning: A framework for multiview clustering and semisupervised classification.” IJCAI, 2016.

[20]
X. Cao, C. Zhang, H. Fu, S. Liu, and H. Zhang, “Diversityinduced multiview
subspace clustering,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2015, pp. 586–594. 
[21]
H. Zhao and Y. Fu, “Dualregularized multiview outlier detection.” in
IJCAI, 2015, pp. 4077–4083. 
[22]
B. Long, P. S. Yu, and Z. Zhang, “A general model for multiple view unsupervised learning,” in
SIAM international conference on data mining. SIAM, 2008, pp. 822–833.