I Introduction
Clustering [1]
is a paradigm to classify the subjects into several groups based on their similarity information. As we know that clustering is a fundamental task in machine learning, pattern recognition and data mining fields and it has widespread applications. With the obtained groups by clustering methods, further analysis tasks can be conducted to achieve different ultimate goals. However, traditional clustering methods only use one feature set or one view information of the subjects while multiple feature sets or multiple view information of these subjects are available. The subjects of interest with multiple feature sets or multiple view information are the so called multiview data.
Multiview data are very common in realworld applications due to the innate properties, or collecting from different sources. For instance, a web page can be described by the words appearing on the web page itself and the words underlying all links pointing to the web page from other pages in nature. In multimedia content understanding, the multimedia segments can be simultaneously described by their video signals from visual camera and audio signals from voice recorder devices. The existence of such multiview data raised the interest of multiview learning [2, 3, 4]
, which has been extensively studied in semisupervised setting. However, for unsupervised learning, especially previous single view clustering methods cannot make full use of the information from multiple views, like running single view clustering algorithm on the concatenated features from multiple views that cannot distinguish the different significance of different views. To make full use of these multiple view information to boost clustering accuracy, multiview clustering attracted more and more attentions in the past two decades, which makes it the time and necessary to summarize and sort out the current progress and open problems to guide its further advancement in the future.
Now, based on the above concepts, we give the definition of the multiview clustering (MVC). MVC is a machine learning paradigm to classify the similar subjects into the same group and dissimilar subjects into different groups by combining the available multiview feature information, which indicates that MVC searches for the consistent clusterings across different views. Consistent with the categorization of clustering algorithms in [1]
, we divide the existing MVC methods into two categories: generative (or modelbased) approaches and discriminative (or similaritybased) approaches. Generative approaches try to learn a generative model from the subjects, with each model representing one cluster while discriminative approaches directly optimize an objective function that involves the pairwise similarities to minimize the average similarities within clusters and to maximize the average similarities between clusters. Due to a large number of discriminative approaches, based on how to combine the multiview information, we further classified them into five groups: (1) common eigenvector matrix (mainly multiview spectral clustering), (2) common coefficient matrix (mainly multiview subspace clustering), (3) common indicator matrix (mainly multiview nonnegative matrix factorization clustering), (4) direct combination (mainly multikernel clustering), (5) combination after projection (mainly canonical correlation analysis (CCA)). The first three groups have a commonality: sharing a similar structure to combine multiple views.
Research on MVC is motivated by the multiview real applications. With the same motivation, multiview representation, multiview supervised and multiview semisupervised learning emerged and developed well. Therefore, the similarities and differences of them are worth exploring. The common similarity between them is that all of them are learned with the multiview information. With regard to the differences, multiview representation aims to learn a compact representation while MVC is to perform clustering, MVC is learned without any label information while multiview supervised and multiview semisupervised learning have all or part of label information. Some of the view combining strategies in these related paradigms can be borrowed and adapted to MVC. In addition, the relationship between MVC and ensemble clustering and multitask clustering are also elaborated due to their similar clustering style.
Recently, MVC has been applied to many applications such as computer vision, natural language processing, social multimedia, bioinformatics, health informatics and so on. At the same time, MVC papers appear largely in many top venues like the conferences ICML
[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], NIPS [19, 20], CVPR [21, 22, 23, 24], ICCV [25], AAAI [26, 27, 28, 29, 30], IJCAI [31, 32, 33, 34, 35, 36, 37], SDM [38, 39], ICDM [40, 41, 42, 43, 44], and journals PAMI [45], TKDE [46, 47, 48, 49, 50],TCYB [51, 52], TIP [53], TNNLS [54]. Although MVC has permeated into many fields and made great practical success, there are still some open problems that limit its further advancement. We point out several open problems and hope they can be helpful to promote the development of MVC in the future. With this survey, we hope to help the reader to have an entire version of the current development of MVC and what can be done in the future.The remainder of this paper is organized as follows. In section II, we review the existing generative models of MVC. Section III introduces several categories of discriminative models of MVC. In Section IV, we analyze the relationship between MVC and some related topics. Section V presents the applications of MVC in different areas. In Section VI, we list several open problems existing current MVC methods, which may help us to advance the further development of MVC. Finally, we make the conclusions.
Ii Generative Approaches
Generative approaches aim to learn the generative models each of which generates the data from one cluster. In most cases, generative clustering approaches are based on mixture models or expectation maximization (EM)
[55]. Therefore, mixture models and EM algorithm are first of all introduced. Another popular single view clustering model named convex mixture models (CMMs) [56] is also introduced, which will be extended to multiview case.Ii1 Mixture Models and CMMs
In generative approach, data are considered as sampling independently from a mixture model of multiple probability distributions. The mixture distribution can be written as
(1) 
where
is the prior probability of the
th component and satisfies and , is the parameter of the th probability density model and is the parameter set of the mixture model. For instance,EM is a widely used algorithm in the parameter estimation of the mixture models. Suppose the observed data and unobserved data are denoted by
and , respectively. and are called complete data and incomplete data. In the E (expectation) step, the posterior distribution of the unobserved data is evaluated with the current parameter values . According to maximum likelihood estimation, the E step calculates the expectation of the complete data log likelihood as(2) 
The M step updates the parameters by maximizing the function (2)
(3) 
Note that for clustering, can be considered as the observed data while is the latent variable whose entry indicates the th data point comes from the th component. Note that the posterior distribution form used to be evaluated in E step and the expectation of the complete data log likelihood used to evaluate the parameters are different for different distribution assumptions.
CMMs [56] are simplified mixture models that result in soft assignments of data points to clusters after extracting the representative exemplars from the data set. By maximizing the loglikelihood, all instances compete to become the “center” (representative exemplar) of the clusters. The instances corresponding to the components that received the highest priors are selected exemplars and then the remaining instances are assigned to the “closest” exemplar. Note that the priors of the components are the only adjustable parameters of a CMM.
Given a data set , the CMM distribution is , , where denotes the prior probability of the th component, satisfying the constraint , and is an exponential family distribution, with its expectation parameters equal to the th data point. Due to the bijection relationship between the exponential families and Bregman divergences [57], the exponential family , with denoting the Bregman divergence corresponding to the components’ distributions, being independent of , and being a constant controlling the sharpness of the components.
The loglikelihood needs to be maximized is given as + const. With the empirical data set distribution definition , the loglikelihood maximization can be equivalently expressed in terms of Kullback Leibler (KL) divergence among and as
(4)  
where is the entropy of the empirical distribution which does not depend on the parameter . Now, the problem is changed into minimizing (4), which is convex and can be solved with an iterative algorithm, whose updates for prior probabilities are given by
(5) 
Grouping the data points into disjoint clusters is done by requiring the instances with the highest
values to serve as exemplars and then assigning the remaining instances to the exemplar with the highest posterior probability. Note that the clustering performance is affected by the constant
, in [56] a reference value is determined with the empirical rule to identify a reasonable range of .Ii2 MultiView Clustering Based on Mixture Models or EM Algorithm
In [58], under the assumption that the two views are independent, multinomial distribution is adopted for document clustering problem. Take twoview case as an example, they execute M, E steps on each view and then interchange the posteriors in each iteration. The optimization process is terminated untill some predefined stopping condition is satisfied. Two multiview EM algorithm versions for finite mixture models are proposed in the paper [59]: the first version can be regarded as that it runs EM in each view and combines all the weighted probabilistic clustering labels generated in each view before each new EM iteration while the second version can be viewed as some probabilistic information fusion for components of two views.
Specifically, based on the CMMs for singleview clustering, the multiview version proposed in [60] became much attractive because it can locate the global optimum and thus avoid the initialization and local optima problems of standard mixture models, which require multiple executions of the EM algorithms.
For multiview CMMs, each with views is denoted by , , the mixture distribution for each view is given as . To pursue a common clustering across all views, all share the same priors. In addition, an empirical data set distribution , , is associated with each view and the multiview algorithm minimizes the sum of KL divergences between and across all views with the constraint
(6)  
which is straightforward to see that the optimized objective is convex, hence the global minimum can be found. The prior undate rule is given as follows:
(7) 
The prior associated with the th instance is a measure of how likely this instance is to be an exemplar, taking all views into account. The appropriate values are identified in the range of an empirically defined by . From Eq. (6), it can be found that all views contribute equally to the sum, without considering their different importance. To overcome this limitation, a weighted version of multiview CMMs was proposed in [61].
Iii Discriminative Approaches
Compared with generative approaches, discriminative approaches directly optimize the objective to seek for the best clustering solution rather than first modelling the subjects then solving these models to determine clustering result. Directly focusing on the objective of clustering makes discriminative approaches gain more attentions and develop more comprehensively. Up to now, most of the existing MVC methods are discriminative approaches. Based on how to combine multiple views, we categorize MVC methods into five main groups and introduce the representative works in each group.
To facilitate the following discussion, we introduce the settings of MVC first. Assume that we are given the data set with views. Let denote the subjects in view (), , is the dimension of the th view data. The aim of MVC is to cluster the subjects into groups. That is, finally we will get a membership matrix to indicate which subjects are in the same group while others in other groups, the sum of each row entries of should be 1 to make sure each row is a probability. If only one entry of each row is 1 and all others are 0, it is the socalled hard clustering otherwise it is soft clustering.
Iiia Common Eigenvector matrix (Mainly MultiView Spectral Clustering)
This group of MVC methods are based on a commonly used clustering technique spectral clustering. Since spectral clustering hinges crucially on the construction of the graph Laplacian and the resulting eigenvectors reflect the cluster structure of the data, this group of MVC methods guarantee to get a common clustering results by assuming that all the views share the common or similar eigenvector matrix. There are two representative methods: cotraining spectral clustering [6] and coregularized spectral clustering [19]. Before discussing them, we will introduce spectral clustering [62] first.
IiiA1 Spectral Clustering
Spectral clustering is a clustering technique to utilize the properties of the Laplacian of graph whose edges denote the similarities between the data points and solve a relaxation of the normalized mincut problem on this graph [63]. Compared with other widely used method like k means clustering that only fits the spherical shaped clustering, spectral clustering can apply to arbitrary shaped clustering and demonstrate good performance.
Given as a weighted undirected graph with vertex set . The data adjacency matrix of the graph is defined to be whose entry represents the similarity of two vertices and . If it means that the vertices and are not connected. Apparently is symmetric because is an undirected graph. The degree matrix is defined as the diagonal matrix with the degree on the diagonal, where . Generally, the graph Laplacian is and the normalized graph Laplacian is . In many spectral clustering works like [62, 6, 19], is also used to change a minimization problem (9) into a maximization problem (8) since where
is the identity matrix. Following the same way adopted in
[62, 6, 19], we will name both and as normalized graph Laplacians afterwards. Now the single view spectral clustering approach can be formulated as follows:(8) 
which is also equivalent to the following problem:
(9) 
where denotes the matrix trace. The rows of matrix are the embeddings of the data points, which can be feed the k means to obtain the final clustering results. A version of the RayleighRitz theorem in [64] shows that the solution of the above optimization problem is given by choosing as the matrix containing the largest or smallest eigenvectors of or as columns. To understand the spectral clustering algorithm better, we briefly outline a commonly used procedure [62] to solve Eq. (8) as follows:

Construct the adjacency matrix .

Compute the normalized Laplacian matrix .

Calculate the eigenvectors of and stack the top eigenvectors as the columns to construct a matrix .

Normalize each row of to obtain .

Run k means algorithm to cluster the row vectors of
. 
Assign subject to cluster if the th row of is assigned to cluster by the k means algorithm.
Apart from the symmetric normalization operator , there is also another normalization operator . Refer to [65] for more details about spectral clustering.
IiiA2 CoTraining MultiView Spectral Clustering
Cotraining is a widely used idea in semisupervised learning, where both labeled and unlabeled data are available. It assumes that the predictor functions in both views will give the same labels for the same sample with high probability. There are two main assumptions to guarantee its success: (1) Sufficiency: each view is sufficient for classification on its own, (2) Conditional independence: the views are conditionally independent given the class labels. In the original cotraining algorithm [66], two initial predictor functions and are trained on the labeled data, then cotraining algorithm repeatedly runs the following steps: the most confident examples predicted are added to the labeled set for and vice versa, then retrain and on the enlarged labeled data. After a predefined number of iterations, and will agree with each other on labels.
For cotraining multiview spectral clustering, the motivation is similar: the clustering result in each view should be the same. In spectral clustering, the eigenvectors of the graph Laplacian encode the discriminative information of the clustering. Therefore, Cotraining multiview spectral clustering [6] uses the eigenvectors of the graph Laplacian in one view to cluster samples and then use the clustering result to modify the graph Laplacian in the other view.
Each column of the similarity matrix (also named adjacency matrix) can be considered as a dimensional vector that indicates the similarities of th point with all the points in the graph. Since the largest eigenvectors have the discriminative information for clustering, the similarity vectors can be projected along those directions to retain the discriminative information for clustering and ignore the within cluster details that might confuse the clustering. After that, the projected information is backprojected to the original dimensional space to get the modified graph. Due to the orthogonality of the projection matrix, the inverse projection is equivalent to the transpose operation.
To make the cotraining spectral clustering algorithm clear, we borrowed Algorithm 1 from [6]. Note that a symmetrization operator on a matrix is defined as in Algorithm 1.
IiiA3 CoRegularized MultiView Spectral Clustering
Coregularization is a famous technique in semisupervised multiview learning. The core idea of coregularization is minimizing the distinction between the predictor functions of two views acts as one part of the objective function. However, there are no predictor functions in unsupervised learning like clustering, so how to implement the coregularization idea in clustering problem. Coregularized multiview spectral clustering [19] adopted the eigenvectors of graph Laplacian to play the similar role of predictor functions in semisupervised learning scenario and proposed two coregularized clustering approaches.
Let and be the eigenvector matrices corresponding to any pair of view graph Laplacians and (). The first version uses a pairwise coregularization criteria that enforces and as close as possible. The measure of clustering disagreement between the two views and is , where using linear kernel is the similarity matrix of . Since , where is the number of the clusters, the measure of disagreement becomes . Integrating the measure of disagreement between any pair of views into the spectral clustering framework, the pairwise coregularized multiview spectral clustering will be formed as the following optimization problem:
(10) 
The hyperparameter
is used to tradeoff the spectral clustering objectives and the spectral embedding disagreement terms.The second version named centroidbased coregularization enforces the eigenvector matrix from each view to be similar by regularizing them towards a common consensus eigenvector matrix. The corresponding optimization problem is formulated as
(11) 
Since relaxed kernel k means and spectral clustering are equivalent, by learning flexible weights automatically, Ye et al. [67] proposed a coregularized kernel k means for multiview clustering. With a multilayer Grassmann manifold interpretation, Dong et al. [68] obtained the same formulation with the pairwise coregularized multiview spectral clustering.
IiiA4 Others
Besides the above mentioned two representative multiview spectral clustering methods, Wang et al. [38] enforces a common eigenvector matrix and formulates a multiobjective problem and then solve it using Pareto optimization.
IiiB Common Coefficient Matrix (Mainly MultiView Subspace Clustering)
In many practical applications, even though the given data set is high dimensional, its intrinsic dimension is often much low. For example, the number of pixels in a given image can be large, yet only a few parameters are used to describe the appearance, geometry and dynamics of a scene. This motivates the development of finding the underlying low dimensional space. In practice, the data could be sampled from multiple subspaces. Subspace clustering [69] is the technique to find the underlying subspaces and then cluster the data points correctly.
IiiB1 Subspace clustering
Subspace clustering uses the selfexpression property [70] of the data set to represent itself as:
(12) 
where is the subspace coefficient matrix (representation matrix), and each is the representation of the original data point based on the subspace. is the noise matrix.
The subspace clustering can be formulated as the following optimization problem:
(13) 
The constraint is to avoid the case that a data point is represented by itself while denotes that the data point lies in a union of affine subspaces. The nonzero elements of correspond to data points from the same subspace.
After getting the subspace representation , the similarity matrix can be obtained to further construct the graph Laplacian and then run spectral clustering on that graph Laplacian to get the final clustering results.
IiiB2 MultiView Subspace Clustering
With multiview information, each subspace representation can be obtained from each view. To get a consistent clustering result from multiple views, Yin et al. [71] shares the common coefficient matrix by enforcing the coefficient matrices from each pair of views as similar as possible. The optimization problem is formulated as
(14) 
where is the norm based pairwise coregularization constraint that can alleviate the noise problem. is used to enforce sparse solution. denotes the diagonal elements of matrix , and the zero constraint is used to avoid trivial solution (each data point represents by itself).
Wang et al. [72] enforced the similar idea to combine multiview information. Apart from that, it adopted a multigraph regularization with each graph Laplacian regularization characterizing the viewdependent nonlinear local data similarity. At the same time, it assumes that the viewdependent representation is low rank and sparse and considers the sparse noise in the data. Wang et al. [53] proposed an angular based similarity to measure the correlation consensus in multiple views and obtained a robust subspace clustering for multiview data. Different from the above approaches, These three works [35, 36, 73] adopted general nonnegative matrix factorization formulation but shared a common representation matrix for the samples with both views and kept each view representation matrix specific. Zhao et al.[26] adopted a deep seminonnegative matrix factorization to perform multiview clustering, in the last layer a common coefficient matrix is enforced to exploit the multiview information.
IiiC Common Indicator Matrix (Mainly MultiView Nonnegative Matrix Factorization Clustering)
IiiC1 Nonnegative Matrix Factorization
For nonnegative data matrix , Nonnegative Matrix Factorization (NMF) [74] aims to seek two nonnegative matrix factors and whose product is a good approximation to :
(15) 
where denotes the desired reduced dimension (for clustering, it is the number of clusters). Here, is the basis matrix while is the indicator matrix.
Due to the nonnegative constraints, NMF can learn a partbased representation. This is very intuitive and meaningful in many applications like the face recognition in
[74]. The observations in many applications like information retrieval [74] and pattern recognition [75] can be explained as an additive linear combinations of nonnegative basis vectors. It had been applied successfully to clustering [74, 76] and achieved the stateoftheart performance.In addition, k means clustering can be formulated using NMF by introducing an indicator matrix . The NMF formulation of k means clustering is
(16) 
where is the cluster centroid matrix.
IiiC2 MultiView Nonnegative Matrix Factorization Clustering
By enforcing the indicator matrices from different views the same, Akata et al. [77] extended the NMF [74] to multiview settings.
To combine multiview information in the NMF framework, Akata et al. [77] enforces a shared indicator matrix among different views to perform multiview clustering in NMF framework. However, the indicator matrix might not be comparable at the same scale. In order to keep the clustering solutions across different views meaningful and comparable, Liu et al. [78] enforces a constraint to push each viewdependent indicator matrix towards a common indicator matrix and another normalization constraint inspired by the connection between NMF and probability latent semantic analysis. The final optimization problem is formulated as:
(17) 
The constraint is used to guarantee within the same range for different such that the comparison between the viewdependent indicator matrix and the consensus indicator matrix is reasonable.
After obtaining the consensus matrix , the cluster label of data point can be computed as .
As we aforementioned, for subspace learning, there are two steps: calculate the subspace representation and run spectral clustering on the graph Laplacian computed from the obtained subspace representation. To get a consistent clustering from multiple views, Gao et al. [79] merged the two steps in subspace clustering and enforced a common indicator matrix across different views. The formulation is as follows:
(18) 
where is the subspace representation matrix of the th view, , is a diagonal matrix with diagonal elements defined as , is the common indicator matrix which can result in a consistent clustering result across different views. Although this multiview subspace clustering is based on subspace clustering, it does not enforce the common coefficient matrix but shares a common indicator matrix for different views. That is why we categorize it into this group. Similar things happen to spectral clustering later.
Since spectral clustering works on graph and the eigen decomposition is time consuming, it is inappropriate for large scale application. k means algorithm does not subject to this limitation, thus it becomes a good choice for large scale application. To deal with large scale multiview data, Cai et al. [31] proposed a multiview k means clustering method by adopting a common indicator matrix for different views. The optimization problem is formulated as follows:
(19) 
where is the weight factor for the th view and is the parameter to control the weights distribution. By learning the weights for different views, the important views will get large weight during multiview clustering.
Wang et al. [7] integrates multiview information via a common indicator matrix and simultaneously take the varied importance of different features to different data clusters into consideration. Its formulation is
(20) 
where , each is the input including the features from all the views and each view has dimension such that . is the weights of each feature for clusters. is the intercept vector, is constant vector of all 1’s, is the cluster indicator matrix. is the group regularization to learn the groupwise feature importance of one view on each cluster while is the norm to learn the individual weight across different clusters.
Before centroidbased coregularization, a similar work [80] used the same idea to perform multiview spectral clustering. The main difference is that [80] used as the disagreement measure between each view eigenvector matrix and the common eigenvector matrix while coregularized multiview spectral clustering [19] adopted . The optimization problem [80] is formulated as
(21) 
where makes become the final cluster indicator matrix. Different from general spectral clustering that get eigenvector matrix first and then run clustering (such as k means that is sensitive to initialization condition) to assign clusters, Cai et al. [80] directly solves the final cluster indicator matrix, thus it will be more robust to the initial condition.
In [81], a matrix factorization approach was adopted to reconcile the groups arising from the individual views. Specifically, a matrix that contains the partitioning of every individual view is created and then decomposed into two matrices, the one showing the contribution of those partitionings to the final multiview clusters, called metaclusters, and the other the assignment of instances to the metaclusters. Tang et al. [40] considered multiview clustering as clustering with multiple graphs, each of which is approximated by matrix factorization with a graphspecific factor and a factor common to all graphs. Qian et al. [82] used the relaxed common indicator matrix (make each view indicator matrix as close as possible) to combine multiview information and employed the Laplacian regularization to maintain the latent geometric structure of the views simultaneously. Apart from using common indicator matrix, [83, 84, 85] introduced a weight matrix to indicate whether the corresponding matrix whose entry is missing such that it can tackle the missing value problem, the multiview selfpaced clustering [34] takes the complexities of the samples and views into consideration to alleviate the local minima problem. Tao et al. [32] enforces a common indicator matrix and seeks for the consensus clustering among all views in an ensemble clustering way. Apart from multiview nonnegative matrix factorization clustering, there exist some other methods to utilize the common indicator matrix to combine multiple views for clustering like [21] who additionally borrowed the linear discriminant analysis idea and weighted each view automatically. For graphbased clustering methods, each similarity matrix for each view is obtained first, Nie et al. [33] assumes a common indicator matrix and then solves the problem by minimizing the differences between the common indicator matrix and each similarity matrix.
IiiD Direct Combination (Mainly MultiKernel Based MultiView Clustering)
Apart from sharing a common structure from different views, the direct view combination is a good way to perform multiview clustering. A natural approach is to define a kernel for each view information and then convexly combine these kernels [8, 86, 87].
IiiD1 Kernel Functions and Kernel Combination Forms
Kernel is a trick to learn nonlinear problem just by linear learning algorithm, since kernel function can directly give the inner products in feature space without explicitly defining the nonlinear transformation . There are some common kernel functions as follows:

Linear kernel: ,

Polynomial kernel: ,

Gaussian kernel (Radial basis kernel): ,

Sigmoid kernel: .
In machine learning, common kernel functions can be viewed as similarity functions [88], such that we can use kernel trick to deal with spectral clustering and kernel k means problems. So far, there exist some works on multikernel learning for clustering [89, 90, 91], however, they are all for singleview clustering. If each kernel is derived from each view and then combined elaborately to deal with the clustering problem, it will become the multikernel learning for multiview clustering. Obviously, multikernel learning [92, 93, 94, 95] can be considered as the most important part of multiview clustering. There are three main categories of multikernel combination [96]:

Linear combination: It includes two basic subcategories: unweighted sum and weighted sum where denotes the kernel weight for the th view and , is the hyperparameter to control the distribution of the weights,

Nonlinear combination: It uses some nonlinear functions of kernels, namely, multiplication, power, and exponentiation,

Datadependent combination: It assigns specific kernel weights for each data instance, which can identify the local distributions in the data and learn proper kernel combination rules for each region.
IiiD2 Kernel K Means and Spectral Clustering
Kernel k means [97] and spectral clustering [98]
are two kernelbased clustering methods for optimizing the intracluster variance. Let
be a feature mapping which maps onto a reproducing kernel Hilbert space . The kernel k means problem is formulated as the following optimization problem,(22) 
where is the cluster indicator matrix (also known as cluster assignment matrix), and are the number and centroid of the th cluster. With a kernel matrix whose entry is , and that is a column vector with all elements 1, Eq. (22) can be equivalently rewritten as the following matrixvector form,
(23) 
For the above kernel k means matrixfactor form, the matrix is discrete, which makes the optimization problem difficult to solve. By relaxing the matrix to take arbitrary real values, the above problem can be approximated. Specifically, by defining and letting take real values, further considering is constant, Eq. (23) will be relaxed to
(24) 
The fact leads to the orthogonality constraint on which further tells us that the optimal can be obtained by the top eigenvectors of the kernel matrix . Therefore, Eq. (24) can be considered as the generalized optimization matrix form of spectral clustering. Note that Eq. (24) is equivalent to Eq. (8) if the kernel matrix takes the normalized Gram matrix form.
IiiD3 MultiKernel Based MultiView Clustering
Assume there are kernel matrices available, each of which corresponds to one view. To make full use of all views, the weighted combination will be used in kernel k means (24) and spectral clustering (8) to obtain the corresponding multiview kernel k means and multiview spectral clustering in paper [41]. With the same nonlinear combination by specifically setting , Guo et al. [99] extended the spectral clustering to multiview clustering by further employing the kernel alignment. Due to the potential redundance of the selected kernels, Liu et al. [28] introduced a matrixinduced regularization to reduce the redundancy and enhance the diversity of the selected kernels to attain the final goal of boosting the clustering performance. By replacing the original Euclidean norm metric in fuzzy cmeans with a kernelinduced metric in the data space and adopting the weighted kernel combination, Zhang et al. [100]
successfully extended the fuzzy cmeans to multiview clustering that is robust to noise and outliers. In the case with incomplete multiview data set existing, by optimizing the alignment of shared instances of the data sets, Shao et al.
[43] collectively completes the kernel matrices of incomplete data sets. To overcome the cluster initialization problem associated with kernel k means, Tzortzis et al. [54] proposed a global kernel k means algorithm, a deterministic and incremental approach that adds one cluster each stage, through a global search procedure consisting of several executions of kernel k means from suitable initiations.IiiD4 Others
Besides multikernel based multiview clustering, there are some other methods that use the direct combination to perform multiview clustering like [21, 33]. In [46], twolevel weights: view wights and variable wights are assigned to the clustering algorithm for multiview data to identify the importance of the corresponding views and variables. To extend fuzzy clustering method to multiview clustering, each view is weighted and the multiview versions of fuzzy cmeans and fuzzy k means are obtained [42] and [51], respectively.
IiiE Combination After Projection (Mainly CCABased MultiView Clustering)
For homogeneous multiview data, it is reasonable to directly combine them together. However, in realworld applications, the multiple representations may come from different feature vector spaces. For instance, in bioinformatics, gene information can be one view while the clinical symptoms can be the other view in patient clustering [13]. Obviously, these information cannot be combined directly. Moreover, high dimension and noise are difficult to handle. To solve the above problems, the last yet important combination way is introduced: combination after projection. The most commonly used technique is Canonical Correlation Analysis (CCA) and the kernel version of CCA (KCCA).
IiiE1 CCA and KCCA
To better understand this style of multiview combination, CCA and KCCA are briefly introduced (refer to [101] for more detail). Given two data sets and each entry or with zero mean, CCA aims to find a projection for and another projection for such that the correlation between the projection of and on and are maximized,
(25) 
where is the correlation and denotes the covariance matrix of and with zero mean. Observing that is not affected by scaling or either together or independently, CCA can be reformulated as
(26) 
With the method of Lagrange multiplier, the two lagrange multipliers and are equal to each other, that is . If is invertible, can be obtained as and . For different eigen values (from large to small), different eigen vectors are obtained, which can be considered as a successive process.
The above canonical correlation problem can be transformed into a distance minimization problem. For ease of derivation, the successive formulation of the canonical correlation is replaced by the simultaneous formulation of the canonical correlation. Assume the number of projections is , the matrices and denote and , respectively. The simultaneous formulation is the optimization problem with p iteration steps:
(27) 
The matrix formulation to the optimization problem (27) is