A Survey on Multi-View Clustering

12/18/2017 ∙ by Guoqing Chao, et al. ∙ University of Connecticut East China Normal University 0

With the fast development of information technology, especially the popularization of internet, multi-view learning becomes more and more popular in machine learning and data mining fields. As we all know that, multi-view semi-supervised learning, such as co-training, co-regularization has gained considerable attentions. Although recently, multi-view clustering (MVC) has developed rapidly, there are not a survey or review to summarize and analyze the current progress. Therefore, this paper sums up the common strategies of combining multiple views and based on that we proposed a novel taxonomy of the MVC approaches. We also discussed the relationships between MVC and multi-view representation, ensemble clustering, multi-task clustering, multi-view supervised and multi-view semi-supervised learning. Several representative real-world applications are elaborated. To promote the further development of MVC, we pointed out several open problems that are worth exploring in the future.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Clustering [1]

is a paradigm to classify the subjects into several groups based on their similarity information. As we know that clustering is a fundamental task in machine learning, pattern recognition and data mining fields and it has widespread applications. With the obtained groups by clustering methods, further analysis tasks can be conducted to achieve different ultimate goals. However, traditional clustering methods only use one feature set or one view information of the subjects while multiple feature sets or multiple view information of these subjects are available. The subjects of interest with multiple feature sets or multiple view information are the so called multi-view data.

Multi-view data are very common in real-world applications due to the innate properties, or collecting from different sources. For instance, a web page can be described by the words appearing on the web page itself and the words underlying all links pointing to the web page from other pages in nature. In multimedia content understanding, the multimedia segments can be simultaneously described by their video signals from visual camera and audio signals from voice recorder devices. The existence of such multi-view data raised the interest of multi-view learning [2, 3, 4]

, which has been extensively studied in semi-supervised setting. However, for unsupervised learning, especially previous single view clustering methods cannot make full use of the information from multiple views, like running single view clustering algorithm on the concatenated features from multiple views that cannot distinguish the different significance of different views. To make full use of these multiple view information to boost clustering accuracy, multi-view clustering attracted more and more attentions in the past two decades, which makes it the time and necessary to summarize and sort out the current progress and open problems to guide its further advancement in the future.

Now, based on the above concepts, we give the definition of the multi-view clustering (MVC). MVC is a machine learning paradigm to classify the similar subjects into the same group and dissimilar subjects into different groups by combining the available multi-view feature information, which indicates that MVC searches for the consistent clusterings across different views. Consistent with the categorization of clustering algorithms in [1]

, we divide the existing MVC methods into two categories: generative (or model-based) approaches and discriminative (or similarity-based) approaches. Generative approaches try to learn a generative model from the subjects, with each model representing one cluster while discriminative approaches directly optimize an objective function that involves the pairwise similarities to minimize the average similarities within clusters and to maximize the average similarities between clusters. Due to a large number of discriminative approaches, based on how to combine the multi-view information, we further classified them into five groups: (1) common eigenvector matrix (mainly multi-view spectral clustering), (2) common coefficient matrix (mainly multi-view subspace clustering), (3) common indicator matrix (mainly multi-view nonnegative matrix factorization clustering), (4) direct combination (mainly multi-kernel clustering), (5) combination after projection (mainly canonical correlation analysis (CCA)). The first three groups have a commonality: sharing a similar structure to combine multiple views.

Research on MVC is motivated by the multi-view real applications. With the same motivation, multi-view representation, multi-view supervised and multi-view semi-supervised learning emerged and developed well. Therefore, the similarities and differences of them are worth exploring. The common similarity between them is that all of them are learned with the multi-view information. With regard to the differences, multi-view representation aims to learn a compact representation while MVC is to perform clustering, MVC is learned without any label information while multi-view supervised and multi-view semi-supervised learning have all or part of label information. Some of the view combining strategies in these related paradigms can be borrowed and adapted to MVC. In addition, the relationship between MVC and ensemble clustering and multi-task clustering are also elaborated due to their similar clustering style.

Recently, MVC has been applied to many applications such as computer vision, natural language processing, social multimedia, bioinformatics, health informatics and so on. At the same time, MVC papers appear largely in many top venues like the conferences ICML 

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], NIPS [19, 20], CVPR [21, 22, 23, 24], ICCV [25], AAAI [26, 27, 28, 29, 30], IJCAI [31, 32, 33, 34, 35, 36, 37], SDM [38, 39], ICDM [40, 41, 42, 43, 44], and journals PAMI [45], TKDE [46, 47, 48, 49, 50],TCYB [51, 52], TIP [53], TNNLS [54]. Although MVC has permeated into many fields and made great practical success, there are still some open problems that limit its further advancement. We point out several open problems and hope they can be helpful to promote the development of MVC in the future. With this survey, we hope to help the reader to have an entire version of the current development of MVC and what can be done in the future.

The remainder of this paper is organized as follows. In section II, we review the existing generative models of MVC. Section III introduces several categories of discriminative models of MVC. In Section IV, we analyze the relationship between MVC and some related topics. Section V presents the applications of MVC in different areas. In Section VI, we list several open problems existing current MVC methods, which may help us to advance the further development of MVC. Finally, we make the conclusions.

Ii Generative Approaches

Generative approaches aim to learn the generative models each of which generates the data from one cluster. In most cases, generative clustering approaches are based on mixture models or expectation maximization (EM) 

[55]. Therefore, mixture models and EM algorithm are first of all introduced. Another popular single view clustering model named convex mixture models (CMMs) [56] is also introduced, which will be extended to multi-view case.

Ii-1 Mixture Models and CMMs

In generative approach, data are considered as sampling independently from a mixture model of multiple probability distributions. The mixture distribution can be written as

(1)

where

is the prior probability of the

th component and satisfies and , is the parameter of the th probability density model and is the parameter set of the mixture model. For instance,

for Gaussian mixture model.

EM is a widely used algorithm in the parameter estimation of the mixture models. Suppose the observed data and unobserved data are denoted by

and , respectively. and are called complete data and incomplete data. In the E (expectation) step, the posterior distribution of the unobserved data is evaluated with the current parameter values . According to maximum likelihood estimation, the E step calculates the expectation of the complete data log likelihood as

(2)

The M step updates the parameters by maximizing the function (2)

(3)

Note that for clustering, can be considered as the observed data while is the latent variable whose entry indicates the th data point comes from the th component. Note that the posterior distribution form used to be evaluated in E step and the expectation of the complete data log likelihood used to evaluate the parameters are different for different distribution assumptions.

CMMs [56] are simplified mixture models that result in soft assignments of data points to clusters after extracting the representative exemplars from the data set. By maximizing the log-likelihood, all instances compete to become the “center” (representative exemplar) of the clusters. The instances corresponding to the components that received the highest priors are selected exemplars and then the remaining instances are assigned to the “closest” exemplar. Note that the priors of the components are the only adjustable parameters of a CMM.

Given a data set , the CMM distribution is , , where denotes the prior probability of the th component, satisfying the constraint , and is an exponential family distribution, with its expectation parameters equal to the th data point. Due to the bijection relationship between the exponential families and Bregman divergences [57], the exponential family , with denoting the Bregman divergence corresponding to the components’ distributions, being independent of , and being a constant controlling the sharpness of the components.

The log-likelihood needs to be maximized is given as + const. With the empirical data set distribution definition , the log-likelihood maximization can be equivalently expressed in terms of Kullback Leibler (KL) divergence among and as

(4)

where is the entropy of the empirical distribution which does not depend on the parameter . Now, the problem is changed into minimizing (4), which is convex and can be solved with an iterative algorithm, whose updates for prior probabilities are given by

(5)

Grouping the data points into disjoint clusters is done by requiring the instances with the highest

values to serve as exemplars and then assigning the remaining instances to the exemplar with the highest posterior probability. Note that the clustering performance is affected by the constant

, in [56] a reference value is determined with the empirical rule to identify a reasonable range of .

Ii-2 Multi-View Clustering Based on Mixture Models or EM Algorithm

In [58], under the assumption that the two views are independent, multinomial distribution is adopted for document clustering problem. Take two-view case as an example, they execute M, E steps on each view and then interchange the posteriors in each iteration. The optimization process is terminated untill some predefined stopping condition is satisfied. Two multi-view EM algorithm versions for finite mixture models are proposed in the paper [59]: the first version can be regarded as that it runs EM in each view and combines all the weighted probabilistic clustering labels generated in each view before each new EM iteration while the second version can be viewed as some probabilistic information fusion for components of two views.

Specifically, based on the CMMs for single-view clustering, the multi-view version proposed in [60] became much attractive because it can locate the global optimum and thus avoid the initialization and local optima problems of standard mixture models, which require multiple executions of the EM algorithms.

For multi-view CMMs, each with views is denoted by , , the mixture distribution for each view is given as . To pursue a common clustering across all views, all share the same priors. In addition, an empirical data set distribution , , is associated with each view and the multi-view algorithm minimizes the sum of KL divergences between and across all views with the constraint

(6)

which is straightforward to see that the optimized objective is convex, hence the global minimum can be found. The prior undate rule is given as follows:

(7)

The prior associated with the th instance is a measure of how likely this instance is to be an exemplar, taking all views into account. The appropriate values are identified in the range of an empirically defined by . From Eq. (6), it can be found that all views contribute equally to the sum, without considering their different importance. To overcome this limitation, a weighted version of multi-view CMMs was proposed in [61].

Iii Discriminative Approaches

Compared with generative approaches, discriminative approaches directly optimize the objective to seek for the best clustering solution rather than first modelling the subjects then solving these models to determine clustering result. Directly focusing on the objective of clustering makes discriminative approaches gain more attentions and develop more comprehensively. Up to now, most of the existing MVC methods are discriminative approaches. Based on how to combine multiple views, we categorize MVC methods into five main groups and introduce the representative works in each group.

To facilitate the following discussion, we introduce the settings of MVC first. Assume that we are given the data set with views. Let denote the subjects in view (), , is the dimension of the th view data. The aim of MVC is to cluster the subjects into groups. That is, finally we will get a membership matrix to indicate which subjects are in the same group while others in other groups, the sum of each row entries of should be 1 to make sure each row is a probability. If only one entry of each row is 1 and all others are 0, it is the so-called hard clustering otherwise it is soft clustering.

Iii-a Common Eigenvector matrix (Mainly Multi-View Spectral Clustering)

This group of MVC methods are based on a commonly used clustering technique spectral clustering. Since spectral clustering hinges crucially on the construction of the graph Laplacian and the resulting eigenvectors reflect the cluster structure of the data, this group of MVC methods guarantee to get a common clustering results by assuming that all the views share the common or similar eigenvector matrix. There are two representative methods: co-training spectral clustering [6] and co-regularized spectral clustering [19]. Before discussing them, we will introduce spectral clustering [62] first.

Iii-A1 Spectral Clustering

Spectral clustering is a clustering technique to utilize the properties of the Laplacian of graph whose edges denote the similarities between the data points and solve a relaxation of the normalized min-cut problem on this graph [63]. Compared with other widely used method like k means clustering that only fits the spherical shaped clustering, spectral clustering can apply to arbitrary shaped clustering and demonstrate good performance.

Given as a weighted undirected graph with vertex set . The data adjacency matrix of the graph is defined to be whose entry represents the similarity of two vertices and . If it means that the vertices and are not connected. Apparently is symmetric because is an undirected graph. The degree matrix is defined as the diagonal matrix with the degree on the diagonal, where . Generally, the graph Laplacian is and the normalized graph Laplacian is . In many spectral clustering works like [62, 6, 19], is also used to change a minimization problem (9) into a maximization problem (8) since where

is the identity matrix. Following the same way adopted in 

[62, 6, 19], we will name both and as normalized graph Laplacians afterwards. Now the single view spectral clustering approach can be formulated as follows:

(8)

which is also equivalent to the following problem:

(9)

where denotes the matrix trace. The rows of matrix are the embeddings of the data points, which can be feed the k means to obtain the final clustering results. A version of the Rayleigh-Ritz theorem in [64] shows that the solution of the above optimization problem is given by choosing as the matrix containing the largest or smallest eigenvectors of or as columns. To understand the spectral clustering algorithm better, we briefly outline a commonly used procedure [62] to solve Eq. (8) as follows:

  • Construct the adjacency matrix .

  • Compute the normalized Laplacian matrix .

  • Calculate the eigenvectors of and stack the top eigenvectors as the columns to construct a matrix .

  • Normalize each row of to obtain .

  • Run k means algorithm to cluster the row vectors of

    .

  • Assign subject to cluster if the th row of is assigned to cluster by the k means algorithm.

Apart from the symmetric normalization operator , there is also another normalization operator . Refer to [65] for more details about spectral clustering.

Iii-A2 Co-Training Multi-View Spectral Clustering

Co-training is a widely used idea in semi-supervised learning, where both labeled and unlabeled data are available. It assumes that the predictor functions in both views will give the same labels for the same sample with high probability. There are two main assumptions to guarantee its success: (1) Sufficiency: each view is sufficient for classification on its own, (2) Conditional independence: the views are conditionally independent given the class labels. In the original co-training algorithm [66], two initial predictor functions and are trained on the labeled data, then co-training algorithm repeatedly runs the following steps: the most confident examples predicted are added to the labeled set for and vice versa, then retrain and on the enlarged labeled data. After a predefined number of iterations, and will agree with each other on labels.

For co-training multi-view spectral clustering, the motivation is similar: the clustering result in each view should be the same. In spectral clustering, the eigenvectors of the graph Laplacian encode the discriminative information of the clustering. Therefore, Co-training multi-view spectral clustering [6] uses the eigenvectors of the graph Laplacian in one view to cluster samples and then use the clustering result to modify the graph Laplacian in the other view.

Each column of the similarity matrix (also named adjacency matrix) can be considered as a -dimensional vector that indicates the similarities of th point with all the points in the graph. Since the largest eigenvectors have the discriminative information for clustering, the similarity vectors can be projected along those directions to retain the discriminative information for clustering and ignore the within cluster details that might confuse the clustering. After that, the projected information is back-projected to the original -dimensional space to get the modified graph. Due to the orthogonality of the projection matrix, the inverse projection is equivalent to the transpose operation.

To make the co-training spectral clustering algorithm clear, we borrowed Algorithm 1 from [6]. Note that a symmetrization operator on a matrix is defined as in Algorithm 1.

  Input: Similarity matrices for two views: and .
  Output: Assignments to clusters.
  Initialize: for , for .
  for i=1 to T do
  1.
  2.
  3. Use and as the new graph similarities and compute the graph Laplacians. Solve for the largest eigenvectors to obtain and
  end for
  4: Normalize each row of and .
  5: Form matrix , where is the most informative view a priori. If there is no prior knowledge on the view informativeness, matrix can also be set to be column-wise concatenation of the two s.
  6: Assign example to cluster if the th row of is assigned to cluster by k means algorithm.
Algorithm 1 Co-training Multi-View Spectral Clustering

Iii-A3 Co-Regularized Multi-View Spectral Clustering

Co-regularization is a famous technique in semi-supervised multi-view learning. The core idea of co-regularization is minimizing the distinction between the predictor functions of two views acts as one part of the objective function. However, there are no predictor functions in unsupervised learning like clustering, so how to implement the co-regularization idea in clustering problem. Co-regularized multi-view spectral clustering [19] adopted the eigenvectors of graph Laplacian to play the similar role of predictor functions in semi-supervised learning scenario and proposed two co-regularized clustering approaches.

Let and be the eigenvector matrices corresponding to any pair of view graph Laplacians and (). The first version uses a pair-wise co-regularization criteria that enforces and as close as possible. The measure of clustering disagreement between the two views and is , where using linear kernel is the similarity matrix of . Since , where is the number of the clusters, the measure of disagreement becomes . Integrating the measure of disagreement between any pair of views into the spectral clustering framework, the pair-wise co-regularized multi-view spectral clustering will be formed as the following optimization problem:

(10)

The hyperparameter

is used to trade-off the spectral clustering objectives and the spectral embedding disagreement terms.

The second version named centroid-based co-regularization enforces the eigenvector matrix from each view to be similar by regularizing them towards a common consensus eigenvector matrix. The corresponding optimization problem is formulated as

(11)

Since relaxed kernel k means and spectral clustering are equivalent, by learning flexible weights automatically, Ye et al. [67] proposed a co-regularized kernel k means for multi-view clustering. With a multi-layer Grassmann manifold interpretation, Dong et al. [68] obtained the same formulation with the pair-wise co-regularized multi-view spectral clustering.

Iii-A4 Others

Besides the above mentioned two representative multi-view spectral clustering methods, Wang et al. [38] enforces a common eigenvector matrix and formulates a multi-objective problem and then solve it using Pareto optimization.

Iii-B Common Coefficient Matrix (Mainly Multi-View Subspace Clustering)

In many practical applications, even though the given data set is high dimensional, its intrinsic dimension is often much low. For example, the number of pixels in a given image can be large, yet only a few parameters are used to describe the appearance, geometry and dynamics of a scene. This motivates the development of finding the underlying low dimensional space. In practice, the data could be sampled from multiple subspaces. Subspace clustering [69] is the technique to find the underlying subspaces and then cluster the data points correctly.

Iii-B1 Subspace clustering

Subspace clustering uses the self-expression property [70] of the data set to represent itself as:

(12)

where is the subspace coefficient matrix (representation matrix), and each is the representation of the original data point based on the subspace. is the noise matrix.

The subspace clustering can be formulated as the following optimization problem:

(13)

The constraint is to avoid the case that a data point is represented by itself while denotes that the data point lies in a union of affine subspaces. The nonzero elements of correspond to data points from the same subspace.

After getting the subspace representation , the similarity matrix can be obtained to further construct the graph Laplacian and then run spectral clustering on that graph Laplacian to get the final clustering results.

Iii-B2 Multi-View Subspace Clustering

With multi-view information, each subspace representation can be obtained from each view. To get a consistent clustering result from multiple views, Yin et al. [71] shares the common coefficient matrix by enforcing the coefficient matrices from each pair of views as similar as possible. The optimization problem is formulated as

(14)

where is the -norm based pairwise co-regularization constraint that can alleviate the noise problem. is used to enforce sparse solution. denotes the diagonal elements of matrix , and the zero constraint is used to avoid trivial solution (each data point represents by itself).

Wang et al. [72] enforced the similar idea to combine multi-view information. Apart from that, it adopted a multi-graph regularization with each graph Laplacian regularization characterizing the view-dependent non-linear local data similarity. At the same time, it assumes that the view-dependent representation is low rank and sparse and considers the sparse noise in the data. Wang et al. [53] proposed an angular based similarity to measure the correlation consensus in multiple views and obtained a robust subspace clustering for multi-view data. Different from the above approaches, These three works [35, 36, 73] adopted general nonnegative matrix factorization formulation but shared a common representation matrix for the samples with both views and kept each view representation matrix specific. Zhao et al.[26] adopted a deep semi-nonnegative matrix factorization to perform multi-view clustering, in the last layer a common coefficient matrix is enforced to exploit the multi-view information.

Iii-C Common Indicator Matrix (Mainly Multi-View Nonnegative Matrix Factorization Clustering)

Iii-C1 Nonnegative Matrix Factorization

For nonnegative data matrix , Nonnegative Matrix Factorization (NMF) [74] aims to seek two nonnegative matrix factors and whose product is a good approximation to :

(15)

where denotes the desired reduced dimension (for clustering, it is the number of clusters). Here, is the basis matrix while is the indicator matrix.

Due to the nonnegative constraints, NMF can learn a part-based representation. This is very intuitive and meaningful in many applications like the face recognition in 

[74]. The observations in many applications like information retrieval [74] and pattern recognition [75] can be explained as an additive linear combinations of nonnegative basis vectors. It had been applied successfully to clustering [74, 76] and achieved the state-of-the-art performance.

In addition, k means clustering can be formulated using NMF by introducing an indicator matrix . The NMF formulation of k means clustering is

(16)

where is the cluster centroid matrix.

Iii-C2 Multi-View Nonnegative Matrix Factorization Clustering

By enforcing the indicator matrices from different views the same, Akata et al. [77] extended the NMF [74] to multi-view settings.

To combine multi-view information in the NMF framework, Akata et al. [77] enforces a shared indicator matrix among different views to perform multi-view clustering in NMF framework. However, the indicator matrix might not be comparable at the same scale. In order to keep the clustering solutions across different views meaningful and comparable, Liu et al. [78] enforces a constraint to push each view-dependent indicator matrix towards a common indicator matrix and another normalization constraint inspired by the connection between NMF and probability latent semantic analysis. The final optimization problem is formulated as:

(17)

The constraint is used to guarantee within the same range for different such that the comparison between the view-dependent indicator matrix and the consensus indicator matrix is reasonable.

After obtaining the consensus matrix , the cluster label of data point can be computed as .

As we aforementioned, for subspace learning, there are two steps: calculate the subspace representation and run spectral clustering on the graph Laplacian computed from the obtained subspace representation. To get a consistent clustering from multiple views, Gao et al. [79] merged the two steps in subspace clustering and enforced a common indicator matrix across different views. The formulation is as follows:

(18)

where is the subspace representation matrix of the th view, , is a diagonal matrix with diagonal elements defined as , is the common indicator matrix which can result in a consistent clustering result across different views. Although this multi-view subspace clustering is based on subspace clustering, it does not enforce the common coefficient matrix but shares a common indicator matrix for different views. That is why we categorize it into this group. Similar things happen to spectral clustering later.

Since spectral clustering works on graph and the eigen decomposition is time consuming, it is inappropriate for large scale application. k means algorithm does not subject to this limitation, thus it becomes a good choice for large scale application. To deal with large scale multi-view data, Cai et al. [31] proposed a multi-view k means clustering method by adopting a common indicator matrix for different views. The optimization problem is formulated as follows:

(19)

where is the weight factor for the th view and is the parameter to control the weights distribution. By learning the weights for different views, the important views will get large weight during multi-view clustering.

Wang et al. [7] integrates multi-view information via a common indicator matrix and simultaneously take the varied importance of different features to different data clusters into consideration. Its formulation is

(20)

where , each is the input including the features from all the views and each view has dimension such that . is the weights of each feature for clusters. is the intercept vector, is constant vector of all 1’s, is the cluster indicator matrix. is the group regularization to learn the group-wise feature importance of one view on each cluster while is the norm to learn the individual weight across different clusters.

Before centroid-based co-regularization, a similar work [80] used the same idea to perform multi-view spectral clustering. The main difference is that  [80] used as the disagreement measure between each view eigenvector matrix and the common eigenvector matrix while co-regularized multi-view spectral clustering  [19] adopted . The optimization problem [80] is formulated as

(21)

where makes become the final cluster indicator matrix. Different from general spectral clustering that get eigenvector matrix first and then run clustering (such as k means that is sensitive to initialization condition) to assign clusters, Cai et al. [80] directly solves the final cluster indicator matrix, thus it will be more robust to the initial condition.

In  [81], a matrix factorization approach was adopted to reconcile the groups arising from the individual views. Specifically, a matrix that contains the partitioning of every individual view is created and then decomposed into two matrices, the one showing the contribution of those partitionings to the final multi-view clusters, called meta-clusters, and the other the assignment of instances to the meta-clusters. Tang et al. [40] considered multi-view clustering as clustering with multiple graphs, each of which is approximated by matrix factorization with a graph-specific factor and a factor common to all graphs. Qian et al. [82] used the relaxed common indicator matrix (make each view indicator matrix as close as possible) to combine multi-view information and employed the Laplacian regularization to maintain the latent geometric structure of the views simultaneously. Apart from using common indicator matrix,  [83, 84, 85] introduced a weight matrix to indicate whether the corresponding matrix whose entry is missing such that it can tackle the missing value problem, the multi-view self-paced clustering [34] takes the complexities of the samples and views into consideration to alleviate the local minima problem. Tao et al. [32] enforces a common indicator matrix and seeks for the consensus clustering among all views in an ensemble clustering way. Apart from multi-view nonnegative matrix factorization clustering, there exist some other methods to utilize the common indicator matrix to combine multiple views for clustering like  [21] who additionally borrowed the linear discriminant analysis idea and weighted each view automatically. For graph-based clustering methods, each similarity matrix for each view is obtained first, Nie et al. [33] assumes a common indicator matrix and then solves the problem by minimizing the differences between the common indicator matrix and each similarity matrix.

Iii-D Direct Combination (Mainly Multi-Kernel Based Multi-View Clustering)

Apart from sharing a common structure from different views, the direct view combination is a good way to perform multi-view clustering. A natural approach is to define a kernel for each view information and then convexly combine these kernels [8, 86, 87].

Iii-D1 Kernel Functions and Kernel Combination Forms

Kernel is a trick to learn nonlinear problem just by linear learning algorithm, since kernel function can directly give the inner products in feature space without explicitly defining the nonlinear transformation . There are some common kernel functions as follows:

  • Linear kernel: ,

  • Polynomial kernel: ,

  • Gaussian kernel (Radial basis kernel): ,

  • Sigmoid kernel: .

In machine learning, common kernel functions can be viewed as similarity functions [88], such that we can use kernel trick to deal with spectral clustering and kernel k means problems. So far, there exist some works on multi-kernel learning for clustering [89, 90, 91], however, they are all for single-view clustering. If each kernel is derived from each view and then combined elaborately to deal with the clustering problem, it will become the multi-kernel learning for multi-view clustering. Obviously, multi-kernel learning [92, 93, 94, 95] can be considered as the most important part of multi-view clustering. There are three main categories of multi-kernel combination [96]:

  • Linear combination: It includes two basic subcategories: unweighted sum and weighted sum where denotes the kernel weight for the th view and , is the hyperparameter to control the distribution of the weights,

  • Nonlinear combination: It uses some nonlinear functions of kernels, namely, multiplication, power, and exponentiation,

  • Data-dependent combination: It assigns specific kernel weights for each data instance, which can identify the local distributions in the data and learn proper kernel combination rules for each region.

Iii-D2 Kernel K Means and Spectral Clustering

Kernel k means [97] and spectral clustering [98]

are two kernel-based clustering methods for optimizing the intra-cluster variance. Let

be a feature mapping which maps onto a reproducing kernel Hilbert space . The kernel k means problem is formulated as the following optimization problem,

(22)

where is the cluster indicator matrix (also known as cluster assignment matrix), and are the number and centroid of the th cluster. With a kernel matrix whose entry is , and that is a column vector with all elements 1, Eq. (22) can be equivalently rewritten as the following matrix-vector form,

(23)

For the above kernel k means matrix-factor form, the matrix is discrete, which makes the optimization problem difficult to solve. By relaxing the matrix to take arbitrary real values, the above problem can be approximated. Specifically, by defining and letting take real values, further considering is constant, Eq. (23) will be relaxed to

(24)

The fact leads to the orthogonality constraint on which further tells us that the optimal can be obtained by the top eigenvectors of the kernel matrix . Therefore, Eq. (24) can be considered as the generalized optimization matrix form of spectral clustering. Note that Eq. (24) is equivalent to Eq. (8) if the kernel matrix takes the normalized Gram matrix form.

Iii-D3 Multi-Kernel Based Multi-View Clustering

Assume there are kernel matrices available, each of which corresponds to one view. To make full use of all views, the weighted combination will be used in kernel k means (24) and spectral clustering (8) to obtain the corresponding multi-view kernel k means and multi-view spectral clustering in paper [41]. With the same nonlinear combination by specifically setting , Guo et al. [99] extended the spectral clustering to multi-view clustering by further employing the kernel alignment. Due to the potential redundance of the selected kernels, Liu et al. [28] introduced a matrix-induced regularization to reduce the redundancy and enhance the diversity of the selected kernels to attain the final goal of boosting the clustering performance. By replacing the original Euclidean norm metric in fuzzy c-means with a kernel-induced metric in the data space and adopting the weighted kernel combination, Zhang et al. [100]

successfully extended the fuzzy c-means to multi-view clustering that is robust to noise and outliers. In the case with incomplete multi-view data set existing, by optimizing the alignment of shared instances of the data sets, Shao et al. 

[43] collectively completes the kernel matrices of incomplete data sets. To overcome the cluster initialization problem associated with kernel k means, Tzortzis et al. [54] proposed a global kernel k means algorithm, a deterministic and incremental approach that adds one cluster each stage, through a global search procedure consisting of several executions of kernel k means from suitable initiations.

Iii-D4 Others

Besides multi-kernel based multi-view clustering, there are some other methods that use the direct combination to perform multi-view clustering like  [21, 33]. In [46], two-level weights: view wights and variable wights are assigned to the clustering algorithm for multi-view data to identify the importance of the corresponding views and variables. To extend fuzzy clustering method to multi-view clustering, each view is weighted and the multi-view versions of fuzzy c-means and fuzzy k means are obtained [42] and [51], respectively.

Iii-E Combination After Projection (Mainly CCA-Based Multi-View Clustering)

For homogeneous multi-view data, it is reasonable to directly combine them together. However, in real-world applications, the multiple representations may come from different feature vector spaces. For instance, in bioinformatics, gene information can be one view while the clinical symptoms can be the other view in patient clustering [13]. Obviously, these information cannot be combined directly. Moreover, high dimension and noise are difficult to handle. To solve the above problems, the last yet important combination way is introduced: combination after projection. The most commonly used technique is Canonical Correlation Analysis (CCA) and the kernel version of CCA (KCCA).

Iii-E1 CCA and KCCA

To better understand this style of multi-view combination, CCA and KCCA are briefly introduced (refer to [101] for more detail). Given two data sets and each entry or with zero mean, CCA aims to find a projection for and another projection for such that the correlation between the projection of and on and are maximized,

(25)

where is the correlation and denotes the covariance matrix of and with zero mean. Observing that is not affected by scaling or either together or independently, CCA can be reformulated as

(26)

With the method of Lagrange multiplier, the two lagrange multipliers and are equal to each other, that is . If is invertible, can be obtained as and . For different eigen values (from large to small), different eigen vectors are obtained, which can be considered as a successive process.

The above canonical correlation problem can be transformed into a distance minimization problem. For ease of derivation, the successive formulation of the canonical correlation is replaced by the simultaneous formulation of the canonical correlation. Assume the number of projections is , the matrices and denote and , respectively. The simultaneous formulation is the optimization problem with p iteration steps:

(27)

The matrix formulation to the optimization problem (27) is