1 Introduction
In many realworld machine learning problems the same data is comprised of several different representations or views. For example, same documents may be available in multiple languages
[1] or different descriptors can be constructed from the same images [2]. Although each of these individual views may be sufficient to perform a learning task, integrating complementary information from different views can reduce the complexity of a given task [3]. Multiview clustering seeks to partition data points based on multiple representations by assuming that the same cluster structure is shared across views. By combining information from different views, multiview clustering algorithms attempt to achieve more accurate cluster assignments than one can get by simply concatenating features from different views.In practice, highdimensional data often reside in a lowdimensional subspace. When all data points lie in a single subspace, the problem can be set as finding a basis of a subspace and a lowdimensional representation of data points. Depending on the constraints imposed on the lowdimensional representation, this problem can be solved using e.g. Principal Component Analysis (PCA)
[4], Independent Component Analysis (ICA)
[5] or Nonnegative Matrix Factorization (NMF) [6, 7, 8]. On the other hand, data points can be drawn from different sources and lie in a union of subspaces. By assigning each subspace to one cluster, one can solve the problem by applying standard clustering algorithms, such as kmeans
[9]. However, these algorithms are based on the assumption that data points are distributed around centroid and often do not perform well in the cases when data points in a subspace are arbitrarily distributed. For example, two points can have a small distance and lie in different subspaces or can be far and still lie in the same subspace [10]. Therefore, methods that rely on a spatial proximity of data points often fail to provide a satisfactory solution. This has motivated the development of subspace clustering algorithms [10]. The goal of subspace clustering is to identify the lowdimensional subspaces and find the cluster membership of data points. Spectral based methods [11, 12, 13] present one approach to subspace clustering problem. They have gained a lot of attention in the recent years due to the competitive results they achieve on arbitrarily shaped clusters and their well defined mathematical principles. These methods are based on the spectral graph theory and represent data points as nodes in a weighted graph. The clustering problem is then solved as a relaxation of the mincut problem on a graph [14].One of the main challenges in spectral based methods is the construction of the affinity matrix whose elements define the similarity between data points. Sparse subspace clustering [15] and lowrank subspace clustering [16, 17, 18, 19] are among most effective methods that solve this problem. These methods rely on the selfexpressiveness property of the data by representing each data point as a linear combination of other data points. LowRank Representation (LRR) [16, 17]
imposes lowrank constraint on the data representation matrix and captures global structure of the data. Lowrank implies that data matrix is represented by a sum of small number of outer products of left and right singular vectors weighted by corresponding singular values. Under assumption that subspaces are independent and data sampling is sufficient, LRR guarantees exact clustering. However, for many realworld datasets this assumption is overly restrictive and the assumption that data is drawn from disjoint subspaces would be more appropriate
[20, 21]. On the other hand, Sparse Subspace Clustering (SSC) [15] represents each data point as a sparse linear combination of other points and captures local structure of the data. Learning representation matrix in SSC can be interpreted as sparse coding [22, 23, 24, 25, 26, 27]. However, compared to sparse coding where dictionary is learned such that the representation is sparse [28, 29], SSC is based on selfrepresentation property i.e. data matrix stands for a dictionary. SSC also succeeds when data is drawn from independent subspaces and the conditions have been established for clustering data drawn from disjoint subspaces [30]. However, theoretical analysis in [31] shows that it is possible that SSC oversegments subspaces when the dimensionality of data points is higher than three. Experimental results in [32] show that LRR misclassifies different data points than SSC. Therefore, in order to capture global and the local structure of the data, it is necessary to combine lowrank and sparsity constraints [32, 33].Multiview subspace clustering can be considered as a part of multiview or multimodal learning. Multiview learning method in [34] learns view generation matrices and representation matrix, relying on the assumption that data from all the views share the same representation matrix. The multiview method in [35] is based on the canonical correlation analysis in extraction of twoview filterbankbased features for image classification task. Similarly, in [36]
the authors rely on tensorbased canonical correlation analysis to perform multiview dimensionality reduction. This approach can be used as a preprocessing step in multiview learning in case of highdimensional data. In
[37] lowrank representation matrix is learned on each view separately and learned representation matrices are concatenated to a matrix from which a unified graph affinity matrix is obtained. The method in [38] relies on learning a linear projection matrix for each view separately. Highorder distancebased multiview stochastic learning is proposed in [39], to efficiently explore the complementary characteristics of multiview features for image classification. The method in [40] is application oriented towards image reranking and assumes that multiview features are contained in hypergraph Laplacians that define different modalities. In [41] authors propose multiview matrix completion algorithm for handling multiview features in semisupervised multilabel image classification.Previous multiview subspace clustering works [42, 43, 44, 45] address the problem by constructing affinity matrix on each view separately and then extend algorithm to handle multiview data. However, since input data may often be corrupted by noise, this approach can lead to the propagation of noise in the affinity matrices and degrade clustering performance. Different from the existing approaches, we propose multiview spectral clustering framework that jointly learns a subspace representation by constructing single affinity matrix shared by multiview data, while at the same time encourages lowrank and sparsity of the representation. We propose Multiview Lowrank Sparse Subspace Clustering (MLRSSC) algorithms that enforce agreement: (i) between affinity matrices of the pairs of views; (ii) between affinity matrices towards a common centroid. Opposed to [35, 40, 46], the proposed approach can deal with highly heterogeneous multiview data coming from different modalities. We present optimization procedure to solve the convex dual optimization problems using Alternating Direction Method of Multipliers (ADMM) [47]. Furthermore, we propose the kernel extension of our algorithms by solving the problem in a Reproducing Kernel Hilbert Space (RKHS). Experimental results show that MLRSSC algorithm outperforms stateoftheart multiview subspace clustering algorithms on several benchmark datasets. Additionally, we evaluate performance on a novel realworld heterogeneous multiview dataset from biological domain.
The remainder of the paper is organized as follows. Section 2 gives a brief overview of the lowrank and sparse subspace clustering methods. Section 3 introduces two novel multiview subspace clustering algorithms. In Section 4 we present the kernelized version of the proposed algorithms by formulating subspace clustering problem in RKHS. The performance of the new algorithms is demonstrated in Section 5. Section 6 concludes the paper.
2 Background and Related Work
In this section, we give a brief introduction to Sparse Subspace Clustering (SSC) [15], LowRank Representation (LRR) [16, 17] and Lowrank Sparse Subspace Clustering (LRSSC) [32].
2.1 Main Notations
Throughout this paper, matrices are represented with bold capital symbols and vectors with bold lowercase symbols. denotes the Frobenius norm of a matrix. The norm, denoted by , is the sum of absolute values of matrix elements; infinity norm is the maximum absolute element value; and the nuclear norm is the sum of singular values of a matrix. Trace operator of a matrix is denoted by and is the vector of diagonal elements of a matrix. denotes null vector. Table 1 summarizes some notations used throughout the paper.
Notation  Definition 
Number of data points  
Number of clusters  
View index  
Number of views  
Dimension of data points in a view  
Data matrix in a view  
Representation matrix in a view  
Centroid representation matrix  
Affinity matrix  
Singular value decomposition (SVD) of  
Data points in a view mapped into highdimensional feature space  
Gram matrix in a view 
2.2 Related Work
Consider the set of data points that lie in a union of linear subspaces of unknown dimensions. Given the set of data points , the task of subspace clustering is to cluster data points according to the subspaces they belong to. The first step is the construction of the affinity matrix whose elements define the similarity between data points. Ideally, the affinity matrix is a block diagonal matrix such that a nonzero distance is assigned to the points from the same subspace. LRR, SSC and LRSSC construct the affinity matrix by enforcing lowrank, sparsity and lowrank plus sparsity constraints, respectively.
LowRank Representation (LRR) [16, 17] seeks to find a lowrank representation matrix for input data . The basic model of LRR is the following:
(1) 
where the nuclear norm is used to approximate the rank of and that results in the convex optimization problem.
Denote the SVD of as . The minimizer of equation (1) is uniquely given by [16]:
(2) 
In the cases when data is contaminated by noise, the following problem needs to be solved:
(3) 
The optimal solution of equation (3) has been derived in [18]:
(4) 
where , and . Matrices are partitioned according to the sets and .
Sparse Subspace Clustering (SSC) [15] requires that each data point is represented by a small number of data points from its own subspace and it amounts to solve the following minimization problem:
(5) 
The norm is used as the tightest convex relaxation of the quasinorm that counts the number of nonzero elements of the solution. Constraint is used to avoid trivial solution of representing a data point as a linear combination of itself.
If data is contaminated by noise, the following minimization problem needs to be solved:
(6) 
This problem can be efficiently solved using ADMM optimization procedure [47].
LowRank Sparse Subspace Clustering (LRSSC) [32] combines lowrank and sparsity constraints:
(7) 
In the case of the corrupted data the following problem needs to be solved to approximate :
(8) 
Once matrix is obtained by LRR, SSC or LRSSC approach, the affinity matrix is calculated as:
(9) 
Given affinity matrix , spectral clustering [11, 12]
finds cluster membership of data points by applying kmeans clustering to the eigenvectors of the graph Laplacian matrix
computed from the affinity matrix .3 Multiview Lowrank Sparse Subspace Clustering
In this section we present Multiview Lowrank Sparse Subspace Clustering (MLRSSC) algorithm with two different regularization approaches. We assume that we are given a dataset of views, where each is described with its own set of features. Our objective is to find a joint representation matrix that balances tradeoff between the agreement across different views, while at the same time promotes sparsity and lowrankness of the solution.
We formulate joint objective function that enforces representation matrices across different views to be regularized towards a common consensus. Motivated by [42], we propose two regularization schemes of the MLRSSC algorithm: (i) MLRSSC based on pairwise similarities and (ii) centroidbased MLRSSC. The first regularization encourages similarity between pairs of representation matrices. The centroidbased approach enforces representations across different views towards a common centroid. Standard spectral clustering algorithm can then be applied to the jointly inferred affinity matrix.
3.1 Pairwise Multiview Lowrank Sparse Subspace Clustering
We propose to solve the following joint optimization problem over views:
(10) 
where is the representation matrix for view . Parameters and define the tradeoff between lowrank, sparsity constraint and the agreement across views, respectively. In the cases where we do not have a prior information that one view is more important than the others, does not dependent on a view and the same value of is used across all views . The last term in the objective in (10) is introduced to encourage similarities between pairs of representation matrices across views.
With all but one fixed, we minimize the function (10) for each independently:
(11) 
By introducing auxiliary variables , , and , we reformulate the objective:
(12) 
The augmented Lagrangian is:
(13) 
where are penalty parameters that need to be tuned and are Lagrange dual variables.
To solve the convex optimization problem in (12), we use Alternating Direction Method of Multipliers (ADMM) [47]. ADMM converges for the objective composed of twoblock convex separable problems, but here the terms , and do not depend on each other and can be observed as one variable block.
Update rule for at iteration . Given at iteration , the matrix that minimizes the objective in equation (13) is updated by the following update rule:
(14) 
The update rule follows straightforwardly by setting the partial derivative of in equation (13) with respect to to zero.
Update rule for at iteration . Given at iteration and at iteration , we minimize the objective in equation (13) with respect to :
(15) 
From [48], it follows that the the unique minimizer of (15) is:
(16) 
where performs softthresholding operation on the singular values of and is the skinny SVD of , here . denotes soft thresholding operator defined as and .
Update rule for at iteration . Given at iteration and at iteration , we minimize the in equation (13) with respect to :
(17) 
The minimization of (17) gives the following update rules for matrix [49, 50]:
(18) 
where denotes soft thresholding operator applied entrywise to .
Update rule for at iteration . Given at iteration and , at iteration , we minimize the objective in equation (13) with respect to :
(19) 
The partial derivative of in equation (13) with respect to :
(20) 
Setting the partial derivative in (20) to zero:
(21) 
Update rules for dual variables at iteration . Given at iteration , dual variables are updated with the following equations:
(22) 
If data is contaminated by noise and does not perfectly lie in the union of subspaces, we modify the objective function as follows:
(23) 
Update rule for at iteration for corrupted data. Given at iteration , the matrix is obtained by equating to zero partial derivative of the augmented Lagrangian of problem (23):
(24) 
These update steps are then repeated until the convergence or until the maximum number of iteration is reached. We check the convergence by verifying the following constraints at each iteration : , , and , for . After obtaining representation matrix for each view , we combine them by taking the elementwise average across all views. The next step of the algorithm is to find the assignment of the data points to corresponding clusters by applying spectral clustering algorithm to the joint affinity matrix . Algorithm 1 summarizes the steps of the pairwise MLRSSC. Due to the practical reasons, we use the same initial values of , and for different views and update after the optimizations of all views. However, it is possible to have more general approach with different initial values of , and for each view , but this significantly increases the number of variables for optimization.
The problem in (10) is convex subject to linear constraints and all its subproblems can be solved exactly. Hence, theoretical results in [51] guarantee the global convergence of ADMM. The computational complexity of Algorithm 1 is , where is the number of iterations, is the number of views and is the number of data points. In the experiments, we set the maximal to , but the algorithm converged before the maximal number of iterations is exceeded (). Importantly, the computational complexity of spectral clustering step is , so the computational cost of the proposed representation learning step is times higher.
3.2 Centroidbased Multiview Lowrank Sparse Subspace Clustering
In addition to the pairwise MLRSSC, we also introduce objective for the centroidbased MLRSSC which enforces viewspecific representations towards a common centroid. We propose to solve the following minimization problem:
(25) 
where denotes consensus variable.

Algorithm 1 Pairwise MLRSSC 
Input: , , , , 
Output: Assignment of the data points to clusters 
1: Initialize: , , , 
2: while not converged do 
3: for to do 
4: Fix others and update by solving (14) in the case of clean data 
or (24) in the case of corrupted data 
5: Fix others and update by solving (16) 
6: Fix others and update by solving (18) 
7: Fix others and update by solving (21) 
8: Fix others and update dual variables by solving (22) 
and also in the case of clean data 
9: end for 
10: Update , 
11: end while 
12: Combine by taking the elementwise average 
13: Apply spectral clustering [12] to the affinity matrix 

This objective function can be minimized by the alternating minimization cycling over the views and consensus variable. Specifically, the following two steps are repeated: (1) fix consensus variable and update each , while keeping all others fixed and (2) fix and update .
By fixing all variables except one , we solve the following problem:
(26) 
Again, we solve the convex optimization problem using ADMM. We introduce auxiliary variables , , and and reformulate the original problem:
(27) 
The augmented Lagrangian is:
(28) 
Update rule for at iteration . Given at iteration and , at iteration , minimization of the objective in equation (28) with respect to leads to the following update rule for :
(29) 
Update rule for . By setting the partial derivative of the objective function in equation (25) with respect to to zero we get the closedform solution to :
(30) 
It is easy to check that update rules for variables , , and dual variables are the same as in the pairwise similarities based multiview LRSSC (equations (14), (16),(18) and (22)).
In order to extend the model to the data contaminated by additive white Gaussian noise, the objective in (25) is modified as follows:
(31) 
Compared to the model for clean data, the only update rule that needs to be modified is for , which is the same as in pairwise MLRSSC given in equation (24).
In centroidbased MLRSSC there is no need to combine affinity matrices across views, since the joint affinity matrix can be directly computed from the centroid matrix i.e. . Algorithm 2 summarizes the steps of centroidbased MLRSSC. The computational complexity of Algorithm 2 is the same as the complexity of Algorithm 1.

Algorithm 2 Centroidbased MLRSSC 
Input: , , , , 
Output: Assignment of the data points to clusters 
1: Initialize: , , , , 
2: while not converged do 
3: for to do 
4: Fix others and update by solving (14) in the case of clean data 
or (24) in the case of corrupted data 
5: Fix others and update by solving (16) 
6: Fix others and update by solving (18) 
7: Fix others and update by solving (29) 
8: Fix others and update dual variables by solving (22) 
and also in the case of clean data 
9: end for 
10: Update , 
11: Fix others and update centroid by solving (30) 
12: end while 
13: Apply spectral clustering [12] to the affinity matrix 

4 Kernel Multiview Lowrank Sparse Subspace Clustering
The spectral decomposition of Laplacian enables spectral clustering to separate data points with nonlinear hypersurfaces. However, by representing data points as a linear combination of other data points, the MLRSSC algorithm learns the affinity matrix that models the linear subspace structure of the data. In order to recover nonlinear subspaces, we propose to solve the MLRSSC in RKHS by implicitly mapping data points into a high dimensional feature space.
We define to be a function that maps the original input space to a high (possibly infinite) dimensional feature space . Since the presented update rules for the corrupted data of both pairwise and centroidbased MLRSSC depend only on the dot products , both approaches can be solved in RKHS and extended to model nonlinear manifold structure.
Let denote the set of data points mapped into highdimensional feature space. The objective function of pairwise kernel MLRSSC for data contaminated by noise is the following:
(32) 
Similarly, the objective function of centroidbased MLRSSC in feature space for corrupted data is:
(33) 
Since is the only variable that depends on , the update rules for and dual variables remain unchanged.
Update rule for at iteration . Given at iteration , the is updated by the following update rule:
(34) 
Substituting the dot product with the Gram matrix , we get the following update rule for :
(35) 
Update rule for is the same in pairwise and centroidbased versions of the algorithm.
5 Experiments
In this section we present results that demonstrate the effectiveness of the proposed algorithms. The performance is measured on one synthetic and three realworld datasets that are commonly used to evaluate the performance of multiview algorithms. Moreover, we introduce novel realworld multiview dataset from molecular biology domain. We compared MLRSSC with the stateoftheart multiview subspace clustering algorithms, as well as with two baselines: best single view LRSSC and feature concatenation LRSSC.
5.1 Datasets
We report the experimental results on synthetic and four realworld datasets. We give a brief description of each dataset. Statistics of the datasets are summarized in Table 2.
UCI Digit dataset is available from the UCI repository^{3}^{3}3http://archive.ics.uci.edu/ml/datasets/Multiple+Features. This dataset consists of 2000 examples of handwritten digits (09) extracted from Dutch utility maps. There are 200 examples in each class, each represented with six feature sets. Following experiments in [45], we used three feature sets: 76 Fourier coefficients of the character shapes, 216 profile correlations and 64 KarhunenLove coefficients.
Reuters dataset [52] contains features of documents available in five different languages and their translations over a common set of six categories. All documents are in the bagofwords representation. We use documents originally written in English as one view and their translations to French, German, Spanish and Italian as four other views. We randomly sampled 100 documents from each class, resulting in a dataset of 600 documents.
3sources dataset^{4}^{4}4http://mlg.ucd.ie/datasets/3sources.html is news articles dataset collected from three online news sources: BBC, Reuters, and The Guardian. All articles are in the bagofwords representation. Of 948 articles, we used 169 that are available in all three sources. Each article in the dataset is annotated with a dominant topic class.
Prokaryotic phyla dataset contains 551 prokaryotic species described with heterogeneous multiview data including textual data and different genomic representations [53]. Textual data consists of bagofwords representation of documents describing prokaryotic species and is considered as one view. In our experiments we use two genomic representations: (i) the proteome composition, encoded as relative frequencies of amino acids (ii) the gene repertoire, encoded as presence/absence indicators of gene families in a genome. In order to reduce the dimensionality of the dataset, we apply principal component analysis (PCA) on each of the three views separately and retain principal components explaining
of the variance. Each species in the dataset is labeled with the phylum it belongs to. Unlike previous datasets, this dataset is unbalanced. The most frequently occurring cluster contains
species, while the smallest cluster contains species.Synthetic dataset was generated in a way described in [42, 54].
points are generated from two views, where data points for each view are generated from twocomponent Gaussian mixture models. Cluster means and covariance matrices for view
are: , ,