Multi-view Low-rank Sparse Subspace Clustering

08/29/2017 ∙ by Maria Brbic, et al. ∙ Ruđer Bošković Institute 0

Most existing approaches address multi-view subspace clustering problem by constructing the affinity matrix on each view separately and afterwards propose how to extend spectral clustering algorithm to handle multi-view data. This paper presents an approach to multi-view subspace clustering that learns a joint subspace representation by constructing affinity matrix shared among all views. Relying on the importance of both low-rank and sparsity constraints in the construction of the affinity matrix, we introduce the objective that balances between the agreement across different views, while at the same time encourages sparsity and low-rankness of the solution. Related low-rank and sparsity constrained optimization problem is for each view solved using the alternating direction method of multipliers. Furthermore, we extend our approach to cluster data drawn from nonlinear subspaces by solving the corresponding problem in a reproducing kernel Hilbert space. The proposed algorithm outperforms state-of-the-art multi-view subspace clustering algorithms on one synthetic and four real-world datasets.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many real-world machine learning problems the same data is comprised of several different representations or views. For example, same documents may be available in multiple languages

[1] or different descriptors can be constructed from the same images [2]. Although each of these individual views may be sufficient to perform a learning task, integrating complementary information from different views can reduce the complexity of a given task [3]. Multi-view clustering seeks to partition data points based on multiple representations by assuming that the same cluster structure is shared across views. By combining information from different views, multi-view clustering algorithms attempt to achieve more accurate cluster assignments than one can get by simply concatenating features from different views.

In practice, high-dimensional data often reside in a low-dimensional subspace. When all data points lie in a single subspace, the problem can be set as finding a basis of a subspace and a low-dimensional representation of data points. Depending on the constraints imposed on the low-dimensional representation, this problem can be solved using e.g. Principal Component Analysis (PCA)

[4]

, Independent Component Analysis (ICA)

[5] or Non-negative Matrix Factorization (NMF) [6, 7, 8]

. On the other hand, data points can be drawn from different sources and lie in a union of subspaces. By assigning each subspace to one cluster, one can solve the problem by applying standard clustering algorithms, such as k-means

[9]. However, these algorithms are based on the assumption that data points are distributed around centroid and often do not perform well in the cases when data points in a subspace are arbitrarily distributed. For example, two points can have a small distance and lie in different subspaces or can be far and still lie in the same subspace [10]. Therefore, methods that rely on a spatial proximity of data points often fail to provide a satisfactory solution. This has motivated the development of subspace clustering algorithms [10]. The goal of subspace clustering is to identify the low-dimensional subspaces and find the cluster membership of data points. Spectral based methods [11, 12, 13] present one approach to subspace clustering problem. They have gained a lot of attention in the recent years due to the competitive results they achieve on arbitrarily shaped clusters and their well defined mathematical principles. These methods are based on the spectral graph theory and represent data points as nodes in a weighted graph. The clustering problem is then solved as a relaxation of the min-cut problem on a graph [14].

One of the main challenges in spectral based methods is the construction of the affinity matrix whose elements define the similarity between data points. Sparse subspace clustering [15] and low-rank subspace clustering [16, 17, 18, 19] are among most effective methods that solve this problem. These methods rely on the self-expressiveness property of the data by representing each data point as a linear combination of other data points. Low-Rank Representation (LRR) [16, 17]

imposes low-rank constraint on the data representation matrix and captures global structure of the data. Low-rank implies that data matrix is represented by a sum of small number of outer products of left and right singular vectors weighted by corresponding singular values. Under assumption that subspaces are independent and data sampling is sufficient, LRR guarantees exact clustering. However, for many real-world datasets this assumption is overly restrictive and the assumption that data is drawn from disjoint subspaces would be more appropriate

[20, 21]. On the other hand, Sparse Subspace Clustering (SSC) [15] represents each data point as a sparse linear combination of other points and captures local structure of the data. Learning representation matrix in SSC can be interpreted as sparse coding [22, 23, 24, 25, 26, 27]. However, compared to sparse coding where dictionary is learned such that the representation is sparse [28, 29], SSC is based on self-representation property i.e. data matrix stands for a dictionary. SSC also succeeds when data is drawn from independent subspaces and the conditions have been established for clustering data drawn from disjoint subspaces [30]. However, theoretical analysis in [31] shows that it is possible that SSC over-segments subspaces when the dimensionality of data points is higher than three. Experimental results in [32] show that LRR misclassifies different data points than SSC. Therefore, in order to capture global and the local structure of the data, it is necessary to combine low-rank and sparsity constraints [32, 33].

Multi-view subspace clustering can be considered as a part of multi-view or multi-modal learning. Multi-view learning method in [34] learns view generation matrices and representation matrix, relying on the assumption that data from all the views share the same representation matrix. The multi-view method in [35] is based on the canonical correlation analysis in extraction of two-view filter-bank-based features for image classification task. Similarly, in [36]

the authors rely on tensor-based canonical correlation analysis to perform multi-view dimensionality reduction. This approach can be used as a preprocessing step in multi-view learning in case of high-dimensional data. In

[37] low-rank representation matrix is learned on each view separately and learned representation matrices are concatenated to a matrix from which a unified graph affinity matrix is obtained. The method in [38] relies on learning a linear projection matrix for each view separately. High-order distance-based multi-view stochastic learning is proposed in [39], to efficiently explore the complementary characteristics of multi-view features for image classification. The method in [40] is application oriented towards image reranking and assumes that multi-view features are contained in hypergraph Laplacians that define different modalities. In [41] authors propose multi-view matrix completion algorithm for handling multi-view features in semi-supervised multi-label image classification.

Previous multi-view subspace clustering works [42, 43, 44, 45] address the problem by constructing affinity matrix on each view separately and then extend algorithm to handle multi-view data. However, since input data may often be corrupted by noise, this approach can lead to the propagation of noise in the affinity matrices and degrade clustering performance. Different from the existing approaches, we propose multi-view spectral clustering framework that jointly learns a subspace representation by constructing single affinity matrix shared by multi-view data, while at the same time encourages low-rank and sparsity of the representation. We propose Multi-view Low-rank Sparse Subspace Clustering (MLRSSC) algorithms that enforce agreement: (i) between affinity matrices of the pairs of views; (ii) between affinity matrices towards a common centroid. Opposed to [35, 40, 46], the proposed approach can deal with highly heterogeneous multi-view data coming from different modalities. We present optimization procedure to solve the convex dual optimization problems using Alternating Direction Method of Multipliers (ADMM) [47]. Furthermore, we propose the kernel extension of our algorithms by solving the problem in a Reproducing Kernel Hilbert Space (RKHS). Experimental results show that MLRSSC algorithm outperforms state-of-the-art multi-view subspace clustering algorithms on several benchmark datasets. Additionally, we evaluate performance on a novel real-world heterogeneous multi-view dataset from biological domain.

The remainder of the paper is organized as follows. Section 2 gives a brief overview of the low-rank and sparse subspace clustering methods. Section 3 introduces two novel multi-view subspace clustering algorithms. In Section 4 we present the kernelized version of the proposed algorithms by formulating subspace clustering problem in RKHS. The performance of the new algorithms is demonstrated in Section 5. Section 6 concludes the paper.

2 Background and Related Work

In this section, we give a brief introduction to Sparse Subspace Clustering (SSC) [15], Low-Rank Representation (LRR) [16, 17] and Low-rank Sparse Subspace Clustering (LRSSC) [32].

2.1 Main Notations

Throughout this paper, matrices are represented with bold capital symbols and vectors with bold lower-case symbols. denotes the Frobenius norm of a matrix. The norm, denoted by , is the sum of absolute values of matrix elements; infinity norm is the maximum absolute element value; and the nuclear norm is the sum of singular values of a matrix. Trace operator of a matrix is denoted by and is the vector of diagonal elements of a matrix. denotes null vector. Table 1 summarizes some notations used throughout the paper.

Notation Definition
Number of data points
Number of clusters
View index
Number of views
Dimension of data points in a view
Data matrix in a view
Representation matrix in a view
Centroid representation matrix
Affinity matrix
Singular value decomposition (SVD) of
Data points in a view mapped into high-dimensional feature space
Gram matrix in a view
Table 1: Notations and abbreviations

2.2 Related Work

Consider the set of data points that lie in a union of linear subspaces of unknown dimensions. Given the set of data points , the task of subspace clustering is to cluster data points according to the subspaces they belong to. The first step is the construction of the affinity matrix whose elements define the similarity between data points. Ideally, the affinity matrix is a block diagonal matrix such that a nonzero distance is assigned to the points from the same subspace. LRR, SSC and LRSSC construct the affinity matrix by enforcing low-rank, sparsity and low-rank plus sparsity constraints, respectively.

Low-Rank Representation (LRR) [16, 17] seeks to find a low-rank representation matrix for input data . The basic model of LRR is the following:

(1)

where the nuclear norm is used to approximate the rank of and that results in the convex optimization problem.

Denote the SVD of as . The minimizer of equation (1) is uniquely given by [16]:

(2)

In the cases when data is contaminated by noise, the following problem needs to be solved:

(3)

The optimal solution of equation (3) has been derived in [18]:

(4)

where , and . Matrices are partitioned according to the sets and .

Sparse Subspace Clustering (SSC) [15] requires that each data point is represented by a small number of data points from its own subspace and it amounts to solve the following minimization problem:

(5)

The norm is used as the tightest convex relaxation of the quasi-norm that counts the number of nonzero elements of the solution. Constraint is used to avoid trivial solution of representing a data point as a linear combination of itself.

If data is contaminated by noise, the following minimization problem needs to be solved:

(6)

This problem can be efficiently solved using ADMM optimization procedure [47].

Low-Rank Sparse Subspace Clustering (LRSSC) [32] combines low-rank and sparsity constraints:

(7)

In the case of the corrupted data the following problem needs to be solved to approximate :

(8)

Once matrix is obtained by LRR, SSC or LRSSC approach, the affinity matrix is calculated as:

(9)

Given affinity matrix , spectral clustering [11, 12]

finds cluster membership of data points by applying k-means clustering to the eigenvectors of the graph Laplacian matrix

computed from the affinity matrix .

3 Multi-view Low-rank Sparse Subspace Clustering

In this section we present Multi-view Low-rank Sparse Subspace Clustering (MLRSSC) algorithm with two different regularization approaches. We assume that we are given a dataset of views, where each is described with its own set of features. Our objective is to find a joint representation matrix that balances trade-off between the agreement across different views, while at the same time promotes sparsity and low-rankness of the solution.

We formulate joint objective function that enforces representation matrices across different views to be regularized towards a common consensus. Motivated by [42], we propose two regularization schemes of the MLRSSC algorithm: (i) MLRSSC based on pairwise similarities and (ii) centroid-based MLRSSC. The first regularization encourages similarity between pairs of representation matrices. The centroid-based approach enforces representations across different views towards a common centroid. Standard spectral clustering algorithm can then be applied to the jointly inferred affinity matrix.

3.1 Pairwise Multi-view Low-rank Sparse Subspace Clustering

We propose to solve the following joint optimization problem over views:

(10)

where is the representation matrix for view . Parameters and define the trade-off between low-rank, sparsity constraint and the agreement across views, respectively. In the cases where we do not have a prior information that one view is more important than the others, does not dependent on a view and the same value of is used across all views . The last term in the objective in (10) is introduced to encourage similarities between pairs of representation matrices across views.

With all but one fixed, we minimize the function (10) for each independently:

(11)

By introducing auxiliary variables , , and , we reformulate the objective:

(12)

The augmented Lagrangian is:

(13)

where are penalty parameters that need to be tuned and are Lagrange dual variables.

To solve the convex optimization problem in (12), we use Alternating Direction Method of Multipliers (ADMM) [47]. ADMM converges for the objective composed of two-block convex separable problems, but here the terms , and do not depend on each other and can be observed as one variable block.

Update rule for at iteration . Given at iteration , the matrix that minimizes the objective in equation (13) is updated by the following update rule:

(14)

The update rule follows straightforwardly by setting the partial derivative of in equation (13) with respect to to zero.

Update rule for at iteration . Given at iteration and at iteration , we minimize the objective in equation (13) with respect to :

(15)

From [48], it follows that the the unique minimizer of (15) is:

(16)

where performs soft-thresholding operation on the singular values of and is the skinny SVD of , here . denotes soft thresholding operator defined as and .

Update rule for at iteration . Given at iteration and at iteration , we minimize the in equation (13) with respect to :

(17)

The minimization of (17) gives the following update rules for matrix [49, 50]:

(18)

where denotes soft thresholding operator applied entry-wise to .

Update rule for at iteration . Given at iteration and , at iteration , we minimize the objective in equation (13) with respect to :

(19)

The partial derivative of in equation (13) with respect to :

(20)

Setting the partial derivative in (20) to zero:

(21)

Update rules for dual variables at iteration . Given at iteration , dual variables are updated with the following equations:

(22)

If data is contaminated by noise and does not perfectly lie in the union of subspaces, we modify the objective function as follows:

(23)

Update rule for at iteration for corrupted data. Given at iteration , the matrix is obtained by equating to zero partial derivative of the augmented Lagrangian of problem (23):

(24)

Update rules for and dual variables are the same as in (16), (18), (21), (22), respectively.

These update steps are then repeated until the convergence or until the maximum number of iteration is reached. We check the convergence by verifying the following constraints at each iteration : , , and , for . After obtaining representation matrix for each view , we combine them by taking the element-wise average across all views. The next step of the algorithm is to find the assignment of the data points to corresponding clusters by applying spectral clustering algorithm to the joint affinity matrix . Algorithm 1 summarizes the steps of the pairwise MLRSSC. Due to the practical reasons, we use the same initial values of , and for different views and update after the optimizations of all views. However, it is possible to have more general approach with different initial values of , and for each view , but this significantly increases the number of variables for optimization.

The problem in (10) is convex subject to linear constraints and all its subproblems can be solved exactly. Hence, theoretical results in [51] guarantee the global convergence of ADMM. The computational complexity of Algorithm 1 is , where is the number of iterations, is the number of views and is the number of data points. In the experiments, we set the maximal to , but the algorithm converged before the maximal number of iterations is exceeded (). Importantly, the computational complexity of spectral clustering step is , so the computational cost of the proposed representation learning step is times higher.

3.2 Centroid-based Multi-view Low-rank Sparse Subspace Clustering

In addition to the pairwise MLRSSC, we also introduce objective for the centroid-based MLRSSC which enforces view-specific representations towards a common centroid. We propose to solve the following minimization problem:

(25)

where denotes consensus variable.

 

Algorithm 1 Pairwise MLRSSC
Input: , , , ,
Output: Assignment of the data points to clusters
1: Initialize: , , ,
2: while not converged do
3:    for to do
4:     Fix others and update by solving (14) in the case of clean data
      or (24) in the case of corrupted data
5:     Fix others and update by solving (16)
6:     Fix others and update by solving (18)
7:     Fix others and update by solving (21)
8:     Fix others and update dual variables by solving (22)
      and also in the case of clean data
9:    end for
10:   Update ,
11: end while
12: Combine by taking the element-wise average
13: Apply spectral clustering [12] to the affinity matrix

 

This objective function can be minimized by the alternating minimization cycling over the views and consensus variable. Specifically, the following two steps are repeated: (1) fix consensus variable and update each , while keeping all others fixed and (2) fix and update .

By fixing all variables except one , we solve the following problem:

(26)

Again, we solve the convex optimization problem using ADMM. We introduce auxiliary variables , , and and reformulate the original problem:

(27)

The augmented Lagrangian is:

(28)

Update rule for at iteration . Given at iteration and , at iteration , minimization of the objective in equation (28) with respect to leads to the following update rule for :

(29)

Update rule for . By setting the partial derivative of the objective function in equation (25) with respect to to zero we get the closed-form solution to :

(30)

It is easy to check that update rules for variables , , and dual variables are the same as in the pairwise similarities based multi-view LRSSC (equations (14), (16),(18) and (22)).

In order to extend the model to the data contaminated by additive white Gaussian noise, the objective in (25) is modified as follows:

(31)

Compared to the model for clean data, the only update rule that needs to be modified is for , which is the same as in pairwise MLRSSC given in equation (24).

In centroid-based MLRSSC there is no need to combine affinity matrices across views, since the joint affinity matrix can be directly computed from the centroid matrix i.e. . Algorithm 2 summarizes the steps of centroid-based MLRSSC. The computational complexity of Algorithm 2 is the same as the complexity of Algorithm 1.

 

Algorithm 2 Centroid-based MLRSSC
Input: , , , ,
Output: Assignment of the data points to clusters
1: Initialize: , , , ,
2: while not converged do
3:    for to do
4:     Fix others and update by solving (14) in the case of clean data
      or (24) in the case of corrupted data
5:     Fix others and update by solving (16)
6:     Fix others and update by solving (18)
7:     Fix others and update by solving (29)
8:     Fix others and update dual variables by solving (22)
      and also in the case of clean data
9:    end for
10:   Update ,
11:   Fix others and update centroid by solving (30)
12: end while
13: Apply spectral clustering [12] to the affinity matrix

 

4 Kernel Multi-view Low-rank Sparse Subspace Clustering

The spectral decomposition of Laplacian enables spectral clustering to separate data points with nonlinear hypersurfaces. However, by representing data points as a linear combination of other data points, the MLRSSC algorithm learns the affinity matrix that models the linear subspace structure of the data. In order to recover nonlinear subspaces, we propose to solve the MLRSSC in RKHS by implicitly mapping data points into a high dimensional feature space.

We define to be a function that maps the original input space to a high (possibly infinite) dimensional feature space . Since the presented update rules for the corrupted data of both pairwise and centroid-based MLRSSC depend only on the dot products , both approaches can be solved in RKHS and extended to model nonlinear manifold structure.

Let denote the set of data points mapped into high-dimensional feature space. The objective function of pairwise kernel MLRSSC for data contaminated by noise is the following:

(32)

Similarly, the objective function of centroid-based MLRSSC in feature space for corrupted data is:

(33)

Since is the only variable that depends on , the update rules for and dual variables remain unchanged.

Update rule for at iteration . Given at iteration , the is updated by the following update rule:

(34)

Substituting the dot product with the Gram matrix , we get the following update rule for :

(35)

Update rule for is the same in pairwise and centroid-based versions of the algorithm.

5 Experiments

In this section we present results that demonstrate the effectiveness of the proposed algorithms. The performance is measured on one synthetic and three real-world datasets that are commonly used to evaluate the performance of multi-view algorithms. Moreover, we introduce novel real-world multi-view dataset from molecular biology domain. We compared MLRSSC with the state-of-the-art multi-view subspace clustering algorithms, as well as with two baselines: best single view LRSSC and feature concatenation LRSSC.

5.1 Datasets

We report the experimental results on synthetic and four real-world datasets. We give a brief description of each dataset. Statistics of the datasets are summarized in Table 2.

UCI Digit dataset is available from the UCI repository333http://archive.ics.uci.edu/ml/datasets/Multiple+Features. This dataset consists of 2000 examples of handwritten digits (0-9) extracted from Dutch utility maps. There are 200 examples in each class, each represented with six feature sets. Following experiments in [45], we used three feature sets: 76 Fourier coefficients of the character shapes, 216 profile correlations and 64 Karhunen-Love coefficients.

Reuters dataset [52] contains features of documents available in five different languages and their translations over a common set of six categories. All documents are in the bag-of-words representation. We use documents originally written in English as one view and their translations to French, German, Spanish and Italian as four other views. We randomly sampled 100 documents from each class, resulting in a dataset of 600 documents.

3-sources dataset444http://mlg.ucd.ie/datasets/3sources.html is news articles dataset collected from three online news sources: BBC, Reuters, and The Guardian. All articles are in the bag-of-words representation. Of 948 articles, we used 169 that are available in all three sources. Each article in the dataset is annotated with a dominant topic class.

Prokaryotic phyla dataset contains 551 prokaryotic species described with heterogeneous multi-view data including textual data and different genomic representations [53]. Textual data consists of bag-of-words representation of documents describing prokaryotic species and is considered as one view. In our experiments we use two genomic representations: (i) the proteome composition, encoded as relative frequencies of amino acids (ii) the gene repertoire, encoded as presence/absence indicators of gene families in a genome. In order to reduce the dimensionality of the dataset, we apply principal component analysis (PCA) on each of the three views separately and retain principal components explaining

of the variance. Each species in the dataset is labeled with the phylum it belongs to. Unlike previous datasets, this dataset is unbalanced. The most frequently occurring cluster contains

species, while the smallest cluster contains species.

Synthetic dataset was generated in a way described in [42, 54].

points are generated from two views, where data points for each view are generated from two-component Gaussian mixture models. Cluster means and covariance matrices for view

are: , ,