Shared Generative Latent Representation Learning for Multi-view Clustering

07/23/2019 ∙ by Ming Yin, et al. ∙ The University of Sydney 0

Clustering multi-view data has been a fundamental research topic in the computer vision community. It has been shown that a better accuracy can be achieved by integrating information of all the views than just using one view individually. However, the existing methods often struggle with the issues of dealing with the large-scale datasets and the poor performance in reconstructing samples. This paper proposes a novel multi-view clustering method by learning a shared generative latent representation that obeys a mixture of Gaussian distributions. The motivation is based on the fact that the multi-view data share a common latent embedding despite the diversity among the views. Specifically, benefited from the success of the deep generative learning, the proposed model not only can extract the nonlinear features from the views, but render a powerful ability in capturing the correlations among all the views. The extensive experimental results, on several datasets with different scales, demonstrate that the proposed method outperforms the state-of-the-art methods under a range of performance criteria.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image clustering is one of the fundamental research topics, which has been widely studied in computer vision and machine learning

CaronBojanowskiJoulinDouze2018 ; ChangWangMengXiangPan2017

. As well, as a class of unsupervised learning methods, clustering has attracted significant attention from various applications. With the advance of information technology, in many real-world scenarios, many heterogeneous visual features, such as HOG

DalalTriggs2005 , SIFT DengDongSocherLiLi2009 and LBP OjalaPietikainenMaenpaa2002 can be readily acquired and form a new type data, i.e., multi-view data. Therefore, to efficiently capture the complementary information among different views, multi-view clustering has gained considerable attention in the recent years for learning more comprehensive information Sun2013 ; XuTaoXu2013 . In essence, multi-view clustering seeks to partition data points based on multiple representations by assuming that the same cluster structure is shared across all the views GaoNieLiHuang2015 ; WangNieHuang2013 ; YinGaoXieGuo2018 . It is crucial for learning algorithm to incorporate the heterogeneous view information to enhance its accuracy and robustness.

In general, multi-view clustering can be roughly separated into two classes, i.e., similarity-based and feature-based. The former aims to construct an affinity matrix whose elements define the similarity between each pair of samples. In light of this, multi-view subspace clustering is one of the most famous similarity-based methods, which purses a latent subspace shared by multiple views, assuming that each views is built from a common subspace

ChaudhuriKakadeLivescuSridharan2009 ; GaoNieLiHuang2015 ; YinGaoXieGuo2018 ; ZhangFuLiuLiuCao2015 . However, these methods often suffer scalability issue due to super-quadratic running time for computing spectra JiangZhengTanTangZhou2017 . While for the feature-based methods, it seeks to partition the samples into clusters so as to minimize the within-cluster sum of squared errors, such as multi-view -means clustering CaiNieHuang2013 ; XuHanNieLi2017 . It is clear that the selection of feature space is vital as the clustering with Euclidean distance on raw pixels is completely ineffective.

Inspired by the recent amazing success of deep learning in feature learning


, a surge of multi-view learning based on deep neural networks (DNN) are proposed

NgiamKhoslaKimNamLeeNg2011 ; WangAroraLivescuBilmes2015 ; XuGuanZhaoNiuWangWang2018 . First, Ngiam et al. NgiamKhoslaKimNamLeeNg2011

explored extracting shared representations by training a bimodal deep autoencoders. Next, by extending canonical correlation analysis (CCA), Wang

et al. WangAroraLivescuBilmes2015 proposed a novel deep canonically correlation autoencoders (DCCAE), which introduces an autoencoder regularization term into deep CCA. However, unfortunately the aforementioned can only be feasible to the two-view case, failing to handle the multi-view one. To explicitly summarize the consensus and complementary information in multi-view data, a Deep Multi-view Concept learning (DMCL) XuGuanZhaoNiuWangWang2018 is presented by performing non-negative factorization on every view hierarchically.

Though these methods perform well in multi-view clustering, the generative process of multi-view data cannot be modeled such that they can be used to generate samples accordingly. To this end, benefited from the success of approximate Bayesian inference, the variational autoencoders (VAE) has been the most popular algorithm under the framework that combines differentiable models with variational inference

KingmaWelling2014 ; PuGanHenaoYuanLiStevensCarin2016

. By modeling the data generative procedure with a Gaussian Mixture Model (GMM) model and a neural network, Jiang

et al. JiangZhengTanTangZhou2017 proposed a novel unsupervised generative clustering approach within the framework of VAE, namely Variational Deep Embedding (VaDE). Although it has shown great advantages in clustering, it is not able to be applied directly to multi-view learning.

Targeting for classification and information retrieval, Srivastava et al. SrivastavaSalakhutdinov2014

presented a deep Boltzmann machine for learning a generative model of multi-view data. However, until recently there was no successful multi-view extension to clustering yet. The main obstacle is how to efficiently exploit the shared generative latent representation across the views in

unsupervised way. To tackle this issue, in this paper, we propose a novel multi-view clustering by learning a shared generative latent representation that obeys a mixture of Gaussian distributions, namely Deep Multi-View Clustering via Variational Autoencoders (DMVCVAE). In particular, our motivation is based on the fact that the multi-view data share a common latent embedding despite the diversity among the views. Meanwhile, the proposed model benefits from the success of the deep generative learning, which can capture the data distribution by neural networks.

In summary, our contributions are as follows.

  • We present to learn a shared generative latent representation for multi-view clustering. Specifically, the generative approach assumes that the data of different views share a commonly conditional distribution of hidden variables given observed data and the hidden data are sampled independently from a mixture of Gaussian distributions.

  • To better exploit the information from multiple views, we introduce a set of non-negative combination weights which will be learned jointly with the deep autoencoders network in a unified framework.

  • We conduct a number of numerical experiments showing that the proposed method outperforms the state-of-the-art clustering models on several famous datasets including large-scale multi-view data.

2 Related Works

In literature, there are a few studies on clustering using deep neural networks JiZhangLiSalzmannReid2017 ; PengXiaoFengYauYi2016 ; TianGaoCuiChenLiu2014 ; XieGirshickFarhadi2016 ; YangFuSidiropoulosHong2017 . In a sense, the algorithms are roughly divided into two categories, i.e., separately and jointly deep clustering approaches. The earlier deep clustering algorithms JiZhangLiSalzmannReid2017 ; PengXiaoFengYauYi2016 ; TianGaoCuiChenLiu2014

often work in two stages: firstly, extracting deep features and performing traditional clustering successively, such as the

-means and spectral clustering, for the final segmentation. Yet the separated process does not help learn clustering favourable features. To this end, the jointly feature learning and clustering methods

XieGirshickFarhadi2016 ; YangFuSidiropoulosHong2017 are proposed based on deep neural networks. In XieGirshickFarhadi2016 , Xie et al. presented Deep Embedded Clustering (DEC) to learn a mapping from the data space to a lower-dimensional feature space, where it iteratively optimizes a Kullback-Leibler (KL) divergence based clustering objective. In YangFuSidiropoulosHong2017 , Yang et al. proposed a dimensionality reduction jointly with -means clustering framework, where deep neural networks are applied to dimensionality reduction.

However, due to the limitation of the similarity measures in the aforementioned methods, the hidden, hierarchical dependencies in the latent space of data are often not able to be captured effectively. Instead, deep generative models were built to better handle the rich latent structures within data JiangZhengTanTangZhou2017

. In essence, deep generative models are utilized to estimate the density of observed data under some assumptions about its latent structure, i.e., the hidden causes. Recently, Jiang

et al. JiangZhengTanTangZhou2017 proposed a novel clustering framework, by integrating VAE and a GMM for clustering tasks, namely Variational Deep Embedding (VaDE). Unfortunately, as this method mainly focuses on single-view data, the complementary information from multiple heterogeneous views cannot be efficiently exploited. In other words, the existing generative model cannot deal with the shared latent representations for modeling the generative process of each view data.

3 The Proposed Method

3.1 The Architecture

Given a collection of multi-view data set (), totally views, it is reasonable to assume that the -th sample of the -th view is generated by some unknown process, for example, from an unobserved continuous variable . The variable

is a common hidden representation shared by all views. Furthermore, in a typical setting, each sample

of a view is assumed to be generated through a two-stage process: first the hidden variable is generated according to some prior distribution and then the observed sample is yielded by some conditional distributions . Usually, due to the unknown of the and parameters , the prior and the likelihood are hidden.

For clustering tasks, it is desired that the observed sample is generated jointly according to the latent variable and an assumed clustering variable .However, the most existing variational autoencoders are not suitable for clustering tasks by design, even to say nothing of multi-view clustering. Therefore, we are motivated to present a novel multi-view clustering under the VAE framework, by incorporating clustering-promoting objective intuitively. Ideally we shall assume that the sample generative process is given by the new likelihood , conditioned on both the hidden variable and the cluster label . However for simplicity we break the direct dependence of on conditioned on an assumed Gaussian mixture variable . The proposed framework is shown in the right panel of Figure 1. In this architecture, multi-view samples are generated by using DNN to decode the common hidden variable , which is sampled by GMM as we assumed. To efficiently infer the posterior of both and from the information of multiple views, a novel weighted target distribution is introduced, based on individual variational distribution of from each view. In order to optimize the evidence lower bound (ELBO), similar to VAE, we use DNN to encode observed data and incorporate the distribution of multiple embeddings to infer the shared latent representation .

Figure 1: The architecture of the proposed multi-view model. The data generative process under the deep autoencoders framework is performed in three steps. (a). A cluster is first picked from a pretrained GMM model; (b). A shared latent representation (embedding) weighted by each view is generated by the prior picked cluster; (c) DNN decodes the latent embedding into an observable . To optimize the ELBO of the proposed model, the encoder network is applied.

3.2 The Objective

For the sake of simplicity, we express a generic multi-view variable as where is the general variable of the -th view. Consider the latent variables and the discrete latent variable (). Without loss of generality, in light of clustering task under the framework of VAE, we aim to compute the probabilistic cluster assignments of for each view, denoted by

. By the Bayes theorem, the corresponding posterior of

and given is computed as follow.


where we assume the views are independent, i.e., 111Hereafter the model parameter is omitted..

As the integral is intractable, it is hard to calculate the posterior. Inspired by the principle of VAE KingmaWelling2014 , we turn to compute an appropriate posterior to approximate the true posterior by minimizing the following KL divergence between them.




is called the evidence lower bound (ELBO) and is log-likelihood.

Minimizing KL divergence is equivalent to maximizing the ELBO. Often is assumed to be a mean-field distribution and can be readily factorized by


Due to the powerfulness of DNN to approximate non-linear function, we here introduce a neural network to infer , with parameters . That is, DNN is utilized to encode observed view data into latent representation. Meanwhile, to incorporate multi-view information, we propose a combined variational approximation

. Considering the importance of different views, we introduce a weight vector

( ) to fuse the distribution of hidden variables, so that the consistency and complementary of multi-view data can be better exploited. In particular, we assume the variational approximation to the posterior of latent representation to be a Gaussian by integrating information from multiple views as follows.



is an identity matrix with suitable dimension. In the standard VAE, each pair of

and defines a Gaussian for latent variable in the -th view. We have fused the information in Eqs. (5) - (8).

Furthermore, ELBO can be rewritten by,


Hence, we set to maximize , due to the first term has no relationship with and the second term is non-negative. As a result, we use the following equation to compute , i.e.,


This means we are proposing a mixture model for the latent prior . Particularly we implement the latent prior as a Gaussian mixture as follows,


where is the categorical distribution with parameter such that (

) is the prior probability for cluster

, and both and (

) are the mean and the variance of the

-th Gaussian component, respectively.

Once the latent variable is produced according to the GMM prior, the multi-view data generative process will defined as, for the binary observed data,


where is a deep neural network whose input is parameterized by ,

is multivariate Bernoulli distribution parameterized by

. Or for the continuous data,


where and are all deep neural networks with appropriate parameters , producing the mean and variance for the Gaussian likelihoods. The generative process is depicted in the right part of Figure 1.

For the -th view, since and  are independent conditioned on , the joint probability can be decomposed by,


Next, by using the reparameterization trick and the Stochastic Gradient Variational Bayes (SGVB) KingmaWelling2014 , the objective function of our method for binary data can be formulated by,


where is outputs of the DNN , denotes the number of Monte Carlo samples in the SGVB estimator and is usually set to be 1. The dimension for and is while the dimension for and is . denotes the -th element of , represents the -th sample in the -th element of , and means the -th element of . denotes for simplicity.

For the continuous data, the objective function is rewritten as:


where and can be obtained by Eq. (15) and Eq. (16), respectively. Intuitively, the first term of Eq. (20) is used for reconstruction, and the rest is the KL divergence from the Gaussian mixture prior to the variational posterior . As such, the model can not only generate the samples well, but make variational inference close to our hypothesis.

Note that although our model is also equipped with VAE and GMM, it is distinct from the existing work DuDuHe2017 ; JiangZhengTanTangZhou2017 . Our model focuses on multi-view clustering task by simultaneously learning the generative network, inference network and the weight of each view.

By a direct application of the chain rule and estimators, similar to the work

DuDuHe2017 ; JiangZhengTanTangZhou2017 , the gradients of the loss for Eqs. (19) and (20) are calculated readily. To train the model, the estimated gradients in conjunction with standard stochastic gradient based optimization methods, such as SGD or Adam, are applied. Overall, using the mixed Gaussian latent variables, the proposed model can be trained by back-propagation with reparameterization trick. After training, the shared latent representation is achieved for each sample . Finally the final cluster assignment is computed by Eq. (10).

4 Experimental Results

4.1 Datasets

To evaluate the performance of the proposed DMVCVAE, we select four real-world datasets including digits, object and facial images. A summary of the dataset statistics is also provided in Table 1.

  • UCI digits222 consists of features of handwritten digits of 0 to 9 extracted from UCI machine learning repository DuaGraff2017 . It contains 2000 data points with 200 samples for each digit. These digits are represented by six types of features, including pixel averages in

    windows (PIX) of dimension 240, Fourier coefficients of dimension 76, profile correlations (FAC) of dimension 216, Zernike moments (ZER) of dimension 47, Karhunen-Loeve coefficients (KAR) of dimension 64 and morphological features (MOR) of dimension 6.

  • Caltech 101 is an object recognition dataset LiFergusPerona2004 containing 8677 images of 101 categories. We chose 7 classes of Caltech 101 with 1474 images, i.e., Face, Motorbikes, Dolla-Bill, Garfield, Snoopy, Stop-Sign and Windsor-Chair. There are six different views, including Gabor features of dimension of 48, wavelet moments of dimension 40, CENTRIST features of dimension 254, histogram of oriented gradients(HOG) of dimension 1984, GIST features of dimension 512, and local binary patterns (LBP) of dimension 928.

  • ORL contains 10 different images from each of 40 distinct subjects. For some subjects, the images were taken at different times with varying lighting, facial expressions and facial details. It consists of three types of features: intensity of dimension 4096, LBP features of dimension 3304 and Gabor features of dimension 6750.

  • NUS-WIDE-Object (NUS) is a dataset for object recognition which consists of 30000 images in 31 classes. We use 5 features provided by the web-site, i.e. 65 dimension color Histogram (CH), 226 dimension color moments (CM), 145 dimension color correlation (CORR), 74 dimension edge distribution and 129 wavelet texture.

Datasets # of samples # of views # of classes
UCI digits 2,000 6 10
Caltech-7 1,474 6 7
ORL 400 3 40
NUS-WIDE-Object 30,000 5 31
Table 1: Dataset Summary

4.2 Experiment Settings

In our experiments, the fully connected network and same architecture settings as DEC XieGirshickFarhadi2016 are used. More specifically, the architectures of and are -500-500-200-10 and 10-2000-500-500-, respectively, where is input dimensionality of each view. We use Adam optimizer KingmaBa2014

to maximize the objective function, and set the learning rate to be 0.0001 with a decay of 0.9 for every 10 epochs.

Initializing the parameters of the deep neural network is usually utilized to avoid the problem that the model might get stuck in a undesirable local minima or saddle points. Here, we use layer-wise pre-training method bengio2007greedy for training DNN and . After pre-training, the network is adopted to project input data points into the latent representation , and then we perform -means to to obtain initial centroids of GMM ). Besides, the weights of Eqs. (6) and (7) are initialized to for each view and the parameter of GMM is initialized to .

Three popular metrics are used to evaluate the clustering performance, i.e. clustering accuracy (ACC), normalized mutual information (NMI) and adjusted rand index (ARI), in which the clustering accuracy is defined by

where is the ground-truth label, is the cluster assignment obtained by the model, and ranges over all possible one-to-one mappings between cluster assignment and labels. The mapping can be efficiently fulfilled by the Kuhn-Munkres algorithm ChenDonohoSaunders2001 . NMI indicates the correlation between predicted labels and ground truth labels. ARI scales from to 1, which measures the similarity between two data clusterings, higher value usually means better clustering performance. As each measure penalizes or favors different properties in the clustering, we report results on all the measures for a comprehensive evaluation.

4.3 Baseline Algorithms

We compare the proposed DMVCVAE with the following clustering methods including both shallow models and deep models.

  • Single View: Choosing the single view of the best clustering performance using the graph Laplacian derived from and performing spectral clustering on it.

  • Feature Concatenation (abbreviated to Feature Concat.): Concatenating the features of all views and conducting spectral clustering on it.

  • Kernel Addition: Building an affinity matrix from every feature and taking an average of them, then inputting to a spectral clustering algorithm.

  • MultiNMFLiuWangGaoHan2013 : Multi-view NMF applies NMF to project each view data to the common latent subspace. This method can be roughly considered as one-layer version of our proposed method.

  • LT-MSCZhangFuLiuLiuCao2015 : Low-rank tensor constrained multi-view subspace clustering

    proposes a multi-view clustering by considering the subspace representation matrices of different views as a tensor.

  • SCMV-3DTYinGaoXieGuo2018 : Low-rank multi-view clustering in third-order tensor space via t-linear combination using t-product based on the circular convolution to reconstruct multi-view tensorial data by itself with sparse and low-rank penalty.

  • DCCA andrewAroraBilmesLivescu2013 : Providing flexible nonlinear representations with respect to the correlation objective measured on unseen data.

  • DCCAE WangAroraLivescuBilmes2015 : Combining the DCCA objective and reconstruction errors of the two views.

  • VCCAP wang2016deep : Using a deep generative method to achieve a natural idea that the multiple views can be generated from a small set of shared latent variables.

In our experiments, -means is utilized for six shallow methods to obtain the final clustering results. For three deep methods, DCCA, DCCAE and VCCAP, we use spectral clustering to perform the clustering, similar to the work WangAroraLivescuBilmes2015 .

4.4 Performance Evaluation

We first compare our method with six shallow models on the chosen test datasets. The parameter settings for the compared methods are done according to their authors’ suggestions for their best clustering scores. The clustering performance of different methods are achieved by running 10 trials and reporting the average score of the performance measures, shown in Table 2. The bold numbers highlight the best results.

As can be seen, except for the Single View, the other methods exploit all of views data with an improved performance than using a single view. In terms of all of these evaluation criteria, our proposed method consistently outperforms the shallow models for UCI digits and Caltech-7 datasets. In particularly, for Caltech-7, our method outperforms the second best algorithm in terms of ACC and NMI by 17.7% and 25.0%, respectively. While for ORL dataset, LT-MSC and SCMV-3DT achieves the best result in terms of NMI and ARI, respectively. This may be explained by the small size of ORL dataset, since large-scale datasets often lead to better performance for deep models. The results also verify that our model DMVCVAE significantly benefits from deep learning.

To further verify the performance of our approach among the deep models, we report the comparisons between the deep models, given in Tabel  3. Since these three models can only handle two views data, we tested all the two view combination and the best clustering score is reported finally. Specifically, FAC and KAR features are chosen in UCI digits, GIST and LBP features for Caltech-7, and LBP and Gabor features for ORL. For fair comparison, we perform the proposed model on the same views. From Tabel  3, it is observed that our proposed method significantly outperforms others on all criteria.

Methods UCI-digits Caltech-7 ORL
Single View 0.6956 0.6424 0.7301 0.4100 0.4119 0.2582 0.6700 0.8477 0.5676
Feature Concat. 0.6973 0.6973 0.6064 0.3800 0.3410 0.2048 0.6700 0.8329 0.5590
Kernel Addition 0.7700 0.7456 0.3700 0.3936 0.2573 0.6570 0.6000 0.8062 0.4797
MultiNMF 0.7760 0.7041 0.6031 0.3602 0.3156 0.1965 0.6825 0.8393 0.5736
LT-MSC 0.8422 0.8217 0.7584 0.5665 0.5914 0.4182 0.7587 0.9094 0.7093
SCMV-3DT 0.9300 0.8608 0.8459 0.6246 0.6031 0.4693 0.7947 0.9088 0.7381
Ours 0.9570 0.9166 0.9107 0.8014 0.8538 0.7048 0.7975 0.9013 0.7254
Table 2: Clustering performance comparison between the propose model and shallows methods.
Methods UCI-digits Caltech-7 ORL
DCCA 0.8195 0.8020 0.7424 0.8242 0.6781 0.7131 0.6125 0.8094 0.4699
DCCAE 0.8205 0.8057 0.7458 0.8462 0.7054 0.7319 0.6425 0,8115 0.5048
VCCAP 0.7480 0.7320 0.6277 0.8372 0.6301 0.7206 0.4150 0.6440 0.2418
Ours 0.8875 0.8076 0.7765 0.8568 0.7386 0.7826 0.6950 0.8356 0.5643
Table 3: Clustering performance comparison among the deep models.

4.5 Visualizations

In Figure 2, we visualize the latent space on Caltech-7 dataset by various deep models. t-SNE maaten2008visualizing is applied to reducing the dimensionality to 2-dimensional space. It can be observed that the embedding learned by DMVCVAE is better than that by DCCAE and VCCAP. Figure 3 shows the learned representations of DMVCVAE on UCI digits dataset. Specifically, we see that, as training progressing, the latent feature clusters become more and more separated, suggesting that the overall architecture motivates seeking informative representations with better clustering performance.

Figure 2: Visualization to show the latent subspaces of Caltech-7 dataset.
(a) Epoch 10
(b) Epoch 40
(c) Epoch 70
(d) Epoch 100
Figure 3: Visualization to show the latent subspaces of UCI digits by DMVCVAE visualization from epoch 10 to 100.

4.6 Experiment on large-scale multi-view data

With the unprecedentedly explosive growth in the volume of visual data, how to effectively segment large-scale multi-view data becomes an interesting but challenging problem LiNieHuangHuang2015 ; ZhangLiuShenShenShao2018 . Therefore, we further test our model on the large-scale dataset, i.e., NUS-WIDE-Object. As the aforementioned compared methods cannot handle the large-scale data, we compare with the recent work, such as Large-Scale Multi-View Spectral Clustering (LSMVSC) LiNieHuangHuang2015 and Binary Multi-View Clustering (BMVC) ZhangLiuShenShenShao2018 . In this experiment, we replace the ARI measure with PURITY such that the comparison will be fair333Here we cited the reported results from their original papers as the lack of the corresponding source codes. means there is no report in the original paper.. By the similar settings, the clustering results are reported in Table 4. As can be seen, our proposed approach achieved better clustering performance against the compared ones and verified the strong capacity on handling large-scale multi-view clustering.

Methods NUS-WIDE-Object
LSMVSC 0.1493 0.2821
BMVC 0.1680 0.1621 0.2872
Ours 0.1909 0.2129 0.3168
Table 4: Clustering performance for large-scale dataset.

5 Conclusions

In this paper, we proposed a novel multi-view clustering algorithm by learning a shared latent representation under the VAE framework. The shared latent embeddings, multi-view weights and deep autoencoders networks are simultaneously learned in a unified framework such that the final clustering assignment is intuitively achieved. Experimental results show that the proposed method can provide better clustering solutions than other state-of-the-art approaches, including the shallow models and deep models.


  • (1) G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In ICML, pages 1247–1255, 2013.
  • (2) X. Cai, F. Nie, and H. Huang.

    Multi-view k-means clustering on big data.

    In IJCAI, pages 2598–2604, 2013.
  • (3) M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
  • (4) J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan. Deep adaptive image clustering. In ICCV, 2017.
  • (5) K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan. Multi-view clustering via canonical correlation analysis. In ICML, pages 129–136, 2009.
  • (6) S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Review, 43(1):129–159, Jan. 2001.
  • (7) N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, pages 886–893, 2005.
  • (8) J. Deng, W. Dong, R. Socher, L. jia Li, K. Li, and L. Fei-fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • (9) C. Du, C. Du, and H. He. Sharing deep generative representation for perceived image reconstruction from human brain activity. In IJCNN, pages 1049–1056, 2017.
  • (10) D. Dua and C. Graff. UCI machine learning repository, 2017.
  • (11) H. Gao, F. Nie, X. Li, and H. Huang. Multi-view subspace clustering. In ICCV, pages 4238–4246, 2015.
  • (12) G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
  • (13) P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid. Deep subspace clustering networks. In NIPS, pages 24–33, 2017.
  • (14) Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variational deep embedding: An unsupervised and generative approach to clustering. In IJCAI, pages 1965–1972, 2017.
  • (15) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, volume abs/1412.6980, 2015.
  • (16) D. P. Kingma and M. Welling. Auto-encoding variational Bayes. CoRR, abs/1312.6114, 2014.
  • (17) F.-F. Li, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In CVPR Workshop, pages 178–178, 2004.
  • (18) Y. Li, F. Nie, H. Huang, and J. Huang. Large-scale multi-view spectral clustering via bipartite graph. In AAAI, volume 4, pages 2750–2756, 2015.
  • (19) J. Liu, C. Wang, J. Gao, and J. Han. Multi-view clustering via joint nonnegative matrix factorization. In SIAM Data Mining, 2013.
  • (20) L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11):2579–2605, 2008.
  • (21) J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, pages 689–696, 2011.
  • (22) T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):971–987, 2002.
  • (23) X. Peng, S. Xiao, J. Feng, W.-Y. Yau, and Z. Yi. Deep subspace clustering with sparsity prior. In IJCAI, pages 1925–1931, 2016.
  • (24) Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. Variational autoencoder for deep learning of images, labels and captions. In NIPS, pages 2352–2360, 2016.
  • (25) N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. Journal of Machine Learning Research, 15(1):2949–2980, 2014.
  • (26) S. Sun. A survey of multi-view machine learning. Neural Computing and Applications, 23(7):2031–2038, 2013.
  • (27) F. Tian, B. Gao, Q. Cui, E. Chen, and T.-Y. Liu. Learning deep representations for graph clustering. In AAAI, pages 1293–1299, 2014.
  • (28) H. Wang, F. Nie, and H. Huang. Multi-view clustering and feature learning via structured sparsity. In ICML, volume 28, pages 352–360, 2013.
  • (29) W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multi-view representation learning. In ICML, pages l083-1092, 2015.
  • (30) W. Wang, X. Yan, H. Lee, and K. Livescu. Deep variational canonical correlation analysis. preprint arXiv:1610.03454, 2016.
  • (31) J. Xie, R. Girshick, and A. Farhadi.

    Unsupervised deep embedding for clustering analysis.

    In ICML, pages 478–487, 2016.
  • (32) C. Xu, Z. Guan, W. Zhao, Y. Niu, Q. Wang, and Z. Wang. Deep multi-view concept learning. In IJCAI, pages 2898-2904, 2018.
  • (33) C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. preprint arXiv:1304.5634, 2013.
  • (34) J. Xu, J. Han, F. Nie, and X. Li. Re-weighted discriminatively embedded -means for multi-view clustering. IEEE Transactions on Image Processing, 26(6):3016-3027, 2017.
  • (35) B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In ICML, pages 3861–3870, 2017.
  • (36) M. Yin, J. Gao, S. Xie, and Y. Guo. Multiview subspace clustering via tensorial t-product representation. IEEE Transactions on Neural Networks and Learning Systems, 30(3):851–864, 2019.
  • (37) C. Zhang, H. Fu, S. Liu, G. Liu, and X. Cao. Low-rank tensor constrained multiview subspace clustering. In ICCV, pages 1582-1590, 2015.
  • (38) Z. Zhang, L. Liu, F. Shen, H. T. Shen, and L. Shao. Binary multi-view clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, doi:10.1109/TPAMI.2018.2847335, pages 1–1, 2018.
  • (39) Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In NIPS, pages 153-160, 2007.