1 Introduction
Image clustering is one of the fundamental research topics, which has been widely studied in computer vision and machine learning
CaronBojanowskiJoulinDouze2018 ; ChangWangMengXiangPan2017. As well, as a class of unsupervised learning methods, clustering has attracted significant attention from various applications. With the advance of information technology, in many realworld scenarios, many heterogeneous visual features, such as HOG
DalalTriggs2005 , SIFT DengDongSocherLiLi2009 and LBP OjalaPietikainenMaenpaa2002 can be readily acquired and form a new type data, i.e., multiview data. Therefore, to efficiently capture the complementary information among different views, multiview clustering has gained considerable attention in the recent years for learning more comprehensive information Sun2013 ; XuTaoXu2013 . In essence, multiview clustering seeks to partition data points based on multiple representations by assuming that the same cluster structure is shared across all the views GaoNieLiHuang2015 ; WangNieHuang2013 ; YinGaoXieGuo2018 . It is crucial for learning algorithm to incorporate the heterogeneous view information to enhance its accuracy and robustness.In general, multiview clustering can be roughly separated into two classes, i.e., similaritybased and featurebased. The former aims to construct an affinity matrix whose elements define the similarity between each pair of samples. In light of this, multiview subspace clustering is one of the most famous similaritybased methods, which purses a latent subspace shared by multiple views, assuming that each views is built from a common subspace
ChaudhuriKakadeLivescuSridharan2009 ; GaoNieLiHuang2015 ; YinGaoXieGuo2018 ; ZhangFuLiuLiuCao2015 . However, these methods often suffer scalability issue due to superquadratic running time for computing spectra JiangZhengTanTangZhou2017 . While for the featurebased methods, it seeks to partition the samples into clusters so as to minimize the withincluster sum of squared errors, such as multiview means clustering CaiNieHuang2013 ; XuHanNieLi2017 . It is clear that the selection of feature space is vital as the clustering with Euclidean distance on raw pixels is completely ineffective.Inspired by the recent amazing success of deep learning in feature learning
HintonSalakhutdinov2006, a surge of multiview learning based on deep neural networks (DNN) are proposed
NgiamKhoslaKimNamLeeNg2011 ; WangAroraLivescuBilmes2015 ; XuGuanZhaoNiuWangWang2018 . First, Ngiam et al. NgiamKhoslaKimNamLeeNg2011explored extracting shared representations by training a bimodal deep autoencoders. Next, by extending canonical correlation analysis (CCA), Wang
et al. WangAroraLivescuBilmes2015 proposed a novel deep canonically correlation autoencoders (DCCAE), which introduces an autoencoder regularization term into deep CCA. However, unfortunately the aforementioned can only be feasible to the twoview case, failing to handle the multiview one. To explicitly summarize the consensus and complementary information in multiview data, a Deep Multiview Concept learning (DMCL) XuGuanZhaoNiuWangWang2018 is presented by performing nonnegative factorization on every view hierarchically.Though these methods perform well in multiview clustering, the generative process of multiview data cannot be modeled such that they can be used to generate samples accordingly. To this end, benefited from the success of approximate Bayesian inference, the variational autoencoders (VAE) has been the most popular algorithm under the framework that combines differentiable models with variational inference
KingmaWelling2014 ; PuGanHenaoYuanLiStevensCarin2016. By modeling the data generative procedure with a Gaussian Mixture Model (GMM) model and a neural network, Jiang
et al. JiangZhengTanTangZhou2017 proposed a novel unsupervised generative clustering approach within the framework of VAE, namely Variational Deep Embedding (VaDE). Although it has shown great advantages in clustering, it is not able to be applied directly to multiview learning.Targeting for classification and information retrieval, Srivastava et al. SrivastavaSalakhutdinov2014
presented a deep Boltzmann machine for learning a generative model of multiview data. However, until recently there was no successful multiview extension to clustering yet. The main obstacle is how to efficiently exploit the shared generative latent representation across the views in
unsupervised way. To tackle this issue, in this paper, we propose a novel multiview clustering by learning a shared generative latent representation that obeys a mixture of Gaussian distributions, namely Deep MultiView Clustering via Variational Autoencoders (DMVCVAE). In particular, our motivation is based on the fact that the multiview data share a common latent embedding despite the diversity among the views. Meanwhile, the proposed model benefits from the success of the deep generative learning, which can capture the data distribution by neural networks.In summary, our contributions are as follows.

We present to learn a shared generative latent representation for multiview clustering. Specifically, the generative approach assumes that the data of different views share a commonly conditional distribution of hidden variables given observed data and the hidden data are sampled independently from a mixture of Gaussian distributions.

To better exploit the information from multiple views, we introduce a set of nonnegative combination weights which will be learned jointly with the deep autoencoders network in a unified framework.

We conduct a number of numerical experiments showing that the proposed method outperforms the stateoftheart clustering models on several famous datasets including largescale multiview data.
2 Related Works
In literature, there are a few studies on clustering using deep neural networks JiZhangLiSalzmannReid2017 ; PengXiaoFengYauYi2016 ; TianGaoCuiChenLiu2014 ; XieGirshickFarhadi2016 ; YangFuSidiropoulosHong2017 . In a sense, the algorithms are roughly divided into two categories, i.e., separately and jointly deep clustering approaches. The earlier deep clustering algorithms JiZhangLiSalzmannReid2017 ; PengXiaoFengYauYi2016 ; TianGaoCuiChenLiu2014
often work in two stages: firstly, extracting deep features and performing traditional clustering successively, such as the
means and spectral clustering, for the final segmentation. Yet the separated process does not help learn clustering favourable features. To this end, the jointly feature learning and clustering methods
XieGirshickFarhadi2016 ; YangFuSidiropoulosHong2017 are proposed based on deep neural networks. In XieGirshickFarhadi2016 , Xie et al. presented Deep Embedded Clustering (DEC) to learn a mapping from the data space to a lowerdimensional feature space, where it iteratively optimizes a KullbackLeibler (KL) divergence based clustering objective. In YangFuSidiropoulosHong2017 , Yang et al. proposed a dimensionality reduction jointly with means clustering framework, where deep neural networks are applied to dimensionality reduction.However, due to the limitation of the similarity measures in the aforementioned methods, the hidden, hierarchical dependencies in the latent space of data are often not able to be captured effectively. Instead, deep generative models were built to better handle the rich latent structures within data JiangZhengTanTangZhou2017
. In essence, deep generative models are utilized to estimate the density of observed data under some assumptions about its latent structure, i.e., the hidden causes. Recently, Jiang
et al. JiangZhengTanTangZhou2017 proposed a novel clustering framework, by integrating VAE and a GMM for clustering tasks, namely Variational Deep Embedding (VaDE). Unfortunately, as this method mainly focuses on singleview data, the complementary information from multiple heterogeneous views cannot be efficiently exploited. In other words, the existing generative model cannot deal with the shared latent representations for modeling the generative process of each view data.3 The Proposed Method
3.1 The Architecture
Given a collection of multiview data set (), totally views, it is reasonable to assume that the th sample of the th view is generated by some unknown process, for example, from an unobserved continuous variable . The variable
is a common hidden representation shared by all views. Furthermore, in a typical setting, each sample
of a view is assumed to be generated through a twostage process: first the hidden variable is generated according to some prior distribution and then the observed sample is yielded by some conditional distributions . Usually, due to the unknown of the and parameters , the prior and the likelihood are hidden.For clustering tasks, it is desired that the observed sample is generated jointly according to the latent variable and an assumed clustering variable .However, the most existing variational autoencoders are not suitable for clustering tasks by design, even to say nothing of multiview clustering. Therefore, we are motivated to present a novel multiview clustering under the VAE framework, by incorporating clusteringpromoting objective intuitively. Ideally we shall assume that the sample generative process is given by the new likelihood , conditioned on both the hidden variable and the cluster label . However for simplicity we break the direct dependence of on conditioned on an assumed Gaussian mixture variable . The proposed framework is shown in the right panel of Figure 1. In this architecture, multiview samples are generated by using DNN to decode the common hidden variable , which is sampled by GMM as we assumed. To efficiently infer the posterior of both and from the information of multiple views, a novel weighted target distribution is introduced, based on individual variational distribution of from each view. In order to optimize the evidence lower bound (ELBO), similar to VAE, we use DNN to encode observed data and incorporate the distribution of multiple embeddings to infer the shared latent representation .
3.2 The Objective
For the sake of simplicity, we express a generic multiview variable as where is the general variable of the th view. Consider the latent variables and the discrete latent variable (). Without loss of generality, in light of clustering task under the framework of VAE, we aim to compute the probabilistic cluster assignments of for each view, denoted by
. By the Bayes theorem, the corresponding posterior of
and given is computed as follow.(1) 
where we assume the views are independent, i.e., ^{1}^{1}1Hereafter the model parameter is omitted..
As the integral is intractable, it is hard to calculate the posterior. Inspired by the principle of VAE KingmaWelling2014 , we turn to compute an appropriate posterior to approximate the true posterior by minimizing the following KL divergence between them.
(2) 
where
(3) 
is called the evidence lower bound (ELBO) and is loglikelihood.
Minimizing KL divergence is equivalent to maximizing the ELBO. Often is assumed to be a meanfield distribution and can be readily factorized by
(4) 
Due to the powerfulness of DNN to approximate nonlinear function, we here introduce a neural network to infer , with parameters . That is, DNN is utilized to encode observed view data into latent representation. Meanwhile, to incorporate multiview information, we propose a combined variational approximation
. Considering the importance of different views, we introduce a weight vector
( ) to fuse the distribution of hidden variables, so that the consistency and complementary of multiview data can be better exploited. In particular, we assume the variational approximation to the posterior of latent representation to be a Gaussian by integrating information from multiple views as follows.(5)  
(6)  
(7)  
(8) 
where
is an identity matrix with suitable dimension. In the standard VAE, each pair of
and defines a Gaussian for latent variable in the th view. We have fused the information in Eqs. (5)  (8).Furthermore, ELBO can be rewritten by,
(9) 
Hence, we set to maximize , due to the first term has no relationship with and the second term is nonnegative. As a result, we use the following equation to compute , i.e.,
(10) 
This means we are proposing a mixture model for the latent prior . Particularly we implement the latent prior as a Gaussian mixture as follows,
(11)  
(12) 
where is the categorical distribution with parameter such that (
) is the prior probability for cluster
, and both and () are the mean and the variance of the
th Gaussian component, respectively.Once the latent variable is produced according to the GMM prior, the multiview data generative process will defined as, for the binary observed data,
(13)  
(14) 
where is a deep neural network whose input is parameterized by ,
is multivariate Bernoulli distribution parameterized by
. Or for the continuous data,(15)  
(16)  
(17) 
where and are all deep neural networks with appropriate parameters , producing the mean and variance for the Gaussian likelihoods. The generative process is depicted in the right part of Figure 1.
For the th view, since and are independent conditioned on , the joint probability can be decomposed by,
(18) 
Next, by using the reparameterization trick and the Stochastic Gradient Variational Bayes (SGVB) KingmaWelling2014 , the objective function of our method for binary data can be formulated by,
(19) 
where is outputs of the DNN , denotes the number of Monte Carlo samples in the SGVB estimator and is usually set to be 1. The dimension for and is while the dimension for and is . denotes the th element of , represents the th sample in the th element of , and means the th element of . denotes for simplicity.
For the continuous data, the objective function is rewritten as:
(20) 
where and can be obtained by Eq. (15) and Eq. (16), respectively. Intuitively, the first term of Eq. (20) is used for reconstruction, and the rest is the KL divergence from the Gaussian mixture prior to the variational posterior . As such, the model can not only generate the samples well, but make variational inference close to our hypothesis.
Note that although our model is also equipped with VAE and GMM, it is distinct from the existing work DuDuHe2017 ; JiangZhengTanTangZhou2017 . Our model focuses on multiview clustering task by simultaneously learning the generative network, inference network and the weight of each view.
By a direct application of the chain rule and estimators, similar to the work
DuDuHe2017 ; JiangZhengTanTangZhou2017 , the gradients of the loss for Eqs. (19) and (20) are calculated readily. To train the model, the estimated gradients in conjunction with standard stochastic gradient based optimization methods, such as SGD or Adam, are applied. Overall, using the mixed Gaussian latent variables, the proposed model can be trained by backpropagation with reparameterization trick. After training, the shared latent representation is achieved for each sample . Finally the final cluster assignment is computed by Eq. (10).4 Experimental Results
4.1 Datasets
To evaluate the performance of the proposed DMVCVAE, we select four realworld datasets including digits, object and facial images. A summary of the dataset statistics is also provided in Table 1.

UCI digits^{2}^{2}2 https://archive.ics.uci.edu/ml/datasets/Multiple+Features consists of features of handwritten digits of 0 to 9 extracted from UCI machine learning repository DuaGraff2017 . It contains 2000 data points with 200 samples for each digit. These digits are represented by six types of features, including pixel averages in
windows (PIX) of dimension 240, Fourier coefficients of dimension 76, profile correlations (FAC) of dimension 216, Zernike moments (ZER) of dimension 47, KarhunenLoeve coefficients (KAR) of dimension 64 and morphological features (MOR) of dimension 6.

Caltech 101 is an object recognition dataset LiFergusPerona2004 containing 8677 images of 101 categories. We chose 7 classes of Caltech 101 with 1474 images, i.e., Face, Motorbikes, DollaBill, Garfield, Snoopy, StopSign and WindsorChair. There are six different views, including Gabor features of dimension of 48, wavelet moments of dimension 40, CENTRIST features of dimension 254, histogram of oriented gradients(HOG) of dimension 1984, GIST features of dimension 512, and local binary patterns (LBP) of dimension 928.

ORL contains 10 different images from each of 40 distinct subjects. For some subjects, the images were taken at different times with varying lighting, facial expressions and facial details. It consists of three types of features: intensity of dimension 4096, LBP features of dimension 3304 and Gabor features of dimension 6750.

NUSWIDEObject (NUS) is a dataset for object recognition which consists of 30000 images in 31 classes. We use 5 features provided by the website, i.e. 65 dimension color Histogram (CH), 226 dimension color moments (CM), 145 dimension color correlation (CORR), 74 dimension edge distribution and 129 wavelet texture.
Datasets  # of samples  # of views  # of classes 
UCI digits  2,000  6  10 
Caltech7  1,474  6  7 
ORL  400  3  40 
NUSWIDEObject  30,000  5  31 
4.2 Experiment Settings
In our experiments, the fully connected network and same architecture settings as DEC XieGirshickFarhadi2016 are used. More specifically, the architectures of and are 50050020010 and 102000500500, respectively, where is input dimensionality of each view. We use Adam optimizer KingmaBa2014
to maximize the objective function, and set the learning rate to be 0.0001 with a decay of 0.9 for every 10 epochs.
Initializing the parameters of the deep neural network is usually utilized to avoid the problem that the model might get stuck in a undesirable local minima or saddle points. Here, we use layerwise pretraining method bengio2007greedy for training DNN and . After pretraining, the network is adopted to project input data points into the latent representation , and then we perform means to to obtain initial centroids of GMM ). Besides, the weights of Eqs. (6) and (7) are initialized to for each view and the parameter of GMM is initialized to .
Three popular metrics are used to evaluate the clustering performance, i.e. clustering accuracy (ACC), normalized mutual information (NMI) and adjusted rand index (ARI), in which the clustering accuracy is defined by
where is the groundtruth label, is the cluster assignment obtained by the model, and ranges over all possible onetoone mappings between cluster assignment and labels. The mapping can be efficiently fulfilled by the KuhnMunkres algorithm ChenDonohoSaunders2001 . NMI indicates the correlation between predicted labels and ground truth labels. ARI scales from to 1, which measures the similarity between two data clusterings, higher value usually means better clustering performance. As each measure penalizes or favors different properties in the clustering, we report results on all the measures for a comprehensive evaluation.
4.3 Baseline Algorithms
We compare the proposed DMVCVAE with the following clustering methods including both shallow models and deep models.

Single View: Choosing the single view of the best clustering performance using the graph Laplacian derived from and performing spectral clustering on it.

Feature Concatenation (abbreviated to Feature Concat.): Concatenating the features of all views and conducting spectral clustering on it.

Kernel Addition: Building an affinity matrix from every feature and taking an average of them, then inputting to a spectral clustering algorithm.

MultiNMFLiuWangGaoHan2013 : Multiview NMF applies NMF to project each view data to the common latent subspace. This method can be roughly considered as onelayer version of our proposed method.

LTMSCZhangFuLiuLiuCao2015 : Lowrank tensor constrained multiview subspace clustering
proposes a multiview clustering by considering the subspace representation matrices of different views as a tensor.

SCMV3DTYinGaoXieGuo2018 : Lowrank multiview clustering in thirdorder tensor space via tlinear combination using tproduct based on the circular convolution to reconstruct multiview tensorial data by itself with sparse and lowrank penalty.

DCCA andrewAroraBilmesLivescu2013 : Providing flexible nonlinear representations with respect to the correlation objective measured on unseen data.

DCCAE WangAroraLivescuBilmes2015 : Combining the DCCA objective and reconstruction errors of the two views.

VCCAP wang2016deep : Using a deep generative method to achieve a natural idea that the multiple views can be generated from a small set of shared latent variables.
In our experiments, means is utilized for six shallow methods to obtain the final clustering results. For three deep methods, DCCA, DCCAE and VCCAP, we use spectral clustering to perform the clustering, similar to the work WangAroraLivescuBilmes2015 .
4.4 Performance Evaluation
We first compare our method with six shallow models on the chosen test datasets. The parameter settings for the compared methods are done according to their authors’ suggestions for their best clustering scores. The clustering performance of different methods are achieved by running 10 trials and reporting the average score of the performance measures, shown in Table 2. The bold numbers highlight the best results.
As can be seen, except for the Single View, the other methods exploit all of views data with an improved performance than using a single view. In terms of all of these evaluation criteria, our proposed method consistently outperforms the shallow models for UCI digits and Caltech7 datasets. In particularly, for Caltech7, our method outperforms the second best algorithm in terms of ACC and NMI by 17.7% and 25.0%, respectively. While for ORL dataset, LTMSC and SCMV3DT achieves the best result in terms of NMI and ARI, respectively. This may be explained by the small size of ORL dataset, since largescale datasets often lead to better performance for deep models. The results also verify that our model DMVCVAE significantly benefits from deep learning.
To further verify the performance of our approach among the deep models, we report the comparisons between the deep models, given in Tabel 3. Since these three models can only handle two views data, we tested all the two view combination and the best clustering score is reported finally. Specifically, FAC and KAR features are chosen in UCI digits, GIST and LBP features for Caltech7, and LBP and Gabor features for ORL. For fair comparison, we perform the proposed model on the same views. From Tabel 3, it is observed that our proposed method significantly outperforms others on all criteria.
Methods  UCIdigits  Caltech7  ORL  
ACC  NMI  ARI  ACC  NMI  ARI  ACC  NMI  ARI  
Single View  0.6956  0.6424  0.7301  0.4100  0.4119  0.2582  0.6700  0.8477  0.5676 
Feature Concat.  0.6973  0.6973  0.6064  0.3800  0.3410  0.2048  0.6700  0.8329  0.5590 
Kernel Addition  0.7700  0.7456  0.3700  0.3936  0.2573  0.6570  0.6000  0.8062  0.4797 
MultiNMF  0.7760  0.7041  0.6031  0.3602  0.3156  0.1965  0.6825  0.8393  0.5736 
LTMSC  0.8422  0.8217  0.7584  0.5665  0.5914  0.4182  0.7587  0.9094  0.7093 
SCMV3DT  0.9300  0.8608  0.8459  0.6246  0.6031  0.4693  0.7947  0.9088  0.7381 
Ours  0.9570  0.9166  0.9107  0.8014  0.8538  0.7048  0.7975  0.9013  0.7254 
Methods  UCIdigits  Caltech7  ORL  
ACC  NMI  ARI  ACC  NMI  ARI  ACC  NMI  ARI  
DCCA  0.8195  0.8020  0.7424  0.8242  0.6781  0.7131  0.6125  0.8094  0.4699 
DCCAE  0.8205  0.8057  0.7458  0.8462  0.7054  0.7319  0.6425  0,8115  0.5048 
VCCAP  0.7480  0.7320  0.6277  0.8372  0.6301  0.7206  0.4150  0.6440  0.2418 
Ours  0.8875  0.8076  0.7765  0.8568  0.7386  0.7826  0.6950  0.8356  0.5643 
4.5 Visualizations
In Figure 2, we visualize the latent space on Caltech7 dataset by various deep models. tSNE maaten2008visualizing is applied to reducing the dimensionality to 2dimensional space. It can be observed that the embedding learned by DMVCVAE is better than that by DCCAE and VCCAP. Figure 3 shows the learned representations of DMVCVAE on UCI digits dataset. Specifically, we see that, as training progressing, the latent feature clusters become more and more separated, suggesting that the overall architecture motivates seeking informative representations with better clustering performance.
4.6 Experiment on largescale multiview data
With the unprecedentedly explosive growth in the volume of visual data, how to effectively segment largescale multiview data becomes an interesting but challenging problem LiNieHuangHuang2015 ; ZhangLiuShenShenShao2018 . Therefore, we further test our model on the largescale dataset, i.e., NUSWIDEObject. As the aforementioned compared methods cannot handle the largescale data, we compare with the recent work, such as LargeScale MultiView Spectral Clustering (LSMVSC) LiNieHuangHuang2015 and Binary MultiView Clustering (BMVC) ZhangLiuShenShenShao2018 . In this experiment, we replace the ARI measure with PURITY such that the comparison will be fair^{3}^{3}3Here we cited the reported results from their original papers as the lack of the corresponding source codes. means there is no report in the original paper.. By the similar settings, the clustering results are reported in Table 4. As can be seen, our proposed approach achieved better clustering performance against the compared ones and verified the strong capacity on handling largescale multiview clustering.
Methods  NUSWIDEObject  
ACC  NMI  PURITY  
LSMVSC  –  0.1493  0.2821 
BMVC  0.1680  0.1621  0.2872 
Ours  0.1909  0.2129  0.3168 
5 Conclusions
In this paper, we proposed a novel multiview clustering algorithm by learning a shared latent representation under the VAE framework. The shared latent embeddings, multiview weights and deep autoencoders networks are simultaneously learned in a unified framework such that the final clustering assignment is intuitively achieved. Experimental results show that the proposed method can provide better clustering solutions than other stateoftheart approaches, including the shallow models and deep models.
References
 (1) G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In ICML, pages 1247–1255, 2013.

(2)
X. Cai, F. Nie, and H. Huang.
Multiview kmeans clustering on big data.
In IJCAI, pages 2598–2604, 2013.  (3) M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
 (4) J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan. Deep adaptive image clustering. In ICCV, 2017.
 (5) K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan. Multiview clustering via canonical correlation analysis. In ICML, pages 129–136, 2009.
 (6) S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Review, 43(1):129–159, Jan. 2001.
 (7) N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, pages 886–893, 2005.
 (8) J. Deng, W. Dong, R. Socher, L. jia Li, K. Li, and L. Feifei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 (9) C. Du, C. Du, and H. He. Sharing deep generative representation for perceived image reconstruction from human brain activity. In IJCNN, pages 1049–1056, 2017.
 (10) D. Dua and C. Graff. UCI machine learning repository, 2017.
 (11) H. Gao, F. Nie, X. Li, and H. Huang. Multiview subspace clustering. In ICCV, pages 4238–4246, 2015.
 (12) G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 (13) P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid. Deep subspace clustering networks. In NIPS, pages 24–33, 2017.
 (14) Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variational deep embedding: An unsupervised and generative approach to clustering. In IJCAI, pages 1965–1972, 2017.
 (15) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, volume abs/1412.6980, 2015.
 (16) D. P. Kingma and M. Welling. Autoencoding variational Bayes. CoRR, abs/1312.6114, 2014.
 (17) F.F. Li, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In CVPR Workshop, pages 178–178, 2004.
 (18) Y. Li, F. Nie, H. Huang, and J. Huang. Largescale multiview spectral clustering via bipartite graph. In AAAI, volume 4, pages 2750–2756, 2015.
 (19) J. Liu, C. Wang, J. Gao, and J. Han. Multiview clustering via joint nonnegative matrix factorization. In SIAM Data Mining, 2013.
 (20) L. van der Maaten and G. Hinton. Visualizing data using tSNE. Journal of Machine Learning Research, 9(11):2579–2605, 2008.
 (21) J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, pages 689–696, 2011.
 (22) T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution grayscale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):971–987, 2002.
 (23) X. Peng, S. Xiao, J. Feng, W.Y. Yau, and Z. Yi. Deep subspace clustering with sparsity prior. In IJCAI, pages 1925–1931, 2016.
 (24) Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. Variational autoencoder for deep learning of images, labels and captions. In NIPS, pages 2352–2360, 2016.
 (25) N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. Journal of Machine Learning Research, 15(1):2949–2980, 2014.
 (26) S. Sun. A survey of multiview machine learning. Neural Computing and Applications, 23(7):2031–2038, 2013.
 (27) F. Tian, B. Gao, Q. Cui, E. Chen, and T.Y. Liu. Learning deep representations for graph clustering. In AAAI, pages 1293–1299, 2014.
 (28) H. Wang, F. Nie, and H. Huang. Multiview clustering and feature learning via structured sparsity. In ICML, volume 28, pages 352–360, 2013.
 (29) W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multiview representation learning. In ICML, pages l0831092, 2015.
 (30) W. Wang, X. Yan, H. Lee, and K. Livescu. Deep variational canonical correlation analysis. preprint arXiv:1610.03454, 2016.

(31)
J. Xie, R. Girshick, and A. Farhadi.
Unsupervised deep embedding for clustering analysis.
In ICML, pages 478–487, 2016.  (32) C. Xu, Z. Guan, W. Zhao, Y. Niu, Q. Wang, and Z. Wang. Deep multiview concept learning. In IJCAI, pages 28982904, 2018.
 (33) C. Xu, D. Tao, and C. Xu. A survey on multiview learning. preprint arXiv:1304.5634, 2013.
 (34) J. Xu, J. Han, F. Nie, and X. Li. Reweighted discriminatively embedded means for multiview clustering. IEEE Transactions on Image Processing, 26(6):30163027, 2017.
 (35) B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong. Towards kmeansfriendly spaces: Simultaneous deep learning and clustering. In ICML, pages 3861–3870, 2017.
 (36) M. Yin, J. Gao, S. Xie, and Y. Guo. Multiview subspace clustering via tensorial tproduct representation. IEEE Transactions on Neural Networks and Learning Systems, 30(3):851–864, 2019.
 (37) C. Zhang, H. Fu, S. Liu, G. Liu, and X. Cao. Lowrank tensor constrained multiview subspace clustering. In ICCV, pages 15821590, 2015.
 (38) Z. Zhang, L. Liu, F. Shen, H. T. Shen, and L. Shao. Binary multiview clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, doi:10.1109/TPAMI.2018.2847335, pages 1–1, 2018.
 (39) Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layerwise training of deep networks. In NIPS, pages 153160, 2007.
Comments
There are no comments yet.