1 Introduction
Multiview learning seeks to represent data collected from different sources (, multiview data) and fuse them in an unsupervised or semisupervised manner. This learning strategy helps fully leverage the information in different views, which is beneficial for many realworld learning tasks, , predicting diseases based on multiple clinical testing records yuan2018multi ; zhang2018multi and embedding words semantically across different languages guo2012cross . Especially for predictive tasks with few labeled data (and possibly no labels in some views), multiview learning methods impose useful regularization on target models, and accordingly assist mitigate overfitting issues. However, traditional multiview learning methods are built on two questionable assumptions, which may limit their practical application.
Firstly, most existing multiview learning methods zhao2017multi ; li2018survey assume that their training data in different views are wellaligned. This requirement is inappropriate in many settings. For example, realworld multiview learning often requires us to collect multiview data from different organizations, , predicting credit level based on account balances from different banks, diagnosing diseases based on clinical and genetic reports from different hospitals, etc. For security and privacy, the multiview data from different organizations are anonymous and shuffled before releasing. Moreover, it is likely that the views of the data collected by different organizations are generated by different groups of individuals. Therefore, realworld multiview data are often independent, and thus, lack a clear correspondence. Secondly, the multiview learning methods based on coregularization strategies andrew2013deep ; quang2013unifying ; ding2014low ; wang2015deep ; benton2019deep assume that the latent representations of the data in different views obey the same distribution. In practice, however, the information in a view can be redundant for some views and complementary for the others. For such multiview data, the views have a clustering structure, and thus, obey different latent distributions. Enforcing a single distribution across all the views may cause serious overregularization problems.
In this work, we propose a new multiview learning method based on optimal transport theory, mitigating the dependency of multiview learning on the aforementioned two assumptions. As illustrated in Figure 1, for the latent representations of different views, we leverage sliced Wasserstein distance bonneel2015sliced to measure the discrepancy between their distributions, which does not require the correspondence between samples. This modification is a generalization of a traditional coregularization strategy. For each pair of views, we introduce a learnable weight to their sliced Wasserstein distance, and these weights can be interpreted as the optimal transport between different views. The optimal transport defined on the sliced Wasserstein distances leads to a hierarchical optimal transport (HOT) model. It provides a new regularization strategy, that represents the pairwise similarity between different views and implicitly indicates their clustering structure. Furthermore, given some learnable global representations as references (the orange distributions in Figure 1), we can apply this HOT model to find the clustering structure of the views explicitly (, the orange optimal transport matrix in Figure 1). We learn this HOT model efficiently by combining minibatch gradient descent kingma2014adam with the Sinkhorn scaling algorithm cuturi2013sinkhorn . The proposed method achieves robust multiview learning with fewer assumptions, and is demonstrated to perform well on multiple datasets.
2 Proposed Model
Suppose that we have a set of samples collected from views, , for , where is the sample space of the th view, for is a dimensional sample in the space, and contains observed samples. We aim to learn encoders to extract latent representations for the views, and leverage these representations as features for various learning tasks. Denote the encoder of the th view as , where is the dimensional latent space of view .
Multiview learning achieves the desired aim by learning the encoders jointly. We focus on the multiview learning strategy called coregularization, that leverages the information of one view to impose constraints on the others. Typical coregularization methods include canonical correlation analysis (CCA) chaudhuri2009multi and its variants via2007learning ; white2012convex ; andrew2013deep ; ding2014low ; wang2015deep ; benton2019deep . These methods assume that there is a common dimensional latent space shared by the outputs of the encoders. The projections of the encoders’ outputs to this space obey the same distribution, or their distributions are highly correlated with each other. For example, the Least Squares based Generalized CCA (LSCCA) via2007learning learns the encoders by
(1) 
where is a matrix projecting the latent representations of each view to a common latent space, and
is an identity matrix. The objective function of LSCCA is the summation of the pairwise comparisons between different views, which penalizes differences between them. Similarly, we also learn global latent representations shared by all the views, as for Generalized Deep CCA (GDCCA)
benton2019deep :(2) 
where contains the global latent representations. The GDCCA compares different views in an indirect way – taking as a reference, it makes the latent representations of all the views approach and suppresses the difference between different views.
Semisupervised Learning. The above multiview learning methods can be used as regularizers in semisupervised learning. In particular, when some multiview data are labeled, we can learn a classifier associated with the encoders by solving the following optimization problem:^{1}^{1}1We ignore the constraints associated with to simplify notation.
(3) 
where and are the sets of indices for labeled and unlabeled data, respectively; is the label associated with the th multiview data point; and is a classifier taking the concatenation of as its input. The first term in (3) can be the cross entropy loss for labeled data. The second term can be an arbitrary regularizer imposed on all the views, which can be implemented as the objective functions in (1, 2). Finally, the last term can be any additional regularizer imposed on each single view. We can implement this term as the manifoldbased regularizer that encourages the smoothness of the data manifold quang2013unifying ; sindhwani2008rkhs . Alternatively, we can introduce learnable decoders associated with the encoders to construct autoencoders and implement
as the reconstruction loss between the sample in each view and its estimation
wang2019adversarial ; huang2018multimodal ; ye2016learning ; wang2015deep . The two regularizers are weighted by and .2.1 Sliced Wasserstein distance for view matching
The multiview learning methods in (1, 2) require that the samples in different views are wellaligned, , for is sampled jointly from .^{2}^{2}2Besides the coregularization strategy, other multiview learning strategies like cotraining and multikernel fusion are also dependent on wellaligned multiview data, as discussed in Section 4 When the samples in each view are generated independently and only a few samples are labeled and wellaligned, as shown in Figure 1, we need to design a new regularizer to achieve robust multiview learning without correspondence. A natural way to modify the objective functions in (1, 2) is by introducing permutation matrices to match the samples, , replacing the terms in the objective functions with
(4) 
(5) 
where represents the set of all valid permutation matrices.^{3}^{3}3Without loss of generality, here we assume that for different views, their number of samples are the same. Therefore, the ’s in (4, 5) are permutation matrices. The in (4) is the permutation matrix indicating the correspondence between the samples of the th view and those of the th view, and the in (5) is the permutation matrix indicating the correspondence between the samples of the th view and the global latent representation.
Because such matching problems are NPhard, we propose an approximate algorithm to solve them efficiently based on the sliced Wasserstein distance bonneel2015sliced ; kolouri2018sliced .
Definition 2.1 (Sliced Wasserstein).
Let be the dimensional hypersphere and the uniform measure on . For each , we denote the projection on as , where
. For arbitrary two probability measures defined on a compact metric space
, denoted as and , we define their sliced Wasserstein distance as(6) 
where is the onedimensional (1D) distribution after the projection, and is the Wasserstein distance between and defined on .
The sliced Wasserstein distance provides a valid metric to measure the discrepancy between different distributions. Given the samples of these two distributions, , and , we can calculate the sliced Wasserstein distance empirically as
(7) 
where contains projectors randomly selected from , and is the empirical estimation of the Wasserstein distance between the two 1D distributions and . As shown in (7), when calculating , we do not need to learn permutation matrices explicitly – for , we just need to sort and
in ascending (or descending) order, and calculate the Euclidean distance between the sorted vectors
kolouri2018sliced .Denote the representations of each view’s samples in the common latent space (, ) as . These latent representations can be viewed as the samples of an unknown conditional distribution . From this standpoint, the matching problems in (4, 5) empirically measure the discrepancy between different conditional distributions, which can be replaced approximately by the sliced Wasserstein distance in (7). In theory, we have
Proposition 2.2.
Given two sets of samples, denoted as and , each of which has dimensional samples, .
The proof of this proposition is found in the Supplementary Material. This result indicates that the sliced Wasserstein distance achieves a lower bound of the optimal objective functions in (4, 5). Therefore, to match the views based on their unaligned samples, we plug the sliced Wasserstein distance into (1, 2) and obtain the following two models:
(8) 
(9) 
2.2 Hierarchical optimal transport for view clustering
The new objective functions in (8, 9) do not require wellaligned samples, but they still tend to make the latent representations of different views approach the same distribution. In particular, (8) penalizes the sliced Wasserstein distance between each pair of views, while (9) penalizes the sliced Wasserstein distance between each view and the reference . To overcome this problem, we further modify the multiview learning methods as follows. For (8), we introduce learnable weights to the sliced Wasserstein distances and obtain
(10) 
where represents a dimensional allone vector and is the matrix of the weights. To avoid trivial solutions (, ), we restrict to be (
) a doubly stochastic matrix, and (
) a symmetric matrix with allzero diagonal elements. By solving this problem, we find the clustering structure of the views implicitly – the views corresponding to the pairs with large weights belong to the same clusters. Note that in (10) we relax the strict constraint to a least squares based regularizer, which helps us to apply minibatch gradient descent directly to learn the model. In the subsequent experiments, we find that this relaxation does not do harm to the learning results while simplifies our learning algorithm.For (9), besides introducing learnable weights, we consider multiple global latent representations which correspond to different clusters directly. The problem becomes
(11) 
where is the number of clusters we set for the views, which is fixed as 3 in the following experiments; represents the global latent representation matrix corresponding to the th cluster; and
is the matrix of the weights, which is restricted as a doubly stochastic matrix. According to its constraints, we can explain the matrix as the joint distribution of the views and the clusters, and the element
is the probability that the th view belongs to the th cluster. In other words, this method can find the clustering structures explicitly. Similar to (10), we relax the strict constraint to a regularizer in (11).In both these two methods, we establish an optimal transport model with a hierarchical architecture. The in (10) achieves an optimal transport across different views, whose ground distance is the sliced Wasserstein distance between the latent representations of the views. Similarly, the in (11) is an optimal transport from the views to their clusters, whose ground distance is the sliced Wasserstein distance between the latent representations of the views to those of the clusters. These optimal transport matrices can be learned efficiently by computing the entropic Wasserstein distance based on the Sinkhorn scaling algorithm cuturi2013sinkhorn . To our knowledge, our work is the first to leverage the hierarchical optimal transport model to implement multiview learning methods. This framework provides a new way to represent different views and find their clustering structure.
3 Learning Algorithm
We propose an efficient learning algorithm to solve the problems in (10, 11), based on alternating optimization. In each iteration, we first calculate the sliced Wasserstein distances and update the weight matrix via the Sinkhorn scaling algorithm cuturi2013sinkhorn . Then, we fix the weight matrix and learn the encoders and their projection matrices via minibatch gradient descent, , the Adam algorithm kingma2014adam . Algorithms 1 and 2 show the details of our implementation, where is the set of doubly stochastic matrices with marginals and , and is the weight of the entropic regularizer when applying the Sinkhorn algorithm. The details of the Sinkhorn algorithm can be found in cuturi2013sinkhorn and in our Supplementary Material. In line 6 of Algorithm 1, we set the cost matrix to , to ensure for . Additionally, the Sinkhorn algorithm can make converge to a symmetric matrix when the cost matrix is symmetric and the marginals of are the same. Therefore, all the constraints on can be readily satisfied. In Algorithm 2, the in line 2 are initialized as Gaussian random matrices, and for each the number of columns is equal to the batch size. Moreover, when some wellaligned labeled data are available, we can apply these two algorithms to achieve semisupervised learning. Considering the labeled data, we just need to replace the loss in Algorithms 1 and 2 with the loss in (3) and update the model and a classifier jointly.
Our HOT model is a new member of the hierarchical optimal transport family, combining sliced Wasserstein distance with entropic Wasserstein distance. Compared with existing hierarchical optimal transport models chen2018optimal ; lee2019hierarchical ; yurochkin2019hierarchical ; xu2020learning , our model has advantages from the perspectives of computational complexity and model flexibility. Given views, each of which contains samples in a batch, the computational complexity of our method is for Algorithm 1 and for Algorithm 2. Here, is the number of random projections used to compute a sliced Wasserstein distance, which is much smaller than , is the number of iterations used in the Sinkhorn scaling algorithm, and is the number of clusters for the views. The first term () corresponds to calculating the cost matrix based on sliced Wasserstein distance, and the second term () corresponds to computing the entropic Wasserstein distance based on the Sinkhorn scaling algorithm. Instead of using sliced Wasserstein distance, existing hierarchical optimal transport models apply Wasserstein distance chen2018optimal ; yurochkin2019hierarchical ; xu2020learning or entropic Wasserstein distance lee2019hierarchical to calculate the cost matrix . As a result, for each element of the cost matrix, the computational complexity is
when applying linear programming to compute Wasserstein distance, or
when applying Sinkhorn scaling algorithm to compute entropic Wasserstein distance lee2019hierarchical ; yurochkin2019hierarchical . To avoid these computations, such methods have to assume the distribution of the samples to be Gaussian chen2018optimal ; xu2020learning , which limits their applicability and increases the risk of overregularization. According to the analysis above, our learning algorithms has much less computational complexity, without the need to impose any assumptions on the latent distributions of the views.4 Related Work
Multiview learning Multiview learning can be broadly categorized into three strategies zhao2017multi ; sun2013survey : cotraining, multikernel fusion, and coregularization guo2019canonical . Cotraining requires () that the views in the training data are conditionally independent, and () that each view is sufficient to predict labels. It iteratively learns a separate classifier for each view using labeled samples and annotates the unlabeled data based on the most confident predictions of each classifier kumar2011co ; ma2017self . Kernelbased methods merge the kernel matrices of different views and learn global representations based on the merged kernel de2010multi ; li2015large . Coregularization methods add regularization terms to encourage the data from different views to be consistent. The representative regularizers include () CCAbased methods guo2019canonical ; chaudhuri2009multi ; sindhwani2008rkhs ; via2007learning ; guo2012cross ; andrew2013deep that penalize the difference between the views in the latent space, and () linear discriminate analysis based methods jin2014multi that require labeled data. Generally, the cotraining methods are not scalable for cases with more than two views, and the kernelbased methods are transductive. Because both approaches are not suitable for largescale multiview learning tasks, in our work we focus on the coregularization strategy and its improvements. Additionally, all methods discussed above require wellaligned multiview data. Although some methods have been proposed to achieve multiview learning based on incomplete or noisy views xu2015multi ; jin2014multi ; christoudias2008multi , they rely on labeled data, which are not available in many scenarios.
Optimal transportbased learning Optimal transport theory villani2008optimal has proven to be useful in distribution matching memoli2011gromov ; su2017order , data clustering agueh2011barycenters ; xu2018distilled ; yurochkin2019hierarchical , and learning a generative model arjovsky2017wasserstein . Given two sets of samples, we can calculate the optimal transport between them by linear programming kusner2015word . With the help of the Sinkhorn algorithm benamou2015iterative , an entropic Wasserstein distance has been proposed to accelerate the computation of optimal transport cuturi2013sinkhorn . When only the distance between distributions is needed, one can apply the dual form of Wasserstein distance arjovsky2017wasserstein or the sliced Wasserstein distance bonneel2015sliced ; kolouri2018sliced to approximate the distance, and avoid explicitly computing the optimal transport. Recently, hierarchical optimal transport models have been proposed to compare the distributions with structural information, , the nonlinear factorization models in xu2018distilled ; xu2020gromov ; schmitzer2013hierarchical , and optimal transport models for multimodal distributions chen2018optimal ; lee2019hierarchical ; yurochkin2019hierarchical . These hierarchical optimal transport models achieve encouraging performance on multimodal distribution matching chen2018optimal ; lee2019hierarchical ; xu2020learning and data clustering yurochkin2019hierarchical ; xu2020gromov . Compared with existing HOT models, our model has lower computational complexity, and it does not have constraints on the target distributions.
Dataset  # Samples  # Cls.  View 1/  View 2/  View 3/  View 4/  View 5/  View 6/ 

Caltech7/20  1474/2386  7/20  Gabor/48  WM/40  CENTRIST/254  HOG/1984  GIST/512  LBP/928 
Handwritten  2000  10  Pixel/240  Fourier/76  FAC/216  ZER/47  KAR/64  MOR/6 
5 Experiments
To demonstrate the usefulness of the proposed multiview learning methods, we test them on both synthetic and realworld datasets, with comparisons to stateoftheart methods. For each method, we consider the following four datasets: Caltech7, Caltech20, and the Handwritten datasets in li2015large
. The Caltech7, Caltech20, and Handwritten datasets correspond to three image classification tasks. Each contains 56 kinds of visual features extracted by classic methods. The details of the feature extraction methods are provided at
https://github.com/yeqinglee/mvdata. The statistics of these datasets are summarized in Table 1. For each dataset, we test our method and the alternative approaches in 20 trials. In each trial, we randomly select 60% of the samples for training, 20% of the samples for validation, and the remaining 20% of samples for testing. For the training data, we keep 5% as wellaligned and labeled data. For existing multiview learning methods that require wellaligned data, we just remove the labels of the remaining training data. Concerning the proposed robust multiview learning methods and their variants, we randomly permute the remaining training data in each view and remove their labels to generate unaligned unlabeled data.Method  Data  Caltech7  Caltech20  Handwritten  
Baseline  LSCCA via2007learning  —  Aligned  87.361.43  71.202.74  87.983.46 
DGCCA benton2019deep  —  Aligned  87.601.08  71.802.61  87.123.89  
AECCA wang2015deep  AE  Aligned  87.621.47  71.502.84  88.533.22  
Ours  SW (8)  —  Unaligned  87.111.55  70.943.05  88.583.29 
SW (9)  —  Unaligned  87.551.79  72.502.41  89.782.98  
HOT (10)  —  Unaligned  88.311.56  72.962.69  89.953.52  
HOT (11),  —  Unaligned  88.291.87  73.432.70  90.052.62  
SW (8)  AE  Unaligned  87.491.25  70.942.73  89.903.31  
SW (9)  AE  Unaligned  87.241.64  72.362.27  89.223.28  
HOT (10)  AE  Unaligned  88.471.52  73.182.40  90.373.00  
HOT (11),  AE  Unaligned  88.982.15  73.482.08  91.072.55 
Averaged classification accuracy (%) and standard deviation (semisupervised learning)
Method  Data  Caltech7  Caltech20  Handwritten  
Baseline  LSCCA via2007learning  —  Aligned  82.331.84  60.422.16  70.207.77 
DGCCA benton2019deep  —  Aligned  75.564.59  54.673.13  66.155.20  
AECCA wang2015deep  AE  Aligned  84.391.71  66.752.47  83.304.83  
Ours  SW (8)  —  Unaligned  82.251.47  59.933.74  65.679.80 
SW (9)  —  Unaligned  85.711.33  69.212.51  86.402.71  
HOT (10)  —  Unaligned  82.891.32  60.853.31  67.657.56  
HOT (11),  —  Unaligned  86.271.79  71.072.29  87.122.10  
SW (8)  AE  Unaligned  83.881.70  67.272.96  82.733.70  
SW (9)  AE  Unaligned  86.641.57  69.292.37  87.633.08  
HOT (10)  AE  Unaligned  83.981.55  67.412.32  83.853.84  
HOT (11),  AE  Unaligned  87.321.33  69.482.53  87.503.15 
Caltech7  HOT (11)  Handwritten  HOT (11) 

All  86.271.79  All  87.122.10 

86.031.90 

84.453.17 

86.151.69 

86.453.08 

86.011.21 

84.652.57 

82.952.12 

84.402.15 

85.871.10 

84.753.38 

85.432.84 

79.302.58 

“All” means training models using all the views.

“
View” means removing the data of the view and training models accoridngly.
5.1 Comparisons on semisupervised and unsupervised learning
Applying the semisupervised framework proposed in (3), we test different multiview learning methods and record their classification accuracy. For each method, we train
multilayer perceptron (MLP) models as encoders and a softmax layer as a classifier. To achieve fairness in these comparisons, all methods apply models with the same architecture and the same hyperparameters. In particular, we set the hyperparameters empirically as follows: the number of epochs is
; the learning rate is fixed as ; the batch size is ; for each , the dimension of its output is ; the dimension of the common latent space is ; in (3), and ; in (10, 11), ; for the Sinkhorn algorithm, the number of iterations is and ; for sliced Wasserstein distance, the number of projections is set to be in our methods. The robustness of our learning algorithms to the key hyperparameters above can be found in the Supplementary Material. The average performance of these methods for the 20 trials is reported in Table 2. The baselines include the LSCCA via2007learning , the DGCCA benton2019deep , and the autoencoderassisted CCA (AECCA) wang2015deep . Compared with these baselines, which require wellaligned training data, our methods and their variants apply unaligned training data but achieve at least comparable performance on classification accuracy. Moreover, among our methods, applying our hierarchical optimal transport model generally achieves higher accuracy than applying sliced Wasserstein distance directly. These phenomena demonstrate the feasibility of sliced Wasserstein distance in multiview learning and the advantage of our HOT model. Additionally, we can introduce a set of decoders corresponding to the encoders, and construct the regularizer as the reconstruction loss of the autoencoder (AE), which helps further improve the classification accuracy.Besides semisupervised learning, we can learn the latent representations of different views in an unsupervised manner, and then train a classifier based on them. The performance of different methods is shown in Table 3, which helps us evaluate the power of different methods on unsupervised feature extraction. Compared with the performance achieved by semisupervised learning, the performance of the baselines drops precipitously in this setting. For the proposed methods, those depending on pairwise comparisons between views (, implementing (8, 10)) suffer from the degradation of performance as well, while those learning clusters of views explicitly (, implementing (9, 11)) retain high classification accuracy, which implies that learning the clustering structure of views explicitly might be more suitable for unsupervised multiview learning. Again, in this experiment applying our hierarchical optimal transport model can improve the performance.
5.2 Justification of view clustering
According to Table 3, we find that applying our HOT model in (11) achieves encouraging performance, which implies that the clustering structure of views learned by this method is reasonable. In Figure 3, we visualize the corresponding optimal transport matrices learned for different datasets. We can find that for the views in the Caltech7 dataset, “Gabor” and “CENTIST” belong to one cluster while “WM” and “GIST” belong to another, and “HOG” and “LBP” corresponds to the mixture of the cluster 2 and 3. To verify the rationality of this clustering structure, we evaluate the significance of different views in our learning tasks by removing each of the views and training the model accordingly. As shown in Table 3, compared with the result achieved by using all views, removing either “Gabor” or “CENTRIST” (either “WM” or “GIST”) just degrades the classification accuracy slightly, while removing “HOG” or “LBP” does harm to the accuracy severely. This result demonstrates that the clusters we find indeed group the views with redundant information and comparable contributions. Similarly, for the views in the Handwritten dataset, the distribution of “Pix” on the three clusters is similar to those of “Fac” and “KAR”. Therefore, removing one of them from the training views leads to similar classification accuracy. On the other hand, “MOR” is a unique view belonging to the cluster 1, so removing it leads to serious degradation on the performance.
6 Conclusion
We have proposed a hierarchical optimal transport model to achieve robust multiview learning. This method neither depends on the correspondence between the samples of different views nor requires the views to obey the same latent distribution. The proposed approach consistently outperforms many strong baseline models on multiple datasets, demonstrating its potential for complicated learning tasks in realworld scenarios. In the future, we plan to explore the practical applications of our method, , introducing it to federated learning and achieving some predictive tasks for financial and healthcare data analysis. Additionally, we would like to extend our method to multiview multitask learning.
7 Broader Impact Statement
This paper makes a significant contribution to extending the frontier of multiview learning with fewer assumptions and better interpretability. To the best of our knowledge, our method is the first approach to learn latent representations for different views without correspondence and explore the clustering structure of the views at the same time. Its relationship to traditional multiview learning methods is clarified in the paper as well. A potential application scenario of our work is multiview learning based on distributed private data, , patients’ records in different hospitals, and individuals’ financial statements in different banks. Our method provides a potential solution to allow different organizations to share their data with better protections on data privacy: ) the data can be not only anonymous but also from different individuals; ) in the training phase, we don’t need to learn their correspondence explicitly. These advantages greatly reduce the risk of information leaking.
References
 (1) M. Agueh and G. Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.

(2)
G. Andrew, R. Arora, J. Bilmes, and K. Livescu.
Deep canonical correlation analysis.
In
International conference on machine learning
, pages 1247–1255, 2013.  (3) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 (4) J.D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
 (5) A. Benton, H. Khayrallah, B. Gujral, D. A. Reisinger, S. Zhang, and R. Arora. Deep generalized canonical correlation analysis. ACL 2019, page 1, 2019.
 (6) N. Bonneel, J. Rabin, G. Peyré, and H. Pfister. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015.
 (7) K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan. Multiview clustering via canonical correlation analysis. In Proceedings of the 26th annual international conference on machine learning, pages 129–136, 2009.

(8)
Y. Chen, T. T. Georgiou, and A. Tannenbaum.
Optimal transport for gaussian mixture models.
IEEE Access, 7:6269–6278, 2018. 
(9)
C. M. Christoudias, R. Urtasun, and T. Darrell.
Multiview learning in the presence of view disagreement.
In
Proceedings of the TwentyFourth Conference on Uncertainty in Artificial Intelligence
, pages 88–96, 2008.  (10) M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
 (11) V. R. De Sa, P. W. Gallagher, J. M. Lewis, and V. L. Malave. Multiview kernel construction. Machine learning, 79(12):47–71, 2010.
 (12) Z. Ding and Y. Fu. Lowrank common subspace for multiview learning. In 2014 IEEE international conference on Data Mining, pages 110–119. IEEE, 2014.
 (13) C. Guo and D. Wu. Canonical correlation analysis (cca) based multiview learning: An overview. arXiv preprint arXiv:1907.01693, 2019.
 (14) Y. Guo and M. Xiao. Cross language text classification via subspace coregularized multiview learning. In ICML, 2012.
 (15) F. Huang, X. Zhang, C. Li, Z. Li, Y. He, and Z. Zhao. Multimodal network embedding via attention based multiview variational autoencoder. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pages 108–116, 2018.
 (16) X. Jin, F. Zhuang, H. Xiong, C. Du, P. Luo, and Q. He. Multitask multiview learning for heterogeneous tasks. In Proceedings of the 23rd ACM international conference on conference on information and knowledge management, pages 441–450, 2014.
 (17) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

(18)
S. Kolouri, G. K. Rohde, and H. Hoffmann.
Sliced wasserstein distance for learning gaussian mixture models.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3427–3436, 2018. 
(19)
A. Kumar and H. Daumé.
A cotraining approach for multiview spectral clustering.
In Proceedings of the 28th international conference on machine learning (ICML11), pages 393–400, 2011.  (20) M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. From word embeddings to document distances. In International conference on machine learning, pages 957–966, 2015.
 (21) J. Lee, M. Dabagia, E. Dyer, and C. Rozell. Hierarchical optimal transport for multimodal distribution alignment. In Advances in Neural Information Processing Systems, pages 13453–13463, 2019.
 (22) Y. Li, F. Nie, H. Huang, and J. Huang. Largescale multiview spectral clustering via bipartite graph. In TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
 (23) Y. Li, M. Yang, and Z. Zhang. A survey of multiview representation learning. IEEE transactions on knowledge and data engineering, 31(10):1863–1883, 2018.
 (24) F. Ma, D. Meng, Q. Xie, Z. Li, and X. Dong. Selfpaced cotraining. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2275–2284. JMLR. org, 2017.
 (25) F. Mémoli. Gromov–wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11(4):417–487, 2011.
 (26) M. H. Quang, L. Bazzani, and V. Murino. A unifying framework for vectorvalued manifold regularization and multiview learning. In International conference on machine learning, pages 100–108, 2013.
 (27) B. Schmitzer and C. Schnörr. A hierarchical approach to optimal transport. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 452–464. Springer, 2013.
 (28) V. Sindhwani and D. S. Rosenberg. An rkhs for multiview learning and manifold coregularization. In Proceedings of the 25th international conference on Machine learning, pages 976–983, 2008.
 (29) B. Su and G. Hua. Orderpreserving wasserstein distance for sequence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1049–1057, 2017.
 (30) S. Sun. A survey of multiview machine learning. Neural computing and applications, 23(78):2031–2038, 2013.
 (31) J. Vía, I. Santamaría, and J. Pérez. A learning algorithm for adaptive canonical correlation analysis of several data sets. Neural Networks, 20(1):139–152, 2007.
 (32) C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 (33) W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multiview representation learning. In International Conference on Machine Learning, pages 1083–1092, 2015.
 (34) X. Wang, D. Peng, P. Hu, and Y. Sang. Adversarial correlated autoencoder for unsupervised multiview representation learning. KnowledgeBased Systems, 168:109–120, 2019.
 (35) M. White, X. Zhang, D. Schuurmans, and Y.l. Yu. Convex multiview subspace learning. In Advances in Neural Information Processing Systems, pages 1673–1681, 2012.
 (36) C. Xu, D. Tao, and C. Xu. Multiview learning with incomplete views. IEEE Transactions on Image Processing, 24(12):5812–5825, 2015.
 (37) H. Xu. Gromovwasserstein factorization models for graph clustering. arXiv preprint arXiv:1911.08530, 2019.
 (38) H. Xu, D. Luo, R. Henao, S. Shah, and L. Carin. Learning autoencoders with relational regularization. arXiv preprint arXiv:2002.02913, 2020.
 (39) H. Xu, W. Wang, W. Liu, and L. Carin. Distilled wasserstein learning for word embedding and topic modeling. In Advances in Neural Information Processing Systems, pages 1716–1725, 2018.

(40)
T. Ye, T. Wang, K. McGuinness, Y. Guo, and C. Gurrin.
Learning multiple views with orthogonal denoising autoencoders.
In International Conference on Multimedia Modeling, pages 313–324. Springer, 2016. 
(41)
Y. Yuan, G. Xun, K. Jia, and A. Zhang.
A multiview deep learning framework for eeg seizure detection.
IEEE journal of biomedical and health informatics, 23(1):83–94, 2018.  (42) M. Yurochkin, S. Claici, E. Chien, F. Mirzazadeh, and J. M. Solomon. Hierarchical optimal transport for document representation. In Advances in Neural Information Processing Systems, pages 1599–1609, 2019.
 (43) C. Zhang, E. Adeli, T. Zhou, X. Chen, and D. Shen. Multilayer multiview classification for alzheimer’s disease diagnosis. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 (44) J. Zhao, X. Xie, X. Xu, and S. Sun. Multiview learning overview: Recent progress and new challenges. Information Fusion, 38:43–54, 2017.
8 Supplementary Material
8.1 The proof of Proposition 2.2
Proposition 2.2 Given two sets of samples, denoted as and , each of which has dimensional samples, .
Proof.
Given two sets of samples, denoted as and , each of which has dimensional samples, we have
(12) 
Here, . Because each , we have . ∎
8.2 The Sinkhorn scaling algorithm
The scheme of the Sinkhorn scaling algorithm is shown below:
8.3 The configuration of models and hyperparameters
We implement all the models with PyTorch and train them on a single NVIDIA GTX 1080 Ti GPU. For our methods, the hyperparameters are set empirically as follows: the number of epochs is
; the learning rate is fixed as ; the batch size is ; for each , the dimension of its output is 20; the dimension of the common latent space is 10; in (3), and ; in (10, 11), ; for the Sinkhorn algorithm, the number of iterations is and ; for sliced Wasserstein distance, the number of projections is set to be in our methods; for the HOT in (11), the number of the clusters of views is set to be .Among these hyperparameters, there are three key hyperparameters: the batch size, the number of projections when calculating the sliced Wasserstein distance, and the number of clusters . In particular, the sliced Wasserstein distance used in our work provides an empirical estimation for the expected distance between distributions based on their samples. The batch size controls the number of samples used to calculate the sliced Wasserstein distance. The number of projections controls the precision and the stability of the estimation. Generally, using a large batch size and a large number of projections provides us better estimation but increases computations at the same time. In Figure 4(a) and Figure 4(b), we can find that the performance of our multiview learning method (the HOT in (11)) is relatively robust to the change of these two hyperparameters. According to these two figures, we set the batch size to be 400 and the number of projections be 3. Similarly, Figure 4(c) shows that our method is robust to the number of clusters we set, and the best performance is achieved when . For semisupervised learning, there are two more key hyperparameters: the weight of the multiview learning regularizer and the weight in (10, 11). According to Figure 5, when and are set to be our method can achieve encouraging performance.
)) to its hyperparameters in unsupervised learning. We use the Handwritten dataset in this experiment.
8.4 More results of optimal transport matrices
Figure 6 visualize the optimal transport matrices obtained for different datasets when we learn the latent representations of their views in unsupervised ways. We can find that both of our two HOT models can predict the clusters of the views. However, the clustering structures learned by them are inconsistent in some cases. For example, for the Caltech7 dataset, “Gabor”, “WM”, and “HOG” are likely to be in the same cluster when we apply the HOT model in 10. On the other hand, when we use the HOT model in 11, “Gabor” is more likely to be grouped with “CENTRIST”. Because the performance of the second model is better than that of the first one for unsupervised learning, we think the clusters detected by the second model is more reliable, which is also verified in the Table 3 in the main paper.
Additionally, we find that for the simple tasks (, the Caltech7 and the Handwritten), our methods can learn sparse optimal transport matrices and detect clusters clearly. For complicated tasks, , those with many classes (the Caltech20), the optimal transport matrices we learned are often dense, and the clusters are not so obvious as those in the simple cases.
Comments
There are no comments yet.