1 Introduction
Autoencoders have been used widely in many challenging machine learning tasks for generative modeling,
, image (Kingma & Welling, 2013; Tolstikhin et al., 2018) and sentence (Bowman et al., 2016; Wang et al., 2019) generation. Typically, an autoencoder assumes that the data in the sample space may be mapped to a lowdimensional manifold, which can be represented in a latent space . The autoencoder fits the unknown data distribution via a latentvariable model denoted , specified by a prior distribution on latent code and a generative model mapping the latent code to the data . Learning seeks to minimize the discrepancy between and . According to the choice of the discrepancy, we can derive different autoencoders. For example, the variational autoencoder (Kingma & Welling, 2013) applies the KLdivergence as the discrepancy and learns a probabilistic autoencoder via maximizing the evidence lower bound (ELBO). The Wasserstein autoencoder (WAE) (Tolstikhin et al., 2018) minimizes a relaxed form of the Wasserstein distance between and , and learns a deterministic autoencoder. In general, the objective function approximating the discrepancy consists of a reconstruction loss of observed data and a regularizer penalizing the difference between the prior distribution and the posterior derived by encoded data, , . Although existing autoencoders have achieved success in many generative tasks, they often suffer from the following two problems.Regularizer misspecification Typical autoencoders like the VAE and WAE fix the
as normal distributions, which often leads to the problem of overregularization. Moreover, applying such an unstructured prior increases the difficulties in conditional generation tasks. To avoid oversimplified priors, the Gaussian mixture VAE (GMVAE)
(Dilokthanakul et al., 2016) and the VAE with VampPrior (Tomczak & Welling, 2018) characterize their priors as learnable mixture models. However, without side information (Wang et al., 2019), jointly learning the autoencoder and the prior suffers from a high risk of underregularization, which is sensitive to the setting of hyperparameters (
, the number of mixture components and the initialization of the prior).Cotraining of heterogeneous autoencoders Solving a single task often relies on the data in different domains (, multiview data). For example, predicting the mortality of a patient may require both her clinical record and genetic information. In such a situation, we may need to learn multiple autoencoders to extract latent variables as features from different views. Traditional multiview learning strategies either assume that the cotrained autoencoders share the same latent distributions (Wang et al., 2015; Ye et al., 2016), or assume that there exists an explicit transform between different latent spaces (Wang et al., 2016). These assumptions are questionable in practice, as the corresponding autoencoders can have heterogeneous architectures and incomparable latent spaces. How to cotrain such heterogeneous autoencoders is still an open problem.
To overcome the aforementioned problems, we propose a new Relational regularized AutoEncoder (RAE). As illustrated in Figure 1(a), we formulate the prior
as a Gaussian mixture model. Differing from existing methods, however, we leverage the GromovWasserstein (GW) distance
(Mémoli, 2011) to regularize the structural difference between the prior and the posterior in a relational manner, , comparing the distance between the samples obeying the prior with those between the samples obeying the posterior, and restricting their difference. Considering this relational regularizer allows us to implement the discrepancy between and as the fused GromovWasserstein (FGW) distance (Vayer et al., 2018a). Besides imposing structural constraints on the prior distribution within a single autoencoder, for multiple autoencoders with different latent spaces (, the 2D and 3D latent spaces shown in Figure 1(b)) we can train them jointly by applying the relational regularizer to their posterior distributions.The proposed relational regularizer is applicable for both probabilistic and deterministic autoencoders, corresponding to approximating the FGW distance as hierarchical FGW and sliced FGW, respectively. We demonstrate the rationality of these two approximations and analyze their computational complexity. Experimental results show that ) learning RAEs helps achieve structured prior distributions and also suppresses the underregularization problem, outperforming related approaches in imagegeneration tasks; and ) the proposed relational cotraining strategy is beneficial for learning heterogeneous autoencoders, which has potential for multiview learning tasks.
2 Relational Regularized Autoencoders
2.1 Learning mixture models as structured prior
Following prior work with autoencoders (Tolstikhin et al., 2018; Kolouri et al., 2018), we fit the model distribution by minimizing its Wasserstein distance to the data distribution , , . According to Theorem 1 in (Tolstikhin et al., 2018), we can relax the Wasserstein distance and formulate the learning problem as follows:
(1) 
where is the target generative model (decoder) and is the posterior of given , parameterized by an encoder . Accordingly, is the marginal distribution derived from the posterior; represents the distance between samples, and is an arbitrary discrepancy between distributions. Parameter achieves a tradeoff between reconstruction loss and the regularizer.
Instead of fixing as a normal distribution, we seek to learn a structured prior associated with the autoencoder:
(2) 
where is the set of valid prior distributions, which is often assumed as a set of (Gaussian) mixture models (Dilokthanakul et al., 2016; Tomczak & Welling, 2018). Learning the structured prior allows us to explore the clustering structure of the data and achieve conditional generation (, sampling latent variables from a single component of the prior and generating samples accordingly).
2.2 Relational regularization via GromovWasserstein
Jointly learning the prior and the autoencoder may lead to underregularization in the training phase – it is easy to fit to without harm to the reconstruction loss. Solving this problem requires introduction of structural constraints when comparing these two distributions, motivating a relational regularized autoencoder (RAE). In particular, besides commonlyused regularizers like the KL divergence (Dilokthanakul et al., 2016) and the Wasserstein distance (Titouan et al., 2019), which achieve direct comparisons for the distributions, we consider a relational regularizer based on the GromovWasserstein (GW) distance (Mémoli, 2011) in our learning problem:
(3) 
where controls the tradeoff between the two regularizers, and is the GW distance defined as follows.
Definition 2.1.
Let and be two metric measure spaces, where is a compact metric space and
is a probability measure on
(with defined in the same way). The GromovWasserstein distance is defined aswhere , and is the set of all probability measures on with and as marginals.
The
defines a relational loss, comparing the difference between the pairs of samples from the two distributions. Accordingly, the GW distance corresponds to the minimum expectation of the relational loss. The optimal joint distribution
corresponding to the GW distance is called the optimal transport between the two distributions.The in (3) penalizes the structural difference between the two distributions, mutually enhancing the clustering structure of the prior and that of the posterior. We prefer using the GW distance to implement the relational regularizer, because of the ease by which it may be combined with existing regularizers, allowing design of scalable learning algorithms. In particular, when the direct regularizer is the Wasserstein distance (Titouan et al., 2019), , , we can combine it with the and derive a new regularizer as follows:
(4) 
where
is a direct loss function between the two spaces. The new regularizer enforces a shared optimal transport for the Wasserstein and the GromovWasserstein terms, corresponding to the fused GromovWasserstein (FGW) distance
(Vayer et al., 2018a) between the distributions. The rationality of this combination has two perspectives. First, the optimal transport indicates the correspondence between two spaces (Mémoli, 2011; Xu et al., 2019b). In the following section, we show that this optimal transport maps encoded data to the clusters defined by the prior. Enforcing shared optimal transport helps ensure the consistency of the clustering structure. Additionally, as shown in (4), . When replacing the regularizers in (3) with the FGW regularizer, we minimize an upper bound of the original objective function, useful from the viewpoint of optimization.Therefore, we learn an autoencoder with relational regularization by solving the following optimization problem:
(5) 
where the prior is parameterized as a Gaussian mixture model (GMM) with components . We set the probability of each component as . The autoencoder can be either probabilistic or deterministic, leading to different learning algorithms.
3 Learning algorithms
3.1 Probabilistic autoencoder with hierarchical FGW
When the autoencoder is probabilistic, for each sample , the encoder
outputs the mean and the logarithmic variance of the posterior
. Accordingly, the marginal distribution becomes a GMM as well, with number of components equal to the batch size, and the regularizer corresponds to the FGW distance between two GMMs. Inspired by the hierarchical Wasserstein distance (Chen et al., 2018; Yurochkin et al., 2019; Lee et al., 2019), we leverage the structure of the GMMs and propose a hierarchical FGW distance to replace the original regularizer. In particular, given two GMMs, we define the hierarchical FGW distance between them as follows.Definition 3.1 (Hierarchical FGW).
Let and be two GMMs. and are
dimensional Gaussian distributions.
, are the distribution of the Gaussian components. For , the hierarchical fused GromovWasserstein distance between these two GMMs is(6) 
As shown in (6), the hierarchical FGW corresponds to an FGW distance between the distributions of the Gaussian components, whose ground distance is the Wasserstein distance between the Gassuain components. Figures 2(a) and 2(b) further illustrate the difference between the FGW and our hierarchical FGW. For the two GMMs, instead of computing the optimal transport between them in the sample space, the hierarchical FGW builds optimal transport between their Gaussian components. Additionally, we have
Proposition 3.2.
when for all the Gaussian components.
Replacing the FGW with the hierarchical FGW, we convert an optimization problem of a continuous distribution (the in (4)) to a much simpler optimization problem of a discrete distribution (the in (6)). Rewriting (6) in matrix form, we compute the hierarchical FGW distance via solving the following nonconvex optimization problem:
(7) 
where indicates the inner product between matrices, , and is an
dimensional allone vector. The optimal transport matrix
is a joint distribution of the Gaussian components in the two GMMs, where and , whose elements are the Wasserstein distances between Gaussian components, and(8) 
with and represents the Hadamard product. The Wasserstein distance between Gaussian distributions has a closedform solution:
Definition 3.3.
Let and be two dimensional Gaussian distributions, where and represent the mean and the covariance matrix, respectively. The Wasserstein distance is
(9) 
When the covariance matrices are diagonal, , , where
is the standard deviation, (
9) can be rewritten as(10) 
We solve (7) via the proximal gradient method in (Xu et al., 2019b), with further details in Appendix A.
The hierarchical FGW is a good substitute for the original FGW, imposing structural constraints while being more efficient computationally. Plugging the hierarchical FGW and its computation into (5), we apply Algorithm 1 to learn the proposed RAE. Note that taking advantage of the Envelope Theorem (Afriat, 1971)
, we treat the optimal transport matrix as constant when applying backpropagation, reducing computational complexity significantly. The optimal transport matrix maps the components in the
to those in the . Because the components in the correspond to samples and the components in the correspond to clusters, this matrix indicates the clustering structure of the samples.3.2 Deterministic autoencoder with sliced FGW
When the autoencoder is deterministic, its encoder outputs the latent codes corresponding to observed samples. These latent codes can be viewed as the samples of . For the prior
, we can also generate samples with the help of the reparameterization trick. In such a situation, we estimate the FGW distance in (
5) based on the samples of the two distributions. For arbitrary two metric measure spaces and , the empirical FGW between their samples and is(11) 
We can rewrite this empirical FGW in matrix form as (7), and solve it by the proximal gradient method discussed above. When the samples are in 1D space and the metric is the Euclidean distance, however, according to the sliced GW distance in (Titouan et al., 2019) and the sliced Wasserstein distance in (Kolouri et al., 2016), the optimal transport matrix corresponds to a permutation matrix and the can be rewritten as:
(12) 
where is the set of all permutations of . Without loss of generality, we assume the 1D samples are sorted, , and , and demonstrate that the solution of (12) is characterized by the following theorem.
Theorem 3.4.
For , we denote their zeromean translations as and , respectively. The solution of (12) satisfies: 1) When , the solution is the identity permutation . 2) Otherwise, the solution is the antiidentity permutation .
The proof of Theorem 3.4 is provided in Appendix B. Consequently, for the samples in 1D space, we can calculate the empirical FGW distance efficiently via permuting the samples. To leverage this property for highdimensional samples, we propose the following sliced FGW distance:
Definition 3.5 (Sliced FGW).
Let be the dimensional hypersphere and the uniform measure on . For each , we denote the projection on as , where . For and , we define their sliced fused GromovWasserstein distance as
where represents the distribution after the projection, and is the FGW distance between and .
According to this definition, the sliced FGW projects the original metric measure spaces into 1D spaces, and calculates the FGW distance between these spaces. The sliced FGW corresponds to the expectation of the FGW distances under different projections. We can approximate the sliced FGW distance based on the samples of the distributions as well. In particular, given from , from , and projections , the empirical sliced FGW is
(13) 
where represents the projected sample. Figure 2(c) further illustrates the principle of the sliced FGW distance. Replacing the empirical FGW with the empirical sliced FGW, we learn the relational regularized autoencoder via Algorithm 2.
3.3 Comparisons on computational complexity
Compared with calculating empirical FGW distance directly, our hierarchical FGW and sliced FGW have much lower computational complexity. Following notation in the previous two subsections, we denote the batch size as , the number of Gaussian components in the prior as , and the dimension of the latent code as . If we apply the proximal gradient method in (Xu et al., 2019b) to calculate the empirical FGW directly, the computational complexity is , where is the number of Sinkhorn iterations used in the algorithm. For our hierarchical FGW, we apply the proximal gradient method to a problem with a much smaller size (, solving (7)) because of in general. Accordingly, the computational complexity becomes . For our sliced FGW, we apply random projections to project the latent codes to 1D spaces, whose complexity is . For each pair of projected samples, we sort them with operations and compute (12) with operations. Overall, the computational complexity of our sliced FGW is . Because in general, the computational complexity of the sliced FGW is comparable to that of the hierarchical FGW.
4 Relational CoTraining of Autoencoders
Besides learning a single autoencoder, we can apply our relational regularization to learn multiple autoencoders. As shown in Figure 1(b), when learning two autoencoders we can penalize the GW distance between their posterior distributions, and accordingly the learning problem becomes:
(14) 
The regularizer quantifies the discrepancy between the marginalized posterior and the prior, the prior distributions can be predefined or learnable parameters, achieves a tradeoff between and the relational regularizer , and controls the overall significance of these two kinds of regularizers. When learning probabilistic autoencoders, we set to the hierarchical Wasserstein distance between GMMs (Chen et al., 2018) and approximate the relational regularizer by a hierarchical GW distance, equivalent to the hierarchical FGW with . When learning deterministic autoencoders, we set to the sliced Wasserstein distance used in (Kolouri et al., 2018) and approximate the relational regularizer via the sliced GW (Titouan et al., 2019) (the sliced FGW with ).
The main advantage of the proposed relational regularization is that it is applicable for cotraining heterogeneous autoencoders. As shown in (14), the data used to train the autoencoders can come from different domains and with different data distributions. To fully capture the information in each domain, sometimes the autoencoders have heterogeneous architectures, and the corresponding latent codes are in incomparable spaces, , with different dimensions. Taking the GW distance as the relational regularizer, we impose a constraint on the posterior distributions defined in different latent spaces, encouraging structural similarity between them. This regularizer helps avoid overregularization because it does not enforce a shared latent distribution across different domains. Moreover, the proposed regularizer is imposed on the posterior distributions. In other words, it does not require samples from different domains to be paired.
According to the analysis above, our relational cotraining strategy has potential for multiview learning, especially in the scenario with unpaired samples. In particular, given the data in different domains, we first learn their latent codes via solving (14). Concatenating the latent codes in different domains, we can use the concatenation of the latent codes as the features for downstream learning tasks.
5 Related Work
GromovWasserstein distance The GW distance has been used as a metric for shape registration (Mémoli, 2009, 2011), vocabulary set alignment (AlvarezMelis & Jaakkola, 2018), and graph matching (Chowdhury & Mémoli, 2018; Vayer et al., 2018b; Xu et al., 2019b). The work in (Peyré et al., 2016) proposes an entropyregularized GW distance and calculates it based on Sinkhorn iterations (Cuturi, 2013). Following this direction, the work in (Xu et al., 2019b) replaces the entropy regularizer with a Bregman proximal term. The work in (Xu, 2019) proposes an ADMMbased method to calculate the GW distance. To further reduce the computational complexity, the recursive GW distance (Xu et al., 2019a) and the sliced GW distance (Titouan et al., 2019) have been proposed. For generative models, the work in (Bunne et al., 2019) leverages the GW distance to learn coupled adversarial generative networks. However, none of the existing autoencoders consider using the GW distance as their regularizer.
Autoencoders The principle of the autoencoder is to minimize the discrepancy between the data and model distributions. The common choices of the discrepancy include the KL divergence (Kingma & Welling, 2013; Dilokthanakul et al., 2016; Tomczak & Welling, 2018; Takahashi et al., 2019) and the Wasserstein distance (Tolstikhin et al., 2018; Kolouri et al., 2018), which lead to different learning algorithms. Our relational regularized autoencoder can be viewed as a new member of the Wasserstein autoencoder family. Compared with the MMD and the GAN loss used in WAE (Tolstikhin et al., 2018), and the sliced Wasserstein distance used in (Kolouri et al., 2018), our FGWbased regularizer imposes relational constraints and allows the learning of an autoencoder with structured prior distribution.
Cotraining methods For data in different domains, a commonlyused cotraining strategy maps them to a shared latent space, and encourages similarity between their latent codes. This cotraining strategy suppresses the risk of overfitting for each model and enhances their generalization power, which achieves encouraging performance in multiview learning (Kumar & Daumé, 2011; Chen & Denoyer, 2017; Sindhwani et al., 2005). However, this strategy assumes that the latent codes yield the same distribution, which may lead to over regularization. Additionally, it often requires wellaligned data, , the samples in different domains are paired. Our relational cotraining strategy provides a potential solution to relax these restrictions for practical applications.
6 Experiments
6.1 Image generation
We test our relational regularized autoencoder (RAE) in imagegeneration tasks and compare it with the following alternatives: the variational autoencoder (VAE) (Kingma & Welling, 2013), the Wasserstein autoencoder (WAE) (Tolstikhin et al., 2018), the sliced Wasserstein autoencoder (SWAE) (Kolouri et al., 2018), the Gaussian mixture VAE (GMVAE) (Dilokthanakul et al., 2016), and the VampPrior (Tomczak & Welling, 2018). Table 1 lists the main differences between our RAE and these baselines.
We test the methods on the MNIST (LeCun et al., 1998) and CelebA datasets (Liu et al., 2015). For fairness, all the autoencoders have the same DCGANstyle architecture (Radford et al., 2015; Tolstikhin et al., 2018) and are learned with the same hyperparameters: the learning rate is ; the optimizer is Adam (Kingma & Ba, 2014) with and ; the number of epochs is ; the batch size is ; the weight of regularizer is ; the dimension of latent code is for MNIST and for CelebA. For the autoencoders with structured priors, we set the number of the Gaussian components to be and initialize their prior distributions at random. For the proposed RAE, the hyperparameter is set to be , which empirically makes the Wasserstein term and the GW term in our FGW distance have the same magnitude. The probabilistic RAE calculates the hierarchical FGW based on the proximal gradient method with iterations, and the deterministic RAE calculates the sliced FGW with
random projections. All the autoencoders use Euclidean distance as the distance between samples, and accordingly the reconstruction loss is the meansquareerror (MSE). We implement all the autoencoders with PyTorch and train them on a single NVIDIA GTX 1080 Ti GPU. More implementation details,
, the architecture of the autoencoders, are provided in Appendix C.Method  

VAE  Probabilistic  KL  
WAE  Deterministic  MMD  
SWAE  Deterministic  
GMVAE  Probabilistic  KL  
VampPrior  Probabilistic  KL  
Our RAE  Probabilistic  
Deterministic 
Encoder  Method  MNIST  CelebA  

Rec. loss  FID  Rec. loss  FID  
Probabilistic  VAE  16.60  156.11  96.36  59.99 
GMVAE  16.76  60.88  108.13  353.17  
VampPrior  22.41  127.81  —  —  
RAE  14.14  41.99  63.21  52.20  
Deterministic  WAE  9.97  52.78  63.83  52.07 
SWAE  11.10  35.63  87.02  88.91  
RAE  10.37  49.39  64.49  51.45 
For each dataset, we compare the proposed RAE with the baselines on ) the reconstruction loss on testing samples; ) the Fréchet Inception Distance (FID) between 10,000 testing samples and 10,000 randomly generated samples. We list the performance of various autoencoders in Table 2. Among probabilistic autoencoders, our RAE consistently achieves the best performance on both testing reconstruction loss and FID score. When learning deterministic autoencoders, our RAE is at least comparable to the considered alternatives on these measurements. Figure 3 compares the autoencoders on their convergence of the reconstruction loss. The convergence of our RAE is almost the same as that of stateoftheart methods, which further verifies its feasibility.
For the autoencoders learning GMMs as their priors, we further make comparisons for them in conditional generation tasks, , generating samples conditioned on specific Gaussian components. Figures 4 and 5 visualize the generation results for various methods. For the MNIST dataset, the GMVAE, our probabilistic RAE, and deterministic RAE achieve desired generation results. The images conditioned on different Gaussian components correspond to different digits/writing styles. The VampPrior, however, suffers from a problem of severe mode collapse. The images conditioned on different Gaussian components are similar to each other and with limited modes – most of them are “0”, “2”, “3”, and “8”. As shown in Table 1, for each Gaussian component of the prior, the VampPrior parameterizes it by passing a landmark through the encoder. Because the landmarks are in the sample space, this implicit model requires more parameters, making it sensitive to initialization with a high risk of overfitting. Figures 3(a) and 3(b) verify our claim: the testing loss of the VampPrior is unstable and does not converge well during training. For the CelebA dataset, the GMVAE fails to learn a GMMbased prior. As shown in Figure 5(a), the GMVAE trains a single Gaussian distribution, while ignoring the remaining components. As a result, only one Gaussian component can generate face images. Our probabilistic and deterministic RAE, by contrast, learn their GMMbased prior successfully. In particular, all the components of our probabilistic RAE can generate face images, but the components are indistinguishable. Our deterministic RAE achieves the best performance in this conditional generation task – different components can generate semanticallymeaningful images with interpretable diversity. For each component, we add some tags to highlight semantic meaning. The visual comparisons for various autoencoders on their reconstructed and generated samples are shown in Appendix D.
Method  Data type  Caltech  Caltech  Hand  Cathgen 
1017  10120  written  
Independent AEs  Unpaired  56.76  32.22  40.50  64.60 
AEs+CoReg  Paired  76.35  59.83  46.50  66.86 
AEs+W  Unpaired  82.43  69.46  49.00  65.84 
AEs+GW  Unpaired  84.46  69.46  71.50  66.93 
6.2 Multiview learning via cotraining autoencoders
We test our relational cotraining strategy on the four multiview learning datasets (Li et al., 2015):^{1}^{1}1https://github.com/yeqinglee/mvdata Caltech1017 is a subset of the Caltech101 dataset (FeiFei et al., 2004) with 1,474 images in classes. Each image is represented by dimensional Gabor features and
dimensional Wavelet moments.
Caltech10120 is a subset of the Caltech101 with 2,386 images in classes. The features are the same with the Caltech1017. Handwritten is a dataset with 2,000 images corresponding to digits. Each image has dimensional pixelbased features and dimensional Fourier coefficients. Cathgen is a realworld dataset of 8,000 patients. For each patient, we need to leverage dimensional clinical features and dimensional genetic features to predict the happening of myocardial infarction.For each dataset, we use 80% of the data for training, 10% for validation, and the remaining 10% for testing. We test various multiview learning methods. For each method, we first learn two autoencoders for the data in different views in an unsupervised way, and then concatenate the latent codes of the autoencoders as the features and train a classifier based on softmax regression. When learning autoencoders, our relational cotraining method solves (
14) with and . The influence of on the learning results is shown in Appendix D. For simplification, we set the prior distributions as normal distributions in (14). The autoencoders are probabilistic, whose encoders and decoders are MLPs. Each autoencoder has dimensional latent codes, and more implementation details are provided in Appendix C. We set as the hierarchical Wasserstein distance and the relational regularizer as the hierarchical GW distance. In addition to the proposed method, denoted as AEs+GW, we consider the following baselines: ) learning two variational autoencoders independently (Independent AEs); ) learning two variational autoencoders jointly with a leastsquare coregularization (Sindhwani et al., 2005) (AEs+CoReg); ) learning two autoencoders by replacing the in (14) with a Wasserstein regularizer (AEs+W). The AE+CoReg penalizes the Euclidean distance between the latent codes from different views, which needs paired samples. The remaining methods penalize the discrepancy between the distributions of the latent codes, which are applicable for unpaired samples. The classification accuracy in Table 3 demonstrates effectiveness of our relational cotraining strategy, as the proposed method outperforms the baselines consistently across different datasets.7 Conclusions
A new framework has been proposed for learning autoencoders with relational regularization. Leveraging the GW distance, this framework allows the learning of structured prior distributions associated with the autoencoders and prevents the model from underregularization. Besides learning a single autoencoder, the proposed relational regularizer is beneficial for cotraining heterogeneous autoencoders. In the future, we plan to make this relational regularizer applicable for cotraining more than two autoencoders and further reduce its computational complexity.
References
 Afriat (1971) Afriat, S. Theory of maxima and the method of lagrange. SIAM Journal on Applied Mathematics, 20(3):343–357, 1971.

AlvarezMelis & Jaakkola (2018)
AlvarezMelis, D. and Jaakkola, T.
Gromovwasserstein alignment of word embedding spaces.
In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pp. 1881–1890, 2018.  Bowman et al. (2016) Bowman, S., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and Bengio, S. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 10–21, 2016.
 Bunne et al. (2019) Bunne, C., AlvarezMelis, D., Krause, A., and Jegelka, S. Learning generative models across incomparable spaces. In International Conference on Machine Learning, pp. 851–861, 2019.
 Chen & Denoyer (2017) Chen, M. and Denoyer, L. Multiview generative adversarial networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 175–188. Springer, 2017.
 Chen et al. (2018) Chen, Y., Georgiou, T. T., and Tannenbaum, A. Optimal transport for gaussian mixture models. IEEE Access, 7:6269–6278, 2018.
 Chowdhury & Mémoli (2018) Chowdhury, S. and Mémoli, F. The GromovWasserstein distance between networks and stable network invariants. arXiv preprint arXiv:1808.04337, 2018.
 Cuturi (2013) Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pp. 2292–2300, 2013.
 Dilokthanakul et al. (2016) Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H., Arulkumaran, K., and Shanahan, M. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.
 FeiFei et al. (2004) FeiFei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR workshop, pp. 178–178. IEEE, 2004.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

Kolouri et al. (2016)
Kolouri, S., Zou, Y., and Rohde, G. K.
Sliced Wasserstein kernels for probability distributions.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 5258–5267, 2016.  Kolouri et al. (2018) Kolouri, S., Pope, P. E., Martin, C. E., and Rohde, G. K. Slicedwasserstein autoencoder: an embarrassingly simple generative model. arXiv preprint arXiv:1804.01947, 2018.

Kumar & Daumé (2011)
Kumar, A. and Daumé, H.
A cotraining approach for multiview spectral clustering.
In Proceedings of the 28th International Conference on Machine Learning (ICML11), pp. 393–400, 2011.  LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lee et al. (2019) Lee, J., Dabagia, M., Dyer, E., and Rozell, C. Hierarchical optimal transport for multimodal distribution alignment. In Advances in Neural Information Processing Systems, pp. 13453–13463, 2019.
 Li et al. (2015) Li, Y., Nie, F., Huang, H., and Huang, J. Largescale multiview spectral clustering via bipartite graph. In AAAI, 2015.
 Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In ICCV, 2015.
 Mémoli (2009) Mémoli, F. Spectral GromovWasserstein distances for shape matching. In ICCV Workshops, pp. 256–263, 2009.
 Mémoli (2011) Mémoli, F. Gromov–wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11(4):417–487, 2011.
 Peyré et al. (2016) Peyré, G., Cuturi, M., and Solomon, J. Gromovwasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning, pp. 2664–2672, 2016.
 Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

Sindhwani et al. (2005)
Sindhwani, V., Niyogi, P., and Belkin, M.
A coregularization approach to semisupervised learning with multiple views.
In Proceedings of ICML workshop on learning with multiple views, volume 2005, pp. 74–79, 2005. 
Takahashi et al. (2019)
Takahashi, H., Iwata, T., Yamanaka, Y., Yamada, M., and Yagi, S.
Variational autoencoder with implicit optimal priors.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume 33, pp. 5066–5073, 2019.  Titouan et al. (2019) Titouan, V., Flamary, R., Courty, N., Tavenard, R., and Chapel, L. Sliced gromovwasserstein. In Advances in Neural Information Processing Systems, pp. 14726–14736, 2019.
 Tolstikhin et al. (2018) Tolstikhin, I., Bousquet, O., Gelly, S., and Schölkopf, B. Wasserstein autoencoders. In International Conference on Learning Representations (ICLR 2018). OpenReview. net, 2018.
 Tomczak & Welling (2018) Tomczak, J. and Welling, M. Vae with a vampprior. In International Conference on Artificial Intelligence and Statistics, pp. 1214–1223, 2018.
 Vayer et al. (2018a) Vayer, T., Chapel, L., Flamary, R., Tavenard, R., and Courty, N. Fused GromovWasserstein distance for structured objects: theoretical foundations and mathematical properties. arXiv preprint arXiv:1811.02834, 2018a.
 Vayer et al. (2018b) Vayer, T., Chapel, L., Flamary, R., Tavenard, R., and Courty, N. Optimal transport for structured data. arXiv preprint arXiv:1805.09114, 2018b.
 Wang et al. (2016) Wang, S., Ding, Z., and Fu, Y. Coupled marginalized autoencoders for crossdomain multiview learning. In IJCAI, pp. 2125–2131, 2016.
 Wang et al. (2015) Wang, W., Arora, R., Livescu, K., and Bilmes, J. On deep multiview representation learning. In International Conference on Machine Learning, pp. 1083–1092, 2015.

Wang et al. (2019)
Wang, W., Gan, Z., Xu, H., Zhang, R., Wang, G., Shen, D., Chen, C., and Carin,
L.
Topicguided variational autoencoder for text generation.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 166–177, 2019.  Xu (2019) Xu, H. Gromovwasserstein factorization models for graph clustering. arXiv preprint arXiv:1911.08530, 2019.
 Xu et al. (2019a) Xu, H., Luo, D., and Carin, L. Scalable gromovwasserstein learning for graph partitioning and matching. arXiv preprint arXiv:1905.07645, 2019a.
 Xu et al. (2019b) Xu, H., Luo, D., Zha, H., and Duke, L. C. Gromovwasserstein learning for graph matching and node embedding. In International Conference on Machine Learning, pp. 6932–6941, 2019b.

Ye et al. (2016)
Ye, T., Wang, T., McGuinness, K., Guo, Y., and Gurrin, C.
Learning multiple views with orthogonal denoising autoencoders.
In International Conference on Multimedia Modeling, pp. 313–324. Springer, 2016.  Yurochkin et al. (2019) Yurochkin, M., Claici, S., Chien, E., Mirzazadeh, F., and Solomon, J. Hierarchical optimal transport for document representation. arXiv preprint arXiv:1906.10827, 2019.
Appendix A The proximal gradient method
Both the hierarchical FGW in (6) and the empirical FGW in (11) can be rewritten in matrix format. As shown in (7), the calculation of the distance corresponds to solving the following nonconvex optimization problem:
(15) 
where and are predefined discrete distributions. This problem can be solved iteratively by the following proximal gradient method (Xu et al., 2019b). In each th iteration, given current estimation , we consider the following problem with a proximal term
(16) 
This subproblem can be solved easily via Sinkhorn iterations (Cuturi, 2013). The details of the algorithm are shown in Algorithm 3. In our experiments, we set when learning probabilistic RAE (PRAE). The hyperparameter is set adaptively. In particular, in each iteration, given the matrix , we set . This setting helps us improve the numerical stability when calculating the in Algorithm 3. When learning deterministic RAE (DRAE), we apply the sliced FGW distance with random projections, such that the training time of the DRAE and that of the PRAE are comparable.
Appendix B The Proof of Theorem 3.4
Theorem B.1.
For , where , the solution of the problem
(17) 
where is the set of all permutation of , is invariant with respect to any translations of and .
Proof.
Theorem B.2.
Following the notations in Theorem B.1, for we denote their zeromean translations as and , respectively.
satisfies:
1) When , the solution is the identity permutation .
2) Otherwise, the solution is the antiidentity permutation .
Proof.
The proposed problem is equivalent to the following problem
Comments
There are no comments yet.