Learning Autoencoders with Relational Regularization

02/07/2020 ∙ by Hongteng Xu, et al. ∙ 0

A new algorithmic framework is proposed for learning autoencoders of data distributions. We minimize the discrepancy between the model and target distributions, with a relational regularization on the learnable latent prior. This regularization penalizes the fused Gromov-Wasserstein (FGW) distance between the latent prior and its corresponding posterior, allowing one to flexibly learn a structured prior distribution associated with the generative model. Moreover, it helps co-training of multiple autoencoders even if they have heterogeneous architectures and incomparable latent spaces. We implement the framework with two scalable algorithms, making it applicable for both probabilistic and deterministic autoencoders. Our relational regularized autoencoder (RAE) outperforms existing methods, e.g., the variational autoencoder, Wasserstein autoencoder, and their variants, on generating images. Additionally, our relational co-training strategy for autoencoders achieves encouraging results in both synthesis and real-world multi-view learning tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 14

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Autoencoders have been used widely in many challenging machine learning tasks for generative modeling,

, image (Kingma & Welling, 2013; Tolstikhin et al., 2018) and sentence  (Bowman et al., 2016; Wang et al., 2019) generation. Typically, an autoencoder assumes that the data in the sample space may be mapped to a low-dimensional manifold, which can be represented in a latent space . The autoencoder fits the unknown data distribution via a latent-variable model denoted , specified by a prior distribution on latent code and a generative model mapping the latent code to the data . Learning seeks to minimize the discrepancy between and . According to the choice of the discrepancy, we can derive different autoencoders. For example, the variational autoencoder (Kingma & Welling, 2013) applies the KL-divergence as the discrepancy and learns a probabilistic autoencoder via maximizing the evidence lower bound (ELBO). The Wasserstein autoencoder (WAE) (Tolstikhin et al., 2018) minimizes a relaxed form of the Wasserstein distance between and , and learns a deterministic autoencoder. In general, the objective function approximating the discrepancy consists of a reconstruction loss of observed data and a regularizer penalizing the difference between the prior distribution and the posterior derived by encoded data, , . Although existing autoencoders have achieved success in many generative tasks, they often suffer from the following two problems.

Regularizer misspecification Typical autoencoders like the VAE and WAE fix the

as normal distributions, which often leads to the problem of over-regularization. Moreover, applying such an unstructured prior increases the difficulties in conditional generation tasks. To avoid oversimplified priors, the Gaussian mixture VAE (GMVAE) 

(Dilokthanakul et al., 2016) and the VAE with VampPrior (Tomczak & Welling, 2018) characterize their priors as learnable mixture models. However, without side information (Wang et al., 2019)

, jointly learning the autoencoder and the prior suffers from a high risk of under-regularization, which is sensitive to the setting of hyperparameters (

, the number of mixture components and the initialization of the prior).

Co-training of heterogeneous autoencoders Solving a single task often relies on the data in different domains (, multi-view data). For example, predicting the mortality of a patient may require both her clinical record and genetic information. In such a situation, we may need to learn multiple autoencoders to extract latent variables as features from different views. Traditional multi-view learning strategies either assume that the co-trained autoencoders share the same latent distributions (Wang et al., 2015; Ye et al., 2016), or assume that there exists an explicit transform between different latent spaces (Wang et al., 2016). These assumptions are questionable in practice, as the corresponding autoencoders can have heterogeneous architectures and incomparable latent spaces. How to co-train such heterogeneous autoencoders is still an open problem.

To overcome the aforementioned problems, we propose a new Relational regularized AutoEncoder (RAE). As illustrated in Figure 1(a), we formulate the prior

as a Gaussian mixture model. Differing from existing methods, however, we leverage the Gromov-Wasserstein (GW) distance 

(Mémoli, 2011) to regularize the structural difference between the prior and the posterior in a relational manner, , comparing the distance between the samples obeying the prior with those between the samples obeying the posterior, and restricting their difference. Considering this relational regularizer allows us to implement the discrepancy between and as the fused Gromov-Wasserstein (FGW) distance (Vayer et al., 2018a). Besides imposing structural constraints on the prior distribution within a single autoencoder, for multiple autoencoders with different latent spaces (, the 2D and 3D latent spaces shown in Figure 1(b)) we can train them jointly by applying the relational regularizer to their posterior distributions.

The proposed relational regularizer is applicable for both probabilistic and deterministic autoencoders, corresponding to approximating the FGW distance as hierarchical FGW and sliced FGW, respectively. We demonstrate the rationality of these two approximations and analyze their computational complexity. Experimental results show that ) learning RAEs helps achieve structured prior distributions and also suppresses the under-regularization problem, outperforming related approaches in image-generation tasks; and ) the proposed relational co-training strategy is beneficial for learning heterogeneous autoencoders, which has potential for multi-view learning tasks.

2 Relational Regularized Autoencoders

2.1 Learning mixture models as structured prior

Following prior work with autoencoders (Tolstikhin et al., 2018; Kolouri et al., 2018), we fit the model distribution by minimizing its Wasserstein distance to the data distribution , , . According to Theorem 1 in (Tolstikhin et al., 2018), we can relax the Wasserstein distance and formulate the learning problem as follows:

(1)

where is the target generative model (decoder) and is the posterior of given , parameterized by an encoder . Accordingly, is the marginal distribution derived from the posterior; represents the distance between samples, and is an arbitrary discrepancy between distributions. Parameter achieves a trade-off between reconstruction loss and the regularizer.

Instead of fixing as a normal distribution, we seek to learn a structured prior associated with the autoencoder:

(2)

where is the set of valid prior distributions, which is often assumed as a set of (Gaussian) mixture models (Dilokthanakul et al., 2016; Tomczak & Welling, 2018). Learning the structured prior allows us to explore the clustering structure of the data and achieve conditional generation (, sampling latent variables from a single component of the prior and generating samples accordingly).

(a) Proposed RAE
(b) Relational Co-training
Figure 1: (a) Learning a single autoencoder with relational regularization. (b) Relational co-training of the autoencoders with incomparable latent spaces.

2.2 Relational regularization via Gromov-Wasserstein

Jointly learning the prior and the autoencoder may lead to under-regularization in the training phase – it is easy to fit to without harm to the reconstruction loss. Solving this problem requires introduction of structural constraints when comparing these two distributions, motivating a relational regularized autoencoder (RAE). In particular, besides commonly-used regularizers like the KL divergence (Dilokthanakul et al., 2016) and the Wasserstein distance (Titouan et al., 2019), which achieve direct comparisons for the distributions, we consider a relational regularizer based on the Gromov-Wasserstein (GW) distance (Mémoli, 2011) in our learning problem:

(3)

where controls the trade-off between the two regularizers, and is the GW distance defined as follows.

Definition 2.1.

Let and be two metric measure spaces, where is a compact metric space and

is a probability measure on

(with defined in the same way). The Gromov-Wasserstein distance is defined as

where , and is the set of all probability measures on with and as marginals.

The

defines a relational loss, comparing the difference between the pairs of samples from the two distributions. Accordingly, the GW distance corresponds to the minimum expectation of the relational loss. The optimal joint distribution

corresponding to the GW distance is called the optimal transport between the two distributions.

The in (3) penalizes the structural difference between the two distributions, mutually enhancing the clustering structure of the prior and that of the posterior. We prefer using the GW distance to implement the relational regularizer, because of the ease by which it may be combined with existing regularizers, allowing design of scalable learning algorithms. In particular, when the direct regularizer is the Wasserstein distance (Titouan et al., 2019), , , we can combine it with the and derive a new regularizer as follows:

(4)

where

is a direct loss function between the two spaces. The new regularizer enforces a shared optimal transport for the Wasserstein and the Gromov-Wasserstein terms, corresponding to the fused Gromov-Wasserstein (FGW) distance 

(Vayer et al., 2018a) between the distributions. The rationality of this combination has two perspectives. First, the optimal transport indicates the correspondence between two spaces (Mémoli, 2011; Xu et al., 2019b). In the following section, we show that this optimal transport maps encoded data to the clusters defined by the prior. Enforcing shared optimal transport helps ensure the consistency of the clustering structure. Additionally, as shown in (4), . When replacing the regularizers in (3) with the FGW regularizer, we minimize an upper bound of the original objective function, useful from the viewpoint of optimization.

Therefore, we learn an autoencoder with relational regularization by solving the following optimization problem:

(5)

where the prior is parameterized as a Gaussian mixture model (GMM) with components . We set the probability of each component as . The autoencoder can be either probabilistic or deterministic, leading to different learning algorithms.

3 Learning algorithms

3.1 Probabilistic autoencoder with hierarchical FGW

When the autoencoder is probabilistic, for each sample , the encoder

outputs the mean and the logarithmic variance of the posterior

. Accordingly, the marginal distribution becomes a GMM as well, with number of components equal to the batch size, and the regularizer corresponds to the FGW distance between two GMMs. Inspired by the hierarchical Wasserstein distance (Chen et al., 2018; Yurochkin et al., 2019; Lee et al., 2019), we leverage the structure of the GMMs and propose a hierarchical FGW distance to replace the original regularizer. In particular, given two GMMs, we define the hierarchical FGW distance between them as follows.

Definition 3.1 (Hierarchical FGW).

Let and be two GMMs. and are

-dimensional Gaussian distributions.

, are the distribution of the Gaussian components. For , the hierarchical fused Gromov-Wasserstein distance between these two GMMs is

(6)

As shown in (6), the hierarchical FGW corresponds to an FGW distance between the distributions of the Gaussian components, whose ground distance is the Wasserstein distance between the Gassuain components. Figures 2(a) and 2(b) further illustrate the difference between the FGW and our hierarchical FGW. For the two GMMs, instead of computing the optimal transport between them in the sample space, the hierarchical FGW builds optimal transport between their Gaussian components. Additionally, we have

Proposition 3.2.

when for all the Gaussian components.

(a) FGW
(b) Hierarchical FGW
(c) Sliced FGW
Figure 2: Illustrations of FGW, hierarchical FGW, and sliced FGW. The black solid arrow represents the distance between samples while the black dotted arrow represents the relational loss between sample pairs. In (a, c), the red and blue arrows represent the Euclidean distance between samples. In (b), the red and blue arrows represent the Wasserstein distance between Gaussian components.

Replacing the FGW with the hierarchical FGW, we convert an optimization problem of a continuous distribution (the in (4)) to a much simpler optimization problem of a discrete distribution (the in (6)). Rewriting (6) in matrix form, we compute the hierarchical FGW distance via solving the following non-convex optimization problem:

(7)

where indicates the inner product between matrices, , and is an

-dimensional all-one vector. The optimal transport matrix

is a joint distribution of the Gaussian components in the two GMMs, where and , whose elements are the Wasserstein distances between Gaussian components, and

(8)

with and represents the Hadamard product. The Wasserstein distance between Gaussian distributions has a closed-form solution:

Definition 3.3.

Let and be two -dimensional Gaussian distributions, where and represent the mean and the covariance matrix, respectively. The Wasserstein distance is

(9)

When the covariance matrices are diagonal, , , where

is the standard deviation, (

9) can be rewritten as

(10)

We solve (7) via the proximal gradient method in (Xu et al., 2019b), with further details in Appendix A.

The hierarchical FGW is a good substitute for the original FGW, imposing structural constraints while being more efficient computationally. Plugging the hierarchical FGW and its computation into (5), we apply Algorithm 1 to learn the proposed RAE. Note that taking advantage of the Envelope Theorem (Afriat, 1971)

, we treat the optimal transport matrix as constant when applying backpropagation, reducing computational complexity significantly. The optimal transport matrix maps the components in the

to those in the . Because the components in the correspond to samples and the components in the correspond to clusters, this matrix indicates the clustering structure of the samples.

1:  Input Samples in
2:  Output The autoencoder and the prior with Gaussian components .
3:  for

each epoch

4:   for each batch of samples
5:     for .
6:    Reparameterize , where .
7:    .
8:    Calculate , via (10), calculate via (8)
9:    Obtain optimal transport via solving (7).
10:    .
11:    .
12:    Update .
Algorithm 1 Learning RAE with hierarchical FGW

3.2 Deterministic autoencoder with sliced FGW

When the autoencoder is deterministic, its encoder outputs the latent codes corresponding to observed samples. These latent codes can be viewed as the samples of . For the prior

, we can also generate samples with the help of the reparameterization trick. In such a situation, we estimate the FGW distance in (

5) based on the samples of the two distributions. For arbitrary two metric measure spaces and , the empirical FGW between their samples and is

(11)

We can rewrite this empirical FGW in matrix form as (7), and solve it by the proximal gradient method discussed above. When the samples are in 1D space and the metric is the Euclidean distance, however, according to the sliced GW distance in (Titouan et al., 2019) and the sliced Wasserstein distance in (Kolouri et al., 2016), the optimal transport matrix corresponds to a permutation matrix and the can be rewritten as:

(12)

where is the set of all permutations of . Without loss of generality, we assume the 1D samples are sorted, , and , and demonstrate that the solution of (12) is characterized by the following theorem.

Theorem 3.4.

For , we denote their zero-mean translations as and , respectively. The solution of (12) satisfies: 1) When , the solution is the identity permutation . 2) Otherwise, the solution is the anti-identity permutation .

The proof of Theorem 3.4 is provided in Appendix B. Consequently, for the samples in 1D space, we can calculate the empirical FGW distance efficiently via permuting the samples. To leverage this property for high-dimensional samples, we propose the following sliced FGW distance:

Definition 3.5 (Sliced FGW).

Let be the -dimensional hypersphere and the uniform measure on . For each , we denote the projection on as , where . For and , we define their sliced fused Gromov-Wasserstein distance as

where represents the distribution after the projection, and is the FGW distance between and .

According to this definition, the sliced FGW projects the original metric measure spaces into 1D spaces, and calculates the FGW distance between these spaces. The sliced FGW corresponds to the expectation of the FGW distances under different projections. We can approximate the sliced FGW distance based on the samples of the distributions as well. In particular, given from , from , and projections , the empirical sliced FGW is

(13)

where represents the projected sample. Figure 2(c) further illustrates the principle of the sliced FGW distance. Replacing the empirical FGW with the empirical sliced FGW, we learn the relational regularized autoencoder via Algorithm 2.

1:  Input Samples in
2:  Output The autoencoder and the prior with Gaussian components .
3:  for each epoch
4:   for each batch of samples
5:    for
6:     Samples of : .
7:     Samples of : , .
8:    for
9:     Create a random projection
10:     , for .
11:     Sort and , respectively.
12:     Calculate based on     sorted samples and Theorem 3.4.
13:    .
14:    Calculate via (13).
15:    Update .
Algorithm 2 Learning RAE with sliced FGW

3.3 Comparisons on computational complexity

Compared with calculating empirical FGW distance directly, our hierarchical FGW and sliced FGW have much lower computational complexity. Following notation in the previous two subsections, we denote the batch size as , the number of Gaussian components in the prior as , and the dimension of the latent code as . If we apply the proximal gradient method in (Xu et al., 2019b) to calculate the empirical FGW directly, the computational complexity is , where is the number of Sinkhorn iterations used in the algorithm. For our hierarchical FGW, we apply the proximal gradient method to a problem with a much smaller size (, solving (7)) because of in general. Accordingly, the computational complexity becomes . For our sliced FGW, we apply random projections to project the latent codes to 1D spaces, whose complexity is . For each pair of projected samples, we sort them with operations and compute (12) with operations. Overall, the computational complexity of our sliced FGW is . Because in general, the computational complexity of the sliced FGW is comparable to that of the hierarchical FGW.

4 Relational Co-Training of Autoencoders

Besides learning a single autoencoder, we can apply our relational regularization to learn multiple autoencoders. As shown in Figure 1(b), when learning two autoencoders we can penalize the GW distance between their posterior distributions, and accordingly the learning problem becomes:

(14)

The regularizer quantifies the discrepancy between the marginalized posterior and the prior, the prior distributions can be predefined or learnable parameters, achieves a trade-off between and the relational regularizer , and controls the overall significance of these two kinds of regularizers. When learning probabilistic autoencoders, we set to the hierarchical Wasserstein distance between GMMs (Chen et al., 2018) and approximate the relational regularizer by a hierarchical GW distance, equivalent to the hierarchical FGW with . When learning deterministic autoencoders, we set to the sliced Wasserstein distance used in (Kolouri et al., 2018) and approximate the relational regularizer via the sliced GW (Titouan et al., 2019) (the sliced FGW with ).

The main advantage of the proposed relational regularization is that it is applicable for co-training heterogeneous autoencoders. As shown in (14), the data used to train the autoencoders can come from different domains and with different data distributions. To fully capture the information in each domain, sometimes the autoencoders have heterogeneous architectures, and the corresponding latent codes are in incomparable spaces, , with different dimensions. Taking the GW distance as the relational regularizer, we impose a constraint on the posterior distributions defined in different latent spaces, encouraging structural similarity between them. This regularizer helps avoid over-regularization because it does not enforce a shared latent distribution across different domains. Moreover, the proposed regularizer is imposed on the posterior distributions. In other words, it does not require samples from different domains to be paired.

According to the analysis above, our relational co-training strategy has potential for multi-view learning, especially in the scenario with unpaired samples. In particular, given the data in different domains, we first learn their latent codes via solving (14). Concatenating the latent codes in different domains, we can use the concatenation of the latent codes as the features for downstream learning tasks.

5 Related Work

Gromov-Wasserstein distance The GW distance has been used as a metric for shape registration (Mémoli, 2009, 2011), vocabulary set alignment (Alvarez-Melis & Jaakkola, 2018), and graph matching (Chowdhury & Mémoli, 2018; Vayer et al., 2018b; Xu et al., 2019b). The work in (Peyré et al., 2016) proposes an entropy-regularized GW distance and calculates it based on Sinkhorn iterations (Cuturi, 2013). Following this direction, the work in (Xu et al., 2019b) replaces the entropy regularizer with a Bregman proximal term. The work in (Xu, 2019) proposes an ADMM-based method to calculate the GW distance. To further reduce the computational complexity, the recursive GW distance (Xu et al., 2019a) and the sliced GW distance (Titouan et al., 2019) have been proposed. For generative models, the work in (Bunne et al., 2019) leverages the GW distance to learn coupled adversarial generative networks. However, none of the existing autoencoders consider using the GW distance as their regularizer.

Autoencoders The principle of the autoencoder is to minimize the discrepancy between the data and model distributions. The common choices of the discrepancy include the KL divergence (Kingma & Welling, 2013; Dilokthanakul et al., 2016; Tomczak & Welling, 2018; Takahashi et al., 2019) and the Wasserstein distance (Tolstikhin et al., 2018; Kolouri et al., 2018), which lead to different learning algorithms. Our relational regularized autoencoder can be viewed as a new member of the Wasserstein autoencoder family. Compared with the MMD and the GAN loss used in WAE (Tolstikhin et al., 2018), and the sliced Wasserstein distance used in (Kolouri et al., 2018), our FGW-based regularizer imposes relational constraints and allows the learning of an autoencoder with structured prior distribution.

Co-training methods For data in different domains, a commonly-used co-training strategy maps them to a shared latent space, and encourages similarity between their latent codes. This co-training strategy suppresses the risk of overfitting for each model and enhances their generalization power, which achieves encouraging performance in multi-view learning (Kumar & Daumé, 2011; Chen & Denoyer, 2017; Sindhwani et al., 2005). However, this strategy assumes that the latent codes yield the same distribution, which may lead to over regularization. Additionally, it often requires well-aligned data, , the samples in different domains are paired. Our relational co-training strategy provides a potential solution to relax these restrictions for practical applications.

6 Experiments

6.1 Image generation

We test our relational regularized autoencoder (RAE) in image-generation tasks and compare it with the following alternatives: the variational autoencoder (VAE) (Kingma & Welling, 2013), the Wasserstein autoencoder (WAE) (Tolstikhin et al., 2018), the sliced Wasserstein autoencoder (SWAE) (Kolouri et al., 2018), the Gaussian mixture VAE (GMVAE) (Dilokthanakul et al., 2016), and the VampPrior (Tomczak & Welling, 2018). Table 1 lists the main differences between our RAE and these baselines.

We test the methods on the MNIST (LeCun et al., 1998) and CelebA datasets (Liu et al., 2015). For fairness, all the autoencoders have the same DCGAN-style architecture (Radford et al., 2015; Tolstikhin et al., 2018) and are learned with the same hyperparameters: the learning rate is ; the optimizer is Adam (Kingma & Ba, 2014) with and ; the number of epochs is ; the batch size is ; the weight of regularizer is ; the dimension of latent code is for MNIST and for CelebA. For the autoencoders with structured priors, we set the number of the Gaussian components to be and initialize their prior distributions at random. For the proposed RAE, the hyperparameter is set to be , which empirically makes the Wasserstein term and the GW term in our FGW distance have the same magnitude. The probabilistic RAE calculates the hierarchical FGW based on the proximal gradient method with iterations, and the deterministic RAE calculates the sliced FGW with

random projections. All the autoencoders use Euclidean distance as the distance between samples, and accordingly the reconstruction loss is the mean-square-error (MSE). We implement all the autoencoders with PyTorch and train them on a single NVIDIA GTX 1080 Ti GPU. More implementation details,

, the architecture of the autoencoders, are provided in Appendix C.

 Method
 VAE Probabilistic KL
 WAE Deterministic MMD
 SWAE Deterministic
 GMVAE Probabilistic KL
 VampPrior Probabilistic KL
Our RAE Probabilistic
Deterministic
Table 1: Comparisons for different autoencoders
 Encoder Method MNIST CelebA
Rec. loss FID Rec. loss FID
Probabilistic VAE 16.60 156.11 96.36 59.99
GMVAE 16.76 60.88 108.13 353.17
VampPrior 22.41 127.81
RAE 14.14 41.99 63.21 52.20
Deterministic WAE 9.97 52.78 63.83 52.07
SWAE 11.10 35.63 87.02 88.91
RAE 10.37 49.39 64.49 51.45
Table 2: Comparisons on learning image generator
(a) MNIST
(b) Enlarged (a)
(c) CelebA
Figure 3: Comparisons for various methods on their convergence.

(a) VampPrior
(b) GMVAE
(c) Probabilistic RAE
(d) Deterministic RAE

Figure 4: Comparisons on conditional digit generation.

For each dataset, we compare the proposed RAE with the baselines on ) the reconstruction loss on testing samples; ) the Fréchet Inception Distance (FID) between 10,000 testing samples and 10,000 randomly generated samples. We list the performance of various autoencoders in Table 2. Among probabilistic autoencoders, our RAE consistently achieves the best performance on both testing reconstruction loss and FID score. When learning deterministic autoencoders, our RAE is at least comparable to the considered alternatives on these measurements. Figure 3 compares the autoencoders on their convergence of the reconstruction loss. The convergence of our RAE is almost the same as that of state-of-the-art methods, which further verifies its feasibility.


(a) GMVAE
(b) Probabilistic RAE
(c) Deterministic RAE

Figure 5: Comparisons on conditional face generation.

For the autoencoders learning GMMs as their priors, we further make comparisons for them in conditional generation tasks, , generating samples conditioned on specific Gaussian components. Figures 4 and 5 visualize the generation results for various methods. For the MNIST dataset, the GMVAE, our probabilistic RAE, and deterministic RAE achieve desired generation results. The images conditioned on different Gaussian components correspond to different digits/writing styles. The VampPrior, however, suffers from a problem of severe mode collapse. The images conditioned on different Gaussian components are similar to each other and with limited modes – most of them are “0”, “2”, “3”, and “8”. As shown in Table 1, for each Gaussian component of the prior, the VampPrior parameterizes it by passing a landmark through the encoder. Because the landmarks are in the sample space, this implicit model requires more parameters, making it sensitive to initialization with a high risk of overfitting. Figures 3(a) and 3(b) verify our claim: the testing loss of the VampPrior is unstable and does not converge well during training. For the CelebA dataset, the GMVAE fails to learn a GMM-based prior. As shown in Figure 5(a), the GMVAE trains a single Gaussian distribution, while ignoring the remaining components. As a result, only one Gaussian component can generate face images. Our probabilistic and deterministic RAE, by contrast, learn their GMM-based prior successfully. In particular, all the components of our probabilistic RAE can generate face images, but the components are indistinguishable. Our deterministic RAE achieves the best performance in this conditional generation task – different components can generate semantically-meaningful images with interpretable diversity. For each component, we add some tags to highlight semantic meaning. The visual comparisons for various autoencoders on their reconstructed and generated samples are shown in Appendix D.

Method Data type Caltech Caltech Hand- Cathgen
101-7 101-20 written
 Independent AEs Unpaired 56.76 32.22 40.50 64.60
 AEs+CoReg Paired 76.35 59.83 46.50 66.86
 AEs+W Unpaired 82.43 69.46 49.00 65.84
 AEs+GW Unpaired 84.46 69.46 71.50 66.93
Table 3: Comparisons on classification accuracy (%)

6.2 Multi-view learning via co-training autoencoders

We test our relational co-training strategy on the four multi-view learning datasets (Li et al., 2015):111https://github.com/yeqinglee/mvdata Caltech101-7 is a subset of the Caltech-101 dataset (Fei-Fei et al., 2004) with 1,474 images in classes. Each image is represented by -dimensional Gabor features and

-dimensional Wavelet moments.

Caltech101-20 is a subset of the Caltech-101 with 2,386 images in classes. The features are the same with the Caltech101-7. Handwritten is a dataset with 2,000 images corresponding to digits. Each image has -dimensional pixel-based features and -dimensional Fourier coefficients. Cathgen is a real-world dataset of 8,000 patients. For each patient, we need to leverage -dimensional clinical features and -dimensional genetic features to predict the happening of myocardial infarction.

For each dataset, we use 80% of the data for training, 10% for validation, and the remaining 10% for testing. We test various multi-view learning methods. For each method, we first learn two autoencoders for the data in different views in an unsupervised way, and then concatenate the latent codes of the autoencoders as the features and train a classifier based on softmax regression. When learning autoencoders, our relational co-training method solves (

14) with and . The influence of on the learning results is shown in Appendix D. For simplification, we set the prior distributions as normal distributions in (14). The autoencoders are probabilistic, whose encoders and decoders are MLPs. Each autoencoder has -dimensional latent codes, and more implementation details are provided in Appendix C. We set as the hierarchical Wasserstein distance and the relational regularizer as the hierarchical GW distance. In addition to the proposed method, denoted as AEs+GW, we consider the following baselines: ) learning two variational autoencoders independently (Independent AEs); ) learning two variational autoencoders jointly with a least-square co-regularization (Sindhwani et al., 2005) (AEs+CoReg); ) learning two autoencoders by replacing the in (14) with a Wasserstein regularizer (AEs+W). The AE+CoReg penalizes the Euclidean distance between the latent codes from different views, which needs paired samples. The remaining methods penalize the discrepancy between the distributions of the latent codes, which are applicable for unpaired samples. The classification accuracy in Table 3 demonstrates effectiveness of our relational co-training strategy, as the proposed method outperforms the baselines consistently across different datasets.

7 Conclusions

A new framework has been proposed for learning autoencoders with relational regularization. Leveraging the GW distance, this framework allows the learning of structured prior distributions associated with the autoencoders and prevents the model from under-regularization. Besides learning a single autoencoder, the proposed relational regularizer is beneficial for co-training heterogeneous autoencoders. In the future, we plan to make this relational regularizer applicable for co-training more than two autoencoders and further reduce its computational complexity.

References

Appendix A The proximal gradient method

Both the hierarchical FGW in (6) and the empirical FGW in (11) can be rewritten in matrix format. As shown in (7), the calculation of the distance corresponds to solving the following non-convex optimization problem:

(15)

where and are predefined discrete distributions. This problem can be solved iteratively by the following proximal gradient method (Xu et al., 2019b). In each -th iteration, given current estimation , we consider the following problem with a proximal term

(16)

This subproblem can be solved easily via Sinkhorn iterations (Cuturi, 2013). The details of the algorithm are shown in Algorithm 3. In our experiments, we set when learning probabilistic RAE (P-RAE). The hyperparameter is set adaptively. In particular, in each iteration, given the matrix , we set . This setting helps us improve the numerical stability when calculating the in Algorithm 3. When learning deterministic RAE (D-RAE), we apply the sliced FGW distance with random projections, such that the training time of the D-RAE and that of the P-RAE are comparable.

1:  Initialize ,
2:  for
3:   .
4:   Sinkhorn iteration: , ,
5:   .
6:  Return
Algorithm 3

Appendix B The Proof of Theorem 3.4

Theorem B.1.

For , where , the solution of the problem

(17)

where is the set of all permutation of , is invariant with respect to any translations of and .

Proof.

Denote the translations of and as and , respectively, where . We then denote the objective function in (17) as . Accordingly, we have

(18)

where , . Based on (18), we have

(19)

whose solution is invariant. ∎

Based on Theorem B.1 and the proof of sliced GW distance in (Titouan et al., 2019), we have

Theorem B.2.

Following the notations in Theorem B.1, for we denote their zero-mean translations as and , respectively. satisfies:
1) When , the solution is the identity permutation .
2) Otherwise, the solution is the anti-identity permutation .

Proof.

The proposed problem is equivalent to the following problem