Log In Sign Up

Sliced-Wasserstein Autoencoder: An Embarrassingly Simple Generative Model

In this paper we study generative modeling via autoencoders while using the elegant geometric properties of the optimal transport (OT) problem and the Wasserstein distances. We introduce Sliced-Wasserstein Autoencoders (SWAE), which are generative models that enable one to shape the distribution of the latent space into any samplable probability distribution without the need for training an adversarial network or defining a closed-form for the distribution. In short, we regularize the autoencoder loss with the sliced-Wasserstein distance between the distribution of the encoded training samples and a predefined samplable distribution. We show that the proposed formulation has an efficient numerical solution that provides similar capabilities to Wasserstein Autoencoders (WAE) and Variational Autoencoders (VAE), while benefiting from an embarrassingly simple implementation.


page 5

page 12

page 13

page 20

page 22

page 23

page 24

page 25


Statistical Regeneration Guarantees of the Wasserstein Autoencoder with Latent Space Consistency

The introduction of Variational Autoencoders (VAE) has been marked as a ...

Sinkhorn AutoEncoders

Optimal Transport offers an alternative to maximum likelihood for learni...

Encoded Prior Sliced Wasserstein AutoEncoder for learning latent manifold representations

While variational autoencoders have been successful generative models fo...

Deep Generative Models for Distribution-Preserving Lossy Compression

We propose and study the problem of distribution-preserving lossy compre...

Learning Autoencoders with Relational Regularization

A new algorithmic framework is proposed for learning autoencoders of dat...

Topic Modeling with Wasserstein Autoencoders

We propose a novel neural topic model in the Wasserstein autoencoders (W...

Primal-Dual Wasserstein GAN

We introduce Primal-Dual Wasserstein GAN, a new learning algorithm for b...

Code Repositories


Implementation of the Sliced Wasserstein Autoencoders

view repo

I Introduction

Scalable generative models that capture the rich and often nonlinear distribution of high-dimensional data, (i.e., image, video, and audio), play a central role in various applications of machine learning, including transfer learning

[14, 25]

, super-resolution

[21, 16]

, image inpainting and completion


, and image retrieval

[7], among many others. The recent generative models, including Generative Adversarial Networks (GANs) [11, 30, 1, 2] and Variational Autoencoders (VAE) [15, 24, 5] enable an unsupervised and end-to-end modeling of the high-dimensional distribution of the training data.

Learning such generative models boils down to minimizing a dissimilarity measure between the data distribution and the output distribution of the generative model. To this end, and following the work of Arjovsky et al. [1] and Bousquet et al. [5] we approach the problem of generative modeling from the optimal transport point of view. The optimal transport problem [34, 18] provides a way to measure the distances between probability distributions by transporting (i.e., morphing) one distribution into another. Moreover, and as opposed to the common information theoretic dissimilarity measures (e.g., -divergences), the p-Wasserstein dissimilarity measures that arise from the optimal transport problem: 1) are true distances, and 2) metrize a weak convergence of probability measures (at least on compact spaces). Wasserstein distances have recently attracted a lot of interest in the learning community [9, 12, 5, 1, 18] due to their exquisite geometric characteristics [31]. See the supplementary material for an intuitive example showing the benefit of the Wasserstein distance over commonly used -divergences.

In this paper, we introduce a new type of autoencoders for generative modeling (Algorithm 1), which we call Sliced-Wasserstein Autoencoders (SWAE), that minimize the sliced-Wasserstein distance between the distribution of the encoded samples and a predefined samplable distribution. Our work is most closely related to the recent work by Bousquet et al. [5] and the follow-up work by Tolstikhin et al. [33]. However, our approach avoids the need to perform costly adversarial training in the encoding space and is not restricted to closed-form distributions, while still benefiting from a Wasserstein-like distance measure in the encoding space that permits a simple numerical solution to the problem.

In what follows we first provide an extensive review of the preliminary concepts that are needed for our formulation. In Section 3 we formulate our proposed method. The proposed numerical scheme to solve the problem is presented in Section 4. Our experiments are summarized in Section 5. Finally, our work is Concluded in Section 6.

Ii Notation and Preliminaries

Let denote the compact domain of a manifold in Euclidean space and let denote an individual input data point. Furthermore, let be a Borel probability measure defined on

. We define the probability density function

for input data to be:

Let denote a deterministic parametric mapping from the input space to a latent space

(e.g., a neural network encoder). Utilizing a technique often used in the theoretical physics community (See


), known as Random Variable Transformation (RVT), the probability density function of the encoded samples

can be expressed in terms of and by:


where denotes the Dirac distribution function. The main objective of Variational Auto-Encoders (VAEs) is to encode the input data points into latent codes such that: 1) can be recovered/approximated from , and 2) the probability density function of the encoded samples, , follows a prior distribution . Similar to classic auto-encoders, a decoder is required to map the latent codes back to the original space such that


where denotes the decoded samples. It is straightforward to see that when (i.e. ), the distribution of the decoder and the input distribution are identical. Hence, the objective of a variational auto-encoder simplifies to learning and such that they minimize a dissimilarity measure between and , and between and . Defining and implementing the dissimilarity measure is a key design decision, and is one of the main contributions of this work, and thus we dedicate the next section to describing existing methods for measuring these dissimilarities.

Ii-a Dissimilarity between and

We first emphasize that the VAE work in the literature often assumes stochastic encoders and decoders [15], while we consider the case of only deterministic mappings. Different dissimilarity measures have been used between and in various work in the literature. Most notably, Nowozin et al. [26] showed that for the general family of -divergences, , (including the KL-divergence, Jensen-Shannon, etc.), using the Fenchel conjugate of the convex function and minimizing leads to a min-max problem that is equivalent to the adversarial training widely used in the generative modeling literature [11, 23, 24].

Others have utilized the rich mathematical foundation of the OT problem and Wasserstein distances [1, 12, 5, 33]. In Wasserstein-GAN, [1] utilized the Kantorovich-Rubinstein duality for the 1-Wasserstein distance, , and reformulated the problem as a min-max optimization that is solved through an adversarial training scheme. In a different approach, [5] utilized the autoencoding nature of the problem and showed that could be simplified as:


Note that Eq. (3) is equivalent to Theorem 1 in [5] for deterministic encoder-decoder pair, and also note that and are parametric differentiable models (e.g. neural networks). Furthermore, Eq. (3) supports a simple implementation where for i.i.d samples of the input distribution the minimization can be written as:


We emphasize that Eq. (3) (and consequently Eq. (4)) takes advantage of the fact that the pairs and are available, hence calculating the transport distance coincides with summing the transportation costs between all pairs (, ). For example, the total transport distance may be defined as the sum of Euclidean distances between all pairs of points. In this paper, we also use following Eq. (4) to measure the discrepancy between and . Next, we review the methods used for measuring the discrepancy between and .

Ii-B Dissimilarity between and


is a known distribution with an explicit formulation (e.g. Normal distribution) the most straightforward approach for measuring the (dis)similarity between

and is the log-likelihood of with respect to , formally:


maximizing the log-likelihood is equivalent to minimizing the KL-divergence between and , (see supplementary material for more details and derivation of Equation (5)). This approach has two major limitations: 1) The KL-Divergence and in general -divergences do not provide meaningful dissimilarity measures for distributions supported on non-overlapping low-dimensional manifolds [1, 19] (see supplementary material), which is common in hidden layers of neural networks, and therefore they do not provide informative gradients for training , and 2) we are limited to distributions that have known explicit formulations, which is very restrictive because it eliminates the ability to use the much broader class of distributions were we know how to sample from them, but do not know their explicit form.

Various alternatives exist in the literature to address the above-mentioned limitations. These methods often sample from and from and measure the discrepancy between these sets (i.e. point clouds). Note that there are no one-to-one correspondences between s and s. Tolstikhin et al. [33] for instance, proposed two different approaches for measuring the discrepancy between and , namely the GAN-based and the maximum mean discrepancy (MMD)-based approaches. The GAN-based approach proposed in [33] defines a discriminator network,

, to classify

s and s as coming from ‘true’ and ‘fake’ distributions correspondingly and proposes a min-max adversarial optimization for learning and . This approach could be thought as a Fenchel conjugate of some -divergence between and . The MMD-based approach, on the other hand, utilizes a positive-definite reproducing kernel to measure the discrepancy between and , however, the choice of the kernel remain a data-dependent design parameter.

An interesting alternative approach is to use the Wasserstein distance between and . The reason being that Wasserstein metrics have been shown to be particularly beneficial for measuring the distance between distributions supported on non-overlapping low-dimensional manifolds. Following the work of Arjovsky et al. [1], this can be accomplished utilizing the Kantorovich-Rubinstein duality and through introducing a min-max problem, which leads to yet another adversarial training scheme similar the GAN-based method in [33]. Note that, since elements of and are not paired an approach similar to Eq. (4) could not be used to calculate the Wasserstein distance. In this paper, we propose to use the sliced-Wasserstein metric, [28, 29, 3, 17, 19, 6], to measure the discrepancy between and . We show that using the sliced-Wasserstein distance ameliorates the need for training an adversary network, and provides an efficient but yet simple numerical implementation.

Before explaining our proposed approach, it is worthwhile to point out the benefits of learning autoencoders as generative models over GANs. In GANs, one needs to minimize a distance between and which are high-dimensional point clouds for which there are no correspondences between s and s. For the autoencoders, on the other hand, there exists correspondences between the high-dimensional point clouds and , and the problem simplifies to matching the lower-dimensional point clouds and . In other words, the encoder performs a nonlinear dimensionality reduction, that enables us to solve a much simpler problem compared to GANs. Next we introduce the details of our approach.

Iii Proposed method

In what follows we first provide a brief review of the necessary equations to understand the Wasserstein and sliced-Wasserstein distances and then present our Sliced Wassersten Autoencoders (SWAE).

Iii-a Wasserstein distances

The Wasserstein distance between probability measures and , with corresponding densities and is defined as:


where is the set of all transportation plans (i.e. joint measures) with marginal densities and , and is the transportation cost. Eq. (6) is known as the Kantorovich formulation of the optimal mass transportation problem, which seeks the optimal transportation plan between and . If there exist diffeomorphic mappings, (i.e. transport maps) such that and consequently,


where is the determinant of the Jacobian, then the Wasserstein distance could be defined based on the Monge formulation of the problem (see [34] and [18]) as:


where is the set of all diffeomorphisms that satisfy Eq. (7). As can be seen from Eqs. (6) and (8), obtaining the Wasserstein distance requires solving an optimization problem. Various efficient optimization techniques have been proposed in the past (e.g. [8, 32, 27]).

The case of one dimensional probability densities, and , is specifically interesting as the Wasserstein distance has a closed-form solution. Let and be the cumulative distributions of one-dimensional probability distributions and , correspondingly. The Wassertein distance can then be calculated as:


The closed-form solution of Wasserstein distance for one-dimensional probability densities motivates the definition of sliced-Wasserstein distances.

Fig. 1: Visualization of the slicing process defined in Eq. (10)

Iii-B Sliced-Wasserstein distances

The interest in the sliced-Wasserstein distance is due to the fact that it has very similar qualitative properties as the Wasserstein distance, but it is much easier to compute, since it only depends on one-dimensional computations. The sliced-Wasserstein distance was used in [28, 29] to calculate barycenter of distributions and point clouds. Bonneel et al. [3] provided a nice theoretical overview of barycenteric calculations using the sliced-Wasserstein distance. Kolouri et al. [17] used this distance to define positive definite kernels for distributions and Carriere et al. [6]

used it as a distance for persistence diagrams. Sliced-Wasserstein was also recently used for learning Gaussian mixture models


The main idea behind the sliced-Wasserstein distance is to slice (i.e. project) higher-dimensional probability densities into sets of one-dimensional distributions and compare their one-dimensional representations via Wasserstein distance. The slicing/projection process is related to the field of Integral Geometry and specifically the Radon transform [13]. The relevant result to our discussion is that a d-dimensional probability density could be uniquely represented as the set of its one-dimensional marginal distributions following the Radon transform and the Fourier slice theorem [13]. These one dimensional marginal distributions of are defined as:


where is the d-dimensional unit sphere. Note that for any fixed , is a one-dimensional slice of distribution . In other words, is a marginal distribution of that is obtained from integrating

over the hyperplane orthogonal to

(See Figure 1). Utilizing the one-dimensional marginal distributions in Eq. (10), the sliced Wasserstein distance could be defined as:


Given that and are one-dimensional the Wasserstein distance in the integrand has a closed-form solution as demonstrated in (9). The fact that is a distance comes from being a distance. Moreover, the two distances also induce the same topology, at least on compact sets [31].

A natural transportation cost that has extensively studied in the past is the , , for which there are theoretical guarantees on existence and uniqueness of transportation plans and maps (see [31] and [34]). When the following inequality bounds hold for the SW distance:


where is a constant. Chapter 5 in [4] proves this inequality with (See [31] for more details). The inequalities in (12) is the main reason we can use the sliced Wasserstein distance, , as an approximation for .

Iii-C Sliced-Wasserstein auto-encoder

Our proposed formulation for the SWAE is as follows:


where is the encoder, is the decoder, is the data distribution, is the data distribution after encoding and decoding (Eq. (2)), is the distribution of the encoded data (Eq. (1)), is the predefined distribution (or a distribution we know how to sample from), and

is a hyperparameter that identifies the relative importance of the loss functions.

To further clarify why we use the Wasserstein distance to measure the difference between and , but the sliced-Wasserstein distance to measure the difference between and , we reiterate that the Wasserstein distance for the first term can be solved via Eq. (4) due to the existence of correspondences between and (i.e., we desire ), however, for and , analogous correspondences between the s and s do not exist and therefore calculation of the Wasserstein distance requires an additional optimization step (e.g., in the form of an adversarial network). To avoid this additional optimization, while maintaining the favorable characteristics of the Wasserstein distance, we use the sliced-Wasserstein distance to measure the discrepancy between and .

Iv Numerical optimization

Fig. 2: The Wasserstein distance for one-dimensional probability distributions and (top left) is calculated based on Eq. (9). For a numerical implementation, the integral in Eq. (9) is substituted with where, (top right). When only samples from the distributions are available and (bottom left), the Wasserstein distance is approximated by sorting s and s and letting , where and are the sorted indices (bottom right).

Iv-a Numerical implementation of the Wasserstein distance in 1D

The Wasserstein distance between two one-dimensional distributions and is obtained from Eq. (9). The integral in Eq. (9) could be numerically calculated using , where and (see Fig. 2). In scenarios where only samples from the distributions are available, and

, the empirical distributions can be estimated as

and , where is the Dirac delta function centered at . Therefore the corresponding empirical cumulative distribution of is where is the step function ( is defined similarly). Sorting s in an ascending order, such that and where is the index of the sorted s, it is straightforward to confirm that (see Fig. 2 for a visual confirmation). Therefore, the Wasserstein distance can be approximated by first sorting s and s and then calculating:


Eq. (14) turns the problem of calculating the Wasserstein distance for two one-dimensional probability densities from their samples into a sorting problem that can be solved efficiently best case and worst case).

Iv-B Slicing empirical distributions

In scenarios where only samples from the d-dimensional distribution, , are available, , the empirical distribution can be estimated as . Following Eq. (10) it is straightforward to show that the marginal distributions (i.e. slices) of the empirical distribution, , are obtained from:


see the supplementary material for a proof.

Iv-C Minimizing sliced-Wasserstein via random slicing

Minimizing the sliced-Wasserstein distance (i.e. as in the second term of Eq. 13) requires an integration over the unit sphere in , i.e., . In practice, this integration is substituted by a summation over a finite set ,

Note that . Moreover, the global minimum for is also a global minimum for each . A fine sampling of is required for a good approximation of . Such sampling, however, becomes prohibitively expensive as the dimension of the embedding space grows. Alternatively, following the approach presented by Rabin and Peyré [28], and later by Bonneel et al. [3] and subsequently by Kolouri et al. [19], we utilize random samples of at each minimization step to approximate the sliced-Wasserstein distance. Intuitively, if and are similar, then their projections with respect to any finite subset of

would also be similar. This leads to a stochastic gradient descent scheme where in addition to the random sampling of the input data, we also random sample the projection angles from


Iv-D Putting it all together

To optimize the proposed SWAE objective function in Eq. (13) we use a stochastic gradient descent scheme as described here. In each iteration, let and be i.i.d random samples from the input data and the predefined distribution, , correspondingly. Let

be randomly sampled from a uniform distribution on

. Then using the numerical approximations described in this section, the loss function in Eq. (13) can be rewritten as:


where and are the indices of sorted s and with respect to , correspondingly. The steps of our proposed method are presented in Algorithm 1. It is worth pointing out that sorting is by itself an optimization problem (which can be solved very efficiently), and therefore the sorting followed by the gradient descent update on and is in essence a min-max problem, which is being solved in an alternating fashion.

0:  Regularization coefficient , and number of random projections, . Initialize the parameters of the encoder, , and decoder,
  while  and have not converged do
     Sample from training set (i.e. )
     Sample from
     Sample from
     Sort such that
     Sort such that
     Update and by descending
  end while
Algorithm 1  Sliced-Wasserstein Auto-Encoder (SWAE)

V Experiments

Here we show the results of SWAE for two mid-size image datasets, namely the MNIST dataset [20], and the CelebFaces Attributes Dataset (CelebA) [22]

. For the encoder and the decoder we used mirrored classic deep convolutional neural networks with 2D average poolings and leaky rectified linear units (Leaky-ReLu) as the activation functions. The implementation details are included in the Supplementary material.

For the MNIST dataset, we designed a deep convolutional encoder that embeds the handwritten digits into a two-dimensional embedding space (for visualization). To demonstrate the capability of SWAE on matching distributions and in the embedding/encoder space we chose four different s, namely the ring distribution, the uniform distribution, a circle distribution, and a bowl distribution. Figure 3 shows the results of our experiment on the MNIST dataset. The left column shows samples from , the middle column shows s for the trained and the color represent the labels (note that the labels were only used for visualization). Finally, the right column depicts a grid in through the trained decoder . As can be seen, the embedding/encoder space closely follows the predefined , while the space remains decodable. The implementation details are included in the supplementary material.

Fig. 3: The results of SWAE on the MNIST dataset for three different distributions as ,, namely the ring distribution, the uniform distribution, and the circle distribution. Note that the far right visualization is showing the decoding of a grid in (in the encoding space).

The CelebA face dataset contains a higher degree of variations compared to the MNIST dataset and therefore a two-dimensional embedding space does not suffice to capture the variations in this dataset. Therefore, while the SWAE loss function still goes down and the network achieves a good match between and the decoder is unable to match and . Therefore, a higher-dimensional embedding/encoder space is needed. In our experiments for this dataset we chose a dimensional embedding space. Figure 4 demonstrates the outputs of trained SWAEs with and for sample input images. The input images were resized to and then fed to our autoencoder structure.

Fig. 4: Trained SWAE outputs for sample input images with different embedding spaces of size and .

For CelebA dataset we set to be a -dimensional uniform distribution and trained our SWAE on the CelebA dataset. Given the convex nature of

, any linear combination of the encoded faces should also result in a new face. Having that in mind, we ran two experiments in the embedding space to check that in fact the embedding space satisfies this convexity assumption. First we calculated linear interpolations of sampled pairs of faces in the embedding space and fed the interpolations to the decoder network to visualize the corresponding faces. Figure

5, left column, shows the interpolation results for random pairs of encoded faces. It is clear that the interpolations remain faithful as expected from a uniform . Finally, we performed Principle Component Analysis (PCA) of the encoded faces and visualized the faces corresponding to these principle components via . The PCA components are shown on the left column of Figure 5. Various interesting modes including, hair color, skin color, gender, pose, etc. can be observed in the PC components.

Vi Conclusions

We introduced Sliced Wasserstein Autoencoders (SWAE), which enable one to shape the distribution of the encoded samples to any samplable distribution. We theoretically showed that utilizing the sliced Wasserstein distance as a dissimilarity measure between the distribution of the encoded samples and a predefined distribution ameliorates the need for training an adversarial network in the embedding space. In addition, we provided a simple and efficient numerical scheme for this problem, which only relies on few inner products and sorting operations in each SGD iteration. We further demonstrated the capability of our method on two mid-size image datasets, namely the MNIST dataset and the CelebA face dataset and showed results comparable to the techniques that rely on additional adversarial trainings. Our implementation is publicly available 111

Fig. 5: The results of SWAE on the CelebA face dataset with a 128-dimensional uniform distribution as . Linear interpolation in the encoding space for random samples (on the right) and the first 10 PCA components calculated in the encoding space.


This work was partially supported by NSF (CCF 1421502). The authors would like to thank Drs. Dejan Slepćev, and Heiko Hoffmann for their invaluable inputs and many hours of constructive conversations.


Supplementary Material

Comparison of different distances

Following the example by Arjovsky et al. [1] and later Kolouri et al. [19] here we show a simple example comparing the Jensen-Shannon divergence with the Wasserstein distance. First note that the Jensen-Shannon divergence is defined as,


is the Kullback-Leibler divergence. Now consider the following densities,

be a uniform distribution around zero and let be a shifted version of the . Figure 6 show and as a function of . As can be seen the JS divergence fails to provide a useful gradient when the distributions are supported on non-overlapping domains.


To maximize (minimize) the similarity (dissimilarity) between and , we can write :

where we replaced with Eq. (1). Furthermore, it is straightforward to show:

Slicing empirical distributions

Here we calculate a Radon slice of the empirical distribution with respect to . Using the definition of the Radon transform in Eq. (10) and RVT in Eq. (1) we have:

Simple manifold learning experiment

Figure 7 demonstrates the results of SWAE with random initializations to embed a 2D manifold in to a 2D uniform distribution.

Fig. 7: Different runs of SWAE to embed a 3D nonlinear manifold into a 2D uniform distribution.

The implementation details of our algorithm

The following text walks you through the implementation of our Sliced Wasserstein Autoencoders (SWAE).

To run this notebook you’ll require the following packages:

{}] In [1]: import numpy as np import keras.utils from keras.layers import Input,Dense, Flatten,Activation from keras.models importload_model,Model from keras.layers import Conv2D, UpSampling2D, AveragePooling2D from keras.layers import LeakyReLU,Reshape from keras.preprocessing.image import ImageDataGenerator from keras.optimizers importRMSprop from keras.datasets import mnist from keras.models import save_model from keras import backend as K import tensorflow as tf import matplotlib.pyplot as plt from IPython import display import time

{}] Using TensorFlow backend.

  • generateTheta(L,dim) -> Generates random sampels from

  • generateZ(batchsize,endim) -> Generates ‘batchsize’ samples ‘endim’ dimensional samples from

  • stitchImages(I,axis=0) -> Helps us with visualization

{}] In [2]: def generateTheta(L,endim): theta_=np.random.normal(size=(L,endim)) for l in range(L): theta_[l,:]=theta_[l,:]/np.sqrt(np.sum(theta_[l,:]**2)) return theta_ def generateZ(batchsize,endim): z_=2*(np.random.uniform(size=(batchsize,endim))-0.5) return z_ def stitchImages(I,axis=0): n,N,M,K=I.shape if axis==0: img=np.zeros((N*n,M,K)) for i in range(n): img[i*N:(i+1)*N,:,:]=I[i,:,:,:] else: img=np.zeros((N,M*n,K)) for i in range(n): img[:,i*M:(i+1)*M,:]=I[i,:,:,:] return img

{}] In [3]: img=Input((28,28,1)) #Input image interdim=128 # This is the dimension of intermediate latent #(variable after convolution and before embedding) endim=2 # Dimension of the embedding space embedd=Input((endim,)) #Keras input to Decoder depth=16 # This is a design parameter. L=50 # Number of random projections batchsize=500

{}] In [4]: x=Conv2D(depth*1, (3, 3

), padding

='same')(img) x=LeakyReLU(alpha=0.2)(x) # x=BatchNormalization(momentum=0.8)(x) x=Conv2D(depth*1, (3, 3), padding='same')(x) x=LeakyReLU(alpha=0.2)(x) # x=BatchNormalization(momentum=0.8)(x) x=AveragePooling2D((2, 2), padding='same')(x) x=Conv2D(depth*2, (3, 3), padding='same')(x) x=LeakyReLU(alpha=0.2)(x) # x=BatchNormalization(momentum=0.8)(x) x=Conv2D(depth*2, (3, 3), padding='same')(x) x=LeakyReLU(alpha=0.2)(x) # x=BatchNormalization(momentum=0.8)(x) x=AveragePooling2D((2, 2), padding='same')(x) x=Conv2D(depth*4, (3, 3), padding='same')(x) x=LeakyReLU(alpha=0.2)(x) # x=BatchNormalization(momentum=0.8)(x) x=Conv2D(depth*4, (3, 3), padding='same')(x) x=LeakyReLU(alpha=0.2)(x) # x=BatchNormalization(momentum=0.8)(x) x=AveragePooling2D((2, 2), padding='same')(x) x=Flatten()(x) x=Dense(interdim,activation='relu')(x) encoded=Dense(endim)(x)

encoder=Model(inputs=[img],outputs=[encoded]) encoder.summary()

{}] _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 28, 28, 1) 0 _________________________________________________________________ conv2d_1 (Conv2D) (None, 28, 28, 16) 160 _________________________________________________________________ leaky_re_lu_1 (LeakyReLU) (None, 28, 28, 16) 0 _________________________________________________________________ conv2d_2 (Conv2D) (None, 28, 28, 16) 2320 _________________________________________________________________ leaky_re_lu_2 (LeakyReLU) (None, 28, 28, 16) 0 _________________________________________________________________ average_pooling2d_1 (Average (None, 14, 14, 16) 0 _________________________________________________________________ conv2d_3 (Conv2D) (None, 14, 14, 32) 4640 _________________________________________________________________ leaky_re_lu_3 (LeakyReLU) (None, 14, 14, 32) 0 _________________________________________________________________ conv2d_4 (Conv2D) (None, 14, 14, 32) 9248 _________________________________________________________________ leaky_re_lu_4 (LeakyReLU) (None, 14, 14, 32) 0 _________________________________________________________________ average_pooling2d_2 (Average (None, 7, 7, 32) 0 _________________________________________________________________ conv2d_5 (Conv2D) (None, 7, 7, 64) 18496 _________________________________________________________________ leaky_re_lu_5 (LeakyReLU) (None, 7, 7, 64) 0 _________________________________________________________________ conv2d_6 (Conv2D) (None, 7, 7, 64) 36928 _________________________________________________________________ leaky_re_lu_6 (LeakyReLU) (None, 7, 7, 64) 0 _________________________________________________________________ average_pooling2d_3 (Average (None, 4, 4, 64) 0 _________________________________________________________________ flatten_1 (Flatten) (None, 1024) 0 _________________________________________________________________ dense_1 (Dense) (None, 128) 131200 _________________________________________________________________ dense_2 (Dense) (None, 2) 258 ================================================================= Total params: 203,250 Trainable params: 203,250 Non-trainable params: 0 _________________________________________________________________

{}] In [5]: x=Dense(interdim)(embedd) x=Dense(depth*64,activation='relu')(x) # x=BatchNormalization(momentum=0.8)(x) x=Reshape((4,4,4*depth))(x) x=UpSampling2D((2, 2))(x) x=Conv2D(depth*4, (3, 3), padding='same')(x) x=LeakyReLU(alpha=0.2)(x) # x=BatchNormalization(momentum=0.8)(x) x=Conv2D(depth*4, (3, 3), padding='same')(x) x=LeakyReLU(alpha=0.2)(x) x=UpSampling2D((2, 2))(x) x=Conv2D(depth*4, (3, 3), padding='valid')(x) x=LeakyReLU(alpha=0.2)(x) # x=BatchNormalization(momentum=0.8)(x) x=Conv2D(depth*4, (3, 3), padding='same')(x) x=LeakyReLU(alpha=0.2)(x) x=UpSampling2D((2, 2))(x) x=Conv2D(depth*2, (3, 3), padding='same')(x) x=LeakyReLU(alpha=0.2)(x) # x=BatchNormalization(momentum=0.8)(x) x=Conv2D(depth*2, (3, 3), padding='same')(x) x=LeakyReLU(alpha=0.2)(x) # x=BatchNormalization(momentum=0.8)(x) # x=BatchNormalization(momentum=0.8)(x) decoded=Conv2D(1, (3, 3), padding='same',activation='sigmoid')(x)

decoder=Model(inputs=[embedd],outputs=[decoded]) decoder.summary()

{}] _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_2 (InputLayer) (None, 2) 0 _________________________________________________________________ dense_3 (Dense) (None, 128) 384 _________________________________________________________________ dense_4 (Dense) (None, 1024) 132096 _________________________________________________________________ reshape_1 (Reshape) (None, 4, 4, 64) 0 _________________________________________________________________ up_sampling2d_1 (UpSampling2 (None, 8, 8, 64) 0 _________________________________________________________________ conv2d_7 (Conv2D) (None, 8, 8, 64) 36928 _________________________________________________________________ leaky_re_lu_7 (LeakyReLU) (None, 8, 8, 64) 0 _________________________________________________________________ conv2d_8 (Conv2D) (None, 8, 8, 64) 36928 _________________________________________________________________ leaky_re_lu_8 (LeakyReLU) (None, 8, 8, 64) 0 _________________________________________________________________ up_sampling2d_2 (UpSampling2 (None, 16, 16, 64) 0 _________________________________________________________________ conv2d_9 (Conv2D) (None, 14, 14, 64) 36928 _________________________________________________________________ leaky_re_lu_9 (LeakyReLU) (None, 14, 14, 64) 0 _________________________________________________________________ conv2d_10 (Conv2D) (None, 14, 14, 64) 36928 _________________________________________________________________ leaky_re_lu_10 (LeakyReLU) (None, 14, 14, 64) 0 _________________________________________________________________ up_sampling2d_3 (UpSampling2 (None, 28, 28, 64) 0 _________________________________________________________________ conv2d_11 (Conv2D) (None, 28, 28, 32) 18464 _________________________________________________________________ leaky_re_lu_11 (LeakyReLU) (None, 28, 28, 32) 0 _________________________________________________________________ conv2d_12 (Conv2D) (None, 28, 28, 32) 9248 _________________________________________________________________ leaky_re_lu_12 (LeakyReLU) (None, 28, 28, 32) 0 _________________________________________________________________ conv2d_13 (Conv2D) (None, 28, 28, 1) 289 ================================================================= Total params: 308,193 Trainable params: 308,193 Non-trainable params: 0 _________________________________________________________________

Here we define Keras variables for and sample s.


In [6]: #Define a Keras Variable for \theta_ls theta=K.variable(generateTheta(L,endim)) #Define a Keras Variable for samples of z z=K.variable(generateZ(batchsize,endim))

Put encoder and decoder together to get the autoencoder

{}] In [7]: # Generate the autoencoder by combining encoder and decoder aencoded=encoder(img) ae=decoder(aencoded) autoencoder=Model(inputs=[img],outputs=[ae]) autoencoder.summary()

{}] _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 28, 28, 1) 0 _________________________________________________________________ model_1 (Model) (None, 2) 203250 _________________________________________________________________ model_2 (Model) (None, 28, 28, 1) 308193 ================================================================= Total params: 511,443 Trainable params: 511,443 Non-trainable params: 0 _________________________________________________________________

{}] In [8]: # Let projae be the projection of the encoded samples,K.transpose(theta)) # Let projz be the projection of the $q_Z$ samples,K.transpose(theta)) # Calculate the Sliced Wasserstein distance by sorting # the projections and calculating the L2 distance between W2=(tf.nn.top_k(tf.transpose(projae),k=batchsize).values- tf.nn.top_k(tf.transpose(projz),k=batchsize).values)**2

{}] In [9]: crossEntropyLoss= (1.0)*K.mean(K.binary_crossentropy(K.flatten(img), K.flatten(ae))) L1Loss= (1.0)*K.mean(K.abs(K.flatten(img)-K.flatten(ae))) W2Loss= (10.0)*K.mean(W2) # I have a combination of L1 and Cross-Entropy loss # for the first term and then and W2 for the second term vae_Loss=crossEntropyLoss+L1Loss+W2Loss autoencoder.add_loss(vae_Loss) # Add the custom loss to the model

{}] In [10]: #Compile the model autoencoder.compile(optimizer='rmsprop',loss='')

{}] In [11]: (x_train,y_train),(x_test,_)=mnist.load_data() x_train=np.expand_dims(x_train.astype('float32')/255.,3)

{}] In [12]: plt.imshow(np.squeeze(x_train[0,...]))

max size=0.90.9MNIST_SlicedWassersteinAutoEncoder_19_0.png

{}] In [13]: loss=[] forepoch in range(20): ind=np.random.permutation(x_train.shape[0]) for i in range(int(x_train.shape[0]/batchsize)): Xtr=x_train[ind[i*batchsize:(i+1)*batchsize],...] theta_=generateTheta(L,endim) z_=generateZ(batchsize,endim) K.set_value(z,z_) K.set_value(theta,theta_) loss.append(autoencoder.train_on_batch(x=Xtr,y=None)) plt.plot(np.asarray(loss)) display.clear_output(wait=True) display.display(plt.gcf()) time.sleep(1e-3)

max size=0.90.9MNIST_SlicedWassersteinAutoEncoder_21_0.png

{}] In [15]: # Test autoencoder en=encoder.predict(x_train)# Encode the images dec=decoder.predict(en) # Decode the encodings

{}] In [16]: # Sanity check for the autoencoder # Note that we can use a more complex autoencoder that results # in better reconstructions. Also the autoencoders used in the # literature often use a much larger latent space (we are using only 2!) fig,[ax1,ax2]=plt.subplots(2,1,figsize=(100,10)) I_temp=(stitchImages(x_train[:10,...],1)*255.0).astype('uint8') Idec_temp=(stitchImages(dec[:10,...],1)*255.0).astype('uint8') ax1.imshow(np.squeeze(I_temp)) ax1.set_xticks([]) ax1.set_yticks([]) ax2.imshow(np.squeeze(Idec_temp)) ax2.set_xticks([]) ax2.set_yticks([])

max size=0.90.9MNIST_SlicedWassersteinAutoEncoder_24_1.png

{}] In [17]: # Distribution of the encoded samples plt.figure(figsize=(10,10)) plt.scatter(en[:,0],-en[:,1],c=10*y_train, plt.xlim([-1.5,1.5]) plt.ylim([-1.5,1.5])

max size=0.90.9MNIST_SlicedWassersteinAutoEncoder_26_0.png

{}] In [18]: #Sample the latent variable on a Nsample x Nsample grid Nsample=25 hiddenv=np.meshgrid(np.linspace(-1,1,Nsample),np.linspace(-1,1,Nsample)) v=np.concatenate((np.expand_dims(hiddenv[0].flatten(),1), np.expand_dims(hiddenv[1].flatten(),1)),1) # Decode the grid decodeimg=np.squeeze(decoder.predict(v))

{}] In [19]: #Visualize the grid count=0 img=np.zeros((Nsample*28,Nsample*28)) for i in range(Nsample): for j in range(Nsample): img[i*28:(i+1)*28,j*28:(j+1)*28]=decodeimg[count,...] count+=1

{}] In [20]: fig=plt.figure(figsize=(10,10)) plt.imshow(img)

max size=0.90.9MNIST_SlicedWassersteinAutoEncoder_30_0.png

{}] In [21]: #Visualize the z samples plt.figure(figsize=(10,10)) Z=generateZ(10000,2) plt.scatter(Z[:,0],Z[:,1]) plt.xlim([-1.5,1.5]) plt.ylim([-1.5,1.5])

max size=0.90.9MNIST_SlicedWassersteinAutoEncoder_31_0.png