Dual Contradistinctive Generative Autoencoder

11/19/2020 ∙ by Gaurav Parmar, et al. ∙ Carnegie Mellon University University of California, San Diego 4

We present a new generative autoencoder model with dual contradistinctive losses to improve generative autoencoder that performs simultaneous inference (reconstruction) and synthesis (sampling). Our model, named dual contradistinctive generative autoencoder (DC-VAE), integrates an instance-level discriminative loss (maintaining the instance-level fidelity for the reconstruction/synthesis) with a set-level adversarial loss (encouraging the set-level fidelity for there construction/synthesis), both being contradistinctive. Extensive experimental results by DC-VAE across different resolutions including 32x32, 64x64, 128x128, and 512x512 are reported. The two contradistinctive losses in VAE work harmoniously in DC-VAE leading to a significant qualitative and quantitative performance enhancement over the baseline VAEs without architectural changes. State-of-the-art or competitive results among generative autoencoders for image reconstruction, image synthesis, image interpolation, and representation learning are observed. DC-VAE is a general-purpose VAE model, applicable to a wide variety of downstream tasks in computer vision and machine learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 8

page 12

page 14

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tremendous progress has been made in deep learning for the development of various learning frameworks

[krizhevsky2012imagenet, he2016deep, goodfellow2014generative, vaswani2017attention]. Autoencoder (AE) [lecun1987modeles, hinton1994autoencoders] aims to compactly represent and faithfully reproduce the original input signal by concatenating an encoder and a decoder in an end-to-end learning framework. The goal of AE is to make the encoded representation semantically efficient and sufficient to reproduce the input signal by its decoder. Autoencoder’s generative companion, variational autoencoder (VAE) [kingma2013auto], additionally learns a variational model for the latent variables to capture the underlying sample distribution.

The key objective for a generative autoencoder is to maintain two types of fidelities: (1) an instance-level fidelity to make the reconstruction/synthesis faithful to the individual input data sample, and (2) a set-level fidelity to make the reconstruction/synthesis of the decoder faithful to the entire input data set. The VAE/GAN algorithm [VAEGAN] combines a reconstruction loss with an adversarial loss. However, the result of VAE/GAN is sub-optimal, as shown in Table 1.* indicates equal contribution

The pixel-wise reconstruction loss in the standard VAE [kingma2013auto] typically results in blurry images with degenerated semantics. A possible solution to resolving the above conflict lies in two aspects: (1) turning the measure in the pixel space into induced feature space that is more semantically meaningful; (2) changing the L2 distance (per-pixel) into a learned instance-level

distance function for the entire image (akin to generative adversarial networks which learn

set-level distance functions). Taking these two steps allows us to design an instance-level classification loss that is aligned with the adversarial loss in the GAN model enforcing set-level fidelity. Motivated by the above observations, we develop a new generative autoencoder model with dual contradistinctive losses by adopting a discriminative loss performing instance-level classification (enforcing the instance-level fidelity), which is rooted in metric learning [kulis2012metric] and contrastive learning [hadsell2006dimensionality, wu2018unsupervised, infoNCE]. Combined with the adversarial losses for the set-level fidelity, both terms are formulated in the induced feature space performing contradistinction: (1) the instance-level contrastive loss considers each input instance (image) itself as a class, and (2) the set-level adversarial loss treats the entire input set as a positive class. We name our method dual contradistinctive generative autoencoder (DC-VAE) and make the following contributions.

  • We develop a new algorithm, dual contradistinctive generative autoencoder (DC-VAE), by combining instance-level and set-level classification losses in the VAE framework, and systematically show the significance of these two loss terms in DC-VAE.

  • The effectiveness of DC-VAE is illustrated in a number of tasks, including image reconstruction, image synthesis, image interpolation, and representation learning by reconstructing and sampling images across different resolutions including , , , and .

  • Under the new loss term, DC-VAE attains a significant performance boost over the competing methods without architectural change, making it a general-purse model applicable to a variety of computer vision tasks. DC-VAE helps greatly reducing the performance gap for image synthesis between the baseline VAE to the competitive GAN models.

2 Related Work

Related work can be roughly divided into three categories: (1) generative autoencoder, (2) deep generative model, and (3) contrastive learning.

Generative autoencoder. Variational autoencoder (VAE) [kingma2013auto] points to an exciting direction of generative models by developing an Evidence Lower BOund (ELBO) objective [higgins2017beta, ding2020guided]. However, the VAE reconstruction/synthesis is known to be blurry. To improve the image quality, a sequence of VAE based models have been developed [VAEGAN, dumoulin2017adversarially, huang2018introvae, brock2018large, zhang2019perceptual]. VAE/GAN [VAEGAN] adopts an adversarial loss to improve the quality of the image, but its output for both reconstruction and synthesis (new samples) is still unsatisfactory. IntroVAE [huang2018introvae] adds a loop from the output back to the input and is able to attain image quality that is on par with some modern GANs in some aspects. However, its full illustration for both reconstruction and synthesis is unclear. PGA [zhang2019perceptual] adds a constraint to the latent variables.

Deep generative model. Pioneering works of [tu2007learning, NCE] alleviate the difficulty of learning densities by approximating likelihoods via classification (real (positive) samples vs. fake (pseudo-negative or adversarial) samples). Generative adversarial network (GAN) [goodfellow2014generative]

builds on neural networks and amortized sampling (a decoder network that maps a noise into an image). The subsequent development after GAN

[DCGAN, WGAN, gulrajani2017improved, karras2018progressive, gong2019autogan, dumoulin2017adversarially, donahue2017bigan] has led to a great leap forward in building decoder-based generative models. It has been widely observed that the adversarial loss in GANs contributes significantly to the improved quality of image synthesis. Energy-based generative models [pmlr-v5-salakhutdinov09a, xie2016theory, jin2017introspective, lee2018wasserstein] — which aim to directly model data density — are making a steady improvement for a simultaneously generative and discriminative single model.

Contrastive learning. From another angle, contrastive learning [hadsell2006dimensionality, wu2018unsupervised, he2020momentum, chen2020simple]

has lately shown its particular advantage in unsupervised training of CNN features. It overcomes the limitation in unsupervised learning where class label is missing by turning each image instance into one class. Thus, the softmax function in the standard discriminative classification training can be applied. Contrastive learning can be connected to metric learning

[bromley93, chopra2005, chechik2010].

In this paper, we aim to improve VAE [kingma2013auto] by introducing a contrastive loss [infoNCE] to address instance-level fidelity between the input and the reconstruction in the induced feature space. Unlike in self-supervised representation learning methods [infoNCE, he2020momentum, chen2020simple], where self-supervision requires generating a transformed input (via data augmentation operations), the reconstruction naturally fits into the contrastive term that encourages the matching between the reconstruction and the input image instance, while pushing the reconstruction away from the rest of the images in the entire training set. Thus, the instance-level and set-level contradistinctive terms collaborate with each to encourage the high fidelity of the reconstruction and synthesis. In Figure 3, we systematically show the significance of with and without the instance-level and the set-level contradistinctive terms. In addition, we explore multi-scale contrastive learning via two schemes in Section 4.2: 1) deep supervision for contrastive learning in different convolution layers, and 2) patch-based contrastive learning for fine-grained data fidelity. In the experiments, we show competitive results for the proposed DC-VAE in a number of benchmarks for three tasks, including image synthesis, image reconstruction, and representation learning.

3 Preliminaries: VAE and VAE/GAN

Variational autoencoder (VAE)

Assume a given training set where each . We suppose that each is sampled from a generative process

. In the literature, vector

refers to latent variables. In practice, latent variables and the generative process are unknown. The objectives of a variational autoencoder (VAE) [kingma2013auto] is to simultaneously train an inference network and a generator network . In VAE [kingma2013auto]

, the inference network is a neural network that outputs parameters for Gaussian distribution

. The generator is a deterministic neural network parameterized by . Generative density is assumed to be subject to a Gaussian distribution: . These models can be trained by minimizing the negative of evidence lower bound (ELBO) in Eq. (1) below.

(1)

where is the prior, which is assumed to be . The first term reduces to standard pixel-wise reconstruction loss (up to a constant) due to the Gaussian assumption. The second term is the regularization term, which prevents the conditional from deviating from the Gaussian prior . The inference network and generator network are jointly optimized over training samples by:

(2)

where is the distribution induced by the training set .

VAE has an elegant formulation. However, it relies on a pixel-wise reconstruction loss, which is known not ideal to be reflective of perceptual realism [Johnson2016Perceptual, pix2pix2017]

, often resulting in blurry images. From another viewpoint, it can be thought of as using a kernel density estimator (with an isotropic Gaussian kernel) in the pixel space. Although allowing efficient training and inference, such a non-parametric approach is overly simplistic for modeling the semantics and perception of natural images.

Vae/gan

Generative adversarial networks (GANs) [goodfellow2014generative] and its variants [DCGAN]

, on the other hand, are shown to be producing highly realistic images. The success was largely attributed to learning a fidelity function (often referred to as a discriminator) that measures how realistic the generated images are. This can be achieved by learning to contrast (classify) the set of training images with the set of generated images

[tu2007learning, NCE, goodfellow2014generative].

VAE/GAN [VAEGAN] augments the ELBO objective (Eq. (2)) with the GAN objective. Specifically, the objective of VAE/GAN consists of two terms, namely the modified ELBO (Eq. (3)) and the GAN objective. To make the notations later consistent, we now define the set of given training images as in which a total number of unlabeled training images are present. For each input image , the modified ELBO computes the reconstruction loss in the feature space of the discriminator instead of the pixel space:

(3)

where denotes the feature embedding from the discriminator

. Feature reconstruction loss (also referred to as perceptual loss), similar to that in style transfer

[Johnson2016Perceptual]. The modified GAN objective considers both reconstructed images (latent code from ) and sampled images (latent code from the prior ) as its fake samples:

(4)

The VAE/GAN objective becomes:

(5)
Figure 2: Model architecture for the proposed DC-VAE algorithm.

4 Dual contradistinctive generative autoencoder (DC-VAE)

Here we want to address a question: Is the degeneration of the synthesized images by VAE always the case once the decoder is joined with an encoder? Can the problem be remedied by using a more informative loss?

Although improving the image qualities of VAE by integrating a set-level contrastive loss (GAN objective of Eq. (4)), VAE/GAN still does not accurately model instance-level fidelity. Inspired by the literature on instance-level classification [exemplar-svm], approximating likelihood by classification [tu2007learning], and contrastive learning [hadsell2006dimensionality, wu2018unsupervised, he2020momentum], we propose to model instance-level fidelity by contrastive loss (commonly referred to as InfoNCE loss) [infoNCE]. In DC-VAE, we perform the following minimization and loosely call each term a loss.

(6)

where is an index for a training sample (instance), is the union of positive samples and negative samples, is the critic function that measures compatibility between and . Following the popular choice from [he2020momentum],

is the cosine similarity between the embeddings of

and :

. Note that unlike in contrastive self-supervised learning methods

[infoNCE, he2020momentum, chen2020simple] where two views (independent augmentations) of an instance constitutes a positive pair, an input instance and its reconstruction comprises a positive pair in DC-VAE. Likewise, the reconstruction and any instance that is not can be a negative pair.

To bridge the gap between the instance-level contrastive loss (Eq. (6)) and log-likelihood in ELBO term (Eq. (1)), we observe the following connection.

Remark 1

(From [ma-collins-2018-noise, pmlr-v97-poole19a]) The following objective is minimized, i.e., the optimal critic is achieved, when where is any function that does not depend on .

(7)

It can be seen from [ma-collins-2018-noise, pmlr-v97-poole19a] that the contrastive loss of Eq. (6) implicitlyestimates the log-likelihood required for the evidence lower bound (ELBO). Hence, we modify the ELBO objective of Eq. (1) as follows and name it as implicit ELBO (IELBO):

(8)

Finally, the combined objective for the proposed DC-VAE algorithm becomes:

(9)
Figure 3:

Qualitative results of CIFAR-10

[cifar10] images (resolution ) for experiments in Table 1 [cifar10].

The definition of follows Eq. (4). Note here we also consider the term in Eq. (4) as contrasdistinctive since it tries to minimize the difference/discriminative classification between the input (“real”) image set and the reconstructed/generated (“fake”) image set. Below we highlight the significance of the two contradistinctive terms. Figure 2 shows the model architecture.

4.1 Understanding the loss terms

Instance-level fidelity. The first item in Eq. (8) is an instance-level fidelity term encouraging the reconstruction to be as close as possible to the input image while being different from all the rest of the images. A key advantage of the contrastive loss in Eq. (8) over the standard reconstruction loss in Eq. (3) is its relaxed and background instances aware formulation. In general, the reconstruction in Eq. (3) wants a perfect match between the reconstruction and the input, whereas the contrastive loss in Eq. (8) requests for being the most similar one among the training samples. This way, the contrastive loss becomes more cooperative with less conflict to the GAN loss, compared with the reconstruction loss. The introduction of the contrastive loss results in a significant improvement over VAE and VAE/GAN.

We further explain the difference between reconstruction and contrastive loss based on the input and it’s reconstruction . To simplify the notation, we use instead of the output layer feature (shown in Eq. 4)) for the illustration purpose. The reconstruction loss enforces the similarity between the reconstructed image and the input image while the GAN loss computes an adversarial loss . refers to the classifier parameter. The reconstruction loss term enforces pixel-wise/feature matching between input and the reconstruction, while the GAN loss encourages the reconstruction and input discriminatively non-separable; the two are measured in different ways resulting in a conflict. Our contrastive loss on the other hand, is also a discriminative term, it can be viewed as . To compare the reconstruction loss with the contrastive loss: the former wants to have an exact match between the reconstruction with the input, whereas the later is more relaxed to be ok if no exact match but as the closest one amongst all the training samples.

In other words, the reconstruction wants a perfect match for the instance-level fidelity whereas the contrastive loss is asking for being the most similar one among the given training samples. Using the contrastive loss gives more room and creates less conflict with the GAN loss.

Set-level fidelity. The second item in Eq. (9) is a set-level fidelity term encouraging the entire set of synthesized images to be non distinguishable from the input image set. Having this term (Eq. (4)) is still important since the instance contrastive loss alone (Eq. (9)) will also lead to a degenerated situation: the input image and its reconstruction can be projected to the same point in the new feature space, but without a guarantee that the reconstruction itself lies on the valid “real” image manifold.

As shown in Figure 3 and Table 1 for the comparison with and without the individual terms in Eq. (9). We observe evident effectiveness of the proposed DC-VAE combining both the instance-level fidelity term (Eq. (6)) and the set-level fidelity term (Eq. (4)), compared with VAE (using pixel-wise reconstruction loss without the GAN objective), VAE-GAN (using feature reconstruction loss and the GAN objective), and VAE contrastive (using contrastive loss but without the GAN objective).

In the experiments, we show that both terms required to achieve faithful reconstruction (captured by InfoNCE loss) with perceptual realism (captured by the GAN loss).

4.2 Multi scale contrastive learning

Inspired by [lee2015deeply], we utilize information from feature maps at different scales. In addition to contrasting on the last layer of in Equation 9, we add contrastive objective on where is some function on top of an intermediate layer of D. We do it in two different ways.

  1. Deep supervision: We use 11 convolution to reduce the dimension channel-wise, and use a linear layer to obtain .

  2. Local patch: We use a random location across channel at layer (size: 11d, where d is the channel depth).

The intuition for the second is that in a convolutional neural network, one location at a feature map corresponds to a receptive area (patch) in the original image. Thus, by contrasting locations across channels in the same feature maps, we are encouraging the original image and the reconstruction to image have locally similar content, while encouraging them to have locally dissimilar content in other images. We use deep supervision for initial training, and add local patch after certain iterations.

5 Experiments

5.1 Implementation

Datasets To validate our method, we train our method on several different datasets — CIFAR-10 [cifar10]

, STL-10

[STL10], CelebA [CelebA], CelebA-HQ [karras2018progressive], and LSUN bedroom [yu15lsun]. See the appendix for more detailed descriptions.

Network architecture For resolution, we design the encoder and decoder subnetworks of our model in a similar way to the discriminator and generator found through neural architecture search in AutoGAN [gong2019autogan]. For the higher resolution experiments ( and resolution), we use Progressive GAN [karras2018progressive] as the backbone. Network architecture diagram is available in the appendix.

Training details

The number of negative samples for contrastive learning is 8096 for all datasets (analysis of this hyperparameter is provided in supplementary material). The latent dimension for the VAE decoder is 128 for CIFAR-10, STL-10, and 512 for CelebA, CelebA-HQ and LSUN Bedroom. Learning rate is 0.0002 with Adam parameters of

and a batch size of 128 for CIFAR-10 and STL-10. For CelebA, CelebA-HQ, LSUN Bedroom datasets, we use the optimizer parameters given in [karras2018progressive]. The contrastive embedding dimension used is 16 for each of the experiments.

5.2 Ablation Study

Method
FID/IS
Sampling
FID/IS
Reconstruction
Pixel
Distance
Perceptual
Distance
VAE 115.8 / 3.8 108.4 / 4.3 21.8 65.8
VAE/GAN 39.8 / 7.4 29.0 / 7.6 62.7 57.2
VAE-Contrastive 240.4 / 1.8 242 / 1.9 53.6 104.2
DC-VAE 17.9 / 8.2 21.4 / 7.9 45.9 52.9
Table 1: Ablation studies on CIFAR-10 for the proposed DC-VAE algorithm. We follow [Johnson2016Perceptual]

and measure perceptual distance in an relu

layer of a pretrained VGG network. means lower is better. means higher is better.

To demonstrate the necessity of the GAN loss (Eq. 4) and contrastive loss (Eq. 8), we conduct four experiments with the same backbone. These experiments are: VAE (No GAN, no Contrastive), VAE/GAN (with GAN, no Contrastive), VAE-Contrastive (No GAN, with Contrastive, and ours (With GAN, with Contrastive). Here, GAN denotes Eq. 4, and Contrastive denotes Eq. 8.

CIFAR-10 STL-10
Method IS FID IS FID
Methods based on GAN:
DCGAN [DCGAN] 6.6 - - -
ProbGAN [he2019probgan] 7.8 24.6 8.9 46.7
WGAN-GP ResNet [gulrajani2017improved] 7.9 - - -
RaGAN [jolicoeur2018relativistic] - 23.5 - -
SN-GAN [miyato2018spectral] 8.2 21.7 9.1 40.1
MGAN [hoang2018mgan] 8.3 26.7 - -
Progressive GAN [karras2018progressive] 8.8 - - -
Improving MMD GAN [wang2019improving] 8.3 16.2 9.3 37.6
PULSGAN [PUGAN] - 22.3 - -
AutoGAN [gong2019autogan] 8.6 12.4 9.2 31.0
Methods based on VAE:
VAE 3.8 115.8 - -
VAE/GAN 7.4 39.8 - -
VEEGAN [veegan2017] - 95.2 - -
WAE-GAN [WAE] - 93.1 - -
NVAE [vahdat2020NVAE] Sampling - 50.8 - -
NVAE [vahdat2020NVAE] Reconstruction - 2.67 - -
DC-VAE Sampling (ours) 8.2 17.9 8.1 41.9
DC-VAE Recon. (ours) 7.9 21.4 8.4 43.6
Table 2: Comparison on CIFAR-10 and STL-10. Average Inception scores (IS) [salimans2016improved] and FID scores [FID]. Results derived from [gong2019autogan]. Table style based [lee2019meta]. Result from [aneja2020ncpvae]. Result from [dieng2019prescribed].

Qualitative analysis From Figure 3, we see that without GAN and contrastive, images are blurry; Without GAN, the contrastive head can classify images, but not on the image manifold; Without Contrastive, reconstruction images are on the image manifold because of the discriminator, but they are different from input images. These experiments show that it is necessary to combine both instance-level and set-level fidelity, and in a contradistinctive manner.

Quantitative analysis In Table 1 we observe the same trend. VAE generates blurry images; thus the FID/IS (Inception Score) is not ideal. VAE-Contrastive does not generate images on the natural manifold; thus FID/IS is poor. VAE/GAN combines set-level and instance-level information. However the L2 objective is not ideal; thus the FID/IS is sub-optimal. For both reconstruction and sampling tasks, DC-VAE generates high fidelity images and has a favorable FID and Inception score. This illustrates the advantange of having a contradistinctive objective on both set level and instance level. To measure the faithfulness of the reconstructed image we compute the pixelwise L2 distance and the perceptual distance ([Johnson2016Perceptual]). For the pixel distance, VAE has the lowest value because it directly optimizes this distance during training; our pixel-wise distance is better than VAE/GAN and VAE-Contrastive. For perceptual distance, our method outperforms other three, which confirms that using contrastive learning helps reconstruct images semantically.

5.3 Comparison to existing generative models

Figure 4: Comparison of DC-VAE (resolution ) with IntroVAE [huang2018introvae] (resolution ). Zoom in for a better visualization.

Table 2 gives a comparison of quantitative measurement for CIFAR-10 and STL-10 dataset. In general, there is a large difference in terms of FID and IS between GAN family and VAE family of models. Our model has state-of-the-art results in VAE family, and is comparable to state-of-the-art GAN models on CIFAR-10. Similarly Tables 3, 5, and 5 show that DC-VAE is able to generate images that are comparable to GAN based methods even on higher resolution datasets such as LSUN Bedrooms, CelebA, CelebA-HQ. Our method achieves state-of-the-art results on these datasets among VAE-based methods which focus on building better architectures. Figure 4 and Table 8 show that our model yields more faithful reconstructions compared to existing state-of-the-art generative auto-encoder methods.

Method
FID
FID
(Sampling) (Reconstruction)
Progressive GAN [karras2018progressive] 8.3 -
SNGAN [miyato2018spectral] (from [chen2019self]) 16.0 -
SSGAN[chen2019self] 13.3 -
StyleALAE [pidhorskyi2020adversarial] 17.13 15.92
DC-VAE (ours) 14.3 10.57
Table 3: Quality of Image generation (FID) comparison on LSUN Bedrooms. 128128 resolution. 256256 resolution. means lower is better.
Method FID StyleALAE [pidhorskyi2020adversarial] 19.21 NVAE [vahdat2020NVAE] (from [aneja2020ncpvae]) 40.26 NCP-VAE [aneja2020ncpvae] 24.69 DC-VAE (ours) 15.81 Method FID Methods based on GAN: PresGAN [dieng2019prescribed] 29.1 LSGAN [mao2017least] (from [glann2019]) 53.9 COCO-GAN [lin2019coco] 5.7 ProGAN [karras2018progressive] (from [lin2019coco]) 7.30 Methods based on VAE: VEE-GAN [veegan2017] (from [dieng2019prescribed]) 46.2 WAE-GAN [WAE] 42 DC-VAE (ours) Reconstruction 14.3 DC-VAE (ours) Sampling 19.9
Table 4: FID comparison on CelebA-HQ for 256x256 resolution. means lower is better.
Table 5: FID comparison on CelebA. 6464 resolution. 128128 resolution. means lower is better.
Figure 5: Latent traversal on CelebA-HQ [karras2018progressive] (resolution ) and LSUN Bedroom [yu15lsun] (resolution ) and example image editing on CelebA-HQ [karras2018progressive] image. (Zoom in for a better visualization.).
Figure 6: Interpolation results generated by DC-VAE (ours) on CelebA-HQ [karras2018progressive] images (, left) and LSUN Bedroom [yu15lsun] images (, right). (Zoom in for a better visualization.)

5.4 Latent Space Representation: Image and style interpolation

We further validate the effectiveness of DC-VAE for representation learning. One benefit of having an AE/VAE framework compared with just a decoder as in GAN [goodfellow2014generative] is to be able to directly obtain the latent representation from the input images. The encoder and decoder modules in VAE allows us to readily perform image/style interpolation by mixing the latent variables of different images and reconstruct/synthesize new ones. We demonstrate qualitative results on image interpolation (Fig. 6), style interpolation and image editing (Fig. 5) (method used for this is outlined in the supplementary materials) . We directly use the trained DC-VAE model without disentanglement learning [karras2019style]. We also quantitatively compare the latent space disentanglement through the perceptual path length (PPL) [karras2019style] (Table 7). We observe that DC-VAE learns a more disentangled latent space representation than the backbone Progressive GAN [karras2018progressive] and StyleALAE [pidhorskyi2020adversarial] that use a much more capable StyleGAN [karras2019style] backbone.

5.5 Latent Space Representation: Classification

Method
VAE [kingma2013auto] 2.92%0.12 3.05%0.42 2.98%0.14
-VAE(=2) [higgins2017beta] 4.69%0.18 5.26%0.22 5.40%0.33
FactorVAE(=5) [kim2018disentangling] 6.07%0.05 6.18%0.20 6.35%0.48
-TCVAE (=1,=5,=1) [chen2018isolating] 1.62%0.07 1.24%0.05 1.32%0.09
Guided-VAE [ding2020guided] 1.85%0.08 1.60%0.08 1.49%0.06
Guided--TCVAE [ding2020guided] 1.47%0.12 1.10%0.03 1.31%0.06
DC-VAE (Ours) 1.30%0.035 1.27%0.037 1.29%0.034
Table 6: Comparison to prior VAE-based representation learning methods.

Classification error on MNIST dataset.

: lower is better. 95 % confidence intervals are from 5 trials. Results derived from

[ding2020guided].

To show that our model learns a good representation, we measure the performance on the downstream MNIST classification task [ding2020guided]. The VAE models were trained on MNIST dataset [lecun2010mnist]. We feed input images into our VAE encoder and get the latent representation. Then we train a linear classifier on the latent representation to classify the classes of the input images. Results in Table 6 show that our model gives the lowest classification error in most cases. This experiment demonstrates that our model not only gains the ability to do faithful synthesis and reconstruction, but also gains better representation ability on the VAE side.

Method Backbone PPL Full
StyleALAE [pidhorskyi2020adversarial] StyleGAN [karras2019style] 33.29
ProGAN [karras2018progressive] ProGAN [karras2018progressive] 40.71
DC-VAE (ours) ProGAN [karras2018progressive] 24.66
Table 7: PPL Comparison of on CelebA-HQ [karras2018progressive].
Method Backbone
Pixel
Distance
Perceptual
Distance
StyleALAE [pidhorskyi2020adversarial] StyleGAN [karras2019style] 0.117 40.40
DC-VAE (ours) ProGAN [karras2018progressive] 0.072 38.63
Table 8: Reconstruction Comparison of on CelebA-HQ [karras2018progressive] validation set. We follow [Johnson2016Perceptual] and measure perceptual distance in an relu layer of a pretrained VGG network. means lower is better.

6 Conclusion

In this paper, we have developed dual contradistinctive generative autoencoder (DC-VAE), a new framework that integrates an instance-level discriminative loss (InfoNCE) with a set-level adversarial loss (GAN) into a single variational autoencoder framework. Our experiments show state-of-the-art or competitive results in several tasks, including image synthesis, image reconstruction, representation learning for image interpolation, and representation learning for classification. DC-VAE is a general-purpose VAE model and it points to a encouraging direction that attains high-quality synthesis (decoding) and inference (encoding).

7 Acknowledgment

This work is funded by NSF IIS-1717431 and NSF IIS-1618477. Zhuowen Tu is also funded under Qualcomm Faculty Award.

References

Appendix A Appendix

a.1 Additional reconstruction results

In Figures 8 and 14 we show a large collection of additional recontruction images on the CelebA-HQ [karras2018progressive] and LSUN Bedroom [yu15lsun] datasets.

a.2 Smoothness of latent space

In this section we analyse the smoothness of the latent space learnt by DC-VAE. In Figure 11 we show additional high resolution () CelebA-HQ [karras2018progressive] images generated by an evenly spaced linear blending between two latent vectors. In Fig. 5 we show that DC-VAE is able to perform meaningful attribute editing on images while retaining the original identity. To perform image editing, we first need to compute the direction vector in the latent space that correspond to a desired attribute (e.g. has glasses, has blonde hair, is a woman, has facial hair). We compute these attribute direction vectors by selecting 20 images that have the attribute and 20 images that do not have the attribute, obtaining the corresponding pairs of 20 latent vectors, and calculating the difference of the mean. The results in Fig. 5 show that these direction vectors can be added to a latent vector to add a diverse combination of desired image attributes while retaining the original identity of the individual.

a.3 Effect of negative samples

In this section we analyse the effect of varying the number of negative samples used for contrastive learning. Figure 7 shows the reconstruction error on the CIFAR-10 [cifar10] test set as the negative samples is varied. We observe that a higher number of negative samples results in better reconstruction. We choose 8096 for all of our experiments because of memory constraints.

Figure 7: Pixel reconstruction error on CIFAR-10 [cifar10] test set for varying number of negative samples
Figure 8: Additional CelebA-HQ [karras2018progressive] reconstruction images (resolution ) generated by DC-VAE (ours)

a.4 Dataset details

CIFAR-10 comprises 50,000 training images and 10,000 test images with a spatial resolution of . STL-10 is a similar dataset that contains 5,000 training images and 100,000 unlabeled images at resolution. We follow the procedure in AutoGAN [gong2019autogan] and resize the STL-10 images to . The CelebA dataset has 162,770 training images and 19,962 testing images, CelebA-HQ contains 30,000 images of size , and LSUN Bedroom has approximately 3M images. For CelebA-HQ we split the dataset into 29,000 training images and 1,000 validation images following the method in [huang2018introvae]. We resize all images progressively in these three datasets from () to () for the progressive training.

a.5 Network architecture diagrams

In Figures 15 we show the detailed network architecture of DC-VAE for input resolutions of . Note that the comparison results shown in Figure 3 and Table 1 in the main paper, for VAE, VAE/GAN, VAE w/o GAN, and our proposed DC-VAE are all based on the same network architecture (shown in Figure 15 here), for a fair comparison.

The network architectures shown in Figure 15 are adapted closely from the networks discovered by [gong2019autogan] through Neural Architecture Search. The DC-VAE developed in our paper is not tied to any particular CNN architecture. We choose the AutoGAN architecture [gong2019autogan] to start with a strong baseline. The decoder in Figure 15 matches the generator in [gong2019autogan]. The encoder is built by modifying the output shape of the final linear layer in the discriminator of AutoGAN [gong2019autogan] to match the latent dimension and adding spectral normalization. The discriminator is used both for classifying real/fake images, and contrastive learning. For each layer we choose, we first apply 1x1 convolution and a linear layer, and then use this feature as an input to the contrastive module. For experiments at , we pick two different positions: the output of second residual conv block (lower level) and the output of the first linear layer (higher level). For experiments on higher resolution datasets we use a Progressive GAN [karras2018progressive] Generator and Discriminator as our backbone and apply similar modifications as described above.

a.6 Further details about the representation learning experiments

As seen in Table 4 in the main paper, we show the representation capability of DC-VAE following the procedure outlined in [ding2020guided]. We train our model on the MNIST dataset [lecun2010mnist] and measure the transferability though a classification task on the latent embedding vector. Specifically, we first pretrain the DC-VAE model on the training split of the MNIST dataset. Following that we freeze the DC-VAE model and train a linear classifier that takes latent embedding vector as the input and predicts the class label of the original image.

Figure 9: Visualization of the effect of adding each instance level and set level objectives. Table 1 and Figure 3 contain FID [FID] results and qualitative comparisons on the CIFAR-10 [cifar10] that correspond to these settings.
(a) STL-10 Reconstructions generated by DC-VAE
(b) STL-10 Samples generated by DC-VAE
Figure 10: DC-VAE reconstruction (a) and synthesis results (b) on STL-10 [STL10] images (resolution ). In (a) the top two rows are input images and the bottom two rows are the corresponding reconstruction images.

a.7 Evaluation details

In Tables 1 and 8

the perceptual distance is computed as the average MSE distance of the features extracted by a pretrained VGG-16 network. We borrow from

[Johnson2016Perceptual] and use the activation of the relu4_3 layer. For computing the FID scores we follow the standard practice ([huang2018introvae], [pidhorskyi2020adversarial]) and use 50,000 generated images. In Table 5 we use the version of DC-VAE model trained on CelebA-HQ [karras2018progressive] for a fair comparison with other methods which are trained at the same resolution.

Figure 11: Additional latent space interpolations on CelebA-HQ [karras2018progressive] (resolution )
Figure 12: Latent Mixing results on CelebA-HQ [karras2018progressive]. Each combined image in the grid is generated by replacing an arbitrary subset of Source A latent with the corresponding Source B latent.
Figure 13: Additional image editing on CelebA-HQ [karras2018progressive] reconstruction images (resolution )
Figure 14: Additional LSUN Bedroom [yu15lsun] reconstruction images (resolution )
(a) Encoder (b) Decoder (c) Discriminator
Figure 15: Network architecture of DC-VAE for resolution for CIFAR-10 [cifar10] and STL-10 [STL10]. (a) is the Encoder. (b) is the Decoder. (c) is the Discriminator.