1 Introduction
Tremendous progress has been made in deep learning for the development of various learning frameworks
[krizhevsky2012imagenet, he2016deep, goodfellow2014generative, vaswani2017attention]. Autoencoder (AE) [lecun1987modeles, hinton1994autoencoders] aims to compactly represent and faithfully reproduce the original input signal by concatenating an encoder and a decoder in an endtoend learning framework. The goal of AE is to make the encoded representation semantically efficient and sufficient to reproduce the input signal by its decoder. Autoencoder’s generative companion, variational autoencoder (VAE) [kingma2013auto], additionally learns a variational model for the latent variables to capture the underlying sample distribution.The key objective for a generative autoencoder is to maintain two types of fidelities: (1) an instancelevel fidelity to make the reconstruction/synthesis faithful to the individual input data sample, and (2) a setlevel fidelity to make the reconstruction/synthesis of the decoder faithful to the entire input data set. The VAE/GAN algorithm [VAEGAN] combines a reconstruction loss with an adversarial loss. However, the result of VAE/GAN is suboptimal, as shown in Table 1.^{†}^{†}* indicates equal contribution
The pixelwise reconstruction loss in the standard VAE [kingma2013auto] typically results in blurry images with degenerated semantics. A possible solution to resolving the above conflict lies in two aspects: (1) turning the measure in the pixel space into induced feature space that is more semantically meaningful; (2) changing the L2 distance (perpixel) into a learned instancelevel
distance function for the entire image (akin to generative adversarial networks which learn
setlevel distance functions). Taking these two steps allows us to design an instancelevel classification loss that is aligned with the adversarial loss in the GAN model enforcing setlevel fidelity. Motivated by the above observations, we develop a new generative autoencoder model with dual contradistinctive losses by adopting a discriminative loss performing instancelevel classification (enforcing the instancelevel fidelity), which is rooted in metric learning [kulis2012metric] and contrastive learning [hadsell2006dimensionality, wu2018unsupervised, infoNCE]. Combined with the adversarial losses for the setlevel fidelity, both terms are formulated in the induced feature space performing contradistinction: (1) the instancelevel contrastive loss considers each input instance (image) itself as a class, and (2) the setlevel adversarial loss treats the entire input set as a positive class. We name our method dual contradistinctive generative autoencoder (DCVAE) and make the following contributions.
We develop a new algorithm, dual contradistinctive generative autoencoder (DCVAE), by combining instancelevel and setlevel classification losses in the VAE framework, and systematically show the significance of these two loss terms in DCVAE.

The effectiveness of DCVAE is illustrated in a number of tasks, including image reconstruction, image synthesis, image interpolation, and representation learning by reconstructing and sampling images across different resolutions including , , , and .

Under the new loss term, DCVAE attains a significant performance boost over the competing methods without architectural change, making it a generalpurse model applicable to a variety of computer vision tasks. DCVAE helps greatly reducing the performance gap for image synthesis between the baseline VAE to the competitive GAN models.
2 Related Work
Related work can be roughly divided into three categories: (1) generative autoencoder, (2) deep generative model, and (3) contrastive learning.
Generative autoencoder. Variational autoencoder (VAE) [kingma2013auto] points to an exciting direction of generative models by developing an Evidence Lower BOund (ELBO) objective [higgins2017beta, ding2020guided]. However, the VAE reconstruction/synthesis is known to be blurry. To improve the image quality, a sequence of VAE based models have been developed [VAEGAN, dumoulin2017adversarially, huang2018introvae, brock2018large, zhang2019perceptual]. VAE/GAN [VAEGAN] adopts an adversarial loss to improve the quality of the image, but its output for both reconstruction and synthesis (new samples) is still unsatisfactory. IntroVAE [huang2018introvae] adds a loop from the output back to the input and is able to attain image quality that is on par with some modern GANs in some aspects. However, its full illustration for both reconstruction and synthesis is unclear. PGA [zhang2019perceptual] adds a constraint to the latent variables.
Deep generative model. Pioneering works of [tu2007learning, NCE] alleviate the difficulty of learning densities by approximating likelihoods via classification (real (positive) samples vs. fake (pseudonegative or adversarial) samples). Generative adversarial network (GAN) [goodfellow2014generative]
builds on neural networks and amortized sampling (a decoder network that maps a noise into an image). The subsequent development after GAN
[DCGAN, WGAN, gulrajani2017improved, karras2018progressive, gong2019autogan, dumoulin2017adversarially, donahue2017bigan] has led to a great leap forward in building decoderbased generative models. It has been widely observed that the adversarial loss in GANs contributes significantly to the improved quality of image synthesis. Energybased generative models [pmlrv5salakhutdinov09a, xie2016theory, jin2017introspective, lee2018wasserstein] — which aim to directly model data density — are making a steady improvement for a simultaneously generative and discriminative single model.Contrastive learning. From another angle, contrastive learning [hadsell2006dimensionality, wu2018unsupervised, he2020momentum, chen2020simple]
has lately shown its particular advantage in unsupervised training of CNN features. It overcomes the limitation in unsupervised learning where class label is missing by turning each image instance into one class. Thus, the softmax function in the standard discriminative classification training can be applied. Contrastive learning can be connected to metric learning
[bromley93, chopra2005, chechik2010].In this paper, we aim to improve VAE [kingma2013auto] by introducing a contrastive loss [infoNCE] to address instancelevel fidelity between the input and the reconstruction in the induced feature space. Unlike in selfsupervised representation learning methods [infoNCE, he2020momentum, chen2020simple], where selfsupervision requires generating a transformed input (via data augmentation operations), the reconstruction naturally fits into the contrastive term that encourages the matching between the reconstruction and the input image instance, while pushing the reconstruction away from the rest of the images in the entire training set. Thus, the instancelevel and setlevel contradistinctive terms collaborate with each to encourage the high fidelity of the reconstruction and synthesis. In Figure 3, we systematically show the significance of with and without the instancelevel and the setlevel contradistinctive terms. In addition, we explore multiscale contrastive learning via two schemes in Section 4.2: 1) deep supervision for contrastive learning in different convolution layers, and 2) patchbased contrastive learning for finegrained data fidelity. In the experiments, we show competitive results for the proposed DCVAE in a number of benchmarks for three tasks, including image synthesis, image reconstruction, and representation learning.
3 Preliminaries: VAE and VAE/GAN
Variational autoencoder (VAE)
Assume a given training set where each . We suppose that each is sampled from a generative process
. In the literature, vector
refers to latent variables. In practice, latent variables and the generative process are unknown. The objectives of a variational autoencoder (VAE) [kingma2013auto] is to simultaneously train an inference network and a generator network . In VAE [kingma2013auto], the inference network is a neural network that outputs parameters for Gaussian distribution
. The generator is a deterministic neural network parameterized by . Generative density is assumed to be subject to a Gaussian distribution: . These models can be trained by minimizing the negative of evidence lower bound (ELBO) in Eq. (1) below.(1)  
where is the prior, which is assumed to be . The first term reduces to standard pixelwise reconstruction loss (up to a constant) due to the Gaussian assumption. The second term is the regularization term, which prevents the conditional from deviating from the Gaussian prior . The inference network and generator network are jointly optimized over training samples by:
(2) 
where is the distribution induced by the training set .
VAE has an elegant formulation. However, it relies on a pixelwise reconstruction loss, which is known not ideal to be reflective of perceptual realism [Johnson2016Perceptual, pix2pix2017]
, often resulting in blurry images. From another viewpoint, it can be thought of as using a kernel density estimator (with an isotropic Gaussian kernel) in the pixel space. Although allowing efficient training and inference, such a nonparametric approach is overly simplistic for modeling the semantics and perception of natural images.
Vae/gan
Generative adversarial networks (GANs) [goodfellow2014generative] and its variants [DCGAN]
, on the other hand, are shown to be producing highly realistic images. The success was largely attributed to learning a fidelity function (often referred to as a discriminator) that measures how realistic the generated images are. This can be achieved by learning to contrast (classify) the set of training images with the set of generated images
[tu2007learning, NCE, goodfellow2014generative].VAE/GAN [VAEGAN] augments the ELBO objective (Eq. (2)) with the GAN objective. Specifically, the objective of VAE/GAN consists of two terms, namely the modified ELBO (Eq. (3)) and the GAN objective. To make the notations later consistent, we now define the set of given training images as in which a total number of unlabeled training images are present. For each input image , the modified ELBO computes the reconstruction loss in the feature space of the discriminator instead of the pixel space:
(3)  
where denotes the feature embedding from the discriminator
. Feature reconstruction loss (also referred to as perceptual loss), similar to that in style transfer
[Johnson2016Perceptual]. The modified GAN objective considers both reconstructed images (latent code from ) and sampled images (latent code from the prior ) as its fake samples:(4)  
The VAE/GAN objective becomes:
(5) 
4 Dual contradistinctive generative autoencoder (DCVAE)
Here we want to address a question: Is the degeneration of the synthesized images by VAE always the case once the decoder is joined with an encoder? Can the problem be remedied by using a more informative loss?
Although improving the image qualities of VAE by integrating a setlevel contrastive loss (GAN objective of Eq. (4)), VAE/GAN still does not accurately model instancelevel fidelity. Inspired by the literature on instancelevel classification [exemplarsvm], approximating likelihood by classification [tu2007learning], and contrastive learning [hadsell2006dimensionality, wu2018unsupervised, he2020momentum], we propose to model instancelevel fidelity by contrastive loss (commonly referred to as InfoNCE loss) [infoNCE]. In DCVAE, we perform the following minimization and loosely call each term a loss.
(6)  
where is an index for a training sample (instance), is the union of positive samples and negative samples, is the critic function that measures compatibility between and . Following the popular choice from [he2020momentum],
is the cosine similarity between the embeddings of
and :. Note that unlike in contrastive selfsupervised learning methods
[infoNCE, he2020momentum, chen2020simple] where two views (independent augmentations) of an instance constitutes a positive pair, an input instance and its reconstruction comprises a positive pair in DCVAE. Likewise, the reconstruction and any instance that is not can be a negative pair.To bridge the gap between the instancelevel contrastive loss (Eq. (6)) and loglikelihood in ELBO term (Eq. (1)), we observe the following connection.
Remark 1
(From [macollins2018noise, pmlrv97poole19a]) The following objective is minimized, i.e., the optimal critic is achieved, when where is any function that does not depend on .
(7) 
It can be seen from [macollins2018noise, pmlrv97poole19a] that the contrastive loss of Eq. (6) implicitlyestimates the loglikelihood required for the evidence lower bound (ELBO). Hence, we modify the ELBO objective of Eq. (1) as follows and name it as implicit ELBO (IELBO):
(8)  
Finally, the combined objective for the proposed DCVAE algorithm becomes:
(9) 
The definition of follows Eq. (4). Note here we also consider the term in Eq. (4) as contrasdistinctive since it tries to minimize the difference/discriminative classification between the input (“real”) image set and the reconstructed/generated (“fake”) image set. Below we highlight the significance of the two contradistinctive terms. Figure 2 shows the model architecture.
4.1 Understanding the loss terms
Instancelevel fidelity. The first item in Eq. (8) is an instancelevel fidelity term encouraging the reconstruction to be as close as possible to the input image while being different from all the rest of the images. A key advantage of the contrastive loss in Eq. (8) over the standard reconstruction loss in Eq. (3) is its relaxed and background instances aware formulation. In general, the reconstruction in Eq. (3) wants a perfect match between the reconstruction and the input, whereas the contrastive loss in Eq. (8) requests for being the most similar one among the training samples. This way, the contrastive loss becomes more cooperative with less conflict to the GAN loss, compared with the reconstruction loss. The introduction of the contrastive loss results in a significant improvement over VAE and VAE/GAN.
We further explain the difference between reconstruction and contrastive loss based on the input and it’s reconstruction . To simplify the notation, we use instead of the output layer feature (shown in Eq. 4)) for the illustration purpose. The reconstruction loss enforces the similarity between the reconstructed image and the input image while the GAN loss computes an adversarial loss . refers to the classifier parameter. The reconstruction loss term enforces pixelwise/feature matching between input and the reconstruction, while the GAN loss encourages the reconstruction and input discriminatively nonseparable; the two are measured in different ways resulting in a conflict. Our contrastive loss on the other hand, is also a discriminative term, it can be viewed as . To compare the reconstruction loss with the contrastive loss: the former wants to have an exact match between the reconstruction with the input, whereas the later is more relaxed to be ok if no exact match but as the closest one amongst all the training samples.
In other words, the reconstruction wants a perfect match for the instancelevel fidelity whereas the contrastive loss is asking for being the most similar one among the given training samples. Using the contrastive loss gives more room and creates less conflict with the GAN loss.
Setlevel fidelity. The second item in Eq. (9) is a setlevel fidelity term encouraging the entire set of synthesized images to be non distinguishable from the input image set. Having this term (Eq. (4)) is still important since the instance contrastive loss alone (Eq. (9)) will also lead to a degenerated situation: the input image and its reconstruction can be projected to the same point in the new feature space, but without a guarantee that the reconstruction itself lies on the valid “real” image manifold.
As shown in Figure 3 and Table 1 for the comparison with and without the individual terms in Eq. (9). We observe evident effectiveness of the proposed DCVAE combining both the instancelevel fidelity term (Eq. (6)) and the setlevel fidelity term (Eq. (4)), compared with VAE (using pixelwise reconstruction loss without the GAN objective), VAEGAN (using feature reconstruction loss and the GAN objective), and VAE contrastive (using contrastive loss but without the GAN objective).
In the experiments, we show that both terms required to achieve faithful reconstruction (captured by InfoNCE loss) with perceptual realism (captured by the GAN loss).
4.2 Multi scale contrastive learning
Inspired by [lee2015deeply], we utilize information from feature maps at different scales. In addition to contrasting on the last layer of in Equation 9, we add contrastive objective on where is some function on top of an intermediate layer of D. We do it in two different ways.

Deep supervision: We use 11 convolution to reduce the dimension channelwise, and use a linear layer to obtain .

Local patch: We use a random location across channel at layer (size: 11d, where d is the channel depth).
The intuition for the second is that in a convolutional neural network, one location at a feature map corresponds to a receptive area (patch) in the original image. Thus, by contrasting locations across channels in the same feature maps, we are encouraging the original image and the reconstruction to image have locally similar content, while encouraging them to have locally dissimilar content in other images. We use deep supervision for initial training, and add local patch after certain iterations.
5 Experiments
5.1 Implementation
Datasets To validate our method, we train our method on several different datasets — CIFAR10 [cifar10]
, STL10
[STL10], CelebA [CelebA], CelebAHQ [karras2018progressive], and LSUN bedroom [yu15lsun]. See the appendix for more detailed descriptions.Network architecture For resolution, we design the encoder and decoder subnetworks of our model in a similar way to the discriminator and generator found through neural architecture search in AutoGAN [gong2019autogan]. For the higher resolution experiments ( and resolution), we use Progressive GAN [karras2018progressive] as the backbone. Network architecture diagram is available in the appendix.
Training details
The number of negative samples for contrastive learning is 8096 for all datasets (analysis of this hyperparameter is provided in supplementary material). The latent dimension for the VAE decoder is 128 for CIFAR10, STL10, and 512 for CelebA, CelebAHQ and LSUN Bedroom. Learning rate is 0.0002 with Adam parameters of
and a batch size of 128 for CIFAR10 and STL10. For CelebA, CelebAHQ, LSUN Bedroom datasets, we use the optimizer parameters given in [karras2018progressive]. The contrastive embedding dimension used is 16 for each of the experiments.5.2 Ablation Study
Method 






VAE  115.8 / 3.8  108.4 / 4.3  21.8  65.8  
VAE/GAN  39.8 / 7.4  29.0 / 7.6  62.7  57.2  
VAEContrastive  240.4 / 1.8  242 / 1.9  53.6  104.2  
DCVAE  17.9 / 8.2  21.4 / 7.9  45.9  52.9 
and measure perceptual distance in an relu
layer of a pretrained VGG network. means lower is better. means higher is better.To demonstrate the necessity of the GAN loss (Eq. 4) and contrastive loss (Eq. 8), we conduct four experiments with the same backbone. These experiments are: VAE (No GAN, no Contrastive), VAE/GAN (with GAN, no Contrastive), VAEContrastive (No GAN, with Contrastive, and ours (With GAN, with Contrastive). Here, GAN denotes Eq. 4, and Contrastive denotes Eq. 8.
CIFAR10  STL10  
Method  IS  FID  IS  FID 
Methods based on GAN:  
DCGAN [DCGAN]  6.6       
ProbGAN [he2019probgan]  7.8  24.6  8.9  46.7 
WGANGP ResNet [gulrajani2017improved]  7.9       
RaGAN [jolicoeur2018relativistic]    23.5     
SNGAN [miyato2018spectral]  8.2  21.7  9.1  40.1 
MGAN [hoang2018mgan]  8.3  26.7     
Progressive GAN [karras2018progressive]  8.8       
Improving MMD GAN [wang2019improving]  8.3  16.2  9.3  37.6 
PULSGAN [PUGAN]    22.3     
AutoGAN [gong2019autogan]  8.6  12.4  9.2  31.0 
Methods based on VAE:  
VAE  3.8  115.8     
VAE/GAN  7.4  39.8     
VEEGAN^{∗} [veegan2017]    95.2     
WAEGAN [WAE]    93.1     
NVAE^{†} [vahdat2020NVAE] Sampling    50.8     
NVAE^{†} [vahdat2020NVAE] Reconstruction    2.67     
DCVAE Sampling (ours)  8.2  17.9  8.1  41.9 
DCVAE Recon. (ours)  7.9  21.4  8.4  43.6 
Qualitative analysis From Figure 3, we see that without GAN and contrastive, images are blurry; Without GAN, the contrastive head can classify images, but not on the image manifold; Without Contrastive, reconstruction images are on the image manifold because of the discriminator, but they are different from input images. These experiments show that it is necessary to combine both instancelevel and setlevel fidelity, and in a contradistinctive manner.
Quantitative analysis In Table 1 we observe the same trend. VAE generates blurry images; thus the FID/IS (Inception Score) is not ideal. VAEContrastive does not generate images on the natural manifold; thus FID/IS is poor. VAE/GAN combines setlevel and instancelevel information. However the L2 objective is not ideal; thus the FID/IS is suboptimal. For both reconstruction and sampling tasks, DCVAE generates high fidelity images and has a favorable FID and Inception score. This illustrates the advantange of having a contradistinctive objective on both set level and instance level. To measure the faithfulness of the reconstructed image we compute the pixelwise L2 distance and the perceptual distance ([Johnson2016Perceptual]). For the pixel distance, VAE has the lowest value because it directly optimizes this distance during training; our pixelwise distance is better than VAE/GAN and VAEContrastive. For perceptual distance, our method outperforms other three, which confirms that using contrastive learning helps reconstruct images semantically.
5.3 Comparison to existing generative models
Table 2 gives a comparison of quantitative measurement for CIFAR10 and STL10 dataset. In general, there is a large difference in terms of FID and IS between GAN family and VAE family of models. Our model has stateoftheart results in VAE family, and is comparable to stateoftheart GAN models on CIFAR10. Similarly Tables 3, 5, and 5 show that DCVAE is able to generate images that are comparable to GAN based methods even on higher resolution datasets such as LSUN Bedrooms, CelebA, CelebAHQ. Our method achieves stateoftheart results on these datasets among VAEbased methods which focus on building better architectures. Figure 4 and Table 8 show that our model yields more faithful reconstructions compared to existing stateoftheart generative autoencoder methods.
Method 




(Sampling)  (Reconstruction)  
Progressive GAN^{‡} [karras2018progressive]  8.3    
SNGAN^{†} [miyato2018spectral] (from [chen2019self])  16.0    
SSGAN^{†}[chen2019self]  13.3    
StyleALAE^{‡} [pidhorskyi2020adversarial]  17.13  15.92  
DCVAE ^{†} (ours)  14.3  10.57 
Method FID StyleALAE [pidhorskyi2020adversarial] 19.21 NVAE [vahdat2020NVAE] (from [aneja2020ncpvae]) 40.26 NCPVAE [aneja2020ncpvae] 24.69 DCVAE (ours) 15.81  Method FID Methods based on GAN: PresGAN^{∗} [dieng2019prescribed] 29.1 LSGAN [mao2017least] (from [glann2019]) 53.9 COCOGAN ^{†} [lin2019coco] 5.7 ProGAN^{†} [karras2018progressive] (from [lin2019coco]) 7.30 Methods based on VAE: VEEGAN^{†} [veegan2017] (from [dieng2019prescribed]) 46.2 WAEGAN^{∗} [WAE] 42 DCVAE^{†} (ours) Reconstruction 14.3 DCVAE^{†} (ours) Sampling 19.9 
5.4 Latent Space Representation: Image and style interpolation
We further validate the effectiveness of DCVAE for representation learning. One benefit of having an AE/VAE framework compared with just a decoder as in GAN [goodfellow2014generative] is to be able to directly obtain the latent representation from the input images. The encoder and decoder modules in VAE allows us to readily perform image/style interpolation by mixing the latent variables of different images and reconstruct/synthesize new ones. We demonstrate qualitative results on image interpolation (Fig. 6), style interpolation and image editing (Fig. 5) (method used for this is outlined in the supplementary materials) . We directly use the trained DCVAE model without disentanglement learning [karras2019style]. We also quantitatively compare the latent space disentanglement through the perceptual path length (PPL) [karras2019style] (Table 7). We observe that DCVAE learns a more disentangled latent space representation than the backbone Progressive GAN [karras2018progressive] and StyleALAE [pidhorskyi2020adversarial] that use a much more capable StyleGAN [karras2019style] backbone.
5.5 Latent Space Representation: Classification
Method  

VAE [kingma2013auto]  2.92%0.12  3.05%0.42  2.98%0.14 
VAE(=2) [higgins2017beta]  4.69%0.18  5.26%0.22  5.40%0.33 
FactorVAE(=5) [kim2018disentangling]  6.07%0.05  6.18%0.20  6.35%0.48 
TCVAE (=1,=5,=1) [chen2018isolating]  1.62%0.07  1.24%0.05  1.32%0.09 
GuidedVAE [ding2020guided]  1.85%0.08  1.60%0.08  1.49%0.06 
GuidedTCVAE [ding2020guided]  1.47%0.12  1.10%0.03  1.31%0.06 
DCVAE (Ours)  1.30%0.035  1.27%0.037  1.29%0.034 
Classification error on MNIST dataset.
: lower is better. 95 % confidence intervals are from 5 trials. Results derived from
[ding2020guided].To show that our model learns a good representation, we measure the performance on the downstream MNIST classification task [ding2020guided]. The VAE models were trained on MNIST dataset [lecun2010mnist]. We feed input images into our VAE encoder and get the latent representation. Then we train a linear classifier on the latent representation to classify the classes of the input images. Results in Table 6 show that our model gives the lowest classification error in most cases. This experiment demonstrates that our model not only gains the ability to do faithful synthesis and reconstruction, but also gains better representation ability on the VAE side.
Method  Backbone  PPL Full 

StyleALAE [pidhorskyi2020adversarial]  StyleGAN [karras2019style]  33.29 
ProGAN [karras2018progressive]  ProGAN [karras2018progressive]  40.71 
DCVAE (ours)  ProGAN [karras2018progressive]  24.66 
Method  Backbone 




StyleALAE [pidhorskyi2020adversarial]  StyleGAN [karras2019style]  0.117  40.40  
DCVAE (ours)  ProGAN [karras2018progressive]  0.072  38.63 
6 Conclusion
In this paper, we have developed dual contradistinctive generative autoencoder (DCVAE), a new framework that integrates an instancelevel discriminative loss (InfoNCE) with a setlevel adversarial loss (GAN) into a single variational autoencoder framework. Our experiments show stateoftheart or competitive results in several tasks, including image synthesis, image reconstruction, representation learning for image interpolation, and representation learning for classification. DCVAE is a generalpurpose VAE model and it points to a encouraging direction that attains highquality synthesis (decoding) and inference (encoding).
7 Acknowledgment
This work is funded by NSF IIS1717431 and NSF IIS1618477. Zhuowen Tu is also funded under Qualcomm Faculty Award.
References
Appendix A Appendix
a.1 Additional reconstruction results
a.2 Smoothness of latent space
In this section we analyse the smoothness of the latent space learnt by DCVAE. In Figure 11 we show additional high resolution () CelebAHQ [karras2018progressive] images generated by an evenly spaced linear blending between two latent vectors. In Fig. 5 we show that DCVAE is able to perform meaningful attribute editing on images while retaining the original identity. To perform image editing, we first need to compute the direction vector in the latent space that correspond to a desired attribute (e.g. has glasses, has blonde hair, is a woman, has facial hair). We compute these attribute direction vectors by selecting 20 images that have the attribute and 20 images that do not have the attribute, obtaining the corresponding pairs of 20 latent vectors, and calculating the difference of the mean. The results in Fig. 5 show that these direction vectors can be added to a latent vector to add a diverse combination of desired image attributes while retaining the original identity of the individual.
a.3 Effect of negative samples
In this section we analyse the effect of varying the number of negative samples used for contrastive learning. Figure 7 shows the reconstruction error on the CIFAR10 [cifar10] test set as the negative samples is varied. We observe that a higher number of negative samples results in better reconstruction. We choose 8096 for all of our experiments because of memory constraints.
a.4 Dataset details
CIFAR10 comprises 50,000 training images and 10,000 test images with a spatial resolution of . STL10 is a similar dataset that contains 5,000 training images and 100,000 unlabeled images at resolution. We follow the procedure in AutoGAN [gong2019autogan] and resize the STL10 images to . The CelebA dataset has 162,770 training images and 19,962 testing images, CelebAHQ contains 30,000 images of size , and LSUN Bedroom has approximately 3M images. For CelebAHQ we split the dataset into 29,000 training images and 1,000 validation images following the method in [huang2018introvae]. We resize all images progressively in these three datasets from () to () for the progressive training.
a.5 Network architecture diagrams
In Figures 15 we show the detailed network architecture of DCVAE for input resolutions of . Note that the comparison results shown in Figure 3 and Table 1 in the main paper, for VAE, VAE/GAN, VAE w/o GAN, and our proposed DCVAE are all based on the same network architecture (shown in Figure 15 here), for a fair comparison.
The network architectures shown in Figure 15 are adapted closely from the networks discovered by [gong2019autogan] through Neural Architecture Search. The DCVAE developed in our paper is not tied to any particular CNN architecture. We choose the AutoGAN architecture [gong2019autogan] to start with a strong baseline. The decoder in Figure 15 matches the generator in [gong2019autogan]. The encoder is built by modifying the output shape of the final linear layer in the discriminator of AutoGAN [gong2019autogan] to match the latent dimension and adding spectral normalization. The discriminator is used both for classifying real/fake images, and contrastive learning. For each layer we choose, we first apply 1x1 convolution and a linear layer, and then use this feature as an input to the contrastive module. For experiments at , we pick two different positions: the output of second residual conv block (lower level) and the output of the first linear layer (higher level). For experiments on higher resolution datasets we use a Progressive GAN [karras2018progressive] Generator and Discriminator as our backbone and apply similar modifications as described above.
a.6 Further details about the representation learning experiments
As seen in Table 4 in the main paper, we show the representation capability of DCVAE following the procedure outlined in [ding2020guided]. We train our model on the MNIST dataset [lecun2010mnist] and measure the transferability though a classification task on the latent embedding vector. Specifically, we first pretrain the DCVAE model on the training split of the MNIST dataset. Following that we freeze the DCVAE model and train a linear classifier that takes latent embedding vector as the input and predicts the class label of the original image.
a.7 Evaluation details
the perceptual distance is computed as the average MSE distance of the features extracted by a pretrained VGG16 network. We borrow from
[Johnson2016Perceptual] and use the activation of the relu4_3 layer. For computing the FID scores we follow the standard practice ([huang2018introvae], [pidhorskyi2020adversarial]) and use 50,000 generated images. In Table 5 we use the version of DCVAE model trained on CelebAHQ [karras2018progressive] for a fair comparison with other methods which are trained at the same resolution.(a) Encoder  (b) Decoder  (c) Discriminator 