Log In Sign Up

Exemplar VAEs for Exemplar based Generation and Data Augmentation

by   Sajad Norouzi, et al.

This paper presents a framework for exemplar based generative modeling, featuring Exemplar VAEs. To generate a sample from the Exemplar VAE, one first draws a random exemplar from a training dataset, and then stochastically transforms that exemplar into a latent code, which is then used to generate a new observation. We show that the Exemplar VAE can be interpreted as a VAE with a mixture of Gaussians prior in the latent space, with Gaussian means defined by the latent encoding of the exemplars. To enable optimization and avoid overfitting, Exemplar VAE's parameters are learned using leave-one-out and exemplar subsampling, where, for the generation of each data point, we build a prior based on a random subset of the remaining data points. To accelerate learning, which requires finding the exemplars that exert the greatest influence on the generation of each data point, we use approximate nearest neighbor search in the latent space, yielding a lower bound on the log marginal likelihood. Experiments demonstrate the effectiveness of Exemplar VAEs in density estimation, representation learning, and generative data augmentation for supervised learning.


page 7

page 12

page 13

page 15

page 16


ByPE-VAE: Bayesian Pseudocoresets Exemplar VAE

Recent studies show that advanced priors play a major role in deep gener...

q-VAE for Disentangled Representation Learning and Latent Dynamical Systems

This paper proposes a novel variational autoencoder (VAE) derived from T...

From Variational to Deterministic Autoencoders

Variational Autoencoders (VAEs) provide a theoretically-backed framework...

Can VAEs Generate Novel Examples?

An implicit goal in works on deep generative models is that such models ...

Reproducible, incremental representation learning with Rosetta VAE

Variational autoencoders are among the most popular methods for distilli...

Learning Invariances for Interpretability using Supervised VAE

We propose to learn model invariances as a means of interpreting a model...

Data Generation in Low Sample Size Setting Using Manifold Sampling and a Geometry-Aware VAE

While much efforts have been focused on improving Variational Autoencode...

Code Repositories


Exemplar VAE: Linking Generative Models, Nearest Neighbor Retrieval, and Data Augmentation

view repo


Implementation of Contrastive Predictive Coding plus rotation predicting and exemplar.

view repo

1 Introduction

Consider the problem of conditional image generation, given a natural language description of a scene such as
       “A woman is staring at Monet’s Water Lilies”.
There are two general strategies for addressing this problem.One can resort to exemplar based methods, e.g., using web search engines to retrieve photographs with similar captions, and then edit the retrieved images to generate new ones.Alternatively, one can adopt parametric models

such as deep neural networks optimized for text to image translation to synthesize new relevant scenes.

This paper presents a machine learning framework for exemplar based generative models using expressive neural nets, combining the advantages of both exemplar based and parametric paradigms. Here we focus on simple unconditional generation tasks, but the learning formulation and the methods developed are generally applicable to many potential applications including text to image translation.

Exemplar based methods depend on large and diverse datasets of exemplars and relatively simple machine learning algorithms, such as Parzen window estimation (Parzen, 1962) and conditional random fields (Lafferty et al., 2001). They deliver impressive results on texture synthesis (Efros and Leung, 1999)

, image super resolution

(Freeman et al., 2002), and inpaiting (Criminisi et al., 2003; Hays and Efros, 2007), despite their simplicity. These techniques can accommodate web scale datasets with a improvement in sample quality as the dataset size increases, without the need for further optimization of model parameters. The success of exemplar based methods hinges on the distance metric used to build a local density model for each neighborhood. Unfortunately, finding an effective distance metric in a high dimensional space is challenging on its own (Xing et al., 2003; Johnson et al., 2016)

. Further, while exemplar based methods excel in interpolation tasks, they often underperform their parametric counterparts in extrapolation.

Parametric generative models based on deep neural nets enable learning complex data distributions across myriad problem domains (e.g., Oord et al. (2016); Reed et al. (2016)

). Predominant models, such as Variational Autoencoders (VAEs) 

(Kingma and Welling, 2014; Rezende et al., 2014), Normalizing Flows (Dinh et al., 2014, 2016), and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014b), adopt a decoder network to convert samples from a prior distribution, often a factored Gaussian, into samples from the target distribution. After the completion of training, these models discard the training data and generate new samples using decoder networks alone. Hence, the burden of generative modeling rests entirely on the parametric model. Further, with the availability of additional training data, these models require re-training or fine-tuning.

This paper investigates a general framework for exemplar based generative modeling and a particular instantiation of this framework called the Exemplar VAE. To sample from the Exemplar VAE, one first draws a random exemplar from a training dataset and then stochastically transforms that exemplar into a new observation. We are inspired by recent work on generative models augmented with external memory (e.g., Guu et al. (2018); Li et al. (2019); Tomczak and Welling (2018); Khandelwal et al. (2019); Bornschein et al. (2017)), but unlike most existing work, we do not rely on a prespecified distance metric to define the neighborhood structure. Instead, we simultaneously learn a latent space and a distance metric suited for generative modeling.

Exemplar VAE can be interpreted as a VAE with a Gaussian mixture prior in the latent space, with one component per exemplar. The component means are defined by the latent encoding of the exemplars. We build on the VampPrior formulation of Tomczak and Welling (2018), and our work is a continuation of recent papers on enhancing VAEs with richer latent priors (Kunin et al., 2019; Bauer and Mnih, 2018; Lawson et al., 2019).

The main contributions of this paper include:

  • [topsep=0pt, partopsep=0pt, leftmargin=15pt, parsep=0pt, itemsep=1pt]

  • The development of the Exemplar VAE and a framework for exemplar based generative modeling.

  • The proposal of critical regularization methods, enhancing generalization of exemplar based generative models.

  • The use of approximate nearest neighbor search to formulate a lower bound on ELBO to accelerate learning.

Our experiments demonstrate that Exemplar VAEs consistently outperform VAEs with a Guassian prior and a VampPrior on density estimation and represenation learning. Further, unsupervised data augmentation using Exemplar VAEs proves to be extremely helpful, resulting in a classification error rate of on permutation invariant MNIST.

2 Exemplar based Generative Models

We define an exemplar based generative model in terms of a dataset of exemplars, , and a parametric transition distribution, , which stochastically transforms an exemplar into a new observation . The log density of a data point under an exemplar based generative model is expressed as


where we assume the prior probability of selecting each exemplar is uniform.

The transition distribution can be defined using any expressive parametric generative model, including VAEs, Normalizing Flow and auto-regressive models. Any reasonable transition distribution should put a considerable probability mass on the reconstruction of an exemplar from itself, i.e.,  should be large for all . Further, an ideal transition distribution should be able to model the conditional dependencies between different dimensions of given , since the dependence of on is often insufficient to make dimensions of conditionally independent.

One can think of Kernel Density Estimation (KDE), also known as Parzen window estimates

(Parzen, 1962), as the simplest instance of exemplar based generative models, in which the transition distribution is defined in terms of a prespecified kernel function and its meta-parameters. For example, with a Gaussian kernel, KDE takes the form


where is the log normalizing constant for an isotropic Gaussian in dimensions. The non-parametric nature of KDE enables one to exploit extremely large heterogeneous datasets of exemplars and apply nearest neighbor search techniques to density estimation. That said, simple KDEs underperform neural density estimatation, especially in high dimensional spaces, due to the inflexibility of typical transition distributions, e.g., when .

This work aims to adopt desirable properties of non-parametric exemplar based models to help scale parametric models to large heterogeneous datasets.

2.1 Optimization

We use expressive parametric density functions to represent the transition distribution within an exemplar based framework (1). Consistent with recent work (Tomczak and Welling, 2018; Bornschein et al., 2017), we find that simply maximizing the expected log marginal likelihood over the empirical training data,


to find the parameters of the transition distribution () results in massive overfitting. This is not surprising, since a flexible transition distribution can put all its probability mass on the reconstruction of each exemplar, i.e., , yielding high log-likelihood on training data but poor generalization.

We propose two simple but effective regularization strategies to mitigate overfitting in exemplar based generative models:

  1. [topsep=0pt, partopsep=0pt, leftmargin=10pt, parsep=0pt, itemsep=3pt]

  2. Leave-one-out during training. The generation of a given data point is expressed in terms of all exemplars except that point. The non-parametric nature of the generative model enables easy adoption of such a leave-one-out (LOO) objective during training, to optimize


    where is an indicator function taking the value of if and only if .

  3. Exemplar subsampling during training. In addition to LOO, we observe that explaining a training point using a subset of the remaining training exemplars improves generalization. To that end we use a hyper-parameter to define the exemplar subset size for the generative model. To generate we draw exemplar indices, denoted , uniformly at random from subsets of . Let denote this sampling procedure with ( choose ) possible subset outcomes. Combining LOO and exemplar subsampling, the objective takes the form


    Note that by moving inside the log in (5) we recover ; i.e., via Jensen’s inequality, is a lower bound on . Nevertheless, even when training with is possible, we find that yields better generalization.

Once training is completed, we use all training exemplars to explain the generation of validation or test points using (1). Hence, the regularization techniques discussed above (LOO and exemplar subsampling) are not relevant to inference. The non-parameteric nature of an exemplar based model is compatible with such regularization techniques, the use of which is not straightforward for training parametric generative models. Even though cross validation is commonly used for parameter tuning and model selection, here cross validation is used as a training objective directly, suggestive of a meta-learning perspective.

Exemplar sub-sampling is also very similar to Dropout (Srivastava et al., 2014) where we drop some mixture components in the latent space during training to encourage better generalization at the test time. Like conventional dropout, we still use the complete model for evaluation.

3 Exemplar Variational Autoencoders

We present the Exemplar VAE as an instance of neural exemplar based generative models, in which the transition distribution is defined in terms of the encoder and the decoder of a VAE, i.e.,


The Exemplar VAE assumes that, given , an observation is conditionally independent from the associated exemplar . This conditional independence assumption helps simplify the formulation, enabling efficient optimization.

Figure 1: Exemplar VAEs generate a new sample by drawing a random exemplar from a training set (e.g., ), then stochastically mapping that exemplar to the latent space by sampling from , followed by sampling from . We distinguish between the variational posterior distribution and the exemplar based latent prior distribution .

The generative process of an Exemplar VAE has three steps:

  1. [topsep=0pt, partopsep=0pt, leftmargin=13pt, parsep=0pt, itemsep=2pt]

  2. Sample to draw a random exemplar from the training set .

  3. Sample using the VAE’s encoder to transform the exemplar into a distribution over the latent codes, from which is drawn.

  4. Sample using the VAE’s decoder to transform into a distribution over the observation space, from which is drawn.

In a VAE with a Gaussian prior on latent codes, the encoder network is used during training to define a variational bound on the log marginal likelihood (Kingma and Welling, 2014). Once training is complete, a conventional VAE generates new observations using the decoder alone. To sample from an Exemplar VAE, in addition to the decoder, one needs access to a set of exemplars and the encoder, or at least the latent encoding of the exemplars. Importantly, given the non-parametric nature of Exemplar VAEs, one can train this model with one set of exemplars and perform inference with another, potentially much larger set.

Marginalizing out the exemplar index and the latent variable , we derive an evidence lower bound (ELBO) (Jordan et al., 1999; Blei et al., 2017) on Exemplar VAE’s log marginal likelihood for a single data point as:


The separation of the reconstruction and KL terms in (9) summarizes the impact of the exemplars on the learning objective as a mixture prior distribution in the latent space, with each mixture component being defined using the latent encoding of one exemplar, i.e.,


This paper considers the simplest form of Exemplar VAE, with a factored Gaussian variational family and a Bernoulli observation model. Extensions to other distribution families is straightforward.

The Exemplar VAE employs two encoder networks: for inference over latent codes given an observation , and for mapping an exemplar to the latent space to obtain an exemplar based prior. We share almost all of the parameters between and , inspired by the VAE VampPrior (Tomczak and Welling, 2018) and the derivation of ELBO with a marginal KL between the aggregated variational posterior and the prior (Hoffman and Johnson, 2016; Makhzani et al., 2015). Accordingly, we define,


where the two encoders use the same Gaussian mean function , but differ in the covariance structure. The inference network uses a standard diagonal covariance matrix , wheras exemplar based prior uses a shared scalar parameter to define an isotropic Gaussian per exemplar.

To complete the definition of the learning objective for Exemplar VAEs, recall that represents sampling a subset of exemplars of size uniformly at random from . Incorporating subsampling regularization into (9), with some further algebraic manipulation, we obtain the following Exemplar VAE objective:


with the aggregated exemplar based prior defined as


where . For learning, we use a single Monte Carlo sample per data point and the reparameterization trick to estimate the gradient of (13) with respect to . In other words, in (13) is obtained via a stochastic latent encoding of a data point using and for a diagonal .

The simplicity of the Gaussian mixture prior in (14) enables efficient computation of all of the pairwise distances between a minibatch of latent codes and Gaussian means using a single matrix product (). Moreover, it makes it possible to rely on existing approximate nearest neighbor search methods in the Euclidean to speed up training space (e.g., Muja and Lowe (2014)).

Recall the definition of Parzen window estimates using Gaussian kernels in (2) and note the similarity between (2) and (14). The Exemplar VAE’s Gaussian mixture prior can be thought of as a Parzen window estimate in the latent space. Therefore, Exemplar VAEs can be interpreted as deep Parzen window estimators, which learn a latent space well suited to Parzen window density estimation.

3.1 Approximate Nearest Neighbor Search for Efficient Optimization

The computational cost during training can become a burden as the number of exemplars () increases. As explained next, this can be mitigated with the use of fast, approximate nearest neighbor search in the latent space to find subsets of relevant items, providing a lower bound on the ELBO.

For each training point , we sample and then compute (14). As an alternative, rather than using all exemplars, one could evaluate each with respect to nearest neighbors in the latent space, where . Back propagation is therefore faster, and because probability density is non-negative and is monotonically increasing, it follows that


That is, approximating the exemplar prior with nearest neighbors is a lower bound on the log prior term and (13).

We store a cache of latent codes for training points to facilitate nearest neighbor search. The cache is updated whenever a new latent code of a training point is available, i.e.,

 we update the cache for any points in the training minibatch and the prior batch kNN as they are passed through the encoder. Algorithm

1 summaries the efficient learning procedure.

  Input: Training dataset
  Define Cache:
    initialize cache = []
    : insert value with index into cache
    : update the value of index to
    : return indices of kNNs of in cache
  for  in  do
  forepoch  in  do
   for  in  do
     to obtain a set of exemplar indices
    for  in  do
    Gradient ascend on to update , , and
    for  in  do ,
Algorithm 1 Efficient Exemplar VAE Learning Algorihtm

4 Related Work

Variational Autoencoders (VAEs) (Kingma and Welling, 2014; Rezende et al., 2014) are a versatile class of latent variable generative models used for non-linear dimensionality reduction (Gregor et al., 2016), generating discrete data (Bowman et al., 2015), and learning disentangled representations (Higgins et al., 2016; Chen et al., 2018), while providing a tractable lower bound on the log marginal likelihood. Improved variants of the VAE have been proposed through modifications to the VAE objective function (Burda et al., 2015), more flexible variational familieis (Kingma et al., 2016; Rezende and Mohamed, 2015), and more powerful decoding models (Chen et al., 2016; Gulrajani et al., 2016).

In particular, recent work shows that more powerful latent priors (Tomczak and Welling, 2018; Bauer and Mnih, 2018; Dai and Wipf, 2019; Lawson et al., 2019) can significantly improve the effectiveness of VAEs for density estimation, as was hinted by (Hoffman and Johnson, 2016). This line of work is motivated in part by the empirical observation of the gap between the prior and the aggregated posterior (e.g., Makhzani et al. (2015)). Using more powerful latent priors may help avoid posterior collapse, which is a barrier to the use of VAEs with autoregressive decoders (Bowman et al., 2015). Unlike most existing work, the Exemplar VAE makes limited assumptions about the structure of the latent space and uses a non-parameteric exemplar based prior in the latent space.

VAEs with a VampPrior (Tomczak and Welling, 2018) optimize a set of pseudo-inputs together with the encoder network to obtain a mixture model approximation to the aggregate posterior. They argue that computing the exact aggregated posterior is expensive and suffers from overfitting; to avoid overfitting they ensure that . VampPrior and Exemplar VAE are similar in their reuse of the encoding network and a mixture distribution over the latent space. Exemplar VAE does not require an increase in the number of model parameters, and it avoids overfitting through simple but effective regularization techniques (Sec. 2). Computational costs are reduced through approximate kNN search during training. Exemplar VAE also extends naturally to large high dimensional datasets, and to discrete data, which are challenging for pseduo-input optimization.

Attention and external memory are effective elements in deep learning architectures.

Li et al. (2016) used memory augmented networks with attention to enhance generative models. Bornschein et al. (2017) used hard attention with memory in a VAE, with generation conditioned on a sample from the memory, using both learnable and exemplar memory. One can view Exemplar VAE as a VAE augmented with memory. Exemplar VAE and (Bornschein et al., 2017) are similar in conditioning generation on external samples. Bornschein et al. (2017)

optimized for a discrete variable to select the suitable memory index, which can be challenging. In contrast, we use a uniform distribution over exemplars, and computation costs are reduced through approximate KNN search in latent space. Further, their formulation of the variational posterior is a barrier to sharing the encoder between the approximate posterior and the memory addressing component, unlike VampPrior and Exemplar VAE.

While introduced in the context of VAEs, the use of exemplar based generative models is not limited to VAEs. Indeed, one can extend the use of exemplar based priors to other powerful generative frameworks, such as Normalizing Flow (Dinh et al., 2016, 2014; Kingma and Dhariwal, 2018). For example, Li et al. (2019) propose to exploit the local manifold of data points in the latent space to improve Normalizing Flow. They use a simple distance metric based on PCA to define the neighborhood structure. Here, we show that one can train models in an end-to-end manner and let the network design the distance metric.

In natural language processing

Guu et al. (2018)

propose a way to edit samples from training corpus in terms of a neural editor, given a sentence and an edit vector. For each training sentence, they consider a set of similar prototypes selected based on Jaccard distance. This is similar in spirit to the Exemplar VAE, however, like

Li et al. (2019), they pre-specify rather than learn the distance metric. Following (Guu et al., 2018; Li et al., 2019), we cast the global generation problem into one of transforming data points. This also bridges the gap between data augmentation and generation.

5 Experiments

To assess the effectiveness of Exemplar VAEs we conduct three sets of experiments, on density estimation, represenation learning, and unsupervised data augmentation.

Experimental setup. For training generative models, we mirror the experimental setup of the VampPrior (Tomczak and Welling, 2018) as much as possible. We use the same hyper-parameters and optimizers (gradient normalized Adam (Kingma and Ba, 2014; Yu et al., 2017a)

), but we change the generative model. We use a learning rate of 5e-4. We stop training if for 50 consecutive epochs the validation ELBO does not improve. Dynamic binarization of training data is used. We use linear KL annealing for 100 epochs.

MNIST comprising 50K training and 10K validation images. For Omniglot, 1345 randomly selected points are used for validation; the 23K remaining images are used for training. All results reported for VampPrior and VAE with Gaussian prior are based on the github implementation of VampPrior generously provided by the authors. Below we consider three architectures: In VAE we use an MLP with 2 hidden layers (300 units each) both for encoder and decoder. HVAE has two stochastic layers, where the generative distribution is , and the approximate posterior is . Like the VAE, the encoders and decoders are MLPs. We fix the prior over to a standard Gaussian and use more expressive priors just for . ConvHVAE is more powerful. Its generative and posterior distributions are like HVAE, but with convolutional layers. Details about the architectures and hyper-parameters are given in the supplementary material.

Evaluation. For evaluation of density models, we use the multi-sample bound in Importance Weighted Autoencoders (IWAE) (Burda et al., 2015) with 5000 samples to lower bound the log probability of test data. We use the whole training dataset as the exemplar set, without any regularization or kNN acceleration. This makes the evaluation time consuming, but generating an unbiased sample from the Exemplar VAE is efficient. Our preliminary experiments suggest that using kNN for evaluation is feasible.

Figure 2: Training and validation ELBO on Dynamic MNIST for Exemplar VAE with and without LOO. LOO reduces the generalization gap between training and validation ELBO.

5.1 Ablation Study

First, we evaluate the effectiveness of the regularization techniques proposed (Sec. 2), e.g., leave-one-out and exemplar subsampling, for enhancing generalization.

Leave-one-out. Here we consider a VAE trained with an optimal prior (aggregated posterior) and report the gap between the ELBO computed on the training and validation sets. The VampPrior has shown that increasing the number of pseudo-inputs causes overfitting, but here we consider an exact optimal prior. Fig. 2 demonstrates the effectiveness of leave-one-out in helping to avoid overfitting. Table 1 gives the test log-likelihood lower bounds for Exemplar VAE on both MNIST and Omniglot with and without LOO.

Exemplar VAE
Dataset w/ LOO w/o LOO
Table 1: Log likelihood lower bounds on the test set (in nats) for Exemplar VAE with and without leave-one-out (LOO).

Exemplar subsampling. As explained in Section 2, the Exemplar VAE uses a hyper-parameter to define the number of exemplars used for estimating the prior. Here, we report the Exemplar VAE’s performance as a function of divided by the number of training data points . We consider . All models employ LOO, so the use of refers to . Table 2 presents the results for both MNIST and Omniglot. In all of the following experiments we will use .

Dataset 1 0.5 0.2 0.1
Table 2: Log likelihood lower bounds on the test set (in nats) for Exemplar VAE with different fractions of exemplar subsampling.
Method Dynamic MNIST Fashion MNIST Omniglot
VAE w/ Gaussian prior
VAE w/ VampPrior
Exemplar VAE
HVAE w/ Gaussian prior
HVAE w/ VampPrior
Exemplar HVAE
ConvHVAE w/ Gaussian prior
ConvHVAE w/ Lars
ConvHVAE w/ VampPrior
Exemplar ConvHVAE
Table 3: Density estimation on dynamic MNIST, Fashion MNIST, and Omniglot for different methods and architectures. Log likelihood lower bounds (in nats) averaged across 5 training runs are estimated using IWAE with 5000 samples. All of the architectures use 40-dimensional latent spaces. For LARS (Bauer and Mnih, 2018) and SNIS (Lawson et al., 2019), the IWAE is computed based on 1000 (vs 5000) samples; their architectures and training procedures are also somewhat different.

Efficient learning. For simple VAE architectures, finding the exact prior probability of latent codes based on (10), even for , is feasible. We use this opportunity to ablate the use of approximate nearest neighbor search and caching for efficient training. Table 6 shows that Efficient Exemplar VAE is competitive with vanilla Exemplar VAE.

Model 1 0.5 0.2 0.1
Exemplar VAE
kNN Exemplar VAE
Table 4: Log likelihood lower bounds on the test set (in nats) for Exemplar VAE and kNN Exemplar VAE with k=10.

Fashion MNIST Omniglot
Figure 3: Given the input exemplar on the top left of each plate, conditional Exemplar VAE samples are generated and shown. The Exemplar VAE is able to generates diverse samples while typically preserving the class label and the style of the input exemplar.
MNIST Fashion MNIST Omniglot
Figure 4: Samples generated by the Exemplar VAE. Given the input exemplar on the left, conditional model samples are shown on the right. Note that Exemplar VAE is able to preserve the style and identity of the input exemplar, yet generates novel and diverse samples.

5.2 Density Estimation

We report density estimation with MNIST, Omniglot and Fashion MNIST, using three different architectures, namely VAE, HVAE and ConvHVAE (Tomczak and Welling, 2018). For each architecture we consider a Gaussian prior, the VampPrior, and an Exemplar based prior. For training VAE and HVAE we used the exact exemplar prior, but for ConvHVAE we used 10NN exemplars (see Sec. 3.1).

Test log likelihood lower bounds were estimated using 5000 samples from the variational posterior (Burda et al., 2015). In the case of the Exemplar VAE, the training set was used as the exemplars.

Table 13 shows that Exemplar VAE outperforms other models (in all but one case), including the VampPrior which represents the state-of-the-art among VAEs with a factored variational posterior. Improvements of Exemplar VAE on Omniglot data are greater than on other datasets, which is likely due to the significant diversity of this dataset. One can enhance VampPrior with more pseudo-inputs, but we find this often makes optimization challenging and leads to overfitting. We posit that, by comparison, Exemplar VAEs have the potential to scale more easily to large, diverse datasets.

Fig. 13 shows samples generated from Exemplar ConvVAE (top-left is the input exemplar for each plate). These samples highlight the power of Exemplar VAE in maintaining the content of the source image while adding diversity. In the case of MNIST the most evident changes are in the stroke width and brightness with slight variation in shape. Fashion MNIST and Omniglot samples show more pronounced variations to the style of the source image, possibly because both datasets exhibit greater diversity than MNIST.

5.3 Representation Learning

Exemplar VAE VAE w/ Gaussian Prior
Figure 5: t-SNE visualization of learned latent representations for MNIST test points, colored by labels.

We next explore the structure of the latent representation for Exemplar VAE. Fig. 5 shows a t-SNE visualization of the latent representations of MNIST test data for the Exemaplar VAE and for VAE with a Gaussian prior. Test points are colored by their digit label. (No labels were used during training.) The Exemplar VAE representation appears more meaningful, with tighter clusters. We also use k-nearest neighbor (kNN) classification performance as a proxy for the representation quality. As is clear from Table 11, Exemplar VAE consistently outperforms other approaches. Results on Omniglot are not reported since the low resolution version of this dataset does not include class labels.

Method MNIST Fashion MNIST
VAE w/ Gaussian Prior
VAE w/ VampPrior
Exemplar VAE
Table 5: kNN classification error (%) on 40-D representations.
Figure 6: MNIST validation error as a function of in (16) for different generative models. The parameter controls the strength of the loss on real vs.

augmented data. Here, we use a simple MLP with ReLU activations and two hidden layers of 1024 units.

5.4 Generative Models Evaluation

Exemplar VAE also benefits from some properties of non-parametric models. For instance, After fitting an Exemplar VAE to a particular dataset we should not necessarily use the original training data for evaluation. We can take samples from any generative model (GANs, Energy Based Models, VAEs) and use those samples as a exemplar for Exemplar VAE. If a generative model captures the diversity and properties of the training dataset well, we believe the evaluation of test-set probability under exemplar VAE based on samples from that generative should not be that different from Exemplar VAE with samples from original training data. Further more we can rank different generative model by this evaluation.

We evaluated three VAE variants trained in previous section. We sampled 50k from each generative model and used them as an exemplar for

Evaluation Algorithm
Model IWAE ExVAE ExVAE (250k)
VAE + Gaussian Prior
VAE + VampPrior
Exemplar VAE
Table 6: Log likelihood lower bounds on the test set (in nats) evaluated based on Exemplar VAE and classical IWAE both with 5000 samples.

5.5 Generative Data Augmentation

We assess the effectiveness of the Exemplar VAE for generating augmented data to improve supervised learning. Recent generative models have achieved impressive sample quality and diversity, but they have seen limited success in improving discriminative models. Ravuri and Vinyals (2019)

used class-conditional generative models to synthesize extra training data, yielding marginal gains on ImageNet accuracy. Alternative techniques for optimizing data augmentation policies

(Cubuk et al., 2019; Lim et al., 2019; Hataya et al., 2019) or approaches based on adversarial perturbations (Goodfellow et al., 2014a; Miyato et al., 2018) have been more successful in improving classification accuracy.

In our experiments we use the training data points as exemplars and generate additional samples from the Exemplar VAE. Class labels of the exemplars are transferred to corresponding new images, and a combination of real and generated data is used for training. More specifically, during each training iteration, we follow the following steps:

  1. [topsep=0pt, partopsep=0pt, leftmargin=13pt, parsep=0pt, itemsep=2pt]

  2. Draw a minibatch from training data.

  3. For each , draw , and then set , which inherits the class label . This yields a synthetic minibatch .

  4. Gradient descend on on the weighted cross entropy loss,


We train MLPs with ReLU activations and two hidden layers of 1024 or 8192 units on MNIST and Fashion MNIST, using the proposed generative data augmentation method. We leverage label smoothing (Szegedy et al., 2016) with a smoothing parameter of

for our experiments with and without data augmentation. Note that the Exemplar VAEs used for data augmentation adopt fully connected layers and do not observe class labels during training. Networks are optimized using stochastic gradient descent with a momentum of

for epochs. The learning rate is linearly decayed from an initial value of to . The meta-parameters in (16) and were tuned using the validation set.

Figure 6

shows that Exemplar VAE is far more effective that other VAEs for data augmentation and even a small amounts of generative data augmentation improves classifier accuracy. Interestingly, a classifier trained solely on synthetic data achieves error rates smaller than classifiers trained on the original data. Given

, on MNIST and on Fashion MNIST, we train 10 networks on the union of training and validation sets and report average test errors. On permutation invariant MNIST, Exemplar VAE augmentations achieve an impressive average error rate of . Table 8 summarizes our results in comparison with previous work. On MNIST, Ladder Networks (Sønderby et al., 2016) and Virtual Adversarial Training (Miyato et al., 2018) report and error rates respectively, using deeper architectures and much more complex training procedures. Table 8 summarizes our results on permutation invariant Fashion MNIST, showing consistent gains stemming from Exexmplar VAE augmentations.

  Method Hidden layers Test error
  Dropout [1]
  Label smoothing [2]
  Dropconnect [3]
  Variational Info. Bottleneck [4]
  Dropout + Max norm contraint [1]
  Manifold Tangent Classifier [5]
  DBM + dropout finetuning [1]
  Label Smoothing (LS)
  LS + Exemplar VAE Augmentation
  Label Smoothing
  LS + Exemplar VAE Augmentation
Table 7: Test error (%) on permutation invariant MNIST from [1] Srivastava et al. (2014), [2] Pereyra et al. (2017), [3] Wan et al. (2013), [4] Alemi et al. (2016), and [5] Rifai et al. (2011), as well as our results with and without generative data augmentation.
  Method Hidden layers Test error
  Label Smoothing
  LS + Exemplar VAE Augmentation
  Label Smoothing
  LS + Exemplar VAE Augmentation
Table 8: Test error (%) on permutation invariant Fashion MNIST.

6 Conclusion

This paper develops a new framework for exemplar based generative modeling. We extend this framework to VAEs, based on which we define an Exemplar based prior distribution on latent variables. We present two simple but efficacious regularization techniques for Exemplar VAEs. An efficient learning algorithm based on approximate nearest neighbor search is proposed. The effectiveness of the Exemplar VAE on density estimation, representation learning, and data augmentation for supervised learning is demonstrated.

The development of Exemplar VAEs opens up interesting future research directions. Application to language modeling and other discrete data, especially in the context of unsupervised data augmentation is worth exploring. Extentions to other generative models such as Normalizing FLow and GANs, and larger and more complex image datasets are promising. Exploring the effect of the exemplar based prior on posterior collapse and learning disentangled representations would be valuable. Last but not least, the non-parametric properties of Exemplar VAE may enable evaluation of generative models with intractable log-likelihoods.


We are extremely grateful to Micha Livne, Will Grathwohl, and Kevin Swersky for extensive discussions. We thank Diederik Kingma, Chen Li, and Danijar Hafner for their feedback on an initial draft of this paper. This work was financially supported in part by a grant from NSERC Canada.


  • A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2016) Deep variational information bottleneck. arXiv:1612.00410. Cited by: Table 8.
  • M. Bauer and A. Mnih (2018) Resampled priors for variational autoencoders. arXiv:1810.11428. Cited by: Appendix A, §D.1, §1, §4, Table 3.
  • D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association. Cited by: §3.
  • J. Bornschein, A. Mnih, D. Zoran, and D. J. Rezende (2017) Variational memory addressing in generative models. NeurIPS. Cited by: §1, §2.1, §4.
  • S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio (2015) Generating sentences from a continuous space. arXiv:1511.06349. Cited by: Appendix A, §4, §4.
  • Y. Burda, R. Grosse, and R. Salakhutdinov (2015) Importance weighted autoencoders. arXiv:1509.00519. Cited by: Table 9, Appendix A, §4, §5.2, §5.
  • R. T. Q. Chen, X. Li, R. Grosse, and D. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. Advances in Neural Information Processing Systems. Cited by: §4.
  • X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel (2016) Variational lossy autoencoder. arXiv:1611.02731. Cited by: §4.
  • A. Criminisi, P. Perez, and K. Toyama (2003) Object removal by exemplar-based inpainting. CVPR. Cited by: §1.
  • E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. Computer Vision and Pattern Recognition, pp. 113–123. Cited by: §5.5.
  • B. Dai and D. Wipf (2019) Diagnosing and enhancing VAE models. ICLR. Cited by: §4.
  • Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier (2017) Language modeling with gated convolutional networks. International Conference on Machine Learning 70, pp. 933–941. Cited by: §D.1.
  • L. Dinh, D. Krueger, and Y. Bengio (2014) Nice: non-linear independent components estimation. arXiv:1410.8516. Cited by: §1, §4.
  • L. Dinh, J. Sohl-Dickstein, and S. Bengio (2016) Density estimation using real nvp. arXiv:1605.08803. Cited by: §1, §4.
  • A. A. Efros and T. K. Leung (1999) Texture synthesis by non-parametric sampling. International Conference on Computer Vision. Cited by: §1.
  • W. T. Freeman, T. R. Jones, and E. C. Pasztor (2002) Example-based super-resolution. IEEE Computer graphics and Applications. Cited by: §1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014a) Explaining and harnessing adversarial examples. arXiv:1412.6572. Cited by: §5.5.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014b) Generative adversarial nets. Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • K. Gregor, F. Besse, D. J. Rezende, I. Danihelka, and D. Wierstra (2016) Towards conceptual compression. NeurIPS. Cited by: §4.
  • I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville (2016) Pixelvae: a latent variable model for natural images. arXiv:1611.05013. Cited by: §4.
  • K. Guu, T. B. Hashimoto, Y. Oren, and P. Liang (2018) Generating sentences by editing prototypes. TACL. Cited by: §1, §4.
  • R. Hataya, J. Zdenek, K. Yoshizoe, and H. Nakayama (2019)

    Faster autoaugment: learning augmentation strategies using backpropagation

    arXiv:1911.06987. Cited by: §5.5.
  • J. Hays and A. A. Efros (2007) Scene completion using millions of photographs. ACM Transac. on Graphics (TOG). Cited by: §1.
  • I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2016) Beta-VAE: learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations. Cited by: §4.
  • M. D. Hoffman and M. J. Johnson (2016) Elbo surgery: Yet another way to carve up the variational evidence lower bound.

    Workshop in Advances in Approximate Bayesian Inference, NIPS

    1, pp. 2.
    Cited by: §3, §4.
  • J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. ECCV. Cited by: §1.
  • M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999) An introduction to variational methods for graphical models. Machine Learning. Cited by: §3.
  • U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis (2019) Generalization through memorization: nearest neighbor language models. arXiv:1911.00172. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv:1412.6980. Cited by: §5.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. ICLR. Cited by: §1, §3, §4.
  • D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §4.
  • D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. NeurIPS. Cited by: §4.
  • D. Kunin, J. M. Bloom, A. Goeva, and C. Seed (2019) Loss landscapes of regularized linear autoencoders. arXiv:1901.08168. Cited by: §1.
  • J. Lafferty, A. McCallum, and F. C. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. ICML. Cited by: §1.
  • J. Lawson, G. Tucker, B. Dai, and R. Ranganath (2019) Energy-inspired models: learning with sampler-induced distributions. NeurIPS. Cited by: §1, §4, Table 3.
  • C. Li, J. Zhu, and B. Zhang (2016) Learning to generate with memory. ICML. Cited by: §4.
  • Y. Li, T. Gao, and J. Oliva (2019) A forest from the trees: generation through neighborhoods. arXiv:1902.01435. Cited by: §1, §4, §4.
  • S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim (2019) Fast autoaugment. NeurIPS. Cited by: §5.5.
  • J. Lucas, G. Tucker, R. B. Grosse, and M. Norouzi (2019) Don’t blame the elbo! a linear vae perspective on posterior collapse. NeurIPS. Cited by: Appendix A.
  • A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey (2015) Adversarial autoencoders. arXiv:1511.05644. Cited by: §3, §4.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018)

    Virtual adversarial training: A regularization method for supervised and semi-supervised learning

    IEEE Trans. PAMI 41 (8), pp. 1979–1993. Cited by: §5.5, §5.5.
  • M. Muja and D. G. Lowe (2014)

    Scalable nearest neighbor algorithms for high dimensional data

    IEEE Trans. PAMI. Cited by: §3.
  • A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) Wavenet: a generative model for raw audio. arXiv:1609.03499. Cited by: §1.
  • E. Parzen (1962)

    On estimation of a probability density function and mode

    Annals of Mathematical Statistics. Cited by: §1, §2.
  • G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton (2017) Regularizing neural networks by penalizing confident output distributions. arXiv:1701.06548. Cited by: Table 8.
  • S. Ravuri and O. Vinyals (2019) Classification accuracy score for conditional generative models. Advances in Neural Information Processing Systems, pp. 12247–12258. Cited by: §5.5.
  • S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee (2016) Generative adversarial text to image synthesis. ICLR. Cited by: §1.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. arXiv:1401.4082. Cited by: §1, §4.
  • D. J. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. arXiv:1505.05770. Cited by: §4.
  • S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Muller (2011) The manifold tangent classifier. Advances in Neural Information Processing Systems, pp. 2294–2302. Cited by: Table 8.
  • C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther (2016) Ladder variational autoencoders. Advances in Neural Information Processing Systems, pp. 3738–3746. Cited by: §5.5.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR. Cited by: §2.1, Table 8.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern recognition. Cited by: §5.5.
  • J. M. Tomczak and M. Welling (2018) VAE with a vampprior. AISTATS. Cited by: Appendix A, §D.1, §1, §1, §2.1, §3, §4, §4, §5.2, §5.
  • L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. International Conference on Machine Learning, pp. 1058–1066. Cited by: Table 8.
  • E. P. Xing, M. I. Jordan, S. J. Russell, and A. Y. Ng (2003) Distance metric learning with application to clustering with side-information. NeurIPS. Cited by: §1.
  • A. W. Yu, L. Huang, Q. Lin, R. Salakhutdinov, and J. Carbonell (2017a) Block-normalized gradient method: an empirical study for training deep neural network. arXiv:1707.04822. Cited by: §5.
  • A. W. Yu, Q. Lin, R. Salakhutdinov, and J. Carbonell (2017b) Normalized gradient with adaptive stepsize method for deep neural network training. arXiv:1707.04822 18 (1). Cited by: §D.2.

Appendix A Number of Active Dimensions in the Latent Space

The problem of posterior collapse (Bowman et al., 2015; Lucas et al., 2019), resulting in a number of inactive dimensions in the latent space of a VAE, can be reduced via the use a VampPrior (Tomczak and Welling, 2018) as opposed to a factored Gaussian prior. We also investigate this phenomena by counting the number of active dimensions based on a metric proposed by Burda et. al (Burda et al., 2015)

. This metric computes the variance of the mean of the latent encoding of the data points in each dimension of the latent space,

, where is sampled from the dataset. If the computed variance is above a certain threshold, then that dimension is considered active. The proposed threshold by (Bauer and Mnih, 2018) is and we use the same value. We observe that the Exemplar VAE has the largest number of active dimensions in all cases except one. In the case of ConvHVAE on MNIST and Fashion MNIST, the gap between Exemplar VAE and other methods is more considerable.

Number of Active Dimensions out of
Model Dynamic MNIST Fashion MNIST Omniglot
VAE w/ Gaussian prior
VAE w/ Vampprior
Exemplar VAE
HVAE w/ Gaussian prior
HVAE w/ VampPrior
Exemplar HVAE
ConvHVAE w/ Gaussian prior
ConvHVAE w/ VampPrior
Exemplar ConvHVAE
Table 9: The number of active dimensions computed based on a metric proposed by Burda et. al (Burda et al., 2015). This metric considers a latent dimension active if the variance of its mean over the dataset is higher than . For hierarchical architectures the reported number is for the which is the highest stochastic layer.

Appendix B Cyclic Generation

The exemplar VAE generates a new sample by transforming a randomly selected exemplar. The newly generated data point can also be used as an exemplar and we can repeat this procedure again and again. This kind of generation bears some similarity to MCMC in energy-based models. Figure 7 shows how samples evolve and consistently stay near the manifold of MNIST digits. We can apply the same procedure starting from a noisy input image as an exemplar. Figure 8 shows that the model is able to quickly transform the noisy images into samples that resemble real MNIST images.

Figure 7: Cyclic generation starting from a training data point. Samples generated from an Exemplar VAE starting from a training data point, and then reusing the generated data as exemplars for the next round of generation (left to right).
Figure 8: Cyclic generation starting from a noise input (left to right).

Appendix C Reconstruction vs. KL

Table 10 shows the value of KL and the reconstruction terms of ELBO, computed based on a single sample from the variational posterior, averaged across test set. These numbers show that not only the exemplar VAE improves the KL term, but also the reconstruction terms are comparable with the VampPrior.

Dynamic MNIST Fashion MNIST Omniglot
Model KL Neg.Reconst. KL Neg. Reconst. KL Neg. Reconst.
VAE w/ Gaussian prior
VAE w/ VampPrior
Exemplar VAE
HVAE w/ Gaussian prior
HVAE w/ VampPrior
Exemplar HVAE
ConvHVAE w/ Gaussian prior
ConvHVAE w/ VampPrior
Exemplar ConvHVAE
Table 10: KL and reconstruction part of ELBO averaged over test set by a single sample from posterior.

Appendix D Experimental Details

d.1 Architectures

All of the neural network architectures are based on the VampPrior of Tomczak & Welling (Tomczak and Welling, 2018), the implementation of which is available online111 We leave tuning the architecture of Exemplar VAEs to future work. To describe the network architectures, we follow the notation of LARS (Bauer and Mnih, 2018). Neural network layers used are either convolutional (denoted CNN) or fully-connected (denoted MLP), and the number of units are written inside a bracket separated by a dash (e.g., MLP[300-784] means a fully-connected layer with 300 input units and 784 output units). We use curly bracket to show concatenation.

Three different architectures are used in the experiments, described below. refers to the dimensionality of the latent space.

a) VAE:

b) HVAE:

c) ConvHVAE: The generative and variational posterior distributions are identical to HVAE.

As the activation function, the gating mechanism of

(Dauphin et al., 2017) is used throughout. So for each layer we have two parallel branches where the sigmoid of one branch is multiplied by the output of the other branch. In ConvHVAE the kernel size of the first layer of is 7 and the third layer used kernel size of 5. The last layer of used kernel size of 1 and all the other layers used kernels.

d.2 Hyper-parameters

We use Graident Normalized Adam (Yu et al., 2017b) with Learning rate of and minibatch size of for all of the datasets. We dynamically binarize each training data, but we do not binarize the exemplars that serve as the prior. We utilize early stopping for training VAEs, where we stopped the training if for consecutive epochs the validation ELBO does not improve. In all of the experiments, we use 40 dimensional latent spaces both for hierarchical and non-hierarchical architectures. To limit the computation costs of ConvHVAE, we considered kNN based on euclidean distance in the latent space, where set to . The number of exemplars set to the half of the training data except in the ablation study section.

Appendix E Misclassfied MNIST Digits

Figure 9: Misclassified images from MNIST test set for a two layer MLP trained with Exemplar VAE augmentation.

Appendix F Additional Samples

MNIST Fashion MNIST Omniglot
Figure 10: Random samples generated by three Exemplar VAEs trained on MNIST, Fashion MNIST, and Omniglot.

Figure 11: Given the input exemplar on the top left of each plate, conditional Exemplar VAE samples are generated and shown.