ExemplarVAE
Exemplar VAE: Linking Generative Models, Nearest Neighbor Retrieval, and Data Augmentation
view repo
This paper presents a framework for exemplar based generative modeling, featuring Exemplar VAEs. To generate a sample from the Exemplar VAE, one first draws a random exemplar from a training dataset, and then stochastically transforms that exemplar into a latent code, which is then used to generate a new observation. We show that the Exemplar VAE can be interpreted as a VAE with a mixture of Gaussians prior in the latent space, with Gaussian means defined by the latent encoding of the exemplars. To enable optimization and avoid overfitting, Exemplar VAE's parameters are learned using leaveoneout and exemplar subsampling, where, for the generation of each data point, we build a prior based on a random subset of the remaining data points. To accelerate learning, which requires finding the exemplars that exert the greatest influence on the generation of each data point, we use approximate nearest neighbor search in the latent space, yielding a lower bound on the log marginal likelihood. Experiments demonstrate the effectiveness of Exemplar VAEs in density estimation, representation learning, and generative data augmentation for supervised learning.
READ FULL TEXT VIEW PDFExemplar VAE: Linking Generative Models, Nearest Neighbor Retrieval, and Data Augmentation
Implementation of Contrastive Predictive Coding plus rotation predicting and exemplar.
Consider the problem of conditional image generation, given a natural
language description of a scene such as
“A woman is staring at Monet’s Water Lilies”.
There are two general strategies for addressing this problem.One can resort to exemplar based methods, e.g., using web search engines to
retrieve photographs with similar captions,
and then edit the retrieved images to generate new ones.Alternatively, one can adopt parametric models
such as deep neural networks optimized for text to image translation to synthesize new relevant scenes.
This paper presents a machine learning framework for exemplar based generative models using expressive neural nets, combining the advantages of both exemplar based and parametric paradigms. Here we focus on simple unconditional generation tasks, but the learning formulation and the methods developed are generally applicable to many potential applications including text to image translation.
Exemplar based methods depend on large and diverse datasets of exemplars and relatively simple machine learning algorithms, such as Parzen window estimation (Parzen, 1962) and conditional random fields (Lafferty et al., 2001). They deliver impressive results on texture synthesis (Efros and Leung, 1999)
, image super resolution
(Freeman et al., 2002), and inpaiting (Criminisi et al., 2003; Hays and Efros, 2007), despite their simplicity. These techniques can accommodate web scale datasets with a improvement in sample quality as the dataset size increases, without the need for further optimization of model parameters. The success of exemplar based methods hinges on the distance metric used to build a local density model for each neighborhood. Unfortunately, finding an effective distance metric in a high dimensional space is challenging on its own (Xing et al., 2003; Johnson et al., 2016). Further, while exemplar based methods excel in interpolation tasks, they often underperform their parametric counterparts in extrapolation.
Parametric generative models based on deep neural nets enable learning complex data distributions across myriad problem domains (e.g., Oord et al. (2016); Reed et al. (2016)
). Predominant models, such as Variational Autoencoders (VAEs)
(Kingma and Welling, 2014; Rezende et al., 2014), Normalizing Flows (Dinh et al., 2014, 2016), and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014b), adopt a decoder network to convert samples from a prior distribution, often a factored Gaussian, into samples from the target distribution. After the completion of training, these models discard the training data and generate new samples using decoder networks alone. Hence, the burden of generative modeling rests entirely on the parametric model. Further, with the availability of additional training data, these models require retraining or finetuning.This paper investigates a general framework for exemplar based generative modeling and a particular instantiation of this framework called the Exemplar VAE. To sample from the Exemplar VAE, one first draws a random exemplar from a training dataset and then stochastically transforms that exemplar into a new observation. We are inspired by recent work on generative models augmented with external memory (e.g., Guu et al. (2018); Li et al. (2019); Tomczak and Welling (2018); Khandelwal et al. (2019); Bornschein et al. (2017)), but unlike most existing work, we do not rely on a prespecified distance metric to define the neighborhood structure. Instead, we simultaneously learn a latent space and a distance metric suited for generative modeling.
Exemplar VAE can be interpreted as a VAE with a Gaussian mixture prior in the latent space, with one component per exemplar. The component means are defined by the latent encoding of the exemplars. We build on the VampPrior formulation of Tomczak and Welling (2018), and our work is a continuation of recent papers on enhancing VAEs with richer latent priors (Kunin et al., 2019; Bauer and Mnih, 2018; Lawson et al., 2019).
The main contributions of this paper include:
[topsep=0pt, partopsep=0pt, leftmargin=15pt, parsep=0pt, itemsep=1pt]
The development of the Exemplar VAE and a framework for exemplar based generative modeling.
The proposal of critical regularization methods, enhancing generalization of exemplar based generative models.
The use of approximate nearest neighbor search to formulate a lower bound on ELBO to accelerate learning.
Our experiments demonstrate that Exemplar VAEs consistently outperform VAEs with a Guassian prior and a VampPrior on density estimation and represenation learning. Further, unsupervised data augmentation using Exemplar VAEs proves to be extremely helpful, resulting in a classification error rate of on permutation invariant MNIST.
We define an exemplar based generative model in terms of a dataset of exemplars, , and a parametric transition distribution, , which stochastically transforms an exemplar into a new observation . The log density of a data point under an exemplar based generative model is expressed as
(1) 
where we assume the prior probability of selecting each exemplar is uniform.
The transition distribution can be defined using any expressive parametric generative model, including VAEs, Normalizing Flow and autoregressive models. Any reasonable transition distribution should put a considerable probability mass on the reconstruction of an exemplar from itself, i.e., should be large for all . Further, an ideal transition distribution should be able to model the conditional dependencies between different dimensions of given , since the dependence of on is often insufficient to make dimensions of conditionally independent.
One can think of Kernel Density Estimation (KDE), also known as Parzen window estimates
(Parzen, 1962), as the simplest instance of exemplar based generative models, in which the transition distribution is defined in terms of a prespecified kernel function and its metaparameters. For example, with a Gaussian kernel, KDE takes the form(2) 
where is the log normalizing constant for an isotropic Gaussian in dimensions. The nonparametric nature of KDE enables one to exploit extremely large heterogeneous datasets of exemplars and apply nearest neighbor search techniques to density estimation. That said, simple KDEs underperform neural density estimatation, especially in high dimensional spaces, due to the inflexibility of typical transition distributions, e.g., when .
This work aims to adopt desirable properties of nonparametric exemplar based models to help scale parametric models to large heterogeneous datasets.
We use expressive parametric density functions to represent the transition distribution within an exemplar based framework (1). Consistent with recent work (Tomczak and Welling, 2018; Bornschein et al., 2017), we find that simply maximizing the expected log marginal likelihood over the empirical training data,
(3) 
to find the parameters of the transition distribution () results in massive overfitting. This is not surprising, since a flexible transition distribution can put all its probability mass on the reconstruction of each exemplar, i.e., , yielding high loglikelihood on training data but poor generalization.
We propose two simple but effective regularization strategies to mitigate overfitting in exemplar based generative models:
[topsep=0pt, partopsep=0pt, leftmargin=10pt, parsep=0pt, itemsep=3pt]
Leaveoneout during training. The generation of a given data point is expressed in terms of all exemplars except that point. The nonparametric nature of the generative model enables easy adoption of such a leaveoneout (LOO) objective during training, to optimize
(4) 
where is an indicator function taking the value of if and only if .
Exemplar subsampling during training. In addition to LOO, we observe that explaining a training point using a subset of the remaining training exemplars improves generalization. To that end we use a hyperparameter to define the exemplar subset size for the generative model. To generate we draw exemplar indices, denoted , uniformly at random from subsets of . Let denote this sampling procedure with ( choose ) possible subset outcomes. Combining LOO and exemplar subsampling, the objective takes the form
(5) 
Note that by moving inside the log in (5) we recover ; i.e., via Jensen’s inequality, is a lower bound on . Nevertheless, even when training with is possible, we find that yields better generalization.
Once training is completed, we use all training exemplars to explain the generation of validation or test points using (1). Hence, the regularization techniques discussed above (LOO and exemplar subsampling) are not relevant to inference. The nonparameteric nature of an exemplar based model is compatible with such regularization techniques, the use of which is not straightforward for training parametric generative models. Even though cross validation is commonly used for parameter tuning and model selection, here cross validation is used as a training objective directly, suggestive of a metalearning perspective.
Exemplar subsampling is also very similar to Dropout (Srivastava et al., 2014) where we drop some mixture components in the latent space during training to encourage better generalization at the test time. Like conventional dropout, we still use the complete model for evaluation.
We present the Exemplar VAE as an instance of neural exemplar based generative models, in which the transition distribution is defined in terms of the encoder and the decoder of a VAE, i.e.,
(6) 
The Exemplar VAE assumes that, given , an observation is conditionally independent from the associated exemplar . This conditional independence assumption helps simplify the formulation, enabling efficient optimization.
The generative process of an Exemplar VAE has three steps:
[topsep=0pt, partopsep=0pt, leftmargin=13pt, parsep=0pt, itemsep=2pt]
Sample to draw a random exemplar from the training set .
Sample using the VAE’s encoder to transform the exemplar into a distribution over the latent codes, from which is drawn.
Sample using the VAE’s decoder to transform into a distribution over the observation space, from which is drawn.
In a VAE with a Gaussian prior on latent codes, the encoder network is used during training to define a variational bound on the log marginal likelihood (Kingma and Welling, 2014). Once training is complete, a conventional VAE generates new observations using the decoder alone. To sample from an Exemplar VAE, in addition to the decoder, one needs access to a set of exemplars and the encoder, or at least the latent encoding of the exemplars. Importantly, given the nonparametric nature of Exemplar VAEs, one can train this model with one set of exemplars and perform inference with another, potentially much larger set.
Marginalizing out the exemplar index and the latent variable , we derive an evidence lower bound (ELBO) (Jordan et al., 1999; Blei et al., 2017) on Exemplar VAE’s log marginal likelihood for a single data point as:
(7)  
(8)  
(9) 
The separation of the reconstruction and KL terms in (9) summarizes the impact of the exemplars on the learning objective as a mixture prior distribution in the latent space, with each mixture component being defined using the latent encoding of one exemplar, i.e.,
(10) 
This paper considers the simplest form of Exemplar VAE, with a factored Gaussian variational family and a Bernoulli observation model. Extensions to other distribution families is straightforward.
The Exemplar VAE employs two encoder networks: for inference over latent codes given an observation , and for mapping an exemplar to the latent space to obtain an exemplar based prior. We share almost all of the parameters between and , inspired by the VAE VampPrior (Tomczak and Welling, 2018) and the derivation of ELBO with a marginal KL between the aggregated variational posterior and the prior (Hoffman and Johnson, 2016; Makhzani et al., 2015). Accordingly, we define,
(11)  
(12) 
where the two encoders use the same Gaussian mean function , but differ in the covariance structure. The inference network uses a standard diagonal covariance matrix , wheras exemplar based prior uses a shared scalar parameter to define an isotropic Gaussian per exemplar.
To complete the definition of the learning objective for Exemplar VAEs, recall that represents sampling a subset of exemplars of size uniformly at random from . Incorporating subsampling regularization into (9), with some further algebraic manipulation, we obtain the following Exemplar VAE objective:
(13)  
with the aggregated exemplar based prior defined as
(14)  
where . For learning, we use a single Monte Carlo sample per data point and the reparameterization trick to estimate the gradient of (13) with respect to . In other words, in (13) is obtained via a stochastic latent encoding of a data point using and for a diagonal .
The simplicity of the Gaussian mixture prior in (14) enables efficient computation of all of the pairwise distances between a minibatch of latent codes and Gaussian means using a single matrix product (). Moreover, it makes it possible to rely on existing approximate nearest neighbor search methods in the Euclidean to speed up training space (e.g., Muja and Lowe (2014)).
Recall the definition of Parzen window estimates using Gaussian kernels in (2) and note the similarity between (2) and (14). The Exemplar VAE’s Gaussian mixture prior can be thought of as a Parzen window estimate in the latent space. Therefore, Exemplar VAEs can be interpreted as deep Parzen window estimators, which learn a latent space well suited to Parzen window density estimation.
The computational cost during training can become a burden as the number of exemplars () increases. As explained next, this can be mitigated with the use of fast, approximate nearest neighbor search in the latent space to find subsets of relevant items, providing a lower bound on the ELBO.
For each training point , we sample and then compute (14). As an alternative, rather than using all exemplars, one could evaluate each with respect to nearest neighbors in the latent space, where . Back propagation is therefore faster, and because probability density is nonnegative and is monotonically increasing, it follows that
(15)  
That is, approximating the exemplar prior with nearest neighbors is a lower bound on the log prior term and (13).
We store a cache of latent codes for training points to facilitate nearest neighbor search. The cache is updated whenever a new latent code of a training point is available, i.e.,
we update the cache for any points in the training minibatch and the prior batch kNN as they are passed through the encoder. Algorithm
1 summaries the efficient learning procedure.Variational Autoencoders (VAEs) (Kingma and Welling, 2014; Rezende et al., 2014) are a versatile class of latent variable generative models used for nonlinear dimensionality reduction (Gregor et al., 2016), generating discrete data (Bowman et al., 2015), and learning disentangled representations (Higgins et al., 2016; Chen et al., 2018), while providing a tractable lower bound on the log marginal likelihood. Improved variants of the VAE have been proposed through modifications to the VAE objective function (Burda et al., 2015), more flexible variational familieis (Kingma et al., 2016; Rezende and Mohamed, 2015), and more powerful decoding models (Chen et al., 2016; Gulrajani et al., 2016).
In particular, recent work shows that more powerful latent priors (Tomczak and Welling, 2018; Bauer and Mnih, 2018; Dai and Wipf, 2019; Lawson et al., 2019) can significantly improve the effectiveness of VAEs for density estimation, as was hinted by (Hoffman and Johnson, 2016). This line of work is motivated in part by the empirical observation of the gap between the prior and the aggregated posterior (e.g., Makhzani et al. (2015)). Using more powerful latent priors may help avoid posterior collapse, which is a barrier to the use of VAEs with autoregressive decoders (Bowman et al., 2015). Unlike most existing work, the Exemplar VAE makes limited assumptions about the structure of the latent space and uses a nonparameteric exemplar based prior in the latent space.
VAEs with a VampPrior (Tomczak and Welling, 2018) optimize a set of pseudoinputs together with the encoder network to obtain a mixture model approximation to the aggregate posterior. They argue that computing the exact aggregated posterior is expensive and suffers from overfitting; to avoid overfitting they ensure that . VampPrior and Exemplar VAE are similar in their reuse of the encoding network and a mixture distribution over the latent space. Exemplar VAE does not require an increase in the number of model parameters, and it avoids overfitting through simple but effective regularization techniques (Sec. 2). Computational costs are reduced through approximate kNN search during training. Exemplar VAE also extends naturally to large high dimensional datasets, and to discrete data, which are challenging for pseduoinput optimization.
Attention and external memory are effective elements in deep learning architectures.
Li et al. (2016) used memory augmented networks with attention to enhance generative models. Bornschein et al. (2017) used hard attention with memory in a VAE, with generation conditioned on a sample from the memory, using both learnable and exemplar memory. One can view Exemplar VAE as a VAE augmented with memory. Exemplar VAE and (Bornschein et al., 2017) are similar in conditioning generation on external samples. Bornschein et al. (2017)optimized for a discrete variable to select the suitable memory index, which can be challenging. In contrast, we use a uniform distribution over exemplars, and computation costs are reduced through approximate KNN search in latent space. Further, their formulation of the variational posterior is a barrier to sharing the encoder between the approximate posterior and the memory addressing component, unlike VampPrior and Exemplar VAE.
While introduced in the context of VAEs, the use of exemplar based generative models is not limited to VAEs. Indeed, one can extend the use of exemplar based priors to other powerful generative frameworks, such as Normalizing Flow (Dinh et al., 2016, 2014; Kingma and Dhariwal, 2018). For example, Li et al. (2019) propose to exploit the local manifold of data points in the latent space to improve Normalizing Flow. They use a simple distance metric based on PCA to define the neighborhood structure. Here, we show that one can train models in an endtoend manner and let the network design the distance metric.
In natural language processing
Guu et al. (2018)propose a way to edit samples from training corpus in terms of a neural editor, given a sentence and an edit vector. For each training sentence, they consider a set of similar prototypes selected based on Jaccard distance. This is similar in spirit to the Exemplar VAE, however, like
Li et al. (2019), they prespecify rather than learn the distance metric. Following (Guu et al., 2018; Li et al., 2019), we cast the global generation problem into one of transforming data points. This also bridges the gap between data augmentation and generation.To assess the effectiveness of Exemplar VAEs we conduct three sets of experiments, on density estimation, represenation learning, and unsupervised data augmentation.
Experimental setup. For training generative models, we mirror the experimental setup of the VampPrior (Tomczak and Welling, 2018) as much as possible. We use the same hyperparameters and optimizers (gradient normalized Adam (Kingma and Ba, 2014; Yu et al., 2017a)
), but we change the generative model. We use a learning rate of 5e4. We stop training if for 50 consecutive epochs the validation ELBO does not improve. Dynamic binarization of training data is used. We use linear KL annealing for 100 epochs.
MNIST comprising 50K training and 10K validation images. For Omniglot, 1345 randomly selected points are used for validation; the 23K remaining images are used for training. All results reported for VampPrior and VAE with Gaussian prior are based on the github implementation of VampPrior generously provided by the authors. Below we consider three architectures: In VAE we use an MLP with 2 hidden layers (300 units each) both for encoder and decoder. HVAE has two stochastic layers, where the generative distribution is , and the approximate posterior is . Like the VAE, the encoders and decoders are MLPs. We fix the prior over to a standard Gaussian and use more expressive priors just for . ConvHVAE is more powerful. Its generative and posterior distributions are like HVAE, but with convolutional layers. Details about the architectures and hyperparameters are given in the supplementary material.
Evaluation. For evaluation of density models, we use the multisample bound in Importance Weighted Autoencoders (IWAE) (Burda et al., 2015) with 5000 samples to lower bound the log probability of test data. We use the whole training dataset as the exemplar set, without any regularization or kNN acceleration. This makes the evaluation time consuming, but generating an unbiased sample from the Exemplar VAE is efficient. Our preliminary experiments suggest that using kNN for evaluation is feasible.
First, we evaluate the effectiveness of the regularization techniques proposed (Sec. 2), e.g., leaveoneout and exemplar subsampling, for enhancing generalization.
Leaveoneout. Here we consider a VAE trained with an optimal prior (aggregated posterior) and report the gap between the ELBO computed on the training and validation sets. The VampPrior has shown that increasing the number of pseudoinputs causes overfitting, but here we consider an exact optimal prior. Fig. 2 demonstrates the effectiveness of leaveoneout in helping to avoid overfitting. Table 1 gives the test loglikelihood lower bounds for Exemplar VAE on both MNIST and Omniglot with and without LOO.
Exemplar VAE  
Dataset  w/ LOO  w/o LOO 
MNIST  
Omniglot 
Exemplar subsampling. As explained in Section 2, the Exemplar VAE uses a hyperparameter to define the number of exemplars used for estimating the prior. Here, we report the Exemplar VAE’s performance as a function of divided by the number of training data points . We consider . All models employ LOO, so the use of refers to . Table 2 presents the results for both MNIST and Omniglot. In all of the following experiments we will use .
Dataset  1  0.5  0.2  0.1 
MNIST  
Omniglot 
Method  Dynamic MNIST  Fashion MNIST  Omniglot 
VAE w/ Gaussian prior  
VAE w/ VampPrior  
Exemplar VAE  
HVAE w/ Gaussian prior  
HVAE w/ VampPrior  
Exemplar HVAE  
ConvHVAE w/ Gaussian prior  
ConvHVAE w/ Lars  
ConvHVAE w/ SNIS  N/A  
ConvHVAE w/ VampPrior  
Exemplar ConvHVAE 
Efficient learning. For simple VAE architectures, finding the exact prior probability of latent codes based on (10), even for , is feasible. We use this opportunity to ablate the use of approximate nearest neighbor search and caching for efficient training. Table 6 shows that Efficient Exemplar VAE is competitive with vanilla Exemplar VAE.
Model  1  0.5  0.2  0.1 
Exemplar VAE  
kNN Exemplar VAE 




MNIST 
Fashion MNIST  Omniglot 
MNIST  Fashion MNIST  Omniglot 
We report density estimation with MNIST, Omniglot and Fashion MNIST, using three different architectures, namely VAE, HVAE and ConvHVAE (Tomczak and Welling, 2018). For each architecture we consider a Gaussian prior, the VampPrior, and an Exemplar based prior. For training VAE and HVAE we used the exact exemplar prior, but for ConvHVAE we used 10NN exemplars (see Sec. 3.1).
Test log likelihood lower bounds were estimated using 5000 samples from the variational posterior (Burda et al., 2015). In the case of the Exemplar VAE, the training set was used as the exemplars.
Table 13 shows that Exemplar VAE outperforms other models (in all but one case), including the VampPrior which represents the stateoftheart among VAEs with a factored variational posterior. Improvements of Exemplar VAE on Omniglot data are greater than on other datasets, which is likely due to the significant diversity of this dataset. One can enhance VampPrior with more pseudoinputs, but we find this often makes optimization challenging and leads to overfitting. We posit that, by comparison, Exemplar VAEs have the potential to scale more easily to large, diverse datasets.
Fig. 13 shows samples generated from Exemplar ConvVAE (topleft is the input exemplar for each plate). These samples highlight the power of Exemplar VAE in maintaining the content of the source image while adding diversity. In the case of MNIST the most evident changes are in the stroke width and brightness with slight variation in shape. Fashion MNIST and Omniglot samples show more pronounced variations to the style of the source image, possibly because both datasets exhibit greater diversity than MNIST.
Exemplar VAE  VAE w/ Gaussian Prior 
We next explore the structure of the latent representation for Exemplar VAE. Fig. 5 shows a tSNE visualization of the latent representations of MNIST test data for the Exemaplar VAE and for VAE with a Gaussian prior. Test points are colored by their digit label. (No labels were used during training.) The Exemplar VAE representation appears more meaningful, with tighter clusters. We also use knearest neighbor (kNN) classification performance as a proxy for the representation quality. As is clear from Table 11, Exemplar VAE consistently outperforms other approaches. Results on Omniglot are not reported since the low resolution version of this dataset does not include class labels.
Method  MNIST  Fashion MNIST 
VAE w/ Gaussian Prior  
VAE w/ VampPrior  
Exemplar VAE 
Exemplar VAE also benefits from some properties of nonparametric models. For instance, After fitting an Exemplar VAE to a particular dataset we should not necessarily use the original training data for evaluation. We can take samples from any generative model (GANs, Energy Based Models, VAEs) and use those samples as a exemplar for Exemplar VAE. If a generative model captures the diversity and properties of the training dataset well, we believe the evaluation of testset probability under exemplar VAE based on samples from that generative should not be that different from Exemplar VAE with samples from original training data. Further more we can rank different generative model by this evaluation.
We evaluated three VAE variants trained in previous section. We sampled 50k from each generative model and used them as an exemplar for
Evaluation Algorithm  
Model  IWAE  ExVAE  ExVAE (250k) 
VAE + Gaussian Prior  
VAE + VampPrior  
Exemplar VAE  
We assess the effectiveness of the Exemplar VAE for generating augmented data to improve supervised learning. Recent generative models have achieved impressive sample quality and diversity, but they have seen limited success in improving discriminative models. Ravuri and Vinyals (2019)
used classconditional generative models to synthesize extra training data, yielding marginal gains on ImageNet accuracy. Alternative techniques for optimizing data augmentation policies
(Cubuk et al., 2019; Lim et al., 2019; Hataya et al., 2019) or approaches based on adversarial perturbations (Goodfellow et al., 2014a; Miyato et al., 2018) have been more successful in improving classification accuracy.In our experiments we use the training data points as exemplars and generate additional samples from the Exemplar VAE. Class labels of the exemplars are transferred to corresponding new images, and a combination of real and generated data is used for training. More specifically, during each training iteration, we follow the following steps:
[topsep=0pt, partopsep=0pt, leftmargin=13pt, parsep=0pt, itemsep=2pt]
Draw a minibatch from training data.
For each , draw , and then set , which inherits the class label . This yields a synthetic minibatch .
Gradient descend on on the weighted cross entropy loss,
(16) 
We train MLPs with ReLU activations and two hidden layers of 1024 or 8192 units on MNIST and Fashion MNIST, using the proposed generative data augmentation method. We leverage label smoothing (Szegedy et al., 2016) with a smoothing parameter of
for our experiments with and without data augmentation. Note that the Exemplar VAEs used for data augmentation adopt fully connected layers and do not observe class labels during training. Networks are optimized using stochastic gradient descent with a momentum of
for epochs. The learning rate is linearly decayed from an initial value of to . The metaparameters in (16) and were tuned using the validation set.Figure 6
shows that Exemplar VAE is far more effective that other VAEs for data augmentation and even a small amounts of generative data augmentation improves classifier accuracy. Interestingly, a classifier trained solely on synthetic data achieves error rates smaller than classifiers trained on the original data. Given
, on MNIST and on Fashion MNIST, we train 10 networks on the union of training and validation sets and report average test errors. On permutation invariant MNIST, Exemplar VAE augmentations achieve an impressive average error rate of . Table 8 summarizes our results in comparison with previous work. On MNIST, Ladder Networks (Sønderby et al., 2016) and Virtual Adversarial Training (Miyato et al., 2018) report and error rates respectively, using deeper architectures and much more complex training procedures. Table 8 summarizes our results on permutation invariant Fashion MNIST, showing consistent gains stemming from Exexmplar VAE augmentations.Method  Hidden layers  Test error 
Dropout [1]  
Label smoothing [2]  
Dropconnect [3]  
Variational Info. Bottleneck [4]  
Dropout + Max norm contraint [1]  
Manifold Tangent Classifier [5]  
DBM + dropout finetuning [1]  
Label Smoothing (LS)  
LS + Exemplar VAE Augmentation  
Label Smoothing  
LS + Exemplar VAE Augmentation 
Method  Hidden layers  Test error 
Label Smoothing  
LS + Exemplar VAE Augmentation  
Label Smoothing  
LS + Exemplar VAE Augmentation 
This paper develops a new framework for exemplar based generative modeling. We extend this framework to VAEs, based on which we define an Exemplar based prior distribution on latent variables. We present two simple but efficacious regularization techniques for Exemplar VAEs. An efficient learning algorithm based on approximate nearest neighbor search is proposed. The effectiveness of the Exemplar VAE on density estimation, representation learning, and data augmentation for supervised learning is demonstrated.
The development of Exemplar VAEs opens up interesting future research directions. Application to language modeling and other discrete data, especially in the context of unsupervised data augmentation is worth exploring. Extentions to other generative models such as Normalizing FLow and GANs, and larger and more complex image datasets are promising. Exploring the effect of the exemplar based prior on posterior collapse and learning disentangled representations would be valuable. Last but not least, the nonparametric properties of Exemplar VAE may enable evaluation of generative models with intractable loglikelihoods.
We are extremely grateful to Micha Livne, Will Grathwohl, and Kevin Swersky for extensive discussions. We thank Diederik Kingma, Chen Li, and Danijar Hafner for their feedback on an initial draft of this paper. This work was financially supported in part by a grant from NSERC Canada.
Faster autoaugment: learning augmentation strategies using backpropagation
. arXiv:1911.06987. Cited by: §5.5.Workshop in Advances in Approximate Bayesian Inference, NIPS
1, pp. 2. Cited by: §3, §4.Virtual adversarial training: A regularization method for supervised and semisupervised learning
. IEEE Trans. PAMI 41 (8), pp. 1979–1993. Cited by: §5.5, §5.5.Scalable nearest neighbor algorithms for high dimensional data
. IEEE Trans. PAMI. Cited by: §3.On estimation of a probability density function and mode
. Annals of Mathematical Statistics. Cited by: §1, §2.The problem of posterior collapse (Bowman et al., 2015; Lucas et al., 2019), resulting in a number of inactive dimensions in the latent space of a VAE, can be reduced via the use a VampPrior (Tomczak and Welling, 2018) as opposed to a factored Gaussian prior. We also investigate this phenomena by counting the number of active dimensions based on a metric proposed by Burda et. al (Burda et al., 2015)
. This metric computes the variance of the mean of the latent encoding of the data points in each dimension of the latent space,
, where is sampled from the dataset. If the computed variance is above a certain threshold, then that dimension is considered active. The proposed threshold by (Bauer and Mnih, 2018) is and we use the same value. We observe that the Exemplar VAE has the largest number of active dimensions in all cases except one. In the case of ConvHVAE on MNIST and Fashion MNIST, the gap between Exemplar VAE and other methods is more considerable.Number of Active Dimensions out of  
Model  Dynamic MNIST  Fashion MNIST  Omniglot 
VAE w/ Gaussian prior  
VAE w/ Vampprior  
Exemplar VAE  
HVAE w/ Gaussian prior  
HVAE w/ VampPrior  
Exemplar HVAE  
ConvHVAE w/ Gaussian prior  
ConvHVAE w/ VampPrior  
Exemplar ConvHVAE 
The exemplar VAE generates a new sample by transforming a randomly selected exemplar. The newly generated data point can also be used as an exemplar and we can repeat this procedure again and again. This kind of generation bears some similarity to MCMC in energybased models. Figure 7 shows how samples evolve and consistently stay near the manifold of MNIST digits. We can apply the same procedure starting from a noisy input image as an exemplar. Figure 8 shows that the model is able to quickly transform the noisy images into samples that resemble real MNIST images.
Table 10 shows the value of KL and the reconstruction terms of ELBO, computed based on a single sample from the variational posterior, averaged across test set. These numbers show that not only the exemplar VAE improves the KL term, but also the reconstruction terms are comparable with the VampPrior.
Dynamic MNIST  Fashion MNIST  Omniglot  
Model  KL  Neg.Reconst.  KL  Neg. Reconst.  KL  Neg. Reconst. 
VAE w/ Gaussian prior  
VAE w/ VampPrior  
Exemplar VAE  
HVAE w/ Gaussian prior  
HVAE w/ VampPrior  
Exemplar HVAE  
ConvHVAE w/ Gaussian prior  
ConvHVAE w/ VampPrior  
Exemplar ConvHVAE 
All of the neural network architectures are based on the VampPrior of Tomczak & Welling (Tomczak and Welling, 2018), the implementation of which is available online^{1}^{1}1https://github.com/jmtomczak/vae_vampprior. We leave tuning the architecture of Exemplar VAEs to future work. To describe the network architectures, we follow the notation of LARS (Bauer and Mnih, 2018). Neural network layers used are either convolutional (denoted CNN) or fullyconnected (denoted MLP), and the number of units are written inside a bracket separated by a dash (e.g., MLP[300784] means a fullyconnected layer with 300 input units and 784 output units). We use curly bracket to show concatenation.
Three different architectures are used in the experiments, described below. refers to the dimensionality of the latent space.
a) VAE:
b) HVAE:
c) ConvHVAE: The generative and variational posterior distributions are identical to HVAE.
As the activation function, the gating mechanism of
(Dauphin et al., 2017) is used throughout. So for each layer we have two parallel branches where the sigmoid of one branch is multiplied by the output of the other branch. In ConvHVAE the kernel size of the first layer of is 7 and the third layer used kernel size of 5. The last layer of used kernel size of 1 and all the other layers used kernels.We use Graident Normalized Adam (Yu et al., 2017b) with Learning rate of and minibatch size of for all of the datasets. We dynamically binarize each training data, but we do not binarize the exemplars that serve as the prior. We utilize early stopping for training VAEs, where we stopped the training if for consecutive epochs the validation ELBO does not improve. In all of the experiments, we use 40 dimensional latent spaces both for hierarchical and nonhierarchical architectures. To limit the computation costs of ConvHVAE, we considered kNN based on euclidean distance in the latent space, where set to . The number of exemplars set to the half of the training data except in the ablation study section.
MNIST  Fashion MNIST  Omniglot 
