1 Introduction
The variational autoencoder (VAE) framework (Kingma & Welling, 2013) and generative adversarial network (GAN) framework (Goodfellow et al., 2014)
have been the two dominant options for training deep generative models. Despite recent excitement about GAN, VAE remains a popular option, featuring ease of training and wide applicability, with an encoderdecoder pair being the only problemspecific requirement. The VAE encoder encodes training examples into posterior distributions in an abstract latent space, from which sampled latent vectors are drawn and the VAE decoder is trained to reconstruct the training examples from their respective latent vectors. In addition to minimizing mistakes in reconstruction (‘reconstruction loss’), VAE features the competing objective of minimizing the difference between the posterior distributions and an assumed prior (‘latent loss’, measured by KLdivergence). The total VAE loss the model tries to minimize is the sum of these two losses, and the competing objective of minimizing the latent loss creates an information bottleneck between the encoder and the decoder. Ideally, the learned compression allows a random vector in the latent space to be decoded into a realistic sample in generation time.
Compared to GAN, VAE is directly trained to encode all training examples and therefore is less prone to the failure mode of generating a few memorized training examples (‘mode collapse’). On the other hand, it tends to have lower precision, which manifests as blurry images for visual problems (Sajjadi et al., 2018). It has been theorized that such blurred reconstructions correspond to multiple distinct training examples and are due their overlapping posterior distributions in the latent space. Conversely, holes in the latent space that do not correspond to any posterior distributions of training examples may result in generated samples unconstrained by training data (Rezende & Viola, 2018). One may note that these two issues are two sides of the same coin: Strong information bottleneck leads to too much noise in the sampled latent vectors and overlapping posterior distributions, whereas weak information bottleneck leads to too little noise and leaves holes in the latent space. Unsurprisingly, the simplest approach to improving the VAE has been finetuning the strength of the information bottleneck by the introduction of the KLdivergence weight
as a hyperparameter. In addition to the hyperparameter sweep of KLdivergence weight
(Higgins et al., 2017), manual annealing in both directions: KLdivergence warmup (Bowman et al., 2015; Sønderby et al., 2016; Akuzawa et al., 2018) and controlled capacity increase (Burgess et al., 2018) has been employed to achieve good latent space structure and accurate reconstruction simultaneously. Such training scheme relies on the model’s memory of training steps with a different KLdivergence weight, even though there is no a priori reason to prefer any particular one in this case. This generalized VAE with manual annealing in either direction serves both baseline and inspiration for our work.Our motivation for studying VAE is to generate fake yet realistic test data. Such data has a wide range of applications including testing systems involving input validation, performance testing, and UI design testing. We are particularly interested in generating samples that respect the correlations among multiple fields/columns of the training data, and we would like our generative model to discover and learn such correlations in an unsupervised fashion. Generating such fake yet realistic data is beyond the capability of a simple fuzzer, and to our knowledge such correlation is rarely measured independently from the reconstruction loss in the existing literature. In the following sections, we will first provide further background on generalized VAE. We will then describe the benchmark data set, followed by the tree recursive model generated by our framework and the baseline generation quality achievable with a generalized VAE. With the stage set, we keep the encoderdecoder pair constant and proceed to diagnose why the model fails to capture the full extent of the field correlations. First, we measure what the total VAE loss would be in our models for generated samples as if they were training or testing data, and we term it generated loss. We find that generated loss lags behind or even increases during the training process, in comparison to the training/testing loss. We believe that elevated generated loss indicates that information about the training examples is not diffused properly in the latent space, either due to overlapping posterior distributions or holes in the latent space. Motivated by this discovery, we seek improved variational methods more adaptive to local distribution of mean latent vector of training examples and capable of diffusing information throughout the latent space. Finally, we demonstrate that augmenting training data with generated variants under small (augmented training), and training a VAE with multiple values of simultaneously (multiscale VAE) are such variational methods and are closely related.
Our main contributions are as follows: 1) We propose generated loss, the total VAE loss of generated samples, as a diagnostic metric of generation quality of VAEs. 2) We propose augmented training, augmenting training data with generated variants, as a variational method for training VAEs to achieve superior generation quality. 3) Alternatively, we propose multiscale VAE, a VAE trained with multiple values simultaneously which is more tunable, captures aggregated characteristics like correlations more accurately, but tends to encode less details.
2 Background
2.1 Generalized Vae
Neural networkbased autoencoders have long been used for unsupervised learning (Ballard, 1987)
and variations like denoising autoencoder have been proposed to learn a more robust representation
(Vincent et al., 2010). The use of autoencoder as a generative model, however, only took off after the invention of VAE (Kingma & Welling, 2013), which is trained to maximize the evidence lower bound (ELBO) of the loglikelihood of training examples(1) 
where is the KLdivergence between two distributions and is the latent vector, whose prior distribution is most commonly assumed to be multivariate unit Gaussian. is given by the decoder, and is the posterior distribution of the latent vector given by the stochastic encoder, whose operation can be made differentiable through the reparameterization trick , if is assumed to be a diagonalcovariance Gaussian.
A common modification to the ELBO of VAE is to add a hyperparameter to the KLdivergence term and use the following objective function:
(2) 
where controls the strength of the information bottleneck on the latent vector. For higher values of , we accept lossier reconstruction, in exchange of higher effective compression ratio. This hyperparameter has been theoretically justified as a KKT multiplier for maximizing under the inequality constraint that the KLdivergence must be less than a constant (Higgins et al., 2017). In practice, is usually kept constant (Higgins et al., 2017) or manually annealed to increase over time (Bowman et al., 2015; Sønderby et al., 2016; Akuzawa et al., 2018).
In both cases, the generator
samples from the probability distribution given by the decoder
where is a random latent vector in generation time:(3) 
2.2 Benchmark data set and metric
Addresses are a frequently encountered data type here at Google. It is a simple data type, but features intuitive yet nontrivial correlations among fields. Such correlation is perhaps easy to capture for specifically designed classifiers and regressors, but it is far harder to train generative models to generate samples that respect such correlation in an unsupervised fashion. Therefore, an address data set can serve as a contextrelevant benchmark data set for our framework for training structured data VAEs. Specifically, the OpenAddresses Vermont state data set is chosen for its moderate size (See Appendix
A for more details).We focus on the correlation between zip (postal) code and coordinates (latitude, longitude) as an example of field correlations. We estimate the distribution of coordinates of addresses in a given zip code from the training examples, and use pvalue as the metric for generated samples. Recall that pvalue is defined as the probability that the given sample is more likely than another sample from the same distribution, given null hypothesis. In our case, we would like the null hypothesis to be true – i.e., training examples and generated samples follow the same distribution. For a perfect model, pvalues of the generated samples follow uniform distribution between 0 and 1.
In practice, we make the simplifying assumption that the coordinates follow 2dimensional Gaussian distribution for addresses of a given zip code. We consider the zip code categorical variable and calculate the mean
and the sample correlation matrixof the coordinates in the zip code. We can then apply the multivariate version of the twotailed t test to determine whether generated coordinates
in the zip code follow the same distribution:where is the Mahalanobis distance squared,
is the cumulative distribution function (CDF) of chisquared distribution with
degrees of freedom.2.3 Tree recursive model
Our address model consists of encoderdecoder modules. The latent space is 128dim so all encoderdecoder modules produce and consume 128dim vectors. The string fields (street number, street name, unit, city, district, region, and zip code) are modelled by a shared seqtoseq charrnn StringLiteral module, whereas the two float fields (latitude and longitude) are collected and modelled jointly by the ScalarTuple module. The full address data is then modelled by the Tuple module, whose encoder RNN consumes the embedding vectors generated by the encoders of these child modules and whose decoder RNN generates embedding vectors to be decoded by the decoders of these child modules. The reconstruction loss term for each field is given equal weight as follows:

String field loss: calculated as crossentropy loss per character in nat, given by the StringLiteral decoder. Each string field is given equal weight 1.0 regardless of the length of the string, so characters in a shorter string are given more weight than ones in a longer string. The StringLiteral decoder implements scheduled sampling (Bengio et al., 2015) and can be trained with character input drawn from its own softmax distribution (always sampling, AS), groundtruth characters of the training example (teacher forcing, TF), or arbitrary scheduled sampling (SS) in between.

Float field loss: The ScalarTuple
module models latitude and longitude jointly and performs PCAwhitening as a preprocessing step on the fly with moving mean vector and covariance matrix. The decoder network then tries to predict the 2 resultant zeromean unitvariance values, with mean squared errors as the loss function.

Skew loss: The Tuple decoder adds a special loss term dubbed skew loss, the mean squared error between the embedding generated by itself and the embedding given by the respective child encoder. It is given equal weight as the child module’s reconstruction loss, experimentally found to help stabilizing the training process, and makes sure that child encoder and decoder use the same representation. The Tuple decoder performs autoregression on its own output and implements scheduled sampling, where the embedding given by the child encoder is considered the groundtruth and using the embedding generated by the Tuple decoder itself is considered ‘sampling’.
The latent loss is the standard KLdivergence loss between 128dim unit Gaussian and diagonalcovariance Gaussian. Since we use weighted average for the reconstruction loss, we consider KLdivergence per latent dimension the latent loss and report its relative weight as in Eq (2).
The encoders of our framework only produces an embedding vector. In order to train a VAE, we interpret the embedding vector as the mean vector
and generate the standard deviation vector
from it with a standard deviation network. Our justification is that the embedding vector of a generative model should contain all the relevant information about the example, and this design simplifies the modular architecture. We did also test a more conventional architecture that generates both and from the last layer on equal footing but found no qualitative difference in the model’s behaviors. For more implementation details, see Appendix B.2.4 Training and generation
Throughout experiments reported in this paper, the model is trained endtoend using the Adam optimizer (Kingma & Ba, 2014) with initial learning rate
and the rest of the TensorFlow default
. The learning rate decays continuously by a factor of 0.99 per 1000 steps, and the gradients are clipped by the L2 global norm at 0.01. The experiments have a fixed budget of 2M training steps with batch size 256, running on 32 workers unless indicated otherwise. When KLdivergence warmup and/or scheduled sampling are used, they have the same warmup period with linear schedule.With our focus on the difference between generated samples and training/testing examples, we do not want their difference to be trivially attributed to the difference in mean or covariance of their latent space distributions. Therefore, we sample from the multivariate Gaussian distribution closest to the distribution of the sampled latent vectors of the training data instead of the stronger assumption that they follow the unit Gaussian distribution. That is, we keep track of the moving mean and the moving covariance matrix of the sampled latent vectors during the training process, and sample from for generation. To assess the generation quality of trained models, we measure the pvalues of generated coordinates in the generated zip code for 10000 generated samples. As described in described in Sec 2.2, their ideal distribution is uniform distribution between 0 and 1, with mean = median = 0.5, standard deviation =
. If the generated zip code is not found in the training data, the pvalue is considered 0. Other than the pvalues, we subjectively inspect the street names of generated samples, interpolation between training examples on the map, and measure the average Levenshtein distance per character
between the original street name and its reconstruction as a proxy for how much detail is encoded. Since we divide the Levenshtein distance by the length of the original street name, as long as the reconstructed street name is not longer than the original. We measure the average Levenshtein distance per character for each model over 10000 training examples that are randomly selected each time.3 VAE baseline
Here we report the generation quality of baseline generalized VAE, measured by the pvalues of generated samples and of reconstructed training examples. These experiments use the full 2M steps as the warmup period, and the groundtruth probability decreases linearly from 1 to 0 for scheduled sampling experiments. For a rough measure of the reproducibility, we rerun the best experiments with the same hyperparameters.
mean  median  stddev  

Tuple SS + String TF  0.246 – 0.261  0.0450 – 0.0606  0.321 – 0.329  0.114 – 0.111 
Tuple AS + String TF  0.240 – 0.249  0.0448 – 0.0505  0.317 – 0.322  0.184 – 0.149 
Always Sampling  0.179 – 0.203  < 0.01  0.293 – 0.303  0.0234 – 0.0511 
Scheduled Sampling  0.215 – 0.247  < 0.01 – 0.0237  0.317 – 0.331  0.0865 – 0.0974 
Teacher Forcing  0.0178  0  0.0970  0.0961 
Teacher forcing for strings makes generated street names more realistic, even though sampling for strings seems to drive down . What is happening is that sampling forces the string model to mindlessly generate the exact nth letter at the nth position, regardless of which letters are generated previously. This results in reconstructions such as “PAINT WORKS RD” “PAINT POIKS RD” and nonsensical generated street names such as “LINDINS PNON ”. Sampling for Tuple is essential for the model to capture the correlations between fields, and scheduled sampling seems to hold a slight edge over always sampling by starting with an information shortcut directly from the child encoder to the Tuple RNN (See also Fig 2). We find that models trained with fixed often experience training failure characterized by increasing KLdivergence and elevated bits per character (BPC) during the training process relative to the KLdivergence warmup counterpart. We hypothesize that the model has difficulty performing autoregression to take advantage of the autocorrelations in the presence of persistent noise due to latent vector sampling. Interpolation between two training examples by a model trained with the Tuple SS + String TF scheme tends to be a straight line on the map. Even though it bends for nearby population centers, the interpolation still passes through multiple sparsely populated areas like state forests, where there are few addresses. The model seems to recognize that city and zip code are categorical variables, but it indiscriminately tries to interpolate street name and number, even though these two are details of the training examples and may not make sense to interpolate. In the shown example, the model makes up multiple addresses that start with “147, HARTS RD, GROTON” due to a training example nearby that starts with “147, HARTS RD, TOPSHAM”.
Simple charrnn VAEs trained on concatenated string expression of the training examples can actually generate good samples in terms of zipcoordinate correlations, but occasionally generate malformed samples that result in errors when converted back to structured data (Appendix F). Simpler multiseqtomultiseq model without autoregression at the Tuple level fails to capture any of the correlations (Appendix G) and neither do generative model frameworks that regularize global latent vector distributions but not continuity of the decoder like adversarial autoencoder (Appendix H) and Wasserstein autoencoder (Appendix I).
4 Generated loss
For these Vermont state address models, we never observe overfitting. That is, training loss and testing loss always change in sync and the model is indeed maximizing the loglikelihood of the data from the true distribution. This is not the case, however, for samples generated by the model. We measured what the VAE loss would be in our models for generated samples as if they were training or testing data, and we term it generated loss. We found that generated loss lags behind under the Tuple sampling + String TF schemes, and actually increases under all of the other training schemes during the training process (Fig 1 and 2).
Except for the Tuple sampling + String TF training schemes, generated loss actually increases during the training process, so the model is not maximizing the loglikelihood of its generated samples by its own estimate. In other words, VAE fails to establish bijection between the latent space and the data space for generated samples not from the true distribution. Perhaps we should not find this surprising: in the training process of VAE, we minimize the reconstruction loss from a latent vector sampled from a distribution centered around the mean latent vector. So if we start from a generated sample that is mapped into the neighborhood of a training example in the latent space by the encoder, encoding using the mean vector followed by decoding will result in a sample more similar to the said training example. Indeed, we observe that pvalues of generated samples tend to increase over repeated encoding and decoding. What is more surprising is that even the training examples themselves are not immune to this. In fact, they seem to converge faster but to the same distribution of pvalues as the generated samples, after just one round of encoding / decoding. Apparently, the gravitational pull of training examples is additive and exerting on themselves. Since the latent vectors during training are sampled from a Gaussian distribution, this gravitational pull should diminish exponentially as the distance in the latent space increases, perhaps not unlike gravity modified with a Yukawatype potential. For models that encode the street names, the street names of generated samples also tend to converge to real street names in the training data after repeated encoding and decoding. For detailed formalism and plots, see Appendix E.
5 Augmented training
Our results of generated loss measurements suggest that information from the training data isn’t sufficiently propagated. So we propose the following scheme to facilitate biased diffusion in the latent space by training on generated variants.
center After gen_start_step steps: Initialize augmented latent vectors with sampled latent vectors of the current training batch. Augment next training batch with variants generated from the augmented latent vectors. After a training step, each augmented latent vector is replaced with either: The sampled latent vector of an example from the current training batch, selected without replacement, with probability . The sampled latent vector of the variant generated from it with probability . Repeat from 2.
Intuitively, augmented training extends the standard VAE training scheme. Instead of just taking one ‘hop’ from the mean vector of a training example and minimizing the reconstruction loss from the sampled latent vector, we actually generate a reconstruction from the sampled latent vector, run it through the encoder, take a second ‘hop’ from the mean vector of the reconstruction and minimize the reconstruction loss from the augmented latent vector to the reconstruction, and so on. The augmented latent vector is initialized from the sampled latent vector of a training example, and before we run generation from it the model was just trained to minimize reconstruction loss from it. Therefore, the reconstruction generated from the sampled latent vector is likely to be similar to the original training example. The similarity will decay over repeated encoding / decoding due to the model’s capacity limit and the noise introduced by latent vector sampling, so we reinitialize it with probability such that the average lifetime is steps, which turned out to be 5 steps for the optimized experiments below. We only start augmented training after gen_start_step steps to make sure that the model is ready to generate reasonable reconstructions, and controls the number of augmented latent vectors we use. Formally, we train the model with reconstructions from the following sequence in addition to the training examples :
In terms of objective function, we have
(4) 
Assuming that is equal to the training batch size, as is the case for our experiments. It is worth pointing out that are sampled latent vectors instead of the mean vector given by the encoder in the previous section. Our experiments showed that augmented training does not improve generation quality without such noise injection.
mean  median  stddev  

Tuple SS + String TF  0.401 – 0.401  0.321 – 0.324  0.373 – 0.376  0.239 – 0.195 
Scheduled Sampling  0.317 – 0.335  0.137 – 0.180  0.354 – 0.358  0.116 – 0.115 
We can see that augmented training improves model’s generation quality and generated loss (Fig 3, taken from the run marked by the bold font in Table 5). Reduced generated loss indicates better embedding of the generated samples, even though it is still not as low as training/testing loss. KLdivergence warmup followed by cooldown outperforms simple warmup, despite identical . We suspect that with simple KLdivergence warmup the difference between real and fake data gets more entrenched so it is harder for augmented latent vector to escape their potential well, following the gravity analogy.
As part of our observation for the model’s interpolation, we find that an augmented training model settles more often on a generated street name instead of fully interpolating the street names of training examples. For example, in a interpolation between training examples with street names “HARTS RD” and “SECOND ST”, the most common street name given is actually “S MAIN ST”. The theme of nonlinearity continues as we plot the interpolation on the map, which twists and turns to avoid sparsely populated areas like state forests and goes through population centers like cities and towns instead.
6 Multiscale VAE
The fact that we can improve the model with KLdivergence warmup and cooldown indicates that it does not take many steps to train the standard deviation network. We also make the observation that objective functions with different are not necessarily in conflict with each other. Training with , the highest possible before collapses into multivariate unit Gaussian optimizes the global structure of the latent space. Training with smaller optimizes the local structure, and training with optimizes autoencoding. Therefore, we propose the following training scheme.
center Initialize standard deviation networks , where each standard deviation network is associated with a distinct but constant value . Without loss of generality, we assume . Assign pairs evenly to workers, which otherwise share the same encoder / decoder. Train the model with these workers.
In terms of objective function, we have
(5)  
It is tempting to draw connections and contrast between this multiscale objective function and the augmented objective function Eq (4). Intuitively, augmented latent vectors should get further and further away from the mean vector in average after more and more ‘hops’, and indeed at the limit of perfect encoder / decoder , small and locally constant in the neighborhood of ,
as sum of Gaussian random variables. Perhaps augmented training has similar effects on the model as multiscale VAE with geometrically spaced
values where terms with higher in Eq (4) serve the role of workers with higher . For experiments partially motivated by this observation, see Appendix J. In this section, we set and , which seem to have better behavior.For the experiments below, no augmented training is used and they always employ scheduled sampling and KLdivergence cooldown for the first 1M training steps. For experiments combining this setup and augmented training, see Appendix K.
mean  median  stddev  

Tuple SS + String TF  0.476 – 0.509  0.494 – 0.580  0.382 – 0.384  0.916 – 0.921 
Scheduled Sampling  0.402 – 0.411  0.349 – 0.361  0.367 – 0.371  0.497 – 0.525 
We can see that multiscale VAE alone outperforms even augmented training models in terms of generation quality. In fact, it is no longer obvious which model is the best. With optimized hyperparameters, one run results in a model that generates samples whose pvalues are slightly below that of training examples, and the other results in a model that generates samples whose pvalues are slightly above (the latter one is arbitrarily chosen for the following figures). The optimal value of in our case is in the range 0.64 – 1.28, consistent with our results with VAE. Since multiscale VAE by design already optimizes the loglikelihood of training examples for values in the interval , KLdivergence warmup from 0 unsurprisingly doesn’t help. However, modest KLdivergence cooldown seems to be beneficial and improves the robustness of model performance w.r.t. hyperparameter tuning of .
While not designed to do so, optimized multiscale VAE alone does have lower generated loss (Fig 4) than the VAE baseline. While the multiscale VAE model still reconstructs and interpolates between street numbers, it no longer does so for street names. Instead, it makes up street names by autoregression, in a sense exhibiting partial posterior collapse. This behavior is intuitively sensible: The model sees a wide variety of street names in close proximity of each other and associated with the same city and zip code, and subsequently concludes that street names are details not to memorize for each individual training example. In aggregation, however, the multiscale VAE model actually generates more samples with street names from the training data than the optimized VAE model (60% vs. 44%). When the interpolation given by a Multiscale VAE is plotted on the map, it snakes through multiple population centers and tries to stay within their neighborhoods as long as possible to minimize coordinate and zip code loss terms. In the optimal case, the interpolation adapts the property of a spacefilling curve. Both tendencies are present in the augmented training models, but even stronger for multiscale VAE.
7 Conclusion
We described the Vermont state address benchmark data set, a field correlation metric used to quantify generation quality, and our discrete VAE based on a generated tree recursive model. We showed that even when trained with KLdivergence warmup and scheduled sampling, generalized VAE only demonstrates limited capacity in capturing such field correlations, and most of the issue is with the variational method instead of the encoderdecoder pair. More specifically,

VAE loss of generated samples (generated loss) may lag behind or even increase during the training process and serves as a useful metric for VAE optimization. The model tends to make mistakes in the direction of typical training examples, even for the training examples themselves.

Both generation quality and generated loss can be improved by augmenting training data with generated variants (augmented training).

Training VAE with multiple values and standard deviation networks simultaneously (multiscale VAE) is a formally related, tunable technique. The resulted model tends to encode less details but offer superior generation quality in terms of aggregated properties like field correlations.
Admittedly, we have not fully solved the issues observed in our work, and it is speculative whether we have hit upon certain fundamental limitations of the VAE framework. For early results of applying these ideas to an image VAE, see Chou (2019).
Author contributions
J.C. contributed the idea and implementation of generated loss measurement, augmented training, multiscale VAE, and the use of pvalue as correlation metric. J.C. also implemented the Alala framework in collaboration with DeLesley Hutchins, with a focus on the decoders, VAE training scheme, and the engine for model generation from Protocol Buffer message definition. G.H. contributed the idea of using the Vermont state address data set and correlations of generated data as the generation quality metric. G.H. also implemented the trainer binary for the Vermont state address model and the Python module for measuring correlations.
Acknowledgments
DeLesley Hutchins designed and implemented the modular encoderdecoder architecture and the bidirectional RNN Tuple encoder for a different purpose, and both are incorporated into the Alala framework. DeLesley Hutchins also first suggested the use of scheduled sampling and passthrough baseline model and provided valuable critiques to an early draft of this paper. We would also like to thank Irina Higgins for extensive internal review and helpful feedbacks.
References
 Akuzawa et al. (2018) Kei Akuzawa, Yusuke Iwasawa, and Yutaka Matsuo. Expressive speech synthesis via modeling expressions with variational autoencoder. CoRR, abs/1804.02135, 2018. URL http://arxiv.org/abs/1804.02135.
 Ballard (1987) Dana H Ballard. Modular learning in neural networks. In AAAI, pp. 279–284, 1987.
 Barron (2017) Jonathan T. Barron. Continuously differentiable exponential linear units. CoRR, abs/1704.07483, 2017. URL http://arxiv.org/abs/1704.07483.
 Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. CoRR, abs/1506.03099, 2015. URL http://arxiv.org/abs/1506.03099.
 Bowman et al. (2015) Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating sentences from a continuous space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.
 Burgess et al. (2018) Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in vae. arXiv preprint arXiv:1804.03599, 2018.
 Carlini et al. (2018) Nicholas Carlini, Chang Liu, Jernej Kos, Úlfar Erlingsson, and Dawn Song. The secret sharer: Measuring unintended neural network memorization & extracting secrets. CoRR, abs/1802.08232, 2018. URL http://arxiv.org/abs/1802.08232.
 Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoderdecoder for statistical machine translation. CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078.
 Chou (2019) Jason Chou. Generated loss and augmented training of mnist vae. In arXiv, 2019.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.
 Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

Krizhevsky & Hinton (2010)
Alex Krizhevsky and Geoff Hinton.
Convolutional deep belief networks on cifar10.
Unpublished manuscript, 40(7), 2010.  Looks et al. (2017) Moshe Looks, Marcello Herreshoff, DeLesley Hutchins, and Peter Norvig. Deep learning with dynamic computation graphs. CoRR, abs/1702.02181, 2017. URL http://arxiv.org/abs/1702.02181.
 Makhzani et al. (2015) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian J. Goodfellow. Adversarial autoencoders. CoRR, abs/1511.05644, 2015. URL http://arxiv.org/abs/1511.05644.
 Rezende & Viola (2018) Danilo Jimenez Rezende and Fabio Viola. Taming vaes. arXiv preprint arXiv:1810.00597, 2018.
 Sajjadi et al. (2018) Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall. arXiv preprint arXiv:1806.00035, 2018.
 Sønderby et al. (2016) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in neural information processing systems, pp. 3738–3746, 2016.
 Tolstikhin et al. (2017) Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein AutoEncoders. arXiv eprints, art. arXiv:1711.01558, November 2017.
 Vincent et al. (2010) Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and PierreAntoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11:3371–3408, December 2010. ISSN 15324435. URL http://dl.acm.org/citation.cfm?id=1756006.1953039.
Appendix A Vermont state address data set
A corpus of Vermont state addresses from the zip file us_northeast.zip is downloaded from OpenAddresses. We decompressed it to use its us/vt/statewide.csv as the raw data. We then defined a simple Protocol Buffer message to represent its rows and for the purpose of model generation described in Appendix B:
message Address { optional float lat = 1; optional float long = 2; // We don’t discount the possibility that some addresses could have // nonnumerical entries for fields that seem like they should be numerical, // like street numbers for example: optional string number = 3; optional string street = 4; optional string unit = 5; optional string city = 6; optional string district = 7; optional string region = 8; optional string postcode = 9; }
We then split the data set into training, testing, and validation sets with 8:1:1 expected ratio. The training set contains 266450 examples, the testing set contains 33304 examples, and the validation set remains unused. In terms of total number of characters, the training set contains 7504720 characters among all string fields, or an average of 28.17 per training example. These sets used in our work can be downloaded from https://github.com/EIFY/vermont_address. We have also trained models on different slices and found the results to be robust to slice change.
Regarding zipcoordinate correlations, we expect the pvalues of coordinates given zip code to be uniformly distributed between 0 and 1 for the training set itself. As a sanity check, here are the stats of the pvalues of the training set:
Mean: 0.521861141342 Median: 0.537469273433 Standard deviation: 0.298400
Given finite training examples, these stats seem reasonable relative to the limit mean = median = 0.5, standard deviation = .
Appendix B Tree recursive model implementation details
Here we provide the implementation details of the StringLiteral module (Fig 5), ScalarTuple module (Fig 6), Tuple module (Fig 7), and the standard deviation network (Fig 8). If not specified otherwise, continuously differentiable exponential linear unit (CELU) with (Barron, 2017)
is the default activation function in our model. It is chosen for its compatibility with the prior distribution of the latent vector since its image covers 99.865% of the unit Gaussian distribution. Due to its relatively weak nonlinearity, we simply initialize the weights with indegree scaled unit variance
^{1}^{1}1truncated at 2 standard deviations. This is tf.variance_scaling_initializer(scale=1.0) in TensorFlow., i.e. and zero bias.b.1 The StringLiteral module
The encoder and decoder of the StringLiteral
module are characterRNN based on 128dim gated recurrent unit (GRU)
(Cho et al., 2014) with 16dim trainable character embedding initialized with uniform distribution between 0 and 1 and shared between the encoder and decoder. The GRU differs from the original in that it uses CELU with whose output is capped at 6 in the same way as ReLU6 (Krizhevsky & Hinton, 2010) to prevent blowup, i.e. given 16dim character embedding at step and 128dim RNN state at step , RNN state at step is given bywhere
is still the sigmoid function but
. and are initialized with indegree scaled unit variance weights and zero bias, but are initialized with zero weights and unit bias to make sure that the GRU cell isn’t too forgetful from the beginning.For the encoder, is the zero vector and a fullyconnected layer is applied to the final state of the RNN to generate the embedding. For the decoder, is initialized by running another
fullyconnected layer on the embedding vector and a softmax layer predicts the
th character from . We use crossentropy loss in nat and normalize such that each string field is given total loss weight 1. e.g. the zip code always has 5 digits, so each of them plus the endofstring token is given loss weight .b.2 The ScalarTuple module
The two float fields (lat and long) are collected and modelled jointly by the ScalarTuple module. The module keeps track of the moving mean and the moving covariance matrix of the training data, i.e. given values of the scalar tuple in a minibatch
where is the moving average decay, set to be 0.999. The ScalarTuple module then performs PCAwhitening on the raw input for both training and testing:
(6)  
(7) 
where
is the identity matrix,
is a regularization coefficient set to be , and denotes elementwise inverse square root. The encoder then generates the embedding from with a sigmoid layer, and the decoder generates from the embedding with a linear layer and computes squared error loss. Both float fields are given total loss weight 1, so sum of the squared error is used instead of average. PCAwhitening is similar to and reduces to batch normalization
(Ioffe & Szegedy, 2015) without scale and shift when components of is uncorrelated, but automatically handles strong correlation which can cause batch normalization to overestimate the true variances of the data. For generation, the inverse of Eq (7) is used to unwhiten the prediction.b.3 The Tuple module
Unlike the StringLiteral module and the ScalarTuple module, the Tuple module is not a leaf module, i.e. the input of its encoder is the embeddings generated by the encoders of its child modules, and the output of its decoder is the embeddings used by the decoders of its child modules. The encoder is based on a bidirectional RNN with the same GRU implementation as the StringLiteral module, but with 128dim state size and 128dim input size. In addition, since size of the tuple is fixed and each element of the tuple is different, GRU cells for different tuple elements are distinct and do not share parameters. For example, with 7 string fields and 2 float fields modelled by a ScalarTuple module, the Tuple module encoder for the address model has distinct GRU cells. The bidirectional RNN has a shared 128dim trainable initial state for both directions, and the two final states of the bidirectional RNN are concatenated and fed to a fullyconnected layer to produce the final embedding.
The Tuple module decoder is also based on an RNN. The initial state for the decoder RNN is initialized from the embedding with a fullyconnected layer. Each element of the tuple again has its own GRU cell, and the same GRU implementation is used with 128dim state size and 128dim input size. In addition, each GRU cell includes a fullyconnected layer that generates its child module embedding from the current state. This child module embedding is then fed back to the GRU cell to increment to the next state for generation and scheduled sampling (Bengio et al., 2015). For training/testing with teacher forcing, embedding given by the child encoder is used as the groundtruth input. In order to make sure that the encoder and decoder of the child module use the same representation, we add an extra loss term dubbed skew loss, which is the mean squared error between the generated embedding and the embedding given by the child encoder. This skew loss is somewhat arbitrarily given the same weight as the reconstruction loss of the respective child module.
The model described here is generated from the Protocol Buffer message definition by an internal framework. Codenamed Alala in reference to the Greek goddess and the Hawaiian crow Corvus hawaiiensis, the framework is based on TensorFlow Fold (Looks et al., 2017) and developed for training VAEs on arbitrarilydefined protocol buffers. The order of the elements of the tuple follows that of the Protocol Buffer message definition, except that (lat, long) are collected and modelled jointly by the ScalarTuple module as the last element of the tuple.
b.4 Standard deviation network
Encoders of Alala modules produce an embedding vector. In order to train a VAE with Alala modules, we interpret the embedding vector as the mean vector and generate the standard deviation vector from it with a standard deviation network. For the model described here, the standard deviation network consists of 3 fullyconnected layers, topped with a sigmoid layer to produce the standard deviation vector whose elements are always in the range . The sigmoid layer is initialized with zero weights and 5 bias to make sure that the standard deviation vector starts out small in the beginning of the training process.
Appendix C Full hyperparameter sweep result
The VAE baseline experiments use the full 2M steps as the warmup period, and the groundtruth probability decreases linearly from 1 to 0 for scheduled sampling experiments. For a rough measure of the reproducibility, we rerun the best experiments with the same hyperparameters.
For augmented training experiments below, , gen_start_step = for scheduled sampling (roughly when generated loss stops decreasing) and gen_start_step = for Tuple SS + String TF (roughly when training loss stops rapidly decreasing). We use = 256 augmented latent vectors, so training batches now consist of 256 training examples and 256 generated variants. In practice, data generation is slow due to the lack of parallelism, so we actually shut down the initial training process with 32 workers after gen_start_step and relaunch it with 512 workers for augmented training. These augmented training experiments always employ simultaneous scheduled sampling and KLdivergence warmup, with and the first 1M training steps as the warmup period. We also find that it’s beneficial to have a KLdivergence cooldown period after the warmup period, in which case we have .
mean  median  stddev  
Tuple SS + String TF  
0.128  0.316  0.123  0.356  0.0402  
0.64  0.128  0.401 – 0.401  0.321 – 0.324  0.373 – 0.376  0.239 – 0.195  
Scheduled Sampling  
0.128  0.272  0.0327  0.348  0.0266  
0.279  0.0481  0.349  0.0207  
0.275  0.0348  0.350  0.0242  
0.64  0.128  0.311 – 0.333  0.105 – 0.159  0.360 – 0.362  0.113 – 0.112  
0.317 – 0.335  0.137 – 0.180  0.354 – 0.358  0.116 – 0.115  
0.307  0.106  0.355  0.115  
0.312  0.110  0.359  0.114  
1  0.311  0.110  0.358  0.108 
For multiscale VAE experiments below, no augmented training is used and they always employ scheduled sampling with the first 1M training steps as the warmup period. In case KLdivergence warmup or cooldown is also employed during the warmup period, we have . For experiments combining this setup and augmented training, see Appendix K.
mean  median  stddev  
Tuple SS + String TF  
1.28  0.64  0.476 – 0.509  0.494 – 0.580  0.382 – 0.384  0.916 – 0.921 
1.28  0.460  0.474  0.373  0.921  
Scheduled Sampling  
5.12  0.64  0.386  0.297  0.368  0.605 
2.56  0.376  0.275  0.367  0.619  
1.28  0.402 – 0.411  0.349 – 0.361  0.367 – 0.371  0.497 – 0.525  
0  0.218  0.00738  0.315  0.0441  
5.12  0.224  0.00101  0.343  0.965  
2.56  0.396  0.306  0.379  0.685  
1.28  0.395  0.326  0.367  0.581  
0.64  0.313  0.152  0.344  0.268 
Appendix D Map interpolations
Map interpolations of the models featured in the main text. For the live version and more interpolation examples, see the HTML files of the repository https://github.com/EIFY/vermont_address.
Appendix E Stats over repeated encoding and decoding
In the following, we plot the pvalue distributions of 10000 generated samples and 10000 training examples over repeated encoding and decoding for the models featured in the main text. For generated samples , we examine the pvalues of the following sequence:
For training examples , we examine the pvalues of the following sequence:
For from 0 to 9. The results strongly suggest that for and where indicates that two random variables follow the same distribution.
We also examine the proportion of street names of the generated samples present in the training data over repeated encoding and decoding. For models that encode the street names, the proportion also increases over repeated encoding and decoding. This observation has privacy implications: Namely, repeated encoding and decoding may be an efficient tool to extract rare or unique sequence of the training data from a VAE. It is instrumental to quantify such risk a la Carlini et al. (2018), and we leave it to future work.
Appendix F Commaseparated text model
The simplest approach to model the Vermont state address data is to model them in their original format as commaseparated text. So we trained a simple seqtoseq model using just the StringLiteral module and the standard deviation network (Table 7). We doubled the size of the embedding vector to 256dim to compensate for the difference in number of parameters, omitted the string fields that are always empty so we only expect 6 commaseparated values (number, street, city, postcode, lat, long), and rounded the floating point numbers for the coordinates to five decimal places. If the generated commaseparated text does not have enough commaseparated values, or the last two values do not represent valid input for python’s float() function, we consider the generated sample to be malformed and its pvalue to be zero. We used teacher forcing as the training scheme for these experiment as scheduled sampling simply does not work. for the KL divergence warmup experiment, but running multiscale VAE with and resulted in posterior collapse. is fixed at 0.768 for the other multiscale VAE experiment.
malformed  mean  median  stddev  

KL warmup  23  0.424  0.411  0.344 
Posterior collapse  17  0.436  0.428  0.340 
Multiscale VAE  43  0.394  0.356  0.344 
We can see that these commaseparated text models are quite good at generating valid samples (with fewer than 0.5% of 10000 samples malformed) and capturing zipcoordinate correlations. While it is possible to train a meaningful latent commaseparated text model (Fig 18
), it does not improve the generation quality over a pure autoregressive model resulted from posterior collapse (Fig
19). We speculate that percharacter crossentropy reconstruction loss function simply does not yield a latent space with good structure – from the model’s perspective addresses that start with "147,HARTS RD" are the addresses closest to each other, not addresses that are geographically close.Appendix G Passthrough model
To establish a baseline against the tree recursive model, we implemented a simple passthrough model by replacing the Tuple module with a SimpleTuple module. The decoder of the SimpleTuple module just passes the embedding vector down to child decoders, and the encoder just concatenates embedding vectors generated by child encoders and applies a fullyconnected to generate the final embedding. This architecture requires separate string models for each of the four nonempty string fields (number, street, city, and postcode) and we just omitted the string fields that are always empty. We then used the same pvalue metric to quantify the generation quality of passthrough models in the table below. Other than and gen_start_step = for augmented training, hyperparameters not specified in the table are the same as in the corresponding tree recursive model experiments.
mean  median  stddev  
Generalized VAE  
Always Sampling  0.128  0.0699  0.176  
Scheduled Sampling  0.0603  0.168  
0  0.384  0.0652  0.175  
Teacher Forcing  0.0835  0.192  
Augmented Training  
Scheduled Sampling  0.0614  0.171  
Teacher Forcing  0.118  0.00110  0.229  
Multiscale VAE only  
Scheduled Sampling  1.28  0.64  0.0612  0.168  
Teacher Forcing  0.0766  0.186 
Teacher forcing for the strings remains the best training scheme, but lack of the RNNbased Tuple decoder significantly cripples the model’s capacity of capturing correlations. Augmented training manages to improve over the low baseline, but multiscale VAE fails to yield improvements. Most likely, the passthrough model lacks the capacity to learn what zip code is associated with given coordinates (or vice versa) and resorts to encode both zip code and coordinates in the latent vector in parallel. Generated loss/BPC increases during most of the training process for these passthrough models (Fig 20), unless augmented training is employed (Fig 21).
Appendix H AAE (Adversarial Autoencoder) model
We have tested using the Adversarial Autoencoder (AAE) framework (Makhzani et al., 2015) with the tree recursive model for this problem. To keep the number of parameters approximately the same, we repurposed the standard deviation network described in Sec B.4 for the role of discriminator by replacing the final sigmoid layer with a softmax layer to distinguish the latent vector generated by the deterministic encoder from the random vector drawn from the unit Gaussian distribution. We found that despite running the same Tuple SS + String TF training scheme, an AAE model failed to capture the zipcoordinate correlations (pvalue stats turned out to be mean = 0.0576, median = 0, standard deviation = 0.183 after 2M autoencoder training steps and 2M discriminator training steps on the same learning rate schedule as the VAE models. Crossentropy adversarial loss was given the same weight as the reconstruction loss).
Appendix I WAE (Wasserstein Autoencoder) model
We have also tested using the Wasserstein Autoencoder (Tolstikhin et al., 2017) with the tree recursive model for this problem. In our implementation, we make the simplifying assumption that the latent vectors follow a multivariate Gaussian distribution and proceed to estimate its mean vector and covariance matrix . The Wasserstein distance squared from this multivariate Gaussian distribution to the 128dim unit Gaussian distribution is then given by
where is the norm, , is the identity matrix, and is the matrix Frobenius norm. We also found that a WAE model running the same Tuple SS + String TF training scheme failed to capture the zipcoordinate correlations (pvalue stats turned out to be mean = 0.0240, median = 0, standard deviation = 0.112 after 2M training steps on the same learning rate schedule as the VAE models. The Wasserstein distance squared latent loss was given the weight 0.128 per latent dimension). As pointed out by (Tolstikhin et al., 2017), AAE can be considered a special case of WAE, so we find these results to be consistent. We suspect that this problem requires the framework to ensure the continuity of the decoder w.r.t. individual training examples, rendering generative model frameworks that only regualrize the latent vector distribution insufficient.
Appendix J Alternative Multiscale VAE formulations
In Sec 6, we initialize to be equally spaced on the interval for simplicity, and we find the optimized multiscale VAE’s behavior to be sensible. That is, the model accurately captures the zipcoordinate correlations and considers street name beyond its capacities to encode as it varies within the reconstruction loss of the coordinates. We also note that adding augmented training partially restores street name autoencoding by making up its own details, with a price in terms of correlation accuracy and street name realism. However, we are still curious whether multiscale VAE can be forced to encode more details such as street names by setting more workers to train with lower values.
Inspired by the exponential decay term in the augmented objective function Eq (4), the most obvious alternative is to use geometric spacing instead of linear spacing. For example, with common ratio and , we have for 32 workers. Again comparing the multiscale objective function Eq (5) with the augmented objective function Eq (4), we noticed that although the KLdivergence weight is usually applied to the KLdivergence directly, the multiscale objective function with geometric spacing would match the augmented objective function Eq (4) more closely if we invert the weights and divide the reconstruction loss by instead:
(8)  
For a plain VAE this does not matter, especially when a gradientnormalizing optimizer like Adam is used. For a multiscale VAE however, the difference in formulation changes how gradients from workers training at different scales are weighted. For the result below, we tested both. Training scheme is fixed to be Tuple SS + String TF, and we measure both the zipcoordinate pvalue stats and the average Levenshtein distance per character between the original street name and its reconstruction.
The following is the street name reconstructions for the first 10 training examples by the run marked by the bold font:
"HARTS RD" > "HARTS RD" "SECOND ST" > "SPRING ST" "LIME KILN RD" > "LAKE DAVIS RD" "JOHNSON HILL RD" > "BAY HILL RD" "JACKSON CROSS RD" > "MOUNTAINSON AVE" "WAUGH FARM RD" > "RUNNS LN" "PAINT WORKS RD" > "PORTECHAN WINOORD RD" "FURLONG RD" > "FLATC ST" "SABIN ST" > "SOUTHWIN TER" "POISSON DR" > "POPE PKWY"
We can see that multiscale VAE can indeed be tuned to encode more details such as street names. However, there remains a tradeoff between accurate zipcoordinate correlations and details such as street names, and finetuned multiscale VAE is no better than multiscale VAE trained with augmented training (Appendix K) at comparable correlation accuracy. Interestingly, weight inversion does seem to make the model more reliable at street name reconstruction with comparable zipcoordinate correlation accuracy.
A more radical alternative is to assign a different target KLdivergence value for each worker a la (Burgess et al., 2018), instead of a different KLdivergence weight that still allows each worker to find its own tradeoff between latent loss and reconstruction loss. We were surprised to find out, however, that it does not work to specify the target total KLdivergence in terms of L1 loss . Instead, we only punish the model for going over the capacity budget, and trust the reconstruction loss alone to use as much capacity budget as possible with loss term , so we have the following objective function:
(9)  
For the result below, capacity penalty weight is set to be 128 per latent dimension while the reconstruction loss is still the weighted average of natpercharacter and mean squared error loss terms to make sure that capacity budget is respected. The target capacities are specified with minimum capacity and capacity increment . For example, if and , the 32 workers run with capacity budgets . Target capacities remain fixed throughout the training process for these runs.
mean  median  stddev  

10  0.2  0.434 – 0.449  0.402 – 0.434  0.382 – 0.375  0.913 – 0.914 
0.35  0.403 – 0.414  0.339 – 0.372  0.364 – 0.359  0.734 – 0.759  
0.5  0.276 – 0.301  0.0874 – 0.125  0.332 – 0.341  0.483 – 0.517  
1.0  0.216 – 0.223  0.0265 – 0.0388  0.301 – 0.302  0.118 – 0.125  
15  0.2  0.319 – 0.343  0.153 – 0.223  0.350 – 0.347  0.642 – 0.628 
We observe the same tradeoff between accurate zipcoordinate correlations and details such as street names and are not able to get better result. Perhaps it is harder to tune multiscale VAE with target capacities since the model is not allowed to make tradeoffs between reconstruction loss and latent loss on its own.
Appendix K Multiscale VAE + augmented training experiments
We explore whether multiscale VAE can be improved by augmented training. The experiments below are based on the hyperparameters optimized in Sec 6 (). To compensate slow data generation, we always shut down the training process at gen_start_step and bring it back up with = 256 and 512 workers. Due to the observed variability, we always run experiments with the same hyperparameters twice.
gen_start_step  mean  median  stddev  
Tuple SS + String TF  
0.465 – 0.471  0.471 – 0.491  0.377 – 0.383  0.770 – 0.780  
0.444 – 0.457  0.425 – 0.466  0.385 – 0.383  0.719 – 0.662  
Scheduled Sampling  
0.358 – 0.385  0.196 – 0.267  0.376 – 0.381  0.427 – 0.441  
0.338 – 0.378  0.154 – 0.270  0.369 – 0.377  0.423 – 0.407  
0.357 – 0.382  0.217 – 0.268  0.368 – 0.377  0.378 – 0.373 
Multiscale VAE alone is capable of capturing the zipcoordinate correlations and generated loss decreases during the training process under the Tuple SS + String TF training scheme, so it turns out to be beneficial to delay the start of augmented training until at least 1M steps. Optimized multiscale VAE + augmented training features a good compromise between the two with accurate zipcoordinate correlations, low generated loss (Fig 22, taken from the run of Table 12 marked by the bold font), and partially restored street name reconstructions. We noticed that even though optimized multiscale VAE + augmented training usually reconstructs the first few letters of the street name, sometimes the corresponding embedding (mean) vector actually encodes a different but typical street name, especially when the training example features an unusual street name like "LIME KILN RD" or "WAUGH FARM RD". We believe this is another manifestation of the additive gravitational pull of the training examples, in combination with augmented training..
With the speculation that sampled latent vector loses information about the training example faster when is higher, the experiments below have adaptive on the interval , equally spaced on the linear scale. For example, if , and , values of form an arithmetic sequence such that workers run with parameters .
We have also tested multiscale VAE + augmented training (scheduled sampling) with less and adaptive on the interval equally spaced on the log scale. For example, if , and , values of form a geometric sequence such that workers run with parameters . They do not work better and turned out to be irrelevant in the context of later findings, so we report them here for completeness. Just like previous experiments, we always shut down and bring the training process back up at gen_start_step = with workers.
Appendix L Other negative results

It may be counterintuitive to initialize character embedding with uniform distribution between 0 and 1 and use the sigmoid function as the activation function for the ScalarTuple encoder, while CELU with is used for most of the model. However, changing them to cover more of the unit Gaussian distribution by their respective embeddings doesn’t yield any improvement.

It’s critical to bias the information diffusion in the latent space towards spreading information from the training examples for augmented training. Generation quality is improved only when augmented latent vectors are initialized from the sampled latent vectors of the training examples.

Replacing with given by a discriminator for augmented training doesn’t work as intended. Without a way to use gradient descent to increasingly confuse the discriminator, what does confuse the discriminator tends to be nearexact copies of training examples. Variants that keep a fraction of the augmented latent vectors that correspond to lower don’t work either.

Since we maintain full data parallelism for multiscale VAE training, a training batch on a worker is always used with the same value and standard deviation network. Using a training batch with multiple values and standard deviation networks, either deterministically or randomly, doesn’t yield further improvement within the same training budget.

Early attempts to replace multiple standard deviation networks with one parameterized by the value , e.g. making one or more of its fullyconnected layers and appending to their input, do not work. It may be inherently difficult to learn a shared representation across multiple scales.
Comments
There are no comments yet.