Generated Loss, Augmented Training, and Multiscale VAE

04/23/2019
by   Jason Chou, et al.
Google
0

The variational autoencoder (VAE) framework remains a popular option for training unsupervised generative models, especially for discrete data where generative adversarial networks (GANs) require workaround to create gradient for the generator. In our work modeling US postal addresses, we show that our discrete VAE with tree recursive architecture demonstrates limited capability of capturing field correlations within structured data, even after overcoming the challenge of posterior collapse with scheduled sampling and tuning of the KL-divergence weight β. Worse, VAE seems to have difficulty mapping its generated samples to the latent space, as their VAE loss lags behind or even increases during the training process. Motivated by this observation, we show that augmenting training data with generated variants (augmented training) and training a VAE with multiple values of β simultaneously (multiscale VAE) both improve the generation quality of VAE. Despite their differences in motivation and emphasis, we show that augmented training and multiscale VAE are actually connected and have similar effects on the model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

04/24/2019

Generated Loss and Augmented Training of MNIST VAE

The variational autoencoder (VAE) framework is a popular option for trai...
08/06/2021

GLASS: Geometric Latent Augmentation for Shape Spaces

We investigate the problem of training generative models on a very spars...
02/15/2018

Quantum Variational Autoencoder

Variational autoencoders (VAEs) are powerful generative models with the ...
06/27/2020

Deep Generative Modeling for Mechanistic-based Learning and Design of Metamaterial Systems

Metamaterials are emerging as a new paradigmatic material system to rend...
05/21/2020

Unsupposable Test-data Generation for Machine-learned Software

As for software development by machine learning, a trained model is eval...
04/03/2018

Training VAEs Under Structured Residuals

Variational auto-encoders (VAEs) are a popular and powerful deep generat...
02/19/2018

Degeneration in VAE: in the Light of Fisher Information Loss

Variational Autoencoder (VAE) is one of the most popular generative mode...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The variational autoencoder (VAE) framework (Kingma & Welling, 2013) and generative adversarial network (GAN) framework (Goodfellow et al., 2014)

have been the two dominant options for training deep generative models. Despite recent excitement about GAN, VAE remains a popular option, featuring ease of training and wide applicability, with an encoder-decoder pair being the only problem-specific requirement. The VAE encoder encodes training examples into posterior distributions in an abstract latent space, from which sampled latent vectors are drawn and the VAE decoder is trained to reconstruct the training examples from their respective latent vectors. In addition to minimizing mistakes in reconstruction (‘reconstruction loss’), VAE features the competing objective of minimizing the difference between the posterior distributions and an assumed prior (‘latent loss’, measured by KL-divergence). The total VAE loss the model tries to minimize is the sum of these two losses, and the competing objective of minimizing the latent loss creates an information bottleneck between the encoder and the decoder. Ideally, the learned compression allows a random vector in the latent space to be decoded into a realistic sample in generation time.

Compared to GAN, VAE is directly trained to encode all training examples and therefore is less prone to the failure mode of generating a few memorized training examples (‘mode collapse’). On the other hand, it tends to have lower precision, which manifests as blurry images for visual problems (Sajjadi et al., 2018). It has been theorized that such blurred reconstructions correspond to multiple distinct training examples and are due their overlapping posterior distributions in the latent space. Conversely, holes in the latent space that do not correspond to any posterior distributions of training examples may result in generated samples unconstrained by training data (Rezende & Viola, 2018). One may note that these two issues are two sides of the same coin: Strong information bottleneck leads to too much noise in the sampled latent vectors and overlapping posterior distributions, whereas weak information bottleneck leads to too little noise and leaves holes in the latent space. Unsurprisingly, the simplest approach to improving the VAE has been fine-tuning the strength of the information bottleneck by the introduction of the KL-divergence weight

as a hyperparameter. In addition to the hyperparameter sweep of KL-divergence weight

(Higgins et al., 2017), manual annealing in both directions: KL-divergence warm-up (Bowman et al., 2015; Sønderby et al., 2016; Akuzawa et al., 2018) and controlled capacity increase (Burgess et al., 2018) has been employed to achieve good latent space structure and accurate reconstruction simultaneously. Such training scheme relies on the model’s memory of training steps with a different KL-divergence weight, even though there is no a priori reason to prefer any particular one in this case. This generalized -VAE with manual annealing in either direction serves both baseline and inspiration for our work.

Our motivation for studying VAE is to generate fake yet realistic test data. Such data has a wide range of applications including testing systems involving input validation, performance testing, and UI design testing. We are particularly interested in generating samples that respect the correlations among multiple fields/columns of the training data, and we would like our generative model to discover and learn such correlations in an unsupervised fashion. Generating such fake yet realistic data is beyond the capability of a simple fuzzer, and to our knowledge such correlation is rarely measured independently from the reconstruction loss in the existing literature. In the following sections, we will first provide further background on generalized -VAE. We will then describe the benchmark data set, followed by the tree recursive model generated by our framework and the baseline generation quality achievable with a generalized -VAE. With the stage set, we keep the encoder-decoder pair constant and proceed to diagnose why the model fails to capture the full extent of the field correlations. First, we measure what the total VAE loss would be in our models for generated samples as if they were training or testing data, and we term it generated loss. We find that generated loss lags behind or even increases during the training process, in comparison to the training/testing loss. We believe that elevated generated loss indicates that information about the training examples is not diffused properly in the latent space, either due to overlapping posterior distributions or holes in the latent space. Motivated by this discovery, we seek improved variational methods more adaptive to local distribution of mean latent vector of training examples and capable of diffusing information throughout the latent space. Finally, we demonstrate that augmenting training data with generated variants under small (augmented training), and training a VAE with multiple values of simultaneously (multiscale VAE) are such variational methods and are closely related.

Our main contributions are as follows: 1) We propose generated loss, the total VAE loss of generated samples, as a diagnostic metric of generation quality of VAEs. 2) We propose augmented training, augmenting training data with generated variants, as a variational method for training VAEs to achieve superior generation quality. 3) Alternatively, we propose multiscale VAE, a VAE trained with multiple values simultaneously which is more tunable, captures aggregated characteristics like correlations more accurately, but tends to encode less details.

2 Background

2.1 Generalized -Vae

Neural network-based autoencoders have long been used for unsupervised learning (Ballard, 1987)

and variations like denoising autoencoder have been proposed to learn a more robust representation

(Vincent et al., 2010). The use of autoencoder as a generative model, however, only took off after the invention of VAE (Kingma & Welling, 2013), which is trained to maximize the evidence lower bound (ELBO) of the log-likelihood of training examples

(1)

where is the KL-divergence between two distributions and is the latent vector, whose prior distribution is most commonly assumed to be multivariate unit Gaussian. is given by the decoder, and is the posterior distribution of the latent vector given by the stochastic encoder, whose operation can be made differentiable through the reparameterization trick , if is assumed to be a diagonal-covariance Gaussian.

A common modification to the ELBO of VAE is to add a hyperparameter to the KL-divergence term and use the following objective function:

(2)

where controls the strength of the information bottleneck on the latent vector. For higher values of , we accept lossier reconstruction, in exchange of higher effective compression ratio. This hyperparameter has been theoretically justified as a KKT multiplier for maximizing under the inequality constraint that the KL-divergence must be less than a constant (Higgins et al., 2017). In practice, is usually kept constant (Higgins et al., 2017) or manually annealed to increase over time (Bowman et al., 2015; Sønderby et al., 2016; Akuzawa et al., 2018).

In both cases, the generator

samples from the probability distribution given by the decoder

where is a random latent vector in generation time:

(3)

2.2 Benchmark data set and metric

Addresses are a frequently encountered data type here at Google. It is a simple data type, but features intuitive yet non-trivial correlations among fields. Such correlation is perhaps easy to capture for specifically designed classifiers and regressors, but it is far harder to train generative models to generate samples that respect such correlation in an unsupervised fashion. Therefore, an address data set can serve as a context-relevant benchmark data set for our framework for training structured data VAEs. Specifically, the OpenAddresses Vermont state data set is chosen for its moderate size (See Appendix

A for more details).

We focus on the correlation between zip (postal) code and coordinates (latitude, longitude) as an example of field correlations. We estimate the distribution of coordinates of addresses in a given zip code from the training examples, and use p-value as the metric for generated samples. Recall that p-value is defined as the probability that the given sample is more likely than another sample from the same distribution, given null hypothesis. In our case, we would like the null hypothesis to be true – i.e., training examples and generated samples follow the same distribution. For a perfect model, p-values of the generated samples follow uniform distribution between 0 and 1.

In practice, we make the simplifying assumption that the coordinates follow 2-dimensional Gaussian distribution for addresses of a given zip code. We consider the zip code categorical variable and calculate the mean

and the sample correlation matrix

of the coordinates in the zip code. We can then apply the multivariate version of the two-tailed t test to determine whether generated coordinates

in the zip code follow the same distribution:

where is the Mahalanobis distance squared,

is the cumulative distribution function (CDF) of chi-squared distribution with

degrees of freedom.

2.3 Tree recursive model

Our address model consists of encoder-decoder modules. The latent space is 128-dim so all encoder-decoder modules produce and consume 128-dim vectors. The string fields (street number, street name, unit, city, district, region, and zip code) are modelled by a shared seq-to-seq char-rnn StringLiteral module, whereas the two float fields (latitude and longitude) are collected and modelled jointly by the ScalarTuple module. The full address data is then modelled by the Tuple module, whose encoder RNN consumes the embedding vectors generated by the encoders of these child modules and whose decoder RNN generates embedding vectors to be decoded by the decoders of these child modules. The reconstruction loss term for each field is given equal weight as follows:

  1. String field loss: calculated as cross-entropy loss per character in nat, given by the StringLiteral decoder. Each string field is given equal weight 1.0 regardless of the length of the string, so characters in a shorter string are given more weight than ones in a longer string. The StringLiteral decoder implements scheduled sampling (Bengio et al., 2015) and can be trained with character input drawn from its own softmax distribution (always sampling, AS), ground-truth characters of the training example (teacher forcing, TF), or arbitrary scheduled sampling (SS) in between.

  2. Float field loss: The ScalarTuple

    module models latitude and longitude jointly and performs PCA-whitening as a preprocessing step on the fly with moving mean vector and covariance matrix. The decoder network then tries to predict the 2 resultant zero-mean unit-variance values, with mean squared errors as the loss function.

  3. Skew loss: The Tuple decoder adds a special loss term dubbed skew loss, the mean squared error between the embedding generated by itself and the embedding given by the respective child encoder. It is given equal weight as the child module’s reconstruction loss, experimentally found to help stabilizing the training process, and makes sure that child encoder and decoder use the same representation. The Tuple decoder performs autoregression on its own output and implements scheduled sampling, where the embedding given by the child encoder is considered the ground-truth and using the embedding generated by the Tuple decoder itself is considered ‘sampling’.

The latent loss is the standard KL-divergence loss between 128-dim unit Gaussian and diagonal-covariance Gaussian. Since we use weighted average for the reconstruction loss, we consider KL-divergence per latent dimension the latent loss and report its relative weight as in Eq (2).

The encoders of our framework only produces an embedding vector. In order to train a VAE, we interpret the embedding vector as the mean vector

and generate the standard deviation vector

from it with a standard deviation network. Our justification is that the embedding vector of a generative model should contain all the relevant information about the example, and this design simplifies the modular architecture. We did also test a more conventional architecture that generates both and from the last layer on equal footing but found no qualitative difference in the model’s behaviors. For more implementation details, see Appendix B.

2.4 Training and generation

Throughout experiments reported in this paper, the model is trained end-to-end using the Adam optimizer (Kingma & Ba, 2014) with initial learning rate

and the rest of the TensorFlow default

. The learning rate decays continuously by a factor of 0.99 per 1000 steps, and the gradients are clipped by the L2 global norm at 0.01. The experiments have a fixed budget of 2M training steps with batch size 256, running on 32 workers unless indicated otherwise. When KL-divergence warm-up and/or scheduled sampling are used, they have the same warm-up period with linear schedule.

With our focus on the difference between generated samples and training/testing examples, we do not want their difference to be trivially attributed to the difference in mean or covariance of their latent space distributions. Therefore, we sample from the multivariate Gaussian distribution closest to the distribution of the sampled latent vectors of the training data instead of the stronger assumption that they follow the unit Gaussian distribution. That is, we keep track of the moving mean and the moving covariance matrix of the sampled latent vectors during the training process, and sample from for generation. To assess the generation quality of trained models, we measure the p-values of generated coordinates in the generated zip code for 10000 generated samples. As described in described in Sec 2.2, their ideal distribution is uniform distribution between 0 and 1, with mean = median = 0.5, standard deviation =

. If the generated zip code is not found in the training data, the p-value is considered 0. Other than the p-values, we subjectively inspect the street names of generated samples, interpolation between training examples on the map, and measure the average Levenshtein distance per character

between the original street name and its reconstruction as a proxy for how much detail is encoded. Since we divide the Levenshtein distance by the length of the original street name, as long as the reconstructed street name is not longer than the original. We measure the average Levenshtein distance per character for each model over 10000 training examples that are randomly selected each time.

3 -VAE baseline

Here we report the generation quality of baseline generalized -VAE, measured by the p-values of generated samples and of reconstructed training examples. These experiments use the full 2M steps as the warm-up period, and the ground-truth probability decreases linearly from 1 to 0 for scheduled sampling experiments. For a rough measure of the reproducibility, we rerun the best experiments with the same hyperparameters.

mean median stddev
Tuple SS + String TF 0.246 – 0.261 0.0450 – 0.0606 0.321 – 0.329 0.114 – 0.111
Tuple AS + String TF 0.240 – 0.249 0.0448 – 0.0505 0.317 – 0.322 0.184 – 0.149
Always Sampling 0.179 – 0.203 < 0.01 0.293 – 0.303 0.0234 – 0.0511
Scheduled Sampling 0.215 – 0.247 < 0.01 – 0.0237 0.317 – 0.331 0.0865 – 0.0974
Teacher Forcing 0.0178 0 0.0970 0.0961
Table 1: -VAE baseline performance

Teacher forcing for strings makes generated street names more realistic, even though sampling for strings seems to drive down . What is happening is that sampling forces the string model to mindlessly generate the exact nth letter at the nth position, regardless of which letters are generated previously. This results in reconstructions such as “PAINT WORKS RD” “PAINT POIKS RD” and nonsensical generated street names such as “LINDINS PNON ”. Sampling for Tuple is essential for the model to capture the correlations between fields, and scheduled sampling seems to hold a slight edge over always sampling by starting with an information shortcut directly from the child encoder to the Tuple RNN (See also Fig 2). We find that models trained with fixed often experience training failure characterized by increasing KL-divergence and elevated bits per character (BPC) during the training process relative to the KL-divergence warm-up counterpart. We hypothesize that the model has difficulty performing autoregression to take advantage of the autocorrelations in the presence of persistent noise due to latent vector sampling. Interpolation between two training examples by a model trained with the Tuple SS + String TF scheme tends to be a straight line on the map. Even though it bends for nearby population centers, the interpolation still passes through multiple sparsely populated areas like state forests, where there are few addresses. The model seems to recognize that city and zip code are categorical variables, but it indiscriminately tries to interpolate street name and number, even though these two are details of the training examples and may not make sense to interpolate. In the shown example, the model makes up multiple addresses that start with “147, HARTS RD, GROTON” due to a training example nearby that starts with “147, HARTS RD, TOPSHAM”.

Simple char-rnn VAEs trained on concatenated string expression of the training examples can actually generate good samples in terms of zip-coordinate correlations, but occasionally generate malformed samples that result in errors when converted back to structured data (Appendix F). Simpler multi-seq-to-multi-seq model without autoregression at the Tuple level fails to capture any of the correlations (Appendix G) and neither do generative model frameworks that regularize global latent vector distributions but not continuity of the decoder like adversarial autoencoder (Appendix H) and Wasserstein autoencoder (Appendix I).

4 Generated loss

For these Vermont state address models, we never observe overfitting. That is, training loss and testing loss always change in sync and the model is indeed maximizing the log-likelihood of the data from the true distribution. This is not the case, however, for samples generated by the model. We measured what the VAE loss would be in our models for generated samples as if they were training or testing data, and we term it generated loss. We found that generated loss lags behind under the Tuple sampling + String TF schemes, and actually increases under all of the other training schemes during the training process (Fig 1 and 2).

Except for the Tuple sampling + String TF training schemes, generated loss actually increases during the training process, so the model is not maximizing the log-likelihood of its generated samples by its own estimate. In other words, -VAE fails to establish bijection between the latent space and the data space for generated samples not from the true distribution. Perhaps we should not find this surprising: in the training process of VAE, we minimize the reconstruction loss from a latent vector sampled from a distribution centered around the mean latent vector. So if we start from a generated sample that is mapped into the neighborhood of a training example in the latent space by the encoder, encoding using the mean vector followed by decoding will result in a sample more similar to the said training example. Indeed, we observe that p-values of generated samples tend to increase over repeated encoding and decoding. What is more surprising is that even the training examples themselves are not immune to this. In fact, they seem to converge faster but to the same distribution of p-values as the generated samples, after just one round of encoding / decoding. Apparently, the gravitational pull of training examples is additive and exerting on themselves. Since the latent vectors during training are sampled from a Gaussian distribution, this gravitational pull should diminish exponentially as the distance in the latent space increases, perhaps not unlike gravity modified with a Yukawa-type potential. For models that encode the street names, the street names of generated samples also tend to converge to real street names in the training data after repeated encoding and decoding. For detailed formalism and plots, see Appendix E.

Figure 1: Loss (left) and BPC (right) for training data, testing data, and generated samples during KL-divergence warm-up (, ), Tuple SS + String TF. KL-divergence warm-up drives increase in training and testing loss. Due to KL-divergence warm-up, steady generated loss actually implies decreasing reconstruction loss for generated samples, as evidenced by the decreasing generated BPC.
Figure 2: Loss (left) and BPC (right) for generated samples during the training process with the same KL-divergence warm-up under different training schemes, generated loss of the Tuple SS + String TF scheme is the same as that of Fig 1.

5 Augmented training

Our results of generated loss measurements suggest that information from the training data isn’t sufficiently propagated. So we propose the following scheme to facilitate biased diffusion in the latent space by training on generated variants.

center After gen_start_step steps: Initialize augmented latent vectors with sampled latent vectors of the current training batch. Augment next training batch with variants generated from the augmented latent vectors. After a training step, each augmented latent vector is replaced with either: The sampled latent vector of an example from the current training batch, selected without replacement, with probability . The sampled latent vector of the variant generated from it with probability . Repeat from 2.

Intuitively, augmented training extends the standard VAE training scheme. Instead of just taking one ‘hop’ from the mean vector of a training example and minimizing the reconstruction loss from the sampled latent vector, we actually generate a reconstruction from the sampled latent vector, run it through the encoder, take a second ‘hop’ from the mean vector of the reconstruction and minimize the reconstruction loss from the augmented latent vector to the reconstruction, and so on. The augmented latent vector is initialized from the sampled latent vector of a training example, and before we run generation from it the model was just trained to minimize reconstruction loss from it. Therefore, the reconstruction generated from the sampled latent vector is likely to be similar to the original training example. The similarity will decay over repeated encoding / decoding due to the model’s capacity limit and the noise introduced by latent vector sampling, so we re-initialize it with probability such that the average lifetime is steps, which turned out to be 5 steps for the optimized experiments below. We only start augmented training after gen_start_step steps to make sure that the model is ready to generate reasonable reconstructions, and controls the number of augmented latent vectors we use. Formally, we train the model with reconstructions from the following sequence in addition to the training examples :

In terms of objective function, we have

(4)

Assuming that is equal to the training batch size, as is the case for our experiments. It is worth pointing out that are sampled latent vectors instead of the mean vector given by the encoder in the previous section. Our experiments showed that augmented training does not improve generation quality without such noise injection.

mean median stddev
Tuple SS + String TF 0.401 – 0.401 0.321 – 0.324 0.373 – 0.376 0.239 – 0.195
Scheduled Sampling 0.317 – 0.335 0.137 – 0.180 0.354 – 0.358 0.116 – 0.115
Table 2: Augmented training performance

We can see that augmented training improves model’s generation quality and generated loss (Fig 3, taken from the run marked by the bold font in Table 5). Reduced generated loss indicates better embedding of the generated samples, even though it is still not as low as training/testing loss. KL-divergence warm-up followed by cool-down outperforms simple warm-up, despite identical . We suspect that with simple KL-divergence warm-up the difference between real and fake data gets more entrenched so it is harder for augmented latent vector to escape their potential well, following the gravity analogy.

As part of our observation for the model’s interpolation, we find that an augmented training model settles more often on a generated street name instead of fully interpolating the street names of training examples. For example, in a interpolation between training examples with street names “HARTS RD” and “SECOND ST”, the most common street name given is actually “S MAIN ST”. The theme of non-linearity continues as we plot the interpolation on the map, which twists and turns to avoid sparsely populated areas like state forests and goes through population centers like cities and towns instead.

Figure 3: Loss (left) and BPC (right) for training data, testing data, and generated samples during augmented training. Generated loss/BPC now move in sync with their training/testing counterparts. Unusually, training loss/BPC are higher than their testing counterparts since training batches are augmented with generated variants, which are not from the true distribution exactly and therefore more challenging for the model. Loss is mostly driven by KL-divergence warm-up followed by cool-down. Due to build optimization issues, we had to interrupt the training process twice between 500k and 1M steps.

6 Multiscale VAE

The fact that we can improve the model with KL-divergence warm-up and cool-down indicates that it does not take many steps to train the standard deviation network. We also make the observation that objective functions with different are not necessarily in conflict with each other. Training with , the highest possible before collapses into multivariate unit Gaussian optimizes the global structure of the latent space. Training with smaller optimizes the local structure, and training with optimizes autoencoding. Therefore, we propose the following training scheme.

center Initialize standard deviation networks , where each standard deviation network is associated with a distinct but constant value . Without loss of generality, we assume . Assign pairs evenly to workers, which otherwise share the same encoder / decoder. Train the model with these workers.

In terms of objective function, we have

(5)

It is tempting to draw connections and contrast between this multiscale objective function and the augmented objective function Eq (4). Intuitively, augmented latent vectors should get further and further away from the mean vector in average after more and more ‘hops’, and indeed at the limit of perfect encoder / decoder , small and locally constant in the neighborhood of ,

as sum of Gaussian random variables. Perhaps augmented training has similar effects on the model as multiscale VAE with geometrically spaced

values where terms with higher in Eq (4) serve the role of workers with higher . For experiments partially motivated by this observation, see Appendix J. In this section, we set and , which seem to have better behavior.

For the experiments below, no augmented training is used and they always employ scheduled sampling and KL-divergence cool-down for the first 1M training steps. For experiments combining this setup and augmented training, see Appendix K.

mean median stddev
Tuple SS + String TF 0.476 – 0.509 0.494 – 0.580 0.382 – 0.384 0.916 – 0.921
Scheduled Sampling 0.402 – 0.411 0.349 – 0.361 0.367 – 0.371 0.497 – 0.525
Table 3: Multiscale VAE only performance

We can see that multiscale VAE alone outperforms even augmented training models in terms of generation quality. In fact, it is no longer obvious which model is the best. With optimized hyperparameters, one run results in a model that generates samples whose p-values are slightly below that of training examples, and the other results in a model that generates samples whose p-values are slightly above (the latter one is arbitrarily chosen for the following figures). The optimal value of in our case is in the range 0.64 – 1.28, consistent with our results with -VAE. Since multiscale VAE by design already optimizes the log-likelihood of training examples for values in the interval , KL-divergence warm-up from 0 unsurprisingly doesn’t help. However, modest KL-divergence cool-down seems to be beneficial and improves the robustness of model performance w.r.t. hyperparameter tuning of .

While not designed to do so, optimized multiscale VAE alone does have lower generated loss (Fig 4) than the -VAE baseline. While the multiscale VAE model still reconstructs and interpolates between street numbers, it no longer does so for street names. Instead, it makes up street names by autoregression, in a sense exhibiting partial posterior collapse. This behavior is intuitively sensible: The model sees a wide variety of street names in close proximity of each other and associated with the same city and zip code, and subsequently concludes that street names are details not to memorize for each individual training example. In aggregation, however, the multiscale VAE model actually generates more samples with street names from the training data than the optimized -VAE model (60% vs. 44%). When the interpolation given by a Multiscale VAE is plotted on the map, it snakes through multiple population centers and tries to stay within their neighborhoods as long as possible to minimize coordinate and zip code loss terms. In the optimal case, the interpolation adapts the property of a space-filling curve. Both tendencies are present in the augmented training models, but even stronger for multiscale VAE.

Figure 4: Loss (left) and BPC (right) for training data, testing data, and generated samples during the training process of multiscale VAE only on the worker with the lowest value .

7 Conclusion

We described the Vermont state address benchmark data set, a field correlation metric used to quantify generation quality, and our discrete VAE based on a generated tree recursive model. We showed that even when trained with KL-divergence warm-up and scheduled sampling, generalized -VAE only demonstrates limited capacity in capturing such field correlations, and most of the issue is with the variational method instead of the encoder-decoder pair. More specifically,

  1. VAE loss of generated samples (generated loss) may lag behind or even increase during the training process and serves as a useful metric for VAE optimization. The model tends to make mistakes in the direction of typical training examples, even for the training examples themselves.

  2. Both generation quality and generated loss can be improved by augmenting training data with generated variants (augmented training).

  3. Training VAE with multiple values and standard deviation networks simultaneously (multiscale VAE) is a formally related, tunable technique. The resulted model tends to encode less details but offer superior generation quality in terms of aggregated properties like field correlations.

Admittedly, we have not fully solved the issues observed in our work, and it is speculative whether we have hit upon certain fundamental limitations of the VAE framework. For early results of applying these ideas to an image VAE, see Chou (2019).

Author contributions

J.C. contributed the idea and implementation of generated loss measurement, augmented training, multiscale VAE, and the use of p-value as correlation metric. J.C. also implemented the Alala framework in collaboration with DeLesley Hutchins, with a focus on the decoders, VAE training scheme, and the engine for model generation from Protocol Buffer message definition. G.H. contributed the idea of using the Vermont state address data set and correlations of generated data as the generation quality metric. G.H. also implemented the trainer binary for the Vermont state address model and the Python module for measuring correlations.

Acknowledgments

DeLesley Hutchins designed and implemented the modular encoder-decoder architecture and the bidirectional RNN Tuple encoder for a different purpose, and both are incorporated into the Alala framework. DeLesley Hutchins also first suggested the use of scheduled sampling and pass-through baseline model and provided valuable critiques to an early draft of this paper. We would also like to thank Irina Higgins for extensive internal review and helpful feedbacks.

References

Appendix A Vermont state address data set

A corpus of Vermont state addresses from the zip file us_northeast.zip is downloaded from OpenAddresses. We decompressed it to use its us/vt/statewide.csv as the raw data. We then defined a simple Protocol Buffer message to represent its rows and for the purpose of model generation described in Appendix B:

message Address {
  optional float lat = 1;
  optional float long = 2;

  // We don’t discount the possibility that some addresses could have
  // non-numerical entries for fields that seem like they should be numerical,
  // like street numbers for example:
  optional string number = 3;
  optional string street = 4;
  optional string unit = 5;
  optional string city = 6;
  optional string district = 7;
  optional string region = 8;
  optional string postcode = 9;
}

We then split the data set into training, testing, and validation sets with 8:1:1 expected ratio. The training set contains 266450 examples, the testing set contains 33304 examples, and the validation set remains unused. In terms of total number of characters, the training set contains 7504720 characters among all string fields, or an average of 28.17 per training example. These sets used in our work can be downloaded from https://github.com/EIFY/vermont_address. We have also trained models on different slices and found the results to be robust to slice change.

Regarding zip-coordinate correlations, we expect the p-values of coordinates given zip code to be uniformly distributed between 0 and 1 for the training set itself. As a sanity check, here are the stats of the p-values of the training set:

Mean: 0.521861141342
Median: 0.537469273433
Standard deviation: 0.298400

Given finite training examples, these stats seem reasonable relative to the limit mean = median = 0.5, standard deviation = .

Appendix B Tree recursive model implementation details

(a)
(b)
Figure 5: StringLiteral Module
(a)
(b)
Figure 6: ScalarTuple Module
(a)
(b)
Figure 7: Tuple Module
Figure 8: Standard Deviation Network

Here we provide the implementation details of the StringLiteral module (Fig 5), ScalarTuple module (Fig 6), Tuple module (Fig 7), and the standard deviation network (Fig 8). If not specified otherwise, continuously differentiable exponential linear unit (CELU) with (Barron, 2017)

is the default activation function in our model. It is chosen for its compatibility with the prior distribution of the latent vector since its image covers 99.865% of the unit Gaussian distribution. Due to its relatively weak nonlinearity, we simply initialize the weights with in-degree scaled unit variance

111truncated at 2 standard deviations. This is tf.variance_scaling_initializer(scale=1.0) in TensorFlow., i.e. and zero bias.

b.1 The StringLiteral module

The encoder and decoder of the StringLiteral

module are character-RNN based on 128-dim gated recurrent unit (GRU)

(Cho et al., 2014) with 16-dim trainable character embedding initialized with uniform distribution between 0 and 1 and shared between the encoder and decoder. The GRU differs from the original in that it uses CELU with whose output is capped at 6 in the same way as ReLU6 (Krizhevsky & Hinton, 2010) to prevent blow-up, i.e. given 16-dim character embedding at step and 128-dim RNN state at step , RNN state at step is given by

where

is still the sigmoid function but

. and are initialized with in-degree scaled unit variance weights and zero bias, but are initialized with zero weights and unit bias to make sure that the GRU cell isn’t too forgetful from the beginning.

For the encoder, is the zero vector and a fully-connected layer is applied to the final state of the RNN to generate the embedding. For the decoder, is initialized by running another

fully-connected layer on the embedding vector and a softmax layer predicts the

th character from . We use cross-entropy loss in nat and normalize such that each string field is given total loss weight 1. e.g. the zip code always has 5 digits, so each of them plus the end-of-string token is given loss weight .

b.2 The ScalarTuple module

The two float fields (lat and long) are collected and modelled jointly by the ScalarTuple module. The module keeps track of the moving mean and the moving covariance matrix of the training data, i.e. given values of the scalar tuple in a mini-batch

where is the moving average decay, set to be 0.999. The ScalarTuple module then performs PCA-whitening on the raw input for both training and testing:

(6)
(7)

where

is the identity matrix,

is a regularization coefficient set to be , and denotes element-wise inverse square root. The encoder then generates the embedding from with a sigmoid layer, and the decoder generates from the embedding with a linear layer and computes squared error loss

. Both float fields are given total loss weight 1, so sum of the squared error is used instead of average. PCA-whitening is similar to and reduces to batch normalization

(Ioffe & Szegedy, 2015) without scale and shift when components of is uncorrelated, but automatically handles strong correlation which can cause batch normalization to overestimate the true variances of the data. For generation, the inverse of Eq (7) is used to un-whiten the prediction.

b.3 The Tuple module

Unlike the StringLiteral module and the ScalarTuple module, the Tuple module is not a leaf module, i.e. the input of its encoder is the embeddings generated by the encoders of its child modules, and the output of its decoder is the embeddings used by the decoders of its child modules. The encoder is based on a bidirectional RNN with the same GRU implementation as the StringLiteral module, but with 128-dim state size and 128-dim input size. In addition, since size of the tuple is fixed and each element of the tuple is different, GRU cells for different tuple elements are distinct and do not share parameters. For example, with 7 string fields and 2 float fields modelled by a ScalarTuple module, the Tuple module encoder for the address model has distinct GRU cells. The bidirectional RNN has a shared 128-dim trainable initial state for both directions, and the two final states of the bidirectional RNN are concatenated and fed to a fully-connected layer to produce the final embedding.

The Tuple module decoder is also based on an RNN. The initial state for the decoder RNN is initialized from the embedding with a fully-connected layer. Each element of the tuple again has its own GRU cell, and the same GRU implementation is used with 128-dim state size and 128-dim input size. In addition, each GRU cell includes a fully-connected layer that generates its child module embedding from the current state. This child module embedding is then fed back to the GRU cell to increment to the next state for generation and scheduled sampling (Bengio et al., 2015). For training/testing with teacher forcing, embedding given by the child encoder is used as the ground-truth input. In order to make sure that the encoder and decoder of the child module use the same representation, we add an extra loss term dubbed skew loss, which is the mean squared error between the generated embedding and the embedding given by the child encoder. This skew loss is somewhat arbitrarily given the same weight as the reconstruction loss of the respective child module.

The model described here is generated from the Protocol Buffer message definition by an internal framework. Code-named Alala in reference to the Greek goddess and the Hawaiian crow Corvus hawaiiensis, the framework is based on TensorFlow Fold (Looks et al., 2017) and developed for training VAEs on arbitrarily-defined protocol buffers. The order of the elements of the tuple follows that of the Protocol Buffer message definition, except that (lat, long) are collected and modelled jointly by the ScalarTuple module as the last element of the tuple.

b.4 Standard deviation network

Encoders of Alala modules produce an embedding vector. In order to train a VAE with Alala modules, we interpret the embedding vector as the mean vector and generate the standard deviation vector from it with a standard deviation network. For the model described here, the standard deviation network consists of 3 fully-connected layers, topped with a sigmoid layer to produce the standard deviation vector whose elements are always in the range . The sigmoid layer is initialized with zero weights and -5 bias to make sure that the standard deviation vector starts out small in the beginning of the training process.

Appendix C Full hyperparameter sweep result

The -VAE baseline experiments use the full 2M steps as the warm-up period, and the ground-truth probability decreases linearly from 1 to 0 for scheduled sampling experiments. For a rough measure of the reproducibility, we rerun the best experiments with the same hyperparameters.

center mean median stddev Tuple SS + String TF 0 0.384 0.246 – 0.261 0.0450 – 0.0606 0.321 – 0.329 0.114 – 0.111 0.128 0.137 < 0.01 0.267 0.600 Tuple AS + String TF 0 0.384 0.240 – 0.249 0.0448 – 0.0505 0.317 – 0.322 0.184 – 0.149 0.128 0.179 < 0.01 0.298 0.487 Always Sampling 0 0.384 0.173 < 0.01 0.294 0.0306 0.128 0.179 – 0.203 < 0.01 0.293 – 0.303 0.0234 – 0.0511 Scheduled Sampling 0 0.64 0.225 < 0.01 0.326 0.144 0.384 0.215 – 0.247 < 0.01 – 0.0237 0.317 – 0.331 0.0865 – 0.0974 0.128 0.208 < 0.01 0.309 0.0445 0.64 0.0864 0 0.220 0.884 0.384 0.103 0 0.240 0.751 0.128 0.119 – 0.248 < 0.01 – 0.0361 0.253 – 0.325 0.291 – 0.166 0.064 < 0.01 0 0.00365 0.975 Teacher Forcing 0 0.384 0.0178 0 0.0970 0.0961 0.128 < 0.01 0 0.0283 0.535

Table 4: -VAE baseline performance

For augmented training experiments below, , gen_start_step = for scheduled sampling (roughly when generated loss stops decreasing) and gen_start_step = for Tuple SS + String TF (roughly when training loss stops rapidly decreasing). We use = 256 augmented latent vectors, so training batches now consist of 256 training examples and 256 generated variants. In practice, data generation is slow due to the lack of parallelism, so we actually shut down the initial training process with 32 workers after gen_start_step and relaunch it with 512 workers for augmented training. These augmented training experiments always employ simultaneous scheduled sampling and KL-divergence warm-up, with and the first 1M training steps as the warm-up period. We also find that it’s beneficial to have a KL-divergence cool-down period after the warm-up period, in which case we have .

mean median stddev
Tuple SS + String TF
0.128 0.316 0.123 0.356 0.0402
0.64 0.128 0.401 – 0.401 0.321 – 0.324 0.373 – 0.376 0.239 – 0.195
Scheduled Sampling
0.128 0.272 0.0327 0.348 0.0266
0.279 0.0481 0.349 0.0207
0.275 0.0348 0.350 0.0242
0.64 0.128 0.311 – 0.333 0.105 – 0.159 0.360 – 0.362 0.113 – 0.112
0.317 – 0.335 0.137 – 0.180 0.354 – 0.358 0.116 – 0.115
0.307 0.106 0.355 0.115
0.312 0.110 0.359 0.114
1 0.311 0.110 0.358 0.108
Table 5: Augmented training performance

For multiscale VAE experiments below, no augmented training is used and they always employ scheduled sampling with the first 1M training steps as the warm-up period. In case KL-divergence warm-up or cool-down is also employed during the warm-up period, we have . For experiments combining this setup and augmented training, see Appendix K.

mean median stddev
Tuple SS + String TF
1.28 0.64 0.476 – 0.509 0.494 – 0.580 0.382 – 0.384 0.916 – 0.921
1.28 0.460 0.474 0.373 0.921
Scheduled Sampling
5.12 0.64 0.386 0.297 0.368 0.605
2.56 0.376 0.275 0.367 0.619
1.28 0.402 – 0.411 0.349 – 0.361 0.367 – 0.371 0.497 – 0.525
0 0.218 0.00738 0.315 0.0441
5.12 0.224 0.00101 0.343 0.965
2.56 0.396 0.306 0.379 0.685
1.28 0.395 0.326 0.367 0.581
0.64 0.313 0.152 0.344 0.268
Table 6: Multiscale VAE only performance

Appendix D Map interpolations

Map interpolations of the models featured in the main text. For the live version and more interpolation examples, see the HTML files of the repository https://github.com/EIFY/vermont_address.

Figure 9: Interpolation between the first 2 training examples by the VAE trained with KL-divergence warm-up (Tuple SS + String TF).
Figure 10: Interpolation between the first 2 training examples by the VAE trained with augmented training.
Figure 11: Interpolation between the first 2 training examples by the multiscale VAE (without augmented training).

Appendix E Stats over repeated encoding and decoding

In the following, we plot the p-value distributions of 10000 generated samples and 10000 training examples over repeated encoding and decoding for the models featured in the main text. For generated samples , we examine the p-values of the following sequence:

For training examples , we examine the p-values of the following sequence:

For from 0 to 9. The results strongly suggest that for and where indicates that two random variables follow the same distribution.

Figure 12: Box plot of p-values over repeated encoding and decoding (Tuple SS + String TF). The ‘generated’ sequence at repetition = corresponds to the p-value distribution of 10000 samples of , and the ‘reconstructed’ sequence at repetition = corresponds to that of 10000 randomly selected training examples .
Figure 13:

Box plot of p-values over repeated encoding and decoding (augmented training). Some of the p-values are unaffected by repeated encoding and decoding and show up as outliers more than 1.5 IQR (interquartile range) away from the lower quartile.

Figure 14: Box plot of p-values over repeated encoding and decoding (multiscale VAE only)

We also examine the proportion of street names of the generated samples present in the training data over repeated encoding and decoding. For models that encode the street names, the proportion also increases over repeated encoding and decoding. This observation has privacy implications: Namely, repeated encoding and decoding may be an efficient tool to extract rare or unique sequence of the training data from a VAE. It is instrumental to quantify such risk a la Carlini et al. (2018), and we leave it to future work.

Figure 15: Number of generated street names present in the training data out of 10000 samples of over repeated encoding and decoding (Tuple SS + String TF)
Figure 16: Number of generated street names present in the training data out of 10000 samples of over repeated encoding and decoding (augmented training).
Figure 17: Number of generated street names present in the training data out of 10000 samples of over repeated encoding and decoding (multiscale VAE only). The proportion of generated street names present in the training data is higher than that of optimized -VAE (Fig 15) and stays constant throughout repeated encoding/decoding.

Appendix F Comma-separated text model

The simplest approach to model the Vermont state address data is to model them in their original format as comma-separated text. So we trained a simple seq-to-seq model using just the StringLiteral module and the standard deviation network (Table 7). We doubled the size of the embedding vector to 256-dim to compensate for the difference in number of parameters, omitted the string fields that are always empty so we only expect 6 comma-separated values (number, street, city, postcode, lat, long), and rounded the floating point numbers for the coordinates to five decimal places. If the generated comma-separated text does not have enough comma-separated values, or the last two values do not represent valid input for python’s float() function, we consider the generated sample to be malformed and its p-value to be zero. We used teacher forcing as the training scheme for these experiment as scheduled sampling simply does not work. for the KL divergence warm-up experiment, but running multiscale VAE with and resulted in posterior collapse. is fixed at 0.768 for the other multiscale VAE experiment.

malformed mean median stddev
KL warm-up 23 0.424 0.411 0.344
Posterior collapse 17 0.436 0.428 0.340
Multiscale VAE 43 0.394 0.356 0.344
Table 7: Comma-separated text model performance

We can see that these comma-separated text models are quite good at generating valid samples (with fewer than 0.5% of 10000 samples malformed) and capturing zip-coordinate correlations. While it is possible to train a meaningful latent comma-separated text model (Fig 18

), it does not improve the generation quality over a pure autoregressive model resulted from posterior collapse (Fig 

19). We speculate that per-character cross-entropy reconstruction loss function simply does not yield a latent space with good structure – from the model’s perspective addresses that start with "147,HARTS RD" are the addresses closest to each other, not addresses that are geographically close.

Figure 18: Interpolation between the first 2 training examples by the comma-separated text model (KL warm-up).
Figure 19: "Interpolation" between the first 2 training examples by the comma-separated text model (posterior collapse). More purple and red-to-orange markers are visible here in comparison to Fig 18 since they are no longer concentrated around their closest training examples.

Appendix G Pass-through model

To establish a baseline against the tree recursive model, we implemented a simple pass-through model by replacing the Tuple module with a SimpleTuple module. The decoder of the SimpleTuple module just passes the embedding vector down to child decoders, and the encoder just concatenates embedding vectors generated by child encoders and applies a fully-connected to generate the final embedding. This architecture requires separate string models for each of the four non-empty string fields (number, street, city, and postcode) and we just omitted the string fields that are always empty. We then used the same p-value metric to quantify the generation quality of pass-through models in the table below. Other than and gen_start_step = for augmented training, hyperparameters not specified in the table are the same as in the corresponding tree recursive model experiments.

mean median stddev
Generalized -VAE
Always Sampling 0.128 0.0699 0.176
Scheduled Sampling 0.0603 0.168
0 0.384 0.0652 0.175
Teacher Forcing 0.0835 0.192
Augmented Training
Scheduled Sampling 0.0614 0.171
Teacher Forcing 0.118 0.00110 0.229
Multiscale VAE only
Scheduled Sampling 1.28 0.64 0.0612 0.168
Teacher Forcing 0.0766 0.186
Table 8: Pass-through model performance

Teacher forcing for the strings remains the best training scheme, but lack of the RNN-based Tuple decoder significantly cripples the model’s capacity of capturing correlations. Augmented training manages to improve over the low baseline, but multiscale VAE fails to yield improvements. Most likely, the pass-through model lacks the capacity to learn what zip code is associated with given coordinates (or vice versa) and resorts to encode both zip code and coordinates in the latent vector in parallel. Generated loss/BPC increases during most of the training process for these pass-through models (Fig 20), unless augmented training is employed (Fig 21).

Figure 20: Loss (left) and BPC (right) for training data, testing data, and generated samples during the training process of the pass-through model with teacher forcing.
Figure 21: Loss (left) and BPC (right) for training data, testing data, and generated samples during the training process of the pass-through model with augmented training and teacher forcing.

Appendix H AAE (Adversarial Autoencoder) model

We have tested using the Adversarial Autoencoder (AAE) framework (Makhzani et al., 2015) with the tree recursive model for this problem. To keep the number of parameters approximately the same, we repurposed the standard deviation network described in Sec B.4 for the role of discriminator by replacing the final sigmoid layer with a softmax layer to distinguish the latent vector generated by the deterministic encoder from the random vector drawn from the unit Gaussian distribution. We found that despite running the same Tuple SS + String TF training scheme, an AAE model failed to capture the zip-coordinate correlations (p-value stats turned out to be mean = 0.0576, median = 0, standard deviation = 0.183 after 2M autoencoder training steps and 2M discriminator training steps on the same learning rate schedule as the VAE models. Cross-entropy adversarial loss was given the same weight as the reconstruction loss).

Appendix I WAE (Wasserstein Autoencoder) model

We have also tested using the Wasserstein Autoencoder (Tolstikhin et al., 2017) with the tree recursive model for this problem. In our implementation, we make the simplifying assumption that the latent vectors follow a multivariate Gaussian distribution and proceed to estimate its mean vector and covariance matrix . The Wasserstein distance squared from this multivariate Gaussian distribution to the 128-dim unit Gaussian distribution is then given by

where is the norm, , is the identity matrix, and is the matrix Frobenius norm. We also found that a WAE model running the same Tuple SS + String TF training scheme failed to capture the zip-coordinate correlations (p-value stats turned out to be mean = 0.0240, median = 0, standard deviation = 0.112 after 2M training steps on the same learning rate schedule as the VAE models. The Wasserstein distance squared latent loss was given the weight 0.128 per latent dimension). As pointed out by (Tolstikhin et al., 2017), AAE can be considered a special case of WAE, so we find these results to be consistent. We suspect that this problem requires the framework to ensure the continuity of the decoder w.r.t. individual training examples, rendering generative model frameworks that only regualrize the latent vector distribution insufficient.

Appendix J Alternative Multiscale VAE formulations

In Sec 6, we initialize to be equally spaced on the interval for simplicity, and we find the optimized multiscale VAE’s behavior to be sensible. That is, the model accurately captures the zip-coordinate correlations and considers street name beyond its capacities to encode as it varies within the reconstruction loss of the coordinates. We also note that adding augmented training partially restores street name autoencoding by making up its own details, with a price in terms of correlation accuracy and street name realism. However, we are still curious whether multiscale VAE can be forced to encode more details such as street names by setting more workers to train with lower values.

Inspired by the exponential decay term in the augmented objective function Eq (4), the most obvious alternative is to use geometric spacing instead of linear spacing. For example, with common ratio and , we have for 32 workers. Again comparing the multiscale objective function Eq (5) with the augmented objective function Eq (4), we noticed that although the KL-divergence weight is usually applied to the KL-divergence directly, the multiscale objective function with geometric spacing would match the augmented objective function Eq (4) more closely if we invert the weights and divide the reconstruction loss by instead:

(8)

For a plain -VAE this does not matter, especially when a gradient-normalizing optimizer like Adam is used. For a multiscale VAE however, the difference in formulation changes how gradients from workers training at different scales are weighted. For the result below, we tested both. Training scheme is fixed to be Tuple SS + String TF, and we measure both the zip-coordinate p-value stats and the average Levenshtein distance per character between the original street name and its reconstruction.

center mean median stddev 0.9 1.28 0.64 0.425 – 0.445 0.403 – 0.448 0.362 – 0.368 0.601 – 0.732 2.56 1.28 0.441 – 0.488 0.433 – 0.538 0.376 – 0.379 0.904 – 0.918 5.12 2.56 0.395 – 0.395 0.295 – 0.314 0.370 – 0.364 0.922 – 0.917 0.8 1.28 0.64 0.264 – 0.264 0.0830 – 0.0939 0.322 – 0.317 0.0354 – 0.0475 2.56 1.28 0.308 – 0.363 0.184 – 0.291 0.325 – 0.334 0.264 – 0.391 5.12 2.56 0.400 – 0.403 0.354 – 0.360 0.362 – 0.358 0.594 – 0.710

Table 9: Multiscale VAE with geometric spacing (weight inverted)

center mean median stddev 0.9 1.28 0.64 0.342 – 0.461 0.213 – 0.477 0.353 – 0.351 0.479 – 0.725 2.56 1.28 0.493 – 0.494 0.548 – 0.549 0.385 – 0.380 0.910 – 0.921 5.12 2.56 0.364 – 0.460 0.251 – 0.471 0.359 – 0.376 0.920 – 0.920 0.8 1.28 0.64 0.247 – 0.269 0.0563 – 0.0998 0.316 – 0.320 0.0620 – 0.0714 2.56 1.28 0.325 – 0.332 0.212 – 0.223 0.332 – 0.336 0.315 – 0.278 5.12 2.56 0.384 – 0.403 0.313 – 0.348 0.357 – 0.371 0.565 – 0.855

Table 10: Multiscale VAE with geometric spacing

The following is the street name reconstructions for the first 10 training examples by the run marked by the bold font:

"HARTS RD" -> "HARTS RD"
"SECOND ST" -> "SPRING ST"
"LIME KILN RD" -> "LAKE DAVIS RD"
"JOHNSON HILL RD" -> "BAY HILL RD"
"JACKSON CROSS RD" -> "MOUNTAINSON AVE"
"WAUGH FARM RD" -> "RUNNS LN"
"PAINT WORKS RD" -> "PORTECHAN WINOORD RD"
"FURLONG RD" -> "FLATC ST"
"SABIN ST" -> "SOUTHWIN TER"
"POISSON DR" -> "POPE PKWY"

We can see that multiscale VAE can indeed be tuned to encode more details such as street names. However, there remains a trade-off between accurate zip-coordinate correlations and details such as street names, and fine-tuned multiscale VAE is no better than multiscale VAE trained with augmented training (Appendix K) at comparable correlation accuracy. Interestingly, weight inversion does seem to make the model more reliable at street name reconstruction with comparable zip-coordinate correlation accuracy.

A more radical alternative is to assign a different target KL-divergence value for each worker a la (Burgess et al., 2018), instead of a different KL-divergence weight that still allows each worker to find its own trade-off between latent loss and reconstruction loss. We were surprised to find out, however, that it does not work to specify the target total KL-divergence in terms of L1 loss . Instead, we only punish the model for going over the capacity budget, and trust the reconstruction loss alone to use as much capacity budget as possible with loss term , so we have the following objective function:

(9)

For the result below, capacity penalty weight is set to be 128 per latent dimension while the reconstruction loss is still the weighted average of nat-per-character and mean squared error loss terms to make sure that capacity budget is respected. The target capacities are specified with minimum capacity and capacity increment . For example, if and , the 32 workers run with capacity budgets . Target capacities remain fixed throughout the training process for these runs.

mean median stddev
10 0.2 0.434 – 0.449 0.402 – 0.434 0.382 – 0.375 0.913 – 0.914
0.35 0.403 – 0.414 0.339 – 0.372 0.364 – 0.359 0.734 – 0.759
0.5 0.276 – 0.301 0.0874 – 0.125 0.332 – 0.341 0.483 – 0.517
1.0 0.216 – 0.223 0.0265 – 0.0388 0.301 – 0.302 0.118 – 0.125
15 0.2 0.319 – 0.343 0.153 – 0.223 0.350 – 0.347 0.642 – 0.628
Table 11: Multiscale VAE with target capacities

We observe the same trade-off between accurate zip-coordinate correlations and details such as street names and are not able to get better result. Perhaps it is harder to tune multiscale VAE with target capacities since the model is not allowed to make trade-offs between reconstruction loss and latent loss on its own.

Appendix K Multiscale VAE + augmented training experiments

We explore whether multiscale VAE can be improved by augmented training. The experiments below are based on the hyperparameters optimized in Sec 6 (). To compensate slow data generation, we always shut down the training process at gen_start_step and bring it back up with = 256 and 512 workers. Due to the observed variability, we always run experiments with the same hyperparameters twice.

gen_start_step mean median stddev
Tuple SS + String TF
0.465 – 0.471 0.471 – 0.491 0.377 – 0.383 0.770 – 0.780
0.444 – 0.457 0.425 – 0.466 0.385 – 0.383 0.719 – 0.662
Scheduled Sampling
0.358 – 0.385 0.196 – 0.267 0.376 – 0.381 0.427 – 0.441
0.338 – 0.378 0.154 – 0.270 0.369 – 0.377 0.423 – 0.407
0.357 – 0.382 0.217 – 0.268 0.368 – 0.377 0.378 – 0.373
Table 12: Multiscale VAE + augmented training performance

Multiscale VAE alone is capable of capturing the zip-coordinate correlations and generated loss decreases during the training process under the Tuple SS + String TF training scheme, so it turns out to be beneficial to delay the start of augmented training until at least 1M steps. Optimized multiscale VAE + augmented training features a good compromise between the two with accurate zip-coordinate correlations, low generated loss (Fig 22, taken from the run of Table 12 marked by the bold font), and partially restored street name reconstructions. We noticed that even though optimized multiscale VAE + augmented training usually reconstructs the first few letters of the street name, sometimes the corresponding embedding (mean) vector actually encodes a different but typical street name, especially when the training example features an unusual street name like "LIME KILN RD" or "WAUGH FARM RD". We believe this is another manifestation of the additive gravitational pull of the training examples, in combination with augmented training..

Figure 22: Loss (left) and BPC (right) for training data, testing data, and generated samples during the training process of multiscale VAE + augmented training on the worker with the lowest value .
Figure 23: Box plot of p-values over repeated encoding and decoding (multiscale VAE + augmented training).
Figure 24: Number of generated street names present in the training data out of 10000 samples of over repeated encoding and decoding (multiscale VAE only + augmented training). The proportion (51%) is higher than that of optimized -VAE (44%, Fig 15) and stays roughly constant throughout repeated encoding/decoding.
Figure 25: Interpolation between the first 2 training examples by the multiscale VAE with augmented training.

With the speculation that sampled latent vector loses information about the training example faster when is higher, the experiments below have adaptive on the interval , equally spaced on the linear scale. For example, if , and , values of form an arithmetic sequence such that workers run with parameters .

center gen_start_step mean median stddev Tuple SS + String TF 1 0.428 – 0.484 0.420 – 0.528 0.357 – 0.376 0.621 – 0.715 0.350 – 0.408 0.233 – 0.346 0.355 – 0.374 0.486 – 0.507 Scheduled Sampling 1 0.374 – 0.393 0.238 – 0.316 0.380 – 0.376 0.527 – 0.426 0.381 – 0.390 0.262 – 0.283 0.381 – 0.381 0.446 – 0.448 0.382 – 0.417 0.279 – 0.357 0.375 – 0.386 0.392 – 0.536 0.358 – 0.391 0.218 – 0.291 0.368 – 0.380 0.391 – 0.568 0.361 – 0.365 0.226 – 0.246 0.370 – 0.368 0.441 – 0.369

Table 13: Multiscale VAE + augmented training performance (linearly-spaced)

We have also tested multiscale VAE + augmented training (scheduled sampling) with less and adaptive on the interval equally spaced on the log scale. For example, if , and , values of form a geometric sequence such that workers run with parameters . They do not work better and turned out to be irrelevant in the context of later findings, so we report them here for completeness. Just like previous experiments, we always shut down and bring the training process back up at gen_start_step = with workers.

center mean median stddev Scheduled Sampling 16 1 0.373 – 0.380 0.251 – 0.283 0.374 – 0.368 0.542 – 0.435 32 0.373 – 0.383 0.257 – 0.303 0.371 – 0.366 0.529 – 0.410 64 0.367 – 0.375 0.246 – 0.269 0.369 – 0.370 0.472 – 0.571 128 0.368 – 0.378 0.246 – 0.273 0.370 – 0.372 0.445 – 0.524 256 0.318 – 0.388 0.175 – 0.295 0.340 – 0.377 0.575 – 0.422 Geometric sequence 1 0.374 – 0.379 0.228 – 0.267 0.382 – 0.375 0.427 – 0.583 0.361 – 0.405 0.224 – 0.321 0.369 – 0.386 0.573 – 0.631 0.358 – 0.389 0.249 – 0.286 0.359 – 0.378 0.582 – 0.519 0.371 – 0.389 0.268 – 0.276 0.367 – 0.381 0.397 – 0.537 0.361 – 0.377 0.232 – 0.266 0.368 – 0.369 0.628 – 0.432

Table 14: Multiscale VAE + augmented training performance (geometrically-spaced)

Appendix L Other negative results

  1. It may be counter-intuitive to initialize character embedding with uniform distribution between 0 and 1 and use the sigmoid function as the activation function for the ScalarTuple encoder, while CELU with is used for most of the model. However, changing them to cover more of the unit Gaussian distribution by their respective embeddings doesn’t yield any improvement.

  2. It’s critical to bias the information diffusion in the latent space towards spreading information from the training examples for augmented training. Generation quality is improved only when augmented latent vectors are initialized from the sampled latent vectors of the training examples.

  3. Replacing with given by a discriminator for augmented training doesn’t work as intended. Without a way to use gradient descent to increasingly confuse the discriminator, what does confuse the discriminator tends to be near-exact copies of training examples. Variants that keep a fraction of the augmented latent vectors that correspond to lower don’t work either.

  4. Since we maintain full data parallelism for multiscale VAE training, a training batch on a worker is always used with the same value and standard deviation network. Using a training batch with multiple values and standard deviation networks, either deterministically or randomly, doesn’t yield further improvement within the same training budget.

  5. Early attempts to replace multiple standard deviation networks with one parameterized by the value , e.g. making one or more of its fully-connected layers and appending to their input, do not work. It may be inherently difficult to learn a shared representation across multiple scales.