## 1 Introduction

The goal of generative modeling for discrete sequences is to learn the joint distribution of a sequence of random variables. One class of models that has shown particularly strong performance are discrete autoregressive models, which parameterize the joint density such that each variable depends on all previous assignments. These models give state-of-the-art performance across many tasks in natural language processing and related areas

(Vaswani2017; Al-Rfou2018). A downside of discrete autoregressive models, however, is that their sampling procedure requires both a) sampling explicit discrete tokens and b) sampling them serially in the length of the sequence. This can pose problems in real-world generation applications.Normalizing flows are a class of generative model for continuous random variables that implicitly represent the joint distribution of a high-dimensional random variable via an invertible deterministic transformation from a base density (Rezende2015; Kingma2016)

. Normalizing flows have been explored both to increase the flexibility of the variational posterior distribution in the context of variational autoencoders

(Rezende2015; Kingma2016), and to model observed space, which is the focus of this work. Normalizing flows provide two key advantages: model flexibility and control over computational tradeoffs. Flows generalize continuous autoregressive models (Papamakarios2017) and give more distributional flexibility. Furthermore, normalizing flows can be designed that are non-autoregressive during sampling (Oord2017; Jul), enabling parallel generation. Recent work around images has demonstrated accuracy for non-autoregressive models approaching that of autoregressive models in the continuous setting (Jul).Both properties are desirable for generative models of discrete sequences. Unfortunately, normalizing flows rely on parameterized applications of the change-of-variables formula. Applying related methods, e.g. via the discrete change of variables or a relaxation, to discrete random variables leads to significant additional challenges. A method for applying flows to discrete data and creating flows flexible enough to model, typically highly-multimodal, discrete data has not yet been demonstrated.

In this work, we propose an approach for discrete sequence modeling with a latent normalizing flow. We develop a generative model that jointly learns a flow-based density in the latent space and a simple mapping to discrete observations. Specifically we propose (1) a latent variable model that learns all dynamics of the observed discrete space in the latent continuous space, and (2) three specific normalizing flow architectures designed to capture these dynamics, in particular the extreme multimodality inherent in discrete data.

Experiments consider discrete latent generative models for character-level language modeling and polyphonic music modeling. We find that the latent flow model is able to describe the character-level dataset as well as a discrete autoregressive LSTM-based model, and is able to describe the polyphonic music datasets comparably to other autoregressive latent-variable models. We further find that the parallel-generation version of the model is able to generate sentences faster than the baseline model, with a penalty to modeling performance. Finally, we analyze the functionality of the model and demonstrate how it induces the high degree of multimodality needed to map between continuous and discrete spaces.

Code is available at https://github.com/harvardnlp/TextFlow.

## 2 Related Work

#### Latent Variable Models for Sequences

In the context of language modeling, Bowman2015 experiment with a variational autoencoder (VAE) of fixed size continuous latent space and an autoregressive RNN decoder. In practice, the VAE encodes little information about the sentence in the latent space because the decoder is powerful enough to model the data well, reducing the VAE to a standard autoregressive model. Recent work has focused on increasing the amount of information the model places in the latent space, either by modifying the prior density (Xu2018), the decoder structure (Yang2017a), or the variational inference procedure (Kim2018), though in all cases the model still relies heavily on the discrete decoder. Our proposed model removes the discrete autoregressive decoder entirely.

Other methods construct VAEs for sequence data with a variable size latent variable composed of one latent vector per input token

(Bayer2014; Chung2015; Gu2015). While similar to the model proposed in this work in the layout of the latent dimensions, these models also include an autoregressive discrete decoder.Chen2016a propose a VAE model for images with a learned normalizing-flow based prior and a weaker decoder. The latent size is fixed, the model is applied to continuous random variables, and the decoder still allows for dependence between the random variables. This differs from our latent sequence model. To the best of our knowledge, no previous works explore the setting of a latent continuous sequence model with a weak discrete decoder.

#### Non-Autoregressive Generation

In the domain of natural images, Dinh2016 and Jul propose flow-based models using affine “coupling layers” to allow for non-autoregressive generation. Compared to state-of-the-art autoregressive models, their non-autoregressive model performs both training and generation in parallel but suffers a penalty to model accuracy.

In the domain of text Gu2017 propose a model which uses fertility scores as a latent variable, approaching the performance of autoregressive models. While this works for translation due to the aligned nature of the sentences, the fertility framework and required pre-trained autoregressive model preclude the technique from more general application. Lee2018

propose a deterministic model based on a denoising process to iteratively improve the quality of a non-autoregresively generated sentence. The authors demonstrate strong performance at neural machine translation, but the technique does not model the full distribution and requires a task-specific predetermined denoising process.

In an alternative approach for faster neural machine translation, Kaiser2018 propose to use a discrete latent space of variable but reduced size (e.g. 8x fewer tokens than the length of the sentence). While this technique speeds up the translation process, it is still serial. Furthermore, the method makes no claims about fully modeling the distribution of the data.

## 3 Background: Normalizing Flows

Normalizing flows are a class of model that define a density through a parameterized invertible deterministic transformation from a base density, such as a standard Gaussian (Tabak2010). Define an invertible transformation and base density . These specify density via the change-of-variables formula:

Consider two core operations defined with flows: (a) Sampling, , is performed by first sampling from the base distribution, , and then applying the forward transformation ; (b) density evaluation, for a known , is computed by inverting the transformation, , and computing the base density . If is chosen to have an easily computable Jacobian determinant and inverse, both of these can be computed efficiently.

One method for satisfying these criteria is to compose invertible components, such as scalar affine transformations, and arrange them to ensure a triangular Jacobian matrix and therefore a linear determinant calculation. We consider three different variants on this theme, and discuss the computational tradeoffs for sampling and density evaluation. For this section we assume without loss of generality that with ordered dimensions .

#### Autoregressive Flow (AF)

Autoregressive flows, originally proposed in Papamakarios2017, ensure an invertible transformation and triangular Jacobian matrix by conditioning each scalar affine transformation on all previously observed variables ,

where and are the shift and scale functions with shared parameters . The Jacobian matrix is triangular because is non-zero only for , with determinant .

A flow diagram of AF is shown in Figure 1a. To sample , we sample each on the left. The first is computed through an affine transformation, and then each subsequent is sampled in serial based on and . To evaluate the density, we simply apply individual scalar affine transformations in parallel, each depending on all previous observed , and compute the base density.

#### Inverse Autoregressive Flow (IAF)

Inverse autoregressive flows, proposed in Kingma2016, use affine transformations that depend on previous instead of . The transformation for IAF has the form:

A flow diagram for IAF is shown in Figure 1b. For the sampling process all can be computed given in parallel; conversely, density evaluation requires computing each serially since is needed for the transformation. In practice AF and IAF encode different inductive biases which can hinder the ability of IAF to generalize as well as AF (Oord2017).

#### Split Coupling Flow (SCF)

Split coupling flows, initially proposed in Dinh2016 and followed up on in Jul, utilize “coupling layers” that keep a subset of the random variables unchanged, i.e. , and use these to condition the transformation for the rest of the random variables . The transformation for SCF and can be written:

A flow diagram for SCF is shown in Figure 1c, where for visualization. Because only the first two variables are used to condition the rest of the affine transformations, both sampling and density evaluation are parallel. As SCF is a special case of AF it has a strictly reduced modeling flexibility in exchange for improved computational efficiency (Papamakarios2017).

#### Layered Flows

Each flow encodes an invertible function with a linearly computable Jacobian determinant. Because invertibility is closed under function composition, and the Jacobian determinant of composed functions is the product of the individual Jacobian determinants, more flexible distributions can be created by layering flows and changing the ordering of the dependencies at each layer (Salimans2017). Changing the ordering between layers allows all s or s to interact with each other, and is usually implemented by reversing or shuffling the ordering of dependencies (Jul).

Figure 1d shows an example with three layers of AF, with reversed dependency ordering between layers. Stacking multiple layers of flow has been shown to significantly increase the modeling flexibility of this class of normalizing flows (Jul; Oord2017).

A multilayer flow represents a true invertible vector transformation with a dense Jacobian matrix. Forming the building blocks for the discrete flow models, we denote a multilayer AF as , a multilayer IAF as , and a multilayer SCF .

## 4 Latent Flows for Discrete Sequences

Using these building blocks, we aim to develop flexible flow-based models for discrete sequences. The first difficulty is that any deterministic non-trivial mapping between a discrete space and a continuous space or between two discrete spaces is not invertible. Instead we explore using a latent-variable model, with a continuous latent sequence modeled through normalizing flows. We begin by describing the full generative process and then focus on the flow-based prior.

### 4.1 Generating Discrete Sequences

Our central process will be a latent-variable model for a discrete sequence. However, unlike standard discrete autoregressive models, we aim to lift the main dynamics of the system into continuous space, i.e. into the prior. In particular, we make the strong assumption that each discrete symbol is conditionally independent given the latent.

Concretely, we model the generation of a discrete sequence conditioned on a latent sequence made up of continuous random vectors with and is a hidden dimension. Define as our prior distribution, and generate from the conditional distribution over discrete observed variables . The conditional likelihood generates each conditionally independently: , where the emission distribution depends on the dataset.

To allow for non-autoregressive generation, the length of the sequence is explicitly modeled as a latent variable and all parts of the model are conditioned on it. Length conditioning is elided in the following discussion (see the Supplementary Materials for details). The complete graphical model is shown in Figure 2.

### 4.2 Criteria for Effective Flow Parameterization

The prior in this process needs to capture the dynamics of the discrete system in a continuous space. Unlike common continuous spaces such as images, in which conditional distributions are often modeled well by unimodal or few-modal distributions, discrete spaces with fixed generation order are highly multimodal.

Figure 3 illustrates this difficulty. First consider the continuous distributions generated by an AF model (PixelCNN++ (Salimans2017)) with 10 mixture components. Despite its flexibility, the resulting distributions have a limited modality indicating that increasing flexibility does not better model the data. Further corroborating this hypothesis, (Salimans2017) report that using more than 5 mixture components does not improve performance.

In contrast, Figure 3b shows a similar experiment on discrete data. Here the first and third distributions are highly multimodal (given previous characters there are multiple different possibilities for the next character). Furthermore, the degree of multimodality can vary significantly, as in the second example, requiring models to be able to adjust the number of indicated modes in addition to their locations. In the proposed model, because the conditional likelihood models each as independent, this multimodality at each time step needs to exist almost exclusively in the latent space with each likelihood being highly constrained in its conditioning.

### 4.3 Flow Architectures for Sequence Dynamics

We consider three flow architectures that describe relations across the time and hidden dimensions that aim to maximize the potential for multimodal distributions. These differ in their inductive biases as well as the sampling and density evaluation processes. Note, that throughout this section represents a random vector, and so the model is over random variables. The main concern is the interactions between time and hidden dimensions.

#### Model 1: AF in time, AF in hidden (AF / AF)

First consider an autoregressive flow along the time dimension with each time step applying an autoregressive flow along the hidden dimension. The transformation function can be written as,

where is a layered AF transformation described above with each constituent affine transformation conditioned on in addition to . A proof that this represents a valid normalizing flow is given in the Supplementary Materials.

The flow diagram is shown in Figure 4a. At each time step the AF-in-hidden induces dependencies along the hidden dimension (inside ) to create a multimodal distribution. The AF-in-time conditions each subsequent on the previous latent vectors . For density evaluation, , both the dependencies within each and the dependencies across time can be computed in parallel. For sampling, each hidden dimension at each time step must be computed in serial.

#### Model 2: AF in time, SCF in hidden (AF / SCF)

Model 1 can be evaluated efficiently, but the serial sampling procedure may be an issue in applications. As an alternative we consider a flow which replaces AF-in-hidden dimension with a layered SCF. The prior is defined by the forward and inverse transformation functions,

The flow diagram is shown in Figure 4b. This model allows for similar parallel density evaluation as Model 1, however it is parallel in sampling along the hidden dimension, which can help efficiency. The downside is that SCF may not be able to induce the flexible multimodality required for the discrete case.

#### Model 3: IAF in time, SCF in hidden (IAF / SCF)

Finally, the autoregressive sampling behavior can be removed completely. The final model uses an IAF-in-time to remove this serial dependency in sampling. The transformation functions are:

The flow diagram is shown in Figure 4c. For sampling, given the time-wise and hidden dependencies can be satisfied in parallel (they all appear on the right side of the forward transformation function). Density evaluation, on the other hand, becomes parallel along hidden and serial in time.^{1}^{1}1We also considered an IAF / IAF model; however having fully serial operation in density evaluation makes training prohibitively expensive.

#### Extension: The Non-Linear Squared Flow

We can add further flexibility to the model by modifying the core flows. Building on the observations of (Huang2018), we propose replacing the affine scalar transformation with an invertible non-linear squared transformation (designated NLSq):

This transformation has five pseudo-parameters instead of the two for the affine. It reduces to the affine function in the case where . When , the function effectively adds a perturbation with position controlled by and scale controlled by and , which even in 1D can induce multimodality. Under conditions on the scale parameter the function can be guaranteed to be invertible, and the analytical inverse is the solution to a cubic equation (see Supplementary Materials for details).

Figure 5 illustrates the transformation. Figure 5a, b show an example of four compositions of NLSq functions, and the initial and final density. Whereas the affine transformation would simply scale and shift the Gaussian, the NLSq function induces multimodality. As a final example of the ability of this function to model a multimodal distribution within the flow framework, Figure 5c shows the learned 2D density for a toy dataset consisting of a mixture of four Gaussians. Consistent with Huang2018, we find that an AF even with many layers fails to learn to model the same distribution.

## 5 Variational Inference and Training

To train the model, we need to learn both the simple likelihood and the prior models. This requires being able to efficiently perform posterior inference, i.e. compute the posterior distribution , which is computationally intractable. We instead use the standard approach of amortized variational inference (Kingma2014) by introducing a trained inference network, . This distribution models each

as a diagonal Gaussian with learned mean and variance:

While this mean-field factorization results in a weak inference model, preliminary experiments indicated that increasing the flexibility of the inference model with e.g. IAF (Kingma2016) did not improve performance.

This inference network is trained jointly with the model to maximize the evidence lower-bound (ELBO),

Training proceeds by estimating the expectation with monte-carlo samples and optimizing the lower bound for both the inference network parameters

as well as the prior and likelihood parameters.## 6 Methods and Experiments

We consider two standard discrete sequence modeling tasks: character-level language modeling and polyphonic music modeling. For all experiments, we compute the negative log-likelihood (NLL) estimated with importance sampling and evaluate on a held-out test set to evaluate distribution-modeling performance. As a baseline, we use a LSTM-based language model as in Press2017, the standard discrete autoregressive model. For all experiments we use a baseline LSTM of the same size as the flow-based model. For all flow-based models, a BiLSTM is used to compute the likelihood model and the inference network

. All flow-based models use NLSq unless otherwise noted. Optimization and hyperparameter details are given in the Supplementary Materials.

Model | Test NLL | Reconst. | KL |

(bpc) | (bpc) | (bpc) | |

LSTM | 1.38 | - | - |

LSTM (sentence-wise) | 1.41 | - | - |

AF-only | 2.90 | 0.15 | 2.77 |

AF / AF | 1.42 | 0.10 | 1.37 |

AF / SCF | 1.46 | 0.10 | 1.43 |

IAF / SCF | 1.63 | 0.21 | 1.55 |

### 6.1 Character-Level Language Modeling

Character-level language modeling tests the ability of a model to
capture the full distribution of high entropy data with long-term
dependencies. We use the Penn Treebank dataset, with the standard
preprocessing as in (Mikolov2012). The dataset consists of
approximately 5M characters, with rare words replaced by “unk”
and a character-level vocabulary size of .^{2}^{2}2 Unlike
previous works on character-level language modeling which consider
the dataset to be a single continuous string of characters,
non-autoregressive generation requires the dataset to be split up
into finite chunks. Following previous text-based VAE works in the
literature (Bowman2015)

, the dataset is split into sentences. To avoid extreme outliers, the dataset is limited to sentences of length less than 288 tokens, which accounts for 99.3% of the original dataset. Due to these two modifications the absolute NLL scores are not precisely comparable between this dataset and the one used in previous works, although the difference is small.

Table 1 shows results. The LSTM baseline establishes a “gold standard” representing a model trained directly on the observed discrete sequence with the same conditioning as the proposed model. In terms of absolute NLL score, AF / AF nearly matches the LSTM baseline, whereas AF / SCF is within 0.05 of the LSTM baseline. These results demonstrate that the combination of AF-in-hidden and the NLSq scalar invertible function induce enough multimodality in the continous distribution to model the discrete data. The AF-only “unigram” model removes the relationships across time in the prior model, effectively dropping the time-dynamics.

The IAF / SCF model performs worse than the other models, which reflects the additional challenges associated with non-autoregressive sampling. The same effect is seen with normalizing flow-based generative models for images (Dinh2016; Jul), where non-autoregressive models have not reached the state-of-the-art performance. Still, compared to the AF-only baseline the autoregressive model clearly learns important dependencies between characters.

Interestingly, in all models the KL term dominates the ELBO, always accounting for over 90% of the ELBO. This is in stark contrast to previous NLP latent-variable models with strong likelihood models. In these models, the KL term accounts for less than 5% of the ELBO (Bowman2015; Kim2018; Xu2018), or less than 30% of the ELBO when using a specially designed auxiliary loss (Goyal2017). This indicates that the model 1) is using the latent space to predict each letter, and 2) is rewarded in terms of NLL for accurately encoding the discrete tokens in both the reconstruction term and the KL term.

Model | Test NLL | Reconst. | KL |

(bpc) | (bpc) | (bpc) | |

AF / AF | 1.42 | 0.10 | 1.37 |

- NLSq | 1.50 | 0.11 | 1.51 |

- AF hidden | 1.57 | 0.14 | 1.57 |

- AF hidden and NLSq | 1.56 | 0.29 | 1.56 |

Table 2 shows model ablations. Without either the NLSq function or the AF-in-hidden dependencies the performance degrades. Once AF-in-hidden is removed, however, further removing NLSq appears to make only a small difference in terms of NLL. These results provide further evidence to our hypothesis that modeling discrete data requires a high degree of multimodality. Furthermore, standard normalizing flows without these additions do not achieve the required flexibility.

#### Visualizing learned distributions

Figure 6 shows the prior densities of AF /AF with . A continuous sequence of 2-vectors is sampled from . The AF / AF model is used to evaluate , which gives at every timestep. The figure shows the series of 8 distributions corresponding to the characters “_groups_”. In the first plot we can see that given the previous the prior distribution is unimodal, indicating the model identifies that following the previous word there is only one likely token (a space). At the next timestep, the distribution is highly multimodal, indicating uncertainty of the new word. As the model sees more of the context in the continuous space corresponding to successive characters in the word “groups”, the number of modes decreases. In two cases, corresponding to the token following “gro” and the token following “group” the distribution is bimodal, indicating a clear two-way branching decision.

### 6.2 Polyphonic Music Modeling

Next we consider the polyphonic music modeling task (Boulanger-LewandowskiNicolas;BengioYoshua;Vincent2009). Here each timestep consists of an 88-dimensional binary vector indicating the musical notes played. Unlike character-level language modeling where one token appears at each time step, multiple notes are played simultaneously giving a maximum effective vocabulary size of . For this dataset all models are modified so the emission distributions and

are independent Bernoulli distributions instead of Categorical distributions.

Table 3 presents the results, split into model classes. RNN/LSTM is the weakest class, capturing the temporal dependencies but treating the 88 notes as independent. RNN-NADE is the strongest class, explicitly modeling the joint distribution of notes in addition to the temporal dependencies. The rest are different latent variable approaches to this problem. They each treat the 88 notes as conditionally independent given a variable-length latent variable. All models make different modeling choices and all except DMM include dependencies between observed random variables .

Model | Nottingham | Piano | Musedata | JSB |

RNN-NADE | 2.31 | 7.05 | 5.6 | 5.19 |

TSBN | 3.67 | 7.89 | 6.81 | 7.48 |

STORN | 2.85 | 7.13 | 6.16 | 6.91 |

NASMC | 2.72 | 7.61 | 6.89 | 3.99 |

SRNN | 2.94 | 8.2 | 6.28 | 4.74 |

DMM | 2.77 | 7.83 | 6.83 | 6.39 |

LSTM | 3.43 | 7.77 | 7.23 | 8.17 |

AF / AF | 2.39 | 8.19 | 6.92 | 6.53 |

AF / SCF | 2.56 | 8.26 | 6.95 | 6.64 |

IAF / SCF | 2.54 | 8.25 | 7.06 | 6.59 |

The AF / AF model outperforms all models on the Nottingham dataset, SRNN on the Piano dataset, and TSBN and STORN on the JSB dataset. The AF / AF model also approaches the RNN-NADE model on the Nottingham dataset. AF / AF performs most poorly on the Piano dataset, which has the longest sequences but only 87 individual sequences. The dataset therefore poorly matches the inductive bias of the discrete flow models, which is designed to ingest whole sequences. The AF / SCF model performs slightly worse than AF / AF on all datasets, which is expected given the loss of modeling power. IAF / SCF performs slightly worse than AF / AF but surprisingly better than AF / SCF on all datasets except Musedata. Given the small amount of training data, IAF / SCF overfits less than AF /SCF, explaining the improved generalization despite being overall a weaker model.

Overall, the performance on the polyphonic music datasets demonstrates that the discrete flow model can work at least as well as models which explicitly include connections between the s, and that the weakness of the inference model is made up for by the flexibility of the prior.

### 6.3 Non-Autoregressive Generation

While our main goal was to develop a flexible multimodal latent flow model, a secondary goal was to develop a non-autoregressive approach to discrete generation. IAF / SCF best fits this goal. We examine the practical speed of this model compared to discrete autoregressive models.

Figure 7 shows generation speed for both tasks. Experiments are run on a single Tesla V100 GPU with a batch size of one, with the IAF / SCF model using an LSTM to implement time-wise conditioning. Compared to the baseline LSTM model, the speedup comes from the fact that in the IAF formulation all of the inputs are available to the LSTM in parallel and therefore cuDNN can parallelize parts of the computation.

Figure 7 shows that for very short sequences the overhead of the proposed model makes generation slower than the baseline LSTM, whereas after that point the IAF / SCF is faster than the LSTM. This experiment was run with a batch size of 1, for small batch sizes the trend holds while for large batch sizes the additional parallelization afforded by having access to all LSTM inputs becomes less important.

## 7 Conclusion

This work proposes a latent-variable model for discrete sequences that learns a highly multimodal normalizing flow-based continuous distribution. We show that two flows, AF / AF and AF / SCF, succeed in learning rich multimodal distributions. Furthermore, we show that IAF / SCF, while slightly less accurate, is an efficient approach for non-autoregressive generation. The proposed models can also be adapted for conditional language modeling for use in e.g. character-level translation. Furthermore, future work can explore moving to alternate architectures such as those based on self-attention, which give performance and are more parallelizable than LSTMs. We hope this work encourages further exploration of the interplay between and relative merits of discrete and continuous representations.

## Acknowledgements

We thank Yoon Kim, Justin Chiu, and Yuntian Deng for helpful discussions and comments.

## References

## Appendix A Proposed flow validity

A transformation function represents a valid normalizing flow if is invertible. A transformation function represents a useful normalizing flow is the Jacobian of can be computed with linear complexity in dimension of the data. We show that the three proposed flows in this work have both of these properties.

First consider the AF / AF flow, defined by transformation function :

To prove the mapping is invertible it suffices to find the inverse:

is a normalizing flow and therefore an invertible function. Each can thus be calculated from giving .

For the latent flows considered in the main text . Here we equivalently view as a large vector. We write . In this case the Jacobian matrix can be written as a block matrix

where each block is a Jacobian matrix.

For the AF / AF flow because depends only on and , which itself only depends on . Therefore the Jacobian matrix is block triangular with determinant

Thus, the Jacobian determinant is simply the product of the Jacobian determinants of the AF-in-hidden transformations at each time step. (Papamakarios2017) show that the Jacobian determinant is linear in for AF, thus the overall complexity for the determinant calculation of AF / AF is .

The proof holds when is replaced with , as (Dinh2016) show that the Jacobian of can be computed with linear complexity. This concludes the proof that AF / AF and AF / SCF are valid normalizing flows with Jacobian determinant calculations linear in the data dimension.

For IAF / SCF the transformation function pair is:

This is invertible because a inverse function is found. because depends only on and . The Jacobian matrix is thus block triangular with determinant . The same argument as for AF / AF gives a Jacobian determinant complexity of .

## Appendix B NLSq invertibility

The NLSq function is

(1) |

In the following discussion we assume A real scalar function is invertible if its derivative is positive everywhere.

Taking another derivative and setting it equal to 0 gives the critical points . The distinction between maximum and minimum depends on the sign of . In either case, the minimum slope is

Thus invertibility is guaranteed if . In our implementation , , , , and , where are unrestricted and output from the model, and is a constant included for stability. We found allows significant freedom of the perturbation while disallowing “barely invertible” functions.

The inverse of the NLSq function is analytically computable, which is important for efficient generation. Solving for in Eq. 1 gives the cubic equation

Under the invertibility condition above this is guaranteed to have one real root which can be found analytically (G.C.Holmes2002).

In practice, because the forward direction as written (applying ) requires fewer operations it is used for the reverse function , and the solution to the cubic equation is used for the forward function .

## Appendix C Variable length input

When working with non-autoregressive models we need to additionally deal with the variable length nature of the observed sequences. Unlike autoregressive models, which can emit an end-of-sentence token, non-autoregressive models require the length to be sampled initially. Given a sequence of length we can write

where the second equality comes from the fact that for . For unconditional sequence modeling we can use the empirical likelihood for , and then condition all parts of the model itself on . In this work we implement the conditioning as a two one-hot vectors at every timestep , indicating the distance from the beginning and end of the sequence. Compared to other popular position encodings in the literature, such as the one commonly used in the Transformer (Vaswani2017), this primarily encodes the absolute length instead of the relative position between tokens needed in a self-attention based architecture.

The generative process becomes:

## Appendix D Implementation and optimization details

During optimization, the expectation in the ELBO is approximated with 10 samples. 5 layers of AF-in-hidden or SCF-in-hidden flow are used for the AF / AF and AF / SCF models and 3 layers are used for the IAF / SCF models, for character-level language modeling. 5 layers of SCF-in-hidden are used for all models on the polyphonic datasets. The base density is a standard Gaussian. Adam is used as the optimizer with a learning rate of 1e-3 and a gradient clipping cutoff of 0.25. Dropout is used to regularize the baseline model and the LSTM in the prior of the AF / AF and AF / SCF models. All LSTMs are two layers deep, and all embedding and hidden layers are made up of 500 units. Weight tying between the input embedding of the encoder and output embedding of the decoder is employed.

A latent size of for each random vector and is used. During preliminary experiments we found that for character-level language modeling the results were nearly identical for .

Many recent works have found that it is necessary to bias the variational optimization to prevent posterior collapse, most commonly by using KL annealing or modifying the objective (Bowman2015; Kingma2016; Chen2016a), without which it is easy for the model to obtain strong performance by simply ignoring the latent code. In our case we similarly find that KL annealing is essential to learn a strong mapping. We hypothesize that while the decoder is extremely weak, the prior itself is powerful and thus the generative model overall is still powerful enough to require such a bias.

Specifically, for the language modeling task we use KL annealing with an initial period of 0 weight on the KL term for 4 epochs followed by a linear increase to the full ELBO across 10 epochs. This schedule allows the models to first encode the vocabulary in the continuous space with 0 reconstruction loss and then learn the statistical dependencies between tokens. For the polyphonic datasets we extend this to 0 weight for 20 epochs followed by a linear increase over 15 epochs, due to the reduced dataset size.