1 Introduction
There have been many recent advances in normalizing flows, a technique for constructing highdimensional continuous distributions from invertible transformations of simple distributions (Rezende and Mohamed, 2015; Tabak and Turner, 2013; Rippel and Adams, 2013). Applications for highdimensional continuous distributions are widespread: these include latent variable models with expressive posterior approximations (Rezende and Mohamed, 2015; Ranganath et al., 2016; Kingma et al., 2016a), parallel image generation (Dinh et al., 2017; Kingma and Dhariwal, 2018), parallel speech synthesis (Oord et al., 2017; Ping et al., 2018; Prenger et al., 2018)
, and generalpurpose density estimation
(Papamakarios et al., 2017).Normalizing flows are based on the changeofvariables formula, which derives a density given an invertible function applied to continuous events. There have not been analogous advances for discrete distributions, where flows are typically thought to not be applicable. Instead, most research for discrete data has focused on building either latentvariable models with approximate inference (Bowman et al., 2015)
, or increasingly sophisticated autoregressive models that assume a fixed ordering of the data
(Bengio et al., 2003; Vaswani et al., 2017).In this paper, we present an alternative for flexible modeling of discrete sequences by extending continuous normalizing flows to the discrete setting. We construct discrete flows with two architectures:

Discrete autoregressive flows enable multiple levels of autoregressivity. For example, one can design a bidirectional language model of text where each token depends on both lefttoright and righttoleft contexts while maintaining an exact likelihood and sampling.

Discrete bipartite flows (i.e., with coupling layers similar to RealNVP (Dinh et al., 2017)) enable flexible models with parallel generation. For example, one can design nonautoregressive text models which maintain an exact likelihood for training and evaluation.
We evaluate discrete flows on a number of controlled problems: discretized mixture of Gaussians, fullrank discrete distributions, an addition task, and Potts models. In all settings we find that stacking discrete autoregressive flows yields improved performance over autoregressive baselines, and that bipartite flows can reach similar performance as autoregressive baselines while being fast to generate. Finally, we scale up discrete bipartite flows to characterlevel language modeling where we reach 1.38 bits per character on Penn Tree Bank and 1.23 bits per character on text8; their generation speed is over 100x faster than stateoftheart autoregressive models.
1.1 Related Work
Bidirectional models.
Classically, bidirectional language models such as loglinear models and Markov random fields have been pursued, but they require either approximate inference (Mnih and Teh, 2012; Jernite et al., 2015) or approximate sampling (Berglund et al., 2015)
. Unlike bidirectional models, autoregressive models must impose a specific ordering, and this has been shown to matter across natural language processing tasks
(Vinyals et al., 2015; Ford et al., 2018; Xia et al., 2017). Bidirectionality such as in encoders have been shown to significantly improve results in neural machine translation
(Britz et al., 2017). Most recently, BERT has shown bidirectional representations can significantly improve transfer tasks (Devlin et al., 2018). In this work, discrete autoregressive flows enable bidirectionality while maintaining the benefits of a (tractable) generative model.Nonautoregressive models.
There have been several advances for flexible modeling with nonautoregressive dependencies, mostly for continuous distributions (Dinh et al., 2014, 2017; Kingma and Dhariwal, 2018). For discrete distributions, Reed et al. (2017) and Stern et al. (2018) have considered retaining blockwise dependencies while factorizing the graphical model structure in order to simulate hierarchically. Gu et al. (2018) and Kaiser et al. (2018) apply latent variable models for fast translation, where the prior is autoregressive and the decoder is conditionally independent. Lee et al. (2018) adds an iterative refinement stage to initial parallel generations. Ziegler and Rush (2019) also apply latent variable models and with continuous nonautoregressive normalizing flows as the prior. In this work, discrete bipartite flows enable nonautoregressive generation while maintaining an exact density—analogous to RealNVP advances for image generation (Dinh et al., 2017). Most recently, Hoogeboom et al. (2019) proposed integer discrete flows, a concurrent unpublished work with similar ideas as discrete flows but with a flow transformation for ordinal data and applications to image compression and image generation. We find their results complement ours in illustrating the advantages of invertible functions which do not require log determinant Jacobians and apply an approximate gradient.
2 Background
2.1 Normalizing Flows
Normalizing flows transform a probability distribution using an invertible function
(Tabak and Turner, 2013; Rezende and Mohamed, 2015; Rippel and Adams, 2013). Let be adimensional continuous random variable whose density can be computed efficiently. Given an invertible function
, the changeofvariables formula provides an explicit construction of the induced distribution on the function’s output, :(1) 
The transformation is referred to as a flow and is referred to as the base distribution. Composing multiple flows can induce further complex distributions that increase the expressivity of (Rezende and Mohamed, 2015; Papamakarios et al., 2017).
2.2 Flow Transformation
For an arbitrary invertible , the determinant of the Jacobian incurs an complexity, which is infeasible for highdimensional datasets. Thus, normalizing flows are designed so that the determinant of the flow’s Jacobian can be computed efficiently. Here, we review two popular flow transformations.
Autoregressive flows.
Autoregressive functions such as recurrent neural networks and Transformers
(Vaswani et al., 2017) have been shown to successfully model sequential data across many domains. Specifically, assume a base distribution . With and as autoregressive functions of , i.e. , and for all , the flow computes a locationscale transform (Papamakarios et al., 2017; Kingma et al., 2016b),The transformation is invertible and the inverse can be vectorized and computed in parallel:
In addition to a fasttocompute inverse, the autoregressive flow’s Jacobian is lowertriangular, so its determinant is the product of the diagonal elements, . This enables autoregressive flow models to have efficient logprobabilities for training and evaluation.
Bipartite flows.
Realvalued nonvolume preserving (RealNVP) uses another type of invertible transformation (Dinh et al., 2017) that nonlinearly transforms subsets of the input. For some , a coupling layer follows a bipartite rather than autoregressive factorization:
(2)  
(3) 
where and are functions of with . Due to the bipartite nature of the transformation in coupling layers, we refer to them as bipartite flows. By changing the ordering of variables between each flow, the composition of bipartite flows can learn highly flexible distributions. By design, their Jacobian is lowertriangluar, with a determinant that is the product of diagonal elements, .
Bipartite flows are not as expressive as autoregressive flows, as a subset of variables do not undergo a transformation. However, both their forward and inverse computations are fast to compute, making them suitable for generative modeling where fast generation is desired.
3 Discrete Flows
Normalizing flows depend on the change of variables formula (Equation 1) to compute the change in probability mass for the transformation. However, the change of variables formula applies only to continuous random variables. We extend normalizing flows to discrete events.
3.1 Discrete Change of Variables
Let
be a discrete random variable and
where is some function of . The induced probability mass function of is the sum over the preimage of :where is the set of all elements such that . For an invertible function , this simplifies to
(4) 
This change of variables formula for discrete variables is similar to the continuous change of variables formula (Equation 1), but without the logdeterminantJacobian. Intuitively, the logdeterminantJacobian corrects for changes to the volume of a continuous space; volume does not exist for discrete distributions so there is no need to account for it. Computationally, Equation 4 is appealing as there are no restrictions on such as fast Jacobian computations in the continuous case, or tradeoffs in how the logdeterminantJacobian influences the output density compared to the base density.
3.2 Discrete Flow Transformation
Next we develop discrete invertible functions. To build intuition, first consider the binary case. Given a dimensional binary vector , one natural function applies the XOR bitwise operator,
where is a function of previous outputs, ; is the XOR function (0 if and are equal and 1 otherwise). The inverse is . We provide an example next.
Example.
Let where is defined by the following probability table:

The data distribution cannot be captured by a factorized one . However, it can with a flow: set ; with probabilities ; and with probabilities . The flow captures correlations that cannot be captured alone with the base. More broadly, discrete flows perform a multidimensional relabeling of the data such that it’s easier to model with the base. This is analogous to continuous flows, which whiten the data such that it’s easier to model with the base (typically, a spherical Gaussian).
Modulo locationscale transform. To extend XOR to the categorical setting, consider a dimensional vector , each element of which takes on values in . One can perform locationscale transformations on the modulo integer space,
(5) 
Here, and are autoregressive functions of taking on values in and respectively. For this transformation to be invertible, and must be coprime (an explicit solution for is Euclid’s algorithm). An easy way to ensure coprimality is to set to be prime; mask noninvertible values for a given ; or fix . Setting and , it’s easy to see that the modulo locationscale transform generalizes XOR. (We use for all experiments except characterlevel language modeling.)
The idea also extends to the bipartite flow setting: the functions are set to for a subset of the data dimensions, and are functions of that subset otherwise. Invertible discrete functions are widely used in random number generation, and could provide inspiration for alternatives to the location scale transformation for constructing flows (Salmon et al., 2011).
Example.
Figure 2 illustrates an example of using flows to model correlated categorical data. Following Metz et al. (2016)
, the data is drawn from a mixture of Gaussians with 8 means evenly spaced around a circle of radius 2. The output variance is
, with samples truncated to be between and , and we discretize at thelevel, resulting in two categorical variables (one for
and one for ) each with 90 states. A factorized base distribution cannot capture the data correlations, while a single discrete flow can. (Note the modulo locationscale transform does not make an ordinal assumption. We display ordinal data as an example only for visualization; other experiments use nonordinal data.)3.3 Training Discrete Flows
With discrete flow models, the maximum likelihood objective per datapoint is
where the flow has free parameters according to its autoregressive or bipartite network, and the base distribution
has free parameters as a factorized (or itself an autoregressive) distribution. Gradient descent with respect to base distribution parameters is straightforward. To perform gradient descent with respect to flow parameters, one must backpropagate through the discreteoutput function
and . We use the straightthrough gradient estimator (Bengio et al., 2013). In particular, the (autoregressive or bipartite) network outputs two vectors of logits for each dimension , one for the location and scale respectively. For the scale, we add a mask whose elements are negative infinity on noninvertible values such as 0. On the forward pass, we take the argmax of the logits, where for the location,(6) 
Because the argmax operation is not differentiable, we replace Equation 6 on the backward pass with the softmaxtemperature function:
As the temperature , the softmaxtemperature becomes close to the argmax and the bias of the gradient estimator disappears. However, when is too low, the gradients vanish, inhibiting the optimization. Work with the Gumbelsoftmax distribution indicates that this approximation works well when the number of classes (Maddison et al., 2016; Jang et al., 2017), which aligns with our experimental settings; we also fix .
4 Experiments
We perform a series of synthetic tasks to better understand discrete flows, and also perform characterlevel language modeling tasks. For all experiments with discrete autoregressive flows, we used an autoregressive Categorical base distribution where the first flow is applied in reverse ordering. (This setup lets us compare its advantage of bidirectionality to the baseline of an autoregressive base with 0 flows.) For all experiments with discrete bipartite flows, we used a factorized Categorical base distribution where the bipartite flows alternate masking of even and odd dimensions. We implement and make available discrete flows as part of Bayesian Layers
(Tran et al., 2018).Autoregressive Base  Autoregressive Flow  Factorized Base  Bipartite Flow  

0.9  0.9  1.3  1.0  
7.7  7.6  8.0  7.9  
10.7  10.3  11.5  10.7  
15.9  15.7  16.6  16.0 
AR Base  AR Flow  
number of states = 3  
,  9.27  9.124 
,  3.79  3.79 
,  16.66  11.23 
,  6.30  5.62 
number of states = 4  
,  11.64  10.45 
,  5.87  5.56 
number of states = 5  
,  13.58  10.25 
,  7.94  7.07 
4.1 Fullrank Discrete Distribution
To better understand the expressivity of discrete flows, we examined how well they could fit random fullrank discrete distributions. In particular, we sample a true set of probabilities for all dimensions of classes according to a Dirichlet distribution of size , . For the network for the autoregressive base distribution and location parameters of the flows, we used a Transformer with 64 hidden units. We used a composition of 1 flow for the autoregressive flow models, and 4 flows for the bipartite flow models.
Table 1 displays negative loglikelihoods (nats) of trained models over data simulated from this distribution. Across the data dimension and number of classes , autoregressive flows gain several nats over the autoregressive base distribution, which has no flow on top. Bipartite flows improve over its factorized base and in fact obtain nats competitive with the autoregressive base while remaining fully parallel for generation.
4.2 Addition
Following Zaremba and Sutskever (2014), we examine an addition task: there are two input numbers with digits (each digit takes values), and the output is their sum with digits (we remove the digit if it appears). Addition naturally follows a righttoleft ordering: computing the leftmost digit requires carrying the remainder from the rightmost computations. Given an autoregressive base which poses a lefttoright ordering, we examine whether the bidirectionality that flows offer can adjust for wrong orderings. While the output is determnistic, the flexibility of discrete flows may enable more accurate outputs. We use an LSTM to encode both inputs, apply 0 or 1 flows on the output, and then apply an LSTM to parameterize the autoregressive base where its initial state is set to the concatenation of the two encodings. All LSTMs use 256 hidden units for , and 512 hidden units for .
For , an autoregressive base achieves 4.0 nats; an autoregressive flow achieves 0.2 nats (i.e., close to the true deterministic solution over all pairs of 10digit numbers). A bipartite model with 1, 2, and 4 flows achieves 4.0, 3.17, and 2.58 nats respectively. For , an autoregressive base achieves 12.2 nats; an autoregressive flow achieves 4.8 nats. A bipartite model with 1, 2, 4, and 8 flows achieves 12.2, 8.8, 7.6, and 5.08 nats respectively.
4.3 Potts Model
Given the bidirectional dependency enabled by discrete flows, we examined how they could be used for distilling undirected models with tractable energies but intractable sampling and likelihoods. We sampled from Potts models (the Categorical generalization of Ising models), which are a 2d Markov random field with pairwise interactions between neighbors (above/below, left/right, but not diagonally connected) (Wu, 1982). To generate data we ran 500 steps of MetropolisHastings, and evaluated the NLL of baselines and discrete flows as a function of the coupling strength, . Low coupling strengths correspond to more independent states, while high coupling strengths result in more correlated states across space. For the base network, we used a single layer LSTM with 32 hidden units. For the flow network, we used an embedding layer which returns a trainable location parameter for every unique combination of inputs.
Table 2 displays negative loglikelihoods (nats) of trained models over data simulated from Potts models with varying lattice size and coupling strength. As Potts models are undirected models, the autoregressive base posits a poor inductive bias by fixing an ordering and sharing network parameters across the individual conditional distributions. Over data dimension and coupling , autoregressive flows perform as well as, or improve upon, autoregressive base models. Appendix A includes samples from the model; they are visually indistinguishable from the data.
4.4 CharacterLevel Penn Tree Bank
Test NLL (bpc)  Generation  

3layer LSTM (Merity et al., 2018)  1.18  3.8 min 
Ziegler and Rush (2019) (AF/SCF)  1.46   
Ziegler and Rush (2019) (IAF/SCF)  1.63   
Bipartite flow  1.38  0.17 sec 
We follow the setup of Ziegler and Rush (2019), which to the best of our knowledge is the only comparable work with nonautoregressive language modeling. We use Penn Tree Bank with minimal processing from Mikolov et al. (2012), consisting of roughly 5M characters and a vocabulary size of . We split the data into sentences and restrict to a max sequence length of 288. The LSTM baseline of Merity et al. (2018) uses 3 layers, truncated backpropagation with a sequence length of 200, embedding size of 400, and hidden size of 1850.^{1}^{1}1The LSTM results are only approximately comparable as they do not apply the extra preprocessing step of removing sentences with 288 tokens. Ziegler and Rush (2019)’s nonautoregressive models have two variants, in which they use a specific prior with a conditionally independent likelihood and fully factorized variational approximation. AF/SCF uses a prior over latent time steps and hidden dimension that’s autoregressive in and nonautoregressive in ; and IAF/SCF is nonautoregressive in both and . For the bipartite flow, we use 8 flows each with an embedding of size 400 and an LSTM with 915 hidden units.
Table 3 compares the test negative loglikelihood in bits per character as well as the time to generate a 288dimensional sequence of tokens on a NVIDIA P100 GPU. The bipartite flow significantly outperforms Ziegler and Rush (2019), including their autoregressive/nonautoregressive hybrid. In addition, the generation time is over 1000x faster than the LSTM baseline. Intuitively, the use of bipartite flows means that we only have to perform one forward pass over the model as opposed to the 288 forward passes for a typical autoregressive model.
4.5 CharacterLevel text8
We also evaluated on text8, using the preprocessing of Mikolov et al. (2012); Zhang et al. (2016) with 100M characters and a vocabulary size of . We split the data into 90M characters for train, 5M characters for dev, and 5M characters for test. For discrete bipartite flows, we use a batch size of 128, sequence length of 256, a varying number of flows, and parameterize each flow with a Transformer with 2 or 3 layers, 512 hidden units, 2048 filter size, and 8 heads.
Figure 3 compares the test negative loglikelihood in bits per character as well as the time to generate one data point, i.e., a 256dimensional sequence, on a NVIDIA P100 GPU. The bipartite flow reaches competitive performance, i.e., better than an LSTM baseline but not as good as the stateoftheart bpc from the 235M parameter 64layer Transformer (we’re unfamiliar with previous nonautoregressive results to compare to). We also find that having a flexible scale (“w/ ”) improves performance over fixing and only learning the location transform . The bipartite flows’ generation times are significantly faster than the baselines with upwards of a 100x speedup.
5 Discussion
We describe discrete flows, a class of invertible functions for flexible modeling of discrete data. Discrete autoregressive flows enable bidirectionality by stacking multiple levels of autoregressivity, each with varying order. Discrete bipartite flows enable nonautoregressive generation by flexibly modeling data with a sequence of bipartitefactorized flow transformations. Our experiments across a range of synthetic tasks and characterlevel text data show the promise of such approaches.
As future work, we’re also investigating discrete inverse autoregressive flows, which enable flexible variational approximations for discrete latent variable models. An open question remains with scaling discrete flows to large numbers of classes: in particular, the straightthrough gradient estimator works well for small numbers of classes such as for characterlevel language modeling, but it may not work for (sub)wordlevel modeling where the vocabulary size is greater than 5,000. In such settings, alternative gradient estimators or data representations may be fruitful. Lastly, there may be other invertible discrete functions used in cryptography and random number generation that could be leveraged for building up alternative forms for discrete flows (Salmon et al., 2011).
Acknowledgements
Keyon Vafa is supported by NSF grant DGE1644869. We thank Michael W. Dusenberry, Kevin Murphy, Wei Ping, and Jakob Uszkoreit for helpful comments and feedback.
References

Bengio et al. (2003)
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003).
A neural probabilistic language model.
Journal of machine learning research
, 3(Feb):1137–1155.  Bengio et al. (2013) Bengio, Y., Léonard, N., and Courville, A. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
 Berglund et al. (2015) Berglund, M., Raiko, T., Honkala, M., Kärkkäinen, L., Vetek, A., and Karhunen, J. T. (2015). Bidirectional recurrent neural networks as generative models. In Advances in Neural Information Processing Systems, pages 856–864.
 Bowman et al. (2015) Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. (2015). Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
 Britz et al. (2017) Britz, D., Goldie, A., Luong, M.T., and Le, Q. (2017). Massive exploration of neural machine translation architectures. arXiv preprint arXiv:1703.03906.
 Devlin et al. (2018) Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). BERT: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
 Dinh et al. (2014) Dinh, L., Krueger, D., and Bengio, Y. (2014). NICE: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516.
 Dinh et al. (2017) Dinh, L., SohlDickstein, J., and Bengio, S. (2017). Density estimation using real nvp. In International Conference on Learning Representations.
 Ford et al. (2018) Ford, N., Duckworth, D., Norouzi, M., and Dahl, G. E. (2018). The importance of generation order in language modeling. In Empirical Methods in Natural Language Processing.
 Gu et al. (2018) Gu, J., Bradbury, J., Xiong, C., Li, V. O., and Socher, R. (2018). Nonautoregressive neural machine translation. In International Conference on Learning Representations.
 Hoogeboom et al. (2019) Hoogeboom, E., Peters, J. W., van den Berg, R., and Welling, M. (2019). Integer discrete flows and lossless compression. arXiv preprint arXiv:1905.07376.
 Jang et al. (2017) Jang, E., Gu, S., and Poole, B. (2017). Categorical reparameterization with gumbelsoftmax. In International Conference on Learning Representations.
 Jernite et al. (2015) Jernite, Y., Rush, A., and Sontag, D. (2015). A fast variational approach for learning markov random field language models. In International Conference on Machine Learning, pages 2209–2217.
 Kaiser et al. (2018) Kaiser, Ł., Roy, A., Vaswani, A., Pamar, N., Bengio, S., Uszkoreit, J., and Shazeer, N. (2018). Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382.
 Kingma and Dhariwal (2018) Kingma, D. P. and Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10236–10245.
 Kingma et al. (2016a) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016a). Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751.
 Kingma et al. (2016b) Kingma, D. P., Salimans, T., and Welling, M. (2016b). Improving Variational Inference with Inverse Autoregressive Flow. In Neural Information Processing Systems.
 Lee et al. (2018) Lee, J., Mansimov, E., and Cho, K. (2018). Deterministic nonautoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901.
 Maddison et al. (2016) Maddison, C. J., Mnih, A., and Teh, Y. W. (2016). The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712.
 Merity et al. (2018) Merity, S., Keskar, N. S., and Socher, R. (2018). An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240.
 Metz et al. (2016) Metz, L., Poole, B., Pfau, D., and SohlDickstein, J. (2016). Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163.
 Mikolov et al. (2012) Mikolov, T., Sutskever, I., Deoras, A., Le, H.S., Kombrink, S., and Cernocky, J. (2012). Subword language modeling with neural networks. preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf), 8.
 Mnih and Teh (2012) Mnih, A. and Teh, Y. W. (2012). A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426.
 Oord et al. (2017) Oord, A. v. d., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G. v. d., Lockhart, E., Cobo, L. C., Stimberg, F., et al. (2017). Parallel wavenet: Fast highfidelity speech synthesis. arXiv preprint arXiv:1711.10433.
 Papamakarios et al. (2017) Papamakarios, G., Murray, I., and Pavlakou, T. (2017). Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2335–2344.
 Ping et al. (2018) Ping, W., Peng, K., and Chen, J. (2018). Clarinet: Parallel wave generation in endtoend texttospeech. arXiv preprint arXiv:1807.07281.
 Prenger et al. (2018) Prenger, R., Valle, R., and Catanzaro, B. (2018). Waveglow: A flowbased generative network for speech synthesis. arXiv preprint arXiv:1811.00002.
 Ranganath et al. (2016) Ranganath, R., Tran, D., and Blei, D. (2016). Hierarchical variational models. In International Conference on Machine Learning, pages 324–333.
 Reed et al. (2017) Reed, S., van den Oord, A., Kalchbrenner, N., Colmenarejo, S. G., Wang, Z., Chen, Y., Belov, D., and de Freitas, N. (2017). Parallel multiscale autoregressive density estimation. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2912–2921. JMLR. org.
 Rezende and Mohamed (2015) Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. In International Conference on Machine Learning.
 Rippel and Adams (2013) Rippel, O. and Adams, R. P. (2013). Highdimensional probability estimation with deep density models. arXiv preprint arXiv:1302.5125.
 Salmon et al. (2011) Salmon, J. K., Moraes, M. A., Dror, R. O., and Shaw, D. E. (2011). Parallel random numbers: as easy as 1, 2, 3. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, page 16. ACM.
 Stern et al. (2018) Stern, M., Shazeer, N., and Uszkoreit, J. (2018). Blockwise parallel decoding for deep autoregressive models. In Advances in Neural Information Processing Systems, pages 10107–10116.
 Tabak and Turner (2013) Tabak, E. and Turner, C. V. (2013). A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164.
 Tran et al. (2018) Tran, D., Mike, D., van der Wilk, M., and Hafner, D. (2018). Bayesian layers: A module for neural network uncertainty. arXiv preprint arXiv:1812.03973.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
 Vinyals et al. (2015) Vinyals, O., Bengio, S., and Kudlur, M. (2015). Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391.
 Wu (1982) Wu, F.Y. (1982). The potts model. Reviews of modern physics, 54(1):235.
 Xia et al. (2017) Xia, Y., Tian, F., Wu, L., Lin, J., Qin, T., Yu, N., and Liu, T.Y. (2017). Deliberation networks: Sequence generation beyond onepass decoding. In Advances in Neural Information Processing Systems, pages 1784–1794.
 Zaremba and Sutskever (2014) Zaremba, W. and Sutskever, I. (2014). Learning to execute. arXiv preprint arXiv:1410.4615.
 Zhang et al. (2016) Zhang, S., Wu, Y., Che, T., Lin, Z., Memisevic, R., Salakhutdinov, R. R., and Bengio, Y. (2016). Architectural complexity measures of recurrent neural networks. In Advances in neural information processing systems, pages 1822–1830.
 Ziegler and Rush (2019) Ziegler, Z. M. and Rush, A. M. (2019). Latent normalizing flows for discrete sequences. arXiv preprint arXiv:1901.10548.