Discrete Flows: Invertible Generative Models of Discrete Data

05/24/2019 ∙ by Dustin Tran, et al. ∙ 6

While normalizing flows have led to significant advances in modeling high-dimensional continuous distributions, their applicability to discrete distributions remains unknown. In this paper, we show that flows can in fact be extended to discrete events---and under a simple change-of-variables formula not requiring log-determinant-Jacobian computations. Discrete flows have numerous applications. We consider two flow architectures: discrete autoregressive flows that enable bidirectionality, allowing, for example, tokens in text to depend on both left-to-right and right-to-left contexts in an exact language model; and discrete bipartite flows that enable efficient non-autoregressive generation as in RealNVP. Empirically, we find that discrete autoregressive flows outperform autoregressive baselines on synthetic discrete distributions, an addition task, and Potts models; and bipartite flows can obtain competitive performance with autoregressive baselines on character-level language modeling for Penn Tree Bank and text8.

READ FULL TEXT VIEW PDF

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There have been many recent advances in normalizing flows, a technique for constructing high-dimensional continuous distributions from invertible transformations of simple distributions (Rezende and Mohamed, 2015; Tabak and Turner, 2013; Rippel and Adams, 2013). Applications for high-dimensional continuous distributions are widespread: these include latent variable models with expressive posterior approximations (Rezende and Mohamed, 2015; Ranganath et al., 2016; Kingma et al., 2016a), parallel image generation (Dinh et al., 2017; Kingma and Dhariwal, 2018), parallel speech synthesis (Oord et al., 2017; Ping et al., 2018; Prenger et al., 2018)

, and general-purpose density estimation

(Papamakarios et al., 2017).

Normalizing flows are based on the change-of-variables formula, which derives a density given an invertible function applied to continuous events. There have not been analogous advances for discrete distributions, where flows are typically thought to not be applicable. Instead, most research for discrete data has focused on building either latent-variable models with approximate inference (Bowman et al., 2015)

, or increasingly sophisticated autoregressive models that assume a fixed ordering of the data

(Bengio et al., 2003; Vaswani et al., 2017).

In this paper, we present an alternative for flexible modeling of discrete sequences by extending continuous normalizing flows to the discrete setting. We construct discrete flows with two architectures:

  1. Discrete autoregressive flows enable multiple levels of autoregressivity. For example, one can design a bidirectional language model of text where each token depends on both left-to-right and right-to-left contexts while maintaining an exact likelihood and sampling.

  2. Discrete bipartite flows (i.e., with coupling layers similar to RealNVP (Dinh et al., 2017)) enable flexible models with parallel generation. For example, one can design nonautoregressive text models which maintain an exact likelihood for training and evaluation.

We evaluate discrete flows on a number of controlled problems: discretized mixture of Gaussians, full-rank discrete distributions, an addition task, and Potts models. In all settings we find that stacking discrete autoregressive flows yields improved performance over autoregressive baselines, and that bipartite flows can reach similar performance as autoregressive baselines while being fast to generate. Finally, we scale up discrete bipartite flows to character-level language modeling where we reach 1.38 bits per character on Penn Tree Bank and 1.23 bits per character on text8; their generation speed is over 100x faster than state-of-the-art autoregressive models.

1.1 Related Work

Bidirectional models.

Classically, bidirectional language models such as log-linear models and Markov random fields have been pursued, but they require either approximate inference (Mnih and Teh, 2012; Jernite et al., 2015) or approximate sampling (Berglund et al., 2015)

. Unlike bidirectional models, autoregressive models must impose a specific ordering, and this has been shown to matter across natural language processing tasks

(Vinyals et al., 2015; Ford et al., 2018; Xia et al., 2017)

. Bidirectionality such as in encoders have been shown to significantly improve results in neural machine translation

(Britz et al., 2017). Most recently, BERT has shown bidirectional representations can significantly improve transfer tasks (Devlin et al., 2018). In this work, discrete autoregressive flows enable bidirectionality while maintaining the benefits of a (tractable) generative model.

Nonautoregressive models.

There have been several advances for flexible modeling with nonautoregressive dependencies, mostly for continuous distributions (Dinh et al., 2014, 2017; Kingma and Dhariwal, 2018). For discrete distributions, Reed et al. (2017) and Stern et al. (2018) have considered retaining blockwise dependencies while factorizing the graphical model structure in order to simulate hierarchically. Gu et al. (2018) and Kaiser et al. (2018) apply latent variable models for fast translation, where the prior is autoregressive and the decoder is conditionally independent. Lee et al. (2018) adds an iterative refinement stage to initial parallel generations. Ziegler and Rush (2019) also apply latent variable models and with continuous non-autoregressive normalizing flows as the prior. In this work, discrete bipartite flows enable nonautoregressive generation while maintaining an exact density—analogous to RealNVP advances for image generation (Dinh et al., 2017). Most recently, Hoogeboom et al. (2019) proposed integer discrete flows, a concurrent unpublished work with similar ideas as discrete flows but with a flow transformation for ordinal data and applications to image compression and image generation. We find their results complement ours in illustrating the advantages of invertible functions which do not require log determinant Jacobians and apply an approximate gradient.

2 Background

2.1 Normalizing Flows

Normalizing flows transform a probability distribution using an invertible function

(Tabak and Turner, 2013; Rezende and Mohamed, 2015; Rippel and Adams, 2013). Let be a

-dimensional continuous random variable whose density can be computed efficiently. Given an invertible function

, the change-of-variables formula provides an explicit construction of the induced distribution on the function’s output, :

(1)

The transformation is referred to as a flow and is referred to as the base distribution. Composing multiple flows can induce further complex distributions that increase the expressivity of (Rezende and Mohamed, 2015; Papamakarios et al., 2017).

2.2 Flow Transformation

For an arbitrary invertible , the determinant of the Jacobian incurs an complexity, which is infeasible for high-dimensional datasets. Thus, normalizing flows are designed so that the determinant of the flow’s Jacobian can be computed efficiently. Here, we review two popular flow transformations.

Autoregressive flows.

Autoregressive functions such as recurrent neural networks and Transformers

(Vaswani et al., 2017) have been shown to successfully model sequential data across many domains. Specifically, assume a base distribution . With and as autoregressive functions of , i.e. , and for all , the flow computes a location-scale transform (Papamakarios et al., 2017; Kingma et al., 2016b),

The transformation is invertible and the inverse can be vectorized and computed in parallel:

In addition to a fast-to-compute inverse, the autoregressive flow’s Jacobian is lower-triangular, so its determinant is the product of the diagonal elements, . This enables autoregressive flow models to have efficient log-probabilities for training and evaluation.

Bipartite flows.

Real-valued non-volume preserving (RealNVP) uses another type of invertible transformation (Dinh et al., 2017) that nonlinearly transforms subsets of the input. For some , a coupling layer follows a bipartite rather than autoregressive factorization:

(2)
(3)

where and are functions of with . Due to the bipartite nature of the transformation in coupling layers, we refer to them as bipartite flows. By changing the ordering of variables between each flow, the composition of bipartite flows can learn highly flexible distributions. By design, their Jacobian is lower-triangluar, with a determinant that is the product of diagonal elements, .

Bipartite flows are not as expressive as autoregressive flows, as a subset of variables do not undergo a transformation. However, both their forward and inverse computations are fast to compute, making them suitable for generative modeling where fast generation is desired.

3 Discrete Flows

Normalizing flows depend on the change of variables formula (Equation 1) to compute the change in probability mass for the transformation. However, the change of variables formula applies only to continuous random variables. We extend normalizing flows to discrete events.

3.1 Discrete Change of Variables

Let

be a discrete random variable and

where is some function of . The induced probability mass function of is the sum over the pre-image of :

where is the set of all elements such that . For an invertible function , this simplifies to

(4)

This change of variables formula for discrete variables is similar to the continuous change of variables formula (Equation 1), but without the log-determinant-Jacobian. Intuitively, the log-determinant-Jacobian corrects for changes to the volume of a continuous space; volume does not exist for discrete distributions so there is no need to account for it. Computationally, Equation 4 is appealing as there are no restrictions on such as fast Jacobian computations in the continuous case, or tradeoffs in how the log-determinant-Jacobian influences the output density compared to the base density.

3.2 Discrete Flow Transformation

Figure 1: Flow transformation when computing log-likelihoods. (a) Discrete autoregressive flows stack multiple levels of autoregressivity. The receptive field of output unit 2 (red) includes left and right contexts. (b) Discrete bipartite flows apply a binary mask (blue and green) which determines the subset of variables to transform. With 2 flows, the receptive field of output unit 2 is .

Next we develop discrete invertible functions. To build intuition, first consider the binary case. Given a -dimensional binary vector , one natural function applies the XOR bitwise operator,

where is a function of previous outputs, ; is the XOR function (0 if and are equal and 1 otherwise). The inverse is . We provide an example next.

Example.

Let where is defined by the following probability table:

0.63 0.07
0.03 0.27

The data distribution cannot be captured by a factorized one . However, it can with a flow: set ; with probabilities ; and with probabilities . The flow captures correlations that cannot be captured alone with the base. More broadly, discrete flows perform a multi-dimensional relabeling of the data such that it’s easier to model with the base. This is analogous to continuous flows, which whiten the data such that it’s easier to model with the base (typically, a spherical Gaussian).

Modulo location-scale transform. To extend XOR to the categorical setting, consider a -dimensional vector , each element of which takes on values in . One can perform location-scale transformations on the modulo integer space,

(5)

Here, and are autoregressive functions of taking on values in and respectively. For this transformation to be invertible, and must be coprime (an explicit solution for is Euclid’s algorithm). An easy way to ensure coprimality is to set to be prime; mask noninvertible values for a given ; or fix . Setting and , it’s easy to see that the modulo location-scale transform generalizes XOR. (We use for all experiments except character-level language modeling.)

The idea also extends to the bipartite flow setting: the functions are set to for a subset of the data dimensions, and are functions of that subset otherwise. Invertible discrete functions are widely used in random number generation, and could provide inspiration for alternatives to the location scale transformation for constructing flows (Salmon et al., 2011).

Example.

Figure 2 illustrates an example of using flows to model correlated categorical data. Following Metz et al. (2016)

, the data is drawn from a mixture of Gaussians with 8 means evenly spaced around a circle of radius 2. The output variance is

, with samples truncated to be between and , and we discretize at the

level, resulting in two categorical variables (one for

and one for ) each with 90 states. A factorized base distribution cannot capture the data correlations, while a single discrete flow can. (Note the modulo location-scale transform does not make an ordinal assumption. We display ordinal data as an example only for visualization; other experiments use non-ordinal data.)

(a) Data
(b) Factorized Base
(c) 1 Flow
Figure 2: Learning a discretized mixture of Gaussians with maximum likelihood. Discrete flows help capture the multi-dimensional modes, which a factorized distribution cannot. (Note because the data is 2-D, discrete autoregressive flows and discrete bipartite flows are equivalent.)

3.3 Training Discrete Flows

With discrete flow models, the maximum likelihood objective per datapoint is

where the flow has free parameters according to its autoregressive or bipartite network, and the base distribution

has free parameters as a factorized (or itself an autoregressive) distribution. Gradient descent with respect to base distribution parameters is straightforward. To perform gradient descent with respect to flow parameters, one must backpropagate through the discrete-output function

and . We use the straight-through gradient estimator (Bengio et al., 2013). In particular, the (autoregressive or bipartite) network outputs two vectors of logits for each dimension , one for the location and scale respectively. For the scale, we add a mask whose elements are negative infinity on non-invertible values such as 0. On the forward pass, we take the argmax of the logits, where for the location,

(6)

Because the argmax operation is not differentiable, we replace Equation 6 on the backward pass with the softmax-temperature function:

As the temperature , the softmax-temperature becomes close to the argmax and the bias of the gradient estimator disappears. However, when is too low, the gradients vanish, inhibiting the optimization. Work with the Gumbel-softmax distribution indicates that this approximation works well when the number of classes (Maddison et al., 2016; Jang et al., 2017), which aligns with our experimental settings; we also fix .

4 Experiments

We perform a series of synthetic tasks to better understand discrete flows, and also perform character-level language modeling tasks. For all experiments with discrete autoregressive flows, we used an autoregressive Categorical base distribution where the first flow is applied in reverse ordering. (This setup lets us compare its advantage of bidirectionality to the baseline of an autoregressive base with 0 flows.) For all experiments with discrete bipartite flows, we used a factorized Categorical base distribution where the bipartite flows alternate masking of even and odd dimensions. We implement and make available discrete flows as part of Bayesian Layers

(Tran et al., 2018).

Autoregressive Base Autoregressive Flow Factorized Base Bipartite Flow
0.9 0.9 1.3 1.0
7.7 7.6 8.0 7.9
10.7 10.3 11.5 10.7
15.9 15.7 16.6 16.0
Table 1: Negative log-likelihoods for the full rank discrete distribution (lower is better). Autoregressive flows improve over its autoregressive base. Bipartite flows improve over its factorized base and achieve nats close to an autoregressive distribution while remaining parallel.
AR Base AR Flow
number of states = 3
, 9.27 9.124
, 3.79 3.79
, 16.66 11.23
, 6.30 5.62
number of states = 4
, 11.64 10.45
, 5.87 5.56
number of states = 5
, 13.58 10.25
, 7.94 7.07
Table 2: Negative log-likelihoods on the square-lattice Potts model (lower is better). denotes dimensionality. Higher coupling strength corresponds to more spatial correlations.

4.1 Full-rank Discrete Distribution

To better understand the expressivity of discrete flows, we examined how well they could fit random full-rank discrete distributions. In particular, we sample a true set of probabilities for all dimensions of classes according to a Dirichlet distribution of size , . For the network for the autoregressive base distribution and location parameters of the flows, we used a Transformer with 64 hidden units. We used a composition of 1 flow for the autoregressive flow models, and 4 flows for the bipartite flow models.

Table 1 displays negative log-likelihoods (nats) of trained models over data simulated from this distribution. Across the data dimension and number of classes , autoregressive flows gain several nats over the autoregressive base distribution, which has no flow on top. Bipartite flows improve over its factorized base and in fact obtain nats competitive with the autoregressive base while remaining fully parallel for generation.

4.2 Addition

Following Zaremba and Sutskever (2014), we examine an addition task: there are two input numbers with digits (each digit takes values), and the output is their sum with digits (we remove the digit if it appears). Addition naturally follows a right-to-left ordering: computing the leftmost digit requires carrying the remainder from the rightmost computations. Given an autoregressive base which poses a left-to-right ordering, we examine whether the bidirectionality that flows offer can adjust for wrong orderings. While the output is determnistic, the flexibility of discrete flows may enable more accurate outputs. We use an LSTM to encode both inputs, apply 0 or 1 flows on the output, and then apply an LSTM to parameterize the autoregressive base where its initial state is set to the concatenation of the two encodings. All LSTMs use 256 hidden units for , and 512 hidden units for .

For , an autoregressive base achieves 4.0 nats; an autoregressive flow achieves 0.2 nats (i.e., close to the true deterministic solution over all pairs of 10-digit numbers). A bipartite model with 1, 2, and 4 flows achieves 4.0, 3.17, and 2.58 nats respectively. For , an autoregressive base achieves 12.2 nats; an autoregressive flow achieves 4.8 nats. A bipartite model with 1, 2, 4, and 8 flows achieves 12.2, 8.8, 7.6, and 5.08 nats respectively.

4.3 Potts Model

Given the bidirectional dependency enabled by discrete flows, we examined how they could be used for distilling undirected models with tractable energies but intractable sampling and likelihoods. We sampled from Potts models (the Categorical generalization of Ising models), which are a 2d Markov random field with pairwise interactions between neighbors (above/below, left/right, but not diagonally connected) (Wu, 1982). To generate data we ran 500 steps of Metropolis-Hastings, and evaluated the NLL of baselines and discrete flows as a function of the coupling strength, . Low coupling strengths correspond to more independent states, while high coupling strengths result in more correlated states across space. For the base network, we used a single layer LSTM with 32 hidden units. For the flow network, we used an embedding layer which returns a trainable location parameter for every unique combination of inputs.

Table 2 displays negative log-likelihoods (nats) of trained models over data simulated from Potts models with varying lattice size and coupling strength. As Potts models are undirected models, the autoregressive base posits a poor inductive bias by fixing an ordering and sharing network parameters across the individual conditional distributions. Over data dimension and coupling , autoregressive flows perform as well as, or improve upon, autoregressive base models. Appendix A includes samples from the model; they are visually indistinguishable from the data.

4.4 Character-Level Penn Tree Bank

Test NLL (bpc) Generation
3-layer LSTM (Merity et al., 2018) 1.18 3.8 min
Ziegler and Rush (2019) (AF/SCF) 1.46 -
Ziegler and Rush (2019) (IAF/SCF) 1.63 -
Bipartite flow 1.38 0.17 sec
Table 3: Character-level language modeling results on Penn Tree Bank.

We follow the setup of Ziegler and Rush (2019), which to the best of our knowledge is the only comparable work with nonautoregressive language modeling. We use Penn Tree Bank with minimal processing from Mikolov et al. (2012), consisting of roughly 5M characters and a vocabulary size of . We split the data into sentences and restrict to a max sequence length of 288. The LSTM baseline of Merity et al. (2018) uses 3 layers, truncated backpropagation with a sequence length of 200, embedding size of 400, and hidden size of 1850.111The LSTM results are only approximately comparable as they do not apply the extra preprocessing step of removing sentences with 288 tokens. Ziegler and Rush (2019)’s nonautoregressive models have two variants, in which they use a specific prior with a conditionally independent likelihood and fully factorized variational approximation. AF/SCF uses a prior over latent time steps and hidden dimension that’s autoregressive in and nonautoregressive in ; and IAF/SCF is nonautoregressive in both and . For the bipartite flow, we use 8 flows each with an embedding of size 400 and an LSTM with 915 hidden units.

Table 3 compares the test negative log-likelihood in bits per character as well as the time to generate a 288-dimensional sequence of tokens on a NVIDIA P100 GPU. The bipartite flow significantly outperforms Ziegler and Rush (2019), including their autoregressive/nonautoregressive hybrid. In addition, the generation time is over 1000x faster than the LSTM baseline. Intuitively, the use of bipartite flows means that we only have to perform one forward pass over the model as opposed to the 288 forward passes for a typical autoregressive model.

4.5 Character-Level text8

bpc Gen. LSTM (Coojimans+2016) 1.43 19.8s 64-layer Transformer (Al-Rfou+2018) 1.13 35.5s Bipartite flow (4 flows, w/ ) 1.60 0.15s Bipartite flow (8 flows, w/o ) 1.29 0.16s Bipartite flow (8 flows, w/ ) 1.23 0.16s
Figure 3: Character-level language modeling results on text8. The test bits per character decreases as the number of flows increases. More hidden units and layers in the Transformer per flow, and applying a scale transformation instead of only location, also improves performance.

We also evaluated on text8, using the preprocessing of Mikolov et al. (2012); Zhang et al. (2016) with 100M characters and a vocabulary size of . We split the data into 90M characters for train, 5M characters for dev, and 5M characters for test. For discrete bipartite flows, we use a batch size of 128, sequence length of 256, a varying number of flows, and parameterize each flow with a Transformer with 2 or 3 layers, 512 hidden units, 2048 filter size, and 8 heads.

Figure 3 compares the test negative log-likelihood in bits per character as well as the time to generate one data point, i.e., a 256-dimensional sequence, on a NVIDIA P100 GPU. The bipartite flow reaches competitive performance, i.e., better than an LSTM baseline but not as good as the state-of-the-art bpc from the 235M parameter 64-layer Transformer (we’re unfamiliar with previous nonautoregressive results to compare to). We also find that having a flexible scale (“w/ ”) improves performance over fixing and only learning the location transform . The bipartite flows’ generation times are significantly faster than the baselines with upwards of a 100x speedup.

5 Discussion

We describe discrete flows, a class of invertible functions for flexible modeling of discrete data. Discrete autoregressive flows enable bidirectionality by stacking multiple levels of autoregressivity, each with varying order. Discrete bipartite flows enable nonautoregressive generation by flexibly modeling data with a sequence of bipartite-factorized flow transformations. Our experiments across a range of synthetic tasks and character-level text data show the promise of such approaches.

As future work, we’re also investigating discrete inverse autoregressive flows, which enable flexible variational approximations for discrete latent variable models. An open question remains with scaling discrete flows to large numbers of classes: in particular, the straight-through gradient estimator works well for small numbers of classes such as for character-level language modeling, but it may not work for (sub)word-level modeling where the vocabulary size is greater than 5,000. In such settings, alternative gradient estimators or data representations may be fruitful. Lastly, there may be other invertible discrete functions used in cryptography and random number generation that could be leveraged for building up alternative forms for discrete flows (Salmon et al., 2011).

Acknowledgements

Keyon Vafa is supported by NSF grant DGE-1644869. We thank Michael W. Dusenberry, Kevin Murphy, Wei Ping, and Jakob Uszkoreit for helpful comments and feedback.

References

Appendix A Samples from Potts Model

Figure 4: Samples from Potts model, for 4x4 grid with 3 states and coupling strength () = 0.1