 Cubic-Spline Flows

A normalizing flow models a complex probability density as an invertible transformation of a simple density. The invertibility means that we can evaluate densities and generate samples from a flow. In practice, autoregressive flow-based models are slow to invert, making either density estimation or sample generation slow. Flows based on coupling transforms are fast for both tasks, but have previously performed less well at density estimation than autoregressive flows. We stack a new coupling transform, based on monotonic cubic splines, with LU-decomposed linear layers. The resulting cubic-spline flow retains an exact one-pass inverse, can be used to generate high-quality images, and closes the gap with autoregressive flows on a suite of density-estimation tasks.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Normalizing flows are flexible probabilistic models of continuous data. A normalizing flow models data as the output of an invertible smooth transformation of noise :

 \bfx=\bff\roundbr\bfuwhere\bfu∼π\roundbr\bfu. (1)

Typically, the noise distribution is taken to be simple (a standard normal is a common choice), whereas the transformation and its inverse

are implemented by an invertible neural network. The probability density of

under the flow is obtained by a change of variables:

 p\roundbr\bfx=π\roundbr\bff−1\roundbr\bfx\vertbrdet\roundbr∂\bff−1∂\bfx. (2)

Given training data

, the flow is trained with backpropagation, by maximizing the total log likelihood

with respect to the parameters of the transformation . Normalizing flows have been successfully used for density estimation [5, 23], variational inference [17, 27, 29], image, audio and video generation [14, 16, 18, 26], and likelihood-free inference .

In practice, the Jacobian determinant of must be tractable, so that the density can be computed efficiently. Several invertible transformations with a tractable Jacobian determinant have been proposed, including coupling transforms [4, 5], autoregressive transforms [3, 13, 17, 23], invertible convolutions [11, 16], transformations based on the matrix-determinant lemma [27, 29], and continuous transformations based on differential equations [2, 8].

Even though the above transformations are invertible in theory, they are not always efficiently invertible in practice. For instance, the affine autoregressive transforms used by MAF  and IAF  are times slower to invert than to evaluate, where is the dimensionality of . Even worse, inverting the non-affine transformations used by NAF  and block-NAF  would require numerical optimization. In practice, the forward transformation is needed for generating data, whereas the inverse transformation is needed for computing ; for a flow to be applicable in both situations, both directions need to be fast.

Transformations that are equally fast to invert and evaluate do exist. Affine coupling transforms, used by Real NVP  and Glow , only require a single neural-network pass in either direction; however, more flexible autoregressive transforms have obtained better density-estimation results. More flexible coupling transforms have been proposed, based on piecewise-quadratic splines , although these have not previously been evaluated as density estimators. Continuous flows such as Neural ODEs  and FFJORD  are also equally fast in both directions, but they require numerically integrating a differential equation in each direction, which can be slower than a single neural-network pass.

In this paper we introduce cubic-spline flows, which, like Real NVP and Glow, only require a single neural-network pass in either the forward or the inverse direction, but in practice are as flexible as state-of-the-art autoregressive flows. Cubic-spline flows alternate between non-affine coupling transforms that use flexible monotonic cubic splines to transform their input, and LU-decomposed linear layers that combine their input across dimensions. Experimentally, we show that cubic-spline flows match or exceed state-of-the-art performance in density estimation, and that they are capable of generating high-quality images with only a single neural-network pass. Figure 1: Left: Illustration of a random monotonic cubic spline with K=10 bins which maps [0,1] to [0,1]. Monotonic splines act as drop-in replacements for affine or additive transformations in coupling layers. Right: Since the cubic spline can be interpreted as a cumulative distribution function, its derivative defines a probability density function which is piecewise quadratic.

2 Method

2.1 Coupling transforms

A coupling transform maps an input to an output in the following way:

1. [leftmargin=*, topsep=0pt, itemsep=0pt]

2. Split the input into two parts, .

3. Compute parameters , where NN is an arbitrary neural network.

4. Compute for , where is an invertible function parameterized by .

5. Set , and return .

The Jacobian matrix of a coupling transform is lower triangular, since is given by transforming elementwise as a function of , and is equal to . Thus, the Jacobian determinant of the coupling transform is

 det\roundbr∂ϕ∂\bfx=D∏i=d∂g\bfthetai∂xi. (3)

Coupling transforms solve two important problems for normalizing flows: first, they have a tractable Jacobian determinant; second, they can be inverted exactly in a single pass. The inverse of a coupling transform can be easily computed by running steps 1–4 above, this time inputting , and using to compute in step 3.

2.2 Monotonic-spline coupling transforms

Typically, the function takes the form of an affine or additive transformation for computational ease. The affine transformation is given by:

 g\bfthetai(xi)=αixi+βiwhere\bfthetai=\curlybrαi,βi. (4)

The additive transformation corresponds to the special case . Both the affine and the additive transformation are easy to invert, but they lack flexibility.

Nonetheless, we are free to choose any function as long as it is differentiable and we can easily invert it. Recently, Müller et al.  proposed a powerful generalization of the above forms, based on monotonic piecewise polynomials. The idea is to restrict the input domain of between and , partition the input domain into bins, and define to be a simple polynomial segment within each bin (see fig. 1). Müller et al.  restrict themselves to monotonically-increasing linear and quadratic polynomial segments, whose coefficients are parameterized by . Moreover, the polynomial segments are restricted to match at the boundaries so that is continuous. Functions of this form are known as polynomial splines.

2.3 Monotonic cubic-spline coupling transforms

Inspired by this direction of research, we introduce cubic-spline flows, a natural extension to the framework of Müller et al. . We propose to implement as a monotonic cubic spline , where each segment is defined by a monotonically-increasing cubic polynomial (fig. 1). Cubic splines are well-known across science, finding use in many applications, and their theory is well-explored [6, 12].

We employ Steffen’s method  to parameterize a monotonically-increasing cubic spline that maps to . Given a fixed number of bins , Steffen’s method takes coordinates , known as knots, and two values for the derivatives at inputs and . Then, the method constructs a continuously-differentiable cubic spline that passes through the knots, has the given boundary derivatives, and is monotonic on each segment (for details, see section A.1). To ensure the spline is monotonically-increasing and maps to , we require:

 (x0,y0)=(0,0),(xK,yK)=(1,1), (5) xk

Our implementation of the cubic-spline coupling transform is as follows:

1. [leftmargin=*, topsep=0pt, itemsep=0pt]

2. A neural network NN takes

and outputs an unconstrained parameter vector

of length for each .

3. Vector is partitioned as , where and have length , and has length .

4. Vectors and are each passed through a softmax; the outputs are interpreted as the widths and heights of the bins, which must be positive and sum to .

5. A cumulative sum of the bin widths and heights yields the knots . Vector is interpreted as the two boundary derivatives.

Evaluating a cubic spline at location requires finding the bin in which lies, which can be done efficiently with binary search, since the bins are sorted. The Jacobian determinant of the cubic-spline coupling transform can be computed in closed form as a product of quadratic-polynomial derivatives (see section A.2). For the inverse, we need to compute the roots of a cubic polynomial whose constant term depends on the desired value to invert, which can be done analytically. However, in early experiments, we found that naïve root finding using trigonometric and hyperbolic methods was unstable for certain values of the coefficients. To address this issue, we follow a modified version, provided by Peters , of the scheme outlined by Blinn , which stabilizes many of the potentially difficult cases (see section A.3).

Unlike the affine and additive transformations which have limited flexibility, a monotonic spline can approximate any continuous monotonic function arbitrarily well, given sufficiently many bins. Yet, cubic-spline coupling transforms have a closed-form Jacobian determinant and can be inverted in one pass. Finally, our parameterization of cubic splines using Steffen’s method is fully differentiable, so the transform can be trained by gradient methods.

In addition to transforming by monotonic cubic splines whose parameters are a function of , we introduce a set of monotonic cubic splines which act elementwise on

, whose parameters are optimized directly with stochastic gradient descent. This means that our coupling layer transforms all input variables at once, rather than being restricted to a subset, as follows:

 \bftheta1:d−1 =Trainable parameters (7) \bfthetad:D =NN(\bfx1:d−1) (8) yi =g\bfthetai(xi)for each i=1:D. (9)

2.4 Linear transforms with an LU decomposition

To ensure all input variables can interact with each other, it is common to permute the elements of intermediate layers in a normalizing flow. Permutation is an invertible linear transformation, with absolute determinant equal to

. Kingma & Dhariwal  generalize this approach using an efficient parameterization of a linear transformation, demonstrating improvements on a range of image tasks. In particular, they parameterize the lower-upper or LU decomposition of a square matrix as , where is a fixed permutation matrix, is lower-triangular with ones on the diagonal, and is upper triangular. Written in this form, the determinant of can be calculated in time as the product of the diagonal elements of . Also, by parameterizing the diagonal elements of to be positive, is guaranteed to be invertible. Given that we have the LU factorization of , we can efficiently apply the inverse transformation, and sample from the model in one pass.

2.5 Cubic-spline flows

We propose a general-purpose flow based on alternating LU-decomposed linear layers and monotonic cubic-spline coupling transforms. Since the splines map to

, we apply a sigmoid, and its inverse, the logit, before and after each coupling layer, so the overall transform is compatible with unconstrained data and linear layers. We clip the inputs of the logit to

to prevent saturation with 32-bit floating-point precision. The function is no longer strictly reversible for all inputs due to this clipping.

Like Real NVP or Glow, cubic-spline flows can represent either a transformation from data to noise, or from noise to data. In both cases, the transformation requires only a single pass of the neural network defining the flow. Overall, our proposed cubic-spline flow resembles a traditional feed-forward neural network architecture, alternating between linear transformations and piecewise nonlinearities, while retaining exact invertibility. Figure 2: Qualitative results for two-dimensional synthetic datasets using cubic-spline flows with two coupling layers. Some previous flows struggle to model such fine details, as demonstrated by, e.g., Nash & Durkan .

3 Experiments

For our experiments, NN in eq. 8

is a fully-connected or convolutional neural network with two pre-activation residual blocks

[9, 10]. We use the Adam optimizer , and a cosine schedule for annealing the learning rate . All experiments use a standard normal for the noise distribution

, except for the two-dimensional experiments, which use a uniform distribution.

3.1 Synthetic datasets

We first demonstrate the flexibility of cubic-spline flows on synthetic two-dimensional datasets (fig. 2). The mixture of Gaussians and Einstein are taken from , and we generate another dataset using an image of Gauss. For each task, we train on one-million data points, use a cubic-spline flow with two coupling layers, no linear layers, and bins. As we can see, the model is flexible enough to fit complex, multimodal densities that other flows would struggle with. In each case, the model consists of fewer than of the parameters than the histograms which are used to display the data.

3.2 Density estimation on tabular data

We next consider the UCI and BSDS300 datasets, a standard suite of benchmarks for density estimation of tabular data. Our spline flows alternate coupling layers with LU-decomposed linear layers, and we use

bins in all experiments. For all datasets, we stack ten of these composite linear and coupling layers, apart from MINIBOONE, where we use five. Hyperparameters such as the number of training steps and the hidden dimension used by NN in coupling layers are tuned separately for each cubic model. As an ablation study, we compare with a flow using quadratic instead of cubic splines, which extends

 with LU layers and elementwise transformations as in section 2.

Table 1

shows our results. The cubic and quadratic models are close in performance, with the error bars on test log likelihood overlapping. However, a paired comparison computing the mean and standard error of the difference between models on individual examples shows that the cubic-spline flow is statistically significantly better on three of the five datasets, while being indistinguishable on POWER and slightly worse than the quadratic-spline flow on BSDS300. On the other hand, the quadratic-spline flow is slightly faster and less prone to numerical issues.

Section A.4 further compares the cubic and quadratic models.

Spline flows achieve state-of-the-art performance on all tasks against FFJORD 

, the previously strongest flow-based model with a one-pass inverse. Moreover, the spline flows are competitive against the best-performing autoregressive flows, achieving state of the art for any flow model on POWER and GAS. These results demonstrate that it is possible for neural density estimators based on coupling layers to challenge autoregressive models, and that it may not be necessary to sacrifice one-pass sampling for density-estimation performance.

3.3 Image generation

Finally, we demonstrate that cubic-spline flows scale to high-dimensional data using the Fashion-MNIST dataset

. We use a Glow-like model  with cubic-spline coupling transforms, which includes a multi-scale architecture, actnorm layers with data-dependent initialization, and invertible convolutions. In line with Kingma & Dhariwal , we illustrate the effect of scaling the standard deviation of the noise distribution by a temperature . Qualitative results are shown in fig. 3. We leave the comparison with state-of-the-art flow-based image models on higher-dimensional image datasets for future work. Figure 3: Qualitative results for Fashion-MNIST dataset. Left: Training data. Right: Unconditional samples from the cubic-spline flow. Samples for three temperatures are shown in separate blocks, top-to-bottom: 0.5,0.75,1.0.

4 Conclusion

The cubic-spline flow is the first model that combines the density-estimation performance of state-of-the-art autoregressive models like NAF  and TAN , while still being able to generate realistic samples in a single pass like Real NVP  and Glow . More generally, we have shown that flexible monotonic splines are a useful differentiable module, which could be included in many models that can be trained end-to-end. We intend to implement autoregressive flows with these transformations, which may further improve the state of the art in density estimation.

Acknowledgements

This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh. George Papamakarios was also supported by Microsoft Research through its PhD Scholarship Programme.

Appendix A More details on monotonic cubic splines

a.1 Parameterization of the spline

We here include an outline of the method of Steffen  which closely matches the original paper. Let be a given set of knot points in the plane which satisfy

 (x0,y0)=(0,0),(xK,yK)=(1,1), (10) xk

If additionally the derivatives at the knot points are known, a unique cubic polynomial

 fk(ξ)=αk0+αk1ξ+αk2ξ2+αk3ξ3, (12)

where , is determined on each bin. The coefficients are given by

 αk0 =yk (13) αk1 =dk (14) αk2 =3sk−2dk−dk+1wk (15) αk3 =dk+dk+1−2sk\roundbrwk2, (16)

where is the width of each bin, and is the slope of the line joining consecutive knots. The resulting cubic spline is continuously differentiable on the interval , and it remains only to determine the derivative values so that the overall spline is monotonic on this interval.

To this end, Steffen  uses the unique quadratic function passing through the points , and to determine the derivative at location . Monotonicity of this quadratic on the interval defined by these points is sufficient to specify an appropriate derivative, which is given by

 pk=sk−1wk+skwk−1wk−1+wk. (17)

However, it is possible that the quadratic in question is not monotonic on the given interval, and in this case the derivative value needs to be modified. Steffen  proposes using the smallest value for the derivative at such that the quadratic is monotonic on the given interval, showing that

 dk={2min\roundbrsk−1,skif pk>2min\roundbrsk−1,skpkotherwise (18)

is sufficient.

a.2 Computing the derivative

The derivative of eq. 12 is straightforwardly given by

 \derivfk(ξ)x=αk1+2αk2ξ+3αk3ξ2. (19)

The logarithm of the absolute value of the determinant of a cubic-spline coupling transform is thus given by a sum of the logarithm of eq. 19 for each transformed .

a.3 Computing the inverse

Computing the inverse of a cubic polynomial requires solving for the roots of a cubic polynomial whose constant term depends on the value to invert. That is, for a given , we wish to find which satisfies

 \roundbrαk0−y+αk1ξ+αk2ξ2+αk3ξ3=0, (20)

where is defined as before. Monotonicity means there is at most one root in each bin, and it is possible to determine this value analytically, using either Cardano’s formula, or trigonometric or hyperbolic methods. In practice, careful numerical consideration is necessary to ensure stable root-finding. We found the comprehensive treatment of Blinn  particularly useful, ultimately settling on a slightly modified version of a procedure outlined in this latter work . We refer readers to Blinn  for full details.

a.4 Comparison with monotonic quadratic splines

Following Müller et al. , we construct a quadratic spline by integrating an unconstrained linear spline. The result is a monotonic piecewise-quadratic function, where the values and derivatives of the function are continuous at the knots. The derivatives of the quadratic spline at the knots, corresponding to the density, are set directly by the linear spline. However, some monotonically-increasing knot locations cannot be expressed through this construction.

In contrast, Steffen’s cubic-spline construction 

can interpolate arbitrary monotonic settings of the knot locations

, so unlike the quadratic spline can set arbitrary quantiles of the implied distribution at the knot locations. However, the method sets the derivatives of the function based on prior smoothness assumptions, rather than allowing the user to set them directly as in the quadratic spline. As a result, the number of parameters for the two spline constructions are similar. The cubic spline could be made strictly more flexible by adding parameters for the derivatives at the knots, constrained to maintain monotonicity.

As they are, the cubic and quadratic splines give different functions for knots specifying the same quantiles, corresponding to different inductive biases. Steffen  discusses several nice properties of the interpolants chosen by his cubic-spline method.

References

• Blinn  Blinn, J. F. How to solve a cubic equation, part 5: Back to numerics. IEEE Computer Graphics and Applications, 27(3):78–89, 2007.
• Chen et al.  Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Advances in Neural Information Processing Systems, 2018.
• De Cao et al.  De Cao, N., Titov, I., and Aziz, W. Block neural autoregressive flow.

Conference on Uncertainty in Artificial Intelligence

, 2019.
• Dinh et al.  Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear independent components estimation. International Conference on Learning Representations, Workshop track, 2015.
• Dinh et al.  Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using Real NVP. International Conference on Learning Representations, 2017.
• Durrleman & Simon  Durrleman, S. and Simon, R. Flexible regression models with cubic splines. Statistics in medicine, 8(5):551–561, 1989.
• Fritsch & Carlson  Fritsch, F. N. and Carlson, R. E. Monotone piecewise cubic interpolation. SIAM Journal on Numerical Analysis, 17(2):238–246, 1980.
• Grathwohl et al.  Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. K. FFJORD: Free-form continuous dynamics for scalable reversible generative models. International Conference on Learning Representations, 2018.
• He et al. [2016a] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition.

IEEE Conference on Computer Vision and Pattern Recognition

, 2016a.
• He et al. [2016b] He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. European Conference on Computer Vision, 2016b.
• Hoogeboom et al.  Hoogeboom, E., van den Berg, R., and Welling, M. Emerging convolutions for generative normalizing flows. arXiv:1901.11137, 2019.
• Hou & Andrews  Hou, H. and Andrews, H. Cubic splines for image interpolation and digital filtering. IEEE Transactions on acoustics, speech, and signal processing, 26(6):508–517, 1978.
• Huang et al.  Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. Neural autoregressive flows.

International Conference on Machine Learning

, 2018.
• Kim et al.  Kim, S., Lee, S.-g., Song, J., and Yoon, S. FloWaveNet: A generative flow for raw audio. arXiv:1811.02155, 2018.
• Kingma & Ba  Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2014.
• Kingma & Dhariwal  Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible convolutions. Advances in Neural Information Processing Systems, 2018.
• Kingma et al.  Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. Advances in Neural Information Processing Systems, 2016.
• Kumar et al.  Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine, S., Dinh, L., and Kingma, D. VideoFlow: A flow-based generative model for video. arXiv:1903.01434, 2019.
• Loshchilov & Hutter  Loshchilov, I. and Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv:1608.03983, 2016.
• Müller et al.  Müller, T., McWilliams, B., Rousselle, F., Gross, M., and Novák, J. Neural importance sampling. arXiv:1808.03856, 2018.
• Nash & Durkan  Nash, C. and Durkan, C. Autoregressive energy machines. International Conference on Machine Learning, 2019.
• Oliva et al.  Oliva, J., Dubey, A., Zaheer, M., Poczos, B., Salakhutdinov, R., Xing, E., and Schneider, J. Transformation autoregressive networks. International Conference on Machine Learning, 2018.
• Papamakarios et al.  Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. Advances in Neural Information Processing Systems, 2017.
• Papamakarios et al.  Papamakarios, G., Sterratt, D. C., and Murray, I. Sequential neural likelihood: Fast likelihood-free inference with autoregressive flows. International Conference on Artificial Intelligence and Statistics, 2019.
• Peters  Peters, C. How to solve a cubic equation, revisited. Accessed: 2019-04-17.
• Prenger et al.  Prenger, R., Valle, R., and Catanzaro, B. WaveGlow: A flow-based generative network for speech synthesis. arXiv:1811.00002, 2018.
• Rezende & Mohamed  Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. International Conference on Machine Learning, 2015.
• Steffen  Steffen, M. A simple method for monotonic interpolation in one dimension. Astronomy and Astrophysics, 239:443, 1990.
• van den Berg et al.  van den Berg, R., Hasenclever, L., Tomczak, J. M., and Welling, M. Sylvester normalizing flows for variational inference. Conference on Uncertainty in Artificial Intelligence, 2018.
• Xiao et al.  Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747, 2017.