1 Introduction
Normalizing flows are flexible probabilistic models of continuous data. A normalizing flow models data as the output of an invertible smooth transformation of noise :
(1) 
Typically, the noise distribution is taken to be simple (a standard normal is a common choice), whereas the transformation and its inverse
are implemented by an invertible neural network. The probability density of
under the flow is obtained by a change of variables:(2) 
Given training data
, the flow is trained with backpropagation, by maximizing the total log likelihood
with respect to the parameters of the transformation . Normalizing flows have been successfully used for density estimation [5, 23], variational inference [17, 27, 29], image, audio and video generation [14, 16, 18, 26], and likelihoodfree inference [24].In practice, the Jacobian determinant of must be tractable, so that the density can be computed efficiently. Several invertible transformations with a tractable Jacobian determinant have been proposed, including coupling transforms [4, 5], autoregressive transforms [3, 13, 17, 23], invertible convolutions [11, 16], transformations based on the matrixdeterminant lemma [27, 29], and continuous transformations based on differential equations [2, 8].
Even though the above transformations are invertible in theory, they are not always efficiently invertible in practice. For instance, the affine autoregressive transforms used by MAF [23] and IAF [17] are times slower to invert than to evaluate, where is the dimensionality of . Even worse, inverting the nonaffine transformations used by NAF [13] and blockNAF [3] would require numerical optimization. In practice, the forward transformation is needed for generating data, whereas the inverse transformation is needed for computing ; for a flow to be applicable in both situations, both directions need to be fast.
Transformations that are equally fast to invert and evaluate do exist. Affine coupling transforms, used by Real NVP [5] and Glow [16], only require a single neuralnetwork pass in either direction; however, more flexible autoregressive transforms have obtained better densityestimation results. More flexible coupling transforms have been proposed, based on piecewisequadratic splines [20], although these have not previously been evaluated as density estimators. Continuous flows such as Neural ODEs [2] and FFJORD [8] are also equally fast in both directions, but they require numerically integrating a differential equation in each direction, which can be slower than a single neuralnetwork pass.
In this paper we introduce cubicspline flows, which, like Real NVP and Glow, only require a single neuralnetwork pass in either the forward or the inverse direction, but in practice are as flexible as stateoftheart autoregressive flows. Cubicspline flows alternate between nonaffine coupling transforms that use flexible monotonic cubic splines to transform their input, and LUdecomposed linear layers that combine their input across dimensions. Experimentally, we show that cubicspline flows match or exceed stateoftheart performance in density estimation, and that they are capable of generating highquality images with only a single neuralnetwork pass.
: Since the cubic spline can be interpreted as a cumulative distribution function, its derivative defines a probability density function which is piecewise quadratic.
2 Method
2.1 Coupling transforms
A coupling transform maps an input to an output in the following way:

[leftmargin=*, topsep=0pt, itemsep=0pt]

Split the input into two parts, .

Compute parameters , where NN is an arbitrary neural network.

Compute for , where is an invertible function parameterized by .

Set , and return .
The Jacobian matrix of a coupling transform is lower triangular, since is given by transforming elementwise as a function of , and is equal to . Thus, the Jacobian determinant of the coupling transform is
(3) 
Coupling transforms solve two important problems for normalizing flows: first, they have a tractable Jacobian determinant; second, they can be inverted exactly in a single pass. The inverse of a coupling transform can be easily computed by running steps 1–4 above, this time inputting , and using to compute in step 3.
2.2 Monotonicspline coupling transforms
Typically, the function takes the form of an affine or additive transformation for computational ease. The affine transformation is given by:
(4) 
The additive transformation corresponds to the special case . Both the affine and the additive transformation are easy to invert, but they lack flexibility.
Nonetheless, we are free to choose any function as long as it is differentiable and we can easily invert it. Recently, Müller et al. [20] proposed a powerful generalization of the above forms, based on monotonic piecewise polynomials. The idea is to restrict the input domain of between and , partition the input domain into bins, and define to be a simple polynomial segment within each bin (see fig. 1). Müller et al. [20] restrict themselves to monotonicallyincreasing linear and quadratic polynomial segments, whose coefficients are parameterized by . Moreover, the polynomial segments are restricted to match at the boundaries so that is continuous. Functions of this form are known as polynomial splines.
2.3 Monotonic cubicspline coupling transforms
Inspired by this direction of research, we introduce cubicspline flows, a natural extension to the framework of Müller et al. [20]. We propose to implement as a monotonic cubic spline [7], where each segment is defined by a monotonicallyincreasing cubic polynomial (fig. 1). Cubic splines are wellknown across science, finding use in many applications, and their theory is wellexplored [6, 12].
We employ Steffen’s method [28] to parameterize a monotonicallyincreasing cubic spline that maps to . Given a fixed number of bins , Steffen’s method takes coordinates , known as knots, and two values for the derivatives at inputs and . Then, the method constructs a continuouslydifferentiable cubic spline that passes through the knots, has the given boundary derivatives, and is monotonic on each segment (for details, see section A.1). To ensure the spline is monotonicallyincreasing and maps to , we require:
(5)  
(6) 
Our implementation of the cubicspline coupling transform is as follows:

[leftmargin=*, topsep=0pt, itemsep=0pt]

Vector is partitioned as , where and have length , and has length .

Vectors and are each passed through a softmax; the outputs are interpreted as the widths and heights of the bins, which must be positive and sum to .

A cumulative sum of the bin widths and heights yields the knots . Vector is interpreted as the two boundary derivatives.
Evaluating a cubic spline at location requires finding the bin in which lies, which can be done efficiently with binary search, since the bins are sorted. The Jacobian determinant of the cubicspline coupling transform can be computed in closed form as a product of quadraticpolynomial derivatives (see section A.2). For the inverse, we need to compute the roots of a cubic polynomial whose constant term depends on the desired value to invert, which can be done analytically. However, in early experiments, we found that naïve root finding using trigonometric and hyperbolic methods was unstable for certain values of the coefficients. To address this issue, we follow a modified version, provided by Peters [25], of the scheme outlined by Blinn [1], which stabilizes many of the potentially difficult cases (see section A.3).
Unlike the affine and additive transformations which have limited flexibility, a monotonic spline can approximate any continuous monotonic function arbitrarily well, given sufficiently many bins. Yet, cubicspline coupling transforms have a closedform Jacobian determinant and can be inverted in one pass. Finally, our parameterization of cubic splines using Steffen’s method is fully differentiable, so the transform can be trained by gradient methods.
In addition to transforming by monotonic cubic splines whose parameters are a function of , we introduce a set of monotonic cubic splines which act elementwise on
, whose parameters are optimized directly with stochastic gradient descent. This means that our coupling layer transforms all input variables at once, rather than being restricted to a subset, as follows:
(7)  
(8)  
(9) 
2.4 Linear transforms with an LU decomposition
To ensure all input variables can interact with each other, it is common to permute the elements of intermediate layers in a normalizing flow. Permutation is an invertible linear transformation, with absolute determinant equal to
. Kingma & Dhariwal [16] generalize this approach using an efficient parameterization of a linear transformation, demonstrating improvements on a range of image tasks. In particular, they parameterize the lowerupper or LU decomposition of a square matrix as , where is a fixed permutation matrix, is lowertriangular with ones on the diagonal, and is upper triangular. Written in this form, the determinant of can be calculated in time as the product of the diagonal elements of . Also, by parameterizing the diagonal elements of to be positive, is guaranteed to be invertible. Given that we have the LU factorization of , we can efficiently apply the inverse transformation, and sample from the model in one pass.2.5 Cubicspline flows
We propose a generalpurpose flow based on alternating LUdecomposed linear layers and monotonic cubicspline coupling transforms. Since the splines map to
, we apply a sigmoid, and its inverse, the logit, before and after each coupling layer, so the overall transform is compatible with unconstrained data and linear layers. We clip the inputs of the logit to
to prevent saturation with 32bit floatingpoint precision. The function is no longer strictly reversible for all inputs due to this clipping.Like Real NVP or Glow, cubicspline flows can represent either a transformation from data to noise, or from noise to data. In both cases, the transformation requires only a single pass of the neural network defining the flow. Overall, our proposed cubicspline flow resembles a traditional feedforward neural network architecture, alternating between linear transformations and piecewise nonlinearities, while retaining exact invertibility.
Training data  Flow density  Flow samples 

Model  POWER  GAS  HEPMASS  MINIBOONE  BSDS300  

Onepass flows  FFJORD [8]  
Quadraticspline  
Cubicspline  
Auto regressive flows  MAF [23]  
NAF [13]  
BlockNAF [3]  
TAN various [22] 
Test log likelihood (in nats) for UCI datasets and BSDS300; higher is better. Error bars correspond to two standard deviations (FFJORD do not report error bars). Apart from quadratic and cubic splines, all results are taken from existing literature. NAF
report error bars across five repeated runs rather than across the test set.3 Experiments
For our experiments, NN in eq. 8
is a fullyconnected or convolutional neural network with two preactivation residual blocks
[9, 10]. We use the Adam optimizer [15], and a cosine schedule for annealing the learning rate [19]. All experiments use a standard normal for the noise distribution, except for the twodimensional experiments, which use a uniform distribution.
3.1 Synthetic datasets
We first demonstrate the flexibility of cubicspline flows on synthetic twodimensional datasets (fig. 2). The mixture of Gaussians and Einstein are taken from [21], and we generate another dataset using an image of Gauss. For each task, we train on onemillion data points, use a cubicspline flow with two coupling layers, no linear layers, and bins. As we can see, the model is flexible enough to fit complex, multimodal densities that other flows would struggle with. In each case, the model consists of fewer than of the parameters than the histograms which are used to display the data.
3.2 Density estimation on tabular data
We next consider the UCI and BSDS300 datasets, a standard suite of benchmarks for density estimation of tabular data. Our spline flows alternate coupling layers with LUdecomposed linear layers, and we use
bins in all experiments. For all datasets, we stack ten of these composite linear and coupling layers, apart from MINIBOONE, where we use five. Hyperparameters such as the number of training steps and the hidden dimension used by NN in coupling layers are tuned separately for each cubic model. As an ablation study, we compare with a flow using quadratic instead of cubic splines, which extends
[20] with LU layers and elementwise transformations as in section 2.shows our results. The cubic and quadratic models are close in performance, with the error bars on test log likelihood overlapping. However, a paired comparison computing the mean and standard error of the difference between models on individual examples shows that the cubicspline flow is statistically significantly better on three of the five datasets, while being indistinguishable on POWER and slightly worse than the quadraticspline flow on BSDS300. On the other hand, the quadraticspline flow is slightly faster and less prone to numerical issues.
Section A.4 further compares the cubic and quadratic models.Spline flows achieve stateoftheart performance on all tasks against FFJORD [8]
, the previously strongest flowbased model with a onepass inverse. Moreover, the spline flows are competitive against the bestperforming autoregressive flows, achieving state of the art for any flow model on POWER and GAS. These results demonstrate that it is possible for neural density estimators based on coupling layers to challenge autoregressive models, and that it may not be necessary to sacrifice onepass sampling for densityestimation performance.
3.3 Image generation
Finally, we demonstrate that cubicspline flows scale to highdimensional data using the FashionMNIST dataset
[30]. We use a Glowlike model [16] with cubicspline coupling transforms, which includes a multiscale architecture, actnorm layers with datadependent initialization, and invertible convolutions. In line with Kingma & Dhariwal [16], we illustrate the effect of scaling the standard deviation of the noise distribution by a temperature . Qualitative results are shown in fig. 3. We leave the comparison with stateoftheart flowbased image models on higherdimensional image datasets for future work.4 Conclusion
The cubicspline flow is the first model that combines the densityestimation performance of stateoftheart autoregressive models like NAF [13] and TAN [22], while still being able to generate realistic samples in a single pass like Real NVP [5] and Glow [16]. More generally, we have shown that flexible monotonic splines are a useful differentiable module, which could be included in many models that can be trained endtoend. We intend to implement autoregressive flows with these transformations, which may further improve the state of the art in density estimation.
Acknowledgements
This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh. George Papamakarios was also supported by Microsoft Research through its PhD Scholarship Programme.
Appendix A More details on monotonic cubic splines
a.1 Parameterization of the spline
We here include an outline of the method of Steffen [28] which closely matches the original paper. Let be a given set of knot points in the plane which satisfy
(10)  
(11) 
If additionally the derivatives at the knot points are known, a unique cubic polynomial
(12) 
where , is determined on each bin. The coefficients are given by
(13)  
(14)  
(15)  
(16) 
where is the width of each bin, and is the slope of the line joining consecutive knots. The resulting cubic spline is continuously differentiable on the interval , and it remains only to determine the derivative values so that the overall spline is monotonic on this interval.
To this end, Steffen [28] uses the unique quadratic function passing through the points , and to determine the derivative at location . Monotonicity of this quadratic on the interval defined by these points is sufficient to specify an appropriate derivative, which is given by
(17) 
However, it is possible that the quadratic in question is not monotonic on the given interval, and in this case the derivative value needs to be modified. Steffen [28] proposes using the smallest value for the derivative at such that the quadratic is monotonic on the given interval, showing that
(18) 
is sufficient.
a.2 Computing the derivative
a.3 Computing the inverse
Computing the inverse of a cubic polynomial requires solving for the roots of a cubic polynomial whose constant term depends on the value to invert. That is, for a given , we wish to find which satisfies
(20) 
where is defined as before. Monotonicity means there is at most one root in each bin, and it is possible to determine this value analytically, using either Cardano’s formula, or trigonometric or hyperbolic methods. In practice, careful numerical consideration is necessary to ensure stable rootfinding. We found the comprehensive treatment of Blinn [1] particularly useful, ultimately settling on a slightly modified version of a procedure outlined in this latter work [25]. We refer readers to Blinn [1] for full details.
a.4 Comparison with monotonic quadratic splines
Following Müller et al. [20], we construct a quadratic spline by integrating an unconstrained linear spline. The result is a monotonic piecewisequadratic function, where the values and derivatives of the function are continuous at the knots. The derivatives of the quadratic spline at the knots, corresponding to the density, are set directly by the linear spline. However, some monotonicallyincreasing knot locations cannot be expressed through this construction.
In contrast, Steffen’s cubicspline construction [28]
can interpolate arbitrary monotonic settings of the knot locations
, so unlike the quadratic spline can set arbitrary quantiles of the implied distribution at the knot locations. However, the method sets the derivatives of the function based on prior smoothness assumptions, rather than allowing the user to set them directly as in the quadratic spline. As a result, the number of parameters for the two spline constructions are similar. The cubic spline could be made strictly more flexible by adding parameters for the derivatives at the knots, constrained to maintain monotonicity.
As they are, the cubic and quadratic splines give different functions for knots specifying the same quantiles, corresponding to different inductive biases. Steffen [28] discusses several nice properties of the interpolants chosen by his cubicspline method.
References
 Blinn [2007] Blinn, J. F. How to solve a cubic equation, part 5: Back to numerics. IEEE Computer Graphics and Applications, 27(3):78–89, 2007.

Chen et al. [2018]
Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K.
Neural ordinary differential equations.
Advances in Neural Information Processing Systems, 2018. 
De Cao et al. [2019]
De Cao, N., Titov, I., and Aziz, W.
Block neural autoregressive flow.
Conference on Uncertainty in Artificial Intelligence
, 2019.  Dinh et al. [2015] Dinh, L., Krueger, D., and Bengio, Y. NICE: Nonlinear independent components estimation. International Conference on Learning Representations, Workshop track, 2015.
 Dinh et al. [2017] Dinh, L., SohlDickstein, J., and Bengio, S. Density estimation using Real NVP. International Conference on Learning Representations, 2017.
 Durrleman & Simon [1989] Durrleman, S. and Simon, R. Flexible regression models with cubic splines. Statistics in medicine, 8(5):551–561, 1989.
 Fritsch & Carlson [1980] Fritsch, F. N. and Carlson, R. E. Monotone piecewise cubic interpolation. SIAM Journal on Numerical Analysis, 17(2):238–246, 1980.
 Grathwohl et al. [2018] Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. K. FFJORD: Freeform continuous dynamics for scalable reversible generative models. International Conference on Learning Representations, 2018.

He et al. [2016a]
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
IEEE Conference on Computer Vision and Pattern Recognition
, 2016a.  He et al. [2016b] He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. European Conference on Computer Vision, 2016b.
 Hoogeboom et al. [2019] Hoogeboom, E., van den Berg, R., and Welling, M. Emerging convolutions for generative normalizing flows. arXiv:1901.11137, 2019.
 Hou & Andrews [1978] Hou, H. and Andrews, H. Cubic splines for image interpolation and digital filtering. IEEE Transactions on acoustics, speech, and signal processing, 26(6):508–517, 1978.

Huang et al. [2018]
Huang, C.W., Krueger, D., Lacoste, A., and Courville, A.
Neural autoregressive flows.
International Conference on Machine Learning
, 2018.  Kim et al. [2018] Kim, S., Lee, S.g., Song, J., and Yoon, S. FloWaveNet: A generative flow for raw audio. arXiv:1811.02155, 2018.
 Kingma & Ba [2014] Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2014.
 Kingma & Dhariwal [2018] Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible convolutions. Advances in Neural Information Processing Systems, 2018.
 Kingma et al. [2016] Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. Advances in Neural Information Processing Systems, 2016.
 Kumar et al. [2019] Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine, S., Dinh, L., and Kingma, D. VideoFlow: A flowbased generative model for video. arXiv:1903.01434, 2019.
 Loshchilov & Hutter [2016] Loshchilov, I. and Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv:1608.03983, 2016.
 Müller et al. [2018] Müller, T., McWilliams, B., Rousselle, F., Gross, M., and Novák, J. Neural importance sampling. arXiv:1808.03856, 2018.
 Nash & Durkan [2019] Nash, C. and Durkan, C. Autoregressive energy machines. International Conference on Machine Learning, 2019.
 Oliva et al. [2018] Oliva, J., Dubey, A., Zaheer, M., Poczos, B., Salakhutdinov, R., Xing, E., and Schneider, J. Transformation autoregressive networks. International Conference on Machine Learning, 2018.
 Papamakarios et al. [2017] Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. Advances in Neural Information Processing Systems, 2017.
 Papamakarios et al. [2019] Papamakarios, G., Sterratt, D. C., and Murray, I. Sequential neural likelihood: Fast likelihoodfree inference with autoregressive flows. International Conference on Artificial Intelligence and Statistics, 2019.
 Peters [2016] Peters, C. How to solve a cubic equation, revisited. http://momentsingraphics.de/?p=105, 2016. Accessed: 20190417.
 Prenger et al. [2018] Prenger, R., Valle, R., and Catanzaro, B. WaveGlow: A flowbased generative network for speech synthesis. arXiv:1811.00002, 2018.
 Rezende & Mohamed [2015] Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. International Conference on Machine Learning, 2015.
 Steffen [1990] Steffen, M. A simple method for monotonic interpolation in one dimension. Astronomy and Astrophysics, 239:443, 1990.
 van den Berg et al. [2018] van den Berg, R., Hasenclever, L., Tomczak, J. M., and Welling, M. Sylvester normalizing flows for variational inference. Conference on Uncertainty in Artificial Intelligence, 2018.
 Xiao et al. [2017] Xiao, H., Rasul, K., and Vollgraf, R. FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747, 2017.